Hostname: page-component-78c5997874-g7gxr Total loading time: 0 Render date: 2024-11-10T20:29:27.628Z Has data issue: false hasContentIssue false

Exponential inequalities for VLMC empirical trees

Published online by Cambridge University Press:  23 January 2008

Antonio Galves
Affiliation:
Instituto de Matemática e Estatística, Universidade de São Paulo, BP 66281, 05315-970 São Paulo, Brasil; galves@ime.usp.br
Véronique Maume-Deschamps
Affiliation:
Institut de Mathématiques de Bourgogne, BP 47870, 21078 Dijon cedex France; vmaume@u-bourgogne.fr; schmittb@u-bourgogne.fr
Bernard Schmitt
Affiliation:
Institut de Mathématiques de Bourgogne, BP 47870, 21078 Dijon cedex France; vmaume@u-bourgogne.fr; schmittb@u-bourgogne.fr
Get access

Abstract

A seminal paper by Rissanen, published in 1983, introduced the class of Variable Length Markov Chains and the algorithm Context which estimates the probabilistic tree generating the chain. Even if the subject was recently considered in several papers, the central question of the rate of convergence of the algorithm remained open. This is the question we address here. We provide an exponential upper bound for the probability of incorrect estimation of the probabilistic tree, as a function of the size of the sample. As a consequence we prove the almost sure consistency of the algorithm Context. We also derive exponential upper bounds for type I errors and for the probability of underestimation of the context tree. The constants appearing in the bounds are all explicit and obtained in a constructive way.

Type
Research Article
Copyright
© EDP Sciences, SMAI, 2008

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bejerano, G. and Yona, G., Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17 (2001) 2343. CrossRef
Bühlmann, P. and Wyner, A., Variable length Markov chains. Ann. Statist. 27 (1999) 480513.
Csiszár, I., Large-scale typicality of Markov sample paths and consistency of MDL order estimators. Special issue on Shannon theory: perspective, trends, and applications. IEEE Trans. Inform. Theory 48 (2002) 16161628. CrossRef
I. Csiszár and Z. Talata, Context tree estimation for not necessarily finite memory processes via BIC and MDL, manuscript (2005).
Dedecker, J. and Doukhan, P., A new covariance inequality and applications. Stochastic Process. Appl. 106 (2003) 6380. CrossRef
Dedecker, J. and Prieur, C., New dependence coefficients. Examples and applications to statistics. Prob. Theory Relat. Fields 132 (2005) 203236. CrossRef
P. Ferrari and A. Galves, Coupling and regeneration for stochastic processes. Notes for a minicourse presented in XIII Escuela Venezolana de Matematicas. Can be downloaded from www.ime.usp.br/~pablo/book/abstract.html (2000).
Ferrari, F. and Wyner, A., Estimation of general stationary processes by variable length Markov chains. Scand. J. Statist. 30 (2003) 459480. CrossRef
Leonardi, F. and Galves, A., Sequence Motif identification and protein classification using probabilistic trees. Lect. Notes Comput. Sci. 3594 (2005) 190193. CrossRef
V. Maume-Deschamps, Exponential inequalities and estimation of conditional probabilities in Dependence in probability and statistics, Lect. Notes in Stat., Vol. 187, P. Bertail, P. Doukhan and P. Soulier Eds. Springer (2006).
Rissanen, J., A universal data compression system. IEEE Trans. Inform. Theory 29 (1983) 656664. CrossRef
Tjalkens, T.J. and Willems, F.M.J.F., Implementing the context-tree weighting method: arithmetic coding. Recent advances in interdisciplinary mathematics (Portland, ME, 1997). J. Combin. Inform. System Sci. 25 (2000) 49-58.
Willems, F.M., Shtarkov, Y.M. and Tjalkens, T.J, The context-tree weighting method: basic properties. IEEE Trans. Inform. Theory 41 (1995) 653664. CrossRef