Estimating a discrete distribution via histogram selection

Nathalie Akakpo

doi:10.1051/ps/2009007

Estimating a discrete distribution via histogram selection

Published online by Cambridge University Press: 22 February 2011

Nathalie Akakpo

Show author details

Nathalie Akakpo*: Affiliation:
Laboratoire de Probabilités et Statistiques, Université Paris Sud XI, Bâtiment 425, 91405 Orsay Cedex, France; nathalie.akakpo@math.u-psud.fr

Article contents

Abstract
References

Get access

Abstract

Our aim is to estimate the joint distribution of a finite sequence of independent categorical variables. We consider the collection of partitions into dyadic intervals and the associated histograms, and we select from the data the best histogram by minimizing a penalized least-squares criterion. The choice of the collection of partitions is inspired from approximation results due to DeVore and Yu. Our estimator satisfies a nonasymptotic oracle-type inequality and adaptivity properties in the minimax sense. Moreover, its computational complexity is only linear in the length of the sequence. We also use that estimator during the preliminary stage of a hybrid procedure for detecting multiple change-points in the joint distribution of the sequence. That second procedure still satisfies adaptivity properties and can be implemented efficiently. We provide a simulation study and apply the hybrid procedure to the segmentation of a DNA sequence.

Keywords

Adaptive estimator approximation result categorical variable change-point detection minimax estimation model selection nonparametric estimation penalized least-squares estimation

Information

Type: Research Article
Information: ESAIM: Probability and Statistics , Volume 15: Supplement: In honor of Marc Yor , 2011 , pp. 1 - 29

DOI: https://doi.org/10.1051/ps/2009007 [Opens in a new window]
Copyright: © EDP Sciences, SMAI, 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Aerts, M. and Veraverbeke, N., Bootstrapping a nonparametric polytomous regression model. Math. Meth. Statist. 4 (1995) 189–200.

Baraud, Y. and Birgé, L., Estimating the intensity of a random measure by histogram type estimators. Prob. Theory Relat. Fields 143 (2009) 239–284. CrossRef

Barron, A., Birgé, L. and Massart, P., Risk bounds for model selection via penalization. Prob. Theory Relat. Fields 113 (1999) 301–413. CrossRef

C. Bennett and R. Sharpley, Interpolation of operators, volume 129 of Pure and Applied Mathematics. Academic Press Inc., Boston, M.A. (1988).

Birgé, L., Model selection via testing: an alternative to (penalized) maximum likelihood estimators. Ann. Inst. H. Poincaré Probab. Statist. 42 (2006) 273–325. CrossRef

L. Birgé, Model selection for Poisson processes, in Asymptotics: Particles, Processes and Inverse Problems, Festschrift for Piet Groeneboom. IMS Lect. Notes Monograph Ser. 55. IMS, Beachwood, USA (2007) 32–64.

Birgé, L. and Massart, P., Minimal penalties for Gaussian model selection. Prob. Theory Relat. Fields 138 (2007) 33–73. CrossRef

Braun, J.V. and Müller, H.-G., Statistical methods for DNA sequence segmentation. Stat. Sci. 13 (1998) 142–162.

Braun, J.V., Braun, R.K. and Müller, H.-G., Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika 87 (2000) 301–314. CrossRef

T.H. Cormen, C.E. Leiserson, R.L. Rivest and C. Stein, Introduction to algorithms. Second edition. MIT Press, Cambridge, MA (2001).

M. Csűrös, Algorithms for finding maximum-scoring segment sets, in Proc. of the 4th international workshop on algorithms in bioinformatics 2004. Lect. Notes Comput. Sci. 3240. Springer, Berlin, Heidelberg (2004) 62–73.

R.A. DeVore and G.G. Lorentz, Constructive approximation. Springer-Verlag, Berlin, Heidelberg (1993).

DeVore, R.A. and Sharpley, R.C., Maximal functions measuring smoothness. Mem. Amer. Math. Soc. 47 (1984) 293.

DeVore, R.A. and Degree, X.M. Yu of adaptive approximation. Math. Comp. 55 (1990) 625–635. CrossRef

Durot, C., Lebarbier, E. and Tocquet, A.-S., Estimating the joint distribution of independent categorical variables via model selection. Bernoulli 15 (2009) 475–507. CrossRef

Fu, Y.-X. and Curnow, R.N., Maximum likelihood estimation of multiple change points. Biometrika 77 (1990) 562–565.

S. Gey S. and E. Lebarbier, Using CART to detect multiple change-points in the mean for large samples. SSB preprint, Research report No. 12 (2008).

Hoebeke, M., Nicolas, P. and Bessières, P., MuGeN: simultaneous exploration of multiple genomes and computer analysis results. Bioinformatics 19 (2003) 859–864. CrossRef

E. Lebarbier, Quelques approches pour la détection de ruptures à horizon fini. Ph.D. thesis, Université Paris Sud, Orsay, 2002.

E. Lebarbier and E. Nédélec, Change-points detection for discrete sequences via model selection. SSB preprint, Research Report No. 9 (2007).

P. Massart, Concentration inequalities and model selection. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23, 2003. Lect. Notes Math. 1896. Springer, Berlin, Heidelberg (2007).

Nicolas, P. et al., Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models. Nucleic Acids Res. 30 (2002) 1418–1426. CrossRef

Szpankowski, W., Szpankowski, L. and Ren, W., An optimal DNA segmentation based on the MDL principle. Int. J. Bioinformatics Res. Appl. 1 (2005) 3–17. CrossRef

Article contents

Estimating a discrete distribution via histogram selection

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests