Hostname: page-component-cd9895bd7-jn8rn Total loading time: 0 Render date: 2024-12-26T22:35:29.192Z Has data issue: false hasContentIssue false

A non asymptotic penalized criterion for Gaussian mixture model selection

Published online by Cambridge University Press:  05 January 2012

Cathy Maugis
Affiliation:
Institut de Mathématiques de Toulouse, INSA de Toulouse, Université de Toulouse, 135 avenue de Rangueil, 31077 Toulouse Cedex 4, France; cathy.maugis@insa-toulouse.fr
Bertrand Michel
Affiliation:
Laboratoire de Statistique Théorique et Appliquée, Université Paris 6, 175 rue du Chevaleret, 75013 Paris, France; bertrand.michel@upmc.fr
Get access

Abstract

Specific Gaussian mixtures are considered to solve simultaneouslyvariable selection and clustering problems. A non asymptoticpenalized criterion is proposed to choose the number of mixturecomponents and the relevant variable subset. Because of the nonlinearity of the associated Kullback-Leibler contrast on Gaussianmixtures, a general model selection theorem for maximum likelihoodestimation proposed by [Massart Concentration inequalities and model selection Springer, Berlin (2007). Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23 (2003)] is used to obtainthe penalty function form. This theorem requires to control thebracketing entropy of Gaussian mixture families. The ordered andnon-ordered variable selection cases are both addressed in thispaper.

Type
Research Article
Copyright
© EDP Sciences, SMAI, 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

H. Akaike, Information theory and an extension of the maximum likelihood principle, in Second International Symposium on Information Theory (Tsahkadsor, 1971), Akadémiai Kiadó, Budapest (1973) 267–281.
S. Arlot and P. Massart, Data-driven calibration of penalties for least-squares regression. J. Mach. Learn. Res. (2008) (to appear).
Banfield, J.D. and Raftery, A.E., Model-based Gaussian and non-Gaussian clustering. Biometrics 49 (1993) 803821. CrossRef
Barron, A., Birgé, L. and Massart, P., Risk bounds for model selection via penalization. Prob. Th. Re. Fields 113 (1999) 301413. CrossRef
J.-P. Baudry, Clustering through model selection criteria. Poster session at One Day Statistical Workshop in Lisieux. http://www.math.u-psud.fr/ baudry, June (2007).
Biernacki, C., Celeux, G. and Govaert, G., Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Analy. Mach. Intell. 22 (2000) 719725. CrossRef
Biernacki, C., Celeux, G., Govaert, G. and Langrognet, F., Model-based cluster and discriminant analysis with the mixmod software. Comput. Stat. Data Anal. 51 (2006) 587600. CrossRef
Birgé, L. and Massart, P., Gaussian model selection. J. Eur. Math. Soc. 3 (2001) 203268.
L. Birgé and P. Massart, A generalized Cp criterion for Gaussian model selection. Prépublication n° 647, Universités de Paris 6 et Paris 7 (2001).
L. Birgé and P. Massart. Minimal penalties for Gaussian model selection. Prob. Th. Rel. Fields 138 (2007) 33–73.
L. Birgé and P. Massart, From model selection to adaptive estimation, in Festschrift for Lucien Le Cam. Springer, New York (1997) 55–87.
Bouveyron, C., Girard, S. and Schmid, C., High-Dimensional Data Clustering. Comput. Stat. Data Anal. 52 (2007) 502519. CrossRef
K.P. Burnham and D.R. Anderson, Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer-Verlag, New York, 2nd edition (2002).
G. Castellan, Modified Akaike's criterion for histogram density estimation. Technical report, Université Paris-Sud 11 (1999).
Castellan, G., Density estimation via exponential model selection. IEEE Trans. Inf. Theory 49 (2003) 20522060. CrossRef
Celeux, G. and Govaert, G., Gaussian parsimonious clustering models. Pattern Recogn. 28 (1995) 781793. CrossRef
A.P. Dempster, N.M. Laird and D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc, Ser. B. 39 (1977) 1–38.
Genovese, C.R. and Wasserman, L., Rates of convergence for the Gaussian mixture sieve. Ann. Stat. 28 (2000) 11051127.
Ghosal, S. and van der Vaart, A.W., Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. Ann. Stat. 29 (2001) 12331263. CrossRef
Keribin, C., Consistent estimation of the order of mixture models. Sankhyā. The Indian Journal of Statistics. Series A 62 (2000) 4966.
Law, M.H., Figueiredo, M.A.T. and Jain, A.K., Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 11541166. CrossRef
Lebarbier, E., Detecting multiple change-points in the mean of Gaussian process by model selection. Signal Proc. 85 (2005) 717736. CrossRef
V. Lepez, Potentiel de réserves d'un bassin pétrolier: modélisation et estimation. Ph.D. thesis, Université Paris-Sud 11 (2002).
P. Massart, Concentration inequalities and model selection. Springer, Berlin (2007). Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23 (2003).
C. Maugis, Sélection de variables pour la classification non supervisée par mélanges gaussiens. Applications à l'étude de données transcriptomes. Ph.D. thesis, University Paris-Sud 11 (2008).
C. Maugis, G. Celeux and M.-L. Martin-Magniette, Variable Selection for Clustering with Gaussian Mixture Models. Biometrics (2008) (to appear).
C. Maugis and B. Michel, Slope heuristics for variable selection and clustering via Gaussian mixtures. Technical Report 6550, INRIA (2008).
Raftery, A.E. and Dean, N., Variable Selection for Model-Based Clustering. J. Am. Stat. Assoc. 101 (2006) 168178. CrossRef
Schwarz, G., Estimating the dimension of a model. Ann. Stat. 6 (1978) 461464. CrossRef
D. Serre, Matrices. Springer-Verlag, New York (2002).
M. Talagrand, Concentration of measure and isoperimetric inequalities in product spaces. Publ. Math., Inst. Hautes Étud. Sci. 81 (1995) 73–205. CrossRef
Talagrand, M., New concentration inequalities in product spaces. Invent. Math. 126 (1996) 505563. CrossRef
F. Villers, Tests et sélection de modèles pour l'analyse de données protéomiques et transcriptomiques. Ph.D. thesis, University Paris-Sud 11 (2007).