Hostname: page-component-cd9895bd7-fscjk Total loading time: 0 Render date: 2024-12-28T15:36:44.806Z Has data issue: false hasContentIssue false

A graph-based estimator of the number of clusters

Published online by Cambridge University Press:  19 June 2007

Gérard Biau
Affiliation:
Institut de Mathématiques et de Modélisation de Montpellier, UMR CNRS 5149, Équipe de Probabilités et Statistique, Université Montpellier II, CC 051, Place Eugène Bataillon, 34095 Montpellier Cedex 5, France; biau@math.univ-montp2.fr; cadre@math.univ-montp2.fr; pelletier@math.univ-montp2.fr
Benoît Cadre
Affiliation:
Institut de Mathématiques et de Modélisation de Montpellier, UMR CNRS 5149, Équipe de Probabilités et Statistique, Université Montpellier II, CC 051, Place Eugène Bataillon, 34095 Montpellier Cedex 5, France; biau@math.univ-montp2.fr; cadre@math.univ-montp2.fr; pelletier@math.univ-montp2.fr
Bruno Pelletier
Affiliation:
Institut de Mathématiques et de Modélisation de Montpellier, UMR CNRS 5149, Équipe de Probabilités et Statistique, Université Montpellier II, CC 051, Place Eugène Bataillon, 34095 Montpellier Cedex 5, France; biau@math.univ-montp2.fr; cadre@math.univ-montp2.fr; pelletier@math.univ-montp2.fr
Get access

Abstract

Assessing the number of clusters of a statistical population is one of the essential issues of unsupervised learning. Given n independent observations X1,...,Xn drawn from an unknown multivariate probability density f, we propose a new approach to estimate the number of connected components, or clusters, of the t-level set $\mathcal L(t)=\{x:f(x) \geq t\}$ . The basic idea is to form a rough skeleton of the set $\mathcal L(t)$ using any preliminary estimator of f, and to count the number of connected components of the resulting graph. Under mild analytic conditions on f, and using tools from differential geometry, we establish the consistency of our method.

Type
Research Article
Copyright
© EDP Sciences, SMAI, 2007

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

G.E. Bredon, Topology and Geometry, Springer-Verlag, New York, Graduate Texts in Mathematics 139 (1993).
Brito, M.R., Chavez, E.L., Quiroz, A.J. and Yukich, J.E., Connectivity of the mutual k-nearest neighbor graph in clustering and outlier detection. Statist. Probab. Lett. 35 (1997) 3342. CrossRef
Cadre, B., Kernel estimation of density level sets. J. Multivariate Anal. 97 (2006) 9991023. CrossRef
I. Chavel, Riemannian Geometry: A Modern Introduction. Cambridge University Press, Cambridge (1993).
T.H. Cormen, C.E. Leiserson and R.L. Rivest, Introduction to Algorithms. The MIT Press, Cambridge (1990).
Cuevas, A., Febrero, M. and Fraiman, R., Estimating the number of clusters. Canad. J. Statist. 28 (2000) 367382. CrossRef
Cuevas, A., Febrero, M. and Fraiman, R., Cluster analysis: a further approach based on density estimation. Comput. Statist. Data Anal. 36 (2001) 441459. CrossRef
Devroye, L. and Wise, G., Detection of abnormal behavior via nonparametric estimation of the support. SIAM J. Appl. Math. 38 (1980) 480488. CrossRef
R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification, 2nd edition. Wiley-Interscience, New York (2000).
L. Györfi, M. Kohler, A. Krzyżak and H. Walk, A Distribution-Free Theory of Nonparametric Regression. Springer-Verlag, New York (2002).
J.A. Hartigan, Clustering Algorithms. John Wiley, New York (1975).
T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2001).
S. Kobayashi and K. Nomizu, Foundations of Differential Geometry, Vol. I & II, 2nd edition. Wiley, New York (1996).
U. von Luxburg and S. Ben-David, Towards a statistical theory of clustering. PASCAL Workshop on Statistics and Optimization of Clustering (2005).
Penrose, M.D., A strong law for the longest edge of the minimal spanning tree. Ann. Probab. 27 (1999) 246260.
Polonik, A., Measuring mass concentrations and estimating density contour clusters–an excess mass approach. Ann. Statist. 23 (1995) 855881. CrossRef
B.L.S. Prakasa Rao, Nonparametric Functional Estimation. Academic Press, Orlando (1983).
Tsybakov, A.B., On nonparametric estimation of density level sets. Ann. Statist. 25 (1997) 948969. CrossRef