Hostname: page-component-745bb68f8f-grxwn Total loading time: 0 Render date: 2025-01-07T19:03:39.010Z Has data issue: false hasContentIssue false

An Examination of Procedures for Determining the Number of Clusters in a Data Set

Published online by Cambridge University Press:  01 January 2025

Glenn W. Milligan*
Affiliation:
The Ohio State University
Martha C. Cooper
Affiliation:
The Ohio State University
*
Requests for reprints should be sent to Glenn W. Milligan, Faculty of Management Sciences, 301 Hagerty Hall, The Ohio State University, Columbus, OH 43210.

Abstract

A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters. To provide a variety of clustering solutions, the data sets were analyzed by four hierarchical clustering methods. External criterion measures indicated excellent recovery of the true cluster structure by the methods at the correct hierarchy level. Thus, the clustering present in the data was quite strong. The simulation results for the stopping rules revealed a wide range in their ability to determine the correct number of clusters in the data. Several procedures worked fairly well, whereas others performed rather poorly. Thus, the latter group of rules would appear to have little validity, particularly for data sets containing distinct clusters. Applied researchers are urged to select one or more of the better criteria. However, users are cautioned that the performance of some of the criteria may be data dependent.

Type
Original Paper
Copyright
Copyright © 1985 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

The authors would like to express their appreciation to a number of individuals who provided assistance during the conduct of this research. Those who deserve recognition include Roger Blashfield, John Crawford, John Gower, James Lingoes, Wansoo Rhee, F. James Rohlf, Warren Sarle, and Tom Soon.

References

Andrews, D. F. (1972). Plots of high-dimensional data. Biometrics, 28, 125136.CrossRefGoogle Scholar
Arnold, S. J. (1979). A test for clusters. Journal of Marketing Research, 19, 545551.CrossRefGoogle Scholar
Baker, F. B., Hubert, L. J. (1975). Measuring the power of hierarchical cluster analysis. Journal of the American Statistical Association, 70, 3138.CrossRefGoogle Scholar
Ball, G. H., Hall, D. J. (1965). ISODATA, A novel method of data analysis and pattern classification, Menlo Park: Stanford Research Institute.Google Scholar
Beale, E. M. L. (1969). Cluster analysis, London: Scientific Control Systems.Google Scholar
Binder, D. A. (1978). Bayesian cluster analysis. Biometrika, 65, 3138.CrossRefGoogle Scholar
Blashfield, R. K., Morey, L. C. (1980). A comparison of four clustering methods using MMPI Monte Carlo data. Applied Psychological Measurement, 4, 5764.CrossRefGoogle Scholar
Bock, H. H. (1977). On tests concerning the existence of a classification. First international symposium on data analysis and informatics (pp. 449464). Rocquencourt, France: IRIA.Google Scholar
Calinski, R. B., Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3, 127.Google Scholar
Cohen, A. C. (1967). Estimation in mixtures of two normal distributions. Technometrics, 9, 1528.CrossRefGoogle Scholar
Davies, D. L., Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 224227.CrossRefGoogle ScholarPubMed
Day, N. E. (1969). Estimating the components of a mixture of normal distributions. Biometrika, 56, 463474.CrossRefGoogle Scholar
Dubes, R., Jain, A. K. (1979). Validity studies in clustering methodologies. Pattern Recognition, 11, 235254.CrossRefGoogle Scholar
Duda, R. O., Hart, P. E. (1973). Pattern classification and scene analysis, New York: Wiley.Google Scholar
Edwards, A. W. F., Cavalli-Sforza, L. (1965). A method for cluster analysis. Biometrics, 21, 362375.CrossRefGoogle ScholarPubMed
Englemann, L., Hartigan, J. A. (1969). Percentage points of a test for clusters. Journal of the American Statistical Association, 64, 16471648.CrossRefGoogle Scholar
Everitt, B. S. (1979). Unresolved problems in cluster analysis. Biometrics, 35, 169181.CrossRefGoogle Scholar
Everitt, B. S. (1981). A Monte Carlo investigation in the likelihood ratio test for the number of components in a mixture of normal distributions. Multivariate Behavioral Research, 16, 171180.CrossRefGoogle Scholar
Fleiss, J. L., Lawlor, W., Platman, S. R., Fieve, R. R. (1971). On the use of inverted factor analysis for generating typologies. Journal of Abnormal Psychology, 77, 127132.CrossRefGoogle Scholar
Fleiss, J. L., Zubin, J. (1969). On the methods and theory of clustering. Multivariate Behavioral Research, 4, 235250.CrossRefGoogle ScholarPubMed
Friedman, H. P., Rubin, J. (1967). On some invariant criteria for grouping data. Journal of the American Statistical Association, 62, 11591178.CrossRefGoogle Scholar
Frey, T., Van Groenewoud, H. (1972). A cluster analysis of the D-squared matrix of white spruce stands in Saskatchewan based on the maximum-minimum principle. Journal of Ecology, 60, 873886.CrossRefGoogle Scholar
Fukunaga, K., Koontz, W. L. G. (1970). A criterion and an algorithm for grouping data. IEEE Transactions on Computers, C-19, 917923.CrossRefGoogle Scholar
Gengerelli, J. A. (1963). A method for detecting subgroups in a population and specifying their membership list. Journal of Psychology, 5, 457468.CrossRefGoogle Scholar
Gnanadesikan, R., Kettenring, J. R., Landwehr, J. M. (1977). Interpreting and assessing the results of cluster analyses. Bulletin of the International Statistical Institute, 47, 451463.Google Scholar
Good, I. J. (1982). An index of separateness of clusters and a permutation test for its statistical significance. Journal of Statistical Computing and Simulation, 15, 8184.CrossRefGoogle Scholar
Goodall, D. W. (1966). Hypothesis testing in classification. Nature, 221, 329330.CrossRefGoogle Scholar
Gower, J. C. (1975). Goodness-of-fit criteria for classification and other patterned structures. In Estabrook, G. (Eds.), Proceedings of the 8th international conference on numerical taxonomy, San Francisco: Freeman.Google Scholar
Gower, J. C. (1981, June). Is classification statistical? Paper presented at the meeting of the Classification Society, Toronto.Google Scholar
Hall, D. J., Duda, R. O., Huffman, D. A., Wolf, E. E. (1973). Development of new pattern recognition methods, Los Angeles: Aerospace Research Laboratories.Google Scholar
Hansen, R. A., & Milligan, G. W. (1981). Objective assessment of cluster analysis output: Theoretical considerations and empirical findings. Proceedings of the American Institute for Decision Sciences, 314316.Google Scholar
Hartigan, J. A. (1975). Clustering algorithms, New York: Wiley.Google Scholar
Hartigan, J. A. (1977). Distribution problems in clustering. In Van Ryzin, J. (Eds.), Classification and clustering, New York: Academic Press.Google Scholar
Hartigan, J. A. (1978). Asymptotic distributions for clustering criteria. Annals of Statistics, 6, 117131.CrossRefGoogle Scholar
Hill, R. S. (1980). A stopping rule for partitioning dendrograms. Botanical Gazette, 141, 321324.CrossRefGoogle Scholar
Hubert, L. J., Baker, F. B. (1977). The comparison and fitting of given classification schemes. Journal of Mathematical Psychology, 16, 233253.CrossRefGoogle Scholar
Hubert, L. J., Levin, J. R. (1976). A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 83, 10721080.CrossRefGoogle Scholar
Jain, A. K., Waller, W. G. (1978). On the number of features in the classification of multivariate gaussian data. Pattern Recognition, 10, 365374.CrossRefGoogle Scholar
Jancey, R. C. (1966). Multidimensional group analysis. Australian Journal of Botany, 14, 127130.CrossRefGoogle Scholar
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32, 241254.CrossRefGoogle ScholarPubMed
Lee, K. L. (1979). Multivariate tests for clusters. Journal of the American Statistical Association, 74, 708714.CrossRefGoogle Scholar
Lingoes, J. C., Cooper, T. (1971). PEP-I: A FORTRAN IV (G) program for Guttman-Lingoes nonmetric probability clustering. Behaviorial Science, 16, 259261.Google Scholar
Marriot, F. H. C. (1971). Practical problems in a method of cluster analysis. Biometrics, 27, 501514.CrossRefGoogle Scholar
McClain, J. O., Rao, V. R. (1975). CLUSTISZ: A program to test for the quality of clustering of a set of objects. Journal of Marketing Research, 12, 456460.Google Scholar
Milligan, G. W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325342.CrossRefGoogle Scholar
Milligan, G. W. (1981). A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46, 187199.CrossRefGoogle Scholar
Milligan, G. W. (1981). A review of Monte Carlo tests of cluster analysis. Multivariate Behavioral Research, 16, 379407.CrossRefGoogle ScholarPubMed
Milligan, G. W. (1981c, June). A discussion of procedures for determining the number of clusters in a data set. Paper presented at the meeting of the Classification Society, Toronto.Google Scholar
Milligan, G. W. (1983). Characteristics of four external criterion measures. In Felsenstein, J. (Eds.), Proceedings of the 1982 NATO Advanced Studies Institute on Numerical Taxonomy (pp. 167173). New York: Springer-Verlag.Google Scholar
Milligan, G. W., Sokol, L. M. (1980). A two-stage clustering algorithm with robust recovery characteristics. Educational and Psychological Measurement, 40, 755759.CrossRefGoogle Scholar
Milligan, G. W., Soon, S. C., Sokol, L. M. (1983). The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 4047.CrossRefGoogle ScholarPubMed
Mojena, R. (1977). Hierarchical grouping methods and stopping rules: An evaluation. The Computer Journal, 20, 359363.CrossRefGoogle Scholar
Morey, L., Agresti, A. (1984). The measurement of classification agreement: An adjustment to the Rand statistic for chance agreement. Educational and Psychological Measurement, 44, 3337.CrossRefGoogle Scholar
Mountford, M. D. (1970). A test for the difference between clusters. In Patil, G. P., Pielou, E. C., Waters, W. E. (Eds.), Statistical Ecology (pp. 237257). University Park, Pa.: Pennsylvania State University Press.Google Scholar
Naus, J. I. (1966). A power comparison of two tests of non-random clustering. Technometrics, 8, 493517.Google Scholar
Orloci, L. (1967). An agglomerative method for classification of plant communities. Journal of Ecology, 55, 193206.CrossRefGoogle Scholar
Perruchet, C. (1983). Les épreuves de classifiabilité en analyses des données, Issy-Les-Moulineaux, France: C.N.E.T..Google Scholar
Ray, A. A. (1982). SAS user's guide: Statistics, Cary, North Carolina: SAS Institute.Google Scholar
Ratkowsky, D. A., Lance, G. N. (1978). A criterion for determining the number of groups in a classification. Australian Computer Journal, 10, 115117.Google Scholar
Rohlf, F. J. (1974). Methods of comparing classifications. Annual Review of Ecology and Systematics, 5, 101113.CrossRefGoogle Scholar
Rubin, J. (1967). Optimal classification into groups: An approach for solving the taxonomy problem. Journal of Theoretical Biology, 15, 103144.CrossRefGoogle ScholarPubMed
Sarle, W. S. (1983). Cubic clustering criterion, Cary, N.C.: SAS Institute.Google Scholar
Scott, A. J., Symons, M. J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics, 27, 387397.CrossRefGoogle Scholar
Sneath, P. H. A. (1977). A method for testing the distinctness of clusters: A test of the disjunction of two clusters in Euclidean space as measured by their overlap. Mathematical Geology, 9, 123143.CrossRefGoogle Scholar
Sneath, P. H. A., Sokal, R. R. (1973). Numerical taxonomy, San Francisco: Freeman.Google Scholar
Sokal, R. R., Sneath, P. H. A. (1963). Principles of numerical taxonomy, San Francisco: Freeman.Google Scholar
Thorndike, R. L. (1953). Who belongs in a family?. Psychometrika, 18, 267276.CrossRefGoogle Scholar
Wolfe, J. H. (1970). Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research, 5, 329350.CrossRefGoogle ScholarPubMed
Wong, M. A. (1982). A hybrid clustering method for identifying high-density clusters. Journal of the American Statistical Association, 77, 841847.CrossRefGoogle Scholar
Wong, M. A., & Schaak, C. (1982). Using the Kth nearest neighbor clustering procedure to determine the number of subpopulations. Proceedings of the Statistical Computing Section, American Statistical Association, 4048.Google Scholar