Hostname: page-component-745bb68f8f-lrblm Total loading time: 0 Render date: 2025-01-07T19:50:53.171Z Has data issue: false hasContentIssue false

An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets

Published online by Cambridge University Press:  01 January 2025

Evgenia Dimitriadou
Affiliation:
Institut für Statistik und Wahrscheinlichkeitstheorie, Technische Universität Wien
Sara Dolničar
Affiliation:
Institut für Tourismus und Freizeitwirtschaft, Wirtschaftsuniversität wien
Andreas Weingessel*
Affiliation:
Institut für Statistik und Wahrscheinlichkeitstheorie, Technische Universität Wien
*
Requests for reprints should be sent to A. Weingessel, Institut fOr Statistik, Technische Universitfit Wien, Wiedner Hauptstrage 8-10/1071, A-1040 Wien, AUSTRIA.

Abstract

The problem of choosing the correct number of clusters is as old as cluster analysis itself. A number of authors have suggested various indexes to facilitate this crucial decision. One of the most extensive comparative studies of indexes was conducted by Milligan and Cooper (1985). The present piece of work pursues the same goal under different conditions. In contrast to Milligan and Cooper's work, the emphasis here is on high-dimensional empirical binary data. Binary artificial data sets are constructed to reflect features typically encountered in real-world data situations in the field of marketing research. The simulation includes 162 binary data sets that are clustered by two different algorithms and lead to recommendations on the number of clusters for each index under consideration. Index results are evaluated and their performance is compared and analyzed.

Type
Articles
Copyright
Copyright © 2002 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Author names are listed in alphabetical order.

This piece of research was supported by the Austrian Science Foundation (FWF) under grant SFB#010 (“Adaptive Information Systems and Modeling in Economics and Management Science”).

The authors would like to thank the anonymous reviewers and especially the associate editor for their helpful comments and suggestions.

References

Aldenderfer, M.S., & Blashfield, R.K. (1996). Cluster analysis. London, U.K.: Sage Publications.Google Scholar
Andrews, D.F. (1972). Plots of high-dimensional data. Biometrics, 28, 125136.CrossRefGoogle Scholar
Arabie, P., & Hubert, L.J. (1996). Clustering and classification (pp. 563). River Edge, NJ: World Scientific.CrossRefGoogle Scholar
Arratia, R., & Lander, E.S. (1990). The distribution of clusters in random graphs. Advances in Applied Mathematics, 11, 3648.CrossRefGoogle Scholar
Baker, F.B., & Hubert, L.J. (1975). Measuring the power of hierarchical cluster analysis. Journal of the American Statistical Association, 70, 3138.CrossRefGoogle Scholar
Ball, G.H., & Hall, D.J. (1965). ISODATA, A novel method of data analysis and pattern classification. Menlo Park, CA: Stanford Research Institute.Google Scholar
Baroni-Urbani, C., & Buser, M.W. (1976). Similarity of binary data. Systematic Zoology, 25, 251259.CrossRefGoogle Scholar
Baulieu, F. (1989). A classification of presence/absence based dissimilarity coefficients. Journal of Classification, 6, 233246.CrossRefGoogle Scholar
Calinski, R.B., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3, 127.Google Scholar
Cheetham, H., & Hazel, J. (1969). Binary (presence-absence) similarity coefficients. Journal of Paleontology, 43, 11301136.Google Scholar
Cox, D. (1970). The analysis of binary data. London, U.K.: Chapman and Hall.Google Scholar
Davies, D.L., & Bouldin, D.W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 224227.CrossRefGoogle ScholarPubMed
Dolnicar, S., Grabler, K., & Mazanec, J. (2000). A tale of three cities: Perceptual charting for analysing destination images. In Woodside, A. (Eds.), Consumer psychology of tourism, hospitality and leisure (pp. 3962). London, U.K.: CAB International.Google Scholar
Dolnicar, S., Leisch, F., Weingessel, A., Buchta, C., & Dimitriadou, E. (1998). A comparison of several cluster algorithms on artificial binary data scenarios from tourism marketing. Wien, Austria: Adaptive Information Systems.Google Scholar
Edwards, A.W.F., & Cavalli-Sforza, L. (1965). A method for cluster analysis. Biometrics, 21, 362375.CrossRefGoogle ScholarPubMed
Formann, A.K. (1984). Die Latent-Class-Analyse: Einführung in die Theorie und Anwendung [Latent class analysis: Introduction into theory and application], Weinheim, Germany: Beltz.Google Scholar
Friedman, H.P., & Rubin, J. (1967). On some invariant criteria for grouping data. Journal of the American Statistical Association, 62, 11591178.CrossRefGoogle Scholar
Fritzke, B. (1997). Some competitive learning methods. Unpublished manuscript [On-line draft document available at http://www.ki.inf.tu-dresden.de/fritzke/JavaPaper/t.html or http://www.neuroinformatik.ruhr-unibochum.de/ini/VDM/research/gsn/].Google Scholar
Fukunaga, K., Koontz, W.L.G. (1970). A criterion and an algorithm for grouping data. IEEE Transactions on Computers, C-19, 917923.CrossRefGoogle Scholar
Gower, J.C. (1985). Measures of similarity, dissimilarity, and distance. In Kotz, S., & Johnson, N.L. (Eds.), Encyclopedia of Statistical Sciences, Vol. 5 (pp. 397405). New York, NY: Wiley.Google Scholar
Green, P.E., Tull, D.S., & Albaum, G. (1988). Research for Marketing Decisions 5th ed., Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
Hall, D.J., Duda, R.O., Huffman, D.A., & Wolf, E.E. (1973). Development of new pattern recognotion methods. Los Angeles, CA: Aerospace Research Laboratories.Google Scholar
Hartigan, J.A. (1975). Clustering algorithms. New York, NY: Wiley.Google Scholar
Hubalek, L. (1982). Coefficients of association and similarity, based on binary (presence-absence) data: An evaluation. Biological Review, 57, 669689.CrossRefGoogle Scholar
Hubert, L.J., & Levin, J.R. (1976). A general statistical framework for assessing categorical clustering in free recall. Phycological Bulletin, 83, 10721080.Google Scholar
Kaufmann, H., & Pape, H. (1996). Multivariate statistische Verfahren [Multivariate statistical methods] 2nd ed., Berlin: Walter de Gruyter.Google Scholar
Li, X., & Dubes, R.C. (1989). A probabilistic measure of similarity for binary data in pattern recognition. Pattern Recognition, 22(4), 397409.CrossRefGoogle Scholar
Linde, Y., Buzo, A., & Gray, R.M. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, COM-28(1), 8495.CrossRefGoogle Scholar
Marriot, F.H.C. (1971). Practical problems in a method of cluster analysis. Biometrics, 27, 501514.CrossRefGoogle Scholar
McCutcheon, A.L. (1987). Latent class analysis. Beverly Hills, CA: Sage Publications.CrossRefGoogle Scholar
Milligan, G.W. (1980). An examination of the effect of six types of error perturbation on fifteen clutering algorithms. Psychometrika, 45, 325342.CrossRefGoogle Scholar
Milligan, G.W. (1981). A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46, 187199.CrossRefGoogle Scholar
Milligan, G.W., & Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159179.CrossRefGoogle Scholar
Orloci, L. (1967). An agglomerative method of classification of plant communities. Journal of Ecology, 55, 193206.CrossRefGoogle Scholar
Ramaswamy, W., Chatterjee, R., & Cohen, S.H. (1996). Joint segmentation on distinct interdependent bases with categorical data. Journal of Marketing Research, 33, 337350.CrossRefGoogle Scholar
Ratkowsky, D.A., & Lance, G.N. (1978). A criterion for determining the number of groups in a classification. Australian Computer Journal, 10, 115117.Google Scholar
Rost, J. (1996). Testtheorie, Testkonstruktion [Theory and construction of tests], Bern: Verlag Hans Huber.Google Scholar
Sarle, W.S. (1983). Cubic clustering criterion. Research Triangle Park, NC: SAS Institute.Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. Annuals of Statistics, 6, 461464.CrossRefGoogle Scholar
Scott, A.J., & Symons, M.J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics, 27, 387397.CrossRefGoogle Scholar
Thorndike, R.L. (1953). Who belongs in the familiy?. Psychometrika, 18, 267276.CrossRefGoogle Scholar
Wedel, M., & Kamakura, W.A. (1998). Marketing segmentation. Conceptual and methodological foundations (pp. 8992). Boston/Dordrecht/London: Kluwer Academic.Google Scholar
Wolfe, J.H. (1970). Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research, 5, 329350.CrossRefGoogle ScholarPubMed
Xu, L. (1997). Bayesian Ying-Yang machine, clustering and number of clusters. Pattern Recognition Letters, 18, 11671178.CrossRefGoogle Scholar
Yang, M.-S., & Yu, K.F. (1990). On stochastic convergence theorems for the fuzzy c-means clustering procedure. International Journal of General Systems, 16, 397411.CrossRefGoogle Scholar