An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets

Evgenia Dimitriadou; Sara Dolničar; Andreas Weingessel

doi:10.1007/BF02294713

An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets

Published online by Cambridge University Press: 01 January 2025

Evgenia Dimitriadou ,

Sara Dolničar and

Andreas Weingessel

Show author details

Evgenia Dimitriadou: Affiliation:
Institut für Statistik und Wahrscheinlichkeitstheorie, Technische Universität Wien
Sara Dolničar: Affiliation:
Institut für Tourismus und Freizeitwirtschaft, Wirtschaftsuniversität wien
Andreas Weingessel*: Affiliation:
Institut für Statistik und Wahrscheinlichkeitstheorie, Technische Universität Wien
*: Requests for reprints should be sent to A. Weingessel, Institut fOr Statistik, Technische Universitfit Wien, Wiedner Hauptstrage 8-10/1071, A-1040 Wien, AUSTRIA.

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

The problem of choosing the correct number of clusters is as old as cluster analysis itself. A number of authors have suggested various indexes to facilitate this crucial decision. One of the most extensive comparative studies of indexes was conducted by Milligan and Cooper (1985). The present piece of work pursues the same goal under different conditions. In contrast to Milligan and Cooper's work, the emphasis here is on high-dimensional empirical binary data. Binary artificial data sets are constructed to reflect features typically encountered in real-world data situations in the field of marketing research. The simulation includes 162 binary data sets that are clustered by two different algorithms and lead to recommendations on the number of clusters for each index under consideration. Index results are evaluated and their performance is compared and analyzed.

Keywords

number of clusters clustering indexes binary data artificial data sets market segmentation

Information

Type: Articles
Information: Psychometrika , Volume 67 , Issue 1 , March 2002 , pp. 137 - 159

DOI: https://doi.org/10.1007/BF02294713 [Opens in a new window]
Copyright: Copyright © 2002 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Author names are listed in alphabetical order.

This piece of research was supported by the Austrian Science Foundation (FWF) under grant SFB#010 (“Adaptive Information Systems and Modeling in Economics and Management Science”).

The authors would like to thank the anonymous reviewers and especially the associate editor for their helpful comments and suggestions.

References

Aldenderfer, M.S., & Blashfield, R.K. (1996). Cluster analysis. London, U.K.: Sage Publications.Google Scholar

Andrews, D.F. (1972). Plots of high-dimensional data. Biometrics, 28, 125–136.CrossRef Google Scholar

Arabie, P., & Hubert, L.J. (1996). Clustering and classification (pp. 5–63). River Edge, NJ: World Scientific.CrossRef Google Scholar

Arratia, R., & Lander, E.S. (1990). The distribution of clusters in random graphs. Advances in Applied Mathematics, 11, 36–48.CrossRef Google Scholar

Baker, F.B., & Hubert, L.J. (1975). Measuring the power of hierarchical cluster analysis. Journal of the American Statistical Association, 70, 31–38.CrossRef Google Scholar

Ball, G.H., & Hall, D.J. (1965). ISODATA, A novel method of data analysis and pattern classification. Menlo Park, CA: Stanford Research Institute.Google Scholar

Baroni-Urbani, C., & Buser, M.W. (1976). Similarity of binary data. Systematic Zoology, 25, 251–259.CrossRef Google Scholar

Baulieu, F. (1989). A classification of presence/absence based dissimilarity coefficients. Journal of Classification, 6, 233–246.CrossRef Google Scholar

Calinski, R.B., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3, 1–27.Google Scholar

Cheetham, H., & Hazel, J. (1969). Binary (presence-absence) similarity coefficients. Journal of Paleontology, 43, 1130–1136.Google Scholar

Cox, D. (1970). The analysis of binary data. London, U.K.: Chapman and Hall.Google Scholar

Davies, D.L., & Bouldin, D.W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 224–227.CrossRef Google Scholar PubMed

Dolnicar, S., Grabler, K., & Mazanec, J. (2000). A tale of three cities: Perceptual charting for analysing destination images. In Woodside, A. (Eds.), Consumer psychology of tourism, hospitality and leisure (pp. 39–62). London, U.K.: CAB International.Google Scholar

Dolnicar, S., Leisch, F., Weingessel, A., Buchta, C., & Dimitriadou, E. (1998). A comparison of several cluster algorithms on artificial binary data scenarios from tourism marketing. Wien, Austria: Adaptive Information Systems.Google Scholar

Edwards, A.W.F., & Cavalli-Sforza, L. (1965). A method for cluster analysis. Biometrics, 21, 362–375.CrossRef Google Scholar PubMed

Formann, A.K. (1984). Die Latent-Class-Analyse: Einführung in die Theorie und Anwendung [Latent class analysis: Introduction into theory and application], Weinheim, Germany: Beltz.Google Scholar

Friedman, H.P., & Rubin, J. (1967). On some invariant criteria for grouping data. Journal of the American Statistical Association, 62, 1159–1178.CrossRef Google Scholar

Fritzke, B. (1997). Some competitive learning methods. Unpublished manuscript [On-line draft document available at http://www.ki.inf.tu-dresden.de/fritzke/JavaPaper/t.html or http://www.neuroinformatik.ruhr-unibochum.de/ini/VDM/research/gsn/].Google Scholar

Fukunaga, K., Koontz, W.L.G. (1970). A criterion and an algorithm for grouping data. IEEE Transactions on Computers, C-19, 917–923.CrossRef Google Scholar

Gower, J.C. (1985). Measures of similarity, dissimilarity, and distance. In Kotz, S., & Johnson, N.L. (Eds.), Encyclopedia of Statistical Sciences, Vol. 5 (pp. 397–405). New York, NY: Wiley.Google Scholar

Green, P.E., Tull, D.S., & Albaum, G. (1988). Research for Marketing Decisions 5th ed., Englewood Cliffs, NJ: Prentice-Hall.Google Scholar

Hall, D.J., Duda, R.O., Huffman, D.A., & Wolf, E.E. (1973). Development of new pattern recognotion methods. Los Angeles, CA: Aerospace Research Laboratories.Google Scholar

Hartigan, J.A. (1975). Clustering algorithms. New York, NY: Wiley.Google Scholar

Hubalek, L. (1982). Coefficients of association and similarity, based on binary (presence-absence) data: An evaluation. Biological Review, 57, 669–689.CrossRef Google Scholar

Hubert, L.J., & Levin, J.R. (1976). A general statistical framework for assessing categorical clustering in free recall. Phycological Bulletin, 83, 1072–1080.Google Scholar

Kaufmann, H., & Pape, H. (1996). Multivariate statistische Verfahren [Multivariate statistical methods] 2nd ed., Berlin: Walter de Gruyter.Google Scholar

Li, X., & Dubes, R.C. (1989). A probabilistic measure of similarity for binary data in pattern recognition. Pattern Recognition, 22(4), 397–409.CrossRef Google Scholar

Linde, Y., Buzo, A., & Gray, R.M. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, COM-28(1), 84–95.CrossRef Google Scholar

Marriot, F.H.C. (1971). Practical problems in a method of cluster analysis. Biometrics, 27, 501–514.CrossRef Google Scholar

McCutcheon, A.L. (1987). Latent class analysis. Beverly Hills, CA: Sage Publications.CrossRef Google Scholar

Milligan, G.W. (1980). An examination of the effect of six types of error perturbation on fifteen clutering algorithms. Psychometrika, 45, 325–342.CrossRef Google Scholar

Milligan, G.W. (1981). A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46, 187–199.CrossRef Google Scholar

Milligan, G.W., & Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.CrossRef Google Scholar

Orloci, L. (1967). An agglomerative method of classification of plant communities. Journal of Ecology, 55, 193–206.CrossRef Google Scholar

Ramaswamy, W., Chatterjee, R., & Cohen, S.H. (1996). Joint segmentation on distinct interdependent bases with categorical data. Journal of Marketing Research, 33, 337–350.CrossRef Google Scholar

Ratkowsky, D.A., & Lance, G.N. (1978). A criterion for determining the number of groups in a classification. Australian Computer Journal, 10, 115–117.Google Scholar

Rost, J. (1996). Testtheorie, Testkonstruktion [Theory and construction of tests], Bern: Verlag Hans Huber.Google Scholar

Sarle, W.S. (1983). Cubic clustering criterion. Research Triangle Park, NC: SAS Institute.Google Scholar

Schwarz, G. (1978). Estimating the dimension of a model. Annuals of Statistics, 6, 461–464.CrossRef Google Scholar

Scott, A.J., & Symons, M.J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics, 27, 387–397.CrossRef Google Scholar

Thorndike, R.L. (1953). Who belongs in the familiy?. Psychometrika, 18, 267–276.CrossRef Google Scholar

Wedel, M., & Kamakura, W.A. (1998). Marketing segmentation. Conceptual and methodological foundations (pp. 89–92). Boston/Dordrecht/London: Kluwer Academic.Google Scholar

Wolfe, J.H. (1970). Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research, 5, 329–350.CrossRef Google Scholar PubMed

Xu, L. (1997). Bayesian Ying-Yang machine, clustering and number of clusters. Pattern Recognition Letters, 18, 1167–1178.CrossRef Google Scholar

Yang, M.-S., & Yu, K.F. (1990). On stochastic convergence theorems for the fuzzy c-means clustering procedure. International Journal of General Systems, 16, 397–411.CrossRef Google Scholar

Article contents

An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests