A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning

Michael J. Brusco; Douglas Steinley

doi:10.1007/s11336-007-9013-4

A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning

Published online by Cambridge University Press: 01 January 2025

Michael J. Brusco and

Douglas Steinley

Show author details

Michael J. Brusco*: Affiliation:
Florida State University
Douglas Steinley: Affiliation:
University of Missouri-Columbia
*: Requests for reprints should be sent to Michael J. Brusco, Department of Marketing, College of Business, Florida State University, Tallahassee, FL 32306-1110, USA. E-mail: mbrusco@cob.fsu.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Perhaps the most common criterion for partitioning a data set is the minimization of the within-cluster sums of squared deviation from cluster centroids. Although optimal solution procedures for within-cluster sums of squares (WCSS) partitioning are computationally feasible for small data sets, heuristic procedures are required for most practical applications in the behavioral sciences. We compared the performances of nine prominent heuristic procedures for WCSS partitioning across 324 simulated data sets representative of a broad spectrum of test conditions. Performance comparisons focused on both percentage deviation from the “best-found” WCSS values, as well as recovery of true cluster structure. A real-coded genetic algorithm and variable neighborhood search heuristic were the most effective methods; however, a straightforward two-stage heuristic algorithm, HK-means, also yielded exceptional performance. A follow-up experiment using 13 empirical data sets from the clustering literature generally supported the results of the experiment using simulated data. Our findings have important implications for behavioral science researchers, whose theoretical conclusions could be adversely affected by poor algorithmic performances.

Keywords

combinatorial data analysis cluster analysis heuristics sum of squares criterion

Information

Type: Theory and Methods
Information: Psychometrika , Volume 72 , Issue 4 , December 2007 , pp. 583 - 600

DOI: https://doi.org/10.1007/s11336-007-9013-4 [Opens in a new window]
Copyright: Copyright © 2007 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Al-Sultan, K. (1995). A tabu search approach to the clustering problem. Pattern Recognition, 28, 1443–1451.CrossRef Google Scholar

Anderson, E. (1935). The irises of the Gaspé peninsula. Bulletin of the American Iris Society, 59, 2–5.Google Scholar

Arabie, P., Hubert, L. (1992). Combinatorial data analysis. Annual Review of Psychology, 43, 169–203.CrossRef Google Scholar

Arabie, P., Hubert, L.J. (1996). An overview of combinatorial data analysis. In Arabie, P., Hubert, L.J., De Soete, G. (Eds.), Clustering and classification (pp. 5–63). River Edge: World Scientific.CrossRef Google Scholar

Babu, G.P., Murty, M.N. (1993). A near optimal initial seed value selection in k-means algorithm using genetic algorithms. Pattern Recognition Letters, 14, 763–769.CrossRef Google Scholar

Babu, G.P., Murty, M.N. (1994). Simulated annealing for selecting optimal initial seeds in the K-means algorithm. Indian Journal of Pure and Applied Mathematics, 25, 85–94.Google Scholar

Banfield, C.F., Bassil, L.C. (1977). A transfer algorithm for nonhierarchical classification. Applied Statistics, 26, 206–210.CrossRef Google Scholar

Banfield, J.D., Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.CrossRef Google Scholar

Belew, R.K., Booker, J.B. (1991). Proceedings of the fourth international conference on genetic algorithms, San Mateo: Morgan-Kaufmann.Google Scholar

Brucker, F. (1978). On the complexity of clustering problems. In Beckmann, M., Kunzi, H.P. (Eds.), Optimization and operations research (pp. 45–54). Heidelberg: Springer.CrossRef Google Scholar

Brusco, M.J. (2004). Clustering binary data in the presence of masking variables. Psychological Methods, 9, 510–523.CrossRef Google Scholar PubMed

Brusco, M.J. (2006). A repetitive branch-and-bound procedure for minimum within-cluster sums of squares partitioning. Psychometrika, 71, 347–363.CrossRef Google Scholar PubMed

Brusco, M.J., Cradit, J.D. (2001). A variable selection heuristic for k-means cluster analysis. Psychometrika, 66, 249–270.CrossRef Google Scholar

Brusco, M.J., Cradit, J.D., Tashchian, A. (2003). Multicriterion clusterwise regression for joint segmentation settings: An application to customer value. Journal of Marketing Research, 40, 225–234.CrossRef Google Scholar

Cerny, V. (1985). A thermodynamical approach to the traveling salesman problem. Journal of Optimization Theory and Applications, 45, 41–51.CrossRef Google Scholar

Day, W.H.E. (1996). Complexity theory: An introduction for practitioners of classification. In Arabie, P., Hubert, L.J., De Soete, G. (Eds.), Clustering and classification (pp. 199–233). River Edge: World Scientific.CrossRef Google Scholar

Diehr, G. (1985). Evaluation of a branch and bound algorithm for clustering. SIAM Journal for Scientific and Statistical Computing, 6, 268–284.CrossRef Google Scholar

Dimitriadou, E., Dolniĉar, S., Weingessel, A. (2002). An examination of indexes for determining the number of clusters in binary data sets. Psychometrika, 67, 137–160.CrossRef Google Scholar

du Merle, O., Hansen, P., Jaumard, B., Mladenoviĉ, N. (2000). An interior point algorithm for minimum sum-of-squares clustering. SIAM Journal on Scientific Computing, 21, 1485–1505.CrossRef Google Scholar

Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188.CrossRef Google Scholar

Forgy, E.W. (1965). Cluster analyses of multivariate data: Efficiency versus interpretability of classifications. Biometrics, 21, 768.Google Scholar

Forrest, S. (1993). Proceedings of the fifth international conference on genetic algorithms, San Mateo: Morgan-Kaufmann.Google Scholar

Glover, F. (1989). Tabu search—Part I. ORSA Journal on Computing, 1, 190–206.CrossRef Google Scholar

Glover, F. (1990). Tabu search—Part II. ORSA Journal on Computing, 2, 4–32.CrossRef Google Scholar

Glover, F., Laguna, M. (1993). Tabu search. In Reeves, C. (Eds.), Modern heuristic techniques for combinatorial problems (pp. 70–141). Oxford: Blackwell.Google Scholar

Glover, F., Taillard, E., Werra, D. (1993). A user’s guide to tabu search. Annals of Operations Research, 41, 3–28.CrossRef Google Scholar

Goldberg, D.E. (1989). Genetic algorithms in search, optimization, and machine learning, New York: Addison-Wesley.Google Scholar

Grötschel, M., Holland, O. (1991). Solution of large-scale symmetric traveling salesman problems. Mathematical Programming, 51, 141–202.CrossRef Google Scholar

Hair, J.F., Anderson, R.E., Tatham, R.L., Black, W.C. (1998). Multivariate data analysis, (5th ed.). Saddle River: Prentice-Hall.Google Scholar

Hand, D.J. (1981). Discrimination and classification, New York: Wiley.Google Scholar

Hand, D.J., Krzanowski, W.J. (2005). Optimising k-means clustering results with standard software packages. Computational Statistics and Data Analysis, 49, 969–973.CrossRef Google Scholar

Hansen, P., Mladenoviĉ, N. (2001). J-Means: A new local search heuristic for minimum sum of squares clustering. Pattern Recognition, 34, 405–413.CrossRef Google Scholar

Hartigan, J.A. (1975). Clustering algorithms, New York: Wiley.Google Scholar

Hartigan, J.A., Wong, M.A. (1979). Algorithm AS136: A k-means clustering program. Applied Statistics, 28, 100–128.CrossRef Google Scholar

Heinz, G., Peterson, L.J., Johnson, R.W., & Kerk, C.J. (2003). Exploring relationships in body dimensions. Journal of Statistics Education, 11, www.amstat.org/publications/jse/v11n2/datasets.heinz.html.Google Scholar

Holland, J.H. (1975). Adaptation in natural and artificial systems, Ann Arbor: University of Michigan Press.Google Scholar

Hubert, L., Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.CrossRef Google Scholar

Hubert, L., Arabie, P., Meulman, J. (2001). Combinatorial data analysis: Optimization by dynamic programming, Philadelphia: Society Industrial and Applied Mathematics.CrossRef Google Scholar

Jancey, R.C. (1966). Multidimensional group analysis. Australian Journal of Botany, 14, 127–130.CrossRef Google Scholar

Jones, D.R., Beltramo, M.A. (1991). Solving partitioning problems with genetic algorithms. In Belew, R.K., Booker, J.B. (Eds.), Proceedings of the fourth international conference on genetic algorithms, San Mateo: Morgan Kaufmann.Google Scholar

Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P. (1983). Optimization by simulated annealing. Science, 220, 671–680.CrossRef Google Scholar PubMed

Klein, R.W., Dubes, R.C. (1989). Experiments in projection and clustering by simulated annealing. Pattern Recognition, 22, 213–220.CrossRef Google Scholar

Koontz, W.L.G., Narendra, P.M., Fukunaga, K. (1975). A branch and bound clustering algorithm. IEEE Transaction on Computers, C-24, 908–915.CrossRef Google Scholar

Krishna, K., Murty, N.M. (1999). Genetic K-means algorithm. IEEE Transactions on Systems, Man, & Cybernetics—Part B: Cybernetics, 29, 433–439.CrossRef Google Scholar PubMed

MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations. In Le Cam, L.M. & Neyman, J. (Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 281–297). Berkeley: University of California Press.Google Scholar

Maronna, R., Jacovkis, P.M. (1974). Multivariate clustering procedures with variable metrics. Biometrics, 30, 499–505.CrossRef Google Scholar

MathWorks, Inc. (2002). Using MATLAB (version 6), Natick: The MathWorks, Inc.Google Scholar

Maulik, U., Bandyopadhyay, S. (2000). Genetic algorithm-based clustering technique. Pattern Recognition, 33, 1455–1465.CrossRef Google Scholar

Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A., Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1092.CrossRef Google Scholar

Milligan, G.W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.CrossRef Google Scholar

Milligan, G.W. (1980). The validation of four ultrametric clustering algorithms. Pattern Recognition, 12, 41–50.CrossRef Google Scholar

Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 123–127.CrossRef Google Scholar

Milligan, G.W. (1989). A validation study of a variable-weighting algorithm for cluster analysis. Journal of Classification, 6, 53–71.CrossRef Google Scholar

Milligan, G.W., Cooper, M.C. (1986). A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21, 441–458.CrossRef Google Scholar PubMed

Milligan, G.W., Cooper, M.C. (1988). A study of variable standardization. Journal of Classification, 5, 181–204.CrossRef Google Scholar

Pacheco, J., Valencia, O. (2003). Design of hybrids for the minimum sum-of-squares clustering problem. Computational Statistics and Data Analysis, 43, 235–248.CrossRef Google Scholar

Pal, S.K., Majumder, D.D. (1977). Fuzzy sets and decision making approaches in vowel and speaker recognition. IEEE Transactions on Systems, Man, and Cybernetics, 7, 625–629.Google Scholar

Reinelt, G. (2001). TSPLIB. http://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95.Google Scholar

Scott, A.J., Symons, M.J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics, 27, 387–398.CrossRef Google Scholar

Selim, S.Z., Al-Sultan, K. (1991). A simulated annealing algorithm for the clustering problem. Pattern Recognition, 24, 1003–1008.CrossRef Google Scholar

Späth, H. (1980). Cluster analysis algorithms for data reduction and classification of objects, New York: Wiley.Google Scholar

Steinley, D. (2003). Local optima in K-means clustering: What you don’t know may hurt you. Psychological Methods, 8, 294–304.CrossRef Google Scholar PubMed

Steinley, D. (2004). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386–396.CrossRef Google Scholar PubMed

Steinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 1–34.CrossRef Google Scholar PubMed

Steinley, D. (2006). Profiling local optima in K-means clustering: Developing a diagnostic technique. Psychological Methods, 11, 178–192.CrossRef Google Scholar PubMed

Steinley, D., Brusco, M. (2007). Initializing K-means batch clustering: A critical analysis of several techniques. Journal of Classification, 24, 99–121.CrossRef Google Scholar

Steinley, D., Henson, R. (2005). OCLUS: An algorithmic method for generating clusters with known overlap. Journal of Classification, 22, 221–250.CrossRef Google Scholar

Sun, L.-X., Xie, Y.-L., Song, X.-H., Wang, J.-H., Yu, R.-Q. (1994). Cluster analysis by simulated annealing. Computers in Chemistry, 18, 103–108.CrossRef Google Scholar

Sun, L.-X., Xu, F., Liang, Y.-Z., Xie, Y.-L., Yu, R.-Q. (1994). Cluster analysis by the K-means algorithm and simulated annealing. Chemometrics and Intelligent Laboratory Systems, 25, 51–60.CrossRef Google Scholar

Sung, C.S., Jin, H.W. (2000). A tabu-search-based heuristic for clustering. Pattern Recognition, 33, 849–858.CrossRef Google Scholar

van Os, B.J., Meulman, J.J. (2004). Improving dynamic programming strategies for partitioning. Journal of Classification, 21, 207–230.CrossRef Google Scholar

Article contents

A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests