A Variable-Selection Heuristic for K-means Clustering

Michael J. Brusco; J. Dennis Cradit

doi:10.1007/BF02294838

A Variable-Selection Heuristic for K-means Clustering

Published online by Cambridge University Press: 01 January 2025

Michael J. Brusco and

J. Dennis Cradit

Show author details

Michael J. Brusco*: Affiliation:
Florida State University
J. Dennis Cradit: Affiliation:
Florida State University
*: Requests for reprints should be sent to Michael J. Brnsco, Marketing Depaxtment, College of Business, Florida State University, Tallahassee, FL 32306-1110, E-Mail: mbrusco@cob.fsu.edu

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

One of the most vexing problems in cluster analysis is the selection and/or weighting of variables in order to include those that truly define cluster structure, while eliminating those that might mask such structure. This paper presents a variable-selection heuristic for nonhierarchical (K-means) cluster analysis based on the adjusted Rand index for measuring cluster recovery. The heuristic was subjected to Monte Carlo testing across more than 2200 datasets with known cluster structure. The results indicate the heuristic is extremely effective at eliminating masking variables. A cluster analysis of real-world financial services data revealed that using the variable-selection heuristic prior to the K-means algorithm resulted in greater cluster stability.

Keywords

cluster analysis K-means partitioning variable selection heuristics

Type: Articles
Information: Psychometrika , Volume 66 , Issue 2 , June 2001 , pp. 249 - 270

DOI: https://doi.org/10.1007/BF02294838 [Opens in a new window]
Copyright: Copyright © 2001 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

We gratefully acknowledge the constructive comments of three anonymous reviewers, the Associate Editor, and Editor, which led to considerable improvements in this article. We hole diat our variable-selection heuristic evolved during the review process. This evolution was attributable to a variety of factors including: (a) the publication of the HINoV procedure (Carmone et al., 1999), (b) a thoughtful comment from an anonymous reviewer regarding correlated masking variables, and (c) a helpful suggestion from the Associate Editor concerning multiple true cluster structures in a single dataset.

References

Anderberg, M.R. (1973). Cluster analysis for applications. New York, NY: Academic Press.Google Scholar

Arabie, P., Hubert, L.J. (1994). Cluster analysis in marketing research. In Bagozzi, R.P. (Eds.), Advanced methods in marketing research (pp. 160–189). Oxford, England: Blackwell.Google Scholar

Arabie, P., Hubert, L.J. (1996). An overview of combinatorial data analysis. In Arabie, P., Hubert, L.J., De Soete, G. (Eds.), Clustering and classification (pp. 5–63). River Edge, NJ: World Scientific Publishing.CrossRef Google Scholar

Art, D., Gnanadesikan, R., Kettenring, J.R. (1982). Data-based metrics for cluster analysis. Utilitas Mathematica, Series A, 21, 75–99.Google Scholar

Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., Lewis, P.A. (1994). A study of the classification capabilities of neural networks using unsupervised learning: A comparison with K-means clustering. Psychometrika, 59, 509–525.CrossRef Google Scholar

Balasubramanian, S., Gupta, S., Kamakura, W., Wedel, M. (1998). Modelling large data sets in marketing. Statistica Neerlandica, 52, 303–323.CrossRef Google Scholar

Berry, M.J.A., Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support. New York, NY: John Wiley & Sons.Google Scholar

Blattberg, R., Glazer, R., Little, J. (1994). The marketing information revolution. Boston, MA: Harvard Business School Press.Google Scholar

Box, G.E.P., Muller, M.E. (1958). A note on the generation of random normal deviates. Annals of Mathematical Statistics, 29, 610–611.CrossRef Google Scholar

Breckenridge, J.N. (1989). Replicating cluster analysis: Method, consistency, and validity. Multivariate Behavioral Research, 24, 147–161.CrossRef Google Scholar

Carmone, F.J., Kara, A., Maxwell, S. (1999). HINoV: A new model to improve market segmentation by identifying noisy variables. Journal of Marketing Research, 36, 501–509.CrossRef Google Scholar

Chaturvedi, A., Carroll, J.D., Green, P.E., Rotondo, J.A. (1997). A feature-based approach to market segmentation via overlapping K-centroids clustering. Journal of Marketing Research, 34, 370–377.CrossRef Google Scholar

Cheng, R., Milligan, G.W. (1996). K-means clustering methods with influence detection. Educational and Psychological Measurement, 56, 833–838.CrossRef Google Scholar

Cormack, R.M. (1971). A review of classification (with Discussion). Journal of the Royal Statistical Society, Series A, 134, 321–367.CrossRef Google Scholar

DeSarbo, W.S., Carroll, J.D., Clark, L.A., Green, P.E. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with different weighting of variables. Psychometrika, 49, 57–78.CrossRef Google Scholar

DeSarbo, W.S., Manrai, A.K., Manrai, L.A. (1993). Non-spatial tree models for the assessment of competitive market structure: An integrated review of the marketing and psychometric literature. In Eliashberg, J., Lilien, G. (Eds.), Handbook in operations research and management science: Marketing (pp. 193–257). New York, NY: Elsevier.Google Scholar

De Soete, G. (1986). Optimal variable weighting for ultrametric and additive tree clustering. Quality and Quantity, 20, 169–180.CrossRef Google Scholar

De Soete, G. (1988). OVWTRE: A program for optimal variable weighting for ultrametric and additive tree fitting. Journal of Classification, 5, 101–104.CrossRef Google Scholar

De Soete, G., DeSarbo, W.S., Carroll, J.D. (1985). Optimal variable weighting for hierarchical clustering: An alternating least-squares algorithm. Journal of Classification, 2, 173–192.CrossRef Google Scholar

Fowlkes, E.B., Gnanadesikan, R., Kettenring, J.R. (1987). Variable selection in clustering and other contexts. In Mallows, C.L. (Eds.), Design, data, and analysis (pp. 13–34). New York, NY: John Wiley & Sons.Google Scholar

Fowlkes, E.B., Gnanadesikan, R., Kettenring, J.R. (1988). Variable selection in clustering. Journal of Classification, 5, 205–228.CrossRef Google Scholar

Fowlkes, E.B., Mallows, C.L. (1983). A method for comparing two hierarchical clusterings (with comments and rejoinder). Journal of the American Statistical Association, 78, 553–584.CrossRef Google Scholar

Friedman, H.P., Rubin, J. (1967). On some invariant criteria for grouping data. Journal of the American Statistical Association, 62, 1159–1178.CrossRef Google Scholar

Gnanadesikan, R., Kettenring, J.R., Tsao, S.L. (1995). Weighting and selection of variables for cluster analysis. Journal of Classification, 12, 113–136.CrossRef Google Scholar

Green, P.E., Carmone, F.J., Kim, J. (1990). A preliminary study of optimal variable weighting in K-means clustering. Journal of Classification, 7, 271–285.CrossRef Google Scholar

Helsen, K., Green, P.E. (1991). A computational study of replicated clustering with an application to market segmentation. Decision Sciences, 22, 1124–1141.CrossRef Google Scholar

Hubert, L., Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.CrossRef Google Scholar

Knuth, D.E. (1997). The art of computing: Vol. 1. Fundamental algorithms. Reading, MA: Addison-Wesley.Google Scholar

Krieger, A., Green, P.E. (1999). A generalized rand-index method for consensus clustering of separate partitions of the same data base. Journal of Classification, 16, 63–89.CrossRef Google Scholar

Kruskal, J.B. (1972). Linear transformations of multivariate data to reveal clustering. In Shepard, R.N., Romney, A.K., Nerlove, S.B. (Eds.), Multidimensional scaling: Theory and applications in the behavioral sciences (pp. 181–191). New York, NY: Seminar Press.Google Scholar

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, I, 231–297.Google Scholar

McIntyre, R.M., Blashfield, R.K. (1980). A nearest-centroid technique for evaluating the minimum variance clustering procedure. Multivariate Behavioral Research, 15, 225–238.CrossRef Google Scholar

Milligan, G.W. (1980). An examination of six types of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.CrossRef Google Scholar

Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 123–127.CrossRef Google Scholar

Milligan, G.W. (1989). A validation study of a variable-weighting algorithm for cluster analysis. Journal of Classification, 6, 53–71.CrossRef Google Scholar

Milligan, G.W. (1996). Clustering validation: Results and implications for applied analyses. In Arabie, P., Hubert, L.J., De Soete, G. (Eds.), Clustering and classification (pp. 341–375). River Edge, NJ: World Scientific Publishing.CrossRef Google Scholar

Milligan, G.W., Cooper, M.C. (1986). A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21, 441–458.CrossRef Google Scholar PubMed

Milligan, G.W., Cooper, M.C. (1988). A study of the standardization of variables in cluster analysis. Journal of Classification, 5, 181–204.CrossRef Google Scholar

Milligan, G.W., Soon, S.C., Sokol, L.M. (1983). The effect of cluster size, dimensionality, and the number of clusters on the recovery of true cluster structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 40–47.CrossRef Google Scholar PubMed

Morey, L.C., Blashfield, R.K., Skinner, H.A. (1983). A comparison of cluster analysis techniques within a sequential validation framework. Multivariate Behavioral Research, 18, 309–329.CrossRef Google Scholar PubMed

Rand, W.M. (1971). Objective criteria for evaluating clustering methods. Journal of the American Statistical Association, 66, 846–850.CrossRef Google Scholar

Rohlf, F.J. (1970). Adaptive hierarchical clustering schemes. Systematic Zoology, 19, 58–82.CrossRef Google Scholar

Salstone, R., Stange, K. (1996). A computer program to calculate Hubert and Arabie's adjusted Rand index. Journal of Classification, 13, 169–172.CrossRef Google Scholar

Ward, J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236–244.CrossRef Google Scholar

Waller, N.G., Kaiser, H.A., Illian, J.B., Manry, M. (1998). A comparison of the classification capabilities of the 1-dimensional Kohonen neural network with two partitioning and three hierarchical cluster analysis algorithms. Psychometrika, 63, 5–22.CrossRef Google Scholar

Wedel, M., Kamakura, W.A. (1997). Market segmentation: Conceptual and methodological foundations. Boston, MA: Kluwer Academic Publishers.Google Scholar

Article contents

A Variable-Selection Heuristic for K-means Clustering

Abstract

Keywords

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests