Hostname: page-component-745bb68f8f-f46jp Total loading time: 0 Render date: 2025-01-07T10:43:45.477Z Has data issue: false hasContentIssue false

Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data

Published online by Cambridge University Press:  01 January 2025

J. Fernando Vera*
Affiliation:
University of Granada
Rodrigo Macías
Affiliation:
Centro de Investigación en Matemáticas, Unidad Monterrey
*
Correspondence should be made to J. Fernando Vera, Department of Statistics and O.R., Faculty of Sciences, University of Granada, 18071 Granada, Spain. Email: jfvera@ugr.es

Abstract

One of the main problems in cluster analysis is that of determining the number of groups in the data. In general, the approach taken depends on the cluster method used. For K-means, some of the most widely employed criteria are formulated in terms of the decomposition of the total point scatter, regarding a two-mode data set of N points in p dimensions, which are optimally arranged into K classes. This paper addresses the formulation of criteria to determine the number of clusters, in the general situation in which the available information for clustering is a one-mode N×N\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N\times N$$\end{document} dissimilarity matrix describing the objects. In this framework, p and the coordinates of points are usually unknown, and the application of criteria originally formulated for two-mode data sets is dependent on their possible reformulation in the one-mode situation. The decomposition of the variability of the clustered objects is proposed in terms of the corresponding block-shaped partition of the dissimilarity matrix. Within-block and between-block dispersion values for the partitioned dissimilarity matrix are derived, and variance-based criteria are subsequently formulated in order to determine the number of groups in the data. A Monte Carlo experiment was carried out to study the performance of the proposed criteria. For simulated clustered points in p dimensions, greater efficiency in recovering the number of clusters is obtained when the criteria are calculated from the related Euclidean distances instead of the known two-mode data set, in general, for unequal-sized clusters and for low dimensionality situations. For simulated dissimilarity data sets, the proposed criteria always outperform the results obtained when these criteria are calculated from their original formulation, using dissimilarities instead of distances.

Type
Original Paper
Copyright
Copyright © 2017 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer.Google Scholar
Calinski, R. B., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3, 127.Google Scholar
Chiang, M. M., & Mirkin, B. (2010). Intelligent choice of the number of cluster in K-means clustering: An experimental study with different cluster spreads. Journal of Classification, 27, 340CrossRefGoogle Scholar
Cilibrasi, R., & Vitanyi, P. (2004). Automatic meaning discovery using Google. http://xxx.lanl.gov/abs/cs.CL/0412098.Google Scholar
Cilibrasi, R. L., & Vitanyi, P. M. B. (2007). The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19, (3), 370383CrossRefGoogle Scholar
Condon, E., & Golden, B., & Lele, S., & Raghavan, S., & Wasil, E. (2002). A visualization model based on adjacency data. Decision Support Systems, 33, 349362CrossRefGoogle Scholar
DeSarbo, W., Carroll, J. D., Clark, L., & Green, P. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika, 49, 5778CrossRefGoogle Scholar
Elvevag, B., & Storms, G. (2003). Scaling and clustering in the study of semantic disruptions in patients with Schizophrenia: A re-evaluation. Schizophrenia Research, 63, 237CrossRefGoogle Scholar
Everit, B. S., & Landau, S., & Leese, M., & Stahl, D. (2011). Cluster analysis. Wiley series in probability and statistics, 5New York: Wiley.Google Scholar
Gower, J. C., & Krzanowski, W. J. (1999). Analysis of distance for structured multivariate data and extensions to multivariate analysis of variance. Journal of the Royal Statistical Society: Series C (Applied Statistics), 48, (4), 505519Google Scholar
Hartigan, J. A. (1975). Clustering algorithms, New York: Wiley.Google Scholar
Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100108CrossRefGoogle Scholar
Heiser, W. J., & Groenen, P. J. F. (1997). Cluster differences scaling with a within-clusters loss component and a fuzzy succesive approximation strategy to avoid local minima. Psychometrika, 62, 6383CrossRefGoogle Scholar
Ito, K., & Zeugmann, T., & Zhu, Y. (2010). Clustering the normalized compression distance for Influenza virus data. Lecture Notes in Computer Science, 6060, 130146CrossRefGoogle Scholar
Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis, New York: WileyCrossRefGoogle Scholar
Krzanowski, W. J., & Lai, Y. T. (1985). A criterion for determining the number of groups in a data set using sum of squares clustering. Biometrics, 44, 2334CrossRefGoogle Scholar
McQueen, J. (1967). Some methods for classification and analysis of multivariate observations, In Fifth Berkeley Symposium on Mathematical Statistics and Probability (vol. II, pp. 281–297).Google Scholar
Maitra, R., & Melnykov, V. (2010). Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19, (2), 354376CrossRefGoogle Scholar
Melnykov, V., & Chen, W-C, & Maitra, R. (2012). MixSim: An R package for simulating data to study performance of clustering algorithms. Journal of Statistical Software, 51, (12), 125CrossRefGoogle Scholar
Milligan, G. W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, (1), 123127CrossRefGoogle Scholar
Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159179CrossRefGoogle Scholar
Mirkin, B. (2005). Clustering for data mining: A data recovery approach, Boca Raton, FL: Chapman and HallCrossRefGoogle Scholar
Monti, S., & Tamayo, P., & Mesirov, J., & Golub, T. (2003). Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52, 91118CrossRefGoogle Scholar
Poland, J., & Zeugmann, T. (2006). Clustering the google distance with eigenvectors and semidefinite programming. In Knowledge Media Technologies, First International Core-to-Core Workshop, “Diskussionsbeiträge, Institut für Medien und Kommunikationswisschaft" (vol. 21, pp. 61–69). Technische Universität Ilmenau.Google Scholar
Press, W. H., & Teukolsky, S. A., & Vetterling, W. T., & Flannery, B. P. (2007). Numerical recipes: The art of scientific computing, 3New York: Cambridge University Press.Google Scholar
Priem, R. L., & Love, L., & Shaffer, M. A. (2002). Executives perceptions of uncertainty sources: A numerical taxonomy and underlying dimensions. Journal of Management, 28, 725746CrossRefGoogle Scholar
Rocci, R., & Vichi, M. (2008). Two-mode multi-partitioning. Computational Statistics and Data Analysis, 52, 19842003CrossRefGoogle Scholar
Schwarz, A. J. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461464CrossRefGoogle Scholar
Sokal, R., & Michener, C. (1958). A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin, 38, 14091438.Google Scholar
Steinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 134CrossRefGoogle ScholarPubMed
Steinley, D., & Brusco, M. J. (2007). Initializing K-means batch clustering: A critical evaluation of several techniques. Journal of Classification, 24, 99121CrossRefGoogle Scholar
Sugar, C. A., & James, G. M. (2003). Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Asssociation., 98, 750762CrossRefGoogle Scholar
Tibshirani, R., & Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society B, 63, 411423CrossRefGoogle Scholar
Vera, J. F., & Macías, R., & Angulo, J. M. (2008). Non-stationary spatial covariance structure estimation in oversampled domains by cluster differences scaling with spatial constraints. Stochastic Environmental Research and Risk Assessment, 22, 95106CrossRefGoogle Scholar
Vera, J. F., & Macías, R., & Angulo, J. M. (2009). A latent class MDS model with spatial constraints for non-stationary spatial covariance estimation. Stochastic Environmental Research and Risk Assessment, 23, (6), 769779CrossRefGoogle Scholar
Vera, J. F., & Macías, R., & Heiser, W. J. (2009). A latent class multidimensional scaling model for two-way one-mode continuous rating dissimilarity data. Psychometrika, 74, (2), 297315CrossRefGoogle Scholar
Vera, J. F., & Macías, R., & Heiser, W. J. (2013). Cluster differences unfolding for two-way two-mode preference rating data. Journal of Classification, 30, 370396CrossRefGoogle Scholar