Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data

J. Fernando Vera; Rodrigo Macías

doi:10.1007/s11336-017-9561-1

Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data

Published online by Cambridge University Press: 01 January 2025

J. Fernando Vera and

Rodrigo Macías

Show author details

J. Fernando Vera*: Affiliation:
University of Granada
Rodrigo Macías: Affiliation:
Centro de Investigación en Matemáticas, Unidad Monterrey
*: Correspondence should be made to J. Fernando Vera, Department of Statistics and O.R., Faculty of Sciences, University of Granada, 18071 Granada, Spain. Email: jfvera@ugr.es

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

One of the main problems in cluster analysis is that of determining the number of groups in the data. In general, the approach taken depends on the cluster method used. For K-means, some of the most widely employed criteria are formulated in terms of the decomposition of the total point scatter, regarding a two-mode data set of N points in p dimensions, which are optimally arranged into K classes. This paper addresses the formulation of criteria to determine the number of clusters, in the general situation in which the available information for clustering is a one-mode N×N\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N\times N$$\end{document} dissimilarity matrix describing the objects. In this framework, p and the coordinates of points are usually unknown, and the application of criteria originally formulated for two-mode data sets is dependent on their possible reformulation in the one-mode situation. The decomposition of the variability of the clustered objects is proposed in terms of the corresponding block-shaped partition of the dissimilarity matrix. Within-block and between-block dispersion values for the partitioned dissimilarity matrix are derived, and variance-based criteria are subsequently formulated in order to determine the number of groups in the data. A Monte Carlo experiment was carried out to study the performance of the proposed criteria. For simulated clustered points in p dimensions, greater efficiency in recovering the number of clusters is obtained when the criteria are calculated from the related Euclidean distances instead of the known two-mode data set, in general, for unequal-sized clusters and for low dimensionality situations. For simulated dissimilarity data sets, the proposed criteria always outperform the results obtained when these criteria are calculated from their original formulation, using dissimilarities instead of distances.

Keywords

Type: Original Paper
Information: Psychometrika , Volume 82 , Issue 2 , June 2017 , pp. 275 - 294

DOI: https://doi.org/10.1007/s11336-017-9561-1 [Opens in a new window]
Copyright: Copyright © 2017 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer.Google Scholar

Calinski, R. B., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3, 1–27.Google Scholar

Chiang, M. M., & Mirkin, B. (2010). Intelligent choice of the number of cluster in K-means clustering: An experimental study with different cluster spreads. Journal of Classification, 27, 3–40CrossRef Google Scholar

Cilibrasi, R., & Vitanyi, P. (2004). Automatic meaning discovery using Google. http://xxx.lanl.gov/abs/cs.CL/0412098.Google Scholar

Cilibrasi, R. L., & Vitanyi, P. M. B. (2007). The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19, (3), 370–383CrossRef Google Scholar

Condon, E., & Golden, B., & Lele, S., & Raghavan, S., & Wasil, E. (2002). A visualization model based on adjacency data. Decision Support Systems, 33, 349–362CrossRef Google Scholar

DeSarbo, W., Carroll, J. D., Clark, L., & Green, P. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika, 49, 57–78CrossRef Google Scholar

Elvevag, B., & Storms, G. (2003). Scaling and clustering in the study of semantic disruptions in patients with Schizophrenia: A re-evaluation. Schizophrenia Research, 63, 237CrossRef Google Scholar

Everit, B. S., & Landau, S., & Leese, M., & Stahl, D. (2011). Cluster analysis. Wiley series in probability and statistics, 5New York: Wiley.Google Scholar

Gower, J. C., & Krzanowski, W. J. (1999). Analysis of distance for structured multivariate data and extensions to multivariate analysis of variance. Journal of the Royal Statistical Society: Series C (Applied Statistics), 48, (4), 505–519Google Scholar

Hartigan, J. A. (1975). Clustering algorithms, New York: Wiley.Google Scholar

Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100–108CrossRef Google Scholar

Heiser, W. J., & Groenen, P. J. F. (1997). Cluster differences scaling with a within-clusters loss component and a fuzzy succesive approximation strategy to avoid local minima. Psychometrika, 62, 63–83CrossRef Google Scholar

Ito, K., & Zeugmann, T., & Zhu, Y. (2010). Clustering the normalized compression distance for Influenza virus data. Lecture Notes in Computer Science, 6060, 130–146CrossRef Google Scholar

Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis, New York: WileyCrossRef Google Scholar

Krzanowski, W. J., & Lai, Y. T. (1985). A criterion for determining the number of groups in a data set using sum of squares clustering. Biometrics, 44, 23–34CrossRef Google Scholar

McQueen, J. (1967). Some methods for classification and analysis of multivariate observations, In Fifth Berkeley Symposium on Mathematical Statistics and Probability (vol. II, pp. 281–297).Google Scholar

Maitra, R., & Melnykov, V. (2010). Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19, (2), 354–376CrossRef Google Scholar

Melnykov, V., & Chen, W-C, & Maitra, R. (2012). MixSim: An R package for simulating data to study performance of clustering algorithms. Journal of Statistical Software, 51, (12), 1–25CrossRef Google Scholar

Milligan, G. W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, (1), 123–127CrossRef Google Scholar

Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179CrossRef Google Scholar

Mirkin, B. (2005). Clustering for data mining: A data recovery approach, Boca Raton, FL: Chapman and HallCrossRef Google Scholar

Monti, S., & Tamayo, P., & Mesirov, J., & Golub, T. (2003). Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52, 91–118CrossRef Google Scholar

Poland, J., & Zeugmann, T. (2006). Clustering the google distance with eigenvectors and semidefinite programming. In Knowledge Media Technologies, First International Core-to-Core Workshop, “Diskussionsbeiträge, Institut für Medien und Kommunikationswisschaft" (vol. 21, pp. 61–69). Technische Universität Ilmenau.Google Scholar

Press, W. H., & Teukolsky, S. A., & Vetterling, W. T., & Flannery, B. P. (2007). Numerical recipes: The art of scientific computing, 3New York: Cambridge University Press.Google Scholar

Priem, R. L., & Love, L., & Shaffer, M. A. (2002). Executives perceptions of uncertainty sources: A numerical taxonomy and underlying dimensions. Journal of Management, 28, 725–746CrossRef Google Scholar

Rocci, R., & Vichi, M. (2008). Two-mode multi-partitioning. Computational Statistics and Data Analysis, 52, 1984–2003CrossRef Google Scholar

Schwarz, A. J. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464CrossRef Google Scholar

Sokal, R., & Michener, C. (1958). A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin, 38, 1409–1438.Google Scholar

Steinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 1–34CrossRef Google Scholar PubMed

Steinley, D., & Brusco, M. J. (2007). Initializing K-means batch clustering: A critical evaluation of several techniques. Journal of Classification, 24, 99–121CrossRef Google Scholar

Sugar, C. A., & James, G. M. (2003). Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Asssociation., 98, 750–762CrossRef Google Scholar

Tibshirani, R., & Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society B, 63, 411–423CrossRef Google Scholar

Vera, J. F., & Macías, R., & Angulo, J. M. (2008). Non-stationary spatial covariance structure estimation in oversampled domains by cluster differences scaling with spatial constraints. Stochastic Environmental Research and Risk Assessment, 22, 95–106CrossRef Google Scholar

Vera, J. F., & Macías, R., & Angulo, J. M. (2009). A latent class MDS model with spatial constraints for non-stationary spatial covariance estimation. Stochastic Environmental Research and Risk Assessment, 23, (6), 769–779CrossRef Google Scholar

Vera, J. F., & Macías, R., & Heiser, W. J. (2009). A latent class multidimensional scaling model for two-way one-mode continuous rating dissimilarity data. Psychometrika, 74, (2), 297–315CrossRef Google Scholar

Vera, J. F., & Macías, R., & Heiser, W. J. (2013). Cluster differences unfolding for two-way two-mode preference rating data. Journal of Classification, 30, 370–396CrossRef Google Scholar

Article contents

Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests