Hostname: page-component-5f745c7db-nc56l Total loading time: 0 Render date: 2025-01-06T21:22:34.793Z Has data issue: true hasContentIssue false

A Model-Based Approach to Simultaneous Clustering and Dimensional Reduction of Ordinal Data

Published online by Cambridge University Press:  01 January 2025

Monia Ranalli*
Affiliation:
The Pennsylvania State University
Roberto Rocci
Affiliation:
University of Tor Vergata
*
Correspondence should be made to Monia Ranalli, Department of Statistics, The Pennsylvania State University, State College, PA, USA. Email: mxr459@psu.edu

Abstract

The literature on clustering for continuous data is rich and wide; differently, that one developed for categorical data is still limited. In some cases, the clustering problem is made more difficult by the presence of noise variables/dimensions that do not contain information about the clustering structure and could mask it. The aim of this paper is to propose a model for simultaneous clustering and dimensionality reduction of ordered categorical data able to detect the discriminative dimensions discarding the noise ones. Following the underlying response variable approach, the observed variables are considered as a discretization of underlying first-order latent continuous variables distributed as a Gaussian mixture. To recognize discriminative and noise dimensions, these variables are considered to be linear combinations of two independent sets of second-order latent variables where only one contains the information about the cluster structure while the other one contains noise dimensions. The model specification involves multidimensional integrals that make the maximum likelihood estimation cumbersome and in some cases infeasible. To overcome this issue, the parameter estimation is carried out through an EM-like algorithm maximizing a composite log-likelihood based on low-dimensional margins. Examples of application of the proposal on real and simulated data are performed to show the effectiveness of the proposal.

Type
Original Paper
Copyright
Copyright © 2017 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bartholomew, D., Knott, M., & Moustaki, I. (2011). Latent variable models and factor analysis: A unified approach (3rd ed.). Wiley Series in Probability and Statistics. Wiley.CrossRefGoogle Scholar
Bishop, C. M. (1998). Latent variable models. In Learning in graphical models. Springer Netherlands (pp. 371–403).CrossRefGoogle Scholar
Bock, D. & Moustaki, I. (2007). Handbook of statistics on psychonometrics, chap. Item response theory in a general framework Amsterdam: ElsevierGoogle Scholar
Bouveyron, C. & Brunet, C. (2012). Model-based clustering of high-dimensional data: A review. Computational Statistics & Data Analysis 71, 5278CrossRefGoogle Scholar
Bouveyron, C. & Brunet, C. (2012). Simultaneous model-based clustering and visualization in the Fisher discriminative subspace. Statistics and Computing 22 (1), 301324CrossRefGoogle Scholar
Cagnone, S. & Viroli, C. (2012). A factor mixture analysis model for multivariate binary data. Statistical Modelling 12, 257277CrossRefGoogle Scholar
Celeux, G. & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition 28 (5), 781793CrossRefGoogle Scholar
Dean, N. & Raftery, A. E. (2010). Latent class analysis variable selection. Annals of the Institute of Statistical Mathematics 62 (1), 1135CrossRefGoogle ScholarPubMed
de Leon, A. R. (2005). Pairwise likelihood approach to grouped continuous model and its extension. Statistics & Probability Letters 75 (1), 4957CrossRefGoogle Scholar
de Leon, A. R. & Carrigre, K. C. (2007). General mixed-data model: Extension of general location and grouped continuous models. Canadian Journal of Statistics 35 (4), 533548CrossRefGoogle Scholar
Everitt, B. (1988). A finite mixture model for the clustering of mixed-mode data. Statistics & Probability Letters 6 (5), 305309CrossRefGoogle Scholar
Gao, X. & Song, PXK (2010). Composite likelihood Bayesian information criteria for model selection in high-dimensional data. Journal of the American Statistical Association 105 (492), 15311540CrossRefGoogle Scholar
Ghahramani, Z., & Hinton, G. E. (1997). The EM algorithm for mixtures of factor analyzers. Technical Report, University of Toronto.Google Scholar
Giordan, M. & Diana, G. (2011). A clustering method for categorical ordinal data. Communications in Statistics: Theory and Methods 40 (7), 13151334CrossRefGoogle Scholar
Gollini, I. & Murphy, T. B. (2014). Mixture of latent trait analyzers for model-based clustering of categorical data. Statistics and Computing 24, 569588CrossRefGoogle Scholar
Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61 (2), 215231CrossRefGoogle Scholar
Goodman, L. A. & Clogg, C. C. (1984). The analysis of cross-classified data having ordered categories Cambridge, MA: Harvard University PressGoogle Scholar
Greenacre, M. (2007). Correspondence analysis in practice London: CRC PressCrossRefGoogle Scholar
Hinton, G. E. Dayan, P. & Revow, M. (1997). Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks 8 (1), 6574CrossRefGoogle ScholarPubMed
Hubert, L. & Arabie, P. (1985). Comparing partitions. Journal of Classification 2 (1), 193218CrossRefGoogle Scholar
Hwang, H. Montréal, H. Dillon, W. & Takane, Y. (2006). An extension of multiple correspondence analysis for identifying heterogeneous subgroups of respondents. Psychometrika 71 (1), 161171CrossRefGoogle Scholar
Jöreskog, K. G. (1990). New developments in lisrel: Analysis of ordinal variables using polychoric correlations and weighted least squares. Quality and Quantity 24 (4), 387404CrossRefGoogle Scholar
Jöreskog, K. G. & Moustaki, I. (2001). Factor analysis for ordinal variables: A comparison of three approaches. Multivariate Behavioural Research 36, 347387CrossRefGoogle ScholarPubMed
Jöreskog, K. G. & Sörbom, D. (1996). LISREL 8: User’s reference guide Chicago: Scientific SoftwareGoogle Scholar
Katsikatsou, M. & Moustaki, I. (2016). Pairwise likelihood ratio tests and model selection criteria for structural equation models with ordinal variables. Psychometrika 81 (4), 10461068CrossRefGoogle ScholarPubMed
Katsikatsou, M. Moustaki, I. Yang-Wallentin, F. & Jöreskog, K. G. (2012). Pairwise likelihood estimation for factor analysis models with ordinal data. Computational Statistics & Data Analysis 56 (12), 42434258CrossRefGoogle Scholar
Kumar, N. & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication 26 (4), 283297CrossRefGoogle Scholar
Lawley, D. N. & Maxwell, A. E. (1962). Factor analysis as a statistical method. Journal of the Royal Statistical Society. Series D (The Statistician) 12 (3), 209229Google Scholar
Lee, S. Y. Poon, W. Y. & Bentler, P. (1990). Full maximum likelihood analysis of structural equation models with polytomous variables. Statistics & Probability Letters 9 (1), 9197CrossRefGoogle Scholar
Lindsay, B. (1988). Composite likelihood methods. Contemporary Mathematics 80, 221239CrossRefGoogle Scholar
Linzer, D. A. & Lewis, J. B. (2011). poLCA: An R package for polytomous variable latent. Journal of Statistical Software 42 (10), 129CrossRefGoogle Scholar
Lubke, G. & Neale, M. (2008). Distinguishing between latent classes and continuous factors with categorical outcomes: Class invariance of parameters of factor mixture models. Multivariate Behavioral Research 43 (4), 592620CrossRefGoogle ScholarPubMed
Marbac, M., Biernacki, C., & Vandewalle, V. (2014a). Model-based clustering for conditionally correlated categorical data. ArXiv preprint arXiv:1401.5684.CrossRefGoogle Scholar
Marbac, M., Biernacki, C., & Vandewalle, V. (2014b). Finite mixture model of conditional dependencies modes to cluster categorical data. ArXiv preprint arXiv:1402.5103.Google Scholar
Mardia, K. V. Kent, J. T. Hughes, G. & Taylor, C. C. (2009). Maximum likelihood estimation using composite likelihoods for closed exponential families. Biometrika 96 (4), 975982CrossRefGoogle Scholar
MATLAB. (2013). User’s guide, R2013b. MathWorks.Google Scholar
Maugis, C. Celeux, G. & Martin-Magniette, M. L. (2009). Variable selection for clustering with gaussian mixture models. Biometrics 65 (3), 701709CrossRefGoogle ScholarPubMed
McLachlan, G. Bean, R. W. & Ben-Tovim, J. L. (2007). Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution. Computational Statistics & Data Analysis 51, 53275338CrossRefGoogle Scholar
Mclachlan, G., & Peel, D. (2000). Finite mixture models (1st ed.). Wiley Series in Probability and Statistics. Wiley.Google Scholar
McNicholas, P. & Murphy, T. (2008). Parsimonious gaussian mixture models. Statistics and Computing 18 (3), 285296CrossRefGoogle Scholar
McParland, D. Gormley, I. Clark, S. McCormick, T. Kabudula, C. & Collinson, M. (2014). Clustering south african households based on their asset status using latent variable models. The Annals of Applied Statistics 8 (2), 747776CrossRefGoogle ScholarPubMed
Millsap, R. E. & Yun-Tein, J. (2004). Assessing factorial invariance in ordered-categorical measures. Multivariate Behavioral Research 39 (3), 479515CrossRefGoogle Scholar
Molenberghs, G., & Verbeke, G. (2005). Models for discrete longitudinal data. Springer Series in Statistics Series. Springer, Incorporated New York.Google Scholar
Muthén, B. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika 49 (1), 115132CrossRefGoogle Scholar
Nenadic, O., & Greenacre, M. (2007). Correspondence analysis in R, with two- and three-dimensional graphics: The CA package. Journal of Statistical Software, 20(3), 1–13. http://www.jstatsoft.org.Google Scholar
Raftery, A. E. Dean, N. & Graduate, NDI (2006). Variable selection for model-based clustering. Journal of the American Statistical Association 101, 168178CrossRefGoogle Scholar
Ranalli, M. & Rocci, R. (2016). Mixture models for ordinal data: A pairwise likelihood approach. Statistics and Computing 26 (1), 529547CrossRefGoogle Scholar
Ranalli, M., & Rocci, R. (2016b). Standard and novel model selection criteria in the pairwise likelihood estimation of a mixture model for ordinal data. In A. F. X. Wilhelm & H. A. Kestler (Eds.), Studies in classification, data analysis, and knowledge organization. Analysis of large and complex data (pp. 53–68).CrossRefGoogle Scholar
Ranalli, M. & Rocci, R. (2017). Mixture models for mixed-type data through a composite likelihood approach. Computational Statistics & Data Analysis 110, 87102CrossRefGoogle Scholar
Rocci, R. Gattone, S. A. & Vichi, M. (2011). A new dimension reduction method: Factor discriminant k-means. Journal of Classification 28 (2), 210226CrossRefGoogle Scholar
Takane, Y. & Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika 52 (3), 393408CrossRefGoogle Scholar
Tipping, M. & Bishop, C. (1999). Mixtures of probabilistic principal component analyzers. Neural Computation 11 (2), 443482CrossRefGoogle ScholarPubMed
Van Buuren, S. & Heiser, W. J. (1989). Clustering objects into k groups under optimal scaling of variables. Psychometrika 54 (4), 699706CrossRefGoogle Scholar
Varin, C. Reid, N. & Firth, D. (2011). An overview of composite likelihood methods. Statistica Sinica 21 (1), 141Google Scholar
Vichi, M. & Kiers, H. A. (2001). Factorial k-means analysis for two-way data. Computational Statistics & Data Analysis 37 (1), 4964CrossRefGoogle Scholar
White, A., Wyse, J., & Murphy, T. B. (2014). Bayesian variable selection for latent class analysis using a collapsed Gibbs sampler. ArXiv preprint arXiv:1402.6928.Google Scholar
Witten, D. M. & Tibshirani, R. (2010). A framework for feature selection in clustering. Journal of the American Statistical Association 105, 490CrossRefGoogle ScholarPubMed
Yakowitz, S. J. & Spragins, J. D. (1968). On the identifiability of finite mixtures. The Annals of Mathematical Statistics 39 (1), 209214CrossRefGoogle Scholar
Yang, T., Browne, R. P., & McNicholas, P. D. (2014). Model based clustering of high-dimensional binary data. ArXiv preprint arXiv:1404.3174.Google Scholar