A Model-Based Approach to Simultaneous Clustering and Dimensional Reduction of Ordinal Data

Monia Ranalli; Roberto Rocci

doi:10.1007/s11336-017-9578-5

A Model-Based Approach to Simultaneous Clustering and Dimensional Reduction of Ordinal Data

Published online by Cambridge University Press: 01 January 2025

Monia Ranalli and

Roberto Rocci

Show author details

Monia Ranalli*: Affiliation:
The Pennsylvania State University
Roberto Rocci: Affiliation:
University of Tor Vergata
*: Correspondence should be made to Monia Ranalli, Department of Statistics, The Pennsylvania State University, State College, PA, USA. Email: mxr459@psu.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The literature on clustering for continuous data is rich and wide; differently, that one developed for categorical data is still limited. In some cases, the clustering problem is made more difficult by the presence of noise variables/dimensions that do not contain information about the clustering structure and could mask it. The aim of this paper is to propose a model for simultaneous clustering and dimensionality reduction of ordered categorical data able to detect the discriminative dimensions discarding the noise ones. Following the underlying response variable approach, the observed variables are considered as a discretization of underlying first-order latent continuous variables distributed as a Gaussian mixture. To recognize discriminative and noise dimensions, these variables are considered to be linear combinations of two independent sets of second-order latent variables where only one contains the information about the cluster structure while the other one contains noise dimensions. The model specification involves multidimensional integrals that make the maximum likelihood estimation cumbersome and in some cases infeasible. To overcome this issue, the parameter estimation is carried out through an EM-like algorithm maximizing a composite log-likelihood based on low-dimensional margins. Examples of application of the proposal on real and simulated data are performed to show the effectiveness of the proposal.

Keywords

mixture models reduction ordinal data composite likelihood

Type: Original Paper
Information: Psychometrika , Volume 82 , Issue 4 , December 2017 , pp. 1007 - 1034

DOI: https://doi.org/10.1007/s11336-017-9578-5 [Opens in a new window]
Copyright: Copyright © 2017 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bartholomew, D., Knott, M., & Moustaki, I. (2011). Latent variable models and factor analysis: A unified approach (3rd ed.). Wiley Series in Probability and Statistics. Wiley.CrossRef Google Scholar

Bishop, C. M. (1998). Latent variable models. In Learning in graphical models. Springer Netherlands (pp. 371–403).CrossRef Google Scholar

Bock, D. & Moustaki, I. (2007). Handbook of statistics on psychonometrics, chap. Item response theory in a general framework Amsterdam: ElsevierGoogle Scholar

Bouveyron, C. & Brunet, C. (2012). Model-based clustering of high-dimensional data: A review. Computational Statistics & Data Analysis 71, 52–78CrossRef Google Scholar

Bouveyron, C. & Brunet, C. (2012). Simultaneous model-based clustering and visualization in the Fisher discriminative subspace. Statistics and Computing 22 (1), 301–324CrossRef Google Scholar

Cagnone, S. & Viroli, C. (2012). A factor mixture analysis model for multivariate binary data. Statistical Modelling 12, 257–277CrossRef Google Scholar

Celeux, G. & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition 28 (5), 781–793CrossRef Google Scholar

Dean, N. & Raftery, A. E. (2010). Latent class analysis variable selection. Annals of the Institute of Statistical Mathematics 62 (1), 11–35CrossRef Google Scholar PubMed

de Leon, A. R. (2005). Pairwise likelihood approach to grouped continuous model and its extension. Statistics & Probability Letters 75 (1), 49–57CrossRef Google Scholar

de Leon, A. R. & Carrigre, K. C. (2007). General mixed-data model: Extension of general location and grouped continuous models. Canadian Journal of Statistics 35 (4), 533–548CrossRef Google Scholar

Everitt, B. (1988). A finite mixture model for the clustering of mixed-mode data. Statistics & Probability Letters 6 (5), 305–309CrossRef Google Scholar

Gao, X. & Song, PXK (2010). Composite likelihood Bayesian information criteria for model selection in high-dimensional data. Journal of the American Statistical Association 105 (492), 1531–1540CrossRef Google Scholar

Ghahramani, Z., & Hinton, G. E. (1997). The EM algorithm for mixtures of factor analyzers. Technical Report, University of Toronto.Google Scholar

Giordan, M. & Diana, G. (2011). A clustering method for categorical ordinal data. Communications in Statistics: Theory and Methods 40 (7), 1315–1334CrossRef Google Scholar

Gollini, I. & Murphy, T. B. (2014). Mixture of latent trait analyzers for model-based clustering of categorical data. Statistics and Computing 24, 569–588CrossRef Google Scholar

Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61 (2), 215–231CrossRef Google Scholar

Goodman, L. A. & Clogg, C. C. (1984). The analysis of cross-classified data having ordered categories Cambridge, MA: Harvard University PressGoogle Scholar

Greenacre, M. (2007). Correspondence analysis in practice London: CRC PressCrossRef Google Scholar

Hinton, G. E. Dayan, P. & Revow, M. (1997). Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks 8 (1), 65–74CrossRef Google Scholar PubMed

Hubert, L. & Arabie, P. (1985). Comparing partitions. Journal of Classification 2 (1), 193–218CrossRef Google Scholar

Hwang, H. Montréal, H. Dillon, W. & Takane, Y. (2006). An extension of multiple correspondence analysis for identifying heterogeneous subgroups of respondents. Psychometrika 71 (1), 161–171CrossRef Google Scholar

Jöreskog, K. G. (1990). New developments in lisrel: Analysis of ordinal variables using polychoric correlations and weighted least squares. Quality and Quantity 24 (4), 387–404CrossRef Google Scholar

Jöreskog, K. G. & Moustaki, I. (2001). Factor analysis for ordinal variables: A comparison of three approaches. Multivariate Behavioural Research 36, 347–387CrossRef Google Scholar PubMed

Jöreskog, K. G. & Sörbom, D. (1996). LISREL 8: User’s reference guide Chicago: Scientific SoftwareGoogle Scholar

Katsikatsou, M. & Moustaki, I. (2016). Pairwise likelihood ratio tests and model selection criteria for structural equation models with ordinal variables. Psychometrika 81 (4), 1046–1068CrossRef Google Scholar PubMed

Katsikatsou, M. Moustaki, I. Yang-Wallentin, F. & Jöreskog, K. G. (2012). Pairwise likelihood estimation for factor analysis models with ordinal data. Computational Statistics & Data Analysis 56 (12), 4243–4258CrossRef Google Scholar

Kumar, N. & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication 26 (4), 283–297CrossRef Google Scholar

Lawley, D. N. & Maxwell, A. E. (1962). Factor analysis as a statistical method. Journal of the Royal Statistical Society. Series D (The Statistician) 12 (3), 209–229Google Scholar

Lee, S. Y. Poon, W. Y. & Bentler, P. (1990). Full maximum likelihood analysis of structural equation models with polytomous variables. Statistics & Probability Letters 9 (1), 91–97CrossRef Google Scholar

Lindsay, B. (1988). Composite likelihood methods. Contemporary Mathematics 80, 221–239CrossRef Google Scholar

Linzer, D. A. & Lewis, J. B. (2011). poLCA: An R package for polytomous variable latent. Journal of Statistical Software 42 (10), 1–29CrossRef Google Scholar

Lubke, G. & Neale, M. (2008). Distinguishing between latent classes and continuous factors with categorical outcomes: Class invariance of parameters of factor mixture models. Multivariate Behavioral Research 43 (4), 592–620CrossRef Google Scholar PubMed

Marbac, M., Biernacki, C., & Vandewalle, V. (2014a). Model-based clustering for conditionally correlated categorical data. ArXiv preprint arXiv:1401.5684.CrossRef Google Scholar

Marbac, M., Biernacki, C., & Vandewalle, V. (2014b). Finite mixture model of conditional dependencies modes to cluster categorical data. ArXiv preprint arXiv:1402.5103.Google Scholar

Mardia, K. V. Kent, J. T. Hughes, G. & Taylor, C. C. (2009). Maximum likelihood estimation using composite likelihoods for closed exponential families. Biometrika 96 (4), 975–982CrossRef Google Scholar

MATLAB. (2013). User’s guide, R2013b. MathWorks.Google Scholar

Maugis, C. Celeux, G. & Martin-Magniette, M. L. (2009). Variable selection for clustering with gaussian mixture models. Biometrics 65 (3), 701–709CrossRef Google Scholar PubMed

McLachlan, G. Bean, R. W. & Ben-Tovim, J. L. (2007). Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution. Computational Statistics & Data Analysis 51, 5327–5338CrossRef Google Scholar

Mclachlan, G., & Peel, D. (2000). Finite mixture models (1st ed.). Wiley Series in Probability and Statistics. Wiley.Google Scholar

McNicholas, P. & Murphy, T. (2008). Parsimonious gaussian mixture models. Statistics and Computing 18 (3), 285–296CrossRef Google Scholar

McParland, D. Gormley, I. Clark, S. McCormick, T. Kabudula, C. & Collinson, M. (2014). Clustering south african households based on their asset status using latent variable models. The Annals of Applied Statistics 8 (2), 747–776CrossRef Google Scholar PubMed

Millsap, R. E. & Yun-Tein, J. (2004). Assessing factorial invariance in ordered-categorical measures. Multivariate Behavioral Research 39 (3), 479–515CrossRef Google Scholar

Molenberghs, G., & Verbeke, G. (2005). Models for discrete longitudinal data. Springer Series in Statistics Series. Springer, Incorporated New York.Google Scholar

Muthén, B. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika 49 (1), 115–132CrossRef Google Scholar

Nenadic, O., & Greenacre, M. (2007). Correspondence analysis in R, with two- and three-dimensional graphics: The CA package. Journal of Statistical Software, 20(3), 1–13. http://www.jstatsoft.org.Google Scholar

Raftery, A. E. Dean, N. & Graduate, NDI (2006). Variable selection for model-based clustering. Journal of the American Statistical Association 101, 168–178CrossRef Google Scholar

Ranalli, M. & Rocci, R. (2016). Mixture models for ordinal data: A pairwise likelihood approach. Statistics and Computing 26 (1), 529–547CrossRef Google Scholar

Ranalli, M., & Rocci, R. (2016b). Standard and novel model selection criteria in the pairwise likelihood estimation of a mixture model for ordinal data. In A. F. X. Wilhelm & H. A. Kestler (Eds.), Studies in classification, data analysis, and knowledge organization. Analysis of large and complex data (pp. 53–68).CrossRef Google Scholar

Ranalli, M. & Rocci, R. (2017). Mixture models for mixed-type data through a composite likelihood approach. Computational Statistics & Data Analysis 110, 87–102CrossRef Google Scholar

Rocci, R. Gattone, S. A. & Vichi, M. (2011). A new dimension reduction method: Factor discriminant k-means. Journal of Classification 28 (2), 210–226CrossRef Google Scholar

Takane, Y. & Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika 52 (3), 393–408CrossRef Google Scholar

Tipping, M. & Bishop, C. (1999). Mixtures of probabilistic principal component analyzers. Neural Computation 11 (2), 443–482CrossRef Google Scholar PubMed

Van Buuren, S. & Heiser, W. J. (1989). Clustering objects into k groups under optimal scaling of variables. Psychometrika 54 (4), 699–706CrossRef Google Scholar

Varin, C. Reid, N. & Firth, D. (2011). An overview of composite likelihood methods. Statistica Sinica 21 (1), 1–41Google Scholar

Vichi, M. & Kiers, H. A. (2001). Factorial k-means analysis for two-way data. Computational Statistics & Data Analysis 37 (1), 49–64CrossRef Google Scholar

White, A., Wyse, J., & Murphy, T. B. (2014). Bayesian variable selection for latent class analysis using a collapsed Gibbs sampler. ArXiv preprint arXiv:1402.6928.Google Scholar

Witten, D. M. & Tibshirani, R. (2010). A framework for feature selection in clustering. Journal of the American Statistical Association 105, 490CrossRef Google Scholar PubMed

Yakowitz, S. J. & Spragins, J. D. (1968). On the identifiability of finite mixtures. The Annals of Mathematical Statistics 39 (1), 209–214CrossRef Google Scholar

Yang, T., Browne, R. P., & McNicholas, P. D. (2014). Model based clustering of high-dimensional binary data. ArXiv preprint arXiv:1404.3174.Google Scholar

Article contents

A Model-Based Approach to Simultaneous Clustering and Dimensional Reduction of Ordinal Data

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests