A Two-Step Estimator for Multilevel Latent Class Analysis with Covariates

Roberto Di Mari; Zsuzsa Bakk; Jennifer Oser; Jouni Kuha

doi:10.1007/s11336-023-09929-2

A Two-Step Estimator for Multilevel Latent Class Analysis with Covariates

Published online by Cambridge University Press: 01 January 2025

and

Roberto Di Mari*: Affiliation:
University of Catania
Zsuzsa Bakk: Affiliation:
Leiden University
Jennifer Oser: Affiliation:
Ben-Gurion University
Jouni Kuha: Affiliation:
London School of Economics and Political Science
*: Correspondence should bemade to Roberto DiMari, Department of Economics and Business, University of Catania, Corso Italia 55, 95128 Catania, Italy. Email: roberto.dimari@unict.it

Article contents

Abstract
The Multilevel Latent Class Model with Covariates
Previous Methods of Estimation
Two-step Estimator for the Model with Covariates
Simulation Study
Analysis of Cross-National Citizenship Norms with Multilevel LCA
Discussion
Funding Information
Footnotes
References

Rights & Permissions

Abstract

We propose a two-step estimator for multilevel latent class analysis (LCA) with covariates. The measurement model for observed items is estimated in its first step, and in the second step covariates are added in the model, keeping the measurement model parameters fixed. We discuss model identification, and derive an Expectation Maximization algorithm for efficient implementation of the estimator. By means of an extensive simulation study we show that (1) this approach performs similarly to existing stepwise estimators for multilevel LCA but with much reduced computing time, and (2) it yields approximately unbiased parameter estimates with a negligible loss of efficiency compared to the one-step estimator. The proposal is illustrated with a cross-national analysis of predictors of citizenship norms.

Keywords

multilevel latent class analysis covariates stepwise estimators pseudo ML

Type: Theory and Methods
Information: Psychometrika , Volume 88 , Issue 4 , December 2023 , pp. 1144 - 1170

DOI: https://doi.org/10.1007/s11336-023-09929-2 [Opens in a new window]
Creative Commons: This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Copyright: Copyright © 2023 The Author(s)

Latent class analysis (LCA) is used to create a clustering of units based on a set of observed variables, expressed in terms of an underlying unobserved classification. When it is applied to hierarchical (multilevel) data where lower-level units are nested in higher-level ones, the basic latent class model can be extended to account for this data structure. This can be seen as a random coefficients multinomial logistic model (see, for instance (Agresti et al., Reference Agresti, Booth, Hobert and Caffo2000)) for an unobserved categorical variable that is measured by several observed indicators, with a higher-level latent class variable in the role of a categorical random effect (Vermunt, Reference Vermunt2003). Multilevel LCA has become more popular in the social sciences in recent years, for example in educational sciences (Fagginger Auer et al., Reference Fagginger Auer, Hickendorff, Van Putten, Bèguin and Heiser2016; Grilli et al., Reference Grilli, Marino, Paccagnella and Rampichini2022, Reference Grilli, Pennoni, Rampichini and Romeo2016; Grilli & Rampichini, Reference Grilli and Rampichini2011; Mutz & Daniel, Reference Mutz and Daniel2013), economics (Paccagnella & Varriale, Reference Paccagnella, Varriale, Torelli, Pesarin and Bar-Hen2013), epidemiology (Tomczyk et al., Reference Tomczyk, Hanewinkel and Isensee2015; Rindskopf, Reference Rindskopf2006; Zhang et al., Reference Zhang, van der Lans and Dagevos2012; Horn et al., Reference Horn, Fagan, Jaki, Brown, Hawkins, Arthur and Catalano2008), sociology (Da Costa & Dias, Reference Da Costa and Dias2015; Morselli & Glaeser, Reference Morselli and Glaeser2018), and political science (Ruelens & Nicaise, Reference Ruelens and Nicaise2020). In most of these examples, the multilevel LCA model includes also covariates that are used as predictors of the clustering, and substantive research questions often focus on the coefficients of the covariates.

In estimation of models with covariates, for single-level LCA the current mainstream recommendation is to use stepwise methods that separate the estimation of the measurement model for the observed indicators from the estimation of the structural model for the latent variables given the covariates (see, e.g., (Bakk & Kuha, Reference Bakk and Kuha2018; Di Mari et al., Reference Di Mari, Bakk and Punzo2020; Di Mari & Maruotti, Reference Di Mari and Maruotti2022; Vermunt, Reference Vermunt2010)). This is practically convenient because when changes of covariates are made, only the structural model rather than the full model needs to be re-estimated. Different structural models can be considered even by different researchers at different times. Stepwise estimation can also avoid biases which can arise when all the parameters are instead estimated together in a simultaneous (one-step) approach to estimation. In such cases, misspecifications in one part of the model can cause bias also in the parameter estimates in other parts (Bakk & Kuha, Reference Bakk and Kuha2018).

In multilevel LCA, the one-step approach is particularly cumbersome because of increased estimation time, especially with multiple covariates possibly defined at different levels. In that context, there is still need for further research on bias-adjusted efficient stepwise estimators. Recently Bakk et al. (Reference Bakk, Di Mari, Oser and Kuha2022) and Di Mari et al. (Reference Di Mari, Bakk, Oser and Kuha2022) proposed a “two-stage” estimator for this purpose. The parameters of the measurement model are estimated in its first stage, without including the covariates. This is further broken down into three steps. In the first of them, initial estimates of the measurement model are obtained from a single-level LC model, ignoring the multilevel structure. The latent class probabilities of the multilevel LC model are then estimated, keeping the measurement parameters from the first step fixed. Third, to stabilize the estimated measurement model and to account for possible interaction effects, the multilevel model is estimated again, now keeping the latent class parameters fixed. The estimated measurement parameters from this last step of the first stage are then held fixed in the second stage, where the model for the latent classes given covariates is estimated.

This method has been shown to greatly simplify model construction and interpretation compared to the one-step estimator, with almost identical results if model assumptions are not violated, and with enhanced algorithmic stability and improved speed of convergence. In addition, the two-stage estimator exhibits an increased degree of robustness compared to the simultaneous approach in the presence of measurement noninvariance (Bakk et al., Reference Bakk, Di Mari, Oser and Kuha2022).

A difficulty in this two-stage technique is deriving an asymptotic covariance matrix that takes into account the multi-step procedure. Conditioning on the first-stage estimates as if they were known, even though they are estimates with a sampling distribution, introduces a downward bias in the standard errors, a phenomenon that is well known also in the context of stepwise structural equation models (Skrondal & Kuha, Reference Skrondal and Kuha2012; Oberski and Satorra, Reference Oberski and Satorra2013). For two-step single level LCA, the standard errors can be corrected in a straightforward way (Bakk & Kuha, Reference Bakk and Kuha2018), but this is more difficult for two-stage LCA due to conditioning on multiple steps.

The two-stage approach is still in some ways more involved than it needs to be. In this paper we show that it is possible to simplify it into a more straightforward two-step estimator, still retaining its good performance but with a further reduced computation time. This approach is closely motivated by two-step estimation as it is used for single-level LCA. In the first step, the full multilevel measurement model is estimated in one go, but without covariates. In the second step, covariates are included in the model, keeping the measurement model parameters fixed at their estimates from the first step.

With such a two-step estimator, we contribute to the existing literature in several ways: (1) we establish model identification for the multilevel LC model under standard assumptions, as foundation for correct measurement model estimation; (2) we derive a step-by-step EM algorithm with closed-form formulas to handle the computation of the two-step estimator; and (3) we derive the correct asymptotic variance-covariance matrix of the second step estimator of the structural model, drawing on the theory of pseudo maximum likelihood estimation (Gong and Samaniego, Reference Gong and Samaniego1981).

We evaluate the finite sample properties of our proposal by means of an extensive simulation study. Cross-national data on citizenship norms from the International Association for the Evaluation of Educational Achievement survey are analyzed to illustrate the proposal, and possible extensions are discussed in the conclusions.

1. The Multilevel Latent Class Model with Covariates

Let $Y_{ij} = {(Y_{i j 1}, \dots, Y_{ijH})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Y}_{ij}=(Y_{ij1},\dots,Y_{ijH})'$$\end{document} be a vector of observed responses, where $Y_{ijh}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_{ijh}$$\end{document} denotes the response of individual $i = 1, \dots, n_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\dots,n_{j}$$\end{document} in group $j = 1, \dots, J$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=1,\dots,J$$\end{document} on the h-th categorical indicator variable (“item”), with $h = 1, \dots, H$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h=1,\dots,H$$\end{document} . The data have a hierarchical (multilevel) structure where the individuals are nested within the groups. In the following, we will also refer to individuals as the “low-level units”, and groups as the “high-level units”. Let $Y_{j} = {(Y_{1 j}, \dots, Y_{n_{j} j})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Y}_{j}=(\textbf{Y}_{1j},\dots,\textbf{Y}_{n_{j}j})'$$\end{document} denote the set of responses for all the low-level units belonging to high-level unit j, with $Y_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Y}_j$$\end{document} for different j taken to be independent of each other. For simplicity of exposition, we focus below on the case where the items $Y_{ijh}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_{ijh}$$\end{document} are dichotomous, but the idea and methods of two-step estimation proposed here apply in a straightforward way also for polytomous items.

Let $W_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_j$$\end{document} be a categorical latent variable (i.e. a latent class (LC) variable) defined at the high level, with possible values $m = 1, \dots, M$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m=1,\dots,M$$\end{document} and probabilities $P (W_{j} = m) = ω_{m} > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(W_j = m) = \omega _m > 0$$\end{document} , and let $ω = {(ω_{1}, \dots, ω_{M})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\omega }=(\omega _{1},\dots,\omega _{M})'$$\end{document} . Given a realization of $W_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_j$$\end{document} , let $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} be a categorical latent variable defined at the low level, with possible values $t = 1, \dots, T$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t=1,\dots,T$$\end{document} , and conditional probabilities $P (X_{ij} = t | W_{j} = m) = π_{t | m} > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(X_{ij} = t \vert W_j = m) = \pi _{t \vert m} > 0$$\end{document} . We collect all the $π_{t | m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _{t \vert m}$$\end{document} in the $M \times T$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M \times T$$\end{document} matrix $Π$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {\Pi }$$\end{document} . The $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} for the same j are taken to be conditionally independent given $W_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_{j}$$\end{document} , so that

\begin{matrix} P (X_{1 j}, \dots, X_{n_{j} j}) = \sum_{m = 1}^{M} P (W_{j} = m) \prod_{i = 1}^{n_{j}} P (X_{ij} | W_{j} = m) . \end{matrix}

This shows that the high-level latent class $W_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_{j}$$\end{document} serves as a categorical random effect which accounts for associations between the low-level latent classes $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} for different low-level units i within the same high-level unit j.

The items $Y_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Y}_{j}$$\end{document} are treated as observed indicators of the latent classes. A multilevel latent class model specifies the probability of observing a particular response configuration for a high-level unit j as

(1)

\begin{matrix} P (Y_{j}) = & \sum_{m = 1}^{M} P (W_{j} = m) \prod_{i = 1}^{n_{j}} \sum_{t = 1}^{T} P (X_{ij} = t | W_{j} = m) P (Y_{ij} | X_{ij} = t, W_{j} = m) \\ = & \sum_{m = 1}^{M} ω_{m} \prod_{i = 1}^{n_{j}} \sum_{t = 1}^{T} π_{t | m} \prod_{h = 1}^{H} P (Y_{ijh} | X_{ij} = t, W_{j} = m), \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} P(\textbf{Y}_j)= & {} \sum _{m=1}^M P(W_{j}=m) \prod _{i=1}^{n_j} \sum _{t=1}^{T} P(X_{ij}=t|W_{j}=m)\,P(\textbf{Y}_{ij}|X_{ij} = t, W_{j}=m)\nonumber \\= & {} \sum _{m=1}^M \omega _m \prod _{i=1}^{n_j} \sum _{t=1}^{T} \pi _{t\vert m} \prod _{h=1}^{H} P(Y_{ijh}|X_{ij} = t, W_{j}=m), \end{aligned}$$\end{document}

where $P (Y_{ijh} | X_{ij} = t, W_{j} = m)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(Y_{ijh} \vert X_{ij} = t, W_j = m)$$\end{document} denotes the conditional probability mass function of the h-th item, given the latent class variables $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} and $W_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_{j}$$\end{document} . The second line in this further assumes that the responses for $Y_{ijh}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_{ijh}$$\end{document} for different items h are conditionally independent given $(X_{ij}, W_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(X_{ij},W_{j})$$\end{document} , a standard assumption which we make throughout.

Model (1) is a general formulation which is equal to an unrestricted multi-group latent class model. Most applications, however, use a more restricted version which assumes that the item response probabilities do not depend directly on the high-level latent class $W_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_{j}$$\end{document} ((Vermunt, Reference Vermunt2003; Lukociene et al., Reference Lukociene, Varriale and Vermunt2010); this model is represented in Fig. 1, if we omit the covariates $Z_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z_{ij}$$\end{document} which will be introduced below). We will also make this assumption throughout this paper. Model (1) is also similar to the multilevel item response model of Gnaldi et al. (Reference Gnaldi, Bacci and Bartolucci2016), but with categorical latent variables at both levels. The response probabilities are then given by

(2)

\begin{matrix} P (Y_{j}) = \sum_{m = 1}^{M} ω_{m} \prod_{i = 1}^{n_{j}} \sum_{t = 1}^{T} π_{t | m} \prod_{h = 1}^{H} P (Y_{ijh} | X_{ij} = t) . \end{matrix}

Therefore, within each high-level latent class $W_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_{j}$$\end{document} , the model for the items has the form of a standard (single-level) LC model with $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} as the latent class (McCutcheon, Reference McCutcheon1987; Goodman, Reference Goodman1974; Hagenaars, Reference Hagenaars1990). When the items $Y_{ijh}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_{ijh}$$\end{document} are binary with values 0 and 1, we denote $P (Y_{ijh} = 1 | X_{ij} = t) = ϕ_{h | t}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(Y_{ijh}=1|X_{ij}=t)=\phi _{h \vert t}$$\end{document} , so that $P (Y_{ijh} = y_{ijh} | X_{ij} = t) = ϕ_{h | t}^{y_{ijh}} {(1 - ϕ_{h | t})}^{1 - y_{ijh}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(Y_{ijh}=y_{ijh}|X_{ij}=t)= \phi _{h|t}^{y_{ijh}} (1-\phi _{h|t})^{1-y_{ijh}}$$\end{document} , and denote by $Φ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Phi }$$\end{document} the $H \times T$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H\times T$$\end{document} matrix of all the $ϕ_{h | t}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\phi _{h|t}$$\end{document} .

Figure 1. Graphical representation of a multilevel latent class model which includes a low-level latent class variable $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} nested in a high-level latent class variable $W_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_j$$\end{document} , and covariates $Z_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z_{ij}$$\end{document} for $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} . Here the response probabilities for items $Y_{ijh}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_{ijh}$$\end{document} depend directly only on $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} .

It can be shown that the model is identified (in a generic sense, see Allman et al. Reference Allman, Matias and Rhodes2009), under a standard set of assumptions:

Proposition 1.1

(Identification) Suppose that the following conditions hold: (A.1) $ϕ_{h | t} \neq ϕ_{h | s}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\phi _{h\vert t} \ne \phi _{h\vert s}$$\end{document} for all $h = 1, \dots, H$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h=1,\dots, H$$\end{document} and for $t \neq s$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t \ne s$$\end{document} ; and (A.2) the $M \times T$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M \times T$$\end{document} matrix $Π$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {\Pi }$$\end{document} has rank M. Then the multilevel LC model (2) is identified when $M \leq T$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M\le T$$\end{document} and $n_{j} \geq 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_j \ge 3$$\end{document} , for all $j = 1, \dots, J$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=1,\dots,J$$\end{document} .

The proof of Proposition 1.1 follows the same lines as in Gassiat et al. (Reference Gassiat, Cleynen and Robin2016), who proved identification of finite state space nonparametric hidden Markov models, and applies the results of Theorem 9 of Allman et al. (Reference Allman, Matias and Rhodes2009). The fact that all $ϕ_{h | t}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\phi _{h\vert t}$$\end{document} are distinct is sufficient for linear independence of the Bernoulli random variables. For $n_{j} = 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_j=3$$\end{document} , using the assumption of conditional independence of low-level units given high-level class $W_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_{j}$$\end{document} , the distribution of $(Y_{1 j}, Y_{2 j}, Y_{3 j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\textbf{Y}_{1j},\textbf{Y}_{2j},\textbf{Y}_{3j})$$\end{document} factorizes as the product of three terms $μ_{i j | m} = \sum_{t} π_{t | m} P (Y_{ij} | X_{ij} = t)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{ij \vert m} = \sum _{t} \pi _{t \vert m} P(\textbf{Y}_{ij} \vert X_{ij} = t)$$\end{document} for $i = 1, 2, 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,2,3$$\end{document} . Assumption (A.2) ensures that $μ_{1 j | m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{1j\vert m}$$\end{document} , $μ_{2 j | m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{2j\vert m}$$\end{document} and $μ_{3 j | m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{3j\vert m}$$\end{document} are linearly independent. Thus Theorem 9 of Allman et al. (Reference Allman, Matias and Rhodes2009) applies.

We make three ancillary comments on Proposition 1.1. First, for the unrestricted multilevel LC model (1), if an assumption analogous to (A.1) holds—i.e. if all success probabilities of the Bernoulli random variables are distinct—we can relax (A.2) and prove identification using Allman et al. (Reference Allman, Matias and Rhodes2009)’s Theorem 9 (in the related context of mixture of finite mixtures with Gaussian components, a similar argument is used by (Di Zio et al., Reference Di Zio, Guarnera and Rocci2007)). Second, for longitudinal and multilevel data, generic identification of the measurement model does not require any condition on the number of items, provided that conditions (A.1) and (A.2) are satisfied. Third, although we have discussed identification specifically for binary items and Bernoulli conditional distributions, the identification result extends also to polytomous items if we can assume, analogously to (A.1), that all conditional category-class response probabilities are distinct. This guarantees linear independence of the corresponding multinomial random variables.

Covariates can be included in the multilevel LC model to predict latent class membership in both the low and high-level classes. Let $Z_{ij} = {(1, Z_{1 j}^{'}, Z_{2 i j}^{'})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}_{ij}=(1,\textbf{Z}_{1j}',\textbf{Z}_{2ij}')'$$\end{document} be a vector of K covariates, which can include high-level ( $Z_{1 j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}_{1j}$$\end{document} ) and low-level ( $Z_{2 i j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}_{2ij}$$\end{document} ) variables. For $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} we can consider the multinomial logistic model

(3)

\begin{matrix} P (X_{ij} = t | W_{j} = m, Z_{ij}) = \frac{exp (γ_{tm}^{'} Z_{ij})}{1 + \sum_{s = 2}^{T} exp (γ_{tm}^{'} Z_{ij})}, \end{matrix}

where $γ_{tm}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\gamma }_{tm}$$\end{document} is a K-vector of regression coefficients for each $t = 2, \dots, T$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t=2,\dots,T$$\end{document} and $m = 1, \dots, M$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m=1,\dots,M$$\end{document} . When only the intercept term is included, so that $Z_{ij} = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}_{ij}=1$$\end{document} , then $γ_{tm} = log (π_{t | m} / π_{1 | m})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\gamma }_{tm}=\log (\pi _{t|m}/\pi _{1|m})$$\end{document} in the notation of the model without covariates above. We denote by $Γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Gamma }$$\end{document} the $(T - 1) M \times K$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(T-1)M\times K$$\end{document} matrix of all the parameters in the $γ_{tm}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\gamma }_{tm}$$\end{document} vectors.

A model for $W_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_{j}$$\end{document} can be specified similarly, now using only high-level covariates $Z_{j}^{*} = {(1, Z_{1 j}^{'})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}^{*}_{j}=(1,\textbf{Z}_{1j}')'$$\end{document} , as

(4)

\begin{matrix} P (W_{j} = m |, Z_{j}^{*}) = \frac{exp (α_{m}^{'} Z_{j}^{*})}{1 + \sum_{l = 2}^{M} exp (α_{m}^{'} Z_{j}^{*})}, \end{matrix}

where $α_{m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\alpha }_{m}$$\end{document} for $m = 2, \dots, M$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m=2,\dots,M$$\end{document} , are regression coefficients. Although this too is straightforward, for ease of exposition and simplicity of notation we will below not consider models with covariates for $W_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_{j}$$\end{document} , but present the two-step estimator only for the case where $Z_{j}^{*} = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}^{*}_{j}=1$$\end{document} and thus $α_{m} = log (ω_{m} / ω_{1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\alpha }_{m}=\log (\omega _m/\omega _{1})$$\end{document} . The focus of interest is then on the model for the low-level (individual-level) latent class $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} , and the high-level (group-level) latent class $W_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_{j}$$\end{document} serves primarily as a random effect which accounts for intra-group associations between $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} . We further assume that the observed items $Y_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Y}_{j}$$\end{document} are conditionally independent of the covariates $Z_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}_{ij}$$\end{document} given the latent class variables $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} . This means that the measurement of $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} by $Y_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Y}_{ij}$$\end{document} is taken to be invariant with respect to the covariates. With these assumptions, and denoting $Z_{j} = {(Z_{1 j}^{'}, \dots, Z_{n_{j} j}^{'})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}_j=(\textbf{Z}_{1j}',\dots,\textbf{Z}_{n_{j}j}')'$$\end{document} , the model that we will consider is finally of the form

(5)

\begin{matrix} P (Y_{j} | Z_{j}) = \sum_{m = 1}^{M} ω_{m} \prod_{i = 1}^{n_{j}} \sum_{t = 1}^{T} P (X_{ij} = t | W_{j} = m, Z_{ij}) \prod_{h = 1}^{H} P (Y_{ijh} | X_{ij} = t); \end{matrix}

see also a graphical representation of the model in Fig. 1. This model is identified when the corresponding model without covariates is identified, as long as the design matrix of all the $Z_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}_{ij}$$\end{document} s has full column rank (for an analogous condition for identifiability in the context of single-level latent class models with covariates, see Huang and Bandeen-Roche Reference Huang and Bandeen-Roche2004 and Ouyang and Xu Reference Ouyang and Xu2022).

2. Previous Methods of Estimation

We denote the parameters of the model in (5) as $θ = {(θ_{1}^{'}, θ_{2}^{'})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }=(\varvec{\theta }_{1}',\varvec{\theta }_{2}')'$$\end{document} where $θ_{1} = vec (Φ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_{1}=\text {vec}(\varvec{\Phi })$$\end{document} are the parameters of the measurement model for the items $Y_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Y}_{j}$$\end{document} and $θ_{2} = {(vec {(Γ)}^{'}, ω^{'})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_{2}=( \text {vec}(\varvec{\Gamma })',\varvec{\omega }')'$$\end{document} the parameters of the structural model the latent class variables $(X_{ij}, W_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(X_{ij},W_{j})$$\end{document} given the covariates $Z_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}_{ij}$$\end{document} . Maximum likelihood estimates of these parameters can be obtained by maximizing the log likelihood $ℓ (θ) = \sum_{j = 1}^{J} log P (Y_{j} | Z_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell (\varvec{\theta }) =\sum _{j=1}^{J} \log P(\textbf{Y}_{j}|\textbf{Z}_{j})$$\end{document} with respect to all the parameters together. This is the simultaneous or one-step method of estimation for the model. It has serious disadvantages, however. The full model needs to be re-estimated whenever the covariates in the structural model are changed, which can be computationally demanding because of the complexity of such multilevel models. Further, because all the parameters are estimated together, misspecification in one part of the model may destabilize also parameters in other parts of the model (Vermunt, Reference Vermunt2010; Asparouhov and Muthén, Reference Asparouhov and Muthén2014).

Because of the complexity of the one-step approach, in practice the classical three-step method of estimation is more often used. In its step 1, model (2) without covariates is first estimated. In step 2, this model is used to assign respondents to the latent classes $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} and $W_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_{j}$$\end{document} , conditional on their observed responses $Y_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Y}_{j}$$\end{document} ; how this is done for the multilevel LC model is described in detail in Vermunt (Reference Vermunt2003). In step 3 the assigned latent classes are modelled given covariates, treating the classes now as observed variables. This is straightforward to do. However, it, yields biased estimates of the parameters of the structural model, because the assigned classes are potentially misclassified versions of the true latent classes.

Because of this bias in the classical three-step approach, bias-adjusted stepwise methods are needed. One such method for multilevel LC models with covariates is the two-stage estimator proposed by Di Mari et al. (Reference Di Mari, Bakk, Oser and Kuha2022) - see also Bakk et al. (Reference Bakk, Di Mari, Oser and Kuha2022). It involves the following two stages:

(A) First stage: Unconditional multilevel LC model building (measurement model construction).

Step 1:

A single-level latent class model is fitted for $Y_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Y}_{ij}$$\end{document} given the low-level latent class $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} , ignoring the multilevel structure of the data. This gives an initial estimate of $Φ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Phi }$$\end{document} .

Step 2.a:

The multilevel model without covariates (equation 2) is estimated, keeping $Φ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Phi }$$\end{document} fixed at its estimated value from Step 1. This gives estimates of $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\omega }$$\end{document} and $Π$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Pi }$$\end{document} .

Step 2.b:

The two-level model is estimated again, now keeping $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\omega }$$\end{document} and $Π$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Pi }$$\end{document} fixed at their estimates from Step 2.a. This gives the estimate of $Φ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Phi }$$\end{document} which is taken forward to the second stage.
(B) Second stage: Inclusion of covariates in the model (structural model construction).

Step 3:

The multilevel model (5) with covariates is estimated, keeping the measurement parameters $Φ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Phi }$$\end{document} fixed at their estimates from the first stage. This gives the two-stage estimates of the structural parameters $θ_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_2$$\end{document} .

While effective, the two-stage approach has some shortcomings. Although Steps 2.a and 2.b both estimate only part of the measurement model parameters, computationally they do not save much effort because the most challenging part of the estimation (the E-step of the EM algorithm; see below) is required by both steps. Fixing the response probabilities is also not enough to prevent label switching of the classes from one step to the next in the first stage, since this can simultaneously occur at both the low and high levels. Finally, estimating the correct form of the second-stage information matrix, which should take variability of the previous steps into account, is difficult due to the sequential re-updating of the measurement model. These complications make it desirable to look for more straightforward bias-adjusted stepwise approaches for the multilevel LC model. Such a method, the two-step estimator, is described next.

3. Two-step Estimator for the Model with Covariates

We propose to amend the two-stage estimator by concentrating all of the measurement modeling into a single step 1, where we estimate the multilevel LC model but without covariates. The estimated parameters of the measurement model for the items $Y_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Y}_{ij}$$\end{document} from this step are then taken forward as fixed to step 2, where the structural model for the latent classes given covariates is estimated. Step 2 is thus the same as the second stage of two-stage estimation, but the three steps of its first stage are here collapsed into the single step 1.

The two-step estimation procedure for multilevel LC models that is described in this section has been implemented in the R package multilevLCA (Lyrvall et al., Reference Lyrvall, Di Mari, Bakk, Oser and Kuha2023), which can be downloaded from CRAN. The package’s routines have been used for the simulations and data analysis in Sects. 4, and 5 of the paper.

3.1. Step 1 — Measurement Model

In the first step, a simple multilevel LC model without covariates is fitted to the data. Given the data defined above, the log likelihood for this step is

(6)

\begin{matrix} ℓ_{1} = ℓ (Φ, Π, ω) = \sum_{j = 1}^{J} log P (Y_{j}), \end{matrix}

where $P (Y_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\textbf{Y}_{j})$$\end{document} is given by (2). This is maximized to find the ML estimate of the parameters of this model. Direct (numerical) maximization is possible, either with suitable constraints or by adopting well-known logistic re-parametrizations, but it quickly becomes infeasible even for a moderate number of low- and/or high-level classes. A more practical alternative to maximize (6) is by means of the expectation-maximization (EM) algorithm (Dempster et al., Reference Dempster, Laird and Rubin1977), which is what we propose here.

A standard implementation of EM would involve computing $M \times T^{n_{j}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M\times T^{n_j}$$\end{document} joint posterior probabilities, which is infeasible already with a few low-level units per high-level unit. Instead, our implementation of the EM algorithm follows closely Vermunt (Reference Vermunt2003)’s upward–downward method of computing the joint posteriors of the low- and high-level classes (see also (Vermunt, Reference Vermunt2008)), where the number of joint posterior probabilities to be computed is only a linear function of the number of low-level units per high-level unit. Here we describe in detail the E and M steps of the algorithm, with the step-by-step implementation, that we use to obtain the estimates in Step 1.

Using standard EM terminology, let us introduce the following augmenting variables:

(7)

\begin{matrix} u_{j, m} = (\begin{matrix} 1, & if W_{j} = m \\ 0, & otherwise . \end{matrix}), v_{i, j, t, m} = (\begin{matrix} 1, & if X_{ij} = t, W_{j} = m, \\ 0, & otherwise . \end{matrix}) \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} u_{j,m} = {\left\{ \begin{array}{ll} 1, &{} \text{ if } W_j = m\\ 0, &{} \text{ otherwise }. \end{array}\right. }\text {,}\quad v_{i,j,t,m} = {\left\{ \begin{array}{ll} 1, &{} \text{ if } X_{ij} = t,\quad W_j = m, \\ 0, &{} \text{ otherwise }. \end{array}\right. } \end{aligned}$$\end{document}

Defining the complete-data sample as ${Y_{1}, \dots, Y_{J}, v_{1, 1}, \dots, u_{j, m}, \dots, u_{J, M}, v_{1, 1, 1, 1}, \dots, v_{i, j, t, m}, \dots, v_{n_{J}, J, T, M}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\textbf{Y}_1,\dots,\textbf{Y}_J,v_{1,1},\dots,u_{j,m},\dots,u_{J,M},v_{1,1,1,1},\dots,v_{i,j,t,m},\dots,v_{n_J,J,T,M}\}$$\end{document} , the complete–data log–likelihood (CDLL) for the first step can be specified as

(8)

\begin{matrix} ℓ_{1}^{c} = & \sum_{j = 1}^{J} \sum_{m = 1}^{M} u_{j, m} log (ω_{m}) + \sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} \sum_{m = 1}^{M} \sum_{t = 1}^{T} v_{i, j, t, m} log (π_{t | m}) \\ + \sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} \sum_{m = 1}^{M} \sum_{t = 1}^{T} v_{i, j, t, m} \sum_{h = 1}^{H} {Y_{ijh} log (ϕ_{h | t}) + [1 - Y_{ijh}] log (1 - ϕ_{h | t})}, \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \ell _{1}^{c} =&\sum _{j=1}^J \sum _{m=1}^M u_{j,m} \log (\omega _m) + \sum _{j=1}^J \sum _{i=1}^{n_j} \sum _{m=1}^M \sum _{t=1}^{T} v_{i,j,t,m} \log (\pi _{t \vert m}) \nonumber \\ {}&+ \sum _{j=1}^J \sum _{i=1}^{n_j} \sum _{m=1}^M \sum _{t=1}^{T} v_{i,j,t,m} \sum _{h=1}^{H} \{ Y_{ijh}\log (\phi _{h \vert t}) + [1-Y_{ijh}]\log (1-\phi _{h \vert t}) \}, \end{aligned}$$\end{document}

where we have dropped the argument $(Φ, Π, ω)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\varvec{\Phi },\mathbf {\Pi },\varvec{\omega })$$\end{document} from $ℓ_{1}^{c}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _{1}^{c}$$\end{document} for simplicity of notation.

In the E step, the missing data are imputed by conditional expectations given the observed data and current values for the unknown model parameters. More specifically, this involves the computation of the following expected CDLL

(9)

\begin{matrix} E (ℓ_{1}^{c}) & = \sum_{j = 1}^{J} \sum_{m = 1}^{M} {\hat{u}}_{j, m} log (ω_{m}) + \sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} \sum_{m = 1}^{M} \sum_{t = 1}^{T} {\hat{v}}_{i, j, t, m} log (π_{t | m}) \\ + \sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} \sum_{m = 1}^{M} \sum_{t = 1}^{T} {\hat{v}}_{i, j, t, m} \sum_{h = 1}^{H} {Y_{ijh} log (ϕ_{h | t}) + [1 - Y_{ijh}] log (1 - ϕ_{h | t})} \equiv Q, \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mathbb {E}\left[ \ell _{1}^{c} \right]&= \sum _{j=1}^J \sum _{m=1}^M \hat{u}_{j,m} \log (\omega _m) + \sum _{j=1}^J \sum _{i=1}^{n_j} \sum _{m=1}^M \sum _{t=1}^{T} \hat{v}_{i,j,t,m} \log (\pi _{t \vert m}) \nonumber \\ {}&\quad + \sum _{j=1}^J \sum _{i=1}^{n_j} \sum _{m=1}^M \sum _{t=1}^{T} \hat{v}_{i,j,t,m} \sum _{h=1}^{H} \{ Y_{ijh}\log (\phi _{h \vert t}) + [1-Y_{ijh}]\log (1-\phi _{h \vert t}) \} \equiv Q, \end{aligned}$$\end{document}

where

(10)

\begin{matrix} {\hat{u}}_{j, m} = \frac{ω_{m} \prod_{i = 1}^{n_{j}} \sum_{t = 1}^{T} π_{t | m} \prod_{h = 1}^{H} P (Y_{ijh} | X_{ij} = t)}{\sum_{l = 1}^{M} ω_{l} \prod_{i = 1}^{n_{j}} \sum_{t = 1}^{T} π_{t | l} \prod_{h = 1}^{H} P (Y_{ijh} | X_{ij} = t)} . \end{matrix}

To compute the conditional expectation of $v_{i, j, t, m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v_{i,j,t,m}$$\end{document} , we use the fact that the joint probability $P (X_{ij} = t, W_{j} = m | Y_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(X_{ij} = t, W_j = m \vert \textbf{Y}_j)$$\end{document} can be written as $P (W_{j} = m | Y_{j}) P (X_{ij} = t | W_{j}, Y_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(W_j = m \vert \textbf{Y}_j) P(X_{ij} = t \vert W_j,\textbf{Y}_j)$$\end{document} , where $P (W_{j} = m | Y_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(W_j = m \vert \textbf{Y}_j)$$\end{document} is already available from (10). Note that, given the model assumptions,

(11)

\begin{matrix} P (X_{ij} = t | W_{j}, Y_{j}) = P (X_{ij} = t | W_{j}, Y_{ij}), \end{matrix}

which we use to compute the following desired quantity

(12)

\begin{matrix} {\hat{v}}_{i, j, t, m} & = P (X_{ij} = t, W_{j} = m | Y_{j}) \\ = P (W_{j} = m | Y_{j}) P (X_{ij} = t | W_{j}, Y_{ij}) \\ = {\hat{u}}_{j, m} \frac{P (X_{ij} = t | W_{j} = m) P (Y_{ij} | X_{ij} = t)}{P (Y_{ij})} \\ = {\hat{u}}_{j, m} \frac{π_{t | m} \prod_{h = 1}^{H} P (Y_{ijh} | X_{ij} = t)}{\sum_{s = 1}^{T} π_{s | m} \prod_{h = 1}^{H} P (Y_{ijh} | X_{ij} = s)}, \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \hat{v}_{i,j,t,m}&= P(X_{ij} = t, W_j = m \vert \textbf{Y}_j) \nonumber \\&= P(W_j = m \vert \textbf{Y}_j) P(X_{ij} = t \vert W_j,\textbf{Y}_{ij})\nonumber \\&= \hat{u}_{j,m} \frac{P( X_{ij} = t \vert W_j = m)P(\textbf{Y}_{ij} \vert X_{ij} = t)}{P(\textbf{Y}_{ij})} \nonumber \\&= \hat{u}_{j,m} \frac{\pi _{t \vert m} \prod _{h=1}^{H} P(Y_{ijh}|X_{ij} = t)}{\sum _{s=1}^{T} \pi _{s \vert m} \prod _{h=1}^{H} P(Y_{ijh}|X_{ij} = s)}, \end{aligned}$$\end{document}

where in the third row we are using the assumption that the joint probability function of the response variables depend on high–level class membership only through low-level class membership. For the unrestricted multi–group LC model, the expression (12) would be adapted straightforwardly.

In the M step of the algorithm, the expected CDLL (9) is maximized with respect to the model parameters $(Φ, Π, ω)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\varvec{\Phi },\mathbf {\Pi },\varvec{\omega })$$\end{document} subject to the usual sum–to–one constraints on probabilities. This yields the following closed–form updates

(13)

\begin{matrix} ω_{m} & = \frac{\sum_{j = 1}^{J} {\hat{u}}_{j, m}}{\sum_{j = 1}^{J} \sum_{m = 1}^{M} {\hat{u}}_{j, m}}, \end{matrix}

(14)

\begin{matrix} π_{t | m} & = \frac{\sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} {\hat{v}}_{i, j, t, m}}{\sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} \sum_{t = 1}^{T} {\hat{v}}_{i, j, t, m}}, \end{matrix}

(15)

\begin{matrix} ϕ_{h | t} & = \frac{\sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} \sum_{m = 1}^{M} {\hat{v}}_{i, j, t, m} Y_{ijh}}{\sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} \sum_{m = 1}^{M} {\hat{v}}_{i, j, t, m}} . \end{matrix}

Starting from initial values for the model parameters, the algorithm iterates between the E- and the M-steps until some convergence criterion is met, e.g. until the difference between the log-likelihood values of two subsequent iterations falls below some threshold value.

As for all mixture models, the log-likelihood function can have several local optima and there is no guarantee that the solution found by the EM algorithm is the global optimum (Wu, Reference Wu1983). To better explore the likelihood surface, multiple starting value strategies are typically implemented (among others, see (Biernacki et al., Reference Biernacki, Celeux and Govaert2003; Maruotti & Punzo, Reference Maruotti and Punzo2021)). Beyond doubt, the easiest, and most common approach is to initialize the EM algorithm randomly from several different starting points. However, even for relatively simpler models, the multiple starting value strategy is often outperformed by more refined techniques (Biernacki et al., Reference Biernacki, Celeux and Govaert2003).

For any stepwise estimators, the initialization strategy of earlier steps is particularly relevant because subsequent steps will be conditional on estimates from previous steps. In our step 1, we suggest implementing the following hierarchical initialization strategy (for a similar approach in a related context, see for instance (Catania & Di Mari, Reference Catania and Di Mari2021; Catania et al., Reference Catania, Di Mari and Santucci de Magistris2022)):

(1) Perform a single–level K–modes clustering (Huang, Reference Huang, Lu and Luu1997; MacQueen, Reference MacQueen1967), with $K = M$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=M$$\end{document} . For each $j = 1, \dots, J$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=1,\dots,J$$\end{document}
- let ${\dot{W}}_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\dot{W}_{ij}$$\end{document} be the outcome class assignment for unit i in group j;
- specify ${\tilde{W}}_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widetilde{W}_j$$\end{document} as the most frequent assigned class among the $n_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_j$$\end{document} observations belonging to group j, and let ${\tilde{W}}_{ij} = {\tilde{W}}_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widetilde{W}_{ij} = \widetilde{W}_j$$\end{document} for all $i = 1, \dots, n_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\dots,n_j$$\end{document} .
The relative sizes of the resulting high–level classes are used to initialize $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\omega }$$\end{document} . The entries of $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\omega }$$\end{document} , before being carried over to the actual estimation step, can be sorted in increasing or decreasing order.
(2) Fit a single–level T–class LC model on the pooled data, ignoring the multilevel structure. Note that the K–modes algorithm can be employed herein as well to initialize the single–level LCA. The estimated output is organized as follows
- the response probabilities are passed on the EM algorithm as a start for $Φ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Phi }$$\end{document} ;
- let ${\tilde{X}}_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widetilde{X}_{ij}$$\end{document} be the maximum a posteriori class assignment for unit i in group j. Cross–tabulate $\tilde{X}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widetilde{\textbf{X}}$$\end{document} and $\tilde{W}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widetilde{\textbf{W}}$$\end{document} , where $\tilde{X} = {({\tilde{X}}_{11}, \dots, {\tilde{X}}_{n_{J} J})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widetilde{\textbf{X}} = (\widetilde{X}_{11},\dots, \widetilde{X}_{n_{J}J})^{\prime }$$\end{document} , and $\tilde{W} = {({\tilde{W}}_{11}, \dots, {\tilde{W}}_{n_{J} J})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widetilde{\textbf{W}} = (\widetilde{W}_{11},\dots, \widetilde{W}_{n_{J}J})^{\prime }$$\end{document} . From the $T \times M$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T \times M$$\end{document} table of joint counts, compute the conditional (relative) counts of $\tilde{X} | \tilde{W}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widetilde{\textbf{X}} \vert \widetilde{\textbf{W}}$$\end{document} to initialize $Π$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {\Pi }$$\end{document} .
The low-level classes can be re-ordered by letting low-level cluster 1 be the one with the highest average probability to score a “1” on all items, cluster 2 the one with the second highest average probability to score a “1” on all items, and so on.

Note that the suggested rule to re-order low-level classes is only an example of a rule that is often (but not always) useful. This is because, if there are many items or some are for rare characteristics, the joint probability of scoring “1” on all of them together might be a number so small as to be overwhelmed by sampling error or even by machine imprecision. That would effectively bring label switching back again. In cases like these, we suggest implementing alternative re-ordering principles.

Running the EM algorithm to convergence from the above starting values, the solution with the highest log-likelihood (6) provides us with estimates $\hat{ω}, \hat{Π}, \hat{Φ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\omega }},\hat{\mathbf {\Pi }},\hat{\varvec{\Phi }}$$\end{document} . Of these, $\hat{ω}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\omega }}$$\end{document} and $\hat{Π}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\mathbf {\Pi }}$$\end{document} are discarded and $vec (\hat{Φ}) = {\hat{θ}}_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {vec}(\hat{\varvec{\Phi }})=\hat{\varvec{\theta }}_1$$\end{document} are retained as the estimates of the measurement parameters $θ_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_{1}$$\end{document} from this step 1.

3.2. Step 2 — Model for Class Membership

In the second step of estimation, the parameters $θ_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_{2}$$\end{document} of the model for the latent classes in Eq. (5) are estimated, keeping the measurement parameters $θ_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_1$$\end{document} fixed at their step-1 estimates ${\hat{θ}}_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}_1$$\end{document} (see Fig. 2). These step-2 estimates are obtained by maximizing the pseudo log-likelihood function

(16)

\begin{matrix} ℓ_{2} (θ_{2} | θ_{1} = {\hat{θ}}_{1}) = \sum_{j = 1}^{J} log P (Y_{j} | Z_{j}) \end{matrix}

with respect to $θ_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_{2}$$\end{document} . Here $log P (Y_{j} | Z_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log P(\textbf{Y}_{j} \vert \textbf{Z}_{j})$$\end{document} is given by Eq. (5), except that ${\hat{θ}}_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}_1$$\end{document} are regarded as fixed and known values rather than unknown parameters. The EM algorithm that we propose for this step works similarly to the one that we used for the first step. In particular, under the definition of the augmenting variables given in Sect. 3.1, the CDLL is given by

(17)

\begin{matrix} ℓ_{2}^{c} = & \sum_{j = 1}^{J} \sum_{m = 1}^{M} u_{j, m} log (ω_{m}) + \sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} \sum_{m = 1}^{M} \sum_{t = 1}^{T} v_{i, j, t, m} log (\frac{exp (γ_{tm}^{'} Z_{ij})}{1 + \sum_{s = 2}^{T} exp (γ_{tm}^{'} Z_{ij})}) \\ + \sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} \sum_{m = 1}^{M} \sum_{t = 1}^{T} v_{i, j, t, m} \sum_{h = 1}^{H} {Y_{ijh} log ({\hat{ϕ}}_{h | t}) + [1 - Y_{ijh}] log (1 - {\hat{ϕ}}_{h | t})}, \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \ell _{2}^{c} =&\sum _{j=1}^J \sum _{m=1}^M u_{j,m} \log (\omega _m) + \sum _{j=1}^J \sum _{i=1}^{n_j} \sum _{m=1}^M \sum _{t=1}^{T} v_{i,j,t,m} \log \left( \frac{\exp (\varvec{\gamma }_{tm}^{\prime } \textbf{Z}_{ij})}{1+\sum _{s=2}^{T}\exp (\varvec{\gamma }_{tm}^{\prime } \textbf{Z}_{ij})}\right) \nonumber \\&+ \sum _{j=1}^J \sum _{i=1}^{n_j} \sum _{m=1}^M \sum _{t=1}^{T} v_{i,j,t,m} \sum _{h=1}^{H} \{ Y_{ijh}\log (\hat{\phi }_{h \vert t}) + [1-Y_{ijh}]\log (1-\hat{\phi }_{h \vert t}) \}, \end{aligned}$$\end{document}

where we have dropped the argument $(θ_{2} | θ_{1} = {\hat{θ}}_{1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\varvec{\theta }_2 \vert \varvec{\theta }_1 = \hat{\varvec{\theta }}_1)$$\end{document} from $ℓ_{2}^{c}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _{2}^{c}$$\end{document} for ease of notation. Note that the E step is analogous as that described in Sect. 3.1, except that now the low-level class probabilities conditional on high-level membership depend on covariates. In the M step the expected CDLL, obtained by substituting the missing values with expectations computed using analogous formulas as (10) and (12), is maximized with respect to $θ_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_2$$\end{document} only. Whereas the update for $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\omega }$$\end{document} is given by (13), to derive the update for the regression coefficients note that $v_{i, j, t, m} = P (X_{ij} = t, W_{j} = m | Y_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v_{i,j,t,m} = P(X_{ij} = t, W_j = m \vert \textbf{Y}_j)$$\end{document} can be written as the product of $u_{j, m} = P (W_{j} = m | Y_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u_{j,m} = P(W_j = m \vert \textbf{Y}_j)$$\end{document} and $q_{i, j, t | m} = P (X_{ij} = t | W_{j}, Y_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_{i,j,t\vert m} = P(X_{ij} = t \vert W_j,\textbf{Y}_j)$$\end{document} . Thus, estimates of $Γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Gamma }$$\end{document} can be found solving the equations

(18)

\begin{matrix} \sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} \sum_{m = 1}^{M} \sum_{t = 1}^{T} {\hat{u}}_{j, m} {\hat{q}}_{i, j, t | m} \frac{\partial log (P (X_{ij} = t | W_{j} = m, Z_{ij}))}{\partial vec (Γ)} = 0, \end{matrix}

which are weighted sums of M equations, each with weights ${\hat{q}}_{i, j, t | m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{q}_{i,j,t\vert m}$$\end{document} .

Figure 2. Step 2 of the two-step estimation: Estimating the structural model for low-level latent classes $X_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{ij}$$\end{document} given covariates $Z_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}_{ij}$$\end{document} and high-level latent classes $W_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_j$$\end{document} , keeping measurement model parameters for items $Y_{ijh}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_{ijh}$$\end{document} fixed at their estimates from Step 1.

Stepwise estimation is well known to enhance algorithm stability and speed of convergence (Bakk & Kuha, Reference Bakk and Kuha2018; Bartolucci et al., Reference Bartolucci, Montanari and Pandolfi2015; Di Mari & Maruotti, Reference Di Mari and Maruotti2022; Skrondal & Kuha, Reference Skrondal and Kuha2012). However, class labels in multiple hidden layer models can still be switched, and keeping the response probabilities fixed cannot prevent it as there are still M! possible permutations of the high-level class labels. We handle this issue by initializing $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\omega }$$\end{document} at its estimate from the first step, and by taking $log (π_{t | m} / π_{1 | m})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log \left( \pi _{t\vert m} / \pi _{1\vert m} \right) $$\end{document} to initialize the intercepts $γ_{0 t m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _{0tm}$$\end{document} , for all $m = 1, \dots, M$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m=1,\dots,M$$\end{document} and $t = 2, \dots, T$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t=2,\dots,T$$\end{document} . The other elements of $Γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Gamma }$$\end{document} are initialized at zero.

3.3. Selecting the Number Latent Classes

The description of the two-step estimation procedure above takes the numbers of latent classes at both the lower and higher levels as given. The selection of these numbers is a separate exercise. It is normally carried out without covariates, and the selected numbers of classes are then held fixed when covariates are added. This is also in line with general recommendations for LCA with covariates (Masyn, Reference Masyn2017).

The selection of the numbers of classes could be considered as a joint exercise of both the high and low levels together, but a generally used recommendation is to use instead a hierarchical procedure which selects them one at a time (Lukociene et al., Reference Lukociene, Varriale and Vermunt2010). First, simple LC models are fitted at the lower level and the number of classes for it (T) is selected. Second, this number is held fixed, and multilevel LC models are fitted and compared to select the number of classes at the higher level (M). Third, the selected M is fixed, and model selection for the multilevel model is done again at the lower level, to obtain the final value of T. A still simpler approach would skip the third step (Vermunt, Reference Vermunt2003), but including it allows us to check if the selected number of lower-level classes changes once the within-group associations induced by the high-level classes are allowed for.

This hierarchical approach can be used with any method of estimating the models. However, when combined with our two-step estimator, simultaneously selecting the number of classes of the measurement at both levels is also feasible. Practically, this is possible by leveraging an efficient integration of the above initialization strategy with parallel (multi-core) estimation of all plausible values of T and M.

The best candidate values of M and T can be selected with standard information criteria, like AIC or BIC. For the final choice, we suggest balancing the use information criteria with the evaluation of low- and high-level class separation, and, perhaps most importantly, the substantive inspection of the candidate model configurations. For a wider discussion on this issue, see, among others, Di Mari et al. (Reference Di Mari, Bakk, Oser and Kuha2022); Magidson and Vermunt (Reference Magidson, Vermunt and Kaplan2004). In the social sciences, one of the most commonly used measures of class separation is the entropy-based R $^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^2$$\end{document} of Magidson (Reference Magidson1981). The latter can be defined at both lower and higher levels to judge class separation (see (Di Mari et al., Reference Di Mari, Bakk, Oser and Kuha2022; Lukociene et al., Reference Lukociene, Varriale and Vermunt2010)).

3.4. Statistical properties of the two-step estimator

Our two-step estimator is an instance of pseudo maximum likelihood estimation (Gong and Samaniego, Reference Gong and Samaniego1981). Such estimators are consistent and asymptotically normally distributed under very general regularity conditions. The conditions and a proof of consistency can be found in (Gourieroux and Monfort (Reference Gourieroux and Monfort1995), Sec. 24.2.4). Let the true parameter vector be $θ^{⋆} = {(θ_{1}^{⋆'}, θ_{2}^{⋆'})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }^{\star } = (\varvec{\theta }_1^{\star \prime }, \varvec{\theta }_2^{\star \prime })^{\prime }$$\end{document} . If the one-step ML estimator of $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }$$\end{document} is itself consistent for $θ^{⋆}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }^{\star }$$\end{document} , in order to prove consistency of our two-step estimator $\hat{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}$$\end{document} it suffices to show that (1) $θ_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_1$$\end{document} and $θ_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_2$$\end{document} can vary independently of each other, and (2) ${\hat{θ}}_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}_1$$\end{document} is consistent for $θ_{1}^{⋆}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_1^{\star }$$\end{document} . These conditions are satisfied in our case: (1) is true by construction of the model, and (2) is satisfied since ${\hat{θ}}_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}_{1}$$\end{document} from step 1 is a ML estimate of the measurement model parameters of the multilevel LC model without covariates, and these parameters are taken to be the same as in the model with covariates.

Let $ℓ (θ_{1}, θ_{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell (\varvec{\theta }_{1},\varvec{\theta }_{2})$$\end{document} denote the joint log-likelihood function for the model, let ${\bar{s}}_{θ_{2}} (θ_{1}^{⋆}, θ_{2}^{⋆})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\overline{{\textbf {s}}}_{\varvec{\theta }_2} (\varvec{\theta }_1^{\star },\varvec{\theta }_2^{\star })$$\end{document} denote the mean score $N^{- 1} \partial ℓ (θ_{1}, θ_{2}) / \partial θ_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N^{-1} \partial \ell (\varvec{\theta }_1,\varvec{\theta }_2) / \partial \varvec{\theta }_2$$\end{document} evaluated at $(θ_{1}^{⋆}, θ_{2}^{⋆})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\varvec{\theta }_1^{\star }, \varvec{\theta }_2^{\star })$$\end{document} , where N denote the overall sample size, and let

\begin{matrix} I (θ^{*}) = (\begin{matrix} I_{11} \\ I_{21} & I_{22} \end{matrix}), \end{matrix}

be the Fisher information matrix. In addition, let us suppose that

\begin{matrix} N^{1 / 2} (\begin{matrix} {\hat{θ}}_{1} - θ_{1}^{⋆} \\ {\bar{s}}_{θ_{2}} (θ_{1}^{⋆}, θ_{2}^{⋆}) \end{matrix}) \to_{}^{d} N (0, (\begin{matrix} Σ_{11} \\ Σ_{21} & I_{22} \end{matrix})) . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} N^{1/2} \begin{bmatrix} \hat{\varvec{\theta }}_1 - \varvec{\theta }_1^{\star } &{} \\ \overline{{\textbf {s}}}_{\varvec{\theta }_2} (\varvec{\theta }_1^{\star },\varvec{\theta }_2^{\star }) &{} \end{bmatrix} \xrightarrow [ ]{d} \text {N}\left( \textbf{0}, \begin{bmatrix} \varvec{\Sigma }_{11} &{} \\ \varvec{\Sigma }_{21} &{} \varvec{{\mathcal I}}_{22} \end{bmatrix}\right) . \end{aligned}$$\end{document}

Then, using the results of Theorem 2.2 of Gong and Samaniego (Reference Gong and Samaniego1981) (see also (Parke, Reference Parke1986)),

(19)

\begin{matrix} N^{1 / 2} ({\hat{θ}}_{2} - θ_{2}^{⋆}) \to_{}^{d} N (0, V), \end{matrix}

where ${\hat{θ}}_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}_2$$\end{document} is the proposed two-step estimator and

(20)

\begin{matrix} V = \underset{\equiv V_{2}}{\underset{⏟}{I_{22}^{- 1}}} + \underset{\equiv V_{1}}{\underset{⏟}{I_{22}^{- 1} I_{21} Σ_{11} I_{21}^{'} I_{22}^{- 1}}} . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \varvec{V} = \underbrace{\varvec{{\mathcal I}}_{22}^{-1}}_{\equiv \varvec{V}_{2}} + \underbrace{\varvec{{\mathcal I}}_{22}^{-1}\, \varvec{{\mathcal I}}_{21}\, \varvec{\Sigma }_{11}\, \varvec{{\mathcal I}}_{21}'\, \varvec{{\mathcal I}}_{22}^{-1}}_{\equiv \varvec{V}_{1}}. \end{aligned}$$\end{document}

Intuitively, $V_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{V}_{2}$$\end{document} describes the variability in ${\hat{θ}}_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}_{2}$$\end{document} given the step one estimates ${\hat{θ}}_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}_{1}$$\end{document} , and $V_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{V}_{1}$$\end{document} the additional variability arising from the fact that $θ_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_{1}$$\end{document} are not known but rather estimated by ${\hat{θ}}_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}_{1}$$\end{document} with their own sampling variability.

Let $s_{i j, θ_{2}} ({\hat{θ}}_{1}, {\hat{θ}}_{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf {s}}_{ij,\varvec{\theta }_2}(\hat{\varvec{\theta }}_{1},\hat{\varvec{\theta }}_{2})$$\end{document} be the individual contribution to the score of low-level unit i belonging to high-level group j evaluated at the parameter estimates of the first and second step respectively. To compute such score we use the well-known fact that $\partial ℓ (θ) / \partial θ = \partial Q / \partial θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\partial \ell (\varvec{\theta }) / \partial \varvec{\theta }= \partial Q / \partial \varvec{\theta }$$\end{document} (Oakes, Reference Oakes1999), where $Q = E (ℓ^{c}, (θ))$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q= \mathbb {E}\left[ \ell ^{c} (\varvec{\theta }) \right] $$\end{document} . All such quantities are available from the above EM algorithm without any extra effort. Therefore, $I_{22}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{{\mathcal I}}_{22}$$\end{document} and $I_{21}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{{\mathcal I}}_{21}$$\end{document} can be estimated respectively as

(21)

\begin{matrix} {\hat{I}}_{22} = N^{- 1} \sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} s_{i j, θ_{2}} ({\hat{θ}}_{2}) s_{i j, θ_{2}} {({\hat{θ}}_{2})}^{'} \end{matrix}

and

(22)

\begin{matrix} {\hat{I}}_{21} = N^{- 1} \sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} s_{i j, θ_{2}} ({\hat{θ}}_{1}, {\hat{θ}}_{2}) s_{i j, θ_{1}} {({\hat{θ}}_{1}, {\hat{θ}}_{2})}^{'} . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \hat{\varvec{{\mathcal I}}}_{21} = N^{-1} \sum _{j=1}^J \sum _{i=1}^{n_j} {\textbf {s}}_{ij,\varvec{\theta }_2}(\hat{\varvec{\theta }}_{1},\hat{\varvec{\theta }}_{2}) \text { } {\textbf {s}}_{ij,\varvec{\theta }_1}(\hat{\varvec{\theta }}_{1},\hat{\varvec{\theta }}_{2})^{\prime }. \end{aligned}$$\end{document}

An estimate ${\hat{Σ}}_{11}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\mathbf {\Sigma }}_{11}$$\end{document} can be obtained analogously by fitting model (2). We give details on the derivations of the desired quantities in the appendix.

Note that Equation (20) shows that there is a loss of efficiency of the two-step estimator with respect to the simultaneous ML estimator. This important theoretical and practical aspect with be investigated in the simulation study—although we expect this loss to be rather small as very little information about $θ_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_2$$\end{document} should be contained in $Y$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Y}$$\end{document} .

4. Simulation Study

4.1. Settings

We conduct a simulation study to investigate the finite sample properties of the proposed two-step estimator. It is compared with the simultaneous (one-step) estimator and the two-stage estimator of Bakk et al. (Reference Bakk, Di Mari, Oser and Kuha2022); Di Mari et al. (Reference Di Mari, Bakk, Oser and Kuha2022). One-step estimation is the statistical benchmark, and the two-step estimator’s performance is evaluated in terms of its statistical and computational performance relative to this benchmark. The target measures that we use for the comparison are the bias, standard deviations, confidence interval coverage rates, and computation time of the stepwise estimators compared with those of the simultaneous estimator. We compute both absolute standard deviations, to assess the efficiency of our estimator, as well as relative standard deviations with respect to the one-step method, to investigate potential loss of efficiency with respect to the benchmark. Class separation and sample size are well-known determinants of the finite-sample behavior of stepwise estimators for LCA (Bakk & Kuha, Reference Bakk and Kuha2018; Vermunt, Reference Vermunt2010). We considered all combinations of larger and smaller sample sizes, at higher level (30, 50, or 100 higher-level units) and lower level (100 or 500), with a total of 6 sample size conditions. Data were generated from a multilevel LC model with 2 high-level classes and 3 low-level classes and with 10 binary indicators and one continuous covariate generated from a standard normal distribution. The random slopes $γ_{2 | 1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _{2 \vert 1}$$\end{document} , and $γ_{3 | 1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _{3 \vert 1}$$\end{document} were set to $-$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document} 0.25 and $-$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document} 0.25, whereas $γ_{2 | 2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _{2 \vert 2}$$\end{document} , and $γ_{3 | 2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _{3 \vert 2}$$\end{document} to 0.25 and 0.25, corresponding to a moderate magnitude on the logistic scale.

Table 1. 24 simulation conditions.

LL stands for Low-Level, HL stands for High-Level.

In multilevel LC models, separation plays a role at both low and high levels (Lukociene et al., Reference Lukociene, Varriale and Vermunt2010). We manipulate low-level class separation by allowing the the response probabilities for the most likely responses to be either 0.7, 0.8 or 0.9, corresponding respectively to low, moderate, and large class separation. We remark that the low class separation condition can be considered as an extreme scenario, in which LCA is hardly carried out in practice. Nevertheless, we decide to include it as a benchmarking condition. Class profiles are such that the first class has high probability to score 1 on all items, the second class to score 1 on the last five items and 0 on the first 5 items, and the third class is likely to score 0 on all items. At the high level, in the model for W, we manipulate class separation by altering the random intercept magnitudes, which are both relatively close to zero in the moderate separation case ( $-$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document} 0.85, $-$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document} 1.38 and 0.85, 1.38), and further away from zero in the large separation case ( $-$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document} 1.38, $-$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document} 2.07 and 1.38, 2.07). These simulation conditions are in line with previous studies on multilevel LCA (Lukociene et al., Reference Lukociene, Varriale and Vermunt2010; Park & Yu, Reference Park and Yu2018).

We generated 500 samples for each of the 36 crossed simulation factors of low-level and high-level sample size and low-level and high-level class separation (see Table 1). Data generation and model estimation were carried out in R (Venables et al., Reference Venables and Smith2013), with the integration of C++ code for computation efficiency (Eddelbuettel & François, Reference Eddelbuettel and François2011).

4.2. Results

All estimators show very similar values for bias (see Figs. 3a, b), and both two-stage and two-step estimators have nearly identical results compared to the simultaneous estimator. Relative efficiency with respect to the simultaneous estimator (Table 8, in the appendix) is, in all conditions, approximately one for both stepwise estimators, with the two-stage estimator doing very slightly worse only in one condition. Confidence interval coverages (Fig. 4) are mostly very similar between the three estimators. We observe some undercoverage for all methods in the low-separation and small high-level sample size conditions. This may be due to the fact that expected information matrices are used to estimate the asymptotic variance covariance matrix, rather than the observed ones, and the contributions to the score are computed on high level units, and to the overlap between classes.

The different estimators thus perform essentially identically. Where they differ from each other is in their computational demands. Considering the computation time relative to the simultaneous estimator (Fig. 5), we find that both stepwise estimators are always (and up to four times) faster than the simultaneous estimator, and the two-step estimator achieves this with one fewer step compared to the existing two-stage competitor.

Figure 3. Line graphs of estimated bias for the one-step, two-step, and two-stage estimators, for the 36 simulation conditions, averaged over the 500 replicates. Error bars are based on mean bias ± Monte Carlo standard deviations.

Figure 4. Observed coverage rates of 95% confidence intervals, averaged over covariate effects, for the one-step, two-stage and two-step estimators for the 36 simulation condition, averaged over the 500 replicates. Lower and higher confidence values reported in the confidence bars, based on the minimum and maximum coverages of the confidence intervals for each covariate effect.

Figure 5. Relative computation time for the one-step, two-stage and two-step estimators for the 24 simulation condition, averaged over the 500 replicates. The one-step estimator’s estimation time is taken as reference. Confidence bands based on average values ± their Monte Carlo standard deviation.

5. Analysis of Cross-National Citizenship Norms with Multilevel LCA

In this empirical example, we analyze citizenship norms in a diverse set of countries. The data are taken from the International Civic and Citizenship Education Study (ICCS) conducted by the International Association for the Evaluation of Educational Achievement (IEA). Prior research has used LCA to analyze the first two waves of this survey, which were conducted in 1999 and 2009, to investigate distinctive types of citizenship norms (Hooghe & Oser, Reference Hooghe and Oser2015; Hooghe et al., Reference Hooghe, Oser and Marien2016; Oser & Hooghe, Reference Oser and Hooghe2013). We focus on the most recent round of the survey, from 2016 (Köhler et al., Reference Köhler, Weber, Brese, Schulz and Carstens2018). The data are from a survey of students in their eighth year of schooling. We have data from between 1300 and 7000 respondents in each of 24 countries, as shown in Table 2.

The respondents answered 12 questions (items) on how important they think different behaviours are for ”being a good adult citizen”. These behaviours were always obeying the law (labelled obey below), taking part in activities promoting human rights (rights), participating in activities to benefit people in the local community (local), working hard (work), taking part in activities to protect the environment (envir), voting in every national election (vote), learning about the country’s history (history), showing respect for government representatives (respect), following political issues in the newspaper, on the radio, on TV or on the Internet (news), participating in peaceful protests against laws believed to be unjust (protest), engaging in political discussions (discuss), and joining a political party (party).

We treat these twelve items as indicators of the individuals’ perceptions of the duties of a citizen (citizenship norms). The data have a multilevel structure, with individuals as the low-level units and countries as the high-level units. As predictors of low-level latent class membership, we include the respondent’s gender, socio-economic status operationalised by the proxy measure of the number of books in their home, and measures of the respondent’s educational expectations, parental education, and if she/he is a non-native language speaker. For details on data cleaning and recoding, see Oser et al. (Reference Oser, Di Mari and Bakk2023).

To compare with previous work on the same data, we fit a multilevel LC model with $T = 4$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T=4$$\end{document} low-level classes (of individuals within countries) and $M = 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M=3$$\end{document} high-level classes (of countries). The same data set was analyzed in Di Mari et al. (Reference Di Mari, Bakk, Oser and Kuha2022) with a multilevel LC model with random intercepts, estimated with a two-stage estimator. We extend Di Mari et al. (Reference Di Mari, Bakk, Oser and Kuha2022)’s model specification by allowing for both random intercepts and random slopes, and we fit the model with the proposed two-step estimator. As the two-step estimator has been shown to be computationally more efficient than the two-stage estimator though with equal performances, for the comparison we include the benchmark simultaneous estimator only.

Table 2. Number of respondents per country of the third wave (2016) of the IEA survey used for the analysis.

Table 3. Summary statistics for the measurement model.

The measurement model, at both levels, presents very well separated classes (Table 3). At the lower level, the four latent classes are characterised by their the conditional response probability patterns, as shown in Fig. 6. Two classes present response configurations relating to two relevant and well-known notions of citizenship norms. First, a “Duty” group, which places high importance on the act of voting, discussing politics, and party activity, while manifesting relatively low interest in protecting human rights and activities to assist the local community. Second, an “Engaged” group, which displays higher emphasis on engaged attitudes such as protecting the environment, and lower importance on more traditional citizenship activity items such as membership of political parties. In addition, we observe two classes with consistently high and consistently low probabilities of assigning importance to all of the behaviours, here labelled the “Maximal” and the “Subject” classes respectively.

Figure 6. Measurement model at the lower (individual) level: line graph of the class-conditional response probabilities.

At the higher level, the estimated model includes three latent classes for the countries, labelled below as HL1, HL2 and HL3. Considering first the conditional probabilities for the four individual-level classes given these country-level classes (see Table 4), we can see that HL1 has clearly the highest conditional probability for the individual “Duty” class, HL2 for the “Maximal” class and HL3 for the “Engaged”. The country classes do not differ in probabilities of the passive “Subject” class of individuals, which are in any case consistently low. Table 5 shows the assignment of countries to the classes, when the assignment is done based on highest posterior probabilities given the survey responses in the countries. Here there are no very clear patterns. Only two countries (Denmark and Netherlands) are assigned to HL1, while the other two classes each include a fairly heterogeneous subset of the rest of the countries.

Table 4. Estimated proportions of low-level (individual-level) classes conditional on high-level (country-level) class membership.

Table 5. Assignment of countries to the high-level classes, based on the maximum a posteriori (MAP) classification rule. $M = 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M=3$$\end{document} .

Table 6 presents estimates of the parameters of main interest in the analysis, the coefficients of the structural model for the lower-level classes given individual-level covariates, separately within each of the higher-level classes. We note first that the one-step and two-step estimates and their standard errors are very similar, as would be expected given the previous simulation results.

Considering the coefficients themselves, note that they compare each of the other classes to the “Maximal” class for whom all of the behaviours are to a greater or less extent considered important to good citizenship. Compared to this class, the relative probability of the (overall quite small) “Subject” class for whom none of the behaviours are important, is higher for individuals who are boys, speak the native language at home, have fewer books at home, and have low educational aspirations. The probabilities of the “Engaged” class, who are partly similar to “Maximal” but place less importance on many of the traditional political activities, are relatively higher for girls, those who have larger number of books at home, and for native speakers. For the “Duty” class, which differs from the “Engaged” in placing much less importance on direct activism, the probabilities relative to “Maximal” are higher for boys and those with low educational aspirations. For the comparisons of other pairs of classes, these estimates also imply, for example, that the probabilities of “Engaged” relative to “Duty” are generally higher for girls than for boys. These patterns of the coefficients are broadly similar in each of the country classes, with some variation in detail.

Table 6. Estimated coefficients of structural models, i.e. multinomial logistic models for membership of the four individual-level latent classes conditional on covariates, separately within each of the three country-level latent classes (HL1, HL2 and HL3).

The “Maximal” class is taken as the reference level for the response class. The number of books available in the respondent’s home is treated as a proxy for the respondent’s socio–economic status. Both simultaneous (one-step) and the proposed two-step estimators of the same parameters are shown, with standard errors in parentheses.

***p-value<0.01, **p-value<0.05, *p-value<0.1.

Finally, we report CPU time of estimation and the number of iterations until convergence for the two approaches (Table 7). In this real-data example, the two-step estimator takes only about 22 s to reach convergence, with 26 EM iterations. The one-step estimator requires 261 iterations and a running time of around 4.5 min to reach convergence. Each iteration requires about 0.93 s to run for the one-step estimator, while the two-step estimator uses 0.85 s and much fewer EM iterations overall.

Table 7. CPU time to estimation in seconds, and number of iterations until convergence for the two methods - one-step and two-step estimators.

Discussion

In this paper we proposed a two-step estimator for the multilevel latent class model with covariates. It concentrates the estimation of the measurement model in a single first step. In the second step, covariates are added to the model, keeping the measurement model parameters fixed. The approach represents a simplification over the recently proposed two-stage estimator (Bakk et al., Reference Bakk, Di Mari, Oser and Kuha2022) by having only two steps instead of multiple sub-steps in estimating the measurement model.

We discussed model identification of the unconditional model, derived an Expectation Maximization algorithm for efficient estimation of both steps and presented second-step asymptotic standard errors that account for the variability in the first step. The simplified two-step procedure makes it possible to apply the standard theory of Gong and Samaniego (Reference Gong and Samaniego1981) for obtaining unbiased standard errors, a further improvement over the two-stage estimator. An effective initialization strategy, using (dissimilarity–based) cluster analysis, was also proposed.

In the simulation study, we observed that the performance of the proposed estimator in terms of bias is very similar to the benchmark simultaneous (full-information ML) estimator—and similar to that of the two-stage estimator—with nearly no efficiency loss. The two-step estimator was up to 4 times faster than the simultaneous estimator. It should be mentioned that, in conditions where the entropy of the LC model is low, all estimators show relatively higher variability and bias, a finding in line with previous research on stepwise estimators for single-level LC models (Vermunt, Reference Vermunt2010).

In the real data example, we found interesting lower and higher level class configurations, consistent with existing literature on the topic of citizenship norms (see, e.g., (Oser et al., Reference Oser, Hooghe, Bakk and Di Mari2022)). In the structural model, the model allows us to investigate the associations between covariates and the latent classes, including the possibility of group-level heterogeneous effects of covariates on lower class membership. In addition, we found a considerable CPU running time difference between the one-step and the two-step estimators, which was even larger than what we observed in the more controlled simulation environment. More specifically, whereas the former required 4.5 min to reach convergence, the latter only needed 22 s. From an applied user’s perspective, such a CPU time gain can be substantial on a larger scale. As an example, consider a data set with larger low- and high- level sample sizes: if simultaneous estimation took 2 h, our two-step estimator would produce final estimates in only roughly 12 min. We expect, based on existing literature on two-step estimators (see, e.g., (Di Mari & Maruotti, Reference Di Mari and Maruotti2022)), such a gap to increase in model complexity - i.e. number of lower/higher level classes and/or available predictors. The difference in time is also multiplied if the models are estimated repeatedly, for example when different sets of covariates or different numbers of latent classes are explored.

There are some issues that deserve future research. First, while we describe two possible approaches for class selection in Sect. 3.3, this is not the main focus of the current work. Further research should investigate class selection using the different estimators. Second, we have proposed estimates for the asymptotic variance–covariance matrix based on the outer product of the score. Deriving Hessian– and/or sandwich–based (White, Reference White1982) standard errors, e.g. for small high-level sample size and complex sampling scenarios, can be interesting topics for future work. Third, we have discussed multimodality of the likelihood surface as a long-standing well-known characteristic feature related, in general, to mixture models. The EM algorithm’s properties have been largely studied over the years - i.e., monotonicity, and global convergence (see, e.g., (Redner & Walker, Reference Redner and Walker1984)). The EM has several advantages, e.g., low cost per iteration, economy of storage and ease of programming. However, in practice, due to multimodality, convergence to global or local optima depends on the choice of the starting point (Wu, Reference Wu1983). As such, there is no systematic, neither theoretical nor simulation based, study of the behavior of the EM with two-step estimators. We speculate that, given that the second step operates in a lower dimensional space compared to simultaneous estimation, two-step estimators should somewhat restrain the initialization problem. This point, being not the focus of the current work, certainly deserves specialized attention. For this, and related matters, we defer to future research.

Acknowledgements

The Authors thank Johan Lyrvall for his valuable comments on the manuscript. Di Mari acknowledges financial support from a University of Catania grant (Starting Grant FIRE, PIACERI 2020/2022), and Oser by a European Union grant (ERC, PRD, project number 101077659).

Funding Information

Open access funding provided by Università degli Studi di Catania within the CRUI-CARE Agreement.

A. Appendix A: Computation of the Score Vector for the Multilevel Latent Class Model

A.1. The unconditional multilevel LC (first step)

Let us reparametrize the unconditional multilevel LC model of Equation (2) according to the following log-linear equations

(23)

\begin{matrix} (\begin{matrix} log (\frac{ϕ_{h | t}}{1 - ϕ_{h | t}}) = β_{h | t} \\ log (\frac{ω_{m}}{ω_{1}}) = α_{m} \\ log (\frac{π_{t | m}}{π_{1 | m}}) = γ_{t | m}, \end{matrix}) \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} {\left\{ \begin{array}{ll} \log \left[ \displaystyle \frac{\phi _{h \vert t}}{1-\phi _{h \vert t}} \right] = \beta _{h \vert t} \\ \log \left[ \displaystyle \frac{\omega _m}{\omega _1}\right] = \alpha _{m} \\ \log \left[ \displaystyle \frac{\pi _{t \vert m}}{\pi _{1 \vert m}}\right] = \gamma _{t \vert m}, \end{array}\right. } \end{aligned}$$\end{document}

In addition, let us conveniently rewrite (9) as follows

(24)

\begin{matrix} Q (α, Γ, B) = Q (α) + Q (Γ) + Q (B), \end{matrix}

where $α = {(α_{2}, \dots, α_{M})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\alpha }= (\alpha _2,\dots,\alpha _M)^{\prime }$$\end{document} , $Γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Gamma }$$\end{document} is a $T - 1 \times M$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T-1 \times M$$\end{document} matrix with elements $γ_{t | m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _{t\vert m}$$\end{document} , for $m = 1, \dots, M$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m=1,\dots,M$$\end{document} and $t = 2, \dots, T$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t=2,\dots,T$$\end{document} , $B$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{B}$$\end{document} is an $H \times T$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H \times T$$\end{document} matrix with elements $β_{h | t}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _{h\vert t}$$\end{document} for $t = 1, \dots, T$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t=1,\dots,T$$\end{document} and $h = 1, \dots, H$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h=1,\dots,H$$\end{document} , and

(25)

\begin{matrix} Q (α) & = \sum_{j = 1}^{J} \sum_{m = 1}^{M} {\hat{u}}_{j, m} log (ω_{m}) \end{matrix}

(26)

\begin{matrix} Q (Γ) & = \sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} \sum_{m = 1}^{M} \sum_{t = 1}^{T} {\hat{v}}_{i, j, t, m} log (π_{t | m}) \end{matrix}

(27)

\begin{matrix} Q (B) & = \sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} \sum_{m = 1}^{M} \sum_{t = 1}^{T} {\hat{v}}_{i, j, t, m} {Y_{ijh} log (ϕ_{h | t}) + [1 - Y_{ijh}] log (1 - ϕ_{h | t})} . \end{matrix}

Recalling that $\partial ℓ (θ) / \partial θ^{'} = \partial Q / \partial θ^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\partial \ell (\varvec{\theta }) / \partial \varvec{\theta }^{\prime } = \partial Q / \partial \varvec{\theta }^{\prime } $$\end{document} , the ij-th contribution to the score has the following three blocks, with generic elements

(28)

\begin{matrix} s_{i j, θ_{1}} ({\hat{α}}_{m}) & = {\hat{u}}_{j, m} - ω_{m} \end{matrix}

(29)

\begin{matrix} s_{i j, θ_{1}} ({\hat{γ}}_{t | m}) & = ({\hat{q}}_{i, j, t | m} - π_{t | m}) {\hat{u}}_{j, m} \end{matrix}

(30)

\begin{matrix} s_{i j, θ_{1}} ({\hat{β}}_{h | t}) & = \sum_{m = 1}^{M} {\hat{v}}_{i, j, t, m} (Y_{ijh} - ϕ_{h | t}) . \end{matrix}

Thus, an estimate of $Σ_{11}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{11}$$\end{document} can be obtained as follows

(31)

\begin{matrix} {\hat{Σ}}_{11} = N^{- 1} \sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} s_{ij} (\hat{α}, \hat{Γ}, \hat{B}) s_{ij} {(\hat{α}, \hat{Γ}, \hat{B})}^{'} \end{matrix}

Appendix A.2. The multilevel LC model with covariates (second step)

Let us define $π_{t | m}^{ij} = \frac{exp (γ_{tm}^{'} Z_{ij})}{1 + \sum_{s = 2}^{T} exp (γ_{tm}^{'} Z_{ij})}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi ^{ij}_{t \vert m} = \frac{\exp (\varvec{\gamma }_{tm}^{\prime } \textbf{Z}_{ij})}{1+\sum _{s=2}^{T}\exp (\varvec{\gamma }_{tm}^{\prime } \textbf{Z}_{ij})}$$\end{document} . The Q function of Equation (17) can be rewritten under the log-linear parametrizations introduced above, except for the second block which is as follows

(32)

\begin{matrix} Q (Γ) = \sum_{j = 1}^{J} \sum_{i = 1}^{n_{j}} \sum_{m = 1}^{M} \sum_{t = 1}^{T} {\hat{v}}_{i, j, t, m} log (π_{t | m}^{ij}) \end{matrix}

The second block of the ij-th contribution to the score as generic $K + 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K+1$$\end{document} contributions

(33)

\begin{matrix} s_{i j, θ_{2}} ({\hat{γ}}_{tm}) = {\hat{u}}_{j, m} ({\hat{q}}_{i, j, t | m} - π_{t | m}^{ij}) z_{ij} . \end{matrix}

B. Extra Tables and Figures

Table 8. Average relative efficiency for the two-step and two-stage estimator relative to the one-step estimator (SD over benchmark one-step SD), averaged over covariate effects.

Footnotes

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Agresti, A., Booth, J. G., Hobert, J. P., & Caffo, B. (2000). Random-effects modeling of categorical response data. Sociological Methodology, 30(1), 27–80.CrossRef Google Scholar

Allman, E. S., Matias, C., Rhodes, J. A., et al. Identifiability of parameters in latent structure models with many observed variables The Annals of Statistics. (2009 37(6A), 3099–3132.CrossRef Google Scholar

Asparouhov, T., & Muthén, B. (2014). Auxiliary variables in mixture modeling: Three-step approaches using Mplus. Structural Equation Modeling, 21(3), 329–341.CrossRef Google Scholar

Bakk, Z., Di Mari, R., Oser, J., & Kuha, J. (2022). Two-stage multilevel latent class analysis with covariates in the presence of direct effects. Structural Equation Modeling: A Multidisciplinary Journal, 29(2), 267–277.CrossRef Google Scholar

Bakk, Z., & Kuha, J. (2018). Two-step estimation of models between latent classes and external variables. Psychometrika, 83, 871–892.CrossRef Google Scholar PubMed

Bartolucci, F., Montanari, G. E., & Pandolfi, S. (2015). Three-step estimation of latent Markov models with covariates. Computational Statistics & Data Analysis, 83, 287–301.CrossRef Google Scholar

Biernacki, C., Celeux, G., & Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis, 41(3–4), 561–575.CrossRef Google Scholar

Catania, L., & Di Mari, R. (2021). Hierarchical Markov-switching models for multivariate integer-valued time-series. Journal of Econometrics, 221(1), 118–137.CrossRef Google Scholar

Catania, L., Di Mari, R., & Santucci de Magistris, P. (2022). Dynamic discrete mixtures for high-frequency prices. Journal of Business & Economic Statistics, 40(2), 559–577.CrossRef Google Scholar

Da Costa, L. P., & Dias, J. G. (2015). What do Europeans believe to be the causes of poverty? A multilevel analysis of heterogeneity within and between countries. Social Indicators Research, 122(1), 1–20.CrossRef Google Scholar

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22.CrossRef Google Scholar

Di Mari, R., Bakk, Z., Oser, J., & Kuha, J. (2022). Multilevel latent class analysis with covariates: Analysis of cross-national citizenship norms with a two-stage approach. Under review.Google Scholar

Di Mari, R., Bakk, Z., & Punzo, A. (2020). A random-covariate approach for distal outcome prediction with latent class analysis. Structural Equation Modeling: A Multidisciplinary Journal, 27(3), 351–368.CrossRef Google Scholar

Di Mari, R., & Maruotti, A. (2022). A two-step estimator for generalized linear models for longitudinal data with time-varying measurement error. Advances in Data Analysis and Classification, 16, 273–300.CrossRef Google Scholar

Di Zio, M., Guarnera, U., & Rocci, R. (2007). A mixture of mixture models for a classification problem: The unity measure error. Computational Statistics & Data Analysis, 51(5), 2573–2585.CrossRef Google Scholar

Eddelbuettel, D., & François, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), 1–18.CrossRef Google Scholar

Fagginger Auer, M. F., Hickendorff, M., Van Putten, C. M., Bèguin, A. A., & Heiser, W. J. (2016). Multilevel latent class analysis for large-scale educational assessment data: Exploring the relation between the curriculum and students’ mathematical strategies. Applied Measurement in Education, 29, 144–159.CrossRef Google Scholar

Gassiat, É, Cleynen, A., & Robin, S. (2016). Inference in finite state space non parametric hidden markov models and applications. Statistics and Computing, 26(1–2), 61–71.CrossRef Google Scholar

Gnaldi, M., Bacci, S., & Bartolucci, F. (2016). A multilevel finite mixture item response model to cluster examinees and schools. Advances in Data Analysis and Classification, 10, 53–70.CrossRef Google Scholar

Gong, G., & Samaniego, F. J. (1981). Pseudo maximum likelihood estimation: Theory and applications. The Annals of Statistics, 9, 861–869.CrossRef Google Scholar

Goodman, L. A. (1974). The analysis of systems of qualitative variables when some of the variables are unobservable. Part I: A modified latent structure approach. American Journal of Sociology, 79, 1179–1259.CrossRef Google Scholar

Gourieroux, C., & Monfort, A. (1995). Statistics and econometric models (Vol. 1). Cambridge University Press.Google Scholar

Grilli, L., Marino, M. F., Paccagnella, O., & Rampichini, C. (2022). Multiple imputation and selection of ordinal level 2 predictors in multilevel models: An analysis of the relationship between student ratings and teacher practices and attitudes. Statistical Modelling, 22(3), 221–238.CrossRef Google Scholar

Grilli, L., Pennoni, F., Rampichini, C., & Romeo, I. (2016). Exploiting timss and pirls combined data: multivariate multilevel modelling of student achievement. The Annals of Applied Statistics, 10(4), 2405–2426.CrossRef Google Scholar

Grilli, L., & Rampichini, C. (2011). The role of sample cluster means in multilevel models: A view on endogeneity and measurement error issues. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 7(4), 121.CrossRef Google Scholar

Hagenaars, J. A. (1990). Categorical longitudinal data - Loglinear analysis of panel, trend and cohort data. Sage.Google Scholar

Hooghe, M., & Oser, J. (2015). The rise of engaged citizenship: The evolution of citizenship norms among adolescents in 21 countries between 1999 and 2009. International Journal of Comparative Sociology, 56(1), 29–52.CrossRef Google Scholar

Hooghe, M., Oser, J., & Marien, S. (2016). A comparative analysis of “good citizenship’: A latent class analysis of adolescents’ citizenship norms in 38 countries. International Political Science Review, 37(1), 115–129.CrossRef Google Scholar

Horn, M. L. V., Fagan, A. A., Jaki, T., Brown, E. C., Hawkins, J. D., Arthur, M. W., & Catalano, R. F. (2008). Using multilevel mixtures to evaluate intervention effects in group randomized trials. Multivariate Behavioral Research, 43(2), 289–326.CrossRef Google Scholar PubMed

Huang, G. H., & Bandeen-Roche, K. (2004). Building an identifiable latent class model with covariate effects on underlying and measured variables. Psychometrika, 69(1), 5–32.CrossRef Google Scholar

Huang, Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data mining. In Lu, H. M. H., & Luu, H. (Eds.), KDD: Techniques and applications (pp. 21–34). World Scientific.Google Scholar

Köhler, H., Weber, S., Brese, F., Schulz, W., & Carstens, R. (2018). ICCS 2016 user guide for the international database: IEA International Civic and Citizenship Education Study 2016. Amsterdam: The International Association for the Evaluation of Educational Achievement (IEA).Google Scholar

Lukociene, O., Varriale, R., & Vermunt, J. (2010). The simultaneous decision(s) about the number of lower- and higher-level classes in multilevel latent class analysis. Sociological Methodology, 40(1), 247–283.CrossRef Google Scholar

Lyrvall, J., Di Mari, R., Bakk, Z., Oser, J., & Kuha, J. (2023). multilevlca: An r package for single-level and multilevel latent class analysis with covariates. arXiv preprint arXiv:2305.07276.Google Scholar

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. M. Le Cam & J. Neyman (Eds.), Proceedings of the fifth berkeley symposium on mathematical statistics and probability (pp. 281–297). University of California Press.Google Scholar

Magidson, J. (1981). Qualitative variance, entropy, and correlation ratios for nominal dependent variables. Social Science Research, 10, 177–194.CrossRef Google Scholar

Magidson, J., & Vermunt, J. (2004). Latent class models Latent class models. In Kaplan, D. (Ed.), The Sage handbook of quantitative methodology for the social sciences (pp.175–198). Sage.Google Scholar

Maruotti, A., & Punzo, A. (2021). Initialization of hidden markov and semi-markov models: A critical evaluation of several strategies. International Statistical Review, 89(3), 447–480.CrossRef Google Scholar

Masyn, K. E. (2017). Measurement invariance and differential item functioning in latent class analysis with stepwise multiple indicator multiple cause modeling. Structural Equation Modeling: A Multidisciplinary Journal, 24(2), 180–197.CrossRef Google Scholar

McCutcheon, A. L. (1987). Latent Class Analysis. Sage.CrossRef Google Scholar

Morselli, D., & Glaeser, S. (2018). Economic conditions and social trust climates in Europe over ten years: An ecological analysis of change. Journal of Trust Research, 8(1), 68–86.CrossRef Google Scholar

Mutz, R., & Daniel, H. (2013). University and student segmentation: Multilevel latent-class analysis of students’ attitudes towards research methods and statistics. British Journal of Educational Psychology, 83(2), 280–304.CrossRef Google Scholar PubMed

Oakes, D. (1999). Direct calculation of the information matrix via the EM. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(2), 479–482.CrossRef Google Scholar

Oberski, D. L., & Satorra, A. (2013). Measurement error models with uncertainty about the error variance. Structural Equation Modeling, 20, 409–428.CrossRef Google Scholar

Oser, J., Di Mari, R., & Bakk, Z. (2023). Data preparation for citizenship norm analysis, international association for the evaluation of educational achievement (IEA) 1999–2009-2016. Open Science Framework. https://doi.org/10.17605/OSF.IO/AKS42 Google Scholar

Oser, J., & Hooghe, M. (2013). The evolution of citizenship norms among s candinavian adolescents, 1999–2009. Scandinavian Political Studies, 36(4), 320–346.CrossRef Google Scholar

Oser, J., Hooghe, M., Bakk, Z., & Di Mari, R. (2022). Changing citizenship norms among adolescents, 1999–2009-2016: A two-step latent class approach with measurement equivalence testing. Quality & Quantity, 2022, 1–19.Google Scholar

Ouyang, J., & Xu, G. (2022). Identifiability of latent class models with covariates. Psychometrika, 87(4), 1343–1360.CrossRef Google Scholar PubMed

Paccagnella, O., Varriale, R. (2013). Asset Ownership of the Elderly Across Europe: A Multilevel Latent Class Analysis to Segment Countries and Households. In Torelli, N., Pesarin, F., & Bar-Hen, A. (Eds.), Advances in Theoretical and Applied Statistics (pp. 383–393). Berlin, Heidelberg: Springer, Berlin Heidelberg.CrossRef Google Scholar

Park, J., & Yu, H. T. (2018). Recommendations on the sample sizes for multilevel latent class models. Educational and Psychological Measurement, 78(5), 737–761.CrossRef Google Scholar PubMed

Parke, W. R. (1986). Pseudo maximum likelihood estimation: the asymptotic distribution. The Annals of Statistics, 14, 355–357.CrossRef Google Scholar

Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood and the em algorithm. SIAM Review, 26(2), 195–239.CrossRef Google Scholar

Rindskopf, D. (2006). Heavy alcohol use in the “fighting back” survey sample: Separating individual and community level influences using multilevel latent class analysis. Journal of Drug Issues, 36(2), 441–462.CrossRef Google Scholar

Ruelens, A., & Nicaise, I. (2020). Investigating a typology of trust orientations towards national and European institutions: A person-centered approach. Social Science Research, 87, 102414.CrossRef Google Scholar PubMed

Skrondal, A., & Kuha, J. (2012). Improved regression calibration. Psychometrika, 77(4), 649–669.CrossRef Google Scholar

Tomczyk, S., Hanewinkel, R., & Isensee, B. (2015). Multiple substance use patterns in adolescents: A multilevel latent class analysis. Drug and Alcohol Dependence, 155, 208–214.CrossRef Google Scholar PubMed

Venables, W. N., Smith, D. M., & the R Core Team. (2013). An introduction to R. notes on R: A programming environment for data analysis and graphics version 3.0.0. http://cran.r-project.org/doc/manuals/R-intro.pdf Google Scholar

Vermunt, J. K. (2003). Multilevel latent class models. Sociological Methodology, 33(1), 213–239.CrossRef Google Scholar

Vermunt, J. K. (2008). Latent class and finite mixture models for multilevel data sets. Statistical Methods in Medical Research, 17(1), 33–51.CrossRef Google Scholar PubMed

Vermunt, J. K. (2010). Latent class modeling with covariates: Two improved three-step approaches. Political Analysis, 18, 450–469.CrossRef Google Scholar

White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica: Journal of the Econometric Society, 50(1), 1–25.CrossRef Google Scholar

Wu, C. J. (1983). On the convergence properties of the em algorithm. The Annals of Statistics, 11, 95–103.CrossRef Google Scholar

Zhang, X., van der Lans, I., & Dagevos, H. (2012). Impacts of fast food and the food retail environment on overweight and obesity in China: A multilevel latent class cluster approach. Public Health Nutrition, 15(1), 88–96.CrossRef Google Scholar