1 Introduction
Cognitive Diagnostic Models (CDMs), or Diagnostic Classification Models (Templin & Henson, Reference Templin and Henson2010), have emerged as a crucial tool for modeling educational assessment data with multidimensional discrete (often binary) latent variables. Various diagnostic goals lead to CDMs with different measurement models; examples include the Deterministic Input Noisy Output (DINA; “And” gate model Junker & Sijtsma, Reference Junker and Sijtsma2001) with the conjunctive assumption, the Deterministic Input Noisy Output “Or” gate model (DINO; Templin & Henson, Reference Templin and Henson2006) with the disjunctive assumption, the main-effect diagnostic models (de la Torre, Reference de la Torre2011; DiBello et al., Reference DiBello, Stout and Roussos2012; Maris, Reference Maris1999) incorporating the additive effects of latent attributes, and the all-effect general diagnostic models (de la Torre, Reference de la Torre2011; Henson et al., Reference Henson, Templin and Willse2009; Von Davier, Reference Von Davier2008) with a more saturated parameterization.
Currently, most existing CDMs are designed to model binary or polytomous response data. However, the diversification of examination modes and increased availability of educational and psychological data have enabled the collection of various response data types. Continuous response data arise in many scenarios, such as language proficiency tests scoring on a continuous scale and recording response time in computer-based assessments (Minchen et al., Reference Minchen, de la Torre and Liu2017). The modeling of response times has long been a topic of interest, see De Boeck & Jeon (Reference De Boeck and Jeon2019) for a comprehensive overview. Another common response type is count responses, found in assessments recording the number of correct responses, the frequency of specific behaviors in classroom activities, the usage frequency of particular strategies in problem-solving tasks, and computer-based tests recording visit counts per item (Liu et al., Reference Liu, Liu, Shi and Jiang2022; Man & Harring, Reference Man and Harring2019). Rasch (Reference Rasch1993) first proposed a Poisson-based item response theory (IRT) model for count data, and since then, many other models have been developed (Magnus & Thissen, Reference Magnus and Thissen2017; Man & Harring, Reference Man and Harring2019, Reference Man and Harring2023).
A crucial element in a CDM is the relationship between observed item responses and latent attributes, specified by the Q-matrix (Tatsuoka, Reference Tatsuoka1983). Recently, Lee & Gu (Reference Lee and Gu2024) proposed a new cognitive diagnostic modeling framework for general response types with a prespecified Q-matrix. However, in many practical applications, the true Q-matrix may not be known a priori, necessitating an exploratory approach to infer the Q-matrix directly from the response data. In such challenging exploratory settings for flexible data types, estimating the Q-matrix reliably and efficiently is highly desirable but largely unknown. Beyond exploratory CDMs and general response types, integrating a higher-order layer into CDMs (de la Torre & Douglas, Reference de la Torre and Douglas2004; Templin et al., Reference Templin, Henson, Templin and Roussos2008) offers significant advantages. Such models uses one or more continuous latent traits to explain the binary attributes, providing a more nuanced understanding of the relationships between different skills. This yields a comprehensive and realistic representation of cognitive processes.
This article makes the following key contributions. First, we propose a unified framework for modeling higher-order general-response CDMs (HO-GRCDMs). We formulate the bottom layer (data layer) of HO-GRCDMs using flexible exponential family distributions. This allows the model to directly adapt to (a) different types of responses (binary, continuous, count, etc.) by altering the parametric family and (b) various types of measurement assumptions (main-effect, all-effect, DINA, etc.) by modifying the latent covariate vector. In the higher-order layer, we employ a probit model to describe the relationship between the higher-order continuous latent traits and the binary latent attributes. We consider both pre-specified and unknown higher-order structures, referring to the former as the partially exploratory setting and the latter as the fully exploratory setting. The higher-order modeling approach was originally proposed by de la Torre & Douglas (Reference de la Torre and Douglas2004) for binary response data, referred to as the higher-order CDM. We generalize it to general response types and employ a probit link instead of the logit link used in de la Torre & Douglas (Reference de la Torre and Douglas2004). As will be discussed later, using a probit link for the higher-order layer provides significant theoretical and computational advantages.
Second, we establish identifiability for the proposed HO-GRCDMs. Model identifiability is a crucial prerequisite for valid statistical estimation, but it is a challenging issue for complex latent variable models such as HO-GRCDMs. While the identifiability of single-layer exploratory CDMs for categorical data has been extensively studied (e.g., Chen et al., Reference Chen, Culpepper and Liang2020; Culpepper, Reference Culpepper2019; Liu & Culpepper, Reference Liu and Culpepper2024; Xu & Shang, Reference Xu and Shang2018), much less is known about the identifiability of CDMs with higher-order structures. Lee & Gu (Reference Lee and Gu2024) provides identifiability results for general response CDMs with a known Q-matrix, but that also does not guarantee the identifiability of an HO-GRCDM. Some existing studies established identifiability for CDMs with higher-order discrete latent structures, including the Bayesian pyramid model in Gu & Dunson (Reference Gu and Dunson2023), and the DeepCDM in Gu (Reference Gu2024). However, the identifiability issue of CDMs with higher-order continuous latent traits is still underexplored, despite these models’ popularity (Culpepper & Balamuta, Reference Culpepper and Balamuta2023; de la Torre & Douglas, Reference de la Torre and Douglas2004; Templin et al., Reference Templin, Henson, Templin and Roussos2008; Wang, Yang, et al., Reference Wang, Yang, Culpepper and Douglas2018). Culpepper & Balamuta (Reference Culpepper and Balamuta2023) addresses the identifiability of the CDM layer in a higher-order model, but does not establish identifiability of the entire model. To our best knowledge, our identifiability results are the first to fully identify CDMs with multidimensional higher-order continuous latent traits.
Third, we propose an efficient Monte Carlo Expectation–Maximization (MCEM) algorithm for estimating HO-GRCDMs. The computational challenge of HO-GRCDM arise from three aspects: (a) the various types of response data, (b) the complex hierarchical structure that consists of both binary and continuous latent variables, and (c) and the unknown Q-matrix. A typical approach to parameter estimation of such models is to regard the latent variables as missing data and employ Markov chain Monte Carlo (MCMC; Robert & Casella, Reference Robert and Casella2004) or an EM-type algorithm, where the latter method is usually faster. However, the complex hierarchical structure in our model makes the maximization of the complete data log-likelihood intractable during the typical EM updates. In this article, we propose an MCEM algorithm to maximize the regularized maximum likelihood to simultaneously estimate the Q-matrix and other parameters. Similar to Chen et al. (Reference Chen, Liu, Xu and Ying2015), we consider the Q-matrix estimation as a latent variable selection problem, and maximize log-likelihood with
$L_1$
penalty. Our method provides the first non-MCMC method for parameter estimation in the category of CDMs with a multidimensional higher-order structure.
Our estimation framework incorporates an efficient direct sampling scheme and features significantly reduced simulated samples. The continuous latent traits
$\boldsymbol {\theta }$
in the higher-order layer are the only latent variables whose realization need to be sampled. After imputing the missing data
$\boldsymbol {\theta }$
, the M-step optimization becomes more tractable, and we solve this by the cyclical coordinate ascent (Friedman et al., Reference Friedman, Hastie and Tibshirani2010; Tay et al., Reference Tay, Narasimhan and Hastie2023). Here, the simulation of
$\boldsymbol {\theta }$
has been a crucial issue in both IRT and higher-order CDM estimation. The MCMC method, commonly used for this purpose, often suffers from slow convergence and requires careful tuning of algorithm parameters. In this article, we highlight that, benefiting from the use of a probit link for the higher-order layer, we can directly simulate
$\boldsymbol {\theta }$
from a unified skew-normal distribution according to the recent theoretical results of Li et al. (Reference Li, Gibbons and Ková2025). Last but not least, initialization is an important issue for the efficiency of an algorithm. We employ an efficient Singular Value Decomposition (SVD)-based method for finding initial values, which is an extension of Chen et al. (Reference Chen, Li and Zhang2019) and Zhang et al. (Reference Zhang, Chen and Li2020) from the binary response case to the general response case. This non-iterative method is computationally fast and enjoys statistical consistency guarantees under certain conditions (Zhang et al., Reference Zhang, Chen and Li2020).
The rest of this article is organized as follows. Section 2 introduces the framework of HO-GRCDM. Section 3 presents the identifiability results. Section 4 proposes an efficient MCEM algorithm for parameter estimation. Sections 3 and 4 are developed under a partially exploratory setting (with an unknown Q-matrix in the CDM layer), while Section 5 extends the methodology to a fully exploratory setting (with an unknown Q-matrix in the CDM layer and also an unknown higher-order
$Q^{(H)}$
-matrix in the deeper layer). Section 6 conducts extensive simulation studies for HO-GRCDM under various measurement models, higher-order structures, and response types. Section 7 applies our methodology to a response time data set extracted from the TIMSS 2019 math assessment. Section 8 concludes and discusses future research directions. The Supplementary Material contains additional theoretical results, all technical proofs, and additional numerical results.
2 Model setup
Assume there are N examinees responding to a test with J items. For each examinee, the observed response vector
$\mathbf {R}=(R^1,\ldots ,R^J)$
is a J-dimensional vector. Depending on the assessment design, the response
$R^j$
could be binary, polytomous, counted-valued, continuous, and so on. We represent an examinee’s latent skill profile as a K-dimensional random vector,
$\mathbf {A} = (A_1, \ldots , A_K)^\top \in \{0, 1\}^K$
, and let
$\boldsymbol \alpha \in \{0,1\}^K$
be an arbitrary binary vector. Let
$\mathbb {P}^j_\nu $
denote the prespecified parametric family for the jth response, with parameter
$\nu =\nu _{j,\boldsymbol \alpha }$
. Before introducing the specific notations, we first present the general form of HO-GRCDM as,
Equations (1)–(3) describe a CDM with a three-layer hierarchical structure, with the observed general-response data
$\mathbf {R}$
at the bottom layer, the binary latent attributes
$\mathbf A$
at the middle layer, and the higher-order Normal latent variables
$\theta \in \mathbb {R}^D$
at the top layer.
A CDM typically consists of two main components: the measurement part and the latent part. The measurement part, defined in (1), describes how the observed responses
$\mathbf {R}$
depend on the latent attributes
$\boldsymbol \alpha $
. We consider a very broad framework that allows flexible response types and various measurement models. The latent part, defined in (2) and (3), models the binary attributes. In the following sections, we separately define each part of the HO-GRCDM. For notational simplicity, for a positive integer M, let
$[M]$
be the set of all positive integers
$\{1,\ldots ,M\}$
.
2.1 Bottom data layer: CDMs with general responses (GR-CDMs)
In the bottom layer (1), an examinee’s observed responses depend on his/her statuses of K binary latent attributes. Here
$A_k=1$
indicates the kth attribute is mastered; otherwise,
$A_k=0$
. The
$h(\boldsymbol {\alpha })$
is a latent covariate vector consisting of certain main effects and interaction effects of attributes, and
$\boldsymbol {\beta }^j$
is a parameter vector that we will further specify. The parameter
$\nu _{j,\boldsymbol {\alpha }}=[\boldsymbol {\beta }^j]^{\top }h(\boldsymbol {\alpha })$
is a linear combination of
$\boldsymbol {\beta }^j$
and
$h(\boldsymbol {\alpha })$
. For convenience of presentation, we assume that for all items
$ j = 1, \ldots , J $
, the
$ \mathbb {P}^j_{\nu _{j,\boldsymbol \alpha }}$
are the same parametric family, and omit the superscript.
We first elaborate on the choice of the parametric family
$\mathbb {P}_\nu $
that can be used to model each response type. Below, we denote
$g(\cdot ;~ \nu )$
as the probability mass/density function (pmf/pdf) of the parametric family under consideration. The Bernoulli distribution with mean
$\nu $
can be used to model binary responses (de la Torre, Reference de la Torre2011; Maris, Reference Maris1999). Alternatively, one can assume that
$\nu $
is the logit transform of the Bernoulli mean, as for the case of the Logistic Linear Model (LLM, Maris, Reference Maris1999). Equivalently, we model the Bernoulli mean as
$\text {logistic}(\nu ) := {1}/({1+\exp (-\nu )})$
:
$$ \begin{align} P(R^j=r| \mathbf A=\boldsymbol \alpha, \nu_{j,\boldsymbol \alpha})=g(r; ~\text{logistic}(\nu_{j,\boldsymbol \alpha}))=&\frac{\exp{(-\nu_{j,\boldsymbol{\alpha}})}^{1-r}}{1+\exp{(-\nu_{j,\boldsymbol{\alpha}})}}, \quad r=0,1. \end{align} $$
In this work, we allow very diverse choices of non-categorical responses such as count and continuous data. For count-valued data, the Poisson distribution with mean
$\nu $
can be used:
For unbounded continuous responses, one can use a normal distribution, where
$ \nu _{j,\boldsymbol {\alpha }} $
denotes the mean parameter. Together with an additional variance parameter
$ \sigma _j^2 $
, we have
$$ \begin{align} g(r;~ \nu_{j,\boldsymbol \alpha})=&\frac{1}{\sqrt{2\pi\sigma_j^2}} \exp\left\{-\frac{\left(r-\nu_{j,\boldsymbol{\alpha}}\right)^2}{2\sigma_j^2}\right\}. \end{align} $$
For continuous responses with a constrained range, one can use transformed-normal distributions. For example, the log-normal distribution can be used for modeling positive responses:
$$ \begin{align} g(r;~ \nu_{j,\boldsymbol \alpha})=&\frac{1}{\sqrt{2\pi\sigma_j^2} r} \exp\left\{-\frac{\left(\log(r)-\nu_{j,\boldsymbol{\alpha}}\right)^2}{2\sigma_j^2}\right\}, \end{align} $$
and the logistic-normal distribution for responses within the range of (0,1):
$$ \begin{align} g(r;~ \nu_{j,\boldsymbol \alpha})=&\frac{1}{\sqrt{2\pi\sigma_j^2} r(1-r)} \exp\left\{-\frac{\left(\log(r/(1-r))-\nu_{j,\boldsymbol{\alpha}}\right)^2}{2\sigma_j^2}\right\}. \end{align} $$
Alternatively, to model positive continuous responses, we can also use the Gamma distribution:
$$ \begin{align} g(r; ~ \nu_{j,\boldsymbol \alpha})=& \frac{\nu_{j,\boldsymbol{\alpha}}^{s_j}}{\Gamma(s_j)} r^{s_j-1} \exp\left\{-\nu_{j,\boldsymbol{\alpha}} \cdot r\right\}, \quad j=1,\ldots,J. \end{align} $$
Here,
$ \nu _{j,\boldsymbol {\alpha }}> 0 $
is the rate parameter, and
$ s_j> 0 $
is the shape parameter.
Next, we specify the assumptions on the parameter
$\nu _{j,\boldsymbol \alpha }$
. In the literature of cognitive diagnostic modeling, a common assumption involves a pre-specified Q-matrix (Tatsuoka, Reference Tatsuoka1983),
$Q=[q_{jk}]_{J\times K}$
, that describes which of the K attributes are measured by each of the J items. If
$q_{jk}=1$
, then the kth attribute is measured by the jth item; otherwise,
$q_{jk}=0$
. So, the value of
$\nu _{j,\boldsymbol \alpha }$
must depend only on the attributes specified by the jth row of the Q-matrix. While the Q-matrix is often assumed to be provided along the data, we consider a more challenging exploratory estimation scenario where the Q is unknown and need to be estimated along with other parameters.
Given the Q-matrix, one needs some structural assumptions on how
$\nu _{j,\boldsymbol {\alpha }}=[\boldsymbol {\beta }^j]^{\top }h(\boldsymbol {\alpha })$
depends on the Q-matrix entries. Here, we present three popular measurement model assumptions on the parameters
$\boldsymbol {\beta }^j$
and the function
$h(\cdot )$
. First, the main-effect GR-CDM assumes that
${\boldsymbol {\beta }^j}=\left ( \beta ^j_0, \beta ^j_1,\ldots , \beta ^j_K\right )^{\top }$
, and
${h(\boldsymbol {\alpha })}=\left (1,\alpha _{1}, \ldots , \alpha _{K}\right )^{\top }$
. Then, we can write Eq. (1) as
$$ \begin{align} P(R^j|\mathbf A=\boldsymbol{\alpha},\boldsymbol{\beta}^j) =& g\left( R^j;~ \beta^j_0 + \sum_{k=1}^{K} \beta^j_k q_{j,k}\alpha_k \right). \end{align} $$
For illustration, suppose that we are considering binary responses and
$\mathbb {P}_{\nu }$
is the Bernoulli distribution, then (10) becomes the Additive Cognitive Diagnosis Model (ACDM; de la Torre).
Next, the all-effect GR-CDM assumes
${\boldsymbol {\beta }^j}=\left ( \beta ^j_0, \beta ^j_1,\ldots , \beta ^j_K, \beta ^{j}_{1,2},\ldots , \beta ^{j}_{K-1,K},\ldots , \beta ^{j}_{1,2,\ldots ,K}\right )^{\top }$
, and
${h(\boldsymbol {\alpha })}=\left (1,\alpha _{1}, \ldots , \alpha _{K}, \alpha _{1}\alpha _{2}, \ldots , \alpha _{K-1}\alpha _{K}, \ldots , \prod _{k=1}^K\alpha _{k} \right )^{\top }$
. The pmf/pdf becomes
$$ \begin{align} P(R^j|\mathbf A=\boldsymbol{\alpha},\boldsymbol{\beta}^j) & = g\left(\mathbf{R}^j;~ \beta^j_0 + \sum_{k=1}^{K} \beta^j_k q_{j,k}\alpha_k \right.\nonumber\\ & \quad + \sum_{1\leq k_1\leq k_2 } \boldsymbol{\beta}^j_{k_1k_2} \left\{ q_{j,k_1}\alpha_{k_1} \right\}\left\{ q_{j,k_2}\alpha_{k_2} \right\} +\left. \cdots + \boldsymbol{\beta}^j_{1,2,\ldots,K}\prod_{k=1}^{K}\left\{ q_{j,k}\alpha_{k} \right\} \right). \end{align} $$
Note that for the case of binary responses, this becomes the GDINA model (de la Torre, Reference de la Torre2011).
Finally, we introduce the GR-DINA model. We borrow the notations from the all-effect models, and take
${\boldsymbol {\beta }^j}=\left ( \beta ^j_0,~\beta ^j_{\mathcal {K}_j} \right )^{\top }$
and
$h(\alpha ) = \left (1, ~\prod _{k \in \mathcal {K}_j} (q_{j,k}\alpha _k) \right )^{\top }$
. Here,
$\mathcal {K}_j = \left \{k \in [K]:~ q_{jk} = 1\right \}$
denotes the set of attributes that are measured by item j. Then the pmf/pdf is written as
$$ \begin{align} P(R^j|\mathbf A=\boldsymbol{\alpha},\boldsymbol{\beta}^j) &= g\left(R^j; ~\beta^j_0 + \beta^j_{\mathcal{K}_j} \prod_{k \in \mathcal{K}_j} (q_{j,k}\alpha_k) \right). \end{align} $$
Note that the above formulation can be regarded as a special case of all-effect GR-CDM. Under binary responses, this model is the popular DINA (Junker & Sijtsma, Reference Junker and Sijtsma2001) model with the conjunctive assumption. Under positive continuous responses, (12) becomes the continuous DINA model (c-DINA; Minchen et al., Reference Minchen, de la Torre and Liu2017).
For main-effect and all-effect GR-CDMs, the Q-matrix should constrain certain
$\beta $
-coefficients to be zero. For example, under the main-effect model, we must have
$\beta ^j_{k} = 0$
for k for which
$q_{jk} = 0$
. Under the all-effect model, we must have
$\beta ^j_S = 0$
when
$S \not \subseteq \mathcal {K}_j$
. Since we consider an unknown Q-matrix, the index of such zero coefficients is also unknown and needs to be estimated from data.
2.2 Latent layers: Higher-order latent trait model for binary attributes
As shown in Equations (2) and (3), we consider a higher-order latent layer to model the attributes
$\boldsymbol \alpha $
. We introduce a D-dimensional continuous latent trait,
$\boldsymbol {\theta }=(\theta _1,\dots ,\theta _D)^{\top }$
. There are two common choices for the invertible link function f: the logit link function
$f(x)=\log (x/1-x)$
, and the probit link function
$f(x)=\Phi ^{-1}(x)$
, where
$\Phi $
is the cumulative distribution function (CDF) of a standard normal random variable. In this article, we employ the probit link function and let
Here,
$\boldsymbol{\lambda _1}^k={(\lambda _{1,1}^k,\ldots , \lambda _{1,D}^k)}^{\top }$
and
${\lambda _0^k}$
are the slopes and the intercept, respectively. Let
$\boldsymbol {\lambda _1}={(\boldsymbol {\lambda _1}^1,\ldots , \boldsymbol {\lambda _1}^K)}^{\top }$
be a matrix consisting of slope parameters of all K attributes, and
$\boldsymbol {\lambda _0}=(\lambda _0^1,\ldots ,\lambda _0^K)^{\top }$
be a vector consisting of intercept parameters of all K attributes. We assume a pre-specified binary matrix
$Q^{(H)}=[q_{kd}^{(H)}]_{K\times D}$
, which constrains the sparsity structure of
$\boldsymbol {\lambda }_1$
to enhance the interpretability. The entry
$q_{kd}^{(H)}=1$
implies that the dth continuous latent variable
$\theta _{d}$
contributes to the mastering of the kth binary latent attribute
$A_k$
. We refer to the HO-GRCDM with an unknown Q-matrix and a pre-specified
$ Q^{(H)} $
matrix as a partially exploratory setting, and refer to the case where both Q and
$ Q^{(H)} $
are unknown as a fully exploratory setting. In the following sections, we first develop the identifiability theory and estimation algorithm under the partially exploratory setting, and then extend the methodology to the fully exploratory case in Section 5. In Equation (3),
$\boldsymbol {\theta }$
is assumed to follow a D-variate normal distribution,
$\boldsymbol {\theta } \sim N(\mathbf {0}_D,\boldsymbol {\Sigma _{\theta }})$
, where the zero mean vector
$\mathbf {0}_D$
is to fix the measurement origin, and the covariance matrix
$\boldsymbol {\Sigma _{\theta }}=(\sigma _{dd'})$
has unit diagonal entries to fix the measurement units. That is,
$\sigma _{dd}=1$
, for
$d=1,\ldots ,D$
. We also assume that the first item loading on each factor is positive to resolve the sign indeterminacy issue of the latent factors.
The higher-order latent layer is introduced to resemble an item response model (Birnbaum, Reference Birnbaum1968; Lord, Reference Lord1952). This modeling approach was initially used in de la Torre & Douglas (Reference de la Torre and Douglas2004), where the authors considered binary responses and assumed both the bottom layer’s Q matrix and the latent layer’s
$Q^{(H)}$
matrix were known. In addition to the higher-order CDM, another approach introduced by Templin et al. (Reference Templin, Henson, Templin and Roussos2008) employed a multivariate probit model with a single continuous latent factor. In that work, each binary attribute is derived by dichotomizing a Normal random variable at a specific threshold, with the K Normal variables modeled using a factor analysis structure. Going beyond binary responses, Culpepper & Balamuta (Reference Culpepper and Balamuta2023) proposed a CDM for polytomous responses that similarly model the higher-order latent structure via a probit model with a single latent factor.
Intuitively, the higher-order latent structure should be more parsimonious than the bottom layer, as it generally represents more abstract factors/traits in a higher level. For this aim, a subscale structure for
$Q^{(H)}$
may be appropriate, where each attribute is associated with only one latent trait among all the D traits. In other words, an attribute
$A_k$
exclusively depends on a latent trait
$\theta _d$
, and we have
$q_{kd}^{(H)} = 1$
. For all other groups indexed by
$d' \neq d$
, we assume
$q_{kd'}^{(H)} =0$
and
$\lambda _{1,d'}^k = 0$
to not include their effect on
$\alpha _k$
. However, in some cases, this structure may be too simple to capture the relationship between
$\mathbf A$
and
$\boldsymbol {\theta }$
. To address this, the bifactor structure with an additional general factor could be a more flexible alternative yet still being parsimonious. In the bifactor structure, we assume that there are
$D-1$
groups (indexed by
$d=1,\ldots ,D-1$
), and that each binary attribute is assigned to exactly one group. Each attribute
$\alpha _k$
that belongs in group d is influenced by two latent factors: a general factor
$\theta _1$
, and a group-specific factor
$\theta _{d+1}$
. Consequently, we have
$q_{k 1} = q_{k (d+1)}^{(H)} = 1$
. The effects of the other groups
$d' \neq d$
are assumed to be zero, that is
$q_{k(d'+1)}^{(H)} =0$
and
$\lambda _{1,d'+1}^k = 0$
. In the remainder of the article, we will assume that the higher-order model follows either the subscale structure or the bifactor structure.
2.3 Benefits of using the probit link to model the higher-order layer
We elaborate on our rationale for choosing the probit link to model the higher-order layer, as opposed to the more common logit link. There are mainly two advantages in doing so. To see this, we first present the marginal distribution of the binary random vector
$\mathbf A$
obtained using the probit link. For each
$\boldsymbol \alpha \in \{0,1\}^K$
, the marginal probability of
$\mathbf A = \boldsymbol {\alpha }$
under our model is
$$ \begin{align} \pi_{\boldsymbol \alpha} := P\left(\mathbf A = \boldsymbol{\alpha}\right) &=P\left(\epsilon_{1}\leq(-1)^{(\alpha_1+1)}(\lambda_{0}^{1}+{\boldsymbol{\lambda_1}^1}^{\top}\boldsymbol{\theta}),\ldots,\epsilon_{K}\leq(-1)^{(\alpha_K+1)}(\lambda_{0}^{K}+{\boldsymbol{\lambda_1}^K}^{\top}\boldsymbol{\theta})\right)\nonumber\\ &=P\left((-1)^{(\alpha_1)}\lambda_{0}^{1}+\sqrt{{\boldsymbol{\lambda}_{1}^{1}}^{\top} \boldsymbol{\Sigma_{\theta}}{\boldsymbol{\lambda}_{1}^{1}}}\xi_{1}\leq0,\ldots,(-1)^{(\alpha_K)}\lambda_{0}^{K}+\sqrt{{\boldsymbol{\lambda}_{1}^{K}}^{\top} \boldsymbol{\Sigma_{\theta}}{\boldsymbol{\lambda}_{1}^{K}}}\xi_{K}\leq0\right)\nonumber\\ &=\Phi_K\left(\frac{(-1)^{(\alpha_1+1)}\lambda_{0}^{1}}{\sqrt{{\boldsymbol{\lambda}_{1}^{1}}^{\top} \boldsymbol{\Sigma_{\theta}}{\boldsymbol{\lambda}_{1}^{1}}+1}},\ldots,\frac{(-1)^{(\alpha_K+1)}\lambda_{0}^{K}}{\sqrt{{\boldsymbol{\lambda}_{1}^{K}}^{\top} \boldsymbol{\Sigma_{\theta}}{\boldsymbol{\lambda}_{1}^{K}}+1}},~ \mathbf C_{\rho} \right), \end{align} $$
with a tetrachoric correlation matrix
$\mathbf C_{\rho } = (C_{\rho }(k_1,k_2))_{K\times K}$
:
$$ \begin{align*} C_{\rho}(k_1,k_2)=\frac{(-1)^{(\alpha_{k_1}+\alpha_{k_2})}{\boldsymbol{\lambda}_{1}^{k_1}}^{\top} \boldsymbol{\Sigma_{\theta}}{\boldsymbol{\lambda}_{1}^{k_2}}}{\sqrt{{\boldsymbol{\lambda}_{1}^{k_1}}^{\top} \boldsymbol{\Sigma_{\theta}}{\boldsymbol{\lambda}_{1}^{k_1}}+1}\sqrt{{\boldsymbol{\lambda}_{1}^{k_2}}^{\top} \boldsymbol{\Sigma_{\theta}}{\boldsymbol{\lambda}_{1}^{k_2}}+1}} \end{align*} $$
for
$k_1 \neq k_2 \in [K]$
, and
$C_{\rho }(k,k)=1$
for
$k \in [K]$
. Here,
$\xi _{k}$
and
$\epsilon _{k}$
are independent and identically distributed standard normal random variables, and
$\Phi _K$
represents the CDF of a K-variate normal distribution; see more details in Fang et al. (Reference Fang, Guo, Xu, Ying and Zhang2021).
The first virtue of using the probit link is for establishing model identifiability. Using (14), Fang et al. (Reference Fang, Guo, Xu, Ying and Zhang2021) proved that the probit model’s identifiability boils down to identifying the parameters
$\boldsymbol {\lambda }^{k}=(\lambda _{0}^{k},\boldsymbol {\lambda }_{1}^{k})$
based on the threshold values
${\lambda _{0}^{1}}/({{\boldsymbol {\lambda }_{1}^{1}}^{\top } \boldsymbol {\Sigma _{\theta }}{\boldsymbol {\lambda }_{1}^{1}}+1})^{1/2}$
, and the pairwise tetrachoric correlations. This means that identifiability is reduced to a problem similar to the identifiability of linear factor models. In contrast, this property does not exist when using a logit link, which involves a complex convolution of Gaussian and logistic random variables.
The second advantage of using a probit link is on computation. First, the explicit form of the marginal distribution of
$\mathbf A$
in (14) significantly reduces the computational complexity of parameter estimation. When using an EM-type algorithm, once the conditional expectation
$E_{\boldsymbol {\theta }}[l(\boldsymbol {\theta },\mathbf {A},\mathbf {R})|\mathbf {A},\mathbf {R}]$
, where
$l(\boldsymbol {\theta },\mathbf {A},\mathbf {R})$
is the log-likelihood, is computed, the whole conditional expectation
$E_{\mathbf {A}}[{E_{\boldsymbol {\theta }}[l(\boldsymbol {\theta },\mathbf {A},\mathbf {R})|\mathbf {A},\mathbf {R}]|\mathbf {R}}]$
in the E-step is available, as the out-layer expectation can be easily computed according to Equation (14). Second, utilizing the probit link enables the development of an efficient sampling scheme, as it allows direct sampling of
$ \boldsymbol {\theta } $
from the target distribution
$ \boldsymbol {\theta } | \mathbf {A}, \mathbf {R}, \boldsymbol {\lambda _1}, \boldsymbol {\lambda _0} $
. As detailed in Section 4.2.1,
$ \boldsymbol {\theta } | \mathbf {A}, \mathbf {R}, \boldsymbol {\lambda _1}, \boldsymbol {\lambda _0} $
follows a unified skew-normal distribution, whose samples can be obtained by a direct combination of samples from truncated normal and multivariate normal distributions.
3 Identifiability
We next propose conditions that guarantee the identifiability of HO-GRCDMs. For a statistical model
$\{P_\nu \}$
indexed by a set of parameters
$\nu $
, we say that the model is identifiable at the true parameter
$\nu _0$
when the equality between marginal distribution of the observed variables
$P_\nu = P_{\nu _0}$
implies equal parameter values
$\nu = \nu _0$
. Identifiability is a fundamental prerequisite for consistent parameter estimation and valid model interpretation.
We first present separate identifiability conditions for (a) the bottom layer exploratory GR-CDM and (b) the higher-order continuous latent layer as if the binary attributes were observed binary responses. Then, we combine these results using a layer-wise proof argument similar to that in Gu (Reference Gu2024) and derive identifiability conditions for HO-GRCDMs. More specifically, we first marginalize out the top continuous layer and identify the GR-CDM parameters, including the proportion parameters
$\boldsymbol \pi =(\pi _{\boldsymbol \alpha };~\boldsymbol \alpha \in \{0,1\}^K)$
describing the marginal distribution of the binary attributes. Next, we use the estimated GR-CDM proportion parameters
$\boldsymbol \pi $
and (14) to identify the parameters in the higher-order probit model. We define saturated GR-CDMs as follows.
Definition 1 (Saturated GR-CDM).
A saturated GR-CDM with parameters
$(\boldsymbol {\pi }, \boldsymbol {\beta }, Q)$
is a CDM without an higher-order structure, defined by (1) and with proportion parameters
$\pi _{\boldsymbol \alpha }$
for each binary attribute pattern
$\boldsymbol \alpha \in \{0,1\}^K$
. Here,
$\boldsymbol {\pi } = (\pi _{\boldsymbol \alpha })$
satisfy
$\sum _{\boldsymbol \alpha \in \{0,1\}^K} \pi _{\boldsymbol \alpha } = 1,$
and
$\pi _{\boldsymbol \alpha }> 0.$
We impose a monotonicity condition on the item parameters to avoid the sign-flipping for each latent attribute; that is, to distinguish between
$A_k = 0$
and 1. Motivated by the popular monotonicity conditions for binary-response CDMs (Xu & Shang, Reference Xu and Shang2018), we assume that
holds for all saturated GR-CDMs, and also for all HO-GRCDMs. Here,
$\mathbf {q}_j$
is the jth row of Q. The assumption (15) means that the students possessing all required skills for the jth item have a larger
$\nu $
-parameter in the distribution
$R^j \mid \mathbf A$
than those who lack some required skills. This condition can be further simplified under specific GR-CDMs. For example, (15) is equivalent to
$\beta ^j_{\mathcal {K}_j}> 0$
under the DINA model, and to
$\beta _k^j> 0$
for
$q_{j,k} =1$
under the ACDM.
Proposition 1. Under the saturated DINA/main-effect/all-effect GR-CDM that satisfies the monotonicity condition (15), the model components
$(\boldsymbol {\pi }, \boldsymbol {\beta }, Q)$
are identifiable up to a permutation of the K latent attributes when the following conditions hold.
-
A. The true Q-matrix contains two submatrices
$\mathbf {I}_K$
after row swapping, i.e., Q can be written as
$Q = [\mathbf {I}_K, \mathbf {I}_K, Q^{*\top }]^\top .$
-
B. Suppose that the Q-matrix is written as in A. For any
$\boldsymbol \alpha \neq \boldsymbol \alpha '$
, there exists
$j> 2K$
such that
$\nu _{j,\boldsymbol \alpha } \neq \nu _{j, \boldsymbol \alpha '}$
.
In particular, condition B holds when the Q-matrix also contains another identity submatrix
$\mathbf {I}_K$
.
Note that Proposition 1 is a very general result stated under any GR-CDM not necessarily with a higher-order structure. Conditions A and B resemble popular identifiability conditions for CDMs with categorical responses (Culpepper, Reference Culpepper2019; Xu & Shang, Reference Xu and Shang2018). Lee & Gu (Reference Lee and Gu2024) showed that these conditions also suffice for identifying GR-CDMs, but under the confirmatory setting with a known Q-matrix. Proposition 1 above further ensures that if the unknown Q-matrix satisfies these conditions, then the Q-matrix itself and the model parameters are both uniquely identifiable.
Next, we present identifiability conditions for the higher-order probit layer in HO-GRCDMs. Here, we view the probit model as a parametric family with parameters
$(\boldsymbol {\lambda }_0, \boldsymbol {\lambda }_1, \boldsymbol {\Sigma _{\theta }})$
and probability mass function in (14). Recall that we consider the subscale model and the bifactor model with a known higher-order loading structure
$Q^{(H)}$
to model the K binary attributes.
We first present the necessary and sufficient conditions for the subscale model, which is proved by heavily utilizing the properties of the probit link mentioned after (14). This identifiability result may be of independent interest in the IRT literature. We present the proofs of all theoretical results in the Supplementary Material.
Proposition 2. Consider the subscale model with K attributes and D-dimensional Gaussian latent factors. For
$d \in [D]$
, let
$K_d = \sum _{k = 1}^K q_{k,d}^{(H)}$
denote the number of attributes that belong to group d. Then, the model is identifiable if and only if one of the below conditions holds for all
$d \in [D]$
:
-
C1.
$K_d \ge 3$
, -
C2.
$K_d = 2$
,
$\sigma _{d d'} \neq 0$
for some
$d \neq d'$
.
To summarize the above conditions C1 and C2, the subscale model with three or more attributes for each group is identifiable. Interestingly, to ensure identifiability, we require at least two attributes to belong to each group, which boils down to assuming condition A on the
$Q^{(H)}$
-matrix.
Next, we present identifiability conditions for the bifactor model, where the D-dimensional latent vector
$\boldsymbol {\theta }$
consists of one general factor
$\theta _1$
and
$D-1$
group-specific effects
$\theta _2, \ldots , \theta _D$
. Here, it is necessary to assume an additional orthogonal structure between the general and group-specific factors to resolve a trivial rotational ambiguity issue (see Fang et al., Reference Fang, Guo, Xu, Ying and Zhang2021, Section 4.1). For each
$d \in [D-1]$
, let
$L_d \subseteq [K]$
be the collection of attributes that belong to the dth group. We also let
$\lambda _{1,d'}^{L_d}$
be a sub-vector of
$\lambda _{1,d'}=(\lambda _{1,d'}^1,\ldots ,\lambda _{1,d'}^K)^\top $
that consists of the indices
$k \in L_d$
.
Proposition 3 (Fang et al. (Reference Fang, Guo, Xu, Ying and Zhang2021, Theorem 7)).
Consider the bifactor model with K binary attributes and D-dimensional Gaussian latent variables with covariance
$\Sigma _\theta = \begin {pmatrix} 1 & \mathbf {0}^\top \\ \mathbf {0} & \Sigma _{\theta }^\star \end {pmatrix},$
where
$\Sigma _{\theta }^\star $
is a
$(D-1) \times (D-1)$
symmetric positive definite matrix with
$\text {Diag}(\Sigma _{\theta }^\star ) = I_{D-1}$
. Let
$H := \{d \in [D-1]: \boldsymbol {\lambda }_{1,1}^{L_d} \mbox { and } \boldsymbol {\lambda }_{1,d+1}^{L_d} \mbox { are linear independent}\}$
. Then, the model is identifiable if
and one of the following conditions hold
$:$
-
C3.
$|H| \ge 3$
, -
C4.
$|H| = 2$
, and there exists a group d such that
$L_d$
can be partitioned into
$L_{d,1}$
and
$L_{d,2}$
so that
$\boldsymbol {\lambda }_{1,1}^{L_{d,a}}$
and
$\boldsymbol {\lambda }_{1,d+1}^{L_{d,a}}$
are linearly independent for both partitions
$a = 1,2$
.
Combining the above results, we can establish the desired identifiability result for HO-GRCDMs. The main argument is to sequentially identify the latent layers, similar to previous works for multilayer variants of CDMs (Gu, Reference Gu2024; Gu & Dunson, Reference Gu and Dunson2023).
To resolve the latent attribute permutation issue in Proposition 1, we additionally assume that there is a pre-specified anchor item for each latent attribute. This means for each latent attribute
$A_k$
, we know that some item
$j_k\in [J]$
solely measures
$A_k$
. Without loss of generality, combined with condition A, we can assume that the first K items are the anchor items for the K binary attributes.
Theorem 1. The following two identifiability conclusions hold.
-
(a) Consider an HO-GRCDM with a subscale higher-order layer and known anchor items. The model is identifiable when the true bottom layer GRCDM parameters
$Q, \boldsymbol {\beta }$
satisfy conditions A and B, and the true latent layer parameters
$Q^{(H)}, \Sigma _\theta $
satisfy either condition C1 or C2. -
(b) Next, consider an HO-GRCDM with a bifactor higher-order layer and known anchor items. The model parameters are identifiable when the true parameters satisfy conditions A, B, (16), and either C3 or C4.
Theorem 1 provides our main result that guarantees the HO-GRCDMs are identifiable. Even with the anchor item assumption, we are still identifying all other
$J-K$
rows of the Q-matrix.
One may notice that the knowledge of anchor items is usually not required for identifying exploratory CDMs without a higher-order structure. However, this anchor assumption is required for our unique partially exploratory setting where we want to utilize the available higher-order
$Q^{(H)}$
-matrix. Without anchor items, the Q-matrix in the bottom CDM layer is only identifiable up to a label permutation of the K attributes, just like in typical one-layer CDMs with a fully unknown Q-matrix; in this case, the rows of the
$Q^{(H)}$
-matrix cannot be aligned with the columns of the Q-matrix. To relax this anchor-item requirement, we also consider a fully exploratory setting in Section 5, where both Q and
$Q^{(H)}$
are treated as unknown.
4 A new MCEM algorithm
In this section, we develop an MCEM algorithm to estimate HO-GRCDMs. We first define some notations. Suppose there are N subjects responding to a test with J items. Let
$i=1,\ldots , N$
and
$j=1,\ldots , J$
denote the index of subjects and items, respectively. The test is designed to measure K binary latent attributes for each subject i,
$\mathbf {A}_i=(A_{i1},\ldots , A_{iK})^{\top }$
, which is further determined by D higher-order continuous latent variables,
$\boldsymbol {\theta }_i=(\theta _{i1},\ldots , \theta _{iD})^{\top }$
. We slightly abuse notation and use
$\mathbf {R} =[R_{ij}]_{N\times J}=(\mathbf {R}_1, \ldots , \mathbf {R}_N)^\top $
to denote the observed response matrix. Let
$\boldsymbol {\Theta } = (\boldsymbol {\theta }_1, \ldots , \boldsymbol {\theta }_N)^\top $
be the
$N \times D$
matrix that collects continuous latent variables for the N subjects, and define
$\mathbf {A} = (\mathbf A_1, \ldots , \mathbf A_N)^\top $
as the
$N \times K$
matrix consisting of the binary attribute profiles for all N subjects.
Let
$\boldsymbol {\beta }$
and
$\boldsymbol {\lambda }$
denote the set of all the coefficient parameters in the first (1) and second layer (2), respectively. Additionally, we aim to estimate the Q-matrix in the CDM layer and the covariance matrix
$\boldsymbol {\Sigma _{\theta }}$
for the continuous latent traits (3). Note that estimating Q by directly maximize the marginalized log-likelihood is computationally infeasible even for a moderate size of J and K. This is because one need to compute the profile likelihood based on each Q among
$2^{J\times K}$
possible matrices, and find out the one that maximizes the profile likelihood. One solution to avoid such expensive computation is to consider the estimation of Q as a latent variable selection problem and solving it through a regularized maximum likelihood estimator (Chen et al., Reference Chen, Liu, Xu and Ying2015). In particular, we maximize the regularized marginal log-likelihood
$l(\boldsymbol {\beta },\boldsymbol {\lambda },\boldsymbol {\Sigma _{\theta }}|\mathbf {R})$
with an
$L_1$
penalty
$p_{s}(\beta )$
:
where the log-likelihood function can be written as
$$ \begin{align} l(\boldsymbol{\beta},\boldsymbol{\lambda},\boldsymbol{\Sigma_{\theta}},Q|\mathbf{R})=&\sum_{i=1}^{N} \log \left\{ \sum_{\boldsymbol \alpha\in\left\{0,1\right\}^K} \prod_{j=1}^{J} P(R_{ij}|\boldsymbol \alpha;\boldsymbol{\beta}^j) \int P(\boldsymbol \alpha|\boldsymbol{\theta};\boldsymbol{\lambda})f(\boldsymbol{\theta}|\boldsymbol{\Sigma_{\theta}}) {\mathrm{d}}\boldsymbol{\theta} \right\}, \end{align} $$
and the penalty function is defined as
$$ \begin{align} p_{\mathbf{s}}(\boldsymbol{\beta})=\sum_{j=1}^{J} p_{s}(\boldsymbol{\beta}^j)=\sum_{j=1}^{J} \left(s\sum_{\beta^j_k\in\boldsymbol{\beta}^j}|\beta^j_k|\right), \end{align} $$
where s is the regularization parameter. Here, the Q-matrix does not need to appear explicitly in the log-likelihood expression because its information is implicitly captured by the sparsity of the coefficients. After solving (17),
$ Q $
can be estimated by identifying the non-zero pattern of
$ \widehat {\boldsymbol {\beta }} $
.
The mechanism for identifying the entries
$q_{jk}$
in the Q-matrix varies across different measurement models. For main-effect models,
$ Q $
can be recovered using the rule
$ q_{jk} = \mathbf {1}(\beta _{k}^{j} \neq 0) $
, where
$ \mathbf {1} $
is the indicator function. For all-effect models, theoretically, each row of
$ Q $
should be identified by the highest-order non-zero coefficient. Specifically, if
$ \exists S \subseteq [K] $
such that
$ \beta ^j_{S} \neq 0 $
and
$ \beta ^j_{S'} = 0 $
for all
$ S' \subseteq [K], S \subset S' $
, then
$ q_{jk} = 1 $
for
$ k \in S $
; otherwise,
$ q_{jk} = 0 $
. However, this strict rule may not be always applicable because some estimated
$\beta $
-coefficients may be close to zero but not exactly zero. In practice, a more effective approach is either to choose the largest non-zero interaction coefficient or to truncate the coefficients before identifying
$ Q $
. For the latter approach, we recommend practitioners set the truncation thresholds based on the general magnitude of their estimated coefficients. For the DINA model, since there should be only one non-zero coefficient for each item
$ j $
, the largest non-zero interaction coefficient can be selected, and the corresponding
$ q_{jk} $
can be identified as equal to one.
The maximization problem presented in Equation (17) is quite complex due to the summation of integrals inside the log function and cannot be solved directly. The Expectation–Maximization (EM) algorithm is a popular method that iterates between the E-step and the M-step to seek the maximizer, and we first introduce the penalized variant of the EM algorithm in Section 4.1. This basic EM algorithm still suffers from the intractable integrals in the E-step, which motivates us to propose a more scalable novel MCEM algorithm in Section 4.2.
4.1 Penalized EM algorithm
We first introduce the basic procedure of a penalized EM algorithm. A usual EM algorithm alternates between two steps: in the E-step, the expected complete data log-likelihood is computed, and in the M-step, the parameters are updated by maximizing this expected log-likelihood. The penalized EM algorithm follows the same procedure but includes a penalty term in the M-step to regularize the parameter estimates.
The complete data log-likelihood in a HO-GRCDM is
$$ \begin{align} l(\boldsymbol{\beta}, \boldsymbol{\lambda}, \boldsymbol{\Sigma_{\theta}} | \mathbf{R}, \mathbf{A}, \boldsymbol{\Theta}) &= \log \left( \prod_{i=1}^{N} \prod_{j=1}^{J} P(R_{ij} | \mathbf{A}_i; \boldsymbol{\beta}^j) P(\mathbf{A}_i | \boldsymbol{\theta}_i; \boldsymbol{\lambda}) f(\boldsymbol{\theta}_i | \boldsymbol{\Sigma_{\theta}}) \right)\nonumber\\ &\overset{\triangle}{=} l_1(\boldsymbol{\beta}, \mathbf{R}, \mathbf{A}) + l_2(\boldsymbol{\lambda}, \boldsymbol{\theta}_i, \mathbf{A}) + l_3(\boldsymbol{\theta}_i, \boldsymbol{\Sigma_{\theta}}), \end{align} $$
with
$$ \begin{align} l_1(\boldsymbol{\beta}, \mathbf{R}, \mathbf{A}) =& \sum_{i=1}^{N} \sum_{\boldsymbol \alpha \in \left\{0,1\right\}^K} \mathbf{1}(\mathbf{A}_i = \boldsymbol \alpha) \sum_{j=1}^J \log \left( P\left( R_{ij} | \mathbf{A}_i = \boldsymbol \alpha; \boldsymbol{\beta}^j \right) \right), \end{align} $$
$$ \begin{align} l_2(\boldsymbol{\lambda}, \boldsymbol{\theta}_i, \mathbf{A}) =& \sum_{i=1}^{N} \sum_{\boldsymbol \alpha \in \left\{0,1\right\}^K} \mathbf{1}(\mathbf{A}_i = \boldsymbol{\alpha}) \sum_{k=1}^K \left( \alpha_k \log \Phi\left(\boldsymbol{\theta}_i^{\top} {\boldsymbol{\lambda_1}^k} + \lambda_0^k \right) \right. \nonumber\\ & + (1 - \alpha_k) \log \left( 1 - \Phi\left(\boldsymbol{\theta}_i^{\top} {\boldsymbol{\lambda_1}^k} + \lambda_0^k \right) \right)\Big), \end{align} $$
$$ \begin{align} l_3(\boldsymbol{\theta}_i, \boldsymbol{\Sigma_{\theta}}) = & \sum_{i=1}^N \left( -\frac{D}{2} \log (2\pi) - \frac{1}{2} \log|\boldsymbol{\Sigma_{\theta}}| - \frac{1}{2} \boldsymbol{\theta}_i^{\top} \boldsymbol{\Sigma_{\theta}} \boldsymbol{\theta}_i \right). \end{align} $$
Let
$\boldsymbol {\eta }=(\boldsymbol {\beta }, \boldsymbol {\lambda },\boldsymbol {\Sigma _{\theta }})$
generically denotes the collection of all parameters. For any parameter, a superscript “
$(t)$
” denotes the values obtained in the tth iteration. Each iteration t of the penalized EM algorithm contains the following two steps:
E-Step: Compute the Q-function as the expectation of the complete data log-likelihood:
where the conditional expectation is with respect to
$P(\mathbf {A},\boldsymbol {\Theta }|\mathbf {R})$
. We break down
$Q^{(t)}(\boldsymbol {\beta }, \boldsymbol {\lambda }, \boldsymbol {\Sigma _{\theta }})$
into three parts,
$Q^{(t)}(\boldsymbol {\beta }, \boldsymbol {\lambda }, \boldsymbol {\Sigma _{\theta }}) = {Q_1}^{(t)}(\boldsymbol {\beta })+{Q_2}^{(t)}(\boldsymbol {\lambda })+{Q_3}^{(t)}(\boldsymbol {\Sigma _{\theta }}),$
with
$$ \begin{align} {Q_1}^{(t)}(\boldsymbol{\beta}) &= E_{\mathbf{A}}[{l_1}|\mathbf{R};\widehat{\boldsymbol{\eta}}^{(t-1)}] = \sum_{i=1}^{N} \sum_{\boldsymbol{\alpha}\in\left\{0,1\right\}^K} \sum_{j=1}^J \log \left(P\left(R_{ij}|\boldsymbol{\alpha};\boldsymbol{\beta}^j\right)\right) \psi_{i,\boldsymbol{\alpha}}^{(t-1)}, \end{align} $$
Here the common term
$\psi _{i,\boldsymbol {\alpha }}^{(t-1)}$
has the following expression,
$$ \begin{align}\psi_{i,\boldsymbol{\alpha}}^{(t-1)} &=P(\mathbf{A}_i = \boldsymbol{\alpha}|\mathbf{R}_i;\widehat{\boldsymbol{\beta}}^{(t-1)}, \widehat{\boldsymbol{\lambda}}^{(t-1)},\widehat{\boldsymbol{\Sigma}}_{\theta}^{(t-1)})\\ \notag &=\frac{P(\mathbf{R}_i|\mathbf{A}_i = \boldsymbol{\alpha};\widehat{\boldsymbol{\beta}}^{(t-1)})P(\mathbf{A}_i = \boldsymbol{\alpha};\widehat{\boldsymbol{\lambda}}^{(t-1)},\widehat{\boldsymbol{\Sigma}}_{\theta}^{(t-1)})}{ \sum_{\boldsymbol{\alpha}\in\left\{0,1\right\}^K} P(\mathbf{R}_i|\mathbf{A}_i = \boldsymbol{\alpha};\widehat{\boldsymbol{\beta}}^{(t-1)})P(\mathbf{A}_i = \boldsymbol{\alpha};\widehat{\boldsymbol{\lambda}}^{(t-1)},\widehat{\boldsymbol{\Sigma}}_{\theta}^{(t-1)}) }, \end{align} $$
and
${l_2}^{(i,\boldsymbol {\alpha })}$
and
${l_3}^{(i)}$
denote the corresponding terms in the summation indexed by i and
$\boldsymbol {\alpha }$
presented in Equations (22) and (23), respectively.
M-Step: Update the parameters by maximizing the penalized Q-function:
This is typically a convex optimization for exponential family distributed responses.
4.2 MCEM algorithm with an efficient sampling scheme
4.2.1 Monte Carlo integration for E-Step
As mentioned earlier, the probit link offers a significant advantage by enabling direct computation of the conditional expectation
$E_{\mathbf {A}}[\cdot |\mathbf {R};\widehat {\boldsymbol {\eta }}^{(t-1)}]$
as shown in (25)–(27), due to the explicit form of
$P(\mathbf {A} = \boldsymbol {\alpha };\widehat {\boldsymbol {\lambda }}^{(t-1)},\widehat {\boldsymbol {\Sigma }}_{\theta }^{(t-1)})$
presented in (14). This simplifies the computational challenge of the E-Step to calculating the inner expectations,
$E_{\boldsymbol {\theta }_i}[\cdot |\mathbf {A}_i = \boldsymbol {\alpha },\widehat {\boldsymbol {\lambda }}^{(t-1)},\boldsymbol {\Sigma _{\theta }}^{(t-1)}]$
, as involved in (26) and (27). Furthermore, conditional on
$\mathbf {A}_i = \boldsymbol {\alpha }$
,
$\widehat {\boldsymbol {\lambda }}^{(t-1)}$
, and
$\widehat {\boldsymbol {\Sigma }}_{\theta }^{(t-1)}$
,
$\boldsymbol {\theta }_i$
’s are independent and identically distributed (i.i.d.). Using a general notation
$\boldsymbol {\theta }$
to represent
$\boldsymbol {\theta }_i$
in Equations (26) and (27), we have
$$ \begin{align} & E_{\boldsymbol{\theta}_i}\left[{l_2}^{(i,\boldsymbol{\alpha})}|\mathbf{A}_i = \boldsymbol{\alpha},\widehat{\boldsymbol{\lambda}}^{(t-1)},\boldsymbol{\widehat{\Sigma}_{\theta}}^{(t-1)}\right]\\ \notag &\quad=E_{\boldsymbol{\theta}}\left[\sum_{k=1}^K\Big( \alpha_k \log \Phi\left(\boldsymbol{\theta}^{\top} {\boldsymbol{\lambda_1}^k}+{{\lambda_0^k}}\right)+(1-\alpha_k) \log \left(1-\Phi\left(\boldsymbol{\theta}^{\top} {\boldsymbol{\lambda_1}^k}+{{\lambda_0^k}}\right)\right)\Big)\bigg|\mathbf{A}_i = \boldsymbol{\alpha},\widehat{\boldsymbol{\lambda}}^{(t-1)},\boldsymbol{\Sigma_{\theta}}^{(t-1)}\right], \end{align} $$
and
This implies when computing
${Q_2}^{(t)}$
or
${Q_3}^{(t)}$
, only
$2^K$
expectations,
$E_{\boldsymbol {\theta }}\left [\cdot |\mathbf {A}_i = \boldsymbol {\alpha },\widehat {\boldsymbol {\lambda }}^{(t-1)},\boldsymbol {\Sigma _{\theta }}^{(t-1)}\right ]$
, for
$\boldsymbol {\alpha } \in \left \{0,1\right \}^{K}$
, need to be evaluated, regardless of the sample size N.
Despite the significantly reduced computational complexity, the above expectations involve multidimensional integrals and cannot be evaluated in closed form. We propose to use Monte Carlo integration and draw
$M_t$
samples
$\boldsymbol {\theta }^{(m,\boldsymbol {\alpha })}$
,
$m=1,\ldots ,M_t$
from
$P(\boldsymbol {\theta }|\mathbf {A} = \boldsymbol {\alpha };\widehat {\boldsymbol {\lambda }}^{(t-1)},\boldsymbol {\Sigma _{\theta }}^{(t-1)})$
,
$\boldsymbol {\alpha }\in \left \{0,1\right \}^K$
, in the tth iteration. For any function
$w(\boldsymbol {\theta },\boldsymbol {\alpha })$
of
$\boldsymbol {\theta }$
and
$\boldsymbol {\alpha }$
, we can approximate its expectation as
$$ \begin{align} W^{(t)}(\boldsymbol{\alpha}):=E_{\boldsymbol{\theta}}[w(\boldsymbol{\theta},\boldsymbol{\alpha})|\mathbf{A} = \boldsymbol{\alpha},\widehat{\boldsymbol{\lambda}}^{(t-1)},\boldsymbol{\widehat{\Sigma}_{\theta}}^{(t-1)}]&\approx \frac{\sum_{m=1}^{M_t} w(\boldsymbol{\theta}^{(m,\boldsymbol{\alpha})},\boldsymbol{\alpha})}{M_t}. \end{align} $$
By replacing
$w(\boldsymbol {\theta },\boldsymbol {\alpha })$
with the corresponding terms within the square brackets in Equations (30) and (31), we obtain Monte Carlo approximations for these expectations.
Sampling from
$f(\boldsymbol {\theta }|\mathbf {A} = \boldsymbol {\alpha };\widehat {\boldsymbol {\lambda }}^{(t-1)},\boldsymbol {\widehat {\Sigma }_{\theta }}^{(t-1)})$
has been a challenging issue in the literature of both multidimensional IRT models and higher-order CDMs. A commonly used method is the MCMC method, including the Metropolis–Hastings (MH; Cai, Reference Cai2010) sampler and the Gibbs sampler (Béguin & Glas, Reference Béguin and Glas2001; Culpepper, Reference Culpepper2016). However, such methods suffer from slow convergence to the target distribution, thereby slowing down the algorithm. Fortunately, with a probit link used in the latent layers, directly sampling is feasible. By Li et al. (Reference Li, Gibbons and Ková2025, Theorem 4.2),
Here,
$SUN$
denotes an unified skew-normal distribution (Arellano-Valle & Azzalini, Reference Arellano-Valle and Azzalini2006), where
$\boldsymbol {\Delta }_c=\boldsymbol {\Sigma _{\theta }}{\mathbf {U}_1}^{\top }\mathbf {S}^{-1}$
,
$\boldsymbol {\gamma }_c=\mathbf {S}^{-1}\mathbf {U}_2$
,
$\boldsymbol {\Gamma }_c=\mathbf {S}^{-1}(\mathbf {U}_1\boldsymbol {\Sigma _{\theta }}{\mathbf {U}_1}^{\top }+{\mathbf {I}}_K){\mathbf {S}}^{-1}$
, and
$$ \begin{align} \mathbf{U}_1&= diag(2\alpha_1-1,\ldots,2\alpha_K-1)\boldsymbol{\lambda_1},\nonumber\\ \mathbf{U}_2&= diag(2\alpha_1-1,\ldots,2\alpha_K-1)\boldsymbol{\lambda_0},\nonumber\\ \mathbf{S}&= diag({{\mathbf{U}_1^1}^{\top}}\boldsymbol{\Sigma_{\theta}}{\mathbf{U}_1^1},\ldots,{{\mathbf{U}_1^K}^{\top}}\boldsymbol{\Sigma_{\theta}}{\mathbf{U}_1^K}), \end{align} $$
with
${\mathbf {U}_1^k}^\top $
denoting the kth row of
$\mathbf {U}_1$
. Furthermore, by Li et al. (Reference Li, Gibbons and Ková2025, Corollary 4.3), if (33) holds, then the conditional distribution of
$\boldsymbol {\theta }$
is equal to the following distribution (where “
$\overset {\mathrm {d}}{=}$
” means “equal in distribution”)
where
$\mathbf {V}_0$
is independent of
$\mathbf {V}_1$
and
Here,
$TN(\mathbf {0}_D,\mathbf {S}^{-1}({\mathbf {U}_1}\boldsymbol {\Sigma _{\theta }}{\mathbf {U}_1}^{\top }+{\mathbf {I}}_K)^{-1},-\boldsymbol {\infty },-\mathbf {S}^{-1}{\mathbf {U}_2})$
denotes a K-variate truncated normal distribution with zero mean and covariance matrix
$\mathbf {S}^{-1}({\mathbf {U}_1}\boldsymbol {\Sigma _{\theta }}{\mathbf {U}_1}^{\top }+{\mathbf {I}}_K)^{-1}$
and truncation below the threshold
$-\mathbf {S}^{-1}{\mathbf {U}_2}$
. This means that we can generate samples of
$ \boldsymbol {\theta } $
by first drawing samples of
$ \mathbf {V}_1 $
and
$ \mathbf {V}_0 $
, and then performing a linear combination of these. The sampling scheme is summarized in Algorithm 1. This approach is efficient, as sampling from both the multivariate normal distribution and the truncated normal distribution is straightforward.
4.2.2 The M-Step implementation
In M-Step, the maximization of
${Q}^{(t)}$
can be divided into three optimization problems.

The optimization objective in (38) incorporates an
$L_1$
penalty and utilizes various functions
$g(\cdot )$
according to the response data type and measurement models. The optimization of (39) falls within the field of generalized linear models with a probit link. We use the coordinate ascent method, as described in Friedman et al. (Reference Friedman, Hastie and Tibshirani2010) and Tay et al. (Reference Tay, Narasimhan and Hastie2023), to solve both problems. This method is known for its flexibility and power in solving such optimization problems.
Regarding the optimization of Equation (40), to estimate a covariance matrix with the constraint that diagonal elements must be ones, we employ a method inspired by the approach used in Stan (Carpenter et al., Reference Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt, Brubaker, Guo, Li and Riddell2017) and detailed by Lewandowski et al. (Reference Lewandowski, Kurowicka and Joe2009). Below, we elaborate on this method as a two-step estimation procedure. First, we estimate
$\boldsymbol {\Sigma _{\theta }}$
without considering the constraints, and denote the resulting estimator as
${\widehat {\boldsymbol {\Sigma }}_{\theta }^{*}}=(\widehat {\sigma }_{dd'}^*)$
. In each iteration t, according to Equations (27) and (31), this solution can be directly derived as,
$$ \begin{align} {{\widehat{\boldsymbol{\Sigma}}_{\theta}}^{*(t)}}=&\frac{1}{M_t N}\sum_{m=1}^{M_t} \sum_{\boldsymbol{\alpha}\in\left\{0,1\right\}^K} \left(\boldsymbol{\theta}^{(m,\boldsymbol{\alpha})} {\boldsymbol{\theta}^{(m,\boldsymbol{\alpha})}}^{\top}\sum_{i=1}^N \psi_{i,\boldsymbol{\alpha}}^{(t-1)}\right). \end{align} $$
For simplicity, we neglect the iteration-specific notation
$(t)$
and introduce the computation based on a general
$\widehat {\boldsymbol {\Sigma }}_{\theta }^{*}$
.
Next, we compute the final estimator
$\widehat {\boldsymbol {\Sigma }}_{\theta }$
that satisfies our diagonal constraints by reparameterizing the variance by its Cholesky decomposition
${\boldsymbol {\Sigma }}_{\theta } = \mathbf {G}^{\top }\mathbf {G}$
. Let
$\mathbf {G}=[g_{dd'}]$
, and let
$\mathbf {g}_d$
denote the dth column of
$\mathbf {G}$
. To ensure the diagonal entries of
$\widehat {\boldsymbol {\Sigma }}_{\theta }$
are ones, the upper triangular Cholesky factor
$\mathbf {G}$
must satisfy
$\|\mathbf {g}_d\| = 1$
. Based on
$\widehat {\boldsymbol {\Sigma }}_{\theta }^{*}$
, we can use the following procedure to obtain
$\widehat {\mathbf {G}}$
:

Here,
$z_{dd'}=\tanh (\widehat {\sigma }_{dd'}^*)$
. After obtaining
$\widehat {\mathbf {G}}$
, we compute
$\widehat {\boldsymbol {\Sigma }}_{\theta }$
based on
$\widehat {\boldsymbol {\Sigma }}_{\theta } = \widehat {\mathbf {G}}^{\top }\widehat {\mathbf {G}}$
.
We summarize the above MCEM algorithm for the HO-GRCDM in Algorithm 2.

Remark 1. Many HO-GRCDMs may have additional dispersion parameters. In these cases, additional steps are needed in the M step to update these parameters explicitly or with the help of existing optimizers. For example, in the log-normal CDM case, there are additional standard deviation parameters
$\gamma _j$
that need to be estimated. After obtaining estimates for all
$\boldsymbol {\beta }^j$
, the
$\gamma _j$
can be solved explicitly by
$\gamma _j = {\sum _{i=1}^N\sum _{\boldsymbol {\alpha }\in \{0,1\}^K}(R_{ij} - \boldsymbol {\nu }_{j,\boldsymbol {\alpha }})}/({N \cdot 2^K})$
in each M step.
Remark 2. An important practical consideration when implementing the MCEM algorithm is the selection of the Monte Carlo sample size. To ensure valid statistical analysis, the numerical error from Monte Carlo approximation should be negligible relative to the statistical error, which typically decreases at a rate of
$1/\sqrt {N}$
. Since numerical error depends on the Monte Carlo sample size, a larger Monte Carlo sample size is required for larger sample sizes to maintain numerical accuracy. However, increasing the Monte Carlo sample size also raises computational cost. A practical approach to balancing accuracy and efficiency is to gradually increase the Monte Carlo sample size as the algorithm progresses. This ensures that numerical precision improves as estimates stabilize while keeping early iterations computationally efficient.
Remark 3. An alternative to the MCEM algorithm proposed in this article is the Stochastic Approximation EM (SAEM; Delyon et al., Reference Delyon, Lavielle and Moulines1999), which modifies the expectation approximation step. Instead of using Monte Carlo integration to approximate the expectation
${W}^{(t)}(\boldsymbol {\alpha })$
in Equation (32) at iteration t, SAEM employs a stochastic approximation procedure. Specifically, letting
$\tilde {\boldsymbol {\theta }}^{(t)}$
be a single sample drawn from
$f(\boldsymbol {\theta }|\mathbf {A} = \boldsymbol {\alpha };\widehat {\boldsymbol {\lambda }}^{(t-1)},\boldsymbol {\widehat {\Sigma }}_{\theta }^{(t-1)})$
, the expectation is updated iteratively as
$ \tilde {W}^{(t)}(\boldsymbol {\alpha })= (1-\gamma _t)\tilde {W}^{(t-1)}(\boldsymbol {\alpha })+\gamma _t w(\tilde {\boldsymbol {\theta }}^{(t)},\boldsymbol {\alpha }) $
, where
$\gamma _t$
is a pre-determined step size that decreases with t. SAEM is computationally efficient, requiring fewer Monte Carlo samples per iteration by leveraging previously generated samples, though it relies on careful tuning of step-size parameters. In contrast, MCEM avoids the step-size tuning challenge but may demand more computational resources due to repeated sampling at each iteration. Practitioners can flexibly select between SAEM and MCEM based on specific application requirements and computational resources. Additionally, other SA-based methods for latent variable models may also be considered, such as the stochastic optimization approach discussed in Zhang & Chen (Reference Zhang and Chen2022).
Remark 4. Initialization is an important issue for the EM algorithm. An efficient approach is to use the SVD-based algorithm to obtain the initial values (Chen et al., Reference Chen, Li and Zhang2019; Zhang et al., Reference Zhang, Chen and Li2020). This method leverages the low-rank nature of the design matrix to capture the principal components of the data, thus providing stable and informative starting points for the iterative fitting process. The detailed procedures for initialization are presented in the Supplementary Material.
5 Extension to the fully exploratory setting
In this section, we extend our identifiability theory and estimation method to the setting without a predefined higher-order latent structure. That is, no prior information about
$Q^{(H)}$
is assumed. In terms of identifiability, we present a set of identifiability conditions that are analogous to that in Section 3. In addition to the subscale/bifactor higher-order structure with an unknown
$Q^{(H)}$
-matrix, we additionally consider an un-restricted
$Q^{(H)}$
matrix (but with independent continuous factors
$\boldsymbol {\theta }$
) as well. As the individual statements resemble that in Section 3, the following paragraphs mainly describe the subscale case, and postpone additional results to Section B of the Supplemenatry Material.
The following Proposition extends Proposition 2 by considering a subscale model with an unknown
$Q^{(H)}$
-matrix. In other words, we consider the case where the true
$Q^{(H)}$
matrix is a subscale model, but do not assume any further knowledge of it. Note that D, the number of latent factors are still assumed to be known. As in Proposition 2, let
$L_d$
denote the attributes indexed by k that belong in group d. Additionally, assume that there are no all-zero rows in
$\lambda _1$
, or equivalently,
$\lambda _{1,d}^k \neq 0$
for all
$k \in K_d$
. This assumption is necessary for the exploratory setting, as it is impossible to identify the group membership when
$\lambda _{1,d}^k = 0$
for all d.
Proposition 4. Consider an exploratory probit model with a subscale higher-order structure and no all-zero rows in
$\lambda _1$
. Then, under condition C1, the model is identifiable up to label switching of the latent factors
$(\theta _2, \ldots , \theta _D)$
.
Now, we establish identifiability of the fully exploratory HO-GRCDM, where both Q-matrix and
$Q^{(H)}$
-matrix are unknown. Here, the dimensions K and D of both latent layers are assumed to be known, which is a common assumption in other identifiability results for exploratory settings (Chen et al., Reference Chen, Liu, Xu and Ying2015, Reference Chen, Culpepper and Liang2020; Lee & Gu, Reference Lee and Gu2025; Xu & Shang, Reference Xu and Shang2018). While being analogous to Theorem 1, the following theorem does not require the knowledge of anchor items.
Theorem 2. Consider an exploratory HO-GRCDM with a subscale higher-order structure with no all-zero rows in
$\lambda _1$
. This model is identifiable up to label switching in each layer, when the true parameters satisfy conditions A, B, and C1.
Regarding parameter estimation, the MCEM algorithm developed in Section 4 can be directly adapted to the fully exploratory HO-GRCDM scenario. The primary modification involves incorporating penalty terms into Equation (39), yielding the following modified optimization problem:
$$ \begin{align} \text{where~ }p_{s'}(\boldsymbol{\lambda}) = \sum_{k=1}^{K} p_{s'}(\boldsymbol{\lambda}^k) = \sum_{k=1}^{K} \left( s' \sum_{\lambda^k_{l}\in\boldsymbol{\lambda}^k}|\lambda^k_{l}|\right), \end{align} $$
and
$s'$
is a regularization parameter. The factor
$2^K$
reflects the number of possible latent attribute profiles, ensuring that the penalty term is appropriately scaled relative to the likelihood. This optimization formulation closely parallels Equation (38) and can thus be solved using the same coordinate ascent approach described there.
In the case where
$Q^{(H)}$
is known, the number of latent attributes K and that of higher-order latent factors D can be directly determined by
$Q^{(H)}$
’s dimensions. When
$Q^{(H)}$
is unknown, these latent dimensions must also be inferred from data. A widely used method is to comparing models of varying dimensions with fit indices. This involves proposing a series of plausible combinations of
$(K,D)$
based on domain knowledge, fitting the model under each configuration, and selecting the optimal model using model-fit criteria such as BIC (Xu & Shang, Reference Xu and Shang2018) and DIC (Culpepper & Chen, Reference Culpepper and Chen2019). Despite being an important problem both from theory and method perspectives, we note that determining the optimal numbers of attributes and higher-order factors is not our primary focus. Rather, our primary contribution lies in addressing the challenging task of estimating factor loadings without structural information on the loading matrices (i.e., Q-matrices), given the inherent complexity and hierarchical structure of the HO-GRCDM.
6 Simulation studies
We conduct extensive simulation studies to investigate the performance of the proposed MCEM algorithm for HO-GRCDMs. We primarily focus on the settings with a known
$ Q^{(H)} $
but also include a simulation study for the scenario with an unknown
$ Q^{(H)} $
in the Supplementary Material.
We consider three sample sizes:
$N = 500$
,
$1,000$
, and
$2,000$
with
$(J, K, D) = (30, 7, 3)$
. We examine three types of models for the bottom data layer: the main-effect model, the all-effect model, and the DINA model. Within each bottom-layer model category, three types of response models are considered: (a) Bernoulli CDM for binary data, (b) Poisson CDM for count data, and (c) Lognormal CDM for continuous data. In the Bernoulli CDM case, the logistic distribution is used to model the probability of a correct response. Since both the Lognormal and Gamma distributions can be used to model positive continuous responses, we consider the Lognormal distribution as a representative case and postpone the simulation under the Gamma distribution to the Supplementary Material. The distributions for each model are detailed in Equations (4), (5), and (7). The first K items serve as anchor items. The bottom layer Q-matrix,
$Q_{30\times 7}$
, is shown in Equation (45). For each bottom layer model, we explore two higher-order layer structures: the subscale and the bifactor structure. The true slope coefficients for the higher-order layer are also presented in Equation (45). The unconstrained parameters in
$\boldsymbol {\Sigma }_\theta $
are sampled from a uniform distribution
$U(0.1, 0.3)$
.
It is straightforward to verify that the above models are identifiable. Specifically,
$Q_{30\times 7}$
satisfies the conditions in Proposition 1, and
$\boldsymbol {\lambda _1}$
and
$\boldsymbol {\lambda _2}$
satisfy the the conditions in Propositions 2 and 3, respectively. Therefore, the identifiability conditions in Theorem 1 are satisfied.

The Monte Carlo sample size
$M_t$
for sampling
$\theta $
is determined by the formula
$M_t = s_0 + s_1 \cdot (t-1)$
. In this study, we set
$s_0 = s_1 = 5$
, meaning five samples are simulated in the first iteration, with the number of samples increasing by five in each subsequent iteration. The convergence criterion is
$\max \| \widehat {\boldsymbol {\eta }}^{(t)} - \widehat {\boldsymbol {\eta }}^{(t-1)} \| < 0.04$
for three successive iterations. The threshold 0.04 is chosen as it yields satisfactory precision in our simulations, though a smaller value can be used for stricter convergence at the cost of increased computation time. Initialization is implemented using the SVD-based algorithm mentioned in Remark 4 and detailed in the Supplementary Material. The coordinate ascent part in our algorithm is conducted by using the R package glmnet (Hastie et al., Reference Hastie, Qian and Tay2021). For each model fitting, we apply our MCEM algorithm multiple times for a sequence of regularization parameters s, which vary across distributions and are listed in the Supplementary Material. The s that produces the smallest BIC value is then selected to finalize the model fitting.
6.1 Simulations for main-effect HO-GRCDMs
For the main-effect models, we set the coefficients
$\beta ^j_k$
in the same way as Lee & Gu (Reference Lee and Gu2024):
where
$(c_0,c_1)$
are two constants, set to (
$-$
1,3) for the Lognormal-CDM, (0.5,1) for the Poisson-CDM, and (
$-$
2,4) for the Bernoulli-CDM. For the Lognormal-CDM, the dispersion parameter
$ \sigma _j $
is set to 1 for all J items. 100 independent replications are conducted in each setting.
We report the estimation accuracy for the continuous parameters in Table 1, by displaying the Root Mean Squared Errors (RMSE) and absolute biases (aBias). Note that the differences in the bottom layer parametric families may lead to results that are not directly comparable across these models. In particular, the Bernoulli model adopts a nonlinear parametrization for the correct response probability, so its probability parameters are on a different scale than those under other models. As shown in Table 1, the estimation accuracy for all the parameters improves as the sample size increases. Furthermore, to examine the recovery of the discrete Q-matrix, we report the proportion of correctly estimated rows and entries in Q in Table 2. It can be seen that the estimation accuracy of Q is reasonably high and improves as the sample size grows. The simulation results in Tables 1 and 2 provide empirical verification of our identifiability results.
Table 1 RMSE and aBias for the main-effect HO-GRCDM

Table 2 Proportion of correctly recovered rows (
$P_R$
) and entries (
$P_E$
) in Q-matrix for the main-effect HO-GRCDM

6.2 Simulations for all-effect HO-GRCDMs
For the all-effect models, we set the true coefficients
$\beta ^j_k$
as
$\beta ^j_0 = c_0$
, and
where
$\mathcal {K}_j = \left \{k \in [K];\, q_{jk} = 1\right \}$
,
$S \subseteq [K]\backslash \emptyset $
. Here,
$c_0$
and
$c_1$
are the same constants that we have defined in Section 6.1. Similar to the main-effect model scenario, we apply the proposed MCEM algorithm to fit each model and conduct 100 independent replications for each setting. RMSE and aBias are computed for evaluating estimation accuracy.
The simulation results in Table 3 show that the estimation accuracy for all parameters improves as the sample size increases. Compared to the main-effect case, the accuracy of the continuous parameters is lower, which is unsurprising given the significantly larger number of parameters in this case. To examine the recovery of
$ Q $
, we report the proportion of correctly estimated rows and entries in
$ Q $
in Table 4. As explained in the Q estimation section, following the strict rule tends to yield an onverly dense matrix. Instead, we identify the
$ j $
th row of
$ Q $
based on the largest non-zero interaction coefficient for item
$ j $
. The estimation accuracy of
$ Q $
improves as the sample size grows. In fact, the recovery of
$ Q$
is better than in the main-effect case. This is likely due to the larger number of parameters in the all-effect case, which may help to recover the Q-matrix by providing more information about the dependence structure between the items and the attributes.
Table 3 RMSE and aBias for the all-effect HO-GRCDM

Table 4 Proportion of correctly recovered rows (
$P_R$
) and entries (
$P_E$
) in Q-matrix for the all-effect HO-GRCDM

6.3 Simulations for DINA HO-GRCDMs
For the DINA models, we set the true coefficients
$\beta ^j_k$
as
$\beta ^j_0 = c_0$
, and
$$\begin{align*}\beta^j_S = \begin{cases} c_1 & \text{if } S = \mathcal{K}_j, \\ 0 & \text{otherwise}. \end{cases} \end{align*}$$
As mentioned in Section 2.1, DINA model can be regarded as a special case of the all-effect model, meaning that we can use the same estimation procedure with all-effect model to estimate DINA model. Here, we set
$c_0$
and
$c_1$
same as Section 6.1, and implement the MCEM algorithm.
These simulation results are presented in Table 5. As demonstrated in Table 5, the estimation accuracy for all parameters improves as the sample size increases. Furthermore, to examine the recovery of
$ Q $
, we report the proportion of correctly estimated rows and entries in
$ Q $
in Table 6. Since there should be only one non-zero coefficient per item, we identify the
$ j $
th row of
$ Q $
based on the largest non-zero interaction coefficient for item
$ j $
. Again, the estimation accuracy of
$ Q $
improves as the sample size grows. The results are similar to those in the all-effect case. In the confirmatory case, the DINA model has fewer parameters than the all-effect model, making it easier to estimate. However, in the exploratory case, it is fitted as an all-effect model, leading to similar computational cost. This may explain the comparable estimation accuracy in the two cases.
Table 5 RMSE and aBias for the DINA HO-GRCDM

Table 6 Proportion of correctly recovered rows (
$P_R$
) and entries (
$P_E$
) in Q-matrix for the DINA HO-GRCDM

7 Real data analysis
In this section, we demonstrate the application of the HO-GRCDM to response time data from the 2019 TIMSS mathematics assessment (Fishbein et al., Reference Fishbein, Foy and Yin2021). We analyze the response time data collected from United Arab Emirates students responding to booklet 1 in the eighth-grade math assessment.Footnote 1 This dataset includes the time spent on each item screen (in seconds) by 1599 students on 28 items within a total exam time of 45 min. Data points with log response times less than 0 or greater than 6 are regarded as outliers, potentially resulting from students’ random guessing, running out of time, or taking breaks during the exam. These outliers are considered missing data (NA), and the corresponding observations are deleted, resulting in a total of 1,163 observations. The data are then transformed from seconds to minutes. A summary of descriptive information about the items is given in the Supplementary Material, including item type, item description, and the slope and location parameters obtained when fitting the two parameter item response model (2PL).
In psychometric modeling, the Q-matrix—or more generally, the item-attribute relationship—captures the conditional dependence of the observed outcomes associated with the items given the latent variables. While these outcome variables are typically categorical response data, they can also include process data such as response time, which reflect information into examinees’ latent traits. The use of a sparse loading matrix for response time modeling has been explored in the literature (Bolsinova & Tijmstra, Reference Bolsinova and Tijmstra2018; He et al., Reference He, Culpepper and Douglas2024). In particular, He et al. (Reference He, Culpepper and Douglas2024) adopt a same sparsity structure (analogous to the Q-matrix) for response time modeling as for response accuracy data, noting that this simplifies the interpretation of item–attribute relationships and enables simultaneous analysis of response behaviors across both dimensions. Moreover, in their empirical data analysis, they assess the benefit of including item–attribute coefficients (i.e., the sparsity structure) for response time by comparing model fit to an alternative model that removes this structure—retaining only item-level intercepts to capture general speed effects. Their DIC-based comparison suggests that incorporating item–attribute coefficients improved the overall model fit. These findings motivate us to assume the same provisional Q-matrices for modeling response time data as those used for response accuracy data. The TIMSS 2019 mathematics assessment is designed to measure three cognitive skills (Knowing, Applying, and Reasoning) and four content skills (Number, Algebra, Geometry, and Data and Probability), with each item specified to assess one cognitive skill and one content skill. The assessment provides provisional Q-matrix and
$Q^{(H)}$
-matrix, as shown in Tables 7 and 8.
Table 7 Q-matrix of the TIMSS 2019 data set

Table 8
$Q^{(H)\top }$
matrix of the TIMSS 2019 data set

We apply the developed MCEM algorithm to fit a main-effect HO-GRCDM with a subscale higher-order structure (using the provided
$Q^{(H)}$
-matrix) to the dataset. We elaborate more regarding the rationale for choosing the structure of the HO-GRCDM in subsequent paragraphs. Here, note that we work under the exploratory framework with an unknown Q-matrix for the bottom layer, and do not use any information in the provisional Q-matrix. This is because our purpose is to derive findings on the test structure and item characteristics that might provide valuable insights for test developers by assessing the validity of the test design Q-matrix.
Regarding the choice of parametric families to model the response time, the log-normal distribution and Gamma distribution are two commonly used distributions (De Boeck & Jeon, Reference De Boeck and Jeon2019; Klein Entink et al., Reference Klein Entink, Fox and van der Linden2009; Maris, Reference Maris1993; Van der Linden, Reference Van der Linden2006). The log-normal model is often used when the logarithm of the response times follows a normal distribution, while the Gamma model is suitable for modeling positive continuous variables with skewed distributions. To choose between these two models, we first fit both models to the data and compute their BIC values. The obtained BIC values are 73,165.46 for the Gamma model and 86,664.80 for the log-normal model. Furthermore, we plot the probability histogram for the response time of each item and fit a density curve using the spline method. Using the estimated parameters obtained by fitting the log-normal and Gamma HO-GRCDM, their corresponding density curves are also plotted. We show plots for the last 10 items (items 19–28) in Figure 1 and present plots for all items in Section F of the Supplementary Material. By examining the histogram and comparing the density curves, we found that the response time variables for most items are right-skewed, and the Gamma model’s curve (red line) overlaps more with the empirical density (blue line) than the log-normal model (green line), indicating that the Gamma model has a better fit. Therefore, we use the Gamma distribution for the response time to fit a main-effect HO-GRCDM. We conduct an additional simulation study in Section D of the Supplementary Material to investigate the performance of the MCEM algorithm for estimating the main-effect HO-GRCDM with a Gamma distribution.

Figure 1 TIMSS data analysis.Note: Probability histogram and fitted density curves (empirical density, Gamma model, and log-normal model) for response time data (in minutes).
We also comment on the choice of the measurement model and the latent layer structure for our HO-GRCDM. It is common to use the main-effect models to understand response times (De Boeck & Jeon, Reference De Boeck and Jeon2019; Lee & Gu, Reference Lee and Gu2024; Maris, Reference Maris1993; Sternberg, Reference Sternberg1980). Lee & Gu (Reference Lee and Gu2024) discusses the rationale for adopting the additive model assumption in more detail. As for the high-order layer, rather than focusing on assessing general cognitive ability, our primary goal is to capture fine-grained distinctions between subskills, for which the subscale model is more appropriate.
Figure 2(a) presents heatmap of the estimated bottom-layer parameters. The fitted model reveals a sparse structure of bottom-layer coefficients—only three columns (the 1st, 4th, and 6th columns) have non-zero coefficients. In addition, there is an estimated positive correlation of 0.35 between the cognitive skill and the content skill. Using the estimated parameters, we compute
$ P(\mathbf {A}_i = \boldsymbol {\alpha }|\mathbf {R}_i;\ \widehat {\boldsymbol {\beta }}, \widehat {\boldsymbol {\lambda }}, \widehat {\boldsymbol {\Sigma }}_{\theta }) $
for each student, selecting the
$ \boldsymbol {\alpha }\in \{0,1\}^K $
with the highest value as the estimate of their attribute profile. To explore the sparse structure and explain the three attributes with non-zero coefficients, we compute correlations among the seven attributes based on the estimated attributes of students and show the results in Figure 2(b). The first observation from the correlation plot is that the three cognitive attributes, attributes 5–7, are extremely highly correlated, with correlations up to 0.98. There are also high correlations among the four content attributes (attributes 1–4), with a correlation of 0.98 between attributes 1 and 3, and a correlation of 0.76 between attributes 2 and 4. This indicates that the content attributes may further divided into two groups: (1, 3) and (2, 4). Additionally, the correlations between one cognitive attribute and another content attribute are much smaller than that from content attributes or cognitive attributes.

Figure 2 TIMSS data analysis. (a) Heatmap for estimated bottom-layer parameters. (b) Correlation plot of the estimated latent attributes.
The high correlations in attribute groups (1,3), (2,4), and (5,6,7) indicate that attributes within the same group are hard to distinguish. This is also reflected in the estimated bottom-layer coefficients
$ \beta $
: only one attribute in each group has non-zero coefficients. The 6th column coefficients can be regarded as the coefficients for all of the cognitive skills: Knowing, Reasoning, and Applying. For attributes 1–4, we found that only items 1–15 have non-zero coefficients on the first attribute, with the majority measuring “Number” and “Algebra.” Items 24–28 have non-zero coefficients for attribute 4, and these items measure either “Geometry” or “Data and Probability.” These observations suggest that attribute groups (1,3), (2,4), and (5,6,7) correspond to (Number, Algebra), (Geometry, Data, and Probability), and (Knowing, Reasoning, Applying), respectively.
The relationships among the attributes are both intuitive and interpretable; however, the estimated coefficients do not align with the test-design-based Q-matrix shown in Table 7. A similar dataset was analyzed in Lee & Gu (Reference Lee and Gu2024) using the designed Q-matrix, where their findings revealed interesting patterns of intrinsic dependence among attributes. These insights motivated our application of the HO-GRCDM to the TIMSS data. To further investigate this issue, we have conducted additional analyses to explore whether a finer-grained or a coarser-grained attribute specification could improve model fit. Specifically, we fitted HO-GRCDMs in a fully exploratory scenario with five and eight attributes (i.e.,
$(K, D) = (5,2)$
and
$(K, D) = (8,2)$
) and compared their fit to our current HO-GRCDM model. The details of these analyses are provided below.
On the one hand, the high correlations among attributes may suggest that a model with fewer attributes would be more appropriate. Our consideration of the
$ K=5 $
case is based on the observed significant correlations among the three cognitive attributes, indicating that they may effectively represent a single general cognitive attribute. When combined with the four content attributes, this yields a candidate model with
$ K = 5 $
. On the other hand, incorporating more fine-grained attributes with potentially greater homogeneity within examinee groups sharing the same mastery level might better leverage the advantages of CDMs. In addition to the Content and Cognitive domain classifications, TIMSS provides Topic Area information, specifying the skills measured by each item. These topic areas include “Fractions and Decimals,” “Integers,” “Ratio, Proportion, and Percent,” “Expressions, Operations, and Equations,” “Relationships and Functions,” “Geometric Shapes and Measurements,” “Probability,” and “Data.” These attributes define a finer-grained test design with
$ K=8 $
, potentially enhancing the distinction between mastery and non-mastery groups.
Regarding identifiability, neither test design necessarily guarantees an identifiable HO-GRCDM, in both the partially and fully exploratory cases. In the
$ K=5 $
case, the higher-order model may lack identifiability due to the presence of only a single cognitive attribute. In the
$ K=8 $
case, the topic area-derived test design does not ensure an identifiable bottom-layer model. Specifically, some attributes (e.g., “Ratio, Proportion, and Percent” and “Probability”) are measured by only a single item, making the fitted model unreliable. Despite these limitations, we consider these cases for model comparison to assess whether reducing the number of attributes (
$ K=5 $
) or adopting a finer-grained specification (
$ K=8 $
) improves model fit. We adopt the fully exploratory case for this analysis, as it may uncover a well-fitting and interpretable model even when the specified designs are non-identifiable under the partially exploratory setting. The BIC values obtained were 73,470.34 for
$ K=5 $
and 74,684.59 for
$ K=8 $
, both higher than 73,165.46 for HO-GRCDM with
$ K=7 $
, indicating that the current
$ K=7 $
structure provides a better model fit.
Beyond this empirical comparison, we briefly discuss the broader implications of attribute specification in test development, as it directly influences the quality of data and the diagnostic insights that can be drawn from model-based analyses. One important consideration is the granularity of the attributes. As a reviewer noted, overly coarse-grained attributes may lead to considerable heterogeneity among examinees within each mastery level, which can undermine the usefulness of binary classifications and weaken the conditional independence assumption of CDMs. To address this concern, two directions may be considered. First, when possible, attributes should be defined in a fine-grained manner that realistically supports the binary mastery/non-mastery distinction and induces conditional independence among item responses. Second, in cases where attributes are inherently broad and difficult to dichotomize meaningfully, modeling them as polytomous attributes may offer a more accurate and informative representation. These considerations highlight the important role of attribute specification in ensuring the validity and interpretability of CDMs and have direct implications for future test design and development.
8 Discussion
We have proposed a general modeling framework, HO-GRCDM, for modeling CDMs with general responses and a higher-order structure. This framework features high flexibility in (1) addressing various types of response data, (2) adapting to a variety of measurement models, and (3) considering an exploratory settings with an unknown Q-matrix. Furthermore, our models have a rich representational power in its hierarchical structure to hunt for higher-order cognitive information. We provide interpretable identifiability conditions in terms of the Q and
$Q^{(H)}$
matrices that ensure the validity and accuracy of model fitting. The probit link used for the higher-order layer facilitates our identifiability theory as well as the development of an efficient MCEM algorithm for parameter estimation. Compared to existing MCMC and EM methods, our MCEM algorithm has lower computational complexity due to an explicit conditional expectation formula for
$\boldsymbol {\alpha }$
and an efficient sampler for
$\boldsymbol {\theta }$
. Extensive simulation studies under various response types and measurement models are conducted to examine the efficiency of the proposed algorithm.
There are several promising directions for future work that build upon the HO-GRCDM framework. First, incorporating more than two layers would help explore deeper and more nuanced diagnostic information. Gu (Reference Gu2024) proposed a new family of DeepCDMs featuring multiple, potentially deep, entirely discrete latent layers for cognitive diagnosis. However, that work focuses on binary response variables and binary latent variables. It would be interesting to develop a framework applicable to general responses and incorporate multiple discrete latent layers. Second, while it is typical to consider binary attributes in CDMs, extending them to polytomous attributes can provide a more nuanced representation of latent skills (Chen & de la Torre, Reference Chen and de la Torre2013; Karelitz, Reference Karelitz2004; von Davier, Reference von Davier2005). Third, the model can be further extended by jointly modeling multiple types of responses (Man & Harring, Reference Man and Harring2023; Van der Linden, Reference Van der Linden2007; Wang, Zhang, et al., Reference Wang, Zhang, Douglas and Culpepper2018). Many assessments include a mix of binary, count, and continuous responses, each providing complementary diagnostic information. Extending HO-GRCDM to accommodate such heterogeneous responses while preserving its higher-order structure would enhance its flexibility and practical utility.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/psy.2025.15.
Acknowledgements
The authors thank the editor, associate editor, and three reviewers for helpful and constructive comments. This research was conducted when Jia Liu was a visiting PhD student in the Department of Statistics at Columbia University.
Funding statement
Yuqi Gu’s research is partially supported by the NSF grant DMS-2210796.
Competing interests
The authors declare none.
















