Hostname: page-component-745bb68f8f-s22k5 Total loading time: 0 Render date: 2025-01-07T11:14:07.436Z Has data issue: false hasContentIssue false

A Note on Improving Variational Estimation for Multidimensional Item Response Theory

Published online by Cambridge University Press:  01 January 2025

Chenchen Ma
Affiliation:
University of Michigan
Jing Ouyang
Affiliation:
University of Michigan
Chun Wang*
Affiliation:
University of Washington
Gongjun Xu*
Affiliation:
University of Michigan
*
Correspondence should be made to Chun Wang, College of Education, University of Washington, 312 E Miller Hall, 2012 Skagit Lane, Seattle, WA98105, USA. Email: wang4066@uw.edu
Correspondence should be made to Gongjun Xu, Department of Statistics, University of Michigan, 456 West Hall, 1085 South University, Ann Arbor, MI 48109, USA. Email: gongjun@umich.edu
Rights & Permissions [Opens in a new window]

Abstract

Survey instruments and assessments are frequently used in many domains of social science. When the constructs that these assessments try to measure become multifaceted, multidimensional item response theory (MIRT) provides a unified framework and convenient statistical tool for item analysis, calibration, and scoring. However, the computational challenge of estimating MIRT models prohibits its wide use because many of the extant methods can hardly provide results in a realistic time frame when the number of dimensions, sample size, and test length are large. Instead, variational estimation methods, such as Gaussian variational expectation–maximization (GVEM) algorithm, have been recently proposed to solve the estimation challenge by providing a fast and accurate solution. However, results have shown that variational estimation methods may produce some bias on discrimination parameters during confirmatory model estimation, and this note proposes an importance-weighted version of GVEM (i.e., IW-GVEM) to correct for such bias under MIRT models. We also use the adaptive moment estimation method to update the learning rate for gradient descent automatically. Our simulations show that IW-GVEM can effectively correct bias with modest increase of computation time, compared with GVEM. The proposed method may also shed light on improving the variational estimation for other psychometrics models.

Type
Theory & Methods
Copyright
Copyright © 2023 The Author(s), under exclusive licence to The Psychometric Society.

Developing, refining, and validating survey questionnaires that measure target latent traits such as personality or cognitive abilities has always been a core agenda in education and psychology, and this focus is also extended to health measurement and culminates in a multi-decade initiative on patient-reported outcome measures. Psychometric methods and tools are an integral part of achieving this focus. When the constructs that these assessments try to measure become increasingly complex, multidimensional item response theory (MIRT), also known as item factor analysis, provides a unified framework and convenient statistical tool for item analysis, calibration, and scoring. However, the increasing scale and complexity of survey designs, especially in large-scale assessments (LSA), require MIRT models with many latent factors. For instance, the English Language Proficiency Assessment for the 21st Century (ELPA21) across two gradebands consists of eight domain-level traits measured by more than 600 items (CRESST, 2017). The existing computational algorithms for fitting high-dimensional MIRT models are insufficient to navigate the massive amount of assessment data, reflected by excessively long computation time and unstable estimation results.

MIRT provides a powerful tool for enriching the information gained in educational assessment (Hartig & Höhler, Reference Hartig and Höhler2009). For instance, cognitive instructional psychology considers “science knowledge” and “mathematical ability” as highly differentiated theoretical constructs that consist of both basic facts and skills as well as deeper or higher-order understanding (Hamilton et al., Reference Hamilton, Nussbaum, Kupermintz, Kerkhoven and Snow1995; Kupermintz et al., Reference Kupermintz, Ennis, Hamilton, Talbert and Snow1995). As another example, the 2003 assessment framework of PISA (OECD, Reference OECD2003) contains a hierarchy of ability dimensions with general “knowledge and skills” at the highest level, followed by reading, math, science, and problem solving. Then at the lowest level are the sub-domains such as “space and shape,” “change and relationships,” and “quantity” nested within math. Hence, dimensions on different levels vary in their degree of generality and abstraction. Oftentimes, the highest level represents a broad competency level, whereas lower levels represent narrower and more specific abilities. If the intention is to model both the overall and lower-level abilities simultaneously, the model will be high dimensional (Briggs & Wilson, Reference Briggs and Wilson2003).

Even though the research and development in statistics and psychometrics have provided increasingly sophisticated measurement models to better assess constructs in social sciences, the practice still lags behind (Cai & Hansen, Reference Cai and Hansen2018). Unidimensional IRT models continue to dominate the current applications in many domains. One reason is that when the number of items, sample size, and the number of dimensions are all large, the current computational algorithms for MIRT estimation may not be powerful enough to produce results in a reasonable time frame (or ever) (CRESST, 2017). For instance, due to the large number of students and items within each gradeband, the operational analysis approach used for ELPA21 is a two-step approach: in the first step, a unidimensional IRT model is fitted to the item response data for each domain subtest to obtain item parameter estimates; then in the second step, a restricted hierarchical model [i.e., testlet model, Wainer et al. (Reference Wainer, Bradlow and Wang2007), Gibbons and Hedeker (Reference Gibbons and Hedeker1992), Cai et al. (Reference Cai, Yang and Hansen2011)] is fitted to estimate the correlations between the four domains (Thissen, Reference Thissen2013). Such a two-step process has two limitations: (1) the item parameter calibration errors are ignored in the second step, and (2) the restricted hierarchical model is only an approximation to the independent-cluster MIRT model. Various full-information methods have been proposed to deal with the computational challenge, which are listed below with pros and cons. The list is by no means exhaustive, but it includes some of the most popular methods that are available in commercial software packages or R packages.Footnote 1

  1. 1. Adaptive Gaussian quadrature. Compared to the regular Gauss-Hermite quadrature [e.g., Bock and Aitkin (Reference Bock and Aitkin1981)], even though the number of quadrature points per dimension is reduced, the total number of quadrature points still increases exponentially with the number of dimensions. Moreover, an extra step is needed to compute the posterior mode and variance of latent factors in each iteration, which adds additional computation costs (Pinheiro & Bates, Reference Pinheiro and Bates1995).

  2. 2. Monte Carlo techniques. This family of methods include, for instance, the Monte Carlo EM algorithm (McCulloch, Reference McCulloch1997; Wang & Xu, Reference Wang and Xu2015), stochastic EM algorithm (von Davier & Sinharay, Reference von Davier and Sinharay2010; Zhang et al., Reference Zhang, Chen and Liu2020), or Metropolis–Hastings Robbins–Monro algorithm (Cai Reference Cai2010a, Reference Caib). These methods circumvent intractable integrations by sampling from the posterior distributions; however, they may still computationally intensive for complicated high-dimensional models. Fully Bayesian estimation methods, such as Markov chain Monte Carlo (MCMC; Albert, Reference Albert1992; Patz & Junker, Reference Patz and Junker1999) can also be considered in this category. The Bayesian approach is also computationally costly as it needs a long chain to converge for complex models, though it is preferable with smaller sample sizes.

  3. 3. Analytic dimension reduction. For models assuming certain conditional independence among factors (such as the bi-factor models), the conditional independence relations can be used to partition the joint space of all latent variables into smaller subsets. As a result, brute force numerical integration over the joint latent space can be replaced by a sequence of integrations over smaller subsets of latent variables, which helps reduce the computation burden dramatically. This strategy to deal with high-dimensional integration challenges is known as analytic dimension reduction  (Cai et al., Reference Cai, Yang and Hansen2011; Gibbons & Hedeker, Reference Gibbons and Hedeker1992; Rijmen et al., Reference Rijmen, Vansteelandt and De Boeck2008). One limitation, though, is that the algebraic manipulations of the likelihood of a specific model might become very complicated, and they differ for different models (e.g., Cai et al., Reference Cai, Yang and Hansen2011; Gibbons & Hedeker, Reference Gibbons and Hedeker1992). Hence, there is no universal rule that applies to any model.

  4. 4. Laplace approximation. This method is based on second-order Taylor expansion of the log-integrand around its mode (Lindstrom & Bates, Reference Lindstrom and Bates1988) such that the high-dimensional integral becomes tractable. This method is a classical and popularly used method for generalized linear mixed-effects models (GLMM), and it is available in many software packages, such as the “lem4” R package (Bates et al., Reference Bates, chler, Bolker and Walker2014). However, this approximation may not be accurate when the dimension increases to 3 or higher, the sample size is small (Jeon et al., Reference Jeon, Rijmen and Rabe-Hesketh2017), or the likelihood function is skewed.

Besides the full-information methods above, a recent constraint joint maximum likelihood estimation (CJMLE) was proposed by Chen et al. (Reference Chen, Li and Zhang2019), which is more computationally efficient than many marginal maximum likelihood methods, and the estimator has the theoretical guarantee to be consistent under high-dimensional settings. Extending CJMLE, the singular value decomposition (SVD)-based estimator was proposed by Zhang et al. (Reference Zhang, Chen and Li2020a), which further improves the performance of CJMLE. These joint maximum likelihood methods enjoy the low computational cost but sacrifice the flexibility of latent factors by treating them as fixed effects. For instance, it would be hard conceptually to generalize the algorithm to a multiple-group condition in which unbiased estimation of group-specific population distributions is often needed than estimation of individual person’s latent trait as a fixed effect.

In light of the limitations of the above-mentioned methods, variational estimation methods that leverage advances in statistical and machine learning have recently gained increasing interests in psychometrics (Cho et al., Reference Cho, Wang, Zhang and Xu2021, Reference Cho, Xiao, Wang and Xu2022; Jeon et al., Reference Jeon, Rijmen and Rabe-Hesketh2017). Among numerous variational estimation methods, Rijmen & Jeon (Reference Rijmen and Jeon2013) was one of the first to use a variational estimation technique for MIRT models that approximates the likelihood function by a computationally tractable lower bound, but it only studied MIRT models with discrete latent factors. Later, a wide range of studies on variational methods were conducted for the estimation of more complex models  (Hui et al., Reference Hui, Warton, Ormerod, Haapaniemi and Taskinen2017; Natesan et al., Reference Natesan, Nandakumar, Minka and Rubright2016). Recently, Jeon et al. (Reference Jeon, Rijmen and Rabe-Hesketh2017) proposed variational maximization–maximization (VMM) algorithm for the generalized linear mixed models (GLMMs), which outperforms Laplace approximation with a small sample size. However, they rely on some iterative numerical algorithms to attain the solutions in each maximization step, resulting in a slow speed in running the algorithm. To further increase computational efficiency, many researchers brought up variational autoencoder (VAE), a deep learning-based variational method to tackle the estimation problems in MIRT models (Curi et al., Reference Curi, Converse, Hajewski and Oliveira2019; Wu et al., Reference Wu, Davis, Domingue, Piech and Goodman2020). Extending from VAE, the importance-weighted VAE (IW-VAE) is developed and exhibits competitive performances to other estimation methods  (Liu et al., Reference Liu, Wang and Xu2022; Urban & Bauer, Reference Urban and Bauer2021) at large sample sizes. However, the two IW-VAE methods lack theoretical support for the consistency of estimators. In addition, although they are powerful in handling large-scale data, their performances in small to medium-sample data may not be as well (see Appendix for more details). Cho et al. (Reference Cho, Wang, Zhang and Xu2021, Reference Cho, Xiao, Wang and Xu2022) proposed a Gaussian variational expectation–maximization (GVEM) algorithm, which has shown to be computationally fast and produces comparable and sometimes more accurate parameter estimates than the MH-RM algorithm and than the CJMLE method in high-dimensional exploratory item factor analysis models (i.e., M2PL and M3PL in Cho et al., Reference Cho, Wang, Zhang and Xu2021). Moreover, Cho et al. (Reference Cho, Wang, Zhang and Xu2021) proved that the estimated parameters from GVEM algorithm are consistent under the high-dimensional setting. However, we found that directly applying the GVEM algorithm in confirmatory MIRT models would generate relatively large bias on discrimination parameters, especially when the correlations among factors are high and the sample size is not large (please see Sect. 2 for the detailed simulation results). Such a bias issue happens commonly to variational estimation for various statistical models (Bishop, Reference Bishop2006).

To correct the bias in the variational algorithms for MIRT models, we propose an importance-weighted GVEM algorithm (denoted as IW-GVEM hereafter) , which is an extension of GVEM algorithm by performing additional steps after GVEM convergence. The primary idea is to use an importance-weighted variational inference technique to create a tighter variational lower bound to the target, otherwise intractable, marginal likelihood. Because the variational lower bound proposed in Cho et al. (Reference Cho, Wang, Zhang and Xu2021, Reference Cho, Xiao, Wang and Xu2022) is replaced by a weighted average based on importance sampling (Domke & Sheldon, Reference Domke and Sheldon2018), the desirable closed-form solution in the M-step is no longer applicable. Instead, we propose to use Adam (Kingma & Ba, Reference Kingma and Ba2014), a popular algorithm for first-order gradient-based optimization. This computationally efficient algorithm updates the objective function stochastically based on adaptive estimates of lower-order moments, and it is especially well suited for large data and complex models. Moreover, different from the IW-VAE methods rooted in deep neural network models where substantial theoretical works on the consistency of the estimators remain to be done, our proposed IW-GVEM is a more transparent method that comes with theoretical guarantees under the high-dimensional setting.

In what follows, this note briefly describes the M2PL model and the original GVEM algorithm and then introduces the IW-GVEM algorithm in Sect. 1, followed by a comprehensive simulation study in Sect. 2. We end the paper with discussions and future directions.

1. Methods

1.1. M2PL

Multidimensional 2PL model is one of the most widely used MIRT models in practice (Reckase, Reference Reckase2009). With M2PL, the item response function of the ith individual to the jth item is modeled by

(1) P ( Y ij = 1 θ i ) = exp ( a j θ i - b j ) 1 + exp ( a j θ i - b j ) , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} P(Y_{ij}=1 \mid \varvec{\theta }_i)= \frac{\exp ({\varvec{a}}_j^\top \varvec{\theta }_i-b_j)}{1+\exp ({\varvec{a}}_j^\top \varvec{\theta }_i-b_j)}, \end{aligned}$$\end{document}

where Y ij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_{ij}$$\end{document} for i = 1 , . . . , N \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,...,N$$\end{document} and j = 1 , . . . , J \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=1,...,J$$\end{document} is a binary response, a j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{a}}_j$$\end{document} denotes a K-dimensional vector of item discrimination parameters for item j, and b j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_j$$\end{document} specifies the corresponding difficulty level with item difficulty parameter as b j / a j 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_j/\Vert {\varvec{a}}_j\Vert _2$$\end{document} . Following notations in Cho et al. (Reference Cho, Wang, Zhang and Xu2021), we use Y i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{Y}}_i$$\end{document} to denote the response vector of the ith subject, and θ i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_i$$\end{document} to denote the latent trait vector of the ith subject. We write A = ( α j , j = 1 , , J ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$A = (\varvec{\alpha _j}, j=1,\dots , J)$$\end{document} and B = ( b j , j = 1 , , J ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$B = (b_j, j = 1, \dots , J)$$\end{document} . For model identification, oftentimes the means and variances of θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }$$\end{document} are fixed as zeros and ones, respectively, and the covariance (which is actually correlation) of θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }$$\end{document} is freely estimated.

1.2. GVEM

Let Δ = ( A , B , ρ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Delta }=({{\varvec{A}},{\varvec{B}},\varvec{\rho }})$$\end{document} denote the set of unknown parameters for M2PL, where ρ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\rho }$$\end{document} denotes the correlations of θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }$$\end{document} . As discussed, the population means of θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }$$\end{document} are fixed at 0, and the population variances are fixed at 1. The correlations among θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }$$\end{document} ’s can be freely estimated. Then, the log-marginal likelihood of responses Y \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{Y}}$$\end{document} is

(2) l ( Δ Y ) = i = 1 N log P ( Y i Δ ) = i = 1 N log j = 1 J P ( Y ij Δ , θ i ) ϕ ( θ i ) d θ i , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} l(\varvec{\Delta }\mid {\varvec{Y}})= \sum _{i=1}^N \log P({\varvec{Y}}_{i}\mid \varvec{\Delta })=\sum _{i=1}^N\log \int \prod _{j=1}^J P(Y_{ij}\mid \varvec{\Delta },\varvec{\theta }_i) \phi (\varvec{\theta }_i)d\varvec{\theta }_i, \end{aligned}$$\end{document}

where ϕ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\phi $$\end{document} denotes a K-dimensional Gaussian distribution of θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }$$\end{document} with mean 0 and covariance Σ θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma _{\varvec{\theta }}$$\end{document} . It is the potentially high-dimensional integration in Eq. (2) that makes direct maximization of the log-marginal likelihood computationally prohibitive. The log-likelihood of response Y \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{Y}$$\end{document} has an equivalent form

l ( Δ Y ) = i = 1 N θ i log P ( Y i Δ ) × q i ( θ i ) d θ i , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} l(\varvec{\Delta } \mid {\varvec{Y}})=\sum _{i=1}^N \int _{\varvec{\theta }_i} \log P({\varvec{Y}}_{i}\mid \varvec{\Delta }) \times q_i (\varvec{\theta }_i) d \varvec{\theta }_i, \end{aligned}$$\end{document}

where q i ( θ i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i (\varvec{\theta }_i)$$\end{document} can be any probability density function satisfying θ i q i ( θ i ) d θ i = 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\int _{\varvec{\theta }_i} q_i (\varvec{\theta }_i) d \varvec{\theta }_i = 1$$\end{document} .

The main idea behind variational inference is to approximate the intractable integral in Eq. (2) with a computationally feasible form, known as the evidence lower bound (ELBO; Blei et al., Reference Blei, Kucukelbir and McAuliffe2017; Ormerod & Wand, Reference Ormerod and Wand2010). Because P ( Y i Δ ) = P ( Y i , θ i Δ ) / P ( θ i Y i , Δ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P({\varvec{Y}}_{i}\mid \varvec{\Delta }) = P({\varvec{Y}}_{i}, \varvec{\theta }_i \mid \varvec{\Delta })/ P(\varvec{\theta }_i \mid {\varvec{Y}}_{i}, \varvec{\Delta })$$\end{document} , we write l ( Δ Y ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l(\varvec{\Delta } \mid {\varvec{Y}})$$\end{document} as

l ( Δ Y ) = i = 1 N θ i log P ( Y i , θ i Δ ) P ( θ i Y i , Δ ) × q i ( θ i ) d θ i = i = 1 N θ i log P ( Y i , θ i Δ ) q i ( θ i ) P ( θ i Y i , Δ ) q i ( θ i ) × q i ( θ i ) d θ i = i = 1 N θ i log P ( Y i , θ i Δ ) q i ( θ i ) × q i ( θ i ) d θ i + K L { q i ( θ i ) P ( θ i Y i , Δ ) } , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} l(\varvec{\Delta } \mid {\varvec{Y}})= & {} \sum _{i=1}^N \int _{\varvec{\theta }_i} \log \frac{P({\varvec{Y}}_{i}, \varvec{\theta }_i \mid \varvec{\Delta })}{P(\varvec{\theta }_i \mid {\varvec{Y}}_{i}, \varvec{\Delta })} \times q_i (\varvec{\theta }_i) d \varvec{\theta }_i \\= & {} \sum _{i=1}^N \int _{\varvec{\theta }_i} \log \frac{P({\varvec{Y}}_{i}, \varvec{\theta }_i \mid \varvec{\Delta })q_i (\varvec{\theta }_i)}{P(\varvec{\theta }_i \mid {\varvec{Y}}_{i}, \varvec{\Delta })q_i (\varvec{\theta }_i)} \times q_i (\varvec{\theta }_i) d \varvec{\theta }_i \\= & {} \sum _{i=1}^N \int _{\varvec{\theta }_i} \log \frac{P({\varvec{Y}}_{i}, \varvec{\theta }_i \mid \varvec{\Delta })}{q_i (\varvec{\theta }_i)} \times q_i (\varvec{\theta }_i) d \varvec{\theta }_i + KL \{q_i (\varvec{\theta }_i) \mid P(\varvec{\theta }_i \mid {\varvec{Y}}_{i}, \varvec{\Delta }) \}, \end{aligned}$$\end{document}

where K L { q i ( θ i ) P ( θ i Y i , Δ ) } = θ i log q i ( θ i ) P ( θ i Y i , Δ ) × q i ( θ i ) d θ i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$KL \{q_i (\varvec{\theta }_i) \mid P(\varvec{\theta }_i \mid {\varvec{Y}}_{i}, \varvec{\Delta }) \} = \int _{\varvec{\theta }_i} \log \frac{q_i (\varvec{\theta }_i) }{P(\varvec{\theta }_i \mid {\varvec{Y}}_{i}, \varvec{\Delta })} \times q_i (\varvec{\theta }_i) d \varvec{\theta }_i$$\end{document} is nonnegative. This is because

- K L { q i ( θ i ) P ( θ i Y i , Δ ) } = θ i log P ( θ i Y i , Δ ) q i ( θ i ) × q i ( θ i ) d θ i θ i P ( θ i Y i , Δ ) q i ( θ i ) - 1 × q i ( θ i ) d θ i θ i P ( θ i Y i , Δ ) d θ i - θ i q i ( θ i ) d θ i = 1 - 1 = 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} - KL \{q_i (\varvec{\theta }_i) \mid P(\varvec{\theta }_i \mid {\varvec{Y}}_{i}, \varvec{\Delta }) \}= & {} \int _{\varvec{\theta }_i} \log \frac{P(\varvec{\theta }_i \mid {\varvec{Y}}_{i}, \varvec{\Delta })}{q_i (\varvec{\theta }_i) } \times q_i (\varvec{\theta }_i) d \varvec{\theta }_i \\\leqslant & {} \int _{\varvec{\theta }_i} \left(\frac{P(\varvec{\theta }_i \mid {\varvec{Y}}_{i}, \varvec{\Delta })}{q_i (\varvec{\theta }_i) }-1\right) \times q_i (\varvec{\theta }_i) d \varvec{\theta }_i \\\leqslant & {} \int _{\varvec{\theta }_i} P(\varvec{\theta }_i \mid {\varvec{Y}}_{i}, \varvec{\Delta }) d \varvec{\theta }_i - \int _{\varvec{\theta }_i} q_i (\varvec{\theta }_i) d \varvec{\theta }_i \\ {}= & {} 1- 1 = 0 \end{aligned}$$\end{document}

Therefore, we have a lower bound of log-likelihood that

(3) l ( Δ Y ) i = 1 N θ i log P ( Y i , θ i Δ ) q i ( θ i ) × q i ( θ i ) d θ i = i = 1 N E q i ( θ i ) [ log P ( Y i , θ i Δ ) q i ( θ i ) ] = : E L B O , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} l(\varvec{\Delta } \mid {\varvec{Y}})\geqslant & {} \sum _{i=1}^N \int _{\varvec{\theta }_i} \log \frac{P({\varvec{Y}}_{i}, \varvec{\theta }_i \mid \varvec{\Delta })}{q_i (\varvec{\theta }_i)} \times q_i (\varvec{\theta }_i) d \varvec{\theta }_i \nonumber \\= & {} \sum _{i=1}^N E_{q_i(\varvec{\theta }_i)}\bigg [\log \frac{P({\varvec{Y}}_{i},\varvec{\theta }_i \mid \varvec{\Delta } )}{q_i(\varvec{\theta }_i)}\bigg ] =: ELBO, \end{aligned}$$\end{document}

where the last term i = 1 N E q i ( θ i ) [ log P ( Y i , θ i Δ ) q i ( θ i ) ] \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sum _{i=1}^N E_{q_i(\varvec{\theta }_i)}\bigg [\log \frac{P({\varvec{Y}}_{i},\varvec{\theta }_i \mid \varvec{\Delta } )}{q_i(\varvec{\theta }_i)}\bigg ]$$\end{document} is the ELBO for l ( Δ | Y ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l(\varvec{\Delta }|{\varvec{Y}})$$\end{document} in Equation (2). Maximizing the log-marginal likelihood is then approximated by maximizing ELBO, and q i ( θ i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i(\varvec{\theta }_i)$$\end{document} , the variational distribution, needs to be carefully chosen to minimize the gap between the log-marginal likelihood and its ELBO.

The key is to find q i ( θ i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i(\varvec{\theta }_i)$$\end{document} so that ELBO approximates the marginal likelihood l ( Δ | Y ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l(\varvec{\Delta }|{\varvec{Y}})$$\end{document} as close as possible. Note that when q i ( θ i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i(\varvec{\theta }_i)$$\end{document} is the posterior density of θ i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_i$$\end{document} , i.e., q i ( θ i ) = P ( θ i Y i , Δ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i(\varvec{\theta }_i)=P(\varvec{\theta }_i\mid {\varvec{Y}}_{i}, \varvec{\Delta } )$$\end{document} , maximizing ELBO is equivalent to Bock & Aitkin (Reference Bock and Aitkin1981)’s marginal maximum likelihood/expectation–maximization (MML/EM) algorithm. Instead, as the choice of q i ( θ i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i(\varvec{\theta }_i)$$\end{document} determines the computational cost and success of the algorithm, Cho et al. (Reference Cho, Wang, Zhang and Xu2021, Reference Cho, Xiao, Wang and Xu2022) proposed a choice of q i ( θ i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i(\varvec{\theta }_i)$$\end{document} that satisfied two criteria: (1) it is easy to maximize, and (2) it approximates the true log-marginal likelihood well. Due to the independence of the students’ responses in general IRT models, q i ( θ i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i(\varvec{\theta }_i)$$\end{document} is selected for each individual separately. Specifically, under M2PL, the joint distribution of θ i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_i$$\end{document} and Y i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{Y}}_i$$\end{document} is,

(4) log P ( Y i , θ i α , b , ρ ) = j = 1 J { Y ij ( α j θ i - b j ) + log 1 1 + exp ( α j θ i - b j ) } + log ϕ θ ( θ i ) j = 1 J log e ξ ij 1 + e ξ ij + j = 1 J Y ij ( α j θ i - b j ) + j = 1 J b j - α j θ i - ξ ij 2 - j = 1 J η ( ξ ij ) { ( b j - α j θ i ) 2 - ξ ij 2 } + log ϕ θ ( θ i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned}&\log P(Y_{i}, \varvec{\theta }_i \mid \varvec{\alpha }, \varvec{b}, { \varvec{\rho }}) \nonumber \\&\quad = \sum _{j=1}^J \Big \{ Y_{ij} (\varvec{\alpha }_j^\top \varvec{\theta }_i - b_j) + \log \frac{1}{1+\exp (\varvec{\alpha }_j^\top \varvec{\theta }_i - b_j)} \Big \} + \log \phi _{\varvec{\theta }}(\varvec{\theta }_i) \nonumber \\&\quad \ge \sum _{j=1}^J \log \frac{e^{\xi _{ij}}}{1+e^{\xi _{ij}}} + \sum _{j=1}^J Y_{ij}(\varvec{\alpha }_j^\top \varvec{\theta }_i - b_j) + \sum _{j=1}^J \frac{b_j -\varvec{\alpha }_j^\top \varvec{\theta }_i - \xi _{ij}}{2} \nonumber \\&\qquad - \sum _{j=1}^J \eta (\xi _{ij})\{(b_j -\varvec{\alpha }_j^\top \varvec{\theta }_i)^2 - \xi _{ij}^2\} + \log \phi _{\varvec{\theta }}(\varvec{\theta }_i) \end{aligned}$$\end{document}
(5) : = l ( Y i , θ i α , b , ρ , ξ ij ) , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned}&\quad := l(Y_i, \varvec{\theta }_i \mid \varvec{\alpha }, \varvec{b}, { \varvec{\rho }}, \xi _{ij}), \end{aligned}$$\end{document}

where ξ ij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{ij}$$\end{document} is the variational parameter for the ith subject, which will be updated iteratively in the M-step of GVEM, and η ( ξ ij ) = ( 2 ξ i , j ) - 1 [ e ξ i , j / ( 1 + e ξ i , j ) - 1 / 2 ] \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\eta (\xi _{ij}) = (2 \xi _{i, j})^{-1}[e^{\xi _{i, j}} /(1+e^{\xi _{i, j}})-1 / 2]$$\end{document} . The derivation is as follows. Because the difficulty of handling the marginal distribution of P ( Y i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P({\varvec{Y}}_i)$$\end{document} mostly comes from the logistic sigmoid function, which makes the integration over θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} not a closed form in the E-step. As a result, Cho et al. (Reference Cho, Wang, Zhang and Xu2021) used a local variational approximation method (Jordan, Reference Jordan2004). Denote x ij = b j - α j θ i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{ij} = b_j - \varvec{\alpha }_j^\top \varvec{\theta }_i$$\end{document} , the local variational method gives the following variational lower bound for the sigmoid function:

1 1 + exp ( α j θ i - b j ) = exp ( x ij ) 1 + exp ( x ij ) = max ξ ij exp ( ξ ij ) 1 + exp ( ξ ij ) exp x ij - ξ ij 2 - η ( ξ ij ) ( x ij 2 - ξ ij 2 ) exp ( ξ ij ) 1 + exp ( ξ ij ) exp x ij - ξ ij 2 - η ( ξ ij ) ( x ij 2 - ξ ij 2 ) , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \frac{1}{1+\exp (\varvec{\alpha }_j^\top \varvec{\theta }_i - b_j)}= & {} \frac{\exp (x_{ij})}{1+\exp (x_{ij}) } \\= & {} \max _{\xi _{ij}} \frac{\exp (\xi _{ij})}{ 1+\exp (\xi _{ij}) } \exp \left\{ \frac{x_{ij} - \xi _{ij} }{2} - \eta (\xi _{ij} ) (x_{ij}^2 - \xi _{ij}^2) \right\} \\\geqslant & {} \frac{\exp (\xi _{ij})}{ 1+\exp (\xi _{ij}) } \exp \left\{ \frac{x_{ij} - \xi _{ij} }{2} - \eta (\xi _{ij} ) (x_{ij}^2 - \xi _{ij}^2) \right\} , \end{aligned}$$\end{document}

and by applying the above lower bound to Eq. (4), we get Eq. (5), which provides a variational lower bound for log P ( Y i , θ i α , b , ρ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \log P(Y_{i}, \varvec{\theta }_i \mid \varvec{\alpha }, \varvec{b}, { \varvec{\rho }})$$\end{document} .

By variational inference theory, we can show that the variational distributions q i ( θ i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i(\varvec{\theta }_i)$$\end{document} (for i = 1 , . . . , N \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,..., N$$\end{document} ) that minimize the distances between the lower bound and the joint distribution follow a Gaussian distribution with closed-form mean and variance, i.e., q i ( θ i ) N ( θ i μ i , Σ i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ q_i(\varvec{\theta }_i) \sim N(\varvec{\theta }_i \mid \mu _i,\Sigma _i )$$\end{document} where the mean parameter of the normal distribution is

(6) μ i = Σ i × j = 1 J { 2 η ( ξ i , j ) b j + Y ij - 1 2 } α j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mu _i ={\Sigma _i} \times \sum _{j=1}^J \Big \{ 2 \eta (\xi _{i,j} ) b_j +Y_{ij} - \frac{1}{2} \Big \} \varvec{\alpha }_j \end{aligned}$$\end{document}

and the covariance matrix is

(7) ( Σ i ) - 1 = ( Σ θ ) - 1 + 2 j = 1 J η ( ξ i , j ) α j α j . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \big (\Sigma _i \big )^{-1} = \big ({\Sigma _{\varvec{\theta }}} \big )^{-1} + 2\sum _{j=1}^J\eta (\xi _{i,j} ) \varvec{\alpha }_j \varvec{\alpha }_j^\top . \end{aligned}$$\end{document}

In the confirmatory model estimation, we update population covariance matrix Σ θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{\varvec{\theta }}$$\end{document} by

(8) Σ θ = 1 N i = 1 N ( Σ i + μ i μ i ) . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \varvec{\Sigma }_{\varvec{\theta }}=\frac{1}{N}\sum _{i=1}^N (\varvec{\Sigma }_i+\varvec{\mu }_i\varvec{\mu }_i^\top ). \end{aligned}$$\end{document}

But because we need to fix the diagonal elements of Σ θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{\varvec{\theta }}$$\end{document} during estimation to fix the scale, we propose to rescale Σ θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{\varvec{\theta }}$$\end{document} after the M-step converges, i.e.,

Σ θ = diag ( Σ θ ) - 1 Σ θ diag ( Σ θ ) - 1 , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \varvec{\Sigma }_{\varvec{\theta }}^{*}=\left(\left(\sqrt{\textrm{diag}(\varvec{\Sigma }_{\varvec{\theta }})}\right) ^{-1}\right) ^\top \varvec{\Sigma }_{\varvec{\theta }}\left(\sqrt{\textrm{diag}(\varvec{\Sigma }_{\varvec{\theta }})}\right) ^{-1}, \end{aligned}$$\end{document}

and the discrimination parameter needs to be rescaled accordingly, i.e., α j = α j diag ( Σ θ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\alpha }_j^*=\varvec{\alpha }_j\sqrt{\textrm{diag}(\varvec{\Sigma }_{\varvec{\theta }})}$$\end{document} . For the exploratory analysis, Σ θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{\varvec{\theta }}$$\end{document} is fixed at an identity matrix during estimation, and a post hoc rotation will then produce proper nonzero correlations. In the following, we assume that the GVEM algorithm has converged and we fix the variational parameter ξ ij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{ij}$$\end{document} as the final estimates. In other words, we do not update ξ ij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{ij}$$\end{document} in the later iterative steps and ξ ij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{ij}$$\end{document} is fixed at the initialization GVEM step in Algorithm 1.

1.3. Importance Sampling

Referring back to the basic idea underlying variational inference, i.e., the ELBO for log-likelihood of response l ( Δ Y ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ l(\varvec{\Delta } \mid {\varvec{Y}})$$\end{document} in the inequality (3), it can be seen that a tighter lower bound is attained when R P ( Y i , θ i Δ ) / q i ( θ i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R\equiv {P({\varvec{Y}}_{i},\varvec{\theta }_i \mid \varvec{\Delta } )}/{q_i(\varvec{\theta }_i)}$$\end{document} around its mean P ( Y i Δ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P({\varvec{Y}}_{i}\mid \varvec{\Delta })$$\end{document} . Therefore, we can consider different random variables with the same mean that are more concentrated. For example, we can draw M i.i.d. samples from q ( z ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q({\varvec{z}})$$\end{document} , and average the estimates as in importance sampling (IS):

(9) R M = 1 M m = 1 M R m = 1 M m = 1 M p ( x , z m ) q ( z m ) , z m q ( · ) . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} R_M = \frac{1}{M}\sum _{m=1}^M R_m = \frac{1}{M} \sum _{m=1}^M \frac{p({\varvec{x}}, {\varvec{z}}_m)}{ q({\varvec{z}}_m)}, \ {\varvec{z}}_m \sim q(\cdot ). \end{aligned}$$\end{document}

This leads to a tighter “importance-weighted ELBO" (IW-ELBO) on log P ( x ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log P ({\varvec{x}})$$\end{document} ,

(10) IW-ELBO M = E q ( Z ) log 1 M m = 1 M p ( z m , x ) q ( z m ) : = L M ( x ) . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \text {IW-ELBO}_M = E_{q({\varvec{Z}})}\left[ \log \frac{1}{M} \sum _{m=1}^M \frac{p({\varvec{z}}_m, {\varvec{x}})}{q({\varvec{z}}_m)} \right] := {\mathcal {L}}_M({\varvec{x}}). \end{aligned}$$\end{document}

It is shown that L M ( x ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {L}}_M({\varvec{x}})$$\end{document} converge to log p ( x ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log p({\varvec{x}})$$\end{document} as M goes to infinity (Burda et al., Reference Burda, Grosse and Salakhutdinov2015), which is summarized in the following result.

Proposition 1

For all M, the lower bounds satisfy

log p ( x ) L M + 1 L M . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \log p({\varvec{x}}) \ge {\mathcal {L}}_{M+1} \ge {\mathcal {L}}_M. \end{aligned}$$\end{document}

Moreover, if p ( x , z ) / q ( z | x ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p({\varvec{x}}, {\varvec{z}}) / q({\varvec{z}} | {\varvec{x}})$$\end{document} is bounded, then L M \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {L}}_M$$\end{document} approaches log p ( x ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log p({\varvec{x}})$$\end{document} as M goes to infinity.

Motivated by this result, we use the importance sampling method and calculate the derivatives of L M \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {L}}_M$$\end{document} to further perform gradient-based optimization. Specifically, denote w m = p ( x , z m ) / q ( z m ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_m = p({\varvec{x}}, {\varvec{z}}_m) / q({\varvec{z}}_m) $$\end{document} , then the derivatives of L M \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {L}}_M$$\end{document} with respect to θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }$$\end{document} are

θ L M ( x ) = θ E q ( Z ) log 1 M m = 1 M w m = E q ( Z ) θ log 1 M m = 1 M w m = E q ( Z ) m = 1 M w ~ m θ log w m , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \nabla _{\varvec{\theta }}{\mathcal {L}}_M ({\varvec{x}})&= \nabla _{\varvec{\theta }} {E}_{q({\varvec{Z}} )}\left[ \log \frac{1}{M} \sum _{m=1}^M w_m \right] \\&= {E}_{q({\varvec{Z}})} \left[ \nabla _{\varvec{\theta }} \log \frac{1}{M} \sum _{m=1}^M w_m \right] \\&= {E}_{q({\varvec{Z}})} \left[ \sum _{m=1}^M {\tilde{w}}_m \nabla _{\varvec{\theta }} \log w_m \right] , \end{aligned}$$\end{document}

where w ~ m = w m / m = 1 M w m \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\tilde{w}}_m = w_m / \sum _{m'=1}^M w_{m'}$$\end{document} and

(11) θ log w m = θ log p ( x , z m ) - θ log q ( z m ) . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \nabla _{\varvec{\theta }} \log w_m = \nabla _{\varvec{\theta }} \log p({\varvec{x}}, {\varvec{z}}_m) - \nabla _{\varvec{\theta }} \log q({\varvec{z}}_m). \end{aligned}$$\end{document}

1.4. IW-GVEM

The primary idea of IW-GVEM is to replace Eq. (3) with importance-weighted ELBO as in Eq. (10). That is, for each i = 1 , , N \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i = 1,\dots , N$$\end{document} , we draw M samples from q i ( θ i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i(\varvec{\theta }_i)$$\end{document} for S times:

θ i ( s , m ) q i ( θ i ) , for s = 1 , , S , m = 1 , , M . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \varvec{\theta }_i^{(s, m)} \sim q_i(\varvec{\theta }_i), \text { for } s = 1, \dots , S, m = 1, \dots , M. \end{aligned}$$\end{document}

Define w i ( s , m ) = p ( Y i , θ i ( s , m ) ) / q i ( θ i ( s , m ) ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_i^{(s, m)} = p(Y_i, \varvec{\theta }_i^{(s,m)}) / q_i(\varvec{\theta }_i^{(s,m)})$$\end{document} , where p ( Y i , θ i ( s , m ) ) = P ( Y i , θ i ( s , m ) α , b , ρ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(Y_i, \varvec{\theta }_i^{(s,m)}) = P(Y_i, \varvec{\theta }_i^{(s,m)} \mid \varvec{\alpha }, {\varvec{b}}, { \varvec{\rho }})$$\end{document} as in Eq. (4), and q i ( θ i ( s , m ) ) N ( θ i ( s , m ) μ i , Σ i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i(\varvec{\theta }_i^{(s,m)}) \sim N(\varvec{\theta }_i^{(s,m)} \mid \mu _i, \Sigma _i)$$\end{document} , then L M ( Y ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {L}}_M ({\varvec{Y}})$$\end{document} can be approximated by

L M ( Y ) i = 1 N 1 S s = 1 S log 1 M m = 1 M w i ( s , m ) . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} {\mathcal {L}}_M ({\varvec{Y}}) \approx \sum _{i=1}^N \left(\frac{1}{S}\sum _{s=1}^S\left[ \log \frac{1}{M}\sum _{m=1}^M w_i^{(s, m)} \right] \right) . \end{aligned}$$\end{document}

Note w i ( s , m ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_i^{(s,m)}$$\end{document} is a function of parameters ( ξ i , α , b , ρ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\xi _i, \varvec{\alpha }, {\varvec{b}}, { \varvec{\rho }})$$\end{document} .

To learn parameters, we use a stochastic gradient ascent method, which needs to calculate the gradients of L M ( Y ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {L}}_M({\varvec{Y}})$$\end{document} . Based on Eq. (11), the gradients can be approximated by

α L M ( Y ) i = 1 N 1 S s = 1 S m = 1 M w ~ i ( s , m ) α log P ( Y i , θ i ( s , m ) α , b , ρ ) - α log q i ( θ i ( s , m ) Y i ) , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \nabla _{\varvec{\alpha }}{\mathcal {L}}_M ({\varvec{Y}}) \approx \sum _{i=1}^N \left(\frac{1}{S}\sum _{s=1}^S \sum _{m=1}^M \tilde{w}_i^{(s,m)} \nabla _{\varvec{\alpha }} \left[ \log P(Y_i, \varvec{\theta }_i^{(s,m)} \mid \varvec{\alpha }, {\varvec{b}}, { \varvec{\rho }}) - \nabla _{\varvec{\alpha }} \log q_i(\varvec{\theta }_i^{(s,m)} \mid Y_i) \right] \right) , \end{aligned}$$\end{document}

where w ~ i ( s , m ) = w i ( s , m ) / m = 1 M w i ( s , m ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{w}_i^{(s,m)} = w_i^{(s,m)} / \sum _{m'=1}^M w_i^{(s,m')}$$\end{document} . Note that q i ( θ i ( s , m ) Y i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i(\varvec{\theta }_i^{(s,m)} \mid Y_i)$$\end{document} does not depend on the parameters in the current iteration. Therefore, we only need to calculate w ~ i ( s , m ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{w}_i^{(s,m)}$$\end{document} and α P ( Y i , θ i ( s , m ) α , b , ρ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nabla _{\varvec{\alpha }} P(Y_i, \varvec{\theta }_i^{(s,m)} \mid \varvec{\alpha }, {\varvec{b}}, { \varvec{\rho }})$$\end{document} . Similarly we can calculate b L M ( Y ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nabla _{ {\varvec{b}}}{\mathcal {L}}_M ({\varvec{Y}})$$\end{document} and Σ θ L M ( Y ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nabla _{\Sigma _{\varvec{\theta }}}{\mathcal {L}}_M ({\varvec{Y}})$$\end{document} . Specifically, we have

(12) α j log P Y i , θ i ( s , m ) α , b , ρ = w ~ i ( s , m ) α j log P ( Y i , θ i ( s , m ) α , b , ρ ) = w ~ i ( s , m ) Y ij - 1 + 1 1 + exp ( α j θ i ( s , m ) - b j ) θ i ( s , m ) , b j log P Y i , θ i ( s , m ) α , b , ρ = w ~ i ( s , m ) b j log P ( Y i , θ i ( s , m ) α , b , ρ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \nabla _{\varvec{\alpha }_j} \log P\left(Y_i, \varvec{\theta }_i^{(s,m)} \mid \varvec{\alpha }, {\varvec{b}},\varvec{\rho }\right)= & {} \tilde{w}_i^{(s,m)} \nabla _{\varvec{\alpha }_j} \left[ \log P(Y_i, \varvec{\theta }_i^{(s,m)} \mid \varvec{\alpha }, {\varvec{b}}, { \varvec{\rho }}) \right] \nonumber \\= & {} \tilde{w}_i^{(s,m)} \left[ \left(Y_{ij} - 1 + \frac{1}{1 + \exp (\varvec{\alpha }_j^\top \varvec{\theta }_i^{(s,m)} - b_j)}\right) \varvec{\theta }_i^{(s,m)} \right] , \nonumber \\ \nabla _{b_j} \log P\left(Y_i, \varvec{\theta }_i^{(s,m)} \mid \varvec{\alpha }, {\varvec{b}},\varvec{\rho }\right)= & {} \tilde{w}_i^{(s,m)} \nabla _{b_j} \left[ \log P(Y_i, \varvec{\theta }_i^{(s,m)} \mid \varvec{\alpha }, {\varvec{b}}, { \varvec{\rho }}) \right] \end{aligned}$$\end{document}
(13) = w ~ i ( s , m ) 1 - Y ij - 1 1 + exp ( α j θ i ( s , m ) - b j ) , Σ θ log P Y i , θ i ( s , m ) α , b , ρ = w ~ i ( s , m ) Σ θ log P ( Y i , θ i ( s , m ) α , b , ρ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned}= & {} \tilde{w}_i^{(s,m)} \left[ 1 - Y_{ij} - \frac{1}{1 + \exp (\varvec{\alpha }_j^\top \varvec{\theta }_i^{(s,m)} - b_j)} \right] ,\nonumber \\ \nabla _{\Sigma _{\varvec{\theta }}} \log P\left(Y_i, \varvec{\theta }_i^{(s,m)} \mid \varvec{\alpha }, {\varvec{b}},\varvec{\rho }\right)= & {} \tilde{w}_i^{(s,m)} \nabla _{\Sigma _{\varvec{\theta }}} \left[ \log P(Y_i, \varvec{\theta }_i^{(s,m)} \mid \varvec{\alpha }, {\varvec{b}}, { \varvec{\rho }}) \right] \end{aligned}$$\end{document}
(14) = w ~ i ( s , m ) 1 2 Σ θ - 1 2 θ i ( s , m ) ( θ i ( s , m ) ) . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned}= & {} \tilde{w}_i^{(s,m)} \left[ \frac{1}{2}\Sigma _{\varvec{\theta }} - \frac{1}{2} \varvec{\theta }_i^{(s,m)} (\varvec{\theta }_i^{(s,m)})^\top \right] . \end{aligned}$$\end{document}

To summarize, in the ( t + 1 ) t h \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(t+1)th$$\end{document} iteration, we perform the following:

  1. 1. For i = 1 , , N \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1, \dots , N$$\end{document} , draw M samples from q i ( θ i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i (\varvec{\theta }_i)$$\end{document} for S times.

  2. 2. Calculate w i ( s , m ) = P ( Y i , θ i ( s , m ) α , b , ρ ) / q i ( θ i ( s , m ) ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_i^{(s, m)} = P(Y_i, \varvec{\theta }_i^{(s,m)}\mid \varvec{\alpha }, {\varvec{b}}, { \varvec{\rho }}) / q_i(\varvec{\theta }_i^{(s,m)})$$\end{document} and w ~ i ( s , m ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{w}_i^{(s,m)} $$\end{document} = w i ( s , m ) / m = 1 M w i ( s , m ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$= w_i^{(s,m)} / \sum _{m'=1}^M w_i^{(s,m')}$$\end{document} .

  3. 3. Calculate the gradients according to Eqs.  (12), (13), and (14).

Algorithm 1 IW-GVEM for M2PL

Proper learning rate scheduling is important in gradient-based algorithms. In this work, we apply the Adaptive moment estimation (Adam) method (Kingma & Ba, Reference Kingma and Ba2014), which has been extensively used in deep learning research and applications, to adjust the learning rate in our training process. In Adam, we compute individual adaptive learning rates for each parameter from estimates of the first and second moments of the gradients. Specifically in the tth iteration, we calculate exponential moving averages of the gradient (denoted as v t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{v}}_t$$\end{document} ) and the squared gradient (denoted as s t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{s}}_t$$\end{document} ) with exponential decay rates β 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _1$$\end{document} and β 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _2$$\end{document} , respectively. The moving averages can be seen as estimates of the first and second moments of the gradients. Then, we correct these biased exponential moving averages by 1 - β 1 t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 - \beta _1^t$$\end{document} and 1 - β 2 t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 - \beta _2^t$$\end{document} , respectively, and update parameters using standardized gradients. The concrete steps of generic Adam are provided below, where g t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{g}}_t$$\end{document} is the gradient (corresponding to that in Eqs. (12), (13) and (14), respectively) in the tth iteration:

  1. 1. v t = β 1 v t - 1 + ( 1 - β 1 ) g t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{v}}_t = \beta _1 {\varvec{v}}_{t-1} + (1-\beta _1){\varvec{g}}_t$$\end{document} (update biased first moment estimate)

  2. 2. r t = β 2 r t - 1 + ( 1 - β 2 ) g t 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${ {\varvec{r}}_t} = \beta _2 { {\varvec{r}}_{t-1}} + (1-\beta _2){\varvec{g}}_t^2$$\end{document} (update biased second moment estimate)

  3. 3. v ^ t = v t / ( 1 - β 1 t ) , r ^ t = r t / ( 1 - β 2 t ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{{\varvec{v}}}_t = {\varvec{v}}_t / (1 - \beta _1^t),\ { \hat{{\varvec{r}}}_t} = { {\varvec{r}}_t} / (1 - \beta _2^t)$$\end{document} (compute bias-corrected moment estimates)

  4. 4. g ^ t = η v t ^ / ( r ^ t + ϵ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{{\varvec{g}}}_t = \eta \hat{{\varvec{v}}_t} /(\sqrt{{ \hat{{\varvec{r}}}_t}} + \epsilon )$$\end{document} , where η \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\eta $$\end{document} is learning rate (update the final gradient)

With this, the proposed importance-weighted Gaussian variational EM (IW-GVEM) algorithm is summarized in Algorithm 1. For the choice of hyperparameters, we follow the suggestions in Kingma & Ba (Reference Kingma and Ba2014) and adopt the default setting that β 1 = 0.9 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _1 = 0.9$$\end{document} and β 2 = 0.999 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _2 = 0.999$$\end{document} . Empirically in our simulation studies, for better convergence performance, we let the learning rate of Σ θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{\theta }$$\end{document} to be 0.1 η \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.1\eta $$\end{document} while the learning rate for a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{a}$$\end{document} and b \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{b}$$\end{document} to be η \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\eta $$\end{document} , and we search for an optimal learning rate η \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\eta $$\end{document} with the maximum ELBO over a list { 0.01 , 0.05 , 0.1 , 0.5 } \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{0.01, 0.05, 0.1, 0.5\}$$\end{document} . Lastly, we set ϵ = 0.001 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon = 0.001$$\end{document} .

In terms of convergence criteria, we evaluate the Euclidean norm of the difference between the estimated parameters of the current step and those of the previous step. When the difference is less than a certain tolerance value, the algorithm is stopped. For our simulation studies, in obtaining the initial model parameter using the GVEM algorithm, we reach convergence at ( l + 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(l+1)$$\end{document} th iteration if α GV l + 1 - α GV l 2 + b GV t + 1 - b GV t 2 + Σ θ , G V l + 1 - Σ θ , G V l 2 0.0001 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Vert \varvec{\alpha }_{GV}^{l+1} - \varvec{\alpha }_{GV}^{l}\Vert _2 + \Vert \varvec{b}_{GV}^{t+1} - \varvec{b}_{GV}^{t}\Vert _2 + \Vert \varvec{\Sigma }_{\theta , GV}^{l+1} - \varvec{\Sigma }_{\theta , GV}^{l}\Vert _2 \leqslant 0.0001$$\end{document} . In IW-GVEM, we reach convergence at ( t + 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(t+1)$$\end{document} th iteration when max { α t + 1 - α t 2 , b t + 1 - b t 2 , Σ θ t + 1 - Σ θ t 2 } 0.0001 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\max \{ \Vert \varvec{\alpha }^{t+1} - \varvec{\alpha }^{t}\Vert _2, \Vert \varvec{b}^{t+1} - \varvec{b}^{t}\Vert _2, \Vert \varvec{\Sigma }_{\theta }^{t+1} - \varvec{\Sigma }_{\theta }^{t}\Vert _2 \} \leqslant 0.0001$$\end{document} or the iteration stops when it reaches certain maximum iteration number.

2. Simulation Studies

2.1. Design

We conducted comprehensive simulation studies to evaluate the performance of the proposed method under various manipulated conditions. We follow similar designs as in Cho et al. (Reference Cho, Wang, Zhang and Xu2021) and consider different settings: (1) sample size: N = 200 or 500; (2) number of domains: K = 2 or 5; (3) test length: J = 30 if K = 2 or J = 55 if K = 5; (4) both within and between multidimensional structures; (5) factor correlations: low correlation r unif ( 0.1 , 0.3 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r\sim \text {unif}(0.1, 0.3)$$\end{document} or high correlation r unif ( 0.5 , 0.7 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r \sim \text {unif}(0.5, 0.7)$$\end{document} ; and (6) confirmatory or exploratory analysis.

Figure. 1 Bias for K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=2$$\end{document} under confirmatory analysis.

Figure. 2 RMSE for K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=2$$\end{document} under confirmatory analysis.

Similar to Cho et al. (Reference Cho, Wang, Zhang and Xu2021), for the between-item multidimensional structure, we had equal numbers of items loaded on each factor. For the within-item multidimensional structure, when K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=2$$\end{document} , about one-third of the items were loaded onto the first, or the second, or both factors, respectively. In the cases where K = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=5$$\end{document} , there were about one-third of the items loaded onto one, two, or three factors, respectively. For the model parameters, we simulated the item discrimination parameters α j , k \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{j,k}$$\end{document} from uniform distribution on [1, 2], and difficulty parameter b j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_j$$\end{document} from the standard normal distribution. We generated the latent traits θ j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_j$$\end{document} from multivariate normal distribution N ( 0 , Σ θ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N({\varvec{0}}, \varvec{\Sigma }_{\varvec{\theta }})$$\end{document} , where the diagonal elements of Σ θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{\varvec{\theta }}$$\end{document} were all 1 and off-diagonal elements were generated from uniform distributions. Specifically, in high-correlation settings, the uniform distribution was set to be unif ( 0.5 , 0.7 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {unif}(0.5, 0.7)$$\end{document} , whereas in the low-correlation settings we set it to be unif ( 0.1 , 0.3 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {unif}(0.1, 0.3)$$\end{document} .

For evaluation, we compared the bias and root-mean-squared errors (RMSEs) of model parameters, as well as computation time between GVEM and IW-GVEM. For exploratory analysis, we did a promax rotation after model convergence, and compared the rotated parameters to the true values (Cho et al., Reference Cho, Xiao, Wang and Xu2022). For IW-GVEM, we first ran GVEM algorithm to get initial estimates of model parameters, and then ran several gradient descent steps using importance sampling to correct the bias. To select a proper initial learning rate for the gradient algorithm, we first sampled a set of data aside based on the GVEM estimates. After we got model parameter estimates using importance sampling, we calculated the lower bound as in our objective function based on the previously sampled data set, and chose the learning rate corresponding to the largest lower bound. In the simulation studies, we set S and M to be 10. Our empirical experiments have shown that increasing S and M did not result in significant improvements and 10 was large enough for the simulation settings. The results were averaged over 100 repetitions.

2.2. Results

Figures 1 and 2 present the bias and RMSE of confirmatory M2PL model when K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=2$$\end{document} . Note that in confirmatory analysis, there are discrimination parameters specified to be zeros. These zero-constrained terms are excluded in the bias and RMSE computation. The two separately colored boxes represent the distribution of respective criteria across 100 replications from IW-GVEM (denoted as “IS” in the figure) and the original GVEM algorithm. As shown, GVEM already performs well by producing close to 0 bias for b \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{b}}$$\end{document} and Σ θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{\theta }$$\end{document} . It is the discrimination parameter, α \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\alpha }$$\end{document} , that has a non-ignorable bias. The IW-GVEM algorithm effectively corrects such bias on α \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\alpha }$$\end{document} across all conditions without deteriorating the estimation accuracy of other parameters. And because the bias is corrected, the RMSE of α \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\alpha }$$\end{document} is also smaller consistently compared to that from GVEM, whereas again, there is no appreciable difference between IW-GVEM and GVEM in terms of RMSE on b \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{b}}$$\end{document} and Σ θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{\theta }$$\end{document} . Sorting through the manipulated conditions, it is “within-item” multidimensional structure in combination with high factor correlation tends to yield larger RMSE for both methods and all parameters.

Figure. 3 Bias for K = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=5$$\end{document} under confirmatory analysis.

Figure. 4 RMSE for K = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=5$$\end{document} under confirmatory analysis.

Figure. 5 Bias for K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=2$$\end{document} under exploratory analysis.

Figure. 6 RMSE for K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=2$$\end{document} under exploratory analysis.

Figure. 7 Bias for K = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=5$$\end{document} under exploratory analysis.

Figure. 8 RMSE for K = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=5$$\end{document} under exploratory analysis.

Figures 3 and 4 present the bias and RMSE of confirmatory M2PL model when K = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=5$$\end{document} . The trend observed from the K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=2$$\end{document} condition continues to hold here. That is, IW-GVEM can correct bias on α \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\alpha }$$\end{document} effectively and hence also brings down its RMSE, whereas bias on the other parameters is already close to 0 from both methods and their RMSEs are also comparable. Increasing the number of dimensions certainly makes the model estimation harder to converge, and the estimates are also more variable, especially for b \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{b}}$$\end{document} and Σ θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{\theta }$$\end{document} , as reflected by wider boxes for those parameters in Fig. 3.

Figures 5, 6, 7, and 8 presents the results from exploratory estimation condition, in the same order as before. For the exploratory M2PL model estimation, GVEM generally performs well and the bias on α \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\alpha }$$\end{document} is already small to begin with. This is consistent with the results reported in literature (Cho et al., Reference Cho, Wang, Zhang and Xu2021, Reference Cho, Xiao, Wang and Xu2022). Even so, under all settings, the RMSEs of IW-GVEM are still smaller than or equal to that of GVEM. IW-GVEM can still further bring down the bias of α \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\alpha }$$\end{document} to near 0 for most cases. The exceptional case when the bias of α \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\alpha }$$\end{document} from IW-GVEM is larger than the bias from GVEM is for the “within item, correlation is high" condition. This case is the most difficult case where the items were loaded on factors via a more complicated setting and the correlations among factors are relatively high. Nonetheless, this special case has overall good estimation performance as the estimation bias from IW-GVEM is still close to the bias from GVEM, and the RMSE from IW-GVEM is lower than the RMSE from GVEM. In addition, when K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=2$$\end{document} the bias of Σ θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{\theta }$$\end{document} appears to depart from 0 and IW-GVEM does not correct for such bias, although the RMSE of Σ θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{\theta }$$\end{document} is kept small across the board. The bias of Σ θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{\theta }$$\end{document} gets closer to 0 when K increases and when the factor correlation is low. Because in the exploratory estimation mode, specific types of rotations will affect resulting factor correlations, the bias in Σ θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{\theta }$$\end{document} estimation is less of a concern. Although the increase in the number of dimensions K could lead to a more complicated model and bring challenges to parameter estimation, the increase in test length, on the other hand, improves the estimation accuracy of parameters. Specifically, at K = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K = 5$$\end{document} , we use test length J = 55 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J = 55$$\end{document} which is greater than J = 30 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J = 30$$\end{document} at K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K = 2$$\end{document} . This increase in test length explains the results that the biases at K = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K =5$$\end{document} are closer to 0 than that at K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K =2$$\end{document} for some cases. Overall, the results from GVEM and IW-GVEM are very close.

Table 1 presents the computation time for confirmatory M2PL estimation under both GVEM and IW-GVEM algorithms. Understandably, IW-GVEM takes longer time under all conditions because both the important sampling step and the gradient descent optimization are time-consuming compared to closed-form updates in GVEM. Unsurprisingly, both methods need longer time for larger sample sizes. It is more interesting to note that, other things being equal, when the multidimensional structure is “within-item,” GVEM almost doubles (when K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=2$$\end{document} ) or sometimes even triples (when K = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=5$$\end{document} ) the computation time compared to the “between-item” condition. But for IW-GVEM, the computation time is rather stable across these two multidimensional structures. Similarly, high correlation among factors is known to be more challenging, hence computation time increases by about 50% or more for GVEM from low to high-correlation conditions, but the computation time of IW-GVEM seems to be unaffected. These all suggest that IW-GVEM is better suited for more complex models. The same patterns remain for the exploratory M2PL estimation, as shown in Table 2, although exploratory analysis in general takes longer time than confirmatory analysis, simply because more parameters are needed to be updated simultaneously.

Table 1 Computation time (seconds) for the confirmatory M2PL estimation.

Table 2 Computation time (seconds) for the exploratory M2PL estimation.

3. Discussion

In this note, we proposed an importance-weighted version of GVEM to correct its bias on the α \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\alpha }$$\end{document} estimates in the confirmatory M2PL models. Because the evidence lower bound (ELBO), a key component of variational inference, is derived based on Jensen’s inequality, the ELBO will approximate the log-marginal distribution (i.e., log P ( X ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log P({\varvec{X}})$$\end{document} ) more closely when R P ( X , Z ) / q ( Z ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R\equiv {P({\varvec{X}},{\varvec{Z}})}/{q({\varvec{Z}})}$$\end{document} is more concentrated around its mean P ( X ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P({\varvec{X}})$$\end{document} . Hence, the primary idea of IW-GVEM is to replace R with its sample mean by drawing i.i.d. samples from variational distribution q ( z ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q({\varvec{z}})$$\end{document} . In so doing, we achieve a tighter bound of Jensen’s inequality, but at the slight cost of computational efficiency. The added computation time is mainly due to sampling in the E-step and gradient descent in the M-step. From our simulation results, the bias correction is effective for confirmatory models and the extra computation time is acceptable because even with additional computational cost, the total time is still short. In fact, the time increase from GVEM to IW-GVEM is at a slow rate in that the time ratio between the two methods is smaller for more complex models (i.e., K = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=5$$\end{document} , within-item multidimensional structure, and high correlations). Note that for exploratory M2PL models, the original GVEM is still recommended because it already produces almost unbiased results and hence importance sampling seems unnecessary, although it does not introduce any undesirable bias either. Theoretically, Cho et al. (Reference Cho, Wang, Zhang and Xu2021) proved that the estimated factor loading matrix and estimated latent factor from the GVEM algorithm is consistent as N \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N \rightarrow \infty $$\end{document} and J \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J \rightarrow \infty $$\end{document} . The proposed IW-GVEM algorithm is based on the GVEM estimation, hence with consistent initial GVEM estimators, the final estimators from the IW-GVEM algorithm also have the theoretical guarantee to be consistent in the high-dimensional setting. Moreover, compared to ELBO in GVEM, the importance-weighted ELBOs are greatly improved after importance sampling. In finite-sample simulations, importance-weighted ELBOs at M = 5 , 10 , 50 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M = 5, 10, 50$$\end{document} , and 100 are all larger than ELBO from GVEM and converge as M increases (See Appendix B).

In IW-GVEM, we propose to use the adaptive moment estimation method to automatically update the learning rate on the fly. Our preliminary results showed that the Adam algorithm performs better than fixed learning rate. Further, we also evaluated the effect of Monte Carlo sample size (i.e., S = 10 , 50 , 100 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S=10, 50, 100$$\end{document} ) and sample size for the importance sampling step (i.e., M = 10 , 50 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M=10, 50$$\end{document} ) and noted essentially the same results. Hence, we set S = 10 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S=10$$\end{document} and M = 10 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M=10$$\end{document} in our simulation study, which explains the only modest increase in computation time.

Aside from GVEM, another recently proposed fast algorithm for high-dimensional IRT estimation is the joint maximum likelihood estimation (Chen et al., Reference Chen, Li and Zhang2019). This method treats the latent abilities as fixed effect parameters instead of random variables. Although this approach is innovative and their algorithm appears to produce accurate parameter estimates efficiently, the interpretation of person parameters is different such that caution needs to be exercised when one intends to generalize findings to a certain population. Plus, treating each individual as a separate fixed effect is, at the conceptual level, hard to justify when generalizing M2PL to a multiple-group MIRT model. This is because the goal of a multiple-group extension is to allow for unbiased marginal estimation of group-specific population distributions.

Instead, the GVEM method can be generalized to multiple-group MIRT in a more straightforward fashion. Our other study exploring multiple-group GVEM for differential item functioning detection (DIF) reveals that it can very well detect uniform DIF, but the power of detecting DIF on discrimination parameter is low. This is likely due to the estimation bias on α \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\alpha }$$\end{document} from GVEM in the confirmatory model estimation, and hence the IW-GVEM will likely improve detection of the non-uniform DIF, in particular the DIF on discrimination parameters. Our study can also be extended in other directions. For instance, like in Cho et al. (Reference Cho, Wang, Zhang and Xu2021), the IW-GVEM can be extended to M3PL models. Moreover, the current IW-GVEM algorithm does not automatically output standard error of item parameter estimates, and hence future studies may consider combining it with the supplemented EM algorithm (Cai, Reference Cai2008; Chen & Wang, Reference Chen and Wang2021) to produce accurate SE estimates. In addition to MIRT, the proposed method may also shed light on improving the performance of the variational estimation for other psychometric models, such as generalized linear mixed models (Jeon et al., Reference Jeon, Rijmen and Rabe-Hesketh2017) and cognitive diagnosis models (Yamaguchi & Okada, Reference Yamaguchi and Okada2020, Reference Yamaguchi and Okada2020a).

Acknowledgements

We are grateful to the editor, an associate editor, and three anonymous referees for their helpful comments and suggestions. This work is partially supported by IES Grant R305D200015 and NSF grants SES-1846747 and SES-2150601.

Declarations

Data Availability

The simulation code and datasets generated during the current study are available at  https://github.com/jingoystat/A-Note-on-Improving-Variational-Estimation-for-Multidimensional-Item-Response-Theory.

Conflict of interest

The authors declare that they have no conflict of interest.

Appendix A: Additional Comparative Studies

A.1: Comparing IW-GVEM with Importance-Weighted Variational Bayesian Method

In recent literature, researchers also proposed importance-weighted variational Bayesian (IW-VB) methods for the estimation of MIRT models. In particular, Urban and Bauer (Reference Urban and Bauer2021) and  Liu et al. (Reference Liu, Wang and Xu2022) proposed to use importance-weighted variational autoencoder (IW-VAE) for exploratory factor analysis. This method is a deep learning-based variational method and is computationally fast in large data sets. Although IW-VB methods handle large-scale data with high computational efficiency, their performances at relatively small-sized and medium-sized data are not competitive. While MCMC could be an alternative method for small samples, in situations with small to medium-sample sizes, our variational method is faster and more competitive than MCMC.

Figure. 9 Bias for K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=2$$\end{document} between item under exploratory analysis.

Figure. 10 Bias for K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=2$$\end{document} within item under exploratory analysis.

Figure. 11 RMSE for K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=2$$\end{document} between item under exploratory analysis.

Figure. 12 RMSE for K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=2$$\end{document} within item under exploratory analysis.

Figure. 13 Bias for K = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=5$$\end{document} between item under exploratory analysis.

Figure. 14 Bias for K = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=5$$\end{document} within item under exploratory analysis.

Figure. 15 RMSE for K = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=5$$\end{document} between item under exploratory analysis.

Figure. 16 RMSE for K = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=5$$\end{document} within item under exploratory analysis.

In this section, we provide additional finite-sample simulation results to show that our method outperforms the IW-VB methods in small to medium samples. To illustrate it, we compare our proposed IW-GVEM method and IW-VB method by Liu et al. (Reference Liu, Wang and Xu2022) at N = 200 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = 200$$\end{document} , N = 500 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = 500$$\end{document} and N = 1000 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = 1000$$\end{document} . Because their method focuses only on exploratory MIRT, we will compare the performance of our method (denoted as “IS" in the figure) to IW-VB for exploratory analysis. The simulation settings follow the same settings as in Sect. 2.1. The results are presented in Figures  9, 10, 11, 12, 13, 14, 15, and 16. From the results, we see the biases of IW-GVEM are closer to 0 than the IW-VB method under all simulation settings. The RMSEs of our proposed method are substantially smaller than the IW-VB in Liu et al. (Reference Liu, Wang and Xu2022).

A.2: Comparing IW-GVEM with Joint Maximum Likelihood Method

The joint maximum likelihood (JML) estimator is a computationally efficient estimator with theoretical consistency established. It is proved in Chen et al. (Reference Chen, Li and Zhang2019) that JML estimator is consistent under high-dimensional settings and it outperforms the marginal maximum likelihood approaches in terms of computational costs. However, different from our IW-GVEM method, the latent abilities are treated as fixed effect parameters instead of random variables in JML method, which may constrain its performances in settings where latent factors are correlated. The JML estimation is also inconsistent in the setting when the number of items is fixed and the sample size grows to infinity. Because the number of parameters in the joint likelihood function grows to infinity, the standard theory for the maximum likelihood method cannot directly apply and the point estimation consistency for each item cannot be attained, which is known as Neyman–Scott phenomenon  (Neyman & Scott, Reference Neyman and Scott1948).

Extensive simulation studies were conducted in Cho et al. (Reference Cho, Wang, Zhang and Xu2021) to compare GVEM to JMLE method under the same simulation settings (sample sizes, within or between multidimensional structures, factor correlations, etc.) and using the same evaluation criteria (bias and RMSE) as in Sect. 2.1. Specifically, Figures 3 and 4 of Cho et al. (Reference Cho, Wang, Zhang and Xu2021) compared the bias and RMSE of GVEM and JML and showed that GVEM has much lower bias and RMSE than JML across all settings. At certain challenging cases such as “within item, correlation is high", JML estimator has even worse performances. This could be explained by that latent factors are fixed effects in JMLE, whereas GVEM treats them as random effects with multivariate Gaussian distributions accounting for the correlations among factors.

As an improvement of GVEM method, our IW-GVEM method outperforms GVEM in confirmatory factor analysis and has overall comparable performances as GVEM in exploratory factor analysis, across all simulation settings. For a detailed comparison of the simulation results of IW-GVEM and GVEM, please refer to Sect. 2.2. As our IW-GVEM is comparable to, if not better than, GVEM, the performance of our IW-GVEM is also better than JML under our simulation settings.

Appendix B: Additional Simulation Study

In this section, we present finite-sample simulation studies to show that our proposed IW-GVEM greatly improves the ELBO from GVEM. For the purpose of illustration, we consider the four settings under N = 200 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = 200$$\end{document} and J = 30 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J= 30$$\end{document} : (1) within-item and low factor correlation; (2) between-item and low factor correlation; (3) within-item and high factor correlation; (4) between-item and high factor correlation. For each setting, we generate the ELBOs from the GVEM algorithm and importance-weighted ELBOs for different sample sizes M = 5 , 10 , 50 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M = 5, 10, 50$$\end{document} , and 100 at the importance sampling step over 100 replications. The calculated ELBOs are presented in Fig. 17. From Fig. 17, we see that the importance sampling step leads to a tighter importance-weighted ELBO ( M = 5 , 10 , 50 , 100 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M = 5, 10, 50, 100$$\end{document} ) than that of GVEM. As the sample M in the importance sampling step increases, the ELBOs converge, which is consistent with theoretical results in Proposition 1.

Figure. 17 Importance-weighted ELBO at N = 200 , J = 30 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = 200, J = 30$$\end{document}

Footnotes

1 The limited-information method such as weighted least squares is not reviewed here as it handles high-dimensional models very differently, and it cannot handle missing data very well.

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

References

Albert, J.H.. (1992). Bayesian estimation of normal ogive item response curves using GIBBS sampling. Journal of educational statistics, 17 3251269.CrossRefGoogle Scholar
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2014). Fitting linear mixed-effects models using lme4. arXiv preprint arXiv:1406.5823.Google Scholar
Bishop, C.M.Pattern recognition and machine learning 2006 Springer.Google Scholar
Blei, D.M., Kucukelbir, A, McAuliffe, J.D.. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112 518859877.CrossRefGoogle Scholar
Bock, R.D., Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46 4443459.CrossRefGoogle Scholar
Briggs, D. C., & Wilson, M. (2003). An introduction to multidimensional measurement using rasch models.Google Scholar
Burda, Y., Grosse, R., & Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.Google Scholar
Cai, L. (2008). Sem of another flavor: Two new applications of the supplemented EM algorithm. British Journal of Mathematical and Statistical Psychology, 61, 309329.CrossRefGoogle ScholarPubMed
Cai, L. (2010). Metropolis–Hastings Robbins–Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 35 3307335.CrossRefGoogle Scholar
Cai, L. (2010). High-dimensional exploratory item factor analysis by a Metropolis–Hastings Robbins–Monro algorithm. Psychometrika, 75 13357.CrossRefGoogle Scholar
Cai, L, Hansen, M. (2018). Improving educational assessment: Multivariate statistical methods. Policy Insights from the Behavioral and Brain Sciences, 5 11924.CrossRefGoogle Scholar
Cai, L, Yang, J.S., Hansen, M. (2011). Generalized full-information item bifactor analysis. Psychological methods, 16 3221.CrossRefGoogle ScholarPubMed
Chen, Y, Li, X, Zhang, S. (2019). Joint maximum likelihood estimation for high-dimensional exploratory item factor analysis. Psychometrika, 84 1124146.CrossRefGoogle ScholarPubMed
Chen, P, Wang, C. (2021). Using EM algorithm for finite mixtures and reformed supplemented EM for MIRT calibration. Psychometrika, 86, 299326.CrossRefGoogle ScholarPubMed
Cho, A. E., Xiao, J., Wang, C., & Xu, G. (2022). Regularized variational estimation for exploratory item response theory. Psychometrika, pp. 129.Google Scholar
Cho, A.E., Wang, C, Zhang, X, Xu, G. (2021). Gaussian variational estimation for multidimensional item response theory. British Journal of Mathematical and Statistical Psychology, 74, 5285.CrossRefGoogle ScholarPubMed
CRESST (2017). English language proficiency assessment for the 21st century: Item analysis and calibration.Google Scholar
Curi, M., Converse, G. A., Hajewski, J., & Oliveira, S. (2019). Interpretable variational autoencoders for cognitive models. In 2019 international joint conference on neural networks (IJCNN), pp. 18. IEEE.CrossRefGoogle Scholar
Domke, J., & Sheldon, D. R. (2018). Importance weighting and variational inference. Advances in Neural Information Processing Systems, 31.Google Scholar
Gibbons, R.D., Hedeker, D.R.. (1992). Full-information item bi-factor analysis. Psychometrika, 57 3423436.CrossRefGoogle Scholar
Hamilton, L.S., Nussbaum, E.M., Kupermintz, H, Kerkhoven, J.I., Snow, R.E.. (1995). Enhancing the validity and usefulness of large-scale educational assessments: Ii. nels: 88 science achievement. American Educational Research Journal, 32 3555581.CrossRefGoogle Scholar
Hartig, J, Höhler, J. (2009). Multidimensional IRT models for the assessment of competencies. Studies in Educational Evaluation, 35 2–35763.CrossRefGoogle Scholar
Hui, F.K., Warton, D.I., Ormerod, J.T., Haapaniemi, V, Taskinen, S. (2017). Variational approximations for generalized linear latent variable models. Journal of Computational and Graphical Statistics, 26 13543.CrossRefGoogle Scholar
Jeon, M, Rijmen, F, Rabe-Hesketh, S. (2017). A variational maximization-maximization algorithm for generalized linear mixed models with crossed random effects. Psychometrika, 82 3693716.CrossRefGoogle Scholar
Jordan, M.I.. (2004). Graphical models. Statistical science, 19 1140155.CrossRefGoogle Scholar
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar
Kupermintz, H, Ennis, M.M., Hamilton, L.S., Talbert, J.E., Snow, R.E.. (1995). In dedication: Leigh burstein: Enhancing the validity and usefulness of large-scale educational assessments: I. nels: 88 mathematics achievement. American Educational Research Journal, 32 3525554.Google Scholar
Lindstrom, M.J., Bates, D.M.. (1988). Newton–Raphson and EM algorithms for linear mixed-effects models for repeated-measures data. Journal of the American Statistical Association, 83 40410141022.Google Scholar
Liu, T., Wang, C., & Xu, G. (2022). Estimating three- and four-parameter MIRT models with importance-weighted sampling enhanced variational auto-encoder. Frontiers in Psychology, 13.CrossRefGoogle Scholar
McCulloch, C.E.. (1997). Maximum likelihood algorithms for generalized linear mixed models. Journal of the American statistical Association, 92 437162170.CrossRefGoogle Scholar
Natesan, P, Nandakumar, R, Minka, T, Rubright, J.D.. (2016). Bayesian prior choice in IRT estimation using MCMC and variational bayes. Frontiers in Psychology, 7, 1422.CrossRefGoogle ScholarPubMed
Neyman, J., & Scott, E. L. (1948). Consistent estimates based on partially consistent observations. Econometrica: Journal of the Econometric Society, 132.CrossRefGoogle Scholar
OECD, N. (2003). The pisa 2003 assessment framework: Mathematics, reading, science and problem solving knowledge and skills.Google Scholar
Ormerod, J.T., Wand, M.P.. (2010). Explaining variational approximations. The American Statistician, 64 2140153.CrossRefGoogle Scholar
Patz, R.J., Junker, B.W.. (1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of educational and behavioral statistics, 24 4342366.CrossRefGoogle Scholar
Pinheiro, J.C., Bates, D.M.. (1995). Approximations to the log-likelihood function in the nonlinear mixed-effects model. Journal of computational and Graphical Statistics, 4 11235.CrossRefGoogle Scholar
Reckase, M. D. (2009). Multidimensional item response theory models. In Multidimensional item response theory, pp. 79112. Springer.CrossRefGoogle Scholar
Rijmen, F, Jeon, M. (2013). Fitting an item response theory model with random item effects across groups by a variational approximation method. Annals of Operations Research, 206 1647662.CrossRefGoogle Scholar
Rijmen, F, Vansteelandt, K, De Boeck, P. (2008). Latent class models for diary method data: Parameter estimation by local computations. Psychometrika, 73 2167182.CrossRefGoogle ScholarPubMed
Thissen, D. (2013). Using the testlet response model as a shortcut to multidimensional item response theory subscore computation. In New developments in quantitative psychology, pp. 2940. Springer.CrossRefGoogle Scholar
Urban, C.J., Bauer, D.J.. (2021). A deep learning algorithm for high-dimensional exploratory item factor analysis. Psychometrika, 86 1129.CrossRefGoogle ScholarPubMed
von Davier, M, Sinharay, S. (2010). Stochastic approximation methods for latent regression item response models. Journal of Educational and Behavioral Statistics, 35 2174193.CrossRefGoogle Scholar
Wainer, H, Bradlow, E.T., Wang, XTestlet response theory and its applications 2007 Cambridge University Press.CrossRefGoogle Scholar
Wang, C, Xu, G. (2015). A mixture hierarchical model for response times and response accuracy. British Journal of Mathematical and Statistical Psychology, 68 3456477.CrossRefGoogle ScholarPubMed
Wu, M., Davis, R. L., Domingue, B. W., Piech, C., & Goodman, N. (2020). Variational item response theory: Fast, accurate, and expressive. arXiv preprint arXiv:2002.00276.Google Scholar
Yamaguchi, K, Okada, K. (2020). Variational Bayes inference algorithm for the saturated diagnostic classification model. Psychometrika, 85 4973995.CrossRefGoogle ScholarPubMed
Yamaguchi, K, Okada, K. (2020). Variational Bayes inference for the DINA model. Journal of Educational and Behavioral Statistics, 45 5569597.CrossRefGoogle Scholar
Zhang, H, Chen, Y, Li, X. (2020). A note on exploratory item factor analysis by singular value decomposition. Psychometrika, 85, 358372.CrossRefGoogle ScholarPubMed
Zhang, S, Chen, Y, Liu, Y. (2020). An improved stochastic EM algorithm for large-scale full-information item factor analysis. British Journal of Mathematical and Statistical Psychology, 73 14471.CrossRefGoogle ScholarPubMed
Figure 0

Algorithm 1 IW-GVEM for M2PL

Figure 1

Figure. 1 Bias for K=2\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=2$$\end{document} under confirmatory analysis.

Figure 2

Figure. 2 RMSE for K=2\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=2$$\end{document} under confirmatory analysis.

Figure 3

Figure. 3 Bias for K=5\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=5$$\end{document} under confirmatory analysis.

Figure 4

Figure. 4 RMSE for K=5\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=5$$\end{document} under confirmatory analysis.

Figure 5

Figure. 5 Bias for K=2\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=2$$\end{document} under exploratory analysis.

Figure 6

Figure. 6 RMSE for K=2\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=2$$\end{document} under exploratory analysis.

Figure 7

Figure. 7 Bias for K=5\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=5$$\end{document} under exploratory analysis.

Figure 8

Figure. 8 RMSE for K=5\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=5$$\end{document} under exploratory analysis.

Figure 9

Table 1 Computation time (seconds) for the confirmatory M2PL estimation.

Figure 10

Table 2 Computation time (seconds) for the exploratory M2PL estimation.

Figure 11

Figure. 9 Bias for K=2\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=2$$\end{document} between item under exploratory analysis.

Figure 12

Figure. 10 Bias for K=2\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=2$$\end{document} within item under exploratory analysis.

Figure 13

Figure. 11 RMSE for K=2\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=2$$\end{document} between item under exploratory analysis.

Figure 14

Figure. 12 RMSE for K=2\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=2$$\end{document} within item under exploratory analysis.

Figure 15

Figure. 13 Bias for K=5\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=5$$\end{document} between item under exploratory analysis.

Figure 16

Figure. 14 Bias for K=5\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=5$$\end{document} within item under exploratory analysis.

Figure 17

Figure. 15 RMSE for K=5\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=5$$\end{document} between item under exploratory analysis.

Figure 18

Figure. 16 RMSE for K=5\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=5$$\end{document} within item under exploratory analysis.

Figure 19

Figure. 17 Importance-weighted ELBO at N=200,J=30\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N = 200, J = 30$$\end{document}