A Bayesian Approach Towards Missing Covariate Data in Multilevel Latent Regression Models

Christian Aßmann; Jean-Christoph Gaasch; Doris Stingl

doi:10.1007/s11336-022-09888-0

A Bayesian Approach Towards Missing Covariate Data in Multilevel Latent Regression Models

Published online by Cambridge University Press: 01 January 2025

Christian Aßmann

Jean-Christoph Gaasch and

Doris Stingl

Show author details

Christian Aßmann: Affiliation:
Leibniz Institute for Educational Trajectories Bamberg Otto-Friedrich-Universität Bamberg
Jean-Christoph Gaasch: Affiliation:
Otto-Friedrich-Universität Bamberg
Doris Stingl*: Affiliation:
Otto-Friedrich-Universität Bamberg
*: Correspondence should be made to Doris Stingl, Otto-Friedrich-Universität Bamberg, Bamberg, Germany. Email: doris.stingl@uni-bamberg.de

Article contents

Abstract
Introduction
Model Setup and Bayesian Inference
Simulation Study
Empirical Illustration
Conclusion
Funding Information
Footnotes
References

Rights & Permissions

Abstract

The measurement of latent traits and investigation of relations between these and a potentially large set of explaining variables is typical in psychology, economics, and the social sciences. Corresponding analysis often relies on surveyed data from large-scale studies involving hierarchical structures and missing values in the set of considered covariates. This paper proposes a Bayesian estimation approach based on the device of data augmentation that addresses the handling of missing values in multilevel latent regression models. Population heterogeneity is modeled via multiple groups enriched with random intercepts. Bayesian estimation is implemented in terms of a Markov chain Monte Carlo sampling approach. To handle missing values, the sampling scheme is augmented to incorporate sampling from the full conditional distributions of missing values. We suggest to model the full conditional distributions of missing values in terms of non-parametric classification and regression trees. This offers the possibility to consider information from latent quantities functioning as sufficient statistics. A simulation study reveals that this Bayesian approach provides valid inference and outperforms complete cases analysis and multiple imputation in terms of statistical efficiency and computation time involved. An empirical illustration using data on mathematical competencies demonstrates the usefulness of the suggested approach.

Keywords

Item response theory population heterogeneity Markov chain Monte Carlo classification and regression trees missing values

Type: Theory and Methods
Information: Psychometrika , Volume 88 , Issue 4 , December 2023 , pp. 1495 - 1528

DOI: https://doi.org/10.1007/s11336-022-09888-0 [Opens in a new window]
Creative Commons: This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Copyright: Copyright © 2022 The Author(s)

1. Introduction

Models for measurement and structural analysis of latent traits have been developed among others by Muthén (Reference Muthén1979), Zwinderman (Reference Zwinderman1991) and Adams et al. (Reference Adams, Wilson and Wu1997). These latent regression models (LRM) typically use a regression equation to assess the relationship between the latent trait and additional covariates and link measurements to the latent trait via a model, possibly arising from the context of item response theory (IRT; e.g., Embretson & Reise, Reference Embretson and Reise2000). As demonstrated by Rijmen et al. (Reference Rijmen, Tuerlinckx, De Boeck and Kuppens2003), and described extensively in Wilson and De Boeck (Reference Wilson and De Boeck2004), these models can be conceptualized within the wider context of nonlinear mixed models. Since the derived likelihood functions involve multiple integrals arising from the involved latent variables, a Bayesian framework using Markov chain Monte Carlo (MCMC) techniques is eminently suited to provide inference, see e.g. Edwards (Reference Edwards2010). The seminal article of Albert (Reference Albert1992) adopts data augmentation (DA), see Tanner and Wong (Reference Tanner and Wong1987), within a Bayesian estimation approach for measurement models with dichotomous items.Footnote 1 Further work adopted Albert’s DA procedure for extended model structures incorporating multilevel and clustered data structures (Aßmann & Boysen-Hogrefe, Reference Aßmann and Boysen-Hogrefe2011; Fox, Reference Fox2005; Fox & Glas, Reference Fox and Glas2001; Johnson & Jenkins, Reference Johnson and Jenkins2005). Prominent applications of these models arise in the context of large-scale assessment studies like the Programme for International Student Assessment (PISA; e.g., OECD, 2014), the Trends in International Mathematics and Science Study (TIMSS; e.g., Mullis & Martin, Reference Mullis and Martin2013), the Programme for the International Assessment of Adult Competencies (PIAAC; e.g., OECD, 2013b) or the National Assessment of Educational Progress (NAEP; e.g., Allen et al., Reference Allen, Carlson and Zelenak1999).

However, surveyed information is often seriously afflicted by item nonresponse. Si and Reiter (Reference Si and Reiter2013), for example, report less than five percent complete cases on a set of 80 background variables in a data file of the Trends in International Mathematics and Science Study (TIMSS; e.g., Mullis & Martin, Reference Mullis and Martin2013). Especially in multilevel contexts, such a large fraction of missing values poses a challenge to efficient parameter estimation. An appropriate strategy for handling missing values and corresponding model specification is required when analyzing the data. While several studies deal with the impact of missing or omitted competence items (Köhler et al., Reference Köhler, Pohl and Carstensen2015; Pohl et al., Reference Pohl, Gräfe and Rose2014), there has been less work on missing values in background variables. By default, the educational assessment studies cited above treat missing values in context questionnaires via dummy variable adjustments, see e.g. OECD (2014). Aside from the obvious information loss, dummy-variable adjustments for missing values can cause biased estimation, see Jones (Reference Jones1996). The involved categorization of information may have negative side effects on the assumed functional relationship, see also Grund et al. (Reference Grund, Lüdtke and Robitzsch2020) for a more detailed discussion. These results are in line with a recent study by Rutkowski (Reference Rutkowski2011) who found non negligible bias and misleading interpretations at the population level when partially missing covariates are dummy coded.Footnote 2

With the latent factor being of substantial interest, the Bayesian approach implemented in terms of a MCMC algorithm using the DA device has the advantage to provide direct access to the latent factors in terms of the posterior distribution.Footnote 3 Furthermore, in the presence of missing values in background variables, DA in the Bayesian context offers a conceptually straightforward way to deal with missing values. The vector of unknown quantities can be augmented with the missing values in covariates. Correspondingly, the MCMC sampling scheme incorporates the set of full conditional distributions of the missing values. This approach has the advantage that the modeling of the full conditional distributions can incorporate information available in form of a latent variable serving in the considered model context as a sufficient statistic.Footnote 4 These advantages result in increased statistical efficiency and reduced computational costs as illustrated in this paper. Such a handling of information is in principle also possible in the context of Maximum Likelihood estimation in terms of a chained equation approach via iteratively sampling from an assumed or approximated set of full conditional distributions, see Grund et al. (Reference Grund, Lüdtke and Robitzsch2020) for a discussion in the absence of hierarchical structures. In addition, in data contexts with a large number of covariates relative to the number of observations, the Bayesian approach incorporates shrinkage in terms of the involved prior distributions and facilitates updating of information with regard to the modeled relationships. Next, Bayesian estimators of parameters or functions thereof, like context effects and uncertainty measures, are directly accessible without the use of combining rules.

The DA principle has been successfully applied in different contexts ranging from multivariate panel models to social network analysis and educational large-scale assessments by Liu et al. (Reference Liu, Taylor and Belin2000), Koskinen et al. (Reference Koskinen, Robins and Pattison2010), Blackwell et al. (Reference Blackwell, Honaker and King2017) and Kaplan and Su (Reference Kaplan and Su2018). Full conditional distributions of missing values are typically operationalized in terms of a parametric modeling approach as discussed by Grund et al. (Reference Grund, Lüdtke and Robitzsch2020) and Erler et al. (Reference Erler, Rizopoulos, van Rosmalen, Jaddoe, Franco and Lesaffre2016). Goldstein et al. (Reference Goldstein, Carpenter and Browne2014), Erler et al. (Reference Erler, Rizopoulos, van Rosmalen, Jaddoe, Franco and Lesaffre2016) and Grund et al. (Reference Grund, Lüdtke and Robitzsch2018) provide a discussion in the context of linear regression models for metrically scaled hierarchical data.

In this article, we extend the DA approach towards missing values in covariate data in extended hierarchical structures in LRMs for dependent variables with binary and ordinal scale.Footnote 5 We also illustrate that DA allows for direct access to a valid model specification for the missing values incorporating information available in form of sufficient statistics as suggested by the Hammersley–Clifford theorem, see Robert and Casella (Reference Robert and Casella2004). Further, specifying the full conditional distributions of missing values in terms of sufficient statistics arising in the hierarchical latent regression context has the potential to reduce the computational burden. The role of sufficient statistics has also been stressed by Neal and Kypraios (Reference Neal and Kypraios2015) discussing situations, where the augmented variables and sufficient statistics are discrete and the models of interest belong to well known probability distributions. Our approach extends on this as we consider hierarchical structures and identifying restrictions arising from the factor like model structures resulting in complex posterior distributions.Footnote 6 Consideration of full conditional distributions for handling of missing values enriched with information from latent model structures extends also the sequential imputations approach discussed by Kong et al. (Reference Kong, Liu and Wong1994). Whereas the sequential imputations approach builds on predictive distributions for missing values separating thereby the model for the missing values in the covariate variables from the considered latent model structures, our approach is based on smoothed, i.e. full conditional distributions incorporating information from the latent model structures via the DA principle.Footnote 7

In combination with modeling the full conditional distributions of missing values via non-parametric sequential regression trees as suggested by Burgette and Reiter (Reference Burgette and Reiter2010) and Doove et al. (Reference Doove, van Buuren and Dusseldorp2014), the DA approach suggested in this paper offers high flexibility in empirical applications to cope with nonlinear relationships, e.g. interaction terms, within a potentially large set of covariates having different scales. The proposed modeling approach allows hence for tackling research questions typically addressed in sociology, psychology, and economics in the field of educational inequality and the role of institutions, see among others Carlsson et al. (Reference Carlsson, Dahl, Öckert and Rooth2015), Passaretta and Skopek (Reference Passaretta and Skopek2021) and Cornelissen and Dustmann (Reference Cornelissen and Dustmann2019). It simultaneously addresses the uncertainty associated with the estimation of a latent trait variable and the imputation of missing values in manifest covariate variables. The reciprocal dependence of outcomes and predictors is reflected to the full extent by the Bayesian DA estimation algorithm. The benefits of the suggested fully Bayesian approach arise in terms of methodological stringency and gains in statistical efficiency. Illustration of the suggested approach is provided by means of a simulation study and an empirical application using the first wave of the starting cohort of ninth graders surveyed in the German National Educational Panel Study—Educational Trajectories in Germany (NEPS), see Blossfeld and Roßbach (Reference Blossfeld and Roßbach2019). To highlight the benefits of considering sufficient statistics within the suggested DA approach towards missing values in covariates, we provide a comparison with a classical imputation setup, where the full conditional distributions of missing values are defined on the basis of directly observable quantities only, see e.g. von Hippel (Reference von Hippel2007). As shown in the simulations, the consideration of sufficient statistics accelerates the computation up to a third and ensure the feasibility of specifying full conditional distributions in multilevel contexts.

The paper proceeds as follows. Section 2 outlines the specification of the considered model setup and provides the corresponding Bayesian sampling algorithm that deals with structures reflecting heterogeneity and missing values in covariates via DA. Performance of the estimation routine is demonstrated through a simulation study in Sect. 3, whilst Sect. 4 provides the empirical illustration using data from the NEPS. Section 5 concludes.

2. Model Setup and Bayesian Inference

2.1. Model Setup

Consider J measurement items observed on N individuals summarized in a $N \times J$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N\times J$$\end{document} data matrix $Y = {(y_{1}, \dots, y_{N})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y=(y_1,\ldots ,y_N)'$$\end{document} with row vectors $y_{i} = (y_{i 1}, \dots, y_{ij}, \dots, y_{iJ})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_i=(y_{i1},\ldots , y_{ij},\ldots ,y_{iJ})$$\end{document} for each $i = 1, \dots, N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\ldots ,N$$\end{document} and $j = 1, \dots, J$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=1,\ldots ,J$$\end{document} . In case of binary measurements $y_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{ij}$$\end{document} denotes a random variable taking the value $y_{ij} = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{ij}=1$$\end{document} if in an educational assessment context respondent i is able to solve item j and the value $y_{ij} = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{ij}=0$$\end{document} otherwise. To analyze this kind of test items, Lord (Reference Lord1952, Reference Lord1953) proposes an IRT model generally known as the two-parameter normal ogive (2PNO) stating the probability ( $Pr$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {Pr}$$\end{document} ) for a correctly solved item as $Pr (y_{ij} = 1 | θ_{i}, α_{j}, β_{j}) = Φ (α_{j} θ_{i} - β_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {Pr}(y_{ij}=1|\theta _i,\alpha _j,\beta _j) = \Phi (\alpha _j\theta _i-\beta _j)$$\end{document} , where $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i$$\end{document} denotes a scalar person parameter, $α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _j$$\end{document} is a item discrimination parameter and $β_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _j$$\end{document} denotes the item difficulty or item fixed effect. We adopt the standard normal cumulative distribution function $Φ (\cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Phi (\cdot )$$\end{document} as the link function, as it offers computational advantages for MCMC based Bayesian estimation. Also, it allows for an alternative representation in terms of a threshold mechanism, which was first formalized in the context of individual level data by McKelvey and Zavoina (Reference McKelvey and Zavoina1975) and can be found for multivariate binary variables in Maddala (Reference Maddala1983, p. 138). Extending towards the analysis of ordered polytomous item responses, see Samejima (Reference Samejima1969), the observed item responses can be seen as a ordered polytomous version of an underlying continuous variable $y_{ij}^{*} = α_{j} θ_{i} - β_{j} + ε_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{ij}^*=\alpha _j\theta _i-\beta _j+\varepsilon _{ij}$$\end{document} , where the independent and identically distributed error term $ε_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon _{ij}$$\end{document} follows a standard normal distribution. Then one can link the observed categorical and the underlying continuous variable using a threshold mechanism, namely

(1)

\begin{matrix} y_{ij} = \sum_{q = 1}^{Q_{j}} (q - 1) I (κ_{j q - 1} < y_{ij}^{*} \leq κ_{jq}), \end{matrix}

where $κ_{j} = {(κ_{j 0}, κ_{j 1}, \dots, κ_{j Q_{j}})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _j=(\kappa _{j0},\kappa _{j1},\ldots ,\kappa _{jQ_j})'$$\end{document} is the $(Q_{j} + 1)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(Q_j+1)$$\end{document} -dimensional vector of item category cutoff parameters and $I (\cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {I}}(\cdot )$$\end{document} denotes the indicator function. The resulting probability that respondent i achieves grade q on item j, given his latent trait and item parameters, is hence implied by

\begin{matrix} Pr (y_{ij} = q | θ_{i}, α_{j}, β_{j}, κ_{j}) = Φ (κ_{j q + 1} - (α_{j} θ_{i} - β_{j})) - Φ (κ_{jq} - (α_{j} θ_{i} - β_{j})), \end{matrix}

thus nesting the binary case as well. This probability can be represented as in terms of the latent variables as $\int f (y_{ij}, y_{ij}^{*} | θ_{i}, α_{j}, β_{j}, κ_{j}) d y_{ij}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\int f(y_{ij},y_{ij}^*|\theta _i,\alpha _j,\beta _j,\kappa _j) {\textrm{d}}y_{ij}^*$$\end{document} , where

(2)

\begin{matrix} f (y_{ij}, y_{ij}^{*} | θ_{i}, α_{j}, β_{j}, κ_{j}) = \frac{1}{\sqrt{2 π}} exp (- \frac{1}{2} (y_{ij}^{*} - (α_{j} θ_{i} - β_{j}))^{2}) I (κ_{j y_{ij}} < y_{ij}^{*} < κ_{j y_{ij} + 1}) . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} f\big (y_{ij},y_{ij}^*|\theta _i,\alpha _j,\beta _j,\kappa _j\big )=\frac{1}{\sqrt{2\pi }}\exp \left\{ -\frac{1}{2}\big (y_{ij}^*-\big (\alpha _j\theta _i-\beta _j\big )\big )^2\right\} {\mathcal {I}}\big (\kappa _{jy_{ij}}<y_{ij}^* <\kappa _{jy_{ij}+1}\big ).\nonumber \\ \end{aligned}$$\end{document}

The necessary identifying restrictions for all parameters will be discussed jointly below.

IRT models are designed to directly link items and persons to a common scale. To enlarge their scope, the focus of analysis was broadened towards structural analysis by Muthén (Reference Muthén1979) addressing the issue that persons may not only differ in terms of their competence, but also in terms of covariates which are correlated with their competence. The standard framework assuming $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i$$\end{document} , $i = 1, \dots, N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\ldots ,N$$\end{document} , to be identically and independently normally distributed can be extended to incorporate a conditional mean operationalized as $E [θ_{i} | X_{i}] = X_{i} γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {E}[\theta _i|X_i]=X_i\gamma $$\end{document} , $i = 1, \dots, N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\ldots ,N$$\end{document} . Thereby $X = {(X_{1}^{'}, \dots, X_{N}^{'})}^{'} = (X^{(1)}, \dots, X^{(P)})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X=(X_1',\ldots ,X_N')'=(X^{(1)},\ldots ,X^{(P)})$$\end{document} in terms of row vectors $X_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_i$$\end{document} , $i = 1, \dots, N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\ldots ,N$$\end{document} and column vectors $X^{(p)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^{(p)}$$\end{document} , $p = 1, \dots, P$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=1,\ldots ,P$$\end{document} denotes a matrix of $N \times P$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N\times P$$\end{document} individual specific covariates and $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} the corresponding vector of regression coefficients. When hierarchical clustering in observations is present, this needs to be incorporated in the model as well, as consideration of hierarchical data structures is an important prerequisite for valid inference on the relationship between explaining and latent variables. The multiple forms of population heterogeneity in educational research are reviewed in Muthén (Reference Muthén1989) and Burstein (Reference Burstein1980), whereas Greene (Reference Greene2004b) provides a discussion for economic applications of the panel probit model incorporating latent heterogeneity structures. Population heterogeneity may be considered in terms of a nested multilevel structure thereby assuming a composite population consisting of a finite number of G mutually exclusive groups indexed by $g = 1, \dots, G$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=1,\ldots ,G$$\end{document} , where $L = (L_{1}, \dots, L_{N})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L=(L_1,\ldots ,L_N)$$\end{document} with $L_{i} \in {1, \dots, G}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_i\in \{1,\ldots ,G\}$$\end{document} , $i = 1, \dots, N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\ldots ,N$$\end{document} denotes the individual group membership. Within these groups, separate LRMs may hold. Sample stratification may be based on an explicitly observed cluster variable, e.g., gender or school type. This type of modeling dates back to the early works of Muthén and Christoffersson (Reference Muthén and Christoffersson1981) and Mislevy (Reference Mislevy1985), but without consideration of covariates except the cluster variable. Often, the specification is theory driven with the aim to discover substantial differences of covariate effects and variances for predefined groups. These differences are captured through the estimation of group-specific latent trait distributions. Additionally, hierarchical structures may be related to random effects. As in multilevel models there is a composite population consisting of clusters $c = 1, \dots, C$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c=1,\ldots ,C$$\end{document} , where the individual cluster membership is also known a-priori and is captured by $S = (S_{1}, \dots, S_{N})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S=(S_1,\ldots ,S_N)$$\end{document} with $S_{i} \in {1, \dots, C}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_i\in \{1,\ldots ,C\}$$\end{document} for all $i = 1, \dots, N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\ldots ,N$$\end{document} . While fixed group-specific regression parameters are suitable for a relative small number of groups, consideration of hierarchical structures with regard to schools or classes often implies a prohibitively large number of parameters. Difficulties regarding the computation and the statistical properties of the maximum likelihood estimator in this context were studied by Greene (Reference Greene2004a).Footnote 8 Thus, the introduction of identically and independently normally distributed cluster-specific random effects $ω = (ω_{1}, \dots, ω_{C})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega =(\omega _1,\ldots ,\omega _C)$$\end{document} offers an appropriate alternative or addition to the fixed effects approach. The most basic multilevel specification is the random intercept latent regression item response model. Depending on the specific hierarchical structure under consideration, combinations of both approaches are possible and allow for multiple hierarchical levels.

To illustrate, consider a model with nested hierarchical structure with $S_{i} = S_{i^{'}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_i=S_{i'}$$\end{document} implying $L_{i} = L_{i^{'}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_i=L_{i'}$$\end{document} , i.e. individuals within the same cluster also refer to the same group, but not vice-versa, given as

(3)

\begin{matrix} θ_{i} = ω_{S_{i}} + X_{i} γ_{L_{i}} + ϵ_{i} . \end{matrix}

Thereby $ϵ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon _{i}$$\end{document} , $i = 1, \dots, N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\ldots ,N$$\end{document} , is independently normally distributed with mean zero and heteroscedastic variance $σ_{L_{i}}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^2_{L_{i}}$$\end{document} . Likewise $ω_{S_{i}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega _{S_i}$$\end{document} is independently normally distributed with mean zero and heteroscedastic variance $υ_{L_{i}}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\upsilon ^2_{L_i}$$\end{document} . The assumed heteroscedasticity is hence a further way to implement features of (nested) hierarchical structures.Footnote 9 We summarize all model parameters as $ψ = ({α_{j}, β_{j}, κ_{j}}_{j = 1}^{J}, {γ_{g}, σ_{g}^{2}, υ_{g}^{2}}_{g = 1}^{G})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\psi =(\{\alpha _j,\beta _j,\kappa _j\}_{j=1}^J,\{\gamma _g,\sigma ^2_g,\upsilon ^2_g\}_{g=1}^G)$$\end{document} . The implied conditional covariance structure with regard to two elements of $θ = (θ_{1}, \dots, θ_{N})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta =(\theta _1,\ldots ,\theta _N)$$\end{document} denoted with i and $i^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i'$$\end{document} can be described as

\begin{matrix} Cov (θ_{i}, θ_{i^{'}} | ψ, X, S, L) = (\begin{matrix} 0, & for i \neq i^{'} and S_{i} \neq S_{i^{'}}, \\ υ_{L_{i}}^{2} = υ_{L_{i^{'}}}^{2}, & for i \neq i^{'} with S_{i} = S_{i^{'}} and L_{i} = L_{i^{'}}, \\ σ_{L_{i}}^{2} + υ_{L_{i}}^{2} = σ_{L_{i^{'}}}^{2} + υ_{L_{i^{'}}}^{2}, & for i = i^{'} with S_{i} = S_{i^{'}} and L_{i} = L_{i^{'}} . \end{matrix}) \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \text {Cov}(\theta _i,\theta _{i'}|\psi ,X,S,L)=\left\{ \begin{array}{ll} 0, &{}\quad \text {for } i\ne i'\text { and }S_i\ne S_{i'}, \\ \upsilon _{L_i}^2=\upsilon _{L_{i'}}^2, &{}\quad \text {for } i\ne i'\text { with } S_i=S_{i'}\text { and } L_i=L_{i'},\\ \sigma _{L_i}^2+\upsilon _{L_i}^2=\sigma _{L_{i'}}^2+\upsilon _{L_{i'}}^2, &{}\quad \text {for } i=i'\text { with } S_i=S_{i'}\text { and } L_i=L_{i'}. \end{array} \right. \end{aligned}$$\end{document}

This covariance structure allows for group specific conditional variances but possibly similar or different correlations within clusters. The corresponding likelihood function in case of completely observed data is given as

(4)

\begin{matrix} f (Y | ψ, X, S, L) = \int f (Y, Y^{*}, θ, ω | ψ, X, S, L) d Y^{*} d θ d ω . \end{matrix}

Thereby

(5)

\begin{matrix} f (Y, Y^{*}, θ, ω | ψ, X, S, L) = (\prod_{i = 1}^{N}, f, (y_{i}, y_{i}^{*} | θ_{i}, ψ), f, (θ_{i} | X_{i}, ψ, ω, S_{i}, L_{i})) f (ω | ψ, S, L), \end{matrix}

where $f (y_{i}, y_{i}^{*} | θ_{i}, ψ) = \prod_{j = 1}^{J} f (y_{ij}, y_{ij}^{*} | ψ, θ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(y_{i},y_{i}^*|\theta _i,\psi )=\prod _{j=1}^{J} f(y_{ij},y_{ij}^*|\psi ,\theta _i)$$\end{document} with $f (y_{ij}, y_{ij}^{*} | ψ, θ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(y_{ij},y_{ij}^*|\psi ,\theta _i)$$\end{document} as in Eq. (2),

\begin{matrix} f (θ_{i} | X_{i}, ψ, ω, S_{i}, L_{i}) = {(2 π)}^{- \frac{1}{2}} (σ_{L_{i}}^{2})^{- \frac{1}{2}} exp (- \frac{1}{2 σ_{L_{i}}^{2}} {(θ_{i} - (ω_{S_{i}} - X_{i} γ_{L_{i}}))}^{2}), \end{matrix}

and $f (ω | ψ, S, L)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\omega |\psi ,S,L)$$\end{document} following a multivariate normal distribution with mean zero and covariance matrix $diag (υ_{L_{1}}^{2}, \dots, υ_{L_{N}}^{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {diag}(\upsilon _{L_1}^2,\ldots ,\upsilon _{L_N}^2)$$\end{document} .

In case of completely observed data Y and X, the Bayesian model setup is then completed by an appropriate prior distribution $π (ψ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi (\psi )$$\end{document} . However, the estimation of IRT models is in general plagued by an identification problem, where the classical identification strategies impose restrictions on the parameter space. For the given model, the identification problem can be described as follows. First, the overall means of $y_{ij}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{ij}^*$$\end{document} are implied by the mean values of $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i$$\end{document} , $β_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _j$$\end{document} , and $κ_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _j$$\end{document} , as well as the signs of $α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _j$$\end{document} . The mean values of $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i$$\end{document} in turn arise from the regression coefficients $γ_{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _g$$\end{document} in combination with the observed covariates $X_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_i$$\end{document} . Second, the scaling of $y_{ij}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{ij}^*$$\end{document} is implied by the scaling of $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i$$\end{document} and $α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _j$$\end{document} , where the scaling of $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i$$\end{document} arises from the variance parameters $υ_{g}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\upsilon _g^2$$\end{document} and $σ_{g}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _g^2$$\end{document} . The given interdependencies lead to the fact that these parameters are not jointly identifiable. However, for given signs of $α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _j$$\end{document} and mean values for two of the three quantities $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i$$\end{document} , $β_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _j$$\end{document} , and $κ_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _j$$\end{document} , mean values for the remaining quantity become identifiable. The same holds for the scaling issue, where for given signs of $α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _j$$\end{document} and a given scaling for one of the two quantities $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i$$\end{document} and $α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _j$$\end{document} , the remaining scaling becomes identifiable. The decision which mean and scaling parameters to fix is in principle arbitrary. However, for the considered hierarchical structures it is more convenient, also in terms of the implied sampling scheme, to restrict the item parameters $α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _j$$\end{document} , $β_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _j$$\end{document} , and $κ_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _j$$\end{document} . The typical choice as discussed in the literature by Fox (Reference Fox2010) and Albert and Chib (Reference Albert and Chib1997) imposes the following ordering and value constraints on the parameter space. With regard to the threshold parameter the restrictions can be formulated in terms of the condition $\prod_{j = 1}^{J} I (κ_{j 0} = - \infty, κ_{j 1} = 0 < κ_{j 2} < \dots < κ_{j Q_{j} - 1} < κ_{j Q_{j}} = + \infty)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\prod _{j=1}^J {\mathcal {I}}(\kappa _{j0}=-\infty ,\kappa _{j1}=0< \kappa _{j2}<\cdots< \kappa _{jQ_j-1} < \kappa _{jQ_j}=+\infty )$$\end{document} , while for the item difficulties and discrimination parameters, we have $I (\sum_{j = 1}^{J} β_{j} = 0)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {I}}(\sum _{j=1}^J \beta _j=0)$$\end{document} and $I (\prod_{j = 1}^{J} α_{j} I (α_{j} > 0) = 1)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {I}}(\prod _{j=1}^J \alpha _j {\mathcal {I}}(\alpha _j>0)=1)$$\end{document} . Given these identifying restrictions, appropriate (conjugate) prior distributions can be formulated as given in Table 1. In the light of the Clifford–Hammersley theorem, see Robert and Casella (Reference Robert and Casella2004) for theorem and proof, the implied joint posterior distribution

(6)

\begin{matrix} f (θ, Y^{*}, ω, ψ | Y, X, L, S) \propto f (Y, Y^{*}, θ, ω | ψ, X, S, L) π (ψ) \end{matrix}

is accessible in terms of the corresponding set of full conditional distributions. With $Z = {Y, X, S, L}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z=\{Y,X,S,L\}$$\end{document} , we have

(7)

\begin{matrix} f (θ, Y^{*}, ω, ψ | Z) \propto \frac{f (θ | {\tilde{Y}}^{*}, \tilde{ω}, \tilde{ψ}, Z)}{f (\tilde{θ} | {\tilde{Y}}^{*}, \tilde{ω}, \tilde{ψ}, Z)} \frac{f (Y^{*} | θ, \tilde{ω}, \tilde{ψ}, Z)}{f ({\tilde{Y}}^{*} | θ, \tilde{ω}, \tilde{ψ}, Z)} \frac{f (ω | θ, Y^{*}, \tilde{ψ}, Z)}{f (\tilde{ω} | θ, Y^{*}, \tilde{ψ}, Z)} \frac{f (ψ | θ, Y^{*}, ω, Z)}{f (\tilde{ψ} | θ, Y^{*}, ω, Z)}, \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} f(\theta ,Y^*,\omega ,\psi |Z)\propto \frac{f(\theta |{\tilde{Y}}^*,{\tilde{\omega }},{\tilde{\psi }},Z)}{f({\tilde{\theta }}|{\tilde{Y}}^*,{\tilde{\omega }},{\tilde{\psi }},Z)}\frac{f(Y^*|\theta ,{\tilde{\omega }},{\tilde{\psi }},Z)}{f({\tilde{Y}}^*|\theta ,{\tilde{\omega }},{\tilde{\psi }},Z)}\frac{f(\omega |\theta ,Y^*,{\tilde{\psi }},Z)}{f({\tilde{\omega }}|\theta ,Y^*,{\tilde{\psi }},Z)}\frac{f(\psi |\theta ,Y^*,\omega ,Z)}{f({\tilde{\psi }}|\theta ,Y^*,\omega ,Z)}, \end{aligned}$$\end{document}

where the chosen sequence ordering $θ, Y^{*}, ω, ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta ,Y^*,\omega ,\psi $$\end{document} is arbitrary and $\tilde{\cdot}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\tilde{\cdot }}$$\end{document} denotes any admissible point of the indicated variable. The set of full conditional distributions resulting from Eq. (7) and employed within an MCMC algorithm taking the form of an iterative sequential Metropolis-Hastings (MH) within Gibbs sampling scheme to provide inference based on a sample from the posterior distributions is given in detail in Sect. 2.2.

Table 1. Prior specifications and MCMC starting values.

The hyperparameters are chosen as ${ν_{γ_{g}} = 0, Ω_{γ_{g}} = 100 I_{P + 1}, a_{σ_{g}^{2}} = 3, b_{σ_{g}^{2}} = 1, a_{υ_{g}^{2}} = 3, b_{υ_{g}^{2}} = 1}_{g = 1}^{G}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\nu _{\gamma _g} = 0, \Omega _{\gamma _g} = 100\textbf{I}_{P+1},a_{\sigma _g^2}=3,\ b_{\sigma _g^2} = 1, a_{\upsilon _g^2}=3,\ b_{\upsilon _g^2} = 1\}_{g=1}^G$$\end{document} and ${ν_{α_{j}} = 0, Ω_{α_{j}} = 100, ν_{β_{j}} = 0, Ω_{β_{j}} = 100, {ν_{κ_{jq}} = 0, Ω_{κ_{jq}}^{2} = 100}_{q = 1}^{Q_{j}}}_{j = 1}^{J}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\nu _{\alpha _j}=0,\Omega _{\alpha _j}=100, \nu _{\beta _j}=0,\Omega _{\beta _j}=100,\{\nu _{\kappa _{jq}}=0,\Omega ^2_{\kappa _{jq}}=100\}_{q=1}^{Q_j}\}_{j=1}^J$$\end{document} . The hyperparameters for the inverse gamma distribution are chosen to provide finite variance and smallest possible prior sample size.

Next, we will discuss the handling of missing values. Given the factorization of the likelihood described in Eqs. (2) and (5), handling of missing values in item responses $Y = (Y_{obs}, Y_{mis})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y=(Y_{\text {obs}},Y_{\text {mis}})$$\end{document} is directly possible by dropping the corresponding elements $Y_{mis}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_{\text {mis}}$$\end{document} from the likelihood. That means, per item j, only the observed $y_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{ij}$$\end{document} are used to estimate the parameters. An alternative approach of handling missing values in Y may be to consider missing values as wrong answers. Our approach is also fully compatible with von Hippel (Reference von Hippel2007) suggesting to consider draws of $Y_{mis}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_{\text {mis}}$$\end{document} from the posterior predictive distribution for the specification of the full conditional distributions of the missing values of the covariate variables X but not using them for analysis.Footnote 10

However, when facing partially observed X one has to think of an appropriate missing data technique to facilitate estimation. In the following, we will denote $X = (X_{obs}, X_{mis})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X=(X_{\text {obs}}, X_{\text {mis}})$$\end{document} . In the context of the considered model structure, the latent variables and hierarchical structures take the role of sufficient statistics and may play a crucial role for implementing appropriate models defining the uncertainty associated with missing values $X_{mis}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {mis}}$$\end{document} . We suggest to handle missing values $X_{mis}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {mis}}$$\end{document} by means of DA, as this allows for advantageous use of the latent and hierarchical model structures within the modeling of missing values by means of Rao-Blackwellization and due to a lower dimensional representation of the relevant information also reducing the computational burden.Footnote 11 The advantages relate to gains in statistical efficiency in estimmation of $ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\psi $$\end{document} captured by the bias, root mean square error, and coverage. Hence, the augmented posterior distribution

\begin{matrix} f (θ, Y^{*}, ω, ψ, X_{mis} | Y, X_{obs}, S, L) \propto f (Y, Y^{*}, θ, ω | ψ, X, S, L) π (X_{mis} | X_{obs}, ψ) π (ψ), \end{matrix}

incorporating an appropriate prior distribution $π (X_{mis} | X_{obs}, ψ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi (X_{\text {mis}}|X_{\text {obs}},\psi )$$\end{document} , is of interest and subject to inference. The characterization in terms of the full conditional distributions given in Eq. (7) is then extended as follows. With $\tilde{Z} = {Y, X_{obs}, {\tilde{X}}_{mis}, S, L}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\tilde{Z}}=\{Y,X_{\text {obs}},{\tilde{X}}_{\text {mis}},S,L\}$$\end{document} , we have

(8)

\begin{matrix} f (θ, Y^{*}, ω, ψ, X_{mis} | Y, X_{obs}, S, L) \propto f (θ, Y^{*}, ω, ψ | \tilde{Z}) \frac{f (X_{mis} | θ, Y^{*}, ω, ψ, Y, X_{obs}, S, L)}{f ({\tilde{X}}_{mis} | θ, Y^{*}, ω, ψ, Y, X_{obs}, S, L)}, \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} f(\theta ,Y^*,\omega ,\psi ,X_{\text {mis}}|Y,X_{\text {obs}},S,L)\propto f(\theta ,Y^*,\omega ,\psi |{\tilde{Z}})\frac{f(X_{\text {mis}}|\theta ,Y^*,\omega ,\psi ,Y,X_{\text {obs}},S,L)}{f({\tilde{X}}_{\text {mis}}|\theta ,Y^*,\omega ,\psi ,Y,X_{\text {obs}},S,L)}, \end{aligned}$$\end{document}

thereby augmenting the MCMC sampling scheme.Footnote 12

The suggested sequential sampling is also well suited to deal with regression specifications involving cross products of variables considered in X. Given an initialization of $X_{mis}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {mis}}$$\end{document} and thus the involved cross products, missing values for one variable can be drawn. If this variable is involved in cross products, these cross products are updated. This procedure is then repeated for each variable in X. In order to establish highly flexible modeling of the distributions of $X_{mis}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {mis}}$$\end{document} and allow for handling of a possibly large number of background variables, we adopt sequential recursive classification and regression trees in combination with sampling via a Bayesian bootstrap (CART-BB) for the construction of full conditional distributions, see Burgette and Reiter (Reference Burgette and Reiter2010) and Rubin (Reference Rubin1981). Modeling the full conditional distributions of missing values in this way is compatible with assuming prior distributions for the missing values proportional to the empirical densities observed for each variable, see also Table 1.Footnote 13 This choice is motivated by the flexibility of CART-BB to handle variables of any scale and the potential to cope with nonlinear relationships among the variables, see also Doove et al. (Reference Doove, van Buuren and Dusseldorp2014). The application of CART-BB to model the full conditional distributions of missing values is particularly useful because the analyst does not need to specify the full conditional distributions of missing values (imputation models) explicitly. The complete set of full conditional distributions and further details referring to the augmented parameter vector are provided in the following. We label the suggested Bayesian estimation approach using data augmentation and sequential recursive partitioning classification and regression trees combined with a Bayesian bootstrap for handling missing values in covariate variables as DART approach.

2.2. Bayesian Inference

Bayesian inference is based on a posterior sample generated via the following MCMC algorithm iteratively sampling from the set of full conditional distributions.Footnote 14 The algorithm is based on the blocking scheme $y_{11}^{*}, \dots, y_{NJ}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{11}^*,\ldots ,y_{NJ}^*$$\end{document} , $α_{1}, β_{1}, \dots, α_{J}, β_{J}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _1,\beta _1,\ldots ,\alpha _J,\beta _J$$\end{document} , $κ_{1}, \dots, κ_{J}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _1,\ldots ,\kappa _J$$\end{document} , $X_{mis}^{(1)}, \dots, X_{mis}^{(P)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {mis}}^{(1)},\ldots ,X_{\text {mis}}^{(P)}$$\end{document} , $θ_{1}, \dots$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _1,\ldots $$\end{document} , $θ_{N}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _N$$\end{document} , $γ_{1}, \dots, γ_{G}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _1,\ldots ,\gamma _G$$\end{document} , $σ_{1}^{2}, \dots, σ_{G}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _1^2,\ldots ,\sigma _G^2$$\end{document} , $ω_{1}, \dots, ω_{C}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega _1,\ldots ,\omega _C^2$$\end{document} , $υ_{1}^{2}, \dots, υ_{G}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\upsilon _1^2,\ldots ,\upsilon _G^2$$\end{document} , where the initialization of all quantities except $y_{11}^{*}, \dots, y_{NJ}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{11}^*,\ldots ,y_{NJ}^*$$\end{document} is described in Table 1 and initial values for $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} and $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega $$\end{document} are drawn from standard normal distributions. An implementation of this MCMC sampling algorithm in R is available within the supplementary material. The set of full conditional distributions can be described as follows.

$f (y_{ij}^{*} | \cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(y_{ij}^*|\cdot )$$\end{document}

The full conditional distributions of the random variables $y_{ij}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{ij}^*$$\end{document} , $i = 1, \dots, N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\ldots ,N$$\end{document} and $j = 1 \dots, J$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=1\ldots ,J$$\end{document} are independent and sampled from a truncated normal distribution with moments $μ_{y_{ij}^{*}} = α_{j} θ_{i} - β_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \mu _{y_{ij}^*} = \alpha _j\theta _{i}-\beta _j$$\end{document} and $σ_{y_{ij}^{*}}^{2} = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^2_{y_{ij}^*} = 1$$\end{document} , where the truncation sphere is $(κ_{j y_{ij}}, κ_{j y_{ij} + 1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\kappa _{jy_{ij}},\kappa _{jy_{ij}+1})$$\end{document} .

$f (α_{1}, β_{1}, \dots, α_{J}, β_{J} | \cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\alpha _1,\beta _1,\ldots ,\alpha _J,\beta _J|\cdot )$$\end{document}

Note that for the assumed model structure in absence of the identifying restrictions all full conditional distributions of the item parameters $ξ_{j} = {(α_{j}, β_{j})}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _j=(\alpha _j,\beta _j)'$$\end{document} , $j = 1, \dots, J$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=1,\ldots ,J$$\end{document} are mutually independent. In the presence of the identifying restrictions, however an arbitrarily chosen single element, say $ξ_{j^{'}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{j'}$$\end{document} , is completely determined by the others $J - 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J-1$$\end{document} item parameters, i.e. $ξ_{j^{'}} = ({(\prod_{j \neq j^{'}} α_{j})}^{- 1}, - \sum_{j \neq j^{'}} β_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{j'}=((\prod _{j\ne j'}\alpha _j)^{-1},-\sum _{j\ne j'}\beta _j)$$\end{document} . In this sense, the joint distribution of all item parameters is defective, as the distribution of the element implied by the other elements is not specified. Further, sampling from the full conditional distribution of item parameters $ξ_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _j$$\end{document} in absence of identifying restrictions can be characterized in terms of the linear regression equation $y_{j}^{*} = H ξ_{j} + ϵ_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{j}^*=H\xi _j+\epsilon _j$$\end{document} , where H is a $N \times 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N\times 2$$\end{document} auxiliary matrix consisting of $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} and $- ι_{N}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-\iota _N$$\end{document} , where $ι_{N}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\iota _N$$\end{document} denotes a $N \times 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N\times 1$$\end{document} vector of ones. Since $ϵ_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon _j$$\end{document} is normally distributed, $ξ_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _j$$\end{document} is proportional to a bivariate truncated normal distribution with covariance matrix and mean vector

\begin{matrix} Σ_{ξ_{j}} = (H^{'} H + Ω_{ξ_{j}}^{- 1})^{- 1} and μ_{ξ_{j}} = Σ_{ξ_{j}} (H^{'} y_{j}^{*} + Ω_{ξ_{j}}^{- 1} ν_{ξ_{j}}) . \end{matrix}

The positivity constraints on the item discrimination parameters causing the truncation are handled via accept reject sampling. In each iteration sampling is performed until a draw is accepted. The values of the hyperparameters $ν_{ξ_{j}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu _{\xi _j}$$\end{document} and $Ω_{ξ_{j}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Omega _{\xi _j}$$\end{document} are chosen as given in Table 1. Note that for any possible subset containing $J - 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J-1$$\end{document} item parameters, the remainder item parameters, say $ξ_{j^{'}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{j'}$$\end{document} , are implied by the assumed identifying restrictions. Although this element is determined by all other elements, the data driven information contained within the above regression is not incorporated in the characterization of these item parameters. Further, J equivalent possibilities exist to characterize the redundant element. Hence, incorporating these J alternative possibilities to draw from the full conditional distribution into a single raw via averaging seems preferable in order to use all available data based information and thus improve mixing and convergence issues. Given draws for $α = (α_{1}, \dots, α_{J})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha =(\alpha _1,\ldots ,\alpha _J)$$\end{document} and $β = (β_{1}, \dots, β_{J})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta =(\beta _1,\ldots ,\beta _J)$$\end{document} averaging the J characterizations is possible in terms of the geometric mean and the arithmetic mean resulting in $α = (α_{1} {(\prod_{j = 1}^{J} α_{j})}^{- \frac{1}{J}}, \dots, α_{J} {(\prod_{j = 1}^{J} α_{j})}^{- \frac{1}{J}})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha =(\alpha _1(\prod _{j=1}^J\alpha _j)^{-\frac{1}{J}},\ldots ,\alpha _J(\prod _{j=1}^J\alpha _j)^{-\frac{1}{J}})$$\end{document} and $β = (β_{1} - \frac{1}{J} \sum_{j = 1}^{J} β_{j}, \dots, β_{J} - \frac{1}{J} \sum_{j = 1}^{J} β_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta =(\beta _1-\frac{1}{J}\sum _{j=1}^J\beta _j,\ldots ,\beta _J-\frac{1}{J}\sum _{j=1}^J \beta _j)$$\end{document} . We refer to this approach to handling identifying restrictions as a kind of marginal data augmentation, see among others Imai and van Dyk (Reference Imai and van Dyk2005).

$f (κ_{j} | \cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\kappa _j|\cdot )$$\end{document}

Draws from the mutually independent full conditional distributions of the item category cutoff parameters $κ_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _j$$\end{document} are retained via a MH step following Albert and Chib (Reference Albert and Chib1997). To perform this sampling step it is convenient to consider a reparameterization of the elements $κ_{j 2}, \dots, κ_{j Q_{j} - 1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{j2},\ldots ,\kappa _{jQ_j-1}$$\end{document} , where $κ_{jq} = \sum_{w = 2}^{q} exp {τ_{jw}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{jq}=\sum _{w=2}^q \exp \{\tau _{jw}\}$$\end{document} for all $j = 1, \dots, J$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=1,\ldots ,J$$\end{document} and $q = 2, \dots, Q_{j} - 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q=2,\ldots ,Q_j-1$$\end{document} . The threshold parameters can then be stated as $κ_{j} = (- \infty, 0, κ_{j 2}, \dots, κ_{j Q_{j} - 1}, \infty) = h (τ_{j}) = (h_{j 0}, h_{j 1}, h_{j 2}, \dots, h_{j Q_{j} - 1}, h_{j Q_{j}}) = (- \infty, 0, exp {τ_{j 2}}, exp {τ_{j 2}} + exp {τ_{j 3}}, \dots, \sum_{q = 2}^{Q_{j} - 1} exp {τ_{jq}}, \infty)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _j=(-\infty ,0,\kappa _{j2},\ldots ,\kappa _{jQ_j-1},\infty )=h(\tau _j)=(h_{j0},h_{j1},h_{j2},\ldots ,h_{jQ_j-1},h_{jQ_j})=(-\infty ,0,\exp \{\tau _{j2}\},\exp \{\tau _{j2}\}+\exp \{\tau _{j3}\},\ldots ,\sum _{q=2}^{Q_j-1}\exp \{\tau _{jq}\},\infty )$$\end{document} . Given the prior for $κ_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _j$$\end{document} this transformation induces a multivariate normal prior for $τ_{j} = (τ_{j 2}, \dots, τ_{j Q_{j} - 1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau _j=(\tau _{j2},\ldots ,\tau _{jQ_j-1})$$\end{document} given as

\begin{matrix} π (τ_{j}) \propto \prod_{q = 2}^{Q_{j} - 1} exp (- \frac{1}{2 Ω_{κ_{jq}}^{2}} (τ_{jq} - ν_{κ_{jq}})^{2}) . \end{matrix}

Hence, the posterior and thus full conditional distribution can be reformulated in terms of $τ_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau _j$$\end{document} . To generate a draw from the full conditional of $τ_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau _j$$\end{document} , we choose as a proposal a multivariate t-distribution with mean vector $m_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_j$$\end{document} , covariance matrix $V_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V_j$$\end{document} and $Q_{j} - 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_j-2$$\end{document} degrees of freedom, where

\begin{matrix} m_{j} = arg max_{τ_{j}} ln {f (y_{j} | ξ_{j}, h (τ_{j}), ψ, θ) π (τ_{j})} \end{matrix}

and $V_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V_j$$\end{document} is the inverse of the Hessian of $ln {f (y_{j} | ξ_{j}, h (τ_{j}), ψ, θ) π (τ_{j})}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text{ ln }\{f(y_{j}|\xi _j,h(\tau _j),\psi ,\theta )\pi (\tau _j)\}$$\end{document} evaluated at $m_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_j$$\end{document} . Note that $f (y_{j} | ξ_{j}, h (τ_{j}), θ, ψ) = \prod_{i = 1}^{N} [Φ (h_{j y_{ij} + 1} - (α_{j} θ_{i} - β_{j})) - Φ (h_{j y_{ij}} - (α_{j} θ_{i} - β_{j}))]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(y_j|\xi _j,h(\tau _j),\theta ,\psi )=\prod _{i=1}^N [\Phi (h_{jy_{ij}+1}-(\alpha _j\theta _i-\beta _j))-\Phi (h_{jy_{ij}}-(\alpha _j\theta _i-\beta _j))]$$\end{document} . The probability of accepting candidate values $τ_{j}^{cand}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau _j^{\text{ cand }}$$\end{document} is given as

\begin{matrix} a_{τ_{j}} = min (1, \frac{f (y_{j} | ξ_{j}, h (τ_{j}^{cand}), ψ, θ) π (τ_{j}^{cand})}{f (y_{j} | ξ_{j}, h (τ_{j}), ψ, θ) π (τ_{j})} \frac{f_{t} (τ_{j} | m_{j}, V_{j}, Q_{j} - 2)}{f_{t} (τ_{j}^{cand} | m_{j}, V_{j}, Q_{j} - 2)}) . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} a_{\tau _j}=\min \left\{ 1,\frac{f\big (y_{j}|\xi _j,h\big (\tau ^{\text{ cand }}_j\big ),\psi ,\theta \big )\pi \big (\tau ^{\text{ cand }}_j\big )}{f\big (y_{j}|\xi _j,h(\tau _j),\psi ,\theta \big )\pi (\tau _j)}\frac{f_t\big (\tau _j|m_j,V_j,Q_j-2\big )}{f_t\big (\tau ^{\text{ cand }}_j|m_j,V_j,Q_j-2\big )}\right\} . \end{aligned}$$\end{document}

The acceptance rates within the simulation study and the empirical application where found to be at least 0.95. A draw for $κ_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _j$$\end{document} is then implied by $h (τ_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h(\tau _j)$$\end{document} . The chosen hyperparameter values for $Ω_{κ_{jq}}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Omega _{\kappa _{jq}}^2$$\end{document} and $ν_{κ_{jq}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu _{\kappa _{jq}}$$\end{document} are given in Table 1.

$f (X_{mis}^{(p)} | \cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(X_{\text {mis}}^{(p)}|\cdot )$$\end{document}

Values of $X_{mis}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {mis}}$$\end{document} are sampled sequentially for each column vector $X^{(p)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^{(p)}$$\end{document} , $p = 1, \dots, P$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=1,\ldots ,P$$\end{document} in two steps. Let $X_{com}^{(\ p)} = (X_{obs}^{(\ p)}, X_{mis}^{(\ p)})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text{ com }}^{({\setminus } p)}=(X_{\text{ obs }}^{({\setminus } p)},X_{\text{ mis }}^{({\setminus } p)})$$\end{document} denote the completed matrix of conditional variables in $X_{}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {}}$$\end{document} except column p, with the operator $\ p$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\setminus p$$\end{document} meaning without p.Footnote 15 First, a decision tree is built for $X_{com}^{(\ p)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text{ com }}^{({\setminus } p)}$$\end{document} conditional on the corresponding values of all remaining variables $X_{com}^{(\ p)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {com}}^{(\setminus p)}$$\end{document} as well as conditional on $θ, ω, S, L$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta ,\omega , S,L$$\end{document} , and Y. A further possibility is to consider only subsets of the conditioning variables $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} , $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega $$\end{document} , S, L, and Y. To incorporate a priori uncertainty on the hyperparameters of the sequential partitioning regression trees, we build trees with a randomly varying minimum number of elements within nodes. Every missing observation can then be assigned to a node and thus a grouping of observations implied by the binary partition in terms of the conditioning variables. The values within each node provide access to an empirical distribution function serving as an approximation to the full conditional distribution of a missing value and thus as the key element for running the data generating mechanism for missing values. With prior distributions of missing values proportional to observed data densities, draws from the empirical distribution function within a node correspond to draws from the full conditional distributions of missing values. To account for the estimation uncertainty of the full conditional distribution, the Bayesian bootstrap is applied to the assigned group of observations, see Rubin (Reference Rubin1981). Thereby, the uncertainty regarding the estimated empirical distribution implied by the proposed set of observed values is fully considered.Footnote 16 The considered approach further offers the flexibility to consider any function of observed or augmented data within the set of conditioning variables as well. Next to the matrices $Y^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y^*$$\end{document} and Y also statistics thereof might be considered. This may include draws of missing values in Y or $Y^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y^*$$\end{document} from the posterior predictive distributions as suggested by von Hippel (Reference von Hippel2007). In case of restricting the analysis to observed values of Y only as in the empirical illustration, additionally missing categories might be considered. Note that this is the default of the R function rpart within the implementation of the CART-BB algorithm, see Therneau and Atkinson (Reference Therneau and Atkinson2018). Further, also group specific or individual specific specifications of the full conditional distributions could be adapted by consideration of group specific variables within the set of conditioning variables only, i.e. create a binary partition only for those values fulfilling the conditions $L_{i} = g$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_i=g$$\end{document} or $S_{i} = c$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_i=c$$\end{document} . The sampled $X_{mis}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {mis}}$$\end{document} values allow to refer to an updated completed matrix of covariates in all other steps of the MCMC algorithm.

$f (θ_{i} | \cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\theta _i|\cdot )$$\end{document}

The full conditional distributions for $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i$$\end{document} , $i = 1, \dots, N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\ldots ,N$$\end{document} are elementwise conditionally independent. Let $B_{i} = y_{i}^{*} + β$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$B_{i}=y_{i}^* + \beta $$\end{document} . This allows for stating the conditional distribution of the individual abilities as normal with moments

(9)

\begin{matrix} σ_{θ_{i}}^{2} = (α^{'} α + σ_{S_{i}}^{- 2})^{- 1} and μ_{θ_{i}} = σ_{θ_{i}}^{2} (α^{'} B_{i} + σ_{S_{i}}^{- 2} (ω_{S_{i}} + X_{i} γ_{L_{i}})) . \end{matrix}

$f (γ_{g} | \cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\gamma _g|\cdot )$$\end{document}

To sample from the full conditional distributions of the regression coefficients, let $D^{C}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D^C$$\end{document} denote a $N \times C$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N\times C$$\end{document} design matrix of zeros and ones. Each row of $D^{C}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D^C$$\end{document} has a single entry 1 indicating the respondents’ cluster membership $S_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_i$$\end{document} . The operator [g] selects the elements of $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} , respectively the rows of X and $D^{C}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D^C$$\end{document} for which the condition $L_{i} = g$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_i=g$$\end{document} holds. Further, let $Σ_{ϵ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma _{\epsilon }$$\end{document} be a $N_{g} \times N_{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_g\times N_g$$\end{document} diagonal matrix with elements $σ_{ϵ, g}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^2_{\epsilon ,g}$$\end{document} . Draws from the conditional distribution of $γ_{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _g$$\end{document} are obtained from a multivariate normal with covariance matrix and mean vector

\begin{matrix} Σ_{γ_{g}} = (X_{[g]}^{'} Σ_{ϵ}^{- 1} X_{[g]} + Ω_{γ_{g}}^{- 1})^{- 1} and μ_{γ_{g}} = Σ_{γ_{g}} (X_{[g]}^{'} Σ_{ϵ}^{- 1} (θ_{[g]} - D_{[g]}^{C} ω) + Ω_{γ_{g}}^{- 1} ν_{γ_{g}}) . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \Sigma _{\gamma _g} = \big (X_{[g]}'\Sigma _\epsilon ^{-1}X_{[g]} + \Omega _{\gamma _g}^{-1}\big )^{-1} \text { and } \mu _{\gamma _g} = \Sigma _{\gamma _g}\big (X_{[g]}'\Sigma _\epsilon ^{-1}(\theta _{[g]} - D^C_{[g]}\omega ) + \Omega _{\gamma _g}^{-1}\nu _{\gamma _g}\big ). \end{aligned}$$\end{document}

Note that values of hyperparameters $ν_{γ_{g}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu _{\gamma _g}$$\end{document} and $Ω_{γ_{g}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Omega _{\gamma _g}$$\end{document} are chosen as given in Table 1.

$f (σ_{g}^{2} | \cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\sigma _{g}^2|\cdot )$$\end{document}

In each group g you find $C_{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_g$$\end{document} clusters and $N_{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_g$$\end{document} respondents. It holds that $\sum_{g = 1}^{G} C_{g} = C$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sum _{g=1}^{G}C_g=C$$\end{document} and $\sum_{g = 1}^{G} N_{g} = N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sum _{g=1}^{G}N_g=N$$\end{document} . Choosing a conjugate prior, the full conditional distribution of $σ_{g}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{g}^2$$\end{document} is distributed inverse gamma with shape and scale parameters

\begin{matrix} a_{σ_{g}^{2}} = a_{σ_{g}^{2}}^{0} + N_{g} / 2, b_{σ_{g}^{2}} & = (b_{σ_{g}^{2}}^{0} + \frac{1}{2} (θ_{[g]} - D_{[g]}^{C} ω - X_{[g]} γ_{g})^{'} (θ_{[g]} - D_{[g]}^{C} ω - X_{[g]} γ_{g}))^{- 1}, \end{matrix}

where the values of the hyperparameters $a_{σ_{g}^{2}}^{0}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a^0_{\sigma _g^2}$$\end{document} and $b_{σ_{g}^{2}}^{0}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b^0_{\sigma _{g}^2}$$\end{document} are chosen as given in Table 1.

$f (ω_{c} | \cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\omega _c|\cdot )$$\end{document}

Let the operator [c] select the elements of $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} , respective the rows of X belonging to cluster c and $N_{c}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_c$$\end{document} be the total number of persons in cluster c. The cluster-specific random intercepts $ω_{c}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega _c$$\end{document} are conditionally independent and follow a full conditional distribution given as a normal distribution with moments

\begin{matrix} σ_{ω_{c}}^{2} & = {(υ_{S_{c}}^{- 2} + N_{c} / σ_{S_{c}}^{2})}^{- 1} and μ_{ω_{c}} = σ_{ω_{c}}^{2} (σ_{S_{c}}^{- 2}, {(θ_{[c]} - X_{[c]} γ_{S_{c}})}^{'}, ι_{N_{c}}) . \end{matrix}

The chosen values for hyperparameters are given in Table 1.

$f (υ_{g}^{2} | \cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\upsilon _g^2|\cdot )$$\end{document}

Given a conjugate prior and making use of the operator [g], $υ_{ω, g}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\upsilon _{\omega ,g}^2$$\end{document} is distributed inverse gamma with shape and scale parameter

\begin{matrix} a_{υ_{g}^{2}} = a_{υ_{g}^{2}}^{0} + C_{g} / 2 and b_{υ_{g}^{2}} = {(b_{υ_{g}^{2}}^{0} + 0.5 ω_{[g]}^{'} ω_{[g]})}^{- 1} . \end{matrix}

Note that values of hyperparameters $a_{υ_{g}^{2}}^{0}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a^0_{\upsilon _g^2}$$\end{document} and $b_{υ_{g}^{2}}^{0}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b^0_{\upsilon _{g}^2}$$\end{document} are chosen as given in Table 1.

Given this MCMC algorithm, parameter estimates and functions of interest thereof can be readily obtained from the MCMC output denoted as ${ψ^{(r)}, θ^{(r)}, ω^{(r)}}_{r = 1}^{R}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\psi ^{(r)},\theta ^{(r)},\omega ^{(r)}\}_{r=1}^R$$\end{document} with R denoting the number of iterations after burn-in. Deciding for an absolute loss function, the estimates are implied by the medians of the posterior sample. Their calculation does not involve the application of any combining rules as for other approaches to handle missing values. If relevant, also the MCMC output with regard to the augmented quantities ${Y^{*, (r)}, X^{(r)} = (X_{obs}, X_{mis}^{(r)})}_{r = 1}^{R}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{Y^{*,(r)}, X^{(r)}=(X_{\text {obs}},X_{\text {mis}}^{(r)})\}_{r=1}^R$$\end{document} may be considered as well. To illustrate, given the hierarchical model structure, within group correlation may as well be of interest, i.e.

\begin{matrix} Cor (θ_{i}, θ_{i^{'}} | ψ, X, S_{i} = S_{i^{'}} = g, i \neq i^{'}) = \frac{ν_{g}^{2}}{ν_{g}^{2} + σ_{g}^{2}} \end{matrix}

with the corresponding estimator given as

\begin{matrix} \tilde{Cor} (θ_{i}, θ_{i^{'}} | ψ, X, S_{i} = S_{i^{'}} = g, i \neq i^{'}) = median {(\frac{ν_{g}^{2 (r)}}{ν_{g}^{2 (r)} + σ_{g}^{2 (r)}})}_{r = 1}^{R} . \end{matrix}

Next, the effects of changes in X on the individual competence level conditional on school type g (CE) might be of interest. Additionally, also the relative effects to another school type $g^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g'$$\end{document} (RE) or the conditional effects in standardized form (CSE), see e.g. Nieminen et al. (Reference Nieminen, Lehtiniemi, Vähäkangas, Huusko and Rautio2013), can be considered, i.e.

(10)

\begin{matrix} {CE}_{X, g} = γ_{g}, {RE}_{X, g, g^{'}} = γ_{g} - γ_{g^{'}}, and {CSE}_{X, g} = \frac{sd [X_{[g]}]}{sd [θ_{[g]}]} γ_{g}, \end{matrix}

where $sd$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {sd}$$\end{document} denotes the vector of standard deviations of the column vectors in $X_{[g]}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{[g]}$$\end{document} . Also context effects in the form of ceteris paribus effects can be considered, e.g. $CP = E [θ_{i} | X_{i}, ψ, L_{i} = g] - E [θ_{i} | X_{i}, ψ, L_{i} = g^{'}] = X_{i} (γ_{g} - γ_{g^{'}})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\text {CP}}=\text {E}[\theta _i|X_i,\psi ,L_i=g]-\text {E}[\theta _i|X_i,\psi ,L_i=g']=X_i(\gamma _g-\gamma _{g'})$$\end{document} or $CPA = E [θ_{i} | X_{i}, ψ, L_{i} = g, S_{i} = c, y_{i}^{*}, y_{i}] - E [θ_{i} | X_{i}, ψ, L_{i} = g^{'}, y_{i}^{*}, y_{i}] = \frac{1}{C} \sum_{c = 1}^{C} μ_{θ_{i}} (X_{i}, L_{i} = g, S_{i} = c, ψ, ω_{c}, y_{i}^{*}) - μ_{θ_{i}} (X_{i}, L_{i} = g^{'}, S_{i} = c, ψ, ω_{c}, y_{i}^{*})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\text {CPA}}=\text {E}[\theta _i|X_i,\psi ,L_i=g,S_i=c,y_i^*,y_i]-\text {E}[\theta _i|X_i,\psi ,L_i=g',y_i^*,y_i]=\frac{1}{C}\sum _{c=1}^C \mu _{\theta _i}(X_i,L_i=g,S_i=c,\psi ,\omega _c,y_i^*)-\mu _{\theta _i}(X_i,L_i=g',S_i=c,\psi ,\omega _c,y_i^*)$$\end{document} , where $μ_{θ_{i}} (\cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{\theta _i}(\cdot )$$\end{document} is given in Eq. (9).

Estimates of conditional, relative and conditional standardized effects are readily available as

\begin{matrix} {\tilde{CE}}_{X, g} = & median {(γ_{g}^{(r)})}_{r = 1}^{R}, {\tilde{RE}}_{X, g} = median {(γ_{g}^{(r)} - γ_{g^{'}}^{(r)})}_{r = 1}^{R}, \\ and {\tilde{CSE}}_{X, g} = & median {(\frac{sd [X_{[g]}^{(r)}]}{sd [θ_{[g]}^{(r)}]}, γ_{g}^{(r)})}_{r = 1}^{R}, \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} {\widetilde{{\text {CE}}}}_{X,g}= & {} \text {median}\left\{ \gamma _g^{(r)}\right\} _{r=1}^R,\quad {\widetilde{{\text {RE}}}}_{X,g}=\text {median}\left\{ \gamma _g^{(r)}-\gamma _{g'}^{(r)}\right\} _{r=1}^R,\\ \text {and}\quad {\widetilde{{\text {CSE}}}}_{X,g}= & {} \text {median} \left\{ \frac{\text {sd}[X_{[g]}^{(r)}]}{\text {sd}[\theta _{[g]}^{(r)}]}\gamma _g^{(r)}\right\} _{r=1}^R, \end{aligned}$$\end{document}

whereas for the context effects we have $\tilde{CP} = median {(X_{i}^{(r)}, (γ_{g}^{(r)} - γ_{g^{'}}^{(r)}))}_{r = 1}^{R}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\widetilde{{\text {CP}}}}=\text {median}\left\{ X_i^{(r)}(\gamma _g^{(r)}-\gamma _{g'}^{(r)})\right\} _{r=1}^R$$\end{document} and

\begin{matrix} \tilde{CPA} = & median (\frac{1}{C}, \sum_{c = 1}^{C}, (μ_{θ_{i}} (X_{i}^{(r)}, L_{i} = g, S_{i} = c ψ^{(r)}, ω_{c}^{(r)}, y_{i}^{*, (r)}))) \\ {((- μ_{θ_{i}} (X_{i}^{(r)}, L_{i} = g^{'}, S_{i} = c, ψ^{(r)}, ω_{c}^{(r)}, y_{i}^{*, (r)})))}_{r = 1}^{R} . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} {\widetilde{{\text {CPA}}}}= & {} \text {median}\left\{ \frac{1}{C}\sum _{c=1}^C\left( \mu _{\theta _i}\big (X_i^{(r)},L_i=g,S_i=c\psi ^{(r)},\omega _c^{(r)},y_i^{*,(r)}\big )\right. \right. \\{} & {} \quad \left. \left. - \mu _{\theta _i}\big (X_i^{(r)},L_i=g',S_i=c,\psi ^{(r)},\omega _c^{(r)},y_i^{*,(r)}\big )\right) \right\} _{r=1}^R. \end{aligned}$$\end{document}

Note that measures of uncertainty, e.g. posterior standard deviation or highest posterior density intervals, are likewise directly accessible without use of combining rules.

Finally, note that computation of the marginal data likelihood, i.e.

\begin{matrix} f (Y | X_{obs}, S, L) = \int f (Y | ψ, X, S, L) f (X_{mis} | X_{obs}, ψ) f (ψ) d X_{mis} d ψ, \end{matrix}

involved in Bayes factors to allow for non-nested model comparison is possible along the lines suggested by Chib (Reference Chib1995), Chib and Jeliazkov (Reference Chib and Jeliazkov2001) and Aßmann and Preising (Reference Aßmann and Preising2020) in the context of linear dynamic panel models.

3. Simulation Study

We assess the proposed strategy via a simulation study. To illustrate the possible gains arising from the handling of missing values by means of DA, we consider as benchmarks estimation without missing values, i.e., before any values have been discarded from the data sets (BD), estimation of complete cases only (CC), and a third benchmark situation mimicking the situation of handling missing values without latent structures, i.e., handling of missing values in an imputation sense before estimating the model of interest (IBM). For the IBM benchmark, the full conditional distributions of missing values are also constructed via CART-BB by using information from observable variables only. For this, we consider

\begin{matrix} f_{IBM} (X_{mis}^{(p)} | X_{com}^{(\ p)}, Y, S, L), p = 1, \dots, P . \end{matrix}

The IBM strategy conditions on all observables $(Y, X_{obs}, S, L)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(Y,X_{\text {obs}},S,L)$$\end{document} but not on latent model structures like $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} or $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega $$\end{document} .Footnote 17 These three benchmarks are contrasted with the suggested Bayesian estimation approach DART. Within the DART approach, we will add to the considered observable set of conditioning variables also the latent variables $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} and $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega $$\end{document} to assess the full conditional distribution of $X_{mis}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {mis}}$$\end{document} , i.e.

\begin{matrix} f_{DART} (X_{mis}^{(p)} | X_{com}^{(\ p)}, Y, θ, ω, S, L), p = 1, \dots, P . \end{matrix}

Next, we will consider also a modified version of the DART approach, labeled DART-m. We discard Y and S from the set of conditioning variables entering the CART-BB algorithm. This illustrates that the latent variables $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} and $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega $$\end{document} serve as a kind of sufficient statistics of Y and S. When specifying the full conditional distributions of missing values $X_{mis}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {mis}}$$\end{document} the sufficient statistics allow for incorporation of the relevant information but provide a more parsimonious representation of this information leading to a noticeable reduction in computation time.

The simulation study is based on the following data generating process (DGP), where the comparison is based on averaged estimation over $S = 1000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {S}}=1000$$\end{document} replications referring to the same DGP. The considered DGP satisfies the following conditions. The response matrix Y is simulated assuming the model outlined in Eqs. (1), (2) and (3) with a sample setup of $N = 4000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=4000$$\end{document} students allocated equally to $C = 20$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=20$$\end{document} schools which belong to either one of $G = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G=2$$\end{document} school types. Thus, there are 200 students per school and 10 schools per school type corresponding to a nested hierarchical structure. The respondents face a test of altogether $J = 20$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=20$$\end{document} items of which the first 18 are binary and the last two are ordinal with $Q_{19} = Q_{20} = 4$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_{19}=Q_{20}=4$$\end{document} categories. The J discrimination and difficulty parameters are fixed across replications and were obtained once via drawing from uniform distributions in the interval (0.7, 1.3) for discrimination and $(- 0.7, 0.7)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(-\,0.7,0.7)$$\end{document} for difficulty parameters respectively. To fulfill the identifying restrictions, the item difficulty and discrimination parameters are transformed in terms of the geometric and arithmetic mean respectively, see also Sect. 2.2 for details. Finally, the item category cutoff parameters for the two ordinal items are set to $κ_{19} = {(0, 0.5, 1)}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{19}=(0,0.5,1)'$$\end{document} and $κ_{20} = {(0, 0.7, 1.4)}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{20} = (0,0.7,1.4)'$$\end{document} .

We consider three covariates with two covariates $X^{(p)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^{(p)}$$\end{document} , $p = 2, 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=2,3$$\end{document} , capturing individual differences in the latent trait $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i$$\end{document} . Adding a constant, the regressor matrix can be stated as $X = (ι_{N}, X^{(2)}, X^{(3)})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X=(\iota _N, X^{(2)}, X^{(3)})$$\end{document} . Since participants in large-scale studies are often heterogeneous, we also map this circumstance in our simulation study. The chosen DGP leans towards the data situation in empirical surveys such as the NEPS, as we consider heterogeneity between groups of individuals. Therefore $X^{(2)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^{(2)}$$\end{document} is sampled from a Bernoulli distribution with $Pr (X_{i, g = 1}^{(2)} = 1) = 0.3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Pr (X_{i,g=1}^{(2)} = 1) =0.3$$\end{document} for group 1 ( $g = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=1$$\end{document} ) and $Pr (X_{i, g = 2}^{(2)} = 1) = 0.6$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Pr (X_{i,g=2}^{(2)} = 1) =0.6$$\end{document} for group 2 ( $g = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g=2$$\end{document} ). $X^{(3)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^{(3)}$$\end{document} is sampled from a normal distribution with school specific means and a variance set to one. The overall means in group 1 are chosen to be smaller compared to group 2. The corresponding parameters of the population model are set to $γ_{1} = {(- 0.5, 0.4, 0.2)}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _1 = (-\,0.5,0.4,0.2)'$$\end{document} , $γ_{2} = {(1, 0.2, - 0.2)}^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _2 = (1,0.2,-\,0.2)'$$\end{document} , $σ_{1}^{2} = 0.64$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^2_{1} = 0.64$$\end{document} , $σ_{2}^{2} = 0.36$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^2_{2} = 0.36$$\end{document} , $υ_{1}^{2} = 0.81$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\upsilon ^2_{1} = 0.81$$\end{document} and $υ_{2}^{2} = 0.49$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\upsilon ^2_{2} = 0.49$$\end{document} . The simulation study consists out of four missing scenarios. For scenarios 1 and 2 the missing rates for $X^{(2)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^{(2)}$$\end{document} and $X^{(3)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^{(3)}$$\end{document} depend exclusively on the latent trait variable $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} . As suggested by a reviewer, dependence of the missing probability on the latent variable $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} suggests to characterize the mechanism to be approximately at random, since the latent variable $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} becomes estimable in the considered model framework via observable quantities. For scenario 3 missing probabilities are determined by weighted sum scores depending on the observed variables $X^{(2)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^{(2)}$$\end{document} , $X^{(3)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^{(3)}$$\end{document} , and the latent variable $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} . The scenario 4 is similar to scenario 3, but missing in $X^{(3)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^{(3)}$$\end{document} depends itself on $X^{(3)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^{(3)}$$\end{document} thus characterizing the mechanism to be not at random. For further details on the four described missing scenarios, see Table 2.

Table 2. Simulated missing data mechanisms.

As the missing mechanisms depend on latent variables, the scenarios 1–3 can be characterized as approximately missing at random and scenario 4, where the missing probability depends on the variable itself as missing not at random. All simulation runs have been performed with 25,000 Gibbs iterations with the first 5000 iterations as burn-in.

Table 3. Simulation study (scenario 1, missing rates: $X_{1} = 19 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_1=19\%$$\end{document} , $X_{2} = 26 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_2=26\%$$\end{document} , overall = 33%)—True parameter values, mean posterior medians and standard deviations, RMSEs and coverage ratios of structural parameter (regression coefficients, variance parameters) over 1000 replications obtained from BD, CC, IBM and DART.

$G = 2; C = 20; N = 4000; J = 20; n_{iter} = 20, 000 + 5000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G=2;\ C=20;\ N=4000;\ J=20;\ n_{\textrm{iter}}=20{,}000+5000$$\end{document} . RMSE = root mean square error; BD = before deletion; CC = complete cases; IBM = multiple imputation before modeling based on observed data; DART = data augmentation using sequential recursive partitioning based on all data and latent parameters. DART-m = data augmentation using sequential recursive partitioning based on the sufficient statistics $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} and $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega $$\end{document} . Runtimes = mean runtimes per data set in minutes (Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities).

Table 4. Simulation study (scenario 2, missing rates: $X_{1} = 40 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_1=40\%$$\end{document} , $X_{2} = 50 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_2=50\%$$\end{document} , overall = 59%)—True parameter values, mean posterior medians and standard deviations, RMSEs and coverage ratios of structural parameter (regression coefficients, variance parameters) over 1000 replications obtained from BD, CC, IBM and DART.

Table 5. Simulation study (scenario 3, missing rates: $X_{1} = 20 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_1=20\%$$\end{document} , $X_{2} = 36 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_2=36\%$$\end{document} , overall = 46%)—True parameter values, mean posterior medians and standard deviations, RMSEs and coverage ratios of structural parameter (regression coefficients, variance parameters) over 1000 replications obtained from BD, CC, IBM and DART.

Table 6. Simulation study (scenario 4, missing rates: $X_{1} = 17 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_1=17\%$$\end{document} , $X_{2} = 28 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_2=28\%$$\end{document} , overall $= 40 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=\,40\%$$\end{document} )—True parameter values, mean posterior medians and standard deviations, RMSEs and coverage ratios of structural parameter (regression coefficients, variance parameters) over 1000 replications obtained from BD, CC, IBM and DART.

Each of the repeated estimations is finally based on MCMC chains of length 25,000. After discarding the first 5000 iterations as burn-in, inference is based on the remaining 20,000 simulated draws from the joint posterior distribution. Convergence is monitored via the Geweke statistic, the Gelman–Rubin statistics, and the effective sample size, see Geweke (Reference Geweke1992), Gelman et al. (Reference Gelman, Carlin, Stern, Dunson, Vehtari and Rubin2013), and Vehtari et al. (Reference Vehtari, Gelman, Simpson, Carpenter and Bürkner2021), and the supplementary material for further information. The convergence diagnostics indicate overall convergence.

Results for the four different missing scenarios are presented in Tables 3, 4, 5 and 6. They provide the true parameter values used in the DGP, mean posterior medians and averaged standard deviations over the 1, 000 replications obtained for the BD, CC, IBM, DART, and DART-m sample estimates with regard to the regression coefficients and conditional variance parameters. Beside the averaged estimates, simulation results are also evaluated in terms of the root mean square error (RMSE) and the coverage, i.e. the proportion of $95 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$95\%$$\end{document} highest posterior density regions (HDRs) that contain the true DGP parameter values. For completeness, results on item characteristics (item discrimination, item difficulty and item category cutoff parameters) are available in the supplementary material. For the BD estimates we find overall unbiased results for all parameters. The results indicate a correct implementation of the algorithm and further serve as a benchmark to assess the relative performance of the different methods in the case of missing values. As expected, the CC results show a huge bias, where the bias becomes larger as the proportion of missing values increases. The results also show that the biases tend to be larger when the probability of missing values in $X^{(2)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^{(2)}$$\end{document} and $X^{(3)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^{(3)}$$\end{document} depends only on $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} , see Tables 3 and 4, and not additionally on the covariates themselves, see Tables 5 and 6. Not unexpectedly, coverage rates for CC are the lowest, see e.g. the parameters $γ_{0, 1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _{0,1}$$\end{document} and $σ_{1}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _1^2$$\end{document} in Table 3.

When comparing IBM to DART and DART-m, the differences are less pronounced. Nevertheless, it appears consistently across all four simulation studies that with using DART or DART-m we achieve smaller biases. Further inspection of the RMSE for IBM, DART and DART-m suggests no severe loss of statistical efficiency compared to BD, but with a small advantage for DART and DART-m. These results are supported by the coverage rates meeting the $95 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$95\%$$\end{document} confidence level for most of the parameters using DART, especially DART-m, whereas this becomes especially clear with Scenario 2 in Table 4 showing the highest proportion of missing values. Here, we could only achieve a coverage rate of around $50 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$50\%$$\end{document} for the parameters $γ_{1, 1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _{1,1}$$\end{document} and $γ_{2, 1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _{2,1}$$\end{document} using IBM, but obtain higher coverage rates using DART and even better using DART-m.

Taking a look at the averaged standard deviations, these tend to be smaller for IBM, since without the latent variables $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} and $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega $$\end{document} drawn from the full conditional distributions in each iteration, we do not consider an important source of variability affecting the uncertainty of the missing values. Further, without consideration of $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} and $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega $$\end{document} , the bias increases as shown by our simulation results.

The advantages of the DART-m approach are particularly evident in the runtimes (mean runtimes per data set in minutes) given in Tables 3, 4, 5 and 6. DART-m efficiently uses the information from the latent variables $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} and $ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega $$\end{document} , which serve as sufficient statistics and therefore can replace the item response Y and the school affiliation S. The resulting runtimes show that the suggested DART-m approach saves up to one third of the computation time compared to the IBM approach.

Similar effects can be seen when inspecting the properties of the sampled trajectories ${X_{mis}^{(r)}}_{r = 1}^{R}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{X_{\text {mis}}^{(r)}\}_{r=1}^R$$\end{document} . The properties arising from the different approaches can be assessed via calculating for each missing value the absolute and squared distance to the true (before deletion) and estimated value. With the former providing bias and the latter the variance, we summarize the finding per variable and aggregate over the missing values per variable and over the simulated data sets. The same procedure is also done to obtain root mean square errors. Note that after averaging over missing values per variable and over data sets, root mean square errors are not exactly identical to variance plus squared bias. With regard to bias and variance we calculate bias as $\frac{1}{S} \sum_{s = 1}^{S} \frac{1}{# X_{mis}^{(j)}} \sum_{k = 1}^{# X_{mis}^{(j)}} | X_{mis, k, s}^{(j)} - {\tilde{X}}_{mis, k, s}^{(j)} |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{{\mathcal {S}}} \sum _{{\mathcal {s}}=1}^{{\mathcal {S}}} \frac{1}{\# X_{\text {mis}}^{(j)}}\sum _{k=1}^{\# X_{\text {mis}}^{(j)}}|X_{\text {mis},k,{\mathcal {s}}}^{(j)}-{\tilde{X}}_{\text {mis},k,{\mathcal {s}}}^{(j)}|$$\end{document} , variance as $\frac{1}{S} \sum_{s = 1}^{S} \frac{1}{# X_{mis}^{(j)}} \sum_{k = 1}^{# X_{mis}^{(j)}} {(X_{mis, k, s}^{(j)} - {\hat{X}}_{mis, k, s}^{(j)})}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{{\mathcal {S}}} \sum _{{\mathcal {s}}=1}^{{\mathcal {S}}} \frac{1}{\# X_{\text {mis}}^{(j)}}\sum _{k=1}^{\# X_{\text {mis}}^{(j)}}(X_{\text {mis},k,{\mathcal {s}}}^{(j)}-{\hat{X}}_{\text {mis},k,{\mathcal {s}}}^{(j)})^2$$\end{document} , and root mean square error as $\frac{1}{S} \sum_{s = 1}^{S} \frac{1}{# X_{mis}^{(j)}} \sum_{k = 1}^{# X_{mis}^{(j)}} \sqrt{{(X_{mis, k, s}^{(j)} - {\tilde{X}}_{mis, k, s}^{(j)})}^{2}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{{\mathcal {S}}} \sum _{{\mathcal {s}}=1}^{{\mathcal {S}}} \frac{1}{\# X_{\text {mis}}^{(j)}}\sum _{k=1}^{\# X_{\text {mis}}^{(j)}}\sqrt{(X_{\text {mis},k,{\mathcal {s}}}^{(j)}-{\tilde{X}}_{\text {mis},k,{\mathcal {s}}}^{(j)})^2}$$\end{document} . Thereby, $# X_{mis}^{(j)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\# X_{\text {mis}}^{(j)}$$\end{document} denotes the number of missing values per variable, $X_{mis, k, s}^{(j)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {mis},k,{\mathcal {s}}}^{(j)}$$\end{document} the kth missing value in variable $j = 2, 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=2,3$$\end{document} , and ${\tilde{X}}_{mis, k, s}^{(j)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\tilde{X}}_{\text {mis},k,{\mathcal {s}}}^{(j)}$$\end{document} and ${\hat{X}}_{mis, k, s}^{(j)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{X}}_{\text {mis},k,{\mathcal {s}}}^{(j)}$$\end{document} denote true (before deletion) and estimated values within repeated estimation $s$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {s}}$$\end{document} respectively. The results are described in Table 7. As expected and in line with the other simulation results presented, the suggested augmentation approaches DART and DART-m show reduced bias although slightly increased variance compared to the IBM approach. This in turn then causes the improved inference regarding the regression coefficients both in terms of bias and coverage.

Table 7. Comparison of prediction accuracy of conditioning variables X.

Quantities are calculated as follows with $bias = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{# X_{mis}^{(j)}} \sum_{k = 1}^{# X_{mis}^{(j)}} | X_{mis, k, s}^{(j)} - {\tilde{X}}_{mis, k, s}^{(j)} |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {bias}=\frac{1}{\mathcal {S}} \sum _{\mathcal {s}=1}^{\mathcal {S}} \frac{1}{\# X_{\textrm{mis}}^{(j)}}\sum _{k=1}^{\# X_{\textrm{mis}}^{(j)}}|X_{\textrm{mis},k,\mathcal {s}}^{(j)}-\tilde{X}_{\textrm{mis},k,\mathcal {s}}^{(j)}|$$\end{document} , $j = 2, 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=2,3$$\end{document} , $variance = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{# X_{mis}^{(j)}} \sum_{k = 1}^{# X_{mis}^{(j)}} {(X_{mis, k, s}^{(j)} - {\hat{X}}_{mis, k, s}^{(j)})}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {variance}=\frac{1}{\mathcal {S}} \sum _{\mathcal {s}=1}^{\mathcal {S}} \frac{1}{\# X_{\textrm{mis}}^{(j)}}\sum _{k=1}^{\# X_{\textrm{mis}}^{(j)}}(X_{\textrm{mis},k,\mathcal {s}}^{(j)}-\hat{X}_{\textrm{mis},k,\mathcal {s}}^{(j)})^2$$\end{document} , $j = 2, 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=2,3$$\end{document} , and $root mean square error = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{# X_{mis}^{(j)}} \sum_{k = 1}^{# X_{mis}^{(j)}} \sqrt{{(X_{mis, k, s}^{(j)} - {\tilde{X}}_{mis, k, s}^{(j)})}^{2}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {root mean square error}=\frac{1}{\mathcal {S}} \sum _{\mathcal {s}=1}^{\mathcal {S}} \frac{1}{\# X_{\textrm{mis}}^{(j)}}\sum _{k=1}^{\# X_{\textrm{mis}}^{(j)}}\sqrt{(X_{\textrm{mis},k,\mathcal {s}}^{(j)}-\tilde{X}_{\textrm{mis},k,\mathcal {s}}^{(j)})^2}$$\end{document} , $j = 2, 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=2,3$$\end{document} , with $# X_{mis}^{(j)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\# X_{\textrm{mis}}^{(j)}$$\end{document} denoting the number of missing values per variable, $X_{mis, k, s}^{(j)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\textrm{mis},k,\mathcal {s}}^{(j)}$$\end{document} the kth missing value in variable j, and ${\tilde{X}}_{mis, k, s}^{(j)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{X}_{\textrm{mis},k,\mathcal {s}}^{(j)}$$\end{document} and ${\hat{X}}_{mis, k, s}^{(j)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{X}_{\textrm{mis},k,\mathcal {s}}^{(j)}$$\end{document} denote true (before deletion) and estimated values in repeated estimation $s$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {s}$$\end{document} respectively.

To summarize, the simulation illustrates that the combination of data augmentation and sequential recursive partitioning offers a suitable solution for the treatment of missing covariates in the context of LRMs, both with regard to estimation efficiency and computational burden.

4. Empirical Illustration

In order to illustrate the usefulness of the suggested Bayesian data augmentation approach in empirical analysis, we provide exemplary applications using the scientific data use file of the German National Educational Panel Study: Starting Cohort Grade 9, doi: 10.5157/NEPS:SC4:10.0.0, see NEPS Network (2019), on mathematical competencies of ninth graders. Children of this cohort have been surveyed in an institutional context. Data collection has taken place in schools in Germany between fall 2010 and winter 2010/2011 based on a stratified sampling of schools according to school types, see Aßmann et al. (Reference Aßmann, Steinhauer, Kiesl, Koch, Schönberger, Müller-Kuller, Rohwer, Rässler, Blossfeld, Blossfeld, Roßbach and von Maurice2011). Both factors, the institutional setting of schools in Germany as well as the stratified sampling approach, give reason to consider a differentiated hierarchical data structure.

We chose the mathematical competency domain as an example for latent variable modeling with person covariates. The relationship between mathematical competency and individual characteristics is thereby structured by the type of secondary schooling. Mathematical competency was assessed in the first survey wave. The corresponding test comprises four content areas: quantity, change and relationships, space and shape, and data and chance (Neumann et al., Reference Neumann, Duchhardt, Grüßing, Heinze, Knopp and Ehmke2013), where a total of 15, 629 ninth graders have taken the considered test. For an overview and further results on the mathematics test data see Duchhardt and Gerdes (Reference Duchhardt and Gerdes2013). As most of the items have low missing rates, the estimation within the empirical illustration is based on the likelihood involving observed values of Y only and only students with a valid response to at least three mathematics test items are consider.Footnote 18 From the $J = 22$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=22$$\end{document} tasks that had to be solved in the test, 20 items have a binary format and two are treated as ordinal items with four categories. In addition to the test data, we consider two clustering variables (schooltype and school) and student covariates. Merging mathematics test data and all student information together results in a final data set with 14, 320 observations. The available school type variable (Bayer et al., Reference Bayer, Goßmann and Bela2014) was transformed to cover four tracks of the German secondary education system: Hauptschule (HS; lower track), Realschule (RS; intermediate track), Gymnasium (GYM; academic or upper track) and, for observations where a clear assignment to these tracks was not possible or unclear, we define a residual category (OTHER). With $37 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$37\%$$\end{document} of students, GYM is the modal track. The school identifier school assigns a unique number to each school and serves as a further clustering variable with a total of 532 schools. Table 8 provides the descriptive statistics on the sample and considered variables. The illustration is provided in form of the following two model specifications.

Table 8. NEPS grade 9—descriptive statistics (complete case summary).

Absolute and relative counts (in parentheses) are reported.

educ. education, comp. compulsory, sec. secondary, inter. intermediate, voc. vocational, qual. qualification, inst. institution.

Table 9. NEPS grade 9, mathematical competencies—parameter estimates of model I.

$C = 532; N = 14320; N_{CC} = 6748; J = 22$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=532;\ N=14320;\ N_{CC}=6748;\ J=22$$\end{document} . Median and standard deviation (in parentheses) of the posterior distribution are reported.

* $90 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$90\%$$\end{document} HDI; ** $95 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$95\%$$\end{document} HDI; *** $99 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$99\%$$\end{document} HDI. Runtime: 35.6 h.

Table 10. NEPS grade 9, mathematical competencies—relative effects for structural parameter estimates of model I.

Table 11. NEPS grade 9, mathematical competencies—relative effects for structural parameter estimates of model II.

$C = 532; N = 14320; N_{CC} = 7708; J = 22$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=532;\ N=14320;\ N_{CC}=7708;\ J=22$$\end{document} . Median and standard deviation (in parentheses) of the posterior distribution are reported.

Table 12. NEPS grade 9, mathematical competencies—structural parameter estimates of model II.

The first model specification considers a small set of background variables with different scales including cross terms, whereas the second model specification has an enlarged set of categorical background variables to illustrate that the suggested DART-m approach is feasible and efficient in terms of computational cost and statistical efficiency. For the first model specification (model I) we adapt a specification discussed by Passaretta and Skopek (Reference Passaretta and Skopek2021) to assess the role of schools in socioeconomic inequality of learning. Following a differential exposure approach, the relationship of mathematical competency is analyzed with regard to the student variables gender, parents’ socio-economic status (HISEI), school exposure (schoolexp), and age at time of assessment (agetest).Footnote 19 In line with literature, we expect more school exposure and higher assessment age to be positively correlated with mathematical competence, whereas the (un)balancing effect of schools on competence is captured in terms of the cross terms between socioeconomic status and age of testing as well as school exposure. A positive effect for the considered cross terms would indicate that school experience accelerates competence more for students with higher socio economic status. The total amount of missing data for the variables within this model specification is to be considered as moderate to strong. Whereas the number of missing values in gender is negligible, about one fifth of the values are missing for HISEI. For agetest almost no missing values are present, whereas for school exposure the defining date of school entry was surveyed in the parental interview with a missing rate of 42.9%, see Table 8. The ratio of students having complete background information is $47.1 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$47.1\%$$\end{document} which corresponds to 6, 748 observations. The second model specification (model II) considers an enlarged set of background variables and contains gender (binary), generation status (4 categories), grade final report card mathematics (6 categories), school year repeated (binary), computer in your home (3 categories), homepos room (binary), and highest parental education qualification (HCASMIN, 9 categories). We can see a substantial heterogeneity within the covariate HCASMIN between the school types. For example, we observe that $29.5 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$29.5\%$$\end{document} of the students in HS have parents in category 2 (basic vocational training above and beyond compulsory schooling) but only $3.2 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3.2\%$$\end{document} of the students in GYM, or the other way round with category 8 (completed traditional, academically orientated university education) which have only $3.1 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3.1\%$$\end{document} of students in HS, but $32.2 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$32.2\%$$\end{document} for GYM. Most of the variables have a negligible amount of missing values. However, we have over $40 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$40\%$$\end{document} of missing values for the covariate HCASMIN, as this information has been surveyed within the parental interview. Therefore the ratio of students with complete background information drops to $57.3 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$57.3\%$$\end{document} , i.e. only 7708 complete case observations.

Figure 1. NEPS grade 9, Gaussian kernel density estimates for the set of conditional variances on person level $σ_{g}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _g^2$$\end{document} and school level $υ_{g}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\upsilon _g^2$$\end{document} and expected a posteriori estimates of scalar person parameter $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i$$\end{document} referring to mathematical competence in model I.

Figure 2. NEPS grade 9, Gaussian kernel density estimates for the set of conditional variances on person level $σ_{g}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _g^2$$\end{document} and school level $υ_{g}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\upsilon _g^2$$\end{document} and expected a posteriori estimates of scalar person parameter $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i$$\end{document} referring to mathematical competence in model II.

For each of the two models, estimates are based on 25,000 MCMC iterations, where a burn-in phase of 5000 has been found sufficient to mitigate the effects of initialization within the empirical analysis, see the supplementary material for corresponding results and further information concerning the convergence diagnostics and the assessment of the Monte Carlo error for the obtained point estimates.

Corresponding results for the two model specifications with regard to regression and conditional variance parameters are given in Table 9 for model I and Table 12 for model II respectively. Tables 10 and 11 provide corresponding estimates on relative effects between school types.Footnote 20 These tables provide the resulting estimates in terms of medians, standard deviations, and highest posterior density coverage rates (HDI). The results regarding the structural relationships show clear school type specific differences in the distribution of competencies, see upper panels of Figs. 1 and 2. The highest mean scores are consistently found for GYM, followed by the other school types RS, OTHER, and HS. In the same way, the conditional variances on the person- and the school-level, $σ_{g}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^2_{g}$$\end{document} and $υ_{g}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\upsilon ^2_{g}$$\end{document} , differ across the different types of secondary schooling. However, student’s idiosyncratic error terms, i.e. inter-individual differences not captured by the covariates, constantly contribute more to the variability in mathematical competency than school belonging over the different educational tracks, see lower panels of Figs. 1 and 2.

Regarding covariate effects, the models indicate interactions with the grouping variable. For more details, let us first look at the effects of the additional personal covariates used in model I. The negative effect of being female on mathematical competency (gender : 1) is shown to be stable across all school types, but at varying degrees. The effects of school exposure and age at testing are completely subsumed with the school type, i.e. in ninth grade these variables have no effect beyond school type in contrast to gender. This completes the findings from the literature discussing effects in primary schools, see Passaretta and Skopek (Reference Passaretta and Skopek2021).

Next, we consider the structural parameter estimates of model II. Again, we see the negative effect although slightly reduced of being female in all school types. Compared to students without a migration background, a first-generation migration background has a substantial negative (99% HDI not including zero) impact on mathematics competency across all school types. The negative effects also prevail for a migration background of the second generation, while for third generation migrants the negative effects are reduced (GYM and OTHER) or become not substantially different from zero (HS and RS). For the covariate grade mathematics in the previous year, where grade 1 (very good) is the reference category, we see that a good result from the previous year has a negative effect on mathematics competence compared to very good, where with worsening grades, the effect accelerates. This pattern can be observed throughout all school types, where the overall effect is strongest in the school type GYM. With regard to the covariate school year repeated, we also find differences across the school types, where this variable has no impact for school types RS and OTHER, but positively different from zero effect for school types HS and GYM. Not having your own computer, but sharing one with other family is found to have no impact on individual competence level across all school types, where we point at the possibility that this relationship may have changed since 2010 substantially. Also having an own room has no substantial effect given the considered set of covariate variables, except for school type RS. With regard to the variable HCASMIN, we find positive effects for higher HCASMIN levels for school type HS, while no effects substantially different from zero are found for all other school types. However, this variables further illustrates that the inspection of relative effects as defined in Eq. (10) with corresponding results for model specification II given in Table 11 is important to gauge differences across schools correctly. The relative effects between the different school types for the variable HCASMIN show no substantial differences between the school types. In this regard, the findings relate to the school specific distribution of HCASMIN, compare Table 8. For this model, we also calculated within group correlations, see bottom of Table 12. Although the groups show different conditional variances, estimates show no evidence for differing within group correlations.

While this effects are in line with the results from the literature, the suggested Bayesian estimation approach allows for effectively incorporating all available information, i.e. all information and model features with regard to the measurement model in terms of discrimination and difficulty parameters, intra-class correlation, and school type heterogeneity are reflected within the corresponding full conditional distributions. Given this, the results document a clear shift in means and covariate effects as well as unequal variances of the school type-specific density curves. The results of these two empirical applications extend the findings of our simulation studies from Chapter 3.

5. Conclusion

To handle missing values this paper discusses a Bayesian estimation approach making use of the device of data augmentation. The missing values in conditioning variables are hence considered along with the underlying continuous outcomes, the model parameters and the latent traits or hierarchical structures in the MCMC sampling scheme involved in operationalizing the Bayesian estimation. The DA device enables to provide the estimation of all these quantities in a statistically efficient one-step procedure. The uncertainty stemming from partially missing covariate data is directly incorporated into parameter estimation. At every iteration of the algorithm an imputed version of the covariate data is used to sample from the set of full conditional posterior distributions. Vice versa, the iteratively updated parameter values resulting from posterior sampling can in turn be considered within the full conditional distribution of missing values. Thus, compared to existing methods the novel method carries out parameter estimation while handling missing values in background variables simultaneously. Taken together, there are several advantages resulting from such an approach. First, it is statistically efficient in the sense that values for the latent trait, item characteristics, and missing values of background variables are all provided at once, second, all possible sources of uncertainty are taken into account, and third, the approach is especially well suited to deal with latent variables corresponding to competencies or arising from hierarchical structures, where the mutual dependence can be directly handled in terms of the full conditional distributions inserted into the sampler.

The advantages show off in terms of statistical efficiency and the computational burden is possibly eased, when latent quantities in the sense of sufficient statistics can be used to specify the full conditional distributions of missing values. An empirical example using the NEPS further demonstrates the broad applicability of the approach to a wide range of social science topics. Besides permitting the estimation of competency scores and their correlations with the context variables purified from measurement error, any number of completed data sets arising from the MCMC output may also serve as multiple imputations of the missing background information. Future research may investigate in detail the possibilities to perform nested and non-nested model comparison via Bayes factors based on the marginal data likelihood. Also alternative models for the full conditional distributions of missing values or automated variable selection based on the spike-and-slab prior specification, see Ročková and George (Reference Ročková and George2018), to determine which variables have group specific influence and which variables have homogeneous influence across the different groups, could be considered.

Acknowledgements

The authors thank David Kaplan, Roman Liesenfeld, and participants of the statistics seminars of the University of Cologne and the Leibniz Institute for Educational Trajectories, as well as the participants of the workshop on quality aspects of machine learning organized by the Statistics Network Bavaria for helpful comments and suggestions that helped to improve the manuscript in addition to the comments received by three reviewers and the associate editor. Financial support is acknowledged by the Deutsche Forschungsgesellschaft (DFG) within priority programme SPP 1646 under grants AS 368/3-1 and AS 368/3-2. Further, we would like to thank the Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities for the provision of the system resources the simulation studies were performed with. This paper uses data from the German National Educational Panel Study (NEPS), see Blossfeld and Roßbach (Reference Blossfeld and Roßbach2019). The NEPS is carried out by the Leibniz Institute for Educational Trajectories (LIfBi, Germany) in cooperation with a nationwide network.

Funding Information

Open Access funding enabled and organized by Projekt DEAL.

Footnotes

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/s11336-022-09888-0.

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

¹ Thereby DA facilitates efficient sampling from the posterior distribution via augmenting the posterior distribution with quantities not necessarily being of primary interest, but possibly functioning as sufficient statistics and thus enabling and operationalizing Rao-Blackwellization. As a byproduct DA enables smoother sampling from the posterior distribution of the quantities of primary interest as either closed form sampling becomes available or the construction of an importance or enveloping density is considerably simplified, see Carlin and Louis (Reference Carlin and Louis1998).

² Note that also complete cases analysis, which excludes all observations having a missing value on any covariate from estimation, beside the inefficient use of the sample information in situations with high rates of missing values may result in biased estimation, especially when observations are missing at random (Little & Rubin, Reference Little and Rubin2002, p. 41–44). Only in missing completely at random situations possibly related to multiple matrix designs estimation may stay unbiased.

³ When performing Maximum Likelihood based estimation typically implemented in terms of an Expectation Maximization algorithm, only point estimates are directly available but extra calculations are required to obtain corresponding uncertainty measures.

⁴ This may include information in terms of prevailing missing patterns, where Muthén et al. (Reference Muthén, Kaplan and Hollis1987) consider conditioning on missing data patterns for improved estimation.

⁵ The considered model framework incorporates an enriched multilevel structure compared to Fox and Glas (Reference Fox and Glas2001), whereas Aßmann et al. (Reference Aßmann, Gaasch, Pohl and Carstensen2015) consider the case of Bayesian estimation for the homogeneous two-parameter normal ogive LRM for binary outcomes only.

⁶ In addition, the computational cost of the approach discussed by Neal and Kypraios (Reference Neal and Kypraios2015) grows exponentially with the total number of observations, whereas our MCMC based approach is linearly related to the number of observations.

⁷ The sequential imputations of Kong et al. (Reference Kong, Liu and Wong1994) resembles via use of predictive distributions an importance sampling approach, while our approach based on full conditional distributions resembles an efficient importance sampling approach as discussed by Richard and Zhang (Reference Richard and Zhang2007).

⁸ The problem has been discussed extensively under the term incidental parameter problem in the statistics literature, see Lancaster (Reference Lancaster2000) for a survey.

⁹ Note that extensions in the form of random coefficients within groups or homogeneous coefficients across groups rendering the latent regression function as θ i = ω S i + X i γ i L i + W i λ + ϵ i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i=\omega _{S_i}+X_i\gamma _{iL_i} +W_i\lambda +\epsilon _i$$\end{document} with γ i L i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _{iL_i}$$\end{document} following a multivariate normal distribution with expectation μ γ L i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{\gamma _{L_i}}$$\end{document} and covariance Ω L i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Omega _{L_i}$$\end{document} and W i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W_i$$\end{document} denoting a set of covariates with homogeneous influence are also possible as discussed in Aßmann et al. (Reference Aßmann, Steinhauer, Kiesl, Koch, Schönberger, Müller-Kuller, Rohwer, Rässler, Blossfeld, Blossfeld, Roßbach and von Maurice2011).

¹⁰ This extends also towards missing-by-design values in item responses. Sampling of missing-by-design values Y mis \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_{\text {mis}}$$\end{document} are implied by Eq. (1) in the paper, where y ij ∗ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{ij}^*$$\end{document} follows then a normal distribution not subject to truncation as the truncation is implied by the observed values in Y only. The completed Y ∗ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y^*$$\end{document} and hence the completed Y might possibly be helpful for sampling values in X mis \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {mis}}$$\end{document} , as pointed out by von Hippel (Reference von Hippel2007). However, as illustrated and implied by the considered model framework in the paper, θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} serves takes a role of a sufficient statistic also for Y ∗ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y^*$$\end{document} , and thus consideration of sampled Y mis \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_{\text {mis}}$$\end{document} values within the imputation of X mis \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {mis}}$$\end{document} does not necessarily result in further gains in terms of statistical efficiency. Given this, we point out that the suggested approach should be applied to data situations, where at least some elements of y i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_i$$\end{document} are observed for each individual i = 1 … , N \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1\ldots ,N$$\end{document} . Situations, where several competence domains are investigated can be addressed via multivariate extensions of the suggested modeling framework.

¹¹ Consideration of sufficient statistics may also serve as a guiding principle for model specification.

¹² Note that sampling from f ( X mis | θ , Y ∗ , ω , ψ , Y , X obs , S , L ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(X_{\text {mis}}|\theta ,Y^*,\omega ,\psi ,Y,X_{\text {obs}},S,L)$$\end{document} will be based on sequential iterative sampling from the set of univariate full conditional distributions for each variable X mis ( p ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{\text {mis}}^{(p)}$$\end{document} , p = 1 , … , P \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=1,\ldots ,P$$\end{document} , see also Sect. 2.2 for details.

¹³ In combination with sampling from the empirical cumulative distribution function, i.e. sampling from the range of observed values only, this ensures that the CART-BB approach towards full conditional distributions does involve only proper prior distributions thus ensuring the existence of the integrating constant of the joint posterior distribution. Furthermore, the existence of the joint posterior distribution and the corresponding integrating constant as implied by the Eq. (8) is directly ensured in case the missing values relate to variables with finite sample spaces. In case the missing values relate to variables with theoretically possible countable infinite or uncountable infinite sample spaces, the CART-BB algorithm constructs the empirical cumulative distribution function implied by the obtained partition based on measures of homogeneity, e.g. the variance, and incorporates the restriction to the range of observed values as a modeling assumption. Thus, the suggested approach may be most useful in situations with many categorical variables, as in our empirical applications. For applications where the restriction to the range of observed values raises concerns, the suggested CART-BB approach could be applied to the set of categorical values only and alternative modeling approaches for the missing values for variables within continuous infinite support may be considered as well.

¹⁴ The proposed Bayesian analysis and its MCMC implementation is further suited to incorporate information arising from weighting factors. In case of non-stochastic weights, e.g. design weights, the variables entering the modeling can be transformed accordingly, whereas in case of stochastic weights, e.g. non-response adjusted weights typically handled via replication weights, the variables can be transformed within each MCMC iteration.

¹⁵ In case that also interaction terms are considered, ( X com ( \ p ) ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(X_{\text{ com }}^{({\setminus } p)})$$\end{document} also subsumes all columns referring to cross terms not involving variable p. Cross terms involving variable p are hence not subject to modeling but updated each time an underlying variable has been updated.

¹⁶ Sampling from the empirical distribution function via the Bayesian bootstrap corresponds to running the data generating process of a parametric imputation model, with involved parameters being sampled from the estimated distributions in order to fully account for the uncertainty of the data generating process, i.e. the uncertainty how the empirical cumulative distribution function would look like if the missing values would be observed.

¹⁷ The IBM benchmark is hence in line with a typical multiple imputation strategy, although no combining rules are required as sampling is performed within the MCMC sampler. This ensures further that the comparison of the different approaches is conditional on the same level of numerical precision as implied by the number of MCMC iterations after burn-in.

¹⁸ For ten items we have missing rates of less than 2 % \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2\%$$\end{document} , less than 5 % \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5\%$$\end{document} for another eight items, for three items we have the range from 5 % - 10 % \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5\% - 10\%$$\end{document} and only one item has a missing rate of 20 % \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$20\%$$\end{document} .

¹⁹ Regarding socio-economic status, there are many operationalizations implemented in the NEPS. In line with recent analyses of the PISA data (OECD, 2013a, p. 132), we took the highest occupational level of parents measured by the index ISEI-08 (Ganzeboom, Reference Ganzeboom2010) and calculated a variable HISEI as the higher ISEI-08 score of either the students’ mother or the students’ father or the only available score. To calibrate the scale of the regression coefficient associated with HISEI, the original values are divided by 100. HISEI ranges from 1.16 to 8.90 with higher values indicating a higher level of occupational status. This variable in particular shows strong differences between the school types which can be seen in Table 8. Age at assessment and school exposure are defined as the difference between date of assessment and date of birth or date of school entry respectively.

²⁰ Results on item characteristics (discrimination, difficulty, and cut-off parameters) are available within the supplementary material.

References

Adams, R. J., Wilson, M., & Wu, M. (1997). Multilevel item response models: An approach to errors in variables regression. Journal of Educational and Behavioral Statistics, 22(1), 47–76.CrossRef Google Scholar

Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics, 17(3), 251–269.CrossRef Google Scholar

Albert, J. H., & Chib, S. (1997). Bayesian methods for cumulative, sequential and two-step ordinal data regression models. Bowling Greene: Department of Mathematics and Statistics, Bowling Greene State University.Google Scholar

Allen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 technical report (NCES-1999-452).Google Scholar

Aßmann, C., & Boysen-Hogrefe, J. (2011). A Bayesian approach to model-based clustering for binary panel probit models. Computational Statistics & Data Analysis, 55, 261–279.CrossRef Google Scholar

Aßmann, C., Gaasch, C., Pohl, S., & Carstensen, C. H. (2015). Bayesian estimation in IRT models with missing values in background variables. Psychological Test and Assessment Modeling, 57(4), 595–618.Google Scholar

Aßmann, C., & Preising, M. (2020). Bayesian estimation and model comparison for linear dynamic panel models with missing values. Australian & New Zealand Journal of Statistics, 62(4), 536–557. https://doi.org/10.1111/anzs.12316 CrossRef Google Scholar

Aßmann, C., Steinhauer, H. W., Kiesl, H., Koch, S., Schönberger, B., Müller-Kuller, A., Rohwer, G., Rässler, S., & Blossfeld, H.-P. (2011). Sampling designs of the national educational panel study: Challenges and solutions Zeitschrift für Erziehungswissenschaften Special Issue 14. In Blossfeld, H.-P., Roßbach, H.-G., & von Maurice, J. (Eds.), Education as a lifelong process. The German national educational panel study (NEPS) (pp. 51–65). Wiesbaden: VS Verlag für Sozialwissenschaften.Google Scholar

Bayer, M., Goßmann, F., & Bela, D. (2014). NEPS technical report: Generated school type variable t723080_g1 in starting cohorts 3 and 4 (NEPS working paper no. 46). University of Bamberg, Leibniz Institute for Educational Trajectories, National Educational Panel Study.Google Scholar

Blackwell, M., Honaker, J., & King, G. (2017). A unified approach to measurement error and missing data: Details and extensions. Sociological Methods & Research, 46(3), 342–369. https://doi.org/10.1177/0049124115589052 CrossRef Google Scholar

Blossfeld, H.-P., & Roßbach, H.-G. (Eds.). (2019). Education as a lifelong process. The German National Educational Panel Study (NEPS): Edition ZfENew York: Springer VS.CrossRef Google Scholar

Burgette, L. F., & Reiter, J. P. (2010). Multiple imputation for missing data via sequential regression trees. American Journal of Epidemiology, 172(9), 1070–1076.CrossRef Google Scholar PubMed

Burstein, L. (1980). The analysis of multilevel data in educational research and evaluation. Review of Research in Education, 8, 158–233.Google Scholar

Carlin, B. P., & Louis, T. A. (1998). Bayes and empirical Bayes methods for data analysis Monographs on statistics and applied probability (Vol. 69). Boca Raton: Chapman & Hall/CRC.Google Scholar

Carlsson, M., Dahl, G. B., Öckert, B., & Rooth, D.-O. (2015). The effect of schooling on cognitive skills. The Review of Economics and Statistics, 97(3), 533–547. https://doi.org/10.1162/REST_a_00501 CrossRef Google Scholar

Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90(432), 1313–1321.CrossRef Google Scholar

Chib, S., & Jeliazkov, I. (2001). Marginal likelihood from the metropolis–hastings output. Journal of the American Statistical Association, 96(453), 270–281.CrossRef Google Scholar

Cornelissen, T., & Dustmann, C. (2019). Early school exposure, test scores, and noncognitive outcomes. American Economic Journal: Economic Policy, 11(2), 35–63. https://doi.org/10.1257/pol.20170641 Google Scholar

Doove, L. L., van Buuren, S., & Dusseldorp, E. (2014). Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, 92–104.CrossRef Google Scholar

Duchhardt, C. & Gerdes, A. (2013). NEPS technical report for mathematics—Scaling results of starting cohort 4 in ninth grade (NEPS working paper no. 22). University of Bamberg, Leibniz Institute for Educational Trajectories, National Educational Panel Study.Google Scholar

Edwards, M. C. (2010). A Markov chain Monte Carlo approach to confirmatory item factor analysis. Psychometrika, 75(3), 474–497. https://doi.org/10.1007/s11336-010-9161-9 CrossRef Google Scholar

Embretson, S. E., & Reise, S. (2000). Item response theory for psychologists. Mahwah: Lawrence Erlbaum Associates.Google Scholar

Erler, N. S., Rizopoulos, D., van Rosmalen, J., Jaddoe, V. W. V., Franco, O. H., & Lesaffre, E. M. E. H. (2016). Dealing with missing covariates in epidemiologic studies: A comparison between multiple imputation and a full Bayesian approach. Statistics in Medicine, 35(17), 2955–2974.CrossRef Google Scholar

Fox, J.-P. (2005). Multilevel IRT using dichotomous and polytomous response data. British Journal of Mathematical and Statistical Psychology, 58(1), 145–172.CrossRef Google Scholar PubMed

Fox, J.-P. (2010). Bayesian item response modeling: Theory and applications. New York: Springer.CrossRef Google Scholar

Fox, J.-P., & Glas, C. A. W. (2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66(2), 271–288.CrossRef Google Scholar

Ganzeboom, H. B. G. (2010). A new international socio-economic index [ISEI] of occupational staus for the international standard classification of occupation 2008 [ISCO-08] constructed with data from the ISSP 2002–2007; with an analysis of quality of occupational measurement in ISSP. In Annual conference of international social survey programme, Lisbon, May 1 2010.Google Scholar

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. (3rd ed.). Boca Raton: CRC Press.CrossRef Google Scholar

Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments, Bayesian statistics 4 (pp. 169–193). Oxford: Oxford University Press.Google Scholar

Goldstein, H., Carpenter, J. R., & Browne, W. J. (2014). Fitting multilevel multivariate models with missing data in responses and covariates that may include interactions and non-linear terms. Journal of the Royal Statistical Society: Series A (Statistics in Society), 177(2), 553–564.CrossRef Google Scholar

Greene, W. (2004a). The behaviour of the maximum likelihood estimator of limited dependent variable models in the presence of fixed effects. The Econometrics Journal, 7(1), 98–119.CrossRef Google Scholar

Greene, W. (2004b). Convenient estimators for the panel probit model: Further results. Empirical Economics, 29(1), 21–47. https://doi.org/10.1007/s00181-003-0187-z CrossRef Google Scholar

Grund, S., Lüdtke, O., & Robitzsch, A. (2018). Multiple imputation of missing data at level 2: A comparison of fully conditional and joint modeling in multilevel designs. Journal of Educational and Behavioral Statistics, 43(3), 316–353. https://doi.org/10.3102/1076998617738087 CrossRef Google Scholar

Grund, S., Lüdtke, O., & Robitzsch, A. (2020). On the treatment of missing data in background questionnaires in educational large-scale assessments: An evaluation of different procedures. Journal of Educational and Behavioral Statistics. https://doi.org/10.3102/1076998620959058 Google Scholar

Imai, K., & van Dyk, D. A. (2005). A Bayesian analysis of the multinomial probit model using marginal data augmentation. Journal of Econometrics, 124(2), 311–334.CrossRef Google Scholar

Johnson, M. S., & Jenkins, F. (2005). A Bayesian hierarchical model for large-scale educational surveys: An application to the national assessment of educational progress (ETS RR-04-38). Princeton: Educational Testing Service.Google Scholar

Jones, M. P. (1996). Indicator and stratification methods for missing explanatory variables in multiple linear regression. Journal of the American Statistical Association, 91(433), 222–230.CrossRef Google Scholar

Kaplan, D., & Su, D. (2018). On imputation for planned missing data in context questionnaires using plausible values: A comparison of three designs. Large-scale Assessments in Education, 6(1), 6. https://doi.org/10.1186/s40536-018-0059-9 CrossRef Google Scholar

Köhler, C., Pohl, S., & Carstensen, C. H. (2015). Taking the missing propensity into account when estimating competence scores: Evaluation of item response theory models for nonignorable omissions. Educational and Psychological Measurement, 75(5), 850–874.CrossRef Google Scholar PubMed

Kong, A., Liu, J. S., & Wong, W. H. (1994). Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, 89(425), 278–288.CrossRef Google Scholar

Koskinen, J. H., Robins, G. L., & Pattison, P. E. (2010). Analysing exponential random graph (p-star) models with missing data using Bayesian data augmentation. Statistical Methodology, 7(3), 366–384.CrossRef Google Scholar

Lancaster, T. (2000). The incidental parameter problem since 1948. Journal of Econometrics, 95(2), 391–413.CrossRef Google Scholar

Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data. (2nd ed.). Hoboken: Wiley.CrossRef Google Scholar

Liu, M., Taylor, J. M. G., & Belin, T. R. (2000). Multiple imputation and posterior simulation for multivariate missing data in longitudinal studies. Biometrics, 56(4), 1153–1157.CrossRef Google Scholar PubMed

Lord, F. M. (1952). A theory of test scores. Psychometric monograph no. 7Richmond: Psychometric Corporation.Google Scholar

Lord, F. M. (1953). An application of confidence intervals and of maximum likelihood to the estimation of an examinee’s ability. Psychometrika, 18(1), 57–75.CrossRef Google Scholar

Maddala, G. S. (1983). Limited-dependent and qualitative variables in econometrics. Cambridge: Cambridge University Press.CrossRef Google Scholar

McKelvey, R., & Zavoina, W. (1975). A statistical model for the analysis of ordered level dependent variables. Journal of Mathematical Sociology, 4(1), 103–120.CrossRef Google Scholar

Mislevy, R. J. (1985). Estimation of latent group effects. Journal of the American Statistical Association, 80(392), 993–997.CrossRef Google Scholar

Mullis, I. V. S., & Martin, M. O. (2013). TIMSS 2015 assessment frameworks. Chestnut Hill: TIMSS & PIRLS International Study Center.Google Scholar

Muthén, B., Kaplan, D., & Hollis, M. (1987). On structural equation modeling with data that are not missing completely at random. Psychometrika, 52(3), 431–462.CrossRef Google Scholar

Muthén, B. O. (1979). A structural probit model with latent variables. Journal of the American Statistical Association, 74(368), 807–811.Google Scholar

Muthén, B. O. (1989). Latent variable modeling in heterogeneous populations. Psychometrika, 54(4), 557–585.CrossRef Google Scholar

Muthén, B. O., & Christoffersson, A. (1981). Simultaneous factor analysis of dichotomous variables in several groups. Psychometrika, 46(4), 407–419.CrossRef Google Scholar

Neal, P., & Kypraios, T. (2015). Exact Bayesian inference via data augmentation. Statistics and Computing, 25(2), 333–347.CrossRef Google Scholar

NEPS Network. (2019). German national educational panel study, scientific use file of starting cohort grade 9. Leibniz Institute for Educational Trajectories (LIfBi), Bamberg. Retrieved from https://doi.org/10.5157/NEPS:SC4:10.0.0.CrossRef Google Scholar

Neumann, I., Duchhardt, C., Grüßing, M., Heinze, A., Knopp, E., & Ehmke, T. (2013). Modeling and assessing mathematical competence over the lifespan. Journal for Educational Research Online, 5(2), 80–109.Google Scholar

Nieminen, P., Lehtiniemi, H., Vähäkangas, K., Huusko, A., & Rautio, A. (2013). Standardised regression coefficient as an effect size index in summarising findings in epidemiological studies. Epidemiology Biostatistics and Public Health, 10(4), 1–16.Google Scholar

OECD. (2013a). PISA 2012 results: Excellence through equity: Giving every student the chance to succeed (volume II). Paris: OECD Publishing.Google Scholar

OECD. (2013b). Technical report of the survey of adult skills (PIAAC). Paris: OECD Publishing.Google Scholar

OECD. (2014). PISA 2012 technical report. Paris: OECD Publishing.Google Scholar

Passaretta, G., & Skopek, J. (2021). Does schooling decrease socioeconomic inequality in early achievement? A differential exposure approach. American Sociological Review, 86(6), 1017–1042. https://doi.org/10.1177/00031224211049188 CrossRef Google Scholar

Pohl, S., Gräfe, L., & Rose, N. (2014). Dealing with omitted and not-reached items in competence tests: Evaluating approaches accounting for missing responses in item response theory models. Educational and Psychological Measurement, 74(3), 423–452.CrossRef Google Scholar

Richard, J. F., & Zhang, W. (2007). Efficient high-dimensional importance sampling. Journal of Econometrics, 141(2), 1385–1411.CrossRef Google Scholar

Rijmen, F., Tuerlinckx, F., De Boeck, P., & Kuppens, P. (2003). A nonlinear mixed model framework for item response theory. Psychological Methods, 8(2), 185–205.CrossRef Google Scholar PubMed

Robert, C. P., & Casella, G. (2004). Monte Carlo statistical methods. (2nd ed.). New York: Springer.CrossRef Google Scholar

Ročková, V., & George, E. I. (2018). The spike-and-slab lasso. Journal of the American Statistical Association, 113(521), 431–444.CrossRef Google Scholar

Rubin, D. B. (1981). The Bayesian bootstrap. The Annals of Statistics, 9(1), 130–134.CrossRef Google Scholar

Rutkowski, L. (2011). The impact of missing background data on subpopulation estimation. Journal of Educational Measurement, 48(3), 293–312.CrossRef Google Scholar

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometric monograph no. 17Richmond, VA: Psychometric Corporation.CrossRef Google Scholar

Si, Y., & Reiter, J. P. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics, 38(5), 499–521.CrossRef Google Scholar

Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398), 528–549.CrossRef Google Scholar

Therneau, T., & Atkinson, B. (2018). rpart: Recursive partitioning and regression trees, [Computer software manual]. Retrieved from https://CRAN.R-project.org/package=rpart. (R package version 4.1-13).Google Scholar

Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P.-C. Rank-normalization, folding, and localization: An improved

\hat{R}

for assessing convergence of MCMC (with discussion) Bayesian Analysis. (2021 16(2), 667–718.Google Scholar

von Hippel, P. (2007). Regression with missing Ys: An improved strategy for analyzing multiply imputed data. Sociological Methodology, 37, 83–117.CrossRef Google Scholar

Wilson, M., & De Boeck, P. (2004). Explanatory item response models. New York, NY: Springer.CrossRef Google Scholar

Zwinderman, A. H. (1991). A generalized Rasch model for manifest predictors. Psychometrika, 56(4), 589–600.CrossRef Google Scholar