Adjusted Residuals for Evaluating Conditional Independence in IRT Models for Multistage Adaptive Testing

Peter W. van Rijn; Usama S. Ali; Hyo Jeong Shin; Sean-Hwane Joo

doi:10.1007/s11336-023-09935-4

Adjusted Residuals for Evaluating Conditional Independence in IRT Models for Multistage Adaptive Testing

Published online by Cambridge University Press: 01 January 2025

Hyo Jeong Shin and

Peter W. van Rijn*: Affiliation:
ETS Global
Usama S. Ali: Affiliation:
Educational Testing Service South Valley University
Hyo Jeong Shin: Affiliation:
Sogang UniversitySeoul
Sean-Hwane Joo: Affiliation:
University of Kansas
*: Correspondence should be made to Peter W. van Rijn, ETS Global, Amsterdam, The Netherlands. Email: pvanrijn@etsglobal.org

Article contents

Abstract
Method
Illustrations
Discussion
Footnotes
References

Rights & Permissions

Abstract

The key assumption of conditional independence of item responses given latent ability in item response theory (IRT) models is addressed for multistage adaptive testing (MST) designs. Routing decisions in MST designs can cause patterns in the data that are not accounted for by the IRT model. This phenomenon relates to quasi-independence in log-linear models for incomplete contingency tables and impacts certain types of statistical inference based on assumptions on observed and missing data. We demonstrate that generalized residuals for item pair frequencies under IRT models as discussed by Haberman and Sinharay (J Am Stat Assoc 108:1435–1444, 2013. https://doi.org/10.1080/01621459.2013.835660) are inappropriate for MST data without adjustments. The adjustments are dependent on the MST design, and can quickly become nontrivial as the complexity of the routing increases. However, the adjusted residuals are found to have satisfactory Type I errors in a simulation and illustrated by an application to real MST data from the Programme for International Student Assessment (PISA). Implications and suggestions for statistical inference with MST designs are discussed.

Keywords

residual analysis conditional independence item response theory multistage adaptive testing

Type: Theory and Methods
Information: Psychometrika , Volume 89 , Issue 1 , March 2024 , pp. 317 - 346

DOI: https://doi.org/10.1007/s11336-023-09935-4 [Opens in a new window]
Copyright: Copyright © 2023 The Author(s) under exclusive licence to The Psychometric Society.

Multistage adaptive testing (MST) is becoming increasingly popular with major educational assessment programs currently operating in this fashion (Robin et al., Reference Robin, Steffen, Liang, Yan, von Davier and Lewis2014; Yamamoto et al., Reference Yamamoto, Shin and Khorramdel2019). MST consists of a modular approach to adapt the difficulty level of the test to the ability level of test takers in which the possible test paths are limited in number and can be reviewed in advance. Item response theory (IRT) models are often behind MST applications. One of the key pillars of IRT is the assumption of independence of item responses conditional on the latent ability variable, often referred to as local independence (Lord & Novick, Reference Lord and Novick1968, Section 16.3). Evaluating this conditional independence (CI) assumption for MST data is more complicated than with data from a linear test due to dependencies in the incomplete data that arise from MST designs. However, it remains important to evaluate the CI assumption in the context of MST, because violations can lead to inflated measurement precision (Wainer & Thissen, Reference Wainer and Thissen1996), confounds in subscores (Yen, Reference Yen1993), and issues with the adaptive algorithm (Zenisky et al., Reference Zenisky, Hambleton and Sireci2001). In this paper, we demonstrate that existing methods for evaluating CI can fail for MST data and present appropriate adjustments for one particular method based on generalized residuals (Haberman & Sinharay, Reference Haberman and Sinharay2013).

Incomplete designs in which test takers only see a subset of all available items play an important role in large-scale educational assessments for various reasons such as test security in the case of continuous testing programs (Kolen & Brennan, Reference Kolen and Brennan2004) and domain coverage in the case of educational surveys with limited testing time (Johnson, Reference Johnson1992). Assuming dichotomous responses to J items, a $2^{J}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2^J$$\end{document} -contingency table is obtained (Tjur, Reference Tjur1982). Incomplete designs lead to what is called an incomplete contingency table in log-linear modeling (Bishop, Fienberg, & Holland, Reference Bishop, Fienberg and Holland2007, Chapter 5). That is, the table has structural zeros. However, for MST designs, the routing decisions can cause specific deviations for certain cells of the contingency table. In addition to structural zeros, not all item-response patterns are equally likely due to the specifics of the MST design (e.g., a set of difficult items is presented only when the sum score on a set of earlier items exceeds a specific value). The extent to which these deviations arise depends on the interplay between the routing decisions, the difficulty level of the MST modules, and the ability level of test takers. In terms of log-linear modeling, this issue is referred to as quasi-independence (Goodman, Reference Goodman1968), which means that an independence model can only be specified for a subset of the outcome space. In case of MST data, as we shall see later, this subset is not always straightforward, which complicates the evaluation of CI for IRT models.

When it comes to statistical inference using observed and missing data from MST designs with respect to model parameters and model fit, it is important to distinguish between likelihood and sampling-distribution inference (Rubin, Reference Rubin1976).Footnote 1 Likelihood inference is based on (ratios of) the likelihood function for different parameter values and sampling-distribution inference on comparing observed values of a statistic with its sampling distribution under the null hypothesis. With missing data caused by incomplete designs, inference can proceed by making use of the ignorability principle (Rubin, Reference Rubin1976). For sampling-distribution inference, missing data is ignorable if missing data are missing at random (MAR) and observed data are observed at random (OAR), which together is referred to as missing completely at random (MCAR; Rubin, Reference Rubin1976). For likelihood inference, missing data is ignorable, if MAR holds and the missing-data parameter is distinct from the observed-data parameter.

As an example of likelihood inference with MST data, Glas (Reference Glas1988) showed that solvable equations can be obtained for maximum marginal likelihood estimation of item parameters for the Rasch and two-parameter logistic (2PL) models in MST designs. Furthermore, Mislevy and Wu (Reference Mislevy and Wu1996, Theorem 5.2) showed that the missing data caused by adaptive testing designs is MAR, but not MCAR. However, in contrast to Glas (Reference Glas1988), they assume that item parameters are known and consider inference on ability parameters only. Either way, MST and other adaptive testing designs can lead to issues with evaluating the assumption of CI (Mislevy & Chang, Reference Mislevy and Chang2000) and dimensionality assessment (Zhang, Reference Zhang2013) when sampling-distribution inference is used. Therefore, it is critical to evaluate the impact of MST data on statistical methods and properly adjust such methods to account for the adaptive design to provide accurate results (Ali et al., Reference Ali, Shin and Rijnin press).

Notwithstanding these issues, common methods for evaluating CI in IRT models such as residual correlations (Yen, Reference Yen1993) and Pearson’s $X^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^2$$\end{document} statistic for item pairs (Chen & Thissen, Reference Chen and Thissen1997) have been used in the context of adaptive testing (Pommerich & Segall, Reference Pommerich and Segall2008). It is not the purpose of this paper to compare several different conditional dependence diagnostics. Liu & Maydeu-Olivares (Reference Liu and Maydeu-Olivares2013) provide an overview of methods for evaluating CI, including limited-information methods based on item pairs. However, it is generally unknown what the impact of MST is on such methods and what needs to be done to take the MST design into account. The purpose of the present paper is to address these issues focusing on residual analysis (Haberman & Sinharay, Reference Haberman and Sinharay2013; Reiser, Reference Reiser1996) as this has worked well in terms of Type I error (van Rijn et al., Reference van Rijn, Sinharay, Haberman and Johnson2016). In the context of evaluating CI, this analysis can be performed by evaluating residuals for frequencies of item pairs or triplets. We demonstrate that not accounting for the MST design generally leads to incorrect residuals and provide appropriate adjustments that take into account the MST design.

In general, item parameters are needed to create MST designs and report test results. In this context, a distinction can be made between pre-calibrated and post-calibrated MSTs (Jewsbury & van Rijn, Reference Jewsbury and van Rijn2020). In a pre-calibrated MST, item parameters are estimated from an earlier data collection and treated as fixed in the given MST data. This earlier data collection can be a pilot study specifically designed to obtain item parameters (e.g., a linear pre-test) or an earlier administration of the studied MST (e.g., with seeded new items). In a post-calibrated MST, item parameters are (re-)estimated from the given data set. We mostly focus on the latter situation in which it becomes important to discuss MST design aspects that drive the tradeoff between information in the data for estimating item parameters and that for estimating ability parameters (Zwitser & Maris, Reference Zwitser and Maris2015). That is, MST designs that work well for ability estimation, may not work well for item parameter estimation.

The paper is outlined as follows. In the Method section, we discuss CI, IRT models, estimation, and adjusted residuals. We illustrate the proposed adjustments by means of a simulation study in which parameter recovery is also briefly addressed. Finally, the methods are applied to real MST data from the Programme for International Student Assessment (PISA). The paper ends with a discussion.

1. Method

We denote the observed item response variable by $x_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{ij}$$\end{document} for test takers $i = 1, \dots, N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\ldots ,N$$\end{document} and items $j = 1, \dots, J$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=1,\ldots ,J$$\end{document} . The test length is denoted by L with $L \leq J$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L\le J$$\end{document} . Let S be the set of possible item response vectors $x_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{x}}_i$$\end{document} , i.e., the sample space.

In this paper, we make comparisons between two types of linear designs and two types of MST designs. The first linear design is a complete design in which test takers see all items (i.e., $L = J$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L=J$$\end{document} ). The second linear design is a random design in which test takers see a random subset of all items (i.e., $L < J$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L<J$$\end{document} ).

The first MST design is a basic two-stage design in which test takers start with a routing module, which is followed by an adaptive module of either low or high difficulty, depending on a test-taker’s performance on the routing module. Since test takers are routed to only one of the two adaptive modules, we have $L < J$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L<J$$\end{document} . In the basic MST design, the routing and adaptive modules are fixed. The following example of a basic MST design consisting of three modules A, B, and C will be used throughout to illustrate our methods. Figure 1 illustrates this design. We assume that module A is of medium difficulty, module B is of low difficulty, and module C is of high difficulty. Test takers start with module A. If their sum score on module A ( $x_{A}^{+}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_A^+$$\end{document} ) is below the cutoff $c_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_A$$\end{document} , they are routed to module B. If their sum score on module A is at least $c_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_A$$\end{document} , they are routed to module C. The basic MST design can be generalized to (1) more than two stages, (2) more than two difficulty levels beyond the first stage, and (3) more complicated routing rules.

Figure. 1 Basic MST design with two stages and two levels of difficulty (module B is of lower difficulty, module C is of higher difficulty).

The second MST design is a balanced two-stage design. In this design, again a routing module is followed by an adaptive module, but we let each item appear in both stages by creating multiple routing modules and multiple adaptive modules. This design is intended to mimic balanced incomplete block designs used in linear testing (Messick et al., Reference Messick, Beaton and Lord1983). If all modules are to have the same length, then three routing modules, three low-difficulty modules, and three high-difficulty modules can be assembled to create a balanced MST design from the above basic MST design. Essentially, the basic MST design is replicated three times using different subsets from the item pool to create a balanced MST design. The design is illustrated in Fig. 2. Here, all items are used once across modules A, D, and G, but also once across A, B, C, and D, E, F, and G, H, I. Note that there are different ways to create a balanced two-stage design. For example, by splitting the item pool randomly in parts and creating modules from each part.

Figure. 2 Balanced MST design with two stages, two levels of difficulty, and all items used in both stages.

Different designs can lead to different sample spaces, which is important to realize because this can lead to different sampling distributions as well. Consider the special case where each module consists of a single dichotomous item and the cutoff $c_{A} = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_A=1$$\end{document} . For the basic MST design, the sample space then only has four elements: $S_{basic} = (00 M, 01 M, 1 M 0, 1 M 1)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_\text {basic}=(00\text {M},01\text {M},1\text {M}0,1\text {M}1)$$\end{document} , where 0 indicates an incorrect response, 1 indicates a correct response, and M indicates missing by design. So, the pattern 00M indicates that items A and B were answered incorrectly and, by design, the response to item C is missing. It follows that if item B is observed, then item A must be incorrect, and if item C is observed, then item A must be correct. Alternatively, if we employ the balanced MST design by allowing each item to be the first for 1/3 of the test takers and, depending on the response, one of the remaining items to be second, the sample space has nine elements: $S_{balanced} = (00 M, 01 M, 1 M 0, 1 M 1, 10 M, M 10, M 11, M00, 0 M 1)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_\text {balanced}=(00\text {M},01\text {M},1\text {M}0,1\text {M}1,10\text {M},\text {M}10,\text {M}11,\text {M00},0\text {M}1)$$\end{document} . Finally, if two out of the three items are administered at random, the sample space $S_{random}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_\text {random}$$\end{document} consists of all twelve possible response patterns. For these three examples, the sample spaces are related as follows: $S_{basic} \subset S_{balanced} \subset S_{random}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_\text {basic} \subset S_\text {balanced} \subset S_\text {random}$$\end{document} . If each module consists of multiple items instead of a single item, the dependencies between items across modules become more complex. For example, in the basic MST with multiple items per module, if module B is observed, then it follows that $x_{A}^{+} < c_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_A^+ < c_A$$\end{document} , and if module C is observed, $x_{A}^{+} \geq c_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_A^+ \ge c_A$$\end{document} .

We next discuss how MST designs affect the assumption of CI. This is followed by a brief exposition of IRT models and their estimation. We then discuss residuals for evaluation of CI and present the proper adjustments for MST designs.

1.1. Conditional Independence

In general, for test taker i, a vector of J item responses $x_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{x}}_i$$\end{document} and a latent ability variable $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{{\theta }}}_i$$\end{document} , the assumption of CI amounts to being able to write the joint probability as

(1)

\begin{matrix} p (x_{i} | θ_{i}) = \prod_{j = 1}^{J} p (x_{ij} | θ_{i}), \end{matrix}

with $\prod_{j = 1}^{J} (m_{j} + 1)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\prod _{j=1}^J (m_j + 1)$$\end{document} possible item response vectors in S if item j is scored $0, 1, \dots, m_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0,1,\ldots ,m_j$$\end{document} , where $m_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_j$$\end{document} is the maximum score for item j. We assume here that there are no missing values and no design restrictions on the sample space. The above notion is typically referred to as strong CI, whereas weak CI only states that item pairs are conditionally independent (McDonald, Reference McDonald1999).

In the context of MST, the conditional probability in Eq. 1 needs to be rewritten to take into account routing rules (Haberman & von Davier, Reference Haberman, von Davier, Yan, von Davier and Lewis2014, Section 15.2.2). To this end, we let $F_{iH}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{iH}$$\end{document} be the set of items administered to test taker i up to stage H and $G_{H}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G_H$$\end{document} be the set of all available items in these stages with $F_{iH} \subset G_{H}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{iH} \subset G_H$$\end{document} . According to Haberman & von Davier (Reference Haberman, von Davier, Yan, von Davier and Lewis2014), CI then boils down to

(2)

\begin{matrix} p (x_{i G_{H}} | θ_{i}) & = \prod_{j \in F_{iH}} p (x_{ij} | θ_{i}), \end{matrix}

which essentially means that CI can only be assumed to hold for items that were observed.

As an illustration consider the basic MST design example described above where $G_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G_2$$\end{document} consists of the items for modules A, B, and C and $F_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_2$$\end{document} consists of either modules A and B or A and C. Let’s assume further that the latent variable is unidimensional. In this design, CI conditional on $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} across modules B and C cannot be determined. In this case, CI can only be assessed for the sample space, which relates to quasi-independence in log-linear models with incomplete data (Goodman, Reference Goodman1968). However, taking into account the missing data does not fully capture the basic MST design because the responses on module A and the cutoff $c_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_A$$\end{document} have an impact on either observing module B or C. This needs to be taken into account as well when one wants to evaluate CI. To this end, we need to find the probabilities for the item responses of module A conditional on the routing decisions (i.e., the sum score for module A), which are

(3)

\begin{matrix} p (x_{A} | θ, x_{A +} < c_{A}) & = \frac{p (x_{A} | θ)}{p (x_{A +} < c_{A} | θ)}, \end{matrix}

(4)

\begin{matrix} p (x_{A} | θ, x_{A +} \geq c_{A}) & = \frac{p (x_{A} | θ)}{p (x_{A +} \geq c_{A} | θ)}, \end{matrix}

where $x_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{x}}_A$$\end{document} denotes the vector of item responses for module A. These probabilities are needed in evaluating CI for item pairs that do not appear within the same module (i.e., they are separated by a routing decision). In MST designs, CI in the sense of a simple product without conditioning on the routing decision only holds for item pairs within the same module. Furthermore, it can be shown that for the basic MST design for items j and k

\begin{matrix} p (x_{A_{j}}, x_{A_{k}} | θ) & = p (x_{A_{j}} | θ) p (x_{A_{k}} | θ), \\ p (x_{B_{j}}, x_{B_{k}} | θ) & = p (x_{B_{j}} | θ) p (x_{B_{k}} | θ), \\ p (x_{C_{j}}, x_{C_{k}} | θ) & = p (x_{C_{j}} | θ) p (x_{C_{k}} | θ), \\ p (x_{A_{j}}, x_{B_{k}} | θ) & \neq p (x_{A_{j}} | θ) p (x_{B_{k}} | θ), \\ p (x_{A_{j}}, x_{C_{k}} | θ) & \neq p (x_{A_{j}} | θ) p (x_{C_{k}} | θ), \\ p (x_{B_{j}}, x_{C_{k}} | θ) & = 0 . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} p(x_{A_j},x_{A_k}|\theta )&= p(x_{A_j}|\theta )p(x_{A_k}|\theta ),\\ p(x_{B_j},x_{B_k}|\theta )&= p(x_{B_j}|\theta )p(x_{B_k}|\theta ),\\ p(x_{C_j},x_{C_k}|\theta )&= p(x_{C_j}|\theta )p(x_{C_k}|\theta ),\\ p(x_{A_j},x_{B_k}|\theta )&\ne p(x_{A_j}|\theta )p(x_{B_k}|\theta ),\\ p(x_{A_j},x_{C_k}|\theta )&\ne p(x_{A_j}|\theta )p(x_{C_k}|\theta ),\\ p(x_{B_j},x_{C_k}|\theta )&= 0. \end{aligned}$$\end{document}

1.2. Model

We now turn to discussing specific IRT models. A general logistic IRT model can be specified by

(5)

\begin{matrix} p (X_{ij} = x | θ) = \frac{exp (α_{jx}^{'} D θ_{i} + β_{jx})}{\sum_{h = 0}^{m_{j}} exp (α_{jh}^{'} D θ_{i} + β_{jh})}, x = 0, \dots, m_{j}, \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} p(X_{ij}=x|{\varvec{{\theta }}})=\frac{\exp ({\varvec{{\alpha }}}_{jx}'{\textbf{D}}{\varvec{{\theta }}}_i+\beta _{jx})}{\sum _{h=0}^{m_j}\exp ({\varvec{{\alpha }}}_{jh}'{\textbf{D}}{\varvec{{\theta }}}_i+\beta _{jh})},\qquad x=0,\ldots ,m_j, \end{aligned}$$\end{document}

where $D$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{D}}$$\end{document} is a design matrix, $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{{\theta }}}_i$$\end{document} is a latent ability variable, $α_{jx}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{{\alpha }}}_{jx}$$\end{document} contains item slopes, and $β_{jx}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _{jx}$$\end{document} is an item intercept. The general form in Eq. 5 can accommodate polytomous IRT models such as the nominal response model (Bock, Reference Bock1972), but also quite general multidimensional IRT models (Reckase, Reference Reckase2009). Further generalizations, such as the three-parameter logistic model, are possible when an additional layer of latent item response variables is introduced (Haberman, Reference Haberman2013). For a unidimensional two-parameter logistic (2PL) model, the assumption of CI leads to the following convenient form

(6)

\begin{matrix} log p (x_{i} | θ_{i}) & = P_{0} (θ_{i}, ξ) + \sum_{j = 1}^{J} x_{j} (α_{j} θ + β_{j}), \end{matrix}

where $P_{0} (θ_{i}, ξ) = log (\prod_{j = 1}^{J}, {[1 + exp (α_{j} θ + β_{j})]}^{- 1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {P}}_0(\theta _i,{\varvec{{\xi }}})=\log \left\{ \prod _{j=1}^{J} [1+\exp (\alpha _j\theta +\beta _j)]^{-1}\right\} $$\end{document} is a normalizing constant (Glas, Reference Glas1989, Eq. 2.2.1) and $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{{\xi }}}$$\end{document} is the vector containing all item parameters.

Although our focus is on evaluating CI and not on modeling conditional dependence, we briefly discuss some approaches for the latter case. Random effects for conditional dependence can be expressed through, for example, testlet models (Wainer et al., Reference Wainer, Bradlow and Wang2007) or bifactor models (Gibbons & Hedeker, Reference Gibbons and Hedeker1992). In general, a MIRT model for dichotomous items can be specified as

(7)

\begin{matrix} log p (x_{i} | θ_{i}) & = P_{0} (θ_{i}, ξ) + \sum_{j = 1}^{J} x_{j} (α_{j}^{'} D θ_{i} + β_{j}) . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \log p({\textbf{x}}_i|{\varvec{{\theta }}}_i)&= {\mathcal {P}}_0({\varvec{{\theta }}}_i,{\varvec{{\xi }}})+\sum _{j=1}^{J}x_j({\varvec{{\alpha }}}_j'{\textbf{D}}{\varvec{{\theta }}}_i+\beta _j). \end{aligned}$$\end{document}

Fixed effects for conditional dependence can be expressed through a log-linear approach. Kelderman & Rijkes (Reference Kelderman and Rijkes1994) and Verhelst & Verstralen (Reference Verhelst and Verstralen2008) discuss several models related to the Rasch family, including polytomous and multidimensional extensions. Another model that explicitly represents dependencies among items is the interaction model (Haberman, Reference Haberman, von Davier and Carstensen2007), in which the probability of an item response vector is given by

(8)

\begin{matrix} log p (x_{i} | θ_{i}) & = P_{0} (θ_{i}, ξ) + \sum_{j = 1}^{J} x_{j} (θ_{i} + β_{j}) + \sum_{j = 2}^{J} \sum_{k = 1}^{j - 1} x_{j} x_{k} (γ_{j} + γ_{k}) . \end{matrix}

A salient feature of the interaction model is that CI does not hold, but the sum score is still a sufficient statistic for $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} . Some other alternatives are discussed by Ip (Reference Ip2002) and Nikoloulopoulos & Joe (Reference Nikoloulopoulos and Joe2015).

1.3. Estimation

As noted, item parameters are needed to create MST designs and generally such designs lead to incomplete data. Missing data in MST designs is MAR, but not OAR and, hence, not MCAR (Mislevy & Wu, Reference Mislevy and Wu1996). This can be demonstrated in the basic MST design. Let $u$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{u}}$$\end{document} be the missing-data indicator. If module B is missing, then the sum score for module A must be at least $c_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_A$$\end{document} , so that the joint probability of missing data on module B, $p (u_{B})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p({\textbf{u}}_B)$$\end{document} , is independent of $x_{B}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{x}}_B$$\end{document} and $x_{C}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{x}}_C$$\end{document} , but not of $x_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{x}}_A$$\end{document} :

(9)

\begin{matrix} p (u_{B} | x_{A}, x_{B}, x_{C}) & = p (u_{B} | x_{A}) \neq p (u_{B}) . \end{matrix}

The first step in the above equation indicates that the data in module B is missing at random, whereas the second step indicates the data in module B is not observed at random.

Similar reasoning applies for module C in the basic MST design and for other MST designs. As noted, this result has implications for sampling-distribution inference, but not for parameter estimation if the marginal likelihood is used. That is, in this case, the ignorability principle applies for estimation of item and population parameters of unidimensional IRT models (Rubin, Reference Rubin1976; Eggen & Verhelst, Reference Eggen and Verhelst2011). The situation for multidimensional IRT models and MST data is somewhat more complex (see Jewsbury & van Rijn, Reference Jewsbury and van Rijn2020).

For marginal maximum likelihood (MML) estimation of item parameters with MST data, Glas (Reference Glas1988) showed that with sequentially administered items a, b, and c, their joint marginal probability can be written as:

(10)

\begin{matrix} p (x_{a}, x_{b}, x_{c} | ξ) & = p (x_{a} | ξ_{a}) p (x_{b}, x_{c} | x_{a}, ξ) \end{matrix}

(11)

\begin{matrix} = p (x_{a} | ξ_{a}) \int p (x_{b}, x_{c} | θ, ξ) f (θ | x_{a}, ξ_{a}) d θ \end{matrix}

(12)

\begin{matrix} = \int p (x_{a}, x_{b}, x_{c} | θ, ξ) f (θ) d θ, \end{matrix}

where $f (θ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\theta )$$\end{document} is typically a normal distribution (Bock & Aitkin, Reference Bock and Aitkin1981), although there are many other possibilities (Woods, Reference Woods, Reise and Revicki2015). This result shows that MST data poses no problems for MML estimation. For conditional maximum likelihood (CML) estimation of item parameters in the Rasch model with MST data, the situation is different because the sum score is used to obtain the conditional likelihood. Although we do not discuss CML further here, it is interesting to note that in order to apply CML to MST data, one essentially has to take into account the routing decisions (Zwitser & Maris, Reference Zwitser and Maris2015).

We let q be a positive integer and $Ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Omega $$\end{document} be the parameter space consisting of a nonempty open set of q-dimensional vectors. Then, MML estimation of the q-dimensional parameter vector $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{{\xi }}}$$\end{document} proceeds by defining the marginal log-likelihood function

(13)

\begin{matrix} ℓ (ξ) = \sum_{i = 1}^{N} ℓ_{i} (ξ), \end{matrix}

where the individual contributions are given by

(14)

\begin{matrix} ℓ_{i} (ξ) = log \int p (x_{i} | θ) f (θ) d θ . \end{matrix}

Estimates can be found by solving the likelihood equations $\partial ℓ (ξ) / \partial ξ = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\partial \ell ({\varvec{{\xi }}}) / \partial {\varvec{{\xi }}}=0$$\end{document} . The integrals can be solved by numerical integration methods (e.g., adaptive quadrature; Naylor & Smith, Reference Naylor and Smith1982).

1.4. Adjusted Residuals

We now discuss our method for evaluating CI and present adjustments to the original methods to account for MST design. The approach is based on generalized residuals for the frequencies of a pair of scores on items j and k as developed by Haberman & Sinharay (Reference Haberman and Sinharay2013):

(15)

\begin{matrix} Z_{x_{j}, x_{k}} = \frac{O_{x_{j}, x_{k}} - E_{x_{j}, x_{k}}}{S_{x_{j}, x_{k}}} . \end{matrix}

where $O_{x_{j}, x_{k}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O_{x_j,x_k}$$\end{document} is the observed frequency of $x_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_j$$\end{document} and $x_{k}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_k$$\end{document} , $E_{x_{j}, x_{k}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_{x_j,x_k}$$\end{document} is the expected frequency, and $S_{x_{j}, x_{k}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_{x_j,x_k}$$\end{document} is the associated standard deviation. For example, $O_{1_{j}, 1_{k}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O_{1_j,1_k}$$\end{document} denotes the observed frequency of a score of 1 on both items j and k. The expected frequency is typically given by

(16)

\begin{matrix} E_{x_{j}, x_{k}} = \sum_{i = 1}^{N} \int p (x_{j}, x_{k} | θ) f (θ | x_{i}) d θ . \end{matrix}

However, for MST data, an adjustment is needed in order to account for patterns that either are unobserved or are more or less likely due to the MST design.

Consider again the basic MST design with modules A (medium difficulty), B (low difficulty), and C (high difficulty). If $A_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$A_j$$\end{document} denotes the j-th item in module A and $c_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_A$$\end{document} is the sum score cutoff on module A, then the following item probabilities given that the sum score takes on certain values are relevant

(17)

\begin{matrix} p (x_{A_{j}} | θ, x_{A}^{+} < c_{A}) & = \frac{p (x_{A_{j}}, x_{A}^{+} < c_{A} | θ)}{p (x_{A}^{+} < c_{A} | θ)}, \end{matrix}

(18)

\begin{matrix} p (x_{A_{j}} | θ, x_{A}^{+} \geq c_{A}) & = \frac{p (x_{A_{j}}, x_{A}^{+} \geq c_{A} | θ)}{p (x_{A}^{+} \geq c_{A} | θ)}, \end{matrix}

where $p (x_{A}^{+} < c_{A} | θ) = \sum_{k = 0}^{c_{A} - 1} p (x_{A}^{+} = k | θ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(x_A^+ < c_A|\theta ) = \sum _{k=0} ^{c_A - 1} p(x_A^+ = k |\theta )$$\end{document} and $p (x_{A}^{+} \geq c_{A} | θ) = \sum_{k = c_{A}}^{J_{A}} p (x_{A}^{+} = k | θ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(x_A^+ \ge c_A|\theta ) = \sum _{k=c_A} ^{J_A} p(x_A^+ = k |\theta )$$\end{document} . The probabilities for individual sum scores are readily found by making use of the Lord-Wingersky recursions for computing the distribution of the sum score conditional on $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} (Lord &Wingersky, Reference Lord and Wingersky1984). The equations can be illustrated with a basic example. Consider the case of three dichotomous items and a cut score of two: The probability that the first item is incorrect given that the total score is smaller than two is: $(Q Q Q + Q P Q + Q Q P) / (Q Q Q + P Q Q + Q P Q + Q Q P)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(QQQ + QPQ + QQP) / (QQQ+PQQ+QPQ+QQP)$$\end{document} with P and Q indicating probabilities for correct and incorrect item responses. The probability that the first item is correct given $x^{+} < 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x^+ < 2$$\end{document} is then $P Q Q / (Q Q Q + P Q Q + Q P Q + Q Q P)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$PQQ / (QQQ+PQQ+QPQ+QQP)$$\end{document} . In addition, the probability that the first item is incorrect given $x^{+} \geq 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x^+ \ge 2$$\end{document} is $(Q P P) / (P P Q + P Q P + Q P P + P P P)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(QPP) / (PPQ+PQP+QPP+PPP)$$\end{document} and the probability that the first item is correct given $x^{+} \geq 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x^+ \ge 2$$\end{document} is $(P P Q + P Q P + P P P) / (P P Q + P Q P + Q P P + P P P)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(PPQ+PQP+PPP) / (PPQ+PQP+QPP+PPP)$$\end{document} . Although the equations are straightforward, some carefulness in their application is required since certain combinations of item and sum scores cannot occur (e.g., correct item response and sum score of zero).

For a pair with item j in module A and item k in module B, it then follows that the expected frequency is

(19)

\begin{matrix} E_{x_{A_{j}}, x_{B_{k}}} = \sum_{i \in N_{jk}} E_{x_{i A_{j}}, x_{i B_{k}}} = \sum_{i \in N_{jk}} \int p (x_{A_{j}} | θ, x_{A +} < c_{A}) p (x_{B_{k}} | θ) f (θ | x_{i}) d θ, \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} E_{x_{A_j},x_{B_k}} = \sum _{i \in {\mathcal {N}}_{jk}} E_{x_{iA_j},x_{iB_k}}=\sum _{i \in {\mathcal {N}}_{jk}} \int p(x_{A_j}|\theta ,x_{A+}<c_A)p(x_{B_k}|\theta )f(\theta |{\textbf{x}}_i)d\theta , \end{aligned}$$\end{document}

For a pair with item j in module A and item k in module C, the expected frequency is

(20)

\begin{matrix} E_{x_{A_{j}}, x_{C_{k}}} = \sum_{i \in N_{jk}} E_{x_{i A_{j}}, x_{i C_{k}}} = \sum_{i \in N_{jk}} \int p (x_{A_{j}} | θ, x_{A +} \geq c_{A}) p (x_{C_{k}} | θ) f (θ | x_{i}) d θ, \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} E_{x_{A_j},x_{C_k}} = \sum _{i \in {\mathcal {N}}_{jk}} E_{x_{iA_j},x_{iC_k}}= \sum _{i \in {\mathcal {N}}_{jk}} \int p(x_{A_j}|\theta ,x_{A+}\ge c_A)p(x_{C_k}|\theta )f(\theta |{\textbf{x}}_i)d\theta , \end{aligned}$$\end{document}

If routing is based on a different criterion than the sum score (e.g., ability estimate, maximum information, minimum expected posterior variance), similar conditioning principles would apply for taking into account the MST design. However, computations become more involved compared to conditioning on the sum score, which enables the use of the straightforward Lord-Wingersky recursions. For example, using ability estimates may require summarizing over all response patterns that lead to an ability estimate below or above the cut.

To ease the following presentation, we simplify notation by suppressing the particular item scores and modules, so that the residual for individual test taker i on item pair (j, k) can be written as

(21)

\begin{matrix} r_{ijk} & = O_{x_{ij}, x_{ik}} - E_{x_{ij}, x_{ik}}, \end{matrix}

where $O_{x_{ij}, x_{ik}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O_{x_{ij},x_{ik}}$$\end{document} is the indicator for observing $x_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{ij}$$\end{document} and $x_{ik}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{ik}$$\end{document} (e.g., if both items are answered correctly, its value is 1). Then, the variance of the residual $r_{jk} = \sum_{i \in N_{jk}} r_{ijk}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{jk}=\sum _{i \in {\mathcal {N}}_{jk}} r_{ijk}$$\end{document} is found to be

(22)

\begin{matrix} V (r_{jk}) & = \sum_{i \in N_{jk}} {(r_{ijk} - d_{jk}^{'} \nabla ℓ_{i} (\hat{ξ}))}^{2}, \end{matrix}

where $\nabla ℓ_{i} (\hat{ξ})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nabla \ell _i(\hat{{\varvec{{\xi }}}})$$\end{document} is the gradient contribution of test taker i and

(23)

\begin{matrix} d_{jk} & = J {(\hat{ξ})}^{- 1} \sum_{i \in N_{jk}} r_{ijk} \nabla ℓ_{i} (\hat{ξ}), \end{matrix}

where $J (\hat{ξ})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J(\hat{{\varvec{{\xi }}}})$$\end{document} is the Fisher information matrix for model parameters (Louis, Reference Louis1982). A general formulation of Eq. 22 is found in Equation 9 of Haberman & Sinharay (Reference Haberman and Sinharay2013), but our version is more similar to Equation 46 of Haberman et al. (Reference Haberman, Sinharay and Chon2013), although in a slightly different context. Except for centering the gradients, no further adjustments are needed for using the gradient and the Fisher information matrix, because these were obtained by correct application of the ignorability principle (Eggen & Verhelst, Reference Eggen and Verhelst2011). The generalized residual in Eq. 15 is then found by computing, using the simplified notation, $S_{jk} = \sqrt{V (r_{jk}) / N_{jk}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_{jk} = \sqrt{V(r_{jk})/N_{jk}}$$\end{document} , where $N_{jk}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_{jk}$$\end{document} indicates the sample size for the item pair. Haberman & Sinharay (Reference Haberman and Sinharay2013) showed that, under very general conditions, the generalized residual has an asymptotic standard normal distribution. In the context of large-scale assessments with balanced incomplete block designs, van Rijn et al. (Reference van Rijn, Sinharay, Haberman and Johnson2016) found that Type-I-error rates of these residuals for item pairs were close to the nominal level. Note that these residuals can be extended without too much difficulty to more than two items (e.g., item triples) and other IRT models.

With polytomous items, there can be sparse data for item pairs, even for relatively large samples (Joe & Maydeu-Olivares, Reference Joe and Maydeu-Olivares2010). In this case, a reduced residual for item pair j and k can be obtained as a product moment following Cai & Hansen (Reference Cai and Hansen2013) by defining:

(24)

\begin{matrix} {\tilde{r}}_{ijk} = \sum_{h_{j} = 1}^{m_{j}} \sum_{h_{k} = 1}^{m_{k}} h_{j} h_{k} (O_{x_{i h_{j}}, x_{i h_{k}}} - E_{x_{i h_{j}}, x_{i h_{k}}}) . \end{matrix}

Note that the reduced residual is equal to $r_{ijk}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{ijk}$$\end{document} if $m_{j} = m_{k} = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_j=m_k=1$$\end{document} . The reduced generalized residual is then found by entering the above reduced residual into Eqs. 22 and 23. The variance of the reduced residual ${\tilde{r}}_{jk} = \sum_{i \in N_{jk}} {\tilde{r}}_{ijk}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\tilde{r}}_{jk}=\sum _{i \in {\mathcal {N}}_{jk}} {\tilde{r}}_{ijk}$$\end{document} is then

(25)

\begin{matrix} V ({\tilde{r}}_{jk}) & = \sum_{i \in N_{jk}} {({\tilde{r}}_{ijk} - {\tilde{d}}_{jk}^{'} \nabla ℓ_{i} (\hat{ξ}))}^{2}, \end{matrix}

where

(26)

\begin{matrix} {\tilde{d}}_{jk} & = J {(\hat{ξ})}^{- 1} \sum_{i \in N_{jk}} {\tilde{r}}_{ijk} \nabla ℓ_{i} (\hat{ξ}) . \end{matrix}

Before illustrating the adjusted residuals, we briefly mention two other popular methods for evaluating conditional dependence that are affected when dealing with MST data. The first method is the check on the correlation between residual item scores for item pairs (j, k), commonly referred to as the $Q_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_3$$\end{document} statistic (Yen, Reference Yen1984). Although this method is known to have problems due to lack of a known null distribution (Chen & Thissen, Reference Chen and Thissen1997), it has been used succesfully in combination with a parametric bootstrap (Christensen et al., Reference Christensen, Makransky and Horton2017). To compute the statistic, for each item, we need the following residual item score

(27)

\begin{matrix} r_{ij} = x_{ij} - P (x_{ij}), \end{matrix}

with

(28)

\begin{matrix} P (x_{ij}) = \int P (x_{j} | θ) f (θ | x_{i}) d θ, \end{matrix}

as the expected item score. Then, the $Q_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_3$$\end{document} statistic is simply the correlation between the vectors $r_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{r}}_j$$\end{document} and $r_{k}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{r}}_k$$\end{document} . This version is slightly different from the original statistic because the posterior $f (θ | x_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f({\varvec{{\theta }}}|{\textbf{x}}_i)$$\end{document} is used instead of a point estimate of $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{{\theta }}}$$\end{document} (e.g., maximum-likelihood estimate (MLE) or expected a posteriori (EAP)). For MST data, subvectors of the residual vector $r_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{r}}_j$$\end{document} or $r_{k}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{r}}_k$$\end{document} may need to be computed in different ways depending on the associated routing rules and the particular item pair j and k.

The second method we mention is the limited information statistic $M_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_2$$\end{document} developed by Maydeu-Olivares & Joe (Reference Maydeu-Olivares and Joe2005), which is, for example, produced by commonly used IRT software such as flexMIRT (Houts & Cai, Reference Houts and Cai2016) and the R package mirt (Chalmers, Reference Chalmers2012). The statistic assesses overall model fit and is given by

(29)

\begin{matrix} M_{2} = N {(p_{2} - π_{2} (\hat{ξ}))}^{'} {\hat{C}}_{2} (p_{2} - π_{2} (\hat{ξ})), \end{matrix}

where $p_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{p}}_2$$\end{document} is the vector of observed proportions of first- and second-order marginals, $π_{2} (\hat{ξ})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{{\pi }}}_2(\hat{{\varvec{{\xi }}}})$$\end{document} is the vector of expected proportions of first- and second-order marginals, and ${\hat{C}}_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{{\textbf{C}}}_2$$\end{document} is the associated covariance matrix (Maydeu-Olivares & Joe, Reference Maydeu-Olivares and Joe2005, Eqs. 12 and 13). Under the model, $M_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_2$$\end{document} converges in distribution to a $χ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi ^2$$\end{document} -distribution with $s - q$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s-q$$\end{document} degrees of freedom, where s is the length of $p_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{p}}_2$$\end{document} and q is the number of model parameters. For MST data, the second-order marginals would need to be adjusted in a similar fashion as for the generalized residuals. One complication that may arise is that, even though $M_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_2$$\end{document} focuses on first- and second-order marginals only, observed and expected proportions of full response patterns are required.

2. Illustrations

We now illustrate the adjusted residuals for item pairs to evaluate CI through a small simulation study and an application to PISA data.

2.1. Simulation

In the simulation study, we varied the sample size, the number of items, and the design. Sample sizes were 600 and 6000. The number of items were 9 (3 in each stage) and 45 (15 in each stage). Four different designs were used: (1) A complete design in which all test takers receive all items, (2) a random design in which test takers take 2/3 of all items (i.e., either 6 out of 9 items or 30 out of 45 items), (3) a basic two-stage design with two difficulty levels in the second stage (see Fig. 1), and 4) a balanced two-stage design with two levels in the second stage (see Fig. 2). The test length in the latter two MST designs is the same as in the random design. Only dichotomously scored items that followed a unidimensional 2PL model were used with $β_{j} \sim N (0.13, 1.29)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _j \sim \text {N}(0.13,1.29)$$\end{document} and $log α_{j} \sim N (0.05, 0.20)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log \alpha _j \sim \text {N}(0.05,0.20)$$\end{document} . These values are based on the estimated item parameters from the PISA 2018 reading MST data discussed in the next section. Ability parameters $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} were drawn from a standard normal distribution. We used 200 replications in each of the four conditions (number of items x sample size). For each replication, a single complete data set was simulated to which the three designs (random, basic MST, and balanced MST) were applied to create incomplete data.

The MST modules were assembled as follows. After the item parameters were randomly drawn, the items were sorted by difficulty and classified into three difficulty groups (low, medium, and high) with 1/3 of the items in each group (3 or 15). For the routing module in the first stage, an equal number of items was used from each difficulty group (so 1 or 5 easy, 1 or 5 medium, and 1 or 5 high, for 3 or 15 items, respectively). For the low- and high-difficulty modules in the second stage, the majority of the items would come from the associated difficulty group (2 easy, 1 medium for low difficulty module, and 2 high, 1 medium for high difficulty for 3 items; 10 easy, 5 medium for low difficulty module and 10 high, 5 medium for high difficulty for 15 items). In the basic MST design, items were fixed in the three modules, whereas in the balanced, items were rotated across the three modules such that each item occurred in each stage. That is, in the balanced MST, three groups of test takers were created: I, II, and III with each either 200 or 2000 test takers depending on the total sample size used. For the condition with 15 items in each module, in group I, module A consists of 5 easy, 5 medium, and 5 hard items. In group II, module A consists of 5 different easy items, 5 different medium items, and 5 different hard items. In group III, the remaining items from each difficulty group are used. Similar reasoning is applied to create the low-difficulty module B and the high-difficulty module C for groups II and III. The routing decision was based on the sum score on the first module with a cutoff of either 2 out 3 correct or 8 out of 15 items.

We evaluate the simulated data with the different parameter settings on two types of outcomes. The first type consists of outcomes related to the estimation of item and ability parameters. This provides a check on the ignorability principle for the MST designs. We make use of the averaged bias and averaged root mean squared error (RMSE) for both types of parameters, the IRT reliability of $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} , and the log determinant of the information matrix for item parameters.Footnote 2 This latter outcome relates to the D-optimality criterion often used in optimal design research (Berger, Reference Berger1992). The second type of outcomes concerns CI. We compare the performance of the unadjusted and adjusted generalized residuals for item pair frequencies under the model (Type I error). Since the data and the model are the same across all conditions (only the missing data is different), it is to be expected that the distributions of these residuals are asymptotically similar. We made use of the open-source MIRT software developed by Haberman (Reference Haberman2013) to estimate the model parameters.Footnote 3 The sortware uses marginal maximum likelihood method for model parameter estimation. We developed additional R code (R Core Team, 2019) to compute the required adjustments for the residuals using Equations 19 to 23, which can be found in appendix A.

To illustrate that the computations are nontrivial, consider the cells in the simulation design with 15 items in each stage. With a cutoff score of 8 for the first module, there are 16,384 possible response patterns that lead to a sum score lower than 8 and 16,384 patterns that lead to a sum score 8 or higher. As noted, the Lord-Wingersky algorithm can be put to use in case routing is based on the sum score, which is relatively straightforward. However, if routing is based on a different criterion (e.g., an ability estimate), it may be necessary to repeatedly go over all possible response patterns in order to compute the adjusted residuals.

The results related to item parameter recovery are presented in Appendix B. Table 1 shows the means and standard deviations of Type-I-error rates for nominal $α$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha $$\end{document} levels of .01, .05, and .10 over 200 replications. It can be seen that the results for the complete and random designs are very close to the nominal levels for $N = 6000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=6000$$\end{document} and significance levels of .05 and .10. The results for the basic MST and balanced MST without adjustments are off by a large amount. However, when the adjustments are applied to take into account the MST design, the results are very similar to the complete and random design. It should be noted that the different designs have different sample spaces when it comes to item pairs.

Table 1 Mean (SD) type I error of generalized residuals for item pair frequencies.

Figure 3 shows plots of the smoothed densities of the residuals for each of the four conditions. The densities for the complete and random designs are close to each other and close to normal. The same holds for the densities under the basic and balanced MST designs with adjustments. QQ plots can be found in appendix B and show a similar result. Appendix B also contains some results on the $Q_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_3$$\end{document} and $M_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_2$$\end{document} statistics with MST data.

Figure. 3 Density plots of residuals (top left: $N = 600$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=600$$\end{document} , $J = 9$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=9$$\end{document} ; top right: $N = 6000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=6000$$\end{document} , $J = 9$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=9$$\end{document} ; bottom left: $N = 600$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=600$$\end{document} , $J = 45$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=45$$\end{document} ; bottom right: $N = 6000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=6000$$\end{document} , $J = 45$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=45$$\end{document} )

2.2. Real Data Application

In this section, we illustrate the complexity of evaluating CI assumptions with MST data by an application to a large-scale educational assessment, namely, the PISA 2018 Reading MST. In the context of PISA, issues with the CI assumption have been raised before. In PISA’s terminology, items are grouped into units that share stimuli or content. Monseur et al. (Reference Monseur, Baye, Lafontaine and Quittre2011) found conditional dependencies within units, but also between units. More importantly, this impacted the estimated country-level variance of student proficiencies.

To illustrate the adjusted residuals with real data, we make use of a subset of data from the PISA 2018 Reading MST. For more details on the design, see Yamamoto et al. (Reference Yamamoto, Shin and Khorramdel2019). The PISA 2018 Reading MST consisted of three stages and was administered to $N = 562, 051$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=562,\!\!051$$\end{document} students. The routing is based on (1) random assignment to one of the modules in the first stage, (2) performance in the first and second stage measured by the sum score on automatically scorable items, and (3) a probability layer matrix.

In PISA, the units are combined into testlets that form the modules of the MST. Students take two units in the first stage, three units in the second stage, and two units in the third stage. After the first stage, three performance groups are created based on the total score for items that can be automatically scored: low, medium, and high. The low-performing group is routed with .9 probability to a low-difficulty module in the second stage and with .1 probability to a high-difficulty module.Footnote 4 The high-performing group is routed in the opposite way as the low-performing group. The medium-performing group is routed with .5 probability to a low-difficulty module in the second stage and with .5 probability to a high-difficulty module. After the second stage, routing to the third stage is performed in a similar fashion. To reduce the impact of item position effects, stages two and three were reversed for 25% of students, which is referred to as design B (while the regular design is referred to as A). The total number of paths in the MST was 192. Note that an MST path can be taken as a test form, with the difference being that an MST path is dependent on the item responses during the test administration whereas a test form can be assigned before the test is administered.

To illustrate conditional dependencies resulting from the unit structure, we computed the generalized residuals for item pairs within each of the five units that were used to create the eight testlets in the first stage. Since students were randomly assigned to take one of these eight testlets, adjustments as described in the previous sections are not necessary. Figure 4 shows the heatmap of the estimated generalized residuals for all 22 items in the first stage. Two pairs of units did not have observations by design (R220–R424 and R559–R560). Within-unit CDs are seen about twice as often as between-unit conditional dependencies: 88% of within-unit residuals are significant at the.01 level vs. 43% of between-unit residuals.

Figure. 4 Heatmap of generalized residuals for item pairs in first stage of PISA 2018 reading MST.

Since substantial conditional dependencies due to the unit structure were found for the items in the first stage and to simplify our analysis, we treated the 45 units (245 items) as 45 polytomously scored items. A multi-group generalized partial credit model (GPCM) was fitted to these 45 items ( $N = 562, 051$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=562,\!\!051$$\end{document} ) using the MIRT software with 102 country-by-language groups. Although it would be of interest to investigate within-unit conditional dependence, the size of such an analysis is beyond the scope of our illustration. To simplify things further, we refrain from using sampling weights and the complex sampling design.

We further analyzed one path that can follow from one of the eight modules in the first stage into the second stage, ignoring the third stage. This module consists of two units (R560 and R424) with in total seven automatically scorable items. Students taking this module can be routed to four modules in the second stage, two low-difficulty and two high-difficulty modules. However, we focus our analysis on one of these four paths and compute the residuals for one pair of units with one unit in the first stage (R560) and one unit in one of the low-difficulty modules in the second stage (R542). How the routing works is illustrated in Fig. 5.

Figure. 5 Highlight of MST design for PISA 2018 reading assessment.

After the first stage, six groups of students can be distinguished. The first group is the low-performance group (i.e., $x_{A}^{+} < 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_A^+ < 3$$\end{document} ) that is routed to the low-difficulty module, the second group is the low-performance group that is routed to the high-difficulty module, the third group is the medium-performance group (i.e., $3 \leq x_{A}^{+} \leq 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3 \le x_A^+ \le 5$$\end{document} ) that is routed to the low-difficulty module, up to the sixth group which is the high-performance group (i.e., $x_{A}^{+} > 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ x_A^+ > 5$$\end{document} ) that is routed to the high-difficulty module. The adjustments are different for each of the six groups, so that even for this simple subset, the analysis requires careful bookkeeping. The sum score cutoffs for the first module are 3 and 5 with a maximum score of 7. We only focus on the low-difficulty modules in stage two, so the expected frequencies for the three associated groups are given by

(30)

\begin{matrix} E (x_{A_{j}}, x_{B_{k}}) & = \sum_{i \in N_{jkLL}} \int p (x_{A_{j}} | θ, x_{A +} < 3) p (x_{B_{k}} | θ) f (θ | x_{i}) d θ, \end{matrix}

(31)

\begin{matrix} E (x_{A_{j}}, x_{B_{k}}) & = \sum_{i \in N_{jkML}} \int p (x_{A_{j}} | θ, 3 \leq x_{A +} \leq 5) p (x_{B_{k}} | θ) f (θ | x_{i}) d θ, \end{matrix}

(32)

\begin{matrix} E (x_{A_{j}}, x_{B_{k}}) & = \sum_{i \in N_{jkHL}} \int p (x_{A_{j}} | θ, x_{A +} > 5) p (x_{B_{k}} | θ) f (θ | x_{i}) d θ . \end{matrix}

The other expected frequencies can be obtained in a similar fashion.

Table 2 displays the observed, expected, and residual frequencies, standard deviations and generalized residuals with and without adjustments for the MST design for all patterns on this pair of units for the given subset of the PISA 2018 reading MST data. It is clearly seen that the adjusted expected frequencies are closer to the observed frequencies than the unadjusted frequencies. For example, for the unadjusted residuals, 22 out of 30 residual frequencies are significant at the .05 level (i.e., Z values are larger than 1.96 in absolute value), whereas for the adjusted residuals, 12 out of 30 are significant. Furthermore, Pearson’s $X^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^2$$\end{document} Chen & Thissen, Reference Chen and Thissen1997) with 20 degrees of freedom is 1043.9 without adjustments and 117.9 with adjustments. To visualize the differences and patterns, we created heatmaps for the unadjusted and adjusted residuals for pairs of scores with an observed frequency of at least 10, which are shown in Fig. 6. It is clear that the adjusted residuals are less extreme than the unadjusted residuals. However, even after adjusting, there seems to be a pattern of underestimating frequencies of low scores on the first unit R560 and high scores on the second unit R542. Although there seem to be some clear deviations from CI, it is not as bad as one would think based on the unadjusted residuals. Furthermore, Cramér’s V coefficient, which can be taken as an effect size measure, is .05, indicating a small effect. After reviewing the units, we could not find substantive reasons for dependence.Footnote 5

Table 2 Observed, expected, and residual frequencies, standard deviations and generalized residuals with and without MST adjustments for all patterns on one pair of units for subset of PISA 2018 reading MST data ( $N = 14, 226$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=14,\!226$$\end{document} ).

Even with the relatively large number of observations for this pair of units ( $N = 14, 226$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=14,\!226$$\end{document} ), the two-way contingency table has some sparse elements (e.g., the observed frequencies of $x_{j} = 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_j=3$$\end{document} or 4 and $x_{k} = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_k=0$$\end{document} are both 1). We can use the reduced residual of Eq. 24 and the associated generalized residual. For the PISA unit pair, the reduced generalized residual is $-$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document} 1.40 without adjustments for the MST design and 1.96 with adjustments for the MST design, which is not entirely in line with the full results. Although this result warrants further study, we leave that to a future investigation.

In closing this section, we want to emphasize that evaluating CI for a large-scale assessment such as PISA is quite the undertaking. With 192 paths in the MST and each student taking 7 units, there are 4032 pairs of units to be inspected and each pair would need a different set of adjustments to obtain the proper residual. If item pairs were analyzed instead of unit pairs, this number would be substantially larger. Furthermore, there is the potential of item-by-country interactions that can affect the scaling of the PISA data (von Davier et al., Reference von Davier, Yamamoto, Shin, Chen, Khorramdel, Weeks and Kandathil2019) as well as the evaluation of CI.

3. Discussion

In this paper, we discussed the key assumption of CI in IRT models in the context of MST data and argued that it is related to quasi-independence in log-linear models for incomplete contingency tables. We demonstrated that generalized residuals for item pair frequencies are inappropriate for MST data without adjustments. The performance of the MST-adjusted residuals was demonstrated through simulations. With respect to parameter recovery in case of MST data, it was found that the balanced MST design seems to have the best of both worlds in the sense that it produced low RMSEs for both item and ability parameters. In the conditions of our simulation study, the adjusted residuals for item pair frequencies seemed to have satisfactory Type-I-error rates for both the basic and balanced MST design when the sample size is sufficiently large.

Figure. 6 Heatmap of unadjusted (left) and adjusted (right) generalized residuals for PISA example.

The adjusted residuals were also applied to a subset of real MST data from PISA, which demonstrated the complexities of evaluating the assumption of CI in a large-scale MST application and indicated a small violation. In the PISA 2018 Reading MST, there were many different paths and our illustration demonstrates what needs to be done to compute the adjusted residuals. In this illustration, CI was assessed for a single path, but the sample size per path may be small and this affects the power to detect violations. If possible, it may help to aggregate results if the same item pair appears in multiple paths. However, there is a difference in the type of CI violation that may warrant assessing CI separately for each path. For example, with a balanced MST and each item appearing in each stage, it may be that CI for the item pair A,B is violated if item A is in the routing stage and item B is in the adaptive stage, but not if item B is in the routing stage and item A is in the adaptive stage. This could indicate a sequential dependence.

As noted, violations to CI can lead to inflated measurement precision, confounds in scores and subscores, and issues with the adaptive algorithm. If many large residuals are found, a number of strategies can be applied to minimize negative impact on the measurement (Yen, Reference Yen1993). Before applying any strategy, it is important to track down what is causing the conditional dependence. If an explanation can be found for a particular item pair, one can, for example, resort to simply combining scores of item pairs. Alternatively, if the pattern of conditional dependencies is more elaborate, a more complex measurement model may be required and some examples were discussed in the Model section. More importantly, when a more complex measurement model is required, such as a testlet or bifactor MIRT model, but a more basic measurement model, such as a unidimensional IRT model, was used to develop the MST, the adaptivity of the MST may not be as effective and, in extreme cases, may lead to biased outcomes (Jewsbury & van Rijn, Reference Jewsbury and van Rijn2020).

More research can be conducted to address the impact of MST data on the adjusted residuals for item pair frequencies under more conditions. Different models may be used as well as model violations (dimensionality, conditional dependence). In addition, although we present some results in Appendix B, more work is needed to evaluate the impact of MST data on statistical methods and what adjustments are needed to provide correct results (Ali et al., Reference Ali, Shin and Rijnin press). Such statistical methods include classical item statistics (e.g., percent correct), other statistics for CI (e.g., $Q_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_3$$\end{document} ), and model fit statistics (e.g., $M_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_2$$\end{document} ). However, the current paper provides some pointers in the right direction.

We end the paper by noting that, in practice, MST designs and routing decisions can be quite complex, as shown in the PISA application. This can make the computation of the adjusted residuals for the complete test quite involved and specific to the situation at hand. That is, it is not straightforward to create software or code that is sufficiently generic to deal with the large variety of MST designs seen in practice.

Appendix A: R Code

In this appendix, R code to compute the adjusted residuals is presented. First, we need the item-response functions for the 2PL:

Next, the Lord-Wingersky algorithm is needed (code works for dichotomous items only):

Then, we can create a function for the adjusted residuals for the basic MST design, which is shown below. To compute the adjusted residuals, several pieces of output from the MIRT program (Haberman, Reference Haberman2013) are needed. These are the estimated item parameters, the individual gradients, the individual posterior distributions of $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} , and the estimated asymptotic covariance matrix of item parameters. Since the MIRT program can write each piece of output to a separate csv-file, it is straightforward to obtain them and read them into R (see the manual on https://github.com/EducationalTestingService/MIRT).

Table 3 Mean (SD) simulation results with respect to parameter recovery (200 replications).

N = sample size, J = number of items, L = test length.

$log D$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log D$$\end{document} = log determinant of item parameter information matrix.

Figure. 7 Distributions of $Q_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_3$$\end{document} statistic across all item pairs under different designs ( $N = 6000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=6000$$\end{document} , $J = 45$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=45$$\end{document} ).

Figure. 8 Heatmaps of average $Q_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_3$$\end{document} statistic under complete and basic MST designs ( $N = 6000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=6000$$\end{document} , $J = 45$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=45$$\end{document} ).

Appendix B: Additional Simulation Results

Table 3 shows the mean results with respect to recovery of item parameters over 200 replications. The table displays the mean and standard deviation (SD) of average bias and RMSE of intercept and slope parameters for each simulation condition. The means were computed by, for each replication, averaging the bias or RMSE across the parameters, and then averaging across replications. The SDs were computed by taking the standard deviation across replications of these means for each replication. In addition, the mean and SD of bias, RMSE, and reliability are shown for EAP ability estimates. In the last column, the logarithm of the determinant of the item parameter information matrix is shown.

Apart from the results for the complete design, which serve as a reference, the results with respect to item parameter recovery are best for the random design. For $N = 600$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=600$$\end{document} and $J = 9$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=9$$\end{document} , there are substantial differences in bias in slope between the designs with the basic MST producing the largest bias. However, the differences in bias between the designs for the other conditions are relatively small. This could be due to the dependencies being relatively larger for smaller J in the basic MST design. For the RMSE, the random design produces the smallest values, although the results for the balanced MST design are generally close. The observation that the random design and the balanced MST design produce similar item parameter recovery is supported by the similar values for the log determinant of the item parameter information matrix. For the basic MST, the results for the intercept seem somewhat surprising, but the intercept parameter should not be confused with the difficulty parameter. That is, the adaptivity targets the difficulty, not the intercept.

With respect to ability estimation, differences in bias between the designs are negligible. For the RMSE and reliability, apart from the complete design, the basic and balanced MST designs produce the best values. In these simulations, the balanced MST design seems to have the best of both worlds in terms item and ability parameter recovery.

Figure 7 shows the distribution of the $Q_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_3$$\end{document} statistic across all item pairs under the different designs with $N = 6000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=6000$$\end{document} and $J = 45$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=45$$\end{document} using either the posterior or the WLE (Warm, Reference Warm1989). For the posterior-based $Q_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_3$$\end{document} , there appears to be a negative bias, which is claimed to be approximately $- 1 / (J - 1)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-1/(J-1)$$\end{document} (Yen, Reference Yen1993) but appears to be slightly closer to zero here. For the WLE-based $Q_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_3$$\end{document} , this bias does not seem to appear for complete and random designs. The basic MST design does seem to result in a small negative bias.

Although the above $Q_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_3$$\end{document} distributions are not that strange, there are interactions between how the $Q_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_3$$\end{document} is computed and which design is used. These can be revealed using heatmaps for the average $Q_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_3$$\end{document} across replications for item pairs, which are shown in Fig. 8. The top row shows the heatmap when sorted by item difficulty and the bottom row shows the heatmap when sorted by module in the basic MST. Clearly, the $Q_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_3$$\end{document} computation and the design interact, although the range of these average $Q_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_3$$\end{document} statistics is actually not that large (see the legend).

Table 4 shows summary statistics of the distribution of the $M_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_2$$\end{document} statistic in the conditions with $N = 6000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=6000$$\end{document} and $J = 45$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=45$$\end{document} as computed by the R package mirt (Chalmers, Reference Chalmers2012). Since the mirt package can only compute the statistic for complete data, we only show results based on items in the routing and easy modules for the complete design and the basic MST design (so that $J = 30$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=30$$\end{document} ).

Table 4 Summary statistics of $M_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_2$$\end{document} statistic for items in routing and easy modules in complete and basic MST design ( $N = 6000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=6000$$\end{document} and $J = 30$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=30$$\end{document} ).

Figures 9 and 10 show QQ plots of the generalized residuals for item pair frequencies under the different designs for the two conditions with $N = 600$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=600$$\end{document} and $N = 6000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=6000$$\end{document} with $J = 9$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=9$$\end{document} . The QQ plots for the other two conditions with $J = 45$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=45$$\end{document} are not shown as the images are very large.

Figure. 9 QQ plots of residuals ( $N = 600$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=600$$\end{document} , $J = 9$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=9$$\end{document} ).

Figure. 10 QQ plots of residuals ( $N = 6000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=6000$$\end{document} , $J = 9$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J=9$$\end{document} ).

Footnotes

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

1 Bayesian inference is not considered in this paper.

2 For example, averaged bias for the item intercepts is computed as R - 1 ∑ r = 1 R J - 1 ∑ j = 1 J β ^ jr - β jr \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^{-1}\sum _{r=1}^R J^{-1}\sum _{j=1}^J {\hat{\beta }}_{jr} - \beta _{jr}$$\end{document} .

3 https://github.com/EducationalTestingService/MIRT

4 Although we distinguish between low- and high-difficulty modules, there generally is considerable overlap due to the variation of item difficulties within units.

5 Note that we are not at liberty to share the content of the units.

References

Ali, U. S., Shin, H. J., & van Rijn, P. W. (in press). Applicability of traditional statistical methods to multistage test data. In D. Yan & A. von Davier (Eds.), Research for practical issues and solutions in computerized multistage testing. Taylor and Francis.Google Scholar

Berger, M.P.. (1992). Sequential sampling designs for the two-parameter item response theory model. Psychometrika, 57, 521–538.CrossRef Google Scholar

Bishop, Y.M., Fienberg, S.E., Holland, P.W.Discrete multivariate analysis: Theory and practice 2007 Springer.Google Scholar

Bock, R.D.. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29–51.CrossRef Google Scholar

Bock, R.D., Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459.CrossRef Google Scholar

Cai, L, Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66, 245–276.CrossRef Google Scholar PubMed

Chalmers, R. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48 61–29.CrossRef Google Scholar

Chen, W-H, Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22, 265–289.CrossRef Google Scholar

Christensen, K. B., Makransky, G., & Horton, M. (2017). Critical values for Yen’s Q3: Identification of local dependence in the Rasch model using residual correlations. Applied Psychological Measurement, 41(3), 178–194.CrossRef Google Scholar

Eggen, TJHM, Verhelst, N.D.. (2011). Item calibration in incomplete testing designs. Psicológica, 32 1107–132.Google Scholar

Gibbons, R.D., Hedeker, D. (1992). Full-information item bi-factor analysis. Psychometrika, 57, 423–436.CrossRef Google Scholar

Glas, CAW. (1988). The Rasch model and multistage testing. Journal of Educational Statistics, 13, 45–52.CrossRef Google Scholar

Glas, CAWContributions to estimating and testing rasch models (Unpublished doctoral dissertation) 1989 University of Twente.Google Scholar

Goodman, L. A. (1968). The analysis of cross-classified data: Independence, quasi-independence, and interactions in contingency tables with or without missing entries. Journal of the American Statistical Association, 63, 1091–1131.Google Scholar

Haberman, S.J.. (2007). The interaction model. Multivariate and mixture distribution Rasch models,von Davier, M, Carstensen, C.H. (Eds.), Springer 201–216.CrossRef Google Scholar

Haberman, S. J. (2013). A general program for item-response analysis that employs the stabilized Newton–Raphson algorithm (ETS Research Report RR-13-32). https://doi.org/10.1002/j.2333-8504.2013.tb02339.x.CrossRef Google Scholar

Haberman, S.J., Sinharay, S. (2013). Generalized residuals for general models for contingency tables with application to item response theory. Journal of the American Statistical Association, 108, 1435–1444.CrossRef Google Scholar

Haberman, S.J., Sinharay, S, Chon, K.H.. (2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78, 417–440.CrossRef Google Scholar PubMed

Haberman, S.J., von Davier, A.A.. (2014). Considerations on parameter estimation, scoring, and linking in multistage testing. Computerized multistage testing: Theory and applications, Yan, D, von Davier, A.A., Lewis, C (Eds.), CRC Press 229–248.Google Scholar

Houts, C.R., Cai, LflexMIRT: user manual version 3.5: Flexible multilevel multidimensional item analysis and test scoring 2016 Vector Psychometric Group.Google Scholar

Ip, E.H.. (2002). Locally dependent latent trait model and the Dutch identity revisited. Psychometrika, 67, 367–386.CrossRef Google Scholar

Jewsbury, P.A., van Rijn, P.W.. (2020). IRT and MIRT models for item parameter estimation with multidimensional multistage tests. Journal of Educational and Behavioral Statistics, 45, 383–402.CrossRef Google Scholar

Joe, H, Maydeu-Olivares, A. (2010). A general family of limited information goodnessof- fit statistics for multinomial data. Psychometrika, 75, 393–419.CrossRef Google Scholar

Johnson, E.G.. (1992). The design of the national assessment of educational progress. Journal of Educational Measurement, 29 295–110.CrossRef Google Scholar

Kelderman, H, Rijkes, CPM. (1994). Loglinear multidimensional IRT models for polytomously scored items. Psychometrika, 59 2149–176.CrossRef Google Scholar

Kolen, M, Brennan, RTest equating, scaling, and linking: Methods and practices 2004 Springer.CrossRef Google Scholar

Liu, Y, Maydeu-Olivares, A. (2013). Local dependence diagnostics in IRT modeling of binary data. Educational and Psychological Measurement, 73, 254–274.CrossRef Google Scholar

Lord, F.M., Novick, M.R.Statistical theories of mental test scores 1968 Addison- Wesley.Google Scholar

Lord, F.M., Wingersky, M.S.. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8, 453–461.CrossRef Google Scholar

Louis, T. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological), 44, 226–233.CrossRef Google Scholar

Maydeu-Olivares, A, Joe, H. (2005). Limited- and full-information estimation and goodness-of-fit testing in 2n contingency tables: A unified framework. Journal of the American Statistical Association, 100, 1009–1020.CrossRef Google Scholar

McDonald, R.P.Test theory: A unified treatment 1999 Lawrence Erlbaum.Google Scholar

Messick, S., Beaton, A., & Lord, F. (1983). National assessment of educational progress reconsidered: A new design for a new era (Tech. Rep.).Google Scholar

Mislevy, R.J., Chang, H.H.. (2000). Does adaptive testing violate local independence?. Psychometrika, 65 2149–156.CrossRef Google Scholar

Mislevy, R.J., Wu, P-KMissing responses and IRT ability estimation: Omits, choice, time limits, and adaptive testing (ETS Research Report RR-96-30) 1996 Educational Testing Service.Google Scholar

Monseur, C., Baye, A., Lafontaine, D., & Quittre, V. (2011). PISA test format assessment and the local independence assumption. IERI Monographs Series: Issues and Methodologies in Large-Scale Assessments, 4.Google Scholar

Naylor, J.C., Smith, AFM. (1982). Applications of a method for efficient computation of posterior distributions. Applied Statistics, 31, 214–225.CrossRef Google Scholar

Nikoloulopoulos, A.K., Joe, H. (2015). Factor copula models for item response data. Psychometrika, 80 1126–150.CrossRef Google Scholar PubMed

Pommerich, M, Segall, D.O.. (2008). Local dependence in an operational CAT: Diagnosis and implications. Journal of Educational Measurement, 45 3201–223.CrossRef Google Scholar

R Core Team. (2019). R: A language and environment for statistical computing [Computer software manual]. Retrieved from https://www.R-project.org/.Google Scholar

Reckase, M.D.Multidimensional item response theory 2009 Springer.CrossRef Google Scholar

Reiser, M. (1996). Analysis of residuals for the multinomial item response model. Psychometrika, 61, 509–528.CrossRef Google Scholar

Robin, F, Steffen, M, Liang, L. (2014). The multistage test implementation of the GRE revised general test. Computerized multistage testing: Theory and applications, Yan, D, von Davier, A.A., Lewis, C (Eds.), CRC Press 325–341.Google Scholar

Rubin, D.B.. (1976). Inference and missing data. Biometrika, 63 3581–592.CrossRef Google Scholar

Tjur, T. (1982). A connection between Rasch’s item analysis model and a multiplicative poisson model. Scandinavian Journal of Statistics, 9, 23–30.Google Scholar

van Rijn, P.W., Sinharay, S, Haberman, S.J., Johnson, M.S.. (2016). Assessment of fit of item response theory models used in large-Scale educational survey assessments. Large-scale Assessments in Education, .CrossRef Google Scholar

Verhelst, N.D., Verstralen, HHFM. (2008). Some considerations on the partial credit model. Psicologica, 29, 229–254.Google Scholar

von Davier, M, Yamamoto, K, Shin, H.J., Chen, H, Khorramdel, L, Weeks, J, Kandathil, M. (2019). Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assessment in Education: Principles, Policy & Practice, 26 4466–488.Google Scholar

Wainer, H, Bradlow, E, Wang, XTestlet response theory and its applications 2007 Cambridge University Press.CrossRef Google Scholar

Wainer, H, Thissen, D. (1996). How is reliability related to the quality of test scores? What is the effect of local dependence on reliability?. Educational Measurement: Issues and Practice, 15 122–29.CrossRef Google Scholar

Warm, T.A.. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450.CrossRef Google Scholar

Woods, C.M.. (2015). Estimating the latent density in unidimensional IRT to permit nonnormality. Handbook of item response theory modeling: Applications to typical performance assessment, Reise, S.P., Revicki, D.A. (Eds.), Routledge 60–84.Google Scholar

Yamamoto, K., Shin, H. J., & Khorramdel, L. (2019). Introduction of multistage adaptive testing design in PISA 2018 (OECD Education Working Papers No. 209). https://doi.org/10.1787/b9435d4b-en.CrossRef Google Scholar

Yen, W.M.. (1984). Effect of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8 2125–145.CrossRef Google Scholar

Yen, W.M.. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187–213.CrossRef Google Scholar

Zenisky, A. L., Hambleton, R. K., & Sireci, S. G. (2001). Effects of local item dependence on the validity of IRT item, test, and ability statistics. (MCAT-5). https://doi.org/10.1002/j.2333-8504.2006.tb02009.x.CrossRef Google Scholar

Zhang, J. (2013). A procedure for dimensionality analyses of response data from various test designs. Psychometrika, 78 137–58.CrossRef Google Scholar PubMed

Zwitser, R.J., Maris, G. (2015). Conditional statistical inference with multistage testing designs. Psychometrika, 80 165–84.CrossRef Google Scholar PubMed

Figure. 1 Basic MST design with two stages and two levels of difficulty (module B is of lower difficulty, module C is of higher difficulty).

Figure. 2 Balanced MST design with two stages, two levels of difficulty, and all items used in both stages.

Table 1 Mean (SD) type I error of generalized residuals for item pair frequencies.

Figure. 3 Density plots of residuals (top left: N=600\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N=600$$\end{document}, J=9\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$J=9$$\end{document}; top right: N=6000\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N=6000$$\end{document}, J=9\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$J=9$$\end{document}; bottom left: N=600\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N=600$$\end{document}, J=45\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$J=45$$\end{document}; bottom right: N=6000\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N=6000$$\end{document}, J=45\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$J=45$$\end{document})

Figure. 4 Heatmap of generalized residuals for item pairs in first stage of PISA 2018 reading MST.

Figure. 5 Highlight of MST design for PISA 2018 reading assessment.

Table 2 Observed, expected, and residual frequencies, standard deviations and generalized residuals with and without MST adjustments for all patterns on one pair of units for subset of PISA 2018 reading MST data (N=14,226\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N=14,\!226$$\end{document}).

Figure. 6 Heatmap of unadjusted (left) and adjusted (right) generalized residuals for PISA example.

Table 3 Mean (SD) simulation results with respect to parameter recovery (200 replications).

Figure. 7 Distributions of Q3\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$Q_3$$\end{document} statistic across all item pairs under different designs (N=6000\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N=6000$$\end{document}, J=45\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$J=45$$\end{document}).

Figure. 8 Heatmaps of average Q3\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$Q_3$$\end{document} statistic under complete and basic MST designs (N=6000\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N=6000$$\end{document}, J=45\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$J=45$$\end{document}).

Table 4 Summary statistics of M2\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$M_2$$\end{document} statistic for items in routing and easy modules in complete and basic MST design (N=6000\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N=6000$$\end{document} and J=30\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$J=30$$\end{document}).

Figure. 9 QQ plots of residuals (N=600\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N=600$$\end{document}, J=9\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$J=9$$\end{document}).

Figure. 10 QQ plots of residuals (N=6000\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N=6000$$\end{document}, J=9\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$J=9$$\end{document}).

Article contents

Adjusted Residuals for Evaluating Conditional Independence in IRT Models for Multistage Adaptive Testing

Abstract

Keywords

1. Method

1.1. Conditional Independence

1.2. Model

1.3. Estimation

1.4. Adjusted Residuals

2. Illustrations

2.1. Simulation

2.2. Real Data Application

3. Discussion

Appendix A: R Code

Appendix B: Additional Simulation Results

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests