Hostname: page-component-5f745c7db-q8b2h Total loading time: 0 Render date: 2025-01-06T07:12:14.232Z Has data issue: true hasContentIssue false

What can we learn from Plausible Values?

Published online by Cambridge University Press:  01 January 2025

Maarten Marsman*
Affiliation:
University of Amsterdam Cito
Gunter Maris
Affiliation:
University of Amsterdam Cito
Timo Bechger
Affiliation:
Cito
Cees Glas
Affiliation:
University of Twente
*
Correspondence should be made to Maarten Marsman, Department of Psychology, University of Amsterdam, Nieuwe Prinsengracht 129-B, P.O. Box 15906, 1001 NK Amsterdam, The Netherlands. Email: m.marsman@uva.nl
Rights & Permissions [Opens in a new window]

Abstract

In this paper, we show that the marginal distribution of plausible values is a consistent estimator of the true latent variable distribution, and, furthermore, that convergence is monotone in an embedding in which the number of items tends to infinity. We use this result to clarify some of the misconceptions that exist about plausible values, and also show how they can be used in the analyses of educational surveys.

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Copyright
Copyright © 2015 The Author(s)

1. Introduction

In educational surveys, an item response theory (IRT) model is used to model the conditional distribution of a vector of item responses X={X1,X2,,Xn}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {X} = \{X_1\text {, }X_2\text {, }\ldots \text {, }X_n\}$$\end{document} as a function of a latent random variable (ability) Θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\Theta }$$\end{document}, where the item response functions are monotonically increasing in ability. The IRT model characterizes the latent variable Θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta $$\end{document}, and the goal of educational surveys is to estimate the distribution of Θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta $$\end{document} which we denote by f. Together, the IRT model and the ability distribution induce the following statistical model:

P(Xf=x)=RP(X=xθ)f(θ)dθ,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} P(\mathbf {X}_f=\mathbf {x}) = \int _{\mathbb {R}}P(\mathbf {X}=\mathbf {x} \mid \theta )f(\theta ) \text {d}\theta , \end{aligned}$$\end{document}

where P(Xf)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X}_f)$$\end{document} is the true data distribution of which we obtain a sample. Throughout this paper, we assume that the IRT model is given, and focus on the unknown f. We consider the usual case where the item responses Xi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_i$$\end{document} are discrete with a finite number of possible realizations but note that the results remain the same when the Xi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_i$$\end{document} are continuous and sums are replaced by integrals.

There are four possible approaches to estimate f from the observed data. The first entails the use of a function T such that T(X)Θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T(\mathbf {X}) \sim \Theta $$\end{document}. If X\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {X}$$\end{document} is discrete, realizations of T(X)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T(\mathbf {X})$$\end{document} are discrete as well. The second approach requires a function T such that T(X)LΘ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T(\mathbf {X}) \overset{\mathcal {L}}{\longrightarrow }\Theta $$\end{document}, i.e., a random variable that, asymptotically, has the same distribution as Θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta $$\end{document}. This can be any T that is a consistent estimator of Θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta $$\end{document} such as the Maximum Likelihood (ML) or Weighted ML (WML) estimator (Warm, Reference Warm1989). The third approach is to use the data to generate a random variable Θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta ^*$$\end{document} such that and ΘΘ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta ^*\sim \Theta $$\end{document}. By definition, Θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta $$\end{document} and Θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta ^*$$\end{document} are exchangeable and their joint density can be written as follows:

(1)f(θ,θ)=xf(θX=x)f(θX=x)P(Xf=x),\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} f(\theta ^*,\theta ) = \sum _{\mathbf {x}}f(\theta ^*\mid \mathbf {X}=\mathbf {x})f(\theta \mid \mathbf {X}=\mathbf {x})P(\mathbf {X}_f=\mathbf {x}), \end{aligned}$$\end{document}

where summation is over all possible realizations of X\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {X}$$\end{document}. The conditional distributions f(θX)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\theta \mid \mathbf {X})$$\end{document} are posterior distributions and it easily follows that the marginal distribution of draws from these posteriors equals the population distribution. Thus, if we sample from the correct posteriors, the population distribution can be recovered in a straightforward way. The problem, however, is that we do not know the correct posterior because we do not know f. In practice, we would therefore use a prior distributionFootnote 1g to generate random variables Θ~X\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{\Theta } \mid \mathbf {X}$$\end{document} (i.e., sample from the posteriors g(θX)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\theta \mid \mathbf {X})$$\end{document}). The random variables Θ~X\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{\Theta } \mid \mathbf {X}$$\end{document} are called plausible values (PVs) in the psychometric literature (Mislevy, Reference Mislevy1991; Mislevy, Beaton, Kaplan, & Sheehan, Reference Mislevy, Beaton, Kaplan and Sheehan1993). Using PVs to estimate f constitutes the fourth and final approach and the one this paper is about.

In this paper, we prove that under mild regularity conditions, PVs are random variables of the form Θ~X\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{\Theta }\mid \mathbf {X}$$\end{document} such that Θ~LΘ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{\Theta } \overset{\mathcal {L}}{\longrightarrow }\Theta $$\end{document}. That is, we will show that the marginal distribution of the PVs is a consistent estimator of f. More specifically, let

g~(θ)=xg(θX=x)P(Xf=x)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \tilde{g}(\theta ) = \sum _{\mathbf {x}} g(\theta \mid \mathbf {X}=\mathbf {x})P(\mathbf {X}_f=\mathbf {x}) \end{aligned}$$\end{document}

denote the marginal distribution of the PVs.

This distribution is intractable but easily sampled from; that is, nature provides realizations from P(Xf)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X}_f)$$\end{document}, which we then use to sample PVsFootnote 2.

It is well known that the empirical cumulative distribution function (ecdf) of the PVs is a consistent estimator of g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} as the number of persons goes to infinity. Our main goal is to demonstrate that g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} in turn converges in law to f (i.e., Θ~=Θg~LΘf\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{\Theta } = \Theta _{\tilde{g}} \overset{\mathcal {L}}{\longrightarrow }\Theta _f$$\end{document}) as the number of items goes to infinity. The following example gives a foretaste of what this paper is about.

Example 1

We generate responses of N=\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = $$\end{document} 10,000 persons on a test consisting of n Rasch items with difficulty parameters sampled uniformly between -1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-1$$\end{document} and 1. The ability distribution f is a mixture with two normal components whose ecdf is shown in the left panel of Fig. 1. One component may, for instance, be the distribution for the boys and the other one is that for the girls.

The analyst is unaware of the difference between the boys and the girls and chooses g to be a standard normal distribution. We now generate a single PV for each of the N persons; once for a test with n=10\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n = 10$$\end{document} items and once for a test with n=40\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n = 40$$\end{document} items. The PV distributions are shown in the right panel of Fig. 1. Figure 1 shows that the distribution of the PVs is not the standard normal. In fact, with 40 items, it begins to resemble the true ability distribution even though the population model is clearly wrong.

Figure 1. Ecdfs of N=\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N =$$\end{document} 10,000 draws from f(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\theta )$$\end{document} and N=\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N =$$\end{document} 10,000 draws from the standard normal prior distribution g(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\theta )$$\end{document} are shown in both panels (in gray in the right panel). Ecdfs of the marginal distributions of PVs are shown in the right panel.

Instead of proving that g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} converges in law to f, we will prove a stronger result. Namely, that g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} converges to f in Expected Kullback-Leibler (EKL) divergence (Kullback & Leibler, Reference Kullback and Leibler1951) as the number of items n tends to infinity.

Definition

The Expected (posterior) Kullback-Leibler (EKL) divergence between ΘfX\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta _f \mid \mathbf {X}$$\end{document} and ΘgX\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta _g\mid \mathbf {X}$$\end{document}, w.r.t. f(ΘX)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\Theta \mid \mathbf {X})$$\end{document} and P(Xf)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X}_f)$$\end{document} is

E(Δ(Θf;ΘgXf))=xΔ(Θf;ΘgX=x)P(Xf=x)=xRlnf(θX=x)g(θX=x)f(θX=x)dθP(Xf=x),\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mathbb {E}(\Delta (\Theta _f{\text { ; }}\Theta _g \mid \mathbf {X}_f))&= \sum _{\mathbf {x}}\Delta (\Theta _f{\text { ; }}\Theta _g \mid \mathbf {X}=\mathbf {x})P(\mathbf {X}_f=\mathbf {x})\\&=\sum _{\mathbf {x}} \left[ \int _\mathbb {R} \ln \left( \frac{f(\theta \mid \mathbf {X}=\mathbf {x})}{g(\theta \mid \mathbf {X}= \mathbf {x})}\right) f(\theta \mid \mathbf {X}=\mathbf {x}) {\text {d}} \theta \right] P(\mathbf {X}_f=\mathbf {x}), \end{aligned}$$\end{document}

where Δ(Θf;ΘgX)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta (\Theta _f\text { ; }\Theta _g\mid \mathbf {X})$$\end{document} denotes the Kullback-Leibler (KL) divergence of f(ΘX)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\Theta \mid \mathbf {X})$$\end{document} and g(ΘX)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\Theta \mid \mathbf {X})$$\end{document} with respect to f(ΘX)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\Theta \mid \mathbf {X})$$\end{document}, with 0ln(0)0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0\ln (0)\equiv 0$$\end{document}.

Throughout this paper, we assume that all divergences are finite, which is true if the support of g contains that of f (i.e., f is absolutely continuous w.r.t. g) almost everywhere (a.e.). Note that the KL and EKL divergences that we use in this paper are non-symmetric in their arguments, yet their values are always non-negative and zero if and only if the compared probability distributions are the same a.e. (see Theorem 9.6.1 in Cover & Thomas, Reference Cover and Thomas1991, p. 232).

We demonstrate in the next section that convergence in EKL divergence is indeed stronger than convergence in law. Then, we prove that EKL divergence is monotonically non-increasing in n and tends to zero as the number of items n tends to infinity: Informally, this means that g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} will always get closer to f as n grows, as we saw in the example. Having thus established our main result, we discuss a number of implications for educational surveys and show that quite a lot can be learned from PVs. Throughout, PISA data will be used for illustration. The paper ends with a discussion.

2. Convergence in EKL divergence implies convergence in law

To demonstrate that g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} converges in law to f, it is sufficient to prove that g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} converges to f in KL divergence as this implies convergence in law (DasGupta, Reference DasGupta2008, p. 21). The following theorem implies that convergence in EKL divergence is stronger than convergence in KL divergence.

Theorem 1

Given an IRT model P(Xθ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X} \mid \theta )$$\end{document} and assuming that the support of g contains the support of f, the KL divergence of Θg~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta _{\tilde{g}}$$\end{document} w.r.t. Θf\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta _{f}$$\end{document}, i.e.,

Δ(Θf;Θg~)=Rlnf(θ)g~(θ)f(θ)dθ,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \Delta (\Theta _f{\text { ; }}\Theta _{\tilde{g}}) = \int _{\mathbb {R}} \ln \frac{f(\theta )}{\tilde{g}(\theta )}f(\theta ){\text {d}}\theta , \end{aligned}$$\end{document}

is always smaller than or equal to EKL divergence. That is,

Δ(Θf;Θg~)E(Δ(Θf;ΘgXf)).\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \Delta (\Theta _f{\text { ; }}\Theta _{\tilde{g}}) \le \mathbb {E}(\Delta (\Theta _f{\text { ; }}\Theta _{g} \mid \mathbf {X}_f)). \end{aligned}$$\end{document}

Proof

We start with rewriting the logarithm of the ratio of g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} over f

lng~(θ)f(θ)=lnxg(θX=x)P(Xf=x)xf(θX=x)P(Xf=x)=lnxg(θX=x)P(Xf=x)f(θX=x)P(Xf=x)f(θX=x)P(Xf=x)xf(θX=x)P(Xf=x)=lnxg(θX=x)f(θX=x)P(X=xθ)xlng(θX=x)f(θX=x)P(X=xθ),\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \ln \frac{\tilde{g}(\theta )}{f(\theta )}&= \ln \left\{ \frac{\sum _{\mathbf {x}}g(\theta \mid \mathbf {X}=\mathbf {x})P(\mathbf {X}_f=\mathbf {x})}{\sum _{\mathbf {x}}f(\theta \mid \mathbf {X}=\mathbf {x})P(\mathbf {X}_f=\mathbf {x})}\right\} \\&=\ln \left\{ \sum _{\mathbf {x}}\frac{g(\theta \mid \mathbf {X}=\mathbf {x})P(\mathbf {X}_f=\mathbf {x})}{f(\theta \mid \mathbf {X}=\mathbf {x})P(\mathbf {X}_f=\mathbf {x})} \frac{f(\theta \mid \mathbf {X}=\mathbf {x})P(\mathbf {X}_f=\mathbf {x})}{\sum _{\mathbf {x}}f(\theta \mid \mathbf {X}=\mathbf {x})P(\mathbf {X}_f=\mathbf {x})}\right\} \\&=\ln \left\{ \sum _{\mathbf {x}}\frac{g(\theta \mid \mathbf {X}=\mathbf {x})}{f(\theta \mid \mathbf {X}=\mathbf {x})} P(\mathbf {X}=\mathbf {x} \mid \theta )\right\} \\&\ge \sum _{\mathbf {x}} \ln \frac{g(\theta \mid \mathbf {X}=\mathbf {x})}{f(\theta \mid \mathbf {X}=\mathbf {x})} P(\mathbf {X}=\mathbf {x} \mid \theta ), \end{aligned}$$\end{document}

using Jensen’s inequality. Thus, we obtain

lnf(θ)g~(θ)xlnf(θX=x)g(θX=x)P(X=xθ).\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \ln \frac{f(\theta )}{\tilde{g}(\theta )}&\le \sum _{\mathbf {x}} \ln \frac{f(\theta \mid \mathbf {X}=\mathbf {x})}{g(\theta \mid \mathbf {X}=\mathbf {x})} P(\mathbf {X}=\mathbf {x} \mid \theta ). \end{aligned}$$\end{document}

Integrating both sides of this expression w.r.t. f gives the desired result:

Rlnf(θ)g~(θ)f(θ)dθRxlnf(θX=x)g(θX=x)P(X=xθ)f(θ)dθ=xRlnf(θX=x)g(θX=x)f(θX=x)dθP(Xf=x).\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \int _{\mathbb {R}}\ln \frac{f(\theta )}{\tilde{g}(\theta )} f(\theta ) \text {d} \theta&\le \int _{\mathbb {R}}\sum _{\mathbf {x}} \ln \frac{f(\theta \mid \mathbf {X}=\mathbf {x})}{g(\theta \mid \mathbf {X}=\mathbf {x})} P(\mathbf {X}=\mathbf {x} \mid \theta ) f(\theta ) \text {d}\theta \\&= \sum _{\mathbf {x}}\int _{\mathbb {R}} \ln \frac{f(\theta \mid \mathbf {X}=\mathbf {x})}{g(\theta \mid \mathbf {X}=\mathbf {x})} f(\theta \mid \mathbf {X}=\mathbf {x}) \text {d}\theta \text { }P(\mathbf {X}_f=\mathbf {x}). \end{aligned}$$\end{document}

It follows that g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} converges in law to f if g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} converges to f in EKL. Proving convergence in EKL will be the burden of the ensuing sections.\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

3. Monotone Convergence of Plausible Values

Before we can state our first result in Theorem 2, we need two Lemma’s.

Lemma 1

Given an IRT model P(Xθ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X} \mid \theta )$$\end{document} and assuming that the support of g contains the support of f, the EKL divergence of ΘfX\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta _f \mid \mathbf {X}$$\end{document} and ΘgX\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta _g \mid \mathbf {X}$$\end{document}, w.r.t. f(ΘX)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\Theta \mid \mathbf {X})$$\end{document} and P(Xf)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X}_f)$$\end{document}, equals prior divergence minus marginal divergence, that is,

E(Δ(Θf;ΘgXf))=Δ(Θf;Θg)-Δ(Xf;Xg).\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mathbb {E}(\Delta (\Theta _f{\text { ; }}\Theta _g\mid \mathbf {X}_f))=\Delta (\Theta _f{\text { ; }}\Theta _g) - \Delta (\mathbf {X}_f{\text { ; }}\mathbf {X}_g). \end{aligned}$$\end{document}

Proof

Using the definition of the posterior, and given the IRT model P(Xθ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X}\mid \theta )$$\end{document}, we rewrite the EKL divergence as follows:

E(Δ(Θf;ΘgXf))=xRlnP(X=xθ)f(θ)P(Xf=x)P(X=xθ)g(θ)P(Xg=x)f(θX=x)dθP(Xf=x)=xRlnf(θ)g(θ)P(Xg=x)P(Xf=x)f(θX=x)dθP(Xf=x),\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mathbb {E}(\Delta (\Theta _f\text { ; }\Theta _g \mid \mathbf {X}_f)) =&\sum _{\mathbf {x}} \int _{\mathbb {R}} \ln \left( \frac{\frac{P(\mathbf {X}=\mathbf {x} \mid \theta )f(\theta )}{P(\mathbf {X}_f=\mathbf {x})}}{\frac{P(\mathbf {X}=\mathbf {x} \mid \theta )g(\theta )}{P(\mathbf {X}_g=\mathbf {x})}}\right) f(\theta \mid \mathbf {X}=\mathbf {x})\text {d}\theta \text { }P(\mathbf {X}_f=\mathbf {x})\\ =&\sum _{\mathbf {x}} \int _{\mathbb {R}} \ln \left( \frac{f(\theta )}{g(\theta )} \frac{P(\mathbf {X}_g=\mathbf {x})}{P(\mathbf {X}_f=\mathbf {x})}\right) f(\theta \mid \mathbf {X}=\mathbf {x})\text {d}\theta \text { }P(\mathbf {X}_f=\mathbf {x}), \end{aligned}$$\end{document}

where P(Xg)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X}_g)$$\end{document} is the distribution of the data under the prior g. Using properties of the logarithm, we obtain

E(Δ(Θf;ΘgXf))=xRlnf(θ)g(θ)f(θX=x)dθP(Xf=x)+xRlnP(Xg=x)P(Xf=x)f(θX=x)dθP(Xf=x).\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mathbb {E}(\Delta (\Theta _f\text { ; }\Theta _g \mid \mathbf {X}_f)) =&\sum _{\mathbf {x}} \int _{\mathbb {R}} \ln \left( \frac{f(\theta )}{g(\theta )}\right) f(\theta \mid \mathbf {X}=\mathbf {x})\text {d}\theta \text { }P(\mathbf {X}_f=\mathbf {x})\\&+ \sum _{\mathbf {x}} \int _{\mathbb {R}} \ln \left( \frac{P(\mathbf {X}_g=\mathbf {x})}{P(\mathbf {X}_f=\mathbf {x})}\right) f(\theta \mid \mathbf {X}=\mathbf {x})\text {d}\theta \text { }P(\mathbf {X}_f=\mathbf {x}). \end{aligned}$$\end{document}

If we sum over the possible values of X\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {X}$$\end{document} in the first term and integrate over Θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta $$\end{document} in the second term, respectively, we obtain

E(Δ(Θf;ΘgXf))=Rlnf(θ)g(θ)f(θ)dθ+xlnP(Xg=x)P(Xf=x)P(Xf=x)=Rlnf(θ)g(θ)f(θ)dθ-xlnP(Xf=x)P(Xg=x)P(Xf=x)=Δ(Θf;Θg)-Δ(Xf;Xg).\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mathbb {E}(\Delta (\Theta _f\text { ; }\Theta _g \mid \mathbf {X}_f))&= \int _{\mathbb {R}} \ln \left( \frac{f(\theta )}{g(\theta )}\right) f(\theta )\text {d}\theta + \sum _{\mathbf {x}}\ln \left( \frac{P(\mathbf {X}_g=\mathbf {x})}{P(\mathbf {X}_f=\mathbf {x})}\right) P(\mathbf {X}_f=\mathbf {x})\\&= \int _{\mathbb {R}} \ln \left( \frac{f(\theta )}{g(\theta )}\right) f(\theta )\text {d}\theta - \sum _{\mathbf {x}}\ln \left( \frac{P(\mathbf {X}_f=\mathbf {x})}{P(\mathbf {X}_g=\mathbf {x})}\right) P(\mathbf {X}_f=\mathbf {x})\\&=\Delta (\Theta _f\text { ; }\Theta _g) - \Delta (\mathbf {X}_f\text { ; }\mathbf {X}_g). \end{aligned}$$\end{document}

It follows that EKL divergence of the posterior distribution is equal to the difference between prior divergence Δ(Θf;Θg)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta (\Theta _f\text { ; }\Theta _g)$$\end{document} and marginal divergence Δ(Xf;Xg)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta (\mathbf {X}_f\text { ; }\mathbf {X}_g)$$\end{document} (i.e., divergence of P(Xg)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X}_g)$$\end{document} w.r.t. P(Xf)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X}_f)$$\end{document}).\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

Lemma 1 implies that E(Δ(Θf;ΘgXf))\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}(\Delta (\Theta _f\text { ; }\Theta _g\mid \mathbf {X}_f))$$\end{document} equals zero if and only if prior divergence is equal to marginal divergence. Since the divergences are finite and non-negative, we find that

Δ(Θf;Θg)Δ(Xf;Xg).\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \Delta (\Theta _f\text { ; }\Theta _g) \ge \Delta (\mathbf {X}_f\text { ; }\mathbf {X}_g). \end{aligned}$$\end{document}

We will now prove that Δ(Xf;Xg)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta (\mathbf {X}_f\text { ; }\mathbf {X}_g)$$\end{document} is a monotone non-decreasing sequence in the number of items n with Δ(Θf;Θg)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta (\Theta _f\text { ; }\Theta _g)$$\end{document} as an upper bound. To this aim, we consider what happens to marginal divergence when an item is added ( i.e., n is increased to n+1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n+1$$\end{document}). To fix the notation, let X1,X2,...\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_1\text {, } X_2\text {, } ...$$\end{document} denote an infinite sequence of item responses, with Xn\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_n$$\end{document} the n-th element and Xn\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {X}_n$$\end{document} a vector consisting of the first n elements of this sequence.

Lemma 2

Given an IRT model P(Xθ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X} \mid \theta )$$\end{document} and assuming that the support of g contains the support of f, the marginal divergence for n+1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n+1$$\end{document} observations is larger than or equal to marginal divergence for n observations:

Δ(Xf,n+1;Xg,n+1)Δ(Xf,n;Xg,n).\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \Delta (\mathbf {X}_{f,n+1}; \mathbf {X}_{g,n+1}) \ge \Delta (\mathbf {X}_{f,n}; \mathbf {X}_{g,n}). \end{aligned}$$\end{document}

Proof

The marginal divergence for n+1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n+1$$\end{document} items is

Δ(Xf,n+1;Xg,n+1)=xn+1lnP(Xf,n+1=xn+1)P(Xg,n+1=xn+1)P(Xf,n+1=xn+1).\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \Delta (\mathbf {X}_{f,n+1}\text { ; }\mathbf {X}_{g,n+1}) = \sum _{\mathbf {x}_{n+1} }\ln \left( \frac{P(\mathbf {X}_{f,n+1} =\mathbf {x}_{n+1})}{P(\mathbf {X}_{g,n+1} =\mathbf {x}_{n+1})}\right) P(\mathbf {X}_{f,n+1} =\mathbf {x}_{n+1}). \end{aligned}$$\end{document}

Conditioning on the first n observations and factoring the distribution, we obtain

Δ(Xf,n+1;Xg,n+1)=xnxn+1lnP(Xf,n+1=xn+1Xn=xn)P(Xg,n+1=xn+1Xn=xn)P(Xf,n=xn)P(Xg,n=xn)×P(Xf,n+1=xn+1Xn=xn)P(Xf,n=xn).\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \Delta (\mathbf {X}_{f,n+1}\text { ; }\mathbf {X}_{g,n+1}) =&\sum _{\mathbf {x}_{n}}\sum _{x_{n+1}} \ln \left( \frac{P(X_{f,n+1}=x_{n+1} \mid \mathbf {X}_n=\mathbf {x}_n)}{P(X_{g,n+1}=x_{n+1}\mid \mathbf {X}_n=\mathbf {x}_n)} \frac{P(\mathbf {X}_{f,n}= \mathbf {x}_n)}{P(\mathbf {X}_{g,n}=\mathbf {x}_n)}\right) \\&\times P(X_{f,n+1}=x_{n+1}\mid \mathbf {X}_n=\mathbf {x}_n)P(\mathbf {X}_{f,n}=\mathbf {x}_n). \end{aligned}$$\end{document}

This is equal to

Δ(Xf,n+1;Xg,n+1)=xnxn+1lnP(Xf,n+1=xn+1Xn=xn)P(Xg,n+1=xn+1Xn=xn)×P(Xf,n+1=xn+1Xn=xn)P(Xf,n=xn)+xnlnP(Xf,n=xn)P(Xg,n=xn)P(Xf,n=xn)=E(Δ(Xf,n+1;Xg,n+1|Xf,n))+Δ(Xf,n;Xg,n),\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \Delta (\mathbf {X}_{f,n+1}\text { ; }\mathbf {X}_{g,n+1})&= \sum _{\mathbf {x}_{n}} \sum _{x_{n+1}} \ln \left( \frac{P(X_{f,n+1}=x_{n+1}\mid \mathbf {X}_n=\mathbf {x}_n)}{P(X_{g,n+1}=x_{n+1}\mid \mathbf {X}_n=\mathbf {x}_n)}\right) \\&\quad \times P(X_{f,n+1}=x_{n+1}\mid \mathbf {X}_n=\mathbf {x}_n)P(\mathbf {X}_{f,n}=\mathbf {x}_n)\\&\qquad + \sum _{\mathbf {x}_{n}} \ln \left( \frac{P(\mathbf {X}_{f,n}=\mathbf {x}_n)}{P(\mathbf {X}_{g,n}=\mathbf {x}_n)}\right) P(\mathbf {X}_{f,n}=\mathbf {x}_n)\\&= \mathbb {E}(\Delta (X_{f,n+1}\text { ; }X_{g,n+1}|\mathbf {X}_{f,n})) + \Delta (\mathbf {X}_{f,n}\text { ; }\mathbf {X}_{g,n}), \end{aligned}$$\end{document}

a result closely related to the chain rule of KL divergence (Cover & Thomas, Reference Cover and Thomas1991, p. 23). Since E(Δ(Xf,n+1;Xg,n+1Xf,n))0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}(\Delta (X_{f,n+1}\text { ; }X_{g,n+1}\mid \mathbf {X}_{f,n})) \ge 0$$\end{document}, we see that

Δ(Xf,n+1;Xg,n+1)Δ(Xf,n;Xg,n).\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \Delta (\mathbf {X}_{f,n+1}\text { ; }\mathbf {X}_{g,n+1}) \ge \Delta (\mathbf {X}_{f,n}\text { ; }\mathbf {X}_{g,n}). \end{aligned}$$\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

Using Lemmas 1 and 2, we can now state Theorem 2.

Theorem 2

(Monotonicity Theorem) Given an IRT model P(Xθ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X} \mid \theta )$$\end{document} and assuming that the support of g contains the support of f, E(Δ(Θf;ΘgXf,n))\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}(\Delta (\Theta _f{\text { ; }}\Theta _g\mid \mathbf {X}_{f,n}))$$\end{document} is monotone non-increasing in the number of items n.

Proof

From Lemmas 1 and 2, we obtain

E(Δ(Θf;ΘgXf,n+1))=Δ(Θf;Θg)-Δ(Xf,n+1;Xg,n+1)=Δ(Θf;Θg)-E(Δ(Xf,n+1;Xg,n+1Xf,n))-Δ(Xf,n;Xg,n),\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mathbb {E}(\Delta (\Theta _f\text { ; }\Theta _g\mid \mathbf {X}_{f,n+1})) = \Delta (\Theta _f\text { ; }\Theta _g)&- \Delta (\mathbf {X}_{f,n+1}\text { ; }\mathbf {X}_{g,n+1})\\ = \Delta (\Theta _f\text { ; }\Theta _g)&- \mathbb {E}(\Delta (X_{f,n+1}\text { ; }X_{g,n+1} \mid \mathbf {X}_{f,n}))\\&- \Delta (\mathbf {X}_{f,n}\text { ; }\mathbf {X}_{g,n}), \end{aligned}$$\end{document}

and Lemma 1 shows that the difference of the first and the last terms is equal to the EKL divergence for n items. Thus, we have

E(Δ(Θf;ΘgXf,n+1))=E(Δ(Θf;ΘgXf,n))-E(Δ(Xf,n+1;Xg,n+1Xf,n)).\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mathbb {E}(\Delta (\Theta _f\text { ; }\Theta _g \mid \mathbf {X}_{f,n+1})) = \mathbb {E}(\Delta (\Theta _f\text { ; }\Theta _g \mid \mathbf {X}_{f,n})) - \mathbb {E}(\Delta (X_{f,n+1}\text { ; }X_{g,n+1} \mid \mathbf {X}_{f,n})). \end{aligned}$$\end{document}

This implies a sequence of EKL divergences which adheres to the (in-)equality:

0E(Δ(Θf;ΘgXf,n+1))E(Δ(Θf;ΘgXf,n))Δ(Θf;Θg),\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} 0 \le \mathbb {E}(\Delta (\Theta _f\text { ; }\Theta _g \mid \mathbf {X}_{f,n+1})) \le \mathbb {E}(\Delta (\Theta _f\text { ; }\Theta _g \mid \mathbf {X}_{f,n})) \le \Delta (\Theta _f\text { ; }\Theta _g), \end{aligned}$$\end{document}

i.e., a monotone non-increasing sequence in n with lower bound 0. Since prior divergence is finite by assumption, it is an upper bound for this sequence. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

4. Large Sample Properties of Plausible Values

The Monotonicity Theorem shows that the sequence of EKL divergences converges in an embedding in which n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \rightarrow \infty $$\end{document}. This does not imply that the marginal distribution of PVs converges to f, since the sequence of EKL divergences may converge to a number that is strictly larger than zero. We have yet to show that the sequence of EKL divergences converges to zero. Since by Lemma 1 the EKL divergence is equal to the difference between prior and marginal divergence, we may equivalently show that the inequality

(2)Δ(Θf;Θg)Δ(Xf,n;Xg,n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \Delta (\Theta _f\text { ; }\Theta _g) \ge \Delta (\mathbf {X}_{f,n}\text { ; }\mathbf {X}_{g,n}) \end{aligned}$$\end{document}

becomes an equality as n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \rightarrow \infty $$\end{document}.

Theorem 3

(Convergence Theorem) Given an IRT model P(Xθ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X} \mid \theta )$$\end{document} and assuming that the support of g contains the support of f,

limnE(Δ(Θf;ΘgXf,n))=0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \lim _{n\rightarrow \infty } \mathbb {E}(\Delta (\Theta _f{\text { ; }}\Theta _g\mid \mathbf {X}_{f,n})) = 0 \end{aligned}$$\end{document}

if the sequence of posteriors converges to a degenerate distribution.

Proof

We start with a direct proof of (2) (suppressing the dependence on n). Note first that,

(3)x:lnP(Xf=x)P(Xg=x)=-lnP(Xg=x)P(Xf=x)=-lnRP(X=xθ)g(θ)dθRP(X=xθ)f(θ)dθ=-lnRP(X=xθ)g(θ)P(X=xθ)f(θ)P(X=xθ)f(θ)RP(X=xθ)f(θ)dθdθ=-lnRg(θ)f(θ)f(θX=x)dθ-Rlng(θ)f(θ)f(θX=x)dθ=Rlnf(θ)g(θ)f(θX=x)dθ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \forall \mathbf {x}: \ln \frac{P(\mathbf {X}_f=\mathbf {x})}{P(\mathbf {X}_g=\mathbf {x})}&= -\ln \frac{P(\mathbf {X}_g=\mathbf {x})}{P(\mathbf {X}_f=\mathbf {x})}\nonumber \\&= -\ln \frac{\int _\mathbb {R} P(\mathbf {X}=\mathbf {x} \mid \theta )g(\theta )\text {d}\theta }{\int _\mathbb {R} P(\mathbf {X}=\mathbf {x}\mid \theta )f(\theta )\text {d}\theta }\nonumber \\&= -\ln \int _\mathbb {R}\frac{P(\mathbf {X}=\mathbf {x}\mid \theta )g(\theta ) }{P(\mathbf {X}=\mathbf {x}\mid \theta )f(\theta )} \frac{P(\mathbf {X}=\mathbf {x}\mid \theta )f(\theta )}{\int _\mathbb {R} P(\mathbf {X}=\mathbf {x}\mid \theta )f(\theta )\text {d}\theta } \text {d}\theta \nonumber \\&= -\ln \int _\mathbb {R}\frac{g(\theta ) }{f(\theta )} f(\theta \mid \mathbf {X}=\mathbf {x})\text {d}\theta \nonumber \\&\le -\int _\mathbb {R}\ln \frac{g(\theta ) }{f(\theta )} f(\theta \mid \mathbf {X}=\mathbf {x})\text {d}\theta = \int _\mathbb {R}\ln \frac{f(\theta ) }{g(\theta )} f(\theta \mid \mathbf {X}=\mathbf {x})\text {d}\theta \end{aligned}$$\end{document}

using Jensen’s inequality in the last line. Taking expectations w.r.t. Pf(X)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P_f(\mathbf {X})$$\end{document} gives the inequality in (2). Similarly, we obtain

x:lnP(Xf=x)P(Xg=x)=lnRP(X=xθ)f(θ)dθRP(X=xθ)g(θ)dθ=lnRf(θ)g(θ)g(θX=x)dθ,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \forall \mathbf {x}: \ln \frac{P(\mathbf {X}_f=\mathbf {x})}{P(\mathbf {X}_g=\mathbf {x})} = \ln \frac{\int _\mathbb {R} P(\mathbf {X}=\mathbf {x} \mid \theta )f(\theta )\text {d}\theta }{\int _\mathbb {R} P(\mathbf {X}=\mathbf {x}\mid \theta )g(\theta )\text {d}\theta } = \ln \int _\mathbb {R}\frac{f(\theta ) }{g(\theta )} g(\theta \mid \mathbf {X}=\mathbf {x})\text {d}\theta , \end{aligned}$$\end{document}

such that

(4)-lnRg(θ)f(θ)f(θX=x)dθ=lnRf(θ)g(θ)g(θX=x)dθRlnf(θ)g(θ)f(θX=x)dθ.\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} -\ln \int _\mathbb {R}\frac{g(\theta ) }{f(\theta )} f(\theta \mid \mathbf {X}=\mathbf {x})\text {d}\theta&= \ln \int _\mathbb {R}\frac{f(\theta ) }{g(\theta )} g(\theta \mid \mathbf {X}=\mathbf {x})\text {d}\theta \nonumber \\&\le \int _\mathbb {R}\ln \frac{f(\theta ) }{g(\theta )} f(\theta \mid \mathbf {X}=\mathbf {x})\text {d}\theta . \end{aligned}$$\end{document}

Since f is absolutely continuous w.r.t. g, we obtain that both f(θ)g(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{f(\theta ) }{g(\theta )}$$\end{document} and lnf(θ)g(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ln \frac{f(\theta ) }{g(\theta )}$$\end{document} are uniformly integrable. Convergence in probability of both posteriors (w.r.t. f and g as prior) is then sufficient to guarantee the equality in (3) (e.g., Venkatesh, Reference Venkatesh2013, pp. 480–481), since under these conditions we may change the order of limits and integration.\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

The Convergence Theorem relies on posterior consistency. The regularity conditions that imply posterior consistency can be found in many places. For unidimensional monotone IRT models, the regularity conditions for strong consistency (i.e., almost sure convergence) can be found in Chang and Stout (Reference Chang and Stout1993, pp. 42–43). As a courtesy to the reader, we list their conditions in Appendix 1. Chang and Stout (Reference Chang and Stout1993, pp. 43–45) argued that in practice these conditions are “very general and appropriate hypotheses” (p. 51). Similar conditions can be found in Chang (Reference Chang1996) for polytomous IRT models.

Combining Theorem 1, the Monotonicity Theorem, and the Convergence Theorem, we arrive at our final result.

Theorem 4

(Monotone Convergence Theorem) Given an IRT model P(Xθ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X} \mid \theta )$$\end{document} and assuming that the support of g contains the support of f and the sequence of posteriors converges to a degenerate distribution, then Δ(Θf;Θg~)0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta (\Theta _f{\text { ; }}\Theta _{\tilde{g}}) \rightarrow 0$$\end{document}, monotonically, and furthermore, Θg~LΘf\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta _{\tilde{g}} \overset{\mathcal {L}}{\longrightarrow }\Theta _f$$\end{document}.

Proof

Under the stated assumptions, the Convergence Theorem implies that the EKL divergence converges to zero as n tends to infinity. Convergence is monotone by Theorem 2. From Theorem 1, we consequently obtain

Δ(Θf;Θg~)0.\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \Delta (\Theta _f\text { ; }\Theta _{\tilde{g}}) \rightarrow 0 . \end{aligned}$$\end{document}

Since convergence in KL divergence implies convergence in law (DasGupta, Reference DasGupta2008, p. 21), we have

Θg~LΘf.\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \Theta _{\tilde{g}} \overset{\mathcal {L}}{\longrightarrow }\Theta _f. \end{aligned}$$\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

In summary, the Monotone Convergence Theorem states that (under mild regularity conditions) the marginal distribution of PVs g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} is a consistent estimator of the true ability distribution f.

5. Implications

In plain words, the Monotone Convergence Theorem implies that we can use PVs to learn about the true distribution of ability. In this section, we discuss some of the practical implications of this result using PISA data for illustration. We remind the reader that g is a prior distribution, f the true distribution, and g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} the marginal distribution of the PVs.

5.1. What can we learn from Plausible Values?

What can we learn about the “correct” population model f(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\theta )$$\end{document} when we are using PVs from the “wrong” posterior g(θX=x)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\theta \mid \mathbf {X}=\mathbf {x})$$\end{document}? A common misconception is that the marginal distribution of PVs equals the population model (i.e., g~=g\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g} = g$$\end{document}) and nothing can be learned from PVs over that which is already known from the population model (prior distribution) (e.g., Kreiner & Christensen, Reference Kreiner and Christensen2014). This is true, if and only if, the population model is the true ability distribution (i.e., g=f\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g = f$$\end{document}). This is not likely and in practice we expect to see that g~g\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g} \ne g$$\end{document}.

Example 2

(PISA) To illustrate that the PV distribution may diverge from the prior in applications, we analyze data from the 2006 PISA cycle. More specifically, we used the n=26\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n = 26$$\end{document} items intended to assess reading ability in booklet 6 made by N=\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N =$$\end{document} 1738 Canadian students (see Appendix 2 for details of this analysis). A single PV was generated for each student using the One Parameter Logistic Model (OPLM; Verhelst & Glas, Reference Verhelst, Glas, Fischer and Molenaar1995) as IRT model, and a standard normal distribution as prior. The ecdf of N draws from the prior distribution g (solid gray line) and the ecdf of the generated PVs using n=26\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=26$$\end{document} items are shown in Fig. 2 (solid black line). The marginal distribution of the PVs is clearly different from the specified prior distribution.

Figure 2. Ecdf of PVs (g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document}) and N draws from a standard normal prior distribution (i.e., g(θ)=ϕ(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\theta )=\phi (\theta )$$\end{document}) in the PISA example.

If the population model is misspecified (i.e., gf\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g \ne f$$\end{document}), we can still learn a lot from looking at the PV distribution. The PV distribution provides a consistent estimate of the true ability distribution, which is at least as plausible as the population model which figures as a prior. Specifically, it follows from the Monotonicity Theorem that, if gf\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g \ne f$$\end{document}, and hence g~g\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g} \ne g$$\end{document}, the marginal distribution of PVs g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} is closer to f than g is; as we saw in Example 1. Moreover, we can use PVs to evaluate the fit of the population model by testing the hypothesis H0:g~=g\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_0 : \tilde{g} = g$$\end{document} against H1:g~g\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_1 : \tilde{g} \ne g$$\end{document}. If H0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_0$$\end{document} is rejected, there is no reason to be interested in g: g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} is our best guess of what the true distribution of ability would look like.

Example 3

(PISA continued) We use the PISA example to illustrate that we can test the hypothesis H0:g~=g\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_0: \tilde{g} = g$$\end{document} against H1:g~g\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_1: \tilde{g} \ne g$$\end{document} using real data with a relatively small number of observations, and that the power of this test is increasing with n. To this aim, we randomly assigned each student two items out of the 26 items that were available. Figure 2 shows the ecdf of the PVs using n=2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n = 2$$\end{document} items (dashed line). It is clear that even with two items, the marginal distribution of PVs differs from the specified prior distribution and H0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_0$$\end{document} does not hold (this test is performed in the next example, see Table 1). Figure 2 also shows that the PV distributions diverge from the prior distribution as n increases, thereby increasing the probability to reject H0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_0$$\end{document} if it is wrong.

Table 1. Average values of KS test statistic using PISA data to compare g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} with the prior distributions used to generate g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document}.

Values over 0.046 are significant at an α\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\alpha }$$\end{document} level of 0.05.

5.2. Choose a flexible population model

The population model is formally a prior and, under the conditions of the Monotone Convergence Theorem, becomes irrelevant as the number of items becomes large. Essentially, this is an instance of the common finding that the data overrule the prior when the number of observations increases. In practice, however, there is a natural limit to the number of items that can be administered which raises the question how we can favor convergence without increasing the number of items.

The answer comes from Lemma 1 which suggests that convergence of the PV distribution to the true distribution of ability is faster if prior divergence is reduced. Thus, for a given n, we would like prior divergence to be as small as possible (i.e., we would like g to resemble f). When little or nothing is known about f, we may achieve this using a flexible prior; that is, one that easily adapts to different shapes. Otherwise, we may look at the PV distribution found in previous editions of the study to improve the prior. Convergence is also improved if we adopt an empirical Bayesian approach and estimate the parameters of the prior so that it adapts itself to the data as much as possible (see, for instance; White, Reference White1982). Using, for instance, a normal prior in Example 1 would help to discover the bimodality of the true ability distribution with less items.

Example 4

(PISA Continued) We use the previously established OPLM model with three prior distributions ordered in terms of flexibility:

  1. 1. A standard normal distribution N(0,1)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {N}(0\text {, }1)$$\end{document}.

  2. 2. A normal distribution N(μ,σ2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {N}(\mu \text {, }\sigma ^2)$$\end{document} with mean μ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu $$\end{document} and variance σ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^2$$\end{document}.

  3. 3. A PCA regression prior N(Λ^β,σ2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {N}(\widehat{\varvec{\Lambda }}\varvec{\beta }\text {, }\sigma ^2)$$\end{document}, where Λ^\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\varvec{\Lambda }}$$\end{document} constitutes the principal component scores estimated on student covariates assessed in the PISA student questionnaire. We use the first 50 principal components explaining roughly 60%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$60~\%$$\end{document} of the variance in the student questionnaire.

The parameters of the prior distribution are estimated using the Gibbs sampler (Geman & Geman, Reference Geman and Geman1984) with non-informative hyper-priors (Gelman, Carlin, Stern, & Rubin, Reference Gelman, Carlin, Stern and Rubin2004).

For each prior distribution, we test the hypothesis H0:g~=g\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_0 : \tilde{g} = g$$\end{document} against H1:g~g\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_1 : \tilde{g} \ne g$$\end{document} using the two-sample Kolmogorov-Smirnov (KS) test. For the second and third prior, we ran an additional 1000 iterations of the Gibbs sampler. In each iteration, we generated one PV for each person, generate a sample of size N from the prior, and compute the KS test statistic. Thus, we obtained 1000 replications for the test statistic, which were then averaged. The results are shown in Table 1 and confirm that prior divergence decreases as more flexible prior distributions are used.

Our main concern is whether or not the PV distribution converges to the true ability distribution. Since we do not know the true ability distribution, we compare our results with the best guess that we have, i.e., the distribution of PVs obtained by using n=26\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n = 26$$\end{document} items and the PCA regression prior. We repeated the procedure to obtain Table 1, but instead of comparing the generated PVs with draws from the prior, we compared the generated PVs with the PVs generated using n=26\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n = 26$$\end{document} items and the PCA regression prior. The results in Table 2 show that the PV distributions converge to a single (true) distribution as n increases and/or the prior becomes more flexible.

Table 2. Average values of KS test statistic using PISA data to compare g~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{g}$$\end{document} using different prior distributions with the best guess.

It is important to note that there is a limit to the amount of parameters that we can estimate, and thus the amount of flexibility that we can achieve in practice. This can be seen in Example 4. For n=2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=2$$\end{document}, Table 2 seems to suggest that the normal prior works better than the more flexible PCA regression prior. This counter intuitive result only holds for n=2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=2$$\end{document} and is due to the poor estimation of hyper-parameters that results when both N and n are small. The normal prior has just two parameters, μ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu $$\end{document} and σ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^2$$\end{document}, whereas the PCA regression prior has 52 parameters, β={β0,β1,...,β50}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\beta } = \{\beta _0\text {, }\beta _1\text {, }...\text {, }\beta _{50}\}$$\end{document} and σ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^2$$\end{document}. Since the standard errors accumulate for the generated PV distributions, we expect to observe larger variations in the generated PV distributions using the PCA regression prior. These larger variations are reflected in the value of the KS test statistic.

5.3. What if we miss a covariate?

A remarkable feature of Example 1 is that the PV distribution reveals the difference between boys and girls even though sex was not included as a covariate in the population model. This is consistent with our results. Given the conditions of the Monotone Convergence Theorem, the distribution of plausible values

g~(θz1,z2)=xg(θx,z2)Pf(xz1,z2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \tilde{g}(\theta \mid z_1\text {, } z_2)=\sum _{\mathbf {x}} g(\theta \mid \mathbf {x}\text {, }z_2)P_f(\mathbf {x}\mid z_1\text {, }z_2) \end{aligned}$$\end{document}

is a consistent estimator of the population distribution f(θz1,z2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\theta \mid z_1\text {, }z_2)$$\end{document}, for sets of covariates z1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z_1$$\end{document} and z2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z_2$$\end{document}; even when z2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z_2$$\end{document} is the empty set (i.e., if we miss all covariates). It also means that a secondary analyst who happens to observe the student’s sex will, when n is sufficiently large, recover the difference between boy and girls even if the PVs have been generated with a population model that contains no covariates at all.

Figure 3. Plausible value distributions of boys and girls with and without gender as a covariate in the PISA example.

Example 5

(PISA Continued) We look at the distribution of boys and girls in Canada who took booklet 6 using PISA’s final student weights. To generate the PVs we consider two prior distributions; the flexible N(μ,σ2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {N}(\mu \text {, }\sigma ^2)$$\end{document} prior distribution without covariates, and the N(Λ^β,σ2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {N}(\widehat{\varvec{\Lambda }}\varvec{\beta }\text {, }\sigma ^2)$$\end{document} prior distribution which included gender as a predictor (i.e., it was a covariate in the PCA).

Figure 3 shows the PV distributions of boys and girls weighted by the PISA student weights. It is clear that the weighted distributions of PVs under the two prior distributions are indistinguishable, apart from sampling error. We also see that the girls perform better than the boys. The weighted average ability for the boys was estimated at 0.180 and that of girls at 0.242. The weighted standard deviation of ability for the boys was estimated at 0.304 and that of the girls at 0.282. Note that the differences in variances between boys and girls would not have been found in a latent regression model unless it had been explicitly modeled.

What it means for n to be “sufficiently large” depends on the effect of the covariate on the distribution of Θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta $$\end{document}; that is, for large effects relatively many items are needed, and for small effects relatively few items are needed. It also depends on the population model. Institutions that release PVs typically include a large set of covariates in the population model on the argument that any covariate that a secondary analyst might be interested in must be included, directly or by proxy, to avoid bias in secondary analysis of the PVs. Schofield, Junker, Taylor, and Black (Reference Schofield, Junker, Taylor and Black2015) make this claim precise and, in accordance with our results, argue that bias should vanish when n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \rightarrow \infty $$\end{document}. We agree to the current practice to include as many covariates as possible because it reduces prior divergence but note that a flexible prior with or without covariates can be used to the same effect. A simple extension of Example 1 would illustrate, for instance, that, if a binary predictor is excluded from the population model, the correct coefficient will be recovered even for small n when the prior distribution is a mixture of two normal distributions.

If the population model is a regression model in which a covariate is missing, this may not only lead to bias in the PV distributions but may also lead to bias in parameter estimates for effects that are part of the modelFootnote 3, or one might not observe that the missing covariate makes the unknown f skewed. This means that we run the risk of performing an incorrect inference about the unknown f if we look at the population model. It follows from our results that the marginal distribution of the PVs will always be a better estimate of f than the population model is in this situation, even if we do not recover the correct regression coefficient of the missing covariate.

6. Discussion

In this paper, we have proved that, under mild regularity conditions, the empirical distribution of the PVs is a consistent estimator of the distribution of ability in the population, and that convergence is monotone in an embedding in which the number of items tends to infinity. In plain words, this implies that we can use PVs to learn about the true distribution of ability in the population. We have used this result to clear up some of the misconceptions about PVs, and also to show how they can be used in the analyses of educational surveys. Thus far, PVs have been used in educational surveys mostly to simplify secondary analyses. Our result suggests that the distribution of PVs could play the leading role, using the population model merely as a vehicle to produce PVs.

The population model is properly seen as a prior and the consistency of the PV distribution as an estimator of the true distribution is essentially the common result that the data overrule the prior when the number of observations increases. We have demonstrated that convergence of the PV distribution to the true distribution of ability can be improved if we estimate the parameters λ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\lambda }$$\end{document} of the prior distribution, but it does not imply that it makes sense to interpret the estimates λ^\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\varvec{\lambda }}$$\end{document} when the prior distribution is misspecified. Technically, as the number of persons in the sample, N, tends to infinity, λ^\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\varvec{\lambda }}$$\end{document} are the parameter values that minimize prior divergence under the prior w.r.t. the true ability distribution (White, Reference White1982). However, when the prior distribution is misspecified and prior divergence is not zero, the result of White (Reference White1982) does not tell us how wrong our conclusions are when inference is based on λ^\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\varvec{\lambda }}$$\end{document}.

In closing, we mention a limitation of our results. Our results imply that if the sequence of posteriors converges to a degenerate distribution as n tends to infinity, then the marginal distribution of PVs converges to the unknown f. For models where the “if" part is resolved, our results (i.e., Theorems 3, 4) apply.

Appendix 1: The regularity conditions in Chang and Stout (1993)

In order to prove their Theorem 2, Chang and Stout (Reference Chang and Stout1993) require five regularity conditions. Before we give these conditions, we need to fix some notation. Let Xi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_i$$\end{document} denote the response of a person to an item i, where Xi=1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_i = 1$$\end{document} denotes a correct response and Xi=0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_i=0$$\end{document} an incorrect response, where

(5)Xi=1with probabilityPi(θ)=P(Xi=1θ),0with probability1-Pi(θ)=P(Xi=0θ),\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} X_i = {\left\{ \begin{array}{ll} 1 &{} \text { with probability } P_i(\theta ) = P(X_i = 1 \mid \theta ),\\ 0 &{} \text { with probability } 1-P_i(\theta ) = P(X_i = 0\mid \theta ), \end{array}\right. } \end{aligned}$$\end{document}

where Pi(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P_i(\theta )$$\end{document} denotes the probability of a correct response for a person with ability θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document}, and θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} is unknown and has the domain (-,)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(-\infty \text {, }\infty )$$\end{document} or some subinterval thereof. Chang and Stout (Reference Chang and Stout1993, p. 38) made two assumptions about the unidimensional IRT model:

  1. 1. Local independence:

    P(X1=x1,,Xn=xnθ)=i=1nPi(θ)xi(1-Pi(θ))1-xi.\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} P(X_1 = x_1\text {, }\ldots \text {, }X_n = x_n\mid \theta ) = \prod _{i=1}^n P_i(\theta )^{x_i}(1-P_i(\theta ))^{1-x_i}. \end{aligned}$$\end{document}
  2. 2. Monotonicity: each Pi(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P_i(\theta )$$\end{document} is strictly increasing in θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document}.

Note that these conditions are standard assumptions in parametric unidimensional IRT models, and are satisfied for the commonly used One-, Two- and Three-parameter logistic and normal ogive models.

Fix any θ0Θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _0 \in \Theta $$\end{document} (the latent space), then the five regularity conditions are as follows (Chang & Stout, Reference Chang and Stout1993, pp. 42–43):

  1. (A1) Let θΘ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta \in \Theta $$\end{document}, where Θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta $$\end{document} is (-,)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(-\infty \text {, }\infty )$$\end{document} or a bounded or unbounded interval of (-,)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(-\infty \text {, }\infty )$$\end{document}. Let the prior density f(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\theta )$$\end{document} be continuous and positive at θ0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _0$$\end{document}, where θ0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _0$$\end{document} is assumed to be the true value of θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document}.

  2. (A2) Pi(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P_i(\theta )$$\end{document} is twice continously differentiable and Pi(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P_i^{\prime }(\theta )$$\end{document} and Pi(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P_i^{\prime \prime }(\theta )$$\end{document} are bounded in absolute value uniformly with respect to both θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document} and i in some closed interval N0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_0$$\end{document} of θ0Θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _0 \in \Theta $$\end{document}.

  3. (A3) For every fixed θθ0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta \ne \theta _0$$\end{document}, assume for some given c(θ)>0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c(\theta ) > 0$$\end{document},

    limn¯1ni=1nEθ0Pi(θ)Xi(1-Pi(θ)1-XiPi(θ0)Xi(1-Pi(θ0)1-Xi-c(θ),\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \overline{\lim _{n\rightarrow \infty }} \text { }\frac{1}{n} \sum _{i=1}^n \mathbb {E}_{\theta _0}\left( \frac{P_i(\theta )^{X_i}(1-P_i(\theta )^{1-X_i}}{P_i(\theta _0)^{X_i}(1-P_i(\theta _0)^{1-X_i}}\right) \le -c(\theta ), \end{aligned}$$\end{document}
    and
    supiλi(θ)=supilogPi(θ)1-Pi(θ)<.\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \sup _i\left| \lambda _i(\theta )\right| = \sup _i\left| \log \left( \frac{P_i(\theta )}{1-P_i(\theta )}\right) \right| < \infty . \end{aligned}$$\end{document}
    (For a sequence of real numbers {an}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{a_n\}$$\end{document}, if limnan\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lim _{n\rightarrow \infty } a_n$$\end{document} does not exist, then {an}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{a_n\}$$\end{document} must have more than one limit point. In this case, limn¯an\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\overline{\lim _{n\rightarrow \infty }} \text { }a_n$$\end{document} denotes the largest such or upper limit point. Also, Eθ0(W)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}_{\theta _0}(W)$$\end{document} denotes the expectation of W with θ=θ0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta = \theta _0$$\end{document} assumed.)
  4. (A4) {Ii(θ)}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{I^{\prime }_i(\theta )\}$$\end{document} and {λi(θ)}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\lambda _i^{\prime }(\theta )\}$$\end{document} and {λi(θ)}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\lambda _i^{\prime \prime }(\theta )\}$$\end{document} are bounded in absolute value uniformly in i and θN0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta \in N_0$$\end{document}, where N0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_0$$\end{document} is specified in (A2) above.

  5. (A5)

    lim infn1nI(n)(θ0)>c(θ0)>0.\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \liminf _{n \rightarrow \infty } \frac{1}{n} I^{(n)}(\theta _0) > c(\theta _0) > 0. \end{aligned}$$\end{document}
    That is, asymptotically, the average information at θ0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _0$$\end{document} is bounded away from 0.

Note that we have used f(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\theta )$$\end{document} to denote the prior density and i to index the items, whereas Chang and Stout (Reference Chang and Stout1993) used Π(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Pi (\theta )$$\end{document} and j, respectively, in their manuscript.

Appendix 2: Details about the PISA analyses

We used item response data from Booklet 6 in the 2006 PISA cycle. Specifically, we used the responses from N=1768\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = 1768$$\end{document} Canadian students to n=28\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n = 28$$\end{document} items intended to assess reading ability. The data of 30 students were omitted due to missing responses, and we fitted a One Parameter Logistic Model (OPLM; Verhelst & Glas, Reference Verhelst, Glas, Fischer and Molenaar1995) on data from the remaining N=1738\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = 1738$$\end{document} students.

The item difficulties were estimated using conditional maximum likelihood and the item discriminations were estimated using marginal maximum likelihood using the OPLM package (Verhelst, Glas, & Verstralen, Reference Verhelst, Glas and Verstralen1995). We used cross-validation for estimation of the (discrete) item discriminations; First, the discriminations were estimated based on data from a random selection of 1200 students. At this stage, we deleted two items that did not fit the scale (items 6 and 8). The remaining n=26\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n = 26$$\end{document} items scaled reasonably well in this sample, R1C=133.067\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_{1C}= 133.067$$\end{document}, df=90\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$df =90$$\end{document}, p=0.0022\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p = 0.0022$$\end{document} (for a description of the R1C\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_{1C}$$\end{document} statistic see Verhelst et al., Reference Verhelst, Glas and Verstralen1995). Second, the parameters were validated on data from the remaining 538 students, and scaled well, R1C=118.686,df=90,p=0.0231\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_{1C}= 118.686, df =90, p=0.0231$$\end{document}.

The estimated item parameters are shown in Table 3, where a indicates item discrimination and b item difficulty (category thresholds for polytomous items). For polytomous items, score categories are indicated within parentheses after the item number.

Table 3. Parameters of the estimated IRT model for the PISA example.

Footnotes

1 The prior g usually conditions on a large set of covariates Z\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {Z}$$\end{document} and in the parlance of educational surveys is known as the population or conditioning model for the survey. To avoid excessive notation, we will present the main results without explicitly mentioning the conditioning on covariates.

2 We assume in this paper that we obtain a simple random sample from P(Xf)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {X}_f)$$\end{document} (i.e., f(θ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\theta )$$\end{document}). In educational surveys, one typically obtains non-simple random samples. We note that our results generalize to the latter situation.

3 The simplest example would be a prior where the mean is assumed to be equal to zero and one estimates the variance. If the true mean is not equal to zero, the variance estimate will be biased.

References

Chang, H. (1996). The asymptotic posterior normality of the latent trait for polytomous IRT models. Psychometrika, 61(3), 445463CrossRefGoogle Scholar
Chang, H., Stout, W. (1993). The asymptotic posterior normality of the latent trait in an IRT model. Psychometrika, 58, 3752CrossRefGoogle Scholar
Cover, T., Thomas, J. (1991). Elements of information theory, New York: Wiley-InterscienceGoogle Scholar
DasGupta, A. (2008). Asymptotic theory of statistics and probability, New York: SpringerGoogle Scholar
Gelman, A., Carlin, B., Stern, H., Rubin, D. (2004). Bayesian data analysis, Boca Raton: Chapman & Hall/CRCGoogle Scholar
Geman, S., Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721741CrossRefGoogle ScholarPubMed
Kreiner, S., Christensen, K. (2014). Analyses of model fit and robustness. A new look at the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika, 79(2), 210231CrossRefGoogle Scholar
Kullback, S., Leibler, R. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22, 7986CrossRefGoogle Scholar
Mislevy, R. (1991). Randomization-based inference about latent variables from complex samples. Psychometrika, 56, 177196CrossRefGoogle Scholar
Mislevy, R., Beaton, A., Kaplan, B., Sheehan, K. (1993). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29, 133161CrossRefGoogle Scholar
Schofield, L., Junker, B., Taylor, L., Black, D. (2015). Predictive inference using latent variables with covariates. Psychometrika, 80(3), 727747CrossRefGoogle ScholarPubMed
Venkatesh, S. (2013). The theory of probability: Explorations and applications, Cambridge: Cambridge University PressGoogle Scholar
Verhelst, N., Glas, C. (1995). The one parameter logistic model: OPLM. In Fischer, G.H., Molenaar, I.W. (Eds.), Rasch models: Foundations, recent developments and applications (pp. 215238), New York: SpringerCrossRefGoogle Scholar
Verhelst, N., Glas, C., Verstralen, H. (1995). OPLM: Computer program and manual (Computer software manual), Arnhem: CitoGoogle Scholar
Warm, T. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427450CrossRefGoogle Scholar
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 125CrossRefGoogle Scholar
Figure 0

Figure 1. Ecdfs of N=\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N =$$\end{document} 10,000 draws from f(θ)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$f(\theta )$$\end{document} and N=\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N =$$\end{document} 10,000 draws from the standard normal prior distribution g(θ)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$g(\theta )$$\end{document} are shown in both panels (in gray in the right panel). Ecdfs of the marginal distributions of PVs are shown in the right panel.

Figure 1

Figure 2. Ecdf of PVs (g~\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\tilde{g}$$\end{document}) and N draws from a standard normal prior distribution (i.e., g(θ)=ϕ(θ)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$g(\theta )=\phi (\theta )$$\end{document}) in the PISA example.

Figure 2

Table 1. Average values of KS test statistic using PISA data to compare g~\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\tilde{g}$$\end{document} with the prior distributions used to generate g~\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\tilde{g}$$\end{document}.

Figure 3

Table 2. Average values of KS test statistic using PISA data to compare g~\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\tilde{g}$$\end{document} using different prior distributions with the best guess.

Figure 4

Figure 3. Plausible value distributions of boys and girls with and without gender as a covariate in the PISA example.

Figure 5

Table 3. Parameters of the estimated IRT model for the PISA example.