A Simple Method for Comparing Complex Models: Bayesian Model Comparison for Hierarchical Multinomial Processing Tree Models Using Warp-III Bridge Sampling

Quentin F. Gronau; Eric-Jan Wagenmakers; Daniel W. Heck; Dora Matzke

doi:10.1007/s11336-018-9648-3

A Simple Method for Comparing Complex Models: Bayesian Model Comparison for Hierarchical Multinomial Processing Tree Models Using Warp-III Bridge Sampling

Published online by Cambridge University Press: 01 January 2025

Quentin F. Gronau

Eric-Jan Wagenmakers ,

Daniel W. Heck and

Dora Matzke

Show author details

Quentin F. Gronau*: Affiliation:
University of Amsterdam
Eric-Jan Wagenmakers: Affiliation:
University of Amsterdam
Daniel W. Heck: Affiliation:
University of Mannheim
Dora Matzke: Affiliation:
University of Amsterdam
*: Correspondence should be made to Quentin F. Gronau, University of Amsterdam, Nieuwe Achtergracht 129 B, 1018 WT Amsterdam, The Netherlands. Email: Quentin.F.Gronau@gmail.com

Article contents

Abstract
Multinomial Processing Trees
Warp-III Bridge Sampling for MPTs
Empirical Examples
Discussion
Data Availability Statement
Footnotes
References

Rights & Permissions

Abstract

Multinomial processing trees (MPTs) are a popular class of cognitive models for categorical data. Typically, researchers compare several MPTs, each equipped with many parameters, especially when the models are implemented in a hierarchical framework. A Bayesian solution is to compute posterior model probabilities and Bayes factors. Both quantities, however, rely on the marginal likelihood, a high-dimensional integral that cannot be evaluated analytically. In this case study, we show how Warp-III bridge sampling can be used to compute the marginal likelihood for hierarchical MPTs. We illustrate the procedure with two published data sets and demonstrate how Warp-III facilitates Bayesian model averaging.

Keywords

multinomial processing tree Bayesian model comparison Bayes factor bridge sampling Warp-III posterior model probability Bayesian model averaging

Type: Original Paper
Information: Psychometrika , Volume 84 , Issue 1 , 15 March 2019 , pp. 261 - 284

DOI: https://doi.org/10.1007/s11336-018-9648-3 [Opens in a new window]
Creative Commons: This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Copyright: Copyright © 2018 The Psychometric Society

Multinomial processing trees (MPTs; e.g., Riefer & Batchelder, Reference Riefer and Batchelder1988) are substantively motivated stochastic models for the analysis of categorical data. MPTs allow researchers to test theories about cognitive architecture by formalizing qualitatively different cognitive processes that underlie performance in an experimental paradigm. MPTs are popular in various areas of psychology and have been applied, for instance, in research on memory, perception, logical reasoning, and attitudes (for reviews, see Batchelder & Riefer, Reference Batchelder and Riefer1999; Erdfelder et al., Reference Erdfelder, Auer, Hilbig, Aßfalg, Moshagen and Nadarevic2009; Hütter & Klauer, Reference Hütter and Klauer2016). MPTs are related to tree-based item response theory models as presented, for instance, in Böckenholt (Reference Böckenholt2012a, Reference Böckenholtb); Culpepper (Reference Culpepper2014), and De Boeck and Partchev (Reference De Boeck and Partchev2012).Footnote 1

Traditionally, parameter estimation in MPTs has relied on maximum likelihood methods for aggregated data (Hu & Batchelder, Reference Hu and Batchelder1994; Singmann & Kellen, Reference Singmann and Kellen2013). Recently, however, MPT modelers have become increasingly interested in using Bayesian hierarchical methods to examine individual differences in model parameters (Klauer, Reference Klauer2010; Matzke, Dolan, Batchelder, & Wagenmakers, Reference Matzke, Dolan, Batchelder and Wagenmakers2015; Smith & Batchelder, Reference Smith and Batchelder2010). Bayesian hierarchical modeling allows researchers to simultaneously account for the differences and similarities between participants and typically provides more accurate statistical inference than the analysis of aggregated data, especially in situations with moderate between-subject variability and scarce participant-level data (e.g., Gelman & Hill, Reference Gelman and Hill2007).

In typical applications, MPT modelers are interested in comparing a limited set of models. The models can be nested, which is the case when testing parameter constraints (e.g., Batchelder & Riefer, Reference Batchelder and Riefer1990; Singmann, Kellen, & Klauer, Reference Singmann, Kellen, Klauer, Knauff, Pauen, Sebanz and Wachsmuth2013), or non-nested, which is the case when comparing structurally different models (e.g., Fazio, Brashier, Payne, & Marsh, Reference Fazio, Brashier, Payne and Marsh2015; Kellen, Singmann, & Klauer, Reference Kellen, Singmann and Klauer2014). A wide range of model comparison and assessment methods exist both in the frequentist and Bayesian framework, each with its own goals and operating characteristics, such as Pearson’s $χ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi ^2$$\end{document} test, the likelihood ratio test, information criteria such as AIC (Akaike, Reference Akaike, Petrov and Csaki1973), BIC (Schwarz, Reference Schwarz1978), DIC (Spiegelhalter, Best, Carlin, & van der Linde, Reference Spiegelhalter, Best, Carlin and van der Linde2002), and WAIC (Watanabe, Reference Watanabe2010), leave-one-out cross-validation (Vehtari, Gelman, & Gabry, Reference Vehtari, Gelman and Gabry2017), and posterior predictive checks (Gelman, Reference Gelman2013; Meng, Reference Meng1994; Robins, van der Vaart, & Ventura, Reference Robins, van der Vaart and Ventura2000). Furthermore, a range of powerful methods exist for analyzing multinomial data in particular (e.g., Bishop, Fienberg, & Holland, Reference Bishop, Fienberg and Holland1975; Maydeu-Olivares & Joe, Reference Maydeu-Olivares and Joe2005). The goal of this case study is to enrich the model comparison toolkit of MPT modelers by illustrating—with examples from the literature—a computationally feasible approach to model comparison in hierarchical MPTs based on Bayes factors and posterior model probabilities.Footnote 2 Furthermore, the proposed approach also enables Bayesian model averaging which we advocate as a principled way of testing parameter constraints while fully taking into account model uncertainty.

Suppose one is interested in comparing a discrete set of M models denoted as $M_{1}, M_{2}, \dots, M_{M}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_1, \mathcal {M}_2, \ldots , \mathcal {M}_M$$\end{document} with corresponding prior model probabilities $p (M_{1}), p (M_{2}), \dots, p (M_{M})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\mathcal {M}_1), p(\mathcal {M}_2), \ldots , p(\mathcal {M}_M)$$\end{document} , which satisfy the constraints $p (M_{i}) \geq 0 \forall i \in {1, 2, \dots, M}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\mathcal {M}_i) \ge 0 \; \; \forall i \in \{1, 2, \ldots , M\}$$\end{document} and $\sum_{i = 1}^{M} p (M_{i}) = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sum _{i=1}^{M} p(\mathcal {M}_i) = 1$$\end{document} . The posterior model probability of $M_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_i$$\end{document} is then obtained using Bayes’ rule:

(1)

\begin{matrix} \underset{posterior model probability}{\underset{⏟}{p (M_{i} ∣ data)}} = \underset{updating factor}{\underset{⏟}{\frac{p (data ∣ M_{i})}{\sum_{j = 1}^{M} p (data ∣ M_{j}) p (M_{j})}}} \times \underset{prior model probability}{\underset{⏟}{p (M_{i})}}, \end{matrix}

where $p (data ∣ M_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\text {data} \mid \mathcal {M}_i)$$\end{document} is the marginal likelihood of model $M_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_i$$\end{document} .

If model comparison involves assessing the tenability of parameter constraints in a set of nested models, posterior model probabilities can be used to quantify the model-averaged evidence that a parameter is free to vary or should be constrained across different groups or experimental conditions (e.g., Hoeting, Madigan, Raftery, & Volinsky, Reference Hoeting, Madigan, Raftery and Volinsky1999; Rouder, Morey, Verhagen, Swagman, & Wagenmakers, Reference Rouder, Morey, Verhagen, Swagman and Wagenmakers2017). If the model comparison involves only two models, $M_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_1$$\end{document} and $M_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_2$$\end{document} , it is convenient to consider the odds of one model over the other one. Bayes’ rule yields:

(2)

\begin{matrix} \underset{posterior odds}{\underset{⏟}{\frac{p (M_{1} ∣ data)}{p (M_{2} ∣ data)}}} = \underset{Bayes factor BF_{12}}{\underset{⏟}{\frac{p (data ∣ M_{1})}{p (data ∣ M_{2})}}} \times \underset{prior odds}{\underset{⏟}{\frac{p (M_{1})}{p (M_{2})}}} . \end{matrix}

Equation (2) shows that the change in odds brought about by the data is given by the ratio of the marginal likelihoods of the models, a quantity known as the Bayes factor (Etz & Wagenmakers, Reference Etz and Wagenmakers2017; Jeffreys, Reference Jeffreys1961; Kass & Raftery, Reference Kass and Raftery1995; Ly, Verhagen, & Wagenmakers, Reference Ly, Verhagen and Wagenmakers2016).

Equations (1) and (2) illustrate that the computation of posterior model probabilities and Bayes factors requires the computation of the marginal likelihood of the models. The marginal likelihood is obtained by integrating out the model parameters with respect to the parameters’ prior distribution:

(3)

\begin{matrix} p (data ∣ M_{i}) = \int_{Θ} p (data ∣ θ, M_{i}) p (θ ∣ M_{i}) d θ . \end{matrix}

The marginal likelihood includes a natural penalty for overdue model complexity and implements a form of the principle of parsimony also known as Occam’s razor (e.g., Jefferys & Berger, Reference Jefferys and Berger1992; Myung & Pitt, Reference Myung and Pitt1997; Vandekerckhove, Matzke, & Wagenmakers, Reference Vandekerckhove, Matzke, Wagenmakers, Busemeyer, Townsend, Wang and Eidels2015).Footnote 3 Although conceptually straightforward, in practice it is challenging to compute Bayes factors and posterior model probabilities for hierarchical MPTs because the marginal likelihood features a high-dimensional integral that cannot be solved analytically.

In this case study, we show how Warp-III bridge sampling (Meng & Schilling, Reference Meng and Schilling2002; Meng & Wong, Reference Meng and Wong1996, henceforth referred to as Warp-III ) can be used to estimate the marginal likelihood for hierarchical MPTs. Warp-III may be used for nested and, crucially, also non-nested model comparisons, for which simpler methods, such as the Savage–Dickey density ratio (Dickey & Lientz, Reference Dickey and Lientz1970), cannot be applied. Importantly, Warp-III is not specific to hierarchical MPTs; it may be used to compute the marginal likelihood for a wide range of complex cognitive models. In fact, Warp-III improves upon simpler bridge sampling techniques (e.g., DiCiccio, Kass, Raftery, & Wasserman, Reference DiCiccio, Kass, Raftery and Wasserman1997, Gronau et al., Reference Gronau, Sarafoglou, Matzke, Ly, Boehm and Marsman2017) by respecting potential skewness in the posterior distribution—a typical consequence of estimating parameters of cognitive models from scarce data (e.g., Ly et al., Reference Ly, Boehm, Heathcote, Turner, Forstmann, Marsman and Matzke2018; Matzke et al., Reference Matzke, Dolan, Batchelder and Wagenmakers2015). Due to its accuracy and relatively straightforward implementation, we believe that Warp-III is a promising and timely addition to the Bayesian toolkit of cognitive modelers in general and MPT modelers in particular.

The article is organized as follows. We first introduce the latent-trait approach to hierarchical MPTs. We then demonstrate how Warp-III can be used to estimate the marginal likelihood for latent-trait MPTs. Lastly, we apply the method to two model comparison problems from published studies. The first example focuses on Bayesian model averaging for nested models; the second example focuses on the computation of the Bayes factor for non-nested models.

1. Multinomial Processing Trees

Data for MPTs consist of categorical responsesFootnote 4 from several participants to a set of items. MPTs are based on the assumption that these responses follow a multinomial distribution. MPTs reparametrize the category probabilities of the multinomial distribution in terms of the model parameters that represent the probabilities of latent cognitive processes (Riefer & Batchelder, Reference Riefer and Batchelder1988).

Consider the pair-clustering MPT depicted in Fig. 1. The model was developed for the measurement of the storage and retrieval processes that determine the recall of semantically related word pairs (Batchelder & Riefer, Reference Batchelder and Riefer1980). A typical pair-clustering study involves a free recall memory experiment, where participants are presented with a list of study words in a word-by-word fashion. The study list consists of two types of items: semantically related word pairs such as knife–fork and words without a category partner (i.e., singletons), such as dog. After the study phase, participants are required to recall as many of the study words as they can. Typically, semantically related word pairs are recalled consecutively as a “pair-cluster.”

The model represents the interplay between the hypothesized latent cognitive processes in a rooted tree structure. The pair-clustering MPT features $K = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K = 2$$\end{document} independent category systems. Each category system corresponds to a separate multinomial distribution: one for word pairs ( $k = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = 1$$\end{document} ) and one for singletons ( $k = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = 2$$\end{document} ). The category probabilities in each system are modeled using a separate subtree with a finite number of branches.

Figure 1. Pair-clustering MPT. Available at https://tinyurl.com/yb7bma4e under CC license https://creativecommons.org/licenses/by/2.0/.

Each branch of a subtree corresponds to a specific sequence of processing stages and terminates in one of $L_{k}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_k$$\end{document} possible response categories denoted as $C_{kl}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{kl}$$\end{document} , where $l = 1, \dots, L_{k}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l = 1, \ldots , L_k$$\end{document} indexes the lth of $L_{k}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_k$$\end{document} possible responses in subtree k. In the pair-clustering MPT, the recall of word pairs is scored into $L_{1} = 4$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1 = 4$$\end{document} categories: (1) Both words of the pair are recalled consecutively ( $C_{11}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{11}$$\end{document} ); (2) both words are recalled but not consecutively ( $C_{12}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{12}$$\end{document} ); (3) only one word is recalled ( $C_{13}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{13}$$\end{document} ); (4) no word is recalled ( $C_{14}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{14}$$\end{document} ). The recall of singletons is scored into $L_{2} = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_2 = 2$$\end{document} response categories: (1) The word is recalled ( $C_{21}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{21}$$\end{document} ); (2) the word is not recalled ( $C_{22}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{22}$$\end{document} ).

The response category probabilities are expressed as a function of the MPT parameters, $θ_{p} \in (0, 1) \forall p \in {1, 2, \dots, P}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _p \in (0, 1) \; \; \forall p \in \{1, 2, \ldots , P\}$$\end{document} , which can be collected in a vector $θ = (θ_{1}, θ_{2}, \dots, θ_{P})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta } = (\theta _1, \theta _2, \ldots , \theta _P)$$\end{document} . The pair-clustering MPT features four parameters: $θ = (c, r, u, a)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta } = (c, r, u, a)$$\end{document} . The cluster-storage parameter c corresponds to the probability that a word pair is stored as a cluster in memory. The cluster-retrieval parameter r corresponds to the conditional probability that a clustered word pair is retrieved from memory during the test phase. The model assumes that stored and retrieved word clusters are always recalled consecutively. The storage-retrieval parameter u corresponds to the conditional probability that a member of a word pair is stored and retrieved, given that the word pair was not clustered. The model makes the simplifying assumption that words from unclustered pairs are never recalled consecutively. The singleton storage-retrieval parameter a corresponds to the probability that a singleton is stored and retrieved. In many applications, researchers impose the constraint that $a = u$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a = u$$\end{document} .

The response category probabilities are obtained as follows. First, we obtain the probability of each branch that terminates in a given response category. Let $B_{klm}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$B_{klm}$$\end{document} denote the mth of $M_{kl}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_{kl}$$\end{document} branches that terminate in response category $C_{kl}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{kl}$$\end{document} . The probability of branch $B_{klm}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$B_{klm}$$\end{document} is obtained by traversing the tree from root to leaf and multiplying the encountered parameters:

(4)

\begin{matrix} Pr (B_{klm} ∣ θ) = \prod_{p = 1}^{P} θ_{p}^{v_{klmp}} {(1 - θ_{p})}^{w_{klmp}}, \end{matrix}

where $v_{klmp} \geq 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${v_{klmp}} \ge 0$$\end{document} and $w_{klmp} \geq 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${w_{klmp}}\ge 0$$\end{document} are the number of nodes on branch $B_{klm}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$B_{klm}$$\end{document} that are related to parameter $θ_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _p$$\end{document} , $p = 1, \dots, P$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=1, \ldots , P$$\end{document} , and $1 - θ_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1-\theta _p$$\end{document} , respectively. Second, we sum the probabilities of the $M_{kl}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_{kl}$$\end{document} branches that terminate in $C_{kl}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{kl}$$\end{document} :

(5)

\begin{matrix} Pr (C_{kl} ∣ θ) = \sum_{m = 1}^{M_{kl}} Pr (B_{klm} ∣ θ) . \end{matrix}

For instance, the probability of response category $C_{14}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{14}$$\end{document} is given by $Pr (C_{14} ∣ θ) = c (1 - r) + (1 - c) {(1 - u)}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {Pr}(C_{14} \mid \varvec{\theta }) = c \; (1 - r) + (1 -c) \; (1 - u)^2$$\end{document} .

The probability of the observed response frequencies across category systems denoted by $n = (n_{11}, \dots, n_{1 L_{1}}, \dots, n_{K 1}, \dots, n_{K L_{K}})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{n} = (n_{11},\ldots , n_{1L_1}, \ldots , n_{K1}, \ldots , n_{KL_K})$$\end{document} , where $n_{kl}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_{kl}$$\end{document} is the observed response frequency for category $l = 1, \dots, L_{k}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l = 1,\ldots , L_k$$\end{document} in category system (subtree) $k = 1, \dots, K$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = 1,\ldots , K$$\end{document} , is given by a product-multinomial distribution:

(6)

\begin{matrix} Pr (N = n ∣ θ) = \prod_{k = 1}^{K} (\frac{J_{k}!}{n_{k 1}! \times n_{k 2}! \times . . . \times n_{k L_{k}}!}, \prod_{l = 1}^{L_{k}}, {(Pr (C_{kl} ∣ θ))}^{n_{kl}}), \end{matrix}

where $J_{k}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J_k$$\end{document} denotes the number of items in category system k (see also Klauer, Reference Klauer2010; Matzke et al., Reference Matzke, Dolan, Batchelder and Wagenmakers2015).

1.1. Bayesian Hierarchical MPTs: The Latent-Trait Approach

Bayesian hierarchical approaches explicitly model heterogeneity in participants by introducing a group-level distribution from which the participant-level parameters are drawn (e.g., Gelman & Hill, Reference Gelman and Hill2007; Gill, Reference Gill2002; Lee, Reference Lee2011; Lee & Wagenmakers, Reference Lee and Wagenmakers2013; Rouder & Lu, Reference Rouder and Lu2005).Footnote 5 Here we focus on Klauer’s (Reference Klauer2010) latent-trait approach that relies on a multivariate normal group-level distribution to describe the between-subject variability and the correlations between the participant-level parameters.

To model participant heterogeneity, observed responses are aggregated over items, but not over participants, resulting in a vector of category frequencies for each participant i: $n_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{n}_i$$\end{document} , $i = 1, 2, \dots, I$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i = 1, 2, \ldots , I$$\end{document} , where I is the total number of participants. Each participant obtains a participant-specific parameter vector $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_i$$\end{document} of length P.

The latent-trait approach assumes that the probit-transformed participant-level parameter vectors $θ_{i}^{^{'}} = Φ^{- 1} (θ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_{i}^{'} = \Phi ^{-1}(\varvec{\theta }_{i})$$\end{document} follow a P-dimensional multivariate normal distribution with mean vector $μ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} and covariance matrix $Σ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} : $θ_{i}^{^{'}} \sim N_{P} (μ, Σ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \varvec{\theta }_{i}^{'} \sim \mathcal {N}_P(\varvec{\mu }, \varvec{\Sigma })$$\end{document} . The probit transformation $Φ^{- 1} (θ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Phi ^{-1}(\varvec{\theta }_i)$$\end{document} is defined component-wise, where $Φ^{- 1} (\cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Phi ^{-1}(\cdot )$$\end{document} corresponds to the inverse of the cumulative distribution function of the normal distribution. Priors are assigned to $μ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} and $Σ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} . We follow earlier implementations of the latent-trait approach and assign independent standard normal distributions to the P components of $μ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} (Heck, Arnold, & Arnold, Reference Heck, Arnold and Arnold2018a; Matzke et al., Reference Matzke, Dolan, Batchelder and Wagenmakers2015). This choice corresponds to uniform priors on the probability scale for the grand means. For the covariance matrix $Σ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} , a convenient prior choice would be an inverse Wishart prior with degrees of freedom $ν = P + 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu = P + 1$$\end{document} and identity scale matrix. This setting leads to uniform priors on the correlation parameters; however, this choice is constraining on the standard deviation parameters. Although changing the degrees of freedom $ν$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu $$\end{document} affords more flexibility for modeling the standard deviations, it comes at the cost of constraining the prior on the correlation parameters (Gelman & Hill, Reference Gelman and Hill2007).

This dilemma can be circumvented by using a scaled inverse Wishart prior as introduced by Gelman and Hill (Reference Gelman and Hill2007) and proposed in the context of hierarchical MPT modeling by Klauer (Reference Klauer2010). Compared to a regular inverse Wishart prior, the scaled version has the advantage that it allows one to model the standard deviations more flexibly while retaining the desirable uniform prior on the correlation parameters. The scaled inverse Wishart prior is based on the following decomposition of the covariance matrix $Σ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} :

(7)

\begin{matrix} Σ = Diag (ξ) Q Diag (ξ), \end{matrix}

where $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} is a vector of P scaling parameters and $Q$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{Q}$$\end{document} corresponds to the $P \times P$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P \times P$$\end{document} unscaled covariance matrix. The scaled inverse Wishart prior is obtained by placing a regular inverse Wishart prior on the unscaled covariance matrix $Q$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{Q}$$\end{document} and a suitable prior on the vector of scaling parameters $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} .

We follow Klauer (Reference Klauer2010) and assign $Q$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{Q}$$\end{document} an inverse Wishart prior with degrees of freedom $ν = P + 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu = P + 1$$\end{document} and scale matrix $I_{P}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{I}_P$$\end{document} (i.e., $P \times P$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P \times P$$\end{document} identity matrix). For the P components of $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} , we follow Heck et al. (Reference Heck, Arnold and Arnold2018a) and use independent uniform priors that range from zero to ten. These choices correspond to relatively diffuse priors for the standard deviations of the random effects on the probit scale and uniform priors for the correlations between the random effects.

Note that these prior distributions have been proposed in a context of parameter estimation, where the exact choice of the prior is irrelevant as long as sufficiently informative data are available. In contrast, in the context of model comparison, the priors have an important and lasting effect: As shown in Eq. (3), the marginal likelihood is obtained by taking a weighted average of the probability of the data across all possible parameter settings where the weights correspond to the parameters’ prior density. We argue that the standard normal and uniform priors for the grand means and the correlations, respectively, provide a reasonable default setting also from the perspective of model comparison. The choice of the prior for $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} is less straightforward. We report the results corresponding to the default setting of the recently developed MPT software package TreeBUGS (Heck et al., Reference Heck, Arnold and Arnold2018a), but we probed the robustness of our conclusions with a sensitivity analysis using $ξ_{p} \sim Uniform (0, ξ_{max}) \forall p \in {1, 2, \dots, P}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _p \sim \text {Uniform}(0, \xi _{\text {max}}) \, \forall p \in \{1, 2, \ldots , P\}$$\end{document} , with $ξ_{max} = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{\text {max}} = 2$$\end{document} instead of $ξ_{max} = 10$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{\text {max}} = 10$$\end{document} , a prior that was chosen based on the implied group-level distributions on the probability scale. As the conclusions were unaffected by the choice of the upper bound, the results of the sensitivity analysis are mentioned only briefly and are presented in more detail in Supplemental Materials available at https://osf.io/rycg6/.

Under these prior settings, the probit-transformed participant-level MPT parameter vectors can be written as:

(8)

\begin{matrix} θ_{i}^{^{'}} = μ + ξ ⊙ ω_{i}, \end{matrix}

where $ω_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\omega }_i$$\end{document} is the P-dimensional vector with the unscaled random effects for participant i and $⊙$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\odot $$\end{document} denotes the Hadamard product (i.e., entry-wise multiplication, e.g., Liu & Trenkler, Reference Liu and Trenkler2008). The unscaled random effects are drawn from a P-dimensional zero-centered multivariate normal distribution with covariance matrix $Q$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{Q}$$\end{document} : $ω_{i} \sim N_{P} (0, Q)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\omega }_i \sim \mathcal {N}_P(\varvec{0},\varvec{Q})$$\end{document} .

Note that the model is overparameterized: $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} and $Q$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{Q}$$\end{document} cannot be interpreted separately. Similarly, the unscaled random effects $ω_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\omega }_i$$\end{document} cannot be interpreted on their own but need to be combined with the scaling parameter vector $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} to form the random effects of interest. The scaling parameters $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} , the unscaled covariance matrix $Q$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{Q}$$\end{document} , and the unscaled random effects $ω_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\omega }_i$$\end{document} are not of interest in themselves and are simply an artifact of using a flexible scaled inverse Wishart prior on $Σ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} : The parameters of interest are $θ_{i}^{^{'}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_{i}^{'}$$\end{document} , $μ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} , and $Σ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} . Therefore, the scaled inverse Wishart prior can be regarded as a form of parameter expansion (e.g., Gelman & Hill, Reference Gelman and Hill2007) which has been reported to speed up convergence when fitting the model using Markov chain Monte Carlo sampling (MCMC; e.g., Gamerman & Lopes, Reference Gamerman and Lopes2006).

The reader is referred to Klauer (Reference Klauer2010) and Matzke et al. (Reference Matzke, Dolan, Batchelder and Wagenmakers2015) for a more detailed description of the latent-trait approach. Parameter estimation may proceed using MCMC sampling implemented in standard Bayesian statistical software such as JAGS (Plummer, Reference Plummer, Hornik, Leisch and Zeileis2003) or Stan (Stan Development Team, 2016).

1.2. Computing the Marginal Likelihood

The marginal likelihood for latent-trait MPTs is given by:Footnote 6

(9)

\begin{matrix} Pr (N = n) = & \int . . . \int \prod_{i = 1}^{I} (\overset{individual-level}{\overset{⏞}{Pr (N_{i} = n_{i} ∣ μ, ξ, ω_{i})}}, \overset{group-level}{\overset{⏞}{p (ω_{i} ∣ Q)}}) \overset{priors}{\overset{⏞}{p (Q) p (μ) p (ξ)}} d Q d μ d ξ d ω_{1} . . . d ω_{I} \\ = & \int . . . \int \prod_{i = 1}^{I} [\underset{Pr (N_{i} = n_{i} ∣ μ, ξ, ω_{i})}{\underset{⏟}{\prod_{k = 1}^{K} {\frac{J_{k}!}{n_{i k 1}! \times n_{i k 2}! \times . . . \times n_{i k L_{k}}!} \prod_{l = 1}^{L_{k}} {(Pr (C_{kl} ∣ μ, ξ, ω_{i}))}^{n_{ikl}}}}} \\ \times \underset{p (ω_{i} ∣ Q)}{\underset{⏟}{{(2, π)}^{- \frac{P}{2}} {(Q)}^{- \frac{1}{2}} exp {- \frac{1}{2} ω_{i}^{⊤} Q^{- 1} ω_{i}}}}] \\ \times \underset{p (Q)}{\underset{⏟}{\frac{1}{2^{\frac{ν P}{2}} Γ_{P} (\frac{ν}{2})} {(Q)}^{- \frac{ν + P + 1}{2}} exp (- \frac{1}{2} tr (Q^{- 1}))}} \\ \times \underset{p (μ)}{\underset{⏟}{{(2, π)}^{- \frac{P}{2}} exp {- \frac{1}{2} μ^{⊤} μ}}} \underset{p (ξ)}{\underset{⏟}{{(ξ_{max})}^{- P}}} d Q d μ d ξ d ω_{1} . . . d ω_{I}, \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \text {Pr}(\varvec{N} = \varvec{n}) =&\int ... \int \prod _{i = 1}^{I} \left[ \overbrace{\text {Pr}(\varvec{N}_i = \varvec{n}_i \mid \varvec{\mu }, \varvec{\xi }, \varvec{\omega }_i)}^{\text {individual-level}} \overbrace{p(\varvec{\omega }_i \mid \varvec{Q})}^{\text {group-level}} \right] \overbrace{p(\varvec{Q})p(\varvec{\mu }) p(\varvec{\xi })}^{\text {priors}} \text {d}\varvec{Q} \text {d}\varvec{\mu } \text {d}\varvec{\xi } \text {d}\varvec{\omega }_1 ... \text {d}\varvec{\omega }_I \nonumber \\ =&{\int } ... {\int } \prod _{i = 1}^{I} \Bigg [\underbrace{\prod ^K_{k=1} \Bigg \{{\frac{J_k!}{n_{ik1}! \times n_{ik2}! \times ... \times n_{ikL_k}!}} \prod ^{L_k}_{l=1}\left[ \text {Pr}(C_{kl} \mid \varvec{\mu }, \varvec{\xi }, \varvec{\omega }_i)\right] ^{n_{ikl}}\Bigg \}}_{\text {Pr}(\varvec{N}_i = \varvec{n}_i \mid \varvec{\mu }, \varvec{\xi }, \varvec{\omega }_i)} \nonumber \\&\quad \times \underbrace{\left( 2\pi \right) ^{-\frac{P}{2}} \left| \varvec{Q}\right| ^{-\frac{1}{2}} \exp \bigg \{-\frac{1}{2} \varvec{\omega }_i^\top \varvec{Q}^{-1}\varvec{\omega }_i\bigg \}}_{p(\varvec{\omega }_i \mid \varvec{Q})}\Bigg ] \nonumber \\&\quad \times \underbrace{\frac{1}{2^{\frac{\nu P}{2}}\Gamma _P(\frac{\nu }{2})} \left| \varvec{Q}\right| ^{-\frac{\nu +P+1}{2}}\exp \left\{ -\frac{1}{2}{\text {tr}}\left( \varvec{Q}^{-1}\right) \right\} }_{p(\varvec{Q})} \nonumber \\&\quad \times \underbrace{\left( 2\pi \right) ^{-\frac{P}{2}} \exp \bigg \{-\frac{1}{2} \varvec{\mu }^\top \varvec{\mu }\bigg \}}_{p(\varvec{\mu })} \underbrace{\left( \xi _{\text {max}}\right) ^{-P}}_{p(\varvec{\xi })} \text {d}\varvec{Q} \text {d}\varvec{\mu } \text {d}\varvec{\xi } \text {d}\varvec{\omega }_1 ... \text {d}\varvec{\omega }_I, \end{aligned}$$\end{document}

where $Γ_{P} (a) = π^{P (P - 1) / 4} \prod_{j = 1}^{P} Γ (a + \frac{1 - j}{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma _{P}(a)=\pi ^{{P(P-1)/4}}\prod _{{j=1}}^{P}\Gamma \left( a+\frac{1-j}{2}\right) $$\end{document} and $Γ (z) = \int_{0}^{\infty} x^{z - 1} e^{- x} d x$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma (z)=\int _{0}^{\infty }x^{z-1}e^{-x}\,\text {d}x$$\end{document} are the multivariate and regular gamma function, respectively. In this parametrization, we do not need to explicitly integrate out the participant-level parameter vectors $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_i$$\end{document} since they are functions of $μ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} , $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} , and $ω_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\omega }_i$$\end{document} (see Eq. (8)).

We exploit the fact that the covariance matrix $Q$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{Q}$$\end{document} in Eq. (9) can be integrated out in closed form (see also, Overstall & Forster, Reference Overstall and Forster2010); a detailed derivation is provided in Supplemental Materials. The marginal likelihood is then given by:

(10)

\begin{matrix} Pr (N = n) = & \int . . . \int \prod_{i = 1}^{I} (\prod_{k = 1}^{K}, {, \frac{J_{k}!}{n_{i k 1}! \times n_{i k 2}! \times . . . \times n_{i k L_{k}}!}, \prod_{l = 1}^{L_{k}}, {(Pr (C_{kl} ∣ μ, ξ, ω_{i}))}^{n_{ikl}}, }) \\ \times \frac{Γ_{P} (\frac{ν + I}{2})}{Γ_{P} (\frac{ν}{2})} \frac{π^{- \frac{IP}{2}}}{{(Ω^{⊤} Ω + I_{P})}^{\frac{ν + I}{2}}} \times {(2, π)}^{- \frac{P}{2}} exp {- \frac{1}{2} μ^{⊤} μ} \\ \times {(ξ_{max})}^{- P} d μ d ξ d ω_{1} . . . d ω_{I}, \end{matrix}

where $Ω$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Omega }$$\end{document} is an $I \times P$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I \times P$$\end{document} matrix of the P-dimensional random effects vectors $ω_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\omega }_i$$\end{document} of the I participants. Even after integrating out $Q$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{Q}$$\end{document} , the expression for the marginal likelihood is still a high-dimensional integral (i.e., $P (I + 2)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(I + 2)$$\end{document} dimensions); the challenge is to find a method which yields accurate estimates of this integral.

2. Warp-III Bridge Sampling for MPTs

We propose to use Warp-III bridge sampling (Meng & Schilling, Reference Meng and Schilling2002; Meng & Wong, Reference Meng and Wong1996; Overstall, Reference Overstall2010), an advanced version of bridge sampling, to evaluate the high-dimensional integral in Eq. (10). Bridge sampling is a general method for estimating normalizing constantsFootnote 7, a problem that is not only encountered in Bayesian inference, but also in likelihood-based approaches (Gelman & Meng, Reference Gelman and Meng1998). We first outline the basic principles of bridge sampling and then present the details of the advanced Warp-III method. The reader is referred to the recent tutorial by Gronau et al. (Reference Gronau, Sarafoglou, Matzke, Ly, Boehm and Marsman2017) for a detailed explanation of the general bridge sampling approach.

Let $ζ = (μ, ξ, ω_{1}, \dots, ω_{I})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\zeta } = (\varvec{\mu }, \varvec{\xi }, \varvec{\omega }_1, \ldots , \varvec{\omega }_I)$$\end{document} be the vector of quantities that must be integrated out to obtain the marginal likelihood, so that

(11)

\begin{matrix} \begin{matrix} Pr (N = n) = & \int Pr (N = n ∣ ζ) p (ζ) d ζ . \end{matrix} \end{matrix}

General bridge sampling is based on the following identity:

(12)

\begin{matrix} 1 = \frac{\int \overset{bridge function}{\overset{⏞}{h (ζ)}} p (ζ ∣ N = n) \overset{proposal distribution}{\overset{⏞}{g (ζ)}} d ζ}{\int h (ζ) \underset{posterior distribution}{\underset{⏟}{p (ζ ∣ N = n)}} g (ζ) d ζ}, \end{matrix}

where $p (ζ ∣ N = n)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\varvec{\zeta } \mid \varvec{N} = \varvec{n})$$\end{document} is the posterior distribution of $ζ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\zeta }$$\end{document} , $g (ζ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\varvec{\zeta })$$\end{document} is the probability density function of a proposal distribution, and $h (ζ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h(\varvec{\zeta })$$\end{document} is a function such that $0 < (\int h (ζ) p (ζ ∣ N = n) g (ζ) d ζ) < \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0< \left| \int h(\varvec{\zeta }) \, p(\varvec{\zeta } \mid \varvec{N} = \varvec{n}) \, g(\varvec{\zeta }) \text {d}\varvec{\zeta }\right| < \infty $$\end{document} . It follows from Eq. (12) that

(13)

\begin{matrix} Pr (N = n) = \frac{\int h (ζ) Pr (N = n ∣ ζ) p (ζ) g (ζ) d ζ}{\int h (ζ) g (ζ) p (ζ ∣ N = n) d ζ} = \frac{E_{g (ζ)} (h (ζ) Pr (N = n ∣ ζ) p (ζ))}{E_{p (ζ ∣ N = n)} (h (ζ) g (ζ))} . \end{matrix}

The bridge sampling estimate of the marginal likelihood is then obtained by sampling from $g (ζ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\varvec{\zeta })$$\end{document} and $p (ζ ∣ N = n)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\varvec{\zeta } \mid \varvec{N} = \varvec{n})$$\end{document} and then using Monte Carlo approximations to estimate the expected values.

The optimal choice of $h (ζ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h(\varvec{\zeta })$$\end{document} , one that minimizes the relative mean-squared error of the estimator, is given by:

(14)

\begin{matrix} h_{o} (ζ) \propto {(s_{1} Pr (N = n ∣ ζ) p (ζ) + s_{2} Pr (N = n) g (ζ))}^{- 1}, \end{matrix}

where $s_{i} = \frac{D_{i}}{D_{1} + D_{2}}, i \in {1, 2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_i = \frac{D_i}{D_1 + D_2}, \; i \in \{1,2\}$$\end{document} , where $D_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_1$$\end{document} and $D_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_2$$\end{document} denote the number of draws from $p (ζ ∣ N = n)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\varvec{\zeta } \mid \varvec{N} = \varvec{n})$$\end{document} and $g (ζ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\varvec{\zeta })$$\end{document} , respectively, used to approximate the expected values (Meng & Wong, Reference Meng and Wong1996). We set $D_{1} = D_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_1 = D_2$$\end{document} . Note that $h_{o}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_o$$\end{document} is only optimal if the draws from the posterior distribution are independent which is not the case with MCMC procedures. To account for this fact, we replace $D_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_1$$\end{document} in defining the weights $s_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_1$$\end{document} and $s_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_2$$\end{document} by the effective sample size obtained using the codaR package (Plummer, Best, Cowles, & Vines, Reference Plummer, Best, Cowles and Vines2006).Footnote 8 As $h_{o} (ζ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_o(\varvec{\zeta })$$\end{document} depends on $Pr (N = n)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {Pr}(\varvec{N} = \varvec{n})$$\end{document} , the very quantity we want to estimate, we follow Meng and Wong (Reference Meng and Wong1996) and use an iterative scheme to update an initial guess of the marginal likelihood until convergence:Footnote 9

(15)

\begin{matrix} \hat{Pr} {(N = n)}^{(t + 1)} = \frac{\frac{1}{D_{2}} \sum_{r = 1}^{D_{2}} \frac{l_{2, r}}{s_{1} l_{2, r} + s_{2} \hat{Pr} {(N = n)}^{(t)}}}{\frac{1}{D_{1}} \sum_{j = 1}^{D_{1}} \frac{1}{s_{1} l_{1, j} + s_{2} \hat{Pr} {(N = n)}^{(t)}}}, \end{matrix}

where $l_{1, j} = \frac{Pr (N = n ∣ {ζ^{*}}_{j}) p ({ζ^{*}}_{j})}{g ({ζ^{*}}_{j})}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_{1,j} = \frac{\text {Pr}(\varvec{N} = \varvec{n} \mid \varvec{\zeta ^*}_j) \, p(\varvec{\zeta ^*}_j)}{g(\varvec{\zeta ^*}_j)}$$\end{document} , $l_{2, r} = \frac{Pr (N = n ∣ {\tilde{ζ}}_{r}) p ({\tilde{ζ}}_{r})}{g ({\tilde{ζ}}_{r})}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_{2,r} = \frac{\text {Pr}(\varvec{N} = \varvec{n} \mid \varvec{\tilde{\zeta }}_r) \, p(\varvec{\tilde{\zeta }}_r)}{g(\varvec{\tilde{\zeta }}_r)}$$\end{document} , ${{ζ^{*}}_{1}, \dots, {ζ^{*}}_{D_{1}}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\varvec{\zeta ^*}_1, \ldots , \varvec{\zeta ^*}_{D_1}\}$$\end{document} are $D_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_1$$\end{document} draws from $p (ζ ∣ N = n)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\varvec{\zeta } \mid \varvec{N} = \varvec{n})$$\end{document} , and ${{\tilde{ζ}}_{1}, \dots, {\tilde{ζ}}_{D_{2}}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\varvec{\tilde{\zeta }}_1, \ldots , \varvec{\tilde{\zeta }}_{D_2}\}$$\end{document} are $D_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_2$$\end{document} draws from $g (ζ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\varvec{\zeta })$$\end{document} .

A remaining question is how to choose $g (ζ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\varvec{\zeta })$$\end{document} . The precision of the bridge sampling estimator is governed by the number of samples from $g (ζ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\varvec{\zeta })$$\end{document} and the overlap between $g (ζ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\varvec{\zeta })$$\end{document} and $p (ζ ∣ N = n)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\varvec{\zeta } \mid \varvec{N} = \varvec{n})$$\end{document} (Meng & Wong, Reference Meng and Wong1996). Therefore, $g (ζ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\varvec{\zeta })$$\end{document} should closely resemble the posterior distribution. For instance, we may choose a multivariate normal distribution for g with mean vector and covariance matrix that match the corresponding quantities of the posterior samples. Although the multivariate normal approach works well in many applications (e.g., Gronau et al., Reference Gronau, Sarafoglou, Matzke, Ly, Boehm and Marsman2017; Overstall & Forster, Reference Overstall and Forster2010), it can be inefficient when the posterior distribution is skewed.

Warp-III improves upon the multivariate normal bridge sampling approach by matching, not only the first two, but also the third moment (i.e., skewness) of g and the posterior distribution. Consequently, in case there is no skewness, Warp-III results in estimates with the same precision as the ones from the simpler multivariate normal approach. However, crucially, in the presence of skewness, Warp-III is able to match g and the posterior distribution more closely which results in a higher precision of the marginal likelihood estimates compared to the simpler approach. How much of an improvement Warp-III is over the simpler multivariate normal approach may depend on the particular example at hand.

Figure 2. Matching the proposal and posterior distribution with warping. Histograms show the posterior distribution; density lines show the standard normal proposal distribution. Available at https://tinyurl.com/y7owvsz3 under CC license https://creativecommons.org/licenses/by/2.0/.

In Warp-III, g is fixed to a multivariate standard normal distribution. The posterior distribution is then manipulated—“warped”—so that its mean vector, covariance matrix, and skew match g. Crucially, the warped posterior distribution retains the normalizing constant of the posterior distribution. Figure 2 illustrates the rationale of the Warp-III transformation for the univariate case. The histogram in the upper-left panel shows hypothetical “unbounded” posterior samples that can range across the entire real line; the solid line shows the standard normal proposal distribution g. The overlap between the two distributions is clearly suboptimal. Bridge sampling applied to these two distributions can be thought of as “Warp-0” because the posterior distribution is not modified. The upper-right panel illustrates “Warp-I”: Subtracting the mean of the posterior samples from all posterior samples matches the first moment of the distributions. The lower-right panel illustrates “Warp-II”: Dividing the zero-centered posterior samples by their standard deviation matches the first two moments of the distributions. This approach is practically equivalent to the multivariate normal bridge sampling approach described above. Lastly, the lower-left panel illustrates Warp-III: Randomly assigning a minus sign to the standardized posterior samples matches also the third moment of the distributions.

Warp-III assumes that all components of the parameter vector can range across the entire real line. In the context of latent-trait MPTs, this assumption is not fulfilled since $ξ_{p} \in (0, ξ_{max}) \forall p \in {1, \dots, P}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _p \in (0, \xi _{\text {max}}) \; \forall p \in \{1, \ldots , P\}$$\end{document} . We therefore transform $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} so that $ξ_{trans} = Φ^{- 1} (\frac{ξ}{ξ_{max}})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }_{\text {trans}} = \Phi ^{-1}\left( \frac{\varvec{\xi }}{\xi _{\text {max}}}\right) $$\end{document} with Jacobian ${(ξ_{max})}^{P} N_{P} (ξ_{trans}; 0, I_{P})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( \xi _{\text {max}}\right) ^{P} \mathcal {N}_P(\varvec{\xi }_{\text {trans}}; \varvec{0}, \varvec{I}_P)$$\end{document} , where $N_{P} (x; y, Z)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {N}_P(\varvec{x}; \varvec{y}, \varvec{Z})$$\end{document} denotes the probability density function of a P-dimensional normal distribution with mean vector $y$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{y}$$\end{document} and covariance matrix $Z$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{Z}$$\end{document} which is evaluated for the vector $x$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}$$\end{document} .Footnote 10 Let $ψ = (μ, ξ_{trans}, ω_{1}, \dots, ω_{I})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\psi } = (\varvec{\mu }, \varvec{\xi }_{\text {trans}} , \varvec{\omega }_1, \ldots , \varvec{\omega }_I)$$\end{document} denote the resulting parameter vector where all components are on the real line.

Warp-III is then based on applying the following stochastic transformation to $ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\psi }$$\end{document} :

(16)

\begin{matrix} η = \underset{symmetry}{\underset{⏟}{b}} \times \underset{covariance I}{\underset{⏟}{R^{- 1}}} \times \underset{mean 0}{\underset{⏟}{(ψ - v)}}, \end{matrix}

where $b \sim Bernoulli (0.5)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b \sim \text {Bernoulli}(0.5)$$\end{document} on ${- 1, 1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{-1, 1\}$$\end{document} and $v$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{v}$$\end{document} corresponds to the expected value of $ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\psi }$$\end{document} (i.e., the mean vector). The matrix $R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{R}$$\end{document} is obtained via the Cholesky decomposition of the covariance matrix of $ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\psi }$$\end{document} , denoted as $S$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{S}$$\end{document} , thus, $S = R R^{⊤}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{S} = \varvec{R} \varvec{R}^\top $$\end{document} . In practice, $v$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{v}$$\end{document} and $S$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{S}$$\end{document} are unknown and must be approximated using the posterior samples. Note that Eq. (16) simply generalizes the intuition illustrated in Fig. 2 for the univariate case to the general case with multiple parameters.

Due to the Bernoulli random variable b, the warped posterior density has the form of a mixture density (see also Overstall, Reference Overstall2010, p. 70):

(17)

\begin{matrix} \begin{matrix} p_{η} (η ∣ N = n) & = \frac{(R)}{2} (\frac{{\tilde{p}}_{ψ} (v - R η ∣ N = n)}{Pr (N = n)} + \frac{{\tilde{p}}_{ψ} (v + R η ∣ N = n)}{Pr (N = n)}) \\ = \frac{{\tilde{p}}_{η} (η ∣ N = n)}{Pr (N = n)}, \end{matrix} \end{matrix}

where ${\tilde{p}}_{η} (η ∣ N = n) = \frac{(R)}{2} ({\tilde{p}}_{ψ} (v - R η ∣ N = n) + {\tilde{p}}_{ψ} (v + R η ∣ N = n))$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{p}_{\varvec{\eta }}(\varvec{\eta } \mid \varvec{N} = \varvec{n}) = \frac{\left| \varvec{R}\right| }{2} \left[ \tilde{p}_{\varvec{\psi }}(\varvec{v} - \varvec{R}\varvec{\eta } \mid \varvec{N} = \varvec{n}) + \tilde{p}_{\varvec{\psi }}(\varvec{v} + \varvec{R}\varvec{\eta } \mid \varvec{N} = \varvec{n})\right] $$\end{document} denotes the un-normalized warped posterior distribution and ${\tilde{p}}_{ψ} (\cdot ∣ N = n)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{p}_{\varvec{\psi }}(\cdot \mid \varvec{N} = \varvec{n})$$\end{document} denotes the un-normalized posterior distribution that has been transformed to the real line (but not warped). This proves that the warped posterior distribution retains the normalizing constant of the original posterior distribution.

The Warp-III estimator of the marginal likelihood is then derived by using the warped posterior distribution $p_{η} (η ∣ N = n)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{\varvec{\eta }}(\varvec{\eta } \mid \varvec{N} = \varvec{n})$$\end{document} instead of $p (ζ ∣ N = n)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\varvec{\zeta } \mid \varvec{N} = \varvec{n})$$\end{document} in Eq. (12). Equation (13) shows that this results in a ratio of two expected values, where the numerator is an expected value with respect to the multivariate standard normal proposal distribution $g (η)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\varvec{\eta })$$\end{document} and the denominator is an expected value with respect to the warped posterior distribution $p_{η} (η ∣ N = n)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{\varvec{\eta }}(\varvec{\eta } \mid \varvec{N} = \varvec{n})$$\end{document} . Hence, we could obtain an estimate of the marginal likelihood by first warping the posterior samples using Eq. (16), then sampling from the proposal distribution, and applying the iterative updating scheme in Eq. (15).

However, in line with the literature (e.g., Sinharay & Stern, Reference Sinharay and Stern2005), we rewrite the expected value in the denominator of Eq. (13) in terms of the unbounded posterior samples that are transformed to the real line but are not warped; a derivation is provided in Supplemental Materials. The estimate of the marginal likelihood is then obtained by applying the iterative scheme in Eq. (15) using:

(18)

\begin{matrix} l_{1, j} = \frac{\frac{(R)}{2} ({\tilde{p}}_{ψ} (2 v - {ψ^{*}}_{j} ∣ N = n) + {\tilde{p}}_{ψ} ({ψ^{*}}_{j} ∣ N = n))}{g (R^{- 1}, ({ψ^{*}}_{j} - v))}, \end{matrix}

and

(19)

\begin{matrix} l_{2, r} = \frac{\frac{(R)}{2} ({\tilde{p}}_{ψ} (v - R \tilde{η_{r}} ∣ N = n) + {\tilde{p}}_{ψ} (v + R \tilde{η_{r}} ∣ N = n))}{g ({\tilde{η}}_{r})}, \end{matrix}

where ${{ψ^{*}}_{1}, \dots, {ψ^{*}}_{D_{1}}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\varvec{\psi ^*}_1, \ldots , \varvec{\psi ^*}_{D_1}\}$$\end{document} are $D_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_1$$\end{document} draws from $p_{ψ} (ψ ∣ N = n)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{\varvec{\psi }}(\varvec{\psi } \mid \varvec{N} = \varvec{n})$$\end{document} and ${{\tilde{η}}_{1}, \dots, {\tilde{η}}_{D_{2}}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\varvec{\tilde{\eta }}_1, \ldots , \varvec{\tilde{\eta }}_{D_2}\}$$\end{document} are $D_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_2$$\end{document} draws from the proposal distribution $g (η)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\varvec{\eta })$$\end{document} . Furthermore, ${\tilde{p}}_{ψ} (ψ ∣ N = n)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{p}_{\varvec{\psi }}(\varvec{\psi }\mid \varvec{N} = \varvec{n})$$\end{document} denotes the un-normalized posterior density of the unbounded posterior samples; it is therefore written in terms of $ξ_{trans}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }_{\text {trans}}$$\end{document} and is adjusted by the Jacobian term:Footnote 11

(20)

\begin{matrix} {\tilde{p}}_{ψ} (ψ ∣ N = n) = & \prod_{i = 1}^{I} (\prod_{k = 1}^{K}, {, \frac{J_{k}!}{n_{i k 1}! \times n_{i k 2}! \times . . . \times n_{i k L_{k}}!}, \prod_{l = 1}^{L_{k}}, {(Pr (C_{kl} ∣ μ, ξ_{trans}, ω_{i}))}^{n_{ikl}}, }) \\ \times \frac{Γ_{P} (\frac{ν + I}{2})}{Γ_{P} (\frac{ν}{2})} \frac{π^{- \frac{IP}{2}}}{{(Ω^{⊤} Ω + I_{P})}^{\frac{ν + I}{2}}} \times {(2, π)}^{- \frac{P}{2}} exp {- \frac{1}{2} μ^{⊤} μ} \\ \times {(2, π)}^{- \frac{P}{2}} exp {- \frac{1}{2} ξ_{trans}^{⊤} ξ_{trans}} . \end{matrix}

Note that rewriting the expected value in terms of ${\tilde{p}}_{ψ} (ψ ∣ N = n)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{p}_{\varvec{\psi }}(\varvec{\psi }\mid \varvec{N} = \varvec{n})$$\end{document} is only a technical nicety. This approach is identical to applying the Warp-III transformation to the posterior samples and then using the iterative scheme with the warped posterior density and a multivariate standard normal proposal distribution.

3. Empirical Examples

3.1 Example 1: Nested Model Comparison

We re-analyzed the pair-clustering data set reported in Riefer, Knapp, Batchelder, Bamber, and Manifold (Reference Riefer, Knapp, Batchelder, Bamber and Manifold2002) using the hierarchical latent-trait approach.Footnote 12 Experiment 4 examined the memory of patients with brain damage due to prolonged alcoholism in comparison with a control group of alcoholic patients without indications of brain damage. The participants attempted to memorize the same list of 20 categorically related word pairs in a series of six study-test trials.Footnote 13 For demonstration purposes, we focused on the free recall performance of the 21 control participants. Specifically, we investigated whether the model parameters change from the first to the second trial indicating a change in the storage and retrieval processes as a function of practice using posterior model probabilities and Bayesian model averaging.

3.1.1. Model Specification

To model differences in parameters, we augmented Eq. (8) with a parameter vector that captures the difference in parameters between the two trials: $δ = (δ_{c}, δ_{r}, δ_{u})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\delta } = (\delta _c, \delta _r, \delta _u)$$\end{document} . The probit-transformed parameter vectors of participant i for the first trial ( $θ_{1, i}^{^{'}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_{1,i}^{'}$$\end{document} ) and the second trial ( $θ_{2, i}^{^{'}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_{2,i}^{'}$$\end{document} ) are then obtained as follows:

(21)

\begin{matrix} \begin{matrix} θ_{1, i}^{^{'}} & = \overset{\begin{matrix} group mean \\ for first trial \end{matrix}}{\overset{⏞}{μ - \frac{δ}{2}}} + ξ ⊙ ω_{i}, \\ θ_{2, i}^{^{'}} & = \underset{\begin{matrix} group mean \\ for second trial \end{matrix}}{\underset{⏟}{μ + \frac{δ}{2}}} + ξ ⊙ ω_{i} . \end{matrix} \end{matrix}

For an alternative approach to modeling within-subject differences in model parameters, the reader is referred to Rouder, Lu, Morey, Sun, and Speckman (Reference Rouder, Lu, Morey, Sun and Speckman2008).

Table 1 shows the $2^{3} = 8$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2^3 = 8$$\end{document} nested models that implement the eight sets of possible parameter constraints. $M_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_1$$\end{document} allows all three parameters to vary between trials so that $δ = (δ_{c}, δ_{r}, δ_{u})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\delta } = (\delta _c, \delta _r, \delta _u)$$\end{document} . In contrast, $M_{8}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_8$$\end{document} posits that none of the parameters vary between trials so that $δ = (0, 0, 0)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\delta } = (0, 0, 0)$$\end{document} . Models $M_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_2$$\end{document} to $M_{7}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_7$$\end{document} are between these extremes and allow either one or two parameters to vary between trials.

Table 1. Overview of the eight nested models for the analysis of the first two trials of the pair-clustering data set reported in Riefer et al. (Reference Riefer, Knapp, Batchelder, Bamber and Manifold2002).

Note. $M_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_1$$\end{document} allows all three parameters to vary between trials, and $M_{8}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_8$$\end{document} posits that none of the parameters vary between trials. Models $M_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_2$$\end{document} to $M_{7}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_7$$\end{document} are between these extremes.

We used independent zero-centered normal priors for the components of $δ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\delta }$$\end{document} . We explored a narrow ( $σ_{δ}^{narrow} \approx 0.52$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^\text {narrow}_{\delta } \approx 0.52$$\end{document} ), medium ( $σ_{δ}^{medium} \approx 0.84$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^\text {medium}_{\delta } \approx 0.84$$\end{document} ), and a wide ( $σ_{δ}^{wide} \approx 1.28$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^\text {wide}_{\delta } \approx 1.28$$\end{document} ) zero-centered normal prior to assess the sensitivity of the results to the width of the test-relevant prior distribution. As shown in Supplemental Materials, the standard deviations $σ_{δ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{\delta }$$\end{document} were chosen to correspond to small, medium, and large effects on the probability scale centered around 0.5. Priors for the remaining parameters followed the specification described earlier.

We estimated the posterior distribution of the model parameters using JAGS by adapting the script provided by Matzke et al. (Reference Matzke, Dolan, Batchelder and Wagenmakers2015). The JAGS code is available in Supplemental Materials. We ran three MCMC chains with overdispersed start values, discarded the first 4000 posterior samples as burn in, and retained only every 20th sample to reduce autocorrelation. Results reported below are based on a total of 90,000 posterior samples. Convergence of the MCMC chains was assessed by visual inspection and the $\hat{R}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{R}$$\end{document} statistic ( $\hat{R} < 1.05$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{R}<1.05$$\end{document} for all parameters; Gelman & Rubin, Reference Gelman and Rubin1992).

Figure 3. Posterior distributions of the probit group-level means (plotted on the probability scale) from the full model $M_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_1$$\end{document} for the analysis of the first two trials of the pair-clustering data reported in Riefer et al. (Reference Riefer, Knapp, Batchelder, Bamber and Manifold2002). The solid lines correspond to the posteriors for the first trial and the dotted lines to the posteriors for the second trial. Available at https://tinyurl.com/y9a33l4t under CC license https://creativecommons.org/licenses/by/2.0/.

Figure 3 shows the resulting posterior distributions of the probit group-level means from the full model $M_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_1$$\end{document} ; the parameters were transformed back to the probability scale. The posteriors were computed using the medium prior setting ( $σ_{δ}^{medium}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^\text {medium}_{\delta }$$\end{document} )—results obtained with the narrow and wide prior were highly similar and are not displayed. The plot of the posterior distributions based on the alternative prior choice for the elements of $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} (i.e., uniform priors with upper bound $ξ_{max} = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{\text {max}} = 2$$\end{document} instead of $ξ_{max} = 10$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{\text {max}} = 10$$\end{document} ) was visually almost indistinguishable from the one presented here and has hence been relegated to Supplemental Materials. The cluster-storage c parameter did not change substantially, whereas the storage-retrieval u, and especially the cluster-retrieval r parameter, seemed to increase from the first trial to the second.

3.1.2. Computing Marginal Likelihoods with Warp-III

Equation (20) was adjusted to include the relevant prior distributions for the elements of $δ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\delta }$$\end{document} . For each model, we split the 90,000 posterior samples in two equal parts (first and second half of the iterations per chain) and used the first part for estimating $R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{R}$$\end{document} and $v$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{v}$$\end{document} and the second part for the iterative updating scheme in Eq. (15) (Overstall & Forster, Reference Overstall and Forster2010). Hence, $D_{1} = D_{2} = 45, 000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_1 = D_2 = 45,000$$\end{document} . To assess the accuracy of the resulting estimates, we repeated this procedure 50 times.Footnote 14 We implemented the procedure in R (R Core Team, 2016). For efficiency, we parallelized the computations, and coded the computationally intensive elements in efficient C++ code which was called from within R using Rcpp (Eddelbuettel et al., Reference Eddelbuettel, François, Allaire, Chambers, Bates and Ushey2011). Using a standard personal computer and four CPU cores, computing the marginal likelihood for each repetition took less than one minute per model. The code is available in Supplemental Materials.

Figure 4. Posterior model probabilities (left panel) and posterior inclusion probabilities (right panel) for the analysis of the first two trials of the pair-clustering data reported in Riefer et al. (Reference Riefer, Knapp, Batchelder, Bamber and Manifold2002) obtained with Warp-III bridge sampling. In the left panel, the x-axis indicates which parameters were allowed to vary from the first to the second trial (e.g., $c - u$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c-u$$\end{document} corresponds to $M_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_3$$\end{document} where r was fixed between trials). Gray symbols show the results of the 50 repetitions, and black symbols display the posterior model probabilities and posterior inclusion probabilities that are based on the median of the 50 estimated log marginal likelihoods. Circles show results obtained with the narrow prior, diamonds with the medium prior, and triangles with the wide prior. The dotted lines show the prior model probabilities and prior inclusion probabilities. Available at https://tinyurl.com/yaxbj9o6 under CC license https://creativecommons.org/licenses/by/2.0/.

3.1.3. Posterior Model Probabilities

To formally quantify evidence for the differences in parameters, we computed the posterior model probabilities of the eight models using the marginal likelihoods obtained with Warp-III. We assumed that all models were equally likely a priori. The left panel of Fig. 4 shows the posterior model probabilities for the narrow, medium, and wide prior settings. The plot of the posterior model probabilities based on the alternative prior choice for the elements of $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} (i.e., uniform priors with upper bound $ξ_{max} = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{\text {max}} = 2$$\end{document} instead of $ξ_{max} = 10$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{\text {max}} = 10$$\end{document} ) was visually almost indistinguishable from the one presented here and has hence been relegated to Supplemental Materials. Formal model comparison confirmed the results of the visual inspection of the posterior distributions shown in Fig. 3: $M_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_2$$\end{document} , the model that allows for a difference in r and u, received the most support from the data. As expected, the width of the test-relevant prior $δ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\delta }$$\end{document} influenced the value of the marginal likelihood, but it did not change the conclusions qualitatively. Warp-III provided accurate estimates of the posterior model probabilities as indicated by the small variability across the 50 repetitions (i.e., gray symbols). For this nested example, the posterior model probabilities can be also obtained using the Savage–Dickey density ratio representation of the Bayes factor (Dickey & Lientz, Reference Dickey and Lientz1970; Wagenmakers, Lodewyckx, Kuriyal, & Grasman, Reference Wagenmakers, Lodewyckx, Kuriyal and Grasman2010). As shown in Supplemental Materials, the Savage–Dickey procedure resulted in posterior model probabilities that were highly similar to the ones obtained with Warp-III.

3.1.4. Bayesian Model Averaging

Bayesian model averaging does not require researchers to commit to a single “best” model; it allows researchers to acknowledge uncertainty about the choice of the correct model (e.g., Hoeting et al., Reference Hoeting, Madigan, Raftery and Volinsky1999; Rouder et al., Reference Rouder, Morey, Verhagen, Swagman and Wagenmakers2017). This is achieved by considering the posterior inclusion probabilities of the parameters. Posterior inclusion probabilities quantify the model-averaged evidence for a change in a given parameter; they can be obtained by summing the posterior model probabilities of the models that allow the parameter to differ between the trials. For instance, the posterior inclusion probability of the c parameter is obtained by summing the posterior model probabilities of $M_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_1$$\end{document} , $M_{3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_3$$\end{document} , $M_{4}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_4$$\end{document} , and $M_{6}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_6$$\end{document} . Posterior inclusion probabilities are then compared to the prior inclusion probabilities, in this case 0.5, which are obtained in an analogous manner but based on the prior model probabilities.Footnote 15 The right panel of Fig. 4 shows the posterior inclusion probabilities for the three prior settings. The plot of the posterior inclusion probabilities based on the alternative prior choice for the elements of $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} (i.e., uniform priors with upper bound $ξ_{max} = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{\text {max}} = 2$$\end{document} instead of $ξ_{max} = 10$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{\text {max}} = 10$$\end{document} ) was visually almost indistinguishable from the one presented here and has hence been relegated to Supplemental Materials. The posterior inclusion probabilities of the r and u parameter are higher than the prior inclusion probabilities, indicating evidence for a difference in these parameters between trials. In contrast, the posterior inclusion probability of c is lower than the corresponding prior inclusion probability, indicating evidence for invariance between the trials. As before, the width of the $δ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\delta }$$\end{document} prior does not change the conclusions qualitatively.

3.1.5. Substantive Contribution

The data from Riefer et al. (Reference Riefer, Knapp, Batchelder, Bamber and Manifold2002) have been analyzed in a number of articles. The original article analyzed the aggregated data (an approach known to suffer from limitations in case there is heterogeneity across participants, e.g., Klauer, Reference Klauer2006) and considered the p values of $G^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G^2$$\end{document} statistics to investigate whether parameters differ across trials. Smith and Batchelder (Reference Smith and Batchelder2010) re-analyzed a subset of the data using the hierarchical beta-MPT model (which specifies group-level beta distributions and thus differs from the latent-trait approach that we used).Footnote 16 To investigate whether parameters differ across trials, Smith and Batchelder (a) considered the posterior distribution of the difference between trials for the group-level mean parameters and (b) ran a classical paired sample t test on the individual-level parameter estimates. These approaches, however, do not allow one to quantify evidence for an invariance (i.e., a simpler model where some parameters do not differ across trials) on a continuous scale in a systematic way and, crucially, they do not allow one to disentangle “absence of evidence” (i.e., the data are uninformative) and “evidence of absence” (i.e., the data support a simpler model).Footnote 17 These shortcomings can be addressed by computing Bayes factors and posterior model and posterior inclusion probabilities. “Absence of evidence” can be inferred from Bayes factors close to one and posterior model and posterior inclusion probabilities close to the corresponding prior probabilities. In contrast, “evidence of absence” can be inferred from large Bayes factors in favor of the simpler model, and in situations when the posterior model probability of the simpler model is the highest or when the posterior inclusion probability is smaller than the prior inclusion probability.

Our Bayesian re-analysis suggests that there is strong evidence that the probability of retrieving word pairs that have been stored as a cluster (i.e., r) changed from the first to the second trial. Furthermore, there is evidence that the probability of storing and retrieving words that have not been stored as a cluster (i.e., u) differed between the two trials. Crucially, our approach also allowed us to conclude that there is some evidence that the probability of storing a word pair as a cluster (i.e., c) did not change from the first to the second trial (although this evidence is not that pronounced since the posterior inclusion probability for a difference in c is—depending on the prior choice—relatively close to the prior inclusion probability of .5). Another key improvement in our analysis over the above-mentioned analyses is the use of Bayesian model averaging. In this example, $M_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_2$$\end{document} received the highest posterior probability; however, $M_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_1$$\end{document} also received substantive posterior probability. Therefore, selecting a single best model (i.e., $M_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}_2$$\end{document} ) and basing final inference solely on this model might be suboptimal at best and misleading at worst. In contrast, when using the model-averaged posterior inclusion probabilities for drawing conclusions about which parameters differ between trials, one takes into account all models under consideration according to their plausibilities in light of the observed data.

Finally, note that one might argue that this data set is relatively small and is thus uninformative. However, one strength of the Bayesian approach is that it allows one to quantify whether the data are informative or not. For this example, the Bayesian results suggest that the data are in fact informative which is indicated by posterior model/inclusion probabilities that are quite different from the corresponding prior probabilities.

3.2. Example 2: Non-Nested Model Comparison

We re-analyzed data from Experiment 2 reported by Fazio et al. (Reference Fazio, Brashier, Payne and Marsh2015) who investigated the influence of knowledge on the illusory truth effect. The illusory truth effect refers to the phenomenon that, in the absence of knowledge about the truth status of a statement, repeated statements are easier to process and are judged more truthful than new statements. Fazio et al., however, provided evidence that participants tend to rely on the ease of processing (i.e., fluency) even when they have knowledge about the statement.

We re-analyzed data from 39 participants who indicated the truthfulness (i.e., “true”/“false”) of 176 statements, half of which were true and half of which were false. Half of the statements were likely to be known according to general knowledge norms (“known” statements) and half of them were likely to be unknown (“unknown” statements). An example of a true known statement is “The Pacific Ocean is the largest ocean on Earth.” An example of a false unknown statement is “Billy the Kid’s last name is Garrett.” To manipulate fluency, half of the statements were presented twice, once in the exposure phase and once in the truth-rating phase, whereas the other half was only presented in the truth-rating phase. Hence, the experiment had a 2 (truth status: true vs. false) $\times$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} 2 (assumed knowledge: known vs. unknown) $\times$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} 2 (repetition: repeated vs. not repeated) balanced within-subject design, and each cell of the design featured 22 statements.

Figure 5. Knowledge-conditional (top panel) and fluency-conditional (bottom panel) MPTs. Available at https://tinyurl.com/ya8sovfr under CC license https://creativecommons.org/licenses/by/2.0/.

3.2.1. Model Specification

Fazio et al. (Reference Fazio, Brashier, Payne and Marsh2015) constructed two MPTs to study the illusory truth effect. The knowledge-conditional model depicted in the top panel of Fig. 5 assumes that participants rely on knowledge when assessing truthfulness and only rely on fluency when they are unable to retrieve knowledge about the statement. Parameter k represents the probability of retrieving knowledge about the statement from memory. If knowledge is retrieved, participants are assumed to give the correct response (i.e., “true” for true statements and “false” for false statements). If no knowledge is retrieved with probability $1 - k$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 - k$$\end{document} , participants rely on fluency with probability f and respond “true.” If participants do not rely on fluency with probability $1 - f$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 - f$$\end{document} , they guess “true” with probability g and “false” with probability $1 - g$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 - g$$\end{document} . Responses to true statements are scored into the categories $C_{11}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{11}$$\end{document} (correct “true” response) and $C_{12}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{12}$$\end{document} (incorrect “false” response). Responses to false statements are scored into the categories $C_{21}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{21}$$\end{document} (incorrect “true” response) and $C_{22}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{22}$$\end{document} (correct “false” response). In contrast, the fluency-conditional model depicted in the bottom panel reflects the notion that participants mainly rely on fluency and only use knowledge in the absence of fluency. The models feature the same set of parameters, but they assume a different conditional probability structure.

For each model, we replicated the two subtrees four times (i.e., a total of eight subtrees per model) to accommodate the design of the experiment: The first replicate corresponded to known true and false statements that were not repeated, the second to known true and false statements that were repeated, the third to unknown true and false statements that were not repeated, and the fourth to unknown true and false statements that were repeated. Following Fazio et al. (Reference Fazio, Brashier, Payne and Marsh2015), we used separate knowledge parameters for known ( $k_{k}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k_k$$\end{document} ) and unknown ( $k_{u}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k_u$$\end{document} ) statements, and separate fluency parameters for repeated statements ( $f_{r}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_r$$\end{document} ) and statements shown only once ( $f_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_n$$\end{document} ). The guessing parameter g was constrained to be equal across the four replicates. We implemented the models within the hierarchical latent-trait approach, using the prior specifications described earlier.

We estimated the posterior distribution of the model parameters using JAGS, ran three MCMC chains with overdispersed start values, discarded the first 4000 posterior samples as burn in, and retained only every 50th sample. Results reported below are based on a total of 180, 000 posterior samples. The posterior distributions of the group-level mean parameters are displayed in Supplemental Materials.

3.2.2. Computing Bayes Factors with Warp-III

For each model, we split the 180, 000 posterior samples in two equal parts (first and second half of the iterations per chain) and used the first part for estimating $R$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{R}$$\end{document} and $v$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{v}$$\end{document} and the second part for the iterative updating scheme in Eq. (15) ( $D_{1} = D_{2} = 90, 000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D_1 = D_2 = 90,000$$\end{document} ). Using a standard personal computer and four CPU cores, computing the marginal likelihood took approximately three minutes per model.

The resulting marginal likelihoods were used to compute the Bayes factor in favor of the fluency-conditional model over the knowledge-conditional model. To assess the accuracy of the resulting Bayes factor, we repeated this procedure 50 times. Estimates of the Bayes factor ranged from $1.3 \times 10^{42}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.3 \times 10^{42}$$\end{document} to $3.6 \times 10^{43}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3.6 \times 10^{43}$$\end{document} in favor of the fluency-conditional model. Estimates of the Bayes factor based on the alternative prior choice for the elements of $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} (i.e., uniform priors with upper bound $ξ_{max} = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{\text {max}} = 2$$\end{document} instead of $ξ_{max} = 10$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{\text {max}} = 10$$\end{document} ) ranged from $1.7 \times 10^{41}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.7 \times 10^{41}$$\end{document} to $1.7 \times 10^{43}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.7 \times 10^{43}$$\end{document} in favor of the fluency-conditional model. In line with the conclusion drawn by Fazio et al. (Reference Fazio, Brashier, Payne and Marsh2015) based on the $G^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G^2$$\end{document} statistic, this result provides overwhelming evidence in favor of the fluency-conditional model.Footnote 18

Figure 6 displays the Warp-III Bayes factor estimates (on the log scale) in white as a function of the number of posterior samples used in the bridge sampling procedure.Footnote 19 As a comparison, the estimates based on the simpler multivariate normal bridge sampling approach are displayed in gray. As the number of posterior samples increases, the Bayes factor estimates become more precise. For this particular example, it is apparent that the Warp-III estimates are less variable than the estimates based on the simpler multivariate normal approach.

Figure 6. Log Bayes factor estimates in favor of the fluency-conditional (FC) model over the knowledge-conditional (KC) model as a function of the number of posterior samples. The Warp-III estimates are displayed in white, and the estimates based on the simpler multivariate normal approach are displayed in gray. Available at https://tinyurl.com/ydbfev7w under CC license https://creativecommons.org/licenses/by/2.0/.

3.2.3. Substantive Contribution

The authors of the original article analyzed the aggregated data (again, an approach known to be suboptimal in case there is heterogeneity across participants) and considered the $G^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G^2$$\end{document} statistics with corresponding p values. Based on the fact that the knowledge-conditional model had a larger, significant $G^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G^2$$\end{document} statistic compared to the fluency-conditional model that had a lower, nonsignificant $G^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G^2$$\end{document} statistic, the authors concluded that the knowledge-conditional model fit the data poorly and the fluency-conditional model fit the data well. Therefore, the authors favored the fluency-conditional model based on two binary accept–reject decisions. This makes it difficult to gauge the degree of support that the data provide in favor of the fluency-conditional model. The Bayes factor may be 10, or 100, or 1000—these are very different levels of evidence. In fact, our analysis shows that the Bayes factor is about $1.3 \times 10^{42}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.3 \times 10^{42}$$\end{document} – $3.6 \times 10^{43}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3.6 \times 10^{43}$$\end{document} in favor of the fluency-conditional model, which represents an overwhelming amount of evidence.

It could be argued that, since the compared models have the same number of parameters, comparing $G^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G^2$$\end{document} statistics may result in choosing the same model as based on considering AIC or BIC. AIC is asymptotically equivalent to cross-validation (Stone, Reference Stone1977) which is known to be inconsistent in the sense that, when the number of observations goes to infinity, the data-generating model will not be chosen with certainty (Shao, Reference Shao1993). In contrast, when using Bayes factors, model selection consistency is generally fulfilled (Bayarri, Berger, Forte, & García-Donato, Reference Bayarri, Berger, Forte and García-Donato2012). Although the BIC is a rough approximation of the Bayes factor, we believe that it is better to compute proper Bayes factors which are transparent with respect to the prior assumptions.

Finally, one might argue again that this data set is relatively small and is thus uninformative. However, the resulting Bayes factor is very different from 1, indicating that the data are in fact highly informative with respect to adjudicating between the fluency-conditional and the knowledge-conditional models.

4. Discussion

Bayesian hierarchical techniques for MPT modeling are increasingly popular. Current hierarchical MPT approaches, however, do not incorporate Bayesian model comparison methods based on Bayes factors and posterior model probabilities, possibly because of the computational challenges associated with the evaluation of the marginal likelihood. In this article, we addressed this challenge and showed how Warp-III bridge sampling can be used to obtain accurate and stable estimates of the marginal likelihood of hierarchical MPTs. We applied the method to model comparison problems from two published studies and illustrated how the marginal likelihood can be used for Bayesian model averaging and for the computation of the Bayes factor.

Our examples highlighted that Bayesian model comparison based on posterior model/inclusion probabilities and Bayes factors allows researchers to disentangle between “absence of evidence” and “evidence of absence.” Note that it is crucial in all stages of cognitive model development, validation, and application that one is able to quantify evidence in favor of invariances (i.e., “evidence of absence”) in a coherent and systematic way. For model development and validation, it is important to show that certain experimental manipulations selectively influence only a subset of the model parameters, whereas the remaining parameters are unaffected (i.e., selective influence studies). Once a cognitive model has been established as a valid measurement tool, it can be used, for instance, to investigate which subprocesses are targeted by new experimental manipulations or which subprocesses differ or do not differ in clinical subpopulations (cognitive psychometrics; e.g., Riefer et al., Reference Riefer, Knapp, Batchelder, Bamber and Manifold2002). In these applications, it is important to be able to quantify evidence for a difference but, crucially, also for an invariance since one might wish to make statements of the form “there is evidence that retrieval processes are not affected.”

There are often a number of different candidate models for the analysis of observed data. In Example 1, we demonstrated how Bayesian model averaging can be used to draw conclusions that fully take into model uncertainty. In our opinion, Bayesian model averaging is an extremely powerful approach and, to the best of our knowledge, it is currently not used in the context of hierarchical MPTs and cognitive modeling more generally. We believe that attending researchers to this approach and providing the computational tools to facilitate its application (i.e., Warp-III) is one of the key contributions of this work.

Our examples illustrated that Warp-III is relatively straightforward to implement once posterior samples from the models have been obtained with MCMC sampling. Another advantage of Warp-III bridge sampling is its relative speed. In our experience, the Warp-III procedure requires much less computational time than the MCMC sampling from the posterior. One of the crucial determinants of the computational time of Warp-III is how long it takes to evaluate the un-normalized posterior density. To maximize speed for our applications, we implemented the un-normalized posterior density functions in C++ code called from within R via Rcpp (Eddelbuettel et al., Reference Eddelbuettel, François, Allaire, Chambers, Bates and Ushey2011). Compared to a simpler bridge sampling version which only matches the first two moments of the proposal and the posterior (e.g., Overstall & Forster, Reference Overstall and Forster2010), Warp-III is expected to take about twice as long for a fixed number of samples due to the mixture representation of the warping procedure which requires evaluating the un-normalized posterior twice as often as for the simpler bridge sampling version. However, Warp-III is also expected to be more accurate in case the posterior is skewed which means there might be a speed–accuracy trade-off.

Despite its computational simplicity, Warp-III should not be applied blindly. Specifically, as we demonstrated for our empirical examples, it is important to assess the variability of the resulting model comparison measure—such as posterior model probabilities or Bayes factors—by repeating the Warp-III procedure multiple times. When the measure of interest clearly favors a given model, as in our second example, some fluctuation is not necessarily concerning. However, in situations where the fluctuation influences which model is favored, researchers should either increase the number of posterior and proposal samples to decrease the variability of the estimate, or, if this solution is practically infeasible, they should acknowledge that the estimate does not support firm conclusions about the relative predictive adequacy of the models.

The accuracy of the estimate is governed not only by the number of samples but also by the overlap between the proposal and the posterior distribution. Warp-III attempts to maximize this overlap by matching the mean vector, covariance matrix, and the skew of the two distributions. However, in case the posterior distribution exhibits multiple modes, the overlap may not be sufficiently close. Researchers should carefully check whether multimodalities occur in their application. If this is the case, repeated runs of the Warp-III procedure could be used to obtain an impression of the stability of the estimate. Nevertheless, it should be kept in mind that Warp-III is not designed for multimodal posterior distributions and results should be interpreted with caution. The development of bridge sampling procedures for multimodal posterior distributions is currently ongoing (e.g., Frühwirth–Schnatter, Reference Frühwirth-Schnatter2004; Wang & Meng, Reference Wang and Meng2016). Note, however, that this is not a very severe limitation of the Warp-III method, since posterior distributions are unimodal in many models used in psychology—they even converge to normal distributions under specific conditions (Dawid, Reference Dawid1970).

Relatedly, note that we use the unscaled effects $ω_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\omega }_i$$\end{document} and the scaling parameters $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} directly in the bridge sampling procedure—but technically, these are only identified jointly. Therefore, MCMC chains for these parameters may look irregular and exhibit, for instance, multiple modes, decreasing the efficiency of the Warp-III procedure as mentioned above. Although this was not the case for our applications, we advise researchers to carefully monitor the MCMC chains of the unidentified unscaled effects and scaling parameters.

On a more theoretical note, as Eq. (3) illustrates, Bayesian model comparison is sensitive to the choice of the prior distribution. We relied on relatively standard priors for the group-level parameters, but also established the robustness of our conclusions with a series of sensitivity analyses (see also Supplemental Materials). Nevertheless, we do not suggest that our prior choices should be considered as the gold standard for model comparison in hierarchical MPTs. Several approaches are available for specifying theoretically justified prior distributions for cognitive models (Lee & Vanpaemel, Reference Lee and Vanpaemel2018; see also Heck & Wagenmakers, Reference Heck and Wagenmakers2016, for specifying order constraints in MPTs). We believe that the increasing popularity of hierarchical MPTs will enable researchers to specify informative paradigm-specific and model-specific prior distributions based on experience with the models (e.g., typical parameter ranges and effect sizes). The dependency on the prior is sometimes considered as a weakness of Bayes factor model comparisons (e.g., Aitkin, Reference Aitkin2001). Some researchers and statisticians even conclude that due to this reason, the use of Bayes factors is not recommended (e.g., Gelman et al., Reference Gelman, Carlin, Stern, Dunson, Vehtari and Rubin2014, chapter 7.4).Footnote 20 In contrast, we believe that the ability to incorporate prior knowledge is an advantage of Bayesian inference; we consider the prior as integral part of the model which should be chosen just as carefully as the likelihood (e.g., Vanpaemel, Reference Vanpaemel2010). Ideally, researchers should preregister their priors before data collection (Chambers, Reference Chambers2013, Reference Chambers2015) to ensure that these are used to express genuine prior knowledge and not to increase researchers’ degrees of freedom in obtaining the desired results. Note that we are not the first to advocate a Bayesian approach to hierarchical MPTs. However, to the best of our knowledge, we are the first who advocate Bayesian model comparison using posterior model/inclusion probabilities and Bayes factors and provide the tools to compute these quantities for hierarchical MPTs. Equipped with a feasible approach for computing the relevant quantities for Bayesian model comparison, one could, in principle, specify an informed prior for the models themselves in addition to the specification of the parameter prior. This way one could incorporate prior knowledge about how likely each model is or one could, if desired, incorporate a penalty for multiple comparisons as described in Scott and Berger (Reference Scott and Berger2010).

Although we focused exclusively on latent-trait MPTs, Warp-III is not limited to the latent-trait approach or other hierarchical MPTs, such as the beta-MPT (Smith & Batchelder, Reference Smith and Batchelder2010) or the crossed random effects approach (Matzke et al., Reference Matzke, Dolan, Batchelder and Wagenmakers2015). Warp-III may be used to compute the marginal likelihood for a large variety of cognitive models. For instance, the simple multivariate normal bridge sampling approach has been recently applied to hierarchical reinforcement learning models (Gronau et al., Reference Gronau, Sarafoglou, Matzke, Ly, Boehm and Marsman2017). We believe that Warp-III may be especially useful for so-called sloppy models with highly correlated parameters (Brown & Sethna, Reference Brown and Sethna2003), including but not limited to race models of response times, which often yield skewed posterior distributions (e.g., Brown & Heathcote, Reference Brown and Heathcote2008; Matzke, Love, & Heathcote, Reference Matzke, Love and Heathcote2017). The Warp-III methodology also lends itself to model comparison in extensions of hierarchical cognitive models that impose on the model parameters a statistical structure such as a linear regression, factor analysis, or analysis of variance (e.g., Boehm, Steingroever, & Wagenmakers, Reference Boehm, Steingroever and Wagenmakers2017; Heck et al., Reference Heck, Arnold and Arnold2018a; Turner, Wang, & Merkle, Reference Turner, Wang and Merkle2017; Vandekerckhove, Reference Vandekerckhove2014). The application of Warp-III to complex experimental designs is ongoing work in our laboratory.

Although Warp-III is a general procedure for computing the marginal likelihood, depending on the situation, other approaches may be better suited for the model comparison problem at hand. If researchers focus on non-hierarchical implementations of cognitive models, importance sampling may be an easier solution, particularly in the context of MPTs (Vandekerckhove et al., Reference Vandekerckhove, Matzke, Wagenmakers, Busemeyer, Townsend, Wang and Eidels2015). If the focus is on nested models, the Savage–Dickey density ratio is an easier and faster alternative. Lastly, if the number of models under consideration is very large, Reversible Jump MCMC (Green, Reference Green1995) might be the appropriate choice. Nevertheless, we believe that in most applications of hierarchical cognitive models, the research question concerns the comparison of a limited set of possibly non-nested models. In these situations, Warp-III provides a straightforward and accurate method for computing the marginal likelihood for a wide range of complex models.

5. Data Availability Statement

The datasets analyzed during the current study are available on the Open Science Framework: https://osf.io/rycg6/.

Footnotes

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11336-018-9648-3) contains supplementary material, which is available to authorized users.

1 The interested reader is referred to Plieninger and Heck (Reference Plieninger and Heck2018) for a comparison of these model classes.

2 Note that posterior model probabilities can also be obtained using information criteria (e.g., Burnham & Anderson, Reference Burnham and Anderson2002; Wagenmakers & Farrell, Reference Wagenmakers and Farrell2004).

3 For details on the predictive interpretation of the marginal likelihood, see Supplemental Materials available at https://osf.io/rycg6/.

4 Hu (Reference Hu2001), Heck and Erdfelder (Reference Heck and Erdfelder2016), and Heck, Erdfelder, and Kieslich (Reference Heck, Erdfelder and Kieslich2018b) proposed extensions that also incorporate response times.

5 Bayesian hierarchical models can be also used to account for heterogeneity in items instead of participants.

6 We omit conditioning on the model for enhanced legibility.

7 Bridge sampling in its original form has been proposed to estimate a ratio of normalizing constants. This approach, however, becomes challenging and inefficient in case the two models have different parameter spaces (e.g., non-nested comparisons), and potentially very little overlap between the posterior distributions. For these cases, it may be easier and more efficient to compute each normalizing constant separately (e.g., DiCiccio et al., Reference DiCiccio, Kass, Raftery and Wasserman1997; Overstall & Forster, Reference Overstall and Forster2010). This ensures that the two relevant distributions (i.e., proposal and posterior) for each of the separate bridge sampling applications are close to each other yielding an efficient estimator. Therefore, we recommend computing each normalizing constant separately to enable application of the method to a wide range of model comparison scenarios.

8 Specifically, we used the median effective sample size across all posterior components.

9 In our experience, the exact value of the initial guess typically does not have a lasting influence on the resulting estimate. Nevertheless, good initial values may lead to faster convergence. For implementation details, see Gronau et al. (Reference Gronau, Sarafoglou, Matzke, Ly, Boehm and Marsman2017), especially Appendix B.

10 As before, the probit transformation is defined component-wise.

11 Note that ξ max \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{\text {max}}$$\end{document} drops out of the expression because it cancels with the first term of the Jacobian. Implicitly, however, it still influences the marginal likelihood because it appears in the transformation equation ξ trans = Φ - 1 ξ ξ max \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }_{\text {trans}} = \Phi ^{-1}\left( \frac{\varvec{\xi }}{\xi _{\text {max}}}\right) $$\end{document} . It is also needed for evaluating Pr ( C kl ∣ μ , ξ trans , ω i ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {Pr}(C_{kl} \mid \varvec{\mu }, \varvec{\xi }_{\text {trans}}, \varvec{\omega }_i)$$\end{document} since in order to obtain the MPT parameters on the probit scale (i.e., Eq. (8)) we need to transform ξ trans \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }_{\text {trans}}$$\end{document} back to ξ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} via the inverse transformation ξ = ξ max Φ ξ trans \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi } = \xi _{\text {max}} \, \Phi \left( \varvec{\xi }_{\text {trans}}\right) $$\end{document} .

12 Data were obtained from https://bayesmodels.com/; see also Lee and Wagenmakers (Reference Lee and Wagenmakers2013).

13 Riefer et al. (Reference Riefer, Knapp, Batchelder, Bamber and Manifold2002) did not administer singletons.

14 We assessed the accuracy of the estimates conditional on the posterior samples, that is, for each repetition, we used the same posterior samples but generated new samples from the proposal distribution. Whenever feasible, it may be advantageous to also generate new posterior samples in each repetition.

15 The change from prior inclusion odds to posterior inclusion odds can also be quantified by means of an inclusion Bayes factor (not reported).

16 Note that this data set has been also analyzed in Lee and Wagenmakers (Reference Lee and Wagenmakers2013, chapter 14). In this case, the hierarchical latent-trait approach was used; however, no explicit model comparison or hypothesis testing was conducted.

17 Note also that it is well known that the two-step procedure (b) used by Smith and Batchelder can yield biased conclusions (Boehm, Marsman, Matzke, & Wagenmakers, Reference Boehm, Marsman, Matzke and Wagenmakers2018).

18 Although the Bayes factor indicates overwhelming evidence in favor of the fluency-conditional model, it should be kept in mind that the Bayes factor quantifies the evidence of two models relative to each other. In practice, researchers should also check that the model that is favored by the Bayes factor provides an adequate fit to the observed data (e.g., Steingroever, Wetzels, & Wagenmakers, Reference Steingroever, Wetzels and Wagenmakers2014).

19 Posterior sample sizes smaller than 180,000 were obtained by considering only a subset of the 180,000 posterior samples for each model (i.e., no new posterior samples were obtained). Note that the same posterior sample sizes were used for the Warp-III and the simpler multivariate normal approach, but the results of the two methods are displayed with an offset to avoid overlapping symbols. Plots for each model’s marginal likelihood estimates are presented in Supplemental Materials.

20 Another objection is that Bayes factors are often used to compare nested models where certain values of continuous parameters are treated as “special” (since the parameters are fixed to these values). These researchers often favor continuous model expansion instead (e.g., Gelman et al., Reference Gelman, Carlin, Stern, Dunson, Vehtari and Rubin2014, chapter 7.4; Gelman & Rubin, Reference Gelman and Rubin1995).

References

Aitkin, M. (2001). Likelihood and Bayesian analysis of mixtures. Statistical Modelling, 1, 287– 304. CrossRef Google Scholar

Akaike, H., & Petrov, B. N., & Csaki, F. (1973). Information theory as an extension of the maximum likelihood principle. Second international symposium on information theory, Budapest: Akademiai Kiado. 267– 281. Google Scholar

Batchelder, W. H., & Riefer, D. M. (1980). Separation of storage and retrieval factors in free recall of clusterable pairs. Psychological Review, 87, 375– 397. CrossRef Google Scholar

Batchelder, W. H., & Riefer, D. M. (1990). Multinomial processing models of source monitoring. Psychological Review, 97, 548– 564. CrossRef Google Scholar

Batchelder, W. H., & Riefer, D. M. (1999). Theoretical and empirical review of multinomial process tree modeling. Psychonomic Bulletin & Review, 6, 57– 86. CrossRef Google Scholar PubMed

Bayarri, M. J., Berger, J. O., Forte, A., & García-Donato, G. (2012). Criteria for Bayesian model choice with application to variable selection. The Annals of Statistics, 40, 1550– 1577. CrossRef Google Scholar

Bishop, Y. M., Fienberg, S., & Holland, P. (1975). Discrete multivariate analysis: Theory and practice. Cambridge, MA: MIT Press. Google Scholar

Böckenholt, U. (2012). The cognitive-miser response model: Testing for intuitive and deliberate reasoning. Psychometrika, 77, 2 388– 399. CrossRef Google Scholar

Böckenholt, U. (2012). Modeling multiple response processes in judgment and choice. Psychological Methods, 17, 4 665– 678. 22545594 CrossRef Google Scholar PubMed

Boehm, U., Marsman, M., Matzke, D., & Wagenmakers, E. -J. (2018). On the importance of avoiding shortcuts in applying cognitive models to hierarchical data. Behavior Research Methods, 50, 1614–1631.CrossRef Google Scholar

Boehm, U., Steingroever, H., & Wagenmakers, E. -J. (2017). Using Bayesian regression to incorporate covariates into hierarchical cognitive models. Manuscript submitted for publication.Google Scholar

Brown, K. S., & Sethna, J. P. (2003). Statistical mechanical approaches to models with many poorly known parameters. Physical Review E, 68, 021904– CrossRef Google Scholar PubMed

Brown, S. D., & Heathcote, A. (2008). The simplest complete model of choice reaction time: Linear ballistic accumulation. Cognitive Psychology, 57, 153– 178. 18243170 CrossRef Google Scholar PubMed

Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach. (2). New York: Springer. Google Scholar

Chambers, C. D. (2013). Registered reports: A new publishing initiative at Cortex. Cortex, 49, 609– 610. CrossRef Google Scholar PubMed

Chambers, C. D. (2015). Ten reasons why journals must review manuscripts before results are known. Addiction, 110, 10– 11. CrossRef Google Scholar PubMed

Culpepper, S. A. (2014). If at first you don’t succeed, try, try again: Applications of sequential IRT models to cognitive assessments. Applied Psychological Measurement, 38, 8 632– 644. CrossRef Google Scholar

Dawid, A. (1970). On the limiting normality of posterior distributions. In Mathematical proceedings of the Cambridge philosophical society (Vol. 67, pp. 625–633).CrossRef Google Scholar

De Boeck, P., & Partchev, I. (2012). IRTrees: Tree-based item response models of the GLMM family. Journal of Statistical Software, 48, 1– 28. CrossRef Google Scholar

DiCiccio, T. J., Kass, R. E., Raftery, A. E., & Wasserman, L. (1997). Computing Bayes factors by combining simulation and asymptotic approximations. Journal of the American Statistical Association, 92, 903– 915. CrossRef Google Scholar

Dickey, J. M., & Lientz, B. P. (1970). The weighted likelihood ratio, sharp hypotheses about chances, the order of a Markov chain. The Annals of Mathematical Statistics, 41, 214– 226. CrossRef Google Scholar

Eddelbuettel, D., François, R., Allaire, J., Chambers, J., Bates, D., & Ushey, K. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40, 1– 18. CrossRef Google Scholar

Erdfelder, E., Auer, T. -S., Hilbig, B. E., Aßfalg, A., Moshagen, M., & Nadarevic, L. (2009). Multinomial processing tree models: A review of the literature. Zeitschrift für Psychologie, 217, 108– 124. CrossRef Google Scholar

Etz, A., & Wagenmakers, E. -J. (2017). J.B.S. Haldane’s contribution to the Bayes factor hypothesis test. Statistical Science, 32, 313– 329. CrossRef Google Scholar

Fazio, L. K., Brashier, N. M., Payne, B. K., & Marsh, E. J. (2015). Knowledge does not protect against illusory truth. Journal of Experimental Psychology: General, 144, 993– 1002. CrossRef Google Scholar

Frühwirth-Schnatter, S. (2004). Estimating marginal likelihoods for mixture and Markov switching models using bridge sampling techniques. Econometrics Journal, 7, 143– 167. CrossRef Google Scholar

Gamerman, D., & Lopes, H. F. (2006). Markov chain Monte Carlo: Stochastic simulation for Bayesian inference, Boca Raton, FL: Chapman & Hall/CRC. CrossRef Google Scholar

Gelman, A. (2013). Two simple examples for understanding posterior p values whose distributions are far from uniform. Electronic Journal of Statistics, 7, 2595– 2602. CrossRef Google Scholar

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian data analysis. (3). Boca Raton (FL): Chapman & Hall/CRC. Google Scholar

Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models, Cambridge: Cambridge University Press. Google Scholar

Gelman, A., & Meng, X. -L. (1998). Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical Science, 13, 163– 185. CrossRef Google Scholar

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences (with discussion). Statistical Science, 7, 457– 472. CrossRef Google Scholar

Gelman, A., & Rubin, D. B. (1995). Avoiding model selection in Bayesian social research. Sociological Methodology, 25, 165– 173. CrossRef Google Scholar

Gill, J. (2002). Bayesian methods: A social and behavioral sciences approach, 1 Boca Raton, FL: CRC Press. CrossRef Google Scholar

Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711– 732. CrossRef Google Scholar

Gronau, Q. F., Sarafoglou, A., Matzke, D., Ly, A., Boehm, U., & Marsman, M. et al. (2017). A tutorial on bridge sampling. Journal of Mathematical Psychology, 81, 80– 97. CrossRef Google Scholar PubMed

Heck, D. W., Arnold, N. R., & Arnold, D. (2018a). TreeBUGS: An R package for hierarchical multinomial-processing-tree modeling. Behavior Research Methods, 50 (1). 264– 284. CrossRef Google Scholar

Heck, D. W., & Erdfelder, E. (2016). Extending multinomial processing tree models to measure the relative speed of cognitive processes. Psychonomic Bulletin & Review, 23, 1440– 1465. CrossRef Google Scholar PubMed

Heck, D. W., Erdfelder, E., & Kieslich, P. J. (2018b). Generalized processing tree models: Jointly modeling discrete and continuous variables. Psychometrika, 83, 893–918.CrossRef Google Scholar

Heck, D. W., & Wagenmakers, E. -J. (2016). Adjusted priors for Bayes factors involving reparameterized order constraints. Journal of Mathematical Psychology, 73, 110– 116. CrossRef Google Scholar

Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14, 382– 417. Google Scholar

Hu, X. (2001). Extending general processing tree models to analyze reaction time experiments. Journal of Mathematical Psychology, 45, 603– 634. 11493016 CrossRef Google Scholar PubMed

Hu, X., & Batchelder, W. H. (1994). The statistical analysis of general processing tree models with the EM algorithm. Psychometrika, 59, 21– 47. CrossRef Google Scholar

Hütter, M., & Klauer, K. C. (2016). Applying processing trees in social psychology. European Review of Social Psychology, 27, 116– 159. CrossRef Google Scholar

Jefferys, W. H., & Berger, J. O. (1992). Ockham’s razor and Bayesian analysis. American Scientist, 80, 64– 72. Google Scholar

Jeffreys, H. (1961). Theory of probability, 3 Oxford, UK: Oxford University Press. Google Scholar

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773– 795. CrossRef Google Scholar

Kellen, D., Singmann, H., & Klauer, K. C. (2014). Modeling source-memory overdistribution. Journal of Memory and Language, 76, 216– 236. CrossRef Google Scholar

Klauer, K. C. (2006). Hierarchical multinomial processing tree models: A latent-class approach. Psychometrika, 71, 7– 31. CrossRef Google Scholar

Klauer, K. C. (2010). Hierarchical multinomial processing tree models: A latent-trait approach. Psychometrika, 75, 70– 98. CrossRef Google Scholar

Lee, M. D. (2011). How cognitive modeling can benefit from hierarchical Bayesian models. Journal of Mathematical Psychology, 55, 1– 7. CrossRef Google Scholar

Lee, M. D., & Vanpaemel, W. (2018). Determining informative priors for cognitive models. Psychonomic Bulletin & Review, 25, 114– 127. CrossRef Google Scholar PubMed

Lee, M. D., & Wagenmakers, E. -J. (2013). Bayesian cognitive modeling: A practical course, Cambridge: Cambridge University Press. Google Scholar

Liu, S., & Trenkler, G. (2008). Hadamard, Khatri-Rao, Kronecker and other matrix products. International Journal of Information and Systems Sciences, 4, 160– 177. Google Scholar

Ly, A., Boehm, U., Heathcote, A., Turner, B. M., Forstmann, B., Marsman, M., & Matzke, D. (2018). A flexible and efficient hierarchical Bayesian approach to the exploration of individual differences in cognitive-model-based neuroscience. In A. A. Moustafa (Ed.), Computational models of brain and behavior (pp. 467–480). Wiley Blackwell.Google Scholar

Ly, A., Verhagen, A. J., & Wagenmakers, E. -J. (2016). An evaluation of alternative methods for testing hypotheses, from the perspective of Harold Jeffreys. Journal of Mathematical Psychology, 72, 43– 55. CrossRef Google Scholar

Matzke, D., Dolan, C. V., Batchelder, W. H., & Wagenmakers, E. -J. (2015). Bayesian estimation of multinomial processing tree models with heterogeneity in participants and items. Psychometrika, 80, 205– 235. CrossRef Google Scholar PubMed

Matzke, D., Love, J., & Heathcote, A. (2017). A Bayesian approach for estimating the probability of trigger failures in the stop-signal paradigm. Behavior Research Methods, 49, 267– 281. CrossRef Google Scholar PubMed

Maydeu-Olivares, A., & Joe, H. (2005). Limited-and full-information estimation and Goodness-of-Fit testing in 2n contingency tables. Journal of the American Statistical Association, 100, 471 1009– 1020. CrossRef Google Scholar

Meng, X. -L. (1994). Posterior predictive p values. The Annals of Statistics, 22, 1142– 1160. CrossRef Google Scholar

Meng, X. -.L, & Schilling, S. (2002). Warp bridge sampling. Journal of Computational and Graphical Statistics, 11, 552– 586. CrossRef Google Scholar

Meng, X. -L., & Wong, W. H. (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, 6, 831– 860. Google Scholar

Myung, I. J., & Pitt, M. A. (1997). Applying Occam’s razor in modeling cognition: A Bayesian approach. Psychonomic Bulletin & Review, 4, 79– 95. CrossRef Google Scholar

Overstall, A. M. (2010). https://eprints.soton.ac.uk/170229/ Default Bayesian model determination for generalised liner mixed models (Doctoral dissertation. University of Southampton). Retrieved August 21, 2018 from.Google Scholar

Overstall, A. M., & Forster, J. J. (2010). Default Bayesian model determination methods for generalised linear mixed models. Computational Statistics & Data Analysis, 54, 3269– 3288. CrossRef Google Scholar

Plieninger, H., Heck, D. W. (2018). https://doi.org/10.1080/00273171.2018.1469966 A new model for acquiescence at the interface of psychometrics and cognitive psychology. Multivariate Behavioral Research.Google Scholar

Plummer, M., Hornik, K., Leisch, F., & Zeileis, A. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling, Vienna: Austria. Google Scholar

Plummer, M., Best, N., Cowles, K., & Vines, K. (2006). CODA: Convergence diagnosis and output analysis for MCMC. R News, 6, 7– 11. Google Scholar

R Core Team (2016). https://www.R-project.org/. R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria: Retrieved August 21, 2018 fromGoogle Scholar

Riefer, D. M., & Batchelder, W. H. (1988). Multinomial modeling and the measurement of cognitive processes. Psychological Review, 95, 318– 339. CrossRef Google Scholar

Riefer, D. M., Knapp, B. R., Batchelder, W. H., Bamber, D., & Manifold, V. (2002). Cognitive psychometrics: Assessing storage and retrieval deficits in special populations with multinomial processing tree models. Psychological Assessment, 14, 184– 201. 12056081 CrossRef Google Scholar PubMed

Robins, J. M., van der Vaart, A., & Ventura, V. (2000). Asymptotic distribution of p values in composite null models. Journal of the American Statistical Association, 95, 452 1143– 1156. Google Scholar

Rouder, J. N., & Lu, J. (2005). An introduction to Bayesian hierarchical models with an application in the theory of signal detection. Psychonomic Bulletin & Review, 12, 573– 604. CrossRef Google Scholar PubMed

Rouder, J. N., Lu, J., Morey, R. D., Sun, D., & Speckman, P. L. (2008). A hierarchical process dissociation model. Journal of Experimental Psychology: General, 137, 370– 389. CrossRef Google Scholar PubMed

Rouder, J. N., Morey, R. D., Verhagen, A. J., Swagman, A. R., & Wagenmakers, E. -J. (2017). Bayesian analysis of factorial designs. Psychological Methods, 22, 304– 321. 27280448 CrossRef Google Scholar PubMed

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461– 464. CrossRef Google Scholar

Scott, J. G., & Berger, J. O. (2010). Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. The Annals of Statistics, 38, 2587– 2619. CrossRef Google Scholar

Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88, 422 286– 292. CrossRef Google Scholar

Singmann, H., & Kellen, D. (2013). MPTinR: Analysis of multinomial processing tree models in R. Behavior Research Methods, 45, 560– 575. 23344733 CrossRef Google Scholar PubMed

Singmann, H., Kellen, D., & Klauer, K. C., Knauff, M., Pauen, M., Sebanz, N., & Wachsmuth, I. (2013). Investigating the other-race effect of Germans towards Turks and Arabs using multinomial processing tree models. In (Eds.), Proceedings of the 35th annual conference of the cognitive science society (pp. 1330–1335). Austin, TX: Cognitive Science Society.Google Scholar

Sinharay, S., & Stern, H. S. (2005). An empirical comparison of methods for computing Bayes factors in generalized linear mixed models. Journal of Computational and Graphical Statistics, 14, 415– 435. CrossRef Google Scholar

Smith, J. B., & Batchelder, W. H. (2010). Beta-MPT: Multinomial processing tree models for addressing individual differences. Journal of Mathematical Psychology, 54, 167– 183. CrossRef Google Scholar

Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society B, 64, 583– 639. CrossRef Google Scholar

Stan Development Team. (2016). http://mc-stan.org/ RStan: The R interface to Stan. Retrieved August 21, 2018 from (R package version 2.14.1)Google Scholar

Steingroever, H., Wetzels, R., & Wagenmakers, E-J (2014). Absolute performance of reinforcement-learning models for the Iowa Gambling Task. Decision, 1, 161– 183. CrossRef Google Scholar

Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society Series B, 39, 44– 47. CrossRef Google Scholar

Turner, B. M., Wang, T., & Merkle, E. C. (2017). Factor analysis linking functions for simultaneously modeling neural and behavioral data. NeuroImage, 153, 28– 48. 28341163 CrossRef Google Scholar PubMed

Vandekerckhove, J. (2014). A cognitive latent variable model for the simultaneous analysis of behavioral and personality data. Journal of Mathematical Psychology, 60, 58– 71. CrossRef Google Scholar

Vandekerckhove, J., Matzke, D., & Wagenmakers, E. -J., Busemeyer, J., Townsend, J., Wang, Z. J., & Eidels, A. (2015). Model comparison and the principle of parsimony. Oxford handbook of computational and mathematical psychology, Oxford: Oxford University Press. 300– 319. Google Scholar

Vanpaemel, W. (2010). Prior sensitivity in theory testing: An apologia for the Bayes factor. Journal of Mathematical Psychology, 54, 491– 498. CrossRef Google Scholar

Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27, 1413– 1432. CrossRef Google Scholar

Wagenmakers, E. -J., & Farrell, S. (2004). AIC model selection using Akaike weights. Psychonomic Bulletin & Review, 11, 192– 196. CrossRef Google Scholar PubMed

Wagenmakers, E. -J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage-Dickey method. Cognitive Psychology, 60, 158– 189. CrossRef Google Scholar PubMed

Wang, L., & Meng, X. -L. (2016). arXiv:1609.07690 Warp bridge sampling: The next generation. arXiv preprint.Google Scholar

Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11, 3571– 3594. Google Scholar

Figure 1. Pair-clustering MPT. Available at https://tinyurl.com/yb7bma4e under CC license https://creativecommons.org/licenses/by/2.0/.

Table 1. Overview of the eight nested models for the analysis of the first two trials of the pair-clustering data set reported in Riefer et al. (2002).

Figure 3. Posterior distributions of the probit group-level means (plotted on the probability scale) from the full model M1\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\mathcal {M}_1$$\end{document} for the analysis of the first two trials of the pair-clustering data reported in Riefer et al. (2002). The solid lines correspond to the posteriors for the first trial and the dotted lines to the posteriors for the second trial. Available at https://tinyurl.com/y9a33l4t under CC license https://creativecommons.org/licenses/by/2.0/.

Figure 4. Posterior model probabilities (left panel) and posterior inclusion probabilities (right panel) for the analysis of the first two trials of the pair-clustering data reported in Riefer et al. (2002) obtained with Warp-III bridge sampling. In the left panel, the x-axis indicates which parameters were allowed to vary from the first to the second trial (e.g., c-u\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$c-u$$\end{document} corresponds to M3\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\mathcal {M}_3$$\end{document} where r was fixed between trials). Gray symbols show the results of the 50 repetitions, and black symbols display the posterior model probabilities and posterior inclusion probabilities that are based on the median of the 50 estimated log marginal likelihoods. Circles show results obtained with the narrow prior, diamonds with the medium prior, and triangles with the wide prior. The dotted lines show the prior model probabilities and prior inclusion probabilities. Available at https://tinyurl.com/yaxbj9o6 under CC license https://creativecommons.org/licenses/by/2.0/.

Figure 5. Knowledge-conditional (top panel) and fluency-conditional (bottom panel) MPTs. Available at https://tinyurl.com/ya8sovfr under CC license https://creativecommons.org/licenses/by/2.0/.

Gronau et al. supplementary material

A Simple Method for Comparing Complex Models: Bayesian Model Comparison for Hierarchical Multinomial Processing Tree Models Using Warp-III Bridge Sampling (Online Appendix)

File 918.7 KB

Erratum to: A Simple Method for Comparing Complex Models: Bayesian Model Comparison for Hierarchical Multinomial Processing Tree Models Using Warp-III Bridge Sampling

Quentin F. Gronau Quentin F. Gronau , Eric-Jan Wagenmakers , Daniel W. Heck and Dora Matzke

Psychometrika , Volume 84 , Issue 4

Article contents

A Simple Method for Comparing Complex Models: Bayesian Model Comparison for Hierarchical Multinomial Processing Tree Models Using Warp-III Bridge Sampling

Abstract

Keywords

1. Multinomial Processing Trees

1.1. Bayesian Hierarchical MPTs: The Latent-Trait Approach

1.2. Computing the Marginal Likelihood

2. Warp-III Bridge Sampling for MPTs

3. Empirical Examples

3.1 Example 1: Nested Model Comparison

3.1.1. Model Specification

3.1.2. Computing Marginal Likelihoods with Warp-III

3.1.3. Posterior Model Probabilities

3.1.4. Bayesian Model Averaging

3.1.5. Substantive Contribution

3.2. Example 2: Non-Nested Model Comparison

3.2.1. Model Specification

3.2.2. Computing Bayes Factors with Warp-III

3.2.3. Substantive Contribution

4. Discussion

5. Data Availability Statement

Footnotes

References

Gronau et al. supplementary material

A correction has been issued for this article:

Linked content

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Article contents

A Simple Method for Comparing Complex Models: Bayesian Model Comparison for Hierarchical Multinomial Processing Tree Models Using Warp-III Bridge Sampling

Abstract

Keywords

1. Multinomial Processing Trees

1.1. Bayesian Hierarchical MPTs: The Latent-Trait Approach

1.2. Computing the Marginal Likelihood

2. Warp-III Bridge Sampling for MPTs

3. Empirical Examples

3.1 Example 1: Nested Model Comparison

3.1.1. Model Specification

3.1.2. Computing Marginal Likelihoods with Warp-III

3.1.3. Posterior Model Probabilities

3.1.4. Bayesian Model Averaging

3.1.5. Substantive Contribution

3.2. Example 2: Non-Nested Model Comparison

3.2.1. Model Specification

3.2.2. Computing Bayes Factors with Warp-III

3.2.3. Substantive Contribution

4. Discussion

5. Data Availability Statement

Footnotes

References

Gronau et al. supplementary material

A correction has been issued for this article:

Linked content

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests