Hostname: page-component-784d4fb959-splj4 Total loading time: 0 Render date: 2025-07-15T15:19:43.209Z Has data issue: false hasContentIssue false

Accounting for Persistence in Tests with Linear Ballistic Accumulator Models

Published online by Cambridge University Press:  16 June 2025

Jochen Ranger*
Affiliation:
Department of Psychology, https://ror.org/05gqaka33 Martin-Luther-Universität Halle-Wittenberg, Halle, Germany
Sören Much
Affiliation:
Department of Psychology, https://ror.org/05gqaka33 Martin-Luther-Universität Halle-Wittenberg, Halle, Germany Wilhelm Wundt Institute for Psychology, https://ror.org/05gqaka33 Leipzig University , Leipzig, Germany
Niklas Neek
Affiliation:
Department of Psychology, https://ror.org/05gqaka33 Martin-Luther-Universität Halle-Wittenberg, Halle, Germany
Augustin Mutak
Affiliation:
Faculty of Education and Psychology, https://ror.org/00mv6sv71 Freie Universität Berlin, Berlin, Germany Faculty of Education and Psychology, https://ror.org/046ak2485 University of Zagreb, Zagreb, Croatia
Steffi Pohl
Affiliation:
Faculty of Education and Psychology, https://ror.org/00mv6sv71 Freie Universität Berlin, Berlin, Germany
*
Corresponding author: Jochen Ranger; jochen.ranger@psych.uni-halle.de
Rights & Permissions [Opens in a new window]

Abstract

In this article, we propose a series of latent trait models for the responses and the response times on low stakes tests where some test takers respond preliminary without making full effort to solve the items. The models consider individual differences in capability and persistence. Core of the models is a race between the solution process and a process of disengagement that interrupts the solution process. The different processes are modeled with the linear ballistic accumulator model. Within this general framework, we develop different model variants that differ in the number of accumulators and the way the response is generated when the solution process is interrupted. We distinguish no guessing, random guessing and informed guessing where the guessing probability depends on the status of the solution process. We conduct simulation studies on parameter recovery and on trait estimation. The simulation study suggests that parameter values and traits can be recovered well under certain conditions. Finally, we apply the model variants to empirical data.

Information

Type
Theory and Methods
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Psychometric Society

The motivation to take a test is an important determinant of the test result (Silm et al., Reference Silm, Pedaste and Täht2020). Test-taking motivation is an internal state that drives test takers to engage in the test and pursue good test results (Bandhu et al., Reference Bandhu, Mohan, Nittala, Jadhav, Bhadauria and Saxena2024). It determines the intensity by which test takers work and the time they are willing to spend on the test (Baumert & Demmerich, Reference Baumert and Demmerich2001; Eklöf, Reference Eklöf2010; Knetka, Reference Knetka2017). The intensity depends on the extent to which mental resources are mobilized and assigned to a task (Ranger & Kuhn, Reference Ranger and Kuhn2014; Wenger & Gibson, Reference Wenger and Gibson2004). The willingness to spend time on the test determines the maximal time, a test taker is disposed to invest into the test and the single items. In this article, we focus on the willingness to spend time on the test and ignore the second facet of test taking motivation.

It is well known that in achievement tests, the test results do not only depend on the capability of the test takers, but also on their motivation to take the test. Recent findings suggest that the time dedicated to the items has a strong relation to the test score (Cheyette & Piantadosi, Reference Cheyette and Piantadosi2024). In fact, the importance of the time spent on the solution process has been acknowledged at the very beginning of psychometrics (Thurstone, Reference Thurstone1937). Findings also suggest that test takers differ systematically in the amount of time they invest into a test. This characteristic of a test taker may be represented by a latent trait that we denote as persistence. Test takers with a low value of persistence are not willing to spend much time on the items and potentially quit working prematurely. The persistence of test takers has been conceptualized differently in the literature. On the one hand, it has been conceptualized as a stable trait that determines the time investment in the single items throughout the test (Goldhammer et al., Reference Goldhammer, Martens and Lüdtke2017). In this conceptualization, the perspective is on the general level of the time investment. On the other hand, it has been conceptualized as the maintenance of engagement throughout the test (Nagy & Robitzsch, Reference Nagy and Robitzsch2021). In this conceptualization, the perspective is on the change or rather the absence of it over the test. In the article, we understand persistence in the first sense, as the general tendency of a test taker to invest time into the single items. A formal definition of persistence in mathematical terms is given in the next section.

The influence of test taking motivation on the test results implies that the test scores of the test takers do not only reflect their level of ability, but also their level of persistence. This is problematic in case ability is the target trait of the assessment. The joint influence of ability and persistence on the test results has to be disentangled by a measurement model that considers both quantities.

1 Latent trait models with persistence components

The responses of test takers in tests are often modeled with item response theory models (Embretson & Reise, Reference Embretson and Reise2000). Standard item response theory models assume that the probability to solve an item depends on the difficulty of the item and the effective ability of a test taker. The effective ability, however, confounds capability and persistence as highlighted in the previous section. While extensions like bifactor models (Gibbons & Hedeker, Reference Gibbons and Hedeker1992) or non-compensatory item response models (DeMars, Reference DeMars2016; Suh & Bolt, Reference Suh and Bolt2010) might to some extend capture the effects of persistence on test taking in further latent traits, they cannot represent persistence as a variable related to the invested time directly. This requires models for responses and the response times. Models that account for individual differences in persistence can be classified into two classes, mixture models and race models.

Mixture models for responses and response times were proposed for low stakes tests or tests with a strict time limit in order to model rapid guessing (Alós-Ferrer, Reference Alós-Ferrer2018; Lu et al., Reference Lu, Wang, Zhang and Tao2020; Meyer, Reference Meyer2010; Molenaar et al., Reference Molenaar, Bolsinova and Vermunt2018; Nagy & Ulitzsch, Reference Nagy and Ulitzsch2022; Ulitzsch et al., Reference Ulitzsch, von Davier and Pohl2022; Wang & Xu, Reference Wang and Xu2015). Here, we focus on their application to low stakes tests. In their simplest form, the mixture models assume two different modes of responding, a fast one where test takers guess rapidly and a slow one where test takers respond regularly. Fast guesses are responses with a very low investment of time. They are construed as the result of random responding and do not reflect a test taker’s capability. Regular responses, on the other hand, are responses with a high investment of time. They are supposed to reflect the capability of the test taker. In mixture models, individual differences in persistence are captured in the propensity to respond with one of the two response modes. Test takers with low persistence have a high frequency of rapid guessing.

A different conceptualization of persistence is made by race models. Race models assume a race between the response process and a timer that interrupts the response process. When the response process is interrupted prematurely, the test taker responds either incorrectly, omits the item or selects a response by guessing. First versions of race models were proposed by Roskam (Reference Roskam, Roskam and Suck1987) and Glickman et al. (Reference Glickman, Gray and Morales2005), although these authors did not refer to persistence. Lee and Ying (Reference Lee and Ying2015) and Lu and Wang (Reference Lu and Wang2020) proposed mixture cure-rate models for responses and response times where the response process is censored by a censoring time. In their models, the censoring time is not related to characteristics of the test takers. Ranger and Kuhn (Reference Ranger and Kuhn2014), on the other hand, assumed a race between information processing and disengagement and related both processes to latent traits. Hawkins and Heathcote (Reference Hawkins and Heathcote2021) proposed a model where different response options race against a censoring time. In the race models, the persistence of a test taker is represented by a latent trait that is related to the censoring time. This allows for individual differences in the time spent on a task. Test takers with a very low level of persistence tend to give responses fast and prematurely. But note that also test takes with a high level of persistence may interrupt processing in case an item is very time demanding.Footnote 1

The overview illustrates that the mixture models and the race models make different assumptions about the response process. In specific, they differ in their assumptions about persistence, guessing and the generation of incorrect responses. We discuss these aspects in the following.

Mixture models distinguish two modes of responding, a fast one that leads to a rapid guess and a slow one that leads to a regular response. Individual differences in persistence manifest in the frequency by which a test taker is in a particular response mode. In the fast response mode, however, there are no individual differences with respect to the solution probability and the response time distribution. This assumption contradicts the observation that even test takers with reduced engagement differ systematically in the time they invest into the items (Cheyette & Piantadosi, Reference Cheyette and Piantadosi2024). Persistence is thus more than just the propensity to respond by rapid guessing. The race models, on the other hand, relate the maximal time, a test taker is willing to spend on an item, to a latent trait of a test taker. This makes them more adequate for low-stakes tests where test takers differ gradually in their test taking motivation. Race models, however, are less adequate for speeded tests where the time spent on an item changes rapidly when test takers run out of time.

In the mixture models, the responses are either regular responses or random guesses. A random guess is a guess with a fixed guessing probability that may differ over items, but not over test takers. This implies that guessing is not related to the capability of a test taker or the actual progress a test taker has made when working on an item. Guesses, however, are rarely entirely random (Noventa et al., Reference Noventa, Heller and Kelava2024). Evidence suggests that test takers use partial knowledge when guessing. Partial knowledge is incomplete knowledge that improves the probability of a successful guess (Burton, Reference Burton2002). Partial knowledge may be used for a comparative evaluation of all response options (Suh & Bolt, Reference Suh and Bolt2010) or for the elimination of distractors in single-choice tests (Lau et al., Reference Lau, Lau, Hong and Usop2011). This suggests the existence of two forms of guessing behavior, rapid guessing and informed guessing (Guthrie et al., Reference Guthrie, Zhang and Chen2020). In item response theory, informed guessing is modeled by relating the guessing probability to the capability of a test taker (San Martin et al., Reference San Martin, del Pino and De Boeck2006) or by a sequential process where distractors are eliminated (Bechger et al., Reference Bechger, Maris, Verstralen, Verhelst, van der Ark, Croon and Sijtsma2005). Informed guessing cannot be accounted for by the mixture models reviewed above. The race models are deficient in this aspect also. They either assume no guessing or random guessing.

In the race models, the incorrect responses are generated by an interruption of the response process. This implies that the solution probability increases with the time spent on an item. This assumption may not hold in practice as findings suggest that incorrect responses can be generated actively (Duncan, Reference Duncan1974; Stupple et al., Reference Stupple, Pitchford, Ball, Hunt and Steel2017). This happens in case a test taker has a misconception of the problem and believes that an incorrect solution is correct (Frary, Reference Frary1980). A misconception is different from partial knowledge or lack of understanding (Lau et al., Reference Lau, Lau, Hong and Usop2011) as it generates incorrect responses irrespective of the given response options in a single-choice test and lowers the solution probability beyond what is expected by chance (Sadler, Reference Sadler1998; Suh & Bolt, Reference Suh and Bolt2010). A detailed analysis of the misconceptions held by the test takers requires single-choice tests with tailored distractors and an analysis with nominal response models (Sadler, Reference Sadler1998), Bayesian networks (Lee & Corter, Reference Lee and Corter2011) or cognitive diagnosis models (de la Torre, Reference de la Torre2009). Race models with two accumulators representing information processing and disengagement cannot account for the effect of misconceptions. The mixture models reviewed above are better suited for misconceptions. Although they are not capable to diagnose the specific kind of misconception, they allow for incorrect responses that are not caused by guessing.

In this article, we address some of the issues raised above. We propose six models for the responses and response times on tests that account for individual differences in the time the test takers are willing to spend on the items. The models are based on the linear ballistic accumulator model (Brown & Heathcote, Reference Brown and Heathcote2008; Rouder et al., Reference Rouder, Province, Morey, Gomez and Heathcote2015). The proposed models differ in the number of the accumulators (two/three) and their assumption about guessing (none/random/informed). By considering informed guessing, we cover the twilight zone between rapid guessing on a total random basis and regular responding. The article is structured as follows. First, we present the different models. Then, we report the results of a simulation study on the recovery of the model parameters and the latent traits. Thereafter, we apply the proposed model to empirical data. The article ends with a discussion.

2 Log-normal race models for persistence in tests

In psychometrics, there has been increased interest in psychometric process models (Batchelder, Reference Batchelder2007). Psychometric process models are measurement models that are derived from a mathematical description of the response process and promise a more profound characterization of the test takers than item response models that simply describe test takers in terms of their effective ability to solve items (van der Maas et al., Reference van der Maas, Molenaar, Maris, Kievit and Borsboom2011). In the following, we propose a process model on basis of the linear ballistic accumulator model (Brown & Heathcote, Reference Brown and Heathcote2008; Rouder et al., Reference Rouder, Province, Morey, Gomez and Heathcote2015). We chose the linear ballistic accumulator model for its very good trade-off between simplicity and flexibility. Besides, it has an extendible modular structure and is interpretable in terms of a log-normal race model. Throughout the article, we assume that a test consists of a series of items and that the responses and the response times have been recorded. The response format of the items may be free or single-choice. Responses are scored as either correct or incorrect.

In the linear ballistic model, the responses are generated by a race of different accumulators. In a first series of models (Model A1–A3), we assume two accumulators. The first accumulator represents the progress of a test taker toward solving an item and the second accumulator his tendency to disengage. The first accumulator triggers the correct response. The second accumulator triggers either an incorrect response (Model A1), a random guess (Model A2) or an informed guess (Model A3). In a second series of models (Model B1–B3), we assume three accumulators. As in Model A1–A3, the first and second accumulator in Models B1–B3 represent the progress and the disengagement, respectively. We additionally assume a third accumulator that represents the tendency to respond incorrectly. The first accumulator triggers a correct response, the third accumulator an incorrect response and the second accumulator an incorrect response (Model B1), a random guess (Model B2) or an informed guess (Model B3). An overview of the different models considered in the article is given in Table 1. A more thorough description follows in the next two sections.

Table 1 Overview of the different models for persistence

2.1 Models of Class A

In Models A1–A3, we assume that the response process can be described by two accumulation processes. The first accumulator represents the progress toward the solution that was made when working on an item. The progress toward the solution increases linearly over time. The drift rate $\alpha _g ( \theta _1 )$ of the progress of a test taker in item g is a log-linear function of the test taker’s capability $\theta _1$ and two item parameters $\alpha _{0g}$ and $\alpha _{1g}$ :

(1) $$ \begin{align} \log \big( \alpha_{g} (\theta_1 ) \big) = \alpha_{0g} + \alpha_{1g} \theta_1 + e_g \text{.} \end{align} $$

The value of $\theta _1$ ( $\theta _1 \in \mathbb {R}$ ) quantifies the level of capability a test taker has. Item parameter $\alpha _{0g}$ ( $\alpha _{0g} \in \mathbb {R}$ ) is an intercept parameter that determines the drift rate of a reference test taker with trait level $\theta _1=0$ . Item parameter $\alpha _{1g}$ ( $\alpha _{1g} \in \mathbb {R}^{+}$ ) is a discrimination coefficient that determines the influence the capability has on the drift rate. The residual $e_g$ ( $e_g \in \mathbb {R}$ ) is a random variable that represents all additional influences on the accumulation process that are unrelated to the capability of the test taker. The role of $e_g$ is similar to the role of the specific influences in factor analysis.

As a consequence of the linear ballistic accumulation, the progress $\text {P}_{g}( t; \alpha _{g}(\theta _1 ) )$ of a test taker with capability $\theta _1$ and drift rate $\alpha _{g}(\theta _1 )$ in item g at point of time t is:

(2) $$ \begin{align} \text{P}_{g}\big( t; \alpha_{g}( \theta_1 ) \big) = \alpha_{g}\big( \theta_1 \big) \cdot t = \left( \exp \big( \alpha_{0g} + \alpha_{1g} \theta_1 \big) \cdot t \right) \cdot \exp \big( e_g \big) \text{.} \end{align} $$

Equation (2) implies that not all test takers with the same level of capability $\theta _1$ make the same progress in item g. The individual progress of the test takers fluctuates around the typical progress $\exp ( \alpha _{0g} + \alpha _{1g} \theta _1 ) \cdot t$ due to the random influence $\exp ( e_g )$ . This random influence accounts for the noise of the information accumulation. In the following, we assume that $e_g$ is a normally distributed random variable with expectation of zero and standard deviation of $\sigma _{1g}$ . This implies that the progress at each point of time has a log-normal distribution with scale parameter $\exp ( \alpha _{0g} + \alpha _{1g} \theta _1 ) \cdot t$ and shape parameter $\sigma _{1g}$ in case the latent trait is considered as fixed.

When processing an item, test takers make progress toward the solution, but also lose their engagement to work on the task. We assume a second linear ballistic accumulator that represents the tendency to disengage from the task. The disengagement tendency increases linearly over time with a second drift rate. We assume that the second drift rate is related to the persistence $\theta _2$ of a test taker and two item parameters $\beta _{0g}$ and $\beta _{1g}$ via the log-linear model:

(3) $$ \begin{align} \log \big( \beta_{g} ( \theta_2) \big) = \beta_{0g} - \beta_{1g} \theta_2 + r_g \text{.} \end{align} $$

The value of $\theta _2$ ( $\theta _2 \in \mathbb {R}$ ) quantifies the persistence a test taker has. The item parameters $\beta _{0g}$ ( $\beta _{0g} \in \mathbb {R}$ ) and $\beta _{1g}$ ( $\beta _{1g} \in \mathbb {R}^{+}$ ) can be interpreted in parallel to $\alpha _{0g}$ and $\alpha _{1g}$ . In contrast to Equation (1), the contribution of $\theta _2$ to $\beta _{g} ( \theta _2) )$ is negative. This justifies the interpretation of $\theta _2$ in terms of persistence as high values of $\theta _2$ correspond to low levels of disengagement. The random variable $r_g$ is a residual that represents all additional influences on the disengagement process.

As a consequence of the linear ballistic accumulation, the disengagement tendency $\text {D}_{g}( t; \beta _{g}( \theta _2 ) )$ of a test taker with persistence $\theta _2$ and drift rate $\beta _{g}(\theta _2 )$ in item g at point of time t is:

(4) $$ \begin{align} \text{D}_{g}\big( t; \beta_{g}( \theta_2 ) \big) = \beta_{g}\big( \theta_2 \big) \cdot t = \left( \exp \big( \beta_{0g} - \beta_{1g} \theta_2 \big) \cdot t \right) \cdot \exp \big( r_g \big) \text{.} \end{align} $$

The residual $r_g$ represents the noise of the accumulation process. It is assumed to be distributed according to a normal distribution with expectation of zero and standard deviation of $\sigma _{2g}$ . This implies that the disengagement tendency at each point of time has a log-normal distribution with scale parameter $\exp ( \beta _{0g} - \beta _{1g} \theta _2 ) \cdot t$ and shape parameter $\sigma _{2g}$ in case the latent trait is considered as fixed.

For both accumulators, we assume two critical thresholds. The first threshold $C_{1g}$ represents the progress that has to be made in order to solve an item. As soon as the momentary progress of a test taker $\text {P}_{g}( t; \alpha _{g}( \theta _1 ) )$ exceeds $C_{1g}$ , a correct response can be given. This happens at the solution time $ t_{Sg} = C_{1g} / \alpha _{g}( \theta _1 ) $ . As the log-transformed solution time $\log ( t_{Sg} ) = \log ( C_{1g} ) - ( \alpha _{0g} + \alpha _{1g} \theta _1 ) - e_g $ is a linear function of capability and the residual $e_g$ , the solution time has a log-normal distribution. The second threshold $C_{2g}$ represents the level of disengagement at which test takers stop working. This happens at the disengagement time $ t_{Dg} = C_{2g} / \beta _{g}( \theta _2 ) $ . As the log-transformed disengagement time $\log ( t_{Dg} ) = \log ( C_{2g} ) - ( \beta _{0g} - \beta _{1g} \theta _2 ) - r_g $ is a linear function of persistence and the residual $r_g$ , the disengagement time is likewise log-normally distributed.

The response and the response time in an item are the result of a race between the two accumulators (Rouder et al., Reference Rouder, Province, Morey, Gomez and Heathcote2015). The observed response time $t_g$ is the hitting time of the faster accumulator. It is the minimum of the solution time and the disengagement time $t_g = \min [ t_{Sg},t_{Dg} ] = \min [ C_{1g} / \alpha _{g}( \theta _1 ) , C_{2g} / \beta _{g}( \theta _2 ) ]$ . The response $x_g$ depends on which accumulator reaches its threshold first. In case the accumulator representing progress wins, the response is always the correct solution. In case the accumulator representing disengagement wins, the response depends on the version of the model. In a first version of the model (Model A1), the test takers always respond incorrectly. In a second version of the model (Model A2), the test takers guess randomly. In this case, the guessing probability is an item dependent quantity $\pi _g$ that does not depend on the latent trait of a test taker. In a third version of the model (Model A3), the guessing probability depends on the progress that was made up to $t_g$ . We assume that the interval $[0,C_{1g}]$ represents the continuum from random guessing to a correct solution. When $\text {P}_{g}( t_g; \alpha _{g}( \theta _1 ) )=0$ , there was no progress and the solution probability is on the level $\pi _g$ of a random guess. When $\text {P}_{g}( t_g; \alpha _{g}( \theta _1 ) )=C_{1g}$ , the test taker has made enough progress in order to solve the item. In this case, the solution probability is $1.0$ . For all other levels of progress, the guessing probability is determined by the linear interpolation:

(5) $$ \begin{align} \pi \big( \text{P}_{g}\big( t_g; \alpha_{g}( \theta_1 ) \big) \big) = \pi_g + (1-\pi_g) \cdot \frac{ \text{P}_{g}\big( t_g; \alpha_{g}( \theta_1 ) \big) }{ C_{1g} } \text{.} \end{align} $$

An example of the assumed response process is given in Figure 1, left plot. The line from $0$ to $\text {T}_{\text {s}}$ visualizes the progress of a test taker with drift rate of $\alpha _{g} (\theta _1) = 3.33$ . This drift rate implies that the test taker requires a solution time of $t_{Sg}=3$ in order to make enough progress ( $C_{1g}=10$ ) to solve the task. The line from $0$ to $\text {T}_{\text {D}}$ visualizes the disengagement tendency of a test taker with drift rate $\beta _{g}( \theta _2) = 5$ . The disengagement accumulator hits the threshold $C_{2g}=10$ at the disengagement time $T_{Dg}=2$ and interrupts the solution process before the solution has been found. In Model A1, the test taker would respond incorrectly. In Model A2, the test taker would guess randomly with a fixed guessing probability of, for example, $\pi _g=0.125$ . In Model A3, the test taker would guess on basis of the progress that was made up to the time $t=2$ . As $\text {P}_{g}( 2; \alpha _{g}( \theta _1 ) )=6.66$ , the guessing probability is $\pi ( 6.66 ) = \pi _g + (1-\pi _g) \cdot 6.66/10$ , which would be $\pi (6.66) = 0.125+0.875 \cdot 6.66/10 = 0.70$ when the probability of a random guess is $\pi _g=0.125$ .

Figure 1 Illustration of the assumed response process for variant A and variant B of the linear ballistic accumulator model.

From the distributional assumptions about the residuals, the joint distribution of the response and the response time in an item can be derived. In order to save space, we just give a short sketch of the derivation. The solution time and the disengagement time are determined by the thresholds and the drift rates. The solution time, for example, is $t_{Sg} = C_{1g}/ \alpha _{g} (\theta _1)$ . For fixed latent traits, the drift rates are log-normally distributed. The inverse value of a log-normal random variable is likewise log-normally distributed. The solution and the disengagement time are thus log-normally distributed. The response time in Model A is the minimum of the solution and the disengagement time. Both times are log-normally distributed and independent as the residual terms $e_g$ and $r_g$ are independent. The density function of the first hitting time of the winning accumulator is thus the product of a log-normal density function with the distribution function of a log-normal distribution. The observed responses are determined by the winning accumulator. In Model A1, the response is always incorrect when the disengagement accumulator wins. In Model A2, the response is set by a guessing process that acts independently from the response time with a fixed guessing probability. In Model A3, the guessing probability depends on the relation of the progress to the threshold $C_{1g}$ at the disengagement time. The progress at the disengagement time, given that the disengagement accumulator wins, is distributed according to a truncated log-normal distribution. The guessing probability is the ratio of the accumulated progress and the threshold $C_{1g}$ . It likewise has a truncated log-normal distribution. The subdensity function of the disengagement time is divided between the correct and incorrect response according to the potential guessing probability. This requires integration over the truncated log-normal distribution. The implied subdensities of Model A1–A3 can be found in Section S1 of the Supplementary Material.

2.2 Models of Class B

In Models A1–A3, an item can always be solved when the persistence is sufficiently high. This assumption is plausible for simple tasks (e.g., simple calculations), but implausible for more complex tasks (e.g., items on crystallized intelligence). To overcome this limitation, we extend the models by a third accumulator that represents the progress toward an incorrect solution, a quantity we denominate as misinformation in the following. The incorrect solution subsumes all systematically wrong responses that are due to a misconception of the problem. The misinformation in item g increases linearly over time with a drift rate that depends on a third latent trait $\theta _3$ , namely the error-proneness of a test taker, and two item parameters $\gamma _{0g}$ and $\gamma _{1g}$ via the log-linear model:

(6) $$ \begin{align} \log \big( \gamma_{g} (\theta_3 ) \big) = \gamma_{0g} + \gamma_{1g} \theta_3 + s_g \text{.} \end{align} $$

The value of $\theta _3$ ( $\theta _3 \in \mathbb {R}$ ) quantifies the proneness of a test taker toward an incorrect solution. The item parameters $\gamma _{0g}$ ( $\gamma _{0g} \in \mathbb {R}$ ) and $\gamma _{1g}$ ( $\gamma _{1g} \in \mathbb {R}^{+}$ ) can be interpreted in parallel to the corresponding item parameters in Models A1–A3. The quantity $s_g$ is a random variable that generates noise in the accumulation process. It represents all further influences on the accumulation process. It is assumed to have a normal distribution with expectation of zero and standard deviation $\sigma _{3g}$ . As a consequence, the misinformation $\text {M}_{g}( t; \gamma _{g}( \theta _3 ) )$ of a test taker with drift rate $\gamma _{g}( \theta _3 )$ in item g is a linear function of the time spent on the item:

(7) $$ \begin{align} \text{M}_{g}\big( t; \gamma_{g}( \theta_3 ) \big) = \gamma_{g}\big( \theta_3 \big) \cdot t = \left( \exp \big( \gamma_{0g} + \gamma_{1g} \theta_3 \big) \cdot t \right) \cdot \exp \big( s_g \big) \text{.} \end{align} $$

The observed response time is the result of a race of the three accumulators $\text {P}_{g} ( t; \alpha _{g}( \theta _1 ) )$ , $\text {D}_{g} ( t; \beta _{g}( \theta _2 ) )$ and $\text {M}_{g} ( t; \gamma _{g}( \theta _3 ) )$ toward accumulator specific thresholds $C_{1g}$ , $C_{2g}$ , and $C_{3g}$ . The accumulator that first hits its threshold determines the observed response and response time. If the accumulator $\text {P}_{g} ( t; \alpha _{g}( \theta _1 ) )$ representing progress wins, the correct response is given. If the accumulator $\text {M}_{g} ( t; \gamma _{g}( \theta _3 ) )$ representing misinformation wins, an incorrect response is given. If the accumulator $\text {D}_{g} ( t; \beta _{g}( \theta _2 ) )$ representing disengagement wins, the response depends on the version of the model. In Model B1, the response is always incorrect. In Model B2, the test takers guess randomly with a fixed guessing probability of $\pi _g$ that does not depend on the test takers’ capabilities or any accumulator. In Model B3, the test takers make an informed guess. The response is correct when $\text {P}_{g} ( t; \alpha _{g}( \theta _1 ) )> \text {M}_{g} ( t; \gamma _{g}( \theta _3 ) )$ and incorrect when $\text {P}_{g} ( t; \alpha _{g}( \theta _1 ) ) < \text {M}_{g} ( t; \gamma _{g}( \theta _3 ) )$ . Note that the guessing process is determined by the relation of the momentary progress to the momentary misinformation. A test taker selects the response that has the strongest support or that is hold most actively in mind. This guessing process is different to the informed guessing process in Model A3, which is independent of alternative response options. The observed response time is always the minimum of the three hitting times.

An example of the assumed response process is given in Figure 1, right plot. In the example, the test taker has the drift rates $\alpha _{g}( \theta _1 )=3.3$ , $\beta _{g}( \theta _2 )=5.0$ , and $\gamma _{g}( \theta _3 )=2.5$ . The line from $0$ to $\text {T}_{\text {S}}$ depicts the progress, the line from $0$ to $\text {T}_{\text {D}}$ the disengagement tendency and the line from $0$ to $\text {T}_{\text {I}}$ the misinformation. The disengagement accumulator hits its threshold $C_{2g}=10$ at the disengagement time $t_{Dg}=2$ and interrupts the response process prematurely. The progress and misinformation of the test taker at this point of time are $\text {P}_{g} ( 2; \alpha _{g}( \theta _1 ) ) = 6.6 $ and $\text {M}_{g} ( 2; \gamma _{g}( \theta _3 ) ) = 5.0 $ , respectively. In the example, the response time is $t_g=2$ . For Model B1, the observed response would be incorrect. For Model B2, the observed response would be the result of a random guess with a fixed guessing probability of, for example, $\pi _g=0.125$ . In Model B3, the response would be correct as the progress is larger than the misinformation when responding.

The distributional assumptions about the residuals determine the joint distribution of the response and response time in an item. In order to save space, we just sketch how the density functions can be derived. The response time is determined by the hitting times of each accumulator as the minimum of three independent log-normally distributed random variables. Its distribution follows from standard results on order statistics. The probability of a response is determined as in Models A1–A3 in case the accumulator representing progress or misinformation wins. The case that the disengagement accumulator wins, has to be treated differently. In Model B1, the response is directly determined by the winning accumulator. In Model B2, the response is determined by a guessing process with fixed guessing probability that does not dependent of the response time or the remaining accumulators. In Model B3, the response is determined by the accumulator with the higher level. The progress and misinformation at time t, given that both quantities are below their threshold, are distributed according to two independent truncated log-normal distributions. The guessing probability is thus the probability that a truncated log-normally distributed random variable exceeds another one. The density function of the response time $t_g$ is then divided between the correct and incorrect response according to the potential values of the guessing probability. The implied subdensities of Model B1–B3 can be found in Section S1 of the Supplementary Material.

2.3 The joint distribution of the responses and response times in a test

Model A1–A3 and Model B1–B3 specify the distribution of the response and response time in a single item conditional on the trait levels of a test taker. We denote the subdensity function of this distribution as $\text {f}_{X_g,T_g|\boldsymbol {\theta }} ( x,t|\boldsymbol {\theta } )$ ; see Section S1 of the Supplementary Material for more details. The vector $\boldsymbol {\theta }$ represents the latent traits of a test taker, which are $\boldsymbol {\theta }=(\theta _1,\theta _2)$ in Model A and $\boldsymbol {\theta }=(\theta _1,\theta _2,\theta _3)$ in Model B. Item parameters are not mentioned explicitly. In order to derive the joint distribution of the responses $\boldsymbol {x}=(x_1, \ldots ,x_G)$ and response times $\boldsymbol {t}=(t_1, \ldots ,t_G)$ in a test of G items, we assume conditional independence. This is the standard assumption in item response theory (Embretson & Reise, Reference Embretson and Reise2000). Conditional independence means that responses and response times in different items are independent when conditioning on the latent trait. The joint distribution of the responses and response times $\text {f}_{ \boldsymbol {X},\boldsymbol {T} |\boldsymbol {\theta }}( \boldsymbol {x},\boldsymbol {t} |\boldsymbol {\theta })$ then factors into the product of the item specific subdensities:

(8) $$ \begin{align} \text{f}_{ \boldsymbol{X},\boldsymbol{T} |\boldsymbol{\theta}}( \boldsymbol{x},\boldsymbol{t} |\boldsymbol{\theta}) = \prod_G \text{f}_{X_g,T_g|\boldsymbol{\theta}}(x_g,t_g|\boldsymbol{\theta}) \text{.} \end{align} $$

Equation (8) is the distribution of the responses and response times conditional on specific values of the latent trait. The marginal distribution is the distribution of the responses and response times when test takers are sampled randomly. The marginal distribution $\text {f}_{ \boldsymbol {X},\boldsymbol {T}}( \boldsymbol {x},\boldsymbol {t} )$ follows from the conditional distribution $\text {f}_{ \boldsymbol {X},\boldsymbol {T}|\boldsymbol {\theta }}( \boldsymbol {x},\boldsymbol {t} |\boldsymbol {\theta })$ when integrating over the distribution of the latent traits $\text {f}_{\boldsymbol {\theta }}(\boldsymbol {\theta })$ in the population of the potential test takers the actual test takers were sampled from. In item response theory, it is standard practice to assume that the latent traits are standardized multivariate normally distributed random variables with unrestricted correlation matrix $\boldsymbol {R}$ . The normal distribution is unimodal and symmetric and for this reason considered as an adequate representation of the distribution of human characteristics (Sartori, Reference Sartori2006). In low stakes tests, however, there is sometimes an imbalance between test takers with low and high persistence. This requires skewed distributions. One option in this case is to use a mixture of two multivariate normal distributions (McLachlan & Peel, Reference McLachlan and Peel2000). Irrespective of the distribution one deems most adequate, the scale of the latent traits has to be fixed. We here consider standardization of the latent variables, but other identification restrictions are possible as well.

3 Strengths, limitations, and relations to alternative models

The proposed models make precise assumptions about the data generating process. In the following, we discuss these assumptions. We also compare the model to models that have been proposed before. Specifically, we address the following aspects: Interpretation of accumulators, ballistic accumulation, linear accumulation, conditional independence of accumulators, response specific accumulators, nature of latent traits, and informed guessing.

3.1 Interpretation of accumulators

Process models in general and the proposed variants of the LBA model in specific are models of the response process on a high abstract level (Wagenmakers, Reference Wagenmakers2009). They are general models which represent the progress or accumulation of misinformation on a continuum. The accumulators are not directly related to possible substages of the response process, knowledge states or the neural substrate underlying problem solving. This is different to cognitive diagnostic modeling or knowledge structure analysis where detailed assumptions about the tasks and their demands are made (Heller, Reference Heller2023; Noventa et al., Reference Noventa, Heller and Kelava2024). Abstract models like the ones proposed in this article have the advantage of being generally applicable as no theory about the tasks is necessary.

3.2 Ballistic accumulation

In the proposed models, accumulation is ballistic as random variation impacts the whole path of the accumulator in the same way. The progress at a specific point of time, for example, is modeled as $\text {P}_{g} ( t; \alpha _{g}( \theta _1 ) ) = \left ( \exp ( \alpha _{0g} + \alpha _{1g} \theta _1 ) \cdot t \right ) \cdot \exp ( e_g ) $ . The overall effect of all random noise in information accumulation is summarized in the residual $ \exp ( e_g )$ . This is different to the diffusion model (Ratcliff & McKoon, Reference Ratcliff and McKoon2008) where the product of drift rate and time is perturbed at any point in time by momentary fluctuations. The diffusion model, however, was originally proposed for simple perceptual decisions where the random fluctuations were supposed to account for spontaneous neural activity. Spontaneous fluctuations are less plausible in more complex tasks. We consider ballistic accumulation as a reasonable simplification of the real progression of information accumulation or disengagement. Modeling momentary fluctuations would increase the model’s complexity disproportionately.

3.3 Linear accumulation

In the proposed model, we assume a linear increase of progress, disengagement, and misinformation over time. Accumulators are always increasing as the drift rates are restricted to be positive. This implies that each accumulator will eventually hit its threshold. While this is realistic for disengagement, it might be problematic for the accumulator representing progress as some test takers might be incapable to solve an item even in infinite time. Replacing the log-link in Equation (1) with an alternative link (e.g., an identity link) would resolve this issue. The increase of the accumulators over time is modeled by a linear function. This is similar to the diffusion model (Ratcliff & McKoon, Reference Ratcliff and McKoon2008; van der Maas et al., Reference van der Maas, Molenaar, Maris, Kievit and Borsboom2011) where the expected preference formation is also a linear function of time. Linear accumulation is a strong assumption that excludes acceleration or deceleration. The assumption of linearity is avoided in the race model of Ranger & Kuhn (Reference Ranger and Kuhn2014), where progress and disengagement accumulation are modeled by splines. While this might be more realistic, it complicates the extension to partial guessing. Recently, Much et al. (Reference Much, Ranger, Mutak, Krause and Pohl2022) proposed an accumulation model where progress increases monotonically to an upper asymptote. Test takers stop accumulating in case the growth rate drops below a critical threshold. This model avoids the linearity assumption, but cannot be interpreted in terms of persistence directly.

3.4 Conditional independence of accumulators

In the present model, the process of information, misinformation, and disengagement accumulation are unconditionally related by a possible interdependence of the latent traits underlying the accumulators. The residual terms $e_g$ , $r_g$ and $s_g$ are assumed to be independent. This implies that the typical paths $\alpha _{g} ( \theta _1 ) \cdot t$ , $\beta _{g} ( \theta _2 ) \cdot t$ and $\gamma _{g} ( \theta _3 ) \cdot t$ might be related via the correlations of the latent traits, but not the specific deviations. Untypically slow progress in an item that is not explained by a person’s capability has therefore no impact on the development of disengagement. The models thus separate the solution process from the process of motivation, unlike models where both processes are interrelated (e.g., Cisek et al., Reference Cisek, Puskas and El-Murr2009; Much et al., Reference Much, Ranger, Mutak, Krause and Pohl2022). Although the assumption made in our model is strong, it is weaker than the assumptions that were made in the cure mixture model of Lee and Ying (Reference Lee and Ying2015). In their model, the solution process is interrupted by a completely independent censoring time. The assumption of conditional independence in the proposed model could be relaxed like in the model of Ranger et al. (Reference Ranger, Kuhn and Gaviria2015) where two competing accumulators are related by a bivariate normal copula. This model, however, was not intended to model persistence. Our model is in-line with the motivation of the hierarchical modeling framework for responses and response times (van der Linden, Reference van der Linden2007). In the hierarchical modeling framework, single components of the response process are only related on a higher level through the latent traits. This is justified by the assumptions that test taker form their test taking motivation before taking a test and do not change it anymore when working.

3.5 Response specific accumulators

In the proposed model, we score responses as either correct or incorrect. Different incorrect responses, e.g., by different distractor choices in a single-choice test, are not distinguished. The fusion of different types of wrong responses into an overall category is not unusual in psychometric process models. It is, for example, also made in the psychometric extensions of the diffusion model by van der Maas et al. (Reference van der Maas, Molenaar, Maris, Kievit and Borsboom2011). The assumption can be justified by further assumptions. Given that there are several accumulators for incorrect responses, only the incorrect accumulator with the minimal hitting time is relevant for the response and response time. Representing several accumulators by one requires that the minimum of the hitting times of all incorrect accumulators is log-normally distributed and depends on just one latent trait. This implies that the original distributions of all incorrect accumulators were such that the minimum of possibly correlated random draws is log-normally distributed. If this assumption appears too strong, it is possible to introduce accumulators for each specific wrong response as it was done by Bunji and Okada (Reference Bunji and Okada2022) in their model for personality tests; see also Hawkins and Heathcote (Reference Hawkins and Heathcote2021). Assuming response-specific accumulators, however, is problematic in achievement tests as even in single-choice items, it is not known which response options are actually considered by a test taker (Vigneau et al., Reference Vigneau, Caissie and Bors2006).

3.6 Nature of latent traits

In Models B1–B3, the performance in a test is related to three latent traits, one representing the persistence of a test taker and the other two his/her capability and error-proneness. Capability and error-proneness are conceived as two distinct, but possibly related quantities. This is similar to the race model of Ranger et al. (Reference Ranger, Kuhn and Gaviria2015), but differs, for example, from the diffusion model where the two traits are interwoven. Large correlations between the traits could be interpreted in terms of a higher order factor model (parallel measures assumed). Alternatively, as in Ranger et al. (Reference Ranger, Kuhn and Gaviria2015), one could reparametrize capability and error-proneness as $\theta _1 = \theta _1^* + \theta _3^*$ and $\theta _3 = \theta _1^* - \theta _3^*$ . Here, the first trait $\theta _1^*$ denotes the overall speed of a test taker and $\theta _3^*$ the preference of one response option over the other. The interpretation of $\theta _3^*$ would then be similar to the interpretation of the drift rate in the diffusion model.

The trait that represents the persistence of a test taker is assumed to be constant over the test. We assume that the effort of a test taker is determined by the value of the task which does not change during test taking. This is similar to the mixture model of Meyer (Reference Meyer2010) where test takers are characterized by a fixed engagement level. In this aspect, our model differs from models that assume systematic changes in performance during test taking (e.g., Fox & Marianti, Reference Fox and Marianti2016; Mutak et al., Reference Mutak, Krause, Ulitzsch, Much, Ranger and Pohl2024; Ranger et al., Reference Ranger, Wolgast, Much, Mutak, Krause and Pohl2023), switches between engagement levels during the test (e.g., Molenaar et al., Reference Molenaar, Bolsinova and Vermunt2018; Ulitzsch et al., Reference Ulitzsch, von Davier and Pohl2022) or changes in persistence when test takers run out of time (e.g., Wang & Xu, Reference Wang and Xu2015); note, however, that some change is possible in our model. General declines in persistence or item position effects can be absorbed in the intercept parameter $\beta _{0g}$ , which is item dependent. Changes in the time investment from item to item are generated by the random variation of the disengagement accumulator. The relation of the discrimination parameter $\beta _{1g}$ to the residual variance $\sigma ^2_{2g}$ determines the variation of the invested time within the test and its correlation between items. This allows for a purely random process in each item when the discrimination coefficient is zero and an almost determined process then the discrimination coefficient is very high. In this aspect, the proposed models are more flexible than the mixture models for test engagement.

3.7 Informed guessing

In Models A3 and B3, the probability to solve an item by guessing depends on the progress that was made by a test taker till disengagement. This implies that test takers with higher levels of capability or longer processing times have a higher probability to guess correctly. This is in contrast to guessing in the mixture models for rapid guessing or the conception of guessing in the three-parameter logit model (von Davier, Reference von Davier2009). In these models, test takers guess on a random basis with the same guessing probability, irrespective of their ability level or the time they spent on an item. Random guesses are considered in our models as informed guesses on a very low information level, which is a result of a very short processing time or a low level of capability. Model A3 and Model B3 also differ from item response models with ability-based guessing. In models for ability-based guessing, the guessing probability either directly depends on a test takers ability (San Martin et al., Reference San Martin, del Pino and De Boeck2006) or on a second latent trait that could, for example, reflect test-wiseness (DeMars, Reference DeMars2016). In the proposed models, the dependency of the guessing probability on the capability of a test taker is mediated through the actual progress that was made. This implies that the solution probability is also not a direct function of time as, for example, in Bolsinova and Molenaar (Reference Bolsinova and Molenaar2018) or Wang and Hanson (Reference Wang and Hanson2005). In item response theory, partial knowledge has also been modeled with the Nedelsky model (Bechger et al., Reference Bechger, Maris, Verstralen, Verhelst, van der Ark, Croon and Sijtsma2005) where partial knowledge is used for excluding distractors in single-choice response sets. The response is then selected among the remaining distractors randomly. Such a form of guessing could be integrated in the proposed model when specific levels of progress would be associated with specific subsets of distractors. The linear dependency of the guessing probability (or its logit) on the actual progress in Model A3 is a strong assumption. While it is plausible that the attraction of the correct response should increase with the knowledge level, the linearity of the relation cannot be justified from the model alone. A justification would require a precise model for the knowledge structure assessed by the items. This is typically not available. In Model B3, the guessing probability depends on a comparison of the evidence accumulated for the correct and the incorrect response. This guessing process has some resemblance to guessing in the diffusion model (Ratcliff, Reference Ratcliff1988).

4 Parameter estimation

The item parameters of the model are not uniquely identified as in a latent accumulation process, the scale of the accumulators is arbitrary; note that the first hitting times are not affected when the drift rate and the threshold of an accumulator are shifted or multiplied by the same amount. This is not specific for the proposed variants of the LBA model, but occurs in the graded response model or the diffusion model also. In order to identify the model, one parameter must be fixed (Brown & Heathcote, Reference Brown and Heathcote2008). Here, we set all thresholds to an arbitrary constant. As this is an identification restriction, it sets the scale for the accumulators and does not have any consequences on model fit.Footnote 2

Item parameters can be estimated by marginal maximum likelihood estimation. This requires assumptions about the distribution of the latent traits. In the following, we assume that the latent traits are multivariate normally distributed with unrestricted correlation matrix $\boldsymbol {R}$ , although other distributions would be possible. The item parameters of the model are referred to as $\boldsymbol {\delta }=( \boldsymbol {\delta }_1, \ldots , \boldsymbol {\delta }_G )$ . Here, $\boldsymbol {\delta }_g$ represents the parameters in the g-th item, that is $\boldsymbol {\delta }_g = (\alpha _{0g},\alpha _{1g},\sigma _{1g},\beta _{0g},\beta _{1g},\sigma _{2g},\pi _g)$ in Model A and $\boldsymbol {\delta }_g = ( \alpha _{0g},\alpha _{1g},\sigma _{1g},\beta _{0g},\beta _{1g},\sigma _{2g},\gamma _{0g},\gamma _{1g},\sigma _{3g},\pi _g)$ in Model B. The marginal log-likelihood function is a function of the model parameters. It is defined as:

(9) $$ \begin{align} \text{LL}(\boldsymbol{\delta},\boldsymbol{R}; \boldsymbol{X},\boldsymbol{T}) = \log \left[ \sum_N \int \text{f}_{ \boldsymbol{X},\boldsymbol{T}|\boldsymbol{\theta}}( \boldsymbol{x}_n,\boldsymbol{t}_n |\boldsymbol{\theta};\boldsymbol{\delta}) \text{f}_{\theta} ( \boldsymbol{\theta};\boldsymbol{R} ) \mathrm{d} \boldsymbol{\theta} \right] \text{.} \end{align} $$

In Equation (9), the matrices $\boldsymbol {X}$ and $\boldsymbol {T}$ contain the responses and response time pattern $\boldsymbol {x}_{n}$ and $\boldsymbol {t}_{n}$ of $n=1,\ldots ,N$ test takers. Function $\text {f}_{ \boldsymbol {X},\boldsymbol {T}|\boldsymbol {\theta }} ( \boldsymbol {x},\boldsymbol {t} | \boldsymbol {\theta };\boldsymbol {\delta } )$ is the subdensity of the conditional joint distribution of the responses and response times (Equation (8)) with item parameters mentioned explicitly. Function $\text {f}_{ \boldsymbol{\theta} } ( \boldsymbol {\theta };\boldsymbol {R} )$ is the density function of the standard multivariate normal distribution with covariance matrix $\boldsymbol {R}$ . In marginal maximum likelihood estimation, one determines those parameter values that maximize the marginal likelihood function. This includes the item parameters and the elements of the correlation matrix. As the marginal maximum likelihood estimator is a standard maximum likelihood estimator on basis of the marginal distribution, it has the same properties with respect to consistency, bias and its asymptotic distribution (Berger et al., Reference Berger, Liseo and Wolpert1999).

5 Simulation study

We conducted two simulation studies in order to investigate whether and under what conditions the model parameters and the latent traits can be estimated well. The first simulation study addressed the recovery of the item parameters, the second simulation study the recovery of the latent traits. Due to restrictions of space, we only give a limited overview of the results; see Section S2 of the Supplementary Material for further results on parameter recovery and Section S3 of the Supplementary Material for further results on trait estimation.

5.1 Parameter recovery

In the simulation study on parameter recovery, we generated data according to the proposed models for a test of $G=24$ items. For Models A1–A3, the item parameter values were set to $\alpha _{0g} \in \{ 0.6,0.7 \}$ , $\alpha _{1g} \in \{ 0.2,0.3 \}$ , $\sigma _{1g} \in \{ 0.5 \}$ , $\beta _{0g} \in \{ 0.7,0.9,1.1 \}$ , $\beta _{1g} \in \{ 0.2,0.3 \}$ , and $\sigma _{2g} \in \{ 0.5 \}$ . For Models B1–B2, the item parameter values were set to $\alpha _{0g} \in \{ 0.6,0.7 \}$ , $\alpha _{1g} \in \{ 0.2,0.3 \}$ , $\sigma _{1g} \in \{ 0.5 \}$ , $\beta _{0g} \in \{ 0.4 \}$ , $\beta _{1g} \in \{ 0.4 \}$ , $\sigma _{2g} \in \{ 0.3 \}$ , $\gamma _{0g} \in \{ 0.7,0.9,1.1 \}$ , $\gamma _{1g} \in \{ 0.2,0.3 \}$ , and $\sigma _{3g} \in \{ 0.5 \}$ . The guessing probability was set to $\pi _g=0.125$ in Models A2, A3, and B2. By fully crossing the item parameters, we determined the values for $G=24$ items. The values of the item parameters were within the range of values we got for the items of an intelligence test. Rather than using real items, we created synthetic ones in order to systematically cross high and low parameter values. The thresholds $C_{1g}$ , $C_{2g}$ , and $C_{3g}$ were set to $10$ and considered as known. Fixing the thresholds to a specific value is necessary in order to identify the model. The latent traits were assumed to be multivariate normally distributed with expected values of zero and variances of one. The correlation coefficients were set to $\rho _{12}=-0.4$ in Models A1–A3 and to $\rho _{12}=-0.3$ , $\rho _{13}=0.4$ , $\rho _{23}=-0.2$ in Models B1–B3. We assumed correlated latent traits as there is evidence that capability and persistence are correlated (Zhang et al., Reference Zhang, Wetzel, Yoon and Roberts2024).

We generated simulation samples for all models. We simulated 250 data sets for a sample size of 250 test takers and 250 data sets for a sample size of 1,000 test takers. A description of how the data sets were generated can be found in Section S2 of the Supplementary Material. In the simulation samples, the average response times were between 3 and 4 time units and the solution frequencies between $0.21$ and $0.84$ in the items. The correlation of the response times ranged from 0.04 to 0.32, the correlation of the responses from $-$ 0.06 to 0.18 and the correlation of the responses and the response times from $-0.13$ to $0.26$ across the items and the different models. Such values should be representative for a typical achievement test.

Having generated the data sets, we estimated the item parameters with marginal maximum likelihood estimation. Details on the implementation can be found in Section S2 of the Supplementary Material. Results on parameter recovery are summarized in Table 2 for Models A2 and A3 and in Table 3 for Models B2 and B3. Results for Models A1 and B1 are not reported here, but can be found in Section S2 of the Supplementary Material (Tables S2.1S2.4). The results were very similar and we decided to skip them in order to save space. In Tables 2 and 3, we report the true value (TV), the average estimate (M) and the standard error of estimation (SE). We also report the coverage frequency (CI) of confidence intervals for a confidence level of $0.95$ ( $\alpha =0.05$ ). In both tables, the results have been averaged over parameters with the same value.

Table 2 True value (TV), average estimate (M), standard error of estimation (SE), and coverage frequency (CI) of confidence intervals with confidence level C=0.95 $(\alpha =0.05)$ of the item parameters of the linear ballistic accumulator Model A for different samples sizes and variants of the model

Note: Results for parameters have been averaged over the items of with the same parameter values; for an overview of the different models, see Table 1.

Table 3 True value (TV), average estimate (M), standard error of estimation (SE) and coverage frequency (CI) of confidence intervals with confidence level C=0.95 $(\alpha =0.05)$ of the item parameters of the linear ballistic accumulator Model B for different samples sizes and variants of the model

Note: Results for parameters have been averaged over the items of with the same parameter values; for an overview of the different models, see Table 1.

In the variants of Model A, the marginal maximum likelihood estimator performs well. The parameter estimates are virtually unbiased, irrespective of the sample size. In the condition with 1,000 subjects, the coverage frequencies of the confidence intervals are close to the intended level of $0.95$ . The only exceptions are the coverage frequencies for $\alpha _1$ and $\pi $ in Model A3 that are slightly too low. In the conditions with 250 subjects, the coverage frequencies are slightly too low in all parameters. In the variants of Model B, the marginal maximum likelihood estimator performs less well. Parameters of the second accumulator are biased. The coverage frequencies of the confidence intervals also fall below the intended level of $0.95$ . This was partly due to the bias of the estimator, inaccurate estimates of its standard errors and deviations from the normal distribution. All this was caused by a small number of data sets where the estimates were far off the TV.

In general, the simulation study suggests that the item parameters can be estimated well with samples of at least 1000 test takers. However, in some cases, there seem to be convergence issues. In practice, it is advisable to check for problems that are indicated by extreme parameter estimates or large standard errors of estimation. Model simplifications like restricting model parameters to the same value might help in this case.

5.2 Recovery of latent traits

In a second simulation study, we investigated to what accuracy the latent traits of test takers can be inferred from their response and response time patterns. We defined fictitious test takers by fully crossing fixed levels of the traits ( $ \theta _1,\theta _2,\theta _3 \in \{-2.0,-1.5,\ldots ,1.5,2.0 \}$ ). We used fixed trait levels in order to study whether the maximum likelihood estimator is conditionally unbiased. For each test taker, we generated 250 (Model A) or 50 (Model B) response and response time patterns for a test of 24 items. Responses and response time patterns were generated as in the simulation study on parameter recovery. We then estimated the latent traits from the response and response time patterns by maximum likelihood estimation and determined Wald-type confidence intervals for a confidence level of 0.95 ( $\alpha =0.05$ ); for more details on the implementation see Section S3 of the Supplementary Material. We repeated the simulation study for a test of $48$ items by simply doubling the test. A detailed description of the results (e.g., average estimate, median estimate, standard error of estimation, coverage frequencies of the confidence intervals) can be found in Section S3 of the Supplementary Material. Here, we only summarize the general findings in order to save space.

In all variants of Model A, the coverage frequencies of the confidence intervals were near the intended level of 0.95. The trait estimates were virtually unbiased in Model A1 and Model A3. In Model A2, there was a small bias in low levels of capability. The standard error of estimation was generally higher in Model A2 than in Model A1 or Model A3. It was larger in low levels of the traits than in high levels. In the variants of Model B, the recovery of the traits was generally worse than in the variants of Model A. In all variants of Model B, the coverage frequencies of the confidence intervals were near the intended level of 0.95 for capability and error-proness. For persistence, the coverage frequencies were between $0.71$ and $0.94$ and thus below the intended level of $0.95$ . Estimates of capability had a negligible or small bias in all models. The estimates of error-proneness were biased in low levels of the trait, although the median of the estimates was close to the TV. Average as well as median estimates of persistence deviated from the TVs in all models in case persistence was high. The standard error of estimation depended on the model and the level of the trait to be estimated. Standard errors were higher in Model B2 than in Model B1 and Model B3. Low values of capability and error-proness and high values of persistence were generally estimated with little precision.

In summary, whether trait levels can be estimated well depends on the level of the trait and on the model. Estimation is generally better in versions of Model A than in versions of Model B. Low values of capability and error-proness as well as high values of persistence are estimated with a high standard error of estimation. This is due to the race process. Whenever one accumulator dominates the others, there is little direct information about the other accumulators. All what is known is that the other accumulators were higher. This has implications for psychological assessment. It implies that in test takers with high persistence, one cannot estimate persistence well, but capability. On the other hand, in test takers with low persistence, one can estimate persistence well, but not capability. This indicates the need for an adaptive test where the time demand of an item is adjusted to the typical time investment of an individual. This, however, has limits as one cannot infer the capability of a test taker in case items are not processed at all. Adaptive tests, however, are only needed in diagnostic assessment. In large scale assessment, one typically is not interested in single test takers, but in aggregates. This does not require precise estimates on the individual level.

6 Empirical example

We applied the model to data from a matrix reasoning test collected by Myszkowski et al. (Reference Myszkowski, Storme, Kubiak and Baron2022). The data consisted of the responses and response times of $555$ test takers on a progressive matrix reasoning test. The test comprised 17 items that were generated with the IMak R package (Blum & Holling, Reference Blum and Holling2018). Each item consisted of a matrix of figures. Test takers had to deduce the rule the figures were created with. In each item, eight response options were given, one of which was the correct response. Test takers had to indicate the correct response. Responses were scored as either correct or incorrect. Data were collected online in a low-stakes setting. The data set as well as further test information are available in the repository of the original study at https://osf.io/uge2w. We chose the data set for our empirical analysis as it was collected in a low-stakes setting, such that individual differences in test taking effort are to be expected.

Before analyzing the data, we removed outliers. Some test takers had very long response times that may indicate interruptions caused by pausing or distraction. In online assessments, this cannot be avoided. As an outlier, we defined a response time that was more than 2.5 interquartile ranges above the upper quartile. We did not remove unusually short response times as we consider these as regular data. Removing outliers reduced the sample size from 555 to 543. Descriptive statistics describing the distribution of the responses and the response times in the items of the test are given in Table 4.

Table 4 Average solution probability ( $M_X$ ) and range over items, average response time ( $M_T$ ) and range over items as well as standard deviation of response time ( $S_T$ ) and range over items in the IMak test

We first evaluated the pacing of the test takers. We correlated the response times of each test taker with the average time demand of an item. This resulted in 543 coefficients of correlations. A high correlation implies that a response time pattern is regular as the test taker spends more time on the more time demanding items. Figure 2, upper plot shows the distributions of the coefficients of correlations for the different score groups; note that the test score ranges from $0$ $17$ . The plot suggests that test takers with a low test score behave unusually as there is no relation between their pacing and the time demand of the items. This was corroborated by an inspection of the average time, the test takers spent on the item. Figure 2, lower plot, visualizes the average response times of the test takers in a score group for the items of the test. Each line connects the average response times of one score group over the items. Items are sorted according to their difficulty in decreasing order. Score groups with scores of $0$ to $4$ are highlighted in red. In most score groups, the average response times increase with the difficulty of the item. In low score groups, however, the response times do not to depend on the item difficulty. The figure also suggests that the item difficulty has the strongest influence on the response time. Item position effects appear negligible.

Figure 2 Box-plot of the correlations of the test takers’ response times with the time demand of an item given for the different score groups (upper plot) and average time on task on the items for different score groups (lower plot). Note: The solution probability of an item is indicated by p. Groups with a score of 0–4 are highlighted.

We fit all models considered in the manuscript to the IMak data by marginal maximum likelihood estimation. All models were implemented in two versions. In the first version, we assumed that all traits are multivariate normally distributed. However, given that the descriptive results suggested a high proportion of test takers with low persistence, we also implemented a version where $\theta _2$ was distributed according to a mixture of two normal distributions. This more flexible distribution allows for an imbalance toward low levels of persistence that cannot be represented by the symmetric normal distribution. We did not use mixtures for the other traits as cognitive traits are supposed to be normally distributed and there was no evidence in the data against this assumption. We denote the versions with multivariate normally distributed latent traits as Models A1N to B3N and the versions with the mixture distribution as Models A1M to B3M. For sake of comparison, we also fit the hierarchical model of van der Linden (Reference van der Linden2007) to the data. The model was also implemented in two versions, using the multivariate normal distribution or a mixture of two multivariate normal distributions for the latent traits. For the hierarchical model, we used mixtures for both traits as the model does not allow for a separation between capability and persistence.

Information on relative model fit is given in Table 5 where we report the values of the marginal log-likelihood function at the parameter estimates (LL), the difference of the marginal maximum likelihood function of the respective model to the marginal maximum likelihood of the model with the highest marginal log-likelihood ( $\Delta $ LL), the values of the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) (Burnham & Anderson, Reference Burnham and Anderson2013) as well as the difference of the respective information criterion of the current model to the information criterion of the best fitting model ( $\Delta $ AIC and $\Delta $ BIC).

Table 5 Value of the marginal log-likelihood function (LL), number of parameters (NP), difference of marginal log-likelihood function ( $\Delta $ LL), AIC-index (AIC), difference of AIC-index ( $\Delta $ AIC), BIC-index and difference of BIC-index ( $\Delta $ BIC) for the six accumulator models (A1–B3) and the hierarchical model (VLM) for the version with the normal distribution or the mixture distribution

Note: Differences are with respect to the best fitting Model with lowest AIC. Model Variants are indicated as in Table 1. N denotes the normal distribution of the latent traits, M denotes the mixture distribution.

The versions of the hierarchical model of van der Linden (Reference van der Linden2007) had the worst fit. This was to be expected as already the response times were not log-normally distributed. The versions of Model A with just two accumulators were all inferior to the versions of Model B with three accumulators. This suggests that test takers produce incorrect responses actively. This could be due to the scoring of the test. In the IMak test, the more difficult items are generated by several rules. In case some, but not all rules are deduced, an incorrect response option is selected after a regular response process. In the model versions with three accumulators, the versions of model B2 and B3 that allow for guessing fit better than the versions of Model B1 without guessing. This is unexpected as in low-stakes tests, there is little to gain from guessing. Test takers, however, could not skip items so that test takers always had to choose a response option. The fit of the versions of Model B2 with random guessing was better than the fit of the versions of Model B3 with informed guessing. This could indicate that test takers who disengage preliminary are not motivated to increase their test score by strategic guessing. This is in-line with the observation that the estimated guessing probability that was around 0.18 in Model B2N and B2M, which is only slightly higher than the probability expected by chance. However, it could also be the case that informed guessing does not occur in the way it was implemented in Model B3, so that the versions of Model B3 fit worse than the potentially equally misspecificed versions of Model B2. Of the Models B2, Model B2M with the mixture distribution fit best. Normal mixtures may either be interpreted in terms of distinct classes, where each mixture component represents a qualitative different group of test takers, or as a flexible tool to generate skewed or bimodal densities without any further claim. In the present case, the mixture distribution was not bimodal, but simply had more mass in the left tail. This suggests that there are not two clearly separated classes of test takers with different level of persistence, but simply a slight imbalance toward low levels of persistence. The correlations between the latent traits were $\widehat {\rho }_{12}=0.03$ , $\widehat {\rho }_{13}=0.49$ and $\widehat {\rho }_{23}=-0.76$ . Hence, there was no correlation between capability and persistence. There was, however, a negative correlation between persistence and the proneness toward an incorrect solution. This suggests that the tendency to give up is also related to respond incorrectly. The value of the item parameters of Model B2M with the mixture distribution as well as information on absolute model fit is given in Section S4 of the Supplementary Material.

We analyzed the implications of Model B2M for the test taking process. Figure 3 visualizes the median response times of the winning accumulator on the items that are implied by Model B2M and the estimated item parameters. The median response times are short when the response process is interrupted (small triangular) and do not change much over the test. This corresponds to the observation that a subgroup of test takers does not increase the processing time in harder items; see Figure 2. The median response times that are generated when the progress accumulator (small dot) or the misinformation accumulator (small square) wins, are longer and increase as a function of the item difficulty; note that the median response times generated by the misinformation accumulator are slightly shorter than the ones that are generated by the progress accumulator. The profile of the median response times, however, is very similar. This suggests that the response process underlying correct and incorrect responses is similar.

Figure 3 Median response times on the items that were generated by the win of the accumulator representing progress (S), disengagement (D) or misinformation (I) as implied by Model B2M and the estimated item parameters.

Even though the model assumes a constant level of persistence $\theta _2$ , this does not mean that the response mode of the test takers is constant throughout the test. This is illustrated in Figure 4 where the implied winning probabilities of the three accumulators are visualized for the items. The probability that the response process is interrupted preliminary increases systematically over the test. This results directly from the constant level of persistence and the increasing time demand of the more difficult items at the end of the test.

Figure 4 Winning probability of the accumulators representing progress (S), disengagement (D) or misinformation (I) as a function of the item position as implied by Model B2M and the estimated item parameters.

Figure 5 Correlations between the wins of the disengagement accumulator in the different items of the test as implied by Model B2M and the estimated item parameters.

We finally analyzed the association of the tendency to interrupt the response process in the single items; note that the accumulation of disengagement in the different items is associated due to the common influence of the persistence trait on the disengagement accumulators. The strength of the association, however, is determined by the discrimination coefficient $\beta _{1g}$ and the standard deviation of the residual $\sigma _2$ . In order to determine the strength of the association, we coded an implied win of the disengagement accumulator as one and an implied loss as zero in each item. Then we determined the correlations between the binary random variates. The correlations are visualized in Figure 5. Although the correlations also depend on the marginal probabilities, the findings suggest that the association is stronger in the last items. This is not unrealistic and might account for the fact that test takers disengage systematically when items become more demanding (and more time intensive).

7 Discussion

The performance in tests depends on the capability of a test taker, but also on his motivation to apply it. One important aspect of test taking motivation is the time, a test taker is willing to spend on the items. Test takers that interrupt the response process preliminary, do not reach their maximal performance level (Thurstone, Reference Thurstone1937). In this article, we have proposed a series of models that relate the test performance to traits representing the information processing capacity of a test taker and his persistence. Core of the models is the assumption of a race between the solution process and a process of disengagement. The models differ with respect to the number of accumulators and the response in case of disengagement. In Models A1–A3, we assume two accumulators. The first accumulator acts as a kind of progress bar that reflects how near a test taker is to the solution. The second accumulator reflects the accumulating tendency to interrupt the response process. Both accumulators have different roles in the response process as the first triggers a correct response while the second triggers an incorrect response or a guessing process that is either random or informed. In Models A1–A3, incorrect responses can only occur due to lack of persistence. As such, these models are specifically suited for tests with free response format or for tests where incorrectness of a generated response is recognizable. This is the case with poorly constructed distractors. Models B1–B3 consist of three accumulators. Two accumulators reflect progress and misinformation while one accumulator represents the tendency to disengage. The progress accumulator triggers a correct response, the misinformation accumulator an incorrect response and the disengagement accumulator either triggers an incorrect response or a guessing process. Hence, we assume that incorrect responses can be generated actively, by following a wrong solution path. This is different to simply not knowing the solution. As such, these models are specifically suited for tests, in which false conclusions can be made or problems can be approached wrongly. This requires that response options in single-choice tests tap typical errors.

The models have several applications. The models provide a measure of the test taker’s capability that is purified from disengagement. In doing so, the proposed models improve upon mixture models for disengagement (e.g., Liu et al., Reference Liu, Li, Liu and Luo2019) or effort moderated item response models (e.g., Wise & DeMars, Reference Wise and DeMars2006) that simply classify responses as engaged or disengaged and explicitly or implicitly discard the disengaged responses when estimating the traits. This makes the model an attractive candidate for the evaluation of group performance like in the PISA study. It is also useful for psychological assessment, although the model cannot provide precise estimates of capability when persistence is very low. Apart from serving as a measurement model, the models can be used for an investigation of the response process. This was demonstrated in the empirical example where we analyzed data from a matrix reasoning test. Here, we could demonstrate that our model explains the occurrence of guesses and incorrect responses and their interrelation over time as an interplay of the time that is dedicated to a task and the time that is required to respond. We, however, failed to provide evidence for informed guessing. This might be due to the fact that test takers with very low levels of persistence are also not motivated to engage in an informed guess; or to the fact that informed guessing is difficult when a large number of distractors are presented and does not occur in the form it was implemented in the models. Whether the superiority of the models with random guessing over the models with informed guessing is a peculiarity of our data set or holds in general is a question that should be investigated in the future.

Our model is an effort to account for motivational aspects of test taking. It naturally has limitations that should be addressed in future research. First of all, the interpretation of the model in terms of persistence is not totally warranted. From a mathematical perspective, there are just accumulators that generate correct or incorrect responses or trigger a guess. This is most problematic in Model A1 and Model B1. The interpretation of accumulators in terms of disengagement is thus an interpretation that requires further evidence, e.g., predictions that imply disengagement or a relation to an external measure of test-taking motivation. Low persistence is also not necessarily a sign of low motivation, but could also be the result of meta-cognition (Cheyette & Piantadosi, Reference Cheyette and Piantadosi2024), fatigue or test taking strategies. A second limitation is the fact that our model is static as we consider fixed latent traits that do not change throughout the test. This is in-line with the standard approach in item response modeling. The assumption of constant trait levels is not that severe as it appears as general trends can be absorbed in the item parameters. Nevertheless, there is evidence for individual changes throughout the test (e.g., Schweizer et al., Reference Schweizer, Zeller and Reiss2020; Ulitzsch et al., Reference Ulitzsch, von Davier and Pohl2022; Weirich et al., Reference Weirich, Hecht, Penk and Roppelt2017). Modeling individual changes requires a combination of the race model with growth curve models. This is a topic of future research. A third limitation is the assumption that test takers will always give a response. Test takers that quit working may also skip the item entirely. This results in item omissions. As item omissions indicate a lack of persistence, they can legitimately be scored as wrong responses in models A1 and B1. This does not hold for models A2/A3 and models B2/B3 where there is no one-to-one correspondence between quitting and a wrong response. Incorporating missing responses into the models, however, is not trivial. As in models A3 and B3, the given response under quitting depends on the accumulated knowledge, the accumulated knowledge should also influence the decision between omission and guessing. Such an extension of the proposed models to omitted responses is a topic for future research. At the moment, it might be best to prevent omissions by not allowing the test takers to skip items. And finally, the assumption of a race between a disengagement process and a solution process is not the only possible approach. First of all, the decision to disengage might be a direct function of the progress a test taker is actually making. This requires assumptions about the meta-cognitions of a test taker that monitor the process of test taking. Second, there are alternative process models that could be extended, like the diffusion model with collapsing boundaries.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/psy.2025.10026.

Data availability statement

The IMak data are available via OSF at https://osf.io/uge2w.

Funding statement

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project Number 288472689.

Competing interests

The authors declare none.

Footnotes

1 There are more approaches to account for persistence, like the identification of non-effortful responses (e.g., Liu et al., Reference Liu, Li, Liu and Luo2019; Wise & DeMars, Reference Wise and DeMars2006), models that use response time as a predictor of response accuracy (e.g., Bolsinova et al., Reference Bolsinova, Tijmstra, Molenaar and De Boeck2011; van Breukelen & Roskam, Reference van Breukelen, Roskam, Doignon and Falmagne1991; Wang & Hanson, Reference Wang and Hanson2005) or mixture models solely based on the responses (e.g., Bolt et al., Reference Bolt, Cohen and Wollack2002; Yamamoto & Everson, Reference Yamamoto and Everson1995). We do not review them in detail as they are not relevant for the manuscript.

2 Fixing the threshold over test takers forces the accumulators to the same scale. This does not affect the interpretation of the item parameters in case the scale is item specific. Such an assumption is reasonable for the progress and misinformation accumulator that represent critical knowledge levels that are not subject to deliberate control. Individual variations in the scale, however, might occur in the disengagement accumulator. Random variations in the disengagement threshold, should they exist, are transferred directly to random fluctuations of the disengagement accumulator when the threshold is fixed. It is thus not possible to distinguish individual differences in the threshold from individual differences in the drift rate.

References

Alós-Ferrer, C. (2018). A dual-process diffusion model. Journal of Behavioral Decision Making, 31, 203218. https://doi.org/10.1002/bdm.1960 Google Scholar
Bandhu, D.,Mohan, M.,Nittala, N.,Jadhav, P.,Bhadauria, A., &Saxena, K. (2024). Theories of motivation: A comprehensive analysis of human behavior drivers. Acta Psychologica, 224, 104177. https://doi.org/10.1016/j.actpsy.2024.104177 Google Scholar
Batchelder, W. (2007). Cognitive psychometrics: Combining two psychological traditions. CSCA Lecture.Google Scholar
Baumert, J., &Demmerich, A. (2001). Test motivation in the assessment of student skills: The effects of incentives on motivation and performance. European Journal of Psychology of Education, 16, 441462. https://doi.org/10.1007/BF03173192 Google Scholar
Bechger, T.,Maris, G.,Verstralen, H., &Verhelst, N. D. (2005). The Nedelsky model for multiple-choice items. Invan der Ark, L.,Croon, M., &Sijtsma, K. (Eds.), New developments in categorial data analysis for the social and behavioral sciences (pp. 187206). Lawrence Erlbaum. https://doi.org/10.4324/9781410612021 Google Scholar
Berger, J.,Liseo, B., &Wolpert, R. (1999). Integrated likelihood methods for eliminating nuisance parameters. Statistical Science, 14, 128. https://doi.org/10.1214/ss/1009211804 Google Scholar
Blum, D., &Holling, H. (2018). Automatic generation of figural analogies with the IMak package. Frontiers in Psychology, 9, 1286. https://doi.org/10.3389/fpsyg.2018.01286 Google Scholar
Bolsinova, M., &Molenaar, D. (2018). Modeling nonlinear conditional dependence between response time and accuracy. Frontiers in Psychology, 9, 1525. https://doi.org/10.3389/fpsyg.2018.01525 Google Scholar
Bolsinova, M.,Tijmstra, J.,Molenaar, D., &De Boeck, P. (2011). Conditional dependence between response time and accuracy: An overview of its possible sources and directions for distinguishing between them. Frontiers in Psychology, 8, 202. https://doi.org/10.3389/fpsyg.2017.00202 Google Scholar
Bolt, D.,Cohen, A., &Wollack, J. (2002). Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement, 39, 331348. https://doi.org/10.1111/j.1745-3984.2002.tb01146.x Google Scholar
Brown, S., &Heathcote, A. (2008). The simplest complete model of choice response time: Linear ballistic accumulation. Cognitive Psychology, 57, 153178. https://doi.org/10.1016/j.cogpsych.2007.12.002 Google Scholar
Bunji, K., &Okada, K. (2022). Linear ballistic accumulator item response theory model for multidimensional multiple-alternative forced-choice measurement of personality. Multivariate Behavioral Research, 57, 658678. https://doi.org/10.1080/00273171.2021.1896351 Google Scholar
Burnham, K., &Anderson, D. (2013). Model selection and inference—A practical information-theoretic approach. Springer. https://doi.org/10.1007/978-1-4757-2917-7 Google Scholar
Burton, R. (2002). Misinformation, partial knowledge and guessing in true/false tests. Medical Education, 36, 805811. https://doi.org/10.1046/j.1365-2923.2002.01299.x Google Scholar
Cheyette, S., &Piantadosi, S. (2024). Response to difficulty drives variation in IQ test performance. Open Mind, 8, 265277. https://doi.org/10.1162/opmi_a_00127 Google Scholar
Cisek, P.,Puskas, G., &El-Murr, S. (2009). Decisions in changing conditions: The urgency-gating model. The Journal of Neuroscience, 29, 1156011571. https://doi.org/10.1523/JNEUROSCI.1844-09.2009 Google Scholar
de la Torre, J. (2009). A cognitive diagnosis model for cognitively based multiple-choice options. Applied Psychological Measurement, 33, 163183. https://doi.org/10.1177/0146621608320523 Google Scholar
DeMars, C. (2016). Partially compensatory multidimensional item response theory models: Two alternate model forms. Educational and Psychological Measurement, 76, 231257. https://doi.org/10.1177/0013164415589595 Google Scholar
Duncan, G. (1974). An empirical Bayes approach to scoring multiple-choice tests in the misinformation model. Journal of the American Statistical Association, 69, 5057. https://doi.org/10.1080/01621459.1974.10480127 Google Scholar
Eklöf, H. (2010). Skill and will: Test-taking motivation and assessment quality. Assessment in Education: Principles, Policy & Practice, 17, 345356. https://doi.org/10.1080/0969594X.2010.516569 Google Scholar
Embretson, S., &Reise, S. (2000). Item response theory for psychologists. Lawrence Erlbaum. https://doi.org/10.4324/9781410605269 Google Scholar
Fox, J., &Marianti, S. (2016). Joint modeling of ability and differential speed using responses and response times. Multivariate Behavioral Research, 51, 540553. https://doi.org/10.1080/00273171.2016.1171128 Google Scholar
Frary, R. (1980). The effect of misinformation, partial information, and guessing on expected multiple-choice test item scores. Applied Psychological Measurement, 4, 7190. https://doi.org/10.1177/014662168000400109 Google Scholar
Gibbons, R., &Hedeker, D. (1992). Full-information item bi-factor analysis. Psychometrika, 57, 423436. https://doi.org/10.1007/BF02295430 Google Scholar
Glickman, M.,Gray, J., &Morales, C. (2005). Combining speed and accuracy to assess error-free cognitive processes. Psychometrika, 70, 405425. https://doi.org/10.1007/s11336-002-0999-3 Google Scholar
Goldhammer, F.,Martens, T., &Lüdtke, O. (2017). Conditioning factors of test-taking engagement in PIAAC: An exploratory IRT modelling approach considering person and item characteristics. Large Scale Assessment in Education, 5, 18. https://doi.org/10.1186/s40536-017-0051-9 Google Scholar
Guthrie, M.,Zhang, T., &Chen, Z. (2020). A tale of two guessing strategies: Interpreting the time students spend solving problems through online log data. In Physics education research conference 2020, Virtual Conference. https://doi.org/10.5555/1642194.1642224 Google Scholar
Hawkins, G., &Heathcote, A. (2021). Racing against the clock: Evidence-based versus time-based decisions. Psychological Review, 128, 222263. https://doi.org/10.1037/rev0000259 Google Scholar
Heller, J. (2023). Special issue on knowledge structures: Theoretical developments and applications. Journal of Mathematical Psychology, 114, 102773. https://doi.org/10.1016/j.jmp.2023.102773 Google Scholar
Knetka, E. (2017). Motivational aspects of test-taking. Umea University. Retrieved from http://umu.diva-portal.org/ Google Scholar
Lau, P.,Lau, S.,Hong, K., &Usop, H. (2011). Guessing, partial knowledge, and misconceptions in multiple-choice tests. Journal of Educational Technology & Society, 14, 99110. http://www.jstor.org/stable/jeductechsoci.14.4.99 Google Scholar
Lee, J., &Corter, J. E. (2011). Diagnosis of subtraction bugs using Bayesian networks. Applied Psychological Measurement, 35, 2747. https://doi.org/10.1177/0146621610377079 Google Scholar
Lee, Y., &Ying, Z. (2015). A mixture cure-rate model for responses and response times in time-limit tests. Psychometrika, 80, 748775. https://doi.org/10.1007/s11336-014-9419-8 Google Scholar
Liu, Y.,Li, Z.,Liu, H., &Luo, F. (2019). Modeling test-taking non-effort in MIRT models. Frontiers in Psychology, 10, 145. https://doi.org/10.3389/fpsyg.2019.00145 Google Scholar
Lu, J., &Wang, C. (2020). A response time process model for not-reached and omitted items. Journal of Educational Measurement, 57, 584620. https://doi.org/10.1111/jedm.12270 Google Scholar
Lu, J.,Wang, C.,Zhang, J., &Tao, J. (2020). A mixture model for responses and response times with a higher-order ability structure to detect rapid guessing behaviour. British Journal of Mathematical and Statistical Psychology, 73, 261288. https://doi.org/10.1177/00131644211045 Google Scholar
McLachlan, G., &Peel, D. (2000). Finite mixture models. Wiley. https://doi.org/10.1002/0471721182 Google Scholar
Meyer, J. (2010). A mixture Rasch model with item response time components. Applied Psychological Measurement, 34, 521538. https://doi.org/10.1177/0146621609355451 Google Scholar
Molenaar, D.,Bolsinova, M., &Vermunt, J. (2018). A semi-parametric within-subject mixture approach to the analyses of responses and response times. British Journal of Mathematical and Statistical Psychology, 71, 205228. https://doi.org/10.1111/bmsp.12117 Google Scholar
Much, S.,Ranger, J.,Mutak, A.,Krause, R., &Pohl, S. (2022). Modeling the process underlying solution and non-solution behavior with a non-linear ballistic accumulator model. In IMPS 2022: International meeting of the psychometric society.Google Scholar
Mutak, A.,Krause, R.,Ulitzsch, E.,Much, S.,Ranger, J., &Pohl, S. (2024). Modeling the intraindividual relation of ability and speed within a test. Journal of Educational Measurement, 61, 378407. https://doi.org/10.1111/jedm.12391 Google Scholar
Myszkowski, N.,Storme, M.,Kubiak, E., &Baron, S. (2022). Exploring the associations between personality and response speed trajectories in low-stakes intelligence tests. Personality and Individual Differences, 191, 111580. https://doi.org/10.1016/j.paid.2022.111580 Google Scholar
Nagy, G., &Robitzsch, A. (2021). A continuous HYBRID IRT model for modeling changes in guessing behavior in proficiency tests. Psychological Test and Assessment Modeling, 63, 361395.Google Scholar
Nagy, G., &Ulitzsch, E. (2022). A multilevel mixture IRT framework for modeling response times as predictors or indicators of response engagement in IRT models. Educational and Psychological Measurement, 82, 845879. https://doi.org/10.1177/00131644211045351 Google Scholar
Noventa, S.,Heller, J., &Kelava, A. (2024). Toward a unified perspective on assessment models, part I: Foundations of a framework. Journal of Mathematical Psychology, 122, 102872. https://doi.org/10.1016/j.jmp.2024.102872 Google Scholar
Ranger, J., &Kuhn, J. (2014). An accumulator model for responses and response times in tests based on the proportional hazards model. British Journal of Mathematical and Statistical Psychology, 67, 388407. https://doi.org/10.1111/bmsp.12025 Google Scholar
Ranger, J.,Kuhn, J., &Gaviria, J.-L. (2015). A race model for responses and response times in tests. Psychometrika, 80, 791810. https://doi.org/10.1007/s11336-014-9427-8 Google Scholar
Ranger, J.,Wolgast, A.,Much, S.,Mutak, A.,Krause, R., &Pohl, S. (2023). Disentangling different aspects of change in tests with the D-diffusion model. Multivariate Behavioral Research, 58, 10391055. https://doi.org/10.1080/00273171.2023.2171356 Google Scholar
Ratcliff, R. (1988). Continuous versus discrete information processing modeling accumulation of partial information. Psychological Review, 95, 238255. https://doi.org/10.1037/0033-295x.95.2.238 Google Scholar
Ratcliff, R., &McKoon, G. (2008). The diffusion decision model: Theory and data for two-choice decision tasks. Neural Computation, 20, 873922. https://doi.org/10.1162/neco.2008.12-06-420 Google Scholar
Roskam, E. (1987). Towards a psychometric theory of intelligence. InRoskam, E. &Suck, R. (Eds.), Progress in mathematical psychology (pp. 151174). North-Holland.Google Scholar
Rouder, J.,Province, J.,Morey, R.,Gomez, P., &Heathcote, A. (2015). The lognormal race: A cognitive-process model of choice and latency with desirable psychometric properties. Psychometrika, 80, 491513. https://doi.org/10.1007/s11336-013-9396-3 Google Scholar
Sadler, P. (1998). Psychometric models of student conceptions in science: Reconciling qualitative studies and distractor-driven assessment instruments. Journal of Research in Science Teaching, 35, 265296. https://doi.org/10.1002/(SICI)1098-2736(199803)35:3<265::AID-TEA3>3.0.CO;2-P 3.0.CO;2-P>Google Scholar
San Martin, E.,del Pino, G., &De Boeck, P. (2006). IRT models for ability-based guessing. Applied Psychological Measurement, 30, 183203. https://doi.org/10.1177/0146621605282773 Google Scholar
Sartori, R. (2006). The bell curve in psychological research and practice: Myth or reality? Quality & Quantity, 40, 407418. https://doi.org/10.1007/s11135-005-6104-0 Google Scholar
Schweizer, K.,Zeller, F., &Reiss, S. (2020). Higher-order processing and change-to-automaticity as explanations of the item-position effect in reasoning tests. Acta Psychologica, 203, 102991. https://doi.org/10.1016/j.actpsy.2019.102991 Google Scholar
Silm, G.,Pedaste, M., &Täht, K. (2020). The relationship between performance and test-taking effort when measured with self-report or time-based instruments: A meta-analytic review. Educational Research Review, 31, 100335. https://doi.org/10.1016/j.edurev.2020.100335 Google Scholar
Stupple, E.,Pitchford, M.,Ball, L.,Hunt, T., &Steel, R. (2017). Slower is not always better: Response-time evidence clarifies the limited role of miserly information processing in the Cognitive Reflection Test. PLoS ONE, 12, e0186404. https://doi.org/10.1371/journal.pone.0186404 Google Scholar
Suh, Y., &Bolt, D. (2010). Nested logit models for multiple-choice item response data. Psychometrika, 75, 454473. https://doi.org/10.1007/s11336-010-9163-7 Google Scholar
Thurstone, L. (1937). Ability, motivation, and speed. Psychometrika, 2, 249254. https://doi.org/10.1007/BF02287896 Google Scholar
Ulitzsch, E.,Pohl, S.,Khorramdel, L.,Kroehne, U., &von Davier, M. (2022). A response-time-based latent response mixture model for identifying and modeling careless and insufficient effort responding in survey data. Psychometrika, 87, 593619. https://doi.org/10.1007/s11336-021-09817-7 Google Scholar
Ulitzsch, E.,von Davier, M., &Pohl, S. (2022). A hierarchical latent response model for inferences about examinee engagement in terms of guessing and item-level nonresponse. British Journal of Mathematical and Statistical Psychology, 73, 83112. https://doi.org/10.1111/bmsp.12188 Google Scholar
van Breukelen, G., &Roskam, E. (1991). A Rasch model for the speed-accuracy tradeoff in time limited tests. InDoignon, J., &Falmagne, J. (Eds.), Mathematical psychology: Recent research in psychology (pp. 251271). Springer. https://doi.org/10.1007/978-1-4613-9728-1_15 Google Scholar
van der Linden, W. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72, 287308. https://doi.org/10.1007/s11336-006-1478-z Google Scholar
van der Maas, H.,Molenaar, D.,Maris, G.,Kievit, R., &Borsboom, D. (2011). Cognitive psychology meets psychometric theory: On the relation between process models for decision making and latent variable models for individual differences. Psychological Review, 118, 339356. https://doi.org/10.1037/a0022749 Google Scholar
Vigneau, F.,Caissie, A., &Bors, D. (2006). Eye-movement analysis demonstrates strategic influences on intelligence. Intelligence, 34, 261272. https://doi.org/10.1016/j.intell.2005.11.003 Google Scholar
von Davier, M. (2009). Is there need for the 3PL model? Guess what? Measurement: Interdisciplinary Research and Perspectives, 7, 110114. https://doi.org/10.1080/15366360903117079 Google Scholar
Wagenmakers, E. (2009). Methodological and empirical developments for the Ratcliff diffusion model of response times and accuracy. European Journal of Cognitive Psychology, 21, 641671. https://doi.org/10.1080/09541440802205067 Google Scholar
Wang, C., &Xu, G. (2015). A mixture hierarchical model for response times and response accuracy. British Journal of Mathematical and Statistical Psychology, 68, 456477. https://doi.org/10.1111/bmsp.12054 Google Scholar
Wang, T., &Hanson, B. (2005). Development and calibration of an item response model that incorporates response time. Applied Psychological Measurement, 29, 323339. https://doi.org/10.1177/0146621605275984 Google Scholar
Weirich, S.,Hecht, M.,Penk, C., &Roppelt, A. (2017). Item position effects are moderated by changes in test-taking effort. Applied Psychological Measurement, 41, 115129. https://doi.org/10.1177/0146621616676791 Google Scholar
Wenger, M., &Gibson, B. (2004). Using hazard functions to assess changes in processing capacity in an attentional cuing paradigm. Journal of Experimental Psychology, 30, 708719. https://doi.org/10.1037/0096-1523.30.4.708 Google Scholar
Wise, S., &DeMars, C. (2006). An application of item response time: The effort-moderated IRT model. Journal of Educational Measurement, 43, 1938. https://doi.org/10.1111/j.1745-3984.2006.00002.x Google Scholar
Yamamoto, K., &Everson, H. (1995). Modeling the mixture of IRT and pattern responses by a modified hybrid model. ETS Research Report Series, 1, 126. https://doi.org/10.1002/j.2333-8504.1995.tb01651.x Google Scholar
Zhang, L.,Wetzel, E.,Yoon, H., &Roberts, B. (2024). Perseverance, a measure of conscientiousness, is a valid predictor of achievement and truancy across the globe. Journal of Personality and Social Psychology, 126, 853872. https://doi.org/10.1037/pspp0000505 Google Scholar
Figure 0

Table 1 Overview of the different models for persistence

Figure 1

Figure 1 Illustration of the assumed response process for variant A and variant B of the linear ballistic accumulator model.

Figure 2

Table 2 True value (TV), average estimate (M), standard error of estimation (SE), and coverage frequency (CI) of confidence intervals with confidence level C=0.95 $(\alpha =0.05)$ of the item parameters of the linear ballistic accumulator Model A for different samples sizes and variants of the model

Figure 3

Table 3 True value (TV), average estimate (M), standard error of estimation (SE) and coverage frequency (CI) of confidence intervals with confidence level C=0.95 $(\alpha =0.05)$ of the item parameters of the linear ballistic accumulator Model B for different samples sizes and variants of the model

Figure 4

Table 4 Average solution probability ($M_X$) and range over items, average response time ($M_T$) and range over items as well as standard deviation of response time ($S_T$) and range over items in the IMak test

Figure 5

Figure 2 Box-plot of the correlations of the test takers’ response times with the time demand of an item given for the different score groups (upper plot) and average time on task on the items for different score groups (lower plot). Note: The solution probability of an item is indicated by p. Groups with a score of 0–4 are highlighted.

Figure 6

Table 5 Value of the marginal log-likelihood function (LL), number of parameters (NP), difference of marginal log-likelihood function ($\Delta $LL), AIC-index (AIC), difference of AIC-index ($\Delta $AIC), BIC-index and difference of BIC-index ($\Delta $BIC) for the six accumulator models (A1–B3) and the hierarchical model (VLM) for the version with the normal distribution or the mixture distribution

Figure 7

Figure 3 Median response times on the items that were generated by the win of the accumulator representing progress (S), disengagement (D) or misinformation (I) as implied by Model B2M and the estimated item parameters.

Figure 8

Figure 4 Winning probability of the accumulators representing progress (S), disengagement (D) or misinformation (I) as a function of the item position as implied by Model B2M and the estimated item parameters.

Figure 9

Figure 5 Correlations between the wins of the disengagement accumulator in the different items of the test as implied by Model B2M and the estimated item parameters.

Supplementary material: File

Ranger et al. supplementary material

Ranger et al. supplementary material
Download Ranger et al. supplementary material(File)
File 1.3 MB