Reliability Theory for Measurements with Variable Test Length, Illustrated with ERN and Pe Collected in the Flanker Task

Jules L. Ellis; Klaas Sijtsma; Kristel de Groot; Patrick J. F. Groenen

doi:10.1007/s11336-024-09982-5

Reliability Theory for Measurements with Variable Test Length, Illustrated with ERN and Pe Collected in the Flanker Task

Published online by Cambridge University Press: 01 January 2025

Jules L. Ellis

Klaas Sijtsma

Kristel de Groot

and

Patrick J. F. Groenen

Show author details

Jules L. Ellis*: Affiliation:
Open University of the Netherlands
Klaas Sijtsma: Affiliation:
Tilburg University
Kristel de Groot: Affiliation:
Erasmus University Rotterdam
Patrick J. F. Groenen: Affiliation:
Erasmus University Rotterdam
*: Correspondence should be made to Jules L. Ellis, Faculty of Psychology, Open University of the Netherlands, Heerlen, The Netherlands. Email: jules.ellis@ou.nl

Article contents

Abstract
Introduction
Flanker Tasks and Resulting Data
Reliability for Psychophysiological Data
Comparison with Generalizability Theory Approaches
Discussion: Contributions of Our Study to the Theory and Practice of Reliability
Declarations
Footnotes
References

Rights & Permissions

Abstract

In psychophysiology, an interesting question is how to estimate the reliability of event-related potentials collected by means of the Eriksen Flanker Task or similar tests. A special problem presents itself if the data represent neurological reactions that are associated with some responses (in case of the Flanker Task, responding incorrectly on a trial) but not others (like when providing a correct response), inherently resulting in unequal numbers of observations per subject. The general trend in reliability research here is to use generalizability theory and Bayesian estimation. We show that a new approach based on classical test theory and frequentist estimation can do the job as well and in a simpler way, and even provides additional insight to matters that were unsolved in the generalizability method approach. One of our contributions is the definition of a single, overall reliability coefficient for an entire group of subjects with unequal numbers of observations. Both methods have slightly different objectives. We argue in favor of the classical approach but without rejecting the generalizability approach.

Keywords

reliability event-related potentials ERN Pe Flanker Task classical test theory

Type: Theory & Methods
Information: Psychometrika , Volume 89 , Issue 4 , December 2024 , pp. 1280 - 1303

DOI: https://doi.org/10.1007/s11336-024-09982-5 [Opens in a new window]
Creative Commons: This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Copyright: © 2024 The Author(s)

1. Introduction

This article is based on a consultation request from biological psychologists seeking psychometric advice with respect to reliability issues. They were struggling with the issue of appropriate reliability estimation for the psychophysiological data they collected using a design in which the number of observations per person is a random variable instead of a fixed number, which poses some statistical challenges. Until recently, they relied on methods from classical test theory (CTT), mainly coefficient alpha and the split-half method (e.g., Fabiani et al., Reference Fabiani, Gratton, Karis, Donchin, Ackles, Jennings and Coles1987) for computing reliability for data characterized by these challenges. A problem with most classical reliability coefficients is that they cannot be applied to these data without discarding large portions of the data (Clayson, Reference Clayson2020). Baldwin et al. (Reference Baldwin, Larson and Clayson2015) suggested that these simple methods were inappropriate and suggested generalizability theory (GT) as a viable alternative using Bayesian statistics. In this article, we develop two new CTT methods that circumvent this problem using a frequentist approach. Our methods can be applied easily: the first method requires only traditional reliability estimates such as coefficient alpha or $λ_{4}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda _{4}$$\end{document} , computed repeatedly, and the second method requires only two observed variances and an observed mean, using 100% of the data. Moreover, our theoretical analysis justifies the computation of an overall reliability coefficient over groups of participants with different numbers of observations. This also leads to a conceptual distinction between ‘reliability’ and ‘test–retest correlation’ even if items are parallel, thus clarifying theoretical issues that were previously unaddressed.

The data relevant to this study are event-related potentials (ERPs) collected during an Eriksen Flanker Task (Eriksen & Eriksen, Reference Eriksen and Eriksen1974), but other, albeit similar, data types are also relevant here. A well-known example is the Stroop test (Stroop, Reference Stroop1935). Fabiani et al. (Reference Fabiani, Gratton, Karis, Donchin, Ackles, Jennings and Coles1987) and Hedge et al. (Reference Hedge, Powell and Sumner2018) discussed additional stimulus types in the context of CTT reliability estimation. Because we focus on reliability, we do not further discuss other task types for generating similar data sets but concentrate on the Flanker Task data.

In a Flanker Task, participants are repeatedly shown a string of letters (‘SSSSS’, ‘SSHSS’, ‘HHSHH’, ‘HHHHH’) and are instructed to press a button with one hand if the central letter is an ‘H’ and with the other hand if the central letter is an ‘S’. Participants must respond as quickly and as accurately as possible, and although correct responses are observed for the majority of trials, incorrect responses are observed too. On such trials where participants respond incorrectly, specific event-related potentials (ERPs) arise. ERPs are voltage fluctuations in neurons that can be measured from the scalp with the use of electro-encephalography (EEG). Two ERPs that are consistently observed when participants err (in a Flanker Task or in similar experimental designs) are the error-related negativity (ERN, Falkenstein et al., Reference Falkenstein, Hohnsbein, Hoormann and Blanke1991; Gehring et al., Reference Gehring, Goss, Coles, Meyer and Donchin1993) and the error positivity (Pe, Falkenstein et al., Reference Falkenstein, Hohnsbein, Hoormann and Blanke1991). The former peaks between 25 and 100 ms after the commission of an error and is most potent at fronto-central scalp sites. The latter peaks between 200 and 400 ms after the incorrect response and is best observed at centro-parietal locations. Although their precise functional significance is still debated (Olvet & Hajcak, Reference Olvet and Hajcak2008; Overbeek et al., Reference Overbeek, Nieuwenhuis and Ridderinkhof2005), the ERN is thought to represent early error signaling that is not dependent on the person being aware of having committed the error, and the Pe may represent later, more conscious processing of the error (Nieuwenhuis et al., Reference Nieuwenhuis, Ridderinkhof, Blom, Band and Kok2001; O’Connell et al., Reference O’Connell, Dockree, Bellgrove, Kelly, Hester, Garavan, Robertson and Foxe2007). The stronger the ERP (i.e., the more negative the ERN and the more positive the Pe), the stronger the neuronal response to committing the error. As the ERN and Pe are only observed when participants err, the number of observations per participant varies, complicating reliability estimation of this data.

Although we developed our reliability theory with ERN and Pe data in mind, it may be applicable to other data where the number of observations is variable. One reviewer noted that “it might be helpful for stimulus-related ERPs, which also tend to have unbalanced trial counts due to artifact rejection,” and we agree with this. Another reviewer pointed out that the situation is similar in cases of agreement coefficients or intraclass coefficients based on multiple raters, if the number of raters or the number of objects is variable, and this is given more attention in Supplementary Material C.

So far, the theoretical psychometric literature has been unaware of the reliability issue that played in this research area. This article extends CTT with new methods for estimating reliability with variable numbers of observations per participant, as is routinely encountered in psychophysiological data such as that of the ERN and Pe. We first describe the Flanker Task, the resulting data matrices, and briefly review reliability methods that have been applied. After stating our assumptions, we present our first method in the form of a theorem and corollary, which deal with potentially non-parallel items. After this, we present our second method, which deals with parallel items. After a computational example and real data examples, we present a theoretical analysis of test–retest correlations, showing that they can generally not be used to estimate reliability. We compare our CTT approach in detail to the GT approaches suggested by Baldwin et al. (Reference Baldwin, Larson and Clayson2015) and Clayson et al. (Reference Clayson, Carbine, Baldwin, Olsen and Larson2021).

2. Flanker Tasks and Resulting Data

2.1. Flanker Tasks

The Flanker Task used in the present study is a representative version of the Eriksen Flanker Task of which data have already been presented elsewhere (Bernoster et al., Reference Bernoster, De Groot, Wieser, Thurik and Franken2019; Rietdijk et al., Reference Rietdijk, Franken and Thurik2014). In this version of the Flanker Task, participants complete 400 trials in which they are shown a letter array of which the central target letter is equal (‘SSSSS’, ‘HHHHH’) or unequal (‘SSHSS’, ‘HHSHH’) to the flanking distractor letters. Participants are instructed to press a predefined button with their right index finger if the central letter (the target) is an ‘H’ and another button with their left index finger if the target is an ‘S’. Trials start with a 250 ms cue (‘⌃’) pointing at the location of the target. Then, the letter array appears for 52 ms, followed by a black screen for 648 ms. During this 700-ms period, participants can respond by pressing one of the buttons. Then, a feedback symbol appears indicating whether their response was correct (‘ooo’), incorrect (‘xxx’), or too late (‘!’). After a 500-ms break (the so-called interstimulus interval or ITI), the next trial starts. The trial sequence is illustrated in Fig. 1. Participants completed 80 trials in a row and had the opportunity to take a break between each series of 80 trials. Within a series, each of the four letter arrays was presented 20 times in a random order to prevent training or fatigue effects from having a systematic effect on certain letter array conditions.

The ERN and Pe data were extracted in line with standard practices in electrophysiological research—the ERN was defined as the mean amplitude at electrode FCz in the 25–100-ms time window, and the Pe was defined as the mean amplitude at electrode Pz in the 200–400-ms time window. Precise information on the recording and (pre-)processing of the data is described in Supplementary Material A.

Figure 1 Schematic Representation of a Flanker Task Trial

When given enough time to respond, participants would have each trial correct and no error-related ERPs (i.e., no ERNs or Pe-s) would occur. Therefore, participants respond to trials under time pressure. With adequately chosen presentation, response, and feedback time intervals, time pressure forces participants to make errors. Presenting Flanker Tasks under time pressure is a means for eliciting the data of interest, but because the presence of an ERN / Pe only occurs when the response was incorrect, we will further ignore the correct/incorrect data structure and focus on the trials for which an incorrect answer was given, and thus, an ERN and Pe were elicited.

2.2. Data Matrix

With the Flanker Task, interest resides with psychophysiological activity in response to error. Incorrect responses trigger an ERN and Pe, whereas correct responses do not. There are two ways to represent these data in a data matrix that we will refer to as spaced and condensed. In the spaced data matrix, each n-th column corresponds to the n-th trial on which a stimulus was presented. This produces a data matrix containing ERPs when a response was incorrect, interspaced with blanks in other trials. In the condensed data matrix, each n-th column corresponds to the n-th trial on which the participant made an error. This produces a data matrix with ERPS in consecutive columns at the left side, followed by blanks. A small fictitious example of both data matrices is given in Table 1.

Table 1 Small Example of Spaced and Condensed Data Matrix of ERNs

Because correct responses do not elicit these ERPs, it is disputable whether unavailable ERPs should be considered as missing. The present situation is different from the blanks in a data matrix where, for example, a participant’s age was expected. Because each participant has a particular age, a blank represents a truly missing value that the researcher may wish to track down or treat statistically. This approach does not make sense with alleged ERPs that, in fact, do not exist when responses are correct and therefore are not missing. The blanks in Fig. 1 indicate where positive responses were given, but do not represent missing ERPs. Our reliability method therefore must deal with unequal numbers of scores across participants but not necessarily with missing data. Consequently, we will use the condensed data matrix.

Note that data removed during data cleaning (for ERPs specifically in the artifact rejection step, see Supplementary Material A) are missing even in our definition. This is usually a much smaller part of the data. In our examples with real data, we do not differentiate these missings from the empty cells due to correct responses, but we do not claim that future researchers should necessarily do the same.

We will not differentiate between stimulus types (‘SSSSS’, ‘SSHSS’, ‘HHSHH’, ‘HHHHH’) in the following sections, thus treating them as equivalent, because we want to focus on the methodological innovation. A separate section will discuss how the results of different stimulus types can be integrated.

2.3. Review of Previous Methods to Estimate Reliability

Importantly, as the ERN and Pe represent the neuronal response to committing an error, they do not manifest on trials where a participant responded correctly. It is customary to present the participant with a fixed number of stimuli (e.g., 400) and to compute the participant’s mean ERN and Pe over only the error trials, but it must be noted that the number of error trials is a random variable that can attain different values for different participants. If the mean ERN or Pe of each participant is used as a psychological test score, then the number of error trials corresponds to the concept of test length in reliability theory, but textbooks on CTT do not address the possibility that the test length is a random variable that attains different values for different participants. As a result, it is not directly clear how the reliability of the test scores can be estimated from these data. Klawohn et al. (Reference Klawohn, Meyer, Weinberg and Hajcak2020) used the split-half method, and many others used coefficient alpha (Marco-Pallares et al., Reference Marco-Pallares, Cucurell, Münte, Strien and Rodriguez-Fornells2011; Meyer et al., Reference Meyer, Riesel and Proudfit2013; Olvet & Hajcak, Reference Olvet and Hajcak2009; Pontifex et al., Reference Pontifex, Scudder, Brown, O’Leary, Wu, Themanson and Hillman2010; Rietdijk et al., Reference Rietdijk, Franken and Thurik2014). A problem with the computation of coefficient alpha in this case is that it requires the same number of observations for each participant. In analyses of ERN and Pe data, this problem is often solved by computing alpha only for a small number of trials, say the first eight trials. A disadvantage of this approach is that it discards all data of participants with fewer than eight trials, as well as data from the ninth trial onward. For example, if this is applied to our ERN data, 69% of the scores are discarded. Other authors (e.g., Clayson et al., Reference Clayson, Carbine, Baldwin, Olsen and Larson2021) advocated the use of generalizability theory (GT) with multilevel analysis, which does not discard data.

A reason why Clayson et al. (Reference Clayson, Carbine, Baldwin, Olsen and Larson2021) and Clayson and Miller (Reference Clayson and Miller2017a, Reference Clayson and Millerb) turned to GT is that it allows a coherent treatment of multiple error sources such as ‘items’ and ‘time.’ Although we agree that this could be a reason to use GT, many studies with ERPs involve only a single error source, such as different trials within the same session, together with maybe fixed factors such as diagnosis group and stimulus type. In these cases, our new CTT methods are simpler and provide additional insights. However, we disagree with the cited authors on one point. Clayson and Miller (Reference Clayson and Miller2017a, p. 72) state that CTT requires the assumption of parallel items. This has been claimed by authoritative authors on GT too, but we consider this claim misguided (see Sijtsma, Reference Sijtsma2009a, Reference Sijtsmab; Sijtsma & Pfadt, Reference Sijtsma and Pfadt2021a, Reference Sijtsma and Pfadtb). In our view, the only fundamental assumption of CTT is that error scores are uncorrelated (Ellis, Reference Ellis2021). In some CTT theorems, it is also assumed that items are parallel, but this is not the case for all CTT theorems, and in the absence of parallel items we can still use the part of CTT that does not require parallel items. In order to make this clear, we will precisely state in which formulas we assume parallel items and where we do not.

Both CTT and GT assume uncorrelated error score variables, and for this reason we studied the autocorrelations in the ERN and Pe data of the Flanker Task. This is not the focus of our article, and therefore, this analysis is reported in Supplementary Material B. Our conclusion is that CTT and GT may be applied to these data.

3. Reliability for Psychophysiological Data

In this section, we develop a CTT approach to reliability that respects the characteristics of the ERP data collected using the Flanker Task. First, we introduce a CTT definition of reliability for the case that participants do not have the same number of items (here, error trials), typical of Flanker Task ERP data. The reliability defined in this way for the whole group, with varying number of items across participants, is shown to be a weighted average of the reliability estimated within each subgroup with the same number of items. The weights are the subgroup proportions of participants adding up to 1 across all subgroups and are easily derivable from the data, as are the estimates of the other parameters needed. Second, we study the method for parallel items as a special case and derive a result for estimating reliability that is even simpler, because it requires only two observed variances and the harmonic mean of the number of observations per participant. Third, we provide computational examples for estimating reliability for ERP data. Fourth and finally, we study the correlation between test administrations that have item-by-item parallelism between administrations but not within the same administration. We show that in this case the test scores would not be parallel, and therefore, there is no reason to expect that the correlation of the two test scores is equal to their reliability. We also show that if the items within the test administrations are parallel, then the situation simplifies considerably and the reliability can be estimated from the correlation between two administrations if the harmonic means of the test lengths are equal.

3.1. Reliability if the Number of Items is a Random Variable

3.1.1. Assumptions

Let $X_{1}, X_{2}, \dots$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{1},X_{2},\ldots $$\end{document} be an infinite sequence of observable score variables, where $X_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{i}$$\end{document} is the observable score variable on trial i. The variables are called “observable” because we assume that not all of them are observed for all participants. In this study, this means that a variable is observed if an ERN score and a Pe score are observed and recorded for a participant. We will also say that, following psychometrics jargon, each variable $X_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{i}$$\end{document} is an item score variable, or even shorter, an item; the i-th column in the condensed data matrix is a sample of $X_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{i}$$\end{document} . Let N be the number of observed trials; N is a random variable. We assume that the variables that are observed are $X_{1}, X_{2}, \dots, X_{N}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{1},X_{2},\ldots ,X_{N}$$\end{document} , where N can have different values for different participants. We assume $N \geq 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N\ge 1$$\end{document} for all participants. In practical situations, N would also be bounded from above by some fixed number m (in our study, $m = 400)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m=400)$$\end{document} , but there is no mathematical need to assume that here.

We assume CTT for the observable variables: For each $i \in N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i\in \mathbb {N}$$\end{document} , there are variables $T_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_{i}$$\end{document} and $E_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_{i}$$\end{document} such that for all $i, j, k \in N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i,j,k\in \mathbb {N}$$\end{document} with $k \neq i$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k\ne i$$\end{document} ,

(A1)

\begin{matrix} X_{i} = T_{i} + E_{i} \end{matrix}

(A2)

\begin{matrix} Cov (E_{i}, T_{j}) = 0 \end{matrix}

(A3)

\begin{matrix} Cov (E_{i}, E_{k}) = 0 \end{matrix}

Assumption A3 refers to uncorrelated errors. We need it in some derivations but not in all. We further assume that the expected measurement error does not depend on the number of observations; that is, for all $i, n \in N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i,n\in \mathbb {N}$$\end{document} ,

(A4)

\begin{matrix} E (E_{i} | N = n) = 0 \end{matrix}

We further assume that the true scores and error scores are still uncorrelated if one considers only a subpopulation with a fixed number of observations: for all $i, j, n \in N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i,j,n\in \mathbb {N}$$\end{document} ,

(A5)

\begin{matrix} Cov (E_{i}, T_{j} | N = n) = 0 \end{matrix}

Finally, we assume that the variables $N, X_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N,X_{i}$$\end{document} , $T_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_{i}$$\end{document} and $E_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_{i}$$\end{document} have finite second moments, both unconditionally and conditionally on N.

3.1.2. Variance Decomposition of Total Scores

Since participants differ in their number of observations, it is convenient to define each participant’s overall test score not as the raw sum score, but rather as the mean of available item scores of the participant. We therefore define the test (or total) observed score, the test (or total) true score, and the test (or total) error score as

\begin{matrix} X_{+} : = & \sum_{i = 1}^{N} X_{i} / N, \\ T_{+} : = & \sum_{i = 1}^{N} T_{i} / N, \\ E_{+} : = & \sum_{i = 1}^{N} E_{i} / N . \end{matrix}

Then $X_{+} = T_{+} + E_{+}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{+}=T_{+}+E_{+}$$\end{document} but, now that the number of summands is variable, it is not obvious whether at the group level we have that $Var (X_{+}) = Var (T_{+}) + Var (E_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( X_{+} \right) =\textrm{Var}\left( T_{+} \right) +\textrm{Var}\left( E_{+} \right) $$\end{document} . This is what we prove next.

Lemma 1

Assume A1, A2, A4 and A5. Then

\begin{matrix} Cov (E_{+}, T_{+}) = 0 . \end{matrix}

Proof

By the law of total covariance, we have

\begin{matrix} Cov (E_{+}, T_{+}) = E (Cov (E_{+}, T_{+} | N)) + Cov (E (E_{+} | N), E (T_{+} | N)) . \end{matrix}

In the first term on the right, using the property that $Cov (a Y, b Z) = a b Cov (Y, Z)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Cov}\!\left( aY,bZ \right) =ab\,\textrm{Cov}(Y,Z)$$\end{document} if Y and Z are random variables and a and b are scalars (here, we use $a = b = n^{- 1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a=b=n^{-1})$$\end{document} , and assumption A5, we obtain

\begin{matrix} Cov (E_{+}, T_{+} | N = n) = \sum_{i = 1}^{n} \sum_{j = 1}^{n} n^{- 2} Cov (E_{i}, T_{j} | N = n) = 0 . \end{matrix}

Therefore, we conclude that $E (Cov, (E_{+}, T_{+} | N)) = 0 .$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}\left( \textrm{Cov}\left( E_{+},T_{+}\vert N \right) \right) =0.$$\end{document} In the second term on the right, we have

\begin{matrix} E (E_{+}, | N = n) = E (\sum_{i = 1}^{N} E_{i} / n, | N = n) = \sum_{i = 1}^{n} n^{- 1} E (E_{i}, | N = n) = 0 . \end{matrix}

Therefore, $Cov (E (E_{+}, | N), E (T_{+}, | N)) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Cov}\!\left( \mathbb {E}\left( E_{+} \vert \,N\right) ,\mathbb {E}\left( T_{+} \vert \,N\right) \right) =0$$\end{document} . $□$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

From Lemma 1, it follows immediately that $Var (X_{+}) = Var (T_{+}) + Var (E_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( X_{+} \right) =\textrm{Var}\!\left( T_{+} \right) +\textrm{Var}\!\left( E_{+} \right) $$\end{document} .

3.1.3. Conditional and Unconditional Reliability

We define reliability generically as the ratio of true score variance to observed score variance. This is consistent with the definitions of many previous authors in CTT (e.g., Cho, Reference Cho2021; Guttman, Reference Guttman1953; Novick, Reference Novick1966; Raykov & Marcoulides, Reference Raykov and Marcoulides2017). Define the unconditional reliability of the test observed score as

\begin{matrix} Rel (X_{+}) : = \frac{Var (T_{+})}{Var (X_{+})} . \end{matrix}

We set $Rel (X_{+}) : = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Rel}\left( X_{+} \right) :=0$$\end{document} if $Var (X_{+}) = 0 .$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\!\left( X_{+} \right) =0.$$\end{document} We will now stratify the participant population based on N and then consider some parameters defined on the stratification. First, we assume that we can estimate the reliability of the test observed score in the subpopulation where the number of observations equals n. This is the conditional reliability of the test observed score, defined as

\begin{matrix} ρ_{n} : = \frac{Var (T_{+} | N = n)}{Var (X_{+} | N = n)} . \end{matrix}

We set $ρ_{n} : = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _{n}:=0$$\end{document} if $Var (X_{+}, | N = n) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( X_{+} \vert {N=n}\right) =0$$\end{document} , so that $Var (T_{+}, | N = n) = ρ_{n} Var (X_{+} | N = n)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( T_{+} \vert {N=n}\right) =\rho _{n}\textrm{Var}(X_{+}\vert N=n)$$\end{document} in all cases. Furthermore, we write the conditional observed variance as

\begin{matrix} σ_{n}^{2} : = Var (X_{+} | N = n) \end{matrix}

and the fraction of the subjects with n observations as

\begin{matrix} π_{n} : = P (N = n) . \end{matrix}

If the number of observed trials is bounded by some $m \in N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m\in \mathbb {N}$$\end{document} , then we can simply write $π_{n} = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _{n}=0$$\end{document} for $n > m$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n>m$$\end{document} . We express the unconditional reliability, $Rel (X_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Rel}\left( X_{+} \right) $$\end{document} in terms of the conditional reliabilities, $ρ_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _{n}$$\end{document} . Note that the following result does not require uncorrelated errors.

Theorem 1

Assume A1, A2, A4 and A5. The unconditional reliability of the total observed score $X_{+}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{+}$$\end{document} is then given by

\begin{matrix} Rel (X_{+}) = 1 - \frac{\sum_{n = 1}^{\infty} (1 - ρ_{n}) σ_{n}^{2} π_{n}}{Var (X_{+})} . \end{matrix}

Proof

By the law of total variance, we have

\begin{matrix} Var (X_{+}) = E (Var, (X_{+}, | N)) + Var (E (X_{+}, | N)) \end{matrix}

and

\begin{matrix} Var (T_{+}) = E (Var, (T_{+}, | N)) + Var (E (T_{+}, | N)) . \end{matrix}

Assumption A4 implies $E (E_{i}, | N) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}\left( E_{i} \vert \,N\right) =0$$\end{document} . Combining this result with the CTT definition $X_{+} = T_{+} + E_{+}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{+}=T_{+}+E_{+}$$\end{document} and its expectation for subgroups, $E {(X}_{+} | N) = E (T_{+} | N) + E (E_{+} | N)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}{(X}_{+}\vert N)=\mathbb {E}(T_{+}\vert N)+\mathbb {E}(E_{+}\vert N)$$\end{document} , we have $E (X_{+}, | N) = E (T_{+}, | N)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}\left( X_{+} \vert \,N\right) =\,\mathbb {E}\left( T_{+} \vert \,N\right) $$\end{document} ; hence, the variance terms on the right of the two former equations vanish if we subtract $Var (T_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( T_{+} \right) $$\end{document} from $Var (X_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( X_{+} \right) $$\end{document} to obtain $Var (E_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( E_{+} \right) $$\end{document} . Therefore,

\begin{matrix} Var (E_{+}) = & Var (X_{+}) - Var (T_{+}) \\ = & E (Var, (X_{+}, | N)) - E (Var, (T_{+}, | N)) \\ = & \sum_{n = 1}^{\infty} Var (X_{+} | N = n) \times π_{n} - \sum_{n = 1}^{\infty} Var (T_{+} | N = n) \times π_{n} \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \textrm{Var}\left( E_{+} \right)= & {} \textrm{Var}\left( X_{+} \right) -\textrm{Var}\left( T_{+} \right) \\= & {} \mathbb {E}\left( \textrm{Var}\left( X_{+} \vert \,N\right) \right) -\mathbb {E}\left( \textrm{Var}\left( T_{+} \vert \,N\right) \right) \\= & {} \sum \limits _{n=1}^\infty {\textrm{Var}(X_{+}\vert N=n)\times } \pi _{n}-\sum \limits _{n=1}^\infty {\textrm{Var}(T_{+}\vert N=n)} \times \pi _{n} \end{aligned}$$\end{document}

Since $Var (T_{+}, | N = n) = ρ_{n} σ_{n}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( T_{+} \vert \,{N=n}\right) =\rho _{n}\sigma _{n}^{2}$$\end{document} , we have

\begin{matrix} Var (E_{+}) = & \sum_{n = 1}^{\infty} σ_{n}^{2} π_{n} - \sum_{n = 1}^{\infty} ρ_{n} σ_{n}^{2} π_{n} \\ = & \sum_{n = 1}^{\infty} (1 - ρ_{n}) σ_{n}^{2} π_{n} . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \textrm{Var}\left( E_{+} \right)= & {} \sum \limits _{n=1}^\infty {\sigma _{n}^{2}\pi _{n}} -\sum \limits _{n=1}^\infty {\rho _{n}\sigma _{n}^{2}\pi _{n}}\\= & {} \sum \limits _{n=1}^\infty {(1-\rho _{n})\sigma _{n}^{2}\pi _{n}}.\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \end{aligned}$$\end{document}

$□$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

The key principle of Theorem 1 is that, although we should not average the reliability coefficients from different groups, we may average the error variances if the mean error is 0 in each group. This provides a simple estimation method for reliability when different participants have responded to different numbers of items, based on general assumptions.

As an aside, one may note the resemblance of the formula in the theorem with the formula underlying stratified alpha. Suppose a test consists of G subtests, each measuring a different aspect of an overarching attribute of greater complexity than each of the aspects it represents, such as intelligence. Let $σ_{g}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{g}^{2}$$\end{document} denote the variance of the score on subtest g; $ρ_{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _{g}$$\end{document} the reliability of subtest g; and $σ_{X}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{X}^{2}$$\end{document} the variance of the total sum score across the G subtests; then, the reliability of the total score equals (Lord & Novick, Reference Lord and Novick1968, exercise 4.5; Nunnally, Reference Nunnally1978, p. 248)

\begin{matrix} Rel (X_{+}) = 1 - \frac{\sum_{g = 1}^{G} (1 - ρ_{g}) σ_{g}^{2}}{Var (X_{+})} . \end{matrix}

This is called a stratified reliability coefficient, and it is called stratified alpha if $ρ_{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _{g}$$\end{document} is replaced by the corresponding coefficient $α_{g}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{g}$$\end{document} of subtest g; other reliability coefficients can be stratified similarly (Ogasawara, Reference Ogasawara2009). The stratification of the treatment of the Flanker data concerns the participant population rather than the item set; therefore, both equations are applicable in different situations.

Theorem 1 uses the ‘true’ population values of the conditional reliabilities $ρ_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _{n}$$\end{document} , but these are usually not known exactly. Any estimation method that produces correct reliability estimates can be used here. We will now describe how, moreover, coefficient alpha can be used to obtain a lower bound to the unconditional reliability. Let $α_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{n}$$\end{document} be the value of coefficient alpha (Cronbach, Reference Cronbach1951; Novick & Lewis, Reference Novick and Lewis1967; Ten Berge & Sočan, Reference Ten Berge and Sočan2004; Sijtsma & Van der Ark, Reference Sijtsma and van der Ark2020) in the subpopulation with $N = n$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=n$$\end{document} . Assuming uncorrelated errors (assumption A3), a standard result is that $α_{n} \leq ρ_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{n}\le \rho _{n}$$\end{document} , irrespective of the population or any selection thereof. Substitution of $α_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{n}$$\end{document} for $ρ_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _{n}$$\end{document} in Theorem 1 yields the following result.

Corollary 1

Assume A1, A2, A3, A4 and A5. Then

\begin{matrix} Rel (X_{+}) \geq 1 - \frac{\sum_{n = 1}^{\infty} (1 - α_{n}) σ_{n}^{2} π_{n}}{Var (X_{+})} . \end{matrix}

The proof follows immediately from $α_{n} \leq ρ_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{n}\le \rho _{n}$$\end{document} . The quantity at the right hand may be named length-stratified alpha. Based on Corollary 1, we suggest to estimate coefficient alpha in each subgroup with a fixed number of observations, use them as lower bounds of the conditional reliabilities, and aggregate them into the length-stratified alpha, which may then serve as a lower bound of the unconditional reliability of the total score. The old method of using alpha in this situation was to pick a minimum number of available trials, say $m = 12$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m=12$$\end{document} , and then compute coefficient alpha with m items, thus discarding the available item scores $X_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{i}$$\end{document} with $i > m$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i>m$$\end{document} and discarding the participants with $N < m$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N<m$$\end{document} . For the ensuing coefficient alpha, it would, however, be unclear whether it is greater than, less than, or equal to the unconditional reliability. Our length-stratified alpha has the advantage that all data and all participants are used in the estimation and that the direction of the bias is clear: it yields a lower bound to the unconditional reliability.

Although alpha has been heavily criticized, we have the considered opinion that it is appropriate in the present situation. Only if the data are highly multidimensional will coefficient alpha show a large theoretical discrepancy with respect to the true reliability, but otherwise it closely approximates reliability from below (Sijtsma & Pfadt, Reference Sijtsma and Pfadt2021a). However, if one wants to avoid coefficient alpha, it may be replaced in Corollary 1 by any other lower bound or lower-bound estimate of reliability, such as Guttman’s $λ_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda _{2}$$\end{document} (Guttman, Reference Guttman1945)

3.1.4. Simple Formula for Parallel Items

Let us now assume furthermore that the items are parallel. This means they satisfy assumptions A1, A2 and A3, and for all $i, j \in N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i,j\in \mathbb {N}$$\end{document}

(A6)

\begin{matrix} T_{i} = & T_{j}, \end{matrix}

(A7)

\begin{matrix} Var (E_{i}) = & Var (E_{j}) . \end{matrix}

These assumptions imply that the items have equal variances and equal correlations. For simplicity, denote $ε^{2} : = Var (E_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon ^{2}:=\textrm{Var}\left( E_{i} \right) $$\end{document} , $τ^{2} : = Var (T_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau ^{2}:=\textrm{Var}\left( T_{i} \right) $$\end{document} , $σ^{2} : = Var (X_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}:=\textrm{Var}\left( X_{i} \right) $$\end{document} , and $ρ = Cor (X_{i}, X_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho =\textrm{Cor}\left( X_{i},\,X_{j} \right) $$\end{document} . We assume the latter correlation is defined, hence $σ^{2} > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}>0$$\end{document} , and then, standard CTT results are that $τ^{2} = ρ σ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau ^{2}=\,\rho \sigma ^{2}$$\end{document} and $ε^{2} = (1 - ρ) σ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon ^{2}=(1-\rho )\sigma ^{2}$$\end{document} . We furthermore assume that the error variances and covariances are independent of the number of items; that is, for all $i, j \in N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i,j\in \mathbb {N}$$\end{document} , with $i \neq j$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i\ne j$$\end{document} ,

(A7a)

\begin{matrix} Var (E_{i} | N) = Var (E_{i}) and Cov (E_{i}, E_{j} | N) = 0 . \end{matrix}

This means that responses of subjects with longer tests are not more or less reliable than responses of other subjects, and errors remain uncorrelated within groups of the same test length.

Theorem 2

Assume A1, A2, A3, A4, A5, A6, A7 and A7a and $Var (X_{+}) > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( X_{+} \right) >0$$\end{document} . Then

\begin{matrix} Rel (X_{+}) = & 1 - \frac{E (N^{- 1}) (1 - ρ) σ^{2}}{Var (X_{+})} \\ = & \frac{ρ}{ρ + E (N^{- 1}) (1 - ρ)} \\ = & \frac{1}{1 - E (N^{- 1})} - \frac{E (N^{- 1})}{1 - E (N^{- 1})} \frac{σ^{2}}{Var (X_{+})} . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \textrm{Rel}\left( X_{+} \right)= & {} 1-\frac{\mathbb {E}\left( N^{-1} \right) (1-\rho )\sigma ^{2}}{\textrm{Var}(X_{+})}\\= & {} \frac{\rho }{\rho +\mathbb {E}\left( N^{-1} \right) (1-\rho )\mathrm {\,}}\\= & {} \frac{1}{1-\mathbb {E}\left( N^{-1} \right) }-\frac{\mathbb {E}\left( N^{-1} \right) }{1-\mathbb {E}\left( N^{-1} \right) }\frac{\sigma ^{2}}{\textrm{Var}(X_{+})}. \end{aligned}$$\end{document}

Proof

By the law of total variance, and noting that $E (E_{+}, | N) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}\left( E_{+} \vert \,N\right) =0$$\end{document} (first step below) and, because due to A7 and A7a, which is true irrespective of the number of trials, N, we can write $Var (E_{+}, | N) = Var (N^{- 1} \sum E_{i} | N) = N^{- 2} . N . Var (E_{i}, | N) = N^{- 1} Var (E_{i}) = N^{- 1} ε^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( E_{+} \vert \,N\right) =\textrm{Var}\left( N^{-1}\sum E_{i} \vert N \right) =N^{-2}.N.\textrm{Var}\left( E_{i} \vert \,N\right) =N^{-1}\textrm{Var}\, (E_{i})=N^{-1}\varepsilon ^{2}$$\end{document} (second step below), whereas $Var (X_{+}) > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( X_{+} \right) >0$$\end{document} implies $σ^{2} > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}>0$$\end{document} so that $ε^{2} = (1 - ρ) σ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon ^{2}=(1-\rho )\sigma ^{2}$$\end{document} (fourth step below); we can readily derive

\begin{matrix} Var (E_{+}) = & E (Var, (E_{+}, | N)) + Var (E, (E_{+}, | N)) \\ = & E (Var, (E_{+}, | N)) \\ = & E (N^{- 1}, ε^{2}) \\ = & E (N^{- 1}) ε^{2} \\ = & E (N^{- 1}) (1 - ρ) σ^{2} . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \textrm{Var}\left( E_{+} \right)= & {} \mathbb {E}\left( \textrm{Var}\left( E_{+} \vert \,N\right) \right) +\textrm{Var}\left( \mathbb {E}\left( E_{+} \vert \,N\right) \right) \\= & {} \mathbb {E}\left( \textrm{Var}\left( E_{+} \vert \,N\right) \right) \\= & {} \mathbb {E}\left( N^{-1}\varepsilon ^{2} \right) \\= & {} \mathbb {E}\left( N^{-1} \right) \varepsilon ^{2}\\= & {} \mathbb {E}\left( N^{-1} \right) (1-\rho )\sigma ^{2}. \end{aligned}$$\end{document}

This yields the first equation in Theorem 2. To obtain the second equation, we take the next steps. Because all item true scores are parallel, it follows that $T_{+}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_{+}$$\end{document} , which is the mean of the item true scores, equals $T_{+} = T_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_{+}=T_{i}$$\end{document} ; hence, $Var (T_{+}) = τ^{2} = ρ σ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( T_{+} \right) =\tau ^{2}=\rho \sigma ^{2}$$\end{document} , and $Var (X_{+}) = Var (T_{+}) + Var (E_{+}) = ρ σ^{2} + E (N^{- 1}) (1 - ρ) σ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( X_{+} \right) =\textrm{Var}\left( T_{+} \right) +\textrm{Var}\left( E_{+} \right) =\rho \sigma ^{2}+\mathbb {E}\left( N^{-1} \right) (1-\rho )\sigma ^{2}$$\end{document} . This yields the second equation in Theorem 2. To obtain the third and final equation, we notice that $Var (X_{+}) / σ^{2} = ρ + E (N^{- 1}) (1 - ρ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( X_{+} \right) /\sigma ^{2}=\rho +\mathbb {E}\left( N^{-1} \right) (1-\rho )$$\end{document} , and solving for $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} yields

\begin{matrix} ρ = \frac{1}{1 - E (N^{- 1})} \times \frac{Var (X_{+})}{σ^{2}} - \frac{E (N^{- 1})}{1 - E (N^{- 1})} . \end{matrix}

Multiplying both sides with $\frac{σ^{2}}{Var (X_{+})}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{\sigma ^{2}}{\textrm{Var}(X_{+})}$$\end{document} yields on the left-hand side

\begin{matrix} \frac{ρ σ^{2}}{Var (X_{+})} = \frac{Var (T_{+})}{Var (X_{+})} = Rel (X_{+}), \end{matrix}

so that we obtain

\begin{matrix} Rel (X_{+}) = \frac{1}{1 - E (N^{- 1})} - \frac{E (N^{- 1})}{1 - E (N^{- 1})} \frac{σ^{2}}{Var (X_{+})} . \end{matrix}

This is the third equation in Theorem 2. $□$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

Corollary 2

Under the circumstances of Theorem 2,

a. A sample estimate of the unconditional reliability can be computed from estimates of $E (N^{- 1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}\left( N^{-1} \right) $$\end{document} , $σ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}$$\end{document} , and $Var (X_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}(X_{+})$$\end{document} .
b. If we write the harmonic mean of N as $H = 1 / E (N^{- 1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H=1/\,\mathbb {E}\left( N^{-1} \right) $$\end{document} , then
$\begin{matrix} Rel (X_{+}) = \frac{H ρ}{1 + (H - 1) ρ} . \end{matrix}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \textrm{Rel}\left( X_{+} \right) =\frac{H\rho }{1+(H-1)\rho }. \end{aligned}$$\end{document}

The second part of Corollary 2 says that we can generalize the Spearman–Brown formula to a situation with variable test lengths by substituting the harmonic mean of the test lengths.

As a generalization of the results obtained thus far, we mention the possibility to include subgroupings of the population replacing subgroupings based on the number of items that elicited ERPs or combining the two subgrouping variables. Reliability results are largely like the results obtained thus far. Definitions and proofs are provided in Supplementary Material C.

3.1.5. Examples

Example of Theorem 1. This example uses the ERN data obtained from 158 participants. We consider a total of 400 letter series presentations, making no distinction between the four different letter series (100 presentations each). Together, these participants realized 50 different values of the number of trials, $N = n$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=n$$\end{document} , running from 0 trials (8 participants, the highest frequency for any of the 50 values of n) through 122 trials (1 participant, the lowest frequency also realized with 15 different values of n). Participants with 0 (8 participants) or 1 (7 participants) realized trials are not useful because the conditional alpha is undefined in these groups. Groups defined by $N = n$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=n$$\end{document} with one participant cannot be used either because the conditional sample variance is undefined in such groups. Note that with the method used in the proof of the lemma, the results still hold if stratification is done with groups that combine groups of the form $[N = n]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[N=n]$$\end{document} . Therefore, we used deciles of the subjects with $N \geq 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N\ge 2$$\end{document} ; see Table 2. In each group, there is a range in the number of available trials. For example, in the first decile N ranged from 2 to 4, and in the second decile N ranged from 5 to 7. The conditional alphas were computed with only the trial numbers on which all participants in the group had a score, that is, the minimum number of trials in that group. For example, in the first decile group, $α_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{n}$$\end{document} was computed with two items, even though some subjects had four items, and in the second decile group, $α_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{n}$$\end{document} was computed with five items, even though some participants had seven items. In total, 2509 out of 3011 observations were used (83%), and this percentage can grow to 100% if more participants are added to the sample, such that each group of the form $[N = n]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[N=n]$$\end{document} is large enough to estimate $α_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{n}$$\end{document} without combining groups. In contrast, only 31% of the observations would be used if a single coefficient alpha is computed for the first eight items, and 42% would be used if alpha is computed with the number of items that utilizes the largest percentage of the data, which is 18 items. The table further lists the estimates of $π_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _{n}$$\end{document} , $α_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{n}$$\end{document} , $σ_{n}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{n}^{2}$$\end{document} , and $(1 - α_{n}) σ_{n}^{2} π_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1-\alpha _{n} \right) \sigma _{n}^{2}\pi _{n}$$\end{document} per decile group. The sum of the estimates of $(1 - α_{n}) σ_{n}^{2} π_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1-\alpha _{n} \right) \sigma _{n}^{2}\pi _{n}$$\end{document} is 15.986, and the sample variance of $X_{+}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{+}$$\end{document} in the entire group is 47.390. Therefore, the estimated value of length-stratified alpha is

\begin{matrix} 1 - \frac{\sum_{n = 1}^{\infty} (1 - α_{n}) σ_{n}^{2} π_{n}}{Var (X_{+})} \approx 1 - \frac{15.986}{47.390} = 0.663 . \end{matrix}

It should be noted that each group has fewer than 20 participants and that each $α_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{n}$$\end{document} may have a large standard error. Nevertheless, the total estimate of length-stratified alpha might have an acceptable standard error, because it is based on a weighted average of the $α_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{n}$$\end{document} s. For example, simulation of 1000 samples with parallel items with reliability 0.1887 each, using the test lengths in the column “Number of Items Used” of Table 2, with 13 participants per group, showed that a mean length-stratified alpha of 0.656 had a bias of only $- 0.007$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-0.007$$\end{document} (compared to the outcome 0.663 in a single simulation with subgroups of $10^{6}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${10}^{6}$$\end{document} participants) and a standard error of 0.074. In comparison, with the same parameters, a single sample of 130 subjects with 9 items would have an alpha with standard error 0.042. Thus, stratification increased the standard error—as usual—but the effect may be modest. An extensive study of the standard error of length-stratified alpha would be interesting but is beyond the scope of this article.

Table 2 Statistics For The Computation of Length-Stratified Alpha

Note. $N =$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N =$$\end{document} number of items; $π_{n} =$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _{n} =$$\end{document} number of participants in decile / total number of participants (143); $α_{n} =$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{n} =$$\end{document} coefficient alpha in decile; $σ_{n}^{2} =$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{n}^{2} =$$\end{document} variance of sum score in decile group

Computational Example of Theorem 2. We start with a computational example, using only a small subsample of persons to clarify the steps needed. Table 3 shows the ERN scores of seven participants and the variables $X_{+}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{+}$$\end{document} , N, and 1/N derived from the data. The sample variance of all the raw ERP scores in Table 3 ( $-$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document} 7.32 to $-$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document} 11.73, spanning six columns) is 127.0444, which we use as an estimate of $σ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}$$\end{document} ; the sample variance of $X_{+}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{+}$$\end{document} is 52.7442, and the sample mean of 1/N is 0.2357. If we substitute this in the last equation of Theorem 2, thus assuming parallel trials, we obtain

\begin{matrix} Rel (X_{+}) \approx \frac{1}{1 - 0.2357} - \frac{0.2357}{1 - 0.2357} \frac{127.0444}{52.7442} = 0.566 . \end{matrix}

Using the sample variance of subjects with different N as an estimate of $σ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}$$\end{document} is justified because each column is assumed to have the same expectation and variance, as we assume parallel items and scores independent of N.

Table 3 Computational Example for Seven Participants (out of 143)

Real Data Example of Theorem 2. Now consider the entire sample of the data from 150 participants with one or more scores; the computations are similar. Parameter $σ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}$$\end{document} was estimated as the sample variance of the whole data set, regardless of the subject and the trial. $Var (X_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}(X_{+})$$\end{document} was estimated as the sample variance of $X_{+}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{+}$$\end{document} , and $E (N^{- 1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}\left( N^{-1} \right) $$\end{document} as the sample mean of 1/N. The data of all subjects with $N > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N>0$$\end{document} were used in all these estimates. The estimates are reported in Table 4. In conclusion, the reliability of the total score is estimated at 0.559.

Table 4 Estimates Needed for Computing the Unconditional Reliability

As Theorem 2 is based on the assumption of parallel items, one would need to check this assumption. Supplementary Material B illustrates some visual inspections that may be relevant to this. But note that if the items are parallel, the estimates based on Theorem 1 and 2 estimate the same parameter, provided that they are computed on the same data. When we applied Theorem 1 in Table 2, we used a subset of the data, and this yielded the estimate 0.663. If we use the same subset of data to estimate the reliability with Theorem 2, we obtain 0.614; the difference between the two estimates is 0.049. The size of the difference may be viewed as an indication of the extent to which the assumption of parallel items is violated. Simulations of parallel items with normally distributed true scores and error scores and reliability 0.1887 (needed to reproduce the length-stratified alpha of 0.663) suggest that this difference falls between the 98 $^{th}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\textrm{th}}$$\end{document} and the 99 $^{th}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\textrm{th}}$$\end{document} percentile of the sampling distribution. Although the difference between the two estimates seems significant, indicating a violation of the assumption of parallel items, the effect of the violation on the reliability estimate is modest.

A reason why the reliability is relatively low is that a small value of N has a large effect on $E (N^{- 1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}\left( N^{-1} \right) $$\end{document} . Therefore, if the reliability is small, we recommend to revise the data collection such that each subject has a certain minimum number of valid scores. As an example, in a Flanker Task one could consider decreasing the allotted time for answering, which would increase the number of errors.

Comparison of Various Methods With Real Data. One may be interested in a comparison of our outcomes with the outcomes of preexisting methods if they are applied to the ERN data. We consider (1) various versions of coefficient alpha, (2) the split-half reliabilities, and (3) variance components. For a fair comparison, we use only the data with $N \geq 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N\ge 2$$\end{document} . Recall that our first method, length-stratified alpha, yielded 0.663 as a lower bound and utilized 83% of the data, and our second method, assuming parallel items, yielded the estimate 0.559 based on 100% of the data.

1. We computed coefficient alpha for the first eight items with all participants that have eight or more items. The outcome was 0.487, based on 117 participants, so that computations use 117 $\times$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} 8 / 3011 $=$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document} 31% of the data. The arithmetic mean of the number of observations (confined to $N \geq 2)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N\ge 2)$$\end{document} was 20.99, and when we computed coefficient alpha for the first 21 items with all participants that have 21 or more items, the outcome was 0.695, based on 56 participants, which utilized 56 $\times$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} 21 / 3011 $=$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document} 39% of the data. In view of the fact that Corollary 2 uses the harmonic mean, we repeat this computation with the harmonic mean. The harmonic mean of the number of observations (confined to $N \geq 2)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N\ge 2)$$\end{document} was 10.3172, and when we computed coefficient alpha for the first 10 items with all participants that have 10 or more items, the outcome was 0.596, based on 106 participants, which utilized 106 $\times$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} 10 / 3011 $=$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document} 35% of the data. Our lower bound 0.663, based on Corollary 1, has the advantage that it also uses data from subjects with fewer than 8, 10, or 21 observations.
2. The correlation between the mean of the first half and the mean of the second half of the scores was 0.496, yielding a split-half reliability of 0.663. If the halves were randomly selected, the mean split-half reliability over 1000 independent draws was 0.665 with a standard deviation of 0.039. This computation utilizes 100% of the data, which is therefore not entirely comparable with length-stratified alpha, which used 83% of the data. If the same 83% of the data is used to compute the split-half reliabilities, after 1000 draws the split-half reliabilities had a mean of 0.671 with a standard deviation of 0.024
3. In a variance components model, the restricted maximum likelihood estimates for the variance components of participants, items, and interaction $+$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$+$$\end{document} error were 30.575, 0.138, and 151.965. Clayson et al. (Reference Clayson, Carbine, Baldwin, Olsen and Larson2021, p. 183) recommended computing the stepped-up coefficient with the arithmetic mean, but our analysis shows that the harmonic mean should be used (a further explanation on this is found after Eq. 4). Using the arithmetic mean of the number of observations (21), the estimated reliability is $30.575 / (30.575 + 151.965 / 21) = 0.809 .$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$30.575/(30.575+151.965/21)=0.809.$$\end{document} Using the harmonic mean (10), the estimated reliability is $30.575 / (30.575 + 151.965 / 10) = 0.668 .$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$30.575/(30.575+151.965/10)=0.668.$$\end{document}

3.2. Correlation with a Second Test Administration

For a fixed number of items across participants, the CTT reliability of the test score $X_{+}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{+}$$\end{document} equals the correlation of the test with a parallel test. The idea is that if one could replicate the test administration under similar circumstances, the reliability tells us what the correlation between the first and second test is (in the context of this article, the terms ’test’ and ’test administration’ are used interchangeably). Although in practice parallel tests are (nearly) impossible to obtain, we consider the theoretical issue of what happens to this result if the number of items is allowed to vary across participants, typical of ERPs obtained using the Flanker test. We will show that if the items within the first test are not parallel, even if the items of the second test are one-by-one parallel with the items of the first test if both items are administered, a change in the number of items in the second test, in comparison with the first test, will entail that subjects can have a different true score $T_{+}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_{+}$$\end{document} on the second test. Thus, even if the items of the two tests are one-by-one parallel, the test scores would not be parallel, and therefore, there is no reason to expect that the correlation of the two test scores is equal to their reliability. We study this next in more detail. In doing this, we assume in the mathematical development that the series of items in both tests are infinitely long irrespective of whether they have really been observed.

Let the items of the second test be denoted by ${X^{'}}_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${X'}_{i}$$\end{document} , ${T^{'}}_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${T'}_{i}$$\end{document} , and ${E^{'}}_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${E'}_{i}$$\end{document} and the number of items of the second test by $N^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N'$$\end{document} . We assume for all $i, j \in N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i,j\in \mathbb {N}$$\end{document} ,

(A8)

\begin{matrix} X_{i}^{'} = T_{i}^{'} + E_{i}^{'} \end{matrix}

(A9)

\begin{matrix} Cov (E_{i}^{'}, T_{j}^{'}) = 0 \end{matrix}

We use the following assumptions. First, the items of the two tests are one-by-one parallel, that is, for all items $i \in N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i\in \mathbb {N}$$\end{document} ,

(A10)

\begin{matrix} T_{i} = T_{i}^{'}; \end{matrix}

(A11)

\begin{matrix} Var (E_{i}) = Var (E_{i}^{'}); \end{matrix}

(A12)

\begin{matrix} Cov (E_{i}, E_{i}^{'}) . = 0 \end{matrix}

Note that the definition requires A8–A10 for all $i \in N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i\in \mathbb {N}$$\end{document} , even though only N items are observed in the first test, and only $N^{^{'}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N^{'}$$\end{document} in the second test, where $N \neq N^{^{'}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N\ne N^{'}$$\end{document} in general. The assumptions state that the equalities hold if the variables involved are observed, but they do not imply that all of these variables are indeed observed. This circumstance is comparable to the setup in mathematical statistics where we use an infinite sequence of random variables to obtain the central limit theorem, even though any real sample will include only a finite number of these random variables.

We do not need to assume that the errors within a test administration are uncorrelated, but we will assume that the error correlations are the same in both test administrations: for all $i, j \in N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i,j\in \mathbb {N}$$\end{document} ,

(A13)

\begin{matrix} Cov (E_{i}, E_{j}) = Cov (E_{i}^{'}, E_{j}^{'}) . \end{matrix}

Finally, we assume that

(A14)

\begin{matrix} (N, N^{'}) is independent of all T_{i}, E_{i}, T_{i}^{'} and E_{i}^{'} jointly \end{matrix}

(A15)

\begin{matrix} N and N^{'} have the same probability distribution \end{matrix}

The test scores on the second test are defined as

\begin{matrix} {X^{'}}_{+} : = & \sum_{i = 1}^{N^{'}} {X^{'}}_{i} / N^{'}, \\ {T^{'}}_{+} : = & \sum_{i = 1}^{N^{'}} {T^{'}}_{i} / N^{'}, \\ {E^{'}}_{+} : = & \sum_{i = 1}^{N^{'}} E_{i}^{^{'}} / N^{'} . \end{matrix}

We focus on two items from the same test to arrive at reliability based on one administration and denote the correlation between $X_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{i}$$\end{document} and $X_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{j}$$\end{document} as $ρ_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _{ij}$$\end{document} . Further, we denote the correlation between $X_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{i}$$\end{document} and ${X^{'}}_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${X'}_{j}$$\end{document} as ${ρ^{'}}_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\rho '}_{ij}$$\end{document} . If the items are one-by-one parallel, then we have

\begin{matrix} ρ_{ij} = & {ρ^{'}}_{ij} (i \neq j) \\ Var (X_{i}) = & Var ({X^{'}}_{i}) \end{matrix}

and ${ρ^{'}}_{ii}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\rho '}_{ii}$$\end{document} is the reliability of $X_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{i}$$\end{document} . The average covariance between the first n items of the first test and the first m items of the second test is

\begin{matrix} {\bar{C}}_{nm} : = \frac{1}{nm} \sum_{i = 1}^{n} \sum_{j = 1}^{m} {ρ^{'}}_{ij} \sqrt{Var (X_{i}) Var ({X^{'}}_{j})} . \end{matrix}

To express this in terms of only parameters of the first test administration, write $ρ_{ij}^{*} : = ρ_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _{ij}^{*}:=\rho _{ij}$$\end{document} if $i \neq j$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i\ne j$$\end{document} and $ρ_{ii}^{*} : = {ρ^{'}}_{ii}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _{ii}^{*}:={\rho '}_{ii}$$\end{document} , and let

\begin{matrix} {\bar{C}}_{nm}^{*} : = \frac{1}{nm} \sum_{i = 1}^{n} \sum_{j = 1}^{m} ρ_{ij}^{*} \sqrt{Var (X_{i}) Var (X_{j})} . \end{matrix}

If the items are one-by-one parallel, then ${\bar{C}}_{nm}^{*} = {\bar{C}}_{nm}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{C}_{nm}^{*}=\bar{C}_{nm}$$\end{document} , but the point of ${\bar{C}}_{nm}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{C}_{nm}^{*}$$\end{document} is that it is entirely defined with parameters of the first test. Let

\begin{matrix} π_{nm} = P (N = n, N^{^{'}} = m) . \end{matrix}

We are now able to formulate Theorem 3, which allows us to express the correlation between the two test scores as a function of the parameters of the first test and the joint distribution of the test lengths ( $π_{nm})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _{nm})$$\end{document} . The involved parameters include the item reliabilities, and we do not offer a method to estimate them, but this is irrelevant to the conclusion that we will draw from this theorem.

Theorem 3

Assume A1, A2, and A8–A15. The correlation between $X_{+}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{+}$$\end{document} and ${X^{'}}_{+}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${X'}_{+}$$\end{document} is

\begin{matrix} Cor (X_{+}, {X^{'}}_{+}) = \frac{1}{Var (X_{+})} \sum_{n = 1}^{\infty} \sum_{m = 1}^{\infty} {\bar{C}}_{nm}^{*} π_{nm} . \end{matrix}

Proof

By the law of total covariance,

\begin{matrix} Cov (X_{+}, X_{+}^{^{'}}) = E (Cov, (X_{+}, X_{+}^{^{'}}, |, N, N^{'})) + Cov (E (X_{+}, |, N, N^{^{'}}), E ({X^{'}}_{+}, |, N, N^{^{'}})) . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \textrm{Cov}\left( X_{+},X^{'}_{+} \right) =\,\mathbb {E}\left( \textrm{Cov}\left( {X_{+},X^{'}_{+}} \vert \,{N,N'}\right) \right) +\textrm{Cov}\left( \mathbb {E}\left( X_{+} \vert \,{N,N^{'}}\right) , \mathbb {E}\left( {X'}_{+} \vert \,{N,N^{'}}\right) \,\right) . \end{aligned}$$\end{document}

\begin{matrix} Cov (X_{+}, X_{+}^{^{'}}) = E (Cov, (X_{+}, X_{+}^{^{'}}, |, N, N^{'})) . \end{matrix}

Now

\begin{matrix} Cov (X_{+}, X_{+}^{^{'}}, |, N, N^{'}) = \frac{1}{N N^{'}} \sum_{i = 1}^{N} \sum_{j = 1}^{N^{'}} Cov (X_{i}, X_{j}^{^{'}}, |, N, N^{'}) . \end{matrix}

Because $(N, N^{'})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(N,N')$$\end{document} is independent of $(X_{i}, X_{j}^{^{'}})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(X_{i},X^{'}_{j})$$\end{document} , $Cov (X_{i}, X_{j}^{^{'}}, |, N, N^{'}) = Cov (X_{i}, X_{j}^{^{'}})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Cov}\left( {X_{i},X^{'}_{j}} \vert \,{N,N'}\right) =\textrm{Cov}\left( X_{i},X^{'}_{j} \right) $$\end{document} . If $i \neq j$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i\ne j$$\end{document} , then $Cov (X_{i}, X_{j}^{^{'}}) = Cov (X_{i}, X_{j})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Cov}\left( X_{i},X^{'}_{j} \right) =\textrm{Cov}\left( X_{i},X_{j} \right) $$\end{document} because parallel tests have similar error correlations. If $i = j$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=j$$\end{document} , then $Cov (X_{i}, X_{j}^{^{'}}) = {ρ^{'}}_{ii} Var (X_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Cov}\left( X_{i},X^{'}_{j} \right) =\mathrm {\,}{\rho '}_{ii}\textrm{Var}(X_{i})$$\end{document} , because $X_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{i}$$\end{document} and ${X^{'}}_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${X'}_{i}$$\end{document} are parallel. Notice that due to parallelism, $Var (X_{i}) = Var ({X^{'}}_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( X_{i} \right) =\textrm{Var}({X'}_{i})$$\end{document} . In sum, we have that $Cov (X_{i}, X_{j}^{^{'}}, |, N, N^{'}) = ρ_{ij}^{*} \sqrt{Var (X_{i}) Var (X_{j})}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Cov}\left( {X_{i},X^{'}_{j}} \vert \,{N,N'}\right) =\rho _{ij}^{*}\sqrt{\textrm{Var}\left( X_{i} \right) \textrm{Var}(X_{j})} $$\end{document} for all $i, j \in N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i,j\in \mathbb {N}$$\end{document} . Therefore, $Cov (X_{+}, X_{+}^{^{'}}, |, N, N^{'}) = {\bar{C}}_{N N^{'}}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Cov}\left( {X_{+},X^{'}_{+}} \vert \,{N,N'}\right) =\bar{C}_{NN'}^{*}$$\end{document} , yielding

\begin{matrix} Cov (X_{+}, X_{+}^{^{'}}) = E ({\bar{C}}_{N N^{'}}^{*}) = \sum_{n = 1}^{\infty} \sum_{m = 1}^{\infty} {\bar{C}}_{nm}^{*} π_{nm} . \end{matrix}

Finally, because the two tests have one-by-one parallel items and N and $N^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N'$$\end{document} have the same probability distribution, $Var (X_{+}) = Var ({X^{'}}_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( X_{+} \right) =\textrm{Var}({X'}_{+})$$\end{document} . $□$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

Note that the correlation between N and $N^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N'$$\end{document} will affect the $π_{nm}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _{nm}$$\end{document} , and if the items within a test are not parallel, this will generally affect the outcome. The correlation between N and $N^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N'$$\end{document} will not affect $R e l (X_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Rel(X_{+})$$\end{document} as defined earlier, however, and therefore, we conclude from this theorem that the correlation between the total scores of two tests with variable lengths will generally not be equal to the reliability of the total score, even if the items of the two tests are one-by-one parallel and the two tests have identical distributions of test lengths. One way to understand this result is that the variable test length acts as a source of variation that is not included in the definition of reliability. We will analyze this situation in the next section using GT.

If the items in a test are parallel, the situation simplifies considerably. In that case ${\bar{C}}_{nm} = ρ σ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{C}_{nm}=\rho \sigma ^{2}$$\end{document} , so that $\sum_{n = 1}^{\infty} \sum_{m = 1}^{\infty} {\bar{C}}_{nm} π_{nm} = ρ σ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\sum }_{n=1}^\infty {\sum }_{m=1}^\infty {\bar{C}_{nm}\pi _{nm}} =\rho \sigma ^{2}$$\end{document} , regardless of the distribution $π_{nm}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _{nm}$$\end{document} , hence regardless of the correlation between N and $N^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N'$$\end{document} . Furthermore, we saw earlier that $Var (X_{+}) = ρ σ^{2} + E (N^{- 1}) (1 - ρ) σ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( X_{+} \right) =\rho \sigma ^{2}+\mathbb {E}\left( N^{-1} \right) (1-\rho )\sigma ^{2}$$\end{document} , which implies that we do not even need that N and $N^{'}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N'$$\end{document} have the same distribution; it is sufficient that $E (1 / N) = E (1 / N^{'})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}\left( 1/N \right) =\mathbb {E}\left( 1/N' \right) $$\end{document} .

Corollary 3

Assume A1, A2 (basic CTT), A6, A7 (the items of the first test are parallel) and A8–A14 (the items of the second test are one by one parallel with the items of the first test, with similar error correlations, and the true and error score variables are independent of $(N, N^{^{'}}))$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(N,N^{'}))$$\end{document} . If $E (1 / N) = E (1 / N^{'})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}\left( 1/N \right) =\mathbb {E}\left( 1/N' \right) $$\end{document} , then $Cor (X_{+}, {X^{'}}_{+}) = Rel (X_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Cor}\left( X_{+},{X'}_{+} \right) =\textrm{Rel}(X_{+})$$\end{document} .

3.2.1. Integration of Reliabilities of Different Stimulus Types

This section briefly discusses how the above methods can be applied if the ERPs are obtained from different stimulus types, such as ‘SSSSS’, ‘SSHSS’, ‘HHSHH’, and ‘HHHHH’ in the Flanker Task. In such cases, one may consider it implausible that ERPs from different stimulus types are parallel. Nevertheless, the method of Corollary 1 can still be used because it does not require parallel items. If the items within each stimulus type are parallel, while items from different stimulus types are not parallel, a better estimate can be obtained with the following method: (1) estimate the reliabilities within each stimulus type using the methods of Theorem 2 and (2) integrate the ensuing reliabilities with the formula for stratified reliability of composite tests, $1 - \sum_{g = 1}^{G} σ_{g}^{2} (1 - ρ_{g}) / Var (X_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1-{\sum }_{g=1}^G {\sigma _{g}^{2}\left( 1-\rho _{g} \right) } /\,\textrm{Var}(X_{+})$$\end{document} , discussed after Theorem 1.

4. Comparison with Generalizability Theory Approaches

Several authors have adopted the use of GT for ERP scores. Baldwin et al. (Reference Baldwin, Larson and Clayson2015) and Clayson and Miller (Reference Clayson and Miller2017a; Reference Clayson and Millerb) described a model with persons and trials as random factors. They included diagnostic category as a fixed factor, where persons are nested within diagnostic categories such as anxiety disorder and major depressive disorder. The authors estimated the generalizability coefficients in each diagnostic category separately, so for the present discussion it suffices to consider only one diagnostic category and thus omit diagnostic category as a factor. In addition, Clayson et al. (Reference Clayson, Carbine, Baldwin, Olsen and Larson2021) described a model that includes the factors persons, trials, and occasions. Within a single diagnosis group and with data of only a single occasion, the model these authors proposed includes only persons and trials as random factors.

Because, in contrast with CTT, trial (or item) is now considered a random factor, we will slightly change the notation and write the score of a participant p on item i as X(p, i). The model with participant effects ( $τ_{p})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau _{p})$$\end{document} , trial effects ( $β_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _{i})$$\end{document} , interaction effects ( $γ_{pi})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _{pi})$$\end{document} , and a residual ( $ε_{pi})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon _{pi})$$\end{document} can be written as

\begin{matrix} X (p, i) = μ + τ_{p} + β_{i} + γ_{pi} + ε_{pi} . \end{matrix}

Various methods exist for estimating the variance components corresponding to $τ_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau _{p}$$\end{document} , $β_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _{i}$$\end{document} , and $γ_{pi} + ε_{pi}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _{pi}+\varepsilon _{pi}$$\end{document} . Clayson et al. (Reference Clayson, Carbine, Baldwin, Olsen and Larson2021) recommended Bayesian hierarchical models. Denote the variance components $σ^{2} (τ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}(\tau )$$\end{document} , and so on. The authors defined the dependability coefficient for subjects with n trials as

(1)

\begin{matrix} Dep (X_{+}, n) = \frac{σ^{2} (τ)}{σ^{2} (τ) + \frac{1}{n} [σ^{2} (β) + σ^{2} (γ + ε)]} . \end{matrix}

Baldwin et al. (Reference Baldwin, Larson and Clayson2015, p. 792) assumed furthermore that $σ^{2} (β) = σ^{2} (γ) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}\left( \beta \right) =\sigma ^{2}\left( \gamma \right) =0$$\end{document} , leading to the special case

(2)

\begin{matrix} Dep (X_{+}, n) = \frac{σ^{2} (τ)}{σ^{2} (τ) + \frac{1}{n} σ^{2} (ε)} . \end{matrix}

Writing $ρ = σ^{2} (τ) / (σ^{2} (τ) + σ^{2} (ε))$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho =\sigma ^{2}(\tau )/(\sigma ^{2}\left( \tau \right) +\sigma ^{2}(\varepsilon ))$$\end{document} , we can rewrite the coefficient of Baldwin et al. as

(3)

\begin{matrix} Dep (X_{+}, n) = \frac{n ρ}{1 + (n - 1) ρ} . \end{matrix}

To compare this result with our own results, we note that if we define the true scores as $T_{i} (p) = μ + τ_{p} + β_{i} + γ_{pi}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_{i}\left( p \right) =\mu +\tau _{p}+\beta _{i}+\gamma _{pi}$$\end{document} , the assumption $σ^{2} (β) = σ^{2} (γ) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}\left( \beta \right) =\sigma ^{2}\left( \gamma \right) =0$$\end{document} implies that the items are tau-equivalent, i.e., $T_{i} = T_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_{i}=T_{j}$$\end{document} for all $i, j \in N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i,j\in \mathbb {N}$$\end{document} . Baldwin et al. (Reference Baldwin, Larson and Clayson2015) used the same value of $σ^{2} (ε)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}\left( \varepsilon \right) $$\end{document} regardless of the included items or participants, so they treat the items as if they are parallel. In Corollary 2, we concluded for the situation of parallel items that, with $H = 1 / E (N^{- 1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H=1/\,\mathbb {E}\left( N^{-1} \right) $$\end{document} (the harmonic mean of N),

(4)

\begin{matrix} Rel (X_{+}) = \frac{H ρ}{1 + (H - 1) ρ} . \end{matrix}

We will now discuss the differences between the approach of Baldwin et al. (Reference Baldwin, Larson and Clayson2015) and our own analysis. The most obvious difference is that Baldwin et al. use Equation (3), which uses a fixed number of trials n, whereas we use Equation (4), which uses the harmonic mean H of a variable number of trials. Baldwin et al. thus compute conditional dependability coefficients, given a value of n, but they do not discuss how these conditional coefficients can be integrated into a single unconditional coefficient that summarizes the reliability or dependability in a population of persons having different values of n. Clayson et al. (Reference Clayson, Carbine, Baldwin, Olsen and Larson2021, p. 183) recommend integration by using a formula that is equivalent to (3) and replace n by the arithmetic mean or median of N, but this seems to be an ad hoc formula without proof of correctness. Our analysis shows that this integration can be done with essentially the same formula, which is the Spearman–Brown formula, replacing the fixed test length with the harmonic mean of the test lengths. Our formula has the advantage that it is mathematically proven to produce the unconditional reliability when this is defined in the conventional manner as the true score variance divided by the observed score variance. The harmonic mean ( $H = 1 / E (N^{- 1}))$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H=1/{\,}\mathbb {E}\left( N^{-1} \right) )$$\end{document} , rather than the arithmetic mean (denoted here with $A = E (N))$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$A=\mathbb {E}\left( N \right) )$$\end{document} , is used because the overall error variance is the expected value of the individual error variances $\frac{1}{n} σ^{2} (ε)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {\,}\frac{1}{n}\sigma ^{2}\left( \varepsilon \right) $$\end{document} , which is $σ^{2} (ε) / H$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}\left( \varepsilon \right) /H$$\end{document} and not $σ^{2} (ε) / A$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}\left( \varepsilon \right) /A$$\end{document} . In general, $H < A$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H<\,A$$\end{document} if $N > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N>0$$\end{document} and $Var (N) > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( N \right) >0$$\end{document} , so using the arithmetic mean produces estimates that are too optimistic. Mathematically, Equation (4) is more general than Equation (3), because the latter can be viewed as a special case of the former when the test length is fixed. The two formulas can be complementary in their applications. Equation (3) can be useful in clinical settings if, after the test administration, one wants to decide whether enough trials have been observed for a given patient with known n, even if the estimate of $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} is based on data with many patients with variable N. Equation (4) can be used in research where a single reliability value is needed for a group of persons with variable N.

The second difference is the method for estimating $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} . Baldwin et al. (Reference Baldwin, Larson and Clayson2015) advocate the use of a Bayesian hierarchical model to estimate the variance components and their ratio. In Corollary 2a, we concluded that it suffices to estimate $σ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}$$\end{document} (the variance of all scores) and $Var (X_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( X_{+} \right) $$\end{document} and $E (N^{- 1})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}\left( N^{-1} \right) $$\end{document} . These are simply variances and means of observed variables, and in the two examples of Theorem 2 we demonstrated that these quantities can easily be estimated with the corresponding sample moments. Our analysis was mainly concerned with the relations between parameters, and the examples were merely given with the purpose to clarify the results, not to claim that this is the best estimation method. Any estimate of $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} may be inserted in Corollary 2b (the Spearman–Brown formula with harmonic mean of N) to obtain an estimate of the overall reliability. We discuss the merits of the method of Baldwin et al. and compare them with our estimation method based on Corollary 2a.

According to Baldwin et al. (Reference Baldwin, Larson and Clayson2015), the advantages of their method are that it does not produce negative variance component estimates and that “computing interval estimates and hypothesis tests for variance components and dependability coefficients is straightforward” (ibid., p. 794). Our method is based on the sample variances of $σ^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}$$\end{document} and $Var (X_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( X_{+} \right) $$\end{document} , which cannot be negative either. Note that the method of Baldwin et al. assumes normal distributions for the components, whereas our method does not require any distributional assumptions whatsoever. The method of Baldwin et al. produces interval estimates, but in doing this it relies heavily on the assumption of normality. Ogasawara (Reference Ogasawara2006) and Maydeu-Olivares et al. (Reference Maydeu-Olivares, Coffman and Hartmann2007) compared asymptotic distribution-free (ADF) estimators and normal theory estimators for coefficient alpha, and Maydeu-Olivares et al. concluded that “for sample sizes over 100 observations, ADF intervals are preferable, regardless of item skewness and kurtosis” (ibid., p. 157). Braschel et al. (Reference Braschel, Svec, Darlington and Donner2015) and Coffman et al. (Reference Coffman, Maydeu-Olivares and Arnau2008) also noted lack of robustness of estimates of intraclass correlations based on normal theory, and Coffman et al. provided the ADF distribution of sample intraclass correlations. Using Bayesian methods does not render estimators invulnerable to violations of normality. Ionan et al. (Reference Ionan, Polley, McShane and Dobbin2014) compared various frequentist and Bayesian methods for interval estimation of the intraclass correlation in a two-way crossed random effects model and concluded that “none of the methods work well if the number of levels of a factor are limited and data are markedly non-normal” (ibid., p. 1). This does not mean that our method is necessarily preferable, however; hypothesis testing and interval estimation of $σ^{2} / Var (X_{+})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}/\,\textrm{Var}\left( X_{+} \right) $$\end{document} , a ratio of two dependent variances, have similar problems if data are non-normal (Wilcox, Reference Wilcox1990, Reference Wilcox2015). Further research is needed to determine the optimal estimation method for small non-normal data with random numbers of observations.

A third difference is that we provide an analysis of what happens if the test administration is repeated with possibly a different number of trials. Baldwin et al. (Reference Baldwin, Larson and Clayson2015) did not discuss this matter.

Clayson et al. (Reference Clayson, Carbine, Baldwin, Olsen and Larson2021) generalized the model of Baldwin et al. (Reference Baldwin, Larson and Clayson2015) to a setting with multiple occasions. Applied to a setting with a single occasion, the main difference with Baldwin et al. is that Clayson et al. do not assume $σ^{2} (β) = σ^{2} (γ) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma ^{2}\left( \beta \right) =\sigma ^{2}\left( \gamma \right) =0$$\end{document} , leading to Equation (1) instead of Equation (2). A comparison of our analysis with Clayson et al. follows roughly the same lines as our comparison with Baldwin et al. Clayson et al. describe conditional dependability coefficients, given a fixed number of trials, whereas our method describes how we can integrate coefficients for different numbers of trials into an unconditional coefficient. More specifically, if we assume that the components of $τ, β, γ, ε$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau ,\beta ,\gamma ,\varepsilon $$\end{document} are independent of N, then $Var (X_{+}) = E (Var, (X_{+}, | N)) = \sum_{n = 1}^{\infty} Var (X_{+}, | N = n) π_{n} = \sum_{n = 1}^{\infty} {σ^{2} (τ) + \frac{1}{n} [σ^{2} (β) + σ^{2} (γ + ε)]} π_{n} = σ^{2} (τ) + E (\frac{1}{N}) [σ^{2} (β) + σ^{2} (γ + ε)]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{Var}\left( X_{+} \right) = \quad \mathbb {E}\left( \textrm{Var}\left( X_{+} \vert \,N\right) \right) ={\sum }_{n=1}^\infty {\textrm{Var}\left( X_{+} \vert \,{N=n}\right) \pi _{n}} ={\sum }_{n=1}^\infty {\{\sigma ^{2}\left( \tau \right) +\mathrm {\,}\frac{1}{n}[\sigma ^{2}\left( \beta \right) +\sigma ^{2}\left( \gamma +\varepsilon \right) ]\}\pi _{n}} =\sigma ^{2}\left( \tau \right) +\mathbb {E}\left( \frac{1}{N} \right) [\sigma ^{2}\left( \beta \right) +\sigma ^{2}\left( \gamma +\varepsilon \right) ]$$\end{document} . The unconditional dependability is therefore

(5)

\begin{matrix} Dep (X_{+}) = \frac{σ^{2} (τ)}{σ^{2} (τ) + E (\frac{1}{N}) [σ^{2} (β) + σ^{2} (γ + ε)]} . \end{matrix}

Using $ρ^{'} = σ^{2} (τ) / [σ^{2} (τ) + σ^{2} (β) + σ^{2} (γ + ε)]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho '=\sigma ^{2}(\tau )/[\sigma ^{2}\left( \tau \right) +\sigma ^{2}\left( \beta \right) +\sigma ^{2}\left( \gamma +\varepsilon \right) ]$$\end{document} , we can rewrite Equations (1) and (5) as

(6)

\begin{matrix} Dep (X_{+}, n) = & \frac{n ρ^{'}}{1 + (n - 1) ρ^{'}}, \end{matrix}

(7)

\begin{matrix} Dep (X_{+}) = & \frac{H ρ^{'}}{1 + (H - 1) ρ^{'}} . \end{matrix}

Clayson et al. (Reference Clayson, Carbine, Baldwin, Olsen and Larson2021) estimated the variance components using a Bayesian hierarchical model, but given the previous discussion, we are not convinced that this is the best estimation method. Our method to integrate various coefficients works regardless of the estimation method used for the reliability or generalizability. We have suggested coefficient alpha in Corollary 1 because it can be interpreted both in CTT and in GT (see also Sijtsma & Pfadt, Reference Sijtsma and Pfadt2021a; Reference Sijtsma and Pfadtb).

5. Discussion: Contributions of Our Study to the Theory and Practice of Reliability

We have extended CTT with new formulas to compute the reliability in situations where the number of items per subject is a random variable. These formulas can be applied to data of performance monitoring ERPs such as the ERN and Pe, where the number of relevant trials depends on the performance of the participant. We studied this for the Eriksen Flanker Task, but our theory can also be applied in other tasks in which ERN and Pe measurements can be obtained, such as Go / NoGo tasks and Stroop tasks (see Baldwin et al. Reference Baldwin, Larson and Clayson2015). Furthermore, we illustrated our theory with time-window mean amplitude scores, but our formulas are equally valid for other EEG scores such as peak amplitude or peak latency.

The first method we created is based on a reliability formula for a stratified sample. This method can be used in combination with existing reliability estimates such as alpha or omega, applied to each subgroup with equal test length. The limitation of this method is that it requires that each subgroup of participants with the same test length is large enough to estimate the reliability accurately. This requirement might be difficult to meet, although fortunately, for the field of psychophysiology, a trend toward the use of larger samples is observed (Kissel & Friedman, Reference Kissel and Friedman2023). If the requirement is not met, then subgroups with different test lengths have to be combined, which leads to loss of data. The reason for this data loss is that alpha has to be computed on a rectangular data matrix; if groups with $N = k$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N\mathrm {\,=\,}k$$\end{document} and $N = k + 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {\,}N\mathrm {\,=\,}k+ 1$$\end{document} items are combined, then either alpha is computed with $k + 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k+1$$\end{document} items and the participants with $N = k$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N{\,=\,}k$$\end{document} are discarded, or alpha is computed with k items and the data of the $(k + 1)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {(}k+1\mathrm {)}$$\end{document} th item are discarded. In our example, 83% of the data could be used in our example of length-stratified alpha. However, maybe it is not really necessary to estimate the reliability in each subgroup with the same accuracy as one would desire in the total group. The stratification formula combines the subgroup reliabilities in a weighted sum, and the standard error of the total reliability can be less than each of the contributing standard errors. Our first simulation of standard errors of length-stratified alpha, reported in “Example of Theorem 1,” gave promising results. Further research is needed to construct interval estimates of this version of stratified reliability and to provide sample size recommendations. The second method we proposed only requires two variance estimates and one mean to compute, which makes it very easy to apply. Moreover, our second method uses 100% of the data. Its limitation is that it requires that the items are parallel.

Our analysis shows that reliability estimation of the ERN and Pe data with CTT is very well possible. The advantage of CTT is that the greater simplicity of having only a single facet allows us to focus on an aspect that did not receive attention in the GT treatments, which is that the number of items is also a random variable. In contrast with earlier treatments of GT, we were able to define a single reliability coefficient that combines all subgroups with different numbers of items. Our analysis shows that the harmonic mean of the number of items, rather than the arithmetic mean, relates the variance components to the overall reliability, and this result is relevant in both CTT and GT approaches. Our analysis also clarified that even if items on a second test administration are parallel with the items of the first test administration, their total scores may not be parallel if the number of items changes between the test administrations. We generalized our approach to data that are stratified on other variables in Supplementary Material C. We pointed out that Corollary 2 and Eq. (4) (i.e., the Spearman–Brown formula with the harmonic mean of test lengths) can also be applied in designs where randomly selected raters from one population are nested within objects, with different sample sizes per object. This formula may be useful in studies of performance evaluations of health care organizations where each organization is rated by a sample of their patients, where sample sizes are usually different (e.g., Ellis, Reference Ellis2013; Ogasawara, Reference Ogasawara2021)—although the situation is complicated by the need for a casemix correction.

We contend that CTT still has its merits if a detailed analysis of reliability is needed. This study shows that CTT does not always require parallel items as some authors suggest and put forward as a limiting condition for using CTT (Clayson & Miller, Reference Clayson and Miller2017a, p. 72). The simplicity of CTT is attractive in the present context where it enables the researcher to estimate reliability in a simple way, addressing the problem of obtaining a single reliability coefficient with variable test lengths that more complex methods seem to obscure. In doing so, the present work provides a crucial and necessary contribution to advancing ERP studies of individual differences.

Declarations

Data Availability Statement

The data of Tables 2 and 4 and Figures S1-S5 and the code that generated it are available in the Open Science Framework repository at https://doi.org/10.17605/OSF.IO/KZY3D

Conflict of interest

We have no conflict of interest to disclose.

Footnotes

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/s11336-024-09982-5.

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Baldwin, S. A., Larson, M. J., Clayson, P. E. (2015). The dependability of electrophysiological measurements of performance monitoring in a clinical sample: A generalizability and decision analysis of the ERN and Pe. Psychophysiology, 52, 790–800.CrossRef Google Scholar

Bernoster, I., De Groot, K., Wieser, M. J., Thurik, R., Franken, I. H. (2019). Birds of a feather flock together: Evidence of prominent correlations within but not between self-report, behavioral, and electrophysiological measures of impulsivity. Biological Psychology, 145, 112–123.CrossRef Google Scholar

Braschel, M. C., Svec, I., Darlington, G. A., Donner, A. (2015). A comparison of confidence interval methods for the intraclass correlation coefficient in community-based cluster randomization trials with a binary outcome. Clinical Trials, 13(2), 180–187.CrossRef Google Scholar PubMed

Cho, E. (2021). Neither Cronbach’s alpha nor McDonald’s omega: A commentary on Sijtsma and Pfadt. Psychometrika, 86(4), 877–886.CrossRef Google Scholar

Clayson, P. E. (2020). Moderators of the internal consistency of error-related negativity scores: A meta-analysis of internal consistency estimates. Psychophysiology.CrossRef Google Scholar PubMed

Clayson, P. E., Carbine, K. A., Baldwin, S. A., Olsen, J. A., Larson, M. J. (2021). Using generalizability theory and the ERP Reliability Analysis (ERA) Toolbox for assessing test-retest reliability of ERP scores part 1: Algorithms, framework, and implementation. International Journal of Psychophysiology, 166, 174–187.CrossRef Google Scholar PubMed

Clayson, P. E., Miller, G. A. (2017). ERP Reliability Analysis (ERA) Toolbox: An open-source toolbox for analyzing the reliability of event-related brain potentials. International Journal of Psychophysiology, 111, 68–79.CrossRef Google Scholar PubMed

Clayson, P. E., Miller, G. A. (2017). Psychometric considerations in the measurement of event-related brain potentials: Guidelines for measurement and reporting. International Journal of Psychophysiology, 111, 57–67.CrossRef Google Scholar

Coffman, D. L., Maydeu-Olivares, A., Arnau, J. (2008). Asymptotic distribution free interval estimation. Methodology, 4(1), 4–9.CrossRef Google Scholar

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.CrossRef Google Scholar

Ellis, J. L. (2013). Probability interpretations of intraclass reliabilities. Statistics in Medicine, 32(26), 4596–4608.CrossRef Google Scholar PubMed

Ellis, J. L. (2021). A test can have multiple reliabilities. Psychometrika, 86(4), 869–876.CrossRef Google Scholar PubMed

Eriksen, B. A., Eriksen, C. W. (1974). Effects of noise letters upon the identification of a target letter in a nonsearch task. Perception & Psychophysics, 16, 143–149.CrossRef Google Scholar

Fabiani, M., Gratton, G., Karis, D., Donchin, E. (1987). The definition, identification, and reliability of measurement of the P300 component of the event-related brain potential. In Ackles, P. K., Jennings, J. R., Coles, M. G. H. (Eds), Advances in Psychophysiology, Greenwich, CT: JAI Press Inc 1–78.Google Scholar

Falkenstein, M., Hohnsbein, J., Hoormann, J., Blanke, L. (1991). Effects of crossmodal divided attention on late ERP components. II. Error processing in choice reaction tasks. Electroencephalography and Clinical Neurophysiology, 78, 447–455.CrossRef Google Scholar PubMed

Gehring, W. J., Goss, B., Coles, M. G., Meyer, D. E., Donchin, E. (1993). A neural system for error detection and compensation. Psychological Science, 4, 385–390.CrossRef Google Scholar

Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10, 255–282.CrossRef Google Scholar PubMed

Guttman, L. (1953). Reliability formulas that do not assume experimental independence. Psychometrika, 18(3), 225–239.CrossRef Google Scholar

Hedge, C., Powell, G., Sumner, P. (2018). The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behavior Research Methods, 50, 1166–1186.CrossRef Google Scholar

Ionan, A. C., Polley, M-YC, McShane, L. M., Dobbin, K. K. (2014). Comparison of confidence interval methods for an intra-class correlation coefficient (ICC). BMC Medical Research Methodology.CrossRef Google Scholar PubMed

Kissel, H. A., Friedman, B. H. (2023). Participant diversity in Psychophysiology. Psychophysiology.CrossRef Google Scholar PubMed

Klawohn, J., Meyer, A., Weinberg, A., Hajcak, G. (2020). Methodological choices in event-related potential (ERP) research and their impact on internal consistency reliability and individual differences: An examination of the error-related negativity (ERN) and anxiety. Journal of Abnormal Psychology, 129, 29–37.CrossRef Google Scholar PubMed

Lord, F. M., Novick, M. R. (1968). Statistical theories of mental test scores, Reading, MA: Addison -Wesley.Google Scholar

Marco-Pallares, J., Cucurell, D., Münte, T. F., Strien, N., Rodriguez-Fornells, A. (2011). On the number of trials needed for a stable feedback-related negativity. Psychophysiology, 48(6), 852–860.CrossRef Google Scholar PubMed

Maydeu-Olivares, A., Coffman, D. L., Hartmann, W. M. (2007). Asymptotically distribution-free (ADF) interval estimation of coefficient alpha. Psychological Methods, 12(2), 157–176.CrossRef Google Scholar PubMed

Meyer, A., Riesel, A., Proudfit, G. H. (2013). Reliability of the ERN across multiple tasks as a function of increasing errors. Psychophysiology, 50(12), 1220–1225.CrossRef Google Scholar PubMed

Nieuwenhuis, S., Ridderinkhof, K. R., Blom, J., Band, G. P., Kok, A. (2001). Error-related brain potentials are differentially related to awareness of response errors: Evidence from an antisaccade task. Psychophysiology, 38, 752–760.CrossRef Google Scholar PubMed

Novick, M. R. (1966). The axioms and principal results of classical test theory. Journal of Mathematical Psychology, 3(1), 1–18.CrossRef Google Scholar

Novick, M. R., Lewis, C. (1967). Coefficient alpha and the reliability of composite measurements. Psychometrika, 32, 1–13.CrossRef Google Scholar PubMed

Nunnally, J. C. (1978). Psychometric theory, 2 New York: McGraw-Hill.Google Scholar

O’Connell, R. G., Dockree, P. M., Bellgrove, M. A., Kelly, S. P., Hester, R., Garavan, H., Robertson, I. H., Foxe, J. J. (2007). The role of cingulate cortex in the detection of errors with and without awareness: A high-density electrical mapping study. European Journal of Neuroscience, 25, 2571–2579.CrossRef Google Scholar PubMed

Ogasawara, H. (2006). Approximations to the distribution of the sample coefficient alpha under nonnormality. Behaviormetrika, 33(1), 3–26.CrossRef Google Scholar

Ogasawara, H. (2009). Stratified coefficients of reliability and their sampling behavior under nonnormality. Behaviormetrika, 36(1), 49–73.CrossRef Google Scholar

Ogasawara, H. (2021). A unified treatment of agreement coefficients and their asymptotic results: The formula of the weighted mean of weighted ratios. Journal of Classification, 38, 390–422.CrossRef Google Scholar

Olvet, D. M., Hajcak, G. (2008). The error-related negativity (ERN) and psychopathology: Toward an endophenotype. Clinical Psychology Review, 28(8), 1343–1354.CrossRef Google Scholar PubMed

Olvet, D. M., Hajcak, G. (2009). The stability of error-related brain activity with increasing trials. Psychophysiology, 46(5), 957–961.CrossRef Google Scholar PubMed

Overbeek, T. J., Nieuwenhuis, S., Ridderinkhof, K. R. (2005). Dissociable components of error processing: On the functional significance of the Pe vis-à-vis the ERN/Ne. Journal of Psychophysiology, 19(4), 319–329.CrossRef Google Scholar

Pontifex, M. B., Scudder, M. R., Brown, M. L., O’Leary, K. C., Wu, C-T, Themanson, J. R., Hillman, C. H. (2010). On the number of trials necessary for stabilization of error-related brain activity across the life span. Psychophysiology, 47(4), 767–773.Google Scholar PubMed

Raykov, T., Marcoulides, G. A. (2017). Thanks coefficient alpha, we still need you!. Educational and Psychological Measurement, 79(1), 200–210.CrossRef Google Scholar PubMed

Rietdijk, W. J., Franken, I. H., Thurik, A. R. (2014). Internal consistency of event-related potentials associated with cognitive control: N2/P3 and ERN/Pe. PLoS ONE, 9.CrossRef Google Scholar PubMed

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74, 107–120.CrossRef Google Scholar PubMed

Sijtsma, K. (2009). Reliability beyond theory and into practice. Psychometrika, 74, 169–173.CrossRef Google Scholar PubMed

Sijtsma, K., Pfadt, J. M. (2021). Invited review Part II: On the use, the misuse, and the very limited usefulness of Cronbach’s alpha: Discussing lower bounds and correlated errors. Psychometrika.CrossRef Google Scholar

Sijtsma, K., Pfadt, J. M. (2021). Rejoinder: The future of reliability. Psychometrika.CrossRef Google Scholar PubMed

Sijtsma, K., van der Ark, L. A. (2020). Measurement models for psychological attributes, London: Chapman & Hall.CrossRef Google Scholar

Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18(6), 643–662.CrossRef Google Scholar

Ten Berge, J. M. F., Sočan, G. (2004). The greatest lower bound to the reliability of a test and the hypothesis of unidimensionality. Psychometrika, 69, 613–625.CrossRef Google Scholar

Wilcox, R. R. (1990). Comparing the variances of two dependent groups. Journal of Educational Statistics, 15(3), 237.CrossRef Google Scholar

Wilcox, R. (2015). Comparing the variances of two dependent variables. Journal of Statistical Distributions and Applications.CrossRef Google Scholar

Figure 1 Schematic Representation of a Flanker Task Trial

Table 1 Small Example of Spaced and Condensed Data Matrix of ERNs

Table 2 Statistics For The Computation of Length-Stratified Alpha

Table 3 Computational Example for Seven Participants (out of 143)

Table 4 Estimates Needed for Computing the Unconditional Reliability

Ellis et al. supplementary materials

File 186.6 KB

Article contents

Reliability Theory for Measurements with Variable Test Length, Illustrated with ERN and Pe Collected in the Flanker Task

Abstract

Keywords

1. Introduction

2. Flanker Tasks and Resulting Data

2.1. Flanker Tasks

2.2. Data Matrix

2.3. Review of Previous Methods to Estimate Reliability

3. Reliability for Psychophysiological Data

3.1. Reliability if the Number of Items is a Random Variable

3.1.1. Assumptions

3.1.2. Variance Decomposition of Total Scores

Lemma 1

Proof

3.1.3. Conditional and Unconditional Reliability

Theorem 1

Proof

Corollary 1

3.1.4. Simple Formula for Parallel Items

Theorem 2

Proof

Corollary 2

3.1.5. Examples

3.2. Correlation with a Second Test Administration

Theorem 3

Proof

Corollary 3

3.2.1. Integration of Reliabilities of Different Stimulus Types

4. Comparison with Generalizability Theory Approaches

5. Discussion: Contributions of Our Study to the Theory and Practice of Reliability

Declarations

Data Availability Statement

Conflict of interest

Footnotes

References

Ellis et al. supplementary materials

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests