Retest reliability and reliable change of community-dwelling Black/African American older adults with and without mild cognitive impairment using NIH Toolbox-Cognition Battery and Cogstate Brief Battery for laptop

Taylor Rigby; Voyko Kavcic; Sarah R. Shair; Tanisha G. Hill-Jarrett; Sarah Garcia; Jon Reader; Carol Persad; Arijit K. Bhaumik; Subhamoy Pal; Benjamin M. Hampstead; Bruno Giordani

doi:10.1017/S1355617724000444

Retest reliability and reliable change of community-dwelling Black/African American older adults with and without mild cognitive impairment using NIH Toolbox-Cognition Battery and Cogstate Brief Battery for laptop

Published online by Cambridge University Press: 20 December 2024

Taylor Rigby

Voyko Kavcic ,

Sarah R. Shair ,

Tanisha G. Hill-Jarrett ,

Sarah Garcia ,

Jon Reader ,

Carol Persad ,

Arijit K. Bhaumik ,

Subhamoy Pal and

Benjamin M. Hampstead

...Show all authors

Show author details

Taylor Rigby: Affiliation:
Michigan Alzheimer’s Disease Research Center, MI, USA Department of Psychiatry, University of Michigan, MI, USA Department of Veterans Affairs Medical Center, Geriactric Research Education and Clinical Center, Ann Arbor, MI, USA Department of Veterans Affairs Medical Center, Ann Arbor, MI, USA
Voyko Kavcic: Affiliation:
Wayne State University, MI, USA
Sarah R. Shair: Affiliation:
Department of Veterans Affairs Medical Center, Ann Arbor, MI, USA
Tanisha G. Hill-Jarrett: Affiliation:
Department of Neurology, Memory and Aging Center, University of California San Francisco, CA, USA Global Brain Health Institute, University of California San Francisco, CA, USA
Sarah Garcia: Affiliation:
Department of Psychology, Stetson University, FL, USA
Jon Reader: Affiliation:
Michigan Alzheimer’s Disease Research Center, MI, USA Department of Neurology, University of Michigan, MI, USA
Carol Persad: Affiliation:
Michigan Alzheimer’s Disease Research Center, MI, USA Department of Psychiatry, University of Michigan, MI, USA
Arijit K. Bhaumik: Affiliation:
Michigan Alzheimer’s Disease Research Center, MI, USA Department of Psychiatry, University of Michigan, MI, USA Department of Neurology, University of Michigan, MI, USA
Subhamoy Pal: Affiliation:
Michigan Alzheimer’s Disease Research Center, MI, USA Department of Neurology, University of Michigan, MI, USA
Benjamin M. Hampstead: Affiliation:
Michigan Alzheimer’s Disease Research Center, MI, USA Department of Psychiatry, University of Michigan, MI, USA Department of Veterans Affairs Medical Center, Ann Arbor, MI, USA
Bruno Giordani*: Affiliation:
Michigan Alzheimer’s Disease Research Center, MI, USA Department of Psychiatry, University of Michigan, MI, USA
*: Corresponding author: Bruno Giordani; Email: giordani@umich.edu

Article contents

Abstract
Objective:
Method:
Results:
Conclusions:
Introduction
Method
Results
Discussion
Conclusions
Funding statement
Competing interests
References

Rights & Permissions

Abstract

Objective:

With the increased use of computer-based tests in clinical and research settings, assessing retest reliability and reliable change of NIH Toolbox-Cognition Battery (NIHTB-CB) and Cogstate Brief Battery (Cogstate) is essential. Previous studies used mostly White samples, but Black/African Americans (B/AAs) must be included in this research to ensure reliability.

Method:

Participants were B/AA consensus-confirmed healthy controls (HCs) (n = 49) or mild cognitive impairment (MCI) (n = 34) adults 60–85 years that completed NIHTB-CB and Cogstate for laptop at two timepoints within 4 months. Intraclass correlations, the Bland-Altman method, t-tests, and the Pearson correlation coefficient were used. Cut scores indicating reliable change provided.

Results:

NIHTB-CB composite reliability ranged from .81 to .93 (95% CIs [.37–.96]). The Fluid Composite demonstrated a significant difference between timepoints and was less consistent than the Crystallized Composite. Subtests were less consistent for MCIs (ICCs = .01–.89, CIs [−1.00–.95]) than for HCs (ICCs = .69–.93, CIs [.46–.92]). A moderate correlation was found for MCIs between timepoints and performance on the Total Composite (r = -.40, p = .03), Fluid Composite (r = -.38, p = .03), and Pattern Comparison Processing Speed (r = -.47, p = .006).

On Cogstate, HCs had lower reliability (ICCs = .47–.76, CIs [.05–.86]) than MCIs (ICCs = .65–.89, CIs [.29–.95]). Identification reaction time significantly improved between testing timepoints across samples.

Conclusions:

The NIHTB-CB and Cogstate for laptop show promise for use in research with B/AAs and were reasonably stable up to 4 months. Still, differences were found between those with MCI and HCs. It is recommended that race and cognitive status be considered when using these measures.

Keywords

Computerized neuropsychological assessment computerized cognitive assessment practice effects psychometrics reproducibility of results reliability of tests

Type: Research Article
Information: Journal of the International Neuropsychological Society , First View , pp. 1 - 11

DOI: https://doi.org/10.1017/S1355617724000444 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2024. Published by Cambridge University Press on behalf of International Neuropsychological Society

Introduction

As the population with dementia has grown, disparities have emerged in the prevalence of all cause dementia among different races. Older Black/African Americans (B/AAs) are disproportionately more likely than older Whites to have Alzheimer’s disease (AD) and other dementias (Dilworth-Anderson et al., Reference Dilworth-Anderson, Hendrie, Manly, Khachaturian and Fazio2008; Power et al., Reference Power, Bennett, Turner, Dowling, Ciarleglio, Glymour and Gianattasio2021; Steenland et al., Reference Steenland, Goldstein, Levey and Wharton2016; Yaffe et al., Reference Yaffe, Falvey, Harris, Newman, Satterfield, Koster and Simonsick2013). Further, despite the increased risk posed to B/AA older adults for developing dementia, B/AA adults are largely underrepresented in research seeking to understand these diseases. There is also evidence that a missed or delayed diagnosis of AD and other dementia types is more common among B/AA older adults than among White older adults (Clark et al., Reference Clark, Kutner, Goldstein, Peterson-Hazen, Garner, Zhang and Bowles2005; Gianattasio et al., Reference Gianattasio, Prather, Glymour, Ciarleglio and Power2019; Lin et al., Reference Lin, Daly, Olchanski, Cohen, Neumann, Faul and Freund2021), which then contributes to a delay of care that may impact disease trajectory and outcomes. Thus, it is increasingly important to identify people at risk for AD and related dementias as early as possible, in part through accurately identifying individuals with mild cognitive impairment (MCI). A diagnosis of MCI refers to cognitive decline that is not normal for a person’s age but generally does not affect that person’s ability to carry out most activities of daily living (Gauthier et al., Reference Gauthier, Reisberg, Zaudig, Petersen, Ritchie, Broich and Winblad2006). MCI is classified as one of two types based on a person’s symptoms: amnestic (memory issues predominate) or non-amnestic (other cognitive issues predominate; Petersen et al., Reference Petersen, Lopez, Armstrong, Getchius, Ganguli, Gloss and Rae-Grant2018; Alzheimer’s Association, 2022). It is estimated that 10–15% of individuals with MCI go on to develop a form of dementia each year (Alzheimer’s Association, 2022) and about 1/3 of people with MCI develop dementia due to AD within five years (Alzheimer’s Association, 2022; 42). Others with MCI may revert to their preclinical cognition or remain clinically stable (Pandya et al., Reference Pandya, Clem, Silva and Woon2016).

Traditionally, neuropsychological measures have been used in clinical settings and in research studies to identify and track those with cognitive decline. However, more recently introduced computerized measures have a relative ease of administration when compared to traditional neuropsychological methods (Diaz-Orueta et al., Reference Diaz-Orueta, Blanco-Campal, Lamar, Libon and Burke2020; Weintraub et al., Reference Weintraub, Dikmen, Heaton, Tulsky, Zelazo, Bauer and Gershon2013). Consequently, computerized assessments will likely be in increasing demand. While traditional neuropsychological methods have been well studied, less is known about practice effects and the retest reliability of computerized testing methods, particularly with different racial/ethnic groups (Diaz-Orueta et al., Reference Diaz-Orueta, Blanco-Campal, Lamar, Libon and Burke2020; Scott et al., Reference Scott, Sorrell and Benitez2019). Practice effects refer to the expected and common improvement in test performance due to repeated exposures to test materials (Calamia et al., Reference Calamia, Markon and Tranel2012; Portney & Watkins, Reference Portney and Watkins2009). Retest reliability can be defined as the extent to which a measurement is consistent and free of random measurement error (the fluctuation in scores of repeated assessments due to unpredictable factors; Portney & Watkins, Reference Portney and Watkins2009). Retest reliability is essential to help clinicians and researchers understand how much of a measured change in score is attributable to measurement error and how much represents a true condition (Calamia et al., Reference Calamia, Markon and Tranel2012). Further, practice effects can mask actual cognitive decline in longitudinal studies of older adults and thereby give the illusion of stability or only minor change (Calamia et al., Reference Calamia, Markon and Tranel2012). Reliable change can be used to assess whether a change at retest on a given variable is “reliable” (meaning it is statistically improbable that the change is due to measurement error) and therefore represents a meaningful change (Chelune et al, Reference Chelune, Naugle, Luders, Sedlak and Awad1993; Iverson, Reference Iverson2001).

As more studies begin to incorporate computerized cognitive measures into clinical trials and longitudinal research, and scientists and clinicians explore the clinical applications of these tools, it becomes increasingly important to better understand reliability and retest issues for these methods. To ensure that measures and treatments are valid and reliable for B/AAs, B/AAs must be included in the research exploring these subjects. NIH Toolbox-Cognition Battery (NIHTB-CB) and the Cogstate Brief Battery (Cogstate) are computerized cognitive assessment batteries frequently used in clinical research. In a previous study the NIHTB-CB has been shown to have retest concordance correlation coefficients in healthy older adults ages 60–80 years of .73 for the Fluid Composite and .92 for the Crystallized Composite with individual subtests ranging between .46 and .88 (Scott et al., Reference Scott, Sorrell and Benitez2019). The NIHTB-CB has been shown to have interclass correlations in healthy adults ages 20–85 years of .79 for the Fluid Composite and .92 for the Crystallized Composite (Heaton et al., Reference Heaton, Akshoomoff, Tulsky, Mungas, Weintraub, Dikmen and Gershon2014) with individual subtests ranging between .72 and .94 (Weintraub et al., Reference Weintraub, Dikmen, Heaton, Tulsky, Zelazo, Bauer and Gershon2013). The Cogstate individual subtest retest reliability using interclass correlations has been shown to range from .22 to .94 with healthy adults aged 18–96 years (Cole et al., Reference Cole, Arrieux, Schwab, Ivins, Qashu and Lewis2013; Faletti et al., Reference Falleti, Maruff, Collie and Darby2006; Fredrickson et al., Reference Fredrickson, Maruff, Woodward, Moore, Fredrickson, Sach and Darby2010; Lim et al., Reference Lim, Jaeger, Harrington, Ashwood, Ellis, Stöffler and Maruff2013), .79–.95 for those with amnestic MCI aged 60–96 years, and .68–.93 for those with Alzheimer’s disease aged 60 to 96 (Lim et al., Reference Lim, Jaeger, Harrington, Ashwood, Ellis, Stöffler and Maruff2013).

Despite the findings that older B/AAs are disproportionately more likely than older Whites to have all type dementia, previous studies examining the retest reliability of NIHTB-CB and Cogstate for laptop were conducted using mostly White samples (Cole et al., Reference Cole, Arrieux, Schwab, Ivins, Qashu and Lewis2013; Faletti et al., Reference Falleti, Maruff, Collie and Darby2006; Fredrickson et al., Reference Fredrickson, Maruff, Woodward, Moore, Fredrickson, Sach and Darby2010; Hammers et al., Reference Hammers, Spurgeon, Ryan, Persad, Heidebrink, Barbas and Giordani2011; Heaton et al., Reference Heaton, Akshoomoff, Tulsky, Mungas, Weintraub, Dikmen and Gershon2014; Lim et al., Reference Lim, Jaeger, Harrington, Ashwood, Ellis, Stöffler and Maruff2013; Scott et al., Reference Scott, Sorrell and Benitez2019; Weintraub et al., Reference Weintraub, Dikmen, Heaton, Tulsky, Zelazo, Bauer and Gershon2013). Thus, the current study aimed to assess the retest reliability of NIHTB-CB and the Cogstate Brief Battery for laptop up to 4 months in healthy controls and those with MCI in a B/AA sample. The differences in scores between testing timepoints were calculated to examine practice effects and provide cut scores to determine reliable change. The relationship between testing interval and performance was also examined.

It was hypothesized that the NIHTB-CB retest reliabilities for the healthy controls in an all B/AA sample would be similar to previous findings using non-impaired majority White samples (Heaton et al., Reference Heaton, Akshoomoff, Tulsky, Mungas, Weintraub, Dikmen and Gershon2014; Weintraub et al., Reference Weintraub, Dikmen, Heaton, Tulsky, Zelazo, Bauer and Gershon2013), as the scores used were a priori adjusted for age, sex, race/ethnicity, and education (available through NIHTB-CB for laptop). We hypothesized that the Crystallized Composite would be more reliable than the Fluid Composite, but that all three composites, and the subtests that comprise them, would demonstrate moderate to excellent reliability and small to medium practice effects up to 4 months in healthy controls. Less is known about the retest reliability and practice effects in those with MCI or AD when using the NIHTB-CB; however, those with MCI have demonstrated significantly attenuated learning performance on accuracy and reaction time tasks with repeated computerized testing when compared to healthy controls (Darby et al., Reference Darby, Maruff, Collie and McStephen2002). Thus, we hypothesized that those with MCI would demonstrate moderate to excellent reliability but be less susceptible to practice effects than healthy controls, particularly on Fluid tasks requiring a memory component. Less is known about the impact of shorter versus longer test intervals in the NIHTB-CB for either healthy controls or those with MCI; thus, findings should be viewed as exploratory.

On the Cogstate, no demographic-adjusted norms have been provided by the manufacturers. Still, it was hypothesized that all subtests would demonstrate moderate to excellent reliability and small to medium practice effects in healthy controls up to 4 months in an all B/AA sample based on the performance of majority White samples in prior studies (Cole et al., Reference Cole, Arrieux, Schwab, Ivins, Qashu and Lewis2013; Faletti et al., Reference Falleti, Maruff, Collie and Darby2006; Fredrickson et al., Reference Fredrickson, Maruff, Woodward, Moore, Fredrickson, Sach and Darby2010; Lim et al., Reference Lim, Jaeger, Harrington, Ashwood, Ellis, Stöffler and Maruff2013). Based on a previous study with a majority White sample, it was hypothesized that those with MCI would demonstrate similar retest reliability and susceptibility to practice effects as healthy controls on Cogstate subtests (Lim et al., Reference Lim, Jaeger, Harrington, Ashwood, Ellis, Stöffler and Maruff2013). Results have been mixed in studies exploring the length between retest intervals in Cogstate (Faletti et al., Reference Falleti, Maruff, Collie and Darby2006; Fredrickson et al., Reference Fredrickson, Maruff, Woodward, Moore, Fredrickson, Sach and Darby2010; Hammers et al., Reference Hammers, Spurgeon, Ryan, Persad, Heidebrink, Barbas and Giordani2011), so no a priori prediction was made.

Method

Participants

Participants were recruited through the Healthy Black Elders Center, the community engagement core for the Michigan Center for Urban African American Aging Research, a joint program through the Wayne State University Institute of Gerontology and the University of Michigan Institute of Social Research, and through the Michigan Alzheimer’s Disease Research Center (MADRC). This research was completed in accordance with the Helsinki Declaration. This study was reviewed and approved by the human subjects Institutional Review Board at Wayne State University in Detroit, MI, USA, and the human subjects Institutional Review Board at the University of Michigan Medical School in Ann Arbor, MI, USA. Participants were evaluated for decision making capacity at the time of the informed consent process. All participants signed consent as per the human subjects Wayne State University Institutional Review Board in Detroit, MI, USA, and the human subjects University of Michigan Medical School Institutional Review Board in Ann Arbor, MI, USA, prior to participation in the study. All participants completed the National Alzheimer’s Coordinating Center (NACC) – Uniform Data Set (UDS) version 2 evaluation which included a multidomain medical, neurological, social, and neuropsychological evaluation; participants were then diagnosed at the MADRC using NACC consensus conference criteria (Weintraub et al., Reference Weintraub, Salmon, Mercaldo, Ferris, Graff-Radford, Chui and Morris2009). NIHTB-CB and Cogstate results were not available to the consensus panel. The initial NIHTB-CB and Cogstate assessments were conducted up to 8 days before UDS visits and up to 117 days after UDS assessments with 71.1% of assessments taking place on the same day. The NIHTB-CB and Cogstate retest was conducted between 6 and 139 days (or within 4 months) after the initial administration with the mean being 46.9 days and the median being 33 days. Participants also completed a Computer Anxiety Survey (Wild et al., 2012) to assess their level of comfort with computers.

Participants were B/AA community-dwelling older adults between 60 and 85 years of age that reported having either male or female biological sex. Participants included in the analyses completed NIHTB-CB and Cogstate at two testing timepoints within four months of the initial administration and were classified by consensus diagnosis (Weintraub et al., Reference Weintraub, Salmon, Mercaldo, Ferris, Graff-Radford, Chui and Morris2009) as either having no clinically significant cognitive impairment (healthy control; n = 49) or as having MCI (n = 34). Those with MCI were further classified at consensus as MCI with amnestic features (aMCI; n = 24) or MCI with non-amnestic features (naMCI; n = 10). Due to the low incidence of naMCI observed in this sample and the statistical equivalence on demographic variables (see Results section, Demographics) to those with aMCI, the aMCI and naMCI subsamples were combined and are described hereafter as the MCI group (n = 34).

Assessment measures

National Institutes of Health Toolbox-Cognition Battery (NIHTB-CB): The NIHTB-CB was designed to be a brief (30-min), computerized, widely accessible, and easily administered cognitive screener for ages 3–85 that is available in both English and Spanish (Gershon et al., Reference Gershon, Wagster, Hendrie, Fox, Cook and Nowinski2013). It was originally designed for the purpose of creating a “common currency” among different research studies (Weintraub et al., Reference Weintraub, Dikmen, Heaton, Tulsky, Zelazo, Bauer and Gershon2013). The battery consists of seven tests measuring five cognitive domains, which are separated broadly into “fluid” or dynamic thinking skills (executive functions, episodic memory, processing speed, working memory) and “crystallized” or skills that remain relatively stable in adulthood (language – vocabulary knowledge and oral reading proficiency; Heaton et al., Reference Heaton, Akshoomoff, Tulsky, Mungas, Weintraub, Dikmen and Gershon2014; Weintraub et al., Reference Weintraub, Dikmen, Heaton, Tulsky, Zelazo, Bauer and Gershon2013). Individual subtest performances as well as composite summary scores of crystallized cognitive abilities, fluid cognition, and total cognition are provided. The Crystallized Cognition Composite includes the Oral Reading Recognition and Picture Vocabulary subtests. Measures of fluid abilities include the Dimensional Change Card Sort task, Flanker Inhibitory Control and Attention, List Sorting Working Memory, Pattern Comparison Processing Speed, and Picture Sequence Memory subtests. Specific test details, procedures, and extensive psychometric evaluation are available elsewhere (Weintraub et al., Reference Weintraub, Dikmen, Heaton, Tulsky, Zelazo, Bauer and Gershon2013).

Cogstate Brief Battery (Cogstate): Cogstate is a computerized cognitive assessment that provides measures of four different cognitive domains using playing card paradigms: visual learning, working memory, processing speed, and attention. Briefly, the core tests include a Detection Task (a simple reaction time task), Identification Task (a choice reaction time test of visual attention), One Card Learning Task (a continuous visual recognition learning task), and One Back Task (a test of working memory). These separate tests and their psychometric properties have been described previously (Falleti et al., Reference Falleti, Maruff, Collie and Darby2006; Lim et al., Reference Lim, Jaeger, Harrington, Ashwood, Ellis, Stöffler and Maruff2013; Maruff et al., Reference Maruff, Lim, Darby, Ellis, Pietrzak, Snyder and Masters2013).

Computer Anxiety Survey: Computer anxiety was measured using the Wild et al. (Reference Wild, Mattek, Maxwell, Dodge, Jimison and Kaye2012) Computer Anxiety Survey, a 16-item measure on which participants rate their level of anxiety when using computers (e.g., “I feel relaxed when I am working on a computer”). Responses are rated on a five-point, Likert-type scale and range from “Strongly Disagree” to “Strongly Agree.” Total scores range from 16 to 80, with higher scores indicating greater levels of computer anxiety. Computer anxiety summary scores are derived by totaling the rating for each item. Specific survey details and psychometric properties have been described previously (Wild et al., Reference Wild, Mattek, Maxwell, Dodge, Jimison and Kaye2012).

Mini-Mental State Examination (MMSE): The MMSE is a brief objective measure of cognitive functioning that quantitatively estimates the severity of cognitive impairment (Folstein et al., 1975). However, it is important to note that the MMSE is a brief cognitive screening tool and it is not meant to be used as a means of diagnosis, but rather as a path to referral for more comprehensive testing if needed (Arevalo-Rodriguez et al., Reference Arevalo-Rodriguez, Smailagic, Roqué-Figuls, Ciapponi, Sanchez-Perez, Giannakou and Cullum2021; Ranson et al., Reference Ranson, Kuźma, Hamilton, Muniz-Terrera, Langa and Llewellyn2019; Tombaugh & McIntyre, Reference Tombaugh and McIntyre1992). The MMSE can typically be administered in 5–10 minutes. It consists of a variety of questions with total scores ranging from 0 to 30, and lower scores representing poorer cognitive function. Cut scores <24 and<25 are the most commonly used to suggest possible cognitive impairment (Tsoi et al., Reference Tsoi, Chan, Hirai, Wong and Kwok2015); however, a cut score of 27 (≤26) has been shown to maximize the diagnostic accuracy of the MMSE in B/AA individuals with more education (Spering et al., Reference Spering, Hobson, Lucas, Menon, Hall and O’Bryant2012). Specific test details, procedures, and psychometric evaluations have been compiled in the form of review articles (Tombaugh & McIntyre, Reference Tombaugh and McIntyre1992; Tsoi et al., Reference Tsoi, Chan, Hirai, Wong and Kwok2015).

Statistical analyses

All statistical analyses were conducted using SPSS V.28. Scores used in the analyses for NIHTB-CB were a priori adjusted (age, sex, race/ethnicity, and education) t-scores (M = 50, SD = 10) available through NIHTB-CB for laptop. Per manufacturer recommendations, scores used in the analyses for Cogstate were log transformed and derived from raw scores available through Cogstate; specifically, accuracy scores (correct vs. incorrect responses) for One Card Learning and One Back, and reaction time (in milliseconds) for Detection and Identification. Prior to analysis, all measures were screened for univariate and multivariate outliers and seven individual scores were identified as being extreme outliers (i.e., z-score >3.29) across measures. These seven individual outlying scores were then winsorized and subsequently included in the analyses. Accuracy scores for One Back were noted to be negatively skewed; specifically, healthy controls were found to be quite accurate when performing the One Back test. Those with MCI were also skewed negatively, but to a lesser extent.

Demographics

Demographic data were examined for group differences using independent measures t-test on continuous variables and chi-square statistics on categorical variables (see Table 1 for sample characteristics).

Table 1. Sample characteristics

Note: Chi-square test was used for categorical variables and t-scores for continuous variables. HC = healthy controls; MCI = mild cognitive impairment; MMSE = Mini-Mental State Evaluation; p-value = level of significance; M/SD = mean/standard deviation; n = number of participants.

Intraclass correlation coefficients

To assess the degree of correlation and the agreement between measurement timepoints (or retest reliability), two-way mixed intraclass correlation coefficients (ICCs) with 95% confidence intervals (CIs) were run using an absolute definition of agreement for the total sample, healthy controls, and those with MCI (McGraw & Wong, Reference McGraw and Wong1996). ICCs were interpreted using the guidelines set forth by Koo and Li (2016) who recommended that CIs be interpreted rather than the singular correlation coefficient and defined values less than .40 as low reliability, values between .40 and .74 as moderate, values between .75 and .89 as good, and values greater than .89 as excellent (Table 2, Table 3).

Table 2. Intraclass correlation coefficients examining retest reliability up to 4 months for NIH Toolbox-Cognition Battery and Cogstate Brief Battery for laptop

Table 3. Paired-sample t-tests examining practice effects for NIH Toolbox-Cognition Battery and Cogstate Brief Battery for laptop

Note: Scores used were the norm a priori adjusted (age, sex, race/ethnicity, and education) t-scores (M = 50, SD = 10) available through NIH Toolbox-Cognition Battery for laptop and log-transformed scores derived from raw scores for Cogstate Brief Battery for laptop. t = t-test statistic; M(SD)1 = M = mean of testing timepoint 1, SD = standard deviation of testing timepoint 1; M(SD)2 = M = mean of testing timepoint 2, SD = standard deviation of testing timepoint 2; r ₁₂ = Pearson correlation between timepoint 1 and timepoint 2; p-value = level of significance; d = Cohen’s measure of sample effect size; RT = reaction time.

Bland-Altman method

The Bland-Altman method was used to test for changes in the mean between the two test occasions and inspect for systematic bias and limits of agreement. Specifically, mean differences using the Bland-Altman method and 95% estimation of CIs for limits of agreement were calculated (Bland & Altman, 1995); bias was defined as having all observations lie to one side of the line of equality (i.e., observations that did not include zero (the line of equality) within the 95% CI) (Table 4).

Table 4. Values used to calculate reliable change between timepoints 1 and 2 for NIH Toolbox-Cognition Battery and Cogstate Brief Battery for laptop

Note: Scores used were the norm a priori adjusted (age, sex, race/ethnicity, and education) t-scores (M = 50, SD = 10) available through NIH Toolbox Cognition Battery for laptop and log transformed scores derived from raw scores for Cogstate Brief Battery for laptop. SEM1 = standard error of measurement at testing timepoint 1; SEM2 = standard error of measurement at testing timepoint 2; M(SD)diff = with M being the mean difference and SD being the standard deviation of difference; SE = standard error of difference; RT = reaction rime.

Paired sample t-tests

Paired sample t-tests with 95% CIs were used to evaluate for practice effects upon retest in the total sample, healthy controls, and those with MCI. Cohen’s d was used to measure effect size and results were interpreted using benchmarks suggested by Cohen (Reference Cohen1988) with ±0.2 as small, ±0.5 as medium, and ±0.8 as large (Table 5).

Table 5. Reliable change confidence intervals for NIH Toolbox-Cognition Battery and Cogstate Brief Battery for laptop

Note: Scores used were the a priori norm adjusted (age, sex, race/ethnicity, and education) t-scores (M = 50, SD = 10) available through NIH Toolbox Cognition Battery for laptop and log transformed scores derived from raw scores for Cogstate Brief Battery for laptop. Confidence intervals (CI) were calculated by multiplying the Standard Error of Difference of performance on testing timepoint 1 and testing timepoint 2 by a z-score to arrive at the confidence intervals (70%, 80%, 90%). If a retest score for a given variable changes by the provided amount or more (either positive or negative), that score is indicative of worsening or improvement. For example, a person whose score has gotten worse with retesting by the amount shown above (or greater) for a given variable, would be exceeding the worsening in scores experienced by 85, 90, or 95% of the sample, respectively. CI = Confidence Interval; RT = reaction time.CI = confidence interval.

Reliable change

Reliable change methodology was used to calculate reliable change CIs that can be used to assess whether a change in score after retest is reliable and meaningful (Chelune et al, 1993, Iverson, 2001). Specifically, the standard error of difference score (SE_diff) was calculated using the standard error of measurement at initial testing (SEM₁) and at retest (SEM₂) as per guidelines set by Iverson (2001). That is: SE_diff = √((SEM₁)² + (SEM₂)²), or SEdiff = √((SD₁√1-r ₁₂)² + (SD₂√1-r ₁₂)²), with SD₁ and SD₂ referring to the standard deviation at test and retest, respectively, and r ₁₂ referring to test-retest correlation between time 1 and time 2. The SE _diff was then multiplied by a z-score to arrive at CIs (70, 80, 90%).

Reliable change CIs can be directly referenced to determine if a change in score after retest on a given variable is reliable and meaningful. For use, examiners calculate a difference score (i.e., t-score at retest minus t-score at initial testing) using their patient or participant score(s). Norm-adjusted t-scores provided by NIHTB-CB should be used to calculate these difference scores, and log-transformed raw scores should be used for Cogstate. Difference scores for a given variable that change by the amount presented in Table 6 or more (either positive or negative) are indicative of worsening or improvement. For example, a person who changes by the amount provided or greater would be exceeding the change in scores experienced by 85, 90, or 95% of the sample, respectively.

Table 6. Correlation between difference scores and the difference in days between testing timepoints on the NIH Toolbox-Cognition Battery and Cogstate Brief Battery for laptop

Note: Scores used were the norm a priori adjusted (age, sex, race/ethnicity, and education) t-scores (M = 50, SD = 10) available through NIH Toolbox-Cognition Battery for laptop and log transformed scores derived from raw scores for Cogstate Brief Battery for laptop. Difference scores were calculated by subtracting the score from testing timepoint two from the score from timepoint one for a given variable per individual participant. Pearson correlation coefficient was used to evaluate the correlation between testing timepoint difference scores and the difference in days between testing timepoints. r = Pearson correlation coefficient; p-value = level of significance; RT = reaction time.

Pearson correlation coefficient

The relationship between testing interval and performance was examined using the Pearson correlation coefficient. Specifically, the testing performance difference scores (derived by subtracting testing timepoint two from timepoint one for a given variable) and the length in days between testing timepoints were used. Results were interpreted using benchmarks suggested by Cohen (Reference Cohen1988) with ±0.1 as small, ±0.3 as medium , and ±0.5 as large.