Introduction
Alzheimer’s disease and Alzheimer’s disease-related dementias (AD/ADRDs) currently affect over 55 million people globally with numbers projected to increase to 139 million by 2050. Widespread early detection and diagnosis during the early stages of the disease is seen as increasingly critical to maximize the effectiveness of intervention and treatment. Despite this, most individuals are currently diagnosed after the onset of dementia and only half ever receive a diagnosis from a clinician (Amjad et al., Reference Amjad, Roth, Sheehan, Lyketsos, Wolff and Samus2018; Hampel et al., Reference Hampel, Au, Mattke, van der Flier, Aisen, Apostolova, Chen, Cho, De Santi, Gao, Iwata, Kurzman, Saykin, Teipel, Vellas, Vergallo, Wang and Cummings2022). Challenges in assessment and diagnosis may stem from several sources, such as a limited availability of qualified AD/ADRD specialists, time-, cost-, and distance-related barriers for patients, and limitations of existing traditional neuropsychological tests to detect subtle cognitive decline. These factors present significant obstacles to early detection and disease monitoring of AD/ADRD in community settings and clinical trials investigating novel therapeutic agents.
New methods of cognitive assessment are urgently needed to keep pace with rapidly evolving biomarkers that are now being used for early detection of AD. The use of digital technologies to assess cognition may help to overcome many of the limitations of traditional paper-and-pencil testing and can also help to support the transition toward partially remote or “decentralized” AD/ADRD clinical trials which have potential to increase accessibility and reduce participant burden (Leroy et al., Reference Leroy, Gana, Aïdoud, N’kodo, Balageas, Blanc, Bomia, Debacq and Fougère2023). Remote cognitive assessment using smartphones has several potential advantages over traditional assessment methods including increasing accessibility, improving ecological validity, and enabling high-frequency assessment (HFA) (Öhman et al., Reference Öhman, Hassenstab, Berron, Schöll and Papp2021). Smartphone-based cognitive assessment platforms have the ability to track environmental, health, and behavioral factors that may impact the validity of cognitive performance (Emert et al., Reference Emert, Taylor, Gartenberg, Schade, Roberts, Nagy, Russell, Huskey, Mueller, Gamaldo and Buxton2023; Scott et al., Reference Scott, Graham-Engeland, Engeland, Smyth, Almeida, Katz, Lipton, Mogle, Munoz, Ram and Sliwinski2015; Wilks et al., Reference Wilks, Aschenbrenner, Gordon, Balota, Fagan, Musiek, Balls-Berry, Benzinger, Cruchaga, Morris and Hassenstab2021). HFA can improve the reliability and validity of cognitive assessment through averaging performance across multiple trials (Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018). HFA also allows researchers to track person-specific changes and patterns that may signal decline (e.g., increased cognitive variability, poor learning curves); an approach consistent with personalized medicine and one that may also reduce the time needed to detect therapeutic efficacy and/or reach clinical trial endpoints (Dodge et al., Reference Dodge, Zhu, Mattek, Austin, Kornfeld and Kaye2015; Weizenbaum et al., Reference Weizenbaum, Soberanes, Hsieh, Molinare, Buckley, Betensky, Properzi, Marshall, Rentz, Johnson, Sperling, Amariglio and Papp2023).
Recent research suggests that mobile cognitive assessments are acceptable and feasible to use in older adult populations (Koo & Vizer, Reference Koo and Vizer2019; Nicosia et al., Reference Nicosia, Aschenbrenner, Balota, Sliwinski, Tahan, Adams, Stout, Wilks, Gordon, Benzinger, Fagan, Xiong, Bateman, Morris and Hassenstab2023; Papp et al., Reference Papp, Samaroo, Chou, Buckley, Schneider, Hsieh, Soberanes, Quiroz, Properzi, Schultz, García‐Magariño, Marshall, Burke, Kumar, Snyder, Johnson, Rentz, Sperling and Amariglio2021; Thompson et al., Reference Thompson, Harrington, Roque, Strenger, Correia, Jones, Salloway and Sliwinski2022). However, research on the reliability and validity of unsupervised mobile cognitive assessments is still evolving. Preliminary work from Thompson et al., and others suggests that Mobile Monitoring of Cognitive Change (M2C2), a set of smartphone assessments developed for the National Institute on Aging’s Mobile Toolbox initiative, demonstrates comparable or better reliability (.89 or higher) than traditional neuropsychological assessments (Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018; Thompson et al., Reference Thompson, Harrington, Roque, Strenger, Correia, Jones, Salloway and Sliwinski2022). For example, reliabilities for the Montreal Cognitive Assessment (MoCA) can range from .50 to .71, while Mini Mental Status reliability can be as low as .35 (Bernstein et al., Reference Bernstein, Lacritz, Barlow, Weiner and DeFina2011; Spencer et al., Reference Spencer, Wendell, Giggey, Katzel, Lefkowitz, Siegel and Waldstein2013). Regarding validity, prior work by Thompson et al., demonstrated that smartphone-based cognitive tests were superior to the MoCA in the detection of cerebral amyloid status (Thompson et al., Reference Thompson, Kunicki, Emrani, Strenger, De Vito, Britton, Dion, Harrington, Roque, Salloway, Sliwinski, Correia and Jones2023). Additionally, Nicosia et al. (Reference Nicosia, Aschenbrenner, Balota, Sliwinski, Tahan, Adams, Stout, Wilks, Gordon, Benzinger, Fagan, Xiong, Bateman, Morris and Hassenstab2023) demonstrated associations between smartphone-based cognitive tests with 1) traditional neuropsychological testing, 2) amyloid and tau positron emission tomography, 3) cerebrospinal fluid markers of amyloid beta (Aβ) 40, Aβ42, total tau, and tau phosphorylated at threonine 181, and 4) cortical thickness in AD-related regions of interest (Nicosia et al., Reference Nicosia, Aschenbrenner, Balota, Sliwinski, Tahan, Adams, Stout, Wilks, Gordon, Benzinger, Fagan, Xiong, Bateman, Morris and Hassenstab2023).
Despite these promising initial findings, remaining challenges need to be addressed before smartphone-based cognitive assessments can be widely implemented in clinical and research settings. Successful implementation of unsupervised, smartphone-based cognitive assessment requires adequate protocol adherence and engagement. Attrition can be a common problem seen in remote assessment studies, which may have important implications if smartphone-based testing protocols will be used for longitudinal disease monitoring (Pratap et al., Reference Pratap, Neto, Snyder, Stepnowsky, Elhadad, Grant, Mohebbi, Mooney, Suver, Wilbanks, Mangravite, Heagerty, Areán and Omberg2020). Additionally, it is critical to demonstrate convergence with existing reference standard neuropsychological tests, particularly measures sensitive to ADRD neuropathology, such as the Free and Cued Selective Reminding task and the Preclinical Alzheimer’s Composite Index, which are commonly used for early-stage AD detection and monitoring (Papp et al., Reference Papp, Rentz, Orlovsky, Sperling and Mormino2017; Schindler et al., Reference Schindler, Jasielec, Weng, Hassenstab, Grober, McCue, Morris, Holtzman, Xiong and Fagan2017). Therefore, it is critical to investigate validity considerations that may influence the implementation of smartphone-testing protocols for both clinical care and research.
The present analysis examined two novel hypotheses to address the validity of smartphone-based cognitive assessments in a sample of 120 cognitively unimpaired older adults. First, we hypothesized that the smartphone tasks would demonstrate convergence with same-domain standard neuropsychological tests, as well as divergence with different-domain tests. Second, we evaluated two factors hypothesized to affect adherence and performance on remote HFA, time of day and anticipation of feedback.
Specifically, we hypothesized that better adherence and performance would be seen for morning vs. evening sessions. This was based on prior evidence of cognitive “sundowning” captured on remote assessments, with better performance seen in the morning (Wilks et al., Reference Wilks, Aschenbrenner, Gordon, Balota, Fagan, Musiek, Balls-Berry, Benzinger, Cruchaga, Morris and Hassenstab2021), as well as the broader aging literature, which indicates an increased likelihood of time-specific cognitive variability with both age and neurodegenerative pathology (Anderson et al., Reference Anderson, Campbell, Amer, Grady and Hasher2014; Musiek et al., Reference Musiek, Bhimasani, Zangrilli, Morris, Holtzman and Ju2018).
Additionally, we hypothesized that anticipation of feedback on cognitive test performance may increase engagement and performance on remote tests. Research has shown that feedback impacts task performance and motivation, and is therefore typically not provided during controlled testing in neuropsychological practice (Clark et al., Reference Clark, Brunell and Buelow2024; Di Rosa et al., Reference Di Rosa, Schiff, Cagnolati and Mapelli2015). However, providing patient feedback after testing is routine, as is providing feedback in neurorehabilitation programs and commercial brain game apps to maintain engagement and motivation ((Burgers et al., Reference Burgers, Eden, Van Engelenburg and Buningh2015; van Dokkum et al., Reference van Dokkum, Ward and Laffont2015). One way to potentially impart these benefits of feedback without disrupting controlled, real-time testing conditions is to tell participants what they can expect to gain from the assessment (i.e., to let them know they will receive feedback after testing is complete). To our knowledge, ours is the first study to examine how the anticipation of receiving feedback about one’s performance might relate to subsequent engagement and performance on self-administered cognitive tasks by randomly assigning participants to either anticipated or surprise feedback conditions.
Finally, we provide a full sample update on the within-subject reliabilities for M2C2 HFA, originally reported in the study’s preliminary (n = 52) reliability findings published in 2022.
Methods
Participants and recruitment
Participants were cognitively unimpaired older adults, between the ages of 60 and 80, recruited from the Butler Alzheimer’s Prevention Registry, a local database of older adults interested in AD research at the Butler Hospital Memory and Aging Program. A total of 256 individuals were invited to the study through email or phone call, and 146 consented and completed an online screening. Twenty-three participants were excluded during the screening process, and three participants withdrew after enrollment, for a final sample size of 120. Please refer to Thompson et al., Reference Thompson, Kunicki, Emrani, Strenger, De Vito, Britton, Dion, Harrington, Roque, Salloway, Sliwinski, Correia and Jones2023 for detailed enrollment data and inclusion and exclusion criteria. Familiarity with smartphones (defined by a minimum 1-year history of use) was required for enrollment. Screening was conducted using an online survey and the modified Telephone Interview for Cognitive Status (TICSm) (Brandt et al., Reference Brandt, Spencer and Folstein1988; Cook et al., Reference Cook, Marsiske and McCoy2009). Unimpaired cognition was defined as a TICSm cutoff score of ≥34 (Cook et al., Reference Cook, Marsiske and McCoy2009). Participants completed an exit survey online to provide feedback at the end of the study, and a $20 gift card compensation was provided. All participants were made aware of this compensation during the consent process. The project received approval from the Butler Hospital Institutional Review Board, and all participants provided consent. The research was completed in accordance with Helsinki Declaration.
Remote cognitive assessment
Remote cognitive tasks were completed using the Mobile Monitoring of Cognitive Change (M2C2) app, a cognitive testing platform developed as part of the National Institute of Aging’s Mobile Toolbox initiative and described previously (Thompson et al., Reference Thompson, Harrington, Roque, Strenger, Correia, Jones, Salloway and Sliwinski2022; Thompson et al., Reference Thompson, Kunicki, Emrani, Strenger, De Vito, Britton, Dion, Harrington, Roque, Salloway, Sliwinski, Correia and Jones2023). Android smartphones preloaded with the cognitive assessment app were mailed to participants along with a detailed use guide. The phones were locked down to prevent the use of other features such as web browsing and the camera. Participants completed brief (i.e., 3–4 minutes) M2C2 sessions each day for eight consecutive days, within morning, afternoon, and evening time blocks. Participants received a push notification on their phone when it was time to complete a session and every 30 minutes thereafter for the duration of the session (or until completed). Session start times were set at random, but always began within fixed 1-hour windows which were chosen at the start of the study with input from the participant to fit their schedules (Thompson et al., Reference Thompson, Harrington, Roque, Strenger, Correia, Jones, Salloway and Sliwinski2022, Reference Thompson, Kunicki, Emrani, Strenger, De Vito, Britton, Dion, Harrington, Roque, Salloway, Sliwinski, Correia and Jones2023). If participants received a notification when they were busy, they could complete the session later - anytime until the close of the 2-hour session window. Additional sessions could be completed on day 9 as optional or make-up sessions. Staff provided support through phone or email as needed, as described in previously (Thompson et al., Reference Thompson, Kunicki, Emrani, Strenger, De Vito, Britton, Dion, Harrington, Roque, Salloway, Sliwinski, Correia and Jones2023). During each M2C2 session participants completed three previously characterized cognitive tasks assessing episodic memory (Prices), visual working memory (Color Shapes), and processing speed (Symbol Match) (Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018; Thompson et al., Reference Thompson, Kunicki, Emrani, Strenger, De Vito, Britton, Dion, Harrington, Roque, Salloway, Sliwinski, Correia and Jones2023). Each task took approximately 60 s to complete. The Prices task is a delayed forced-choice recognition paradigm (Gallo et al., Reference Gallo, Shahid, Olson, Solomon, Schacter and Budson2006; Naveh-Benjamin, Reference Naveh-Benjamin2000). Ten grocery item-price pairs are encoded followed by immediate recall recognition trials. Performance was measured by the proportion of correct responses on ten recall trials. The Color Shapes task is a visual change detection task measuring intra-item feature binding (Parra et al., Reference Parra, Abrahams, Logie, Méndez, Lopera and Della Sala2010; Parra et al., Reference Parra, Sala, Abrahams, Logie, Guillermo Méndez and Lopera2011). Performance was measured by discriminability (d-prime) performance calculated from the proportion of correct identifications and proportion of misidentified stimuli (Stanislaw & Todorov, Reference Stanislaw and Todorov1999). The Symbol Match task is a speeded continuous performance task in which participants are asked to identify matching symbol pairs (Hassenstab et al., Reference Hassenstab, Aschenbrenner, Balota, McDade, Lim, Fagan, Benzinger, Cruchaga, Goate, Morris and Bateman2020; Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018). Performance was measured by the median reaction time to complete the task across all trials in milliseconds (Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018). There were no performance validity measures included in the M2C2 protocol.
Feedback manipulation
We investigated whether a simple experimental manipulation – telling participants that they would or would not learn about their test results – was associated with adherence or performance on the remote cognitive assessment protocol. Participants were randomly assigned to one of two conditions, feedback or no feedback, and were told that they either would or would not receive performance feedback upon study completion, respectively. At the end of the study, all participants were given the option to receive feedback, regardless of condition. Feedback was given in the form of a pdf summary of raw scores on all three digital assessments plotted over the course of the 8-day test period. No benchmarks (e.g., age specific normative data) were provided. Feedback was presented as for the participants’ curiosity only, and was labeled as not for non-clinical use. Participants who obtained low scores on the MoCA during the in-person study visit and had cognitive concerns were advised to follow-up with their primary care provider and given referral information for neuropsychological evaluation.
Protocol phases
Our study was launched during the COVID-19 pandemic and all study procedures were completed remotely (i.e., screening, consent, and M2C2 orientation). After COVID-19 restrictions were lifted, we were able to add an in-person study visit, but the main study protocol remained fully remote and unchanged. Prior participants and newly enrolled participants were all invited to complete the in-person study visit. Forty of the 52 prior participants came in for the optional in-person visit approximately 17 months after finishing the fully remote study. The remainder of the sample enrolled after pandemic restrictions were lifted and completed the in-person study visit within approximately one month of the remote protocol. To account for this variation, a variable for protocol type (short versus long delay) was included in the statistical models. A dichotomous variable was used given the bimodal distribution of the time delay (clustered around 1 month and 17 months) as well as for ease of interpretation.
In-person cognitive assessment
Participants were scheduled for a single in-person study visit to complete standard paper-and-pencil neuropsychological assessments. The assessment battery consisted of widely used and thoroughly validated measures previously used in the Preclinical Alzheimer’s Cognitive Composite battery and shown to be sensitive to detect prodromal AD (Donohue et al., Reference Donohue, Sperling, Salmon, Rentz, Raman, Thomas and Aisen2014). Individual tests included the Wechsler Adult Intelligence Scale Revised (WAIS-R) Digit Symbol Substitution (DSST) and Digit Span subtests (David, Reference David1981), the Wechsler Memory Scale Revised (WMS-R) Logical Memory Immediate and Delayed Recall subtest (Wechsler, Reference Wechsler1987), Trail Making Test A & B (Reitan & Wolfson, Reference Reitan and Wolfson1985), Category Fluency Test (animals) (Martin & Fedio, Reference Martin and Fedio1983), and the Free and Cued Selective Reminding Test – Immediate Recall (FCSRT-IR) (Grober & Buschke, Reference Grober and Buschke1987; Grober et al., Reference Grober, Veroff and Lipton2018). The Wechsler Test of Adult Reading (WTAR) word reading subtest was used as an estimate of premorbid verbal intellectual functioning (Wechsler, Reference Wechsler2001). The testing took approximately 60-90 minutes to complete.
Analysis
M2C2 adherence was operationalized by a compliance cutoff of 80%, which was examined overall and by time of day (morning, afternoon, evening). Compliance was defined by the number of sessions a participant initiated. Incomplete sessions were counted toward compliance, as long as participants had completed at least one of the three tasks. A Cochran’s Q-test was used to test for group differences in completion rates between time of day (Cochran, Reference Cochran1950). McNemar’s test was used for post-hoc analyses. Pearson’s correlations were used to examine associations between M2C2 task performance and demographic variables. Linear regression analyses adjusted for age, sex, and education were used to evaluate significant differences between the feedback and control conditions on protocol adherence and performance on M2C2 tasks. Linear regression models were constructed to evaluate convergent and divergent validity between M2C2 tasks and the standard neuropsychological assessments. Models were adjusted for feedback condition and protocol type. We utilized the Benjamini-Yekutieli method to adjust for multiple comparisons (Benjamini & Yekutieli, Reference Benjamini and Yekutieli2005). To evaluate test stability, we examined between- and within-person variance (standard deviations) in scores. Intraclass correlations (ICCs) were computed by fitting unconditional multilevel mixed models using restricted maximum likelihood to each of the M2C2 tasks, as previously reported (Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018). With this approach, the ICCs indicate the expected correlation between two randomly sampled measurements from the same person. The ICCs we report are the reliabilities for the aggregate scores based on the number of sessions completed. We have presented reliabilities for a range of sessions including the median number of sessions, as not all participants complete 100% of the sessions (24 sessions).
Results
Sample characteristics
The sample consisted of 120 participants with a mean age of 68.9 years (SD = 4.9) and a mean education of 16.5 years (SD = 2.4). The sample was 68.3% female, and 87% White (Table 1). There were no significant demographic differences between participants on the basis of protocol type, with the exception of age, which was somewhat younger in the short delay group (mean age 67.5) relative to the long delay (mean age 70.5). There were no significant demographic differences on the basis of feedback condition. Performance on all three M2C2 tasks was associated with age (Table 2). Performance on the Prices and Color Shapes tasks was associated with sex. There were no association between M2C2 performance and education.
Note: TICSm = Modified Telephone Interview for Cognitive Status, MoCA = Montreal Cognitive Assessment.
Note: CI = confidence interval, M2C2 = Mobile Monitoring of Cognitive Change. If the confidence interval does not contain 0, the result is significant at p < .05.
Effects of day and time of day on adherence and performance
We examined the overall completion rate across all 24 assigned M2C2 test sessions and by time of day. Compliance was defined as a completion rate of 80% or higher. Overall, 89.3% of participants met compliance across the 24 assigned sessions within 8 days. Average compliance by time of day ranged from 90.2% for morning sessions, to 77.9% for afternoon sessions, and 84.4% for evening sessions. A Cochran’s Q-test revealed differences in completion rates between time of day, Q (2) = 10.56, p = .01, with significant differences between afternoon and morning (z = −3.27, p = .003), but not between afternoon and evening (p = .11) or evening and morning (p = .11). There was only one participant who completed less than two sessions on more than one day (days 1 and 8). Compliance declined the most on day 8 (70.5%), with only 97 participants completing assigned sessions. See supplemental Figure 1 for a summary of overall session adherence rates and average performances on M2C2 tasks by study day. Thirty-five (28.7%) participants completed at least one optional or make-up session on day 9. We detected no differences in adherence based on age, education, or sex.
Effects of feedback condition on adherence and performance
There was no evidence of differences in overall adherence by feedback condition (ß = 0.05, p = .56). There was a difference in adherence by feedback condition on day one, but this difference did not reach the threshold for significance (ß = −0.17, p = .06), and was absent on subsequent days. There was an effect of feedback condition on Symbol Match (processing speed) performance on day one (ß = 0.17, p = .05) through day eight (ß = 0.26, p = .01). There was no effect of feedback condition on Prices (episodic memory) or Color Shapes (working memory) performance. All results reported for day 1 or day 8 were aggregated within the day for analysis.
Convergent and discriminant validity
To assess convergent validity with the standard tests, we looked at associations between performance on each M2C2 cognitive task averaged across sessions with each same-domain standard neuropsychological test, adjusting for feedback condition and protocol type (Table 2). Proportion of correct responses on the Prices task was positively associated with Logical Memory immediate (ß = 0.22) and delayed recall (ß = 0.31). Prices task was also associated with FCSRT Immediate Recall (ß = 0.24). Median response time on the Symbol Match task was positively associated with Trails A completion time (ß = 0.35) and negatively associated with DSST total score (ß = −0.44) and Verbal Fluency (animals) total score (ß = −0.32). Color Shapes task d-prime (discriminability) was positively associated with DSST total score (ß = 0.33) and negatively associated with Trails B completion time (ß = −0.24), but not significantly associated with longest Digit Span backward (ß = 0.16) (Table 3). There were no main effects of protocol type in any of the models.
Note: All models adjusted for protocol type and feedback group. Unstd. Est. = Unstandardized estimate, CI = Confidence interval, Std. Est. = Standardized estimate, M2C2 = Mobile Monitoring of Cognitive Change. The standardized estimate can be interpreted as an effect size, where 0.10 = small, 0.30 = medium, and 0.50 = large. If the CI does not contain 0, the result is significant at p < .05. * = result was still significant at p < .05 after Benjamini-Yekutieli adjustment for multiple comparisons.
To evaluate divergent validity, we checked for associations with different domain standard neuropsychological measures, again adjusting for feedback condition and protocol type (Table 4). Proportion of correct responses on the Prices task recall was not associated with measures of processing speed, working memory, or premorbid estimated verbal intelligence (IQ). The Symbol Matching task was associated with verbal fluency (ß = −0.32), but not with untimed language-dependent tasks, such as verbal memory or premorbid verbal IQ estimate. Colors shapes task was not associated with verbal fluency, psychomotor processing speed or premorbid verbal IQ estimate (Table 4).
Note: All models adjusted for protocol type and feedback group. Unstd. Est. = Unstandardized estimate, CI = Confidence interval, Std. Est. = Standardized estimate, M2C2 = Mobile Monitoring of Cognitive Change. The standardized estimate can be interpreted as an effect size, where 0.10 = small, 0.30 = medium, and 0.50 = large. * = result was still significant at p < .05 after Benjamini-Yekutieli adjustment for multiple comparisons.
Reliability
To evaluate test stability, we examined between- and within-person variance in performance. These results are stratified by feedback condition and provide an update to our preliminary (n = 52) reliability data published in 2022 prior to the study completion. On average, in the full sample (N = 120), performance improved over time on the Color Shapes and Symbol Match, while performance remained stable on the Prices task (Sup. Figure 1). The within-person reliabilities of average scores aggregated across 9, 15, 18, 21, and 24 sessions were excellent (Table 5). Additionally, we provide the ICCs aggregated from two randomly selected measurement observations from each participant for each task.
Note: M2C2 = Mobile Monitoring of Cognitive Change, ICC = intraclass correlation. The ICC can be interpreted as the correlation between two randomly selected measurement observations of the same participant. The values in the remaining columns estimate the reliability of the test for a participant who completed 9 sessions versus 15 sessions, 18 sessions, etc. The range of values is provided because not all participants completed all 24 sessions. The median number of completed sessions was 24 and the average number of completed sessions was 23.
Discussion
The present study adds to a growing body of support for self-administered, HFA via smartphone as a feasible, acceptable, and reliable approach to measuring cognition in older adults, including those at risk for AD/ADRD. In addition to replicating strong test reliability validity, and adherence rates, our study provides new insights into potential factors that may impact adherence and performance on testing in this context (Cerino et al., Reference Cerino, Katz, Wang, Qin, Gao, Hyun, Hakun, Roque, Derby, Lipton and Sliwinski2021; Harrington et al., Reference Harrington, Hakun, Zhaoyang, Cerino, Hyun, Katz, Lipton and Sliwinski2021; Nicosia et al., Reference Nicosia, Aschenbrenner, Balota, Sliwinski, Tahan, Adams, Stout, Wilks, Gordon, Benzinger, Fagan, Xiong, Bateman, Morris and Hassenstab2023).
In our examination of the potential effects of anticipated feedback on cognitive performance, we found that participants who expected to eventually receive feedback on their performance had faster reaction times on the M2C2 processing speed task. This result may suggest that simply knowing that feedback will be received could improve effortful engagement on digital tasks, as processing speed (i.e., how quickly one performs on a task) may be more influenced by subjective control than other cognitive abilities. Overall, these findings on the effects of feedback anticipation have major implications for understanding how to incentivize adherence and performance on remote testing without compromising test validity. Our results suggest that feedback may not need to be given in real-time in order to potentially increase task engagement. Monetary incentives have also been recently associated with strong adherence in a mixed sample of older adults with normal cognition and mild cognitive impairment (Nicosia et al., Reference Nicosia, Aschenbrenner, Balota, Sliwinski, Tahan, Adams, Stout, Wilks, Gordon, Benzinger, Fagan, Xiong, Bateman, Morris and Hassenstab2023), but are not appropriate or scalable for clinical settings. In contrast, providing feedback at the end of an assessment period is clinically appropriate and may also be beneficial for patient awareness and engagement in understanding brain health.
Contrary to our expectations, we did not observe an effect of feedback condition on overall protocol adherence. It should be noted that ceiling effects due to high overall study adherence in this sample may have limited our ability to detect significant differences. Additionally, our experimental manipulation was subtle, focusing on adjusting only the anticipation of feedback so as to avoid introducing any changes to the M2C2 protocol. Future research could also examine the effects of more direct in-app feedback and/or feedback following each M2C2 task session on adherence; however, doing so would likely introduce unwanted variability in cognitive performance and compromise the reliability of the assessments.
We also examined the timing of assessments and found good adherence at all three time points (morning, afternoon, evening). Adherence was strongest for the morning sessions, and significantly less so for afternoon sessions. These findings add to the recent literature on cognitive sundowning captured using remote HFA (Wilks et al., Reference Wilks, Aschenbrenner, Gordon, Balota, Fagan, Musiek, Balls-Berry, Benzinger, Cruchaga, Morris and Hassenstab2021). Despite the observed compliance rate of almost 90% overall, adherence declined notably on Day 8, suggesting that a shorter assessment period might be better tolerated in future studies. Overall, these findings are consistent with prior reports, which support the feasibility of remote, self-administered smartphone-based approaches to cognitive assessment in older adults with and without cognitive impairment (Harrington et al., Reference Harrington, Hakun, Zhaoyang, Cerino, Hyun, Katz, Lipton and Sliwinski2021; Nicosia et al., Reference Nicosia, Aschenbrenner, Balota, Sliwinski, Tahan, Adams, Stout, Wilks, Gordon, Benzinger, Fagan, Xiong, Bateman, Morris and Hassenstab2023; Papp et al., Reference Papp, Samaroo, Chou, Buckley, Schneider, Hsieh, Soberanes, Quiroz, Properzi, Schultz, García‐Magariño, Marshall, Burke, Kumar, Snyder, Johnson, Rentz, Sperling and Amariglio2021).
We further examined the convergent and divergent validity of M2C2 tasks and found consistent associations with same-domain standard neuropsychological assessments sensitive to cognitive deficits in preclinical and prodromal AD. M2C2 associations with standard assessments held true as small or medium effects even after controlling for variability in time delays between our remote and in-person assessments. Most notable for the early detection of AD, we found small but consistent associations between performance on the Prices episodic memory task and measures that are sensitive to early memory decline, including contextual verbal memory (Logical Memory) and verbal paired associates learning (Free and Cued Selective Reminding Test). We further also examined divergent validity and demonstrated an absence of associations between M2C2 tasks and most standard measures assessing different cognitive domains. One exception was a significant association between Symbol Match and verbal fluency, which may be attributable to their shared sensitivity to processing speed. This work, together with our earlier report demonstrating task sensitivity to AD biomarkers, and that of Nicosia et al. (Reference Nicosia, Aschenbrenner, Balota, Sliwinski, Tahan, Adams, Stout, Wilks, Gordon, Benzinger, Fagan, Xiong, Bateman, Morris and Hassenstab2023), builds a strong foundation of multimodal validation evidence in support of using HFA approaches in the older adult population (Nicosia et al., Reference Nicosia, Aschenbrenner, Balota, Sliwinski, Tahan, Adams, Stout, Wilks, Gordon, Benzinger, Fagan, Xiong, Bateman, Morris and Hassenstab2023; Thompson et al., Reference Thompson, Kunicki, Emrani, Strenger, De Vito, Britton, Dion, Harrington, Roque, Salloway, Sliwinski, Correia and Jones2023).
Finally, we also examined the reliability of M2C2 smartphone-based assessments. Consistent with prior research, we found high within-subject reliabilities among M2C2 measures over time (Nicosia et al., Reference Nicosia, Aschenbrenner, Balota, Sliwinski, Tahan, Adams, Stout, Wilks, Gordon, Benzinger, Fagan, Xiong, Bateman, Morris and Hassenstab2023; Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018). One advantage of HFA is the ability to obtain high degrees of within-person reliability, which is difficult to achieve with one or two assessment visits. The high M2C2 task reliabilities observed in this study are consistent with good test-retest reliabilities recently reported for other multi-day digital assessment protocols (Stricker et al., Reference Stricker, Corriveau-Lecavalier, Wiepert, Botha, Jones and Stricker2023; Weizenbaum et al., Reference Weizenbaum, Soberanes, Hsieh, Molinare, Buckley, Betensky, Properzi, Marshall, Rentz, Johnson, Sperling, Amariglio and Papp2023).
The results of this study provide insights into ways that HFA protocols could be refined and optimized for future research in older adult samples. First, our results support that testing sessions scheduled for the morning and evening may yield better adherence than testing sessions scheduled in the middle of the day. This may in part reflect variability in people’s mid-day schedules due to activities such as work or running errands, which may be distracting or more difficult to interrupt. Assessments may fit better into morning and evenings routines and may be easier for people to complete when they are at home. Additionally, our findings indicate that very good within-subject reliabilities can be achieved on measures of processing speed, working memory, and episodic memory within 15 sessions (i.e., three daily sessions for five days). Taken together with the decline in adherence that we observed in our sample after seven days of testing, a five- or six-day assessment period seems optimal. These refinements and other insights, such as the advantages of providing performance feedback for examinees, can inform future clinical implementation research and applications of HFA in decentralized AD/ADRD clinical trials (Leroy et al., Reference Leroy, Gana, Aïdoud, N’kodo, Balageas, Blanc, Bomia, Debacq and Fougère2023). In the future, remote digital assessments could also be particularly useful in combination with blood-based biomarker detection of AD proteinopathy in clinical settings to help guide decision making when obtaining a full cognitive evaluation might not otherwise be feasible (Ashton et al., Reference Ashton, Brum, Di Molfetta, Benedet, Arslan, Jonaitis, Langhough, Cody, Wilson, Carlsson, Vanmechelen, Montoliu-Gaya, Lantero-Rodriguez, Rahmouni, Tissot, Stevenson, Servaes, Therriault, Pascoal and Lleó2024).
The results of this study should be interpreted with the nature of the cohort in mind. Participants were cognitively normal individuals exclusively recruited from a registry of those expressing interest in research on AD/ADRD. In this context, it is plausible to expect relatively high performance and motivation to be engaged in a self-administered remote assessment protocol, compared to other contexts, such as enrolling symptomatic patients in community clinics. Whereas the M2C2 app has been primarily tested in asymptomatic at-risk populations, others have recently reported positive feasibility and validity data with its use in larger longitudinal cohorts of older adults, including those with mild cognitive impairment (Cerino et al., Reference Cerino, Katz, Wang, Qin, Gao, Hyun, Hakun, Roque, Derby, Lipton and Sliwinski2021; Harrington et al., Reference Harrington, Hakun, Zhaoyang, Cerino, Hyun, Katz, Lipton and Sliwinski2021). Another important contextual consideration in our sample is the variable time delays between the remote and in-person cognitive testing ranging from one to 18 months due to COVID-19-related research restrictions. Given the potential for the remote and in-person task associations to be moderated by a delay, we controlled for protocol type (short or long delay) in our models and additionally checked and found no significant interactions with protocol type.
An important limitation of this study is the narrow demographic features of the sample (i.e., largely White, female, highly educated, maximum age of 80), which limits generalizability to the broader older adult population. Replication of our protocol’s feasibility and the validity of the M2C2 measures in a more diverse older adult sample is necessary prior to implementation in larger research trials and clinical settings. Finally, we did not have data on participant employee status or participant’s schedules, which may have yielded further insights into time-of-day variations in M2C2 performance and adherence.
As with most mobile cognitive assessments, M2C2 tasks have not been developed for diagnostic purposes and are probably best suited for initial screening and monitoring of cognitive function. Our M2C2 protocol does not include any performance validity checks, but it does have the advantage of being primarily visual-based, which may make it more difficult for participants to take notes or otherwise ‘cheat’ on the tasks. Nonetheless, we recognize that there are many other potential threats to performance validity including distractions/interruptions or assistance from other individuals, and these are important limitations of self-administered remote assessments. In order to manage potential variability in device processing capabilities and avoid interruptions from calls, texts, or pop-ups we sent all participants a study-managed Android phone to use for the M2C2 protocol; however, this approach is not feasible for scalable clinical implementation. M2C2 and other mobile assessments are now more readily and easily deployed for use on participant’s own devices (including iOS and Android), and this is the current approach in use our own ongoing research and other studies (Harris et al., Reference Harris, Tang, Birnbaum, Cherian, Mendhe and Chen2024).
Traditional neuropsychological testing continues to play an important role in capturing the nature and extent of cognitive deficits and aiding clinicians in the diagnostic process. Such measures are administered in carefully controlled conditions that utilize normative data and allow examiners to observe potentially diagnostically relevant behavioral factors such as process-oriented test performance features, emotional affect and social reciprocity, and other non-test behaviors including neurobehavioral signs that may have diagnostic relevance. However, it is likely that in the future, proxies for some of these non-test observational factors can be met via passive monitoring and artificial intelligence-driven capture of behavior and symptoms to further optimize remote assessment. With the current abundance of research in this space, digital cognitive assessment tools are poised to undergo further refinements and integration with other technologies, and will play an important role in AD/ADRD clinical research.
Conclusions
In summary, this study significantly supports the feasibility, acceptability, and reliability of self-administered, high-frequency cognitive assessments via smartphones in older adults, particularly those at risk for AD. The findings not only replicate strong test reliability, validity, and adherence but also provide valuable insights into factors affecting adherence and performance. The anticipation of learning one’s test results emerges as a potential motivator, having some impact on both reaction times and initial adherence. The study suggests that delayed feedback may still enhance task engagement, offering a practical alternative to delivering real-time results during remote testing. Moreover, our examination of assessment timing reveals a preference for morning sessions, which should be considered for optimizing future protocols. The reliability and validity of M2C2 smartphone-based assessments, including their consistent associations with same-domain standard neuropsychological measures, suggests that with further refinement, such remote cognitive assessment tools could potentially be scalable for screening and monitoring use in clinical settings.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S1355617724000328.
Acknowledgments
None.
Financial support
This work was supported by Alzheimer’s Association grant AACSF-20-685786 (Thompson, PI) and by NIA grant T32 AG049676 to Penn State University.
Competing interests
Dr Thompson has been a paid consultant for the Davos Alzheimer’s Collaborative. Dr Stephen Salloway has been a paid consultant for Lilly, Biogen, Roche, Genentech, Eisai, Bolden, Amylyx, NovoNordisk, Prothena, Ono, and Alnylam. All other authors have nothing to disclose.