INTRODUCTION
Neuropsychological assessments are the accepted standard-of-care for measuring cognition. Many of the most common neuropsychological tests have existed for decades but research on their strengths and limitations has led to improvements in how they are used and interpreted in clinical, research, and forensic settings. For example, research showing the base rates with which cognitively intact individuals achieve low scores across a test battery has helped reduce false positive conclusions that a patient has declined (Binder, Iverson, & Brooks, Reference Binder, Iverson and Brooks2009; Brooks & Iverson, Reference Brooks and Iverson2010; Brooks, Iverson, & White, Reference Brooks, Iverson and White2009). There is also evolving awareness of the complex relationships between diverse neuropathologic changes and heterogeneous cognitive phenotypes (Boyle et al., Reference Boyle, Yu, Wilson, Leurgans, Schneider and Bennett2018; James et al., Reference James, Wilson, Boyle, Trojanowski, Bennett and Schneider2016; Wennberg et al., Reference Wennberg, Whitwell, Tosakulwong, Weigand, Murray, Machulda, Petrucelli, Mielke, Jack, Knopman and Parisi2019).
Pioneering work from Bondi and Jak in longitudinal aging cohorts rather consistently demonstrates that actuarial approaches that classify cognitive impairment using patterns and frequencies of low scores have led to modest rates of clinical reversion (mild cognitive impairment or “MCI” to “cognitively normal” at follow-up), improved characterization of risk of progression to dementia, and stronger associations with biologic disease markers than “single-test” methods (Bondi et al., Reference Bondi, Edmonds, Jak, Clark, Delano-Wood, McDonald, Nation, Libon, Au, Galasko and Salmon2014; Bondi & Smith, Reference Bondi and Smith2014; Jak et al., Reference Jak, Bondi, Delano-Wood, Wierenga, Corey-Bloom, Salmon and Delis2009; Jak et al., Reference Jak, Preis, Beiser, Seshadri, Wolf, Bondi and Au2016; Petersen et al., Reference Petersen, Smith, Waring, Ivnik, Tangalos and Kokmen1999). Oltra-Cucarella et al. (Reference Oltra-Cucarella, Sánchez-SanSegundo, Lipnicki, Sachdev, Crawford, Pérez-Vicente, Cabello-Rodríguez and Ferrer-Cascales2018) recently showed that the number of low scores in a test battery predicted progression from MCI to dementia in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort with improved specificity (Oltra-Cucarella et al., Reference Oltra-Cucarella, Sánchez-SanSegundo, Lipnicki, Sachdev, Crawford, Pérez-Vicente, Cabello-Rodríguez and Ferrer-Cascales2018). Put simply, implementing these approaches increases a clinician’s confidence about whether a patient’s test scores reflect true cognitive changes versus normal performance variability unrelated to suspected underlying disease.
Deemphasizing individual test scores known to fluctuate in cognitively normal individuals promotes clinical translation by more closely mimicking the holistic interpretations used by neuropsychologists. One possible limitation of these methods, however, is that they dichotomize impairment status rather than allowing for a continuum of evidence for cognitive decline. Dichotomous approaches that reduce complex cognitive profiles derived from psychometrically imperfect measures might not classify patients into discrete diagnostic groups accurately or reliably. Moving away from dichotomous categorizations (“not impaired” vs. “impaired”) and toward a continuous spectrum construct might advance clinical assessment methods further and promote integration with similarly complex disease biomarker measures.
Data from large longitudinal research cohorts with biomarker collection challenge traditional conceptualizations of disease–phenotype relationships and underscore the imperfect alignment of disease states and clinical syndromes (Jack et al., Reference Jack, Bennett, Blennow, Carrillo, Dunn, Haeberlein, Holtzman, Jagust, Jessen, Karlawish and Liu2018). For example, biomarker [e.g., positron emission tomography (PET)] evidence for amyloid plaque and tau tangle Alzheimer’s disease (AD) pathology does not guarantee measurable cognitive impairment (De Meyer et al., Reference De Meyer, Shapiro, Vanderstichele, Vanmechelen, Engelborghs, De Deyn, Coart, Hansson, Minthon, Zetterberg and Blennow2010; Mortamais et al., Reference Mortamais, Ash, Harrison, Kaye, Kramer, Randolph, Pose, Albala, Ropacki, Ritchie and Ritchie2017) and, when present, cognitive impairment is not universally the prototypic AD presentation of “rapid forgetting” (Ossenkoppele et al., Reference Ossenkoppele, Pijnenburg, Perry, Cohn-Sheehy, Scheltens, Vogel, Kramer, van der Vlies, Joie, Rosen and van der Flier2015; Perry et al., Reference Perry, Brown, Possin, Datta, Trujillo, Radke, Karydas, Kornak, Sias, Rabinovici and Gorno-Tempini2017; Phillips et al., Reference Phillips, Da Re, Dratch, Xie, Irwin, McMillan, Vaishnavi, Ferrarese, Lee, Shaw and Trojanowski2018). Yet, in the absence of advanced biomarkers of disease pathology, clinical presentation alone may not differentiate disease states adequately (e.g., amnestic profiles associated with both AD and limbic-predominant age-related TDP-43 encephalopathy) (Nelson et al., Reference Nelson, Dickson, Trojanowski, Jack, Boyle, Arfanakis, Rademakers, Alafuzoff, Attems, Brayne, Coyle-Gilchrist and Schneider2019). The National Institute on Aging and Alzheimer's Association (NIA-AA) established the “A/T/N” framework for biomarker evidence of AD with expected adaptation to include additional neuropathologic biomarkers (Jack et al., Reference Jack, Bennett, Blennow, Carrillo, Dunn, Haeberlein, Holtzman, Jagust, Jessen, Karlawish and Liu2018; Nelson et al., Reference Nelson, Dickson, Trojanowski, Jack, Boyle, Arfanakis, Rademakers, Alafuzoff, Attems, Brayne, Coyle-Gilchrist and Schneider2019). Anticipating this paradigm shift, a cognitive correlate derived from neuropsychological evaluations would complement biomarker frameworks by systematically quantifying evidence of domain-specific cognitive decline.
The purpose of this study is to develop and validate the Discrepancy-based Evidence for Loss of Thinking Abilities (DELTA) score. In the absence of prior test scores for comparison, DELTA scores characterize evidence for cognitive decline on a continuous spectrum based on the extent of discrepancies between obtained test scores and predicted premorbid scores derived from multiple-variable regression models. The approach reflects the progress of prior research demonstrating the benefits of accounting for low-score-base-rates among cognitively normal individuals and psychometric principles for improving detection of cognitive changes (Iverson & Brooks, Reference Iverson, Brooks, Schoenberg and Scott2011). This initial validation used the ADNI cohort and evaluated how the DELTA score predicted functional changes over time, as well as its association with AD biomarkers. We provide a Microsoft Excel-based scoring program that directly incorporates the study findings and we discuss next steps for broader validation outside of AD samples.
METHODS
Data Source and Participants
We obtained data from the ADNI database (adni.loni.usc.edu). ADNI is a longitudinal multicenter study of advanced biomarkers and clinical assessments of individuals suspected of or at-risk for AD (see www.adni-info.org for details). ADNI was approved by the institutional review boards of all participating institutions. Informed written consent was obtained from all participants at each site. Data included in our study were participant demographics, neuropsychological test results, functional measures including the Clinical Dementia Rating scale (CDR) and Functional Activities Questionnaire (FAQ), and biomarkers of beta-amyloid[Aβ; via PET and cerebrospinal fluid (CSF)] and phosphorylated tau (p-tau; via CSF).
The CDR (Morris, Reference Morris1993) has both a global score (range 0–3, where 0 = “normal” and 3 = “severe dementia”) and Sum of Boxes (SOB) score (range 0–18) that quantify aspects of daily functioning including memory, orientation, judgment/problem-solving, community affairs, home and hobbies, and personal care. The FAQ (Pfeffer, Kurosaki, Harrah Jr., Chance, & Filos, Reference Pfeffer, Kurosaki, Harrah, Chance and Filos1982) assesses the degree of assistance individuals need when completing 10 instrumental activities of daily living (range 0–30 with higher scores representing greater assistance needed). Lastly, Aβ and p-tau are the hallmark neuropathologic features of AD under the biologic A/T/N classification system and commonly underlie cognitive and behavioral changes in older adults (Jack et al., Reference Jack, Bennett, Blennow, Carrillo, Dunn, Haeberlein, Holtzman, Jagust, Jessen, Karlawish and Liu2018).
The ADNI cohort was used as the normative reference for DELTA development. Regression coefficients were derived from a sample of robust cognitively normal (RCN) individuals. RCNs had to have CDR = 0 and Mini Mental Status Exam (MMSE) score ≥29 at both baseline and 1-year follow-up. We regressed age, gender, years of education, and word-reading ability against each of the neuropsychological test scores. Equations derived from the results of these regression models were then applied to the entire ADNI cohort to compute individual participants’ predicted premorbid test scores. The discrepancy between a participants’ predicted and obtained scores is the key component of the DELTA score, which theoretically represents the likelihood of true cognitive decline from a predicted baseline state.
Discrepancy-Based Evidence for Loss of Thinking Abilities (DELTA) Score Development
The DELTA score is broadly based on the following: (1) the degree of discrepancy between an individual’s predicted and observed scores on individual tests within a battery, and (2) the frequency of discrepancy scores that exceed common cutoffs for infrequently occurring or “impaired” scores. This initial development and validation is specific to ADNI test scores and represents a proof of concept.
Step 1: Identifying ADNI test scores to incorporate into DELTA
The six tests used for the ADNI-based DELTA spanned three cognitive domains: Memory [Rey Auditory Verbal Learning Test (AVLT), Wechsler Memory Scale – Revised Logical Memory (WMS-R LM)], Language [Animal Fluency, 30-item Boston Naming Test (BNT-30)], and Executive Function (Clock Drawing, Trails B); see Supplemental Table A for test details. We isolated the “executive” component of Trails B by dividing Trails B time by Trails A time to reduce confounding effects of psychomotor speed on measuring the set-shifting component of Trails B (Arbuthnott & Frank, Reference Arbuthnott and Frank2000).
Table 1. Regression equations for predicting test scores based on age, gender, years of education, and word-reading ability (ANART total errors). Scores within the table used for DELTA score calculation include AVLT Delayed Recall, LM Delayed Recall, Trails B/Trails A, Animal Fluency Total Correct, and BNT-30 Total Correct. Regression equations not calculated for Clock Drawing due to limited score range in control group (4 or 5). The “ANART R2” column indicates the added variance attributed specifically to word-reading performance above and beyond age, gender, and years of education
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200713132704276-0845:S1355617719001346:S1355617719001346_tab1.png?pub-status=live)
ANART, American National Adult Reading Test; AVLT, Rey Auditory Verbal Learning Test; BNT, Boston Naming Test; LM, Logical Memory (WMS-R); SEE, standard error of the estimate; YrsEduc, Years of Education.
a Age p < .01.
b Gender p < .01.
c Years of Education p < .01.
d ANART Error Score p < .01.
Step 2: Calculating test score prediction equations from robust cognitively normal (RCN) participants
Predicted scores for each relevant test component came from regression equations using coefficients (B-weights) corresponding to the effects of age, gender (male or female), years of education, and word-reading ability on test scores in the RCN subgroup (Eppig et al., Reference Eppig, Edmonds, Campbell, Sanderson-Cimino, Delano-Wood and Bondi2017). Word-reading scores came from the American National Adult Reading Test (ANART number of errors), which estimates general intelligence (i.e., IQ) and informs expected premorbid cognitive abilities (McGurn et al., Reference McGurn, Starr, Topfer, Pattie, Whiteman, Lemmon, Whalley and Deary2004). Word-reading ability was chosen as a performance-based predictor of cognitive abilities to improve upon typical demographic-only adjustment methods (Crawford, Moore, & Cameron, Reference Crawford, Moore and Cameron1992; Duff, Chelune, & Dennett, Reference Duff, Chelune and Dennett2011; Duff, Dalley, Suhrie, & Hammers, Reference Duff, Dalley, Suhrie and Hammers2018). All predictors were left in the equation regardless of statistical significance; this approach captures any variance explained by these commonly collected variables and more clearly allows for direct comparison of their relative prediction strengths across test scores. The test score-specific prediction equations therefore looked like:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200713132704276-0845:S1355617719001346:S1355617719001346_eqnu1.png?pub-status=live)
Step 3: Calculating standardized discrepancy scores
We identified 270 ADNI participants as RCNs. We first calculated predicted raw scores for RCNs and then standardized by z-transforming the discrepancy (z-Discrep) between predicted and actual raw scores by dividing the difference by the test-specific regression model’s standard error of the estimate (SEE; defined as the standard deviation (SD) of the error term). This occurred for all test scores (Table 1) except for Clock Drawing because all RCNs obtained scores of either 4 or 5 (out of 5). We subtracted actual from predicted scores for the Trail Making Test so that negative z-scores reflected poor performance.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200713132704276-0845:S1355617719001346:S1355617719001346_eqnu2.png?pub-status=live)
Step 4: Defining cutoffs for evidence of cognitive decline
A key component to neuropsychological test score interpretations is the frequency with which a given score occurs in a reference population (e.g., scores corresponding to a z-score of −2.0 are atypically low and usually interpreted as strong evidence of cognitive decline). We therefore established percentile cutoffs for infrequently occurring z-Discrep scores: 16th, 7th, and 2nd percentile. We used percentiles due to non-normality of the z-Discrep score distributions. These percentiles correspond to commonly used cutoffs in normally distributed data (16th%ile for z = −1.0, 7th%ile for z = −1.5, 2nd%ile for z = −2.0) (Iverson & Brooks, Reference Iverson, Brooks, Schoenberg and Scott2011).
Step 5: Defining DELTA score criteria
We based DELTA score criteria on the principle that obtaining low scores on Test A and Test B within a cognitive domain occurs less frequently than obtaining low scores on Test A or Test B (Table 2). The DELTA score also accounts for the degree of discrepancy between obtained and predicted scores. For example, if both the BNT and Animal Fluency scores have a z-Discrep below the second percentile, the individual receives a Language score of 5. However, if only one of those two z-Discrep scores is below the second percentile and the other is normal, they receive a Language score of 3. The same score of 3 could also be obtained if both z-Discrep scores fall between the second and seventh percentile.
Table 2. Criteria for calculating the DELTA score (automatic scoring program provided)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200713132704276-0845:S1355617719001346:S1355617719001346_tab2.png?pub-status=live)
A unique component of the Memory DELTA score for both AVLT and LM is the requirement of <50% retention of initially learned information. This was done to reduce confounding effects of poor immediate recall on delayed recall scores due to non-memory factors like inattentiveness or executive deficits (Casaletto et al., Reference Casaletto, Marx, Dutt, Neuhaus, Saloner, Kritikos, Miller and Kramer2017). We chose <50% retention arbitrarily and this cutoff was applied uniformly.
Step 6: Calculating the DELTA score
Domain-specific scores range from 0 to 5. A maximum domain score of 5 corresponds to both z-Discrep scores within that domain falling below the second percentile (note that the cutoff value of which depends on whether the individual had a “Low,” “Average,” or “High” predicted score - see Results). For the Memory domain, this also requires <50% retention on both AVLT and LM delayed recall. Domain-specific scores are then summed for a total DELTA score (0–15).
Methods and Analyses for Validating DELTA Scores
We calculated DELTA scores for all ADNI participants with complete test data at Time 0 (baseline assessment; BL) and for each follow-up year up to the fifth year (Y1, Y2, Y3, Y4, Y5). Functional outcomes included the CDR-SOB and FAQ scores corresponding to time points with calculable DELTA scores. Functional outcomes were used as the primary validation of the DELTA score because it is a purely clinical and psychometrically based score that is independent of biomarker indicators of specific diseases processes. Biomarkers were used in the secondary validation.
Primary Validation with Functional Outcomes
First, we examined DELTA score changes in the RCN group and the entire baseline sample at follow-up.
Second, we used linear mixed model analyses with maximum likelihood estimation evaluating associations between DELTA score and longitudinal functional changes. Model fit was evaluated in a hierarchical (i.e., nested) approach relative to the unconditional means (null) model: (1) fixed and random effect of time, (2) fixed effects of age, gender, and years of education, (3) fixed effect of BL DELTA score, (4) fixed and random effect of BL DELTA × Time interaction. This first approach most closely mimics a clinical scenario where a patient obtains a DELTA score and the clinician wants to know how that predicts future everyday functioning.
Third, we leveraged the longitudinal cognitive data in ADNI by building the following model using DELTA as a time-varying covariate: (1) fixed and random effect of time, (2) fixed effects of age, gender, and years of education, (3) fixed effects of mean-DELTA and mean-centered-DELTA (decoupled to control for each case’s mean cognitive functioning across the study), (4) random effect of mean-centered DELTA. This second approach examines how well changes in DELTA scores coincide with changes in functional outcomes over time.
Separate analyses were run with CDR-SOB and FAQ scores as the dependent variable. We tracked overall model fit using −2 Log Likelihood and Akaike’s Information Criterion changes at each step as well as reductions in unexplained variance (R 2 change) for the covariance parameters (random effects).
Secondary Validation with Alzheimer’s Disease Biomarkers
We examined associations between DELTA scores, PET-Aβ burden, and CSF evidence of AD, stratifying participants by apolipoprotein E (APOE) e4 carriers and noncarriers. PET-Aβ was quantified using PET scanning with 18F-florbetapir (AV45) tracer. Standardized uptake value ratios (SUVr) were calculated by ADNI by dividing mean cortical florbetapir uptake (frontal, anterior/posterior cingulate, lateral parietal, lateral temporal) by whole cerebellar uptake. PET-Aβ positivity reflected a cross-sectional SUVr > 1.11 (Landau et al., Reference Landau, Thomas, Thurfjell, Schmidt, Margolin, Mintun, Pontecorvo, Baker and Jagust2014). CSF evidence of AD was determined by cutoff scores optimized for ADNI (Hansson et al., Reference Hansson, Seibyl, Stomrud, Zetterberg, Trojanowski, Bittner, Lifke, Corradini, Eichenlaub, Batrla and Buck2018) that used the CSF-hyperphosphorylated tau (CSF-pTau) to CSF-Aβ (1–42) ratio (CSF-pTau/CSF-Aβ > .0251 pg/ml). We analyzed continuous associations between DELTA score and biomarker burden using Spearman’s rho for non-normal data and examined the positive predictive value (PPV) of a given DELTA group for dichotomized biomarker outcomes (PET-Aβ positivity and CSF-AD positivity). See www.loni.usc.edu for acquisition and processing details.
All statistical analyses were performed using SPSS v.22 or v.25. A priori alpha levels were set at p < .005 (unless otherwise noted) to partially account for spurious findings associated with a large sample size and to reflect recent proposals to lower thresholds for enhancing replication of new discoveries (Benjamin et al., Reference Benjamin, Berger, Johannesson, Nosek, Wagenmakers, Berk, Bollen, Brembs, Brown, Camerer, Cesarini, Chambers, Clyde, Cook, De Boeck, Dienes, Dreber, Easwaran, Efferson, Fehr, Fidler, Field, Forster, George, Gonzalez, Goodman, Green, Green, Greenwald, Hadfield, Hedges, Held, Hua Ho, Hoijtink, Hruschka, Imai, Imbens, Ioannidis, Jeon, Jones, Kirchler, Laibson, List, Little, Lupia, Machery, Maxwell, McCarthy, Moore, Morgan, Munafó, Nakagawa, Nyhan, Parker, Pericchi, Perugini, Rouder, Rousseau, Savalei, Schönbrodt, Sellke, Sinclair, Tingley, Van Zandt, Vazire, Watts, Winship, Wolpert, Xie, Young, Zinman and Johnson2018; Ioannidis, Reference Ioannidis2018).
Restricted Sample
The ADNI sample considered for this study was around 92% white/Caucasian (4% black/African American, 2% Asian, <1% each of multiple races, American Indian/Alaskan, and Hawaiian). We elected to restrict DELTA score development and validation to white/Caucasian participants and openly acknowledge this limited scope. We assumed that indiscriminately applying the data underlying DELTA score development across diverse racial and ethnic groups was inappropriate. The well-documented and complex relationships between sociodemographic factors and cognitive test scores (Dotson, Kitner-Triolo, Evans, & Zonderman, Reference Dotson, Kitner-Triolo, Evans and Zonderman2009; Manly & Echemendia, Reference Manly and Echemendia2007; Rivera Mindt, Byrd, Saez, & Manly, Reference Rivera Mindt, Byrd, Saez and Manly2010) require careful consideration when extrapolating data from unrepresentative samples (Brooks, Sherman, Iverson, Slick, & Strauss, Reference Brooks, Sherman, Iverson, Slick and Strauss2011). We hope these methods will be replicated using racially and ethnically diverse cohorts.
RESULTS
The RCN group included 270 participants (mean ± SD age = 74.7 ± 5.5 years, 51.1% female, 100% white/Caucasian, 74.1% APOEe4 noncarriers; MMSE mean ± SD = 29.6 ± .5, education years ± SD = 16.7 ± 2.6). Regression-based equations for predicting test score performance (including some not used for DELTA score calculation), as well as the added variance explained by the word-reading ability component, are provided as Table 1.
The z-Discrep score distributions varied in the RCN group as a function of predicted score. Those with higher predicted test scores had a different distribution of z-Discrep scores than those with lower predicted test scores. For example, z-Discrep = −1.86 corresponds to the seventh percentile for participants with high predicted AVLT delayed recall, whereas z-Discrep = −1.42 corresponds to the seventh percentile for those with low predicted AVLT delayed recall. Therefore, we stratified the predicted scores for each test into “Low” (lower quartile), “Average” (middle quartiles), and “High” (upper quartile) groups and then identified the z-Discrep values corresponding to the 16th, 7th, and 2nd percentile that were specific to the predicted score group (Figure 1).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200713132704276-0845:S1355617719001346:S1355617719001346_fig1.png?pub-status=live)
Fig. 1. Conceptual figure demonstrating derivation of the z-Discrep score that corresponds to a given percentile cutoff, stratified by each participant’s predicted raw test score (“Low,” “Average,” or “High”). All bell curves represent the theoretical distribution of standardized (z) discrepancy scores. The top curve shows the z-Discrep distribution for the entire RCN sample along with the theoretical “Low,” “Average,” and “High” predicted score subgroups that make up the overall RCN sample. The bottom curves show that these subgroups were isolated so that the z-Discrep scores that correspond to the 2nd, 7th, and 16th percentile cutoffs would be specific to the predicted score group’s z-Discrep distribution.
We created five “Level of Evidence” for cognitive impairment groups based on the DELTA score distribution in the RCN sample: DELTA = 0 (“No Evidence” of cognitive decline; 73.8% of RCNs), DELTA = 1–3 (“Low Evidence”; 24.6% of RCNs), DELTA = 4–6 (“Moderate Evidence”; 1.6% of RCNs), DELTA = 7–9 (“Strong Evidence”; .0% of RCNs), DELTA = 10+ (“Very Strong Evidence”; .0% of RCNs).
Evidence for Incremental Value of DELTA Score Use
Addition of word-reading ability as a performance-based predictor of premorbid performance significantly improved model fit in 3 out of the 5 predicted test scores derived from regression equations (Clock Drawing excluded) for DELTA score calculation: LM Delayed Recall, Trails B/A proportion score, and the BNT-30 (Table 1). Word-reading independently accounted for 15–60% of the total variance explained by the overall models for these tests.
We then examined the potential benefit of using multiple test scores for characterizing cognitive abilities within a domain by comparing rates of “low scores” on single tests to frequencies of DELTA scores. Individual test scores were converted to percentiles based on the RCN score distribution. Using memory tests and a seventh percentile cutoff (z < −1.5 in a normal distribution) as an exemplar, we observed that 10.5% of RCNs had a LM Delay score <7th%ile and 9.7% had an AVLT Delay score <7th%ile, while 19.1% had either one or the other. In other words, one in five cognitively intact individuals may get flagged as having “impaired” memory if relying on individual test scores. However, 97% of the RCN sample had a DELTA Memory score of 0 and 98% had an overall DELTA score in the “No Evidence” (DELTA = 0) or “Low Evidence” range (DELTA = 1–3). This suggests a possible reduction in “false positive” determinations of cognitive decline using the DELTA score.
Longitudinal DELTA Scores
Inspection of the overall sample’s longitudinal mean functional and DELTA score data showed expected group-level worsening until around Y3 and then decreased scores between Y3 and Y4 that held through Y5, suggesting a survival bias in the sample’s attrition over time. There was also a consistent decline in representation of APOEe4 carriers. We therefore focused results on longitudinal data spanning BL, Y1, Y2, and Y3. Table 3 shows descriptive data stratified by assessment time point.
Table 3. Sample descriptive statistics stratified by assessment point. Robust cognitively normal (RCN) data represent the RCN samples baseline visit
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200713132704276-0845:S1355617719001346:S1355617719001346_tab3.png?pub-status=live)
APOEe4, apolipoprotein epsilon 4; AVLT, Rey Auditory Verbal Learning Test; BNT-30, 30-item Boston Naming Test; CDR-SOB, Clinical Dementia Rating-Sum of Boxes score; FAQ, Functional Activities Questionnaire; IQR, interquartile range; LM, Logical Memory (WMS-R); Min.–Max., minimum value–maximum value; RCN, robust cognitively normal; SD, standard deviation; y, years.
Of the 270 RCNs used for calculating regression-predicted test score performance, 256 (94.8%) completed all tests and therefore had calculable DELTA scores. Rates of DELTA group changes for the RCN sample and overall BL sample are provided as supplemental material (Supplemental Table B) and in Table 4. Over 83% of RCNs with “No Evidence” at baseline and follow-up cognitive data (i.e., remained in the study) stayed in this group at Y1, Y2, and Y3 follow-up. For RCNs with “Low Evidence” at BL and with follow-up cognitive data, over 90% remained either as “Low Evidence” or reverted to “No Evidence.”
Table 4. Change in DELTA group status based on BL DELTA group for the entire BL sample. Values represent the percentage of participants with follow-up DELTA scores within each DELTA group. Interpret reversion and progression percentages at the highest BL DELTA groups (“Strong” and “Very Strong”) with caution due to lower frequency of these DELTA groups at BL and greater loss to follow-up (i.e., survivor bias), particularly by Year 3
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200713132704276-0845:S1355617719001346:S1355617719001346_tab4.png?pub-status=live)
DELTA, Discrepancy-based Evidence for Loss of Thinking Abilities.
For the overall BL sample (Table 4), progression to higher levels of evidence for cognitive decline (e.g., “Moderate” or higher) was relatively rare for those with “No Evidence” at BL, but rates increased as a function of BL DELTA group. Reversion and progression percentages become skewed at the highest BL DELTA groups (“Strong” and “Very Strong”) due to lower rates of these levels of evidence at BL and greater loss to follow-up.
DELTA Validation Against Longitudinal Functional Changes
Linear mixed model analyses focused on CDR-SOB and FAQ changes from BL through Y3 as a function of BL DELTA scores. Table 5 shows the model fit characteristics at each stage of the analyses. Higher BL DELTA score (i.e., worse cognition) predicted higher BL CDR-SOB beyond the effects of age, gender, and education (between-case intercept, ΔR 2 = .318, large effect). The BL DELTA score × Time interaction term significantly improved model fit (ΔR 2 = .742, large effect) and suggested that higher BL DELTA score was associated with faster increases in CDR-SOB score (i.e., worsening) over time. No covariates explained additional within-case residual variance (i.e., deviation from regression-predicted CDR-SOB) above the effects of time (ΔR 2 < .02).
Table 5. Linear mixed model analysis of baseline and longitudinal DELTA scores predicting CDR-SOB and FAQ score changes over 3-year follow-up. Terms: “Between-Intercept” – Variance associated with between-participant baseline differences (i.e., initial CDR-SOB/FAQ score); “Within-Residual” – Variance associated with discrepancies between regression-predicted and actual CDR-SOB/FAQ score for each participant; “Time-Intercept” – Variance associated with rates of change in CDR-SOB/FAQ score over time (i.e., between-participant differences in slope of change)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200713132704276-0845:S1355617719001346:S1355617719001346_tab5.png?pub-status=live)
−2LL, −2 Log Likelihood; AIC, Akaike’s Information Criterion; BL, baseline; CDR-SOB, Clinical Dementia Rating Sum of Boxes score; DELTA, Discrepancy-based Evidence for Loss of Thinking Abilities score; c-DELTA, mean-centered DELTA score; FAQ, Functional Activities Questionnaire; m-DELTA, mean DELTA score across time points.
a All stepwise model additions statistically improved model fit (p < .0001).
We then ran the models using participants’ longitudinal, visit-specific DELTA scores instead of just the BL DELTA score (i.e., DELTA as a time-varying covariate). Higher mean and mean-centered BL DELTA scores predicted faster rates of increase in CDR-SOB score beyond the effects of time (occasion-intercept, ΔR 2 = .658, large effect), and accounting for person-specific longitudinal changes in DELTA score further improved the model (ΔR 2 = .791). These results suggest that longitudinal changes in CDR-SOB track closely and in the same direction as changes in DELTA score. Of note, remaining significant unexplained variance indicated that the degree to which DELTA score tracks with CDR-SOB is not uniform across all participants (i.e., the strength of the association between DELTA score and CDR-SOB differs from person to person).
Results were similar when evaluating longitudinal FAQ changes in place of CDR-SOB. For all analyses, age, gender, and years of education did not significantly predict longitudinal functional changes in models that included DELTA scores.
DELTA Validation Against PET and CSF Biomarkers
Biomarker validation was performed on the subset of the BL sample with available PET and/or CSF biomarkers. Higher BL DELTA score predicted higher PET-Aβ SUVr (n = 739, ρ = .324, medium effect), lower CSF-Aβ (n = 1000, ρ = −.412, medium–large effect), higher CSF-pTau (n = 998, ρ = .340, medium effect), and higher CSF-pTau/CSF-Aβ ratio (n = 997, ρ = .460, medium–large effect); all p’s < .001. We looked at the Memory subscore (0–5) of the DELTA score independently and found similar relationships with biomarkers as the total DELTA score.
Figure 2A shows relationships among DELTA groups and PET-Aβ status stratified by APOEe4 noncarriers and carriers. Among APOEe4 noncarriers with PET-Aβ scans (n = 416), 135 (32.5%) were PET-Aβ(+). Positive predictive value (PPV) of the DELTA scores increased as a function of DELTA “level of evidence” group from 25.7% in the DELTA = 0 group (“No Evidence”) to 63.6% for participants with DELTA > 6 (“Strong Evidence” plus “Very Strong Evidence” groups). Results were similar when looking at Memory score only (Figure 2B). Among APOEe4 carriers with PET-Aβ scans (n = 320), 240 (75.0%) were PET-Aβ(+). PPV of the DELTA groups increased from 62.7% in the DELTA = 0 group (“No Evidence”) to 92.7% for participants with DELTA > 3 (“Moderate Evidence” or higher groups). We found slightly stronger relationships based on the Memory subscore, such that a Memory scores of 4 (17/17 participants) and 5 (17/17 participants) had 100% PPV.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200713132704276-0845:S1355617719001346:S1355617719001346_fig2.png?pub-status=live)
Fig. 2. (A–B): Positive predictive value for PET-Aβ (SUVr > 1.11) across each DELTA group (A) and stratified by DELTA Memory score (B). Separate lines represent APOE e4 status (carriers vs. noncarriers) and the total sample, with base rates of PET-Aβ positivity provided for each group (dotted lines). NOTE: No participants obtained a DELTA Memory score of “1” (see Table 2 for criteria).
Figure 3A shows relationships among DELTA groups and CSF-AD biomarker status stratified by APOEe4 noncarriers and carriers. Among APOEe4 noncarriers with CSF-AD biomarkers (pTau/Aβ ratios, n = 553), 166 (30.0%) were CSF-AD(+). PPV of the DELTA scores again increased as a function BL DELTA “level of evidence” group from 18.6% in the “No Evidence” group to 88.9% in the “Very Strong Evidence” group (8/9 participants). Results were similar when looking at Memory score only (Figure 3B). Among APOEe4 carriers with CSF-AD biomarkers (n = 444), 340 (76.6%) were CSF-AD(+). PPV of the DELTA scores increased from 52.7% in the “No Evidence” group to 95.1% in the “Strong Evidence” group (58/61 participants) and 96.6% in the “Very Strong Evidence” group (28/29 participants). We again observed stronger relationships based on the Memory subscore in the APOEe4 carrier group, such that a Memory scores of 4 (32/32 participants) and 5 (36/36 participants) had 100% PPV.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200713132704276-0845:S1355617719001346:S1355617719001346_fig3.png?pub-status=live)
Fig. 3. (A–B): Positive predictive value for CSF-AD biomarkers (pTau/Aβ ratio > .0251 pg/ml) across each DELTA group (A) and stratified by DELTA Memory score (B). Separate lines represent APOE e4 status (carriers vs. noncarriers) and the total sample, with base rates of CSF-AD positivity provided for each group. NOTE: No participants obtained a DELTA Memory score of “1” (see Table 2 for criteria).
Automated Scoring Program
An automated scoring program for calculating DELTA scores is provided for free use in Supplemental program.
DISCUSSION
We set out to develop and validate a novel approach for characterizing and quantifying evidence for cognitive decline based on normative reference methods, which we termed the DELTA score. The DELTA score does not replace existing methods for assessing within-person longitudinal change (e.g., reliable change indices, standardized regression-based change). Novel aspects of the DELTA score and specific considerations for appropriate use are outlined in supplemental material. The approach was rooted in principles of low-score-base-rates aggregated across a relatively comprehensive neuropsychological test battery that evaluated components of memory, executive function, and language. This is similar conceptually to prior work (Bondi et al., Reference Bondi, Edmonds, Jak, Clark, Delano-Wood, McDonald, Nation, Libon, Au, Galasko and Salmon2014; Jak et al., Reference Jak, Bondi, Delano-Wood, Wierenga, Corey-Bloom, Salmon and Delis2009; Oltra-Cucarella et al., Reference Oltra-Cucarella, Sánchez-SanSegundo, Lipnicki, Sachdev, Crawford, Pérez-Vicente, Cabello-Rodríguez and Ferrer-Cascales2018) but differs in that the DELTA Score is continuous compared to typical dichotomization of “impaired” versus “unimpaired” status. We note that while the DELTA scores themselves have somewhat limited variability (0–15 overall, 0–5 per domain), the automated scoring program also produces the continuous standardized discrepancy scores for each test, which can be flexibly applied in research and clinical settings to fit individual needs.
Prior work using the ADNI cohort separately developed continuous composite scores for memory and executive function, which demonstrated improved prediction of cognitive decline and association with neuroimaging/CSF biomarker outcomes compared to single test scores (Crane et al., Reference Crane, Carle, Gibbons, Insel, Mackin, Gross, Jones, Mukherjee, Curtis, Harvey and Weiner2012; Gibbons et al., Reference Gibbons, Carle, Mackin, Harvey, Mukherjee, Insel, Curtis, Mungas and Crane2012). A key finding of these composite approaches was the ability to detect meaningful clinical changes with smaller sample sizes than would be required if using individual tests. This has significant implications when designing clinical trials with cognitive outcomes. Therefore, the DELTA approach may enhance usefulness in this regard given that it is based on multiple test scores and includes multiple cognitive domains, though further research and validation are required.
Jak et al. (Reference Jak, Bondi, Delano-Wood, Wierenga, Corey-Bloom, Salmon and Delis2009) showed that “impairment” classification methods with the greatest stability and fewest instances of reversion also likely have the lowest sensitivity to true cognitive impairment (i.e., single-score approaches). This highlights important concepts: (1) progression/reversion/stability rates only matter if the clinician is confident in the diagnostic classification in the first place, (2) diagnostic sensitivity and specificity is a direct function of the strictness of the criteria for determining “impairment” (Iverson & Brooks, Reference Iverson, Brooks, Schoenberg and Scott2011). Relatedly, reliance on interpretation of single test scores obtained from a larger test battery may increase risk for both false positives and false negatives. This has been demonstrated regularly in the “multivariate base rate” literature (Binder et al., Reference Binder, Iverson and Brooks2009; Brooks & Iverson, Reference Brooks and Iverson2010; Houck et al., Reference Houck, Asken, Bauer, Kontos, McCrea, McAllister, Broglio and Clugston2019) and the concept held true in our study as well. For example, if a clinician defined “memory impairment” based on a score <7th percentile of a normative reference group (or z < −1.5 in a normal distribution), they are accepting a 7% false positive rate. However, the false positive rate increases exponentially when more test scores are available. Almost 20% of our study’s RCN group would qualify as memory impaired (i.e., one in five scored <7th%ile on either their AVLT or LM delayed recall). In contrast, 97% of our robust normal controls had a DELTA Memory score of 0 thereby illustrating that interpretive approaches like the DELTA score alleviate such problems by taking these concepts into account.
As discussed, there are several strengths of using composite scores derived from comprehensive evaluations for characterizing cognitive status, but there are also practical considerations. Incorporating more tests increases the length of assessments. Other composite scores derived from ADNI data (Crane et al., Reference Crane, Carle, Gibbons, Insel, Mackin, Gross, Jones, Mukherjee, Curtis, Harvey and Weiner2012; Gibbons et al., Reference Gibbons, Carle, Mackin, Harvey, Mukherjee, Insel, Curtis, Mungas and Crane2012) included four tests underlying a memory composite [Rey AVLT, WMS-R LM, the Alzheimer’s Disease Assessment Schedule, and MMSE] and five tests for the executive function composite [category fluency (both animals and vegetables), Trail Making Test, Digit Span, WAIS-R Digit-Symbol, Clock Drawing]. The DELTA score in this study comprises a battery of six total tests covering three domains, which we estimate would take 35–40 min. This offers practical advantages potentially more readily integrated into modern medical settings that emphasize multidisciplinary and time-efficient patient visits. The DELTA score’s high PPV for both PET-Aβ and CSF-AD biomarker status (+ or −) highlights a potential future application for efficiently identifying (or ruling out) presumably related (or unrelated) disease states for clinical trial enrollment.
Individual neuropsychological tests often have suboptimal test–retest reliability, and therefore scores fluctuate (both higher and lower) due to factors unrelated to the disease process (Brooks et al., Reference Brooks, Sherman, Iverson, Slick and Strauss2011). Using normative reference groups demographically and/or intellectually dissimilar to an individual patient also heightens risk for misclassifying cognitive decline (Iverson & Brooks, Reference Iverson, Brooks, Schoenberg and Scott2011). Clinicians must be wary of “red herrings” in the form of cognitive test score variability unrelated to disease state. Reducing this phenomenon requires development of more reliable and culturally appropriate measures, and/or classifying cognitive function using multiple test scores in conjunction with low-score-base-rate concepts.
Neuropsychologists uniquely appreciate these concepts and, unsurprisingly, have spearheaded modern approaches for classifying cognitive impairment. However, even the more methodologically rigorous classification criteria often reduce samples to either “impaired” or “unimpaired” status and then characterize by the type of impairment (combinations of single vs. multiple domain and amnestic vs. non-amnestic labels). Dichotomizing cognitive status may contribute to mixed findings regarding clinical progression, reversion, or stability (Pandya, Clem, Silva, & Woon, Reference Pandya, Clem, Silva and Woon2016). Variability in progression, reversion, and stability rates across studies also likely reflects inconsistent definitions for impairment and the number of parameters used for classifying participants (Edmonds et al., Reference Edmonds, Delano-Wood, Clark, Jak, Nation, McDonald, Libon, Au, Galasko, Salmon and Bondi2015; Jak et al., Reference Jak, Bondi, Delano-Wood, Wierenga, Corey-Bloom, Salmon and Delis2009; Thomas et al., Reference Thomas, Eppig, Weigand, Edmonds, Wong, Jak, Delano-Wood, Galasko, Salmon, Edland and Bondi2019).
LIMITATIONS
Multiple limitations of this initial validation coincide with necessary future research outlined below. The current DELTA score was derived from an exclusively white/Caucasian sample that is highly educated. Not every participant contributed data for all follow-up assessment points, likely resulting in survivor bias. Some participants contributed data inconsistently (e.g., BL, Y1, and Y3 but not Y2, Y4, and Y5), which could bias longitudinal frequency rate statistics. Advanced biomarkers were available only on a subset of the total study sample that seemingly was enriched for APOEe4 carriers (about 40% of those with biomarker data); therefore, PPVs using the total study sample may overestimate general population risk. Age, gender, years of education, and word-reading ability collectively explained only 5–13% of the variance in predicted test scores, suggesting several unmeasured and potentially important factors that could improve the models. Exploration of nonlinear and/or non-mean regression (e.g., quantile regression) when examining the roles of age, education, word-reading, etc. and accounting for variability in residuals across the spectrum of these variables may further improve premorbid score predictions (Sherwood, Zhou, Weintraub, & Lang, Reference Sherwood, Zhou, Weintraub and Wang2016). Sample size was relatively small for certain DELTA groups and associated data should be interpreted cautiously, while further research may refine the cutoff scores associated with a given “level of evidence” for decline group. Lastly, no participants in the study obtained a DELTA Memory score of “1.” Replication in other large samples will help refine scoring criteria, if necessary.
FUTURE DEVELOPMENT AND EXPANSION OF DELTA METHODS
We demonstrated a proof of concept for a novel approach to characterizing evidence for cognitive decline. As with any pilot endeavor, there are many opportunities for expansion and improvement. We propose several ideas that we hope will guide researchers and clinicians in independent replication and validation efforts, and help promote clinical translation.
Replicate this work in multicultural samples.
Expand predictors in the regression equations to better explain cognitive test scores. The automated scoring program contains empty fields for “VARIABLE #5” and “VARIABLE #6”, so other researchers can easily adapt the scoring program using new data and novel predictor variables.
Incorporate tests from additional cognitive domains.
Validate the DELTA approach using different neuropsychological tests than those used in the present study due to convenience of the ADNI sample.
We envision opportunities for identifying clinically relevant “profiles” based on patterns of domain-specific DELTA scores. Analogous to “A/T/N” classifications for biomarker evidence of amyloid, tau, and neurodegeneration, we propose something like “M/E/L” for neuropsychological evidence of memory, executive, and language decline using DELTA methodology. We anticipate diverse opinions regarding which cognitive domains to add and which test scores qualify for a given domain
Evaluate use of DELTA scores in clinical trials using cognitive outcomes.
Apply similar methodology for developing a “mood” score and a “behavior” score that could be used in conjunction with the cognitive DELTA score and biomarker panels. This could more precisely characterize clinical syndromes with prominent noncognitive features (e.g., FTD syndromes).
CONCLUSIONS
We present data supporting the initial development and validation of a discrepancy-based test score metric, called the DELTA score, for characterizing the level of evidence for cognitive decline. Higher initial DELTA scores predicted faster rates of functional decline and longitudinal changes in DELTA scores coincided with changes in functional questionnaire scores. Greater evidence for cognitive decline predicted AD biomarker status, particularly for APOEe4 carriers. Future work should expand the DELTA score to different populations, include additional cognitive domains, and evaluate how domain-specific score patterns align with neurodegenerative disease biomarkers.
ACKNOWLEDGMENTS
This work was supported by an Alzheimer’s Association grant (KRT; AARF-17-528918). The contents of this paper do not represent the views of the Department of Veterans Affairs or the United States Government.
CONFLICT OF INTEREST
The authors have no conflicts of interest to disclose.
SUPPLEMENTARY MATERIAL
To view supplementary material for this article, please visit https://doi.org/10.1017/S1355617719001346