Hostname: page-component-745bb68f8f-mzp66 Total loading time: 0 Render date: 2025-02-06T13:04:53.010Z Has data issue: false hasContentIssue false

Reliable Change on Neuropsychological Tests in the Uniform Data Set

Published online by Cambridge University Press:  03 August 2015

Brandon E. Gavett*
Affiliation:
University of Colorado, Colorado Springs, Department of Psychology, Colorado Springs, Colorado
Lee Ashendorf
Affiliation:
Boston University School of Medicine, Department of Psychiatry, Boston, Massachusetts
Ashita S. Gurnani
Affiliation:
University of Colorado, Colorado Springs, Department of Psychology, Colorado Springs, Colorado
*
Correspondence and reprint requests to: Brandon E. Gavett, UCCS Department of Psychology, 1420 Austin Bluffs Parkway, Colorado Springs, CO 80918. E-mail: bgavett@uccs.edu
Rights & Permissions [Opens in a new window]

Abstract

Longitudinal normative data obtained from a robust elderly sample (i.e., believed to be free from neurodegenerative disease) are sparse. The purpose of the present study was to develop reliable change indices (RCIs) that can assist with interpretation of test score changes relative to a healthy sample of older adults (ages 50+). Participants were 4217 individuals who completed at least three annual evaluations at one of 34 past and present Alzheimer’s Disease Centers throughout the United States. All participants were diagnosed as cognitively normal at every study visit, which ranged from three to nine approximately annual evaluations. One-year RCIs were calculated for 11 neuropsychological variables in the Uniform Data Set by regressing follow-up test scores onto baseline test scores, age, education, visit number, post-baseline assessment interval, race, and sex in a linear mixed effects regression framework. In addition, the cumulative frequency distributions of raw score changes were examined to describe the base rates of test score changes. Baseline test score, age, education, and race were robust predictors of follow-up test scores across most tests. The effects of maturation (aging) were more pronounced on tests related to attention and executive functioning, whereas practice effects were more pronounced on tests of episodic and semantic memory. Interpretation of longitudinal changes on 11 cognitive test variables can be facilitated through the use of reliable change intervals and base rates of score changes in this robust sample of older adults. A Web-based calculator is provided to assist neuropsychologists with interpretation of longitudinal change. (JINS, 2015, 21, 558–567)

Type
Research Articles
Copyright
Copyright © The International Neuropsychological Society 2015 

Introduction

Neuropsychologists are often tasked with re-evaluating individuals to help determine whether cognitive functioning has changed over a given time interval. Most neuropsychological test instruments are interpreted using normative data collected from a putatively healthy sample to understand the expected mean and variance in test scores produced by nondiseased persons (Mitrushina, Boone, Razani, & D’Elia, Reference Mitrushina, Boone, Razani and D’Elia2005). These normative data are typically used to interpret an individual person’s test scores in the context of his or her peers, with corrections for demographic factors such as age, education, sex, and race (Heaton, Miller, Taylor, & Grant, Reference Heaton, Miller, Taylor and Grant2004). When applying norms for the purpose of understanding change in older adults, there are two critical issues that could undermine interpretation.

First, it is difficult to determine whether the normative data are robust to latent causes of cognitive difficulties, especially in older age groups. An older person who is part of a normative sample may be in the very early stages of a neurodegenerative disease, such as Alzheimer’s disease, but may not be manifesting clinically obvious cognitive difficulties at the time the normative data were collected. Recent efforts have been made to include participants believed to be disease-free after several years of follow-up (“robust norms;” Holtzer et al., 2008; Pedraza et al., Reference Pedraza, Lucas, Smith, Petersen, Graff-Radford and Ivnik2010), as a means of ensuring that the normative sample is representative of cognitively healthy individuals.

Second, norms are generally cross-sectional in nature, not longitudinal, yet are interpreted to reflect magnitude of change when used for repeated assessments of patients or research participants. This ignores properties of the test such as reliability and practice effects, and it also discounts statistical effects such as regression to the mean (McCaffrey, Duff, & Westervelt, Reference McCaffrey, Duff and Westervelt2000). Various statistical methods have been proposed to account for these potential confounds, ranging from simple standard deviation difference methods (see Frerichs & Tuokko, Reference Frerichs and Tuokko2005), to reliable change models of varying complexity (see Hinton-Bayre, Reference Hinton-Bayre2010), to standardized regression-based (SRB) methods (e.g., Attix et al., Reference Attix, Story, Chelune, Ball, Stutts, Hart and Barth2009). See Duff (Reference Duff2012) and Heilbronner et al. (Reference Heilbronner, Sweet, Attix, Krull, Henry and Hart2010) for a more detailed discussion of these and other issues related to serial assessment in neuropsychology. Robust longitudinal norms contextualize the change in an individual’s test scores relative to a sample that is believed to have been free from neurodegenerative disease during the test–retest interval. Change in test scores that is more extreme than that observed in robust normative samples may reflect a change in cognition that is beyond the limits of normal aging (Bläsi et al., Reference Bläsi, Zehnder, Berres, Taylor, Spiegel and Monsch2009). Robust norms have been shown to improve diagnostic accuracy in the longitudinal assessment of older adults (De Santi et al., Reference De Santi, Pirraglia, Barr, Babb, Williams, Rogers and de Leon2008; Holtzer et al., Reference Holtzer, Goldin, Zimmerman, Katz, Buschke and Lipton2008; Pedraza et al., Reference Pedraza, Lucas, Smith, Petersen, Graff-Radford and Ivnik2010).

In this study, we propose to address the two weaknesses discussed above by quantifying expected changes in cognitive abilities over time through the use of linear mixed effects regression models to calculate reliable change intervals (RCIs). Linear mixed effects models are an extension of SRB models, which address longitudinal data by allowing for individual variability in baseline test scores (intercepts) and rate of change over time (slopes; Pinhiero & Bates, Reference Pinhiero and Bates2000). These models can be used to predict an examinee’s follow-up test score based on variables such as the examinee’s baseline test score and several demographic variables. The observed follow-up test score is compared to the predicted follow-up test score, and if the difference is large enough, the change may be interpreted as reliable. The magnitude of reliable change is scaled relative to the standard error observed in the linear mixed effects model and the degree of confidence desired in the prediction interval (often 90%). For instance, if the standard error is 2.0 and the desired degree of confidence for the interval is 90%, then the confidence interval would have a range of 2.0 times the standard normal distribution quantile associated with a two-tailed alpha level of .05 (i.e., 1.645). (For small sample sizes, this standard normal quantile can be replaced with the appropriate t distribution quantile for a given degrees of freedom.) In this example, 2×1.645=3.29, indicating that the 90% confidence interval would have range of 3.29 units in both the positive and negative directions. Differences between observed and predicted follow-up scores that are more extreme than ±3.29 are thus suggestive of reliable change. By applying RCIs to neuropsychological measurements, one can identify whether a change in a given score is clinically interpretable. We seek to produce robust longitudinal change indices that can be used in vivo to determine whether individuals are changing at a rate that is consistent with normal aging, whether an individual’s rate of change is more rapid than expected, or whether a treatment has a beneficial effect on cognition.

We will identify individuals from the National Alzheimer’s Coordinating Center (NACC) Uniform Data Set (UDS; Beekly et al., Reference Beekly, Ramos, Lee, Deitrich, Jacka, Wu and Kukull2007; Morris et al., Reference Morris, Weintraub, Chui, Cummings, Decarli, Ferris and Kukull2006) who have been confirmed through at least three (and up to nine) longitudinal clinical assessments to be cognitively healthy. We will then retrospectively examine the first (baseline) and second (follow-up) visits to quantify the degree of change observed across time in this putatively healthy sample. While the UDS neuropsychological battery is well established (Weintraub et al., Reference Weintraub, Salmon, Mercaldo, Ferris, Graff-Radford, Chui and Morris2009), the psychometric properties are still under evaluation and no RCIs have yet been presented, limiting the effectiveness and potentially the accuracy of longitudinal evaluations using this selection of tests. As the UDS neuropsychological battery is possibly the most widely used research battery for the cognitive assessment of dementia in the United States, it is important to identify the longitudinal psychometric characteristics of this battery, for both research and clinical purposes. The objective of this study is to present RCIs based on linear mixed effects models for each of the available UDS neuropsychological variables. As a result, readers will have access to robust longitudinal data that can be used to interpret cognitive changes in older adults.

Method

Participants

This study was determined to be exempt from human subjects review by the University of Colorado, Colorado Springs Institutional Review Board. Data used in the present study were obtained from the NACC’s publicly available database. Created by the National Institute on Aging, the NACC compiles a wide variety of data, including neuropsychological test scores from 34 past and present Alzheimer’s Disease Centers (ADCs) using the UDS battery. We included participants who had completed at least three visits, including one baseline visit, between September 2005 and March 2014. A total of 4598 individuals in the database were diagnosed as cognitively normal at all visits. We also excluded 92 participants who were less than 50 years old at their baseline visit and 302 participants who did not speak English as their primary language or who were not assessed in English. In total, we excluded 381 participants (13 participants met more than one of the exclusion criteria), leaving a sample of 4217 for inclusion in the study. These participants underwent at least three—and up to nine—approximately annual evaluations at an ADC and were diagnosed as cognitively normal at all evaluations. Because very few participants completed more than seven visits, we did not analyze data from the eighth or ninth visits. See Table 1 for details regarding participant demographic variables.

Table 1 Participant demographics

Note. N=sample size; M=mean; SD=standard deviation.

Measures

The neuropsychological measures available for these analyses included the Mini-Mental State Examination (Folstein, Folstein, & McHugh, Reference Folstein, Folstein and McHugh1975), Wechsler Adult Intelligence Scale-Revised (WAIS-R) Digit Span Forward and Backward conditions (Wechsler, Reference Wechsler1981), WAIS-R Digit Symbol (Wechsler, Reference Wechsler1981), Trail Making Test (TMT) parts A and B (Reitan & Wolfson, Reference Reitan and Wolfson1993), Story A from Wechsler Memory Scale-Revised (WMS-R) Logical Memory (Wechsler, Reference Wechsler1987), two semantic fluency tasks (animals and vegetables; Weintraub et al., Reference Weintraub, Salmon, Mercaldo, Ferris, Graff-Radford, Chui and Morris2009), and the 30 odd-item short form of the Boston Naming Test (BNT) (Jefferson et al., Reference Jefferson, Wong, Gracer, Ozonoff, Green and Stern2007). These tests are common to most dementia clinicians and researchers, and will not be described here. See the study by Weintraub and colleagues (2009) for more information.

Data Analysis

The test data, which were available for as few as three to as many as seven approximately annual visits, were used in a linear mixed effects model for each test, with visit number nested within participants. For each test, we modeled linear, quadratic, and logarithmic trends and found that a linear trend provided the best balance between model fit and parsimony (data not shown). Reliable change intervals were derived for the second visit only. All analyses were performed in R version 3.1.2 (R Core Team, 2015). The lme4 package (version 1.1-8) was used for longitudinal modeling (Bates, Maechler, Bolker, & Walker, 2015).

Eleven linear mixed effects regression models, one for each test, were specified to include both fixed and random (intercept and slope) effects. The follow-up test scores from visits two to seven were regressed onto the following fixed effects: baseline test score, age at baseline (years), education (years), visit number, assessment interval (years post-baseline), race (Caucasian or non-Caucasian), and sex (male or female). All predictor variables were entered simultaneously; although stepwise regression procedures have been used previously for RCI studies in the neuropsychology literature, these methods were not used here. Being data driven, rather than theory driven, models identified using stepwise methods have the potential to capitalize on chance and may not generalize beyond the sample data; many other limitations have also been identified (e.g., Whittingham, Stephens, Bradbury, & Freckleton, Reference Whittingham, Stephens, Bradbury and Freckleton2006). Dummy coding was used for race and sex, with Caucasians and males as the reference categories for their respective groups. For each model, fixed effects parameter estimates and their 95% confidence intervals were obtained using restricted maximum likelihood estimation. The standard deviation of the random intercepts and slopes were also obtained. Predicted follow-up scores were based on the fixed effects parameter estimates only. To account for the variability introduced by the uncertainty in both the fixed and random effects, 90% reliable change intervals were based on the residual standard error as well as the variability in the predictions. The variability in the predictions was estimated using parametric bootstrapping (B=1000) of the predicted test scores across all visits. This bootstrapping procedure was based on simulated values for the random effects to account for these sources of variability and results in unique prediction intervals for each participant.

In addition to calculating RCI, we also examined the frequency with which raw scores changed from baseline to follow-up. To establish base rates for longitudinal change in this sample, we derived cumulative percentages for raw score changes of each observed magnitude. It is not uncommon for “statistically significant” score differences to occur frequently in healthy samples (Matarazzo & Herman, Reference Matarazzo and Herman1984). Therefore, these base rate data can serve to augment the RCI values to not only determine the statistical significance of the observed change from baseline to follow-up, but to determine the relative frequency of change of a given magnitude.

Results

Participant demographics are presented in Table 1. Descriptive statistics for the 11 neuropsychological tests at baseline and 1-year follow-up are presented in Table 2. The fixed effects parameter estimates and their 95% confidence intervals are presented in Table 3, along with the standard deviations of the random effects. For each test, the random slope accounted for very little variability, with SDs ranging from 0.09 (MMSE) to 3.99 (TMT-B); in contrast, the SDs of the random intercept terms were more sizeable, ranging from 0.66 (MMSE) to 20.99 (TMT-B). These results suggest that, although individuals varied in their baseline test scores, there is little heterogeneity in individual trajectories of change over time on any of the tests. These patterns of change are depicted graphically in Figure 1. As seen in this figure, the margin of error in the average reliable change intervals increases, sometimes asymmetrically, across visits for most tests. A closer examination of the fixed effects parameter estimates and their 95% confidence intervals in Table 3 reveals that, for most tests, baseline test score, age, education, and race were the most reliable predictors of follow-up test score. Higher baseline test scores, younger age, more years of education, and Caucasian race were associated with better performance on all follow-up test scores. Female sex was associated with higher follow-up scores on the MMSE, Digit Symbol Coding, vegetable fluency, and Logical Memory I and II, whereas male sex was predictive of higher follow-up scores on the BNT. A longer post-baseline interval was predictive of worse follow-up scores on all tests except the MMSE and Forward Digit Span. More frequent exposure to tests (i.e., a larger number of previous visits) yielded better scores on Backward Digit Span, Digit Symbol Coding, the BNT, and the two Logical Memory subtests. Neither visit number nor post-baseline interval were predictive of follow-up scores on the MMSE and Forward Digit Span.

Fig. 1 Predicted test scores (black circles) and 90% reliable change intervals (dotted red lines) for each test across visits 2 to 7 based on linear mixed effects regression.

Table 2 Descriptive statistics for each UDS test at baseline and 1-year follow-up

Note. M=mean; SD=standard deviation; S=skewness; K=kurtosis; MMSE=Mini-Mental State Examination; DS-F=Digit Span Forward; DS-B=Digit Span Backward; TMT=Trail Making Test; BNT=Boston Naming Test; LM-I=Logical Memory Immediate Recall; LM-D=Logical Memory Delayed Recall.

Table 3 Linear mixed effects regression parameter estimates for predicting follow-up test scores across seven annual visits

Note. CI=Confidence interval; SD=Standard Deviation; MMSE=Mini-Mental State Examination; DS-F=Digit Span Forward; DS-B=Digit Span Backward; DSC=Digit Symbol Coding; TMT=Trail Making Test; BNT=Boston Naming Test; LM-I=Logical Memory Immediate Recall; LM-D=Logical Memory Delayed Recall.

As there is concern for potential heteroscedasticity among the regression residuals, a plot of the residuals versus fitted values is provided in Figure 2. The models for MMSE, TMT-A, TMT-B, and BNT should be interpreted with caution due to non-normal score distributions caused by floor and ceiling effects. Floor effects (for the TMT) and ceiling effects (for MMSE and BNT) in the data may bias the interpretation of change scores in examinees who are close to floor or ceiling on these tests at baseline.

Fig. 2 Residuals versus fitted plots at visit 2 for each test based on linear mixed effects regression.

Table 4 contains data relevant to the reliable change indices from baseline to the first annual follow-up visit. The methods used in this study produce a unique RCI for each participant. To summarize the margin of error needed for reliable change, the data shown in Table 4 were derived from the average participant in our sample [i.e., with mean values of all continuous predictor variables and modal values for sex (i.e., female) and race (i.e., Caucasian)]. The column labeled “SEE” reflects the residual standard error, as reported in Table 3. The column labeled “90% PI MOE” represents the bootstrapped margin of error for predicting follow-up test scores, conditioned on all random effects. The column labeled “90% RCI MOE” represents the margin of error for the 90% reliable change intervals. If the difference between observed and predicted follow-up scores falls outside of this interval, the change may be interpreted as reliable with 90% confidence. The test scores associated with several relevant base rates of score changes on these 11 tests are presented in Table 5.

Table 4 Reliable change intervals from baseline to the first annual follow-up visit for the average participant in the study sample

Note. SEE=standard error of the estimate; PI MOE=Prediction Interval Margin of Error; RCI MOE=Reliable Change Interval Margin of Error; MMSE=Mini-Mental State Examination; DS-F=Digit Span Forward; DS-B=Digit Span Backward; TMT=Trail Making Test; BNT=Boston Naming Test; LM-I=Logical Memory Immediate Recall; LM-D=Logical Memory Delayed Recall.

Table 5 UDS test score changes from baseline to the first annual follow-up visit corresponding to various base rates

Note. MMSE=Mini-Mental State Examination; DS-F=Digit Span Forward; DS-B=Digit Span Backward; DSC=Digit Symbol Coding; TMT=Trail Making Test; LM-I=Logical Memory Immediate Recall; LM-D=Logical Memory Delayed Recall; BNT=Boston Naming Test.

Readers wishing to obtain reliable change intervals for other combinations of predictor variables are referred to the Web-based calculator created to supplement this manuscript. It should be noted, however, that predictions for out-of-sample data cannot be conditioned on the random effects, which may underestimate the magnitude of the reliable change intervals. This calculator can be accessed at https://begavett.shinyapps.io/UDS_RCI.

Discussion

As the aging population continues to grow worldwide, the number of individuals who suffer from neurodegenerative diseases also continues to grow (Sosa-Ortiz, Acosta-Castillo, & Prince, Reference Sosa-Ortiz, Acosta-Castillo and Prince2012). Clinical diagnosis of neurodegenerative disease requires a change from a baseline level of functioning (McKhann et al., Reference McKhann, Knopman, Chertkow, Hyman, Jack, Kawas and Phelps2011), which supports the need for serial assessment. Despite the clear importance of serial assessment in the tracking of longitudinal cognitive decline, relatively little attention has been paid to issues of interpreting change scores. Without an understanding of factors such as normal aging, practice effects, regression to the mean, and measurement error, it may be easy to misinterpret score differences between baseline and follow-up. Because there are very limited normative data available for serial assessment data and change scores, interpretation of change is often subjective.

The current study adds to the reliable change literature in two important ways. First, we have used linear mixed effects regression to model change in cognitive test scores over at least three and as many as seven approximately annual visits. The results of these analyses reveal that there is little heterogeneity in the individual trajectories of change over time in a large sample believed to be free from cognitive impairment. Second, these results also help to tease apart the relative contributions of maturation (i.e., normal aging) and practice effects that can affect follow-up test scores. Of the 11 test scores examined here, practice effects were most evident for Backward Digit Span, Digit Symbol Coding, the BNT, and the two Logical Memory subtests. Based on the parameter estimates for these tests, a one-point test score increase appears after approximately 2 visits for Logical Memory Immediate and Delayed, 3 visits for Digit Symbol Coding, 9 visits for the BNT, and 17 visits for Backward Digit Span, when holding all other predictor variables constant. For many tests, these practice effects are outweighed by the length of the post-baseline assessment interval, which was inversely associated with performance on Backward Digit Span, Digit Symbol Coding, TMT-A and B, both semantic fluency tasks, the BNT, and both Logical Memory subtests. For Backward Digit Span, Digit Symbol Coding, TMT-A and B, and semantic fluency, the post-baseline assessment interval had a more pronounced effect than the influence of practice. On the other hand, practice effects outweighed maturation effects on the BNT and both Logical Memory subtests. Therefore, literature on practice effects may be augmented by consideration of test–retest intervals (e.g., Duff, Callister, Dennett, & Tometich, Reference Duff, Callister, Dennett and Tometich2012).

These linear mixed effects models were used to calculate a standard error for predicted test scores at an examinee’s second visit. These standard errors are used, along with the variability in predicted test scores, to generate 90% reliable change intervals, which provide a range of difference scores that fall within the test’s margin of error while accounting for several important covariates and sources of variability. The results provide empirical data on change scores from baseline to approximately 1-year follow-up in a robust sample of participants who underwent at least three approximately annual evaluations and were never diagnosed with any form of cognitive impairment at any visit. Using regression methods that account for maturation effects (i.e., aging), practice effects, regression to the mean, baseline test scores, and demographic variables, we present data for eleven different UDS neuropsychological test variables that can be used to calculate a predicted follow-up test score and 90% reliable change intervals for the difference between observed and predicted follow-up scores. Follow-up test score changes that fall outside of these intervals can be interpreted as reflecting “true” change with a magnitude that is larger than would be expected based on the measurement error of the test. To augment these reliable change intervals, we also present data on the frequency with which score changes were observed in this robust sample. Because statistically significant changes in test scores may often be very frequent in a clinical sample, interpreting RCIs along with base rate data can assist with the interpretation of score changes in the context of how commonly or rarely such a change score is expected to occur in a normative sample.

By way of an example, consider a 73-year-old, college-educated Caucasian man evaluated using these UDS measures, with scores and percentiles (calculated using Shirk et al., Reference Shirk, Mitchell, Shaughnessy, Sherman, Locascio, Weintraub and Atri2011) presented in the first two columns of Table 6. If we were to determine “impairment” by using a global cutoff of Z=−1.5 (7th percentile), we would find no scores below that cutoff and therefore there are no impaired cognitive domains at this initial visit. Thirteen months later, he is seen for his first follow-up, reports no functional problems, and his neuropsychological test scores are provided in the second two columns of that table. Using the same standard for “impairment,” we would say that he is now impaired on Digit Symbol Coding and TMT-B and exhibits difficulty with complex processing speed. However, using the RCIs developed above and as obtained from the Web-based calculator, we can see that he exhibited decline in excess of the 90% interval of change (change Z-score >±1.645) on MMSE, animal naming, vegetable fluency, and the BNT. Even though he is not “impaired” in the language domain using the fixed Z-score criterion, he displayed decline on three language-domain tasks relative to the baseline exam, suggesting that this might be a domain of clinical interest. In contrast, while Digit Symbol Coding and TMT-B both technically declined into the impaired range, neither test showed reliable change across visits and therefore, despite the newly developed “impairment” on these tests, this cannot be interpreted as a decline relative to the visit 1 baseline.

Table 6 Neuropsychological and RCI data for sample case

Note. RCI=Reliable Change Interval; BR=Base Rate; MMSE=Mini-Mental State Examination; DS-F=Digit Span Forward; DS-B=Digit Span Backward; DSC=Digit Symbol Coding; TMT=Trail Making Test; LM-I=Logical Memory Immediate Recall; LM-D=Logical Memory Delayed Recall; BNT=Boston Naming Test.

All demographic variables were found to contribute to the prediction of follow-up scores, with some (e.g., age, education) more robust than others (e.g., sex). It should be noted that these results were obtained from a sample of older adults who were diagnosed as cognitively healthy at their baseline visit. Therefore, the results presented in this study, especially the data used to predict follow-up test scores (Table 3) cannot be generalized beyond this population. It would be a misuse of the data to attempt to predict follow-up test scores for individuals with cognitive impairment at baseline. Similarly, the results will not generalize to individuals whose baseline test scores are not included in the test score intervals presented in Table 2, or to people whose demographic variables or test–retest intervals were not observed in the current study.

This study is limited in several ways. First, the data in the current sample were obtained from the NACC, which compiles data from 34 past and present ADCs across the United States. Each ADC may differ somewhat in its recruitment methods, especially for cognitively healthy individuals. The sample used in this study was not recruited for the purposes of producing normative data (e.g., random sampling was not used), and valid concerns may be raised about the external validity of these findings. The sample was also highly educated (M=15.80; SD=2.79) and was under-representative of racial and ethnic minorities. In contrast, the sample is very large, geographically diverse, and continued follow-up beyond the two visits used in this study gives confidence that the participants were not in the early stages of a neurodegenerative disease at the time the data were collected. Several of the neuropsychological test variables in the UDS have non-normal distributions. As discussed above, truncated distributions may be associated with heteroscedasticity (Figure 2), which could contribute to an underestimate of the residual variance for tests with floor or ceiling effects (i.e., MMSE, TMT, BNT).

Another limitation of the results is the finding that most of the test variables included in the current study possessed test–retest reliabilities below.70 (Table 2). These findings are roughly consistent with 1-year test–retest reliability estimates derived from meta-analysis (Calamia, Markon, & Tranel, Reference Calamia, Markon and Tranel2013). The change in mean scores from baseline to follow-up is likely to be reflective of the magnitude of history and maturational influences acting across the two time points. The strength of the correlation between test scores at two successive time points may be indicative of individual differences in variability of change (Salthouse & Tucker-Drob, Reference Salthouse and Tucker-Drob2008). The low test–retest correlations could be attributed to random error, real change in the construct validity of the test between the two time points, or measurement error. Although maturational influences may affect within-person change in test scores, we also show that practice effects contribute to change in performance on most tests (Salthouse & Tucker-Drob, Reference Salthouse and Tucker-Drob2008). As might be expected, tests involving attention, processing speed, mental efficiency, and working memory were more susceptible to maturational influences (i.e., longer test–retest intervals), whereas tests involving episodic and semantic memory were more susceptible to practice effects.

Although a minority of the tests in the UDS battery are current and in common use in clinical settings (i.e., TMT, animal fluency, BNT), these results may still be valuable to both clinicians and researchers who perform cognitive evaluations of older adults. While newer editions of these tests have been published in recent years (e.g., the WAIS and WMS have twice been updated), it is unclear whether these updates have led to substantial improvements in the longitudinal measurement properties of these tests for the assessment of elderly individuals. The results of the current study can be valuable in that there is a paucity of longitudinal data that have been published in robust samples, especially for modern versions of these tests. The lack of available robust longitudinal data for some modern tests (e.g., WAIS-IV) could affect validity when interpreting changes in test scores without access to appropriate data. Although the tests used in this study may be older versions, they should not necessarily be considered obsolete due to the fact that they are being used in large, modern, federally funded research projects on cognitive aging and neurodegenerative disease (e.g., the NACC UDS). In fact, one could argue that the availability of robust longitudinal data make these tests more appropriate than updated versions for serial assessment of older adults, especially if one takes the perspective that research evidence, and not test publishers, should dictate the selection of tests and test norms used by neuropsychologists (Adams, Reference Adams2000; Bush, Reference Bush2010; Silverstein & Nelson, Reference Silverstein and Nelson2000; Strauss, Spreen, & Hunter, Reference Strauss, Spreen and Hunter2000).

Many of the UDS neuropsychological tests have marginal test–retest reliability for measuring change in cognition across approximately annual evaluations (Strauss, Sherman, & Spreen, 2006). Although the lengthy interval between baseline and follow-up testing (M=14.62 months; SD=5.20) would be expected to cause a decrease in test–retest reliability relative to shorter intervals, these reliability data are thought to possess better external validity than reliability coefficients obtained at shorter intervals because approximately 1 year is believed to be a typical (or even shorter than typical) retest interval for older adults who are cognitively healthy at baseline. Because of these undesirable test–retest reliability values, the margin of error required to detect reliable change can be quite large for some tests (Table 4). Although this margin of error may not be sufficiently precise to detect subtle changes, these results may nevertheless be valuable for detecting more obvious cognitive decline across an approximately 1-year period. The results presented here suggest that there may be great value in focusing on test–retest reliability in the development of new cognitive tests, but interpretation of score changes must also account for demographic variables, past exposure to tests, and test–retest intervals.

Acknowledgments

The authors thank Stephen Hawes, Ph.D., and the rest of the NACC Publication Review Committee for their helpful comments on a previous draft of this manuscript. The NACC database is funded by NIA Grant U01 AG016976. The authors have no conflicts of interests to disclose.

References

Adams, K.M. (2000). Practical and ethical issues pertaining to test revisions. Psychological Assessment, 12, 281286.CrossRefGoogle ScholarPubMed
Attix, D.K., Story, T.J., Chelune, G.J., Ball, J.D., Stutts, M.L., Hart, R.P., & Barth, J.T. (2009). The prediction of change: Normative neuropsychological trajectories. The Clinical Neuropsychologist, 23, 2138.Google Scholar
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). lme4: Linear mixed-effects models using Eigen and S4. R package version 1.1-8 [Software]. Retrieved from http://CRAN.R-project.org/package=lme4.Google Scholar
Beekly, D.L., Ramos, E.M., Lee, W.W., Deitrich, W.D., Jacka, M.E., Wu, J., & Kukull, W.A. (2007). The National Alzheimer’s Coordinating Center (NACC) database: The Uniform Data Set. Alzheimer Disease and Associated Disorders, 21, 249258.Google Scholar
Bläsi, S., Zehnder, A.E., Berres, M., Taylor, K.I., Spiegel, R., & Monsch, A.U. (2009). Norms for change in episodic memory as a prerequisite for the diagnosis of mild cognitive impairment (MCI). Neuropsychology, 23, 189200.Google Scholar
Bush, S.S. (2010). Determining whether or when to adopt new versions of psychological and neuropsychological tests: Ethical and professional considerations. The Clinical Neuropsychologist, 24, 716.CrossRefGoogle ScholarPubMed
Calamia, M., Markon, K., & Tranel, D. (2013). The robust reliability of neuropsychological measures: Meta-analyses of test-retest correlations. The Clinical Neuropsychologist, 27, 10771105.Google Scholar
De Santi, S., Pirraglia, E., Barr, W., Babb, J., Williams, S., Rogers, K., & de Leon, M.J. (2008). Robust and conventional neuropsychological norms: Diagnosis and prediction of age-related cognitive decline. Neuropsychology, 22, 469484.CrossRefGoogle ScholarPubMed
Duff, K. (2012). Evidence-based indicators of neuropsychological change in the individual patient: Relevant concepts and methods. Archives of Clinical Neuropsychology, 27, 248261.Google Scholar
Duff, K., Callister, C., Dennett, K., & Tometich, D. (2012). Practice effects: A unique cognitive variable. The Clinical Neuropsychologist, 26, 11171127.Google Scholar
Folstein, M.F., Folstein, S.E., & McHugh, P.R. (1975). “Mini-mental state.” A practical method for grading the cognitive state of patients for the clinician. Journal of Psychiatric Research, 12, 189198.CrossRefGoogle ScholarPubMed
Frerichs, R.J., & Tuokko, H.A. (2005). A comparison of methods for measuring cognitive change in older adults. Archives of Clinical Neuropsychology, 20, 321333.Google Scholar
Heaton, R.K., Miller, S.W., Taylor, S.J., & Grant, I. (2004). Revised comprehensive norms for an expanded Halstead-Reitan Battery: Demographically adjusted neuropsychological norms for African American and Caucasian adults. Lutz, FL: Psychological Assessment Resources, Inc.Google Scholar
Heilbronner, R.L., Sweet, J.J., Attix, D.K., Krull, K.R., Henry, G.K., & Hart, R.P. (2010). Official position of the American Academy of Clinical Neuropsychology on serial neuropsychological assessments: The utility and challenges of repeat test administrations in clinical and forensic contexts. The Clinical Neuropsychologist, 24, 12671278.CrossRefGoogle ScholarPubMed
Hinton-Bayre, A.D. (2010). Deriving reliable change statistics from test-retest normative data: Comparison of models and mathematical expressions. Archives of Clinical Neuropsychology, 25, 244256.CrossRefGoogle ScholarPubMed
Holtzer, R., Goldin, Y., Zimmerman, M., Katz, M., Buschke, H., & Lipton, R. B. (2008). Robust norms for selected neuropsychological tests in older adults. Archives of Clinical Neuropsychology, 23, 531541.Google Scholar
Jefferson, A.L., Wong, S., Gracer, T.S., Ozonoff, A., Green, R.C., & Stern, R.A. (2007). Geriatric performance on an abbreviated version of the Boston Naming Test. Applied Neuropsychology, 14, 215223.Google Scholar
Matarazzo, J.D., & Herman, D.O. (1984). Base rate data for the WAIS-R: Test-retest stability and VIQ-PIQ differences. Journal of Clinical Neuropsychology, 6, 351366.CrossRefGoogle ScholarPubMed
McCaffrey, R.J., Duff, K., & Westervelt, H.J. (2000). Practitioner’s guide to evaluating change with neuropsychological assessment instruments. New York, NY: Springer.Google Scholar
McKhann, G.M., Knopman, D.S., Chertkow, H., Hyman, B.T., Jack, C.R., Kawas, C.H., & Phelps, C.H. (2011). The diagnosis of dementia due to Alzheimer’s disease: Recommendations from the National Institute on Aging-Alzheimer’s Association workgroups on diagnostic guidelines for Alzheimer’s disease. Alzheimer’s & Dementia, 7, 263269.Google Scholar
Mitrushina, M., Boone, K.B., Razani, J., & D’Elia, L.F. (2005). Handbook of normative data for neuropsychological assessment (2nd ed.). New York, NY: Oxford University Press.Google Scholar
Morris, J.C., Weintraub, S., Chui, H.C., Cummings, J., Decarli, C., Ferris, S., & Kukull, W.A. (2006). The Uniform Data Set (UDS): Clinical and cognitive variables and descriptive data from Alzheimer Disease Centers. Alzheimer Disease and Associated Disorders, 20, 210216.Google Scholar
Pinhiero, J.C., & Bates, D.M. (2000). Mixed-effects models in S and S-PLUS. New York: Springer.CrossRefGoogle Scholar
Pedraza, O., Lucas, J., Smith, G.E., Petersen, R.C., Graff-Radford, N.R., & Ivnik, R.J. (2010). Robust and expanded norms for the Dementia Rating Scale. Archives of Clinical Neuropsychology, 25, 347358.Google Scholar
R Core Team (2015). R: A language and environment for statistical computing (Version 3.1.2) [Software]. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org/.Google Scholar
Reitan, R., & Wolfson, D. (1993). The Halstead-Reitan neuropsychological test battery: Theory and clinical applications. Tucson, AZ: Neuropsychology Press.Google Scholar
Salthouse, T.A., & Tucker-Drob, E.M. (2008). Implications of short-term retest effects for the interpretation of longitudinal change. Neuropsychology, 22, 800811.Google Scholar
Shirk, S.D., Mitchell, M.B., Shaughnessy, L.W., Sherman, J.C., Locascio, J.J., Weintraub, S., & Atri, A. (2011). A web-based normative calculator for the uniform data set (UDS) neuropsychological test battery. Alzheimer’s Research & Therapy, 3, 32.CrossRefGoogle ScholarPubMed
Silverstein, M.L., & Nelson, L.D. (2000). Clinical and research implications of revising psychological tests. Psychological Assessment, 12, 298303.CrossRefGoogle ScholarPubMed
Sosa-Ortiz, A.L., Acosta-Castillo, I., & Prince, M.J. (2012). Epidemiology of dementias and Alzheimer’s disease. Archives of Medical Research, 43, 600608.Google Scholar
Strauss, E., Sherman, E.M.S., & Spreen, O. (2006). A compendium of neuropsychological tests: Administration, norms, and commentary (3rd ed.). New York, NY: Oxford University Press.Google Scholar
Strauss, E., Spreen, O., & Hunter, M. (2000). Implications of test revisions for research. Psychological Assessment, 12, 237244.Google Scholar
Wechsler, D. (1981). Wechsler Adult Intelligence Scale-Revised. New York: Psychological Corporation.Google Scholar
Wechsler, D. (1987). WMS-R: Wechsler Memory Scale-Revised. New York: Psychological Corporation.Google Scholar
Weintraub, S., Salmon, D., Mercaldo, N., Ferris, S., Graff-Radford, N.R., Chui, H., & Morris, J.C. (2009). The Alzheimer’s Disease Centers’ Uniform Data Set (UDS): The neuropsychologic test battery. Alzheimer Disease and Associated Disorders, 23, 91101.Google Scholar
Whittingham, M.J., Stephens, P.A., Bradbury, R.B., & Freckleton, R.P. (2006). Why do we still use stepwise modelling in ecology and behaviour? Journal of Animal Ecology, 75, 11821189.Google Scholar
Figure 0

Table 1 Participant demographics

Figure 1

Fig. 1 Predicted test scores (black circles) and 90% reliable change intervals (dotted red lines) for each test across visits 2 to 7 based on linear mixed effects regression.

Figure 2

Table 2 Descriptive statistics for each UDS test at baseline and 1-year follow-up

Figure 3

Table 3 Linear mixed effects regression parameter estimates for predicting follow-up test scores across seven annual visits

Figure 4

Fig. 2 Residuals versus fitted plots at visit 2 for each test based on linear mixed effects regression.

Figure 5

Table 4 Reliable change intervals from baseline to the first annual follow-up visit for the average participant in the study sample

Figure 6

Table 5 UDS test score changes from baseline to the first annual follow-up visit corresponding to various base rates

Figure 7

Table 6 Neuropsychological and RCI data for sample case