Introduction
Estimation of the pace of cognitive decline throughout the lifecourse is central to research on cognitive aging and dementia (Salthouse, Reference Salthouse2010a). Cognitive decline is a more compelling marker of Alzheimer’s disease (AD) dementia than impairment at one testing session because it is less affected by historical factors such as years of education that precede the onset of AD (Glymour et al., Reference Glymour, Weuve, Berkman, Kawachi and Robins2005). However, design and analysis of longitudinal studies, wherein cognitive testing is repeatedly conducted on the same person over time, can be complicated because, in addition to normal aging or maturation, factors such as selective attrition, period and cohort effects, statistical artifacts (e.g., regression to the mean), and retest or practice effects contribute to changes in cognitive test performance (Dodge et al., Reference Dodge, Wang, Chang and Ganguli2011; Salthouse, Reference Salthouse2010a, Reference Salthouse2010b).
Retest or practice effects refer to the extent to which repeated cognitive testing results in improved performance due to familiarity with the testing materials and setting (Horton, Reference Horton1992; Zehnder, Blasi, Berres, Spiegel, & Monsch, Reference Zehnder, Blasi, Berres, Spiegel and Monsch2007). These effects are well-documented in longitudinal studies of cognitive aging (Abner et al., Reference Abner, Dennis, Mathews, Mendiondo, Caban-Holt and Kryscio2012; Basso, Bornstein, & Lang, Reference Basso, Bornstein and Lang1999; Calamia, Markon, & Tranel, Reference Calamia, Markon and Tranel2012; Collie, Maruff, Darby, & McStephen, Reference Collie, Maruff, Darby and McStephen2003; Cooper, Lacritz, Weiner, Rosenberg, & Cullum, Reference Cooper, Lacritz, Weiner, Rosenberg and Cullum2004; Duff et al., Reference Duff, Lyketsos, Beglinger, Chelune, Moser, Arndt and McCaffrey2011; Ferrer, Salthouse, Stewart, & Schwartz, Reference Ferrer, Salthouse, Stewart and Schwartz2004; Ferrer, Salthouse, McArdle, Stewart, & Schwartz, Reference Ferrer, Salthouse, McArdle, Stewart and Schwartz2005; Frank, Wiederholt, Kritz-Silverstein, Salmon & Barrett-Connor, Reference Frank, Wiederholt, Kritz-Silverstein, Salmon and Barrett-Connor1996; Horton, 1992; Howieson et al., Reference Howieson, Carlson, Moore, Wasserman, Abendroth, Payne-Murphy and Kaye2008; Ivnik et al., Reference Ivnik, Smith, Lucas, Petersen, Boeve, Kokmen and Tangalos1999; Jacqmin-Gadda, Fabrigoule, Commenges, & Dartigues, Reference Jacqmin-Gadda, Fabrigoule, Commenges and Dartigues1997; Machulda et al., Reference Machulda, Pankratz, Christianson, Ivnik, Mielke, Roberts and Petersen2013; Mitrushina and Satz, Reference Mitrushina and Satz1991; Rabbitt, Diggle, Smith, Holland, & McInnes, Reference Rabbitt, Diggle, Smith, Holland and McInnes2001; Rabbitt, Diggle, Holland, & McInnes Reference Rabbitt, Diggle, Holland and McInnes2004; Salthouse, Reference Salthouse2009; Wilson, Leurgans, Boyle, & Bennett, Reference Wilson, Leurgans, Boyle and Bennett2011; Wilson, Li, Bienias, & Bennett, Reference Wilson, Li, Bienias and Bennett2006; Zehnder et al., Reference Zehnder, Blasi, Berres, Spiegel and Monsch2007). Although strongest with shorter retest intervals, retest effects have been documented for up to 5 (Burke, Reference Burke1997) and 12 (Salthouse, Schroeder, & Ferrer, Reference Salthouse, Schroeder and Ferrer2004) years (Hausknecht, Halpert, Di Paolo, & Gerrard, Reference Hausknecht, Halpert, Di Paolo and Gerrard2007). A consensus conference for clinical neuropsychology has called for research on ramifications of repeated cognitive testing (Heilbronner et al., Reference Heilbronner, Sweet, Attix, Krull, Henry and Hart2010). Van der Elst, Van Boxtel, Van Breukelen, and Jolles (Reference Van der Elst, Van Boxtel, Van Breukelen and Jolles2008) found a robust increase of between 0.2 and 0.6 standard deviations (SD) in verbal list-learning performance 3 years after the first testing occasion in a large sample of cognitively normal older adults, while Bartels, Wegrzyn, Wiedl, Ackermann, and Ehrenreich (Reference Bartels, Wegrzyn, Wiedl, Ackermann and Ehrenreich2010) found medium to large retest effects between 0.36 and 1.19 SD after approximately 3 months. Although both of these studies conceptualize retest effects as a one-time boost between the first and subsequent occasions, retest effects may also exist at each visit with diminishing returns (Collie et al., Reference Collie, Maruff, Darby and McStephen2003; Sliwinski, Hoffman, & Hofer, Reference Sliwinski, Hoffman and Hofer2010).
In epidemiologic research, failure to account for retest effects obscures the estimated rate of cognitive decline. If retest effects are correlated with risk factors of interest, ignoring them may lead to biased estimates of their effects on the rate of cognitive change. Retest effects may differ by the type of cognitive task. Tests that measure different cognitive abilities (e.g., memory, language) (Cooper et al., Reference Cooper, Lacritz, Weiner, Rosenberg and Cullum2004) or that use different administration or response modalities (e.g., oral vs. written) might show different patterns of retest effects. In this study, we examined retest effects at the level of constructs rather than individual cognitive tests to avoid detecting differences in modality.
In addition to the type of test, retest effects may be attributable to participant characteristics related to proficiency in test-taking via test-taking strategies and less test anxiety, in which case persons with less testing experience might show larger retest effects (Thorndike, Reference Thorndike1922). Retest effects may also be attributed to episodic memory, or the successful learning and retention of test content such that subsequent improved performance is facilitated by recollection of the content. This is a motivation behind the use of alternate forms for tests of episodic memory (e.g., Benedict and Zgaljardic, Reference Benedict and Zgaljardic1998; Delis, Kramer, Kaplan, & Ober, Reference Delis, Kramer, Kaplan and Ober2000). Thus, testing for differential retest effects by factors related to test experience and episodic memory provide a way to better understand retest effects.
Clinically, group-level differences in retest effects have implications for test–retest reliability and interpretation of norms. The rank ordering of patients at one assessment compared to another may be stable despite a large retest effect, indicating good test–retest reliability but complicating interpretations from a single set of norms for both assessments (Calamia et al., Reference Calamia, Markon and Tranel2012). This would interfere with tracking of disease progression and detection of decline. Alternatively, if test–retest reliability is moderate or low in an overall sample but higher in subgroups, that could reflect systematic group differences in the magnitude of individual differences in the amount of retest (Salthouse & Tucker-Drob, Reference Salthouse and Tucker-Drob2008).
Sociodemographic Factors Related to Test Experience
Because educational attainment is a strong predictor of cognitive performance in later life, retest effects may differ by number of years of education (Cagney & Lauderdale, Reference Cagney and Lauderdale2002; Stern et al., Reference Stern, Gurland, Tatemichi, Tang, Wilder and Mayeux1994; Stuss, Stethem, & Poirier, Reference Stuss, Stethem and Poirier1987). Individuals with less education or lower quality education have less prior experience with test-taking and strategies for maximizing test performance. Such individuals have the most to gain from practice with the test. Similarly, given differences in early educational experiences for older adults by race and ethnicity due to persistent educational inequalities (Glymour & Manly, Reference Glymour and Manly2008), we hypothesized that Hispanic older adults, most of whom in the present sample are immigrants to the United States, may be less familiar on average with testing and therefore experience greater retest effects (Gould, Reference Gould1996).
Age, sex, and language spoken at home may also moderate retest effects. Previous research suggests that, with the exception of measures of word list recall, retest effects are inversely related to age (Mitrushina & Satz, Reference Mitrushina and Satz1991; Rabbitt, Lunn, Wong, & Cobain, Reference Rabbitt, Lunn, Wong and Cobain2008). Sex differences in cognitive performance have been documented for a range of cognitive abilities, suggesting differential retest effects may also occur. Women tend to do better on memory tests (leaving men with more room to improve upon retest), while men tend to do better on visuospatial tasks (Mann, Sasanuma, Sakuma, & Masaki, Reference Mann, Sasanuma, Sakuma and Masaki1990; Salthouse, Reference Salthouse2010a; Voyer, Voyer, & Bryden, Reference Voyer, Voyer and Bryden1995). Primary language may also be important to retest; one study found Spanish-speakers demonstrated greater retest effects than English-speakers (Mungas, Reed, Marshall, & Gonzalez, Reference Mungas, Reed, Marshall and González2000).
Dementia Risk Factors
The ability to learn and retain new information may also facilitate retest effects. Previous studies suggest that the absence of retest effects may reflect amnestic mild cognitive impairment (MCI) or AD (Darby, Maruff, Collie, & McStephen, Reference Darby, Maruff, Collie and McStephen2002; Duff et al., Reference Duff, Lyketsos, Beglinger, Chelune, Moser, Arndt and McCaffrey2011; Frank et al., Reference Frank, Wiederholt, Kritz-Silverstein, Salmon and Barrett-Connor1996; Schrijnemaekers, de Jager, Hogervorst, & Budge, Reference Schrijnemaekers, de Jager, Hogervorst and Budge2006). However, at least one recent study reported retest effects for memory in participants with MCI and dementia (Machulda et al., Reference Machulda, Pankratz, Christianson, Ivnik, Mielke, Roberts and Petersen2013). The apolipoprotein E (APOE) ε4 allele predicts earlier onset of Alzheimer’s disease among older Whites (Baxter, Caselli, Johnson, Reiman, & Osborne, Reference Baxter, Caselli, Johnson, Reiman and Osborne2003; Blair et al., Reference Blair, Folsom, Knopman, Bray, Mosley and Boerwinkle2005; Haan, Shemanski, Jagust, Manolio, & Kuller, Reference Haan, Shemanski, Jagust, Manolio and Kuller1999), but the association seems to be attenuated among Blacks (Borenstein, Copenhaver, & Mortimer, Reference Borenstein, Copenhaver and Mortimer2006; Tang et al., Reference Tang, Stern, Marder, Bell, Gurland, Lantiqua and Mayeux1998). A previous study found APOE e4 carriers did not exhibit a retest effect (Machulda et al., Reference Machulda, Pankratz, Christianson, Ivnik, Mielke, Roberts and Petersen2013). Furthermore, cardiovascular burden is an established risk factor for poorer cognition and neurodegenerative disease, especially among minority older adults (Flicker, Reference Flicker2010; Luchsinger et al., Reference Luchsinger, Reitz, Honig, Tang, Shea and Mayeux2005). Thus, it is possible that greater cardiovascular risk burden may affect the magnitude of retest effects.
The Present Study
We examined whether retest effects vary by demographic factors such as race/ethnicity, age, language spoken at home, literacy, sex, years of education, and dementia risk factors including APOE ε4 status, baseline cognitive status, and cardiovascular burden. We estimated multilevel random effects models of change in general cognitive performance, memory, executive function, and language. The mean retest effect was allowed to differ by the characteristic of interest. We hypothesized that Hispanic racial/ethnic group membership and fewer years of education predict larger retest effects, while dementia risk factors such as possession of the APOE e4 allele, lower cognitive performance at baseline, and greater cerebrovascular risk burden predict smaller retest effects.
Methods
Participants and Procedures
We used data on N=4073 participants from the Washington Heights-Inwood Columbia Aging Project (WHICAP), an ongoing epidemiologic cohort of community-living Medicare-eligible older adults recruited from northern Manhattan (Tang et al., Reference Tang, Cross, Andrews, Jacobs, Small, Bell and Mayeux2001). Participants were residents of three contiguous US census tracts in Northern Manhattan, New York. Individuals were invited to participate in an in-person survey in 1992, with follow-up visits every 2 to 3 years. Recruitment re-opened in 1999 to replenish the cohort. At each interview, participants answered extensive questionnaires about their early life education, health, and cognitive performance. The present study used data from 4073 participants who participated in neuropsychological assessments. Details of the sampling strategies and recruitment outcomes have been published previously (Luchsinger, Tang, Shea, & Mayeux, Reference Luchsinger, Tang, Shea and Mayeux2001; Manly, Schupf, Tang, & Stern, Reference Manly, Schupf, Tang and Stern2005). The study was approved by Institutional Review Boards at Columbia Presbyterian Medical Center, Columbia University Health Sciences, and the New York State Psychiatric Institute.
Measures
Racial/ethnic group
Participants self-reported their race by selecting membership from categories of American Indian/Alaska Native, Asian, Native Hawaiian or other Pacific Islander, Black or African American, or White. Participants were then asked whether they were Hispanic. We grouped participants into categories of non-Hispanic White, non-Hispanic Black, and Hispanic.
Cardiovascular burden
We used a summary of cardiovascular burden based on presence of diabetes, hypertension, heart disease, stroke, central obesity, and current smoking status (Schneider et al., Reference Schneider, Gross, Bangen, Skinner, Benitez, Glymour and Luchsinger2014).
Educational experience
We used self-reported years of education completed to represent previous exposure to learning. However, there is considerable heterogeneity in the amount of learning obtained given a certain grade level due to interstate, racial, and international differences in educational quality (Glymour & Manly, Reference Glymour and Manly2008; Manly et al., Reference Manly, Jacobs, Sano, Bell, Merchant and Stern1999, Reference Manly, Jacobs, Touradji, Small and Stern2002, Reference Manly, Touradji, Tang and Stern2003, Reference Manly, Byrd, Touradji and Stern2004). Thus, years of education is a poor proxy for schooling. Because of this, we also tested for differences in retest by level of literacy as a proxy for quality of education (Manly et al., Reference Manly, Byrd, Touradji and Stern2004). We stratified this analysis by language of administration due to nonequivalence of the English WRAT (Wilkinson & Robertson, Reference Wilkinson and Robertson2006) and Spanish WAT (Del Ser, Gonzalez-Montalvo, MartinezEspinosa, Delgado-Villapalos, & Bermejo, Reference Del Ser, Gonzalez-Montalvo, MartinezEspinosa, Delgado-Villapalos and Bermejo1997). We took a median split of performance on these tests.
Cognitive performance
WHICAP administered a neuropsychological test battery at each study visit (Tang et al., Reference Tang, Cross, Andrews, Jacobs, Small, Bell and Mayeux2001). Tests were designed for administration in Spanish or English (Dugbartey, Townes, & Mahurin, Reference Dugbartey, Townes and Mahurin2000; Jacobs et al., Reference Jacobs, Sano, Dooneief, Marder, Bell and Stern1995). The tests are described in the Appendix. We constructed factor scores for general cognitive performance, memory, executive function, and language using confirmatory factor analysis models for each domain. We used immediate recall, delayed recall, and delayed recognition from the Buschke Selective Reminding Test to construct the memory factor. The executive functioning factor was derived using the Color Trail-Making Test (A and B), WAIS Similarities, Identities/Oddities, shape time, time to detect a consonant trigram, phonemic fluency, and semantic fluency for animals. Language was derived using phonemic and semantic fluency, 15-item Boston Naming, repetition, and comprehension. All of the above variables contributed to the general cognitive factor. The assignment of tests to factors is largely consistent with a previously published factor analysis of the neuropsychological test battery in the WHICAP cohort (Siedlecki et al., Reference Siedlecki, Manly, Brickman, Schupf, Tang and Stern2010), except that we dropped the speed factor derived by Siedlecki et al. and added an executive functioning factor. The executive functioning factor has more indicators that represent broader fluid ability, and is more reliable than two separate factors for speed and reasoning that are each based on fewer measures. Previous factor analysis of the WHICAP battery revealed that semantic and phonemic fluency both load best with language (Siedlecki et al., Reference Siedlecki, Manly, Brickman, Schupf, Tang and Stern2010), which is also consistent with the derived executive functioning factor in the Alzheimer’s Disease Neuroimaging Initiative (Gibbons et al., Reference Gibbons, Carle, Mackin, Harvey, Mukherjee and Insel2012).
Each factor was scaled to have a mean of 50 and standard deviation (SD) of 10 in the US population of adults aged 70 years and older to facilitate comparison of magnitudes of effects across domains and with future studies. Details are provided elsewhere (Gross, Jones, Fong, Tommet, & Inouye, Reference Gross, Jones, Fong, Tommet and Inouye2014; Gross, Sherva, et al., Reference Gross, Sherva, Mukherjee, Newhouse, Kauwe and Munsie2014). Briefly, we calibrated the factors using a nationally representative sample of adults aged 70 and older from the Aging, Demographics, and Memory Study (ADAMS), a sub-study of the Health and Retirement Study (Juster & Suzman, Reference Juster and Suzman1995; Langa et al., Reference Langa, Plassman, Wallace, Herzog, Heeringa, Ofstedal and Willis2005). The ADAMS battery included Trails A and B, Digits Forward and Backward, semantic and phonemic fluency, Boston Naming Test, Symbol Digit Modalities, and a 10-noun word recall task. Items common to ADAMS and WHICAP served as links to calibrate cognitive factors. The factor analysis was performed in a longitudinal dataset with multiple records per participant. We fixed item discrimination and difficulty parameters for common items in the factor analysis including both WHICAP and ADAMS to the values estimated in an ADAMS-only factor analysis. This scaling approach does not make the WHICAP sample nationally representative, but it allows future analysts, using other datasets with items overlapping with ADAMS, to derive directly comparable scores. The approach assumes measurement invariance of factors with respect to time: an assumption previously verified in other samples of older adults and which we evaluated in WHICAP through formal tests described earlier (Hayden et al., Reference Hayden, Jones, Zimmer, Plassman, Browndyke, Pieper and Welsh-Bohmer2011; Johnson et al., Reference Johnson, Gross, Pa, McLaren, Park and Manly2012). We additionally tested longitudinal measurement invariance of the factors among participants assessed at baseline and whose second study visit was between 1.5 and 2.5 years later (median: 2.1 years) using multiple group confirmatory factor analysis models. Details are provided in the Appendix.
Analyses
To test hypotheses, we used multilevel models with random effects for people and time alongside fixed effects for retest in general cognition, memory, executive functioning, and language (Johnson et al., Reference Johnson, Gross, Pa, McLaren, Park and Manly2012; Laird & Ware, Reference Laird and Ware1982; Muthén & Curran, 1997; Raudenbush & Bryk, Reference Raudenbush and Bryk2002). Time since enrollment into the study was the timescale of interest. The system of equations below describes the basic model:
Yij is a cognitive outcome (general cognitive performance, memory, executive functioning, or language) for participant i at time j. The level 1 model describes within-person change over time based on random (U0i) and fixed (γ00) effects for participants, random (U1i) and fixed (γ10) effects for time, a fixed effect (β2) for the retest effect, adjustment variables βp, and residual error εij for each participant at each time. Level 2 equations describe the random and fixed effects for participants and time. Distributions of εij, U0i, and U1i are assumed to be normal with mean 0 and variance 1.
We coded retest in two ways to acknowledge different conceptualizations of how they come about. First, as our primary analysis, the retest variable was coded 0 at each participant’s first study visit in which they were administered a neuropsychological battery, and 1 otherwise. Retest effects here are interpretable as the difference or jump in performance from the first assessment to the predicted performance based on the level and slope of change at the second and later assessments. This characterization is consistent with previous studies examining retest effects (Ivnik et al., Reference Ivnik, Smith, Lucas, Petersen, Boeve, Kokmen and Tangalos1999; Rabbitt et al., Reference Rabbitt, Diggle, Holland and McInnes2004; Salthouse et al., Reference Salthouse, Schroeder and Ferrer2004; Salthouse & Tucker-Drob, Reference Salthouse and Tucker-Drob2008). Previous studies have suggested that subsequent gains after the second testing occasion are negligible (but see the Discussion section) (Kausler, 1994; Rabbitt, 1993, Reference Rabbitt, Diggle, Holland and McInnes2004). Second, to allow for the possibility that participants learn more at each test occasion with diminishing returns over time (Abner et al., Reference Abner, Dennis, Mathews, Mendiondo, Caban-Holt and Kryscio2012; Collie et al., Reference Collie, Maruff, Darby and McStephen2003; Hoffman et al., 2011), we also calculated retest as the square root of the number of prior test occasions. We adjusted all models for sex, baseline age, and recruitment cohort (1992 or 1999).
To determine whether retest effects vary by individual characteristics, or effect modification, we extended the model described above to a series of multiple group models in a structural equation modeling framework, in which groups were defined based on the characteristic of interest. Groups were defined by race/ethnicity (non-Hispanic Black, non-Hispanic White, Hispanic), age (<75 years, 75–80 years, 80+ years), sex, years of education (less than 8 years, 8 or more years), literacy (median split, models conducted separately by language of administration), APOE ε4 status (carrier, noncarrier), quartile of baseline general cognitive performance, and number of cardiovascular risk factors (0, 1, 2, or ≥3). We conducted analyses by baseline quartiles of cognitive performance instead of adjudicated dementia diagnosis because, in WHICAP, neuropsychological test performance was considered during the adjudication procedure. Nonetheless, we also conducted a sensitivity analysis to identify whether excluding participants with an adjudicated diagnosis of dementia affected results. Differences in mean retest effects between these groupings are estimated in a manner analogous to using an interaction between the characteristic and retest indicator, as follows:
The interaction of the moderator and the retest effect, β4, is the parameter of interest. In planned sensitivity analyses, we examined retest effects for all component tests in the WHICAP neuropsychological battery. Analyses were conducted with Mplus statistical software (version 7.11, Muthén & Muthén, 1998–Reference Muthén and Muthén2012) using robust maximum likelihood estimation that assumed outcome observations were missing at random, conditional on covariates (Little & Rubin, Reference Little and Rubin1987). Fit of modeled trajectories to data was assessed with a pseudo-R2 statistic. The pseudo-R2 represents the proportion of variability in observed data explained by the model (Singer & Willet, Reference Singer and Willet2003). It is calculated by squaring the correlation between observed and model-estimated (including random effects terms) outcome scores. We adjusted models for potential selective survival using inverse probability weights (Hernan & Robins, Reference Hernan and Robins2006) calculated from a logistic regression of death on age, sex, baseline general cognitive performance, APOE e4 status, education in years, recruitment cohort, and cardiovascular risk measured at baseline.
Results
The study sample was mostly female (68.5%), had 8 or more years of education (53.6%), and the average age at the first visit was 77 years (range, 63–103 years) (Table 1). The sample was ethnically diverse, with 33.7% non-Hispanic Black, 24.9% non-Hispanic White, and 41.4% Hispanic. The percentage of participants with at least one APOE ε4 allele was 22.6%. Two-year test–retest reliabilities for the factors representing general cognitive performance, memory, executive functioning, and language were r=0.88, r=0.77, r=0.80, and r=0.83, respectively.
SD=standard deviation.
Overall Retest Effect
The median number of study visits was three (interquartile interval: 2, 4) and median follow-up time was 3.9 years (interquartile interval: 1, 7.8). The second study visit took place on average 1.9 years (interquartile interval: 1.3, 2.2 years) after the first study visit. For each cognitive outcome, a 1-point difference is analogous to a 0.1 standard deviation difference. As expected, the overall retest effect was considerable for all domains. For general cognitive performance, the retest effect was 0.60 SD, while the annual rate of general cognitive decline was only −0.047 SD (Table 2). Thus, the absolute value of the retest effect is the same magnitude as 12.8 years of cognitive decline (Table 2). Retest effects were also large for memory (retest=0.57 SD; 95% confidence interval [CI] [0.42, 0.72] SD), executive functioning (retest=0.45 SD; 95% CI [0.32, 0.58] SD), and language (retest=0.64 SD; 95% CI [0.47, 0.81] SD) (Table 2).
Note. Parallel process latent growth models of changes in global cognition, memory, and executive functioning score changes using time in study as the timescale. Each cognitive score was scaled to have a mean of 50 and standard deviation of 10 at the baseline study visit. The annual rate of decline is the mean of the random slope in the model. The ratio of retest and slope reflects the relative magnitude of the retest effect compared to subsequent annual cognitive decline. The retest parameters correspond to β2 parameters in Eq. [1]. The model-estimated proportion of total variance attributable to between-persons differences was 86%, 74%, 78%, and 81% for general cognitive performance, memory, executive functioning, and language, respectively.
Effect Modification of Retest Effects by Participant Characteristics
Models fit well to the data, with pseudo-R2 values above 0.79 for each cognitive outcome (Table 3). Visual inspection of model residuals confirmed adequate fit to the data. The magnitude of the retest effect, parameterized as the jump from the first to subsequent test occasions, was statistically significant and positive for general cognitive performance in nearly every subgroup (Table 3). Inferences were similar for memory and for language. For executive functioning, average retest effects tended to be smaller but were mostly statistically significant (Table 3). This pattern of results was identical when we parameterized retest effects as the square root of the number of prior test occasions (Appendix Table 2).
Note. Multilevel models of changes for general cognitive performance, memory, executive functioning, and language using time in study as the timescale. The retest parameters correspond to β2 parameters in equation 4, and group differences correspond to parameter β4. Retest effects are parameterized here as the jump in performance between the first and subsequent testing occasions.
*p<0.05.
The magnitude of retest effects did not differ significantly by race/ethnicity, age, language, sex, education, literacy, APOE status, or cardiovascular burden (Table 3). Participants in the lowest quartile of baseline general cognitive performance demonstrated greater retest effects compared to participants in the middle two quartiles of general cognitive performance, for whom retest effects were not significant (Table 3). Figure 1 shows the model-estimated cognitive trajectory for participants at these quartiles of cognitive function. Although we did not exclude participants who had an adjudicated diagnosis of dementia in WHICAP, we observed that 645 of 679 (94.9%) of participants with dementia were in the lowest quartile of baseline cognitive performance (sensitivity), and 3006 of 3369 (89.2%) of non-demented participants had a score above the lowest quartile (specificity).
Sensitivity Analyses
We examined the magnitude of retest effects for each component test in the WHICAP battery. Results of this sensitivity analysis were consistent with findings using the factor scores. Retest effects were generally greater in magnitude for memory tests than for executive functioning tests. We also reran analyses excluding participants with dementia; the only change in inferences was that the difference in retest by baseline cognitive quartile was no longer statistically significant (Appendix 3). Although overall retest among participants with a study diagnosis of dementia did not statistically significantly differ from others in the lowest quartile of baseline general cognitive performance for any cognitive outcome (p>.05), participants with dementia did on average have higher retest effects for general cognitive performance (retest no dementia: −0.03 points, 95% CI [−1.40, 1.34]; retest dementia: 2.96 points, 95% CI [1.65, 4.27]), memory (retest no dementia: −0.20 points, 95% CI [−4.57, 4.17]; retest dementia: 4.33 points, 95% CI [2.31, 6.35]), executive functioning (retest no dementia: −0.31 points, 95% CI [−2.11, 1.49]; retest dementia: 1.36 points, 95% CI [0.13, 2.59]), and language (retest no dementia: −0.36 points, 95% CI [−2.44, 1.72]; retest dementia: 2.08 points, 95% CI [0.47, 3.69]).
Discussion
In this large, diverse community-based sample of older adults, we examined differences in retest effects by racial/ethnic group, age, language spoken at home, sex, years of education, literacy, APOE ε4 status, baseline cognitive function, and cardiovascular burden. Despite the relatively long 2-year interval between testing occasions, the overall magnitude of retest was on average more than 10 times the annual rate of subsequent cognitive decline, and greatest for language. The magnitude of retest is in line with previous findings (Bartels et al., Reference Bartels, Wegrzyn, Wiedl, Ackermann and Ehrenreich2010; Van der Elst et al., Reference Van der Elst, Van Boxtel, Van Breukelen and Jolles2008). The magnitude of retest effects did not differ by any characteristic examined other than baseline cognitive status: on average, participants performing in the lowest quartile at baseline experienced the greatest boost from repeated testing. This finding is probably attributable to regression to the mean. Overall, the results suggest retest effects do not differ greatly across observable demographic and dementia-related factors.
Previous research indicates that the magnitude of retest effects varies widely across different tests (Calamia et al., Reference Calamia, Markon and Tranel2012; Frank et al., Reference Frank, Wiederholt, Kritz-Silverstein, Salmon and Barrett-Connor1996), with effects typically but not always largest for visual memory and smallest for visuospatial ability (Calamia et al., Reference Calamia, Markon and Tranel2012, but see also Dodrill & Troupin, 1975; Ferrer et al., Reference Ferrer, Salthouse, Stewart and Schwartz2004; Frank et al., Reference Frank, Wiederholt, Kritz-Silverstein, Salmon and Barrett-Connor1996; McCaffrey, Onega, Orsillo, Nelles, & Haase, Reference McCaffrey, Onega, Orsillo, Nelles and Haase1992). In our study, we built on prior research by considering cognitive domains instead of individual tests in an attempt to draw conclusions at the level of constructs, and mitigate the potential for spurious findings from multiple tests. A further implicit advantage of our study was the choice of scaling to an external standard, the ADAMS HRS. This scaling made no difference in the results compared to factors scores that were scaled internally. Scale choice is in many cases arbitrary. However, we believe that future scientific progress in the area of cognitive aging will be accelerated if findings are presented on a common scale across studies. Resources are available that describe how other studies can be linked to an external metric such as the nationally representative sample used here (e.g., Gross, Jones, et al., Reference Gross, Jones, Fong, Tommet and Inouye2014; Gross, Sherva, et al., Reference Gross, Sherva, Mukherjee, Newhouse, Kauwe and Munsie2014; Jones et al., Reference Jones, Rudolph, Inouye, Yang, Fong, Milberg and Marcantonio2010).
Our data suggest retest effects were greater for participants in the lowest quartile of baseline cognitive performance. Sensitivity analyses revealed the largest practice effects were observed in the subgroup of participants diagnosed with dementia at baseline. Although in clinical settings, it is less likely to observe significant improvement in neuropsychological test performance in dementia patients upon follow-up, this phenomenon is not unusual in research settings. Participants who meet research criteria for dementia or MCI at do not always meet criteria at their next visit; this has been documented in WHICAP (Schofield et al., Reference Schofield, Mosesson, Stern and Mayeux1995; Manly et al., Reference Manly, Tang, Schupf, Stern, Vonsattel and Mayeux2008) as well as in other population-based cohorts (Boyle, Wilson, Aggarwal, Tang, & Bennett, Reference Boyle, Wilson, Aggarwal, Tang and Bennett2006). In this study, dementia diagnoses were in part based on a published algorithm using education-adjusted neuropsychological test scores (Stern et al., Reference Stern, Andrews, Pittman, Sano, Tatemichi, Lantigua and Mayeux1992). People whose neuropsychological test scores were consistent with dementia according to this algorithm, and whose daily function and level of independence deteriorated from previous levels according to self or informant report, were eligible for a diagnosis of dementia. In WHICAP and other similar epidemiologic cohorts, consensus diagnoses of dementia have been found to be meaningfully associated with declining cognitive trajectories and biomarkers for AD and cerebrovascular disease. The consensus group is blind to previous diagnosis in WHICAP, so although the criteria for dementia remain stable at each visit, there is no way to ensure continuity of diagnosis if people who had low scores at their initial visit rise slightly above the cut score on one or a few tests at follow-up.
This study indicates that retest effects do not differ by race/ethnicity or years of education, which were intended to be proxies for testing experience. However, years of education only captures testing experience from early life, and does not reflect experiences accumulated throughout life. Admittedly, race and ethnicity are imperfect markers of test experience, and thus our results cannot conclusively disprove the hypothesis that test experience plays a role in retest effects. Furthermore, most Hispanic participants in WHICAP were immigrants, whose years of education are systematically lower and do not easily translate to years of education in the United States (Hoffmeyer-Zlotnik et al., 2005).
The finding of differential retest effects by baseline cognitive status is likely attributable to regression to the mean (Barnett, van der Pols, & Dobson, Reference Barnett, van der Pols and Dobson2005). Most participants performing in the lowest quartile of cognitive performance had a study diagnosis of dementia. Persons with dementia have impaired learning and memory, and thus one might expect they should exhibit smaller retest effects assuming that retest is attributable largely to episodic memory. Previous studies have reported no retest effects in persons with MCI and dementia (Darby et al., Reference Darby, Maruff, Collie and McStephen2002; Schrijnemaekers et al., Reference Schrijnemaekers, de Jager, Hogervorst and Budge2006) or minimal (Duff et al., Reference Duff, Lyketsos, Beglinger, Chelune, Moser, Arndt and McCaffrey2011). Incipient dementia may not attenuate retest effects if procedural memory accounts for improvement on repeated test administration (Mitrushina, Boone, Razani, & D’Elia, Reference Mitrushina, Boone, Razani and D’Elia2005). Procedural memory, or long-term, unconscious recollection of previous experiences important for retaining skills (e.g., typing on a keyboard or riding a bicycle), is relatively well-preserved in people with dementia (Meyer & Schvaneveldt, Reference Meyer and Schvaneveldt1971; Perani et al., Reference Perani, Bressi, Cappa, Vallar, Alberini, Grassi and Fazio1993; Sabe, Jason, Juejati, Leiguarda, & Starkstein, Reference Sabe, Jason, Juejati, Leiguarda and Starkstein1995; Schaie, Reference Schaie2005; Tulving & Markowitsch, Reference Tulving and Markowitsch1998). This rationale may be limited to measures in which procedural memory has greater influence; tests of confrontation naming and verbal comprehension in the language factor are less susceptible to this reasoning. Indeed, we only found retest differences by baseline cognitive performance for the general cognitive performance and memory factors, not the executive functioning or language factors (Table 3).
Limitations of our study must be noted. First, we defined retest effects in two ways based on the discontinuity between first and second assessments, and on the square root of the number of prior test occasions. The former approach imposes the assumption that the retest benefit is constant across the second and subsequent assessments. The latter approach assumes accumulating retest effects at each successive assessment, with diminishing additional benefit at each successive assessment. Although modest violations of either of these assumptions are plausible, such violations are unlikely to substantively alter our findings. There are other plausible specifications of retest effects. For example, if each successive test occasion were to hypothetically confer a slightly larger retest benefit, our effect estimates would be a weighted average of these effects. This phenomenon would obscure subgroup differences in the magnitude of retest effects only if such differences occurred for some, but not all, waves of assessment. We think such a complex pattern of retest effects is unlikely.
A second limitation is, regardless of how we parameterize them, retest effects are difficult to disentangle from aging in studies that have roughly equally spaced assessment intervals because the number of prior assessments is nearly collinear with time since baseline (Hoffman, Hofer, & Sliwinski, Reference Hoffman, Hofer and Sliwinski2011). This challenge is common to most longitudinal studies of cognition. In the absence of random assignment of timing of the first assessment, simplifying assumptions are necessary to identify retest effects in studies with test–retest intervals longer than approximately a week (Hoffman et al., Reference Hoffman, Hofer and Sliwinski2011; Sliwinski et al., Reference Sliwinski, Hoffman and Hofer2010). Studies with very short test–retest intervals are optimal for distinguishing retest effects from normal cognitive aging because one can infer real change has not happened between the intervals (Salthouse & Tucker-Drob, Reference Salthouse and Tucker-Drob2008). We did not attempt to estimate retest differences as a function of the amount of time elapsed between successive tests because such variability is relatively small in WHICAP, by design, and any variance that is observed may be due to other variables such as respondents’ health status or enthusiasm for participating in cognitive assessments. Because of this structural limitation in the test–retest intervals in WHICAP, our estimated retest effects are likely conservative because some declines due to aging are expected.
A third limitation is that, in our study, we cannot know for certain whether we are capturing retest effects between the first and second visits, or change in cognitive performance. Improvement in cognitive performance is unlikely given that many who showed larger retest effects had dementia, and cognition is not expected to improve over time in people with dementia. The retest effects in our regressions are based either on a contrast between cognitive performance at the first assessment and cognitive performance at subsequent assessments, or an accumulating benefit with diminishing returns. Thus, a further limitation of our approach is that, to the extent age-related change is incorrectly estimated, the estimated retest effect will also be incorrect (Hoffman et al., Reference Hoffman, Hofer and Sliwinski2011). However, in a typical cohort study design, we believe this approach is the best available strategy to estimate retest effects. A final limitation is that the present analysis was restricted to cognitive domains tested in WHICAP. Measures of spatial ability, processing speed, and higher-level task-switching, for example, were not available. The mechanisms by which retest effects operate, and thus predictors of differential retest effects, may differ for different domains. A final study limitation is that our parameterization of retest effects implicitly assumes variance in the retest effect, but we did not formally incorporate random effects for retest. Ideally, with additional data it would be possible to describe the variance of the retest effect in a multilevel model incorporating random effects for the retest term.
Retest effects cannot be ignored in longitudinal research in cognitive aging because they may mask age-related cognitive decline and potentially distort tracking of disease progression and detection of decline (Ronnlund & Nilsson, Reference Ronnlund and Nilsson2006; Ronnlund, Nyberg, Backman, & Nilsson, Reference Ronnlund, Nyberg, Backman and Nilsson2005). The present study empirically evaluated differential patterns of retest effects for several cognitive domains in a diverse sample of community-living older adults. Because we found no differential retest effects among observable demographic groups, our findings suggest that, although retest effects must be taken into account, differential retest effects may not limit the generalizability of inferences across groups in longitudinal research. Our study provides evidence that a commonly recognized bias may not be all that worrisome empirically. Although the findings suggest differential retest effects may not limit the generalizability of inferences across groups in longitudinal research, replication in other cohorts with different participant characteristics and retest intervals is warranted.
Acknowledgments
The WHICAP study is supported by a National Institutes of Health grant (R01 AG037212 to Mayeux). This work was supported by UL1TR000040. We gratefully acknowledge a conference grant from the National Institute on Aging (R13AG030995 to D. Mungas) that facilitated data analysis for this project. Dr. Gross was supported by National Institutes of Health grant R03 AG045494-01. Dr. Benitez was supported by the Litwin Foundation and National Institutes of Health grants (UL1 TR000062 and KL2 TR000060). Dr. Shih was supported by an NIA R01 grant (R01AG043960). Dr. Bangen was supported by a National Institutes of Health grant (T32 MH1993417). Dr. Skinner was supported by an Alzheimer Disease Training Grant (5T32AG000258). Dr. Manly was supported by a National Institutes of Health grant (R01 AG028786 to Manly). No authors claim any conflicts of interest.
Supplementary Material
To view supplementary material for this article, please visit http://dx.doi.org/10.1017/S1355617715000508.