Introduction
Racially disparate outcomes on neuropsychological episodic memory tests have persistently been observed among older adults. Generally, older African Americans demonstrate lower scores on episodic memory tests than Whites (Fillenbaum, Peterson, Welsh-Bohmer, Kukull, & Heyman, Reference Fillenbaum, Peterson, Welsh-Bohmer, Kukull and Heyman1998; Manly et al., Reference Manly, Jacobs, Sano, Bell, Merchant, Small and Stern1998; Masel & Peek, Reference Masel and Peek2009; McDougall, Vaughan, Acee, & Becker, Reference McDougall, Vaughan, Acee and Becker2007; Schwartz et al., Reference Schwartz, Glass, Bolla, Stewart, Glass, Rasmussen and Bandeen-Roche2004; Whitfield et al., Reference Whitfield, Fillenbaum, Pieper, Albert, Berkman, Blazer and Seeman2000; Zsembik & Peek, Reference Zsembik and Peek2001). Worse performance may represent poorer episodic memory functioning, measurement problems such as test bias, or a combination. Poor performance among African Americans due to measurement problems could lead to misdiagnosis of memory disorders (Gurland et al., Reference Gurland, Wilder, Lantigua, Stern, Chen, Killeffer and Mayeux1999; Weiner, Reference Weiner2008; Whitfield, Reference Whitfield2002; Whitfield et al., Reference Whitfield, Fillenbaum, Pieper, Albert, Berkman, Blazer and Seeman2000). Inaccurate assessment and inappropriate diagnoses can have profound negative implications on quality of life, end of life decision making, and caregiver support (Dilworth-Anderson, Hendrie, Manly, Khachaturian, & Fazio, Reference Dilworth-Anderson, Hendrie, Manly, Khachaturian and Fazio2008; Parker & Philp, Reference Parker and Philp2004). Previous investigators have identified demographic characteristics including age and sex (Manly et al., Reference Manly, Jacobs, Sano, Bell, Merchant, Small and Stern1998; McDougall, et al., Reference McDougall, Vaughan, Acee and Becker2007; Mungas, Reed, Farias, & DeCarli, Reference Mungas, Reed, Farias and DeCarli2009; Zsembik & Peek, Reference Zsembik and Peek2001), health conditions including hypertension and cardiovascular disease (Schwartz et al., Reference Schwartz, Glass, Bolla, Stewart, Glass, Rasmussen and Bandeen-Roche2004; Whitfield et al., Reference Whitfield, Fillenbaum, Pieper, Albert, Berkman, Blazer and Seeman2000), and sociocultural variables including education, language, acculturation, and socioeconomic status (Boone, Victor, Wen, Razani, & Ponton, Reference Boone, Victor, Wen, Razani and Ponton2007; Manly, Byrd, Touradji, & Stern, Reference Manly, Byrd, Touradji and Stern2004) as factors associated with observed score differences across groups.
Stern et al., suggested educational experiences influence brain development and can be considered a proxy for cognitive reserve (Stern et al., Reference Stern, Gurland, Tatemichi, Tang, Wilder and Mayeux1994; Stern, Reference Stern2009). Parental education (Kaplan et al., Reference Kaplan, Turrell, Lynch, Everson, Helkala and Salonen2001; Rogers et al., Reference Rogers, Plassman, Kabeto, Fisher, McArdle, Llewellyn and Langa2009; Singh-Manoux, Richards, & Marmot, Reference Singh-Manoux, Richards and Marmot2005), home experiences that stimulate childhood learning (Everson-Rose, Mendes de Leon, Bienias, Wilson, & Evans, Reference Everson-Rose, Mendes de Leon, Bienias, Wilson and Evans2003), and lifetime engagement in cognitive activities (Scarmeas & Stern, Reference Scarmeas and Stern2003; Wilson, Barnes, & Bennett, Reference Wilson, Barnes and Bennett2003; Wilson et al., Reference Wilson, Barnes, Krueger, Hoganson, Bienias and Bennett2005) are examples of factors found to influence late-life cognitive functioning. These experiences, conceptualized as cognitive reserve in the current manuscript, may preserve cognitive functioning in the face of brain pathology in later life (Jones et al., Reference Jones, Fong, Metzger, Tulebaev, Yang, Alsop and Inouye2010; Scarmeas & Stern, Reference Scarmeas and Stern2003). The primary goal of this study is to examine factors associated with cognitive reserve concurrently for measurement bias and their ability to explain differences in episodic memory performance across African Americans and Whites.
The association between education and reserve may be partially mediated by socioeconomic status and education quality (Brunner, Reference Brunner2005; Dotson, Kitner-Triolo, Evans, & Zonderman, Reference Dotson, Kitner-Triolo, Evans and Zonderman2009; Kaplan et al., Reference Kaplan, Turrell, Lynch, Everson, Helkala and Salonen2001; Stern, Albert, Tang, & Tsai, Reference Stern, Albert, Tang and Tsai1999). Higher socioeconomic status may afford opportunities to engage in cognitively stimulating experiences, which may buffer against late life cognitive decline (Stern et al., Reference Stern, Gurland, Tatemichi, Tang, Wilder and Mayeux1994, Reference Stern, Albert, Tang and Tsai1999; Stern, Reference Stern2006). Manly, Touradji, Tang, and Stern (Reference Manly, Touradji, Tang and Stern2003) and Manly, Schupf, Tang, and Stern (Reference Manly, Schupf, Tang and Stern2005) studied education quality as measured by performance on reading tests (Cosentino, Manly, & Mungas, Reference Cosentino, Manly and Mungas2007). Low reading levels (i.e., a proxy for poor education quality) were associated with more rapid rates of cognitive decline (Manly et al., Reference Manly, Touradji, Tang and Stern2003, Reference Manly, Schupf, Tang and Stern2005).
Demographic, health, and sociocultural factors that contribute to differential episodic memory ability may represent test bias (Brickman, Cabo, & Manly, Reference Brickman, Cabo and Manly2006; Gasquoine, Reference Gasquoine2009; Pedraza & Mungas, Reference Pedraza and Mungas2008; Robertson, Liner, & Heaton, Reference Robertson, Liner and Heaton2009; Rosselli & Ardila, Reference Rosselli and Ardila2003). Educational experiences that lead to the acquisition of test-taking strategies can increase “test wiseness” and may inflate test scores (Gasquoine, Reference Gasquoine2009; Manly, Jacobs, Touradji, Small, & Stern, Reference Manly, Jacobs, Touradji, Small and Stern2002; Robertson et al., Reference Robertson, Liner and Heaton2009; Rosselli & Ardila, Reference Rosselli and Ardila2003; Scruggs & Lifson, Reference Scruggs and Lifson1985). If test wiseness varies across groups, individuals in different groups with the same underlying level of the ability measured by the test would have unequal expected scores, which is a definition of differential item functioning (DIF) (Camilli & Shepard, Reference Camilli and Shepard1994; Thissen, Steinberg, & Wainer, Reference Thissen, Steinberg, Wainer, Holland and Wainer1993). Other factors representing test bias include reaction to test content (e.g., familiarity, interest) (Brickman et al., Reference Brickman, Cabo and Manly2006; Flaugher, Reference Flaugher1978; Stricker & Emmerich, Reference Stricker and Emmerich1999; Teng & Manly, Reference Teng and Manly2005) and cultural factors including stereotype threat, language, or unrepresentative norms (Brickman et al., Reference Brickman, Cabo and Manly2006; Gasquoine, Reference Gasquoine2009; Kit, Tuokko, & Mateer, Reference Kit, Tuokko and Mateer2008; Loewenstein, Arguelles, Arguelles, & Linn-Fuentes, Reference Loewenstein, Arguelles, Arguelles and Linn-Fuentes1994; Manly et al., Reference Manly, Jacobs, Touradji, Small and Stern2002; Manly, Reference Manly2008; Teng & Manly, Reference Teng and Manly2005; Whitfield, Reference Whitfield2002).
Meaningful comparisons of performance across groups necessitate attention to measurement equivalence (Teresi, Kleinman, & Ocepek-Welikson, Reference Teresi, Kleinman and Ocepek-Welikson2000; Teresi, Stewart, Morales, & Stahl, Reference Teresi, Stewart, Morales and Stahl2006; Tuokko et al., Reference Tuokko, Chou, Bowden, Simard, Ska and Crossley2009). Several researchers have applied DIF methodology to assess relationships between characteristics associated with test bias and performance on neuropsychological tests among racially diverse older adults (Crane, van Belle, & Larson, Reference Crane, van Belle and Larson2004; Crane et al., Reference Crane, Narasimhalu, Gibbons, Pedraza, Mehta, Tang and Mungas2008; Jones, Reference Jones2003; Pedraza et al., Reference Pedraza, Graff-Radford, Smith, Ivnik, Willis, Petersen and Lucas2009; Ramirez, Teresi, Holmes, Gurland, & Lantigua, Reference Ramirez, Teresi, Holmes, Gurland and Lantigua2006; Teresi, Holmes, Ramirez, Gurland, & Lantigua, Reference Teresi, Holmes, Ramirez, Gurland and Lantigua2001; Teresi et al., Reference Teresi, Golden, Cross, Gurland, Kleinman and Wilder1995). Much of this previous work has found substantial DIF in global measures of cognition, such as the Mini-Mental State Examination (MMSE) (Crane, Gibbons, Jolley, & van Belle, Reference Crane, Gibbons, Jolley and van Belle2006; Dorans & Kulick, Reference Dorans and Kulick2006; Jones, Reference Jones2006; Morales, Flowers, Gutierrez, Kleinman, & Teresi, Reference Morales, Flowers, Gutierrez, Kleinman and Teresi2006; Ramirez et al., Reference Ramirez, Teresi, Holmes, Gurland and Lantigua2006) or the Cognitive Abilities Screening Instrument (CASI) (Crane et al., Reference Crane, van Belle and Larson2004; Gibbons et al., Reference Gibbons, McCurry, Rhoads, Masaki, White, Borenstein and Crane2009). DIF has also been observed in specific cognitive domains, such as visual naming ability (Pedraza et al., Reference Pedraza, Graff-Radford, Smith, Ivnik, Willis, Petersen and Lucas2009), fluency, and working memory (Crane et al., Reference Crane, Narasimhalu, Gibbons, Pedraza, Mehta, Tang and Mungas2008). To our knowledge this is the first study to examine DIF in African Americans and Whites on a measure of episodic memory.
DIF analyses determine whether individual characteristics exaggerate or attenuate the probability of successful responses to episodic memory items, given a particular level of episodic memory functioning. DIF analyses often focus on item-level findings. Crane Gibbons, Narasimhalu, Lai and Cella (Reference Crane, Gibbons, Narasimhalu, Lai and Cella2007) and Crane, Gibbons, and Ocepek-Welikson, et al. (Reference Crane, Gibbons, Ocepek-Welikson, Cook, Cella, Narasimhalu and Teresi2007) suggest there may be different audiences for DIF analyses. Scale developers may be most interested in item-level findings. Clinicians may be primarily interested in individual-level DIF impact. Social scientists may be primarily interested in group-level DIF impact, which addresses the question, “Is it likely that DIF might impact mean scores for groups or relationships between covariates of interest?” (Crane, Gibbons, and Ocepek-Welikson, et al., Reference Crane, Gibbons, Ocepek-Welikson, Cook, Cella, Narasimhalu and Teresi2007; Crane, Gibbons, Narasimhalu, et al., Reference Crane, Gibbons, Narasimhalu, Lai and Cella2007). In this study, we are primarily interested in group-level DIF impact. One research question being posed is: Does DIF impact the relationships between factors associated with reserve and episodic memory functioning across African American and White older adults?
Figure 1 depicts theorized relationships evaluated in this study. Observed variables (performance on episodic memory tests, demographics, indicators associated with reserve) are depicted in rectangles, while the unobserved factor (actual episodic memory functioning) is in an oval. The prior work of Manly et al. (Reference Manly, Jacobs, Touradji, Small and Stern2002, Reference Manly, Touradji, Tang and Stern2003, Reference Manly, Schupf, Tang and Stern2005) suggested that educational experiences were particularly important. Because these investigators did not test for DIF, its possible importance as an explanatory factor is unknown. In the current study we directly tested for DIF and depict DIF in a dashed box in Figure 1. The dashed box indicates that usually DIF is ignored, but is included in the present study. Thus, the goals of this study are thus to better understand relationships between memory performance and demographic and cognitive reserve covariates, while accounting for DIF.
Method
Participants
Study participants were identified from the Memory and Aging Project (MAP) and the Minority Aging Research Study (MARS) conducted by the Rush Alzheimer's Disease Center. MAP and MARS are ongoing longitudinal cohort studies among community-dwelling older adults in Chicago. MAP began enrollment in 1997 (Bennett et al., Reference Bennett, Schneider, Buchman, Mendes de, Bienias and Wilson2005). Consenting participants agreed to detailed annual evaluations, cognitive testing, and postmortem organ donation. MARS has a nearly identical design and began enrollment of African Americans in 2004. By April 2010, MAP included 1304 participants, and MARS 349. Recruitment strategies were so similar that a few African Americans are enrolled in both studies.
We evaluated baseline data from self-identified African Americans or Whites who were free of dementia, and had complete episodic memory and cognitive reserve data. The data from these studies were obtained in compliance with Rush's Institutional Review Board regulations.
Clinical Evaluations
Participants completed clinical evaluations including medical history, neurological examination, and neuropsychological assessment (Arvanitakis, Bennett, Wilson, & Barnes, Reference Arvanitakis, Bennett, Wilson and Barnes2010; Bennett et al., Reference Bennett, Schneider, Buchman, Mendes de, Bienias and Wilson2005). A clinician used clinical data and standard criteria to classify dementia and Alzheimer's Disease (McKhann et al., Reference McKhann, Drachman, Folstein, Katzman, Price and Stadlan1984).
Neuropsychological Evaluations
Participants completed a 19-test battery assessing five cognitive domains. We evaluated episodic memory tests common across MAP and MARS. (a) Story recall (4 scores). Logical Memory Story A (Wechsler, Reference Wechsler1987) is a fact-dense textual passage read aloud once; the participant is asked to recall elements immediately and after a delay. The East Boston Memory Test (Albert et al., Reference Albert, Smith, Scherr, Taylor, Evans and Funkenstein1991) is similar, and includes scores for immediate and delayed recall. (b) Word list (3 scores). The 10-word CERAD list (Morris et al., Reference Morris, Heyman, Mohs, Hughes, van Belle and Fillenbaum1989) was administered in three learning trials that are summed (range, 0–30). After a distracter task, the participant is asked to recall the words (range, 0–10). Participants are then presented with ten trials of four words, and asked to identify the one on the CERAD list (range, 0–10).
Cognitive reserve
Cognitive reserve indicators included: years of personal, maternal, and paternal education, childhood cognitive activity frequency, income at age 40, and education quality, as measured by reading level (see below). We initially categorized self-reported personal years of education as (1) some primary (<grade 8); (2) primary (completed grade 8); (3) some high school (9–11); (4) high school (completed grade 12); or (5) post-secondary (13 or greater). For DIF analyses, we categorized education as <12 and ≥12 years to ensure adequate analytic sample sizes.
We calculated childhood cognitive activity from self-reported activities at ages 6 and 12. Participants were asked how often someone read to them, told them stories, or played games with them (age 6) and how often they read books and magazines or went to the library (age 12); response options ranged from less than once a year (1 point) to almost every day (5 points) and composite scores were obtained by averaging across the five items (Wilson et al., Reference Wilson, Barnes and Bennett2003). The scale has demonstrated adequate psychometric properties (Cronbach's α = 0.88; test–retest reliability of r = 0.79) in studies with older adults (Barnes, Wilson, de Leon, & Bennett, Reference Barnes, Wilson, de Leon and Bennett2006; Wilson et al., Reference Wilson, Barnes, Krueger, Hoganson, Bienias and Bennett2005). We dichotomized average scores at ≤3 and >3 activities to ensure adequate analytic sample sizes.
Income at age 40 was reported in one of six categories defined by a range of dollar amounts. We compared participant responses to the median U.S. family income for the appropriate year (United States Census Bureau, 2010). We categorized income as below or above median income at age 40.
Reading level was measured by reading tests. MAP participants were administered the National Adult Reading Test (NART) (Nelson, Reference Nelson1982), while MARS participants were administered the third edition of the Wide Range Achievement Test Reading subtest (WRAT-3) (Wilkinson, Reference Wilkinson1993). For each test, participants read aloud words of increasing complexity; correct pronunciation is required to obtain a point.
We analyzed NART and WRAT-3 data from the 10 individuals enrolled in both studies to co-calibrate this variable. We identified 23 data points where those individuals were evaluated by the two tests at least two times within a 6-month window. For those 23 occasions, we examined a scatterplot (Appendix 1) that confirmed Z scores on the two tests appeared to be roughly linearly related to each other. We identified the median Z score on the WRAT-3 and the median Z score on the NART for these individuals, and used those Z scores to categorize reading levels from the parent studies.
Data Analysis
Overview
We derived three different composite scores from the seven episodic memory test data points: a composite Z-score, an IRT score that ignored DIF (a “naive” score), and an IRT score that accounted for DIF with respect to all of the covariates. We performed linear regression analyses using standardized composite scores as dependent variables and race as the primary predictor. We included demographic factors, and factors associated with cognitive reserve, paying particular attention to reading level. We performed a series of sensitivity analyses to assess the robustness of our findings.
Composite Z score
We created the composite measure of episodic memory by converting raw scores on each test to Z scores using the baseline MAP mean and standard deviation. We averaged these Z scores (Wilson et al., Reference Wilson, Barnes and Bennett2003, Reference Wilson, Barnes, Krueger, Hoganson, Bienias and Bennett2005).
Dimensionality
Both the naive IRT score and the IRT score accounting for multiple sources of DIF rely on an assumption of unidimensionality, that is, that the items can be conceptualized as measuring a single underlying construct. There is no single standard approach for determining whether a scale is sufficiently unidimensional. We used exploratory and confirmatory factor analyses.
Naive IRT scores
We used Parscale (Muraki & Bock, Reference Muraki and Bock2003) using Samejima's graded response model (Samejima, Reference Samejima1969) and expected a posteriori (EAP) scoring. The graded response model is a polytomous extension of the two-parameter logistic model (2PL) (Lord & Novick, Reference Lord and Novick1968).
IRT scores that accounted for all forms of DIF
We used a hybrid ordinal logistic regression/IRT approach to identify and account for DIF, using difwithpar software (Crane et al., Reference Crane, Gibbons, Jolley and van Belle2006). We analyzed several covariates for DIF: self-reported race, sex, education, age, father's education, mother's education, childhood cognitive activities, income at age 40, and reading level. We were primarily interested in accounting for all sources of DIF. Detailed methods have been published previously (Crane et al., Reference Crane, Gibbons, Jolley and van Belle2006, Reference Crane, Narasimhalu, Gibbons, Pedraza, Mehta, Tang and Mungas2008).
Regression analyses
All regression models included an indicator term for race. We transformed each episodic memory composite score to have a mean of 0 and standard deviation of 1. We performed a series of regression analyses with the composite episodic memory scores as dependent variables: (1) Base: race; (2) Demographics: race plus demographics (sex and age); (3) Demographics and cognitive reserve except reading level: model 2 plus cognitive reserve factors other than reading level (years of education, father's education, mother's education, childhood cognitive activities, and income at age 40); (4) Demographics and cognitive reserve including reading level: model 3 plus reading level.
Sensitivity analyses
We performed several sensitivity analyses to determine whether assumptions made in our modeling affected our conclusions. We repeated DIF analyses related to race using Multiple Indicator Multiple Cause (MIMIC) modeling. These analyses were performed in two ways, using (1) a single factor model (analogous to the IRT approach used in the primary analysis); and, (2) a bi-factor model that does not rely on the assumption of unidimensionality.
We assessed multicollinearity between the covariates. We matched African Americans to Whites of similar age and education and the same sex, and repeated the regression analyses to control for cohort effects. We performed regression analyses with age, education, and childhood cognitive activity as continuous variables. The scores we used to co-calibrate the reading tests may lead to misclassifying high or low reading levels (Appendix 1), so we performed a secondary analysis in which we omitted people whose reading scores were close to the cutoff values (within 0.25 SD of the cutoff values), that is, people whose reading levels were most likely to be misclassified to ensure that misclassification of reading level was not driving the results.
We performed additional analyses to determine whether the reading level effect was unique, or whether using another cognitive test would have the same effect. We compared correlations between reading scores and Digit Span Forward, Digit Span Backward (Wechsler Memory Test-R) (Wechsler, Reference Wechsler1987), and Digit Ordering (Cooper & Sagar, Reference Cooper and Sagar1993; Wilson et al., Reference Wilson, Beckett, Barnes, Schneider, Bach, Evans and Bennett2002). We used Digit Ordering, the test that had the lowest correlations with reading scores, to avoid confounding the domains. We dichotomized Digit Ordering so similar proportions would be classified as high or low as were in those categories for reading level. We then repeated the final regression model replacing reading level with Digit Ordering.
Results
Demographics and Episodic Memory Scores
Data were available from 1644 participants. We performed our primary analyses on the 993 participants with complete data, including 273 African Americans and 720 Whites. Some participants who were included in the data set also self-identified as Hispanic: 5 (2%) of the African Americans and 77 (11%) of the Whites. Figure 2 provides an outline of the sample derivation. There were 83 participants excluded due to a diagnosis of Alzheimer's disease or other dementia and 12 participants excluded because they self-identified in a racial group other than African American or White. An additional 556 participants were excluded because they had missing data. Missing data were especially prevalent for three reserve indicators: mother's and father's education and income at 40. The demographic and episodic memory characteristics remained the same when we included participants with missing data. We also compared results from the 993 people with complete data on all covariates to results from the 1421 people with data on all covariates other than mother's education, father's education, and income at age 40, and all regression coefficients were within a few hundredths of each other.
The 993 participants in our primary analyses had a mean age of 77.8 years (SD = 7.6) and a mean of 14.8 years of education (SD = 3.3); 71% were women and 73% were White. Further demographic details are provided in Table 1. On average African Americans were younger and had more years of formal schooling than Whites, had approximately the same levels of parental education and income at age 40, had higher childhood cognitive activity scores, and had lower reading levels. Mean scores for African Americans and Whites for the individual episodic memory tests are shown in Table 1. The tests used to make episodic memory scores demonstrated adequate reliability (α = 0.81) and bivariate correlations ranging from 0.23 to 0.85.
aSee the Methods section for details on calculation of the childhood cognitive activity score.
bIncome at age 40 was reported in categories of dollars. We dichotomized this variable by looking at the median family income in the U.S. for the year in which the participant was 40. See methods section for details.
cReading level was obtained from the WRAT-3 and NART in the MARS and MAP studies, respectively. As detailed in the methods section, we analyzed data from participants in both studies to identify threshold values for the two tests that could be considered to be equivalent. The values shown in this table represent the numbers of individuals above and below those thresholds, which were a Z score of 0.48 for the WRAT-3 in MARS and a Z score of −0.94 for the NART in MAP.
IRT and DIF Analyses
We calculated three composite episodic memory scores, which were highly correlated. The two IRT scores were more closely correlated with each other (r = 0.998) than with the composite Z score (r = 0.913 for the naive IRT score and r = 0.900 for the IRT score accounting for DIF). Results from exploratory and confirmatory factor analyses indicated that the episodic memory indicators were sufficiently unidimensional for use of IRT. Only a single Eigen value was above 1 and the second factor had a negligible Eigen value, a single factor model did not fit well, so we fit a bi-factor model in which the three word list items formed a secondary factor and in which we allowed for residual correlation between the two Logical Memory items and similarly for the two East Boston items. This model fit well. Factor loadings between the single factor model and the bi-factor model were very similar, and all of the loadings on the general factor in the bi-factor model were >0.30, which McDonald suggests is evidence of sufficient unidimensionality (McDonald, Reference McDonald1999).
The DIF analyses considered nine covariates: race, age, sex, education, income at age 40, early life cognitive activities, mother's education, father's education, and reading level. The difference between the IRT score accounting for all nine sources of DIF and the naive IRT score represents individual-level DIF impact. When DIF has a negligible impact, the difference will be close to zero. If DIF makes a big impact, this difference will be large. We compared differences to the median standard error of measurement for IRT scores in this data set, which was 0.3. Accounting for all sources of DIF led to changes larger than 0.3 for only six participants (<1%), which suggested the overall individual level DIF impact was negligible.
We compared scores for African Americans and Whites when accounting for and ignoring DIF. The mean (SD) naive score for African Americans was −0.005 (0.88), and for Whites it was +0.002 (1.04), a difference of 0.007. The mean (SD) scores accounting for DIF for African Americans was −0.036 (0.89), and for Whites it was 0.014 (1.04), a difference of 0.050. Ignoring DIF thus very modestly attenuated differences in mean episodic memory scores between Whites and African Americans.
Factors Associated With Episodic Memory Scores
Regression results are shown in Table 2. The cells show values for regression coefficients for each model. The four sections show results obtained from models with: (1) race only; (2) race and demographics; (3) race, demographics and measures of cognitive reserve except reading level; and (4) race, demographics, and all measures of cognitive reserve including reading level. The three columns show results for the three different dependent variables (naive IRT score, IRT score accounting for all sources of DIF, and composite Z-score) used for the regression models.
Our primary focus in these analyses was on the coefficients associated with race, shown in the top row of each section of Table 2. The intercept term provides an estimate of the adjusted mean for the reference group, while the coefficient for race provides an estimate of the adjusted mean difference between African Americans and Whites. In unadjusted models, mean episodic memory scores were not different across race groups in our sample (Model 1). When we accounted for demographic differences across race groups by including age and sex, African Americans on average did worse than Whites (Model 2). These findings were consistent across the three composite episodic memory scores. We entered age and sex separately in the models and confirmed our suspicion that this effect was attributable to age. The third section in Table 2 summarizes regression findings from models that included race, demographics, and measures of cognitive reserve other than reading level (Model 3). The coefficient for race was not affected by including these factors in the model, suggesting that differences across racial groups in age- and sex-adjusted episodic memory performance were not due to these factors. Again, findings were very similar for the three dependent variables.
The fourth section in Table 2 summarizes findings from the full model including reading level. Adding reading level to Model 3 caused the coefficient associated with race to become insignificant, suggesting that reading level explained the differences across race groups in age- and sex-adjusted episodic memory scores. These results were consistent across different composite episodic memory scores.
Sensitivity Analyses
There are a range of methods to detect and account for DIF (Millsap & Everson, Reference Millsap and Everson1993; Teresi, Reference Teresi2006) that might yield different results. We found similar results for race using single factor or bi-factor multiple indicator—multiple cause (MIMIC) models as those we report for the IRT approach. The consistency of findings across the two approaches (MIMIC vs. IRT) is reassuring, as is the consistency of findings when we relaxed the single factor assumption (single factor vs. bi-factor MIMIC).
We did not detect any multicollinearity. We assessed the variation inflation factors (VIF) for old models (age dichotomized) and new models (age centered and treated as continuous), all of the VIFs were less than 4, indicating no multicollinearity was detected. We matched participants on age, years of education, and sex to derive a sample of 546. We repeated our regression models in this matched data set and confirmed our main findings observed in Model 4 of Table 2 (see Table 3).
Note. Overall Regression model results based on participants matched on sex, age and years of education (n = 546). Findings are essentially identical to regression results from the whole sample shown in Table 2.
We performed additional regression analyses on the entire sample in which we treated age, years of education and childhood cognitive activity as continuous variables. Findings were essentially the same as our primary analyses.
We repeated analyses after excluding people whose reading test scores were close to the cutoff values. Using this approach, 67 participants were excluded from MARS and 48 from MAP. Results were very similar to those from the whole sample (Appendix 2).
We repeated the analyses of Model 4, substituting Digit Ordering for reading level. The coefficient for race in the model of the IRT score accounting for all forms of DIF was −0.16 (p = .04), in the model of the naive IRT scores it was −0.16 (p = .03), and in the model with composite Z-scores it was −0.10 (p = .18). These results suggest the ability of reading level to account for the effect of race on episodic memory is specific to reading level, because using a cognitive domain minimally correlated with reading level did not remove the effect of race.
Discussion
The goal of this study was to investigate several possible explanations for lower episodic memory test scores among older African Americans compared to older Whites. Measurement bias, as identified by DIF analyses, did not explain differences across race in age- and sex-adjusted episodic memory scores. Several variables used as proxies for reserve did not explain these differences. However, we confirmed the findings of Manly and colleagues (Reference Manly, Jacobs, Touradji, Small and Stern2002, Reference Manly, Touradji, Tang and Stern2003, Reference Manly, Schupf, Tang and Stern2005) that education quality, as measured by reading level, explained differences in age- and sex-adjusted scores between African Americans and Whites. This finding appears to be unique to reading level, as a measure of attention (Digit Ordering) did not have the same effect.
An important strength of this study is the evaluation of DIF. DIF analyses are common in educational testing, but still rare in neuropsychology. Without specific analyses, it is impossible to determine whether observed score differences across groups may be due to measurement bias or true group differences. We found that DIF was not responsible for differences in episodic memory test scores between African Americans and Whites. This finding is in contrast to DIF studies in other cognitive domains (Crane et al., Reference Crane, Narasimhalu, Gibbons, Pedraza, Mehta, Tang and Mungas2008; Pedraza et al., Reference Pedraza, Graff-Radford, Smith, Ivnik, Willis, Petersen and Lucas2009).
We used a hybrid IRT/OLR approach to DIF detection. There are a range of methods to detect and account for DIF (Millsap & Everson, Reference Millsap and Everson1993; Teresi, Reference Teresi2006) that might yield different results. We found similar results for race using a different DIF detection technique. The IRT approach used here relies on the assumption of unidimensionality. Methods for DIF assessment when this assumption is violated are not readily available, especially when the goal is to account for DIF with respect to a large number of covariates. We found the same item identified with DIF for race when we used single factor or bi-factor MIMIC models for episodic memory, suggesting that ignoring bi-factor structure may not be an important feature in our DIF findings.
African Americans tend to perform lower on episodic memory tests than Whites of similar age, but the differences are often due to differences in education, occupation or income (Dotson et al., Reference Dotson, Kitner-Triolo, Evans and Zonderman2009; Manly et al., Reference Manly, Jacobs, Sano, Bell, Merchant, Small and Stern1998; McDougall et al., Reference McDougall, Vaughan, Acee and Becker2007; Mungas et al., Reference Mungas, Reed, Farias and DeCarli2009; Zsembik & Peek, Reference Zsembik and Peek2001). In the current study, mean scores for some memory tests were actually higher for African Americans than Whites (Table 1), but African Americans were younger on average (Table 1). Indeed, in unadjusted analyses (Model 1 in Table 2), composite episodic memory scores did not differ across race. In adjusted analyses, African Americans had poorer age- and sex-adjusted episodic memory scores (Model 2 in Table 2). In our study, reserve factors other than reading level did not explain differences across race groups in age-adjusted episodic memory scores (Model 3 in Table 3). Reading level itself did explain differences across race groups in age adjusted episodic memory scores (Model 4 in Table 3). This effect was specific for reading level, as the race effect was still present in models that excluded reading level but included Digit Ordering.
Prior research has identified reading level as a proxy of educational quality associated with cognitive decline (Manly et al., Reference Manly, Jacobs, Touradji, Small and Stern2002). This factor has been identified as particularly important to comparisons of neuropsychological testing results across groups of elders characterized by diverse languages and ethnic backgrounds (Cosentino et al., Reference Cosentino, Manly and Mungas2007; Manly et al., Reference Manly, Jacobs, Touradji, Small and Stern2002). Two tests of reading level were used in the analyses: WRAT-3 and NART. MARS selected the WRAT-3 due to concerns about floor effects for the NART among minority elders. A cross-validation study found the WRAT-3 and NART to be comparable measures of premorbid intelligence (Johnstone, Callahan, Kapila, & Bouman, Reference Johnstone, Callahan, Kapila and Bouman1996). We are unaware of any formulas or other means of translating between the two measures. While other statistical methods (e.g., Bland-Altman plots) might prove useful to compare these tests, our sample size of 10 individuals with data from both tests was insufficient for these methods. Our categorization into high and low reading levels might be considered somewhat crude, with the distinct possibility of misclassification. The fact that this crude variable explained differences in age- and sex-adjusted episodic memory scores, while a series of other factors associated with reserve did not explain these differences, is remarkable. Our results remained unchanged when we omitted people with reading scores close to the cutoff used to distinguish between high and low scores, increasing our confidence in our findings. Results of additional sensitivity analyses in which we substituted Digit Ordering for episodic memory further buttress the impressive nature of this finding. There was no misclassification for Digit Ordering—the same test was used in both studies—but it did not explain age- and sex-adjusted episodic memory score differences between African Americans and Whites.
As noted above, the WRAT and NART have been conceptualized as measures of reading level indicating educational quality (as we have done here) and also as measures of premorbid intelligence (Johnstone, Callahan, Kapila, & Bouman, Reference Johnstone, Callahan, Kapila and Bouman1996). We have used these tests in models that have already adjusted for years of education, parental education, income at midlife, and childhood cognitive activities—all factors likely also associated with intelligence but none of which explained racial differences in episodic memory scores. Furthermore, the effect of reading ability to explain racial differences was unique, as Digit Ordering did not explain racial differences in episodic memory scores. Digit Ordering is also correlated with intelligence (Luo, Chen, Zen, & Murray, Reference Luo, Chen, Zen and Murray2010). While we cannot rule out the possibility that intellectual ability rather than educational quality explains differences across race in episodic memory scores, our analyses suggest that reading test scores alone—and not the other factors considered here—are able to explain these differences, suggesting that there is something unique about reading test scores not shared by these other factors.
As in any observational study, residual and/or unmeasured confounding variables may explain our findings. Unmeasured confounders (i.e., those not included in the current study) might include environmental factors (e.g., pollutants) and genetic differences. The complexities of race and culture are also unmeasured factors that may influence the performance of ethnically diverse older adults on neuropsychological tests. Aspects of culture such as acculturation contribute to older adults’ performances on episodic memory tests (Manly et al., Reference Manly, Byrd, Touradji and Stern2004).
We used somewhat crude dichotomous indicators of each factor associated with cognitive reserve in our DIF assessments and in our regression models, which raises the possibility of residual confounding. For example, based on responses to the question regarding income at age 40, we dichotomized participants into those with incomes below the median family income in the year they were 40 versus those at or above the median income. It is possible that levels of wealth well over the poverty line may not be related to additional brain protection than more modest levels of wealth, while levels of wealth close to or below the poverty line may be more linearly related to brain insults. By dichotomizing these variables, we are necessarily grouping together individuals who may nevertheless have variability in risk. When we treated the variables as continuous, our results were unchanged.
The generalizability of the results may be limited by the geographic location of the study population, the specific inclusion criteria used for the two studies, and the focus on African Americans and Whites. Furthermore, generalizability to other older African Americans may be limited by the relatively high education level in the current sample. Recall bias could possibly impact the measurement of some of our covariates such as income at age 40, childhood cognitive activity, and educational experience, though we do not expect this bias to be different across race groups. Analyses in which we matched on sex, age and years of education did not substantially change our findings. That result suggests that multiple linear regression is an adequate approach to determine the effect of race on episodic memory performance.
These results may also be limited to the specific cognitive domain, episodic memory, examined and the neuropsychological tests used to measure this domain. Indeed, Crane et al. (Reference Crane, Narasimhalu, Gibbons, Pedraza, Mehta, Tang and Mungas2008) found DIF was more important in explaining differences across race/ethnic groups for a fluency and working memory composite. The cross-sectional analyses we performed did not allow us to comment on rates of decline of episodic memory functioning over time. Thus, we cannot comment on whether rates of decline may differ by race, or whether any such difference may be due to DIF, demographic factors, or factors associated with reserve.
In conclusion, we found on average, older African Americans had lower age- and sex- adjusted mean episodic memory scores than Whites. Those differences are not due to ignoring DIF. We tested several factors related to reserve identified from previous research, and none of these explained differences across race groups in age- and sex-adjusted episodic memory scores. However, reading level, posited to be an indicator of the quality of educational experiences, did explain differences across race groups in age- and sex-adjusted mean episodic memory scores. This finding was not generalizable to other cognitive tests. These findings reinforce prior work (Manly et al., Reference Manly, Jacobs, Touradji, Small and Stern2002, Reference Manly, Touradji, Tang and Stern2003, Reference Manly, Schupf, Tang and Stern2005) that stressed the importance of measuring and accounting for the quality of education (as measured by reading level) in studies of older individuals from racially diverse samples.
Acknowledgments
We thank the participants in the Rush Memory and Aging Project and the Minority Aging Research Study, and the staff of the Rush Alzheimer's Disease Center. Data collection was supported by the following National Institute of Aging grants: (R01AG17917, D Bennett, PI) and (R01AG022018, L Barnes, PI). Data analyses were supported by R01AG029672 (P Crane, PI). Parts of this manuscript were presented at the National Multicultural Conference & Summit 2011 in Seattle, Washington. No conflict of interest exists for the authors.
Appendix 1: WRAT-3 and NART analyses
Appendix 2: Regression results excluding individuals with reading test Z scores within 0.25 of the cutpoint
Note. Regression results from a sensitivity analysis in which we omitted individuals with reading test scores close to the threshold values used to differentiate between low and high scores. The sample size reduced from 993 to 878 (67 participants were excluded from MARS and 48 participants from MAP). Regression findings are largely similar to those reported in the primary analyses.