Published online by Cambridge University Press: 23 January 2006
The agreement between neuropsychologists identifying cognitive impairment (CI) in older adults was examined, as were factors influencing the classification process. Twenty four neuropsychologists in 18 study centers classified cases with or without CI after reviewing neuropsychological findings and other relevant information. All cases were participants in the third wave of the Canadian Study of Health and Aging, a study of CI in later life. For 117 randomly selected cases, a second neuropsychologist reviewed the same material and reclassified the cases. Cases given the same (concordant) or different (discordant) classifications were compared with respect to patient and rater characteristics. The inter-rater agreement was moderate (77.7% agreement, kappa = .49). On all measures of cognitive functioning, the concordant group without impairment obtained a higher mean score than the discordant group, and the discordant group obtained a higher mean score than the concordant group with impairment. For 5 out of 8 cognitive measures, the concordant group with impairment differed from the concordant group without impairment and the discordant group, but the latter two groups did not differ significantly. The findings are comparable to others in the field and highlight the need for neuropsychologists to further clarify procedures for identifying subtle, or mild, forms of cognitive impairment. (JINS, 2006, 12, 72–79.)
The identification of cognitive impairment has long been recognized as critically important to the health and well being of older adults. Cognitive impairment (CI), regardless of its severity, may signal the need for a medical evaluation to determine the etiology of the impairment and inform the clinician regarding possible treatment options. In addition, it is becoming recognized that mild cognitive impairment seen in older adults may evolve to dementia in a significant proportion of persons (e.g., Tuokko & Frerichs, 2000) and that many cognitively impaired older persons, with or without dementia, require community-based support services if they are to remain in their own homes (Shapiro & Tate, 1997).
Cognitive impairment, as seen in older adults, is a heterogeneous classification and its identification relies heavily on a broad understanding of brain-behavior relations across the lifespan. It is well known that many different underlying disorders may result in impaired cognitive functioning and that the prevalence of disorders affecting cognition increases with age (e.g., Canadian Study of Health and Aging Working Group, 1994). Given the complexity and importance of this task, it is vital to ascertain whether clinicians draw similar conclusions when provided with the same information. That is, the inter-rater agreement when identifying cognitive impairment is of the utmost importance for neuropsychologists.
When identifying cognitive impairment in older adults, a number of factors may need to be taken into consideration. For example, O'Connor et al. (1996) cited factors contributing to discrepant clinical judgments, such as participants' poor vision or deafness, and rater confidence. In other fields, clinician experience and training have been found to influence inter-rater reliability (e.g., Steinhausen & Erdin, 1991; Ballantyne et al., 1995; Brooks & Thomas, 1995). Thus both the characteristics of the participants (e.g., sensory impairment, health status) and the raters (e.g., experience, training, confidence levels) may influence clinical decision-making.
A number of studies have examined inter-rater agreement in the context of specific disorders affecting cognition (e.g., Alzheimer's disease, Kukull et al., 1990; Lopez et al., 1990; Hogervorst et al., 2000; the clinical identification of dementia using DSM-III-R criteria, Baldereschi et al., 1994; Solari et al., 1994; Graham et al., 1996; O'Connor et al., 1996), but few have examined the agreement between raters in the detection of CI more generally. Only one study, that included older adults, specifically examined the inter-rater agreement among neuropsychologists (5 clinical neuropsychologists from 4 medical centers) when identifying CI (White et al., 2002). Two hundred and fifty-one cases were selected from the participating centers after excluding those with recent stroke, neurosurgery, head injury, legal blindness, alcohol or drug abuse, or severe deafness. Each neuropsychologist reviewed the medical, neuropsychological, and interview data for cases from his/her own site, while the members of an external review panel, 4 senior neuropsychologists, each reviewed 25% of the cases. The agreement between the individual neuropsychologists' and the review panel for the classification of CI versus no CI was moderate (kappa coefficient = .48).
We examined the reliability with which CI was identified by neuropsychologists in a large, national, epidemiological study of cognitive functioning in late life. Our goals were to assess the inter-rater agreement of clinician-based neuropsychological classification of CI, and to identify factors that influenced the decision-making process. We identified cases where the neuropsychologists' classifications were concordant or discordant and compared the resulting groups on patient and rater characteristics. It was anticipated that there would be two types of concordant cases: those who were clearly not impaired and those who were clearly impaired (O'Connor et al., 1996). Further, it was anticipated that discordant cases would fall between the concordant not impaired and concordant impaired cases on measures of cognitive functioning (i.e., cognitive screening and neuropsychological measures) and not be clearly distinguishable from them (Graham et al., 1997). In relation to the concordant not impaired, the concordant impaired were expected to be older, have less education, be female, have lower scores on measures of cognition, depression, and premorbid IQ, as all of these characteristics have been shown to differ between those with and without dementia (e.g., Jorm et al., 1998; Stern et al., 1994). In addition, given that the raters received the same information and the population under study was fairly homogeneous (e.g., age), it was anticipated that rater characteristics may also influence the decision-making process.
Cases for inclusion in this study were drawn from participants in the third wave of the Canadian Study of Health and Aging (CSHA-3). The CSHA is a national, longitudinal investigation of CI and dementia in which data were gathered at 18 study sites across Canada at five-year intervals (in 1990, 1996, 2001). Each participant underwent the same assessment in either English or French. Of the Original CSHA cohort established in 1990, 3,424 survivors took part in the CSHA-3 study, and 1,484 of these took part in the neuropsychological examination.
The participants in the CSHA-3 neuropsychological assessment were: (1) those who received a CSHA-1 or CSHA-2 consensus diagnosis of no cognitive impairment (NCI) or cognitive impairment with no dementia (CIND); and (2) participants who screened negative for cognitive impairment at CSHA-2 and, during the CSHA-3 screening examination, obtained a score falling above 49 but below 90 on the Modified Mini-Mental State Examination (3MS; Teng & Chui, 1987).
The neuropsychological test battery used in CSHA-3 was a subset of 8 of the measures used in CSHA-1 and 2 (Tuokko et al., 1995), and included the following: A Canadian version of the Wechsler Memory Scale Information subtest (Wechsler, 1975) and a modified version of the Buschke's Cued Recall paradigm (BCR; Tuokko et al., 1991) were used to assess memory; the Digit Symbol subtest of the Wechsler Adult Intelligence Scale–Revised (WAIS–R, Wechsler, 1981) and short forms (Satz & Mogel, 1962) of the WAIS–R Similarities and Block Design subtests were used to assess abstract reasoning and construction. Language skills were assessed using measures of the generation of words in response to letter (Controlled Oral Word Association Test; Spreen & Benton, 1977) and semantic cuing (animal names; Rosen, 1980), and a version of the Boston Naming Test using the 30 even items (Fisher et al., 1999). The Reading subtest from Wide Range Achievement Test–3 (Wilkinson, 1993) was administered to assess premorbid intelligence (note: an equivalent measure was not available for the sample assessed in French). This battery was administered and scored by trained psychometricians who also recorded whether or not the participants exhibited problems that impeded testing in each of the following areas: hearing, vision, fatiguability, inattention, perseveration, impulsivity, social impropriety, tangentiality, physical impairments, and facility with testing. The psychometrician also rated participant cooperativeness from 1 (Excellent) to 5 (Poor) and participant facility with language from 1 (Completely fluent) to 5 (Major difficulty). T scores for the neuropsychological tests were determined using the age, education, and gender corrected norms developed by Tuokko & Woodward (1996) in an English-speaking sample for the following measures derived from this battery: BCR Trial 1 Free Recall, BCR Total Retrieval (Free Recall Trial 1+Trial 2 +Trial 3), WAIS–R Digit Symbol, WAIS–R Block Design, WAIS–R Similarities, Animal Naming, and Verbal Fluency. Fisher et al.'s (1999) norms for the Boston Naming Test were used to interpret performance on this measure. T scores could not be calculated for the French-speaking sample, but, where possible, appropriate norms were applied.
The neuropsychological assessments were evaluated by a neuropsychologist who had access to all information obtained from the participant (i.e., self-reported) during the screening interview: the participant's marital status, vision, hearing, living situation, social support, activities of daily living, health status, health conditions, leisure and physical activities, income, and health care and end-of-life preferences. In addition, the neuropsychologist had access to the participant's scores on the following measures administered as part of the screening interview: the 3MS, the Short Happiness and Affect Research Protocol (Stones et al., 1996), and the Center for Epidemiologic Studies–Depression scale (Andresen et al., 1994).
The neuropsychologists were asked to make a clinical judgment as to whether or not CI was evident; they were specifically requested to include even very mild forms of CI (e.g., 1–1.5 SD below mean on neuropsychological tests). The following guidelines, provided to the neuropsychologists, were chosen to reflect the way in which cognitive impairment of insufficient magnitude to warrant a diagnosis of dementia has been described in the literature (e.g., Petersen et al., 1999). Guideline 1: Cognitively Impaired if score on any test is 1.5 SD or more below the mean, or Guideline 2: Cognitively Impaired if score on more than 1 test is 1.0 SD or more below the mean. Their use was intended to encourage clinicians to include people with mild CI in addition to those more severely impaired. The neuropsychologists were also asked to rate their confidence in their classification on a four-point scale (very, moderately, somewhat, not at all confident). The cases identified with CI by the neuropsychologists then proceeded to a full medical assessment after which a consensus diagnosis concerning the underlying pathology of the CI was determined.
The goal of this approach to identifying cases for full medical appraisal in the context of a large, epidemiological study was to reduce costs, as reliance on a strictly psychometric approach has been associated with high false positive rates when applied with cognitive screening measures like the Mini-Mental Sate Examination (O'Connor et al., 1991). As this was not a strictly controlled experiment, but a naturalistic clinical research study, it was anticipated that the neuropsychologists might not adhere rigidly to the guidelines provided or be unduly tied to test performance criteria, but that they would exercise reasoned clinical judgment in making their classifications. For example, a participant might have met Guideline 1 on one measure but the neuropsychologists might conclude that no cognitive impairment was present, as (following the definition of a standard deviation) a substantial minority of the normal population would be expected to fall in this range on a single measure. Conversely, a participant who did not meet Guideline 1 or 2 may have been identified as cognitively impaired even though his/her performance fell above the criteria but below their estimated premorbid level of functioning.
After the CSHA-3 data collection was finished, 117 cases were randomly selected for rereview by a second neuropsychologist from a different study center. Of the 117 CSHA-3 participants who comprised the reliability sample, ages ranged from 75 to 100 years, with a mean of 84.5 years, SD = 5.7. Seventeen of these persons were assessed in French and 100 were assessed in English.
The Original classifications of CI/no CI were provided for the 117 cases by 24 neuropsychologists recruited by the study centers involved in CSHA-3. Nine of these 24 neuropsychologists then volunteered to provide CI/no CI reclassifications for cases from other centers, forming the Reliability classifications; these neuropsychologists did not reclassify any cases for which they provided the Original classification. Four additional neuropsychologists, involved in the CSHA-3 but who did not provide Original classifications, volunteered to provide Reliability classifications. Hence, a total of 28 neuropsychologists (15 Original only, 9 Original and Reliability, 4 Reliability only) were involved in classifying the 117 cases. All neuropsychologists were surveyed with respect to their degree status, specialty designations, years of experience, and familiarity with each neuropsychological test, rated as “not at all familiar” (0) to “very familiar” (10).
Of the neuropsychologists, 24 held doctoral degrees in psychology and 4 held Master of Arts or Science degrees. All were licensed to practice as clinical psychologists except for one who was acting under the supervision of a registered psychologist (note that registration with a Master's degree is accepted practice in some provinces in Canada). Eight of the neuropsychologists who took part were involved only with CSHA-3, whereas 13 took part in 2 waves of the CSHA, and 7 took part in all 3 waves of the study. All neuropsychologists had experience in assessing elderly people and only 3 indicated that assessment of older persons comprised less than 50% of their clinical caseloads. Most neuropsychologists were very familiar with the neuropsychological measures; only 2 had familiarity ratings below 6 on more than 1 measure. The neuropsychologists were least familiar with Buschke's cued recall paradigm for memory assessment (12 ranked their familiarity < 6). Those neuropsychologists who provided only Original classifications (n = 15) did not differ significantly from those who only provided Reliability classifications (n = 4), or those who provided both Original and Reliability classifications (n = 9), with respect to their years of experience working as psychologists [F(2,25) = .87, p < .43], years of experience with older adult clients [F(2,25) = 1.49, p < .24], percentage of caseload made up of older adults [F(2,25) = .69, p < .51], or familiarity with the individual measures used in this study [F(2,25) = .06–1.9, p < .94–.18.
All information available to the Original neuropsychologist was sent to the second neuropsychologist for reclassification of each case; the second neuropsychologist was blind to the Original classification, but was asked to follow the same classification procedures as the Original neuropsychologist.
The inter-rater agreement for CI classifications between the Original and Reliability neuropsychologists was assessed by calculating a kappa coefficient and percentage agreement.
We identified concordant cases as those given the same CI classification by the Original and Reliability neuropsychologists. The remainder, those where CI classifications differed, were identified as discordant cases. The concordant and discordant cases were compared on measures of participant characteristics including: language of assessment (i.e., English or French) and psychometricians' ratings of participant hearing, vision, cooperativeness, fatigability, inattention, perseveration, impulsivity, social impropriety, tangentiality, physical impairments, and facility with testing using χ2 tests.
Concordant cases were then recategorized as concordant, not impaired (C-NI) and concordant, impaired (C-I) for comparison with discordant cases. These groups were compared on measures of the participants' cognitive functioning (i.e., 3MS and neuropsychological measures) using one-way analyses of variance (ANOVAs). Even though the neuropsychological test scores were corrected for age, education, and gender, differences might be still expected between the groups for these variables because their influence may not be the same for each group (Reitan & Wolfson, 1996).
Finally, the discordant and concordant groups were compared with respect to rater characteristics using χ2 tests. First, to determine whether the confidence rating for the discordant and concordant groups differed, they were compared on the lowest confidence rating provided between the two neuropsychologists rating each case. Subsequently, the groups were compared with respect to the lowest amount of rater experience reported between the two neuropsychologists' ratings for each case, the number of years experience with older adults, and the reported percent of caseload made up of older adults.
Table 1 shows that there was 77% agreement between the Original and Reliability classifications, resulting in a kappa value of .49. When the concordant and discordant cases were compared using χ2 tests, there were no differences with respect to language in which the assessment was conducted, or any of the ratings made by the psychometricians at the time of testing (see Table 2).When the cases were recategorized as C-NI, C-I, and discordant (D) and compared using ANOVAs, the groups did not differ on age [F(2,114) = .27, p < .77]; education [F(2,114) = 3.57, p < .03]; CES-D [F(2,90) = .49, p < .62; WRAT-3 [F(2,96) = .98, p < .38]; or gender, χ2 = 1.07, p < .59 (see Table 3). However, the anticipated pattern of performance (i.e., C-NI > discordant > C-I) was seen for the 3MS scores and all 8 of the neuropsychological measures (see Table 4). For the 3MS and 4 of the neuropsychological measures, the C-I group differed significantly (p < .001) from both the C-NI and D groups (see Table 4). In all cases, the C-NI and D groups were not distinguishable (see Table 4).
Agreement between the original and reliability neuropsychological classifications of no cognitive impairment (NCI) or cognitive impairment (CI)
Comparison of concordant and discordant cases based on patient characteristics
Demographic characteristics of groups for whom the neuropsychological classifications were concordant–not impaired (C-NI), discordant (D), or concordant–impaired (C-I)
Means and standard deviations (SD) for measures of cognitive performance between groups for whom the neuropsychological classifications were concordant–not impaired (C-NI), discordant (D), or concordant–impaired (C-I)
Finally, when the discordant and concordant groups were compared with respect to rater characteristics, there were no differences with respect to the following: lowest confidence rating provided between the two neuropsychologists rating each case [χ2(3, N = 117) = 0.99, p < .80]; the lowest amount of rater experience reported between the two neuropsychologists' raters (i.e., 5 years or less vs. 6 years or more), [χ2(1, N = 117) = .00, p < .96]; years of experience with older adults (i.e., 5 years or less vs. 6 years or more), [χ2(1, N = 117) = 0.01, p < .91]; or percent of caseload made up of older adults [χ2(3, N = 117) = 0.30, p < .59].
Despite the complexity and the growing importance of identifying the full spectrum of cognitive impairments in older adults, few studies have examined the reliability with which this can be accomplished. Our study is unique in its size (28 neuropsychologists, 117 cases) and scope, in that factors influencing the decision-making of neuropsychologists were examined. In our study, the neuropsychologists showed a moderate level of agreement when asked to judge whether a participant exhibited CI or not. The level of agreement observed (kappa = 0.49) is virtually identical to that reported in the existing literature for psychologists making similar judgments concerning the presence of cognitive impairment (i.e., 0.48, White et al., 2002) and for dementia diagnoses made by neurologists using clinical records (i.e., 0.49, Solari et al., 1994). When agreement on diagnosis of specific disorders has been examined, agreement tends to be lower (e.g., White et al., 2002). Studies that have reported higher inter-rater agreements have used either well-defined sets of criteria (e.g., DSM-III–R) and standardized written vignettes drawn from carefully selected clinical groups (O'Connor et al., 1996), or a multidisciplinary consensus process (Graham et al., 1996).
Only one clear difference between our concordant and discordant groups was apparent: performance on the cognitive measures. The discordant group fell between the C-NI and C-I groups with respect to cognitive performance on all measures. When significant differences were observed, the discordant group did not differ from the C-NI, whereas the C-I cases were clearly distinguishable from both the C-NI and discordant groups. Of interest, approximately half (12/26) of the discordant cases were persons who received a consensus diagnosis of CIND, a sizable proportion of whom may be in the very early or prodromal stages of a progressive dementia (e.g., Tuokko & Frerichs, 2000; Tuokko et al., 2003). Five of the discordant cases were persons who received a consensus diagnosis of NCI and 9 were lost to follow-up and did not receive a consensus diagnosis. It will be important to determine if the discordant cases are at increased risk for future cognitive decline.
It is possible that the observed agreement rates were reflective of various aspects of the CSHA-3 study design including participant selection criteria and neuropsychological assessment procedures. In the CSHA-3, neuropsychological examinations were conducted for people with CSHA-2 clinical diagnoses of NCI or CIND and for all participants with CSHA-3 3MS scores falling above 49 but below 90. These restrictions, placed by the 3MS selection criteria, truncated the range of the participants' cognitive functioning at the upper and lower ends where concordance rates may be expected to be highest, thereby reducing the overall concordance rates.
Another aspect of the study design that may have influenced agreement rates was the limited information about the participant available to the neuropsychologists. For example, under more ideal circumstances, neuropsychologists might have access to medical information about the person at the time that classifications are being made. In our situation, only self-report of health conditions and health status was available. It is possible that access to medical information would have increased agreement. However, White et al. (2002) had access to medical information and obtained a virtually identical rate of agreement to ours.
Despite these limitations, our finding that disagreements between the neuropsychologists occur primarily for cases with subtle, or mild, CI reinforces the need for additional research in this area. Much controversy currently exists regarding the nosology of such cognitive impairment. Some researchers promote a restrictive definition of mild CI, characterized by memory impairment in the context of normal general cognitive function (e.g., Maruff et al., 2004; Petersen et al., 1999; 2001). Other researchers, principally those working in Europe (e.g., Ritchie & Touchon, 2000; Ritchie et al., 2001; Touchon & Ritchie, 1999), favor a more inclusive nosological entity such as Age-Associated Cognitive Decline (AACD; Levy, 1994) that refers to a decline in any cognitive functions (attention, learning and memory, thinking, language, and visuospatial function), and is identified in relation to norms for elderly persons. More recently, Petersen (2003, 2004) has proposed various subtypes of mild CI including amnestic, single domain nonmemory, and multiple domain. Not until neuropsychologists come to terms with the classification of mild forms of CI and articulate the best ways to define them, will inter-rater agreement improve beyond our present moderate rates.
A research personnel award from the Canadian Institutes of Health Research, Institute of Aging supported HT in the preparation of this manuscript. Phases 1 and 2 of the Canadian Study of Health and Aging core study were funded by the Seniors' Independence Research Program, through Health Canada's National Health Research and Development Program [NHRDP project #6606-3954-MC(S)]; supplementary funding for analysis of the caregiver component was provided by the Medical Research Council. Additional funding was provided by Pfizer Canada Incorporated, through the Medical Research Council/Pharmaceutical Manufacturers Association of Canada Health Activity Program, NHRDP [project #6603-1417-302(R)], by Bayer Incorporated, and by the British Columbia Health Research Foundation [projects #38 (93-2) and #34 (96-1)]. Core funding for Phase 3 was obtained from the Canadian Institutes for Health Research (CIHR grant # MOP-42530); supplementary funding for the caregiver component was obtained from CIHR grant # MOP-43945. Additional funding was provided by Merck-Frosst and by Janssen-Ortho Inc. The study was coordinated through the University of Ottawa and Health Canada. We would specifically like to acknowledge the assistance of Liz Sykes, Nansy Jean-Baptiste, Maggie Stewart, and Beth Sander, who worked very hard to prepare the materials for the reliability review, selected the sample, prepared the clinical examination summaries, worked with the volunteer neuropsychologists, and prepared the data for analysis.
Agreement between the original and reliability neuropsychological classifications of no cognitive impairment (NCI) or cognitive impairment (CI)
Comparison of concordant and discordant cases based on patient characteristics
Demographic characteristics of groups for whom the neuropsychological classifications were concordant–not impaired (C-NI), discordant (D), or concordant–impaired (C-I)
Means and standard deviations (SD) for measures of cognitive performance between groups for whom the neuropsychological classifications were concordant–not impaired (C-NI), discordant (D), or concordant–impaired (C-I)