Hostname: page-component-745bb68f8f-b6zl4 Total loading time: 0 Render date: 2025-02-11T10:05:04.327Z Has data issue: false hasContentIssue false

The Value of Bayes’ Theorem for Interpreting Abnormal Test Scores in Cognitively Healthy and Clinical Samples

Published online by Cambridge University Press:  18 March 2015

Brandon E. Gavett*
Affiliation:
Department of Psychology, University of Colorado, Colorado Springs, Colorado
*
Correspondence and reprint requests to: Brandon E. Gavett, UCCS Department of Psychology, 1420 Austin Bluffs Parkway, Colorado Springs, CO 80918. E-mail: bgavett@uccs.edu
Rights & Permissions [Opens in a new window]

Abstract

The base rates of abnormal test scores in cognitively normal samples have been a focus of recent research. The goal of the current study is to illustrate how Bayes’ theorem uses these base rates—along with the same base rates in cognitively impaired samples and prevalence rates of cognitive impairment—to yield probability values that are more useful for making judgments about the absence or presence of cognitive impairment. Correlation matrices, means, and standard deviations were obtained from the Wechsler Memory Scale –4th Edition (WMS-IV) Technical and Interpretive Manual and used in Monte Carlo simulations to estimate the base rates of abnormal test scores in the standardization and special groups (mixed clinical) samples. Bayes’ theorem was applied to these estimates to identify probabilities of normal cognition based on the number of abnormal test scores observed. Abnormal scores were common in the standardization sample (65.4% scoring below a scaled score of 7 on at least one subtest) and more common in the mixed clinical sample (85.6% scoring below a scaled score of 7 on at least one subtest). Probabilities varied according to the number of abnormal test scores, base rates of normal cognition, and cutoff scores. The results suggest that interpretation of base rates obtained from cognitively healthy samples must also account for data from cognitively impaired samples. Bayes’ theorem can help neuropsychologists answer questions about the probability that an individual examinee is cognitively healthy based on the number of abnormal test scores observed. (JINS, 2015, 21, 1–9)

Type
Research Articles
Copyright
Copyright © The International Neuropsychological Society 2015 

Introduction

Most neuropsychologists are familiar with sensitivity and specificity statistics that are used to describe the classification accuracy of tests. Using the detection of Creutzfeldt-Jakob disease (CJD) as an example, sensitivity is defined as the probability that a person with CJD tests positive for CJD. This is a conditional probability that can be abbreviated as p(PT|CJD), where PT stands for “positive test.” Similarly, a test’s specificity is the probability that a person without CJD tests negative for CJD; this can be abbreviated as p(NTCJD), where NT stands for “negative test” and ¬CJD stands for “no CJD.”

Although useful statistics, sensitivity and specificity are invariant to the base rate of the condition in the population. This invariance poses problems with the application of sensitivity and specificity data to the individual case (Nugent, Reference Nugent2004). To illustrate why this is a problem, consider a hypothetical scenario where a neuropsychological test is developed that can identify CJD with a sensitivity of .999 (.001 false negative rate) and a specificity of .998 (.002 false positive rate). These sensitivity and specificity values represent nearly perfect test performance and are idealized, if unattainable, characteristics of a cognitive test. If a person randomly selected from the population tests positive for CJD on this test, what is the probability that she truly has CJD? In this case, the probability is not .999 (the test’s sensitivity). Rather, the probability is only .0005, or .05%. The reason for this extremely low probability, even considering the nearly ideal classification accuracy of the test, is due to the fact that the base rate of CJD is extremely low (roughly .0000001). This example illustrates the false positive paradox, a type of base rate fallacy, in which interpretation of a positive test result does not account for the prevalence of the condition in the population (Gouvier, Reference Gouvier2001). The false positive paradox is common in psychology and medicine, and has been used to inform decision making in areas such as cancer screenings (Detterbeck et al., Reference Detterbeck, Mazzone, Naidich and Bach2013; U.S. Preventive Services Task Force, 2009) and tests for sexually transmitted infections (Katz et al., Reference Katz, Effler, Ohye, Brouillet, Lee and Whiticar2004). As illustrated by O’Bryant and Lucas (Reference O’Bryant and Lucas2006), proper interpretation of a positive or negative test result must account for the base rate of the condition under investigation. As such, positive and negative predictive power (PPP and NPP, respectively) are classification accuracy statistics that are more useful than sensitivity and specificity when applied to the individual case.

Whereas sensitivity reflects the conditional probability that, for example, a person with verified CJD tests positive, PPP describes the conditional probability that a person has CJD given a positive test result. Using the same example, specificity reflects the conditional probability that a person without CJD tests negative, whereas NPP describes the conditional probability that a person does not have CJD given a negative test result. Often, neuropsychological evaluations are sought when the etiology of cognitive impairment is unknown, so probabilities that are conditioned upon knowledge of the true etiology, such as sensitivity — p(PT|CJD) — can be misleading when applied to the individual case. Bayes’ theorem is used to calculate probabilities such as PPP and NPP, and thus, can answer questions about the probability of an underlying etiology when a positive or negative test result is observed. PPP and NPP are not invariant to the base rate of the condition in the population; rather, calculation of these values is dependent upon such base rates (Nugent, Reference Nugent2004).

The above discussion is likely familiar to most neuropsychologists. However, there are other situations in neuropsychological assessment that require the use of Bayes’ theorem to arrive at a probability estimate appropriate to answer common referral questions about the absence or presence of cognitive impairment. In recent years, an abundance of research has been published documenting the base rates of abnormal test scores in cognitively healthy populations (Binder, Iverson, & Brooks, Reference Binder, Iverson and Brooks2009; Brooks, Reference Brooks2010, Reference Brooks2011; Brooks & Iverson, Reference Brooks and Iverson2010; Brooks, Holdnack, & Iverson, Reference Brooks, Holdnack and Iverson2011; Brooks, Iverson, & Holdnack, Reference Brooks, Iverson and Holdnack2013; Brooks, Iverson, & White, Reference Brooks, Iverson and White2009; Brooks, Iverson, Holdnack, & Feldman, Reference Brooks, Iverson, Holdnack and Feldman2008; Brooks et al., Reference Brooks, Iverson, Koushik, Mazur-Mosiewicz, Horton and Reynolds2013; Brooks, Iverson, Lanting, Horton, & Reynolds, Reference Brooks, Iverson, Lanting, Horton and Reynolds2012; Brooks, Iverson, Sherman, & Holdnack, Reference Brooks, Iverson, Sherman and Holdnack2009; Brooks, Sherman, & Iverson, Reference Brooks, Sherman and Iverson2010; Brooks, Strauss, Sherman, Iverson, & Slick, Reference Brooks, Strauss, Sherman, Iverson and Slick2009; Crawford, Garthwaite, & Gault, Reference Crawford, Garthwaite and Gault2007; Decker, Schneider, & Hale, Reference Decker, Schneider and Hale2012; Gunner, Miele, Lynch, & McCaffrey, Reference Gunner, Miele, Lynch and McCaffrey2012; Palmer, Boone, Lesser, & Wohl, Reference Palmer, Boone, Lesser and Wohl1998; Schretlen, Munro, Anthony, & Pearlson, Reference Schretlen, Munro, Anthony and Pearlson2003; Schretlen, Testa, Winicki, Pearlson, & Gordon, Reference Schretlen, Testa, Winicki, Pearlson and Gordon2008). This research has been highly influential in helping neuropsychologists better understand that abnormal test scores are not pathognomonic of cognitive impairment, and that variables such as the chosen threshold for identifying abnormal test scores, the number of test scores derived from a battery, the correlations of these test scores with one another, and individual characteristics such as premorbid intellect can affect the frequency with which cognitively healthy individuals obtain abnormal scores in a neuropsychological test battery (Brooks, Iverson, & White Reference Brooks, Iverson, Sherman and Holdnack2009, Reference Brooks, Holdnack and Iverson2011; Crawford et al., Reference Crawford, Garthwaite and Gault2007; Decker et al., Reference Decker, Schneider and Hale2012; Schretlen et al., Reference Schretlen, Testa, Winicki, Pearlson and Gordon2008). Although the results from this body of research have been useful, they also have the potential to be misinterpreted in the same way that sensitivity and specificity can be misinterpreted (e.g., the false positive paradox also applies here). The goal of the current study is to discuss this potential for misuse and to describe how Bayes’ theorem can provide neuropsychologists with conditional probability values that provide an appropriate answer to typical referral questions about cognitive impairment.

If NC=normal cognition and kATS=number (k) of abnormal test scores, then p(kATS|NC) indicates the probability that an individual examinee will obtain k abnormal test scores given that the person’s cognitive functioning is normal. This is the type of conditional probability value that has been presented in previous studies as evidence that some abnormal test scores should be expected in cognitively healthy individuals. However, for clinical purposes, that probability value is not useful because it is unknown whether a patient’s cognitive functioning is normal. Often, the reason for a neuropsychological assessment is to make a determination as to whether a person’s cognitive functioning is normal or abnormal; this is a judgment that must be made based on the results of the assessment. It is rare for a clinical neuropsychological evaluation to be requested when it is known with certainty that an examinee possesses normal cognition. Unless the examinee is verified to be free from cognitive impairment, the conditional probability p(kATS|NC) neither provides the appropriate information to answer questions about the state of an examinee’s health, nor does it provide information that is generalizable to individuals whose cognitive status is unknown. Therefore, in most clinical situations, understanding the probability that a person’s cognitive functioning is normal given that the person obtains k abnormal test scores - p(NC|kATS) - is more useful than its converse, p(kATS|NC).

In the language of Bayesian statistics, the pre-test probability (also known as the “prior” probability) is an estimate of the probability that a person is cognitively normal before any tests have been administered, and can be abbreviated as p(NC). A good estimate of the pre-test probability is the base rate (prevalence) of normal cognition in a given situation. For instance, if 20% of patients referred to a clinic are cognitively normal, then a reasonable estimate of the probability that a newly referred patient is cognitively normal is 20%. The post-test probability (also known as the “posterior” probability) is the updated probability value that results after test scores (or information from interviews, etc.) have been obtained. A good test should create large differences between the pre-test and post-test probabilities. For example, if the pre-test probability of normal cognition is 20% and the post-test probability is also 20%, then the test has no incremental validity for identifying cognitive impairment. In contrast, if the post-test probability is 90%, this reflects a substantial shift in probability and demonstrates good incremental validity of a test for identifying cognitive impairment.

In the context discussed in this manuscript, the post-test probability is the probability that someone is cognitively normal given that they produce k abnormal test scores, and can be written as p(NC|kATS). Bayes’ theorem states that the post-test probability can be obtained by multiplying the pre-test probability by the probability of k abnormal test scores in a healthy sample and dividing that product by the probability of k abnormal test scores. The mathematical form of Bayes’ theorem is shown in Appendix A. Arriving at this post-test probability estimate requires estimation of the pre-test probability p(NC) as well as base rates for p(kATS|NC) and p(kATSNC). The first conditional probability, p(kATS|NC), can be estimated from data obtained from cognitively healthy samples, which will vary depending upon the number of test scores, the criteria for impairment (e.g.,<1.5 standard deviations below the mean of a cognitively healthy sample), and the degree to which the test scores in the battery are correlated. These data are available in various published articles (e.g., Brooks, Reference Brooks2010; Brooks et al., Reference Brooks, Iverson, Holdnack and Feldman2008, Reference Brooks, Holdnack and Iverson2011; Brooks, Iverson, & White 2009; Crawford et al., Reference Crawford, Garthwaite and Gault2007; Decker et al., Reference Decker, Schneider and Hale2012) and through a software program published by John Crawford (http://homepages.abdn.ac.uk/j.crawford/pages/dept/psychom.htm).

Estimation of the base rate of normal cognition, p(NC), is also needed to calculate the post-test probability p(NC|kATS). The pre-test probability that an examinee is cognitively normal is the complement of the probability that the examinee is cognitively impaired and is dependent on the assessment setting. For example, in a skilled nursing facility’s memory care unit, the probability that an examinee is cognitively normal is likely 0. Although most other clinical settings will not have rates of cognitive impairment that approach 100%, the pre-test probability that the examinee is cognitively normal is likely less than 50%.

The final probability estimate needed to find the post-test probability p(NC|kATS) is the probability that a cognitively impaired individual obtains k abnormal test scores: p(kATS|NC). This probability will also depend upon the total number of test scores obtained, the threshold used to classify a score as abnormal, and the degree to which the test scores in the battery are correlated. As discussed above, the conditional probability p(kATS|NC) has been the focus of prior studies in this area (e.g., Brooks, Reference Brooks2010; Brooks, Iverson, & White 2009; Brooks, et al., Reference Brooks, Holdnack and Iverson2011; Crawford et al., Reference Crawford, Garthwaite and Gault2007; Decker et al., Reference Decker, Schneider and Hale2012). However, less attention has been paid to the estimation of the base rates of k abnormal test scores in clinical samples. This may be partially due to the fact that very few test manuals present sufficient statistical information (i.e., means, standard deviations, and correlation matrices) about the psychometric properties of the test battery in samples with documented cognitive impairment.

The goal of the current study is to estimate the probability of observing k abnormal test scores in both cognitively healthy and cognitively impaired samples, and to illustrate how differences in the pre-test probability contribute to providing an estimate of the post-test probability. This goal will be achieved by using data from the Technical and Interpretive Manual (Wechsler, Reference Wechsler2009) of the Wechsler Memory Scale, Fourth Edition (WMS-IV). Copyright © 2009 NCS Pearson, Inc. Reproduced with permission. All rights reserved. Summary statistics from the WMS-IV standardization and special groups (clinical) samples will be used in the Monte Carlo framework described by Crawford et al. (Reference Crawford, Garthwaite and Gault2007) to estimate base rates of abnormal scores. This simulation approach is necessary to estimate the probability values used by Bayes’ theorem. Several studies have documented the accuracy of this Monte Carlo approach, as well as its superiority over other approaches (e.g., binomial simulation), as a model for the raw data. One advantage of Monte Carlo simulation is its ability to account for the intercorrelations between test scores (e.g., Brooks & Iverson, Reference Brooks and Iverson2010; Crawford et al., Reference Crawford, Garthwaite and Gault2007; Decker et al., Reference Decker, Schneider and Hale2012; Schretlen et al., Reference Schretlen, Testa, Winicki, Pearlson and Gordon2008). In a battery such as the WMS-IV, which uses the same methods (e.g., story) to generate two or more test scores (e.g., immediate and delayed recall trials), accounting for these intercorrelations when simulating data is crucial due to the obvious lack of independence between many test scores.

Method

Participants and Materials

The data used in this study were obtained from the WMS-IV Technical and Interpretive Manual (Wechsler, Reference Wechsler2009). In particular, the test correlation matrices, means, and standard deviations from the adult battery, which was administered to the standardization sample of cognitively healthy participants and a “special groups” (mixed clinical) sample, were extracted for analysis. The data from the older adult battery were not used for two reasons. First, because fewer test scores (8 scores) are available for this battery than the adult battery (15 scores), direct comparisons cannot be made between the base rates estimated for the two batteries. Second, because there are no available data pertaining to the performance of a cognitively impaired sample on the older adult battery, post-test probabilities cannot be computed. The relevant data used for the standardization and special groups samples in this study can be found in Tables 4.1 and 4.3 of the WMS-IV Technical and Interpretive Manual, respectively.

The special groups sample is made up of a total of 555 individuals (M Age=42.1, SD Age=8.2; 46.7% women) who were diagnosed with a condition capable of causing cognitive impairment. Diagnostic groups included attention-deficit/hyperactivity disorder (n=33), reading disorder (n=15), mathematics disorder (n=22), mild (n=32) and moderate (n=35) intellectual disability, mild cognitive impairment (MCI) (n=50), probable mild Alzheimer’s disease (n=48), epilepsy with right (n=15) and left (n=8) temporal lobectomy (n=15), autistic disorder (n=21), Asperger’s disorder (n=35), moderate to severe traumatic brain injury (n=32), major depressive disorder (n=84 in younger adults and n=10 in older adults), schizophrenia (n=55), and anxiety disorders (n=60). See Table 4.28 in the WMS-IV Technical and Interpretive Manual (Wechsler, Reference Wechsler2009) for more detailed demographic information about this sample. Also see the WMS-IV manual for details about the standardization sample.

For the standardization and special groups samples, data from the following subtest variables were available in the form of age-corrected scaled scores (SS): Logical Memory I and II; Verbal Paired Associates I, II, and II-Word Recall; Designs I, I-Content, I-Spatial, II, II-Content, and II-Spatial; Visual Reproduction I and II; Spatial Addition; and Symbol Span. The index scores used in the WMS-IV are the Auditory Memory Index, Visual Memory Index, Visual Working Memory Index, Immediate Memory Index, and Delayed Memory Index. Subtest scores were analyzed separately from index scores.

This study was determined to be not human subjects research by the University of Colorado, Colorado Springs Institutional Review Board and was conducted in accordance with the Helsinki Declaration.

Simulation Studies

Simulations of expected performance on the WMS-IV were conducted separately for the standardization and special groups samples, based on the correlation matrices, means, and standard deviations of test scores in each group. The correlation matrices for the index scores in both the standardization and special groups samples had negative eigenvalues (i.e., were not positive definite), which prevented the use of Monte Carlo simulation for these test scores. Therefore, only subtest data were used in this study. Monte Carlo simulation was conducted in R version 3.1.1 (R Core Team, 2014) and followed the methods described by Crawford et al. (Reference Crawford, Garthwaite and Gault2007), which are briefly summarized here. First, 15 independent and identically distributed standard normal scores were randomly generated. These scores were then multiplied by the Cholesky decomposition of the correlation matrices for both the adult and special groups samples and then scaled according to the means and standard deviations reported in Tables 4.1 and 4.3 of the WMS-IV Technical and Interpretive Manual (Wechsler, Reference Wechsler2009). This yielded 15 Z-scores for a single simulated individual on the WMS-IV battery, which can be converted to scaled scores (M=10; SD=3) by multiplying by 3 and then adding 10. This process was conducted 1,000,000 times so that WMS-IV data from 1,000,000 simulated individuals were available for analysis. The R code used to generate the simulated data is provided in Appendix B.

The thresholds for identifying abnormal scores in this study were set at 1, 1.5, and 2 standard deviations below the mean, as these represent common cutoff scores used in clinical neuropsychology. Therefore, abnormality was defined as scaled scores of <7, <6, and <4. The simulated data were used to identify the number of abnormal test scores out of the 15 total scores. Based on the recorded number of abnormal test scores, percentages and cumulative percentages were derived; the percentages represent the base rates of exactly k abnormal test scores, whereas the cumulative percentages represent the base rates of k or more abnormal test scores.

These results can be applied to a variety of clinical settings, where differences may exist in the probability that examinees are cognitively normal. Bayes’ theorem was used to examine the effects of altering cutoff scores and base rates of normal cognition on the post-test probability estimates that an examinee is cognitively normal given k abnormal test scores. All analyses used the probability values for exactly k abnormal scores, not the cumulative probability values for k or more abnormal scores.

Results

Correlation matrices for the WMS-IV tests obtained from the Technical and Interpretive Manual (Wechsler, Reference Wechsler2009) were used to create the correlogram depicted in Figure 1 (Friendly, Reference Friendly2002). The upper panel (above the diagonal) of the correlogram depicts the test intercorrelations for the standardization sample, whereas the lower panel (below the diagonal) is based on the correlations in the special groups sample. The results of the Monte Carlo simulation are shown in Table 1. These results show estimates of the probability that a randomly selected examinee will obtain exactly k (column labeled %) or ≥ k (column labeled Cum %) abnormal test scores on the 15 WMS-IV Adult Battery subtests, based on the chosen thresholds for defining abnormal scores. As expected, lower cutoff scores produced a lower frequency of abnormal scores. Another expected finding is that abnormal test scores were more frequent in the mixed clinical sample than in the standardization sample. These results, when combined with the pre-test base rates of normal cognition, are used to derive post-test probabilities of normal cognition for an individual examinee. These post-test probabilities are represented graphically in Figure 2 for base rates of normal cognition ranging from 5% to 45% (in 5% increments) and for abnormal scores of <7, <6, and <4.

Fig. 1 Correlogram depicting test intercorrelations (Pearson’s r) for the standardization sample (above the diagonal) and special groups sample (below the diagonal). Blue shading reflects positive correlations, whereas red shading reflects negative correlations. Darker colors reflect stronger correlations, whereas lighter colors reflect weaker correlations. LMI=Logical Memory I; LMII=Logical Memory II; VPAI=Verbal Paired Associates I; VPAII=Verbal Paired Associates II; VPAW=Verbal Paired Associates Word Recall; DI=Designs I; DI_C=Designs I Content; DI_S=Designs I Spatial; DII=Designs II; DII_C=Designs II Content; DII_S=Designs II Spatial; VRI=Visual Reproduction I; VRII=Visual Reproduction II; SA=Spatial Addition; SS=Symbol Span.

Fig. 2 Post-test probabilities for normal cognition based on number of abnormal test scores and chosen cutoff scores for pre-test probabilities ranging from 5% to 45% (in 5% increments).

Table 1 Base rates for k abnormal test scores on the WMS-IV adult battery based on group and cutoff score

Note. k=number of abnormal test scores; %=percent of sample with exactly k abnormal test scores; Cum %=percent of sample with k or more abnormal test scores.

Discussion

Past research in this area has been used to argue that ignorance of the base rates of abnormal test scores in cognitively normal individuals can increase the risk of incorrectly diagnosing an examinee with a cognitive disorder. For instance, Brooks et al. (Reference Brooks, Iverson, Holdnack and Feldman2008) found that 39% of cognitively healthy older adults produced at least one demographically corrected score at or below the 5th percentile on a battery consisting of 8 WMS-III test scores. Based on these data, the authors argued, “the fact that it is common for healthy older adults to have one WMS-III delayed memory subtest score 1.5 [sic] SDs below the mean calls into question the validity of the current psychometric criteria for MCI” (Brooks et al., Reference Brooks, Iverson, Holdnack and Feldman2008, p. 472). However, this conclusion is equivalent to interpreting a test result based on the test’s specificity, rather than its negative predictive power, because it fails to take into account the larger context in which older adults tend to be evaluated for MCI. In clinical settings, older adults are not randomly sampled from the population to undergo neuropsychological assessment. Rather, clinical evaluations for MCI are likely to occur after the patient or someone close to the patient (e.g., spouse, physician) expresses concern about the patient’s cognitive functioning. Cognitive complaints are associated with increased odds [self: odds ration (OR)=2.1, informant: OR=2.2, both: OR=4.2] of a subsequent diagnosis of MCI or dementia relative to those without a complaint (Gifford et al., Reference Gifford, Liu, Lu, Tripodis, Cantwell, Palmisano and Jefferson2014). For instance, out of 175 consecutive tertiary referrals to a memory disorders clinic, only 15 (8.6%) were judged to be cognitively normal (Lonie et al., Reference Lonie, Herrmann, Donaghey, Ebmeier, Lonie, Herrmann and Ebmeier2008). With such low base rates of normal cognition in clinical settings such as the one reported by Lonie et al. (Reference Lonie, Herrmann, Donaghey, Ebmeier, Lonie, Herrmann and Ebmeier2008), even one abnormal score still makes it considerably more likely than not that the examinee is cognitively impaired.

Consider a hypothetical scenario in which an examinee has been administered the WMS-IV Adult Battery and that the threshold for impaired performance is set at one standard deviation below the mean (SS<7). The hypothetical examinee obtains scores on all 15 test variables and one of these scores falls in the abnormal range. Based on the data presented in Table 1, the probability that a cognitively healthy person obtains one or more abnormal scores under these conditions is .654 (65.4%). Without consideration of any other factors, this prevalence rate of 65.4% appears to provide convincing evidence to suggest that a diagnosis of MCI in such an individual would be “accidental” (Brooks et al., Reference Brooks, Iverson, Holdnack and Feldman2008, p. 475).

Studies presenting base rates of abnormal test scores in cognitively normal samples tend to report cumulative percentages rather than the exact percentages associated with k abnormal test scores. So although 65.4% of the WMS-IV standardization sample was estimated to produce one or more abnormal test scores, only 17.3% were estimated to produce exactly one abnormal test score (Table 1). When applying these results to the individual patient, the exact probability, rather than the cumulative probability, is needed when using Bayes’ theorem to estimate post-test probabilities of normal cognition when exactly k abnormal scores are observed.

If the hypothetical examinee discussed above was referred to a memory disorders clinic like that described by Lonie et al. (Reference Lonie, Herrmann, Donaghey, Ebmeier, Lonie, Herrmann and Ebmeier2008), the base rate (pre-test probability) of normal cognition in such a situation is estimated to be .086 (8.6%). The data in Table 1 indicate that the probability of observing exactly one abnormal test score (SS<7) in a person with cognitive impairment is .094 (9.4%). Using these data, Bayes’ theorem produces a post-test probability for cognitive impairment, as shown in Appendix C.

Bayes’ theorem reveals that the probability that this hypothetical examinee is cognitively normal is 14.8%; alternatively, the probability that the examinee is cognitively impaired is 85.2%. The post-test probability of 14.8% is based on the pre-test probability estimate of 8.6%, which was adjusted to account for the conditional probability estimates of 17.3% and 9.4% for observing exactly one abnormal score in cognitively healthy and cognitively impaired samples, respectively, as shown in Table 1. If one asks the question, “what is the probability of at least one abnormal score in a cognitively healthy sample?,” the answer is 65.4%. In contrast, if the question is “what is the probability that the examinee is cognitively normal given exactly one abnormal test score?,” the answer is 14.8%. The difference between these two questions and their answers illustrates how Bayes’ theorem can provide useful information for the assessment of individual patients.

At the present time, there are insufficient published data to allow for the widespread use of these methods in neuropsychology research and practice. The limitations are many. Most test manuals do not present test correlation matrices in their standardization samples, which are needed to estimate base rates of abnormal test scores in cognitively healthy individuals. Even more rare is the test manual that publishes these correlation matrices in clinical samples; without these, it is not possible to estimate the base rates of abnormal test scores in cognitively impaired individuals. This problem is further complicated by clinicians’ preference for a flexible battery approach (Rabin, Barr, & Burton, Reference Rabin, Barr and Burton2005), which usually involves the use of tests whose intercorrelations are unknown, both in cognitively healthy and cognitively impaired populations. Current efforts to provide standardized co-normed test batteries that can be used in a flexible testing framework (e.g., the Calibrated Neuropsychological Normative System; Schretlen, Testa, & Pearlson, Reference Schretlen, Testa and Pearlson2010; the Neuropsychological Assessment Battery; Stern & White, Reference Stern and White2003) represent an advancement in the flexible battery approach, but further improvements could be made if publishers made the test correlation matrices available for these tests in the standardization samples as well as any cognitively impaired samples used for clinical validation. Because the factor structure and, therefore, the correlation matrix, of tests may not be invariant to different etiologies of cognitive impairment, correlation matrices should be presented separately for documented clinical groups (e.g., traumatic brain injury, Alzheimer’s disease) whenever possible. Because this information can improve the practice of neuropsychology, neuropsychologists should demand that publishers provide these data for research and clinical purposes. Similarly, researchers and practitioners who use a flexible battery are encouraged to publish data on the means, standard deviations, skewness, kurtosis, and correlation matrices for every test in their battery, broken down by clinical grouping (Decker et al., Reference Decker, Schneider and Hale2012). Finally, data pertaining to the base rates of cognitively normal versus cognitively impaired examinees evaluated across various clinical contexts should be published as well, as this information will help better inform pre-test and post-test probability estimates.

The potential lack of invariance of the test correlation matrix, means, and standard deviations to different causes of cognitive impairment represents a limitation of the current study. Pearson Assessment, the publisher of the WMS-IV, has fortunately published the correlation matrix of the WMS-IV adult battery tests in a mixed clinical sample of 555 individuals with a variety of disparate conditions that can impair cognition. However, the test correlation matrix in this clinical sample may not represent the correlation matrix of any one clinical group. Therefore, the results presented here may not be generalizable to the individual case.

The model considered here is a reasonable way to simulate test data; however, it can be improved upon. For example, the model did not account for some variables that have previously been identified as important covariates. Variables such as predicted intelligence (current and premorbid; Brooks & Iverson, Reference Brooks and Iverson2010; Brooks, Iverson, Feldman, & Holdnack, Reference Brooks, Iverson, Feldman and Holdnack2009; Brooks, Iverson, & White, 2007; Schretlen et al., Reference Schretlen, Testa, Winicki, Pearlson and Gordon2008), years of education (Brooks et al., Reference Brooks, Iverson, Holdnack and Feldman2008; Schretlen et al., Reference Schretlen, Testa, Winicki, Pearlson and Gordon2008), age (Schretlen et al., Reference Schretlen, Testa, Winicki, Pearlson and Gordon2008), sex (Schretlen et al., Reference Schretlen, Testa, Winicki, Pearlson and Gordon2008), and race (Schretlen et al., Reference Schretlen, Testa, Winicki, Pearlson and Gordon2008) have been shown to affect test scores. Because the WMS-IV manual does not present correlation matrices, means, and standard deviations stratified by these variables, the data presented in this manuscript only account for age through the use of age-adjusted scaled scores. Finally, the simulation approach in this study assumed that the WMS-IV test scores are normally distributed; however, when the population variance is unknown, sampling from a t distribution may be preferable.

The false positive (or false negative) paradox and the base rate fallacy are problems that could arise when interpreting base rates of abnormal test scores in cognitively healthy samples without accounting for the prevalence of cognitive impairment in the relevant population. This problem may be especially pronounced in clinical settings where the majority of referrals are cognitively impaired. Failure to consider the base rates of cognitive impairment in a larger context may lead to under-identification of true cognitive impairment, as illustrated in the example above. Patients who undergo neuropsychological assessment generally represent a highly selected sample of individuals at greater risk of cognitive impairment than individuals randomly selected from the general population. This base rate is essential for identifying the probability that an examinee is cognitively intact given k abnormal test scores.

The results presented in this study are highly consistent with previous literature examining the prevalence of abnormal test scores in cognitively healthy samples. These simulated data suggest that cognitively healthy individuals frequently obtain one or more abnormal test scores on the WMS-IV subtests (Table 1). However, the current study adds to this previous body of research in several important ways. This is the first known study to also present base rates of abnormal test scores in a clinical sample of individuals with cognitive impairment. By using base rates from both groups, not just the base rates obtained from cognitively healthy samples, Bayes’ theorem yields post-test probability values that can assist clinicians with judgments about the absence or presence of cognitive impairment. Importantly, the post-test probabilities are also dependent upon the pre-test probability of cognitive impairment, which will vary across different clinical and research settings. Failure to consider the base rates of cognitive impairment in a given setting may increase the risk of errors in clinical judgment related to the false positive paradox and the base rate fallacy. Because neuropsychologists are rarely asked to assess cognitive functioning in individuals known to possess normal cognition, interpreting results based on the prevalence of k or more abnormal test scores in cognitively healthy samples has the potential to be uninformative or misleading. Bayes’ theorem is valuable for making probabilistic judgments about cognitive status after test results have been obtained.

Acknowledgments

The author has no conflicts of interests to disclose. A Web-based application for using Bayes’ theorem to calculate post-test probabilities for the WMS-IV based on these data can be accessed at https://begavett.shinyapps.io/WMS-IV.

Wechsler Memory Scale, Fourth Edition (WMS-IV). Copyright © 2009 NCS Pearson, Inc. Reproduced with permission. All rights reserved. “Wechsler Memory Scale” and “WMS” are trademarks, in the US and/or other countries, of Pearson Education, Inc. or its affiliate(s).

APPENDIX A

Bayes’ Theorem

To apply Bayes’ theorem as discussed herein, three probability values are needed. The first, p(kATS|NC), reflects the conditional probability that exactly k abnormal test scores are observed given an examinee with normal cognition is evaluated. The second, p(NC), reflects the pre-test probability that an examinee possesses normal cognition. This may be thought of as the prevalence rate of normal cognition in a particular clinic. Finally, p(kATS) reflects the probability that exactly k abnormal test scores are produced by an examinee. The post-test probability value p(NC|kATS) reflects the probability that an examinee is cognitively normal when exactly k abnormal test scores are observed. Bayes’ theorem expresses this post-test probability as a function of the other three probability values, as shown below.

$$p(NC\!\mid\!kATS)={{p(kATS \mid NC){\times} p(NC)} \over {p(kATS)}}$$

Alternatively, the denominator p(kATS) can be formulated as:

$$p(kATS\!\mid\!NC){\times}p(NC){\plus}p(kATS\!\mid\!\neg NC){\times}p(\neg NC)$$

APPENDIX B

R Code Used to Generate Simulated Data

# Descriptive statistics for the Adult Battery - Standardization # Sample (Subtest scores)

a.sub.m <- # Means of Subtests in Standardization Sample

# (see WMS-IV Manual for data)

a.sub.sd <- # Standard deviations of Subtests in Standardization # Sample (see WMS-IV Manual for data)

# Descriptive statistics for the Adult Battery - Special Groups # Sample (Subtest scores)

sg.sub.m <- # Means of Subtests in Special Groups Sample

# (see WMS-IV Manual for data)

sg.sub.sd <- # Standard deviations of Subtests in Standar# dization Sample (see WMS-IV Manual for data)

# Correlation matrices for the Adult Battery - both samples

# (Subtest scores)

cm.a.sub <- # Correlation Matrix of Subtests in Standardi# zation Sample (see WMS-IV Manual for data)

cm.sg.sub <- # Correlation Matrix of Subtests in Special # Groups Sample (see WMS-IV Manual for data)

# Cholesky decomposition of the two correlation matrices

chol.a.sub <- chol(cm.a.sub)

chol.sg.sub <- chol(cm.sg.sub)

nsims <- 1000000 # Set the number of simulated data points # desired

set.seed(80918) # Sets the seed for the random number # generator (for reproducibility)

sub.random <- matrix(rnorm(nsims*15,

mean=0,

sd=1),

ncol=nsims,

nrow=15)

CxR.matrix.a <- t(t(chol.a.sub) %*% sub.random)

CxR.matrix.sg <- t(t(chol.sg.sub) %*% sub.random)

sd.a <- CxR.matrix.a * a.sub.sd

sd.sg <- CxR.matrix.sg * sg.sub.sd

m.a <- sd.a + a.sub.m

m.sg <- sd.sg + sg.sub.m

msm.a <- m.a - a.sub.m

msm.sg <- m.sg - a.sub.m

z.a <- msm.a/a.sub.sd

z.sg <- msm.sg/a.sub.sd

APPENDIX C

Application of Bayes’ Theorem to Hypothetical Patient Data

In this example, the pre-test probability of normal cognition, p(NC), is set at 0.086, which was the reported prevalence of normal cognition in a memory disorders clinic (Lonie et al., Reference Lonie, Herrmann, Donaghey, Ebmeier, Lonie, Herrmann and Ebmeier2008). Its complement, the pre-test probability of cognitive impairment, pNC), is 0.914. When impairment is defined as a scaled score of <7, the base rate of exactly 1 abnormal test score (k=1) in a cognitively healthy group, p(kATS|NC), estimated from the WMS-IV standardization sample, is 0.173 (Table 1). The base rate of exactly 1 abnormal test score in a cognitively impaired group, p(kATS|NC), estimated from the WMS-IV special groups sample, is 0.094 (Table 1). In this hypothetical scenario, the post-test probability that an examinee is cognitively normal given exactly 1 abnormal test score, p(NC|kATS), is estimated to be 0.148, or 14.8%.

$$\eqalignno{ &#x0026; p(NC\!\mid\!kATS)={{p(kATS\!\mid\!NC){\times}p(NC)} \over {p(kATS)}} \cr &#x0026; p(NC\!\mid\!kATS)={{p(kATS\!\mid\!NC){\times}p(NC)} \over {p(kATS\!\mid\!NC){\times}p(NC){\plus}p(kATS\!\mid\!\neg NC){\times}p(\neg NC)}} \cr &#x0026; p(NC\!\mid\!kATS)={{.173{\times}.086} \over {(.173{\times}.086){\plus}(.094{\times}.914)}} \cr &#x0026; p(NC\!\mid\!kATS)={{.015} \over {.015{\plus}.086}} \cr &#x0026; p(NC\!\mid\!kATS)={{.015} \over {.101}} \cr &#x0026; p(NC\!\mid\!kATS)=.148 $$

References

Binder, L.M., Iverson, G.L., & Brooks, B.L. (2009). To err is human: “Abnormal” neuropsychological scores and variability are common in healthy adults. Archives of Clinical Neuropsychology, 24(1), 3146.Google Scholar
Brooks, B.L. (2010). Seeing the forest for the trees: Prevalence of low scores on the Wechsler Intelligence Scale for Children, Fourth Edition (WISC-IV). Psychological Assessment, 22(3), 650656.Google Scholar
Brooks, B.L. (2011). A study of low scores in Canadian children and adolescents on the Wechsler Intelligence Scale For Children, Fourth Edition (WISC-IV). Child Neuropsychology, 17(3), 281289.Google Scholar
Brooks, B.L., & Iverson, G.L. (2010). Comparing actual to estimated base rates of “abnormal” scores on neuropsychological test batteries: Implications for interpretation. Archives of Clinical Neuropsychology, 25(1), 1421.CrossRefGoogle ScholarPubMed
Brooks, B.L., Holdnack, J.A., & Iverson, G.L. (2011). Advanced clinical interpretation of the WAIS-IV and WMS-IV: Prevalence of low scores varies by level of intelligence and years of education. Assessment, 18(2), 156167.Google Scholar
Brooks, B.L., Iverson, G.L., & Holdnack, J.A. (2013). Understanding and using multivariate base rates with the WAIS-IV/WMS-IV. In J.A. Holdnack, L. Drozdick, L.G. Weiss & G.L. Iverson (Eds.), WAIS-IV, WMS-IV, and ACS: Advanced clinical interpretation (pp. 75102). London: Elsevier.Google Scholar
Brooks, B.L., Iverson, G.L., & White, T. (2009). Advanced interpretation of the Neuropsychological Assessment Battery with older adults: Base rate analyses, discrepancy scores, and interpreting change. Archives of Clinical Neuropsychology, 24(7), 647657.CrossRefGoogle ScholarPubMed
Brooks, B.L., Iverson, G.L., Feldman, H.H., & Holdnack, J.A. (2009). Minimizing misdiagnosis: Psychometric criteria for possible or probable memory impairment. Dementia and Geriatric Cognitive Disorders, 27(5), 439450.Google Scholar
Brooks, B.L., Iverson, G.L., Holdnack, J.A., & Feldman, H.H. (2008). Potential for misclassification of mild cognitive impairment: A study of memory scores on the Wechsler Memory Scale-III in healthy older adults. Journal of the International Neuropsychological Society, 14(3), 463478.CrossRefGoogle ScholarPubMed
Brooks, B.L., Iverson, G.L., & White, T. (2007). Substantial risk of “accidental MCI” in healthy older adults: Base rates of low memory scores in neuropsychological assessment. Journal of the International Neuropsychological Society, 13, 490500.Google Scholar
Brooks, B.L., Iverson, G.L., Koushik, N.S., Mazur-Mosiewicz, A., Horton, A.M., & Reynolds, C.R. (2013). Prevalence of low scores in children and adolescents on the test of verbal conceptualization and fluency. Applied Neuropsychology: Child, 2(1), 7077.Google Scholar
Brooks, B.L., Iverson, G.L., Lanting, S.C., Horton, A.M., & Reynolds, C.R. (2012). Improving test interpretation for detecting executive dysfunction in adults and older adults: Prevalence of low scores on the test of verbal conceptualization and fluency. Applied Neuropsychology: Adult, 19(1), 6170. doi:10.1080/09084282.2012.651951 Google Scholar
Brooks, B.L., Iverson, G.L., Sherman, E.M.S., & Holdnack, J.A. (2009). Healthy children and adolescents obtain some low scores across a battery of memory tests. Journal of the International Neuropsychological Society, 15(4), 613617.Google Scholar
Brooks, B.L., Sherman, E.M.S., & Iverson, G.L. (2010). Healthy children get low scores too: Prevalence of low scores on the NEPSY-II in preschoolers, children, and adolescents. Archives of Clinical Neuropsychology, 25(3), 182190.Google Scholar
Brooks, B.L., Strauss, E., Sherman, E.M.S., Iverson, G.L., & Slick, D.J. (2009). Developments in neuropsychological assessment: Refining psychometric and clinical interpretive methods. Canadian Psychology, 50(3), 196209.Google Scholar
Crawford, J.R., Garthwaite, P.H., & Gault, C.B. (2007). Estimating the percentage of the population with abnormally low scores (or abnormally large score differences) on standardized neuropsychological test batteries: A generic method with applications. Neuropsychology, 21(4), 419430.Google Scholar
Decker, S.L., Schneider, W.J., & Hale, J.B. (2012). Estimating base rates of impairment in neuropsychological test batteries: A comparison of quantitative models. Archives of Clinical Neuropsychology, 27(1), 6984.Google Scholar
Detterbeck, F.C., Mazzone, P.J., Naidich, D.P., & Bach, P.B. (2013). Screening for lung cancer: Diagnosis and management of lung cancer, 3rd ed: American college of chest physicians evidence-based clinical practice guidelines. Chest, 143, e78Se92S.Google Scholar
Friendly, M. (2002). Corrgrams: Exploratory displays for correlation matrices. The American Statistician, 56(4), 316324.Google Scholar
Gifford, K.A., Liu, D., Lu, Z., Tripodis, Y., Cantwell, N.G., Palmisano, J., & Jefferson, A.L. (2014). The source of cognitive complaints predicts diagnostic conversion differentially among nondemented older adults. Alzheimer’s & Dementia, 10(3), 319327.CrossRefGoogle ScholarPubMed
Gouvier, W.D. (2001). Are you sure you’re really telling the truth? Neurorehabilitation, 16, 215219.CrossRefGoogle ScholarPubMed
Gunner, J.H., Miele, A.S., Lynch, J.K., & McCaffrey, R.J. (2012). Performance of non-neurological older adults on the Wisconsin Card Sorting Test and the Stroop Color-Word Test: Normal variability or cognitive impairment? Archives of Clinical Neuropsychology, 27(4), 398405.Google Scholar
Katz, A.R., Effler, P.V., Ohye, R.G., Brouillet, B., Lee, M.V.C., & Whiticar, P.M. (2004). False-positive gonorrhea test results with a nucleic acid amplification test: The impact of low prevalence on positive predictive value. Clinical Infectious Diseases, 38, 814819.Google Scholar
Lonie, J.A., Herrmann, L.L., Donaghey, C.L., Ebmeier, K.P., Lonie, J.A., Herrmann, L.L., & Ebmeier, K.P. (2008). Clinical referral patterns and cognitive profile in mild cognitive impairment. The British Journal of Psychiatry, 192, 5964.Google Scholar
Nugent, W.R. (2004). The role of prevalence rates, sensitivity, and specificity in assessment accuracy: Rolling the dice in social work process. Journal of Social Service Research, 31(2), 5175.Google Scholar
O’Bryant, S.E., & Lucas, J.A. (2006). Estimating the predictive value of the Test of Memory Malingering: An illustrative example for clinicians. The Clinical Neuropsychologist, 20(3), 533540.Google Scholar
Palmer, B.W., Boone, K.B., Lesser, I.M., & Wohl, M.A. (1998). Base rates of “impaired” neuropsychological test performance among healthy older adults. Archives of Clinical Neuropsychology, 13(6), 503511.Google Scholar
R Core Team. (2014). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.r-project.org/ Google Scholar
Rabin, L.A., Barr, W.B., & Burton, L.A. (2005). Assessment practices of clinical neuropsychologists in the United States and Canada: A survey of INS, NAN, and APA Division 40 members. Archives of Clinical Neuropsychology, 20(1), 3365.CrossRefGoogle Scholar
Schretlen, D.J., Munro, C.A., Anthony, J.C., & Pearlson, G.D. (2003). Examining the range of normal intraindividual variability in neuropsychological test performance. Journal of the International Neuropsychological Society, 9(6), 864870.Google Scholar
Schretlen, D.J., Testa, S.M., & Pearlson, G.D. (2010). The Calibrated Neuropsychological Normative System. Lutz, FL: Psychological Assessment Resources.Google Scholar
Schretlen, D.J., Testa, S.M., Winicki, J.M., Pearlson, G.D., & Gordon, B. (2008). Frequency and bases of abnormal performance by healthy adults on neuropsychological testing. Journal of the International Neuropsychological Society, 14(3), 436445.CrossRefGoogle ScholarPubMed
Stern, R.A., & White, T. (2003). Neuropsychological assessment battery. Lutz, FL: Psychological Assessment Resources.Google Scholar
U.S. Preventive Services Task Force. (2009). Screening for breast cancer: U.S. Preventive Services Task Force recommendation statement. Annals of Internal Medicine, 151, 716726.Google Scholar
Wechsler, D. (2009). WMS-IV Technical and Interpretive Manual. San Antonio, TX: Pearson Assessment.Google Scholar
Figure 0

Fig. 1 Correlogram depicting test intercorrelations (Pearson’s r) for the standardization sample (above the diagonal) and special groups sample (below the diagonal). Blue shading reflects positive correlations, whereas red shading reflects negative correlations. Darker colors reflect stronger correlations, whereas lighter colors reflect weaker correlations. LMI=Logical Memory I; LMII=Logical Memory II; VPAI=Verbal Paired Associates I; VPAII=Verbal Paired Associates II; VPAW=Verbal Paired Associates Word Recall; DI=Designs I; DI_C=Designs I Content; DI_S=Designs I Spatial; DII=Designs II; DII_C=Designs II Content; DII_S=Designs II Spatial; VRI=Visual Reproduction I; VRII=Visual Reproduction II; SA=Spatial Addition; SS=Symbol Span.

Figure 1

Fig. 2 Post-test probabilities for normal cognition based on number of abnormal test scores and chosen cutoff scores for pre-test probabilities ranging from 5% to 45% (in 5% increments).

Figure 2

Table 1 Base rates for k abnormal test scores on the WMS-IV adult battery based on group and cutoff score