Bigler (this issue) and I apparently are in agreement about the importance of symptom validity testing, and my recommendation to adopt a new terminology of “performance validity” to address the validity of performance on measures of ability, and “symptom validity” to address the validity of symptom report on measures such as the MMPI-2. We appear to differ on issues related to false positives and the rigor of performance and symptom validity research designs.
The study by Locke, Smigielski, Powell, and Stevens (Reference Locke, Smigielski, Powell and Stevens2008) is cited by Bigler as demonstrating potential false positive errors due to TOMM scores falling in a “near miss” zone just below cutoff. This interpretation suggests a continuum of performance. Review of Bigler's Figure 1 and Locke et al.'s Table 2 shows that the frequency distribution of TOMM scores does not, however, reflect a continuum but shows two discrete distributions: (1) a sample of 68 ranging from 45 to 50 (mean = 49.31, SD = 1.16) and (2) a sample of 19 ranging from 22 to 44 (mean = 35.11, SD = 6.55) [note Bigler interprets two distributions below 45, but the sample size is too small to establish this presence]. Clearly, Locke et al. did not view TOMM failures as false positives in their sample. Although Locke et al. found that performance on neurocognitive testing was significantly lower in this group, TOMM failure was not related to severity of brain injury, depression or anxiety; only disability status predicted TOMM failure. They concluded: “This study suggests that reduced effort occurs outside forensic settings, is related to neuropsychometric performance, and urges further research into effort across various settings” (p. 273).
As previously noted in my primary review, several factors minimize the significance of false positive errors. First, scores reflecting invalid performance are atypical in pattern or degree for bona fide neurological disorder. Second, cutoff scores are typically set to keep false positive errors at or below 10%. Third, investigators are encouraged to specify the characteristics of bona fide clinical patients who fail PVTs representing “false positives,” to enhance the clinical use of the PVT in the individual case. Fourth, appropriate use of PVTs in the individual case requires the presence of multiple abnormal scores on independent PVTs, occurring in the context of external incentive, with no compelling neurologic, psychiatric or developmental explanation for PVT failure, before one can conclude the presence of malingering (cf., Slick, Sherman, & Iverson, Reference Slick, Sherman and Iverson1999).
Bigler also criticizes the research in this area as being, at best, Class III level research (American Academy of Neurology, AAN, Edlund, Gronseth, So, & Franklin, Reference Edlund, Gronseth, So and Franklin2004), noting the research is typically retrospective, using samples of convenience, with study authors not blind to group assignment. Review of the AAN guidelines, however, shows that retrospective investigations using case control designs can meet Class II standards (p. 20). Moreover, there is no requirement for masked or independent assessment, if the reference standards for presence of disorder and the diagnostic tests are objective (italics added). The majority of studies cited in recent reviews (Boone, Reference Boone2007; Larrabee, Reference Larrabee2007; Morgan & Sweet, Reference Morgan and Sweet2009) follow case control designs contrasting either non-injured simulators or criterion/known-groups of definite or probable malingerers, classified using objective test criteria from Slick et al. (Reference Slick, Sherman and Iverson1999), with groups of clinical patients with significant neurologic disorder (usually moderate/severe TBI) and/or psychiatric disorder (i.e., major depressive disorder). As such, these investigations would meet AAN Level II criteria.
In my earlier review in this dialog, I described a high degree of reproducibility of results in performance and symptom validity research. Additionally, the effect sizes generated by this research are uniformly large, for example, d = −1.34 for Reliable Digit Span (Jasinski, Berry, Shandera, & Clark, Reference Jasinski, Berry, Shandera and Clark2011); d = .96 for MMPI-2 FBS (Nelson, Sweet, & Demakis, Reference Nelson, Sweet and Demakis2006), replicated at d = .95 incorporating 43 new studies (Nelson, Hoelzle, Sweet, Arbisi, & Demakis, Reference Nelson, Hoelzle, Sweet, Arbisi and Demakis2010); d = 2.02 for the two-alternative forced choice Digit Memory Test (Vickery, Berry, Inman, Harris, & Orey, Reference Vickery, Berry, Inman, Harris and Orey2001). These effect sizes exceed those reported for several psychological and medical tests (Meyer et al., Reference Meyer, Finn, Eyde, Kay, Moreland, Dies and Reed2001). Effect sizes of this magnitude are striking, considering that the discrimination is between feigned performance and legitimate neuropsychological abnormalities, rather than between feigned performance and normal performance. Reproducible results and large effect sizes cannot occur without rigorous experimental design.