Hostname: page-component-745bb68f8f-s22k5 Total loading time: 0 Render date: 2025-02-06T10:26:54.993Z Has data issue: false hasContentIssue false

A crisis of generalizability or a crisis of constructs?

Published online by Cambridge University Press:  10 February 2022

Kevin M. King
Affiliation:
Department of Psychology, University of Washington, Seattle, WA98195, USAkingkm@uw.eduhttp://faculty.washington.edu/kingkm
Aidan G.C. Wright
Affiliation:
Department of Psychology, University of Pittsburgh, Pittsburgh, PA15260, USAaidan@pitt.eduhttp://www.personalityprocesses.com/

Abstract

Psychologists wish to identify and study the mechanisms and implications of nomothetic constructs that reveal truths about human nature and span across operationalizations. To achieve this goal, psychologists should spend more time carefully describing and measuring constructs across a wide range of methods and measures, and less time rushing to explain and predict.

Type
Open Peer Commentary
Copyright
Copyright © The Author(s), 2022. Published by Cambridge University Press

“I live in a jingle jangle jungle. If you ain't got it, you can't be it”

-Bobby Darin

Yarkoni raises concerns about business as usual in psychological science, noting that our methods are rarely designed to extrapolate much beyond the specific sample, measures, or procedures at hand. He aptly frames this as a crisis of generalizability, because if you change samples, measures, or procedures and the results don't hold, then what can be extrapolated? We see at least one alternative way to construe these same issues: namely, a crisis of construct validity. We see this as a valuable alternative articulation, because although many scientists may be willing to write off external validity, knowing that these issues also cut to the core of internal validity may well give them pause for thought.

We argue that most psychologists want to identify and study the mechanisms and implications of nomothetic constructs that reveal fundamental truths about human nature and expand beyond any specific operationalization. Clinicians want to understand depression, not the Beck Depression Inventory. Personality psychologists want to understand narcissism, not the Narcissistic Personality Inventory. Cognitive psychologists want to understand attention, not the dot probe task. However, to the extent that our methods are too tightly tethered to single methods or measures, we have not elucidated the conceptual, but rather echoed the operational. We risk becoming a science of squares, not circles, in structural equation modeling terms. The key point is that this isn't just a matter of external validity, but it cuts to the core of internal validity, and what it is we think we are studying.

Because the field gives such short shrift to the development of measures that can flexibly, reliably, and broadly capture constructs of interest, the field is polluted with methods and measures that have wide acceptance but perform poorly on some dimension of internal validity. “Gold-standard” measures are only thus because of a field-wide consensus weighed in citations rather than empirical quality. Entire bodies of literature are developed with measures whose psychometric properties have barely been questioned, much less deeply interrogated.

For instance, in some fields, construct validity is largely limited to a reliance on face validity to support construct representation (Whitely, Reference Whitely1983). For example, the ego depletion literature manipulated and measured behaviors ranging from stating the actual color of a color word instead of reading the color word (e.g., a Stroop task), eat healthy foods, regulate emotions, give counter-attitudinal speeches, behave counter to a learned habit, regulate attention, make decisions, or persist in an unpleasant task (Hagger et al., Reference Hagger, Chatzisarantis, Alberts, Anggono, Batailler, Birt and Zwienenberg2016; Hagger, Wood, Stiff, & Chatzisarantis, Reference Hagger, Wood, Stiff and Chatzisarantis2010). However, no research in this domain focused on the fundamental measurement properties of these varying operationalizations. From a construct validity perspective (e.g., Borsboom, Mellenbergh, & Van Heerden, Reference Borsboom, Mellenbergh and Van Heerden2004), did they represent some common construct, and can they reliably capture variance attributable to the construct? Although the variety of operationalizations of self-control was admirable, no effort was made to stop and ask whether they reflected the same construct. Researchers would benefit from spending more time simply describing constructs and seeking to sample items and stimuli that might sample as broad of a range of the construct as possible, in order to define the limits of what is and what is not a reasonable measure of the construct. As Yarkoni argues, sampling from a broad range of stimuli for both IVs and DVs is a critical method to establish the construct validity.

Another example of how easy it is to put cart before horse when seeking to establish broadly defined constructs is the NIMH Research Domain Criteria (RDoC) (Insel, Reference Insel2014), which has spent over one billion dollars (per NIH RePORTER) pursuing evidence for neural circuits that span “units” of analysis. The point is not that this is illogical, indeed we think the goal is laudable, but rather that it presumes that constructs can be defined consistently and coherently across levels of analysis that span from genes, to molecules, to self-report, to lab tasks, without the recognition that at each level idiosyncrasies of methods give ample reason to be pessimistic. In other words, this can be understood as another manifestation or downstream consequence of the jingle fallacy. One cannot simply presume the same construct across methods, even if they have been similarly labelled. Serious research efforts must be undertaken to bridge constructs across methods before these constructs are used for prediction and explanation.

Measures first developed in small samples with relative impoverished psychometric models persist in fields as “gold-standard” measures due to researchers' familiarity rather than any evidence of quality. For example, a re-analysis of six large datasets on measures of executive function showed that the original factor structure reported by Miyake et al. (Reference Miyake, Friedman, Emerson, Witzki, Howerter and Wager2000), studied in 137 college students, and cited over 13,000 times, did not outperform more standard and well-accepted models of cognitive function such as the Cattell-Horn-Cattell model (Jewsbury, Bowden, & Strauss, Reference Jewsbury, Bowden and Strauss2016). The NIH Toolbox measure of executive function included a single measure of discriminant validity (IQ), which was correlated at r = 0.44–0.79 across ages (Zelazo et al., Reference Zelazo, Anderson, Richler, Wallner-Allen, Beaumont and Weintraub2013). “Grit” serves as another example; the original measure was so highly correlated with conscientiousness in the original paper (r = 0.77), that when corrected for unreliability it would approach 1.0 (Duckworth et al., Reference Duckworth, Peterson, Matthews and Kelly2007), not to mention serious critiques of the misapplication of factor analysis in that original manuscript (Credé, Tynan, & Harms, Reference Credé, Tynan and Harms2017). Without greater attention to the systematic description and careful, extensive measurement efforts, the field will continue to see the introduction, reification, and persistence of problematic measures.

In this way, we view the generalizability crisis described by Yarkoni to be a crisis of constructs. Behavioral scientists have a track record of subordinating external validity to internal validity, which is why we feel it important to highlight that business as usual is doing violence to both. The good news is the prescription is simple: the field should insist on, if not prize, a careful focus on methods and measures development, and deep construct validation. It is the bedrock of our science.

Financial support

Kevin King was funded by grants from NIDA (DA047247) and NIAAA (AA028832). Aidan Wright was funded by grants from NIAAA (AA026879).

Conflict of interest

None.

References

Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061.CrossRefGoogle ScholarPubMed
Credé, M., Tynan, M. C., & Harms, P. D. (2017). Much ado about grit: A meta-analytic synthesis of the grit literature. Journal of Personality and Social Psychology, 113(3), 492.CrossRefGoogle ScholarPubMed
Duckworth, A. L., Peterson, C., Matthews, M. D., & Kelly, D. R. (2007). Grit: perseverance and passion for long-term goals. Journal of Personality and Social Psychology, 92(6), 1087.CrossRefGoogle ScholarPubMed
Hagger, M. S., Chatzisarantis, N. L., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R., … Zwienenberg, M. (2016). A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science, 11(4), 546573.CrossRefGoogle ScholarPubMed
Hagger, M. S., Wood, C., Stiff, C., & Chatzisarantis, N. L. D. (2010). Ego depletion and the strength model of self-control: A meta-analysis. Psychological Bulletin, 136, 495525. doi: 10.1037/a0019486CrossRefGoogle ScholarPubMed
Insel, T. R. (2014). The NIMH research domain criteria (RDoC) project: Precision medicine for psychiatry. American Journal of Psychiatry, 171(4), 395397.CrossRefGoogle ScholarPubMed
Jewsbury, P. A., Bowden, S. C., & Strauss, M. E. (2016). Integrating the switching, inhibition, and updating model of executive function with the Cattell–Horn–Carroll model. Journal of Experimental Psychology: General, 145(2), 220.CrossRefGoogle ScholarPubMed
Miyake, A., Friedman, N. P., Emerson, M. J., Witzki, A. H., Howerter, A., & Wager, T. D. (2000). The unity and diversity of executive functions and their contributions to complex “frontal lobe” tasks: A latent variable analysis. Cognitive Psychology, 41(1), 49100.CrossRefGoogle ScholarPubMed
Whitely, S. E. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93(1), 179.CrossRefGoogle Scholar
Zelazo, P. D., Anderson, J. E., Richler, J., Wallner-Allen, K., Beaumont, J. L., & Weintraub, S. (2013). II. NIH toolbox cognition battery (CB): Measuring executive function and attention. Monographs of the Society for Research in Child Development, 78(4), 1633.CrossRefGoogle Scholar