What Yarkoni describes is grim: any statistical model estimated from any study has so many omitted sources of variance that the estimates are likely meaningless. The gross misspecification of our models and specificity of our operationalizations produce claims with generality so narrow that no one would be interested in them. From this, one could reasonably conclude that a single laboratory of researchers should struggle to design an experiment worth the time it takes to carry it out.
For those who agree – and we count ourselves among them – what are the possible steps forward? There are seemingly infinite sources of invalidity in our work; where do we start? Of the solutions Yarkoni describes, we expand on ideas of large-scale descriptive research that are feasible and worthwhile.
The foundational assumption of construct validity and generalizability
Construct validity is a linchpin in the research process. When researchers create numbers from measurements, it is assumed those numbers take on the intended meaning. A foundational challenge for psychological scientists is ensuring this assumption holds so that those numbers are valid and that their meaning generalizes across the range of interpretations made about them. When psychologists study constructs like motivation, personality, and individualism, they don't intend to only describe the people in their sample, they intend to describe something meaningful and global about the human condition.
Psychometricians refer to the evaluation of the assumption of construct validity and construct generalizability as on-going construct validation (Cronbach & Meehl, Reference Cronbach and Meehl1955; Kane, Reference Kane2013). On-going construct validation is possible with classic and modern psychometric methods for many approaches to measurement that are common in psychology. For example, the Trends in International Mathematics and Science Study measures mathematics and science achievement of children from 60 countries and is the culmination of years of quantitative and qualitative research to determine how to measure such constructs and generate valid scores that are comparable across diverse peoples (TIMSS; https://www.iea.nl/studies/iea/timss). A concrete step forward is for psychologists in other areas of study to take what they are measuring and the generalizability of what they are measuring as seriously as the scientists who created TIMSS take achievement.
However, this step comes well before conducting studies using these measures to test relationships and causal effects. It requires work that psychology has historically undervalued: systematic review and synthesis of theory with an emphasis on organizing old ideas instead of generating new ones, mixed methods, representative sampling, and descriptive research on constructs and the variability in scoring that measuring them in different contexts can cause.
What does large-scale construct validation research look like?
Psychology knows what large-scale collaborative studies look like because large-scale replications have become a norm following the Reproducibility Project: Psychology (Open Science Collaboration, 2015). The Many Labs collaboration is on its fifth iteration (ML; Ebersole et al., Reference Ebersole, Mathur, Baranski, Bart-Plange, Buttrick, Chartier and Nosek2020) and published registered replication reports typically include data collection efforts spanning across dozens of laboratories (e.g., Wagenmakers et al., Reference Wagenmakers, Beek, Dijkhoff, Gronau, Acosta, Adams and Zwaan2016). However, these replication studies skip over construct validation. We reviewed the measures used in ML2 and found that surveys collected to measure key variables of interest in the replication study had limited validity evidence and poor psychometric properties. For example, a measure of subjective well-being was used in a replication of an effect reported by Anderson, Kraus, Galinsky, and Keltner (Reference Anderson, Kraus, Galinsky and Keltner2012). We tested the assumed factor structure of this scale using ML2 data and by any conventional standards the model fit was poor (CFI = 0.616, RMSEA = 0.262, and SRMR = 0.267; Shaw, Cloos, Luong, Elbaz, & Flake, Reference Shaw, Cloos, Luong, Elbaz and Flake2020), casting doubt that the scores represented well-being. How can we interpret the replicability of an effect if the numbers used in the analysis don't have the meaning the original researchers intended? The results of this review are consistent with what Yarkoni is saying: psychologists have bent over backwards trying to replicate effects that didn't convey anything meaningful in the first place.
Luckily, we can squeeze more juice from completed large-scale replication studies with post hoc construct validation. For example, Cloos and Flake (Reference Cloos and Flakesubmitted) assessed the psychometric properties of an instrument used in ML2 and if those properties generalized across two translated versions. The short story is that a critical unmodeled source of variance in replication results is measurement heterogeneity introduced by translation. Researchers could also take this approach with single studies that are published with materials and data. If the instruments from single studies could also be systematically reviewed, reanalyzed, and synthesized, psychologists could generate compelling evidence for the generalizability of constructs. This is something researchers can retroactively work on moving forward. But this is not an efficient process. Ideally studies would step back from replicating effects and focus on the theoretical merit and measurement approaches of key constructs in a fashion that exposes them to substantial heterogeneity (e.g., across data collection settings, time, and cultures). The constructs and associated measures that demonstrate validity and generalizability are then good candidates for replication studies.
We are currently attempting a version of this with the Psychological Science Accelerator (Moshontz et al., Reference Moshontz, Campbell, Ebersole, Ijzerman, Urry, Forscher and Chartier2018). We are taking two measures originally developed in English and evaluating validity and generalizability in over 20 languages. We aren't testing any key effects; we are just going to describe the measures and their properties across a diverse set of languages using a mixed-methods approach. Regardless of the results, we will generate useful knowledge about these constructs and the feasibility of using existing measures to study them on a global scale.
Large-scale construct validation is methodologically and logistically challenging with very few incentives for doing it. An optimistic interpretation is that there is plenty to do. If we focus less on generating new ideas and more on organizing, synthesizing, measuring, and assessing constructs from existing ideas, we could keep busy for decades. Probably longer, in fact! Scientists spent literally hundreds of years determining what an electric charge was, the units it should be measured in, and how to measure it. As researchers in a far younger discipline with abstract and unobservable constructs, those hundreds of years are likely still ahead of us.
What Yarkoni describes is grim: any statistical model estimated from any study has so many omitted sources of variance that the estimates are likely meaningless. The gross misspecification of our models and specificity of our operationalizations produce claims with generality so narrow that no one would be interested in them. From this, one could reasonably conclude that a single laboratory of researchers should struggle to design an experiment worth the time it takes to carry it out.
For those who agree – and we count ourselves among them – what are the possible steps forward? There are seemingly infinite sources of invalidity in our work; where do we start? Of the solutions Yarkoni describes, we expand on ideas of large-scale descriptive research that are feasible and worthwhile.
The foundational assumption of construct validity and generalizability
Construct validity is a linchpin in the research process. When researchers create numbers from measurements, it is assumed those numbers take on the intended meaning. A foundational challenge for psychological scientists is ensuring this assumption holds so that those numbers are valid and that their meaning generalizes across the range of interpretations made about them. When psychologists study constructs like motivation, personality, and individualism, they don't intend to only describe the people in their sample, they intend to describe something meaningful and global about the human condition.
Psychometricians refer to the evaluation of the assumption of construct validity and construct generalizability as on-going construct validation (Cronbach & Meehl, Reference Cronbach and Meehl1955; Kane, Reference Kane2013). On-going construct validation is possible with classic and modern psychometric methods for many approaches to measurement that are common in psychology. For example, the Trends in International Mathematics and Science Study measures mathematics and science achievement of children from 60 countries and is the culmination of years of quantitative and qualitative research to determine how to measure such constructs and generate valid scores that are comparable across diverse peoples (TIMSS; https://www.iea.nl/studies/iea/timss). A concrete step forward is for psychologists in other areas of study to take what they are measuring and the generalizability of what they are measuring as seriously as the scientists who created TIMSS take achievement.
However, this step comes well before conducting studies using these measures to test relationships and causal effects. It requires work that psychology has historically undervalued: systematic review and synthesis of theory with an emphasis on organizing old ideas instead of generating new ones, mixed methods, representative sampling, and descriptive research on constructs and the variability in scoring that measuring them in different contexts can cause.
What does large-scale construct validation research look like?
Psychology knows what large-scale collaborative studies look like because large-scale replications have become a norm following the Reproducibility Project: Psychology (Open Science Collaboration, 2015). The Many Labs collaboration is on its fifth iteration (ML; Ebersole et al., Reference Ebersole, Mathur, Baranski, Bart-Plange, Buttrick, Chartier and Nosek2020) and published registered replication reports typically include data collection efforts spanning across dozens of laboratories (e.g., Wagenmakers et al., Reference Wagenmakers, Beek, Dijkhoff, Gronau, Acosta, Adams and Zwaan2016). However, these replication studies skip over construct validation. We reviewed the measures used in ML2 and found that surveys collected to measure key variables of interest in the replication study had limited validity evidence and poor psychometric properties. For example, a measure of subjective well-being was used in a replication of an effect reported by Anderson, Kraus, Galinsky, and Keltner (Reference Anderson, Kraus, Galinsky and Keltner2012). We tested the assumed factor structure of this scale using ML2 data and by any conventional standards the model fit was poor (CFI = 0.616, RMSEA = 0.262, and SRMR = 0.267; Shaw, Cloos, Luong, Elbaz, & Flake, Reference Shaw, Cloos, Luong, Elbaz and Flake2020), casting doubt that the scores represented well-being. How can we interpret the replicability of an effect if the numbers used in the analysis don't have the meaning the original researchers intended? The results of this review are consistent with what Yarkoni is saying: psychologists have bent over backwards trying to replicate effects that didn't convey anything meaningful in the first place.
Luckily, we can squeeze more juice from completed large-scale replication studies with post hoc construct validation. For example, Cloos and Flake (Reference Cloos and Flakesubmitted) assessed the psychometric properties of an instrument used in ML2 and if those properties generalized across two translated versions. The short story is that a critical unmodeled source of variance in replication results is measurement heterogeneity introduced by translation. Researchers could also take this approach with single studies that are published with materials and data. If the instruments from single studies could also be systematically reviewed, reanalyzed, and synthesized, psychologists could generate compelling evidence for the generalizability of constructs. This is something researchers can retroactively work on moving forward. But this is not an efficient process. Ideally studies would step back from replicating effects and focus on the theoretical merit and measurement approaches of key constructs in a fashion that exposes them to substantial heterogeneity (e.g., across data collection settings, time, and cultures). The constructs and associated measures that demonstrate validity and generalizability are then good candidates for replication studies.
We are currently attempting a version of this with the Psychological Science Accelerator (Moshontz et al., Reference Moshontz, Campbell, Ebersole, Ijzerman, Urry, Forscher and Chartier2018). We are taking two measures originally developed in English and evaluating validity and generalizability in over 20 languages. We aren't testing any key effects; we are just going to describe the measures and their properties across a diverse set of languages using a mixed-methods approach. Regardless of the results, we will generate useful knowledge about these constructs and the feasibility of using existing measures to study them on a global scale.
Large-scale construct validation is methodologically and logistically challenging with very few incentives for doing it. An optimistic interpretation is that there is plenty to do. If we focus less on generating new ideas and more on organizing, synthesizing, measuring, and assessing constructs from existing ideas, we could keep busy for decades. Probably longer, in fact! Scientists spent literally hundreds of years determining what an electric charge was, the units it should be measured in, and how to measure it. As researchers in a far younger discipline with abstract and unobservable constructs, those hundreds of years are likely still ahead of us.
Acknowledgments
None.
Financial support
Jessica Kay Flake and Raymond Luong's work was funded by the Ware's Prospector's Innovation Fund awarded to Jessica Kay Flake (249040, 2019). Mairead Shaw's work was funded by Fonds de recherche du Quebec – Nature et technologies awarded to Mairead Shaw (288759, 2020)
Conflict of interest
On multiple occasions we have gotten ice cream with the author of the target article.