Hostname: page-component-745bb68f8f-v2bm5 Total loading time: 0 Render date: 2025-02-06T07:45:56.992Z Has data issue: false hasContentIssue false

Exposing and overcoming the fixed-effect fallacy through crowd science

Published online by Cambridge University Press:  10 February 2022

Wilson Cyrus-Lai
Affiliation:
Organisational Behaviour Area, INSEAD, Singaporewilson-cyrus.lai@insead.edu; eric.luis.uhlmann@gmail.com
Warren Tierney
Affiliation:
Organisational Behaviour Area/Marketing Area, INSEAD, Singaporewarren.tierney@insead.edu
Martin Schweinsberg
Affiliation:
Martin Schweinsberg, Organisational Behaviour Area, ESMT Berlin, 10178, BerlinGermany. martin.schweinsberg@esmt.org
Eric Luis Uhlmann
Affiliation:
Organisational Behaviour Area, INSEAD, Singaporewilson-cyrus.lai@insead.edu; eric.luis.uhlmann@gmail.com

Abstract

By organizing crowds of scientists to independently tackle the same research questions, we can collectively overcome the generalizability crisis. Strategies to draw inferences from a heterogeneous set of research approaches include aggregation, for instance, meta-analyzing the effect sizes obtained by different investigators, and parsing, attempting to identify theoretically meaningful moderators that explain the variability in results.

Type
Open Peer Commentary
Copyright
Copyright © The Author(s), 2022. Published by Cambridge University Press

Yarkoni highlights the fixed-effect fallacy, arguing that many if not most research findings are unlikely to prove robust to stimulus sampling and task operationalizations. Experimental studies in psychology and related fields are exposed to the possibility that the effect is specific to the stimulus set in question, such that alternative approaches could have attenuated or even reversed the reported finding. Recent initiatives to crowdsource the analyses of complex datasets (Bastiaansen et al., Reference Bastiaansen, Kunkels, Blaauw, Boker, Ceulemans, Chen and Bringmann2020; Botvinik-Nezer et al., Reference Botvinik-Nezer, Holzmeister, Camerer, Dreber, Huber, Johannesson and Schonberg2020; Schweinsberg et al., Reference Schweinsberg, Feldman, Staub, van den Akker, van Aert, van Assen, Liu and Uhlmann2021; Silberzahn et al., Reference Silberzahn, Uhlmann, Martin, Anselmi, Aust, Awtrey and Nosek2018), and the design of experiments (Baribault et al., Reference Baribault, Donkin, Little, Trueblood, Oravecz, Van Ravenzwaaij and Vandekerckhove2018; Landy et al., Reference Landy, Jia, Ding, Viganola, Tierney, Dreber and Uhlmann2020) provide strong quantitative evidence for these assertions. When different scientists independently analyze the same dataset to try and answer the same research question, or separately create their own experimental design to test the same hypothesis, a wide range of results are obtained.

These large-scale crowd science projects illustrate two key approaches to drawing robust conclusions and building strong theory through diversity in approaches and results. One strategy to overcoming the generalizability challenge is aggregation, for example, simply meta-analyzing across the estimates obtained by independent analysts or from different experimental designs. Another is parsing, or attempting to find meaningful moderators that explain why some approaches yield large estimates and others small to null estimates or even estimates reversed in sign.

The parsing strategy is in harmony with the perspectivist approach to theoretical progress, which assumes that most phenomena in the social sciences are massively moderated (McGuire, Reference McGuire1973, Reference McGuire and Berkowitz1983). From this perspective, “the opposite of a great truth is also true” (Banaji, Reference Banaji, Jost, Prentice and Banaji2003), and thus it is unsurprising that different empirical approaches to testing the same idea can return effect size estimates that are opposed in sign. The fundamental task of researchers, from a perspectivist standpoint, is to untangle this web by identifying moderators that will allow us to predict when effects emerge, disappear, and reverse. However, we suggest that aggregation and parsing can be complementary rather than competing: meta-scientists can both meta-analyze across crowdsourced approaches and seek to meaningfully explain variability in effect sizes.

In an illustration of the aggregation strategy, Landy et al. (Reference Landy, Jia, Ding, Viganola, Tierney, Dreber and Uhlmann2020) recruited up to 13 research teams to independently create experimental stimulus sets to test the same set of five original hypotheses, all supported in unpublished research by the original authors (e.g., “working for no reason is morally praised,” “deontologists are happier than consequentialists”). Over 15,000 research participants were randomly assigned to the different study designs. All five original effects directly replicated using the same stimulus set the original authors had used. However, four of five hypotheses had different material – makers created designs that returned statistically significant effects in opposite directions from one another. At the same time, two out of five original hypotheses proved conceptually robust when meta-analyzing the results across the experimental designs from the different teams of researchers. This maps on closely to predictions by Yarkoni and others, that even when directly replicable, only a minority of findings in social psychology and related fields will prove generalizable across contexts and approaches.

Employing both the aggregation and parsing strategies together, Schweinsberg et al. (Reference Schweinsberg, Feldman, Staub, van den Akker, van Aert, van Assen, Liu and Uhlmann2021) asked up to 15 independent researchers to test two hypotheses using the same dataset capturing gender and status dynamics in intellectual debates. Not only statistical choices (e.g., covariates), but also the operationalization of variables (e.g., status) were left unconstrained and up to the individual researchers' discretion. For example, an analyst could choose to identify high versus low status academics using job rank, citation counts, PhD institution rank, or a combination of indicators. No two researchers employed the same specification. For both hypotheses, independent analysts reported statistically significant estimates in opposite directions despite relying on the same dataset. Hypothesis 1 (women speak more in the presence of other women) was supported while aggregating across different measurement and testing approaches, whereas Hypothesis 2 (high status academics speak more) was comparatively not, with estimates distributed around zero in the latter case. Leveraging a Boba multiverse analysis (Liu, Kale, Althoff, & Heer, Reference Liu, Kale, Althoff and Heer2020; see also Steegen, Tuerlinckx, Gelman, & Vanpaemel, Reference Steegen, Tuerlinckx, Gelman and Vanpaemel2016) to identify key analyst choice points, Schweinsberg et al. (Reference Schweinsberg, Feldman, Staub, van den Akker, van Aert, van Assen, Liu and Uhlmann2021) further demonstrate that differing variable operationalizations directly contributes to this radical dispersion in estimates across different analysts. For example, researchers who operationalized status as job rank consistently returned negative estimates for H2, whereas those operationalizing status using ranking of doctoral institution returned consistently positive estimates. This illustrates how the parsing strategy treats variability across different approaches as clues to meaningful moderation, rather than error to be averaged away.

In order to draw generalizable conclusions, Tierney et al. (in Reference Tierney, Cyrus-Lai and Uhlmannpreparation) assigned teams of doctoral students and professors to separately create conceptual replication designs testing for backlash against angry women. The original study finds that although male managers who express anger (relative to sadness or neutral emotions) experience a boost in status, female managers who express anger are accorded less social status and respect (Brescoll & Uhlmann, Reference Brescoll and Uhlmann2008). Participants in this ongoing data collection across over 50 laboratories are randomly assigned to one of 27 study designs (the original design and 26 conceptual replication designs) testing the hypothesized interaction between target gender and emotion expression. The employed methods include scenarios, ostensive newspaper stories, audio recordings, video recordings, and storyboards with illustrated characters as well as a myriad of different ways of expressing anger. In addition to a preregistered meta-analysis of the results across designs, we will systematically test potential moderators of the results across designs; among these are anger extremity, dominance displays, and the salience of target gender.

In summary, we can collectively overcome the generalizability crisis by organizing crowds of scientists to tackle the same research questions independently. Doing so will further expose the fixed-effect fallacy that a single analysis and research paradigm are sufficient for drawing strong theoretical inferences. Scientists can rely on the wisdom of the crowd by aggregating results across independent investigators, and seek to identify meaningful moderators of the results across different approaches, in the perspectivist spirit.

Financial support

This research was supported by an R&D grant from INSEAD to Eric Uhlmann.

Conflict of interest

None.

References

Banaji, M. R. (2003). The opposite of a great truth is also true: Homage of Koan #7. In Jost, J., Prentice, D. & Banaji, M. R. (Eds.), The yin and yang of progress in social psychology: Perspectivism at work (pp. 127140). Washington, DC: American Psychological Association.Google Scholar
Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., Van Ravenzwaaij, D., … Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences, 115(11), 26072612.CrossRefGoogle ScholarPubMed
Bastiaansen, J. A., Kunkels, Y. K., Blaauw, F. J., Boker, S. M., Ceulemans, E., Chen, M., … Bringmann, L. F. (2020). Time to get personal? The impact of researchers choices on the selection of treatment targets using the experience sampling methodology. Journal of Psychosomatic Research, 137, 110211.CrossRefGoogle ScholarPubMed
Botvinik-Nezer, R., Holzmeister, F., Camerer, C. F., Dreber, A., Huber, J., Johannesson, M.Schonberg, T. (2020). Variability in the analysis of a single neuroimaging dataset by many teams. Nature, 582, 8488.CrossRefGoogle ScholarPubMed
Brescoll, V., & Uhlmann, E. L. (2008). Can angry women get ahead? Status conferral, gender, and workplace emotion expression. Psychological Science, 19, 268275.CrossRefGoogle Scholar
Landy, J. F., Jia, M., Ding, I. L., Viganola, D., Tierney, W., Dreber, A., … Uhlmann, E. L. (2020). Crowdsourcing hypothesis tests: Making transparent how design choices shape research results. Psychological Bulletin, 146(5), 451479.CrossRefGoogle ScholarPubMed
Liu, Y., Kale, A., Althoff, T., & Heer, J. (2020). Boba: Authoring and visualizing multiverse analyses. IEEE Transactions on Visualization and Computer Graphics, 27(2), 17531763.CrossRefGoogle Scholar
McGuire, W. J. (1973). The yin and yang of progress in social psychology: Seven koan. Journal of Personality and Social Psychology, 26(3), 446456.CrossRefGoogle Scholar
McGuire, W. J. (1983). A contextualist theory of knowledge: Its implications for innovations and reform in psychological research. In Berkowitz, L. (Ed.), Advances in experimental social psychology (Vol. 16, pp. 147). New York, NY: Academic Press.Google Scholar
Schweinsberg, M., Feldman, M., Staub, N., van den Akker, O., van Aert, R., van Assen, M., Liu, Y., … Uhlmann, E. (2021). Radical dispersion of effect size estimates when independent scientists operationalize and test the same hypothesis with the same data. Organizational Behavior and Human Decision Processes, 165, 228249.CrossRefGoogle Scholar
Silberzahn, R., Uhlmann, E. L., Martin, D., Anselmi, P., Aust, F., Awtrey, E.Nosek, B. N. (2018). Many analysts, one dataset: Making transparent how variations in analytical choices affect results. Advances in Methods and Practices in Psychological Science, 1, 337356.CrossRefGoogle Scholar
Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science, 11(5), 702712.CrossRefGoogle ScholarPubMed
Tierney, W., Cyrus-Lai, W., … Uhlmann, E. L. (in preparation). Who respects an angry woman? A pre-registered re-examination of the relationships between gender, emotion expression, and status conferral. Crowdsourced research project in progress.Google Scholar