Citizen science can help to alleviate the generalizability crisis

Courtney B. Hilton; Samuel A. Mehr

doi:10.1017/S0140525X21000352

Citizen science can help to alleviate the generalizability crisis

Published online by Cambridge University Press: 10 February 2022

Courtney B. Hilton

and

Samuel A. Mehr

Show author details

Courtney B. Hilton: Affiliation:
Department of Psychology, Harvard University, Cambridge, MA02138, USAcourtneyhilton@g.harvard.edu
Samuel A. Mehr: Affiliation:
Department of Psychology, Harvard University, Cambridge, MA02138, USAcourtneyhilton@g.harvard.edu Data Science Initiative, Harvard University, Cambridge, MA02138, USAsam@wjh.harvard.edu; https://themusiclab.org School of Psychology, Victoria University of Wellington, Kelburn Parade, Wellington6012, New Zealand

Article contents

Abstract
References

Rights & Permissions

Abstract

Improving generalization in psychology will require more expansive data collection to fuel more expansive statistical models, beyond the scale of traditional lab research. We argue that citizen science is uniquely positioned to scale up data collection and, that in spite of certain limitations, can help to alleviate the generalizability crisis.

Type: Open Peer Commentary
Information: Behavioral and Brain Sciences , Volume 45 , 2022 , e21

DOI: https://doi.org/10.1017/S0140525X21000352 [Opens in a new window]
Copyright: Copyright © The Author(s), 2022. Published by Cambridge University Press

Yarkoni argues that common statistical practices in psychology fail to quantitatively support the generalizations psychologists care about. This is because most analyses ignore important sources of variation and, as a result, unjustifiably generalize from narrowly sampled particulars.

Is this problem tractable? We are optimists, so we leave aside Yarkoni's suggestions to “do something else” or “embrace qualitative research,” and focus instead on his key prescription: the adoption of mixed-effects modeling to estimate effects at the level of a factor (e.g., stimulus), to be interpreted as one of a population of potential measurements, licensing generalization over that factor.

Yarkoni is correct that far too few studies do this. In our field of the psychology of music, many inaccurately generalize, for example, from a single musical example to all music; or from a set of songs from a particular context (e.g., pop songs) to all songs; or from the music perception abilities of a particular subset of humans to all humans.

Consider the “Mozart effect”: a notorious positive effect of listening to a Mozart sonata on spatial reasoning that was over-generalized to “all Mozart” and eventually “all music.” While replicable under narrow conditions, the original result was, in fact, specific to neither spatial reasoning, Mozart, nor music generally – the effect was the result of generic modifications to arousal and mood (Thompson, Schellenberg, & Husain, Reference Thompson, Schellenberg and Husain2001).

Modeling random effects for stimuli and other relevant factors, however, brings with it a substantial challenge: researchers will need far more stimuli and participants, sampled more broadly and deeply, and with far more measures, than is typically practical. Psychologists already struggle to obtain sufficient statistical power for narrowly sampled, fixed-effect designs (Smaldino & McElreath, Reference Smaldino and McElreath2016).

How, then, can we alleviate the generalizability crisis? We think citizen science can help.

Citizen science refers to a collection of research tools and practices united by the alignment of interests between participants and the aims of the project, such that participation is intrinsically motivated (e.g., by curiosity in the topic) rather than by extrinsic factors (e.g., money or course credit). The results are studies that cheaply recruit thousands or even millions of diverse participants via the internet. Studies take many forms, ranging from “gamified” experiments that go viral online, such as our “Tone-deafness test” (current N > 1.2 million; https://themusiclab.org); to collective/collaborative field reporting, such as New Zealand's nationwide pigeon census (the Great Kererū count, https://www.greatkererucount.nz/).

The potential of citizen science is staggering. For example, the Moral Machine Experiment (Awad et al., Reference Awad, Dsouza, Kim, Schulz, Henrich, Sharff and Rahwan2018) collected 40 million decisions from millions of people (representing 10 languages and over 200 countries) on moral intuitions about self-driving cars. Such massive scale enabled the quantification of cross-country variability in moral intuitions, and how it was mediated by cultural and economic factors particular to each country, with profound real-world implications.

Further, when citizen science is coupled with corpus methods, generalizability across stimuli can be effectively maximized. We previously investigated high-level representations formed during music listening, by asking whether naïve listeners can infer the behavioral context of songs produced in unfamiliar foreign societies (Mehr et al., Reference Mehr, Singh, York, Glowacki and Krasnow2018, Reference Mehr, Singh, Knox, Ketter, Pickens-Jones, Atwood and Glowacki2019). Each iteration of a viral “World Music Quiz” played a random draw of songs from the Natural History of Song corpus, a larger stimulus set that representatively samples music from 86 world cultures.

As such, the findings of the experiment – that listeners made accurate inferences about the songs' behavioral contexts – can be accurately generalized (a) to the populations of songs the stimulus subsets were drawn from (e.g., lullabies); (b) more weakly, to music, writ large (insofar as the subpopulations of songs represented by those categories sample from other categories); and (c) to the population of listeners from whom our participants were drawn (i.e., members of internet-connected societies). All of these factors can be explicitly modeled with random effects.

The same reasoning applies to studying subpopulations of participants (measured in terms of any characteristic) and even subsets of corpora. For example, in a study of acoustic regularities in infant-directed vocalizations across cultures, we model random effects of listener characteristics, speaker/singer (i.e., the producers of the stimuli) characteristics, and stimulus categories of interest (e.g., infant-directed vs. adult-directed speech). This is only possible with large datasets (in our case, nearly 1 million listener judgements; Hilton, Moser, et al., Reference Hilton, Moser, Bertolo, Lee-Rubin, Amir, Bainbridge and Mehr2021). Other under-used analyses also become more practical with big citizen-science data, including radical randomization (Baribault et al., Reference Baribault, Donkin, Little, Trueblood, Oravecz, van Ravenzwaaij and Vandekerckhove2018), prediction with cross-validation (Yarkoni & Westfall, Reference Yarkoni and Westfall2017), and matching methods for causal inference (Stuart, Reference Stuart2010).

Citizen-science methods are limited, however, by the need to factor in participants' interests and incentives; the need to avoid factors that might dissuade participation (e.g., clunky user interfaces, boring time-consuming tasks), which can require graphic design and web development talent for “gamification” (e.g., Cooper et al., Reference Cooper, Khatib, Treuille, Barbero, Lee, Beenen and Players2010); the risks of recruiting a biased population subset (i.e., those with internet access; Lourenco & Tasimi, Reference Lourenco and Tasimi2020); and the trade-offs between densely sampling stimuli across- versus within-participants, given the typically short duration of citizen-science experiments.

Indeed, while our efforts to recruit children at scale online via citizen science show promising results (Hilton, Crowley de-Thierry, Yan, Martin, & Mehr, Reference Hilton, Crowley de-Thierry, Yan, Martin and Mehr2021), rare or hard-to-study populations may be difficult to recruit en masse (cf. Lookit, a platform for online research in infants; Scott & Schulz, Reference Scott and Schulz2017). As Yarkoni notes, alternative approaches like multisite collaborations (e.g., ManyBabies Consortium, 2020) could be calibrated to maximize generalizability across stimuli rather than directly replicating results with the same stimuli.

All that being said, thanks to a growing ecosystem of open-source tools (e.g., de Leeuw, Reference de Leeuw2015; Hartshorne, de Leeuw, Goodman, Jennings, & O'Donnell, Reference Hartshorne, de Leeuw, Goodman, Jennings and O'Donnell2019; Peirce et al., Reference Hartshorne, de Leeuw, Goodman, Jennings and O'Donnell2019); the availability of large-scale, naturalistic corpora from industry partners (e.g., Spotify Research; Way, Garcia-Gathright, & Cramerr, Reference Way, Garcia-Gathright and Cramerr2020); and calls for collaborative, field-wide investment in citizen-science infrastructure (Sheskin et al., Reference Sheskin, Scott, Mills, Bergelson, Bonawitz, Spelke and Schulz2020) – addressing these limitations has never been easier.

As such, we think that citizen science can play a useful role as psychologists begin to address the generalizability crisis.

Acknowledgment

We would like to thank Max Krasnow, Mila Bertolo, Stats Atwood, Alex Holcombe, and William Ngiam for feedback on drafts of this commentary.

Financial support

C.B.H. and S.A.M. are supported by NIH DP5OD024566. S.A.M. is supported by the Harvard Data Science Initiative.

Conflict of interest

None.

References

Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Sharff, A., … Rahwan, I. (2018). The moral machine experiment. Nature, 563(7729), 59–64.CrossRef Google Scholar PubMed

Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., van Ravenzwaaij, D., … Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences, 115(11), 2607–2612.CrossRef Google Scholar PubMed

Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee, J., Beenen, M., … Players, F. (2010). Predicting protein structures with a multiplayer online game. Nature, 466(7307), 756–760.CrossRef Google Scholar PubMed

de Leeuw, J. R. (2015). jsPsych: A JavaScript library for creating behavioral experiments in a Web browser. Behavior Research Methods, 47(1), 1–12.CrossRef Google Scholar

Hartshorne, J. K., de Leeuw, J., Goodman, N., Jennings, M., & O'Donnell, T. J. (2019). A thousand studies for the price of one: Accelerating psychological science with Pushkin. Behavior Research Methods, 51(4), 1–22.CrossRef Google Scholar

Hilton, C., Crowley de-Thierry, L., Yan, R., Martin, A., & Mehr, S. (2021). Children infer the behavioral contexts of unfamiliar songs. PsyArXiv. doi: 10.31234/osf.io/rz6qn.CrossRef Google Scholar

Hilton, C. B., Moser, C. J., Bertolo, M., Lee-Rubin, H., Amir, D., Bainbridge, C. M., … Mehr, S. A. (2021). Acoustic regularities in infant-directed vocalizations across cultures. bioRxiv. doi: 10.1101/2020.04.09.032995Google Scholar

Lourenco, S. F., & Tasimi, A. (2020). No participant left behind: Conducting science during COVID-19. Trends in Cognitive Sciences, 24(8), 583–584.CrossRef Google Scholar PubMed

ManyBabies Consortium. (2020). Quantifying sources of variability in infancy research using the infant-directed-speech preference. Advances in Methods and Practices in Psychological Science, 3, 24–52.CrossRef Google Scholar

Mehr, S. A., Singh, M., Knox, D., Ketter, D., Pickens-Jones, D., Atwood, S., … Glowacki, L. (2019). Universality and diversity in human song. Science, 366(6468), eaax0868.CrossRef Google Scholar PubMed

Mehr, S. A., Singh, M., York, H., Glowacki, L., & Krasnow, M. M. (2018). Form and function in human song. Current Biology, 28(3), 356–368.e5.CrossRef Google Scholar PubMed

Peirce, J., Gray, J. R., Simpson, S., MacAskill, M., Höchenberger, R., Sogo, H., … Lindeløv, J. K. (2019). PsychoPy2: Experiments in behavior made easy. Behavior Research Methods, 51(1), 195–203.CrossRef Google Scholar PubMed

Scott, K., & Schulz, L. (2017). Lookit (Part 1): A new online platform for developmental research. Open Mind, 1(1), 4–14.CrossRef Google Scholar

Sheskin, M., Scott, K., Mills, C. M., Bergelson, E., Bonawitz, E., Spelke, E. S., … Schulz, L. (2020). Online developmental science to foster innovation, access, and impact. Trends in Cognitive Sciences, 24(9), 675–678.CrossRef Google Scholar PubMed

Smaldino, P. E., & McElreath, R. (2016). The natural selection of bad science. Royal Society Open Science, 3(9), 160384.CrossRef Google Scholar PubMed

Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1), 1–21.CrossRef Google Scholar

Thompson, W. F., Schellenberg, E. G., & Husain, G. (2001). Arousal, mood, and the Mozart effect. Psychological Science, 12(3), 248–251.CrossRef Google Scholar PubMed

Way, S. F., Garcia-Gathright, J., & Cramerr, H. (2020). Local trends in global music streaming. Proceedings of the Fourteenth International AAAI Conference on Web and Social Media, 10.Google Scholar

Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 1100–1122.CrossRef Google Scholar PubMed