Hostname: page-component-745bb68f8f-s22k5 Total loading time: 0 Render date: 2025-02-06T10:50:02.443Z Has data issue: false hasContentIssue false

Citizen science can help to alleviate the generalizability crisis

Published online by Cambridge University Press:  10 February 2022

Courtney B. Hilton
Affiliation:
Department of Psychology, Harvard University, Cambridge, MA02138, USAcourtneyhilton@g.harvard.edu
Samuel A. Mehr
Affiliation:
Department of Psychology, Harvard University, Cambridge, MA02138, USAcourtneyhilton@g.harvard.edu Data Science Initiative, Harvard University, Cambridge, MA02138, USAsam@wjh.harvard.edu; https://themusiclab.org School of Psychology, Victoria University of Wellington, Kelburn Parade, Wellington6012, New Zealand

Abstract

Improving generalization in psychology will require more expansive data collection to fuel more expansive statistical models, beyond the scale of traditional lab research. We argue that citizen science is uniquely positioned to scale up data collection and, that in spite of certain limitations, can help to alleviate the generalizability crisis.

Type
Open Peer Commentary
Copyright
Copyright © The Author(s), 2022. Published by Cambridge University Press

Yarkoni argues that common statistical practices in psychology fail to quantitatively support the generalizations psychologists care about. This is because most analyses ignore important sources of variation and, as a result, unjustifiably generalize from narrowly sampled particulars.

Is this problem tractable? We are optimists, so we leave aside Yarkoni's suggestions to “do something else” or “embrace qualitative research,” and focus instead on his key prescription: the adoption of mixed-effects modeling to estimate effects at the level of a factor (e.g., stimulus), to be interpreted as one of a population of potential measurements, licensing generalization over that factor.

Yarkoni is correct that far too few studies do this. In our field of the psychology of music, many inaccurately generalize, for example, from a single musical example to all music; or from a set of songs from a particular context (e.g., pop songs) to all songs; or from the music perception abilities of a particular subset of humans to all humans.

Consider the “Mozart effect”: a notorious positive effect of listening to a Mozart sonata on spatial reasoning that was over-generalized to “all Mozart” and eventually “all music.” While replicable under narrow conditions, the original result was, in fact, specific to neither spatial reasoning, Mozart, nor music generally – the effect was the result of generic modifications to arousal and mood (Thompson, Schellenberg, & Husain, Reference Thompson, Schellenberg and Husain2001).

Modeling random effects for stimuli and other relevant factors, however, brings with it a substantial challenge: researchers will need far more stimuli and participants, sampled more broadly and deeply, and with far more measures, than is typically practical. Psychologists already struggle to obtain sufficient statistical power for narrowly sampled, fixed-effect designs (Smaldino & McElreath, Reference Smaldino and McElreath2016).

How, then, can we alleviate the generalizability crisis? We think citizen science can help.

Citizen science refers to a collection of research tools and practices united by the alignment of interests between participants and the aims of the project, such that participation is intrinsically motivated (e.g., by curiosity in the topic) rather than by extrinsic factors (e.g., money or course credit). The results are studies that cheaply recruit thousands or even millions of diverse participants via the internet. Studies take many forms, ranging from “gamified” experiments that go viral online, such as our “Tone-deafness test” (current N > 1.2 million; https://themusiclab.org); to collective/collaborative field reporting, such as New Zealand's nationwide pigeon census (the Great Kererū count, https://www.greatkererucount.nz/).

The potential of citizen science is staggering. For example, the Moral Machine Experiment (Awad et al., Reference Awad, Dsouza, Kim, Schulz, Henrich, Sharff and Rahwan2018) collected 40 million decisions from millions of people (representing 10 languages and over 200 countries) on moral intuitions about self-driving cars. Such massive scale enabled the quantification of cross-country variability in moral intuitions, and how it was mediated by cultural and economic factors particular to each country, with profound real-world implications.

Further, when citizen science is coupled with corpus methods, generalizability across stimuli can be effectively maximized. We previously investigated high-level representations formed during music listening, by asking whether naïve listeners can infer the behavioral context of songs produced in unfamiliar foreign societies (Mehr et al., Reference Mehr, Singh, York, Glowacki and Krasnow2018, Reference Mehr, Singh, Knox, Ketter, Pickens-Jones, Atwood and Glowacki2019). Each iteration of a viral “World Music Quiz” played a random draw of songs from the Natural History of Song corpus, a larger stimulus set that representatively samples music from 86 world cultures.

As such, the findings of the experiment – that listeners made accurate inferences about the songs' behavioral contexts – can be accurately generalized (a) to the populations of songs the stimulus subsets were drawn from (e.g., lullabies); (b) more weakly, to music, writ large (insofar as the subpopulations of songs represented by those categories sample from other categories); and (c) to the population of listeners from whom our participants were drawn (i.e., members of internet-connected societies). All of these factors can be explicitly modeled with random effects.

The same reasoning applies to studying subpopulations of participants (measured in terms of any characteristic) and even subsets of corpora. For example, in a study of acoustic regularities in infant-directed vocalizations across cultures, we model random effects of listener characteristics, speaker/singer (i.e., the producers of the stimuli) characteristics, and stimulus categories of interest (e.g., infant-directed vs. adult-directed speech). This is only possible with large datasets (in our case, nearly 1 million listener judgements; Hilton, Moser, et al., Reference Hilton, Moser, Bertolo, Lee-Rubin, Amir, Bainbridge and Mehr2021). Other under-used analyses also become more practical with big citizen-science data, including radical randomization (Baribault et al., Reference Baribault, Donkin, Little, Trueblood, Oravecz, van Ravenzwaaij and Vandekerckhove2018), prediction with cross-validation (Yarkoni & Westfall, Reference Yarkoni and Westfall2017), and matching methods for causal inference (Stuart, Reference Stuart2010).

Citizen-science methods are limited, however, by the need to factor in participants' interests and incentives; the need to avoid factors that might dissuade participation (e.g., clunky user interfaces, boring time-consuming tasks), which can require graphic design and web development talent for “gamification” (e.g., Cooper et al., Reference Cooper, Khatib, Treuille, Barbero, Lee, Beenen and Players2010); the risks of recruiting a biased population subset (i.e., those with internet access; Lourenco & Tasimi, Reference Lourenco and Tasimi2020); and the trade-offs between densely sampling stimuli across- versus within-participants, given the typically short duration of citizen-science experiments.

Indeed, while our efforts to recruit children at scale online via citizen science show promising results (Hilton, Crowley de-Thierry, Yan, Martin, & Mehr, Reference Hilton, Crowley de-Thierry, Yan, Martin and Mehr2021), rare or hard-to-study populations may be difficult to recruit en masse (cf. Lookit, a platform for online research in infants; Scott & Schulz, Reference Scott and Schulz2017). As Yarkoni notes, alternative approaches like multisite collaborations (e.g., ManyBabies Consortium, 2020) could be calibrated to maximize generalizability across stimuli rather than directly replicating results with the same stimuli.

All that being said, thanks to a growing ecosystem of open-source tools (e.g., de Leeuw, Reference de Leeuw2015; Hartshorne, de Leeuw, Goodman, Jennings, & O'Donnell, Reference Hartshorne, de Leeuw, Goodman, Jennings and O'Donnell2019; Peirce et al., Reference Hartshorne, de Leeuw, Goodman, Jennings and O'Donnell2019); the availability of large-scale, naturalistic corpora from industry partners (e.g., Spotify Research; Way, Garcia-Gathright, & Cramerr, Reference Way, Garcia-Gathright and Cramerr2020); and calls for collaborative, field-wide investment in citizen-science infrastructure (Sheskin et al., Reference Sheskin, Scott, Mills, Bergelson, Bonawitz, Spelke and Schulz2020) – addressing these limitations has never been easier.

As such, we think that citizen science can play a useful role as psychologists begin to address the generalizability crisis.

Acknowledgment

We would like to thank Max Krasnow, Mila Bertolo, Stats Atwood, Alex Holcombe, and William Ngiam for feedback on drafts of this commentary.

Financial support

C.B.H. and S.A.M. are supported by NIH DP5OD024566. S.A.M. is supported by the Harvard Data Science Initiative.

Conflict of interest

None.

References

Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Sharff, A., … Rahwan, I. (2018). The moral machine experiment. Nature, 563(7729), 5964.CrossRefGoogle ScholarPubMed
Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., van Ravenzwaaij, D., … Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences, 115(11), 26072612.CrossRefGoogle ScholarPubMed
Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee, J., Beenen, M., … Players, F. (2010). Predicting protein structures with a multiplayer online game. Nature, 466(7307), 756760.CrossRefGoogle ScholarPubMed
de Leeuw, J. R. (2015). jsPsych: A JavaScript library for creating behavioral experiments in a Web browser. Behavior Research Methods, 47(1), 112.CrossRefGoogle Scholar
Hartshorne, J. K., de Leeuw, J., Goodman, N., Jennings, M., & O'Donnell, T. J. (2019). A thousand studies for the price of one: Accelerating psychological science with Pushkin. Behavior Research Methods, 51(4), 122.CrossRefGoogle Scholar
Hilton, C., Crowley de-Thierry, L., Yan, R., Martin, A., & Mehr, S. (2021). Children infer the behavioral contexts of unfamiliar songs. PsyArXiv. doi: 10.31234/osf.io/rz6qn.CrossRefGoogle Scholar
Hilton, C. B., Moser, C. J., Bertolo, M., Lee-Rubin, H., Amir, D., Bainbridge, C. M., … Mehr, S. A. (2021). Acoustic regularities in infant-directed vocalizations across cultures. bioRxiv. doi: 10.1101/2020.04.09.032995Google Scholar
Lourenco, S. F., & Tasimi, A. (2020). No participant left behind: Conducting science during COVID-19. Trends in Cognitive Sciences, 24(8), 583584.CrossRefGoogle ScholarPubMed
ManyBabies Consortium. (2020). Quantifying sources of variability in infancy research using the infant-directed-speech preference. Advances in Methods and Practices in Psychological Science, 3, 2452.CrossRefGoogle Scholar
Mehr, S. A., Singh, M., Knox, D., Ketter, D., Pickens-Jones, D., Atwood, S., … Glowacki, L. (2019). Universality and diversity in human song. Science, 366(6468), eaax0868.CrossRefGoogle ScholarPubMed
Mehr, S. A., Singh, M., York, H., Glowacki, L., & Krasnow, M. M. (2018). Form and function in human song. Current Biology, 28(3), 356368.e5.CrossRefGoogle ScholarPubMed
Peirce, J., Gray, J. R., Simpson, S., MacAskill, M., Höchenberger, R., Sogo, H., … Lindeløv, J. K. (2019). PsychoPy2: Experiments in behavior made easy. Behavior Research Methods, 51(1), 195203.CrossRefGoogle ScholarPubMed
Scott, K., & Schulz, L. (2017). Lookit (Part 1): A new online platform for developmental research. Open Mind, 1(1), 414.CrossRefGoogle Scholar
Sheskin, M., Scott, K., Mills, C. M., Bergelson, E., Bonawitz, E., Spelke, E. S., … Schulz, L. (2020). Online developmental science to foster innovation, access, and impact. Trends in Cognitive Sciences, 24(9), 675678.CrossRefGoogle ScholarPubMed
Smaldino, P. E., & McElreath, R. (2016). The natural selection of bad science. Royal Society Open Science, 3(9), 160384.CrossRefGoogle ScholarPubMed
Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1), 121.CrossRefGoogle Scholar
Thompson, W. F., Schellenberg, E. G., & Husain, G. (2001). Arousal, mood, and the Mozart effect. Psychological Science, 12(3), 248251.CrossRefGoogle ScholarPubMed
Way, S. F., Garcia-Gathright, J., & Cramerr, H. (2020). Local trends in global music streaming. Proceedings of the Fourteenth International AAAI Conference on Web and Social Media, 10.Google Scholar
Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 11001122.CrossRefGoogle ScholarPubMed