Yarkoni argues that common statistical practices in psychology fail to quantitatively support the generalizations psychologists care about. This is because most analyses ignore important sources of variation and, as a result, unjustifiably generalize from narrowly sampled particulars.
Is this problem tractable? We are optimists, so we leave aside Yarkoni's suggestions to “do something else” or “embrace qualitative research,” and focus instead on his key prescription: the adoption of mixed-effects modeling to estimate effects at the level of a factor (e.g., stimulus), to be interpreted as one of a population of potential measurements, licensing generalization over that factor.
Yarkoni is correct that far too few studies do this. In our field of the psychology of music, many inaccurately generalize, for example, from a single musical example to all music; or from a set of songs from a particular context (e.g., pop songs) to all songs; or from the music perception abilities of a particular subset of humans to all humans.
Consider the “Mozart effect”: a notorious positive effect of listening to a Mozart sonata on spatial reasoning that was over-generalized to “all Mozart” and eventually “all music.” While replicable under narrow conditions, the original result was, in fact, specific to neither spatial reasoning, Mozart, nor music generally – the effect was the result of generic modifications to arousal and mood (Thompson, Schellenberg, & Husain, Reference Thompson, Schellenberg and Husain2001).
Modeling random effects for stimuli and other relevant factors, however, brings with it a substantial challenge: researchers will need far more stimuli and participants, sampled more broadly and deeply, and with far more measures, than is typically practical. Psychologists already struggle to obtain sufficient statistical power for narrowly sampled, fixed-effect designs (Smaldino & McElreath, Reference Smaldino and McElreath2016).
How, then, can we alleviate the generalizability crisis? We think citizen science can help.
Citizen science refers to a collection of research tools and practices united by the alignment of interests between participants and the aims of the project, such that participation is intrinsically motivated (e.g., by curiosity in the topic) rather than by extrinsic factors (e.g., money or course credit). The results are studies that cheaply recruit thousands or even millions of diverse participants via the internet. Studies take many forms, ranging from “gamified” experiments that go viral online, such as our “Tone-deafness test” (current N > 1.2 million; https://themusiclab.org); to collective/collaborative field reporting, such as New Zealand's nationwide pigeon census (the Great Kererū count, https://www.greatkererucount.nz/).
The potential of citizen science is staggering. For example, the Moral Machine Experiment (Awad et al., Reference Awad, Dsouza, Kim, Schulz, Henrich, Sharff and Rahwan2018) collected 40 million decisions from millions of people (representing 10 languages and over 200 countries) on moral intuitions about self-driving cars. Such massive scale enabled the quantification of cross-country variability in moral intuitions, and how it was mediated by cultural and economic factors particular to each country, with profound real-world implications.
Further, when citizen science is coupled with corpus methods, generalizability across stimuli can be effectively maximized. We previously investigated high-level representations formed during music listening, by asking whether naïve listeners can infer the behavioral context of songs produced in unfamiliar foreign societies (Mehr et al., Reference Mehr, Singh, York, Glowacki and Krasnow2018, Reference Mehr, Singh, Knox, Ketter, Pickens-Jones, Atwood and Glowacki2019). Each iteration of a viral “World Music Quiz” played a random draw of songs from the Natural History of Song corpus, a larger stimulus set that representatively samples music from 86 world cultures.
As such, the findings of the experiment – that listeners made accurate inferences about the songs' behavioral contexts – can be accurately generalized (a) to the populations of songs the stimulus subsets were drawn from (e.g., lullabies); (b) more weakly, to music, writ large (insofar as the subpopulations of songs represented by those categories sample from other categories); and (c) to the population of listeners from whom our participants were drawn (i.e., members of internet-connected societies). All of these factors can be explicitly modeled with random effects.
The same reasoning applies to studying subpopulations of participants (measured in terms of any characteristic) and even subsets of corpora. For example, in a study of acoustic regularities in infant-directed vocalizations across cultures, we model random effects of listener characteristics, speaker/singer (i.e., the producers of the stimuli) characteristics, and stimulus categories of interest (e.g., infant-directed vs. adult-directed speech). This is only possible with large datasets (in our case, nearly 1 million listener judgements; Hilton, Moser, et al., Reference Hilton, Moser, Bertolo, Lee-Rubin, Amir, Bainbridge and Mehr2021). Other under-used analyses also become more practical with big citizen-science data, including radical randomization (Baribault et al., Reference Baribault, Donkin, Little, Trueblood, Oravecz, van Ravenzwaaij and Vandekerckhove2018), prediction with cross-validation (Yarkoni & Westfall, Reference Yarkoni and Westfall2017), and matching methods for causal inference (Stuart, Reference Stuart2010).
Citizen-science methods are limited, however, by the need to factor in participants' interests and incentives; the need to avoid factors that might dissuade participation (e.g., clunky user interfaces, boring time-consuming tasks), which can require graphic design and web development talent for “gamification” (e.g., Cooper et al., Reference Cooper, Khatib, Treuille, Barbero, Lee, Beenen and Players2010); the risks of recruiting a biased population subset (i.e., those with internet access; Lourenco & Tasimi, Reference Lourenco and Tasimi2020); and the trade-offs between densely sampling stimuli across- versus within-participants, given the typically short duration of citizen-science experiments.
Indeed, while our efforts to recruit children at scale online via citizen science show promising results (Hilton, Crowley de-Thierry, Yan, Martin, & Mehr, Reference Hilton, Crowley de-Thierry, Yan, Martin and Mehr2021), rare or hard-to-study populations may be difficult to recruit en masse (cf. Lookit, a platform for online research in infants; Scott & Schulz, Reference Scott and Schulz2017). As Yarkoni notes, alternative approaches like multisite collaborations (e.g., ManyBabies Consortium, 2020) could be calibrated to maximize generalizability across stimuli rather than directly replicating results with the same stimuli.
All that being said, thanks to a growing ecosystem of open-source tools (e.g., de Leeuw, Reference de Leeuw2015; Hartshorne, de Leeuw, Goodman, Jennings, & O'Donnell, Reference Hartshorne, de Leeuw, Goodman, Jennings and O'Donnell2019; Peirce et al., Reference Hartshorne, de Leeuw, Goodman, Jennings and O'Donnell2019); the availability of large-scale, naturalistic corpora from industry partners (e.g., Spotify Research; Way, Garcia-Gathright, & Cramerr, Reference Way, Garcia-Gathright and Cramerr2020); and calls for collaborative, field-wide investment in citizen-science infrastructure (Sheskin et al., Reference Sheskin, Scott, Mills, Bergelson, Bonawitz, Spelke and Schulz2020) – addressing these limitations has never been easier.
As such, we think that citizen science can play a useful role as psychologists begin to address the generalizability crisis.
Yarkoni argues that common statistical practices in psychology fail to quantitatively support the generalizations psychologists care about. This is because most analyses ignore important sources of variation and, as a result, unjustifiably generalize from narrowly sampled particulars.
Is this problem tractable? We are optimists, so we leave aside Yarkoni's suggestions to “do something else” or “embrace qualitative research,” and focus instead on his key prescription: the adoption of mixed-effects modeling to estimate effects at the level of a factor (e.g., stimulus), to be interpreted as one of a population of potential measurements, licensing generalization over that factor.
Yarkoni is correct that far too few studies do this. In our field of the psychology of music, many inaccurately generalize, for example, from a single musical example to all music; or from a set of songs from a particular context (e.g., pop songs) to all songs; or from the music perception abilities of a particular subset of humans to all humans.
Consider the “Mozart effect”: a notorious positive effect of listening to a Mozart sonata on spatial reasoning that was over-generalized to “all Mozart” and eventually “all music.” While replicable under narrow conditions, the original result was, in fact, specific to neither spatial reasoning, Mozart, nor music generally – the effect was the result of generic modifications to arousal and mood (Thompson, Schellenberg, & Husain, Reference Thompson, Schellenberg and Husain2001).
Modeling random effects for stimuli and other relevant factors, however, brings with it a substantial challenge: researchers will need far more stimuli and participants, sampled more broadly and deeply, and with far more measures, than is typically practical. Psychologists already struggle to obtain sufficient statistical power for narrowly sampled, fixed-effect designs (Smaldino & McElreath, Reference Smaldino and McElreath2016).
How, then, can we alleviate the generalizability crisis? We think citizen science can help.
Citizen science refers to a collection of research tools and practices united by the alignment of interests between participants and the aims of the project, such that participation is intrinsically motivated (e.g., by curiosity in the topic) rather than by extrinsic factors (e.g., money or course credit). The results are studies that cheaply recruit thousands or even millions of diverse participants via the internet. Studies take many forms, ranging from “gamified” experiments that go viral online, such as our “Tone-deafness test” (current N > 1.2 million; https://themusiclab.org); to collective/collaborative field reporting, such as New Zealand's nationwide pigeon census (the Great Kererū count, https://www.greatkererucount.nz/).
The potential of citizen science is staggering. For example, the Moral Machine Experiment (Awad et al., Reference Awad, Dsouza, Kim, Schulz, Henrich, Sharff and Rahwan2018) collected 40 million decisions from millions of people (representing 10 languages and over 200 countries) on moral intuitions about self-driving cars. Such massive scale enabled the quantification of cross-country variability in moral intuitions, and how it was mediated by cultural and economic factors particular to each country, with profound real-world implications.
Further, when citizen science is coupled with corpus methods, generalizability across stimuli can be effectively maximized. We previously investigated high-level representations formed during music listening, by asking whether naïve listeners can infer the behavioral context of songs produced in unfamiliar foreign societies (Mehr et al., Reference Mehr, Singh, York, Glowacki and Krasnow2018, Reference Mehr, Singh, Knox, Ketter, Pickens-Jones, Atwood and Glowacki2019). Each iteration of a viral “World Music Quiz” played a random draw of songs from the Natural History of Song corpus, a larger stimulus set that representatively samples music from 86 world cultures.
As such, the findings of the experiment – that listeners made accurate inferences about the songs' behavioral contexts – can be accurately generalized (a) to the populations of songs the stimulus subsets were drawn from (e.g., lullabies); (b) more weakly, to music, writ large (insofar as the subpopulations of songs represented by those categories sample from other categories); and (c) to the population of listeners from whom our participants were drawn (i.e., members of internet-connected societies). All of these factors can be explicitly modeled with random effects.
The same reasoning applies to studying subpopulations of participants (measured in terms of any characteristic) and even subsets of corpora. For example, in a study of acoustic regularities in infant-directed vocalizations across cultures, we model random effects of listener characteristics, speaker/singer (i.e., the producers of the stimuli) characteristics, and stimulus categories of interest (e.g., infant-directed vs. adult-directed speech). This is only possible with large datasets (in our case, nearly 1 million listener judgements; Hilton, Moser, et al., Reference Hilton, Moser, Bertolo, Lee-Rubin, Amir, Bainbridge and Mehr2021). Other under-used analyses also become more practical with big citizen-science data, including radical randomization (Baribault et al., Reference Baribault, Donkin, Little, Trueblood, Oravecz, van Ravenzwaaij and Vandekerckhove2018), prediction with cross-validation (Yarkoni & Westfall, Reference Yarkoni and Westfall2017), and matching methods for causal inference (Stuart, Reference Stuart2010).
Citizen-science methods are limited, however, by the need to factor in participants' interests and incentives; the need to avoid factors that might dissuade participation (e.g., clunky user interfaces, boring time-consuming tasks), which can require graphic design and web development talent for “gamification” (e.g., Cooper et al., Reference Cooper, Khatib, Treuille, Barbero, Lee, Beenen and Players2010); the risks of recruiting a biased population subset (i.e., those with internet access; Lourenco & Tasimi, Reference Lourenco and Tasimi2020); and the trade-offs between densely sampling stimuli across- versus within-participants, given the typically short duration of citizen-science experiments.
Indeed, while our efforts to recruit children at scale online via citizen science show promising results (Hilton, Crowley de-Thierry, Yan, Martin, & Mehr, Reference Hilton, Crowley de-Thierry, Yan, Martin and Mehr2021), rare or hard-to-study populations may be difficult to recruit en masse (cf. Lookit, a platform for online research in infants; Scott & Schulz, Reference Scott and Schulz2017). As Yarkoni notes, alternative approaches like multisite collaborations (e.g., ManyBabies Consortium, 2020) could be calibrated to maximize generalizability across stimuli rather than directly replicating results with the same stimuli.
All that being said, thanks to a growing ecosystem of open-source tools (e.g., de Leeuw, Reference de Leeuw2015; Hartshorne, de Leeuw, Goodman, Jennings, & O'Donnell, Reference Hartshorne, de Leeuw, Goodman, Jennings and O'Donnell2019; Peirce et al., Reference Hartshorne, de Leeuw, Goodman, Jennings and O'Donnell2019); the availability of large-scale, naturalistic corpora from industry partners (e.g., Spotify Research; Way, Garcia-Gathright, & Cramerr, Reference Way, Garcia-Gathright and Cramerr2020); and calls for collaborative, field-wide investment in citizen-science infrastructure (Sheskin et al., Reference Sheskin, Scott, Mills, Bergelson, Bonawitz, Spelke and Schulz2020) – addressing these limitations has never been easier.
As such, we think that citizen science can play a useful role as psychologists begin to address the generalizability crisis.
Acknowledgment
We would like to thank Max Krasnow, Mila Bertolo, Stats Atwood, Alex Holcombe, and William Ngiam for feedback on drafts of this commentary.
Financial support
C.B.H. and S.A.M. are supported by NIH DP5OD024566. S.A.M. is supported by the Harvard Data Science Initiative.
Conflict of interest
None.