Hostname: page-component-745bb68f8f-kw2vx Total loading time: 0 Render date: 2025-02-11T09:42:21.517Z Has data issue: false hasContentIssue false

Random effects won't solve the problem of generalizability

Published online by Cambridge University Press:  10 February 2022

Adam Bear
Affiliation:
Department of Psychology, Harvard University, Cambridge, MA02138, USAadambear@fas.harvard.edu; https://adambear.me
Jonathan Phillips
Affiliation:
Program in Cognitive Science, Dartmouth College, Hanover, NH03755, USA. jonathan.s.phillips@dartmouth.edu; https://www.dartmouth.edu/~phillab/phillips.html

Abstract

Yarkoni argues that researchers making broad inferences often use impoverished statistical models that fail to include important sources of variation as random effects. We argue, however, that for many common study designs, random effects are inappropriate and insufficient to draw general inferences, as the source of variation is not random, but systematic.

Type
Open Peer Commentary
Copyright
Copyright © The Author(s), 2022. Published by Cambridge University Press

Yarkoni compellingly argues that researchers often neglect important sources of variation in their statistical models. One of the most important sources of variation that often goes unmodeled is the experimental stimuli that researchers select (sect. 3.1). Yarkoni encourages researchers to statistically model stimuli as a random factor in a mixed-effects model. While this suggestion will no doubt improve generalizability for certain types of psychological studies, it is inadequate in many other cases.

Modeling stimuli as a random factor introduces a key assumption about the process by which the stimuli were generated – an assumption that, in many experiments, is almost certainly false. For stimuli to count as “random,” the source of variation must actually be random. That is, the stimuli are assumed to be random draws from a (usually) normal distribution that mimics the true distribution of stimuli to which the researchers want to generalize. Because this sampling distribution is assumed to be centered on the true average effect size, only the variance around this effect size is estimated. As Yarkoni shows, when the model estimates more variance, the true average effect is less certain, and it is more difficult to generalize beyond the particular set of stimuli used in the experiment.

In a wide range of cases, this assumption of random sampling is suspect. Consider a study that investigates whether disgusting immoral actions elicit increased moral reprobation. Suppose researchers generate a set of scenarios involving immoral actions, some of which are disgusting and others of which are not; collect moral judgments from a large sample of online participants; and – having read Yarkoni's article – model both subjects and stimuli as random factors in their analysis. If the mixed-effects model yields a highly significant p-value for the disgustingness of the action, is the general conclusion that disgusting moral violations are judged (by WEIRD [western, educated, industrialized, rich, and democratic] people) to be morally worse than non-disgusting ones warranted?

Probably not. The researchers created their stimuli with a hypothesis in mind and were introspectively aware of which stimuli would elicit stronger or weaker moral judgments. As a result, even if the researchers intend to create a fair test, they will almost certainly be disinclined – consciously or unconsciously – to select stimuli that are unlikely to provide support for their hypothesis. In other words, the stimuli that the researchers chose to include in the study were not random draws from a representative population of moral violations, but were biased to favor a particular conclusion.

Indeed, a study by Strickland and Suben (Reference Strickland and Suben2012) provides a real-world demonstration of how this can happen, albeit in a somewhat exaggerated setting. Groups of undergraduates were assigned the task of creating stimuli to test specific hypotheses from experimental philosophy, but different groups were given contradictory hypotheses. The different groups generated systematically different stimuli, which in turn influenced whether, and to what extent, they observed a statistically significant effect.

There is a further problem with modeling stimuli as a random factor when the stimuli are generated nonrandomly. If researchers are systematically selecting stimuli that tend to favor their hypothesis, the model will tend to underestimate the true amount of variation in the effect size across stimuli. Concretely, imagine that, in the true population of stimuli that the experimenter wants to generalize to, the effect size is normally distributed around 0. Yet the researchers' directional hypothesis motivates them to systematically sample stimuli from the right tail of this distribution. The variance in the effect size of this truncated distribution will be substantially smaller than the variance of the true distribution – barely more than a third of the size. Indeed, even if the experimenters have only a weak bias to avoid sampling stimuli whose effect sizes are more than a standard deviation in the opposite direction of their hypothesis, the variance of the resulting distribution will be less than two-thirds of the true population variance. Thus, when the stimuli are sampled with bias, a random-effects model will almost certainly underestimate how much the effect size varies across stimuli and, in turn, provide overly narrow confidence intervals around an already biased estimate.

The problems that we lay out here are, in principle, quite difficult to solve, as they cannot be corrected by a simple tweak to a statistical model. Even more troubling is the fact that in many cases, there is no obvious way of determining what even is the “true” population of stimuli that one should generalize to. For example, is the effect of disgust on moral judgment meant to generalize to all possible actions in all possible scenarios? All actual morally relevant actions? Only some particular subset of salient moral violations? There seems to be no easy resolution to this question and, in turn, no easy way to know whether the stimuli represent a “biased” sample from the underlying “true” distribution.

Although this problem may seem intractable – and has even led us to question some of our own work – certain steps can be taken to mitigate it. For example, as Yarkoni suggests, experimenters can try to sample stimuli directly from real-world corpora (e.g., a court database of crimes). However, this is often laborious and impractical and, in fact, may suffer from its own biases (e.g., crimes may not be the category of immoral actions to which the researchers want to generalize). Alternatively, as Strickland and Suben (Reference Strickland and Suben2012) suggest, researchers may recruit naive assistants or Mechanical Turk workers to generate stimuli without knowledge of the hypothesis. Finally, researchers could fight fire with fire by starting adversarial collaborations in which teams of researchers with the opposite hypotheses generate their own stimuli. If adversarial teams find effects of approximately equal magnitude in opposite directions, the original effect was likely due to experimenter bias. If not, the researchers should be more confident that their effect is generalizable to a broader stimulus set, even if these researchers cannot precisely define what that set is.

Financial support

This research received no specific grant from any funding agency, commercial or not-for-profit sectors.

Conflict of interest

None.

References

Strickland, B., & Suben, A. (2012). Experimenter philosophy: The problem of experimenter bias in experimental philosophy. Review of Philosophy and Psychology, 3, 457467.CrossRefGoogle Scholar