Zwaan et al. provided a useful summary of key issues in the role of replication in psychological research. The article will serve as a useful resource for scholars. However, their coverage of some issues failed to address important qualifications to their conclusions. In the interest of brevity, we highlight one such example.
The authors argue that direct replications should be more prominent in the literature, in part, because they have substantial theoretical value (see in particular concern II, sect. 5.2). We agree that direct replications can sometimes make valuable theoretical contributions. However, such contributions only become likely to the degree that replications are held to the same standards of evidence as studies demonstrating novel effects. Unfortunately, the present discussion (along with many others) implicitly adopts a different standard of evidence than is customary for studies of novel effects.
It is useful to consider the nature of psychological hypotheses and the evidence typically required of studies exploring those hypotheses. The hypotheses being tested generally link two or more psychological or behavioral constructs. For example, “frustration leads to aggression” links the psychological experience of frustration to the outcome of aggression. When an original study claims support for such a hypothesis, it is because a measure or manipulation of frustration is empirically associated with a measure of aggression. For any given study, though, reviewers or editors might question the extent to which the chosen manipulation or measure adequately reflects the construct of interest or question the proposed mechanism linking the constructs. Selective journals routinely require the researcher to empirically evaluate the viability of competing explanations. That is, the demonstration of a novel “effect” is considered to be of limited theoretical value if it is open to multiple interpretations, particularly if one or more of those interpretations is uninteresting or falls outside the focal theory (such as demand artifacts, placebo effects, confounds in a manipulation or measure, or an alternative psychological mechanism). As a result, the testing of a novel theory routinely requires a programmatic approach involving multiple studies.
Unfortunately, results of direct replications are frequently open to multiple interpretations, particularly when they fail to produce the original effect, and many potential explanations are uninteresting or fall outside the replicator's preferred explanation. For example, statistical problems (e.g., inadequate power or severe violations of underlying statistical assumptions) or violations of psychometric invariance (e.g., differences between studies in the construct validity of a manipulation or measure) would often be of little substantive interest (e.g., see Fabrigar & Wegener Reference Fabrigar and Wegener2016; Stroebe & Strack Reference Stroebe and Strack2014). Replicators have paid attention to statistical power but have often ignored other alternative accounts of their effects (such as failures of psychometric invariance [Fabrigar & Wegener Reference Fabrigar and Wegener2016]). Imagine that a researcher attempted to replicate a study originally conducted in the early 1980s using a clip from Three's Company to produce positive mood. If a replication study fails to show the effect because the positive mood induction is no longer humorous to contemporary participants, this would not constitute a notable theoretical advance (presuming the goal of the original research was to understand mood effects rather than the psychology of Three's Company or 1980s American sitcoms).
Other explanations for failing to replicate might be theoretically interesting, such as differences in the characteristics of the study participants or features of the experimental context changing the nature of the relations between the psychological constructs of interest. However, such insights are possible only if the relevant participant differences or contextual influences are identified. Likewise, concluding that the original study was a false positive could be a valuable contribution. However, that statistical explanation is convincing only after alternative explanations have been evaluated and rejected (just as support for a novel theory becomes convincing only after alternative plausible explanations have been evaluated and rejected).
Zwaan et al. did acknowledge that changes in contexts or participants might require changes in study materials even when the goals of the research are of “direct replication.” That acknowledgment takes a step toward the approach we are advocating compared with some direct replication efforts. However, neither the present article nor many others place a strong emphasis on evaluating competing explanations for a replication study's findings. In failing to do so, such articles suggest that replication studies advance theory even when the implications of their findings are highly ambiguous. To the contrary, we suggest that replication studies open to many alternative explanations are no more theoretically valuable than an original study open to many alternative explanations. Replication advocates often seem to view alternatives to false-positive conclusions as if they are “excuses” or “dodges” offered by the original researchers. Excuses or dodges might sometimes be offered, but psychometric invariance of manipulations and measures, contextual moderators, and individual difference moderators are not “dodges.” They are standard methodological and theoretical considerations. They can often be specified in advance and evaluated before and after a replication study has been undertaken. These considerations parallel the kinds of considerations routine in evaluating alternative explanations for original research results. Putting aside such considerations in the case of replications only weakens their empirical and theoretical utility.
In practice, researchers undertaking direct replications have rarely attempted a systematic exploration of competing explanations for their findings. For example, the Many Labs initiative conducts tests of previously demonstrated effects, but does not follow up these tests with multistudy assessments of plausible explanations (e.g., Ebersole et al. Reference Ebersole, Atherton, Belanger, Skulborstad, Allen, Banks, Baranski, Bernstein, Bofiglio, Boucher, Brown, Budima, Cairo, Capaldi, Chartier, Chung, Cicero, Coleman, Conway, Davis, Devos, Fletcher, German, Grahe, Hermann, Hicks, Honeycutt, Humphrey, Janus, Johnson, Joy-Gaba, Juzeler, Keres, Kinney, Kirschenbaum, Klein, Lucas, Lustgraff, Martin, Menon, Metzger, Moloney, Morse, Prislin, Razza, Re, Rule, Sacco, Sauerberger, Shrider, Shultz, Siesman, Sobocko, Sternglanz, Summerville, Tskhay, van Allen, Vaughn, Walker, Weinberg, Wilson, Wirth, Wortman and Nosek2016a). Instead, it has been left to the original researchers to explain discrepant findings and provide initial empirical evaluations of the alternatives (e.g., Luttrell et al. Reference Luttrell, Petty and Xu2017; Petty & Cacioppo Reference Petty and Cacioppo2016). In some cases, replication failures have stemmed from violations of psychometric invariance comparing the replication with the original research (Ebersole et al. Reference Ebersole, Alaei, Atherton, Bernstein, Brown, Chartier, Chung, Hermann, Joy-Gaba, Line, Rule, Sacco, Vaughn and Nosek2017; Luttrell et al. Reference Luttrell, Petty and Xu2017).
In summary, we do not have a problem with replication being an important part of mainstream psychological science, but benefits of that effort will be most likely to the extent that replications are evaluated in ways that parallel evaluation of original research (i.e., holding each to the same standards of evidence).
Zwaan et al. provided a useful summary of key issues in the role of replication in psychological research. The article will serve as a useful resource for scholars. However, their coverage of some issues failed to address important qualifications to their conclusions. In the interest of brevity, we highlight one such example.
The authors argue that direct replications should be more prominent in the literature, in part, because they have substantial theoretical value (see in particular concern II, sect. 5.2). We agree that direct replications can sometimes make valuable theoretical contributions. However, such contributions only become likely to the degree that replications are held to the same standards of evidence as studies demonstrating novel effects. Unfortunately, the present discussion (along with many others) implicitly adopts a different standard of evidence than is customary for studies of novel effects.
It is useful to consider the nature of psychological hypotheses and the evidence typically required of studies exploring those hypotheses. The hypotheses being tested generally link two or more psychological or behavioral constructs. For example, “frustration leads to aggression” links the psychological experience of frustration to the outcome of aggression. When an original study claims support for such a hypothesis, it is because a measure or manipulation of frustration is empirically associated with a measure of aggression. For any given study, though, reviewers or editors might question the extent to which the chosen manipulation or measure adequately reflects the construct of interest or question the proposed mechanism linking the constructs. Selective journals routinely require the researcher to empirically evaluate the viability of competing explanations. That is, the demonstration of a novel “effect” is considered to be of limited theoretical value if it is open to multiple interpretations, particularly if one or more of those interpretations is uninteresting or falls outside the focal theory (such as demand artifacts, placebo effects, confounds in a manipulation or measure, or an alternative psychological mechanism). As a result, the testing of a novel theory routinely requires a programmatic approach involving multiple studies.
Unfortunately, results of direct replications are frequently open to multiple interpretations, particularly when they fail to produce the original effect, and many potential explanations are uninteresting or fall outside the replicator's preferred explanation. For example, statistical problems (e.g., inadequate power or severe violations of underlying statistical assumptions) or violations of psychometric invariance (e.g., differences between studies in the construct validity of a manipulation or measure) would often be of little substantive interest (e.g., see Fabrigar & Wegener Reference Fabrigar and Wegener2016; Stroebe & Strack Reference Stroebe and Strack2014). Replicators have paid attention to statistical power but have often ignored other alternative accounts of their effects (such as failures of psychometric invariance [Fabrigar & Wegener Reference Fabrigar and Wegener2016]). Imagine that a researcher attempted to replicate a study originally conducted in the early 1980s using a clip from Three's Company to produce positive mood. If a replication study fails to show the effect because the positive mood induction is no longer humorous to contemporary participants, this would not constitute a notable theoretical advance (presuming the goal of the original research was to understand mood effects rather than the psychology of Three's Company or 1980s American sitcoms).
Other explanations for failing to replicate might be theoretically interesting, such as differences in the characteristics of the study participants or features of the experimental context changing the nature of the relations between the psychological constructs of interest. However, such insights are possible only if the relevant participant differences or contextual influences are identified. Likewise, concluding that the original study was a false positive could be a valuable contribution. However, that statistical explanation is convincing only after alternative explanations have been evaluated and rejected (just as support for a novel theory becomes convincing only after alternative plausible explanations have been evaluated and rejected).
Zwaan et al. did acknowledge that changes in contexts or participants might require changes in study materials even when the goals of the research are of “direct replication.” That acknowledgment takes a step toward the approach we are advocating compared with some direct replication efforts. However, neither the present article nor many others place a strong emphasis on evaluating competing explanations for a replication study's findings. In failing to do so, such articles suggest that replication studies advance theory even when the implications of their findings are highly ambiguous. To the contrary, we suggest that replication studies open to many alternative explanations are no more theoretically valuable than an original study open to many alternative explanations. Replication advocates often seem to view alternatives to false-positive conclusions as if they are “excuses” or “dodges” offered by the original researchers. Excuses or dodges might sometimes be offered, but psychometric invariance of manipulations and measures, contextual moderators, and individual difference moderators are not “dodges.” They are standard methodological and theoretical considerations. They can often be specified in advance and evaluated before and after a replication study has been undertaken. These considerations parallel the kinds of considerations routine in evaluating alternative explanations for original research results. Putting aside such considerations in the case of replications only weakens their empirical and theoretical utility.
In practice, researchers undertaking direct replications have rarely attempted a systematic exploration of competing explanations for their findings. For example, the Many Labs initiative conducts tests of previously demonstrated effects, but does not follow up these tests with multistudy assessments of plausible explanations (e.g., Ebersole et al. Reference Ebersole, Atherton, Belanger, Skulborstad, Allen, Banks, Baranski, Bernstein, Bofiglio, Boucher, Brown, Budima, Cairo, Capaldi, Chartier, Chung, Cicero, Coleman, Conway, Davis, Devos, Fletcher, German, Grahe, Hermann, Hicks, Honeycutt, Humphrey, Janus, Johnson, Joy-Gaba, Juzeler, Keres, Kinney, Kirschenbaum, Klein, Lucas, Lustgraff, Martin, Menon, Metzger, Moloney, Morse, Prislin, Razza, Re, Rule, Sacco, Sauerberger, Shrider, Shultz, Siesman, Sobocko, Sternglanz, Summerville, Tskhay, van Allen, Vaughn, Walker, Weinberg, Wilson, Wirth, Wortman and Nosek2016a). Instead, it has been left to the original researchers to explain discrepant findings and provide initial empirical evaluations of the alternatives (e.g., Luttrell et al. Reference Luttrell, Petty and Xu2017; Petty & Cacioppo Reference Petty and Cacioppo2016). In some cases, replication failures have stemmed from violations of psychometric invariance comparing the replication with the original research (Ebersole et al. Reference Ebersole, Alaei, Atherton, Bernstein, Brown, Chartier, Chung, Hermann, Joy-Gaba, Line, Rule, Sacco, Vaughn and Nosek2017; Luttrell et al. Reference Luttrell, Petty and Xu2017).
In summary, we do not have a problem with replication being an important part of mainstream psychological science, but benefits of that effort will be most likely to the extent that replications are evaluated in ways that parallel evaluation of original research (i.e., holding each to the same standards of evidence).