Anticipate replication in design
In answering concerns about context variability, Zwaan et al. suggest that original authors' reports should be more detailed and acknowledge limitations. But these suggestions miss what lets us meaningfully compare two studies across contexts: calibration of methods, independent from the hypothesis test.
Often, suspicions arise that a replication is not measuring or manipulating the same thing as the original. For example, the Reproducibility Project (Open Science Collaboration 2015) was criticized for substituting an Israeli vignette's mention of military service with an activity more common to the replication's U.S. participants (Gilbert et al. Reference Gilbert, King, Pettigrew and Wilson2016). All of the methods reporting in the world cannot resolve this kind of debate. Instead, we need to know whether both scenarios successfully affected the independent variable. Whether researchers have the skill to carry out a complex or socially subtle procedure is also underspecified in most original and replication research, surfacing only as a doubt when replications fail.
Unfortunately, much original research does not include procedures to check that manipulations affected the independent variable or to validate original measures. Such steps can be costly, especially if participant awareness concerns require a separate study for checking. Nevertheless, the highest standard of research methodology should include validation that lets us interpret both positive and negative results (Giner-Sorolla Reference Giner-Sorolla2016; LeBel & Peters Reference LeBel and Peters2011). Although the rules of replication should allow replicators to add checks on methods, such checks should also be a part of original research. Specifically, by adopting the Registered Report publication format (Chambers et al. Reference Chambers, Dienes, McIntosh, Rotshtein and Willmes2015), evaluation of methods precedes data collection, so that planning to interpret negative results is essential. More generally, publication decisions should openly favor studies that take the effort to validate their methods.
Discuss and balance reasons to replicate
Providing a rationale for studying a particular relationship is pivotal to any scientific enterprise, but there are no clear guidelines for choosing a study to replicate. One criterion might be importance: theoretical weight, societal implications, influence through citations or textbooks, mass appeal. Alternatively, replications may be driven by doubt in the robustness of the effect. Currently, most large-scale replication efforts (e.g., Ebersole et al. Reference Ebersole, Atherton, Belanger, Skulborstad, Allen, Banks, Baranski, Bernstein, Bofiglio, Boucher, Brown, Budima, Cairo, Capaldi, Chartier, Chung, Cicero, Coleman, Conway, Davis, Devos, Fletcher, German, Grahe, Hermann, Hicks, Honeycutt, Humphrey, Janus, Johnson, Joy-Gaba, Juzeler, Keres, Kinney, Kirschenbaum, Klein, Lucas, Lustgraff, Martin, Menon, Metzger, Moloney, Morse, Prislin, Razza, Re, Rule, Sacco, Sauerberger, Shrider, Shultz, Siesman, Sobocko, Sternglanz, Summerville, Tskhay, van Allen, Vaughn, Walker, Weinberg, Wilson, Wirth, Wortman and Nosek2016a; Klein et al. Reference Klein, Ratliff, Vianello, Adams, Bahník, Bernstein, Bocian, Brandt, Brooks, Brumbaugh, Cemalcilar, Chandler, Cheong, Davis, Devos, Eisner, Frankowska, Furrow, Galliani, Hasselman, Hicks, Hovermale, Hunt, Huntsinger, IJerzman, John, Joy-Gaba, Kappes, Krueger, Kurtz, Levitan, Mallett, Morris, Nelson, Nier, Packard, Pilati, Rutchick, Schmidt, Skorinko, Smith, Steiner, Storbeck, Van Swol, Thompson, van't Veer, Vaughn, Vranka, Wichman, Woodzicka and Nosek2014b; Open Science Collaboration 2015) have chosen their studies either arbitrarily (e.g., by journal dates) or by an unsystematic and opaque process.
Without well-justified reasons and methods for selection, it is easy to imagine doubt motivating any replication. Speculatively, many individual replications seem to be attracted by a profile of surprising results, weak theory, and methods. But if replications hunt the weak by choice, conclusions about the robustness of a science will skew negative. This problem is compounded by the psychological reality that findings that refute the status quo (such as failed replications) attract more attention than findings that reinforce the status quo (such as successful replications).
Replicators (like original researchers) should provide strong justification for their choice of topic. When replication is driven by perceptions of faulty theory or implausibly large effects, this should be stated openly. Most importantly, replications should also draw on selection criteria a priori based on positive traits, such as theoretical importance, or diffusion in the academic and popular literature. Indeed, we are aware of one attempt to codify some of these traits, but it has not yet been finalized or published (Lakens Reference Lakens2016).
Although non-replication of shaky effects can be valuable, encouragement is also needed to replicate studies that are meaningful to psychological theory and literature. Importance could be one criterion of evaluation for single replication articles. Special issues and large-scale replication projects could be planned around principled selection of important effects to replicate. The Collaborative Replications and Education Project (2018), for example, chooses studies for replication based on a priori citation criteria.
Evaluate replication outcomes more accurately
The replication movement also suffers from an underdeveloped process for evaluating the validity of its findings. Currently, replication results are reported and publicized as a success or failure. But “failure” really represents two categories: valid non-replications and invalid (i.e., inconclusive) research. In original research, a null result could reflect a true lack of effect or problems with validity (a manipulation or measure not being operationalized precisely and effectively). Validity is best established through pilot testing, manipulation checks, and the consideration of context, sample, and experimental design, and evaluated through peer review. If validity is inadequate, then the results are inconclusive, not negative.
Indeed, most replication attempts try hard to avoid inconclusive statistical outcomes, often allotting themselves stronger power than the original study. But there has not been as much attention to identifying inconclusive methodological outcomes, such as when a replication's manipulation check fails, or a method is changed in a way that casts doubts upon the findings. One hindrance is the attitude, sometimes seen, that direct replications do not need to meet the same standards of external peer review as original research. For example, the methods of the individual replications in Open Science Collaboration (2015) were reviewed only by one or two project members and an original study author, pre-data collection.
Although we largely agree with Zwaan et al.'s analysis, we want to add to it, drawing on our experiences with replications as authors and editors. Over the past years in psychology, successful reforms have been based on concrete suggestions with visible incentives. We suggest three such moves that Zwaan et al. might not have considered.
Anticipate replication in design
In answering concerns about context variability, Zwaan et al. suggest that original authors' reports should be more detailed and acknowledge limitations. But these suggestions miss what lets us meaningfully compare two studies across contexts: calibration of methods, independent from the hypothesis test.
Often, suspicions arise that a replication is not measuring or manipulating the same thing as the original. For example, the Reproducibility Project (Open Science Collaboration 2015) was criticized for substituting an Israeli vignette's mention of military service with an activity more common to the replication's U.S. participants (Gilbert et al. Reference Gilbert, King, Pettigrew and Wilson2016). All of the methods reporting in the world cannot resolve this kind of debate. Instead, we need to know whether both scenarios successfully affected the independent variable. Whether researchers have the skill to carry out a complex or socially subtle procedure is also underspecified in most original and replication research, surfacing only as a doubt when replications fail.
Unfortunately, much original research does not include procedures to check that manipulations affected the independent variable or to validate original measures. Such steps can be costly, especially if participant awareness concerns require a separate study for checking. Nevertheless, the highest standard of research methodology should include validation that lets us interpret both positive and negative results (Giner-Sorolla Reference Giner-Sorolla2016; LeBel & Peters Reference LeBel and Peters2011). Although the rules of replication should allow replicators to add checks on methods, such checks should also be a part of original research. Specifically, by adopting the Registered Report publication format (Chambers et al. Reference Chambers, Dienes, McIntosh, Rotshtein and Willmes2015), evaluation of methods precedes data collection, so that planning to interpret negative results is essential. More generally, publication decisions should openly favor studies that take the effort to validate their methods.
Discuss and balance reasons to replicate
Providing a rationale for studying a particular relationship is pivotal to any scientific enterprise, but there are no clear guidelines for choosing a study to replicate. One criterion might be importance: theoretical weight, societal implications, influence through citations or textbooks, mass appeal. Alternatively, replications may be driven by doubt in the robustness of the effect. Currently, most large-scale replication efforts (e.g., Ebersole et al. Reference Ebersole, Atherton, Belanger, Skulborstad, Allen, Banks, Baranski, Bernstein, Bofiglio, Boucher, Brown, Budima, Cairo, Capaldi, Chartier, Chung, Cicero, Coleman, Conway, Davis, Devos, Fletcher, German, Grahe, Hermann, Hicks, Honeycutt, Humphrey, Janus, Johnson, Joy-Gaba, Juzeler, Keres, Kinney, Kirschenbaum, Klein, Lucas, Lustgraff, Martin, Menon, Metzger, Moloney, Morse, Prislin, Razza, Re, Rule, Sacco, Sauerberger, Shrider, Shultz, Siesman, Sobocko, Sternglanz, Summerville, Tskhay, van Allen, Vaughn, Walker, Weinberg, Wilson, Wirth, Wortman and Nosek2016a; Klein et al. Reference Klein, Ratliff, Vianello, Adams, Bahník, Bernstein, Bocian, Brandt, Brooks, Brumbaugh, Cemalcilar, Chandler, Cheong, Davis, Devos, Eisner, Frankowska, Furrow, Galliani, Hasselman, Hicks, Hovermale, Hunt, Huntsinger, IJerzman, John, Joy-Gaba, Kappes, Krueger, Kurtz, Levitan, Mallett, Morris, Nelson, Nier, Packard, Pilati, Rutchick, Schmidt, Skorinko, Smith, Steiner, Storbeck, Van Swol, Thompson, van't Veer, Vaughn, Vranka, Wichman, Woodzicka and Nosek2014b; Open Science Collaboration 2015) have chosen their studies either arbitrarily (e.g., by journal dates) or by an unsystematic and opaque process.
Without well-justified reasons and methods for selection, it is easy to imagine doubt motivating any replication. Speculatively, many individual replications seem to be attracted by a profile of surprising results, weak theory, and methods. But if replications hunt the weak by choice, conclusions about the robustness of a science will skew negative. This problem is compounded by the psychological reality that findings that refute the status quo (such as failed replications) attract more attention than findings that reinforce the status quo (such as successful replications).
Replicators (like original researchers) should provide strong justification for their choice of topic. When replication is driven by perceptions of faulty theory or implausibly large effects, this should be stated openly. Most importantly, replications should also draw on selection criteria a priori based on positive traits, such as theoretical importance, or diffusion in the academic and popular literature. Indeed, we are aware of one attempt to codify some of these traits, but it has not yet been finalized or published (Lakens Reference Lakens2016).
Although non-replication of shaky effects can be valuable, encouragement is also needed to replicate studies that are meaningful to psychological theory and literature. Importance could be one criterion of evaluation for single replication articles. Special issues and large-scale replication projects could be planned around principled selection of important effects to replicate. The Collaborative Replications and Education Project (2018), for example, chooses studies for replication based on a priori citation criteria.
Evaluate replication outcomes more accurately
The replication movement also suffers from an underdeveloped process for evaluating the validity of its findings. Currently, replication results are reported and publicized as a success or failure. But “failure” really represents two categories: valid non-replications and invalid (i.e., inconclusive) research. In original research, a null result could reflect a true lack of effect or problems with validity (a manipulation or measure not being operationalized precisely and effectively). Validity is best established through pilot testing, manipulation checks, and the consideration of context, sample, and experimental design, and evaluated through peer review. If validity is inadequate, then the results are inconclusive, not negative.
Indeed, most replication attempts try hard to avoid inconclusive statistical outcomes, often allotting themselves stronger power than the original study. But there has not been as much attention to identifying inconclusive methodological outcomes, such as when a replication's manipulation check fails, or a method is changed in a way that casts doubts upon the findings. One hindrance is the attitude, sometimes seen, that direct replications do not need to meet the same standards of external peer review as original research. For example, the methods of the individual replications in Open Science Collaboration (2015) were reviewed only by one or two project members and an original study author, pre-data collection.
Conclusion and recommendations
Reasons for replicating a particular effect should be made transparent, with positive, systematic methods encouraged. Replication reports and original research alike should include evidence of the validity of measures and manipulations, with standards set before data collection. Methods should be externally peer reviewed for validity by experts, with clear consequences (revision, rejection) if they are judged as inadequate. Also, when outcomes of replication are simplified into “box scores,” they should be sorted into three categories: replication, non-replication, and inconclusive. By improving the validity of replication reports, we will strengthen our science, while offering a more accurate portrayal of its state.