Hostname: page-component-745bb68f8f-d8cs5 Total loading time: 0 Render date: 2025-02-11T15:38:00.876Z Has data issue: false hasContentIssue false

Improving social and behavioral science by making replication mainstream: A response to commentaries

Published online by Cambridge University Press:  27 July 2018

Rolf A. Zwaan
Affiliation:
Department of Psychology, Education, and Child Sciences Erasmus University, 3000 DR, Rotterdam, The Netherlands. zwaan@essb.eur.nlhttps://www.eur.nl/essb/people/rolf-zwaan
Alexander Etz
Affiliation:
Department of Cognitive Sciences, University of California, Irvine, CA 92697-5100. etz.alexander@gmail.comhttps://alexanderetz.com/
Richard E. Lucas
Affiliation:
Department of Psychology, Michigan State University, East Lansing, MI 48824. lucasri@msu.edudonnel59@msu.eduhttps://www.msu.edu/user/lucasri/https://psychology.msu.edu/people/faculty/donnel59
M. Brent Donnellan
Affiliation:
Department of Psychology, Michigan State University, East Lansing, MI 48824. lucasri@msu.edudonnel59@msu.eduhttps://www.msu.edu/user/lucasri/https://psychology.msu.edu/people/faculty/donnel59

Abstract

The commentaries on our target article are insightful and constructive. There were some critical notes, but many commentaries agreed with, or even amplified our message. The first section of our response addresses comments pertaining to specific parts of the target article. The second section provides a response to the commentaries' suggestions to make replication mainstream. The final section contains concluding remarks.

Type
Authors' Response
Copyright
Copyright © Cambridge University Press 2018 

Replication facilitates scientific progress but has never occupied a central role in the social and behavioral sciences. The goal of our target article was to change this situation. Science is not a collection of static empirical findings that have passed some threshold for statistical significance. Rather, it should rest on a set of procedures that reliably produce specific results to help advance theories. We presented direct replication as just one of many ways that can improve research in the social and behavioral sciences, along with, for instance, preregistration and greater transparency.

We are encouraged by the many thoughtful and constructive commentaries on our target article. Taken as a whole, we believe the commentaries affirm our position. Several commentaries amplify and enhance what we said in the target article. Other commentaries bring up new topics that researchers should consider as they move forward with their own replication attempts. Still others sound a more critical note about the value of direct replication, at least as currently practiced. In the first section of this response we discuss comments that pertain to specific sections of the target article or that raise novel points that we had not previously addressed. In the second section, we highlight and respond to additional issues raised by the commentators. In the final section we provide an integrative overview of the target article, the commentaries, and our response to them.

R1. Response to specific comments

In the target article we presented an overview of what direct replication studies are, how they relate to other forms of inquiry, and what terminology has been used to describe these various investigations. We also presented a series of six concerns that have been raised about the value of replication studies and their implementation, along with our response to each of these concerns. In this section we follow the structure of the target article to discuss how the commentaries further shape our thinking about these specific concerns.

R1.1. Concern I: Context is too variable

One response to replication attempts (especially attempts that fail to achieve the same result as the original) is to posit that some unspecified contextual factor, a hidden moderator, affected the results of the replication study such that the original result was not replicated. This claim is then used to argue that differences in results are difficult to interpret. Taken to the extreme, this line of reasoning can be used by critics to question the entire enterprise of direct replication. Indeed, the hidden moderator argument is sometimes used to suggest that for entire areas of research, contextual factors are so influential and so difficult to predict that replication studies should not be expected to arrive at the same results as the original study. In the target article, we explained that the extreme form of this argument is antithetical to mainstream beliefs about how scientific knowledge is supposed to accumulate and how it is to be applied. Essentially, it means that entire experimental lines of research could be rendered immune from independent verification. However, as the commentaries note, when considered in relation to specific replication attempts, some nuance is required when considering contextual factors.

One of the most consistent, and most important, themes emerging from the commentaries regarding this issue is that although concerns about context sensitivity are often presented as an issue for replicators to consider, they can also be addressed by researchers conducting original studies. For instance, de Ruiter rightly points out that a replication is a test of a scientific claim. If the scope of this claim has not been specified, then the scientific community should take that claim to mean that the original authors implicitly generalized across the unmentioned details. Debates about context sensitivity as an explanation of failed replication studies highlight that original researchers have an important responsibility for specifying the conditions that are essential for the predictions from their theory to hold. Howe & Perfors likewise propose that authors specify the extent to which they expect their findings to replicate. Simons, Shoda, & Lindsay (Simons et al.) take this idea even further and propose that specifying constraints on generality should be an essential component of original articles. Such statements can eliminate ambiguity in advance, thereby leading to a more cumulative science. We wholeheartedly endorse the proposal of specifying constraints on generality and return in more detail to this issue in our discussion of concern IV.

One virtue of statements specifying constraints on generality is that they provide authors with greater incentive (and explicit guidance) to think carefully about contextual influences in the initial stages of an original study rather than when interpreting failed replication attempts. Likewise, when replications are routine, there are more incentives to thoroughly document study procedures and identify the kinds of expertise needed to conduct studies. Authors may also wish to increase the rigor of their studies by adopting preregistration and within-lab direct replications. These practices will increase their own confidence in the evidentiary value of their original work. Collectively, these practices will help move research forward. They are among the reasons motivating our target article. This spirit of optimism is evident in the commentary by Spellman & Kahneman when they noted that replications efforts have been useful in improving research practices and evidentiary standards.

Some of the commentators appear to be unconvinced that it will ever be possible to specify conditions that allow for direct replications to occur. Petty, for instance, notes that the use of the same operations does not guarantee that a study counts as a replication, precisely because the same operations may mean different things in different contexts. This sentiment is echoed by Wegener and Fabrigar. We agree with these authors that (1) the appropriateness of specific procedures in new samples or new settings must certainly be evaluated when conducting replications and (2) evaluating these issues is necessary when interpreting results from replication studies. Fortunately, as noted by those commentators, the conceptual and statistical tools needed to conduct such analyses are available. The need to make sure that procedures of a direct replication study have validity does not invalidate the importance of direct replication.

Furthermore, it is telling that the two examples of problematic reliance on original operations that Petty and Wegener & Fabrigar highlight are (1) not based on actual failed replication attempts (as far as we can tell) and (2) represent extreme and arguably implausible examples. For instance, Petty asks what would happen if stimuli from the 1950s were used today, and Wegener and Fabrigar ask readers to imagine a film clip from a 1980s sitcom being used to elicit humor almost 40 years later. Arguments about the importance of these contextual factors are more compelling when accompanied by evidence that a failed replication resulted from inattention to contextual factors that changed the meaning of the operations used in the study. These kinds of general concerns are occasionally raised when a replication study fails. However, these arguments are more convincing when researchers propose tests of those ideas in future work.

The extent to which context matters may indeed depend on many factors, including the domain in which the research is being conducted. Gantman, Gomila, Martinez, Matias, Levy Paluck, Starck, Wu, & Yaffe suggest that the problem of context sensitivity is especially thorny in field research. If true, then specifying constraints on generality is especially important in this type of research. Importantly, this concern with context highlights an additional benefit of a greater emphasis on replication.

Gelman proposes to abandon the notion of direct replication and move to a meta-analytic approach, given that direct replications, in his view, are impossible in psychology. This might be too extreme of a perspective. We already stated in the target article why it is important to retain the notion of direct replication, along the lines proposed by Nosek and Errington (Reference Nosek and Errington2017). Moreover, specifying constraints on generality will make it easier to conduct direct replications. We do, however, agree with Gelman's observation that meta-analytic approaches are important for advancing research and theory and see our proposal as fully congruent with this idea.

R1.2. Concern II: Direct replications have limited theoretical value

Several commentators (Alexander & Moors; Little & Smith; Carsel, Demos, & Motyl [Carsel et al.]; Witte & Zenker) argue that the focus on replication ignores a more serious underlying problem, namely, the poor status of theorizing in the field. Carsel et al., for instance, suggest that stronger theories will make replications more feasible and more informative because such theories generate more testable hypotheses. They note that statistical hypotheses are never really true in the strictest sense and that for such hypotheses to be of any use we must ensure that they map as closely as possible to our substantive (qualitative) hypotheses; we are reminded of Box's (Reference Box, Launer and Wilkinson1979) famous quote: “All models are wrong, but some are useful.” We agree with these commentators' general sentiment and reiterate our belief that a theory can be tested only when it is clearly and unambiguously specified (see also Etz et al., Reference Etz, Haaf, Rouder and Vandekerckhovein press). The greater the specificity of a prediction, the more informative is the research. This idea applies to both original and replication studies.

Direct replication has an important role to play with respect to the development of stronger theories. It is key, as the epigraph to the target article indicates, to have a procedure that reliably produces an effect and not to just have a single experiment with p<.05. Direct replication is the way to determine whether the procedure is reliable. The next step is then to examine, in a theory-driven systematic way, the limits of that effect in increasingly more stringent tests. This, as Gernsbacher notes, has already been the common practice in some areas of psychology for years (see also Bonett Reference Bonett2012). The proposal to have researchers state constraints of generality of their findings (Simons et al.) is a step in this direction. Having to state constraints on generality are beneficial will force researchers to make their theories more explicit by making distinctions between factors that are thought to be essential for the effect to occur and those that are incidental.

An interesting and novel refinement of replication research that builds directly on the notion of constraints on generality is the meta-study (Baribault et al. Reference Baribault, Donkin, Little, Trueblood, Oravecz, van Ravenzwaaij, White, De Boeck and Vandekerckhove2018). Researchers distinguish between factors that are deemed essential for the effect, for instance, whether the meaning of a color word matches the color in which it is presented in a Stroop experiment and factors that might be moderators of the effect, for instance, the use of words that are not color words but are strongly associated with a color (e.g., blood and grass), the font in which the words are presented, the number of letters that are colored, or the geographical location of the lab in which the experiment is carried out. These latter factors are randomized in a series of micro-experiments, each of them being a potential moderator of the effect. In other words, a meta-study is an attempt to sample from the set of possible experiments on a given topic. A series of meta-studies allows for a stronger test of a theory than a single experiment (original or replication) in that it assesses the robustness of an effect across various subtly different incarnations of the experiment (see also the commentary by Witte & Zenker). Accordingly, a meta-study can provide empirical support for a statement of constraints on generality and allows for further theoretical specification. Moreover, when a moderator appears, being able to account for it enhances the explanatory power of the theory, thus resulting in a progressive research program (Lakatos Reference Lakatos, Lakatos and Musgrave1970). We agree with Alexander & Moors and Little & Smith that new avenues for more ambitious testing of theories should be explored; meta-studies are one such approach.

R1.3. Concern III: Direct replications are not feasible in some domains

We addressed this concern in the target article and several commentators expanded on that theme. Kuehberger & Schulte-Mecklenbeck point out that the fact that direct replications are more feasible in some domains than in others creates a selection bias: Studies that are easy to reproduce are more likely to become the target of replication efforts than studies that are difficult or expensive to reproduce. Similarly, Giner-Sorolla, Amodio, & van Kleef (Giner-Sorolla et al.) point out that if the reasons for selecting specific studies are not made explicit, then studies that are most frequently targeted for replication attempts may be those for which there is the most doubt. Whether or not this is a problem, however, depends on the goals of the replication attempt.

It is important to distinguish between two distinct perspectives related to this concern. The first is the perspective of the meta-scientist, who may want to estimate the replicability of a paradigm, domain, research area, or even an entire field. We agree that to accomplish such a goal, having a sound sampling plan is critical. If only the weakest studies or those that are easiest to reproduce are selected for replication, then surely estimates regarding the strength of an entire field or domain of study would not be accurate. We also agree that this sampling issue has not been given sufficient attention in the literature; the suggestions in these commentaries provide an important step in this direction.

The second perspective is that of the researcher in the field who is interested in the robustness of a specific research finding. For this researcher, there are myriad potential reasons why a specific study is selected. Also this is perfectly acceptable. Indeed, even explicit doubt about the veracity of the original finding can be a valid reason for conducting a replication study. Blanket suggestions about which studies can or should be chosen for replication limit the freedom of researchers to follow theoretical and empirical leads that they believe will be most interesting and fruitful. These suggestions place constraints on replicators that are not placed on original researchers. A single investigator may be interested in the replicability and robustness of a single minor finding; and just as the original investigator was free to produce that minor finding, someone wishing to replicate that result should be free to do so without others raising concerns about how the study was selected.

In short, we are unsympathetic to suggestions for the need to more tightly regulate replications versus other kinds of research. Movements in this direction are antithetical to making replication mainstream. Nevertheless, a number of tools are available to help replicators approach their task in a more rigorous fashion. Many of these tools are useful to scientists who simply want to evaluate the existing literature without conducting a direct replication. For instance, we endorse the suggestion by Nuijten, Bakker, Maassen, & Wicherts (Nuijten et al.), who suggest that those who wish to replicate a specific finding first check to make sure that the results of the original studies can actually be reproduced with the original data to ensure that these original results themselves can be verified. Our hope is that these issues will become less relevant when replication is more common and original studies are pushed by the field to have more evidentiary value.

We acknowledge (as we did in the target article) that there are some domains and some types of studies for which widespread replication will be difficult. In those domains, however, it will be especially important to incorporate additional safeguards that accomplish some of the same goals that direct replication is designed to accomplish. Many of the commentaries provided novel suggestions that may help in this respect. MacCoun, for instance, echoed the idea that direct replications are not always affordable or feasible and, for some phenomena, may even be impossible. In such situations, he argues, methods of blinded data analysis can help minimize p-hacking and confirmation bias, increasing our confidence in a study's results. We agree and note that, in fact, Spellman & Kahneman express the view that such strengthening of original studies is already happening.

R1.4. Concern IV: Replications are a distraction

Researchers who have raised reservations about direct replications often question their theoretical value and practical feasibility. A specific incarnation of this view is that direct replications are largely wasted efforts given the limited resources available to researchers in terms of time and energy. In response to this perception, several commentators provided suggestions for ways that replicators could increase the value of replication efforts. However, some of the more critical commentaries on our paper place what we see as puzzling demands on replicators.

Notably, Strack & Stroebe suggest that the onus of explaining why an effect was not replicated should be shouldered by the researchers performing the replication. In the target article, we called this an attempt to “irrationally privilege the chronological order of studies over the objective characteristics of those studies when evaluating claims about quality and scientific rigor” (sect. 5.1.1, para. 3). In their commentaries, Gelman and Ioannidis refer to this privileging of the original result as a “fallacy” and an “anomaly,” respectively. The requirement for replicators to explain why they did not duplicate the effect poses several practical problems. Foremost, original findings can be flukes. In such cases, it is difficult to provide any sort of explanation for the failed attempt beyond noting that it is possible that a random process generated a p<.05 result in a single original study. Indeed, neither replicator nor original author can be certain when random processes are responsible for findings. Less extreme but no less thorny situations occur when replicators (and likely original authors themselves) are unaware of the myriad contextual factors that might have coalesced to produce an original effect. Thus, we are not in favor of placing so much onus on replicators relative to original authors. We believe that replicators (just like original authors) should simply provide their interpretations of results and findings in their own papers in the way they believe is faithful to the data and the literature. The research community can then decide whether particular interpretations are reasonable and empirically supported. Furthermore, pushing replicators to come up with strong statements explaining why they failed to replicate a result may increase concerns about the reputational consequences of replications. Sometimes the best response when reporting a failed replication is simply to get the finding into the literature, to provide a constraints on generality statement, and to issue necessary caveats about the need for additional research.

Indeed, we prefer to adopt a multipronged approach to evaluating replications, which would ideally culminate with multiple replications of specific findings that are combined in a meta-analysis. This seems to contrast with the scenario outlined by Strack & Stroebe, who describe what appears to be a case where there is one successful study and one failed direct replication. Without knowing more about the relative merits of the two studies in question, it is impossible to provide sound advice about how replicators should interpret the results and, thus, what they should or should not do in a discussion section. For instance, when the original study employed a between-subjects design with a sample size of 40 participants and the replication was a seemingly faithful recreation of the design but with a sample size of 400 participants, the weight of the evidence might lean in favor of the results of the replication. If both studies had modest sample sizes, the interpretation might need to be quite constrained. If, in contrast, the original had a sample size of 400 and the replication had a sample size of 40, there might be a strong need for the replication authors to compare effect size estimates and contemplate the power of their replication study before drawing strong conclusions! We think it is unwise to tie the hands of replicators by placing blanket requirements about how they interpret their results.

What we would consider fluke findings are sometimes part of the existing literature. Consider the results of several Registered Replication Reports, which show that a large effect size from an initial small-sample, between-subjects design can be reduced to near zero in large-scale, multi-lab replication attempts (e.g., Eerland et al. Reference Eerland, Sherrill, Magliano, Zwaan, Arnal, Aucoin, Berger, Birt, Capezza, Carlucci, Crocker, Ferretti, Kibbe, Knepp, Kurby, Melcher, Michael, Poirier and Prenoveau2016; Hagger et al. Reference Hagger, Chatzisarantis, Alberts, Anggono, Batailler, Birt, Brand, Brandt, Brewer, Bruyneel, Calvillo, Campbell, Cannon, Carlucci, Carruth, Cheung, Crowell, De Ridder, Dewitte, Elson, Evans, Fay, Fennis, Finley, Francis, Heise, Hoemann, Inzlicht, Koole, Koppel, Kroese, Lange, Lau, Lynch, Martijn, Merckelbach, Mills, Michirev, Miyake, Mosser, Muise, Muller, Muzi, Nalis, Nurwanti, Otgaar, Philipp, Primoceri, Rentzsch, Ringos, Schlinkert, Schmeichel, Schoch, Schrama, Schütz, Stamos, Tinghög, Ullrich, vanDellen, Wimbarti, Wolff, Yusainy, Zerhouni and Zwienenberg2016; Wagenmakers et al. Reference Wagenmakers, Beek, Dijkhoff, Gronau, Acosta, Adams, Albohn, Allard, Benning, Blouin-Hudon, Bulnes, Caldwell, Calin-Jageman, Capaldi, Carfagno, Chasten, Cleeremans, Connell, DeCicco, Dijkstra, Fischer, Foroni, Hess, Holmes, Jones, Klein, Koch, Korb, Lewinski, Liao, Lund, Lupianez, Lynott, Nance, Oosterwijk, Ozdoğru, Pacheco-Unguetti, Pearson, Powis, Riding, Roberts, Rumiati, Senden, Shea-Shumsky, Sobocko, Soto, Steiner, Talarico, van Allen, Vandekerckhove, Wainwright, Wayand, Zeelenberg, Zetzer and Zwaan2016a). It is also possible to test whether there is heterogeneity in effect size estimates to try to find evidence in support of the existence of moderators. In cases where the effect size estimate is indistinguishable from zero and there are few indications of heterogeneity, the simplest explanation is that the original finding was a false positive or a grave overestimate (Gelman's type M error). We furthermore like to underscore the relevance of Gelman's time-reversal heuristic here. Suppose the registered replication report had been conducted first and then the original study came second. What weight would we then assign to the original study? Very little, we surmise. Indeed, this is a subtext of the commentary by Ioannidis.

Strack & Stroebe are right to note that theories are formulated on a level that transcends the concrete evidence and that their validity does not rest on the outcome of one specific experimental paradigm. This mirrors our view (as we already outlined above); we hope nothing in our target article suggests otherwise. Direct replications provide a specific kind of information about the ability of a set of procedures to reliably produce the same results upon repetition. The process of evaluating the evidence for or against theoretical propositions involves a complex judgment involving multiple strands of evidence. Further, we also believe that null results are useful for the evaluation of a theory. That is, when explanations can be formulated for the absence of an effect and empirical support for them can be obtained, then the theory would actually be strengthened. This is akin to the process for evaluating the discriminant validity of measures in psychometric work. Theory specifies there should be no relation between two constructs and evidence is then gathered to test such a prediction.

Other commentators pushed the field to consider additional important elements besides direct replication. In many ways, we have no issues with these lines of thought. It was not our intent to say that direct replication is the one and only thing that will improve psychological science. Heit & Rotello seem to agree with us that direct replication is valuable, but argue that it should not be elevated more than, and thus shift attention away from, other worthwhile research practices, including conceptual replication and checking statistical assumptions. They also point out that replicating studies without checking the statistical assumptions can lead to increased confidence in incorrect conclusions, and that successful replications should not be elevated more than failed replications, given that both are informative. Indeed, our goal was simply to emphasize the benefits of, and call for increased attention to, what is an underused practice: direct replications. It is perhaps not surprising that we also agree that direct replications should be performed in a sensible manner. Indeed, replicating studies without checking statistical assumptions is unwise. Also we obviously agree that successful replications should not be elevated over failed replications (or vice versa).

Witte & Zenker raise the interesting issue of when replication is a necessary and worthwhile endeavor. They argue that replication efforts should be limited to specified theoretical effects. A successful replication would then show that an original study corroborated a theory. This is a reasonable point but we return to our earlier concern–we do not want to limit the freedom of replicators to study the effects of their own choosing. We also agree with their observation that several conceptually replicated experiments together may (ever more firmly) jointly corroborate, or falsify, a theoretical prediction.

R1.5. Concern V: Replications affect reputations

The fifth concern addressed in the target article focused not on the accumulation of scientific knowledge per se, but on the extra-scientific concerns about the people involved. Specifically, many have worried publicly about the reputational impact of replication studies, both for those whose works are targeted and for those who conduct the replications themselves. Pennycook agrees that scientists should separate their identities from the data they produce. More importantly, he points out that although people often fear that their work will fail to replicate, reputational consequences are often based more on whether the original authors approach such results with an open, scientific mindset than on whether the replication attempt affirmed or contradicted the original work. Pennycook rightly remarks that the same is true for the reputation of the replicators, and we agree that dispassionate, descriptive approaches to reporting replication results will make this research less fraught. We very much appreciate Pennycook's suggestions regarding ways that social and behavioral sciences can move toward this goal.

Along a similar line of thinking, Tullett & Vazire provide an idealistic new metaphor for scientific progress that challenges the idea that replications contribute to the literature only when they “tear down” bricks in an existing wall of scientific knowledge. They prefer an alternative metaphor where different participants in any scientific endeavor should be thought of as jointly solving a jigsaw puzzle. This metaphor captures the idea that the goal of scientific research is to uncover some underlying phenomenon and that both novel and replication studies provide critical information for achieving that goal. A welcome feature of the puzzle metaphor is that it puts replicators and original researchers on equal footing as two kinds of agents trying to solve a common problem. The brick analogy sets up a somewhat adversarial or regulatory dynamic in which original researchers are builders and replicators are those who further test the bricks for soundness. Reputational concerns are likely to be less relevant in the former conceptualization. It is also important to underscore that making replication mainstream means that original researcher and replicator are roles in the scientific enterprise that will normally be played by the same individual at different times and in different contexts. This captures what we mean by making replication mainstream.

Ultimately, the conceptual separation of data and researchers will be an important part of making replication mainstream. In fact, we can expect this process to be reciprocal. To the extent that replication becomes more mainstream, reputation will become more separated from data, which will make replication more mainstream, and so on. This cycle would seem to pay dividends for scientists and science as a whole.

One commentary brought up a reputational concern that we did not consider in the target article, but that is nonetheless interesting and important: replications might have reputational consequences for science itself. Scientific results must, of course, ultimately be conveyed to the public. This creates special challenges as popular coverage of science sometimes invites and perhaps even demands more certainty and clarity than is warranted by the existing evidence. Białek is concerned that false negatives may inspire lower confidence in science, which would undercut the effort to make replication mainstream. He does not advocate that scientists should stop replicating studies simply because they will look bad in the eyes of the public. Rather, he argues for a concerted effort to communicate the acceptability of uncertainty associated with scientific findings to the public (and to our peers too). We agree with this view but hasten to add that we suspect that a great deal of the responsibility lies with the original researchers. A quote from de Ruiter's commentary nicely expresses this sentiment: “Finding general effects in psychology is very difficult, and it would be a good first step to address our replication crisis if we stopped pretending it is not” (last para.). There are many examples in which researchers broadcast their findings (often based on a single experiment with p barely<.05), trumping up their relevance to a variety of domains, only to resort to complaints about context being too variable after a failed replication. As we noted earlier, researchers should be realistic about the generalizability of their findings.

One additional virtue of making replication mainstream is that science will ideally produce more findings relevant to the kinds of claims covered in the popular media. Journalists may not have to wait too long to see if seemingly newsworthy findings are credible by virtue of having a track record of replicability. We see this increased knowledge base as an important practical consequence of the ideas we advocated in our target article.

A flipside to our argument is that research findings conveyed to the public but not backed by a solid evidentiary base risk generating grave reputational consequences. For instance, members of the public may become disillusioned when they implement the findings described in the popular media to change their behaviors and find that their efforts prove unsuccessful. This could prove catastrophic as these are the people who support science by funding governmental investments in research and the existence of many universities. Thus, we believe that researchers have strong incentives to make sure the public is provided with scientific claims that have a strong evidentiary base. As we have argued, direct replication is a component in making sure that the evidence base for claims is strong.

R1.6. Concern VI: There is no standard method to evaluate replication results

A concern about replications, noted in the target article, is that researchers are faced with myriad ways to statistically evaluate original and replication studies. Several commentators pick up on this theme and on related concerns about effect sizes and the kinds of errors that characterize research. For instance, Gelman, as well as Tackett & McShane, suggests that no real “null” effects are being studied in psychology, and thus, the concepts of false positives and false negatives are not very useful and should be abandoned. They suggest that we begin with an assumption that all effects we study are nonzero to some extent and recommend transitioning toward using multilevel models that allow us to characterize the variability of effects between studies in a rich fashion. Their preference to essentially abandon null hypothesis testing reflects a long-standing issue in the field as hypothesis testing has always been a contested issue among statisticians and methodologists. The fact remains, however, that many well-informed users of statistics still consider a hypothesis test (and ideas of false positives and false negatives) relevant to answering their scientific question in many cases. Thus, again we are reluctant to be too directive about the kinds of statistical tools researchers use. There might be cases where hypothesis testing is useful.

We agree with those commentators who push the value of thinking beyond type I and II errors to have researchers consider estimation errors of the sort Gelman has proposed. There are many ways to get things wrong in scientific research! Type M errors occur when researchers overestimate or underestimate the magnitude of an effect, and type S errors occur when researchers get the sign of the effect wrong. This kind of error could be particularly problematic when dealing with interventions as the sign may indicate iatrogenic effects. As we hope was clear in both the target article and this response, we see replication studies as providing additional information that can help reduce errors of all sorts. Thus, to the extent that we used terms like false positives in the original target article, critics who prefer the type M and S framework can think of a false positive as occurring when researchers are dealing with effects that are tiny in comparison to the original estimate or when there is type S error of any magnitude. If this mindset is adopted, we believe the vast majority of our arguments and perspectives hold.

Tackett & McShane chastise us for suggesting “three ways of statistically evaluating a replication, all of which are based on the null hypothesis significance testing (NHST) paradigm and the dichotomous p-value thresholds” that are “a form of statistical alchemy that falsely promise to transmute randomness into certainty” (para. 7). We worry this is a misreading of our target article. We assume these comments are in reference to our summary of three of the methods used to evaluate the results of the Reproducibility Project: Psychology (Open Science Collaboration 2015). However, we were merely describing the various methods that the authors of that study had used to evaluate replication success, and elsewhere in our article, we (briefly) discuss additional approaches that could be used. Tackett & McShane are right in noting limitations in some of the existing methods for evaluating replications and we are glad that they brought attention to those issues. We reduced our critical coverage of those issues in the target article in the interest of space. We are inclined toward the two approaches we discuss in detail (the small telescopes approach and the replication Bayes factor approach), but we agree that other approaches such as a meta-analytical (multilevel) approach can be useful, as well. We did not want to have our case for making replication mainstream bogged down by statistical arcana or the “framework wars” that sometimes derail debates over frequentist versus Bayesian methods.

Holcombe & Gershman reiterate that the result of an experiment depends on the status of not only the primary research hypotheses being investigated, but also the other auxiliary hypotheses (i.e., moderators). As such, their proposal is also relevant to concern I: Context is too variable. A failure to replicate a past finding can be a result of either the falsity of the primary research hypothesis or the failure to satisfy the conditions specified by auxiliary hypotheses. It would appear that a replication failure can provide evidence against only the conjunction of the research hypothesis and relevant auxiliary hypotheses–an instance of the Duhem-Quine problem. Without a way to distinguish between the primary and auxiliary hypotheses when interpreting a replication result, researchers are left wondering about the status of the theory. Holcombe & Gershman suggest a reformulation of Bayes' theorem derived by Strevens (Reference Strevens2001, p. 525) can solve this problem. Call H the truth status of the primary hypothesis, A the truth status of an auxiliary hypothesis, and HA the conjunction of H and A. Strevens showed that the posterior belief in H given the falsification of HA can be determined entirely by (1) our prior belief in H and (2) our prior belief in A given H. Based on this result, Holcombe & Gershman recommend implementing “pilot programs to induce scientists to set out their beliefs before the data of a replication study are collected,” allowing Strevens's result to be used and the belief in the theory updated.

It is fruitful to consider how our beliefs in our theories can be disentangled from our beliefs in auxiliary hypotheses in a quantitative way. However, we must admit to being surprised by how apparently simple the result summarized by Holcombe & Gershman is. We only need to specify prior probabilities, and need not consider such things as our model for the data-generating process? Consulting Strevens (Reference Strevens2001), the precise result referred to by Holcombe & Gershman (clarified to us via personal communication) is

$$P\lpar H\vert \; \neg \lpar {HA} \rpar \rpar = P\lpar H \rpar \times \displaystyle{{1 - P\lpar A\vert H\rpar } \over {1 - P\lpar A\vert H\rpar P\lpar H \rpar }}$$

Our (brief) impression based on the derivation by Strevens is that the above result can be used to evaluate replication success only when the following two conditions are met. First, it is possible for the data from the replication experiment to provide strict falsification of HA. If this condition is not met, the expression above becomes dependent on more than the two prior probabilities. Second, there exists either a single auxiliary hypothesis or a small number of independent auxiliary hypotheses that capture the differences between original and replication study. If this condition is not met we again see the above result become dependent on more than the relevant prior probabilities. We suspect that neither of these conditions is usually met in the context of psychological research. Hypotheses in psychology tend to make only probabilistic predictions, meaning strict falsification of the conjunction HA is usually not possible. Moreover, we doubt that differences between studies can be fully captured by a small number of independent auxiliary hypotheses.

R2. Additional suggestions for making replication mainstream

Although we anticipated that some of the commentaries might present dissenting views about the value of direct replication or the specific suggestions we made for resolving controversies about these issues, we were especially gratified to see that many commentators went beyond our suggestions to provide novel ways to make replication more mainstream. Srivastava, for example, points out that replication research promotes dissemination of information needed for other aspects of verification. Making replication more normative creates the expectation that others will need to know the details of original research, including previously opaque details about specific measures, procedures, or underlying data from original research. Thus, making replication mainstream promotes meta-scientific knowledge about what results to treat as credible even if a specific study never happens to be replicated. More broadly, endorsing replication as a method for ensuring the credibility of research reinforces the idea that scientists ought to be checking each other's work. Lilienfeld argues that direct replication is not only feasible but, in fact, also necessary for two domains of clinical psychological science: the evaluation of psychotherapy outcome and the construct validity of psychological measures. We agree, and we hope that replication attempts become more commonplace in these areas.

Howe & Perfors propose to make it standard practice for journals to pre-commit to publishing adequately powered, technically competent direct replications (at least in online form) for any article they publish and link to it from the original article. This is a practice that journals could readily implement. It is known colloquially as the Pottery Barn Rule following Srivastava (Reference Srivastava2012) and is close to the editorial policy adopted at the Journal of Research in Personality when Richard Lucas was the Editor-In-Chief (Lucas & Donnellan Reference Lucas and Donnellan2013). Our one caveat is that we are not convinced replications should only be relegated to online archives. IJzerman, Grahe, & Brandt (IJzerman et al.) and also Gernsbacher point to an initiative to make replication habitual by integrating replication with undergraduate education. Given that several of us are already using this practice, we support this initiative. Similar to the proposal by IJzerman et al., but targeted at a stage somewhat later in the educational process is Kochari & Ostarek's proposal to introduce a replication-first rule for Ph.D. projects. We think this is an interesting proposal that specific graduate programs consider for themselves. We suspect including replication studies could actually improve the quality of the dissertations themselves, while also becoming a valuable element of graduate training. Gorgolewski, Nichols, Kennedy, Poline, & Poldrack suggest making replication mainstream by making it prestigious, for instance, by giving awards and note that such an approach has already been implemented by the Organization for Human Brain Mapping. Although we are not sure which of these proposals will be seen as most feasible and effective, we were impressed with the diverse suggestions that the commenters provided in this regard and look forward to seeing them implemented in some form.

Other commentaries focus on how the use of direct replication was synergistic with other proposed reforms in the ongoing discussions about methodological improvements in the field. Little & Smith favor the use of small-N designs. We think there is much to like about such an approach, but we also worry that such designs are not feasible in many areas of psychology. Thus, to the extent that such designs are appropriate for a given research question, they should be encouraged. Paolacci & Chandler advocate the use of open samples (e.g., those recruited from platforms such as Mechanical Turk), and they discuss how to use them in a methodologically sound way. These authors rightly note that although open samples provide greater opportunity for close replication (as more researchers have access to the same population), they also pose unique challenges for replication research. Original researchers and replicators alike would benefit from considering the issues that Paolacci & Chandler raise in this regard. Tierney, Schweinsberg, & Uhlmann (Tierney et al.) and also Gernsbacher suggest an approach to which we are particularly sympathetic (and, indeed, a description of which we had to cut out of an earlier version of the manuscript; Zwaan Reference Zwaan2017) called concurrent replication. This practice involves the widespread replication of research findings in independent laboratories prior to publication or in a reciprocal fashion. As Tierney et al. argue, this addresses three key concerns discussed in the target article: it explicitly takes context into account, reduces reputational costs for original authors and replicators, and increases the theoretical value of failed replications. Spellman & Kahneman question whether replications will need to continue in the way they have recently been conducted. Specifically, they argue that large-scale, multi-lab replications that assess the robustness of one or two critical findings may die out as replication becomes more integrated into the research process. This may indeed be a consequence of making replication mainstream. Spellman & Kahneman present several proposals about how various research labs could collaborate to simultaneously investigate new phenomena and conduct direct replications of the results that are found. These are well worth considering, and journal editors may wish to commission special issues to incentivize tests of their proposals.

As many commentators note, improving research practices more broadly would also facilitate embedding replication in the mainstream of research. Improvements can be targeted at different stages of the research process. For instance, Schimmack notes that the current practice of basing publication decisions on whether the main results are significant or not is problematic, as this provides a built-in guarantee that most replication results will yield smaller effect sizes than original studies. In turn, this means that replication studies will inevitably be perceived as a challenge to the original effect. Publication bias also renders meta-analyses problematic, which is an issue that the field needs to address. Tools such as registered reports can help reduce publication bias. Many commentators furthermore note that reporting standards for original research should be improved because they are currently not sufficiently stringent (Giner-Sorolla et al. and Simons et al.). With reporting standards as they are, it is important to verify the original results before launching a replication effort (Nuijten et al.). With respect to the evaluation of replications, Giner-Sorolla et al. state that methodologically inconclusive replications ought not to be counted as non-replications. We concur. With respect to the dissemination of replications, Egloff suggests that debates surrounding the larger reform movement in psychology, in contrast to what often occurs right now, should choose mainstream outlets for their work (rather than, e.g., blogs and social media). We agree that this would help make replication more mainstream, and researchers should definitely be encouraged to send their contributions to this important debate to conventional outlets in addition to their blogs. If the debate surrounding the reform efforts features more prominently in mainstream outlets then replications themselves may eventually be seen as being worthy of publication there, as well. Unfortunately, some high-profile journals routinely reject replication studies for lack of novelty; and thus, an important component of this approach will be to lobby editors, publication boards, and societies to allow for such contributions to be published. One might worry that allowing replication studies into top journals that had previously been known for novel research findings will dilute the pages of those journals with a flood of studies that simply seek to verify those results. Future meta-scientific research can track the number of replication studies that are actually submitted and published at journals that allow them, while also tracking the effects on the credibility and replicability of the results that are published at those journals.

R3. Conclusion: There should be no special rules for replication studies

The title of our target article, “Making replication mainstream,” was chosen to reflect our beliefs about the role that direct replications should play in science. Although we advocate direct replications, we acknowledge that replications are (1) only one tool among many to improve science, (2) not necessarily the most important reform that scientists who are concerned about methodological reform should adopt, or (3) even a practice that all scientists must necessarily prioritize in their own research efforts. Instead, we argued that direct replication should become a more normal, more mainstream part of the scientific process. An important part of our goal of making replication mainstream is to ensure that replication studies are not held to different standards than other forms of research.

A number of the commentaries could be interpreted as proposing special rules for conducting and evaluating the results of replication studies. In our target paper we discussed a broad range of statistical tests that could be used to evaluate replication results, but some commentators suggest that replication research needs to go even further in the number of tests conducted and the sophistication of those tests. Although we are of course in favor of reconsidering the appropriateness of any default analytic approaches, we believe that this goal in no way applies specifically to replication studies. In fact, one of the things replication efforts have shown is that analytic approaches and reporting standards may need to be improved across the board. Several commentaries seem to be aligned with this view (e.g., Giner-Sorolla et al.; Nuijten et al.; Schimmack, Spellman & Kahneman; and Simons et al.)

Commentators such as Wegener & Fabrigar suggested that replication research should adhere to the standards that hold for original research, standards that include rigorous pretesting and the inclusions of supplemental tests to ensure that the manipulations and measures worked as intended. They suggest that replications rarely meet these standards. However, they provide no evidence that replication studies routinely fail to meet these standards or, more importantly, that original research itself is actually held to the high standard that they describe for replications. Indeed, our experience is that the practice of conducting replication attempts frequently reveals methodological problems in the original studies that would have gone unnoticed without the attention to detail that conducting direct replications requires (e.g., Donnellan et al. Reference Donnellan, Lucas and Cesario2015). It would be informative to conduct a meta-scientific study in which original and replication studies were scored for dimensions of rigor of the sort identified by Wegener and Fabriagar.

As consumers of research, we, of course, start with our subjective impressions regarding the typical quality of original and replication research; and our personal impression (which is also not informed by systematic data) is that replication studies are already more likely to include these features than the typical original study. After all, these studies have the advantage of relying on existing protocols, and thus, replication researchers have far less flexibility when it comes to data analysis as the original study provides numerous constraints. Many replications have far larger samples sizes and generate effect size estimates that are seemingly more plausible than original studies. When original studies find significant results, the fact that the study did not include critical methodological features that would have been useful to explain a nonsignificant result is noticed by few. Concerns about the magnitude of the effect size estimate are more often raised when the estimate is tiny (as is often the case with null results from replication studies) than when it is large (as is often the case with primary studies based on modest sample sizes). Considerable attention is devoted to explaining away small effect size estimates by appeal to hidden moderators, whereas less attention is paid to explaining why a large effect is plausible given the outcome variable in question or strength of the experimental manipulation. Regardless of who is right about the prevalence of the features of high-quality research, we agree that they are desirable and as a field we should push for their inclusion in all research, replication or otherwise. We worry, however, that their absence is often used, in an ad hoc fashion, as a way to dismiss failed replications of original studies that used the exact same methodology. In extreme cases, critics may even ignore the strengths of the replication studies when attempting to privilege the original result.

Ioannidis, identifying some of the features we have identified, pushes this argument further than even we do and suggests that replications often have more utility than original studies because biases are more common in original research. Many commentators provided special rules for how to identify studies that should be replicated. For example, Witte & Zenker stress that it is important to evaluate the potential for generating theoretical knowledge before launching a replication project. Several commentaries address this concern by providing concrete solutions. Hardwicke, Tessler, Peloquin, & Frank and Coles, Tiokhin, Scheel, Isager, & Lakens both propose translating the question of which replication to run into a formal decision-making process, whereby replications would be deemed worthy to run or not based on the utility we expect to gain from them. Their suggestions essentially amount to considering the costs and benefits of running a particular replication and evaluating the subjective probability we assign the underlying theory, hearkening back to the quintessentially Bayesian ideas previously put forward by the likes of Wald, Lindley, Savage, and others. Their suggestions can aid individual researchers and groups when they go about deciding how to allocate their own time and effort.

These are all interesting and potentially useful perspectives to take moving forward. At the same time, we do not think that special rules for selecting replication studies are needed, or even desirable. Certainly original research studies vary in the contribution they make to science, yet few propose formal mechanisms for deciding which new original studies should be conducted. Much original research builds on, or is even critical of, prior theory and research (as should be the case in a cumulative science). Idiosyncratic interests and methodological expertise guide the original research questions that people pursue. This should be true for replication research, as well. People conduct replications for many reasons: because they want to master the methods in an original study, because they want to build on the original finding, and yes, even because they doubt the validity of the original work. But this is true regardless of whether the follow-up study that a person conducts is a replication or an entirely new study building on prior work.

As we noted in the target article, although the ability to replicate a research finding is a foundational principle of the scientific method, the role of direct replication in the social and behavioral sciences is surprisingly controversial. The goal of our article was to identify and address the major reasons why this controversy exists and to suggest that science would benefit from making replication more mainstream. The commentaries on this article strengthen our belief that an increased focus on replication will benefit science; at the same time these commentaries pushed us to think more about the reasons why controversy about replication exists. We hope that the resulting debate will encourage all scientists to think carefully about the role that direct replication should play in building a cumulative body of knowledge. Once again, we thank these commentators for their insightful comments and we look forward to seeing these ideas evolve as social and behavioral sciences engage with a broad range of meta-scientific issues in the years to come.

ACKNOWLEDGMENT

Alexander Etz was supported by Grant 1534472 from National Science Foundation's Methods, Measurements, and Statistics panel and by the National Science Foundation Graduate Research Fellowship Program (No. DGE1321846).

Footnotes

1.

Current address: Department of Psychology, Psychology Building, 316 Physics Road, Room 249A, Michigan State University, East Lansing, MI 48824.

References

Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., van Ravenzwaaij, D., White, C. N., De Boeck, P. & Vandekerckhove, J. (2018) Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences of the United States of America 115(11):2607–12. Available at: http:/doi.org/10.1073/pnas.1708285114.Google Scholar
Bonett, D. G. (2012) Replication-extension studies. Current Directions in Psychological Science 21:409–12.Google Scholar
Box, G. E. P. (1979) Robustness in the strategy of scientific model building. In: Robustness in statistics, ed. Launer, R. L. & Wilkinson, G. N., pp. 201–36. Academic.Google Scholar
Donnellan, M. B., Lucas, R. E. & Cesario, J. (2015) On the association between loneliness and bathing habits: Nine replications of Bargh and Shalev (2012): Study 1. Emotion 15(1):109–19.Google Scholar
Eerland, A., Sherrill, A. M., Magliano, J. P., Zwaan, R. A., Arnal, J. D., Aucoin, P., Berger, S. A., Birt, A. R., Capezza, N., Carlucci, M., Crocker, C., Ferretti, T. R., Kibbe, M. R., Knepp, M. M., Kurby, C. A., Melcher, J. M., Michael, S. W., Poirier, C. & Prenoveau, J. M. (2016) Registered replication report: Hart & Albarracín (2011). Perspectives on Psychological Science 11:158–71.Google Scholar
Etz, A., Haaf, J. M., Rouder, J. N. & Vandekerckhove, J. (in press) Bayesian inference and testing any hypothesis you can specify. Preprint. Available at: https://psyarxiv.com/wmf3r/.Google Scholar
Hagger, M. S., Chatzisarantis, N. L. D., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R., Brand, R., Brandt, M. J., Brewer, G., Bruyneel, S., Calvillo, D. P., Campbell, W. K., Cannon, P. R., Carlucci, M., Carruth, N. P., Cheung, T., Crowell, A., De Ridder, D. T. D., Dewitte, S., Elson, M., Evans, J. R., Fay, B. A., Fennis, B. M, Finley, A., Francis, Z., Heise, E., Hoemann, H., Inzlicht, M., Koole, S. L., Koppel, L., Kroese, F., Lange, F., Lau, K., Lynch, B. P., Martijn, C., Merckelbach, H., Mills, N. V., Michirev, A., Miyake, A., Mosser, A. E., Muise, M., Muller, D., Muzi, M., Nalis, D., Nurwanti, R., Otgaar, H, Philipp, M. C., Primoceri, P., Rentzsch, K., Ringos, L., Schlinkert, C., Schmeichel, B. J., Schoch, S. F., Schrama, M., Schütz, A., Stamos, A., Tinghög, G., Ullrich, J., vanDellen, M., Wimbarti, S., Wolff, W., Yusainy, C., Zerhouni, O. & Zwienenberg, M. (2016) A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science 11(4):546–73.Google Scholar
Lakatos, I. (1970) Falsification and the methodology of scientific research programmes. In: Criticism and the growth of knowledge, ed. Lakatos, I. & Musgrave, A., pp. 91196. Cambridge University Press.Google Scholar
Lucas, R. E. & Donnellan, M. B. (2013) Improving the replicability and reproducibility of research published in the Journal of Research in Personality. Journal of Research in Personality 4(47):453–54.Google Scholar
Nosek, B. A. & Errington, T. M. (2017) Making sense of replications. eLife 6:e23383.Google Scholar
Open Science Collaboration (2015) Estimating the reproducibility of psychological science. Science 349(6251):aac4716. Available at: http://doi.org/10.1126/science.aac4716.Google Scholar
Srivastava, S. (2012, September 17) A Pottery Barn rule for scientific journals. Blog post. Available at: https://hardsci.wordpress.com/2012/09/27/a-pottery-barn-rule-for-scientific-journals/.Google Scholar
Strevens, M. (2001) The Bayesian treatment of auxiliary hypotheses. The British Journal for the Philosophy of Science 52(3):515–37.Google Scholar
Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B. Jr., Albohn, D. N., Allard, E. S., Benning, S. D., Blouin-Hudon, E.-M., Bulnes, L. C., Caldwell, T. L., Calin-Jageman, R. J., Capaldi, C. A., Carfagno, N. S., Chasten, K. T., Cleeremans, A., Connell, L., DeCicco, J. M., Dijkstra, K, Fischer, A. H., Foroni, F., Hess, U., Holmes, K. J., Jones, J. L. H., Klein, O., Koch, C., Korb, S., Lewinski, P., Liao, J. D., Lund, S., Lupianez, J., Lynott, D., Nance, C. N., Oosterwijk, S., Ozdoğru, A. A., Pacheco-Unguetti, A. P., Pearson, B., Powis, C., Riding, S., Roberts, T.-A., Rumiati, R. I., Senden, M., Shea-Shumsky, N. B., Sobocko, K., Soto, J. A., Steiner, T. G., Talarico, J. M., van Allen, Z. M., Vandekerckhove, M., Wainwright, B., Wayand, J. F., Zeelenberg, R., Zetzer, E. E. & Zwaan, R. A. (2016a) Registered replication report: Strack, Martin & Stepper (1988). Perspectives on Psychological Science 11:917–28.Google Scholar
Zwaan, R. A. (2017, May 8) Concurrent replication. Blog post. Available at: https://rolfzwaan.blogspot.nl/2017/05/concurrent-replication.html.Google Scholar