Direct replications have an important place in our scientific toolkit. Given limited resources, however, scientists must decide when to replicate versus when to conduct novel research. A Bayesian viewpoint can help clarify this issue. On the Bayesian view, scientific knowledge is represented by a probability distribution over theoretical hypotheses, given the available evidence (Strevens Reference Strevens and Borchert2006). This distribution, called the posterior, can be decomposed into the product of the prior probability of each hypothesis and the likelihood of the data given the hypothesis. Evidence from a new study can then be integrated into the posterior, making hypotheses more or less probable. The amount of change to the posterior can be quantified as a study's information gain. Using this formalism, one can design “optimal experiments” that maximize information gain relative to available resources (e.g., MacKay Reference MacKay1992). One attraction of this quantitative framework is that it captures the dictum that researchers should design studies that can adjudicate between competing hypotheses (Platt Reference Platt1964).
Some good intuitions fall out of the Bayesian formulation. A study designed to select between a priori likely hypotheses (e.g., those well supported by existing data) can lead to high information gain. By contrast, a study whose data provide strong support for a hypothesis that already has a high prior probability, or weak support for a hypothesis that has a low prior probability, provides much less information gain. Larger samples and more precise measurements will result in greater information gain, but only in the context of a design that can distinguish high-prior-probability hypotheses.
Even well-designed studies can be undermined by errors in execution, reporting, or analysis. If a study is known to be erroneous, then it clearly leads to no information gain, but the more common situation is some uncertainty about the possibility of error. A Bayesian framework can capture this uncertainty by weighting information gain by a study's credibility. Concerns about questionable research practices, analytic error, or fraud thus all decrease the overall information gain from a study.
Direct replications are special in this framework only in that they follow a pre-existing set of design decisions. Thus, the main reason to replicate is simply to gather more data in a promising paradigm. In cases where the original study had low credibility or a small sample size, a replication can lead to substantial information gain (Klein et al. Reference Klein, Ratliff, Vianello, Adams, Bahník, Bernstein, Bocian, Brandt, Brooks, Brumbaugh, Cemalcilar, Chandler, Cheong, Davis, Devos, Eisner, Frankowska, Furrow, Galliani, Hasselman, Hicks, Hovermale, Hunt, Huntsinger, IJerzman, John, Joy-Gaba, Kappes, Krueger, Kurtz, Levitan, Mallett, Morris, Nelson, Nier, Packard, Pilati, Rutchick, Schmidt, Skorinko, Smith, Steiner, Storbeck, Van Swol, Thompson, van't Veer, Vaughn, Vranka, Wichman, Woodzicka and Nosek2014c). Replicating the same design again and again will offer diminishing returns, however, as estimates of relevant quantities become more precise (Mullen et al. Reference Mullen, Muellerleile and Bryant2001). If a study design is not potentially informative, for example, because it cannot in principle differentiate between hypotheses, then replicating that design will not lead to information gain. Finally, when a particular finding has substantial applied value, replicators might want to consider an expected value analysis wherein a replication's information gain is weighted by the expected utility of a particular outcome.
Replications have one unique feature, though: They can change our interpretation of an original study by affecting our estimates of the original study's credibility. Imagine a very large effect is observed in a small study and an identical but larger replication study then observes a much smaller effect. If both studies are assumed to be completely credible, the best estimate of the quantity of interest is the variance-weighted average of the two (Borenstein et al. Reference Borenstein, Hedges, Higgins and Rothstein2009). But if the replication has high credibility – for example, because of preregistration, open data, and son on – then the mismatch between the two may result from the earlier study lacking credibility as a result of error, analytic flexibility, or another cause. Such explanations would be taken into account by downweighting the information gain of the original study by that study's potential lack of credibility. Of course, substantial scientific judgment is required when sample, stimulus, or procedural details differ between replication and original (cf. Anderson et al., Reference Anderson, Bahník, Barnett-Cowan, Bosco, Chandler, Chartier, Cheung, Christopherson, Cordes, Cremata, Della Penna, Estel, Fedor, Fitneva, Frank, Grange, Hartshorne, Hasselman, Henninger, van der Hulst, Jonas, Lai, Levitan, Miller, Moore, Meixner, Munafò, Neijenhuijs, Nilsonne, Nosek, Plessow, Prenoveau, Ricker, Schmidt, Spies, Steiger, Strohminger, Sullivan, van Aert, van Assen, Vanpaemel, Vianello, Voracek and Zuni2016; Gilbert et al. Reference Gilbert, King, Pettigrew and Wilson2016). Often, multiple studies that investigate reasons for the failure of a replication are needed to understand disparities in results (see, e.g., Baribault et al. Reference Baribault, Donkin, Little, Trueblood, Oravecz, van Ravenzwaaij, White, De Boeck and Vandekerckhove2018; Lewis & Frank Reference Lewis and Frank2016; Phillips et al. Reference Phillips, Ong, Surtees, Xin, Williams, Saxe and Frank2015).
The prior probability of hypotheses will not be universally agreed upon and can lead to disagreements about whether a particular result should be replicated. One researcher may believe that a study with low information gain – perhaps on account of a small sample size – deserves to be “rescued” by replication because it addresses a plausible hypothesis. By contrast, a more skeptical researcher who assigned the original hypothesis a lower prior probability might see no reason to replicate. Or that researcher might replicate simply to convince others that the original study lacks credibility, especially in the case that it is influential within academia or the general public. Overall, as long as studies are appropriately conducted and reported, and all studies are considered, then the Bayesian framework will accumulate evidence and converge to an estimate of the true posterior.
Replication is an expensive option for assessing credibility, however. Assessing analytic reproducibility and robustness may be a more efficient means of ensuring that errors or specific analytic choices are not responsible for a particular result (Steegen et al. 2016; Stodden et al. Reference Stodden, McNutt, Bailey, Deelman, Gil, Hanson, Heroux, Ioannidis and Taufer2016). Forensic tools like p-curve or the test for excess significance (Ioannidis & Trikalinos Reference Ioannidis and Trikalinos2007; Simonsohn et al. Reference Simonsohn, Nelson and Simmons2014) can also help in assessing credibility.
How should an individual researcher make use of this Bayesian framework? When thinking about replication, researchers should ask the same questions they do when planning a new study: Does my planned study differentiate between plausible theoretical hypotheses, and do I have sufficient resources to carry it out? For a replication, this judgment can then be qualified by whether a re-evaluation of the credibility of the original study would be a net positive, because downweighting the credibility of an incorrect or spurious study also leads to overall information gain. Adopting such a framework to guide a rough assessment of information value (even in the absence of precise numerical assignments) can help researchers decide when to replicate.
Direct replications have an important place in our scientific toolkit. Given limited resources, however, scientists must decide when to replicate versus when to conduct novel research. A Bayesian viewpoint can help clarify this issue. On the Bayesian view, scientific knowledge is represented by a probability distribution over theoretical hypotheses, given the available evidence (Strevens Reference Strevens and Borchert2006). This distribution, called the posterior, can be decomposed into the product of the prior probability of each hypothesis and the likelihood of the data given the hypothesis. Evidence from a new study can then be integrated into the posterior, making hypotheses more or less probable. The amount of change to the posterior can be quantified as a study's information gain. Using this formalism, one can design “optimal experiments” that maximize information gain relative to available resources (e.g., MacKay Reference MacKay1992). One attraction of this quantitative framework is that it captures the dictum that researchers should design studies that can adjudicate between competing hypotheses (Platt Reference Platt1964).
Some good intuitions fall out of the Bayesian formulation. A study designed to select between a priori likely hypotheses (e.g., those well supported by existing data) can lead to high information gain. By contrast, a study whose data provide strong support for a hypothesis that already has a high prior probability, or weak support for a hypothesis that has a low prior probability, provides much less information gain. Larger samples and more precise measurements will result in greater information gain, but only in the context of a design that can distinguish high-prior-probability hypotheses.
Even well-designed studies can be undermined by errors in execution, reporting, or analysis. If a study is known to be erroneous, then it clearly leads to no information gain, but the more common situation is some uncertainty about the possibility of error. A Bayesian framework can capture this uncertainty by weighting information gain by a study's credibility. Concerns about questionable research practices, analytic error, or fraud thus all decrease the overall information gain from a study.
Direct replications are special in this framework only in that they follow a pre-existing set of design decisions. Thus, the main reason to replicate is simply to gather more data in a promising paradigm. In cases where the original study had low credibility or a small sample size, a replication can lead to substantial information gain (Klein et al. Reference Klein, Ratliff, Vianello, Adams, Bahník, Bernstein, Bocian, Brandt, Brooks, Brumbaugh, Cemalcilar, Chandler, Cheong, Davis, Devos, Eisner, Frankowska, Furrow, Galliani, Hasselman, Hicks, Hovermale, Hunt, Huntsinger, IJerzman, John, Joy-Gaba, Kappes, Krueger, Kurtz, Levitan, Mallett, Morris, Nelson, Nier, Packard, Pilati, Rutchick, Schmidt, Skorinko, Smith, Steiner, Storbeck, Van Swol, Thompson, van't Veer, Vaughn, Vranka, Wichman, Woodzicka and Nosek2014c). Replicating the same design again and again will offer diminishing returns, however, as estimates of relevant quantities become more precise (Mullen et al. Reference Mullen, Muellerleile and Bryant2001). If a study design is not potentially informative, for example, because it cannot in principle differentiate between hypotheses, then replicating that design will not lead to information gain. Finally, when a particular finding has substantial applied value, replicators might want to consider an expected value analysis wherein a replication's information gain is weighted by the expected utility of a particular outcome.
Replications have one unique feature, though: They can change our interpretation of an original study by affecting our estimates of the original study's credibility. Imagine a very large effect is observed in a small study and an identical but larger replication study then observes a much smaller effect. If both studies are assumed to be completely credible, the best estimate of the quantity of interest is the variance-weighted average of the two (Borenstein et al. Reference Borenstein, Hedges, Higgins and Rothstein2009). But if the replication has high credibility – for example, because of preregistration, open data, and son on – then the mismatch between the two may result from the earlier study lacking credibility as a result of error, analytic flexibility, or another cause. Such explanations would be taken into account by downweighting the information gain of the original study by that study's potential lack of credibility. Of course, substantial scientific judgment is required when sample, stimulus, or procedural details differ between replication and original (cf. Anderson et al., Reference Anderson, Bahník, Barnett-Cowan, Bosco, Chandler, Chartier, Cheung, Christopherson, Cordes, Cremata, Della Penna, Estel, Fedor, Fitneva, Frank, Grange, Hartshorne, Hasselman, Henninger, van der Hulst, Jonas, Lai, Levitan, Miller, Moore, Meixner, Munafò, Neijenhuijs, Nilsonne, Nosek, Plessow, Prenoveau, Ricker, Schmidt, Spies, Steiger, Strohminger, Sullivan, van Aert, van Assen, Vanpaemel, Vianello, Voracek and Zuni2016; Gilbert et al. Reference Gilbert, King, Pettigrew and Wilson2016). Often, multiple studies that investigate reasons for the failure of a replication are needed to understand disparities in results (see, e.g., Baribault et al. Reference Baribault, Donkin, Little, Trueblood, Oravecz, van Ravenzwaaij, White, De Boeck and Vandekerckhove2018; Lewis & Frank Reference Lewis and Frank2016; Phillips et al. Reference Phillips, Ong, Surtees, Xin, Williams, Saxe and Frank2015).
The prior probability of hypotheses will not be universally agreed upon and can lead to disagreements about whether a particular result should be replicated. One researcher may believe that a study with low information gain – perhaps on account of a small sample size – deserves to be “rescued” by replication because it addresses a plausible hypothesis. By contrast, a more skeptical researcher who assigned the original hypothesis a lower prior probability might see no reason to replicate. Or that researcher might replicate simply to convince others that the original study lacks credibility, especially in the case that it is influential within academia or the general public. Overall, as long as studies are appropriately conducted and reported, and all studies are considered, then the Bayesian framework will accumulate evidence and converge to an estimate of the true posterior.
Replication is an expensive option for assessing credibility, however. Assessing analytic reproducibility and robustness may be a more efficient means of ensuring that errors or specific analytic choices are not responsible for a particular result (Steegen et al. 2016; Stodden et al. Reference Stodden, McNutt, Bailey, Deelman, Gil, Hanson, Heroux, Ioannidis and Taufer2016). Forensic tools like p-curve or the test for excess significance (Ioannidis & Trikalinos Reference Ioannidis and Trikalinos2007; Simonsohn et al. Reference Simonsohn, Nelson and Simmons2014) can also help in assessing credibility.
How should an individual researcher make use of this Bayesian framework? When thinking about replication, researchers should ask the same questions they do when planning a new study: Does my planned study differentiate between plausible theoretical hypotheses, and do I have sufficient resources to carry it out? For a replication, this judgment can then be qualified by whether a re-evaluation of the credibility of the original study would be a net positive, because downweighting the credibility of an incorrect or spurious study also leads to overall information gain. Adopting such a framework to guide a rough assessment of information value (even in the absence of precise numerical assignments) can help researchers decide when to replicate.