If they could, researchers would design critical experiments that isolate and test an individual hypothesis. But as Pierre Duhem (Reference Duhem1954) pointed out, the results of any one experiment depend not only on the truth or falsity of the central hypothesis of interest, but also on other “auxiliary” hypotheses. In the discussion of replications in psychology, this basic fact is understood (although typically referred to using different terminology).
For an experiment in physics, the auxiliary hypotheses might include that the measurement device is functioning correctly. For an experiment in psychology, in addition to correct functioning of the measurement devices (such as computers), the auxiliaries might include that the participants understood the instructions and, for a replication study, that the effect exists in a new population of participants.
Bayes' theorem dictates how one should update one's belief in a hypothesis as a result of new evidence. However, because the results of actual scientific experiments inevitably depend on auxiliary hypotheses, as well as the hypothesis of interest, the only valid use of Bayes' rule is to update one's belief in an undifferentiated compound of the central hypothesis and all of the auxiliaries. But researchers are interested primarily in the credibility of a central hypothesis, not the combination of it with various auxiliaries. How, then, can data be used to update one's strength of belief in a particular hypothesis/ phenomenon?
The philosopher Michael Strevens used probability theory to answer this. The equation he derived prescribes how, after the results of an experiment, one should update the strength of one's belief in a central hypothesis and the auxiliary hypotheses (Strevens Reference Strevens2001). For a replication experiment, a person's relative strength of belief in the central hypothesis and the auxiliary hypotheses involved determine how one should distribute the blame for a replication failure, with (typically) different amounts going to the auxiliary hypothesis and the central hypothesis. If one has a strong belief in the central hypothesis but a relatively weak belief in the auxiliaries of the replication experiment, then belief in the central hypothesis can and should emerge relatively unscathed.
Strevens' equation for distributing credit and blame emerged from a broader philosophy of science called Bayesian confirmation theory (Hawthorne Reference Hawthorne, French and Saatsi2014; Strevens Reference Strevens2017). One might disagree with Bayesian confirmation theory broadly, but still agree that Bayesian belief updating is the ideal in many circumstances.
Strevens' equation provides a way to quantify the evidence for an effect provided by a replication experiment. Many articles on replication, including Zwaan et al., have extensive discussions of the importance of and nature of differences between a replication experiment and an original study, but without a principled, quantitative framework for taking these differences into account. The dream of Bayesian confirmation theory is that scientific inference might proceed like clockwork – in certain circumstances. There will be difficulties with identifying and precisely quantifying the credence of auxiliary hypotheses, but even rough approximations should lead to insights.
Currently, whether scientists' actual belief updating bears much resemblance to the updating prescribed by Strevens' equation is very unclear. Many believe that after a failed replication, researchers often engage in motivated reasoning and irrationally cling to their beliefs, but Strevens' equation indicates that maintaining a strong belief and blaming auxiliaries is rational if one had not put much credence in the auxiliaries of a replication study.
One possible way forward is to implement pilot programs to induce scientists to set out their beliefs before the data of a replication study are collected. In particular, researchers should be asked to specify the strengths (as probabilities) of their beliefs in the central and auxiliary hypotheses of the study. After the data come in, Strevens' equation would dictate how these researchers should update their beliefs. If it turns out that researchers do not update their beliefs in this way, we will have learned something. These findings, and the comments of the researchers on why they differed from Strevens' prescription (if they do), should illuminate how science progresses and how researchers reason.
Such a program may also help to pinpoint the disagreements that can occur between original researchers and replicating researchers. Presently, after a failed replication, a common practice is for authors of the original study to write a commentary. Frequently, the commentary highlights differences between the replication and the original study, sometimes without giving much indication of how much the authors' beliefs have changed as a result of the failed replication. This makes it difficult to determine the degree of disagreement on the issues.
Our proposal is closely related to several proposed reforms in the literature (and already in the Registered Replication Reports now published by Advances in Methods and Practices in Psychological Science, replicating labs are routinely asked what they expect the effect size to be). The key point is the addition of a suitable quantitative framework. Zwaan et al. mention the “Constraints on Generality” proposal of Simons et al. (Reference Simons, Shoda and Lindsay2017) that authors should “spend some time articulating theoretically grounded boundary conditions for particular findings” as this would mean disagreements with replicating authors “are likely to be minimized” (sect. 4, para. 11). But it may be difficult for an author to testify that a result should replicate in different conditions, as she is likely to be uncertain about various aspects. Rather than making a black-and-white statement, then, it may be better if the author communicates their uncertainty by attaching subjective probabilities to some of the auxiliary hypotheses involved. A further benefit of this system would be that authors, and the theories they espouse, would then develop a track record of making correct predictions (Rieth et al. Reference Rieth, Piantadosi, Smith and Vul2013).
We recognize that in many circumstances, it may not be realistic to expect researchers to be able to quantify their confidence in the hypotheses that are part and parcel of an original experiment and potential replication experiments. Areas that are less mature, in the sense that many auxiliary hypotheses are uncertain, may be especially poor candidates. But other areas may be suitable. There are good reasons for researchers to try.
If they could, researchers would design critical experiments that isolate and test an individual hypothesis. But as Pierre Duhem (Reference Duhem1954) pointed out, the results of any one experiment depend not only on the truth or falsity of the central hypothesis of interest, but also on other “auxiliary” hypotheses. In the discussion of replications in psychology, this basic fact is understood (although typically referred to using different terminology).
For an experiment in physics, the auxiliary hypotheses might include that the measurement device is functioning correctly. For an experiment in psychology, in addition to correct functioning of the measurement devices (such as computers), the auxiliaries might include that the participants understood the instructions and, for a replication study, that the effect exists in a new population of participants.
Bayes' theorem dictates how one should update one's belief in a hypothesis as a result of new evidence. However, because the results of actual scientific experiments inevitably depend on auxiliary hypotheses, as well as the hypothesis of interest, the only valid use of Bayes' rule is to update one's belief in an undifferentiated compound of the central hypothesis and all of the auxiliaries. But researchers are interested primarily in the credibility of a central hypothesis, not the combination of it with various auxiliaries. How, then, can data be used to update one's strength of belief in a particular hypothesis/ phenomenon?
The philosopher Michael Strevens used probability theory to answer this. The equation he derived prescribes how, after the results of an experiment, one should update the strength of one's belief in a central hypothesis and the auxiliary hypotheses (Strevens Reference Strevens2001). For a replication experiment, a person's relative strength of belief in the central hypothesis and the auxiliary hypotheses involved determine how one should distribute the blame for a replication failure, with (typically) different amounts going to the auxiliary hypothesis and the central hypothesis. If one has a strong belief in the central hypothesis but a relatively weak belief in the auxiliaries of the replication experiment, then belief in the central hypothesis can and should emerge relatively unscathed.
Strevens' equation for distributing credit and blame emerged from a broader philosophy of science called Bayesian confirmation theory (Hawthorne Reference Hawthorne, French and Saatsi2014; Strevens Reference Strevens2017). One might disagree with Bayesian confirmation theory broadly, but still agree that Bayesian belief updating is the ideal in many circumstances.
Strevens' equation provides a way to quantify the evidence for an effect provided by a replication experiment. Many articles on replication, including Zwaan et al., have extensive discussions of the importance of and nature of differences between a replication experiment and an original study, but without a principled, quantitative framework for taking these differences into account. The dream of Bayesian confirmation theory is that scientific inference might proceed like clockwork – in certain circumstances. There will be difficulties with identifying and precisely quantifying the credence of auxiliary hypotheses, but even rough approximations should lead to insights.
Currently, whether scientists' actual belief updating bears much resemblance to the updating prescribed by Strevens' equation is very unclear. Many believe that after a failed replication, researchers often engage in motivated reasoning and irrationally cling to their beliefs, but Strevens' equation indicates that maintaining a strong belief and blaming auxiliaries is rational if one had not put much credence in the auxiliaries of a replication study.
One possible way forward is to implement pilot programs to induce scientists to set out their beliefs before the data of a replication study are collected. In particular, researchers should be asked to specify the strengths (as probabilities) of their beliefs in the central and auxiliary hypotheses of the study. After the data come in, Strevens' equation would dictate how these researchers should update their beliefs. If it turns out that researchers do not update their beliefs in this way, we will have learned something. These findings, and the comments of the researchers on why they differed from Strevens' prescription (if they do), should illuminate how science progresses and how researchers reason.
Such a program may also help to pinpoint the disagreements that can occur between original researchers and replicating researchers. Presently, after a failed replication, a common practice is for authors of the original study to write a commentary. Frequently, the commentary highlights differences between the replication and the original study, sometimes without giving much indication of how much the authors' beliefs have changed as a result of the failed replication. This makes it difficult to determine the degree of disagreement on the issues.
Our proposal is closely related to several proposed reforms in the literature (and already in the Registered Replication Reports now published by Advances in Methods and Practices in Psychological Science, replicating labs are routinely asked what they expect the effect size to be). The key point is the addition of a suitable quantitative framework. Zwaan et al. mention the “Constraints on Generality” proposal of Simons et al. (Reference Simons, Shoda and Lindsay2017) that authors should “spend some time articulating theoretically grounded boundary conditions for particular findings” as this would mean disagreements with replicating authors “are likely to be minimized” (sect. 4, para. 11). But it may be difficult for an author to testify that a result should replicate in different conditions, as she is likely to be uncertain about various aspects. Rather than making a black-and-white statement, then, it may be better if the author communicates their uncertainty by attaching subjective probabilities to some of the auxiliary hypotheses involved. A further benefit of this system would be that authors, and the theories they espouse, would then develop a track record of making correct predictions (Rieth et al. Reference Rieth, Piantadosi, Smith and Vul2013).
We recognize that in many circumstances, it may not be realistic to expect researchers to be able to quantify their confidence in the hypotheses that are part and parcel of an original experiment and potential replication experiments. Areas that are less mature, in the sense that many auxiliary hypotheses are uncertain, may be especially poor candidates. But other areas may be suitable. There are good reasons for researchers to try.