We applaud Zwaan et al. for compiling many of the present concerns researchers have regarding replication and for their thoughtful rejoinders to those concerns. Yet, the authors gloss over an underlying cause of the problem of the lack of replicability in psychological science and instead focus exclusively on addressing a symptom, specifically that the field does not make replications a centerpiece of hypothesis testing. An underlying cause is that psychologists do not actually propose “strong” testable theories. To paraphrase Meehl (Reference Meehl1990a), null hypothesis testing of “weak” theories produces a literature that is “uninterpretable.” In particular, this is because the qualitative hypotheses generated from weak theories are not formulated specifically enough, just that “X and Y” will interact. Thus, any degree and form of interaction could be used to support the [frequentists'] statistical hypothesis. Further, it is important to remember that the statistical hypothesis, that is, the alternative to the null, is never actually true (Cohen Reference Cohen1994) and can address only the degree of the interaction, not the form. In other words, both a disordinal interaction from an original study and an ordinal interaction from a replication would yield statistical support for the interaction hypothesis. Had the theory been stronger, the hypothesis would have predicted a specific degree and form of the interaction, resulting in the non-replication of the original study by the second. This in part may explain how we came to the conclusion in our own examination of research practices and replication metrics of published research (Motyl et al. Reference Motyl, Demos, Carsel, Hanson, Melton, Mueller, Prims, Sun, Washburn, Wong, Yantis and Skitka2017; Washburn et al., Reference Washburn, Hanson, Motyl, Skitka, Yantis, Wong, Sun, Prims, Mueller, Melton and Carsel2018) that the metrics of replicability seemed to support Meehl's prediction that a poorly theorized scientific literature would produce “uninterpretable” results. Thus, the authors' concern VI (sect. 5.6) regarding point estimation (e.g., effect size, p-values) and their confidence intervals implicitly assume that the original study and replication were interpretable results regarding the verisimilitude of the theory. To summarize this argument, take for a moment the example of throwing darts at a dart board. Zwaan et al. were concerned with whether the second dart came near the first. However, based on the way psychology often works, the size of the bullseye may be the whole wall. Thus, replication can only contribute to the falsification of a theory that is well-defined.
The current predicament of weak theorizing may be created in part by the thinking that “humans are too complicated for strong theories.” Zwaan et al. speak to the symptom of this problem by stating that “context” needs to be better described in our methods sections. Psychological theories often require substantial auxiliary theories and hypotheses to “derive” the qualitative hypotheses that motivate our studies (Meehl Reference Meehl1990b). In short, this leads to the problem of “theoretical degrees of freedom” such that the ambiguous theory can be re-instantiated in such a way that any result we may find will be used as support for our, in fact, unfalsifiable theories. Zwaan et al. assert “If a finding that was initially presented as support for a theory cannot be reliably reproduced using the comprehensive set of instructions for duplicating the original procedure, then the specific prediction that motivated the original research question has been falsified (Popper 1959/2002), at least in the narrow sense” (sect. 2, para. 3). The kind of falsification advocated by Zwaan et al., however, becomes increasingly difficult the further removed a statistical hypothesis is from the qualitative hypothesis (Meehl Reference Meehl1990a), and the finding is rendered uninterpretable when our statistical and qualitative hypotheses become couched in an increasing number of implicit auxiliary hypotheses. Indeed, if our theories are so weak that any contextual change negates them, then those are not theories; they are hypotheses masquerading as theories. Gray (Reference Gray2017) proposed a preliminary method to instantiate our theories visually, which forces the scientist to think through their theory's concepts and relationships. This is a stronger recommendation than the ones made by Zwaan et al., who suggest simply being more careful about statements of generalizability. Concerns II (sect 5.2) and IV (sect. 5.4) would be resolved with stronger theorizing, more careful derivations and discussions of statistical and qualitative hypotheses, as well as both direct and conceptual replications to test the boundary conditions of those theories.
In summary, we contend that the target article authors are right that we need to make replication more mainstream, but argue that we need to go further and encourage stronger theorizing to help make replications more feasible and meaningful.
We applaud Zwaan et al. for compiling many of the present concerns researchers have regarding replication and for their thoughtful rejoinders to those concerns. Yet, the authors gloss over an underlying cause of the problem of the lack of replicability in psychological science and instead focus exclusively on addressing a symptom, specifically that the field does not make replications a centerpiece of hypothesis testing. An underlying cause is that psychologists do not actually propose “strong” testable theories. To paraphrase Meehl (Reference Meehl1990a), null hypothesis testing of “weak” theories produces a literature that is “uninterpretable.” In particular, this is because the qualitative hypotheses generated from weak theories are not formulated specifically enough, just that “X and Y” will interact. Thus, any degree and form of interaction could be used to support the [frequentists'] statistical hypothesis. Further, it is important to remember that the statistical hypothesis, that is, the alternative to the null, is never actually true (Cohen Reference Cohen1994) and can address only the degree of the interaction, not the form. In other words, both a disordinal interaction from an original study and an ordinal interaction from a replication would yield statistical support for the interaction hypothesis. Had the theory been stronger, the hypothesis would have predicted a specific degree and form of the interaction, resulting in the non-replication of the original study by the second. This in part may explain how we came to the conclusion in our own examination of research practices and replication metrics of published research (Motyl et al. Reference Motyl, Demos, Carsel, Hanson, Melton, Mueller, Prims, Sun, Washburn, Wong, Yantis and Skitka2017; Washburn et al., Reference Washburn, Hanson, Motyl, Skitka, Yantis, Wong, Sun, Prims, Mueller, Melton and Carsel2018) that the metrics of replicability seemed to support Meehl's prediction that a poorly theorized scientific literature would produce “uninterpretable” results. Thus, the authors' concern VI (sect. 5.6) regarding point estimation (e.g., effect size, p-values) and their confidence intervals implicitly assume that the original study and replication were interpretable results regarding the verisimilitude of the theory. To summarize this argument, take for a moment the example of throwing darts at a dart board. Zwaan et al. were concerned with whether the second dart came near the first. However, based on the way psychology often works, the size of the bullseye may be the whole wall. Thus, replication can only contribute to the falsification of a theory that is well-defined.
The current predicament of weak theorizing may be created in part by the thinking that “humans are too complicated for strong theories.” Zwaan et al. speak to the symptom of this problem by stating that “context” needs to be better described in our methods sections. Psychological theories often require substantial auxiliary theories and hypotheses to “derive” the qualitative hypotheses that motivate our studies (Meehl Reference Meehl1990b). In short, this leads to the problem of “theoretical degrees of freedom” such that the ambiguous theory can be re-instantiated in such a way that any result we may find will be used as support for our, in fact, unfalsifiable theories. Zwaan et al. assert “If a finding that was initially presented as support for a theory cannot be reliably reproduced using the comprehensive set of instructions for duplicating the original procedure, then the specific prediction that motivated the original research question has been falsified (Popper 1959/2002), at least in the narrow sense” (sect. 2, para. 3). The kind of falsification advocated by Zwaan et al., however, becomes increasingly difficult the further removed a statistical hypothesis is from the qualitative hypothesis (Meehl Reference Meehl1990a), and the finding is rendered uninterpretable when our statistical and qualitative hypotheses become couched in an increasing number of implicit auxiliary hypotheses. Indeed, if our theories are so weak that any contextual change negates them, then those are not theories; they are hypotheses masquerading as theories. Gray (Reference Gray2017) proposed a preliminary method to instantiate our theories visually, which forces the scientist to think through their theory's concepts and relationships. This is a stronger recommendation than the ones made by Zwaan et al., who suggest simply being more careful about statements of generalizability. Concerns II (sect 5.2) and IV (sect. 5.4) would be resolved with stronger theorizing, more careful derivations and discussions of statistical and qualitative hypotheses, as well as both direct and conceptual replications to test the boundary conditions of those theories.
In summary, we contend that the target article authors are right that we need to make replication more mainstream, but argue that we need to go further and encourage stronger theorizing to help make replications more feasible and meaningful.