What is the theoretical value of direct replication? In a recent paper, we (Rotello et al. Reference Rotello, Heit and Dubé2015) described several cases where oft-replicated studies repeated the methodological flaws of the original work. In particular, we presented examples from research on reasoning, memory, social cognition, and child welfare in which the standard method of analysis was not justified and indeed could – and in at least two cases, did – lead to erroneous inferences. Repeating the study, along with the flawed analyses, could lead to yet greater confidence in these incorrect conclusions. Most of our examples concerned conceptual rather than direct replications, in the sense that there were various purposeful design and material changes across studies. Our point was about methodology, namely that inferential errors as a result of unjustified analyses can be magnified upon replication. Contrary to the implication of the target article, we would not argue that the theoretical value of direct, or for that matter conceptual, replications is limited.
Indeed, the target article makes a compelling case for the value of replication, as well as its mainstream role in psychology. Yet we would not elevate replication over other worthwhile research practices. Using an example from Rotello et al. (Reference Rotello, Heit and Dubé2015), we reported that, beginning with Evans et al.(Reference Evans, Barston and Pollard1983), for three decades, replication studies on the belief bias effect in reasoning have employed analyses such as analyses of variance on differences in response rates without checking the assumptions of those analyses. (In this example, researchers could easily do so by collecting data that would allow them to plot receiver operating characteristic curves to see whether there is a linear or curvilinear relationship between correct and incorrect positive response rates.) Checking statistical assumptions is another worthwhile research practice, the results of which sometimes will contraindicate the strategy of simply running the same analyses again. Researchers should place a high priority on checking the assumptions of their statistical analyses and their dependent measures. Just as the Reproducibility Project: Psychology (Open Science Collaboration 2015) has launched a highly successful effort to crowdsource direct replication, we note that other worthwhile research practices, such as checking statistics, could also be crowdsourced. In light of the potential problems with difference scores and analyses of variance that place so many reasoning and recognition memory studies at risk (see also Dubé et al. Reference Dubé, Rotello and Heit2010; Heit & Rotello Reference Heit and Rotello2014; Rotello et al. Reference Rotello, Masson and Verde2008), we would like to see a large-scale effort to check statistical assumptions across of wide range of research domains. We point to statcheck (Nuijten et al. Reference Nuijten, Hartgerink, Van Assen, Epskamp and Wicherts2016) as a promising example along these lines, although its focus to date has been on checking p values. For some research domains, checking statistical assumptions may be a higher priority than direct replications.
Likewise, we would not elevate direct replication over conceptual replication. Philosophers of science have argued that researchers should be particularly confident in a conclusion that can be repeated across diverse contexts and methods (for a review, see Heit et al. Reference Heit, Hahn, Feeney, Ahn, Goldstone, Love, Markman and Wolff2005). For example, Salmon (Reference Salmon1984) described how early twentieth-century scientists developed a diverse set of experimental methods for deriving Avogadro's number (6.02×1023). These methods included Brownian movement, alpha particle decay, X-ray diffraction, black body radiation, and electrochemistry. Together, these diverse methods – these conceptual replications – provided particularly strong support for the existence of atoms and molecules, going well beyond what direct replications could have accomplished. Turning back to psychology, we pose the question of whether the field learns more from N direct replications of a study or from N conceptual replications of the same study. Perhaps when N is very low there is greater value from direct replications, but as N increases the value of conceptual replications becomes more pronounced.
Finally, we would not elevate replication “successes” over replication “failures,” namely, successes or failures in obtaining the same results as a prior study. Scientists learn something important from either outcome. This point is perhaps clearer in medical research—finding evidence that a once-promising medical treatment does not work should be just as important as a positive finding. To the degree that psychological research has an influence on health and medical practices, educational practices, and public policy, finding out which results do not replicate will be crucial. Although replication failures can be associated with fluctuating contexts and post hoc explanations, we note that in much research, context is varied purposefully from study to study. In a sense, context itself is an object of study, and failures are informative. Given that a drug is effective for men, does it work for women? Given that an educational intervention is successful for native English speakers, is it successful for English language learners? Here, addressing replication failures is central to the research enterprise rather than being a problematic matter.
To conclude, the pursuit of direct replication is potentially of high theoretical value, and indeed is becoming increasingly mainstream, for example, as psychology journals devote sections to direct replication reports. However, we would place direct replication alongside other worthwhile research practices, such as conceptual replication and careful evaluation of statistical assumptions. Likewise, we would place successful replications alongside failed replications in terms of their potential to inform the field.
What is the theoretical value of direct replication? In a recent paper, we (Rotello et al. Reference Rotello, Heit and Dubé2015) described several cases where oft-replicated studies repeated the methodological flaws of the original work. In particular, we presented examples from research on reasoning, memory, social cognition, and child welfare in which the standard method of analysis was not justified and indeed could – and in at least two cases, did – lead to erroneous inferences. Repeating the study, along with the flawed analyses, could lead to yet greater confidence in these incorrect conclusions. Most of our examples concerned conceptual rather than direct replications, in the sense that there were various purposeful design and material changes across studies. Our point was about methodology, namely that inferential errors as a result of unjustified analyses can be magnified upon replication. Contrary to the implication of the target article, we would not argue that the theoretical value of direct, or for that matter conceptual, replications is limited.
Indeed, the target article makes a compelling case for the value of replication, as well as its mainstream role in psychology. Yet we would not elevate replication over other worthwhile research practices. Using an example from Rotello et al. (Reference Rotello, Heit and Dubé2015), we reported that, beginning with Evans et al.(Reference Evans, Barston and Pollard1983), for three decades, replication studies on the belief bias effect in reasoning have employed analyses such as analyses of variance on differences in response rates without checking the assumptions of those analyses. (In this example, researchers could easily do so by collecting data that would allow them to plot receiver operating characteristic curves to see whether there is a linear or curvilinear relationship between correct and incorrect positive response rates.) Checking statistical assumptions is another worthwhile research practice, the results of which sometimes will contraindicate the strategy of simply running the same analyses again. Researchers should place a high priority on checking the assumptions of their statistical analyses and their dependent measures. Just as the Reproducibility Project: Psychology (Open Science Collaboration 2015) has launched a highly successful effort to crowdsource direct replication, we note that other worthwhile research practices, such as checking statistics, could also be crowdsourced. In light of the potential problems with difference scores and analyses of variance that place so many reasoning and recognition memory studies at risk (see also Dubé et al. Reference Dubé, Rotello and Heit2010; Heit & Rotello Reference Heit and Rotello2014; Rotello et al. Reference Rotello, Masson and Verde2008), we would like to see a large-scale effort to check statistical assumptions across of wide range of research domains. We point to statcheck (Nuijten et al. Reference Nuijten, Hartgerink, Van Assen, Epskamp and Wicherts2016) as a promising example along these lines, although its focus to date has been on checking p values. For some research domains, checking statistical assumptions may be a higher priority than direct replications.
Likewise, we would not elevate direct replication over conceptual replication. Philosophers of science have argued that researchers should be particularly confident in a conclusion that can be repeated across diverse contexts and methods (for a review, see Heit et al. Reference Heit, Hahn, Feeney, Ahn, Goldstone, Love, Markman and Wolff2005). For example, Salmon (Reference Salmon1984) described how early twentieth-century scientists developed a diverse set of experimental methods for deriving Avogadro's number (6.02×1023). These methods included Brownian movement, alpha particle decay, X-ray diffraction, black body radiation, and electrochemistry. Together, these diverse methods – these conceptual replications – provided particularly strong support for the existence of atoms and molecules, going well beyond what direct replications could have accomplished. Turning back to psychology, we pose the question of whether the field learns more from N direct replications of a study or from N conceptual replications of the same study. Perhaps when N is very low there is greater value from direct replications, but as N increases the value of conceptual replications becomes more pronounced.
Finally, we would not elevate replication “successes” over replication “failures,” namely, successes or failures in obtaining the same results as a prior study. Scientists learn something important from either outcome. This point is perhaps clearer in medical research—finding evidence that a once-promising medical treatment does not work should be just as important as a positive finding. To the degree that psychological research has an influence on health and medical practices, educational practices, and public policy, finding out which results do not replicate will be crucial. Although replication failures can be associated with fluctuating contexts and post hoc explanations, we note that in much research, context is varied purposefully from study to study. In a sense, context itself is an object of study, and failures are informative. Given that a drug is effective for men, does it work for women? Given that an educational intervention is successful for native English speakers, is it successful for English language learners? Here, addressing replication failures is central to the research enterprise rather than being a problematic matter.
To conclude, the pursuit of direct replication is potentially of high theoretical value, and indeed is becoming increasingly mainstream, for example, as psychology journals devote sections to direct replication reports. However, we would place direct replication alongside other worthwhile research practices, such as conceptual replication and careful evaluation of statistical assumptions. Likewise, we would place successful replications alongside failed replications in terms of their potential to inform the field.