We thank Zwaan et al. for their review paper on replication and strongly endorse their call to make replication mainstream. Nonetheless, we find their conceptualization of and recommendations for replication problematic.
Intrinsic to Zwaan et al.'s conceptualization is a false dichotomy of direct versus conceptual replication, with the former defined as “a study that attempts to recreate the critical elements (e.g., samples, procedures, and measures) of an original study” (sect. 4, para. 3) and the latter as a “study where there are changes to the original procedures that might make a difference with regard to the observed effect size” (sect. 4.6). We see problems with both of Zwaan et al.'s definitions and the sharp dichotomization intrinsic to their conceptualization.
In terms of definitions, first, Zwaan et al. punt in defining direct replications by leaving unspecified the crucial matter of what constitutes the “critical elements (e.g., samples, procedures, and measures) of an original study” (sect 4., para. 3). Specifying these is nontrivial if not impossible in general and likely controversial in specific. Second, they are overly broad in defining conceptual replications: Under their definition, practically all behavioral research replication studies would be considered conceptual. To understand why, consider large-scale replication projects such as the Many Labs project (Klein et al. Reference Klein, Ratliff, Vianello, Adams, Bahník, Bernstein, Bocian, Brandt, Brooks, Brumbaugh, Cemalcilar, Chandler, Cheong, Davis, Devos, Eisner, Frankowska, Furrow, Galliani, Hasselman, Hicks, Hovermale, Hunt, Hunstinger, IJerzman, John, Joy-Gaba, Kappes, Krueger, Kurtz, Levitan, Mallett, Morris, Nelson, Nier, Packard, Pilati, Rutchick, Schmidt, Skorinko, Smith, Steiner, Storbeck, Van Swol, Thompson, van't Veer, Vaughn, Vranka, Wichman, Woodzicka and Nosek2014a) and Registered Replication Reports (RRRs; Simons et al. Reference Simons, Holcombe and Spellman2014) where careful measures were taken such that protocols were followed identically across labs in order to achieve near exact or direct replication. In these projects, not only did observed effect sizes differ across labs (as they always do), but so too did, despite such strict conditions, true effect sizes; that is, effect sizes were heterogeneous or contextually variable – and to roughly the same degree as sampling variation (McShane et al. Reference McShane, Bockenholt and Hansen2016; Stanley et al. Reference Stanley, Carter and Doucouliagos2017; Tackett et al. Reference Tackett, McShane, Bockenholt and Gelman2017b). This renders Zwaan et al.'s suggestion of conducting direct replication infeasible: Even if defining the “critical elements” were possible, recreating them in a manner that maintains the effect size homogeneity they insist on for direct replication seems impossible in light of these Many Labs and RRR results.
In addition, and again in light of these results, the sharp dichotomization of direct versus conceptual replication intrinsic to Zwaan et al.'s conceptualization is unrealistic in practice. Further, even were it not, replication designs with hybrid elements (e.g., where the theoretical level is “directly” replicated but the operationalization is systematically varied) are an important future direction – particularly for large-scale replication projects (Tackett et al. Reference Tackett, McShane, Bockenholt and Gelman2017b) – not covered by Zwaan et al.'s conceptualization.
Instead, and in line with Zwaan et al.'s mention of “extensions,” we would like to see a broader approach to conceptualizing replication and, in particular, one that better generalizes to other domains of psychological research. Specifically, large-scale replications are typically only possible when data collection is fast and not particularly costly; thus they are, practically speaking, constrained to certain domains of psychology (e.g., cognitive and social). Consequently, we know much less about the replicability of findings in other domains (e.g., clinical and developmental) let alone how to operationalize replicability in them (Tackett et al. Reference Tackett, Lilienfeld, Patrick, Johnson, Krueger, Miller, Oltmans and Shrout2017a; Reference Tackett, Brandes and Reardonin press). In these other domains, where data collection is slow and costly but individual data sets are typically much richer, we recommend that in addition to the prospective approach to replication employed by large-scale replication projects thus far, a retrospective approach that leverages the large amount of shareable archival data across sites can be valuable and sometimes even preferable (Tackett et al. Reference Tackett, McShane, Bockenholt and Gelman2017b; 2018).
This will require not only a change in both infrastructure and incentive structures, but also a better understanding of appropriate statistical approaches for analyzing pooled data (i.e., hierarchical models) and more complex effects (e.g., curve or function estimates as opposed to point estimates); lab-specific moderators most relevant to include in such analyses; additional method factors that drive heterogeneity (e.g., dropout mechanisms in longitudinal studies); and how to harmonize measurements across labs (e.g., if they use different measures of depression).
It may also require a change in procedures for statistically evaluating replication. Zwaan et al. suggest three ways of doing so, all of which are based on the null hypothesis significance testing paradigm and the dichotomous p-value thresholds intrinsic to it. Such thresholds, whether in the form of p-values or other statistical measures such as confidence intervals and Bayes factors (i) lead to erroneous reasoning (McShane & Gal Reference McShane and Gal2016; Reference McShane and Gal2017); (ii) are a form of statistical alchemy that falsely promise to transmute randomness into certainty (Gelman Reference Gelman2016a), thereby permitting dichotomous declarations of truth or falsity, binary statements about there being “an effect” or “no effect,” a “successful replication” or a “failed replication”; and (iii) should be abandoned (Leek et al. Reference Leek, McShane, Gelman, Colquhoun, Nuitjen and Goodman2017; McShane et al. 2018).
Instead, we would like to see replication efforts statistically evaluated via hierarchical / meta-analytic statistical models. Such models can directly estimate and account for contextual variability (i.e., heterogeneity) in replication efforts, which is critically important given, as per the Many Labs and RRR results, that such variability is roughly comparable to sampling variability even when explicit efforts are taken to minimize it as well as the fact that it is typically many times larger in more standard sets of studies when they are not (Stanley et al. 2017; van Erp et al. Reference van Erp, Verhagen, Grasman and Wagenmakers2017). Importantly, they can also account for differences in methods factors such as dependent variables, moderators, and study designs (McShane & Bockenholt Reference McShane and Bockenholt2017; Reference McShane and Bockenholt2018) and for varying treatment effects (Gelman Reference Gelman2015), thereby allowing for a much richer characterization of a research domain and application to the hybrid replication designs discussed above. We would also like to see the estimates from these models considered alongside additional factors such as prior and related evidence, plausibility of mechanism, study design, and data quality to provide a more holistic evaluation of replication efforts.
Our suggestions for replication conceptualization and evaluation forsake the false promise of certainty offered by the dichotomous approaches favored by the field and by Zwaan et al. Consequently, they will seldom if ever deem a replication effort a “success” or a “failure,” and indeed, reasonable people following them may disagree about the degree of replication success. However, by accepting uncertainty and embracing variation (Carlin Reference Carlin2016; Gelman Reference Gelman2016a), we believe these suggestions will help us learn much more about the world.
We thank Zwaan et al. for their review paper on replication and strongly endorse their call to make replication mainstream. Nonetheless, we find their conceptualization of and recommendations for replication problematic.
Intrinsic to Zwaan et al.'s conceptualization is a false dichotomy of direct versus conceptual replication, with the former defined as “a study that attempts to recreate the critical elements (e.g., samples, procedures, and measures) of an original study” (sect. 4, para. 3) and the latter as a “study where there are changes to the original procedures that might make a difference with regard to the observed effect size” (sect. 4.6). We see problems with both of Zwaan et al.'s definitions and the sharp dichotomization intrinsic to their conceptualization.
In terms of definitions, first, Zwaan et al. punt in defining direct replications by leaving unspecified the crucial matter of what constitutes the “critical elements (e.g., samples, procedures, and measures) of an original study” (sect 4., para. 3). Specifying these is nontrivial if not impossible in general and likely controversial in specific. Second, they are overly broad in defining conceptual replications: Under their definition, practically all behavioral research replication studies would be considered conceptual. To understand why, consider large-scale replication projects such as the Many Labs project (Klein et al. Reference Klein, Ratliff, Vianello, Adams, Bahník, Bernstein, Bocian, Brandt, Brooks, Brumbaugh, Cemalcilar, Chandler, Cheong, Davis, Devos, Eisner, Frankowska, Furrow, Galliani, Hasselman, Hicks, Hovermale, Hunt, Hunstinger, IJerzman, John, Joy-Gaba, Kappes, Krueger, Kurtz, Levitan, Mallett, Morris, Nelson, Nier, Packard, Pilati, Rutchick, Schmidt, Skorinko, Smith, Steiner, Storbeck, Van Swol, Thompson, van't Veer, Vaughn, Vranka, Wichman, Woodzicka and Nosek2014a) and Registered Replication Reports (RRRs; Simons et al. Reference Simons, Holcombe and Spellman2014) where careful measures were taken such that protocols were followed identically across labs in order to achieve near exact or direct replication. In these projects, not only did observed effect sizes differ across labs (as they always do), but so too did, despite such strict conditions, true effect sizes; that is, effect sizes were heterogeneous or contextually variable – and to roughly the same degree as sampling variation (McShane et al. Reference McShane, Bockenholt and Hansen2016; Stanley et al. Reference Stanley, Carter and Doucouliagos2017; Tackett et al. Reference Tackett, McShane, Bockenholt and Gelman2017b). This renders Zwaan et al.'s suggestion of conducting direct replication infeasible: Even if defining the “critical elements” were possible, recreating them in a manner that maintains the effect size homogeneity they insist on for direct replication seems impossible in light of these Many Labs and RRR results.
In addition, and again in light of these results, the sharp dichotomization of direct versus conceptual replication intrinsic to Zwaan et al.'s conceptualization is unrealistic in practice. Further, even were it not, replication designs with hybrid elements (e.g., where the theoretical level is “directly” replicated but the operationalization is systematically varied) are an important future direction – particularly for large-scale replication projects (Tackett et al. Reference Tackett, McShane, Bockenholt and Gelman2017b) – not covered by Zwaan et al.'s conceptualization.
Instead, and in line with Zwaan et al.'s mention of “extensions,” we would like to see a broader approach to conceptualizing replication and, in particular, one that better generalizes to other domains of psychological research. Specifically, large-scale replications are typically only possible when data collection is fast and not particularly costly; thus they are, practically speaking, constrained to certain domains of psychology (e.g., cognitive and social). Consequently, we know much less about the replicability of findings in other domains (e.g., clinical and developmental) let alone how to operationalize replicability in them (Tackett et al. Reference Tackett, Lilienfeld, Patrick, Johnson, Krueger, Miller, Oltmans and Shrout2017a; Reference Tackett, Brandes and Reardonin press). In these other domains, where data collection is slow and costly but individual data sets are typically much richer, we recommend that in addition to the prospective approach to replication employed by large-scale replication projects thus far, a retrospective approach that leverages the large amount of shareable archival data across sites can be valuable and sometimes even preferable (Tackett et al. Reference Tackett, McShane, Bockenholt and Gelman2017b; 2018).
This will require not only a change in both infrastructure and incentive structures, but also a better understanding of appropriate statistical approaches for analyzing pooled data (i.e., hierarchical models) and more complex effects (e.g., curve or function estimates as opposed to point estimates); lab-specific moderators most relevant to include in such analyses; additional method factors that drive heterogeneity (e.g., dropout mechanisms in longitudinal studies); and how to harmonize measurements across labs (e.g., if they use different measures of depression).
It may also require a change in procedures for statistically evaluating replication. Zwaan et al. suggest three ways of doing so, all of which are based on the null hypothesis significance testing paradigm and the dichotomous p-value thresholds intrinsic to it. Such thresholds, whether in the form of p-values or other statistical measures such as confidence intervals and Bayes factors (i) lead to erroneous reasoning (McShane & Gal Reference McShane and Gal2016; Reference McShane and Gal2017); (ii) are a form of statistical alchemy that falsely promise to transmute randomness into certainty (Gelman Reference Gelman2016a), thereby permitting dichotomous declarations of truth or falsity, binary statements about there being “an effect” or “no effect,” a “successful replication” or a “failed replication”; and (iii) should be abandoned (Leek et al. Reference Leek, McShane, Gelman, Colquhoun, Nuitjen and Goodman2017; McShane et al. 2018).
Instead, we would like to see replication efforts statistically evaluated via hierarchical / meta-analytic statistical models. Such models can directly estimate and account for contextual variability (i.e., heterogeneity) in replication efforts, which is critically important given, as per the Many Labs and RRR results, that such variability is roughly comparable to sampling variability even when explicit efforts are taken to minimize it as well as the fact that it is typically many times larger in more standard sets of studies when they are not (Stanley et al. 2017; van Erp et al. Reference van Erp, Verhagen, Grasman and Wagenmakers2017). Importantly, they can also account for differences in methods factors such as dependent variables, moderators, and study designs (McShane & Bockenholt Reference McShane and Bockenholt2017; Reference McShane and Bockenholt2018) and for varying treatment effects (Gelman Reference Gelman2015), thereby allowing for a much richer characterization of a research domain and application to the hybrid replication designs discussed above. We would also like to see the estimates from these models considered alongside additional factors such as prior and related evidence, plausibility of mechanism, study design, and data quality to provide a more holistic evaluation of replication efforts.
Our suggestions for replication conceptualization and evaluation forsake the false promise of certainty offered by the dichotomous approaches favored by the field and by Zwaan et al. Consequently, they will seldom if ever deem a replication effort a “success” or a “failure,” and indeed, reasonable people following them may disagree about the degree of replication success. However, by accepting uncertainty and embracing variation (Carlin Reference Carlin2016; Gelman Reference Gelman2016a), we believe these suggestions will help us learn much more about the world.