Hostname: page-component-745bb68f8f-g4j75 Total loading time: 0 Render date: 2025-02-11T14:26:07.873Z Has data issue: false hasContentIssue false

Conceptualizing and evaluating replication across domains of behavioral research

Published online by Cambridge University Press:  27 July 2018

Jennifer L. Tackett
Affiliation:
Psychology Department, Northwestern University, Evanston, IL 60208. jennifer.tackett@northwestern.eduhttps://www.jltackett.com/
Blakeley B. McShane
Affiliation:
Kellogg School of Management, Northwestern University, Evanston, IL 60208. b-mcshane@kellogg.northwestern.eduhttp://www.blakemcshane.com/

Abstract

We discuss the authors' conceptualization of replication, in particular the false dichotomy of direct versus conceptual replication intrinsic to it, and suggest a broader one that better generalizes to other domains of psychological research. We also discuss their approach to the evaluation of replication results and suggest moving beyond their dichotomous statistical paradigms and employing hierarchical/meta-analytic statistical models.

Type
Open Peer Commentary
Copyright
Copyright © Cambridge University Press 2018 

We thank Zwaan et al. for their review paper on replication and strongly endorse their call to make replication mainstream. Nonetheless, we find their conceptualization of and recommendations for replication problematic.

Intrinsic to Zwaan et al.'s conceptualization is a false dichotomy of direct versus conceptual replication, with the former defined as “a study that attempts to recreate the critical elements (e.g., samples, procedures, and measures) of an original study” (sect. 4, para. 3) and the latter as a “study where there are changes to the original procedures that might make a difference with regard to the observed effect size” (sect. 4.6). We see problems with both of Zwaan et al.'s definitions and the sharp dichotomization intrinsic to their conceptualization.

In terms of definitions, first, Zwaan et al. punt in defining direct replications by leaving unspecified the crucial matter of what constitutes the “critical elements (e.g., samples, procedures, and measures) of an original study” (sect 4., para. 3). Specifying these is nontrivial if not impossible in general and likely controversial in specific. Second, they are overly broad in defining conceptual replications: Under their definition, practically all behavioral research replication studies would be considered conceptual. To understand why, consider large-scale replication projects such as the Many Labs project (Klein et al. Reference Klein, Ratliff, Vianello, Adams, Bahník, Bernstein, Bocian, Brandt, Brooks, Brumbaugh, Cemalcilar, Chandler, Cheong, Davis, Devos, Eisner, Frankowska, Furrow, Galliani, Hasselman, Hicks, Hovermale, Hunt, Hunstinger, IJerzman, John, Joy-Gaba, Kappes, Krueger, Kurtz, Levitan, Mallett, Morris, Nelson, Nier, Packard, Pilati, Rutchick, Schmidt, Skorinko, Smith, Steiner, Storbeck, Van Swol, Thompson, van't Veer, Vaughn, Vranka, Wichman, Woodzicka and Nosek2014a) and Registered Replication Reports (RRRs; Simons et al. Reference Simons, Holcombe and Spellman2014) where careful measures were taken such that protocols were followed identically across labs in order to achieve near exact or direct replication. In these projects, not only did observed effect sizes differ across labs (as they always do), but so too did, despite such strict conditions, true effect sizes; that is, effect sizes were heterogeneous or contextually variable – and to roughly the same degree as sampling variation (McShane et al. Reference McShane, Bockenholt and Hansen2016; Stanley et al. Reference Stanley, Carter and Doucouliagos2017; Tackett et al. Reference Tackett, McShane, Bockenholt and Gelman2017b). This renders Zwaan et al.'s suggestion of conducting direct replication infeasible: Even if defining the “critical elements” were possible, recreating them in a manner that maintains the effect size homogeneity they insist on for direct replication seems impossible in light of these Many Labs and RRR results.

In addition, and again in light of these results, the sharp dichotomization of direct versus conceptual replication intrinsic to Zwaan et al.'s conceptualization is unrealistic in practice. Further, even were it not, replication designs with hybrid elements (e.g., where the theoretical level is “directly” replicated but the operationalization is systematically varied) are an important future direction – particularly for large-scale replication projects (Tackett et al. Reference Tackett, McShane, Bockenholt and Gelman2017b) – not covered by Zwaan et al.'s conceptualization.

Instead, and in line with Zwaan et al.'s mention of “extensions,” we would like to see a broader approach to conceptualizing replication and, in particular, one that better generalizes to other domains of psychological research. Specifically, large-scale replications are typically only possible when data collection is fast and not particularly costly; thus they are, practically speaking, constrained to certain domains of psychology (e.g., cognitive and social). Consequently, we know much less about the replicability of findings in other domains (e.g., clinical and developmental) let alone how to operationalize replicability in them (Tackett et al. Reference Tackett, Lilienfeld, Patrick, Johnson, Krueger, Miller, Oltmans and Shrout2017a; Reference Tackett, Brandes and Reardonin press). In these other domains, where data collection is slow and costly but individual data sets are typically much richer, we recommend that in addition to the prospective approach to replication employed by large-scale replication projects thus far, a retrospective approach that leverages the large amount of shareable archival data across sites can be valuable and sometimes even preferable (Tackett et al. Reference Tackett, McShane, Bockenholt and Gelman2017b; 2018).

This will require not only a change in both infrastructure and incentive structures, but also a better understanding of appropriate statistical approaches for analyzing pooled data (i.e., hierarchical models) and more complex effects (e.g., curve or function estimates as opposed to point estimates); lab-specific moderators most relevant to include in such analyses; additional method factors that drive heterogeneity (e.g., dropout mechanisms in longitudinal studies); and how to harmonize measurements across labs (e.g., if they use different measures of depression).

It may also require a change in procedures for statistically evaluating replication. Zwaan et al. suggest three ways of doing so, all of which are based on the null hypothesis significance testing paradigm and the dichotomous p-value thresholds intrinsic to it. Such thresholds, whether in the form of p-values or other statistical measures such as confidence intervals and Bayes factors (i) lead to erroneous reasoning (McShane & Gal Reference McShane and Gal2016; Reference McShane and Gal2017); (ii) are a form of statistical alchemy that falsely promise to transmute randomness into certainty (Gelman Reference Gelman2016a), thereby permitting dichotomous declarations of truth or falsity, binary statements about there being “an effect” or “no effect,” a “successful replication” or a “failed replication”; and (iii) should be abandoned (Leek et al. Reference Leek, McShane, Gelman, Colquhoun, Nuitjen and Goodman2017; McShane et al. 2018).

Instead, we would like to see replication efforts statistically evaluated via hierarchical / meta-analytic statistical models. Such models can directly estimate and account for contextual variability (i.e., heterogeneity) in replication efforts, which is critically important given, as per the Many Labs and RRR results, that such variability is roughly comparable to sampling variability even when explicit efforts are taken to minimize it as well as the fact that it is typically many times larger in more standard sets of studies when they are not (Stanley et al. 2017; van Erp et al. Reference van Erp, Verhagen, Grasman and Wagenmakers2017). Importantly, they can also account for differences in methods factors such as dependent variables, moderators, and study designs (McShane & Bockenholt Reference McShane and Bockenholt2017; Reference McShane and Bockenholt2018) and for varying treatment effects (Gelman Reference Gelman2015), thereby allowing for a much richer characterization of a research domain and application to the hybrid replication designs discussed above. We would also like to see the estimates from these models considered alongside additional factors such as prior and related evidence, plausibility of mechanism, study design, and data quality to provide a more holistic evaluation of replication efforts.

Our suggestions for replication conceptualization and evaluation forsake the false promise of certainty offered by the dichotomous approaches favored by the field and by Zwaan et al. Consequently, they will seldom if ever deem a replication effort a “success” or a “failure,” and indeed, reasonable people following them may disagree about the degree of replication success. However, by accepting uncertainty and embracing variation (Carlin Reference Carlin2016; Gelman Reference Gelman2016a), we believe these suggestions will help us learn much more about the world.

References

Carlin, J. B. (2016) Is reform possible without a paradigm shift? The American Statistician, Supplemental material to the ASA statement on p-values and statistical significance 10.Google Scholar
Gelman, A. (2015) The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective. Journal of Management 41(2):632–43.Google Scholar
Gelman, A. (2016a) The problems with p-values are not just with p-values. The American Statistician, Supplemental material to the ASA statement on p-values and statistical significance 10.Google Scholar
Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B. Jr., Bahník, S., Bernstein, M. J., Bocian, K., Brandt, M. J., Brooks, B., Brumbaugh, C. C., Cemalcilar, Z., Chandler, J., Cheong, W., Davis, W. E., Devos, T., Eisner, M., Frankowska, N., Furrow, D., Galliani, E. M., Hasselman, F., Hicks, J. A., Hovermale, J. F., Hunt, S. J., Hunstinger, J. R., IJerzman, H., John, M.-S., Joy-Gaba, J. A., Kappes, H. B., Krueger, L. E., Kurtz, J., Levitan, C. A., Mallett, R. K., Morris, W. L., Nelson, A. J., Nier, J. A., Packard, G., Pilati, R., Rutchick, A. M., Schmidt, K., Skorinko, J. L., Smith, R., Steiner, T. G., Storbeck, J., Van Swol, L. M., Thompson, D., van't Veer, A. E., Vaughn, L. A., Vranka, M., Wichman, A. L., Woodzicka, J. A. & Nosek, B. A. (2014a) Investigating variation in replicability: A “Many Labs” replication project. Social Psychology 45(3):142–52. Available at: http://doi.org/10.1027/1864-9335/a000178.Google Scholar
Leek, J., McShane, B. B., Gelman, A., Colquhoun, D., Nuitjen, M. B., and Goodman, S. N. (2017) Five ways to fix statistics. Nature 551(7682):557–59.Google Scholar
McShane, B. B. & Bockenholt, U. (2017) Single paper meta-analysis: Benefits for study summary, theory-testing, and replicability. Journal of Consumer Research 43(6):1048–63.Google Scholar
McShane, B. B. & Bockenholt, U. (2018) Multilevel multivariate meta-analysis with application to choice overload. Psychometrika 83(1):255271.Google Scholar
McShane, B. B., Bockenholt, U. & Hansen, K. T. (2016) Adjusting for publication bias in meta-analysis: An evaluation of selection methods and some cautionary notes. Perspectives on Psychological Science 11(5):730–49.Google Scholar
McShane, B. B. & Gal, D. (2016) Blinding us to the obvious? The effect of statistical training on the evaluation of evidence. Management Science 62(6):1707–18.Google Scholar
McShane, B. B. & Gal, D. (2017) Statistical significance and the dichotomization of evidence. Journal of the American Statistical Association 112(519):885–95.Google Scholar
McShane, B. B., Gal, D., Gelman, A., Robert, C. & Tackett, J. L. (2017) Abandon statistical significance. Technical report, Northwestern University. Available at: https://arxiv.org/abs/1709.07588.Google Scholar
Open Science Collaboration (2015) Estimating the reproducibility of psychological science. Science 349(6251):aac4716. Available at: http://doi.org/10.1126/science.aac4716.Google Scholar
Simons, D. J., Holcombe, A. O. & Spellman, B. A. (2014) An introduction to Registered Replication Reports at Perspectives on Psychological Science. Perspectives on Psychological Science 9(5):552–55.Google Scholar
Stanley, T. D., Carter, E. C. & Doucouliagos, H. (November 2017) What meta-analyses reveal about the replicability of psychological research. Deakin Laboratory for the Meta-Analysis of Research, Working Paper .Google Scholar
Tackett, J. L., Brandes, C. M. & Reardon, K. W. (in press) Leveraging the Open Science Framework in clinical psychological assessment research. Psychological AssessmentGoogle Scholar
Tackett, J. L., Lilienfeld, S. O., Patrick, C. J., Johnson, S. L, Krueger, R. F, Miller, J. D., Oltmans, T. F. & Shrout, P. E. (2017a) It's time to broaden the replicability conversation: Thoughts for and from clinical psychological science. Perspectives on Psychological Science 12(5):742–56.Google Scholar
Tackett, J. L., McShane, B. B., Bockenholt, U. & Gelman, A. (2017b) Large scale replication projects in contemporary psychological research. Technical report, Northwestern University. Available at: arXiv:1710.06031.Google Scholar
van Erp, S., Verhagen, A. J., Grasman, R. P. P. P. & Wagenmakers, E.-J. (2017) Estimates of between-study heterogeneity for 705 meta-analyses reported in Psychological Bulletin from 1990–2013. Journal of Open Psychology Data 5(1):4. DOI: http://doi.org/10.5334/jopd.33.Google Scholar