Hostname: page-component-745bb68f8f-f46jp Total loading time: 0 Render date: 2025-02-11T14:54:06.116Z Has data issue: false hasContentIssue false

Verify original results through reanalysis before replicating

Published online by Cambridge University Press:  27 July 2018

Michèle B. Nuijten
Affiliation:
Department of Methodology and Statistics, Tilburg School of Social and Behavioral Sciences, Tilburg University, 5037 AB, Tilburg, The Netherlands. m.b.nuijten@tilburguniversity.edum.bakker_1@tilburguniversity.edue.maassen@tilburguniversity.eduj.m.wicherts@tilburguniversity.eduhttps//mbnuijten.comhttp://marjanbakker.euhttps//www.tilburguniversity.edu/webwijs/show/e.maassen/http://jeltewicherts.net
Marjan Bakker
Affiliation:
Department of Methodology and Statistics, Tilburg School of Social and Behavioral Sciences, Tilburg University, 5037 AB, Tilburg, The Netherlands. m.b.nuijten@tilburguniversity.edum.bakker_1@tilburguniversity.edue.maassen@tilburguniversity.eduj.m.wicherts@tilburguniversity.eduhttps//mbnuijten.comhttp://marjanbakker.euhttps//www.tilburguniversity.edu/webwijs/show/e.maassen/http://jeltewicherts.net
Esther Maassen
Affiliation:
Department of Methodology and Statistics, Tilburg School of Social and Behavioral Sciences, Tilburg University, 5037 AB, Tilburg, The Netherlands. m.b.nuijten@tilburguniversity.edum.bakker_1@tilburguniversity.edue.maassen@tilburguniversity.eduj.m.wicherts@tilburguniversity.eduhttps//mbnuijten.comhttp://marjanbakker.euhttps//www.tilburguniversity.edu/webwijs/show/e.maassen/http://jeltewicherts.net
Jelte M. Wicherts
Affiliation:
Department of Methodology and Statistics, Tilburg School of Social and Behavioral Sciences, Tilburg University, 5037 AB, Tilburg, The Netherlands. m.b.nuijten@tilburguniversity.edum.bakker_1@tilburguniversity.edue.maassen@tilburguniversity.eduj.m.wicherts@tilburguniversity.eduhttps//mbnuijten.comhttp://marjanbakker.euhttps//www.tilburguniversity.edu/webwijs/show/e.maassen/http://jeltewicherts.net

Abstract

In determining the need to directly replicate, it is crucial to first verify the original results through independent reanalysis of the data. Original results that appear erroneous and that cannot be reproduced by reanalysis offer little evidence to begin with, thereby diminishing the need to replicate. Sharing data and scripts is essential to ensure reproducibility.

Type
Open Peer Commentary
Copyright
Copyright © Cambridge University Press 2018 

Zwaan et al. (2017) provide an important and timely overview of the discussion as to whether direct replications in psychology have value. Along with others (see, e.g., Royal Netherlands Academy of Arts and Sciences 2018), we agree wholeheartedly that replication should become mainstream in psychology. However, we feel that the authors missed a crucial aspect in determining whether a direct replication is valuable. Here, we argue that it is essential to first verify the results of the original study by conducting an independent reanalysis of its data or a check of reported results, before choosing to replicate an earlier finding in a novel sample.

A result is successfully reproduced if independent reanalysis of the original data, using either the same or a (substantively or methodologically) similar analytic approach, corroborates the result as reported in the original paper. If a result cannot be successfully reproduced, the original result is not reliable and it is hard, if not impossible, to substantively interpret it. Such an irreproducible result will have no clear bearing on theory or practice. Specifically, if a reanalysis yields no evidence for an effect in the original study, it is safe to assume that there is no effect to begin with, raising the question of why one would invest additional resources in any replication.

Problems with reproducibility in psychology

Lack of reproducibility might seem like a non-issue; after all, it may seem like a guarantee that running the same analysis on the same data would give the same result. However, there is increasing evidence that reproducibility of published results in psychology is relatively low.

Checking reproducibility of reported results in psychology is greatly impeded by a common failure to share data (Vanpaemel et al. Reference Vanpaemel, Vermorgen, Deriemaecker and Storms2015; Wicherts et al. Reference Wicherts, Borsboom, Kats and Molenaar2006). Even when data are available, they are often of poor quality or not usable (Kidwell et al. Reference Kidwell, Lazarevic, Baranski, Hardwicke, Piechowski, Falkenberg, Kennett, Slowik, Sonnleitner, Hess-Holden, Errington, Fiedler and Nosek2016). Yet some issues with reproducibility can be assessed by scrutinizing papers. Studies repeatedly showed that roughly half of all published psychology articles contains at least one inconsistently reported statistical result, wherein the reported p value does not match the degrees of freedom and test statistic; in roughly one in eight results this may have affected the statistical conclusion (e.g., Bakker & Wicherts, Reference Bakker and Wicherts2011; Nuijten et al. Reference Nuijten, Hartgerink, Van Assen, Epskamp and Wicherts2016; Veldkamp et al. Reference Veldkamp, Nuijten, Dominguez-Alvarez, van Assen and Wicherts2014; Wicherts et al. Reference Wicherts, Bakker and Molenaar2011). Furthermore, there is evidence that roughly half of psychology articles are inconsistent with the given sample size and number of items (Brown & Heathers Reference Brown and Heathers2017), coefficients in mediation models often do not add up (Petrocelli et al. Reference Petrocelli, Clarkson, Whitmire and Moon2012), and in 41% of psychology articles reported degrees of freedom do not match the sample size description (Bakker & Wicherts Reference Bakker and Wicherts2014).

Problems that can be detected without having the raw data, are arguably just the tip of the iceberg of reproducibility issues. Studies that intended to reanalyze data from published studies also often ran into problems (e.g., Ebrahim et al. Reference Ebrahim, Sohani, Montoya, Agarwal, Thorlund, Mills and Ioannidis2014; Ioannidis et al. Reference Ioannidis, Allison, Ball, Coulibaly, Cui, Culhane, Falchi, Furlanello, Game, Jurman, Mangion, Mehta, Nitzburg, Page, Petretto and van Noort2009). Beside the poor availability of raw data, papers usually do not contain details about the exact analytical strategy. Researchers often seem to make analytical choices that are driven by the need to obtain a significant result (Agnoli et al. Reference Agnoli, Wicherts, Veldkamp, Albiero and Cubelli2017; John et al. Reference John, Loewenstein and Prelec2012). These choices can be seemingly arbitrary (e.g., choice of control variables or rules for outlier removal; see also Bakker et al. [Reference Bakker, van Dijk and Wicherts2012] and Simmons et al. [Reference Simmons, Nelson and Simonsohn2011]), which makes it hard to retrace the original analytical steps to verify the result.

Suggested solution

Performing a replication study in a novel sample to establish the reliability of a certain result is time consuming and expensive. It is essential that we avoid wasting resources on trying to replicate a finding that may not even be reproducible from the original data. Therefore, we argue that it should be standard practice to verify the original results before any direct replication is conducted.

A first step in verifying original results can be to check whether the results reported in a paper are internally consistent. Some initial screenings can be done quickly with automated tools such as “statcheck” (Epskamp & Nuijten Reference Epskamp and Nuijten2016; http://statcheck.io), “p-checker” (Schönbrodt Reference Schönbrodt2018), and granularity-related inconsistency of means (“GRIM” [Brown & Heathers Reference Brown and Heathers2017]). Especially if such preliminary checks already flag several potential problems, it is crucial that data and analysis scripts are made available for more detailed reanalysis. One could even argue that if data are not shared in such cases, the article should be retracted.

If a result can successfully be reproduced with the original data and analyses, it is interesting to investigate its sensitivity to alternative analytical choices. One way to do so is to run a so-called multiverse analysis (Steegen et al. 2016), in which different analytical choices are compared to test the robustness of the result. When a multiverse analysis shows that the study result is present in only a limited set of reasonable scenarios, you may not want to invest additional resources in replicating such a study. Note that a multiverse analysis still does not require any new data, and is therefore a relatively cost-effective way to investigate reliability.

Reanalysis of existing data is a crucial tool in investigating reliability of psychological results, so it should become standard practice to share raw data and analysis scripts. Journal policies can be successful in promoting this (Giofrè et al. Reference Giofrè, Cumming, Fresc, Boedker and Tressoldi2017; Kidwell et al Reference Kidwell, Lazarevic, Baranski, Hardwicke, Piechowski, Falkenberg, Kennett, Slowik, Sonnleitner, Hess-Holden, Errington, Fiedler and Nosek2016; Nuijten et al. Reference Nuijten, Borghuis, Veldkamp, Dominguez-Alvarez, Van Assen and Wicherts2017), so we hope that more journals will start requiring raw data and scripts.

In our proposal, the assessment of replicability is a multistep approach that first assesses whether the original reported results are internally consistent, then sets out to verify the original results through independent reanalysis of the data using the original analytical strategy, followed by a sensitivity analysis that checks whether the original result is robust to alternative choices in the analysis, and only then involves the collection of new data.

References

Agnoli, F., Wicherts, J. M., Veldkamp, C. L. S., Albiero, P. & Cubelli, R. (2017) Questionable research practices among Italian research psychologists. PLoS One 12(3):e0172792. Available at: http://doi.org/10.1371/journal.pone.0172792.Google Scholar
Bakker, M., van Dijk, A. & Wicherts, J. M. (2012) The rules of the game called psychological science. Perspectives on Psychological Science 7(6):543–54. Available at: http://doi.org/10.1177/1745691612459060.Google Scholar
Bakker, M. & Wicherts, J. M. (2011) The (mis)reporting of statistical results in psychology journals. Behavior Research Methods 43(3):666–78. Available at: http://doi.org/10.3758/s13428-011-0089-5.Google Scholar
Bakker, M. & Wicherts, J. M. (2014) Outlier removal and the relation with reporting errors and quality of research. PLoS One 9(7):e103360. Available at: http://doi.org/10.1371/journal.pone.0103360.Google Scholar
Brown, N. J. L. & Heathers, J. A. J. (2017) The GRIM test: A simple technique detects numerous anomalies in the reporting of results in psychology. Social Psychological and Personality Science 8(4):363–69. Available at: http://doi.org/10.1177/1948550616673876.Google Scholar
Ebrahim, S., Sohani, Z. N., Montoya, L., Agarwal, A., Thorlund, K., Mills, E. J. & Ioannidis, J. P. (2014) Reanalysis of randomized clinical trial data. Journal of the American Medical Association 312(10):1024–32. Available at: http://doi.org/10.1001/jama.2014.9646.Google Scholar
Epskamp, S. & Nuijten, M. B. (2016) statcheck: Extract statistics from articles and recompute p-values. Available at: https://cran.r-project.org/web/packages/statcheck/ (R package version 1.2.2).Google Scholar
Giofrè, D., Cumming, G., Fresc, L., Boedker, I. & Tressoldi, P. (2017) The influence of journal submission guidelines on authors' reporting of statistics and use of open research practices. PLoS One 12(4):e0175583. Available at: http://doi.org/10.1371/journal.pone.0175583.Google Scholar
Ioannidis, J. P., Allison, D. B., Ball, C. A., Coulibaly, I., Cui, X., Culhane, A. C., Falchi, M., Furlanello, C., Game, L., Jurman, G., Mangion, J., Mehta, T., Nitzburg, M., Page, G. P., Petretto, E. & van Noort, V. (2009) Repeatability of published microarray gene expression analyses. Nature Genetics 41(2):149–55.Google Scholar
John, L. K., Loewenstein, G. & Prelec, D. (2012) Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science 23(5):524–32.Google Scholar
Kidwell, M. C., Lazarevic, L. B. Baranski, E., Hardwicke, T. E., Piechowski, S., Falkenberg, L.-S., Kennett, C., Slowik, A., Sonnleitner, C., Hess-Holden, C., Errington, T. M., Fiedler, S. & Nosek, B. A. (2016) Badges to acknowledge open practices: A simple, low-cost, effective method for increasing transparency. PLoS Biology 14(5):e1002456. Available at: http://doi.org/10.1371/journal.pbio.1002456.Google Scholar
Nuijten, M. B., Borghuis, J., Veldkamp, C. L. S., Dominguez-Alvarez, L., Van Assen, M. A. L. M. & Wicherts, J. M. (2017) Journal data sharing policies and statistical reporting inconsistencies in psychology. Collabra: Psychology 3(1):122. Available at: http://doi.org/10.1525/collabra.102.Google Scholar
Nuijten, M. B., Hartgerink, C. H. J., Van Assen, M. A. L. M., Epskamp, S. & Wicherts, J. M. (2016) The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods 48(4):1205–26. Available at: http://doi.org/10.3758/s13428-015-0664-2.Google Scholar
Petrocelli, J., Clarkson, J., Whitmire, M. & Moon, P. (2012) When abc – c′: Published errors in the reports of single-mediator models. Behavior Research Methods 45(2):595601. Available at: http://doi.org/10.3758/s13428-012-0262-5Google Scholar
Royal Netherlands Academy of Arts and Sciences (2018) Replication studies. Improving reproducibility in the empirical sciences. KNAW.Google Scholar
Schönbrodt, F. D. (2018). p-checker: One-for-all p-value analyzer. Available at: http://shinyapps.org/apps/p-checker/.Google Scholar
Simmons, J. P., Nelson, L. D. & Simonsohn, U. (2011) False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science 22:1359–66. Available at: http://doi.org/10.1177/0956797611417632.Google Scholar
Vanpaemel, W., Vermorgen, M., Deriemaecker, L. & Storms, G. (2015) Are we wasting a good crisis? The availability of psychological research data after the storm. Collabra 1(1):15. Available at: http://doi.org/10.1525/collabra.13.Google Scholar
Veldkamp, C. L. S., Nuijten, M. B., Dominguez-Alvarez, L., van Assen, M. A. L. M. & Wicherts, J. M. (2014) Statistical reporting errors and collaboration on statistical analyses in psychological science. PLoS One 9(12):e114876. Available at: http://doi.org/10.1371/journal.pone.0114876.Google Scholar
Wicherts, J. M., Bakker, M. & Molenaar, D. (2011) Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PLoS One 6(11):e26828. Available at: http://doi.org/10.1371/journal.pone.0026828.Google Scholar
Wicherts, J. M., Borsboom, D., Kats, J. & Molenaar, D. (2006) The poor availability of psychological research data for reanalysis. American Psychologist 61:726–28. Available at: http://doi.org/10.1037/0003-066X.61.7.726.Google Scholar