Hostname: page-component-745bb68f8f-l4dxg Total loading time: 0 Render date: 2025-02-11T14:15:37.385Z Has data issue: false hasContentIssue false

If we accept that poor replication rates are mainstream

Published online by Cambridge University Press:  27 July 2018

David M. Alexander
Affiliation:
Brain and Cognition, KU Leuven, Tiensestraat 102, BE-3000 Leuven, Belgium. david.alexander@kuleuven.bepieter.moors@kuleuven.behttp://www.perceptualdynamics.bewww.gestaltrevision.be
Pieter Moors
Affiliation:
Brain and Cognition, KU Leuven, Tiensestraat 102, BE-3000 Leuven, Belgium. david.alexander@kuleuven.bepieter.moors@kuleuven.behttp://www.perceptualdynamics.bewww.gestaltrevision.be

Abstract

We agree with the authors' arguments to make replication mainstream but contend that the poor replication record is symptomatic of a pre-paradigmatic science. Reliable replication in psychology requires abandoning group-level p-value testing in favor of real-time predictions of behaviors, mental and brain events. We argue for an approach based on analysis of boundary conditions where measurement is closely motivated by theory.

Type
Open Peer Commentary
Copyright
Copyright © Cambridge University Press 2018 

We relish the authors' arguments to make replication mainstream. They have done a tremendous job in summarizing and countering objections to their cause. Yet acceptance that routine direct replication is crucial (and missing) has not quite reached a critical level among psychological scientists. We view the poor replication record as symptomatic of a pre-paradigmatic science. If we accept that poor replication rates are mainstream, this means the really brutal methodological cold turkey has yet to be braved.

Despite its stated goals, psychology does not in practice aim to establish entirely reproducible effects. Zwaan et al.'s exhortation “to make replication mainstream” arises because of this contrast between goals and practice. Consider any run-of-the-mill journal in physics, chemistry, or some other well-established science in light of the concerns raised in Zwaan et al.'s last paragraph. Applying these to any of the sciences listed would come across as odd; as the tail wagging the dog. Although such an assessment may be unpopular, a science without a core canon of directly reproducible results is not yet a science. Nevertheless, the present pre-paradigmatic phase of psychology arises as a result of the maturity of the field, and not through any particular incompetence or dishonesty on the part of we scientists. Inevitably, psychology will still face a reproducibility problem 20 years from now, even when recommendations such as preregistration, open materials, data, and code are standard (cf. Meehl Reference Meehl1990a). Even those results that now are technically reproducible are not often reproducible in a predictive sense: that they enable theoretically related problems to be solved in a straightforward fashion.

Many solutions suggested for the concerns highlighted by Zwaan et al. are decades old (Meehl Reference Meehl1967). The reproducibility crisis presents a sober occasion to revisit them, given our accumulating research experience. Our view is that psychology and cognitive neuroscience have succeeded, in part, by picking the low hanging fruit. By this we mean gathering those results that can be distinguished by using assumptions of a linear, low-dimensional measurement space, treating unparceled variance as “noise” and operational definitions of experimental manipulations as sufficient.

We argue reliable that replication requires reformulating the nature of the ceteris paribus clause (“holding everything else constant”). This clause is usually interpreted as requiring tight control of subjects' behavior, so everything except the phenomenon of interest is excluded from influencing the experimental outcome. This restriction becomes problematic when applied to a complex nonlinear system like a person embedded in an experimental environment. Instead, we propose that the object to be controlled is the entire experimental (and pseudo-naturalistic) space in which the phenomenon of interest is evoked (Manicas & Secord Reference Manicas and Secord1983). The goal is to exactly explore this space, to find out how the phenomenon changes over relevant parameters and where it is only trivially different. Rather than colliding opposing theoretical positions (debates on X vs. Y), the goal is to demarcate when one type of phenomenon (e.g., conscious, directed attention) becomes another (e.g., automatic attention), by defining its boundary conditions. A major focus of theory is then to commit to the experimental space being a certain (potentially nonlinear) shape and dimensionality (Hultsch & Hickey Reference Hultsch and Hickey1978; Wallot & Kelty-Stephen Reference Wallot and Kelty-Stephen2018). The “constant” of the ceteris paribus clause is the requirement to accurately (and repeatedly) position the subject in a desired portion of the theoretically defined experimental space. Importantly, this introduces direct theoretical criteria for deciding whether an experiment was run correctly and blurs the distinction between direct and conceptual replication.

To prevent new evidential walls made of loose bricks, we believe such a reformulation inherently requires abandoning null hypothesis significance testing (NHST) as the primary piece of evidence (Szucs & Ioannidis Reference Szucs and Ioannidis2017b). Mere differences are insufficient to characterize the experimental space and to position a subject within it. Theory should provide us point predictions (Lakens Reference Lakens2013; Widaman Reference Widaman2015). This approach allows us to explore the nonlinear nature of the experimental space and to explicitly motivate research practices like data transformation and aggregation. For example, if data appear log-normally distributed, transformation is allowed only if theory states the value range has geometric symmetries. What appears as a practical data-cleaning operation in mainstream NHST could be a gross distortion of the underlying phenomena when experiment, theory, and data analysis are required to be tightly intertwined.

Furthermore, point predictions should be formulated at the individual rather than group level. Much of our statistics was originally developed for agronomy, where individual kernel weights can be aggregated to (trivially) calculate yields for the crop field. This is generally not the case for the relationship between individual behaviors or neural measurements and the concomitant aggregate outcomes across subjects (Alexander et al. Reference Alexander, Jurica, Trengove, Nikolaev, Gepshtein, Zvyagintsev, Mathiak, Schulze-Bonhage, Reuscher, Ball and van Leeuwen2013; Reference Alexander, Trengove and van Leeuwen2015; Estes Reference Estes1956). Yet, in our experience, it is rare for a paper to be rejected because the authors have not proved that measures are linearly behaved enough to bear the assumptions of aggregation methods such as cross-trial and cross-subject averaging.

A consequence of individual predictions is that replication will then involve running some more subjects over a range of experimental conditions, each of which is a test of theory. Thus, the proposed redefinition of the ceteris paribus clause may limit the otherwise onerous resource requirements to reproduce experimental results. Likewise, our redefinition mandates the conditions under which the vast knowledge base of (mostly linear) statistical assessment methods can be justifiably used. Provided experiment and theory-dictated numerical transformations leave the data in a linear space, linear methods are available.

A side effect of adopting something like our present proposal is that it levels the playing field. Results from the history of findings in psychology cannot be regarded as certain until they have achieved successful replications of the kind that Zwaan et al. argue for. We further suggest that this will not occur until a framework is adopted that requires empirical feedback on the validity and success of each experimental manipulation and theoretically mandates every post-experiment transformation of the data. This, in turn, will not occur until the bitter pill that non-replication is mainstream has been swallowed.

References

Alexander, D. M., Jurica, P., Trengove, C., Nikolaev, A. R., Gepshtein, S., Zvyagintsev, M., Mathiak, K., Schulze-Bonhage, A., Reuscher, J., Ball, T. & van Leeuwen, C. (2013) Traveling waves and trial averaging: The nature of single-trial and averaged brain responses in large-scale cortical signals. NeuroImage 73:95112. Available at: https://doi.org/10.1016/j.neuroimage.2013.01.016.Google Scholar
Alexander, D. M., Trengove, C. & van Leeuwen, C. (2015) Donders is dead: Cortical traveling waves and the limits of mental chronometry in cognitive neuroscience. Cognitive Processing 16(4):365–75. Available at: https://doi.org/10.1007/s10339-015-0662-4.Google Scholar
Estes, W. K. (1956) The problem of inference from curves based on group data. Psychological Bulletin 53(2):134–40.Google Scholar
Hultsch, D. F. & Hickey, T. (1978) External validity in the study of human development: Theoretical and methodological issues. Human Development 21(2):7691. Available at: https://doi.org/10.1159/000271576.Google Scholar
Lakens, D. (2013) Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology 4:863. Available at: https://doi.org/10.3389/fpsyg.2013.00863.Google Scholar
Manicas, P. T. & Secord, P. F. (1983) Implication for psychology of the new philosophy of science. American Psychologist 38(4):399413. Available at: https://doi.org/10.1037/0003-066X.38.4.399.Google Scholar
Meehl, P. E. (1967) Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science 34(2):103–15.Google Scholar
Meehl, P. E. (1990a) Why summaries of research on psychological theories are often uninterpretable. Psychological Reports 66(1):195244. Available at: https://doi.org/10.2466/pr0.1990.66.1.195.Google Scholar
Szucs, D. & Ioannidis, J. P. A. (2017b) When null hypothesis significance testing is unsuitable for research: A reassessment. Frontiers in Human Neuroscience 11:390. Available at: https://doi.org/10.3389/fnhum.2017.00390.Google Scholar
Wallot, S. & Kelty-Stephen, D. G. (2018) Interaction-dominant causation in mind and brain, and its implication for questions of generalization and replication. Minds and Machines 28(2):353–74. Available at: https://doi.org/10.1007/s11023-017-9455-0.Google Scholar
Widaman, K. (2015) Confirmatory theory testing: Moving beyond NHST. The score. Newsletter. Available at: http://www.apadivisions.org/division-5/publications/score/2015/01/issue.pdf.Google Scholar