Hostname: page-component-745bb68f8f-g4j75 Total loading time: 0 Render date: 2025-02-11T14:16:11.013Z Has data issue: false hasContentIssue false

Data replication matters to an underpowered study, but replicated hypothesis corroboration counts

Published online by Cambridge University Press:  27 July 2018

Erich H. Witte
Affiliation:
Institute for Psychology, University of Hamburg, 20146 Hamburg, Germany. witte_e_h@uni-hamburg.dehttps://www.psy.uni-hamburg.de/personen/prof-im-ruhestand/witte-erich.html
Frank Zenker
Affiliation:
Departments of Philosophy and Cognitive Science, Lund University, SE-221 00 Lund, Sweden. frank.zenker@fil.lu.sehttp://www.fil.lu.se/en/person/FrankZenker/

Abstract

Before replication becomes mainstream, the potential for generating theoretical knowledge better be clear. Replicating statistically significant nonrandom data shows that an original study made a discovery; replicating a specified theoretical effect shows that an original study corroborated a theory. Yet only in the latter case is replication a necessary, sound, and worthwhile strategy.

Type
Open Peer Commentary
Copyright
Copyright © Cambridge University Press 2018 

Bakker et al. (Reference Bakker, van Dijk and Wicherts2012) state the average replication probability of empirical studies in psychology as 1 − β error = 0.36. Originating in Neyman-Pearson test theory (NPTT), the 1 − β-error is also known as test power. Prior to collection of data, it estimates the probability that a replication attempt duplicates an original study's data signature. Because 1 − β-error = 0.36 “predicts” the estimated actual replication rate of 36% that Open Science Collaboration (2015) report, we may cautiously interpret the rate as a consequence of realizing NPTT empirically.

In seeming “fear” that a random process (H 0) may have generated our data (D), we (rightly) demand that D feature a low α-error, p(D, H 0) ≤ α = 0.05. We nevertheless regularly allow such “low α” data to feature a high average β-error = 0.64 (Open Science Collaboration 2015). A similarly high β-error value is unproblematic, of course, if we use a composite H 1 hypothesis. For it simply fails to point-specify the effect size (postulated by the H 1) that calculating the β-error presupposes. So we cannot but ignore the replication probability of data.

By contrast, point-specifying three parameters – the α-error, the actual sample size (N), and the effect-size – lets NPTT infer the β-error. By the same logic, point-specifying the effect size, as well as the α- and β-error (e.g., α = β ≤ 0.05) lets NPTT infer the minimum sample size sufficing to register this effect as a statistically significantly nonrandom data signature. Hence, NPTT generally serves to plan well-powered studies.

To an underpowered original study – one featuring 1 − β-error < 0.95, that is–successful data replication thus matters, for this raises our confidence that the original study made a discovery. A well-powered original study, however, already features α = β < 0.05. Hence, if the replication attempt's error probabilities are at least as large (as is typical), then replicating a well-powered study's nonrandom data signature restates, but it cannot raise our confidence that successful data replication is highly probable. Except where we can decrease the error probabilities, therefore, fallible knowledge that replicating the data signature of a well-powered study is highly probable does not require actual data replication.

What the target article's authors call “direct replication” thus amounts to a data replication. For rather than use a theory to point-predict an effect, we use the actual N, the actual α-error, and a stipulated β-error to induce the effect size from data. A direct replication we must assess by estimating its test power, itself calculable only if the H 0 and H 1 hypotheses are both point-specified. Here, the H 0 invariably states a random data distribution. In case the point effect the H 1 postulates is uncertain, we may alternatively predict an interval H 1 hypothesis. (Its endpoints qualify as theoretical predictions, and the midpoint as a theoretical assumption.) We consequently obtain test power either as a point value or as an interval.

In both cases, calculating test power lets our methodological focus shift from the discovery to the justification context (Witte & Zenker Reference Witte and Zenker2017b). In the former context, we evaluate data given hypotheses by studying the error rates of data given the H 0 and H 1 distributions, and so compare p(D, H 0) with p(D, H 1). In the latter context, by contrast, we evaluate hypotheses given data by studying the likelihood ratio (LR) L(H 1|D)/L(H 0|D). Because a fair test assigns equal priors, p(H 0) = p(H 1); this makes the LR numerically identical to the Bayes factor. Moreover, setting the hypothesis corroboration threshold to (1 – β-error)/α-error makes it a Wald test (Wald Reference Wald1947). Desirably, as N increases, test results thus asymptotically approach the percentages of false-positive and false-negative errors.

Data replication then matters, but what counts is the replicated corroboration of a theoretical hypothesis, as per LRH1/H0 > (1 – β-error)/α-error. This the target article's authors call “conceptual replication.” Compared with an H 1 that merely postulates significantly nonrandom data, the theory-based point-specified effect a conceptual replication presupposes is more informative, of course. We can hence do more than run mere twin experiments. Crucially, as one accumulates the likelihoods obtained from individual experiments, several conceptually replicated experiments together may (ever more firmly) jointly corroborate, or falsify, a theoretical prediction (Wald Reference Wald1947; Witte & Zenker Reference Witte and Zenker2016a; Reference Witte and Zenker2016b; Reference Witte and Zenker2017a; Reference Witte and Zenker2017b). (Psychology could only gain from accumulating such methodologically well-hardened facts; see Lakatos [Reference Lakatos1978].) Provided we test a point effect fairly, then, conceptual replication is a genuine strategy to probabilistically support, or undermine, a theoretical construct.

As to how psychological tests correspond to theoretical variables, several different measures currently serve to validate tests (Lord & Novick Reference Lord and Novick1968). In fact, accepting one such test as a measurement procedure for a dispositional variable (e.g., personality, intelligence) lets this test dictate how we estimate the focal variable practically. A comparable strategy to validate experiments, by contrast, seems to be missing. Perhaps it is for this reason that disagreements regarding an experiment's quality often appear purely subjective.

From a theoretical viewpoint, however, test validation strategies are equivalent to experiment validation strategies, for “[t]he validity coefficient is the correlation of an observed variable with some theoretical construct (latent variable) of interest” (Lord & Novick Reference Lord and Novick1968, p. 261, italics added). Indeed, this identity is what warrants our interpreting an experimental setting as the empirical realization of a theoretical construct. We may consequently treat the difference, or correlation, between the individual measurements in the experimental and control groups as an experiment's validity coefficient.

This difference/correlation is valid only if we can exclude alternative explanations that cite various internal or external influences. Compared with the significant workload that preregistration approaches require (Hagger et al. Reference Hagger, Chatzisarantis, Alberts, Anggono, Batailler, Birt, Brand, Brandt, Brewer, Bruyneel, Calvillo, Campbell, Cannon, Carlucci, Carruth, Cheung, Crowell, De Ridder, Dewitte, Elson, Evans, Fay, Fennis, Finley, Francis, Heise, Hoemann, Inzlicht, Koole, Koppel, Kroese, Lange, Lau, Lynch, Martijn, Merckelbach, Mills, Michirev, Miyake, Mosser, Muise, Muller, Muzi, Nalis, Nurwanti, Otgaar, Philipp, Primoceri, Rentzsch, Ringos, Schlinkert, Schmeichel, Schoch, Schrama, Schütz, Stamos, Tinghög, Ullrich, vanDellen, Wimbarti, Wolff, Yusainy, Zerhouni and Zwienenberg2016), for instance, validating an experiment is yet more effortful. For we must establish that (1) participants can, and do, interpret our experimental setting as intended; (2) they are motivated to display the corresponding behavior; and (3) an independent observer can adequately evaluate their reactions (Witte & Melville Reference Witte and Melville1982). Indeed, an overly simplistic manipulation check is “an obstacle toward cumulative science” (Fayant et al. Reference Fayant, Sigall, Lemonnier, Retsin and Alexopoulos2017, p. 125). Therefore, successfully replicating a point-specified effect is sound only if each individual experiment is valid.

In sum, an experiment validation strategy that renders worthwhile the efforts of constructing a valid experiment should rest not on data alone, but also on how well we theoretically predict a focal phenomenon. Only if several labs then achieve a replicated hypothesis corroboration (by testing the LR fairly) could replication provide the gold standard that a theoretically progressive version of empirical psychology requires.

References

Bakker, M., van Dijk, A. & Wicherts, J. M. (2012) The rules of the game called psychological science. Perspectives on Psychological Science 7(6):543–54. Available at: http://doi.org/10.1177/1745691612459060.Google Scholar
Fayant, M. P., Sigall, H., Lemonnier, A., Retsin, E. & Alexopoulos, T. (2017) On the limitations of manipulation checks: An obstacle toward cumulative science. International Review of Social Psychology 30(1):125–30. Available at: https://doi.org/10.5334/irsp.102.Google Scholar
Hagger, M. S., Chatzisarantis, N. L. D., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R., Brand, R., Brandt, M. J., Brewer, G., Bruyneel, S., Calvillo, D. P., Campbell, W. K., Cannon, P. R., Carlucci, M., Carruth, N. P., Cheung, T., Crowell, A., De Ridder, D. T. D., Dewitte, S., Elson, M., Evans, J. R., Fay, B. A., Fennis, B. M, Finley, A., Francis, Z., Heise, E., Hoemann, H., Inzlicht, M., Koole, S. L., Koppel, L., Kroese, F., Lange, F., Lau, K., Lynch, B. P., Martijn, C., Merckelbach, H., Mills, N. V., Michirev, A., Miyake, A., Mosser, A. E., Muise, M., Muller, D., Muzi, M., Nalis, D., Nurwanti, R., Otgaar, H, Philipp, M. C., Primoceri, P., Rentzsch, K., Ringos, L., Schlinkert, C., Schmeichel, B. J., Schoch, S. F., Schrama, M., Schütz, A., Stamos, A., Tinghög, G., Ullrich, J., vanDellen, M., Wimbarti, S., Wolff, W., Yusainy, C., Zerhouni, O. & Zwienenberg, M. (2016) A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science 11(4):546–73.Google Scholar
Lakatos, I. (1978) The methodology of scientific research programs, vol. I. Cambridge University Press.Google Scholar
Lord, F. M. & Novick, M. R. (1968) Statistical theories of mental test scores. Addison-Wesley.Google Scholar
Open Science Collaboration (2015) Estimating the reproducibility of psychological science. Science 349(6251):aac4716. Available at: http://doi.org/10.1126/science.aac4716.Google Scholar
Wald, A. (1947) Sequential analysis. Wiley.Google Scholar
Witte, E. H. & Melville, P. (1982) Experimentelle Kleingruppenforschung: Methodologische Anmerkungen und eine empirische Studie. [Experimental small group research: Methodological remarks and an empirical study.] Zeitschrift für Sozialpsychologie 13:109–24.Google Scholar
Witte, E. H. & Zenker, F. (2016a) Reconstructing recent work on macro-social stress as a research program. Basic and Applied Social Psychology 38(6):301307.Google Scholar
Witte, E. H. & Zenker, F. (2016b) Beyond schools – reply to Marsman, Ly & Wagenmakers. Basic and Applied Social Psychology 38(6):313–17.Google Scholar
Witte, E. H. & Zenker, F. (2017a) Extending a multilab preregistered replication of the ego-depletion effect to a research program. Basic and Applied Social Psychology 39(1):7480.Google Scholar
Witte, E. H. & Zenker, F. (2017b) From discovery to justification. Outline of an ideal research program in empirical psychology. Frontiers in Psychology 8:1847. Available at: https://www.frontiersin.org/articles/10.3389/fpsyg.2017.01847/full.Google Scholar