Bakker et al. (Reference Bakker, van Dijk and Wicherts2012) state the average replication probability of empirical studies in psychology as 1 − β error = 0.36. Originating in Neyman-Pearson test theory (NPTT), the 1 − β-error is also known as test power. Prior to collection of data, it estimates the probability that a replication attempt duplicates an original study's data signature. Because 1 − β-error = 0.36 “predicts” the estimated actual replication rate of 36% that Open Science Collaboration (2015) report, we may cautiously interpret the rate as a consequence of realizing NPTT empirically.
In seeming “fear” that a random process (H 0) may have generated our data (D), we (rightly) demand that D feature a low α-error, p(D, H 0) ≤ α = 0.05. We nevertheless regularly allow such “low α” data to feature a high average β-error = 0.64 (Open Science Collaboration 2015). A similarly high β-error value is unproblematic, of course, if we use a composite H 1 hypothesis. For it simply fails to point-specify the effect size (postulated by the H 1) that calculating the β-error presupposes. So we cannot but ignore the replication probability of data.
By contrast, point-specifying three parameters – the α-error, the actual sample size (N), and the effect-size – lets NPTT infer the β-error. By the same logic, point-specifying the effect size, as well as the α- and β-error (e.g., α = β ≤ 0.05) lets NPTT infer the minimum sample size sufficing to register this effect as a statistically significantly nonrandom data signature. Hence, NPTT generally serves to plan well-powered studies.
To an underpowered original study – one featuring 1 − β-error < 0.95, that is–successful data replication thus matters, for this raises our confidence that the original study made a discovery. A well-powered original study, however, already features α = β < 0.05. Hence, if the replication attempt's error probabilities are at least as large (as is typical), then replicating a well-powered study's nonrandom data signature restates, but it cannot raise our confidence that successful data replication is highly probable. Except where we can decrease the error probabilities, therefore, fallible knowledge that replicating the data signature of a well-powered study is highly probable does not require actual data replication.
What the target article's authors call “direct replication” thus amounts to a data replication. For rather than use a theory to point-predict an effect, we use the actual N, the actual α-error, and a stipulated β-error to induce the effect size from data. A direct replication we must assess by estimating its test power, itself calculable only if the H 0 and H 1 hypotheses are both point-specified. Here, the H 0 invariably states a random data distribution. In case the point effect the H 1 postulates is uncertain, we may alternatively predict an interval H 1 hypothesis. (Its endpoints qualify as theoretical predictions, and the midpoint as a theoretical assumption.) We consequently obtain test power either as a point value or as an interval.
In both cases, calculating test power lets our methodological focus shift from the discovery to the justification context (Witte & Zenker Reference Witte and Zenker2017b). In the former context, we evaluate data given hypotheses by studying the error rates of data given the H 0 and H 1 distributions, and so compare p(D, H 0) with p(D, H 1). In the latter context, by contrast, we evaluate hypotheses given data by studying the likelihood ratio (LR) L(H 1|D)/L(H 0|D). Because a fair test assigns equal priors, p(H 0) = p(H 1); this makes the LR numerically identical to the Bayes factor. Moreover, setting the hypothesis corroboration threshold to (1 – β-error)/α-error makes it a Wald test (Wald Reference Wald1947). Desirably, as N increases, test results thus asymptotically approach the percentages of false-positive and false-negative errors.
Data replication then matters, but what counts is the replicated corroboration of a theoretical hypothesis, as per LRH1/H0 > (1 – β-error)/α-error. This the target article's authors call “conceptual replication.” Compared with an H 1 that merely postulates significantly nonrandom data, the theory-based point-specified effect a conceptual replication presupposes is more informative, of course. We can hence do more than run mere twin experiments. Crucially, as one accumulates the likelihoods obtained from individual experiments, several conceptually replicated experiments together may (ever more firmly) jointly corroborate, or falsify, a theoretical prediction (Wald Reference Wald1947; Witte & Zenker Reference Witte and Zenker2016a; Reference Witte and Zenker2016b; Reference Witte and Zenker2017a; Reference Witte and Zenker2017b). (Psychology could only gain from accumulating such methodologically well-hardened facts; see Lakatos [Reference Lakatos1978].) Provided we test a point effect fairly, then, conceptual replication is a genuine strategy to probabilistically support, or undermine, a theoretical construct.
As to how psychological tests correspond to theoretical variables, several different measures currently serve to validate tests (Lord & Novick Reference Lord and Novick1968). In fact, accepting one such test as a measurement procedure for a dispositional variable (e.g., personality, intelligence) lets this test dictate how we estimate the focal variable practically. A comparable strategy to validate experiments, by contrast, seems to be missing. Perhaps it is for this reason that disagreements regarding an experiment's quality often appear purely subjective.
From a theoretical viewpoint, however, test validation strategies are equivalent to experiment validation strategies, for “[t]he validity coefficient is the correlation of an observed variable with some theoretical construct (latent variable) of interest” (Lord & Novick Reference Lord and Novick1968, p. 261, italics added). Indeed, this identity is what warrants our interpreting an experimental setting as the empirical realization of a theoretical construct. We may consequently treat the difference, or correlation, between the individual measurements in the experimental and control groups as an experiment's validity coefficient.
This difference/correlation is valid only if we can exclude alternative explanations that cite various internal or external influences. Compared with the significant workload that preregistration approaches require (Hagger et al. Reference Hagger, Chatzisarantis, Alberts, Anggono, Batailler, Birt, Brand, Brandt, Brewer, Bruyneel, Calvillo, Campbell, Cannon, Carlucci, Carruth, Cheung, Crowell, De Ridder, Dewitte, Elson, Evans, Fay, Fennis, Finley, Francis, Heise, Hoemann, Inzlicht, Koole, Koppel, Kroese, Lange, Lau, Lynch, Martijn, Merckelbach, Mills, Michirev, Miyake, Mosser, Muise, Muller, Muzi, Nalis, Nurwanti, Otgaar, Philipp, Primoceri, Rentzsch, Ringos, Schlinkert, Schmeichel, Schoch, Schrama, Schütz, Stamos, Tinghög, Ullrich, vanDellen, Wimbarti, Wolff, Yusainy, Zerhouni and Zwienenberg2016), for instance, validating an experiment is yet more effortful. For we must establish that (1) participants can, and do, interpret our experimental setting as intended; (2) they are motivated to display the corresponding behavior; and (3) an independent observer can adequately evaluate their reactions (Witte & Melville Reference Witte and Melville1982). Indeed, an overly simplistic manipulation check is “an obstacle toward cumulative science” (Fayant et al. Reference Fayant, Sigall, Lemonnier, Retsin and Alexopoulos2017, p. 125). Therefore, successfully replicating a point-specified effect is sound only if each individual experiment is valid.
In sum, an experiment validation strategy that renders worthwhile the efforts of constructing a valid experiment should rest not on data alone, but also on how well we theoretically predict a focal phenomenon. Only if several labs then achieve a replicated hypothesis corroboration (by testing the LR fairly) could replication provide the gold standard that a theoretically progressive version of empirical psychology requires.
Bakker et al. (Reference Bakker, van Dijk and Wicherts2012) state the average replication probability of empirical studies in psychology as 1 − β error = 0.36. Originating in Neyman-Pearson test theory (NPTT), the 1 − β-error is also known as test power. Prior to collection of data, it estimates the probability that a replication attempt duplicates an original study's data signature. Because 1 − β-error = 0.36 “predicts” the estimated actual replication rate of 36% that Open Science Collaboration (2015) report, we may cautiously interpret the rate as a consequence of realizing NPTT empirically.
In seeming “fear” that a random process (H 0) may have generated our data (D), we (rightly) demand that D feature a low α-error, p(D, H 0) ≤ α = 0.05. We nevertheless regularly allow such “low α” data to feature a high average β-error = 0.64 (Open Science Collaboration 2015). A similarly high β-error value is unproblematic, of course, if we use a composite H 1 hypothesis. For it simply fails to point-specify the effect size (postulated by the H 1) that calculating the β-error presupposes. So we cannot but ignore the replication probability of data.
By contrast, point-specifying three parameters – the α-error, the actual sample size (N), and the effect-size – lets NPTT infer the β-error. By the same logic, point-specifying the effect size, as well as the α- and β-error (e.g., α = β ≤ 0.05) lets NPTT infer the minimum sample size sufficing to register this effect as a statistically significantly nonrandom data signature. Hence, NPTT generally serves to plan well-powered studies.
To an underpowered original study – one featuring 1 − β-error < 0.95, that is–successful data replication thus matters, for this raises our confidence that the original study made a discovery. A well-powered original study, however, already features α = β < 0.05. Hence, if the replication attempt's error probabilities are at least as large (as is typical), then replicating a well-powered study's nonrandom data signature restates, but it cannot raise our confidence that successful data replication is highly probable. Except where we can decrease the error probabilities, therefore, fallible knowledge that replicating the data signature of a well-powered study is highly probable does not require actual data replication.
What the target article's authors call “direct replication” thus amounts to a data replication. For rather than use a theory to point-predict an effect, we use the actual N, the actual α-error, and a stipulated β-error to induce the effect size from data. A direct replication we must assess by estimating its test power, itself calculable only if the H 0 and H 1 hypotheses are both point-specified. Here, the H 0 invariably states a random data distribution. In case the point effect the H 1 postulates is uncertain, we may alternatively predict an interval H 1 hypothesis. (Its endpoints qualify as theoretical predictions, and the midpoint as a theoretical assumption.) We consequently obtain test power either as a point value or as an interval.
In both cases, calculating test power lets our methodological focus shift from the discovery to the justification context (Witte & Zenker Reference Witte and Zenker2017b). In the former context, we evaluate data given hypotheses by studying the error rates of data given the H 0 and H 1 distributions, and so compare p(D, H 0) with p(D, H 1). In the latter context, by contrast, we evaluate hypotheses given data by studying the likelihood ratio (LR) L(H 1|D)/L(H 0|D). Because a fair test assigns equal priors, p(H 0) = p(H 1); this makes the LR numerically identical to the Bayes factor. Moreover, setting the hypothesis corroboration threshold to (1 – β-error)/α-error makes it a Wald test (Wald Reference Wald1947). Desirably, as N increases, test results thus asymptotically approach the percentages of false-positive and false-negative errors.
Data replication then matters, but what counts is the replicated corroboration of a theoretical hypothesis, as per LRH1/H0 > (1 – β-error)/α-error. This the target article's authors call “conceptual replication.” Compared with an H 1 that merely postulates significantly nonrandom data, the theory-based point-specified effect a conceptual replication presupposes is more informative, of course. We can hence do more than run mere twin experiments. Crucially, as one accumulates the likelihoods obtained from individual experiments, several conceptually replicated experiments together may (ever more firmly) jointly corroborate, or falsify, a theoretical prediction (Wald Reference Wald1947; Witte & Zenker Reference Witte and Zenker2016a; Reference Witte and Zenker2016b; Reference Witte and Zenker2017a; Reference Witte and Zenker2017b). (Psychology could only gain from accumulating such methodologically well-hardened facts; see Lakatos [Reference Lakatos1978].) Provided we test a point effect fairly, then, conceptual replication is a genuine strategy to probabilistically support, or undermine, a theoretical construct.
As to how psychological tests correspond to theoretical variables, several different measures currently serve to validate tests (Lord & Novick Reference Lord and Novick1968). In fact, accepting one such test as a measurement procedure for a dispositional variable (e.g., personality, intelligence) lets this test dictate how we estimate the focal variable practically. A comparable strategy to validate experiments, by contrast, seems to be missing. Perhaps it is for this reason that disagreements regarding an experiment's quality often appear purely subjective.
From a theoretical viewpoint, however, test validation strategies are equivalent to experiment validation strategies, for “[t]he validity coefficient is the correlation of an observed variable with some theoretical construct (latent variable) of interest” (Lord & Novick Reference Lord and Novick1968, p. 261, italics added). Indeed, this identity is what warrants our interpreting an experimental setting as the empirical realization of a theoretical construct. We may consequently treat the difference, or correlation, between the individual measurements in the experimental and control groups as an experiment's validity coefficient.
This difference/correlation is valid only if we can exclude alternative explanations that cite various internal or external influences. Compared with the significant workload that preregistration approaches require (Hagger et al. Reference Hagger, Chatzisarantis, Alberts, Anggono, Batailler, Birt, Brand, Brandt, Brewer, Bruyneel, Calvillo, Campbell, Cannon, Carlucci, Carruth, Cheung, Crowell, De Ridder, Dewitte, Elson, Evans, Fay, Fennis, Finley, Francis, Heise, Hoemann, Inzlicht, Koole, Koppel, Kroese, Lange, Lau, Lynch, Martijn, Merckelbach, Mills, Michirev, Miyake, Mosser, Muise, Muller, Muzi, Nalis, Nurwanti, Otgaar, Philipp, Primoceri, Rentzsch, Ringos, Schlinkert, Schmeichel, Schoch, Schrama, Schütz, Stamos, Tinghög, Ullrich, vanDellen, Wimbarti, Wolff, Yusainy, Zerhouni and Zwienenberg2016), for instance, validating an experiment is yet more effortful. For we must establish that (1) participants can, and do, interpret our experimental setting as intended; (2) they are motivated to display the corresponding behavior; and (3) an independent observer can adequately evaluate their reactions (Witte & Melville Reference Witte and Melville1982). Indeed, an overly simplistic manipulation check is “an obstacle toward cumulative science” (Fayant et al. Reference Fayant, Sigall, Lemonnier, Retsin and Alexopoulos2017, p. 125). Therefore, successfully replicating a point-specified effect is sound only if each individual experiment is valid.
In sum, an experiment validation strategy that renders worthwhile the efforts of constructing a valid experiment should rest not on data alone, but also on how well we theoretically predict a focal phenomenon. Only if several labs then achieve a replicated hypothesis corroboration (by testing the LR fairly) could replication provide the gold standard that a theoretically progressive version of empirical psychology requires.