Hostname: page-component-745bb68f8f-l4dxg Total loading time: 0 Render date: 2025-02-06T11:28:48.085Z Has data issue: false hasContentIssue false

Science with or without statistics: Discover-generalize-replicate? Discover-replicate-generalize?

Published online by Cambridge University Press:  10 February 2022

John P.A. Ioannidis*
Affiliation:
Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, CA94305, USA. jioannid@stanford.edu

Abstract

Overstated generalizability (external validity) is common in research. It may coexist with inflation of the magnitude and statistical support for effects and dismissal of internal validity problems. Generalizability may be secured before attempting replication of proposed discoveries or replication may precede efforts to generalize. These opposite approaches may decrease or increase, respectively, the use of inferential statistics with advantages and disadvantages.

Type
Open Peer Commentary
Copyright
Copyright © The Author(s), 2022. Published by Cambridge University Press

Inflated claims are prevalent in research and the reward system facilitates them (Smaldino & McElreath, Reference Smaldino and McElreath2016). Both the magnitude and statistical support of effects, but also the narratives researchers craft of these effects, can be inflated. To evaluate effect inflation, one can scrutinize numbers presented with specific metrics and/or subject to specific statistical inferential tools. Errors and deficiencies in internal validity can also be modeled or probed within the same quantitative machinery. Conversely, inflation in the accompanying narrative that tries to instill meaning, relevance, and breadth into scientific investigation evades quantification. Some of that inflation pertains to silencing or underestimating internal validity problems and limitations. The most egregious narrative boosting, however, pertains to external validity, aka generalizability. Ignoring, silencing, or downplaying sources of variability; putting a spin to read the results as more important than they are (Boutron & Ravaud, Reference Boutron and Ravaud2018); extrapolating to a broader paradigm than narrowly focused data would allow – are all common problems. Moreover, for applied research that carries decision-making implications, inferring broadly actionable results is the end-product of that expansive narrative.

Inflation of effects, downplaying of internal validity concerns and overstated generalizability often coexist. It is easier to overstate generalizability when effect sizes, statistical significance, or any other type of statistical support seem stronger, thus more immune to error. Supposedly, stronger effects may withstand a greater assault from bias and allow a greater leap of faith for their generalizability. However, this is a misconception. In reality, the opposite may be true. Large effects and strong statistical support may simply herald the presence of more bias and least generalizability, that is, deficits in internal or external validity or both (Ioannidis, Reference Ioannidis2016). The most erroneous data and studies and the more extreme, outlying, non-representative situations and conditions may yield the most astonishing large effects. Whenever scientists come upon discovering a large effect in their research endeavors, they should be particularly worried. The first step should be to go back and find out where some major error has occurred. When no error is found, the second step is to think why this stupendous effect may represent a very unusual situation, with little or no relevance in most other settings.

In trying to remedy this situation, different solutions have been proposed and some of them are pulling in opposite directions. To neutralize excessive, unwarranted claims of discovered effects, one solution is to submit them to exact replication with the hope that, if properly done, false-positive effects should be refuted (Nosek & Errington, Reference Nosek and Errington2020). The sequence goes: discover-replicate-generalize, or, in other words, try to replicate first and, if it replicates, then try to see how far the research finding can generalize to other, different, expansive settings. A second solution, espoused by Yarkoni, is to give priority to generalizability. The sequence goes: discover-generalize-replicate, that is, don't waste time with replication unless a promising research finding has been probed in a sufficiently large variety of settings to have some sense that it is generalizable (and even remotely worthy). In the extreme form, this approach would give the search for generalizability not just priority but also dominance. Research would be mostly an exploration of variability and of the boundaries of generalizability.

These solutions may have different implications for the extent to which inferential statistics should be used. The “discover-replicate-generalize” sequence would require inferential statistics to be deployed, and strengthened, if anything, compared with current practices. Other safeguards such as prespecification and registration are also essential. Not only the main effects, but also issues of their internal validity should be modeled as rigorously as possible with the best statistical methods and inference tools. In fact, if internal validity cannot be secured or taken properly into account with some proper quantitative methods, rushing into replication would be a nuisance: the same errors will be carried forward unopposed and unaccounted for.

Conversely, with the “discover-generalize-replicate” sequence, it is tempting to postpone and thus diminish the use of inferential statistics in the research process. Research becomes mostly a process of description, a collection of notes and observations, like collecting stamps or butterflies and marveling at how different they are. One may even suspect an undertone of cynicism in this approach: because most observations are likely to be misleading and/or non-generalizable, we should not make too much of them. We should not take them or us, as researchers, too seriously. This guidance aims to avoid having too many false-positives; not by eliminating them, but by not allowing them to be called “positives” in the first place.

The choice between the two strategies is not straightforward – and any choice may not be generalizable! Different disciplines and types of scientific investigation may need a different mix. However, any effort to fix the misuse of statistics simply by removing statistics or statistical rules (no matter how imperfect these rules) may not necessarily make things better and may lead to an even worse “free lunch” situation (Ioannidis, Reference Ioannidis2019). Weird, exaggerated claims will still be made. In the absence of any statistical obstacle, they may be made even more easily and with even less restrain. At the extreme, the “premium generalize” strategy may end up making science not much different than a competition of fiction writers coming up with qualitative narratives and without any clear rules on what narrative should be preferred over others. For applied science where decisions are pressing, decision-making may become even more subjective and biased – and it is already too subjective and biased in many circumstances.

At the same time, the major problem of over-generalizing with the blessing of statistic rituals, replication, and all cannot be overstated. Poorly used statistics only exacerbate the problem as they give to these misleading claims a false aura of quantitative legitimacy. Perhaps, instead of less statistics and less quantification, one needs more and better. More appropriate models may incorporate more of the known and unknown variability and generate wider (or at least more fair) estimates of uncertainty. Then, perhaps there will be fewer candidates that are considered worthwhile to spend replication efforts – let alone, dare generalize.

Financial support

METRICS has been supported by grants from the Laura and John Arnold Foundation. The work of John Ioannidis is supported by an unrestricted gift from Sue and Bob O'Donnell.

Conflicts of interest

None.

References

Boutron, I., & Ravaud, P. (2018). Misrepresentation and distortion of research in biomedical literature. Proceedings of the National Academy of Sciences of the USA, 115(11), 26132619.CrossRefGoogle ScholarPubMed
Ioannidis, J. P. (2016). Exposure-wide epidemiology: Revisiting Bradford Hill. Statistics in Medicine, 35(11), 17491762.CrossRefGoogle ScholarPubMed
Ioannidis, J. P. (2019). The importance of predefined rules and prespecified statistical analyses: Do not abandon significance. JAMA, 321(21), 20672068.CrossRefGoogle Scholar
Nosek, B. A., & Errington, T. M. (2020). What is replication? PLoS Biology, 18(3), e3000691.CrossRefGoogle ScholarPubMed
Smaldino, P. E., & McElreath, R. (2016). The natural selection of bad science. Royal Society Open Science, 3(9), 160384.CrossRefGoogle ScholarPubMed