As a remedy to the generalizability crisis, Yarkoni urges researchers to consider “cross-validation techniques that can minimize overfitting and provide alternative ways of assessing generalizability outside of the traditional inferential statistical framework” (Sec. 3.6.7). I believe this advice is valuable and worthy of elaboration.
Traditional model evaluation techniques are beset by (at least) two inconvenient truths. First, goodness-of-fit (GOF) and generalizability are inextricably tied to model complexity (defined by Myung, Pitt, and Kim [Reference Myung, Pitt, Kim, Lambert and Goldstone2004] as “a model's inherent flexibility that enables it to fit a wide range of data patterns” [p. 12]). As models become more complex, GOF to the observed data increases, but generalizability to unseen data decreases. Additionally, GOF indices conflate fit to the useful signal in the data with fit to the useless noise, and so must be adjusted to account for complexity. The widely used Akaike Information Criterion (Akaike, Reference Akaike, Petrov and Csaki1973), for example, mitigates the effects of complexity by penalizing for the number of parameters.
However, this leads to the second issue: Complexity cannot be fully assessed by simply counting parameters (and in fact, overfitting can occur with just one parameter; Piantadosi, Reference Piantadosi2018). Complexity is also affected by the configuration of variables in the model (Cutting, Bruno, Brady, & Moore, Reference Cutting, Bruno, Brady and Moore1992): Models that organize the same number of parameters in different configurations may differ in terms of GOF. It follows from these two issues that researchers who rely exclusively on GOF and quantify complexity only by counting parameters are exacerbating the generalizability crisis.
A solution to these problems can be found by bypassing probability theory altogether and adopting a technique from information theory. The principle of minimum description length (MDL; Rissanen, Reference Rissanen1978, Reference Rissanen1989) aims to separate regularity (i.e., meaningful information) from noise in the observed data and “squeeze out as much regularity as possible” (Grunwald, Reference Grünwald, Grünwald, Myung and Pitt2005, p. 15) via data compression. Suppose we have a sequence of nine binary digits that contains a regularity: twice as many 1s as 0s. The complete data space S includes 29 = 512 patterns, but the regularity only applies to 84 (or 16.4%) of those patterns. Thus, our sequence belongs to a relatively small subset of S. A description (e.g., programming code) that compresses the complete data in this manner would be quite useful: We would know, for example, that future use of that code would return only those sequences that contain the same regularity.
According to the MDL principle, the best description (or model) is that which maximizes compression of S. Our nine-digit sequence could be further compressed: The regularity of “twice as many 1s as 0s + the first three digits are 1s” describes just 20 patterns, compressing the data to less than 4% of S. That is, over 96% of sequences would not follow this more precise regularity, so we should be “impressed” (in the sense of Meehl's [Reference Meehl1990] rainfall analogy or Lakatos's [Reference Lakatos, Worrall and Currie1978] example of Halley's comet) when we find a sequence that does.
What does this have to do with the generalizability crisis? In his introduction to MDL, Grunwald (Reference Grünwald, Grünwald, Myung and Pitt2005) described two relevant features. First, “MDL procedures automatically and inherently protect against overfitting” (p. 5). GOF statistics may overfit the data by capturing both signal and noise, whereas MDL methods filter out that noise through data compression, allowing researchers to focus only on the signal. Second, “MDL methods can be interpreted as searching for a model with good predictive performance on unseen data” (p. 6). Mathematical proof of this statement can be found in Vitányi and Li (Reference Vitányi and Li2000), who concluded that “compression of descriptions almost always gives optimal prediction” (p. 448).
Although MDL may seem obscure, consider it in light of this statement from Roberts and Pashler (Reference Roberts and Pashler2000) in Psychological Review, following their declaration that good fit cannot clarify what a theory predicts: “Without knowing how much a theory constrains possible outcomes, you cannot know how impressed to be when observation and theory are consistent” (p. 359). The phrase “a theory [that] constrains possible outcomes” can be rewritten in MDL terms as “a description that compresses the complete data space.” Through that translation, it becomes clear that the MDL principle encapsulates Meehl's (Reference Meehl, Harlow, Mulaik and Steiger1997) argument that “the narrower the tolerated range of observable values, the riskier the test, and if the test is passed, the stronger the corroboration of the substantive theory” (p. 407).
Various methods have been developed to quantify the MDL principle (see Myung, Navarro, & Pitt, Reference Myung, Navarro and Pitt2006; Navarro, Reference Navarro2004; Pitt, Myung, & Zhang, Reference Pitt, Myung and Zhang2002), but their formulations involve statistical obstacles such as integration across the complete data space. To sidestep this intractability, quantitative psychologists have relied on simulation methods to gain MDL-type insights regarding latent variable models. Preacher (Reference Preacher2006) generated 10,000 random correlation matrices to simulate the complete continuous data space and fit competing structural equation models with the same number of parameters but different configurations to each matrix (interested readers can conduct similar MDL-type studies using the ockhamSEM package in R; Falk & Muthukrishna, Reference Falk and Muthukrishna2021). Despite the fact that the number of parameters was held constant, certain models had an inherent tendency to fit better than others (termed “fitting propensity”).
Bonifay and Cai (Reference Bonifay and Cai2017) expanded upon this work by considering the fitting propensity of several categorical data models. Among other findings, their analysis revealed that the confirmatory bifactor model achieved good fit to an excessively wide range of random datasets. The model was so deficient at compressing the data space (i.e., filtering out noise) that it accommodated an extremely wide range of data patterns, including many that were nonsensical. This MDL-inspired work demonstrated that good fit is essentially built into the bifactor model, so if the goal is to ensure generalizability, GOF testing should not be considered risky or severe (Watts, Poore, & Waldman, Reference Watts, Poore and Waldman2019).
In summary, the information-theoretic principle of MDL offers insights into overfitting and generalizability that are not possible using traditional methods. Although this principle may not address many of the generalizability issues described in the target article, it should be considered by researchers who wish to avoid overfitting and thereby enhance predictive accuracy.
As a remedy to the generalizability crisis, Yarkoni urges researchers to consider “cross-validation techniques that can minimize overfitting and provide alternative ways of assessing generalizability outside of the traditional inferential statistical framework” (Sec. 3.6.7). I believe this advice is valuable and worthy of elaboration.
Traditional model evaluation techniques are beset by (at least) two inconvenient truths. First, goodness-of-fit (GOF) and generalizability are inextricably tied to model complexity (defined by Myung, Pitt, and Kim [Reference Myung, Pitt, Kim, Lambert and Goldstone2004] as “a model's inherent flexibility that enables it to fit a wide range of data patterns” [p. 12]). As models become more complex, GOF to the observed data increases, but generalizability to unseen data decreases. Additionally, GOF indices conflate fit to the useful signal in the data with fit to the useless noise, and so must be adjusted to account for complexity. The widely used Akaike Information Criterion (Akaike, Reference Akaike, Petrov and Csaki1973), for example, mitigates the effects of complexity by penalizing for the number of parameters.
However, this leads to the second issue: Complexity cannot be fully assessed by simply counting parameters (and in fact, overfitting can occur with just one parameter; Piantadosi, Reference Piantadosi2018). Complexity is also affected by the configuration of variables in the model (Cutting, Bruno, Brady, & Moore, Reference Cutting, Bruno, Brady and Moore1992): Models that organize the same number of parameters in different configurations may differ in terms of GOF. It follows from these two issues that researchers who rely exclusively on GOF and quantify complexity only by counting parameters are exacerbating the generalizability crisis.
A solution to these problems can be found by bypassing probability theory altogether and adopting a technique from information theory. The principle of minimum description length (MDL; Rissanen, Reference Rissanen1978, Reference Rissanen1989) aims to separate regularity (i.e., meaningful information) from noise in the observed data and “squeeze out as much regularity as possible” (Grunwald, Reference Grünwald, Grünwald, Myung and Pitt2005, p. 15) via data compression. Suppose we have a sequence of nine binary digits that contains a regularity: twice as many 1s as 0s. The complete data space S includes 29 = 512 patterns, but the regularity only applies to 84 (or 16.4%) of those patterns. Thus, our sequence belongs to a relatively small subset of S. A description (e.g., programming code) that compresses the complete data in this manner would be quite useful: We would know, for example, that future use of that code would return only those sequences that contain the same regularity.
According to the MDL principle, the best description (or model) is that which maximizes compression of S. Our nine-digit sequence could be further compressed: The regularity of “twice as many 1s as 0s + the first three digits are 1s” describes just 20 patterns, compressing the data to less than 4% of S. That is, over 96% of sequences would not follow this more precise regularity, so we should be “impressed” (in the sense of Meehl's [Reference Meehl1990] rainfall analogy or Lakatos's [Reference Lakatos, Worrall and Currie1978] example of Halley's comet) when we find a sequence that does.
What does this have to do with the generalizability crisis? In his introduction to MDL, Grunwald (Reference Grünwald, Grünwald, Myung and Pitt2005) described two relevant features. First, “MDL procedures automatically and inherently protect against overfitting” (p. 5). GOF statistics may overfit the data by capturing both signal and noise, whereas MDL methods filter out that noise through data compression, allowing researchers to focus only on the signal. Second, “MDL methods can be interpreted as searching for a model with good predictive performance on unseen data” (p. 6). Mathematical proof of this statement can be found in Vitányi and Li (Reference Vitányi and Li2000), who concluded that “compression of descriptions almost always gives optimal prediction” (p. 448).
Although MDL may seem obscure, consider it in light of this statement from Roberts and Pashler (Reference Roberts and Pashler2000) in Psychological Review, following their declaration that good fit cannot clarify what a theory predicts: “Without knowing how much a theory constrains possible outcomes, you cannot know how impressed to be when observation and theory are consistent” (p. 359). The phrase “a theory [that] constrains possible outcomes” can be rewritten in MDL terms as “a description that compresses the complete data space.” Through that translation, it becomes clear that the MDL principle encapsulates Meehl's (Reference Meehl, Harlow, Mulaik and Steiger1997) argument that “the narrower the tolerated range of observable values, the riskier the test, and if the test is passed, the stronger the corroboration of the substantive theory” (p. 407).
Various methods have been developed to quantify the MDL principle (see Myung, Navarro, & Pitt, Reference Myung, Navarro and Pitt2006; Navarro, Reference Navarro2004; Pitt, Myung, & Zhang, Reference Pitt, Myung and Zhang2002), but their formulations involve statistical obstacles such as integration across the complete data space. To sidestep this intractability, quantitative psychologists have relied on simulation methods to gain MDL-type insights regarding latent variable models. Preacher (Reference Preacher2006) generated 10,000 random correlation matrices to simulate the complete continuous data space and fit competing structural equation models with the same number of parameters but different configurations to each matrix (interested readers can conduct similar MDL-type studies using the ockhamSEM package in R; Falk & Muthukrishna, Reference Falk and Muthukrishna2021). Despite the fact that the number of parameters was held constant, certain models had an inherent tendency to fit better than others (termed “fitting propensity”).
Bonifay and Cai (Reference Bonifay and Cai2017) expanded upon this work by considering the fitting propensity of several categorical data models. Among other findings, their analysis revealed that the confirmatory bifactor model achieved good fit to an excessively wide range of random datasets. The model was so deficient at compressing the data space (i.e., filtering out noise) that it accommodated an extremely wide range of data patterns, including many that were nonsensical. This MDL-inspired work demonstrated that good fit is essentially built into the bifactor model, so if the goal is to ensure generalizability, GOF testing should not be considered risky or severe (Watts, Poore, & Waldman, Reference Watts, Poore and Waldman2019).
In summary, the information-theoretic principle of MDL offers insights into overfitting and generalizability that are not possible using traditional methods. Although this principle may not address many of the generalizability issues described in the target article, it should be considered by researchers who wish to avoid overfitting and thereby enhance predictive accuracy.
Financial support
The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D210032.
Conflict of interest
None.
Note
The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.