In his assessment of the replicability crisis, Tal Yarkoni points out that the field of psycholinguistics compares relatively favorably with other subdisciplines of psychology. The articles he refers to as commendable advances toward mega-studies (Balota, Yap, Hutchison, & Cortese, Reference Balota, Yap, Hutchison, Cortese and Adelman2012; Keuleers & Balota, Reference Keuleers and Balota2015) have a classic laboratory design, allowing for multifactorial control over participants and stimuli. Psycholinguistics is, however, not the only field concerned with cognitively plausible accounts of language, nor is it exclusive in its use of quantitatively advanced methods. Usage-based linguistic theories have increasingly turned to large text corpora to answer questions about the cognitive processing of language (Gennari & Macdonald, Reference Gennari and Macdonald2009; Gries, Reference Gries2005; Grondelaers, Speelman, Drieghe, Brysbaert, & Geeraerts, Reference Grondelaers, Speelman, Drieghe, Brysbaert and Geeraerts2009; Jaeger, Reference Jaeger2006; Roland, Elman, & Ferreira, Reference Roland, Elman and Ferreira2006; Piantadosi, Tily, & Gibson, Reference Piantadosi, Tily and Gibson2011; Pijpops, Speelman, Grondelaers, & Van de Velde, Reference Pijpops, Speelman, Grondelaers and Van de Velde2018; Szmrecsanyi, Reference Szmrecsanyi2005; Wiechmann, Reference Wiechmann2008). Similar to psychology, these studies have steadily turned to generalized linear mixed-effects models to analyze linguistic phenomena (Baayen, Reference Baayen2008; Gries, Reference Gries2015; Speelman, Heylen, & Geeraerts, Reference Speelman, Heylen, Geeraerts, Speelman, Heylen and Geeraerts2018).
The advantage of corpus-based studies is that they have higher ecological validity, as they work with naturally occurring data. Additional advantages are (i) the scale of the data, which are usually extracted from corpora that cover millions to even billions of words, reducing the risk of underpowered results; (ii) the high replicability, as the corpora are usually publicly available; and (iii) the possibility to gather data from the past, alleviating the present-day bias to some extent (Bergs & Hoffmann, Reference Bergs and Hoffmann2017; De Smet & Van de Velde, Reference De Smet and Van de Velde2020; Hundt, Mollin, & Pfenninger, Reference Hundt, Mollin and Pfenninger2017; Petré & Van de Velde, Reference Petré and Van de Velde2018; Wolk, Bresnan, Rosenbach, & Szmrecsanyi, Reference Wolk, Bresnan, Rosenbach and Szmrecsanyi2013), though the difficulties and obstacles in historical corpus linguistics should not be underestimated (Van de Velde & Peter, Reference Van de Velde, Peter, Adolphs and Knight2020). These advantages assuage Yarkoni's concerns about generalizability.
This does not mean that corpus linguistics is a happy-go-lucky picnic. Studies in this field face some daunting difficulties. One is that in corpus data, occurrence frequencies of language users (roughly equivalent to participants) and words (roughly equivalent to stimuli) commonly take a “Zipfian” distribution: word occurrences follow a power law where a few “types” (lemmas) account for most of the “tokens,” and most types are in a long tail of infrequent attestations (Zipf, Reference Zipf1935). Similarly for speakers: while observations of a given grammatical construction in a text corpus may come from a wide range of language users (speakers or writers), the distribution is typically skewed such that a few language users contribute a disproportionate amount of the observations. If one wants to use mixed models to investigate the psycholinguistic pressures on the “dative alternance,” that is, the difference between he gave flowers to his mother versus he gave his mother flowers, a heavily investigated phenomenon (see Bresnan, Cueni, Nikitina, & Baayen, Reference Bresnan, Cueni, Nikitina, Baayen, Bouma, Krämer and Zwarts2007; Röthlisberger, Grafmiller, & Szmrecsanyi, Reference Röthlisberger, Grafmiller and Szmrecsanyi2017 among others), state-of-the-art linguistic corpus studies customarily add a random factor for the verb (give, donate, present, offer, transfer, regale, etc.), but evidently, the corpus will yield many more observations from frequent verbs than from infrequent verbs. If these two factors (words and speakers) are integrated as random factors in mixed-modeling, the maximum likelihood estimation might have a hard time converging on an adequate model: the size of the random intercepts – let alone slopes – may not be reliably estimable with underpopulated levels of the random factors. An often used “solution” is to bin all speakers/writers or word types with less than five observations, but this has the drawback that the underpopulated levels (often the majority) are considered to be the same. This leads to misrepresenting the non-independence of the observations, flouting the very motivation of random effects.
Another problem is that many corpus-based studies suffer from overfitting. This issue is not peculiar to corpus-based studies, but also crops up in other psychological or psycholinguistic studies (Yarkoni, this issue; Yarkoni & Westfall, Reference Yarkoni and Westfall2017). The main reason is that corpus linguists tend to use all the data available to fit their mixed model. A solution might come from integrating methods from machine learning (Hastie, Tibshirani, & Friedman, Reference Hastie, Tibshirani and Friedman2013). Repeatedly partitioning the data in training and test sets to carry out cross-validation, bootstrapping, or regularization by shrinkage methods (Ridge, Lasso, and Elastic Net) can reduce the overfit, but at present, applying these techniques in the presence of multiple sources of random variation is not straightforward (see Roberts et al., Reference Roberts, Bahn, Ciuti, Boyce, Elith, Guillera-Arroita and Dormann2017).
The use of shrinkage methods has an additional application in corpus linguistics, namely when the number of regressors exceeds the number of observations. This could be the case when the lexical effects are focal variables. Instead of treating the different verbs (give, donate, present, offer, transfer, regale, etc.) as the levels of a random factor “verb” when investigating the dative alternance, considering them as merely a source of random variation, we may be interested in their effect on the choice between the two grammatical constructions (… flowers to his mother vs. … his mother flowers). In corpus linguistics, this is typically achieved by sticking to a verb-as-random-factor approach, focusing on predictions for the random effects, or by running a separate analysis. The former strategy, modeling of focal variables with random factors, arguably “stretches” the purpose of random effects, which are meant to model the association structure in the data, with the fixed-effects modeling systematic trends. The latter strategy often takes the form of “collexeme analysis” (Stefanowitsch & Gries, Reference Stefanowitsch and Gries2003), but the downside is that it does not work with multifactorial control (Bloem, Reference Bloem2021, p. 115). A promising solution may again come from the aforementioned shrinkage methods (Lasso, Ridge, and Elastic Net) with k-fold cross-validation. K-fold cross-validation is the procedure to repartition the data k times (mostly 10), and each time use 1–1/k of the data as the training set and the remaining 1/k as the test set, in effect iteratively using a small portion of the data as if it were “unseen,” to validate the model. Shrinkage with cross-validation not only allows for including a large number of potentially correlating regressors in the model, they also allow for variable selection and effective avoidance of overfitting (Van de Velde & Pijpops, Reference Van de Velde and Pijpops2019).
Other methodological innovations that are currently explored in linguistics may also contribute to generalizability. An underused technique to check the contours of a statistical model by investigating the effect of the parameters is agent-based modeling. In linguistics, the adoption has been slow, but the last decade has seen an upsurge in such studies (Beuls & Steels, Reference Beuls and Steels2013; Bloem, Reference Bloem2021; Landsbergen, Lachlan, Ten Cate, & Verhagen, Reference Landsbergen, Lachlan, Ten Cate and Verhagen2010; Lestrade, Reference Lestrade, Köhnlein and Audring2015; Pijpops, Beuls, & Van de Velde, Reference Pijpops, Beuls and Van de Velde2015; Steels, Reference Steels2016).
In his assessment of the replicability crisis, Tal Yarkoni points out that the field of psycholinguistics compares relatively favorably with other subdisciplines of psychology. The articles he refers to as commendable advances toward mega-studies (Balota, Yap, Hutchison, & Cortese, Reference Balota, Yap, Hutchison, Cortese and Adelman2012; Keuleers & Balota, Reference Keuleers and Balota2015) have a classic laboratory design, allowing for multifactorial control over participants and stimuli. Psycholinguistics is, however, not the only field concerned with cognitively plausible accounts of language, nor is it exclusive in its use of quantitatively advanced methods. Usage-based linguistic theories have increasingly turned to large text corpora to answer questions about the cognitive processing of language (Gennari & Macdonald, Reference Gennari and Macdonald2009; Gries, Reference Gries2005; Grondelaers, Speelman, Drieghe, Brysbaert, & Geeraerts, Reference Grondelaers, Speelman, Drieghe, Brysbaert and Geeraerts2009; Jaeger, Reference Jaeger2006; Roland, Elman, & Ferreira, Reference Roland, Elman and Ferreira2006; Piantadosi, Tily, & Gibson, Reference Piantadosi, Tily and Gibson2011; Pijpops, Speelman, Grondelaers, & Van de Velde, Reference Pijpops, Speelman, Grondelaers and Van de Velde2018; Szmrecsanyi, Reference Szmrecsanyi2005; Wiechmann, Reference Wiechmann2008). Similar to psychology, these studies have steadily turned to generalized linear mixed-effects models to analyze linguistic phenomena (Baayen, Reference Baayen2008; Gries, Reference Gries2015; Speelman, Heylen, & Geeraerts, Reference Speelman, Heylen, Geeraerts, Speelman, Heylen and Geeraerts2018).
The advantage of corpus-based studies is that they have higher ecological validity, as they work with naturally occurring data. Additional advantages are (i) the scale of the data, which are usually extracted from corpora that cover millions to even billions of words, reducing the risk of underpowered results; (ii) the high replicability, as the corpora are usually publicly available; and (iii) the possibility to gather data from the past, alleviating the present-day bias to some extent (Bergs & Hoffmann, Reference Bergs and Hoffmann2017; De Smet & Van de Velde, Reference De Smet and Van de Velde2020; Hundt, Mollin, & Pfenninger, Reference Hundt, Mollin and Pfenninger2017; Petré & Van de Velde, Reference Petré and Van de Velde2018; Wolk, Bresnan, Rosenbach, & Szmrecsanyi, Reference Wolk, Bresnan, Rosenbach and Szmrecsanyi2013), though the difficulties and obstacles in historical corpus linguistics should not be underestimated (Van de Velde & Peter, Reference Van de Velde, Peter, Adolphs and Knight2020). These advantages assuage Yarkoni's concerns about generalizability.
This does not mean that corpus linguistics is a happy-go-lucky picnic. Studies in this field face some daunting difficulties. One is that in corpus data, occurrence frequencies of language users (roughly equivalent to participants) and words (roughly equivalent to stimuli) commonly take a “Zipfian” distribution: word occurrences follow a power law where a few “types” (lemmas) account for most of the “tokens,” and most types are in a long tail of infrequent attestations (Zipf, Reference Zipf1935). Similarly for speakers: while observations of a given grammatical construction in a text corpus may come from a wide range of language users (speakers or writers), the distribution is typically skewed such that a few language users contribute a disproportionate amount of the observations. If one wants to use mixed models to investigate the psycholinguistic pressures on the “dative alternance,” that is, the difference between he gave flowers to his mother versus he gave his mother flowers, a heavily investigated phenomenon (see Bresnan, Cueni, Nikitina, & Baayen, Reference Bresnan, Cueni, Nikitina, Baayen, Bouma, Krämer and Zwarts2007; Röthlisberger, Grafmiller, & Szmrecsanyi, Reference Röthlisberger, Grafmiller and Szmrecsanyi2017 among others), state-of-the-art linguistic corpus studies customarily add a random factor for the verb (give, donate, present, offer, transfer, regale, etc.), but evidently, the corpus will yield many more observations from frequent verbs than from infrequent verbs. If these two factors (words and speakers) are integrated as random factors in mixed-modeling, the maximum likelihood estimation might have a hard time converging on an adequate model: the size of the random intercepts – let alone slopes – may not be reliably estimable with underpopulated levels of the random factors. An often used “solution” is to bin all speakers/writers or word types with less than five observations, but this has the drawback that the underpopulated levels (often the majority) are considered to be the same. This leads to misrepresenting the non-independence of the observations, flouting the very motivation of random effects.
Another problem is that many corpus-based studies suffer from overfitting. This issue is not peculiar to corpus-based studies, but also crops up in other psychological or psycholinguistic studies (Yarkoni, this issue; Yarkoni & Westfall, Reference Yarkoni and Westfall2017). The main reason is that corpus linguists tend to use all the data available to fit their mixed model. A solution might come from integrating methods from machine learning (Hastie, Tibshirani, & Friedman, Reference Hastie, Tibshirani and Friedman2013). Repeatedly partitioning the data in training and test sets to carry out cross-validation, bootstrapping, or regularization by shrinkage methods (Ridge, Lasso, and Elastic Net) can reduce the overfit, but at present, applying these techniques in the presence of multiple sources of random variation is not straightforward (see Roberts et al., Reference Roberts, Bahn, Ciuti, Boyce, Elith, Guillera-Arroita and Dormann2017).
The use of shrinkage methods has an additional application in corpus linguistics, namely when the number of regressors exceeds the number of observations. This could be the case when the lexical effects are focal variables. Instead of treating the different verbs (give, donate, present, offer, transfer, regale, etc.) as the levels of a random factor “verb” when investigating the dative alternance, considering them as merely a source of random variation, we may be interested in their effect on the choice between the two grammatical constructions (… flowers to his mother vs. … his mother flowers). In corpus linguistics, this is typically achieved by sticking to a verb-as-random-factor approach, focusing on predictions for the random effects, or by running a separate analysis. The former strategy, modeling of focal variables with random factors, arguably “stretches” the purpose of random effects, which are meant to model the association structure in the data, with the fixed-effects modeling systematic trends. The latter strategy often takes the form of “collexeme analysis” (Stefanowitsch & Gries, Reference Stefanowitsch and Gries2003), but the downside is that it does not work with multifactorial control (Bloem, Reference Bloem2021, p. 115). A promising solution may again come from the aforementioned shrinkage methods (Lasso, Ridge, and Elastic Net) with k-fold cross-validation. K-fold cross-validation is the procedure to repartition the data k times (mostly 10), and each time use 1–1/k of the data as the training set and the remaining 1/k as the test set, in effect iteratively using a small portion of the data as if it were “unseen,” to validate the model. Shrinkage with cross-validation not only allows for including a large number of potentially correlating regressors in the model, they also allow for variable selection and effective avoidance of overfitting (Van de Velde & Pijpops, Reference Van de Velde and Pijpops2019).
Other methodological innovations that are currently explored in linguistics may also contribute to generalizability. An underused technique to check the contours of a statistical model by investigating the effect of the parameters is agent-based modeling. In linguistics, the adoption has been slow, but the last decade has seen an upsurge in such studies (Beuls & Steels, Reference Beuls and Steels2013; Bloem, Reference Bloem2021; Landsbergen, Lachlan, Ten Cate, & Verhagen, Reference Landsbergen, Lachlan, Ten Cate and Verhagen2010; Lestrade, Reference Lestrade, Köhnlein and Audring2015; Pijpops, Beuls, & Van de Velde, Reference Pijpops, Beuls and Van de Velde2015; Steels, Reference Steels2016).
Financial support
This research received no specific grant from any funding agency, commercial or not-for-profit sectors.
Conflict of interest
None.