Hostname: page-component-745bb68f8f-cphqk Total loading time: 0 Render date: 2025-02-06T09:47:25.672Z Has data issue: false hasContentIssue false

Generalizability in mixed models: Lessons from corpus linguistics

Published online by Cambridge University Press:  10 February 2022

Freek Van de Velde
Affiliation:
Department of Linguistics, KU Leuven, Blijde Inkomststraat 21/3308, BE-3000Leuven, Belgium. freek.vandevelde@kuleuven.be stefano.depascale@kuleuven.be dirk.speelman@kuleuven.behttps://www.arts.kuleuven.be/ling/qlvl/people/pages/00039016; https://www.arts.kuleuven.be/ling/qlvl/people/pages/00102617; https://www.arts.kuleuven.be/ling/qlvl/people/pages/00013279
Stefano De Pascale
Affiliation:
Department of Linguistics, KU Leuven, Blijde Inkomststraat 21/3308, BE-3000Leuven, Belgium. freek.vandevelde@kuleuven.be stefano.depascale@kuleuven.be dirk.speelman@kuleuven.behttps://www.arts.kuleuven.be/ling/qlvl/people/pages/00039016; https://www.arts.kuleuven.be/ling/qlvl/people/pages/00102617; https://www.arts.kuleuven.be/ling/qlvl/people/pages/00013279
Dirk Speelman
Affiliation:
Department of Linguistics, KU Leuven, Blijde Inkomststraat 21/3308, BE-3000Leuven, Belgium. freek.vandevelde@kuleuven.be stefano.depascale@kuleuven.be dirk.speelman@kuleuven.behttps://www.arts.kuleuven.be/ling/qlvl/people/pages/00039016; https://www.arts.kuleuven.be/ling/qlvl/people/pages/00102617; https://www.arts.kuleuven.be/ling/qlvl/people/pages/00013279

Abstract

Part of the generalizability issues that haunt controlled lab experiment designs in psychology, and more particularly in psycholinguistics, can be alleviated by adopting corpus linguistic methods. These work with natural data. This advantage comes at a cost: in corpus studies, lexemes and language users can show different kinds of skew. We discuss a number of solutions to bolster the control.

Type
Open Peer Commentary
Copyright
Copyright © The Author(s), 2022. Published by Cambridge University Press

In his assessment of the replicability crisis, Tal Yarkoni points out that the field of psycholinguistics compares relatively favorably with other subdisciplines of psychology. The articles he refers to as commendable advances toward mega-studies (Balota, Yap, Hutchison, & Cortese, Reference Balota, Yap, Hutchison, Cortese and Adelman2012; Keuleers & Balota, Reference Keuleers and Balota2015) have a classic laboratory design, allowing for multifactorial control over participants and stimuli. Psycholinguistics is, however, not the only field concerned with cognitively plausible accounts of language, nor is it exclusive in its use of quantitatively advanced methods. Usage-based linguistic theories have increasingly turned to large text corpora to answer questions about the cognitive processing of language (Gennari & Macdonald, Reference Gennari and Macdonald2009; Gries, Reference Gries2005; Grondelaers, Speelman, Drieghe, Brysbaert, & Geeraerts, Reference Grondelaers, Speelman, Drieghe, Brysbaert and Geeraerts2009; Jaeger, Reference Jaeger2006; Roland, Elman, & Ferreira, Reference Roland, Elman and Ferreira2006; Piantadosi, Tily, & Gibson, Reference Piantadosi, Tily and Gibson2011; Pijpops, Speelman, Grondelaers, & Van de Velde, Reference Pijpops, Speelman, Grondelaers and Van de Velde2018; Szmrecsanyi, Reference Szmrecsanyi2005; Wiechmann, Reference Wiechmann2008). Similar to psychology, these studies have steadily turned to generalized linear mixed-effects models to analyze linguistic phenomena (Baayen, Reference Baayen2008; Gries, Reference Gries2015; Speelman, Heylen, & Geeraerts, Reference Speelman, Heylen, Geeraerts, Speelman, Heylen and Geeraerts2018).

The advantage of corpus-based studies is that they have higher ecological validity, as they work with naturally occurring data. Additional advantages are (i) the scale of the data, which are usually extracted from corpora that cover millions to even billions of words, reducing the risk of underpowered results; (ii) the high replicability, as the corpora are usually publicly available; and (iii) the possibility to gather data from the past, alleviating the present-day bias to some extent (Bergs & Hoffmann, Reference Bergs and Hoffmann2017; De Smet & Van de Velde, Reference De Smet and Van de Velde2020; Hundt, Mollin, & Pfenninger, Reference Hundt, Mollin and Pfenninger2017; Petré & Van de Velde, Reference Petré and Van de Velde2018; Wolk, Bresnan, Rosenbach, & Szmrecsanyi, Reference Wolk, Bresnan, Rosenbach and Szmrecsanyi2013), though the difficulties and obstacles in historical corpus linguistics should not be underestimated (Van de Velde & Peter, Reference Van de Velde, Peter, Adolphs and Knight2020). These advantages assuage Yarkoni's concerns about generalizability.

This does not mean that corpus linguistics is a happy-go-lucky picnic. Studies in this field face some daunting difficulties. One is that in corpus data, occurrence frequencies of language users (roughly equivalent to participants) and words (roughly equivalent to stimuli) commonly take a “Zipfian” distribution: word occurrences follow a power law where a few “types” (lemmas) account for most of the “tokens,” and most types are in a long tail of infrequent attestations (Zipf, Reference Zipf1935). Similarly for speakers: while observations of a given grammatical construction in a text corpus may come from a wide range of language users (speakers or writers), the distribution is typically skewed such that a few language users contribute a disproportionate amount of the observations. If one wants to use mixed models to investigate the psycholinguistic pressures on the “dative alternance,” that is, the difference between he gave flowers to his mother versus he gave his mother flowers, a heavily investigated phenomenon (see Bresnan, Cueni, Nikitina, & Baayen, Reference Bresnan, Cueni, Nikitina, Baayen, Bouma, Krämer and Zwarts2007; Röthlisberger, Grafmiller, & Szmrecsanyi, Reference Röthlisberger, Grafmiller and Szmrecsanyi2017 among others), state-of-the-art linguistic corpus studies customarily add a random factor for the verb (give, donate, present, offer, transfer, regale, etc.), but evidently, the corpus will yield many more observations from frequent verbs than from infrequent verbs. If these two factors (words and speakers) are integrated as random factors in mixed-modeling, the maximum likelihood estimation might have a hard time converging on an adequate model: the size of the random intercepts – let alone slopes – may not be reliably estimable with underpopulated levels of the random factors. An often used “solution” is to bin all speakers/writers or word types with less than five observations, but this has the drawback that the underpopulated levels (often the majority) are considered to be the same. This leads to misrepresenting the non-independence of the observations, flouting the very motivation of random effects.

Another problem is that many corpus-based studies suffer from overfitting. This issue is not peculiar to corpus-based studies, but also crops up in other psychological or psycholinguistic studies (Yarkoni, this issue; Yarkoni & Westfall, Reference Yarkoni and Westfall2017). The main reason is that corpus linguists tend to use all the data available to fit their mixed model. A solution might come from integrating methods from machine learning (Hastie, Tibshirani, & Friedman, Reference Hastie, Tibshirani and Friedman2013). Repeatedly partitioning the data in training and test sets to carry out cross-validation, bootstrapping, or regularization by shrinkage methods (Ridge, Lasso, and Elastic Net) can reduce the overfit, but at present, applying these techniques in the presence of multiple sources of random variation is not straightforward (see Roberts et al., Reference Roberts, Bahn, Ciuti, Boyce, Elith, Guillera-Arroita and Dormann2017).

The use of shrinkage methods has an additional application in corpus linguistics, namely when the number of regressors exceeds the number of observations. This could be the case when the lexical effects are focal variables. Instead of treating the different verbs (give, donate, present, offer, transfer, regale, etc.) as the levels of a random factor “verb” when investigating the dative alternance, considering them as merely a source of random variation, we may be interested in their effect on the choice between the two grammatical constructions (… flowers to his mother vs. … his mother flowers). In corpus linguistics, this is typically achieved by sticking to a verb-as-random-factor approach, focusing on predictions for the random effects, or by running a separate analysis. The former strategy, modeling of focal variables with random factors, arguably “stretches” the purpose of random effects, which are meant to model the association structure in the data, with the fixed-effects modeling systematic trends. The latter strategy often takes the form of “collexeme analysis” (Stefanowitsch & Gries, Reference Stefanowitsch and Gries2003), but the downside is that it does not work with multifactorial control (Bloem, Reference Bloem2021, p. 115). A promising solution may again come from the aforementioned shrinkage methods (Lasso, Ridge, and Elastic Net) with k-fold cross-validation. K-fold cross-validation is the procedure to repartition the data k times (mostly 10), and each time use 1–1/k of the data as the training set and the remaining 1/k as the test set, in effect iteratively using a small portion of the data as if it were “unseen,” to validate the model. Shrinkage with cross-validation not only allows for including a large number of potentially correlating regressors in the model, they also allow for variable selection and effective avoidance of overfitting (Van de Velde & Pijpops, Reference Van de Velde and Pijpops2019).

Other methodological innovations that are currently explored in linguistics may also contribute to generalizability. An underused technique to check the contours of a statistical model by investigating the effect of the parameters is agent-based modeling. In linguistics, the adoption has been slow, but the last decade has seen an upsurge in such studies (Beuls & Steels, Reference Beuls and Steels2013; Bloem, Reference Bloem2021; Landsbergen, Lachlan, Ten Cate, & Verhagen, Reference Landsbergen, Lachlan, Ten Cate and Verhagen2010; Lestrade, Reference Lestrade, Köhnlein and Audring2015; Pijpops, Beuls, & Van de Velde, Reference Pijpops, Beuls and Van de Velde2015; Steels, Reference Steels2016).

Financial support

This research received no specific grant from any funding agency, commercial or not-for-profit sectors.

Conflict of interest

None.

References

Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Balota, D. A., Yap, M. J., Hutchison, K. A., & Cortese, M. K. (2012). Megastudies: What do millions (or so) of trials tell us about lexical processing? In Adelman, J. S. (Ed.), Visual word recognition volume 1: Models and methods, orthography and phonology (pp. 90115). Psychology Press.Google Scholar
Bergs, A., & Hoffmann, T. (Eds.) (2017). Cognitive approaches to the history of English. Special issue of English Language and Linguistics, 21(2), 191–438.CrossRefGoogle Scholar
Beuls, K., & Steels, L. (2013). Agent-based models of strategies for the emergence and evolution of grammatical agreement. PLoS ONE 8(3), e58960.CrossRefGoogle ScholarPubMed
Bloem, J. (2021). Processing verb clusters. LOT Dissertation Series.Google Scholar
Bresnan, J., Cueni, A., Nikitina, T., & Baayen, H. (2007). Predicting the dative alternation. In Bouma, G., Krämer, I., & Zwarts, J. (Eds), Cognitive foundations of interpretation (pp. 7796). Amsterdam: KNAW/Edita.Google Scholar
De Smet, I., & Van de Velde, F. 2020. A corpus-based quantitative analysis of twelve centuries of preterite and past participle morphology in Dutch. Language Variation and Change 32(3), 241265.CrossRefGoogle Scholar
Gennari, S., & Macdonald, M. (2009). Linking production and comprehension processes: The case of relative clauses. Cognition 111(1), 123.CrossRefGoogle ScholarPubMed
Gries, S. T. (2005). Syntactic priming: a corpus-based approach. Journal of Psycholinguistic Research, 34(4), 365399.CrossRefGoogle ScholarPubMed
Gries, S. T. (2015). The most underused statistical method in corpus linguistics: Multi-level (and mixed-effects) models. Corpora 10(1), 95125.CrossRefGoogle Scholar
Grondelaers, S., Speelman, D., Drieghe, D., Brysbaert, M., & Geeraerts, D. (2009). Introducing a new entity into discourse: Comprehension and production evidence for the status of Dutch er ‘there’ as a higher-level expectancy monitor. Acta Psychologica 130(2), 153160.CrossRefGoogle ScholarPubMed
Hastie, T., Tibshirani, R., & Friedman, J. (2013). The elements of statistical learning. Data mining, inference, and prediction (2nd ed.). Springer.Google Scholar
Hundt, M., Mollin, S., & Pfenninger, S. E. (Eds.). (2017). The changing English language. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Jaeger, F. T. (2006). Redundancy and syntactic reduction in spontaneous speech. PhD diss., Stanford University.Google Scholar
Keuleers, E., & Balota, D. A. (2015). Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments. Quarterly Journal of Experimental Psychology 68(8), 14571468.CrossRefGoogle ScholarPubMed
Landsbergen, F., Lachlan, R., Ten Cate, C., & Verhagen, A. (2010). A cultural evolutionary model of patterns in semantic change. Linguistics 48(2), 363390.CrossRefGoogle Scholar
Lestrade, S. (2015). A case of cultural evolution: The emergence of morphological case. In Köhnlein, B. & Audring, J. (Eds.), Linguistics in the Netherlands (pp. 105115). John Benjamins.Google Scholar
Petré, P., & Van de Velde, F. (2018). The real-time dynamics of the individual and the community in grammaticalization. Language 94(4), 867901.CrossRefGoogle Scholar
Piantadosi, S. T., Tily, H., & Gibson, E. (2011). Word lengths are optimized for efficient communication. Proceedings of the National Academy of Sciences 108(9), 35263529.CrossRefGoogle ScholarPubMed
Pijpops, D., Beuls, K., & Van de Velde, F. (2015). The rise of the verbal weak inflection in Germanic. An agent-based model. Computational Linguistics in the Netherlands Journal 5, 81102.Google Scholar
Pijpops, D., Speelman, D., Grondelaers, S., & Van de Velde, F. (2018). Comparing explanations for the complexity principle. Evidence from argument realization. Language and Cognition 10(3), 514543.CrossRefGoogle Scholar
Roberts, D. R., Bahn, V., Ciuti, S., Boyce, M. S., Elith, J., Guillera-Arroita, G., … Dormann, C. F. (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40, 913929.CrossRefGoogle Scholar
Roland, D., Elman, J., & Ferreira, V. (2006). Why is ‘that’? Structural prediction and ambiguity resolution in a very large corpus of English sentences. Cognition 98(3), 245272.CrossRefGoogle Scholar
Röthlisberger, M., Grafmiller, J., & Szmrecsanyi, B. (2017). Cognitive indigenization effects in the English dative alternation. Cognitive Linguistics 28(4), 673710.CrossRefGoogle Scholar
Speelman, D., Heylen, K., & Geeraerts, D. (2018). Introduction. In Speelman, D., Heylen, K., & Geeraerts, D. (Eds.), Mixed-effects regression models in linguistics (pp. 110). Springer.CrossRefGoogle Scholar
Steels, L. (2016). Agent-based models for the emergence and evolution of grammar. Philosophical Transactions of the Royal Society B 371, 20150447.CrossRefGoogle ScholarPubMed
Stefanowitsch, A., & Gries, S.T. (2003). Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8(2), 209244.CrossRefGoogle Scholar
Szmrecsanyi, B. (2005). Language users as creatures of habit: A corpus-based analysis of persistence in spoken English. Corpus Linguistics and Linguistic Theory 1(1), 113150.CrossRefGoogle Scholar
Van de Velde, F. & Pijpops, D. (2019). Investigating lexical effects in syntax with regularized regression (Lasso). Journal of Research Design and Statistics in Linguistics and Communication Science, 6(2), 166199.Google Scholar
Van de Velde, F., & Peter, P. 2020. Historical linguistics. In Adolphs, S., & Knight, D. (Eds.), The Routledge handbook of English language and digital humanities (pp. 328359). Routledge.CrossRefGoogle Scholar
Wiechmann, D. (2008). On the computation of collostruction strength: Testing measures of association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory 4(2), 253290.CrossRefGoogle Scholar
Wolk, C., Bresnan, J., Rosenbach, A., & Szmrecsanyi, B. (2013). Dative and genitive variability in Late Modern English: Exploring cross-constructional variation and change. Diachronica 30(3), 382419.CrossRefGoogle Scholar
Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science 12(6), 11001122.CrossRefGoogle ScholarPubMed
Zipf, G. K. (1935). The psycho-biology of language. An introduction to dynamic philology. Houghton Mifflin.Google Scholar