Beyond the limitations of any imaginable mechanism: Large language models and psycholinguistics

Conor Houghton; Nina Kazanina; Priyanka Sukumaran

doi:10.1017/S0140525X23001693

Beyond the limitations of any imaginable mechanism: Large language models and psycholinguistics

Published online by Cambridge University Press: 06 December 2023

and

Conor Houghton: Affiliation:
Department of Computer Science, University of Bristol, Bristol, UK conor.houghton@bristol.ac.uk p.sukumaran@bristol.ac.uk conorhoughton.github.io
Nina Kazanina: Affiliation:
School of Psychological Sciences, University of Bristol, Bristol, UK nina.kazanina@bristol.ac.uk International Laboratory of Social Neurobiology, Institute for Cognitive Neuroscience, National Research University, Higher School of Economics, HSE University, Moscow, Russia
Priyanka Sukumaran: Affiliation:
Department of Computer Science, University of Bristol, Bristol, UK conor.houghton@bristol.ac.uk p.sukumaran@bristol.ac.uk conorhoughton.github.io School of Psychological Sciences, University of Bristol, Bristol, UK nina.kazanina@bristol.ac.uk

Article contents

Abstract
Financial support
Competing interest
References

Rights & Permissions

Abstract

Large language models (LLMs) are not detailed models of human linguistic processing. They are, however, extremely successful at their primary task: Providing a model for language. For this reason LLMs are important in psycholinguistics: They are useful as a practical tool, as an illustrative comparative, and philosophically, as a basis for recasting the relationship between language and thought.

Type: Open Peer Commentary
Information: Behavioral and Brain Sciences , Volume 46 , 2023 , e395

DOI: https://doi.org/10.1017/S0140525X23001693 [Opens in a new window]
Copyright: Copyright © The Author(s), 2023. Published by Cambridge University Press

Neural-network models of language are optimized to solve practical problems such as machine translation. Currently, when these large language models (LLMs) are interpreted as models of human linguistic processing they have similar shortcomings to those that deep neural networks have as models of human vision. Two examples can illustrate this. First, LLMs do not faithfully replicate human behaviour on language tasks (Kuncoro, Dyer, Hale, & Blunsom, Reference Kuncoro, Dyer, Hale and Blunsom2018; Linzen & Leonard, Reference Linzen, Leonard, Kalish, Rau, Zhu and Rogers2018; Marvin & Linzen, Reference Marvin, Linzen, Riloff, Chiang, Hockenmaier and Tsujii2018; Mitchell, Kazanina, Houghton, & Bowers, Reference Mitchell, Kazanina, Houghton, Bowers, Nienborg, Poldrack and Naselaris2019). For example, an LLM trained on a word-prediction task shows similar error rates to humans overall on long-range subject–verb number agreement but errs in different circumstances: Unlike humans, it makes more mistakes when sentences have relative clauses (Linzen & Leonard, Reference Linzen, Leonard, Kalish, Rau, Zhu and Rogers2018), indicating differences in how grammatical structure is represented. Second, the LLMs with better performance on language tasks do not necessarily have more in common with human linguistic processing or more obvious similarities to the brain. For example, transformers learn efficiently on vast corpora and avoid human-like memory constraints but are currently more successful as language models than recurrent neural networks such as the long- and short-term memory LLMs (Brown et al., Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal and Amodei2020; Devlin, Chang, Lee, & Toutanova, Reference Devlin, Chang, Lee, Toutanova, Burstein, Doran and Solorio2018), which employ sequential processing, as humans do, and can be more easily compared to the brain.

Furthermore, the target article suggests that, more broadly, the brain and neural networks are unlikely to resemble each other because evolution differs in trajectory and outcome from the optimization used to train a neural network. Generally, there is an unanswered question about which aspects of learning in LLMs are to be compared to the evolution of our linguistic ability and which to language learning in infants but in either case, the comparison seems weak. LLMs are typically trained using a next-word prediction task; it is unlikely our linguistic ability evolved to optimize this and next-word prediction can only partly describe language learning: For example, infants generalize word meanings based on shape (Landau, Smith, & Jones, Reference Landau, Smith and Jones1988) while LLMs lack any broad conceptual encounter with the world language describes.

In fact, it would be peculiar to suggest that LLMs are models of the neural dynamics that support linguistic processing in humans; we simply know too little about those dynamics. The challenge presented by language is different to that presented by vision: Language lacks animal models and debate in psycholinguistics is occupied with broad issues of mechanisms and principles, whereas visual neuroscience often has more detailed concerns. We believe that LLMs have a valuable role in psycholinguistics and this does not depend on any precise mapping from machine to human. Here we describe three uses of LLMs: (1) the practical, as a tool in experimentation; (2) the comparative, as an alternate example of linguistic processing; and (3) the philosophical, recasting the relationship between language and thought.

(1) An LLM models language and this is often of practical quantitative utility in experiment. One straight-forward example is the evaluation of surprisal: How well a word is predicted by what has preceded it. It has been established that reaction times (Fischler & Bloom, Reference Fischler and Bloom1979; Kleiman, Reference Kleiman1980), gaze duration (Rayner & Well, Reference Rayner and Well1996), and EEG responses (Dambacher, Kliegl, Hofmann, & Jacobs, Reference Dambacher, Kliegl, Hofmann and Jacobs2006; Frank, Otten, Galli, & Vigliocco, Reference Frank, Otten, Galli and Vigliocco2015) are modulated by surprisal, giving an insight into prediction in neural processing. In the past, surprisal was evaluated using n-grams, but n-grams become impossible to estimate as n grows and as such they cannot quantify long-range dependencies. LLMs are typically trained on a task akin to quantifying surprisal and are superior to n-grams in estimating word probabilities. Differences between LLM-derived estimates and neural perception of surprisal may quantify which linguistic structures, perhaps poorly represented in the statistical evidence, the brain privileges during processing.
(2) LLMs are also useful as a point of comparison. LLMs combine different computational strategies, mixing representations of word properties with a computational engine based on memory or attention. Despite the clear differences between LLMs and the brain, it is instructive to compare the performance of different LLMs on language tasks to our own language ability. For example, although LLMs are capable of long-range number and gender agreement (Bernardy & Lappin, Reference Bernardy and Lappin2017; Gulordava, Bojanowski, Grave, Linzen, & Baroni, Reference Gulordava, Bojanowski, Grave, Linzen, Baroni, Walker, Ji and Stent2018; Linzen, Dupoux, & Goldberg, Reference Linzen, Dupoux and Goldberg2016; Sukumaran, Houghton, & Kazanina, Reference Sukumaran, Houghton, Kazanina, Bastings, Belinkov, Elazar, Hupkes, Saphr and Wiegreffe2022), they are not successful in implementing another long-range rule: Principle C (Mitchell et al., Reference Mitchell, Kazanina, Houghton, Bowers, Nienborg, Poldrack and Naselaris2019), a near-universal property of languages which depends in its most straight-forward description on hierarchical parsing. Thus, LLMs allow us to recognize those aspects of language which require special consideration while revealing others to be within easy reach of statistical learning.
(3) In the past, philosophical significance was granted to language as evidence of thought or personhood. Turing (Reference Turing1950), for example, proposes conversation as a proxy for thought and Chomsky (Reference Chomsky1966) describes Descartes as attributing the possession of mind to other humans because the human capacity for innovation and for the creative use of language is “beyond the limitations of any imaginable mechanism.” It is significant that machines are now capable of imitating the use of language. While machine-generated text still has attributes of awkwardness and repetition that make it recognizable on careful reading, it would seem foolhardy to predict these final quirks are unresolvable or are characteristic of the division between human and machine. Nonetheless, most of us appear to feel intuitively that LLMs enact an imitation rather than a recreation of our linguistic ability: LLMs seem empty things whose pantomime of language is not underpinned by thought, understanding, or creativity. Indeed, even if an LLM were capable of imitating us perfectly, we would still distinguish between a loved one and their simulation.

This is a challenge to our understanding of the relationship between language and thought: Either we must claim that, despite recent progress, machine-generated language will remain unlike human language in vital respects, or we must defy our intuition and consider machines to be as capable of thought as we are, or we must codify our intuition to specify why a machine able to produce language should, nonetheless, be considered lacking in thought.

Acknowledgments

We are grateful to the many colleagues who read and commented on this text.

Financial support

P. S. received support from the Wellcome Trust [108899/B/15/Z], C. H. from the Leverhulme Trust [RF-2021-533], and N. K. from the International Laboratory for Social Neurobiology of the Institute for Cognitive Neuroscience HSE, RF Government grant [075-15-2022-1037].

Competing interest

None.

References

Bernardy, J.-P., & Lappin, S. (2017). Using deep neural networks to learn syntactic agreement. Linguistic Issues in Language Technology, 15(2), 1–15.CrossRef Google Scholar

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.Google Scholar

Chomsky, N. (1966). Cartesian linguistics: A chapter in the history of rationalist thought. Cambridge University Press.Google Scholar

Dambacher, M., Kliegl, R., Hofmann, M., & Jacobs, A. M. (2006). Frequency and predictability effects on event-related potentials during reading. Brain Research, 1084(1), 89–103.CrossRef Google Scholar PubMed

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., & Solorio, T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. (pp. 4171–4186). https://arxiv.org/abs/1810.04805 Google Scholar

Fischler, I., & Bloom, P. A. (1979). Automatic and attentional processes in the effects of sentence contexts on word recognition. Journal of Verbal Learning and Verbal Behavior, 18(1), 1–20.CrossRef Google Scholar

Frank, S. L., Otten, L. J., Galli, G., & Vigliocco, G. (2015). The ERP response to the amount of information conveyed by words in sentences. Brain and Language, 140, 1–11.CrossRef Google Scholar

Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., & Baroni, M. (2018). Colorless green recurrent networks dream hierarchically. In Walker, M., Ji, H., & Stent, A. (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1195–2205). https://arxiv.org/pdf/1803.11138.pdf Google Scholar

Kleiman, G. M. (1980). Sentence frame contexts and lexical decisions: Sentence-acceptability and word- relatedness effects. Memory & Cognition, 8(4), 336–344.CrossRef Google Scholar PubMed

Kuncoro, A., Dyer, C., Hale, J., & Blunsom, P. (2018). The perils of natural behaviour tests for unnatural models: The case of number agreement. Poster presented at learning language in humans and in machines, Paris, France, July, 5(6). Organizers: Susan Goldin-Meadow, Afra Alishahi, Phil Blunsom, Cynthia Fisher, Chen Yu & Michael Frank.Google Scholar

Landau, B., Smith, L. B., & Jones, S. S. (1988). The importance of shape in early lexical learning. Cognitive Development, 3(3), 299–321.CrossRef Google Scholar

Linzen, T., Dupoux, E., & Goldberg, Y. (2016). Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4, 521–535.CrossRef Google Scholar

Linzen, T., & Leonard, B. (2018). Distinct patterns of syntactic agreement errors in recurrent networks and humans. In Kalish, C., Rau, M. A., Zhu, X. (J.), & Rogers, T. T. (Eds.), Proceedings of CogSci 2018 (pp. 692–697).Google Scholar

Marvin, R., & Linzen, T. (2018). Targeted syntactic evaluation of language models. In Riloff, E., Chiang, D., Hockenmaier, J., & Tsujii, J. (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 1192–1202).CrossRef Google Scholar

Mitchell, J., Kazanina, N., Houghton, C., & Bowers, J. (2019). Do LSTMs know about Principle C?. In Nienborg, H., Poldrack, R., & Naselaris, T. (Eds.), Conference on Cognitive Computational Neuroscience, Berlin. https://doi.org/10.32470/CCN.2019.1241-0Google Scholar

Rayner, K., & Well, A. D. (1996). Effects of contextual constraint on eye movements in reading: A further examination. Psychonomic Bulletin & Review, 3(4), 504–509.CrossRef Google Scholar

Sukumaran, P., Houghton, C., & Kazanina, N. (2022). Do LSTMs see gender? Probing the ability of LSTMs to learn abstract syntactic rules. In Bastings, J., Belinkov, Y., Elazar, Y., Hupkes, D., Saphr, N., & Wiegreffe, S. (Eds.), Poster at BlackboxNLP 2022. https://arxiv.org/abs/2211.00153 Google Scholar

Turing, A. M. (1950). Computing machinery and intelligence. Mind; A Quarterly Review of Psychology and Philosophy, 49, 433–460.CrossRef Google Scholar