Neural-network models of language are optimized to solve practical problems such as machine translation. Currently, when these large language models (LLMs) are interpreted as models of human linguistic processing they have similar shortcomings to those that deep neural networks have as models of human vision. Two examples can illustrate this. First, LLMs do not faithfully replicate human behaviour on language tasks (Kuncoro, Dyer, Hale, & Blunsom, Reference Kuncoro, Dyer, Hale and Blunsom2018; Linzen & Leonard, Reference Linzen, Leonard, Kalish, Rau, Zhu and Rogers2018; Marvin & Linzen, Reference Marvin, Linzen, Riloff, Chiang, Hockenmaier and Tsujii2018; Mitchell, Kazanina, Houghton, & Bowers, Reference Mitchell, Kazanina, Houghton, Bowers, Nienborg, Poldrack and Naselaris2019). For example, an LLM trained on a word-prediction task shows similar error rates to humans overall on long-range subject–verb number agreement but errs in different circumstances: Unlike humans, it makes more mistakes when sentences have relative clauses (Linzen & Leonard, Reference Linzen, Leonard, Kalish, Rau, Zhu and Rogers2018), indicating differences in how grammatical structure is represented. Second, the LLMs with better performance on language tasks do not necessarily have more in common with human linguistic processing or more obvious similarities to the brain. For example, transformers learn efficiently on vast corpora and avoid human-like memory constraints but are currently more successful as language models than recurrent neural networks such as the long- and short-term memory LLMs (Brown et al., Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal and Amodei2020; Devlin, Chang, Lee, & Toutanova, Reference Devlin, Chang, Lee, Toutanova, Burstein, Doran and Solorio2018), which employ sequential processing, as humans do, and can be more easily compared to the brain.
Furthermore, the target article suggests that, more broadly, the brain and neural networks are unlikely to resemble each other because evolution differs in trajectory and outcome from the optimization used to train a neural network. Generally, there is an unanswered question about which aspects of learning in LLMs are to be compared to the evolution of our linguistic ability and which to language learning in infants but in either case, the comparison seems weak. LLMs are typically trained using a next-word prediction task; it is unlikely our linguistic ability evolved to optimize this and next-word prediction can only partly describe language learning: For example, infants generalize word meanings based on shape (Landau, Smith, & Jones, Reference Landau, Smith and Jones1988) while LLMs lack any broad conceptual encounter with the world language describes.
In fact, it would be peculiar to suggest that LLMs are models of the neural dynamics that support linguistic processing in humans; we simply know too little about those dynamics. The challenge presented by language is different to that presented by vision: Language lacks animal models and debate in psycholinguistics is occupied with broad issues of mechanisms and principles, whereas visual neuroscience often has more detailed concerns. We believe that LLMs have a valuable role in psycholinguistics and this does not depend on any precise mapping from machine to human. Here we describe three uses of LLMs: (1) the practical, as a tool in experimentation; (2) the comparative, as an alternate example of linguistic processing; and (3) the philosophical, recasting the relationship between language and thought.
(1) An LLM models language and this is often of practical quantitative utility in experiment. One straight-forward example is the evaluation of surprisal: How well a word is predicted by what has preceded it. It has been established that reaction times (Fischler & Bloom, Reference Fischler and Bloom1979; Kleiman, Reference Kleiman1980), gaze duration (Rayner & Well, Reference Rayner and Well1996), and EEG responses (Dambacher, Kliegl, Hofmann, & Jacobs, Reference Dambacher, Kliegl, Hofmann and Jacobs2006; Frank, Otten, Galli, & Vigliocco, Reference Frank, Otten, Galli and Vigliocco2015) are modulated by surprisal, giving an insight into prediction in neural processing. In the past, surprisal was evaluated using n-grams, but n-grams become impossible to estimate as n grows and as such they cannot quantify long-range dependencies. LLMs are typically trained on a task akin to quantifying surprisal and are superior to n-grams in estimating word probabilities. Differences between LLM-derived estimates and neural perception of surprisal may quantify which linguistic structures, perhaps poorly represented in the statistical evidence, the brain privileges during processing.
(2) LLMs are also useful as a point of comparison. LLMs combine different computational strategies, mixing representations of word properties with a computational engine based on memory or attention. Despite the clear differences between LLMs and the brain, it is instructive to compare the performance of different LLMs on language tasks to our own language ability. For example, although LLMs are capable of long-range number and gender agreement (Bernardy & Lappin, Reference Bernardy and Lappin2017; Gulordava, Bojanowski, Grave, Linzen, & Baroni, Reference Gulordava, Bojanowski, Grave, Linzen, Baroni, Walker, Ji and Stent2018; Linzen, Dupoux, & Goldberg, Reference Linzen, Dupoux and Goldberg2016; Sukumaran, Houghton, & Kazanina, Reference Sukumaran, Houghton, Kazanina, Bastings, Belinkov, Elazar, Hupkes, Saphr and Wiegreffe2022), they are not successful in implementing another long-range rule: Principle C (Mitchell et al., Reference Mitchell, Kazanina, Houghton, Bowers, Nienborg, Poldrack and Naselaris2019), a near-universal property of languages which depends in its most straight-forward description on hierarchical parsing. Thus, LLMs allow us to recognize those aspects of language which require special consideration while revealing others to be within easy reach of statistical learning.
(3) In the past, philosophical significance was granted to language as evidence of thought or personhood. Turing (Reference Turing1950), for example, proposes conversation as a proxy for thought and Chomsky (Reference Chomsky1966) describes Descartes as attributing the possession of mind to other humans because the human capacity for innovation and for the creative use of language is “beyond the limitations of any imaginable mechanism.” It is significant that machines are now capable of imitating the use of language. While machine-generated text still has attributes of awkwardness and repetition that make it recognizable on careful reading, it would seem foolhardy to predict these final quirks are unresolvable or are characteristic of the division between human and machine. Nonetheless, most of us appear to feel intuitively that LLMs enact an imitation rather than a recreation of our linguistic ability: LLMs seem empty things whose pantomime of language is not underpinned by thought, understanding, or creativity. Indeed, even if an LLM were capable of imitating us perfectly, we would still distinguish between a loved one and their simulation.
This is a challenge to our understanding of the relationship between language and thought: Either we must claim that, despite recent progress, machine-generated language will remain unlike human language in vital respects, or we must defy our intuition and consider machines to be as capable of thought as we are, or we must codify our intuition to specify why a machine able to produce language should, nonetheless, be considered lacking in thought.
Neural-network models of language are optimized to solve practical problems such as machine translation. Currently, when these large language models (LLMs) are interpreted as models of human linguistic processing they have similar shortcomings to those that deep neural networks have as models of human vision. Two examples can illustrate this. First, LLMs do not faithfully replicate human behaviour on language tasks (Kuncoro, Dyer, Hale, & Blunsom, Reference Kuncoro, Dyer, Hale and Blunsom2018; Linzen & Leonard, Reference Linzen, Leonard, Kalish, Rau, Zhu and Rogers2018; Marvin & Linzen, Reference Marvin, Linzen, Riloff, Chiang, Hockenmaier and Tsujii2018; Mitchell, Kazanina, Houghton, & Bowers, Reference Mitchell, Kazanina, Houghton, Bowers, Nienborg, Poldrack and Naselaris2019). For example, an LLM trained on a word-prediction task shows similar error rates to humans overall on long-range subject–verb number agreement but errs in different circumstances: Unlike humans, it makes more mistakes when sentences have relative clauses (Linzen & Leonard, Reference Linzen, Leonard, Kalish, Rau, Zhu and Rogers2018), indicating differences in how grammatical structure is represented. Second, the LLMs with better performance on language tasks do not necessarily have more in common with human linguistic processing or more obvious similarities to the brain. For example, transformers learn efficiently on vast corpora and avoid human-like memory constraints but are currently more successful as language models than recurrent neural networks such as the long- and short-term memory LLMs (Brown et al., Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal and Amodei2020; Devlin, Chang, Lee, & Toutanova, Reference Devlin, Chang, Lee, Toutanova, Burstein, Doran and Solorio2018), which employ sequential processing, as humans do, and can be more easily compared to the brain.
Furthermore, the target article suggests that, more broadly, the brain and neural networks are unlikely to resemble each other because evolution differs in trajectory and outcome from the optimization used to train a neural network. Generally, there is an unanswered question about which aspects of learning in LLMs are to be compared to the evolution of our linguistic ability and which to language learning in infants but in either case, the comparison seems weak. LLMs are typically trained using a next-word prediction task; it is unlikely our linguistic ability evolved to optimize this and next-word prediction can only partly describe language learning: For example, infants generalize word meanings based on shape (Landau, Smith, & Jones, Reference Landau, Smith and Jones1988) while LLMs lack any broad conceptual encounter with the world language describes.
In fact, it would be peculiar to suggest that LLMs are models of the neural dynamics that support linguistic processing in humans; we simply know too little about those dynamics. The challenge presented by language is different to that presented by vision: Language lacks animal models and debate in psycholinguistics is occupied with broad issues of mechanisms and principles, whereas visual neuroscience often has more detailed concerns. We believe that LLMs have a valuable role in psycholinguistics and this does not depend on any precise mapping from machine to human. Here we describe three uses of LLMs: (1) the practical, as a tool in experimentation; (2) the comparative, as an alternate example of linguistic processing; and (3) the philosophical, recasting the relationship between language and thought.
(1) An LLM models language and this is often of practical quantitative utility in experiment. One straight-forward example is the evaluation of surprisal: How well a word is predicted by what has preceded it. It has been established that reaction times (Fischler & Bloom, Reference Fischler and Bloom1979; Kleiman, Reference Kleiman1980), gaze duration (Rayner & Well, Reference Rayner and Well1996), and EEG responses (Dambacher, Kliegl, Hofmann, & Jacobs, Reference Dambacher, Kliegl, Hofmann and Jacobs2006; Frank, Otten, Galli, & Vigliocco, Reference Frank, Otten, Galli and Vigliocco2015) are modulated by surprisal, giving an insight into prediction in neural processing. In the past, surprisal was evaluated using n-grams, but n-grams become impossible to estimate as n grows and as such they cannot quantify long-range dependencies. LLMs are typically trained on a task akin to quantifying surprisal and are superior to n-grams in estimating word probabilities. Differences between LLM-derived estimates and neural perception of surprisal may quantify which linguistic structures, perhaps poorly represented in the statistical evidence, the brain privileges during processing.
(2) LLMs are also useful as a point of comparison. LLMs combine different computational strategies, mixing representations of word properties with a computational engine based on memory or attention. Despite the clear differences between LLMs and the brain, it is instructive to compare the performance of different LLMs on language tasks to our own language ability. For example, although LLMs are capable of long-range number and gender agreement (Bernardy & Lappin, Reference Bernardy and Lappin2017; Gulordava, Bojanowski, Grave, Linzen, & Baroni, Reference Gulordava, Bojanowski, Grave, Linzen, Baroni, Walker, Ji and Stent2018; Linzen, Dupoux, & Goldberg, Reference Linzen, Dupoux and Goldberg2016; Sukumaran, Houghton, & Kazanina, Reference Sukumaran, Houghton, Kazanina, Bastings, Belinkov, Elazar, Hupkes, Saphr and Wiegreffe2022), they are not successful in implementing another long-range rule: Principle C (Mitchell et al., Reference Mitchell, Kazanina, Houghton, Bowers, Nienborg, Poldrack and Naselaris2019), a near-universal property of languages which depends in its most straight-forward description on hierarchical parsing. Thus, LLMs allow us to recognize those aspects of language which require special consideration while revealing others to be within easy reach of statistical learning.
(3) In the past, philosophical significance was granted to language as evidence of thought or personhood. Turing (Reference Turing1950), for example, proposes conversation as a proxy for thought and Chomsky (Reference Chomsky1966) describes Descartes as attributing the possession of mind to other humans because the human capacity for innovation and for the creative use of language is “beyond the limitations of any imaginable mechanism.” It is significant that machines are now capable of imitating the use of language. While machine-generated text still has attributes of awkwardness and repetition that make it recognizable on careful reading, it would seem foolhardy to predict these final quirks are unresolvable or are characteristic of the division between human and machine. Nonetheless, most of us appear to feel intuitively that LLMs enact an imitation rather than a recreation of our linguistic ability: LLMs seem empty things whose pantomime of language is not underpinned by thought, understanding, or creativity. Indeed, even if an LLM were capable of imitating us perfectly, we would still distinguish between a loved one and their simulation.
This is a challenge to our understanding of the relationship between language and thought: Either we must claim that, despite recent progress, machine-generated language will remain unlike human language in vital respects, or we must defy our intuition and consider machines to be as capable of thought as we are, or we must codify our intuition to specify why a machine able to produce language should, nonetheless, be considered lacking in thought.
Acknowledgments
We are grateful to the many colleagues who read and commented on this text.
Financial support
P. S. received support from the Wellcome Trust [108899/B/15/Z], C. H. from the Leverhulme Trust [RF-2021-533], and N. K. from the International Laboratory for Social Neurobiology of the Institute for Cognitive Neuroscience HSE, RF Government grant [075-15-2022-1037].
Competing interest
None.