Our target article proposed a novel architecture for language processing. Rather than isolating production and comprehension from each other, we argued that they are closely linked. We first claimed that people predict their own and other people's actions. In a similar way, we argued that speakers predict their own utterances and comprehenders predict other people's utterances at a range of different linguistic levels.
The commentators made a wide range of perceptive points about our account, and we thank them for their input. We have divided their arguments into seven sections. We first respond to comments about the relation between production, comprehension, and other cognitive processes; in the second section, we turn to questions about the neural basis for our account. We said very little about learning and development in the target article, and our third section responds to those commentators who considered its implications for these issues. In the fourth section, we address the more technical issue of the nature of the representations created by forward modelling and how they are compared with implemented representations during monitoring. We respond, in the fifth section, to the commentators who remarked on the nature of prediction-by-simulation and its relationship to prediction-by-association. Finally, in sections six and seven we look at broader questions relating to the scope of the account: the nature of communicative intentions, and the implications of the account for dialogue.
R1. Production, comprehension, and the “cognitive sandwich”
Our account has the overall goal of integrating production and comprehension. Our specific models in Figures 5–7 (target article) are incompatible with traditional separation of production and comprehension as shown in Figure R1 (repeated here from the target article). Some commentators addressed the question of whether our proposal leads to a radical rethinking of the relationship between the two.
In fact, there are two issues concerning Figure R1. One is the extent to which production and comprehension are separate. In terms of the cognitive sandwich, are there two pieces of bread (rather than a wrap)? Our proposal is that instances of production involve comprehension processes (Fig. 5) and that instances of comprehension involve production processes (Fig. 6). But production and comprehension processes are nevertheless distinct – production involves mapping from intention to sound, and comprehension involves mapping from sound to intention. The second issue is whether the bread is separate from the filling. In other words, what is the relationship between production/comprehension and nonlinguistic mechanisms (thinking, general knowledge)? We propose that at least some aspects of general knowledge can be accessed during production and during comprehension, and moreover that interpreting the intention involves general knowledge. These aspects of general knowledge of course draw on a variety of cognitive functions such as memory and conflict resolution (see Slevc & Novick).
By interweaving production and comprehension, we have proposed that our account is incompatible with the traditional “cognitive sandwich” (Hurley Reference Hurley2008a) – an architecture in which production and comprehension are isolated from each other. Because production is a form of action, the use of production processes during comprehension means that comprehension involves a form of embodiment (i.e., uses action to aid perception). We also suggested that our account may be compatible with embodied accounts of meaning (e.g., Barsalou Reference Barsalou1999; Glenberg & Gallese Reference Glenberg and Gallese2012). In contrast, Dove argues that we assume different intermediate and disembodied levels of representation (e.g., syntax, phonology) that are not grounded in modality-specific input/output systems. We agree that our account is not compatible with this form of embodiment, and we accept his conclusion that we retain some amodal representations but abandon a form of modularity.
Our proposal for the relationship between production and comprehension runs counter to traditional interactive accounts which assume that cascading and feedback occur during production. For example, Dell (Reference Dell1986) assumed that a speaker activates semantic features and that activation cascades to words associated with those features and sounds associated with those words, and that activation then feeds back from the phonemes to the activated words and to other words involving those phonemes. Because this process involves several cycles before activation settles on a particular word and set of phonemes, Dell regards early stages of this process as involving prediction. In his very interesting proposal, cascading and feedback are internal to the implementer and are causal in bringing about the (implemented) linguistic representations. This also seems to be the position adopted by Mylopoulos & Pereplyotchik, who replace our forward model with an utterance plan internal to the implementer, and by Mani & Huettig, who regard it as a third route to prediction. In contrast, our predicted representations are the result of the efference copy of the production command, and they are therefore separate from the implementer. Our account of monitoring thus involves comparing two separate sets of representations and so is very different from both Dell and Mylopoulos & Pereplyotchik.
Bowers' discussion of Grossberg (Reference Grossberg1980) also appears similar to Dell's proposal and usefully demonstrates implementer-internal prediction. He also queries why forward modelling should speed up processing when its output is subsequently compared with the output of a slower implementer. The reason is that the comparison can be made as soon as the implementer's output is available (at any level). Otherwise, it is necessary to analyse the implementer's output (as in comprehension-based monitoring; Levelt Reference Levelt1989).
R2. The neuroscience of production–comprehension relations
A number of commentators consider our account in relation to neuroscientific evidence. Some of this evidence concerns monitoring deficits associated with particular aphasias. Hartsuiker discusses a patient who cannot comprehend familiar sounds, words, or sentences, but who is nevertheless able to correct some of her own phonemic speech errors. He argues that such a patient would be incapable of monitoring her own speech by comparing the output of the implementer (utterance percept) with a forward model prediction (predicted utterance percept). However, it is difficult to draw clear conclusions from such cases without knowing the precise nature of the deficits. For example, this patient might monitor proprioceptively to correct errors. In relation to this, Tremblay et al. (Reference Tremblay, Shiller and Ostry2003) found that people can adapt their articulation to external perturbations of jaw movements during silently mouthed speech. In other words, they monitored and corrected their planned utterances in the absence of auditory feedback. Hartsuiker also discusses a patient who detects phonemic errors in others but who does not repair his own (frequent) phonemic errors. It is possible that proprioceptive monitoring is disturbed (and that such monitoring is particularly important for repairing his phonemic errors), but outer-loop monitoring is preserved.
Other commentators point out that the classical Lichtheim–Broca–Wernicke neurolinguistic model is inconsistent with much recent neurophysiological data. In fact, we believe that their proposals are likely to be quite close to ours. Hickok assumes a dorsal stream which subserves sensorimotor integration for motor control (i.e., production) and a ventral stream which links sensory inputs to conceptual memory (i.e., comprehension). Both systems make use of prediction, with motor prediction facilitating production and sensory prediction facilitating comprehension. Hickok's position may not be so distinct from ours if we equate motor prediction with prediction-by-simulation and sensory prediction with prediction-by-association. However, unlike Hickok, we suggest that both streams may be called upon during comprehension. Dick & Andric also argue against the classical neurolinguistic model in favour of a dual-stream account. They suggest that the motor involvement in speech perception may only be apparent in perception under adverse conditions (see sect. R4 for a fuller discussion on this point). Finally, Alario & Hamamé point out additional evidence for forward modelling in production (e.g., Flinker et al. Reference Flinker, Chang, Kirsch, Barbaro, Crone and Knight2010). We accept that current evidence does not allow us to determine the neural basis of the efference copy, and we hope that this will be a target for future research.
R3. Learning and development
Our target article did not explicitly discuss the role of forward modelling in language learning and development. However, we certainly recognise that our account is relevant to the acquisition of language production and comprehension, particularly in relation to their fluency. In fact, both forward and inverse models were first introduced and tested as neurocomputational models of early skill acquisition (e.g., Jordan & Rumelhart Reference Jordan and Rumelhart1992; Kawato et al. Reference Kawato, Furawaka and Suzuki1987; Reference Kawato, Maeda, Uno and Suzuki1990), and we therefore argue that they can be applied to language. Accordingly, we are grateful to a number of commentators for fleshing out the importance of such models in the development of language. We also strongly agree that the use of forward and inverse models in learning does not disappear following childhood. Instead, it leads to adaptation and learning in adults, as well as to prediction of their own and others' utterances.
Hence, Johnson, Turk-Browne, & Goldberg (Johnson et al.) point out the role of prediction in learning to segment utterances, and in learning words and grammatical constructions. Krishnan notes various interrelations between production and comprehension abilities during development. We endorse her goal of explaining the mechanisms underlying such developmental changes. Aitken emphasizes the importance of the developmental perspective in accounting for communication processes in general, and we agree.
Mani & Huettig point out that two-year-olds' production vocabulary (rather than comprehension vocabulary) correlates with their ability to make predictions in a visual world situation (Mani & Huettig Reference Mani and Huettig2012). This provides important new evidence that prediction during comprehension makes use of production processes. In fact, it suggests that two-year-olds are already using prediction-by-simulation. Note that we speculated that adults may emphasize prediction-by-association when comprehending children; we did not suggest that children emphasize prediction-by-association during comprehension.
From a computational perspective, Chang, Kidd, & Rowland (Chang et al.) argue that linguistic prediction is a by-product of language learning. We accept that it originates in learning but note that it is critical for fluent performance in its own right. Their account of comprehension has some similarities to ours – in particular, that it uses a form of production-based prediction. However, it assumes separate meaning and sequencing pathways, whereas we adopt a more traditional multi-level account (semantics, syntax, phonology); future research could directly compare these accounts. We do not see why it is problematic to use syntax and semantics in supervised learning (any more than phonology, which is of course also abstract).
McCauley & Christiansen discuss a model of language acquisition in which prediction-by-simulation facilitates the model's shallow processing of the input during learning. They show how this model can account for a range of recent psycholinguistic findings about language acquisition. The integration of our account with the evidence for the representation of multi-word chunks is potentially informative. We also agree that shallow processing during comprehension may help explain apparent asymmetries between production and comprehension.
R4. Impoverished representations and production monitoring
Many commentators discuss our claim that predictions are impoverished. For example, de Ruiter & Cummins point out that Heinks-Maldonado et al.'s (Reference Heinks-Maldonado, Nagarajan and Houde2006) findings are compatible only with a forward model that incorporates information about pitch. More generally, Strijkers, Runnqvist, Costa, & Holcomb (Strijkers et al.) question how “poor” (predicted) representations can be used to successfully monitor “rich” (implemented) representations, and Hartsuiker similarly claims that impoverished representations are not a good standard for judging correctness (i.e., they will be particularly error-prone).
A key property of forward models is their flexibility. Their primary purpose (in adults) is to promote fluency, and therefore speakers are able to “tune into” whatever aspect of a stimulus is most relevant to this goal. So long as speakers know that pitch is relevant to a particular task (or is obviously being manipulated in an experiment), they predict the pitch that they will produce, and are disrupted if their predicted percept does not match the actual percept. People are able to determine what aspects of a percept to predict on the basis of their situation, such as the current experimental task (see Howes, Healey, Eshghi, & Hough [Howes et al.]) Such flexibility clearly makes the forward models more useful for aiding fluency, but it also means that we cannot determine which aspects of an utterance will necessarily be represented in a forward model. In Alario & Hamamé's terms, we assume that the “opt-out” is circumstantial rather than systematic. Hence, predictions may contain “fine-grained phonetic detail,” contra Trude.
In fact, the question of what information is represented in forward models of motor control (and learning) has received some attention. For example, Kawato et al. (Reference Kawato, Maeda, Uno and Suzuki1990) suggested that movement trajectories can be projected using critical via-points through which the trajectory has to pass at a certain time, rather than in terms of the moment-by-moment dynamics of the implemented movement trajectory. Optimal trajectories can then be learnt by applying local optimising principles for getting from one via-point to the next. In this way, impoverished predictions can be used to monitor rich implementations. In the same way, language users might predict particularly crucial aspects of an utterance, but the aspects that they predict will depend on the circumstances.
More generally, Meyer & Hagoort question the value of predicting one's own utterance. They argue that prediction is useful when it is likely to differ from the actual event. This is the case when predicting another person's behaviour or when the result of one's action is uncertain (e.g., moving in a strong wind). But they argue that people are confident about their own speech. Meyer & Hagoort admit that they will tend to be less confident in dialogue, and we agree. But more important, we argue that the behaviour of the production implementer is not fully determined by the production command, because the complex processes involved in production are subject to internal (“neural”) noise or priming (i.e., influences that may not be a result of the production command). Assuming that these sources of noise do not necessarily affect forward modelling as well, predicted speech may differ from actual speech. In addition, prediction is useful even if the behaviour is fully predictable, because it allows the actor to plan future behaviour on the basis of the prediction. In fact, we made such a proposal in relation to the order of heavy and light phrases (see sect. 3.1, target article).
Hartsuiker claims that our account incorrectly predicts an early competitor effect in Huettig and Hartsuiker (Reference Huettig and Hartsuiker2010). His claim is based on the assumption that comparing the predicted utterance percept with the actual utterance percept should invoke phonological competitors in the same way that comprehending another's speech invokes such competitors. But the predicted utterance percept of heart does not also represent phonological competitors,Footnote 1 and the utterance percept is directly related to the predicted utterance percept (i.e., it is not analysed). Hartsuiker favours a conflict-monitoring account (e.g., Nozari et al. Reference Nozari, Dell and Schwartz2011), but such an account merely detects some difficulty during production, and it is unclear how it can determine the source of difficulty or the means of correction. We accept that a forward modelling account involves some duplication of information; a goal of our account and motor-control accounts is to provide reasons why complex biological systems are not necessarily parsimonious in this respect.
Oppenheim makes the interesting suggestion that inner speech might be the product of forward production models. This is an alternative to the possibility that inner speech involves an incomplete use of the implementer, in which the speaker inhibits production after computing a phonological or phonetic representation. Clearly, findings that inner speech is impoverished would provide some support for the forward-model account. But as Oppenheim himself notes, it is hard to see how inner slips could be identified without using the production implementer to generate an utterance percept at the appropriate level of representation (which is then compared to the predicted utterance percept).
Jaeger & Ferreira argue that the output of the forward production model (the predicted utterance) serves merely as input to the forward comprehension model, and suggest that the efference copy could directly generate the predicted utterance percept. In fact, the motivation for constructing the forward production model is to aid learning an inverse model that maps backward from the predicted utterance percept via the predicted utterance to the production command, just as in motor control theory (Wolpert et al. Reference Wolpert, Ghahramani and Flanagan2001). It is possible that sufficiently fluent speakers might be able to directly map from the production command to the predicted utterance percept (though there may be a separate mapping to the predicted utterance). But this would prevent speakers from remaining sufficiently flexible to learn new words or utterances, just as in the early stages of acquisition. In this context, Adank et al. (Reference Adank, Hagoort and Bekkering2010) found that adult comprehension of unfamiliar accents is facilitated by previous imitation of those accents to a similar extent whether speakers can or cannot hear their own speech. This suggests that adaptation is mediated by the forward production model (though there could also be an effect of proprioception). Note that the inverse model is not merely used for long-term learning, but is also used to modify an action as it takes place (e.g., to speak more clearly if background noise increases).
Finally, Slevc & Novick suggest that nonlinguistic memory tasks and linguistic conflict resolution involve common brain structures (the left inferior temporal gyrus). Patients with lesions to this area show difficulty with both memory tasks and with language production and comprehension. We propose that both self- and other-monitoring rely on memory-based predictions. One possibility is that monitoring can involve automatic correction when the difference between the prediction and the implementation is low, but monitoring requires extensive access to general knowledge when there is greater discrepancy.
R5. Prediction-by-simulation versus prediction-by-association
We propose that comprehenders make use of both prediction-by-simulation and prediction-by-association. In most situations, both routes provide some predictive value, and so we assume that comprehenders integrate the predictions that they make. Both routes also use domain-general cognitive mechanisms such as memory (see Slevc & Novick). We made some suggestions about when comprehenders are likely to weight one route more strongly than the other. For example, simulation will be weighted more strongly when the comprehender appears to be more similar to the speaker than otherwise. Laurent, Moulin-Frier, Bessière, Schwartz, & Diard (Laurent et al.) describe how they have modelled the contributions of association and simulation (in their terms, auditory and motor knowledge) to speech perception under both ideal and adverse conditions (both in the context of external noise and when the comprehender and speaker are very different). Under ideal conditions, association and simulation perform identically, but under adverse conditions, their performance falls off in different ways. However, they note that integration of the two routes (i.e., sensory-motor fusion) yields better performance, and we agree. We strongly support their programme of modelling these contributions (see also de Ruiter & Cummins) and agree that experiments conducted under adverse conditions may help discriminate the contributions of the two forms of prediction. We agree that their modelling results are consistent with findings from transcranial magnetic stimulation (TMS) studies which point to the contribution of motor systems to speech perception, but only in noisy conditions (e.g., D'Ausilio et al. Reference D'Ausilio, Jarmolowska, Busan, Bufalari and Craighero2011).
Yoon & Brown-Schmidt question the need for having both prediction-by-simulation and prediction-by-association in comprehension (i.e., dual-route prediction). These commentators claim that comprehenders using simulation would predict what the speaker would say (i.e., allocentrically). In fact, we propose that comprehenders use context to aid allocentric prediction, but that they are also subject to egocentric biases (i.e., comprehenders only partly take into account information about their partner's context). Yoon & Brown-Schmidt claim prediction-by-association would also be allocentric, and therefore question why we need two prediction mechanisms. We first note that prediction-by-association need not be allocentric, as it might be biased by prior perception of oneself.
But more important, the two routes to prediction are distinct for reasons unrelated to the role of context. Perhaps most important, prediction-by-simulation takes into account the inferred intention of the speaker in a way that prediction-by-association cannot (as it makes no reference to mental states). Hence, prediction-by-simulation should offer a richer and more situation-specific kind of prediction than prediction-by-association, and a combination of these predictions is likely to be more accurate than one form of prediction by itself.
To what extent can prediction-by-simulation account for speech adaptation effects? Trude and Brown-Schmidt (Reference Trude and Brown-Schmidt2012) showed that listeners could use their knowledge of a speaker's regional pronunciation to rapidly rule out competitors in a visual world task (see also Dahan et al. Reference Dahan, Drucker and Scarborough2008). Trude asks whether such rapid incorporation of context to aid prediction can be explained by prediction-by-simulation and to what extent it suggests that listeners make detailed phonetic predictions of what they will hear. As we proposed here (see sect. R4), forward models must be flexible, and in experimental situations that highlight detailed phonetic differences we would expect people to predict such details. Of course there is still the question of how rapidly a listener could incorporate the context (i.e., relating to speaker identity) into their forward model. However, it is interesting to note that Trude and Brown-Schmidt found that competitors were ruled out earlier following increased exposure to previous words (e.g., hearing point to the bag as opposed to just hearing bag). Trude also suggests that listeners' use of impoverished representations could explain difficulties in imitating those features. But, in fact, we claim that listeners typically compute fully specified representations using the implementer, so the reasons for difficulties in imitation presumably lie elsewhere. More generally, we claim that comprehenders use some production processes but not necessarily all (e.g., they may be unable to produce a particular accent).
Festman addresses issues relating to bilingualism. One reason for the difficulty of conversations between a native and a nonnative speaker is that their processing systems are likely to be very different (in terms of both speed and content) and so prediction-by-simulation is likely to be adversely affected. (In addition, prediction-by-simulation will be hindered by limited experience on which to learn forward models.) Prediction-by-association does not suffer from this problem. For example, most L1 (first-language) speakers have experience with L2 (second-language) speakers, and hence can predict L2 speakers even when they would behave differently from them. L2 speakers should also be able to predict L1 speakers (who they tend to encounter regularly), but they may of course not be able to make good predictions (e.g., if they do not know words that the L1 speakers would use).
Rabagliati & Bemis argue that much of language is not predictable and that its power is its ability to communicate the unpredictable. We do not claim that prediction underlies all of language comprehension. Rather, people use prediction whenever they can to assist comprehension (at different linguistic levels). But when an utterance is unpredictable, they simply rely on the implementer. In fact, failure to predict successfully serves to highlight the unexpected, and therefore allows the comprehender to concentrate resources.
R6. Communicative intentions and the production command
Many of the commentators raise the issue of how our account is affected by communicative intentions. In our terms, this is the question of how the production command is determined and used. We agree with Kashima, Bekkering, & Kashima (Kashima et al.) that communicative intentions do not simply underlie the construction of semantics, syntax, and phonology, but incorporate information such as illocutionary force. Most important, we do not assume that covert imitation simply involves copying linguistic representations (or that overt imitation involves “blind” repetition). Instead, our proposal (see Fig. 6, target article) is that comprehenders use the inverse model and context to derive the production command that comprehenders would use if they were to produce the speaker's utterance, and use this to drive the forward model (or to make overt responses). In other words, the forward model and overt imitation (or completion) are affected by the production command. Hence, our account is compatible with findings such as those of Ondobaka et al. (Reference Ondobaka, de Lange, Newman-Norlund, Wiemers and Bekkering2011) because it proposes that imitation can be affected by aspects of intentions such as the compatibility between interlocutors' goals. It can also explain how accent convergence can depend on communicative intentions (e.g., relating to identity).
We agree with Echterhoff that our account aims to integrate a “language-as-action” approach to intention (represented as the production command) with the evidence for mechanistic time-locked processing. We limited our discussion of intentions for reasons of space, but agree that their relationship to other aspects of mental life is central to a more fully developed theory. He specifically highlights the importance of postdictive processing in determining nonliteral and other complex intentions, and we agree. We see his proposal as closely related to the use of offline prediction-by-simulation (see Pezzulo Reference Pezzulo2011a).
The HMOSAIC architecture is used to determine the relationship between actions and higher-level intentions. Commentators de Ruiter & Cummins query whether this is possible for language because of the particularly complex relationship between intention and utterance; Pazzaglia also claims that the mappings between intention and speech sounds are too complex for prediction-by-simulation. We are not convinced that this relationship is more complex than other aspects of human action (and interaction) and believe that this is simply an issue for future research.
Kreysa points out that comprehenders may use gaze to help predict utterances without deliberate consideration of intention. As she notes, such cues may constitute a form of prediction-by-association, though it is also possible that comprehenders perform prediction-by-simulation but with gaze constituting part of the context that is used to compute the intention (see Fig. 6, target article). If this is the case, gaze would help reduce the complexity of the intention-utterance relationship. She also questions whether anticipatory fixation in the visual-world paradigm involves prediction-by-association or simulation. In this context, we note that Lesage et al. (Reference Lesage, Morgan, Olson, Meyer and Miall2012) showed that cerebellar repetitive TMS (rTMS) prevented such anticipatory fixations in predictive contexts (e.g., The man will sail the boat), but not in control sentences (or in vertex rTMS). As the cerebellum appears to be used for prediction in motor control (Miall & Wolpert Reference Miall and Wolpert1996), this suggests that such fixations involve prediction-by-simulation.
Jaeger & Ferreira ask about the precise nature of the prediction errors that the system is trying to minimize. In particular, do these relate to evaluations of how well formed the output is or do they relate to evaluations of its communicative effectiveness in the context in which it is uttered? One possibility is that when speakers utter a predictable word, it means that they have forward modelled that word to a considerable extent before uttering it. Hence, they weight the importance of the forward model more than the output of the implementer and therefore attenuate the form of the word. But alternatively, the speakers realize that the error tends to be less for predictable than unpredictable words, and know that addressees are less likely to comprehend words when the error is great. They thus use a strategy of clearer articulation when they realize that the error is likely to be great (perhaps based on their view of the addressee's ability to comprehend).
Pezzulo & Dindo argue that producers use intentional (signalling) strategies to aid their comprehenders' predictions. In other words, they make themselves more predictable to the comprehender. To do this, they maintain an internal model of the comprehenders' uncertainty. Hence, the authors propose a type of allocentric account. We suspect that speakers make their utterances predictable for both allocentric reasons (e.g., lengthening vowels for children) and egocentric reasons. For example, an effect of interactive alignment (Pickering & Garrod Reference Pickering and Garrod2004) is to make interlocutors more similar (see sect. 3.3, target article) so that comprehenders are more likely to predict speakers accurately. Such alignment might itself follow from an intentional strategy to align but need not. In their penultimate paragraph, Pezzulo & Dindo make several very interesting additional suggestions about ways in which producers and comprehenders may make use of prediction beyond those we addressed in the target article.
R7. Interleaving production and comprehension in dialogue
Howes et al. criticise traditional models of production and comprehension based on large units (such as whole sentences), and we agree. Such models are clearly unable to deal with the fragmentary nature of many contributions to dialogue. Howes et al. propose an incremental model that combines production and comprehension. This model may be able to deal with incrementality and joint utterances in dialogue but does not provide an account of prediction, either of upcoming words in monologue (e.g., Delong et al. Reference DeLong, Urbach and Kutas2005) or ends of turns in dialogue (de Ruiter et al. Reference De Ruiter, Mitterer and Enfield2006). Moreover, Howes et al. argue that prediction can only help speakers repeat their interlocutors, but this is not the case. Overt responses based on the derived production command for the current utterance (i B (t)) lead to repetition, but overt responses based on the derived production command for the upcoming utterance (i B (t+1)) lead to continuations (see Fig. 6, target article). In other words, we believe that our account provides mechanisms that underlie dialogue (Pickering & Garrod Reference Pickering and Garrod2004).
Fowler argues that we wrongly emphasize internal predictive models when the same benefits can be accrued by directly interacting with the environment, most importantly our interlocutors. We accept that the information in the environment helps determine people's actions, but argue that predictions driven by internal models that have been shaped by past experience allow people to perform better (just as is the case for complex engineering). In the interaction process, overt imitation, completions, and complementary responses all appear to occur regularly, and all are compatible with our account (see sect. 3.3, target article).
Echterhoff first asks whether our action-based account can generalize to noninteractive situations. Our models of production and comprehension are explained noninteractively and then applied to dialogue. In particular, we propose that dialogue allows interlocutors to make use of overt responses; in monologue, such overt responses are not relevant, therefore language users focus more completely on internal processes. Echterhoff also argues that shared reality (see sect. 2.3, target article) involves more than seamless coordinated activity, but also has an evaluative component. We propose that successful mutual prediction (both A and B correctly predict both A and B) underlies shared reality and is in turn likely to support the alignment of evaluations of that activity. Mutual predictions occur as a result of aligned action commands, and action commands reflect intentions which of course involve the motivations that, according to Echterhoff, underlie shared reality.
ACKNOWLEDGMENTS
We thank Martin Corley and Chiara Gambi for their comments and acknowledge support of ESRC Grants no. RES-062-23-0736 and no. RES-060-25-0010.
Target article
An integrated theory of language production and comprehension
Related commentaries (32)
A developmental perspective on the integration of language production and comprehension
An ecological alternative to a “sad response”: Public language use transcends the boundaries of the skin
Are forward models enough to explain self-monitoring? Insights from patients and eye movements
Cascading and feedback in interactive models of production: A reflection of forward modeling?
Communicative intentions can modulate the linguistic perception-action link
Does what you hear predict what you will do and say?
Evidence for, and predictions from, forward modeling in language production
Forward modelling requires intention recognition and non-impoverished predictions
How do forward models work? And why would you want them?
Inner speech as a forward model?
Integrate, yes, but what and how? A computational approach of sensorimotor fusion in speech
Intentional strategies that make co-actors more predictable: The case of signaling
Intermediate representations exclude embodiment
Is there any evidence for forward modeling in language production?
It ain't what you do (it's the way that you do it)
Memory and cognitive control in an integrated theory of language processing
Prediction in processing is a by-product of language learning
Prediction is no panacea: The key to language is in the unexpected
Prediction plays a key role in language development as well as processing
Predictive coding? Yes, but from what source?
Preparing to be punched: Prediction may not always require inference of intentions
Seeking predictions from a predictive framework
The complexity-cost factor in bilingualism
The neurobiology of receptive-expressive language interdependence
The poor helping the rich: How can incomplete representations monitor complete ones?
The role of action in verbal communication and shared reality
Toward a unified account of comprehension and production in language development
Towards a complete multiple-mechanism account of predictive language processing
What does it mean to predict one's own utterances?
What is the context of prediction?
When to simulate and when to associate? Accounting for inter-talker variability in the speech signal
“Well, that's one way”: Interactivity in parsing and production
Author response
Forward models and their implications for production, comprehension, and dialogue