We heartily congratulate Pickering & Garrod (P&G) on their outline of a new cognitive architecture that integrates language production and comprehension. What is particularly impressive in their sketch is that it is a comprehensive approach that addresses entire classes of interrelated psycholinguistic phenomena (as opposed to a selected subset of empirical findings) and that it provides natural explanations for especially the time-critical phenomena which have been difficult to explain plausibly and elegantly with our “standard models” (e.g., Dell Reference Dell1986; Levelt Reference Levelt1989).
Therefore, it is, in our view, all the more urgent that this sketch, or at least its central parts, be further developed into a real computational implementation, one that actually generates the complex behavior that we can now only simulate in our minds on the basis of verbal accounts. Only with such an implementation will we be able to assess the adequacy and accuracy of the provided account, and be able to generate nontrivial and testable predictions that can subsequently be tested using the large arsenal of behavioural and neurocognitive methods now available. We are aware that implementing models is not an easy task, and that implementing this particular architecture will prove to be a challenging exercise. But it is possible to use a piecemeal approach: An obvious simplification is to first develop the model on a miniature language, in a restricted context, using simulated time. Also, it is possible and probably advantageous to employ division of labour by delegating parts of the implementation to different research groups that have complementary expertise.
There are two central aspects of the proposed theory that we would like to comment on, and suggest improvements to.
The first concerns the role of intentions. As P&G note, when predicting the utterance of an interlocutor (i.e., in comprehension), it is essential to have an (early) estimate of the underlying intention. In the HMOSAIC model, this is done by running parallel inverse models. But in modelling verbal interaction, one of the most intractable problems is the complex, seemingly arbitrary, many-to-many mapping of utterances and intentions (see, e.g., Levinson Reference Levinson1983; Reference Levinson and Goody1995). We suspect, therefore, that in a model of language production and comprehension (i.e., dialogue processing) this problem is much harder than in the recognition of intentions underlying functional motor behaviour. There are computational models that use Bayesian machine learning procedures to capture the utterance-intention mapping from multimodal interaction corpora (see, e.g., DeVault et al. Reference DeVault, Sagae and Traum2011), but this approach involves computationally expensive and time-consuming offline learning procedures, and the resulting models are limited to the domain they have been trained on (for an alternative Bayesian approach to attacking this problem that does not involve offline training procedures, see De Ruiter & Cummins Reference De Ruiter, Cummins, Brown-Schmidt, Ginzburg and Larsson2012 ). We would urge P&G to prioritize this aspect, as we believe that the success of the proposed approach will be to a large degree dependent on its ability to model intention recognition in dialogue.
The second comment we have involves the use of “impoverished” representations for the efferent copies, and especially the nature of this impoverishment. In P&G's exposition of the theory, the stated reason that the system does not simply use the efferent copies as motor programs is their impoverished nature (target article, sect. 3.1, para. 6). However, as the efferent copy represents the perceptual consequences of the motor program (and not the motor program itself), not using them directly as motor programs, in our view, does not need to be motivated at all. It is simply a different type of representation, not suited as a motor program.
A potentially more serious problem with the proposed impoverished nature of the efferent copies is that they do not adequately explain the phenomena they are supposed to. This holds for both comprehension and production. In production, for instance, the cited findings by Heinks-Maldonado et al. (Reference Heinks-Maldonado, Nagarajan and Houde2006), and especially those by Tourville et al. (Reference Tourville, Reily and Guenther2008), can only be explained if the efferent copy is fully specified, not merely phonologically but also phonetically. If a speaker knows only which phonemes he is going to produce in what order but not how (in terms of phonetic detail), then the proposed theory would predict that changing the first formant in the auditory feedback (as Tourville et al. did) would have no effect at all.
In language comprehension, the proposed theory assumes that listeners predict what their interlocutor is going to say. Indeed, this appears to be essential for explaining the phenomenon of close shadowing (Marslen-Wilson Reference Marslen-Wilson1973), with delays as short as 250 ms. Also, predictions of utterance content probably underlie the listener's highly accurate anticipation of the end of the speaker's turn as found, for instance, by De Ruiter et al. (Reference De Ruiter, Mitterer and Enfield2006) and Stivers et al. (Reference Stivers, Enfield, Brown, Englert, Hayashi, Heinemann, Hoymann, Rossano, De Ruiter, Yoon and Levinson2009). But here too, the accuracy obtained from having access to an early but impoverished prediction would not be able to explain the levels of accuracy observed in end-of-turn anticipation in experiments and natural data. Magyari and De Ruiter (Reference Magyari and De Ruiter2012) found evidence that people are able to predict when a turn ends by predicting how it ends – that is, with which specific words the turn will end. This suggests that the forward model cannot be lexically impoverished, as suggested by P&G in section 3.1 (para. 9).
This is why we would strongly urge P&G to adopt the assumption that the representations of the predictions, both in production and comprehension are fully specified (perceptual) representations, as Pickering and Garrod (Reference Pickering and Garrod2007) suggested for comprehension.
Finally, we again want to express our support for the exciting approach that P&G have taken with their highly original and thought-provoking outline, and look forward to discussing these issues further.
We heartily congratulate Pickering & Garrod (P&G) on their outline of a new cognitive architecture that integrates language production and comprehension. What is particularly impressive in their sketch is that it is a comprehensive approach that addresses entire classes of interrelated psycholinguistic phenomena (as opposed to a selected subset of empirical findings) and that it provides natural explanations for especially the time-critical phenomena which have been difficult to explain plausibly and elegantly with our “standard models” (e.g., Dell Reference Dell1986; Levelt Reference Levelt1989).
Therefore, it is, in our view, all the more urgent that this sketch, or at least its central parts, be further developed into a real computational implementation, one that actually generates the complex behavior that we can now only simulate in our minds on the basis of verbal accounts. Only with such an implementation will we be able to assess the adequacy and accuracy of the provided account, and be able to generate nontrivial and testable predictions that can subsequently be tested using the large arsenal of behavioural and neurocognitive methods now available. We are aware that implementing models is not an easy task, and that implementing this particular architecture will prove to be a challenging exercise. But it is possible to use a piecemeal approach: An obvious simplification is to first develop the model on a miniature language, in a restricted context, using simulated time. Also, it is possible and probably advantageous to employ division of labour by delegating parts of the implementation to different research groups that have complementary expertise.
There are two central aspects of the proposed theory that we would like to comment on, and suggest improvements to.
The first concerns the role of intentions. As P&G note, when predicting the utterance of an interlocutor (i.e., in comprehension), it is essential to have an (early) estimate of the underlying intention. In the HMOSAIC model, this is done by running parallel inverse models. But in modelling verbal interaction, one of the most intractable problems is the complex, seemingly arbitrary, many-to-many mapping of utterances and intentions (see, e.g., Levinson Reference Levinson1983; Reference Levinson and Goody1995). We suspect, therefore, that in a model of language production and comprehension (i.e., dialogue processing) this problem is much harder than in the recognition of intentions underlying functional motor behaviour. There are computational models that use Bayesian machine learning procedures to capture the utterance-intention mapping from multimodal interaction corpora (see, e.g., DeVault et al. Reference DeVault, Sagae and Traum2011), but this approach involves computationally expensive and time-consuming offline learning procedures, and the resulting models are limited to the domain they have been trained on (for an alternative Bayesian approach to attacking this problem that does not involve offline training procedures, see De Ruiter & Cummins Reference De Ruiter, Cummins, Brown-Schmidt, Ginzburg and Larsson2012 ). We would urge P&G to prioritize this aspect, as we believe that the success of the proposed approach will be to a large degree dependent on its ability to model intention recognition in dialogue.
The second comment we have involves the use of “impoverished” representations for the efferent copies, and especially the nature of this impoverishment. In P&G's exposition of the theory, the stated reason that the system does not simply use the efferent copies as motor programs is their impoverished nature (target article, sect. 3.1, para. 6). However, as the efferent copy represents the perceptual consequences of the motor program (and not the motor program itself), not using them directly as motor programs, in our view, does not need to be motivated at all. It is simply a different type of representation, not suited as a motor program.
A potentially more serious problem with the proposed impoverished nature of the efferent copies is that they do not adequately explain the phenomena they are supposed to. This holds for both comprehension and production. In production, for instance, the cited findings by Heinks-Maldonado et al. (Reference Heinks-Maldonado, Nagarajan and Houde2006), and especially those by Tourville et al. (Reference Tourville, Reily and Guenther2008), can only be explained if the efferent copy is fully specified, not merely phonologically but also phonetically. If a speaker knows only which phonemes he is going to produce in what order but not how (in terms of phonetic detail), then the proposed theory would predict that changing the first formant in the auditory feedback (as Tourville et al. did) would have no effect at all.
In language comprehension, the proposed theory assumes that listeners predict what their interlocutor is going to say. Indeed, this appears to be essential for explaining the phenomenon of close shadowing (Marslen-Wilson Reference Marslen-Wilson1973), with delays as short as 250 ms. Also, predictions of utterance content probably underlie the listener's highly accurate anticipation of the end of the speaker's turn as found, for instance, by De Ruiter et al. (Reference De Ruiter, Mitterer and Enfield2006) and Stivers et al. (Reference Stivers, Enfield, Brown, Englert, Hayashi, Heinemann, Hoymann, Rossano, De Ruiter, Yoon and Levinson2009). But here too, the accuracy obtained from having access to an early but impoverished prediction would not be able to explain the levels of accuracy observed in end-of-turn anticipation in experiments and natural data. Magyari and De Ruiter (Reference Magyari and De Ruiter2012) found evidence that people are able to predict when a turn ends by predicting how it ends – that is, with which specific words the turn will end. This suggests that the forward model cannot be lexically impoverished, as suggested by P&G in section 3.1 (para. 9).
This is why we would strongly urge P&G to adopt the assumption that the representations of the predictions, both in production and comprehension are fully specified (perceptual) representations, as Pickering and Garrod (Reference Pickering and Garrod2007) suggested for comprehension.
Finally, we again want to express our support for the exciting approach that P&G have taken with their highly original and thought-provoking outline, and look forward to discussing these issues further.