Pickering & Garrod's (P&G's) integrated account of language production and comprehension brings forward novel cognitive mechanisms for a range of language processing functions. Here we would like to focus on the theoretical development of the speech monitor in P&G's theory and the evidence cited in support of it. The authors propose that we construct forward models of predicted percepts during language production and that these predictions form the basis to internally monitor and if necessary correct our speech. This view of a speech monitor grounded in domain-general action control is refreshing in many ways. Nevertheless, in our opinion it raises a general theoretical concern, at least in the form in which it is implemented in P&G's model. Furthermore, we believe that it is important to emphasize that the evidence cited in support of the use of forward modeling in speech monitoring is suggestive, but far from directly supporting of the theory.
In general terms, we question the rationale behind the proposal that incomplete representations constitute the basis of speech monitoring. A crucial aspect of P&G's model refers to timing. Because forward representations are computed faster than the actual representations that will be used to produce speech, the former serve to correct potential deviations in the latter representations. To ensure that the forward representations are available earlier than the implemented representations, P&G propose that the percepts constructed by the forward model are impoverished, containing only part of the information necessary to produce speech. But how can speech monitoring be efficient if it relies on “poor” representations to monitor the “rich” representations? For instance, a predicted syntactic percept could include grammatical category, but lack number and gender information.
In this example, it is evident that if the slower production implementer is erroneously preparing a verb instead of a noun, the predicted representation coming from the forward model might indeed serve to detect and correct the error prior to speech proper. However, if the representation prepared by the production implementer contains a number or gender error, given that this information is not specified in the predicted percept (in this example), then how do we avoid these errors when speaking? If the predicted language percepts are assumed to always be incomplete in order to be available early in the process, it is truly remarkable that an average speaker produces only about 1 error every 1,000 words (e.g., Levelt et al. Reference Levelt, Roelofs and Meyer1999). Therefore, although prediction likely plays an important role in facilitating the retrieval of relevant language representations (e.g. Federmeier Reference Federmeier2007; Strijkers et al. Reference Strijkers, Holcomb and Costa2011) and hence could also serve to aid the speech monitor, a proposal that identifies predictive processes with the inner speech monitor seems problematic or at least underspecified for now.
Besides the above theoretical concern regarding the use of incomplete representations as the basis of speech monitoring, also the strength of the evidence cited to support the use of forward modeling in speech production seems insufficient at present. The three studies discussed by P&G to illustrate the usage of efference copies during speech production (i.e., Heinks-Maldonado et al. Reference Heinks-Maldonado, Nagarajan and Houde2006; Tian & Poeppel Reference Tian and Poeppel2010; Tourville et al. Reference Tourville, Reily and Guenther2008), demonstrate that shifting acoustic properties of linguistic elements in the auditory feedback given to a speaker produces early auditory response enhancements. Although these data are suggestive and merit further investigation, showing that reafference cancellation generalizes to self-induced sounds does not prove that forward modeling is used for language production per se. It merely highlights that a mismatch between predicted and actual self-induced sounds (linguistic or not) produces an enhanced sensorial response just as in other domains of self-induced action (e.g., tickling). As for now, no study has explored whether auditory suppression related to self-induced sounds is also sensitive to purely linguistic phenomena (e.g., lexical frequency) or to variables known to affect speech monitoring (e.g., lexicality). This leaves open the possibility that the evidence cited only relates to general sensorimotor properties of speech (acoustics and articulation) rather than the monitoring of language proper.
In addition, the temporal arguments put forward by P&G to conclude that these data cannot be explained by comprehension-based accounts and instead support the notion of speech monitoring through prediction are premature. For instance, P&G take the speed with which self-induced sound auditory suppression occurs (around 100 ms after speech onset) as an indication that speakers could not be comprehending what they heard and argue that this favors a role of forward modeling in speech production. But, the speed with which lexical representations are activated in speech perception is still a debated issue and some studies provide evidence for access to words within 100 ms (e.g., MacGregor et al. Reference MacGregor, Pulvermuller, van Casteren and Shtyrov2012; Pulvermüller & Shtyrov Reference Pulvermüller and Shtyrov2006). In a similar vein, P&G rely on Indefrey and Levelt's (Reference Indefrey and Levelt2004) temporal estimates of word production to argue in favor of speech monitoring through prediction. However, this temporal map is still hypothetical, especially in terms of the latencies between the different representational formats (see Strijkers & Costa Reference Strijkers and Costa2011). More generally, one may question why P&G choose to link the proposed model, intended to be highly dynamical (rejecting the “cognitive sandwich”), with temporal data embedded in fully serial models. Indeed, if one abandons the strictly sequential time course of such models and instead allows for fast, cascading activation of the different linguistic representations, not only do the arguments of P&G become problematic, but also the notion of a slow production/comprehension implementer being monitored by a fast and incomplete forward model loses a critical aspect of its theoretical motivation.
To sum up, we believe that theoretical development of the speech monitor in P&G's integrated account of language production and comprehension faces a major challenge since it needs to explain how representations that lack certain dimensions of information can serve to detect and correct errors to such a high – almost errorless – degree. Furthermore, it is important to acknowledge that as it stands, the evidence used in support of this proposal could just as easily be reinterpreted in other terms, highlighting the need of direct empirical exploration of P&G's proposal.
Pickering & Garrod's (P&G's) integrated account of language production and comprehension brings forward novel cognitive mechanisms for a range of language processing functions. Here we would like to focus on the theoretical development of the speech monitor in P&G's theory and the evidence cited in support of it. The authors propose that we construct forward models of predicted percepts during language production and that these predictions form the basis to internally monitor and if necessary correct our speech. This view of a speech monitor grounded in domain-general action control is refreshing in many ways. Nevertheless, in our opinion it raises a general theoretical concern, at least in the form in which it is implemented in P&G's model. Furthermore, we believe that it is important to emphasize that the evidence cited in support of the use of forward modeling in speech monitoring is suggestive, but far from directly supporting of the theory.
In general terms, we question the rationale behind the proposal that incomplete representations constitute the basis of speech monitoring. A crucial aspect of P&G's model refers to timing. Because forward representations are computed faster than the actual representations that will be used to produce speech, the former serve to correct potential deviations in the latter representations. To ensure that the forward representations are available earlier than the implemented representations, P&G propose that the percepts constructed by the forward model are impoverished, containing only part of the information necessary to produce speech. But how can speech monitoring be efficient if it relies on “poor” representations to monitor the “rich” representations? For instance, a predicted syntactic percept could include grammatical category, but lack number and gender information.
In this example, it is evident that if the slower production implementer is erroneously preparing a verb instead of a noun, the predicted representation coming from the forward model might indeed serve to detect and correct the error prior to speech proper. However, if the representation prepared by the production implementer contains a number or gender error, given that this information is not specified in the predicted percept (in this example), then how do we avoid these errors when speaking? If the predicted language percepts are assumed to always be incomplete in order to be available early in the process, it is truly remarkable that an average speaker produces only about 1 error every 1,000 words (e.g., Levelt et al. Reference Levelt, Roelofs and Meyer1999). Therefore, although prediction likely plays an important role in facilitating the retrieval of relevant language representations (e.g. Federmeier Reference Federmeier2007; Strijkers et al. Reference Strijkers, Holcomb and Costa2011) and hence could also serve to aid the speech monitor, a proposal that identifies predictive processes with the inner speech monitor seems problematic or at least underspecified for now.
Besides the above theoretical concern regarding the use of incomplete representations as the basis of speech monitoring, also the strength of the evidence cited to support the use of forward modeling in speech production seems insufficient at present. The three studies discussed by P&G to illustrate the usage of efference copies during speech production (i.e., Heinks-Maldonado et al. Reference Heinks-Maldonado, Nagarajan and Houde2006; Tian & Poeppel Reference Tian and Poeppel2010; Tourville et al. Reference Tourville, Reily and Guenther2008), demonstrate that shifting acoustic properties of linguistic elements in the auditory feedback given to a speaker produces early auditory response enhancements. Although these data are suggestive and merit further investigation, showing that reafference cancellation generalizes to self-induced sounds does not prove that forward modeling is used for language production per se. It merely highlights that a mismatch between predicted and actual self-induced sounds (linguistic or not) produces an enhanced sensorial response just as in other domains of self-induced action (e.g., tickling). As for now, no study has explored whether auditory suppression related to self-induced sounds is also sensitive to purely linguistic phenomena (e.g., lexical frequency) or to variables known to affect speech monitoring (e.g., lexicality). This leaves open the possibility that the evidence cited only relates to general sensorimotor properties of speech (acoustics and articulation) rather than the monitoring of language proper.
In addition, the temporal arguments put forward by P&G to conclude that these data cannot be explained by comprehension-based accounts and instead support the notion of speech monitoring through prediction are premature. For instance, P&G take the speed with which self-induced sound auditory suppression occurs (around 100 ms after speech onset) as an indication that speakers could not be comprehending what they heard and argue that this favors a role of forward modeling in speech production. But, the speed with which lexical representations are activated in speech perception is still a debated issue and some studies provide evidence for access to words within 100 ms (e.g., MacGregor et al. Reference MacGregor, Pulvermuller, van Casteren and Shtyrov2012; Pulvermüller & Shtyrov Reference Pulvermüller and Shtyrov2006). In a similar vein, P&G rely on Indefrey and Levelt's (Reference Indefrey and Levelt2004) temporal estimates of word production to argue in favor of speech monitoring through prediction. However, this temporal map is still hypothetical, especially in terms of the latencies between the different representational formats (see Strijkers & Costa Reference Strijkers and Costa2011). More generally, one may question why P&G choose to link the proposed model, intended to be highly dynamical (rejecting the “cognitive sandwich”), with temporal data embedded in fully serial models. Indeed, if one abandons the strictly sequential time course of such models and instead allows for fast, cascading activation of the different linguistic representations, not only do the arguments of P&G become problematic, but also the notion of a slow production/comprehension implementer being monitored by a fast and incomplete forward model loses a critical aspect of its theoretical motivation.
To sum up, we believe that theoretical development of the speech monitor in P&G's integrated account of language production and comprehension faces a major challenge since it needs to explain how representations that lack certain dimensions of information can serve to detect and correct errors to such a high – almost errorless – degree. Furthermore, it is important to acknowledge that as it stands, the evidence used in support of this proposal could just as easily be reinterpreted in other terms, highlighting the need of direct empirical exploration of P&G's proposal.