Pickering & Garrod (P&G) take production commands to be conceptual representations that encode high-level features – “information about communicative force (e.g., interrogative), pragmatic context, and a nonlinguistic situation model” (target article, sect. 3.1, para. 3). On their model, a production command is input directly to the production implementer, which outputs an utterance. But somewhere in between this input and output, there must be, in addition, an intermediate representation that specifies the low-level features of the utterance, for example, its phonological and phonetic features. In what follows, we will call this low-level production command the utterance plan. In the analogous motor control case, upon which P&G base their model, an utterance plan corresponds to a motor command, which specifies the low-level features of the bodily movement, and is output by the inverse model (Wolpert Reference Wolpert1997).
We argue here that the evidence that P&G cite in favor of positing forward models in speech production is not compelling. More specifically, the data to which they appeal either cannot be explained by forward models, or can be explained by a more parsimonious model, on which the utterance plan and the sensory feedback are directly compared. On this alternative picture, there is no need to posit forward models.
P&G appeal to Heinks-Maldonado et al. (Reference Heinks-Maldonado, Nagarajan and Houde2006) to support their claim that forward modeling is used in speech production. They argue that the suppressed M100 signal in the condition where participants spoke and heard their own unaltered speech – compared with conditions in which their speech was distorted or they heard an artificial voice – is the result of the forward model prediction “canceling out” the matching auditory feedback from the utterance. They urge that “the rapidity of the effect suggests that speakers could not be comprehending what they heard and comparing this to their memory of their planned utterance” (sect. 3.1, para. 16). While this is indeed implausible, there is an alternative hypothesis that is not ruled out by the data: The attenuation effect results from a match between the utterance plan and auditory feedback. Such a comparison would take no more time than the purported comparison between the forward model prediction and auditory feedback. The same point applies to P&G's discussion of the datum reported in Levelt (Reference Levelt1983) concerning mid-word self-correction.
Some theorists (e.g., Prinz Reference Prinz, Liu and Perry2012, p. 238) have insisted that whatever states enter into the comparison with sensory feedback must have the same representational format as the feedback. Because an utterance plan must encode the low-level features of the utterance that it specifies, it arguably meets this criterion.
P&G also appeal to the results reported in Tourville et al. (Reference Tourville, Reily and Guenther2008), highlighting two features of that study. First, the compensation that participants make in response to distorted auditory feedback is rapid – “a hallmark of feedforward (predictive) monitoring (as correction following feedback would be too slow)” (sect. 3.1, para. 17). But rapid compensation can only be attributed to forward modeling when the forward model prediction is used in place of the auditory feedback during online control of behavior. The idea is that, by using the putative forward model prediction of the sensory feedback, the system need not wait for the auditory feedback. However, this cannot be the case in the experiment conducted by Tourville et al. (Reference Tourville, Reily and Guenther2008), because the distorted auditory feedback is externally induced at random, and therefore unpredictable. Participants must base their compensations on the distorted auditory feedback itself, since no prediction would be available in this type of case. Hence, however rapid their compensation, it cannot reflect the operation of forward modeling.
The second feature of the Tourville et al. (Reference Tourville, Reily and Guenther2008) study to which P&G appeal is that “the fMRI [functional magnetic resonance imaging] results identified a network of neurons coding mismatches between expected and actual auditory signals” (sect. 3.1, para. 17). But while the fMRI results did identify a network of neurons that has been shown to be activated when auditory feedback from an utterance is distorted (Fu et al. Reference Fu, Vythelingum, Brammer, Williams, Amaro, Andrew, Yaguez, van Haren, Matsumoto and McGuire2006; Hashimoto & Sakai Reference Hashimoto and Sakai2003; Hirano et al. Reference Hirano, Kojima, Naito, Honjo, Kamoto, Okazawa, Ishizu, Yonekura, Nagahama, Fukuyama and Konishi1997; McGuire et al. Reference McGuire, Silbersweig and Frith1996), the further claim that these neurons code mismatches between forward model predictions (“expected” auditory signals) and actual auditory signals is unwarranted by the available neuroimaging data. All such data are equally consistent with the more parsimonious hypothesis that these neurons code mismatches between the utterance plan and the auditory feedback.
Finally, we are skeptical of P&G's interpretation of the data in Tian and Poeppel (Reference Tian and Poeppel2010). The Tian and Poeppel study found activation in the auditory cortex in two conditions: after participants actually produced a syllable and after they merely imagined producing that same syllable. Following Tian and Poeppel, P&G interpret such activation as evidence of forward modeling. However, this activation may simply encode a general expectation that a sound will be heard, rather than specifically encoding the anticipated auditory feedback. Moreover, even if this activation were shown to be a representation of specific auditory feedback, it does not follow that the activation should be construed as a forward model prediction. It could instead subserve a mere simulation of the auditory feedback. In order to determine that this activation subserves a forward model prediction, as against a mere simulation, we would need evidence that it plays the relevant functional role – that is, it is both based on an efference copy and that it goes on to be compared to auditory feedback. Tian and Poeppel provide no evidence for the latter condition. Indeed, the relevant kind of forward model should not be operative in the passive listening condition of Tian and Poeppel's study. So the fact that auditory cortex activations were found to be similar for listening passively to a sound and imagining producing that sound does not support the claim that the latter reflects forward modeling in particular. Finally, Tian and Poeppel's framing of the issue is itself suspect. They cast forward model predictions as conscious personal-level states – mental images of a certain kind. One might reasonably doubt that such subpersonal states are ever present in the phenomenology that accompanies speech production.
Pickering & Garrod (P&G) take production commands to be conceptual representations that encode high-level features – “information about communicative force (e.g., interrogative), pragmatic context, and a nonlinguistic situation model” (target article, sect. 3.1, para. 3). On their model, a production command is input directly to the production implementer, which outputs an utterance. But somewhere in between this input and output, there must be, in addition, an intermediate representation that specifies the low-level features of the utterance, for example, its phonological and phonetic features. In what follows, we will call this low-level production command the utterance plan. In the analogous motor control case, upon which P&G base their model, an utterance plan corresponds to a motor command, which specifies the low-level features of the bodily movement, and is output by the inverse model (Wolpert Reference Wolpert1997).
We argue here that the evidence that P&G cite in favor of positing forward models in speech production is not compelling. More specifically, the data to which they appeal either cannot be explained by forward models, or can be explained by a more parsimonious model, on which the utterance plan and the sensory feedback are directly compared. On this alternative picture, there is no need to posit forward models.
P&G appeal to Heinks-Maldonado et al. (Reference Heinks-Maldonado, Nagarajan and Houde2006) to support their claim that forward modeling is used in speech production. They argue that the suppressed M100 signal in the condition where participants spoke and heard their own unaltered speech – compared with conditions in which their speech was distorted or they heard an artificial voice – is the result of the forward model prediction “canceling out” the matching auditory feedback from the utterance. They urge that “the rapidity of the effect suggests that speakers could not be comprehending what they heard and comparing this to their memory of their planned utterance” (sect. 3.1, para. 16). While this is indeed implausible, there is an alternative hypothesis that is not ruled out by the data: The attenuation effect results from a match between the utterance plan and auditory feedback. Such a comparison would take no more time than the purported comparison between the forward model prediction and auditory feedback. The same point applies to P&G's discussion of the datum reported in Levelt (Reference Levelt1983) concerning mid-word self-correction.
Some theorists (e.g., Prinz Reference Prinz, Liu and Perry2012, p. 238) have insisted that whatever states enter into the comparison with sensory feedback must have the same representational format as the feedback. Because an utterance plan must encode the low-level features of the utterance that it specifies, it arguably meets this criterion.
P&G also appeal to the results reported in Tourville et al. (Reference Tourville, Reily and Guenther2008), highlighting two features of that study. First, the compensation that participants make in response to distorted auditory feedback is rapid – “a hallmark of feedforward (predictive) monitoring (as correction following feedback would be too slow)” (sect. 3.1, para. 17). But rapid compensation can only be attributed to forward modeling when the forward model prediction is used in place of the auditory feedback during online control of behavior. The idea is that, by using the putative forward model prediction of the sensory feedback, the system need not wait for the auditory feedback. However, this cannot be the case in the experiment conducted by Tourville et al. (Reference Tourville, Reily and Guenther2008), because the distorted auditory feedback is externally induced at random, and therefore unpredictable. Participants must base their compensations on the distorted auditory feedback itself, since no prediction would be available in this type of case. Hence, however rapid their compensation, it cannot reflect the operation of forward modeling.
The second feature of the Tourville et al. (Reference Tourville, Reily and Guenther2008) study to which P&G appeal is that “the fMRI [functional magnetic resonance imaging] results identified a network of neurons coding mismatches between expected and actual auditory signals” (sect. 3.1, para. 17). But while the fMRI results did identify a network of neurons that has been shown to be activated when auditory feedback from an utterance is distorted (Fu et al. Reference Fu, Vythelingum, Brammer, Williams, Amaro, Andrew, Yaguez, van Haren, Matsumoto and McGuire2006; Hashimoto & Sakai Reference Hashimoto and Sakai2003; Hirano et al. Reference Hirano, Kojima, Naito, Honjo, Kamoto, Okazawa, Ishizu, Yonekura, Nagahama, Fukuyama and Konishi1997; McGuire et al. Reference McGuire, Silbersweig and Frith1996), the further claim that these neurons code mismatches between forward model predictions (“expected” auditory signals) and actual auditory signals is unwarranted by the available neuroimaging data. All such data are equally consistent with the more parsimonious hypothesis that these neurons code mismatches between the utterance plan and the auditory feedback.
Finally, we are skeptical of P&G's interpretation of the data in Tian and Poeppel (Reference Tian and Poeppel2010). The Tian and Poeppel study found activation in the auditory cortex in two conditions: after participants actually produced a syllable and after they merely imagined producing that same syllable. Following Tian and Poeppel, P&G interpret such activation as evidence of forward modeling. However, this activation may simply encode a general expectation that a sound will be heard, rather than specifically encoding the anticipated auditory feedback. Moreover, even if this activation were shown to be a representation of specific auditory feedback, it does not follow that the activation should be construed as a forward model prediction. It could instead subserve a mere simulation of the auditory feedback. In order to determine that this activation subserves a forward model prediction, as against a mere simulation, we would need evidence that it plays the relevant functional role – that is, it is both based on an efference copy and that it goes on to be compared to auditory feedback. Tian and Poeppel provide no evidence for the latter condition. Indeed, the relevant kind of forward model should not be operative in the passive listening condition of Tian and Poeppel's study. So the fact that auditory cortex activations were found to be similar for listening passively to a sound and imagining producing that sound does not support the claim that the latter reflects forward modeling in particular. Finally, Tian and Poeppel's framing of the issue is itself suspect. They cast forward model predictions as conscious personal-level states – mental images of a certain kind. One might reasonably doubt that such subpersonal states are ever present in the phenomenology that accompanies speech production.