According to Pickering & Garrod (P&G), a key feature of forward models is that they are fast; they allow a system to correct itself much more quickly than is possible on the basis of reafferent feedback. I understand how a forward model allows fast feedback when the feedback is based on the output of the forward model itself (as opposed to conditions in which the output of the forward model is compared to real feedback). But to better understand how this approach works, it needs to be made clearer what additional information is included in the forward model that is not in the production or comprehension systems themselves. Otherwise, there is no point. Furthermore, if indeed forward models have extra information, how is this reconciled with the claim that forward models are “impoverished” compared to the production and comprehension implementers.
I also find it unclear how a forward model speeds things up when the output of a forward model and actual feedback (e.g., proprioceptive, visual, auditory, etc.) are compared. The speed with which the forward model can compute seems useless in these conditions, as the model has to wait for the actual feedback before the comparison process can begin. Furthermore, whenever feedback is involved in correcting motor control for future actions (as opposed to the current action), it is not immediately clear why a slow feedback loop is not supposed to work.
At the same time, it is possible to imagine feedback between levels of a production (or a comprehension) system that would be fast enough to correct errors before overt errors are produced (or before comprehension is compromised). Consider the “predictive coding” model introduced by Grossberg (Reference Grossberg1980). In his Adaptive Resonance Theory (ART) model, a single unit in layer 2 of the network learns to code for a pattern of activation in layer 1. Critically, the learning between the two layers takes place in both bottom-up connections (such that a pattern of activation in layer 1 learns to activate a given unit in layer 2 – what Grossberg calls “instar learning”) – and in top-down connections (such that an activated layer 2 units learns to activate a pattern in layer 1; what Grossberg calls “outstar learning”). These top-down connections in fact support “prediction,” that is, the layer 2 unit learns to activate those layer 1 units that activated it in the past.
The process of identifying an input involves comparing the bottom-up (input) signal with the top-down (predicted) signal: If the two patterns match (in layer 1) the model, the model goes into a state of resonance (with bottom-up and top-down signals reinforcing one another), and this is taken as evidence that the correct node in layer 2 was indeed activated. If not, there is a mistake that needs to be corrected. The identification of a mistake happens quickly, before any learning takes place (in order to solve the stability-plasticity dilemma, otherwise known as “catastrophic interference”). The important point for present purposes is that ART includes fast prediction within a single system, with no need to posit a separate, parallel forward model. (In fact, the ART system does have a separate parallel system that is engaged in cases of bottom-up/top-down mismatch, but this system does not carry any information about a prediction – it just turns off the layer 2 unit and tells the network to “try again.”)
Could something similar work in the case of speech production? Perhaps a semantic input could activate a lemma unit, and the lemma could feed back to the semantic system, and the model could be confident that the correct lemma was selected if its top-down and bottom-up signals matched. Similar top-down/bottom-up interactions across levels in the speech production system could lead to quick corrections at each stage. And, ultimately, feedback from the actual output (a spoken word in the case speech production) could play a role in correcting errors (after the fact). Given that predictions are made between levels of the speech production system (and possibly between production and comprehension systems), corrections could presumably be made quickly, and at different stages of the process.
Of course, this sketch of an outline of an idea does not even attempt to address the complexities that are addressed by P&G, and any proposal without a forward model may well be inadequate. But I'm struggling to see why separate fast forward models are needed (as opposed to feedback between levels within and between production and comprehension systems), and to understand how forward models are thought to be fast whenever they rely on actual feedback.
According to Pickering & Garrod (P&G), a key feature of forward models is that they are fast; they allow a system to correct itself much more quickly than is possible on the basis of reafferent feedback. I understand how a forward model allows fast feedback when the feedback is based on the output of the forward model itself (as opposed to conditions in which the output of the forward model is compared to real feedback). But to better understand how this approach works, it needs to be made clearer what additional information is included in the forward model that is not in the production or comprehension systems themselves. Otherwise, there is no point. Furthermore, if indeed forward models have extra information, how is this reconciled with the claim that forward models are “impoverished” compared to the production and comprehension implementers.
I also find it unclear how a forward model speeds things up when the output of a forward model and actual feedback (e.g., proprioceptive, visual, auditory, etc.) are compared. The speed with which the forward model can compute seems useless in these conditions, as the model has to wait for the actual feedback before the comparison process can begin. Furthermore, whenever feedback is involved in correcting motor control for future actions (as opposed to the current action), it is not immediately clear why a slow feedback loop is not supposed to work.
At the same time, it is possible to imagine feedback between levels of a production (or a comprehension) system that would be fast enough to correct errors before overt errors are produced (or before comprehension is compromised). Consider the “predictive coding” model introduced by Grossberg (Reference Grossberg1980). In his Adaptive Resonance Theory (ART) model, a single unit in layer 2 of the network learns to code for a pattern of activation in layer 1. Critically, the learning between the two layers takes place in both bottom-up connections (such that a pattern of activation in layer 1 learns to activate a given unit in layer 2 – what Grossberg calls “instar learning”) – and in top-down connections (such that an activated layer 2 units learns to activate a pattern in layer 1; what Grossberg calls “outstar learning”). These top-down connections in fact support “prediction,” that is, the layer 2 unit learns to activate those layer 1 units that activated it in the past.
The process of identifying an input involves comparing the bottom-up (input) signal with the top-down (predicted) signal: If the two patterns match (in layer 1) the model, the model goes into a state of resonance (with bottom-up and top-down signals reinforcing one another), and this is taken as evidence that the correct node in layer 2 was indeed activated. If not, there is a mistake that needs to be corrected. The identification of a mistake happens quickly, before any learning takes place (in order to solve the stability-plasticity dilemma, otherwise known as “catastrophic interference”). The important point for present purposes is that ART includes fast prediction within a single system, with no need to posit a separate, parallel forward model. (In fact, the ART system does have a separate parallel system that is engaged in cases of bottom-up/top-down mismatch, but this system does not carry any information about a prediction – it just turns off the layer 2 unit and tells the network to “try again.”)
Could something similar work in the case of speech production? Perhaps a semantic input could activate a lemma unit, and the lemma could feed back to the semantic system, and the model could be confident that the correct lemma was selected if its top-down and bottom-up signals matched. Similar top-down/bottom-up interactions across levels in the speech production system could lead to quick corrections at each stage. And, ultimately, feedback from the actual output (a spoken word in the case speech production) could play a role in correcting errors (after the fact). Given that predictions are made between levels of the speech production system (and possibly between production and comprehension systems), corrections could presumably be made quickly, and at different stages of the process.
Of course, this sketch of an outline of an idea does not even attempt to address the complexities that are addressed by P&G, and any proposal without a forward model may well be inadequate. But I'm struggling to see why separate fast forward models are needed (as opposed to feedback between levels within and between production and comprehension systems), and to understand how forward models are thought to be fast whenever they rely on actual feedback.