Christiansen & Chater (C&C)argue that the “Now-or-Never” bottleneck arises because input that is not immediately processed is forever lost when it is overwritten by new input entering the same neural substrate. However, the brain, like any recurrent network, is a state-dependent processor whose current state is a function of both the previous state and the latest input (Buonomano & Maass Reference Buonomano and Maass2009). The incoming signal therefore does not wipe out previous input. Rather, the two are integrated into a new state that, in turn, will be integrated with the next input. In this way, an input stream “lives on” in processing memory. Because prior input is implicitly present in the system's current state, it can be faithfully recovered from the state, even after some time. Hence, there is no need to immediately “chunk” the latest input to protect it from interference. This does not mean that no part of the input is ever lost. As the integrated input stream grows in length, it becomes increasingly difficult to reliably make use of the earliest input. Therefore, the sooner the input can be used for further processing, the more successful this will be: There is a “Sooner-is-Better” rather than a “Now-or-Never” bottleneck.
So-called reservoir computing models (Lukoševičius & Jaeger Reference Lukoševičius and Jaeger2009; Maass et al. Reference Maass, Natschläger and Markram2002) exemplify this perspective on language processing. Reservoir computing applies untrained recurrent networks to project a temporal input stream into a random point in a very high-dimensional state space. A “read-out” network is then calibrated, either online through gradient descent or offline by linear regression, to transform this random mapping into a desired output, such as a prediction of the incoming input, a reconstruction of (part of) the previous input stream, or a semantic representation of the processed language. Crucially, the recurrent network itself is not trained, so the ability to retrieve earlier input from the random projection cannot be the result of learned chunking or other processes that have been acquired from language exposure. Indeed, Christiansen and Chater (Reference Christiansen and Chater1999) found that even before training, the random, initial representations in a simple recurrent network's hidden layer allow for better-than-chance classification of earlier inputs. Reservoir computing has been applied to simulations of human language learning and comprehension, and such models accounted for experimental findings from both behavioural (Fitz Reference Fitz, Carlson, Hölscher and Shipley2011; Frank & Bod Reference Frank and Bod2011) and neurophysiological studies (Dominey et al. Reference Dominey, Hoen, Blanc and Lelekov-Boissard2003; Hinaut & Dominey Reference Hinaut and Dominey2013). Moreover, it has been argued that reservoir computing shares important processing characteristics with cortical networks (Rabinovich et al. Reference Rabinovich, Huerta and Laurent2008; Rigotti et al. Reference Rigotti, Barak, Warden, Wang, Daw, Miller and Fusi2013; Singer Reference Singer2013), making this framework particularly suitable to the computational study of cognitive functions.
To demonstrate the ability of reservoir models to memorize linguistic input over time, we exposed an echo-state network (Jaeger & Haas Reference Jaeger and Haas2004) to a word sequence consisting of the first 1,000 words (roughly the length of this commentary) of the Scholarpedia entry on echo-state networks. Ten networks were randomly generated with 1,000 units and static, recurrent, sparse connectivity (20% inhibition). The read-outs were adapted such that the network had to recall the input sequence 10 and 100 words back. The 358 different words in the corpus were represented orthogonally, and the word corresponding to the most active output unit was taken as the recalled word. For a 10-word delay, the correct word was recalled with an average accuracy of 96% (SD=0.6%). After 100 words, accuracy remained at 96%, suggesting that the network had memorized the entire input sequence. This indicates that there was sufficient information in the system's state-space trajectory to reliably recover previous perceptual input even after very long delays. Sparseness and inhibition, two pervasive features of the neocortex and hippocampus, were critical: Without inhibition, average recall after a 10-word delay dropped to 51%, whereas fully connected networks correctly recalled only 9%, which equals the frequency of the most common word in the model's input. In short, the more brain-like the network, the better its capacity to memorize past input.
The modelling results should not be mistaken for a claim that people are able to perfectly remember words after 100 items of intervening input. To steer the language system towards an interpretation, earlier input need not be available to explicit recall and verbalization. Thus, it is also irrelevant to our echo-state network simulation whether or not such specialized read-outs exist in the human language system. The simulation merely serves to illustrate the concept of state-dependent processing where past perceptual input is implicitly represented in the current state of the network. A more realistic demonstration would take phonetic, or perhaps even auditory, features as input, rather than presegmented words. Because the dynamics in cortical networks is vastly more diverse than in our model, there is no principled reason such networks should not be able to cope with richer information sources. Downstream networks can then access this information when interpreting incoming utterances, without explicitly recalling previous words. Prior input encoded in the current state can be used for any context-sensitive operation the language system might be carrying out – for example, to predict the next phoneme or word in the unfolding utterance, to assign a thematic role to the current word, or to semantically integrate the current word with a partial interpretation that has already been constructed.
Because language is structured at different levels of granularity (ranging from phonetic features to discourse relations), the language system requires neuronal and synaptic mechanisms that operate at different timescales (from milliseconds to minutes) in order to retain relevant information in the system's state. Precisely how these memory mechanisms are implemented in biological networks of spiking neurons is currently not well-understood; proposals include a role for diverse, fast-changing neuronal dynamics (Gerstner et al. Reference Gerstner, Kistler, Naud and Paninski2014) coupled with short-term synaptic plasticity (Mongillo et al. Reference Mongillo, Barak and Tsodyks2008) and more long-term adaptation through spike-timing dependent plasticity (Bi & Poo Reference Bi and Poo2001). The nature of processing memory will be crucial in any neurobiologically viable theory of language processing (Petersson & Hagoort Reference Petersson and Hagoort2012), and we should therefore not lock ourselves into architectural commitments based on stipulated bottlenecks.
Christiansen & Chater (C&C)argue that the “Now-or-Never” bottleneck arises because input that is not immediately processed is forever lost when it is overwritten by new input entering the same neural substrate. However, the brain, like any recurrent network, is a state-dependent processor whose current state is a function of both the previous state and the latest input (Buonomano & Maass Reference Buonomano and Maass2009). The incoming signal therefore does not wipe out previous input. Rather, the two are integrated into a new state that, in turn, will be integrated with the next input. In this way, an input stream “lives on” in processing memory. Because prior input is implicitly present in the system's current state, it can be faithfully recovered from the state, even after some time. Hence, there is no need to immediately “chunk” the latest input to protect it from interference. This does not mean that no part of the input is ever lost. As the integrated input stream grows in length, it becomes increasingly difficult to reliably make use of the earliest input. Therefore, the sooner the input can be used for further processing, the more successful this will be: There is a “Sooner-is-Better” rather than a “Now-or-Never” bottleneck.
So-called reservoir computing models (Lukoševičius & Jaeger Reference Lukoševičius and Jaeger2009; Maass et al. Reference Maass, Natschläger and Markram2002) exemplify this perspective on language processing. Reservoir computing applies untrained recurrent networks to project a temporal input stream into a random point in a very high-dimensional state space. A “read-out” network is then calibrated, either online through gradient descent or offline by linear regression, to transform this random mapping into a desired output, such as a prediction of the incoming input, a reconstruction of (part of) the previous input stream, or a semantic representation of the processed language. Crucially, the recurrent network itself is not trained, so the ability to retrieve earlier input from the random projection cannot be the result of learned chunking or other processes that have been acquired from language exposure. Indeed, Christiansen and Chater (Reference Christiansen and Chater1999) found that even before training, the random, initial representations in a simple recurrent network's hidden layer allow for better-than-chance classification of earlier inputs. Reservoir computing has been applied to simulations of human language learning and comprehension, and such models accounted for experimental findings from both behavioural (Fitz Reference Fitz, Carlson, Hölscher and Shipley2011; Frank & Bod Reference Frank and Bod2011) and neurophysiological studies (Dominey et al. Reference Dominey, Hoen, Blanc and Lelekov-Boissard2003; Hinaut & Dominey Reference Hinaut and Dominey2013). Moreover, it has been argued that reservoir computing shares important processing characteristics with cortical networks (Rabinovich et al. Reference Rabinovich, Huerta and Laurent2008; Rigotti et al. Reference Rigotti, Barak, Warden, Wang, Daw, Miller and Fusi2013; Singer Reference Singer2013), making this framework particularly suitable to the computational study of cognitive functions.
To demonstrate the ability of reservoir models to memorize linguistic input over time, we exposed an echo-state network (Jaeger & Haas Reference Jaeger and Haas2004) to a word sequence consisting of the first 1,000 words (roughly the length of this commentary) of the Scholarpedia entry on echo-state networks. Ten networks were randomly generated with 1,000 units and static, recurrent, sparse connectivity (20% inhibition). The read-outs were adapted such that the network had to recall the input sequence 10 and 100 words back. The 358 different words in the corpus were represented orthogonally, and the word corresponding to the most active output unit was taken as the recalled word. For a 10-word delay, the correct word was recalled with an average accuracy of 96% (SD=0.6%). After 100 words, accuracy remained at 96%, suggesting that the network had memorized the entire input sequence. This indicates that there was sufficient information in the system's state-space trajectory to reliably recover previous perceptual input even after very long delays. Sparseness and inhibition, two pervasive features of the neocortex and hippocampus, were critical: Without inhibition, average recall after a 10-word delay dropped to 51%, whereas fully connected networks correctly recalled only 9%, which equals the frequency of the most common word in the model's input. In short, the more brain-like the network, the better its capacity to memorize past input.
The modelling results should not be mistaken for a claim that people are able to perfectly remember words after 100 items of intervening input. To steer the language system towards an interpretation, earlier input need not be available to explicit recall and verbalization. Thus, it is also irrelevant to our echo-state network simulation whether or not such specialized read-outs exist in the human language system. The simulation merely serves to illustrate the concept of state-dependent processing where past perceptual input is implicitly represented in the current state of the network. A more realistic demonstration would take phonetic, or perhaps even auditory, features as input, rather than presegmented words. Because the dynamics in cortical networks is vastly more diverse than in our model, there is no principled reason such networks should not be able to cope with richer information sources. Downstream networks can then access this information when interpreting incoming utterances, without explicitly recalling previous words. Prior input encoded in the current state can be used for any context-sensitive operation the language system might be carrying out – for example, to predict the next phoneme or word in the unfolding utterance, to assign a thematic role to the current word, or to semantically integrate the current word with a partial interpretation that has already been constructed.
Because language is structured at different levels of granularity (ranging from phonetic features to discourse relations), the language system requires neuronal and synaptic mechanisms that operate at different timescales (from milliseconds to minutes) in order to retain relevant information in the system's state. Precisely how these memory mechanisms are implemented in biological networks of spiking neurons is currently not well-understood; proposals include a role for diverse, fast-changing neuronal dynamics (Gerstner et al. Reference Gerstner, Kistler, Naud and Paninski2014) coupled with short-term synaptic plasticity (Mongillo et al. Reference Mongillo, Barak and Tsodyks2008) and more long-term adaptation through spike-timing dependent plasticity (Bi & Poo Reference Bi and Poo2001). The nature of processing memory will be crucial in any neurobiologically viable theory of language processing (Petersson & Hagoort Reference Petersson and Hagoort2012), and we should therefore not lock ourselves into architectural commitments based on stipulated bottlenecks.
ACKNOWLEDGMENTS
We would like to thank Karl-Magnus Petersson for helpful discussions on these issues. SLF is funded by the European Union Seventh Framework Programme under grant no. 334028.