Christiansen & Chater (C&C) describe how the brain's limitations to retain language input (the Now-or-Never bottleneck) constrain and shape human language processing and acquisition.
Interestingly, there is a very strong coincidence between the characteristics of processing and learning under the Now-or-Never bottleneck and recent computational models used in the field of natural language processing (NLP), especially in syntactic parsing. C&C provide some comparison with classic cognitively inspired models of parsing, noting that they are in contradiction with the constraints of the Now-or-Never bottleneck. However, a close look at the recent NLP and computational linguistics literature (rather than the cognitive science literature) shows a clear trend toward systems and models that fit remarkably well with C&C's framework.
It is worth noting that most NLP research is driven by purely pragmatic, engineering-oriented requirements: The primary goal is not to find models that provide plausible explanations of the properties of language and its processing by humans, but rather to design systems that can parse text and utterances as accurately and efficiently as possible for practical applications like opinion mining, machine translation, or information extraction, among others.
In recent years, the need to develop faster parsers that can work on web-scale data has led to much research interest in incremental, data-driven parsers; mainly under the so-called transition-based (or shift-reduce) framework (Nivre Reference Nivre2008). This family of parsers has been implemented in systems such as MaltParser (Nivre et al. Reference Nivre, Hall, Nilsson, Chanev, Eryigit, Kübler, Marinov and Marsi2007), ZPar (Zhang & Clark Reference Zhang and Clark2011), ClearParser (Choi & McCallum Reference Choi, McCallum, Fung and Poesio2013), or Stanford CoreNLP (Chen & Manning Reference Chen, Manning, Moschitti, Pang and Daelemans2014), and it is increasingly popular because they are easy to train from annotated data and provide a very good trade-off between speed and accuracy.
Strikingly, these parsing models present practically all of the characteristics of processing and acquisition that C&C describe as originating from the Now-or-Never bottleneck in human processing:
-
Incremental processing (sect. 3.1): A defining feature of transition-based parsers is that they build syntactic analyses incrementally as they receive the input, from left to right. These systems can build analyses even under severe working memory constraints: Although the issue of “stacking up” with right-branching languages mentioned by C&C exists for so-called arc-standard parsers (Nivre Reference Nivre, Keller, Clark, Crocker and Steedman2004), parsers based on the arc-eager model (e.g., Gómez-Rodríguez & Nivre Reference Gómez-Rodríguez and Nivre2013; Nivre Reference Nivre, Bunt and Noord2003) do not accumulate right-branching structures in their stack; as they build dependency links as soon as possible. In these parsers, we only need to keep a word in the stack while we wait for its head or its direct dependents, so the time that linguistic units need to be retained in memory is kept to the bare minimum.
-
Multiple levels of linguistic structure (sect. 3.2): As C&C mention, the organization of linguistic representation in multiple levels is “typically assumed in the language sciences”; this includes computational linguistics and transition-based parsing models. Traditionally, each of these levels was processed sequentially in a pipeline, contrasting with the parallelism of the Chunk-and-Pass framework. However, the appearance of general incremental processing frameworks spanning various levels, from segmentation to parsing (Zhang & Clark Reference Zhang and Clark2011), has led to recent research on joint processing where the processing of several levels takes place simultaneously and in parallel, passing information between levels (Bohnet & Nivre Reference Bohnet, Nivre, Tsujii, Henderson and Pasca2012; Hatori et al. Reference Hatori, Matsuzaki, Miyao, Tsujii, Li, Lin, Osborne, Lee and Park2012). These models, which improve accuracy over pipeline models, are very close to the Chunk-and-Pass framework.
-
Predictive language processing (sect. 3.3): The joint processing models just mentioned are hypothesized to provide accuracy improvements precisely because they allow for a degree of predictive processing. Contrary to pipeline approaches where information only flows in a bottom-up way, these systems allow top-down information from higher levels “to constrain the processing of the input at lower levels,” just as C&C describe.
-
Acquisition as learning to process (sect. 4): Transition-based parsers learn a sequence of processing actions (transitions), rather than a grammar (Gómez-Rodríguez et al. Reference Gómez-Rodríguez, Sartorio, Satta, Moschitti, Pang and Daelemans2014; Nivre Reference Nivre2008), making the learning process simple and flexible.
-
Local learning (sect. 4.2): This is also a general characteristic of all transition-based parsers. Because they do not learn grammar rules but processing actions to take in specific situations, adding a new example to the training data will create only local changes to the inherent language model. At the implementation level, this typically corresponds to small weight changes in the underlying machine learning model – be it a support vector machine (SVM) classifier (Nivre et al. Reference Nivre, Hall, Nilsson, Chanev, Eryigit, Kübler, Marinov and Marsi2007), perceptron (Zhang & Clark Reference Zhang and Clark2011), or neural network (Chen & Manning Reference Chen, Manning, Moschitti, Pang and Daelemans2014), among other possibilities.
-
Online learning and learning to predict (sect. 4.1 and 4.3): Evaluation of NLP systems usually takes place in standard, fixed corpora, and so recent NLP literature has not placed much emphasis on online learning. However, some systems and frameworks do use online learning models with error-driven learning, like the perceptron (Zhang & Clark Reference Zhang and Clark2011). The recent surge of interest in parsing with neural networks (e.g., Chen & Manning Reference Chen, Manning, Moschitti, Pang and Daelemans2014; Dyer et al. Reference Dyer, Ballesteros, Ling, Matthews, Smith, Zong and Strube2015) also seems to point future research in this direction.
Putting it all together, we can see that researchers whose motivating goal was not psycholinguistic modeling, but only raw computational efficiency, have nevertheless arrived at models that conform to the description in the target article. This fact provides further support for the views C&C express.
A natural question arises about the extent to which this coincidence is attributable to similarities between the efficiency requirements of human and automated processing – or rather to the fact that because evolution shapes natural languages to be easy to process by humans (constrained by the Now-or-Never bottleneck), computational models that mirror human processing will naturally work well on them. Relevant differences between the brain and computers, such as in short-term memory capacity, seem to suggest the latter. Either way, there is clearly much to be gained from cross-fertilization between cognitive science and computational linguistics: For example, computational linguists can find inspiration in cognitive models for designing NLP tools that work efficiently with limited resources, and cognitive scientists can use computational tools as models to test their hypotheses. Bridging the gap between these areas of research is essential to further our understanding of language.
Christiansen & Chater (C&C) describe how the brain's limitations to retain language input (the Now-or-Never bottleneck) constrain and shape human language processing and acquisition.
Interestingly, there is a very strong coincidence between the characteristics of processing and learning under the Now-or-Never bottleneck and recent computational models used in the field of natural language processing (NLP), especially in syntactic parsing. C&C provide some comparison with classic cognitively inspired models of parsing, noting that they are in contradiction with the constraints of the Now-or-Never bottleneck. However, a close look at the recent NLP and computational linguistics literature (rather than the cognitive science literature) shows a clear trend toward systems and models that fit remarkably well with C&C's framework.
It is worth noting that most NLP research is driven by purely pragmatic, engineering-oriented requirements: The primary goal is not to find models that provide plausible explanations of the properties of language and its processing by humans, but rather to design systems that can parse text and utterances as accurately and efficiently as possible for practical applications like opinion mining, machine translation, or information extraction, among others.
In recent years, the need to develop faster parsers that can work on web-scale data has led to much research interest in incremental, data-driven parsers; mainly under the so-called transition-based (or shift-reduce) framework (Nivre Reference Nivre2008). This family of parsers has been implemented in systems such as MaltParser (Nivre et al. Reference Nivre, Hall, Nilsson, Chanev, Eryigit, Kübler, Marinov and Marsi2007), ZPar (Zhang & Clark Reference Zhang and Clark2011), ClearParser (Choi & McCallum Reference Choi, McCallum, Fung and Poesio2013), or Stanford CoreNLP (Chen & Manning Reference Chen, Manning, Moschitti, Pang and Daelemans2014), and it is increasingly popular because they are easy to train from annotated data and provide a very good trade-off between speed and accuracy.
Strikingly, these parsing models present practically all of the characteristics of processing and acquisition that C&C describe as originating from the Now-or-Never bottleneck in human processing:
Incremental processing (sect. 3.1): A defining feature of transition-based parsers is that they build syntactic analyses incrementally as they receive the input, from left to right. These systems can build analyses even under severe working memory constraints: Although the issue of “stacking up” with right-branching languages mentioned by C&C exists for so-called arc-standard parsers (Nivre Reference Nivre, Keller, Clark, Crocker and Steedman2004), parsers based on the arc-eager model (e.g., Gómez-Rodríguez & Nivre Reference Gómez-Rodríguez and Nivre2013; Nivre Reference Nivre, Bunt and Noord2003) do not accumulate right-branching structures in their stack; as they build dependency links as soon as possible. In these parsers, we only need to keep a word in the stack while we wait for its head or its direct dependents, so the time that linguistic units need to be retained in memory is kept to the bare minimum.
Multiple levels of linguistic structure (sect. 3.2): As C&C mention, the organization of linguistic representation in multiple levels is “typically assumed in the language sciences”; this includes computational linguistics and transition-based parsing models. Traditionally, each of these levels was processed sequentially in a pipeline, contrasting with the parallelism of the Chunk-and-Pass framework. However, the appearance of general incremental processing frameworks spanning various levels, from segmentation to parsing (Zhang & Clark Reference Zhang and Clark2011), has led to recent research on joint processing where the processing of several levels takes place simultaneously and in parallel, passing information between levels (Bohnet & Nivre Reference Bohnet, Nivre, Tsujii, Henderson and Pasca2012; Hatori et al. Reference Hatori, Matsuzaki, Miyao, Tsujii, Li, Lin, Osborne, Lee and Park2012). These models, which improve accuracy over pipeline models, are very close to the Chunk-and-Pass framework.
Predictive language processing (sect. 3.3): The joint processing models just mentioned are hypothesized to provide accuracy improvements precisely because they allow for a degree of predictive processing. Contrary to pipeline approaches where information only flows in a bottom-up way, these systems allow top-down information from higher levels “to constrain the processing of the input at lower levels,” just as C&C describe.
Acquisition as learning to process (sect. 4): Transition-based parsers learn a sequence of processing actions (transitions), rather than a grammar (Gómez-Rodríguez et al. Reference Gómez-Rodríguez, Sartorio, Satta, Moschitti, Pang and Daelemans2014; Nivre Reference Nivre2008), making the learning process simple and flexible.
Local learning (sect. 4.2): This is also a general characteristic of all transition-based parsers. Because they do not learn grammar rules but processing actions to take in specific situations, adding a new example to the training data will create only local changes to the inherent language model. At the implementation level, this typically corresponds to small weight changes in the underlying machine learning model – be it a support vector machine (SVM) classifier (Nivre et al. Reference Nivre, Hall, Nilsson, Chanev, Eryigit, Kübler, Marinov and Marsi2007), perceptron (Zhang & Clark Reference Zhang and Clark2011), or neural network (Chen & Manning Reference Chen, Manning, Moschitti, Pang and Daelemans2014), among other possibilities.
Online learning and learning to predict (sect. 4.1 and 4.3): Evaluation of NLP systems usually takes place in standard, fixed corpora, and so recent NLP literature has not placed much emphasis on online learning. However, some systems and frameworks do use online learning models with error-driven learning, like the perceptron (Zhang & Clark Reference Zhang and Clark2011). The recent surge of interest in parsing with neural networks (e.g., Chen & Manning Reference Chen, Manning, Moschitti, Pang and Daelemans2014; Dyer et al. Reference Dyer, Ballesteros, Ling, Matthews, Smith, Zong and Strube2015) also seems to point future research in this direction.
Putting it all together, we can see that researchers whose motivating goal was not psycholinguistic modeling, but only raw computational efficiency, have nevertheless arrived at models that conform to the description in the target article. This fact provides further support for the views C&C express.
A natural question arises about the extent to which this coincidence is attributable to similarities between the efficiency requirements of human and automated processing – or rather to the fact that because evolution shapes natural languages to be easy to process by humans (constrained by the Now-or-Never bottleneck), computational models that mirror human processing will naturally work well on them. Relevant differences between the brain and computers, such as in short-term memory capacity, seem to suggest the latter. Either way, there is clearly much to be gained from cross-fertilization between cognitive science and computational linguistics: For example, computational linguists can find inspiration in cognitive models for designing NLP tools that work efficiently with limited resources, and cognitive scientists can use computational tools as models to test their hypotheses. Bridging the gap between these areas of research is essential to further our understanding of language.
ACKNOWLEDGMENTS
This work was funded by the Spanish Ministry of Economy and Competitiveness/ERDF (grant FFI2014-51978-C2-2-R) and Xunta de Galicia (grant R2014/034). I thank Ramon Ferrer i Cancho for helpful comments on an early version of this commentary.