1. Introduction
Language is fleeting. As we hear a sentence unfold, we rapidly lose our memory for preceding material. Speakers, too, soon lose track of the details of what they have just said. Language processing is therefore “Now-or-Never”: If linguistic information is not processed rapidly, that information is lost for good. Importantly, though, while fundamentally shaping language, the Now-or-Never bottleneckFootnote 1 is not specific to language but instead arises from general principles of perceptuo-motor processing and memory.
The existence of a Now-or-Never bottleneck is relatively uncontroversial, although its precise character may be debated. However, in this article we argue that the consequences of this constraint for language are remarkably far-reaching, touching on the following issues:
-
1. The multilevel organization of language into sound-based units, lexical and phrasal units, and beyond;
-
2. The prevalence of local linguistic relations (e.g., in phonology and syntax);
-
3. The incrementality of language processing;
-
4. The use of prediction in language interpretation and production;
-
5. The nature of what is learned during language acquisition;
-
6. The degree to which language acquisition involves item-based generalization;
-
7. The degree to which language change proceeds item-by-item;
-
8. The connection between grammar and lexical knowledge;
-
9. The relationships between syntax, semantics, and pragmatics.
Thus, we argue that the Now-or-Never bottleneck has fundamental implications for key questions in the language sciences. The consequences of this constraint are, moreover, incompatible with many theoretical positions in linguistic, psycholinguistic, and language acquisition research.
Note, however, that arguing that a phenomenon arises from the Now-or-Never bottleneck does not necessarily undermine alternative explanations of that phenomenon (although it may). Many phenomena in language may simply be overdetermined. For example, we argue that incrementality (point 3, above) follows from the Now-or-Never bottleneck. But it is also possible that, irrespective of memory constraints, language understanding would still be incremental on functional grounds, to extract the linguistic message as rapidly as possible. Such counterfactuals are, of course, difficult to evaluate. By contrast, the properties of the Now-or-Never bottleneck arise from basic information processing limitations that are directly testable by experiment. Moreover, the Now-or-Never bottleneck should, we suggest, have methodological priority to the extent that it provides an integrated framework for explaining many aspects of language structure, acquisition, processing, and evolution that have previously been treated separately.
In Figure 1, we illustrate the overall structure of the argument in this article. We begin, in the next section, by briefly making the case for the Now-or-Never bottleneck as a general constraint on perception and action. We then discuss the implications of this constraint for language processing, arguing that both comprehension and production involve what we call “Chunk-and-Pass” processing: incrementally building chunks at all levels of linguistic structure as rapidly as possible, using all available information predictively to process current input before new information arrives (sect. 3). From this perspective, language acquisition involves learning to process: that is, learning rapidly to create and use chunks appropriately for the language being learned (sect. 4). Consequently, short-term language change and longer-term processes of language evolution arise through variation in the system of chunks and their composition, suggesting an item-based theory of language change (sect. 5). This approach points to a processing-based interpretation of construction grammar, in which constructions correspond to chunks, and where grammatical structure is fundamentally the history of language processing operations within the individual speaker/hearer (sect. 6). We conclude by briefly summarizing the main points of our argument.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20170928100645-46263-mediumThumb-S0140525X1500031X_fig1g.jpg?pub-status=live)
Figure 1. The structure of our argument, in which implicational relations between claims are denoted by arrows. The Now-or-Never bottleneck provides a fundamental constraint on perception and action that is independent of its application to the language system (and hence outside the diamond in the figure). Specific implications for language (indicated inside the diamond) stem from the Now-or-Never bottleneck's necessitating of Chunk-and-Pass language processing, with key consequences for language acquisition. The impact of the Now-or-Never bottleneck on both processing and acquisition together further shapes language change. All three of these interlinked claims concerning Chunk-and-Pass processing, acquisition as processing, and item-based language change (grouped together in the shaded upper triangle) combine to shape the structure of language itself.
2. The Now-or-Never bottleneck
Language input is highly transient. Speech sounds, like other auditory signals, are short-lived. Classic speech perception studies have shown that very little of the auditory trace remains after 100 ms (Elliott Reference Elliott1962), with more recent studies indicating that much acoustic information already is lost after just 50 ms (Remez et al. Reference Remez, Ferro, Dubowski, Meer, Broder and Davids2010). Similarly, and of relevance for the perception of sign language, studies of visual change detection suggest that the ability to maintain visual information beyond 60–70 ms is very limited (Pashler Reference Pashler1988). Thus, sensory memory for language input is quickly overwritten, or interfered with, by new incoming information, unless the perceiver in some way processes what is heard or seen.
The problem of the rapid loss of the speech or sign signal is further exacerbated by the sheer speed of the incoming linguistic input. At a normal speech rate, speakers produce about 10–15 phonemes per second, corresponding to roughly 5–6 syllables every second or 150 words per minute (Studdert-Kennedy Reference Studdert-Kennedy, Smelser and Gerstein1986). However, the resolution of the human auditory system for discrete auditory events is only about 10 sounds per second, beyond which the sounds fuse into a continuous buzz (Miller & Taylor Reference Miller and Taylor1948). Consequently, even at normal rates of speech, the language system needs to work beyond the limits of auditory temporal resolution for nonspeech stimuli. Remarkably, listeners can learn to process speech in their native language at up to twice the normal rate without much decrement in comprehension (Orr et al. Reference Orr, Friedman and Williams1965). Although the production of signs appears to be slower than the production of speech (at least when comparing the production of ASL signs and spoken English; Bellugi & Fischer Reference Bellugi and Fisher1972), signed words are still very brief visual events, with the duration of an ASL syllable being about a quarter of a second (Wilbur & Nolkn Reference Wilbur and Nolkn1986).Footnote 2
Making matters even worse, our memory for sequences of auditory input is also very limited. For example, it has been known for more than four decades that naïve listeners are unable to correctly recall the temporal order of just four distinct sounds – for example, hisses, buzzes, and tones – even when they are perfectly able to recognize and label each individual sound in isolation (Warren et al. Reference Warren, Obusek, Farmer and Warren1969). Our ability to recall well-known auditory stimuli is not substantially better, ranging from 7 ± 2 (Miller Reference Miller1956) to 4 ± 1 (Cowan Reference Cowan2000). A similar limitation applies to visual memory for sign language (Wilson & Emmorey Reference Wilson and Emmorey2006). The poor memory for auditory and visual information, combined with the fast and fleeting nature of linguistic input, imposes a fundamental constraint on the language system: the Now-or-Never bottleneck. If the input is not processed immediately, new information will quickly overwrite it.
Importantly, the Now-or-Never bottleneck is not unique to language but applies to other aspects of perception and action as well. Sensory memory is rich in detail but decays rapidly unless it is further processed (e.g., Cherry Reference Cherry1953; Coltheart Reference Coltheart1980; Sperling Reference Sperling1960). Likewise, short-term memory for auditory, visual, and haptic information is also limited and subject to interference from new input (e.g., Gallace et al. Reference Gallace, Tan and Spence2006; Haber Reference Haber1983; Pavani & Turatto Reference Pavani and Turatto2008). Moreover, our cognitive ability to respond to sensory input is further constrained in a serial (Sigman & Dehaene Reference Sigman and Dehaene2005) or near-serial (Navon & Miller Reference Navon and Miller2002) manner, severely restricting our capacity for processing multiple inputs arriving in quick succession. Similar limitations apply to the production of behavior: The cognitive system cannot plan detailed sequences of movements – a long sequence of commands planned far in advance would lead to severe interference and be forgotten before it could be carried out (Cooper & Shallice Reference Cooper and Shallice2006; Miller et al. Reference Miller, Galanter and Pribram1960). However, the cognitive system adopts several processing strategies to ameliorate the effects of the Now-or-Never bottleneck on perception and action.
First, the cognitive system engages in eager processing: It must recode the rich perceptual input as it arrives to capture the key elements of the sensory information as economically, and as distinctively, as possible (e.g., Brown et al. Reference Brown, Neath and Chater2007; Crowder & Neath Reference Crowder, Neath, Hockley and Lewandowsky1991); and it must do so rapidly, before new input overwrites or interferes with the sensory information. This notion is a traditional one, dating back to early work on attention and sensory memory (e.g., Broadbent Reference Broadbent1958; Coltheart Reference Coltheart1980; Haber Reference Haber1983; Sperling Reference Sperling1960; Treisman Reference Treisman1964). The resulting compressed representations are lossy: They provide only an abstract summary of the input, from which the rich sensory input cannot be recovered (e.g., Pani Reference Pani2000). Evidence from the phenomena of change and inattentional blindness suggests that these compressed representations can be very selective (see Jensen et al. Reference Jensen, Yao, Street and Simons2011 for a review), as exemplified by a study in which half of the participants failed to notice that someone to whom they were giving directions, face-to-face, was surreptitiously exchanged for a completely different person (Simons & Levin Reference Simons and Levin1998). Information not encoded in the short amount of time during which the sensory information is available will be lost.
Second, because memory limitations also apply to recoded representations, the cognitive system further chunks the compressed encodings into multiple levels of representation of increasing abstraction in perception, and decreasing levels of abstraction in action. Consider, for example, memory for serially ordered symbolic information, such as sequences of digits. Typically, people are quickly overloaded and can recall accurately only the last three or four items in a sequence (e.g., Murdock Reference Murdock1968). But it is possible to learn to rapidly encode, and recall, long random sequences of digits, by successively chunking such sequences into larger units, chunking those chunks into still larger units, and so on. Indeed, an extended study of a single individual, SF (Ericsson et al. Reference Ericsson, Chase and Faloon1980), showed that repeated chunking in this manner makes it possible to recall with high accuracy sequences containing as many as 79 digits. But, crucially, this strategy requires learning to encode the input into multiple, successive, and distinct levels of representations – each sequence of chunks at one level must be shifted as a single chunk to a higher level before more chunks interfere with or overwrite the initial chunks. Indeed, SF chunked sequences of three or four digits, the natural chunk size in human memory (Cowan Reference Cowan2000), into a single unit (corresponding to running times, dates, or human ages), and then grouped sequences of three to four of those chunks into larger chunks. Interestingly, SF also verbally produced items in overtly discernible chunks, interleaved with pauses, indicating how action also follows the reverse process (e.g., Lashley Reference Lashley and Jeffress1951; Miller Reference Miller1956). The case of SF further demonstrates that low-level information is far better recalled when organized into higher-level structures than merely coded as an unorganized stream. Note, though, that lower-level information is typically forgotten; it seems unlikely that even SF could recall the specific visual details of the digits with which he was presented. More generally, the notion that perception and action involve representational recoding at a succession of distinct representational levels also fits with a long tradition of theoretical and computational models in cognitive science and computer vision (e.g., Bregman Reference Bregman1990; Marr Reference Marr1982; Miller et al. Reference Miller, Galanter and Pribram1960; Zhu et al. Reference Zhu, Chen, Torrable, Freeman and Yuille2010; see Gobet et al. Reference Gobet, Lane, Croker, Cheng, Jones, Oliver and Pine2001 for a review). Our perspective on repeated multilevel compression is also consistent with data from functional magnetic resonance imaging (fMRI) and intracranial recordings, suggesting cortical hierarchies across vision and audition – from low-level sensory to high-level perceptual and cognitive areas – integrating information at progressively longer temporal windows (Hasson et al. Reference Hasson, Yang, Vallines, Heeger and Rubin2008; Honey et al. Reference Honey, Thesen, Donner, Silbert, Carlson, Devinsky, Doyle, Rubin, Heeger and Hasson2012; Lerner et al. Reference Lerner, Honey, Silbert and Hasson2011).
Third, to facilitate speedy chunking and hierarchical compression, the cognitive system employs anticipation, using prior information to constrain the recoding of current perceptual input (for reviews see Bar Reference Bar2007; Clark Reference Clark2013). For example, people see the exact same collection of pixels either as a hair dryer (when viewed as part of a bathroom scene) or as a drill (when embedded in a picture of a workbench) (Bar Reference Bar2004). Therefore, using prior information to predict future input is likely to be essential to successfully encoding that future input (as well as helping us to react faster to such input). Anticipation allows faster, and hence more effective, recoding when oncoming information creates considerable time urgency. Such predictive processing will be most effective to the extent that the greatest possible amount of available information (across different types and levels of abstraction) is integrated as fast as possible. Similarly, anticipation is important for action as well. For example, manipulating an object requires anticipating the grip force required to deal with the loads generated by the accelerations of the object. Grip force is adjusted too rapidly during the manipulation of an object to rely on sensory feedback (Flanagan & Wing Reference Flanagan and Wing1997). Indeed, the rapid prediction of the sensory consequences of actions (e.g., Poulet & Hedwig Reference Poulet and Hedwig2006) suggests the existence of so-called forward models, which allow the brain to predict the consequence of its actions in real time. Many have argued (e.g., Wolpert et al. Reference Wolpert, Diedrichsen and Flanagan2011; see also Clark Reference Clark2013; Pickering & Garrod Reference Pickering and Garrod2013a) that forward models are a ubiquitous feature of the computational machinery of motor control and more broadly of cognition.
The three processing strategies we mention here – eager processing, computing multiple representational levels, and anticipation – provide the cognitive system with important means to cope with the Now-or-Never bottleneck. Next, we argue that the language system implements similar strategies for dealing with the here-and-now nature of linguistic input and output, with wide-reaching and fundamental implications for language processing, acquisition and change as well as for the structure of language itself. Specifically, we propose that our ability to deal with sequences of linguistic information is the result of what we call “Chunk-and-Pass” processing, by which the language system can ameliorate the effects of the Now-or-Never bottleneck. More generally, our perspective offers a framework within which to approach language comprehension and production. Table 1 summarizes the impact of the Now-or-Never bottleneck on perception/action and language.
Table 1. Summary of the Now-or-Never bottleneck's implications for perception/action and language
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20170928100645-85008-mediumThumb-S0140525X1500031X_tab1.jpg?pub-status=live)
The style of explanation outlined here, focusing on processing limitations, contrasts with a widespread interest in rational, rather processing-based, explanations in cognitive science (e.g., Anderson Reference Anderson1990; Chater et al. Reference Chater, Tenenbaum and Yuille2006 Griffiths & Tenenbaum Reference Griffiths and Tenenbaum2009; Oaksford & Chater Reference Oaksford and Chater1998; Reference Oaksford and Chater2007; Tenenbaum et al. Reference Tenenbaum, Kemp, Griffiths and Goodman2011), including language processing (Gibson et al. Reference Gibson, Bergen and Piantadosi2013; Hale Reference Hale2001; Reference Hale2006; Piantadosi et al. Reference Piantadosi, Tily and Gibson2011). Given the fundamental nature of the Now-or-Never bottleneck, we suggest that such explanations will be relevant only for explaining language use insofar as they incorporate processing constraints. For example, in the spirit of rational analysis (Anderson Reference Anderson1990) and bounded rationality (Simon Reference Simon1982), it is natural to view aspects of language processing and structure, as described below, as “optimal” responses to specific processing limitations, such as the Now-or-Never bottleneck (for this style of approach, see, e.g., Chater et al. Reference Chater, Crocker, Pickering, Oaksford and Chater1998; Levy Reference Levy2008). Here, though, our focus is primarily on mechanism rather than rationality.
3. Chunk-and-Pass language processing
The fleeting nature of linguistic input, in combination with the impressive speed with which words and signs are produced, imposes a severe constraint on the language system: the Now-or-Never bottleneck. Each new incoming word or sign will quickly interfere with previous heard and seen input, providing a naturalistic version of the masking used in psychophysical experiments. How, then, is language comprehension possible? Why doesn't interference between successive sounds (or signs) obliterate linguistic input before it can be understood? The answer, we suggest, is that our language system rapidly recodes this input into chunks, which are immediately passed to a higher level of linguistic representation. The chunks at this higher level are then themselves subject to the same Chunk-and-Pass procedure, resulting in progressively larger chunks of increasing linguistic abstraction. Crucially, given that the chunks recode increasingly larger stretches of input from lower levels of representation, the chunking process enables input to be maintained over ever-larger temporal windows. It is this repeated chunking of lower-level information that makes it possible for the language system to deal with the continuous deluge of input that, if not recoded, is rapidly lost. This chunking process is also what allows us to perceive speech at a much faster rate than nonspeech sounds (Warren et al. Reference Warren, Obusek, Farmer and Warren1969): We have learned to chunk the speech stream. Indeed, we can easily understand (and sometimes even repeat back) sentences consisting of many tens of phonemes, despite our severe memory limitations for sequences of nonspeech sounds.
What we are proposing is that during comprehension, the language system – similar to SF – must keep on chunking the incoming information into increasingly abstract levels of representation to avoid being overwhelmed by the input. That is, the language system engages in eager processing when creating chunks. Chunks must be built right away, or memory for the input will be obliterated by interference from subsequent material. If a phoneme or syllable is recognized, then it is recoded as a chunk and passed to a higher level of linguistic abstraction. And once recoded, the information is no longer subject to interference from further auditory input. A general principle of perception and memory is that interference arises primarily between overlapping representations (Crowder & Neath Reference Crowder, Neath, Hockley and Lewandowsky1991; Treisman & Schmidt Reference Treisman and Schmidt1982); crucially, recoding avoids such overlap. For example, phonemes interfere with each other, but phonemes interfere very little with words. At each level of chunking, information from the previous level(s) is compressed and passed up as chunks to the next level of linguistic representation, from sound-based chunks up to complex discourse elements.Footnote 3 As a consequence, the rich detail of the original input can no longer be recovered from the chunks, although some key information remains (e.g., certain speaker characteristics; Nygaard et al. Reference Nygaard, Sommers and Pisoni1994; Remez et al. Reference Remez, Fellowes and Rubin1997).
In production, the process is reversed: Discourse-level chunks are recursively broken down into subchunks of decreasing linguistic abstraction until the system arrives at chunks with sufficient information to drive the articulators (either the vocal apparatus or the hands). As in comprehension, memory is limited within a given level of representation, resulting in potential interference between the items to be produced (e.g., Dell et al. Reference Dell, Burger and Svec1997). Thus, higher-level chunks tend to be passed down immediately to the level below as soon as they are “ready,” leading to a bias toward producing easy-to-retrieve utterance components before harder-to-retrieve ones (e.g., Bock Reference Bock1982; MacDonald Reference MacDonald2013). For example, if there is a competition between two possible words to describe an object, the word that is retrieved more fluently will immediately be passed on to lower-level articulatory processes. To further facilitate production, speakers often reuse chunks from the ongoing conversation, and those will be particularly rapidly available from memory. This phenomenon is reflected by the evidence for lexical (e.g., Meyer & Schvaneveldt Reference Meyer and Schvaneveldt1971) and structural priming (e.g., Bock Reference Bock1986; Bock & Loebell Reference Bock and Loebell1990; Pickering & Branigan Reference Pickering and Branigan1998; Potter & Lombardi Reference Potter and Lombardi1998) within individuals as well as alignment across conversational partners (Branigan et al. Reference Branigan, Pickering and Cleland2000; Pickering & Garrod Reference Pickering and Garrod2004); priming is also extensively observed in text corpora (Hoey Reference Hoey2005). As noted by MacDonald (Reference MacDonald2013), these memory-related factors provide key constraints on the production of language and contribute to cross-linguistic patterns of language use.Footnote 4
A useful analogy for language production is the notion of “just-in-time”Footnote 5 stock control, in which stock inventories are kept to a bare minimum during the manufacturing process (Ohno & Mito Reference Ohno and Mito1988). Similarly, the Now-or-Never bottleneck requires that, for example, low-level phonetic or articulatory decisions not be made and stored far in advance and then reeled off during speech production, because any buffer in which such decisions can safely be stored would quickly be subject to interference from subsequent material. So the Now-or-Never bottleneck requires that once detailed production information has been assembled, it be executed straightaway, before it can be obliterated by the oncoming stream of later low-level decisions, similar to what has been suggested for motor planning (Norman & Shallice Reference Norman, Shallice, Davidson, Schwartz and Shapiro1986; see also MacDonald Reference MacDonald2013). We call this proposal Just-in-Time language production.
3.1. Implications of Strategy 1: Incremental processing
Chunk-and-Pass processing has important implications for comprehension and production: It requires that both take place incrementally. In incremental processing, representations are built up as rapidly as possible as the input is encountered. By contrast, one might, for example, imagine a parser that waits until the end of a sentence before beginning syntactic analysis, or that meaning is computed only once syntax has been established. However, such processing would require storing a stream of information at a single level of representation, and processing it later; but given the Now-or-Never bottleneck, this is not possible because of severe interference between such representations. Therefore, incremental interpretation and production follow directly from the Now-or-Never constraint on language.
To get a sense of the implications of Chunk-and-Pass processing, it is interesting to relate this perspective to specific computational principles and models. How, for example, do classic models of parsing fit within this framework? A wide range of psychologically inspired models involves some degree of incrementality of syntactic analysis, which can potentially support incremental interpretation (e.g., Phillips Reference Phillips, Camacho, Choueiri and Watanabe1996; Reference Phillips2003; Winograd Reference Winograd1972). For example, the sausage machine parsing model (Frazier & Fodor Reference Frazier and Fodor1978) proposes that a preliminary syntactic analysis is carried out phrase-by-phrase, but in complete isolation from semantic or pragmatic factors. But for a right-branching language such as English, chunks cannot be built left-to-right, because the leftmost chunks are incomplete until later material has been encountered. Frameworks from Kimball (Reference Kimball1973) onward imply “stacking up” incomplete constituents that may then all be resolved at the end of the clause. This approach runs counter to the memory constraints imposed by the Now-or-Never bottleneck. Reconciling right-branching with incremental chunking and processing is one motivation for the flexible constituency of combinatory categorial grammar (e.g., Steedman Reference Steedman1987; Reference Steedman2000; see also Johnson-Laird Reference Johnson-Laird1983).
With respect to comprehension, considerable evidence supports incremental interpretation, going back more than four decades (e.g., Bever Reference Bever and Hayes1970; Marslen-Wilson Reference Marslen-Wilson1975). The language system uses all available information to rapidly integrate incoming information as quickly as possible to update the current interpretation of what has been said so far. This process includes not only sentence-internal information about lexical and structural biases (e.g., Farmer et al. Reference Farmer, Christiansen and Monaghan2006; MacDonald Reference MacDonald1994; Trueswell et al. Reference Trueswell, Tanenhaus and Kello1993), but also extra-sentential cues from the referential and pragmatic context (e.g., Altmann & Steedman Reference Altmann and Steedman1988; Thornton et al. Reference Thornton, MacDonald and Gil1999) as well as the visual environment and world knowledge (e.g., Altmann & Kamide Reference Altmann and Kamide1999; Tanenhaus et al. Reference Tanenhaus, Spivey-Knowlton, Eberhard and Sedivy1995). As the incoming acoustic information is chunked, it is rapidly integrated with contextual information to recognize words, consistent with a variety of data on spoken word recognition (e.g., Marslen-Wilson Reference Marslen-Wilson1975; van den Brink et al. Reference van den Brink, Brown and Hagoort2001). These words are then, in turn, chunked into larger multiword units, as evidenced by recent studies showing sensitivity to multiword sequences in online processing (e.g., Arnon & Snider Reference Arnon and Snider2010; Reali & Christiansen Reference Reali and Christiansen2007b; Siyanova-Chanturia et al. Reference Siyanova-Chanturia, Conklin and Van Heuven2011; Tremblay & Baayen Reference Tremblay, Baayen and Wood2010; Tremblay et al. Reference Tremblay, Derwing, Libben and Westbury2011), and subsequently further integrated with pragmatic context into discourse-level structures.
Turning to production, we start by noting the powerful intuition that we speak “into the void” – that is, that we plan only a short distance ahead. Indeed, experimental studies suggest that, for example, when producing an utterance involving several noun phrases, people plan just one (Smith & Wheeldon Reference Smith and Wheeldon1999), or perhaps two, noun phrases ahead (Konopka Reference Konopka2012), and they can modify a message during production in the light of new perceptual input (Brown-Schmidt & Konopka Reference Brown-Schmidt and Konopka2015). Moreover, speech-error data (e.g., Cutler Reference Cutler1982) reveal that, across representational levels, errors tend to be highly local: Phonological, morphemic, and syntactic errors apply to neighboring chunks within each level (where material may be moved, swapped, or deleted). Consequently, speech planning appears to involve just a small number of chunks – the number of which may be similar across linguistic levels – but which covers different amounts of time depending on the linguistic level in question. For example, planning involving chunks at the level of intonational bursts stretches over considerably longer periods of time than planning at the syllabic level. Similarly, processes of reduction to facilitate production (e.g., modifying the speech signal to make it easier to produce, such as reducing a vowel to a schwa, or shortening or eliminating phonemes) can be observed across different levels of linguistic representation, from individual words (e.g., Gahl & Garnsey Reference Gahl and Garnsey2004; Jurafsky et al. Reference Jurafsky, Bell, Gregory, Raymond, Bybee and Hopper2001) to frequent multiword sequences (e.g., Arnon & Cohen Priva Reference Arnon and Cohen Priva2013; Bybee & Schiebman Reference Bybee and Scheibman1999).
Some may object that the Chunk-and-Pass perspective's strict notion of incremental interpretation and production leaves the language system vulnerable to the rather substantial ambiguity that exists across many levels of linguistic representation (e.g., lexical, syntactic, pragmatic). So-called garden path sentences such as the famous “The horse raced past the barn fell” (Bever Reference Bever and Hayes1970) show that people are vulnerable to at least some local ambiguities: They invite comprehenders to take the wrong interpretive path by treating raced as the main verb, which leads them to a dead end. Only when the final word, fell, is encountered does it become clear that something is wrong: raced should be interpreted as a past participle that begins a reduced relative clause (i.e., the horse [that was] raced past the barn fell). The difficulty of recovery in such garden path sentences indicates how strongly the language system is geared toward incremental interpretation.
Viewed as a processing problem, garden paths occur when the language system resolves an ambiguity incorrectly. But in many cases, it is possible for an underspecified representation to be constructed online, and for the ambiguity to be resolved later when further linguistic input arrives. This type of case is consistent with Marr's (Reference Marr1976) proposal of the “principle of least commitment,” that the perceptual system resolves ambiguous perceptual input only when it has sufficient data to make it unlikely that such decisions will subsequently have to be reversed. Given the ubiquity of local ambiguity in language, such underspecification may be used very widely in language processing. Note, however, that because of the severe constraints the Now-or-Never bottleneck imposes, the language system cannot adopt broad parallelism to further minimize the effect of ambiguity (as in many current probabilistic theories of parsing, e.g., Hale Reference Hale2006; Jurafsky Reference Jurafsky1996; Levy Reference Levy2008). Rather, within the Chunk-and-Pass account, the sole role for parallelism in the processing system is in deciding how the input should be chunked; only when conflicts concerning chunking are resolved can the input be passed on to a higher-level representation. In particular, we suggest that competing higher-level codes cannot be activated in parallel. This picture is analogous to Marr's principle of least commitment of vision: Although there might be temporary parallelism to resolve conflicts about, say, correspondence between dots in a random-dot stereogram, it is not possible to create two conflicting three-dimensional surfaces in parallel, and whereas there may be parallelism over the interpretation of lines and dots in an image, it is not possible to see something as both a duck and a rabbit simultaneously. More broadly, higher-level representations are constructed only when sufficient evidence has accrued that they are unlikely later to need to be replaced (for stimuli outside the psychological laboratory, at least).
Maintaining, and later resolving, an underspecified representation will create local memory and processing demands that may slow down processing, as is observed, for example, by increased reading times (e.g., Trueswell et al. Reference Trueswell, Tanenhaus and Garnsey1994) and distinctive patterns of brain activity (as measured by ERPs; Swaab et al. Reference Swaab, Brown and Hagoort2003). Accordingly, when the input is ambiguous, the language system may require later input to recognize previous elements of the speech stream successfully. The Now-or-Never bottleneck requires that such online “right-context effects” be highly local because raw perceptual input will be lost if it is not rapidly identified (e.g., Dahan Reference Dahan2010). Right-context effects may arise where the language system can delay resolution of ambiguity or use underspecified representations that do not require resolving the ambiguity right away. Similarly, cataphora, in which, for example, a referential pronoun occurs before its referent (e.g., “He is a nice guy, that John”) require the creation of an underspecified entity (male, animate) when he is encountered, which is resolved to be coreferential with John only later in the sentence (e.g., van Gompel & Liversedge Reference van Gompel and Liversedge2003). Overall, the Now-or-Never bottleneck implies that the processing system will build the most abstract and complete representation that is justified, given the linguistic input.Footnote 6
Of course, outside of experimental studies, background knowledge, visual context, and prior discourse will provide powerful cues to help resolve ambiguities in the signal, allowing the system rapidly to resolve many apparent ambiguities without incurring a substantial danger of “garden-pathing.” Indeed, although syntactic and lexical ambiguities have been much studied in psycholinguistics, increasing evidence indicates that garden paths are not a major source of processing difficulty in practice (e.g., Ferreira Reference Ferreira2008; Jaeger Reference Jaeger2010; Wasow & Arnold Reference Wasow, Arnold, Rohdenburg and Mondorf2003).Footnote 7 For example, Roland et al. (Reference Roland, Elman and Ferreira2006) reported corpus analyses showing that, in naturally occurring language, there is generally sufficient information in the sentential context before the occurrence of an ambiguous verb to specify the correct interpretation of that verb. Moreover, eye-tracking studies have demonstrated that dialogue partners exploit both conversational context and task demands to constrain interpretations to the appropriate referents, thereby side-stepping effects of phonological and referential competitors (Brown-Schmidt & Konopka Reference Brown-Schmidt and Konopka2011) that have otherwise been shown to impede language processing (e.g., Allopenna et al. Reference Allopenna, Magnuson and Tanenhaus1998). These dialogue-based constraints also mitigate syntactic ambiguities that might otherwise disrupt processing (Brown-Schmidt & Tanenhaus Reference Brown-Schmidt and Tanenhaus2008). This information may be further combined with other probabilistic sources of information such as prosody (e.g., Kraljic & Brennan Reference Kraljic and Brennan2005; Snedeker & Trueswell Reference Snedeker and Trueswell2003) to resolve potential ambiguities within a minimal temporal window. Finally, it is not clear that undetected garden path errors are costly in normal language use, because if communication appears to break down, the listener can repair the communication by requesting clarification from the dialogue partner.
3.2. Implications of Strategy 2: Multiple levels of linguistic structure
The Now-or-Never bottleneck forces the language system to compress input into increasingly abstract chunks that cover progressively longer temporal intervals. As an example, consider the chunking of the input illustrated in Figure 2. The acoustic signal is first chunked into higher-level sound units at the phonological level. To avoid interference between local sound-based units, such as phonemes or syllables, these units are further recoded as rapidly as possible into higher-level units such as morphemes or words. The same phenomenon occurs at the next level up: Local groups of words must be chunked into larger units, possibly phrases or other forms of multiword sequences. Subsequent chunking then recodes these representations into higher-level discourse structures (that may themselves be chunked further into even more abstract representational structures beyond that). Similarly, production requires running the process in reverse, starting with the intended message and gradually decoding it into increasingly more specific chunks, eventually resulting in the motor programs necessary for producing the relevant speech or sign output. As we discuss in section 3.3, the production process may further serve as the basis for prediction during comprehension (allowing higher-level information to influence the processing of current input). More generally, our account is agnostic with respect to the specific characterization of the various levels of linguistic representationFootnote 8 (e.g., whether sound-based chunks take the form of phonemes, syllables, etc.). What is central for the Chunk-and-Pass account: some form of sound-based level of chunking (or visual-based in the case of sign language), and a sequence of increasingly abstract levels of chunked representations into which the input is continually recoded.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20170928100645-95194-mediumThumb-S0140525X1500031X_fig2g.jpg?pub-status=live)
Figure 2. Chunk-and-Pass processing across a variety of linguistic levels in spoken language. As input is chunked and passed up to increasingly abstract levels of linguistic representations in comprehension, from acoustics to discourse, the temporal window over which information can be maintained increases, as indicated by the shaded portion of the bars associated with each linguistic level. This process is reversed in production planning, in which chunks are broken down into sequences of increasingly short and concrete units, from a discourse-level message to the motor commands for producing a specific articulatory output. More-abstract representations correspond to longer chunks of linguistic material, with greater look-ahead in production at higher levels of abstraction. Production processes may further serve as the basis for predictions to facilitate comprehension and thus provide topdown information in comprehension. (Note that the names and number of levels are for illustrative purposes only.)
A key theoretical implication of Chunk-and-Pass processing is that the multiple levels of linguistic representation, typically assumed in the language sciences, are a necessary by-product of the Now-or-Never bottleneck. Only by compressing the input into chunks and passing them to increasingly abstract levels of linguistic representation can the language system deal with the rapid onslaught of incoming information. Crucially, though, our perspective also suggests that the different levels of linguistic representations do not have a true part–whole relationship with one another. Unlike in the case of SF, who learned strategies to perfectly unpack chunks from within chunks to reproduce the original string of digits, language comprehension typically employs lossy compression to chunk the input. That is, higher-level chunks will not in general contain complete copies of lower-level chunks. Indeed, as speech input is encoded into ever more abstract chunks, increasing amounts of low-level information will typically be lost. Instead, as in perception (e.g., Haber Reference Haber1983), there is greater representational underspecification with higher levels of representation because of the repeated process of lossy compression.Footnote 9 Thus, we would expect a growing involvement of extralinguistic information, such as perceptual input and world knowledge, in processing higher levels of linguistic representation (see, e.g., Altmann & Kamide Reference Altmann and Kamide2009).
Whereas our account proposes a lossy hierarchy across levels of linguistic representation, only a very small number of chunks are represented within a level: otherwise, information is rapidly lost due to interference. This has the crucial implication that chunks within a given level can interact only locally. For example, acoustic information must rapidly be coded in a non-acoustic form, say, in terms of phonemes; but this is only possible if phonemes correspond to local chunks of acoustic input. The processing bottleneck therefore enforces a strong pressure toward local dependencies within a given linguistic level. Importantly, though, this does not imply that linguistic relations are restricted only to adjacent elements but, instead, that they may be formed between any of the small number of elements maintained at a given level of representation. Such representational locality is exemplified across different linguistic levels by the local nature of phonological processes from reduction, assimilation, and fronting, including more elaborate phenomena such as vowel harmony (e.g., Nevins Reference Nevins2010), speech errors (e.g., Cutler Reference Cutler1982), the immediate proximity of inflectional morphemes and the verbs to which they apply, and the vast literature on the processing difficulties associated with non-local dependencies in sentence comprehension (e.g., Gibson Reference Gibson1998; Hawkins Reference Hawkins2004). As noted earlier, the higher the level of linguistic representation, the longer the limited time window within which information can be chunked. Whereas dealing with just two center-embeddings at the sentential level is prohibitively difficult (e.g., de Vries et al. Reference de Vries, Christiansen and Petersson2011; Karlsson Reference Karlsson2007), we are able to deal with up to four to six embeddings at the multi-utterance discourse level (Levinson Reference Levinson2013). This is because chunking takes place at a much longer time course at the discourse level compared with the sentence level, providing more time to resolve the relevant dependency relations before they are subject to interference.
Finally, as indicated by Figure 2, processing within each level of linguistic representation takes place in parallel – but with a clear temporal component – as chunks are passed between levels. Note that, in the Chunk-and-Pass framework, it is entirely possible that linguistic input can simultaneously, and perhaps redundantly, be chunked in more than one way. For example, syntactic chunks and intonational contours may be somewhat independent (Jackendoff Reference Jackendoff2007). Moreover, we should expect further chunking across different “channels” of communication, including visual input such as gesture and facial expressions.
The Chunk-and-Pass perspective is compatible with a number of recent theoretical models of sentence comprehension, including constraint-based approaches (e.g., MacDonald et al. Reference MacDonald, Pearlmutter and Seidenberg1994; Trueswell & Tanenhaus Reference Trueswell, Tanenhaus, Clifton, Frazier and Rayner1994) and certain generative accounts (e.g., Jackendoff's [2007] parallel architecture). Intriguingly, fMRI data from adults (Dehaene-Lambertz et al. Reference Dehaene-Lambertz, Dehaene, Anton, Campagne, Ciuciu, Dehaene, Denghien, Jobert, Lebihan, Sigman, Pallier and Poline2006a) and infants (Dehaene-Lambertz et al. Reference Dehaene-Lambertz, Hertz-Pannier, Dubois, Meriaux, Roche, Sigman and Dehaene2006b) indicate that activation responses to a single sentence systematically slows down when moving away from the primary auditory cortex, either back toward Wernicke's area or forward toward Broca's area, consistent with increasing temporal windows for chunking when moving from phonemes to words to phrases. Indeed, the cortical circuits processing auditory input, from lower (sensory) to higher (cognitive) areas, follow different temporal windows, sensitive to more and more abstract levels of linguistic information, from phonemes and words to sentences and discourse (Lerner et al. Reference Lerner, Honey, Silbert and Hasson2011; Stephens et al. Reference Stephens, Honey and Hasson2013). Similarly, the reverse process, going from a discourse-level representation of the intended message to the production of speech (or sign) across parallel linguistic levels, is compatible with several current models of language production (e.g., Chang et al. Reference Chang, Dell and Bock2006; Dell et al. Reference Dell, Burger and Svec1997; Levelt Reference Levelt2001). Data from intracranial recordings during language production are consistent with different temporal windows for chunk decoding at the word, morphemic, and phonological levels, separated by just over a tenth of a second (Sahin et al. Reference Sahin, Pinker, Cash, Schomer and Halgren2009). These results are compatible with our proposal that incremental processing in comprehension and production takes place in parallel across multiple levels of linguistic representation, each with a characteristic temporal window.
3.3. Implications of Strategy 3: Predictive language processing
We have already noted that, to be able to chunk incoming information as fast and as accurately as possible, the language system exploits multiple constraints in parallel across the different levels of linguistic representation. Such cues may be used not only to help disambiguate previous input, but also to generate expectations for what may come next, potentially further speeding up Chunk-and-Pass processing. Computational considerations indicate that simple statistical information gleaned from sentences provides powerful predictive constraints on language comprehension and can explain many human processing results (e.g., Christiansen & Chater Reference Christiansen and Chater1999; Christiansen & MacDonald Reference Christiansen and MacDonald2009; Elman Reference Elman1990; Hale Reference Hale2006; Jurafsky Reference Jurafsky1996; Levy Reference Levy2008; Padó et al. Reference Padó, Crocker and Keller2009). Similarly, eye-tracking data suggest that comprehenders routinely use a variety of sources of probabilistic information – from phonological cues to syntactic context and real-world knowledge – to anticipate the processing of upcoming words (e.g., Altmann & Kamide Reference Altmann and Kamide1999; Farmer et al. Reference Farmer, Monaghan, Misyak and Christiansen2011; Staub & Clifton Reference Staub and Clifton2006). Results from event-related potential experiments indicate that rather specific predictions are made for upcoming input, including its lexical category (Hinojosa et al. Reference Hinojosa, Moreno, Casado, Munõz and Pozo2005), grammatical gender (Van Berkum et al. Reference Van Berkum, Brown, Zwitserlood, Kooijman and Hagoort2005; Wicha et al. Reference Wicha, Moreno and Kutas2004), and even its onset phoneme (DeLong et al. Reference DeLong, Urbach and Kutas2005) and visual form (Dikker et al. Reference Dikker, Rabagliati, Farmer and Pylkkänen2010). Accordingly, there is a growing body of evidence for a substantial role of prediction in language processing (for reviews, see, e.g., Federmeier Reference Federmeier2007; Hagoort Reference Hagoort, Bickerton and Szathmáry2009; Kamide Reference Kamide2008; Kutas et al. Reference Kutas, Federmeier, Urbach, Gazzaniga and Mangun2014; Pickering & Garrod Reference Pickering and Garrod2007) and evidence that such language prediction occurs in children as young as 2 years of age (Mani & Huettig Reference Mani and Huettig2012). Importantly, as well as exploiting statistical relations within a representational level, predictive processing allows top-down information from higher levels of linguistic representation to rapidly constrain the processing of the input at lower levels.Footnote 10
From the viewpoint of the Now-or-Never bottleneck, prediction provides an opportunity to begin Chunk-and-Pass processing as early as possible: to constrain representations of new linguistic material as it is encountered, and even incrementally to begin recoding predictable linguistic input before it arrives. This viewpoint is consistent with recent suggestions that the production system may be pressed into service to anticipate upcoming input (e.g., Pickering & Garrod Reference Pickering and Garrod2007; Reference Pickering and Garrod2013a). Chunk-and-Pass processing implies that there is practically no possibility for going back once a chunk is created because such backtracking tends to derail processing (e.g., as in the classic garden path phenomena mentioned above). This imposes a Right-First-Time pressure on the language system in the face of linguistic input that is highly locally ambiguous.Footnote 11 The contribution of predictive modeling to comprehension is that it facilitates local ambiguity resolution while the stimulus is still available. Only by recruiting multiple cues and integrating these with predictive modeling is it possible to resolve local ambiguities quickly and correctly.
Right-First-Time parsing fits with proposals such as that by Marcus (Reference Marcus1980), where local ambiguity resolution is delayed until later disambiguating information arrives, and models in which aspects of syntactic structure may be underspecified, therefore not requiring the ambiguity to be resolved (e.g., Gorrell Reference Gorrell1995; Sturt & Crocker Reference Sturt and Crocker1996). It also parallels Marr's (Reference Marr1976) principle of least commitment, as we mentioned earlier, according to which the perceptual system should, as far as possible, only resolve perceptual ambiguities when sufficiently confident that they will not need to be undone. Moreover, it is compatible with the fine-grained weakly parallel interactive model (Altmann & Steedman Reference Altmann and Steedman1988) in which possible chunks are proposed, word-by-word, by an autonomous parser and one is rapidly chosen using top-down information.
To facilitate chunking across multiple levels of representation, prediction takes place in parallel across the different levels but at varying timescales. Predictions for higher-level chunks may run ahead of those for lower-level chunks. For example, most people simply answer “two” in response to the question “How many animals of each kind did Moses take on the Ark?” – failing to notice the semantic anomaly (i.e., it was Noah's Ark, not Moses' Ark) even in the absence of time pressure and when made aware that the sentence may be anomalous (Erickson & Matteson Reference Erickson and Matteson1981). That is, anticipatory pragmatic and communicative considerations relating to the required response appear to trump lexical semantics. More generally, the time course of normal conversation may lead to an emphasis on more temporally extended higher-level predictions over lower-level ones. This may facilitate the rapid turn-taking that has been observed cross-culturally (Stivers et al. Reference Stivers, Enfield, Brown, Englert, Hayashi, Heinemann, Hoymann, Rossano, de Ruiter, Yoon and Levinson2009) and which seems to require that listeners make quite specific predictions about when the speaker's current turn will finish (Magyari & De Ruiter Reference Magyari and de Ruiter2012), as well as being able to quickly adapt their expectations to specific linguistic environments (Fine et al. Reference Fine, Jaeger, Farmer and Qian2013).
We view the anticipation of turn-taking as one instance of the broader alignment that takes place between dialogue partners across all levels of linguistic representation (for a review, see Pickering & Garrod Reference Pickering and Garrod2004). This dovetails with fMRI analyses indicating that although there are some comprehension- and production-specific brain areas, spatiotemporal patterns of brain activity are in general closely coupled between speakers and listeners (e.g., Silbert et al. Reference Silbert, Honey, Simony, Poeppel and Hasson2014). In particular, Stephens et al. (Reference Stephens, Silbert and Hasson2010) observed close synchrony between neural activations in speakers and listeners in early auditory areas. Speaker activations preceded those of listeners in posterior brain regions (including parts of Wernicke's area), whereas listener activations preceded those of speakers in the striatum and anterior frontal areas. In the Chunk-and-Pass framework, the listener lag primarily derives from delays caused by the chunking process across the various levels of linguistic representation, whereas the speaker lag predominantly reflects the listener's anticipation of upcoming input, especially at the higher levels of representation (e.g., pragmatics and discourse). Strikingly, the extent of the listener's anticipatory brain responses were strongly correlated with successful comprehension, further underscoring the importance of prediction-based alignment for language processing. Indeed, analyses of real-time interactions show that alignment increases when the communicative task becomes more difficult (Louwerse et al. Reference Louwerse, Dale, Bard and Jeuniaux2012). By decreasing the impact of potential ambiguities, alignment thus makes processing as well as production easier in the face of the Now-or-Never bottleneck.
We have suggested that only an incremental, predictive language system, continually building and passing on new chunks of linguistic material, encoded at increasingly abstract levels of representation, can deal with the onslaught of linguistic input in the face of the severe memory constraints of the Now-or-Never bottleneck. We suggest that a productive line of future work is to consider the extent to which existing models of language are compatible with these constraints, and to use these properties to guide the creation of new theories of language processing.
4. Acquisition is learning to process
If speaking and understanding language involves Chunk-and-Pass processing, then acquiring a language requires learning how to create and integrate the right chunks rapidly, before current information is overwritten by new input. Indeed, the ability to quickly process linguistic input – which has been proposed as an indicator of chunking ability (Jones Reference Jones2012) – is a strong predictor of language acquisition outcomes from infancy to middle childhood (Marchman & Fernald Reference Marchman and Fernald2008). The importance of this process is also introspectively evident to anyone acquiring a second language: Initially, even segmenting the speech stream into recognizable sounds can be challenging, let alone parsing it into words or processing morphology and grammatical relations rapidly enough to build a semantic interpretation. The ability to acquire and rapidly deploy a hierarchy of chunks at different linguistic scales is parallel to the ability to chunk sequences of motor movements, numbers, or chess positions: It is a skill, built up by continual practice.
Viewing language acquisition as continuous with other types of skill learning is very different from the standard formulation of the problem of language acquisition in linguistics. There, the child is viewed as a linguistic theorist who has the goal of inferring an abstract grammar from a corpus of example sentences (e.g., Chomsky Reference Chomsky1957; Reference Chomsky1965) and only secondarily learning the skill of generating and understanding language. But perhaps the child is not a mini-linguist. Instead, we suggest that language acquisition is nothing more than learning to process: to turn meanings into streams of sound or sign (when generating language), and to turn streams of sound or sign back into meanings (when understanding language).
If linguistic input is available only fleetingly, then any learning must occur while that information is present; that is, learning must occur in real time, as the Chunk-and-Pass process takes place. Accordingly, any modifications to the learner's cognitive system in light of processing must, according to the Now-or-Never bottleneck, occur at the time of processing. The learner must learn to chunk the input appropriately – to learn to recode the input at successively more abstract linguistic levels; and to do this requires, of course, learning the structure of the language being spoken. But how is this structure learned?
We suggest that, in language acquisition, as in other areas of perceptual-motor learning, people learn by processing, and that past processing leaves traces that can facilitate future processing. What, then, is retained, so that language processing gradually improves? We can consider various possibilities: For example, the weights of a connectionist network can be updated online in the light of current processing (Rumelhart et al. Reference Rumelhart and McClelland1986a); in an exemplar-based model, traces of past examples can be reused in the future (e.g., Hintzman Reference Hintzman1988; Logan Reference Logan1988; Nosofsky Reference Nosofsky1986). Whatever the appropriate computational framework, the Now-or-Never bottleneck requires that language acquisition be viewed as a type of skill learning, such as learning to drive, juggle, play the violin, or play chess. Such skills appear to be learned through practicing the skill, using online feedback during the practice itself, although the consolidation of learning occurs subsequently (Schmidt & Wrisberg Reference Schmidt and Wrisberg2004). The challenge of language acquisition is to learn a dazzling sequence of rapid processing operations, rather than conjecturing a correct “linguistic theory.”
4.1. Implications of Strategy 1: Online learning
The Now-or-Never bottleneck implies that learning can depend only on material currently being processed. As we have seen, this implication requires a processing strategy according to which modification to current representations (in this context, learning) occurs right away; in machine-learning terminology, learning is online. If learning does not occur at the time of processing, the representation of linguistic material will be obliterated, and the opportunity for learning will be gone forever. To facilitate such online learning, the child must learn to use all available information to help constrain processing. The integration of multiple constraints – or cues – is a fundamental component of many current theories of language acquisition (see, e.g., contributions in Golinkoff et al. Reference Golinkoff, Hirsh-Pasek, Bloom, Smith, Woodward, Akhtar, Tomasello and Hollich2000; Morgan & Demuth Reference Morgan and Demuth1996; Weissenborn & Höhle Reference Weissenborn and Höhle2001; for a review, see Monaghan & Christiansen Reference Monaghan, Christiansen and Behrens2008). For example, second-graders' initial guesses about whether a novel word refers to an object or an action are affected by that word's phonological properties (Fitneva et al. Reference Fitneva, Christiansen and Monaghan2009); 7-year-olds use visual context to constrain online sentence interpretation (Trueswell et al. Reference Trueswell, Sekerina, Hill and Logrip1999); and preschoolers' language production and comprehension is constrained by pragmatic factors (Nadig & Sedivy Reference Nadig and Sedivy2002). Thus, children learn rapidly to apply the multiple constraints used in incremental adult processing (Borovsky et al. Reference Borovsky, Elman and Fernald2012).
Nonetheless, online learning contrasts with traditional approaches in which the structure of the language is learned offline by the cognitive system acquiring a corpus of past linguistic inputs and choosing the grammar or other model of the language that best fits with those inputs. For example, in both mathematical and theoretical analysis (e.g., Gold Reference Gold1967; Hsu et al. Reference Hsu, Chater and Vitányi2011; Pinker Reference Pinker1984) and in grammar-induction algorithms in machine learning and cognitive science, it is typically assumed that a corpus of language can be held in memory, and that the candidate grammar is successively adjusted to fit the corpus as well as possible (e.g., Manning & Schütze Reference Manning and Schütze1999; Pereira & Schabes Reference Pereira, Schabes and Thompson1992; Redington et al. Reference Redington, Chater and Finch1998). However, this approach involves learning linguistic regularities (at, say, the morphological level), by storing and later surveying relevant linguistic input at a lower level of analysis (e.g., involving strings of phonemes); and then attempting to determine which higher-level regularities best fit the database of lower-level examples. There are a number of difficulties with this type of proposal – for example, that only a very rich lower-level representation (perhaps combined with annotations concerning relevant syntactic and semantic context) is likely to be a useful basis for later analysis. But more fundamentally, the Now-or-Never bottleneck requires that information be retained only if it is recoded at processing time: Phonological information that is not chunked at the morphological level and beyond will be obliterated by oncoming phonological material.Footnote 12
So, if learning is shaped by the Now-or-Never bottleneck, then linguistic input must, when it is encountered, be recoded successively at increasingly abstract linguistic levels if it is to be retained at all – a constraint imposed, we argue, by basic principles of memory. Crucially, such information is not, therefore, in a suitably “neutral” format to allow for the discovery of previously unsuspected linguistic regularities. In a nutshell, the lossy compression of the linguistic input is achieved by applying the learner's current model of the language. But information that would point toward a better model of the language (if examined in retrospect) will typically be lost (or, at best, badly obscured) by this compression, precisely because those regularities are not captured by the current model of the language. Suppose, for example, that we create a lossy encoding of language using a simple, context-free phrase structure grammar that cannot handle, say, noun-verb agreement. The lossy encoding of the linguistic input produced using this grammar will provide a poor basis for learning a more sophisticated grammar that includes agreement – precisely because agreement information will have been thrown away. So the Now-or-Never bottleneck rules out the possibility that the learner can survey a neutral database of linguistic material, to optimize its model of the language.
The emphasis on online learning does not, of course, rule out the possibility that any linguistic material that is remembered may subsequently be used to inform learning. But according to the present viewpoint, any further learning requires reprocessing that material. So if a child comes to learn a poem, song, or story verbatim, the child might extract more structure from that material by mental rehearsal (or, indeed, by saying it aloud). The online learning constraint is that material is learned only when it is being processed – ruling out any putative learning processes that involve carrying out linguistic analyses or compiling statistics over a stored corpus of linguistic material.
If this general picture of acquisition as learning-to-process is correct, then we should expect the exploitation of memory to require “replaying” learned material, so that it can be re-processed. Thus, the application of memory itself requires passing through the Now-or-Never bottleneck – there is no way of directly interrogating an internal database of past experience; indeed, this viewpoint fits with our subjective sense that we need to “bring to mind” past experiences or rehearse verbal material to process it further. Interestingly, there is now also substantial neuroscientific evidence that replay does occur (e.g., in rat spatial learning, Carr et al. Reference Carr, Jadhav and Frank2011). Moreover, it has long been suggested that dreaming may have a related function (here using “reverse” learning over “fictional” input to eliminate spurious relationships identified by the brain, Crick & Mitchison Reference Crick and Mitchison1983; see Hinton & Sejnowki Reference Hinton, Sejnowski, Irwin and Sejnowski1986, for a closely related computational model). Deficits in the ability to replay material would, in this view, lead to consequent deficits in memory and inference; consistent with this viewpoint, Martin and colleagues have argued that rehearsal deficits for phonological pattern and semantic information may lead to difficulties in the long-term acquisition and retention of word forms and word meanings, respectively, and their use in language processing (e.g., Martin & He Reference Martin and He2004; Martin et al. Reference Martin, Shelton and Yaffee1994). In summary, then, language acquisition involves learning to process, and generalizations can only be made over past processing episodes.
4.2. Implications of Strategy 2: Local learning
Online learning faces a particularly acute version of a general learning problem: the stability-plasticity dilemma (e.g., Mermillod et al. Reference Mermillod, Bugaïska and Bonin2013). How can new information be acquired without interfering with prior information? The problem is especially challenging because reviewing prior information is typically difficult (because recalling earlier information interferes with new input) or impossible (where prior input has been forgotten). Thus, to a good approximation, the learner can only update its model of the language in a way that responds to current linguistic input, without being able to review whether any updates are inconsistent with prior input. Specifically, if the learner has a global model of the entire language (e.g., a traditional grammar), the learner runs the risk of overfitting that model to capture regularities in the momentary linguistic input at the expense of damaging the match with past linguistic input.
Avoiding this problem, we suggest, requires that learning be highly local, consisting of learning about specific relationships between particular linguistic representations. New items can be acquired, with implications for later processing of similar items; but learning current items does not thereby create changes to the entire model of the language, thus potentially interfering with what was learned from past input. One way to learn in a local fashion is to store individual examples (this requires, in our framework, that those examples have been abstractly recoded by successive Chunk-and-Pass operations, of course), and then to generalize, piecemeal, from these examples. This standpoint is consistent with the idea that the “priority of the specific,” as observed in other areas of cognition (e.g., Jacoby et al. Reference Jacoby, Baker and Brooks1989), also applies to language acquisition. For example, children seem to be highly sensitive to multiword chunks (Arnon & Clark Reference Arnon and Clark2011; Bannard & Matthews Reference Bannard and Matthews2008; see Arnon & Christiansen, Reference Arnon and Christiansensubmitted, for a reviewFootnote 13 ). More generally, learning based on past traces of processing will typically be sensitive to details of that processing, as is observed across phonetics, phonology, lexical access, syntax, and semantics (e.g., Bybee Reference Bybee2006; Goldinger Reference Goldinger1998; Pierrehumbert Reference Pierrehumbert2002; Tomasello Reference Tomasello1992).
That learning is local provides a powerful constraint, incompatible with typical computational models of how the child might infer the grammar of the language – because these models typically do not operate incrementally but range across the input corpus, evaluating alternative grammatical hypotheses (so-called batch learning). But, given the Now-or-Never bottleneck, the “unprocessed” corpus, so readily available to the linguistic theorist, or to a computer model, is lost to the human learner almost as soon as it is encountered. Where such information has been memorized (as in the case of SF's encoding of streams of digits), recall and processing is slow and effortful. Moreover, because information is encoded in terms of the current encoding, it becomes difficult to neutrally review that input to create a better encoding, and cross-check past data to test wide-ranging grammatical hypotheses.Footnote 14 So, as we have already noted, the Now-or-Never bottleneck seems incompatible with the view of a child as a mini-linguist.
By contrast, the principle of local learning is respected by other approaches. For example, item-based (Tomasello Reference Tomasello2003), connectionist (e.g., Chang et al. 1999; Elman Reference Elman1990; MacDonald & Christiansen Reference MacDonald and Christiansen2002),Footnote 15 exemplar-based (e.g., Bod Reference Bod2009), and other usage-based (e.g., Arnon & Snider Reference Arnon and Snider2010; Bybee Reference Bybee2006) accounts of language acquisition tie learning and processing together – and assume that language is acquired piecemeal, in the absence of an underlying Bauplan. Such accounts, based on local learning, provide a possible explanation of the frequency effects that are found at all levels of language processing and acquisition (e.g., Bybee Reference Bybee2007; Bybee & Hopper Reference Bybee and Hopper2001; Ellis Reference Ellis2002; Tomasello Reference Tomasello2003), analogous to exemplar-based theories of how performance speeds up with practice (Logan Reference Logan1988).
The local nature of learning need not, though, imply that language has no integrated structure. Just as in perception and action, local chunks can be defined at many different levels of abstraction, including highly abstract patterns, for example, governing subject, verb, and object; and generalizations from past processing to present processing will operate across all of these levels. Therefore, in generating or understanding a new sentence, the language user will be influenced by the interaction of multiple constraints from innumerable traces of past processing, across different linguistic levels. This view of language processing involving the parallel interaction of multiple local constraints is embodied in a variety of influential approaches to language (e.g., Jackendoff Reference Jackendoff2007; Seidenberg Reference Seidenberg1997).
4.3. Implications of Strategy 3: Learning to predict
If language processing involves prediction – to make the encoding of new linguistic material sufficiently rapid – then a critical aspect of language acquisition is learning to make such predictions successfully (Altmann & Mirkovic Reference Altmann and Mirkovic2009). Perhaps the most natural approach to predictive learning is to compare predictions with subsequent reality, thus creating an “error signal,” and then to modify the predictive model to systematically reduce this error. Throughout many areas of cognition, such error-driven learning has been widely explored in a range of computational frameworks (e.g., from connectionist networks, to reinforcement learning, to support vector machines) and has considerable behavioral (e.g., Kamin Reference Kamin, Campbell and Church1969) and neurobiological support (e.g., Schultz et al. Reference Schultz, Dayan and Montague1997).
Predictive learning can, in principle, take a number of forms: For example, predictive errors can be accumulated over many samples, and then modifications made to the predictive model to minimize the overall error over those samples (i.e., batch learning). But this is ruled out by the Now-or-Never bottleneck: Linguistic input, and the predictions concerning it, is present only fleetingly. But error-driven learning can also be “online” – each prediction error leads to an immediate, though typically small, modification of the predictive model; and the accumulation of these small modifications gradually reduces prediction errors on future input.
A number of computational models adhere to these principles: Learning involves creating a predictive model of the language, using online error-driven learning. Such models, limited though they are, may provide a starting point for creating an increasingly realistic account of language acquisition and processing. For example, a connectionist model which embodies these principles is the simple recurrent network (Altmann Reference Altmann2002; Christiansen & Chater Reference Christiansen and Chater1999; Elman Reference Elman1990), which learns to map from the current input on to the next element in a continuous sequence of linguistic (or other) input; and which learns, online, by adjusting its parameters (the “weights” of the network) to reduce the observed prediction error, using the back-propagation learning algorithm. Using a very different framework, in the spirit of construction grammar (e.g., Croft Reference Croft2001; Goldberg Reference Goldberg2006), McCauley and Christiansen (Reference McCauley, Christiansen, Carlson, Hölscher and Shipley2011) recently developed a psychologically based, online chunking model of incremental language acquisition and processing , incorporating prediction to generalize to new chunk combinations. Exemplar-based analogical models of language acquisition and processing may also be constructed, which build and predict language structure online, by incrementally creating a database of possible structures, and dynamically using online computation of similarity to recruit these structures to process and predict new linguistic input.
Importantly, prediction allows for top-down information to influence current processing across different levels of linguistic representation, from phonology to discourse, and at different temporal windows (as indicated by Fig. 2). We see the ability to use such top-down information as emerging gradually across development, building on bottom-up information. That is, children gradually learn to apply top-down knowledge to facilitate processing via prediction, as higher-level information becomes more entrenched and allows for anticipatory generalizations to be made.
In this section, we have argued that the child should not be viewed as a mini-linguist, attempting to infer the abstract structure of grammar, but as learning to process: that is, learning to alleviate the severe constraints imposed by the Now-or-Never bottleneck. Next, we discuss how chunk-based language acquisition and processing have shaped linguistic change and, ultimately, the evolution of language.
5. Language change is item-based
Like language, human culture constantly changes. We continually tinker with all aspects of culture, from social conventions and rituals to technology and everyday artifacts (see contributions in Richerson & Christiansen Reference Richerson and Christiansen2013). Perhaps language, too, is a result of cultural evolution – a product of piecemeal tinkering – with the long-term evolution of language resulting from the compounding of myriad local short-term processes of language change. This hypothesis figures prominently in many recent theories of language evolution (e.g., Arbib Reference Arbib2005; Beckner et al. Reference Beckner, Blythe, Bybee, Christiansen, Croft, Ellis, Holland, Ke, Larsen- Freeman and Schoenemann2009; Christiansen & Chater Reference Christiansen and Chater2008; Hurford Reference Hurford, Dunbar, Knight and Power1999; Smith & Kirby Reference Smith and Kirby2008; Tomasello Reference Tomasello2003; for a review of these theories, see Dediu et al. Reference Dediu, Cysouw, Levinson, Baronchelli, Christiansen, Croft, Evans, Garrod, Gray, Kandler, Lieven, Richerson and Christiansen2013). Language is construed as a complex evolving system in its own right; linguistic forms that are easier to use and learn, or are more communicatively efficient, will tend to proliferate, whereas those that are not will be prone to die out. Over time, processes of cultural evolution involving repeated cycles of learning and use are hypothesized to have shaped the languages we observe today.
If aspects of language survive only when they are easy to produce and understand, then moment-by-moment processing will shape not only the structure of language (see also Hawkins Reference Hawkins2004; O'Grady Reference O'Grady2005), but also the learning problem that the child faces. Thus, from the perspective of language as an evolving system, language processing at the timescale of seconds has implications for the longer timescales of language acquisition and evolution. Figure 3 illustrates how the effects of the Now-or-Never bottleneck flow from the timescale of processing to those of acquisition and evolution.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20170928100645-38439-mediumThumb-S0140525X1500031X_fig3g.jpg?pub-status=live)
Figure 3. Illustration of how Chunk-and-Pass processing at the utterance level (with the Ci referring to chunks) constrains the acquisition of language by the individual, which, in turn, influences how language evolves through learning and use by groups of individuals on a historical timescale.
Chunk-and-Pass processing carves the input (or output) into chunks at different levels of linguistic representation at the timescale of the utterance (seconds). These chunks constitute the comprehension and production events from which children and adults learn and update their ability to process their native language over the timescale of the individual (tens of years). Each learner, in turn, is part of a population of language users that shape the cultural evolution of language across a historical timescale (hundreds or thousands of years): Language will be shaped by the linguistic patterns learners find easiest to acquire and process. And the learners will, of course, be strongly constrained by the basic cognitive limitation that is the Now-or-Never bottleneck – and, hence, through cultural evolution, linguistic patterns, which can be processed through that bottleneck, will be strongly selected. Moreover, if acquiring a language is learning to process and processing involves incremental Chunk-and-Pass operations, then language change will operate through changes driven by Chunk-and-Pass processing, both within and between individuals. But this, in turn, implies that processes of language change should be item-based, driven by processing/acquisition mechanisms defined over Chunk-and-Pass representations (rather than, for example, being defined over abstract linguistic parameters, with diverse structural consequences across the entire language).
We noted earlier that a consequence of Chunk-and-Pass processing for production is a tendency toward reduction, especially of more frequently used forms, and this constitutes one of several pressures on language change (see also MacDonald Reference MacDonald2013). Because reduction minimizes articulatory processing effort for the speaker but may increase processing effort for the hearer and learner, this pressure can in extreme cases lead to a communicative collapse. This is exemplified by a lab-based analogue of the game of “telephone,” in which participants were exposed to a miniature artificial language consisting of simple form-meaning mappings (Kirby et al. Reference Kirby, Cornish and Smith2008). The initial language contained random mappings between syllable strings and pictures of moving geometric figures in different colors. After exposure, participants were asked to produce linguistic forms corresponding to specific pictures. Importantly, the participants saw only a subset of the language but nonetheless had to generalize to the full language. The productions of the initial learner were then used as the input language for the next learner, and so on for a total of 10 “generations.” In the absence of other communicative pressures (such as the avoidance of ambiguity; Grice Reference Grice and James1967), the language collapsed into just a few different forms that allowed for systematic, albeit semantically underspecified, generalization to unseen items. In natural language, however, the pressure toward reduction is normally kept in balance by the need to maintain effective communication.
Expanding on the notion of reduction and erosion, we suggest that constraints from Chunk-and-Pass processing can provide a cognitive foundation for grammaticalization (Hopper & Traugott Reference Hopper and Traugott1993). Specifically, chunks at different levels of linguistic structure – discourse, syntax, morphology, and phonology – are potentially subject to reduction. Consequently, we can distinguish between different types of grammaticalization, from discourse syntacticization and semantic bleaching to morphological reduction and phonetic erosion. Repeated chunking of loose discourse structures may result in their reduction into more rigid syntactic constructions, reflecting Givón's (Reference Givón1979) hypothesis that today's syntax is yesterday's discourse.Footnote 16 For example, the resultative construction He pulled the window open might derive from syntacticization of a loose discourse sequence such as He pulled the window and it opened (Tomasello Reference Tomasello2003). As a further by-product of chunking, some words that occur frequently in certain kinds of construction may gradually become “bleached” of meaning and ultimately signal only general syntactic properties. Consider, as an example, the construction be going to, which was originally used exclusively to indicate movement in space (e.g., I'm going to Ithaca) but which is now also used as an intention or future marker when followed by a verb (as in I'm going to eat at seven; Bybee et al. Reference Bybee, Perkins and Pagliuca1994). Additionally, a chunked linguistic expression may further be subject to morphological reduction, resulting in further loss of morphological (or syntactic) elements. For instance, the demonstrative that in English (e.g., that window) lost the grammatical category of number (that sing vs. those plur) when it came to be used as a complementizer, as in the window/windows that is/are dirty (Hopper & Traugott Reference Hopper and Traugott1993). Finally, as noted earlier, frequently chunked elements are likely to become phonologically reduced, leading to the emergence of new shortened grammaticalized forms, such as the phonetic erosion of going to into gonna (Bybee et al. Reference Bybee, Perkins and Pagliuca1994). Thus, the Now-or-Never bottleneck provides a constant pressure toward reduction and erosion across the different levels of linguistic representation, providing a possible explanation for why grammaticalization tends to be a largely unidirectional process (e.g., Bybee et al. Reference Bybee, Perkins and Pagliuca1994; Haspelmath Reference Haspelmath1999; Heine & Kuteva Reference Heine and Kuteva2002; Hopper & Traugott Reference Hopper and Traugott1993).
Beyond grammaticalization, we suggest that language change, more broadly, will be local at the level of individual chunks. At the level of sound change, our perspective is consistent with lexical diffusion theory (e.g., Wang Reference Wang1969; Reference Wang1977; Wang & Cheng Reference Wang, Cheng and Wang1977), suggesting that sound change originates with a small set of words and then gradually spreads to other words with a similar phonological make-up. The extent and speed of such sound change is affected by a number of factors, including frequency, word class, and phonological environment (e.g., Bybee Reference Bybee2002; Phillips Reference Phillips2006). Similarly, morpho-syntactic change is also predicted to be local in nature: what we might call “constructional diffusion.” Accordingly, we interpret the cross-linguistic evidence indicating the effects of processing constraints on grammatical structure (e.g., Hawkins Reference Hawkins2004; Kempson et al. Reference Kempson, Meyer-Viol and Gabbay2001; O'Grady Reference O'Grady2005; see Jaeger & Tily Reference Jaeger and Tily2011, for a review) as a process of gradual change over individual constructions, instead of wholesale changes to grammatical rules. Note, though, that because chunks are not independent of one another but form a system within a given level of linguistic representation, a change to a highly productive chunk may have cascading effects to other chunks at that level (and similarly for representations at other levels of abstraction). For example, if a frequently used construction changes, then constructional diffusion could in principle lead to rapid, and far-reaching, change throughout the language.
On this account, another ubiquitous process of language change, regularization, whereby representations at a particular linguistic level become more patterned, should also be a piecemeal process. This is exemplified by another of Kirby et al.'s (2008) game-of-telephone experiments, showing that when ambiguity is avoided, a highly structured linguistic system emerges across generations of learners, with morpheme-like substrings indicating different semantic properties (color, shape, and movement). Another similar, lab-based cultural evolution experiment showed that this process of regularization does not result in the elimination of variability but, rather, in increased predictability through lexicalized patterns (Smith & Wonnacott Reference Smith and Wonnacott2010). Whereas the initial language contained unpredictable pairings of nouns with plural markers, each noun became chunked with a specific marker in the final languages.
These examples illustrate how Chunk-and-Pass processing over time may lead to so-called obligatorification, whereby a pattern that was initially flexible or optional becomes obligatory (e.g., Heine & Kuteva Reference Heine and Kuteva2007). This process is one of the ways in which new chunks may be created. So, although chunks at each linguistic level can lose information through grammaticalization, and although they cannot regain it, a countervailing process exists by which complex chunks are constructed by “gluing together” existing chunks.Footnote 17 That is, in Bybee's (Reference Bybee2002) phrase, “items that are used together fuse together.” For example, auxiliary verbs (e.g., to have, to go) can become fused with main verbs to create new morphological patterns, as in many Romance languages, in which the future tense is signaled by an auxiliary tacked on as a suffix to the infinitive. In Spanish, the future tense endings -é, -ás, -á, -emos, -éis, -án derive from the present tense of the auxiliary haber, namely, he, has, ha, hemos, habéis, han; and in French, the corresponding endings -ai, -as, -a, -on, -ez, -on derive from the present tense of the auxiliary avoir, namely, ai, as, a, avon, avez, ont (Fleischman Reference Fleischman1982). Such complex new chunks are then subject to erosion (e.g., as is implicit in the example above, the Spanish for you informal, plural will eat is comeréis, rather than *comerhabéis; the first syllable of the auxiliary has been stripped away).
Importantly, the present viewpoint is neutral regarding the extent to which children are the primary source of innovation (e.g., Bickerton Reference Bickerton1984) or regularization (e.g., Hudson et al. 2005) of linguistic material, although constraints from child language acquisition likely play some role (e.g., in the emergence of regular subject-object-verb word order in the Al-Sayyid Bedouin Sign Language; Sandler et al. Reference Sandler, Meir, Padden and Aronoff2005). In general, we would expect that multiple forces influence language change in parallel (for reviews, see Dediu et al. Reference Dediu, Cysouw, Levinson, Baronchelli, Christiansen, Croft, Evans, Garrod, Gray, Kandler, Lieven, Richerson and Christiansen2013; Hruschka et al. Reference Hruschka, Christiansen, Blythe, Croft, Heggarty, Mufwene, Pierrehumbert and Poplack2009), including sociolinguistic factors (e.g., Trudgill Reference Trudgill2011), language contact (e.g., Mufwene Reference Mufwene2008), and use of language as an ethnic marker (e.g., Boyd & Richerson Reference Boyd and Richerson1987).
Because language change, like processing and acquisition, is driven by multiple competing factors, which are amplified by cultural evolution, linguistic diversity will be the norm. Accordingly, we would expect few, if any, “true” language universals to exist in the sense of constraints that can be explained only in purely linguistic terms (Christiansen & Chater Reference Christiansen and Chater2008). Nonetheless, domain-general processing constraints are likely to significantly constrain the set of possible languages (see, e.g., Cann & Kempson Reference Cann, Kempson, Cooper and Kempson2008). This picture is consistent with linguistic arguments suggesting that there may be no strict language universals (Bybee Reference Bybee, Christiansen, Collins and Edelman2009; Evans & Levinson Reference Evans and Levinson2009). For example, computational phylogenetic analyses indicate that word order correlations are lineage-specific (Dunn et al. Reference Dunn, Greenhill, Levinson and Gray2011), shaped by particular histories of cultural evolution rather than following universal patterns as would be expected if they were the result of innate linguistic constraints (e.g., Baker Reference Baker2001) or language-specific performance limitations (e.g., Hawkins Reference Hawkins, Christiansen, Collins and Edelman2009). Thus, the process of piecemeal tinkering that drives item-based language change is subject to constraints deriving not only from Chunk-and-Pass processing and multiple-cue integration but also from the specific trajectory of cultural evolution that a language follows. More generally, in this perspective, there is no sharp distinction between language evolution and language change: Language evolution is just the result of language change over a long timescale (see also Heine & Kuteva Reference Heine and Kuteva2007), obviating the need for separate theories of language evolution and change (e.g., Berwick et al. Reference Berwick, Friederici, Chomsky and Bolhuis2013; Hauser et al. Reference Hauser, Chomsky and Fitch2002; Pinker Reference Pinker1994).Footnote 18
6. Structure as processing
The Now-or-Never bottleneck implies, we have argued, that language comprehension involves incrementally chunking linguistic material and immediately passing the result for further processing, and production involves a similar cascade of Just-in-Time processing operations in the opposite direction. And language will be shaped through cultural evolution to be easy to learn and process by generations of speakers/hearers, who are forced to chunk and pass the oncoming stream of linguistic material. What are the resulting implications for the structure of language and its mental representation? In this section, we first show that certain key properties of language follow naturally from this framework; we then reconceptualize certain important notions in the language sciences.
6.1. Explaining key properties of language
6.1.1. The bounded nature of linguistic units
In nonlinguistic sequential tasks, memory constraints are so severe that chunks of more than a few items are rare. People typically encode phone numbers, number plates, postal codes, and Social Security numbers into sequences of between two and four digits or letters; memory recall deteriorates rapidly for unchunked item-sequences longer than about four elements (Cowan Reference Cowan2000), and memory recall typically breaks into short chunk-like phrases. Similar chunking processes are thought to govern nonlinguistic sequences of actions (e.g., Graybiel Reference Graybiel1998). As we have argued previously in this article, the same constraints apply throughout language processing, from sound to discourse.
Across different levels of linguistic representation, units also tend to have only a few component elements. Even though the nature of sound-based units in speech is theoretically contentious, all proposals capture the sharply bounded nature of such units. For example, a traditional perspective on English phonology would postulate phonemes, short sequences of which are grouped into syllables, with multisyllabic words being organized by intonational or perhaps morphological groupings. Indeed, the tendency toward few-element units is so strong that long, nonsense words with many syllables such as supercalifragilisticexpialidocious is chunked successively, for example, as tentatively indicated:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170928095501420-0807:S0140525X1500031X:S0140525X1500031X_eqn1.gif?pub-status=live)
Similarly, agglutinating languages, such as Turkish, chunk complex multimorphemic words using local grouping mechanisms that include formulaic morpheme expressions (Durrant Reference Durrant2013). Likewise, at higher levels of linguistic representation, verbs normally have only two or three arguments at most. Across linguistic theories of different persuasions, syntactic phrases typically consist of only a few constituents. Thus, the Now-or-Never bottleneck provides a strong bias toward bounded linguistic units across various levels of linguistic representations.
6.1.2. The local nature of linguistic dependencies
Just as we have argued that Chunk-and-Pass processing leads to simple linguistic units with only a small number of components, so it produces a powerful tendency toward local dependencies. Dependencies between linguistic elements will primarily be adjacent or separated by only a few other elements. For example, at the phonological level, processes are highly local, as reflected by data on coarticulation, assimilation, and phonotactic constraints (e.g., Clark et al. Reference Clark, Yallop and Fletcher2007). Similarly, we expect word formation processes to be highly local in nature, which is in line with a variety of different linguistic perspectives on the prominence of adjacency in morphological composition (e.g., Carstairs-McCarthy Reference Carstairs-McCarthy1992; Hay Reference Hay2000; Siegel Reference Siegel1978). Strikingly, adjacency even appears to be a key characteristic of multimorphemic formulaic units in an agglutinating language such as Turkish (Durrant Reference Durrant2013).
At the syntactic level, there is also a strong bias toward local dependencies. For example, when processing the sentence “The key to the cabinets was …” comprehenders experience local interference from the plural cabinets, although the verb was needs to agree with the singular key (Nicol et al.Reference Nicol, Forster and Veres1997; Pearlmutter et al. Reference Pearlmutter, Garnsey and Bock1999). Indeed, individuals who are good at picking up adjacent dependencies among sequence elements in a serial-reaction time task also experience greater local interference effects in sentence processing (Misyak & Christiansen Reference Misyak, Christiansen, Catrambone and Ohlsson2010). Moreover, similar local interference effects have been observed in production when people are asked to continue the above sentence after cabinets (Bock & Miller Reference Bock and Miller1991).
More generally, analyses of Romanian and Czech (Ferrer-i-Cancho Reference Ferrer i Cancho2004) as well as Catalan, Basque, and Spanish (Ferrer-i-Cancho & Liu Reference Ferrer i Cancho and Liu2014) point to a pressure toward minimization of the distance between syntactically related words. This tendency toward local dependencies seems to be particularly strongly expressed in strict-word-order languages such as English, but somewhat less so for more flexible languages such as German (Gildea & Temperley Reference Gildea and Temperley2010). However, the use of case marking in German may provide a cue to overcome this by indicating who does what to whom, as suggested by simulations of the learnability of different word orders with or without case markings (e.g., Lupyan & Christiansen Reference Lupyan, Christiansen, Gray and Schunn2002; Van Everbroeck Reference Van Everbroeck, Hahn and Stoness1999). This highlights the importance not only of distributional information (e.g., regarding word order) but also of other types of cues (e.g., involving phonological, semantic, or pragmatic information), as discussed previously.
We want to stress, however, that we are not denying the existence of long-distance syntactic dependencies; rather, we are suggesting that our ability to process such dependencies will be bounded by the number of chunks that can be kept in memory at a given level of linguistic representation. In many cases, chunking may help to minimize the distance over which a dependency has to remain in memory. For example, the use of personal pronouns can facilitate the processing of otherwise difficult object relative clauses because they are more easily chunked (e.g., People [you know] are more fun; Reali & Christiansen Reference Reali and Christiansen2007a). Similarly, the processing of long-distance dependencies is eased when they are separated by highly frequent word combinations that can be readily chunked (e.g., Reali & Christiansen Reference Reali and Christiansen2007b). More generally, the Chunk-and-Pass account is in line with other approaches that assign processing limitations and complexity as primary constraints on long-distance dependencies, thus potentially providing explanations for linguistic phenomena, such as subjacency (e.g., Berwick & Weinberg Reference Berwick and Weinberg1984; Kluender & Kutas Reference Kluender and Kutas1993), island constraints (e.g., Hofmeister & Sag Reference Hofmeister and Sag2010), referential binding (e.g., Culicover Reference Culicover2013), and scope effects (e.g., O'Grady Reference O'Grady2013). Crucially, though, as we argued earlier, the impact of these processing constraints may be lessened to some degree by the integration of multiple sources of information (e.g., from pragmatics, discourse context, and world knowledge) to support the ongoing interpretation of the input (e.g., Altmann & Steedman Reference Altmann and Steedman1988; Heider et al. Reference Heider, Dery and Roland2014; Tanenhaus et al. Reference Tanenhaus, Spivey-Knowlton, Eberhard and Sedivy1995).
6.1.3. Multiple levels of linguistic representation
Speech allows us to transmit a digital, symbolic code over a serial, analog channel using time variation in sound pressure (or using analog movements, in sign language). How might we expect this digital-analog-digital conversion to be tuned, to optimize the amount of information transmitted?
The problem of encoding and decoding digital signals over an analog serial channel is well studied in communication theory (Shannon Reference Shannon1948) – and, interestingly, the solutions typically adopted look very different from those employed by natural language. Crucially, to maximize the rate of transfer of information it is generally best to transform the message to be conveyed across the analog signal in a very nonlocal way. That is, rather than matching up portions of the information to be conveyed (e.g., in an engineering context, these might be the contents of a database) to particular portions of the analog signal, the best strategy is to encrypt the entire digital message using the entire analog signal, so that the message is coded as a block (e.g., MacKay Reference MacKay2003). But why is the engineering solution to information transmission so very different from that used by natural language, in which distinct portions of the analog signal correspond to meaningful units in the digital code (e.g., phonemes, words)? The Now-or-Never bottleneck provides a natural explanation.
A block-based code requires decoding a stored memory trace for the entire analog signal (for language, typically, acoustic) – that is, the whole block. This is straightforward for artificial computing systems, where memory interference is no obstacle. But this type of block coding is, of course, precisely what the Now-or-Never bottleneck rules out. The human perceptual system must turn the acoustic input into a (lossy) compressed form right away, or else the acoustic signal is lost forever. Similarly, the speech production system cannot decide to send a single, lengthy analog signal, and then successfully reel off the lengthy corresponding sequence of articulatory instructions, because this will vastly exceed our memory capacity for sequences of actions. Instead, the acoustic signal must be generated and decoded incrementally so that the symbolic information to be transmitted maps, fairly locally, to portions of the acoustic signal. Thus, to an approximation, whereas individual phonemes acoustically exhibit enormous contextual variation, diphones (pairs of phonemes) are a fairly stable acoustic signal, as evident by their use in tolerably good speech synthesis and recognition (e.g., Jurafsky et al. Reference Jurafsky, Martin, Kehler, Vander Linden and Ward2000). Overall, then, each successive segment of the analog acoustic input must correspond to a part of the symbolic code being transmitted. This is not because of considerations of informational efficiency but because of the brain's processing limitations in encoding and decoding: specifically, by the Now-or-Never bottleneck.
The need rapidly to encode and decode implies that spoken language will consist of a sequence of short sound-based units (the precise nature of these units may be controversial, and may even differ between languages, but units could include diphones, phonemes, mora, syllables, etc.). Similarly, in speech production, the Now-or-Never bottleneck rules out planning and executing a long articulatory sequence (as in a block-code used in communication technology); rather, speech must be planned incrementally, in the Just-in-Time fashion, requiring that the speech signal corresponds to sequences of discrete sound-based units.
6.1.4. Duality of patterning
Our perspective has yet further intriguing implications. Because the Now-or-Never bottleneck requires that symbolic information must rapidly be read off the analog signal, the number of such symbols will be severely limited – and in particular, may be much smaller than the vocabulary of a typical speaker (many thousands or tens of thousands of items). This implies that the short symbolic sequences into which the acoustic signal is initially recoded cannot, in general, be bearers of meaning; instead, the primary bearers of meaning, lexical items, and morphemes, will be composed out of these smaller units.
Thus, the Now-or-Never bottleneck provides a potential explanation for a puzzling but ubiquitous feature of human languages, including signed languages. This is duality of patterning: the existence of (one or more) level(s) of symbolically encoded sound structure (whether described in terms of phonemes, mora, or syllables) from which the level of words and morphemes (over which meanings are defined) are composed. Such patterning arises, in the present analysis, as a consequence of rapid online multilevel chunking in both speech production and perception. In the absence of duality of patterning, the acoustic signal corresponding, say, to a single noun, could not be recoded incrementally as it is received (Warren & Marslen-Wilson Reference Warren and Marslen-Wilson1987) – but would have to be processed as a whole, thus dramatically overloading sensory memory.
It is, perhaps, also of interest to note that the other domain in which people process enormously complex acoustic input – music – also typically consists of multiple layers of structure (notes, phrases, and so on, see, e.g., Lerdahl & Jackendoff Reference Lerdahl and Jackendoff1983; Orwin et al. Reference Orwin, Howes and Kempson2013). We may conjecture that Chunk-and-Pass processing operates for music as well as language, thus helping to explain why our ability to process musical input spectacularly exceeds our ability to process arbitrary sequential acoustic material (Clément et al. Reference Clément, Demany and Semal1999).
6.1.5. The quasi-regularity of linguistic structure
We have argued that the Now-or-Never bottleneck implies that language processing involves applying highly local Chunk-and-Pass operations across a range of representational levels; and that language acquisition involves learning to perform such operations. But, as in the acquisition of other skills, learning from such specific instances does not operate by rote but leads to generalization and hence modification from one instance to another (Goldberg Reference Goldberg2006). Indeed, such processes of local generalization are ubiquitous in language change, as we have noted above. From this standpoint, we should expect the rule-like patterns in language to emerge from generalizations across specific instances (see, e.g., Hahn & Nakisa Reference Hahn and Nakisa2000, for an example of this approach to inflectional morphology in German); once entrenched, such rule-like patterns can, of course, be applied quite broadly to newly encountered cases. Thus, patterns of regularity in language will emerge locally and bottom-up, from generalizations across individual instances, through processes of language use, acquisition, and change.
We should therefore expect language to be quasi-regular across phonology, morphology, syntax, and semantics – to be an amalgam of overlapping and partially incompatible patterns, involving generalizations from the variety of linguistic forms from which successive language learners generalize. For example, English past tense morphology has, famously, the regular –ed ending, a range of subregularities (sing → sang, ring → rang, spring → sprang, but fling →flung, wring → wrung; and even bring → brought; with some verbs having the same present and past tense forms, e.g., cost → cost, hit → hit, split → split; whereas others differ wildly, e.g., go → went; am → was; see, e.g., Bybee & Slobin Reference Bybee and Slobin1982; Pinker & Prince Reference Pinker and Prince1988; Rumelhart & McClelland Reference Rumelhart, McClelland, McClelland and Rumelhart1986). This quasi-regular structure (Seidenberg & McClelland Reference Seidenberg and McClelland1989) does indeed seem to be widespread throughout many aspects of language (e.g., Culicover Reference Culicover1999; Goldberg Reference Goldberg2006; Pierrehumbert Reference Pierrehumbert2002).
From a traditional, generative perspective on language, such quasi-regularities are puzzling: Natural language is assimilated, somewhat by force, to the structure of a formal language with a precisely defined syntax and semantics – the ubiquitous departures from such regularities are mysterious. From the present standpoint, by contrast, the quasi-regular structure of language arises in rather the same way that a partially regular pattern of tracks were laid down across a forest, through the overlaid traces of an endless number of agents finding the path of local least resistance; and where each language processing episode tends to facilitate future, similar, processing episodes, just as an animal's choice of path facilitates the use of that path for animals that follow.
6.2. What is linguistic structure?
Chunk-and-Pass processing can be viewed as having an interesting connection with traditional linguistic notions. In both production and comprehension, the language system creates a sequence of chunking operations, which link different linguistic units together across multiple levels of structure. That is, the syntactic structure of a given utterance is reflected in its processing history. This conception is reminiscent of previous proposals, in which syntax is viewed as a control structure for guiding semantic interpretation (e.g., Ford et al. Reference Ford, Bresnan, Kaplan and Bresnan1982; Kempson et al. 2001; Morrill Reference Morrill2010). For example, in describing his incremental parser-interpreter, Pulman (Reference Pulman1985) noted, “Syntactic information is used to build up the interpretation and to guide the parse, but does not result in the construction of an independent level of representation” (p. 132). Steedman (Reference Steedman2000) adopted a closely related perspective when introducing his combinatory categorial grammar, which aims to map surface structure directly onto logic-based semantic interpretations, given rich lexical representations of words that include information about phonological structure, syntactic category, and meaning: “… syntactic structure is merely the characterization of the process of constructing a logical form, rather than a representational level of structure that actually needs to be built …” (p. xi). Thus, in these accounts, the syntactic structure of a sentence is not explicitly represented by the language system but plays the role of a processing “trace” of the operations used to create or interpret the sentence (see also O'Grady Reference O'Grady2005).
To take an analogy from constructing objects, rather than sentences, the process by which components of an IKEA-style flat-pack cabinet are combined provides a “history” (combine a board, handle, and screws to construct the doors; combine frame and shelf to construct the body; combine doors, body, and legs to create the finished cabinet). The history by which the cabinet was constructed may thus reveal the intricate structure of the finished item, but this structure need not be explicitly represented during the construction process. Thus, we can “read off” the syntactic structure of a sentence from its processing history, revealing the syntactic relations between various constituents (likely with a “flat” structure; Frank et al. Reference Frank, Bod and Christiansen2012). Syntactic representations are neither computed during comprehension nor in production; instead, there is just a history of processing operations. That is, we view linguistic structure as processing history. Importantly, this means that syntax is not privileged but is only one part of the system – and it is not independent of the other parts (see also Fig. 2).
In this view, a rather minimal notion of grammar specifies how the chunks from which a sentence is built can be composed. There may be several, or indeed many, orders in which such combinations can occur, just as operations for furniture assembly may be carried out somewhat flexibly (but not completely without constraints – it might turn out that the body must be screwed together before a shelf can be attached). In the context of producing and understanding language, the process of construction is likely to be much more constrained: Each new “component” is presented in turn, and it must be used immediately or it will be lost. Moreover, viewing Chunk-and-Pass processing as an aspect of skill acquisition, we might expect that the precise nature of chunks may change with expertise: Highly overlearned material might, for example, gradually come to be treated as a single chunk (see Arnon & Christiansen, Reference Arnon and Christiansensubmitted, for a review).
Crucially, as with other skills, the cognitive system will tend to be a cognitive miser (Fiske & Taylor Reference Fiske and Taylor1984), generally following a principle of least effort (Zipf Reference Zipf1949). As processing proceeds, there is an intricate interplay of top-down and bottom-up processing to alight on the message as rapidly as possible. The language system need only construct enough chunk structure so that, when combined with prior discourse and background knowledge, the intended message can be inferred incrementally. This observation relates to some interesting contemporary linguistic proposals. For example, from a generative perspective, Culicover (Reference Culicover2013) highlighted the importance of incremental processing, arguing that the interpretation of a pronoun depends on which discourse elements are available when it is encountered. This implies that the linear order of words in a sentence (rather than hierarchical structure) plays an important role in many apparently grammatical phenomena, including weak cross-over effects in referential binding. From an emergentist perspective, O'Grady (Reference O'Grady, MacWhinney and O'Grady2015) similarly emphasized the importance of real-time processing constraints for explaining differences in the interpretation of reflexive pronouns (himself, themselves) and plain pronouns (him, them). The former are resolved locally, and thus almost instantly, whereas the antecedents for the latter are searched for within a broader domain (causing problems in acquisition because of a bias toward local information).
More generally, our view of linguistic structure as processing history offers a way to integrate the formal linguistic contributions of construction grammar (e.g., Croft Reference Croft2001; Goldberg Reference Goldberg2006) with the psychological insights from usage-based approaches to language acquisition and processing (e.g., Bybee & McClelland Reference Bybee and McClelland2005; Tomasello Reference Tomasello2003). Specifically, we propose to view constructions as computational procedures Footnote 19 – specifying how to process and produce a particular chunk – where we take a broad view of constructions as involving chunks at different levels of linguistic representation, from morphemes to multiword sequences. A procedure may integrate several different aspects of language processing or production, including chunking acoustic input into sound-based units (phonemes, syllables), mapping a chunk onto meaning (or vice versa), incorporating pragmatic or discourse information, and associating a chunk with specific arguments (see also O'Grady Reference O'Grady2005; Reference O'Grady2013). As with other skills (e.g., Heathcote et al. Reference Heathcote, Brown and Mewhort2000; Newell & Rosenbloom Reference Newell, Rosenbloom and Anderson1981), there will be practice effects, where the repeated use of a given chunk results in faster processing and reduced demands on cognitive resources, and with sufficient use, leading to a high degree of automaticity (e.g., Logan Reference Logan1988; see Bybee & McClelland Reference Bybee and McClelland2005, for a linguistic perspective).
In terms of our previous forest track analogy, the more a particular chunk is comprehended or produced, the more entrenched it becomes, resulting in easier access and faster processing; tracks become more established with use. With sufficiently frequent use, adjacent tracks may blend together, creating somewhat wider paths. For example, the frequent processing of simple transitive sentences, processed individually as multiword chunks, such as “I want milk,” “I want candy,” might first lead to a wider track involving the item-based template “I want X.” Repeated use of this template along with others (e.g., “I like X,” “I see X”) might eventually give rise to a more abstract transitive generalization along the lines of N V N (a highway in terms of our track analogy). Similar proposals for the emergence of basic word order patterns have been proposed both within emergentist (e.g., O'Grady Reference O'Grady2005; Reference O'Grady2013; Tomasello Reference Tomasello2003) and generative perspectives (e.g., Townsend & Bever Reference Townsend and Bever2001). Importantly, however, just as with generalizations in perception and motor skills, the grammatical abstractions are not explicitly represented but result from the merging of item-based procedures for chunking. Consequently, there is no representation of grammatical structure separate from processing. Learning to process is learning the grammar.
7. Conclusion
The perspective developed in this article sees language as composed of a myriad of specific processing episodes, where particular messages are conveyed and understood. Like other action sequences, linguistic acts have their structure in virtue of the cognitive mechanisms that produce and perceive them. We have argued that the structure of language is, in particular, strongly affected by a severe limitation on human memory: the Now-or-Never bottleneck. Sequential information, at many levels of analysis, must rapidly be recoded to avoid being interfered with or overwritten by the deluge of subsequent material. To cope with the Now-or-Never bottleneck, the language system chunks new material as rapidly as possible at a range of increasingly abstract levels of representation. As a consequence, Chunk-and-Pass processing induces a multilevel structure over linguistic input. The history of the process of chunk building can be viewed as analogous to a shallow surface structure in linguistics, and the repertoire of possible chunking mechanisms and the principles by which they can be combined can be viewed as defining a grammar. Indeed, we have suggested that chunking procedures may be one interpretation of the constructions that are at the core of linguistic theories of construction grammar. More broadly, the Now-or-Never bottleneck promises to provide a framework within which to reintegrate the language sciences, from the psychology of language comprehension and production, to language acquisition, language change, and evolution, to the study of language structure.
ACKNOWLEDGMENTS
We would like to thank Inbal Arnon, Amui Chong, Brandon Conley, Thomas Farmer, Adele Goldberg, Ruth Kempson, Stewart McCauley, Michael Schramm, and Julia Ying, as well as seven anonymous reviewers for comments on previous versions of this article. This work was partially supported by BSF grant number 2011107, awarded to MHC (and Inbal Arnon), and ERC grant 295917-RATIONALITY, the ESRC Network for Integrated Behavioural Science, the Leverhulme Trust, Research Councils UK Grant EP/K039830/1 to NC.
Target article
The Now-or-Never bottleneck: A fundamental constraint on language
Related commentaries (28)
Better late than Now-or-Never: The case of interactive repair phenomena
Conceptual short-term memory (CSTM) supports core claims of Christiansen and Chater
Consequences of the Now-or-Never bottleneck for signed versus spoken languages
Exploring some edges: Chunk-and-Pass processing at the very beginning, across representations, and on to action
Gestalt-like representations hijack Chunk-and-Pass processing
How long is now? The multiple timescales of language processing
Is Now-or-Never language processing good enough?
Language acquisition is model-based rather than model-free
Language processing is not a race against time
Linguistic representations and memory architectures: The devil is in the details
Linguistic structure emerges through the interaction of memory constraints and communicative pressures
Linguistics, cognitive psychology, and the Now-or-Never bottleneck
Many important language universals are not reducible to processing or cognition
Mechanisms for interaction: Syntax as procedures for online interactive meaning building
Memory limitations and chunking are variable and cannot explain language structure
Natural language processing and the Now-or-Never bottleneck
Neural constraints and flexibility in language processing
Now or … later: Perceptual data are not immediately forgotten during language processing
On the generalizability of the Chunk-and-Pass processing approach: Perspectives from language acquisition and music
Pro and con: Internal speech and the evolution of complex language
Processing cost and its consequences
Realizing the Now-or-Never bottleneck and Chunk-and-Pass processing with Item-Order-Rank working memories and masking field chunking networks
Reservoir computing and the Sooner-is-Better bottleneck
Socio-demographic influences on language structure and change: Not all learners are the same
The bottleneck may be the solution, not the problem
The ideomotor recycling theory for language
What gets passed in “Chunk-and-Pass” processing? A predictive processing solution to the Now-or-Never bottleneck
“Process and perish” or multiple buffers with push-down stacks?
Author response
Squeezing through the Now-or-Never bottleneck: Reconnecting language processing, acquisition, change, and structure