1 Introduction
For decades now, language researchers have been attempting to explain the observation that people who learn a second language (L2) later in life tend to have poorer ultimate attainment than those who learn the same language earlier in life; for an illustration of the pattern, see Figure 1a. Cross-linguistically, there is a clear downward trend in many, although not all, measures of language proficiency as age of acquisition increases (DeKeyser, Reference DeKeyser, Gass and Mackey2012). This phenomenon has been referred to by many names, usually based on the author's thoughts on the phenomenon's likely cause. Since human maturational processes are widely implicated in first language (L1) acquisition, many suspect similar developmental processes to be largely responsible for these observed age effects on L2 acquisition, often referring to a “critical” or “sensitive period” for language learning. Others, who view the issue as a problem inherent in the process of learning, speak of cross-linguistic interference or entrenchment effects. Still others couch the problem in terms of individual differences of the language learners and quality and form of the L2 input. While there is support for all of these accounts of this phenomenon, it is generally difficult to study any of these potential causes in isolation.
In this study we use a neural network model to investigate the individual and compound effects that two of these potential causes of sensitive periods have on ultimate attainment of a learner's first and second languages. The first factor we will consider, entrenchment, can best be understood as previous knowledge that is difficult to change and can perhaps only be altered slowly, thus interfering with the rapid acquisition of newly available information. In this scenario, the longer the learner is exposed to their native language before a second language is introduced, the more their L1 becomes entrenched, making the novel rules and patterns of an L2 more difficult to learn (Hernandez, Li & MacWhinney, Reference Hernandez, Li and MacWhinney2005). The second factor we consider is the development of aspects of memory, specifically working memory capacity and long-term memory capacity, as implemented by the periodic addition of new units and connections, respectively, to our neural network model. Working memory development is particularly interesting in light of evidence, such as that shown in Figure 1b, that a period of rapid growth of working memory capacity coincides with a period of rapid deterioration of L2 learning ability.
Using only experimentation on human subjects, it is difficult to get a complete picture of the relative contributions of entrenchment and development. While there are exceptions, specifically in the sign language domain, language learning almost invariably starts very early in life, causing L1 acquisition and early L2 acquisition to coincide with many aspects of development. Thus, the contributions of these two factors to the observed differences in ultimate attainment between early and late L2 learners cannot be readily separated from each other. With a computational model, on the other hand, we can examine the interaction of our two chosen factors from all sides, describing the effects of each in isolation as well as their combined impact.
Of course, at present, a computer model cannot learn an entire natural language as human learners can. As such, we chose to model the linguistic sub-tasks of gender assignment and agreement. The factors guiding this choice of tasks included the fact that native and non-native speakers of a language tend to differ significantly, as well as the fact that ultimate attainment tends to vary with age of acquisition. Our model learns to perform gender assignment and gender agreement tasks from naturalistic training data based on word co-occurrence, without having any built-in knowledge of the existence or form of grammatical gender and without being given explicit instruction in the genders of particular words or phrases. Our goal with this model is to provide a better understanding of how the two potential factors we have chosen to study, entrenchment and memory development, contribute individually and in tandem to differences in ultimate language attainment.
Our experiments investigate two related but independent hypotheses. The first of these concerns the effects of language entrenchment: We expect that as the level of L1 entrenchment goes up, L2 learning ability goes down, at least up until some point of maximal entrenchment where the effect levels off. The second line of inquiry concerns the less-is-more hypothesis (Newport, Reference Newport1988, Reference Newport1990) that states that a learner in the early stages of working memory development will find L2 learning easier than a learner with a fully developed working memory. Our simulations investigate these two hypotheses individually and in tandem, to a greater extent than is normally possible in empirical studies. Additionally, we are able to investigate more specific distinctions within the less-is-more hypothesis, discriminating the effects due to starting small from those due to addition of fresh memory resources.
The remainder of the paper is structured as follows. Section 2 reviews previous research on sensitive period phenomena and relationships to the acquisition of grammatical gender. We also review hypotheses relating sensitive periods to working memory and to L1 entrenchment. Section 3 gives an overview first of neural networks in general and then of the specific neural network model studied herein, including all relevant variations. Section 4 describes separate experiments and results for the gender assignment and gender agreement tasks. Finally, in Section 5, we discuss the implications of our findings.
2 Background
2.1 Sensitive periods
Since Lenneberg (Reference Lenneberg1967) first used the term critical period in the context of human language development, a considerable amount of evidence has accumulated that shows a marked decline in the ultimate outcome (not the speed) of language acquisition as age of onset varies from early childhood to late adolescence. This decline has been documented in numerous studies, for both L1 and L2 development, for both spoken and signed languages, and for phonology as well as morphology and syntax (for overviews, see DeKeyser, Reference DeKeyser, Gass and Mackey2012; Hyltenstam & Abrahamsson, Reference Hyltenstam, Abrahamsson, Doughty and Long2003).
Numerous questions remain, however, at least where the L2 is concerned. The most debated one is whether the age effects observed are truly maturational or due to confounds with other variables (e.g., DeKeyser & Larson-Hall, Reference DeKeyser, Larson-Hall, Kroll and de Groot2005; Long, Reference Long2005). Most commonly mentioned in the discussion of potential confounds are the extent of L1 entrenchment (e.g., MacWhinney, Reference MacWhinney, Han and Odlin2006), the quantity and quality of input and practice in L2 (e.g., Jia & Aaronson, Reference Jia and Aaronson2003; Jia, Aaronson & Wu, Reference Jia, Aaronson and Wu2002), the extent to which the learner is motivated to sound like a native speaker (e.g., Bley-Vroman, Reference Bley-Vroman, Rutherford and Sharwood Smith1988), and the extent to which formal education took place in the L2 (e.g., Hakuta, Bialystok & Wiley, Reference Hakuta, Bialystok and Wiley2003). An equally important question concerns the nature of inter-individual variation (e.g., whether high levels of some forms of aptitude mitigate the effect of age of onset; Abrahamsson & Hyltenstam, Reference Abrahamsson and Hyltenstam2008; DeKeyser, Alfi-Shabtay & Ravid, Reference DeKeyser, Alfi-Shabtay and Ravid2010). Finally, there is the question of intra-individual variation depending on the aspects of grammar or pronunciation concerned. In the area of grammar, syntax may be less sensitive to age effects than morphology (Johnson & Newport, Reference Johnson and Newport1989), regulars less than irregulars (see Hudson Kam & Newport, Reference Hudson Kam and Newport2005, Reference Hudson Kam and Newport2009), and salient structures less than non-salient ones (DeKeyser, Reference DeKeyser2000). Even for a given structure, age effects may be detected with ERP without showing up in the behavioral data (e.g., for subject–verb agreement in Chen, Shu, Liu, Zhao & Li, Reference Chen, Shu, Liu, Zhao and Li2007). In the area of pronunciation, phonetic detail such as precise voice onset time (Abrahamsson & Hyltenstam, Reference Abrahamsson and Hyltenstam2009) seems particularly problematic for older learners. Some phonetic cues to phonemic status may be easier to pick up than others (vowel duration being easier than closure duration; Baker, Reference Baker2010); some suprasegmentals such as stress timing may be less sensitive to age than others (Trofimovich & Baker, Reference Trofimovich and Baker2006); and age may even affect different kinds of stress placement differently, the effect being strongest for stress determined by syllable structure (Guion, Harada & Clark, Reference Guion, Harada and Clark2004). In sign language, handshape may be more resistant to age effects than location or movement (Morford & Carlson, Reference Morford and Carlson2011).
Those researchers who suspect that sensitive periods are maturational in nature have couched their causal explanations in both neurological and psychological terms. Neurological explanations have evolved over time from hemispheric specialization (e.g., Lenneberg, Reference Lenneberg1967) to myelination (e.g., Long, Reference Long1990) to varying rates of neurogenesis, synaptogenesis, or synaptic pruning (e.g., Uylings, Reference Uylings2006). These explanations have focused on the brain as a whole, while others more on specific areas such as the prefrontal cortex (e.g., Petanjek, Judas, Kostović & Uylings, Reference Petanjek, Judaš, Kostović and Uylings2008); the amygdala (e.g., Pulvermüller & Schumann, Reference Pulvermüller and Schumann1994); or the hippocampus, medial temporal lobe, and the basal ganglia (e.g., Ullman, Reference Ullman2004). Psychological explanations, rather surprisingly, came onto the scene later and have included growth of working memory capacity (the less-is-more hypothesis, e.g., Newport, Reference Newport1990), increased susceptibility to proactive interference (e.g., Iverson, Kuhl, Akahane-Yamada, Diesch, Tohkura, Kettermann & Siebert, Reference Iverson, Kuhl, Akahane-Yamada, Diesch, Tohkura, Kettermann and Siebert2003), and gradual shifts from predominantly procedural/implicit to predominantly declarative/explicit processes (e.g., DeKeyser, Reference DeKeyser2000; Paradis, Reference Paradis2009; Ullman, Reference Ullman2004). Ultimately, of course, full explanatory adequacy will only be reached if psychological mechanisms can be tied to concurrent neurological developments that together explain the specific learning differences observed.
Empirical research on these issues is usually difficult for many reasons, in large part because the natural confounds of many of the variables involved cannot be experimentally disentangled in research on human learners. Perhaps the only notable exception is in the study of age of acquisition effects in sign language research (Mayberry, Lock & Kazmi, Reference Mayberry, Lock and Kazmi2002), which is discussed in detail in the supplementary material Section S.1.
2.2 Grammatical gender
The linguistic phenomenon that our models will learn about is grammatical gender, which refers to an arbitrary classification of nouns, often marked by phonological, morphological, and/or semantic properties. Several studies suggest that grammatical gender is subject to sensitive periods. Studies with adult L2 learners of languages like French (Guillelmon & Grosjean, Reference Guillelmon and Grosjean2001), Spanish (Lew-Williams & Fernald, Reference Lew-Williams and Fernald2010) and German (Scherag et al., Reference Scherag, Demuth, Rösler, Neville and Röder2004) have shown that non-native adults are slower than L1 speakers at processing nouns, and that their processing does not seem to benefit from patterns in gender agreement that are present in the language. Even childhood learners who begin acquiring French in an immersion program at age six do not achieve native-like gender agreement (Harley, Reference Harley1979; Lapkin & Swain, Reference Lapkin and Swain1977), indicating that acquisition of this grammatical component is subject to early age effects. For a more detailed background of the acquisition of grammatical gender, see Section S.2 in the supplementary material.
Gender systems vary in complexity; many Indo-European languages, such as French and German, divide nouns into only two or three gender classes, whereas Bantu languages employ extensive gender systems with up to twenty gender classes (Corbett, Reference Corbett1991). The degree to which grammatical gender is marked throughout a sentence also varies widely. In English, for example, gender is only marked on pronominals with animate reference, whereas gender in the Bantu language Swazi may be marked on adjectives, verbs, adverbs, numerals, and conjunctions.
The languages examined in the current study, French and Spanish, both assign masculine and feminine gender to all nouns; however, subtle differences between the gender classification systems exist. In French, a noun's final phoneme provides cues to gender, though the predictive value of the final phoneme is not always reliable. For example, according to Surridge (Reference Surridge1993, Reference Surridge1995), only one “feminine” ([z]), and eight “masculine” ([], [], [ã], [ø], [o], [ʒ], [m], [ɛ]) final phonemes indicate gender with more than 90% accuracy; eight “masculine” ([f], [u], [a], [ʁ], [g], [y], [k], [b]) and nine “feminine” ([i], [], [n], [v], [j], [ʃ], [d], [s], [ɲ]) final phonemes indicate gender with 60–89% accuracy; and four final phonemes ([l], [m], [p], [t]) are considered ambiguous and do not provide any indication of the noun's gender. In addition, not everyone agrees with the phonemes’ predictability values. For example, Lyster (Reference Lyster2006) carried out a final phoneme predictive value analysis based on a corpus different from that of Surridge, and while the results are largely similar, some differences exist. Furthermore, the effect of a noun's phonological ending may be overridden by the noun's morphological ending (Surridge, Reference Surridge1989). Under this hierarchy, a word ending in the typically masculine final phoneme [ʁ] will be feminine when encompassed by the typically feminine morphological suffix -ure, as in coiffure “hairstyle”. Overall, the French gender system is governed by patterns, but it is a complex system with many exceptions.
The Spanish gender system is less complex and more reliable than that of French. According to Teschner and Russell (Reference Teschner and Russell1984), the majority of Spanish nouns’ final phonemes are predictive of gender. Specifically, 90% of nouns ending in the phonemes [a] and [d] are feminine, and 89% of nouns ending in [e], [l], [o], and [ɾ] – which account for the majority of nouns – and also [i], [m], [t], [u], [x], [y], [b], [c], [tʃ] are masculine. Only three final phonemes, [n], [θ], and [s], are considered ambiguous in that they do not predict one gender over another. Morphological gender regularities in Spanish also exist, though they do not override final phonemes, as seen in French. Teschner and Russell identify seven morphological endings that are typically feminine (-ción, -gión, -nión, -sión, -tión, -xión, and -ez) and four morphological endings that are typically masculine (-ón, -az, -oz, and -uz). Note that these morphological endings encompass two of the ambiguous final phonemes, [n] and [θ], but not phonemes that are predictive of masculine or feminine. Finally, in both languages, animate nouns referring to humans assume semantic gender, so that the words for “man” and “woman” are masculine and feminine, respectively.
Both French and Spanish mark gender on determiners, pronouns, and adjectives. Examples of determiner and adjective markings are shown in sentences 1 and 2.
(1) “The little book is white.”
French: Le petit livre (masc) est blanc.
Spanish: El libro (masc) pequeño es blanco.
(2) “The little table is white.”
French: La petite table (fem) est blanche.
Spanish: La mesa (fem) pequeña es blanca.
French adjectives may end in almost any phoneme, with the feminine adjective typically marked by an additional and often unpredictable suffix. For example, the adjective blanc [blã] “white, masc” becomes blanche [blãʃ] in its feminine form, and petit [pəti] “small, masc” becomes petite [pətit]. A number of adjectives have the same phonological form for both masculine and feminine, even when the orthographic form differs. For example, the adjective “difficult” has only one orthographic (difficile) and phonological [difisil] form, and the adjective “expensive”, while represented by two orthographic forms (cher, masc; chère, fem), are both pronounced [ʃɛʁ].
Spanish adjective formation, on the other hand, is more predictable. The majority of adjectives are marked by an -o ending for masculine, and an -a ending for feminine, as in blanco/blanca “white”. As in French, not all adjectives have distinct orthographic and phonological masculine and feminine forms. Adjectives ending in -e, -ista, or a consonant, generally maintain the same form in both masculine and feminine, as in verde “green”, idealista “idealist”, and difícil “difficult”. However, exceptions exist and certain types of adjectives ending in a consonant, such as those referring to nationalities, have a feminine form marked by an -a ending, as in español/española “Spanish” and alemán/alemana “German”. Other exceptions include adjectives ending in -ín, -ón, -or, such as juguetón/juguetona “playful” and hablador/habladora “talkative”.
Despite the differences described above, the French and Spanish gender systems are similar in that both classify nouns into masculine and feminine based on phonological regularities, and gender is marked throughout a sentence on determiners, adjectives, and pronouns.
2.3 Memory development and language learning
Our neural network models undergo memory development, in the form of changes in both working memory capacity and long-term memory capacity, in order to examine the effects of maturation on sensitive period effects. At the most simple description, working memory (used interchangeably here with short-term memory) allows pieces of information to be held in the mind for brief periods of time in the absence of the input that caused them. In reality, working memory is most likely composed of a complex interaction of factors, such as attention (Conway, Cowan & Bunting, Reference Conway, Cowan and Bunting2001; Engle, Reference Engle2002; Kane & Engle, Reference Kane and Engle2003), inhibition or filtering mechanisms (Vogel, McCollough & Machizawa, Reference Vogel, McCollough and Machizawa2005), rehearsability (Baddeley, Reference Baddeley2003; Gathercole & Baddeley, Reference Gathercole and Baddeley1993; Wilson & Emmorey, Reference Wilson and Emmorey1997), and “chunking” strategies (Miller, Reference Miller1956). Thus, although working memory is probably not a unitary construct, the core ability to store and integrate multiple items is critical to many aspects of cognitive functioning, including language processing. Working memory capacity refers to the number of items that can be stored and manipulated for a task. In general, higher capacities are associated with better cognitive function (Baddeley, Reference Baddeley2003; Duncan, Seitz, Kolodny, Bor, Herzog & Ahmed, Reference Duncan, Seitz, Kolodny, Bor, Herzog and Ahmed2000) since lower capacities impose greater informational bottlenecks on processing. In development, working memory capacity grows rapidly from early childhood into adolescence, showing up to a three-fold increase (see Gathercole, Reference Gathercole1999). This presents a paradox for language acquisition since higher cognitive function associated with higher memory capacity seems to be inversely correlated with overall language learning ability.
However, this is only a paradox if only the end state of development is considered. In reality, the maturation of working memory as well as language learning occur through time. One possibility is that limited cognitive ability, in particular a small memory capacity, is crucial to early stages of language acquisition, and that memory growth supports full language acquisition. Newport's (Reference Newport1988, Reference Newport1990) less-is-more hypothesis draws upon data from cases where age of acquisition is not confounded with L1 entrenchment: the large proportion of deaf individuals who are not exposed to an accessible form of language early in life. During the language acquisition process and at final language attainment, these late learners have distinct profiles from early learners. As seen among hearing children during early stages of acquisition and word production, young signers (who have been exposed to American Sign Language since birth) morphologically simplify complex signs. This stage is considered to be important for morphological analysis of words and signs. Late learners do not make these types of errors or simplifications, rather processing the forms as “unanalyzed wholes” (Newport, Reference Newport1990). As adults, these late learners use these complex forms in both ungrammatical and grammatical contexts, suggesting that they have not successfully learned their internal morphology. Early learners, in contrast, progressively develop the complex forms and do not make these types of mistakes as adults.
If the development of working memory is indeed inextricably linked with language acquisition abilities, there are two possible explanations for this relationship. The first is addressed by the less-is-more hypothesis, where the crucial factor is starting with a smaller working memory capacity (Newport, Reference Newport1990). The rationale is that when a learning system is incapable of processing and holding in memory larger chunks of input, it is forced to analyze the input at lower level of complexity, picking out the highest-level and most prominent patterns while possibly abstracting away much of the detail. Another potential explanation comes from computational modeling, where it has also been demonstrated that controlling the size of the input, perhaps by providing smaller inputs at the beginning of training, contributes to better learning (Elman, Reference Elman1993). Both models have been experimentally tested in adults, where smaller natural working memory or smaller inputs were associated with better detection of correlations between two binary variables (Kareev, Lieberman & Lev, Reference Kareev, Lieberman and Lev1997).
Most studies that directly investigate the relationship between working memory capacity and language learning in children suggest that the development of phonological short-term memory in particular is critical to word learning (Avons, Wragg, Cupplesa & Lovegrove, Reference Avons, Wragg, Cupplesa and Lovegrove1998; Baddeley, Gathercole & Papagno, Reference Baddeley, Gathercole and Papagno1998; Gathercole & Pickering, Reference Gathercole and Pickering2000). Higher spans in phonological short-term memory are linked with larger vocabulary sizes and better performance at learning new words. These working memory capacities are often measured by performance on non-word repetition tasks. However, these correlations leave the causal relationships inconclusive. The ability to temporarily store phonological traces of new utterances may be an important precursor to storing that item in long-term memory. On the other hand, it has been suggested that vocabulary growth leads to a better ability to analyze the representations into phonological segments, which in turn leads to more robust representations of new words (Metsala, Reference Metsala1999). What these two ideas agree on is the importance of the development of decomposed, sublexical representations – such as phonemes – for language learning. Newport (Reference Newport1990) has made a similar argument about morphology. Drawing upon accompanying behavioral evidence that longer words are learned later in development than shorter words even when frequencies of these words are matched, Brown and Hulme (Reference Brown, Hulme and Gathercole1996) demonstrate a computational model in which shorter words are maintained in short-term memory for longer given a limited short-term memory, facilitating encoding in long-term memory. A consequence of forming representations for smaller input first may be a better recognition of incremental patterns throughout learning.
2.4 Connectionist modeling
As discussed in Section 1 above, it is often difficult to experimentally separate the various possible causes of age effects when performing empirical research on human subjects. Computational modeling has a key advantage in its ability to independently manipulate a number of variables and to observe their main effects and interactions. Early attempts at computational modeling of linguistic sensitive periods (Goldowsky & Newport, Reference Goldowsky, Newport and Clark1993) show support for the less-is-more hypothesis in that a smaller working memory was shown to be better for the learning of some grammatical patterns, and this conclusion was supported by later studies, computational and otherwise (Cochran, McDonald & Parault, Reference Cochran, McDonald and Parault1999; Kareev et al., Reference Kareev, Lieberman and Lev1997; Kersten & Earles, Reference Kersten and Earles2001). Neurocomputational modeling studies (reviewed in Hernandez & Li, Reference Hernandez and Li2007) favor explanations of age-related performance deficits in terms of changes in neural plasticity due the normal accumulation of experience. This idea, that the learning process itself could cause the observed sensitive period effects, is supported by many other modeling studies (reviewed in e.g., Thomas & Johnson, Reference Thomas and Johnson2008) and has been called the “paradox of success” since learning one task to proficiency can harm the learning of other tasks (Seidenberg & Zevin, Reference Seidenberg, Zevin, Munakata and Johnson2006). Sensitive period effects can be produced via the learning process itself in a number of ways, including entrenchment, where early experience leaves the learning system in a state not readily compatible with a new learning task; competition for resources between different tasks to be learned; and catastrophic interference, where a new learning task may impact performance on a previously learned task that is not actively maintained.
Previous neural network models that have dealt with aspects of memory development have used varying approaches to limiting working memory. Elman (Reference Elman1993) trained Simple Recurrent Networks (SRNs) on a complex subset of English. This type of network uses recurrent connections to allow the network to access its own previous states, creating an analog of working memory. Elman found that these networks had better eventual performance when this working memory was initially limited to a discrete window of a few steps and gradually increased, consistent with the less-is-more hypothesis. While others have failed to find a difference between developing and mature networks on similar tasks (e.g., Rohde & Plaut, Reference Rohde and Plaut1999), Elman's study shows one way in which working memory capacity can be modeled in a neural network. As we will explain, our model uses a different approach, directly limiting the capacity of, or physical access to, previous states instead of limiting the network's temporal window of access to these states. Our approach is, in a sense, similar to that of the DevLex models of word and meaning acquisition (Li, Farkas & MacWhinney, Reference Li, Farkas and MacWhinney2004; Li, Zhao & MacWhinney, Reference Li, Zhao and Mac Whinney2007), which utilize growing self-organizing maps to represent semantics and phonology. These maps grow by adding new units to accommodate storage of new lexical and semantic representations; as such, the growth involved more closely resembles long-term memory growth. Our model, in contrast, grows by adding new units that form the substrate for working memory.
There have also been a few notable neural network models that touch on the topic of grammatical gender. MacWhinney, Leinbach, Taraban and McDonald (Reference MacWhinney, Leinbach, Taraban and McDonald1989) presented two neural network models of the acquisition of gender, case, and number in German. Both of these models learned to predict the article associated with a given noun, one using hand-coded semantic, phonological, morphological, and case cues, and the other using only observable data in the form of a complete phonological representation of the input noun along with some semantic and case cues. Both models succeeded at learning the nouns they were trained on, and also generalized very well to new nouns. The second model, without the hand-coded cues, outperformed the first. Unfortunately, the static phonological representations in this model only allow it to be applied to words of two syllables or fewer; our model employs temporal phonological representations that allow any word to be encoded. Additionally, Sokolik and Smith (Reference Sokolik and Smith1992) trained a feed-forward neural network to identify a corpus of French nouns as either masculine or feminine. Their study, however, has been widely criticized (Carroll, Reference Carroll1995; Matthews, Reference Matthews1999) for, among other things, using orthographic input, giving explicit gender feedback, and building in language-specific knowledge about gender classes. We believe that our approach adequately addresses these and other concerns, resulting in a model that only utilizes the information available to language learners.
3 Methods
Our intent in the present work is to use neural network models to understand any sensitive periods that arise due to the effects of cross-linguistic interference and aspects of memory development. The first of these two factors is straightforward to implement: Simply teach a network to perform the same task in two languages. By varying the amount of time before the L2 is introduced, we can vary the expected amount of entrenchment of the L1. The second factor is developmental, and involves changes to a neural network's structure and connectivity over the course of the experiment, above and beyond the connection–weight changes that occur during normal training. So that readers who are perhaps only passingly familiar with neural networks can fully grasp the developmental aspects of the model, we include a primer on neural networks in Section S.3 of the supplementary material.
3.1 Our model
In the present work, we use a type of recurrent neural network architecture called the Long Short Term Memory (or LSTM; Gers & Cummins, Reference Gers and Cummins2000; Gers & Schmidhuber, Reference Gers and Schmidhuber2001; Hochreiter & Schmidhuber, Reference Hochreiter and Schmidhuber1997). The LSTM architecture is similar in many ways to the well-known simple recurrent network (SRN) architecture (Elman, Reference Elman1990), with two notable differences. First, the recurrence in LSTM comes not from a hidden layer and a copy-back context layer as in an SRN, but instead from hidden layer units, called memory cells, that maintain their individual states across time-steps. This difference reflects a computational specialization of LSTM towards use as a substrate for working memory, as the maintenance of information across time is less noisy than in SRNs (Munakata, Reference Munakata2004). Combined with the slow weight changes characteristic of most neural network models, this makes the LSTM architecture well suited to its combined use in this study as a long-term categorization memory for learning the gender assignment and agreement tasks and as a working memory for temporarily storing the information relevant to each individual classification. The second difference is that each memory cell in an LSTM hidden layer is supplemented by a set of up to three additional units which serve to multiplicatively gate the inputs into, outputs from, and state retention of each memory cell. The network can learn to use these multiplicative gates to actively select important information to maintain in working memory while simultaneously reducing the kinds of interference that disrupt important working memory representations. A network composed of memory cells can maintain coherent working memory representations of important inputs for longer periods of time than architectures like the SRN. A more detailed primer on LSTM can be found in supplementary material Section S.4.
Our model learns by updating its connection weights based on the principle of gradient descent, utilizing back-propagation of error signals via an algorithm called LSTM-g (Monner & Reggia, Reference Monner and Reggia2012). While back-propagation has widely been regarded as neurobiologically implausible, Xie and Seung (Reference Xie and Seung2003) revealed gradient descent using back-propagation to be equivalent to a method of Hebbian learning utilized in neurobiologically plausible systems such as Leabra (O'Reilly & Frank, Reference OʼReilly and Frank2006). In light of this, it makes sense to view our use of back-propagation as a computationally expeditious equivalent of more neurobiologically plausible learning methods.
Since the aim of our model is to learn gender properties from speech stimuli, our neural network model is given an input layer able to represent one phoneme of speech at a time. The network is presented with a sequence of such phonemes, one after another, with the sequence as a whole representing a word or noun phrase. This is analogous to listening to spoken sentence fragments. The specific network architectures and desired outputs will differ by experiment and, as such, will be described in detail for each case in Section 4 below.
3.2 Development and network architecture
Since one aim of our model is to investigate the influence of development on learning of gender phenomena, we will next discuss the analogues of maturation in neural networks. Most neural network models have a fixed number of units and connections for the duration of training. Training such a network, starting from randomly assigned connection weights, is tantamount to waiting until a human learner is an adult, or at least fully neurologically developed in the relevant areas, before exposing him or her to any language stimuli. To address cases where language learning happens along with development, we also need to examine situations where the network structure develops during training. In the following paragraphs we examine a few ways of doing this.
In addition to the no growth condition, where all of the network's units and connections are present at the start of training, we examine a unit growth condition in which the network begins with a much smaller number of units and connections (see Figure 2, top row). During the training regimen, new units and their associated connections are gradually added to the network until it reaches maturity, i.e. its maximum number of units and connections, equivalent to the numbers present in the no growth condition. Here, a new unit being added to the network is not necessarily analogous to neurogenesis in humans; instead, we take the view that some of the new connections, created through a process analogous to dendritic outgrowth (Uylings, Reference Uylings2006), happen to project to existing units outside our current view of the network, thus recruiting them for use in processing.
The unit growth condition described above confounds two variables of interest on the cognitive level. Recall that the activations of units in a recurrent neural network like ours are the basis of working memory. The network recruits new units during the maturation process, increasing the amount of information it can process at any given instant. We might reasonably expect this to correlate with an increase in cognitive measures of working memory capacity during training. Since these networks start with a small working memory and increase its capacity during training, we can evaluate the less-is-more hypothesis (Newport, Reference Newport1990) for our model. From our perspective, this hypothesis admits two distinct and independently controllable factors that could lead to better final language performance: (i) starting with a small working memory, and (ii) allocation of new working memory resources during learning. Our unit growth condition possesses both factors, so to investigate them separately, we introduce a third network development condition, termed unit replacement, that has only the second factor. This condition is not intended to correspond to human maturation; rather, it is included merely as a control to help us separate the effects of starting small from the effects of introduction of untrained resources. In this condition, depicted in Figure 2 (middle row), the network starts in the same state as the no growth condition, with its full complement of units and connections, and thus its full working memory capacity. Periodically, units and their associated connections are removed from the network and replaced with new units and fresh, untrained connections. This happens at a rate commensurate with the rate at which units are added in the unit growth condition. Thus, in both conditions fresh resources are introduced over time, but where the unit growth condition uses these resources to grow the network from its initially small size, the unit replacement condition accepts these fresh resources and discards an equal amount of its existing, trained resources, thereby maintaining a constant size. Since the effective size of the working memory does not change in the unit replacement condition, it allows us to determine if periodic introduction of fresh working memory resources alone, without starting small, can produce any significant benefits.
Working memory is not the only cognitive variable that changes as part of the unit growth condition. The new units that each network recruits must be wired up using new connections. Connections, as the reader will recall, are the basis of long-term memory capacity in a neural network. Thus, a network from our unit growth condition adds both working memory and long-term memory capacity during training. To tease apart these variables, we examine a fourth condition, termed the connection growth condition, in which all units are present from the beginning but few of the possible connections exist (see Figure 2, bottom row). Since all units are incorporated from the beginning, the network's working memory capacity is fully developed from the start. During training, the network grows new connections at the same rate as in the unit growth condition, giving the network access to new long-term memory storage and allowing us to directly gauge the effects of long-term memory maturation. In addition, this allows us to indirectly assess the contributions of working memory maturation (and compound effects) by subtractive analysis with the unit growth and no growth conditions.
4 Experiments and results
4.1 Gender assignment
In our first set of experiments, we investigate how well neural networks can learn to perform a gender assignment task using realistic sources of information. These networks take single nouns as input and use that information to predict which determiners can appear with that noun. Since nouns commonly occur with determiners in our target languages, French and Spanish, both the input and the output data are readily available to any learner by simply listening to everyday speech. After training, we determine the network's assignment of gender to individual nouns by presenting those nouns as input and observing the network's predictions for determiner pairing. The gender of the most strongly predicted determiner is taken to be the network's gender assignment for the input noun.
Our approach is similar to that taken by the third model from MacWhinney et al. (Reference MacWhinney, Leinbach, Taraban and McDonald1989) in that our model uses the complete phonological form of a noun to predict the article to be used with that noun. We diverge from this earlier model in a few important ways. First, we eschew semantic features to investigate what can be learned from phonology alone. Even though phonology is predictive for the majority of words in our target languages, this choice deprives our model of information that learners are known to use (see Section 2.2 above). Second, we present the input noun as a temporal sequence of phonemes instead of a single phonological pattern, the latter of which will always have trouble representing long words or those that do not conform to the prespecified representational form. In addition, our approach corrects the most severe issues with the model of gender assignment by Sokolik and Smith (Reference Sokolik and Smith1992). Where their approach was criticized (Carroll, Reference Carroll1995; Matthews, Reference Matthews1999) for using orthographic input, we use phonemic input instead. Where their network came a priori equipped with knowledge of the genders of the training language – and indeed the knowledge that grammatical gender exists at all – our model has no such built-in knowledge. Finally, where their model required explicit feedback about the genders of individual words, our model relies instead upon the co-occurrence of gendered articles with nouns in order to deduce gender assignments. As a result of these differences, our model is more closely aligned with the real-world circumstances of human language learning in most contexts.
An input noun is presented to the network as a temporal sequence of phonemes. Each such phoneme is represented as a set of binary auditory features, with the activations of the network's input layer adjusted to reflect the feature set of each phoneme in turn. We use this representation because such features are universal in the sense that various configurations of these features can represent virtually any phoneme. As such, units representing these features could potentially be a built-in component of the brain of a language learner, or could be learned. That said, we only included enough features here to distinguish all phonemes in our target languages. The full set of phonemes and features are detailed in Table 1. After processing an entire sequence of phonemes representing the input noun, the network activates units in its output layer that correspond to determiners that it predicts to be compatible with the input noun. The network learns to perform this behavior by observing determiner–noun pairings and adjusting its connection weights accordingly.
The left half of Figure 3 shows the general architecture of the networks we train to perform this gender assignment task. The networks have an input layer of units corresponding to the on and off states of the features that make up the input phonemes. Units in the input layer project to units in a single hidden layer of memory cells. The intrinsic self-recurrence of the memory cells forms the substrate for working memory in the network. Finally, the hidden layer projects forward to the output layer which consists of nine units representing the definite and indefinite singular determiners of our target languages: le, la, l’, un, and une in French, and el, la, un, and una in Spanish. We do not posit that units representing these words could be built into the brains of language learners, nor that the words are represented in single units. However, since these determiners form a small closed class of words, we feel it is not too large a leap to presume that the learner represents these frequent determiners as distinct entities before much gender learning takes place. Our single-unit representation for each determiner is the simplest possible in this context, though other representations would likely work as well.
For this set of experiments, we used the 600 French words from the Sokolik and Smith (Reference Sokolik and Smith1992) paper as the input data for our model, and a set of 600 equivalent words from Spanish. For each trial during training, we first select a language and then select a noun at random from our corpus. We pair the noun with either a gender-matched definite or indefinite determiner from the appropriate language to form a simple noun phrase. The noun is given as input to the network, which then predicts applicable determiners and adjusts its weights in such a way that, in the future, it will be more likely to predict the determiner that actually co-occurred with the input noun. A network is considered to have assigned the correct gender for an input noun if an article of the appropriate gender is most active after presentation.
To determine a baseline level of performance on the gender assignment task, we trained networks on either French or Spanish only and recorded their performance. The results are shown in the left half of Figure 4. As one would expect, networks trained on French alone scored well in excess of 90% after training, while scoring at chance on Spanish; similarly, Spanish-trained networks performed well on their native language and at chance on French. It is worth noting that Spanish performance was consistently a few percentage points better than French performance, likely due to the phonemic cues to gender assignment in Spanish being simpler and more reliable than those in French. Performance was consistent across the four development conditions, suggesting that, alone, network development has little impact on outcomes for the gender assignment task, at least in the first language.
With a baseline level of performance established for networks that are “native” to either French or Spanish, we next investigated the performance of bilingual networks under a number of different learning conditions designed to assess the role of L1 entrenchment. Each condition varies the length of time t which the network spends learning the task on L1 alone before L2 is introduced (Zhao & Li, Reference Zhao and Li2010). We describe the conditions in terms of two periods, the first of which consists of training only in L1 for t trials, where t varies widely across conditions. This is immediately followed by the second period, in which L1 and L2 trials are mixed with equal probability. The duration of the second period is always two million trials in an effort to ensure that the networks have time to reach peak performance on both languages. While this second mixed training period will undoubtedly create competition and interference between the two languages, the amount of interference should be the same in each condition because the size and mix proportion of the second training period are the same across conditions. In contrast, the amount of L1-only training prior to the introduction of L2 is varied across conditions, meaning that networks that start with different values of t will end up in different states – reflecting differing levels of entrenchment – when L2 training begins.
A network whose training regimen has t = 0 is a native bilingual in the sense that L1 and L2 are presented at precisely the same time, and in the same proportions. Thus, such a network should exhibit no L1 entrenchment. Networks trained with higher values of t, having had a longer time with exposure only to L1, should exhibit more entrenchment. Given this, the prevailing ideas about L1 entrenchment offer a number of predictions about the final, peak L1 and L2 performance of the networks:
1. Networks’ final L1 performance should not decrease as t increases.
2. Networks trained with t = 0, as native bilinguals, should not exhibit impairment in either language with respect to the other.
3. Networks should show increasing degradation of final L2 performance as t increases, at least until the networks have mastered L1 to a point at which the effect of entrenchment saturates.
These predictions can be investigated by plotting the final L1 and L2 performance of fully trained networks on the gender assignment task versus the value of t with which they were trained. We trained 30 separate networks for each of 15 values of t as well as for each of the four maturation conditions and each of two languages; thus a total of 3,600 networks were trained to produce the following figures. For conditions in which the network matures during training, each of these networks begins training in its most immature state and develops over the course of the first 400,000 trials, at which point it reaches maturity – i.e. architectural parity with the networks in the no growth condition. Thus, some networks in the connection growth and unit growth conditions (i.e. those with t = 0) are first exposed to L2 in their most immature state, while others (i.e. t = 400,000 and above) are not exposed to L2 until after reaching maturity.
After both training periods were complete, we recorded the fraction of inputs to which each network assigned the correct gender, for both languages, and plotted them in the left half of Figure 5. These graphs depict the final performance of the networks on the y-axis versus the value of t – i.e. the duration of the L1-only training period and thus the delay before L2 onset relative to L1 – on the x-axis. Thus, the expected L1 entrenchment increases from negligible to maximal as we move from left to right in each figure; another way of saying this is that the networks towards the left of the x-axis are closer to true bilinguals whereas the networks closer to the right edge are late L2 learners. The y-axis values always depict final performance after the conclusion of training. These graphs show fitted curves for each of the different network maturation conditions, and for each such curve, the shaded area behind it represents the 95% confidence interval.
To examine the first prediction above, we first look at the performance of the various networks in their native language. Table 2 shows the results of a statistical analysis on the performance results, comparing the means for each condition at the first and last t-values using a two-proportion z-test. In addition to statistical significance, the table also provides an indication of the magnitude of the performance change. This answers the question, for each condition, of whether performance is statistically flat, increasing, or decreasing as t values increase. The table also shows codes for the magnitude of the significant changes in performance. The prediction of non-decreasing performance with increasing t appears to be largely borne out. When native language and task match, performance is flat or occasionally weakly increasing as t values rise. Figure 5 shows us that, as with the monolingual networks, bilingual networks have a slightly harder time learning French than Spanish as an L1. Differences in Spanish performance between the different maturational variants of Spanish-native networks were minimal, while the French-native networks that grew their working memory capacity during training showed a slight disadvantage. However, the expected general pattern of flat or improving performance with increasing t held for all conditions.
Significance: *** for p < .001, ** for p < .01, * for p < .05
Magnitude: ***** for m > 12%, **** for m > 9%, *** for m > 6%, ** for m > 3%, * for m > 1%
We can investigate the second prediction above by examining each network's performance on its second language. We do this by comparing second-language performance of networks with t = 0 on the x-axis to the native-language performance. We see that across all maturational conditions, the true bilingual networks (those with t = 0) perform well when compared to the native networks in both languages, lending support to the second prediction above.
Moving on to the third prediction, we can clearly see a t-related performance deficit in the no growth condition for L2 French; increasing Spanish exposure before French is introduced causes the final French performance of the network to decrease at a rate that is at first rapid but eventually slows for larger delays. The maturational properties in play for the connection growth and unit growth conditions, however, appear to have helped these networks compensate for the expected declines in French performance due to Spanish entrenchment. Networks in the unit replacement condition tended to perform at levels comparable to the no growth networks, suggesting that introduction of new working-memory resources without starting small may not be sufficient to gain a significant reprieve from the deleterious effects of increasing L1 entrenchment. In the case where French was the L1 and Spanish the L2, no appreciable t-related performance decreases were observed. We expect that this is due again to the relative ease of the task for Spanish as compared to French.
At least in the case of French as an L2, the data shown in Figure 5 and Table 2 seems to support both the predictions of performance declining due to increased entrenchment and of maturation during learning helping to overcome these difficulties. Next we trained networks on the more difficult task of gender agreement, the results of which are reported in the next section.
4.2 Gender agreement
Our second set of experiments explores how neural networks perform on a gender agreement task. During a trial, networks in these experiments receive a noun phrase (e.g., el mecanismo interno “the internal mechanism” in Spanish) presented as an unsegmented sequence of phonemes (e.g., [elmekanismointeɾno]) as input. The network's job at every point in this phoneme sequence is to predict the next few phonemes that it will hear. As such, the network uses a phonemic representation of everyday speech as both the input and the training signal. After training, the network's gender agreement performance is evaluated using noun phrases of the form determiner–noun–adjective – common constructions in our target languages of Spanish and French. To determine gender agreement, we give the network the determiner and noun as input, followed by the portion of the adjective that is gender-neutral, and ask the network to predict the correct ending for the adjective. If the network predicts the gender-appropriate ending more strongly than the gender-inappropriate ending, we consider the network's answer to be correct.
The noun phrases we used as training data for the gender agreement task were extracted from the French and Spanish versions of Wikipedia (2011). We downloaded archives containing the complete text of each version of Wikipedia and applied part-of-speech tags to each word using TreeTagger (Schmid, Reference Schmid1994). We then extracted all noun phrases of the forms determiner–noun, determiner–noun–adjective, and the less frequent determiner–adjective–noun, where the determiner is one from Table 3. From this list of noun phrases we removed any phrases containing words that were not in our language dictionaries – Lexique 3 for French (New, Reference New2006) and CUMBRE for Spanish (CUMBRE, n.d.). Finally, we extracted the most frequent 100,000 noun phrases for each language. These phrases comprise the training data. On each training trial, we chose a phrase probabilistically, based on the phrases’ corpus frequencies; we used this phrase as the input – and training signal – for the network on that trial.
The right side of Figure 3 presents the architecture of the networks trained on the gender agreement task. The input layer is the same as it was for the gender assignment experiments, with each input unit corresponding to a binary auditory feature of a phoneme. These networks, however, have two hidden layers of memory cells instead of one. This is because the gender agreement task involves two separate levels of segmentation of the input. To perform the task effectively, we expect that any learner needs to divide the phoneme sequence first into morphemes and words and, at a higher level, into noun phrases in which gender agreement must be maintained. Previous experiments with these types of networks on language tasks (Monner & Reggia, Reference Monner and Reggia2012) have shown a network with two hidden layers to be more effective in this case than networks with a single hidden layer.
The network's output layers are each identical to the input layer because the network is predicting upcoming phonemes. There are two such output layers because the network must predict not only the next phoneme that will occur in the input, but the phoneme after that as well. We require the network to make predictions of two future phonemes because some of the gendered adjective endings that we would like to predict consist of two phonemes. For example, the French adjective for “particular” is particulier [paʁtikylje] in the masculine and particulière [paʁtikyljɛʁ] in the feminine; we can see in the phonetic spellings that the gendered endings of these adjectives differ across two phonemes, with [-e] ending the masculine form and [-ɛʁ] ending the feminine form. Since we can only show the network the gender-neutral portion of the phoneme sequence (i.e. [paʁtikylj-]) without giving away the gendered form intended by the speaker, we must have the network predict two subsequent phonemes (either of which may be null if subsequent phonemes do not exist) in order to capture gendered endings with two phonemes such as [-ɛʁ].
When evaluating performance on the gender agreement task after training, we use only phrases of the determiner–noun–adjective form because it is the only form that is adjective-final. Our testing paradigm requires an adjective-final form because the network must predict the gender-appropriate ending of the last word, and only adjectives generally have two distinct gendered endings. Gender-neutral adjectives, and adjectives where the two gendered forms are orthographically distinct but phonetically identical (e.g., in French, the masculine architectural and the feminine architecturale are both pronounced [aʁʃitɛktyʁal]), are present during gender agreement training but ignored during the performance evaluation.
To determine a baseline level of performance on the gender agreement task, we trained sets of networks on either French or Spanish only and recorded their performance. The results are shown in the right half of Figure 4. As was the case with the gender assignment task from the previous section, we find here that networks trained on French do well on French and perform at chance on Spanish. Networks trained on Spanish perform as expected on that language and do significantly worse on French.
We use the same experimental setup as in the gender assignment task to investigate the effects of L1 entrenchment alone (i.e. the no growth condition) and together with network maturation (i.e. the unit growth, unit replacement, and connection growth conditions) in the gender agreement task. As before, training consists of two periods, the first consisting of t trials in which inputs come exclusively from the designated L1, and the second consisting of two million trials where inputs may be drawn from either language. We trained 30 networks in each maturation condition and for each value of t, the duration of the initial L1-only training period. The results are shown in the right half of Figure 5 above.
We can examine the networks’ performance on their first languages, broken out by language and maturation condition as before, by looking at the bottom half of Table 2. As expected, we do not see decreasing performance with increasing t in any of the conditions where the task and native language match.
Next we examine the final performance scores on L2 for networks in each condition of the gender agreement task as a function of t on the x-axis. The results for both languages here are similar to what we observed in the gender assignment task for the case of L2 French. The mature networks in the no growth condition show a marked susceptibility to L1 entrenchment, with L2 performance decreasing by as much as 17% as t is increased, delaying the onset of L2 relative to L1. However, the networks in the unit growth condition were largely able to mitigate this performance decrease by introducing new units and connections during learning. Performance of networks in the connection growth condition fall between these two. The addition of new connections to the networks appears to successfully stave off entrenchment effects when the level of entrenchment is small, but for values of t > 200,000 the entrenchment effects again start to become apparent. This tells us that addition of new units and new connections both help to counteract deficits due to entrenchment. Viewed from the cognitive perspective, growth in long-term memory capacity – in the connection growth condition – during training helped to mitigate the effects of L1 entrenchment, as did growth in working memory capacity, as evidenced by the superior performance of the unit growth condition over the connection growth condition for higher values of t. However, as shown by the unit replacement networks again tending to track the performance of the no growth networks, the addition of fresh neural resources is not all that is required to reap a performance benefit. Instead, it seems that starting small, either in terms of working memory capacity or long-term memory capacity, or both, is an essential factor that, combined with growth of neural resources, leads to the performance increase.
5 Discussion
The data presented in Section 4 (with one exception, discussed in detail below) appears to support the predictions of established ideas of L1 entrenchment: Increasing levels of entrenchment of the L1 caused increasing difficulty in acquiring an L2. The most dramatic of these can be seen clearly in the no growth conditions, where we witness an initially steep decline in learnability of the L2 task as time spent on the L1 task increases. The simulation results also largely agree with conclusions of empirical studies of gender learning in both early and late bilinguals (discussed in Section S.2 of the supplementary material) in that early L2 learners perform much like native speakers, whereas later L2 introduction leads to poorer performance.
The simulations also bore out the predictions of the less-is-more hypothesis, with the networks that undergo working memory development outperforming those that started with full-sized working memory capacities. Our experimental efforts to separate the effects of starting with a small working memory from those of simply adding fresh memory resources showed a distinct advantage to growth combined with starting small. This not only provides a small but important clarification to the mechanism behind the less-is-more hypothesis, but is a result for which an empirical investigation would be difficult if not impossible. We treat the results pertaining to each hypothesis in separate sections below.
5.1 Entrenchment
As mentioned earlier, our simulations provided one exception to our hypothesis about entrenchment, in the form of French-native learners of the Spanish gender assignment task attaining near-native-like performance levels on their L2 task. This may be explained, in whole or in part, by the inherent similarity of French and Spanish; see our discussion of the empirical study by Sabourin, Stowe and de Haan (Reference Sabourin, Stowe and de Haan2006) in Section S.2 of the supplementary material. When two languages are very similar, one might expect L2 learning to be easier where it agrees with L1 and harder where it disagrees. For example, the fact that a noun ending in [o] is a very reliable predictor of masculine gender in both French and Spanish may underlie the unexpected ease with which our French-native networks learned the Spanish gender assignment task: Since masculine nouns ending in [o] are so prevalent in Spanish, the transfer of this concordant rule from L1 French would immediately improve accuracy by leaps and bounds. The reverse – transferring the rule from L1 Spanish to L2 French – would not be as beneficial since masculine nouns ending in [o] are far less prevalent in French than in Spanish, thus leading to less of an impact on the learner's overall accuracy. On the other hand, it may be more difficult for native French speakers to learn Spanish's association between [a] and feminine gender given that [a] is associated with the masculine in French. In our simulations, this rule may have had less of an impact because [a] is a less reliable cue in French; or perhaps it is the case that discordant rules from L1 can be easily overcome. The simulations reported here certainly do not fully explore interactions between language similarity, rule transfer, and ease of L2 learning. To better grasp the significance of interactions between concordant and discordant rules like the examples above, we hope in the future to study an expanded model that includes more languages of varying levels of similarity.
5.2 Memory development
While our modeling approach does not directly implement cognitive constructs such as working memory capacity, we argued in Section 3.3 that the connection growth condition could be reasonably conceived as representing growth from an initially small long-term memory capacity, and the unit growth condition as growth of both long-term and working memory capacities from small beginning states. Allowing the networks to mature in either of these conditions helped to mitigate the negative impacts of L1 entrenchment, especially for longer delays in L2 onset. The fact that the connection growth condition generally improves upon the no growth condition suggests to us that growth of long-term memory capacity may be a key maturational factor during language learning. For the longest delays, the unit growth condition appears to have had the greatest positive impact, which suggests to us that growth of working memory capacity also has a positive influence in combating entrenchment effects.
The unit replacement condition, on the other hand, demonstrated the effects of adding fresh long-term and working memory resources to the network without starting small, and without changing the network's overall size. Since the networks in this condition did not do substantially better than those in the no growth condition, we have to conclude that the only thing lacking in the unit replacement condition – beginning from resources of modest capacity, or starting small – is an essential factor underlying the performance gains made by the unit growth and connection growth networks. This lends support to the less-is-more hypothesis, and further constrains it in the sense that it is now clearer that initial size is crucial; the effect is not caused by resource acquisition alone.
The less-is-more hypothesis is usually presented at the cognitive level, suggesting that a system with limited cognitive resources will latch on to the low-hanging organizational fruit, learning representations efficient enough to accommodate its small memory capacity. This can serve as a boon later on, when new memory capacity is added and can tackle more complex stimuli. This proposal also makes intuitive sense at the level of neural information processing for a variety of reasons. A network that has its full complement of resources when learning begins naturally learns to use all the resources at its disposal to widely distribute its learned interpretations of its L1 experiences. If an L2 is introduced later, the distributed L1 experience cannot be easily or quickly consolidated to use only a subset of the neural resources so as to free up some of these for the L2 alone. Instead, the L2 and L1 experiences intermix and interact, exacerbating L1 entrenchment effects and prolonging performance deficits in L2 due to resource competition from L1 (Thomas, Reference Thomas, Mayor, Ruh and Plunkett2009). On the other hand, a network that begins training with more modest resources will be forced to attempt to encode the L1 using only the limited resources available. Though these may initially be insufficient for a full understanding of L1, the limitations will force the network to adopt more efficient and less widely distributed encodings of the L1. This may entail segmenting the input into smaller generative chunks, like phonemes and morphemes. This consolidation of L1 knowledge in the resources that were added early leaves the later-added neural resources free to adapt to novel data such as that presented by an L2. If this story is correct, starting with fewer resources and building them up during language learning are key strategies to developing more modular representations for each language, which helps to avoid the deleterious effects of L1 entrenchment and resource competition with L2.
Our simulation results also showed that our networks’ final performance when learning only one language was generally the same or worse in the developing conditions compared to the pre-developed no growth condition. This stands apart from previous results showing that late L1 acquisition of sign languages is impaired proportionally to age of acquisition (e.g., Mayberry, Reference Mayberry1993; Mayberry & Lock, Reference Mayberry and Lock2003), the explanation for which is thought by many to be developmental in nature. We see two potential explanations for this discrepancy. The first and most obvious is that our model does not account for the mechanism, developmental or otherwise, that underlies these impairments in late L1 acquisition. A second possibility that affords our model some explanatory power rests on the idea that the observed performance deficits in a late-learned L1 are due to entrenchment and/or interference from home sign systems developed by the learners prior to exposure to a conventional sign language (Seidenberg & Zevin, Reference Seidenberg, Zevin, Munakata and Johnson2006). Under this view, a late-learned L1 functions more like an L2, creating a situation that is more directly comparable to our bilingual networks than the monolingual ones, which had no prior exposure to any type of communication system which could interfere or become entrenched. While this interpretation minimizes the discrepancy between our model and empirical findings, it remains a controversial hypothesis regarding the origin of late L1 learning deficits.
We do not mean this work to in any way suggest that entrenchment and memory development explain all the age effects we see in second language learning. As many researchers have pointed out, cognitive maturation is typically confounded with a variety of other changes that take place in the same time frame, such as social development, changing patterns of input and interaction, and schooling in the L2. While we acknowledge that the factors we have studied here do not explain all the age effects observed in humans, we do believe they are part of a larger picture involving many of the variables outlined above. Our simulations confirm that entrenchment – a natural consequence of learning different tasks in stages – can indeed cause large deficits in second language performance. Our comparison of developmental conditions bears out the predictions of the less-is-more hypothesis, showing that memory development – that is, starting from a small memory and growing it during learning – can help to prevent disruptions due to entrenchment. While much more work is necessary to determine how cognitive maturation contributes to age effects, this study contributes to a better understanding of how memory development in particular could be an important part of that picture.