Introduction
School-age children may not know exactly what a verb is, but their ability to use verbs in novel contexts demonstrates that they have learned the rules by which verbs operate. An important question in language acquisition is how children learn the grammatical category of a word. In languages with consistent word order, the lexical context in which a word appears can be an important cue, especially in combination with other cues including semantics and phonology (Farmer, Christensen, & Monaghan, Reference Farmer, Christiansen and Monaghan2006; Lany & Saffran, Reference Lany and Saffran2011; Monaghan, Christiansen, & Chater, Reference Monaghan, Christiansen and Chater2007). Here, we aim to determine whether children can learn about grammatical category membership solely from distributional regularities in an artificial language.
Distributional information is distinct from surface-level adjacent and non-adjacent dependencies because it involves tracking these sequential statistics across time and exemplars. There are limitations to the usefulness of sequence-based types of information for category learning. Despite positive findings from corpus analyses and models using adjacent dependencies to categorize content words (Mintz, Newport, & Bever, Reference Mintz, Newport and Bever2002; Redington, Chater, & Finch, Reference Redington, Chater and Finch1998; Thorpe & Fernald, Reference Thorpe and Fernald2006), adjacent dependencies alone are likely not as helpful for learning functional grammatical categories such as prepositions and conjunctions, which might precede or follow any number of words or word types. Non-adjacent dependencies, captured as frequent frames like “I __ it”, have been shown to facilitate word learning in 30-month-olds (Childers & Tomasello, Reference Childers and Tomasello2001) and reliably contain words of the same grammatical category (Mintz, Reference Mintz2003). Infants as young as 18 months can track such non-adjacent dependencies, with higher variability of the intervening words resulting in better learning (Gómez, Reference Gómez2002). However, frequent frames like “I __ it”, while highly accurate in predicting the category of the intervening word, account for a very small percentage of what children hear (St Clair, Monaghan, & Christiansen, Reference St. Clair, Monaghan and Christiansen2010). An experiment which shows that distributional information alone can cue child learners to category membership is an essential step in determining its role in language development.
We wanted to study distributional learning in children because they are faced with the task of discovering grammatical categories. It cannot be assumed that children will perform like adults, given that children have different cognitive and meta-cognitive skills and less experience with language. If children cannot do the task, it limits the conclusions we can draw from adult studies. Furthermore, if one takes seriously the idea that language is formed from exposure to input, school-age children have much less cumulative language exposure than adults do and thus much less experience and skill at extracting categories and subcategories from spoken language. Knowledge of argument structure is an area of language arguably still developing into adolescence (Ambridge, Pine, Rowland, & Chang, Reference Ambridge, Pine, Rowland and Chang2012) and potentially impaired in children with language learning impairments (Ebbels, Reference Ebbels2005).
It was once believed that people could not use distributional cues alone to discover grammatical categories. In studies by Smith (Reference Smith1966, Reference Smith1969), adults could use the fact that an item occurred first or last in a sequence to determine categories in an artificial language, but were unable to utilize distributional cues to restrict their generalizations to exclude ungrammatical co-occurrence violations (an example in real language of an ungrammatical co-occurrence is “he poured the jar with water” – a verb like “fill” is grammatical, but “pour” is not). In later studies, adults were successful only when morphological markers denoting category membership were present (Frigo & McDonald, Reference Frigo and McDonald1998), as was the case with infants in a Russian gender paradigm (Gerken, Wilson, & Lewis, Reference Gerken, Wilson and Lewis2005). Thus, it seemed that only very salient cues like absolute position or a combination of cues would be sufficient for grammatical category learning. Results from modeling and novel word learning studies supported this hypothesis. A model by Monaghan, Chater, and Christensen (Reference Monaghan, Chater and Christiansen2005) showed that phonological and distributional cues used in combination resulted in the most accurate category assignment for 5,000 frequent nouns and verbs in English (between 65–80%, depending on frequency, with about equal accuracy for nouns and verbs), while either type of cue alone produced much poorer discrimination (between 50–85% for distributional cues alone, with better categorization of nouns than verbs; and between 60–65% for phonological cues alone, with better categorization of verbs than nouns). In a series of behavioral studies, children and adults used novel verbs in new contexts based on both the distributional and semantic properties of other verbs they heard in those contexts (Ambridge & Lieven, Reference Ambridge and Lieven2011; Ambridge, Pine, & Rowland, Reference Ambridge, Pine and Rowland2011, Reference Ambridge, Pine and Rowland2012).
In an important next step for determining the role of distributional information in language acquisition, Reeder, Newport, and Aslin (Reference Reeder, Newport and Aslin2013) showed that adults could use distributional information in the absence of other cues, including position effects, to construct grammatical-like categories in an artificial grammar learning task. In a series of experiments examining how differing degrees of distributional overlap influenced learning, they demonstrated that adults generalized categories without multiple cues to category membership, and that they restricted their generalizations based on the distributional patterns and amount of exposure. In these experiments, grammatical test items consisted of three-word AXB sequences heard during training, novel AXB sequences which had not appeared together in a trigram in training, and ungrammatical sequences containing a component bigram such as XA or BX that violated the linear order of items heard in training. In Experiment 3, the training contained strategic gaps, such that some test items contained grammatically possible bigrams not heard during training. Because of the shared contexts in the distribution of the words in the grammar, participants could form grammatical categories to determine that novel grammatical sentences were possible while the ungrammatical were not. Figure 1 shows how combinations heard during training (left) could lead to the induction of a category, which would lead to the acceptance of a novel combination (right). That participants rated novel grammatical combinations higher than ungrammatical combinations showed that adults can form rudimentary grammatical categories based on distributional information alone. Because adults rated novel items lower in Experiment 3 than in experiments without distributional gaps, the authors concluded that, while the adults formed categories, they also must have found the gaps meaningful. Thus the differences in distributional information in Experiment 3 led to restricted generalizations, which shows the importance of the shape of distributional information in the formation of categories. Further, because during training participants heard AXB combinations as part of longer strings that optionally contained other categories of words at the beginning and end, categories could not have been determined through position effects, as in Smith (Reference Smith1966, Reference Smith1969). The study did not include ungrammatical items that contained co-occurrence violations only, and so the ability to use co-occurrence information to form and restrict grammatical categories has remained untested. However, that participants could use distributional information alone to determine category membership suggests that this is a strong mechanism for learning language.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180417081746579-0116:S0305000917000435:S0305000917000435_fig1g.gif?pub-status=live)
Fig. 1. This schematization shows that combinations heard during training can lead to the induction of a category of items that have shared distributions, which allows the listener to generalize a new combination at test.
As Reeder et al. (Reference Reeder, Newport and Aslin2013, p. 52) state in their discussion, “Linguistic input to young language learners likely involves many words with partially overlapping contexts (as in Experiment 3)”. Experiment 3 provides an ideal task for an initial foray into assessing distributional learning ability in children because of the realistic nature of the artificial language: in real language, children also must infer and restrict categories from gaps in the input, never being exposed to the entire corpus. Manipulating the input over several experiments is one way to learn how people form and limit generalizations. It is also possible to compare ratings for different types of items within the same experiment. In the present study, we employ an artificial grammar similar to that in Experiment 3 of Reeder et al., modified for use with children. Test items include both linear order and co-occurrence violations to compare graded degrees of generalization. It has not been shown whether children can use category co-occurrence alone (as distinct from adjacent dependency information, which is individual item co-occurrence) as a cue to category membership, but theories of language learning often assume this skill (e.g., Tomasello, Reference Tomasello2000, Reference Tomasello2003). Results from this task will provide a basis for future work examining exposure manipulations such as those in Reeder et al. (Reference Reeder, Newport and Aslin2013), so that we can understand how children with typical and impaired language development create and restrict generalizations.
Methods
Participants
We recruited 27 typically developing participants, aged 6;0–9;11 (M = 8·4, SD = 1·1), 16 of whom were females. An additional eight participants (four females) were later added to the sample. Since these participants were not part of our original planned sample, their data were only included for analyses in which we employ Bayesian Analyses and compute Bayes Factors (since, in contrast to the interpretation of p-values in frequentist analyses, Bayes Factors remain a valid measure of evidence even with optional stopping; Dienes, Reference Dienes2016; Rouder, Reference Rouder2014).Footnote 1 Participants had normal hearing, scored above 85 (M = 115·7, SD = 12·6) on the Kaufman Brief Intelligence Test Matrices subtest, 2nd edition (Kaufman & Kaufman, Reference Kaufman and Kaufman2004), and had no history of neurological disorders or of receiving speech/language therapy per parent report. In addition, all children completed the Peabody Picture Vocabulary Test, 4th edition (PPVT-4; Dunn & Dunn, Reference Dunn and Dunn2007) to document vocabulary skills (raw: M = 158·3, SD = 17·4; standard: M = 120·3, SD = 12·5). Pilot work suggested that typical children under six could not reliably complete the task.
Stimuli
We take our artificial grammar from the Reeder et al. (2103) study. The grammar consisted of five arbitrary category types: Q, A, X, B, and R, controlled for co-occurrence frequency. Each category contained two or three pseudo-words, e.g., category Q words were klidum and spad. Training sentences were combinations of (Q)AXB(R), in that order, so that each sentence minimally contained AXB with Q and R words added optionally to avoid position effects.
Training
Participants heard 36 sentences, constructed from 12 of the 27 possible AXB combinations of the language. Sentences were chosen such that only a subset of A's appeared with each X, and a subset of X's appeared with each B. The 12 AXB combinations appeared in three of four possible sentence types, QAXB, AXB, AXBR, or QAXBR, and this corpus of 36 sentences was heard three times for a total of 108 trials. The ‘Appendix’ lists all training items by AXB combination. During training, while children listened to ‘aliens’ on a computer saying the sentences, they completed a one-back task, indicating by button press if they heard the current sentence immediately prior. Only data from individuals who scored better than 60% on the one-back task were retained; all participants met this criterion. The one-back task ensured children attended during training.
Test
The test consisted of 54 AXB sentences of three types: (1) nine ‘grammatical’ sentences heard during training, (2) nine novel ‘grammatical’ sentences, i.e., sentences that were not heard during training but were consistent with the grammar of the artificial language, and (3) 18 ‘ungrammatical’ sentences. Grammatical sentences were heard twice at test; ungrammatical were each heard once to avoid familiarization. For the ungrammatical items, we included three of each combination: AXA, BXB, BXA, XBA, AXR, and QXB. AXR and QXB represent co-occurrence violations, while the others represent linear order violations. In each test trial, children listened to a sentence, then chose one of two buttons to indicate whether the sentence was something the alien would say, and finally, they moved a slider scale to indicate confidence (see Ambridge, Pine, Rowland, & Young, Reference Ambridge, Pine, Rowland and Young2008), which provided increased power to detect variability in learning. The ‘Appendix’ lists all sentences heard during the test.
Procedures
The one-back task and training paradigm were introduced in the context of aliens trying to repair their spaceship. Children were told to listen to see if the aliens repeated themselves and to press the red button if the alien said the same thing she just said, and to press the green button if she said something different. Red and green buttons were stickers over the keyboard buttons D and K. Training trials were videos of an alien ‘speaking’ one of the training sentences. For every trial except the first, participants performed the one-back task. They were also told that videos would play along the way to alert them of the aliens’ progress. Videos were four-second clips that included depictions of the aliens’ attempts to fix their ship. These occurred at fixed intervals during training.
Immediately after training, participants performed the test. They were told that the aliens again needed their help, and that this time they had to listen to a sentence and decide if it sounded like something the alien would say. Examiners told participants to press the red button if the sentence did not sound like something the alien would say, and green if it did sound like something the alien would say. Then they had to indicate their certainty on a red- and green-colored slider scale by placing the marker on one extreme or the other if they were sure of their decision, and somewhere in the middle if they were less sure. Participants were trained to use the scale at the outset of the experiment through a separate apple/pear shape and color sorting task. E-Prime was used to deliver all task components and perform data collection.
The grammar used in this study was inspired by Experiment 3 of Reeder et al. (Reference Reeder, Newport and Aslin2013), but we made several departures from that design. For clarity's sake, we list key differences here: stimuli were recorded in child-directed speech, the training contained 12 rather than nine AXB combinations, such that three trained AXB types never appeared during test, training included watching videos with pauses between sentences for a one-back task rather than listening to a continuous audio stream, the training was shortened to three exposures of each sentence because pilot testing showed that children could learn at this exposure with better attention to the task, ungrammatical items at test included additional item types beyond AXA and BXB, the test included a button press and continuous visual scale rather than a 5-point Likert scale, and the apple/pear task ensured that children could use the buttons and the scale.
Analysis
We used the lme4 package (Bates, Maechler, Bolker, & Walker, Reference Bates, Mächler, Bolker and Walker2016) and the lmerTest package (Kuznetsova, Brockhoff, & Christensen, Reference Kuznetsova, Brockhoff and Christensen2016) in R version 3.1.3 (R Core Team, 2015) to run linear mixed-effects models to explore several comparisons of interest. We used the maximal random effects structure as recommended in Barr, Levy, Scheepers, and Tily (Reference Barr, Levy, Scheepers and Tily2013), except in instances in which the Akaike information criterion (AIC) and log-likelihood ratios indicated a reduced model improved fit. We compared ratings at test from both binary and slider scale measures for grammatical to ungrammatical items, familiar to ungrammatical, and novel to ungrammatical, to determine learning within groups. We also ran a post-hoc familiar to novel comparison. Novel to ungrammatical is the critical comparison for formation of categories as it is evidence that participants learned the grammar of the artificial language, and that they did so by forming grammatical categories. Ungrammatical served as the reference category except where otherwise noted. We ran a separate model that included age in months and raw scores from the PPVT-4 (Dunn & Dunn, Reference Dunn and Dunn2007) as covariates to test for factors that contributed to learning. Accuracy on the one-back test during training was also included as a covariate in the model to determine whether attention during exposure influenced the ability to learn categories. Because attention during training would equally affect all item types, we only included this as a main effect, while vocabulary and age could interact with item type.
Results
No child scored below 60% on the one-back task during training (M = 0·86, SD = 0·10), and so all participants’ data were included. A linear mixed-effects model with a random subject slope for item type and random intercepts for subject and item was the maximal random effects structure supported by the data. Log-likelihood ratios and AIC confirmed the maximal effects structure as the preferred model. We first ran a model with item grammaticality as the single predictor. Participants rated grammatical sentences (which included familiar and novel items) as more acceptable than ungrammatical in the visual analog scale [beta = –16·04, SE = 3·29, t(36·77) = –4·87, p < ·0001], providing evidence that they could perform the tasks. Results from the binary choice followed those of the visual analog scale, with slightly smaller p-values, for all findings. As such, we report only visual analog scale results from this point forward.
To test learning, we replaced grammaticality with item type (familiar, novel, ungrammatical) in the model. Familiar and novel items were both rated higher than ungrammatical items [familiar: beta = 17·58, SE = 4·02, t(34·55) = 4·37, p < ·0001; novel: beta = 14·75, SE = 3·83, t(31·15) = 3·85, p < ·001]. Estimated effect size for novel vs. ungrammatical is calculated by finding the correlation between the fitted and observed values of the model (Xu, Reference Xu2003) ($\Omega _0^2 $ = ·25). For the purpose of comparing familiar and novel item ratings, we changed the reference category to familiar, and found that ratings for familiar and novel items did not differ [beta = –2·84, SE = 4·18, t(28·08) = –0·68, p = ·50]. The mean rating for familiar items was 66·09 (SE = 1·47), for novel items was 63·26 (SE = 1·55), and for ungrammatical items was 48·51 (SE = 1·63). Figure 2 illustrates mean ratings by item type for each participant. All participants followed a pattern of numerically higher mean ratings for novel grammatical items than ungrammatical.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180417081746579-0116:S0305000917000435:S0305000917000435_fig2g.gif?pub-status=live)
Fig. 2. Mean scale ratings of test items by item type.
We converted scale ratings to z-scores as in Reeder et al. (Reference Reeder, Newport and Aslin2013) to control for variable use of the scale. We use z-score ratings as the dependent variable from this point forward. Familiar and novel ratings were still rated higher than ungrammatical [familiar: beta = 0·57, SE = 0·12, t(33·69) = 4·55, p < ·0001; novel: beta = t(32·52) = 3·97, p < ·001, $\Omega _0^2 $ = ·14. Familiar and novel ratings did not differ [beta = –0·09, SE = 0·13, t(28·30) = –0·69, p = ·50]. Single sample t-tests confirmed that mean ratings for all item types were different from zero [familiar: t(485) = 5·31, p < ·0001; novel: t(485) = 3·03, p = ·003; ungrammatical: t(485) = –7·43, p < ·0001].
To determine whether participants showed graded effects of generalization for different ungrammatical item types, we re-ran the mixed effects model first with ungrammatical items with co-occurrence violations excluded and then with items that violated linear order excluded. Our key question was whether the data provided evidence of distributional learning from co-occurrence information alone. Since we added the additional eight participants for this analysis (see ‘Participants’ section), the key inferential statistic computed here is a Bayes Factor for the critical comparison of novel items to co-occurrence violations. We nevertheless also report the p-values associated with the coefficients of the mixed model, although their interpretation is limited by the fact that we increased the number of participants from the original sample. Bayes Factors compare evidence supporting the null hypothesis of no difference versus the alternative hypothesis of a significant difference. A Bayes Factor smaller than 1/3 is interpreted as evidence for the null hypothesis, whereas a Bayes Factor greater than 3 is interpreted as evidence for the alternative hypothesis, and Bayes Factors between these values are interpreted as insufficient data for the distinction (see Dienes, Reference Dienes2008, Reference Dienes2014). We used the free online Bayes calculator (Dienes, Reference Dienes2008) for these analyses. Because previous experiments have shown participants can detect linear order violations, we used the mean difference between ratings for novel and linear order violations as the estimate of predicted difference, using the estimate from the mixed-effects model comparing ratings for novel versus ungrammatical items, with co-occurrence violation items excluded. This estimate serves as the SD of a half normal distribution, as per Dienes (Reference Dienes2008). The mixed-effects model with co-occurrence violation items excluded showed that participants could distinguish between novel items and linear order violations [beta = 0·51, SE = 0·12, t(31·14) = 4·24, p < ·001]. For the sample estimate, we used the coefficient from the mixed-methods model comparing ratings for novel versus ungrammatical items with linear order violations excluded. Results from the mixed effects model with linear order violations excluded and the Bayesian analysis suggested that participants could distinguish between novel items and co-occurrence violations [beta = 0·34, SE = 0·17, t(24·29) = 2·02, p = ·0549, BF = 3·71].
Taking this one step further, we ran an additional model with familiar items removed to compare novel items to each type of ungrammatical item (AXA, BXB, BXA, XBA, QXB, AXR) to determine whether each type of co-occurrence violation (QXB, AXR) could be distinguished from novel items; novel items served as the reference category. For the Bayesian analysis, we used the mean difference between ratings for novel items and ungrammatical AXA + BXB items, ·44, as the estimate of predicted difference because Reeder et al. (Reference Reeder, Newport and Aslin2013) used these items, thus allowing us to predict that participants would be able to distinguish between these types. The sample estimate was the coefficient for each item type in the mixed-effects model. For each item type except AXR, the mixed-effects model and the Bayes Factors supported the alternative hypothesis that ratings for novel items exceeded ratings for that ungrammatical item type [BXA: beta = –0·47, SE = 0·19, t(22·73) = –2·55, p = ·02, BF = 10·33; XBA: beta = –0·71, SE = 0·22, t(23·20) = –3·26, p = ·003, BF = 57·55; QXB: beta = –0·46, SE = 0·19, t(22·73) = –2·48, p = ·02, BF = 9·25; AXR: beta = –0·21, SE = 0·19, t(22·73) = –1·13, p = ·27, BF = 1·12]. For AXR, the Bayes Factor indicated insubstantial evidence for the null or alternative hypothesis.
Using the original dataset of 27 from this point forward, we added centered raw PPVT-4 scores, one-back accuracy, and age in months as factors in the model. Log-likelihood ratio tests and AIC comparison indicated that only a random item intercept was needed, likely because subject-specific variance was addressed by standardizing the rating scale. The best fit model included only item type and PPVT-4. The main effect of PPVT-4 was not significant [beta = –0·004, SE = 0·002, t(1419) = –1·64, p = ·10], but there was an interaction with item type, such that children with higher vocabulary scores showed a larger distinction between familiar and ungrammatical items than children with lower vocabulary scores [beta = 0·008, SE = 0·003, t(1419) = 2·44, p = ·02, ${\it \Omega} _0^2 $ = ·14] (see Figure 3). The slope difference between novel and familiar items was not significant [beta = –0·005, SE = 0·003, t(1419) = –1·40, p = ·16]. The regression model is reported in Table 1.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180417081746579-0116:S0305000917000435:S0305000917000435_fig3g.jpeg?pub-status=live)
Fig. 3. Mean z-score scale ratings by item type (familiar, novel, ungrammatical) as indexed by the Peabody Picture Vocabulary Test, 4th edition (PPVT-4), raw score.
Table 1 Results of the regression model showing the influence of item type and centered Peabody Pictured Vocabulary Test, 4th Edition (PPVT-4) raw score on ratings
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180417081746579-0116:S0305000917000435:S0305000917000435_tab1.gif?pub-status=live)
As additional evidence that distributional learning, rather than surface-level adjacent dependencies, drove performance in this task, we calculated associative chunk strength for each item, similar to the bigram analysis Reeder et al. (Reference Reeder, Newport and Aslin2013) performed for their Experiment 3 results. Associative chunk strength is the average of the frequency during training of the three component dependencies of every test item: each bigram (AX and XB) and the trigram as a whole (AXB) (Knowlton & Squire, Reference Knowlton and Squire1994). For example, the ungrammatical test item bleggin zub glim has an associative chunk strength of ‘0’ because bleggin zub, zub glim, and bleggin zub glim are each never heard during training. In contrast, the ungrammatical sentence bleggin lapal fluggit has an associative chunk strength of ‘6’ because the bigram bleggin lapal occurs zero times, lapal fluggit occurs 18 times, and bleggin lapal fluggit occurs zero times. The ‘Appendix’ lists associative chunk strength and mean scale rating for each test item. It is possible for ungrammatical and novel items to have the same chunk strength. If participants were using only this information to perform the task, we would expect similar acceptability ratings for items with identical chunk strength. However, as Figure 4 shows, participants’ ratings differ by item type for items with overlapping chunk strength. Adding chunk strength to the model did not reveal any significant effect, and the effect of item type remained significant. Thus it appears that distributional information that goes beyond information derived solely from adjacent dependencies aided participants’ performance in this task. However, ungrammatical items had the widest range of scores (see ‘Appendix’), which suggests that item-level properties beyond chunk strength may have influenced ratings.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180417081746579-0116:S0305000917000435:S0305000917000435_fig4g.gif?pub-status=live)
Fig. 4. Mean z-score scale ratings by item type (familiar, novel, ungrammatical co-occurrence violation, ungrammatical linear order violation) and chunk strength.
Discussion
We used an artificial grammar similar to that in Experiment 3 of Reeder et al. (Reference Reeder, Newport and Aslin2013) to test distributional learning in typically developing school-age children. This task employed systematic gaps in the exposure such that learners had to formulate grammatical categories based on shared distributions to distinguish novel grammatical from ungrammatical items. In this way, the task simulates the problem of real language learning in that children must distinguish possible combinations from impossible without explicit knowledge of category membership. Children could distinguish grammatical from ungrammatical items in a grammar they were recently exposed to. Critically, they distinguished novel grammatical items from ungrammatical items through the utilization of distributional information in the grammar. They also showed graded performance, with lower ratings for linear violations than co-occurrence violations, though co-occurrence violations were still rated lower than novel grammatical items (at least for QXB items), evidence that children can use this information to inform category membership. Item chunk strength analysis, which is based on how often two words appear next to each other, confirmed that children were not using adjacent dependency frequency alone to perform the task, though the observation of lower ratings for items with lower chunk strength show that children constrain generalizations to those that are more robust statistically. Results show that children as young as six are sensitive to distributional information and can utilize it, even in the absence of other strong cuing information such as semantics or phonology, to form categories.
While different scales make a direct comparison with Reeder et al. (Reference Reeder, Newport and Aslin2013) untenable, our finding of similar ratings for familiar and novel items is somewhat inconsistent with adult performance in the earlier study. This may be due to a shortened overall exposure (three repetitions of the corpus instead of four), as Reeder et al. found lower ratings for novel items in Experiment 4, which had extended exposure time. The lack of a difference in our study suggests participants are not merely using stored sentence strings from the training to perform the test, as familiar ratings would be higher if participants relied on this strategy. These results fit with a theory of category learning, whereby individual exemplars (the sequence lapal fluggit or daffin lapal bleggin) may be stored temporarily until a threshold is reached for determining a class of items that can appear in certain contexts, at which point individual exemplars are no longer needed and generalization can occur. Differences between ratings for items with co-occurrence violations and novel grammatical items suggest that information about category membership goes beyond linear order. Recall that co-occurrence violations are not just that these two words never appeared next to each other (an adjacent dependency), but that the two categories did not. This is a novel finding for any age of participant. Regarding individual differences in statistical learning, raw scores on the PPVT-4 (Dunn & Dunn, Reference Dunn and Dunn2007) predicted distinction in ratings for familiar and ungrammatical items. There were no main effects or interactions with age or with accuracy on the one-back task during training, suggesting that distributional learning may be developmentally invariant, and only minimal attention is required, or that other factors may allow for advantages. Other work has suggested individual differences in statistical learning are related to language abilities (Kidd, Reference Kidd2012; Misyak & Christensen, Reference Misyak and Christiansen2012; Misyak, Christensen, & Tomblin, Reference Misyak, Christiansen and Tomblin2010; Seigelman & Frost, Reference Siegelman and Frost2015). Lany and Saffran (Reference Lany and Saffran2011) found a relationship between vocabulary ability and use of distributional cues in linguistic input in a word learning task. In their study, infants with large vocabularies learning an artificial grammar generalized semantic categories using distributional cues more than phonological cues, while infants with smaller vocabularies showed the opposite pattern. If vocabulary scores are understood as a proxy measure of general language aptitude, results from the present study, combined with those of Lany and Saffran, provide some evidence that statistical learning is related to language ability. Given the age of the children in our study, the direction of this relationship is not clear. The parameter estimates and effect size for the interaction are small, and it will be interesting to see if a wider range of vocabulary abilities serves to increase this effect. Future studies, including a replication with children with specific language impairment, will attempt to explore this relationship further.
Adult-like metacognitive skills do not appear to be necessary for distributional learning. Children aged six to nine could both generalize and limit generalizations based on what they heard in the input. Findings from this study support the hypothesis that implicit statistical learning is involved in language acquisition in two ways. One, there is a link between language ability and distributional learning ability. Two, because there is no explicit instruction in the task, children do not need to be fully aware that grammatical rules exist before they begin using information to make generalizations. We also saw that they showed gradation in their formation of categories, with higher ratings for items with co-occurrence violations than linear order violations. A future study that directly manipulates the number of items in the lexicon, as well as length of exposure to the artificial grammar, would reveal how learners use and weight different cues for generalization. This would allow exploration of item-level properties not possible to explore with the grammar of the current study.
We provide evidence that children as young as six can use distributional information in novel linguistic input to form grammatical categories, without other cuing information. Evidence that children use categories comes from higher ratings for novel grammatical test items than ungrammatical items containing similar bigram frequencies. That such a powerful learning mechanism is available to young learners strengthens its plausibility as a useful mechanism in language acquisition. This work provides an important foundation for extending the findings through additional studies on subcategory learning, comparison with adults, and comparison with individuals with language impairment. Future work will also explore manipulations of exposure, as in Reeder et al. (Reference Reeder, Newport and Aslin2013), to determine in more detail how children limit generalizations.
Appendix
List of all training items, listed by AXB combination and sentence type.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180417081746579-0116:S0305000917000435:S0305000917000435_tabU1.gif?pub-status=live)
List of all test items with chunk strength, organized by item type and mean rating.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180417081746579-0116:S0305000917000435:S0305000917000435_tabU2.gif?pub-status=live)