INTRODUCTION
The centrepiece of Chomsky's (Reference Chomsky1957) landmark Syntactic Structures was the observation that speakers' knowledge of their native language does not consist solely of an inventory of rote-learned sentences; rather speakers acquire abstract generalizations which allow them to comprehend and produce utterances that they have never heard before. The question of how children acquire this quintessentially human ability occupies a core place in language acquisition research.
Braine (Reference Braine and Slobin1971) pointed out a paradox that lies at the heart of this ability. On the one hand, even two- to three-year-old children are adept at producing and understanding sentences in which a verb is used in a sentence-level verb-argument structure construction in which it has never appeared in the input. For example, when taught a novel verb in an intransitive inchoative construction (e.g., The sock is weefing), children are able to produce a transitive causative sentence with this verb (e.g., The mouse is weefing the sock; e.g., Akhtar & Tomasello, Reference Akhtar and Tomasello1997; see Tomasello, Reference Tomasello2003; Ambridge & Lieven, Reference Ambridge and Lieven2011, Reference Ambridge, Lieven, MacWhinney and O'Grady2015, for reviews; and Gertner, Fisher & Eisengart, Reference Gertner, Fisher and Eisengart2006; Noble, Rowland & Pine, Reference Noble, Rowland and Pine2011, for similar findings in comprehension).
On the other hand, children must somehow restrict this productivity to avoid producing ungrammatical utterances. While many verbs can appear in both the intransitive inchoative and transitive causative constructions (e.g., The ball rolled / The clown rolled the ball), and children must be able to generalize from one to another, they must also learn that certain verbs are restricted to the former (e.g., The man laughed / *The clown laughed the man [where * indicates an ungrammatical utterance]). Indeed, evidence from both diary and elicited production studies demonstrates that many children pass through a stage in which they produce these types of overgeneralizations, before ‘retreating’ from error (e.g., Bowerman, Reference Bowerman and Hawkins1988; Brooks & Tomasello, Reference Brooks and Tomasello1999; Brooks, Tomasello, Dodson & Lewis, Reference Brooks, Tomasello, Dodson and Lewis1999; Brooks & Zizak, Reference Brooks and Zizak2002; Pinker, Reference Pinker1989). Analogous errors observed for the dative and locative alternations are summarized in Table 1.
Table 1. Possible and attested verb argument structure overgeneralization errors
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160930040326621-0957:S0305000915000586:S0305000915000586_tab1.gif?pub-status=live)
notes: Attested errors (all from Bowerman, Reference Bowerman and Hawkins1988) are shown in bold, with the age of the child (years;months) and a possible grammatical formulation using the alternative construction. Reproduced by permission of Wiley from Ambridge, Pine, Rowland, Chang & Bidgood (Reference Ambridge, Pine, Rowland, Chang and Bidgood2013).
The problem of the retreat from overgeneralization has attracted a considerable amount of attention in the literature. Recent studies of all three types of overgeneralization error listed in Table 1 have provided support for three proposals.
Under the entrenchment hypothesis (e.g., Braine & Brooks, Reference Braine, Brooks, Tomasello and Merriman1995), repeated presentation of a verb (regardless of sentence type) contributes to an ever-strengthening probabilistic inference that its use in non-attested constructions is not permitted. In support of this hypothesis, many studies have demonstrated a negative correlation between overall verb frequency (regardless of sentence type) and the relative acceptability and production probability of errors with that verb, in judgement and production tasks respectively. (Ambridge, Reference Ambridge2013; Ambridge & Brandt, Reference Ambridge and Brandt2013; Ambridge, Pine & Rowland, Reference Ambridge, Pine and Rowland2012a; Ambridge, Pine, Rowland & Chang, Reference Ambridge, Pine, Rowland and Chang2012b; Ambridge, Pine, Rowland, Freudenthal & Chang, Reference Ambridge, Pine, Rowland, Freudenthal and Chang2014; Ambridge, Pine, Rowland, Jones & Clark, Reference Ambridge, Pine, Rowland, Jones and Clark2009; Ambridge, Pine, Rowland & Young, Reference Ambridge, Pine, Rowland and Young2008; Bidgood, Ambridge, Pine & Rowland, Reference Bidgood, Ambridge, Pine and Rowland2014; Blything, Ambridge & Lieven, Reference Blything, Ambridge and Lieven2014; Brooks et al., Reference Brooks, Tomasello, Dodson and Lewis1999; Stefanowitsch, Reference Stefanowitsch2008; Theakston, Reference Theakston2004; Wonnacott, Newport & Tanenhaus, Reference Wonnacott, Newport and Tanenhaus2008).
The pre-emption hypothesis (e.g., Goldberg, Reference Goldberg1995) is similar, but with one important difference. Under entrenchment, a particular error (e.g., *Bart dragged Lisa the box, where a PO-only verb is used in a DO-dative) is probabilistically blocked by any use of the relevant verb (e.g., The man dragged the box; The boy dragged his feet; That movie really dragged on, etc.). Under pre-emption, errors of the form *Bart dragged Lisa the box are probabilistically blocked only by uses that express the same intended message; i.e., PO-dative uses of that verb (e.g., Marge dragged the package to Homer). Thus this hypothesis predicts a negative correlation between the acceptability/production probability of a particular error (e.g. DO-dative uses of drag) and the frequency of that verb in the single most nearly synonymous construction (e.g., PO-dative uses of drag). Although the two measures tend to be highly correlated, recent studies suggest that – to the extent to which they can be differentiated statistically – pre-emption plays a role above and beyond entrenchment (e.g., Ambridge, Reference Ambridge2013; Ambridge et al., Reference Ambridge, Pine and Rowland2012a; Ambridge et al., Reference Ambridge, Pine, Rowland, Freudenthal and Chang2014; Boyd & Goldberg, Reference Boyd and Goldberg2011; Brooks & Tomasello, Reference Brooks and Tomasello1999; Brooks & Zizak, Reference Brooks and Zizak2002; Goldberg, Reference Goldberg2011).
The semantic verb class hypothesis (Pinker, Reference Pinker1989) argues that learners form classes of verbs that are restricted to particular constructions only. For example, focusing on the dative alternation, verbs of accompanied motion and manner of speaking may appear in the PO-dative (e.g., Marge pulled the box to Homer; Homer shouted the instructions to Lisa), but are less than fully acceptable in the DO-dative (e.g., *Marge pulled Homer the box; *Homer shouted Lisa the instructions). On the other hand, verbs of giving and illocutionary communication may appear in both constructions (Lisa gave the book to Bart, Lisa gave Bart the book; Lisa showed the answer to Homer, Lisa showed Homer the answer). Evidence for this hypothesis comes from production and judgement studies showing that if children are taught novel verbs, they use their notional semantic class membership to determine the constructions in which they can and cannot appear (Ambridge et al., Reference Ambridge, Pine, Rowland and Young2008, Reference Ambridge, Pine, Rowland, Jones and Clark2009; Ambridge, Pine & Rowland Reference Ambridge, Pine and Rowland2011; Ambridge et al., Reference Ambridge, Pine, Rowland and Chang2012b; Bidgood et al., Reference Bidgood, Ambridge, Pine and Rowland2014; Brooks & Tomasello, Reference Brooks and Tomasello1999; Gropen, Pinker, Hollander & Goldberg, Reference Gropen, Pinker, Hollander and Goldberg1991a, Reference Gropen, Pinker, Hollander and Goldberg1991b; Gropen, Pinker, Hollander, Goldberg & Wilson, Reference Gropen, Pinker, Hollander, Goldberg and Wilson1989).
Importantly, these semantic classes are not arbitrary. Rather, a particular class of verbs can appear in a particular construction only if there is sufficient overlap between the semantics of the verbs and the semantics of the construction. For example, the DO-dative construction is associated with the meaning of ‘causing to have’ (Pinker, Reference Pinker1989). Thus the reason that verbs from the give and show classes may appear in this construction is that they are consistent with this meaning (in the latter case, the possession transfer is metaphorical; a transfer of information). On this account, the reason that verbs from the pull and shout classes may not appear in the DO-dative construction is that they are not sufficiently consistent with this ‘causing to have’ meaning. Instead, they are restricted to the PO-dative construction, because they are compatible with the meaning of this construction (‘causing to go’). Ambridge et al. (Reference Ambridge, Pine, Rowland, Freudenthal and Chang2014) found that independent ratings of the extent to which particular verbs were consistent with ‘causing to have’ versus ‘causing to go’ significantly predicted the rated acceptability of that verb in the DO- versus PO-dative construction (see Ambridge & Brandt, Reference Ambridge and Brandt2013; Ambridge et al., Reference Ambridge, Pine and Rowland2012a, for an analogous finding for the locative alternation).
In summary, previous studies have found support for the entrenchment, pre-emption, and semantic verb class hypotheses. This raises the question of whether it is possible to posit a single learning mechanism that yields all of these effects (indeed, it is debatable whether any of these proposals constitute mechanistic accounts of the retreat from error per se). One proposal for such a learning mechanism is the ‘FIT’ account (Ambridge, Reference Ambridge2013; Ambridge & Lieven, Reference Ambridge and Lieven2011; Ambridge et al., Reference Ambridge, Pine and Rowland2012a, Reference Ambridge, Pine, Rowland and Chang2012b; Ambridge et al., Reference Ambridge, Pine, Rowland, Freudenthal and Chang2014). The acronym captures the account's emphasis on the importance of the Fit between Items and (construction) Templates (in this case, with regard to semantics).
The central assumption of the account is that speakers maintain an inventory of argument structure constructions – each acquired by abstracting across concrete tokens of those constructions in the input – which, in production, compete to express the speaker's desired message (e.g., MacWhinney, Reference MacWhinney2004). The activation level of each competitor is determined by three factors, illustrated here for the example message “MARGE CAUSED HOMER TO HAVE THE BOX BY PULLING THE BOX TO HOMER”:
-
• Verb-in-construction frequency. The verb in the message (here pull) activates each construction in proportion to the frequency with which it has appeared in that construction in input sentences. This factor yields pre-emption effects because every input occurrence of pull in a PO-dative boosts the activation of this construction, at the expense of the DO-dative construction, in production. This factor yields entrenchment effects because every input occurrence of pull in any other construction (e.g., a simple transitive) boosts the activation of this construction at the expense of the DO-dative.
-
• Relevance. A ‘relevant’ construction is one that contains a slot for every item in the speaker's message. So, for the present example, both the PO-dative (yielding Marge pulled the box to Homer) and the DO-dative (*Marge pulled Homer the box) are more relevant than, for example, the transitive (Marge pulled the box). The notion of relevance captures the intuition of the pre-emption hypothesis that the PO- and DO-dative are better competitors for one another than are other constructions such as the transitive.
-
• Fit. The third factor is the compatibility (or fit) between the semantic properties of each item in the message (e.g., the verb) and the relevant slot in each candidate construction. The semantics of each slot are a frequency-weighted average of the semantics of each item which appeared in that position in the input utterances that gave rise to the construction. This factor is designed to capture the finding that ratings of the extent to which verbs exhibit semantic properties to do with ‘causing to have’ and ‘causing to go’ predict acceptability in the DO- and PO-dative respectively (Ambridge et al., Reference Ambridge, Pine, Rowland, Freudenthal and Chang2014).
-
• A fourth factor, overall construction frequency, may also be important. That is, all else being equal, a speaker is more likely to select a higher-frequency construction (e.g., an active transitive) than a lower-frequency alternative (e.g., the passive). This factor may be necessary to explain, for example, why some alternations attract higher error rates than others. Although construction frequency is indirectly manipulated in the present study, since the simulation involves only two constructions, it is not possible to draw any conclusions regarding the importance of this factor.
Of course, as a verbal model, the FIT account is little more than a re-description of the experimental findings. Thus the aim of the present study is to instantiate the account as a computational model in order to investigate the extent to which it can simulate (a) the overall pattern of generalization to novel verbs, overgeneralization errors, and subsequent retreat shown by children, and (b) the verb-by-verb pattern of adult acceptability ratings for overgeneralization errors in judgement studies. Before introducing the computational model itself, it is important to consider the respects in which it differs from previous models that also simulate aspects of the retreat from overgeneralization. Note that – although not all include a role for semantics – all nevertheless share a degree of theoretical and implementation overlap with the present model, and thus should be considered complementary models, rather than radically different rivals.
Like the present model, the hierarchical Bayesian model of Perfors, Tenenbaum, and Wonnacott (Reference Perfors, Tenenbaum and Wonnacott2010) learns item-based links between particular verbs and the PO- and DO-dative constructions. An important difference it that it generalizes to novel verbs by additionally forming overhypotheses regarding the tendency of (a) all verbs and (b) classes of distributionally similar verbs to occur in both constructions as opposed to only one, rather than on the basis of semantic similarity (the same is true for related models based on the notion of Minimum Description Length; e.g., Dowman, Reference Dowman, Gleitman and Joshi2000, Reference Dowmansubmitted; Hsu & Chater, Reference Hsu and Chater2010; Hsu, Chater & Vitányi, Reference Hsu, Chater and Vitányi2011, Reference Hsu, Chater and Vitányi2013; Onnis, Roberts & Chater, Reference Onnis, Roberts and Chater2002). However, although there is some evidence for the importance of overhypotheses in artificial grammar learning studies with adults (e.g., Perek & Goldberg, in press; Wonnacott et al., Reference Wonnacott, Newport and Tanenhaus2008;), it remains to be seen whether this procedure plays a crucial role in children's natural language learning. Indeed, the claim that children make use of overhypotheses would seem to contradict the large body of evidence suggesting that their early knowledge of language consists of holophrases and low-scope formulae, and only later becomes more abstract (e.g., Ambridge & Lieven, Reference Ambridge, Lieven, MacWhinney and O'Grady2015; Tomasello, Reference Tomasello2003). Although some versions of the Perfors et al. (Reference Perfors, Tenenbaum and Wonnacott2010) model include verb-level semantic features, each feature has only three possible values – corresponding to PO-only, DO-only, and alternating verbs – and so does not simulate fine-grained by-verb semantic effects observed for human participants (Ambridge et al., Reference Ambridge, Pine, Rowland, Freudenthal and Chang2014).
The dual-path model of Chang (Reference Chang2002; see also Chang, Dell & Bock, Reference Chang, Dell and Bock2006) is able to simulate a wide range of language acquisition phenomena, including generalization of novel verbs into unattested constructions and the retreat from overgeneralization (including for the dative alternation; see (Reference Chang2002) pp. 638–640). The model works by learning to sequentially predict the next word in a sentence (using a Simple Recurrent Network), given a message (e.g., AGENT = Marge, ACTION = drag, GOAL = Homer, PATIENT = box) and construction-level event semantics (e.g., CAUSE + MOTION + TRANSFER for the PO-/DO-dative). Due to its sequential nature, the dual-path model constitutes a lower-level and hence more realistic approximation of the task facing real language learners than any of the other models outlined here (including the new model outlined in the present paper). An important difference from the present model is that the dual-path model does not represent verb-level semantics (only construction-level event-semantics). Also, its use of artificially generated datasets means that the model does not make predictions regarding by-verb patterns of adult grammaticality judgements.
Perhaps the model closest to the present simulation is that of Alishahi and Stevenson (Reference Alishahi and Stevenson2008). This model receives input in the form of pairs of a scene and an utterance (e.g., DRAGCAUSE MARGEAGENT BOXTHEME TO HOMERDESTINATION + Marge dragged the box to Homer), from which it extracts argument-structure frames (e.g., [argument1] [verb] [argument2] [argument3]). Frames that are sufficiently similar are collapsed into constructions, using an unsupervised Bayesian clustering process. This model is similar to the present simulation in its use of verb and construction semantics, which allows it to show both generalization to novel verbs and overgeneralization with subsequent retreat, and also in its use of corpus-derived verb + construction counts. An important difference is that Alishahi and Stevenson's simulations did not investigate the relative importance of entrenchment, pre-emption, and verb semantics. Neither did these authors attempt to simulate the by-verb pattern of adult grammaticality judgements. Indeed, in its present form Alishahi and Stevenson's model is unlikely to be able to do so, since – like all of the previous models discussed in this section – it does not represent verb semantics at a sufficiently fine-grained level (see p. 827 for discussion).
The present paper introduces a new model that instantiates the key assumptions of Ambridge and colleagues' verbal FIT account: competition between constructions based on (a) verb-in construction frequency, (b) relevance of constructions for the speaker's intended message, and (c) fit between the fine-grained semantic properties of individual verbs and individual constructions (or, more accurately, their [VERB] slot). The aim is to investigate the ability of the model (a) to explain generalization to novel verbs, overgeneralization error, and subsequent retreat, (b) to model the pattern of by-verb grammaticality judgements obtained in adult studies, and (c) to elucidate the relative importance of entrenchment, pre-emption, and verb semantics, and explore one way in which these factors might be combined into a learning model.
Given that previous regression studies have already demonstrated that entrenchment, pre-emption, and verb semantics play a role in the retreat from error – including for the dative constructions (Ambridge et al., Reference Ambridge, Pine and Rowland2012a; Ambridge et al., Reference Ambridge, Pine, Rowland, Freudenthal and Chang2014) – a question arises as to how the present model adds to our understanding of the phenomenon. The main advantage is that, unlike a regression model, the present computational model instantiates – albeit at a relatively high level – a mechanistic account of a possible procedure for learning verb argument structure restrictions. Nobody would argue that children use a single pass of an input corpus to calculate, for each verb, entrenchment, pre-emption, and verb-semantic measures (i.e., ‘meta’ or ‘macro’ variables), which they then combine in a way analogous to a statistical regression. Rather, children – like the present computational model – use the semantic and statistical regularities that fall out of the raw input data (not meta variables that describe them) to incrementally learn probabilistic links between verbs and constructions. Thus a successful computational model, unlike a regression model, will simulate a period of overgeneralization followed by retreat, and allows for the investigation of factors that alter the trajectory of this learning process. For example, one question we investigate is how learning is affected by the presence of arbitrary lexical exceptions.
METHOD
The problem is conceptualized as one of the speaker learning, via comprehension, verb + construction mappings that allow her, in production, to select the appropriate construction, given the verb that she intends to use (in this case the PO-dative, the DO-dative, or Other). This is, of course, a relatively high-level conceptualization, and one that abstracts across the many factors other than verb-level properties that determine construction choice (e.g., information structure, the relative length of the theme, and recipient NPs etc.; see Bresnan, Cueni, Nikitina & Baayen, Reference Bresnan, Cueni, Nikitina, Baayen, Boume, Kramer and Zwarts2007). Neither does it address the issue of how either verbs or constructions are acquired in the first place (see Twomey, Chang & Ambridge, Reference Twomey, Chang and Ambridge2014, for one simulation of the acquisition of verb semantics). Nevertheless, the rather abstract and theory-neutral nature of this conceptualization renders it potentially compatible with any theoretical approach which assumes that adult speakers possess some kind of abstract knowledge of verb argument structure constructions.
All simulations used the OXlearn MATLAB package (Ruh & Westermann, Reference Ruh and Westermann2009). The learning task was instantiated in a three-layer feed-forward backpropagation network with seven input units (representing the verb), three hidden units, and three output units (representing PO-dative, DO-dative, and Other). The structure of the network is summarized in Figure 1. Both the output and hidden layers used a sigmoid activation function (learning rate 0·01), and received input from a bias unit. Seven input units were used in order to allow each verb to be represented as a vector across seven composite semantic features taken from Ambridge et al. (Reference Ambridge, Pine, Rowland, Freudenthal and Chang2014), roughly speaking: causing to go (two predictors), causing to have, speech, mailing, bequeathing, and motion. In this previous study, participants rated each verb for the extent to which it exhibits each of eighteen semantic features relevant to the alternation, with these features condensed to seven using Principal Components Analysis (PCA). Each verb was represented in terms of its mean rating on each of these seven composite semantic features.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160930040326621-0957:S0305000915000586:S0305000915000586_fig1g.gif?pub-status=live)
Fig. 1. Architecture of the network.
It is important to acknowledge that this implementation sidesteps the extremely difficult question of how learners acquire verb semantics, assuming – in effect – that learners have perfect knowledge of the semantics of every verb from the very first time that they hear it. In fact, real-life acquisition of verb semantics no doubt requires a considerable amount of experience, and presumably proceeds, at least in part, on the basis of syntactic and lexical distributional information (e.g., Gleitman, Reference Gleitman1990; Pinker, Reference Pinker1994; Twomey et al., Reference Twomey, Chang and Ambridge2014). However, it is important to point out that this problem is shared by all current models of the acquisition of verbs' argument structure restrictions, including the verbal and computational models outlined in the previous section. Thus this shortcoming is no reason to disregard the present model in favour of its contemporaries.
Twenty-four verbs were used; the core set from Ambridge et al., (Reference Ambridge, Pine, Rowland and Chang2012b), half PO-only, half alternating (note that this first simulation did not include DO-only verbs – e.g., bet and wager – which are of very low type and token frequency, particularly in speech to children, and so constitute a marginal phenomenon).
The PO-only verbs were drawn from two semantic classes: pull-verbs (pull, drag, carry, haul, lift, and hoist) and shout-verbs (shout, screech, whisper, hiss, scream, and shriek). The alternating verbs were drawn from two further classes: show-verbs (show, teach, ask, pose, tell, and quote) and give-verbs (give, hand, send, mail, throw, and toss).
For each training trial, a verb was presented to the network, along with its target construction (i.e., the target activation of the PO-dative, DO-dative, or Other output unit was set to 1, with the other two output units set to 0). Verb + construction pairs were presented to the model in proportion to the log frequency with which the verb occurred in that construction in the British National Corpus (counts taken from Ambridge et al., Reference Ambridge, Pine, Rowland, Freudenthal and Chang2014), as shown in Table 2. ‘Other’ counts include all non-dative uses of the relevant verb, including – for example – simple transitives (He pulled the rope) and single-word utterances (e.g. Pull!).
Table 2. Training set
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160930040326621-0957:S0305000915000586:S0305000915000586_tab2.gif?pub-status=live)
Thus a single training sweep consisted of 329 verb + construction pairs (86 PO-datives, 66 DO-datives, and 177 Other constructions). The overgeneralization pressure on the model arises from the fact that half the verbs it encounters activate both the PO- and DO-dative output units (though only one or the other on any given trial), while the remainder activate only the PO-dative unit (with varying frequency).
For each test trial, the frozen model was presented with a verb – either a familiar verb from the training set or a novel verb (described below) – and the corresponding activation of the DO-dative output unit recorded. The activation of this output unit is taken as the model's ‘grammaticality judgement’ for a DO-dative sentence with the relevant verb. The model's judgements were compared against those obtained from adult participants (from Ambridge et al., Reference Ambridge, Pine, Rowland and Chang2012b). This method is preferable to simply investigating the model's ability to learn the training set to some error criterion (which is trivial, given the present set-up). It is important to emphasize that the model did not receive any information regarding participants' grammaticality judgements; as described above, target activations of output units were determined solely on the basis of corpus frequency.
Novel verbs were created in order to test the model's ability to generalize; that is, to determine the grammaticality or otherwise of previously unseen verbs in the DO-dative construction on the basis of their semantics. Ambridge et al. (Reference Ambridge, Pine, Rowland and Chang2012b) found that adults displayed this ability, though children aged 5–6 and 9–10 did not. Novel verbs were created by averaging across the semantic ratings for all of the verbs in the relevant semantic class, excluding the target verbs. This resulted in the creation of four novel verbs, two PO-only (novel pulling and novel shouting) and two alternating (novel showing and novel giving). Ambridge et al. (Reference Ambridge, Pine, Rowland and Chang2012b) found that adults rated DO-dative uses of the former, but not the latter, as ungrammatical.
RESULTS
Semantics + Entrenchment model
The model as described above implements entrenchment, but not pre-emption, as all non-DO-dative uses of a particular verb, whether PO-dative or Other, have an equal impact in causing the model not to activate the DO-dative output unit for this verb (pre-emption is added to a subsequent model). The model implements a role for verb semantics, by virtue of the fact that each verb is represented as a vector of seven semantic feature scores. The model was trained for 100,000 sweeps (each consisting of 329 verb + construction pairs) and its output recorded every 10,000 sweeps. All results presented here and subsequently average across ten runs of the model with different random seeds. Figure 2 shows the familiar and novel-verb results for the Semantics + Entrenchment model, averaging across the six verbs in each class.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20241025160505-50215-mediumThumb-S0305000915000586_fig2g.jpg?pub-status=live)
Fig. 2. The Semantics + Entrenchment model.
The model rapidly learns that the DO-datives are acceptable (i.e., to activate this output unit) for the show and give verbs, but not the pull and shout verbs. (The reason that the activation of the DO-dative output unit drops to around 0·3 even for verbs that are grammatical in this construction is that the model is learning that PO-dative and Other uses [e.g., transitives; single-word uses] are also possible constructions for this verb.) The model also generalizes this pattern to the four novel verbs. This latter finding demonstrates one way in which it is possible for a model that includes no hard-wired discrete verb classes to show class-type generalization behaviour. A cluster plot of the hidden units (Figure 3) demonstrates that the model achieves this behaviour by forming representations in the hidden layer that map semantically similar verbs onto the same output unit.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20241025160505-86716-mediumThumb-S0305000915000586_fig3g.jpg?pub-status=live)
Fig. 3. Cluster tree of hidden units in the Semantics + Entrenchment model.
Interestingly, at the second coarsest grain size (the coarsest being PO-only and alternating verbs) the four clusters essentially correspond, with few exceptions, to Pinker's (1989) classes of “manner of speaking (shout, screech, whisper, hiss, scream and shriek)” (p. 112), “continuous imparting of force in some manner causing accompanied motion (pull, drag, carry, haul, lift and hoist)” (p. 110), “illocutionary verbs (show, teach, ask, pose, tell and quote)” (p. 112), and “verbs of giving (give, hand, send, mail, throw and toss)” (p. 110). Crucially, however, the model also groups verbs at a finer level. For example, Figure 3 shows that the model conceptualizes the two members in the pairs hiss + screech, lift + carry, hoist + haul, ask + tell, throw + toss, and send + give as more similar to one another than to other verbs in the same Pinker class. The ability to instantiate semantic similarity at a more fine-grained level is presumably key if the model is to simulate the graded pattern of judgements shown by human participants (explored in detail in subsequent simulations). Although it is beyond the scope of the present investigation to implement and test a Pinker-style class-based model directly, it seems unlikely that a model that can assign only one of four discrete values (corresponding to the classes) will be able to simulate this graded pattern.
The finding that the model does not link the familiar pull and shout verbs to the DO-dative is, in one sense, trivial: because the activation of the output units sums to 1 and these verbs activate only the PO-dative and Other units during training, it is inevitable that the model will not activate the DO-dative unit for these verbs. In another sense, however, the triviality of this finding is exactly the point: a learner that probabilistically links verbs and the constructions in which they have appeared will inevitably show an ‘entrenchment’ effect, even while retaining the ability to generalize novel verbs into unattested constructions on the basis of their semantics; there is no need to posit entrenchment as a special dedicated mechanism.
That said, the Semantics + Entrenchment model fails in two important respects. First, unlike children, it does not display a period of overgeneralization. For the pull and shout verbs – familiar and novel alike – the activation of the DO-dative unit drops rapidly to below 0·1 within the first 10,000 sweeps. Second, at no point during learning do the model's judgements of overgeneralization errors (i.e. DO-uses of PO-only verbs) correlate with those obtained from adult participants (see Figure 4, plotted at 60,000 sweeps, for comparison with the subsequent model).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20241025160505-53371-mediumThumb-S0305000915000586_fig4g.jpg?pub-status=live)
Fig. 4. No correlation between model and human ratings for the Semantics + Entrenchment model.
Semantics + Entrenchment + Pre-emption model
The pre-emption hypothesis holds that overgeneralization errors are blocked not by any uses of the relevant verb (as under entrenchment), but only – or, at least, especially – by uses that express the same intended meaning (or ‘message’). For example, the error *Marge pulled Homer the box (pull + DO-dative) would be pre-empted by pull + PO-dative sentences (e.g. Bart pulled the box to Lisa), as both sentences express a three-argument ‘transfer’ message. Such an error would not be pre-empted by simple transitives (e.g. He pulled the rope), one-word utterances (Pull!), and so on, as such sentences do not express a transfer message.
Pre-emption was instantiated in the model by adding an additional input unit to encode the message. This unit was set to 1 when the target output unit was either the PO- or DO-dative unit (= ‘transfer message’), and 0 when the target output was Other (= ‘non-transfer message’). In terms of real-word learning, the assumption is that learners understand the speaker's intended message in comprehension (= the model's learning phase) and have in mind their own intended message in production (= the model's test phase). In all other respects, including the training set, the model was identical to that outlined above.
This model (see Figure 5) addressed both of the shortcomings of the Semantics + Entrenchment model. First, in contrast to this previous model, it showed a protracted ‘overgeneralization’ period (approximately 0–40,000 sweeps) in which verbs that been presented solely in PO-dative sentences during training (the familiar pull and shout verbs) activated the DO-dative output unit, with an activation strength similar to that yielded by the alternating verbs (the familiar show and give verbs). Presumably, this overgeneralization period is consequence of the fact that the ‘transfer message’ unit (which was always set to 1 during testing, since a DO-dative judgement was being elicited) mapped to both the PO- and DO-output units during training. The model retreats from overgeneralization as it learns which particular dative unit, PO or DO, is appropriate for each verb; information that this model (like the previous model) rapidly generalizes to novel verbs, presumably on the basis of semantic overlap with familiar verbs (a later simulation tests this presumption).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20241025160505-39446-mediumThumb-S0305000915000586_fig5g.jpg?pub-status=live)
Fig. 5. The Semantics + Entrenchment + Pre-emption model.
Addressing the second shortcoming of the previous model, the current model simulated, at 50,000 and 60,000 sweeps, the by-verb pattern of adult grammaticality judgement data for overgeneralization errors of PO-only verbs into the DO-dative construction (r = 0·66, p = ·02; r = 0·68, p = ·02); see Figure 6. Beyond this point, no significant correlations were observed, presumably because the model had overlearned the solution, increasingly treating all verbs attested only in the PO-dative as extremely – and equally – ungrammatical in the DO-dative.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20241025160505-78716-mediumThumb-S0305000915000586_fig6g.jpg?pub-status=live)
Fig. 6. Correlation between model and human ratings for the Semantics + Entrenchment + Pre-emption model.
Of course, the fact that a three-parameter model (Semantics + Entrenchment + Pre-emption) outperforms a two-parameter model (Semantics + Entrenchment) should surprise no-one. The point is that the Pre-emption mechanism is not a free parameter added for no reason other than to improve coverage of the data, but rather an implementation of a particular theoretical proposal that is well supported by previous empirical studies with children and adults. Presumably the reason that the Pre-emption mechanism (i.e., the ‘message’ node) plays such a key role is that, without it, there is very little overgeneralization pressure on the model, which can simply map conservatively from input to output. This pressure arises only when a communicative need (i.e., the desire to express a transfer message) compels the learner to use a verb in a construction in which it has never (or very infrequently) been attested. Indeed, examination of children's errors (e.g., *I said her no) suggests that these too are produced when the child's desire to express an intended message compels her to extend a verb into a construction which expresses that message, despite the fact that this combination is unattested in the input.
Lexeme-based Semantics + Entrenchment + Pre-emption model
The models presented so far have – purely as a simplifying assumption – represented verbs solely as bundles of semantic features, meaning that there is a very high degree of overlap between verbs with similar semantics. However, this assumption is unrealistic, as real learners encounter a (relatively) consistent phonological representation of each verb. This lexeme binds together the particular bundle of semantic properties associated with that verb and, crucially, differentiates this bundle from overlapping bundles associated with other verbs. In order to instantiate this property, we added to the input layer a further twenty-eight input units, each representing an individual verb (24 familiar, 4 novel). The training and test phases were the same as for the previous models, except that the input unit representing the relevant verb was set to 1, with the remaining twenty-seven units set to 0.
Compared with the previous Semantics + Entrenchment + Pre-emption model, this lexeme-based Semantics + Entrenchment + Pre-emption model (see Figure 7) showed a slightly shorter period of ‘overgeneralization’ with the predictions for PO-only and alternating verbs beginning to diverge at around 30,000 sweeps (as opposed to 40,000 for the previous model).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20241025160505-69604-mediumThumb-S0305000915000586_fig7g.jpg?pub-status=live)
Fig. 7. The lexeme-based Semantics + Entrenchment + Pre-emption model.
The lexeme-based model was also slightly better at predicting adults' judgements, with significant correlations observed both earlier (30,000 sweeps: r = 0·59, p = ·042; 40,000: r = 0·64, p = ·036; 50,000: r = 0·58, p = ·048; see Figure 8) and later (70,000: r = 0·61, p = ·035; 80,000: r = 0·62, p = ·03) in development (though at 60,000 sweeps the correlation was not significant: r = 0·56, p = ·06). Taken together with the fact that real learners encounter a binding lexeme for each verb presentation, the (albeit modest) improvement in coverage shown by this model suggests that it is important for all models of this phenomenon to include both a semantic and a lexical component (and we therefore retain this lexeme-based model for the remaining simulations).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20241025160505-79216-mediumThumb-S0305000915000586_fig8g.jpg?pub-status=live)
Fig. 8. Correlation between model and human ratings for the lexeme-based Semantics + Entrenchment + Pre-emption model.
Semantics vs. Pre-emption: Can pre-emption be used to learn arbitrary exceptions to semantically based generalizations?
It has often been suggested (e.g. Boyd & Goldberg, Reference Boyd and Goldberg2011; Goldberg, Reference Goldberg1995, Reference Goldberg2006) that pre-emption might be useful for learning exceptions to semantically based generalizations. In the case of the dative, there has been some debate as to whether, given sufficiently fine-grained and probabilistic generalizations, such exceptions in fact exist (see Ambridge et al., Reference Ambridge, Pine, Rowland, Freudenthal and Chang2014, p. 237). Certainly PO-only verbs such as contribute and donate are exceptions to a purely semantically based generalization (*I donated/contributed the appeal some money vs. I donated/contributed some money to the appeal). However, there is empirical evidence (Ambridge et al., Reference Ambridge, Pine, Rowland and Chang2012b, Reference Ambridge, Pine, Rowland, Freudenthal and Chang2014) that speakers treat such verbs as conforming to a morphophonological generalization not instantiated in the present model. Thus such verbs do not necessarily constitute fully arbitrary exceptions that must be learned by pre-emption alone.
However, given that at least some generalizations presumably have fully arbitrary exceptions, it is important to investigate whether or not the model is able to learn them. Consider, for example, a hypothetical verb that is semantically consistent with both the PO- and DO-dative (e.g., a novel giving or showing verb) but that, for some reason, happens to appear only in the PO-dative construction. Will the model treat it as an exception (as pre-emption would predict), or will the exception be swamped by the semantic generalization?
In order to explore this question, the four novel verbs used previously in the test set only were added to the training set shown in Table 1. Each was presented ten times per sweep, always in PO-dative constructions (hence they are referred to subsequently as ‘PO-only novel pull/shout/show/give’). PO-only novel pull and PO-only novel shout are best thought of as ‘control’ verbs: both semantics and pre-emption push the model in the direction of rejecting the DO-dative, which it would therefore be expected to do rapidly. PO-only novel show and PO-only novel give instantiate the thought-experiment outlined above, and pit semantics and pre-emption against one another. Each of these novel verbs is semantically similar to six verbs that appear in both the PO- and DO-dative during training; thus their semantics push the model in the direction of predicting the DO-dative for that verb. On the other hand, both are attested with very high frequency in the PO-dative only (10 presentations per sweep; a rate chosen to be higher than any other verb + construction pair in the dataset). Thus pre-emption pushes the model in the direction of rejecting the DO-dative for that verb (i.e., predicting the PO-dative instead). This new training set was given to the lexeme-based Semantics + Entrenchment + Pre-emption model outlined above.
Figure 9 plots these results for the novel verbs only (results for the familiar verbs were the same as for the previous model). Despite the high levels of pre-emption, the semantic information still holds considerable sway: from 10,000 to 60,000 sweeps, activation of the DO-dative unit is higher for the semantically alternating novel show and give verbs than for the semantically PO-only novel pull and shout verbs. Nevertheless, slowly but surely, pre-emption wins out over semantics: by 70,000 sweeps, the semantically alternating novel show and give verbs are indistinguishable from the semantically PO-only novel pull and shout verbs, with activation of the DO-dative unit essentially zero by the end of the simulation at 100,000 sweeps. For comparative purposes, recall that the standard lexeme-based Semantics + Entrenchment + Pre-emption model showed the expected pattern for novel verbs from around 30,000 sweeps. Thus, although it takes some time, pre-emption can indeed be used to learn arbitrary exceptions to semantically based generalizations (e.g., Boyd & Goldberg, Reference Boyd and Goldberg2011; Goldberg, Reference Goldberg1995, Reference Goldberg2006).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20241025160505-77390-mediumThumb-S0305000915000586_fig9g.jpg?pub-status=live)
Fig. 9. Semantics vs. Pre-emption.
Extending the lexeme-based Semantics + Entrenchment + Pre-emption model
The final set of simulations investigated whether the model would scale up to a larger dataset: the full set of 301 dative verbs rated by adults in the study of Ambridge et al. (Reference Ambridge, Pine, Rowland, Freudenthal and Chang2014), comprising 145 alternating verbs, 131 PO-only verbs, and 25 DO-only verbs (see ‘Appendix’ Table A1). Because this set was designed to be as comprehensive as possible, including every English dative verb identified in two major reference works on the topic (Levin, Reference Levin1993; Pinker, Reference Pinker1989), it constitutes an appropriate test of the model's ability to scale up to something like a life-sized dataset. This model was trained in exactly the same way as for the previous version, but with this larger training set. The model again showed a good fit to the adult data (r = 0·54, p < ·001; see Figure 10), reaching asymptote at around 1,000 sweeps; considerably sooner that previous model (which is to be expected given the much larger dataset).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20241025160505-04708-mediumThumb-S0305000915000586_fig10g.jpg?pub-status=live)
Fig. 10. Correlation between model and human ratings for the Semantics + Entrenchment + Pre-emption model with extended dataset (301 verbs).
To summarize, the most successful model – the lexeme-based Semantics + Entrenchment + Pre-emption model – was successful in (a) simulating an overall overgeneralization-then-retreat pattern, (b) predicting the correct dative argument structure for novel verbs on the basis of their semantics, and (c) modelling the fine-grained pattern of by-verb grammaticality judgements observed in adult studies, including a large-scale study that included almost all English dative verbs.
DISCUSSION
A central question in the cognitive sciences is how children build linguistic representations that allow them to generalize verbs from one construction to another (e.g., The boy gave a present to the girl → The boy gave the girl a present), whilst appropriately constraining those generalizations to avoid non-adultlike errors (e.g., I said no to her → *I said her no). Indeed, for the many children who pass through a stage in which they produce such errors, the question is how they learn to retreat from them, given the absence of consistent evidence regarding which of their utterances are ungrammatical.
Recently, a consensus has begun to emerge that children solve this “no negative evidence problem” (Bowerman, Reference Bowerman and Hawkins1988), using a combination of statistical learning procedures such as entrenchment (e.g., Theakston, Reference Theakston2004) and pre-emption (e.g., Boyd & Goldberg, Reference Boyd and Goldberg2011), and learning procedures based on verb semantics (e.g., Ambridge et al., Reference Ambridge, Pine, Rowland and Young2008, Reference Ambridge, Pine, Rowland, Jones and Clark2009, Reference Ambridge, Pine and Rowland2011, Reference Ambridge, Pine and Rowland2012a, Reference Ambridge, Pine, Rowland and Chang2012b, Reference Ambridge, Pine, Rowland, Freudenthal and Chang2014). Despite this emerging consensus, there have been few attempts to propose a unitary account that combines all three approaches. One exception is the FIT account (Ambridge et al., Reference Ambridge, Pine and Rowland2011), which argues for competition between constructions based on (a) verb-in construction frequency, (b) relevance of constructions for the speaker's intended message, and (c) fit between the fine-grained semantic properties of individual verbs and individual constructions.
The present study demonstrated that a simple connectionist model that instantiates this account can not only simulate the overall pattern of overgeneralization then retreat, but also use the semantics of novel verbs to predict their argument structure (as in the human studies of Ambridge et al., Reference Ambridge, Pine, Rowland and Young2008, Reference Ambridge, Pine, Rowland, Jones and Clark2009, Reference Ambridge, Pine, Rowland and Chang2012b; Bidgood et al., Reference Bidgood, Ambridge, Pine and Rowland2014) and to predict the by-verb pattern of grammaticality judgements observed in adult studies (Ambridge et al., Reference Ambridge, Pine, Rowland and Chang2012b, Reference Ambridge, Pine, Rowland, Freudenthal and Chang2014).
Although the model used is computationally extremely simple, there is an important sense in which this is its greatest strength. The success of the model suggests that statistical learning effects such as entrenchment and pre-emption need not make use of sophisticated Bayesian or rational learner algorithms to compute an inference from absence. Rather, these effects arise naturally from a learning mechanism that probabilistically links verbs to competing instructions. Similarly, semantic effects need not rely on an explicit procedure for semantic class formation (e.g., Pinker, Reference Pinker1989), but fall naturally out of a model that learns which bundles of semantic features (‘verbs’) are predictive of which constructions.
Another advantage of the present model's simple and high-level approach is that it can easily be extended to other constructions for which children are known to make overgeneralization errors, and for which suitable semantic and statistical measures have been collected. These include the locative (e.g., Ambridge et al., Reference Ambridge, Pine and Rowland2012a), passive (Ambridge, Bidgood, Pine, Rowland & Freudenthal, Reference Ambridge, Bidgood, Pine, Rowland and Freudenthalin press), and reversative un- prefixation (e.g., Ambridge, Reference Ambridge2013; Blything et al., Reference Blything, Ambridge and Lieven2014). Future simulations using the same architecture could also investigate the semantic restrictions on construction slots other than VERB. For example, a (probabilistic) requirement of the DO-dative construction is that the first argument be a potential possessor of the second argument (e.g., *John sent Chicago the package; cf. the PO-dative equivalent John sent the package to Chicago). It would also be possible to investigate other types of overgeneralization error (e.g., the [in]famous case of the English past tense) by using phonological rather than semantic representations at the input level (indeed, the present study essentially uses the same architecture as classic past-tense models such as Rumelhart, McClelland, & PDP Research Group, Reference Rumelhart and McClelland1988).
These advantages notwithstanding, it is important to acknowledge the ways in which the present simulations considerably simplify the task facing real learners. First, the precise semantic properties of individual verbs are known from the start. For child learners, acquiring verb meanings is a notoriously difficult task (Gillette, Gleitman, Gleitman & Lederer, Reference Gillette, Gleitman, Gleitman and Lederer1999), and one that presumably proceeds mostly in parallel with learning verb argument structure constructions (e.g., Twomey et al., Reference Twomey, Chang and Ambridge2014). Second, the simulated learner is assumed to have already abstracted the necessary verb argument structure constructions (e.g., PO- and DO-dative) from the input, and to be able to correctly recognize all further instances of these constructions in the input (though the semantic characteristics of these constructions are learned during the simulation). For real learners, acquiring verb argument structure constructions is an extremely difficult task; indeed, there are very few proposals for how this might be done (though see Alishahi & Stevenson, Reference Alishahi and Stevenson2008; Tomasello, Reference Tomasello2003). Finally, the model does not – unlike both real learners and more sophisticated computational models (e.g., Chang, Reference Chang2002; Chang et al., Reference Chang, Dell and Bock2006) – produce sentences as sequences of temporally ordered words. There is quite a leap to be made from (a) knowing the verb + argument structure combination that one intends to use to (b) producing a well-formed sentence.
Indeed, under many accounts, verb argument constructions are not, in fact, seen as entities that are abstracted from the input, then stored for subsequent retrieval. Exemplar-based accounts (e.g., Bybee, Reference Bybee, Hoffman and Trousdale2013; Langacker, Reference Langacker, Barlow and Kemmer2000) propose that learners store nothing more than individual exemplars, (in this case, sentences), and that the notion of – for example – ‘using a DO-dative construction’ – is simply a shorthand way of referring to a process of online generalization across stored DO-dative sentences that meet some criterion (e.g., similarity to the intended message). In order to instantiate such accounts computationally, we will need considerably more sophisticated models that are able to use both semantic and distributional commonalities to abstract, on the fly, across stored exemplars in a way that yields something like conventional verb argument-structure constructions.
In the meantime, the present findings suggest that the traditional conceptualization of entrenchment, pre-emption, and the formation of semantic generalizations as rival ‘mechanisms’ may be unhelpful. Rather, all three are best thought of as labels for effects that fall naturally out of a learning mechanism that probabilistically associates particular verbs (where each verb is a collection of semantic features) and particular argument structure constructions. The computational model outlined in the present paper constitutes one possible account of how this can be done.
APPENDIX
Table A1. Extended training set
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160930040326621-0957:S0305000915000586:S0305000915000586_tabA1.gif?pub-status=live)