1. Sentence meaning in vector spaces
While for decades sentence meaning has been represented in terms of complex formal structures, the most recent trend in computational semantics is to model semantic representations with dense distributional vectors (aka embeddings). As a matter of fact, distributional semantics has become one of the most influential approaches to lexical meaning, because of the important theoretical and computational advantages of representing words with continuous vectors, such as automatically learning lexical representations from natural language corpora and multimodal data, assessing semantic similarity in terms of the distance between the vectors, and dealing with the inherently gradient and fuzzy nature of meaning (Erk Reference Erk2012; Lenci Reference Lenci2018a).
Over the years, intense research has tried to address the question of how to project the strengths of vector models of meaning beyond word level, to phrases and sentences. The mainstream approach in distributional semantics assumes the representation of sentence meaning to be a vector, exactly like lexical items. Early approaches simply used pointwise vector operations (such as addition or multiplication) to combine word vectors to form phrase or sentence vectors (Mitchell and Lapata Reference Mitchell and Lapata2010), and in several tasks they still represent a non-trivial baseline to beat (Rimell et al. Reference Rimell, Maillard, Polajnar and Clark2016). More recent contributions can be essentially divided into two separate trends. The former attempts to model ‘Fregean compositionality’ in vector space, and aims at finding progressively more sophisticated compositional operations to derive sentence representations from the vectors of the words composing them (Baroni et al. Reference Baroni, Bernardi and Zamparelli2013; Paperno et al. Reference Paperno, Pham and Baroni2014). In the latter trend, dense vectors for sentences are learned as a whole, in a similar way to neural word embeddings (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013; Levy and Goldberg Reference Levy and Goldberg2014): for example, the encoder–decoder models of works like Kiros et al. (Reference Kiros, Zhu, Salakhutdinov, Zemel, Urtasun, Torralba and Fidler2015) and Hill et al. (Reference Hill, Cho and Korhonen2016) are trained to predict, given a sentence vector, the vectors of the surrounding sentences.
Representing sentences with vectors appears to be unrivalled from the applicative point of view, and has indeed important advantages such as the possibility of measuring similarity between sentences with their embeddings, as it is customary at the lexical level, which is then exploited in tasks like automatic paraphrasing and captioning, question-answering, etc. Recently, probing tasks have been proposed to test what kind of syntactic and semantic information is encoded in sentence embeddings (Ettinger et al. Reference Ettinger, Elgohary and Resnik2016; Adi et al. Reference Adi, Kermany, Belinkov, Lavi and Goldberg2017; Conneau et al. Reference Conneau, Kruszewski, Lample, Barrault and Baroni2018; Zhu et al. Reference Zhu, Li and de Melo2018). In particular, Zhu et al. (Reference Zhu, Li and de Melo2018) show that current models are not able to discriminate between different syntactic realization of semantic roles and fail to recognize that Lilly loves Imogen is more similar to its passive counterpart than to Imogen loves Lilly. Moreover, it is difficult to recover information about the component words from sentence embeddings (Adi et al. Reference Adi, Kermany, Belinkov, Lavi and Goldberg2017; Conneau et al. Reference Conneau, Kruszewski, Lample, Barrault and Baroni2018). The semantic representations built with tensor product in the question-answering system by Palangi et al. (Reference Palangi, Smolensky, He and Deng2018) have been claimed to be grammatically interpretable as well. However, the complexity of the semantic information brought by sentences and the difficulty to interpret the embeddings raise doubts about the general theoretical and empirical validity of the ‘sentence-meaning-as-vector’ approach.
In this paper, we propose a structured distributional model (SDM) of sentence meaning that combines word embeddings with formal semantics and is based on the assumption that sentences represent events and situations. These are regarded as inherently complex semantic objects, involving multiple entities that interact with different roles (e.g. agents, patients and locations). The semantic representation of a sentence is a formal structure inspired by discourse representation theory (DRT) (Kamp Reference Kamp2013) and containing distributional vectors. This structure is dynamically and incrementally built by integrating knowledge about events and their typical participants, as they are activated by lexical items. Event knowledge is modelled as a graph extracted from parsed corpora and encoding roles and relationships between participants that are represented as distributional vectors. The semantic representations of SDM retain the advantages of embeddings (e.g. learnability and gradability), but also contain directly interpretable formal structures, differently from classical vector-based approaches.
SDM is grounded on extensive psycholinguistic research showing that generalized knowledge about events stored in semantic memory plays a key role in sentence comprehension (McRae and Matsuki Reference McRae and Matsuki2009). On the other hand, it is also close to recent attempts to look for a ‘division of labour’ between formal and vector semantics, representing sentences with logical forms enriched with distributional representations of lexical items (Beltagy et al. Reference Beltagy, Roller, Cheng, Erk and Mooney2016; Boleda and Herbelot Reference Boleda and Herbelot2016; McNally Reference McNally2017). Like SDM, McNally and Boleda (Reference McNally2017) propose to introduce embeddings within DRT semantic representations. At the same time, differently from these other approaches, SDM consists of formal structures that integrate word embeddings with a distributional representation of activated event knowledge, which is then dynamically integrated during semantic composition.
The contribution of this paper is twofold. First, we introduced a SDM as a cognitively inspired distributional model of sentence meaning, based on a structured formalization of semantic representations and contextual event knowledge (Section 2). Secondly, we show that the event knowledge used by SDM in the construction of sentence meaning representations leads to improvements over other state-of-the-art models in compositionality tasks. In Section 3, SDM is tested on two different benchmarks: the first is RELPRON (Rimell et al. Reference Rimell, Maillard, Polajnar and Clark2016), a popular data set for the similarity estimation between compositional distributional representations; the second is DTFit (Vassallo et al. Reference Vassallo, Chersoni, Santus, Lenci and Blache2018), a data set created to model an important aspect of sentence meaning, that is the typicality of the described event or situation, which has been shown to have important processing consequences for language comprehension.
2. Dynamic composition with embeddings and event knowledge
SDM rests on the assumption that natural language comprehension involves the dynamic construction of semantic representations, as mental characterization of the events or situations described in sentences. We use the term ‘dynamic’ in the sense of dynamic semantic frameworks like DRT, to refer to a bidirectional relationship between linguistic meaning and context (see also Heim Reference Heim1983):
The meaning of an expression depends on the context in which it is used, and its content is itself defined as a context-change potential, which affects and determines the interpretation of the following expressions.
The content of an expression E used in a context C depends on C, but – once the content has been determined – it will contribute to update C to a new context C′, which will help fixing the content of the next expression. Similarly to DRT, SDM integrates word embeddings in a dynamic process to construct the semantic representations of sentences. Contextual knowledge is represented in distributional terms and affects the interpretation of following expressions, which in turn cue new information that updates the current context.Footnote a
Context is a highly multifaceted notion that includes several types of factors guiding and influencing language comprehension: information about the communicative settings, preceding discourse, general presuppositions and knowledge about the world, etc. In DRT, Kamp (Reference Kamp2016) has introduced the notion of articulated context to model different sources of contextual information that intervene in the dynamic construction of semantic representations. In this paper, we focus on the contribution of a specific type of contextual information, which we refer to as Generalized Event Knowledge (gek). This is knowledge about events and situations that we have experienced under different modalities, including the linguistic input (McRae and Matsuki Reference McRae and Matsuki2009), and is generalized because it contains information about prototypical event structures.
In linguistics, the Generative Lexicon theory (Pustejovsky Reference Pustejovsky1995) argues that the lexical entries of nouns also contain information about events that are crucial to define their meaning (e.g. read for book). Psycholinguistic studies in the last two decades have brought extensive evidence that the array of event knowledge activated during sentence processing is extremely rich: verbs (e.g. arrest) activate expectations about typical arguments (e.g. cop and thief) and vice versa (McRae et al. Reference McRae, Spivey-Knowlton and Tanenhaus1998; Ferretti et al. Reference Ferretti, McRae and Hatherell2001; McRae et al. Reference McRae, Hare, Elman and Ferretti2005), and similarly nouns activate other nouns typically co-occurring as participants in the same events (key, door) (Hare et al. Reference Hare, Jones, Thomson, Kelly and McRae2009). The influence of argument structure relations on how words are neurally processed is also an important field of study in cognitive neuroscience (Thompson and Meltzer-Asscher Reference Thompson and Meltzer-Asscher2014; Meltzer-Asscher et al. Reference Meltzer-Asscher, Mack, Barbieri and Thompson2015; Williams et al. Reference Williams, Reddigari and Pylkkänen2017).
Stored event knowledge has relevant processing consequences. Neurocognitive research showed that the brain is constantly engaged in making predictions to anticipate future events (Bar Reference Bar2009; Clark Reference Clark2013). Language comprehension, in turn, has been characterized as a largely predictive process (Kuperberg and Jaeger Reference Kuperberg and Jaeger2015). Predictions are memory-based, and experiences about events and their participants are used to generate expectations about the upcoming linguistic input, thereby minimizing the processing effort (Elman Reference Elman2014; McRae and Matsuki Reference McRae and Matsuki2009). For instance, argument combinations that are more ‘coherent’ with the event scenarios activated by the previous words are read faster in self-paced reading tasks and elicited smaller N400 amplitudes in Event Related Potentials (ERP) experiments (Bicknell et al. Reference Bicknell, Elman, Hare, McRae and Kutas2010; Matsuki et al. Reference Matsuki, Chow, Hare, Elman, Scheepers and McRae2011; Paczynski and Kuperberg Reference Paczynski and Kuperberg2012; Metusalem et al. Reference Metusalem, Kutas, Urbach, Hare, McRae and Elman2012).Footnote b
Elman (Reference Elman2009; Reference Elman2014) has proposed a general interpretation of these experimental results in the light of the Words-as-Cues framework. According to this theory, words are arranged in the mental lexicon as a sort of network of mutual expectations, and listeners rely on pre-stored representations of events and common situations to try to identify the one that a speaker is more likely to communicate. As new input words are processed, they are quickly integrated in a data structure containing a dynamic representation of the sentence content, until some events are recognized as the ‘best candidates’ for explaining the cues (i.e. the words) observed in the linguistic input. It is important to stress that, in such a view, the meaning of complex units such as phrases and sentences is not always built by composing lexical meanings, as the representation of typical events might be already stored and retrieved as a whole in semantic memory. Participants often occurring together become active when the representation of one of them is activated (see also Bar et al. (Reference Bar, Aminoff, Mason and Fenske2007) on the relation between associative processing and predictions).
SDM aims at integrating the core aspects of dynamic formal semantics and the evidence on the role of event knowledge for language processing into a general model for compositional semantic representations that relies on two major assumptions:
Lexical items are represented as embeddings within a network of relations encoding knowledge about events and typical participants, which corresponds to what we have termed above gek.
The semantic representation (sr) of a sentence (or even larger stretches of linguistic input, such as discourse) is a formal structure that dynamically combines the information cued by lexical items.
Like in Chersoni et al. (Reference Chersoni, Lenci and Blache2017), the model is inspired by Memory, Unification and Control (MUC), proposed by Hagoort (Hagoort Reference Hagoort2013, Reference Hagoort2016) as a general model for the neurobiology of language. MUC incorporates three main functional components: (i) Memory corresponds to knowledge stored in long-term memory; (ii) Unification refers to the process of combining the units stored in Memory to create larger structures, with contributions from the context; and (iii) Control is responsible for relating language to joint action and social interaction. Similarly, our model distinguishes between a component storing event knowledge, in the form of a Distributional Event Graph (deg, Section 2.1), and a meaning composition function that integrates information activated from lexical items and incrementally builds the sr (Section 2.2).
2.1 The distributional event graph
The Distributional Event Graph represents the event knowledge stored in long-term memory with information extracted from parsed corpora. We assume a very broad notion of event, as an n-ary relation between entities. Accordingly, an event can be a complex situation involving multiple participants, such as The student reads a book in the library, but also the association between an entity and a property expressed by the noun phrase heavy book. This notion of event corresponds to what psychologist call situation knowledge or thematic associations (Binder Reference Binder2016). As McRae and Matsuki (Reference McRae and Matsuki2009) argue, gek is acquired from both sensorimotor experience (e.g. watching or playing football matches) and linguistic experience (e.g. reading about football matches). deg can thus be regarded as a model of the gek derived from the linguistic input.
Events are extracted from parsed sentences, using syntactic relations as an approximation of deeper semantic roles (e.g. the subject relation for the agent and the direct object relation for the patient). In the present paper, we use dependency parses, as it is customary in distributional semantics, but nothing in SDM hinges on the choice of the syntactic representation. Given a verb or a noun head, all its syntactic dependents are grouped together.Footnote c More schematic events are also generated by abstracting from one or more event participants for every recorded instance. Since we expect each participant to be able to trigger the event and consequently any of the other participants, a relation can be created and added to the graph from every subset of each group extracted from a sentence (cf. Figure 1).
The resulting deg structure is a weighted hypergraph, as it contains weighted relations holding between nodes pairs, and a labelled multigraph, since the edges are labelled in order to represent specific syntactic relations. The weights σ are derived from co-occurrence statistics and measure the association strengths between event nodes. They are intended as salience scores that identify the most prototypical events associated with an entity (e.g. the typical actions performed by a student). Crucially, the graph nodes are represented as word embeddings. Thus, given a lexical cue w, the information in deg can be activated along two dimensions during processing (cf. Table 1):
(1) by retrieving the most similar nodes to w (the paradigmatic neighbours), on the basis of their cosine similarity between their vectors and the vector of w;
(2) by retrieving the closest associates of w (the syntagmatic neighbours), using the edge weights.
Figure 2 shows a toy example of deg. The little boxes with circles in them represent the embedding associated with each node. Edges are labelled with syntactic relations (as a surface approximation of event roles) and weighted with salience scores σ. Each event is a set of co-indexed edges. For example, e 2 corresponds to the event of students reading books in libraries, while e 1 represents a schematic event of students performing some generic action on books (e.g. reading, consulting and studying).
2.2 The meaning composition function
We assume that during sentence comprehension lexical items activate fragments of event knowledge stored in deg (like in Elman’s Words-as-Cues model), which are then dynamically integrated in a semantic representation sr. This is a formal structure directly inspired by DRT and consisting of three different yet interacting information tiers:
(1) universe (U) – this tier, which we do not discuss further in the present paper, includes the entities mentioned in the sentence (corresponding to the discourse referents in DRT). They are typically introduced by noun phrases and provide the targets of anaphoric links.
(2) linguistic conditions (lc) – a context-independent tier of meaning that accumulates the embeddings associated with the lexical items. This corresponds to the conditions that in DRT content words add to the discourse referents. The crucial difference is that now such conditions are embeddings.
(3) active context (ac) – similarly to the notion of articulated context in Kamp (Reference Kamp2016), this component consists of several types of contextual information available during sentence processing or activated by lexical items (e.g. information from the current communication setting and general world knowledge). More specifically, we assume that ac contains the embeddings activated from deg by the single lexemes (or by other contextual elements) and integrated into a semantically coherent structure contributing to the sentence interpretation.
Figure 3 shows an example of sr built from the sentence The student drinks the coffee (ignoring the specific contribution of determiners and tense). The universe U contains the discourse referents introduced by the noun phrases, while lc includes the embeddings of the lexical items in the sentence, each linked to the relevant referent (e.g. $$\overrightarrow {student}$$ : u means that the embedding introduced by student is linked to the discourse referent u). ac consists of the embeddings activated from deg and ranked by their salience with respect to the current content in the sr. The elements in ac are grouped by their syntactic relation in deg, which again we regard here just as a surface approximation of their semantic role (e.g. the items listed under ‘obl:loc’ are a set of possible locations of the event expressed by the sentence). ac makes it possible to enrich the semantic content of the sentence with contextual information, predict other elements of the event and generate expectations about incoming input. For instance, given the ac in Figure 3, we can predict that the student is most likely to be drinking a coffee at the cafeteria and that he/she is drinking it for breakfast or in the morning. The ranking of each element in ac depends on two factors: (i) its degree of activation by the lexical items and (ii) its overall coherence with respect to the information already available in the ac.
A crucial feature of each sr is that lc and ac are also represented with vectors that are incrementally updated with the information activated by lexical items. Let sri-1 be the semantic representation built for the linguistic input w 1, …, w i−1. When we process a new pair 〈wi, ri 〉 with a lexeme wi and syntactic role ri:
1. lc in sri−1 is updated with the embedding $$\overrightarrow {w_i }$$;
2. ac in sri−1 is updated with the embeddings of the syntagmatic neighbours of wi extracted from DEG.
Figures 4 and 5 exemplify the update of the sr for the subject The student with the information is activated by the verb drink. The update process is defined as follows:
(1) lc is represented with the vector $$\overrightarrow {LC}$$ obtained from the linear combination of the embeddings of the words contained in the sentence. Therefore, when $$\langle w_i ,r_i \rangle$$ is processed, the embedding $$\overrightarrow {w_i }$$ is simply added to $$\overrightarrow {LC}$$;Footnote d
(2) for each syntactic role ri, ac contains a set of ranked lists (one for each processed pair) of embeddings corresponding to the most likely words expected to fill that role. For instance, the ac for the fragment The student in Figure 4 contains a list of the embeddings of the most expected direct objects associated with student, a list of the embeddings of the most expected locations, etc. Each list of expected role fillers is itself represented with the weighted centroid vector (e.g. $$\overrightarrow {dobj}$$) of their k most prominent items (with k a model hyperparameter). For instance, setting k = 2, the $$\overrightarrow {dobj}$$ centroid in the ac in Figure 4 is built just from $$\overrightarrow {book}$$ and $$\overrightarrow {research}$$; less salient elements (the gray areas in Figures 3–5) are kept in the list of likely direct objects, but at this stage do not contribute to the centroid representing the expected fillers for that role. ac is then updated with the deg fragment activated by the new lexeme wi (e.g. the verb drink):
The event knowledge activated by wi for a given role ri is ranked according to cosine similarity with the vector $$\overrightarrow {r_i }$$available in ac: in our example, the direct objects activated by the verb drink (e.g. $$\overrightarrow {beer}$$ and $$\overrightarrow {coffee}$$) are ranked according to their cosine similarity to the $$\overrightarrow {dobj}$$ vector of the ac.
The ranking process works also in the opposite direction: the newly retrieved information is used to update the centroids in ac. For example, the direct objects activated by the verb drink are aggregated into centroids and the corresponding weighted lists in ac are re-ranked according to the cosine similarity with the new centroids, in order to maximize the semantic coherence of the representation. At this point, $$\overrightarrow {book}$$ and $$\overrightarrow {research}$$, which are not as salient as $$\overrightarrow {coffee}$$ and $$\overrightarrow {beer}$$ in the drinking context, are downgraded in the ranked list and are therefore less likely to become part of the $$\overrightarrow {dobj}$$ centroid at the next step.
The newly retrieved information is now added to the ac: as shown in Figure 5, once the pair 〈drink, root〉 has been fully processed, the ac contains two ranked lists for the dobj role and two ranked lists for the obl:loc role, the top k elements of each list will be part of the centroid for their relation in the next step. Finally, the whole ac is represented with the centroid vector $$\overrightarrow {AC}$$ built out of the role vectors $$\overrightarrow {r_1 } , \ldots ,\overrightarrow {r_n }$$ available in ac. The vector $$\overrightarrow {AC}$$ encodes the integrated event knowledge activated by the linguistic input.
As an example of gek re-ranking, assume that after processing the subject noun phrase The student, the ac of the corresponding sr predicts that the most expected verbs are read, study, drink, etc., the most expected associated direct objects are book, research, beer, etc., and the most expected locations are library, cafeteria, university, etc. (Figure 4). When the main verb drink is processed, the corresponding role list is removed by the ac, because that syntactic slot is now overtly filled by this lexeme, whose embedding is then added to the lc. The verb drink cues its own event knowledge, for instance, that the most typical objects of drinking are tea, coffee, beer, etc., and the most typical locations are cafeteria, pub, bar, etc. The information cued by drink is re-ranked to promote those items that are most compatible and coherent with the current content of ac (i.e. direct objects and locations that are likely to interact with students). Analogously, the information in the ac is re-ranked to make it more compatible with the gek cued by drink (e.g. the salience of book and research gets decreased, because they are not similar to the typical direct objects and locations of drink). The output of the sr update is shown in Figure 5, whose ac now contains the gek associated with an event of drinking by a student.
A crucial feature of sr is that it is a much richer representation than the bare linguistic input: the overtly realized arguments in fact activate a broader array of roles than the ones actually appearing in the sentence. As an example of how these unexpressed arguments contribute to the semantic representation of the event, consider a situation in which three different sentences are represented by means of ac, namely The student writes the thesis, The headmaster writes the review and The teacher writes the assignment. Although teacher could be judged as closer to headmaster than to student, and thesis as closer to assignment than to review, taking into account also the typical locations (e.g. a library for the first two sentences, a classroom for the last one) and writing supports (e.g. a laptop in the first two cases, a blackboard in the last one) would lead to the first two events being judged as the most similar ones.
In the case of unexpected continuations, the ac will be updated with the new information, though in this case the re-ranking process would probably not change the gek prominence. Consider the case of an input fragment like The student plows…: student activates event knowledge as it is shown in Figure 3, but the verb does not belong to the set of expected events given student. The verb triggers different direct objects from those already in the ac (e.g. typical objects of plow such as furrow and field). Since the similarity of their centroid with the elements of the direct object list in the ac will be very low, the relative ordering of the ranked list will roughly stay the same, and direct objects pertaining to the plowing situation will coexist with direct objects triggered by student. Depending on the continuation of the sentence, then, the elements triggered by plow might gain centrality in the representation or remain peripheral.
It is worth noting that the incremental process of the sr update is consistent with the main principles of formal dynamic semantics frameworks like DRT. As we said above, dynamics semantics assumes the meaning of an expression to be a context-change potential that affects the interpretation of the following expressions. Similarly, in our distributional model of sentence representation the ac in sri−1 affects the interpretation of the incoming input wi, via the gek re-ranking process.Footnote e
3. Experiments
3.1 Data sets and tasks
Our goal is to test SDM in compositionality-related tasks, with a particular focus on the contribution of event knowledge. For the present study, we selected two different data sets: the development set of the RELPRON data set (Rimell et al. Reference Rimell, Maillard, Polajnar and Clark2016)Footnote f and the DTFit data set (Vassallo et al. Reference Vassallo, Chersoni, Santus, Lenci and Blache2018).
RELPRON consists of 518 target–property pairs, where the target is a noun labelled with a syntactic function (either subject or direct object) and the property is a subject or object relative clause providing the definition of the target (Figure 6). Given a model, we produce a compositional representation for each of the properties. In each definition, the verb, the head noun and the argument are composed to obtain a representation of the property. Following the original evaluation in Rimell et al. (Reference Rimell, Maillard, Polajnar and Clark2016), we tested six different combinations for each composition model: the verb only, the argument only, the head noun and the verb, the head noun and the argument, the verb and the argument and all three of them. For each target, the 518 composed vectors are ranked according to their cosine similarity to the target. Like Rimell et al. (Reference Rimell, Maillard, Polajnar and Clark2016), we use mean average precision (henceforth MAP) to evaluate our models on RELPRON. Formally, MAP is defined as
where N is the number of terms in RELPRON, and AP(t) is the average precision for term t, defined as
where Pt is the number of correct properties for term t in the data set, M is the total number of properties in the data set, Prec(k) is the precision at rank k and rel(k) is a function equal to one if the property at rank k is a correct property for t, and zero otherwise. Intuitively, AP(t) will be if, for the term t, all the correct properties associated to the term are ranked in the top positions, and the value becomes lower when the correct items are ranked farther from the head of the list.
Our second evaluation data set, DTFit, has been introduced with the goal of building a new gold standard for the thematic fit estimation task (Vassallo et al. Reference Vassallo, Chersoni, Santus, Lenci and Blache2018). Thematic fit is a psycholinguistic notion similar to selectional preferences, the main difference being that the latter involve the satisfaction of constraints on discrete semantic features of the arguments, while thematic fit is a continuous value expressing the degree of compatibility between an argument and a semantic role (McRae et al. Reference McRae, Spivey-Knowlton and Tanenhaus1998). Distributional models for thematic fit estimation have been proposed by several authors (Erk Reference Erk2007; Baroni and Lenci Reference Baroni and Lenci2010; Erk et al. Reference Erk, Padó and Padó2010; Lenci Reference Lenci2011; Sayeed et al. Reference Greenberg, Sayeed and Demberg2015; Greenberg et al. Reference Greenberg, Sayeed and Demberg2015; Santus et al. Reference Santus, Chersoni, Lenci and Blache2017; Tilk et al. Reference Tilk, Demberg, Sayeed, Klakow and Thater2016; Hong et al. Reference Hong, Sayeed and Demberg2018). While thematic fit data sets typically include human-elicited typicality scores for argument–filler pairs taken in isolation, DTFit includes tuples of arguments of different length, so that the typicality value of an argument depends on its interaction with the other arguments in the tuple. This makes it possible to model the dynamic aspect of argument typicality, since the expectations on an argument are dynamically updated as the other roles in the sentence are filled. The argument combinations in DTFit describe events associated with crowdsourced scores ranging from 1 (very atypical) to 7 (very typical). The data set items are grouped into typical and atypical pairs that differ only for one argument, and divided into three subsets:
795 triplets, each differing only for the Patient role:
- sergeant_N assign_V mission_N (typical)
- sergeant_N assign_V homework_N (atypical)
300 quadruples, each differing only for the Location role:
- policeman_N check_V bag_N airport_N (typical)
- policeman_N check_V bag_N kitchen_N (atypical)
200 quadruples, each differing only for the Instrument role:
- painter_N decorate_V wall_N brush_N (typical)
- painter_N decorate_V wall_N scalpel_N (atypical)
However, the Instrument subset of DTFit was excluded from our current evaluation. After applying the threshold of 5 for storing events in the deg (cf. Section 3.2.3), we found that the SDM coverage on this subset was too low.
For each tuple in the DTFit data set, the task for our models is to predict the upcoming argument on the basis of the previous ones. Given a model, we build a compositional vector representation for each data set item by excluding the last argument in the tuple, and then we measured the cosine similarity between the resulting vector and the argument vector. Models are evaluated in terms of the Spearman’s correlation between the similarity scores and the human ratings.
As suggested by the experimental results of Bicknell et al. (Reference Bicknell, Elman, Hare, McRae and Kutas2010) and Matsuki et al. (Reference Matsuki, Chow, Hare, Elman, Scheepers and McRae2011), the typicality of the described events has important processing consequences: atypical events lead to longer reading times and stronger N400 components, while typical ones are easier to process thanks to the contribution of gek. Thus, the task of modelling typicality judgements can be seen as closely related to modelling semantic processing complexity.
3.2 Models settings
In this study, we compare the performance of SDM with three baselines. The simple additive model formulated in Mitchell and Lapata (Reference Mitchell and Lapata2010), a smoothed additive model and a multilayer Long-Short-Term-Memory (LSTM) neural language model trained against one-hot targets (Zaremba et al. Reference Zaremba, Sutskever and Vinyals2015).
The additive models (Mitchell and Lapata Reference Mitchell and Lapata2010) have been evaluated on different types of word embeddings. We compared their performances with SDM.Footnote g Despite their simplicity, previous evaluation studies on several benchmarks showed that such models can be difficult to beat, even for sophisticated compositionality frameworks (Rimell et al. Reference Rimell, Maillard, Polajnar and Clark2016; Arora et al. Reference Arora, Liang and Ma2017; Tian et al. Reference Tian, Okazaki and Inui2017).
The embeddings we used in our tests are the word2vec models by Mikolov et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013), that is the Skip-Gram with negative sampling (SG) and the continuous-bag-of-words (CBOW), and the C-Phrase model by Pham et al. (Reference Pham, Kruszewski, Lazaridou and Baroni2015). The latter model incorporates information about syntactic constituents, as the principles of the model training are (i) to group the words together according to the syntactic structure of the sentences and (ii) to optimize simultaneously the context predictions at different levels of the syntactic hierarchy (e.g. given the training sentence A sad dog is howling in the park, the context prediction will be optimized for dog, a dog and a sad dog, that is for all the words that form a syntactic constituent). The performance of C-Phrase is particularly useful to assess the benefits of using vectors that encode directly structural/syntactic information.
We used the same corpora both for training the embeddings and for extracting the syntactic relations for deg. The training data come from the concatenation of three dependency-parsed corpora: the BNC (Leech and Smith Reference Leech2000), the Ukwac (Baroni et al. Reference Baroni, Bernardini, Ferraresi and Zanchetta2009) and a 2018 dump of the English Wikipedia, for a combined size of approximately 4 billion tokens. The corpora were parsed with Stanford CoreNLP (Manning et al. Reference Manning, Surdeanu, Bauer, Finkel, Bethard and McClosky2014). The hyperparameters of the embeddings were the following for all models: 400 dimensions, a context window of size 10, 10 negative samples and 100 as the minimum word frequency.Footnote h
3.2.1 Simple additive models
Our additive models, corresponding to an sr consisting of the $$\overrightarrow {LC}$$ component only, represent the meaning of a sentence sent by summing the embeddings of its words:
The similarity with the targets is measured with the cosine between the target vector and the sentence vector.
3.2.2 Smoothed additive models
These models are a smoothed version of the additive baseline, in which the final representation is simply the sum of the vectors of the words in the sentence, plus the top k = 5 nearest neighbour of each word in the sentence.Footnote i Therefore, the meaning of a sentence sent is obtained by
where Nk(w) is the set of the k nearest neighbours of w. Compared to the gek models, the smoothed additive baseline modifies the sentence vector by adding the vectors of related words. Thus, it represents a useful comparison term for understanding the actual added value of the structural aspects of SDM.Footnote j
3.2.3 The structured distributional models
The SDM introduced in Section 2 consists of a full sr including the linguistic conditions vector $$\overrightarrow {LC}$$ and the event knowledge vector $$\overrightarrow {AC}$$. In this section, we detail the hyperparameter setting for the actual implementation of the model.
Distributional event graph
We included in the graph only events with a minimum frequency of 5 in the training corpora. The edges of the graph were weighted with Smoothed Local Mutual Information (LMI). Given a triple composed by the words w 1 and w 2, and a syntactic relation s linking them, we computed its weight by using a smoothed version of the Local Mutual Information (Evert Reference Evert2004):
where the smoothed probabilities are defined as follows:
This type of smoothing, with α = 0.75, was chosen to mitigate the bias of Mutual Information (MI) statistical association measures towards rare events (Levy et al. Reference Levy, Goldberg and Dagan2015). While this formula only involves pairs (as only pairs were employed in the experiments), it is easily extensible to more complex tuples of elements.
Re-ranking settings
For each word in the data set items, the top 50 associated words were retrieved from deg. Both for the re-ranking phase and for the construction of the final representation, the event knowledge vectors (i.e. the role vectors $$\overrightarrow r$$ and the ac vector $$\overrightarrow {AC}$$) are built from the top 20 elements of each weighted list. As detailed in Section 2.2, the ranking process in SDM can be performed in the forward direction and in the backward direction at the same time (i.e. the ac can be used to re-rank newly retrieved information and vice versa, respectively), but for simplicity we only implemented the forward ranking.
Scoring
As in SDM the similarity computations with the target words involves two separate vectors, we combined the similarity scores with addition. Thus, given a target word in a sentence sent, the score for SDM will be computed as
In all settings, we assume the model to be aware of the syntactic parse of the test items. In DTFit, word order fully determines the syntactic constituents, as the sentences are always in the subject verb object [location-obl|instrument-obl] order. In RELPRON, on the other hand, the item contains information about the relation that is being tested: in the subject relative clauses, the properties always show the verb followed by the argument (e.g. telescope: device that detects planets), while in the object relative clauses the properties always present the opposite situation (e.g. telescope: device that observatory has). In the present experiments, we did not use the predictions on non-expressed arguments to compute $$\overrightarrow {AC}$$, and we restricted the evaluation to the representation of the target argument. For example, in the DTFit Patients set, $$\overrightarrow {AC} (sent)$$ only contains the $$\overrightarrow {dobj}$$centroid.
3.2.4 LSTM neural language model
We also compared the additive vector baselines and SDM with an LSTM neural network, taking as input word2Vec embeddings. For every task, we trained the LSTM on syntactically labelled tuples (extracted from the same training corpora used for the other models), with the objective of predicting the relevant target. In DTFit, for example, for the Location task, in the tuple student learn history library, the network is trained to predict the argument library given the tuple student learn history. Similarly, in RELPRON, for the tuple engineer patent design, the LSTM is trained to predict engineer in the subject task and design in the object task, given patent design and engineer patent, respectively.
In both DTFit and RELPRON, for each input tuple, we took the top N network predictions (we tested with N = 3, 5, 10, and we always obtained the best results with N = 10), we averaged their respective word embeddings, and we used the vector cosine between the resulting vector and the embedding of the target reported in the gold standard.
The LSTM is composed by (i) an input layer of the same size of the word2Vec embeddings (400 dimensions, with dropout=0.1); (ii) a single-layer monodirectional LSTM with l hidden layers (where l = 2 when predicting Patients and l = 3 when predicting Locations) of the same size of the embeddings; (iii) a linear layer (again with dropout= 0.1) of the same size of the embeddings, which takes in input the average of the hidden layers of the LSTM; and (iv) finally a softmax layer that projects the filler probability distribution over the vocabulary.
4. Results and discussion
4.1 RELPRON
Given the targets and the composed vectors of all the definitions in RELPRON, we assessed the cosine similarity of each pair and computed the MAP scores shown in Table 2. First of all, the Skip-Gram-based models always turn out to be the best performing ones, with rare exceptions, closely followed by the C-Phrase ones. The scores of the additive models are slightly inferior, but very close to those reported by Rimell et al. (Reference Rimell, Maillard, Polajnar and Clark2016), while the LSTM model lags behind vector addition, improving only when the parameter N is increased. Results seem to confirm the original findings: even with very complex models (in that case, the Lexical Function Model by Paperno et al. Reference Paperno, Pham and Baroni2014), it is difficult to outperform simple vector addition in compositionality tasks.
Interestingly, SDM shows a constant improvement over the simple vector addition equivalents (Table 2), with the only exception of the composition of the verb and the argument. All the results for the headNoun+verb+arg composition are, to the best of our knowledge, the best scores reported so far on the data set. Unfortunately, given the relatively small size of RELPRON, the improvement of the gek models fails to reach significance (p > 0.1 for all comparisons between a basic additive model and its respective augmentation with deg, p values computed with the Wilcoxon rank-sum test). Compared to SDM, the smoothed vector addition baseline seems to be way less consistent (Table 2): for some combinations and for some vector types, adding the nearest neighbours is detrimental. We take these results as supporting the added value of the structured event knowledge and the sr update process in SDM, over the simple enrichment of vector addition with nearest neighbours. Finally, we can notice that the Skip-Gram vectors have again an edge over the competitors, even over the syntactically informed C-Phrase vectors.
4.2 DTFit
At a first glance, the results on DTFit follow a similar pattern (Table 3): the three embedding types perform similarly, although in this case the CBOW vectors perform much worse than the others in the Patients data set. LSTM also largely lags behind all the additive models, showing that thematic fit modelling is not a trivial task for language models, and that more complex neural architectures are required in order to obtain state-of-the-art results (Tilk et al. Reference Tilk, Demberg, Sayeed, Klakow and Thater2016).Footnote k
The results for SDM again show that including the deg information leads to improvements in the performances (Table 3). While on the Locations the difference is only marginal, also due to the smaller number of test items, two models out of three showed significantly higher correlations than their respective additive baselines. The increase is particularly noticeable for the CBOW vectors that, in their augmented version, manage to fill the gap with the other models and to achieve a competitive performance. However, it should also be noticed that there is a striking difference between the two subsets of DTFit: while on patients the advantage of the gek models on both the baselines is clear, on locations the results are almost indistinguishable from those of the smoothed additive baseline, which simply adds the nearest neighbours to the vectors of the words in the sentence. This complies with previous studies on thematic fit modelling with dependency-based distributional models (Sayeed et al. 2015; Santus et al. Reference Chersoni, Santus, Blache and Lenci2017). Because of the ambiguous nature of the prepositions used to identify potential locations, the role vectors used by SDM can be very noisy. Moreover, since most locative complements are optional adjuncts, it is likely that the event knowledge extracted from corpora contain a much smaller number of locations. Therefore, the structural information about locations in deg is probably less reliable and does not provide any clear advantage compared to additive models.
Concerning the comparison between the different types of embeddings, Skip-Gram still retains an advantage over C-Phrase in its basic version, while it is outperformed when the latter vectors are used in SDM. However, the differences are clearly minimal, suggesting that the structured knowledge encoded in the C-Phrase embeddings is not a plus for the thematic fit task. Concerning this point, it must be mentioned that most of the current models for thematic fit estimation rely on vectors relying either on syntactic information (Baroni and Lenci Reference Baroni and Lenci2010; Greenberg et al. Reference Greenberg, Sayeed and Demberg2015; Santus et al. Reference Santus, Chersoni, Lenci and Blache2017; Chersoni et al. Reference Chersoni, Lenci and Blache2017) or semantic roles (Sayeed et al. Reference Sayeed, Demberg and Shkadzko2015; Tilk et al. Reference Tilk, Demberg, Sayeed, Klakow and Thater2016). On the other hand, our results comply with studies like Lapesa and Evert (Reference Lapesa and Evert2017), who reported comparable performance for bag-of-words and dependency-based models on several semantic modelling tasks, thus questioning whether the injection of linguistic structure in the word vectors is actually worth its processing cost. However, this is the first time that such a comparison is carried out on the basis of the DTFit data set, while previous studies proposed slightly different versions of the task and evaluated their systems on different benchmarks.Footnote l A more extensive and in-depth study is required in order to formulate more conclusive arguments on this issue.
Another constant finding of previous studies on thematic fit modelling was that high-dimensional, count-based vector representations perform generally better than dense word embeddings, to the point that Sayeed et al. (Reference Sayeed, Greenberg and Demberg2016) stressed the sensitivity of this task to linguistic detail and to the interpretability of the vector space. Therefore, we tested whether vector dimensionality had an impact on task performance (Table 4). Although the observed differences are generally small, we noticed that higher-dimensional vectors are generally better in the DTFit evaluation and, in one case, the differences reach a marginal significance (i.e. the difference between the 100-dimensional and the 400-dimensional basic Skip-Gram model is marginally significant at p < 0.1). This point will also deserve future investigation, but it seems plausible that for this task embeddings benefit from higher dimensionality for encoding more information, as it has been suggested by Sayeed and colleagues. However, these advantages do not seem to be related to the injection of linguistic structure directly in the embeddings (i.e. not to the direct use of syntactic contexts for training the vectors), as bag-of-words models perform similarly to – if not better than – a syntactic-based model like C-Phrase. We leave to future research a systematic comparison with sparse count-based models to assess whether interpretable dimensions are advantageous for modelling context-sensitive thematic fit.
4.3 Error analysis
One of our basic assumptions about gek is that semantic memory stores representations of typical events and their participants. Therefore, we expect that integrating gek into our models might lead to an improvement especially on the typical items of the DTFit data set. A quick test with the correlations revealed that this is actually the case (Table 5): all models showed increased Spearman’s correlations on the tuples in the typical condition (and in the larger Patients subset of DTFit, the increase is significant at p < 0.05 for the CBOW model), while they remain unchanged or even decrease for the tuples in the atypical conditions. Notice that this is true only for SDM, which is enriched with gek. On the other hand, the simple addition of the nearest neighbours never leads to improvements, as proved by the low correlation scores of the smoothed additive baseline. As new and larger data sets for compositionality tasks are currently under construction (Vassallo et al. Reference Vassallo, Chersoni, Santus, Lenci and Blache2018), it will be interesting to assess the consistency of these results.
Turning to the RELPRON data set, we noticed that the difference between subject and object relative clauses is particularly relevant for SDM, which generally shows better performances on the latter. Table 6 summarizes the scores component on the two subsets. While relying on syntactic dependencies, SDM also processes properties in linear order: the verb+arg model, therefore, works differently when applied to subject clauses than to object clauses. In the subject case, in fact, the verb is found first, and then its expectations are used to re-rank the object ones. In the object case, on the other hand, things proceed the opposite way: at first the subject is found, and then its expectations are used to re-rank the verb ones. Therefore, the event knowledge triggered by the verb seems not only less informative than the one triggered by the argument, but it is often detrimental to the composition process.
5. Conclusion
In this contribution, we introduced a SDM that represents sentence meaning with formal structures derived from DRT and including embeddings enriched with event knowledge. This is modelled with a distributional event graph that represents events and their prototypical participants with distributional vectors linked in a network of syntagmatic relations extracted from parsed corpora. The compositional construction of sentence meaning in SDM is directly inspired by the principles of dynamic semantics. Word embeddings are integrated in a dynamic process to construct the semantic representations of sentences: contextual event knowledge affects the interpretation of following expressions, which cue new information that updates the current context.
Current methods for representing sentence meaning generally lack information about typical events and situation, while SDM rests on the assumption that such information can lead to better compositional representations and to an increased capacity of modelling typicality, which is one striking capacity of the human processing system. This corresponds to the hypothesis by Baggio and Hagoort (Reference Baggio and Hagoort2011) that semantic compositionality actually results from a balance between storage and computation: on the one hand, language speakers rely on a wide amount of stored events and scenarios for common, familiar situations; on the other hand, a compositional mechanism is needed to account for our understanding of new and unheard sentences. Processing complexity, as revealed by effects such as the reduced amplitude of the N400 component in ERP experiments, is inversely proportional to the typicality of the described events and situations: the more they are typical, the more they will be coherent with already-stored representations.
We evaluated SDM on two tasks, namely a classical similarity estimation tasks on the target–definition pairs of the RELPRON data set (Rimell et al. Reference Rimell, Maillard, Polajnar and Clark2016) and a thematic fit modelling task on the event tuples of the DTFit data set (Vassallo et al. Reference Vassallo, Chersoni, Santus, Lenci and Blache2018). Our results still proved that additive models are quite efficient for compositionality tasks, and that integrating the event information activated by lexical items improves the performance on both the evaluation data sets. Particularly interesting for our evaluation was the performance on the DTFit data set, since this data set has been especially created with the purpose of testing computational models on their capacity to account for human typicality judgements about event participants. The reported scores on the latter data set showed that not only SDM improves over simple and smoothed additive models, but also that the increase in correlation concerns the data set items rated as most typical by human subjects, fulfilling our initial predictions.
Differently from other distributional semantic models tested on the thematic fit task, ‘structure’ is now externally encoded in a graph, whose nodes are embeddings, and not directly in the dimension of the embeddings themselves. The fact that the best performing word embeddings in our framework are the Skip-Gram ones is somewhat surprising, and against the finding of previous literature in which bag-of-words models were always described as struggling on this task (Baroni et al. Reference Baroni, Dinu and Kruszewski2014; Sayeed et al. Reference Sayeed, Greenberg and Demberg2016). Given our results, we also suggested that the dimensionality of the embeddings could be an important factor, much more than the choice of training them on syntactic contexts.