1. Introduction
Multiword expressions (MWEs) have already been described as a pain in the neck (Sag et al. Reference Sag, Baldwin, Bond, Copestake and Flickinger2002) and hard going (Rayson et al. Reference Rayson, Piao, Sharoff, Evert and Moirón2010) for natural language processing (NLP), but also considered to be much ado about nothing (de Marneffe, Padó and Manning Reference de Marneffe, Padó and Manning2009) and perhaps plain sailing (Rayson et al. Reference Rayson, Piao, Sharoff, Evert and Moirón2010) through the years. Despite any controversies, with a growing community and various events dedicated to them, interest in MWEs shows no indication of slowing down, as they can be viewed as providing not only challenges but also opportunities for designing new solutions for more accurate language processing (Constant et al. Reference Constant, Eryiit, Monti, Plas, Ramisch, Rosner and Todirascu2017).
After almost two decades and thousands of citations since the publication of the Pain in the Neck paper by Sag et al. (Reference Sag, Baldwin, Bond, Copestake and Flickinger2002) what is it that makes them still an object of interest? First of all, MWEs come in all shapes, sizes and forms, from a (long) idiom like keep your breath to cool your porridge (as keeping to your own affairs) to a (short) collocation like fish and chips, and models designed for one category of MWE may not be adequate to other categories. Secondly, they may also display various degrees of idiosyncrasy, including lexical, syntactic, semantic and statistical (Baldwin and Kim Reference Baldwin, Kim, Indurkhya and Damerau2010), which may interact in complex ways. For instance, a dark horse, in addition to describing the colouring of an animal, may also be used to refer to an unknown candidate who unexpectedly succeeds and this second meaning cannot be fully inferred from the component words. As a consequence, their accurate detection and understanding may require knowledge that goes beyond the individual words and how they can be combined together (Fillmore Reference Fillmore1979). However, for NLP tasks and applications that involve some level of semantic interpretation, ignoring MWEs may result in information being lost or incorrectly processed (e.g., to kick the bucket meaning to die being translated literally).
In this paper, we review some of the methods that have been adopted for computationally modelling MWEs, concentrating on their discovery from corpora. The paper is structured as follows: we start with a brief description of MWEs in Section 2. Methods for MWE discovery are reviewed in Section 3, with focus on discovering information from their collocational and contextual profiles (Sections 4 and 5), as well as from the degree of rigidity of form and translatability (Sections 6 and 7). We also discuss some of the MWE resources available (Section 8). We finish with some conclusions and discussion of future possibilities.
2. What is in a word/multiword?
MWEs are all around. According to estimates, about four MWEs are produced per minute of discourse (Glucksberg Reference Glucksberg1989). They feature prominently in the mental lexicon of native speakers (Jackendoff Reference Jackendoff1997) in all languages and domains, in informal and in technical contexts (Biber et al. Reference Biber, Johansson, Leech, Conrad and Finegan1999). They can be found in songs (Joshua Tree by U2, Knocking on Heaven’s Door by Guns “N” Roses), in books (Much ado about nothing, All is well that ends well by Shakespeare), in newspaper headlines (Spilling the beans about coffee’s true cost Footnote a) and in scientific texts (dentate gyrus, long-term memory, word sense disambiguation). Moreover, these expressions have also been found to have faster processing times compared to non-MWEs (compositional novel sequences) (Cacciari and Tabossi Reference Cacciari and Tabossi1988; Arnon and Snider Reference Arnon and Snider2010; Siyanova-Chanturia Reference Siyanova-Chanturia2013). But what are they and how can we recognise them?
Different definitions have been proposed for them that describe them as recurrent or typical combinations of words that are formulaic (Wray Reference Wray2002) or that need to be treated as a unit at some level of description (Calzolari et al. Reference Calzolari, Fillmore, Grishman, Ide, Lenci, MacLeod and Zampolli2002; Sag et al. Reference Sag, Baldwin, Bond, Copestake and Flickinger2002). In fact, there may not even be a unified phenomenon but instead a set of features that interact in non-trivial ways and that fall in a continuum from idiomatic to compositional combinations (Moon Reference Moon1998).
As some of these definitions refer to words and the crossing of word boundaries (Sag et al. Reference Sag, Baldwin, Bond, Copestake and Flickinger2002), it is also important to adopt a clear definition of what a word is, either in terms of meaning, syntax, or whitespaces (Church Reference Church2013; Ramisch Reference Ramisch2015). For example, the PARSEME guidelines (Ramisch et al. Reference Ramisch, Cordeiro, Savary, Vincze, Barbu Mititelu, Bhatia, Buljan, Candito, Gantar, Giouli, Güngör, Hawwari, Iñurrieta, Kovalevskait, Krek, Lichte, Liebeskind, Monti, Parra Escartn, QasemiZadeh, Ramisch, Schneider, Stoyanova, Vaidya and Walsh2018) define a word as a “linguistically (notably semantically) motivated unit”Footnote b and MWEs as containing at least two words even if they are represented as a single token (e.g., snowman). Here for the sake of simplicity, we assume that words are separated by whitespaces in texts.Footnote c Adopting clear and precise definitions for these target concepts provides the basis for estimating their occurrence in human language and consequently for determining adequate vocabulary sizes, since the performance of many tasks seems to be linked to vocabulary size (Church Reference Church2013). They are also important for designing clear evaluation setups for comparing different MWE processing methods. Discussions of alternative definitions for these and related concepts (e.g., phraseological units, phrasal lexemes and collocations) along with the implications of the combinations they include can be found in (Moon Reference Moon1998; Seretan Reference Seretan2011; Ramisch Reference Ramisch2015) and (Constant et al. Reference Constant, Eryiit, Monti, Plas, Ramisch, Rosner and Todirascu2017).
Some of the core properties that have been used to describe MWEs include (Calzolari et al. Reference Calzolari, Fillmore, Grishman, Ide, Lenci, MacLeod and Zampolli2002):
High degree of lexicalisation, with some component words not being used in isolation (e.g., ad from ad hoc and sandboy from happy as a sandboy),
Breach of general syntactic rules with reduced syntactic flexibility and limited variation (e.g., by and large/*short/*largest). Although it may be possible to find a canonical form for an MWE, it is not always easy to determine which elements form its obligatory core parts and which elements can be varied (if any), as they may allow discontiguity and some degree of modification (e.g., throw NP to the hungry lions/wolves as sacrificing someone),
Idiomaticity or reduced semantic compositionality, possibly involving figuration like metaphors, with the meaning of some expressions not being entirely predictable from their component words.Footnote d MWEs fall into a continuum of idiomaticity, from compositional expressions like olive oil (meaning an oil made of olive) to idiomatic expressions like to trip the light fantastic (meaning to dance),
High degree of conventionality and statistical markedness reflecting a preference for some specific forms, or collocations, over plausible but low-frequency variations, or anti-collocations (Pearce Reference Pearce2001), (e.g., strong tea and fish and chips vs. the less common powerful tea and chips and fish).
Each of these characteristics may occur in varying degrees in a given expression. One classification of MWEs that takes into account how much variability they display was proposed by Sag et al. (Reference Sag, Baldwin, Bond, Copestake and Flickinger2002). In this classification, fixed expressions do not display any morphological inflection or lexical variation (e.g., in addition/*additions and ad infinitum). Semi-fixed expressions have fixed word order but display some morphological inflection (coffee machine/machines). Syntactically flexible expressions exhibit a large range of morphological and syntactic variation (rock the political/proverbial/family/Olympic boat).
To sum up, MWEs can be characterised as possibly discontiguous word combinations that display lexical, syntactic, semantic, pragmatic and/or statistical idiosyncrasies (Baldwin and Kim Reference Baldwin, Kim, Indurkhya and Damerau2010). These properties can be distributed in different ways in MWE categories such as:
Proper names: Manchester United,
Collocations: emotional baggage, heavy rain,
Compounds: pinch of salt, friendly fire,
Idioms: keep NP in NP’s toes, throw NP to the lions/wolves,
Support verbs: wind blows, make a decision, go crazy,
Prepositional verbs: look for, talk NP into,
Verb-particle constructions: take off, clear up,
Lexical bundles: I don’t know whether.
More detailed inventories of categories are discussed by Sag et al. (Reference Sag, Baldwin, Bond, Copestake and Flickinger2002), Constant et al. (Reference Constant, Eryiit, Monti, Plas, Ramisch, Rosner and Todirascu2017) and Ramisch et al. (Reference Ramisch, Cordeiro, Savary, Vincze, Barbu Mititelu, Bhatia, Buljan, Candito, Gantar, Giouli, Güngör, Hawwari, Iñurrieta, Kovalevskait, Krek, Lichte, Liebeskind, Monti, Parra Escartn, QasemiZadeh, Ramisch, Schneider, Stoyanova, Vaidya and Walsh2018). For instance, the PARSEME annotation guidelines (Ramisch et al. Reference Ramisch, Cordeiro, Savary, Vincze, Barbu Mititelu, Bhatia, Buljan, Candito, Gantar, Giouli, Güngör, Hawwari, Iñurrieta, Kovalevskait, Krek, Lichte, Liebeskind, Monti, Parra Escartn, QasemiZadeh, Ramisch, Schneider, Stoyanova, Vaidya and Walsh2018) focus on verbal MWEs in 20 languages including Bulgarian, French, Portuguese and Turkish.
3. Can we detect them automatically?
There has been considerable work on describing MWEs and cataloguing their properties, and some popular resources are discussed in Section 8. As their manual construction is time-consuming and requires expert knowledge, much effort has been devoted to automatically extracting MWEs from corpora. This task, known as MWE discoveryFootnote e, aims to determine if a given sequence of words forms a genuine MWE or if it can be treated as standard combination of words (e.g., small boy). For MWE discovery, the hope is that some form of salience is present such that MWEs stand out and can be automatically detected. In this context, methods based on statistical markedness have been particularly popular since they rely on association and entropic measures calculated from corpus counts (Manning and Schütze Reference Manning and Schütze1999; Kilgarriff et al. Reference Kilgarriff, Rychlý, Smrz, Tugwell, Williams and Vessier2004a; Pecina Reference Pecina2010) and are inexpensive and independent of language and MWE category. These methods have been used to detect preferences of various types, including:
Collocational preferences. Given that the “collocations of a given word are statements of the habitual or customary places of that word” (Firth Reference Firth1957), these methods search for word sequences that are particularly recurrent in corpora and can form MWEs.
Contextual preference. Assuming the distributional hypothesis that implies that you shall know a (multi)word by the company it keeps (Firth Reference Firth1957), these methods have been used to detect discrepancies between the meaning of an MWE and those of its parts, as an indication of idiomaticity.
Canonical form preferences. As MWEs may display different types of inflexibility, evidence of marked preferences for very few of the expected morphological, lexical and syntactic variants can be used as indications of an MWE.
Multilingual preferences. These methods are often based on detecting unexpected asymmetries in translations.
In the next sections, we present a general overview of these methods.
4. Collocational preferences
Assuming that words that like to co-occur more frequently than by chance are indicative of MWEs (Manning and Schütze Reference Manning and Schütze1999; Pecina and Schlesinger Reference Pecina and Schlesinger2006), this statistical markedness can be detected by measures of association strength. In a typical scenario, a list of candidate MWEs is generated, for example, from n-grams (Manning and Schütze Reference Manning and Schütze1999) or from relevant syntactic patterns for the target MWE categories (Justeson and Katz Reference Justeson and Katz1995). The list of candidates is then ranked according to the score of association strength, and those with stronger associations are expected to be genuine MWEs.
Formally, we consider a candidate MWE as a generic n-gram with n word tokens w 1 through w n. Its frequency in a corpus C of size N and lexicon L is denoted by $f(w_1\ldots w_n)$ . From the corpus frequencies, it is possible to estimate probabilities using maximum likelihood estimation, for instance, the unigram probability (p(w 1)) and the n-gram probability ( $p(w_1 \ldots w_n )$ ):
or the probability that the word w 1 occurs in the left of a bigram
or even the probability that two words appear separated by a certain number of words
Here * represents the sum over all possible words in L in that position.
A central question of the collocation problem is whether the observed frequency of a given combination of words is higher than what would be expected from pure chance. Of course language is far from a random distribution of words, yet a notable discrepancy certainly represents something special. To assess that, we have to measure the association strength between words, and this demands the formalisation of a clear expression for the predicted frequency in the case of pure chance, a baseline sometimes referred to as the null hypothesis. The usual choice is to consider statistical independence, or that the frequency of a sequence corresponds to the product of the unigram probabilitiesFootnote f of its members scaled by the size of the corpus,
Therefore, the association measure has to be a function that gauges some kind of distance between the observed data and the prediction. This can be formulated both in terms of frequencies
or in terms of probabilities
However, we must have in mind that the true probabilities are not known, only the maximum likelihood estimates that we can obtain from a finite sample – in this case a corpus. This fact raises an important issue of statistical significance of the association itself in the case of low frequencies. In order to circumvent this problem, there are many association measures that are deduced from known statistical tests. This results in more generalised versions of association measures that not only depend on unigram frequencies but also on other possible combinations, such as those involving n-grams of lower orders than the target. In the next sections, we discuss some of these measures.
4.1. Pointwise mutual information
By far the most widely used association measure is the pointwise mutual information (PMI) (Church and Hanks Reference Church and Hanks1990) and its variations. PMI is derived for bigrams directly from the mutual information between two random variables, using the log-ratio between the observed co-occurrences of the sequence and of the individual words.
PMI values can be positive, denoting affinity between the words, 0 denoting independence between them, or negative, denoting lack of affinity. Moreover, the closer the counts for the sequence are to the word counts, the stronger the association between the words and the more exclusively they like to co-occur.
One well-known issue with PMI is its bias towards infrequent events. Its upper bound, corresponding to the case of perfect association ( $f(w_{i} \ast)=f(\ast w_{j})=f(w_{i}w_{j})$ ), is $ -\log(\,f(w_{i}w_{j})/N)$ . Therefore, a moderately associated low-frequency bigram could, in principle, have a better score than a highly associated high-frequency bigram (Bouma Reference Bouma2009). To correct this, alternative statistical measures based on suitable normalisation of PMI have been proposed (Bouma Reference Bouma2009). One popular variant is the lexicographer’s mutual information (LMI), or salience score (Kilgarriff et al. Reference Kilgarriff, Rychly, Smrz and Tugwell2004b ), which adjusts a PMI value by multiplying it by the frequency, reintroducing the importance of meaningful recurrence.
So far we have discussed association between two words. One option for handling larger candidates is the generalisation of the mutual information to account for many variables. However, as this generalisation is not unique (Van de Cruys Reference Van de Cruys2011), various proposals have been made for calculating the equivalent of PMI for n-grams larger than two words. One of these is the specific total correlation (STC), which is the direct extension of the formula above for $w_{1}\ldots w_{n}$ and it is based on the so-called total correlation proposed by Watanabe (Reference Watanabe1960). Another alternative generalisation is the specific interaction information (SII) (Van de Cruys Reference Van de Cruys2011), which is based on the interaction information measure proposed by McGill (Reference McGill1954). One important difference between STC and SII is that the former is zero only if all words are independent while the latter is zero if at least one is not associated with the others. Table 1 displays these two measures for trigrams.
Another alternative for n-grams is to maintain the original PMI formulation with two variables (w 1 and w 2) but to allow each variable to contain nested expressions as one word (e.g., w 1=first_class and w 2=lounge, and w 1=recurrent and w 1=neural_network) (Seretan Reference Seretan2011).
4.2 Other measures
In addition to PMI, n-gram frequency has also been used for MWE discovery. However, as it does not distinguish meaningful occurrences from chance occurrence of frequent words, it has been used in conjunction with other measures like PMI, generating LMI.
Two other popular measures are the Student’s t-test based measure and the Dice coefficient (Table 1), which in common with PMI, also take into consideration the expected counts to detect meaningful co-occurrences. For instance, Student’s t-test is based on hypothesis testing, assuming that if the words are independent, their observed and expected counts are identical. The Dice coefficient, also known as normalised expectation (Pecina and Schlesinger Reference Pecina and Schlesinger2006), differs from both of these measures by having an upperbound of 1 for perfect correlation.
There are also measures based on contingency tables that record not only the marginal frequencies of the words (w i) in an n-gram, but also the probability of their “non-occurrence” ( $\bar {w_i}$ – all words but w i). These measures, which include Pearson’s χ2 (Table 1) and the most robust log-likelihood ratio (Dunning Reference Dunning1993), compare the co-occurrence of two words with all other combinations in which they occur.
Over the years, many other association measures have been defined for MWE discovery, and Pecina and Schlesinger (Reference Pecina and Schlesinger2006) compiled as many as 82 measures for bigram collocation discovery found in the literature. They show that these measures capture different aspects of MWEs, and as a consequence, when combined together, they can generate better results in terms of MWE discovery than if used in isolation. In fact, in comparative evaluations, no single measure has been found to be the best for extracting MWEs of any category or in any language, confirming that an empirical exploration of these measures is needed for a particular category and language combination (Pearce Reference Pearce2002; Evert and Krenn Reference Evert and Krenn2005; Villavicencio et al. Reference Villavicencio, Kordoni, Zhang, Idiart and Ramisch2007). Likewise, as these measures can be used to produce ranked lists of MWE candidates, as discussed before, defining a threshold that separates genuine MWEs from non-MWEs, also seems to depend on the particular target MWEs, and on whether the task benefits more from recovering more MWEs at the expense of allowing more noise, or not. Evaluation of how closely a given measure captures the MWEs of a particular domain and language is usually done by means of gold standard resources or manual validation by expert judges.
5. Contextual preferences
When deriving the meaning of a combination of words, one widely adopted strategy is to build it from the meanings of the parts, following the principle of compositionality.Footnote g This principle allows a meaning to be assigned to larger units and sentences, even if they contain unseen combinations of words. However, it is not adequate for handling idiomatic MWEs since it may lead to an unrelated meaning being derived (e.g., for trip the light fantastic). Considerable effort has been employed in methods for detecting idiomaticity, both at the level of MWE types, discovering the degree of idiomaticity that an MWE usually displays, and at the level of MWE tokens, deciding for a specific occurrence if it is idiomatic or not. For example, the first task would be used to identify that the meaning of access road can, in general, be inferred from its parts (a road for giving access to a place), while the second task would be to decide if in a sentence like the exam was a piece of cake the occurrence of piece of cake should be interpreted literally as a slice of a baked good, or idiomatically as something easy. For both tasks, information about the contexts in which an MWE occurs has been found to be a good indicator of idiomaticity and we now discuss some of the measures that have been proposed for these tasks.
5.1 Type idiomaticity
If a word can be characterised by “the company it keeps” (Firth Reference Firth1957) and given that words that occur in similar contexts have similar meanings (Turney and Pantel Reference Turney and Pantel2010), we can approximate the meaning of an MWE by aggregating its affinities with its contexts. We can also find words and MWEs with similar meanings measuring how similar their affinities are. These affinities can be determined from distributional semantic models (or vector space models) which have been used to represent word meaning (and possibly subword and phrase meaning) as numerical multidimensional vectors in a putative semantic space (Lin Reference Lin1998; Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013; Pennington, Socher and Manning Reference Pennington, Socher and Manning2014). These models are capable of reaching high levels of agreement with human judgements about word similarity (Baroni, Dinu and Kruszewski Reference Baroni, Dinu and Kruszewski2014; Camacho-Collados, Pilehvar and Navigli Reference Camacho-Collados, Pilehvar and Navigli2015; Lapesa and Evert Reference Lapesa and Evert2017). They vary according to factors like the followingFootnote h:
Type of model: count-based and predictive models (Baroni et al. Reference Baroni, Dinu and Kruszewski2014). Count-based models generate vectors derived from co-occurrence counts between words and their contexts (Lin Reference Lin1998; Pennington et al. Reference Pennington, Socher and Manning2014). Predictive models represent words as real-valued vectors projected onto low-dimensional space whose distances are adjusted as part of learning to predict words from contexts (or vice-versa) (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013; Baroni et al. Reference Baroni, Dinu and Kruszewski2014).
Type of pre-processing applied to the input corpus: such as lemmatisation and part-of-speech tagging. While state-of-the-art models for English have been constructed without any pre-processing, for morphologically richer languages like French and Portuguese pre-processing the corpus can lead to better models (Cordeiro et al. Reference Cordeiro, Villavicencio, Idiart and Ramisch2019).
Type of context: in bag-of-words models (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013), the contexts of a target word are represented as an unordered set of words that does not differentiate between their positions or relations to the target. In models based on syntactic dependencies (Lin Reference Lin1998; Levy and Goldberg Reference Levy and Goldberg2014), contexts are further distinguished in terms of their syntactic relations to the target (e.g., dog as subject vs. as object of the target).
Window size: It defines the number of words around the target that are included as contexts (Lapesa and Evert Reference Lapesa and Evert2014). These windows can be symmetric or asymmetric in relation to the target, and may incorporate a decay factor for prioritising words that are closer to the target.
Number of vector dimensions used for representing words. These range from sparse vectors with as many dimensions as the number of words in the vocabulary to denser and more compact representations. Reductions in the number of dimensions can be obtained using explicit context filtering, such as using only the n more frequent or salient contexts (Padró et al. Reference Padró, Idiart, Villavicencio and Ramisch2014; Salehi, Cook and Baldwin Reference Salehi, Cook, Baldwin, Bouma and Parmentier2014), or adopting dimensionality reduction techniques like singular value decomposition.
Measures of association strength between a target word and its contexts. These measures help to detect more salient co-occurrences that are not just due to chance, and some of them were discussed in the previous section such as χ2, t-score, PMI and Positive PMI (PPMI) (Curran and Moens Reference Curran and Moens2002; Padró et al. Reference Padró, Idiart, Villavicencio and Ramisch2014).
Measures of similarity, distance or divergence between word vectors. These measures have been used to find word vectors that display similar affinities with their contexts, like cosine (explained below), Manhattan distance, Kullback–Leibler divergence, Jensen–Shannon, Dice and Jaccard.
A major advantage of vector space models is the possibility of using algebra to model complex interactions between words. Similarity or relatedness can be modelled as a comparison between word vectors, for instance, as the normalised inner product (the cosine similarity):
where $\hat{\mathbf{v}}(w)$ is the normalisedFootnote i word vector of the word w. Compositional meaning also can be modelled as a mathematical function that composes the vectors of the words in an MWE, but this time not to compare but to add information. The simplest of all is the additive model (Mitchell and Lapata Reference Mitchell and Lapata2008) but there are alternative possibilities including other operations (Mitchell and Lapata Reference Mitchell and Lapata2010; Reddy, McCarthy and Manandhar Reference Reddy, McCarthy and Manandhar2011; Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013; Salehi, Cook and Baldwin Reference Salehi, Cook and Baldwin2015). For the additive model, the vector for a two-word compound ( $\mathbf{v}_{\beta}(w_{1},w_{2})$ ) can be defined as
where w head (or w mod) indicates the semantic head (or modifier) of the compound and $\beta \in [0,1]$ is an adjustable parameter (usually set to 1/2) that might control the relative importance of the head to the compound semantics (Reddy et al. Reference Reddy, McCarthy and Manandhar2011). For example, in flea market, it is the head (market) that has a larger contribution to the overall meaning, and β may be used to reflect this.
The degree of compositionality can be calculated between the corpus-derived vector of the MWE, $\mathbf{v}(w_1w_2)$ (e.g., for rocket_science),Footnote j and the compositionally constructed vector containing the combination of the component words, $\mathbf{v}_{\beta}(w_1, w_2)$ (e.g., rocket and science):
MWEs that presented low values of comp are candidates to be idiomatic MWEs (Cordeiro et al. Reference Cordeiro, Villavicencio, Idiart and Ramisch2019).
This score can be used both to validate a given candidate MWE and also to assign a degree of idiomaticity to it, since MWEs fall on a continuum of idiomaticity (McCarthy, Keller and Carroll Reference McCarthy, Keller, Carroll, Bond, Korhonen, McCarthy and Villavicencio2003; Reddy et al. Reference Reddy, McCarthy and Manandhar2011; Salehi, Cook and Baldwin Reference Salehi, Cook, Baldwin, Markantonatou, Ramisch, Savary and Vincze2018). The success of this score hangs on how linguistically accurate the compositional models and similarity measures used are. The good news is that recent work has demonstrated that additive compositional models associated with cosine similarity are suitable for detecting idiomaticity of noun compounds (Cordeiro et al. Reference Cordeiro, Villavicencio, Idiart and Ramisch2019) and have outperformed other variants in similar tasks (Reddy et al. Reference Reddy, McCarthy and Manandhar2011; Salehi et al. Reference Salehi, Cook and Baldwin2015), including in predicting intra-compound semantics (Hartung et al. Reference Hartung, Kaupmann, Jebbara and Cimiano2017).
Alternative measures for approximating idiomaticity have included comparing the distributional neighbourhood of an MWE with those of the component words, that is, the words that are closest to each of them in vector space. Assuming that compositional MWEs share more distributional neighbours with their component words, the overlap between their neighbours has been used as an indication of the degree of compositionality (McCarthy et al. Reference McCarthy, Keller, Carroll, Bond, Korhonen, McCarthy and Villavicencio2003). Additionally, the rank position of these neighbours can also be considered.
Semantic information about MWEs and their possible senses can also be obtained from resources like dictionaries and thesauri, including synonyms, antonyms, definitions and examples. Some resources, like WordNet (Fellbaum 1998), also include similarity measures like Wu–Palmer (Reference Wu and Palmer1994) and Leacock–Chodorow (Reference Leacock, Chodorow and Fellfaum1998). However, their coverage for MWEs may be limited, and they may not be available for a given domain or language, restricting their applicability for idiomaticity detection.
5.2 Token idiomaticity
So far we discussed methods for discovering MWEs and deciding how idiomatic they can be, and these could be useful for building resources. However, when faced with a particular sequence of words, a speaker (as well as an automatic system) must decide whether in that sentence they can be treated as simple isolated words or if they are components of a unit, an MWE. Sometimes, the syntactic context may help to disambiguate them, as in the sentence Does the bus stop here? where bus stop could be flagged as a possible MWE occurrence except that stop is a verb and the MWE bus stop is formed by two nouns. However, there are cases where both idiomatic and literal readings are possible with exactly the same syntactic configuration. For instance, for kick the bucket more information is needed to disambiguate if a kicking event took place with a literal interpretation of the words, or a dying event with idiomatic interpretation. Although for some MWEs one of the meanings will be predominant, ambiguity is not the exception: an analysis of idiomatic verb-noun combinations (VNCs) revealed that many of them were also used with their literal senses in corpus (Fazly, Cook and Stevenson Reference Fazly, Cook and Stevenson2009). Therefore, for a given MWE occurrence in a sentence, we need to determine if it is used in a literal or an idiomatic meaning.
Token idiomaticity detection can be seen as a word sense disambiguation task, where information from the surrounding words in the sentential context can be used to help disambiguate the MWE sense. Returning to the case of kick the bucket, although both the literal and the idiomatic senses are possible, sentences in which the idiomatic sense occurs will include words that may not be compatible with the literal sense (e.g., illnesses, hospitals and funerals). In previous work on token idiomaticity detection, this sentential context has been modelled in terms of lexical chains, assuming that a literal sense displays strong cohesive ties with the context, which are absent for the idiomatic sense (Sporleder and Li Reference Sporleder and Li2009).
To solve this ambiguity, something akin to compositionality prediction, described in the previous section, has to take place. But this time, instead of comparing the compositional vector of the MWE formed by the combination of the parts with the corpus-generated vector for the MWE, we must compare the vectors for the literal (e.g., hitting the bucket) and idiomatic (e.g., dying) senses with the vectors containing a representation of the sentential context in which the MWE occurs. In this case, the sentential context can be represented using sentence-level distributional models such as Skip-Thought Vectors (Kiros et al. Reference Kiros, Zhu, Salakhutdinov, Zemel, Urtasun, Torralba, Fidler, Cortes, Lawrence, Lee, Sugiyama and Garnett2015) as done by Salton, Ross and Kelleher Reference Salton, Ross and Kelleher2016, or it can be compositionally constructed from the vector representations of the words in the sentence using an operation like vector addition, as done by King and Cook (Reference King and Cook2018). In fact, for token idiomaticity detection in VNC, King and Cook compared the use of different distributional models for representing the target sentences in which the VNCs occur, from word-level (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013) to sentence-level models (Kiros et al. Reference Kiros, Zhu, Salakhutdinov, Zemel, Urtasun, Torralba, Fidler, Cortes, Lawrence, Lee, Sugiyama and Garnett2015). They found that representing a sentential context using the additive model obtained the best results. Alternatives to the additive model include concatenating word vectors of specific parts of the sentential context (Taslimipoor et al. Reference Taslimipoor, Rohanian, Mitkov and Fazly2017).
6. Canonical form preferences
Methods for MWE discovery have also used information about the fixedness displayed by some MWEs in comparison with ordinary word combinations (Sag et al. Reference Sag, Baldwin, Bond, Copestake and Flickinger2002).Footnote k Characteristics like limited lexical and syntactic flexibility (Sag et al. Reference Sag, Baldwin, Bond, Copestake and Flickinger2002) have been used as indicators in tasks such as MWE discovery and idiomaticity detection. For instance, the expression to make ends meet cannot undergo changes in determiners (*to make some/these/many ends meet), pronominalisation (*make them meet), modification (*to make month ends meet), and so on.
One common strategy to detect fixedness is to generate all variants that would be expected for a given combination of words and verify which of them occurs in a very large corpus. The assumption is that absence (or very limited presence) of expected variants is an indication of idiomaticity (Ramisch et al. Reference Ramisch, Schreiner, Idiart and Villavicencio2008a; Fazly et al. Reference Fazly, Cook and Stevenson2009). These variants can be of two types: lexical and syntactic variants.
Lexical variants can be generated by lexical substitution of the component words using synonyms from resources like WordNet (Pearce Reference Pearce2001; Ramisch et al. Reference Ramisch, Schreiner, Idiart and Villavicencio2008a) and inventories of semantic classes (Villavicencio Reference Villavicencio2005) or using similar words from distributional semantic models. For instance, for nut case variants would include hazelnut case, cashew case, nut briefcase and nut luggage. A possible measure of lexical fixedness (LF) proposed by (Fazly et al. Reference Fazly, Cook and Stevenson2009) compares how the PMI of a target MWE deviates from the average PMI of possible variants of this target
where $\overline{\mbox{PMI}}$ is the average on the variants and σ PMI is the standard deviation. LF was defined in the context of detecting idiomaticity in VNCs and the variants were obtained from a certain number of close synonyms of the verb and the noun, but it can be adapted to larger n-grams using generalisations of PMI as discussed in Section 4. The reasoning behind using PMI is to avoid the possible confound caused by high-frequency lexical substitutes.
Syntactic variants can be generated according to regular syntactic rules that apply to a given MWE category, such as passivisation, pluralisation, change of determiners or adverbial modification for verbal MWEs (e.g., ?the bucket was kicked/?kick a bucket/?kick the buckets). Due to the fact that syntactic variants may present different numbers of words, it is no longer suitable to compare PMIs. Instead, (Fazly et al. Reference Fazly, Cook and Stevenson2009) defined a syntactic fixedness (SF) measure based on the probability of occurrence in the corpus of a given syntactic pattern (pt), among a set of m syntactic patterns used to generate the syntactic variants. The proposed fixedness measure is the Kullback–Leibler divergence between the probability distribution for the typical syntactic behaviour p(pt) and the distribution of occurrences of syntactic patterns given that the target n-gram is involved $p(pt | w_{1}...w_{n} )$ .
Large values of SF indicate that the target n-gram presents syntactic pattern frequencies that are very different from the typical frequency distribution expected for that kind of n-gram and this is interpreted as higher degree of SF (Fazly et al. Reference Fazly, Cook and Stevenson2009). If the syntactic patterns are approximately uniformly distributed, SF is related to the Entropy of Permutation and Insertion (EPI) proposed by (Ramisch et al. Reference Ramisch, Villavicencio, Moura and Idiart2008b),
Nonetheless, EPI can be used in more general contexts. Low values of EPI indicate some degree of fixedness.
Similarly, for some types of MWEs, fixedness can be captured by entropic measures of word order as the Permutation Entropy (Zhang et al. Reference Zhang, Kordoni, Villavicencio and Idiart2006) defined as
where $p_{k}(w_{1}...w_{n})$ is the probability of occurrence in the corpus of the kth permutation of the n-gram $w_{1} w_{2}... w_{n}$ . PE is also indirectly related to the association strength of the components of a candidate, since if there is no special association between words, the probability of them appearing in multiple orders should be similar, leading to high PE values (Villavicencio et al. Reference Villavicencio, Kordoni, Zhang, Idiart and Ramisch2007). One of the advantages of using PE as an association measure is that it can be applied to MWEs of arbitrarily large sizes, without the need to be redefined.
If an MWE candidate passes a criterion for fixedness (a rigid adherence to a canonical form) based on the measures described in this section, it is very likely an idiomatic MWE. Therefore, fixedness is an informative score for MWE discovery.
Fixedness has also been incorporated in methods for detecting token idiomaticity, such as those discussed in Section 5.2. The assumption is that when the idiomatic sense is used it tends to occur in the canonical form of the MWE, while the literal sense is less rigid and may occur in more patterns (Fazly et al. Reference Fazly, Cook and Stevenson2009). Fazly et al. (Reference Fazly, Cook and Stevenson2009) propose a method based on canonical forms learned automatically from corpora, where distributional vectors for canonical and non-canonical forms are learned and then an MWE token is classified as idiomatic if it is closer to the canonical form vectors. Methods that incorporate both information about the canonical form of an MWE and distributional information about its sentential contexts (Section 5.2) have found them to be complementary and outperform models that use only one of them (Fazly et al. Reference Fazly, Cook and Stevenson2009; King and Cook Reference King and Cook2018).
7. Multilingual preferences
Idiomatic MWEs resist word-for-word translation, often generating unnatural, nonsensical or incorrect translations (e.g., o fim da picada in Portuguese, lit. the end of the bridle path meaning something unacceptable). When parallel resources are available, this lack of direct translatability can be measured using information such as asymmetries in word alignments between source and target languages (Melamed Reference Melamed1997; Caseli et al. Reference Caseli, Ramisch, Nunes and Villavicencio2010; Attia et al. Reference Attia, Toral, Tounsi, Pecina and van Genabith2010; Tsvetkov and Wintner Reference Tsvetkov and Wintner2012).
The degree of idiomaticity of an MWE has also been calculated from the overlap between the translation of an MWE and the translations of its component words. Moreover, the translations for the MWE and for each of its component words can also be compared using string distance metrics that can help to account for any inflectional differences between them and determine whether the translations share a substring (Salehi et al. Reference Salehi, Cook, Baldwin, Bouma and Parmentier2014). For instance, the translation for public into Persian is contained in the translation for public service. These string similarity measures have been found to lead to better results for MWE idiomaticity detection when combined with information from distributional similarity models of the source and target language (Salehi et al. Reference Salehi, Cook, Baldwin, Markantonatou, Ramisch, Savary and Vincze2018).
8. MWE resources
Evaluation of MWE discovery methods can be performed intrinsically or extrinsically. In intrinsic evaluation, the results produced by a model are compared to a gold standard, usually a dictionary, electronic resource or dataset where MWEs have been manually curated using expert annotations from linguists or lexicographers, or collected via crowdsourcing. While the former provides high quality and robust annotations, it is usually costly and time-consuming to obtain. The latter provides a faster way of gathering judgements from usually large groups of non-experts to reduce the impact of subjectivity on the scores. In extrinsic evaluation, the results produced are incorporated in an NLP application such as machine translation or text simplification, with the expectation that the quality of the MWE resource will be reflected in the performance of the task. However, the results may be influenced by the particular integration of the information into the application. In this section, we list some of the resources that have been used for intrinsic evaluation of MWE tasks and further discussion about extrinsic evaluations can be found in (Constant et al. Reference Constant, Eryiit, Monti, Plas, Ramisch, Rosner and Todirascu2017). In particular, we focus on some of the main corpora that have been annotated with MWEs, as well as datasets containing human judgements about MWE properties.
Annotated corpora
The largest initiative in terms of language diversity is the PARSEME project (Savary et al. Reference Savary, Sailer, Parmentier, Rosner, Rosén, Przepiórkowski, Krstev, Vincze, Wójtowicz, Losnegaard, Parra Escartín, Waszczuk, Constant, Osenova and Sangati2015), which resulted in the creation of corpora for around 20 languages (Ramisch et al. Reference Ramisch, Cordeiro, Savary, Vincze, Barbu Mititelu, Bhatia, Buljan, Candito, Gantar, Giouli, Güngör, Hawwari, Iñurrieta, Kovalevskait, Krek, Lichte, Liebeskind, Monti, Parra Escartn, QasemiZadeh, Ramisch, Schneider, Stoyanova, Vaidya and Walsh2018) containing annotations of verbal MWEs.Footnote l
The Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions (STREUSLE) (Schneider and Smith Reference Schneider and Smith2015) provides comprehensive manual annotations of MWEs and of noun and verb semantic supersenses in a corpus of online reviews in English.Footnote m
Detecting Minimal Semantic Units and their Meanings shared task data (DIMSUM) extended the STREUSLE corpus with additional domains and resulted in a comprehensive annotation of MWEs in running text for English (Schneider et al. Reference Schneider, Hovy, Johannsen and Carpuat2016). The corpus contains over 90,000 words and 5,000 MWEs.Footnote n
The VNC-Tokens dataset (Cook et al. Reference Cook, Fazly, Stevenson, Grégoire, Evert and Krenn2008) contains 2,984 sentences from the British National Corpus that contain VNCs, marked according to whether their sense is idiomatic, literal or unclear, with up to 100 sentences for each of 53 different combinations.Footnote o
For detecting compositionality in context, Korkontzelos et al. (Reference Korkontzelos, Zesch, Zanzotto, Biemann, Diab, Baldwin and Baroni2013) produced annotations for the occurrences in context of target phrases, like old school, with a figurative or literal meaning in 4,350 sentences from WaCky corpus.Footnote p
Datasets
The English Compound Noun Compositionality Dataset (ECNC) (Reddy et al. Reference Reddy, McCarthy and Manandhar2011) contains crowdsourced judgements about the degree of compositionality for a set of 90 English noun–noun (e.g., zebra crossing) and adjective–noun (e.g., sacred cow) compounds. For each compound an average of 30 judgements were collected for 3 numerical scores: the degree to which the first word contributes to the meaning of the compound (e.g., zebra to zebra crossing), the same for the second word (e.g., crossing to zebra crossing) and the degree to which the compound can be compositionally constructed from its parts. A Likert scale from 0 (most idiomatic) to 5 (most compositional) was used.Footnote q
The Noun Compositionality Dataset (Ramisch et al. Reference Ramisch, Cordeiro, Zilio, Idiart, Villavicencio and Wilkens2016; Cordeiro et al. Reference Cordeiro, Villavicencio, Idiart and Ramisch2019) uses the same protocol as Reddy et al. (Reference Reddy, McCarthy and Manandhar2011) and extends the ECNC with judgements collected from native speakers for 190 new compounds for English, and 180 compounds for two additional languages, French and Portuguese. Additionally, for Portuguese, the annotations were extended to include lexical substitution candidates for each of the compounds, resulting in the Lexical Substitution of Nominal Compounds Dataset (LexSubNC) (Wilkens et al. Reference Wilkens, Zilio, Cordeiro, Paula, Ramisch, Idiart and Villavicencio2017).Footnote r
The Dataset of English Noun Compounds Annotated with Judgments on Non-Compositionality and Conventionalization (Farahmand, Smith and Nivre Reference Farahmand, Smith and Nivre2015; Yazdani, Farahmand and Henderson Reference Yazdani, Farahmand and Henderson2015) provides judgements for 1,042 English noun–noun compounds. Each compound contains two binary judgements by four expert annotators, both native and non-native speakers: one for its compositionality and one for its conventionalisation.Footnote s
The Norwegian Blue Parrot Dataset (Kruszewski and Baroni Reference Kruszewski, Baroni, Bos, Frank and Navigli2014) has judgements for modifier-head phrases in English. These include annotations about the phrase being an instance of the concept denoted by the head (e.g., dead parrot and parrot) or a member of the more general concept that includes the head (e.g., dead parrot and pet), along with typicality ratings.Footnote t
The German Noun-Noun Compound Dataset (Roller, Schulte im Walde and Scheible Reference Roller, Schulte im Walde and Scheible2013) contains judgements for a set of 244 German compounds using a compositionality scale from 1 to 7. Each compound has an average of around 30 judgements obtained through crowdsourcing. This resource has also been enriched with feature norms (Roller and Schulte im Walde Reference Roller and Schulte im Walde2014).Footnote u
A Representative Gold Standard of German Noun-Noun Compounds (Ghost-NN) (Schulte im Walde et al. Reference Schulte im Walde, Hätty, Bott and Khvtisavrishvili2016) includes human judgements for 868 German noun–noun compounds about their compositionality, corpus frequency, productivity and ambiguity. The annotations were performed by the authors, linguists and through crowdsourcing. A subset of 180 compounds has been selected for balancing these variables and for these the annotations were done only by experts.Footnote v
Other collections containing MWEs include the SemEval datasets for keyphrase extraction (Kim et al. Reference Kim, Medelyan, Kan, Baldwin, Erk and Strapparava2010) and for noun compound interpretation (Nakov Reference Nakov2008; Hendrickx et al. Reference Hendrickx, Kozareva, Nakov, Ó Séaghdha, Szpakowicz and Veale2013; Butnariu et al. Reference Butnariu, Kim, Nakov, Ó Séaghdha, Szpakowicz and Veale2009), MWE-aware treebanks (Rosén et al. Reference Rosén, Losnegaard, De Smedt, Bejček, Savary, Przepiórkowski, Osenova and Barbu Mititelu2015), MWE listsFootnote w as well as lexical resources (Losnegaard et al. Reference Losnegaard, Sangati, Parra Escartín, Savary, Bargmann and Monti2016).
9. Conclusions
MWEs are complicated, unruly, unpredictable and difficult. They are the telltale sign of non-native speakers and are one big stumbling block for many applications to achieve a more natural and precise handling of human language. Whole decades of research have been devoted to them, and their behaviour still defies attempts to fully capture them. However, they are also a frequent informal and very efficient communicative device to transmit whole complex concepts in a conventional manner, and in the words of Fillmore, Kay and O’Connor (Reference Fillmore, Kay and O’Connor1988) the realm of idiomaticity in a language includes a great deal that is productive, highly structured and worthy of serious grammatical investigation. In this paper, we provided an overview of research on computational modelling of MWEs, revisiting some representative methods for MWE discovery. We concentrated, in particular, on methods for the detection of word combinations that qualify as MWEs, and that identify some of their characteristics, like their degree of fixedness and idiomaticity.
However, this paper only scratches the surface of MWE research, and additional discussions can be found in (Constant et al. Reference Constant, Eryiit, Monti, Plas, Ramisch, Rosner and Todirascu2017; Ramisch and Villavicencio Reference Ramisch, Cordeiro, Savary, Vincze, Barbu Mititelu, Bhatia, Buljan, Candito, Gantar, Giouli, Güngör, Hawwari, Iñurrieta, Kovalevskait, Krek, Lichte, Liebeskind, Monti, Parra Escartn, QasemiZadeh, Ramisch, Schneider, Stoyanova, Vaidya and Walsh2018; Pastor and Colson Reference Pastor and Colson2019). Moreover, progress in related areas is paving the way for a better understanding of how people learn, store and process MWEs, and for the development of computational approaches for dealing with them. For instance, advances in word representations have brought new possibilities for MWE research. In particular, crosslingual word embeddings (Sø gaard et al. 2019) provide fertile grounds for the exploration of multilingual asymmetries linked to idiomaticity, while richer contextually aware word representation models like ELMo (Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018) can be incorporated in methods for token idiomaticity detection.
One possible source of clues of how to improve MWE processing comes from studies of how the brain performs the task. Experimental studies dedicated to investigating how humans process language is growing in number and involve a series of increasingly sophisticated techniques for measuring brain activity. The focus is to understand with increasing accuracy what are the brain regions used in language processing and how their interactions vary temporally and spatially with linguistic complexity. These studies can provide clues about how MWEs are stored and processed by the human brain. The use of eye-tracking information has already brought benefits for tasks like part-of-speech tagging (Barrett et al. Reference Barrett, Bingel, Keller and Søgaard2016; Barrett et al. Reference Barrett, Bingel, Hollenstein, Rei, Søgaard, Korhonen and Titov2018). MWEs have been found to have faster processing times compared to non-MWEs (compositional novel sequences) and these effects have been found in both research using eye-tracking and EEG (Siyanova-Chanturia Reference Siyanova-Chanturia2013). Investigations of the use of gaze features from the GECO corpus (Cop et al. Reference Cop, Dirix, Drieghe and Duyck2017) produced promising results in tasks like discovery (Rohanian et al. Reference Rohanian, Taslimipoor, Yaneva, Ha, Mitkov and Angelova2017), and further advances are expected with increasing availability of larger collections of eye-tracking data. There is still a large gap that has to be overcome to connect the algorithms we develop for NLP and the algorithm actually used by the brain. The hope is that the gap will close soon. MWEs are here to stay and for the foreseeable future will still be in the limelight of research.
Acknowledgements
This work has been partly supported by CNPq (projects 423843/2016-8 and 312114/2015-0) and ESRC HRBDT.