1. Introduction
This squib presents an initial exploration of the use of a relatively new measure of detecting directionality of collocations in a corpus to investigate whether different languages show a preference for different directions of collocation.
1.1 Directional Collocation
Within the Firthian tradition in linguistics, and especially since the publication of Sinclair (Reference Sinclair1991), collocation is viewed as “an integral aspect of linguistic theory” (Barnbrook et al. Reference Barnbrook, Mason and Krishnamurthy2013: 35), yet collocation is largely overlooked in many schools of linguistics. This squib implements a relatively recent measure of directional collocation in corpora of eight different languages to see if there are issues worthy of deeper investigation.
There are two main approaches to investigating collocations. First, the phraseological (Brown Reference Brown, Milton and Fitzpatrick2014) or intensional (Evert Reference Evert2005) approach, which treats collocations as falling in the middle of a continuum from idioms to free combinations. Within this approach, collocations may be required to have a non-literal meaning, word spans to identify collocations can be up to four words left and right of the node word, and the identification of collocations is often restricted to combinations of nouns, verbs, adjectives, and adverbs. Second, the frequency-based (Brown Reference Brown, Milton and Fitzpatrick2014) or distributional (Evert Reference Evert2005) approach, which views collocations as relatively frequent co-occurrences of two words. No constraints on meaning or word types are made in identifying collocations, and the word span is usually one word left or right of the node word. In this squib, I take the frequency-based approach associated with corpus linguistics; the key to identifying a collocation is “the extent to which the items appear together more often than we would expect given their individual frequencies” (Brown Reference Brown, Milton and Fitzpatrick2014: 125).
Nearly all previous work on collocation has involved identifying pairs of co-occurring words without considering “whether word1 is more predictive of word2 or the other way round” (Gries Reference Gries2013: 141). In his original work on collocation in English, Sinclair (Reference Sinclair1991) distinguished between upward and downward collocations. In upward collocation, the collocate is a more frequent word than the node, and in downward collocation, the collocate is less frequent. This distinction is important, since upward collocation usually highlights grammatical frames, whereas downward collocation highlights semantic issues. An alternative terminology was suggested by Kjellmer (Reference Kjellmer, Aijmer and Alternberg1991), who introduced right-predictive collocations such as Pyrrhic victory, where the first word predicts the second but not the other way round, and left-predictive collocations such as deadly nightshade where the second word predicts the first (see Michelbacher et al. Reference Michelbacher, Evert and Schütze2011).
1.2 Measures of Directional Collocation
The standard measures of collocation, such as Mutual Information and z-scores, make no distinction between word1 and word2, treating collocations as symmetrical. Thus the asymmetric nature of many collocations has largely been ignored (the only major exception being the work of Michelbacher et al. Reference Michelbacher, Evert and Schütze2007, Reference Michelbacher, Evert and Schütze2011 on directional associations). A few directional measures of collocation were suggested but these were all problematic in some way, and it is only with Gries’ (Reference Gries2013) introduction of the ΔP measure that a usable and valid measure has become available. Gries defines ΔP as in (1).
In other words, ΔP is the probability of a word being present given the presence of another word minus the probability of the same word being present without the other word. This allows us to distinguish between right-predictive and left-predictive collocations. A right-predictive collocation will be indicated by a high value for:
A left-predictive collocation will be indicated by a high value for:
An example will show how this works. The words of and course collocate in English in phrases like of course and in the course of with an expectation that the left-predictive collocation (of course) will dominate. The frequencies of of and course in the English corpus used in this study are given in Table 1.
For ΔP2|1, the first probability is the frequency of both words being present divided by the total frequency of course; the second probability is the frequency of of without course divided by the total frequency where course is absent (i.e., (273/1140) − (55197/1872038) = 0.210). For ΔP1|2, the first probability is the frequency of both words being present divided by the total frequency of of; the second probability is the frequency of course without of divided by the total frequency where of is absent (i.e., (273/55470) − (867/1817708) = 0.004). From this we can see that the left-predictive of course is a much stronger collocation than the right-predictive course of.
ΔP values range from -1 (where the presence of the cue reduces the likelihood of the outcome, e.g., they has) to + 1 (where the presence of the cue makes the outcome more likely). In most analyses, collocation measures are applied to those collocations that exist in the corpus being analyzed. Non-occurring pairs are not normally considered. For this reason, negative ΔP values will be much rarer than positive ΔP values. A clear direction of collocation can be found by calculating ΔP2|1 - ΔP1|2 (if positive, this shows a right-predictive collocation; if negative, a left-predictive collocation). Desagulier (Reference Desagulier2015) provides a clear, detailed explanation of how to interpret ΔP values.
2. Focus of Research
As a relatively recent measure, ΔP is yet to be widely used and nearly all applications are to English. It is not clear whether the directions of collocations of English are typical of most languages, or whether different languages have different directional collocation patterns. For instance, in one language most strong collocations might be right-predictive, whereas in another language they might be left-predictive. The purpose of this squib is to conduct a preliminary analysis of directional collocations in several languages to see if this produces any findings that warrant more detailed investigation.
3. The Analysis
To identify patterns of directional collocations, corpora of several languages are needed. Corpora built on the same principles for numerous languages can be found at the Leipzig Corpora Collection (<http://corpora2.informatik.uni-leipzig.de/download.html>; see Goldhahn et al. Reference Goldhahn, Eckart and Quasthoff2012). The following criteria guided the selection of corpora: the language must be a left-to-right alphabetic language with words separated by spaces, a range of languages falling into different language families should be chosen, and corpus size should be at least 1 million words. Using these criteria, corpora consisting of 100,000 sentences taken from the Internet for eight languages were used. The languages are English, German, Italian, and Russian (all Indo-European), Finnish (Uralic), Maltese (Afroasiatic), Indonesian (Austronesian), and Basque (a language isolate). Some potentially relevant typological features of these languages are given in Table 2, based on the World Atlas of Language Structures (Dryer and Haspelmath Reference Dryer and Haspelmath2013). Although these corpora are not ideal, given that they are constructed solely from Internet data, they should nonetheless allow us to conduct a preliminary analysis.
An online program for calculating ΔP values from a corpus was created using word forms as the input <http://jira.org/dp/>, and the ΔP values for all immediate collocations with a minimum frequency of 10 were calculated.Footnote 1 Various analyses (detailed below) were then conducted to see whether any languages exhibited a preference for either right- or left-predictive collocations. The results were statistically analyzed using chi-square and Mann-Whitney U as appropriate to see if the differences between right-predictive and left-predictive collocations were significant in a given language. Given the number of comparisons made, a level of significance of p < 0.001 was used to avoid Type I errors.
4. The Results
The first result concerns the numbers of immediate collocations with a minimum frequency of 10 in each language; this is shown in Table 3.
The eight corpora are of similar size, but there is some variation in the number of collocations identified. This appears to reflect the extent to which a language is synthetic, since more synthetic languages have a greater variety of word forms, giving rise to fewer common collocations (see Stengers et al. Reference Stengers, Boers, Housen and Eyckmans2011).
Focusing on the 1000 collocations with the highest ΔP values, we investigated whether they tend to be more right- or more left-predictive; the counts for these are shown in Table 4.
Interestingly, all languages have more left-predictive strong collocations (although for English, for example, the difference is negligible), with five of the languages showing a clear preference for left-predictive collocations.
Separating the left-predictive (i.e., ΔP1|2) and the right-predictive (i.e., ΔP2|1) collocations, we calculated the average ΔP values for the 100 and 500 strongest collocations, as shown in Table 5.
Treating the probabilities as rates of occurrence, to find the average probability we used the harmonic mean (the number of items divided by the sum of the reciprocals). Again, most languages show a clear preference for left-predictive collocations, while Indonesian is neutral, and English shows a preference for right-predictive collocations.
Finally, we examined those collocations that are unidirectional. For left-predictive collocations, this involved calculating ΔP1|2 - ΔP2|1 (and vice versa for right-predictive). We then counted the number of collocations with ΔP value differences above certain thresholds; the findings are shown in Table 6. As with the previous analyses, German, Italian, and Maltese show a preference for left-predictive collocations. English, on the other hand, has a preference for right-predictive collocations.
To illustrate what these numbers involve, the top 20 unidirectional collocations in English are listed in Table 7. It is noticeable that these include only two proper nouns (proper nouns are highly likely to be involved in unidirectional collocations) and that 15 of the collocations include a preposition (a similar pattern is also found for German).
5. Discussion
This is a preliminary, speculative study aiming to see whether applying a largely unused measure can lead to insights worthy of detailed investigation. From examining eight languages, it does appear that different languages manifest directional collocation in different ways. Focusing on those points where statistical tests were used, we can summarize the dominant directions of collocations in the eight languages as in Figure 1, which shows that most languages have a clear preference for left-predictive collocates.
Comparing the directional preferences with the typological features of the eight languages listed in Table 2, the only feature that seems related to the direction of collocation is whether the language is analytic (neutral or right-predictive) or synthetic (left-predictive). It is unclear why this correlation might exist.
For the other typological features, no close relationship with preferred direction of collocation is apparent. This is perhaps highlighted most clearly by adposition types in English and German. The majority of the top 100 directional collocations in both languages include adpositions, and both languages use prepositions with noun phrases, yet in the top 100 directional collocations which include prepositions, 53 of 63 are left-predictive in German and 65 of 76 are right-predictive in English, reflecting the overall directional preference of each language. In the other languages, adpositions are far less common in the top 100 directional collocations. For example, in Italian only 28 of the top 100 collocations include prepositions. Whether other paired sequences of parts of speech are prevalent in the strongly directional collocations in other languages is unclear. Directional collocation analysis using tagged corpora may help to answer this question.
One potential problem emerging from the findings presented here is that English has a preference different from that of the majority of the languages. As mentioned earlier, nearly all previous work on directional collocations has focused on English. This emphasis on English is symptomatic of research in several areas of linguistics; a quick Google Scholar search finds that English is the most researched language in reading research, natural language processing, pragmatics and lexis. If English is an outlier among languages (as might be the case for direction of collocations), then the emphasis on English as the focus of research is worrisome.
The findings show that different languages do have different preferences for direction of collocations, and that these preferences are realized differently in the various languages. These results raise many questions. Why are most languages left-predictive? Why are prepositions so common in strongly directional collocations in German and English but not in other languages? Why does Indonesian have no clear preference? Is English an outlier language? This squib makes no attempt to answer such questions; rather, it shows that using a ΔP analysis can help to highlight issues that may be worthy of further consideration.