Exploring the direction of collocations in eight languages

Richard Watson Todd

doi:10.1017/cnj.2018.28

Exploring the direction of collocations in eight languages

Published online by Cambridge University Press: 24 September 2018

Richard Watson Todd

Show author details

Richard Watson Todd*: Affiliation:
King Mongkut's University of Technology Thonburi
*: irictodd@kmutt.ac.th

Article contents

Abstract
Introduction
Focus of Research
The Analysis
The Results
Discussion
Footnotes
References

Rights & Permissions

Abstract

An abstract is not available for this content. As you have access to this content, full HTML content is provided on this page. A PDF of this content is also available in through the ‘Save PDF’ action button.

Type: Squib/Notule
Information: Canadian Journal of Linguistics/Revue canadienne de linguistique , Volume 64 , Issue 1 , March 2019 , pp. 146 - 154

DOI: https://doi.org/10.1017/cnj.2018.28 [Opens in a new window]
Copyright: © Canadian Linguistic Association/Association canadienne de linguistique 2018

1. Introduction

This squib presents an initial exploration of the use of a relatively new measure of detecting directionality of collocations in a corpus to investigate whether different languages show a preference for different directions of collocation.

1.1 Directional Collocation

Within the Firthian tradition in linguistics, and especially since the publication of Sinclair (Reference Sinclair1991), collocation is viewed as “an integral aspect of linguistic theory” (Barnbrook et al. Reference Barnbrook, Mason and Krishnamurthy2013: 35), yet collocation is largely overlooked in many schools of linguistics. This squib implements a relatively recent measure of directional collocation in corpora of eight different languages to see if there are issues worthy of deeper investigation.

There are two main approaches to investigating collocations. First, the phraseological (Brown Reference Brown, Milton and Fitzpatrick2014) or intensional (Evert Reference Evert2005) approach, which treats collocations as falling in the middle of a continuum from idioms to free combinations. Within this approach, collocations may be required to have a non-literal meaning, word spans to identify collocations can be up to four words left and right of the node word, and the identification of collocations is often restricted to combinations of nouns, verbs, adjectives, and adverbs. Second, the frequency-based (Brown Reference Brown, Milton and Fitzpatrick2014) or distributional (Evert Reference Evert2005) approach, which views collocations as relatively frequent co-occurrences of two words. No constraints on meaning or word types are made in identifying collocations, and the word span is usually one word left or right of the node word. In this squib, I take the frequency-based approach associated with corpus linguistics; the key to identifying a collocation is “the extent to which the items appear together more often than we would expect given their individual frequencies” (Brown Reference Brown, Milton and Fitzpatrick2014: 125).

Nearly all previous work on collocation has involved identifying pairs of co-occurring words without considering “whether word1 is more predictive of word2 or the other way round” (Gries Reference Gries2013: 141). In his original work on collocation in English, Sinclair (Reference Sinclair1991) distinguished between upward and downward collocations. In upward collocation, the collocate is a more frequent word than the node, and in downward collocation, the collocate is less frequent. This distinction is important, since upward collocation usually highlights grammatical frames, whereas downward collocation highlights semantic issues. An alternative terminology was suggested by Kjellmer (Reference Kjellmer, Aijmer and Alternberg1991), who introduced right-predictive collocations such as Pyrrhic victory, where the first word predicts the second but not the other way round, and left-predictive collocations such as deadly nightshade where the second word predicts the first (see Michelbacher et al. Reference Michelbacher, Evert and Schütze2011).

1.2 Measures of Directional Collocation

The standard measures of collocation, such as Mutual Information and z-scores, make no distinction between word1 and word2, treating collocations as symmetrical. Thus the asymmetric nature of many collocations has largely been ignored (the only major exception being the work of Michelbacher et al. Reference Michelbacher, Evert and Schütze2007, Reference Michelbacher, Evert and Schütze2011 on directional associations). A few directional measures of collocation were suggested but these were all problematic in some way, and it is only with Gries’ (Reference Gries2013) introduction of the ΔP measure that a usable and valid measure has become available. Gries defines ΔP as in (1).

(1)

$$\hskip -1pc \lpar1\rpar \quad\Delta \hbox{P} = \hbox{\,p} \lpar {\hbox{outcome} \vert \hbox{cue} = \hbox{\,present}} \rpar - \hbox{\,p} \lpar {\hbox{outcome} \vert \hbox{cue} = \hbox{absent}} \rpar $$

In other words, ΔP is the probability of a word being present given the presence of another word minus the probability of the same word being present without the other word. This allows us to distinguish between right-predictive and left-predictive collocations. A right-predictive collocation will be indicated by a high value for:

(2)

$$\hskip -1pc \lpar2\rpar \quad\Delta \hbox{P}_{2 \vert 1} = \hbox{\,p} \lpar {\hbox{word}_2 \vert \hbox{word}_1 = \hbox{\,present}} \rpar - \hbox{\,p} \lpar {\hbox{word}_2 \vert \hbox{word}_1 = \hbox{absent}} \rpar $$

A left-predictive collocation will be indicated by a high value for:

(3)

$$\hskip -1pc \lpar3\rpar \quad\Delta \hbox{P}_{1 \vert 2} = \hbox{\,p} \lpar {\hbox{word}_1 \vert \hbox{word}_2 = \hbox{\,present}} \rpar - \hbox{\,p} \lpar {\hbox{word}_1 \vert \hbox{word}_2 = \hbox{absent}} \rpar $$

An example will show how this works. The words of and course collocate in English in phrases like of course and in the course of with an expectation that the left-predictive collocation (of course) will dominate. The frequencies of of and course in the English corpus used in this study are given in Table 1.

Table 1: Frequencies of of and course for calculating ΔP

For ΔP_2|1, the first probability is the frequency of both words being present divided by the total frequency of course; the second probability is the frequency of of without course divided by the total frequency where course is absent (i.e., (273/1140) − (55197/1872038) = 0.210). For ΔP_1|2, the first probability is the frequency of both words being present divided by the total frequency of of; the second probability is the frequency of course without of divided by the total frequency where of is absent (i.e., (273/55470) − (867/1817708) = 0.004). From this we can see that the left-predictive of course is a much stronger collocation than the right-predictive course of.

ΔP values range from -1 (where the presence of the cue reduces the likelihood of the outcome, e.g., they has) to + 1 (where the presence of the cue makes the outcome more likely). In most analyses, collocation measures are applied to those collocations that exist in the corpus being analyzed. Non-occurring pairs are not normally considered. For this reason, negative ΔP values will be much rarer than positive ΔP values. A clear direction of collocation can be found by calculating ΔP_2|1 - ΔP_1|2 (if positive, this shows a right-predictive collocation; if negative, a left-predictive collocation). Desagulier (Reference Desagulier2015) provides a clear, detailed explanation of how to interpret ΔP values.

2. Focus of Research

As a relatively recent measure, ΔP is yet to be widely used and nearly all applications are to English. It is not clear whether the directions of collocations of English are typical of most languages, or whether different languages have different directional collocation patterns. For instance, in one language most strong collocations might be right-predictive, whereas in another language they might be left-predictive. The purpose of this squib is to conduct a preliminary analysis of directional collocations in several languages to see if this produces any findings that warrant more detailed investigation.

3. The Analysis

To identify patterns of directional collocations, corpora of several languages are needed. Corpora built on the same principles for numerous languages can be found at the Leipzig Corpora Collection (<http://corpora2.informatik.uni-leipzig.de/download.html>; see Goldhahn et al. Reference Goldhahn, Eckart and Quasthoff2012). The following criteria guided the selection of corpora: the language must be a left-to-right alphabetic language with words separated by spaces, a range of languages falling into different language families should be chosen, and corpus size should be at least 1 million words. Using these criteria, corpora consisting of 100,000 sentences taken from the Internet for eight languages were used. The languages are English, German, Italian, and Russian (all Indo-European), Finnish (Uralic), Maltese (Afroasiatic), Indonesian (Austronesian), and Basque (a language isolate). Some potentially relevant typological features of these languages are given in Table 2, based on the World Atlas of Language Structures (Dryer and Haspelmath Reference Dryer and Haspelmath2013). Although these corpora are not ideal, given that they are constructed solely from Internet data, they should nonetheless allow us to conduct a preliminary analysis.

Table 2: Typological features of the eight languages

An online program for calculating ΔP values from a corpus was created using word forms as the input <http://jira.org/dp/>, and the ΔP values for all immediate collocations with a minimum frequency of 10 were calculated.Footnote ¹ Various analyses (detailed below) were then conducted to see whether any languages exhibited a preference for either right- or left-predictive collocations. The results were statistically analyzed using chi-square and Mann-Whitney U as appropriate to see if the differences between right-predictive and left-predictive collocations were significant in a given language. Given the number of comparisons made, a level of significance of p < 0.001 was used to avoid Type I errors.

4. The Results

The first result concerns the numbers of immediate collocations with a minimum frequency of 10 in each language; this is shown in Table 3.

Table 3: Numbers of frequent collocations in eight languages

The eight corpora are of similar size, but there is some variation in the number of collocations identified. This appears to reflect the extent to which a language is synthetic, since more synthetic languages have a greater variety of word forms, giving rise to fewer common collocations (see Stengers et al. Reference Stengers, Boers, Housen and Eyckmans2011).

Focusing on the 1000 collocations with the highest ΔP values, we investigated whether they tend to be more right- or more left-predictive; the counts for these are shown in Table 4.

Table 4: Numbers of right- and left-predictive collocations in the top 1000

Interestingly, all languages have more left-predictive strong collocations (although for English, for example, the difference is negligible), with five of the languages showing a clear preference for left-predictive collocations.

Separating the left-predictive (i.e., ΔP_1|2) and the right-predictive (i.e., ΔP_2|1) collocations, we calculated the average ΔP values for the 100 and 500 strongest collocations, as shown in Table 5.

Table 5: Harmonic means of strongest right- and left-predictive collocations

Treating the probabilities as rates of occurrence, to find the average probability we used the harmonic mean (the number of items divided by the sum of the reciprocals). Again, most languages show a clear preference for left-predictive collocations, while Indonesian is neutral, and English shows a preference for right-predictive collocations.

Finally, we examined those collocations that are unidirectional. For left-predictive collocations, this involved calculating ΔP_1|2 - ΔP_2|1 (and vice versa for right-predictive). We then counted the number of collocations with ΔP value differences above certain thresholds; the findings are shown in Table 6. As with the previous analyses, German, Italian, and Maltese show a preference for left-predictive collocations. English, on the other hand, has a preference for right-predictive collocations.

Table 6: Numbers of unidirectional collocations

To illustrate what these numbers involve, the top 20 unidirectional collocations in English are listed in Table 7. It is noticeable that these include only two proper nouns (proper nouns are highly likely to be involved in unidirectional collocations) and that 15 of the collocations include a preposition (a similar pattern is also found for German).

Table 7: Top 20 unidirectional collocations in English

5. Discussion

This is a preliminary, speculative study aiming to see whether applying a largely unused measure can lead to insights worthy of detailed investigation. From examining eight languages, it does appear that different languages manifest directional collocation in different ways. Focusing on those points where statistical tests were used, we can summarize the dominant directions of collocations in the eight languages as in Figure 1, which shows that most languages have a clear preference for left-predictive collocates.

Figure 1: Summary of collocational direction preference in the eight languages

Comparing the directional preferences with the typological features of the eight languages listed in Table 2, the only feature that seems related to the direction of collocation is whether the language is analytic (neutral or right-predictive) or synthetic (left-predictive). It is unclear why this correlation might exist.

For the other typological features, no close relationship with preferred direction of collocation is apparent. This is perhaps highlighted most clearly by adposition types in English and German. The majority of the top 100 directional collocations in both languages include adpositions, and both languages use prepositions with noun phrases, yet in the top 100 directional collocations which include prepositions, 53 of 63 are left-predictive in German and 65 of 76 are right-predictive in English, reflecting the overall directional preference of each language. In the other languages, adpositions are far less common in the top 100 directional collocations. For example, in Italian only 28 of the top 100 collocations include prepositions. Whether other paired sequences of parts of speech are prevalent in the strongly directional collocations in other languages is unclear. Directional collocation analysis using tagged corpora may help to answer this question.

One potential problem emerging from the findings presented here is that English has a preference different from that of the majority of the languages. As mentioned earlier, nearly all previous work on directional collocations has focused on English. This emphasis on English is symptomatic of research in several areas of linguistics; a quick Google Scholar search finds that English is the most researched language in reading research, natural language processing, pragmatics and lexis. If English is an outlier among languages (as might be the case for direction of collocations), then the emphasis on English as the focus of research is worrisome.

The findings show that different languages do have different preferences for direction of collocations, and that these preferences are realized differently in the various languages. These results raise many questions. Why are most languages left-predictive? Why are prepositions so common in strongly directional collocations in German and English but not in other languages? Why does Indonesian have no clear preference? Is English an outlier language? This squib makes no attempt to answer such questions; rather, it shows that using a ΔP analysis can help to highlight issues that may be worthy of further consideration.

Footnotes

¹ Thanks are due to Unimax Co., Ltd. for designing the program for calculating ΔP.

References

Barnbrook, Geoff, Mason, Oliver, and Krishnamurthy, Ramesh. 2013. Collocation: Applications and implications. Basingstoke: Palgrave Macmillan.Google Scholar

Brown, Dale. 2014. Knowledge of collocations. In Dimensions of vocabulary knowledge, ed. Milton, James and Fitzpatrick, Tess, 123–139. Basingstoke: Palgrave Macmillan.Google Scholar

Desagulier, Guillaume. 2015. A lesson from associative learning: Asymmetry and productivity in multiple-slot constructions. Available at https://halshs.archives-ouvertes.fr/halshs-01184230.Google Scholar

Dryer, Matthew S., and Haspelmath, Martin, eds. 2013. The world atlas of language structures online (WALS). Leipzig: Max Planck Institute for Evolutionary Anthropology. Available at http://wals.info.Google Scholar

Evert, Stefan. 2005. The statistics of word cooccurrences: Word pairs and collocations. Doctoral dissertation, Universität Stuttgart.Google Scholar

Goldhahn, Dirk, Eckart, Thomas, and Quasthoff, Uwe. 2012. Building large monolingual dictionaries at the Leipzig corpora collection: From 100 to 200 languages. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'12), Istanbul. Available at <http://www.lrec-conf.org/proceedings/lrec2012/index.html>..>Google Scholar

Gries, Stefan Th. 2013. 50-something years of work on collocations. International Journal of Corpus Linguistics 18(1): 137–165.Google Scholar

Kjellmer, Góran. 1991. A mint of phrases. In English corpus linguistics: Studies in honor of Jan Svartvik, ed. Aijmer, Karin and Alternberg, Bengt, 111–127. London: Longman.Google Scholar

Michelbacher, Lukas, Evert, Stefan, and Schütze, Hinrich. 2007. Asymmetric association measures. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2007), Borovets, Bulgaria. Available at http://www.lml.bas.bg/ranlp2007/.Google Scholar

Michelbacher, Lukas, Evert, Stefan, and Schütze, Hinrich. 2011. Asymmetry in corpus-derived and human word associations. Corpus Linguistics and Linguistic Theory 7(2): 245–276.Google Scholar

Sinclair, John. 1991. Corpus concordance collocation. Oxford: Oxford University Press.Google Scholar

Stengers, Helene, Boers, Frank, Housen, Alex, and Eyckmans, June. 2011. Formulaic sequences and L2 oral proficiency: Does the type of target language influence the association? International Review of Applied Linguistics 49(4): 321–343.Google Scholar