Perceptive evaluation of Levenshtein dialect distance measurements using Norwegian dialect data

Charlotte Gooskens; Wilbert Heeringa

doi:10.1017/S0954394504163023

Perceptive evaluation of Levenshtein dialect distance measurements using Norwegian dialect data

Published online by Cambridge University Press: 08 November 2004

Charlotte Gooskens and

Wilbert Heeringa

Show author details

Charlotte Gooskens: Affiliation:
University of Groningen, The Netherlands
Wilbert Heeringa: Affiliation:
University of Groningen, The Netherlands

Article contents

Abstract
MATERIAL
PERCEPTUAL DISTANCE MEASUREMENTS
LEVENSHTEIN DISTANCE MEASUREMENTS
PERCEPTUAL VERSUS LEVENSHTEIN DISTANCES
FACTORS INFLUENCING THE CORRELATION BETWEEN PERCEPTUAL DISTANCES AND LEVENSHTEIN DISTANCES
CONCLUSIONS
References

Rights & Permissions

Abstract

The Levenshtein dialect distance method has proven to be a successful method for measuring phonetic distances between Dutch dialects. The aim of the present investigation is to validate the Levenshtein dialect distance with perceptual data from a language area other than the Dutch, namely Norway. We calculate the correlation between the Levenshtein distances and the distances between 15 Norwegian dialects as judged by Norwegian listeners. We carry out this analysis to see the degree to which the average Levenshtein distances correspond to the psychoacoustic perception of the speakers of the dialects.The present article reports on part of a study supported by NWO, the Netherlands Organization for Scientific Research. We are grateful for the permission from Kristian Skarbø and Jørn Almberg to use their material and for the help of Jørn Almberg during the whole investigation. We thank Saakje van Dellen for her obliging help with the data entry and Peter Kleiweg for letting us use the programs that he developed for the visualization of the maps and dendrograms in this article. Finally, we would like to thank John Nerbonne for valuable comments and for correcting our English.

Type: Research Article
Information: Language Variation and Change , Volume 16 , Issue 3 , October 2004 , pp. 189 - 207

DOI: https://doi.org/10.1017/S0954394504163023 [Opens in a new window]
Copyright: © 2004 Cambridge University Press

In 1995, Kessler introduced the use of the Levenshtein distance as a tool for measuring linguistic distances between language varieties. He applied the algorithm to the comparison of Irish dialects. The Levenshtein distance is a string edit distance measure. On the basis of linguistic distances between dialectal varieties, dialect areas can be found. More innovative is the possibility of drawing dialect maps that reflect the fact that dialect areas should be considered as continua and not as areas separated by sharp borders. Its application to the Dutch language area has produced convincing results (see Heeringa, 2004; Nerbonne & Heeringa, 1998). The results are partly similar to the map of Daan and Blok (1969), which may be considered as the most authoritative Dutch dialect map up till now. Still, it is desirable to validate the method further.

In this article we validate the Levenshtein distance. We will investigate to what extent dialect distances found with Levenshtein distance correlate with distances as perceived by the dialect speakers themselves. We will try to find an answer to the following question: May Levenshtein distance-based dialect distances be considered as a good approximation of the perceptual distances? To answer this question, we will use a set of 15 Norwegian varieties. Results for Dutch may be impressive, but the Dutch dialect area is a flat, regularly populated landscape. In contrast with this, the Norwegian dialect area is less regular, because of the mountains. This may make the test harder, but more revealing.

In the next section, “Material,” the data is described on the basis of which both the perception experiment and the Levenshtein distance measurements were performed. In the section “Perceptual Distance Measurements,” the perception experiment will be presented, which was carried out to calculate the psychoacoustic distances between 15 Norwegian dialects. In the following section, the Levenshtein distance will be presented and applied to data from the same 15 Norwegian dialects. Then, the results of the two kinds of distance measures will be compared and explanations for the results will be suggested. Finally, some general conclusions will be drawn.

MATERIAL

To carry out our investigation we needed to obtain suitable material. This means that we needed to have access to recordings of the same text in a fair number of dialects from one language area to carry out the perception experiments. At the same time, we needed digitized transcriptions in a form that could be used in already existing computer programs for calculating the Levenshtein distances.

Dialects

We chose to focus on the Scandinavian language area because the Scandinavian countries have a strong tradition of research in the area of dialectology. This has resulted in maps similar to the traditional Dutch dialect maps (for an overview, see Skjekkeland, 1997). These maps are useful for the interpretation of the results. Norway seems to be particularly interesting because of the strong position of the dialects in this country. In contrast to many European countries, the dialect is used by people of all ages and social backgrounds, not only in the private domain, but also in official contexts (Omdal, 1995). This makes it easy to use recent recordings of young people from all over the country without the risk that some of the speakers might use a more standardized variant of their dialect or a variety that is no longer being used in everyday life. Also, it does not feel unnatural for Norwegian people to read aloud a text in their own dialect. This allowed us to use read texts, which was necessary as we needed the same text in different dialects. In Figure 1, the 15 dialects that were used in the investigation are shown. The dialects are spread over a large part of the Norwegian language area, and they cover most major dialect areas, as found on the traditional map of Skjekkeland (1997:276). On this map, the Norwegian language area is divided into nine dialect areas. In our set of 15 varieties, six areas are represented.

Map of Norway showing the 15 dialects used in the present investigation. The abbreviation after the name of each location indicates the dialect area to which the variety belongs, according to Skjekkeland (1997). The same abbreviations are used in other figures in this article. Skjekkeland (1997) gave a more global division in which Norwegian dialects are divided into Vestnorsk (covering No, Sv, and Nv) and Austnorsk (covering Mi, Au, and Tr).

Text

It is time-consuming to make recordings of dialects and to transcribe the texts phonetically. Fortunately, we were able to use already existing recordings of Norwegian dialect speakers. The speakers all read aloud the same text, namely, the fable “The North Wind and the Sun.”¹

The recordings and the transcriptions (in IPA as well as in SAMPA) were made by Jørn Almberg in cooperation with Kristian Skarbø at the Department of Linguistics, NTNU, Trondheim and are available at http://www.ling.hf.ntnu.no/nos/.

This text has often been used for phonetic investigations; see, for example, the International Phonetic Association (1949, 1999), where the same text has been transcribed in a large number of different languages.

Speakers

There were 4 male and 11 female speakers. Thirteen of the speakers had filled in a questionnaire about their background. From this we know that the average age of these speakers was 30.5 years, ranging between 22 and 35, except for one speaker who was 66. All 13 speakers attended university or already had a university degree. No formal testing of the extent to which the speakers used their own dialect was carried out. However, they had all lived at the place where the dialect is spoken until the mean age of 20 (with a minimum age of 18), and they all regarded themselves as representative speakers of the dialects in question. All speakers, except one, had at least one parent speaking the dialect.

Recordings

The recordings were made in a soundproof studio in the autumn of 1999 and the spring of 2000. The speakers were all given the text in Norwegian beforehand and were allowed time to prepare the recordings in order to be able to read aloud the text in their own dialect. Many speakers had to change some words of the original text for the dialect to sound authentic. The word order was changed in three cases. When reading the text aloud, the speakers were asked to imagine that they were reading the text to someone with the same dialectal background as themselves. This was done to ensure a reading style that was as natural as possible and to achieve dialectal correctness.

The microphone used for the recordings was a MILAB LSR-1000, and the recordings were made in DAT format using a FOSTEX D-10 Digital Master Recorder. The recordings were edited by means of Cool Edit 96 and are available on the World Wide Web.

These recordings were used in the perception experiment described in the following section.

Transcriptions

On the basis of the recordings, phonetic transcriptions were made of all 15 dialects. The transcriptions were made in IPA as well as in X-SAMPA (Speech Assessment Methods Phonetic Alphabet). This is a machine-readable phonetic alphabet that is still human-readable. Basically, it maps IPA symbols to the 7-bit printable ASCII/ANSI characters. All transcriptions were made by the same person, which ensures consistency. Most Norwegian dialects distinguish between two tonal patterns on the word level, often referred to as tonemes. Some dialects even have a third toneme, the circumflex (e.g., Kristoffersen, 2000). In our material, four dialects (Bjugn, Fræna, Verdal, and Stjørdal) have circumflex tonemes on one word (mann meaning ‘man’). In the transcriptions, toneme transcriptions were included, and it was indicated where the different tonemes occurred in the text. We know from the literature that the realization of the tonemes can vary considerably across the Norwegian dialects. However, no information was given about the precise realization of the tonemes in the transcriptions.

The Levenshtein distance measurements are based on the transcriptions and are presented later in the article.

PERCEPTUAL DISTANCE MEASUREMENTS

Perceptual data have often been used in dialectology (e.g., Daan & Blok, 1969; Gooskens, 1997; Preston, 1999) and have proved that listeners without linguistic training are quite able to make judgments, for example, about distances between dialects. Perceived linguistic distances are likely to be at least partly based on objective linguistic distances. However, a number of factors other than linguistic distances might influence the way in which listeners perceive distances between dialects. We will return to this point later. To be able to investigate how well the Levenshtein distances correspond to the perceived linguistic distances, we carried out a perception experiment on the basis of 15 Norwegian dialects. Next, we will describe the listening experiment and the results will follow.

Experiment

Manipulations

To investigate the dialect distances between 15 Norwegian dialects, as perceived by Norwegian listeners, for each of the 15 varieties the corresponding recording of the fable “The North Wind and the Sun” was presented to Norwegian listeners in a listening experiment. The running text provides the listeners with more kinds of information than the information used for the calculation of the Levenshtein distances. One important difference is that the listeners based their judgments on spoken material that contains prosody, whereas this is not the case for the Levenshtein distances. For this reason, we decided to include a monotonized version of all fragments. Because in these fragments the pitch contour is not present like in the Levenshtein distances, we expect the correlation of these two distance measures to be higher than when Levenshtein distances are correlated with the original fragments.

In the listening experiment described next, each of the 15 dialect recordings were presented in the following two versions:

Monotonized version. By means of electronic monotonization the intonation (including word tones) is removed from the signal.
Original version. This version has the original prosodic and verbal information, but is processed in the same way as the monotonized version.

The manipulations were carried out with the program PRAAT.²

The program PRAAT is a free public-domain program developed by Paul Boersma and David Weenink at the Institute of Phonetic Sciences of the University of Amsterdam and is available at http://www.fon.hum.uva.nl/praat.

To monotonize the fragments, the pitch contours were changed into flat lines. The recordings of female speakers were monotonized at 224 Hz, which is the mean pitch of the 11 female speakers. The recording of the male speakers were monotonized at 134 Hz. This was the mean pitch of the three male speakers.

Listeners

The listeners were 15 groups of high school pupils, one group from each of the places where the 15 dialects are spoken (see Figure 1). Each group consisted of 16 to 27 listeners (with a mean of 19). Their mean age was 17.8 years; 52% were female and 48% were male. Only the responses of listeners who had lived the major part of their lives in the place where the dialect is spoken were used for the analysis. On average, these listeners had lived in the place in question for 16.7 years. Nine of the 290 listeners (3%) said that they never speak the dialect, the rest spoke the dialect always (60%), often (21%), or seldom (16%). A large majority of the listeners (83%) had one or two parents who also spoke the dialect.

Procedure

The two versions (monotonous and original) of the 15 dialects were presented in two blocks, with the dialects randomized within each block. First the block with the monotonized version was presented, and after a short break the block with the original version was presented. Each block was preceded by a practice recording (a speaker from Stjørdal, but not one of the 15 recordings used in the real experiment). Between each two recordings there was a pause of 3 seconds.

While listening to the dialects, the listeners were asked to judge each dialect on a scale from 1 (similar to own dialect) to 10 (not similar to own dialect). The whole experiment lasted approximately 20 minutes, followed by a questionnaire. In this questionnaire the listeners were asked questions about their individual characteristics, such as language background, age, and sex. The listeners were paid for their participation.

Results

Distances

The mean perceptual distances between the 15 Norwegian dialects are presented in Table 1, obtained on the basis of the experiment in which the original, nonmanipulated recordings were presented. Each group of listeners judged the linguistic distances between their own dialect and the 15 dialects, including their own dialect. In this way, we get a matrix with 15∗15 distances. The fact that the listeners also had to judge their own dialect resulted in varying diagonals (between 1.0 and 3.4). Some groups of listeners judged the recorded sample of the own dialect to be more than minimally distant. This might be explained by the fact that the recorded speakers were not equally representative for the dialect in question. It might, however, also be the case that some dialects show more variation than others. Finally, the differences can also be caused by the fact that the groups of listeners differ in some respect. For example, some groups might be more familiar with their own dialects than others or more tolerant as to what they are willing to accept as a good representation of their dialect.

Mean perceptual distances between all pairs of 15 Norwegian dialects as perceived by 15 groups of listeners when listening to the nonmanipulated recordings (judged on a scale from 1 = similar to own dialect to 10 = not similar to own dialect)

There are two mean distances between each pair of dialects. For example, the distance that the listeners from Bergen perceived between their own dialect and the dialect of Trondheim (mean judgment is 7.8) is different from the distance as perceived by the listeners from Trondheim (mean judgment is 8.6). Different explanations can be given for the fact that different groups perceive the same linguistic distances differently. For example, it is likely that the attitude toward a dialect influences the perception of the linguistic distance. We will return to this point later.

Classification

On the basis of the distance matrix, the dialects can be classified with cluster analysis. The goal of a cluster analysis is to identify the main groups. The groups are called clusters. Clusters may consist of subclusters, and subclusters may in turn consist of subsubclusters, and so on. The result is a hierarchically structured tree in which the dialects are the leaves (Jain & Dubes, 1988). Several alternatives exist. We used the Unweighted Pair Group Method using Arithmetic Averages (UPGMA), because we found that dendrograms generated by this method reflected distances that correlated most strongly with the original distances in the distance matrix (see Sokal & Rohlf, 1962).

Because the cluster program expects only one value for each pair of different elements, distances of dialects with respect to themselves are not used, and the average of the two mean distances is used when classifying the varieties. For example, the average of the distance between Bergen–Trondheim and Trondheim–Bergen is used.

The dendrogram (Figure 2) is obtained on the basis of Table 1 and accords rather well with the map of Skjekkeland (Figure 1). Sørvestlandsk, Austlandsk, and Trøndsk groups can clearly be identified. However, the Midlandsk dialects, Bø and Lesja, do not form a close cluster. Geographically they are rather distant, so they may be rather different, although they should be in the same group according to the traditional division. The Nordvestlandsk dialects (Fræna and Herøy) seem to be very different from each other, although they are geographically rather close. Possibly the fact that these dialects belong to the same group on the map of Skjekkeland may be explained by the fact that Skjekkeland based the characterization on a limited number of phenomena, which are (partly) different from those found in the text “The North Wind and the Sun.” In our sample, the Nordlandsk area is represented by only one variety (Bodø). This variety is grouped with the varieties of the Trøndsk area, which is not unexpected geographically.

Dendrogram derived from the 15∗15 matrix of perceptual distances showing the clustering of (groups of) Norwegian dialects. On the horizontal scale, distances are given in the scale as used by the listeners.

LEVENSHTEIN DISTANCE MEASUREMENTS

Method

Traditional dialectology has aimed to divide language areas into dialect areas mostly by drawing sharp borders between the areas on a map. The choice of the borders has often been based on the knowledge and intuition of the investigators of the areas in question. The application of isoglosses has been another widely used means of dividing language areas into dialect areas. Coinciding isoglosses are interpreted as borders. However, the use of isoglosses gives rise to a number of problems. First, isoglosses do not always coincide. They can run parallel, forming vague bundles, or even cross each other, describing contradictory binary divisions. In practice, well-known isoglosses that form bundles are selected, but this makes this aspect of the method subjective. Second, the use of isoglosses gives a very categorical view of dialect differences. Either a dialect is different from another dialect or it is not, no degrees of differences can be expressed. Finally, dialects might be dispersed by migration or war so that closely related dialects are no longer adjacent to each other. This causes problems when drawing the isoglosses and borders on the dialect map (see Chambers & Trudgill, 1998:89–103; Kessler, 1995).

To solve some of the problems we have outlined, several (computational) methods for measuring the linguistic distances between language varieties have been developed since the beginning of the 1970s (Heeringa, 2004:14–24). In this investigation, we wish to evaluate one of the methods, the Levenshtein distance method, which has been applied successfully to Irish Gaelic (Kessler, 1995) and Dutch dialects (Heeringa, 2004:213–278; Nerbonne & Heeringa, 1998). The basic algorithm has been described in detail in Kruskal (1999). Compared to traditional methods (for instance the isogloss method), this approach has the advantage that varieties are compared and classified in an objective way and on the basis of the aggregate of many phenomena rather than on the basis of just single phenomena. In contrast to other computational methods, Levenshtein distance yields gradual word pronunciation differences, and the method uses the data exhaustively, which makes it most sensitive.

Algorithm

Using the Levenshtein distance, two dialects are compared by comparing the pronunciation of words in the first dialect with the pronunciation of the same words in the second. It is determined how one pronunciation is changed into the other by inserting, deleting, or substituting sounds. Weights are assigned to these three operations. In the simplest form of the algorithm, all operations have the same cost. For example, assume afternoon is pronounced as

in the dialect of Savannah, Georgia, and as

in the dialect of Lancaster, Pennsylvania.

The data is taken from the Linguistic Atlas of the Middle and South Atlantic States (LAMSAS) and is available via: http://hyde.park.uga.edu/lamsas/.

Changing one pronunciation into the other can be done as follows (ignoring suprasegmentals and diacritics for this moment).⁴

The example should not be interpreted as a historical reconstruction of the way in which one pronunciation changed into another. From that point of view, it may be more obvious to show how

changed into

. We just show that the distance between two arbitrary pronunciations is found on the basis of the least costly set of operations mapping one pronunciation into another.

In fact, many sequence operations map

. The power of the Levenshtein algorithm is that it always finds the cost of the cheapest mapping. Comparing pronunciations in this way, the distance between longer pronunciations will generally be greater than the distance between shorter pronunciations. The longer the pronunciation, the greater the chance for differences with respect to the corresponding pronunciation in another variety. Because this does not accord with the idea that words are linguistic units, the sum of the operations is divided by the length of the longest alignment that gives the minimum cost. The longest alignment has the greatest number of matches. In our example we have the following alignment:

The total cost of 3 (1 + 1 + 1) is now divided by the length of 9. This gives a word distance of 0.33 or 33%.

Gradual weights

The simplest versions of this method are based on a notion of phonetic distance in which phonetic overlap is binary: nonidentical phones contribute to phonetic distance, identical ones do not. Thus the pair [a,p] counts as different to the same degree as [b,p]. In more sensitive versions, phones are compared on the basis of their feature values, so the pair [a,p] counts as more different than [b,p]. However, it is not always clear what weight should be attributed to the different features. The version that we use in this article is based on the comparison of spectrograms of the sounds. A spectrogram is the visual representation of the acoustical signal, and the visual differences between the spectrograms are reflections of the acoustical differences. When using spectrograms it is not necessary to make decisions about the weight of the different features. The spectrograms were made on the basis of recordings of the sounds of the International Phonetic Alphabet (IPA) as pronounced by John Wells and Jill House on the cassette The Sounds of the International Phonetic Alphabet, from 1995.⁵

See http://www.phon.ucl.ac.uk/home/wells/cassette.htm.

The different sounds were isolated from the recordings and monotonized at the mean pitch of each of the two speakers with the program PRAAT (see note 3). Next, with PRAAT, a spectrogram was made for each sound using the so-called Barkfilter, which is a more perceptually oriented model. On the basis of the Barkfilter representation, segment distances were calculated. The way in which this was done is described extensively in Heeringa (2004:79–119) and more briefly in Gooskens and Heeringa (2004).

Logarithmic weights

In perception, small differences in pronunciation may play a relatively strong role in comparison to larger differences. Therefore, we used logarithmic segment distances. The effect of using logarithmic distances is that small distances are weighed relatively more heavily than large distances. Because the logarithm of 0 is not defined, and the logarithm of 1 is 0, distances are increased by 1 before the logarithm is calculated. To obtain percentages, we calculate:

Allowed matches

To reckon with syllabification in words, the Levenshtein algorithm is adapted so that only a vowel may match with a vowel, a consonant with a consonant, the [j] or [w] with a vowel (or opposite), the [i] or [u] with a consonant (or opposite), and a central vowel (in our research only the schwa) with a sonorant (or opposite). In this way unlikely matches (e.g., a [p] with a [a]) are prevented.