1 Introduction
The accentuation patterns of Japanese dialects have been among the most vigorously studied areas in Japanese linguistics. This can be attributed to the long history of auditory transcriptions of the pitch patterns of words in various Japanese dialects (e.g. Ikeda Reference Ikeda and Terakawa1951; Hirayama Reference Hirayama1951, Reference Hirayama1957; Kibe Reference Kibe2000). Much acoustic work has also been done on Standard Japanese (e.g. Fujisaki & Hirose Reference Fujisaki and Hirose1984, Poser Reference Poser1984, Kubozono Reference Kubozono1993, Pierrehumbert & Beckman Reference Pierrehumbert and Beckman1988), and some on varieties of Japanese (Ishihara Reference Ishihara2004a on Kagoshima Japanese, Kori Reference Kori1987 on Osaka Japanese, Matsuura Reference Matsuura2008 on Nagasaki Japanese, and Nagano-Madsen Reference Nagano-Madsen2004 on Kochi Japanese). However, knowledge of the linguistic-phonetic properties of Japanese dialects is still lacking. Linguistic-phonetic properties have both language-internal and cross-linguistic aspects. With regard to the former, linguistic-phonetic properties characterise ‘any and all distinctions which may ever be manipulated systematically in a particular language’ while for the latter, they comprise ‘[a]ny phonetic property . . . in terms of which utterances in Language X are systematically different from utterances in Language Y’ (Anderson Reference Anderson and Fromkin1978: 133–134).
In many Japanese descriptive works, H and L have been traditionally used to represent pitch (Hirayama Reference Hirayama1951, Reference Hirayama1957, Reference Hirayama1960; Kibe Reference Kibe2000; the National Language Research Institute 1959, 1967–1975; the Society for the Study of Kyushu Dialects 1969). The aims of these works vary: to compare the accentual differences across regions; to compile an accent dictionary; to conduct a phonological analysis; to reconstruct the accentuation of proto-Japanese, etc. While phonetic representation using this H–L dichotomy may be suitable for the above purposes, it is unlikely that it is detailed enough to characterise differences in the acoustic realisation of a tonal feature across dialects. The use of H and L to represent accentual pitch is not suitable for linguistic-tonetic analysis, not only because it is not detailed enough, but also because it fails to express linguistically important between-speaker differences and it is not quantifiable. For linguistic-phonetic comparisons, therefore, representations are needed that can provide enough phonetic detail; that are able to capture the invariant properties of the various acoustic realisations of given linguistic information; and that can express linguistically important between-speaker differences (Rose Reference Rose1993: 190). That is, linguistic-phonetic representations need to be able to show in detail to what quantifiable extent the productions of different speakers are linguistically the same.
Osaka Japanese (OJ) and Kagoshima Japanese (KJ), which typologically belong to different accent systems (Hirayama Reference Hirayama1960, Reference Hirayama1967; Uwano Reference Uwano and Kaji1999), nevertheless share some pitch patterns. However, it can be inferred from previous phonetic studies on OJ and KJ tone patterns (Ikeda Reference Ikeda and Terakawa1951, Ishihara Reference Ishihara2004a, Kori Reference Kori1987) that there appear to be some f0 realisation differences in a number of pitch patterns between OJ and KJ. However, we do not know exactly how they are acoustically different.
Therefore, focusing on the shared LH, LHL, LLH and LLLH pitch patterns, the first aim of this paper is to specify the linguistic-phonetic properties of OJ and KJ tonalities in such a way as to permit inter-dialectal comparison. The second aim of this paper is to explore implications for surface tonal representations of KJ on the basis of observed linguistic-tonetic differences between OJ and KJ. This will be done in the framework of Autosegmental-Metrical (AM) theory (Pierrehumbert & Beckman Reference Pierrehumbert and Beckman1988, Pierrehumbert & Hirschberg Reference Pierrehumbert, Hirschberg, Cohen, Morgan and Pollack1990).
In this study, linguistic-tonetic representations of OJ and KJ tonalities are derived from normalised acoustic representations for the LH, LHL, LLH and LLLH pitch patterns (see Section 3 for detailed explanation of normalisation). Linguistic-phonetic description should then allow us to identify the phonetic features which specify the sounds within a given language or variety, and those features which underlie sound contrasts between varieties (Ladefoged & Maddieson Reference Ladefoged and Maddieson1986; Rose Reference Rose1987, Reference Rose1991).
Different levels of representation are relevant to the current research. This is because our conventional model of language presupposes that an utterance, which can be acoustically represented (e.g. by its f0 contour), is a realisation of its phonological form (whether as phonemes, systematic phonemes, morpho-phonemes or autosegmental H and L tones). The two research aims of this study explained above can be rephrased in terms of the different levels of representation involved. That is, for the first aim, we will show how one can derive linguistic-tonetic representations from speakers’ utterances (acoustic representation). For the second aim, we will investigate how two levels of representation – namely linguistic-tonetic and phonological representations – are mapped onto each other.
The varieties referred to as OJ and KJ in this study are spoken in Osaka City and Kagoshima City, respectively (Figure 1), and all speech samples for this study were recorded in these cities.
1.1 Accentuation of Osaka and Kagoshima Japanese
Tables 1 and 2 contain some words exemplifying the differences in accentual contrast between OJ and KJ. The pitch pattern of words in OJ is determined: (i) by the presence or absence of a lexical accent; (ii) if a lexical accent is present, by its position, which dictates where the accentual pitch fall starts; and (iii) by whether the word is ‘low-pitch beginning’ or ‘high-pitch beginning’ (Shibatani Reference Shibatani1990, Sugito Reference Sugito1995). Those words which have a lexical accent – its phonetic realisation is a pitch fall – are called ‘accented’ words, and those which do not are called ‘unaccented’ words. So, for example, in Table 1, both kabuto [LHL] ‘helmet’ and otoko [HHL] ‘man’ are ‘accented’ words with a lexical accent on the second syllable. The former is a ‘low-pitch beginning’ word and the latter is a ‘high-pitch beginning’ word. On the other hand, suzume [LLH] ‘sparrow’ and sakura [HHH] ‘cherry blossom’ are both ‘unaccented’ words, the former belonging to the ‘low-pitch beginning’ group and the latter to the ‘high-pitch beginning’ group. A pitch rise may be observed between the penultimate and final syllable in unaccented ‘low-pitch beginning’ words, as can be seen in suzume [LLH] ‘sparrow’ (Sugito Reference Sugito1997).
* = location of an accent.
As can be seen in Table 2, KJ is much simpler, exhibiting only a two-way accentual contrast: LnHL vs. LnH where n ≥ 0 (Hirayama Reference Hirayama1960). In the former pattern, only the penultimate syllable of a word has a high pitch, and every other syllable has a low pitch (i.e. hana [HL] ‘nose’, sakura [LHL] ‘cherry blossom’ and kagaribi [LLHL] ‘watch fire’). In the latter pattern, only the last syllable of a word has a high pitch, and every other syllable before it has a low pitch (i.e. hana [LH] ‘flower’, usagi [LLH] ‘rabbit’ and kakimono [LLLH] ‘document’). The LnHL type is referred to as Type A and the LnH type as Type B in this paper (after Hirayama Reference Hirayama1960). The main contrast between KJ Type A and Type B is thus whether or not there is a fall in pitch at the end, although unlike Standard Japanese there is not a lexical contrast in terms of the location of an accent in KJ.
The present study used four target pitch patterns common to both dialects: LH, LLH, LLLH and LHL. The first three pitch patterns represent unaccented low-pitch beginning words in OJ, and Type B words in KJ. The LHL pitch pattern is an accented low-pitch beginning word with an accent on the second syllable in OJ, and a Type A word in KJ.
1.2 Accent typology of Osaka and Kagoshima Japanese
It can be seen from the description given in the previous section that OJ and KJ belong to different accent systems. These are called the Kyoto–Osaka system and the two-pattern system, respectively (Hirayama Reference Hirayama1957, Shibatani Reference Shibatani1990). As is also shown in that section, words can be classified into two groups in the Kyoto–Osaka system, unaccented and accented. If a word is accented, the location of an accent is lexically contrastive as in Standard Japanese (SJ). However, unlike SJ, words also contrast by virtue of starting with a high pitch or a low pitch. In the two-pattern system, all words exhibit only two-way accentual oppositions, and unlike SJ and OJ, the information as to where a pitch falls is not relevant to determining the meaning of a word.
In the spectrum of accentual complexity, the two-pattern system (i.e. KJ) has one of the simplest accentual systems and the Kyoto–Osaka system (i.e. OJ) has one of the most complex systems among the Japanese dialects.Footnote 1 This difference in accentual complexity between OJ and KJ was made apparent in Section 1.1. Note also that some scholars (i.e. Gussenhoven Reference Gussenhoven2004, Yip Reference Yip2002) regard Japanese as a tone language with impoverished paradigmatic and syntagmatic tonal contrast. As such, no absolute division exists between accent languages and tone languages (Yip Reference Yip2002: 257).
1.3 Previous studies
There have been a large number of studies on the accentuation of OJ and KJ. The accentuation of both OJ and KJ words and phrases has been phonologically analysed by many scholars within various phonological frameworks (McCawley Reference McCawley1968, Reference McCawley1970, Reference McCawley and Hyman1977; Haraguchi Reference Haraguchi1977; Shibatani Reference Shibatani1979). Recently, experimental studies have been undertaken to investigate the realisation of word accents in phonologically large units such as phrases and sentences for OJ (Kori Reference Kori1987) and KJ (Kubozono & Matsui Reference Kubozono and Matsui1996, Ishihara Reference Ishihara2004a). Seven volumes of ‘Study of Japanese prosody and speech sounds’ compiled by Sugito (Reference Sugito1994–1999) contain a series of detailed studies on the acoustics, phonetics, physiology and other features of OJ accentuation. Kibe's (Reference Kibe2000) descriptive study based on her own fieldwork and historical records is a comprehensive synchronic and diachronic study of two-pattern Japanese dialects of Kyushu, the island on which Kagoshima City is located.
With regard to the target pitch patterns of the present study (LH, LHL, LLH and LLLH), some scholars have remarked on the different realisations between OJ and KJ. It has been reported that the low-pitch beginning words of OJ have a continuous pitch rise (Ikeda Reference Ikeda and Terakawa1951) and are ‘acoustically characterised by an over-all slow [f0] rise’ (Kori Reference Kori1987: 44), while Ishihara (Reference Ishihara2004a) demonstrated that in KJ, f0 gradually falls before the rise in the penultimate syllable of a Type A word or in the final syllable of a Type B word. Judging from these previous studies, there appear to be some acoustic realisation differences in the LH, LHL, LLH and LLLH pitch patterns between OJ and KJ. However, we do not know exactly how they are acoustically different due to the lack of linguistic-phonetic description of Japanese dialects. It is one of the aims of this paper to rectify this lack of knowledge.
1.4 Autosegmental-Metrical theory
Pierrehumbert & Beckman (Reference Pierrehumbert and Beckman1988) present evidence that the linking and spreading of the tones of the accent to other syllables – as is assumed in much autosegmental work on Japanese (Haraguchi Reference Haraguchi1977) – is not the best account for surface tone patterns in SJ. Instead, they argue that the surface f0 contour is produced by phonetic interpolation between tone targets, including not only the lexically stipulated accent but also tones associated with higher levels of prosodic structure. With this idea, they introduced a new model for describing Japanese pitch accent by using a few tones per phrase, with interpolation between them. The approach which was adopted by Pierrehumbert & Beckman (Reference Pierrehumbert and Beckman1988) in order to analyse Japanese pitch accent (based on Liberman Reference Liberman1975, Bruce Reference Bruce1977, and Pierrehumbert Reference Pierrehumbert1980) is now called the Autosegmental-Metrical (AM) theory (Ladd Reference Ladd1996).
In AM theory, all relevant pitch is represented by a single H or L autosegment or by a combination of the two. However, it is not clear what H and L tones actually mean (i.e. pitch target? surface tone? f0 target? etc). Nor is it made clear, given a specific f0 contour, on what basis H and L tones are actually identified. Bruce (Reference Bruce1977: 131–143) initially identified intonational tones on the basis of the turning points in the f0 contour. Local maxima, then, correspond to high tones and local minima to low tones. In AM theory, however, tones do not always correspond to turning points, and turning points do not always reflect the surface phonetic realisation of tones (Pierrehumbert Reference Pierrehumbert1980). In terms of their surface phonetic realisation, phonological ‘High’ and ‘Low’ tones are not clearly defined in any work relating to the development of the AM theory (Ladd Reference Ladd1996: 104). As an answer to this problem, one might say that the high–low representation can be phonologically characterised in the ‘relative’ Jakobsonian sense (Anderson Reference Anderson1985: 116–139). That is, the first syllable, for example, is represented as L because it is relatively lower in f0 than the maximum f0 that is associated with the second syllable. The only condition is whether it is relatively high or relatively low, and the precise f0 value does not matter.
Some implications that follow from observed differences between OJ and KJ will be explored with respect to the tonal representation of KJ using AM theory (Pierrehumbert & Beckman Reference Pierrehumbert and Beckman1988, Pierrehumbert & Hirschberg Reference Pierrehumbert, Hirschberg, Cohen, Morgan and Pollack1990). Specifically, some indeterminacies with respect to tone assignment which result from the lack of acoustic specification of a target tone in AM theory will be pointed out while discussing the tonal representation of KJ.
2 Data and method
The data for analysis are taken from Sugito & Kakehi (Reference Sugito and Kakehi2005), a DVD database of recordings of Japanese dialects collected from 1989 to 1993 and including a large number of citation forms of words and utterances, together with syllabic sounds and numbers. From the 86 OJ and 79 KJ native speakers of different generations recorded, the speech samples of 12 OJ speakers (six males, six females), and 14 KJ speakers (seven males, seven females) are used for this study. These speakers belong to the three generations of Old (aged 60 years and over), Middle (aged between 59 and 40 years), and Young (aged between 39 and 20 years). This particular grouping was selected because an initial auditory-based analysis identified a discrepancy between these three generations and the generations younger than 20 years old (Sugito & Kakehi database contains recordings from primary and secondary students). This discrepancy is presumably due to the susceptibility of primary and secondary students to the accent of SJ. As far as the three generations aged 20 or older are concerned, pitch patterns are fairly consistent across the speakers, maintaining the pitch contrast reported in previous studies and accent dictionaries. In particular, the speakers selected for this study behave in the same way with respect to the pitch patterns of the target words. The average age of the 12 OJ speakers is 57 years (sd = 18.6) for males and 46 years (sd = 14.8) for females. That of the 14 KJ speakers is 41 years (sd = 24.8) for males and 47 years (sd = 16.7) for females.
Table 3 contains the target words selected for describing the f0 realisation of the four target pitch patterns. A standard romanisation system for SJ is used to represent OJ and KJ target words, as there are no important segmental differences. The syllable structure of all target words is (C)V. The words were selected to control as much as possible for intrinsic vocalic and consonantal effects on f0 (Peterson & Barney Reference Peterson and Barney1952, Lehiste & Peterson Reference Lehiste and Peterson1961). Thus, various voiced and voiceless consonants and vowels are included in the target words in order to control for their known intrinsic effects (Peterson & Barney Reference Peterson and Barney1952, Lehiste & Peterson Reference Lehiste and Peterson1961). In Section 5.2 below, OJ and KJ will be compared in terms of the f0 realisation of LH, LHL, LLH and LLLH pitch patterns, particularly focusing on the initial syllable. As far as the vowel quality of the initial syllable is concerned, as can be seen from Table 3, OJ and KJ share the same distribution of the vowels (two /i/, four /u/, one /e/, two /o/ and four /a/ vowels), and thus the intrinsic effect arising from the vowel types is well controlled in this study.
All words were read once by each speaker, and recorded on professional digital equipment (16 kHz sampling rate, 16 bit quantisation).
After being segmentally annotated, all target tokens were analysed using the get_f0 function of the ‘Snack Sound Toolkit’ (Sjölander Reference Sjölander2006). The get_f0 function is an implementation of ‘the robust algorithm for pitch tracking (RAPT)’ proposed by Talkin (Reference Talkin, Klejin and Paliwal1995), which is widely used for estimating f0. On the basis of the segmental annotation, f0 was sampled at the onset and every 20% point for each vowel. RAPT is very accurate in estimating f0, yet will never, of course, be totally free from errors (i.e. halving and doubling effects). The f0 values of each token were therefore visually inspected, and those tokens which contained obvious f0 tracking mistakes were re-analysed by setting different tracking thresholds. Identification of vowel onset or offset adjacent to an obstruent or a nasal is fairly straightforward from the audio speech waveforms and spectrograms; however, liquids and glides make the identification of the onset and offset of a vowel more difficult. In these cases, segmentation was done by visually identifying discontinuities in the acoustic parameters, such as waveform shape, F-pattern, and amplitude. Phonation offset in many falling-pitch tokens (e.g. LHL) was either abrupt or creaky. This showed as glottal pulse striations with jitter at the end of the vocalic F-pattern, causing the algorithm to extract f0 inaccurately. Thus, this portion was excluded from analysis.
3 Normalisation
The aim of this study is to give quantified descriptions for both dialects of the linguistic-acoustic properties of the four pitch patterns in question. However, extracting the specific phonetic properties of a given language or variety is not an easy task, as different speakers have different acoustic outputs for what is perceived to be the same linguistic information. This is because ‘[t]he acoustic properties of the radiated speech wave are a unique function of a speaker's vocal tract anatomy, and since speakers’ vocal tracts differ, so will their acoustic output – even for phonetically the same sound’ (Rose Reference Rose1987: 343). Therefore, the individual content (in other words, between-speaker differences) arising from individual anatomical differences needs to be factored out as much as possible by a process of normalisation to extract the linguistic-acoustic content of the target pitch patterns.
However, not all between-speaker differences are associated with different vocal tract anatomy: speakers can differ linguistically as well (Rose Reference Rose2002: 175–194). If such linguistic between-speaker differences occurred in the topology of a given language, it would be very difficult to spot them without normalisation, since they would be mixed up with the non-linguistic anatomical differences. Thus, normalisation enables us to specify an invariant acoustic property from different acoustic outputs and present it as a linguistic-phonetic representation while quantifying the linguistically important between-speaker differences. This representation, then, can be compared across languages and dialects. Particularly for cross-language/dialect comparisons, ‘[n]ormalisation is the only way, other than (the unlikely) recourse to bilingual speakers, of testing transcriptionally based hypotheses on the nature of linguistic-phonetic tonal variation’ (Rose Reference Rose1987: 344).
It has been mooted in Section 1 above that the use of H and L as a phonetic representation is a poor way of specifying the tonetic properties of a given language/dialect, as this kind of broad representation fails to identify the linguistic-tonetic differences between OJ and KJ. One might consider that the mean f0 curve obtained by simply averaging the raw data can be used as a linguistic-tonetic representation. Although such a representation might be able to provide a general idea of how f0 values change in a particular word, it is a linguistically poor representation, not only because it does not express the invariant aspect of given linguistic information (its units would still be meaningless Hz), but also because it fails to express the linguistically important between-speaker differences.
The z-score normalisation technique, the performance of which has been empirically attested for f0 (Rose Reference Rose1991, Zhu Reference Zhu1994, Ishihara Reference Ishihara2004a), is used in this study. The z-score normalisation procedure is: f0norm = (f0i – m) / s, where f0i is a sampling point, and m and s are the arithmetic mean and standard deviation of f0i (i = 1, 2,. . ., n), respectively. The z-score normalisation parameters (m and s) were obtained for each speaker from a set of mono-to-four-syllabic words. This set, which includes the target words listed in Table 3, is given in the appendix. The procedure which was used to sample f0 values from these words is the same as that explained in Section 2 above. In this particular normalisation procedure, each f0 observation is expressed as so many standard deviations above and below a speaker's overall mean f0 (= m). Thus, standard deviation (sd) is used as the unit for normalised f0 in this study.
The z-score normalisation parameters are given for each speaker in Table 4. As can be seen in this table, there are considerable differences in these parameter values across the speakers. For example, the range of the mean values is about 75 Hz (86.8 Hz–162.1 Hz) for the OJ male speakers while it is about 39 Hz (111.7 Hz–150.2 Hz) for the KJ male speakers (i.e. about half that of the OJ male speakers). As for the female speakers, the range of the mean values for the OJ females is approximately 73 Hz (148.7 Hz–222.1 Hz) whereas that of the KJ females is about 35 Hz (182.8 Hz–217.9 Hz), again about half that of the OJ females. As explained above, these differences are treated as being due to individual anatomical differences, and as factors to be removed in order to represent the invariant aspects of KJ and OJ tonalities by means of normalisation. This assumes of course that there are no differences between the dialects in f0 range.
4 Sample size and margin of error
The sample size of the current study (26 = 12 KJ speakers + 14 OK speakers) satisfies the minimum of three people for each sex for quantified phonetic work suggested by Ladefoged (Reference Ladefoged, Hardcastle and Laver1997: 140). However, the number of speakers needed depends on the variance of the data, the margin of error and the desired confidence, thus it needs to be statistically decided. Since we know the sample size of this study (n = 26) and the mean of the speakers’ standard deviations (s = 24.8, calculated from Table 4), the margin of error (E) of 9.5 Hz was estimated with 95% confidence (z = 1.96).Footnote 2 An error of 9.5 Hz is equivalent to sd = 0.38. Hence, when we compare the f0 realisation differences between OJ and KJ, any differences which fall within this margin of error (sd = 0.38) will not be considered.
5 Normalisation results
Figure 2 contains the mean of all speakers’ normalised f0 curves of the four pitch patterns, separately plotted for OJ (Figures 2a and 2c) and KJ (Figures 2b and 2d). Normalised f0 is plotted against mean absolute duration in order to preserve the f0 derivative. The range of f0 values is shown by vertical bars indicating one standard deviation above and below the mean. The temporal axis of Figure 2 indicates the average duration of words with each pitch pattern. This was obtained by summing the mean segment durations in the relevant target words across all speakers. For example, for the LH pitch pattern of OJ (see Figure 2a), the mean duration of the pitch pattern of 0.33 s is the sum of the mean duration of the first vowel = 0.09 s, that of the intervocalic consonant = 0.12 s, and that of the second vowel = 0.12 s.
The normalised f0 contours presented in Figure 2 can be considered as a type of linguistic-phonetic representation of KJ and OJ accentual patterns. These linguistic-phonetic representations can now be used for both intra-dialectal comparisons – for example, to investigate the acoustic realisation differences between different accentual types within a dialect – and for inter-dialectal comparisons – for example, to investigate the realisation differences in the LH pitch pattern across different dialects. Both types of comparison will now be exemplified.
5.1 Within-dialect comparison
Figure 2 shows that the mean f0 curves for the target pitch patterns lie between sd = c. −2.5 and sd = c. 1.5. Compared to other sampling points, a relatively large amount of between-speaker variation (sd = c. 0.7–0.8) is found at the onset and offset points of each vowel for some syllables. This can be seen in the vertical standard deviation bars in Figure 2. This is due to the perturbatory effect of the syllable-initial consonant, and the final creaky phonation which was referred to at the end of Section 2 above. Otherwise, between-speaker variation is within the range of sd = 0.4–0.6, which is comparable with previous studies (see Ishihara Reference Ishihara2004a for KJ, Rose Reference Rose1993 and Zhu Reference Zhu1994 for Shanghai).
In both OJ and KJ in Figure 2, it can be clearly seen that the high-pitch syllable of the LHL pattern, indicated by the arrows (panels a and b), is realised significantly higher in f0 than that of the LH pattern (unpaired two-tailed t-tests, p < .0001 at the highest point). This is a well-reported phenomenon associated with accented words in OJ (Kori Reference Kori1987), and Type A words in KJ (Kubozono & Matsui Reference Kubozono and Matsui1996, Ishihara Reference Ishihara2004a). The same phenomenon has been found in Standard Japanese, where it is called ‘accentually induced f0 boost’ (Kubozono Reference Kubozono1993: 85). The implication of this is that pitch-accent is not merely tonal, but involves an additional parameter of prominence, as implied by the term ‘accent’.
5.2 Between-dialect comparisons
In Figure 3, the mean f0 curves are plotted separately according to the pitch patterns across the dialects (sd bars are omitted for clarity). The x-axis of Figure 3 is equalised duration (%).
A consistent difference between OJ and KJ is observed in the normalised f0 values associated with the L on the first syllable, indicated by arrows in Figure 3. The f0 values of the first syllable lie between sd = −1 and sd = 0 for OJ, while they are around sd = 0 for KJ. This observation for KJ is consistent with Ishihara's (Reference Ishihara2004a) finding that f0 declines around sd = 0 in KJ initial L syllables. Unpaired two-tailed t-tests indicate that the first syllable has significantly higher f0 values (p < .0001 at all sampling points) in KJ than OJ for all target pitch patterns. Note too that the difference between KJ and OJ at the first syllable is larger than the margin of error (sd = 0.38) estimated in Section 4 above at all sampling points. In the case of the LLLH pitch pattern (Figure 3d), for example, the difference between OJ and KJ in f0 value at the onset point is sd = 1.1 (OJ: sd = −0.52; KJ: sd = 0.58).
Due to this difference between OJ and KJ, the LLH and LLLH pitch patterns exhibit significantly different f0 contours between the two dialects in that f0 gradually rises from the first to the final syllable in OJ, while in KJ it gradually declines from the first to the penultimate syllable before rising in the final syllable. Due to the gradually falling contour observed in the L syllables of LLH and LLLH in KJ, the average maximum normalised value of the KJ first L syllable (sd = 0.45 for LLH and sd = 0.53 for LLLH) is significantly higher (unpaired two-tailed t-tests, p < .0001) than the average minimum normalised value of the penultimate L syllable (sd = −0.72 for LLH and sd = −0.80 for LLLH). Whereas for OJ, as a result of the gradually ascending contour, the average minimum normalised value of the first syllable (sd = −0.85 for LLH and sd = −0.90 for LLLH) is significantly lower (p < .0001) than the average maximum normalised value of the penultimate L syllable (sd = −0.10 for LLH and sd = 0.02 for LLLH).
The normalised f0 curves of OJ shown in Figure 3 agree with Ikeda's (Reference Ikeda and Terakawa1951) auditory-based transcription of a rising pitch contour for low-pitch beginning words. However, the f0 curves given in Figure 3 provide more detailed information about the pitch contour which may not be captured by transcription.
6 Discussion
Having identified differences between OJ and KJ in the acoustic realisation of their LH, LHL, LLH and LLLH pitch patterns, this section investigates how linguistic-tonetic and phonological representations are mapped onto each other. In particular, we explore some implications of the identified differences between these two dialects in their normalised acoustics with respect to the surface tonal representation of KJ using AM theory.
It was pointed out in Section 1.4 above that it is not clear in the AM theory as to what H and L tones actually mean. For example, does L tone mean that ‘f0 or pitch stays low until countermanded at some predetermined point’ or ‘moves to a low value’, or something else? Relating to this question, Pierrehumbert & Beckman (Reference Pierrehumbert and Beckman1988: 175) explain that ‘some of the tones have a specified duration, whereas all others are treated as points’, but they are not explicit about which tones have a specified duration or which are treated as a point. Nevertheless, once target tones are determined, they are linearly interpolated as a subsequent process (Pierrehumbert & Beckman Reference Pierrehumbert and Beckman1988: 175).
Tonal representation in OJ is addressed first.
6.1 Osaka Japanese tonal representation
As introduced in Section 1.1 above, whether an OJ word starts with a high or a low pitch is determined lexically. The AM tonal representations of OJ low-pitch beginning accented (i.e. LHL) and unaccented (i.e. LLLH) words are shown in Figure 4. Both lexical and surface representations (from Pierrehumbert & Beckman Reference Pierrehumbert and Beckman1988: 229) are shown together with the linguistic-tonetic representation – the corresponding mean z-score normalised f0 contours derived in this study. The lexical representations are given in the first row, and surface representations are given in the second. Figure 4 thus shows how the acoustic-phonetic reality relates to the phonological constructs. Note that the representations of an accented word (LHL) are given in the panels on the left of Figure 4 and those of an unaccented word (LLLH) are in the panels on the right.
6.1.1 LHL accented word
The different levels of the representations for the LHL accented word are given in the left hand side panels of Figure 4. In the lexical representation of the accented word (Figure 4a), the pitch accent is modelled by a sequence of HL tones linked to the accented syllable. The L tone, which models the low pitch at the beginning of the word, is a property of the word and is therefore linked to it, rather than to the word's syllables (Pierrehumbert & Beckman Reference Pierrehumbert and Beckman1988: 229). At the surface, this L is linked to the word-initial syllable.
Although the words in Figure 4 were uttered as single words in a citation manner, they are not free from the tones associated with the prosodic levels higher than words. Thus, it is necessary to consider that these words are also associated with some tones which are linked to higher prosodic levels (e.g. utterance). Since all target words were uttered in a citation manner as a statement, it is reasonable to posit a final boundary L tone linked to the final syllable of the words (see Figures 4c and 4d). This final boundary L tone is possibly a property of utterance causing a final f0 lowering (Pierrehumbert & Beckman Reference Pierrehumbert and Beckman1988: 72–75).
It is now possible to specify how the phonological tones might map onto the normalised acoustics. As far as the L tones are concerned, both word-initial and pitch accent tones are mapped onto a normalised value of sd = c. 0, but the word-initial L is realised over the whole of the first syllable rhyme, whereas the pitch accent L is a point value at the beginning of the final syllable rhyme. The utterance-final L can also be considered as a point value, but associated with a value of sd = c. –2 (or 2 standard deviations below the previous L tone) at the end of the final syllable rhyme. The H of the pitch accent tone is realised at sd = c. 1.5 and is prolonged over the second syllable rhyme. The perturbations at the beginning and end of the rhymes can be modelled with a smoothing function.
6.1.2 LLLH unaccented word
The different levels of the representations for the LLLH unaccented word are given in the panels on the right of Figure 4. Since the LLLH word does not have an accent, its lexical representation (Figure 4b) consists of two word-level tones: an initial L, which accounts for the low-pitch onset as in the accented word, and a word-final H. In the surface representation, these tones are linked to the first and last syllables of the word, the last syllable being also linked to an utterance-final L, as in the accented word. As far as the mapping of these tones onto the normalised acoustics is concerned, the initial L is realised with a normalised f0 value of sd = c. −0.5, prolonged over the first syllable rhyme. The lower realisation of the word-initial L in the unaccented word suggests that the higher value for word-initial L (sd = c. 0) in the accented word is a result of assimilation to the following H, and this should therefore be accounted for by rule.
The word-final H is realised with a normalised f0 value of sd = c. 0. The big difference in the realisations of the H in the accented and unaccented words can be seen as resulting from either the effect of accentual boost on H when it occurs as part of the pitch accent, or the lowering effect of the utterance-final L on the unaccented word-final H, or both.
The gradually rising f0 after the initial syllable can be seen as an interpolation between the values on the initial and final syllable. The rising f0 on the final syllable possibly indicates that its word-final H is realised as a point value; the falling perturbation at the end of the rhyme is a typical phonatory offset effect. Although the f0 on the second and third syllables appears fairly level (and is therefore suggestive of prolonged values), this is probably due to the effect of the initial consonantal perturbations.
6.2 Kagoshima Japanese tonal representation
Having explained above how the phonological tones are mapped onto the normalised acoustics in OJ, we now start looking into KJ. More precisely, what we attempt here is to investigate what sort of phonological tones (or autosegmental tones) are necessary to represent KJ at the surface level based on the linguistic-phonetic representations of KJ. This will be discussed using the Type B LLLH pitch pattern as an example.
As described in Section 5.2 above, one of the significant aspects of KJ is a gradual fall on the initial L syllables. This gradual fall followed by a rise on the final syllable is a typical f0 contour for Type B words. For OJ LLLH unaccented words, the gradually rising f0 after the initial syllable was modelled by interpolating between the L tone associated with the first syllable and the H tone associated with the final syllable. Similarly, this gradual fall of KJ from the initial syllable to the penultimate syllable can also be seen as an interpolation between the tone associated with the initial syllable and that associated with the penultimate syllable. Likewise, the rise in f0 between the penultimate and the final syllables can be modelled by interpolating the tones associated with the penultimate and the final syllables, respectively. Since we do not know the values (e.g. L or H) of these three tones relevant to Type B LLLH pitch pattern, we tentatively call these tones T1, T2 and T3, as shown in Figure 5, in which a tentative surface representation (first row) of KJ's LLLH pitch pattern is given together with its normalised acoustic representation (second row).Footnote 3
Now that we can see how these phonological tones (T1, T2 and T3) are mapped onto the normalised acoustics, we can discuss what tone values (e.g. H or L) are appropriate for these tones. In this paper we focus only on T1 and T2, our main question being: What are the tone values for T1 and T2? As shown in Figure 5, the tone values of T1 and T2 will be discussed by referring to LLLH Type B as an example. The tonal value of T2 will be discussed first.
6.2.1 Kagoshima Japanese tonal representation: T2
As can be seen from Figure 5, T2 is linked to the penultimate syllable, and its normalised value (sd = c. –0.5) is the lowest of all the syllables. Although the f0 on the penultimate syllable continues to fall (indicating that T2 is realised as a point value), this fall may be due to a perturbatory effect of the consonant.
Previously, Ishihara (Reference Ishihara2000) investigated the f0 behaviour of the (L)HLnH(L) sequence (6 ≥ n ≥ 1) of KJ adjective + noun phrases (i.e. LHL.LLLLH, where a period stands for a word boundary) and reported that the f0 minima between the high-pitch syllables – which correspond to T2 – varies as a function of n in an exponential manner. This exponential decay is a typical characteristic of a target tone in a falling contour (Pierrehumbert & Beckman Reference Pierrehumbert and Beckman1988: 35–46). Thus, this exponential f0 decay indicates that there is a target tone (T2) linked to the penultimate syllable for Type B words. As the syllable that T2 is linked with has the lowest normalised f0 value out of all syllables, an L tone is the appropriate tone value for T2, as represented in Figure 6.
6.2.2 Kagoshima Japanese tonal representation: T1
As we can see in Figure 5, the word-initial T1 is realised with a normalised f0 value of sd = c. 0. Its realisation value is constant regardless of the length of the Type B words (see Figures 3c and 3d). Like T2, the f0 on the initial syllable continues to fall (indicating that T1 is realised as a point value); this could also be due to a perturbatory effect of the consonant.
As for the tone value of T1, there appear to be three possibilities (L, Ø and H), as presented in Figure 7. Following the traditional auditory-based pitch transcriptions of KJ words (i.e. LLLLHL and LLLLH), one might intuitively posit an L tone for T1. However, a few points must be addressed regarding this issue.
As has been reported, the normalised f0 values of the initial L syllable are around the mean (sd = 0) for Type B words. This means that the f0 values of the initial L syllable are around the average f0 of the speakers’ f0 distribution. It is possible to interpret the f0 associated with the initial L syllable as the most neutral f0 value that is near the mean of a speaker's f0 distribution. Considering this, it may not be necessary to assign a particular tone for T1 (Ø) as the f0 realisation of the initial L syllable is predictable from the information regarding a speaker's f0 distribution.Footnote 4 A similar interpretation can be seen in Dainora (Reference Dainora2001) regarding the intonation of English. In reference to an analysis of the intonation of English by Goldsmith (Reference Goldsmith1978), in which the intonation was analysed in terms of H, M and L, Dainora (Reference Dainora2001: 36) mentions that ‘the M does not represent a tone at all, but, rather, represents the neutral frequency used from the beginning of an utterance to the first pitch accent’. In Goldsmith's analysis, the M tone is always the first tone in the sequence and is not associated with an accented syllable. In the AM theory, as noted earlier, it is not clear on what basis low and high tones are actually identified in the f0 contour.
As an alternative value for T1, basing judgement solely on the falling f0 contour that Type B words exhibit on the initial L syllables, it is certainly possible to posit an H tone (particularly with respect to the concept of f0 maxima and minima in the ‘relative’ Jakobsonian sense, Anderson Reference Anderson1985: 116–139). However, positing an H tone for T1 is highly counter-intuitive from a perceptual point of view (personal communication, Prof. Haruo Kubozono, National Institute for Japanese Language and Linguistics, Japan) as the initial syllable in question is perceived and transcribed as a low-pitch syllable by many descriptive linguists (Hirayama Reference Hirayama1957, Reference Hirayama1960; Kibe Reference Kibe2000). At this stage, as far as the f0 realisation of citation words is concerned, it is unclear what value we should assign to T1 because all candidates given in Figure 7 appear to be plausible in one way or another. This point requires further investigation using units phonologically longer than words.
Nevertheless, regardless of the tone value of T1, the falling f0 contour of Type B words can be modelled well by an interpolation between T1, which is linked to the initial syllable and of which normalised value is sd = c. 0, and the penultimate syllable L tone, which has normalised value of sd = c. −0.5.
Thus, again, whichever of the three possibilities presented in Figure 7 represents the tone value of T1, the f0 realisation difference between OJ and KJ observed in the initial syllable of the pitch patterns concerned can be phonologically or phonetically accounted for. If the tonal value of T1 is either Ø or H for KJ, it is straightforward to phonologically understand the f0 realisation difference between OJ and KJ regarding the initial low-pitch syllables because OJ and KJ assign different tonal values to the initial tone (OJ: L vs. KJ: Ø or H). Although it is necessary to look into the f0 behaviour of KJ in larger phonological units such as sentences and utterances to determine the tone value of T1 (see Ishihara Reference Ishihara2004b), it is sensible to consider that if we posit an initial boundary tone for KJ at T1, it should be a non-lexical tone because KJ does not have the same kind of lexical contrast as OJ for the initial boundary tone. Therefore, even if an L tone is proposed for KJ at T1, there is a significant difference between OJ and KJ in that the initial L tone of OJ is lexical while that of KJ is not. It is well known that a tone can be realised differently in f0 depending on whether or not it is a lexical tone, even if the same value is associated with it (Pierrehumbert & Beckman Reference Pierrehumbert and Beckman1988, Kubozono Reference Kubozono1993, Ishihara Reference Ishihara2004a). That is, the f0 realisation difference observed at the initial syllable between OJ and KJ can be either explained by phonologically positing different tones for the first syllable (OJ: L vs. KJ: Ø or H) or by phonetically implementing different realisation rules for the initial L tone.
7 Conclusions
At the beginning of this paper, we argued that the previous auditory descriptions of OJ and KJ tonalities using H and L are not adequate for linguistic-tonetic purposes because they are not phonetically detailed enough, they fail to express linguistically important between-speaker differences, and they are not quantifiable. As a result, the previous auditory descriptions of OJ and KJ tonalities fail to capture the linguistic-tonetic differences between the two dialects that have been identified in this study for the LH, LHL, LLH and LLLH pitch patterns.
Besides this auditory-based representation, the current study was concerned with three levels of representation: acoustic, linguistic-tonetic and AM representations. Acoustic representation (i.e. the f0 contour of an utterance) is the most surface level representation, whereas AM representation is a phonological representation. Linguistic-tonetic representation is situated somewhere in-between. This study showed, first of all, how linguistic-tonetic representations can be obtained from speakers’ utterances (which can be acoustically represented). More precisely, linguistic-tonetic representations of OJ and KJ tonalities were derived from normalised acoustic representations for the LH, LHL, LLH and LLLH pitch patterns. These linguistic-tonetic acoustic representations can provide sufficient phonetic details for characterising differences in the acoustic realisation of a tonal feature; they are able to capture the invariant property of the various acoustic realisations of given linguistic information; and they can express linguistically important between-speaker differences. Thus, the linguistic-phonetic representations that were established in this way for OJ and KJ can be used to make acoustic comparisons, not only between these two dialects, but also with other varieties of Japanese or other languages.
By comparing the linguistic-phonetic representations of OJ and KJ for the LH, LHL, LLH and LLLH pitch patterns, this study has first of all demonstrated that these pitch patterns of OJ have significantly lower f0 realisations than those of KJ at the initial L syllables. More precisely, the LHL pitch pattern starts at sd = c. 0 and the LH, LLH and LLLH patterns start at sd = c. −0.5 in OJ, while they start higher in KJ by sd = 0.5–1. An overall significant difference in contour shape can also be observed between OJ and KJ in that the former exhibits a gradual rise in the LLH and LLLH pitch patterns while the latter shows a gradual fall before a rise between the last two syllables.
Having established the linguistic-tonetic representations of OJ and KJ for the LH, LHL, LLH and LLLH pitch patterns, we have first demonstrated how OJ phonological representations are mapped onto its linguistic-phonetic representations. Then, based on the linguistic-tonetic representations of KJ, we discussed the phonological representation of KJ at the surface level using LLLH Type B words as an example. Through the discussion it was demonstrated that since it has not been clearly defined what H and L tones actually mean in the AM theory, the mapping between a linguistic-tonetic representation and its phonological representation (or AM representation) is not straightforward. In particular, the possibility of Ø tone as the initial tone was discussed, focusing on the fact that the normalised f0 values stay at sd = c. 0 in the first syllable in KJ.
Acknowledgments
This paper is a revised and extended version of a paper presented at Interspeech 2008 in Brisbane, Australia. This research was financially supported by the College of Asia and the Pacific, the Australian National University. The author is very grateful to Dr Phil Rose for his valuable comments and extensive discussions of earlier versions of this paper. Thanks are also due to the editors and three anonymous reviewers for their detailed comments. This paper is dedicated to Dr Phil Rose on the occasion of his retirement from the Australian National University.