Segmenting words from natural speech: subsegmental variation in segmental cues*

C. ANTON RYTTING; CHRIS BREW; ERIC FOSLER-LUSSIER

doi:10.1017/S0305000910000085

Segmenting words from natural speech: subsegmental variation in segmental cues*

Published online by Cambridge University Press: 22 March 2010

C. ANTON RYTTING ,

CHRIS BREW and

ERIC FOSLER-LUSSIER

Show author details

C. ANTON RYTTING*: Affiliation:
University of Maryland Center for Advanced Study of Language (CASL) and Department of Linguistics, the Ohio State University
CHRIS BREW: Affiliation:
Department of Computer Science and Engineering and Department of Linguistics, the Ohio State University
ERIC FOSLER-LUSSIER: Affiliation:
Department of Computer Science and Engineering and Department of Linguistics, the Ohio State University
*: Address for correspondence: C. A. Rytting, 7005 52nd Avenue, College Park, MD 20742, USA. tel: +1 (301) 226-8883. e-mail: crytting@casl.umd.edu

Article contents

Abstract
SIMULATION 1
SIMULATION 2
Footnotes
References

Rights & Permissions

Abstract

Most computational models of word segmentation are trained and tested on transcripts of speech, rather than the speech itself, and assume that speech is converted into a sequence of symbols prior to word segmentation. We present a way of representing speech corpora that avoids this assumption, and preserves acoustic variation present in speech. We use this new representation to re-evaluate a key computational model of word segmentation. One finding is that high levels of phonetic variability degrade the model's performance. While robustness to phonetic variability may be intrinsically valuable, this finding needs to be complemented by parallel studies of the actual abilities of children to segment phonetically variable speech.

Type: Articles
Information: Journal of Child Language , Volume 37 , Special Issue 3: Computational models of child language learning , June 2010 , pp. 513 - 543

DOI: https://doi.org/10.1017/S0305000910000085 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2010

INTRODUCTION

One of the fundamental questions in child language acquisition is how children learn to segment running speech into words. Children's performance at segmenting words is a good predictor of later performance at other stages of language learning, including not only vocabulary building but also expressive language and general language ability (cf. Newman, Bernstein Ratner, Jusczyk, Jusczyk & Dow, Reference Newman, Bernstein Ratner, Jusczyk, Jusczyk and Dow2006).

Most computational models of the word segmentation task treat it as a series of subtasks along these lines:

1. Break the stream of speech into a sequence of acoustic segments.
2. Convert these acoustic segments into abstract symbols or feature bundles.
3. Find coherent subsequences within this sequence of symbols. Treat these subsequences as ‘candidate words’.
4. Add these candidate words into a mental lexicon, combining identical subsequences into the same lexical entry, as instances of the same word.

Most models take steps 1–2 as a given (or as a necessary idealization), and focus on step 3, using transcriptions produced by adults (or dictionaries) as the source of the input. An alternative way of conceptualizing the word learning task involves a slightly different series of subtasks:

1. Break the stream of speech into a sequence of acoustic segments.
2. Represent these segments in a way that preserves the acoustic variation in each segment, such as probability vectors over possible symbols.
3. Find coherent subsequences within this sequence of probability vectors (PVs). Treat these subsequences as ‘candidate words’.
4. Add these candidate words into a mental lexicon, clustering very similar subsequences of PVs into the same lexical entry as likely instances of the same word.

We have developed and implemented a framework for providing alternative input to any computational model of word segmentation that can accommodate probabilistic input. This alternative input follows steps 1–2 as we propose them, using techniques from automatic speech recognition as a proxy for human auditory processing. We argue that this input is more realistic than the types of transcripts usually used to train and test computational models of word segmentation. We also demonstrate the use of this new input on one model of word segmentation (Christiansen, Allen & Seidenberg, Reference Christiansen, Allen and Seidenberg1998).

The paper is organized as follows. First, we review previous work modeling subsegmental variation and propose a method of preserving the phonetic variation found in audio input that is compatible with well-established evaluation metrics, and hence readily comparable across models. We then describe how the input data derived from this method is applied to the Christiansen et al. (Reference Christiansen, Allen and Seidenberg1998) model of word segmentation. Two simulations follow: the first simulation tests the generality of the claims in Christiansen et al. (Reference Christiansen, Allen and Seidenberg1998), by comparing the model's performance using symbolic, citation-form input against that using probabilistic, speech-derived input. To simplify the comparison, a version of the model without suprasegmental cues was used. The results of this simulation show that the chosen word segmentation model performs well given input with little phonetic variation, but not with the high levels of acoustic variation found in many recorded utterances in the Brent corpus (Brent & Siskind, Reference Brent and Siskind2001).

The second simulation tests the generality of the claims about how best to combine multiple cues. It compares two variants of the model that combine segmental information with a novel measure of speech clarity called segmental salience. This cue is used as a rough approximation to suprasegmental cues such as lexical stress, which are known to be of importance in word segmentation for English (Johnson & Jusczyk, Reference Johnson and Jusczyk2001). The second simulation confirms that multiple cues can, under certain conditions, be combined to improve word segmentation performance. It also indicates that phonetic variation cannot be ignored when designing studies of the statistical learning of language. In addition, it suggests that more refined evaluation measures will be needed if we are to understand the advantages and disadvantages of particular statistical learning techniques.

Finally, we discuss the implications for other word segmentation models that rely heavily on segmental information, as well as opportunities for investigating word segmentation performance in non-ideal circumstances. Whether children are able to extract word boundaries from the more highly variable utterances in the Brent corpus is not clear; this warrants further investigation. As computational models of word segmentation investigate performance on concrete examples of running speech, more coordination with experimental measures of infant performance will be necessary to interpret the results.

RELATED WORK: MODELING SUBSEGMENTAL VARIATION

Computational models of infant language acquisition only rarely represent the subsegmental variability present in speech. Those models that have examined the effect of phonetic variation have used two approaches: first, the use of automatic speech recognition (ASR) technology, and second, stochastic resetting of phonological features.

Using automatic speech recognition

Carl de Marcken (Reference de Marcken1996) was perhaps the first to use ASR technology in modeling the word segmentation task. In order to avoid including information unavailable to the infant, de Marcken's model used the output of an automatic phone recognition (APR) system, explicitly excluding all higher-level linguistic information that would bias the system towards phone sequences more frequently found in canonical pronunciations of words. However, de Marcken's experiments used Viterbi's algorithm to reduce speech to a single most likely sequence of phones. Thus, de Marcken's approach does no more to model uncertainty or ambiguity at the segmental level than phone-level human-transcribed corpora, such as the Buckeye corpus (Pitt, Johnson, Hume, Kiesling & Raymond, Reference Pitt, Johnson, Hume, Kiesling and Raymond2005) used by Fleck (Reference Fleck2008). Both over-commit to the phonemic identity of the segment before segmentation begins. The difference is that de Marcken replaces human phone recognition error with APR error.

Roy & Pentland (Reference Roy and Pentland2002) do handle uncertainty: their CELL model is similar in many respects to the approach proposed and tested in this paper. However, since the CELL model focuses primarily on the task of word–meaning mapping within a multimodal domain, its performance was never compared to any other word segmentation model, nor was it tested on comparable corpora. Neither de Marcken (Reference de Marcken1996) nor Roy & Pentland (Reference Roy and Pentland2002) provide evaluation metrics that can be directly compared with other models.

Simulating phonetic variation

Two connectionist approaches, Cairns, Shillcock, Chater & Levy (Reference Cairns, Shillcock, Chater and Levy1997) and Christiansen & Allen (Reference Christiansen, Allen, Sorace, Heycock and Shillcock1997), simulate subsegmental variation in the input by stochastically inserting non-canonical combinations of phonetic features. The latter study is described in more detail below.

The Christiansen & Allen (Reference Christiansen, Allen, Sorace, Heycock and Shillcock1997) study (henceforth CA97) used the Carterette & Jones (Reference Carterette and Jones1974) corpus as input, encoding each phoneme as a vector of (binary) phonological features. During testing, certain of these input features were randomly flipped from 0 to 1 or from 1 to 0, in order to test the effect of subsegmental variation on the connectionist model's performance. Those features which distinguished a particular phoneme from another American English phoneme were considered core features and left unchanged. Of the rest, certain features (on average about two per phoneme) were dubbed peripheral features and were randomly and independently toggled at various probabilities (four conditions: 0, 0·01, 0·05 and 0·1) in order to simulate subsegmental variation in the speech signal. It was found that this type of variation did not significantly alter the performance of the neural network, either when starting with citation forms or the (human) phone-level transcriptions of the corpus.

CA97's implementation of subsegmental variation, while ingenious, is somewhat ad hoc. The distinction between core and peripheral features amounts to an implicit theory about segment confusability. As a theory, this leaves something to be desired, underestimating the probability of confusions that seem quite plausible for infants without top-down information. For example, since English obstruents are arranged in voiced–voiceless minimal pairs (e.g. /f/~/v/, /t/~/d/, /s/~/z/), the feature [voice] would be a core feature for all obstruents under CA97's assumptions. This would mean that it is impossible to misclassify any obstruent along the line of voicing (e.g. a /t/ for a /d/). In reality, voice-onset time is continuous, and the slope of human categorical perception, while steep, is known not to be absolutely vertical (cf. McMurray & Aslin, Reference McMurray and Aslin2005), so two slightly differently tuned listeners certainly could perceive the value of the [voice] feature differently for a particular phone, particularly one near the VOT (voice-onset time) boundary for voicing. By not allowing for this possibility, CA97 effectively models the [voicing] feature as a perfect step function for obstruents. While this is a retreat from one kind of idealization, it also introduces unlooked for and cognitively unmotivated assumptions that should perhaps be eschewed.

Second, the model controls variation in each peripheral feature with a single number that is constant across time and surrounding context. This has the effect of spreading the subsegmental variation (or phonetic ambiguity) in the model evenly throughout each word and each utterance. Again, this is cognitively implausible: surrounding context, including position in the syllable (e.g. Redford & Diehl, Reference Redford and Diehl1999) and surrounding phones (e.g. Krull, Reference Krull1990), plays a large role not only in allophonic variation but also in perception of phonemic distinctions and specific patterns of confusability.

Finally, infants experience subsegmental variation and phonemic ambiguity (and hence the possibility of phonemic confusion) throughout the course of acquisition. CA97 treats variation only at test time. This corresponds to a setting where the learner has access to very clear input (perhaps from a very cooperative caregiver) during training but must handle variability (maybe from other speakers) at later points. Unfortunately, speech science does not support the claim that even the clearest speech is free of variation. For prediction tasks like the one used in CA97, the use of subsegmental variation during training should apply to the target layer as well as to the input layer, since the training cannot presuppose any more certainty than the original input.

Now that high-quality audio data are available, a more direct and straightforward way of modeling phonetic variation is possible. We do not have to build a theory of phonetic variation, but can simply use the subsegmental variation already present in speech. We do this while retaining a model and metrics that are comparable to previous word segmentation models.

MODELING SUBSEGMENTAL VARIATION WITH PROBABILISTIC INPUT

Phone probability vectors

Most models of the word segmentation task make the following assumptions, or closely related variants:

1. The input provided to the model is represented as a sequence of symbols – that is, a string drawn from a finite alphabet or phonemic inventory, augmented by a symbol for utterance boundaries. In the case of distributed (connectionist) representations, each combination of features maps to a symbol.
2. Only one label (or combination of features) is associated with each position in the input string.
3. The number of positions (or segments) in the input string is fixed: each utterance has a certain, known number of segments.
4. The word segmentation task is framed either as (a) the task of identifying which pairs of adjacent segments should have word boundaries placed between them, or (b) the task of identifying which subsequences of contiguous segments (substrings) should be grouped together as words.Footnote ¹
5. Evaluation consists of measuring the accuracy of (a) boundary placement, (b) substring groupings (or word tokens), and/or (c) the building of a lexicon of distinct substrings (or word types).Footnote ²

We propose a different set of assumptions, which we argue are more realistic representations of the input infants receive. Specifically, we replace assumptions 1 and 2 with the following:

1′. The input provided to the model is represented as a sequence of probability vectors over a finite set of symbols (or phoneset), augmented by a symbol for utterance boundaries.
2′. Many labels can be associated probabilistically with each segment of the input string, as long as the total probability over all the labels sums to 1. The labels and probabilities associated with a segment of the input string constitute that segment's phone probability vector.

For the sake of maintaining the evaluation metrics commonly used in other models, and facilitating comparison with them, we keep assumptions 3 through 5 constant. Along with assumption 3, we further assume that each segment in an utterance is associated with a specific temporal region in that utterance's audio recording, and that there are no gaps or overlaps between adjacent segments. Finally, we frame the word segmentation task as a boundary-detection task (as in 4a, above) and evaluate the resulting segmentation by the three metrics listed in assumption 5, discussed in more detail below.

Automatic phone classification

Our input representation is sensitive to what Scharenborg, Norris, ten Bosch & McQueen (Reference Scharenborg, Norris, ten Bosch and McQueen2005) refer to as ‘probabilistic acoustic detail’, but is more constrained than the phone lattice that they consider. In traditional ASR lattices, as in Scharenborg et al.'s description, competing phones may be of different lengths, and the task of the Viterbi algorithm is to choose the best sequence of phones to cover the acoustic input. The optimization process is a search that considers all reasonable possibilities for the number of phones, their identity and the positions of the phone boundaries. In our application, which needs to achieve comparability with CA97, this degree of generality cannot easily be accommodated. Instead we treat phone boundaries (hence also number of phones) as fixed. For each position in the segmented input, we use the relevant acoustic material to assign a posterior probability for each of the possible phone labels in the recognizer's repertoire. Where CA97 works with a sequence of phone labels that are assumed to be reliably known, we work with a sequence of probabilistic distributions over phone labels. If the speech were particularly clear and the phone classifier especially effective, these distributions might turn out to be sharply focused on the ‘correct’ single phones, but in the more prevalent cases where the signal is less helpful or the recognizer less effective the result will be a distribution that allocates significant posterior probability to several different phone labels.

One variant of automatic speech recognition (ASR) that is compatible with these assumptions is automatic phone classification (cf. Halberstadt & Glass, Reference Halberstadt and Glass1997). Automatic phone classification (APC), like the automatic phone recognition variant that de Marcken used, differs from standard ASR in the basic unit of recognition and the types of linguistic knowledge that it utilizes. In traditional ASR, the basic unit is the word, and the task is to identify the sequence of words most likely to have given rise to a particular audio signal. In addition to a phoneset and an acoustic model describing the ranges of audio input associated with each phone in the phoneset, an ASR system has a pronunciation dictionary (or vocabulary of words and their usual pronunciations) and a grammar to describe how those words are likely to be sequenced together.

In APR and APC, the basic unit is not the word but the phone. Hence, no pronunciation dictionaries or word-level grammars are used, because we cannot assume that babies have a vocabulary yet; the goal of word segmentation is to find the words in order to acquire a vocabulary. APC differs from APR in that the former assumes that the number of phones in an utterance and their boundary points are known. Hence, for each phone position and its associated temporal slice of audio signal, discovering the phone's identity may be construed as a classification task. APR does not make this assumption, which poses problems for evaluation. While ‘hard-decision’ APR as used by de Marcken returns unambiguous phone boundaries that can then be mapped to word boundaries, in a ‘soft-decision’ APR system, since it has probabilistic rather than predetermined phone boundaries, the choice-points for word boundaries are probabilistic rather than a discrete set, which breaks assumption 3 and makes evaluation and comparison with transcription-based word segmentation models much less straightforward. For this reason, APC is adopted for producing the input representations for this study.Footnote ³

Obtaining the phone probability vectors

The conversion of the raw audio input into the sequence of phone probability vectors needed for the connectionist model's input is conducted in two stages: find the phone boundaries and then generate the phone probability vectors within each boundary. Both stages are implemented by using a previously developed ASR system based on the hidden Markov model toolkit HTK (Young et al., Reference Young, Evermann, Kershaw, Moore, Odell and Ollason2002). The first stage divides each utterance (using the utterance boundaries marked in the corpus) into discrete, one-segment time intervals. In the second stage, each time interval between two phone boundaries is treated as a separate ‘mini-utterance’ for purposes of phone classification. The HTK system is constrained to treat each mini-utterance as a single segment. More details concerning the implementation of both steps are given in Rytting (Reference Rytting2007).

This method for calculating the phone probability vectors for each segment does not utilize all of the typical contextual knowledge of a typical ASR system (including lexical pronunciations and/or phonotactic grammars) since each segment is considered in isolation; the only linguistic knowledge provided is the segment boundaries and the set of acoustic phonetic models. Because of the lack of contextual knowledge, the ASR acoustic models will be less accurate than those of a state-of-the-art system, but will also preserve subsegmental variation in the signal, which is crucial to model the sorts of uncertainty that an infant listener might experience. By utilizing ASR models which possibly overestimate, rather than underestimate, the amount of variation seen by an infant, we do provide a significant challenge to the model; however, if a model is successful with this type of input then clearly it can handle smaller amounts of variation.

OVERVIEW OF SIMULATIONS

An overview of the Christiansen, Allen and Seidenberg model

In order to investigate the effects of realistic, speech-derived subsegmental variation on word segmentation, we have compared the performance of one influential model of word segmentation using both symbolic and probabilistic input. We focus here specifically on the multiple-cue connectionist model described in Christiansen, Allen & Seidenberg (Reference Christiansen, Allen and Seidenberg1998; henceforth CAS98), which we will refer to generically as the ‘Christiansen model’. While a number of other models could in principle have been adapted to allow for probabilistic input (see, e.g., Batchelder (Reference Batchelder2002) for a review, and Fleck (Reference Fleck2008), Frank, Goldwater, Mansinghka, Griffiths & Tenenbaum (Reference Frank, Goldwater, Mansinghka, Griffiths, Tenenbaum, McNamara and Trafton2007) and Goldwater, Griffiths & Johnson (Reference Goldwater, Griffiths and Johnson2009) for more recent models and empirical evaluations), the simple recurrent network at the basis of the Christiansen model is relatively easy to implement with widely available neural network toolkits and straightforwardly accepts probabilistic input.

CAS98 examines the interaction of multiple cues in finding ‘hidden structure’ in a sequence of observations. It hypothesizes that infants, while performing their primary task of learning the meaning of the language input they hear (or see), also engage in the immediate task of learning to predict observable linguistic events, such as the identity of the next phone, the level of emphasis given to next phone, and whether or not the current phone precedes an utterance boundary. The finding of hidden structure such as word boundaries is a derived task that emerges as infants attend to immediate tasks with directly observable feedback. This view of the infant's task is similar to that assumed by Aslin, Woodward, LaMendola & Bever (Reference Aslin, Woodward, LaMendola, Bever, Demuth and Morgan1996), where it is hypothesized that infants could find word endings by trying to predict utterance endings, and extrapolating from those phones (or features) that predict upcoming ends of utterances.

The innovation that the Christiansen model makes is how multiple cues are combined. Building on Aslin et al.'s immediate-task paradigm, CAS98 adds additional immediate tasks as catalyst tasks, and trains an Elman network on these other tasks simultaneously. As multiple prediction tasks are learned simultaneously by the same network, the combined training will constrain the network to find a better joint solution for derived task of detecting hidden structure such as word boundaries. Specifically, CAS98 demonstrates that combining segmental and suprasegmental cues allows for greater performance at the word segmentation task than either cue alone.

Figure 1 shows a schematic view of the Christiansen network. The catalyst output units for phonemic identity and stress constrain the model during training, but have no direct effect on the model's placement of word boundaries. Only the utterance boundary marker (shown by the # symbol, upper right) determines whether or not a word boundary is posited.

Fig. 1. A schematic representation of the original CAS98 phon-ubm-stress network. Dark circles represent activated input units.

Goals of the simulations

In this paper we present two simulations illustrating the effect of probabilistic input on the Christiansen model. Simulation 1 examines the effects of probabilistic segmental input (i.e. phone probability vectors) on a version of the Christiansen network without lexical stress. Simulation 2 examines the impact of an additional cue (segmental salience). This additional cue is loosely analogous to suprasegmental cues such as lexical stress. Unlike Christensen's lexical stress cues, it is directly derivable from the phone probability vectors with no additional processing of the speech signal.

Implementation and evaluation of the Christiansen model

Model design, training and testing

In both simulations the Christiansen model was re-implemented using the Conx module of the Pyro toolkit (Blank, Kumar, Meeden & Yanco, Reference Blank, Kumar, Meeden and Yanco2003). Elman networks (a type of simple recurrent network or SRN) were used, following CAS98. Each network was trained on one pass through the training corpus, using the same settings as CAS98 (learning rate of 0·1, momentum of 0·95, and initial weight randomization ranging from −0·25 to 0·25), then tested with the network weights ‘frozen’. In order to account for the natural variability in the networks, nine separate runs of training and testing were performed for each variant of each network, each run differing only in the randomized starting weights.

Input representations

One point of difference between the original CAS98 study and the simulations described here is in the feature representation of the input to the connectionist model. Like Aslin et al. (Reference Aslin, Woodward, LaMendola, Bever, Demuth and Morgan1996) and Cairns et al. (Reference Cairns, Shillcock, Chater and Levy1997), CAS98 represents each segment in the input corpus as a bundle of phonological features rather than as a discrete symbol. However, it is also possible to use symbolic input representations within a connectionist framework, by using a localist (or ‘one-hot’) representation, such that each symbol in the relevant phone set has one input unit uniquely associated with it. The original CAS98 study uses localist representations for their models' segmental output and target layers ‘to facilitate performance assessments and analyses of segmentation errors’ (p. 236).

We have conducted studies (not reported here) that examine the effect of input representation on the Christiansen model. For strict replication, we examined the original feature set in CAS98. This feature set contains some flaws (as pointed out in e.g. Fleck, Reference Fleck2008), so we also examined an arguably more realistic feature set, found in Christiansen, Conway & Curtin (Reference Christiansen, Conway, Curtin, Minett and Wang2005). Finally, we examined a localist input representation matching CAS98's output and target layers. In general, the localist input representation performed as well as or better than either of the two distributed representations, so we report only the results of the localist representation here. The patterns observed from the distributed representations are not sufficiently dissimilar to affect the overall findings.

Evaluation procedures

Since the network is supposed to generalize from utterance boundaries to all word boundaries, the activation of the output unit corresponding to the utterance boundary marker (UBM) is used to determine the model's level of belief in a word boundary after a given segment. To calculate precision and recall (defined below), CAS98 posits a word boundary whenever the UBM output unit registers a greater-than-threshold activation. Following Aslin et al. (Reference Aslin, Woodward, LaMendola, Bever, Demuth and Morgan1996), the threshold used for determining a posited word boundary is the average activation for the UBM output unit over all positions. While this method of evaluation is not the only one provided by CAS98, it is the method most closely comparable to evaluations of other models in the literature, so we adopt it here in reporting the results of the Christensen model and its variants on new input.Footnote ⁴

Results are reported in terms of precision and recall (referred to as accuracy and completeness in CAS98) for boundaries, word tokens and word types, where precision equals the proportion of true positives to the sum of true and false positives, and recall is the proportion of true positives to the sum of true positives and false negatives. Unlike CAS98, which only reports one run of the neural network, all simulations reported here take the mean values of true positives, false positives, and false negatives (i.e. $\langle \overline{N}_{tp} \comma \overline{N}_{fp} \rangle$ and $\langle \overline{N}_{tp} \comma \overline{N} _{fn} \rangle$ ) over nine separate runs with different (randomized) initial weights, as shown in Equations 1 and 2.

(1)

$\hfill Mean{\rm \ }Precision \equals {{\overline{N} _{tp} } \over {\overline{N}_{tp} \plus \overline{N}_{fp} }}$

(2)

$\hfill Mean \ Recall \equals {{\overline{N} _{tp} } \over {\overline{N} _{tp} \plus \overline{N} _{fn} }}$

Following CAS98, significance in precision and recall between two conditions is measured comparing the mean number of true and false positives $\langle \overline{N} _{tp} \comma \overline{N}_{fp} \rangle$ for precision, and true positives with false negatives $\langle \overline{N}_{tp} \comma \overline{N}_{fn} \rangle$ for recall, for each of the two conditions in a 2×2 χ² test.

The number of boundaries correctly found is of less interest than the number of words (tokens and types) correctly segmented. In order for a word token to count as correctly segmented, three conditions must apply:

1. The word's beginning must be correctly identified.
2. The word's end must be correctly identified.
3. There must be no false-positive boundaries posited in between the beginning and end of the word.

The precision and recall over word types is calculated in the same manner as word tokens, except that each word type (or distinct string of canonical phones) is only counted once over the entire corpus. Type recall refers to the proportion of distinct words in the corpus found by the model (averaged over nine runs). Type precision refers to the proportion of distinct strings segmented and proposed by the model that correspond to actual words in the corpus, and corresponds to the ‘lexicon precision’ measure in Brent (Reference Brent1999). Since the Christiansen model does not compile a lexicon explicitly as part of its execution, the term ‘type’ precision is adopted here.

As with most computational models of word segmentation, no distinction is made in these metrics between different classes of words, such as function words vs. content words, nouns vs. verbs, or words that children show evidence of knowing vs. other words. Perfect recall means correctly segmenting all the words.Footnote ⁵

SIMULATION 1

Although CA97 gives us some indication of the Christiansen model's performance in the face of certain types of variation in its input, the flaws discussed above in the section ‘Simulating phonetic variation’ limit the conclusions that can be drawn from it. Simulation 1 seeks to overcome these limitations by comparing the performance of the Christiansen model on citation-form input with its performance using phone probability vectors (derived from audio-recordings as described in the sections ‘Modeling subsegmental variation with probabilistic input’ above and ‘Input data’ below) as input for both training and testing. CAS98 used the Korman (Reference Korman1984) corpus, freely available as part of the CHILDES collection of child-directed language corpora (MacWhinney, Reference MacWhinney2000). Since the sound recordings for the Korman corpus are too faint to be utilized by ASR, the audio-recordings for a subsection of the Brent corpus (Brent & Siskind, Reference Brent and Siskind2001), also available through CHILDES/TalkBank (MacWhinney, Reference MacWhinney2000), were used for Simulation 1.

METHOD

Input data

The input corpus for Simulation 1 was based on recordings of four mothers from the Brent corpus, identified by the codes c1, f1, f2 and q1, directed at infants age 0 ; 9 to 0 ; 10·26. Since a large proportion of the experimental literature examining word segmentation focuses on infants between 0 ; 7 and 0 ; 11, it has here been assumed that the Christiansen model is best applied to input directed to infants within this time period. Recordings directed at infants older than 0 ; 11 were excluded from this study as being beyond the age most appropriate for the model. Recordings earlier than 0 ; 9 are rare in the Brent corpus and usually record the mother's first recording session. They were excluded to avoid self-conscious speech and other effects of first-time recordings.

Using the transcriptions' CHAT codes, we removed utterances containing any type of input that might cause trouble for the forced-alignment step, including: whispered or sung speech; unintelligible, untranscribed or partial words; word play or pet names; and mentions of the family's last name (left untranscribed to preserve anonymity). This left 13,443 utterances for the four mothers. Using HMM-based acoustic models trained on the TIMIT corpus (Garofolo, Lamel, Fisher, Fiscus, Pallett & Dahlgren, Reference Garofolo, Lamel, Fisher, Fiscus, Pallett and Dahlgren1993) of read speech, we phonetically aligned this subset of the Brent corpus, performing a forced alignment on the canonical pronunciations found in the CMU dictionary (1993). We utilized the resulting phonetic boundaries to segment each utterance into individual phones. We then calculated the average frame likelihood of the twenty best monophones for each segment and converted these likelihoods into posteriors by normalization.

While in typical ASR tasks it is unusual to first train the models on the same material that will be evaluated, it should be noted that what we are trying to derive is an approximation of the phonetic confusability in the acoustics. Thus, if the models are trained on one phone but during testing they prefer another, this is a clear indication of acoustic confusability, and we can have more confidence that misrecognitions are not due to training/test mismatch.

In order to further increase the confidence in these phonetic materials, utterances that did not have good performance in phone classification across the entire utterance were discarded. The performance of the phone classification across an utterance was calculated using a measure called approximate accuracy, defined as the number of phones correctly detected within the top two guesses for each phone. Using this definition rather than exact accuracy allows for more of the desired variation while ensuring that the correct phone was a good candidate, suggesting that the automatic phonetic alignment process was valid.

Two subsets of the Brent corpus were created: one that has utterances of approximate accuracy of at least 33·3% and that had more than one phone classified correctly (hereafter called ‘Large Brent’), and a second, higher-confidence corpus that had an approximate accuracy of at least 60% per utterance (hereafter called ‘Small Brent’). The 60% cut-off point was chosen to represent a subset of utterances with levels of acoustic variation roughly comparable to that found in low-noise speech corpora such as TIMIT: the rate of canonical pronunciations is on the order of 60–80% in TIMIT, depending on stress and syllabic position (Fosler-Lussier, Greenberg & Morgan, Reference Fosler-Lussier, Greenberg, Morgan, Ohala, Hasegawa, Ohala, Granville and Bailey1999). The 33·3% cut-off, on the other hand, allows a more representative sample of the utterances in the Brent corpus, including a number of longer, more complex utterances. Each of these two subsets were further divided 90%−10% into training and test corpora, as shown in Table 1.

TABLE 1. Size of the training and test corpora for the two Brent corpus subsets

Model design

The training procedure was the same as for CAS98, with the exception that the input and target vectors for the probability vector condition used are not binary, but continuous (rounded to four decimal places) in the range [0, 1]. While training on probabilistic targets may be unusual, and is a departure from the way uncertainty is handled in CA97, this method of training is consistent with the assumption made here that infants at this stage of development do not have access to the phonemic identity of the target segment, except through the probabilistic cues encoded in the input.

In order to keep the phoneset the same as that used in CAS98, the phone probability vectors produced by the APC system were converted from the 61-phone TIMIT phoneset to the 36-phone MRC phoneset that CAS98 used. The twenty best monophones' posterior probabilities were normalized to sum to 1, and used as input activations for the corresponding input units.Footnote ⁶ A schematic representation of the network used is given in Figure 2.

Fig. 2. The phone-probability-vector network used in the probability-vector conditions of Simulation 1. Shades of grey in input units indicate graded activation.

For comparison purposes, we also provide a canonical (or citation-form) version of both Brent subsets, trained with fully symbolic input taken from the canonical pronunciations of each word as listed in the CMU dictionary, converted to the MRC phoneset (without stress information). This corresponds to the phon-ubm condition in CAS98. Since there is no exact correspondence to stress in the probability vector condition, the phon-ubm-stress condition is not reported for the Brent corpus. In addition, two baselines were used, following CAS98. The first one (the utterance-as-word baseline) posits word boundaries only at utterance boundaries, treating each utterance as a single word. The other (the length-based baseline) learns from the training corpus the distribution of word lengths (in segments) but gathers no information relating to the identity of the segments. It then chooses word lengths randomly from the distribution of lengths learned.

RESULTS AND DISCUSSION

Simulation 1a: the Small Brent subset

Results are reported first for the smaller, more restrictive subset of the Brent corpus, shown in Table 2. This corpus subset (the Small Brent corpus) has a much shorter mean utterance length than the Korman corpus used in CAS98 or the Large Brent subset used in Simulation 1b and Simulation 2 (2·3 words per utterance, as opposed to 3·2 for Large Brent and 3·1 for Korman), and a much higher incidence of single-word utterances (46% for Small Brent, versus 28% for Large Brent and 26% for Korman). It follows that the single-word baseline will have substantially better recall on the Small Brent subset than on other corpora.

TABLE 2. Results from Simulation 1a: precision and recall for the 37-70-37 phon-ubm SRN trained and tested with canonical, citation-form input and with automatically phone-classified probabilistic input from the Small Brent corpus subset, compared with two baselines (Prec.=Precision; Rec.=Recall)

Simulation 1a.i: the canonical condition

The SRN using the canonical transcription performs above the length-based baselines for all measures except type recall (χ²(1)=10·0, p=0·0016 for boundary precision; p<0·001 for all other comparisons). As expected, the SRN's performance on boundary precision is trivially worse than the utterance-as-word baseline's perfect precision, and the boundary recall is (trivially) better. However, due to the large number of one-word utterances in the Small Brent subset, the utterance-as-word baseline outperforms the SRN on word precision as well (χ²(1)=30·6, p<0·001), and matches it on type precision (χ²(1)=0·0075, p>0·9). Therefore, the performance of the SRN is not as clearly superior to the baselines as it is for CAS98's simulations using the Korman corpus.Footnote ⁷

Simulation 1a.ii: the probability vector condition

The SRN trained and tested on the probability vector input performs fairly similarly to those using the canonical transcription. Just like the canonical-input SRN, the SRN using the probability vector input also outperforms the length-based baseline for all measures except type recall (χ²(1)=6·8, p=0·009 for boundary precision; χ²(1)=7·0, p=0·008 for word token precision; p<0·001 for all other comparisons). The two SRNs do not differ significantly on any of the six measures. Although the performance on word precision and recall appears to be lower for the probability vector condition, the differences are not statistically significant (χ²(1)=2·7, p=0·10 for word token precision; χ²(1)=1·9, p=0·16 for word token recall).Footnote ⁸

Simulation 1b: the Large Brent subset

Because the relatively small size and short average utterance length of the Small Brent subset made it difficult to distinguish the performance of the SRNs in the canonical and probability vector conditions from the baseline, it is necessary to examine a larger subset of the Brent corpus to obtain reliable figures. It is also useful to see how the Christiansen model (with the types of input examined here) fare with a greater degree of subsegmental variation than that provided in the Small Brent subset. The Large Brent subset, more than double the size of the Small Brent subset, makes this closer look possible. Results for this corpus subset are shown in Table 3.

TABLE 3. Results from Simulation 1b: precision and recall for the 37-70-37 phon-ubm SRN trained and tested with canonical, citation-form input and with automatically phone-classified probabilistic input from the Large Brent corpus subset, compared with two baselines (Prec.=Precision; Rec.=Recall)

Simulation 1b.i: the canonical condition

The canonical SRN outperforms the length-based baseline on all measures (p<0·001 on all comparisons) except type recall (χ²(1)=0·26, p>0·6). As seen in Simulation 1a.i with the Small Brent corpus, the SRN performs worse than the single-word baseline on word precision (χ²(1)=11·0, p<0·001), but better than baseline on boundary, word and type recall (p<0·001 on all comparisons). Unlike Simulation 1a.i, the SRN outperformed the single-word baseline on type precision (χ²(1)=22·1, p<0·001).

Simulation 1b.ii: the probability vector condition

Unlike Simulation 1a, for the Large Brent corpus the subsegmental variation affects performance significantly. The SRN trained and tested on the probability vector input performs significantly worse in all measures compared to the SRN trained and tested on the canonical input (p<0·001 for all comparisons), except type recall (χ²(1)=0·25, p>0·6). This drop in performance is sufficient to bring boundary and word precision down to the level of the length-based baseline (χ²(1)=1·99, p>0·1 for boundary precision; χ²(1)=0·03, p>0·8 for word precision). Still, boundary recall (χ²(1)=184·9, p<0·001) and word recall (χ²(1)=18·6, p<0·001) are significantly better than the length-based baseline, as is type precision (χ²(1)=8·71, p=0·0031).

DISCUSSION

Simulation 1a suggests that the Christiansen model, even without the stress cue, is robust to data with subsegmental variation when this variation is carefully controlled. This finding is consistent with previous tests of the Christiansen model in CA97. Because the near-continuous vector output of a phone recognition classifier is a more accurate representation of human perception than the feature byte-swapping done in CA97, this study provides added support for the model's basic robustness. The corpus used in this simulation is small and the language used very simple, so the claim of success has to be qualified.

Simulation 1b, performed on a larger subset of the Brent corpus, including utterances that are considerably more difficult both for the phone classifier and for the Christiansen word segmenter, shows that there is a point where the variation does cause significant degradation to the model's performance. The best interpretation for this degradation is not immediately clear. One possible explanation is that large-scale variation compromises the reliability of the segmental cues, such that it is no longer possible to find word boundaries using these cues alone.Footnote ⁹ Christiansen's network was explicitly designed to combine multiple cues in a plausible way, without direct supervision of the word segmentation task itself. Therefore, when moving to spoken language, it is worth looking for a set of more robust cues which can, in combination with the segmental cues, improve the model's word segmentation performance. This would be a direct validation of Christiansen's idea, and all the more compelling because of the use of more naturalistic input.

SIMULATION 2

To further study the degradation of performance observed in Simulation 1b, we introduce other cues. We do this while continuing to require that the cues that are used are plausibly available to an infant aged 0 ; 8.

The lexical stress cue used by CAS98 was derived from a dictionary, so it cannot be assumed to be available directly. However, there is ample evidence that infants of the appropriate age use many of the acoustic cues associated with lexical stress, such as pitch, duration and spectral tilt (Thiessen & Saffran, Reference Thiessen and Saffran2004). It is also likely that they are able to distinguish between degrees of care in the articulation of a syllable or segment.

We measure correlates of articulatory care, estimating them from the phone probability vectors already available in the APC system. Our use of these cues is motivated in part by an assumption that babies benefit most from stretches of speech that they can readily interpret. This is an extrapolation of findings that infants prefer speech over non-speech, and CDS over adult-directed speech (e.g. Fernald, Reference Fernald1985). This cue may also be interpreted as a proxy measure of ‘local hyperarticulation’ (Cho & Keating, Reference Cho and Keating2007), associated in English both with lexical stress (e.g. de Jong, Reference de Jong1995) and the beginnings of words (e.g. Fougeron & Keating, Reference Fougeron and Keating1997).

Finding the start of these salient, more easily interpretable stretches may facilitate word learning. Simulation 2 therefore incorporates an additional cue to signal the onset of an acoustically distinct stretch of speech, or region of local hyperarticulation.

We assume here that the confusion matrix of the automatic phone classifier approximates the perceptual confusions of an infant learner of the language. If the APC assigns a high posterior probability to just one phone, and much lower probabilities to all the others, we assume that this reflects a careful and clear pronunciation of that phone. Conversely, if the phone classifier's activation is spread nearly equally between a large number of possible phones, we treat this as evidence that the phone is unclear and/or sloppily articulated. Our ideal measure, which is a measure of posterior entropy, will be referred to as segmental confidence; it is approximated here by the maximum activation value output by the phone classifier, regardless of the identity associated with that value.