Hostname: page-component-745bb68f8f-kw2vx Total loading time: 0 Render date: 2025-02-11T16:35:56.788Z Has data issue: false hasContentIssue false

When to simulate and when to associate? Accounting for inter-talker variability in the speech signal

Published online by Cambridge University Press:  24 June 2013

Alison M. Trude*
Affiliation:
Department of Psychology, University of Illinois at Urbana–Champaign, Champaign, IL 61820. trude1@illinois.edu

Abstract

Pickering & Garrod's (P&G's) theory could be modified to describe how listeners rapidly incorporate context to generate predictions about speech despite inter-talker variability. However, in order to do so, the content of “impoverished” predicted percepts must be expanded to include phonetic information. Further, the way listeners identify and represent inter-talker differences and subsequently determine which prediction method to use would require further specification.

Type
Open Peer Commentary
Copyright
Copyright © Cambridge University Press 2013 

A hallmark of speech perception is that despite inter-talker variability on dimensions including rate, pitch, and phonetic variation due to accents, comprehension usually proceeds quickly and with little conscious effort. P&G's theory provides a potential means of accommodating this variability by including context in the inverse model during comprehension; however, although the theory is potentially compatible with findings on talker adaptation, in order to generate testable hypotheses in this domain, it must more precisely specify what information is included in listeners' predictions and how listeners assess talkers' speech to determine which prediction route is more appropriate for the current input.

According to P&G's model, listener-generated predictions are impoverished, which suggests that they do not include fine-grained phonetic detail. However, a large body of research shows that not only do listeners use fine-grained acoustic-phonetic details online while processing speech (McMurray et al. Reference McMurray, Tanenhaus and Aslin2009; Salverda et al. Reference Salverda, Dahan and McQueen2003), but that this phonetic detail can also affect listeners' subsequent productions (Nielsen Reference Nielsen2011). If P&G wish to capture how listeners become phonetically aligned, allowing improved perception of an individual's speech over time, the definition of “context” must be expanded to include phonetic details, as well as listeners' previous experiences with a particular talker or group of talkers.

Another question raised by the model is how listeners' use of the prediction-by-simulation and prediction-by-association routes would vary as a consequence of talker identity. Consider, for example, an eye-tracking experiment from our laboratory showing rapid comprehension of a regional accent (Trude & Brown-Schmidt Reference Trude and Brown-Schmidt2012). In this study, participants heard two talkers: a male with American English dialect in which /æ/ raises to /eɪ/ only before /g/ (e.g., tag [teɪg], but tack [tæk]), and a female without this shift. On critical trials, participants viewed four images: a /k/-final target (e.g., tack), a /g/-final cohort competitor (e.g., tag), and two unrelated pictures. Participants clicked on the image named by one of the talkers. The results indicated that when listening to the male talker, participants fixated tag less upon hearing tack. Participants ruled the competitor out more quickly because the way that the male talker would have pronounced its vowel was not consistent with the input. Hence, we observed that listeners mentally represented an unfamiliar regional accent and used their knowledge rapidly enough to influence processing of a single word (see also Dahan et al. Reference Dahan, Drucker and Scarborough2008).

P&G argue that listeners rely more on simulation when the talker is similar to them; however, the theory does not specify what degree of similarity is necessary for listeners to be able to use prediction-by-simulation during comprehension and when prediction-by-association is necessary. Additionally, for this model to generate testable hypotheses, it must specify how listeners assess the input in order to determine whether to use simulation or association, and the basis and frequency on which they update their assessments. According to P&G, context is determined using “information about differences between A's speech system and B's speech system” (target article, sect. 3.2, para. 3; Fig. 6 caption), suggesting that listeners engage in a comparison process. However, the details of that process, and how it aligns with current theories of speech perception, are unclear.

At a phonological level, it is possible that our participants would have been able to use the simulation route to predict the male's vowel shift since, as native English speakers, the vowel /eɪ/ is part of their own phonological system. It could also be the case, however, that our participants used the association route since their own phonological representation of tag includes an /æ/, rather than an /eɪ/. At the same time, because our talkers and participants were all American English speakers, their speech was quite phonologically similar. Would this similarity have allowed our participants to use the simulation route most of the time, perhaps switching to association only for the critical vowel shift? Considering that the two talkers alternated randomly in our study, and that certain features of their speech may have been more or less like those of a given participant at different points in a single word, it seems as if it would have been necessary for the participant to constantly re-evaluate which prediction route was more suitable from moment to moment. This process would likely be too slow to implement and still produce the rapid online adaptation effects that we, and others, have observed.

A further question is whether listeners' derived production commands are actually the same representations governing overt imitation in cases in which the talker's and listener's speech vary. In the model, the listener's derived production command is generated after the percept has passed through the inverse model (which includes context), and therefore should include information about the talker's voice; however, it appears that this command is also used to generate overt imitation. If so, it seems that listeners should be able to imitate whatever features of the talker's speech they are able to predict (except when physiological differences prevent them from doing so). However, there are many cases in which a listener may understand a talker's speech without readily producing certain features of it (Mitterer & Ernestus Reference Mitterer and Ernestus2008). The use of impoverished representations during comprehension could explain listeners' failure to overtly imitate these features; however, the fact that listeners can use fine-grained acoustic detail during comprehension seems at odds with this explanation. Furthermore, it has been shown that listeners can accommodate sub-phonemic features during imitation (Nielsen Reference Nielsen2011), though they may do so to varying degrees (Babel Reference Babel2012). Therefore, it would seem the theory needs a mechanism accounting for the dissociation between listeners' use of phonetic detail during comprehension and production.

In conclusion, P&G's theory can potentially explain talker-specific adaptation during comprehension because it allows a role for context while making predictions about a talker's speech. Furthermore, the rapid generation and implementation of representations is consistent with work using online methods that show talker-specific adaptation over the course of a single word. However, there are many open questions that remain about how listeners represent and predict the acoustic features of individuals' speech that must be addressed to make this a useful model of talker adaptation.

ACKNOWLEDGMENT

The writing of this commentary was supported by a National Science Foundation Graduate Research Fellowship, no. 2010084258.

References

Babel, M. (2012) Evidence for phonetic and social selectivity in spontaneous phonetic imitation. Journal of Phonetics 40:177–89.CrossRefGoogle Scholar
Dahan, D., Drucker, S. J. & Scarborough, R. A. (2008) Talker adaptation in speech perception: Adjusting the signal or the representations? Cognition 108:710–18.CrossRefGoogle ScholarPubMed
McMurray, B., Tanenhaus, M. K. & Aslin, R. N. (2009) Within-category VOT affects recovery from “lexical” garden paths: Evidence against phoneme-level inhibition. Journal of Memory and Language 60:6591.Google Scholar
Mitterer, H. & Ernestus, M. (2008) The link between speech perception and production is phonological and abstract: Evidence from the shadowing task. Cognition 109:168–73.Google Scholar
Nielsen, K. (2011) Specificity and abstractness of VOT imitation. Journal of Phonetics 39:132–42.CrossRefGoogle Scholar
Salverda, A. P., Dahan, D. & McQueen, J. (2003) The role of prosodic boundaries in the resolution of lexical embedding in speech comprehension. Cognition 90:5189.Google Scholar
Trude, A. M. & Brown-Schmidt, S. (2012) Talker-specific perceptual adaptation during on-line speech perception. Language and Cognitive Processes 27: 9791001.CrossRefGoogle Scholar