INTRODUCTION
In everyday conversations adults perceive speech by ear and eye, yet the development of this critical audiovisual property of speech perception is still not well understood. In fact, the extant child research reveals that – compared to adults – children exhibit reduced sensitivity to the articulatory gestures of talkers (i.e. visual speech). The McGurk task (McGurk & MacDonald, Reference McGurk and McDonald1976) well illustrates this maturational difference in sensitivity to visual speech. In this task, individuals are presented with audiovisual stimuli with conflicting auditory and visual onsets (e.g. hear /ba/ and see /ga/). Whereas adults typically perceive a blend of the auditory and visual inputs (e.g. /da/ or /ða/) and rarely report perceiving the auditory /ba/, children, by contrast, report perceiving the /ba/ (auditory capture) 40% to 60% of the time (McGurk & MacDonald, Reference McGurk and McDonald1976). Because visual speech plays a role in learning the phonological structure of spoken language (e.g. Locke, Reference Locke1993; Mills, Reference Mills, Dodd and Campbell1987), it is critical to understand how children utilize visual speech cues.
The influence of visual speech on children's audiovisual speech perception clearly increases with age, but the precise timecourse for achieving adultlike benefit from visual speech remains unclear. Numerous studies report that (i) children from roughly five through eleven years of age benefit less than adults from visual speech whereas (ii) adolescents (preteens–teenagers) show an adultlike visual speech advantage (e.g. Desjardins, Rogers & Werker, Reference Desjardins, Rogers and Werker1997; Dodd, Reference Dodd1977; Erdener & Burnham, Reference Erdener and Burnham2013; Jerger, Damian, Spence, Tye-Murray & Abdi, Reference Jerger, Damian, Spence, Tye-Murray and Abdi2009; McGurk & MacDonald, Reference McGurk and McDonald1976; Ross, Molholm, Blanco, Gomez-Ramirez, Saint-Amour & Foxe, Reference Ross, Molholm, Blanco, Gomez-Ramirez, Saint-Amour and Foxe2011; Tremblay, Champoux, Voss, Bacon, Lepore & Theoret, Reference Tremblay, Champoux, Voss, Bacon, Lepore and Theoret2007; Wightman, Kistler & Brungart, Reference Wightman, Kistler and Brungart2006). Developmental improvements in sensitivity to visual speech have been attributed to changes in (i) the perceptual weights given to visual speech (Green, Reference Green, Campbell, Dodd and Burnham1998), (ii) articulatory proficiency and/or speechreading skills (e.g. Desjardins et al., Reference Desjardins, Rogers and Werker1997; Erdener & Burnham, Reference Erdener and Burnham2013), and (iii) linguistic skills and language-specific tuning (Erdener & Burnham, Reference Erdener and Burnham2013; Sekiyama & Burnham, Reference Sekiyama and Burnham2004). Notable complications to this story are suggested, however, by several studies reporting significant sensitivity to visual speech in three- to five-year-olds (Holt, Kirk & Hay-McCutcheon, Reference Holt, Kirk and Hay-McCutcheon2011; Lalonde & Holt, Reference Lalonde and Holt2015), six- to seven-year-olds (Fort, Spinelli, Savariaux & Kandel, Reference Fort, Spinelli, Savariaux and Kandel2012), and eight-year-olds (Sekiyama & Burnham, Reference Sekiyama and Burnham2004, Reference Sekiyama and Burnham2008). Some of these studies stressed that performance in young children can be influenced by visual speech when the children are tested with developmentally appropriate measures and task demands. This viewpoint encourages us to consider the possible bases underlying children's developmental insensitivity to visual speech. Toward this end, Jerger et al. (Reference Jerger, Damian, Spence, Tye-Murray and Abdi2009) adopted a dynamic systems theoretical viewpoint (Smith & Thelen, Reference Smith and Thelen2003).
Dynamic systems theory
Dynamic systems theory proposes two relevant points for understanding the influence of visual speech in children: (i) multiple interactive factors form the basis of developmental change, and (ii) children's early skills are ‘softly assembled’ systems that reorganize into more mature, stable forms in response to environmental and internal forces (Smith & Thelen, Reference Smith and Thelen2003). Evoked potential studies support such a developmental reorganization and restructuring of the phonological system (Bonte & Blomert, Reference Bonte and Blomert2004). During these developmental transitions, processing systems are less robust and children cannot easily use their cognitive resources; thus performance is less stable and more affected by methodological approaches and task demands (Evans, Reference Evans2002). From this perspective, children's reduced sensitivity to visual speech may be incidental to developmental transformations, their processing by-products, and experimental contexts. Clearly, previous research has shown a greater influence of visual speech on children's performance when task demands were modified to be more child-appropriate (Desjardins et al., Reference Desjardins, Rogers and Werker1997; Lalonde & Holt, Reference Lalonde and Holt2015). Further, sensitivity to visual speech has been shown to vary in the same children as a function of stimulus/task demands (Jerger, Damian, Tye-Murray & Abdi, Reference Jerger, Damian, Tye-Murray and Abdi2014).
We propose that some experimental variables that might have contributed to children's reduced sensitivity to visual speech are the use of (i) complex tasks/audiovisual stimuli (e.g. targets embedded in noise or competing speech; McGurk stimuli with conflicting auditory and visual onsets) – because they make listening more challenging or less natural and familiar – and (ii) high-fidelity auditory speech – because it makes visual speech less relevant. The purpose of the present research was to evaluate whether sensitivity to visual speech in children might be increased by the use of stimuli with (i) congruent onsets that invoke more prototypical and representative audiovisual speech processes, and (ii) non-intact auditory onsets that increase the need for visual speech without involving noise. Below we briefly introduce our new stimuli and discuss the current task and its possible benefits for studying the influence of visual speech on performance by children.
Stimuli for the New Visual Speech Fill-In Effect
The new stimuli are words and nonwords with an intact consonant/rhyme in the visual track coupled to a non-intact onset/rhyme in the auditory track (our methodological criterion excised about 50 ms for words and 65 ms for nonwords; see ‘Method’). Stimuli are presented in audiovisual vs. auditory modes. Example stimuli for the word bag are: (i) audiovisual: intact visual (b/ag) coupled to non-intact auditory (– b/ag) and (ii) auditory: static face coupled to the same non-intact auditory (– b/ag). Our idea was to insert visual speech into the ‘nothingness’ created by the excised auditory onset to study the possibility of a Visual Speech Fill-In Effect (Jerger et al., Reference Jerger, Damian, Tye-Murray and Abdi2014), which occurs when performance for the same auditory stimulus differs depending upon the presence/absence of visual speech. Responses illustrating a Visual Speech Fill-In Effect for a repetition task (Jerger et al., Reference Jerger, Damian, Tye-Murray and Abdi2014) are perceiving /bag/ in the audiovisual mode but /ag/ in the auditory mode. Below we overview our new approach – the multimodal picture–word task with low-fidelity speech (non-intact auditory onsets).
Multimodal picture–word task
In the widely used picture word interference task (Schriefers, Meyer & Levelt, Reference Schriefers, Meyer and Levelt1990), participants name pictures while attempting to ignore nominally irrelevant speech distractors. Previous research (e.g. Jerger, Martin & Damian, Reference Jerger, Martin and Damian2002; Jerger et al., Reference Jerger, Damian, Spence, Tye-Murray and Abdi2009) has established that congruent onsets, such as [picture]–[distractor] pairs of [bug]–[bus], speed up picture naming times relative to neutral (or baseline) vowel onsets, such as [bug]–[onion]. A congruent onset is thought to prime picture naming because it creates crosstalk between the phonological representations that support speech production and perception (Levelt, Schriefers, Vorberg, Meyer, Pechmann & Havinga, Reference Levelt, Schriefers, Vorberg, Meyer, Pechmann and Havinga1991). Congruent distractors are assumed to spread activation from input to output phonological representations, a process fostering faster selection of speech segments during naming (Roelofs, Reference Roelofs1997). Our ‘multimodal’ version of this task (Jerger et al., Reference Jerger, Damian, Spence, Tye-Murray and Abdi2009) administers audiovisual stimuli (Quicktime movie files). The to-be-named pictures appear on the T-shirt of a talker whose face moves (audiovisual speech utterance) or stays artificially still (auditory speech utterance coupled with still video). Hence, the speech distractors are presented audiovisually or auditorily only, a manipulation that enables us to study the influence of visual speech on phonological priming.
In a previous study with the multimodal picture–word task and high fidelity distractors (Jerger et al., Reference Jerger, Damian, Spence, Tye-Murray and Abdi2009), we observed a U-shaped developmental function with a significant influence of visual speech on phonological priming in four-year-olds and twelve-year-olds, but not in five- to nine-year-olds. Consistent with our dynamic systems theoretical viewpoint (Smith & Thelen, Reference Smith and Thelen2003), we proposed that phonological knowledge was reorganizing – particularly from five to nine years – into a more elaborated, systematized, and robust resource for supporting a wider range of activities, such as reading. The phonological knowledge supporting visual speech processing was not as readily accessed and/or retrieved during this pronounced period of restructuring for the reasons elaborated above (see also Jerger et al., Reference Jerger, Damian, Spence, Tye-Murray and Abdi2009). As noted above, our current research attempts to moderate these possible internal/external influences by using congruent audiovisual stimuli with non-intact auditory onsets. Our focus on speech onsets may be key because – relative to the other parts of an utterance – onsets are easier to speechread, more reliable with less articulatory variability, and more stressed (Gow, Melvold & Manuel, Reference Gow, Melvold and Manuel1996). In two studies, we addressed research questions about the relation between phonological priming in the auditory vs. audiovisual modes as a function of the characteristics of the stimuli (Analysis 1) and the children's ages and verbal abilities (Analysis 2).
ANALYSIS 1: STIMULUS CHARACTERISTICS
The general aim of this analysis was to assess the influence of visual speech on phonological priming by high- vs. low-fidelity auditory speech in children from four to fourteen years. Whereas the auditory fidelity was manipulated from high to low (intact vs. non-intact onsets), the visual fidelity always remained high (intact). Primary research questions were whether – in all age groups – (i) the presence of visual speech would fill in the non-intact auditory onsets and prime picture naming more effectively than auditory speech alone and (ii) phonological priming would display a greater influence of visual speech for non-intact than intact auditory onsets. Finally, a secondary research question concerned lexical status, namely whether phonological priming in all age groups would display a greater influence of visual speech for nonwords than words (e.g. baz vs. bag). Some important qualities that may influence the effects of visual speech are: (i) congruent dimensions, (ii) integral processing of speech cues, and (iii) low-fidelity auditory speech.
STIMULUS CHARACTERISTICS AND PREDICTIONS
Congruent dimensions
Evidence suggests that audiovisual utterances with congruent rather than conflicting McGurk-like dimensions produce different perceptual experiences. For example, Vatakis and Spence (Reference Vatakis and Spence2007) manipulated the temporal onsets of congruent vs. conflicting auditory and visual inputs and found that listeners were significantly less sensitive to temporal differences when onsets were congruent. Brain activation patterns also differ for congruent vs. conflicting audiovisual speech, with supra-additivity (greater than the sum of unimodal inputs) for the former but sub-additivity for the latter (Calvert, Campbell & Brammer, Reference Calvert, Campbell and Brammer2000). Congruent dimensions also possess lawful relatedness that produces strong cues that the auditory and visual inputs originated from the same speaker and should be integrated (Stevenson, Wallace & Altieri, Reference Stevenson, Wallace and Altieri2014). Thus, in terms of a multisensory perceptual experience, congruent onsets offer some advantages compared to conflicting onsets. The data below also clearly indicate that the speech cues of consonant–vowel stimuli are processed integrally.
Integrality of speech cues
To study the integrality of speech cues, the Garner task (Reference Garner1974) requires participants to (i) attend selectively to a target cue such as a consonant (e.g. /b/ vs. /g/) and (ii) try to ignore a non-target cue such as a vowel that is held constant (/ba/ vs. /ga/) or varies irrelevantly (/ba/, /bi/ vs. /ga/, /gi/). Results have shown that irrelevant variation in the vowels interferes with classifying the consonants and vice versa (e.g. Tomiak, Mullennix & Sawusch, Reference Tomiak, Mullennix and Sawusch1987). Green and Kuhl (Reference Green and Kuhl1989) established that this tight coupling between auditory speech cues extends to audiovisual speech cues. All these results indicate that listeners cannot ignore one speech cue and selectively attend to another. Instead, listeners perceive the cues integrally. Results on the Garner task imply that our auditory and visual speech onsets should be processed integrally.
Low-fidelity (non-Intact) auditory speech
The literature shows a shift in the relative weights of the auditory and visual modes as the quality of the inputs shifts. To illustrate: when listening to McGurk stimuli with degraded auditory speech, children with normal hearing respond more on the basis of the intact visual input (Huyse, Berthommier & Leybaert, Reference Huyse, Berthommier and Leybaert2013). When the visual input is also degraded, however, the children respond more on the basis of the degraded auditory input. Children with normal hearing or mild–moderate hearing loss and good auditory word recognition – when listening to conflicting inputs such as auditory /meat/ coupled with visual /street/ – respond on the basis of the auditory input (Seewald, Ross, Giolas & Yonovitz, Reference Seewald, Ross, Giolas and Yonovitz1985). In contrast, children with more severe hearing loss – and more degraded perception of auditory input – respond more on the basis of the visual input. Finally, when Japanese individuals listen to high-fidelity auditory input, they do not show a McGurk effect; but when they listen to degraded auditory input, they do show the effect (Sekiyama & Burnham, Reference Sekiyama and Burnham2008; Sekiyama & Tohkura, Reference Sekiyama and Tohkura1991). These results indicate that the relative weighting of auditory and visual speech is modulated by the relative quality of each input. Recent neuroscience studies also support this differential weighting, as they reveal that the functional connectivity between the auditory and visual cortices and the superior temporal sulcus (STS, an area of audiovisual integration) changes with input fidelity, with increased connectivity between the STS and the sensory cortex with the higher-fidelity input (Nath & Beauchamp, Reference Nath and Beauchamp2011).
In short, our auditory and visual speech cues are congruent and should be processed in an integral manner. The auditory and visual speech inputs should be weighed differentially depending on the quality of the auditory input. Thus we predict that (i) visual speech will fill in the non-intact auditory onsets and prime picture naming more effectively than auditory speech alone, and (ii) children will be more sensitive to visual speech for non-intact than intact auditory input. In addition to our primary research questions, a secondary research question evaluated whether lexical status affects children's sensitivity to visual speech.
LEXICAL STATUS AND PREDICTIONS
The literature contrasting the McGurk effect for words vs. nonwords indicates that the McGurk effect occurs for both types of stimuli. Within this evidence, some results have revealed that lexical status impacts the McGurk effect. For example, visual speech influences listeners more often when (i) stimuli are words rather than nonwords (Barutchu, Crewther, Kiely, Murphy & Crewther, Reference Barutchu, Crewther, Kiely, Murphy and Crewther2008) or (ii) the visual input forms a word and the auditory input forms a nonword (Brancazio, Reference Brancazio2004). By contrast, however, other results have shown a strong McGurk effect for both nonwords and words, with performance not appearing to be influenced by meaningfulness (Sams, Manninen, Surakka, Helin & Katto, Reference Sams, Manninen, Surakka, Helin and Katto1998). With regard to studies assessing the McGurk effect with only word stimuli in isolation, one study (Dekle, Fowler & Funnell, Reference Dekle, Fowler and Funnell1992) observed a strong McGurk effect whereas the other study (Easton & Basala, Reference Easton and Basala1982) reported no visual influence on performance. In short, these studies do not provide consistent results or predictions
In contrast to the mixed results summarized above, the hierarchical model of speech segmentation (Mattys, White & Melhorn, Reference Mattys, White and Melhorn2005) provides unambiguous predictions for words vs. nonwords. The model proposes that listeners assign the greatest weight to lexical–semantic content when listening to words. If the lexical–semantic content is compromised, however, listeners assign the greatest weight to phonetic–phonological content. If both the lexical–semantic and phonetic–phonological content are compromised, listeners assign the greatest weight to acoustic–temporal content. It is also assumed that monosyllabic words such as our stimuli (bag) may activate their lexical representations without requiring phonological decomposition whereas nonwords (baz) require phonological decomposition (Mattys, Reference Mattys and Reisberg2014).
If these ideas generalize to our task, word stimuli should be heavily weighted in terms of lexical–semantic content but nonword stimuli should be heavily weighted in terms of phonetic–phonological content for both the audiovisual and auditory modes. We predict that children's sensitivity to visual speech will vary depending on the relative weighting and decomposition of the phonetic–phonological content. To the extent that a greater weight on phonetics–phonology increases children's awareness of the phonetic–phonological content and visual speech phonetic cues, we predict that children will show a significantly greater influence of visual speech relative to auditory speech for nonwords than for words. In agreement with Campbell (Reference Campbell1988), we view visual speech as an extra phonetic resource that adds another type of phonetic feature.
Although we critically evaluate the influence of child factors in Analysis 2, we plot results as a function of age in Analysis 1. To briefly address age, the literature reviewed above predicts that – although benefit from visual speech improves with age – children relative to adults show significantly reduced benefit up to the adolescent years. We have argued above, however, that performance for our non-intact stimuli will reveal more sensitivity to visual speech. We thus predict that phonological priming effects will show influences of visual speech from four to fourteen years.
METHOD
Participants
Participants were 132 native English-speaking children ranging in age from 4;2 to 14;5 (55% boys). The racial distribution was 70% White, 13% Asian, 11% Black, and 6% Multiracial, with 9% reporting Hispanic ethnicity. Participants had normal (age-based when appropriate) hearing sensitivity, visual acuity (including corrected to normal), auditory word recognition (Ross & Lerman, Reference Ross and Lerman1971), articulatory proficiency (Goldman & Fristoe, Reference Goldman and Fristoe2000), and visual perception (Beery & Beery, Reference Beery and Beery2004). Participants were divided into four age groups (30 to 38 children each) based on chronological age (four- to five-year-olds: M = 4;11, SD = 0·53; six- to seven-year-olds: M = 7;00, SD = 0·59; eight- to ten-year-olds: M = 9;02, SD = 0·87; and eleven- to fourteen-year-olds: M = 12;04, SD = 1·24). These groups will be referred to as five-year-olds, seven-year-olds, nine-year-olds, and twelve-year-olds. Details for the groups are presented in Analysis 2. Participants accurately pronounced the onsets of the pictures' names; the offsets were also accurately pronounced except for three five-year-olds (who substituted /θ/ for /s/ in gas and geese or omitted /t/ in ghost). Two five-year-olds had to be taught the names of some pictures (geese, beads, and/or gun). To ensure that the experimental results were reflecting performance for words vs. nonwords, participants' knowledge of the word distractors was tested by parental report and a picture-pointing task. Thirty-one children had to be taught the meaning of a distractor; the mean number of unknown distractors averaged 0·917 in the five-year-olds, 0·414 in the seven-year-olds, and 0·016 in the nine- to twelve-year-olds. Mean naming times for the taught vs. previously known words did not differ; no trials were eliminated.
Materials and instrumentation: picture–word task
Pictures and distractors
The entire set of materials consisted of experimental items (8 pictures and 12 distractors) and filler items (16 pictures and 16 distractors). The experimental pictures and phonologically related distractors were words/nonwords beginning with the consonants /b/ or /g/ coupled with the vowels /i/, /æ/, /ʌ/, or /o/. The baseline distractors were words/nonwords beginning with the vowels /i/, /æ/, /ʌ/, or /o/. Illustrative items for the picture [bug] are [picture]–[word/nonword] pairs of [bug]–[bus/buv] for the phonologically related condition and [bug]–[onion/onyit] for the baseline condition (see ‘Appendix A’ for items, available at http://dx.doi.org/10.1017/S030500091500077X). The word and nonword distractors were constructed to have as comparable phonotactic probabilities as possible. In brief, the positional segment frequencies for the words vs. nonwords averaged respectively ·1593 vs. ·1570 (adult values) and ·1911 vs. ·1805 (child values); the biphone frequencies averaged ·0050 vs. ·0056 (adult values) and ·0071 vs. ·0074 (child values) (Storkel & Hoover, Reference Storkel and Hoover2010; Vitevitch & Luce, Reference Vitevitch and Luce2004; see Jerger et al., Reference Jerger, Damian, Tye-Murray and Abdi2014, for details). The filler items were pictures and word/nonword distractors not beginning with /b/ or /g/. Illustrative filler items are the [picture]–[word/nonword] pairs of [dog]–[cheese/cheeg], [shirt]–[pickle/pimmel], and [cookies]–[horse/hork]. To emphasize the distinctiveness between the words and nonwords, if a filler item (e.g. [dog]–[cheese]) was used for the words, its counterpart (e.g. [dog]–[cheeg]) was not used for the nonwords and vice versa. This strategy yielded 8 different picture–distractor filler items each for the words and the nonwords.
Stimulus preparation
The distractors were recorded at the Audiovisual Recording Lab, Washington University School of Medicine. The talker was an eleven-year-old boy actor with clearly intelligible speech. His full facial image and upper chest were recorded. He started and ended each utterance with a neutral face / closed mouth. The color video signal was digitized at 30 frames/s with 24-bit resolution at a 720 × 480 pixel size. The auditory signal was digitized at 48 kHz sampling rate with 16-bit amplitude resolution. The utterances were adjusted to equivalent A-weighted root mean square sound levels. The video track was routed to a high-resolution monitor, and the auditory track was routed through a speech audiometer to a loudspeaker. The intensity level of the distractors was approximately 70 dB SPL. The to-be-named colored pictures were scanned into a computer as 8-bit PICT files and edited to achieve objects of a similar size on a white background.
Editing the auditory onsets
We edited the auditory track of the phonologically related distracters by locating the /b/ or /g/ onsets visually and auditorily with Adobe Premiere Pro and Soundbooth (Adobe Systems Inc., San Jose, CA) and loudspeakers. We applied a perceptual criterion to operationally define a non-intact onset. We excised the waveform in 1 ms steps from the identified auditory onset (first deviations from baseline) to the point in the later waveform for which at least four of five trained listeners heard the vowel as the onset (auditory mode). This process removed the excised portion of the acoustic signal and left the alignment between the auditory and visual tracks as originally produced by the speaker. Splice points were always at zero axis crossings. Using our perceptual criterion, we excised on average 52 ms (/b/) and 50 ms (/g/) from the word onsets and 63 ms (/b/) and 72 ms (/g/) from the nonword onsets. Figure 1 displays the intact vs. non-intact waveforms for the word bag.
We next formed audiovisual (dynamic face) and auditory (static face) modes of presentation for the stimuli. In our experimental design, the auditory mode controls for the influence on performance of any remaining coarticulatory cues in the input. More specifically, we compare results for the non-intact stimuli in the auditory vs. audiovisual modes. Any coarticulatory cues in the auditory input are held constant in the two modes. Thus any influence on picture naming due to articulatory cues should be controlled, and this should allow us to evaluate whether the addition of visual speech influences performance.
Audiovisual and auditory modes
Stimuli were Quicktime movie files. For the audiovisual mode, the children saw (i) 924 ms (experimental trials) or 627 or 1,221 ms (filler-item trials) of the talker's still face and upper chest, followed by (ii) an audiovisual utterance of one distractor and the presentation of one picture on the talker's T-shirt five frames before the auditory onset of the utterance (auditory distractor lags picture), followed by (iii) 924 ms of still face and picture. For the auditory mode, the child heard the same event but the video track was edited to contain only the talker's still face. The onset of the picture occurred in the same frame for the intact and non-intact distracters. The relationship between the onsets of the picture and the distractor, termed stimulus onset asynchrony (SOA), must also be considered for the picture–word task.
SOA
Phonologically related distracters typically produce a maximal effect on naming when the onset of the auditory distractor lags the onset of the picture with a SOA of about 150 ms (Damian & Martin, Reference Damian and Martin1999; Schriefers et al., Reference Schriefers, Meyer and Levelt1990). Our SOA was five frames or about 165 ms (frame size of 33 ms) as used previously (Jerger et al., Reference Jerger, Damian, Spence, Tye-Murray and Abdi2009). Because the picture remained in the same frame for the intact and non-intact stimuli, however, the auditory non-intact onset altered the target SOA of 165 ms and the natural temporal synchrony between the visual and auditory speech onsets. Below we consider these issues.
With regard to altering the SOA, the child literature does not provide evidence about whether the slight temporal shift in the SOA produced by the non-intact onset affects picture naming results. Our experimental design, however, should provide data that can control for this issue. To do so, we will compare results for the non-intact stimuli in the auditory vs. audiovisual modes. The shift in the auditory onset is held constant in the two modes; thus any influence on picture naming due to the shift in the auditory onset should be controlled. This should allow us to evaluate whether the addition of visual speech influences performance.
With regard to altering the temporal synchrony between modes, visual speech normally leads auditory speech (Bell-Berti & Harris, Reference Bell-Berti and Harris1981), but the degree to which visual speech leads varies appreciably (ten Oever, Sack, Wheat, Bien & van Atteveldt, Reference ten Oever, Sack, Wheat, Bien and van Atteveldt2013). Thus listeners are accustomed to natural variability in this asynchrony. Adults synthesize visual and auditory speech into a single multisensory event – without any detection of the asynchrony or any effect on intelligibility – when the visual speech leads the auditory speech by as much as 200 ms (Grant, van Wassenhove & Poeppel, Reference Grant, van Wassenhove and Poeppel2004). Detecting asynchrony between audiovisual speech inputs (simultaneity judgments) is similar in adults and ten- to eleven-year-olds when visual speech leads (Hillock, Powers & Wallace, Reference Hillock, Powers and Wallace2011). This evidence suggests that the alternation in the SOA produced by the non-intact onsets will not affect the children's assimilation of an audiovisual distractor into a single multisensory event. Below we summarize our final set of materials.
Final set of items
We administered two presentations of each experimental item (i.e. baseline, intact, and non-intact distractors) in the audiovisual and auditory modes. The items were randomly intermixed with the filler items in each mode and formed into four lists (which were presented forward or backward for eight variations). Each list contained 24 experimental (57%) and 18 filler-item (43%) trials. The items comprising a list varied randomly under the constraints that (i) no onset could repeat, (ii) the intact and non-intact pairs (e.g. bag and /– b/ag) could not occur without at least two intervening items, (iii) a non-intact onset must be followed by an intact onset, (iv) the mode must alternate after three repetitions, and (v) all types of onsets (vowel, intact /b/ and /g/, non-intact /b/ and /g/, and not /b/ or /g/) must be dispersed uniformly throughout the lists. The presentation of items was counterbalanced so that 50% of items occurred first in the auditory mode and 50% occurred first in the audiovisual mode. The number of intervening items between the intact vs. non-intact pairs (and vice versa) averaged ten items.
Naming responses
Participants named pictures by speaking into a unidirectional microphone mounted on an adjustable stand. The utterances were digitally recorded. To quantify naming times, the computer triggered a counter/timer (resolution less than one ms) at the initiation of a movie file. The timer was stopped by the onset of the participant's vocal response into the microphone, which was fed through a stereo mixing console amplifier and 1 dB step attenuator to a voice-operated relay (VOR). A pulse from the VOR stopped the timing board via a data module board. If necessary, the participant's speaking level, the position of the microphone or child, and/or the setting on the 1 dB step attenuator were adjusted to ensure that the VOR triggered reliably. The counter timer values were corrected for the amount of silence in each movie file before the onset of the picture.
Procedure
The children completed the multimodal picture–word task along with other procedures in three sessions, scheduled approximately ten days apart. The order of presentation of the word vs. nonword conditions was counterbalanced across participants in each age group. Results were collapsed across the counterbalancing conditions. In the first session, the children completed three of the word (or nonword) lists; in the second session, the children completed the fourth word (or nonword) list and the first nonword (or word) list; and in the third session, the children completed the remaining three nonword (or word) lists. Individual lists were administered in separated listening conditions. A variable number of practice trials preceded the presentation of each list.
At the start of the first session, a tester showed each picture on a 5” x 5” card and asked the participant to name the picture; the tester taught the target names of any pictures named incorrectly. Next the tester flashed some picture cards quickly and modeled speeded naming. The child copied the tester. Speeded naming practice trials went back and forth between tester and child until the child was naming the pictures fluently. Mini-practice trials started each of the other sessions.
For formal testing, a tester sat at a computer workstation and initiated each trial by pressing a touch pad (out of child's sight). The children, with a co-tester alongside, sat at a distance of 71 cm directly in front of an adjustable height table containing the computer monitor and loudspeaker. Trials that the co-tester judged flawed (e.g. child squirmed out of position, child triggered microphone with non-speech) were deleted online and re-administered after intervening items. The children were told they would see and hear a boy whose mouth would sometimes be moving and sometimes not. For the words, participants were told that they might hear words or nonwords; for the nonwords, participants were told that they would always hear nonwords. We emphasized that the talking was not important. Participants were told to focus only on (i) watching for a picture that would pop up on the boy's T-shirt and (ii) naming it as quickly and as accurately as possible. The participant's view of the picture subtended a visual angle of 5·65° vertically and 10·25° horizontally; the view of the talker's face subtended a visual angle of 7·17° vertically (eyebrow – chin) and 10·71° horizontally (eye level). Finally, participants also completed an explicit repetition task (always presented after the completion of the picture–word task) to assess the perception of the distractor onsets.
RESULTS
Preliminary analyses
‘Appendix B’ (available at http://dx.doi.org/10.1017/S030500091500077X) details (i) the accuracy of perceiving the onsets and (ii) the quality of the picture–word data (e.g. number of missing trials). In addition to these results, we analyzed the picture–word data preliminarily to determine whether results could be collapsed across the different distractor onsets (/b/ vs. /g/). Appendix C (available at http://dx.doi.org/10.1017/S030500091500077X) details these results. Briefly, separate factorial mixed-design analyses of variance were performed for the baseline and phonologically related distractors. Findings indicated that the different onsets influenced results for the phonologically related distractors but not for the baseline distractors. Specifically, overall picture naming speed was facilitated slightly more for the /b/ than /g/ onset (–147 vs. –117 ms). The effect of the onsets was also slightly more pronounced for the audiovisual than auditory mode (38 vs. 20 ms).
Despite these statistically significant outcomes, the differences in performance due to onset were small and did not interact with lexical status (words vs. nonwords) or fidelity (intact vs. non-intact). Thus, we developed a dual-pronged approach. For the primary analyses below, naming times were collapsed across the onsets to make the principal story clearer. For one key analysis with the collapsed onsets, however (determining whether/how visual speech influenced performance by assessing the difference between each pair of audiovisual–auditory naming times), the analysis was repeated separately for the individual /b/ and /g/ onsets. This analysis provides strong evidence for readers interested in whether/how the speechreadability of the onsets influenced phonological priming (e.g. the bilabial /b/ is easier to speechread than the velar /g/; Tye-Murray, Reference Tye-Murray2014).
Baseline picture–word naming times
Figure 2 shows average picture naming times for the age groups in the presence of the vowel-onset baseline distractors presented in the auditory or audiovisual modes for the words (left) and nonwords (right). Results were analyzed with a factorial mixed-design analysis of variance with one between-participants factor (four age groups) and two within-participant factors (lexical status [words vs. nonwords] and mode [auditory vs. audiovisual]). Results indicated that picture naming times decreased significantly as age increased (F(3,128) = 86·33, MSE = 197462·74, p < ·001, partial η 2 = ·669). No other significant effect was observed. Picture naming times declined from about 1855 ms in the five-year-olds to 1065 ms in the twelve-year-olds for both words and nonwords in both modes. This finding agrees previous findings (e.g. Brooks & MacWhinney, Reference Brooks and MacWhinney2000; Jerger et al., Reference Jerger, Martin and Damian2002).
Phonologically related picture–word naming times
We quantified the priming produced by the phonologically related distractors on picture naming with adjusted naming times, derived by subtracting each participant's baseline naming times from his or her phonologically related naming times as in previous studies (e.g. Jerger et al., Reference Jerger, Damian, Spence, Tye-Murray and Abdi2009). Figure 3 depicts the adjusted naming times in the age groups for words and nonwords (top vs. bottom panels) in the auditory and audiovisual modes. Performance is shown for both the intact and non-intact stimuli (left vs. right panels).
Results were analyzed with a factorial mixed-design analysis of variance with one between-participants factor (four age groups) and three within-participant factors (lexical status [words vs. nonwords], fidelity [intact vs. non-intact], and mode [auditory vs. audiovisual]). Table 1A summarizes the results (significant results are bolded). All four main factors significantly influenced how the phonologically related distractors primed overall picture naming times, with an effect of (i) age group, showing greater priming in the younger than the older children [five-year-olds: –208 ms; seven-year-olds: –143 ms; nine-year-olds and twelve-year-olds: –80 ms], (ii) lexical status, showing greater priming from the nonword than the word distractors [respectively –153 ms vs. –112 ms], (iii) fidelity, showing greater priming from the intact than the non-intact distractors [respectively –162 ms vs. –102 ms], and (iv) mode, showing greater priming from the audiovisual than the auditory distractors [respectively –160 ms vs. –104 ms]. The significantly greater priming for the audiovisual mode is particularly relevant because this pattern highlights a significant influence of visual speech on performance.
A few interactions were also significant, but only one involved age group, namely an age group x fidelity interaction (see Table 1A). As shown in Figure 3 and noted above, the intact (high-fidelity) distractors primed overall picture naming more effectively than the non-intact (low-fidelity) distractors (compare right vs. left panels collapsed across mode and lexical status). This interaction arose because the relative effectiveness of the intact vs. non-intact distractors differed more in the five-year-olds (–104 ms) than in the older groups (seven-year-olds: –44 ms; nine-year-olds: –43 ms; twelve-year-olds: –39 ms). The other significant interactions (two-way and three-way) shown in Table 1A involved mode. To clarify these interactions – and determine whether visual speech significantly influenced performance – we quantified the difference between each pair (audiovisual–auditory) of adjusted naming times. For the sake of simplicity, we labeled all of the difference scores, for both the intact (high-fidelity) and non-intact (low-fidelity) stimuli, a Visual Speech Effect (VSPE) for these analyses. We should emphasize, however, that this VSPE is reflecting an actual filling in of some missing auditory cues for non-intact speech and, by contrast, an augmenting of auditory cues for intact speech. The difference scores are plotted in Figure 4 and represent the difference between the lines in Figure 3. The error bars show the 95% confidence intervals for the difference scores. Note that the confidence intervals do not provide relevant information about the intact and non-intact conditions because only difference scores are interpretable for factors that are not independent.
The higher order (mode x fidelity x lexical status) interaction occurred because the VSPE for the non-intact onsets (Figure 4 collapsed across age groups) was greater for the nonwords than the words (i.e. respectively 91 ms vs. 62 ms; left vs. right panels) whereas the VSPE for the intact onsets did not differ for the nonwords vs words (i.e. respectively 36 ms vs. 33 ms). Although this higher-order interaction may limit the interpretation of the lower-order interactions, we should nonetheless acknowledge the interactions between mode vs. fidelity and vs. lexical status. The mode x fidelity interaction occurred because results showed a greater VSPE for the non-intact than intact onsets (respectively –77 ms vs. –34 ms; Figure 4 collapsed across age groups and lexical status). The mode x lexical status interaction emerged because results showed a larger VSPE for nonwords than words (respectively –63 ms vs. –47 ms; Figure 4 collapsed across age groups and fidelity).
With regard to whether visual speech significantly influenced performance, the confidence intervals (Figure 4) address whether a given group showed a significant VSPE (i.e. did each result differ significantly from zero?). If the 95% confidence interval, or the range of plausible difference scores, does not contain zero, then the results are significant. The confidence intervals revealed a significant VSPE for all the non-intact and intact onsets excepting one, namely intact nonwords in the five-year-olds.
Finally, confidence intervals for the results in Figure 3 are also of interest in terms of whether the phonologically related distractors significantly primed naming in each group. Our specific question was whether each adjusted naming time (difference score between phonologically related naming time and baseline naming time) in each group for each mode differed significantly from zero. Table 2 shows the 95% confidence intervals. Results indicated significant priming – the confidence interval did not contain zero – for all datapoints in Figure 3 excepting one; namely non-intact words, auditory mode in the nine-year-olds. Although values outside of 95% confidence intervals are relatively implausible, the lower limits neared zero for two significant results – non-intact nonwords, auditory mode in the nine-year-olds and twelve-year-olds – a pattern suggesting that we should have a lesser degree of confidence in the repeatability of these two outcomes.
note: * = significant priming; ns = no priming; each age group represents a range of chronological ages (see text).
With regard to the above effects of age, a complication is that the differences in the baseline naming times muddle an unequivocal interpretation of the results. In other words, the greater priming effects in the five-year-olds (Figure 3) could be a result of age or of these children's slower baseline naming times. A straightforward approach to controlling the baseline differences (see Damian & Dumay, Reference Damian and Dumay2007) is to develop priming proportions. Thus we divided each participant's adjusted naming times by her or his corresponding baseline naming times (i.e. [mean time in the phonologically related condition minus mean time in the baseline condition] divided by [mean time in the baseline condition]). A factorial mixed-design analysis of variance on these transformed data, with the same between- and within-participant factors, yielded the same pattern of results as above (see Table 1B). We continued to observe the significant effect of (i) age group, showing greater priming in the younger than older children [five-year-olds: –0·110; seven-year-olds: –0·090; nine-year-olds and twelve-year-olds: –0·070], and the one age group interaction, age group x fidelity, which was elaborated above.
note: ns = p > ·05. Results of a mixed-design analysis of variance with one between-participants factor (Four Age Groups) and three within-participants factors (Lexical Status: word vs. nonword; Fidelity: intact vs. non-intact; Mode: auditory vs. audiovisual). The degrees of freedom are 1,128 for all factors except those involving Age Group wherein the degrees of freedom are 3,128.
With regard to the interactions that the VSPE clarified in Figure 4, the transformed data also continued to reveal the significant higher-order interaction (mode x fidelity x lexical status) and the two lower-order interactions (mode x fidelity and mode x lexical status). A third lower-order interaction (lexical status x fidelity) also achieved significance (p = ·038). This interaction occurred because the difference between priming for the intact vs. non-intact stimuli was slightly greater for nonwords than words, with difference scores respectively of ·043 and ·036 for the proportion transformed data (and 66 vs. 53 ms for the untransformed data).
Finally, it is of interest to ask whether there was a complete or partial Visual Speech Fill-In Effect. The previous mode x fidelity interaction indicates that phonological priming by the intact vs. non-intact distractors differed more for the auditory (–145 ms vs. –64 ms) than audiovisual (–179 ms vs. –141 ms) mode (see Figure 3). Clearly this interaction reflects a robust Visual Speech Fill-In Effect or, as indicated previously, a greater VSPE for the non-intact than intact onsets. However, the current question is whether the Visual Speech Fill-In Effect was complete or partial (in other words, were the non-intact audiovisual distractors as phonologically effective as their intact counterparts).
To evaluate whether phonological priming differed for the non-intact vs. intact audiovisual distractors, we carried out orthogonal contrasts (Abdi & Williams, Reference Abdi, Williams and Salkind2010) on the mean audiovisual adjusted naming times collapsed across the words and nonwords. We found significantly greater priming from the intact than non-intact audiovisual distractors in all age groups: (five-year-olds, F contrast (1,128) = 64·08, MSE = 1421·23, p < ·001, partial η 2 = ·334; seven-year-olds, F contrast (1,128) = 5·75, MSE = 1421·23, p = ·02, partial η 2 = ·043; nine-year-olds, F contrast (1,128) = 6·80, MSE = 1421·23, p = ·01, partial η 2 = ·050; twelve-year-olds, F contrast (1,128) = 7·77, MSE = 1421·23, p = ·006, partial η 2 = ·057). Thus even though the Visual Speech Fill-In Effect was robustly effective, the non-intact audiovisual distractors were not as phonologically compelling as their intact counterparts.
VSPE for the individual /b/ and /g/ onsets
To probe the influence of visual speech as a function of the speechreadability of the onsets, we analyzed the VSPE scores – without collapsing across the onsets – with a factorial mixed-design analysis of variance with one between-participants factor (four age groups) and three within-participant factors (lexical status [words vs. nonwords], fidelity [intact vs. non-intact], and onset [b vs. g]). There was no significant effect of lexical status nor were there any interactions between lexical status and fidelity or onset; thus to graph the results, the VSPE for the onsets was collapsed across words and nonwords. Figure 5 portrays the collapsed VSPE for the /b/ and /g/ onsets in the high- (intact) and low- (non-intact) fidelity conditions in the age groups, along with the 95% confidence intervals.
The statistical analysis revealed only one significant result involving onset: a greater VSPE for the /b/ than the /g/ onset (respectively –64 ms vs. –47 ms when collapsed across fidelity) (F (1,128) = 18·17, MSE = 4340·41, p < ·0001, partial η 2 = ·124). The 95% confidence intervals shown in Figure 5 indicated a significant VSPE – the confidence interval did not contain zero – for all datapoints excepting one; namely the intact stimuli with a /g/ onset in the five-year-olds.
In short, Analysis 1 indicates that phonological priming overall was significantly greater for the audiovisual than auditory mode. Visual speech produced significantly greater phonological priming in children from four to fourteen years, with all age groups showing a significant effect of visual speech for most conditions. The influence of visual speech was slightly greater for the /b/ than the /g/ onsets, but phonological priming did not show the pronounced differences that characterize identifying phonemes on direct measures of speechreading (see also Jordan & Bevan, Reference Jordan and Bevan1997). Next, we investigated the effect of child factors on performance as a function of the mode and stimulus fidelity.
ANALYSIS 2
To identify the child factors underpinning the VSPE, we analyzed results for the intact vs. non-intact words and nonwords as a function of the children's ages and verbal abilities. Our goal was to determine which of the child factors – among age, vocabulary, phonological awareness, and speechreading (visual only speech recognition) – uniquely contributed to performance. We defined ‘uniquely’ statistically as the independent contribution of each variable after controlling for the other variables (Abdi, Edelman, Valentin & Dowling, Reference Abdi, Edelman, Valentin and Dowling2009). Use of this regression analytic approach, which yields part (aka, semi-partial) correlations, is essential for identifying the critical individual factors underpinning speech perception by children.
We investigated two basic research questions: Is the VSPE supported by the same unique child factors for (i) intact vs. non-intact stimuli and (ii) words vs. nonwords? There is little to no evidence to assist in predicting these results. However, we can predict the effects of child factors from models of the picture–word task. As noted in the ‘Introduction’, the model of Levelt et al. (Reference Levelt, Schriefers, Vorberg, Meyer, Pechmann and Havinga1991) based on auditory distractors proposes that the phonologically related distractor (e.g. [picture]–[distractor] pair of [bug]–[bus]) primes picture naming by creating crosstalk between the input and output phonological representations supporting speech perception and production. The congruent distractor activates input phonological representations whose activation spreads to activate the corresponding output phonological representations, and this crosstalk speeds selection of the output speech segments for naming (Roelofs, Reference Roelofs1997). These models – to the extent they generalize – predict that the quality of children's phonological representations or knowledge will influence performance on our task. Again, we view visual speech as an extra phonetic resource as proposed by Campbell (Reference Campbell1988). Finally, based on the hierarchical model of speech segmentation (Mattys et al., Reference Mattys, White and Melhorn2005), we previously proposed that children's sensitivity to visual speech will vary depending on their weighting of the phonetic–phonological content. If this is so, the children's phonological knowledge may be uniquely important to the VSPE, particularly for nonwords. In short, the findings below should provide fundamental new knowledge about the contribution of age-related improvements vs. the absolute excellence of selected verbal skills to speech perception by children.
METHODS
Participants
Participants were the four groups of Analysis 1.
Materials and procedure
Receptive vocabulary was estimated with the Peabody Picture Vocabulary Test (Fourth Edition; Dunn & Dunn, Reference Dunn and Dunn2007), measuring children's ability to identify a picture illustrating a spoken word's meaning. Phonological awareness was estimated with three subtests of the Pre-Reading Inventory of Phonological Awareness (Dodd, Crosbie, McIntosh, Teitzel & Ozanne, Reference Dodd, Crosbie, McIntosh, Teitzel and Ozanne2003), measuring children's ability to isolate onset phonemes, recognize alliterative onset phonemes, and segment the phonemes within a word. Speechreading was estimated with the Children's Audio-Visual Enhancement Test (Tye-Murray & Geers, Reference Tye-Murray and Geers2001), measuring children's ability to repeat words presented in the visual (and auditory) modes. Results for the auditory mode were not reported because all age groups performed at ceiling. Results for the visual mode were scored by words and by word onsets with visemes (visually indistinguishable phonemes) counted as correct. The latter results were used to quantify speechreading for the regression analyses.
RESULTS
Descriptive statistics for child factors
Table 3 summarizes the average ages along with selected verbal skills in the groups. Vocabulary knowledge in the groups averaged about 120 standard score, a result indicating that these children had higher than average verbal skills. Although high verbal performance is, in general, typical of children in research studies, such performance could potentially affect the generalizability of the results to children with more ‘average’ verbal abilities. Phonological awareness averaged 58% correct in the youngest group and about 81% correct in the other groups; performance ranged from the ceiling in all groups to a floor of about 5% in the five-year-olds, 45% in the seven-year-olds and nine-year-olds, and 60% in the twelve-year-olds. Speechreading ranged, on average, from 6% to 25% across groups when scored by words and 39% to 74% when scored by word onsets.
note: standard deviations are in parentheses; * onsets were scored with visemes counted as correct (e.g. pat for bat). Each age group represents a range of chronological ages (see text).
Association between VSPE and child factors
The goal of this project was explanatory – thus we focused on understanding which of the child factors, if any, contributed significantly to the VSPE when the effects of the other factors were controlled. To assess the relative importance of each factor in determining the VSPE, we conducted four regression analyses ((i) words–intact, (ii) words–non-intact, (iii) nonwords–intact, and (iv) nonwords–non-intact) to obtain the part (aka semi-partial) correlation coefficients and partial F statistics (Abdi et al., Reference Abdi, Edelman, Valentin and Dowling2009). The dependent variable was always the VSPE, and the independent variables were always the standardized scores for age, vocabulary, phonological awareness, and speechreading.Footnote 2 Table 4 summarizes these regression results, along with the slope coefficients, for the intact vs. non-intact conditions (left vs. right panels) of the words vs. nonwords (top vs. bottom panels).
notes: ns = not significant (p > ·05). The part correlation coefficients and the partial F statistics evaluate the variation in VSPE uniquely accounted for (after removing the influence of the other variables) by age, vocabulary, phonology, or speechreading of onsets. The slope coefficients quantify the slope of the relationship between the VSPE and each individual child factor when all of the other child factors are held constant. The multiple correlation coefficients for all of the variables considered simultaneously were as follows: words: ·223 (intact) and ·358 (non-intact); nonwords: ·261 (intact) and ·247 (non-intact). dfs = 1,127 for partial F and 4,127 for Multiple R.
Results for the part correlations reflected one overall pattern for the intact stimuli and the non-intact words: the VSPE was uniquely influenced by the children's phonological skills. In contrast to this pattern of results, the VSPE for the low-fidelity (non-intact) nonwords was uniquely influenced only by speechreading skills. In short, these results indicate that the VSPE is underpinned by phonological skills unless the input is an unfamiliar low-fidelity stimulus without a lexical representation, in which case speechreading skills become uniquely contributory.
DISCUSSION
This research assessed the influence of visual speech on phonological priming by high- vs. low-fidelity auditory speech in children between four and fourteen years. The low-fidelity stimuli were words and nonwords with a visual consonant + rhyme coupled to an auditory non-intact onset + rhyme. Our research paradigm presented the stimuli in the auditory and audiovisual modes to determine whether (i) the presence of visual speech would fill in the non-intact auditory onsets and prime picture naming more effectively than auditory speech alone and (ii) phonological priming would display a greater influence of visual speech for non-intact than intact auditory onsets. The results showed a significant VSPE not only for the non-intact, but also for the intact, onsets – a pattern indicating that visual speech not only filled in the non-intact auditory cues but also supplemented the intact auditory cues. We observed a consistently significant influence of visual speech on phonological priming for children of all ages between four to fourteen years for most conditions. The significant boost by visual speech was substantial, particularly for the non-intact stimuli: about 34 ms (intact) and 77 ms (non-intact).
Results assessing lexical status indicated that the nonwords reflected significantly greater priming overall than the words (respectively –153 ms vs. –112 ms). However, the lexical status of stimuli interacted with the mode and fidelity. Results showed that the VSPE for non-intact onsets was significantly greater for nonwords than words (respectively 91 ms vs. 62 ms), whereas the VSPE for intact onsets did not differ significantly for the nonwords vs. words (respectively 36 ms vs. 33 ms; Figure 3 collapsed across age groups). A greater VSPE for the non-intact nonwords than words is consistent with our predictions. When auditory speech has low fidelity, visual speech assumes a relatively greater weight and thus affects performance more. When this relatively greater weighting of visual speech is coupled with the relatively greater weighting of the phonetic–phonological content for nonwords, a significantly greater influence of visual speech is observed for nonwords than words.
With regard to the higher-order interaction – the VSPE differed for non-intact, but not for intact, words vs. nonwords – we should note that our set of onsets was constrained (word or nonword stimuli consisting of /b/ and /g/ onsets along with filler and baseline items). Thus, it is possible that all of the intact word/nonword onsets in this limited set had sufficient sensory input for correct perception, and this would yield no difference in performance for the intact words vs. nonwords.
Results for the multiple comparisons – in all age groups – indicated significantly greater priming for the audiovisual than the auditory mode not only for all non-intact but also for all intact conditions excepting intact nonwords in the five-year-olds. A worthy question is: Why did these results – in contrast to the literature – show a significant VSPE for intact stimuli in all age groups? One possibility is that the variability introduced by intermixing the fidelity (intact vs. non-intact) and mode (audiovisual vs. auditory) of the stimuli may have increased children's awareness of the sensory qualities of the input – thus making visual speech more potent. Results on the Garner task clearly indicate that participants – when they classify consonants – find it harder to ignore irrelevant inputs that vary (/ba/, /bi/ vs. /ga/, /gi/) vs. those that are constant (/ba/ vs. /ga/). This pattern suggests that the children may have found it harder to ignore speech distractors that varied in both fidelity and mode. Results on the Garner task would appear to generalize to our task because individuals process speech automatically (even when instructed to attend to picture naming) and implicitly encode and integrally process all speech cues, not just the target cues. To illustrate, three- to five-year-olds on a talker recognition task identify cartoon characters from their vocal signatures (e.g. pitch, speaking rate, dialect) at well above chance levels, indicating that these non-target speech cues were incidentally learned (Spence, Rollins & Jerger, Reference Spence, Rollins and Jerger2002). With regard to age, Jerger and colleagues (Reference Jerger, Pirozzolo, Jerger, Elizondo, Desai, Wright and Reynosa1993) have assessed performance on the Garner task with other types of speech cues and observed integral processing at all ages between three and seventy-nine years. Thus, we propose that the variability in both stimulus fidelity and mode may have made visual speech more effective at influencing performance. This reasoning is consistent with the proposals of dynamic systems theory (see ‘Introduction’; Smith & Thelen, Reference Smith and Thelen2003).
Another relevant question concerned whether the non-intact audiovisual distractors were as phonologically effective as their intact counterparts (in other words, was the Visual Speech Fill-In Effect complete or partial?). Results in all age groups indicated that the intact audiovisual distractors produced greater phonological priming than their non-intact counterparts. Thus, even though the Visual Speech Fill-In Effect for non-intact distractors was impressively robust, the non-intact audiovisual distractors were not as phonologically potent as their intact counterparts. This outcome agrees with previous results indicating that the visually influenced percept of the McGurk effect is not equivalent to the percept produced by a comparable audiovisual syllable (Rosenblum & Saldana, Reference Rosenblum and Saldana1992).
Finally, results assessing the child factors underpinning performance indicated that the VSPE was uniquely influenced by phonological skills for the intact words and nonwords and the non-intact words. In contrast to this unified pattern of results, the VSPE for non-intact nonwords was uniquely influenced by speechreading skills. We can speculate that the influence of visual speech is more data-driven – i.e., more dependent on speechreading the ‘data’ – when the input is unfamiliar non-intact nonwords, and more knowledge-driven – i.e., more dependent on phonological skills – when the input is intact words/nonwords or familiar non-intact words with stored lexical phonological patterns. Clearly the factors associated with the influence of visual speech on performance are multi-faceted.
In conclusion, the new Visual Speech Fill-In Effect extends the range of measures for assessing benefit from visual speech by children. Results on the new measure document that children from four to fourteen years benefit from visual speech during multimodal speech perception. These findings emphasize that children – like adults – experience a speaker's multimodal utterance. Such information seems critical for incorporating visual speech into our developmental theories of speech perception.
SUPPLEMENTARY MATERIALS
For supplementary material for this paper, please visit http://dx.doi.org/10.1017/S030500091500077X.