Introduction
An issue receiving intense attention in speech production is how the brain plans linguistic processing prior to overt speech production (Levelt et al., Reference Levelt, Roelofs and Meyer1999). Although articulation is seen as a lower-level motor output (Indefrey & Levelt, Reference Indefrey and Levelt2004), speaking is a highly complex sensorimotor task that requires the combined efforts of feedforward and feedback control systems (Guenther et al., Reference Guenther2006). To date, how these two subsystems work to ensure successful communication remains poorly understood.
Terminology and general principles of speech motor control
Several speech motor control models have been formulated; we integrated Directions into Velocities of Articulators (DIVA; Guenther, Reference Guenther2006) and State Feedback Control (Houde & Nagarajan, Reference Houde and Nagarajan2011) to describe feedforward and feedback control (see Figure 1 for details). Speech production begins with a unit in the “speech sound map,” which can be a phoneme, syllable, or phrase. As schematized in Figure 1, feedforward control reads out previously learned motor commands for speech sounds and further issues them to articulators. This mechanism emphasizes its independence from the sensory feedback associated with articulation (Guenther, Reference Guenther2016). Therefore, feedforward control enables the rapidity of speech, but lacks the ability to monitor errors in speech output (Parrell et al., Reference Parrell, Lammert, Ciccarelli and Quatieri2019). Because we live in time-varying and unpredictable surroundings, feedforward control alone cannot ensure effective speech.
Unlike feedforward control, feedback control relies on sensory feedback to maintain speech (Guenther, Reference Guenther2016; Kearney & Guenther, Reference Kearney and Guenther2019). The auditory feedback control system compares actual auditory feedback with intended auditory feedback, and in case of any mismatch, auditory errors are transformed into corrective commands that decrease the perceived errors. This is similar to a somatosensory feedback control mechanism (Guenther, Reference Guenther2006, Reference Guenther2016; Hickok et al., Reference Hickok, Houde and Rong2011). Within this framework, there are two coexisting routes to generate intended auditory feedback (Tian & Poeppel, Reference Tian and Poeppel2012). Firstly, the activation of speech sound map leads to the activation of auditory target, which defines the desired auditory feedback that should arise when a speaker correctly produces the sound (Guenther, Reference Guenther2016; Tourville & Guenther, Reference Tourville and Guenther2011). Secondly, an internal forward model utilizes an efference copy of feedforward commands to internally estimate the current state of vocal tract dynamics and generate auditory prediction (Hickok, Reference Hickok2012; Tian & Poeppel, Reference Tian and Poeppel2010, Reference Tian and Poeppel2012, Reference Tian and Poeppel2013). The feedback control system is indispensable in speech motor control, allowing speakers to regulate movements and interact well with the environment in presence of external perturbations (Bays & Wolpert, Reference Bays and Wolpert2006).
Parrell et al. (Reference Parrell, Lammert, Ciccarelli and Quatieri2019) proposed a special case of feedforward control by eliminating auditory feedback involvement. An auditory prediction (realized using an internal forward model) is the possible outcome of articulatory movement before auditory feedback is received. This prediction is based on previously established causal associations between motor commands and auditory output. This is also why speakers feel that they can “hear” the speech internally when they imagine speaking without moving any articulators (Tian & Poeppel, Reference Tian and Poeppel2012, Reference Tian and Poeppel2015). Critically, the motor-based auditory predictions can be directly compared with auditory targets to verify the correctness of planned feedforward commands (Hickok, Reference Hickok2012). If auditory predictions fail to match auditory targets, the feedforward control system transforms error signals into corrective motor commands (Parrell et al., Reference Parrell, Lammert, Ciccarelli and Quatieri2019).
Speech motor control in bilinguals
Current models detail the organization of feedforward and feedback control exclusively in the first language (L1; see Parrell et al., Reference Parrell, Lammert, Ciccarelli and Quatieri2019 for a review), while research into the second language (L2) has not yet fully considered this issue. More recently, researchers noticed that speech motor control in bilinguals may vary by language type (L1 vs. L2; Liu & Tian, Reference Liu and Tian2018; Mitsuya et al., Reference Mitsuya, MacDonald, Purcell and Munhall2011; Simmonds et al., Reference Simmonds, Wise and Leech2011a, Reference Simmonds, Wise, Dhanjal and Leech2011b). Here we adopt Grosjean’s (Reference Grosjean2010) succinct definition of bilinguals as people who use two languages in their daily life. Of note, there is still insufficient theoretical and empirical information on L2 speech motor control, which highlights the need for further research in this field.
For L1 speech production, a basic idea is that the feedforward and feedback control subsystems cooperate with each other (Parrell et al., Reference Parrell, Lammert, Ciccarelli and Quatieri2019); thus, it is important to understand the relative weighting of these systems in speech motor control (Guenther, Reference Guenther2016; Guenther et al., Reference Guenther2006). Researchers have emphasized a transition from feedback-dominant to feedforward-dominant, driven by production experiences (Guenther, Reference Guenther2006; Guenther & Vladusich, Reference Guenther and Vladusich2012; Liu et al., Reference Liu, Chen, Larson, Huang and Liu2010c; Scheerer et al., Reference Scheerer, Liu and Jones2013). Speakers’ initial attempts to produce speech result in errors, and production relies heavily on feedback control. With sufficient practice, feedforward commands can result in the same sensory consequences without errors, and production principally relies on feedforward control (Guenther, Reference Guenther2006; Guenther & Vladusich, Reference Guenther and Vladusich2012). However, L1 and L2 production experiences are inherently different (Mitsuya et al., Reference Mitsuya, MacDonald, Purcell and Munhall2011). L1 speech motor learning begins in infancy (Tourville & Guenther, Reference Tourville and Guenther2011), but within the broad population of bilinguals, the L2 acquisition age is widely varied. Some bilinguals acquire L2 from birth, some around puberty, and others during adulthood (Woumans et al., Reference Woumans, Santens, Sieben, Versijpt, Stevens and Duyck2015). In most cases, bilinguals are exposed to an L2 after their L1 has already been established. It is therefore possible that feedforward and feedback control are weighted differently for bilinguals’ two language systems.
Motor movements used to produce native sounds are highly overlearned and automatic, requiring much less online sensory monitoring (Simmonds et al., Reference Simmonds, Wise and Leech2011a, Reference Simmonds, Wise, Dhanjal and Leech2011b). However, evidence shows that L2 sounds are produced with larger variability (Chen et al., Reference Chen, Robb, Gilbert and Lerman2001; Ng et al., Reference Ng, Chen and Sadaka2008; Wang & van Heuven, Reference Wang and van Heuven2006), implying that L2 feedforward commands are less familiar and more variable (Mitsuya et al., Reference Mitsuya, MacDonald, Purcell and Munhall2011). Thus, we hypothesized that, compared with L1, L2 production relies on feedback control to a greater extent and on feedforward control to a lesser extent. This hypothesis is supported by two early studies reporting that bilinguals speak more slowly and have more hesitation or sound repetitions in L2 relative to L1 under a delayed auditory feedback condition (Mackay, Reference Mackay1970; Van Borsel et al., Reference Van Borsel, Sunaert and Engelen2005). The underlying logic is that an increased weighting of feedback control increases the disturbing influence of incoming perturbed auditory feedback (Guenther, Reference Guenther2006).
The past several decades have seen an unprecedented upsurge in the number of bilinguals; however, for most bilinguals, speaking a second language is a challenging task (Bergmann et al., Reference Bergmann, Sprenger and Schmid2015). Typical disfluency markers include pauses, syllable repetition, and self-corrections (Götz, Reference Götz2013; Kormos, Reference Kormos2006). Growing evidence shows that speakers are considerably less fluent in L2 compared with their L1. For example, Wiese (Reference Wiese, Dechert, Möhle and Raupach1984) reported that L2 speech contains two to three times as many hesitations as L1 speech. Hincks (Reference Hincks2008) also found slower speech rates in L2 than in L1. It is well known that feedforward control is crucial for fluent speech, while excessive reliance on feedback control causes a time-lag problem because of the delay inherent in processing auditory feedback and launching corrective commands (Civier, Reference Civier2010; Civier et al., Reference Civier, Tasko and Guenther2010; Perkell, Reference Perkell2012). Thus, it is reasonable to hypothesize that poorer L2 fluency is correlated with heavier weighting of feedback control, and, accordingly, better L1 fluency is associated with heavier weighting of feedforward control.
This fluency-related hypothesis is supported only by indirect evidence from patients with speech motor disorders. Guenther (Reference Guenther2016) found that patients with speech motor disorders usually have impaired feedforward control. For example, apraxia of speech, a speech motor planning and programming disorder, is most often associated with damage to the left inferior frontal gyrus, anterior insula, and/or ventral precentral gyrus. According to the DIVA model, damage to these areas affects the speech sound map and the feedforward commands for articulating speech sounds. Stuttering is also a disorder that disrupts speech fluency, but the mechanism remains controversial. Several researchers believe stuttering is a result of abnormal auditory-motor transformation in the feedback control system (Cai et al., Reference Cai, Beal, Ghosh, Tiede and Guenther2012; Loucks et al., Reference Loucks, Chon and Han2012). Other researchers suggest that stuttering results from a general auditory prediction deficit (Daliri & Max, Reference Daliri and Max2015a, Reference Daliri and Max2015b) and a heavy weight on feedback control (Civier et al., Reference Civier, Tasko and Guenther2010; Tourville et al., Reference Tourville, Reilly and Guenther2008).
Feedforward and feedback control of voice intensity
The current study aimed to address whether the relative weighting of feedforward and feedback control varies between L1 and L2. In previous bilingualism research, investigators either compared same-language L1 and L2 speakers or compared L1 and L2 in the same bilinguals (Bergmann et al., Reference Bergmann, Sprenger and Schmid2015). The difficulty for intraspeaker comparisons lies in interpreting whether the observed differences are caused by language status or language differences. To avoid this confusion, we selected voice intensity to dissociate the role of language status because this attribute has few well-known language-specific phonological features.
There is a considerable amount of research describing how speakers control pitch (Chang et al., Reference Chang, Niziolek, Knight, Nagarajan and Houde2013; Chen et al., Reference Chen, Liu, Wang, Larson, Huang and Liu2012), formant (Cai et al., Reference Cai, Beal, Ghosh, Tiede and Guenther2012; Mitsuya et al., Reference Mitsuya, MacDonald, Purcell and Munhall2011), and intensity (Bauer et al., Reference Bauer, Mittal, Larson and Hain2006; Heinks-Maldonado & Houde, Reference Heinks-Maldonado and Houde2005; Liu et al., Reference Liu, Zhang, Xu and Larson2007). Typically, auditory perturbations induce compensatory behaviors that change speech parameters in the opposite direction (Behroozmand et al., Reference Behroozmand, Shebek, Hansen, Oya, Robin, Howard and Greenlee2015; Chang et al., Reference Chang, Niziolek, Knight, Nagarajan and Houde2013). To date, previous studies have provided evidence that pitch and formant control may differ across languages. In tonal languages, such as Chinese, pitch plays a key role in differentiating meanings, while in nontonal languages, such as English, pitch only conveys stress and intonation information (Chen et al., Reference Chen, Liu, Wang, Larson, Huang and Liu2012; Liu et al., Reference Liu, Wang, Chen, Liu, Larson and Huang2010b; Ning et al., Reference Ning, Loucks and Shih2015; Ning et al., Reference Ning, Shih and Loucks2014). Languages also differ in the number, location, and relative proximity of vowels, thus the requirements for formant control also vary across languages (Mitsuya et al., Reference Mitsuya, MacDonald, Purcell and Munhall2011). Uniquely, voice intensity is a basic and low-level sound attribute (Tian et al., Reference Tian, Ding, Teng, Bai and Poeppel2018) that is not highly effective for encoding linguistic contrasts (Liu et al., Reference Liu, Zhang, Xu and Larson2007). To date, there is no direct evidence suggesting that voice intensity is sensitive to the different languages native speakers use.
Studies have shown that online intensity control is similar to pitch and formant control (Bauer et al., Reference Bauer, Mittal, Larson and Hain2006; Heinks-Maldonado & Houde, Reference Heinks-Maldonado and Houde2005; Liu et al., Reference Liu, Zhang, Xu and Larson2007). For example, Bauer and colleagues found that during vowel production, individuals demonstrated a compensatory response to unexpected intensity perturbations (200 ms, ±1, 3 vs. 6 dB SPL; see also Heinks-Maldonado & Houde, Reference Heinks-Maldonado and Houde2005). Furthermore, Liu et al. (Reference Liu, Zhang, Xu and Larson2007) observed that Mandarin speakers also compensated for intensity perturbations (200 ms, ± 3 dB SPL) during Mandarin production. These studies imply that intensity control works to monitor and stabilize voice intensity around a desired level. In this line of research, it is assumed that speakers who rely more on feedforward control will produce speech based more on stored feedforward commands, and hence more stable vocal output. Also, speakers who rely more on feedback control will produce speech based more on auditory feedback to correct for errors, and hence are more affected by perturbations and produce a larger compensatory response (Guenther, Reference Guenther2006).Footnote 1 These studies addressed intensity control through real-time manipulation of speakers’ original auditory feedback.
Noise experiments offer another line of intensity control research. Lombard (Reference Lombard1911) was the first to find that speakers unconsciously increase their voice intensity to compensate for reduced audibility in a noisy environment. This phenomenon is known as the Lombard effect and has been documented in many studies (Lin et al., Reference Lin, Mochida, Asada, Ayaya, Kumagaya and Kato2015; Patel & Schell, Reference Patel and Schell2008). Noise experiments typically instruct participants to produce speech while a constant noise is added to their feedback (Bauer et al., Reference Bauer, Mittal, Larson and Hain2006). Because speaking is a goal-oriented behavior developed to facilitate communication, speakers usually automatically increase voice intensity to improve the signal-to-noise ratio (Liu et al., Reference Liu, Zhang, Xu and Larson2007). Previous noise studies demonstrated that intensity control in a noisy environment differs from that in online perturbation paradigms (i.e., to monitor and stabilize), instead it works to regulate intensity around a loudness level to make speakers heard over the noise. In other words, intensity control in online perturbation paradigms functions to monitor and stabilize, while intensity control in noisy environments functions to overcome noise (Chang-Yit et al., Reference Chang-Yit, Pick and Siegel1975).
Studies have addressed feedforward control by observing how speakers adapt motor commands when auditory feedback is perturbed over a long period (Ballard et al., Reference Ballard, Halaki, Sowman, Kha, Daliri, Robin and Guenther2018; Lametti et al., Reference Lametti, Krol, Shiller and Ostry2014). However, speech adaptation paradigms only reveal the updating of feedforward control, rather than feedforward control in and of itself. In the present study, we innovatively employed a noise-masking paradigm to investigate feedforward control of voice intensity. This paradigm is based on the premise that a loud masking noise would effectively eliminate the auditory feedback for controlling speech movements (Christoffels et al., Reference Christoffels, Formisano and Schiller2007; Houde et al., Reference Houde, Nagarajan, Sekihara and Merzenich2002; Lin et al., Reference Lin, Mochida, Asada, Ayaya, Kumagaya and Kato2015; Maas et al., Reference Maas, Mailend and Guenther2015; Terband et al., Reference Terband, Rodd and Maas2015). As schematized in Figure 2A, a masking noise disrupts comparisons involving actual auditory feedback. Although it is impossible to create an experimental condition without any feedbackFootnote 2 (Kent et al., Reference Kent, Kent, Weismer and Duffy2000; Maas et al., Reference Maas, Mailend and Guenther2015), it is reasonable to expect a much heavier reliance on feedforward control in the absence of auditory feedback (Guenther, Reference Guenther2006, Reference Guenther2016; Guenther & Vladusich, Reference Guenther and Vladusich2012).
Feedforward control incorporates a mechanism that allows speakers to make vocal adjustments independent of auditory feedback (Hickok, Reference Hickok2012; Parrell et al., Reference Parrell, Lammert, Ciccarelli and Quatieri2019). In face of a loud masking noise, speakers evaluate the disturbance from noise signals before they speak. Considering the adverse environment, speakers retrieve the predetermined commands but will not issue them directly to the articulators to avoid obvious errors. Instead, speakers generate an auditory prediction of voice intensity based on feedforward commands and background noise. Then, speakers internally compare auditory prediction and auditory target, which further activate auditory error representing the noise-induced decrease in audibility. At this time, speakers launch a corrective command based on established auditory-motor transformations to surpass the masking noise. We thus predicted that speakers who rely more on feedforward control would adjust motor plans based more on predicted loss in audibility, and hence produce a larger Lombard effect than those who rely less on feedforward control.
The premise of feedback control involves speakers’ perception of their auditory feedback. We thus applied noise signals where participants could hear their voice over the noise, but their voice would be less audible than what they expected to hear. The purpose of added noise was to partially mask air-conducted auditory feedback, and thus reduce the signal-to-noise ratio. As exhibited in Figure 2B, a noise that is not intense enough to mask the original auditory feedback activates feedback control comparisons. Although feedforward control also plays a role in vocal adjustments to noise signals, it was reasonable to expect increased weighting of feedback control to correct motor commands based on perceived auditory errors (Guenther, Reference Guenther2006; Guenther & Vladusich, Reference Guenther and Vladusich2012). We thus predicted that speakers who relied more on feedback control would adjust motor plans based more on perceived loss in audibility, and hence produce a larger Lombard effect than those who relied less on feedback control.
The current study
We designed two noise experiments to test whether the relative reliance on feedforward and feedback control was affected by language type and L2 fluency in Chinese–English bilinguals. In Experiment 1, we addressed the weighting of feedforward control by observing how bilinguals react to a masking noise (90 dB SPL multitalker noise) during L1 and L2 spoken word production. We predicted that L1 relies more on feedforward control compared with L2, so the Lombard effect would be larger in L1. We also predicted a correlation between L2 fluency and the reliance on feedforward control for L2 speakers, where more fluent bilinguals would exhibit larger Lombard effects.
In Experiment 2, we addressed the weighting of feedback control by observing how bilinguals react to a weak noise (30 dB SPL multitalker noise) and a strong noise (60 dB SPL multitalker noise) during L1 and L2 spoken word production. Although both strong noise and masking noise have a high volume, they differ in that speakers’ auditory feedback is still available for feedback control under strong noise but is not available under masking noise. We predicted that L2 relies more on feedback control compared with L1, so the Lombard effect would be larger in L2. In addition, we also expected a correlation between L2 fluency and the reliance on feedback control for L2 speakers, where less fluent bilinguals would be accompanied by larger Lombard effects.
Experiment 1
Methods
Participants
Experiment 1 was completed by 24 Chinese–English bilinguals from Renmin University of China. All participants were right-handed, free of any neurological disease, and self-reported to have normal hearing. On enrolling in the study, participants were instructed to name pictures in both L1 and L2 at their habitual volume while wearing a headphone. The multitalker noise with random varying levels (30 dB, 60 dB, and 90 dB SPL) was sent to the headphone, and participants needed to judge whether they could perceive their auditory feedback under each noise level. All participants reported that they could hear their voice under 30 dB and 60 dB noise but not under 90 dB noise. This screening test was performed to ensure noise manipulation validity in the current study.
The bilinguals’ (11 males) mean age was 20.3 years (SD = 2.2, range 18–28). Note that bilinguals can be classified based on age of L2 acquisition. The cut point for early and late bilinguals is learning L2 between birth and eight years, or at eight years or older (Birdsong & Molis, Reference Birdsong and Molis2001). In the current study, we only included bilinguals who reported receiving their schooling in Chinese and being exposed to English after age 8 (see also Epstein et al., Reference Epstein, Flynn and Martohardjono1996). The mean age of L2 acquisition for the 24 Chinese–English bilinguals was 9.7 years (SD = 1.1, range 9–13).
Stimuli
Twenty black-and-white simple line pictures (including 15 targets and 5 practice items) were selected from a database created by Zhang and Yang (Reference Zhang and Yang2003). The practice items were used to familiarize participants with the experimental procedure and were not employed in the formal experiment. All pictures referred to common objects and had good indexes in visual complexity, familiarity, and image agreement. All pictures had monosyllabic names in both Chinese and English (e.g., Chinese: “猫” /mao/; English: cat). We employed a 90 dB SPL multitalker noise to mask participants’ auditory feedback (Patel & Schell, Reference Patel and Schell2008).
Design
The experiment adopted a 2 (language: L1 and L2) × 2 (noise condition: quiet and masking noise) within-subjects and within-items design. Within a block, participants named 15 target pictures consecutively in each experimental condition, for a total of 15 trials. The order of blocks (L1-quiet, L1-masking noise, L2-quiet, L2-masking noise) was randomized. Participants finished three blocks for each experimental condition, generating a total of 180 trials. The order of items was randomized within L1 blocks but pseudo-randomized in L2 blocks to ensure that a target would not follow a target with the same initial phoneme (e.g., ball, book) to avoid a phonological facilitation effect. A new order was generated for each participant and for each block.
Apparatus
The auditory experiment was conducted in a soundproof room and controlled by E-Prime Professional Software (version 2.0; Psychology Software Tools). Naming latencies were recorded from target presentation using a voice-key, connected to the computer using a PST Serial Response Box. This multitalker noise was calibrated through an audiometer (SMART SENSOR AS804) and presented to participants using a supra-aural headphone (Bose QuietComfort35 II). The participants’ speech was collected and recorded with an external condenser microphone (SHURE SM58S) connected to a YAMAHA Steinberg CI1 soundcard. The microphone was fixed on a short microphone holder standing on the desk and secured at 10 cm from the subjects’ mouth. The target words were extracted and saved as separate WAV files. The recorded speech signals were analyzed with the Praat speech analysis software (version 6.0.43; Boersma & Weenink, Reference Boersma and Weenink2013). The syllabic boundaries of all words were labeled by hand and the vocal cycles were hand-checked for errors such as a miss or double mark. A custom-written Praat script was used to extract the mean intensity of each syllable.
Procedure
Participants were tested individually. First, participants were asked to familiarize themselves with target pictures by viewing each target for 2000 ms with the picture name printed below. After the learning phase, participants received a picture-naming test without concurrently presented names. When the experimenter determined that participants named all pictures correctly in both L1 and L2, the practice blocks were administered. In the practice phase, participants finished one block composed of five practice pictures for each experimental condition. The practice blocks procedure was identical to the experimental blocks procedure, except for the number of pictures. When the experimenter determined that participants understood the naming task instructions, the experimental blocks were administered.
Figure 3 is a schematic representation of the sequence for a block. At the beginning, a flag was presented for 2 seconds to cue the target language in the block. Meanwhile, the noise signals were played continuously in the masking noise condition but remained silent in the quiet condition. Then a fixation point (+) appeared in the middle of the screen for 500 ms, followed by a blank screen. Next, 15 pictures were presented on the screen, 2 seconds apart. Participants were asked to name the picture as quickly and accurately as possible. We stopped playing noise signals after participants finished naming 15 pictures. Finally, a break lasting 10 seconds concluded each block.
L2 fluency test
We included two English speaking tests to measure participants’ L2 fluency. Previous research typically addressed L2 fluency by measuring temporal features, such as speech rate, duration and rate of hesitations, and filled and silent pauses (Hilton, Reference Hilton, Leclercq, Edmonds and Hilton2014; Kormos, Reference Kormos2006; Segalowitz, Reference Segalowitz2010). The current study measured L2 fluency using speech rate indicated by the length of time required to complete the speaking tasks, with shorter time indexing more fluent L2 speech and longer time indexing less fluent L2 speech.
The order of speaking tasks was as follows. First, participants completed a rapid automated naming task in which four 7 × 7 item grid stimulus displays were created for computer presentation. Each grid consisted of one of the following randomly ordered stimulus types: letters (g, k, m, r), objects (pictures depicting a dog, chair, bed, or key), colors (boxes colored red, blue, yellow, or green), and digits (2, 4, 6, 9). Participants were instructed to sequentially name aloud each item in the grid from the top left item to the bottom right item as quickly as possible without errors. This was repeated for each stimulus grid (letters, objects, colors, and digits), and the task order was counterbalanced across participants. The experimenter manually pressed the mouse to start and end the timing procedure for each grid. The final rapid naming duration was the average of each grid’s duration for letter, object, color, and digit-naming tasks.
Next, participants completed a passage reading task in which four English passages were extracted from New Concept English Two. They were instructed to sequentially read aloud each passage at their habitual speed. The experimenter manually pressed the mouse to start and end the timing procedure for each passage reading. The final duration for passage reading was the average of the four passage reading durations. The English fluency tests were performed using E-prime Professional Software.
Results
Two participants were excluded; one could not tolerate the loud masking noise and quit the experiment, and the other’s voice intensity in the quiet and masking noise conditions differed by more than two standard deviations from the group mean (see Lametti et al., Reference Lametti, Krol, Shiller and Ostry2014 for a similar data removal procedure). The remaining data from 22 participants were included in the subsequent analyses. Table 1 presents the mean picture-naming reaction time, error percentages, and mean intensity by language and noise condition.
We used the lmer program of the lme4 package (Baayen et al., Reference Baayen, Davidson and Bates2008; Bates, Reference Bates2005; Bates et al., Reference Bates, Maechler, Bolker and Walker2014) in R software (R Core Team, 2015) to estimate fixed and random effects. The data (i.e., response time, the percentage of error responses, and mean intensity) were analyzed using a linear mixed-effects model with language and noise condition as the fixed factors and participants and items as the random factors. Models used restricted maximum likelihood estimation to find the optimal parameter estimation of the best-fitting model to the observed data. The best-fitting model was defined as the most complex model that significantly improved the variance estimation over previous models. Model fitting included three steps: specifying a model (i.e., null model) that included only random factors (participants and items); enriching the null model by adding fixed factors (i.e., language and noise condition) one by one, and then by adding the two-way interactions between the two factors; and comparing the newly established model to the previous model using the chi-square test. If adding a fixed factor or an interaction to the existing model did not significantly improve the variance estimation (p > 0.05), then the current model was designated the best-fitting model.
Behavioral results
Data from incorrect responses (0.81%), naming latencies longer than 1500 ms or shorter than 200 ms (2.25%), and latencies deviating two standard deviations from each participant’s mean (6.14%) were removed from the behavioral analyses. For response time, the best-fitting model only included the factors of language and noise condition (see Table 2). Adding the two-way interaction between language and noise condition did not significantly improve the fit, χ2 (1, 3596) = 1.15, p = 0.28. A parallel analysis was conducted on the errors, but a binomial family was used because of the binary nature of the response. The best-fitting model only included the factor of noise condition; adding language, χ2 (1, 3928) = 0.51, p = 0.48, and the interaction between language and noise condition, χ2 (1, 3928) = 1.22, p = 0.54, did not significantly improve the fit.Footnote 3 Table 2 displays parameter estimates for the fixed effects for response time, percentage of errors, and mean intensity.
Note: Language2, L2; Noise2, masking noise.
Acoustic analysis
Only data from incorrect responses (0.81%) were removed from the acoustic analyses. Figure 4A illustrates the mean intensity score distribution from 22 participants in each experimental condition. For mean intensity, the best-fitting model included the noise condition and the interaction between language and noise condition (see Table 2). Adding the factor of language did not significantly improve the fit, χ2 (1, 3928) = 2.53, p = 0.11. The two-way interaction is interesting; simple analyses indicated that masking noise increased speakers’ voice intensity relative to the quiet condition in both L1 (β = 10.05, t = 83.10, p < 0.001) and L2 word production (β = 9.94, t = 73.25, p < 0.001), but the intensity increase was larger in L1 than L2 (see Figure 4B). As shown in Figure 4C, simple analyses in the other direction indicated that in the quiet condition, the mean intensity did not differ between L1 and L2 word production (β = –0.13, t = –1.06, p = 0.29), but in the masking noise condition, the mean intensity was significantly higher in L1 than L2 word production (β = –0.52, t = –4.51, p < 0.001).
Correlation analysis between the Lombard effect and L2 fluency
To test whether more fluent L2 speakers rely more on feedforward control than less fluent L2 speakers, we examined the relationship between the Lombard effect in L2 spoken word production and the fluency performance in L2 rapid naming and passage reading. Data from 22 participants in Experiment 1 were entered into the Pearson’s correlation analysis. Here, the Lombard effect was defined as the difference between the mean intensity in the L2-quiet and L2-masking noise conditions. In addition, L2 fluency was measured by two English production tasks and defined as the average durations for rapid naming and passage reading, respectively. The results indicated that the Lombard effect was negatively correlated with the duration for L2 rapid naming task (r = –0.67, 95% CI [–0.36, –0.83], p = 0.002) and the duration for L2 passage reading task (r = –0.62, 95% CI [–0.42, –0.80], p = 0.002). This suggests that the more fluently bilinguals speak in their L2, the larger Lombard effect they exhibit in L2 speech production (see Figures 4D and 4E).
Discussion
To the best of our knowledge, this was the first cross-language study to compare feedforward control in a group of Chinese–English bilinguals using a masking noise. A 90 dB SPL multitalker noise virtually eliminated speakers’ auditory feedback, thus the mechanism of feedback-based motor correction had little role to play, whereas the predictive feedforward control dominated the speech motor control. Notably, we observed that the Lombard effect elicited by a masking noise was larger in L1 than L2 word production. In addition, correlation analyses showed that the Lombard effect in L2 word production was larger in more fluent L2 speakers than less fluent ones. Thus, the results support our two hypotheses that bilinguals’ feedforward control is influenced by language and related to L2 fluency, where a heavier weighting was assigned to feedforward control in the L1 production system and more fluent L2 speakers than to the L2 production system and less fluent L2 speakers.
In Experiment 2, we adjusted the levels of multitalker noise signal from 90 dB to either 60 dB or 30 dB to enable the involvement of auditory feedback in speech motor control. By measuring the magnitude of the Lombard effect in response to a noise that was not as loud as the masking noise, we examine bilinguals’ relative reliance on feedback control in L1 and L2 spoken word production.
Experiment 2
Method
Participants
Participants in Experiment 2 were the same as those in Experiment 1. The order of the two experiments was counterbalanced between participants, with half of the participants completing Experiment 2 after Experiment 1 and the other half completing Experiment 1 after Experiment 2. The interval between the two experiments was about 15 minutes (5 minutes’ short break plus 10 minutes’ L2 fluency tests). This within-subjects practice not only maximized the sensitivity to compare between experiments but also confirmed that result differences were unrelated to individual differences.
Stimuli
The picture stimuli were the same as those in Experiment 1. Previous works suggest that the magnitude of the Lombard effect is influenced by noise level. For example, Patel and Schell (Reference Patel and Schell2008) manipulated noise condition by using quiet, 60 dB, and 90 dB multitalker noise and observed more voice intensity increases when the background noise was 90 dB compared to 60 dB. Following their practice, we also included a 60 dB multitalker noise and added a 30 dB multitalker noise to further investigate how proportional changes in noise level affect vocal adjustments in voice intensity. To differentiate from the masking noise in Experiment 1, we called the 60 dB multitalker noise the strong noise and the 30 dB multitalker noise the weak noise.
Design
Experiment 2 adopted a 2 (language: L1 and L2) × 3 (noise condition: quiet, weak noise, and strong noise) within-subjects and within-items design. Within a block, participants named 15 target pictures consecutively in each experimental condition, for a total of 15 trials. The order of blocks (L1-quiet, L1-weak noise, L1-strong noise, L2-quiet, L2-weak noise, L2-strong noise) was randomized. Participants finished three blocks for each experimental condition, generating a total of 270 trials.
Apparatus and procedure
Identical to Experiment 1.
Results
Data from 22 participants were entered into the final analyses using the same LMM estimation. Table 3 presents the mean picture-naming response time, percentage of errors, and mean intensity by language, and noise condition.
Behavioral results
Data from incorrect responses (0.88%), naming latencies longer than 1500 ms or shorter than 200 ms (2.19%), and latencies deviating two standard deviations from each participant’s mean (5.49%) were removed from all analyses. For response time, the best-fitting model only included the factors of language and noise condition (see Table 4). Adding the two-way interaction between language and noise condition did not significantly improve the fit, χ2 (2, 5432) = 2.93, p = 0.23. For the percentage of errors, the best-fitting model only included the factor of language; adding noise condition, χ2 (2, 5888) = 5.21, p = 0.07, and the interaction between language and noise condition, χ2 (2, 5888) = 5.93, p = 0.20, did not significantly improve the fit.Footnote 4 Table 4 displays parameter estimates for fixed effects for response time, percentage of errors, and mean intensity.
Note: Language2, L2; Noise2, weak noise, Noise3, strong noise.
Acoustic analysis
Only data from incorrect responses (0.88%) were removed from the acoustic analyses. Figure 5A illustrates average mean intensity score distribution of the 22 speakers in each experimental condition. For mean intensity, the best-fitting model included language and noise condition as well as their interaction (see Table 4). To examine the two-way interaction, simple analyses indicated that both weak noise (L1: β = 5.95, t = 52.29, p < 0.001; L2: β = 6.40, t = 53.46, p < 0.001) and strong noise (L1: β = 6.40, t = 53.46, p < 0.001; L2: β = 6.40, t = 53.46, p < 0.001) increased speakers’ voice intensity relative to the quiet condition, but the intensity increase was larger in L2 than L1 (see Figure 5B). As shown in Figure 5C, simple analyses in the other direction also indicated that the mean intensity was not different between L1 and L2 word production in the quiet condition (β = –0.15, t = –1.24, p = 0.22), but the mean intensity was significantly higher in L2 than L1 word production in the weak noise condition (β = 0.27, t = 2.24, p = 0.03), and the strong noise condition (β = 1.18, t = 10.64, p < 0.001).Footnote 5
Correlation analysis between the Lombard effect and L2 fluency
To test whether less fluent L2 speakers rely more on feedback control than more fluent L2 speakers, we examined the relationship between the Lombard effect in L2 spoken word production and the fluency performance in L2 rapid naming and passage reading. Data from 22 participants in Experiment 2 were entered into the Pearson’s correlation analysis. The Lombard effect was defined as the difference between the mean intensity in the L2-quiet and L2-strong noise conditions. The results indicated that the L2 Lombard effect was positively correlated with the duration for L2 rapid naming task (r = 0.58, 95% CI [0.30, 0.81], p = 0.005) and the duration for L2 passage reading task (r = 0.55, 95% CI [0.31, 0.81], p = 0.009). This suggests that the less fluently bilinguals speak in their L2, the larger Lombard effect they exhibit in L2 speech production (see Figure 5D and 5E).
Discussion
In Experiment 2, we observed that the Lombard effect elicited by a weak or strong noise was larger in L2 than L1 word production. In addition, correlation analyses suggest that the Lombard effect in L2 word production was larger in less fluent L2 speakers than more fluent L2 speakers. Thus, the results lend support to the hypotheses that bilinguals’ feedback control is also affected by language and related to L2 fluency, where a heavier weighting is assigned to feedback control in the L2 production system and in less fluent L2 speakers compared to the L1 production system and more fluent L2 speakers.
Our findings suggest that the mean intensity did not differ between L1 and L2 word production under the quiet condition in either Experiment 1 or Experiment 2. The similar characteristic of voice intensity is of great importance in the L1-L2 contrast of two different languages; we thus suggest that the observed vocal changes indeed resulted from the experimental manipulations rather than the languages.
General discussion
The purpose of the two experiments was to systematically determine the relative weighting of feedforward and feedback control in bilinguals’ L1 and L2 speech production, and to evaluate whether individual differences in L2 fluency are related to the organisation of feedforward and feedback control in the L2 speech motor system. We manipulated the noise level mixed with the auditory feedback that participants received while speaking. When the noise intensity (90 dB multitalker noise) exceeds a masking threshold where participants could not perceive their original auditory feedback, bilinguals showed a larger Lombard effect in L1 than in L2 word production. In addition, as L2 fluency increased, the Lombard effect in L2 word production also increased. In contrast, when the noise intensity (30 dB or 60 dB multitalker noise) was below the masking threshold but hampered speech intelligibility, the same bilinguals showed increased Lombard effect in L2 word production compared to L1 word production. In addition, as L2 fluency decreased, the Lombard effect in L2 word production increased. The overall results indicate that compared to L1, L2 speech motor control relies on feedforward control to a lesser extent but relies on feedback control to a greater extent. Also, the correlation findings provide initial evidence in second language learners that L2 speech rapidity is related to higher weighting of feedforward control but lower weighting of feedback control.
Feedforward control between L1 and L2
We investigated bilinguals’ feedforward control using a masking noise in Experiment 1, and, for the first time, we observed that bilinguals exhibited a larger Lombard effect in L1 word production than L2 word production, reflecting that L1 speech motor execution relies more on feedforward control compared to L2. A previous study provided neuro-imaging evidence to differentiate native and novel speech production in terms of feedforward control (Moser et al., Reference Moser, Fridriksson, Bonhilha, Healy, Baylis, Baker and Rorden2009). According to the DIVA model, the left inferior frontal gyrus and anterior insula are important brain regions involved in feedforward control (Guenther, Reference Guenther2016; Kearney & Guenther, Reference Kearney and Guenther2019). Damage to these areas typically cause a disorder in motor speech planning. In Moser et al.’s study (Reference Moser, Fridriksson, Bonhilha, Healy, Baylis, Baker and Rorden2009), 30 normal adults completed a speech production task consisting of two types of three-syllable nonwords: English (native) syllables and non-English (novel) syllables. The authors found that when novel syllable production was compared to native syllable production, greater activations were observed in an extensive neural network including the left inferior frontal gyrus and anterior insula. Of close relevance, they speculated that increased activity in motor speech networks may directly reflect unfamiliarity with the motor commands necessary for target sounds; that is, a difference in feedforward control. In our study, we cannot explain the neural mechanism of a feedforward deficit (Alm, Reference Alm2004, Reference Alm2005; Kearney & Guenther, Reference Kearney and Guenther2019) or its exact nature (Civier, Reference Civier2010). Although L2 words were not novel speech sounds, they were not as familiar as the L1 counterparts to bilingual speakers (as reflected by longer naming latencies). Thus, it is not surprising that a difference in feedforward control was found between L1 and L2 speech motor control in Experiment 1.
For bilinguals, L1 is an overlearned language. The feedforward commands that store detailed instructions for how to move the articulators to achieve a linguistic goal, should be directly read out from “mental syllabary” without effort, with its mechanism similar to singing a familiar song from memory (Civier et al., Reference Civier, Tasko and Guenther2010). Speakers in a highly automatic language have established accurate auditory-motor bidirectional mappings; not only can they predict auditory consequences based on an efference copy of the motor commands and environmental influence but they can also issue motor commands based on the intended auditory consequences. By accurately measuring the level of masking noise, native speakers adjust their voice intensity more to make themselves heard. However, for second language learners, articulation of L2 words is less rehearsed (Parker Jones et al., Reference Parker Jones, Green, Grogan, Pliatsikas, Filippopolitis, Ali and Seghier2012) due to factors such as the age of acquisition, the amount of exposure, and the involvement in daily life (Abutalebi et al., Reference Abutalebi, Cappa and Perani2001), so they are less likely to generate long-term representations of L2 words that are as accurate as their L1 counterparts in the “mental syllabary.” Thus, when L2 speakers face a loud masking noise, they show smaller intensity adjustments to compensate for the inaudibility.
Theoretical frameworks contend that feedforward control can be accomplished quickly by preventing additional processing of sensory feedback (Guenther, Reference Guenther2016; Perkell, Reference Perkell2012). Thus, it is reasonable to associate speed of speech with the relative weighting of feedforward control. Many patient studies also found that brain damage related to feedforward control causes significant motor impairment (Kearney & Guenther, Reference Kearney and Guenther2019). In Experiment 1, we found a negative correlation between L2 fluency and the Lombard effect in L2 speech production, suggesting that more fluent L2 speakers have superior feedforward control ability. This finding provided additional evidence for a fluency-related hypothesis in normal L2 speakers. Speech control models in native language assume that feedforward control weighting is increased as language acquisition progresses (Tourville & Guenther, Reference Tourville and Guenther2011). Recent research findings highlight L2 fluency as a reliable predictor of L2 proficiency (De Jong et al., Reference De Jong, Steinel, Florijn, Schoonen and Hulstijn2012), thus our study also shows that feedforward control is increased as second language learning progresses. Although L2 speech production is inferior in feedforward control compared with L1, we should be optimistic about the difference because, with increasing L2 proficiency, speech control may develop on a continuum, biasing away from feedback control and toward feedforward control, allowing for more native-like speech production.
Feedback control between L1 and L2
We investigated bilinguals’ feedback control under weak noise and strong noise conditions. Contrary to Experiment 1, we observed that bilinguals exhibited larger Lombard effects in L2 word production than in L1 word production, and the effect was magnified in the strong noise condition relative to the weak noise condition. These contrasting findings are interesting because both experiments introduced noise to interfere with the perception of auditory feedback, and the only striking difference was that neither the weak noise nor the strong noise were loud enough to eliminate the auditory feedback needed for feedback control. Thus, the difference in noise levels (weak and strong noise vs. masking noise) was not only quantitative but also qualitive. Notably, in Experiment 2, the strong noise decreased the signal-to-noise ratio to a greater extent than the weak noise, but both noise levels elicited the same result patterns despite a difference in magnitude. Thus, the difference in noise levels (weak noise vs. strong noise) was only quantitative, not qualitive. By controlling for other factors, we suggest that L2 speech motor execution relies more on feedback control compared to L1.
The finding of language-specific feedback control echoes an early study by Mackay (Reference Mackay1970), who employed a delayed auditory feedback technique to interfere with normal speech production and found that artificial disfluency was more serious in L2 speech production for both German–English bilinguals and English–German bilinguals. These findings provided direct evidence that the feedback control difference was unrelated to the language but related to language status. Future studies should investigate the influence of masking noise, weak noise, and strong noise on speech motor control in a group of English–Chinese bilinguals.
In addition, Simmonds et al.’s (Reference Simmonds, Wise, Dhanjal and Leech2011b) brain imaging study differentiated L1 and L2 speech production in terms of feedback control. According to the DIVA model, the auditory and somatosensory association cortices are important brain regions involved in feedback control (Guenther, Reference Guenther2016; Kearney & Guenther, Reference Kearney and Guenther2019). A perturbation to speakers’ auditory feedback typically results in increased neural activities in these areas (Tourville et al., Reference Tourville, Reilly and Guenther2008; Toyomura et al., Reference Toyomura, Koyama, Miyamaoto, Terao, Omori and Murohashi2007). In Simmonds et al.’s (Reference Simmonds, Wise, Dhanjal and Leech2011b) study, bilinguals produced overt propositional speech (i.e., defined visually presented pictures) in both their L1 and L2. The results provided reliable evidence of increased activations for L2 relative to L1 within the temporoparietal cortex. Of close relevance, they attributed the increased temporoparietal cortex activity to more taxing sensory monitoring of any discrepancies between the predicted and actual sensory outcomes in L2 production. Thus, it is not surprising that our study found a difference in feedback control between L1 and L2 production.
Previous research has shown that reliance on feedback control is a dynamic process in nature that ranges from heavily to rarely dependent through vocal development (Civier et al., Reference Civier, Tasko and Guenther2010; Scheerer et al., Reference Scheerer, Liu and Jones2013; Schmidt & Lee, Reference Schmidt and Lee2005). The transition is modulated by practice or experience (Guenther et al., Reference Guenther2006). For L1 speakers, the brain has already internalized the relationships between speech movements and the desired auditory feedback during the process of language acquisition; thus, the additional information provided by auditory feedback becomes redundant. However, for L2 speakers, the mapping between motor commands and their sensory consequences is less reliable, as evidenced by larger vocal variability (Chen et al., Reference Chen, Robb, Gilbert and Lerman2001; Ng et al., Reference Ng, Chen and Sadaka2008; Wang & van Heuven, Reference Wang and van Heuven2006). Thus, auditory feedback is still required to retune and strengthen the motor-sensory transformations. Growing evidence also suggests that L2 speech output needs more careful monitoring to avoid errors (Ganushchak & Schiller, Reference Ganushchak and Schiller2009; Parker Jones et al., Reference Parker Jones, Green, Grogan, Pliatsikas, Filippopolitis, Ali and Seghier2012). Overall, the feedback subsystem may have a more prominent role to play in L2 speech motor control.
Overreliance on feedback control may introduce disfluency problems because a feedback-based strategy is relatively slow to detect and correct errors (Parrell et al., Reference Parrell, Lammert, Ciccarelli and Quatieri2019; Perkell, Reference Perkell2012). Thus, it is reasonable to associate disfluency with the relative weighting of feedback control. Civier and colleagues (Reference Civier2010) also found that people who stutter may suffer from a motor strategy that weights too much toward auditory feedback control, leading to a higher probability of triggering a repetition, resulting in more stuttering. In Experiment 2, we found a positive correlation between L2 fluency and the Lombard effect in L2 speech production, indicating that less fluent L2 speakers are more dependent on feedback control. The findings in Experiments 1 and 2 are important complement to each other because we showed that increasing efficiency of L2 speech motor control is related to a bias away from feedback control and toward feedforward control in the same group of bilinguals. A large body of literature indicates that reliance on feedback control decreases as language acquisition progresses (Liu et al., Reference Liu, Russo and Larson2010a; Scheerer et al., Reference Scheerer, Liu and Jones2013; Tourville & Guenther, Reference Tourville and Guenther2011). Concerning the close relationship between L2 fluency and L2 proficiency, our study also suggests that feedback control plays a less prominent role as second language learning progresses. This finding suggests that differences between native speakers and L2 learners are not always ever-lasting; it is possible that L2 learners can reach native-like efficiency of speech motor control.
Conclusion
In summary, our findings suggest that voice intensity control in bilinguals’ speech production requires a joint effort of feedforward and feedback subsystems, and the relative weighting of feedforward and feedback control depends on whether bilinguals are producing words in L1 or L2. The correlation analyses suggest a close relationship between L2 fluency and the organization of feedforward and feedback control. Although more work is needed to establish these finding in different populations with improved methodologies, this study opens a potential new line of research into bilinguals’ speech motor control.
Acknowledgments
This work was supported by the Key Project of Beijing Social Science Foundation in China (16YYA006) to Qingfang Zhang, the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China (18XNLG28) to Qingfang Zhang.