A cross-language study on feedforward and feedback control of voice intensity in Chinese–English bilinguals

Xiao Cai; Yulong Yin; Qingfang Zhang

doi:10.1017/S0142716420000223

A cross-language study on feedforward and feedback control of voice intensity in Chinese–English bilinguals

Published online by Cambridge University Press: 10 July 2020

Xiao Cai

Yulong Yin and

Qingfang Zhang

Show author details

Xiao Cai: Affiliation:
Renmin University of China
Yulong Yin: Affiliation:
Renmin University of China
Qingfang Zhang*: Affiliation:
Renmin University of China
*: *Corresponding author. Email: qingfang.zhang@ruc.edu.cn

Article contents

Abstract
Introduction
Experiment 1
Experiment 2
General discussion
Conclusion
Footnotes
References

Rights & Permissions

Abstract

Speech production requires the combined efforts of feedforward and feedback control, but it remains unclear whether the relative weighting of feedforward and feedback control is organized differently between the first language (L1) and the second language (L2). In the present study, a group of Chinese–English bilinguals named pictures in their L1 and L2, while being exposed to multitalker noise. Experiment 1 compared feedforward control between L1 and L2 speech production by examining intensity increases in response to a masking noise (90 dB SPL). Experiment 2 compared feedback control between L1 and L2 speech production by examining intensity increases in response to a weak (30 dB SPL) or strong noise (60 dB SPL). We also examined a potential relationship between L2 fluency and the relative weighting of feedforward and feedback systems. The results indicated that L2 speech production relies less on feedforward control relative to L1, exhibiting attenuated Lombard effects to the masking noise. In contrast, L2 speech production relies more on feedback control than L1, producing larger Lombard effects to the weak and strong noise. The relative weighting of feedforward and feedback control is dynamically changed as second language learning progresses.

Keywords

L2 speech motor control feedforward control feedback control Lombard effect voice intensity

Type: Original Article
Information: Applied Psycholinguistics , Volume 41 , Issue 4 , July 2020 , pp. 771 - 795

DOI: https://doi.org/10.1017/S0142716420000223 [Opens in a new window]
Copyright: © The Author(s), 2020. Published by Cambridge University Press

Introduction

An issue receiving intense attention in speech production is how the brain plans linguistic processing prior to overt speech production (Levelt et al., Reference Levelt, Roelofs and Meyer1999). Although articulation is seen as a lower-level motor output (Indefrey & Levelt, Reference Indefrey and Levelt2004), speaking is a highly complex sensorimotor task that requires the combined efforts of feedforward and feedback control systems (Guenther et al., Reference Guenther2006). To date, how these two subsystems work to ensure successful communication remains poorly understood.

Terminology and general principles of speech motor control

Several speech motor control models have been formulated; we integrated Directions into Velocities of Articulators (DIVA; Guenther, Reference Guenther2006) and State Feedback Control (Houde & Nagarajan, Reference Houde and Nagarajan2011) to describe feedforward and feedback control (see Figure 1 for details). Speech production begins with a unit in the “speech sound map,” which can be a phoneme, syllable, or phrase. As schematized in Figure 1, feedforward control reads out previously learned motor commands for speech sounds and further issues them to articulators. This mechanism emphasizes its independence from the sensory feedback associated with articulation (Guenther, Reference Guenther2016). Therefore, feedforward control enables the rapidity of speech, but lacks the ability to monitor errors in speech output (Parrell et al., Reference Parrell, Lammert, Ciccarelli and Quatieri2019). Because we live in time-varying and unpredictable surroundings, feedforward control alone cannot ensure effective speech.

Figure 1. A schematic diagram of the processes involved in speech motor control. The model includes an internal forward model (pink box) that generates auditory prediction based on a copy of planned feedforward commands (efference copy). Auditory feedback control compares actual auditory feedback with auditory prediction and auditory target, indicated by blue and green arrows, respectively. A special case of feedforward control eliminates the involvement of auditory feedback by comparing auditory prediction with auditory target (indicated by yellow arrows). (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article).

Unlike feedforward control, feedback control relies on sensory feedback to maintain speech (Guenther, Reference Guenther2016; Kearney & Guenther, Reference Kearney and Guenther2019). The auditory feedback control system compares actual auditory feedback with intended auditory feedback, and in case of any mismatch, auditory errors are transformed into corrective commands that decrease the perceived errors. This is similar to a somatosensory feedback control mechanism (Guenther, Reference Guenther2006, Reference Guenther2016; Hickok et al., Reference Hickok, Houde and Rong2011). Within this framework, there are two coexisting routes to generate intended auditory feedback (Tian & Poeppel, Reference Tian and Poeppel2012). Firstly, the activation of speech sound map leads to the activation of auditory target, which defines the desired auditory feedback that should arise when a speaker correctly produces the sound (Guenther, Reference Guenther2016; Tourville & Guenther, Reference Tourville and Guenther2011). Secondly, an internal forward model utilizes an efference copy of feedforward commands to internally estimate the current state of vocal tract dynamics and generate auditory prediction (Hickok, Reference Hickok2012; Tian & Poeppel, Reference Tian and Poeppel2010, Reference Tian and Poeppel2012, Reference Tian and Poeppel2013). The feedback control system is indispensable in speech motor control, allowing speakers to regulate movements and interact well with the environment in presence of external perturbations (Bays & Wolpert, Reference Bays and Wolpert2006).

Parrell et al. (Reference Parrell, Lammert, Ciccarelli and Quatieri2019) proposed a special case of feedforward control by eliminating auditory feedback involvement. An auditory prediction (realized using an internal forward model) is the possible outcome of articulatory movement before auditory feedback is received. This prediction is based on previously established causal associations between motor commands and auditory output. This is also why speakers feel that they can “hear” the speech internally when they imagine speaking without moving any articulators (Tian & Poeppel, Reference Tian and Poeppel2012, Reference Tian and Poeppel2015). Critically, the motor-based auditory predictions can be directly compared with auditory targets to verify the correctness of planned feedforward commands (Hickok, Reference Hickok2012). If auditory predictions fail to match auditory targets, the feedforward control system transforms error signals into corrective motor commands (Parrell et al., Reference Parrell, Lammert, Ciccarelli and Quatieri2019).

Speech motor control in bilinguals

Current models detail the organization of feedforward and feedback control exclusively in the first language (L1; see Parrell et al., Reference Parrell, Lammert, Ciccarelli and Quatieri2019 for a review), while research into the second language (L2) has not yet fully considered this issue. More recently, researchers noticed that speech motor control in bilinguals may vary by language type (L1 vs. L2; Liu & Tian, Reference Liu and Tian2018; Mitsuya et al., Reference Mitsuya, MacDonald, Purcell and Munhall2011; Simmonds et al., Reference Simmonds, Wise and Leech2011a, Reference Simmonds, Wise, Dhanjal and Leech2011b). Here we adopt Grosjean’s (Reference Grosjean2010) succinct definition of bilinguals as people who use two languages in their daily life. Of note, there is still insufficient theoretical and empirical information on L2 speech motor control, which highlights the need for further research in this field.

For L1 speech production, a basic idea is that the feedforward and feedback control subsystems cooperate with each other (Parrell et al., Reference Parrell, Lammert, Ciccarelli and Quatieri2019); thus, it is important to understand the relative weighting of these systems in speech motor control (Guenther, Reference Guenther2016; Guenther et al., Reference Guenther2006). Researchers have emphasized a transition from feedback-dominant to feedforward-dominant, driven by production experiences (Guenther, Reference Guenther2006; Guenther & Vladusich, Reference Guenther and Vladusich2012; Liu et al., Reference Liu, Chen, Larson, Huang and Liu2010c; Scheerer et al., Reference Scheerer, Liu and Jones2013). Speakers’ initial attempts to produce speech result in errors, and production relies heavily on feedback control. With sufficient practice, feedforward commands can result in the same sensory consequences without errors, and production principally relies on feedforward control (Guenther, Reference Guenther2006; Guenther & Vladusich, Reference Guenther and Vladusich2012). However, L1 and L2 production experiences are inherently different (Mitsuya et al., Reference Mitsuya, MacDonald, Purcell and Munhall2011). L1 speech motor learning begins in infancy (Tourville & Guenther, Reference Tourville and Guenther2011), but within the broad population of bilinguals, the L2 acquisition age is widely varied. Some bilinguals acquire L2 from birth, some around puberty, and others during adulthood (Woumans et al., Reference Woumans, Santens, Sieben, Versijpt, Stevens and Duyck2015). In most cases, bilinguals are exposed to an L2 after their L1 has already been established. It is therefore possible that feedforward and feedback control are weighted differently for bilinguals’ two language systems.

Motor movements used to produce native sounds are highly overlearned and automatic, requiring much less online sensory monitoring (Simmonds et al., Reference Simmonds, Wise and Leech2011a, Reference Simmonds, Wise, Dhanjal and Leech2011b). However, evidence shows that L2 sounds are produced with larger variability (Chen et al., Reference Chen, Robb, Gilbert and Lerman2001; Ng et al., Reference Ng, Chen and Sadaka2008; Wang & van Heuven, Reference Wang and van Heuven2006), implying that L2 feedforward commands are less familiar and more variable (Mitsuya et al., Reference Mitsuya, MacDonald, Purcell and Munhall2011). Thus, we hypothesized that, compared with L1, L2 production relies on feedback control to a greater extent and on feedforward control to a lesser extent. This hypothesis is supported by two early studies reporting that bilinguals speak more slowly and have more hesitation or sound repetitions in L2 relative to L1 under a delayed auditory feedback condition (Mackay, Reference Mackay1970; Van Borsel et al., Reference Van Borsel, Sunaert and Engelen2005). The underlying logic is that an increased weighting of feedback control increases the disturbing influence of incoming perturbed auditory feedback (Guenther, Reference Guenther2006).

The past several decades have seen an unprecedented upsurge in the number of bilinguals; however, for most bilinguals, speaking a second language is a challenging task (Bergmann et al., Reference Bergmann, Sprenger and Schmid2015). Typical disfluency markers include pauses, syllable repetition, and self-corrections (Götz, Reference Götz2013; Kormos, Reference Kormos2006). Growing evidence shows that speakers are considerably less fluent in L2 compared with their L1. For example, Wiese (Reference Wiese, Dechert, Möhle and Raupach1984) reported that L2 speech contains two to three times as many hesitations as L1 speech. Hincks (Reference Hincks2008) also found slower speech rates in L2 than in L1. It is well known that feedforward control is crucial for fluent speech, while excessive reliance on feedback control causes a time-lag problem because of the delay inherent in processing auditory feedback and launching corrective commands (Civier, Reference Civier2010; Civier et al., Reference Civier, Tasko and Guenther2010; Perkell, Reference Perkell2012). Thus, it is reasonable to hypothesize that poorer L2 fluency is correlated with heavier weighting of feedback control, and, accordingly, better L1 fluency is associated with heavier weighting of feedforward control.

This fluency-related hypothesis is supported only by indirect evidence from patients with speech motor disorders. Guenther (Reference Guenther2016) found that patients with speech motor disorders usually have impaired feedforward control. For example, apraxia of speech, a speech motor planning and programming disorder, is most often associated with damage to the left inferior frontal gyrus, anterior insula, and/or ventral precentral gyrus. According to the DIVA model, damage to these areas affects the speech sound map and the feedforward commands for articulating speech sounds. Stuttering is also a disorder that disrupts speech fluency, but the mechanism remains controversial. Several researchers believe stuttering is a result of abnormal auditory-motor transformation in the feedback control system (Cai et al., Reference Cai, Beal, Ghosh, Tiede and Guenther2012; Loucks et al., Reference Loucks, Chon and Han2012). Other researchers suggest that stuttering results from a general auditory prediction deficit (Daliri & Max, Reference Daliri and Max2015a, Reference Daliri and Max2015b) and a heavy weight on feedback control (Civier et al., Reference Civier, Tasko and Guenther2010; Tourville et al., Reference Tourville, Reilly and Guenther2008).

Feedforward and feedback control of voice intensity

The current study aimed to address whether the relative weighting of feedforward and feedback control varies between L1 and L2. In previous bilingualism research, investigators either compared same-language L1 and L2 speakers or compared L1 and L2 in the same bilinguals (Bergmann et al., Reference Bergmann, Sprenger and Schmid2015). The difficulty for intraspeaker comparisons lies in interpreting whether the observed differences are caused by language status or language differences. To avoid this confusion, we selected voice intensity to dissociate the role of language status because this attribute has few well-known language-specific phonological features.

There is a considerable amount of research describing how speakers control pitch (Chang et al., Reference Chang, Niziolek, Knight, Nagarajan and Houde2013; Chen et al., Reference Chen, Liu, Wang, Larson, Huang and Liu2012), formant (Cai et al., Reference Cai, Beal, Ghosh, Tiede and Guenther2012; Mitsuya et al., Reference Mitsuya, MacDonald, Purcell and Munhall2011), and intensity (Bauer et al., Reference Bauer, Mittal, Larson and Hain2006; Heinks-Maldonado & Houde, Reference Heinks-Maldonado and Houde2005; Liu et al., Reference Liu, Zhang, Xu and Larson2007). Typically, auditory perturbations induce compensatory behaviors that change speech parameters in the opposite direction (Behroozmand et al., Reference Behroozmand, Shebek, Hansen, Oya, Robin, Howard and Greenlee2015; Chang et al., Reference Chang, Niziolek, Knight, Nagarajan and Houde2013). To date, previous studies have provided evidence that pitch and formant control may differ across languages. In tonal languages, such as Chinese, pitch plays a key role in differentiating meanings, while in nontonal languages, such as English, pitch only conveys stress and intonation information (Chen et al., Reference Chen, Liu, Wang, Larson, Huang and Liu2012; Liu et al., Reference Liu, Wang, Chen, Liu, Larson and Huang2010b; Ning et al., Reference Ning, Loucks and Shih2015; Ning et al., Reference Ning, Shih and Loucks2014). Languages also differ in the number, location, and relative proximity of vowels, thus the requirements for formant control also vary across languages (Mitsuya et al., Reference Mitsuya, MacDonald, Purcell and Munhall2011). Uniquely, voice intensity is a basic and low-level sound attribute (Tian et al., Reference Tian, Ding, Teng, Bai and Poeppel2018) that is not highly effective for encoding linguistic contrasts (Liu et al., Reference Liu, Zhang, Xu and Larson2007). To date, there is no direct evidence suggesting that voice intensity is sensitive to the different languages native speakers use.

Studies have shown that online intensity control is similar to pitch and formant control (Bauer et al., Reference Bauer, Mittal, Larson and Hain2006; Heinks-Maldonado & Houde, Reference Heinks-Maldonado and Houde2005; Liu et al., Reference Liu, Zhang, Xu and Larson2007). For example, Bauer and colleagues found that during vowel production, individuals demonstrated a compensatory response to unexpected intensity perturbations (200 ms, ±1, 3 vs. 6 dB SPL; see also Heinks-Maldonado & Houde, Reference Heinks-Maldonado and Houde2005). Furthermore, Liu et al. (Reference Liu, Zhang, Xu and Larson2007) observed that Mandarin speakers also compensated for intensity perturbations (200 ms, ± 3 dB SPL) during Mandarin production. These studies imply that intensity control works to monitor and stabilize voice intensity around a desired level. In this line of research, it is assumed that speakers who rely more on feedforward control will produce speech based more on stored feedforward commands, and hence more stable vocal output. Also, speakers who rely more on feedback control will produce speech based more on auditory feedback to correct for errors, and hence are more affected by perturbations and produce a larger compensatory response (Guenther, Reference Guenther2006).Footnote ¹ These studies addressed intensity control through real-time manipulation of speakers’ original auditory feedback.

Noise experiments offer another line of intensity control research. Lombard (Reference Lombard1911) was the first to find that speakers unconsciously increase their voice intensity to compensate for reduced audibility in a noisy environment. This phenomenon is known as the Lombard effect and has been documented in many studies (Lin et al., Reference Lin, Mochida, Asada, Ayaya, Kumagaya and Kato2015; Patel & Schell, Reference Patel and Schell2008). Noise experiments typically instruct participants to produce speech while a constant noise is added to their feedback (Bauer et al., Reference Bauer, Mittal, Larson and Hain2006). Because speaking is a goal-oriented behavior developed to facilitate communication, speakers usually automatically increase voice intensity to improve the signal-to-noise ratio (Liu et al., Reference Liu, Zhang, Xu and Larson2007). Previous noise studies demonstrated that intensity control in a noisy environment differs from that in online perturbation paradigms (i.e., to monitor and stabilize), instead it works to regulate intensity around a loudness level to make speakers heard over the noise. In other words, intensity control in online perturbation paradigms functions to monitor and stabilize, while intensity control in noisy environments functions to overcome noise (Chang-Yit et al., Reference Chang-Yit, Pick and Siegel1975).

Studies have addressed feedforward control by observing how speakers adapt motor commands when auditory feedback is perturbed over a long period (Ballard et al., Reference Ballard, Halaki, Sowman, Kha, Daliri, Robin and Guenther2018; Lametti et al., Reference Lametti, Krol, Shiller and Ostry2014). However, speech adaptation paradigms only reveal the updating of feedforward control, rather than feedforward control in and of itself. In the present study, we innovatively employed a noise-masking paradigm to investigate feedforward control of voice intensity. This paradigm is based on the premise that a loud masking noise would effectively eliminate the auditory feedback for controlling speech movements (Christoffels et al., Reference Christoffels, Formisano and Schiller2007; Houde et al., Reference Houde, Nagarajan, Sekihara and Merzenich2002; Lin et al., Reference Lin, Mochida, Asada, Ayaya, Kumagaya and Kato2015; Maas et al., Reference Maas, Mailend and Guenther2015; Terband et al., Reference Terband, Rodd and Maas2015). As schematized in Figure 2A, a masking noise disrupts comparisons involving actual auditory feedback. Although it is impossible to create an experimental condition without any feedbackFootnote ² (Kent et al., Reference Kent, Kent, Weismer and Duffy2000; Maas et al., Reference Maas, Mailend and Guenther2015), it is reasonable to expect a much heavier reliance on feedforward control in the absence of auditory feedback (Guenther, Reference Guenther2006, Reference Guenther2016; Guenther & Vladusich, Reference Guenther and Vladusich2012).

Figure 2. (A) A model for determining the voice intensity increases under a masking noise. The red crosses indicate that auditory feedback is not audible for feedback control. (B) A model for determining the voice intensity increases under a weak or strong noise. The red ticks indicate that auditory feedback is less audible but still available for feedback control. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article).

Feedforward control incorporates a mechanism that allows speakers to make vocal adjustments independent of auditory feedback (Hickok, Reference Hickok2012; Parrell et al., Reference Parrell, Lammert, Ciccarelli and Quatieri2019). In face of a loud masking noise, speakers evaluate the disturbance from noise signals before they speak. Considering the adverse environment, speakers retrieve the predetermined commands but will not issue them directly to the articulators to avoid obvious errors. Instead, speakers generate an auditory prediction of voice intensity based on feedforward commands and background noise. Then, speakers internally compare auditory prediction and auditory target, which further activate auditory error representing the noise-induced decrease in audibility. At this time, speakers launch a corrective command based on established auditory-motor transformations to surpass the masking noise. We thus predicted that speakers who rely more on feedforward control would adjust motor plans based more on predicted loss in audibility, and hence produce a larger Lombard effect than those who rely less on feedforward control.

The premise of feedback control involves speakers’ perception of their auditory feedback. We thus applied noise signals where participants could hear their voice over the noise, but their voice would be less audible than what they expected to hear. The purpose of added noise was to partially mask air-conducted auditory feedback, and thus reduce the signal-to-noise ratio. As exhibited in Figure 2B, a noise that is not intense enough to mask the original auditory feedback activates feedback control comparisons. Although feedforward control also plays a role in vocal adjustments to noise signals, it was reasonable to expect increased weighting of feedback control to correct motor commands based on perceived auditory errors (Guenther, Reference Guenther2006; Guenther & Vladusich, Reference Guenther and Vladusich2012). We thus predicted that speakers who relied more on feedback control would adjust motor plans based more on perceived loss in audibility, and hence produce a larger Lombard effect than those who relied less on feedback control.

The current study

We designed two noise experiments to test whether the relative reliance on feedforward and feedback control was affected by language type and L2 fluency in Chinese–English bilinguals. In Experiment 1, we addressed the weighting of feedforward control by observing how bilinguals react to a masking noise (90 dB SPL multitalker noise) during L1 and L2 spoken word production. We predicted that L1 relies more on feedforward control compared with L2, so the Lombard effect would be larger in L1. We also predicted a correlation between L2 fluency and the reliance on feedforward control for L2 speakers, where more fluent bilinguals would exhibit larger Lombard effects.

In Experiment 2, we addressed the weighting of feedback control by observing how bilinguals react to a weak noise (30 dB SPL multitalker noise) and a strong noise (60 dB SPL multitalker noise) during L1 and L2 spoken word production. Although both strong noise and masking noise have a high volume, they differ in that speakers’ auditory feedback is still available for feedback control under strong noise but is not available under masking noise. We predicted that L2 relies more on feedback control compared with L1, so the Lombard effect would be larger in L2. In addition, we also expected a correlation between L2 fluency and the reliance on feedback control for L2 speakers, where less fluent bilinguals would be accompanied by larger Lombard effects.

Experiment 1

Methods

Participants

Experiment 1 was completed by 24 Chinese–English bilinguals from Renmin University of China. All participants were right-handed, free of any neurological disease, and self-reported to have normal hearing. On enrolling in the study, participants were instructed to name pictures in both L1 and L2 at their habitual volume while wearing a headphone. The multitalker noise with random varying levels (30 dB, 60 dB, and 90 dB SPL) was sent to the headphone, and participants needed to judge whether they could perceive their auditory feedback under each noise level. All participants reported that they could hear their voice under 30 dB and 60 dB noise but not under 90 dB noise. This screening test was performed to ensure noise manipulation validity in the current study.

The bilinguals’ (11 males) mean age was 20.3 years (SD = 2.2, range 18–28). Note that bilinguals can be classified based on age of L2 acquisition. The cut point for early and late bilinguals is learning L2 between birth and eight years, or at eight years or older (Birdsong & Molis, Reference Birdsong and Molis2001). In the current study, we only included bilinguals who reported receiving their schooling in Chinese and being exposed to English after age 8 (see also Epstein et al., Reference Epstein, Flynn and Martohardjono1996). The mean age of L2 acquisition for the 24 Chinese–English bilinguals was 9.7 years (SD = 1.1, range 9–13).

Stimuli

Twenty black-and-white simple line pictures (including 15 targets and 5 practice items) were selected from a database created by Zhang and Yang (Reference Zhang and Yang2003). The practice items were used to familiarize participants with the experimental procedure and were not employed in the formal experiment. All pictures referred to common objects and had good indexes in visual complexity, familiarity, and image agreement. All pictures had monosyllabic names in both Chinese and English (e.g., Chinese: “猫” /mao/; English: cat). We employed a 90 dB SPL multitalker noise to mask participants’ auditory feedback (Patel & Schell, Reference Patel and Schell2008).

Design

The experiment adopted a 2 (language: L1 and L2) × 2 (noise condition: quiet and masking noise) within-subjects and within-items design. Within a block, participants named 15 target pictures consecutively in each experimental condition, for a total of 15 trials. The order of blocks (L1-quiet, L1-masking noise, L2-quiet, L2-masking noise) was randomized. Participants finished three blocks for each experimental condition, generating a total of 180 trials. The order of items was randomized within L1 blocks but pseudo-randomized in L2 blocks to ensure that a target would not follow a target with the same initial phoneme (e.g., ball, book) to avoid a phonological facilitation effect. A new order was generated for each participant and for each block.

Apparatus

The auditory experiment was conducted in a soundproof room and controlled by E-Prime Professional Software (version 2.0; Psychology Software Tools). Naming latencies were recorded from target presentation using a voice-key, connected to the computer using a PST Serial Response Box. This multitalker noise was calibrated through an audiometer (SMART SENSOR AS804) and presented to participants using a supra-aural headphone (Bose QuietComfort35 II). The participants’ speech was collected and recorded with an external condenser microphone (SHURE SM58S) connected to a YAMAHA Steinberg CI1 soundcard. The microphone was fixed on a short microphone holder standing on the desk and secured at 10 cm from the subjects’ mouth. The target words were extracted and saved as separate WAV files. The recorded speech signals were analyzed with the Praat speech analysis software (version 6.0.43; Boersma & Weenink, Reference Boersma and Weenink2013). The syllabic boundaries of all words were labeled by hand and the vocal cycles were hand-checked for errors such as a miss or double mark. A custom-written Praat script was used to extract the mean intensity of each syllable.

Procedure

Participants were tested individually. First, participants were asked to familiarize themselves with target pictures by viewing each target for 2000 ms with the picture name printed below. After the learning phase, participants received a picture-naming test without concurrently presented names. When the experimenter determined that participants named all pictures correctly in both L1 and L2, the practice blocks were administered. In the practice phase, participants finished one block composed of five practice pictures for each experimental condition. The practice blocks procedure was identical to the experimental blocks procedure, except for the number of pictures. When the experimenter determined that participants understood the naming task instructions, the experimental blocks were administered.

Figure 3 is a schematic representation of the sequence for a block. At the beginning, a flag was presented for 2 seconds to cue the target language in the block. Meanwhile, the noise signals were played continuously in the masking noise condition but remained silent in the quiet condition. Then a fixation point (+) appeared in the middle of the screen for 500 ms, followed by a blank screen. Next, 15 pictures were presented on the screen, 2 seconds apart. Participants were asked to name the picture as quickly and accurately as possible. We stopped playing noise signals after participants finished naming 15 pictures. Finally, a break lasting 10 seconds concluded each block.

Figure 3. A schematic diagram of the sequence for a block (upper panel: L1 picture naming; lower panel: L2 picture naming).

L2 fluency test

We included two English speaking tests to measure participants’ L2 fluency. Previous research typically addressed L2 fluency by measuring temporal features, such as speech rate, duration and rate of hesitations, and filled and silent pauses (Hilton, Reference Hilton, Leclercq, Edmonds and Hilton2014; Kormos, Reference Kormos2006; Segalowitz, Reference Segalowitz2010). The current study measured L2 fluency using speech rate indicated by the length of time required to complete the speaking tasks, with shorter time indexing more fluent L2 speech and longer time indexing less fluent L2 speech.

The order of speaking tasks was as follows. First, participants completed a rapid automated naming task in which four 7 × 7 item grid stimulus displays were created for computer presentation. Each grid consisted of one of the following randomly ordered stimulus types: letters (g, k, m, r), objects (pictures depicting a dog, chair, bed, or key), colors (boxes colored red, blue, yellow, or green), and digits (2, 4, 6, 9). Participants were instructed to sequentially name aloud each item in the grid from the top left item to the bottom right item as quickly as possible without errors. This was repeated for each stimulus grid (letters, objects, colors, and digits), and the task order was counterbalanced across participants. The experimenter manually pressed the mouse to start and end the timing procedure for each grid. The final rapid naming duration was the average of each grid’s duration for letter, object, color, and digit-naming tasks.

Next, participants completed a passage reading task in which four English passages were extracted from New Concept English Two. They were instructed to sequentially read aloud each passage at their habitual speed. The experimenter manually pressed the mouse to start and end the timing procedure for each passage reading. The final duration for passage reading was the average of the four passage reading durations. The English fluency tests were performed using E-prime Professional Software.

Results

Two participants were excluded; one could not tolerate the loud masking noise and quit the experiment, and the other’s voice intensity in the quiet and masking noise conditions differed by more than two standard deviations from the group mean (see Lametti et al., Reference Lametti, Krol, Shiller and Ostry2014 for a similar data removal procedure). The remaining data from 22 participants were included in the subsequent analyses. Table 1 presents the mean picture-naming reaction time, error percentages, and mean intensity by language and noise condition.

Table 1. Mean picture-naming response time (RT, in ms), percentage of errors (PE, %), mean intensity (MI, in dB), and standard deviations (SD, in parenthesis) as a function of language and noise condition in Experiment 1

We used the lmer program of the lme4 package (Baayen et al., Reference Baayen, Davidson and Bates2008; Bates, Reference Bates2005; Bates et al., Reference Bates, Maechler, Bolker and Walker2014) in R software (R Core Team, 2015) to estimate fixed and random effects. The data (i.e., response time, the percentage of error responses, and mean intensity) were analyzed using a linear mixed-effects model with language and noise condition as the fixed factors and participants and items as the random factors. Models used restricted maximum likelihood estimation to find the optimal parameter estimation of the best-fitting model to the observed data. The best-fitting model was defined as the most complex model that significantly improved the variance estimation over previous models. Model fitting included three steps: specifying a model (i.e., null model) that included only random factors (participants and items); enriching the null model by adding fixed factors (i.e., language and noise condition) one by one, and then by adding the two-way interactions between the two factors; and comparing the newly established model to the previous model using the chi-square test. If adding a fixed factor or an interaction to the existing model did not significantly improve the variance estimation (p > 0.05), then the current model was designated the best-fitting model.

Behavioral results

Data from incorrect responses (0.81%), naming latencies longer than 1500 ms or shorter than 200 ms (2.25%), and latencies deviating two standard deviations from each participant’s mean (6.14%) were removed from the behavioral analyses. For response time, the best-fitting model only included the factors of language and noise condition (see Table 2). Adding the two-way interaction between language and noise condition did not significantly improve the fit, χ2 (1, 3596) = 1.15, p = 0.28. A parallel analysis was conducted on the errors, but a binomial family was used because of the binary nature of the response. The best-fitting model only included the factor of noise condition; adding language, χ2 (1, 3928) = 0.51, p = 0.48, and the interaction between language and noise condition, χ2 (1, 3928) = 1.22, p = 0.54, did not significantly improve the fit.Footnote ³ Table 2 displays parameter estimates for the fixed effects for response time, percentage of errors, and mean intensity.

Table 2. LMM estimates of fixed effects for picture-naming response time (RT), percentage of errors (PE), and mean intensity (MI) in Experiment 1

Note: Language2, L2; Noise2, masking noise.

Acoustic analysis

Only data from incorrect responses (0.81%) were removed from the acoustic analyses. Figure 4A illustrates the mean intensity score distribution from 22 participants in each experimental condition. For mean intensity, the best-fitting model included the noise condition and the interaction between language and noise condition (see Table 2). Adding the factor of language did not significantly improve the fit, χ2 (1, 3928) = 2.53, p = 0.11. The two-way interaction is interesting; simple analyses indicated that masking noise increased speakers’ voice intensity relative to the quiet condition in both L1 (β = 10.05, t = 83.10, p < 0.001) and L2 word production (β = 9.94, t = 73.25, p < 0.001), but the intensity increase was larger in L1 than L2 (see Figure 4B). As shown in Figure 4C, simple analyses in the other direction indicated that in the quiet condition, the mean intensity did not differ between L1 and L2 word production (β = –0.13, t = –1.06, p = 0.29), but in the masking noise condition, the mean intensity was significantly higher in L1 than L2 word production (β = –0.52, t = –4.51, p < 0.001).

Figure 4. Results in Experiment 1. (A) Box plots illustrating the distribution of average mean intensity scores of 22 participants in each experimental condition. Box definitions: middle line is the median, top and bottom of boxes are 75th and 25th percentiles, and square is the mean. (B) Column charts of the mean intensity (mean and standard error) in the L1 and L2 speech production as a function of noise condition. (C) Column charts of the mean intensity (mean and standard error) in the quiet (Q) and masking noise (MN) conditions as a function of language. Asterisks indicate the significant effects. (D) The scatterplot for the correlation between rapid naming and the Lombard effect. (E) The scatterplot for the correlation between passage reading and the Lombard effect. Here, the Lombard effect is defined as the difference between the mean intensity in L2-quiet and L2-masking noise conditions.

Correlation analysis between the Lombard effect and L2 fluency

To test whether more fluent L2 speakers rely more on feedforward control than less fluent L2 speakers, we examined the relationship between the Lombard effect in L2 spoken word production and the fluency performance in L2 rapid naming and passage reading. Data from 22 participants in Experiment 1 were entered into the Pearson’s correlation analysis. Here, the Lombard effect was defined as the difference between the mean intensity in the L2-quiet and L2-masking noise conditions. In addition, L2 fluency was measured by two English production tasks and defined as the average durations for rapid naming and passage reading, respectively. The results indicated that the Lombard effect was negatively correlated with the duration for L2 rapid naming task (r = –0.67, 95% CI [–0.36, –0.83], p = 0.002) and the duration for L2 passage reading task (r = –0.62, 95% CI [–0.42, –0.80], p = 0.002). This suggests that the more fluently bilinguals speak in their L2, the larger Lombard effect they exhibit in L2 speech production (see Figures 4D and 4E).

Discussion

To the best of our knowledge, this was the first cross-language study to compare feedforward control in a group of Chinese–English bilinguals using a masking noise. A 90 dB SPL multitalker noise virtually eliminated speakers’ auditory feedback, thus the mechanism of feedback-based motor correction had little role to play, whereas the predictive feedforward control dominated the speech motor control. Notably, we observed that the Lombard effect elicited by a masking noise was larger in L1 than L2 word production. In addition, correlation analyses showed that the Lombard effect in L2 word production was larger in more fluent L2 speakers than less fluent ones. Thus, the results support our two hypotheses that bilinguals’ feedforward control is influenced by language and related to L2 fluency, where a heavier weighting was assigned to feedforward control in the L1 production system and more fluent L2 speakers than to the L2 production system and less fluent L2 speakers.

In Experiment 2, we adjusted the levels of multitalker noise signal from 90 dB to either 60 dB or 30 dB to enable the involvement of auditory feedback in speech motor control. By measuring the magnitude of the Lombard effect in response to a noise that was not as loud as the masking noise, we examine bilinguals’ relative reliance on feedback control in L1 and L2 spoken word production.

Experiment 2