INTRODUCTION
People often struggle to learn the unfamiliar speech sounds of a new language. Unsurprisingly, difficulty distinguishing speech sounds often leads to difficulty distinguishing words that contain them (e.g., Broersma & Cutler, Reference Broersma and Cutler2011). In some case, even when second language (L2) learners have mastered novel speech sounds, they may still have difficulty using them to recognize words (Darcy et al., Reference Darcy, Daidone and Kojima2013; Díaz et al., Reference Díaz, Mitterer, Broersma and Sebastián-Gallés2012).
This latter pattern applies to L2 learning of lexical tones in Mandarin Chinese. In a previous study (Pelzl et al., Reference Pelzl, Lau, Guo and DeKeyser2019), we found that a group of advanced L2 Mandarin learners (native speakers of English with an average of 10 years learning/using Mandarin) identified tones on single syllables with near-native accuracy, but performed below chance on a lexical decision task that required using tones to reject disyllabic nonwords.
The present study narrows in on these L2 tone word recognition difficulties. We focus on two general classes of explanation for L2 phonological and lexical difficulties (or phonolexical difficulties, cf. Chrabaszcz & Gor, Reference Chrabaszcz and Gor2014). The first attributes difficulties primarily to weaknesses in the quality of L2 lexical representations (Cook et al., Reference Cook, Pandža, Lancaster and Gor2016; Cook & Gor, Reference Cook and Gor2015; Darcy et al., Reference Darcy, Daidone and Kojima2013; Gor, Reference Gor2018; Melnik & Peperkamp, Reference Melnik and Peperkamp2019). In the case of tones, this would mean that frequent errors occur in L2 tone word recognition because the representations that are being activated either lack tones or have low-quality (uncertain) tone information. A second class of explanations attributes L2 difficulties to the influence of L1 processing biases (Chang, Reference Chang2018; MacWhinney & Bates, Reference MacWhinney and Bates1989; Strange, Reference Strange2011). In this case, the problem for L2 learners of tonal languages is that they focus perceptual attention (cf. Chang, Reference Chang2018) on segmental cues to the exclusion of relevant tonal cues. This routine is successful in the L1, but in a tonal L2 leads to spurious activation of words with mismatching tones.
Using a lexical decision task with concurrent electroencephalogram (EEG), and an offline test of explicit lexical and tonal knowledge, we aim to see to what extent the representational and processing accounts can shed light on outcomes in advanced L2 learners of Mandarin.
SECOND LANGUAGE LEARNING OF MANDARIN TONES
In lexical tone languages pitch differentiates words from one another. For example, in Mandarin Chinese the syllable /ma/ spoken with a high pitch is “mom,” but with a low pitch is “horse.” For speakers of nontonal languages the very idea that words could work this way can be hard to fathom. Perhaps for this reason, people often take for granted that learning L2 tones will be difficult, though—as we review in the following text—research indicates that the difficulty of L2 tone learning is not absolute.
The primary acoustic cue for tone is fundamental frequency (F0) (Ho, Reference Ho1976; Howie, Reference Howie1974), the lowest frequency component of a sound wave, which humans perceive as its pitch. F0 is used in all languages in some form (intonation, stress), so it is not novel in and of itself. What sets tone languages apart is the functional use of F0 as a lexical cue.
Nontonal language speakers need to learn at least two qualitatively novel things related to F0 (Figure 1). First, they must learn to treat F0 patterns as discrete tone categories. For example, an L2 Mandarin speaker (or “learner”) must learn that there are four tones: a high tone (Tone 1, or T1), a rising tone (T2), a low tone (T3), and a falling tone (T4). This implies learners must be able to hear the differences between tones (auditory perception). However, knowing these tone categories is not enough.
Tone must also be integrated as a necessary (abstract) feature in a word’s phonological form. We will refer to this as learning tone words. Tone words are often illustrated with the syllable ma /mɑ/, which is a different word with each of the four tones (ma1 “mom”; ma2 “hemp”; ma3 “horse”; ma4 “scold”). While not every syllable in Mandarin occurs with all four of the tones, every syllable of a word requires a tonal feature to be complete (though in some cases, the required feature is a lack of a tone, as in the case of the morpheme me 么). So then, to successfully learn tone words, learners must be able not only to perceive tone categories but also to encode them (abstractly) in long-term memory for lexical representations, and to retrieve words during real-time lexical processing using tones.
TONE CATEGORY AND TONE WORD LEARNING IN NAÏVE OR NOVICE L2 LEARNERS
Previous research suggests that most people, given enough time and training, can learn to hear differences among tone categories and to identify them. As we might expect, people with no previous tone language experience make many errors identifying or discriminating tones (e.g., Alexander et al., Reference Alexander, Wong and Bradlow2005; Bent et al., Reference Bent, Bradlow and Wright2006; Broselow et al., Reference Broselow, Hurtig, Ringen, Ioup and Weinberger1987; Gottfried, Reference Gottfried, Bohn and Munro2007; Lee et al., Reference Lee, Vakoch and Wurm1996; So & Best, Reference So and Best2010). But their errors are perhaps less surprising than their accuracy. Naïve participants generally perform well above chance, indicating they are not just guessing. For classroom learners with only slightly more experience, accuracy in tone identification is often at or above 80% (e.g., Lee et al., Reference Lee, Tao and Bond2009; Wang et al., Reference Wang, Spence, Jongman and Sereno1999; Zhang, Reference Zhang2011; though accuracy may decline if more difficult tasks are used, e.g., Wiener et al., Reference Wiener, Lee and Tao2019). These patterns contrast with some truly difficult L2 speech sounds. For instance, Japanese speakers who have attained advanced proficiency in English may still perform at or below chance distinguishing /r/ and /l/ (Brown, Reference Brown1998). Similarly, English speakers who have achieved advanced proficiency in Russian typically display pronounced difficulties discriminating certain hard/soft consonant distinctions (Chrabaszcz & Gor, Reference Chrabaszcz and Gor2014). At least compared to such cases, basic auditory perception of tones appears less challenging (cf. Antoniou & Wong, Reference Antoniou and Wong2016 for a comparison when training Mandarin tones and Hindi stops, suggesting tones are an easier learning target).
A number of L2 training studies have combined tone category and tone word learning in a single training routine, pairing pictures with small sets of words that differ only by tones. Such studies typically find clear improvements after training, but individual differences (musical expertise, pitch perception) can have a strong impact on outcomes (e.g., Bowles et al., Reference Bowles, Chang and Karuzis2016; Chandrasekaran et al., Reference Chandrasekaran, Sampath and Wong2010; Dong et al., Reference Dong, Clayards, Brown and Wonnacott2019; Li & DeKeyser, Reference Li and DeKeyser2017; Perrachione et al., Reference Perrachione, Lee, Ha and Wong2011; Sadakata & McQueen, Reference Sadakata and McQueen2014; Wong & Perrachione, Reference Wong and Perrachione2007). For example, in Wong and Perrachione (Reference Wong and Perrachione2007) almost half of participants failed to reach even 60% accuracy in matching 18 tone words to pictures (six sets of three-way tone contrasts), even after 10 or more training sessions.
Two training studies have demonstrated the separability of tone category and word learning, by first training tone category identification and then tone words. Ingvalson et al. (Reference Ingvalson, Barr and Wong2013) found that participants with less aptitude showed improved outcomes by first engaging in tone category learning, and only then proceeding to tone word learning. Cooper and Wang (Reference Cooper and Wang2013) found a similar pattern for nonmusicians trained first with Cantonese tone categories and then tone words.
A potential limitation on the generalizability of the tone word training results reviewed in the preceding text is that the studies have relied on very small sets of tone word stimuli. To make tones as salient as possible, each word contrasts with two or three others. This may prove to be optimal for tone training (though the long-term benefits are still unclear), but necessarily fails to capture the complexity of a real tone language lexicon, especially when words longer than a syllable are considered (for an example of tone word training with a larger number of stimuli that more realistically reflect the statistical properties of Mandarin, though only with single syllables, see Wiener et al., Reference Wiener, Ito and Speer2018; for a set of studies that includes disyllabic tone words, see Bowles et al., Reference Bowles, Chang and Karuzis2016; Chang & Bowles, Reference Chang and Bowles2015).
In the case of Mandarin, there are major differences in the qualities of monosyllabic and disyllabic words. One crucial difference is the likelihood of a word having tone neighbors—that is, words that share all phonological features except for a tone (e.g., tang1 /tɑŋ/ “soup” and tang2 “sugar”) or tones (e.g., you1yu4 /jou1y4/ “melancholy” and you2yu2 “squid”). Tone neighbors are the norm for monosyllabic words, but much less common for disyllabic or multisyllabic words (Table 1). Importantly, there are many more disyllabic than monosyllabic words, which means most words learners encounter do not have tone neighbors. At the same time, even though monosyllabic words do have tone neighbors, many of these words are among the earliest to be learned and are encountered with extreme (token) frequency (Tao, Reference Tao, Wang and Sun2015), so that interlocutors can typically intuit intentions, even when words are mispronounced. This sets up a scenario where, early on, learners encounter few consequential tone neighbors and little pressure to avoid confusion. As their vocabularies grow, they will become familiar with more and more words with tone neighbors and will need to discuss topics that require less frequent words where tones may become more critical cues for listeners. This delay in experiencing the communicative value of tones may lead to a large backlog of L2 words with incorrect or missing tone features, which could have major impacts on whether and how L2 learners use tones in real-time word recognition.
To understand how tone category and tone word learning may break down when thousands of words are known, we need to examine outcomes from “training” with the full complexity of a real language lexicon. This means we need studies with advanced L2 tone language learners.
ADVANCED L2 CHINESE RESEARCH
Research with advanced L2 learners of tone languages is still rather rare, but a handful of studies provide some indication of what typical long-term outcomes for tone category and tone word learning look like.
At the level of tone category learning, previous studies with advanced L2 Chinese leaners (English L1) have found that they can achieve near-native performance on identification of tones on isolated monosyllables (Lee et al., Reference Lee, Tao and Bond2009; Pelzl et al., Reference Pelzl, Lau, Guo and DeKeyser2019; Zhang, Reference Zhang2011; for related work with Dutch L1 speakers, see Zou et al., Reference Zou, Chen and Caspers2017). In Pelzl et al. (Reference Pelzl, Lau, Guo and DeKeyser2019), we found that even when tone identification was challenging (syllables clipped from continuous speech), advanced L2 participants performed nearly identically to native Mandarin participants, with a clear difference appearing only for T2. Along the same lines, Shen and Froud (Reference Shen and Froud2016) found that behavioral identification and discrimination performance (using tone continua) for advanced L2 learners was near-native. Interestingly, when Shen and Froud (Reference Shen and Froud2019) tested the same participants using ERPs (event-related potentials), they found their MMN and P300 responses during passive listening were distinct from native patterns (for similar results in L2 learners with varied nontonal L1s, see Yu et al., Reference Yu, Li, Chen, Zhou, Wang, Zhang and Li2019). These disjunctive results suggest that advanced L2 learners can develop tone categories, but that the categories are in some way distinct from those of native Mandarin speakers.
Given advanced learners’ ability to achieve high performance at tone identification, a key question that arises is whether this perceptual capacity will translate into high performance in online tone word recognition. Our previous study (Pelzl et al., Reference Pelzl, Lau, Guo and DeKeyser2019) also examined this question. The same advanced L2 participants that performed at near-native levels on tone identification completed a lexical decision task with disyllabic words, and nonwords that mismatched real words by a tonal or segmental contrast. Tonal nonwords differed from real words only with respect to the tone of the first syllable (e.g., nonword fang4zi /fɑŋ4tsɹ̩/ derived from real word fang2zi “house”). Segmental nonwords differed from real words with respect to the rhyme of the first syllable (e.g., nonword feng2zi /fəŋ2tsɹ̩/ derived from real word fang2zi). As in the tone identification, the stimuli were clipped from continuous speech in sentences. Compared to native speakers, L2 learners performed significantly less accurately on both types of nonword, but the difference in accuracy between the segmental and tonal conditions was particularly striking. For segmental nonwords, mean L2 accuracy was 84% (compared to 96% for L1), while for tonal nonwords it was 35% (L1: 91%). Performance did not appear to be due to lack of word knowledge, as most L2 participants knew upward of 95% of the critical vocabulary, and (with just one exception) participants failed to reach native-speaker levels for rejection of tonal nonwords even when they performed near ceiling on an offline test of tone knowledge for the critical vocabulary. Summarizing the results, our previous study’s tone identification data indicated L2 learners can achieve strong auditory perception of tone categories, but lexical decision data suggested they have persistent difficulty representing and/or processing tone words.
THE PRESENT STUDY
The current study was designed to narrow in on the causes that drive difficulty in learning tone words, focusing on the issues of lexical representation and processing raised previously (Pelzl et al., Reference Pelzl, Lau, Guo and DeKeyser2019), and highlighted in our introduction. We investigate a “best-case scenario” for advanced L2 tone word processing by testing performance in nearly ideal listening conditions—with words spoken clearly and in isolation. Under such conditions, do learners still have difficulty in lexical decision for tone words? If so, is it driven by the quality of lexical representations or by L2 processing routines?
One possible explanation for the low L2 accuracy in rejecting tone nonwords observed by Pelzl et al. (Reference Pelzl, Lau, Guo and DeKeyser2019) was that the challenging stimuli—requiring listeners to process multiple syllables at a naturalistic pace—induced a processing bottleneck, so that listeners did not have enough time to utilize the routines they used so successfully in tone identification (cf. phonetic and phonological modes in Strange, Reference Strange2011). If this were the case, performance might recover if words were pronounced more slowly. To address this possibility, the present study will test whether differences between tonal and segmental nonwords persist with more slowly and clearly pronounced stimuli. This will answer the first research question: (1) Are L2 listeners equally accurate in rejection of isolated disyllabic nonwords that differ from real words only with respect to either a vowel or a tone?
A second possible explanation for low L2 accuracy in rejecting tone nonwords is that it might have been due to a lack of certainty about the phonological form of relevant real words on the part of learners. Cook and Gor (cf. Cook & Gor, Reference Cook and Gor2015; Gor, Reference Gor2018; Gor & Cook, Reference Gor and Cook2018) have posited that L2 learners’ subjective familiarity with words can provide an explanation for why they might be more permissive in accepting phonologically similar words compared with L1 listeners. In this case, the hypothesis is that less familiar words have lower-quality phonological representations and are more likely to be incorrectly accepted, while more familiar words have higher-quality representations and are more likely to be correctly rejected. While we did measure offline knowledge of words and tones in Pelzl et al. (Reference Pelzl, Lau, Guo and DeKeyser2019), we did not attempt to measure confidence for the meanings or tones of the associated words. By measuring subjective confidence, the current study will account more thoroughly for the role of L2 familiarity in lexical decision outcomes, answering the second research question: (2) Does lexical familiarity impact L2 behavioral responses to tone nonwords?
Finally, the current study will take advantage of ERPs as a measure of continuous online responses to gain fuller insight into the L2 tone word-recognition process. The behavioral outcome in a lexical decision task only reflects the final decision point for each trial, leaving the process leading up to that decision unexamined. In this sense, the difference we found in the lexical decision task in Pelzl et al. (Reference Pelzl, Lau, Guo and DeKeyser2019) was only quantitative, not qualitative.Footnote 1 It is possible that, despite lower accuracy overall, L2 learners nevertheless display qualitatively equivalent responses to both vowel and tone word mismatches.
To address this possibility, the current study will use ERPs to assess the word-recognition process as it unfolds during each trial. ERPs are particularly valuable because they can capture qualitative aspects of word-recognition processes, namely whether responses occur within the same time window, and whether the magnitude of responses in different conditions is comparable.
The present study will focus on the N400, which is particularly useful in examination of lexical recognition processes. The N400 is a negative-going ERP response that peaks approximately 400 ms after stimulus onset and can be used as an index of the ease or difficulty a listener has in accessing lexical targets (Kutas & Federmeier, Reference Kutas and Federmeier2000; Kutas & Hillyard, Reference Kutas and Hillyard1980, Reference Kutas and Hillyard1984; Lau et al., Reference Lau, Phillips and Poeppel2008). Several previous studies have found the N400 in native Chinese speakers to be sensitive to lexical tone mismatches in contextually expected words (in sentences: Brown-Schmidt & Canseco-Gonzalez, Reference Brown-Schmidt and Canseco-Gonzalez2004; Li et al., Reference Li, Yang and Hagoort2008; Pelzl et al., Reference Pelzl, Lau, Guo and DeKeyser2019; Schirmer et al., Reference Schirmer, Tang, Penney, Gunter and Chen2005; with picture cues: Malins & Joanisse, Reference Malins and Joanisse2012; Zhao et al., Reference Zhao, Guo, Zhou and Shu2011). However, no previous research has investigated advanced L2 neural sensitivity to tone mismatches in isolated disyllabic words. By examining L2 ERPs to nonwords, we will have a continuous measure of L2 tone processing, allowing us to answer a third research question: (3) Are L2 listeners equally sensitive to vowel and tone mismatches (as indexed by the N400)? Importantly, we will only be examining trials with correct rejections of nonwords. For correct rejections, the N400 amplitude should be more negative than that of real words (“the N400 effect”), indicating the difficulty the listener has accessing a word. If we see similar N400 effects for tone and vowel nonwords, this will indicate that the same process attains for both. If we find smaller N400 responses for tones, this will indicate that even when nonwords are correctly rejected, L2 sensitivity to tones is diminished. This might occur if, for example, L2 listeners rely on slow, explicit judgments to arrive at correct rejections, rather than on the faster and more automatic processes indexed by the N400.
PARTICIPANTS
We recruited 19 native English speakers who had achieved relatively advanced proficiency in spoken Mandarin Chinese. One participant was excluded due to early onset of learning (age 7) and possible tone language exposure in the family home. This left 18 advanced L2 participants. Table 2 summarizes their general learning characteristics, as well as scores on the screening measures, and results on a tone identification task (for details, see supplementary materials). This study used the same screening measures (vocabulary, can-do self-assessment) and criteria as in Pelzl et al. (Reference Pelzl, Lau, Guo and DeKeyser2019) to maintain at least a lower bound of comparability with the population tested in that study (one L2 participant scored a bit lower [65.7] than criterion [70] on the vocabulary test, but was accepted nonetheless as advanced L2 participants were difficult to find). Twenty-four native Chinese speakers also completed the experiment (average age = 26.1). Four were excluded due to excessive EEG artifacts, leaving 20 for all analyses presented in the following text.
All participants gave informed consent and were compensated for their time.
STIMULI DESIGN AND PRODUCTION
We selected 96 disyllabic real words (e.g., fa1yin1 /fɑ1in1/ “method”) to be used in an auditory lexical decision task. All were high-frequency nouns. On the basis of the real words, two types of nonwords were created, differing from real words only with respect to a tone or vowel (Figure 2). For the tone mismatch condition, the tone of the first syllable was changed producing a nonword (e.g., fa2yin1). We will refer to these items as tone nonwords. For the vowel mismatch condition, the vowel (and only the vowel) on the first syllable was changed producing a nonword (e.g., fu1yin1 /fu1in1/), that is, vowel nonwords. (Additional details of procedures for selection and quality control of stimuli, the approach applied for T3 sandhi, as well as the complete list of all stimuli, can be found in supplementary materials online.)
These stimuli improve on those of Pelzl et al. (Reference Pelzl, Lau, Guo and DeKeyser2019) in several ways. First, all tones are balanced across real words, and tone changes are balanced across tone nonwords—that is, T1 becomes T2, T3, and T4 an equal number of times, and similarly for other first syllable tones. Second, whereas Pelzl et al. (Reference Pelzl, Lau, Guo and DeKeyser2019) swapped out entire syllable rhymes, including syllable final /n/ and /ŋ/ (e.g., xiang3fa3 /ɕiɑŋ3fɑ3/ “thought” became the nonword xu3fa3 /ɕy3fɑ3/), the current stimuli limited changes to vowels.Footnote 2 Third, to prevent listeners from rejecting nonwords before the onset of the second syllable, we avoided creating syllables that never occur in Mandarin (e.g., fai /fɑi/) or are very rare (e.g., cen /tsʰən/).
As noted in the preceding text, the current study aims to explore advanced L2 tone perception in a best-case scenario. To this end, we recorded a native Chinese speaker (female) from northern China who spoke with a standard Mandarin accent. She produced the stimuli in isolation, speaking a with a clear voice, at a comfortable, but relaxed, speech rate. Using Praat, all stimuli were cut out of the original audio files to create individual. wav files. The average intensity of each file was scaled to 70 dB, and 200 ms of silence were appended at the end of each file. This resulted in 96 triplets consisting of a real word and its vowel and tone nonword counterparts. Stimuli were divided into three lists, each containing 32 real words, 32 vowel nonwords, and 32 tone nonwords. Additionally, the 32 disyllabic real word filler trials were included in each list to balance the proportion of correct “yes” answers across the experiment. Importantly, no item was repeated in both its real and nonword forms for the same participant, as such repetition might lead to undesirable strategizing.
VOCABULARY TEST
We also constructed an offline vocabulary test. The format is illustrated in Figure 3. For each L2 participant, the test included all real word counterparts for vowel and tone nonwords encountered during the lexical decision task (64 words). Each item provided Chinese characters and toneless Pinyin. Participants supplied tones (numbers 1–4 for each syllable), an English definition, and a confidence rating from 0 to 3 for both the tones and the definition of each item. Participants were informed that the 0–3 scale had the following meaning: 0 = I don’t recognize this word; 1 = I recognize this word, but am very uncertain of the tones/meaning; 2 = I recognize this word, but am a bit uncertain of the tones/meaning; 3 = I recognize this word, and am certain of the tones/meaning. This scale remained visible as a reference throughout the test. For any tones or definitions they did not know, participants were told to leave the answer blank and supply “0” for confidence.
PROCEDURES
We used an auditory lexical decision task. Participants heard a single disyllabic Mandarin word or nonword and decided whether it was a real word or not. EEG was recorded along with the behavioral response for each trial. After the experiment, L2 participants completed an offline vocabulary knowledge test of the real word counterparts of all nonwords they heard in the lexical decision task.
Thirty-six participants (24 L1 and 12 L2) were tested in the lab at Beijing Normal University (BNU). Seven additional L2 participants were tested under conditions as similar as possible in the lab at the University of Maryland (UMD). Each participant was seated in front of a computer monitor and fit with an EEG cap. Auditory stimuli were presented using a single high-quality audio monitor (JBL LSR305) placed centrally above the computer monitor.
For the lexical decision task, instructions presented onscreen included an illustrative example of each type of nonword: “zhong1guo2 is a real word, but zhang1guo2 and zhong4guo2 are not real words in Mandarin.” Instructions were presented in English for L2 participants, and in Chinese for L1 participants. Instructions were followed by 10 practice items with stimuli not included in the experiment. Participants then completed 128 lexical decision trials. Trials were divided into seven blocks (roughly 20 in each) with self-paced breaks between each block. Stimuli were counterbalanced across three lists, and each list was given four unique pseudorandom orders so that stimuli of a single condition type was never repeated more than three times in a row, and strings of expected yes/no answers never extended beyond three items in a row. Timing parameters are shown in Figure 4.
After the ERP experiment was finished, L2 participants completed the offline vocabulary test.
EEG RECORDING
Raw EEG was recorded continuously at a sampling rate of 1,000 Hz using a Neuroscan SynAmps data acquisition system and an electrode cap (BNU: Quik-CapEEG; UMD: Electrocap International) mounted with 29 AgCl electrodes at the following sites: midline: Fz, FCz, Cz, CPz, Pz, Oz; lateral: FP1, F3/4, F7/8 FC3/4, FT7/8, C3/4, T7/8, CP3/4, TP7/8, P4/5, P7/8, and O1/2 (UMD: had FP2, but no Oz). Recordings were referenced online to the right mastoid and rereferenced offline to averaged left and right mastoids. The electro-oculogram (EOG) was recorded at four electrode sites: vertical EOG was recorded from electrodes placed above and below the left eye; horizontal EOG was recorded from electrodes situated at the outer canthus of each eye. Electrode impedances were kept below 5kΩ. The EEG and EOG recordings were amplified and digitized online at 1 kHz with a bandpass filter of 0.1–100 Hz.
EEG DATA PROCESSING
All trials were visually inspected and evaluated individually for artifacts using EEGLAB v10.2.5.8b (Delorme & Makeig, Reference Delorme and Makeig2004) and ERPLAB v3.0.2.1 (Lopez-Calderon & Luck, Reference Lopez-Calderon and Luck2014) running under MATLAB R2013b (MathWorks, 2013). Data from four L1 participants were excluded due to having more than 40% artifacts on experimental trials. After excluding these participants, artifact rejection affected 8.45% of experimental trials (L1 8.08%; L2 8.86%). Trial-level data for each subject baselined to the mean of the 100 ms preceding the onset of the auditory stimulus was exported for further processing in R (R Core Team, 2019). A single average amplitude was obtained for each trial for each electrode for each subject in a slightly delayed auditory N400 window (400–900 ms). This window was chosen on the basis of two criteria. First, the average duration of stimuli was approximately 600 ms. Listeners could only notice a nonword sometime after the onset of the second syllable, suggesting any time earlier than 300 ms would be inappropriate. Second visual inspection of grand average waveforms across all scalp electrodes suggested 900 ms was a reasonable endpoint to capture N400 effects, and is sufficiently generous so that it does not underestimate potentially slower L2 responses.
Data from 15 central electrodes (F3, Fz, F4, FC3, FCz, FC4, C3, Cz, C4, CP3, CPz, CP4, P3, Pz, P4) were chosen for final analysis as visual inspection of grand average waveforms suggested these electrodes had strong and consistent N400 peaks across conditions, and we had no theoretical motivation for positing that ERP responses would vary across regions. To reduce some mild nonnormality in the data, any trial with an absolute value greater than 50 μV was removed prior to final data analysis. Finally, only trials that elicited correct behavioral responses (correct acceptance or correct rejection) were retained for final analysis. After all these steps, the final EEG dataset contained 43,567 data points (80.0% out a of total possible 54,720 data points: L1 = 88.1%; L2 = 70.2%). We note that the loss of data disproportionately affects L2 data, which reduces power for finding effects (an alternative analysis retaining all trials is included in online supplementary materials, substantive results are the same as those reported here).
BEHAVIORAL LEXICAL DECISION TASK RESULTS AND STATISTICAL ANALYSIS
Reliability for the lexical decision task data was high for all three lists (list A: α = .94; list B: α = .93.; list C: α = .93). Descriptive results are shown in Table 3. The L1 group displayed high accuracy across all conditions, while the L2 group had noticeably lower accuracy overall, with tone nonwords registering the lowest. D-prime (d′) was also calculated for each participant, contrasting vowel nonwords and real words, and tone nonwords and real words, using Laplace smoothing to correct for infinite values (Barrios et al., Reference Barrios, Namyst, Lau, Feldman and Idsardi2016; Jurafsky & Martin, Reference Jurafsky and Martin2009). As with accuracy, d′ results suggest overall higher sensitivity to nonwords for L1 listeners with little difference between nonword conditions (vowel d′ = 3.78, sd = .46; tone d′ = 3.81, sd = .46). In contrast, L2 has less sensitivity overall and a larger difference between conditions that suggests vowel nonwords are detected more readily than tone nonwords (vowel d′ = 2.31, sd = .55; tone d′ = 1.59, sd = .78). When considered individually (Figure 5), all but one L2 participant had a lower d′ for tone than vowel nonwords. All but three scored below the lowest L1 d′ for tone nonwords, while for vowel nonwords eight learners were in the range of L1 scores. Only one L2 participant performed near the level of the average L1 scores overall.
All statistical analyses reported in the following text were conducted in R (version 3.6.1, R Core Team, 2019). Mixed-effects models were fit using the lme4 package (version 1.1.21, Bates et al., Reference Bates, Mächler, Bolker and Walker2015). Effects coding was applied using the mixed function in afex (Singmann et al., Reference Singmann, Bolker, Westfall and Aust2017).
Accuracy results were submitted to a generalized linear mixed-effects model (using the bobyqa optimizer) with crossed random effects for subjects and items. The dependent variable was accuracy (1, 0). Fixed effects included the factors condition (real word, tone nonword, vowel nonword), and group (L1, L2), and their interaction. The maximal random effects model was fit first (Barr et al., Reference Barr, Levy, Scheepers and Tily2013; Bates et al., Reference Bates, Kliegl, Vasishth and Baayen2015). Model convergence difficulties were addressed by suppressing correlations in random effects (using “expand_re = TRUE” in the mixed function). The best-fitting model was determined by model comparison conducted through likelihood ratio tests, building from the maximal model (which was rejected due to convergence issues) to progressively less complex models. Inclusion of the nuisance factor list (with subjects nested under lists) did not improve model fit, and so was not retained in the final model. The final model included by-subject random intercepts and slopes for the effect of condition, and by-item random intercepts and slopes for condition and group, but not their interaction (glmer model formula: accuracy ~ condition * group + (condition || subject) + (condition + group || item)). Results are depicted graphically in Figure 6.
Table 4 reports main effects and interactions. P-values were obtained using the likelihood ratio test (LRT) method. The effects of condition and group were both statistically significant. There was also a significant interaction between condition and group.
Signif. codes: *** <0.001; ** <0.01; * <0.05; <0.1.
Critical planned comparisons are reported in Table 5. The Holm method was used to correct for multiple comparisons. Though we are primarily interested in testing accuracy in correct rejection of vowel and tone nonwords in L2, implicit in this comparison is that there is a difference between the differences in accuracy for vowel and tone nonwords for L1 and L2. This is borne out in our comparisons. There was no significant difference in L1 accuracy of correct rejections for vowel and tone nonwords, whereas for the L2 group accuracy for correct rejection of nonwords differed significantly for vowels and tones. L2 listeners were about two and a half times more likely to incorrectly accept tone nonwords than vowel nonwords (38.5/15.1 = 2.55). Finally, the difference between L2 vowel and tone was significantly larger than the difference between L1 vowel and tone.
Signif. codes: *** <0.001; ** <0.01; * <0.05; <0.1.
ERP RESULTS AND STATISTICAL ANALYSIS OF CORRECT TRIALS ONLY
N400 average amplitudes for trials that received a correct response in the lexical decision task are shown in Table 6 and depicted visually as grand average waveforms in Figure 7. Across all midline and central electrodes, L1 displays strong N400 effects to both vowel and tone nonwords. In contrast L2 shows attenuated N400 effects overall, and visually different magnitudes of N400 for vowel and tone nonwords, with tone nonword responses diverging less strongly from real word responses.
Averaged N400 amplitudes from the 400–900 ms window were submitted to a linear mixed-effects model with crossed random effects for subjects and items. Models included fixed effects for condition (real word, vowel nonword, tone nonword) and group (L1, L2) and their interactions. Convergence difficulties were addressed by specifying uncorrelated random effects. Effects coding was used, and p-values were obtained using Satterthwaite’s method. The maximal model that successfully converged was fit first and was then compared to less complex models to test random effects. The final model included random intercepts for subjects and items, and by-item random slopes for the effect of group (lmer model formula: amplitude ~ condition * group + (1 | subject/electrode) + (1 + group || item)).
Model results are reported in Table 7. The main effects of group, and condition, and their interaction were statistically significant.
Signif. codes: *** <0.001; ** <0.01; * <0.05; <0.1.
Planned comparisons are reported in Table 8. Model estimates in planned comparison can be interpreted as amplitude differences (in μV). For L1 listeners, real words evoked significantly more positive amplitudes than either vowel nonwords or tone nonwords, while there was no statistically significant difference between vowel and tone nonword responses. For L2 listeners, real words evoked a significantly more positive response than vowel nonwords, while there was no significant difference between tone nonwords and either real words or vowel nonwords. Finally, the difference of differences between L1 and L2 tone and vowel nonwords was not statistically significant. Visual depiction of model estimated results are shown in boxplots in Figure 8.
To capture patterns at the individual level, we plotted each participant’s mean amplitude for the three conditions (Figure 9). Although L1 participants varied as to whether tone or vowel nonwords elicited stronger negativity, all L1 participants display greater negativity for nonwords than real words. In contrast, L2 participants display much less consistency. While some participants display clear nonword responses, many participants’ N400 effects are small or nonexistent. Ten L2 participants’ tone N400s are smaller than their vowel N400s, though five individuals show the opposite pattern, and three display no nonword N400 effects at all.
In summary, for trials with correct responses, the L1 group displayed significant and strong N400 effects for both vowel and tone nonwords, and this was consistent across all L1 participants. The L2 group displayed significant N400 effects only for vowel nonwords, with weaker N400 effects for tone nonwords, intermediate between vowel nonwords, and real words. This was reflected at the level of individuals by inconsistent N400 effects, with tone nonwords overall less likely to elicit N400s than vowel nonwords.
OFFLINE VOCABULARY TEST DATA PROCESSING
The offline vocabulary data are used to consider how familiarity with words and tones impacts lexical decisions, and to evaluate the general quality of L2 tone word knowledge. The test produced four data points for each nonword that an L2 participant encountered: an accuracy score for the tones and definition they supplied, and a confidence rating for each. For example, if the word was fa1yin1, and the participant provided 11 as the answer for tones, this would be scored as 1, while any other set of two numbers would result in a score of 0 for the tone on that item. Note that this scoring method counted tones on both syllables, whereas the nonwords only ever mismatched real words with respect to tones on the first syllable. In that sense, this scoring approach is strict. Definitions were also scored 1 for correct, or 0 for incorrect. For both of these scores, there was also an accompanying confidence rating, ranging from 0 to 3. One participant’s vocabulary test data was lost due to a coding error.
Overall, L2 learners supplied correct tones for about 74% of the items (807 out of 1,088 total responses), and correct definitions for about 91% of the items overall (990 out 1,088 total responses).
Items given a confidence score of 0 for either tones or vowels were discarded before further analyses (a total of 40 trials), and four trials were missing data (i.e., unanswered). This left a total of 1,044 items (90.6% of all L2 nonword trials) that had data for all four cells (i.e., tone and definition accuracy, and tone and definition confidence ratings).
OFFLINE VOCABULARY TEST RESULTS
Table 9 presents vocabulary results for tone responses. This data can give us insight into the quality of L2 tone representations for known words, as well as its relation to performance in the lexical decision task. Results are listed according confidence ratings. For example, for real word counterparts to vowel nonwords, participants assigned a rating of 3 “high” to the tones they supplied for 377 items. Table 9 also lists the accuracy of the supplied tones, and the accuracy of lexical decisions for those items. Even for high confidence items (“I recognize this word, and am certain of the tones”), tone answers were inaccurate more than 10% of the time, and lexical decision accuracy was lower for tone nonwords than vowel nonwords. For mid- and low confidence items, tone and lexical decision accuracy fell even further.
Table 10 provides parallel results for vocabulary definitions, allowing us to separately evaluate the quality of lexical-semantic knowledge. In contrast to sometimes questionable confidence in tone knowledge, L2 participants’ confidence about their knowledge of definitions seems quite accurate—high-confidence items were correctly defined 98% of the time. In other words, they know which words they know, and which they do not. There is not a clear relationship between this knowledge and performance on the lexical decision task, which follows insofar as the lexical decision task only tested word form recognition, not semantic knowledge.
In sum, results of the vocabulary knowledge test suggest L2 participants have substantial difficulty encoding tones in lexical representations. Even when explicit knowledge is fully available and words are confidently recognized, L2 tone knowledge was still inaccurate more than 10% of the time. (For complete by-item vocabulary test results, see the online supplementary materials.)
DOES LEXICAL FAMILIARITY IMPACT L2 BEHAVIORAL RESPONSES? (“THE BEST-CASE SCENARIO”)
Next, we used the offline vocabulary results to evaluate the extent to which lexical decision errors reflect deficits of offline vocabulary and tone knowledge. To this end, we reanalyzed lexical decision results for the subset of trials characterized by accurate and confident L2 knowledge for both tones and meanings (i.e., 3s for all four response categories on the vocabulary test). This comprised 301 tone nonword and 303 vowel nonword trials (604 total, 55% of total nonword trial data). By testing this data, we get a “best-case scenario” for L2 participants: When lexical knowledge is highly accurate and confident, do L2 learners reject vowel and tone nonwords with equal accuracy?
Table 11 presents descriptive accuracy results for the two nonword conditions in the best-case scenario data for the lexical decision task. The accuracy results were submitted to a generalized linear mixed-effects model following the same procedures as outlined for previous analyses. The model included the fixed effect of nonword condition. The maximal model was fit, and included random intercepts for subjects and items, and random slopes for the by-subject and by-item effects of condition (lmer model formula: accuracy ~ condition + (condition || subject) + (condition || item).
Results are displayed in Table 12. There was a statistically significant difference in accuracy for vowel and tone nonwords. So then, even in the best-case scenario—with near perfect word and tone knowledge—L2 participants still display a more limited ability to reject tone nonwords than vowel nonwords.
Signif. codes: *** <0.001; ** 0.01; * <0.05; . <0.1.
DOES LEXICAL FAMILIARITY IMPACT ERP RESPONSES?
Due to limited power, statistical modeling of the best-case scenario for ERP data was not possible. However, as the ERP analysis was conducted on only those trials that resulted in correct decisions, it is possible to consider the quality of offline knowledge associated with those decisions to examine whether insufficient explicit knowledge of tones contributed to ERP differences. That is, even though L2 participants ultimately made the correct decision on these trials, they still may have been guessing or using other strategies (e.g., they might know that a specific tone is not correct, even though they do not know explicitly what the correct tone is).
For these trials, L2 knowledge of definitions for the real word counterparts of nonwords was very accurate (vowel nonwords: mean = 97%; tone nonwords: mean = 96%). L2 knowledge of tones, however, was not nearly so high (tone nonwords 80%), and varied rather extremely across participants, with the lowest mean average being 31%, and the highest 100%. The extreme low score was somewhat atypical of the group overall. Only two participants scored below 50%. Nevertheless, these results suggest that, insofar as we can equate online and offline word knowledge, even for correctly rejected tone nonword trials, L2 participants did not have accurate explicit knowledge of the appropriate tones for target words 20% of the time. This might have further reduced the amplitude of tone nonword responses.
GENERAL DISCUSSION
We conducted a lexical decision study with ERP recordings in advanced L2 Mandarin learners whose L1 was English, to determine whether and why processing lexical tone is selectively difficult for learners.
Our first research question asked whether L2 listeners were equally accurate in rejection of isolated disyllabic vowel and tone nonwords. Here we found a clear answer. Across all analyses, the L2 group showed consistently weaker performance for tone nonwords than vowel nonwords. With only one exceptional individual, L2 participants showed weaker sensitivity (d′) for tones than vowels in lexical decision, and gaps between the lowest L1 score and L2 scores were more common and larger for tones than vowels. These data replicate the same tone-vowel discrepancy in lexical decision accuracy we found for words extracted from continuous speech in Pelzl et al. (Reference Pelzl, Lau, Guo and DeKeyser2019), and show that the selective deficit in tone word processing extends to more slowly and clearly produced stimuli.
Digging a bit deeper, our second question asked whether a learner’s familiarity with the critical words might moderate their performance on tone nonwords. Looking only at trials for which learners had accurate and confident knowledge of the critical words and their appropriate tones, we saw only a slight improvement in accuracy for rejection of tone nonwords. Overall, the disadvantage for tone versus vowel nonwords persisted.
While L2 learners made more errors on lexical decisions to tone nonwords than vowel nonwords, the deficit should not be exaggerated. The L2 group still made accurate decisions in the majority of tone nonword trials. So then, we can also ask about these trials, when their responses were correct. Our third research question was whether ERPs (for correct trials only) might reveal equal L2 sensitivity to vowel and tone mismatches, as these were the trials when listeners achieved successful responses (i.e., correct rejections of nonwords). Our results were consistent with weaker L2 sensitivity for tones compared to vowels in this case as well. While the L2 group displayed statistically significant N400s to vowel nonwords, the tone response was intermediate between real word and vowel nonword responses. Furthermore, measures of offline L2 knowledge for correctly rejected tone nonwords suggest that, for approximately 20% of those trials, the correct response was not necessarily indicative of correct tone knowledge.
We now consider these results in light of the two broad accounts of L2 phonological and lexical difficulty highlighted in our introduction.
MISSING AND INCORRECT L2 TONE WORD REPRESENTATIONS
Perhaps the most straightforward explanation for L2 difficulty in lexical decision tasks requiring tone knowledge is simply that this lexical tone knowledge was never accurately encoded in the learner’s long-term memory for many words. A number of scholars (Cook & Gor, Reference Cook and Gor2015; Gor, Reference Gor2018; Gor & Cook, Reference Gor and Cook2018; Diependaele et al., Reference Diependaele, Lemhöfer and Brysbaert2013; Veivo & Järvikivi, Reference Veivo and Järvikivi2013) have argued that L2 knowledge for less familiar words (usually lower-frequency items) is characterized by low-quality, or “fuzzy,” phonological representations. Learners cannot display sensitivity to lexical cues they do not remember or remember incorrectly. In the current study, explicit vocabulary test results point to ongoing weaknesses in advanced L2 explicit knowledge of tones: For the group, 25% of supplied tones in the offline test were incorrect, and even when learners indicated the highest level of confidence in their tone knowledge, they were still in error more than 10% of the time. If this scales up to a vocabulary of thousands of words, many advanced L2 Mandarin speakers may (confidently) misremember tones for hundreds or thousands of words.
Despite the potentially large scale of (explicit) L2 tone knowledge deficits, missing or misremembered tone knowledge cannot provide a full account of L2 tone word performance in our lexical decision task, as inaccurate decisions were observed even for words for which learners showed confident and accurate tone knowledge in the offline task.
UNCERTAINTY IN L2 TONE WORD REPRESENTATIONS
Apart from the accuracy of encoded L2 tone knowledge itself, another qualitative aspect of that knowledge that could affect lexical decision performance is learners’ (un)certainty about their own knowledge of the relevant real words (cf. Cook & Gor, Reference Cook and Gor2015; Gor, Reference Gor2018; Gor & Cook, Reference Gor and Cook2018; Veivo & Järvikivi, Reference Veivo and Järvikivi2013). In the present case, even if L2 listeners accurately perceive a tone nonword (e.g., they know they heard fa2*yin1), perhaps they still accept it because they are not confident of the tones of the real word counterpart (e.g., fa1yin1). This uncertainty would make them more permissive in the decision process.
Again, this is not a fully sufficient explanation for present results. While uncertainty may play some role in L2 performance, the best-case scenario analysis suggests that L2 tone nonword inaccuracy is not due solely to such uncertainty. Even when participants had fully accurate and confident explicit tone knowledge for real words, they responded incorrectly in the lexical decision task about one-third of the time.
FORMAT OF L2 TONE WORD REPRESENTATIONS
A third possibility is that aspects of L2 tone word knowledge could be represented in a qualitatively different way from L1, such that L2 listeners retrieve tone knowledge more slowly and less accurately under time pressure. For example, L2 learners might encode tone word information as declarative (explicit) knowledge, rather than automatic (implicit) knowledge (DeKeyser, Reference DeKeyser, Doughty and Long2003, Reference DeKeyser, VanPatten and Williams2007). This knowledge might even be encoded in representational modalities other than phonological form, such as visual encoding of tone diacritics or full orthographic (Pinyin romanization) representations of words (cf. Bassetti et al., Reference Bassetti, Escudero and Hayes-Harb2015, and related articles).
Our analysis of ERPs from correct trials only was intended as an initial exploration of this possibility: If lexical tone is encoded in a qualitatively different way in L2 learners, then we might expect the retrieval of this representation to manifest differently in the neural response, even on trials where retrieval was successful. For example, if L2 learners retrieve explicit knowledge of tone at a slower timescale, then we might expect that they could fail to show nativelike responses to tone nonwords early in processing, and still successfully reject tone nonwords using the slower pathway by the end of the trial. Although not conclusive, our ERP results are consistent with this possibility: unlike native speakers, L2 learners showed a smaller N400 response to tone nonwords than vowel nonwords on trials in which their behavioral response was correct rejection. For example, on the implicit/explicit account this pattern could arise if L2 learners initially retrieve a toneless form of a real word representation on both real word and tone nonword trials, and only later in the trial retrieve the explicitly encoded tone information that distinguishes the words from the nonwords.
If correct, this interpretation of the ERP results would have strong implications for L2 learners’ tone processing capacities in real-world situations, as it would suggest their access of lexical tone information could often be too slow to impact processing of continuous speech. In other words, if they often succeed in comprehension of Mandarin speech, it will be despite ineffective tone processing. However, we believe further ERP work is needed before drawing these strong conclusions. Though we are inclined to believe present results accurately reflect L2 tone ERP responses, we must acknowledge that the weaker tone N400 effects in our present analysis could be due simply to lack of sufficient data. After removing incorrect responses there were nearly 25% fewer trials available for L2 tone than vowel nonwords. Additionally, offline vocabulary results suggest that, for some participants the available data contained a substantial number of guesses. Extending this reasoning then, it is possible that given more data in the tone word condition and less noise in the offline knowledge estimates (i.e., limiting analysis to trials where the participant truly had perfect tone word knowledge), we would discover that L2 tone nonword N400 effects on correct responses were equivalent to those of vowel nonwords. Future replications of present results and refined methods for examining the nature of L2 tone representations will be necessary to fully remove doubts along these lines.
PROCESSING BIASES COULD DRIVE L2 TONE WORD ERRORS
A second class of explanations for L2 tone word errors in the lexical decision task is that they reflect differences in the processing biases used by L1 and L2 listeners, rather than or in addition to differences in their stored lexical representations. This class of explanation can straightforwardly account for the discrepancy between offline and online accuracy in retrieving tone word knowledge: Both tasks would in principle draw on the same, intact tone word knowledge, and it is the L2 processing routine for the lexical decision task that is driving the errors. Given their lifetime of experience attending primarily to segmental cues in word recognition, nontonal L1 speakers default to the same processing routine for tone words, with F0 cues playing little role in accessing lexical candidates.
The Automatic Selective Perception model (ASP) (Strange, Reference Strange2011; see also recent discussion of perceptual attention by Chang, Reference Chang2018) posits that task demands will play a key role in determining when L2 learners are able to successfully attend to novel L2 sounds. When task demands are low (as in many identification or discrimination tasks), L2 learners are able to direct attention to acoustic-phonetic cues that are not used or are given much lower weight in their L1. As task demands increase (recognizing words, interpreting semantic content), learners fall back on L1 perceptual routines, often leading to lower levels of performance. Zou et al. (Reference Zou, Chen and Caspers2017) suggest the ASP correctly predicted the outcomes of their study. They found that, as L2 Mandarin proficiency increased, native Dutch speakers showed greater reliance on tones in a challenging AXB task, indicating convergence on appropriate weighting of Mandarin F0 cues. Paired with results from Pelzl et al. (Reference Pelzl, Lau, Guo and DeKeyser2019), our present results also fit well with the ASP model. When task demands decreased (relative to Pelzl et al. Reference Pelzl, Lau, Guo and DeKeyser2019), L2 learners showed an increasing ability to rely on F0 cues to successfully reject many nonwords. However, of the two tasks, that of Pelzl et al. (Reference Pelzl, Lau, Guo and DeKeyser2019) is likely more similar to the speech learners most often encounter, and so more likely to reflect typical L2 use of tone cues.
Wiener and colleagues (Liu & Wiener, Reference Liu and Wiener2020; Wiener, Reference Wiener2020; Wiener et al., Reference Wiener, Ito and Speer2018, Reference Wiener, Lee and Tao2019; Wiener & Lee, Reference Wiener and Lee2020) present a slightly different view of L2 tone learning, though in many ways it seems complimentary to the ASP model. They frame L2 tone-learning under the umbrella of dimension-based statistical learning (Idemaru & Holt, Reference Idemaru and Holt2011, Reference Idemaru and Holt2014), drawing a distinction between signal-based and knowledge-based (or probability-based) processing. In the first case, listeners rely on the low-level acoustic-phonetic input of the speech signal to recognize words; in the latter, they rely on knowledge of the statistical properties of specific syllables or words in their L2 experience. For example, listeners may know from experience that some syllable + tone combinations are either highly probably or very unlikely to occur. This knowledge may guide their initial processing of relevant syllables, especially under more difficult listening conditions (e.g., multiple talkers, noise). When listening conditions are easier (e.g., a familiar voice in clear speech), listeners can rely more heavily on the acoustic-phonetic signal. Wiener and colleagues have so far not addressed how such processes might play out beyond a single syllable. Given how rare tone neighbors are for disyllabic words, L2 listeners may typically recognize disyllabic words even if they ignore tone (F0) cues. This may lead them to rely on a knowledge-based processing strategy that attends to segmentally defined disyllabic sequences, while disregarding tones.
Though not focused on processing per se, recent work by Chan and Leung (Reference Chan and Leung2020) is also amenable to processing accounts in that it is deals with statistical learning mechanisms. They examined Cantonese and English L1 participants’ abilities to pick up on co-occurrence patterns of syllable-initial segmental cues and specific tones (i.e., syllables beginning with aspirated stop consonants always had rising tones; syllables beginning with an approximant always had falling tones). Whereas Cantonese participants showed some ability to generalize the (implicit) pattern after training, English participants failed to show evidence of learning. Chan and Leung framed their study as an examination of the phonological level of L2 tone learning (cf. the phonetic-phonological-lexical continuum described by Wong & Perrachione, Reference Wong and Perrachione2007), and suggest that L2 tone learning might be particularly difficult when it comes to the formation of implicit and abstract phonological tone representations. Similar to Wiener and colleagues, Chan and Leung focus on the level of single syllables, and it is unclear what they would expect for multisyllable strings, except that it is unlikely to be that L2 performance will increase when confronted with such stimuli.
Finding ways to investigate L2 tone learning in the context of syllables and words of different lengths, while also addressing fundamental differences in the statistical and acoustic properties of single and multisyllable strings will be required to address these issues more fully.
PRACTICAL IMPLICATIONS
Despite the tone word difficulty reported in the preceding text, it is not necessarily the case that this set of L2 learners has many practical difficulties as a result of tone word misperception. All our L2 participants could be characterized as successful language learners. After years of classroom study, they were using Mandarin to communicate on a regular basis in their daily lives, often at a professional level. So then, does the tendency to incorrectly accept nonwords have any bearing on real-world L2 Mandarin learners?
We tentatively suggest that it does. Though nonwords by definition do not occur in native Mandarin speech, words with incorrect or missing tones clearly do exist in the vocabulary of many L2 learners. The inability to differentiate these mistaken words from their real word tone neighbors will prevent learners from recognizing their own incorrect tone knowledge. Similarly, whenever a learner encounters a spoken word they have not previously learned, the inability to recognize that word’s tones in real time will prevent them from acquiring a fully accurate representation of the word. Even if these difficulties do not cause consistent lexical confusion for the L2 learner as listener, they may still cause difficulties for those who listen to the learner. Gaps in tone knowledge will lead to production of tone errors that are potentially confusing or misleading for listeners who do process tones—as we see from the strong N400 responses to tone nonwords by L1 listeners in the current study (see also Pelzl et al., Reference Pelzl, Carlson, Guo, Jackson and Hell2020; Pelzl et al., Reference Pelzl, Lau, Guo, Jackson and Gorin press, for investigations of the impacts of L2 tone errors on native listeners).
CONCLUSION
The present study extends our understanding of L2 tone word difficulties by demonstrating that, even in fairly ideal circumstances, L2 learners have considerable difficulty recognizing words on the basis of tones. Our results suggest that both representational and processing issues are at play in these difficulties. L2 learners seem able to function at a high level despite these tone difficulties, but the difficulties nevertheless pose a considerable learning challenge that may have real impacts on the efficiency with which L2 learners can expand their Mandarin lexicon.
Supplementary Materials
To view supplementary material for this article, please visit http://dx.doi.org/10.1017/S027226312000039X.