Understanding the process and product of second language acquisition (SLA) is complex, as it can be explained not only by factors related to experience (i.e., the extent to which second language [L2] learners practice the target language), but also by those related to aptitude (i.e., the cognitive and perceptual factors which determine the extent to which L2 learners can make the most of relevant L2 experience). Whereas the previous literature has examined aptitude in reference to L2 lexicogrammar development (for reviews, see Li, Reference Li2016; Skehan, Reference Skehan, Granena, Jackson and Yilmaz2016), surprisingly little is known about the role of aptitude in L2 pronunciation learning. The present study aims to fill this gap by proposing a new framework of cognitive abilities relevant to the degree of success after years of explicit and implicit pronunciation learning under various L2 learning conditions. To achieve this main objective, we assessed the segmental and suprasegmental sensitivity of 48 Chinese learners of English in the UK by using a range of behavioural (language and music aptitude tests) and neurophysiological (electroencephalography) measures. Subsequently, we explored which pronunciation learning aptitude variables were linked to the segmental and suprasegmental aspects of the learners’ L2 pronunciation performance, controlling for their L2 learning backgrounds (i.e., their past and recent L2 use).
Background
Second language pronunciation development
Second language pronunciation proficiency is a composite skill which comprises the capacity to (a) pronounce new consonantal and vocalic sounds in a L2 without deleting or substituting them for L1 counterparts (segmental accuracy); (b) use adequate prosody at the word (correct assignment of word stress) and sentence (appropriate use of intonation for declarative and interrogative intensions) levels; and (c) deliver speech at an optimal tempo (speed fluency) without making too many pauses (breakdown fluency) nor self-repetitions or corrections (repair fluency). According to general L2 speech theories (e.g., Kormos, Reference Kormos2014), comprehension processes primarily draw on the decoding of phonological information. When speech includes mispronunciations or unclear pronunciation, listeners may activate inappropriate lexical items, which in turn may hinder their prompt, timely and successful understanding of speakers (Broersma, Reference Broersma2012). Relative to other domains of language (vocabulary, grammar), therefore, the accurate and fluent use of pronunciation is considered to be a particularly fundamental component of L2 oral proficiency (Derwing & Munro, Reference Derwing and Munro2009).
A common feature of theoretical models of SLA is that L2 learners continue to improve their pronunciation proficiency with increased input and output of the target language (e.g., Flege, Reference Flege2016 for Speech Learning Model). More specifically, usage-based accounts of language development explain in depth how experience uniquely facilitates SLA according to how often (frequency), where (contexts) and when (recency) L2 learners practice the target language (e.g., Ellis, Reference Ellis2006). Similar to first language acquisition, early L2 learners (e.g., age of acquisition < 6 years) are likely to achieve high-level pronunciation proficiency, given ample opportunities for language exposure (Abrahamsson & Hyltenstam, Reference Abrahamsson and Hyltenstam2009). When it comes to adult L2 pronunciation learning, strong experience effects (i.e., more practice is better) are observed at the initial stage of L2 pronunciation learning (e.g., Munro & Derwing, Reference Munro and Derwing2008 for first three months of immersion). Yet, a great deal of individual variability is present in the final outcome of late L2 pronunciation learning. Even if any two given L2 learners have similar kinds of L2 experience, the extent to which they notice, understand and learn to produce L2 features can greatly vary. One possible source of these individual differences in L2 learning outcome is proficiency in a variety of cognitive and perceptual skills, which together make up second language learning aptitude.
Second language learning aptitude
One of the most extensively-researched topics in the field of SLA has been the explanatory power of individuals’ aptitude for the rate and ultimate attainment of L2 learning (for reviews, see Li, Reference Li2016; Skehan, Reference Skehan, Granena, Jackson and Yilmaz2016). As originally conceived, aptitude referred to the explicit and intentional learning abilities necessary for successful foreign language learning through formal instruction. In Carroll and Sapon's (Reference Carroll and Sapon1959) influential aptitude model, such abilities constitute phonemic coding, grammatical sensitivity, inductive learning ability and associative memory. According to previous validation studies (e.g., Carroll, Reference Carroll and Glaser1962), L2 learners’ different levels of aptitude, measured by the Modern Language Aptitude Test battery, demonstrated significant associations with their achievements in various classroom settings, such as course grades and SAT scores.
More recently, a growing number of scholars (e.g., Linck, Hughes, Campbell, Silbert, Tare, Jackson & Doughty, Reference Linck, Hughes, Campbell, Silbert, Tare, Jackson and Doughty2013; Skehan, Reference Skehan, Granena, Jackson and Yilmaz2016) have proposed new theoretical frameworks for conceiving aptitude in terms of implicit and incidental learning (i.e., learning without awareness) – a type of learning which may be crucial for high-level L2 acquisition in naturalistic settings. Different from explicit learning aptitude, which is measured through tasks comprising both practice and testing phases, implicit and incidental learning aptitude is measured while participants complete tasks without any practice nor awareness of what is being learned. Developing a composite test battery of 11 domain-general cognitive measures (Hi-LAB), for example, Linck et al. (Reference Linck, Hughes, Campbell, Silbert, Tare, Jackson and Doughty2013) examined the aptitude profiles of advanced L2 learners who obtained high reading and listening scores on Defense Language Proficiency Tests. These learners demonstrated not only greater associative (paired associations) and phonological short-term memory (letter span), but also higher implicit language aptitude (serial reaction time).
In order to analyze the influence of aptitude in various L2 learning contexts, the LLAMA aptitude test battery has been widely adopted in the field of SLA. Building on Carroll's aptitude model, LLAMA features not only explicit learning aptitude – associative memory, phonemic coding and grammatical inferencing, but also incidental learning aptitude – sound sequence recognition. According to previous investigations, explicit LLAMA test scores appeared to predict the extent to which L2 learners can benefit from explicit (rather than implicit) instruction within a short amount of time under laboratory (e.g., Yilmaz & Granena, Reference Yilmaz and Grañena2016) and classroom (e.g., Yalçın & Spada, Reference Yalçın and Spada2016) conditions. In contrast, L2 learners with high-level incidental aptitude (sound sequence recognition) tend to attain advanced proficiency in L2 morphosyntax, especially when they have had regular access to naturalistic language input since an early age (e.g., Granena, Reference Granena2013).
Whereas an extensive body of literature has scrutinized the complex relationship between various kinds of aptitude (explicit and implicit), L2 proficiency (beginner, intermediate, advanced) and context (naturalistic vs. classroom settings), it is noteworthy that most of the relevant research evidence has nearly exclusively considered the effects of aptitude on the learning of the acquisition of listening/reading skills, measured via general proficiency tests (e.g., Linck et al., Reference Linck, Hughes, Campbell, Silbert, Tare, Jackson and Doughty2013 for the Defense Language Proficiency Tests), and the learning of L2 morphosyntax (Granena, Reference Granena2013; Yalçın & Spada, Reference Yalçın and Spada2016). Very few studies have examined the impact of these factors on the acquisition of L2 adult learners’ phonological skills while speaking spontaneously via a comprehensive set of aptitude and speech measures (cf. Saito, Suzukida & Sun, Reference Saito, Suzukida and Sun2018). Furthermore, very few studies have used a combination of both behavioural and neurophysiological metrics.
Developing a new aptitude framework for L2 pronunciation learning
In this study, L2 pronunciation learning aptitude is defined as comprising the cognitive abilities related to the explicit and implicit processing of acoustic information, which is crucial for perceiving various phonetic dimensions of L2 speech. We propose that learners who have greater aptitude in tracking and retaining acoustic information are better able to attend to the primary acoustic correlates of segmentals (high-frequency spectral information), prosody (fundamental frequency height and contour), and fluency (relative ratio of speaking/silent time). We consider this kind of aptitude to be receptive rather than productive in nature, as we follow the predominant theoretical assumption that L2 speech learning is perception-driven (i.e., changes in perception lead to production development) (Flege, Reference Flege2016). To develop the aptitude framework, we excluded cognitive tasks using non-speech materials, such as letter- and non-word span for phonological short-term memory (Linck et al., Reference Linck, Hughes, Campbell, Silbert, Tare, Jackson and Doughty2013), retrieved-induced inhibition for inhibition control (Darcy, Mora & Daidone, Reference Darcy, Mora and Daidone2016), and speeded naming for processing speed (Darcy, Park & Yang, Reference Darcy, Park and Yang2015). This decision is motivated by recent research evidence that both child and adult L2 speech learning is closely tied to human sensitivity to complex speech signals such as language (Diaz, Mitterer, Broersma, Escera & Sebastian-Galles, Reference Diaz, Mitterer, Broersma, Escera and Sebastian-Galles2016) and music (Milovanov, Pietilä, Tervaniemi & Esquef, Reference Milovanov, Pietilä, Tervaniemi and Esquef2010).
Based on a synthesis of extant studies on the cognitive predictors of L1 and L2 speech learning, we identified a total of four measures that differentially reflect aptitude in the explicit and implicit processing of L2 phonological information at the segmental and suprasegmental levels (as summarized in Table 1). Our framework is novel as all the tasks correspond to cognitive/perceptual abilities which are thought to be directly relevant to L2 learners’ potentially different processing of segmental, prosodic and temporal information in both explicit (phonemic coding, tonal/rhythmic imagery) and implicit (auditory encoding precision) modes.
Table 1. Constructs and Measures of Pronunciation-Specific Aptitude.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000895:S1366728918000895_tab1.gif?pub-status=live)
Note. FFR for frequency following response.
One explicit speech-specific component of aptitude is phonemic coding, defined as L2 learners’ ability to analyze, categorize and remember new segmental sounds in relation to corresponding symbols. Individual differences in children's explicit knowledge of the phonology of their L1 have been linked to differences in speech perception (Rvachew & Grawburg, Reference Rvachew and Grawburg2006) and auditory processing (Anvari, Trainor, Woodside & Levy, Reference Anvari, Trainor, Woodside and Levy2002). In Li's (Reference Li2016) meta-analysis, this form of aptitude seems to be only weakly associated with the development of global listening and speaking skills. More recently, however, four studies have explored and confirmed the moderate-to-strong predictive power of phonemic coding for adult L2 learners’ pronunciation attainment, especially at a segmental level after years of classroom (Saito, Reference Saito2017, in press; Saito et al., Reference Saito, Suzukida and Sun2018) and naturalistic (Granena & Long, Reference Granena and Long2013) L2 learning. Given that the cognitive underpinnings of experienced L2 learners’ segmental learning and attainment remains open to debate (e.g., Darcy et al., Reference Darcy, Park and Yang2015, Reference Darcy, Mora and Daidone2016), the acquisition-aptitude link needs to be further examined.
Recent work has shown connections between shared processes in language and music comprehension on several levels, including between harmony and syntax (Patel, Gibson, Ratner, Besson & Holcomb, Reference Patel, Gibson, Ratner, Besson and Holcomb1998), rhythm and stress (Cason, Astésano & Schön, Reference Cason, Astésano and Schön2015), melody and intonation (Liu, Patel, Fourcin & Stewart, Reference Liu, Patel, Fourcin and Stewart2010) and semantics (Daltrozzo & Schön, Reference Daltrozzo and Schön2009). These connections between music and language suggest that aptitude for acquiring suprasegmental aspects of new languages (prosody, fluency) and aptitude for learning to perceive and produce music may be overlapping constructs as well. In music aptitude tests (e.g., Gordon, Reference Gordon1995), learners are tested for their abilities to hear differences in music in pitch/intensity (i.e., tonal imagery) and speed/timing (i.e., rhythmic imagery), when listening to two musical notes. This test is considered as one form of explicit aptitude test, since participants are explicitly guided to pay conscious attention towards analyzing the tone/rhythm of the notes during the test taking session.
Several empirical studies have pointed out that those with higher music aptitude (e.g., musicians) can better recognize and produce sounds not only in a familiar L2 (English) (e.g., Milovanov et al., Reference Milovanov, Pietilä, Tervaniemi and Esquef2010; Slevc & Miyake, Reference Slevc and Miyake2006), but also in an unfamiliar tonal language that they have never learned (Mandarin) (e.g., Gottfried, Reference Gottfried, Bohn and Munro2007; Wong, Skoe, Russo, Dees & Kraus, Reference Wong, Skoe, Russo, Dees and Kraus2007). In an intervention study with a pre- and post-test design, Li and DeKeyser (Reference Li and DeKeyser2017) recently provided longitudinal evidence that music aptitude could mediate the effects of explicit instruction on American learners’ acquisition of L2 Mandarin lexical tones. Specifically, the authors hypothesized that more musically endowed learners may be more sensitive to and capable of capturing acoustic information in speech related to F0 height and contour. Therefore, it is possible that L2 learners’ music aptitude for perceiving tonal and rhythmic imagery could contribute L2 pronunciation proficiency especially at prosodic and temporal levels – an assumption that the current study was designed to test.
Departing from previous aptitude studies predominantly concerned with explicit aptitude, the current study measures implicit pronunciation-specific aptitude in terms of L2 learners’ neural encoding of speech, which we measure using an electrophysiological response known as the frequency following response (FFR), a response with origins within the cortical and subcortical auditory system (Coffey, Herholz, Chepesiuk, Baillet & Zatorre, Reference Coffey, Herholz, Chepesiuk, Baillet and Zatorre2016). The FFR reproduces the temporal and spectral content of the evoking stimulus, and so can be used to assess the stability and precision of the auditory system's encoding of spectral, pitch, and durational information, acoustic features that convey segmental and prosodic information in speech. Attention is not necessary for the elicitation of the FFR; during recording, therefore, participants can engage in absorbing tasks that draw attention away from the sounds (e.g., reading books, watching silent movies). As such, the method is an ideal way to assess auditory processing without the contaminating influence of cognitive and affective state.
In the neurophysiology literature, the degree of auditory precision, estimated through FFR, continues to develop up until around 7–10 years of age (Skoe, Krizman, Anderson & Kraus, Reference Skoe, Krizman, Anderson and Kraus2013) before reaching a relatively stable state (Hornickel, Knowles & Kraus, Reference Hornickel, Knowles and Kraus2012). Individual differences in the FFR have been found to exhibit strong correlations with language skills such as reading (White-Schwoch, Carr, Thompson, Anderson, Nicol, Bradlow & Kraus, Reference White-Schwoch, Carr, Thompson, Anderson, Nicol, Bradlow and Kraus2015) and speech in noise perception (Anderson, Skoe, Chandrasekaran & Kraus, Reference Anderson, Skoe, Chandrasekaran and Kraus2010), suggesting that the auditory skills indexed by the FFR are vital for language processing and acquisition. Nevertheless, there has been only a single previous study of the relationship between neural encoding of speech as measured by the FFR and success in learning a second language in adulthood. Omote, Jasmin & Tierney (Reference Omote, Jasmin and Tierney2017) examined perception of English phonology and FFR phase-locking in native Japanese adults who moved to the United Kingdom in adulthood. Robust neural encoding of the F0 of speech was linked to successful English speech perception. Interestingly, it was also shown that the participants’ FFR predicted their performance even more strongly than their experience backgrounds did (their length of residence in the UK). These results are in line with previous findings that bilingual experience enhances the neural representation of speech F0 (Krizman, Slater, Skoe, Marian & Kraus, Reference Krizman, Slater, Skoe, Marian and Kraus2015). Here we built upon these previous findings by asking for the first time whether neural encoding of speech in the frequency-following response is linked to proficiency in second-language production.
Neural encoding of pitch was assessed by measuring the robustness of neural phase-locking to the fundamental frequency (100 Hz) of a synthesized speech syllable (/da/). We hypothesized that neural encoding of pitch would be linked to participants’ ability to produce suprasegmental features of second language speech. Neural encoding of higher-frequency spectral information was assessed by measuring neural phase-locking at the frequencies of the first (720 Hz, measured at 700 Hz) and second (1240 Hz, measured by averaging responses at 1200 and 1300 Hz) formants. We hypothesized that neural encoding of speech formants would be linked to participants’ ability to produce segmental features of second language speech.
Current study
Adopting the proposed L2 pronunciation learning aptitude framework (phonemic coding, music aptitude, auditory encoding precision), the main objective of the current study was to scrutinize the cognitive correlates of successful L2 pronunciation proficiency attainment among 48 Chinese learners of English in the UK. First, we carefully checked how the participants differed in terms of their sensitivities to segmental and suprasegmental pronunciation learning – i.e., aptitude factors – and their past and recent use of the L2 in classroom and naturalistic settings – i.e., experience factors. To examine the relative effects of experience and aptitude factors on L2 pronunciation attainment, we examined how the aptitude and experience factors differentially contributed to their segmental and suprasegmental aspects of L2 pronunciation proficiency at the time of the project (after 11–17 years of L2 learning in both classroom and naturalistic settings). The hypothesized relationships between independent variables (Aptitude, Experience) and dependent variables (Pronunciation) was summarized in Figure 1.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000895:S1366728918000895_fig1g.jpeg?pub-status=live)
Figure 1. Summary of the Hypothesized Relationships between Independent Variables (Aptitude, Experience) and Dependent Variables (Pronunciation).
Participants
As a part of a larger project designed to survey English oral proficiency among international students in the UK, a total of 48 native speakers of Mandarin Chinese were recruited (6 males, 42 females) for the study. All the participants were students enrolled in various postgraduate programmes (but one who was at undergraduate level) in London (M age = 23.8 years, Range = 21–27) with similar length of residence in the UK (i.e., eight to nine months). During the academic programme, they took a different number of content-based courses in various subjects in sciences (e.g., engineering, mathematics, chemistry) and social sciences (e.g., economics, linguistics, law), while none of them attended any English-as-a-second-language classes. Some participants had many opportunities to speak L2 in the class through group discussions and presentations, whereas others had less chances due to a different course focus. On the other hand, their L2 use (in terms of speaking, listening, reading and writing) outside the class also varied to a great degree (as reported in the Result section). Prior to coming to the UK, they had studied L2 English only in China for 10–16 years without any study-abroad experience in an English-speaking environment, albeit with different ages of learning onset (M age of learning = 8.4 years: Range = 6–13 years). Their self-reported IETLS scores widely varied from 6 to 8 out of 9 (M = 7.1, SD = 0.4). According to CEFR bands, this signals that their general proficiency could be considered from B2 (Independent users) to C1 (Proficient users).
All participants had audiometric thresholds ≤ 25 dB HL for octaves from 500 Hz to 4000 Hz, confirming their normal hearing. The data collection was conducted in a soundproof booth. Each session lasted for approximately 90 minutes per participant with the three main tasks being administered in the following order: aptitude test, pronunciation test and experience interview. To avoid any misunderstandings of the procedure, all instruction was delivered in Chinese by an L1 Mandarin speaking researcher.
Measures of pronunciation proficiency
Speaking task
In the field of L2 speech research, controlled speech tasks (e.g., delayed sentence repetition) have been typically used to elicit participants’ production of certain segmental and suprasegmental features. Yet, some scholars have continuously emphasized the importance of adopting more spontaneous speech tasks, especially for adult L2 learners, who can carefully monitor their correct pronunciation forms when they are allowed to draw on their explicit phonetic knowledge without any attention to the meaningful use of language (Piske, Flege, MacKay & Meador, Reference Piske, Flege, MacKay, Meador, Wrembel, Kul and Dziubalska-Kołaczyk2011). Indeed, it has been shown that adult L2 learners’ speech behaviours are different when elicited via controlled and free speech tasks with their former performance being more targetlike and accurate than the latter performance (Major, Reference Major, Edwards and Zampini2008). In extemporaneous speech tasks, L2 learners are guided to produce language with a primary focus on conveying their intended message under time pressure (for a review, see Skehan, Reference Skehan, Granena, Jackson and Yilmaz2016). Similar to previous L2 speech studies (e.g., Lambert, Kormos & Minn, Reference Lambert, Kormos and Minn2017), a timed picture narration task was adapted from the Pre-Grade 1 Level of the EIKEN English Test (EIKEN, 2016).
Procedure
Since types of topics could affect the participants’ L2 performance (Gass & Varonis, Reference Gass1984), two different versions of the narration task were prepared (Versions A and B). A total of 25 participants were randomly assigned to Version A, and the remaining 23 participants to Version B. For each version (A, B), the participants had one minute to prepare how to describe a four-frame cartoon, and two minutes to narrate the story. To avoid false starts, the participants were given the first sentence that they had to use (for materials, see Appendix A). All the speech samples were recorded with a Roland-05 audio recorder, set at 44.1 kHz sampling rate and 16-bit quantization, and a unidirectional condenser microphone. In line with L2 speech research standards (e.g., Derwing & Munro, Reference Derwing and Munro1997), and to reduce any fatigue effects on listeners in the subsequent rating sessions (see below), the first 30 sec of the speech samples was excised and normalized for peak amplitude for subsequent L2 pronunciation analyses.
Data analyses: subjective judgements
In the analysis of segmental and prosodic qualities of spontaneous speech, objective measures (e.g., acoustic analyses) are not commonly used in the L2 speech literature, due to variability in phonetic context (e.g., following and preceding vowels) and talker characteristics (e.g., anatomical difference in vocal tract). Rather, many scholars have relied on linguistically trained raters’ subjective scalar judgements (e.g., Piske et al., Reference Piske, Flege, MacKay, Meador, Wrembel, Kul and Dziubalska-Kołaczyk2011 for segmentals; Derwing & Munro, Reference Derwing and Munro1997 for prosody). In our precursor research (Saito, Trofimovich & Isaacs, Reference Saito, Trofimovich and Isaacs2017), a training procedure was elaborated for experienced native-speaking raters to assess four different categories of L2 pronunciation proficiency – segmentals, word stress, intonation and speech rate.Footnote 1
To this end, five expert raters with ample linguistic and pedagogical backgrounds (3 females, 2 male) were recruited in London (M age = 35.4 years). Whereas three out of five raters were originally from North America, they had resided in the UK more than 10 years, reporting high-level familiarity with Received Pronunciation. All of them held MA degrees in applied linguistics and reported extensive experience in teaching (M years of teaching = 7.8 years) and speech analyses of this kind through participating in rating sessions as research assistants and/or enrolling in rater training for high-stakes L2 speaking tests. None of them reported any hearing problems. Their familiarity with Chinese-accented speech was relatively high (M familiarity = 5.3, range = 5–6) on a 6-point scale (1 = not at all, 6 = very much).
Each rating session took place individually in a quiet room at a university in London with a researcher who had provided similar training in our previous studies. The raters listened to speech samples played in a randomized order via custom software (which was developed via MATLAB), and then used a moving slider to rate them on a 1000-point scale for segmental errors (0 = frequent, 1000 = infrequent or absent); word stress errors (0 = frequent, 1000 = infrequent or absent), intonation accuracy (0 = unnatural, 1000 = natural), and perceived tempo (0 = too slow or too fast, 1000 = optimal speed). Each end of the continuum was signalled with a frowny (for “0”) face and a smiley (for “1000”) face (for onscreen labels, see Appendix C). To ensure the precision of ratings, the raters were allowed to listen to each speech sample as many times as they wanted to.
To familiarize the raters with the procedure, the researcher first gave brief instructions on the definition of each pronunciation category (for training materials, Appendix B). Second, they practiced the procedure by rating three speech samples that were not included in the main dataset. For each sample, they explained their decisions and received feedback from the researcher to check their understanding of the constructs. Finally, they moved onto analyzing a total of 48 speech samples with a 5-minute intermission halfway through. The entire session took approximately 90 minutes.
In terms of the inter-rater reliability, the results of the Cronbach alpha analyses identified medium agreement for the five raters’ judgements of segmentals (α = .72) and perceived tempo (α = .75), both of which are in line with the standard in L2 research (i.e., α > .70) (Larson-Hall, Reference Larson-Hall2010). Pronunciation scores were averaged across all raters to generate a single score per participant according to segmentals and speech rate. Their word stress and intonation ratings yielded relatively low Cronbach alpha values (α = .56, .67). As a remedy, two raters who demonstrated the strongest agreement (α = .85, 87) were identified. Their averaged scores were used for the following word stress and intonation analyses.
Data analyses: acoustic judgements
As operationalized in previous L2 suprasegmental studies (for a review, Lambert et al., Reference Lambert, Kormos and Minn2017), the temporal aspects of the participants’ spontaneous speech were acoustically examined according to three key constructs of fluency – breakdown, repair and speed. From a theoretical perspective (e.g., Kormos, Reference Kormos2014), these constructs are believed to correspond to L2 learners’ cognitive operations at three different stages of L2 speech production – breakdown for conceptualization and linguistic formulation (searching what and how to say), repair for monitoring (correcting already-produced utterances) and speed for automatization (optimizing the entire production processes).
Breakdown fluency was calculated by dividing the number of filled (lexical fillers such as eh, um) and unfilled (silence) pauses by the total number of words. Whereas filled pauses were counted based on raw transcripts, unfilled silent pauses were automatically identified via a script programmed in Praat (Boersma & Weenink, Reference Boersma and Weenink2017) with minimum silence duration set to 250 milliseconds. Repair fluency was calculated by dividing the total number of self corrections and repetitions (based on raw transcripts) by the total number of words. Speed fluency was measured via the articulation rate, which was calculated by dividing the total number of syllables by phonation time (i.e., total length of each audio file minus all silent, unfilled pauses). For similar fluency analysis methodology, see Bosker, Pinget, Quené, Sanders, and De Jong (Reference Bosker, Pinget, Quené, Sanders and De Jong2013).
To investigate inter-coder reliability, two researchers (both of whom had extensive experience on L2 fluency analyses) separately analyzed the breakdown, repair and speed fluency of 10 samples from the entire dataset. The results of Cronbach alpha analyses found relatively high agreement between the coders for breakdown (α = .92), repair (α = .93) and speed (α = .98). Where disagreement was found, they discussed to find a consensus. One of the coders thus proceeded to analyze the rest of the data (n = 38).
Behavioural measures of explicit aptitude
Phonemic coding
The participants’ phonemic coding ability – the ability to associate unfamiliar sounds to symbols – was assessed via one component of the LLAMA test (Meara, Reference Meara2005). In this subtest (LLAMA-E), the participants were first asked to remember the relationship between 24 recorded syllables (consonant-vowel) and their corresponding phonetic symbols within two minutes. The sound stimuli were created based on an indigenous language in Canada. After the practice session, their recollection was tested, specifically whether they could correctly identify symbols corresponding to two syllable words (a total of 20 items). The participants’ phonemic coding aptitude scores were calculated out of 100 based on the tailored scoring rubrics in LLAMA.
Music aptitude (melody, rhythm)
Two subsections of the Musical Aptitude Profiles for Japanese (MAP-J) (Ogawa, Reference Ogawa2009) were used to assess the participants’ abilities to perceive tonal and temporal aspects of musical phrases. Building on Gordon's (Reference Gordon1995) oft-used, validated music aptitude test (Music Aptitude Profile), the MAP-J was developed to evaluate, in particular, the aptitude of young and adolescent students in Japan and other east-Asian countries who are exposed to both western (violin, piano) and oriental (Japanese/Chinese drums, harp) musical instruments. Both melody and rhythm subtests required participants to make same/different judgments of pairs of short musical phrases. Participants assessed whether they were identical or different in pitch contour (for the melody subtest) and in the number/patterns of beats (for the rhythm subtest).
At the beginning of each subtest, the participants completed a practice session (listening to an ‘identical’ and a ‘different’ pair), followed by a main session which comprised 20 sets of musical phrases. The melody and rhythm scores were calculated out of 20.
Neurophysiological measures of implicit aptitude
Stimulus
The speech token /da/ (170ms) was synthesized via a Klatt-based synthesizer. The first five ms of the sound was the onset burst, and the rest of the sound was voiced with a steady 100 Hz fundamental frequency throughout. While the first, second and third formants shift during the transitional period between 5 to 50ms (400 to 720Hz, 1700 to 1240Hz, 2580 to 2500Hz), all formants stayed constant during the steady state between 50 and 170ms (720Hz, 1240Hz, 2500Hz).
Procedure
The /da/ sound was presented repeatedly (6000 times over the course of 20 minutes) in alternating polarities through insert earphones (ER-3; Etymotic Research) at 80dB with 81ms interstimulus intervals. Presenting stimuli in alternating polarities (i.e., with half of the stimuli inverted) affords the opportunity to separately examine the envelope and the temporal fine structure of speech by adding and subtracting responses of opposite polarities, respectively (Aiken & Picton, Reference Aiken and Picton2008). During the task, the participants were encouraged to focus on reading their favorite books in a relaxed environment, instead of paying special attention to sound properties. The electrophysiological responses to sound stimuli (/da/) were collected from the participants using a BioSemi EEG system with open filters and a sample rate of 16384 Hz. A single active electrode was located at the centre of the top of the head (i.e., at Cz), the reference electrodes were located at the earlobes, and ground electrodes were placed on the forehead.
Data analyses
All neurophysiological analysis was conducted using custom-written software in MATLAB. First, the recording was epoched between -40 ms and 210 ms, relative to stimulus presentation. Trials containing amplitude spikes of >100 micro-volts were rejected as artifacts, and the first 5000 artifact-free responses to each stimulus polarity were selected for the main analysis.
Precision of neural sound encoding was measured using inter-trial phase-locking. Inter-trial phase-locking measures the degree of jitter at a particular frequency within a particular time window. This provides a frequency-specific metric of neural sound encoding that benefits from a relatively robust signal-to-noise ratio compared to analyses of the spectrum of cross-trial average waveforms (Zhu, Bharadwaj, Xia & Shinn-Cunningham, Reference Zhu, Bharadwaj, Xia and Shinn-Cunningham2013). A sliding-window technique was used to assess phase-locking at each time-frequency point. For each trial, a Hanning windowed fast Fourier transform was calculated on 40-ms segments centered at time points between 0 and 170 ms, with 1 ms intervals between time points. The resulting complex vector was then normalized to have a magnitude of 1. For calculation of the phase-locking of the envelope response, vectors were averaged across trials, while for calculation of the phase-locking of the temporal fine structure response, vectors for one of the two stimulus polarities were shifted 180 degrees before averaging. The length of the resulting average vector formed the inter-trial phase-locking value for that time-frequency point. Inter-trial phase locking can vary from zero (no phase consistency whatsoever) to one (perfect phase consistency across trials).
The envelope response contains a robust representation of the F0, and so was used for measurement of fundamental frequency encoding, while the temporal fine structure response contains a robust representation of the higher harmonics, and so was used for measurement of the speech formants. See Figure 2 for an illustration of the relationship between phase-locking in envelope and temporal fine structure responses and the spectro-temporal characteristics of the stimulus.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000895:S1366728918000895_fig2g.jpeg?pub-status=live)
Figure 2. (Left) Spectrogram of stimulus used to evoke frequency-following responses. (Right) Inter-trial phase locking across time and frequency for the temporal fine structure response (top) and the envelope response (bottom).
F0 and formant encoding were quantified in the following manner. Since the fundamental frequency was steady at 100 Hz across the response, F0 phase-locking was quantified as the average phase-locking between 81 and 120 Hz from 10 to 170 ms in the envelope response. However, encoding of the first and second formants was assessed only during the portion of the response in which they were unchanging, i.e., from 60 to 170 ms. F1 was calculated as average phase-locking between 680 and 720 Hz (i.e., as the amplitude of the 7th harmonic), while F2 was calculated as average phase-locking between 1180 and 1220 Hz and between 1280 and 1320 Hz (i.e., as the mean amplitude of the 12th and 13th harmonics) in the temporal fine structure response.
Measures of experience
Although the participants’ length of residence in the UK was identical (i.e., eight-to-nine months), the quantity and quality of their L2 learning experience prior to and during their study-abroad differed to a great degree. The participants were individually interviewed to uncover their past and recent L2 learning backgrounds in a retrospective manner, using a similar interview scheme to that used in Muñoz (Reference Muñoz2014). As such, the participants self-reported the extent to which they had practiced L2 English inside and outside classrooms according to elementary, secondary and university-level schools in China as well as university-level schools in the UK. Finally, they also reported whether and for how long they had engaged in music training (e.g., experience of playing instruments), which has been linked to various aspects of L1 and L2 development (Slevc & Miyake, Reference Slevc and Miyake2006; Tierney, Krizman, Kraus & Tallal, Reference Tierney, Krizman, Kraus and Tallal2015).
Results
Constructs of L2 pronunciation proficiency
Table 2 summarizes the results of the participants’ segmental and suprasegmental dimensions of L2 pronunciation proficiency, measured by both the expert raters’ judgements and acoustic analyses. As summarized in Table 3, the inter-relationships between seven pronunciation measures were assessed via a set of Pearson correlation analyses. An alpha value was corrected via Bonferroni corrections. Strong associations were observed particularly among the expert raters’ segmental and prosodic (word stress, intonation) scores; the intonation and perceived tempo scores; and perceived and objective fluency scores (perceived tempo, articulation rate, pause ratio). Similar to the author's previous research (e.g., Saito et al., Reference Saito, Trofimovich and Isaacs2017), the pronunciation measures adopted in the study appeared to reveal the participants’ four different pronunciation abilities to (a) pronounce individual sounds/words accurately (segmentals, word stress); (b) access adequate prosody (intonation, perceived tempo); (c) produce optimal fluency (perceived tempo, articulation rate, pause ratio); and (d) avoid too much self-monitoring (repair ratio).
Table 2. Descriptive Statistics of Participants’ Segmental and Suprasegmental Proficiency.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000895:S1366728918000895_tab2.gif?pub-status=live)
Table 3. Interrelationships between Segmental and Suprasegmental Proficiency Scores.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000895:S1366728918000895_tab3.gif?pub-status=live)
Note. *indicates statistical significance at p < .008; † indicates marginal significance at p < .01 (Bonferroni corrected).
Given that L2 speech performance is likely influenced by task type, we further probed whether participants’ pronunciation proficiency differed in two different task prompts. A set of paired samples t-tests was performed to compare the segmental and suprasegmental scores of the participants who used Versions A (n = 25) and B (n = 23). The results did not find any significant difference in any pronunciation measures (p > .05), suggesting that task effects could be considered minimal in this study.
Constructs of explicit and implicit aptitude measures
As summarized in Table 4, descriptive statistics demonstrated a great deal of variation in participants’ explicit and implicit aptitude scores. According to normality analyses (the Kolmogorov-Smirnov goodness-of-fit test), positive and negative skewness was observed for phonemic coding and FFR phase-locking at F1 (p < .05); therefore, these aptitude scores were transformed using the Log10 function for subsequent analyses.
Table 4. Descriptive Statistics of Learner Aptitude Profiles.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000895:S1366728918000895_tab4.gif?pub-status=live)
Table 5 summarizes the results of the Pearson correlations among the participants’ explicit and implicit aptitude scores. With a view of multiple comparisons (across explicit vs implicit aptitude constructs), an alpha level was set at .025 after Bonferroni corrections.
Table 5. Interrelationships between Explicit and Implicit Aptitude Scores.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000895:S1366728918000895_tab5.gif?pub-status=live)
Note. *indicates statistical significance at p < .025; † indicates marginal significance at p < .05. aThe data transformed via the log10 function.
Among a total of 48 Chinese learners of English in the current study, their aptitude scores did not demonstrate significant associations, indicating that the explicit and implicit aptitude contrasts seemed to be independent of each other (p > .025). As predicted earlier (in Table 1), the results presented here support our assumption that the six aptitude scores used in the study could tap into the following constructs of pronunciation learning aptitude: (a) explicit/segmental (phonemic coding), (b) explicit/prosody (melodic discrimination), (c) explicit/fluency (rhythmic discrimination), (d) implicit/segmental (FFR phase-locking at F1 and F2), and (e) implicit/prosody (FFR phase-locking at F0).
Experience profiles of participants
Table 6 reveals that the participants’ past L2 learning experience widely differed in terms of the number of hours they had practiced L2 English inside classrooms at elementary-, secondary-, and university-level schools in China (920-6840 hours). To further increase their L2 use outside of classrooms, many chose to go to cram and language conversation schools outside of their regular school curriculums (350-6080 hours). As for their more recent L2 experience during the eight to nine months of study-abroad in the UK, all of them were enrolled in a range of sciences and social sciences classes at university in London (e.g., engineering, economics, law) (360–3000 hours). At the same time, some of the participants actively sought opportunities to use L2 English at non-academic settings during their study-abroad in the UK (e.g., conversing with English-speaking friends) (0–1680 hours). Finally, the participants reported the presence and length of music training.
Table 6. Descriptive Statistics of 48 Chinese Learners’ Past/Recent L2 and Music Training Experience.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000895:S1366728918000895_tab6.gif?pub-status=live)
Aptitude, experience and pronunciation
To provide a general picture on how the participants’ pronunciation proficiency attainment was individually related to their aptitude and experience factors, a set of Pearson correlation analyses was performed (see Tables 7 and 8). To adjust for two conceptual comparisons (proficiency vs. experience; proficiency vs. aptitude), the alpha level was set at .025 via the Bonferroni correction. The results identified a moderate relationship between the participants’ segmental scores and their explicit (phonemic coding) and implicit (FFR at F1) segmental sensitivity. See Figure 3 for a depiction of the difference in neural F1 encoding between participants with high and low L2 segmental proficiency.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000895:S1366728918000895_fig3g.jpeg?pub-status=live)
Figure 3. (Top) Inter-trial phase locking across time and frequency for participants with good (left) and poor (right) L2 segmental production scores (median split, n = 24 participants in each group). (Bottom) Inter-trial phase locking collapsed across time within a window from 60 to 170 ms in participants with good (red) and poor (black) L2 segmental production. Dotted lines indicate +1 standard error of the mean. Only the participants with good L2 segmental production showed a spectral peak around the first formant (700 Hz).
As for suprasegmental attainment, whereas the participants’ perceived tempo was significantly tied to rhythmic discrimination, their prosodic (intonation, perceived tempo) and fluency (articulation rate, pause ratio) performance was significantly correlated with their recent experience inside and outside classrooms rather than any aptitude factors. No statistically significant correlations were found between the participants’ pronunciation proficiency and their musical training experience.
Relative weights of aptitude and experience effects
As shown above, the participants’ aptitude and experience factors were uniquely correlated with the quality of their pronunciation proficiency attainment, indicating a complex relationship between aptitude, experience and L2 pronunciation learning. In order to examine in more depth the extent to which the aptitude factor alone could predict successful L2 pronunciation learning, the participants’ varied experience backgrounds need to be statistically controlled for.
A set of stepwise multiple regression analyses were performed with their pronunciation scores as dependent variables and with aptitude and experience scores as independent variables. To ensure a reliable interpretation of the regression model, a decision was made to select only aptitude and experience variables which showed significant or marginally significant correlations with any aspects of pronunciation proficiency (Variance Inflation Factor [VIF] < 1.02) (see Tables 7 and 8). Such variables include phonemic coding, tonal and rhythmic imagery, FFR phase-locking at F1, and recent L2 use inside and outside classrooms.
Table 7. Interrelationships between Aptitude and Pronunciation Attainment Scores
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000895:S1366728918000895_tab7.gif?pub-status=live)
Note. *indicates statistical significance at p < .025; †indicates marginal significance at p < .05 (Bonferroni corrected); aThe data transformed via the log10 function.
Table 8. Interrelationships between Experience and Pronunciation Attainment Scores
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000895:S1366728918000895_tab8.gif?pub-status=live)
Note. *indicates statistical significance at p < .025; †indicates marginal significance at p < .05 (Bonferroni corrected).
According to the results summarized in Table 9, the regression models explained 8–30% of variance in the participants’ segmental (segmentals, word stress), prosodic (word stress, intonation, perceived tempo) and fluency (perceived tempo and articulation rate) proficiency. Significant aptitude-acquisition links were found between phonemic coding and segmental proficiency, FFR phase-locking at F1 and segmental/word stress proficiency, and rhythmic discrimination and perceived tempo proficiency. In contrast, the recent experience (rather than aptitude) factor appeared to play a key role in accounting for variance in the participants’ intonation, perceived tempo and articulation rate proficiency.
Table 9. Significant Results of Stepwise Multiple Regression Analyses Using Explicit and Implicit Aptitude and Experience as Predictors of L2 Pronunciation Attainment.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000895:S1366728918000895_tab9.gif?pub-status=live)
Note. The variables entered into the regression equations included phonemic coding, rhythmic imagery, FFR and F1 and F2, and recent L2 use inside and outside classrooms.
Discussion
In the context of 48 Chinese learners of English in the UK with varied L2 learning experiences, the current study examined whether, to what degree and how the proposed framework of cognitive factors – i.e., L2 pronunciation learning aptitude – could roughly explain four dimensions of pronunciation proficiency attainment – correct pronunciation of individual sounds/words (segmentals, word stress), adequate prosody (intonation, perceived tempo); optimal fluency (perceived tempo, articulation rate, pause ratio) and self-monitoring (repair ratio). Unlike earlier aptitude studies which were exclusively concerned with explicit language learning cognition (e.g., Saito, Reference Saito2017, in press), we measured L2 learners’ explicit and implicit sensitivity to the segmental (phonemic coding, FFR at F1/F2), prosodic/intonational (tonal imagery, FFR at F0) and temporal (rhythmic imagery, FFR at F0) aspects of speech by adopting a range of behavioural (language and music aptitude tests) and neurophysiological (electroencephalography) measures.
Overall, the results of the descriptive analyses showed that approximately 11 years of English learning experience in China and the UK (i.e., 7000+ hours of L2 use inside and outside classrooms) imparted a positive influence on all dimensions of their L2 pronunciation proficiency (Flege, Reference Flege2016). At the same time, the results of the correlation analyses supported our earlier prediction that the extent to which the learners ultimately improved their segmental and suprasegmental proficiency was uniquely driven by the interaction of different types of experience (past vs. recent) and aptitude (explicit vs. implicit) factors.
Segmental sensitivity and performance
With respect to L2 segmental proficiency, the multiple regression models revealed that the participants’ correct pronunciation was primarily linked to their explicit segmental sensitivity (17.0%: phonemic coding), and secondarily associated with their implicit segmental sensitivity (11.9%: FFR at F1). Comparatively, the final quality of the participants’ L2 segmental performance was not significantly related either to their past or recent experience factors. The findings here successfully replicate previous aptitude studies that identified the presence of significant aptitude (but not experience) effects on experienced L2 learners’ attained segmental accuracy (Granena & Long, Reference Granena and Long2013; Saito, Reference Saito2017, in press; Saito et al., Reference Saito, Suzukida and Sun2018).
One potential reason for the relatively greater weight of the aptitude factor over the experience factor for L2 segmental attainment is difficulty of this specific L2 speech learning instance. According to previous cross-sectional (e.g., Flege, Bohn & Jang, Reference Flege, Bohn and Jang1997; Saito & Brajot, Reference Saito and Brajot2013) and longitudinal (e.g., Munro & Derwing, Reference Munro and Derwing2008) investigations, L2 English learners’ segmental pronunciation forms quickly become intelligible within the first year of immersion in an English-speaking country, but followed by a levelling-off. Whereas most continue to show detectable L1-related accents despite years of practice, the mastery of high-level, more nativelike segmental accuracy is limited to very few individuals (Abrahamsson & Hyltenstam, Reference Abrahamsson and Hyltenstam2009; Saito, Reference Saito2013). As L2 aptitude researchers have recently emphasized, it is in the acquisition of these relatively challenging L2 features that aptitude plays the most prominent role (Li, Reference Li2013; Skehan, Reference Skehan, Granena, Jackson and Yilmaz2016; see also Saito, Reference Saito2017, in press).
It is likely more important to stress that our findings identified not only explicit (phonemic coding) but also implicit (FFR at F1) aptitude as a significant predictor for the participants’ L2 segmental attainment. In the neurophysiology literature, the implicit sensitivity of humans to spectral and temporal features of complex speech signals – as measured using the frequency following response – serves as an anchor for various developmental phenomena in L1 acquisition related to literacy (e.g., White-Schwoch et al., Reference White-Schwoch, Carr, Thompson, Anderson, Nicol, Bradlow and Kraus2015), normal hearing (e.g., Russo, Skoe, Trommer, Nicol, Zecker, Bradlow & Kraus, Reference Russo, Skoe, Trommer, Nicol, Zecker, Bradlow and Kraus2008) and musicality (e.g., Tierney et al., Reference Tierney, Krizman, Kraus and Tallal2015). Building on our precursor research (Omote et al., Reference Omote, Jasmin and Tierney2017), the electrophysiological results of the current investigation suggest that the robustness of auditory processing may be an important foundation for post-pubertal L2 speech learning as well. More specifically, our study demonstrated that neural speech encoding appeared to be independent of adult L2 learners’ explicit phonetic analysis/memory (phonemic coding), and tied to relatively difficult aspects of adult L2 speech learning (segmentals, word stress).
On the one hand, the findings do agree with the dominant view in the field that post-pubertal SLA is mainly driven by explicit language learning cognition (e.g., Suzuki & DeKeyser, Reference Suzuki and DeKeyser2017). Our participants had exclusively practiced L2 English through a number of form-focused classes in Chinese EFL classrooms before they arrived in the UK. Given that explicit aptitude (phonemic coding) could facilitate L2 pronunciation learning to a great degree in such foreign language contexts (Saito, Reference Saito2017, in press), it is not surprising to find relatively strong effects of explicit aptitude on these participants’ L2 segmental attainment.
On the other hand, our study identified a significant relationship between adult learners’ implicit sensitivity to speech signals (FFR) and their L2 pronunciation performance. Our findings echo the strong FFR-acquisition link clearly observed in L1 literature (e.g., White-Schwoch et al., Reference White-Schwoch, Carr, Thompson, Anderson, Nicol, Bradlow and Kraus2015). In this regard, our study adds empirical support to the competing theoretical stance that the same cognitive factors underlying L1 acquisition – notably implicit language learning cognition – remains intact throughout the lifetime, and are therefore active in post-pubertal L2 speech learning as well (Birdsong & Molis, Reference Birdsong and Molis2001; Bundgaard-Nielsen, Best & Tyler, Reference Bundgaard-Nielsen, Best and Tyler2011; Flege, Reference Flege2016; Saito, Reference Saito2013, Reference Saito2015).
Despite the participants’ extensive form-oriented L2 experience prior to their study-abroad in the UK, all of them had been residing in the UK for eight to nine months at the time of the project. As shown in the results (see Table 6), the participants frequently accessed L2 English for meaning rather than form with various interlocutors in diverse conversational contexts. Thus, in this study, certain learners with higher implicit aptitude could have benefited more from this period of naturalistic L2 learning by processing incoming input not only explicitly (with awareness) but also implicitly (without awareness). Our argument here is harmonious with recent theoretical discussion in the L2 aptitude literature on the importance of a combination of explicit and implicit learning. Such multifaceted cognition can help L2 learners make the most of any given input/output opportunities, which is believed to be a necessary condition for the attainment of high-level L2 proficiency (Doughty, Campbell, Mislevy, Bunting, Bowles & Koeth, Reference Doughty, Campbell, Mislevy, Bunting, Bowles and Koeth2010; Linck et al., Reference Linck, Hughes, Campbell, Silbert, Tare, Jackson and Doughty2013; Saito et al., Reference Saito, Suzukida and Sun2018; Skehan, Reference Skehan, Granena, Jackson and Yilmaz2016).
Suprasegmental sensitivity and performance
When it comes to L2 suprasegmental proficiency, the participants’ explicit sensitivity (rhythmic imagery) significantly accounted for 22.8% of the variance in their perceived tempo, confirming the relationship between music aptitude and L2 pronunciation learning (Li & DeKeyser, Reference Li and DeKeyser2017). Unlike their L2 segmental attainment (closely linked to explicit and implicit aptitude), however, most of the participants’ L2 suprasegmental attainment was generally predicted by their recent L2 use inside and outside classrooms during their study-abroad in the UK, regardless of their past English-as-a-Foreign-Language experience in China (9.2-26.2 %). The findings here concur with the theoretical claims (e.g., Ellis, Reference Ellis2006) and empirical evidence (e.g., Saito & Hanzawa, Reference Saito and Hanzawa2016) that SLA is adaptively sensitive to the quantity, quality and recency of input, as form and meaning connections become stronger in accordance with how often certain linguistic items are practiced in the most immediate contexts.
The findings also echo previous observations on the relatively salient effects of experience on L2 suprasegmental (rather than segmental) learning. Whereas L2 segmental learning is a slow, gradual process especially beyond the initial rate of learning stage (Flege, Reference Flege2016), L2 learners’ suprasegmental accuracy and fluency improve substantially and continuously for an extensive period of time, as long as they use the target language on a daily basis (Mora & Valls-Ferrer, Reference Mora and Valls‐Ferrer2012; Trofimovich & Baker, Reference Trofimovich and Baker2006; Saito, Reference Saito2015). This strong relationship between experience and L2 suprasegmental learning could be arguably linked to the fact that the suprasegmental quality of L2 speech more directly affects listeners’ successful comprehension than the segmental quality does (Isaacs & Trofimovich, Reference Isaacs and Trofimovich2012), and that L2 learners are assumed to intentionally or intuitively prioritize the acquisition of L2 suprasegmentals (rather than segmentals) as a function of increased experience (Derwing, Munro, Thomson & Rossiter, Reference Derwing, Munro, Thomson and Rossiter2009).
Interestingly, no significant associations were found between FFR fundamental frequency encoding and L2 suprasegmental attainment. These results suggest that the process and product of adult L2 suprasegmental learning may derive from explicit rather than implicit cognition. However, the findings need to be interpreted with much caution, since F0 phase-locking in the study may not have fully captured natural prosody in English. To this end, different FFR metrics, such as Gamma phase-locking (80 Hz), may be needed for the analysis of participants’ sensitivity to lower frequency (Omote et al., Reference Omote, Jasmin and Tierney2017). Additionally, the relationship between individual differences in characteristics of the FFR and performance in various auditory tasks remains imperfectly understood. Prior research has found that FFR phase-locking at the F0 is linked to the ability to consistently synchronize to a metronome (Tierney & Kraus, Reference Tierney and Kraus2013) and rapidly adapt to stimulus perturbations while synchronizing (Tierney & Kraus, Reference Tierney and Kraus2016). This suggests that FFR phase-locking at lower frequencies could be an index of the precision with which the auditory system represents the timing of sound, a skill which may be useful for extracting temporal cues to prosodic features such as phrase boundaries.
However, FFR phase-locking has not been found to relate to the ability to remember and reproduce temporal patterns, a skill which instead correlates with inter-trial consistency in slower cortical response to sound (Tierney, White-Schwoch, MacLean & Kraus, Reference Tierney, White-Schwoch, MacLean and Kraus2017), suggesting that integration of rhythmic information across time relies more upon cortical than subcortical processing. Cross-trial consistency in cortical responses such as the passive auditory ERPs and cortical tracking of slow changes in amplitude envelope and pitch contour, therefore, may be more promising measures of implicit suprasegmental proficiency. Future studies need to conceptualize, elaborate and validate more reliable aptitude measures by which to measure L2 learners’ capacities to process a wide range of low frequencies to produce word stress, intonation and fluency with adequate rhythmic timings.
Future directions
To conclude, we would like to emphasize a strong call for more L2 speech research of this kind in order to further examine the cognitive and perceptual correlates of successful L2 pronunciation learning with a larger number of participants who have varied levels of proficiency and experience, and different pairings of L1 and L2 backgrounds. Given the exploratory nature of the project, several methodological limitations need to be acknowledged with an eye towards future replication studies. First, the current study was a cross-sectional investigation of the aptitude profiles of intermediate-to-advanced level participants with varied L2 learning experience backgrounds. To unravel the relative impacts of explicit and implicit aptitude on L2 pronunciation learning, future studies can adopt longitudinal, pre-and-posttest-designs (cf. Saito et al., Reference Saito, Suzukida and Sun2018). Such studies will shed light on whether and to what degree high explicit and implicit aptitude learners can differentially benefit from two essentially different L2 learning conditions – (a) naturalistic immersion with ample opportunities to use the L2 meaningfully with native and non-native speakers on a regular basis; vs. (b) form-focused lessons in foreign language settings without many conversational opportunities outside of the classroom.
Relatedly, such future studies should also longitudinally examine the intricate link between aptitude and experience. In the field of SLA and music education, there is empirical evidence that learners’ aptitude test scores (e.g., phonemic coding, music aptitude) are unlikely to change dramatically over time, suggesting that such explicit aptitude can be a relatively stable trait (e.g., Carroll, Reference Carroll and Glaser1962; Gordon, Reference Gordon1995). As shown in the current study, participants’ aptitude and experience profiles independently related to L2 speech performance (VIF < 1.02), indicating they are essentially different factors of SLA. When it comes to FFR measures, however, the neurophysiology literature has shown individual variability when researchers compare participants with substantially different backgrounds (Bidelman, Gandour & Krishnan, Reference Bidelman, Gandour and Krishnan2011 for tonal vs. non-tonal language users; Krizman et al., Reference Krizman, Slater, Skoe, Marian and Kraus2015 for simultaneous vs. sequential bilinguals). These studies suggest that FFR can be modulated by long-term experience to a certain degree. To our knowledge, however, no empirical studies have probed whether and to what degree FFR can change when learners engage in a short-term, but intensive exposure to foreign language (e.g., study abroad); examining this topic is crucial, as it will allow us to further understand the extent to which FFR measures can serve as predictors rather than the result of L2 phonological attainment.
Another interesting direction concerns the role of aptitude in the acquisition of advanced L2 phonology, especially among more experienced L2 learners at the later stage of SLA. It is important to remember that the length of study-abroad among the participants in the current study was only eight months, and that their pronunciation performance was far below the nativelike norm (e.g., see Table 2 for their pronunciation ratings of around 500–600 out of 1000). This indicates that these participants had much room for improvement. In the previous nativelikeness literature in SLA, certain adult L2 learners have been identified as demonstrating high-level L2 pronunciation proficiency, which native listeners cannot perceptibly distinguish from other native samples. Whereas these participants typically have processed an extensive amount of L2 experience (> 10 years) (DeKeyser, Reference DeKeyser2013; Saito, Reference Saito2013) together with strong professional and integrative motivation (Moyer, Reference Moyer2014) and some form of explicit language learning cognition (Granena & Long, Reference Granena and Long2013), it has remained unclear the extent to which implicit language learning cognition – the driving force for successful L1 and early L2 acquisition – can still explain the incidence of exceptional L2 speech learning after puberty (cf. Linck et al., Reference Linck, Hughes, Campbell, Silbert, Tare, Jackson and Doughty2013).
Third, although the current study exclusively drew on production measures, it is notable that any change in a learner's representational system first impacts the perception phase prior to the production phase in both L1 and L2 acquisition (Flege, Reference Flege2016). It would thus be intriguing for future studies to elucidate the role of explicit and implicit aptitude in L2 learners’ perception performance, especially when they are exposed to natural and synthetic tokens varying in the F1 × F2 × F3 domain (Flege et al., Reference Flege, Bohn and Jang1997) and duration of F1 transition (Underbakke, Polka, Gottfried & Strange, Reference Underbakke, Polka, Gottfried and Strange1988), under various lexical conditions (i.e., target sounds in frequent words vs. infrequent words: Flege, Takagi & Mann, Reference Flege, Takagi and Mann1996), and with reaction time instruments (Ingvalson, McClelland & Holt, Reference Ingvalson, McClelland and Holt2011). In terms of analyses, such future studies should also highlight not only the global dimensions of L2 speech, but also specific segmental, prosodic and temporal features difficult for a particular group of L2 learners. For instance, one of the most well-researched topics in L2 speech learning is the acquisition of the English /ɹ/ and /l/ contrast by Japanese learners (for a review, Bradlow, Reference Bradlow, Hansen and Zampini2008). Few Japanese learners have been reported to attain nativelike performance in perceiving and producing English /ɹ/ and /l/ due to their significant lack of sensitivity to highly complex speech signals in F2 and F3 and articulatory configurations (simultaneous constrictions in labial, alveolar and pharyngeal areas of vocal tract) (e.g., Flege et al., Reference Flege, Takagi and Mann1996; Ingvalson et al., Reference Ingvalson, McClelland and Holt2011; Saito, Reference Saito2013; Saito & Brajot, Reference Saito and Brajot2013). To provide a full-fledged picture of this specific aptitude-acquisition link, it would be intriguing for future studies to explore the extent to which Japanese learners’ cognitive and perceptual individual differences could explain the attainment of high-level English /ɹ/-/l/ performance. As a result, such follow-up studies will allow us to evaluate the replicability and robustness of our aptitude framework at a fine-grained level.
Finally, our finding that neural encoding of spectral peaks correlates with L2 segmental production suggests that neural sound processing, as measured using EEG, provides an alternate method of measuring implicit language learning aptitude, complementing behavioural approaches. Future research into neural correlates of language learning aptitude could investigate other EEG metrics that could be similarly useful. Slow (< 8 Hz) rhythms in the EEG signal, for example, entrain to the amplitude envelope (Doelling, Arnal, Ghitza & Poeppel, Reference Doelling, Arnal, Ghitza and Poeppel2014) and pitch contour (Meyer, Henry, Gaston, Schmuck & Friederici, Reference Meyer, Henry, Gaston, Schmuck and Friederici2017) of speech, and the fidelity of this entrainment is related to L1 language abilities in children (Power, Colling, Mead, Barnes & Goswami, Reference Power, Colling, Mead, Barnes and Goswami2016). This measure, therefore, is a promising candidate for a neural foundation of implicit sensitivity to segmental and suprasegmental aspects of speech.
Supplementary material
To view supplementary material for this article, please visit https://doi.org/10.1017/S1366728918000895