Introduction
In the field of Applied Linguistics and second language acquisition (SLA), a growing number of scholars have emphasized the importance of the Open Science approach (e.g., Marsden, Reference Marsden, McKinley and Rosein press). One crucial component of this movement is to make all the research processes related to data collection and analysis fully transparent. As such, readers can not only understand exactly what the researchers attempted to do, but also conduct objective and independent replications of the findings in the future. Such an approach is particularly important when it comes to theoretical and practical crucial topics that need to be replicated in many different contexts. In this paper, we aim to demonstrate how the Open Science approach allows us to consider a fundamental, yet controversial issue, that is, why the rate and ultimate attainment of L2 learners is so varied, especially when they start learning a target language after puberty.
Over the past 50 years, the role of individual differences in postpubertal L2 speech learning has attracted a great amount of scholarly attention. While many demonstrate detectable L1-related accents even after years of practice, some L2 learners can attain highly advanced L2 pronunciation proficiency (e.g., Flege et al., Reference Flege, Munro and MacKay1995). To examine the source of such variation, this line of L2 speech research has traditionally considered only one or two individual difference variables (e.g., age, motivation) at a time. More recently, scholars have begun to describe L2 learning as a complex, adaptive, emergent, self-organizing, and ever-changing system (e.g., Larsen-Freeman, Reference Larsen-Freeman2012). To unravel what underlies a dynamic phenomenon of this kind, we argue that it is crucial to include as many learner-internal and learner-external factors as possible within the same research design. In addition, prior work has typically assessed L2 speech proficiency in terms of the degree of nativelikeness (or accentedness). However, the levels of attainment in postpubertal L2 pronunciation should be assessed based on ease of understanding (comprehensibility), because many adult L2 learners can be highly comprehensible despite their detectable L2 accents (Derwing & Munro, Reference Derwing and Munro2013; Saito et al., Reference Saito, Trofimovich and Isaacs2017).
Considering all the methodological concerns above (i.e., the lack of data transparency, depth, and diversity), the primary objective of the current study is to revisit the process and product of late L2 speech learning. Our study is novel, as we consider the notion of the dynamic system (i.e., simultaneous consideration of multiple dependent and independent variables) and the Open Science approach (i.e., developing, analyzing, and sharing dataset with interested audience). First, we report in detail how we constructed a relatively large-scale learner and speech dataset among 110 late L2 speakers in London. Subsequently, we present the results of regression modeling analyses to shed light on what types of learner variables, related to the learners’ L1, age, experience, motivation, awareness, and attitudes, jointly interact to determine different levels of L2 nativelikeness and comprehensibility. Last, we make the actual dataset publicly available while providing a range of suggestions regarding how to analyze multivariate, multifactorial data of this kind, and inviting the readers to rethink our analyses and interpretations from multiple perspectives (see DATASET).
Background
Many early bilinguals attain high levels of L2 proficiency through mere exposure to the target language in an implicit and incidental fashion (like in L1 acquisition). With respect to late L2 speakers, who start learning a target language after puberty, their speech is generally L2-accented as it builds on and interacts with their already developed L1 system (Flege et al., Reference Flege, Munro and MacKay1995). The degree of such foreign accentedness can vary greatly due to a range of learner-external (L1-L2 distance, age, experience) and learner-internal factors (motivation, awareness, attitudes). To date, previous studies have typically looked at one or two independent variables in isolation and linked them to the nativelikeness of participants’ L2 speech performance.
External Factors of L2 Speech Learning
L1-L2 Distance
A range of theoretical accounts have been proposed to explain the influence of L1 phonetic structures on L2 speech learning. A core premise of such accounts is that the linguistic distance between an L1 and L2 determines pronunciation learning difficulty (Best & Tyler, Reference Best, Tyler, Bohn and Munro2007, for Perceptual Assimilation Model). Numerous empirical studies have documented learners’ difficulty in acquiring relatively new articulatory and acoustic features in an L2 on segmental (e.g., Japanese learners’ English /r/-/l/ acquisition) and suprasegmental (e.g., American learners’ Mandarin lexical tone acquisition) levels. Conversely, there is some evidence that even late L2 learners can attain highly advanced L2 pronunciation proficiency especially when their L2 is linguistically close to their L1 (e.g., Bongaerts et al., Reference Bongaerts, van Summeren, Planken and Schils1997, for Dutch learners of English).
Age
To date, scholars have extensively examined the extent to which L1 influence could be mediated by a set of age-related factors, such as the age of arrival (i.e., the first exposure to the target language in a naturalistic setting), age of learning (i.e., the onset of foreign language education) and testing (i.e., participants’ age at the time of data collection). Although age of acquisition has been found to predict the ultimate attainment of L2 oral proficiency after years of immersion in an L2-speaking environment (e.g., Flege et al., Reference Flege, Munro and MacKay1995), the predictive power of age has remained ambiguous in the context of foreign language learning (several hours of form-focused instruction per week). The existing literature has pointed out that late starters may benefit more from foreign language instruction due to their cognitive maturity, fully developed L1 literacy, and accumulative classroom experience (e.g., Muñoz, Reference Muñoz2014).
Experience
Another variable relevant for late L2 speech learning is concerned with quantity (how much learners have practiced) and quality (how, with whom, and what learners have practiced) of experience. Length of residence (LOR) in an L2 environment has been adopted in L2 speech research as a proxy for the quantity of L2 use; however, the reliability of LOR has been subject to criticism because the frequency of L1 and L2 use differs greatly among individuals, even if they stay in an L2 speaking environment for the same period of time (for more relevant discussion, see Derwing & Munro, Reference Derwing and Munro2013). In this regard, scholars have looked at the quality of experience from multiple angles, such as the ratio of language use (L1 vs. L2) (e.g., Flege et al., Reference Flege, MacKay and Piske2002), type of interlocutors (fluent vs. non-fluent speakers) (e.g., Muñoz & Llanes, Reference Muñoz and Llanes2014), and context of interaction (social vs. professional vs. family) (e.g., Jia & Aaronson, Reference Jia and Aaronson2003).
Learner-Internal Factors of L2 Speech Learning
Metalinguistic awareness
From a theoretical standpoint, awareness (i.e., explicit knowledge about target language) is believed to play a key role in L2 acquisition, because it helps L2 learners to better notice and understand specific features in received input and then internalize them into long-term memory (Schmidt, Reference Schmidt and Robinson2001). A series of experimental studies have convincingly shown that L2 learners exhibit some gains when they practice an L2 explicitly, consciously, and deliberately (e.g., Hama & Leow, Reference Hama and Leow2010). In terms of L2 phonology, there is some evidence that L2 learners with greater phonological awareness (i.e., conscious knowledge about phonological and phonetic structures of a target language) tend to produce not only more segmentally accurate (Saito, Reference Saito2019) but also more comprehensible speech (Venkatagiri & Levis, Reference Venkatagiri and Levis2007).
Motivation
In other studies, highly advanced L2 speakers have been reported to demonstrate high levels of professional and integrative motivation to use language accurately under various circumstances (school, business, social, and home-related). For example, such speakers may be L2 language teachers by profession (Bongaerts et al., Reference Bongaerts, van Summeren, Planken and Schils1997) and/or have intensive immersion experience through international marriages (Ioup et al., Reference Ioup, Boustagi, El Tigi and Moselle1994).
Attitude
Another well-researched topic is concerned with attitudes, defined as “an evaluative orientation to a social object” (Garrett, Reference Garrett2010, p. 3). Whereas scholars have extensively examined language attitudes toward L2 learning and teaching in general (see Gardner & Smythe, Reference Gardner and Smythe1981, for the influential framework and Attitude/Motivation Test Battery), some studies have looked at this topic in the context of L2 pronunciation. For example, previous research has shown that some L2 learners express solidarity with their L1-accented speech, which translates into positive attitudes toward speakers from the same L1 background (McKenzie & Gilmore, Reference McKenzie and Gilmore2017). In the context of French-speaking Quebec, Gatbonton and Trofimovich (Reference Gatbonton and Trofimovich2008) found that strong L1 ethnic group affiliation was associated with low L2 proficiency, whereas positive views toward both L1 and L2 communities were linked to high L2 pronunciation proficiency.
Comprehensibility versus Nativelikeness
Importantly, much of the late L2 speech literature has been exclusively concerned with the relationship between learners’ extrinsic and intrinsic individual differences, and the degree of L2 phonological nativelikeness. In the field of SLA, however, there has been a consensus that the linguistic behaviors of bilinguals and monolinguals are essentially different and that L2 speakers’ linguistic performance should be compared within themselves instead of in comparison with an idealized monolingual native speaker model (e.g., Ortega, Reference Ortega2018). In line with this paradigm shift, a growing number of scholars have emphasized the importance of examining L2 speech from the perspective of comprehensibility rather than nativelikeness (Derwing & Munro, Reference Derwing and Munro2013; Saito et al., Reference Saito, Trofimovich and Isaacs2017).
To date, many empirical studies have indeed shown that perceived comprehensibility and nativelikeness tap into somewhat overlapping but essentially different constructs of L2 speech. For example, while assessing the comprehensibility of L2 speech, listeners are found to attune to a range of linguistic elements, especially those directly relevant to successful comprehension, in order to arrive at the overall meaning of L2-accented speech in the most efficient and effective fashion (e.g., Suzuki & Kormos, Reference Suzuki and Kormos2019, for prosody). L2 learners can continue to enhance the comprehensibility of their speech regardless of detectable L2 accents, as long as they regularly use their L2 for social interaction with various fluent speakers in diverse social settings (Derwing & Munro, Reference Derwing and Munro2013). In contrast, listeners tend to assess the degree of linguistic nativelikeness solely based on phonological accuracy (Saito, Trofimovich, & Isaacs, Reference Saito, Trofimovich and Isaacs2016); the perceived nativelike aspects of L2 speech is resistant to change, especially after the initial rapid development within the first few years of immersion (Saito & Munro, Reference Saito and Munro2014).
Open Science Approach
With the aim of attaining scholarly rigor, the importance of Open Science has been extensively discussed in various academic disciplines (for an overview, see McKiernan et al., Reference McKiernan, Bourne, Brown, Buck, Kenall, Lin, McDougall, Nosek, Ram, Soderberg, Spies, Thaney, Updegrove, Woo and Yarkoni2016). It has been increasingly adopted as a mandatory condition for authors publishing work in major academic journals (e.g., Gewin, Reference Gewin2016, for Nature; Gerrig & Rastle, Reference Gerrig and Rastle2019, for Journal of Memory and Language; Marsden et al., Reference Marsden, Crossley, Ellis, Kormos, Morgan-Short and Thierry2019, for Language Learning). The Open Science approach refers to a wide range of research practices, which include depositing academic literature in freely available platforms (open repository), creating an accessible summary for the general public (open access), and sharing all research materials and datasets (open data). Importantly, the benefits of such open practices are compelling, such as: boosting citations, media attention, potential collaborators, and funding opportunities (see McKiernan et al., Reference McKiernan, Bourne, Brown, Buck, Kenall, Lin, McDougall, Nosek, Ram, Soderberg, Spies, Thaney, Updegrove, Woo and Yarkoni2016).
Despite its popularity in diverse areas of science, the Open Science approach to research has been significantly lacking in the field of SLA (Marsden, Reference Marsden, McKinley and Rosein press). While the number of meta-analyses have been increasing, many primary studies were reported to be eliminated due to the unavailability of data, indicating that the findings of these studies may not necessarily reflect the state-of-the-art status of the field (Larson-Hall & Plonsky, Reference Larson-Hall and Plonsky2015). Relatedly, recent methodological synthesis papers have revealed that a very small portion of individual studies made their materials available (e.g., Marsden, Thompson, et al., Reference Marsden, Thompson and Plonsky2018, for 4% out of 71 self-paced reading studies; Plonsky et al., Reference Plonsky, Marsden, Crowther, Gass and Spinner2019, for 35% out of 214 grammatical judgement studies). These problems subsequently hinder third party researchers from examining the replicability and generalizability of existing research findings (Marsden, Morgan-Short, et al., Reference Marsden, Morgan-Short, Thompson and Abugaber2018).
Motivation for Current Study
Whereas a growing number of scholars have accepted the view that L2 speech is a multifaceted phenomenon, existing research has been mainly concerned with how one or two independent variables could affect the outcomes of L2 speech. Unfortunately, this line of work fails to see L2 learning as a complex dynamic system (Larsen-Freeman, Reference Larsen-Freeman2012). We have yet to determine how a range of different learner-external and learner-internal factors jointly interact to influence the rate and ultimate attainment of late learners’ L2 pronunciation. Such research will shed light on our understanding of what accounts for linguistic, experiential, and sociopsychological underpinnings of late L2 speech learning, as well as informing future practices how to best help different types of learners who aim to achieve comprehensible L2 pronunciation versus those who strive to achieve nativelike L2 pronunciation. Our research question, therefore, is as follows:
• How do learner-external and learner-internal factors differentially relate to L2 learners’ speech comprehensibility and nativelikeness?
In order to answer this research question, we took two unique approaches, including numerous independent and dependent variables to examine L2 speech as a dynamic system (i.e., the dynamic perspective), and constructing, analyzing, and sharing the entire dataset (i.e., the Open Science approach).
In the context of 110 late L2 learners in London, we first explicate what kinds of profiles characterize L2 learners who have achieved varying levels of L2 comprehensibility and nativelikeness. Following the notion of the Open Science approach, therefore, we provide all the details in terms of what research instruments we used to collect the dataset (speaking test, learner questionnaire, rater training scripts), what kinds of statistical analyses we adopted (data reduction, mixed-effects modeling), and how we interpreted the findings. In order to test the scientific rigor of the current study, we would like to invite the readers not only to replicate the method that we developed, and reproduce the results that we reached, but also to critically look at the way we operationalized the current project and think of different types of statistical analyses to approach the dataset with, i.e., the strong version of data transparency (Marsden, Reference Marsden, McKinley and Rosein press).
Method
Dataset
Given that the scope of the study highlights late L2 learners, we carefully focused on late L2 learners whose age of arrival in an English-speaking environment was beyond the age of 16. These learners were assumed to speak L2 English with perceptible L1-related accents (for a similar definition, see Flege et al., Reference Flege, Munro and MacKay1995). To recruit a sufficient number of L2 speakers that could represent a wide range of L2 oral proficiency levels (beginner to advanced), flyers were circulated at various locations (universities, language schools) and on social media. All data collection took place individually in a quiet room at the participants’ residence, offices, schools, and community centers for their convenience. For each session, participants were first interviewed to gather a range of information related to their L1 backgrounds, age, experience, motivation, awareness, and attitudes (see Supporting Information-A for the full-length questionnaire). This was followed by a speech recording session, wherein the participants’ spontaneous speech was elicited via a timed picture description task.
The participants widely differed vis-à-vis a total of 30 learner variables spanning L1 backgrounds, age of acquisition, language quantity and quality of experience, professional and social motivation, and awareness and attitudes toward foreign-accented versus nativelike speech. For the raw data and descriptive statistics of the 30 variables, see DATASET and Supporting Information-B.
• First Language Backgrounds (1 variable): The participants in the current study were classified into nine major language families: (1) Romance (n = 19) (e.g., Italian, Spanish, French), (2) Germanic (n = 5) (e.g., German, Swedish, Dutch), (3) Indo-Iranian (n = 4) (e.g., Hindi-Urdu, Bengali, Punjabi), (4) Balto-Slavic (n = 18) (e.g., Russian, Polish, Czech), (5) Uralic languages (n = 2) (Estonian), (6) Sino-Tibetan (n = 15) (Chinese), (7) Altaic (n = 25) (Japanese, Korean, Turkish), (8) Austro-Asiatic (n = 12) (Vietnamese), and (9) Niger-Congo (n = 10) (Yoruba, Igbo, Swahili). For the analyses, the dummy code—“0” Indo-European (n = 46); “1” non-Indo-European (n = 64)—was used to see how the L1-L2 distance could be associated with the comprehensibility of their L2 English speech.
• Age (3 variables): The participants’ age profiles were substantially different in terms of age of arrival at an English-speaking environment (i.e., age of acquisition) (Range = 16–55), the onset of foreign language education (i.e., age of learning) (Range = 2–58) and data collection (i.e., age of testing) (Range = 20–59).
• Previous Experience (5 variables): In the current study, participants’ previous experience was surveyed in terms of (i) how long they had practiced English in foreign language classrooms (Range = 0–23 years) and (ii) how long they had stayed in English speaking countries (Range = 0.1–39 years). Approximately 30% of the participants reported previous experience in (iii) linguistics training (n = 33) or/and (iv) teaching English as an L2 (n = 31). We also created (v) a composite, broad category to capture the number of participants who had received any type of professional training related to linguistics or/and teaching (n = 36).
• Current Experience (9 variables): To scrutinize current experience in the UK, following the questionnaire format of the Language Contact Profile (Freed et al., Reference Freed, Dewey, Segalowitz and Halter2004), participants were asked to self-report the percentage of time they spent using their L1 and L2 (English in this case) at the time of the project. As per three different settings: professional (work/school), social (with friends) and home (with family). To further examine the type of interlocutors, the participants were asked to estimate the percentage of time they spent interacting in L2 English with fluent versus nonfluent speakers.
• Motivation (3 variables): There is some evidence that very few L2 learners attain near-nativelike pronunciation. Such learners often demonstrate strong concern for the attainment and use of high-level L2 proficiency due to their profession (Bongaerts et al., Reference Bongaerts, van Summeren, Planken and Schils1997; Flege et al., Reference Flege, Munro and MacKay1995) and communication with family members (Ioup et al., Reference Ioup, Boustagi, El Tigi and Moselle1994). The participants rated the degree to which they were expected to use L2 English at a nativelike proficiency level on a 9-point scale (1 = not at all, 9 = very much so) for three different settings: professional (work/school), social (with friends), and home (with family).Footnote 1
• Awareness (5 variables): Following the methodological practices in L2 awareness research (e.g., Hama & Leow, Reference Hama and Leow2010), the participants’ awareness of L2 comprehensibility was measured via self-reports. In the current study, we interviewed the participants to find out the extent to which they were aware of the importance of specific linguistic dimensions in L2 speech. Participants rated which aspects of language they thought were relatively crucial for successful communication on a 9-point scale (1 = not important, 9 = very important). The five statements included were: (a) speaking English without any accent like a native speaker; (b) speaking comprehensible English regardless of accentedness; (c) good pronunciation; (d) appropriate vocabulary and grammar; and (e) idiomatic and sophisticated expression.
• Familiarity and Attitudes (4 variables): In the current study, the participants’ familiarity and attitudes (i.e., perception) toward foreign-accented and nativelike pronunciation were measured via their self-ratings of the four statements on a 9-point scale (1 = strongly disagree, 9 = strongly agree). For familiarity, the two statements asked the extent to which participants were familiar with different types of L2 accented English and British English. For attitudes, the other statements asked the extent to which participants liked it when people speak English with a foreign accent and with a British accent (for a similar method, see Gatbonton & Trofimovich, Reference Gatbonton and Trofimovich2008).
Comprehensibility and Nativelikeness Judgments
Speaking Materials
In previous L2 speech studies, word-, sentence-, and paragraph-reading tasks have often been adopted as outcome measures. However, the construct validity of such controlled tasks has remained controversial because its format allows adult L2 learners to carefully monitor their correct pronunciation forms without much communicative pressure. In order to index adult L2 learners’ pronunciation proficiency, a growing number of scholars have emphasized the importance of adopting spontaneous speech tasks, wherein speakers’ primary focus lies in conveying the intended message while simultaneously paying attention to phonological, lexical, grammatical, and discoursal aspects of language (Piske et al., Reference Piske, Flege, MacKay, Meador, Wrembel, Kul and Dziubalska-Kołaczyk2011; Saito & Plonsky, Reference Saito and Plonsky2019).
Based on this rationale, a decision was made to use a timed picture description task to elicit certain lengths of spontaneous speech without too many disfluencies from L2 learners with varied proficiency levels (beginner to advanced). The participants described seven different pictures under time pressure (five seconds of planning per picture). To avoid false starts and to support true beginners, participants were instructed to use three given key words relevant to the content of each picture (for a similar spontaneous task modality, see Munro, Reference Munro2013 for a picture-naming task). To control for task familiarity, the first four picture descriptions were used as practice, and the last three descriptions were submitted to final analyses. There was no time limit for each picture description. All speech samples were individually recorded via a portable MP3 recorder and normalized for peak amplitude. The first ten seconds of the three picture descriptions were cut and stored as one single MP3 file per participant, with each participant contributing roughly 30 seconds of spontaneous speech. The length of speech per participant could be considered sufficiently long to provide raters with enough linguistic information in conjunction with the standard in L2 speech research (Hopp & Schmid, Reference Hopp and Schmid2013, for 10–20 seconds; Derwing & Munro, Reference Derwing and Munro2013, for 30 seconds). The task instruction and materials were deposited in IRIS (Marsden et al., Reference Marsden, Mackey, Plonsky, Mackey and Marsden2016).
Raters
A total of ten native speaking raters (six males, four females) were recruited in London (M age = 19.5 years). All of them reported that at least one of their parents/carers was an L1 English speaker and that they used English as their primary language of communication in professional, social, and home settings (M % of English use per day = 99.0%). Since the raters were living in London (a highly multilingual city) at the time of the project, they reported relatively high levels of familiarity with foreign-accented speech (M = 5.2; 1 = not at all, 6 = very much). None of them reported having prior linguistics training nor hearing problems.
Procedure
All the rating sessions took place individually in a quiet room at a university in London. The speech samples were played in a randomized order via PRAAT (Boersma & Weenink, Reference Boersma and Weenink2017). Upon hearing each sample, raters were asked to assess them on a 9-point scale for comprehensibility (1 = very difficult to understand, 9 = very easy to understand) and nativelikeness (1 = not native-like, 9 = completely native-like). Since L2 comprehensibility and nativelikeness, by definition, involves “intuitive” judgments, raters were only able to listen to each sample once (no replay button was available).
Raters first received a brief explanation of comprehensibility and nativelikeness from a trained researcher and how to make their ratings (see Supporting Information-C for training scripts). After familiarizing themselves with the picture prompts used to elicit speech, they practiced the rating procedure by using three representative samples that were not included in the main dataset (beginner, intermediate, advanced). Then, the raters proceeded to the main dataset (N = 110 L2 speakers). Raters took a five-minute intermission halfway through. An entire session lasted for approximately two hours. For the raters’ comprehensibility and nativelikeness scores, see DATASET.
Statistical Analysis Procedure
There were two potential issues in the examination of the relationship between the characteristics of the learners identified in the above manner and their L2 English speech comprehensibility and nativelikeness. First, the number of learner variables (n = 30) was fairly large, considering the number of learners (N = 110). Since our goal was to explain between-learner variability, the former should be much smaller than the latter. Secondly, some of the 30 learner individual difference variables were highly correlated, which could, in turn, make it difficult to separate their effects. To further reduce the number of predictor variables, all the learner variables were submitted to a factor analysis to identify latent variables underlying the 30 elicited learner variables. The factor scores were then submitted to a regression model to investigate the relationship between the factors and L2 speech comprehensibility and nativelikeness scores.
Results
Underlying Learner Variables
The first objective of the statistical analyses was to examine a number of underlying factors among a total of 30 learner variables related to the participants’ L1 background, age, experience, motivation, awareness, and attitudes. Following Loewen and Gonulal's (Reference Loewen, Gonulal and Plonsky2015) field-specific guidelines for analyzing factorability and determining a threshold for factor loadings, participants’ questionnaire data was submitted to a factor analysis with Direct Oblimin rotation and the principal component extraction method. Loewen and Gonulal pointed out that the cumulative percentage of explained variance reported in L2 research is relatively low (60–65%). To increase the cumulative percentage of explained variance (> 80%), the Jolliffe criterion was adopted with the eigenvalue set to 0.8. Two tests were conducted to confirm the factorability of the entire dataset: the Bartlett's test of sphericity and the Kaiser-Meyer-Olkin measure of sample adequacy. To select the practically significant factor loadings, 0.5 was used as the cut-off value.
The first model identified 13 factors capturing 82.3% of the variance among the 30 learner variables. Although the Bartlett's test was significant (χ2 = 2067.542, p < .001), the Kaiser-Meyer-Olkin (KMO) value was relatively low (i.e., .419), suggesting that the sampling of the dataset is questionable. According to our inspection of the pattern matrix, one obvious confusion was related to the nine current experience variables that showed a set of strong correlations with each other (r = .3-.8). Some variables were not clearly clustered into any overall factors (e.g., L1 use at work). To enhance the factorability of the dataset, we reduced the nine experience variables into two averaged scores per participant by averaging across the following subcategories across all different contexts (work, social, home): (a) how much they were using their L1; and (b) how much they were using their L2 with fluent users (including L1 speakers and advanced L2 speakers).
The second model identified 11 factors explaining 82.5% of the variance among the 23 learner variables. We considered the factorability to be adequate according to the results of the Bartlett's test (χ2 = 1226.456, p < .001) and KMO test (.547). In conjunction with the pattern matrix summarized in Supporting Information-D, each factor was labeled as follows:
• Factor 1 was labeled as “Experience Quantity” as the items with high loadings concerned the extent to which participants had been in L2 English-speaking environments prior to the project.
• Factor 2 was labeled as “Current L2 Use” as it covered two variables related to the extent to which L2 learners used L2 (instead of L1), especially with fluent speakers at the time of the project.
• Factor 3 was labeled as “Awareness of Nativeness” as the items clustered here indexed the extent to which participants perceived the importance of nativelike use of language, phonology, and idiomatic expressions.
• Factor 4 was labeled as “Age of Immersion” as it clustered all the timing variables such as the age of arrival in English-speaking countries.
• Factor 5 was labeled as “Motivation” as it featured all the items related to participants’ motivation and concern for nativelike English pronunciation in different settings.
• Factor 6 was labeled as “Attitude to Nativeness” as it reflected the extent to which they appreciated, preferred, and had been familiarized with British English.
• Factor 7 was labeled as “EFL Experience” as it featured how early they had started learning English in the classroom setting, and for how long they had received foreign language education prior to their arrival in English countries.
• Factor 8 was labeled as “Special Past Experience” as it spotted participants who had previously received linguistics training and/or L2 English teaching experience.
• Factor 9 was labeled as “Attitude to Foreign Accents” as it captured only one learner variable (i.e., the extent to which participants liked it when others spoke English with a foreign accent).
• Factor 10 was labeled as “Comprehensibility Orientation,” which covered not only how much participants were familiar with foreign-accented English, but also the extent to which they perceived the importance of comprehensibility in successful L2 communication.
• Factor 11 was labeled as “L1 Influence” as it corresponded to the extent to which participants’ L1 background is far from/close to L2 English (i.e., Indo-European language).
Factor scores were then computed with the Bartlett's method, and their relationships with comprehensibility/nativelikeness ratings were visualized and analyzed (see Supporting Information-E).
Regression Modeling
In order to formally investigate the relationship between the factor scores of the 11 factors identified and the ratings of L2 comprehensibility and nativelikeness, we employed a Bayesian multivariate mixed-effects ordinal regression model. We opted for a Bayesian approach because it (a) allows us to estimate the full posterior distribution, which is more informative than the frequentist point estimate (Kruschke, Reference Kruschke2014); (b) generates more intuitive metrics of uncertainty (Lambert, Reference Lambert2018); and (c) employs the tools that allow flexible and complex modeling (e.g., Carpenter et al., Reference Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt, Brubaker, Guo, Li and Riddell2017). Readers are referred to Lambert (Reference Lambert2018) and Kruschke (Reference Kruschke2014) for an accessible introduction to Bayesian data analysis, as well as to Norouzian, de Miranda, and Plonsky (Reference Norouzian, de Miranda and Plonsky2018) for field-specific recommendations on the use of the Bayesian approach.
Multivariate models permit the simultaneous modeling of multiple outcome variables, such as the two kinds of ratings in the present study (see Hui, Reference Hui2019). Furthermore, comprehensibility and nativelikeness ratings consist of ordered categories, and analyzing an ordinal variable with techniques assuming continuous variables causes several problems (Liddell & Kruschke, Reference Liddell and Kruschke2018). Therefore, in the present study, an ordinal regression was employed. The statistical models were fit with brms (Bürkner, Reference Bürkner2017), a front-end R package of Stan (Carpenter et al., Reference Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt, Brubaker, Guo, Li and Riddell2017). The R code is available (see RCODE).
Among multiple classes of ordinal models, we employed a cumulative model, which assumes continuous variables underlying our observed rating variables (Bürkner & Vuorre, Reference Bürkner and Vuorre2019). The error term was assumed to follow a logistic distribution. The model specifically included individual ratings of comprehensibility and nativelikeness as dependent variables, 11 sets of factor scores as fixed-effects variables, and by-learner and by-rater random intercepts. The correlation between the random intercepts of comprehensibility and those of nativelikeness ratings was modeled within each random-effects factor (i.e., learner and rater). No interaction term or random slope was included due to a relatively large number of predictors for the given number of learners. Nonlinear effects were not examined for the same reason. Variable selection was not performed due to the many issues associated with the procedure (Harrell, Reference Harrell2015).
For all the parameters, weakly informative prior distributions were used. Specifically, (i) standard normal distributions were specified for slope coefficients representing the effects of each factor; (ii) student-t distribution with the mean of zero, the degree of freedom of three, and the scale of ten were used for the parameters representing the threshold values of the categorization of underlying latent variables; (iii) nonnegative half student-t priors with the same parameter values as the above were employed for the standard deviation of random effects; and (iv) the LKJ distribution was specified as a prior for the aforementioned correlations of random intercepts. The posterior distribution was derived based on Hamiltonian Monte Carlo with four Markov chains with 10,000 iterations each, including 2,000 warmup iterations.
R-hat indices were all below 1.01, which suggested model convergence. Full posterior distributions are shown in Supporting information-F. In order to assess the goodness of fit of the model, the ratings with the highest posterior probabilities and the observed ratings were cross-tabulated. Out of the 2,200 ratings (i.e., 110 participants × 10 raters × 2 outcome measures), the model classified 763 ratings (34.7%) squarely into one of the nine categories. This, however, could be due to random intercepts. In order to isolate the effects of factor scores from random effects, we rebuilt the model that only included 11 factors and compared its classification accuracy with the baseline accuracy, where we classified all the ratings into the largest category in each outcome variable (i.e., 204 ratings in Rating = 7 in comprehensibility and 179 ratings in Rating = 4 in nativelikeness). The difference in classification accuracy between the two reflects the effects of the 11 factors. The classification accuracy based on the model with 11 factors was 456 (20.7%), whereas the baseline accuracy was 383 (17.4%). The difference between the two ratios was significant (χ2(1) = 5.16, p = .023). Although the extra accuracy brought by the 11 factors might not look large, it is arguably still acceptable considering that much of the variability in ratings stems from the learner-rater interaction. This is exemplified by the fact that the classification accuracy is merely 35% even when between-learner and between-rater variability is perfectly accounted for by random effects, and the main predictors of the model are factor scores that do not explain the interaction. Furthermore, if an error by one rating is allowed (e.g., a speech sample that received the rating of five and was misclassified as a six is counted as an instance of accurate classification), then the accuracy rises to 1,156 ratings (52.5%), with the baseline accuracy of 1,060 ratings (48.2%). Therefore, the model fits the data reasonably well, and the inferences based on the model are considered to be credible.
Table 1 shows the posterior mean and the 95% credible intervals (central posterior intervals) of each parameter. The threshold parameters represent the threshold values of categorization of the continuous latent variable assumed to underlie the ordinal outcome variable. Our focal interest concerns Factors 1 through 11. Since both latent variables underlying outcome variables and factor scores are in unit scale, the parameter values indicate the change in the latent variable in standard deviation (SD) associated with one SD change in factor scores. The table shows that, in both comprehensibility and nativelikeness ratings, zero fell outside of the credible intervals (CIs) in Factors 2 (Current L2 Use), 4 (Age of Immersion), 6 (Attitude to Nativeness), and 8 (Special Past Experience). Additionally, the CIs of Factor 11 (L1 Influence) did not include the null effect in nativelikeness ratings. Factors 2 and 6 are positively correlated with higher ratings, while Factors 4 and 8 are negatively correlated in both outcome measures. Factor 11 (L1 Influence) is negatively correlated with nativelikeness ratings (but not comprehensibility ratings).
TABLE 1. Summary of the Bayesian Multivariate Mixed-Effects Ordinal Regression Model

The results of Bayesian analyses are influenced by the choice of prior distributions. In order to investigate the potential effects of priors, we rebuilt the model with different priors for slope parameters, which are the focus of this study. Specifically, we gradually increased the standard deviation of the normal distribution from 0.8 to 3, and also tested a flat prior. The results largely remained the same. The details are reported in Supporting Information-G.
Varying Strengths Across Ratings
We conducted an additional analysis on the extent to which the strength of the five prominent factors (Current L2 Use, Age of Immersion, Nativeness Attitude, Special Past Experience, and L1 Influence) would differ depending on different levels of L2 comprehensibility and nativelikeness. See Supporting Information-H.
Discussion
Despite much scholarly discussion directed toward the sources of individual differences in L2 speech learning in adulthood, the transparency, size, and diversity of datasets in prior work have remained problematic. To move ahead the research agenda, in our novel study, we took the dynamic perspective on L2 learning (including multiple independent and dependent variables; Larsen-Freeman, Reference Larsen-Freeman2012) and the Open Science approach (making the details of our own dataset publicly available; Marsden, Reference Marsden, McKinley and Rosein press). Specifically, we first presented the dataset of speech samples and the questionnaires from 110 late L2 learners in London. Subsequently, we demonstrated the way we expounded the complex relationship between a total of 30 variables of learner-external and learner-internal individual differences—L1 backgrounds, age, experience, motivation, awareness, and attitudes—and two different dimensions of L2 speech proficiency—comprehensibility and nativelikeness. As reviewed earlier, the existing literature has found all the learner variables selected for this study to affect L2 speech proficiency to some degree. The primary objective of the current investigation was to reveal the relative weights of these variables by way of mixed effects modeling analyses.
According to the results of the analyses, these between-learner variables allowed 20.7% of the ratings to be classified accurately, which we consider robust and comparable to previous research using similar mixed effects models. Among all the associated variables, it was five factors that showed particularly observable associations, that is, current, past and special experience, attitude, and L1-L2 distance. In essence, L2 learners who have received higher comprehensibility scores, and, by extension, have achieved higher L2 speech proficiency levels, use L2 English on a regular basis. These L2 users interact more often with fluent (rather than nonfluent) speakers in L2 English (rather than their L1) (i.e., current experience factors). Not only have these learners arrived in an L2 speaking environment in early adulthood, entailing longer length of immersion (i.e., age factors), but also have had extra, professional experience related to linguistic training and L2 English teaching (i.e., special experience factors). Finally, these learners tend to engage in every L2-use related opportunity with a more positive attitude toward the language of the community, that is, British English (i.e., learner-internal, attitude factors). To achieve more nativelike L2 speech, however, the results indicated that L1-L2 distance may play a significant role. In the case of our study, those who spoke an Indo-European language as an L1 likely showed less detectable L2 accent and thus attained more nativelike oral proficiency (i.e., L1 influence factors).
Assuming that L2 speech proficiency develops over time on the continuum from low to advanced, the results of our cross-sectional dataset provided empirical support to the view that the comprehensibility and nativelikeness aspects of L2 speech learning are comprised of slightly different processes. L2 comprehensibility development continues to take place during adulthood, as long as learners frequently practice a target language in various social settings (Derwing & Munro, Reference Derwing and Munro2013; Saito et al., Reference Saito, Trofimovich and Isaacs2017) with positive attitude and orientation toward the target language and its community (Dewaele et al., Reference Dewaele, Witney, Saito and Dewaele2018). Although many L2 learners strive to approximate the nativelike aspects of L2 speech, foreign accent reduction seems to be tied to factors that most learners cannot control on their own. Attaining more nativelike L2 pronunciation may be limited to certain individuals whose L1-L2 distance is relatively small (i.e., other Indo-European languages) (Bongaerts et al., Reference Bongaerts, van Summeren, Planken and Schils1997).
Taken together, the findings support an increasingly popular idea that L2 learning is a dynamic, complex, adaptive system within which a range of learner external and internal factors affect each other (e.g., Larsen-Freeman, Reference Larsen-Freeman2012). Following this line of thought, we argue that it is crucial for future L2 speech research to include multiple affecting factors related to contexts and individuals instead of examining each single variable in isolation. To tackle the topic of individual differences in any aspect of L2 learning, much caution needs to be exercised in data collection and analysis. It is important to recruit a large number of participants to maintain a strong statistical power of dependent variables, minimize the number of independent variables via data reduction (e.g., factor analyses), and inspect the dynamic, complex link between dependent and independent variables via Bayesian multivariate mixed-effects analyses.
Although we believe our statistical data analysis is reasonable, it is certainly not the only valid way to analyze our data (see DATASET). In psychology, different analyses of a single dataset have been demonstrated to yield different results even for the same research question (Silberzahn et al., Reference Silberzahn, Uhlmann, Martin, Anselmi, Aust, Awtrey and Nosek2018). Thus, we welcome any interested readers to reanalyze our data in the way they prefer and examine any potential differences that arise between their results and ours. Together with such future analyses, we hope to collectively realize a multiverse analysis (Steegen et al., Reference Steegen, Tuerlinckx, Gelman and Vanpaemel2016), in which a single raw dataset is analyzed in a variety of ways to gain insights into how much results may change due to the (arbitrary) decisions researchers make during their data analysis (i.e., so-called “researcher degrees of freedom”; Simmons et al., Reference Simmons, Nelson and Simonsohn2011).
Below, we offer a few alternative, arguably equally valid means by which to analyze our dataset.
1. While we employed a factor score regression (i.e., a factor analysis followed by a regression analysis using the factor scores), one could also build a single structural equation model (SEM) that encompasses both factor and regression models. The SEM can presumably better propagate uncertainty from a measurement model (corresponding to the factor analysis) to a structural model (corresponding to the regression analysis).
2. Another approach is to use penalized regressions without relying on a factor analysis to reduce the number of predictors. Common penalized regression methods such as lasso regression and ridge regression can be viewed as regression models with regularizing prior probabilities on parameter values in a Bayesian sense. Since variables are not reduced, interpretations might turn out to be less challenging with this approach.
3. Furthermore, one could also view the analytical task as one of classification and employ machine learning techniques to predict the ratings of speech samples based on the combination of variables available, after which they could examine which variables influenced the classification.
4. Finally, one can also perform the frequentist analysis equivalent to the Bayesian analyses we performed and examine whether the results converge.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0267190520000045