1. Introduction
Multilingualism has become a powerful fact of life worldwide (Edwards, Reference Edwards2012). In the past decade, research into the effect of multilingualism on individuals’ personality has been emerging (cf. Dewaele & van Oudenhoven, Reference Dewaele and Van Oudenhoven2009; Dewaele, Reference Dewaele, Mercer, Ryan and Williams2012), which is an important complement to the rich ongoing research on the cognitive consequences of multilingualism (cf. Bialystok, Craik & Luk, Reference Bialystok, Craik and Luk2012; Valian, Reference Valian2015; Bialystok, Reference Bialystok2016). Tolerance of ambiguity (TA), a personality variable defined as “the tendency to perceive ambiguous situations as desirable” (Budner, Reference Budner1962), has been examined vis-à-vis multilingualism in recent years. Most notably, Dewaele and Li (Reference Dewaele and Li2013) examine the relationship between TA and multilingualism through a large-scale online questionnaire survey: their sample (N = 2,158) comprised participants from 204 nationalities, with the largest group coming from the USA (n = 478, 22.2% of the total sample) and the 10th largest group from China (n = 41, 1.9%). Most recently, van Compernolle (Reference van Compernolle2016) expands the pioneering research of Dewaele and Li (Reference Dewaele and Li2013), by introducing a third focal variable, “attitudes toward linguistic variation”, and explores the relationships between these focal variables; his sample (N = 379) involved respondents from 47 nationalities, with the largest group again coming from the USA (n = 234, 61.7%) and the fifth largest group from the Netherlands (n = 11, 2.9%); although information about participants of Chinese nationality was not given, the percentage of this group was at the most around 2% of his sample. In other words, multilinguals with Chinese nationality (e.g., Chinese users of English)Footnote 1 have been under-investigated, as the Chinese population accounts for 20% of the world population whereas only 2% of the samples of the above two studies were Chinese.
Partly motivated by the above gap, this exploratory study aims to examine the relationship between multilingualism and TA, by partially replicating Dewaele and Li's (Reference Dewaele and Li2013) pioneering work on a group of Chinese users of English in an English as a Foreign Language (EFL) context. This focus brings about three benefits. First, it adds to our understanding of the psychological profiles of multilinguals in China, an under-investigated context, where the number of English users of Chinese nationality already exceeded 390 million in 2000 (Wei & Su, Reference Wei and Su2015). Second, examining a sample of multilinguals in the Chinese context represents the first attempt to examine EFL contexts, which provides new data that can complement the extant studies solely from non-EFL contexts (see also Section 2.3). Third, focusing upon a group of multilinguals with one single nationality helps enhance the methodological rigour for this new line of research vis-à-vis the construct validity of TA, as “past studies drew data from a global context and this may not be representative of a single community” (Liu, Wan, Lee & Ng, Reference Liu, Xuan, Lee and Chin2017).
2. Literature review
In our review of empirical studies concerning TA and multilingualism, we argue that several suffer from inadequate use of effect size and/or lack of transparency in instrument reporting. Highlighting major issues concerning effect size use and instrumentation is useful before reviewing studies about TA and multilingualism.
2.1 Methodological rigour
Use of effect size, which is arguably more important than the statistical significance level (i.e., the p value) (Ellis, Reference Ellis2010; Larson-Hall, Reference Larson-Hall2010), involves interpreting effect size, which is less straightforward than merely reporting one.
Although Cohen's (Reference Cohen1988) benchmarks (e.g., for Cohen's d, .20 as small, .50 medium, and .80 large) are widely used for effect size interpretation, these are but general guidelines. As Leech, Barrett, and Morgan (Reference Leech, Barrett and Morgan2005, p. 56) note, Cohen's benchmarks, based on the effect sizes usually found in the behavioral sciences, “do not have absolute meaning and are only relative to typical findings in these areas”. It is advisable to look for typical values of effect size on the topic of interest, rather than relying on Cohen's rule of thumb (see Wei, Feng & Ma, Reference Wei, Feng, Ma, Zhao and Dixon2017 for an example of the development of a topic-specific effect-size benchmarks).
Plonsky and Oswald (Reference Plonsky and Oswald2014) propose a field-specific effect-size benchmark system, and we suggest that topic-specific effect-size benchmark systems provide more nuanced guidance in interpreting the effect size in question. On the one hand, a particular field, compared with a particular topic, is broader and more difficult to define; for example, Plonsky and Oswald (Reference Plonsky and Oswald2014) do not specify the scope of “L2 research”, perhaps because it in itself is “a rather difficult concept to define” (Derrick, Reference Derrick2016, p. 138); it is not clear to what extent the “L2 research” field covers the fields of multilingualism, psycholinguistics, and other L2-related (sub-)fields. On the other hand, Plonsky and Oswald's (Reference Plonsky and Oswald2014) benchmarks (e.g., for r, .25 as small, .40 medium, and .60 large) in “L2 research” seem to be too high. They are much higher than Cohen's (Reference Cohen1988) (e.g., for r, .10 as small, .30 medium, and .50 large), probably because the former developed their benchmarks from predominantly experiment-based primary studies, on top of the potential publication bias (Dewaele, Reference Dewaele2005) and the “file drawer” problem (Ellis, Reference Ellis2010, p. 69). Experiment-based studies tend to yield higher effect sizes than survey studies, as can be inferred from two sources relevant to TA, which is both a (socio-)psychological and an individual difference variable. First, Richard, Bond and Stokes-Zoota (Reference Richard, Bond and Stokes-Zoota2003) synthesis, which compiles results from a century of social psychological research covering 25,000 studies of eight million people and featuring a better balance of experiment-based and survey-based studies, finds an average of effect sizes (viz. rs) of .21 from this sub-field of psychology, compared with its much higher counterpart (.46) from the field of “L2 research”. It is particularly noteworthy that several hundred primary studies concerning the topic of personality yielded an effect size (r) average equal to or smaller than .10Footnote 2 (Richard et al., Reference Richard, Bond and Stokes-Zoota2003), namely Cohen's “small-effect” benchmark. Second, according to our preliminary survey of more recent research (e.g., Cunningham, Douglas & Boag, Reference Cunningham, Douglas and Boag2018; Steiger & Reyna, Reference Steiger and Reyna2017)Footnote 3, some predictors in regression models explaining about 1% (equivalent to r = .10) of the variance in the dependent variable are regarded as important or “significant predictors”. In a recent paper published in Personality and Social Psychology Bulletin, Sawyer and Gampa (Reference Sawyer and Gampa2018), using 0.1% (roughly equivalent to r = .03) as a socially meaningful benchmark to interpret effect sizes, find several statistically significant predictors explaining less than 1% of the variance in the dependent variable. This suggests that 1%-variance-accounted-for variables cannot be simply dismissed as negligible, and their effects need be interpreted in the context of the current understanding of the topic. Hence interpreting effect sizes should be topic-specific. The pioneering work on TA by Dewaele and Li (Reference Dewaele and Li2013) has used effect size to some extent (see Section 2.2 for suggestions). Before we propose a topic-specific effect-size interpretation system (Section 4.3), we will tentatively draw upon Cohen's (Reference Cohen1988) system. Reliability and validity measures are classical tools in instrumentation (Chapelle & Duff, Reference Chapelle and Duff2003; Mahboob, Paltridge, Phakiti, Wagner, Starfield, Burns, Jones & De Costa, Reference Mahboob, Paltridge, Phakiti, Wagner, Starfield, Burns, Jones and De Costa2016). Cronbach's alpha (a measure of internal consistency) is the most frequently used reliability index (Derrick, Reference Derrick2016). A Cronbach's alpha analysis performed on a particular scale assumes that the scale is unidimensional. To check this assumption, exploratory factor analysis is useful. In addition, exploratory factor analysis, which provides information of the construct validity of the instrument (for other types of validity, see Messick, Reference Messick1995; Brown et al., 2015), should be implemented when examining the factorial structure of an established scale in new cultures (cf. Kim et al., 2011). As Koh, Chang, Fung, and Kee (Reference Koh, Chang, Fung and Kee2007, p. 227) warn, the validity of the scales developed in the West, such as the TA scale by Herman, Stevens, Bird, Mendenhall and Oddou (Reference Herman, Stevens, Bird, Mendenhall and Oddou2010), is “often questionable when they are transported outside of their native land” or context.
2.2 Multilingualism and TA
In multilingualism research, TA has been one of the most frequently examined psychological variables (for others such as extraversion, see Dewaele, Reference Dewaele2005; Reference Dewaele, Mercer, Ryan and Williams2012).
An individual with higher TA tends to demonstrate higher ability to (1) take in new information; (2) hold contradictory or incomplete information; and (3) adapt in response to the new information or experience (Ehrman, Reference Ehrman and Alatis1993). TA is highly relevant to second/additional language (L2) learning, which “is often seen as ambiguous” (van Compernolle, Reference van Compernolle2017, p. 319) because it involves the appropriation of new and/or modified patterns of language and meaning that are usually unfamiliar and complex to the learner. High TA has been considered “essential” to successful L2 learning ever since Rubin's (Reference Rubin1975) “good language learner” study (Dörnyei & Ryan, Reference Dörnyei and Ryan2015, p. 32), in which it is posited that a “good language learner is. . . comfortable with uncertainty. . . and willing to try out his guesses” (p. 45).
In the field of multilingualism, TA has been measured with the TA Scale developed by Herman et al. (Reference Herman, Stevens, Bird, Mendenhall and Oddou2010), whereas in related areas (e.g., L2 learning) this personality trait has been assessed frequently with other instruments such as Ely's (Reference Ely and Reid Joy1995) Second Language TA Scale (see, e.g., Dewaele & Ip, Reference Dewaele and Ip2013) and original items by the researchers (e.g., Thompson & Lee, Reference Thompson and Lee2013). Herman et al.’s (Reference Herman, Stevens, Bird, Mendenhall and Oddou2010) instrument, developed in what is essentially an English-as-a-native-language or ESL context, is described as “a conceptually clear, internally consistent assessment tool” (p. 60), which is a “refined measure” demonstrating “its improved utility” over Budner's (Reference Budner1962) classic TA inventory (p. 62). Unfortunately, the validity of Herman et al. (Reference Herman, Stevens, Bird, Mendenhall and Oddou2010)’s TA scale has not been fully explored with different contexts and/or populations (see also Section 3.3).
Dewaele and Li's (Reference Dewaele and Li2013) seminal research on TA and multilingualism draws upon a large group of multilinguals through an Internet-based English-medium questionnaire survey. To assess the respondents’ multilingualism and TA, they use a global measure of multilingualism (GMM), viz. “the sum of oral and written knowledge in various languages” (p. 232) and a slightly adapted version of Herman et al.’s (Reference Herman, Stevens, Bird, Mendenhall and Oddou2010) TA instrument. Dewaele and Li (Reference Dewaele and Li2013) categorise the link between GMM and TA as “weak/small”, although not explicitly using Cohen's (Reference Cohen1988) benchmark. Specifically, these authors report that (p. 236):
A one-way ANCOVA with age as a covariate showed that global self-perceived proficiency had a small but significant effect on TA (F(2,1978) = 6.0, p < .003, η 2 =.008). Age was a significant covariate (F(1,1978) = 15.1, p < .0001, η 2 = .008). Post-hoc pairwise comparisons, with Bonferroni correction, showed that for global self-perceived proficiency the TA scores of the “Low” group were significantly lower (p < .002) than those of the “High” group. No significant difference emerged between the Low and Medium group, nor between the Medium and the High groups.
We propose three solutions to overcome shortcomings in reporting and interpreting the ANCOVA results. First, although post-hoc pairwise comparisons are useful, they need to be accompanied by effect size. For example, after mentioning that the difference between the TA scores of the “Low” GMM group and those of the “High” group is statistically significant, it is more important to supply an effect size (e.g., r as suggested by Field, Reference Field2009 Footnote 4). The absence of the term “statistically” can easily result in an erroneous impression that the results are important. Many scholars in psychology (e.g., Carver, Reference Carver1993) have argued that the term “statistically” must always precede the word “significant”Footnote 5.
Second, it is not enough to simply report that no statistically “significant difference emerged between the Low and Medium group [sic.]” because even when the result is not statistically significant (or “statistical” in Larson-Hall's terms) the effect size can be large. Therefore, an effect size index should be reported, along with the exact p value, regardless of whether the result is statistically significant or not.
Third, the wording “small but significant” potentially diminishes the importance of the finding in Dewaele and Li's (Reference Dewaele and Li2013) work. Interpreting the effect size (η 2 =.008) as “small” with generic labels (e.g., “medium” and “large”) without a reference is a common methodFootnote 6 to interpret effect size in quantitative studies. But a few interpretative statements, regarding the seemingly “small” η2 value, would have been useful.
In Cohen's (Reference Cohen1988) benchmark system, this effect size (.008) fell below the benchmark (.01) for the so-called “small” effect, suggesting that GMM accounted for .8% of the variance in TA. Although this value was rather “small” according to Cohen's rule of thumb, his generic labels “do not have absolute meaning” (Morgan, Leech, Cloeckner & Barrett, Reference Morgan, Leech, Cloeckner and Barrett2004: 90). This effect size, found in Dewaele and Li's (Reference Dewaele and Li2013) study, may serve as a useful starting point to examine to what extent this value is typical for the effects of sociobiographical variables on TA, so that an effect size interpretation system for this particular topic can be developed.
van Compernolle's (Reference van Compernolle2016) survey, a quasi-replication of Dewaele and Li's (Reference Dewaele and Li2013) study, also confirms a link between multilingualism (as measured by GMM) and TA, based on a Spearman rho of .19 (p <.0002). Although the correlation coefficient itself (e.g., Spearman rho) represents effect size, “many who use it may not be aware that it is an effect size index” (Ellis, Reference Ellis2010, p. 11). Consequently, the above effect size value (.19) was unfortunately not used to compare with its counterpart (i.e., r = .008) from Dewaele and Li's (Reference Dewaele and Li2013) article. Furthermore, no information about the reliabilityFootnote 7 or validity of the instrument based upon Herman et al. (Reference Herman, Stevens, Bird, Mendenhall and Oddou2010) was provided.
Liu et al.’s (Reference Liu, Xuan, Lee and Chin2017) survey of 132 undergraduate students in Singapore is a recent partial replication of Dewaele and Li's (Reference Dewaele and Li2013) study. These authors claim that “No significant correlation between global proficiency on TA was found, p = .196”. This Singaporean survey is a useful replication study of participants from a single nationality in a non-EFL context. However, the data analysis concentrated on the p value, without mention of effect sizes.
Secondly, the Singaporean study fails to report validity and reliability. It misses a valuable opportunity to explore whether the situation “internal consistency of the four dimensions was not sufficiently robust to allow separate use” (Herman et al., Reference Herman, Stevens, Bird, Mendenhall and Oddou2010, p. 61) occurs, which helps assess the applicability of Herman et al.’s (Reference Herman, Stevens, Bird, Mendenhall and Oddou2010) four-facet TA construct.
The inadequate use of effect size, lack of attention to reliability and validity issues, as well as over-reliance upon participants from non-EFL contexts in previous studies all suggest further (partial) replication studies based on Dewaele and Li's (Reference Dewaele and Li2013) work.
3. The study
3.1 Research questions
The present study is motivated by the gap concerning TA and multilingualism in under-investigated EFL contexts and the need for stronger methodological rigour in multilingualism research and beyond. It pursues the following questions:
RQ1. What are the underlying factors of the TA scale in the Chinese EFL context?
RQ2. To what extent does the sociobiographical variable, multilingualism (operationalised as GMM), affect TA?
RQ3. To what extent do selected sociobiographical variables other than multilingualism (viz. gender, education, number of languages known, and length of stay abroad) affect TA?
3.2 Participants
A total of 260 Chinese (186 females, 74 males) participated in the present study, ranging from age 18 to 35 (mean = 22.7). Most respondents (n = 195) had or were working towards bachelor degrees, 63 master, and two PhD degrees. Most participants (n = 160) had no experience of living abroad; those with such experience spent an average of 19.63 months (min.: 0.5 month and max.: 14 years; median = 12, mode = 6 months) abroad.
An overwhelming majority (n = 209) of the participants reported to be bilingual, with Chinese as their L1; the others were 41 trilinguals, seven quadrilinguals, two pentalinguals and one sextalingual. The most frequent L2 was English (n = 259) and only one respondent reported Korean as L2. Japanese (n=21) was the most frequent L3, followed by French (n=13), Korean (n=7), Russian (n=3), Spanish (n=2), Germany (n=2) and Portuguese (n=1). In terms of L4, Japanese (n=3) and French (n=3) came first with Korean (n=2) Polish (n=1) and Italian (n=1) following. The pattern for L5 was Japanese (n=1), French (n=1) and Korean (n=1). The only L6 reported was German.
3.3 Instrument
The instrument started with a sociobiographical section comprising conventional questions (e.g., gender, age, education level and length of stay abroad) and a global measure of multilingualism (GMM). The GMM was slightly adapted from the version developed by Dewaele and colleagues (Dewaele & Li, Reference Dewaele and Li2013; Dewaele & Stavans, Reference Dewaele and Stavans2014), which has been used in recent studies (e.g., van Compernolle, Reference van Compernolle2016; Reference van Compernolle2017). Dewaele and colleagues’ original GMM referred to the sum of self-perceived proficiency scores for oral (maximum score 5) and written proficiency (maximum score 5) collected on five-point Likert scales in up to six languages. One major benefit of such a measure is that it is “potentially useful to distinguish sextalinguals with limited knowledge of three languages from trilinguals with advanced knowledge of three languages” (Dewaele & Li, Reference Dewaele and Li2014). Dewaele et al.’s GMM thus avoids the lack of clarity inherent to labels such as “bilingual, trilingual”, where every language is included, despite the fact that knowledge in some can be very limited. Our only modification of GMM was that the original five-point Likert scale was changed into a nine-point system, as many of our respondents were familiar with the nine-point system used in the IELTS test to elicit more refined linguistic profiles.
Participants’ TA was assessed with the TA scale adapted from Herman et al. (Reference Herman, Stevens, Bird, Mendenhall and Oddou2010). The original version was a 12-item questionnaire with five-point Likert scales (1 = “strongly disagree” to 5 = “strongly agree”). It was piloted among 73 Chinese multilinguals. A subsequent reliability analysis of the TA scale revealed that one item dragged the overall Cronbach alpha value down to below .60 (viz. .564). With that item removed, the Cronbach alpha for the pilot test reached .657. Therefore, this item was removed from the final version of the questionnaire; this deleted item in the present study was different from the one deleted in Dewaele and Li's (Reference Dewaele and Li2013) study (see Appendix 1). Based on feedback from the participants, some minor stylistic adaptations were also made in the final version of the questionnaire.
3.4 Procedures
The anonymous questionnaire was an open-access survey on Wenjuanwang.com, a free China-based survey provider similar to SurveyMonkey.com. Our survey design and questionnaire received ethical clearance from our affiliation. The questionnaire was advertised through several social media. After the pilot-testing, the revised questionnaire was online between January and April, 2016 and attracted 260 valid respondents. Unlike Dewaele and Li's (Reference Dewaele and Li2013) survey that attracted 2,158 monolinguals and multilinguals, ours did not involve monolinguals because, for all valid respondents, the language of the questionnaire, viz. English, was their foreign language.
Because some respondents left occasional questions blank, the subsample sizes for several variables may vary in the dataset. The dataset was imported into the software package SPSS 22.0 to perform the major statistical procedures.
3.5 Data analysis
RQ1 “What are the underlying factors of the TA scale” was addressed using exploratory factor analysis and reliability analysis. Exploratory factor analysis, rather than its confirmatory counterpart, was chosen because no prior expectations were held regarding the number and nature of underlying factors of the TA scale in the Chinese EFL context.
RQ2, enquiring the extent to which multilingualism affects TA, was attempted with ANOVA and regression, respectively. The ANOVA corresponded to Dewaele and Li's (Reference Dewaele and Li2013) approach to address a similar question by creating three groups of participants with low, medium and high levels of multilingualism. We followed Plonsky and Oswald's (Reference Plonsky and Oswald2017) suggestion that “regression can do everything ANOVA can do, and more”, cautioning that “taking a continuous variable and artificially dividing it into two or more groups is a serious mistake”. When using ANOVA, we provided a more refined analysis by providing an effect size (r) for each pair-wise comparison (cf. Field, Reference Field2009). The absolute value of r ranges between 0 to 1 (the bigger the value, the larger the effect) whereas the squared effect sizes (e.g., an eta squared) give “an underestimated impression of the strength or importance of the effect” (Morgan et al., Reference Morgan, Leech, Cloeckner and Barrett2004, p. 90). Hopefully our two statistical procedures are more intelligible.
RQ3, inquiring to what extent selected sociobiographical variables other than multilingualism affect TA, was answered with hierarchical regression, as this statistical procedure helps ascertain the contribution of each predictor variable (Larson-Hall, Reference Larson-Hall2016).
4. Findings and discussion
4.1 The factorial structure of the TA scale
To answer RQ1, the assumptions for factor analysis were first checked. The factorability of the data was checked through the KMO test (.698) and Bartlett's test of sphericity (χ2 (55) = 437.366, p < .0005). These tests and the sample-size-to-variables ratio (23.6) showed that the dataset was appropriate for factor analysis. Principal components analysis was selected for the factor extraction method, and the direct oblimin rotation was used because it was assumed that the factors would be correlated, which is typical “for naturalistic data, and certainly for any data involving humans” (Field, Reference Field2009, p. 644). To extract the most appropriate number of factors, both the Kaiser criterion of using eigenvalues over 1 and the visual inspection of a scree plot were employed. A cut-off point of .40 was adopted for factor loadings (cf. Field, Reference Field2009).
Three factors were extracted, accounting for 50.3% of the variance in TA scores (see Appendix 1). The most important finding is that only one factor extracted in this study corresponded to the factorial structure of TA in Herman et al. (Reference Herman, Stevens, Bird, Mendenhall and Oddou2010). This factor, comprising Items 3, 7 and 8, was named “TA core” here, although it had been named “challenging perspectives” by Herman et al. (Reference Herman, Stevens, Bird, Mendenhall and Oddou2010). This name highlights that it may be the very part of TA that could be found across different cultural contexts. No further efforts were made to name the other two extracted factors because of the exploratory nature of this study in EFL contexts. Future studies replicating the TA part of the present study are needed to ascertain to what extent the “TA core” factor is present with different samples of multilinguals in EFL contexts.
A reliability analysis revealed that the Cronbach alpha measure (.30) for the overall TA scale (based upon the 11 items in Appendix 1) was not sufficiently robust to allow the use of the total score to denote TA. However, the internal consistency for the TA core factor (Cronbach alpha = .64) was acceptable, whereas the internal consistencies for the other two factors (.38 and .41 respectively for Factors 1 and 2, see Appendix 1) were not robust enough for separate use. Therefore, in later analysis, TA was denoted by the TA core factor, viz. the average of the scores on Items 3, 7 and 8. The higher the TA score (possible range: 1–5), the higher level of tolerance towards ambiguity that the participant had.
4.2 Multilingualism and TA
Following Dewaele and Li (Reference Dewaele and Li2013), to answer RQ2 with ANOVA, participants were first divided into three groups (low, medium, high) based on their GMM scores. The participants with scores that were more than 1 standard deviation below the GMM average (M = 29.28, SD = 6.113) were categorised into the “Low” GMM group (n = 31), those with scores that were more than 1 standard deviation above this average into the “High” group (n = 35), and the remaining participants into the “Medium” group (n = 188). A one-way ANOVA test (F (2, 251) = 2.490, p = .085) revealed that these between-group differences in TA scores were not statistically significant, but the effect size (partial eta squared = .019, R2 = .019), after rounding, reached Cohen's (Reference Cohen1988) small benchmark for R2 (namely .02). To probe further where the differences lay, a series of follow-up t-tests showed that the largest difference (r = .33) lay between the Low and High GMM groups, exceeding Cohen's (Reference Cohen1988) medium benchmark (r = .3), the second largest difference (r = .23) existed between the Medium and High GMM groups, and the difference (r = .09) between the Medium and Low GMM groups was relatively small, failing to reach Cohen's (Reference Cohen1988) small benchmark (r = .1) (see Table 1).
Table 1. TA by GMM groups (ANOVA).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000998:S1366728918000998_tab1.gif?pub-status=live)
Using the ANOVA procedure to answer RQ2 yielded measures readily comparable with those from Dewaele and Li (Reference Dewaele and Li2013). The patterns we found amongst the three GMM groups, in terms of their mean TA scores, are consistent with those from Dewaele and Li (Reference Dewaele and Li2013); for example, the largest difference lay between the Low and High GMM groups, which was also reported by Dewaele and Li (Reference Dewaele and Li2013). But more importantly, we also addressed RQ2, following Plonsky and Oswald's (Reference Plonsky and Oswald2017) above-cited suggestion (see Section 3.5), with a simultaneous regression analysis.
Prior to performing the simultaneous regression with the continuous variable GMM as the predictor for TA, we checked that the relevant assumptions (e.g., linearity) had been met. The results show that GMM did not statistically significantly predict TA, F (1, 252) = 3.602, p = .059), but its effect on TA (R2 = .014, accounting for 1.4% of the TA variance), again, was close to Cohen's (Reference Cohen1988) small benchmark.
In a word, the answer to RQ2 is that the effect sizes reflecting the influence of multilingualism on TA explained 1.4% to 1.9% of the TA variance.
Our finding is not in conflict with Dewaele and Li's (Reference Dewaele and Li2013) finding that GMM “had a small but significant effect on TA”. Their p was “< .003” (based on nearly 2,000 participants) and ours “.059” (based on around 250 participants); with a large enough sample size, the p value would always drop below the (arbitrary) conventional level of statistical significance (.05). In the words of authorities on statistics, “surely, God loves the 0.06 nearly as much as the 0.05” (Rosnow & Rosenthal, Reference Rosnow and Rosenthal1989, p. 1277). Our finding of p = .059 is a case in point to underscore the importance of reporting effect sizes. Although this p value was slightly higher than the conventional statistical significance level adopted (.05), this does not diminish the importance of the result. Readers are advised not to over-emphasise the p value, which is “highly dependent on the sample size” (Mackey & Gass, Reference Mackey and Gass2015, p.396); in comparison, however, effect sizes do not fluctuate much with the sample size and hence merit more attention. In connection with Dewaele and Li's (Reference Dewaele and Li2013) labelling the effect of multilingualism on TA as “small”, with more findings concerning the effects of other sociobiographical variables (see RQ3), we will argue below that it would be more useful to develop a topic-specific effect size interpretation system and label the effect of GMM differently.
4.3 Other selected sociobiographical variables and TA
A preliminary analysis was conducted to explore whether the variables of interest in RQ3 could be used as predictors; and, if yes, in what sequence in later hierarchical regression, after the regression assumptions (e.g., normality and homoscedasticity) had been checked? The preliminary analysis confirmed that all the three non-continuous variables and one continuous variable in RQ3 could be used as predictors. Firstly, two independent-samples t-tests demonstrated statistically significant differences of small-to-medium magnitude between males and females (p = .056Footnote 8, r = .12), and between bilinguals and “multilinguals”Footnote 9 (p = .007, r = .26). Specifically speaking, females (M = 4.239, SD = .864, n = 181) scored higher than males (M = 4.005, SD = .927, n = 73), and “multilinguals” (M = 4.420, SD = .645, n = 50) higher than bilinguals (M = 4.111, SD = .928, n = 204). In other words, both of these non-continuous variables deserved theoretical priority in later regression analysis, where “gender” followed the entry of “number of languages known” because the latter had been shown to be a statistically significant predictor by Dewaele and Li (Reference Dewaele and Li2013). Secondly, the mean difference between “bachelor degree holders and below” (M = 4.200, SD = .905, n = 190) and those with higher education qualifications (M = 4.089, SD = .832, n = 64) was not statistically significant (p = .386), but the effect size r was .055. The very large p value and the relatively small r led to the tentative hypothesis that “education” would not be a statistically significant predictor for TA; however, this r value, after rounding, still met Cohen's (Reference Cohen1988) “small” effect size threshold (.1) and this present study is exploratory in nature because it represents the first attempt in an EFL context to explore the relationship between “education” and TA. Based on these considerations, “education” was also retained for later regression analysis, so as to test the above tentative hypothesis and ascertain its unique contribution to TA. Thirdly, the continuous variable “length of stay abroad” correlated with TA (r = −.057), although this association was not statistically significant (p = .370). Based on similar considerations concerning the variable “education”, “length of stay abroad” was also included in later regression analysis to explore its unique contribution to TA.
Table 2 provides the model summary results for the hierarchical regression predicting TA in the model (see Appendix 2 for detailed findings). Each block statistically significantly added to the prediction of the outcome variable (p being .027, .017, .028, and .041, respectively for Blocks 1, 2, 3 and 4). The ΔR2 column in Table 2 summarises the most important findings: (1) “number of languages known” alone accounted for 1.9% of the variance in TA whereas “gender” accounted for 1.3%, which nearly met the so-called “small” benchmark in Cohen's (Reference Cohen1988) system (2%, 13%, and 26% being the small, medium, and large benchmarks); (2) In contrast, the net contributions to the variance in TA by “education” and “length of stay abroad” were .4% and .3%, which were negligible according to Cohen's (Reference Cohen1988). Here we have two further examples to illustrate that the p value shall not overshadow effect size; although the p values (.028 and .041) fell below the conventional level of statistical significance, it was the effect size value that revealed how important these two variables were in predicting TA.
Table 2. Hierarchical Regression Predicting TA: Model Summary.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000998:S1366728918000998_tab2.gif?pub-status=live)
Note: For Models 2, 3 & 4, the variable underneath ‘Model’ indicates that it is the newly added variable in this particular model, whereas for Model 1, the variable mentioned is the only predictor in this regression model.
Our finding in Table 2 that “number of languages known” explained 1.9% of the variance in TA is consistent with Dewaele and Li's (Reference Dewaele and Li2013) finding that this sociobiographical variable accounted for .09% (p. 236). Another consistent finding is that “education” exerted negligible influence upon TA (i.e., explaining only .4% of the variance), which echoes Dewaele and Li's (Reference Dewaele and Li2013) result that this biographical variable has no effect on TA.
There are two inconsistencies in our findings when compared with previous research. Firstly, “Stay Abroad”, one of the two statically significant predictors of TA in Dewaele and Li (Reference Dewaele and Li2013), explained 1.4% of the variance. However, its counterpart “length of stay abroad” in our results explained only .3% of the variance and was not statistically significant, suggesting that this sociobiographical variable did not affect TA. This difference could be attributed to the large disparity in the stay abroad experience with these two samples: our sample comprised young multilinguals with much shorter period of stay abroad experience; for example, 17% of the participants had the stay abroad experience of three months or less in this study; in Dewaele and Li's (Reference Dewaele and Li2013) study, this subgroup (totalling 568 and accounting for 28.557% of the valid 1,989 respondents) was considered as people “who had not lived abroad” (p. 236), suggesting that the majority of their sample had much longer stay abroad experience. Secondly, while Dewaele and Li (Reference Dewaele and Li2013, p. 235) report that gender exerted “a complete absence of effect on TA”, in our study “gender” explained 1.3% of the TA variance, suggesting that gender may be an important predictor for future research. This gender difference merits future efforts to find out whether this difference exists with other samples of Chinese multilinguals or samples of the same nationality in another cultural context.
Given the topic-specific nature of effect size interpretation as discussed in Section 2.1, we propose to develop a benchmark system specifically for interpreting the effects of sociobiographical variables (e.g., GMM) on TA, although we have adopted Cohen's (Reference Cohen1988) system thus far. As Dewaele and Li's (Reference Dewaele and Li2013) study utilised a very large sample and the two identified statistically significant predictors for TA could respectively explain slightly more than 1% of the TA variance, it could have been proposed that R 2 = .01 be a typical (or medium) effect size for this line of research. This proposal receives further support from the present study, where the important predictors for TA again respectively accounted for slightly more than 1% of the TA variance; furthermore, an added benefit of using .01 as a benchmark for R 2 is that it corresponds to one commonly used benchmark of its unsquared counterpart (r = .1) in Cohen's (Reference Cohen1988) traditional system. Based on our findings concerning the respective contribution (viz. below .5% of the TA variance) of two sociobiographical variables to TA, we further propose that R 2 = .005 be a small benchmark of effect size. We propose to use R 2 = .02, which in Cohen's (Reference Cohen1988) system denotes a small effect, as the benchmark for a “large” effect size for interpreting the influence of sociobiographical variables on TA. Theoretically, R 2 = .09, which corresponds to another commonly used benchmark of its unsquared counterpart (r = .3) in Cohen's (Reference Cohen1988) system, can be used a benchmark for a “very large” effect size.
In conclusion, we propose that .005, .01, .02, and .09 be used respectively as the small, typical (medium), large, and very large benchmarks for the effect size R 2 when interpreting the influence of sociobiographical variables on TA. For example, in the present study, the contribution of GMM, number of languages known and gender, to TA respectively exceeded .01; specifically, they accounted for 1.3–1.9% of the TA variance; according to the proposed system, as these effects exceeded the typical benchmark (1% of the variance-accounted-for), these variables can be regarded as important predictors for TA. This benchmark system could potentially be applied to similar lines of survey research focusing upon psychological factors other than TA.
5. Conclusion
The present study has built upon earlier work on multilingualism and TA by focusing upon multilinguals from one particular nationality. The findings from the Chinese EFL context attest to the limitation of the TA scale originally developed by Herman et al. (Reference Herman, Stevens, Bird, Mendenhall and Oddou2010) and support the need for future research on the core of TA in different cultural contexts. Specifically, the TA core identified in this study only contained three items from the original scale, which was claimed to be a “conceptually clear, internally consistent assessment tool” (Herman et al., Reference Herman, Stevens, Bird, Mendenhall and Oddou2010, p. 60). It would be interesting to see to what extent this TA core can be found with multilingual samples in other replication studies, the value of which is increasingly being recognised (Marsden, Morgan-Short, Thompson & Abugaber, Reference Marsden, Morgan-Short, Thompson and Abugaber2018).
In connection with methodological improvements, three suggestions are proposed for future studies. The first is that for sake of higher transparency in instrumentation, researchers should always report the reliability and validity information of their instruments (cf. Derrick, Reference Derrick2016). The second suggestion advocates more adequate use of effect sizes, and a corresponding lower reliance upon the significance level (viz. the p value), which has recently been banned by the editors of Basic and Applied Social Psychology (Trafimow & Marks, Reference Trafimow and Marks2015). This journal-wide ban on the use of p values represents a natural progression of the long-standing critiques of null hypothesis significance testing (which generates p) and a strong call for the employment of more robust statistics (e.g., effect size) in our reporting practices. Concurring with Ellis (Reference Ellis2010, p. xiv) who predicts that “If history is anything to go by, statistical reforms adopted in psychology will eventually spread to other social science disciplines”, we firmly believe that multilingualism (and the wider field of applied linguistics) will soon be one of these disciplines in Ellis’ prediction. The above-proposed effect size benchmarks can be fruitfully employed in studies exploring the effects of sociobiographical variables (e.g., GMM) on TA and possibly on other psychological variables. The third suggestion encourages the use of different measures of multilingualism to examine TA and other psychological variables. To facilitate comparison, this study employed a revised GMM from Dewaele and Li (Reference Dewaele and Li2013), as a measure of multilingualism; besides GMM, there are other equally useful measures (e.g., Thompson & Khawaja's (Reference Thompson and Khawaja2016) operationalisation of multilingualism).
Despite its substantive and methodological contributions, this study has two major limitations. First, it employs the original English-language version of the TA scale. The results were derived from the participants with relatively high proficiency in English. Further research needs to explore the TA of a wider multilingual population, possibly through indigenizing the TA instrument through the translation and back-translation procedure. Second, this study collected data from an online questionnaire, which has its inherent limitations despite its many advantages (Wilson & Dewaele, Reference Wilson and Dewaele2010). It is not clear whether data collected in a more “closed” paper-and-pencil environment, from which the important index of “response rate” can be calculated, could yield different findings. This merits future research efforts.
Appendix 1 Factor analysis of the TA Scale.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000998:S1366728918000998_tab3.gif?pub-status=live)
Appendix 2. Hierarchical Regression Predicting TA: Results.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191031034907774-0341:S1366728918000998:S1366728918000998_tab4.gif?pub-status=live)