A CALL FOR CAUTIOUS INTERPRETATION OF META-ANALYTIC REVIEWS

Frank Boers; Lara Bryfonski; Farahnaz Faez; Todd McKay

doi:10.1017/S0272263120000327

A CALL FOR CAUTIOUS INTERPRETATION OF META-ANALYTIC REVIEWS

Published online by Cambridge University Press: 27 October 2020

and

Frank Boers: Affiliation:
University of Western Ontario
Lara Bryfonski: Affiliation:
Georgetown University
Farahnaz Faez*: Affiliation:
University of Western Ontario
Todd McKay: Affiliation:
St. George’s University
*: *Correspondence concerning this article should be addressed to Farahnaz Faez, University of Western Ontario, London, ON N6G 1G7, Canada. E-mail: ffaez@uwo.ca

Article contents

Abstract
CASE STUDY 1: YOUSEFI AND NASSAJI ()
CASE STUDY 2: LEE ET AL. ()
CASE STUDY 3: BRYFONSKI AND MCKAY ()
CONCLUSIONS AND RECOMMENDATIONS
Supplementary Materials
Footnotes
References

Rights & Permissions

Abstract

Meta-analytic reviews collect available empirical studies on a specified domain and calculate the average effect of a factor. Educators as well as researchers exploring a new domain of inquiry may rely on the conclusions from meta-analytic reviews rather than reading multiple primary studies. This article calls for caution in this regard because the outcome of a meta-analysis is determined by how effect sizes are calculated, how factors are defined, and how studies are selected for inclusion. Three recently published meta-analyses are reexamined to illustrate these issues. The first illustrates the risk of conflating effect sizes from studies with different design features; the second illustrates problems with delineating the variable of interest, with implications for cause-effect relations; and the third illustrates the challenge of determining the eligibility of candidate studies. Replication attempts yield outcomes that differ from the three original meta-analyses, suggesting also that conclusions drawn from meta-analyses need to be interpreted cautiously.

Type: Research Article
Information: Studies in Second Language Acquisition , Volume 43 , Issue 1 , March 2021 , pp. 2 - 24

DOI: https://doi.org/10.1017/S0272263120000327 [Opens in a new window]
Open Practices: Open data
Copyright: © The Author(s), 2020. Published by Cambridge University Press

The discipline of pedagogy-oriented applied linguistics has witnessed a proliferation of meta-analytic reviews in recent years (e.g., Lee et al., Reference Lee, Jang and Plonsky2015; Shintani, Reference Shintani2015; Uchihara et al., Reference Uchihara, Webb and Yanagisawa2019). These are reviews that collect as many empirical studies on the role of a given factor as possible, and then calculate the weighted average effect from that pool. Meta-analyses are useful because they help to estimate with greater confidence than any individual empirical study whether the chosen factor of interest is likely to play a role that is not confined to specific contexts, and how substantial its role is likely to be. Some researchers may therefore find meta-analytic reviews particularly useful when they make excursions into domains outside their own niche because it seems safer to rely on a comprehensive review than on a couple of individual empirical studies. Even practitioners and policy makers—or those advising practitioners and policy makers—may consider the bottom line of a meta-analytic review a shortcut into the available research evidence and may rely on it to inform their instructional approaches and recommendations for teaching. Sometimes a meta-analysis may be rather broad in its research question and—though certainly of theoretical value—this may limit its potential to inform practitioners’ decision making. For example, a meta-analysis that computes the likely effect of instruction in comparison with no instruction (e.g., Kang et al., Reference Kang, Sok and Han2019) cannot, as such, tell practitioners what instructional interventions work particularly well, unless types of instructional interventions are examined as moderator variables as part of the analysis. In this article, however, we examine three recently published meta-analyses that are sufficiently specific in their research focus and whose conclusions may thus be taken up by educators to guide their practices. A recurring theme is the importance of cautious sampling and transparent methodological decision making, but each of the critiques serves to illustrate additional considerations for interpreting the outcomes of meta-analyses.

The first meta-analysis we examine, about pragmatics instruction (Yousefi & Nassaji, Reference Yousefi and Nassaji2019), offers as one of its conclusions (regarding a moderator variable) that computer-mediated pragmatics instruction generates larger effects than face-to-face instruction. However, the collection of primary studies that this assertion is based on contains hardly any studies that directly compare the two modes of instruction. Instead, this conclusion is based on an indirect comparison of aggregated effect sizes from a small set of studies that implemented computer-mediated instruction and the aggregated effect sizes from a larger set of studies that implemented face-to-face instruction. This is potentially problematic because effect sizes are influenced by the design features of empirical studies and by what contrasts on which they are based. For example, effect sizes tend to be larger in single group pre-/poststudy designs than in studies where effects are calculated by comparing one or more treatment groups’ learning gains. A greater proportion of one type of study design is thus likely to compromise a fair comparison. A reanalysis of Yousefi and Nassaji (Reference Yousefi and Nassaji2019), with greater scrutiny of the primary studies and with calculation of separate effect sizes for different study designs suggests that the assertion made about the superiority of the computer-mediated mode of instruction was not (yet) justified.

The second meta-analysis we examine (Lee et al., Reference Lee, Warschauer and Lee2019) offers strong support for the use of corpora in vocabulary learning. Unlike Yousefi and Nassaji (Reference Yousefi and Nassaji2019), the aggregated effect size is based exclusively on studies with a between-groups design, which should make it easier to interpret. There is nonetheless a difficulty in interpreting the outcome because in the majority of the primary studies included in the analysis it is impossible to tell whether the between-group differences in learning gains should be ascribed to the use of a corpus per se, while this is the factor of interest according to the title and the abstract of the article. In some of the studies, both treatment conditions involved corpus use. In many others, the treatment conditions that involved corpus use differed from their comparison conditions in diverse ways other than corpus use. Calculating an effect size from a small set of studies where corpus use was unequivocally the independent variable still yields an outcome in support of corpus use for vocabulary learning, but far less compellingly so than what emerged from the original meta-analysis.

The third meta-analysis regards the benefits of task-based language teaching (TBLT) programs (Bryfonski & McKay, Reference Bryfonski and McKay2019). Although authors of primary studies often label the instructional programs they put to the test “task based” (and in this meta-analysis the same labels were used), this may not always correspond to how the approach is conceived in other TBLT literature. It is therefore difficult to determine the merits of task-based (versus other versions of communicative language teaching such as task-supported teaching) based on the aggregated effect size from the literature currently available. Replicating TBLT meta-analyses with stricter sampling criteria proves difficult because of a dearth of studies that empirically assess task-based programs relative to non-task-based programs—and the few that are available report mixed findings (e.g., Phuong et al., Reference Phuong, Van den Branden, Van Steendam and Sercu2015).

All things considered, the three “case studies” presented here illustrate that conclusions drawn from meta-analytic research should be interpreted with an eye toward the methodological choices made during the meta-analytic process.

CASE STUDY 1: YOUSEFI AND NASSAJI (Reference Yousefi and Nassaji2019)

SYNOPSIS AND PRELIMINARY COMMENTS

Yousefi and Nassaji’s (Reference Yousefi and Nassaji2019) meta-analysis investigates the effects of instruction on second language pragmatics acquisition. According to the authors, the study is an update to prior work in this area (e.g., Jeon & Kaya, Reference Jeon, Kaya, Norris and Ortega2006, but see also another recent meta-analysis of L2 pragmatics instruction: Plonsky & Zhuang, Reference Plonsky, Zhuang and Taguchi2019). It not only includes more recent studies but also examines previously uninvestigated moderator variables, most notably the role of computer-mediated pragmatics instruction. Based on 36 studies, the authors report overall effectiveness of pragmatics instruction as d = 1.101. This meta-analytic evidence that pragmatics instruction clearly works is reassuring for teachers and course designers, although instructors may be especially interested in what kinds of instruction work particularly well in certain contexts. Yousefi and Nassaji’s analysis of moderator variables is informative in this regard, for example because a larger effect emerged for computer-mediated instruction (mean d = 1.172) than for face-to-face instruction (mean d = 0.965). This led the authors to assert that among the pedagogical implications of their findings “the most outstanding one is the potential of various technologies that can mediate the teaching and learning of pragmatics” (p. 302). Several additional pertinent moderator variables were explored (such as explicitness of instruction, type of outcome measures, length of treatment, and participants’ proficiency level),Footnote ¹ but for reasons of space, our critique will focus on the comparison of computer-mediated and face-to-face instruction.

The studies included in Yousefi and Nassaji’s meta-analysis vary considerably in their designs. Many are single-group studies, where participants’ progress is tracked from a pretest measure to a posttest measure (i.e., the effect size calculation is based on within-group contrasts). Others compare a treatment group’s progress to that of a control group (which receives no instruction regarding the learning targets of interest in the experiment), and a few compare the progress of two treatment groups, where each group experiences a different intervention regarding the same learning targets. Yousefi and Nassaji calculated overall effects by “combining the effects of all instructional types” (p. 294). However, effects of an instructional intervention often appear larger if one contrasts participants’ pre- and posttest performance than if one assesses the effectiveness of an intervention for a group of participants relative to another group’s progress. This is because within-group comparisons of pre- and posttest scores regard the same participants in the two datasets and thus involves less variance than in the case of between-group comparisons, where the contrast in pre- to posttest gains concerns different participants (bringing in more variance). A reduction in variance and standard deviation (SD) will result in larger effect sizes because the SD makes up the denominator in the formula for Cohen’s d. Unless studies report pre-posttest correlations that can be used as a correction for the difference, between-groups and within-groups study designs should be analyzed separately. In their meta-analysis of effects in L2 research, Plonsky and Oswald (Reference Plonsky and Oswald2014) found that observed effects resulting from within-group contrasts were indeed substantially larger than between-groups contrasts. They therefore proposed a different set of benchmarks for small (d = .60), medium (d = 1.00), and large (d = 1.40) effects for within-group contrasts than for between-groups contrasts (small, d = .40, medium, d = .70 and large, d = 1.00). Owing to the mix of within-group and between-groups contrasts in Yousefi and Nassaji’s collection of studies, and lack of reported pretest-posttest correlations, it is not clear how the overall estimated effect of d = 1.101 should be interpreted in relation to the previously mentioned benchmarks.

We therefore reanalyzed the data by calculating separate effect sizes for the within-groups contrasts (k = 103) and the between-groups contrasts (k = 52). While we might expect such a reanalysis to produce slightly different aggregated effects sizes, we would not expect it to have profound repercussions for the general conclusion that pragmatics instruction is effective. As mentioned, Yousefi and Nassaji’s article also investigated modality (computer-mediated vs. face-to-face pragmatics instruction) as a moderator variable. An issue that can arise when examining moderators to a main effect is the difficulty in separating out and attributing unique effects to each moderating variable. To account for this, primary studies should be closely examined in terms of their study designs and for the potential interactions between moderating variables. For example, a recent meta-analysis about the effect of glosses on vocabulary acquisition (Yanagisawa et al., Reference Yanagisawa, Webb and Uchihara2020) included mode of gloss (textual, pictorial, or aural) as a moderating variable but deliberately selected only studies on single glosses for this comparison. Inclusion of studies on multimodal glosses would have made it difficult to separate the effect of mode (e.g., textual vs. pictorial) from the effect of providing more than one annotation (e.g., textual + pictorial) for the same word (Boers et al., Reference Boers, Warren, Grimshaw and Siyanova-Chanturia2017; Ramezanali et al., Reference Ramezanali, Uchihara and Faez2020).

In the case of Yousefi and Nassaji’s investigation of the moderating variable of computer-mediated instruction, there is a potential interaction with the type of study design because the set of studies implementing computer-mediated instruction consists mostly of within-group contrasts, and so the larger aggregated effect size that emerged for this set could be an artifact of this design feature rather than reflecting an effect of computer-mediation per se. Moreover, in virtually all the computer-mediated studies the pragmatics instruction was explicit. This is relevant because Yousefi and Nassaji found a larger overall effect for explicit (d = 1.213) than implicit (d = 0.848) instructional treatments. Explicitness of instruction could thus be an alternative explanation for the comparatively large effect size that emerged from the computer-mediated interventions.

WHAT ARE THE CONTRASTS?

As mentioned, there is a wide range of study designs in the collection of primary studies used by Yousefi and Nassaji (Reference Yousefi and Nassaji2019), yielding diverse contrasts for effect size calculations (pretest vs. posttest scores of a single group or differences in learning gains between two groups). It is important for the sake of transparency and replicability of a meta-analysis to specify what contrasts are used for these calculations (Maassen et al., Reference Maassen, van Assen, Nuijten, Olsson-Collentine and Wicherts2020). Because Yousefi and Nassaji (Reference Yousefi and Nassaji2019) did not include this information, we adopted the following, explicitly stated, procedures from the earlier meta-analysis by Jeon and Kaya (Reference Jeon, Kaya, Norris and Ortega2006) in our replication:

1. For studies that examined one treatment group and one control group (that received no instructional intervention) by means of pre- and posttests, effect sizes were calculated by contrasting the two groups’ outcomes on pre- and immediate posttests (Alcón-Soler, Reference Alcón-Soler2015; Bardovi-Harlig et al., Reference Bardovi-Harlig, Mossman and Vellenga2015; Eslami & Eslami-Rasekh, Reference Eslami, Eslami-Rasekh, Alcón-Soler and Martinez-Flor2008; Felix-Brasdefer, Reference Felix-Brasdefer2008; Furniss, Reference Furniss2016; Narita, Reference Narita2012; Rafieyan et al., Reference Rafieyan, Sharafi-Nejad, Khavari, Eng and Mohamed2014; Tan & Farashaiyan, Reference Tan and Farashaiyan2012).
2. For studies that examined multiple treatment groups and one control group by means of pre- and posttests, effect sizes were calculated by contrasting each group’s immediate pre- and posttest outcomes separately with the control group’s immediate pre- and posttest outcomes (Eslami & Liu, Reference Eslami and Liu2013; Hernandez, Reference Hernandez2011; Li, Reference Li, Taguchi and Sykes2013; Nguyen et al., Reference Nguyen, Phamb and Phamb2012; Tajeddin et al., Reference Tajeddin, Keshavarz and Zand Moghaddam2012).
3. For studies that examined two or more treatment groups without any control group, pretest data was contrasted with immediate posttest data for each group (Chen, Reference Chen2011; Derakhshan & Eslami, Reference Derakhshan and Eslami2015; Felix-Brasdefer, Reference Felix-Brasdefer2008; Fordyce, Reference Fordyce2014; Fukuya & Martinez-Flor, Reference Fukuya and Martinez-Flor2008; Ghobadi & Fahim, Reference Ghobadi and Fahim2009; Gu, Reference Gu2011; Jernigan, Reference Jernigan2012; Li, Reference Li2012a, Reference Li2012b; Nguyen et al., Reference Nguyen, Do, Nguyen and Pham2015; Simin et al., Reference Simin, Eslami, Eslami-Rasekh and Ketabi2014; Tateyama, Reference Tateyama and Bradford-Watts2007, Reference Tateyama and Taguchi2009).
4. For studies that examined one group before and after an intervention, pretest data was contrasted with posttest data on immediate posttests (Alcón-Soler, Reference Alcón-Soler2012; Alcón-Soler & Guzman, Reference Alcón-Soler and Guzman-Pitarch2010; Tanaka & Oki, Reference Tanaka and Oki2015).
5. For studies that reported both treatment group and control group comparisons as well as within group contrasts, effect sizes were calculated for both between-group and within-group contrasts in the ways outlined above (Nguyen et al., Reference Nguyen, Phamb and Phamb2012).
6. For studies that compared two groups pre- and post-intervention and only provided the results of a multifactorial test (e.g., ANOVA), the effect size was calculated from the main effect of time for each group (Takimoto, Reference Takimoto2012a, Reference Takimoto2012b).

Some studies included in Yousefi and Nassaji’s meta-analysis provide insufficient information to calculate effect sizes along the previously mentioned procedures. It is unclear in some cases what method the original analysis utilized. For example, Dastjerdi and Farshid (Reference Dastjerdi and Farshid2011) only reported the results of a t-test comparing posttest results of two experimental groups. Martínez-Flor and Alcón-Soler (Reference Martínez-Flor and Alcón-Soler2007) lacked SDs necessary to compute effect sizes (other reported statistics were nonparametric). Cunningham (Reference Cunningham2016), one of the handful of studies in the collection that implemented a computer-mediated mode of instruction, had to be excluded because the report did not provide sufficient information for calculating effects sizes comparing the two experimental groups (which only included eight and nine participants each). In addition, one publication (Nguyen, Reference Nguyen2013) reported on the same data as another (Nguyen et al., Reference Nguyen, Phamb and Phamb2012), and so the duplicate report was excluded. Therefore, those studies (Cunningham, Reference Cunningham2016; Dastjerdi & Farshid, Reference Dastjerdi and Farshid2011; Martínez-Flor & Alcón-Soler, Reference Martínez-Flor and Alcón-Soler2007; Nguyen, Reference Nguyen2013) were excluded from our reanalysis leaving a total of 32 individual studies (instead of the original 36) and 155 contrasts (see supplement hosted on IRIS for a full list with justifications for inclusion/exclusion).

Another modification to the original meta-analysis concerns the categorization of one of the studies (Nguyen et al., Reference Nguyen, Do, Nguyen and Pham2015) that was coded as computer-mediated instruction by Yousefi and Nassaji. This study utilized e-mail writing as an outcome measure and may thus at first glance appear to be about computer-mediated instruction, but the instruction was not in fact computer mediated. We therefore had to remove it from the set of computer-mediated instruction studies in our reanalysis, reducing this set to six studies.

BENEFITS OF COMPUTER-MEDIATED INSTRUCTION?

Our reanalysis confirms the general finding of Yousefi and Nassaji (Reference Yousefi and Nassaji2019) that pragmatics instruction has a positive effect. For the between-groups contrasts, the overall effect is d = 1.11, a large effect for between-groups comparisons in L2 research. For within-group studies, the result is d = 1.32, a medium to large effect for within-groups comparisons (see supplement hosted on IRIS for full results tables).

However, our reanalysis does not confirm the original meta-analysis when it comes to the comparison of computer-mediated and face-to-face interventions. According to the original analysis, the former yielded larger effects (d = 1.172; k = 30Footnote ³) than the latter (d = 0.965; k = 80), but, according to our reanalysis of the data, the face-to-face mode in fact generated the larger effects. For between-groups designs, we now find a large effect of d = 1.271 (k = 40) for face-to-face instruction and only a small effect of d = 0.65 (k = 12) for computer-mediated instruction. Taking only the within-group studies, we again find a large effect of d = 1.46 (k = 85) for face-to-face instruction and a small effect of d = .75 (k = 18) for computer-mediated instruction (see supplement hosted on IRIS for a full results table). It needs to be acknowledged that the sample sizes for the computer-mediated interventions are now even smaller than they were in the original meta-analysis (due to selection decisions explained in the preceding text and due to the separation of between- and within-group contrasts). This highlights the need for more empirical investigations of computer-mediated pragmatics instruction. Investigations that directly compare the effectiveness of computer-mediated and face-to-face instruction for pragmatics would be especially welcome. In the collection used by Yousefi and Nassaji, only one study (Eslami & Liu, Reference Eslami and Liu2013) did this, and it found no difference in effectiveness between the two modes. A more recent study on pragmatics instruction (Tang, Reference Tang2019), outside the scope of the meta-analysis, found no advantage for computer-mediated activities over face-to-face activities either. In sum, our replication with separate effect size calculations based on study design differences did not support the superiority of computer-mediated pragmatics instruction over face-to-face instruction.

CASE STUDY 2: LEE ET AL. (Reference Lee, Warschauer and Lee2019)

SYNOPSIS AND PRELIMINARY COMMENTS

Lee et al.’s (Reference Lee, Warschauer and Lee2019) meta-analysis concerns the effects of corpus use on second language vocabulary learning. It is a partial replication of an earlier, broader-scope meta-analysis of corpus use in language learning (Boulton & Cobb, Reference Boulton and Cobb2017) but focused specifically on vocabulary and only included studies with an instructed control group (a comparison group) in their design.Footnote ⁴ Based on 29 primary studies, the weighted average effect on short-term learning was found to be medium sized (Hedges’s g = 0.74). In eight of the studies, delayed posttests were included, and these also showed a positive effect (Hedges’s g = 0.64). While Lee et al. (Reference Lee, Warschauer and Lee2019) acknowledge the role of several moderator variables (such as L2 proficiency level), the preceding aggregated effect sizes clearly suggest that corpus use is beneficial for L2 vocabulary learning.

In the following text, we highlight the issue of determining whether the main effects observed in primary studies are always a result of the variable of interest (in this case, corpus use). Before turning to that issue, we point out that it is not always clear what is meant by “effects” in this meta-analysis. Presumably, what is meant is learning outcomes. However, some of the studies (Frankenberg-Garcia, Reference Frankenberg-Garcia2012, Reference Frankenberg-Garcia2014; Stevens, Reference Stevens1991) investigated learners’ success rates as they did exercises under various input conditions, but did not include posttests to gauge the learning outcomes generated by these activities.Footnote ⁵ If the aim of the meta-analysis was to compare the effectiveness of different procedures in terms of learning outcomes, then these studies do not serve that purpose, and so we will exclude them from our reanalysis.

WHAT IS THE INDEPENDENT VARIABLE?

Corpora can be used for the purpose of vocabulary learning in various ways. The introduction to Lee et al. (Reference Lee, Warschauer and Lee2019) indicates that the focus of the article is corpus use for guided inductive learning (p. 722), also known as discovery learning and data-driven learning (Johns, Reference Johns1991). In this approach, learners typically examine concordance lines (i.e., examples of language use extracted from a corpus) with a view to discovering the meanings of words or their usage patterns (e.g., their word partnerships or collocations). Because Lee et al. (Reference Lee, Warschauer and Lee2019) refer first (in the title and the abstract) to corpus use in general and then (in the introduction) to the benefits of concordance lines specifically for the purpose of discovery learning, there is some ambiguity about what is meant by “the effects of corpus use.” If the independent variable of interest is corpus use more generally, then some of the primary studies appear not ideally suited because both treatment conditions in these studies utilized examples extracted from a corpus. The difference between these groups was the ways in which corpus-based instances were operationalized. For example, Sun and Wang (Reference Sun and Wang2003) compared the use of corpus-based examples for guided inductive learning to their use for the purpose of illustrating a pattern that was first explained to the learners. In other words, it was using corpus-based instances to prompt inductive learning versus using corpus-based instances as part of deductive learning that was the variable of interest, and not the use of corpus-based instances per se.

If the effectiveness of corpus use for guided inductive learning is the main variable of interest, then the challenge is to separate the added value of corpus use from that of guided inductive learning. After all, guided inductive learning can also be steered by means of examples that are not extracted from a corpus, but that are invented or collected differently by teachers or textbook writers. With very few exceptions (e.g., Cobb, Reference Cobb1997; Tongpoon, Reference Tongpoon2009), the primary studies in this meta-analysis did not compare the effectiveness of corpus-based and non-corpus-based examples for the purpose of guided inductive learning. Instead, in several of the studies (Anani Sarab & Kardoust, Reference Anani Sarab and Kardoust2014; Poole, Reference Poole2012; Sripicharn, Reference Sripicharn2003; Vyatkina, Reference Vyatkina2016; Yunus & Awab, Reference Yunus and Awab2012) corpus-based discovery learning was compared to a condition in which students received vocabulary explanations upfront followed by a few examples. In that case, it is again impossible to ascribe the superior learning observed for the corpus-based condition to the use of a corpus because it may also be attributable to the purported benefits of guided inductive learning (as opposed to deductive learning), regardless of whether the examples used for the inductive process were extracted from a corpus or produced in another way.

There are undeniably strong arguments for the use of corpus-based examples, such as their authenticity and the ease with which many examples can be generated from an online corpus (e.g., Johns, Reference Johns1991; Stevens, Reference Stevens1991). However, whether using corpus-based examples necessarily produces better learning outcomes than using, say, a series of textbook examples is an empirical question that is addressed by very few of the studies. Additionally, the distinction between authentic concordance lines and made-up examples can easily get blurred when researchers/materials designers start editing concordance lines to make them more comprehensible to the learners and to ensure the discovery-learning progresses as intended (e.g., Kim, Reference Kim2015; Yang, Reference Yang2015). In Supatranont (Reference Supatranont2005, pp. 84–91, and appendices J and K), for example, the only difference between the concordance lines and the textbook-type examples on the student handouts was that the former looked like concordance lines while the latter were presented as regular sentences. The difference between the two treatment conditions in this study was not the presence versus absence of corpus-based examples. Nor was it the presence versus absence of discovery-learning activities because both groups were required to find patterns in the sets of examples given on their handouts. The difference, rather, was that, in addition to pen-and-paper practice, the experimental group conducted computer-assisted searches, while the comparison group only worked with the handouts.

A LEVEL PLAYING FIELD?

A frequent topic in this collection of primary studies is collocation (word partnerships, such as conduct research, sore throat, and depend on), with several studies reporting the benefits of presenting learners with sets of concordance lines showing the most common collocates of a word. The effect of exposing learners to collocations is typically shown in posttests requiring learners to recall the word partnerships they were exposed to in the treatment. However, this is often in comparison with another treatment condition that did not involve any work on collocations at all but instead included learning activities on something else, such as single words or grammar (Mirzaei et al., Reference Mirzaei, Domakani and Rahimi2016; Rahimi & Momeni, Reference Rahimi and Momeni2012; Rezaee et al., Reference Rezaee, Marefat and Saeedakhtar2015). In other words, the experimental groups were exposed to the target items they would be tested on in the posttests, while the comparison groups were not exposed to these target items during their instructional treatment. It is therefore not surprising that the experimental groups outperformed the comparison groups in the posttests. This is reflected in some very large treatment effects (Hedges’s g = 2.07 in Mirzaei et al., Reference Mirzaei, Domakani and Rahimi2016, and 1.98 in Rahimi & Momeni, Reference Rahimi and Momeni2012).Footnote ⁶ However, whether these effects should be ascribed to the nature of instruction (e.g., the use of concordance lines from a corpus) or simply to the focus of instruction (i.e., collocation) is unclear. It is quite conceivable that the comparison groups would not have performed so poorly in the posttests, had they also been exposed to the target collocations during treatment. Put differently, the instructed control groups in these studies were not true comparison groups, but more akin to no-treatment control groups (i.e., groups that receive no instruction on the items or patterns on which they will be tested). If the purpose of the meta-analysis is to estimate the effectiveness of corpus use relative to other instructional treatments that share the same learning objective, then it seems justified to exclude these studies.

Other studies included in the original meta-analysis demonstrated imbalanced learning opportunities between treatment groups, even though both groups did exercises with a focus on collocation. This can be illustrated with reference to a study by Daskalovska (Reference Daskalovska2014). The experimental group in this study was instructed to use online corpus tools to collect the 10 most common adverb collocates of verbs and to report their findings. The comparison group did short pen-and-paper exercises about the same verbs but were exposed to a smaller number of adverbs. Obtaining a high score on one of the posttest sections—the section with the heaviest weighting—hinged on the learners’ ability to supply a wide range of adverbs, and so this potentially gave an advantage to the experimental group. One of the other sections of the posttest did appear better aligned with the comparison group’s practice materials, given that it was a multiple-choice test and the study package created for the comparison group included a similar multiple-choice activity. However, the correct answers to the multiple-choice items in the posttest were not included in the multiple-choice exercise done in the learning stage. For example, in the exercise the students learned “I entirely agree” and “I clearly remember,” but in the posttest they needed to select “I strongly agree” and “I vividly remember” to score points. The poor posttest performance of the comparison group is therefore unsurprising.

Equally unsurprising is the finding that better learning outcomes after corpus use were observed in studies where the experimental groups engaged in corpus-based activities in addition to activities they shared with the comparison groups (e.g., Gordani, Reference Gordani2013), while comparison groups did not engage in any supplementary activities regarding the target vocabulary. In some cases, this meant the experimental groups spent extensive additional time on the target words (e.g., Karras, Reference Karras2016; Yunxia et al., Reference Yunxia, Min, Zhuo, Li and Zhou2009). Better learning outcomes for the experimental groups in these studies could thus be attributed to differences in time investment. Supplementary activities other than corpus-based ones could also be expected to enhance learning outcomes, and so, while these studies undeniably demonstrate that corpus use is effective, they do not demonstrate it is efficient in comparison to learning activities that do not require a corpus.

There are also several publications in the collection that lack sufficient detail and transparency, and for these studies it is impossible to tell if the experimental and comparison conditions differed in more ways than use or nonuse of corpus data. This lack of transparency is especially problematic given that some of these articles (some hardly four pages long) report large effects (e.g., Hedges’s g = 1.15 in Al-Mahbashi et al., Reference Al-Mahbashi, Noor and Amir2015, and 1.38 in Yılmaz & Soruç, Reference Yılmaz2015), thus potentially inflating the aggregated effect.

If we recalculate the average effect on short-term learning based on the studies from the original pool where we do feel confident enough that differences in learning outcomes can be attributed to corpus use (see supplement hosted on IRIS for the original list of studies with justification for inclusion/exclusion), the result is markedly different from the original meta-analysis: Hedges’s g = 0.32. According to the norms proposed by Plonsky and Oswald (Reference Plonsky and Oswald2014) for between-groups contrasts, this is a small effect. However, this average is now based on only five studies, totaling only nine contrasts from the original meta-analysis. Clearly, more (and more focused) empirical investigations of the merits of corpus use are needed for a meta-analysis on this subject to produce a more reliable estimate.

CASE STUDY 3: BRYFONSKI AND MCKAY (Reference Bryfonski and McKay2019)

SYNOPSIS AND PRELIMINARY COMMENTS

Bryfonski and McKay’s (Reference Bryfonski and McKay2019) meta-analytic review was a first effort at estimating the effectiveness of TBLT programs.Footnote ⁷ Their search produced 27 studies with a between-groups design as well as a small collection of studies with within-groups designs (i.e., comparing a single treatment group’s pretest and posttest performance). The original report cautioned that the number of within-groups studies was too small a collection to draw conclusions from (p. 619). Here, we therefore focus on the set of between-groups comparisons. The average effect size Bryfonski and McKay calculated from this collection (d = 0.93) approximates the threshold (d = 1.00) proposed by Plonsky and Oswald (Reference Plonsky and Oswald2014) for a large between-groups effect. The report concludes that this finding “supports the notion that program-wide implementation of TBLT is effective for promoting L2 learning above and beyond the learning found in programs with other, traditional or non-task-based pedagogies” (p. 622).

One of the questions we discuss in the following text is the extent to which the studies included in the original meta-analysis examined implementations of task-based language teaching, that is, TBLT in its “strong” form (Long, Reference Long2015) as opposed to task-supported language teaching. Before turning to this question, on reflection, it seems worthwhile to exclude three of the primary studies in the original collection of between-group studies because they examined TBLT without directly comparing TBLT to non-TBLT treatments (Lai & Lin, Reference Lai and Lin2015; Li & Ni, Reference Li and Ni2013; Shabani & Ghasemi, Reference Shabani and Ghasemi2014). A further study (González-Lloret & Nielson, Reference González-Lloret and Nielson2015) did not establish group equivalence prior to the respective treatments (i.e., there was no pretest), and as an effect size based solely on posttest scores is not optimally reliable if we cannot be confident about pretreatment comparability, we exclude this study in our reanalysis as well.

WHAT’S IN A NAME?

Some TBLT proponents distinguish between programs that use tasks throughout (Long, Reference Long2016), and task-supported programs, where tasks are used alongside or in addition to other approaches, including those involving explicit instruction (Ellis, Reference Ellis2018). With one exception (González-Lloret & Nielson, Reference González-Lloret and Nielson2015—which, as already noted, was excluded from the reanalysis because of lack of pretest data), all the programs described in the primary studies included in this meta-analytic review can be considered task supported rather than task based. An example is Amin (Reference Amin2009), where “[t]he TBL approach adopted in this study takes the form of explicit grammatical instruction in conjunction with communicative activities” (p. 81). Readers should therefore interpret TBLT, which is the term used in the majority of the included articles, as task-supported implementations, and not the “strong” version of TBLT outlined by Long (Reference Long2015).

Another difficulty lies with the notion of task, for which slightly different definitions have been used in prior literature (e.g., Ellis et al., Reference Ellis, Skehan, Li, Shintani and Lambert2019; Long, Reference Long2015). What is agreed on by proponents of TBLT in its various forms, however, is that tasks are meaning focused (i.e., focused on the content of messages rather than their linguistic packaging) and make learners use language as a vehicle toward a goal that itself is not linguistic. For example, in one of the original studies (Lochana & Deb, Reference Lochana and Deb2006) the following activities are presented as tasks according to those researchers’ interpretation of TBLT: “Your teacher will read out a passage; listen to the passage carefully and complete the blanks.” In another study (Amin, Reference Amin2009), the author explains that “The pedagogical tasks … are what learners do in class, such as listening to a tape and repeating phrases or sentences” (p. 44). Although these activities are labeled tasks in these publications, they are language-focused exercises rather than tasks as understood in TBLT circles. Several authors (e.g., Birjandi & Malmir, Reference Birjandi and Malmir2009; Sarani & Sahebi, Reference Sarani and Sahebi2012; Yang, Reference Yang2008) consider pair work as the defining characteristic of TBLT, regardless of whether the activities have a clear communicative purpose. These examples illustrate the wide interpretation of “task” in worldwide contexts.

In the following text we report an attempt at a new meta-analysis that adopts a narrower interpretation of tasks and that only includes studies that meet the criteria for tasks defined by Ellis and Shintani (Reference Ellis and Shintani2013, see following text). First, however, it may be worth speculating why TBLT is understood in such diverse ways, including ways not at all intended by TBLT advocates. Many of the authors of the studies in the meta-analysis cited Willis (Reference Willis1996) and Willis and Willis (Reference Willis and Willis2007), summed up on http://www.willis-elt.co.uk/ and https://www.teachingenglish.org.uk/article/a-task-based-approach to justify their task and program designs. In Willis and Willis’s (Reference Willis and Willis2007) version of TBLT, communicative tasks are preceded by a pretask phase, to help learners prepare for the task, and are followed by a posttask phase, where time is devoted to feedback, reflection on task performance, and reactive treatment of language problems. Several authors relied heavily on this three-phase lesson model but often with a focus on language as a study object rather than as a means toward a nonlinguistic end. It is understandable how “task” may be misconstrued from webpages such as https://www.teachingenglish.org.uk/article/criteria-identifying-tasks-tbl without carefully considering supplementary information. For example, one of the criteria listed there is that the activity should have “a goal or an outcome.” If a researcher misinterprets this goal or outcome as increased language knowledge on the part of students, then their “TBLT” lessons may treat language as a study object instead of a vehicle. Misinformation or misunderstandings may also result in assessments of learning gains that are focused on aspects of language, such as grammatical accuracy and vocabulary knowledge, rather than the learners’ successful completion of the communicative tasks (Plonsky & Kim, Reference Plonsky and Kim2016). Once a practitioner or researcher misses the crucial point about what is meant by a goal or an outcome of a task, they may also misinterpret agreement tasks as reaching an agreement on the right answer in a language exercise and information-gap tasks as completing gap-fill exercises.

Depending on the model of TBLT, guidelines for creating task-based (or task-supported) lessons may be rather vague as to how much language-oriented instruction can (or should) be included at various stages of instruction. Additionally, TBLT proponents have slightly different views of what features distinguish a task from a language exercise. In our reanalysis, we examined the classroom procedures of the primary studies to examine the extent to which the activities labeled as tasks in the main task phase of the described lessons can be characterized as tasks as defined by Ellis and Shintani (Reference Ellis and Shintani2013). The four criteria proposed by Ellis and Shintani (Reference Ellis and Shintani2013, p. 135), slightly reworded here, are as follows:

1. The focus is on meaning, that is, on the content of messages rather than on the language code per se.
2. There is some sort of communication gap between interlocutors, that is, learners exchange information or opinions rather than telling interlocutors—including their teacher—what these interlocutors clearly already know.
3. The task instructions do not stipulate what language elements or patterns the students should use when performing the activity (because that risks turning the activity into a language-focused exercise).
4. There should be a clear purpose (e.g., solving a problem, reaching an agreement about a dilemma) other than practicing language (because in the “real” world, language use is a means to an end, not the end itself).

For 10 of the studies, we concurred that the tasks met one out of four of criteria,Footnote ⁸ and so it seems justified to exclude them from this narrower reanalysis (see supplement hosted on IRIS for full inclusion/exclusion criteria). After exclusion of these and the ones mentioned in the previous section (i.e., studies that were not designed to compare task-based to non-task-based interventions), the collection includes 13 studies.

AT FACE VALUE?

Applying the criteria outlined in the preceding text requires that authors carefully detail their instructional procedures and classroom activities. However, several of the remaining research reports provide insufficient detail to apply Ellis and Shintani’s (Reference Ellis and Shintani2013) criteria. What follows are examples of how little is said about the nature of what are labeled tasks in some of the articles:

The tasks in every lesson had a high corresponding with the course book materials, because of pre-determined syllabus. The teacher used his creativity for adaptation of the tasks with the text book.

(Rezaeyan, 2014, p. 483)

[T]he students were required to do the tasks either in pair or in small groups, with the teacher monitoring their performance and encouraging more communication among them.

(Mesbah, 2016, p. 433)

In task-cycle phase, the students were engaged in completing different kinds of tasks.

(Tan, 2016, p. 103)

[S]tudents engaged in different communicative situations, unrelated to the actual course but organized in such a way that the participants were compelled to use the previously acquired lexico-grammar.

(De Ridder et al., 2007, p. 310)

The author selected eight topics from the textbook or from outside the book, and designed the speaking tasks, considering the student’s actual level and interest.

(Ting, 2012, p. 91)

As illustrated in the previous section, authors may cite publications about TBLT and call the classroom activities they designed tasks, but this offers no guarantee that these in fact fit the criteria for tasks established in the preceding text. Some of the effect sizes in this subset of nontransparent reports are very substantial (e.g., d > 1.7 in Mesbah & Faghani, Reference Mesbah and Faghani2015, and in Tan, Reference Tan2016), even though it is difficult to tell to what these effects should be attributed. For the sake of caution, we exclude these studies in our reanalysis as well. As a result of this, the collection now includes six studies. If these remaining studies shared a tight focus and used very similar instruments and methods, a meta-analysis of them might still be meaningful. However, they in fact display very diverse foci (e.g., speaking vs. writing skills) and outcome measures (see Saito & Plonsky, Reference Plonsky, Zhuang and Taguchi2019, for an illustration that effect sizes can differ markedly depending on the type of outcome measures), and so it is doubtful whether a meaningful generalization can be drawn from such a small remaining sample.

A LEVEL PLAYING FIELD?

Regardless of whether the primary studies included in the original meta-analysis really concerned TBLT programs or, instead, compared one language-focused program to another language-focused program, the fact remains that what was presented as the experimental treatment in these studies almost consistently generated the better outcomes. One might argue that, even though the experimental treatments did not meet all the criteria to be labeled task-based under our criteria, they were nonetheless better aligned with TBLT principles than the comparison treatments. If so, then the outcome of the meta-analysis could still be interpreted as support for programs exhibiting at least some features of TBLT. For example, the so-called TBLT treatments typically involved a greater amount of peer–peer interaction in the target language than the comparison treatment, where students worked mostly individually. So, even though many of the activities described in these studies are exercises instead of tasks, the fact that these exercises were typically tackled collaboratively in the treatment conditions that brought about the better learning outcomes can be meaningful (Sato & Ballinger, Reference Sato and Ballinger2016). Put differently, more nuanced distinctions within the broad spectrum of task-supported programs could be fruitful to help determine the role of specific program characteristics.

As also highlighted in our discussion of Lee et al. (Reference Lee, Warschauer and Lee2019) in the preceding text, better outcomes for the experimental treatment can in some cases be attributed to other factors than the so-called TBLT nature of the treatment. For example, Torky (Reference Torky2006) investigated the benefits of an intensive speaking course in comparison to a course where students hardly did any speaking practice. Unsurprisingly, the students from the speaking course did better in end-of-course speaking activities, which resembled their course activities. In a similar vein, the end-of-course assessment in Yang (Reference Yang2008) concerned speaking skills, which the experimental group had been given ample opportunity to develop in class while the comparison group had not. Considering the potential effect of practice–test congruency (i.e., the probability that one gets better at what one practices regardless of whether the practice method resembles TBLT or something else), we also exclude these studies from the collection of between-group comparisons in our reanalysis when the purpose is to gauge the effect of TBLT as an independent variable. This reduces the collection to three studies. Were we to calculate an average effect from these, the result would be d = 0.258, indicating a small effect, but this is not quite meaningful given the minute sample size.

An extra challenge with assessing many of the primary studies is that the description of the control/comparison conditionFootnote ⁹ is often as minimal as, for example, “[the] control group experienced conventional teaching” (Rezaeyan, Reference Rezaeyan2014). Even some of the lengthy texts, such as PhD dissertations, offer minimal information. For example, Murad (Reference Murad2009) only mentions that “the control group was taught using the conventional methods of teaching used by teachers of EFL at these schools” (p. 77), without giving any further explanations as to what those conventional methods were. When descriptions are included, these are often ambiguous as to whether the two groups spent the same amount of time on the skills or knowledge they would be needing to perform well in the posttests. All this makes it difficult to tell whether the superior performance of the experimental group should be attributed to their being provided with better learning opportunities or simply more learning opportunities in preparation for a specific end-of-course assessment.

The latter possibility can be illustrated with two of the three studies remaining in our reanalysis. One is Lai et al. (Reference Lai, Zhao and Wang2011), which did include helpful details about both the experimental and the comparison treatments as well as the assessment instruments used. In this study, communicative activities were added to a language-focused course in the experimental condition. To evaluate whether this had a positive effect on learning, a speaking test was used, where the students were asked to describe a picture of a person’s bedroom (p. 96). However, picture description was a recurring course activity in the experimental condition, and one of the picture description activities in the course was about bedrooms as well (p. 102). If the students from the TBLT course performed better on the final speaking test, this may be partially attributable to practice–test congruency (because they had done the activity before while the comparison group had not). A similar example is a study by Park (Reference Park, Shehadeh and Coombe2012), who designed computer-assisted activities for the TBLT group, while the non-TBLT group only worked with their prescribed EFL textbook. One of the TBLT group’s computer-assisted lessons was about writing e-mails to e-pals (e.g., to introduce a new e-pal). The non-TBLT group, which was confined to working with the EFL textbook, appears not to have practiced this specific activity. However, the same activity was used as one of the assessment measures, thus potentially giving an advantage to the TBLT group. After excluding also these two studies from the reanalysis, a single study would remain (Phuong et al., Reference Phuong, Van den Branden, Van Steendam and Sercu2015). This is a study that reports a positive effect of a TBLT-informed writing course on students’ vocabulary development, but less improvement compared to the non-TBLT treatment on measures of linguistic accuracy. The result is an averaged d-value of −0.06. In short, using different, stricter criteria for sampling candidate studies changes the conclusions regarding the effectiveness of task-based relative to non-task-based implementations. Again, the main conclusion must be that much more (and more solid and replicable) empirical work on the comparative effectiveness of TBLT needs to be done before a robust meta-analysis of the effects of task-based programs will become feasible. In the interim, it is critical to apply more nuance to domain definitions within the spectrum of task-supported programs so that the role of specific program characteristics can be better understood.

CONCLUSIONS AND RECOMMENDATIONS

The outcome of a meta-analysis is inevitably determined by how a factor of interest is defined and how candidate studies are subsequently selected. As illustrated in all three “case studies” presented here, changes in selection criteria, such as applying more narrow definitions of key variables, can lead to different outcomes. In each of our reanalyses, we considered it desirable to exclude a fair number of studies that were included in the original meta-analyses, because they (a) were not in fact designed to address the research question that the meta-analysis sought answers to, (b) did not report quantitative data (such as pretest scores) required for a reliable effect calculation, (c) exhibited confounds that make it difficult to attribute an observed effect to the factor of interest, and (d) were described with insufficient detail to allow a proper evaluation. Unfortunately, applying stricter selection criteria can drastically reduce sample sizes. If we were dealing with effect sizes from primary studies that were very precise replications of one another, then aggregated effect sizes could still be meaningful, but in the case of the three meta-analyses we have examined here we are dealing with primary studies that show considerable diversity in design, learning targets, outcome measures, and instructional settings. Given this diversity, it is not surprising that the addition or exclusion of a few primary studies can alter the outcome of a meta-analysis. The original meta-analyses seem to have been conducted in a spirit of an inclusive approach to primary study selection (for the sake of sample sizes). It has not been our intention to argue that the “when in doubt, leave it out” stance taken in our replication attempts is necessarily better. The point is, rather, that readers of meta-analytic reviews (be they researchers, policy makers, or teaching professionals) need to be aware that any meta-analytic endeavor involves multiple choices on the part of the analyst, each of which impacts the outcomes (Oswald & Plonsky, Reference Oswald and Plonsky2010). To help readers appreciate this, authors of meta-analytic reviews are of course urged to be totally transparent about the choices they made (Maassen et al., Reference Maassen, van Assen, Nuijten, Olsson-Collentine and Wicherts2020; Norris & Ortega, Reference Norris, Ortega, Norris and Ortega2006). It is doubtful, however, whether many consumers of meta-analytic reviews closely inspect the method sections in such publications, where those choices are explained. Instead, readers may rely solely on the information provided in the abstract and possibly the general conclusion section. Owing to their status as comprehensive reviews, conclusions drawn from meta-analyses exert a certain authority. We hope to have demonstrated that assertions about the role of a given factor (be it the primary factor of interest or a moderating factor) need to be made with caution, especially in the case of recent strands of empirical inquiry.

Recommendations may also be distilled for the researcher wishing to embark on a meta-analysis. One recommendation is to carefully delineate the factor(s) of interest and to evaluate whether the available strand of research related to this factor lends itself to a robust and meaningful analysis. When the maturity of a given domain for meta-analysis is uncertain, it is recommended to first carry out a scoping review. A scoping review is another type of research synthesis that surveys a domain of literature identifying current trends, commonly used methods, and gaps in findings (e.g., Gurzynski-Weiss & Plonsky, Reference Gurzynski-Weiss, Plonsky and Gurzynski-Weiss2017; Hillman et al., Reference Hillman, Selvi and Yazan2020; Tullock, & Ortega, Reference Tullock and Ortega2017). A scoping review can help determine if subsequent meta-analytic work is appropriate and worthwhile. After embarking on a meta-analysis, researchers are advised to scrutinize each candidate study to determine its eligibility and make the criteria for study inclusion clear. As we have illustrated, a field may look ready at first glance, as one starts deploying the powerful online search engines at our disposal, but this may be deceptive if it turns out that many candidate studies fail to meet the standards for inclusion. Unfortunately, scrutinizing the method sections of a large collection of empirical research papers is a labor-intensive exercise. Of course, meta-analytic replications are not immune to interpretation errors either. We fully recognize potential shortcomings in our own reassessment of the primary studies included in our three case studies. Alternatively, a faster way could be to use the prestige of the journals where they were accepted as a proxy of quality assuredness (e.g., Faez et al., Reference Faez, Karas and Uchihara2019), under the assumption that some journals use more rigorous review processes than others. This, then, raises the difficult question what bibliometric data are most suitable to distinguish between journals on account of the relative rigor of their review processes. An additional difficulty is that resulting literature from this approach may be limited to publications from privileged, “WEIRD” (Western, Educated, Industrialized, Rich, and Democratic) contexts, potentially disadvantaging those who have less access to publishing in prestigious peer-refereed journals (Andringa & Godfroid, Reference Andringa and Godfroid2020; Cho, Reference Cho2009; Henrich et al., Reference Henrich, Heine and Norenzayan2010). Besides, even the most prestigious journals occasionally publish articles that are arguably nonoptimal (or, at least, nonoptimally suited for a given meta-analytic purpose). In fact, among the primary studies we felt it justified to exclude from our reanalyses, there were indeed several ones which appeared in prestigious journalsFootnote ¹⁰ (see supplements on IRIS for details on each individual study). It is worth mentioning in this context that each of the three meta-analytic reviews examined here appeared in prestigious journals, too. So, perhaps our call for caution should be extended to journal editors, editorial boards, and reviewers.

In any case, given the issues highlighted, (some of) the conclusions presented in the meta-analyses we have examined here should be taken as tentative for now. Fortunately, as new studies are continually being added to the various strands of inquiry in our discipline, we must be hopeful that sooner or later it will become possible to revisit these meta-analyses and to replicate them with a larger collection of eligible studies. This sustained effort at updating and replicating meta-analyses can be made lighter if meta-analytic reports are transparent not only as to what studies were included but also as to precisely how effect sizes were calculated (so the same procedures may be followed in the updates). For one of the three meta-analyses examined here (Yousefi & Nassaji, Reference Yousefi and Nassaji2019), we felt it necessary to recalculate effect sizes because it was not clear to us precisely on which contrasts the authors had based their calculations. A lack of clarity of how contrasts were defined and analyzed not only limits readers’ ability to evaluate meta-analytic findings but it also hinders replication where effect sizes from new studies could systematically be added to an existing pool and thus gradually make the outcome more robust. The field of applied linguistics has heralded a push toward open-science practices in recent years, including recognition of open data and materials through badges in major journals (e.g., Studies in Second Language Acquisition, Annual Review of Applied Linguistics, Language Learning, Modern Language Journal), repositories for instruments and materials (IRIS-database.org), and registered replications (Morgan-Short et al., Reference Morgan-Short, Marsden, Heil, Ii, Leow, Mikhaylova, Mikolajczak, Moreno, Slabakova and Szudarski2018) and reports (Marsden et al., Reference Marsden, Morgan‐Short, Trofimovich and Ellis2018). Open science practices are one way to promote equity through the sharing of knowledge, instruments, and findings in freely accessible and permanent repositories. While there is growing excitement around open access in applied linguistics research, L2 researchers (and academics more broadly) often fail to practice what they preach in terms of publishing open access (e.g., Zhu, Reference Zhu2017) or making data freely available. The coding schemes and data of some prior meta-analyses have been uploaded in repositories such as IRIS (Bryfonski & McKay, Reference Bryfonski and McKay2019; Plonsky, Reference Plonsky2011, Reference Plonsky and Porte2012, Reference Plonsky, Chamot and Harris2019; Plonsky & Kim, Reference Plonsky and Kim2016; Plonsky & Ziegler, Reference Plonsky and Ziegler2016; Plonsky & Zhuang, Reference Plonsky, Zhuang and Taguchi2019), and this is also where the coding schemes and data of the present three replications can be found. Others have called for more attention to open science in meta-analytic work. McKay and Plonsky (in press), for example, recommend that “all meta-analysts make available not only their coding schemes but also their data and any code used to analyze that data” (p. 14). However, meta-analysis continues to be underrepresented in terms of shared materials and data. Open data is yet another methodological choice, one that may open the door more easily to scrutiny of studies and findings. Whatever channel is deemed most appropriate, the sharing of coding sheets in meta-analysis is critical for building upon prior work and supporting future meta-analysts. It is worth mentioning that calls for greater transparency in reporting meta-analyses are being made outside the discipline of applied linguistics as well (e.g., Maassen et al., Reference Maassen, van Assen, Nuijten, Olsson-Collentine and Wicherts2020, in the field of psychology).

Returning specifically to the three case studies we have presented here, it is important to clarify that our intention was by no means to criticize the instructional interventions advocated in them (i.e., technology-mediated pragmatics instruction, corpus use for vocabulary learning, and TBLT). It was, in fact, our interest in these topics that led us to read and then further explore these three meta-analyses, and in the case of Bryfonski and McKay, reanalyze the pool of primary studies with different criteria. We hope that our three examples can serve as an incentive for others to reexamine the meta-analyses available in their own domains of interest.

Supplementary Materials

To view supplementary material for this article, please visit http://dx.doi.org/10.1017/S0272263120000327.

Footnotes

The experiment in this article earned an Open Data badge for transparent practices. The materials are available at https://www.iris-database.org/iris/app/home/detail?id=york%3a938351&ref=search.

¹ Despite the title of Yousefi and Nassaji’s (Reference Yousefi and Nassaji2019) article, “A Meta-Analysis of the Effects of Instruction and Corrective Feedback on L2 Pragmatics and the Role of Moderator Variables,” the effect of corrective feedback is not investigated in their analysis. This is surprising also because the article appeared in a special issue on the theme of “technology-mediated feedback and instruction.” It is possible that the authors prioritized the topic of technology in their analysis, which would then also explain their foregrounding of the potential of computer-mediated instruction.

^2. For studies that did not report pre- and posttest correlation, a conservative estimate of .30 was utilized during effect size calculation.

^3. There is some inconsistency in Yousefi and Nassaji’s article as regards the number of unique samples included in their calculations. It is first said that (after removing outliers) there were 27 computer-mediated and 83 face-to-face samples, but the results table later mentions 30 and 80, respectively.

^4. Although Lee et al. (Reference Lee, Warschauer and Lee2019) intended to include only studies with an instructed control group (or comparison group), we failed to find information about such a group in one of the publications. This is an article (Horst, M., Cobb, T., & Nicolae, I., (Reference Horst, Cobb and Nicolae2005) Expanding academic vocabulary with a collaborative on-line database. Language Learning and Technology, 9, 90–110. Online;) that describes the design and development of a module of computer-assisted corpus-based activities. The module was tried at different stages of development with different cohorts of students, but we found no mention of a noncorpus comparison treatment.

^5. There is an additional study that investigated how much students were helped by certain resources as they tackled vocabulary exercises. Kaur and Hegelheimer (Reference Kaur and Hegelheimer2005) examined students’ success rates on vocabulary exercises either with the assistance of both an online dictionary and a concordancer or with the assistance of the online dictionary only. To estimate learning outcomes, the students’ voluntary use of target vocabulary in an essay they wrote outside of class was assessed. Interrater reliability was only .68, however. No pretest data are included, which also makes it hard to assess learning gains as a result of the exercises, and so it felt prudent to exclude this study as well.

^6. The effect sizes we mention in this section are the ones calculated by Lee et al. (Reference Yousefi and Nassaji2019; online supplement).

^7. Other meta-analytic reviews on the subject of TBLT are available (e.g., Cobb, Reference Cobb2010), but these do not focus specifically on task implementation over an extensive period (such as a complete school term), while the subject of Bryfonski and McKay (Reference Bryfonski and McKay2019) is TBLT program implementations.

^8. The one criterion met in these studies was the meaning-focused nature of the activities, for example because they focused on text comprehension. The criterion met the least often in the collection of primary studies was having a clear nonlinguistic purpose for doing the activity.

^9. Most of the primary studies use the term control group in the sense of comparison group (i.e., not in the sense of no-treatment group).

^10. For example, one of the publications we have had to exclude (De Ridder et al., Reference De Ridder, Vangehuchten and Gómez2007) when revisiting Bryfonski and McKay’s (Reference Bryfonski and McKay2019) collection of studies, was a brief report in the Forum section of the prestigious journal Applied Linguistics. It was felt necessary to exclude it because (a) the description of the task-based component of the course was too vague to meet the stricter sampling criteria and (b) the end-of-course assessment was different for the experimental and the comparison group, thus introducing a confounding variable.

(Note: Studies from the original meta-analysis can be found with all supplementary material on IRIS).

References

REFERENCES

Al-Mahbashi, A., Noor, N. M., & Amir, Z. (2015). The effect of data driven learning on receptive vocabulary knowledge of Yemeni University learners. 3L: Language, Linguistics and Literature—The Southeast Asian Journal of English Language Studies, 21, 13–24.Google Scholar

Alcón-Soler, E. (2012). Teachability and bilingualism effects on third language learners’ pragmatic knowledge. Intercultural Pragmatics, 9, 511–541. https://doi.org/10.1515/ip‐2012‐0028 CrossRef Google Scholar

Alcón-Soler, E. (2015). Instruction and pragmatic change during study abroad email communication. Innovation in Language Learning and Teaching, 9, 34–45.CrossRef Google Scholar

Alcón-Soler, E., & Guzman-Pitarch, J. (2010). The effect of instruction on learners’ pragmatic awareness: A focus on refusals. International Journal of English Studies, 10, 65–80. https://doi.org/10.6018/ijes/2010/1/113981 CrossRef Google Scholar

Amin, A. A. (2009). Task-based and grammar-based English language teaching: An experimental study in Saudi Arabia [Unpublished doctoral dissertation]. University of Newcastle upon Tyne, Newcastle upon Tyne, UK.Google Scholar

Anani Sarab, M. R., & Kardoust, A. (2014). Concordance-based data-driven learning activities and learning English phrasal verbs in EFL classrooms. Issues in Language Teaching, 3, 112–189.Google Scholar

Andringa, S., & Godfroid, A. ( 2020). Sampling bias and the problem of generalizability in applied linguistics. Annual Review of Applied Linguistics, 40, 134–142. https://doi.org/10.1017/S0267190520000033.CrossRef Google Scholar

Bardovi-Harlig, K., Mossman, S., & Vellenga, H. E. (2015). The effect of instruction on pragmatic routines in academic discussion. Language Teaching Research, 19, 324–350. https://doi.org/10.1177/1362168814541739 CrossRef Google Scholar

Birjandi, P., & Malmir, A. (2009). The effect of task-based approach on the Iranian advanced EFL learners’ narrative vs. expository writing. Iranian Journal of Applied Language Studies, 1, 1, 1–26.Google Scholar

Boers, F., Warren, P., Grimshaw, G., & Siyanova-Chanturia, A. (2017). On the benefits of multimodal annotations for vocabulary uptake from reading. Computer Assisted Language Learning, 30, 709–725. https://doi.org/10.1080/09588221.2017.1356335 CrossRef Google Scholar

Boulton, A., & Cobb, T. (2017). Corpus use in language learning: A meta-analysis. Language Learning, 67, 348–393. https://doi.org/10.1111/lang.12224 CrossRef Google Scholar

Bryfonski, L., & McKay, T. H. (2019). TBLT implementation and evaluation: A meta-analysis. Language Teaching Research, 23, 603–632. https://doi.org/10.1177/1362168817744389 CrossRef Google Scholar

Chen, Y. S. (2011). The effect of explicit teaching of American compliment exchanges to Chinese learners of English. English Teaching & Learning, 35, 1–42.Google Scholar

Cho, D. W. (2009). Science journal paper writing in an EFL context: The case of Korea. English for Specific Purposes, 28, 230–239. https://doi.org/10.1016/j.esp.2009.06.002 CrossRef Google Scholar

Cobb, M. (2010). Meta-analysis of the effectiveness of task-based interaction in form-focused instruction of adult learners in foreign and second language teaching [PhD dissertation]. University of San Francisco.Google Scholar

Cobb, T. (1997). Is there any measurable learning from hands-on concordancing? System, 25, 301–315.CrossRef Google Scholar

Cunningham, J. (2016). Request modification in synchronous computer-mediated communication: The role of focused instruction. The Modern Language Journal, 100, 484–507. https://doi.org/10.1111/modl.12332 CrossRef Google Scholar

Daskalovska, N. (2014). Corpus-based versus traditional learning of collocations. Computer Assisted Language Learning, 28, 130–144.CrossRef Google Scholar

Dastjerdi, H. V., & Farshid, M. (2011). The role of input enhancement in teaching compliments. Journal of Language Teaching & Research, 2, 460–466. https://doi.org/10.4304/jltr.2.2.460‐466 CrossRef Google Scholar

Derakhshan, A., & Eslami, Z. (2015). The effect of consciousness-raising instruction on the pragmatic development of apology and request. The Electronic Journal for English as a Second Language, 18, 1–24.Google Scholar

De Ridder, I., Vangehuchten, L., & Gómez, M. S. (2007). Enhancing automaticity through task-based language learning. Applied Linguistics, 28, 309–315.CrossRef Google Scholar

Ellis, R. (2018). Reflections on task-based language teaching. Multilingual Matters. https://doi.org/10.21832/9781788920148 Google Scholar

Ellis, R., & Shintani, N. (2013). Exploring language pedagogy through second language acquisition research. Routledge. https://doi.org/10.4324/9780203796580 CrossRef Google Scholar

Ellis, R., Skehan, P., Li, S., Shintani, N., & Lambert, C. (2019). Task-based language teaching: Theory and practice. Cambridge University Press. https://doi.org/10.1017/9781108643689 CrossRef Google Scholar

Eslami, Z. R., & Eslami-Rasekh, A. (2008). Enhancing the pragmatic competence of non-native English-speaking teacher candidates (NNESTCs) in an EFL context. In Alcón-Soler, E. & Martinez-Flor, A. (Eds.), Investigating pragmatics in foreign language learning, teaching and testing (pp. 178–197). Multilingual Matters. https://doi.org/10.21832/9781847690869‐011 CrossRef Google Scholar

Eslami, Z. R., & Liu, C. N. (2013). Learning pragmatics through computer-mediated communication in Taiwan. Iranian Journal of Society, Culture, and Language, 1, 52–73.Google Scholar

Faez, F., Karas, M., & Uchihara, T. (2019). Connecting language proficiency to teaching ability: A meta-analysis. Language Teaching Research. Advance online publication. https://doi.org/10.1177/1362168819868667 CrossRef Google Scholar

Felix-Brasdefer, J. C. (2008). Pedagogical intervention and the development of pragmatic competence in learning Spanish as a foreign language. Issues in Applied Linguistics, 16, 49–84.Google Scholar

Fordyce, K. (2014). The differential effects of explicit and implicit instruction on EFL learners’ use of epistemic stance. Applied Linguistics, 35, 6–28. https://doi.org/10.1093/applin/ams076 CrossRef Google Scholar

Frankenberg-Garcia, A. (2012). Learners’ use of corpus examples. International Journal of Lexicography, 25, 273–296.CrossRef Google Scholar

Frankenberg-Garcia, A. (2014). The use of corpus examples for language comprehension and production. ReCALL, 26, 128–146.CrossRef Google Scholar

Fukuya, Y., & Martinez-Flor, A. (2008). The interactive effects of pragmatic-eliciting tasks and pragmatics instruction. Foreign Language Annuals, 41, 478–500. https://doi.org/10.1111/j.1944‐9720.2008.tb03308.x CrossRef Google Scholar

Furniss, E. (2016). Teaching the pragmatics of Russian conversation using a corpus referred website. Language Learning and Technology, 20, 38–60.Google Scholar

Ghobadi, A., & Fahim, M. (2009). The effect of explicit teaching of English “thanking formulas” on Iranian EFL intermediate level students at English language institutes. System, 37, 526–537. https://doi.org/10.1016/j.system.2009.02.010 CrossRef Google Scholar

González-Lloret, M., & Nielson, K. B. (2015). Evaluating TBLT: The case of a task-based Spanish program. Language Teaching Research, 19, 525–549.CrossRef Google Scholar

Gordani, Y. (2013). The effect of the integration of corpora in reading comprehension classrooms on English as a Foreign Language learners’ vocabulary development. Computer Assisted Language Learning, 26, 430–445.CrossRef Google Scholar

Gu, X. L. (2011). The effect of explicit and implicit instructions of request strategies. Intercultural Communication Studies, 20, 104–123.Google Scholar

Gurzynski-Weiss, L., & Plonsky, L. (2017). Look who’s interacting: A scoping review of research involving non-teacher/non-peer interlocutors. In Gurzynski-Weiss, L. (Ed.), Expanding individual difference research in the interaction approach: Investigating learners, instructors, and other interlocutors (pp. 305–324). John Benjamins.CrossRef Google Scholar

Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33, 61–83. https://doi.org/10.1017/S0140525X0999152X CrossRef Google Scholar

Hernandez, T. A. (2011). Re-examining the role of explicit instruction and input flood on the acquisition of Spanish discourse markers. Language Teaching Research, 15, 159–182. https://doi.org/10.1177/1362168810388694 CrossRef Google Scholar

Hillman, S., Selvi, A. F., & Yazan, B. (2020). A scoping review of world Englishes in the Middle East and North Africa. World Englishes. Advance online publication. https://doi.org/10.1111/weng.12505 CrossRef Google Scholar

Horst, M., Cobb, T., & Nicolae, I. (2005). Expanding academic vocabulary with a collaborative on-line database. Language Learning and Technology, 9, 90–110.Google Scholar

Jeon, E. H., & Kaya, T. (2006). Effects of L2 instruction on interlanguage pragmatic development. In Norris, J. M. & Ortega, L. (Eds.) Synthesizing research on language learning and teaching (pp. 165–211). John Benjamins Publishing. https://doi.org/10.1075/lllt.13.10jeo CrossRef Google Scholar

Jernigan, J. (2012). Output and English as a second language pragmatic development: The effectiveness of output-focused video-based instruction. English Language Teaching, 5, 2–14. https://doi.org/10.5539/elt.v5n4p2 CrossRef Google Scholar

Johns, T. (1991). Should you be persuaded: Two examples of data-driven learning. English Language Research Journal, 4, 1–16.Google Scholar

Kang, E. Y., Sok, S., & Han, Z. (2019). Thirty-five years of ISLA on form-focused instruction: A meta-analysis. Language Teaching Research, 23, 428–453. https://doi.org/10.1177/1362168818776671 CrossRef Google Scholar

Karras, J. N. (2016). The effects of data-driven learning upon vocabulary acquisition for secondary international school students in Vietnam. ReCALL, 28, 166–186.CrossRef Google Scholar

Kaur, J., & Hegelheimer, V. (2005). ESL students’ use of concordance in the transfer of academic word knowledge: An exploratory study. Computer Assisted Language Learning, 18, 287–310.CrossRef Google Scholar

Kim, E. (2015). Enhancing a corpus-based approach to teach English phrasal verbs to Korean learners. The Journal of Studies in Language, 31, 295–312.Google Scholar

Lai, C., & Lin, X. (2015). Strategy training in a task-based language classroom. The Language Learning Journal, 43, 20–40.CrossRef Google Scholar

Lai, C., Zhao, Y., & Wang, J. (2011). Task-based language teaching in online ab initio foreign language classrooms. Modern Language Journal, 95(S1), 81–103.CrossRef Google Scholar

Lee, H., Warschauer, M., & Lee, J. H. (2019). The effects of corpus use on second language vocabulary learning: A multilevel meta-analysis. Applied Linguistics, 40, 721–753. https://doi.org/10.1093/applin/amy012 CrossRef Google Scholar

Lee, J., Jang, J., & Plonsky, L. (2015). The effectiveness of second language pronunciation instruction: A meta-analysis. Applied Linguistics, 36, 345–366. https://doi.org/10.1093/applin/amu040 CrossRef Google Scholar

Li, G., & Ni, X. (2013). Effects of a technology-enriched, task-based language teaching curriculum on Chinese elementary students’ achievement in English as a foreign language. International Journal of Computer-Assisted Language Learning and Teaching, 3, 33–49.CrossRef Google Scholar

Li, Q. (2012a). Effects of instruction on adolescent beginners’ acquisition of request modification. TESOL Quarterly, 46, 30–55. https://doi.org/10.1002/tesq.2 CrossRef Google Scholar

Li, S. (2012b). The effect of input-based practice on pragmatic development in L2 Chinese. Language Learning, 62, 403–438. https://doi.org/10.1111/j.1467‐9922.2011.00629.x CrossRef Google Scholar

Li, S. (2013). Amount of practice and pragmatic development of request-making in L2 Chinese. In Taguchi, N. & Sykes, J. M. (Eds.), Technology in interlanguage pragmatics research and teaching (pp. 43–70). John Benjamins. https://doi.org/10.1075/lllt.36.04li CrossRef Google Scholar

Lochana, M., & Deb, G. (2006). Task-based teaching: Learning English without tears. The Asian EFL Journal Quarterly, 8, 140–164.Google Scholar

Long, M. (2015). Second language acquisition and task-based language teaching. Wiley-Blackwell.Google Scholar

Long, M. (2016). In defense of tasks and TBLT: Nonissues and real issues. Annual Review of Applied Linguistics, 36, 5–33. https://doi.org/10.1017/S0267190515000057 CrossRef Google Scholar

Maassen, E., van Assen, M. A. L. M., Nuijten, M. B., Olsson-Collentine, A., & Wicherts, J. M. (2020). Reproducibility of individual effect sizes in meta-analyses in psychology. PLoS One, 15, e0233107. https://doi.org/10.1371/journal.pone.0233107 CrossRef Google Scholar

Marsden, E., Morgan‐Short, K., Trofimovich, P., & Ellis, N. C. (2018). Introducing registered reports at language learning: Promoting transparency, replication, and a synthetic ethic in the language sciences. Language Learning, 68, 309–320. https://doi.org/10.1111/lang.12284 CrossRef Google Scholar

Martínez-Flor, A., & Alcón-Soler, E. (2007). Developing pragmatic awareness of suggestions in the EFL classroom: A focus on instructional effects. Canadian Journal of Applied Linguistics, 10, 47–76.Google Scholar

McKay, T. H., & Plonsky, L. (in press). Reliability analyses: Evaluating error. In Winke, P. & Brunfaut, T. (Eds.), The Routledge handbook of second language acquisition and language testing. Routledge.Google Scholar

Mesbah, M. (2016). Task-based Language Teaching and its effect on medical students’ reading comprehension. Theory and Practice in Language Studies, 6, 431–438.CrossRef Google Scholar

Mesbah, M., & Faghani, M. (2015). Task-based and grammar translation teaching methods in teaching reading comprehension to nursing students: An action research. Aula Orientals, 1, 319–325.Google Scholar

Mirzaei, A., Domakani, M. R., & Rahimi, S. (2016). Computerized lexis-based instruction in EFL classrooms: Using multi-purpose LexisBOARD to teach L2 vocabulary. ReCALL, 28, 22–43.CrossRef Google Scholar

Morgan-Short, K., Marsden, E., Heil, J., Ii, B. I. I., Leow, R. P., Mikhaylova, A., Mikolajczak, S., Moreno, N., Slabakova, R., & Szudarski, P. (2018). Multisite replication in second language acquisition research: Attention to form during listening and reading comprehension. Language Learning, 68, 392–437. https://doi.org/10.1111/lang.12292 CrossRef Google Scholar

Murad, T. M. (2009). The effect of task-based language teaching on developing speaking skills among the Palestinian secondary EFL students in Israel and their attitudes towards English [Unpublished doctoral dissertation]. Yarmouk University, Irbid, Jordan.Google Scholar

Narita, R. (2012). The effects of pragmatic consciousness-raising activity on the development of pragmatic awareness and use of hearsay evidential markers for learners of Japanese as a foreign language. Journal of Pragmatics, 44, 1–29. https://doi.org/10.1016/j.pragma.2011.09.016 CrossRef Google Scholar

Nguyen, T. T. M. (2013). Instructional effects on the acquisition of modifiers in constructive criticisms by EFL learners. Language Awareness, 22, 76–94. https://doi.org/10.1080/09658416.2012.658810 CrossRef Google Scholar

Nguyen, T. T. M., Do, T. T. H., Nguyen, A. T., & Pham, M. H. (2015). Teaching email requests in the academic context: A focus on the role of corrective feedback. Language Awareness, 24, 169–195. https://doi.org/10.1080/09658416.2015.1010543 CrossRef Google Scholar

Nguyen, T. T. M., Phamb, T. H., & Phamb, M. T. (2012). The relative effects of explicit and implicit form-focused instruction on the development of L2 pragmatic competence. Journal of Pragmatics, 44, 416–434. https://doi.org/10.1016/j.pragma.2012.01.003 CrossRef Google Scholar

Norris, J. M., & Ortega, L. (2006). The value and practice of research synthesis for language learning and teaching. In Norris, J. M. & Ortega, L. (Eds.) Synthesizing Research on Language Learning and Teaching (pp. 1–50). John Benjamins.CrossRef Google Scholar

Oswald, F. L., & Plonsky, L. (2010). Meta-analysis in second language research: Choices and challenges. Annual Review of Applied Linguistics, 30, 85–110.CrossRef Google Scholar

Park, M. (2012). Implementing computer-assisted task-based language teaching in the Korean secondary EFL context. In Shehadeh, A. & Coombe, C. A. (Eds.), Task-based language teaching in foreign language contexts: Research and implementation. John Benjamins.Google Scholar

Phuong, H. Y., Van den Branden, K., Van Steendam, E., & Sercu, L. (2015). The impact of PPP and TBLT on Vietnamese students’ writing performance and self-regulatory writing strategies. ITL–International Journal of Applied Linguistics, 116, 37–93.CrossRef Google Scholar

Plonsky, L. (2011). The effectiveness of second language strategy instruction: a meta-analysis. Language Learning, 61, 993–1038. https://doi.org/10.1111/j.1467-9922.2011.00663.x CrossRef Google Scholar

Plonsky, L. (2012). Replication, meta-analysis, and generalizability. In Porte, G. (Ed.), Replication research in applied linguistics (pp. 116–132). Cambridge University Press.Google Scholar

Plonsky, L. (2019). Recent research on language learning strategy instruction. In Chamot, A. & Harris, V. (Eds.), Learning strategy instruction in the language classroom: Issues and implementation. Multilingual Matters.Google Scholar

Plonsky, L., & Kim, Y. (2016). Task-based learner production: A substantive and methodological review. Annual Review of Applied Linguistics, 36, 73–97. https://doi.org/10.1017/S0267190516000015 CrossRef Google Scholar

Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2 research. Language Learning, 64, 878–912. https://doi.org/10.1111/lang.12079 CrossRef Google Scholar

Plonsky, L., & Zhuang, J. (2019). A meta-analysis of second language pragmatics instruction. In Taguchi, N. (Ed.), The Routledge handbook of SLA and pragmatics (pp. 287–307). Routledge.Google Scholar

Plonsky, L., & Ziegler, N. (2016). The CALL-SLA interface: Insights from a second-order synthesis. Language Learning and Technology, 20, 17–37.Google Scholar

Poole, R. (2012). Concordance-based glosses for academic vocabulary acquisition. CALICO Journal, 29, 679–693.CrossRef Google Scholar

Rafieyan, V., Sharafi-Nejad, M., Khavari, Z., Siew Eng, L., & Mohamed, A. R. (2014). Pragmatic comprehension development through telecollaboration. English Language Teaching, 7, 11–19. https://doi.org/10.5539/elt.v7n2p11 CrossRef Google Scholar

Rahimi, M., & Momeni, G. (2012). The effect of teaching collocations on English language proficiency. Procedia-Social and Behavioral Sciences, 31, 37–42.CrossRef Google Scholar

Ramezanali, N., Uchihara, T., & Faez, F. (2020). Efficacy of multimodal glossing on second language vocabulary learning: A meta-analysis. TESOL Quarterly. Advance online publication. https://doi.org/10.1002/tesq.579 CrossRef Google Scholar

Rezaee, A. A., Marefat, H., & Saeedakhtar, A. (2015). Symmetrical and asymmetrical scaffolding of L2 collocations in the context of concordancing. Computer Assisted Language Learning, 28, 532–549.CrossRef Google Scholar

Rezaeyan, M. (2014). On the impact of task-based teaching on academic achievement of Iranian EFL learners: Case study: Female high school students in Yasuj. International Journal of Language Learning and Applied Linguistics World, 7, 476–493.Google Scholar

Sarani, A., & Sahebi, L. F. (2012). The impact of task-based approach on vocabulary learning in ESP courses. English Language Teaching, 5, 118–128.CrossRef Google Scholar

Sato, M., & Ballinger, S. (Eds.) (2016). Peer interaction and second language learning: Pedagogical potential and research agenda. John Benjamins. https://doi.org/10.1075/lllt.45 CrossRef Google Scholar

Saito, K., & Plonsky, L. (2019). Measuring the effects of second language pronunciation teaching: A proposed framework and meta-analysis. Language Learning, 69, 652–708. https://doi.org/10.1111/lang.12345 CrossRef Google Scholar

Shabani, M. B., & Ghasemi, A. (2014). The effect of task-based language teaching (TBLT) and content-based language teaching (CBLT) on the Iranian intermediate ESP learners’ reading comprehension. Procedia: Social and Behavioral Sciences, 98, 1713–1721.Google Scholar

Shintani, N. (2015). The effectiveness of processing instruction and production-based instruction on L2 grammar acquisition: A meta-analysis. Applied Linguistics, 36, 306–325. https://doi.org/10.1093/applin/amu067 CrossRef Google Scholar

Simin, S., Eslami, Z., Eslami-Rasekh, A., & Ketabi, S. (2014). The effects of explicit teaching of apologies on Persian EFL learners’ performance: When e-communication helps. International Journal of Research Studies in Language Learning, 3, 71–84.Google Scholar

Sripicharn, P. (2003). Evaluating classroom concordancing: The use of corpus-based materials by a group of Thai students. Thammasat Review, 8, 203–236.Google Scholar

Stevens, V. (1991). Concordance-based vocabulary exercises: A viable alternative to gap-fillers. Classroom Concordancing: English Language Research Journal, 4, 47–61.Google Scholar

Supatranont, P. (2005). A comparison of the effects of the concordance-based and the conventional teaching methods on engineering students’ English vocabulary learning [Unpublished doctoral dissertation]. Chulalongkorn University, Bangkok, Thailand.Google Scholar

Sun, Y.-C., & Wang, L.-Y. (2003). Concordancers in the EFL classroom: Cognitive approaches and collocation difficulty. Computer Assisted Language Learning, 16, 83–94.CrossRef Google Scholar

Tajeddin, Z., Keshavarz, M. H., & Zand Moghaddam, A. (2012). The effect of task-based language teaching on EFL learners’ pragmatic production, metapragmatic awareness, and pragmatic self-assessment. Iranian Journal of Applied Linguistics (IJAL), 15, 139–166.Google Scholar

Takimoto, M. (2012a). Assessing the effects of identical task repetition and task type repetition on learners’ recognition and production of second language request downgraders. Intercultural Pragmatics, 9, 71–96. https://doi.org/10.1515/ip‐2012‐0004 CrossRef Google Scholar

Takimoto, M. (2012b). Metapragmatic discussion in interlanguage pragmatics. Journal of Pragmatics, 44, 1240–1253. https://doi.org/10.1016/j.pragma.2012.05.007 CrossRef Google Scholar

Tan, K. H., & Farashaiyan, A. (2012). The effectiveness of teaching formulaic politeness strategies in making request to undergraduates in an ESL classroom. Asian Social Science, 8, 189–196. https://doi.org/10.5539/ass.v8n15p189 CrossRef Google Scholar

Tan, Z. (2016). An empirical study on the effects of grammar–translation method and task-based language teaching on Chinese college students’ reading comprehension. International Journal of Liberal Arts and Social Science, 4, 100–109.Google Scholar

Tanaka, H., & Oki, N. (2015). An attempt to raise Japanese EFL learners’ pragmatic awareness using online discourse completion tasks. The JALT Call Journal, 11, 143–154.CrossRef Google Scholar

Tang, X. (2019). The effects of task modality on L2 Chinese learners’ pragmatic development: Computer-mediated written chat vs. face-to-face oral chat. System, 80, 48–59. https://doi.org/10.1016/j.system.2018.10.011 CrossRef Google Scholar

Tateyama, Y. (2007). The effects of instruction on pragmatic awareness. In Bradford-Watts, K. (Ed.), JALT 2006 conference proceedings (pp. 1189–1200). JALT. http://jalt-publications.org/archive/proceedings/2006/E128.pdf Google Scholar

Tateyama, Y. (2009). Requesting in Japanese: The effect of instruction on JFL learners’ pragmatic competence. In Taguchi, N. (Ed.), Pragmatic competence (pp. 129–166). Mouton De Gruyter.Google Scholar

Ting, L. (2012). The implementation of task-based language teaching approach in EFL oral English teaching in art academy. Overseas English, 8, 90–92.Google Scholar

Tongpoon, A. (2009). The enhancement of EFL learners’ receptive and productive vocabulary knowledge through concordance-based methods [Unpublished doctoral dissertation]. Northern Arizona University, Flagstaff, AZ.Google Scholar

Torky, S. (2006). The effectiveness of a task-based instruction program in developing the English language speaking skills of secondary stage [Unpublished doctoral dissertation]. Ain Shams University, Cairo, Egypt.Google Scholar

Tullock, B., & Ortega, L. (2017). Fluency and multilingualism in study abroad: Lessons from a scoping review. System, 71, 7–21. https://doi.org/10.1016/j.system.2017.09.019 CrossRef Google Scholar

Uchihara, T. S., Webb, S., & Yanagisawa, A. (2019). The effects of repetition on incidental vocabulary learning: A meta-analysis of correlational studies, Language Learning, 69, 559–599. https://doi.org/10.1111/lang.12343 CrossRef Google Scholar

Vyatkina, N. (2016). Data-driven learning for beginners: The case of German verb-preposition collocations. ReCALL, 28, 207–226.CrossRef Google Scholar

Willis, D., & Willis, J. (2007). Doing task-based teaching. Oxford University Press.Google Scholar

Willis, J. (1996). A framework for task-based learning. Longman.Google Scholar

Yanagisawa, A., Webb, S., & Uchihara, T. (2020). How do different forms of glossing contribute to L2 vocabulary learning from reading? A meta-regression analysis. Studies in Second Language Acquisition, 42, 411–438. https://doi.org/10.1017/S0272263119000688 CrossRef Google Scholar

Yang, J. (2008). The task-based approach and the grammar translation method with computer-assisted instruction on Taiwanese EFL college students’ speaking performance [Unpublished doctoral dissertation]. Alliant International University, San Diego, CA.Google Scholar

Yang, J. (2015). Effects of collaborative corpus-based learning on the acquisition and retention of delexical verb collocation. The Journal of Studies in Language, 31, 67–94.CrossRef Google Scholar

Yılmaz, E., & Soruç, A. (2015). The use of concordance for teaching vocabulary: A data-driven learning approach. Procedia-Social and Behavioral Sciences, 191, 2626–2630.CrossRef Google Scholar

Yousefi, M., & Nassaji, H. (2019). A meta-analysis of the effects of instruction and corrective feedback on L2 pragmatics and the role of moderator variables: Face-to-face vs. computer-mediated instruction. ITL–International Journal of Applied Linguistics, 170, 277–308. https://doi.org/10.1075/itl.19012.you CrossRef Google Scholar

Yunus, K., & Awab, S. A. (2012). The effects of the use of module-based concordance materials and data-driven learning (DDL) approach in enhancing the knowledge of collocations of prepositions among Malaysian undergraduate law students. International Journal of Learning, 18, 165–181.Google Scholar

Yunxia, S., Min, Y., & Zhuo, S. (2009). An empirical study on a computer-based corpus approach to English vocabulary teaching and learning. In Li, W. and Zhou, J. (Eds.), Proceedings of the 2nd International Conference on Computer Science and Information Technology (pp. 218–221). IEEE Xplore.Google Scholar

Zhu, Y. (2017). Who support open access publishing? Gender, discipline, seniority and other factors associated with academics’ OA practice. Scientometrics, 111, 557–579. https://doi.org/10.1007/s11192-017-2316-z CrossRef Google Scholar PubMed

Boers et al. supplementary material

Appendices A-E

File 2.3 MB

Article contents

A CALL FOR CAUTIOUS INTERPRETATION OF META-ANALYTIC REVIEWS

Abstract

CASE STUDY 1: YOUSEFI AND NASSAJI (Reference Yousefi and Nassaji2019)

SYNOPSIS AND PRELIMINARY COMMENTS

WHAT ARE THE CONTRASTS?

BENEFITS OF COMPUTER-MEDIATED INSTRUCTION?

CASE STUDY 2: LEE ET AL. (Reference Lee, Warschauer and Lee2019)

SYNOPSIS AND PRELIMINARY COMMENTS

WHAT IS THE INDEPENDENT VARIABLE?

A LEVEL PLAYING FIELD?

CASE STUDY 3: BRYFONSKI AND MCKAY (Reference Bryfonski and McKay2019)

SYNOPSIS AND PRELIMINARY COMMENTS

WHAT’S IN A NAME?

AT FACE VALUE?

A LEVEL PLAYING FIELD?

CONCLUSIONS AND RECOMMENDATIONS

Supplementary Materials

Footnotes

References

REFERENCES

Boers et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests