The discipline of pedagogy-oriented applied linguistics has witnessed a proliferation of meta-analytic reviews in recent years (e.g., Lee et al., Reference Lee, Jang and Plonsky2015; Shintani, Reference Shintani2015; Uchihara et al., Reference Uchihara, Webb and Yanagisawa2019). These are reviews that collect as many empirical studies on the role of a given factor as possible, and then calculate the weighted average effect from that pool. Meta-analyses are useful because they help to estimate with greater confidence than any individual empirical study whether the chosen factor of interest is likely to play a role that is not confined to specific contexts, and how substantial its role is likely to be. Some researchers may therefore find meta-analytic reviews particularly useful when they make excursions into domains outside their own niche because it seems safer to rely on a comprehensive review than on a couple of individual empirical studies. Even practitioners and policy makers—or those advising practitioners and policy makers—may consider the bottom line of a meta-analytic review a shortcut into the available research evidence and may rely on it to inform their instructional approaches and recommendations for teaching. Sometimes a meta-analysis may be rather broad in its research question and—though certainly of theoretical value—this may limit its potential to inform practitioners’ decision making. For example, a meta-analysis that computes the likely effect of instruction in comparison with no instruction (e.g., Kang et al., Reference Kang, Sok and Han2019) cannot, as such, tell practitioners what instructional interventions work particularly well, unless types of instructional interventions are examined as moderator variables as part of the analysis. In this article, however, we examine three recently published meta-analyses that are sufficiently specific in their research focus and whose conclusions may thus be taken up by educators to guide their practices. A recurring theme is the importance of cautious sampling and transparent methodological decision making, but each of the critiques serves to illustrate additional considerations for interpreting the outcomes of meta-analyses.
The first meta-analysis we examine, about pragmatics instruction (Yousefi & Nassaji, Reference Yousefi and Nassaji2019), offers as one of its conclusions (regarding a moderator variable) that computer-mediated pragmatics instruction generates larger effects than face-to-face instruction. However, the collection of primary studies that this assertion is based on contains hardly any studies that directly compare the two modes of instruction. Instead, this conclusion is based on an indirect comparison of aggregated effect sizes from a small set of studies that implemented computer-mediated instruction and the aggregated effect sizes from a larger set of studies that implemented face-to-face instruction. This is potentially problematic because effect sizes are influenced by the design features of empirical studies and by what contrasts on which they are based. For example, effect sizes tend to be larger in single group pre-/poststudy designs than in studies where effects are calculated by comparing one or more treatment groups’ learning gains. A greater proportion of one type of study design is thus likely to compromise a fair comparison. A reanalysis of Yousefi and Nassaji (Reference Yousefi and Nassaji2019), with greater scrutiny of the primary studies and with calculation of separate effect sizes for different study designs suggests that the assertion made about the superiority of the computer-mediated mode of instruction was not (yet) justified.
The second meta-analysis we examine (Lee et al., Reference Lee, Warschauer and Lee2019) offers strong support for the use of corpora in vocabulary learning. Unlike Yousefi and Nassaji (Reference Yousefi and Nassaji2019), the aggregated effect size is based exclusively on studies with a between-groups design, which should make it easier to interpret. There is nonetheless a difficulty in interpreting the outcome because in the majority of the primary studies included in the analysis it is impossible to tell whether the between-group differences in learning gains should be ascribed to the use of a corpus per se, while this is the factor of interest according to the title and the abstract of the article. In some of the studies, both treatment conditions involved corpus use. In many others, the treatment conditions that involved corpus use differed from their comparison conditions in diverse ways other than corpus use. Calculating an effect size from a small set of studies where corpus use was unequivocally the independent variable still yields an outcome in support of corpus use for vocabulary learning, but far less compellingly so than what emerged from the original meta-analysis.
The third meta-analysis regards the benefits of task-based language teaching (TBLT) programs (Bryfonski & McKay, Reference Bryfonski and McKay2019). Although authors of primary studies often label the instructional programs they put to the test “task based” (and in this meta-analysis the same labels were used), this may not always correspond to how the approach is conceived in other TBLT literature. It is therefore difficult to determine the merits of task-based (versus other versions of communicative language teaching such as task-supported teaching) based on the aggregated effect size from the literature currently available. Replicating TBLT meta-analyses with stricter sampling criteria proves difficult because of a dearth of studies that empirically assess task-based programs relative to non-task-based programs—and the few that are available report mixed findings (e.g., Phuong et al., Reference Phuong, Van den Branden, Van Steendam and Sercu2015).
All things considered, the three “case studies” presented here illustrate that conclusions drawn from meta-analytic research should be interpreted with an eye toward the methodological choices made during the meta-analytic process.
CASE STUDY 1: YOUSEFI AND NASSAJI (Reference Yousefi and Nassaji2019)
SYNOPSIS AND PRELIMINARY COMMENTS
Yousefi and Nassaji’s (Reference Yousefi and Nassaji2019) meta-analysis investigates the effects of instruction on second language pragmatics acquisition. According to the authors, the study is an update to prior work in this area (e.g., Jeon & Kaya, Reference Jeon, Kaya, Norris and Ortega2006, but see also another recent meta-analysis of L2 pragmatics instruction: Plonsky & Zhuang, Reference Plonsky, Zhuang and Taguchi2019). It not only includes more recent studies but also examines previously uninvestigated moderator variables, most notably the role of computer-mediated pragmatics instruction. Based on 36 studies, the authors report overall effectiveness of pragmatics instruction as d = 1.101. This meta-analytic evidence that pragmatics instruction clearly works is reassuring for teachers and course designers, although instructors may be especially interested in what kinds of instruction work particularly well in certain contexts. Yousefi and Nassaji’s analysis of moderator variables is informative in this regard, for example because a larger effect emerged for computer-mediated instruction (mean d = 1.172) than for face-to-face instruction (mean d = 0.965). This led the authors to assert that among the pedagogical implications of their findings “the most outstanding one is the potential of various technologies that can mediate the teaching and learning of pragmatics” (p. 302). Several additional pertinent moderator variables were explored (such as explicitness of instruction, type of outcome measures, length of treatment, and participants’ proficiency level),Footnote 1 but for reasons of space, our critique will focus on the comparison of computer-mediated and face-to-face instruction.
The studies included in Yousefi and Nassaji’s meta-analysis vary considerably in their designs. Many are single-group studies, where participants’ progress is tracked from a pretest measure to a posttest measure (i.e., the effect size calculation is based on within-group contrasts). Others compare a treatment group’s progress to that of a control group (which receives no instruction regarding the learning targets of interest in the experiment), and a few compare the progress of two treatment groups, where each group experiences a different intervention regarding the same learning targets. Yousefi and Nassaji calculated overall effects by “combining the effects of all instructional types” (p. 294). However, effects of an instructional intervention often appear larger if one contrasts participants’ pre- and posttest performance than if one assesses the effectiveness of an intervention for a group of participants relative to another group’s progress. This is because within-group comparisons of pre- and posttest scores regard the same participants in the two datasets and thus involves less variance than in the case of between-group comparisons, where the contrast in pre- to posttest gains concerns different participants (bringing in more variance). A reduction in variance and standard deviation (SD) will result in larger effect sizes because the SD makes up the denominator in the formula for Cohen’s d. Unless studies report pre-posttest correlations that can be used as a correction for the difference, between-groups and within-groups study designs should be analyzed separately. In their meta-analysis of effects in L2 research, Plonsky and Oswald (Reference Plonsky and Oswald2014) found that observed effects resulting from within-group contrasts were indeed substantially larger than between-groups contrasts. They therefore proposed a different set of benchmarks for small (d = .60), medium (d = 1.00), and large (d = 1.40) effects for within-group contrasts than for between-groups contrasts (small, d = .40, medium, d = .70 and large, d = 1.00). Owing to the mix of within-group and between-groups contrasts in Yousefi and Nassaji’s collection of studies, and lack of reported pretest-posttest correlations, it is not clear how the overall estimated effect of d = 1.101 should be interpreted in relation to the previously mentioned benchmarks.
We therefore reanalyzed the data by calculating separate effect sizes for the within-groups contrasts (k = 103) and the between-groups contrasts (k = 52). While we might expect such a reanalysis to produce slightly different aggregated effects sizes, we would not expect it to have profound repercussions for the general conclusion that pragmatics instruction is effective. As mentioned, Yousefi and Nassaji’s article also investigated modality (computer-mediated vs. face-to-face pragmatics instruction) as a moderator variable. An issue that can arise when examining moderators to a main effect is the difficulty in separating out and attributing unique effects to each moderating variable. To account for this, primary studies should be closely examined in terms of their study designs and for the potential interactions between moderating variables. For example, a recent meta-analysis about the effect of glosses on vocabulary acquisition (Yanagisawa et al., Reference Yanagisawa, Webb and Uchihara2020) included mode of gloss (textual, pictorial, or aural) as a moderating variable but deliberately selected only studies on single glosses for this comparison. Inclusion of studies on multimodal glosses would have made it difficult to separate the effect of mode (e.g., textual vs. pictorial) from the effect of providing more than one annotation (e.g., textual + pictorial) for the same word (Boers et al., Reference Boers, Warren, Grimshaw and Siyanova-Chanturia2017; Ramezanali et al., Reference Ramezanali, Uchihara and Faez2020).
In the case of Yousefi and Nassaji’s investigation of the moderating variable of computer-mediated instruction, there is a potential interaction with the type of study design because the set of studies implementing computer-mediated instruction consists mostly of within-group contrasts, and so the larger aggregated effect size that emerged for this set could be an artifact of this design feature rather than reflecting an effect of computer-mediation per se. Moreover, in virtually all the computer-mediated studies the pragmatics instruction was explicit. This is relevant because Yousefi and Nassaji found a larger overall effect for explicit (d = 1.213) than implicit (d = 0.848) instructional treatments. Explicitness of instruction could thus be an alternative explanation for the comparatively large effect size that emerged from the computer-mediated interventions.
WHAT ARE THE CONTRASTS?
As mentioned, there is a wide range of study designs in the collection of primary studies used by Yousefi and Nassaji (Reference Yousefi and Nassaji2019), yielding diverse contrasts for effect size calculations (pretest vs. posttest scores of a single group or differences in learning gains between two groups). It is important for the sake of transparency and replicability of a meta-analysis to specify what contrasts are used for these calculations (Maassen et al., Reference Maassen, van Assen, Nuijten, Olsson-Collentine and Wicherts2020). Because Yousefi and Nassaji (Reference Yousefi and Nassaji2019) did not include this information, we adopted the following, explicitly stated, procedures from the earlier meta-analysis by Jeon and Kaya (Reference Jeon, Kaya, Norris and Ortega2006) in our replication:
1. For studies that examined one treatment group and one control group (that received no instructional intervention) by means of pre- and posttests, effect sizes were calculated by contrasting the two groups’ outcomes on pre- and immediate posttests (Alcón-Soler, Reference Alcón-Soler2015; Bardovi-Harlig et al., Reference Bardovi-Harlig, Mossman and Vellenga2015; Eslami & Eslami-Rasekh, Reference Eslami, Eslami-Rasekh, Alcón-Soler and Martinez-Flor2008; Felix-Brasdefer, Reference Felix-Brasdefer2008; Furniss, Reference Furniss2016; Narita, Reference Narita2012; Rafieyan et al., Reference Rafieyan, Sharafi-Nejad, Khavari, Eng and Mohamed2014; Tan & Farashaiyan, Reference Tan and Farashaiyan2012).
2. For studies that examined multiple treatment groups and one control group by means of pre- and posttests, effect sizes were calculated by contrasting each group’s immediate pre- and posttest outcomes separately with the control group’s immediate pre- and posttest outcomes (Eslami & Liu, Reference Eslami and Liu2013; Hernandez, Reference Hernandez2011; Li, Reference Li, Taguchi and Sykes2013; Nguyen et al., Reference Nguyen, Phamb and Phamb2012; Tajeddin et al., Reference Tajeddin, Keshavarz and Zand Moghaddam2012).
3. For studies that examined two or more treatment groups without any control group, pretest data was contrasted with immediate posttest data for each group (Chen, Reference Chen2011; Derakhshan & Eslami, Reference Derakhshan and Eslami2015; Felix-Brasdefer, Reference Felix-Brasdefer2008; Fordyce, Reference Fordyce2014; Fukuya & Martinez-Flor, Reference Fukuya and Martinez-Flor2008; Ghobadi & Fahim, Reference Ghobadi and Fahim2009; Gu, Reference Gu2011; Jernigan, Reference Jernigan2012; Li, Reference Li2012a, Reference Li2012b; Nguyen et al., Reference Nguyen, Do, Nguyen and Pham2015; Simin et al., Reference Simin, Eslami, Eslami-Rasekh and Ketabi2014; Tateyama, Reference Tateyama and Bradford-Watts2007, Reference Tateyama and Taguchi2009).
4. For studies that examined one group before and after an intervention, pretest data was contrasted with posttest data on immediate posttests (Alcón-Soler, Reference Alcón-Soler2012; Alcón-Soler & Guzman, Reference Alcón-Soler and Guzman-Pitarch2010; Tanaka & Oki, Reference Tanaka and Oki2015).
5. For studies that reported both treatment group and control group comparisons as well as within group contrasts, effect sizes were calculated for both between-group and within-group contrasts in the ways outlined above (Nguyen et al., Reference Nguyen, Phamb and Phamb2012).
6. For studies that compared two groups pre- and post-intervention and only provided the results of a multifactorial test (e.g., ANOVA), the effect size was calculated from the main effect of time for each group (Takimoto, Reference Takimoto2012a, Reference Takimoto2012b).
Some studies included in Yousefi and Nassaji’s meta-analysis provide insufficient information to calculate effect sizes along the previously mentioned procedures. It is unclear in some cases what method the original analysis utilized. For example, Dastjerdi and Farshid (Reference Dastjerdi and Farshid2011) only reported the results of a t-test comparing posttest results of two experimental groups. Martínez-Flor and Alcón-Soler (Reference Martínez-Flor and Alcón-Soler2007) lacked SDs necessary to compute effect sizes (other reported statistics were nonparametric). Cunningham (Reference Cunningham2016), one of the handful of studies in the collection that implemented a computer-mediated mode of instruction, had to be excluded because the report did not provide sufficient information for calculating effects sizes comparing the two experimental groups (which only included eight and nine participants each). In addition, one publication (Nguyen, Reference Nguyen2013) reported on the same data as another (Nguyen et al., Reference Nguyen, Phamb and Phamb2012), and so the duplicate report was excluded. Therefore, those studies (Cunningham, Reference Cunningham2016; Dastjerdi & Farshid, Reference Dastjerdi and Farshid2011; Martínez-Flor & Alcón-Soler, Reference Martínez-Flor and Alcón-Soler2007; Nguyen, Reference Nguyen2013) were excluded from our reanalysis leaving a total of 32 individual studies (instead of the original 36) and 155 contrasts (see supplement hosted on IRIS for a full list with justifications for inclusion/exclusion).
Another modification to the original meta-analysis concerns the categorization of one of the studies (Nguyen et al., Reference Nguyen, Do, Nguyen and Pham2015) that was coded as computer-mediated instruction by Yousefi and Nassaji. This study utilized e-mail writing as an outcome measure and may thus at first glance appear to be about computer-mediated instruction, but the instruction was not in fact computer mediated. We therefore had to remove it from the set of computer-mediated instruction studies in our reanalysis, reducing this set to six studies.
BENEFITS OF COMPUTER-MEDIATED INSTRUCTION?
Our reanalysis confirms the general finding of Yousefi and Nassaji (Reference Yousefi and Nassaji2019) that pragmatics instruction has a positive effect. For the between-groups contrasts, the overall effect is d = 1.11, a large effect for between-groups comparisons in L2 research. For within-group studies, the result is d = 1.32, a medium to large effect for within-groups comparisons (see supplement hosted on IRIS for full results tables).
However, our reanalysis does not confirm the original meta-analysis when it comes to the comparison of computer-mediated and face-to-face interventions. According to the original analysis, the former yielded larger effects (d = 1.172; k = 30Footnote 3) than the latter (d = 0.965; k = 80), but, according to our reanalysis of the data, the face-to-face mode in fact generated the larger effects. For between-groups designs, we now find a large effect of d = 1.271 (k = 40) for face-to-face instruction and only a small effect of d = 0.65 (k = 12) for computer-mediated instruction. Taking only the within-group studies, we again find a large effect of d = 1.46 (k = 85) for face-to-face instruction and a small effect of d = .75 (k = 18) for computer-mediated instruction (see supplement hosted on IRIS for a full results table). It needs to be acknowledged that the sample sizes for the computer-mediated interventions are now even smaller than they were in the original meta-analysis (due to selection decisions explained in the preceding text and due to the separation of between- and within-group contrasts). This highlights the need for more empirical investigations of computer-mediated pragmatics instruction. Investigations that directly compare the effectiveness of computer-mediated and face-to-face instruction for pragmatics would be especially welcome. In the collection used by Yousefi and Nassaji, only one study (Eslami & Liu, Reference Eslami and Liu2013) did this, and it found no difference in effectiveness between the two modes. A more recent study on pragmatics instruction (Tang, Reference Tang2019), outside the scope of the meta-analysis, found no advantage for computer-mediated activities over face-to-face activities either. In sum, our replication with separate effect size calculations based on study design differences did not support the superiority of computer-mediated pragmatics instruction over face-to-face instruction.
CASE STUDY 2: LEE ET AL. (Reference Lee, Warschauer and Lee2019)
SYNOPSIS AND PRELIMINARY COMMENTS
Lee et al.’s (Reference Lee, Warschauer and Lee2019) meta-analysis concerns the effects of corpus use on second language vocabulary learning. It is a partial replication of an earlier, broader-scope meta-analysis of corpus use in language learning (Boulton & Cobb, Reference Boulton and Cobb2017) but focused specifically on vocabulary and only included studies with an instructed control group (a comparison group) in their design.Footnote 4 Based on 29 primary studies, the weighted average effect on short-term learning was found to be medium sized (Hedges’s g = 0.74). In eight of the studies, delayed posttests were included, and these also showed a positive effect (Hedges’s g = 0.64). While Lee et al. (Reference Lee, Warschauer and Lee2019) acknowledge the role of several moderator variables (such as L2 proficiency level), the preceding aggregated effect sizes clearly suggest that corpus use is beneficial for L2 vocabulary learning.
In the following text, we highlight the issue of determining whether the main effects observed in primary studies are always a result of the variable of interest (in this case, corpus use). Before turning to that issue, we point out that it is not always clear what is meant by “effects” in this meta-analysis. Presumably, what is meant is learning outcomes. However, some of the studies (Frankenberg-Garcia, Reference Frankenberg-Garcia2012, Reference Frankenberg-Garcia2014; Stevens, Reference Stevens1991) investigated learners’ success rates as they did exercises under various input conditions, but did not include posttests to gauge the learning outcomes generated by these activities.Footnote 5 If the aim of the meta-analysis was to compare the effectiveness of different procedures in terms of learning outcomes, then these studies do not serve that purpose, and so we will exclude them from our reanalysis.
WHAT IS THE INDEPENDENT VARIABLE?
Corpora can be used for the purpose of vocabulary learning in various ways. The introduction to Lee et al. (Reference Lee, Warschauer and Lee2019) indicates that the focus of the article is corpus use for guided inductive learning (p. 722), also known as discovery learning and data-driven learning (Johns, Reference Johns1991). In this approach, learners typically examine concordance lines (i.e., examples of language use extracted from a corpus) with a view to discovering the meanings of words or their usage patterns (e.g., their word partnerships or collocations). Because Lee et al. (Reference Lee, Warschauer and Lee2019) refer first (in the title and the abstract) to corpus use in general and then (in the introduction) to the benefits of concordance lines specifically for the purpose of discovery learning, there is some ambiguity about what is meant by “the effects of corpus use.” If the independent variable of interest is corpus use more generally, then some of the primary studies appear not ideally suited because both treatment conditions in these studies utilized examples extracted from a corpus. The difference between these groups was the ways in which corpus-based instances were operationalized. For example, Sun and Wang (Reference Sun and Wang2003) compared the use of corpus-based examples for guided inductive learning to their use for the purpose of illustrating a pattern that was first explained to the learners. In other words, it was using corpus-based instances to prompt inductive learning versus using corpus-based instances as part of deductive learning that was the variable of interest, and not the use of corpus-based instances per se.
If the effectiveness of corpus use for guided inductive learning is the main variable of interest, then the challenge is to separate the added value of corpus use from that of guided inductive learning. After all, guided inductive learning can also be steered by means of examples that are not extracted from a corpus, but that are invented or collected differently by teachers or textbook writers. With very few exceptions (e.g., Cobb, Reference Cobb1997; Tongpoon, Reference Tongpoon2009), the primary studies in this meta-analysis did not compare the effectiveness of corpus-based and non-corpus-based examples for the purpose of guided inductive learning. Instead, in several of the studies (Anani Sarab & Kardoust, Reference Anani Sarab and Kardoust2014; Poole, Reference Poole2012; Sripicharn, Reference Sripicharn2003; Vyatkina, Reference Vyatkina2016; Yunus & Awab, Reference Yunus and Awab2012) corpus-based discovery learning was compared to a condition in which students received vocabulary explanations upfront followed by a few examples. In that case, it is again impossible to ascribe the superior learning observed for the corpus-based condition to the use of a corpus because it may also be attributable to the purported benefits of guided inductive learning (as opposed to deductive learning), regardless of whether the examples used for the inductive process were extracted from a corpus or produced in another way.
There are undeniably strong arguments for the use of corpus-based examples, such as their authenticity and the ease with which many examples can be generated from an online corpus (e.g., Johns, Reference Johns1991; Stevens, Reference Stevens1991). However, whether using corpus-based examples necessarily produces better learning outcomes than using, say, a series of textbook examples is an empirical question that is addressed by very few of the studies. Additionally, the distinction between authentic concordance lines and made-up examples can easily get blurred when researchers/materials designers start editing concordance lines to make them more comprehensible to the learners and to ensure the discovery-learning progresses as intended (e.g., Kim, Reference Kim2015; Yang, Reference Yang2015). In Supatranont (Reference Supatranont2005, pp. 84–91, and appendices J and K), for example, the only difference between the concordance lines and the textbook-type examples on the student handouts was that the former looked like concordance lines while the latter were presented as regular sentences. The difference between the two treatment conditions in this study was not the presence versus absence of corpus-based examples. Nor was it the presence versus absence of discovery-learning activities because both groups were required to find patterns in the sets of examples given on their handouts. The difference, rather, was that, in addition to pen-and-paper practice, the experimental group conducted computer-assisted searches, while the comparison group only worked with the handouts.
A LEVEL PLAYING FIELD?
A frequent topic in this collection of primary studies is collocation (word partnerships, such as conduct research, sore throat, and depend on), with several studies reporting the benefits of presenting learners with sets of concordance lines showing the most common collocates of a word. The effect of exposing learners to collocations is typically shown in posttests requiring learners to recall the word partnerships they were exposed to in the treatment. However, this is often in comparison with another treatment condition that did not involve any work on collocations at all but instead included learning activities on something else, such as single words or grammar (Mirzaei et al., Reference Mirzaei, Domakani and Rahimi2016; Rahimi & Momeni, Reference Rahimi and Momeni2012; Rezaee et al., Reference Rezaee, Marefat and Saeedakhtar2015). In other words, the experimental groups were exposed to the target items they would be tested on in the posttests, while the comparison groups were not exposed to these target items during their instructional treatment. It is therefore not surprising that the experimental groups outperformed the comparison groups in the posttests. This is reflected in some very large treatment effects (Hedges’s g = 2.07 in Mirzaei et al., Reference Mirzaei, Domakani and Rahimi2016, and 1.98 in Rahimi & Momeni, Reference Rahimi and Momeni2012).Footnote 6 However, whether these effects should be ascribed to the nature of instruction (e.g., the use of concordance lines from a corpus) or simply to the focus of instruction (i.e., collocation) is unclear. It is quite conceivable that the comparison groups would not have performed so poorly in the posttests, had they also been exposed to the target collocations during treatment. Put differently, the instructed control groups in these studies were not true comparison groups, but more akin to no-treatment control groups (i.e., groups that receive no instruction on the items or patterns on which they will be tested). If the purpose of the meta-analysis is to estimate the effectiveness of corpus use relative to other instructional treatments that share the same learning objective, then it seems justified to exclude these studies.
Other studies included in the original meta-analysis demonstrated imbalanced learning opportunities between treatment groups, even though both groups did exercises with a focus on collocation. This can be illustrated with reference to a study by Daskalovska (Reference Daskalovska2014). The experimental group in this study was instructed to use online corpus tools to collect the 10 most common adverb collocates of verbs and to report their findings. The comparison group did short pen-and-paper exercises about the same verbs but were exposed to a smaller number of adverbs. Obtaining a high score on one of the posttest sections—the section with the heaviest weighting—hinged on the learners’ ability to supply a wide range of adverbs, and so this potentially gave an advantage to the experimental group. One of the other sections of the posttest did appear better aligned with the comparison group’s practice materials, given that it was a multiple-choice test and the study package created for the comparison group included a similar multiple-choice activity. However, the correct answers to the multiple-choice items in the posttest were not included in the multiple-choice exercise done in the learning stage. For example, in the exercise the students learned “I entirely agree” and “I clearly remember,” but in the posttest they needed to select “I strongly agree” and “I vividly remember” to score points. The poor posttest performance of the comparison group is therefore unsurprising.
Equally unsurprising is the finding that better learning outcomes after corpus use were observed in studies where the experimental groups engaged in corpus-based activities in addition to activities they shared with the comparison groups (e.g., Gordani, Reference Gordani2013), while comparison groups did not engage in any supplementary activities regarding the target vocabulary. In some cases, this meant the experimental groups spent extensive additional time on the target words (e.g., Karras, Reference Karras2016; Yunxia et al., Reference Yunxia, Min, Zhuo, Li and Zhou2009). Better learning outcomes for the experimental groups in these studies could thus be attributed to differences in time investment. Supplementary activities other than corpus-based ones could also be expected to enhance learning outcomes, and so, while these studies undeniably demonstrate that corpus use is effective, they do not demonstrate it is efficient in comparison to learning activities that do not require a corpus.
There are also several publications in the collection that lack sufficient detail and transparency, and for these studies it is impossible to tell if the experimental and comparison conditions differed in more ways than use or nonuse of corpus data. This lack of transparency is especially problematic given that some of these articles (some hardly four pages long) report large effects (e.g., Hedges’s g = 1.15 in Al-Mahbashi et al., Reference Al-Mahbashi, Noor and Amir2015, and 1.38 in Yılmaz & Soruç, Reference Yılmaz2015), thus potentially inflating the aggregated effect.
If we recalculate the average effect on short-term learning based on the studies from the original pool where we do feel confident enough that differences in learning outcomes can be attributed to corpus use (see supplement hosted on IRIS for the original list of studies with justification for inclusion/exclusion), the result is markedly different from the original meta-analysis: Hedges’s g = 0.32. According to the norms proposed by Plonsky and Oswald (Reference Plonsky and Oswald2014) for between-groups contrasts, this is a small effect. However, this average is now based on only five studies, totaling only nine contrasts from the original meta-analysis. Clearly, more (and more focused) empirical investigations of the merits of corpus use are needed for a meta-analysis on this subject to produce a more reliable estimate.
CASE STUDY 3: BRYFONSKI AND MCKAY (Reference Bryfonski and McKay2019)
SYNOPSIS AND PRELIMINARY COMMENTS
Bryfonski and McKay’s (Reference Bryfonski and McKay2019) meta-analytic review was a first effort at estimating the effectiveness of TBLT programs.Footnote 7 Their search produced 27 studies with a between-groups design as well as a small collection of studies with within-groups designs (i.e., comparing a single treatment group’s pretest and posttest performance). The original report cautioned that the number of within-groups studies was too small a collection to draw conclusions from (p. 619). Here, we therefore focus on the set of between-groups comparisons. The average effect size Bryfonski and McKay calculated from this collection (d = 0.93) approximates the threshold (d = 1.00) proposed by Plonsky and Oswald (Reference Plonsky and Oswald2014) for a large between-groups effect. The report concludes that this finding “supports the notion that program-wide implementation of TBLT is effective for promoting L2 learning above and beyond the learning found in programs with other, traditional or non-task-based pedagogies” (p. 622).
One of the questions we discuss in the following text is the extent to which the studies included in the original meta-analysis examined implementations of task-based language teaching, that is, TBLT in its “strong” form (Long, Reference Long2015) as opposed to task-supported language teaching. Before turning to this question, on reflection, it seems worthwhile to exclude three of the primary studies in the original collection of between-group studies because they examined TBLT without directly comparing TBLT to non-TBLT treatments (Lai & Lin, Reference Lai and Lin2015; Li & Ni, Reference Li and Ni2013; Shabani & Ghasemi, Reference Shabani and Ghasemi2014). A further study (González-Lloret & Nielson, Reference González-Lloret and Nielson2015) did not establish group equivalence prior to the respective treatments (i.e., there was no pretest), and as an effect size based solely on posttest scores is not optimally reliable if we cannot be confident about pretreatment comparability, we exclude this study in our reanalysis as well.
WHAT’S IN A NAME?
Some TBLT proponents distinguish between programs that use tasks throughout (Long, Reference Long2016), and task-supported programs, where tasks are used alongside or in addition to other approaches, including those involving explicit instruction (Ellis, Reference Ellis2018). With one exception (González-Lloret & Nielson, Reference González-Lloret and Nielson2015—which, as already noted, was excluded from the reanalysis because of lack of pretest data), all the programs described in the primary studies included in this meta-analytic review can be considered task supported rather than task based. An example is Amin (Reference Amin2009), where “[t]he TBL approach adopted in this study takes the form of explicit grammatical instruction in conjunction with communicative activities” (p. 81). Readers should therefore interpret TBLT, which is the term used in the majority of the included articles, as task-supported implementations, and not the “strong” version of TBLT outlined by Long (Reference Long2015).
Another difficulty lies with the notion of task, for which slightly different definitions have been used in prior literature (e.g., Ellis et al., Reference Ellis, Skehan, Li, Shintani and Lambert2019; Long, Reference Long2015). What is agreed on by proponents of TBLT in its various forms, however, is that tasks are meaning focused (i.e., focused on the content of messages rather than their linguistic packaging) and make learners use language as a vehicle toward a goal that itself is not linguistic. For example, in one of the original studies (Lochana & Deb, Reference Lochana and Deb2006) the following activities are presented as tasks according to those researchers’ interpretation of TBLT: “Your teacher will read out a passage; listen to the passage carefully and complete the blanks.” In another study (Amin, Reference Amin2009), the author explains that “The pedagogical tasks … are what learners do in class, such as listening to a tape and repeating phrases or sentences” (p. 44). Although these activities are labeled tasks in these publications, they are language-focused exercises rather than tasks as understood in TBLT circles. Several authors (e.g., Birjandi & Malmir, Reference Birjandi and Malmir2009; Sarani & Sahebi, Reference Sarani and Sahebi2012; Yang, Reference Yang2008) consider pair work as the defining characteristic of TBLT, regardless of whether the activities have a clear communicative purpose. These examples illustrate the wide interpretation of “task” in worldwide contexts.
In the following text we report an attempt at a new meta-analysis that adopts a narrower interpretation of tasks and that only includes studies that meet the criteria for tasks defined by Ellis and Shintani (Reference Ellis and Shintani2013, see following text). First, however, it may be worth speculating why TBLT is understood in such diverse ways, including ways not at all intended by TBLT advocates. Many of the authors of the studies in the meta-analysis cited Willis (Reference Willis1996) and Willis and Willis (Reference Willis and Willis2007), summed up on http://www.willis-elt.co.uk/ and https://www.teachingenglish.org.uk/article/a-task-based-approach to justify their task and program designs. In Willis and Willis’s (Reference Willis and Willis2007) version of TBLT, communicative tasks are preceded by a pretask phase, to help learners prepare for the task, and are followed by a posttask phase, where time is devoted to feedback, reflection on task performance, and reactive treatment of language problems. Several authors relied heavily on this three-phase lesson model but often with a focus on language as a study object rather than as a means toward a nonlinguistic end. It is understandable how “task” may be misconstrued from webpages such as https://www.teachingenglish.org.uk/article/criteria-identifying-tasks-tbl without carefully considering supplementary information. For example, one of the criteria listed there is that the activity should have “a goal or an outcome.” If a researcher misinterprets this goal or outcome as increased language knowledge on the part of students, then their “TBLT” lessons may treat language as a study object instead of a vehicle. Misinformation or misunderstandings may also result in assessments of learning gains that are focused on aspects of language, such as grammatical accuracy and vocabulary knowledge, rather than the learners’ successful completion of the communicative tasks (Plonsky & Kim, Reference Plonsky and Kim2016). Once a practitioner or researcher misses the crucial point about what is meant by a goal or an outcome of a task, they may also misinterpret agreement tasks as reaching an agreement on the right answer in a language exercise and information-gap tasks as completing gap-fill exercises.
Depending on the model of TBLT, guidelines for creating task-based (or task-supported) lessons may be rather vague as to how much language-oriented instruction can (or should) be included at various stages of instruction. Additionally, TBLT proponents have slightly different views of what features distinguish a task from a language exercise. In our reanalysis, we examined the classroom procedures of the primary studies to examine the extent to which the activities labeled as tasks in the main task phase of the described lessons can be characterized as tasks as defined by Ellis and Shintani (Reference Ellis and Shintani2013). The four criteria proposed by Ellis and Shintani (Reference Ellis and Shintani2013, p. 135), slightly reworded here, are as follows:
1. The focus is on meaning, that is, on the content of messages rather than on the language code per se.
2. There is some sort of communication gap between interlocutors, that is, learners exchange information or opinions rather than telling interlocutors—including their teacher—what these interlocutors clearly already know.
3. The task instructions do not stipulate what language elements or patterns the students should use when performing the activity (because that risks turning the activity into a language-focused exercise).
4. There should be a clear purpose (e.g., solving a problem, reaching an agreement about a dilemma) other than practicing language (because in the “real” world, language use is a means to an end, not the end itself).
For 10 of the studies, we concurred that the tasks met one out of four of criteria,Footnote 8 and so it seems justified to exclude them from this narrower reanalysis (see supplement hosted on IRIS for full inclusion/exclusion criteria). After exclusion of these and the ones mentioned in the previous section (i.e., studies that were not designed to compare task-based to non-task-based interventions), the collection includes 13 studies.
AT FACE VALUE?
Applying the criteria outlined in the preceding text requires that authors carefully detail their instructional procedures and classroom activities. However, several of the remaining research reports provide insufficient detail to apply Ellis and Shintani’s (Reference Ellis and Shintani2013) criteria. What follows are examples of how little is said about the nature of what are labeled tasks in some of the articles:
The tasks in every lesson had a high corresponding with the course book materials, because of pre-determined syllabus. The teacher used his creativity for adaptation of the tasks with the text book.
(Rezaeyan, Reference Rezaeyan2014, p. 483)[T]he students were required to do the tasks either in pair or in small groups, with the teacher monitoring their performance and encouraging more communication among them.
(Mesbah, Reference Mesbah2016, p. 433)In task-cycle phase, the students were engaged in completing different kinds of tasks.
(Tan, Reference Tan2016, p. 103)[S]tudents engaged in different communicative situations, unrelated to the actual course but organized in such a way that the participants were compelled to use the previously acquired lexico-grammar.
(De Ridder et al., Reference De Ridder, Vangehuchten and Gómez2007, p. 310)The author selected eight topics from the textbook or from outside the book, and designed the speaking tasks, considering the student’s actual level and interest.
(Ting, Reference Ting2012, p. 91)As illustrated in the previous section, authors may cite publications about TBLT and call the classroom activities they designed tasks, but this offers no guarantee that these in fact fit the criteria for tasks established in the preceding text. Some of the effect sizes in this subset of nontransparent reports are very substantial (e.g., d > 1.7 in Mesbah & Faghani, Reference Mesbah and Faghani2015, and in Tan, Reference Tan2016), even though it is difficult to tell to what these effects should be attributed. For the sake of caution, we exclude these studies in our reanalysis as well. As a result of this, the collection now includes six studies. If these remaining studies shared a tight focus and used very similar instruments and methods, a meta-analysis of them might still be meaningful. However, they in fact display very diverse foci (e.g., speaking vs. writing skills) and outcome measures (see Saito & Plonsky, Reference Plonsky, Zhuang and Taguchi2019, for an illustration that effect sizes can differ markedly depending on the type of outcome measures), and so it is doubtful whether a meaningful generalization can be drawn from such a small remaining sample.
A LEVEL PLAYING FIELD?
Regardless of whether the primary studies included in the original meta-analysis really concerned TBLT programs or, instead, compared one language-focused program to another language-focused program, the fact remains that what was presented as the experimental treatment in these studies almost consistently generated the better outcomes. One might argue that, even though the experimental treatments did not meet all the criteria to be labeled task-based under our criteria, they were nonetheless better aligned with TBLT principles than the comparison treatments. If so, then the outcome of the meta-analysis could still be interpreted as support for programs exhibiting at least some features of TBLT. For example, the so-called TBLT treatments typically involved a greater amount of peer–peer interaction in the target language than the comparison treatment, where students worked mostly individually. So, even though many of the activities described in these studies are exercises instead of tasks, the fact that these exercises were typically tackled collaboratively in the treatment conditions that brought about the better learning outcomes can be meaningful (Sato & Ballinger, Reference Sato and Ballinger2016). Put differently, more nuanced distinctions within the broad spectrum of task-supported programs could be fruitful to help determine the role of specific program characteristics.
As also highlighted in our discussion of Lee et al. (Reference Lee, Warschauer and Lee2019) in the preceding text, better outcomes for the experimental treatment can in some cases be attributed to other factors than the so-called TBLT nature of the treatment. For example, Torky (Reference Torky2006) investigated the benefits of an intensive speaking course in comparison to a course where students hardly did any speaking practice. Unsurprisingly, the students from the speaking course did better in end-of-course speaking activities, which resembled their course activities. In a similar vein, the end-of-course assessment in Yang (Reference Yang2008) concerned speaking skills, which the experimental group had been given ample opportunity to develop in class while the comparison group had not. Considering the potential effect of practice–test congruency (i.e., the probability that one gets better at what one practices regardless of whether the practice method resembles TBLT or something else), we also exclude these studies from the collection of between-group comparisons in our reanalysis when the purpose is to gauge the effect of TBLT as an independent variable. This reduces the collection to three studies. Were we to calculate an average effect from these, the result would be d = 0.258, indicating a small effect, but this is not quite meaningful given the minute sample size.
An extra challenge with assessing many of the primary studies is that the description of the control/comparison conditionFootnote 9 is often as minimal as, for example, “[the] control group experienced conventional teaching” (Rezaeyan, Reference Rezaeyan2014). Even some of the lengthy texts, such as PhD dissertations, offer minimal information. For example, Murad (Reference Murad2009) only mentions that “the control group was taught using the conventional methods of teaching used by teachers of EFL at these schools” (p. 77), without giving any further explanations as to what those conventional methods were. When descriptions are included, these are often ambiguous as to whether the two groups spent the same amount of time on the skills or knowledge they would be needing to perform well in the posttests. All this makes it difficult to tell whether the superior performance of the experimental group should be attributed to their being provided with better learning opportunities or simply more learning opportunities in preparation for a specific end-of-course assessment.
The latter possibility can be illustrated with two of the three studies remaining in our reanalysis. One is Lai et al. (Reference Lai, Zhao and Wang2011), which did include helpful details about both the experimental and the comparison treatments as well as the assessment instruments used. In this study, communicative activities were added to a language-focused course in the experimental condition. To evaluate whether this had a positive effect on learning, a speaking test was used, where the students were asked to describe a picture of a person’s bedroom (p. 96). However, picture description was a recurring course activity in the experimental condition, and one of the picture description activities in the course was about bedrooms as well (p. 102). If the students from the TBLT course performed better on the final speaking test, this may be partially attributable to practice–test congruency (because they had done the activity before while the comparison group had not). A similar example is a study by Park (Reference Park, Shehadeh and Coombe2012), who designed computer-assisted activities for the TBLT group, while the non-TBLT group only worked with their prescribed EFL textbook. One of the TBLT group’s computer-assisted lessons was about writing e-mails to e-pals (e.g., to introduce a new e-pal). The non-TBLT group, which was confined to working with the EFL textbook, appears not to have practiced this specific activity. However, the same activity was used as one of the assessment measures, thus potentially giving an advantage to the TBLT group. After excluding also these two studies from the reanalysis, a single study would remain (Phuong et al., Reference Phuong, Van den Branden, Van Steendam and Sercu2015). This is a study that reports a positive effect of a TBLT-informed writing course on students’ vocabulary development, but less improvement compared to the non-TBLT treatment on measures of linguistic accuracy. The result is an averaged d-value of −0.06. In short, using different, stricter criteria for sampling candidate studies changes the conclusions regarding the effectiveness of task-based relative to non-task-based implementations. Again, the main conclusion must be that much more (and more solid and replicable) empirical work on the comparative effectiveness of TBLT needs to be done before a robust meta-analysis of the effects of task-based programs will become feasible. In the interim, it is critical to apply more nuance to domain definitions within the spectrum of task-supported programs so that the role of specific program characteristics can be better understood.
CONCLUSIONS AND RECOMMENDATIONS
The outcome of a meta-analysis is inevitably determined by how a factor of interest is defined and how candidate studies are subsequently selected. As illustrated in all three “case studies” presented here, changes in selection criteria, such as applying more narrow definitions of key variables, can lead to different outcomes. In each of our reanalyses, we considered it desirable to exclude a fair number of studies that were included in the original meta-analyses, because they (a) were not in fact designed to address the research question that the meta-analysis sought answers to, (b) did not report quantitative data (such as pretest scores) required for a reliable effect calculation, (c) exhibited confounds that make it difficult to attribute an observed effect to the factor of interest, and (d) were described with insufficient detail to allow a proper evaluation. Unfortunately, applying stricter selection criteria can drastically reduce sample sizes. If we were dealing with effect sizes from primary studies that were very precise replications of one another, then aggregated effect sizes could still be meaningful, but in the case of the three meta-analyses we have examined here we are dealing with primary studies that show considerable diversity in design, learning targets, outcome measures, and instructional settings. Given this diversity, it is not surprising that the addition or exclusion of a few primary studies can alter the outcome of a meta-analysis. The original meta-analyses seem to have been conducted in a spirit of an inclusive approach to primary study selection (for the sake of sample sizes). It has not been our intention to argue that the “when in doubt, leave it out” stance taken in our replication attempts is necessarily better. The point is, rather, that readers of meta-analytic reviews (be they researchers, policy makers, or teaching professionals) need to be aware that any meta-analytic endeavor involves multiple choices on the part of the analyst, each of which impacts the outcomes (Oswald & Plonsky, Reference Oswald and Plonsky2010). To help readers appreciate this, authors of meta-analytic reviews are of course urged to be totally transparent about the choices they made (Maassen et al., Reference Maassen, van Assen, Nuijten, Olsson-Collentine and Wicherts2020; Norris & Ortega, Reference Norris, Ortega, Norris and Ortega2006). It is doubtful, however, whether many consumers of meta-analytic reviews closely inspect the method sections in such publications, where those choices are explained. Instead, readers may rely solely on the information provided in the abstract and possibly the general conclusion section. Owing to their status as comprehensive reviews, conclusions drawn from meta-analyses exert a certain authority. We hope to have demonstrated that assertions about the role of a given factor (be it the primary factor of interest or a moderating factor) need to be made with caution, especially in the case of recent strands of empirical inquiry.
Recommendations may also be distilled for the researcher wishing to embark on a meta-analysis. One recommendation is to carefully delineate the factor(s) of interest and to evaluate whether the available strand of research related to this factor lends itself to a robust and meaningful analysis. When the maturity of a given domain for meta-analysis is uncertain, it is recommended to first carry out a scoping review. A scoping review is another type of research synthesis that surveys a domain of literature identifying current trends, commonly used methods, and gaps in findings (e.g., Gurzynski-Weiss & Plonsky, Reference Gurzynski-Weiss, Plonsky and Gurzynski-Weiss2017; Hillman et al., Reference Hillman, Selvi and Yazan2020; Tullock, & Ortega, Reference Tullock and Ortega2017). A scoping review can help determine if subsequent meta-analytic work is appropriate and worthwhile. After embarking on a meta-analysis, researchers are advised to scrutinize each candidate study to determine its eligibility and make the criteria for study inclusion clear. As we have illustrated, a field may look ready at first glance, as one starts deploying the powerful online search engines at our disposal, but this may be deceptive if it turns out that many candidate studies fail to meet the standards for inclusion. Unfortunately, scrutinizing the method sections of a large collection of empirical research papers is a labor-intensive exercise. Of course, meta-analytic replications are not immune to interpretation errors either. We fully recognize potential shortcomings in our own reassessment of the primary studies included in our three case studies. Alternatively, a faster way could be to use the prestige of the journals where they were accepted as a proxy of quality assuredness (e.g., Faez et al., Reference Faez, Karas and Uchihara2019), under the assumption that some journals use more rigorous review processes than others. This, then, raises the difficult question what bibliometric data are most suitable to distinguish between journals on account of the relative rigor of their review processes. An additional difficulty is that resulting literature from this approach may be limited to publications from privileged, “WEIRD” (Western, Educated, Industrialized, Rich, and Democratic) contexts, potentially disadvantaging those who have less access to publishing in prestigious peer-refereed journals (Andringa & Godfroid, Reference Andringa and Godfroid2020; Cho, Reference Cho2009; Henrich et al., Reference Henrich, Heine and Norenzayan2010). Besides, even the most prestigious journals occasionally publish articles that are arguably nonoptimal (or, at least, nonoptimally suited for a given meta-analytic purpose). In fact, among the primary studies we felt it justified to exclude from our reanalyses, there were indeed several ones which appeared in prestigious journalsFootnote 10 (see supplements on IRIS for details on each individual study). It is worth mentioning in this context that each of the three meta-analytic reviews examined here appeared in prestigious journals, too. So, perhaps our call for caution should be extended to journal editors, editorial boards, and reviewers.
In any case, given the issues highlighted, (some of) the conclusions presented in the meta-analyses we have examined here should be taken as tentative for now. Fortunately, as new studies are continually being added to the various strands of inquiry in our discipline, we must be hopeful that sooner or later it will become possible to revisit these meta-analyses and to replicate them with a larger collection of eligible studies. This sustained effort at updating and replicating meta-analyses can be made lighter if meta-analytic reports are transparent not only as to what studies were included but also as to precisely how effect sizes were calculated (so the same procedures may be followed in the updates). For one of the three meta-analyses examined here (Yousefi & Nassaji, Reference Yousefi and Nassaji2019), we felt it necessary to recalculate effect sizes because it was not clear to us precisely on which contrasts the authors had based their calculations. A lack of clarity of how contrasts were defined and analyzed not only limits readers’ ability to evaluate meta-analytic findings but it also hinders replication where effect sizes from new studies could systematically be added to an existing pool and thus gradually make the outcome more robust. The field of applied linguistics has heralded a push toward open-science practices in recent years, including recognition of open data and materials through badges in major journals (e.g., Studies in Second Language Acquisition, Annual Review of Applied Linguistics, Language Learning, Modern Language Journal), repositories for instruments and materials (IRIS-database.org), and registered replications (Morgan-Short et al., Reference Morgan-Short, Marsden, Heil, Ii, Leow, Mikhaylova, Mikolajczak, Moreno, Slabakova and Szudarski2018) and reports (Marsden et al., Reference Marsden, Morgan‐Short, Trofimovich and Ellis2018). Open science practices are one way to promote equity through the sharing of knowledge, instruments, and findings in freely accessible and permanent repositories. While there is growing excitement around open access in applied linguistics research, L2 researchers (and academics more broadly) often fail to practice what they preach in terms of publishing open access (e.g., Zhu, Reference Zhu2017) or making data freely available. The coding schemes and data of some prior meta-analyses have been uploaded in repositories such as IRIS (Bryfonski & McKay, Reference Bryfonski and McKay2019; Plonsky, Reference Plonsky2011, Reference Plonsky and Porte2012, Reference Plonsky, Chamot and Harris2019; Plonsky & Kim, Reference Plonsky and Kim2016; Plonsky & Ziegler, Reference Plonsky and Ziegler2016; Plonsky & Zhuang, Reference Plonsky, Zhuang and Taguchi2019), and this is also where the coding schemes and data of the present three replications can be found. Others have called for more attention to open science in meta-analytic work. McKay and Plonsky (in press), for example, recommend that “all meta-analysts make available not only their coding schemes but also their data and any code used to analyze that data” (p. 14). However, meta-analysis continues to be underrepresented in terms of shared materials and data. Open data is yet another methodological choice, one that may open the door more easily to scrutiny of studies and findings. Whatever channel is deemed most appropriate, the sharing of coding sheets in meta-analysis is critical for building upon prior work and supporting future meta-analysts. It is worth mentioning that calls for greater transparency in reporting meta-analyses are being made outside the discipline of applied linguistics as well (e.g., Maassen et al., Reference Maassen, van Assen, Nuijten, Olsson-Collentine and Wicherts2020, in the field of psychology).
Returning specifically to the three case studies we have presented here, it is important to clarify that our intention was by no means to criticize the instructional interventions advocated in them (i.e., technology-mediated pragmatics instruction, corpus use for vocabulary learning, and TBLT). It was, in fact, our interest in these topics that led us to read and then further explore these three meta-analyses, and in the case of Bryfonski and McKay, reanalyze the pool of primary studies with different criteria. We hope that our three examples can serve as an incentive for others to reexamine the meta-analyses available in their own domains of interest.
Supplementary Materials
To view supplementary material for this article, please visit http://dx.doi.org/10.1017/S0272263120000327.