1 Introduction
When research studies on certain topics in a field accumulate, there is a need for researchers to examine and compare the findings of these studies to either confirm/reject a hypothesis or to advance a theory. The history of using a systematic and quantitative method to review a large body of studies can be dated back to the 1930s (Liao & Hao, Reference Liao and Hao2008). Since then, researchers and statistical experts have endeavored to develop systematic and professional statistical tools to combine and analyze results from empirical studies. These methods enable researchers to summarize primary studies in a replicable way, thus producing more supportable findings than narrative reviews and vote counting, both of which have been long used to synthesize cumulated studies (Norris & Ortega, Reference Norris and Ortega2000). Meta-analysis, as a statistical analysis of primary studies to integrate findings, was first proposed by Glass (Reference Glass1976). Since then, it has been used increasingly, and the findings of meta-analyses are widely cited. Nowadays though, Glass’s model of research synthesis is no longer considered appropriate, as new methods for retrieving, integrating, and interpreting research findings have been developed (Cooper, Reference Cooper2007). Researchers have argued that the decision rules outlined by the originators of meta-analysis such as Glass, McGaw and Smith (Reference Glass, McGaw and Smith1981) back in the 1980s should be modified and expanded as new meta-analytic methodologies are developed (Cooper, Reference Cooper2007). This need to revise the practice of meta-analysis has arisen also due to the many unanticipated results arrived at by meta-analyses on the same topic. Little consistency in the application of meta-analytic methods and the many variations a meta-analyst can adopt in the procedures and decision points mean that the results of meta-analyses are neither replicable nor comparable (Rothstein & McDaniel, Reference Rothstein and McDaniel1989). Norris and Ortega (Reference Norris and Ortega2000), for example, meta-analyzed 49 unique sample studies on the effectiveness of instruction in L2 learning. They categorized the studies in the sample into four groups based on the level of explicitness of instruction (i.e. explicit vs. implicit) and attention to form (focus on form vs. focus on formS). They compared these four types of instruction to baseline/comparison instructions (i.e. no instruction or non-focused exposure to the structures received by the experimental groups) and found that, overall, explicit types of instruction are more effective than implicit types. As an extension of Norris and Ortega’s study, Goo, Granena, Yilmaz and Novella (Reference Goo, Granena, Yilmaz and Novella2015) also meta-analyzed 34 unique sample studies, in which 11 were from Norris and Ortega’s meta-analysis, and tried to scrutinize the relative effect of implicit and explicit L2 instruction. Both meta-analyses revealed somewhat similar results in that, overall, explicit instruction was more effective than implicit instruction. However, Goo et al. (Reference Goo, Granena, Yilmaz and Novella2015) found both explicit and implicit instructions led to a large effect size on immediate post-test, whereas in Norris and Ortega (Reference Norris and Ortega2000), the large effect size was associated only with explicit instruction. Goo et al. attributed the differences in findings to inherent differences in sampling the eligible studies. Norris and Ortega included all experimental and quasi-experimental studies in which either explicit or implicit instruction was compared with a control/comparison group, while Goo et al. only included studies where both explicit and implicit instruction were designed and compared. This example illustrates how different decisions in meta-analytic process can affect the outcomes.
To enhance the comparability, interpretability, and replicability of meta-analyses across disciplines, the analytical procedures used have to be clear, consistent, and, most important of all, transparent to readers and consumers of the meta-analyses. It is recommended that meta-analysts explicitly describe their procedures, offer justification or a rationale for decisions when alternatives are possible, and explain how different approaches might have affected the conclusions. In other words, the causes of inconsistency can be found and resolved as long as the authors make transparent their decisions at every judgment call.
2 Literature review
Meta-analysis gained popularity as a systematic form of secondary review in the 1980s. Since then, researchers have started to discuss and formulate procedures and examples for conducting such secondary reviews (Liao & Hao, Reference Liao and Hao2008). However, it was not until the mid 1990s that the number of meta-analyses burgeoned (Littell, Corcoran & Pillai, 2008). This increasing interest in employing meta-analysis as a research synthesis method in second language learning/teaching revealed researchers’ recognition of its validity in terms of being able to scientifically aggregate and analyze study findings. Meta-analysis is also able to identify gaps between available studies and to suggest future research directions or even formulate research agendas. As a systematic review, meta-analysis employs statistical methods to integrate and summarize primary studies on a particular topic by comprehensively locating research studies using “organized, transparent, and replicable procedures at each step in the process” (Cooper, Reference Cooper2007: 1). Meta-analysis, like most primary studies and any form of systematic review, follows similar steps: topic formulation, treatment design, sampling, data collection, data analysis, and reporting of results. In the topic formulation stage, research questions, hypotheses, and research purposes are proposed based on research interest and theoretical rationale. Involved in the overall study design are tasks such as developing a protocol, and specifying problems, conditions, sampling procedures, and outcomes of interest. Most important of all, study inclusion and exclusion criteria have to be proposed. A sampling plan has then to be developed in which the study is the sampling unit. Potentially all relevant studies will have to be searched for and obtained. In the data collection step, data are extracted from the primary studies and are integrated following a standardized format. Different approaches to analyzing data extracted from included primary studies in a meta-analysis are possible. However, basic and common steps include the provision of descriptive data on study features and intervention characteristics, examining heterogeneity from the obtained effect sizes, conducting moderator analysisFootnote 1 and sensitivity analysis,Footnote 2 and detecting publication bias.Footnote 3 In the final step of the meta-analysis, tables and graphs are employed to describe the results, interpretation and discussion of findings are presented, and the implications for policy, practice, and future research suggestions are proposed (Littell et al., Reference Littell, Corcoran and Pillai2008).
2.1 Current state of the art
As mentioned earlier, inconsistency or conflicts in the conclusions of meta-analyses conducted on the same or similar topics can be attributed to several factors, such as more than one alternative in major procedures. The call for complete and transparent reporting of decision-making in these critical steps has led to the development of standards or instruments to guide meta-analysts regarding what to report in each stage. Cooper (Reference Cooper2007) developed a checklist of 20 questions to evaluate the validity of the research synthesis conclusions. Based on Cooper’s checklist and other documents related to reporting standards, a working group on journal article reporting standards (JARS), commissioned by the American Psychological Association Publications and Communications Board, established the Meta-Analysis Reporting Standards (MARS) to recommend information to be included when reporting meta-analyses. These standards are much more comprehensive, covering what to describe/report in each section of a paper or topic. Other more concise and recent measurement tools such as A Measurement Tool to Assess Systematic Reviews (AMSTAR) have been proposed since MARS (e.g. Aytug et al., Reference Aytug, Rothstein, Zhou and Kern2012; Plonsky, Reference Plonsky2012; Shea, Hamel, Wells, Bouter, Kristjansson, Grimshaw, Henry & Boers, Reference Shea, Hamel, Wells, Bouter, Kristjansson, Grimshaw, Henry and Boers2009), all reacting to the impetus for more detailed and complete reporting of how meta-analyses are conducted and what they find.
We used MARS as a basis to develop a framework against which four other assessment tools/instruments were compared. The number of items in the surveyed instruments ranged between 17 and 54. All items can be classified into Introduction, Literature search, Method, and Discussion/Conclusion with some degree of variation, with the exception of AMSTAR, which created items to assess information related to data sources, analysis of individual studies, meta-analysis, reporting, and interpretation. It also asks for a summary judgment for each section. We examined the nature of the items and tallied the number that was deemed to be important by at least three of the five instruments that we surveyed. We found that in the Introduction section, meta-analysts need to specify the questions under investigation and related theory/policy or practical issues for such a synthesis. In the Method section, details such as inclusion and exclusion criteria, operational definition for both independent and dependent variables and moderator/mediator analysis need to be provided. In terms of searching for eligible literature, information on references, citation databases, and registries searched, as well as efforts to retrieve all available studies need to be supplied; the process of determining study eligibility needs to be described as well. In coding procedures, inter-coder reliability or agreement, and ways to assess study quality and handle missing data need to be explained; in the section that reports the statistical method, effect size metrics and averaging and/or weighting method, effect size confidence intervals or standard error need to be provided, and the meta-analysts also need to explain how to deal with studies with more than one effect size and what analysis model and assessment of heterogeneity were employed with appropriate justification. When reporting the results, a descriptive table (with effect size and sample size for each study) supplemented with tables or graphic summaries are recommended. When discussing the results, major findings, alternative explanations for observed results, study generalizability, limitations, implications, and interpretation for theory/policy or practice need to be addressed with guidelines for future research.
2.2 Second-order meta-analysis of meta-analysis in CALL
Second-order meta-analysis, also called “overview of reviews”, “umbrella review”, “meta-meta-analysis”, and “meta-analysis of meta-analysis” (Schmidt & Oh, Reference Schmidt and Oh2013: 204) is a research synthesis methodology that integrates evidence from multiple first-order meta-analyses with the aim of gauging the degree to which the variance in effect size calculated from the first-order meta-analyses was due to second-order sampling error, the estimate of which can better inform the precision of the effect sizes derived from the individual meta-analyses (Schmidt & Oh, Reference Schmidt and Oh2013). An alternative focus of second-order meta-analysis could be on the way in which each meta-analysis was conducted. The authors were able to locate two such studies in the field of applied linguistics. Each study is briefly introduced as follows.
Plonsky and Ziegler (Reference Plonsky and Ziegler2016) used a revised version of Plonsky (Reference Plonsky2012) to evaluate the rigor and transparency of 10 meta-analyses in applied linguistics. The inter-rater reliability of the instrument was .87. Several observations of the meta-analyses reviewed in this second-order meta-analysis were presented: (1) the standards proposed in the instruments regarding the literature review were mostly met by the sample, except that most authors failed to provide justifications for inclusion of certain moderator variables; (2) the Method section is the area that needs much improvement; although the authors in the sample provided clear inclusion and exclusion criteria to screen eligible studies employing appropriate search techniques, not many studies employed a quality index to assess primary studies before or after they were integrated for further analysis; (3) there was a lack of discussion of how the findings were derived from the individual meta-analyses to inform theory and recommendations for future research.
Liou and Lin (Reference Liou and Lin2017) adopted an instrument developed by Aytug et al. (Reference Aytug, Rothstein, Zhou and Kern2012) to assess the transparency of reporting and the rigor of 13 meta-analyses on computer-assisted language learning (CALL). Their instrument consists of 18 items derived from a 54-item pool. These 18 items were endorsed by experts and are regarded to be “ethically imperative” (110); a meta-analytic report with no provision of information on these items would be considered as low quality and would be less likely to be replicated. This secondary meta-analysis found that the more recent meta-analytic reports were not more transparent or rigorous in their reporting and conduct than earlier ones, which is contrary to our hypothesis that the development and growth of meta-analytic research knowledge and techniques should enable the recent studies to be more finely tuned. The authors also found that the meta-analysts did not provide the keywords they used to search for relevant literature, nor did they provide justifications for analyzing certain moderator variables. Study features were normally not listed, and information on inter-rater reliability was either missing or such reliability was not checked.
2.3 Purpose statement
Although meta-analysis has become a widely accepted research synthesis method in the social science field, the inconsistent findings derived from multiple meta-analyses in the same research domain are regarded as a major weakness. Researchers have argued, though, that the inconsistencies in meta-analysis results are more easily resolved than those from narrative reviews, as long as meta-analysts “fully articulate” their decision rules (Rothstein & McDaniel, Reference Rothstein and McDaniel1989: 766). Given the proliferation of meta-analyses conducted in disciplines in the social sciences, and the growing number of publications synthesizing the research in CALL, there is a need to formulate agreed-upon mechanisms and procedures for conducting such research syntheses. Accordingly, this research aims to seek answers to the following research questions:
1. How transparent and complete is the reporting of CALL meta-analyses with regard to the critical stages and procedures?
2. Are there correlations between transparency in reporting and number of citations, publication year, and word counts of the included reviews?
3 Method
In’nami and Koizumi (Reference In’nami and Koizumi2009) provided guidelines for selecting databases for meta-analysis in applied linguistics. They first reviewed previous meta-analyses in this field, with the aim of understanding what databases were used. Initially, they located 24 journals that they believed targeted the applied linguistic audience and are more likely to publish meta-analyses in applied linguistics. The first stage of reading of the 24 journals identified 15 meta-analytic studies, of which 12 specified the databases that were used. They also compiled a list with journal coverage rates and periods of coverage in these databases. The authors finally recommended that Linguistics and Language Behavior Abstracts (LLBA), Educational Resources Information Center (ERIC), Modern Language Association (MLA), Linguistics Abstracts, and Scopus are ideal databases for retrieving meta-analyses in applied linguistics. As studies on CALL overlap considerably with applied linguistics in terms of the possible publication outlets, In’nami and Koizumi’s study provided a starting point from which we searched for possible eligible CALL meta-analyses. In the following, we detail the procedures we followed to retrieve the target studies.
3.1 Search for meta-analyses
The keywords used in previous meta-analyses were first examined, which revealed that meta-analysis was overwhelmingly the most frequently used keyword to identify a study as a meta-analysis. Other keywords were also observed, however, with a lower frequency, including research method, secondary research, research synthesis, quantitative research, research review, and effect size. To ensure comprehensive inclusion of meta-analyses conducted in the field of CALL, defined as “the search for and study of applications of the computer in language teaching and learning” (Levy, Reference Levy1997: 1), the above keywords were used in combination with secondary-level identifiers such as technology, computer, computer-assisted instruction, computer-assisted language learning/teaching, language teaching/learning, L2, language acquisition, second/foreign languages, language skills (reading, writing, speaking, listening, pronunciation, etc.), with the aim of identifying an eligibly comprehensive sample.
We followed Aytug et al. (Reference Aytug, Rothstein, Zhou and Kern2012) and In’nami and Koizumi (Reference In’nami and Koizumi2009) when selecting journals and databases to search for meta-analyses. We first reviewed the previous meta-analyses to identify the journals that published them. These journals were then searched issue by issue to retrieve more studies. The searches were conducted starting July 2014 and continued to June 2015. The searches did not exclude non-English research, but as the keywords we used were in English, it is possible that research conducted in languages other than English were filtered out. The journals we conducted manual searches on include those recommended by In’nami and Koizumi (Reference In’nami and Koizumi2010) and our previous meta-analysis (Lin, Reference *Lin2015a, Reference *Lin2015b). The journals include the Annual Review of Applied Linguistics (ARAL), Applied Language Learning (ALL), Applied Linguistics (AL), Applied Psycholinguistics (AP), Assessing Writing (AW), Canadian Modern Language Review (CMLR), the ELT Journal (ELTJ), Foreign Language Annals (FLA), the International Journal of Applied Linguistics (IJAL), the International Review of Applied Linguistics in Language Teaching (IRAL), the JALT Journal (JALTJ), Language Assessment Quarterly (LAQ), Language Learning (LL), Language Learning & Technology (LLT), Language Teaching (LTea), Language Teaching Research (LTR), Language Testing (LTes), The Modern Language Journal (MLJ), Reading Research Quarterly (RRQ), the RELC Journal (RELCJ), Second Language Research (SLR), Studies in Second Language Acquisition (SSLA), System, TESOL Quarterly (TESOLQ), Computers & Education (C&E), Educational Technology, Research & Development (ETR&D), Educational Technology & Society (ETS), the British Journal of Educational Technology (BJET), and Computer Assisted Language Learning (CALL). We also conducted electronic searches on the databases recommended by In’nami and Koizumi (Reference In’nami and Koizumi2010) to capture studies that might have been missed in the journal search. The databases we searched include Academic Search Premier, Comprehensive Dissertation Abstracts, ERIC, LLBA, MLA International Bibliography, Online Computer Library Center (OCLC) ProceedingsFirst, ProQuest Dissertations and Theses, PsycARTICLES, PsycINFO, ScienceDirect, and Social Sciences Citation Index (SSCI).
The keywords identified previously were also used in academic search engines such as Google Scholar to retrieve relevant studies. Furthermore, the bibliography on meta-analysis in applied linguistics compiled by Plonsky (Reference Plonsky2012) and provided on his personal website was also manually checked (http://oak.ucc.nau.edu/ldp3/bibliographies.html).
As the aim of this study was to examine the level of transparency and completeness in reporting meta-analytic procedures deemed to involve important decision-making and judgment calls in the CALL domain, studies had to meet the following criteria to be eligible for inclusion:
1. The meta-analysis had to synthesize studies on topics related to CALL.
2. The meta-analysis had to quantitatively synthesize the results of the included primary studies.
3. The meta-analysis was not reported across several sources; for a meta-analysis reported in more than one source, only one was included.
A meta-analysis was excluded if it was characterized with one of the following conditions:
1. The meta-analysis compared systematic reviews and meta-analyses (Littell et al., Reference Littell, Corcoran and Pillai2008).
2. The meta-analysis aimed to describe the history and current status of the meta-analytic enterprise (Rosenthal & DiMatteo, Reference Rosenthal and DiMatteo2001).
3. The meta-analysis proposed or recommended new procedures or stages of research synthesis (Cooper, Reference Cooper2003).
3.2 Codebook and transparency scale/score
Bearing in mind that the major purposes of this study were to understand the procedures and practices commonly used and followed by meta-analysts in CALL, and the degree of transparency in reporting important decision-making points in the report, we developed a codebook and a transparency measure/scale.
Previous research has revealed somewhat different stages and procedures in conducting a meta-analysis. Cooper (Reference Cooper2003: 6), for example, proposed that a research synthesis should include the five stages of (1) problem formulation; (2) data collection, or the literature search; (3) data evaluation; (4) analysis and interpretation; and (5) presentation of results. The function that each stage serves is very similar to that of a primary study (Cooper, Reference Cooper1998). Rosenthal and DiMatteo (Reference Rosenthal and DiMatteo2001: 69–70), however, suggested five different stages of conducting a meta-analysis:
Defin[ing] the independent and dependent variables of interest; collect[ing] the studies in a systematic way; examin[ing] the variability among the obtained effect sizes informally with graphs and charts; combin[ing] the effects using several measures of their central tendency; examin[ing] the significance level of the indices of central tendency; and using an examination of the binomial effect size display.
We reviewed previous studies that discussed meta-analytical procedures (Egger, Smith & Phillips, Reference Egger, Smith and Phillips1997; Wanous, Sullivan & Malinak, Reference Wanous, Sullivan and Malinak1989), guidelines on how to conduct research syntheses (Plonsky, Reference Plonsky2013), books and book chapters on meta-analysis (Lipsey & Wilson, Reference Lipsey and Wilson2001; Norris & Ortega, Reference Norris and Ortega2000, Reference Norris and Ortega2006), and meta-analytic practices and procedures from other fields (Aytug et al., Reference Aytug, Rothstein, Zhou and Kern2012) in designing our codebook and instruments. Specifically, we coded each meta-analysis based on features of the seven stages: Profile information, Literature search, Method, Results, Discussion, Conclusion, and Appendix, with each stage including three to 14 features to code. Table 1 presents the features and codes assigned at each stage. For each feature, we first determined whether the information was provided in the meta-analysis; we also noted down the page number, and each code was evaluated with the degree of certainty for each code, with 1 being “not so certain” and 3 “very certain”.
The first author coded all of the meta-analyses included and the second and third authors served as second coders, each coding half of the studies. We first discussed the coding scheme and codebook; after reaching consistency in the meanings of the codes, we proceeded with the coding independently. The inter-coder reliability was calculated as the number of codes agreed upon by both coders divided by the number of all codes. For features that received different codes, a third coder (either the second or third author) was called upon, and discrepancies were resolved through discussion.
We modified the instrument that was developed by Aytug and his colleagues (2012) and constructed a transparency index consisting of 45 items that were each measured on a 3-point scale (“no”=0, “partial”=0.5, “complete”=1). We coded whether the meta-analysts provided information on these items, irrespective of how they coded them. For example, 1 point was awarded if the meta-analyst reported the kind of statistical method that was used, irrespective of whether it was a fixed model, random-effects model, or mixed-effects model. If the meta-analysts reported the model that was employed, we assigned 1 to that item; on the contrary, 0 was awarded to the item if this information was not available, and 0.5 was awarded to items for which only partial information was provided. We then summed the scores of the 45 items and calculated a transparency score for each meta-analysis.
4 Results
In total, 15 individual meta-analyses were considered eligible for further analysis. These 15 meta-analyses were published between 2003 and 2015 and are marked with an * in the References. Of the 15 studies, eight were contributed by three authors: Lin (Reference *Lin2014, Reference *Lin2015a, Reference *Lin2015b), Taylor (Reference *Taylor2006, Reference *Taylor2010, Reference *Taylor2013), and Chiu (Reference *Chiu2013) and her colleagues (Chiu, Kao & Reynolds, Reference *Chiu, Kao and Reynolds2012). The topics of interest include computer-mediated communication on different aspects of learning (Lin, Reference *Lin2014, Reference *Lin2015a, Reference *Lin2015b; Lin, Huang & Liou, Reference *Lin, Huang and Liou2013); electronic/computer-mediated glosses on reading and vocabulary learning (Abraham, Reference *Abraham2008; Taylor, Reference *Taylor2006, Reference *Taylor2010, Reference *Taylor2013; Yun, Reference *Yun2011); classroom applications of corpus analysis (Cobb & Boulton, Reference *Cobb and Boulton2015); effects of CALL on vocabulary learning (Chiu, Reference *Chiu2013); digital game-based learning (Chiu et al., Reference *Chiu, Kao and Reynolds2012); strategy-oriented web-based English instruction (Chang & Lin, Reference *Chang and Lin2013); and general computer/technology-assisted language instruction (Grgurović, Chapelle & Shelley, Reference *Grgurović, Chapelle and Shelley2013; Zhao, Reference *Zhao2003). The journals that published these meta-analyses are Language Learning & Technology (three studies), ReCALL (two studies), CALICO Journal (four studies), the British Journal of Educational Technology (two studies), Computer Assisted Language Learning (two studies), and the Australasian Journal of Educational Technology (one study). Cobb and Boulton’s (Reference *Cobb and Boulton2015) study was published as a book chapter. The average number of primary studies per meta-analysis was 24.86, with an average sample size of 1,566 participants. Cohen’s d was the effect size metric in 53% of the meta-analyses, whereas Hedges’ g was the effect size of interest in 40% of the studies. Only one study used both Cohen’s d and Hedges’ g; 67% used Hedges and Olkin’s methods. A few meta-analyses (2%) used both methods. One third (33%) of the meta-analyses in our sample used a random effects model, 6.7% used a fixed effect model, 13% used both models, and roughly 46% of the meta-analyses in our sample did not state the model used. Codings for the 15 included meta-analyses are provided as supplementary materials at https://doi.org/10.1017/S0958344017000271
In the following we report the answers to our two research questions.
4.1 How transparent and complete is the reporting of CALL meta-analyses with regard to the critical stages and procedures?
As shown in Table 2, out of a maximum score of 45, the average score of our sample is 22.27 with a standard deviation of 6.34, indicating a wide variability in the degree of transparent reporting. When closely examined, the lowest-scoring meta-analysis received a score of 13, the highest 35.5. We did not observe such a wide variability, though, in the individual sections. As shown, most of the meta-analyses provided sufficient information in the Profile section (86%) but not in the remaining sections. More precisely, except for Profile information, our sample reported more or less half of the information that was required to meet the standards. The Results (31%) and Method (37%) sections were the weakest, for which less than half of what is required to report was provided. The Literature search (55%) and Discussion/Conclusion (53%) sections were only slightly better, with most of the studies reporting slightly more than half of the information required.
Note. aSection average score/maximum score for each section.
Looking more closely, we found that in the Profile information section, all studies received at least 5 out of a possible 7 points, with one third of the studies receiving full points and one third missing only 1 point. This result is encouraging, as descriptive information provides the threshold information for readers to have a bird’s-eye view of a meta-analysis. The scores in the Literature search section, however, warrant concern. Our sample revealed a lowest score of 2.5 and a highest of 6 out of a possible 8 points. About four studies received a score of 4 or less than 4 points, and only a third of the sample received 6 (the highest number of points in our sample). The same pattern was evident again in the Discussion/Conclusion section for which we see a lowest score of 2 and a highest of 7 out of a possible 8 points. However, in this section, about two thirds of the reports received at least 4 points.
In the following we discuss the finding of each section in more depth.
4.1.1 Profile information
This section consisted of seven items that asked mostly factual information of the meta-analyses, such as the number of primary studies included and a list of the studies, total sample size, the effect size metrics and average/weighting method, and the kind of synthesis method used. Generally, our sample scored high in this section, but two particular items stand out as problematic (see Table 3). Our item 6 asked about the research synthesis method used, for which nearly half of the studies (n=7) did not provide an answer. The model selection is typically dependent on the results of a homogeneity test, which examines the variability of effect size distribution (e.g. whether the obtained effect size represents a common population effect, or the difference in effect size is due to sampling error only) (Li, Shintani & Ellis, Reference Li, Shintani and Ellis2012: 10). In meta-analysis, there are two models to analyze included studies, each with its own assumptions. A fixed-model is recommended if all included studies are identical and if the goal of the analysis is to compute a common effect size for the specified population, with no intention for the result to be generalized to other populations, as this model assumes that there is one true effect size for all included studies, and sampling error is the only reason that causes the effect size between studies to differ. On the contrary, a random effect model is recommended if we believe that within-study and between-study variability, in addition to sampling error, contribute to the variability in the effect size, and therefore the goal is not to estimate a true effect size but the mean of a distribution of effects (Berkeljon & Baldwin, Reference Berkeljon and Baldwin2009; Borenstein, Hedges, Higgins & Rothstein, Reference Borenstein, Hedges, Higgins and Rothstein2009; Li et al., Reference Li, Shintani and Ellis2012: 10). The model needs to be specified in the report because it reveals the goal of the meta-analysis and also entails totally different statistical procedures.
Another item that appears to be problematic is item 7, which assesses whether clear research questions are provided in the report. Four studies in our sample failed to meet this requirement. When closely examined, researchers may believe it is sufficient to describe the overall goal of the study rather than provide narrow and specific research questions, as shown in Chiu (Reference *Chiu2013: E52): “This meta-analysis accounts for the overall effect of computer-mediated instruction in L2 vocabulary and specifically addresses the effects with regard to four factors: treatment duration, the educational level of participants, game-based learning and the role of teachers”.
4.1.2 Literature search
This section asked for a detailed documentation of how potentially eligible studies were searched for and chosen for inclusion. This section does not judge the search strategies that were used but assesses if the listed procedures were reported. Unfortunately, as shown in Table 4, these 15 published meta-analyses did not provide satisfactory information regarding how they ended up with their final samples. Only a little more than half of the items were reported (4.4/8). Closely examined, we find that two items are missing from even the highest scoring studies in this section: date of search and method of dealing with articles other than those in English. Both items, to some degree, influenced the representativeness of the samples. Date of search, once reported, reveals information as to whether the identified studies and the total number of studies retrieved varied due to the date accessed. A systematic recording of the time of the search for eligible studies may help illuminate if there is instability in the sample. The method of dealing with non-English articles has always been an issue in meta-analysis, as excluding non-English articles may result in a biased sample not representative of meta-analyses conducted in a field. Although no researchers would explicitly state that non-English articles were excluded, it presents a challenge to search for and locate them. Once identified, the reading of the article surfaced as another obstacle to be overcome. Consensus regarding reporting still needs to be reached regarding whether non-English articles should be searched for, and if not, how this would potentially influence the representativeness of the sample and the results.
Keywords and explicit lists of exclusion criteria were another two aspects for which at least four studies in our sample did not provide information. Keywords serve as good signposts for retrieving studies that share certain characteristics; they are also useful for study replication. Without specifying the keywords used to retrieve eligible studies, readers may question the central constructs the meta-analysts have in mind when searching for eligible candidates of the study. All of the studies in our sample provided inclusion but not exclusion criteria. Researchers not specifying exclusion criteria might assert that studies that did not fit the inclusion criteria are automatically filtered out and that there is no need to specify exclusion criteria; however, studies that meet the overall standard of inclusion may still need to be excluded due to technical details; for example, in Grgurović et al. (Reference *Grgurović, Chapelle and Shelley2013: 170), a study would still be excluded if it “did not report statistics or reported statistics that were insufficient to calculate the effect size” even though it might meet all of the inclusion criteria.
4.1.3 Method
Twelve items were assessed in the Method section (see Table 5), revealing a large gap in the scores of the studies ranging from 1 to 11. This section is also the second most poorly reported aspect of our sample in that the majority of the studies failed to report more than half of the items. The section assesses technical/statistical intent and the procedures employed by the meta-analysts; for example, it asked whether efforts were made to identify possible heterogeneity among studies, and if so, how. The same questions were asked about data dependency and how it was handled. The number of coders, the reporting of inter-coder reliability, and how coders resolved disagreement are also aspects that merit attention in this section. In meta-analysis, heterogeneity examines whether effect sizes calculated from individual primary studies are consistent. If a heterogeneity test result is significant, measures have to be taken to deal with it. In the same vein, data dependency, if not dealt with appropriately, would reduce estimates of variance, and inflate Type I errors (Borenstein et al., Reference Borenstein, Hedges, Higgins and Rothstein2009; Scammacca, Roberts & Stuebing, Reference Scammacca, Roberts and Stuebing2014). In SLA/CALL research, however, data dependency is quite common and inevitable given the prevailing research design employed in this field.
Most empirical studies in SLA/CALL used more than one dependent variable and included more than just one treatment group to be compared with the control group. When the same participants are measured repeatedly or the same participants in the control group are compared in each comparison, the data become dependent (Scammacca et al., Reference Scammacca, Roberts and Stuebing2014). Dependent data would seriously affect the validity of the meta-analysis results. Researchers have recommended several methods to deal with this issue (for a detailed comparison of available resolutions, refer to Scammacca et al., Reference Scammacca, Roberts and Stuebing2014), and CALL meta-analysts should consider their overall purpose of the meta-analysis while taking into account their research questions and the nature of the data when deciding which measure to use to handle the data dependence. Three items that deal with coding also received little attention from the meta-analysts. Only a few studies reported inter-rater reliability and how disagreements between coders were resolved. Given the highly inferential and complex nature of data coding procedures involved in meta-analysis, as well as the many arbitrary decisions to be made along the way, it is advised that multiple coders be used, and inter-coder reliability in different sections and different analytical stages be reported.
4.1.4 Results
We assessed whether profile information such as sample size, extracted effect size(s), and the number of effect sizes contributed by each study was presented; we also assessed whether sensitive and publication bias analyses were conducted, and if so, how. Individual estimate and overall estimate of effect sizes calculated from individual studies and the entire sample were expected to be shown either in a table or graph. Furthermore, the amount of heterogeneity and rationales for the selection of moderators need to be reported as well. A sensitivity analysis is necessary in meta-analysis because of the alternatives available to meta-analysts. Most of the alternatives are not objective but arbitrary, which would result in inconsistencies in findings among meta-analyses on similar topics. The use of a sensitivity analysis is to detect whether there would be differences in results when the meta-analysis is repeated using alternative decisions or values instead of the original ones. Heterogeneity results from the diversity in methodology in primary studies included in a meta-analysis, and can be observed if the obtained individual effect sizes are more different from each other than they should be due to chance (random error) alone (Higgins & Green, Reference Higgins and Green2011). Publication bias analysis is to neutralize the effect represented by published studies when there is a consensus that the published studies are not representative of the entire population of studies done in an area (Rothstein, Sutton & Borenstein, Reference Rothstein, Sutton and Borenstein2006). Among the five sections, the Results section is the lowest scoring, with a highest score of 4.5 and a lowest score of 0.5 out of a maximum 9 points (see Table 6). No study in our sample conducted a sensitivity analysis, and hence no information for the type of sensitivity analysis was chosen. One third of our sample reported that they performed a publication bias analysis, but only three specified the type of publication bias analysis they employed. The amount of heterogeneity and the rationale for selecting moderators for subgroup analysis are also just as incomplete. The overall low scores in the Method and Results sections might suggest that meta-analysts in the CALL field were generally not equipped with sufficient skills or knowledge required to conduct complex meta-analysis, or were not well informed of the norm of reporting, especially for meta-analysis.
4.1.5 Discussion/Conclusion
This section asked whether the major findings and limitations of the meta-analysis were reported while taking into consideration the degree of heterogeneity and potential biases of the primary studies. We also examined whether classical components of a Conclusion section typically found in a primary study such as generalizability, implications for theory, policy, or practice, as well as recommendations for future studies were also evident in the meta-analyses. The result is not very encouraging, as shown in Table 7, with only a little over half of the items (53%) being reported. Specifically, all studies in our sample stated their major findings, and almost all provided implications for practice, policymaking, or theory. Recommendations for future studies were also proposed by most of the studies. Our sample is particularly weak, however, in the reporting of the potential biases of the primary studies, advancing alternative explanations for observed results, and asserting the generalizability of their findings. Meta-analysis is a more scientific way to manage large quantities of data objectively and effectively than narrative reviews, yet it still cannot refute the possibility that potential bias in primary studies can seriously affect the results. Characteristics of the study, funding sources, selective outcome reporting, and publication processes could all introduce bias into a primary study (Turner, Boutron, Hróbjartsson, Altman & Moher, Reference Turner, Boutron, Hróbjartsson, Altman and Moher2013). Although such bias is not easy to detect, and there is no assessment regarding how it can be reduced or measured, meta-analysts need to inform readers of potential biases inherent in the primary studies, and how the results might have been influenced. Furthermore, a meta-analysis, combining studies with subtle differences in participants, study characteristics and research design, and other major aspects, tends to have greater generalizability than a single large-sample randomized primary study. The results synthesized from the combined studies across different populations and settings are generalizable to a broader range of participants, provided that no significant heterogeneity among studies is present (Heyland, n.d.). The generalizability of the results from a meta-analysis lies partly in how clearly the inclusion criteria are described, and how consistently they are followed in the study selection. Operational definitions of major factors/constructs examined in the synthesis also delimit the generalizability of the results. The meta-analysts need to discuss this for consumers of their findings, especially policymakers, who look for summarized evidence on a particular topic to guide their decisions (Garg, Hackam & Tonelli, Reference Garg, Hackam and Tonelli2008).
4.2 Are there correlations between transparency in reporting and number of citations, publication year, and word counts of the included reviews?
The number of citations for each of the 15 reviews was retrieved from Google Scholar (with a cut-off time of February 3, 2017), and word counts were derived by converting each review from a pdf file into a Word document and then running the word-count analysis. Table 8 shows the number of citations, word counts, and transparency scores for each of the included reviews. Pearson correlation analyses indicated that the number of citations was not significantly related to the transparency scores of reporting in all sections: Total, r(13)=.020, p=.944; Profile information section, r(13)=–.108, p=.702; Literature search, r(13)=.223, p=.425; Method, r(13)=–.142, p=.614; Results, r(13)=.061, p=.828; Discussion/Conclusion, r(13)=.019, p=.945. However, word count was found to be highly correlated to the level of transparency in reporting. In particular, word counts were correlated with total transparency score, r(13)=.838, p=.002; Literature search, r(13)=.666, p=.007; Method, r(13)=.678, p=.005; Discussion, r(13)=.645, p=.009.
Several correlation analyses between publication year and overall quality of reporting (as demonstrated in the total transparency score) and between publication year and different sections (as demonstrated in the section total score) were conducted to explore whether recent publications revealed more transparent and complete reporting than older ones. The results show that there was no significant correlation between publication year and overall transparency, r(13)=.130, p=.322, or various sections, Profile information, r(13)=–.146, p=.302; Literature search, r(13)=.080, p=.388; Method, r(13)=.371, p=.086; Results, r(13)=–.194, p=.25; Discussion and Conclusion, r(13)=.054, p=.425. However, we did find positive correlations in reporting between specific sections: Method and Literature search, r(13)=.794, p=.000. Method and Discussion/Conclusion, r(13)=.457, p=.043; Profile information and Results, r(13)=.645, p=.005; Results and Discussion/Conclusion, r(13)=.548, p=.017; Profile information and Discussion, r(13)=.589, p=.010.
5 Discussion and conclusion
Systematic reviews, taking various forms, have become increasingly important as stakeholders, practitioners, and researchers seek evidence to make important decisions regarding the kinds of investments to make in transforming a classroom into one in which technological tools play a major and significant role in enhancing both the teaching and learning of second or foreign languages. With the increasing number of meta-analyses published in the literature, there has been a recognized need for a consistent framework of reporting for such work (Willis & Quigley, Reference Willis and Quigley2011). Our study explored the reporting quality of 15 meta-analyses in the field of CALL as assessed by adopting a transparency index constructed in previous studies.
The results generally endorsed those found in previous second-order analyses in different disciplines (e.g. Ahn, Ames & Myers, Reference Ahn, Ames and Myers2012; Aytug et al., Reference Aytug, Rothstein, Zhou and Kern2012; Plonsky & Ziegler, Reference Plonsky and Ziegler2016). Of all five sections, the Method and Results are two areas that warrant much improvement. Specifically, in the Method section, operational definitions of variables, the quality of the primary studies, data dependency handling, and heterogeneity identification and analysis in the included studies need to be considered and described in appropriate depth. When reporting results, we hope to see information provided for the number of effect sizes contributed by each study, the meta-analyst’s rationale for the selection of moderators, and the results of publication bias analyses and sensitivity analyses, if they are conducted. Procedures for dealing with heterogeneity among studies need to be addressed as well. The reporting of these items requires a meta-analyst’s professional knowledge of both statistics and procedures. Strengthening the knowledge base in these areas might be possible via consulting published books or guides on how to conduct meta-analyses, or seeking assistance from statisticians.
As an extension of Plonsky and Ziegler (Reference Plonsky and Ziegler2016), the findings of the present study coincide with most of theirs; for instance, both syntheses found: (1) a wide variety in quality of reporting (as measured by transparency/rigor index); (2) Method and Results sections are areas in which practice of reporting needs the greatest improvement; (3) the quality of the primary studies was generally not assessed; and (4) justification of moderator selection was not provided. Employing a more complete transparency scale with 45 items to assess the reporting quality of 15 meta-analytic studies, out of which 10 were included in Plonsky and Ziegler, the unique contributions of our paper can be discussed from three aspects. First, our instrument, including about 2.5 times the number of items used in Plonsky and Ziegler, allowed us to conduct a more sensitive and complete assessment of current reporting practice of CALL meta-analyses. For example, in addition to assessing whether sensitivity/publication analyses were conducted, we also asked which type of sensitivity/publication analysis was used. This follow-up question is important because it echoes the basic premise of the article that the many choices made by meta-analysts can greatly affect the outcomes, and we need them to report exactly what their choices are in terms of the type of analysis decided. Second, the Profile information and Literature search sections, created as two separate independent sections for which most of the items were not included in Plonsky and Ziegler, asked for mostly factual information and a detailed documentation of the meta-analytic procedures. These two sections, although not intended to judge the way each meta-analysis was conducted, corresponded again to our above-mentioned premise that reporting of the meta-analysis needs to be as transparent as possible for cross-study comparison purposes. A third contribution of our paper lies in some of its conflicting findings with Plonsky and Ziegler. To the authors’ best knowledge, Plonsky and Ziegler’s paper, published in the 20th anniversary special issue of Language Learning & Technology, was the first second-order synthesis of meta-analyses in the CALL discipline. Most of the studies included in the sample of their study were also included in the present study as well. In the present research, we found a high percentage of studies reporting implications and interpretation of their findings on theory, policy, and practice, as well as providing future study recommendations. This is not the case, though, in Plonsky and Ziegler’s paper. Although the authors of both papers recognized the potential of subjectivity in applying respective instruments by employing multiple coders and establishing inter-coder reliability, the results are conflicting. The nevertheless contradictory findings have shed some light on the problem and alert us to ponder possible causes that might be attributed to using different sets of instruments and therefore diverse operationalization of the constructs behind them.
Improvement of reporting is possible for some items but not for others. For instance, although no agreed-upon position existed with regard to whether quality of the primary study should be listed as one inclusion criteria when sampling for a meta-analysis, some second-order syntheses do call for the assessment of quality in order to avoid the long-held criticism of “garbage in, garbage out”. Even if this recognition is endorsed, assessments of study quality are mostly not available, and there is little or no consensus regarding the criteria for determining research quality. Two common variables related to study quality have been reported in the literature: whether subjects are randomly assigned to treatments and whether the instruments are reliable (Durlak, Weissberg & Pachan, Reference Durlak, Weissberg and Pachan2010; Valentine, Cooper, Patall, Tyson & Robinson, Reference Valentine, Cooper, Patall, Tyson and Robinson2010). These two pieces of information, despite their importance, were consistently missing from the primary studies, preventing meta-analysts from excluding possibly low quality studies. This in turn has flawed the meta-analytical procedures, resulting in conclusions that are not valid or are untrustworthy. Such information should be deemed as mandatory before a primary study is accepted for journal publication. Furthermore, our exploratory analysis revealed that word count is significantly related to the level of transparency and completeness of reporting. Although this finding is highly expected, journal editors might reconsider whether the word count restrictions commonly imposed on primary studies should be more flexible for systematic reviews, which require considerably more space if essential details are to be included. It was unexpected that the more recently published studies were no better reported than the earlier ones; nor were the highly cited studies more complete in their reporting. Our anticipation of a co-relationship between both variables and transparency in reporting lies in the increase in the publication of guidelines over the last decade, and the recent developments in the statistical methodology used in meta-analysis (Willis & Quigley, Reference Willis and Quigley2011). Such guidelines were initially developed for disciplines other than CALL, and therefore have not drawn sufficient attention from CALL researchers. This might explain why recent meta-analyses were no better than earlier ones. Furthermore, systematic reviews or meta-analyses are still in their infancy in CALL, and researchers might not be aware of the existence of such reporting practices.
5.1 Limitations and recommendations for future meta-analyses of CALL research
Using a 45-item checklist, our study set out to examine the reporting quality of meta-analyses in CALL. Reporting quality is defined as the extent of transparency and completeness of the reporting as measured by the degree of compliance with the checklist. As we indicated earlier, there have been published guidelines over the last decade; the one that we chose was originally developed for appraising meta-analyses in organizational science. There were a number of other well-developed guidelines or checklists published prior to or after the one we adopted. By using different checklists, the results of our study may have been different. With this limitation in mind, we suggest that there should be a standardized checklist, as complete and comprehensive as possible, for specific disciplines so that results can be compared.
Our second limitation lies in the small sample size of the meta-analyses under review. We included only 15, which is a far smaller number than that in other fields. This small number may not represent all published meta-analyses in CALL, although we did try our best to identify all eligible studies. Furthermore, our studies included multiple meta-analyses conducted by the same authors. The reporting practice of these authors might have been over-representative of that practiced by other meta-analysts. We suggest the assessment of reporting quality of meta-analyses at regular intervals as more and more are conducted in this field.
Drawing on the findings of reporting practice examined in the present study, several recommendations for future meta-analyses of CALL research are in order. First, we recommend that meta-analysts consider provision of information with regard to the research synthesis method (i.e. fixed effect, random effect, or mixed effect model) they adopted in aggregating studies; date of search and strategy utilized to deal with studies other than in English, and whether and how they handle heterogeneity and data dependency among studies, as well as inter-coder disagreement. We also recommend that publication bias and sensitivity analysis be conducted, and that any moderator analysis for subgroup analysis be justified. Furthermore, if potential biases of the primary studies are detected, alternative explanations and the generalizability of their findings need to be reported. Second, we recommend meta-analytical studies of topics that go beyond those that were explored in the sample of the present paper. Additionally, second-order syntheses of qualitative meta-analyses or narrative reviews are another option to synthesize research in the CALL field. Our third recommendation stems from the challenge and difficulties when we applied our instruments to assess the reporting quality of the 15 studies. Recently, we have witnessed a trend of establishing reporting standards to regulate the ways manuscripts on meta-analysis should be prepared. Such an endeavor of generating agreed-upon reporting standards encourages researchers to carefully consider their design at every stage along the way when conducting meta-analyses. Many existing reporting standards, however, are designed for use in other disciplines, which might lose their sensitivity to appropriately capture the essence of research design followed by CALL researchers. We therefore recommend the development of agreed-upon and validated instruments or a set of assessment tools to improve the conduct and reporting of meta-analyses explicitly for CALL meta-analysts.
Note. The percentages for each sub-item do not always add to 100%.
Acknowledgements
This study reported part of the data collected from a two-year grant research project supported by the National Science Council of Taiwan (grant number MOST103-2410-H-007-028-MY2). The remaining data will be presented as a book chapter: Liou and Lin (Reference Liou and Lin2017). We also express our gratitude to the anonymous reviewers and especially Editor Alex Boulton for their comments on an earlier draft of this paper.
Supplementary material
For supplementary materials referred to in this article, including the coding of the 15 meta-analyses, please visit https://doi.org/10.1017/S0958344017000271
About the authors
Huifen Lin is a Professor in the Foreign Languages and Literature Department at the National Tsing Hua University, Taiwan. She has several publications on CALL meta-analysis. Her research interests include technology-assisted language learning/teaching and quantitative research methods.
Tsuiping Chen is Associate Professor in the Applied English Department of Kun Shan University, Tainan, Taiwan. Her research focuses on using qualitative and quantitative meta-analysis methods to investigate peer feedback research conducted in the ESL/EFL writing classrooms.
Hsien-Chin Liou is a Professor in the Department of Foreign Languages and Literature at Feng Chia University, Taichung, Taiwan, and specializes in CALL and related topics such as corpus use, academic writing, and vocabulary learning. She has published numerous articles in various CALL or language learning journals, as well as book chapters.