Hostname: page-component-745bb68f8f-5r2nc Total loading time: 0 Render date: 2025-02-07T06:25:49.610Z Has data issue: false hasContentIssue false

WHAT THE RESEARCH SHOWS ABOUT WRITTEN RECEPTIVE VOCABULARY TESTING

A REPLY TO WEBB

Published online by Cambridge University Press:  12 July 2021

Jeffrey Stewart
Affiliation:
Tokyo University of Science
Tim Stoeckel
Affiliation:
University of Niigata Prefecture
Stuart McLean
Affiliation:
Momoyama Gakuin University
Paul Nation
Affiliation:
Victoria University of Wellington
Geoffrey G. Pinchbeck
Affiliation:
Carleton University
Rights & Permissions [Opens in a new window]

Abstract

Type
Critical Commentary
Copyright
© The Author(s), 2021. Published by Cambridge University Press

In response to our State-of-the-Scholarship critical commentary (Stoeckel et al., Reference Stoeckel, McLean and Nation2021), Stuart Webb (Reference Webb2021) asserts that there is no research supporting our suggestions for improving tests of written receptive vocabulary knowledge by (a) using meaning-recall items, (b) making fewer presumptions about learner knowledge of word families, and (c) using appropriate test lengths. As we will show, this is not the case. However, we think questions and concerns he raises reflect those of many who have used these tests until now without controversy, and we appreciate the opportunity to explain these issues in greater detail.

To begin, we think Webb has more common ground with our position than he may realize. We agree with many of his statements, and do not state otherwise in Stoeckel et al. (Reference Stoeckel, McLean and Nation2021). For example, we agree that few if any vocabulary test makers have claimed their tests should be used as substitutes for reading tests; we agree that despite this, vocabulary tests typically do show good correlations with reading; and we agree that despite that, such tests should not be used as evidence of reading comprehension. These matters are not in dispute. However, there are remaining points of disagreement to address.

Test Use

Noting that “the premise on which their article was written is that the intended purpose of the VLT and VST is to measure vocabulary knowledge for the purpose of reading,” Webb appears to dispute that this was ever the case or an intention of the test makers. Webb further asserts that it is “not the intended purpose” of the tests “to accurately reveal the degree to which learners may reach key lexical coverage figures” (p. 458), track growth, or suggest vocabulary learning goals. However, while it is certainly true that these tests have not been sufficiently validated for these purposes, the fact that these applications factored into test makers’ intentions during their creation is clear from their own statements, as can be seen in Table 1.

TABLE 1. Some stated purposes of size and levels tests of written receptive vocabulary knowledge

Furthermore, while space constraints prevent a comprehensive list, the majority of uses of these tests in SLA literature are for purposes such as those previously mentioned rather than merely to check learner knowledge of words without reference to other considerations. The desire of researchers to use them these ways is sympathizable. As Webb notes, in general, vocabulary test scores do indeed correlate with other constructs, and if they could be employed only to measure vocabulary knowledge without any other such inferences permitted, their usefulness would be quite limited. Indeed, while we would welcome such restrictions, were Webb’s cautions about appropriate test use strictly followed it would all but mark the end of use of tests such as the Vocabulary Levels Test (VLT) and Vocabulary Size Test (VST) as variables in research published in journals such as SSLA.

We believe the long-standing confusion regarding the intended purposes of these tests stems from the fact that often, at the times of their creation, these tests did not have narrowly specified uses (Norbert Schmitt, personal communication, May 6, 2021) and, as we and a number of our colleagues would argue, many still do not have them today (Schmitt et al., Reference Schmitt, Nation and Kremmel2020). Thus, it is understandable that teachers and researchers would use them for a wider variety of purposes than appropriate. We hope Stoeckel et al. (Reference Stoeckel, McLean and Nation2021) acts as a caution against this.

Our own view is that while we do not support the notion that vocabulary comprehension alone is sufficient for reading comprehension (McLean, Reference McLean2021), vocabulary knowledge is uncontroversially a component of reading, which can be useful as one of several variables in studies of reading proficiency. Furthermore, while vocabulary knowledge alone is not sufficient for reading comprehension, testing vocabulary mastery can at least ensure that lack of vocabulary knowledge is not an impediment for readers of given texts (Nation, Reference Nation2009, p. 52). However, in all such applications, ideally as closely as possible test items should approximate how vocabulary is encountered in text (Schmitt et al., Reference Schmitt, Nation and Kremmel2020). This leads us to the first suggestion in our original article, regarding choice of item format.

Item Format

In addition to demonstrating that fixed options incorrectly increase estimates of vocabulary size (Gyllstad et al., Reference Gyllstad, Vilkaitė and Schmitt2015; McLean et al., Reference McLean, Kramer and Stewart2015; Stewart & White, Reference Stewart and White2011) research has also consistently shown that, all else being equal, recall formats are more reliable than meaning-recognition formats testing the same words when learners are asked to attempt every item (McLean et al., Reference McLean, Stewart and Batty2020; McLean et al., Reference McLean, Stewart and Kramer2016; Stewart, Reference Stewart2012; Stoeckel et al., Reference Stoeckel, Stewart, McLean, Ishii, Kramer and Matsumoto2019). The relatively poor discrimination of meaning-recognition items makes experiments using them more prone to Type II error, where hypotheses are rejected due to seemingly statistically insignificant results. Furthermore, there is a growing consensus in the literature (e.g., Grabe, Reference Grabe2009, p. 23; Kremmel & Schmitt, Reference Kremmel and Schmitt2016, p. 378; Nation & Webb, Reference Nation and Webb2011, pp. 219, 285–286) that meaning-recall represents an appropriate threshold of lexical knowledge for reading because, as in fluent reading, word meaning must be retrieved from memory rather than identified in a list of options.

Webb appears to contest this position by citing Laufer and Aviad-Levitzky (Reference Laufer and Aviad-Levitzky2017), who gave learners the meaning-recognition based VST, a parallel meaning-recall test, and a reading test. Departing from the VST’s specifications (Nation, Reference Nation2012), they had instructed learners to skip items testing words that they did not believe they knew. Perhaps as a consequence of this change, there was no statistically significant difference between correlations of the two tests to the reading measure (.91 and .92). Despite the insignificant result, the authors argued that meaning-recognition was the better predictor of reading ability. (Webb expressed surprise that we did not include this study in our review. As noted in our original paper, we excluded studies that allowed learners to skip unknown words on the meaning-recognition measure because research demonstrates that examinees use the option to skip differentially, which impacts the relationship between recognition and recall scores [Stoeckel et al., Reference Stoeckel, Bennett and McLean2016].)

To better identify the differences in correlations between these modalities and reading proficiency, subsequent research by McLean et al. (Reference McLean, Stewart and Batty2020) involving meaning-recall and meaning-recognition formats and reading proficiency used a bootstrapping approach to mitigate Type I and II errors. Both meaning-recall and meaning-recognition were tested bilingually for direct comparisons. By sampling with replacement for thousands of iterations, McLean et al. demonstrated that with very little overlap, meaning-recall outperformed meaning-recognition as a predictor of reading proficiency. Webb notes that meaning-recognition was also correlated to reading proficiency. We do not dispute this; as we note in the preceding text, all vocabulary tests will correlate to reading to at least some extent. However, the goal of our paper was to suggest improvements to vocabulary tests. As McLean et al. show, for a test of 30 items, meaning-recall outperforms meaning-recognition in average correlations to reading 0.74 to 0.65 (d = –3.622; see Figure 1), a distinction that becomes even clearer for tests with more items (Figure 2). Such differences in variables can have substantial impacts on models, so researchers should take note.

FIGURE 1. Histograms of bootstrapped correlations of meaning-recall and meaning-recognition to reading proficiency, 30 items (adapted from McLean et al., Reference McLean, Stewart and Batty2020).

FIGURE 2. Histograms of bootstrapped correlations of meaning-recall and meaning-recognition to reading proficiency, 100 items (adapted from McLean et al., Reference McLean, Stewart and Batty2020).

Nor are such findings restricted to the previously mentioned study. A meta-analysis by Jeon and Yamashita (Reference Jeon and Yamashita2014) also found that meaning-recall was the better predictor, but could not attain statistical significance due to the small number of meaning-recall studies examined (7). However, a more recent meta-analysis by Zhang and Zhang (Reference Zhang and Zhang2020) including 21 studies using meaning-recall and 14 using meaning-recognition found that mean correlations between meaning-recall and reading proficiency (r = .66 [.58, .71]) are significantly stronger than those between meaning-recognition and reading proficiencyFootnote 1 (r = .53 [.49, .57]). Debate about uses of tests such as the VLT and the VST aside, the research seems clear: if one does desire to measure vocabulary as it relates to reading, meaning-recall appears to be the better option.

Although he acknowledges the risk of overestimation in meaning-recognition, Webb argues that meaning-recall could underestimate vocabulary size, suggesting meaning-recognition provides more “sensitivity” in scoring. Research in which learners are orally interviewed about their answers on meaning-recognition items shows that despite higher mean scores, the format is highly insensitive, with learners choosing the items they do for a variety of disparate reasons, including construct-irrelevant ones such as test strategies and blind guessing (Gyllstad et al., Reference Gyllstad, Vilkaitė and Schmitt2015; McDonald, Reference McDonald2015).

It is true that meaning-recall tests such as Aviad-Levitzky et al. (Reference Aviad-Levitzky, Laufer and Goldstein2019) that demand answers with perfect L2 English spelling of target word synonyms can depress scores for reasons unrelated to learners’ understanding of meaning, particularly given English’s complex orthography. However, an advantage of recall tests is that unlike fixed-option meaning-recognition tests, researchers retain learners’ free responses, which can then be examined and graded as leniently as desired. Although a common complaint of this procedure is the time required to mark answers, online resources such as www.vocableveltest.org greatly expedite this process. Novel choices are presented to the researcher and can then be whitelisted or blacklisted during initial scoring, allowing for automated scoring of those same responses thereafter.

Moving on to practical matters, Webb expresses concern that meaning-recall tests may take longer to administer to learners. However, the time required need not be prohibitive. Recent research by McLean et al. (Reference McLean, Stewart and Batty2020) indicates that meaning-recall increases test time by roughly 28%, meaning a 10-minute meaning-recognition test would still require less than 13 minutes to complete using meaning-recall. Webb further argues that monolingual tests may be more appropriate for many ESL settings. While this is a legitimate concern in many contexts, as mentioned already, online resources such as www.vocablevelstest.org can simplify this process by permitting the inclusion of multiple L1s as possible answer options.

These arguments in favor of meaning-recall do not mean meaning-recognition tests have no value at all. In cases in which learners do not have access to computers or cell phones, multiple-choice tests may have advantages in classroom contexts where teachers need fast results. Scoring of multiple L1s is possible in meaning-recall tests, but involves greater initial overhead and greater complexity in scoring standards. While more research is necessary, it is possible meaning-recall tests could underestimate knowledge when it is difficult to express meaning. On balance however, research shows meaning-recall is the preferred option for tests measuring vocabulary knowledge for reading.

Lexical Unit

Just as the choice of item format for a test should consider the purpose of the test and the learners taking the test, the choice of a lexical unit should also take account of such considerations. Bauer and Nation’s (Reference Bauer and Nation1993) level six word family (WF6) is too inclusive for some purposes and for many learners, and we need to develop tests that use more appropriate word family levels for the high-frequency and initial mid-frequency vocabulary. Use of WF6 in tests assumes that learners know most or all family members to the same level of knowledge that the target word was tested at. This assumption is unsupported by research with L2 learners of English from a range of proficiency levels (McLean, Reference McLean2018; Stoeckel et al., Reference Stoeckel, Ishii and Bennett2020; Ward & Chuenjundaeng, Reference Ward and Chuenjundaeng2009).

Webb argues lemma-based instruments require testing more words. However, from a statistical standpoint this is incorrect: for size tests, precision is a function of sample size rather than population size (Smith, Reference Smith2004), so keeping item numbers constant does not affect accuracy. For levels tests, we agree more levels may be desirable for lemma-based instruments, but this need not be a serious concern. As Webb has observed, learners need only complete test levels at their proficiency level (Webb et al., Reference Webb, Sasao and Ballance2017).

Webb cautions against reaching conclusions on this matter until more research is available. However, currently the prudent choice is to assume less derivational knowledge on the part of learners, not more. The available evidence suggests that learners well beyond beginner level have trouble recalling the meaning of some derivational forms of known basewords (see Brown et al., Reference Brown, Stoeckel, McLean and Stewart2020, Reference Brown, Stewart, Stoeckel and McLeanin press; McLean, Reference McLean2021 for recent reviews). Tests relying on smaller lexical units will still be effective for learners regardless of proficiency, but the same cannot currently be ensured for WF6-based tests.

Target Word Sample Size

Webb argues that ideal item counts for size and levels tests are “not straightforward” on the grounds that “the greater the number of good test items, the more accurately a test should help to assess knowledge” (p. 458). It is true that good items have higher discrimination, reducing tests’ standard error of measurement. However, although multiple-choice items can be screened and improved, all else being equal, tests with recall items of the same words demonstrate superior quality (McLean et al., Reference McLean, Stewart and Batty2020; Stewart, Reference Stewart2012). Furthermore, in regard to size estimation, regardless of item quality an axiom of inferential statistics is that the larger the sample size, the more reliable the vocabulary size estimate,Footnote 2 and even theoretically perfect item discrimination does not obviate the need for sufficient sample sizes in this regard (Gyllstad et al., Reference Gyllstad, Vilkaitė and Schmitt2015, Reference Gyllstad, McLean and Stewart2020b). In his response, Webb calls for examining how test performance is affected by manipulating disputed variables. Just such a study was conducted by Gyllstad et al. (Reference Gyllstad, McLean and Stewart2020b). An example of the difference item counts make to accuracy in size estimation is illustrated in Figure 3.

FIGURE 3. Monte Carlo study of vocabulary size estimates using tests of 10, 30, and 100 items (adapted from Gyllstad et al., Reference Gyllstad, McLean and Stewart2020b).

Note: The true number of words known by this learner is 750.

As explained in Stoeckel et al. (Reference Stoeckel, McLean and Nation2021), research indicates that size estimation based on item response theory (IRT) can help address concerns about test length. Research by Culligan (Reference Culligan2008) and Gibson and Stewart (Reference Gibson and Stewart2014) illustrates how IRT-based computer adaptive tests can tailor to learners’ ability levels, mitigating the need for many items either far above or below learner ability. Although it is still advisable to test sufficiently at appropriate difficulty levels, IRT can greatly shorten tests of words with wide ranges of frequencies and difficulties, such as the VST.

Conclusion

Webb concluded by expressing his belief that no empirical evidence exists to support our positions regarding meaning-recall items, smaller lexical units, and appropriate test lengths. We hope the research cited in this commentary puts his concerns to rest and makes the evidence for our positions clear. However, we wholeheartedly agree with Webb’s call for further research regarding our suggestions, and hope this dialogue inspires more such studies.

To use a contemporary term, commentaries such as Stoeckel et al. (Reference Stoeckel, McLean and Nation2021), McLean (Reference McLean2021), Stewart (Reference Stewart2014), and Schmitt et al. (Reference Schmitt, Nation and Kremmel2020) “problematize” widely used vocabulary tests. Problems are rarely welcomed with open arms. Much in the same way dated statistical standards can attain a semblance of authority through the precedent of their past use, standards for instruments used in research can take on an air of unimpeachability when they have been used unquestioned for so long. However, it is important not to confuse what is familiar with what is preferable. Leaving precedent unquestioned can prevent appropriate scrutiny of past research.

As a final thought, it should be noted that initially each of the aforementioned characteristics of conventional vocabulary tests (i.e., fixed-response, meaning-recognition tests with relatively few randomly sampled items per level based on word families) were established with little empirical evidence, and based on early and underdeveloped perspectives of validation (Norbert Schmitt, personal communication, May 6, 2021). Validations of newer tests that have inherited these characteristics have rarely attempted to examine their underlying assumptions. While we appreciate Webb’s calls for further evidence, we hope that going forward similar scrutiny is applied to older standards as is now applied to the increasing calls for updated ones.

Footnotes

We are grateful to Norbert Schmitt, Henrik Gyllstad, Christopher Nicklin, Joseph Vitta, Nick Bovee, and Dale Brown for their comments on earlier versions of this article.

1 Form-recall also outperformed meaning-recognition in both this study and in Mclean et al. (Reference McLean, Stewart and Batty2020).

2 We have found that treating size tests as polls of proportions of known words results in slightly better confidence interval estimation than test SEM, despite the latter accounting for item variance (Gyllstad et al., Reference Gyllstad, McLean and Stewart2020a).

References

Aviad-Levitzky, T., Laufer, B., & Goldstein, Z. (2019). The new computer adaptive test of size and strength (CATSS): Development and validation. Language Assessment Quarterly, 16, 345368. https://doi.org/10.1080/15434303.2019.1649409CrossRefGoogle Scholar
Bauer, L., & Nation, P. (1993). Word families. International Journal of Lexicography, 6, 253279. https://doi.org/10.1093/ijl/6.4.253CrossRefGoogle Scholar
Beglar, D., & Hunt, A. (1999). Revising and validating the 2000 word level and university word level vocabulary tests. Language Testing, 16, 131162. https://doi.org/10.1177/026553229901600202CrossRefGoogle Scholar
Brown, D., Stewart, J., Stoeckel, T., & McLean, S. (in press). The coming paradigm shift in the use of lexical units. Studies in Second Language Acquisition.Google Scholar
Brown, D., Stoeckel, T., McLean, S., & Stewart, J. (2020). The most appropriate lexical unit for L2 vocabulary research and pedagogy: A brief review of the evidence. Applied Linguistics. Advance online publication. https://doi.org/10.1093/applin/amaa061CrossRefGoogle Scholar
Culligan, B. (2008). Estimating word difficulty using yes/no tests in an IRT framework and its application for pedagogical objectives. (Unpublished doctoral dissertation). Temple University, Japan.Google Scholar
Gibson, A., & Stewart, J. (2014). Estimating learners’ vocabulary size under item response theory. Vocabulary Learning and Instruction, 3, 7884. http://www.vli-journal.org/issues/03.2/issue03.2.full.pdf#page=82Google Scholar
Grabe, W. (2009). Reading in a second language. Cambridge University Press.Google Scholar
Gyllstad, H., McLean, S., & Stewart, J. (2020a). [Unpublished raw data comparing confidence intervals for vocabulary size estimates produced by a poll of a proportion and the standard error of measurement.]Google Scholar
Gyllstad, H., McLean, S., & Stewart, J. (2020b). Using confidence intervals to determine adequate item sample sizes for vocabulary tests: An essential but overlooked practice. Language Testing. Advance online publication. https://doi.org/10.1177/0265532220979562CrossRefGoogle Scholar
Gyllstad, H., Vilkaitė, L., & Schmitt, N. (2015). Assessing vocabulary size through multiple-choice formats: Issues with guessing and sampling rates. ITL - International Journal of Applied Linguistics, 166, 278306. https://doi.org/10.1075/itl.166.2.04gylCrossRefGoogle Scholar
Jeon, E. H., & Yamashita, J. (2014). L2 reading comprehension and its correlates: A meta‐analysis. Language Learning, 64, 160212. https://doi.org/10.1111/lang.12034CrossRefGoogle Scholar
Kremmel, B., & Schmitt, N. (2016). Interpreting vocabulary test scores: What do various item formats tell us about learners’ ability to employ words? Language Assessment Quarterly, 13, 377392. https://doi.org/10.1080/15434303.2016.1237516CrossRefGoogle Scholar
Laufer, B., & Aviad-Levitzky, T. (2017). What type of vocabulary knowledge predicts reading comprehension: Word meaning-recall or word meaning-recognition? The Modern Language Journal, 101, 729741. https://doi.org/10.1111/modl.12431Google Scholar
McDonald, K. (2015). The potential impact of guessing on monolingual and bilingual versions of the vocabulary size test. Osaka JALT Journal, 2, 4461. http://www.osakajalt.org/journal/Google Scholar
McLean, S. (2018). Evidence for the adoption of the flemma as an appropriate word counting unit. Applied Linguistics, 39, 823845. https://doi.org/10.1093/applin/amw050CrossRefGoogle Scholar
McLean, S. (2021). The coverage comprehension model, its importance to pedagogy and research, and threats to the validity with which it is operationalized. Reading in a Foreign Language, 33, 126140. https://nflrc.hawaii.edu/rfl/item/528Google Scholar
McLean, S., Kramer, B., & Stewart, J. (2015). An empirical examination of the effect of guessing on vocabulary size test scores. Vocabulary Learning and Instruction, 4, 2635. http://vli-journal.org/wp/wp-content/uploads/2015/10/vli.v04.1.2187-2759.pdf#page=31Google Scholar
McLean, S., Stewart, J., & Batty, A. O. (2020). Predicting L2 reading proficiency with modalities of vocabulary knowledge: A bootstrapping approach. Language Testing, 37, 389411. https://doi.org/10.1177/0265532219898380CrossRefGoogle Scholar
McLean, S., Stewart, J., & Kramer, B. (2016, September 12–14). A comparison of multiple-choice and yes/no test formats with a meaning-recall knowledge criterion [Paper presentation]. Vocab@Tokyo. Tokyo, Japan.Google Scholar
Nation, I. S. P. (2009). Teaching ESL/EFL reading and writing. Routledge.Google Scholar
Nation, I. S. P., & Webb, S. (2011). Researching and analyzing vocabulary. Heinle.Google Scholar
Nation, P. (2013). Learning vocabulary in another language (2nd ed.). Cambridge University Press.CrossRefGoogle ScholarPubMed
Nation, P., & Beglar, D. (2007). A vocabulary size test. The Language Teacher, 31, 913. https://jaltpublications.org/tlt/issues/2007-07_31.7Google Scholar
Schmitt, N., Nation, P., & Kremmel, B. (2020). Moving the field of vocabulary assessment forward: The need for more rigorous test development and validation. Language Teaching, 53, 109120. https://doi.org/10.1017/S0261444819000326CrossRefGoogle Scholar
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring the behaviour of two new versions of the Vocabulary Levels Test. Language Testing, 18, 5588. https://doi.org/10.1177/026553220101800103CrossRefGoogle Scholar
Smith, M. H. (2004). A sample/population size activity: Is it the sample size of the sample as a fraction of the population that matters? Journal of Statistics Education, 12. https://doi.org/10.1080/10691898.2004.11910735Google Scholar
Stewart, J. (2012). A multiple-choice test of active vocabulary knowledge. Vocabulary Learning and Instruction, 1, 5359.CrossRefGoogle Scholar
Stewart, J. (2014). Do multiple-choice options inflate estimates of vocabulary size on the VST? Language Assessment Quarterly, 11, 271282. http://vli-journal.org/issues/01.1/issue01.1.09.pdfCrossRefGoogle Scholar
Stewart, J., & White, D. A. (2011). Estimating guessing effects on the vocabulary levels test for differing degrees of word knowledge. TESOL Quarterly, 45, 370380. https://doi.org/10.5054/tq.2011.254523CrossRefGoogle Scholar
Stoeckel, T., Bennett, P., & McLean, S. (2016). Is “I Don’t Know” a viable answer choice on the Vocabulary Size Test? TESOL Quarterly, 50, 965975. https://doi.org/10.1002/tesq.325CrossRefGoogle Scholar
Stoeckel, T., Ishii, T., & Bennett, P. (2020). Is the lemma more appropriate than the flemma as a word counting unit? Applied Linguistics, 41, 601606. https://doi.org/10.1093/applin/amy059CrossRefGoogle Scholar
Stoeckel, T., McLean, S., & Nation, P. (2021). Limitations of size and levels tests of written receptive vocabulary knowledge. Studies in Second Language Acquisition, 43, 181203. https://doi.org/10.1017/S027226312000025XCrossRefGoogle Scholar
Stoeckel, T., Stewart, J., McLean, S., Ishii, T., Kramer, B., & Matsumoto, Y. (2019). The relationship of four variants of the Vocabulary Size Test to a criterion measure of meaning-recall vocabulary knowledge. System, 87, 102161. https://doi.org/10.1016/j.system.2019.102161CrossRefGoogle Scholar
Ward, J., & Chuenjundaeng, J. (2009). Suffix knowledge acquisition and applications. System, 37, 461469. https://doi.org/10.1016/j.system.2009.01.004CrossRefGoogle Scholar
Webb, S. (2008). The effects of context on incidental vocabulary learning. Reading in a Foreign Language, 20, 232245. https://nflrc.hawaii.edu/rfl/item/178Google Scholar
Webb, S. (2021). A different perspective on the limitations of size and levels tests of written receptive vocabulary knowledge. Studies in Second Language Acquisition, 43, 454461. https://doi.org/10.1017/S0272263121000449Google Scholar
Webb, S., Sasao, Y., & Ballance, O. (2017). The updated Vocabulary Levels Test. ITL - International Journal of Applied Linguistics, 168, 3369. https://doi.org/10.1075/itl.168.1.02webCrossRefGoogle Scholar
Zhang, S., & Zhang, X. (2020). The relationship between vocabulary knowledge and L2 reading/listening comprehension: A meta-analysis. Language Teaching Research. Advance online publication. https://doi.org/10.1177/1362168820913998CrossRefGoogle Scholar
Figure 0

TABLE 1. Some stated purposes of size and levels tests of written receptive vocabulary knowledge

Figure 1

FIGURE 1. Histograms of bootstrapped correlations of meaning-recall and meaning-recognition to reading proficiency, 30 items (adapted from McLean et al., 2020).

Figure 2

FIGURE 2. Histograms of bootstrapped correlations of meaning-recall and meaning-recognition to reading proficiency, 100 items (adapted from McLean et al., 2020).

Figure 3

FIGURE 3. Monte Carlo study of vocabulary size estimates using tests of 10, 30, and 100 items (adapted from Gyllstad et al., 2020b).Note: The true number of words known by this learner is 750.