In response to our State-of-the-Scholarship critical commentary (Stoeckel et al., Reference Stoeckel, McLean and Nation2021), Stuart Webb (Reference Webb2021) asserts that there is no research supporting our suggestions for improving tests of written receptive vocabulary knowledge by (a) using meaning-recall items, (b) making fewer presumptions about learner knowledge of word families, and (c) using appropriate test lengths. As we will show, this is not the case. However, we think questions and concerns he raises reflect those of many who have used these tests until now without controversy, and we appreciate the opportunity to explain these issues in greater detail.
To begin, we think Webb has more common ground with our position than he may realize. We agree with many of his statements, and do not state otherwise in Stoeckel et al. (Reference Stoeckel, McLean and Nation2021). For example, we agree that few if any vocabulary test makers have claimed their tests should be used as substitutes for reading tests; we agree that despite this, vocabulary tests typically do show good correlations with reading; and we agree that despite that, such tests should not be used as evidence of reading comprehension. These matters are not in dispute. However, there are remaining points of disagreement to address.
Test Use
Noting that “the premise on which their article was written is that the intended purpose of the VLT and VST is to measure vocabulary knowledge for the purpose of reading,” Webb appears to dispute that this was ever the case or an intention of the test makers. Webb further asserts that it is “not the intended purpose” of the tests “to accurately reveal the degree to which learners may reach key lexical coverage figures” (p. 458), track growth, or suggest vocabulary learning goals. However, while it is certainly true that these tests have not been sufficiently validated for these purposes, the fact that these applications factored into test makers’ intentions during their creation is clear from their own statements, as can be seen in Table 1.
TABLE 1. Some stated purposes of size and levels tests of written receptive vocabulary knowledge
Furthermore, while space constraints prevent a comprehensive list, the majority of uses of these tests in SLA literature are for purposes such as those previously mentioned rather than merely to check learner knowledge of words without reference to other considerations. The desire of researchers to use them these ways is sympathizable. As Webb notes, in general, vocabulary test scores do indeed correlate with other constructs, and if they could be employed only to measure vocabulary knowledge without any other such inferences permitted, their usefulness would be quite limited. Indeed, while we would welcome such restrictions, were Webb’s cautions about appropriate test use strictly followed it would all but mark the end of use of tests such as the Vocabulary Levels Test (VLT) and Vocabulary Size Test (VST) as variables in research published in journals such as SSLA.
We believe the long-standing confusion regarding the intended purposes of these tests stems from the fact that often, at the times of their creation, these tests did not have narrowly specified uses (Norbert Schmitt, personal communication, May 6, 2021) and, as we and a number of our colleagues would argue, many still do not have them today (Schmitt et al., Reference Schmitt, Nation and Kremmel2020). Thus, it is understandable that teachers and researchers would use them for a wider variety of purposes than appropriate. We hope Stoeckel et al. (Reference Stoeckel, McLean and Nation2021) acts as a caution against this.
Our own view is that while we do not support the notion that vocabulary comprehension alone is sufficient for reading comprehension (McLean, Reference McLean2021), vocabulary knowledge is uncontroversially a component of reading, which can be useful as one of several variables in studies of reading proficiency. Furthermore, while vocabulary knowledge alone is not sufficient for reading comprehension, testing vocabulary mastery can at least ensure that lack of vocabulary knowledge is not an impediment for readers of given texts (Nation, Reference Nation2009, p. 52). However, in all such applications, ideally as closely as possible test items should approximate how vocabulary is encountered in text (Schmitt et al., Reference Schmitt, Nation and Kremmel2020). This leads us to the first suggestion in our original article, regarding choice of item format.
Item Format
In addition to demonstrating that fixed options incorrectly increase estimates of vocabulary size (Gyllstad et al., Reference Gyllstad, Vilkaitė and Schmitt2015; McLean et al., Reference McLean, Kramer and Stewart2015; Stewart & White, Reference Stewart and White2011) research has also consistently shown that, all else being equal, recall formats are more reliable than meaning-recognition formats testing the same words when learners are asked to attempt every item (McLean et al., Reference McLean, Stewart and Batty2020; McLean et al., Reference McLean, Stewart and Kramer2016; Stewart, Reference Stewart2012; Stoeckel et al., Reference Stoeckel, Stewart, McLean, Ishii, Kramer and Matsumoto2019). The relatively poor discrimination of meaning-recognition items makes experiments using them more prone to Type II error, where hypotheses are rejected due to seemingly statistically insignificant results. Furthermore, there is a growing consensus in the literature (e.g., Grabe, Reference Grabe2009, p. 23; Kremmel & Schmitt, Reference Kremmel and Schmitt2016, p. 378; Nation & Webb, Reference Nation and Webb2011, pp. 219, 285–286) that meaning-recall represents an appropriate threshold of lexical knowledge for reading because, as in fluent reading, word meaning must be retrieved from memory rather than identified in a list of options.
Webb appears to contest this position by citing Laufer and Aviad-Levitzky (Reference Laufer and Aviad-Levitzky2017), who gave learners the meaning-recognition based VST, a parallel meaning-recall test, and a reading test. Departing from the VST’s specifications (Nation, Reference Nation2012), they had instructed learners to skip items testing words that they did not believe they knew. Perhaps as a consequence of this change, there was no statistically significant difference between correlations of the two tests to the reading measure (.91 and .92). Despite the insignificant result, the authors argued that meaning-recognition was the better predictor of reading ability. (Webb expressed surprise that we did not include this study in our review. As noted in our original paper, we excluded studies that allowed learners to skip unknown words on the meaning-recognition measure because research demonstrates that examinees use the option to skip differentially, which impacts the relationship between recognition and recall scores [Stoeckel et al., Reference Stoeckel, Bennett and McLean2016].)
To better identify the differences in correlations between these modalities and reading proficiency, subsequent research by McLean et al. (Reference McLean, Stewart and Batty2020) involving meaning-recall and meaning-recognition formats and reading proficiency used a bootstrapping approach to mitigate Type I and II errors. Both meaning-recall and meaning-recognition were tested bilingually for direct comparisons. By sampling with replacement for thousands of iterations, McLean et al. demonstrated that with very little overlap, meaning-recall outperformed meaning-recognition as a predictor of reading proficiency. Webb notes that meaning-recognition was also correlated to reading proficiency. We do not dispute this; as we note in the preceding text, all vocabulary tests will correlate to reading to at least some extent. However, the goal of our paper was to suggest improvements to vocabulary tests. As McLean et al. show, for a test of 30 items, meaning-recall outperforms meaning-recognition in average correlations to reading 0.74 to 0.65 (d = –3.622; see Figure 1), a distinction that becomes even clearer for tests with more items (Figure 2). Such differences in variables can have substantial impacts on models, so researchers should take note.
FIGURE 1. Histograms of bootstrapped correlations of meaning-recall and meaning-recognition to reading proficiency, 30 items (adapted from McLean et al., Reference McLean, Stewart and Batty2020).
FIGURE 2. Histograms of bootstrapped correlations of meaning-recall and meaning-recognition to reading proficiency, 100 items (adapted from McLean et al., Reference McLean, Stewart and Batty2020).
Nor are such findings restricted to the previously mentioned study. A meta-analysis by Jeon and Yamashita (Reference Jeon and Yamashita2014) also found that meaning-recall was the better predictor, but could not attain statistical significance due to the small number of meaning-recall studies examined (7). However, a more recent meta-analysis by Zhang and Zhang (Reference Zhang and Zhang2020) including 21 studies using meaning-recall and 14 using meaning-recognition found that mean correlations between meaning-recall and reading proficiency (r = .66 [.58, .71]) are significantly stronger than those between meaning-recognition and reading proficiencyFootnote
1 (r = .53 [.49, .57]). Debate about uses of tests such as the VLT and the VST aside, the research seems clear: if one does desire to measure vocabulary as it relates to reading, meaning-recall appears to be the better option.
Although he acknowledges the risk of overestimation in meaning-recognition, Webb argues that meaning-recall could underestimate vocabulary size, suggesting meaning-recognition provides more “sensitivity” in scoring. Research in which learners are orally interviewed about their answers on meaning-recognition items shows that despite higher mean scores, the format is highly insensitive, with learners choosing the items they do for a variety of disparate reasons, including construct-irrelevant ones such as test strategies and blind guessing (Gyllstad et al., Reference Gyllstad, Vilkaitė and Schmitt2015; McDonald, Reference McDonald2015).
It is true that meaning-recall tests such as Aviad-Levitzky et al. (Reference Aviad-Levitzky, Laufer and Goldstein2019) that demand answers with perfect L2 English spelling of target word synonyms can depress scores for reasons unrelated to learners’ understanding of meaning, particularly given English’s complex orthography. However, an advantage of recall tests is that unlike fixed-option meaning-recognition tests, researchers retain learners’ free responses, which can then be examined and graded as leniently as desired. Although a common complaint of this procedure is the time required to mark answers, online resources such as www.vocableveltest.org greatly expedite this process. Novel choices are presented to the researcher and can then be whitelisted or blacklisted during initial scoring, allowing for automated scoring of those same responses thereafter.
Moving on to practical matters, Webb expresses concern that meaning-recall tests may take longer to administer to learners. However, the time required need not be prohibitive. Recent research by McLean et al. (Reference McLean, Stewart and Batty2020) indicates that meaning-recall increases test time by roughly 28%, meaning a 10-minute meaning-recognition test would still require less than 13 minutes to complete using meaning-recall. Webb further argues that monolingual tests may be more appropriate for many ESL settings. While this is a legitimate concern in many contexts, as mentioned already, online resources such as www.vocablevelstest.org can simplify this process by permitting the inclusion of multiple L1s as possible answer options.
These arguments in favor of meaning-recall do not mean meaning-recognition tests have no value at all. In cases in which learners do not have access to computers or cell phones, multiple-choice tests may have advantages in classroom contexts where teachers need fast results. Scoring of multiple L1s is possible in meaning-recall tests, but involves greater initial overhead and greater complexity in scoring standards. While more research is necessary, it is possible meaning-recall tests could underestimate knowledge when it is difficult to express meaning. On balance however, research shows meaning-recall is the preferred option for tests measuring vocabulary knowledge for reading.
Lexical Unit
Just as the choice of item format for a test should consider the purpose of the test and the learners taking the test, the choice of a lexical unit should also take account of such considerations. Bauer and Nation’s (Reference Bauer and Nation1993) level six word family (WF6) is too inclusive for some purposes and for many learners, and we need to develop tests that use more appropriate word family levels for the high-frequency and initial mid-frequency vocabulary. Use of WF6 in tests assumes that learners know most or all family members to the same level of knowledge that the target word was tested at. This assumption is unsupported by research with L2 learners of English from a range of proficiency levels (McLean, Reference McLean2018; Stoeckel et al., Reference Stoeckel, Ishii and Bennett2020; Ward & Chuenjundaeng, Reference Ward and Chuenjundaeng2009).
Webb argues lemma-based instruments require testing more words. However, from a statistical standpoint this is incorrect: for size tests, precision is a function of sample size rather than population size (Smith, Reference Smith2004), so keeping item numbers constant does not affect accuracy. For levels tests, we agree more levels may be desirable for lemma-based instruments, but this need not be a serious concern. As Webb has observed, learners need only complete test levels at their proficiency level (Webb et al., Reference Webb, Sasao and Ballance2017).
Webb cautions against reaching conclusions on this matter until more research is available. However, currently the prudent choice is to assume less derivational knowledge on the part of learners, not more. The available evidence suggests that learners well beyond beginner level have trouble recalling the meaning of some derivational forms of known basewords (see Brown et al., Reference Brown, Stoeckel, McLean and Stewart2020, Reference Brown, Stewart, Stoeckel and McLeanin press; McLean, Reference McLean2021 for recent reviews). Tests relying on smaller lexical units will still be effective for learners regardless of proficiency, but the same cannot currently be ensured for WF6-based tests.
Target Word Sample Size
Webb argues that ideal item counts for size and levels tests are “not straightforward” on the grounds that “the greater the number of good test items, the more accurately a test should help to assess knowledge” (p. 458). It is true that good items have higher discrimination, reducing tests’ standard error of measurement. However, although multiple-choice items can be screened and improved, all else being equal, tests with recall items of the same words demonstrate superior quality (McLean et al., Reference McLean, Stewart and Batty2020; Stewart, Reference Stewart2012). Furthermore, in regard to size estimation, regardless of item quality an axiom of inferential statistics is that the larger the sample size, the more reliable the vocabulary size estimate,Footnote
2 and even theoretically perfect item discrimination does not obviate the need for sufficient sample sizes in this regard (Gyllstad et al., Reference Gyllstad, Vilkaitė and Schmitt2015, Reference Gyllstad, McLean and Stewart2020b). In his response, Webb calls for examining how test performance is affected by manipulating disputed variables. Just such a study was conducted by Gyllstad et al. (Reference Gyllstad, McLean and Stewart2020b). An example of the difference item counts make to accuracy in size estimation is illustrated in Figure 3.
FIGURE 3. Monte Carlo study of vocabulary size estimates using tests of 10, 30, and 100 items (adapted from Gyllstad et al., Reference Gyllstad, McLean and Stewart2020b).
Note: The true number of words known by this learner is 750.
As explained in Stoeckel et al. (Reference Stoeckel, McLean and Nation2021), research indicates that size estimation based on item response theory (IRT) can help address concerns about test length. Research by Culligan (Reference Culligan2008) and Gibson and Stewart (Reference Gibson and Stewart2014) illustrates how IRT-based computer adaptive tests can tailor to learners’ ability levels, mitigating the need for many items either far above or below learner ability. Although it is still advisable to test sufficiently at appropriate difficulty levels, IRT can greatly shorten tests of words with wide ranges of frequencies and difficulties, such as the VST.
Conclusion
Webb concluded by expressing his belief that no empirical evidence exists to support our positions regarding meaning-recall items, smaller lexical units, and appropriate test lengths. We hope the research cited in this commentary puts his concerns to rest and makes the evidence for our positions clear. However, we wholeheartedly agree with Webb’s call for further research regarding our suggestions, and hope this dialogue inspires more such studies.
To use a contemporary term, commentaries such as Stoeckel et al. (Reference Stoeckel, McLean and Nation2021), McLean (Reference McLean2021), Stewart (Reference Stewart2014), and Schmitt et al. (Reference Schmitt, Nation and Kremmel2020) “problematize” widely used vocabulary tests. Problems are rarely welcomed with open arms. Much in the same way dated statistical standards can attain a semblance of authority through the precedent of their past use, standards for instruments used in research can take on an air of unimpeachability when they have been used unquestioned for so long. However, it is important not to confuse what is familiar with what is preferable. Leaving precedent unquestioned can prevent appropriate scrutiny of past research.
As a final thought, it should be noted that initially each of the aforementioned characteristics of conventional vocabulary tests (i.e., fixed-response, meaning-recognition tests with relatively few randomly sampled items per level based on word families) were established with little empirical evidence, and based on early and underdeveloped perspectives of validation (Norbert Schmitt, personal communication, May 6, 2021). Validations of newer tests that have inherited these characteristics have rarely attempted to examine their underlying assumptions. While we appreciate Webb’s calls for further evidence, we hope that going forward similar scrutiny is applied to older standards as is now applied to the increasing calls for updated ones.
In response to our State-of-the-Scholarship critical commentary (Stoeckel et al., Reference Stoeckel, McLean and Nation2021), Stuart Webb (Reference Webb2021) asserts that there is no research supporting our suggestions for improving tests of written receptive vocabulary knowledge by (a) using meaning-recall items, (b) making fewer presumptions about learner knowledge of word families, and (c) using appropriate test lengths. As we will show, this is not the case. However, we think questions and concerns he raises reflect those of many who have used these tests until now without controversy, and we appreciate the opportunity to explain these issues in greater detail.
To begin, we think Webb has more common ground with our position than he may realize. We agree with many of his statements, and do not state otherwise in Stoeckel et al. (Reference Stoeckel, McLean and Nation2021). For example, we agree that few if any vocabulary test makers have claimed their tests should be used as substitutes for reading tests; we agree that despite this, vocabulary tests typically do show good correlations with reading; and we agree that despite that, such tests should not be used as evidence of reading comprehension. These matters are not in dispute. However, there are remaining points of disagreement to address.
Test Use
Noting that “the premise on which their article was written is that the intended purpose of the VLT and VST is to measure vocabulary knowledge for the purpose of reading,” Webb appears to dispute that this was ever the case or an intention of the test makers. Webb further asserts that it is “not the intended purpose” of the tests “to accurately reveal the degree to which learners may reach key lexical coverage figures” (p. 458), track growth, or suggest vocabulary learning goals. However, while it is certainly true that these tests have not been sufficiently validated for these purposes, the fact that these applications factored into test makers’ intentions during their creation is clear from their own statements, as can be seen in Table 1.
TABLE 1. Some stated purposes of size and levels tests of written receptive vocabulary knowledge
Furthermore, while space constraints prevent a comprehensive list, the majority of uses of these tests in SLA literature are for purposes such as those previously mentioned rather than merely to check learner knowledge of words without reference to other considerations. The desire of researchers to use them these ways is sympathizable. As Webb notes, in general, vocabulary test scores do indeed correlate with other constructs, and if they could be employed only to measure vocabulary knowledge without any other such inferences permitted, their usefulness would be quite limited. Indeed, while we would welcome such restrictions, were Webb’s cautions about appropriate test use strictly followed it would all but mark the end of use of tests such as the Vocabulary Levels Test (VLT) and Vocabulary Size Test (VST) as variables in research published in journals such as SSLA.
We believe the long-standing confusion regarding the intended purposes of these tests stems from the fact that often, at the times of their creation, these tests did not have narrowly specified uses (Norbert Schmitt, personal communication, May 6, 2021) and, as we and a number of our colleagues would argue, many still do not have them today (Schmitt et al., Reference Schmitt, Nation and Kremmel2020). Thus, it is understandable that teachers and researchers would use them for a wider variety of purposes than appropriate. We hope Stoeckel et al. (Reference Stoeckel, McLean and Nation2021) acts as a caution against this.
Our own view is that while we do not support the notion that vocabulary comprehension alone is sufficient for reading comprehension (McLean, Reference McLean2021), vocabulary knowledge is uncontroversially a component of reading, which can be useful as one of several variables in studies of reading proficiency. Furthermore, while vocabulary knowledge alone is not sufficient for reading comprehension, testing vocabulary mastery can at least ensure that lack of vocabulary knowledge is not an impediment for readers of given texts (Nation, Reference Nation2009, p. 52). However, in all such applications, ideally as closely as possible test items should approximate how vocabulary is encountered in text (Schmitt et al., Reference Schmitt, Nation and Kremmel2020). This leads us to the first suggestion in our original article, regarding choice of item format.
Item Format
In addition to demonstrating that fixed options incorrectly increase estimates of vocabulary size (Gyllstad et al., Reference Gyllstad, Vilkaitė and Schmitt2015; McLean et al., Reference McLean, Kramer and Stewart2015; Stewart & White, Reference Stewart and White2011) research has also consistently shown that, all else being equal, recall formats are more reliable than meaning-recognition formats testing the same words when learners are asked to attempt every item (McLean et al., Reference McLean, Stewart and Batty2020; McLean et al., Reference McLean, Stewart and Kramer2016; Stewart, Reference Stewart2012; Stoeckel et al., Reference Stoeckel, Stewart, McLean, Ishii, Kramer and Matsumoto2019). The relatively poor discrimination of meaning-recognition items makes experiments using them more prone to Type II error, where hypotheses are rejected due to seemingly statistically insignificant results. Furthermore, there is a growing consensus in the literature (e.g., Grabe, Reference Grabe2009, p. 23; Kremmel & Schmitt, Reference Kremmel and Schmitt2016, p. 378; Nation & Webb, Reference Nation and Webb2011, pp. 219, 285–286) that meaning-recall represents an appropriate threshold of lexical knowledge for reading because, as in fluent reading, word meaning must be retrieved from memory rather than identified in a list of options.
Webb appears to contest this position by citing Laufer and Aviad-Levitzky (Reference Laufer and Aviad-Levitzky2017), who gave learners the meaning-recognition based VST, a parallel meaning-recall test, and a reading test. Departing from the VST’s specifications (Nation, Reference Nation2012), they had instructed learners to skip items testing words that they did not believe they knew. Perhaps as a consequence of this change, there was no statistically significant difference between correlations of the two tests to the reading measure (.91 and .92). Despite the insignificant result, the authors argued that meaning-recognition was the better predictor of reading ability. (Webb expressed surprise that we did not include this study in our review. As noted in our original paper, we excluded studies that allowed learners to skip unknown words on the meaning-recognition measure because research demonstrates that examinees use the option to skip differentially, which impacts the relationship between recognition and recall scores [Stoeckel et al., Reference Stoeckel, Bennett and McLean2016].)
To better identify the differences in correlations between these modalities and reading proficiency, subsequent research by McLean et al. (Reference McLean, Stewart and Batty2020) involving meaning-recall and meaning-recognition formats and reading proficiency used a bootstrapping approach to mitigate Type I and II errors. Both meaning-recall and meaning-recognition were tested bilingually for direct comparisons. By sampling with replacement for thousands of iterations, McLean et al. demonstrated that with very little overlap, meaning-recall outperformed meaning-recognition as a predictor of reading proficiency. Webb notes that meaning-recognition was also correlated to reading proficiency. We do not dispute this; as we note in the preceding text, all vocabulary tests will correlate to reading to at least some extent. However, the goal of our paper was to suggest improvements to vocabulary tests. As McLean et al. show, for a test of 30 items, meaning-recall outperforms meaning-recognition in average correlations to reading 0.74 to 0.65 (d = –3.622; see Figure 1), a distinction that becomes even clearer for tests with more items (Figure 2). Such differences in variables can have substantial impacts on models, so researchers should take note.
FIGURE 1. Histograms of bootstrapped correlations of meaning-recall and meaning-recognition to reading proficiency, 30 items (adapted from McLean et al., Reference McLean, Stewart and Batty2020).
FIGURE 2. Histograms of bootstrapped correlations of meaning-recall and meaning-recognition to reading proficiency, 100 items (adapted from McLean et al., Reference McLean, Stewart and Batty2020).
Nor are such findings restricted to the previously mentioned study. A meta-analysis by Jeon and Yamashita (Reference Jeon and Yamashita2014) also found that meaning-recall was the better predictor, but could not attain statistical significance due to the small number of meaning-recall studies examined (7). However, a more recent meta-analysis by Zhang and Zhang (Reference Zhang and Zhang2020) including 21 studies using meaning-recall and 14 using meaning-recognition found that mean correlations between meaning-recall and reading proficiency (r = .66 [.58, .71]) are significantly stronger than those between meaning-recognition and reading proficiencyFootnote 1 (r = .53 [.49, .57]). Debate about uses of tests such as the VLT and the VST aside, the research seems clear: if one does desire to measure vocabulary as it relates to reading, meaning-recall appears to be the better option.
Although he acknowledges the risk of overestimation in meaning-recognition, Webb argues that meaning-recall could underestimate vocabulary size, suggesting meaning-recognition provides more “sensitivity” in scoring. Research in which learners are orally interviewed about their answers on meaning-recognition items shows that despite higher mean scores, the format is highly insensitive, with learners choosing the items they do for a variety of disparate reasons, including construct-irrelevant ones such as test strategies and blind guessing (Gyllstad et al., Reference Gyllstad, Vilkaitė and Schmitt2015; McDonald, Reference McDonald2015).
It is true that meaning-recall tests such as Aviad-Levitzky et al. (Reference Aviad-Levitzky, Laufer and Goldstein2019) that demand answers with perfect L2 English spelling of target word synonyms can depress scores for reasons unrelated to learners’ understanding of meaning, particularly given English’s complex orthography. However, an advantage of recall tests is that unlike fixed-option meaning-recognition tests, researchers retain learners’ free responses, which can then be examined and graded as leniently as desired. Although a common complaint of this procedure is the time required to mark answers, online resources such as www.vocableveltest.org greatly expedite this process. Novel choices are presented to the researcher and can then be whitelisted or blacklisted during initial scoring, allowing for automated scoring of those same responses thereafter.
Moving on to practical matters, Webb expresses concern that meaning-recall tests may take longer to administer to learners. However, the time required need not be prohibitive. Recent research by McLean et al. (Reference McLean, Stewart and Batty2020) indicates that meaning-recall increases test time by roughly 28%, meaning a 10-minute meaning-recognition test would still require less than 13 minutes to complete using meaning-recall. Webb further argues that monolingual tests may be more appropriate for many ESL settings. While this is a legitimate concern in many contexts, as mentioned already, online resources such as www.vocablevelstest.org can simplify this process by permitting the inclusion of multiple L1s as possible answer options.
These arguments in favor of meaning-recall do not mean meaning-recognition tests have no value at all. In cases in which learners do not have access to computers or cell phones, multiple-choice tests may have advantages in classroom contexts where teachers need fast results. Scoring of multiple L1s is possible in meaning-recall tests, but involves greater initial overhead and greater complexity in scoring standards. While more research is necessary, it is possible meaning-recall tests could underestimate knowledge when it is difficult to express meaning. On balance however, research shows meaning-recall is the preferred option for tests measuring vocabulary knowledge for reading.
Lexical Unit
Just as the choice of item format for a test should consider the purpose of the test and the learners taking the test, the choice of a lexical unit should also take account of such considerations. Bauer and Nation’s (Reference Bauer and Nation1993) level six word family (WF6) is too inclusive for some purposes and for many learners, and we need to develop tests that use more appropriate word family levels for the high-frequency and initial mid-frequency vocabulary. Use of WF6 in tests assumes that learners know most or all family members to the same level of knowledge that the target word was tested at. This assumption is unsupported by research with L2 learners of English from a range of proficiency levels (McLean, Reference McLean2018; Stoeckel et al., Reference Stoeckel, Ishii and Bennett2020; Ward & Chuenjundaeng, Reference Ward and Chuenjundaeng2009).
Webb argues lemma-based instruments require testing more words. However, from a statistical standpoint this is incorrect: for size tests, precision is a function of sample size rather than population size (Smith, Reference Smith2004), so keeping item numbers constant does not affect accuracy. For levels tests, we agree more levels may be desirable for lemma-based instruments, but this need not be a serious concern. As Webb has observed, learners need only complete test levels at their proficiency level (Webb et al., Reference Webb, Sasao and Ballance2017).
Webb cautions against reaching conclusions on this matter until more research is available. However, currently the prudent choice is to assume less derivational knowledge on the part of learners, not more. The available evidence suggests that learners well beyond beginner level have trouble recalling the meaning of some derivational forms of known basewords (see Brown et al., Reference Brown, Stoeckel, McLean and Stewart2020, Reference Brown, Stewart, Stoeckel and McLeanin press; McLean, Reference McLean2021 for recent reviews). Tests relying on smaller lexical units will still be effective for learners regardless of proficiency, but the same cannot currently be ensured for WF6-based tests.
Target Word Sample Size
Webb argues that ideal item counts for size and levels tests are “not straightforward” on the grounds that “the greater the number of good test items, the more accurately a test should help to assess knowledge” (p. 458). It is true that good items have higher discrimination, reducing tests’ standard error of measurement. However, although multiple-choice items can be screened and improved, all else being equal, tests with recall items of the same words demonstrate superior quality (McLean et al., Reference McLean, Stewart and Batty2020; Stewart, Reference Stewart2012). Furthermore, in regard to size estimation, regardless of item quality an axiom of inferential statistics is that the larger the sample size, the more reliable the vocabulary size estimate,Footnote 2 and even theoretically perfect item discrimination does not obviate the need for sufficient sample sizes in this regard (Gyllstad et al., Reference Gyllstad, Vilkaitė and Schmitt2015, Reference Gyllstad, McLean and Stewart2020b). In his response, Webb calls for examining how test performance is affected by manipulating disputed variables. Just such a study was conducted by Gyllstad et al. (Reference Gyllstad, McLean and Stewart2020b). An example of the difference item counts make to accuracy in size estimation is illustrated in Figure 3.
FIGURE 3. Monte Carlo study of vocabulary size estimates using tests of 10, 30, and 100 items (adapted from Gyllstad et al., Reference Gyllstad, McLean and Stewart2020b).
Note: The true number of words known by this learner is 750.
As explained in Stoeckel et al. (Reference Stoeckel, McLean and Nation2021), research indicates that size estimation based on item response theory (IRT) can help address concerns about test length. Research by Culligan (Reference Culligan2008) and Gibson and Stewart (Reference Gibson and Stewart2014) illustrates how IRT-based computer adaptive tests can tailor to learners’ ability levels, mitigating the need for many items either far above or below learner ability. Although it is still advisable to test sufficiently at appropriate difficulty levels, IRT can greatly shorten tests of words with wide ranges of frequencies and difficulties, such as the VST.
Conclusion
Webb concluded by expressing his belief that no empirical evidence exists to support our positions regarding meaning-recall items, smaller lexical units, and appropriate test lengths. We hope the research cited in this commentary puts his concerns to rest and makes the evidence for our positions clear. However, we wholeheartedly agree with Webb’s call for further research regarding our suggestions, and hope this dialogue inspires more such studies.
To use a contemporary term, commentaries such as Stoeckel et al. (Reference Stoeckel, McLean and Nation2021), McLean (Reference McLean2021), Stewart (Reference Stewart2014), and Schmitt et al. (Reference Schmitt, Nation and Kremmel2020) “problematize” widely used vocabulary tests. Problems are rarely welcomed with open arms. Much in the same way dated statistical standards can attain a semblance of authority through the precedent of their past use, standards for instruments used in research can take on an air of unimpeachability when they have been used unquestioned for so long. However, it is important not to confuse what is familiar with what is preferable. Leaving precedent unquestioned can prevent appropriate scrutiny of past research.
As a final thought, it should be noted that initially each of the aforementioned characteristics of conventional vocabulary tests (i.e., fixed-response, meaning-recognition tests with relatively few randomly sampled items per level based on word families) were established with little empirical evidence, and based on early and underdeveloped perspectives of validation (Norbert Schmitt, personal communication, May 6, 2021). Validations of newer tests that have inherited these characteristics have rarely attempted to examine their underlying assumptions. While we appreciate Webb’s calls for further evidence, we hope that going forward similar scrutiny is applied to older standards as is now applied to the increasing calls for updated ones.