Stoeckel, McLean, and Nation’s article, Limitations of Size and Levels Tests of Written Receptive Vocabulary Knowledge (Reference Stoeckel, McLean and Nation2021), discusses whether the Vocabulary Size Test (VST; Coxhead et al., Reference Coxhead, Nation and Sim2015; Nation & Beglar, Reference Nation and Beglar2007) and the Vocabulary Levels Test (VLT; Nation, Reference Nation1983; Schmitt et al., Reference Schmitt, Schmitt and Clapham2001; Webb et al., Reference Webb, Sasao and Ballance2017) are effective at measuring the vocabulary knowledge necessary for the purpose of reading. Stoeckel et al. suggest that these tests are likely to overestimate receptive vocabulary knowledge and that there are three ways in which the tests could be improved. The first way to improve the tests is by moving from a recognition format to a recall format. The second way is to move from using word families as the lexical unit to using lemmas. Their third suggestion is to increase the number of target items in the tests. Stoeckel et al. conclude that existing size and levels tests lack the accuracy necessary for many specified testing purposes.
Although it is useful to look at different ways to improve on measures of lexical knowledge, there is little research evidence supporting the claims made by Stoeckel et al., and there are several aspects of their article that should be considered further. First, the premise on which their article was written is that the intended purpose of the VLT and VST is to measure vocabulary knowledge for the purpose of reading.Footnote
1 However, the VLT was developed to reveal to teachers where they should focus vocabulary learning (Nation, Reference Nation1983, Reference Nation1990, Reference Nation2008; Nation & Webb, Reference Nation and Webb2011; Read, Reference Read2000; Webb & Nation, Reference Webb and Nation2017; Webb et al., Reference Webb, Sasao and Ballance2017). The VST was developed to measure L2 learner knowledge of the most frequent 14,000 word families as a whole (Nation & Beglar, Reference Nation and Beglar2007), and was later expanded to measure both nonnative and native speakers’ knowledge of the most frequent 20,000 word families as a whole (Coxhead et al., Reference Coxhead, Nation and Sim2015). There is currently no research indicating that the tests are not working well for these purposes. Neither test was developed and validated for the purpose of predicting reading comprehension. In fact, Beglar (Reference Beglar2010, p.114) reports that “test-takers’ responses provide only a rough indication of how well they can read, so the VST should not be viewed as a substitute for a reading test.”
From Stoeckel et al.’s article, we might assume that the VLT and VST do not work well for the purpose of reading. However, this does not appear to be the case. Qian (Reference Qian1999) and Qian (Reference Qian2002) found significant correlations of .78 and .74 between the scores on Nation’s (Reference Nation1983) version of VLT and reading comprehension. Stæhr (Reference Stæhr2008) found a significant correlation of .83 between scores on Schmitt et al.’s (Reference Schmitt, Schmitt and Clapham2001) version of the VLT and reading comprehension. Laufer and Ravenhorst-Kalovski (Reference Laufer and Ravenhorst-Kalovski2010) reported a significant correlation of .80 between VLT (Schmitt et al., Reference Schmitt, Schmitt and Clapham2001) scores and reading comprehension. In one study examining the relationship between scores on the VST and different types of reading comprehension questions, Chen and Liu found smaller significant correlations ranging from .35 to .49 between these variables. It should be noted that Chen and Liu only included scores on the first 10 frequency levels of Nation and Beglar’s (Reference Nation and Beglar2007) VST. Because the VST was developed and initially validated to measure knowledge of a greater number of frequency levels, it is possible that using only part of the test reduces the validity and reliability of these findings. It would be useful for future studies to examine the relationship between reading comprehension and the most recent version of the VLT (Webb et al., Reference Webb, Sasao and Ballance2017) and complete versions of the VST (Coxhead et al., Reference Coxhead, Nation and Sim2015; Nation & Beglar, Reference Nation and Beglar2007) to determine whether the results are consistent with earlier findings. It would also be useful to investigate the degree to which reading comprehension is associated with scores on different vocabulary tests. For example, research could examine the relationships between reading comprehension scores and scores on receptive tests of form-meaning connection such as the VLT and VST, tests of productive vocabulary knowledge such as Lex30 (Meara & Fitzpatrick, Reference Meara and Fitzpatrick2000), tests that include multiple formats (e.g., Computer Adaptive Test of Size and Strength: Aviad-Levitzky et al., Reference Aviad-Levitzky, Laufer and Goldstein2019), and tests that measure other aspects of vocabulary knowledge such as the Word Part Levels Test (Sasao & Webb, Reference Sasao and Webb2017) and Guessing from Context Test (Sasao & Webb, Reference Sasao and Webb2018).
Perhaps Stoeckel et al. are pointing to the fact that in studies of L2 vocabulary, the scores from both tests have been provided to indicate whether participants may be able to understand the L2 input encountered in different learning conditions (e.g., Feng & Webb, Reference Feng and Webb2020; Horst et al., Reference Horst, Cobb and Meara1998). Providing the scores of vocabulary tests that have gone through rigorous development and validation procedures in research is useful because it helps to provide a clearer picture of the vocabulary knowledge of participants. It may also reveal the degree to which prior vocabulary knowledge was a factor in learning (e.g., Peters, Reference Peters and Webb2020; Webb & Chang, Reference Webb and Chang2015). However, the scores of these tests should not be considered to indicate the degree to which materials are understood. There is no research that indicates that tests of vocabulary knowledge can determine comprehension. Comprehension tests are needed for that purpose.
Much of the justification used for the claims made in Stoeckel et al. is based on the extent to which tests may distinguish the lexical coverage of text. Studies of lexical coverage have used carefully controlled research designs that tend to involve replacing low-frequency words with pseudowords to determine the relationship between lexical coverage and comprehension (e.g., Hu & Nation, Reference Hu and Nation2000; Van Zeeland & Schmitt, Reference Van Zeeland and Schmitt2013). Lexical profiling studies have reported the vocabulary knowledge necessary to reach the lexical coverage of materials that may indicate that the text might be understood (e.g., Nation, Reference Nation2006; Webb & Macalister, Reference Webb and Macalister2013). However, it is important to note that meeting lexical coverage figures associated with comprehension does not ensure that materials will be understood. In fact, knowing all the words in spoken and written input (100% lexical coverage) does not ensure that the input will be understood (Hu & Nation, Reference Hu and Nation2000; Martinez & Murphey, Reference Martinez and Murphy2011; Schmitt et al., Reference Schmitt, Jiang and Grabe2011). There are many factors that affect comprehension, and while vocabulary knowledge of the words encountered in input may be the most important factor (Laufer & Sim, Reference Laufer and Sim1985), many other factors also play a role (Grabe, Reference Grabe2009). In fact, research that has investigated the degree to which the lexical profiles of materials are associated with reading comprehension indicates that there may only be a small correlation between the two variables (Webb & Paribakht, Reference Webb and Paribakht2015). Moreover, individual differences among the vocabulary knowledge of L2 learners in a class or in a sample of participants is likely to lead to varying levels of lexical coverage and varying degrees of comprehension. Thus, while research on lexical coverage has been extremely useful in revealing the importance of vocabulary for comprehension (e.g., Hu & Nation, Reference Hu and Nation2000; Laufer, Reference Laufer, Lauren and Nordman1989; Laufer & Ravenhorst-Kalovski, Reference Laufer and Ravenhorst-Kalovski2010; Schmitt et al., Reference Schmitt, Jiang and Grabe2011) and vocabulary learning targets associated with understanding of different materials (e.g., Dang & Webb, Reference Dang and Webb2014; Nation, Reference Nation2006; Webb & Rodgers, Reference Webb and Rodgers2009), its value may relate primarily to theory rather than to practice. Claiming that vocabulary levels and size test scores are likely to determine reading comprehension is a misinterpretation of both the intended purposes of the tests, as well as the findings of studies of lexical coverage.
Test Format
The VST and VLT use meaning recognition formats. Stoeckel et al. argue that the tests would be improved through using a meaning recall format. Surprisingly, the only study (Laufer & Aviad-Levitzky, Reference Laufer and Aviad-Levitzky2017) to explicitly investigate the relationships between meaning recall, meaning recognition, and reading comprehension using one of the tests was not discussed. Laufer and Aviad-Levitzky (Reference Laufer and Aviad-Levitzky2017) explicitly investigated whether meaning recognition items from the VST or meaning recall items for the same words were more closely related to reading comprehension. They found that both test formats were highly correlated with reading comprehension (r = .91 for meaning recall and r = .92 for meaning recognition). However, in contrast to Stoeckel et al., they argued that meaning recognition is a better predictor of reading comprehension than meaning recall because discriminating between distractors in meaning recognition formats may better reflect the processes that readers use to infer unfamiliar vocabulary when reading. Stoeckel et al. justified the value of meaning recall in part by reporting that it was found to have a significantly higher correlation with reading proficiency than meaning recognition in a recent study by McLean et al. (Reference McLean, Stewart and Batty2020). However, it is important to note that McLean et al. also found that both test formats were relatively highly correlated with reading proficiency (Pearson correlations between meaning recall and meaning recognition and reading proficiency in a 30-item test were .74 and .65, respectively) and that the test items used in the study were not from either the vocabulary size or levels tests (they did however follow a similar construction procedure to VST items). Moreover, the meaning recall format examined in McLean et al. was bilingual (test takers provide the L1 meaning when cued with the L2 form) rather than monolingual (test takers provide the L2 meaning when cued with the L2 form). Readers should question the validity of comparisons of monolingual and bilingual test formats because the former can be used in both EFL and ESL contexts while the latter can only be used in EFL contexts in which all students share the same L1 and the teacher is also proficient in the learners’ L1.
There have also been many other studies investigating the different test formats used to measure knowledge of form-meaning connection (e.g., Nakata, Reference Nakata2016; Smith & Karpicke, Reference Smith and Karpicke2014). The study that has most rigorously investigated these formats for L2 learners was conducted by Laufer and Goldstein (Reference Laufer and Goldstein2004). Laufer and Goldstein showed that four common test formats indicate different degrees in knowledge of form-meaning connection; form recall is the most demanding and represents the greatest strength of knowledge while meaning recognition is the least demanding and represents the smallest strength of knowledge. Thus, if we were to compare the sizes of gains in vocabulary knowledge using the four tests, we should expect the highest scores to occur for meaning recognition with scores gradually decreasing in size for form recognition, meaning recall, and form recall in that order. Stoeckel et al. argue that meaning recognition test formats overestimate knowledge. However, the same argument could be used to claim that meaning recall formats underestimate knowledge. The value of the different test formats should be the degree to which they indicate knowledge for the intended purpose. Justification for the use of a meaning recognition format used in the VLT is that it is sensitive to learner knowledge and is easy to complete and grade (Nation, Reference Nation1983). Using recall formats in tests designed to measure vocabulary levels and size might have a large effect on test administration and grading. Because recall formats are more difficult, it would likely take longer to complete a test, thereby reducing test practicality. In addition, grading would not only take longer but can also present challenges with how to evaluate incorrect spelling, grammatical forms, and unexpected responses that indicate partial knowledge, which can have a negative impact on reliability. Changing to a less user-friendly test format might thus have the consequence of reducing its perceived value to teachers (i.e., reducing face validity). Nation and Webb (Reference Nation and Webb2011) also suggested that using meaning recognition formats in diagnostic tests such as the VLT is useful for teachers because it reveals knowledge that could be further developed. Research investigating the advantages and disadvantages of the different formats for their intended users (teachers and learners) would be useful to further clarify the value of the different test formats.
Lexical Unit
Stoeckel et al. argue that the levels and size tests could be improved through changing the lexical unit from word families to lemmas. There has been a great deal of discussion recently about whether lemmas or word families are best suited for measuring receptive knowledge of vocabulary (e.g., Brown et al., Reference Brown, Stoeckel, McLean and Stewart2020; Kremmel, Reference Kremmel2016; Laufer et al., Reference Laufer, Webb, Yohanan and Kim2021; McLean, Reference McLean2018; Nation & Webb, Reference Nation and Webb2011). The value of using word families in tests is that by measuring knowledge of morphologically unrelated words (e.g., care, know, run rather than care, careful, careless), tests assess L2 learning of different words without evaluating knowledge of the morphological system. The value of using lemmas in tests is that by measuring knowledge of derivatives and headwords separately, tests may provide a more precise measurement of lexical knowledge (Kremmel, Reference Kremmel2016). Evaluating knowledge using a lemma-based test might be most sensible for beginners who are unable to recognize the similarities among morphologically related words. However, one disadvantage of using lemma-based tests is that there are far more lemmas than word families to measure. The most frequent 1,000 and 3,000 word families are made up of 3,281 and 9,132 lemmas, respectively (Nation, Reference Nation2016). Although these different lemmas will vary in frequency, the much greater number of lemmas than word families would require measuring lexical knowledge with a much greater number of test items, thereby reducing test practicality. In addition, because there are many morphologically related lemmas, there is bound to be inclusion of morphologically related items when using lemmas as the lexical unit. For example, care, careful, carefully; consider, considerable, considerably, consideration; differ, difference, different; employ, employee, employer, employment; and important, importance, and importantly are a few of the many morphologically related lemmas within the most frequent 2,500 lemmas of Brezina and Gablasova’s (Reference Brezina and Gablasova2015) New General Service List. Teachers and intermediate and advanced learners might question the value of tests that measure knowledge of morphologically related words, thereby reducing face validity. To provide a more transparent measurement of vocabulary knowledge, teachers could use the Word Part Levels Test (Sasao & Webb, Reference Sasao and Webb2017) to assess knowledge of the derivational system together with a test of form-meaning connection. There might also be greater benefit for research and pedagogy in developing a test designed to measure knowledge of derivations than to modify existing tests that appear to be working correctly.
Number of Items
Stoeckel et al. argue that there are an insufficient number of items in tests because they will be unable to accurately reveal the degree to which learners may reach key lexical coverage figures. However, this is not the intended purpose of the tests and is likely more relevant to research than pedagogy. The question of how many items should be included in a vocabulary size or levels test is a good one although not as straightforward as was presented. In general, the greater the number of good test items, the more accurately a test should help to assess knowledge (Haladyna & Rodriguez, Reference Haladyna and Rodriguez2013). Creating more precise tests should be a goal so there is merit to this claim. However, aspects of practicality such as time for test administration, time for grading, and test taker fatigue should also be considered. If there is insufficient time to administer and grade a test, then it will likely have little value to teachers. It would be useful to investigate how tests with different numbers of items meet pedagogical needs.
Conclusion
In this commentary I have argued that the way to move forward with the development of receptive tests of vocabulary levels and size is through research involving the efficacy of those tests. The transparency and replicability of research articles is the foundation of the research process. It enables readers to evaluate research methods as well as the validity of interpretations that can be made on the basis of test results. Moreover, it allows researchers to conduct further studies to follow-up earlier findings. When there are differences of opinion, further research provides the opportunity to clarify findings.
I agree with Stoeckel et al. that it is important to try to improve existing measures of vocabulary knowledge. However, a major problem with Stoeckel et al.’s article is that it did not provide any empirical evidence indicating that size and levels tests of written receptive vocabulary knowledge are not working correctly. Because there is no research that has shown that any of the suggested changes to these tests would improve on their validity and reliability, it is premature to dismiss or reject the existing versions of these tests. Neither was there any empirical evidence provided to support any of the three conclusions that the VST or the VLT can be improved through (a) changing the test formats from meaning recognition to meaning recall, (b) increasing the number of items, and (c) changing the lexical unit from word families to lemmas. This is extremely worrying because we should expect to find evidence-based conclusions. Support should be provided by the findings of multiple studies that have (a) investigated the use of the tests to reveal their shortcomings, and (b) examined how test performance was affected through manipulating the three variables (test format, number of items, lexical unit). Schmitt et al. (Reference Schmitt, Nation and Kremmel2020) encouraged more rigorous vocabulary test development and validation. This should involve conducting studies of existing tests to investigate the degree to which they are working correctly for learners in a variety of L2 learning contexts. Through further validation researchers can determine whether a test is working correctly for its intended purpose, and sufficiently meeting the needs of teachers, learners, and researchers. Research can also examine the degree to which different variables can be manipulated to improve test performance. There would also be great value in creating new tests that tap into other aspects of vocabulary knowledge.
Stoeckel, McLean, and Nation’s article, Limitations of Size and Levels Tests of Written Receptive Vocabulary Knowledge (Reference Stoeckel, McLean and Nation2021), discusses whether the Vocabulary Size Test (VST; Coxhead et al., Reference Coxhead, Nation and Sim2015; Nation & Beglar, Reference Nation and Beglar2007) and the Vocabulary Levels Test (VLT; Nation, Reference Nation1983; Schmitt et al., Reference Schmitt, Schmitt and Clapham2001; Webb et al., Reference Webb, Sasao and Ballance2017) are effective at measuring the vocabulary knowledge necessary for the purpose of reading. Stoeckel et al. suggest that these tests are likely to overestimate receptive vocabulary knowledge and that there are three ways in which the tests could be improved. The first way to improve the tests is by moving from a recognition format to a recall format. The second way is to move from using word families as the lexical unit to using lemmas. Their third suggestion is to increase the number of target items in the tests. Stoeckel et al. conclude that existing size and levels tests lack the accuracy necessary for many specified testing purposes.
Although it is useful to look at different ways to improve on measures of lexical knowledge, there is little research evidence supporting the claims made by Stoeckel et al., and there are several aspects of their article that should be considered further. First, the premise on which their article was written is that the intended purpose of the VLT and VST is to measure vocabulary knowledge for the purpose of reading.Footnote 1 However, the VLT was developed to reveal to teachers where they should focus vocabulary learning (Nation, Reference Nation1983, Reference Nation1990, Reference Nation2008; Nation & Webb, Reference Nation and Webb2011; Read, Reference Read2000; Webb & Nation, Reference Webb and Nation2017; Webb et al., Reference Webb, Sasao and Ballance2017). The VST was developed to measure L2 learner knowledge of the most frequent 14,000 word families as a whole (Nation & Beglar, Reference Nation and Beglar2007), and was later expanded to measure both nonnative and native speakers’ knowledge of the most frequent 20,000 word families as a whole (Coxhead et al., Reference Coxhead, Nation and Sim2015). There is currently no research indicating that the tests are not working well for these purposes. Neither test was developed and validated for the purpose of predicting reading comprehension. In fact, Beglar (Reference Beglar2010, p.114) reports that “test-takers’ responses provide only a rough indication of how well they can read, so the VST should not be viewed as a substitute for a reading test.”
From Stoeckel et al.’s article, we might assume that the VLT and VST do not work well for the purpose of reading. However, this does not appear to be the case. Qian (Reference Qian1999) and Qian (Reference Qian2002) found significant correlations of .78 and .74 between the scores on Nation’s (Reference Nation1983) version of VLT and reading comprehension. Stæhr (Reference Stæhr2008) found a significant correlation of .83 between scores on Schmitt et al.’s (Reference Schmitt, Schmitt and Clapham2001) version of the VLT and reading comprehension. Laufer and Ravenhorst-Kalovski (Reference Laufer and Ravenhorst-Kalovski2010) reported a significant correlation of .80 between VLT (Schmitt et al., Reference Schmitt, Schmitt and Clapham2001) scores and reading comprehension. In one study examining the relationship between scores on the VST and different types of reading comprehension questions, Chen and Liu found smaller significant correlations ranging from .35 to .49 between these variables. It should be noted that Chen and Liu only included scores on the first 10 frequency levels of Nation and Beglar’s (Reference Nation and Beglar2007) VST. Because the VST was developed and initially validated to measure knowledge of a greater number of frequency levels, it is possible that using only part of the test reduces the validity and reliability of these findings. It would be useful for future studies to examine the relationship between reading comprehension and the most recent version of the VLT (Webb et al., Reference Webb, Sasao and Ballance2017) and complete versions of the VST (Coxhead et al., Reference Coxhead, Nation and Sim2015; Nation & Beglar, Reference Nation and Beglar2007) to determine whether the results are consistent with earlier findings. It would also be useful to investigate the degree to which reading comprehension is associated with scores on different vocabulary tests. For example, research could examine the relationships between reading comprehension scores and scores on receptive tests of form-meaning connection such as the VLT and VST, tests of productive vocabulary knowledge such as Lex30 (Meara & Fitzpatrick, Reference Meara and Fitzpatrick2000), tests that include multiple formats (e.g., Computer Adaptive Test of Size and Strength: Aviad-Levitzky et al., Reference Aviad-Levitzky, Laufer and Goldstein2019), and tests that measure other aspects of vocabulary knowledge such as the Word Part Levels Test (Sasao & Webb, Reference Sasao and Webb2017) and Guessing from Context Test (Sasao & Webb, Reference Sasao and Webb2018).
Perhaps Stoeckel et al. are pointing to the fact that in studies of L2 vocabulary, the scores from both tests have been provided to indicate whether participants may be able to understand the L2 input encountered in different learning conditions (e.g., Feng & Webb, Reference Feng and Webb2020; Horst et al., Reference Horst, Cobb and Meara1998). Providing the scores of vocabulary tests that have gone through rigorous development and validation procedures in research is useful because it helps to provide a clearer picture of the vocabulary knowledge of participants. It may also reveal the degree to which prior vocabulary knowledge was a factor in learning (e.g., Peters, Reference Peters and Webb2020; Webb & Chang, Reference Webb and Chang2015). However, the scores of these tests should not be considered to indicate the degree to which materials are understood. There is no research that indicates that tests of vocabulary knowledge can determine comprehension. Comprehension tests are needed for that purpose.
Much of the justification used for the claims made in Stoeckel et al. is based on the extent to which tests may distinguish the lexical coverage of text. Studies of lexical coverage have used carefully controlled research designs that tend to involve replacing low-frequency words with pseudowords to determine the relationship between lexical coverage and comprehension (e.g., Hu & Nation, Reference Hu and Nation2000; Van Zeeland & Schmitt, Reference Van Zeeland and Schmitt2013). Lexical profiling studies have reported the vocabulary knowledge necessary to reach the lexical coverage of materials that may indicate that the text might be understood (e.g., Nation, Reference Nation2006; Webb & Macalister, Reference Webb and Macalister2013). However, it is important to note that meeting lexical coverage figures associated with comprehension does not ensure that materials will be understood. In fact, knowing all the words in spoken and written input (100% lexical coverage) does not ensure that the input will be understood (Hu & Nation, Reference Hu and Nation2000; Martinez & Murphey, Reference Martinez and Murphy2011; Schmitt et al., Reference Schmitt, Jiang and Grabe2011). There are many factors that affect comprehension, and while vocabulary knowledge of the words encountered in input may be the most important factor (Laufer & Sim, Reference Laufer and Sim1985), many other factors also play a role (Grabe, Reference Grabe2009). In fact, research that has investigated the degree to which the lexical profiles of materials are associated with reading comprehension indicates that there may only be a small correlation between the two variables (Webb & Paribakht, Reference Webb and Paribakht2015). Moreover, individual differences among the vocabulary knowledge of L2 learners in a class or in a sample of participants is likely to lead to varying levels of lexical coverage and varying degrees of comprehension. Thus, while research on lexical coverage has been extremely useful in revealing the importance of vocabulary for comprehension (e.g., Hu & Nation, Reference Hu and Nation2000; Laufer, Reference Laufer, Lauren and Nordman1989; Laufer & Ravenhorst-Kalovski, Reference Laufer and Ravenhorst-Kalovski2010; Schmitt et al., Reference Schmitt, Jiang and Grabe2011) and vocabulary learning targets associated with understanding of different materials (e.g., Dang & Webb, Reference Dang and Webb2014; Nation, Reference Nation2006; Webb & Rodgers, Reference Webb and Rodgers2009), its value may relate primarily to theory rather than to practice. Claiming that vocabulary levels and size test scores are likely to determine reading comprehension is a misinterpretation of both the intended purposes of the tests, as well as the findings of studies of lexical coverage.
Test Format
The VST and VLT use meaning recognition formats. Stoeckel et al. argue that the tests would be improved through using a meaning recall format. Surprisingly, the only study (Laufer & Aviad-Levitzky, Reference Laufer and Aviad-Levitzky2017) to explicitly investigate the relationships between meaning recall, meaning recognition, and reading comprehension using one of the tests was not discussed. Laufer and Aviad-Levitzky (Reference Laufer and Aviad-Levitzky2017) explicitly investigated whether meaning recognition items from the VST or meaning recall items for the same words were more closely related to reading comprehension. They found that both test formats were highly correlated with reading comprehension (r = .91 for meaning recall and r = .92 for meaning recognition). However, in contrast to Stoeckel et al., they argued that meaning recognition is a better predictor of reading comprehension than meaning recall because discriminating between distractors in meaning recognition formats may better reflect the processes that readers use to infer unfamiliar vocabulary when reading. Stoeckel et al. justified the value of meaning recall in part by reporting that it was found to have a significantly higher correlation with reading proficiency than meaning recognition in a recent study by McLean et al. (Reference McLean, Stewart and Batty2020). However, it is important to note that McLean et al. also found that both test formats were relatively highly correlated with reading proficiency (Pearson correlations between meaning recall and meaning recognition and reading proficiency in a 30-item test were .74 and .65, respectively) and that the test items used in the study were not from either the vocabulary size or levels tests (they did however follow a similar construction procedure to VST items). Moreover, the meaning recall format examined in McLean et al. was bilingual (test takers provide the L1 meaning when cued with the L2 form) rather than monolingual (test takers provide the L2 meaning when cued with the L2 form). Readers should question the validity of comparisons of monolingual and bilingual test formats because the former can be used in both EFL and ESL contexts while the latter can only be used in EFL contexts in which all students share the same L1 and the teacher is also proficient in the learners’ L1.
There have also been many other studies investigating the different test formats used to measure knowledge of form-meaning connection (e.g., Nakata, Reference Nakata2016; Smith & Karpicke, Reference Smith and Karpicke2014). The study that has most rigorously investigated these formats for L2 learners was conducted by Laufer and Goldstein (Reference Laufer and Goldstein2004). Laufer and Goldstein showed that four common test formats indicate different degrees in knowledge of form-meaning connection; form recall is the most demanding and represents the greatest strength of knowledge while meaning recognition is the least demanding and represents the smallest strength of knowledge. Thus, if we were to compare the sizes of gains in vocabulary knowledge using the four tests, we should expect the highest scores to occur for meaning recognition with scores gradually decreasing in size for form recognition, meaning recall, and form recall in that order. Stoeckel et al. argue that meaning recognition test formats overestimate knowledge. However, the same argument could be used to claim that meaning recall formats underestimate knowledge. The value of the different test formats should be the degree to which they indicate knowledge for the intended purpose. Justification for the use of a meaning recognition format used in the VLT is that it is sensitive to learner knowledge and is easy to complete and grade (Nation, Reference Nation1983). Using recall formats in tests designed to measure vocabulary levels and size might have a large effect on test administration and grading. Because recall formats are more difficult, it would likely take longer to complete a test, thereby reducing test practicality. In addition, grading would not only take longer but can also present challenges with how to evaluate incorrect spelling, grammatical forms, and unexpected responses that indicate partial knowledge, which can have a negative impact on reliability. Changing to a less user-friendly test format might thus have the consequence of reducing its perceived value to teachers (i.e., reducing face validity). Nation and Webb (Reference Nation and Webb2011) also suggested that using meaning recognition formats in diagnostic tests such as the VLT is useful for teachers because it reveals knowledge that could be further developed. Research investigating the advantages and disadvantages of the different formats for their intended users (teachers and learners) would be useful to further clarify the value of the different test formats.
Lexical Unit
Stoeckel et al. argue that the levels and size tests could be improved through changing the lexical unit from word families to lemmas. There has been a great deal of discussion recently about whether lemmas or word families are best suited for measuring receptive knowledge of vocabulary (e.g., Brown et al., Reference Brown, Stoeckel, McLean and Stewart2020; Kremmel, Reference Kremmel2016; Laufer et al., Reference Laufer, Webb, Yohanan and Kim2021; McLean, Reference McLean2018; Nation & Webb, Reference Nation and Webb2011). The value of using word families in tests is that by measuring knowledge of morphologically unrelated words (e.g., care, know, run rather than care, careful, careless), tests assess L2 learning of different words without evaluating knowledge of the morphological system. The value of using lemmas in tests is that by measuring knowledge of derivatives and headwords separately, tests may provide a more precise measurement of lexical knowledge (Kremmel, Reference Kremmel2016). Evaluating knowledge using a lemma-based test might be most sensible for beginners who are unable to recognize the similarities among morphologically related words. However, one disadvantage of using lemma-based tests is that there are far more lemmas than word families to measure. The most frequent 1,000 and 3,000 word families are made up of 3,281 and 9,132 lemmas, respectively (Nation, Reference Nation2016). Although these different lemmas will vary in frequency, the much greater number of lemmas than word families would require measuring lexical knowledge with a much greater number of test items, thereby reducing test practicality. In addition, because there are many morphologically related lemmas, there is bound to be inclusion of morphologically related items when using lemmas as the lexical unit. For example, care, careful, carefully; consider, considerable, considerably, consideration; differ, difference, different; employ, employee, employer, employment; and important, importance, and importantly are a few of the many morphologically related lemmas within the most frequent 2,500 lemmas of Brezina and Gablasova’s (Reference Brezina and Gablasova2015) New General Service List. Teachers and intermediate and advanced learners might question the value of tests that measure knowledge of morphologically related words, thereby reducing face validity. To provide a more transparent measurement of vocabulary knowledge, teachers could use the Word Part Levels Test (Sasao & Webb, Reference Sasao and Webb2017) to assess knowledge of the derivational system together with a test of form-meaning connection. There might also be greater benefit for research and pedagogy in developing a test designed to measure knowledge of derivations than to modify existing tests that appear to be working correctly.
Number of Items
Stoeckel et al. argue that there are an insufficient number of items in tests because they will be unable to accurately reveal the degree to which learners may reach key lexical coverage figures. However, this is not the intended purpose of the tests and is likely more relevant to research than pedagogy. The question of how many items should be included in a vocabulary size or levels test is a good one although not as straightforward as was presented. In general, the greater the number of good test items, the more accurately a test should help to assess knowledge (Haladyna & Rodriguez, Reference Haladyna and Rodriguez2013). Creating more precise tests should be a goal so there is merit to this claim. However, aspects of practicality such as time for test administration, time for grading, and test taker fatigue should also be considered. If there is insufficient time to administer and grade a test, then it will likely have little value to teachers. It would be useful to investigate how tests with different numbers of items meet pedagogical needs.
Conclusion
In this commentary I have argued that the way to move forward with the development of receptive tests of vocabulary levels and size is through research involving the efficacy of those tests. The transparency and replicability of research articles is the foundation of the research process. It enables readers to evaluate research methods as well as the validity of interpretations that can be made on the basis of test results. Moreover, it allows researchers to conduct further studies to follow-up earlier findings. When there are differences of opinion, further research provides the opportunity to clarify findings.
I agree with Stoeckel et al. that it is important to try to improve existing measures of vocabulary knowledge. However, a major problem with Stoeckel et al.’s article is that it did not provide any empirical evidence indicating that size and levels tests of written receptive vocabulary knowledge are not working correctly. Because there is no research that has shown that any of the suggested changes to these tests would improve on their validity and reliability, it is premature to dismiss or reject the existing versions of these tests. Neither was there any empirical evidence provided to support any of the three conclusions that the VST or the VLT can be improved through (a) changing the test formats from meaning recognition to meaning recall, (b) increasing the number of items, and (c) changing the lexical unit from word families to lemmas. This is extremely worrying because we should expect to find evidence-based conclusions. Support should be provided by the findings of multiple studies that have (a) investigated the use of the tests to reveal their shortcomings, and (b) examined how test performance was affected through manipulating the three variables (test format, number of items, lexical unit). Schmitt et al. (Reference Schmitt, Nation and Kremmel2020) encouraged more rigorous vocabulary test development and validation. This should involve conducting studies of existing tests to investigate the degree to which they are working correctly for learners in a variety of L2 learning contexts. Through further validation researchers can determine whether a test is working correctly for its intended purpose, and sufficiently meeting the needs of teachers, learners, and researchers. Research can also examine the degree to which different variables can be manipulated to improve test performance. There would also be great value in creating new tests that tap into other aspects of vocabulary knowledge.