1. INTRODUCTION
In this paper we aim to come to a better understanding of a particular aspect of lexical richness, namely lexical sophistication, in learner language of British learners of French. As is well-known, lexical richness is a multidimensional feature of written or spoken language. Read (Reference Read2000: 200) distinguishes four dimensions of lexical richness, and one of these is lexical sophistication, which he defines as ‘the use of technical terms and jargon as well as the kind of uncommon words that allow writers to express their meanings in a precise and sophisticated manner’. The key question is of course how to operationalise what counts as a sophisticated word or expression. Many measures of lexical richness are based on the assumption that the key factor behind the difficulty of a lexical item is its frequency. Laufer and Nation's (Reference Laufer and Nation1995) Lexical Frequency Profile, for example, is based on the assumption that frequent words are easier than infrequent words. This view is shared by Vermeer (Reference Vermeer2000) and Meara and Bell (Reference Meara and Bell2001). Similarly, Malvern, Richards, Chipere and Durán (Reference Malvern, Richards, Chipere and Durán2004: 3) define lexical sophistication as the appropriate use of low frequency vocabulary items.
The question is, however, whether frequency is the only dimension that counts. The psycholinguistic literature shows that the cognate status of items is also an important factor in processing. Cognate items, i.e., translation pairs in which the words are similar in sound and spelling are processed faster than non-cognates (Van Hell and De Groot, Reference Van Hell and de Groot1998: 193) confirming that, as one would predict, cognates are easier than non-cognates.Footnote 2 For British learners of French therefore, many infrequent items are easy, because the French and English translation equivalents are cognates, e.g. French détester ‘to detest’, which is infrequent but probably highly transparent to learners.
Support for the fact that cognates play an important role in L2 acquisition comes from Laufer and Paribakht (Reference Laufer and Paribakht1998) who demonstrate that French-speaking ESL students obtain higher scores on a test of English controlled active vocabulary than learners with no French because of the large number of French-English cognates. In a similar vein, Horst and Collins (Reference Horst and Collins2006) in their study of the longitudinal development of French learners of English in Canada show that learners initially prefer certain low frequency items such as respond over the high frequency alternative answer, because the former is a cognate of the French translation equivalent répondre.
Experienced teachers, in putting together textbooks or other material for learners will therefore not rely solely on the frequency of lexical items, but they will use additional criteria, such as the cognate status of items, in judging which words learners need. It is interesting to find out to what extent teacher judgements can help us to get a better understanding of the lexical richness of learner language.
We assume that measures of lexical sophistication which involve teacher judgement are better than those that are solely based on frequency. More specifically, the key hypothesis of the current article is that measures of lexical richness which are based on a basic vocabulary list which is derived from teacher judgements should be better able to discriminate between groups than measures that are based on a basic vocabulary list which consists of the most frequent words or on a traditional basic vocabulary list such as Français Fondamental Premier Degré (FF1). The latter is based on a variety of criteria and contains several items that are now out of date (see Tidball and Treffers-Daller, Reference Tidball, Treffers-Daller, Daller, Milton and Treffers-Daller2007).
We test the key hypothesis of this article by investigating how different operationalisations of the concept of basic vocabulary affect the power of a measure of lexical sophistication which was recently proposed by Daller, Van Hout and Treffers-Daller (Reference Daller, Van Hout and Treffers-Daller2003), the Advanced Guiraud (AG) (see below for a description). The current paper is a follow-up to Tidball and Treffers-Daller (Reference Tidball, Treffers-Daller, Daller, Milton and Treffers-Daller2007) in which we found that the AG was less able to discriminate between groups than measures of lexical diversity that do not make use of external criteria such as a basic vocabulary list. The choice of a different basic vocabulary can possibly improve the performance of the AG.
We will compare different operationalisations of the concept of basic vocabulary by calculating AG in a variety of ways. These results will subsequently be compared with those obtained with the help of Vocabprofil, the French version of Nation's Range programme which gives a Lexical Frequency Profile of texts (Laufer and Nation, Reference Laufer and Nation1995). We will also establish how these measures compare with measures of lexical diversity, which do not make reference to any external criteria or lists, such as the Index of Guiraud (Guiraud, Reference Guiraud1954) and D (Malvern et al., Reference Malvern, Richards, Chipere and Durán2004).
Before going into the details of the current study, we will first present different ways of measuring lexical sophistication (section 2), a brief appraisal of recent work on word frequencies in French and how this relates to the operationalisation of the concept of a basic vocabulary (section 3). In section 4 we present the methodology of the current study, and section 5 gives an overview of the results. Section 6 offers a discussion of the results and a conclusion.
2. MEASURING LEXICAL SOPHISTICATION
We agree with Meara and Bell (Reference Meara and Bell2001) that it is important to assess the quality of vocabulary used by L2 learners by making reference to external criteria, such as basic vocabulary lists or frequency lists of lexical items, if one wants to gain a better understanding of lexical richness. As Meara and Bell's (Reference Meara and Bell2001: 6) now rather famous examples (1) to (3) show, measures of diversity which are based on distribution of types and tokens in a text will produce the same result for each of these examples.
(1) The man saw the woman
(2) The bishop observed the actress
(3) The magistrate sentenced the burglar
These three sentences are however quite different in the quality of the vocabulary used, as the words in (1) are less difficult (and more frequent) than those in (2) and (3). As Malvern et al. (Reference Malvern, Richards, Chipere and Durán2004: 124) notice, the dimensions of diversity and rarity (sophistication) are of course linked, because ‘over a longer stretch of language, diversity can only increase by the inclusion of additional different words, and the more these increase, the more any additional word types will tend to be rare’. The question is therefore whether the development of lexical resources in L1 or L2 learning is due to an increase in the number of low frequency words, or whether the children or students make better use of a wider range of higher frequency words.
Hayes and Ahrens (Reference Hayes and Ahrens1988; in Malvern et al., Reference Malvern, Richards, Chipere and Durán2004) and Laufer (Reference Laufer1998), found that the percentage of low frequency vocabulary did not increase in the spoken or (free active) written data of their informants. Recently, Horst and Collins (Reference Horst and Collins2006) have shown that 11- and 12-year-old francophone learners of English in Québec do not use a higher number of low frequency words after 400 hours of tuition, but a larger variety of high frequency words (up to k1 layer), and they draw less upon cognates (see above). This illustrates most clearly that other factors, such as the cognate status of items, in addition to frequency, play an important role in lexical development, and that frequency bands to which vocabulary items belong do not always provide a good indication of students' progress.
In the present study we use the Advanced Guiraud (AG), as proposed by Daller et al. (Reference Daller, Van Hout and Treffers-Daller2003), to measure the differences in lexical sophistication in the speech of British learners of French and a French native speaker control group. The AG is derived from the Index of Guiraud (Guiraud, Reference Guiraud1954), which is the ratio of types (V) over the square root of tokens (N) as expressed in the following formula (V/√N). The AG is ratio of advanced types over the square root of the tokens (Vadv/√N). For its calculation, one needs to distinguish basic and advanced vocabulary and in this paper we do this in three different ways; with the help of a) the traditional français fondamental premier degré (FF1); b) a list of basic words based on teacher judgements; c) a list of basic words derived from the Corpaix frequency list (Véronis, Reference Véronis2000).
As teacher judgements of the difficulty of words were a reliable tool in measuring lexical richness among Turkish-German bilinguals (Daller et al., Reference Daller, Van Hout and Treffers-Daller2003), the second operationalisation is based on teacher judgements. A list of the frequency of words in a corpus of spoken French, the Corpaix oral frequency list (Véronis, Reference Véronis2000), forms the basis of the third operationalisation. We wanted to find out whether frequency lists are able to successfully capture words which intuition tells us belong to basic vocabulary. We believe with Gougenheim, Rivenc, Michéa and Sauvageot (Reference Gougenheim, Rivenc, Michéa and Sauvageot1964: 138) that ‘Ils [les mots concrets] semblent se dérober à la statistique’ (‘concrete words seem to escape statistics’). A comparison of basic vocabulary lists based on frequency with those based on teacher judgement may well be able to shed new light on the validity of this claim.
We compare different operationalisations of the AG with a well-known measure of lexical diversity, D (Malvern et al., Reference Malvern, Richards, Chipere and Durán2004), which represents the single parameter of a mathematical function that models the falling TTR curve (see also Jarvis, Reference Jarvis2002 and McCarthy and Jarvis, Reference McCarthy and Jarvis2007 for an appraisal of this measure). The different measures can give us an indication to what extent the students from the three groups differ from each other in the quantity and/or in the quality of the vocabulary they use.
Finally, we compare these results with those obtained with the help of the frequency bands in Vocabprofil. The output of the programme is a Lexical Frequency Profile (Laufer and Nation, Reference Laufer and Nation1995), which gives the frequency of words according to the following four frequency layers: the list of the most frequent 1000 word families (K1), the second 1000 (K2), the Academic Word List (AWL) and words that do not appear on the other lists (NOL). Laufer (Reference Laufer, Eubank, Selinker and Smith1995) shows that a condensed version of the LFP, which distinguishes between the basic 2000 words and the ‘beyond 2000’ words can also be used to measure lexical richness across different levels of proficiency.
The frequency data on which Vocabprofil is based are derived from a written corpus (see below), but Ovtcharov, Cobb and Halter (Reference Ovtcharov, Cobb and Halter2006) claim that Vocabprofil can be used to analyse vocabulary in oral data. It would be useful to know to what extent the profiles for oral and written data as produced by Vocabprofil differ but the authors do not offer such a comparison. They do, however, show that the profiles of advanced Canadian learners of French are not significantly different from those of Beeching's corpus of oral data from French native speakers, which is also freely available on the internet. Vocabprofil differs from the LFP in that the third frequency layer (K3) contains words which occur at a frequency of 2001–3000 in the corpus, French having no equivalent to the Academic Word list (Cobb and Horst, Reference Cobb, Horst, Bogaards and Laufer2001).
For the purposes of the current article it is important to note that Beeching's corpus contains a very high proportion of NOL words, namely 10.87%, and the same is true for the learners in Ovtcharov et al.'s study: their scores for the NOL category range from 4.02% for learners at the lowest level to 8.71% for learners in the top group. As the percentages are so high, it is likely that the NOL category does not only contain exceptionally rare words, but also many words that may be frequent in spoken language but not in written language, and which therefore do not occur in the written corpus on which the frequency profiles are based. Whether or not this is the case in our data will be investigated below.
3. BASIC VOCABULARIES AND WORD FREQUENCIES IN FRENCH
Until recently, the only existing basic vocabulary list was Le français fondamental premier degré (Gougenheim, Reference Gougenheim1959) which has been widely used as a reference in many studies on vocabulary. Le français fondamental premier degré (FF1) is largely based on an oral corpus. This list is not solely based on frequency: to the most frequent words were added ‘available words’, i.e. common words (e.g. fourchette ‘fork’, chocolat ‘chocolate’, autobus ‘bus’) which did not appear in the corpus because they are topic-specific and therefore have a lower frequency, but were frequently mentioned in additional surveys on specific themes, or were deemed essential for teaching French as a foreign language.
Given the importance of lexical frequency in language processing, a selection of the most frequent words is an obvious alternative to FF1. We chose to use the Corpaix frequency list for oral French (Véronis, Reference Véronis2000) as this frequency list of 4,592 tokens is based on oral data, and it is freely available on the internet (in unlemmatised form). The corpus of one million words from which the list was derived is based on 36 hours of recordings of interviews held in real-life situations, collected over 20 years at the Université de Provence (now part of the DELIC team).
Because the list is drawn from a relatively limited corpus, some contexts are clearly represented more than others. A few examples can illustrate that there are some unexpected results in this list. In the corpus orthographe ‘spelling’ occurs 235 times, and it has a higher frequency than finir ‘to finish’, regarder ‘to look at’, and bactérie ‘bacteria’, which occur 144 times. It is also surprising that fromage ‘cheese’ and pipette ‘pipette’ occur with the same frequency (16) and that fromage ‘cheese’, which appears in FF1, does not feature in the first 1000 words of Corpaix (see appendix for more examples).
In addition we used the frequency profiles that can be obtained with Vocabprofil. The frequency information on which the programme is based stems from a corpus of 50 million words (Verlinde and Selva, Reference Verlinde, Selva, Rayson, Wilson, McEnery, Hardie and Khoja2001) from two newspapers Le Monde (France) and Le Soir (French-speaking part of Belgium).
It is doubtful whether information about frequency of lexical items in written texts can be used for an analysis of oral data, because of the discrepancies between spoken and written French, but we felt it was interesting to see whether this tool is able to uncover the differences in lexical richness between our three groups, what percentage of the students' tokens belongs in the category NOL (not-on-lists) and whether the lexical frequency profiles are better able to discriminate between the groups in this study.
4. METHODS
The participants were two groups of British undergraduates studying French as part of a Languages Degree at the University of the West of England (UWE), Bristol –21 level 1 (first year), 20 level 3 (final year) – and a control group of 23 native French speakers, also students at UWE. All students undertook the same task under the same conditions: they were asked individually to record their description of two picture stories presented as cartoon strips of six pictures each (Plauen, [Reference Plauen1952] 1996]). The corpus contains 23,332 tokens (January 2008).
The general language proficiency of each participant was measured by means of a French C-test which provided a useful external criterion against which the different measures could be validated. This test was highly reliable (Cronbach's alpha = 0.96, 6 items).
The data were transcribed and coded in CHAT, lemmatised and analysed using CLAN (MacWhinney, Reference MacWhinney2000). More details on the informants, the C-test, the transcription and the lemmatisation are given in Tidball and Treffers-Daller (Reference Tidball, Treffers-Daller, Daller, Milton and Treffers-Daller2007).
For this project we operationalised the concept basic vocabulary in three different ways. First of all we used a list based on frequency, availability and judgement (FF1); second, a list based on oral frequency (Corpaix); and third, an intuition-based list (judgements of teachers). We will present each briefly here.
We used FF1 as our first operationalisation, even though this list is rather old and contains several items which relate to rural life in France, such as charrue ‘plough’ and moisson ‘harvest’, which are probably no longer part of the basic vocabularies of speakers living in cities.
Our second list is based on the non-lemmatised oral frequency Corpaix list (Véronis, Reference Véronis2000). This list contains many elements that are typical for spoken data, such as the interjections euh (20,897), ben (2,936) or pff (298). It also unfortunately splits up words that contain an apostrophe, such as aujourd'hui ‘today’ into two words, giving a frequency of 261 for each part.
Homographs constitute another issue: voler ‘to steal/to rob’ (in our data) also means ‘to fly’ (a bird/ aeroplane), vol ‘theft/robbery’ or ‘flight’. The frequency lists on which Vocabprofil is based (see below) do not differentiate between the two meanings and the frequency rank of the word is therefore not entirely meaningful. FF1, on the other hand, gives the two different meanings of voler under two different entries. VocabProfil lists vol in K1 (first thousand words), voler in K2 (1100–2000) and voleur ‘thief/robber’ is NOL (beyond the first 3000 words). The latter is listed in FF1.
As we wanted to compare the results based on the Corpaix list to those based on FF1 (which only contains lemmas), we needed to lemmatise this list. The lemmatisation gave us a frequency list of 2767 types. The methodology followed for the lemmatisation can be found in Tidball and Treffers-Daller (Reference Tidball, Treffers-Daller, Daller, Milton and Treffers-Daller2007).
For our third basic vocabulary list we used the judgement of three experienced tutors of French, two of whom were French native speakers, and one was a bilingual who had grown up with English and French. They were given a list of all 932 types produced by our learners and asked to rank them on a scale on 1 to 7, according to how basic or advanced they judged them to be, with 1 being the most basic and 7 the most advanced. A reliability analysis showed the raters' judgements correlated almost perfectly with each other (Cronbach's Alpha = 0.943 (N = 3). Two weeks later we asked the tutors to give us a second judgement of a random sample of 10% of these judgements, which enabled us to carry out a test-retest reliability analysis. The scores given to each item by individual judges in the first and the second rounds correlated strongly and significantly with each other for the first two judges (r = 0.88 and 0.84) and significantly but less strongly for the third judge (r = 0.56). We also calculated Cohen's kappa to establish to what extent raters agree on what constitutes a basic and a non-basic word in the two rounds. Agreement turned out to be substantial for raters one and two (k = 0.624 and 0.601; p < 0.001) but only fair for the third rater (k = 0.252; p < 0.001). The latter was therefore excluded from further calculations.
We defined our basic vocabulary as follows: First, we totalled the scores given by the two remaining raters. Then we selected all words which obtained total scores in the lower quartile (i.e. scores of 4 or less out of a possible 14) for our basic vocabulary list. This gave us a list of 246 basic words.
In order to make a comparison between the different operationalisations possible, we used the Corpaix frequency list to create three different basic vocabulary lists: the first one (Corpaix 246) contained the same number of words as the judges' file (246 words), the second one (Corpaix 1378) contained the same number of words as FF1 (1378), and the third (Corpaix 2000) corresponded to Laufer's Beyond 2000 measure (i.e. it contained the 2000 most frequent words in the list).
5. RESULTS
In this section we first present the results of the C-test, to show how the language proficiency of the three groups differs on a measure that is independent of the story telling task. We then discuss to what extent the different operationalisations of the concept basic vocabulary overlap (section 5.2) and in section 5.3 we will present the results of the analysis of lexical sophistication using different measures based on those basic vocabularies.
5.1 The C-test
The C-test results demonstrate that there are significant differences between the French proficiency of the two learner groups and the native speakers (ANOVA, F (df 2,61) = 105.371. p < 0.001). The Tukey post hoc test shows that all groups are significantly different from each other. This information is important as one would expect that measures of vocabulary richness should be able to demonstrate the existence of such a clear difference between the learner groups and between learners and native speakers. The power of the C-test to discriminate between groups turned out to be very high as can be seen in the Eta 2 of .776 (see section 5.3 for more details on Eta squared).
5.2 The overlaps between the basic vocabularies
Before going into a discussion of the different measurements of lexical sophistication, it is interesting to see to what extent the different basic vocabularies overlap. No two lists, even those drawn from very large corpora of similar origin, will overlap completely. Comparisons of the first 1000 words of three existing frequency lists derived from different large French literary corpora, of which the Trésor de la langue Française (TLF) (INALF, 1971) is one, showed that they had 80% of words in common, whereas le Français fondamental had 65% in common with TLF (Picoche, Reference Picoche1993).
We have used CLAN to compare the content of FF1 with two other operationalisations: the Corpaix oral frequency list and the basic vocabulary list which is based on the judgements of the teachers. As FF1 contains 1378 words, we compared the first 1378 words of the Corpaix oral frequency list with FF1, and found that 725 words (52.6%) of FF1 are also found in the first 1378 words of the Corpaix frequency list. The judges' file shares 236 words (95.9% of the 246 words it contains) with FF1.
Subsequently, we entered all different operationalisations into Vocabprofil, to find out what percentage of the words in FF1, the Corpaix list and the judges' file belongs in the different frequency bands distinguished by Vocabprofil. The results of these analyses can be found in Table 1. It shows that FF1 and the first 1378 words of Corpaix have roughly similar profiles, with approximately 60% K1 words, whilst the first 2000 words of Corpaix contains almost 50% K1 words. The judges' file and the first 246 words in Corpaix contain a far larger proportion of K1 words: respectively 89 and 94%.
Table 1. Percentage of words in each Vocabprofil frequency band in each corpus
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20160713230239-11448-mediumThumb-S0959269508003463_tab1.jpg?pub-status=live)
The percentage of words that do not appear in any list is very high for FF1 (21.9%) and Corpaix (14.44%) and this is probably due to the fact that Corpaix contains many elements that are frequent in spoken language but which are not found in the written corpora.
Examples of words from our corpus which are NOL – apart from the many interjections mentioned in section 4 – are nouns such as chapeau ‘hat’ and voleur ‘thief’, an adjective such as gentil ‘nice’ and verbs such as nager ‘to swim’ and repartir ‘to set off again’. The problem, from our perspective, is that Vocabprofil puts these very common words (all of which are in FF1) in the same category as bousculer ‘knock down’ and canne ‘walking stick’, which are highly specific and very infrequent, and which are not found in FF1 or Corpaix.Footnote 3 This illustrates the difficulty of using a written frequency list for the analysis of oral data. It is very unlikely that these words would all be classified in the same category if Vocabprofil was based on an oral frequency list.
5.3 Measures of vocabulary richness
Table 2 gives an overview of the results obtained for our different measures of vocabulary richness, including list-free measures. All measures show that there are significant differences between the groups in the vocabulary used. The smallest basic vocabularies – defined by the judges or the Corpaix 246 list – were successful at differentiating between all groups. The AG (judges) and the AG (Corpaix 246) yielded significant differences between the two learner groups (level 1 and level 3) but the AG (FF1)Footnote 4 and AG (Corpaix 1378) or AG (Corpaix 2000) did not.
Table 2. Mean scores on different measures of vocabulary richness and results of a one-way ANOVA/Tukey post hoc
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20160713210311-48939-mediumThumb-S0959269508003463_tab2.jpg?pub-status=live)
* = p < 0.05; ** p < 0.001.
In order to find out which measure is best able to discriminate between groups, we calculated Eta2. Eta squared is the percentage of the variance in the dependent variable that can be accounted for by the independent variable (i.e. group membership in this case). As Table 2 shows, the AG (judges) obtains a higher Eta 2 than the Index of Guiraud. The Eta 2 for the AG (judges) and D are virtually identical. The AG (judges) compares very positively with the other operationalisations of the AG, including its closest ally, the AG (Corpaix 246). This clearly shows that using teacher judgement is a better way to obtain a basic vocabulary list than using frequency data.
We also submitted the data to Vocabprofil, to find out to what extent the frequency layers as distinguished in Vocabprofil could help to distinguish our three groups (see Table 3).
Table 3. Percentage of the tokens belonging to different frequency layers (Vocabprofil) for all three groups (One-way ANOVA/Tukey post hoc)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20160713230239-12310-mediumThumb-S0959269508003463_tab3.jpg?pub-status=live)
The results given in Table 3 show that, as one might expect, the native speakers make less use of words belonging to the highest frequency layer than the level 3 learners, and the latter use fewer words from the highest frequency layer than level 1 learners. The percentages are comparable to those of Beeching's corpus, although a smaller proportion of the words used by Beeching's informants belonged to the K1 category (83.99%), and more of their words are NOL (12.07%). This is probably due to the fact that our informants did not produce free speech but they all narrated the same two stories, which limits the choices for the informants. Of the words used by Beeching's informants, 3.94% belonged to the K2 layer and 1.2% to the K3 layer, and these percentages are very similar to those for our informants.
All three groups differ significantly from each other in their use of K1 words and also in their use of NOL words (see ANOVA/Tukey post hoc in Table 3). It is interesting that the groups do not differ from each other with respect to the K2 and K3 layers of Vocabprofil. While level 3 students use more words from the K2 layer than level 1 students, these differences are too small to become significant. As for the K3 layer, the students seem to use even fewer words of this frequency layer than the level 1 students.
Table 3 shows the effect size of the measurement of the vocabulary used at each frequency layer distinguished by Vocabprofil. It clearly demonstrates that the choice of the words which are not in the frequency list is the most powerful indicator of the differences between the groups, but the Eta 2 of the scores obtained on the basis of Vocabprofil are clearly lower than those obtained with the help of the basic vocabulary lists (see Table 2).
The Eta2 obtained for the C-test outshines all the results found for the lexical richness measures, as the C-test obtained an Eta 2 of .776. A simple C-test may therefore well be a more effective way to distinguish language proficiency levels among learner groups.
The use of cognates by learners also conveys important information that needs to be taken into account in analyses of vocabulary richness. In our data, for example, speakers from all groups describe the thief in the second story most often as a voleur, but the learners' second most popular word for this character is the cognate criminel ‘criminal’, which is not used at all by the native speakers, even though it can be used as a noun in standard French. This word belongs to the K2 layer in Vocabprofil, and it is not listed in the Corpaix oral frequency list at all. Students who use this word would get higher scores on Vocabprofil or on the AG (corpaix) as these measures are exclusively based on frequency (see Tables 4 and 5). The use of cognates is however not necessarily an indication that the speaker possesses a rich vocabulary. Rather, it shows that the speaker knows how to strategically exploit similarities between languages in telling a story. How we can account for the strategic use of cognates in the context of studies on vocabulary richness therefore deserves to be investigated further.
Table 4. Diferent keywords from the stories in Vocabprofil, FF1, Corpaix and the teachers' judgements
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20160713230239-63532-mediumThumb-S0959269508003463_tab4.jpg?pub-status=live)
Table 5. Examples of keywords from the stories and their allocation to different frequency layers in Vocabprofil
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20160713230239-67816-mediumThumb-S0959269508003463_tab5.jpg?pub-status=live)
6. DISCUSSION AND CONCLUSION
The results presented in section 5 clearly show that the power of a list-based measure such as AG depends on the way in which researchers operationalise the concept of basic vocabulary (see also Daller and Xue Reference Daller, Huijuan, Daller, Milton and Treffers-Daller2007, who make a similar point). We found that an operationalisation based on teacher judgements was more powerful than different operationalisations based on frequency, in that the AG (judges) was better able to discriminate between the groups. It is possible though that the performance of the AG based on frequency data can be improved if alternative frequency lists are being used. The Corpaix frequency list may not have been ideal for the current purposes, as it was drawn from a relatively small corpus. Alternative frequency lists based on the CRFP or Lexique could be used in future studies on this topic.
The results of the Vocabprofil analyses show that the students make better use of a range of relatively easy words. The learners used more NOL words at level 3, such as gentil ‘nice’ and chapeau ‘hat’, which are common in spoken language but happen not to occur in the written corpora on which Vocabprofil is based. Our results confirm those of Horst and Collins (Reference Horst and Collins2006) whose learners did not use a higher number of low frequency words after 400 hours of tuition either but a larger variety of high frequency words which belong to the k1 layer. As many researchers have found that the percentage of K2 words (and beyond) is very low in learner language, it may be important to further differentiate between different frequency layers among the k1 group. In this paper we have shown that using a small basic vocabulary (n = 246 words) in calculations of the AG works better than using a large basic vocabulary (n = 1378 or n = 2000). In Tidball and Treffers-Daller (in preparation) we illustrate this further by focusing on motion verbs. While the level 1 learners prefer to use basic deictic verbs (aller ‘go’ and venir ‘come’) over path verbs such as entrer ‘to enter’, all of which belong in the K1 layer of the vocabulary frequency lists, level 3 learners increasingly use entrer to express the same motion event, which shows they have progressed in comparison with the level 1 learners. The percentage of low frequency words is therefore for these learners possibly a less suitable indicator of the differences in the lexical richness of their speech.
The analysis of our data with Vocabprofil provided interesting information about the frequency layers of the vocabulary used by our learners, but the large number of words in the NOL category is worrying in that this category contains both very rare words such as bousculer ‘to knock over’ and highly frequent items which are characteristic of spoken rather than written language. It would therefore be very useful for the research community if a new version of Vocabprofil could be created which is based on frequency data for oral language.
Finally, a preliminary analysis of the use of criminel by the learners indicates that cognates play an important role in L2 vocabulary acquisition, which confirms the results of Laufer and Paribakht (Reference Laufer and Paribakht1998) and Horst and Collins (Reference Horst and Collins2006). The strategic use learners make of cognates is an area that deserves further attention in future studies of vocabulary richness.