1. Introduction
Language learners benefit from extensive reading, beyond their textbooks, ideally with text that is authentic in the linguistic, cultural, and functional sense (Buendgens-Kosten, Reference Buendgens-Kosten2013). Vocabulary difficulty is an important criterion in identifying suitable reading materials. A text with an appropriate proportion of new vocabulary for an intermediate learner might bring little benefit to an advanced learner who is already familiar with most words. Yet the same text might overwhelm a beginner because of the excessive percentage of unfamiliar words.
Learners would thus benefit from automatic recommendation of reading materials whose vocabulary complexity is suitable for them. Most conventional tools adopt the “leveling” or “graded” approach, which employs a common scale to match users and documents at the same or a similar level, such as school grades (Collins-Thompson, Bennett, White, de la Chica & Sontag Reference Collins-Thompson, Bennett, White, de la Chica, Sontag, Berendt, de Vries, Fan, Macdonald, Ounis and Ruthven2011). On the documents, the system performs automatic readability assessment to estimate their difficulty level. On the users, the system performs automatic proficiency assessment to determine their proficiency level. Although this approach can be effective, it lacks personalization and transparency.
A personalized text recommendation system should cater for individual differences in language learning patterns and reading preferences. Using levels to quantify language proficiency can be problematic for several reasons. First, the scale may not be familiar to users. Second, discrete levels do not allow fine-grained adjustment of the preferred difficulty in reading materials. Furthermore, a common scale assumes a standard order of language acquisition. With no control over the scale’s definition, users of different backgrounds cannot adapt the levels to suit their individual learning patterns.
A transparent text recommendation system should facilitate scrutiny of its inference process. To justify the recommendation of a document, it should provide supporting information other than its estimated difficulty level. Unfortunately, most machine learning algorithms for automatic readability and proficiency assessments are not designed for human interpretation, resulting in a “black-box” leveling approach that hampers transparency and user trust.
To enhance personalization and transparency, this article proposes an editable, open learner model (OLM) (Bull & Kay, Reference Bull, Kay, Nkambou, Bourdeau and Mizoguchi2010) and a text recommendation metric based on new-word density (NWD) – that is, the percentage of words in the document that are new for the user. In terms of personalization, users can set their target NWD to calibrate the desired difficulty of reading materials and update the OLM to improve NWD estimation accuracy. In terms of transparency, users can scrutinize recommendation decisions by examining which words are predicted by the OLM to be new vocabulary.
2. Scope and contributions
A text recommendation system searches for the most suitable reading materials for language learners. The suitability of a document is typically measured with two factors: its difficulty, with respect to the user’s language proficiency, and its thematic area, with respect to the user’s reading interests (Heilman, Collins-Thompson, Callan & Eskenazi, Reference Heilman, Collins-Thompson, Callan and Eskenazi2010).
This article focuses on the first factor. The “difficulty” of a document depends on many aspects, ranging from the vocabulary and grammatical knowledge of the user, to pragmatics, connotation, and semantic complexity of the document. Our scope will be limited to vocabulary knowledge. According to some studies in second language acquisition, vocabulary knowledge and grammatical knowledge have comparable effects on reading comprehension (Shiotsu & Weir, Reference Shiotsu and Weir2007; van Gelderen, Schoonen, Stoel, de Glopper & Hulstijn, Reference van Gelderen, Schoonen, Stoel, de Glopper and Hulstijn2007). However, others suggested vocabulary knowledge to be the better predictor (Brisbois, Reference Brisbois1995; Haynes & Carr, Reference Haynes, Carr, Carr and Levy1990; Mecartty, Reference Mecartty2000) and reported a significant correlation between vocabulary difficulty and text difficulty (François & Fairon, Reference François and Fairon2012; Heilman, Collins-Thompson, Callan & Eskenazi, Reference Heilman, Collins-Thompson, Callan, Eskenazi, Sidner, Schultz, Stone and Zhai2007; Reynolds, Reference Reynolds, Tetreault, Burstein, Leacock and Yannakoudakis2016). While a number of computer-assisted language learning systems have addressed grammatical knowledge (e.g. Bull, Pain & Brna, Reference Bull, Pain and Brna1995; Shahrour & Bull, Reference Shahrour, Bull, Nejdl, Kay, Pu and Herder2008; Vajjala & Meurers, Reference Vajjala and Meurers2012), automatic syntactic parsing, semantic complexity, and pragmatics analysis do not yet reliably achieve high accuracy. We will therefore follow most existing text recommendation systems in focusing on vocabulary (e.g. Brown & Eskenazi, Reference Brown and Eskenazi2004; Hsieh, Wang, Su & Lee, Reference Hsieh, Wang, Su and Lee2012; Hsu, Hwang & Chang, Reference Hsu, Hwang and Chang2013; Liang & Song, Reference Liang and Song2009; Miltsakaki, Reference Miltsakaki and Kreutel2009; Wu, Reference Wu2016).
Within this scope, we aim to make two contributions. First, we propose a text recommendation framework based on NWD and an editable OLM (section 3), and show that it can offer greater personalization and transparency than the conventional approach (section 4). Second, we demonstrate the feasibility of user interaction with the model and positive user attitude toward the framework. A user study evaluates the ability of new users to issue search queries and to manipulate search parameters to adjust text difficulty (section 5), and a user survey measures satisfaction on the design and implementation of the model (section 6).
3. Design considerations
According to the “i+1” concept (Krashen, Reference Krashen1981), the most suitable materials for language learning lie within the proximal zone (“+1”) of the learner’s proficiency level (“i”). To determine whether a document lies within this zone, a text recommendation system must identify what content in the document can be understood by the user (“i”) and what is new to the user (“+1”).
A common approach is to label the user and all candidate documents on a scale that measures both user proficiency and text difficulty. The system then recommends documents whose level or score on the scale is similar to that of the user. A variety of scales, both discrete and continuous, have been used. Discrete scales, also referred to as “levels” or “stages,” may involve school grades (e.g. Collins-Thompson & Callan, Reference Collins-Thompson and Callan2005); public examination grades (e.g. Hsu et al., Reference Hsu, Hwang and Chang2013), such as those in the General English Proficiency Test (Roever & Pan, Reference Roever and Pan2008); stages in graded readers (Hill, Reference Hill2008); and categories in language assessment schemes, such as the six categories in the Common European Framework of Reference for Languages. Continuous scales, such as Lexile measures (Lennon & Burdick, Reference Lennon and Burdick2014) and readability formulas (e.g. Anderson, Reference Anderson1983; Coleman & Liau, Reference Coleman and Liau1975), are more fine-grained but still not transparent.
Conventional systems make recommendations according to the estimated grade or score, but do not typically reveal the basis of the estimation, for instance, by pointing to specific words or sentences in the document or relevant characteristics in the learner profile. In the rest of the article, we will use the term “graded approach” to refer to this approach. We argue that it suffers from two main shortcomings – the lack of personalization and of transparency – and propose a novel approach that mitigates them. Table 1 summarizes the differences between the graded approach and our proposal.
3.1 Personalization
Personalization allows users to modify the system’s inference. One way to increase user control is through OLM, which can be broadly defined as “learner models that can be viewed or accessed in some way by the learner, or by other users” (Bull & Kay, Reference Bull, Kay, Nkambou, Bourdeau and Mizoguchi2010: 301). These models, which promote metacognitive activities, can facilitate learner reflection, planning, and self-monitoring. They also encourage learner independence by helping learners take control and responsibility over their learning (Bull & Kay, Reference Bull and Kay2007). An editable learner model is an OLM that allows the learner to change its content, thus increasing the accuracy of the learner model data and the quality of adaptation (Bull & Kay, Reference Bull, Kay, Nkambou, Bourdeau and Mizoguchi2010). For text recommendation, such a model should enable users to adjust its estimation of their proficiency and of document difficulty.
In principle, the graded approach can offer some degree of personalization by allowing users to identify their grade and to modify the documents’ grades. In practice, ambiguities in grade definitions often make proficiency self-assessment difficult for users. It is also unreasonable to ask users to review the grades for a large number of documents, which would in effect defeat the purpose of automatic text recommendation.
Further, the personalization is limited because grade definitions are fixed. All scales assume a standard order of vocabulary acquisition. Many are underpinned by vocabulary lists; for example, the list of 3,000 “familiar” words in the Dale–Chall readability formula or those from the widely recognized scheme Hanyu Shuiping Kaoshi (HSK), adopted by many tools for learning Chinese as a foreign language (e.g. Jin, Lu, Lin & Li, Reference Jin, Lu, Lin and Li2018; Liang & Song, Reference Liang and Song2009). However, learners may deviate from the standard because of their linguistic and professional backgrounds. For instance, a second language (L2) word is likely acquired at an earlier stage by a learner whose first language (L1) has a cognate word. A document full of legal jargon should not be recommended for general users, but may serve as excellent reading material for those with legal training. Yet users cannot change the grade definitions to reflect their learning patterns.
3.1.1 Learner model
We propose an editable learner model that directly estimates a user’s knowledge of each L2 word; more precisely, the model predicts each L2 word to be either “known” or “unknown” to the user. This design increases learner control in two ways. First, the model has the flexibility to capture individual patterns in vocabulary acquisition without making any assumption on acquisition order. Second, users can edit the model with confidence, as it is easier for them to judge whether a word is new vocabulary than to grade themselves on an unfamiliar proficiency scale.
3.1.2 Recommendation metric
We quantify text difficulty by NWD (Holley, Reference Holley1973). The NWD of a text is defined as the number of all unknown words in the text divided by the total number of words. This metric is user-centered in the sense that it quantifies text difficulty from the user’s point of view: the same text may be labeled as 10% NWD for a high-proficiency user, but 30% NWD for a lower-proficiency user. In contrast, the graded approach is not user-centered as it labels a text with a fixed grade on a scale. The onus is placed on users to familiarize themselves with the scale and to interpret the grade in relation to their linguistic competence. The NWD metric also allows fine-grained calibration of the difficulty of reading materials. One can, for example, request 5% NWD or below in order to understand the text without glosses (Hu & Nation, Reference Hu and Nation2000; Schmitt, Jiang & Grabe, Reference Schmitt, Jiang and Grabe2011), or above 10% NWD to simulate the difficulty in a typical textbook (Liang & Song, Reference Liang and Song2009).
3.2 Transparency
A transparent system explains its decisions to users. Transparency promotes user trust (Bull & Kay, Reference Bull and Kay2007) and leads to significant benefits in language learning and other subjects (e.g. Mitrovic & Martin, Reference Mitrovic and Martin2007; Shahrour & Bull, Reference Shahrour, Bull, Nejdl, Kay, Pu and Herder2008). In text recommendation, the system needs to communicate why it deems a document suitable for the user.
The graded approach justifies its recommendation decision with a grade, without identifying the aspects that make the document difficult for the user. Further, the meaning of a grade tends to be opaque. Automatic proficiency and readability assessments can potentially remove the need for users to interpret the grade (e.g. Hsieh et al., Reference Hsieh, Wang, Su and Lee2012). Most assessment models use machine learning algorithms, which typically score a document by combining weighted features based on statistical language models and readability formula scores, parse tree structures, type-token ratio, sentence and word length, etc. (e.g. Collins-Thompson & Callan, Reference Collins-Thompson and Callan2005; Francois & Fairon, Reference François and Fairon2012; Pitler & Nenkova, Reference Pitler and Nenkova2008; Sung, Lin, Dyson, Chang & Chen, Reference Sung, Lin, Dyson, Chang and Chen2015; Wu, Reference Wu2016). These models essentially operate in a “black-box” fashion, as the internal working of their algorithms cannot be easily summarized for human consumption.
NWD can clearly specify which words in the document are predicted to be known or unknown to the user, according to the learner model. The explicitness enables scrutiny of recommendation decisions: users can easily verify the estimated NWD of a document by viewing the learner model and, if necessary, improve the estimation by correcting the model. Furthermore, NWD facilitates rapid development of multilingual text recommendation systems, since its computation is similar across languages and does not require handcrafted vocabulary lists.
4. System architecture
A personalized text recommendation system can be viewed as an instance of an adaptive system, which identifies learner characteristics and adjusts its content accordingly to improve learning efficiency (Brusilovsky, Reference Brusilovsky, Durlach and Lesgold2012; Oxman & Wong, Reference Oxman and Wong2014). An adaptive system is typically analyzed in terms of three models: the domain model, which defines what is to be learned; the learner model, which captures user characteristics and preferences; and the adaptation model, which adapts the content with respect to the learner model (Brusilovsky, Reference Brusilovsky, Durlach and Lesgold2012; Knutov, De Bra & Pechenizkiy, Reference Knutov, De Bra and Pechenizkiy2009). We present our proposed method within this framework and explain how it enhances personalization and transparency over existing approaches.
4.1 Domain model
The domain model defines the knowledge space as a set of knowledge elements – that is, the concepts to be learned. The system must label each pedagogical item with one or more of these elements, in a process known as “knowledge indexing.”
For the graded approach, the knowledge element of each document can be viewed as its grade on the difficulty scale. Knowledge indexing thus requires grade estimation for each document through automatic readability assessment, which is prone to opacity (section 3.2). In contrast, our proposed method considers each L2 word to be a knowledge element. Thus, the knowledge elements of a document are simply the words it contains.
4.2 Learner model
Our learner model parameters are concerned with three user characteristics, namely the user’s vocabulary proficiency; the preferred learning pace, defined by the target NWD; and reading interests (Table 2). The user can set all parameters as part of a personalized search query. Given our research focus (section 2), the rest of this section will concentrate on vocabulary proficiency and learning pace.
The NWD of a document is based on the user’s “vocabulary set” – that is, the set of words known to the user. The learner model estimates this set by predicting each L2 word to be either “known” or “unknown” to the user. Although lexical knowledge can also be expressed on a five-point scale (Ehara, Sato, Oiwa & Nakagawa, Reference Ehara, Sato, Oiwa, Nakagawa, Kay and Boitet2012) or with a real-number score (Yimam et al., Reference Yimam, Biemann, Malmasi, Paetzold, Specia, Štajner, Tack, Zampieri, Tetreault, Burstein, Kochmar, Leacock and Yannakoudakis2018), the binary distinction enables a more transparent interpretation of NWD.
The learner model is thus an instance of the “overlay” model, which characterizes the user’s knowledge level for each knowledge element (Brusilovsky, Reference Brusilovsky, Durlach and Lesgold2012). For an initial estimation of the learner model, new users rate their knowledge of 50 words as “known” or “unknown” (Figure 2). Using this self-assessment as training data, the system performs Complex Word Identification (Yimam et al., Reference Yimam, Biemann, Malmasi, Paetzold, Specia, Štajner, Tack, Zampieri, Tetreault, Burstein, Kochmar, Leacock and Yannakoudakis2018) to construct the user’s vocabulary set. Our model, which automatically predicts each L2 word as either known or unknown to the user with an SVM classifier, performs at 78.0% accuracy (Lee & Yeung, Reference Lee and Yeung2018). New users may also opt out of self-assessment and manually choose a rough vocabulary set size. A size of N means the user knows the N most frequent words (up to N = 40,000), according to word frequency in Chinese Wikipedia.
As shown in Figure 1, the reading environment uses colors to visualize the learner model, displaying known words in black, new words in orange, and search keywords in red. If users see a known word wrongly predicted to be new vocabulary in their reading material, or vice versa, they can modify the word’s status in the editable learner model by tapping on the word.
Learning pace refers to users’ preferred difficulty for their reading materials (Table 2). In comparison to the graded approach, which requires users to dial up or down a grade, the target NWD parameter enables more fine-grained customization. Those who desire a faster pace can set a higher target percentage to retrieve more challenging materials, and those who prefer more leisurely or easy reading can set a lower target.
The maximum NWD is set by default at 20%, which means that documents in search results have at most 20% NWD. The relatively high default value is intended to encourage users to tackle challenging materials. Users can change the maximum NWD by adjusting the percentage with a slider (Figure 3).
4.3 Adaptation model
Given the learner model’s diagnosis, the adaptation model adapts the pedagogical material to optimize learning outcomes (Knutov et al., Reference Knutov, De Bra and Pechenizkiy2009). Content presentation, for example, can be adapted to highlight particular items. Our reading environment draws users’ attention to the search keywords (highlighted in red) and the new vocabulary as predicted by the learner model (in orange). They can tap on these words to read the English gloss, which was taken from the CC-CEDICT project (https://cc-cedict.org), or update their status in the learner model (Figure 1).
Navigation prioritizes the most relevant learning items for the user according to the learner model. Text recommendation systems rank documents according to their suitability for language learning. The ranking metric can be a readability formula or a weighted score based on factors such as text length, reading grade level, and topic. Our app ranks documents in descending order in terms of NWD, based on word segmentation by the Stanford Chinese parser (Manning et al., Reference Manning, Surdeanu, Bauer, Finkel, Bethard, McClosky, Bontcheva and Zhu2014). Search results include only documents whose NWD is below the user-specified maximum (Figure 4).
5. User evaluation
The use of an editable learner model leads to more complicated user interaction, which may require a steeper learning curve. We conducted a user study with the twin goal of evaluating both the usability and performance of our proposed text recommendation method. For the first goal, we measure new users’ ability to exploit personalized search by issuing complex queries involving multiple parameters. For the second goal, we measure whether these parameter variations lead to intended changes in the difficulty of the recommended documents.
5.1 Set-up
Text recommendation systems may be used by both language learners for personal reading and language teachers for class preparation. We therefore recruited both native and non-native speakers of Chinese for our evaluation, including 13 learners of Chinese as a foreign language (CFL) and five native speakers. The CFL learners represented a wide range of language backgrounds, with English, French, Indonesian, Korean, and Thai as L1, and also a variety of proficiency levels, with years of studies ranging from 1 to 15 years. The native speakers were undergraduate students at City University of Hong Kong.
After reading a manual about the app, the subjects were asked to independently perform a number of queries involving three parameters (Table 3): vocabulary size, NWD, and search keywords. Query #1, for example, involved a 40,000-word vocabulary set and 20% target NWD, whereas Query #4 required a search keyword.
Note. NWD = new-word density.
The app performed the document search in a text database containing 8,190 Chinese short essays downloaded from duanmeiwen.com. Two versions of the app, one in traditional characters and another in simplified characters, were made available to the subjects. After each query, the subjects reported the number of retrieved documents as shown on the search result page (Figure 4) and read the top five documents. Then, they rated the difficulty of each document on a 5-point Likert scale, from “very easy” (score 1) to “very difficult” (score 5).Footnote 1 For the fifth query, rather than using the app, the subjects performed the search on duanmeiwen.com.
5.2 Query execution accuracy
We first report on the subjects’ ability to correctly execute the queries. To determine whether the subjects issued the intended queries, we checked whether they reported the expected number of search results.
All 18 subjects correctly issued queries #2 and #3. For Query #1, one subject reported a wrong number of search results, likely due to an incorrect parameter value. For Query #4, 16 subjects reported the correct number, one subject reported an unexpected result, and one subject failed to report any. These results suggest that brief self-training is sufficient for most new users to use the search interface effectively.
5.3 Difficulty of recommended documents
We now analyze the subjects’ perception of text difficulty with respect to different parameter values for vocabulary size and target NWD, and in comparison with web search. Table 3 presents the average rating given by the subjects to the top five documents recommended in each query.
5.3.1 Effect of vocabulary size
The five documents recommended in Query #1 received an average difficulty rating of 3.55. The 40,000-word vocabulary set included most common words. The 20% target NWD therefore led to a substantial number of rare words, which likely contributed to the relatively high difficulty rating. Query #2 maintained the target NWD at 20% but reduced the vocabulary size to 4,000 words, which had the effect of raising the NWD of all documents. As a result, those with more rare words were more likely to exceed 20% NWD and be rejected. The average difficulty rating of the retrieved documents decreasedFootnote 2 to 2.44, confirming the intended effect of the vocabulary size parameter on text difficulty.
5.3.2 Effect of target NWD
Query #3 kept the vocabulary size the same as Query #2, at 4,000 words, but lowered the target NWD from 20% to 15%. This change led to a decrease in the average difficulty ratingFootnote 3 from 2.44 to 2.27. The subjects perceived the recommended documents to be easier even with the relatively small drop (5%) in NWD, suggesting the effectiveness of NWD in fine-grained calibration of text difficulty.
5.3.3 Comparison with web search
The last two queries contrasted our text recommendation method (Query #4) with a search on duanmeiwen.com (Query #5), the source of the app’s text database. Thus, both queries accessed similar pools of candidate documents and utilized the same search keyword. The only major difference was that text difficulty was not considered in Query #5, while Query #4 capped NWD at 20% for a vocabulary size of 4,000. The subjects gave a lower average scoreFootnote 4 to the materials returned by the app (2.28) than to those returned by web query (2.78). The lower rating suggests the ability of our text recommendation method to identify easier reading material, compared to general web search.
6. User survey
To investigate users’ opinion on the proposed text recommendation method, we conducted a survey that solicited both their attitude toward design issues in general and their experience with our app implementation in particular.
We administered this survey to the same 18 subjects as in the user evaluation (section 5.1). As shown in Table 4, the survey consisted of two parts, with the first concerned with general design issues and the second with the app implementation. Each part contained five statements on search methodology, the editable learner model, and the initial estimation of the model. The subjects rated each statement on a 5-point Likert scale, corresponding to “strongly disagree” (score 1), “disagree” (2), “neutral” (3), “agree” (4), and “strongly agree” (5).
6.1 General design issues
6.1.1 Search methodology
In the first part of the survey (left column in Table 4), the first two statements were concerned with recommendation methodology: the use of NWD as the ranking metric (1a) and the use of maximum NWD as the search criterion (2a). Consistent with the correlation between NWD and text difficulty observed in section 5.3, 16 out of 18 subjects “agreed” or “strongly agreed” with the use of NWD for ranking documents (score 4.28). Further, 17 out of 18 subjects “agreed” or “strongly agreed” with the use of maximum NWD as the search constraint (score 4.39). Hence, the subjects held favorable views on the two fundamental features of the proposed text recommendation method.
6.1.2 Editable learner model
Our design assumes users’ willingness to interact with the editable learner model. Testing this assumption, statement (3a) revealed that a majority of the subjects “strongly agreed” that users should be able to manually update the model on their vocabulary knowledge (score 4.67).
6.1.3 Initial estimation of learner model
The last two statements consulted the subjects on the automatic method (4a) and manual method (5a) for initializing the learner model for a new user. With eight out of 18 subjects expressing reservation – either “neutral” or “disagree” – about automatic prediction of vocabulary knowledge, statement (4a) received the lowest score among the statements in the survey (score 3.67). This ambivalence may be attributable to system mistakes in automatic complex word identification (section 4.2), which likely needs improved accuracy to gain more acceptance among subjects. The subjects gave slightly higher scores (score 4.28) to statement (5a), indicating an overall preference to manual estimation of their vocabulary size.
6.2 App implementation
6.2.1 Search interface
In the second part of the survey (right column in Table 4), the first two statements polled the subjects on their experience with the search interface. In statement (1b), 17 out of 18 subjects “agreed” or “strongly agreed” that it was easy to work with the NWD-based search result (Figure 4) (score 4.56). Similarly, in statement (2b), 17 out of 18 subjects “agreed” or “strongly agreed” that it was easy to set the parameter for maximum NWD (score 4.67) (Figure 3). These relatively high scores, consistent with the high success rate in query execution (section 5.2), suggest the usability of the search interface.
6.2.2 Learner model update
The next statement (3b) gauged user opinion on interaction with the editable learner model – that is, tapping on a word to change its status from “unknown” to “known,” or vice versa (Figure 1). This statement received the highest score (4.71) of all statements in the survey. As user updates are critical for the accurate NWD estimation, this result is promising for applying an editable learner model for text recommendation.
6.2.3 Initial estimation of learner model
The last two statements evaluated user experience with the automatic (4b) and manual (5b) methods for model initialization. Statement (4b) showed that most subjects were willing (score 4.61) to complete the 50-word vocabulary self-assessment (Figure 2). In comparison, in statement (5b), the subjects found it relatively difficult (score 4.28) to manually estimate their vocabulary size. All in all, while the subjects preferred the manual method in principle (cf. scores for statements (4a) and (5a)), they found the automatic method easier in practice.
7. Limitations
Results from the user evaluation (section 5) and survey (section 6) should be interpreted with a number of limitations in mind. First, the scope of the learner model is limited to vocabulary knowledge (section 2). An expanded model incorporating syntactic and semantic complexity would offer a more comprehensive evaluation of the effects of personalization and transparency.
Second, the learner model does not cover multiword expressions. Since language often comes in chunks or semi-preconstructed phrases (Sinclair, Reference Sinclair1991), a model that accounts for multiword expressions can yield more accurate NWD estimation. It would, however, complicate user interaction due to the larger number of parameters. The learner model also does not distinguish between different senses of a word. Modeling word senses can improve NWD estimation accuracy, but would require development of a sense inventory for each word as well as automatic word sense disambiguation.
Third, the evaluation was relatively small scale. A larger number of subjects, reflecting greater variety in L1 background and L2 proficiency, would measure user experience more comprehensively among diverse user populations. Further, the evaluation, which focused on text recommendation quality and user perception, did not consider language learning outcomes. Although pedagogical benefits have been demonstrated in intelligent tutoring systems for various subjects, they have yet to be established for text recommendation specifically. A controlled experiment contrasting the learning outcomes of the proposed and graded approaches is needed to draw a conclusion.
8. Conclusions and future work
Text recommendation systems support self-regulated language learning by matching learners with appropriate reading materials. The conventional approach performs the matching through automatic readability assessment, which assigns each document to a difficulty level, and through automatic proficiency assessment, which measures the user on the same scale. This approach cannot be customized to the user’s personal learning patterns, because the levels assume a standard order of language acquisition. It also lacks transparency as it reveals little of the basis of its recommendation decisions.
This article has presented a novel text recommendation method that addresses these limitations. This method increases personalization with an editable learner model, which enables users to update the status of their vocabulary knowledge in the model when it makes wrong predictions. Further, this method provides more transparency through document ranking by NWD, a metric that facilitates user scrutiny and fine-grained calibration of text difficulty. An evaluation showed that new users were able to issue queries with various personalization features to achieve the intended effects on text difficulty. According to a survey, users were receptive to the search methodology and interface and were willing to inspect and update the learner model.
We plan to pursue future work in two directions. First, we plan to expand the editable learner model to cover other dimensions of text difficulty (section 7). Since an expanded model may result in more complex user interaction, it would be worth considering less invasive methods for model updates (e.g. Hokamp, Mihalcea & Schuelke, Reference Hokamp, Mihalcea, Schuelke, Calzolari, Choukri, Declerck, Loftsson, Maegaard, Mariani, Moreno, Odijk and Piperidis2014). Second, we intend to conduct a longitudinal study, involving subjects with a greater variety of L1 backgrounds and L2 proficiency, on the learning outcomes of the proposed text recommendation approach. Such a study would help assess its potential to serve in a personal user model for lifelong learning (Kay & Kummerfeld, Reference Kay and Kummerfeld2019).
Acknowledgements
This work was supported by grants from the Standing Committee on Language Education and Research (EDB(LE)/P&R/EL/164/6) and from the Innovation and Technology Fund (ITS/389/17) in Hong Kong SAR, China. We thank Lam Xing, Zhenqun Yang, and Chak Yan Yeung for their assistance in app implementation.
Ethical statement
The experimental procedure in this research was approved by the Human Subjects Ethics Sub-Committee at City University of Hong Kong (Application No.: H000899).
About the author
John S. Y. Lee is an associate professor at the Department of Linguistics and Translation at City University of Hong Kong. He obtained his PhD in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology.
Author ORCIDs
John S. Y. Lee, https://orcid.org/0000-0003-2505-2678