1. Introduction
The earliest study of leveraging quantifiable features to analyze text readability can be traced back to Sherman (Reference Sherman1893) (see Bailin and Grafstein Reference Bailin and Grafstein2016 for a review). Readability is still a highly active research topic. From the early investigations of literary texts, readability assessment techniques have been widely applied, expanding to other genres and domains including insurance policies, medical awareness pamphlets, jury instructions (DuBay Reference DuBay2004), biology textbooks (Belden and Lee Reference Belden and Lee1961), health education messages (Freimuth Reference Freimuth1979), business communication textbooks (Razek and Cone Reference Razek and Cone1981), economics textbooks (Gallagher and Thompson Reference Gallagher and Thompson1981; McConnell Reference McConnell1982), newspapers (Johns and Wheat Reference Johns and Wheat1984), and adult learning materials (Taylor and Wahlstrom Reference Taylor and Wahlstrom1999).
Among the various aspects of readability assessment techniques, the development of new linguistic features to be integrated into readability models is an issue of both theoretical and practical importance (De Clercq and Hoste Reference De Clercq and Hoste2016; Graesser et al. Reference Graesser, McNamara, Louwerse and Cai2004; McNamara, Louwerse, and Graesser Reference McNamara, Louwerse and Graesser2002; McNamara et al. Reference McNamara, Louwerse, McCarthy and Graesser2010; Sung et al. Reference Sung, Chen, Cha, Tseng, Chang and Chang2015a; Reference Sung, Lin, Dyson, Chang and Chen2015b; Sung et al. Reference Sung, Chang, Lin, Hsieh and Chang2016a; Tanaka-Ishii, Tezuka, and Terada Reference Tanaka-Ishii, Tezuka and Terada2010). Notably, there remains a problem that common general linguistic features (i.e., a set of lexical, syntactic, semantic, and cohesion-related features as described in Appendix) are not capable of reflecting the difficulty levels of the knowledge contained in domain-specific texts. As pointed out by Redish (Reference Redish2000), when a field-specific term appearing in a domain-specific text is also a commonly used word, then the term’s difficulty level within that text cannot be accurately assessed by general linguistic features. For example, when appearing in a general text, shock is a common and easy word that means “strong, and usually unpleasant, emotion,” but in a medical text it refers to “a life-threatening condition that occurs when the body is not getting enough blood flow, which obstructs microcirculation and results in the lack of blood and oxygen in vital organs” (Cecconi et al. Reference Cecconi, De Backer, Antonelli, Beale, Bakker, Hofer, Jaeschke, Mebazaa, Pinsky, Teboul, Vincent and Rhodes2014: 1796). Using the common general linguistic features to measure the word shock in the two senses (i.e., in a general vs. a domain-specific text) would lead to the same predicted difficulty levels because both “shocks” have the same part-of-speech, the same word length, and, according to most word lists, shock is rated with only a single difficulty score. In other words, apart from the fact that the two “shocks” reside in different domains of knowledge, they are superficially identical. To take another example, in an empirical study analyzing Medical Subject Headings (MeSH) in the US medical database, Yan, Song, and Li (Reference Yan, Song and Li2006) found that general linguistic features such as the number of syllables and word length of a medical term were not related to the term’s level of difficulty. From these examples, we can argue that because common general linguistic features are derived from the surface characteristics of text, they are not able to characterize the meaning embodied in the knowledge in particular fields, let alone to represent the relations between different knowledge, or to distinguish their difficulty levels. Researchers need to develop new readability features that are capable of rating the knowledge-oriented difficulty of words. Likewise, Collins-Thompson (Reference Collins-Thompson2014) argues that the ability to capture the dependencies between concepts is a requisite of knowledge-based readability models for deeper content understanding.
In view of the inability of general linguistic features employed by traditional readability formulas to measure the knowledge of domain-specific texts, how to reasonably and effectively do so becomes a topic worthy of further research. Taking Chinese texts in the natural and social sciences as examples, our study has two main purposes: The first is to design a method to represent the knowledge contained in domain-specific Chinese texts that can identify the knowledge features of different grade levels in the subjects of social science and natural science, and to use these features as important references in defining the readability of texts in these subjects. The second is to compare, in terms of the effectiveness of assessing the readability of domain-specific texts, readability models using knowledge features with models using general linguistic features (e.g., word count, phrase length, and sentence length). Specifically, our study aims to answer the following two questions: (1) How well does a general-linguistic-feature-based readability model perform in predicting readability/difficulty levels of domain-specific texts? (2) Does a knowledge-feature-oriented readability model based on the hierarchical conceptual space extracted by latent semantic analysis (LSA) outperform a general-linguistic-feature-based readability model? In order to compare the different approaches for assessing domain-specific text readability, three sub-studies applying different methodologies were conducted to further validate our findings.
In Study 1, we tested whether general linguistic features are suitable for assessing the readability of domain-specific texts. Readability models that incorporated the general linguistic features were trained through machine learning. Study 1 also provided a baseline for the model validation in Studies 2 and 3. In Study 2, we proposed a hierarchical conceptual space to generate a hierarchy of difficult word lists that correspond to school-grade levels. We then used the word lists to calculate the difficulty distribution of conceptual terms in domain-specific texts. The grade level of a text was estimated based on the difficulty level where most terms are distributed. This is to show that in describing a knowledge, a text employs a great number of domain-relevant terms, whose difficulty levels hence reflect the readability of the text.
Study 3 was an extension of Study 2 and used the difficulty level distribution of conceptual terms in a text as a feature of the readability model, which used a support-vector-machine (SVM)-based classifier. In addition, in order to compare the performance of different readability models, Study 3 experimented with the following three readability models: the model combining the grade-level vectors and general linguistic features, the TF model (term frequency; Salton and Buckley Reference Salton and Buckley1988), and the Word2Vec model (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013). By comparing the results of the experiments, we contrasted the advantages and disadvantages of using general linguistic features, bag-of-words-based features, and hierarchical conceptual space to determine the readability of domain-specific texts.
2. Literature review
2.1 The development of readability studies
Text readability/difficulty assessment is an important application in text mining field that has been more formally defined as the sum of all elements in textual material that affect a reader’s understanding, reading speed, and level of interest in the material (Dale and Chall Reference Dale and Chall1949). The effective assessment of the difficulty of texts can contribute to teaching and learning: readers will understand and learn effectively when they select texts at appropriate readability levels that are suited to their reading ability (Klare Reference Klare1963, Reference Klare2000). Texts that are too advanced for one’s reading level will significantly increase the reader’s cognitive load, leading to feelings of frustration (DuBay Reference DuBay2004). Conversely, a text that is too simple for the reader can lead to a lack of motivation and sense of achievement. In other words, readability assessment is helpful for matching reading materials with reading abilities, through which more successful and joyful reading experiences may be created.
Traditional readability formulas are based on research findings that factors such as semantic, syntactic, and lexical complexities influence the difficulty level of a text (Graesser, McNamara, and Kulikowich Reference Graesser, McNamara and Kulikowich2011; Rubenstein and Aborn Reference Rubenstein and Aborn1958). Readability formulas thus employ these linguistic features as key variables to predict text readability. For example, the Flesch Reading Ease Formula (Flesch Reference Flesch1948) used the number of syllables in a word as an indicator of lexical complexity and sentence length as an indicator of syntactic complexity, measuring text difficulty based on average number of syllables per word and average sentence length (Collins-Thompson Reference Collins-Thompson2014). The more syllables an average text word has and the longer an average sentence is, the more difficult a text becomes. Chall and Dale (Reference Chall and Dale1995) added percentage of difficult words as another variable of text difficulty—the greater the number of difficult words a text contains, the more difficult it is.
As Graesser, Singer, and Trabasso (Reference Graesser, Singer and Trabasso1994) pointed out, traditional formulas using general linguistic features fail to reflect the actual reading process. This is because the common general linguistic features, which only involve semantic, syntactic, and lexical properties of text, neglect some other essential text features, one of them being the coherence of a text. Collins-Thompson (Reference Collins-Thompson2014) also suggested that traditional readability formulas focus only on the shallow information of a text, overlooking important deeper content features. This has led to skepticism over the results when such formulas are used to predict text comprehensibility. For example, using word frequency (how often a word occurs in a representative corpus) as the basis of judging whether a word is difficult would not be objective enough, since word usage evolves with time, and new words will always emerge. For this reason, the Thorndike List, which Thorndike and Lorge (Reference Thorndike and Lorge1944) developed based on the word frequencies of their time, is not ideal for assessing the difficulty of modern words and renders readability formulas using this feature even less accurate in their predictions. In addition, word frequency is not necessarily an effective way of assessing word difficulty. The word toothbrush, for example, is simple enough and commonly understood, but it appears infrequently in written texts, whether it is in newspapers, magazines, or textbooks, and would likely be categorized as a difficult word, which clearly is not true. Using word length as a basis for calculating readability has also often been challenged for the reason that longer words are not necessarily more difficult (Yan et al. Reference Yan, Song and Li2006). Meanwhile, basing syntactic complexity on sentence length and using that as a variable in measuring readability have been criticized as being too intuitive and lacking in sophistication (Bailin and Grafstein Reference Bailin and Grafstein2001).
Scholars have conducted research using various linguistic features. Graesser et al. (Reference Gallagher and Thompson2004), for example, developed the Coh-Metrix, an online analyzer of text features with categories such as word information, syntax structure, latent semantics, and cohesion. The effectiveness of the Coh-Metrix readability features for text classification is supported by empirical evidence in the field of psychology (Graesser et al. Reference Graesser, McNamara, Louwerse and Cai2004; McNamara et al. Reference McNamara, Louwerse and Graesser2002; McNamara et al. Reference McNamara, Louwerse, McCarthy and Graesser2010). Kanebrant, Mühlenbock, Kokkinakis, Jönsson, Liberg, Geijerstam, Folkeryd, and Falkenjack (Reference Kanebrant, Mühlenbock, Kokkinakis, Jönsson, Liberg, Geijerstam, Folkeryd and Falkenjack2015) presented T-MASTER, a tool for assessing students’ reading skill on a variety of dimensions. They used the SVIT model for analyzing textual complexity in T-MASTER. The model was constructed using four categories of variables: surface complexity, vocabulary difficulty, syntactic, and idea density.
In contrast to Western languages, readability assessment of Chinese text requires readability models of its own that are distinct from those based on alphabetic writing systems. The complexity of Chinese characters cannot be measured by letter or syllable counts, because each Chinese character is comprised of one syllable and most Chinese words are formed by two characters (Hong et al. Reference Hong, Sung, Tseng, Chang and Chen2016). The uniqueness of the language has led to the inquiry of which readability features are suitable for Chinese text. For example, Sung et al. (Reference Sung, Chang, Lin, Hsieh and Chang2016a) developed Chinese Readability Index Explorer (CRIE) as a system for analyzing Chinese text readability, consisting of four categories of indices: word, syntax, semantics, and cohesion. Using multilevel readability features, the model of the CRIE results in improved effectiveness for the assessment of Chinese text readability (Sung et al. Reference Sung, Chen, Lee, Cha, Tseng, Lin, Chang and Chang2013; Sung et al. Reference Sung, Chen, Cha, Tseng, Chang and Chang2015a, Reference Sung, Lin, Dyson, Chang and Chen2015b). Chen, Chen, and Cheng (Reference Chen, Chen and Cheng2013) used the lexical chain technique to examine the suitability of using lexical cohesion as a readability feature for Chinese text. They divided texts in social and life science textbooks into three reading levels: textbooks written for the first and second graders, the third and fourth graders, and the fifth and sixth graders were regarded as the low, middle, and high levels, respectively. The model was highly effective in a coarse-grained manner: the accuracy to distinguish between the low and non-low levels reached as high as 96%, and that between the middle and non-middle levels achieved 85%. Tseng et al. (Reference Tseng, Sung, Chen and Lee2016) used Word2Vec to obtain a Chinese semantic space and characterized the overall semantic features through vector transformations, thus creating a useful readability feature. When used to assess textbooks in Chinese language arts, social studies, and natural science from the 1st through the 12th grades, the readability model reached an accuracy of 75.99%. However, a large number of dimensions (800 dimensions) were required to build their readability features. As a result, it was very time-consuming to train the model on large data. Moreover, because the testing data consisted of both domain-specific and literary texts, it was not clear whether the accuracy rate was mostly attributed by the prediction of the non-domain-specific texts. Most importantly, the model was uninterpretable. Aside from the leveling result, one has no clue as to what values or benefits these readability features may contribute in the identification of domain knowledge difficulty.
In addition to the concurrent development of readability features, the rise of natural language processing techniques and machine learning algorithms now allow researchers to refine their model algorithms to measure readability of text with a more flexible scope (Feng et al. Reference Feng, Jansche, Huenerfauth and Elhadad2010; François and Miltsakaki 2012; Petersen and Ostendorf Reference Petersen and Ostendorf2009; Sung et al. Reference Sung, Chen, Cha, Tseng, Chang and Chang2015a; Vajjala and Meurers Reference Vajjala and Meurers2012). Besides the aforementioned studies on document readability, some scholars have started to investigate readability at the sentence level. For example, Vajjala and Meurers (Reference Vajjala and Meurers2014) built upon their pioneering research (Vajjala and Meurers Reference Vajjala and Meurers2012) and used 151 features to train readability models on both document and sentence levels. Experimental results indicate that the document-level readability model achieved a Pearson correlation of 0.92, but the sentence-level readability model only achieved an accuracy of 66%. Furthermore, Ambati, Reddy, and Steedman (Reference Ambati, Reddy and Steedman2016) used an incremental Combinatory Categorial Grammar (CCG) parser to calculate sentence complexity and predict the relative sentence readability. The incremental CCG model outperformed the non-incremental Phrase Structure Trees (PST) model in extracting syntactic features. Apart from the above, Howcroft and Demberg (Reference Howcroft and Demberg2017) used the ModelBlocks parser and the program icy-parses to extract 22 features. Four measures (idea density, surprisal, integration cost, and embedding depth) were used to calculate sentence complexity scores and served as the base for training different readability models, in order to figure out which feature set can constitute a readability model that is both effective and efficient. Analysis results indicate that compared with integration cost and idea density, surprisal and embedding depth make the readability model more efficient. The interested reader can refer to Collins-Thompson (Reference Collins-Thompson2014) for a comprehensive overview of relevant studies that followed this line of research.
2.2 Studies of domain-specific text readability
The content of domain-specific texts is generated through the reproduction of relevant knowledge that humans have developed through documenting recurring and complex problems in life (Hirschfeld and Gelman Reference Hirschfeld and Gelman1994). For example, having experienced recurring weather conditions such as typhoons, floods, and snowstorms, humans documented the phenomena through sound, image, and text to form specific ideas about them, and attempted to find solutions for them, leading to the development of the domain of meteorology (Hirschfeld and Gelman Reference Hirschfeld and Gelman1994). Domain-specific texts focus on illuminating the “concepts” of the relevant knowledge. More importantly, it is noticeable that these domain-specific concepts are formed through the convergence of relevant terms. For instance, when explaining the domain-specific concept of “how plants produce nutrients,” one inevitably mentions words that underpin the subject such as photosynthesis, chloroplast, enzyme, glucose, and carbon dioxide. This method of elucidating domain-specific knowledge is different from the structure of descriptive or narrative writing used in general language text.
The more domain-specific terms there are in a domain-specific text, the more domain-specific knowledge the reader will need to have in order to efficiently generate meaningful understanding of the text (Bédard and Chi Reference Bédard and Chi1992). Chi, Glaser, and Farr (Reference Chi, Glaser and Farr1988) noted that the difference between an expert and a novice lies in the former’s ability to master the domain-specific knowledge and to derive a significant amount of meaningful information from the text. Etringer, Hillerbrand, and Claiborn (Reference Etringer, Hillerbrand and Claiborn1995) have also pointed out that experts possess a broad and deep knowledge base, making it easier for them than for novices to retrieve information from their long-term memory to link fragmentary information together and integrate them into meaningful information. This means that experts possess domain-specific concepts that they can use effectively within the knowledge structure of a domain-specific text; conversely, when faced with a domain-specific text, a novice lacks the corresponding domain-specific concepts for them to retrieve the information contained in the text and is thus unable to generate meaningful understanding. Therefore, the readability of a domain-specific text should be judged by the amount of meaningful information that can be retrieved by most people with similar reading abilities. In other words, the knowledge of a domain-specific text should be within the grasp of most readers at a certain developmental phase (e.g., a certain school grade).
Automatically retrieving, defining, and measuring the impact of various factors on the difficulty of domain-specific texts requires sophisticated methods. Currently in the literature, the four methods or tools most often utilized for calculating the readability of a text, which are reviewed below, are general linguistic features, ontology, word lists, and LSA.
2.2.1 General-linguistic-feature-based assessment of domain-specific texts
Early readability research often employed general linguistic features to build linear regression models to assess the difficulty of domain-specific texts. For example, Razek and Cone (Reference Razek and Cone1981) and Gallagher and Thompson (Reference Gallagher and Thompson1981) used the Flesch Reading Ease Test to analyze economics textbooks and business communication textbooks, respectively. Miltsakaki and Troutt (Reference Miltsakaki and Troutt2007) proposed the Read-X system to perform readability assessment on different types of website text, using three readability formulas. Kanungo and Orr (Reference Kanungo and Orr2009) focused on predicting the difficulty of web page summaries. However, since they employed general linguistic features, it remains uncertain, and thus further verification is needed to determine whether these models are capable of representing the knowledge structure of domain-specific texts. To offset the shortcomings of general-purpose readability assessment tools, Dell’Orletta, Venturi and Montemagni (Reference Dell’Orletta, Venturi and Montemagni2012) proposed a ranking method based on the notion of distance for automatically building genre-specific training corpora. Currently, the most popular readability assessment tool for Mandarin is the CRIE system by Sung et al. (Reference Sung, Chang, Lin, Hsieh and Chang2016a). However, the CRIE only employed general linguistic features and was not designed to handle domain-specific texts.
2.2.2 Ontology-based assessment of domain-specific texts
Ontology uses a hierarchical tree structure to represent the relationship between domain-specific concepts (Gruber Reference Gruber1993). Using this tree, it is possible to assess the difficulty of a domain-specific concept by calculating the distance from that concept to the root of the tree; the higher up the concept is on the tree depth, the more difficult it likely is. Yan et al. (Reference Yan, Song and Li2006) used the hierarchy of medical symbols in the American MeSH database to determine the complexity of a medical concept in MeSH by calculating the concept’s tree depth, that is, the distance of that concept to the root of the tree structure. The scope of a document is regarded as the coverage of the domain concepts in the document. The more terms of the document are identified as domain concepts, the less readable the document tends to be. The study compared how well the readers’ level of understanding was predicted by using document scope and other traditional formulas, and the result indicated that document scope is a good predictor of the readability of medical texts. Zhao and Kan (Reference Zhao and Kan2010) used ontology to construct the domain-specific concept hierarchy in the Math World Encyclopedia, and then calculated the difficulty score of each concept. In a similar vein, Project 2061 (AAAS 2007), a long-term initiative of the American Association for the Advancement of Science (AAAS) that aims to help students become literate in science, mathematics, and technology, serves as a useful ontology tool. Many of the topic maps provided by the AAAS in the Atlas are built from K-12 learning goals. By studying these maps carefully, teachers and other educators can get a better sense of the content and nature of grade-level benchmarks as specific learning goals (AAAS 2007).
The ontology method takes into account the hierarchies of domain-specific concepts, but not every domain has a readily available domain-specific conceptual hierarchy. In such a case, experts are required to produce a conceptual hierarchy in order to apply the ontology approach. Otherwise, it is impossible to assess the conceptual difficulty. Thus, using the ontology method to assess text readability can involve significant costs.
2.2.3 Word-list-based assessment of domain-specific texts
Any collection of words can be viewed as a word list. It is a common practice to define word difficulty by referring to a word list. The assumption underlying the practice of using easy or difficult word lists to assess text readability is that the greater number of difficult words a text contains, the harder it is for the concepts embedded in the text to be understood. One widely known measure of this type is the Revised Dale–Chall formula (Chall and Dale Reference Chall and Dale1995). The formula used the Dale word list, which collected the 3000 words that were familiar to at least 80% of American fourth graders at the time. A word is labeled “unfamiliar” if it does not occur in the list. Accordingly, the more words a text contains that are not present on the word list, the lower its readability is. A similar method was used by Fry (Reference Fry1990), who employed Dale and O’Rourke’s Living Word Vocabulary of 43,000 word types to assess the readability of short articles. An example of utilizing word lists to assess domain-specific text readability was Miltsakaki and Troutt (Reference Miltsakaki and Troutt2008), for which measurement of web text readability was based on word difficulty that was rated by word frequency. In order to develop monolingual and bilingual word lists for language learning, Kilgarriff et al. (Reference Kilgarriff, Charalabopoulou, Gavrilidou, Johannessen, Khalil, Kokkinakis, Lew, Sharoff, Vadlapudi and Volodina2014) participated in the KELLY project, which created bilingual learning word cards using the 9000 most frequent words in text corpora of 9 languages and 72 language pairs. The above shows that using corpus-derived word frequency to define word difficulty is a common practice. However, there exists many counter-examples of this intuitively correct assumption. For example, some commonly used household items appear infrequently in specific corpora (such as newspapers). Moreover, frequency-based word lists simply categorize words as difficult or simple, the precise definition of which is highly subjective. In an attempt to remedy the aforementioned flaw of frequency-based word lists, Borst et al. (Reference Borst, Gaudinat, Grabar and Boyer2008:73) computed both the “complexity of the category to which a word belongs” and the “word frequency.” They then multiplied the two scores to represent word difficulty, which was used as a basis for estimating the complexity scores of sentences, and the readability score of a document corresponds to the average score of its sentences. In a similar vein, Ding (Reference Ding2007) proposed a model to automatically construct knowledge hierarchies that could be helpful to classify the words according to their hierarchical relationships with other words in a knowledge domain. For the current study, we also aim to apply similar ideas to the readability assessment of domain-specific texts to make the assessing process more objective.
2.2.4 LSA-based assessment of domain-specific texts
LSA is a technology for network knowledge representation (Landauer and Dumais Reference Landauer and Dumais1997; Landauer, Foltz, and Laham Reference Landauer, Foltz and Laham1998). This technique applies singular value decomposition (SVD) to the term-document matrix to obtain a latent semantic space, after which the term, phrase, sentence, or even an entire document can be folded-in to the latent semantic space and be represented as a semantic vector.
The more similar two vectors are, the more similar the two documents are in terms of semantics (Furnas et al. Reference Furnas, Deerwester, Dumais, Landauer, Harshman, Streeter and Lochbaum1988). Currently, this computational model is widely used in the field of information retrieval to augment traditional technology (e.g., Boolean search technology), which calculates the similarity between sentences or documents based only on whether the terms they contain coincide. LSA has also been applied to text similarity comparison through the computation of their latent semantic properties. Chang, Sung, and Lee (Reference Chang, Sung and Lee2013), for example, used LSA to obtain terms at different levels of difficulty from domain-specific textbooks. Kireyev and Landauer (Reference Kireyev and Landauer2011) employed LSA to develop a domain-specific concept they called Word Maturity to estimate the grade level of a word, and in turn used these grades to gauge the literacy of students.
In addition to detecting semantic similarities, LSA has been further used to analyze text readability. When classifying documents written for non-native readers of French, François and Miltsakaki (Reference François and Miltsakaki2012) used LSA to compute lexical cohesion as a semantic feature for their readability model. Truran et al. (Reference Truran, Georg, Cavazza and Zhou2010) employed the LSA technique to investigate the readability of clinical documents. Graesser et al. (Reference Graesser, McNamara, Louwerse and Cai2004) developed the Coh-Metrix 3.0, providing eight LSA indices as measures of semantic overlap between sentences or between paragraphs. However, these studies treated the LSA technique as one of the many readability features incorporated into their readability models. They did not explore why LSA was useful for analyzing domain-specific text readability. It is therefore unclear how such a model might differ from models using general linguistic features. In view of this, the current study aims to investigate the independent predictive value of both the LSA technique and the general linguistic features as well as their combined predictive power. Although the methods we devised for the current study are based on the hierarchical conceptual spaces put forward by Chang et al. (Reference Chang, Sung and Lee2013), our work is distinguished from Chang et al. (Reference Chang, Sung and Lee2013) in the following ways. The study by Chang et al. (Reference Chang, Sung and Lee2013) only sketches the concepts of domain-specific knowledge in school textbooks that are supposed to be comprehensible for students in the corresponding grades. More specifically, the sketched domain-specific concepts are isolated from each other. The current study extends the work of Chang et al. (Reference Chang, Sung and Lee2013) by regarding the difficulty levels of the domain-specific concepts in a text as forming a relational pattern. The readability features thus developed are more interpretable and more capable of classifying the readability of Chinese domain-specific texts. Additionally, this paper combines the LSA-derived readability features with the common general linguistic features to investigate their impact on readability models.
3. Proposed method: Using hierarchical conceptual space to calculate grade-level vectors for the readability assessment of Chinese domain-specific texts
Instead of viewing a text as having a fixed (i.e., a single and unchangeable) difficulty level, Chi et al. (Reference Chi, Glaser and Farr1988) define readability as a relative notion measured by the degree to which a document is understood by the individual reader. For experts versus novices, a domain-specific document may provide either a massive amount of meaningful content or only trivial information. In other words, the difficulty level of a domain-specific text is determined jointly by the conceptual load of a text and the reading ability of an individual. For example, normally the content of fifth-grade textbooks is easy for ninth graders but difficult for third graders.
Given that each person has a different background, in order to develop a readability model that is widely applicable across age and social groups, we established a common ground of reading abilities according to school education. Nolen-Hoeksema et al. (Reference Nolen-Hoeksema, Fredrickson, Loftus and Wagenaar2009) pointed out two approaches for domain-specific concept acquisition: Obtaining the prototype of a concept through everyday experiences and learning the core of a concept through school instructions. At each grade level, students are taught the knowledge of specific domains through subject courses, forming domain-specific concepts and internalizing these concepts into their knowledge system. Accordingly, to assess the right time to start learning a concept, we can use the concepts that students learn from the textbooks as a basis to estimate the time at which concepts of domain-specific knowledge can (or should) be comprehended by most people.
This method is reasonable because most textbooks are written following curriculum standards, and most students are able to understand a certain domain-specific concept after going through a certain stage of learning. Therefore, the domain-specific concepts that they acquire are generally representative of the typical difficulty level of knowledge learned by the average student at a specific learning phase. The word frequency guide of Zeno et al. (Reference Zeno, Ivens, Millard and Duvvuri1995) and the Manulex, a grade-level lexical database from French elementary-school readers developed by Lété, Sprenger-Charolles, and Colé (Reference Lété, Sprenger-Charolles and Colé2004), also used school-grade levels to define the difficulty level of terms. Lété et al. (Reference Lété, Sprenger-Charolles and Colé2004) believe that the most important variable in understanding language development and the reading process is the vocabulary that children learn, and that using grade level to quantify children’s reading material is valuable to psycholinguists trying to evaluate children’s language acquisition. Therefore, in assessing domain-specific text readability, we may use the difficulty level of a word, which can be determined based on the grade level at which it becomes a main theme or topic.
Figure 1 illustrates how instruction of domain-specific concepts develops in the educational process. A fifth-grade natural science text on the topic of oxygen and carbon dioxide, toward the end the article, describes that one of the ways oxygen and carbon dioxide are used is in the photosynthesis process of plants. However, it is not until in a seventh-grade text on the topic of how plants produce nutrients that the whole process of photosynthesis is truly explained. Thus, to which grade-level difficulty should the term photosynthesis be assigned? How to automatically and objectively assign difficulty levels for terms thus becomes a problem worth investigating.
Many topics can have various shades of domain-specific concept, which is why the concepts are taught in phases, through textbooks for different grades. At higher grade levels, students are exposed to a greater number of (and often more sophisticated) terms related to a topic, enabling them to deepen their knowledge. Continuing to use Figure 1 as an example, photosynthesis is introduced in a fifth-grade text on the side of oxygen and carbon dioxide, providing prior knowledge in preparation for the teaching of the seventh-grade lesson on how plants produce nutrients, facilitating the learning process. This arrangement is a typical type of spiral curriculum, in which there is an iterative revisiting of terms, topics, or themes throughout the course. Learners’ knowledge may be broadened and deepened through this iteration, in which complex concepts are built up from simple ones (Harden Reference Harden1999). Taking this example, Chang et al. (Reference Chang, Sung and Lee2013) took 100 terms from elementary school natural science textbooks in Taiwan and used LSA to calculate the semantic relatedness of each term to natural science domain texts from the third to sixth grades. Figure 2 presents the five steps Chang et al. (Reference Chang, Sung and Lee2013) proposed to test which grade level a domain-specific concept is characterized by.
Step 1. Show the relationship between “term” (i.e., word) and “document” (i.e., text) of social science and natural science textbooks, each with a two-dimensional matrix. In the “term-document matrix,” the value of each entry represents the frequency at which a term occurs in a document. This matrix is called the “term-document matrix.” This step will produce the term-document matrix for social science and natural science texts, respectively.
Step 2. Process the two term-document matrices with the Term Frequency–Inverse Document Frequency (TF–IDF) method. TF–IDF method helps to give greater weight to specifically those words which occur in only some of the documents while down-weighting those which are common to all of the documents (Sparck Jones Reference Sparck Jones1972; Salton and Buckley Reference Salton and Buckley1988).
Step 3. Use the SVD (Golub and Reinsch Reference Golub and Reinsch1970) of the LSA to reduce the dimensions of the term-document matrix, retrieve the latent semantics contained in the terms, and extract semantic clusters to be presented anew. SVD is performed on the term-document matrix (W) in order to project all the term vectors and document vectors onto a single latent semantic space with significantly reduced dimensionality L. That is, $W \approx \widetilde{W} = \textbf{\textit{U}}\boldsymbol{\sum}\textit{\textbf{V}}^{\textit{\textbf{T}}}$. In the equation, $\widetilde{W}$ is the rank-L approximation to W; U is the left singular matrix; Σ is the L × L diagonal matrix composed of the L singular values; V is the right singular matrix; and T denotes matrix transposition. In the process of reducing dimensions, different meanings and different senses of a word are all lumped together, abstracting information to capture latent semantics of the terms within the document. In this way, words describing related concepts will be close to each other in the latent semantic space even if they never co-occur in the same document, and the documents describing related concepts will be close to each other in the latent semantic space even if they do not contain the same set of words (Lee and Chen Reference Lee and Chen2005).
Step 4. In order to determine whether a certain term is a main concept in a specific grade-level textbook or not, Chang et al. (Reference Chang, Sung and Lee2013) projected each domain-specific term, for example, atom, and each grade-level text (all the texts of a grade level were combined to create a single text for a total of 7 texts for the grade levels 3–9) to their respective latent semantic space, generating two vectors: One vector for a specific term and another vector for the grade-level text being compared. For example, we created the term semantic vector for atom, and the document semantic vector for each grade-level text. The cosine value of the similarity between the vector of a specific term and the vector of a grade-level text was then calculated. The cosine value fell between 1 and –1, with a higher value meaning more similarity between a term and a text, and a lower value indicating less similarity. This process was repeated for each of the seven grade levels to find which grade atom was most similar to. To put it another way, we calculated seven cosine similarities for each domain-specific term, one for each of the grades 3–9. The highest cosine value helped us identify which grade level a domain-specific term belonged to.
Step 5. Based on the degree of similarity between the term vector and document vector, all terms are assigned difficulty of grade level.
After the above-described step, terms were classified to a particular grade-level label. Figure 3 shows part of the result of assigning the domain-specific terms to the levels that most appropriately characterize their difficulty levels. We are now able to point out that photosynthesis is more closely related to the knowledge taught in seventh grade, instead of fifth grade where it could also be found. In other words, photosynthesis, chloroplast, enzyme, and glucose are the terms used in describing the knowledge subject of how plants produce nutrients. Learners’ knowledge may be treated as a network of concepts (Hunt Reference Hunt2003); however, the content of concepts is delivered by the words/terms used in the texts. This is the reason why we use the “conceptual terms” to represent learners’ knowledge of each grade. Once the conceptual terms for all the topics in the textbooks at a certain grade have been gathered, they form the conceptual space unique to this grade, which can be given a corresponding difficulty level. When the conceptual spaces of all the grades are linked together, these spaces together form a hierarchical conceptual space reflecting the spiral curriculum design.
One of the fundamental problems of the word lists developed by past studies is that they do not correspond well with grade levels. As a result, while the past methods may be helpful in creating classification models, they are not able to provide users any sort of justification for why any given article is assigned to a grade level.
Addressing this issue, this article adopts the steps we have discussed above to generate a word list derived from a hierarchical conceptual space. In this way, each conceptual term will be assigned to a grade level of difficulty, and the overall conceptual terms are able to represent the typical knowledge assumed to be learned at each grade.
The relationships between the conceptual terms can be visualized via a Pathfinder network representation (Schvaneveldt, Durso, and Dearholt Reference Schvaneveldt, Durso and Dearholt1989, Reference Schvaneveldt, Durso and Dearholt2017). The idea of Pathfinder is to use the cosine matrix of domain-specific concepts (here they are derived from LSA) to determine the distance among concepts. The relationship between the conceptual terms and the conceptual spaces of the topics of “oxygen and carbon dioxide” and “how plants produce nutrient” is represented in Figures 4 and 5, respectively. For the use of Pathfinder, we choose the Threshold Network Type to limit the inter-term relevance of conceptual terms to ensure that only those conceptual terms that are highly relevant to the main idea of the topic are highlighted. If a conceptual term is unrelated to the main idea, it would not appear as connected to other conceptual terms.
Figure 4 shows that the term photosynthesis is isolated from all other terms in the fifth-grade text “Oxygen and carbon dioxide,” because photosynthesis is only referred to as a process that utilizes oxygen and carbon dioxide and is not the focus of the text. By contrast, in Figure 5 because the main topic of the seventh-grade text “How plants produce nutrients” is photosynthesis, we can see that the term photosynthesis is related to other terms. These results once again highlight the problem described in Figure 1 and help confirm what is shown in Figure 3 that despite being present in both fifth- and seventh-grade texts, the proper difficulty of the term photosynthesis is Grade 7.
In the following, we utilized the GGally module of the R software (Schloerke Reference Schloerke2011) to integrate the conceptual spaces of the two texts analyzed above. For the convenience of comparing the relatedness of the two texts, we set the threshold to include only highly relevant conceptual terms. The result is illustrated in Figure 6. Since fifth and seventh grades share the concept of photosynthesis, the two conceptual spaces, following a spiral curriculum design, can be combined to form a hierarchical conceptual space, in which the knowledge expanded both in breadth (conceptual scope: the connection between the conceptual terms of a similar topic) and in depth (conceptual difficulty: the depth of the concepts on the conceptual hierarchy). Through the merging of conceptual spaces, the underlying connection and relations of the conceptual terms will emerge. From Figure 6 we can see that the fifth-grade-level term humans is related to the seventh-grade word glucose through photosynthesis. Humans does not even need another word to be connected to the seventh-grade word starch. Such connections illustrate that the hierarchical conceptual space in which all the levels of knowledge are aggregated not only allows knowledge to be vertically integrated, but can even reveal hidden relationships between domain-specific concepts and help students understand, at a glance, why plants are the foundation of the food chain. Additionally, educators can use this method to observe the relationships between conceptual terms across similar articles, allowing them to prepare teaching materials or add supplementary materials as appropriate.
The example above indicates that the difficulty level of domain-specific texts can be represented by the number and difficulty of conceptual terms which can be transferred to a conceptual space corresponding to grade level. This article therefore hypothesizes that if conceptual space can be effectively matched with the appropriate grade level, then these spaces can be used to estimate the difficulty level of any text in terms of grade level.
After building a hierarchical conceptual space, the text can be matched against it to determine whether the text contains conceptual terms in specific grades. In contrast to the study by Chang et al. (Reference Chang, Sung and Lee2013), which predicts domain-specific readability through a separate mapping of the main concepts of a text with the grade levels, the current study extends their work by treating the distribution of conceptual terms as a relational pattern (i.e., a grade-level vector, which is described below) so that the developed readability features can be more intelligible, providing meaningful interpretation. The relational pattern is formed through the following steps. First, the conceptual terms of each text are tagged for difficulty. The tagged terms for each difficulty level (grade) are then counted, with each term counted only once in each grade, resulting in a difficulty distribution of the text’s terms in each grade. The distribution values form a vector, which we call the text’s grade-level vector. As shown in Figure 7, people can use grade-level vectors to understand the grade-level distribution of each conceptual term in a text. When leveling a text, this makes it much easier to explain why it is placed at a certain grade. In sum, our study has developed grade-level vectors as a readability feature, and a prediction model incorporating the feature was trained by machine learning. The performance enhancement of the proposed model over the model by Chang et al. (Reference Chang, Sung and Lee2013) was investigated. In addition, this study combined grade-level vectors with common general linguistic features in order to analyze the impact of the former on readability models.
4. Study 1: Constructing a readability model for domain-specific texts with general linguistic features as a baseline study
Study 1 examined the effectiveness of a general-linguistic-feature-based model in predicting the readability of domain-specific texts. The performance of this model also served as a comparison baseline for the model performances in Studies 2 and 3.
4.1 Methods
4.1.1 Materials
The experimental materials for this study were adopted from the third- to ninth-grade textbooks published in 2009 by three major publishers in Taiwan, Nan I (Reference Nan2009), Han Lin (Reference Han2009), and Kang Hsuan (Reference Kang2009), and included 1441 social science articles and 772 natural science articles. Each article was an independent lesson from one of these textbooks. Each article, in addition to the main content of the text, also contained headings, descriptive text, punctuation marks, and the descriptions of tables/figures. Homework exercises, guided learning questions, and extracurricular content were not considered. In order to increase the accuracy of our articles, we did not use optical character recognition technology to digitize the textbook articles. Instead, we manually inputted the characters content into unformatted plain text files and manually combed through the files for errors. The distribution of text lengths among grade levels can been seen in Table 1. These textbooks were written and edited according to the knowledge/skill levels formulated in the curriculum standards established by the Ministry of Education, Taiwan, and are therefore representative for the average knowledge background of general students from the third to ninth grades. The interested reader can refer to general guidelines for more detailed information about curriculum guidelines of 12-year basic education (Ministry of Education 2014).
4.1.2 Procedure
The experiment procedure of this study is as shown in Figure 8 and is explained in the following subsections.
4.1.2.1 Preprocessing
Segmentation is one of the most basic and important procedures in the preprocessing of texts. Its main function is to ensure the accurate extraction of terms and that they are assigned the correct part-of-speech according to sentence structure. The WECAn parser was trained on the Sinica Corpus 4.0, and its word segmentation accuracy is 93% (Sung et al. Reference Sung, Chang, Lin, Hsieh and Chang2016a). This study employed WECAn to perform parsing of Chinese texts to facilitate the subsequent experimental procedure.
4.1.2.2 General linguistic feature calculation
This study used the CRIE, an automatic analysis system for text readability indices (Sung et al. Reference Sung, Chang, Lin, Hsieh and Chang2016a), to perform numerical computation of the general linguistic features of the social and natural science textbook articles. The 24 general linguistic features employed are the same as those used for the readability model developed by Sung et al. (Reference Sung, Chen, Lee, Cha, Tseng, Lin, Chang and Chang2013), which included four levels of features: the lexical (e.g., number of characters and words), semantic (e.g., number of content words), syntactic (e.g., average of sentence length), and cohesion (e.g., number of pronouns) levels (see Appendix for details). These 24 general linguistic features have been shown to be reasonably accurate (72.92%) when applied to leveling articles in Chinese literature textbooks. Thus, in the current study these 24 general linguistic features were used to train the readability models for social and natural science textbooks to see if these general language features are also suitable for analyzing the readability of domain-specific texts.
4.1.2.3 Training and validating the readability model
Feng et al. (Reference Feng, Jansche, Huenerfauth and Elhadad2010), Petersen and Ostendorf (Reference Petersen and Ostendorf2009), and Sung et al. (Reference Sung, Chen, Cha, Tseng, Chang and Chang2015a) have all demonstrated that the performance of machine learning based on SVM is superior to that of traditional readability models (i.e., regression models). Therefore, this study used the LIBSVM software (Chang and Lin Reference Chang and Lin2011) to train the readability model and used fivefold cross-validation to verify model effectiveness. To do the fivefold cross-validation, the experimental materials were divided into five subsets: Four of them were used for training the models, and the fifth was used for testing (i.e., validating the model). The process was repeated five times, with each subset used once as the testing data. The accuracy of the model was calculated as the average of the five results.
4.2 Results
The results of the experiment and the classification error matrices are given in Tables 2 and 3.
Tables 2 and 3 show that the prediction accuracy for social science texts is 55.10%, while the accuracy for natural science is even lower, at a mere 49.35%. If we allow plus/minus one level error in the calculation of accuracy (McNamara, Crossley, Roscoe Reference McNamara, Crossley and Roscoe2013; Sung et al. Reference Sung, Lin, Dyson, Chang and Chen2015b), the resultant adjacent accuracies for social science and natural science texts are 79.32% and 81.99%, respectively. These results echo past research findings that readability models using general linguistic features do not perform well when used to predict the readability of domain-specific texts (Collins-Thompson Reference Collins-Thompson2014).
5. Study 2: Constructing a readability model for domain-specific texts through the hierarchical conceptual space as another baseline study
Study 2 used hierarchical conceptual space to calculate the difficulty distribution of conceptual terms in a domain-specific text, and then matched the text with the grade value that corresponded to the difficulty level where most terms were distributed. This method was then checked for its accuracy in predicting text readability and its performance was built as a baseline for the model performance in Study 3.
5.1 Methods
5.1.1 Materials
The materials were the same as Study 1.
5.1.2 Procedure
This method of extracting grade-level vectors through the hierarchical concept space put forward by Chang et al. (Reference Chang, Sung and Lee2013) shows the difficulty distribution of conceptual terms in a domain-specific text. The model tested in Study 2 was dubbed as “grade-level vector majority vote,” henceforth abbreviated as GLVMV. The experimental procedure for this study is shown in Figure 9 and is explained in the following subsections.
5.1.2.1 Preprocessing
The preprocess was the same as Study 1.
5.1.2.2 Generating hierarchical conceptual space for social and natural science texts
We used the method proposed by Chang et al. (Reference Chang, Sung and Lee2013), which was described in Section 3, to obtain the hierarchical conceptual space from training data set for the social and natural science texts.
5.1.2.3 Calculating and validating grade-level vectors that predict the grade level of domain-specific texts
After building a hierarchical conceptual space, we used it to calculate the grade-level vector of every text (as described in Section 3). The model then identified the grade level that possessed the largest number of conceptual terms in the grade-level vector of a text and assigned the grade level of difficulty to the text accordingly (hence the name “grade-level vector majority vote,” GLVMV). In Figure 9, for example, the largest value in the grade-level vector is the nine conceptual terms that belonged to the eighth grade. The amounts of conceptual terms belonging to the other grades were all less than 9 (e.g., only five conceptual terms in the fifth grade). As a result, we assigned the article a difficulty level of eighth grade. This method assumes that the grade level of a domain-specific text can be traced by the concentration of conceptual terms that characterize the knowledge typically acquired at that school age. In the follows, we tested the assumption and assessed the effectiveness of our model by using a fivefold cross-validation. The validation procedure was the same as described in Study 1 (Sung et al. Reference Sung, Chen, Cha, Tseng, Chang and Chang2015a).
5.2 Results
The results of fivefold cross-validation of the conceptual space model for predicting the difficulty levels of 1441 social science articles and 772 natural science articles are presented in Tables 4 and 5.
Tables 4 and 5 show that the prediction accuracy was 46.22% for social science texts and 70.98% for natural science texts. We used McNemar’s test (McNemar Reference McNemar1947) to test the statistical significance of the accuracy differences revealed in Study 1 and Study 2. For both social science and natural science texts, the prediction accuracy rate was significantly different: $X_{1.1}^2{\rm{ = }}23.443$, p < .001, and $X_{1.1}^2{\rm{ = }}82.751$, p < .001 respectively.
These results provide empirical support for the following. First, using the LSA technique to construct a hierarchical conceptual space where difficulty levels of conceptual terms are indexed provides an effective framework of representing the difficulty level of domain-specific texts. Second, compared to the general-linguistic-feature-based model, the accuracy rate of predicting natural science text levels based on this model increased substantially. However, there is still room for improvement when it comes to social science texts.
In our error analysis of a large quantity of misclassified texts, we discovered that the grade level estimated by the maximum value of a grade-level vector of a text is not always a good judge of its readability. Table 6 is an example of a seventh-grade natural science textbook article, but the numbers of conceptual terms that fall into the difficulty levels of grades 6 and 7 are close, which may mean the text should be categorized into the sixth grade instead of the seventh. From this we can see that when domain-specific texts are describing pieces of knowledge, they will use a large number of conceptual terms. If this feature can be utilized alongside a superior classification strategy then it will further bolster the overall readability model’s performance. We will discuss this topic further in Study 3.
6 Study 3: Constructing and validating readability models using either hierarchical conceptual space or bag-of-word-based features for domain-specific texts
Although Studies 1 and 2 provided empirical evidence for the better performance in predicting natural science text by using hierarchical conceptual space than by using general linguistic features, there was a critical methodological difference between the two studies. In Study 1, a machine learning approach for training the readability model was used, while in Study 2 a maximum value of grade-level vector approach was employed. Echoing the machine learning method of Study 1 (i.e., using the SVM classifier), in Study 3 we augmented the conceptual-space-based model in Study 2 with the SVM. This approach allows for the influence of different classification strategies (e.g., using the general linguistic features, the LSA-based hierarchical conceptual space, and the machine learning algorithms) on classification accuracy to be observed (Sung et al. Reference Sung, Chen, Cha, Tseng, Chang and Chang2015a).
To expand model capabilities, in Study 3 we experimented optimization of the readability models with more NLP techniques. First, we combined grade-level vectors with general linguistic features and then used SVM to train the readability model. It is of particular interest to determine whether the combination of different features enhances the efficacy of the readability model in tackling domain-specific texts. We also used two bag-of-words-based features to train an SVM model separately. The first bag-of-words feature, TF (term frequency), is often seen in natural language processing. The other, Word2Vec, was recently released by Google and has quickly gained in popularity. We trained using Word2Vec’s Continuous Bag-of-Words (CBOW) and Skip-gram approaches on a separate set of training data to derive word vectors. We added up the vectors of words in an article and also averaged the summed vector to represent the vector of the article (Le and Mikolov Reference Le and Mikolov2014). Then, the Word2Vec-based readability models were also trained by SVM.
6.1 Methods
6.1.1 Materials
The materials were the same as in Study 1.
6.1.2 Procedure
The experimental procedure for Study 3 is shown in Figure 10 and is explained in the following subsections.
6.1.2.1 Preprocess
The preprocess was the same as Study 1.
6.1.2.2 Generating hierarchical conceptual space for social and natural science texts
The generation of hierarchical conceptual space was the same as Study 2.
6.1.2.3 Training and validating the readability model
The procedure of extracting the grade-level vectors of a text was the same as was done for Study 2. After obtaining the grade-level vectors, this study used the LIBSVM (Chang and Lin Reference Chang and Lin2011) to train the model. The procedure of fivefold cross-validation was the same as in Study 1 (Sung et al. Reference Sung, Chen, Cha, Tseng, Chang and Chang2015a).
6.2 Results
The results of fivefold cross-validation of the expanded readability models for predicting the difficulty level of 1441 social science articles and 772 natural science articles are presented in Tables 7 and 8.
Tables 7 and 8 show that when the model using grade-level vectors was augmented with SVM machine learning, its prediction accuracies for social science and natural science texts improved from 46.22% and 70.98% to 68.98% and 73.96%, respectively. The accuracy rates of the combined model (i.e., grade-level vector + general linguistic features + SVM) for social science and natural science texts achieved the even higher accuracy rates of 86.68% and 75.91%, respectively, which was the best performance among all the models. The McNemar test results for each readability model are given in Tables 9 and 10, showing that the combined model’s performance was significantly different from all other models with most of the p-values being under .001.
*p < .001
* p <.05, ** p < .01, *** p < .001
Furthermore, after using different classification strategies, we can see that the example article in Table 6 was misjudged as belonging to the sixth grade in Study 2, whereas Study 3 correctly classified it as belonging to the seventh grade. The enhancing effect of the SVM indicates the importance of taking into consideration the difficulty-level distribution of all the (instead of the individual) conceptual terms in an article, as the hallmark of machine learning is the ability to discern meaningful patterns from seemingly unrelated information. The results of Study 3 also suggest that general linguistic features are employed more differentially for social science texts than for natural science texts, which is reflected by the great improvement of the model when incorporating the general linguistic features.
When the Word2Vec models and the grade-level vector model (Study 3) were trained using the same seven dimensions of SVM, the Word2Vec models (whether Skip-gram or CBOW) performed better than the grade-level vector model (Study 3) for social science texts, but the latter performed slightly better than the former for natural science texts. When compared with the TF model, the grade-level vector model (Study 3) showed superior accuracy for both social and natural science texts. This is especially noteworthy given that the huge dimensions of the TF model make using it to train readability models extremely time-consuming.
Overall, these accuracies demonstrate that grade-level vectors, whether applied to social or natural science texts, are a viable method of training readability models. To further explain, from Figure 7 we can see that the vocabulary difficulty distributions made by grade-level vectors are easy for people to understand and to apply to readability assessments. These distributions can also be provided to editors to help develop editing guidelines for new texts. In contrast, Word2Vec and TF cannot offer such assistance.
In recent years, intelligibility has become a vital concern in machine learning. In certain applications, such as health care (Caruana et al. Reference Caruana, Lou, Gehrke, Koch, Sturm and Elhadad2015), education (Chang and Sung Reference Chang, Sung, Lu and Chen2019; Hsu et al. Reference Hsu, Lee, Chang and Sung2018; Lin et al. Reference Lin, Chen, Chang, Lee and Sung2019; Lu and Chen Reference Lu, Chen, Lu and Chen2019; Lee, Chang, and Tseng Reference Lee, Chang and Tseng2016), and speech recognition (Chen and Hsu Reference Chen, Hsu, Lu and Chen2019), the intelligibility of a model may far outweigh its accuracy since it could be a helpful feedback for the users (Chang, Sung, and Hong Reference Chang, Sung and Hong2015; Sung et al. Reference Sung, Liao, Chang, Chen and Chang2016b). The reason is obvious—in order for the end-users of a system to trust and act upon the predictions, they need to understand what they are being told (Ribeiro, Singh, and Guestrin Reference Ribeiro, Singh and Guestrin2016; Samek, Wiegand, and Müller Reference Samek, Wiegand and Müller2017). As for text readability, if the model lacks adequate explanatory ability, it would unfortunately be like a mysterious black box whose operation remains unknown and inexplicable. As an attempt to develop readability models that are not only effective, but also interpretable, our study shows that using grade-level vectors to capture the distribution of conceptual terms difficulty among grades can help the reader to determine whether readability predictions are reasonable and also help researchers to further use or improve readability features.
7. Discussion
As in all traditional readability formulas, the general language features used in Study 1 only account for information gleamed from shallow article structures, which cannot effectively reflect the difficulty inherent in a specific domain of knowledge. For example, in Chinese, morning glory and electromagnetic waves are both three-character words. However, students encounter morning glory in third grade, whereas they do not learn about electromagnetic waves until in seventh grade. The difficulty of the two terms is clearly different, yet they are assigned the same difficulty by several general linguistic features. This is the reason why general linguistic features are less discriminating when applied to domain knowledge texts. General linguistic features, then, lead to readability model prediction with low accuracy and can even produce serious overestimation or underestimation of text readability (Begeny and Greene Reference Begeny and Greene2014).
Therefore, this study uses LSA to produce hierarchical conceptual space as a feature for readability model. Compared with the prediction based on a set of general linguistic features trained by SVM classifier, estimation of the difficulty levels of social science texts using only grade-level vectors with the maximum value (GLVMV) achieved the 46.22% of accuracy, which was only 8.88% less. The accuracy of assessing the difficulty levels of natural science texts improved by 21.63% to reach 70.98%. Given that the model of Study 1 incorporated machining learning techniques while the model in Study 2 (i.e., GLVMV) was not assisted by classification algorithms, the predictive power of the grade-level vectors was very impressive. These experimental results suggest that hierarchical conceptual space is capable of representing the knowledge structure of domain-specific texts. The difficulty distribution of knowledge-relevant terms can be revealed by the grade-level vectors. Our study also indicates that domain-specific texts involve specialized knowledge or domain-specific concepts such that these texts must employ relevant, specific terms in large numbers when describing or explaining a concept. In sum, the results of Study 2 show these grade-level vectors are quite accurate reflections of the difficulty of domain-specific text.
Comparing the experimental results of Study 2 and Study 3, we found that although both use hierarchical conceptual space as the readability feature, using a machine learning algorithm ultimately led to gains in classification accuracy for the social science texts plus 22.76% to achieve the accuracy of 68.98% and for natural science texts plus 2.98% to reach 73.96%. These results show that although grade-level vectors quite accurately reflect the conceptual difficulty of domain knowledge, matching the vector’s maximum value with a text’s grade level is not the best classification strategy, because this method overlooks the difficulty levels of other conceptual terms in the text, and the readability of a text should be determined by all of the domain-specific concepts that it contains.
Comparing the experimental results of Study 1 and Study 3, we found that when using the same machine learning algorithm, the accuracy of a readability model that uses grade-level vectors as its feature is higher than that of a model that uses general linguistic features. For social science texts, the former exceeds the latter by 13.88% to achieve the accuracy of 68.98%, and for natural science texts, the difference is an even larger percent of 24.61, reaching the high accuracy of 73.96%. These results show that for assessing domain-specific texts, the method of constructing and applying grade-level vectors proposed by the current study is more suitable than using general linguistic features.
A comparison of the readability models in the three studies indicates that their performances vary greatly. For social science texts, the highest accuracy is 86.68%, which is 40.46% higher than the lowest percent of 46.22%. For natural science texts, the highest accuracy is 75.91%, which is 26.56% higher than the lowest accuracy of 49.35%. These results suggest that the selection of readability features and model training methods has a direct impact on the performance of readability models. In this study, the readability model that combines a grade-level vector and general linguistic features performed even better than those that use the word vectors created by Word2Vec as a feature.
The research by Begeny and Greene (Reference Begeny and Greene2014), which used a word list of 3000 familiar words for the Dale–Chall formula (Chall and Dale Reference Chall and Dale1995), performed the best out of all the traditional readability models used by being able to achieve an accuracy rate of 41.66%. This was compared to the Flesch-Kincaid (Flesch Reference Flesch1948), FOG (Gunning Reference Gunning1952), Forcast (Sticht Reference Sticht1973), Fry (Fry Reference Fry1968), Lexile (Stenner Reference Stenner1996), PSK (Powers, Sumner, and Kearl Reference Powers, Sumner and Kearl1958), SMOG (McLaughlin Reference McLaughlin1969), and Spache (Spache Reference Spache1953) readability models. Begeny and Greene (Reference Begeny and Greene2014: 13) believe this is because the “percentage of high frequency words may be a good gauge of text difficulty.” Word frequency can be used to measure the difficulty of vocabulary and also highlights the importance that word difficulty has on the overall readability of an article. In contrast to the Dale–Chall formula, which uses word frequency to construct its word list, this article uses hierarchical conceptual space to generate the difficulty of conceptual terms and found, through multiple studies, that the resultant readability model is superior to the Dale–Chall formula and outperforms it by +27.31% to achieve an accuracy of 68.98% (compared with the social science model) and +32.29% to reach 73.96% (compared with the natural science model), depending on domain. This shows that the hierarchical conceptual space proposed by this study is superior to solely using a word list derived from word frequency.
Regarding the comparison of hierarchical conceptual space with other bag-of-words methods, such as TF and Word2Vec, we found that the grade-level vector model (i.e., grade-level vector + SVM) performed slightly better than the other methods (i.e., TF + SVM; Word2Vec + SVM) for natural science texts but not for social science texts. One of the possible reasons may be that the natural science texts have more topics with hierarchical conceptual space as shown in Figure 3. However, the conceptual hierarchies in social science may not as obvious as in natural science; therefore, the hierarchical conceptual space and the bag-of-words methods have similar performance in predicting the readability levels. Further evaluation based on more types of texts will be needed to compare the effectiveness of the two methods for the prediction of social science text readability.
8. Conclusions and implications
How well does a general-linguistic-feature-based readability model predict the difficulty level of domain-specific text? Our research provided empirical support for the idea that employing common general linguistic features is not suitable for use on domain-specific texts, because the accuracy rate of prediction was less than 60%. In contrast, a method that used a hierarchical conceptual space, which represented the domain-specific knowledge learned by students in different learning stages, effectively estimated the readability of domain-specific texts, outperforming the general-linguistic-feature-based model by 13.88% and 24.61% to achieve accuracies of 68.98% and 73.96% for social science and natural science texts, respectively. The model combining grade-level vectors with general linguistic features outperformed a general-linguistic-features-only model by 31.58% and 26.56% to achieve accuracy rates of 86.68% and 75.91% for social science and natural science texts, respectively. This indicates that the readability features presented in this paper are not only suitable for representing domain-specific texts, but can also complement linguistic features that are commonly used for general text analysis. In other words, readability models trained only on common general linguistic features are not suitable for assessing the readability of domain-specific texts. However, when combined with suitable readability features, such as grade-level vectors, the general linguistic features can indeed enhance the performance of readability models. The findings above have the following implications for further research and practices.
Firstly, hierarchical conceptual space, which was extracted from the knowledge base of students in different learning stages, is an appropriate and valid tool for assessing the readability/difficulty levels of domain-specific texts and can be combined with machine learning algorithms to predict the readability of both social and natural science texts with good performance. A hierarchical conceptual space can not only serve as a readability feature, but, through the use of data visualization software, also present a text’s conceptual space diagrammatically. This could serve as a tool for instructors or students when working through a text and could potentially supplement any outlines or summaries provided for the text.
Secondly, teachers, book editors, or publishers should consider combining the linguistic-features-based approach with conceptual-space-based approach of readability models when leveling their teaching/learning materials, especially when the targeted materials are domain specific or the targeted learners are content-area readers.
Thirdly, since the readability models proposed by our research are not equally effective for text in different domains and are constructed based only on reading materials for third to ninth graders, future research can address the issue of model generalizability by applying the model to texts in more domains, or expanding the grade levels for higher grades of learners.
Finally, the research can certainly stand improvement in its effort to refine readability assessment models and improve its ability to explain the logic behind the distribution of domain-specific concepts in text. As a case in point, the accuracy of our predicting sixth-grade social science text difficulty lagged behind that of the other grades (see Tables 4 and 7). Upon examination, we found that these sixth-grade texts covered a wide range of topics. For example, one of them introduces the world of ancient civilizations and discusses culture, economics, martial, and religious issues, such as river valley civilizations and Egyptian, Greek, Roman, Mayan, Indian, Islamic, and Chinese cultures, to name a few. The broad subject of this text leads to the result that many conceptual terms were not classified into the sixth grade. For example, the term democratic politics is a word that, while mentioned within this text, is fully discussed in 23 eighth- and ninth-grade articles. This resulted in the concept of democratic politics being highly related to the eighth and ninth grades. The readability model ultimately predicted this sixth-grade text as a ninth-grade one, because it used many conceptual terms the model has labeled as ninth-grade concepts. How to more accurately predict the difficulty levels of texts with such broad topics is a subject worthy of future research. Another potential avenue for improvement is to change the grade-level vectors proposed herein to soft assignment (e.g., retain the complete cosine similarities between all conceptual terms and grade levels) to generate the difficulty distribution of conceptual terms. This may retain even more information which, in turn, could help a readability model classify texts that cover a variety of topics.
In the future, we hope to develop even more powerful readability models. Currently, our proposed readability model can accurately rate the difficulty levels of texts within a single domain. For our future research, we aim to design models with which articles from a wide range of domains can be put together and then be rated at the same time. Development of such a general-purpose readability model could benefit from the incorporation of more domain-specific features (e.g., grade-level vector of different domain) and generic readability features (e.g., Word2Vec and GloVe (Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014)), the advancement in NLP techniques, and a better understanding of reading process (Crossley et al. Reference Crossley, Skalicky, Dascalu, McNamara and Kyle2017).
Acknowledgements
The collection of empirical data had been supported by the Ministry of Science and Technology (MOST-107-2511-H-003-022-MY3; MOST-108-2622-8-003-002-TS1) and the Higher Education Sprout Project of Ministry of Education, Taiwan.