1. Introduction
Unlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps for processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., “B” for word beginning, and “E” for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) (Lafferty, McCallum, and Pereira Reference Lafferty, McCallum and Pereira2001) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which a high-quality in-domain training set is not available (Zhang et al. Reference Zhang, Sun and Zhou2012a). An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method Edge Likelihood (EL) for Chinese word boundary detection. Then we propose a domain-independent CRF-based Chinese word segmenter named DICND; each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector.
2. Related work
The existing Chinese new word detection approaches can be divided into two categories, namely supervised approach and unsupervised approach.
2.1 Supervised Chinese new word detection
The supervised approach considers Chinese new word detection as a sub-task of Chinese word segmentation and solving Chinese word segmentation problem by sequence labeling. In the past decade, CRF and neural networks are two of the most popular methods in supervised Chinese word segmentation. In CRF-based Chinese word segmenters (e.g., Peng, Feng, and McCallum (Reference Peng, Feng and McCallum2004)), each character is represented as a one-hot vector. Then CRF takes the one-hot vector as the input and labels Chinese sentences according to the transition probabilities of the labels in the training set. However, in a one-hot character vector, the attributes of the vector are the set of vocabulary in the training set plus n-gram lexicon features (i.e., if 3-gram is used, any three-character sequence in the training corpus will be an attribute and take a dimension of the vector). A boolean value is assigned to each attribute to indicate whether the character is part of a word which relates to the attribute. Thus, this kind of n-gram representation is sparse, high dimensional, and biased to known words. As a consequence, the one-hot character vector constrains the capability of CRF in Chinese new word detection, especially when training data and test data are in different domains. Intuitively, the out-domain problem can be solved by using a domain-specific dictionary. Wang et al. (Reference Wang, Wong, Chao and Xing2012) applied CRF-based Chinese word segmenter to Chinese microblog data and handled the out-domain problem by leveraging an external word list of popular Internet slang. But this method does not work for Chinese new word detection. Another approach is utilizing domain adaption techniques or adding domain adaptive features in character representation. For instance, Liu et al. (Reference Liu, Zhang, Che, Liu and Wu2014) applied domain adaption techniques in Chinese word segmentation; Zhang et al. (Reference Zhang, Deng, Che and Liu2012b) designed a set of features to indicate the length of the domain-specific words which contain the current character. Xia et al. (Reference Xia, Li, Chao and Zhang2016) used features similar to that of Zhang et al. (Reference Zhang, Deng, Che and Liu2012b) but employed a large-scale external lexicon word in generating extra lexicon features. Leng et al. (Reference Leng, Liu, Wang and Wang2016) further designed more features for character representation. The features include reduplication feature which indicates whether the character is in a reduplication forms of word (e.g., “” (great) and “” (sound of laugh)), conditional entropy feature defined by the entropy of all the characters that follow or precede the current character in the given corpus (Gao and Vogel Reference Gao and Vogel2010), etc. Nevertheless, the feature augmented methods using the traditional one-hot vector as the base of the character representation; the huge dimension of the one-hot vector still dominates the segmentation results and makes the effect of domain-specific features minimal. On the other hand, the neural-network-based methods boost the accuracy of Chinese word segmentation by using a large number of parameters in neural networks to fit the training data. For example, Zheng, Chen, and Xu (Reference Zheng, Chen and Xu2013) used multilayer perception as the labeling engine. Chen et al. (Reference Chen, Qiu, Zhu, Liu and Huang2015) improved the segmentation accuracy by leveraging an LSTM neural network to capture the historical information in the Chinese sentences. Zhang, Zhang, and Fu (Reference Zhang, Zhang and Fu2016) integrated recurrent neural networks with the transition model in Zhang and Clark (Reference Zhang and Clark2007). Concretely, they used a neural network model to replace the discrete linear model in Zhang and Clark (Reference Zhang and Clark2007) for scoring transition action sequences. Qian, Qiu, and Huang (Reference Qian, Qiu and Huang2016) introduced a new evaluation metric for Chinese word segmentation, the weights of the words are different in the evaluation metric. Cai and Zhao (Reference Cai and Zhao2016) proposed a gated combination neural network (GCNN) which decides how to mix the character vectors by two gates, then GCNN works with LSTM to calculate a score for each sentence segmentation, and beam search scheme is used to search for the segmentation with the highest score. Cai et al. (Reference Cai, Zhao, Zhang, Xin, Wu and Huang2017) designed a greedy neural word segmenter (greedyCWS), which improves the GCNN model by keeping a short list of frequent words, and decided how to mix character vectors according to the frequent word list.
However, there are several issues that make the neural network methods cannot detect out-domain new words precisely. First of all, the performances of neural-network-based methods rely on the quality of the training set heavily. A high-quality domain-specific training set is not always available. There are more and more out-domain applications. For example, due to the increasing usage and timely information on Twitter, the problem of identifying new words from Chinese twitter is now a critical application in many organizations including companies and governments. High-quality labeled training sets for Chinese twitter do not exist and new words emerge every day. Moreover, the high lexicon variance in Chinese microblog makes it difficult to design a domain-specific training set which can cover most of the topics. Second, neural-network-based methods take continuous bag-of-words (CBOW) embedding as the input. The rationale behind CBOW character embedding is to learn a 30–50-dimensional vector for each character in a large corpus through a neural network. Theoretically, the vectors can capture the grammatical and semantic meaning of the characters. Nevertheless, the characters in some new words are not grammatical and semantic related with each other. For example, person name and organization name are usually not formed by their semantic meaning. In addition, the learned character embedding is identical for each Chinese character such that it is not context-aware. Moreover, the statistical information in the target data, which is important information in out-domain new word detection, is not utilized by CBOW character vector. Furthermore, the representation is not interpretable; it is hard to identify the problem when an error happens. The enhancement of the CBOW Chinese character embedding (e.g., using Chinese radicalFootnote a) cannot solve these issues efficiently (Sun et al. Reference Sun, Lin, Yang, Ji and Wang2014).
2.2 Unsupervised Chinese new word detection
The unsupervised approach is purely data-driven; tagged training sets are not required in the unsupervised method. In unsupervised Chinese new word detection, the probability of a character sequence being a valid word is evaluated by the frequency distributions relevant to the character sequence. There are several assumptions about unsupervised valid word detection; one of the them is that if the given character sequence is a valid word, it should appear in different contexts. Accessory variety (AV) (Feng et al. 2005) is one of the methods that define word boundary probability according to this assumption. Assume there is a character sequence “” (doorknob), which is a valid word; we can find it in different contexts such as “” (the doorknob is broken), “” (need a new doorknob), “” (or repair this doorknob), and “” (this doorknob is pretty). In this case, there are three different preceding characters of “” (i.e., sentence start, “”, and “”), as well as four different succeeding characters (i.e., “”, sentence end, “”, and “”). AV() takes the minimum of these two values, that is, 3. An obvious drawback of AV is that AV is affected by the number of occurrences of the character sequence. Frequent character sequences often have high AV values since they have more chances to appear in different contexts. Branching entropy (BE) addressed the problem of AV by using conditional probability. Another assumption used in unsupervised Chinese word detection is that if the given character sequence is a valid word, the substrings of the character sequence will mainly co-occur with the character sequence. For example, assume the given character sequence is “” (Amino acids), which is a valid word; its substrings (i.e., “”) should mainly co-occur with “”. Unsupervised Chinese word segmentation approaches developed based on this assumption include symmetric conditional probability (SCP) (Luo and Sun Reference Luo and Sun2003) and mutual information (MI) (Xue Reference Xue2003). Other unsupervised Chinese word segmentation methods contain description length gain (DLG); Kityz, Chunyu, and Yorick (Reference Kityz and Wilks1999) measure the word probability of a character sequence using techniques in information theory; Huang et al. (Reference Huang, Ye, Wang, Chen, Cheng and Zhu2014) tried to integrate different unsupervised approaches into a unified framework, so on.
However, the performance of unsupervised Chinese new word detection is limited by the following issues. Firstly, unsupervised methods often involve a large number of parameters to be set manually. Secondly, Chinese new word which occurs with a low frequency is difficult to be identified correctly by unsupervised methods. The unsupervised methods detect Chinese words mainly based on the statistical analysis of the words and their neighbors. However, the statistics of infrequent words are not reliable. For instance, the BE value of character sequence occurs once is always 0 since it appears in one environment only. In this case, valid words which appear only once in the corpus cannot be detected correctly by using BE.
3. Contribution
In this paper, we focus on utilizing CRF in out-domain new word detection. To tackle the issues we mentioned, first, we introduce a new method of Chinese word boundary detection named EL. Compared with BE, EL improves Chinese word boundary detection by taking not only context variance but also context cohesion into consideration. Second, we propose a domain-independent Chinese new word detector DICND. In DICND, each Chinese character is mapped into a low-dimensional discrete vector using a statistical representation layer. The idea is to identify the characters abstractly without considering the known words, therefore enhancing the flexibility of the algorithm in new word detection.
To the best of our knowledge, it is the first work merely using segmentation relevant features to represent a character. DICND has the following advantages: first, it is domain independent. All of the characters in the documents are represented by their segmentation-related features. Second, the statistics-based character embedding is context-aware. Third, unlike using neural network approaches, the elements in the proposed character embedding are interpretable such that it is easy to trace when an error happens.
The proposed methods are evaluated in the following two aspects. We first evaluate the proposed Chinese word boundary detection method EL on SIGHAN Bakeoff; the experiment result shows EL can identify word boundaries more accurately than the widely used measure BE. Then we compared the out-domain new word detection performance of DICND with that of CRF with the one-hot vector (Zhang, Yasuda, and Sumita (Reference Zhang, Yasuda and Sumita2008)), CRF with CBOW character embedding, LSTM neural networks with CBOW character embedding (Liu et al. Reference Liu, Zhang, Che, Liu and Wu2014), unsupervised methods, GCNN (Cai and Zhao Reference Cai and Zhao2016), and GreedyCWS (Cai et al. Reference Cai, Zhao, Zhang, Xin, Wu and Huang2017). We train the classifiers based on segmented Chinese news; then we apply the trained classifier to microblog data for Chinese new word detection. The training set used in our experiment is the PKU training set in SIGHAN Bakeoff, and the test set is a microblog data set provided by NLPCC (Qiu, Qian, and Shi (Reference Qiu, Qian and Shi2016)). Although the size of the test set is small, which is not a perfect setting for statistics-based character embedding, DICND still achieves the highest F score in these methods. We also identified a few examples to illustrate why DICND performs better than existing tools which would provide more insights to researchers in this field.
4. Word boundary detection by EL
4.1 Overview of EL
BE assumes if a character sequence frequently appears in different contexts, the character sequence is a valid word. However, this assumption does not work for all the character sequences in Chinese word. For example, “” is a character sequence that has many different neighbors. Its preceding character can be “” (America), “” (China), “” (England), and its succeeding character can be “” (claim), “” (report). The reason is that “” is a component of many different words. BE only considers context variance without checking whether these contexts and the character sequence are tightly coupled. In this paper, we propose a new word boundary detection method named EL (BE). BE defines word boundaries of a character sequence not only based on the context variance but also based on the context cohesion. SCP is used to measure the cohesion of the character sequence and its neighbors. The process of generating EL is shown in Figure 1.
The process of EL calculation can be divided into three steps: calculating SCP, normalizing SCP, and calculating the final EL value.
4.2 Symmetric conditional probability
SCP measures the cohesiveness of a character sequence s according to the co-occurrence of the character sequences c 1, …, ci and c i+1, …, c |s| (1 < i < |s|). In general, SCP assumes if s is a valid word, the substrings of s will mainly appear along with s. For example, given sentence “” (Amino acids constitute the basic unit of protein), the character sequence “”(Amino acids) is a valid word, and its substrings (i.e., “”) should mainly co-occur with “”. The probability of the occurrence of s, denoted as P(s), is the frequency of the character sequence in the given corpus in this case. The SCP value of s can be calculated by Equation (1):
SCP(s) is high when all the binary segmentations of s mainly appear along with s, and the value of SCP(s) is in (−∞, 1].
4.3 Postprocessing and normalization
The postprocessing of SCP value is mapping the raw SCP values to {0, …, N}, N is a user-defined parameter. The postprocessing not only normalizes the values into {0, …, N} but also discretizes the numerical SCP values into N + 1 classes. Then the processed values can be input into CRF which can deal with discrete attributes.
The postprocessing of SCP values has three steps:
(1) Randomly select a set of items from the data set as the samples; sort the samples according to their descending SCP values.
(2) Select cut points from the samples. The cut points of the bucket n (n ∈ {0, …, N}) are the $\frac{n}{N}$th and the $\frac{n+1}{N}$th items in the sorted samples.
(3) Bin all the raw SCP(s) in the N + 1 buckets.
After that, high SCP(s) values are mapped to large n, vice versa. It is worth noting that the statistical information of infrequent character sequences is not as reliable as that of frequent character sequences. In this case, we are more interested in the frequent character sequences. Specifically, infrequency character sequences (e.g., character sequences appear only once in the corpus) should be filtered in the samples.
4.4 Edge Likelihood
EL is composed of Left EL and Right EL. Left EL of s is calculated based on the frequency distribution of s and the cohesion of s with its preceding characters. Let p denote a preceding character of s; p + s is a string concatenating p and s. The cohesion of p + s is evaluated by SCP′(p + s) (Equation (2)):
The value of φ(p, s) is in [0, 1]. High φ(p, s) indicates the connection between p and s is tight such that p should contribute less to ELleft(s). Let β(p, s) denote a parameter that indicates the importance of p to ELleft(s) as:
The parameter μ is a lower bound of β(p, s) which is to ensure every p can have certain importance to ELleft(s). The value of μ is in [0, 1], and μ = 0 means we will ignore p in calculating EL(s) if SCP′(p + s) = N (the cohesion of p and s is very strong). BEleft(s) can be treated as a special case that μ = 1 in ELleft(s) calculation, which means considering each p equally regardless of the SCP′(p + s). The effectiveness of different μ is shown in Figure 2; the highest accuracy is achieved when μ = 0.2.
ELright(s) is the probability that the left boundary of s is a valid word boundary. The value of ELleft(s) is defined by the number of preceding characters of s as well as the cohesion of s and its preceding character (Equation (4)):
where $\mathcal{P}_s$ denotes the set of preceding characters of s.
The ELright(s) can be calculated similarly. The value of EL(s) is defined by Equation (5):
Given a character sequence s, EL(s) evaluates the probabilities of the boundaries of s are valid word boundaries not only based on whether s occurs in different contexts but also based on how s connects to its neighbors. Take the wrong segmentation “” (country media) mentioned above as an example; this character sequence does appear in different contexts, but it often tightly connects to its preceding characters, for example, “” (America) and “” (China). In other words, the β values of the preceding characters of “” are low such that the overall EL() should not be too high.
5. Domain-independent Chinese new word detector
5.1 Overview of domain-independent Chinese new word detector
In this section, we introduce a domain-independent Chinese new word detector, named DICND, which tries to address the issues of the current CRF-based new word detection by using a statistics mapping layer. After an unsupervised pre-trained process, each character in the documents is represented as a low-dimension discrete vector which reflects the segmentation-related information of the character. Figure 3 shows an overview of DICND.
5.2 Statistics-based character embedding
The statistics mapping layer is used to embed the Chinese characters into low-dimensional vectors. The segmentation-related statistical features, as well as the POS attributes of the characters and their neighbors, are leveraged as the features of the characters. We follow Peng et al. (Reference Peng, Feng and McCallum2004) and categorize the features into closed feature and open feature.
The closed features are obtained from the training data alone. From our study, we notice EL defines word boundary by context variance while SCP using inner cohesiveness of the character sequence. They can work together to represent the characteristics of a character sequence from two different aspects. Other than EL and SCP, we further use MI to evaluate the association degree of the target character sequence and its surroundings. MI is defined as in Equation (6):
where p(ssurd|s) is the probability of co-occurrence of ssurd and s. In the proposed character embedding, the ssurd of character sequence s = c 0, …, c |s| is c −2c −1, c −1, c |s|+1, and c |s|+1c |s|+2.
It is worth mentioning SCP is utilized to calculate the inner cohesiveness of the s in the closed features, while in the calculation of EL, the SCP is used to evaluate the connection of s and its neighbors (outer cohesiveness).
The closed features of a character c 0 are listed in Table 1.Footnote b The “′” symbol means the values are normalized (discretized).
On the other hand, the open features (Table 2) are generated according to knowledge base other than the training set, namely a Chinese word list, a stop word list, and several POS character lexicons from various sources. The details are in Table 5. The POS or lexical attributes of a character sequence are obtained by checking the existence of the character sequence in the specific list (e.g., the boolean value is Preposition(s) is obtained by checking if s exists in the list of the preposition). Most of the valid Chinese words in the knowledge base consist of one or two characters. Thus, in most of the features, we only consider up to two characters.
5.3 Character tagging
The statistics-based character embedding will be input into the CRF classifier, and five tags are used as the segmentation labels (Table 3).
Linear chain CRF is used as the tagging algorithm (Figure 4).
Given a sentence x with T characters, x = c 1,c 2, …, cT, each character in the sentence is represented by its statistical embedding vector. The character vector of the tth character serves as the observed variable at the current time step, t. The tag of the tth character, denotes as yt, is related to the observed variable of t and the tag of its preceding character, that is, yt −1. Let y be the label of the sequence, CRF defines y by Equation (7):
The model parameters are a set of real weights Λ = {λk}, one weight for each feature, and $\{\,f_k(y,y^\prime,t)\}_{k=1}^{K}$ is a set of binary-valued indicator function reflecting the transitions between yt −1 and yt, K is the total number of feature functions. The feature functions can measure a state transition yt−1 → yt and the entire observation sequence, x, centered at t. For example, one possible feature function could measure how much we suspect that the current word should be labeled as “B” given that the previous character is an adjective. The value of xt is the statistical embedding vector of ct. Large positive values for λk indicate a preference for such an event; large negative values make the event unlikely.
Z(x) is a normalization factor over all state sequences for the sequence x:
The most probable labeling sequence for an input x is
6. Experiment
In this section, we evaluate the performance of EL and DICND. We first compare the effectiveness of EL and BE using the data sets in SIGHAN Bakeoff. Second, we train DICND based on PKU data set or CTB data set which contains mainly news. Then the classifier is applied to a microblog data set to get the out-domain new word detection result. We compared the new word detection result with that of CRF with one-hot vector, CRF with CBOW character embedding, LSTM with CBOW character embedding, GCNN with CBOW character embedding, GreedyCWS, and an unsupervised method.
6.1 Data set
There are several data sets and lexicon data used in our experiment:
SIGHAN Bakeoff (Sproat and Emerson Reference Sproat and Emerson2003) is the most widely used data set in Chinese word segmentation. There are four sub data sets in SIGHAN Bakeoff, namely PKU, CityU, MSR, and CTB6. The data are collected from newspapers (e.g., China Daily, South China Morning Post) such that the sentences in the data sets are formal. Some of the statistics of SIGHAN Bakeoff are shown in Table 4.
NLPCC microblog data set (2016) contains 6000 segmented Chinese tweets. The topics of the tweets include politics, weather, music, sports, etc.
Sogou Web corpusFootnote c contains 6,000,000 unlabeled Chinese sentences from web data.
Dictionary is the dictionary used in Stanford Word SegmenterFootnote d which contains 423,000 Chinese words.
Word Lists are obtained from http://xh.5156edu.com; the detailed information of the word lists is listed in Table 5.
6.2 EL experiments
Theoretically, the EL(s) is linear with the probability of the boundaries of s, which are valid word boundaries. In other words, the proportion of valid character sequence boundaries in {s|EL′(s) = n} should be $I(n) = \frac{n}{N}$ in the ideal case. For instance, assume n = 2 and N = 10, 2/N = 20% of s in {s|EL′(s) = 2} should be character sequence with valid word boundary. In this experiment, we define the error rate as the deviation between the real valid boundary ratio and the ideal valid boundary ratio. Denote the set of s with valid word boundary as Svalid, the valid ratio of s in {s|EL′(s) = n} as R(n), $R(n) = \frac{|{s|EL^\prime(s) = n \& s \in S_{valid}}|}{|{s|EL^\prime(s) = n}|}$. The error rate is the absolute value of I(n) – R(n).
We evaluate the EL and BE value of the character sequences which contain 2–5 tokens in the PKU data set. The result is shown in Table 6. The EL and BE values are normalized into {0, …, 8} (i.e., N is set as 8). Note that for character sequence “” (coach of Shanghai team), it contains valid boundary for not only “” (Shanghai team) and “’ (coach) but also “” (coach of Shanghai team). The results which have lower error rate are bold.
According to Table 6, the error rate of EL is significantly lower than that of BE, which means the EL value is closer to the ideal case compared with the BE value. We can see if we set the value of the threshold to 8, 88.26% of s in {s|EL(s) = 8} are character sequences with valid word boundaries while that of BE is 76.88% (the value should be 100% in the ideal case).
Similarly, we evaluated the error rates of EL and BE on data set MSRA, CityU, and CTB (Table 7). The results which have lower error rate are bold. From the table, we can see the error rate of EL value is about 10% less than that of BE value. In other words, EL is closer to the ideal case and can detect word boundaries more accurate than BE.
6.3 DICND experiments
In this section, we compared the performance of Chinese out-domain new word detection of DICND with several baselines. The baselines include one-hot character vector + CRF (Zhang et al. Reference Zhang, Yasuda and Sumita2008),Footnote e CBOW character vector + CRF, CBOW character vector + LSTM (Liu et al. Reference Liu, Zhang, Che, Liu and Wu2014),Footnote f CBOW character vector + GCNN (Cai and Zhao Reference Cai and Zhao2016),Footnote g greedyCWS (Cai et al. Reference Cai, Zhao, Zhang, Xin, Wu and Huang2017),Footnote h and an unsupervised model MI + BE.Footnote i
Concretely, we train the classifiers on PKU or CTB data set which contains mainly news. Then the classifier is applied to microblog data to get the out-domain new word detection result. The definition of new words in this experiment is the words in the microblog data set but not in PKU data set and our knowledge base. Infrequency new words (words appear once only) are excluded since their statistical information are unreliable. Non-Chinese characters and character sequences containing stop words are also excluded since they are often not the interest of Chinese new word detection. There are 847 valid new words in the microblog data set according to our definition of the new words.
6.3.1 Performance comparison
The new word detection result of DICND and the baselines is shown in Table 9. The experiment setting is as follows. The unsupervised model is a combination of term frequency, MI and BE.Footnote j For any s, l = |s|, if Freq(s) > 1, $\sum_{i=0}^{n-1} MI(c_0,\ldots,c_i\,{:}\,c_{i+1},\ldots,c_l) > 100$ and BE(s) > 0.5, s will be considered as a valid word. The thresholds are optimized by trying different combination of the values. The CBOW character embedding is pre-trained with Sogou web data; the number of dimension of CBOW character embedding is 40, while the number of dimension of the proposed statistic character embedding is 41 (25 closed features in Table 1, and grouped features in Table 2 into 16 attributes). We use the code at https://github.com/FudanNLP/CWS_LSTM as the implementation of CBOW character embedding with LSTM. Note that the formula of F score is
where P and R are the precision and recall, respectively.
For the baseline “CRF + sparse vector” we use the pre-trained model “pku.gz” in the Stanford Word Segmenter.Footnote k The training time of other supervised methods is in Table 8.Footnote l The neural-network-based methods involve a large number of parameters such that require much more training time than that of CRF based methods.
Table 9 shows the F score of new word detection with DICND and the scores of baseline methods. The best results are bold. From the table, we can see DICND achieves the highest F score among the methods. CRF with sparse representation has the highest accuracy because of the high-dimensional representation of each character, but the recall rate is low due to the representation also restrains the flexibility of the algorithm.
In general, long new words are more difficult to be detected correctly since they require more labels to be categorized. Thus, new word detection on short words often has a higher recall and precision than that of long new words. In this part, we compare the out-domain new word detection results of different word lengths. Table 10 shows the F score of DICND and the baseline methods with different length of character sequences. In Table 10, “Gold” is the gold standard provided by the data set, “Detect” is the number of new words detected by the method, and “Valid” is the number of valid new words detected by the method. The best results in their categories are bold.
In addition, we analyzed the out-domain Chinese new word detection results of different training set. Specifically, other than PKU data set, we also train the proposed classifier on CTB data set. The result is in Table 11. The best results in their categories are bold. The overall F score of classifier trained with PKU data set and that of CTB data set are similar, but the classifier trained with CTB has better performance on recall but lower precision compared with that of PKU.
6.3.2 Case studies
In this section, we identified a few cases in our experiments to verify our observation on the tools.
The CRF with sparse representation does not achieve good performance in compound words detection. For instance, the word “” (A game called “Greedy snake”) in the microblog data set is a combination of two known words “” (greedy) and “” (snake). With the sparse representation, the algorithm segments “” (Greedy snake) into two words rather than considering it as one word.
The CBOW character embedding using an identical vector represents a specific character without considering its surrounding context. For instance, no matter the character “” appears in the word “” (gift card) or the word “” (Kafka, a writer) the character vector of “” has no difference. In our experiment, only the new word “” (gift card) is detected by CRF with CBOW character embedding while DICND can identify both of the “” (gift card) and “” (Kafka). This drawback of CBOW character embedding can be partially solved by using a neural network, such as LSTM, as a label classifier. Since the neural network will modify the feature representation automatically. However, the neural network defines the label of a character only based on the character and its several neighbor characters. In other words, the neural-network-based approach lacks overview of the whole sentence which might make it less competitive with CRF which predicts sequences of labels for sequences of input samples. Compared with DICND, the GNCC model (Cai and Zhao Reference Cai and Zhao2016) has a stronger performance on formal word detection, for example, “” (capital chain), “” (art exam). DICND, meanwhile, can detect more person name or internet slang words, for example, “” (open-air fitness dancing), “” (acting cute). This indicates DICND is more adaptive to the target domain (i.e., tweets corpus in this experiment). The reason is that the statistics-based embedding of a character in the test set is calculated according to the frequency distribution of the character sequence in test data, thus the proposed embedding can utilize the statistics in the target domain. On the other hand, the CBOW character vector is pre-trained such that the statistics of the target data cannot be used to improve the word segmentation result. The domain adaption capability of GNCC is improved in greedyCWS. However, DICND still performs better in detecting names of people or organizations in the target data.
One of the reasons that the unsupervised method does not have a good performance is that the test data used in our experiment contains just 6000 sentences, and the sentences are on different topics. Actually, 95% of the 2–5-gram character sequences appear only once or twice in the test set. The statistical information of infrequent character sequences is unreliable. For instance, “” (short for Asian Olympic) is a new word failed to be detected by the unsupervised method. The character sequence appears twice in the document; its succeeding character sequence is “” (Council) for both of the occurrences which make the BE value equals to 0 since it always appears in the same environment in test data. On the other hand, with supervised machine learning mechanism, DICND can detect “” successfully. But the insufficiency of test data also hinders the performance of DICND, especially for long word detection.
7. Conclusion
In this paper, we first proposed EL, a novel method for Chinese word boundary detection. EL defines the probability of the boundaries of a character sequence that are valid word boundaries based on both of context variance and of context cohesion of the character sequence. Our experiment shows EL can detect Chinese character sequence boundaries better than the widely used BE. Second, we designed DICND, which is a domain-independent CRF-based Chinese new word detector. Each Chinese character is represented as a low-dimensional vector in DICND; the values in the character vector are defined by segmentation-related features of the corresponding character. The experiment on out-domain new word detection shows DICND can significantly outperform existing methods. However, the performance of DICND is affected by the size of test data. This is because the proposed character embedding is generated based on the distributions of the character sequences in the given corpus, but the statistics of infrequent character sequences are unreliable.
Although DICND shows improvement over existing methods, the performance of out-domain new word detection still has a large room for improvement. We hope our work can provide insights into the problem.