1. Introduction and scenario
1.1 The problem
Historical documents are usually written in ancient languages that exhibit a number of differences in comparison to modern languages, all of which have a significant impact on Natural Language Processing (NLP) (Piotrowski Reference Piotrowski2012). Both historical and dialectal texts present similar problems from an NLP point of view in that NLP tools developed for contemporary standard language often fail in handling the linguistic varieties encountered in such texts.
The majority of NLP tools are designed to process newspaper texts written in contemporary language. The processing of standardized modern languages exhibits some characteristics that are not found in historical and dialectal corpora. For example, (i) a standard variant is used for written communication which is well documented by dictionaries and grammars; (ii) these languages have standard orthographies and the majority of published texts adhere to these orthographic norms; and (iii) large amounts of text are electronically available and can be used for developing NLP tools and resources. On the contrary, most of these characteristics are not shared by historical and dialectal text resources and, therefore, standard NLP tools can often not be directly applied to such corpora.
Traditionally, some degree of lexical normalization is performed when working with historical and dialectal texts in order to link each variant to its corresponding standard form. Once the texts are normalized, standard NLP and Information Retrieval (IR) tools can be applied to the corpora with the reasonably high performance. The term canonicalization is also used in this area for the same task (Jurish Reference Jurish2010).
Accurate normalization can be very useful: by carrying out such normalization before indexing historical texts, for example, it is possible to perform queries against texts using standard words or lemmata and find their historical counterparts. Normalization has the potential to make ancient documents more accessible for non-expert users. Some examples are shown in Section 3.
NLP tools for standard languages also work better after normalization, which in turn allows for subsequent deeper processing to be carried out, for example, information extraction for the purpose of identifying historical events and other applications.
In this paper, we propose and evaluate an approach based on a model that is often used for similar tasks, such as the induction of phonology and learning grapheme-to-phoneme conversion models. We compare our results to previously published approaches using different historical corpora and languages for evaluation.
Our working hypothesis is that, as in the case of dialectal variants, the differences between ancient and current standard Basque seem to be mainly phonological, and therefore, we have reapplied the best method used in our previous work with dialects (Etxeberria et al. Reference Etxeberria, Alegria, Hulden and Uria2014). This method uses Phonetisaurus,Footnote a a Weighted Finite-State Transducer (WFST) driven phonology tool (Novak, Minematsu, and Hirose Reference Novak, Minematsu and Hirose2012, Reference Novak, Minematsu and Hirose2016) which learns to map phonological changes using a noisy channel model. It is a solution that uses a limited amount of supervision in order to achieve adequate performance without the need for an unrealistic amount of manual effort. In our previous work (Etxeberria et al. Reference Etxeberria, Alegria, Uria and Hulden2016b), we reported first the results using this method for three languages: Basque, Spanish and Slovene.
This technique has also been applied to normalization of non-standard texts in social media (Alegria, Etxeberria, and Labaka Reference Alegria, Etxeberria and Labaka2013).
We have conducted our experiments using several historical corpora in different languages, which have been used in the previous research. Consequently, our results can be compared to those of the state-of-the-art research.
The results we have obtained are very different depending on the size of the training corpus and the languages. As pointed out by Piotrowski (Reference Piotrowski2012: 82), a relevant factor is the “distance” between the historical and modern language variants:
The effectiveness of all canonicalization approaches described in this chapter depends on the “distance” between the historical and modern language variants, i.e., the extent of the differences in spelling and morphology. It thus depends on the language and the particular collections;…
In Section 6, we therefore study the impact of the distance (in the above sense) between language variants and the size of the character set.
1.2 Scenarios
In testing these corpora, we point out a variety of scenarios where NLP can be applied to historical text:
1. In some cases large annotated corpora are available, and the aim is to find an effective method for a tool that generalizes well to additional texts outside of the training data and is able to normalize additional old texts.
2. In other cases no corpora are available, and the main challenge is to annotate a small corpus (as small as possible) in order to develop a tool good enough to normalize a book or a set of texts.
Regarding the experiments, while Basque is in the second scenario, the other languages are in the first one (see Section 3). Anyway, for simplicity and comparison of results we have used previously constructed resources, mainly lists of pairs of words (dictionaries).
This, in turn, means that some methods are conditioned, that is, if only a list is available it will not be possible to obtain the context of the word. So, in order to achieve a better comparison of the results, our method does not take into account the context of the words (when available).
Therefore, results have to be interpreted in relation to these different scenarios. We thus carried out experiments in order to investigate how the size of training corpus affects the results.
2. Related work
The predominant techniques currently used for the normalization or canonicalization of historical texts can be roughly divided into three groups:
Rule-based methods (hand-written phonological grammars) are the most habitual solution; however, these techniques do not fit into our scenario due to the amount of manual work required.
Machine-learning-based techniques: systems that learn from examples of standard–variant pairs. These are our primary concern in this paper.
Unsupervised techniques: systems that work without supervision. Applying edit-distance (Levenshtein distance) and phonetic distance (i.e., the Soundex algorithm) are popular solutions. Such approaches are often used as a baseline for testing new systems (Jurish Reference Jurish2010).
2.1 Rule-based methods
Most of the systems described in the literature use hand-written phonological rules which are compiled into finite-state transducers.
Jurish (Reference Jurish2010) compares a linguistically motivated context-sensitive rule-based system with unsupervised solutions in an IR task concerning a corpus of historical German verse, reducing errors by over 60%.
Porta, Sancho, and Gómez (Reference Porta, Sancho and Gómez2013) present a system for the analysis of Old Spanish word forms, using rules compiled into WFSTs. The system makes use of previously existing resources such as a modern lexicon, a phonological transcription system, and a set of rules that model the evolution of the Spanish language from the Middle Ages onward. The results obtained in all data sets show significant improvements, both in accuracy and in the trade-off between precision and recall with respect to the baseline and the Levenshtein edit-distance.
2.2 Learning phonological changes
Kestemont, Daelemans, and Pauw (Reference Kestemont, Daelemans and Pauw2010) carry out lemmatization in a Middle Dutch literary corpus, presenting a language-independent system that can “learn” intra-lemma spelling variation. This work employs a novel string distance metric to better detect spelling variants. The semi-supervised system attempts to re-rank candidates suggested by the classic Levenshtein distance, leading to substantial gains in lemmatization accuracy.
Mann and Yarowsky (Reference Mann and Yarowsky2001) document a method for inducing translation lexicons based on transduction models of cognate pairs via bridge languages. Bilingual lexicons within language families are induced using probabilistic string edit-distance models.
Inspired by that paper, Scherrer (Reference Scherrer2007) makes use of a generate-and-filter approach somewhat similar to the method we initially used for phonological induction on dialectal corpora (Hulden et al. Reference Hulden, Alegria, Etxeberria and Maritxalar2011; Etxeberria et al. Reference Etxeberria, Alegria, Hulden and Uria2014). In this previous work, we tested two approaches:
Based on the work by Almeida, Santos, and Simoes (Reference Almeida, Santos and Simoes2010), differences between substrings in distinct word pairs are obtained and phonological rules are learned in the format of so-called phonological replacement rules (Beesley and Karttunen Reference Beesley and Karttunen2003; Hulden Reference Hulden2009). These rules are then applied to novel words in the evaluation corpus. To prevent overgeneration, the output of the learning process is later subject to a morphological filter where only actual standard-form outputs are retained.
An Inductive Logic Programming-style (ILP) (Muggleton and De Raedt Reference Muggleton and De Raedt1994) learning algorithm, where phonological transformation rules are learned from word pairs. The goal is to find a minimal set of transformation rules that is both necessary and sufficient to be compatible with the learning data, that is, the word pairs seen in the training data.
Norma is a tool for automatic or semi-automatic normalization of historical texts including learning rewrite rules from training data in a similar way (Bollmann et al. Reference Bollmann, Dipper, Krasselt and Petran2012). The algorithm keeps track of the exact edit operations required to transform the historical word form into its modern equivalent.
More recently, Scherrer and Erjavec (Reference Scherrer and Erjavec2016) have developed a language-independent word normalization method and tested it on a task of modernizing historical Slovene words. Their method relies on supervised data and employs a model of CSMT, using only shallow knowledge.
Using the same approach (CSMT), Pettersson (Reference Pettersson2016) has carried out experiments on historical data sets for five languages: English, German, Hungarian, Icelandic, and Swedish.
We can consider CSMT-based methods as state of the art in this field, so we compare our results with those reported by Scherrer and Erjavec (Reference Scherrer and Erjavec2016) and Pettersson (Reference Pettersson2016). For Spanish, we use Porta, Sancho, and Gómez (Reference Porta, Sancho and Gómez2013) for comparison.
Recently, Ljubešic et al. (Reference Ljubešic, Zupan, Fišer and Erjavec2016) have investigated the differences between token-level and segment-level normalization as well as (not) using additional background resources. The evaluations show that segment-level normalization can be useful given a high enough level of token ambiguity. Additionally, some authors have begun applying neural networks to this problem (Bollmann and Søgaard Reference Bollmann and Søgaard2016; Korchagina Reference Korchagina2017).
The CLIN 2017 shared task was on normalization of historical DutchFootnote b (Tjong Kim Sang et al. Reference Tjong, Bollman, Boschker, Casacuberta, Dietz, Dipper, Domingo, van der Goot, van Koppen, Ljubešić, Östling, Petran, Pettersson, Scherrer, Schraagen, Sevens, Tiedemann, Vanallemeersch and Zervanou2017). Eight teams took part in the shared task and the best results were obtained by the ones that applied CSMT-based methods and neural networks. The training corpus of this task was quite big (more than 1 million word) and the test was carried out using independent text pieces from the 17th century, where only Out-of-Vocabulary (OOV) words are replaced for normalization. Evaluation was measured in two ways: BLEU score and error reduction in POS tagging accuracy.
3. Description of the corpora
In this work, we have applied our normalization method to many data sets of different languages: Basque, Spanish, Slovene, German, Hungarian, Icelandic, and Swedish. We describe them briefly in the next subsections, but we have to say that it is fairly difficult to compare the data sets as they have different characteristics.
All the data sets used for the experiments are available at the next link: https://datahub.io/dataset/historical-data.
3.1 Basque historical corpus
In the case of Basque, we are in the mentioned second scenario (previously annotated corpora are not available), so, first, we chose the literary classic Gero, written by Pedro Agerre “Axular” and published in 1643. Several reasons led us to choose this particular work, most important of which are that (i) it is a Basque literary classic, (ii) it is old enough (from the 17th century), (iii) it is not too short (around 100,000 words), and (iv) a digitized version is readily available (at www.armiarma.com website).
After a cleaning step, the corpus was divided into three parts, each containing 85%, 10%, and 5% of the text. The cleaning process firstly eliminates sections outside of the main text like a letter from the author to his archbishop and some notes written for the readers. Additionally, references in brackets within the text, such as (S. Thom. 1. p. qu 102 art. 3), are deleted.
The unit used to make the division was a paragraph, and paragraphs were randomly selected to obtain the splits described. Following this, a small parallel corpus of historical and standard Basque was built semi-manually for training and tuning (from the part containing 10%) and another one for testing (from the part containing 5%).
The randomly selected parts of the corpus were analyzed by a morphological analyzer of standard Basque (Alegria et al. Reference Alegria, Etxeberria, Hulden and Maritxalar2009). This way, words to be set aside for manual checking, that is, OOV items, with regard to the Basque, morphological analyzer were detected and after annotating these, a small parallel corpus was built.Footnote c
In the two files, each paragraph was divided into sentences and a text categorization tool named TextCat Footnote d was used to determine the language of each sentence in order to discard those citations written in Latin.
The BRAT annotation tool (Stenetorp et al. Reference Stenetorp, Pyysalo, Topić, Ohta, Ananiadou and Tsujii2012) was used for manual revision and annotation of the OOV words. Each OOV item was annotated as either “Variation”, “Correct”, or “Other” (mainly non-Basque words, typos in the digital version). For words in the first class, the corresponding standard word form was provided.
Finally, after discarding “Other” labels, two lists of pairs (variant–standard) were obtained, one for training/tuning and the second one for test (each different word pair only once in each list). In the training list, we also included the standard words in the original text, but we did not include any standard word in the testing list. Figures can be consulted in Table 2. Thus, for the training/tuning process, we have a list of 2949 different word pairs in which 956 pairs contain OOV words and the rest standard words only. All the word pairs in the test data set, however, contain OOV words.
As there are very different dialects in Basque, in a second step, we chose another Basque text to test our method. This second text is Peru Abarka, written by J.A. Mogel in 1881. The predominant reasons for choosing this particular work are that it uses a very different Basque dialect (Biscaian instead of Labourdian) as well as the fact that a digitized version is available (at http://www.vc.ehu.es/gordailua/).
The proportion of OOV words we can find within Peru Abarka is far higher than the proportion in Gero,Footnote e so the number of paragraphs randomly selected for the annotation process has been calculated ad hoc. Finally, we selected 70 paragraphs among the 479 paragraphs of the textFootnote f and used 50 of them for training/tuning and 20 for testing. Figures can be consulted in Table 2. In this case, for the training/tuning process, we have a list of 1179 different word pairs, in which 904 pairs contain OOV words, and 275 contain standard words only. As we did with the previous work, the training/tuning list includes standard words in the original text, but there are only 275 and in the previous corpus they were 1993. As before, the test list contains only non-standard words.
To avoid having to account for possible case mismatches, only lower case letters were used for all word pairs in Basque.
Examples of normalization for both corpora are shown in Table 1, including different phenomena.
3.2 German, Hungarian, Icelandic, and Swedish historical corpus
In Pettersson, Megyesi, and Nivre (Reference Pettersson, Megyesi and Nivre2014), the authors present a multilingual evaluation of several normalization methods applying them to five languages. They have kindly provided us with data of four of the languages used: German, Hungarian, Icelandic, and Swedish. English data cannot be distributed further due to the copyright issues.
This is a very interesting set of historical corpora due to the different typology of the languages. For example, Hungarian is agglutinative, and in German, composition is very productive.
For German experiments, they used a manually normalized subset of the GerManC corpus of German texts from the period 1650–1800 (Scheible et al. Reference Scheible, Whitt, Durrell and Bennett2011).
The Hungarian corpus is a collection of manually normalized codices available from the Hungarian Generative Diachronic Syntax project, HGDS (Simon Reference Simon2014), in total 11 codices from the time period 1440–1541.
For Icelandic, they manually normalized a subset of the Icelandic Parsed Historical Corpus (IcePaHC), a manually tagged and parsed diachronic corpus of texts from the time period 1150–2008 (Rognvaldsson et al. Reference Rognvaldsson, Ingason, Sigurdhsson and Wallenberg2012). The subset contains four texts from the 15th century: three sagas and one narrative religious text.
Finally, for Swedish experiments, they compiled balanced subsets of the Gender and Work corpus (GaW) of court records and church documents from the time period 1527–1812 (Fiebranz et al. Reference Fiebranz, Lindberg, Lindström and Ågren2011).
The details of the four corpora can be consulted in Pettersson, Megyesi, and Nivre (Reference Pettersson, Megyesi and Nivre2014). The authors have created three data sets (training, tuning, and evaluation) from the selected subset. The tuning corpus is obtained extracting every 9th sentence from the subset, the evaluation corpus extracting every 10th sentence, and the training corpus extracting the rest of the sentences. After processing the data sets, they obtained lists of word pairs (old word–standard word), which have been provided to us.
As we have not carried out new tuning to apply our normalization method, we have joined the training and tuning data sets to use them in the learning process. Furthermore, prior to the experiments, we left each different word pair only once in each data set.
Final figures of the data sets of the four languages are in Table 2.
We would like to mention that researchers Pettersson and Megyesi have recently made public a wide range of historical corpora and other useful resources and tools for researchers working with historical text at http://stp.lingfil.uu.se/histcorp/.
3.3 Spanish historical corpus
In Porta, Sancho, and Gómez (Reference Porta, Sancho and Gómez2013), the authors work on the analysis of Old Spanish and report results obtained with different data sets. One of them is the FL–EM data set, which, as the authors of the paper note “FL-EM basically corresponds to the lexicon found within the FreeLing distribution for analysing Old Spanish”.
The data set was kindly provided by the cited authors containing a total of 36,709 entries, each of which contains a historical word and its corresponding modern lemma and morphosyntactic analysis. We processed this information to obtain the modern word forms and so to achieve word pairs (historical word–modern word) to do the experiments. After eliminating repeated pairs, a list of 31,046 different word pairs was obtained. As figures in Table 2 show, half of the pairs were stored for testing and the other half for training.
Besides the mentioned FL-EM data set, and in order to conduct experiments with data sets of different characteristics, the previously cited authors provided us with another data set, the IMPACT-ES data set. They obtained this data set after processing one of the Spanish lexicons available in the IMPACT (Improving Access to Text) European project’s website.Footnote g After extracting the word forms of the Compact Spanish lexicon, they obtained a list containing 24,009 different word pairs (old word–standard word), which have been provided to us. As we did in the previous FL-EM data set, we randomly split the IMPACT-ES data set into two parts for training and evaluation.
3.4 Slovene historical corpus
The data set from Scherrer and Erjavec (Reference Scherrer and Erjavec2016) consists of a training (goo corpus) and a testing lexicon (foo corpus) of historical Slovene as well as a frequency-annotated reference word list of modern Slovene.Footnote h The data come from the IMP resources of historical Slovene (Erjavec Reference Erjavec2015).
Both corpora (training and test) are split into three parts (Scherrer and Erjavec Reference Scherrer and Erjavec2016):
18B Texts from the second half of the 18th century, written in the Bohorič alphabet;
19A Texts from the first half of the 19th century, written in the Bohorič alphabet;
19B Texts from the second half of the 19th century, written in the Gaj alphabet.
The sizes of the corpora can be consulted in Table 2.
3.5 Comparing corpora with regard to evaluation
With regard to the evaluation, an important factor is the presence (or absence) of standard words in the test corpora. Due to our scenario for Basque (only OOV words are annotated and our aim is to test the quality of OOV normalization), we decided to test only the OOV words in the test.Footnote i Like Basque corpora, Spanish FL-EM corpus, Hungarian corpus, and Old Slovene corpus have few standard words in the test corpus (less than 10%). Including only OOV in the test set, the task can become harder. As we will show in the results, Baseline1 will give a measure of this factor.
Another important feature of the division training/test is whether or not these sets are disjoint. Experimental data sets of five languages (Basque, German, Hungarian, Icelandic, and Swedish) have been obtained from manually normalized texts (or part of them), and consequently training and test sets are not completely disjoint. In contrast, data sets of Spanish and Slovene have been obtained from lexicons in which each word pair (historical–modern) appears only once. Consequently, training and test sets are disjoint. Baseline2 takes both factors into account as it will be explained in Section 5.1.
When information about tokens is available, it is possible to train the system using all the tokens or only the types. Based on the previous results (Etxeberria et al. Reference Etxeberria, Alegria, Uria and Hulden2016b), there is not a big difference among them and in order to improve comparability of the results, the second option will be used in our experiments. So, the relevant figures used for our experiments are those of the “Types” columns in Table 2.
4. The WFST-based method
The WFST-based method we propose is the strongest method found in our previous work with dialect normalization (Etxeberria et al. Reference Etxeberria, Alegria, Hulden and Uria2014). This approach uses Phonetisaurus, a WFST-driven phonology tool (Novak, Minematsu, and Hirose Reference Novak, Minematsu and Hirose2012, Reference Novak, Minematsu and Hirose2016), based on OpenFST (Allauzen et al. Reference Allauzen, Riley, Schalkwyk, Skut and Mohri2007), which learns mapping of phonological changes, using a noisy channel model.
After data preparation, where we collect pairs into a dictionary, the application of the tool includes three major steps:
1. Sequence alignment: The alignment algorithm implemented in the Phonetisaurus tool is based on the algorithm proposed in Jiampojamarn, Kondrak, and Sherif (Reference Jiampojamarn, Kondrak and Sherif2007) and includes some minor modifications to it. This algorithm is capable of learning complex grapheme/phoneme relationships and it has been reformulated to take advantage of the WFST framework (Novak,Minematsu, and Hirose Reference Novak, Minematsu and Hirose2016). As we use the tool to align historical words and their equivalent modern words (not words and pronunciations), the result of the alignment process is joint grapheme/grapheme chunks that we use in the next step.
2. Model training: A joint n-gram language model is trained using the aligned data and then converted into a WFST (after some tuning, a joint seven-gram model was generated). For producing the language model, we used the Language Model training toolkit NGramLibrary for our experiments, although several alternative similar tools exist, which all cooperate with Phonetisaurus: mitlm, NGramLibrary, SRILM, SRILM MaxEnt extension, and CMU-Cambridge SLM.
3. Decoding: The default decoder used in the WFST-based approach finds the best hypothesis for the input words, given the WFST obtained in the previous step. It is also possible to extract the k-best output hypotheses for each word.
In practice, the application of the tool is straightforward. We have used the Phonetisaurus tool to learn the changes that occur within the selected word pairs, which by itself produces a grapheme-to-grapheme system.
Once this model is trained and converted to a WFST format, it can be used to generate correspondences between previously unseen words and modern standard forms. The WFST model provides the possibility of retrieving multiple candidate transductions for each input word. However, when multiple possibilities for a corresponding historical variant exist, some filtering becomes necessary and the quality of this filtering becomes very important to improve the results.
In order to choose the best value for the number of candidates to generate, we carried out a tuning process in our previous experiments with Basque (Etxeberria et al. Reference Etxeberria, Alegria, Uria and Hulden2016b). We used cross-validation in the development corpus asking Phonetisaurus to increase the number of retrieved answers ranging from: 5, 10, 20, or 30. The result was that asking for five candidates produces the best results. It is important to underline that the optimization was calculated only for the first corpus in Basque (Gero), so no additional tuning has been carried out for other languages.
Regarding the filtering process of the generated candidates, the first filter is obvious: the transductions that do not correspond to any accepted standard word form are eliminated. From the remaining candidates, if there are any, the most probable transduction according to Phonetisaurus’s weight model is selected. If there is no standard candidate for a given input, the most probable transduction according to Phonetisaurus’s weight model is selected.
The filtering step is important in order to improve the final results but, of course, this depends on the quality of the language model filter. We have used different options:
For selecting standard words in Basque, a morphological analyzer of Basque was used (Alegria et al. Reference Alegria, Etxeberria, Hulden and Maritxalar2009).
For Spanish, the FreeLing suite (Carreras et al. Reference Carreras, Chao, Padró and Padró2004) was used for filtering the proposals.
In the case of Slovene, the data sets used in Scherrer and Erjavec (Reference Scherrer and Erjavec2016) are available from http://nl.ijs.si/imp/experiments/jnle-dataset/, and they include a lexicon of modern Slovene called Sloleks.Footnote j This lexicon has been used to filter the proposals.
For German, Hungarian, Icelandic, and Swedish, the hunspell Footnote k tool has been used as a filter of the modern language.
In order to evaluate the impact of the filtering step and following the idea of Scherrer and Erjavec (Reference Scherrer and Erjavec2016) we have also applied the WFST-based method without any filter, that is, we have asked the weighted transducer for the best candidate and given its answer as the corresponding modern form.
5. Evaluation and results
All the methods we have applied to obtain the normalization of historical words produce always a unique answer for each input word. Thus, the results obtained are given in terms of normalization accuracy, that is, the percentage of correctly normalized inputs (their spelling is identical to the manually modernized gold standard).Footnote l
5.1 Baselines
Two baselines were set in order to test the difficulty of the task for each corpus.
The first baseline is the simplest one we can apply and consists of leaving input historical words unchanged, that is, the normalized proposals in the output are equal to the inputs. As we will see, the results obtained by this method depend on the characteristics of the data set, especially on the test set. If standard forms do not appear in the test (e.g., in the Basque corpora), the baseline’s result will be very low.
The second baseline is again a simple method based on a dictionary of equivalent words learned from the training data. This entails simply memorizing all the distinct word pairs detected among the historical and modernized forms and subsequently applying this knowledge during the evaluation task. If the historical word in the test is not in the dictionary, it is left unchanged. So, if there is any intersection between the test and the training sets, this baseline can improve the results of the previous one.Footnote m
We have calculated the results for the two baselines in order to obtain a robust comparison.
5.2 State-of-the-art results
We want to compare our results, if it is possible, with those reported in similar conditions on the same corpora in previous studies.
As we have described previously, CSMT has been the method that has given the best results. Scherrer and Erjavec (Reference Scherrer and Erjavec2016) and Pettersson (Reference Pettersson2016) report results in similar conditions for some of the corpora we test.
Pettersson (Reference Pettersson2016) reports all the results after applying a filter of modern language. The filters she used and those that we use are different. We use the same kind of filter (hunspell) for all the four languages and she uses list of words compiled for each language, so the results are not directly comparable.
The comparison with Scherrer and Erjavec (Reference Scherrer and Erjavec2016) is straightforward because the word-list for the filter is the same, and the results are reported with and without filter.
For Spanish, we can compare only the results obtained with the FL-EM data set with those published in Porta, Sancho, and Gómez (Reference Porta, Sancho and Gómez2013), but in the cited work, the task is not exactly the same. The problem they address is assigning modern lemmas and word classes to historical word forms, not normalizing them. We feel therefore that the results are not directly comparable.
5.3 Results and discussion
The value in bold for each corpus remarks the best accuracy obtained for that corpus. The italic values correspond to previously published values obtained by other researchers using the CSMT methods.
We want to note that the coverage of the filter based on hunspell (used for German, Hungarian, Icelandic, and Swedish) is more limited than that of the analyzers or lists used in conjunction with the Basque, Spanish, and Slovene experiments; and this may partially explain the comparative weakness of those results.
All the results are shown in Table 3. We can say that, in general, both CSMT and WFST give good results for all the languages, ranking around 80% for the most of the corpus. There are three exceptions which are Peru Abarka in Basque, Icelandic, and Old Slovene (18B). In the first and third cases, the baselines are low and the corpus for training is short. In the case of Icelandic, a further analysis is necessary. We can conclude that these results depend largely on the corpus available.
Comparing CSMT and WFST, figures are not absolutely conclusive. However, we think that WFST is a more simple and easy-to-tune method, and we think that it is recommendable for this task. It is an approach based on a basic noisy channel model. It is necessary to remark that figures can be improved slightly if we tune the method for each corpus. Our aim, however, was to test the robustness of the method. Compared to the previous works, the results for morphologically rich languages (Hungarian, Icelandic, and Basque) are really good.
Regarding the use of the filter, we can conclude that its use is highly recommendable; in most of the cases, the filtering process improves the results and sometimes the improvement is of great importance (Basque, Slovene). Only in one case does the accuracy decrease (German corpus), but in this case, the baseline is very high and the filter, perhaps, not very precise.
6. Testing results based on the size of the training corpus
It is evident that results are (also) dependent on the size of the training corpus. We are particularly interested in the second scenario described in Section 1.2, where it is necessary to annotate a small corpus, as small as possible, in order to develop a tool good enough to normalize a book or a set of texts.
In our previous work on Basque, we test the results after training the system using around 1000 examples annotated from OOV words in the historical corpus (adding recognized words, believed also to be modern words). The results are shown in Table 3, with accuracies higher than 80% for the easier corpus (Gero) and around 70% for the harder task (Peru Abarka).
To research this further, we ran some experiments to determine the range of results in function of the number of examples in the training corpus in order to obtain accurate results. Thus, we have conducted experiments using three large corpora: Spanish (FL-EM), Hungarian, and Swedish.
Due to the relatively large size of these corpora, we were able to test the results using increasingly larger slices of the corpus for training (200, 500, 1000,…), documented in the results shown in Table 4. The figures in the first three lines are the averages obtained using five disjoint subsets of the relevant size. The results shown in the last line (All) are obtained using all the examples of each training corpus: 15,523 for Spanish (FL-EM), 51,172 for Hungarian, and 9042 for Swedish. Those results are also shown in Table 3.
The results are high for Spanish, even with a small training set, medium for Swedish and poor for Hungarian. While in Spanish 500 examples are enough for achieving more than 85% accuracy, for Hungarian it is necessary to use 5000 examples to achieve 60%.
The results are very different for the three languages, and we intend to investigate which factors have to be taken into account.
The poor results for Hungarian could be related to its complex morphology, but comparing to the previous results in Basque (a morphologically complex language also) those are too low. So, we decided to test other features of the corpora related to “distance”.
With respect to the distance between language variants (recall the quotation by Piotrowski (Reference Piotrowski2012) in the Introduction), we have identified two basic features that can have a high degree of influence on the size of the training required for good results: size of the character set (alphabet) and edit-distance among historical and modern word forms.Footnote n
As it can be observed in Table 5, the size of the alphabet for the Hungarian and Swedish corpora is much bigger than for the Spanish one, where in addition to the simplicity of the alphabet, capital letters were eliminated from the corpus. In contrast, no lowercasing was done for Swedish and Hungarian, and their alphabet contains a lot of different graphical accents, and some punctuation and numerals. In addition to this, the edit-distance for Hungarian is much higher than in the other cases (with a high percentage of pairs with edit-distance greater than 3).
Taking into account the two previous tables, we can conclude that the size of the alphabet and the edit-distance of the pairs (rather than the morphological complexity of the languageFootnote o) to be learned are relevant data when deciding the size of the training corpus needed and predicting the quality of the system to be obtained.
7. Conclusions and future work
We have presented several corpora for the task of normalizing historical texts and a robust method for the task, comparing our results with previously published ones.
The corpora used cover a wide typology of languages and scenarios. These corpora are available for replication at https://datahub.io/dataset/historical-data.
To assess the performance and language-independence of the WFST method, training and evaluation were carried out on those historical corpora in seven languages. The results are similar or better to those reported in the bibliography even though no new tuning process was performed.
Our phonological induction-inspired method produces accuracy above 80% in the majority of the scenarios. It is a solution that works quite well using a limited amount of supervision in order to achieve adequate performance without the need for significant manual annotation efforts. We deem the results of our simple system strong and robust enough to recommend it for experiments and running tools, including other languages and corpora.
We want to remark that when compiling the corpora and performing the experiments we became aware that when we apply NLP to historical texts it is necessary to take into account the scenario and the features of the available corpora. Further research about the correlation of the analyzed factors (size, alphabet, edit-distance) and performance of the method is necessary.
Our main scenario has been a search task for historical texts (using normalized form to search and obtaining historical word forms in context), and the evaluation has been carried out on list of word pairs. In the near future, we want to extend this evaluation to the POS tagging scenario, testing the results on corpora and taking advantage from context.
Another aim is to take advantage of additional morphological information (morphemes and partial paradigms) that can be inferred from the corpora (see Etxeberria et al. (Reference Etxeberria, Alegria, Uria and Hulden2016a)), but at the moment, the results are inconclusive. We also want to try to improve upon the performance of the system using neural methods.
Acknowledgments
We are indebted to Josef Novak for his useful help concerning Phonetisaurus, and to Jordi Porta, Eva Pettersson, Yves Scherrer, and Tomaž Erjavec for providing their corpora and for addressing our inquiries about their experiments. Grateful thanks to the three anonymous referees for helping us improving the paper.
The research leading to these results was carried out as part of the TADEEP projects (Spanish Ministry of Economy and Competitiveness, TIN2015-70214-P, with FEDER funding) and the BERBAOLA project (Basque Government funding, Elkartek KK-2017/00043).