1. Introduction
Part-of-speech (PoS) tagging (Greene and Rubin Reference Greene and Rubin1971), that is, automatically attributing the correct, context-dependent PoS tag to each token in the text, is a basic but essential step in natural language processing (NLP). After tokenisation and sentence segmentation, this is typically the next step in the text processing pipeline, as it gives invaluable information about the grammatical properties of words in context and thus enables better information extraction from texts and text mining (Kurdi Reference Kurdi2016), high-quality lemmatisation, syntactic parsing, the use of factored models in machine translation, etc. From this also follows that high-accuracy PoS tagging is very important, as downstream modules will be adversely impacted by any errors made in this preliminary step.
PoS taggers learn the model of a language utilising manually annotated corpora and background lexicons; these resources are expensive to produce, and while they exist now for many languages, they typically cover only contemporary standard language. However, in the case of non-standard language and language varieties, such as historical texts and user-generated content (UGC), the accuracy of PoS tagging trained on standard language drops considerably (Gimpel et al. Reference Gimpel, Schneider, O’Connor, Das, Mills, Eisenstein and Smith2011), as many words become unknown, and the accuracy of tagging unknown words is much lower than for those already seen in the training data or lexicon. To counter this, either of two methods is typically used. The words in the input text can be normalised, thus lowering the out-of-vocabulary rate to the tagger, or new domain data can be manually PoS annotated and the tagger trained both on standard and in-domain non-standard data.
The aim of this paper is to investigate how well each method performs, and to find the optimal path in terms of resources required towards PoS tagging non-standard language in the case of Slovene. Our primary goal is thus to pave the way for other researchers and practitioners dealing with adapting language technologies for other highly inflected (esp. Slavic) languages to non-standard language, producing the best possible systems given the data annotation resources at hand. By investigating the optimal approaches to adapting technologies via supervision, we also attempt to shed more light on the problem of automatic processing of non-standard language. We perform our experiments on two carefully constructed data sets, the first being Slovene historical language and the second Slovene UGC; furthermore, we split these data into easy and hard cases.
The rest of the paper is structured as follows: Section 2 overviews the related work on the tasks of PoS tagging, normalisation of historical and UGC texts, and the domain adaptation approaches in PoS tagging, Section 3 details the data sets used in the experiments, Section 4 presents the experiments performed, while Section 5 gives the conclusions and directions for further work.
2. Background and related work
2.1 PoS tagging
As PoS tagging is one of the oldest and most studied NLP tasks, many taggers have been developed, and there is hardly any machine learning method that has not been utilised for PoS tagging as well. Traditional – and still quite good – taggers have used hidden Markov models (HMMs), with the best-known example being TnT (Brants Reference Brants2000) and its open source implementation HunPos (Halácsy, Kornai, and Oravecz Reference Halácsy, Kornai and Oravecz2007). The exact use of HMMs for the tagging tasks varies from approach to approach (e.g. Collins Reference Collins2002), and a still very popular implementation using a Markov model supplemented with decision trees is TreeTagger (Schmid Reference Schmid1994), where its greatest strength is not so much in high quality annotations but rather in the models for many languages that accompany it. More recently, support vector machines have been utilised for PoS tagging (Ratnaparkhi Reference Ratnaparkhi1996; Grčar, Krek, and Dobrovoljc 2012), as well as methods based on Conditional Random Fields (CRFs); especially CRF approaches report better accuracies than previous ones (Silfverberg et al. Reference Silfverberg, Ruokolainen, Lindén and Kurimo2014; Ljubešić and Erjavec Reference Ljubešić and Erjavec2016).
In our experiments, we also use an open-source CRF-based tagger called the ReLDI-tagger,Footnote a which improved the state of the art in PoS tagging for a series of South Slavic languages (Ljubešić et al. Reference Ljubešić, Zupan, Fišer and Erjavec2016; Ljubešić and Erjavec Reference Ljubešić and Erjavec2016). And while there has recently been an explosion of work on deep learning methods, ReLDI-tagger also gives comparable results on both standard and non-standard data to the state-of-the-art Bi-Long short-term memory (LSTM) PoS tagger Bilty (Plank, Søgaard, and Goldberg Reference Plank, Søgaard and Goldberg2016). In particular, on standard training and testing data ReLDI-tagger obtained 94.27% accuracy on the fine-grained tagset with full morphosyntactic descriptions (Ljubešić and Erjavec Reference Ljubešić and Erjavec2016) while Bilty (on the same data and a large non-annotated data set used for embedding initialisation) obtained an accuracy of 94.29%. Similarly, when applied to non-standard data, an adapted ReLDI-tagger obtained an accuracy of 87.70% (Ljubešić, Erjavec, and Fišer Reference Ljubešić, Erjavec and Fišer2017), while Bilty achieves 88.15%.
Another more recent opportunity for comparing the ReLDI-tagger and its neural alternatives was the shared task on Morphosyntactic Tagging of Tweets within the VarDial Evaluation Campaign in 2018 (Zampieri et al. Reference Zampieri, Malmasi, Nakov, Ali, Shon, Glass and Jain2018), which showed that the two best-performing systems, both being versions of bidirectional recurrent neural networks exploiting both word and sub-word information, improved over a CRF alternative (the ReLDI-tagger expanded with the Brown contextual clusters) in one out of three languages only, with that improvement being just above one accuracy point (87.1 vs. 88.4).
2.2 Normalisation of historical language
Historical language differs from the contemporary one in a number of aspects, one of them being different spelling of words, not only against the contemporary standard, but also among different authors from the same era due to the lack of a unified norm.
What exactly constitutes a ‘normalised’ word is a complex question, and various definitions have been proposed (Eisenstein Reference Eisenstein2013a). Dipper (Reference Dipper2010) allows normalised words to also be artificial, that is, they might not exist in contemporary language; they are, however, written using modern orthography and their variants standardised to one form. Extinct or dialectal words are thus not substituted with their modern (near) equivalents, but only their spelling might be modified. This is a similar approach to Bollmann, Krasselt, and Petran (2012b), who distinguish normalisation from modernisation, with the latter also changing the word to its closest contemporary standard equivalent as regards its morphosyntax and semantics.
Automatic normalisation of historical language has by now a long history,Footnote b as modernising words can enable not only better further automatic processing but also better intelligibility of old texts and better full text search. Many normalisation methods are based on string similarity measures (Metzler, Dumais, and Meek Reference Metzler, Dumais and Meek2007; Gomaa and Fahmy Reference Gomaa and Fahmy2013), from the simplest lexicon-based, through rule-, distance- and phonetic-based, to subsequent implementations of these into workflows that combine several methods into a joint pre-processing pipeline (Rayson et al. Reference Rayson, Archer, Baron, Culpeper and Smith2007; Baron and Rayson Reference Baron and Rayson2008; Scheible et al. Reference Scheible, Whitt, Durrell and Bennett2012; Erjavec Reference Erjavec2011; Bollmann et al. 2012a).
More recently, character-based statistical machine translation (CSMT) has been successfully introduced, initially as a method of transliteration between different scripts (e.g. Matthews Reference Matthews2007) and as a method of translating between closely related languages (Vilar, Peter, and Ney Reference Vilar, Peter and Ney2007), and later as a normalisation method particularly suitable for non-standard language variants. CSMT is an adaptation of phrase-based SMT (Koehn, Och, and Marcu Reference Koehn, Och and Marcu2003) and considers words as strings of characters that need to be (automatically) translated. By treating characters as words, and words as sentences, CSMT can be performed with standard tools for machine translation such as Moses (Koehn et al. Reference Koehn, Hoang, Birch, Callison-Burch, Federico, Bertoldi and Herbst2007).
As with other machine learning tasks, neural networks have been also applied to normalisation (Kalchbrenner and Blunsom Reference Kalchbrenner and Blunsom2013; Bahdanau, Cho, and Bengio Reference Bahdanau, Cho and Bengio2014), including neural SMT approaches on non-standard language varieties such as historical language. Their main advantage against conventional phrase-based translation systems lies in better sharing of statistical evidence between similar words and inclusion of rich context (Koehn Reference Koehn2017). Neural networks have also been employed for normalising relatively low-resourced languages and language variants and those with rich morphology (Ling et al. Reference Ling, Trancoso, Dyer and Black2015; Kim et al. Reference Kim, Jernite, Sontag and Rush2016), but here this new approach is still facing substantial challenges. Finally, recent work has also shown that good results are achieved by training weighted finite state transducers on the task of word normalisation (Etxeberria et al. Reference Etxeberria, Alegria and Uria2019).
In the current paper, we use cSMTiserFootnote c for word normalisation. The tool is based on character-level SMT and was proved to be very efficient in normalising dialectal Swiss (Scherrer and Ljubešić Reference Scherrer and Ljubešić2016), non-standard and historical Slovene (Ljubešić et al. Reference Ljubešić, Zupan, Fišer and Erjavec2016). In the shared task on translating 17th-century Dutch literature to modern language (Tjong Kim Sang et al. Reference Tjong Kim Sang, Bollmann, Boschker, Casacuberta, Dietz, Dipper and Zervanou2017), cSMTiser was ranked first among 18 highly competitive systems,Footnote d outperforming also the two systems using neural networks, including a convolutional neural network with character-level alignments.
A more recent comparison of neural and CSMT approaches (Lusetti et al. Reference Lusetti, Ruzsics, Göhring, Samardžić and Stark2018) showed that minor improvements over CSMT (word-level accuracy of 87.38% vs. 86.35%) can be obtained if ensembles of five neural seq2seq models were combined on one data set of German Swiss, which is a preliminary result and does not scale to the complexity of these experiments where hundreds of normalisers are trained.
2.3 PoS tagging with normalisation
Normalisation has been found to significantly increase the accuracy of PoS tagging in a number of experiments and languages. For English, when normalisation with the VARD2 system is performed before PoS tagging by the CLAWS tagger, Rayson et al. (Reference Rayson, Archer, Baron, Culpeper and Smith2007) report an absolute improvement in accuracy by up to 3%. Similarly, an adaptation of VARD2 for historical Portuguese (Hendrickx and Marquilhas Reference Hendrickx and Marquilhas2011) showed that normalisation helps improve the accuracy of the MBT PoS tagger (Bosch et al. 2007) by more than 6% in total. Similarly, Dipper (Reference Dipper2010) also determined that using tagging with normalised word forms works better than tagging on (orthographically slightly simplified) diplomatic transcriptions in all settings, and Scheible et al. (Reference Scheible, Whitt, Durrell and Bennett2011) as well as Scheible et al. (Reference Scheible, Whitt, Durrell and Bennett2012) found that adding a normalisation layer improves the performance of a standard tagger on Early Modern German considerably. Pettersson, Megyesi, and Nivre (2013a) and Pettersson, Megyesi, and Tiedemann (2013b) have shown in several experiments on Swedish and Icelandic that using normalisation has a positive effect on subsequent tagging and parsing. In a multilingual study by Pettersson, Megyesi, and Nivre (Reference Pettersson, Megyesi and Nivre2014), CSMT was determined as the best method for normalisation for four out of five languages. Finally, Scherrer and Erjavec (Reference Scherrer and Erjavec2016a) have performed experiments on historical Slovene and found that results of tagging and lemmatisation are strongly correlated with the quality of normalisation.
2.4 PoS tagging by domain adaptation
There has also been considerable interest in attempting to increase PoS tagging accuracy of historical texts by adding manually PoS-annotated historical data to the training set of the PoS tagger.
Bollmann (Reference Bollmann2013) showed that even a very small amount of training data (250 manually normalised tokens) significantly raises the accuracy of PoS tagging (approximately 46% on a 15th-century German manuscript), indicating that the approach is especially useful for less-resourced language variants and that the process may be quite cost-effective.
Experiments on historical Portuguese (Yang and Eisenstein Reference Yang and Eisenstein2014, Reference Yang and Eisenstein2015) and Early Modern English Yang and Eisenstein (Reference Yang and Eisenstein2016) have shown that domain adaptation using feature embeddings works better than normalisation, especially in the case of out-of-vocabulary words. The main disadvantage of normalisation is that it fails to address the full range of linguistic changes, such as changes in morphology and semantics, that is, meaning shifts.
2.5 Automatic (pre)processing of UGC
Boosting tagging performance for the ‘new non-standard’, that is, computer-mediated communication such as tweets, comments on news articles, chats, SMS and other UGC is, in general, achieved by similar methods to those for historical language, but with some differences. Due to the UGC-specific elements, a need for expanding PoS tagsets arose and new categories were added for hashtags, @-mentions, discourse markers, URLs and e-mail addresses, and emoticons (Gimpel et al. Reference Gimpel, Schneider, O’Connor, Das, Mills, Eisenstein and Smith2011).
Approaches here typically favour domain adaptation, although early experiments did resort to normalisation lists of expressive shortenings and lengthenings as well as mapping URLs and hashtags to unique symbols (Gimpel et al. Reference Gimpel, Schneider, O’Connor, Das, Mills, Eisenstein and Smith2011; Foster et al. Reference Foster, Çetinoglu, Wagner, Le Roux and Van Genabith2011; Eisenstein Reference Eisenstein2013b).
SMT was used on Dutch UGC by De Clercq et al. (Reference De Clercq, Schulz, Desmet, Lefever and Hoste2013), who built a cascaded SMT system with a token-based module followed by transliteration at the character level and achieved up to a 63% drop in word error rate. Ritter, Clark, and Etzioni (Reference Ritter, Clark and Etzioni2011) proposed word clustering with Brown clusters when building a PoS tagging system for an NLP pipeline aimed at named entity recognition of tweets. The system, which uses Conditional Random Fields, outperforms the Stanford tagger, obtaining a 26% error reduction. Owoputi et al. (Reference Owoputi, O’Connor, Dyer, Gimpel, Schneider and Smith2013) expanded on the method of Brown clusters and systematically evaluated the use of large-scale word clustering on unlabelled English tweets, finding that word clusters are a very strong source of lexical knowledge, especially helpful with words that do not appear in traditional dictionaries. In one of the experiments on out-of vocabulary tokens, adding clusters generated a 5.7-point improvement (previously 79.2% accuracy).
Derczynski et al. (Reference Derczynski, Ritter, Clark and Bontcheva2013) experimented by using vote-constrained bootstrapping with unlabelled data to tackle data sparsity, reducing token error by 26.8% and sentence error by 12.2%. Earlier papers introduce the ARK tagger (Gimpel et al. Reference Gimpel, Schneider, O’Connor, Das, Mills, Eisenstein and Smith2011) and T-PoS (Ritter et al. Reference Ritter, Clark and Etzioni2011). Both of these approaches adopt word clustering to handle linguistic noise, and train from a mixture of hand-annotated tweets and existing PoS-labelled data.
3. Resources and data preparation
This section details the data sets used in the experiments. First, we introduce the language and resources used, comprising contemporary standard, historical and UGC Slovene. Next, we define the principles of word normalisation and the PoS tagset used. Finally, we present the data sets used in the experiments. All the used resources are openly available and can serve other researchers for similar and other experiments while also being comprehensive enough to enable the construction of realistic and useful systems for normalisation and tagging of three types of Slovene written production.
3.1 Contemporary Slovene and resources
Slovene is a South Slavic language and is, similarly to other Slavic languages, highly inflected. For instance, it still retains the dual number, and the paradigms of some words, such as adjectives, have over 60 morphologically distinct forms. It uses 25 letters, with č, š and ž being the only ones with diacritics, unlike some other Slavic languages such as Czech, where the quality of the stressed vowels is also marked. The writing system is largely phonetic, but with taking into account the morphological structure of the word. So, for example, the word hodil (walkedmasc) is pronounced hodiu, but is nevertheless written with the final –l since other forms, such as hodila (walkedfem), do phonetically express the l.
For our experiments, we use two resources of contemporary standard Slovene. The corpus ssj500k (Krek et al. Reference Krek, Dobrovoljc, Erjavec, Može, Ledinek and Holz2015) contains half a million words manually morphosyntactically tagged and lemmatised and is often used for training taggers for standard Slovene (Grčar et al. 2012; Ljubešić and Erjavec Reference Ljubešić and Erjavec2016). We also use the inflectional lexicon Sloleks (Dobrovoljc et al. Reference Dobrovoljc, Krek, Holozan, Erjavec and Romih2015), which contains 100,000 lemmas with their complete inflectional paradigms, comprising in total almost 2.8 million word forms, although few are proper names.
3.2 Historical Slovene and resources
The orthography of the Slovene language was standardised to its present form rather late, with even 19th-century texts still being written quite differently than contemporary texts. The modern-day Slovene alphabet (called the Gaj alphabet, modelled after the Croatian alphabet by Ljudevit Gaj) was introduced into Slovene print in the 1840s; before that, the Bohorič alphabet, modelled on the German one, was used. All the letters of the old alphabet, except the long s, are still used today but they correspond to different phonemes, which makes reading texts in the Bohorič alphabet difficult and, to some extent, also complicates identifying the alphabet used. It should be noted that in the Bohorič alphabet, but especially in the Gaj alphabet of the 19th century, diacritics on vowels were often used, unlike in contemporary Slovene.
The introduction of the Gaj alphabet was also closely preceded by a new grammar and subsequent standardisation of the language orthography; therefore, the change in the alphabet makes a convenient demarcation between very and slightly non-standard historical Slovene language.
For our experiments, we used the goo300k corpus (Erjavec Reference Erjavec2015b), comprising transcriptions of 1100 pages (about 300,000 tokens) sampled from 88 books and one newspaper, which were published between 1584 and 1899. Each word token in the corpus is manually annotated for its normalised (modernised) word form(s), its part of speech, lemma, and – for archaic words – its gloss, that is, contemporary synonyms. The corpus has already been used in several word modernisation experiments (Scherrer and Erjavec Reference Scherrer and Erjavec2016b; Etxeberria et al. Reference Etxeberria, Alegria, Uria and Hulden2016). The goo300k corpus was split into two data sets:
• Bohorič: the texts written in the Bohorič alphabet and published after 1750, as we have only a handful of pages from the oldest texts;
• Gaj: texts written in the Gaj alphabet, up to 1899; we removed three outlier texts that use very idiosyncratic spelling.
3.3 The social media data set
The language used on Slovene social media is, similarly to other languages, often written in a non-standard form, manifesting dialect words and phonetic spelling, as well as sometimes being written without diacritics on č, š and ž, a quicker way of entering text on mobile platforms.
However, many social media texts are also completely or mostly standard. We developed a method (Ljubešić et al. 2015) to automatically classify texts into three levels of technical and linguistic standardness. Technical standardness (from T1, quite standard, to T3, very non-standard) relates to the use of spaces, punctuation, capitalisation and similar, while linguistic standardness (L1–L3) takes into account the level of adherence to the written norm and more or less conscious decisions to use non-standard language, involving spelling, lexis, morphology and word order.
The Janes-Tag corpus (Erjavec et al. Reference Erjavec, Fišer, čibej, Arhar Holdt, Ljubešić and Zupan2017) contains about 75,000 tokens, roughly 3000 Slovene tweets, comments on online news portals, forum posts, etc. The corpus is manually annotated on the levels of tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entities. For our experiments, we wanted to concentrate on the non-standard portion of the corpus, so we used only the L3 part of the corpus,Footnote e which we call JanesL3.
3.4 Normalisation and tagging
This section defines normalisation and tagging in the context of the resources used. Words are normalised only orthographically, which involves transliteration, de-diacritisation of vowels and, of course, changing archaic or phonetic spellings to their contemporary, standard equivalents. When such do not exist, the word is only normalised to standard spelling. By convention, the normalisations are always in lower case, and punctuation symbols are not normalised. In general, we map spans of original tokens into spans of normalised tokens, with further linguistic annotation assigned to the normalised ones. In the majority of cases, there is a 1–1 mapping between the original and the normalised form, however, in historical language and in the language of social media word boundaries are at times different from the contemporary standard, so 1 − n or n − 1 (where n > 1) mappings are sometimes necessary as well. Other approaches have typically taken a more restricted approach to normalisation, either always normalising only 1–1 (Han and Baldwin Reference Han and Baldwin2011), or normalising 1–n but not n–1 cases (Bennett et al. Reference Bennett, Durrell, Scheible and Whitt2010).
To illustrate, we give in Figure 1 two cases, one from the goo300k corpus and the other from the Janes-Tag corpus, both as encoded in the TEI P5 format 2017 used for encoding our corpora. Note that here both are also lemmatised, but this information is not used in the current experiments.
Most annotated Slovene resources use the MULTEXT-East morphosyntactic tagset for Slovene (Erjavec Reference Erjavec2012), which is a fine-grained tagset comprising over 1900 distinct morphosyntactic descriptions (MSDs). This holds for all the resources mentioned in this section except for the goo300k corpus, where a simpler tagset was used. The reason for this is that the focus of the manual annotation was on normalisation rather than MSD tagging, which is a very labour-intensive process when applying the fine-grained tags. The goo300k corpus uses the IMP tagset (Erjavec Reference Erjavec2015a), which is a simplification of the MULTEXT-East one, retaining lexical features but discarding all the inflectional ones, and comprises only 32 distinct MSDs or, rather, parts of speech. For the purposes of the experiments presented in this paper, additional background resources, in particular the ssj500k corpus, the Sloleks lexicon and JanesL3, were also converted to the IMP tagset so that we have a common PoS tagset for all the resources. The only difference between the tagsets is that the Janes-Tag corpus, which – following Bartz, Beißwenger, and Storrer (2014) – has four extra tags to model phenomena (mostly) specific to social media texts, in particular hashtags; mentions; emoticons and emojis; and email addresses and URLs.
3.5 The experimental data sets
Table 1 quantifies the corpus data sets that are used in the experiments. The first and second columns give the number of tokens and words (i.e. all tokens except punctuation) in each data set; it should be noted that we count cases where n original tokens or words map to one normalised one or vice versa as one token or word. By far, the largest is the ssj500k contemporary standard Slovene corpus, as will usually be the case also for other languages. The Gaj data set with just over 200,000 word tokens is the second largest, while the Bohorič and JanesL3 are of comparable size of roughly 50,000 tokens.
The ‘Single’ column gives the percentage of original word tokens that are normalised to single words rather than to multi-word units, where normalisation and tagging is problematic when using an approach that normalises individual tokens. This of course never happens with the ssj500k corpus, and quite rarely with the other corpora, most often with the Bohorič corpus for 0.7% of the word tokens. While such cases might be linguistically interesting they will only have a small impact on the quality of normalisation or PoS tagging.
We next give the proportion of word tokens, excluding differences in capitalisation that do not need normalisation (Word = Normalised) or require only – more or less trivial – transliteration (Word ~ Normalised). In the case of Bohorič, this means converting to Gaj, for Bohorič and Gaj removing vowel diacritics, and for Janes assuming rediacritisation of č, š and ž. The last is a much more complex process than deterministic transliteration, and we have developed a language-independent system that requires only a large correctly diacriticised corpus for training (Ljubešić et al. Reference Ljubešić, Zupan, Fišer and Erjavec2016) and achieves quite good results, with less than 1% token error even on the non-standard language of Slovene tweets.
As the table shows, almost 60% of the words need to be normalised in the Bohorič data set, with the transliteration accounting for only about 15%, all the rest still needing further changes in orthography to bring them in line with the contemporary standard. The situation is quite different with the Gaj data set, where about 15% of the words need to be normalised, but the simple expedient of removing vowel diacritics drops this number to under 3%. Finally, the JanesL3 data set is somewhere in the middle between the two historical ones, with about 20% of the words that need to be normalised; rediacritisation lowers this number to 18%.
The last column gives the percentage of normalised word tokens that are included in the Sloleks lexicon – this information is important for approaches that use a lexicon to filter normalisation candidates. The ssj500k percentage shows that Sloleks provides fairly good coverage, with only 2.2% of the word tokens not in Sloleks, and these are, in about two thirds of the cases, proper nouns. Interestingly, the percentage is also very good with the Bohorič data set, with only 4.1% not covered, in fact better than with the Gaj one, where it is 5.3%. This is due to the fact that the Bohorič data set contains mostly religious books and novels, which have a rather restricted vocabulary, while the Gaj data set is much more varied, containing also a number of textbooks with obsolete terminology, and a run of a newspaper, with many proper nouns. Finally, the JanesL3 data set has the least words covered by Sloleks, with over 17% missing. This reflects the nature of the communication on the medium, with lots of slang words, abbreviations and especially many English words, either written in the original or more or less phoneticised.
3.6 Training and testing splits
Given the rather large size of the standard ssj500k PoS tagging data set, we split it into a training portion and a test portion, using 1600 documents (524,278 tokens) for training and leaving out 55 random documents (61,970 tokens) for testing.
When performing training on our smaller in-domain data sets, to obtain a good accuracy estimate we perform fivefold cross-validation. On the Gaj and JanesL3 data sets, we split the data randomly in five folds on document boundaries while for the Bohorič data set we opt for a sentence-level shuffle. Our different decision for the Bohorič data set is based on two facts: (1) this data set consists of only 15 documents that are very different from one another, and splitting this data set on document boundaries generates a multiple times higher variance in the results than in the other data sets, and (2) this is a historical data set with a limited number of texts, and if we opted for improving the tagging via in-domain training data, we would want to generate a sample consisting of sentences from all available documents.
Another gain from performing fivefold cross-validation is the ability to calculate the 95% confidence interval, which we report with each result based on cross-validation. Reporting such an interval enables the reader to have a general understanding of the variability of the cross-validation results. It is also very practical, in comparison to standard statistical tests, as it can be reported with a result without the dependence on any other result, unlike the results of statistical significance test, which have to be applied on pairs of results. However, once the results of specific setups come close to each other, we still do perform the McNemar test (McNemar Reference McNemar1947), which is a non-parametric test of statistical significance suitable for paired nominal data, therefore also for the output of different taggers.
4. Experiments
This section details the technologies used in the experiments as well as methods – adaptation via normalisation or in-domain data – and evaluation metrics, which include qualitative error analysis.
4.1 Technologies used
As shown in Sections 2.1 and 2.2, recent neural approaches do not yet significantly outperform CRF-based methods with carefully designed features for PoS-tagging and SMT-based methods for normalisation. These ‘traditional’ technologies are also better understood, more stable and less dependent on hyperparameter tuning and are therefore more useful for the type of comparison we apply here. Not the least, given the large number of experiments we ran, these traditional approaches compare very favourably in terms of speed against the very long time necessary for neural training. We therefore use the ReLDI-tagger for tagging and cSMTiser for normalisation. Below we introduce the two tools in more detail.
4.1.1 Normalisation
cSMTiser is a wrapper around the Moses toolkit which enables the user to define training data and additional target language data for extra language models. It uses either a dedicated development set or 10% of the training data to perform MERT tuning.
The tool allows normalising whole sentences as well as performing word-by-word normalisation. Although most approaches use word-by-word translation, our previous experiments (Ljubešić et al. Reference Ljubešić, Zupan, Fišer and Erjavec2016; Scherrer and Ljubešić Reference Scherrer and Ljubešić2016) have shown that performing sentence-level normalisation, thereby taking into account the context in which each word occurs, outperforms the token-level normalisation if the token translation ambiguity is high. While we could not obtain any improvements on Slovene texts from the 19th century and more standard UGC, we obtained an error reduction of 12% on 18th-century Slovene texts, 7% on more non-standard UGC and 22% on dialectal Swiss texts.
Although improvements on less-standard data were significant, the memory consumption of the sentence-level normaliser compared to the token-level one was 25 times higher while the necessary time both for training and decoding was 3 times higher. Given the drastically higher complexity of the sentence-level normalisation and its limited or no impact on normalisation accuracy, we decided to run our extensive experiments (several thousand normalisation models trained and evaluated) with a token-level normaliser.
4.1.2 PoS tagging
ReLDI-tagger exploits the following features extracted either from the training data set or an inflectional lexicon:
lower-cased tokens at positions {−3, −2, …, 2, 3};
focus token (token at position 0) suffixes of length {1, 2, 3, 4};
tag hypotheses obtained from an inflectional lexicon for tokens at positions {−2, −1, …, 1, 2};
focus token-packed representation giving information about the case of the word and whether it occurs at the beginning of the sentence; for example, ull-START starts with upper-case followed by at least two lower-case characters at the start of the sentence.
For cases when Brown clustering information is provided to the tagger, the tagger encodes additional features of all binary paths from the root that have even length (Ljubešić et al. Reference Ljubešić, Erjavec and Fišer2017).
4.2 Unsupervised domain adaptation
While the main focus of these experiments is to investigate the impact of additional supervision on the task of tagging non-standard data, either through training a pre-processing normaliser or training the tagger directly, we include in this research the variable of unsupervised adaptation as well. Namely, unsupervised methods are cheap and it is realistic to expect researchers and practitioners to include them in their systems. We opt for the widely best performing method for unsupervised adaptation of traditional taggers (Owoputi et al. Reference Owoputi, O’Connor, Dyer, Gimpel, Schneider and Smith2013; Horsmann and Zesch Reference Horsmann and Zesch2015), namely Brown clusters (Brown et al. Reference Brown, Desouza, Mercer, Pietra and Lai1992).
We calculate Brown clusters for each of the data sets from large collections of unlabelled text. The non-annotated Bohorič data set consists of 1.1 million tokens and the Gaj data set of 9.3 million tokens. Both were taken from the large and freely available IMP corpus of historical Slovene (Erjavec Reference Erjavec2014). The non-annotated data set for JanesL3 consists of 14.3 million tokens and was taken from the L3 marked texts from the large Janes corpus of Slovene UGC (Fišer, Erjavec, and Ljubešić Reference Fišer, Erjavec and Ljubešić2016). To ensure that both non-standard and standard forms can be found in the resulting clusters (otherwise adding cluster information would not be informative for the tagging process), we extend each of the data sets with data from the slWaC web corpus of Slovene (Ljubešić and Erjavec Reference Ljubešić and Erjavec2011) up to the point of having at least 50 million tokens in the data set.
In all cases, we consider for clustering only words that appear at least 5 times in the corpus. During our baseline experiments, we identify the optimal number of clusters into which the words are to be split.
4.3 Baseline experiments
We start our experiments by measuring the performance of the PoS model trained on standard data on our non-standard data sets. We do not perform any adaptation for these baselines, except on the JanesL3 data set, where we add a simple heuristic which labels emoticons, hashtags, mentions and URLs with their corresponding tags. We consider this heuristic to be simple enough to write that every practitioner would decide to implement it.
On our baseline setting, we perform experiments on the optimal number of Brown clusters to be used for enriching the data representations. Our starting hypothesis is that the less data we have to calculate the clusters, the smaller the number of clusters will give best results. We experiment with the number of clusters of 50, 100, 200, 500, 1000, 2000.
4.3.1 Results
In Table 2, we present the results of the baseline experiments. We can observe that our tagging ceiling, that is, the tagger accuracy on standard data (train ssj500k, test ssj500k), is very high, with 97% accuracy. On none of the other data sets, the tagging performance, with or without unsupervised adaptation, comes close to this performance. By far the highest error increase (column Err. incr.) is measured on the Bohorič data set (15 times), while on the remaining data sets it is around five times. Interestingly, the error increase is higher on the non-standard JanesL3 UGC data than on Gaj historical data, in spite of the heuristics taking care of correctly tagging URLs, mentions, hashtags and emoticons.
Inspecting the improvement obtained by including Brown clustering information in our experiments, we observe the highest relative improvement (column Brown rel.) on the JanesL3 data set, followed by the Gaj and finally by the Bohorič data set. The relative improvements follow roughly the size of the in-domain non-labelled data sets the clusters are based on, which are 14.3, 9.3 and 1.1 million tokens for the three data sets.
Inspecting the number of clusters giving best results, a similar trend is shown, with the optimal number of clusters on Bohorič being 100 clusters, while on the remaining data sets it is 1000 clusters. This result follows our hypothesis that the more data one has for calculating the clusters, the more clusters should be differentiated. This observation makes sense as with the size of the data set (1) the number of words over a frequency threshold that we take into account increases and (2) we gather more data on each word, being therefore able to construct more sensible clusters. A similar trend was already observed in Derczynski, Chester, and Bøgh (Reference Derczynski, Chester and Bøgh2015).
One final observation, not given in Table 2, is that the impact of a different number of Brown clusters in the Gaj and JanesL3 data sets on the results is quite small (absolute drop in accuracy of 0.05% on Gaj and 0.22% on JanesL3, when 500 clusters instead of 1000 are used) while on the Bohorič data set using too many clusters results in a drastic drop in accuracy (4.91% when 200 instead of 100 clusters are used), falling even below the initial result obtained without any unsupervised domain adaptation. This observation stresses the importance of tuning the number of Brown clusters, as already stated in Derczynski, Chester, and Bøgh (Reference Derczynski, Chester and Bøgh2015).
In the remaining experiments, we apply those Brown clusters that gave the best results in this set of experiments.
4.3.2 Error analysis
We perform each of our error analyses primarily by analysing confusion matrices over our PoS tagset and, secondly, by inspecting test data annotations.
The analysis of most frequent errors shows that, as expected, different types of errors occur in the standard setting (training and testing on ssj500k) and the non-standard settings (training on ssj500k, testing on Bohorič, Gaj and JanesL3, with and without unsupervised adaptation).
The standard data set, where less than 3% of tags are incorrect, contains relatively minor errors, mostly limited to (1) attributing the correct main PoS category but (partially) wrong lexical features as encoded in the PoS tag or (2) attributing the main PoS category that is incorrect but similar in function to the correct one.
By far the most frequent errors belong to errors of type 1: while tokens are correctly tagged as proper nouns, their grammatical gender is not – feminine nouns are mislabelled as masculine and vice versa, albeit less frequently. The next most common errors are of type (2) and concern misclassification between general adjectives and adverbs in the positive degree, followed by type (1) misclassification between common and proper nouns. All top misclassification phenomena occur in both directions.
In the Bohorič data set, however, the type of errors is very different: all top most frequent errors include labelling various PoS categories with the tag used for foreign words, that is, for words from other languages. This is due to the fact that in this data set the old orthography was used, relying on German orthography of that time. With the addition of Brown clusters, the error type persists, yet the ratio between parts of speech being confused for foreign words differs – pronouns are still the most frequently misclassified PoS but their number drops with the use of clusters, and the same trend applies to content words. However, the benefits of clustering do not extend to conjunctions and prepositions, as clustering introduces more errors, causing them to surge to second and third most common mislabelled tags, respectively. The two main reasons for this are the following: (1) conjunctions and prepositions are closed class words, which the tagger is very confident about based on their full word-form features when learning on standard data sets, therefore not expecting variation, and (2) the number of different standard forms of these classes is very low, hence all the forms are seen in the training data and, although the non-canonical forms are clustered in the same clusters as the canonical forms, on standard training data the tagger is incapable of measuring the (future) usefulness of these features.
Unlike the results obtained on the older Bohorič historical data set, PoS tagging of the Gaj data set does not result in many ‘foreign word’ labels as the orthography in this data set is very close to the contemporary one. Instead, several PoS tags, mostly for archaic forms, are mislabelled as common or proper masculine nouns, and the attributed gender of common nouns is incorrect (feminine and neuter nouns mistagged as masculine). Most frequently, particles are tagged as conjunctions, which is due to a difference in the linguistic decision during manual tagging of the standard and the historic data set. Namely, a series of word forms such as ‘pa’ and ‘naj’ lie somewhere between these two word classes. Unsupervised adaptation via Brown clusters improves, as expected, mistaggings of feminine nouns and pronouns as masculine nouns, showing the effectiveness of Brown clusters on open word classes. On the other hand, as expected, clustering cannot deal with the misclassification of particles as conjunctions.
The JanesL3 data set has similar problems as the two historical data sets, where foreign words, adverbs and verbs are mistagged as common masculine nouns. Many foreign words were not seen in the training data, nor were general adverbs and verbs in their non-standard forms (‘lahk’ for ‘lahko’, ‘nardit’ for ‘narediti’). All these unseen forms are typically classified as masculine nouns as they are the most common and as well as variable PoS. Emoticons are also often incorrectly recognised as punctuation since our heuristic does not cover all possible emoticons. Unsupervised adaptation via Brown clusters improves the results, in particular by significantly decreasing the number of general adverbs and verbs being mislabelled as common masculine nouns by approximately a half. Again, Brown clusters cover open classes very well: (1) clustering various standard and non-standard forms together and (2) being generally informative for the tagging process as not all standard forms were seen in the training data; therefore, the Brown clustering features are correctly assessed as relevant. On the contrary, the number of proper nouns being mistagged as common nouns increases, which shows that clustering helps with distinguishing between different PoS categories, especially those further apart in function, but has difficulties in determining the exact PoS tag. As expected, clustering does not improve emoticon recognition as the category is not present in the training data.
4.4 Adaptation via normalisation
We next investigate the impact of normalising the text before performing tagging. We evaluate the normalisation performance as well as the tagging performance of the model trained on standard data with and without normalisation of the non-standard test data. We consider the tagging result obtained with the standard tagger without prior normalisation, reported in the previous subsection, to be our tagging baseline. As the normalisation baseline, we consider a dummy process that does not transform the data in any way, except lower-casing it.
We take the normalisation ceiling on the tagging task as the tagging accuracy on manual normalisation, that is, as the tagging accuracy if the normalisation is performing perfectly. This ceiling shows the maximum level of performance we can expect from the approach of improving tagging via prior normalisation.
4.4.1 Results
The results of the normalisation process are presented in Table 3. The normalisation baselines (column Norm. base) show clearly that the Bohorič data set is the least standard, requiring half of the tokens to be modified. The remaining two data sets are close to each other, the Gaj data set being the least non-standard. On all three data sets, normalisation (column Norm.) drastically improves the accuracy, yielding relative error reduction (column Err. red. rel.) from 61% to 90%. The absolute improvements range from 9% to 43%. As expected, the biggest improvements are obtained on the Bohorič data set, while the improvements on the Gaj and JanesL3 data sets are quite similar.
We continue by reporting tagging results with and without performing normalisation in Table 4. The results obtained without any normalisation are presented in the column Tag. These results are, naturally, almost identical to those reported in the previous section, with the difference that now evaluation is performed via five-folding, which is necessary so that the normaliser can be trained on data on which tagging is not evaluated. We report the results when using both realistic (column Norm.) and manual normalisation (column Norm. gold). We again discriminate between models which do not and do use unsupervised domain adaptation in the form of Brown clusters (encoded in column Brown). We report the relative error reduction (column Err. red. rel.) between the system not using normalisation and the system using normalisation.
We can observe that normalisation always drastically improves tagging, mostly on the Bohorič data set where it achieves a staggering 84% relative error reduction. This radical increase in accuracy can be followed to the different orthography used in this data set, something easily learnt by the normaliser. Normalisation brings the least on the JanesL3 data set, partly due to the least accurate normalisation, as reported in Table 3, but probably also due to a drastically different word order in comparison to the standard training data.
Between using automatic and gold normalisation, there is always a statistically significant difference on the downstream tagging task but the gap is not very wide, between 1.4 and 3.3 accuracy points. Given that we cannot assume for cross-validation results to be normally distributed, we do not trust the confidence interval completely, but also run the McNemar statistical test for paired nominal data (McNemar Reference McNemar1947), showing all the differences between using automatic or gold normalisation to be statistically significant on the level of p < 0.001.
Regarding the interplay of normalisation and Brown clusters, as expected, the relative error reduction is always greater if Brown clusters are not used, showing that the two techniques do have an activity intersection. However, normalisation is still a welcome addition to unsupervised adaptation with just a minor decrease in the relative error reduction of 1%–5%.
Inspecting the opposite direction, namely whether or not Brown clustering is a useful addition if normalisation was performed, the systems that use both always achieve higher scores but the difference is within a single accuracy point, and statistically significant, given the McNemar test, on the level of p < 0.001 for all the data sets except the Bohorič one, for which it is significant on the level of p < 0.01. The smaller impact of Brown clusters on this data set can be followed back to a very limited amount of data for learning word clusters.
In these experiments, we can also observe the statistical significance of the usage of unsupervised adaptation when no normalisation is applied, with an absolute accuracy increase between 1.22 and 2.68 points, which – in all three cases – is statistically significant on the level of p < 0.001.
We can conclude here that, as expected, normalisation has a much stronger impact on the tagging task than unsupervised adaptation. While normalisation still brings a lot of benefits when applied on systems with unsupervised normalisation, Brown cluster improvements are largely consumed by the normalisation process. Nevertheless, a minor and statistically significant accuracy increase can be achieved by using both methods and, given the low cost of the unsupervised adaptation technique, both should be used in practice.
4.4.2 Error analysis
The Bohorič data set benefits greatly from using normalisation. As detailed in the previous error analysis section, various PoS tags from pronouns, nouns, verbs, conjunctions, prepositions, adverbs, etc., were mistaken for foreign words. This type of errors almost disappears when using normalisation. The most frequent error type in this setting is the confusion of particles for conjunctions, which is – similar to the Gaj data set – due to a different linguistic decision in how to treat forms that lie somewhere between the two classes. Other error types are typical tagging mistakes: adjectives mislabelled as adverbs, and vice versa, common nouns as proper, etc. Adding Brown clustering helps slightly by reducing the number of errors per each type for a few instances, most notably adjectives being mistagged as adverbs, common nouns as proper nouns, and feminine nouns as masculine nouns.
The remaining two data sets, Gaj and L3, on the other hand, do not manifest a reduction of the error rate as drastically as the older historical data set by using normalisation as they do not exhibit the issue of a different orthography. The categories that benefit most from normalisation in the Gaj data set are adverbs, which are less frequently mislabelled as adjectives, as well as nouns, where gender is much better distinguished, with the error rate more than halved. Moreover, other PoS categories are less commonly mistagged as common masculine nouns, a phenomenon already discussed, which is also true for the JanesL3 data set. In the JanesL3 data set in particular, general adverbs and verbs are much less commonly confused with common masculine nouns; while unsupervised adaptation resolves only some words written in a non-standard fashion, normalisation is here very effective, as it quickly learns the rules for the word-final vowel deletion in adverbs and similar patterns in non-verb forms.
In both data sets, unsupervised adaptation brings little improvement, partially resolving the confusions of foreign words as nouns and adverbs as adjectives. Even in this highly competitive setting, with normalisation applied, unsupervised adaptation again shows where it is most useful – in improving the coverage of open word classes.
4.5 Adaptation via in-domain data
The next set of experiments investigates the impact of direct and supervised adaptation of the tagging process via an in-domain manually annotated data set. Given that we have access both to a standard training data set and a non-standard one, the question arises what method should be used for combining these two data sets. Horsmann and Zesch (Reference Horsmann and Zesch2015) compared a series of data combination strategies and showed that a simple oversampling strategy, that is, producing a specific ratio of general-domain and in-domain data by simply repeating the regularly smaller in-domain data set, achieves best performance.
We ran a series of initial experiments on all our data sets to identify the optimal ratio between standard and non-standard data, by successively adding amounts of non-standard data 10% of size of the standard data set. Regardless of the size of the non-standard data set, we observed a steep increase in performance up to the ratio of 2:1 in standard to non-standard data, after which the impact of further oversampling the non-standard data flattened out. In all our further experiments, we used an identical amount of standard and non-standard data, that is, a 1:1 ratio, as on all three data sets the performance of the tagger around that ratio was already in the stable part of the performance curve.
4.5.1 Results
The first results we report in Table 5 are a comparison of our standard baseline (column Tag.) with the systems relying on previous normalisation (column Norm.), reported in the previous subsection, systems trained only on non-standard data (column Domain) and systems trained on a combination of standard and non-standard data with the 1:1 ratio (column Mixed).
Our first observation is that using in-domain training data is not always a winning strategy; that is, on the Bohorič data set, the system performing normalisation before tagging and not relying on non-standard training data for tagging outperforms the system trained on non-standard tagging training data. On the two remaining data sets, the investment in tagging training data seems to be more useful. It outperforms the system relying on normalisation by few yet statistically significant points – the statistical significance level given the McNemar test being p < 0.001. However, these results should not be used for any final conclusions, as the time investment in producing normalisation data and tagging data is surely not the same. We will control for the time investment in producing a data set for additional supervision in the next subsection.
When comparing the results obtained on using only in-domain data and mixing general- and in-domain data, small improvements are achieved when combining the data on all data sets except the Gaj data set, where a minor yet statistically significant (p < 0.001) drop is experienced. This effect is probably due to the size of the Gaj training data set, that is, the in-domain training set is half the size of the general-domain training set, so expanding it further with the standard data set – which certainly is more different than the texts between each other in the in-domain data set – does not benefit the process any more.
Investigating the impact of Brown clusters, we observe that there are consistent improvements when using them. Comparing their impact on the system relying on normalisation and the one relying on in-domain data, we observe a very similar trend with small yet statistically significant (p < 0.001, except for Bohorič p < 0.01) improvements on all data sets.
4.5.2 Error analysis
Overall, the types of errors being eliminated with adding in-domain tagging training data are very similar to those eliminated with normalisation.
An interesting phenomenon observed on all three data sets is that the confusion between the perfective and progressive (imperfective) main verbs is significantly higher in the system with in-domain tagging data than in the systems relying on normalisation. This is obviously due to this distinction primarily being a lexical one, and while normalisation correctly transforms those non-standard verbs into standard forms, in-domain tagging data is not nearly rich enough to cover all variants of main verbs. On historical data sets, in-domain tagging data also manages to lower the impact of different linguistic decisions regarding specific conjunctions and particles, which was impossible with normalisation.
Interestingly in the UGC domain, in-domain tagging data sets do not resolve problems in confusing emoticons for punctuation since all the frequent types of punctuation – which could have been learnt from in-domain training data – are already covered by the heuristic applied on this domain, while the less frequent emoticons still pose the same problem as with normalisation.
4.6 Comparing adaptation via normalisation and adaptation via in-domain data while controlling for time investments of the different supervision types
In the previous set of experiments, we primarily compared the impact of normalisation with that of giving the tagger in-domain training data. However, we already stressed that this comparison is not fair as the same amount of data was used in both cases, but the investment necessary to produce a normalisation data set and a tagging data set of same size is quite likely not the same.
In this set of experiments, we control for the time necessary to produce a portion of a training data set and compare the impact of producing normalisation data sets and tagging data sets on the task of tagging non-standard data, thus attempting to answer the primary question of this research: whether we should invest in training normalisation or in-domain tagging before performing PoS tagging on non-standard texts.
We perform our first experiment with an exemplary annotator who measures the annotation productivity when annotating for standard word forms, therefore producing normalisation training data, and when annotating for PoS tags, therefore producing PoS tagging training data.
In our experiment the annotator, who was already familiar with the annotation guidelines, was given three sets of samples, one for each corpus. Each sample consisted of 500 word forms from randomly selected sentences. The Bohorič sample was additionally pre-processed with an automatic script that transformed the Bohorič script into the modern Gaj script, as we assumed that a typical normalisation annotation campaign would decide to perform this pre-processing since it significantly reduces the workload on the annotator. The annotator then performed in separate runs manual normalisation and manual tagging of the data samples.
Having the estimate of time necessary to produce a specific amount of gold data of a specific sort, we continue this set of experiments by measuring tagging accuracy as a function of the annotation time investment on both of our approaches: (1) normalisation as a pre-processing step and (2) supervised tagger adaptation.
4.6.1 Results
The results on the experiment on the time necessary to produce gold data of a specific sort on each of the three data sets is given in Table 6. In all three cases, the time ratio between normalisation and tagging was 1:2; that is, tagging took double the time of normalisation. However, time estimates do not count in the annotator training and production of annotation guidelines. We assume that these are similar for both tasks, normalisation and tagging.
The results in controlling the time investment in our two types of tagger supervision are presented in the following three figures. The results on each data set are presented in a separate figure. Each figure has four curves, domain for the system to which in-domain tagging data is added, domain.brown for the system which additionally exploits Brown clusters, norm which has a normalisation pre-processing step, and norm.brown which has additionally unsupervised adaptation.
In Figure 2, the results on the Bohorič data set are presented. Regardless of the time investment, the normalisation approach on this data set outperforms the tagger adaptation via in-domain gold data. The curves representing the normalisation results are naturally half the length of the in-domain tagging curves as the time investment necessary to produce the whole normalisation data set is half of the time investment necessary for the tagging data set.
As was already observed in the previous experiments where the complete available data sets were used, although the time necessary to produce the whole normalisation data set is two times shorter than the one to produce the tagging data set, the system using all available normalisation gold data outperforms the system using all available tagging gold data. Furthermore, the curves of the systems relying on normalisation are much steeper and flatten out earlier, showing that even with minor time investments in producing gold normalisation data the tagging performance can be improved drastically. The curves of the systems relying on in-domain tagging data, however, do rise as amount of data increases, showing that more significant time investments are necessary to improve tagging via in-domain data. On the other hand, with significantly greater time investments it may be possible for in-domain adaptation via gold tagging data to beat the approach of previous normalisation.
Regarding the difference observed whether unsupervised adaptation is performed, when normalising there is almost no difference while when adapting via gold tagging data the difference is significant at the beginning, dropping off as the size of the data set, that is, the amount of time invested in producing the in-domain tagging data set increases.
The results obtained on the Gaj data set are presented in Figure 3. The situation in this figure is quite different from the previous one on the Bohorič data set, showing that normalisation is a more reasonable approach when only a limited amount of time can be invested in developing training data for the process, while training the tagger outperforms normalisation at about 30–40 h of annotation time invested (without annotation guidelines compilation and annotator training). The curves of the normalisation system are again steeper, flattening out earlier, showing that normalisation is a good choice when a limited amount of annotation time is available. However, with more significant time investments, the supervised adaptation via tagging data outperforms the normalisation as the pre-processing system significantly.
The difference whether unsupervised adaptation is applied or not seems on this data set to be rather stable, which is probably due to the high informativeness of the word clusters as a large amount of training data is available for calculating those clusters.
Finally, in Figure 4 we depict the same results for the JanesL3 data set. Very similarly to the Gaj data set, when smaller amounts of annotation time are invested in the adaptation, normalisation tends to be the better option, but on this data set already with 10 h of annotation time the approach of adapting the tagger directly via in-domain tagging data takes the lead. Similarly as with the previous data sets, the normalisation approach flattens out early, while the tagging approach tends to rise even when the entire available data set is used.
Unsupervised adaptation tends to make a significantly bigger difference when the tagger is adapted directly than if normalisation as a pre-processing step is applied. This has primarily to do with the fact that the tagger is given in-domain data on which it can more correctly assess the importance of Brown clustering features, even on closed word classes, a phenomenon already discussed.
5. Conclusion
In this paper, we have investigated the impact two types of supervision have on the task of PoS tagging of non-standard texts: (1) training a normaliser that is applied before tagging or (2) training the tagger directly. Additionally, we have investigated the impact of the most popular unsupervised domain adaptation technique for PoS tagging: Brown clusters. Finally, we have controlled for the amount of time necessary to produce a specific amount of supervision, thereby enabling a fair comparison between the two supervised adaptation approaches that should be highly useful for future researchers and practitioners having a similar task before them, especially if they deal with highly inflected Slavic languages.
The drop in performance of taggers when no adaptation is performed is relative to the distance of the target domain to the standard domain. The error increase ranges from 5 to 15 times. Relative improvements obtained through unsupervised adaptations via Brown clustering range from 6% to 13%, but only 1 to 3 accuracy points. The more data are available for unsupervised normalisation via Brown clusters, the more clusters should be distinguished. Choosing too many clusters harms performance more than choosing too few. As expected, the improvement in the performance of the tagger follows the amount of data for unsupervised adaptation. Unsupervised adaptation is very effective on open word classes as those features can be properly weighted even on standard data sets since not all standard word forms have been seen in the training data. The situation is much worse with closed word classes as the word clustering features are not weighted as informative due to the full coverage of those classes already in the training data. This cheap type of adaptation continues to improve categories such as foreign words, nouns, adjectives and adverbs even after either normalisation or tagger adaptation are applied.
Including normalisation drastically improves tagging accuracy, especially on least standard data sets, with relative error reduction of 35%–85%, with this reduction following the error reduction of the normalisation process. While performing unsupervised adaptation without normalisation brings significant improvements on all three data sets, performing unsupervised adaptation on the normalised data yields only small yet still statistically significant improvements; the smallest improvement and significance level occurs on the Bohorič data set due to a small amount of data for learning the Brown clusters. While unsupervised adaptation mostly improves tagging of open word classes, normalisation, as expected, brings improvements both to open and closed classes. Given the inability of the unsupervised adaptation to improve on highly frequent closed classes, an interesting research direction might be unsupervised identification of alternative forms to closed classes and their injection in standard training data.
Adapting the tagger with in-domain tagging data gains improvements similar to those obtained with normalisation. Coupling supervised and unsupervised adaptation via Brown clusters yields results very similar to those obtained when normalising the input, with minor yet statistically significant improvements on all data sets, the least significant being again the Bohorič data set, primarily due to small amount of data for learning word clusters.
The types of errors being resolved through adaptation via tagging data are very similar to those of normalisation, yet with different lexical issues, such as distinguishing between perfective and imperfective verbs, where normalisation achieves better results.
When measuring the time investment necessary for producing normalisation supervision and tagging supervision, a ratio of two times the duration for producing tagging gold data in comparison to normalisation gold data is observed on all three data sets. Taking into account this difference, and adding iteratively the amount of gold data of each sort that can be produced in a time unit, we observe that normalisation seems to work better if smaller amounts of annotation time are available, while adapting taggers with in-domain gold data tends to outperform the previous approach when significant amounts of annotation time are invested in the tagger adaptation. Curves depicting improvements as more normalisation data are added are steeper and flatten early, which indicates that normalisation is the cheap way of improving tagging of non-standard data, which, however, plateaus quite early. Curves depicting improvements obtained by adding more in-domain PoS tagging data start lower and rise slower, showing it to be inefficient for short annotation campaigns, but, as one would expect, leads to better improvements in the long run.
Unsupervised adaptation via Brown clustering is shown to be similarly useful both when small and large annotation resources are available. Improvements tend to be bigger when adaptation is performed via in-domain tagging data. The only situation where unsupervised adaptation does not bring much is the Bohorič data set when normalisation is performed.
While improving the PoS tagger with in-domain data is a very straightforward approach, one obviously should not underestimate the normalisation approach as it tends to obtain better results if smaller adaptation resources are available. Normalisation, as mentioned, also has side benefits, in making the text better readable and helping with full-text search.
There are multiple potential future research directions. The most obvious one is searching for the best possible results on the given data sets, which would probably be obtained by exploiting both the normalisation and the domain adaptation approach. Another quite promising approach could be reversing the normaliser and applying it on the canonical training data, therefore augmenting data in the target domain. It would also be interesting to test the approach on our fine-grained tagset, to see what impact that has on the relative performance of both approaches – for this, however, we would first need to additionally annotate our historical data set with the fine-grained tags.