1. Introduction
In recent years, the amount of data available for language processing has increased significantly. It is now possible to find vast amounts of data for use as training data in Machine Learning. The field of Statistical Machine Translation (SMT) is no exception to this phenomenon. However, it has been shown in Ozdowska and Way (Reference Ozdowska and Way2009), Gascó et al. (Reference Gascó, Rocha, Sanchis-Trilles, Andrés-Ferrer and Casacuberta2012) that having more data does not always lead to better results. In fact, performance can increase by limiting the training data to a smaller but more relevant set. In addition, inducing a general Machine Translation (MT) model from a large set of training instances can be time consuming. This is why the use of data selection techniques has become a common step in the creation of an MT pipeline.
However, reducing the training data to a subset of relevant sentences can be an ambiguous task. The criteria followed for selecting these sentences can range from the broad (e.g. data belonging to a domain) to the more particular (e.g. those similar to the document we want to translate). In this paper, we address the latter problem. We use data selection algorithms that consider the document to be translated, which is assumed to be available at training time, to build MT models that are more adapted to such a test set. We refer to these methods that use the test set as transductive data selection.
The method for building transductive models that we explore in our paper is Feature Decay Algorithms (FDA). Original FDA and its variants (Poncelas, Way, and Toral Reference Poncelas, Way and Toral2016; Poncelas, Maillette de Buy Wenniger, and Way Reference Poncelas, Maillette de Buy Wenniger and Way2017) are data selection techniques that use the information of the test set to select sentences from a parallel corpus used for training an MT model. Another characteristic of these methods is that they are context-dependent data selection techniques. This means that selecting new sentences is an iterative process, whereby at each step not only the test set is considered but the information given by previously selected sentences is exploited as well.
While the default configuration of FDA proposed in Biçici and Yuret (Reference Biçici and Yuret2011) has demonstrated excellent performanceFootnote a in a number of tasks, there is room for improvement. This paper covers two main points: (i) we propose a faster FDA alternative that benefits from parallelization and (ii) improve the performance of FDA by adding information inferred from the parallel corpora. We provide a number of contributions in this paper, including
a comparison of transductive data selection techniques;
a variation of FDA that boosts execution speed;
a bilingual model of FDA that also considers n-grams in the target language to avoid selecting noisy sentence pairs.
an evaluation of the techniques presented in the work using Neural Machine Translation (NMT).
Consequently, this paper is divided into different sections addressing each of the issues mentioned above.
First, in Section 2, we briefly describe data selection algorithms and also introduce FDA.
The proposal for boosting the speed of FDA is explained in Section 3. We propose MRFDA, a variant of FDA that allows parallel computation, assuming multiple computation units available. This variant executes faster and obtains better results in some experiments.
A strategy to improve the quality is presented in Section 4. While the features of FDA are extracted from the test set, in the source language, a new proposal is to use phrase pairs from the training set as features and build a bilingual FDA system. Crucially, these phrase pairs are extracted from the training data and selected using only the source side of the test set. This allows us to indirectly leverage translation-equivalence information from the parts of the training data that are relevant to the source of the sentences in the test set.
In Section 5, we apply the techniques presented here to the NMT approaches. Finally,Section 6 concludes and suggests some further research that can be carried out in this area.
2. Related work
The goals of data selection are diverse: remove noise, restrict the amount of training data, select in-domain data, etc. Among the different categories, we highlight the transductive data selection methods (Vapnik Reference Vapnik1998), which aim to identify the most relevant data points for a model to predict a new unseen test set. In the MT field, transductive algorithms assume that the document to be translated $S_{test}$ is available at training time so the best subset of sentences for training an MT model (to translate
$S_{test}$) is retrieved (Poncelas Reference Poncelas2019).
In this section, we present an overview of different data selection methods (Eetemadi et al. Reference Eetemadi, Lewis, Toutanova and Radha2015). We classify them as nontransductive (if they do not require the information from the document to be translated $S_{test}$, Section 2.1) or transductive (Section 2.2).
A limited subset of sentences is selected from a set of sentences S, given a sentence-level scoring function that specifies an implicit sentence ranking. Selected sentences are added to a selected pool or labeled data L, which eventually will become the training data for a particular end-task (MT in our case).
2.1 Nontransductive data selection methods
Nontransductive data selection methods are the most common. In this section, we provide a list of methods that extract the most relevant subset of sentences from a set of parallel sentences without using the information of the test set.
Length-based Functions: Many approaches remove noisy data by comparing the length difference (Taghipour et al. Reference Taghipour, Afhami, Khadivi and Shiry2010) or the length proportion (Khadivi and Ney Reference Khadivi and Ney2005) between the source-side and target-side sentences.
Alignment-based Functions: Taghipour et al. (Reference Taghipour, Afhami, Khadivi and Shiry2010) use sentence-alignment entropy to remove noisy data from the training set. Parallel sentences with relatively high entropy in comparison to the rest of the corpus are considered unreliable and are removed.
Language Model-Based Functions: Moore and Lewis (Reference Moore and Lewis2010) propose to use an in-domain language model
$LM_I$ and an out-of-domain language model
$LM_O$ to obtain sentences that are closer to in-domain data. They therefore define the cross-entropy difference (CED) as
$H_{I}(s)-H_{O}(s)$ where
$H_{d}(s)$ is the cross-entropy of the sentence s according to a language model
$LM_d$ in a domain d. Axelrod, He, and Gao (Reference Axelrod, He and Gao2011) extend the equation of CED by using language models in both the source-side and target-side languages, defining the bilingual CED as
$(H_{I-src}(s)-H_{O-src}(s)) + (H_{I-trg}(s)-H_{O-trg}(s))$. Additionally, Hoang and Simaan (Reference Hoang and Simaan2014) propose an iterative algorithm based on Expectation-Maximization to estimate the probability of a sentence pair, from a mix-domain corpus, to be in- or out-domain.
N-gram Coverage: In Eck, Vogel, and Waibel (Reference Eck, Vogel and Waibel2005a), the sentences that contain n-grams which are not in L are rewarded.
TF-IDF Coverage: The proposal of Eck, Vogel, and Waibel (Reference Eck, Vogel and Waibel2005b) is to retrieve sentences that are the most different to selected pool L based on TF-IDF (Salton and Yang Reference Salton and Yang1973) distance. Those sentences that differ the most to the selected pool L are the best candidates to be added to L as they are the most informative.
DWDS: Density Weighted Diversity Sampling (DWDS) (Ambati, Vogel, and Carbonell Reference Ambati, Vogel and Carbonell2011) scores a sentence based on: (i) how the words are distributed in both the selected and candidate pool and (ii) how many words are not already in the selected pool.
Log-probability Ratios: Haffari, Roy, and Sarkar (Reference Haffari, Roy and Sarkar2009) propose to select sentences that are common (high probability) in the candidate pool and rare in the selected pool.
Perplexity Ratios: A similar approach to Log-probability Ratios method is proposed by Mandal et al. (Reference Mandal, Vergyri, Wang, Zheng, Stolcke, Tur, Hakkani-Tur and Ayan2008), in which sentences are selected based on the perplexity ratio between candidate and selected pool.
2.2 Transductive data selection
In this section, we explain data selection algorithms that use the document to be translated to retrieve sentences (as FDA is a method central to this paper, we explain it in Section 2.3). We describe two methods: Edit distance methods, which consider the sentences in the test setindividually (sentence-wise method) and Infrequent n-gram Recover, which, like FDA, consider the test set as a whole (document-wise method).
Sentence similarity. These methods retrieve sentences based on how similar the sentences in S are compared to a sentence $s_{test}$ from the test set (Wang et al. Reference Wang, Wong, Chao, Lu and Xing2014). For every sentence in the test set, the most similar sentences from the training data are retrieved.
Hildebrand et al. (Reference Hildebrand, Eck, Vogel and Waibel2005) propose TF-IDF distance; more precisely, they use the cosine value between TF-IDF (Salton and Yang Reference Salton and Yang1973) vectors as distance metric. For each sentence $s_{test}\in S_{test}$ in the test set, the top-
$\frac{N}{|S_{test}|}$ sentences from S are selected. Although they are aware that the resulting set may contain duplicates, in their experiments they show that models in which duplicate sentences have been removed achieve slightly worse results.
Infrequent n-gram Recovery. Parcheta, Sanchis-Trilles, and Casacuberta (Reference Parcheta, Sanchis-Trilles and Casacuberta2018) propose extracting those sentences containing n-grams from the test set that are considered infrequent (Gascó et al. Reference Gascó, Rocha, Sanchis-Trilles, Andrés-Ferrer and Casacuberta2012), using simple counting over the candidate data S together with a cutoff maximum frequency. Consequently, frequent words such as stop words are ignored. A sentence s is scored according to number of infrequent n-grams shared with the set of sentences $S_{test}$ of the test set. It is computed as in Equation (1):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_eqn1.png?pub-status=live)
where $C_{\{s\}}(ngr)$ is the count of ngr in the sentence
$s \in S$ to be scored,
$C_{S_I+L}(ngr)$ is the count of ngr in the selected data L and an in-domain set
$S_I$ used for initialization, and t is the infrequent n-grams cutoff frequency, that is the maximum number of occurrences up to which an n-gram is considered infrequent. If the number of occurrences is above the threshold t, then ngr is considered frequent n-gram and is ignored (the component
$\max\!(0,t-C_{S_I+L}(ngr))$ is 0) and not considered for scoring the sentence.
2.3 Feature Decay Algorithms
FDA (Biçici and Yuret Reference Biçici and Yuret2011, Reference Biçici and Yuret2015; Biçici, Liu, and Way Reference Biçici, Liu and Way2015) is a method that tries to maximize the coverage of the sentences to be translated. It uses coverage by phrases of the known source-side n-grams of the test-set as an estimator of coverage of the required target-side, which is unknown. Furthermore, it assigns source-side test n-grams a value, that decreases the more the n-grams have already been included in earlier selected sentences. This ensures sufficient translation examples will be extracted for all relevant test set source n-grams.
Note that in this section, we are using the generic notation of f as feature, even if in this paper (and in the original work of Biçici and Yuret Reference Biçici and Yuret2011) the only features used are n-grams from the test set.
The selected features are then extracted from the training set sentences. The score of a sentence s is the normalized sum of the value of its features. At each step, the sentence with the highest score is selected. Then the values of the features of the selected sentence are decreased as in (2):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_eqn2.png?pub-status=live)
where L is the selected pool and $C_L(f)$ is the count of the feature f in L. As the initialization function init(f), Biçici and Yuret (Reference Biçici and Yuret2011) propose to use either 1 or the inverted frequency
$log(|S|/C_S(f))$. The variables d and c are input parameters: the decay factor d is in the range(0, 1) with a default value of
$0.5$, and the decay exponent c is in the range
$[0, \infty )$ with a default value of
$0.0$.
As previously mentioned, the score of a sentence is the normalized sum of the values of the features, and it is computed as in Equation (3):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_eqn3.png?pub-status=live)
where $len\_exp$ is a parameter that indicates the amount of influence of the sentence length on the score. Higher values of
$len\_exp$ cause FDA to retrieve shorter sentences.
While the decay function in Equation (2) was introduced in Biçici and Yuret (Reference Biçici and Yuret2015), the default values of these parameters have been set so it is equivalent to Equation (4), the decay function originally proposed in Biçici and Yuret (Reference Biçici and Yuret2011).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_eqn4.png?pub-status=live)
The score of a sentence s at a particular iteration is the sum of the values of $C_L(f)$ of the features present in s, normalized by the length of s. The score of a sentence, using default configuration in Equation (4) is computed as in (5):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_eqn5.png?pub-status=live)
where $F_s$ is the set of features present in s.
FDA is a context-dependent function (Eetemadi et al. Reference Eetemadi, Lewis, Toutanova and Radha2015) as it uses the information of selected pool L when considering a new sentence to be added and has demonstrated good results to retrieve in-domain data (Silva et al. Reference Silva, Liu, Poncelas and Way2018). It is related to the work of Kirchhoff and Bilmes (Reference Kirchhoff and Bilmes2014) where the problem is examined from a submodular optimization perspective.
In each iteration, the sentence with the highest score is transferred to L, which causes the score of the rest of the candidate sentences (sharing the same n-grams with the test set) to decrease, as they depend on L.
2.4 Comparison of FDA to other transductive algorithms
In Table 2, we compare FDA against the transductive algorithms. In the table, we show the performance of the models built with: the complete training data (AllData column), sentences obtained using TF-IDF Transductive Method (TF-IDF column), Infrequent n-gram Recovery, and FDA.
The data sets used in the experiments are based on those used in Biçici et al. (Reference Biçici, Liu and Way2015) (cf. Table 1):
Languages: German-to-English.
Training set: The training data provided in the WMT 2015 (Bojar et al. Reference Bojar, Chatterjee, Federmann, Haddow, Huck, Hokamp, Koehn, Logacheva, Monz, Negri, Post, Scarton, Specia and Turchi2015) translation task setting a maximum sentence length of 126 words (4.5M sentence pairs, 225M words).
Development set: We use 5K randomly sampled sentences from development sets from previous years of WMT for tuning.
Language Model: 8-gram Language Model (LM) built on the target language side of the selected data via the KenLM toolkit (Heafield Reference Heafield2011) using Kneser–Ney smoothing (Kneser and Ney Reference Kneser and Ney1995).
Test set: Documents provided in the WMT 2015 Translation Task.
Table 1. Statistics of the data sets. $|S|$ is the number sentences,
$|W|$ the number of words, and
$|V|$ the size of the vocabulary
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_tab1.png?pub-status=live)
We train SMT systems on the selected data using the Moses toolkit (Koehn et al. Reference Koehn, Hoang, Birch, Callison-Burch, Federico, Bertoldi, Cowan, Shen, Moran, Zens, Dyer, Bojar, Constantin and Herbst2007) with the standard features and using GIZA++ (Och and Ney Reference Och and Ney2003) for word alignment.
We include several evaluation metrics: BLEU (Papineni et al. Reference Papineni, Roukos, Ward and Zhu2002), NIST (Doddington Reference Doddington2002), TER (Snover et al. Reference Snover, Dorr, Schwartz, Micciulla and Makhoul2006), METEOR (Banerjee and Lavie Reference Banerjee and Lavie2005), and CHRF (Popovic Reference Popovic2015). These scores give an estimation of the quality of the output of the experiments compared to a human-translated reference. In general, the higher the score is, the better the translation is estimated to be (except for TER, which is a translation error measure and so lower is better).
We show the scores as the mean of 4 MERT (Och Reference Och2003) executions. Results in bold indicate an improvement over the baseline (in most cases, default FDA). We have also computed the statistical significance, indicated with an asterisk, at level $p=0.01$ using multeval (Clark et al. Reference Clark, Dyer, Lavie and Smith2011) for BLEU, TER, and METEOR when compared to the baseline at level
$p=0.01$ using Bootstrap Resampling (Koehn Reference Koehn2004).
For comparability, it is important to use the same sized training sets across these experiments. For this reason, in the rest of the paper, we have chosen to use 100, 200, and 500K training sentences. Note that the different methods executed in this work retrieve a similar amount of words for datasets of the same size: 4–5M words for 100K sentences; 9–11M words for 200K sentences; and 25–28M words for 200K sentences.
For the Infrequent n-gram Recovery configuration we also use 3-grams. In the work of Parcheta et al. (Reference Parcheta, Sanchis-Trilles and Casacuberta2018), they do not set a default t in Equation (1), but execute several runs with the values of t ranging from 10 to 20. We also execute the method for the values 10, 20, 40, and 80 (the value is multiplied by 2 in each run). Executing the method with $t=160$ causes the execution time to exceed 48 hours, so we use this as a stopping criterion. The higher the value of t (infrequent n-grams cutoff frequency), the more sentences are retrieved (as the decision for considering an n-gram infrequent is less strict). However, with the largest value of t executed (
$t=80$), the number of sentences retrieved is 229,913 as the remaining sentences have a score of 0.0. Therefore, a comparison of a model built with 500K sentences retrieved by Infrequent n-gram Recovery with
$t=80$ is not possible.
In Table 2, we indicate in bold those scores that outperform the baseline (the model built with all data) and indicate with an asterisk if these improvements are statistically significant at $p=0.01$. The performance of the models built with data retrieved by transductive algorithms performs better than using the complete training set. If we compare FDA to the rest of algorithms, we see that it performs better than TF-IDF. Although we do not indicate in the table, the performance of FDA is statistically significantly better (at
$p=0.01$) than TF-IDF models for all experiments according to METEOR and TER scores (and BLEU for 100K lines and 200K lines models). In contrast, the performance of the Infrequent n-gram recovery method is comparable to FDA. We computed the statistical significance and see that, at
$p=0.01$, none of the scores are statistically significantly better than FDA.
Table 2. Comparison of transductive algorithms
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_tab2.png?pub-status=live)
3. Improving speed of Feature Decay Algorithms: MapReduce FDA
The data selection techniques we are exploring in this work are context-dependent functions. These techniques work iteratively extracting sentences from the candidate pool and adding them to a set we call the selected pool. At each iteration, the information in the selected pool is used to decide which sentences will be selected next. Once the selected pool has the desired amount of sentences, we can use that set as training data.
Seeing that the criteria for selecting a sentence are dependent on the selected pool, that is dependent on all the previous iterations, it is difficult to parallelize. For this reason, in this work, we want to analyze the dependencies between the sentences so a context-dependent technique such as FDA can be executed in parallel when multiple computation units are available. This will imply the computation resources will not be underused and so the algorithm will be faster.
3.1 ParFDA
ParFDA (Biçici et al. Reference Biçici, Liu and Way2015) tries to parallelize FDA by executing several independent FDA processes on partitions of the training data. Then the resulting selected data are merged. However, it is only an approximation since some dependencies between the sentences are lost. The strength of FDA is to use the information of the previous selected sentences to choose the next sentence. If the parallel corpus is split into different parts, the dependencies between sentences in those different parts are lost and so each FDA process does not have complete information of the selected pool to decide which sentence to select next. This can hurt performance especially if features are not uniformly distributed over sub-parts of the corpus.
3.2 MapReduce Feature Decay Algorithms
The proposal in this work is to use the MapReduce approach to perform a faster “FDA-like” data selection. MapReduce (Dean and Ghemawat Reference Dean and Ghemawat2008) is a programming model for processing large datasets. It is an abstraction that allows the data to be processed in parallel. It organizes the data as a set of key value (k, v) pairs which can be distributed among the different computation units. For processing the data, we can execute different instantiations of the map and reduce functions:
map:
$(k,v) \rightarrow (k_I,v_I)$
Converts a pair (k, v) into a pair with an intermediate key and an intermediate value
$(k_I,v_I)$. Every intermediate key value pair is independent of the rest, so different pairs can be assigned to different computation units, and be processed in parallel.
reduce:
$(k_I,v_I) \rightarrow (k_I,[.\,.\,.\,,v_I,.\,.\,.])$
Shuffles the
$(k_I,v_I)$ pairs and groups them by the key
$k_I$.
The map and reduce functions can be concatenated in a pipeline to perform more complex tasks.
The results are not strictly equivalent to FDA as we make some assumptions that simplify the model. The first approximation is to pretend that a sentence can have only one occurrence of each feature. Then, the value of $C_L(f)$ at the iteration when a sentence s is selected can be redefined as pos(s, f), which indicates the position that sentence s has in the ranking of selected pool, conditioned to sentences containing the feature f. The FDA scoring method with default parameters in (5) of a sentence s can be formulated as in Equation (6):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_eqn6.png?pub-status=live)
The computation of score(s) in (6) can be divided into two tasks: (i) compute the values of pos(s, f) for every feature in the sentence and (ii) add all the terms of the summation. In our work, the map stage will address the first task and is explained in Section 3.2.1. In the reduce stage (Section 3.2.2), the second task is performed and the top-N sentences are retrieved.
3.2.1 Map stage
The map stage aims to build a set of tuples (s, pos(s, f)) for every sentence s and feature f. Initially, a data structure (queried from an inverted index data) is built where each feature $f_i$ maps the list of sentences where
$f_i$ is present. On the left side of Figure 1, we can see an example: feature
$f_1$ is present in sentences
$s_3$, and
$s_4$; feature
$f_2$ is present in
$s_1$,
$s_2$, and
$s_3$; and feature
$f_3$ is present in
$s_2$.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_fig1.png?pub-status=live)
Figure 1. Map stage.
Structuring the sentences as presented on the left side of Figure 1 (i.e. having a list of sentences for each feature) allows us to extract the tuples (s, pos(s, f)) in parallel. After ordering the sentences within a list (this will be explained later), the information can be organized as tuples (s, f, pos(s, f)).The name of the feature itself is not relevant to compute the score of Equation (6), so only pairs (s, pos(s, f)) are constructed.
At the end of the map stage (right side of Figure 1), the union of all tuples is retrieved. On the right side of Figure 1, we see the pair $(s_1,1)$ as
$s_1$ holds the position 1 in the list of the
$f_2$ (the list on the left side of Figure 1). The sentence
$s_2$ is in the list of
$f_2$ (in position 2) and in the list of
$f_3$ (in position 1), so we can find the pairs
$(s_2,2)$ and
$(s_2,1)$.
For ordering the sentences, two things must be considered in a sentence: the value $C_L(f)$ of the features and the length of the sentence. Longer sentences are more likely to contain more features, but the ratio of features per word may be smaller. It is necessary to find a balance between the value of the features and the length of the sentence.
Let assume two sentences $s_1$ and
$s_2$ where the set of features present in
$s_2$ is a subset of those present in
$s_1$(
$|F_{s_2}| \subset |F_{s_1}|$). If both sentences were the same length, sentence
$s_1$ would more valuable as it contains more features. It is better to select
$s_2$ only when the length is much smaller. When should it be equally likely to be selected? In other words, when does the equality of scores occur as in Equation (7):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_eqn7.png?pub-status=live)
Considering the extreme case (and thus making another approximation of FDA) where all the positions are 1, we have the following approximation in Equation (8):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_eqn8.png?pub-status=live)
where $|F_{s_i}|$ means the number of features present in the sentence
$s_i$. Therefore, the proportion of length when they are equally valuable is that in Equation (9):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_eqn9.png?pub-status=live)
Accordingly, we use $length(s)/|F_{s}|$ as a criterion to order (increasingly) the sentences.
3.2.2 Reduce stage
In the reduce stage, the pairs of (s, pos(s, f)) produced in the map stage are collected, grouping them by sentence. This means that each sentence maps a list of pos(s, f), allowing us to compute the score of the sentence as in Equation (6). We show an example in the middle part of Figure 2. In the second row, a sentence such as $s_2$ collects the positions from the tuples
$(s_2,1)$ and
$(s_2,2)$ to compute
$\frac{0.5^1}{length(s_2)}+\frac{0.5^2}{length(s_2)}$.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_fig2.png?pub-status=live)
Figure 2. Reduce stage.
The list of tuples (s, score(s)) can be sorted by score and the top-N sentences are retrieved as the selected data.
3.3 Experiments
In order to evaluate the performance of our proposal when there are multiple computation units available, we execute the three methods: FDA, ParFDA and our proposal, and FDA using MapReduce (MRFDA). The experiments were executed on a machine with 32 CPUs, so ParFDA was run by splitting the training set into 32 parts and executing FDA independently on each part.
In Figure 3, we present the execution times of different algorithms when selecting sets of sentences of different sizes. It shows that the execution time of FDA depends on the order of the n-gram used as feature and the size of the selected set. However, both ParFDA and MRFDA have execution times that are more constant, regardless the size of features or the size of the selected set of sentence pairs. We observe that the fastest algorithm is ParFDA. However, as we said in Section 3.1, ParFDA is only an approximation.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_fig3.png?pub-status=live)
Figure 3. Time of execution of FDA, ParFDA, and MRFDA for different training sizes using 3-grams (left) and 7-grams (right) as features.
In order to understand how similar the output of MRFDA is to FDA, we present in Table 3, the percentage of lines ParFDA and MRFDA shares with FDA. We observe that for smaller sets of retrieved data the intersection is small (between 40% and 45%), but it rapidly increases to between 69–82% when more sentences are selected.
Table 3. Percentage of common lines in FDA and ParFDA and FDA and MRFDA
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_tab3.png?pub-status=live)
We also show the evaluation metrics for FDA, ParFDA, and our work (MRFDA) using features of size 3 (Table 4) and size 7 (Table 5). Numbers in bold indicate that MRFDA has outperformed FDA for that particular metric. In addition, we indicate for BLEU, TER, and METEOR the statistically significant improvements at level $p=0.01$ of MRFDA over ParFDA (one asterisk) and over both ParFDA and FDA (two asterisks).
Table 4. Results of the executions of FDA, ParFDA, and MRFDA experiments using 3-grams as features
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_tab4.png?pub-status=live)
Table 5. Results of the executions of FDA, ParFDA, and MRFDA experiments using 7-grams as features
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_tab5.png?pub-status=live)
In Tables 4 and 5, we see that despite being an approximation of FDA, the results obtained are comparable to those obtained for FDA. In most of the experiments, we can see evaluation scores where MRFDA improves over FDA (numbers in bold in the last columns). In addition, especially when selecting larger amounts of lines, the improvements are statistically significantly better, the results of both experiments (using 3-grams and 7-grams) agree that when 500K lines were selected, the improvements are statistically significant at level $p=0.01$. For those experiments where FDA has better scores than MRFDA, we observe that none of them are statistically significant.
ParFDA obtains the lowest results. This can be understood from the fact that ParFDA splits the training set randomly into N parts, for some N, and performs selection on each part independently of the rest. Such independent selection over subsets of sentences is suboptimal for an algorithm that strives above all to select an adequate number of sentences for each n-gram across all sentences. None of the scores of column ParFDA are better than those of FDA or MRFDA. When comparing MRFDA to ParFDA, we observe in that in all the experiments, for at least two evaluation metrics, the improvements for MRFDA are statistically significantly better at level $p=0.01$ than ParFDA.
Using higher order n-grams as features boosts translation quality. This effect is present when comparing Tables 4 and 5 for FDA. ParFDA, in contrast, has a random component so this effect does not always apply. In the experiment of ParFDA where 100K lines were selected, we see a decrease in the evaluation scores in Table 5. In MRFDA, while increasing the order of the n-gram of the features is not beneficial for smaller sizes of data set (100K lines), it rapidly improves when increasing the size (200K lines). However, at some point, using both 3-grams and 7-grams sees similar behaviour. When selecting 500K lines, the scores obtained by BLEU and CHFR are better using 3-grams, whereas TER scores are better with 7-grams (and METEOR scores remain the same).
4. Retrieving good quality sentences: Bilingual FDA
In this section, we propose an extension of MRFDA in order to achieve higher performance than FDA. Sentences containing translations that are not appropriate for the context of the test set are not suitable for use as training data as they may hurt the performance (for example, depending on the context, the word “light” can mean “not heavy” or “not dark”). For this reason, we are interested in finding phrase pairs that are appropriate for the test set. These phrase pairs may be used as features of FDA instead of just n-grams in the source language. If the phrase pairs are appropriate for the test set, this will provide several improvements: (i) avoid selecting bad quality sentence pairs; (ii) avoid selecting terms that are translated differently in another domain; and (iii) retrieve more occurrences of n-grams that have multiple translations.
In the following section, we first propose a method that retrieves phrase pairs that are estimated to be useful for translating a test set (Section 4.2). Then, in Section 4.3, we use this set of phrase pairs as features of FDA.
4.1 Phrase pairs extraction using the test set
In order to extract good phrase pairs that are appropriate for a test set, we extend the translation by pattern matching technique proposed by Lopez (Reference Lopez2008). Lopez proposes a more efficient grammar extraction procedure for the scenario where the test set $S_{test}$ is known at extraction time. In this scenario, Lopez describes how to extract only the specific phrase pairs (given a word-aligned parallel corpus) that are useful for that particular
$S_{test}$ instead of building a phrase table with all the possible phrase pairs that can be obtained from the parallel sentences. The algorithm proposed by Lopez loops over all n-grams in the test set and finds appropriate phrase pairs for them (see Algorithm 1).
Algorithm 1 Pattern Matching Phrase Extraction algorithm (Lopez, Reference Lopez2008).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_tab9.png?pub-status=live)
Algorithm 2 Ranked Pattern Matching Phrase Extraction algorithm.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_tab10.png?pub-status=live)
The method for extracting phrase pairs has been further developed (Callison-Burch, Bannard, and Schroeder Reference Callison-Burch, Bannard and Schroeder2005; Germann Reference Germann2014, Reference Germann2015) so the phrase pairs are indexed using suffixarray (Manber and Myers Reference Manber and Myers1993) for fast retrieval. Once the phrase pairs for all the n-grams have been extracted, the resulting set of phrase pairs are the ones appropriate for $S_{test}$. Building a phrase table with this set of phrase pairs means that all the entries in the phrase table are candidates for translating
$S_{test}$ (because the rest have been ignored).
The proposal of this section is to improve the translation by pattern matching technique by extracting the phrase pairs not from all the sentences in the sample $D_{ngr}$ (extracted from the complete training set S) but from the best sentences. In the original work, the phrase pairs are extracted without considering the quality of the sentences or the context of ngr. Since a phrase in the source language may have multiple translations in the target language, not all the sentences are appropriate from which the phrase are to be extracted. In addition there might be sentences in
$D_{ngr}$ with low translation quality. For this reason, we propose to perform an analysis on
$s_{test}$ and
$D_{ngr}$ in order to decide which are the most appropriate parallel sentences.
Our proposal consists of separating the sampling stage into different steps. In Figure 4, we show the complete pipeline.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_fig4.png?pub-status=live)
Figure 4. Using the information in the test set to filter phrase pairs.
The proposed metric is based on how much the words in $D_{ngr}$ are related to ngr. We want to compare the observed word distribution within
$D_{ngr}$ – which is conditioned on the n-gram ngr being present – against the general unconditioned word distribution for the full parallel text. In doing so, we are assuming that if the relative frequency of words in the conditioned distribution of
$D_{ngr}$ differs from their relative frequency in the unconditioned distribution, this is caused by the requirement of ngr to be present in
$D_{ngr}$. For every word in
$D_{ngr}$, the difference in relative frequencies for that word between the two distributions is then used to define a notion of dependency between the word and ngr. In order to make this comparison we use a statistic based on Pearson’s chi-squared independence test. The dependency a word w in
$D_{ngr}$ has on ngr is calculated as in Equation (10):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_eqn10.png?pub-status=live)
where $O(w,D_{ngr})$ is
$C_{D_{ngr}}(w)$, the observed count of w in the sample
$D_{ngr}$. Note that the count is computed only in one side of the parallel set, that is, if w is a word in the target language, then the count is computed in the target-side only.
$E(w,D_{ngr})= \frac{C_S(w)*|D_{ngr}|}{|S|}$ is the expected count of w in the source side of
$D_{ngr}$ if the distribution was the same as in the complete training data S. A higher value of
$dep(w,D_{ngr})$ indicates a stronger dependence of the word w on the n-gram ngr.
Once the words (both on the source and target sides) of $D_{ngr}$ have been weighted, we can use them to build quality and similarity metrics.
Ranking Based on Quality. We consider that words on the target side of $D_{ngr}$ that are very dependent on the n-gram in the source side ngr are better candidates to be (part of) the translation of ngr. Sentences containing words with a low dependency on ngr indicate that they may not be useful for translating ngr.
For weighting the sentences, we can perform a normalized sum of the dependencies of the words on the target side of the sentence, as in Equation (11):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_eqn11.png?pub-status=live)
where $\langle s,t\rangle$ is a pair of source-side and target-side sentence.
4.2 Phrase extraction analysis
In order to evaluate if the method proposed in Section 4.1 retrieves phrase pairs that are appropriate for a test set, we use them on a phrase table pruning task. We want to investigate if a reduced phrase table, containing only the entries with the phrase pairs retrieved by our method, performs as well as using the complete phrase table.
There are multiple techniques for pruning entries from the phrase table. Previous proposals contemplates removing phrase pairs if they are redundant (Zens, Stanton, and Xu Reference Zens, Stanton and Xu2012) or using Fisher’s Exact Test to evaluate if they occur by chance (Johnson et al. Reference Johnson, Martin, Foster and Kuhn2007). This is done by analyzing the contingency table of the occurrences of source and target phrases of the phrase table and computing for that the hypergeometric distribution $\frac{\binom{C(\bar{s})}{C(\bar{s},\bar{t})}\binom{N-C(\bar{s})}{N-C(\bar{s},\bar{t})}}{\binom{|PT|}{C(\bar{t})}}$, where
$|PT|$ is the size of the phrase table, and
$C(\bar{s})$,
$C(\bar{t})$, and
$C(\bar{s},\bar{t})$ are the count occurrences of source phrase
$\bar{s}$, target phrase
$\bar{t}$ and pair
$\langle \bar{s},\bar{t}\rangle$, respectively, in the phrase table PT. These techniques do not make use of the information in the test set.
The goals of the experiments in this section are to explore whether ranking sentences is useful for obtaining fewer but more appropriate phrase pairs than the default translation by pattern matching technique. Therefore, the proposed experiments are as follows:
NoRanking300: A random sample of 300 phrase pairs (in their original work, Lopez Reference Lopez2008 concludes that with a random sample of 300 sentences the accuracy plateau is reached).
Qual-10: The top-10 phrase pairs,Footnote b from a sample of 300 pairs, ranked using the quality score measure of Equation (11).
The phrase pairs obtained by each experiments will be used to filter entries from a phrase table built using Moses.
In Table 6, we can see a comparison of Qual top-10 with the baseline NoRanking300, the original system proposed in Lopez (Reference Lopez2008). Compared to the baseline system that selects all 300 phrase pairs (column NoRanking300 in Table 6) with our proposal, we can observe that ranking the sentences has a positive effect on the results.
Table 6. Comparison of the estimated quality of translation by an SMT system with a phrase table built using all the parallel sentences (AllData) and performing source-driven phrase extraction extracting random samples of 300 sentences (NoRanking300). The results in bold indicate the best score. The asterisk means the result is statistically significant at level $p=0.01$ when compared with AllData experiment
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_tab6.png?pub-status=live)
Table 7. Summary of the performance of FDA, MRFDA, and BilingualFDA methods
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_tab7.png?pub-status=live)
4.3 BilingualFDA
The phrase pairs obtained in Section 4.2 are useful to build a filtered phrase table to translate a particular test set as shown in Table 6. In this section, we want to demonstrate an extra utility for these phrase pairs to extend FDA and build a BilingualFDA.
Default FDA uses n-grams in the source side extracted from the test set. For example, one of the features that FDA extracts in the test sentence “die Premierminister Indiens und Japans trafen sich in Tokio.” is the n-gram “in Tokio.” Then, the algorithm will select sentences containing this n-gram in the source side. In BilingualFDA though, the features used are phrase pairs such as (“in Tokio”, “in Tokyo”). This means that it will select sentences containing “in Tokio” in the source side and “in Tokyo” on the target side. The algorithm will avoid selecting incorrectly translated sentences from the training data.
4.3.1 Experiments on BilingualFDA
We again run data selection experiments extracting different sizes of training data (100, 200, and 500K lines) using BilingualFDA. In order to build the BilingualFDA, we extend MRFDA to use phrase pairs as features. The set of phrase pairs used in the experiments are those obtained in the experiment Qual top-10 in Section 4.2.
In Table 7, we present a summary comparing the algorithms proposed in this work. The MRFDA experiment shows the results obtained in Section 3 using n-grams of size 3 (the same used for default FDA experiments).
The last column shows the results obtained using BilingualFDA. This is the only algorithm that considers the quality of the sentences, and as a result, it performs the best and achieves statistically significant improvements at level $p=0.01$ for all the scores we have computed (BLEU, TER, and METEOR).
5. Application to NMT
In the following, we explore whether the improvements presented in this paper are also observed with NMT models. The work of van der Wees, Bisazza, and Monz (Reference van der Wees, Bisazza and Monz2017) demonstrated that models can be improved by using in-domain data selected using CED. We replicate the experiments presented in Table 7 (we use the same data) but with neural models trained using OpenNMT-py (Klein et al. Reference Klein, Kim, Deng, Senellart and Rush2017) with the default settings: 2-layer LSTM with 500 hidden units, vocabulary size of 50,000 words for each language, executed for 13 epochs. The results are presented in Table 8.
Table 8. Summary of the performance of FDA, MRFDA, and BilingualFDA methods in NMT
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220111030324951-0033:S1351324920000467:S1351324920000467_tab8.png?pub-status=live)
The work of Poncelas, Maillette de Buy Wenniger, and Way (Reference Poncelas, Maillette de Buy Wenniger and Way2018) explores the performance of NMT using different sizes of data selected by FDA. In their work, they show that the performance increases the more data are selected up to 2M sentences. However, the increases in performance when selecting more than 500K sentences are small (below 1% improvement in terms of BLEU). For this reason, in the NMT experiments, we include the comparison between sizes of 100, 200, and 500K sentence pairs, which are same amounts explored for SMT in this paper.
In the table, we observe that, in contrast to SMT, the scores increase the more data are used for training (the scores in subtable 500K lines are better than those in subtable 200K lines, which at the same time are better than 100K lines).
When comparing column-wise the performance of the models, we see that MRFDA do not outperform FDA. The method that achieves the best results is BilingualFDA as it obtains statistical significant improvements over FDA at $p=0.01$ for the models built with different sizes of data. We propose as future work a more fine-grained experiment for exploring the performance using different configurations of each method. For completeness, we also include the table with the result of the NMT baseline model trained in all data (first column of Table 8).
6. Conclusions and future work
In this work, we have explored FDA, a transductive data selection technique, to build SMT models for German to English. Models trained with data retrieved by this method are shown to be superior to those built with the full training set, as well as models trained with data from other transductive data selection techniques.
First, we have analyzed the factors that have an influence on execution time of FDA. Using high orders of n-grams as features (which forces the method to deal with more features) and retrieving larger data sets are the main factors that cause time duration of FDA to increase. In order to solve that problem and obtain more constant execution times, we presented MRFDA, a parallelized version of FDA that benefits from using multiple CPUs. This new method executes faster than FDA yet obtains comparable performance. Furthermore, the quality achieved by MRFDA is better than with ParFDA, which is the state-of-the-art parallel version of FDA.
The last method considers improving FDA by expanding the features. Default FDA uses only n-grams in the source side as features. We proposed to use pairs of n-grams (in the source and target language) to create BilingualFDA. By preprocessing the test set, it is possible to find those n-grams pairs that are good translation of each other. Using these as features, BilingualFDA can avoid selecting those sentences that potentially hurt performance.
As far as parametrization and application of FDA are concerned, there are some remaining opportunities that we leave for future work: (i) optimal number of sentences (i.e. find a method to retrieve how many sentences should be selected to achieve the best results); (ii) optimal FDA parameters (i.e. explore new ways of estimating more effective default values); and (iii) FDA for tuning (i.e. explore the performance of the models when FDA is used to select sentences for tuning the model).
Alternative executions of FDA that can be further investigated include the use of an approximated translation of the test set (Poncelas, Way, and Sarasola Reference Poncelas, Way and Sarasola2018; Poncelas, de Buy Wenniger, and Way Reference Poncelas, de Buy Wenniger and Way2018) so that the target language is also considered.
Finally, we are interested in further exploring the algorithms explained in this work using NMT, using different configurations (Poncelas et al. Reference Poncelas, de Buy Wenniger and Way2018) or artificial datasets (Poncelas, de Buy Wenniger, and Way Reference Poncelas, de Buy Wenniger and Way2019a; Poncelas and Way Reference Poncelas and Way2019; Soto et al. Reference Soto, Shterionov, Poncelas and Way2020). Even if NMT systems work better with big amounts of data, data selection algorithms are useful to perform the so called “fine-tuning” (Luong and Manning Reference Luong and Manning2015; Freitag and Al-Onaizan Reference Freitag and Al-Onaizan2016), where pre-built system are improved with a small portion of in-domain data. We plan to investigate the performance of this technique using FDA (Poncelas et al. Reference Poncelas, de Buy Wenniger and Way2018; Poncelas, de Buy Wenniger, and Way Reference Poncelas, de Buy Wenniger and Way2019b) and the methods presented here to improve NMT models trained with large sets of parallel sentences.
Financial support
This research is supported by the ADAPT Centre for Digital Content Technology, funded under the SFI Research Centres Programme (Grant 13/RC/2106). This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 713567.