1. Introduction
Neural machine translation (NMT) (Forcada and Ñeco Reference Forcada and Ñeco1997; Kalchbrenner and Blunsom Reference Kalchbrenner and Blunsom2013; Cho et al. Reference Cho, Van Merriënboer, Bahdanau and Bengio2014; Sutskever, Vinyals, and Le Reference Sutskever, Vinyals and Le2014; Bahdanau, Cho, and Bengio Reference Bahdanau, Cho and Bengio2015) has recently drawn significant attention to the researchers due to its encouraging performance on publicly available benchmark datasets (Bojar et al. Reference Bojar, Chatterjee, Federmann, Graham, Haddow, Huck, Yepes, Koehn, Logacheva, Monz, Negri, Névéol, Neves, Popel, Post, Rubino, Scarton, Specia, Turchi, Verspoor and Zampieri2016) and rapid adoption in the production systems (Wu et al. Reference Wu, Schuster, Chen, Le, Norouzi, Macherey, Krikun, Cao, Gao, Macherey, Klingner, Shah, Johnson, Liu, Kaiser, Gouws, Kato, Kudo, Kazawa, Stevens, Kurian, Patil, Wang, Young, Smith, Riesa, Rudnick, Vinyals, Corrado, Hughes and Dean2016; Crego et al. Reference Crego, Kim, Klein, Rebollo, Yang, Senellart, Akhanov, Brunelle, Coquard, Deng, Enoue, Geiss, Johanson, Khalsa, Khiari, Ko, Kobus, Lorieux, Martins, Nguyen, Priori, Riccardi, Segal, Servan, Tiquet, Wang, Yang, Zhang, Zhou and Zoldan2016; Junczys-Dowmunt, Dwojak, and Hoang Reference Junczys-Dowmunt, Dwojak and Hoang2016). The key points of NMT are it generates fluent outputs and it can be implemented as a single end-to-end neural system unlike long-dominant phrase-based Statistical Machine Translation (SMT) (Koehn, Och, and Marcu Reference Koehn, Och and Marcu2003) which combines many submodules. The performance of an NMT system largely depends on the amount of parallel data we have. It produces good translations when we have sufficient training data; however, it performs poorly when the training data size is insufficient. The size of this sufficient data for NMT training is in the order of millions of parallel sentences (Lample et al. Reference Lample, Ott, Conneau, Denoyer and Ranzato2018). In contrast to NMT, SMT models are inherently known to be better than NMT in the absence of enough training data.
Although SMT performs better than NMT in the absence of large parallel corpus, there has been a growing interest among the researchers to build effective NMT models in such scenarios as well. One reason that makes NMT a better choice, even in the absence of sufficient data, is that NMT makes a huge jump in BiLingual Evaluation Understudy (BLEU) score as the data size increases, whereas SMT improves with a fixed rate (Koehn and Knowles Reference Koehn and Knowles2017). NMT requires a huge amount of parallel data for building a good translation system, and absence of such corpora makes NMT suffer from the adequacy problem (Koehn and Knowles Reference Koehn and Knowles2017).
The quality of an NMT system heavily depends on the training data size. The standard systems make use of parallel corpus having millions of sentences. However, it is not that we only lack of training data for certain language pairs, but many a times we also face with the problem of low-resource scenario for many domains such as medical, tourism, and judicial. In that case, translation again becomes a challenging task because of the absence of the parallel data for those domains. For example, many Indic languages do not have enough parallel corpus required to build the robust NMT systems. Only few thousands of parallel sentences are available (Jha Reference Jha2010). In the absence of sufficient amount of data, a model learns poorly because of the low counts of source–target units. One of the major challenges of NMT, irrespective of the training data size, is handling of the rare words. However, if the data size is very small then most of the source–target pairs occur in very less number.
In this work, we propose a method to substantially improve the NMT system for low-resource languages and/or domains. We extract phrase pairs from the original training data using a phrase-based SMT (Koehn et al. Reference Koehn, Och and Marcu2003) training and augment the original training corpus by adding the most probable pairs as parallel sentence pairs. By phrase, we do not necessarily mean any linguistic phrase—it is rather a consecutive sequence of words. We evaluate our approach using the BLEU score (Papineni et al. Reference Papineni, Roukos, Ward and Zhu2002) against the baseline models constructed using the standard attention-based (Bahdanau et al. Reference Bahdanau, Cho and Bengio2015) gated recurrent unit (GRU) (Cho et al. Reference Cho, Van Merriënboer, Bahdanau and Bengio2014) and transformer-based (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) NMT systems. Our experiments show that our proposed model attains significant performance gains over the baseline models under a very low-resource scenario. Our approach is different from the existing approaches in the following ways: (i) our system makes use of a relatively smaller corpus consisting of only 5–23k parallel sentences and (ii) we include the phrase pairs directly in the training corpus treating as sentence pairs.
We summarizes the key contributions and/or characteristics of our proposed approach as follows.
-
• We propose an effective NMT model with feedback from SMT phrases for translating low-resource languages.
-
• We empirically establish that our proposed approach improves the performance of NMT system under low-resource scenario, showing improvements over the baselines for the English–Hindi and Hindi–Bengali language pairs.
-
• We empirically show that transformer works significantly better than the attention-based GRU in low-resource scenarios.
-
• We also build an NMT system for old-to-modern English translation using our proposed approach and observe significant improvement over the baseline. Main intuition of doing this was to establish how generic and effective our proposed approach is for translating the texts of completely different genres and structures.
The remainder of the paper is organized as follows. In Section 2, we define the problem and present the underlying motivation of our current work. Section 3 presents an overview of the existing literature. In Section 4, we describe the proposed method. Sections 5 and 6 discuss the datasets and experimental setup, respectively. In Section 7, we report the results along with proper analysis. Finally, in Section 8, we conclude with future work directions.
2. Problem definition and motivation
Both NMT and SMT require a large-scale high-quality parallel corpus for training a good-quality machine translation (MT) system. Absence of such corpora makes NMT suffer from the adequacy problem (Koehn and Knowles Reference Koehn and Knowles2017). Creating a high-quality large-scale parallel corpus is expensive as it requires time, money, and professionals to translate a large amount of texts. As a result, many of the existing large-scale parallel corpora are limited to some specific languages and domains.
The quality of any MT system can be characterized by its adequacy and fluency. The long-dominant SMT has been found to be good at handling adequacy, but lacks in fluency. Recently, NMT has become the new state of the paradigm to MT. However, it has been reported that NMT sacrifices adequacy at the cost of fluency (Koehn and Knowles Reference Koehn and Knowles2017). The performance of an NMT system greatly depends on the amount of parallel corpus: more the data better is the performance. However, having sufficient corpus for training an NMT system is a challenge. The adequacy is a direct measure of how well an NMT system learns mapping between the symbols in source and target languages. However, NMT fails to capture these mappings when the parallel corpus is not enough. This is often the case that sufficient parallel corpus is not available for many language pairs as well as for some restricted domains. So, in order to help the NMT models learn the mapping between words in the absence of sufficient parallel corpus, we extract phrases from the original training data and add to it (i.e., the original training corpus) for better evidences. This, in turn, provides an implicit knowledge about the mappings between the source–target pairs.
In our current work, we propose an effective approach for the translation of a variety of texts and languages. Phrases extracted from SMT are fed as input to the training of NMT. Firstly, we build the NMT systems for the resource-scarce Indian languages, and secondly, we translate the old English texts to the modern. India is a multilingual country with great linguistic and cultural diversities. The resources and tools in the form of parallel corpus, morphological analyzers, parts-of-speech tagger, etc. are readily not available in the required measures. Translating old text to the modern text is very important for various purposes. Human languages are constantly evolving and changing over time to reflect sociocultural changes, fit current conventions, mores, expressions, and needs. This change in a language often requires “rewriting” the old texts for the modern readers in the same language. In line with global trends, old texts are increasingly available in the forms that computer can process. These ever expanding records (e.g., historical records, scanned books, academic papers, large-scale corpora, and maps)—either digitally born or reconstructed through digitization pipelines—are too big to be “rewritten” manually. We pose this rewriting of old text as an MT problem and use our proposed method to improve this rewriting.
3. Related work
Having enough parallel corpus is a big challenge in NMT, and it is very unlikely to have millions of parallel sentences for every language pair. A few attempts have been made to build NMT systems for the low-resource language pairs (Sennrich, Haddow, and Birch Reference Sennrich, Haddow and Birch2016a; Zhang and Zong Reference Zhang and Zong2016; Gulcehre et al. Reference Gulcehre, Firat, Xu, Cho and Bengio2017), which incorporated huge monolingual corpus in the source and target sides.
Sennrich et al. (Reference Sennrich, Haddow and Birch2016a) have incorporated monolingual data on the target side to investigate two methods of filling the source side of the monolingual data. In the first method, they have used a dummy source sentence for every target sentence, and in the second method, they used a synthetic source sentence obtained via back translation. They claimed that the second method is more effective. However, if there is not enough parallel data, quality of back translation is again a problem.
Zhang and Zong (Reference Zhang and Zong2016) explored the effect of incorporating large-scale source-side monolingual data in NMT in different ways. In the first approach, inspired by Sennrich et al. (Reference Sennrich, Haddow and Birch2016a), they first built a baseline model and then obtained parallel synthetic data by translating the monolingual data. These parallel data along with the original data are used for training an attention-based GRU system. The second method used the multitask learning framework to generate the target translation and reorder the source-side sentences at the same time. They claimed that usage of source-side monolingual data in NMT is more effective than that of SMT.
Gulcehre et al. (Reference Gulcehre, Firat, Xu, Cho and Bengio2017) have proposed two alternative methods to integrate monolingual data on the target side, namely shallow fusion and deep fusion. In shallow fusion, the top K hypotheses (produced by NMT) in each time step t are re-scored using the weighted sum of the scores given by the NMT (trained on parallel data) and a recurrent neural network-based language model (RNNLM). Whereas in deep fusion, hidden states obtained at each time step t of RNNLM and NMT are concatenated and the output is generated from that concatenated state.
In recent times, Arthur, Neubig, and Nakamura (Reference Arthur, Neubig and Nakamura2016) have proposed a model to incorporate translation lexicons through calculating lexical predictive probability and adding this probability to the input of the Softmax. Feng et al. (Reference Feng, Zhang, Zhang, Wang and Abel2017) proposed a method that extracted phrase translation dictionary from the corpus using word alignment, and the phrase translation probability is used in the NMT model to construct the local memory.
Zoph et al. (Reference Zoph, Yuret, May and Knight2016) applied transfer learning for low-resource NMT. They trained a model on high-resource language pair, and then the learned parameters were used for training a low-resource language pair. However, it requires selecting both the high- and low-resource language pairs to be of similar types (i.e., closer to each other). So this approach may not work if the language pairs are distant. There are two fundamental differences between our proposed approach and theirs. Unlike theirs, we do not use any large amount of additional parallel corpus, rather we use a relatively smaller corpus.
Wang et al. (Reference Wang, Pham, Dai and Neubig2018) proposed simple data augmentation technique through randomly replacing words in both the source sentence and the target sentence with other random words from their corresponding vocabularies. He et al. (Reference He, He, Wu and Wang2016) have incorporated SMT features such as translation model, language model (LM) under log-linear framework during the beam search of decoding step.
Wang et al. (Reference Wang, Lu, Tu, Li, Xiong and Zhang2017), Wang, Tu, and Zhang (Reference Wang, Tu and Zhang2018) have proposed NMT model advised by SMT, where at each decoding step, SMT offers additional recommendations and the recommendations are scored with a classifier for combining with the NMT model in an end-to-end manner. Zhao et al. (Reference Zhao, Wang, Zhang and Zong2018) also have used phrase table as recommendation through adding bonus to words worthy of recommendation, for helping NMT in predicting adequate words.
NMT always shows weakness in translating the rare words. Fadaee, Bisazza, and Monz (Reference Fadaee, Bisazza and Monz2017) have proposed an approach for handling rare words through data augmentation for English–German language pair. Their approach also made use of huge monolingual corpus for generating sentence pairs containing rare words, and these generated sentence pairs are used for training the NMT models. Although the said pair does not fall under the low-resource category, they created simulated low-resource settings to perform the experiments and claimed to have achieved substantial improvement on the BLEU score. However, in our experimental settings, we truly use the low-resource languages, for which having a large monolingual corpus is also a challenge.
Song et al. (Reference Song, Zhang, Yu, Luo, Wang and Zhang2019) have investigated a data augmentation method for constraining NMT with pre-specified translations. In this method, source sentences in the training data are code-switched by replacing source phrases with their target translations allowing the model to learn lexicon translations by copying source-side target words. However, in our approach, we do not code-switch the training data, instead we use phrase pairs as training data.
Most of the earlier works related to NMT in low-resource scenario tried to incorporate monolingual data either in the source or target side. The effect of adding monolingual data in NMT is similar to that of building LM on a large-scale monolingual data in SMT. It makes output more fluent; however, NMT always lacks in generating adequate output. Adding monolingual data do not contribute much in improving the adequacy. A number of attempts related to low-resource NMT also tried to expand the training data by adding the back-translated monolingual data. At first, a model is trained using the available training data, and then the monolingual corpus is passed for translation. Quality of the parallel data obtained from the translation of monolingual data depends on the size of the original parallel data. If the data size is very small, then the translated data may not help much. However, the effect of adding source–target phrases into the training data is less explored (Reference Sen, Hasanuzzaman, Ekbal, Bhattacharyya and WaySen et al. in press).
Recently, unsupervised NMT (Lample et al. Reference Lample, Ott, Conneau, Denoyer and Ranzato2018; Artetxe, Labaka, and Agirre Reference Artetxe, Labaka and Agirre2018; Ren et al. Reference Ren, Zhang, Liu, Zhou and Ma2019; Lample and Conneau Reference Lample and Conneau2019), semi-supervised NMT (Zhang et al. Reference Zhang, Liu, Li, Zhou and Chen2018), and unsupervised pre-training (Ramachandran, Liu, and Le Reference Ramachandran, Liu and Le2017) have been emerged and shown promising results on the related languages. These techniques require a huge amount of monolingual data. It has been shown that pure unsupervised technique does not work for distance language pairs, for example, the pairs we are dealing with in this work (Guzmán et al. Reference Guzmán, Chen, Ott, Pino, Lample, Koehn, Chaudhary and Ranzato2019).
4. Proposed method
Our focus is on low-resource scenario, and in order to handle this situation, we add the phrase pairs extracted from the training corpus as feedback to the NMT framework during training. Our proposed approach is not specific to any NMT architecture. We perform experiments on two state-of-the-art neural networks, namely attention-based (Bahdanau et al. Reference Bahdanau, Cho and Bengio2015) GRU and transformer (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) models. Here, we briefly describe the two networks and then present the details of the proposed method.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_fig1.png?pub-status=live)
Figure 1. Attention-based GRU NMT architecture.
4.1 Attention-based GRU
The goal of NMT is to translate a sequence of source words into a sequence of target words with the help of a large neural network. The basic architecture of an NMT, shown in Figure 1, uses two recurrent neural networks, one is called encoder and other is known as the decoder. The encoder converts the source sentence into a dense fixed-length vector, and then the decoder generates target sentence from that vector. But the main drawback of this encoder–decoder approach is that it fails drastically as length of the input sentence grows. The encoder–decoder approach assumes that the encoder can encode the whole sentence into a fixed-length vector, which is not realistic, specifically for the longer sentences. To mitigate this drawback, Bahdanau et al. (Reference Bahdanau, Cho and Bengio2015) came up with an idea which focused on the whole input sentence, while generating the outputs.
Formally, given a sequence of source words x (=
$x_1, x_2, x_3, ..., x_{T_{x}}$
) and the previously translated
$i-1$
words y (=
$y_1, y_2, y_3, ..., y_{i-1}$
), the conditional probability of the ith output
$y_i$
is calculated as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn1.png?pub-status=live)
where
$t_i$
, the input to the softmax, is computed as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn2.png?pub-status=live)
where
$W_{s}$
,
$W_{e}$
,
$W_{c}$
, and
$W_o$
are the model parameters. The hidden state
$s_i$
in the decoder at time step i is computed as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn3.png?pub-status=live)
Here, g is a nonlinear transform function, which is usually a long short-term memory (LSTM) (Hochreiter and Schmidhuber Reference Hochreiter and Schmidhuber1997) or a GRU (Cho et al. Reference Cho, Van Merriënboer, Bahdanau and Bengio2014), and
$c_i$
is the context vector at time step i, which is calculated as a weighted sum of the input annotations
$h_j$
:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn4.png?pub-status=live)
where
$T_{x}$
is the length of the source sequence and
$h_j$
is the encoder hidden state at jth time step and computed using a nonlinear transformation (such as GRU and LSTM) function as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn5.png?pub-status=live)
The normalized weight
$\alpha_{ij}$
for
${h}_j$
is calculated as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn6.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn7.png?pub-status=live)
where
$V_a$
,
$U_a$
, and
$W_a$
are the trainable parameters. All of the parameters in the NMT model are optimized to maximize the following conditional log-likelihood of the N parallel sentences
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn8.png?pub-status=live)
where
$T_{y}$
is the length of the target sequence.
4.2 Transformer network
In recurrent network, the representation at time step i is dependent of previous time stamps. Vaswani et al. (Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) proposed the transformer network, block diagrammatic representation is shown in Figure 2, which completely depends on self-attention and removes the recurrent operations found in the previous NMT approach—allowing for parallelization of computing at all time stamps in encoder–decoder. However, in the absence of recurrence, to capture the token position within the input sentence, a positional encoding is added with each input embedding before passing to the encoder. The encoder consists of several identical layers. Each layer is composed of mainly two sublayers: a multihead self-attention layer and a position-wise feed-forward network layer. The decoder consists of several identical layers like encoder and operates in a similar to encoder. In addition to the two sublayers in encoder, decoder inserts a third sublayer, which performs multihead attention over the output of the encoder. Each of these sublayers is followed by a layer called normalization. The decoder works similar to the decoder in Bahdanau et al. (Reference Bahdanau, Cho and Bengio2015) and generates one token at a time using a softmax layer. For more specific details of the network, please refer to Vaswani et al. (Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_fig2.png?pub-status=live)
Figure 2. Transformer architecture.
Mathematically, as in Vaswani et al. (Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017), positional encoding is defined as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn9.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn10.png?pub-status=live)
where pos is the position, and i is the ith dimension of a d dimensional input vector. Self-attention is defined as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn11.png?pub-status=live)
where Q, K, and V are the queries, keys, values packed together into matrices.
$d_k$
is the ratio of d to number of heads denoted as h. One multihead is a concatenation of multiple h heads, defined as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn12.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn13.png?pub-status=live)
where the projections are parameter matrices
$W^Q_i, W^K_i, W^O \in \mathbb{R}^{d \times d_k}$
. Feed-forward network in each layer of the encoder and decoder is formulated as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn14.png?pub-status=live)
where the projections are parameter matrices
$W_1, W_2 \in \mathbb{R}^{d \times d}$
.
4.3 Data augmentation
The NMT models are extremely data hungry, and in the absence of large training corpus, it does not learn the model parameters properly. In our work, we propose an approach for training NMT models using small corpora, especially under a situation for translating domain-specific small corpora.
The overall process flow of our architecture is depicted in Figure 3. The core idea is to provide more information about the alignment between the source and target phrases. When a sentence pair is passed through an encoder–decoder, the sentence pair does not carry any information about the mapping between the source and target phrases. The model learns the translation mappings implicitly by predicting and rectifying the error over a large parallel corpus. However, the model fails to learn the association between the phrases when the corpus size is small. So, apart from feeding sentence pairs into the network, we also feed phrase pairs as the training examples. This gives an illusion of having a larger corpus.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_fig3.png?pub-status=live)
Figure 3. Our proposed phrase injection NMT approach.
In order to perform this feedback mechanism, we first extract parallel phrases from the corpus and then add these parallel phrases in the training set. To extract parallel phrases, we use the Moses (Koehn et al. Reference Koehn, Hoang, Birch, Callison-Burch, Federico, Bertoldi, Cowan, Shen, Moran, Zens, Dyer, Bojar, Constantin and Herbst2007) SMT system. We train a source–target phrase-based SMT (Koehn et al. Reference Koehn, Och and Marcu2003) and extract all the phrase pairs from the phrase table. Out of these, many parallel phrases are not sound, that is, there can be many incorrect source–target alignments (Koehn et al. Reference Koehn, Och and Marcu2003). We set different conditions while choosing the phrases from the phrase table. Assume that every source phrase e is aligned to a set of target phrases
$F = (f_1, f_2, . . ., f_n)$
. Note that n may vary for each source phrase. So, for each source phrase e in the phrase table, we extract three sets of parallel phrases:
-
1. First set (
$Set_{p\geq0.5}$ ): set of parallel phrases
${(e,f_t)}$ provided
$P(f_t|e)\geq0.5$ ;
-
2. Second set (
$Set_{p=1.0}$ ): set of parallel phrases
${(e,f_t)}$ provided
$P(f_t|e)=1.0$ ;
-
3. Third set (
$Set_{all}$ ): for this set, we consider all the phrase pairs from the phrase table.
Since the number of phrase pairs is larger than the number of original parallel sentences, to maintain a fair ratio between them, we use the following formula for combining phrase pairs with the original training set.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqn15.png?pub-status=live)
We combine the extracted set (of parallel phrases) with N times of the original corpus, where N is calculated as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_eqnU1.png?pub-status=live)
Without this, training set will contain mostly the phrases, and as the phrases are smaller in length, they may make the model biased towards the phrase length.
5. Datasets
For experiments, we use English–Hindi and Hind–Bengali parallel corpora from the multilingual Indian Language Corpora Initiative (ILCI) (Jha Reference Jha2010). The ILCI parallel corpora are from the two domains: Health (ILCI-H) and Tourism (ILCI-T), each of these comprising of 25k parallel sentences. These corpora have insufficient number of parallel sentences compared to the other language pairs found in literature. Indian languages do not have sufficient corpus required to train an NMT system, and thus they fall under the low-resource category as the standard NMT requires millions of parallel sentences for training. For experimentation, we randomly split each corpus into three sets: Train, Test, and Dev. Details are shown in Table 1.
Table 1. Dataset statistics showing the number of sentences and tokens. By old, we refer to old English and by mod, we refer to modern English
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_tab1.png?pub-status=live)
For judicial domain data, we use IIT Bombay English-Hindi parallel corpus (Kunchukuttan, Mehta, and Bhattacharyya Reference Kunchukuttan, Mehta and Bhattacharyya2018). It consists of parallel sentences from miscellaneous domains and out of which only 7561 parallel sentences belong to the judicial domain. For experiments, we randomly split these judicial domain parallel sentences into Train, Dev, and Test sets consisting of 5561, 1000, and 1000, respectively.
As old English texts, we use the publicly available The Homilies of the Anglo-Saxon Church Footnote a by Ælfric of Eynsham (c.950–c.1010) who was a prolific author in old English and its translation by Benjamin Thorpe (c.1782–c.1870) as modern English texts. We call it Old English-Modern English (OE-ME) corpus.
The OE–ME corpus is tiny in size, and it has 720 parallel paragraphs in 40 sections. Most of the parallel paragraphs have equal number of OE–ME parallel sentences which help in aligning the parallel sentences. Some parallel paragraphs are discarded as they do not have equal number of OE–ME sentences to avoid misalignment which gives rise to a total of 3716 parallel sentences. We do not use any sentence aligner as only a few sentences are discarded. We randomly split it into three sets: Train, Test, and Dev containing 2716, 500, and 500 sentences, respectively. For tokenization, we use tokenizer.perl which is part of the Moses SMT system. Details of the datasets are presented in Table 1.
Table 2. Vocabulary size for different language pairs
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_tab2.png?pub-status=live)
O, Old; M, Modern.
Table 3. Training data sizes for different models after adding phrases
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_tab3.png?pub-status=live)
E, English; H, Hindi; B, Bengali; O, old; M, modern.
6. Experimental setup
Attention-based GRU models: We use Nematus (Sennrich et al. Reference Sennrich, Firat, Cho, Birch, Haddow, Hitschler, Junczys-Dowmunt, Läubli, Miceli Barone, Mokry and Nadejde2017) for training the NMT models. Our neural models are trained on word level. We create vocabulary from the training set for the different systems. The size of the vocabularies used in training the models is shown in Table 2. The augmented data size for each model are shown in Table 3. We set the embedding size as 128, hidden size as 256, and the learning rate as 0.001. Note that we tried higher embedding and hidden dimensions but did not work as the training data size is very small. Encoder and decoder are two-layered GRU blocks. The models are trained with mini-batch size of 40, and we restrict the maximum sentence length to 80. We use the Adam optimizer (Kingma and Ba Reference Kingma and Ba2015) for optimizing the models. The training stops on meeting the early stopping criteria. We use the early stopping based on BLEU measure with an early stopping patience value 10. All the models run for 110–130k (approx.) updates before the early stopping. For decoding, we set the beam size as 3. For the other parameters, default values were used.
Transformer-based models: For training the models, we use Sockeye (Hieber et al. Reference Hieber, Domhan, Denkowski, Vilar, Sokolov, Clifton and Post2017), a toolkit for NMT. We set default the embedding dimension of 512, hidden dimension of 512, learning rate of 0.0002, and dropout rate of 0.2. Number of layers in each of encoder and decoder is 6. Number of multihead attention is 8. We use Adam optimizer (Kingma and Ba Reference Kingma and Ba2015) optimizer. We keep a mini-batch size of 2000 words.Footnote b
Phrase extraction: We use the Moses (Koehn et al. Reference Koehn, Hoang, Birch, Callison-Burch, Federico, Bertoldi, Cowan, Shen, Moran, Zens, Dyer, Bojar, Constantin and Herbst2007) toolkit for training a phrase-based SMT system. The phrase table, generated during training, is used for extracting the phrases. For training, we keep the following settings in the Moses: grow-diag-final-and heuristics for word alignment, msd-bidirectional-fe for reordering model, and 4-gram LM with modified Kneser–Ney smoothing (Kneser and Ney Reference Kneser and Ney1995) using KenLM (Heafield Reference Heafield2011). However, we note that the order of LM does not affect the phrase table.
We train the following three types of NMT models for each of health, tourism, and judicial domain corpora.
-
1. Baseline: The NMT model is trained only on the original parallel corpus.
-
2. Baseline +
$Set_{p\geq0.5}$ : The NMT is trained on the original parallel corpus along with phrase pair set
$Set_{p\geq0.5}$ (see Section 4.3).
-
3. Baseline +
$Set_{p=1.0}$ : The NMT is trained on the original parallel corpus along with phrase pair set
$Set_{p=1.0}$ (see Section 4.3).
-
4. Baseline +
$Set_{all}$ : The NMT model is trained on the original parallel corpus along with phrase pair set
$Set_{all}$ (see Section 4.3).
The number of training examples for the systems as mentioned above are shown in Table 3.
7. Results and analysis
We evaluate the models on the test sets using BLEU metric (Papineni et al. Reference Papineni, Roukos, Ward and Zhu2002). Table 4 summaries the results of different systems, and in Tables 9 and 10, we show some example outputs obtained from these systems. We plot the BLEU scores of the different translation systems in Figures 4 and 5 for comparison with the baselines.
7.1 Attention-based GRU versus transformer
Both attention-based GRU and transformer-based models are improved by our phrase-augmentation approach. However, if we compare them, transformer-based models are better than the attention-based GRU models for all the translation directions. Transformer-based models result in better baselines than the GRU-based models. Also, SMT is known to work better than the neural models in the absence of sufficient training data. But with our approach, transformer-based models perform better than the SMT-based baselines for five translation directions (see Table 4), and for rest of the translation directions, our approach with transformer networks obtains the competitive (with SMT) results.
7.2 Comparative systems
Here, we compare our proposed approach with some of the well-explored techniques in low-resource NMT such as subword-level NMT (Sennrich, Haddow, and Birch Reference Sennrich, Haddow and Birch2016b), back translation (Sennrich et al. Reference Sennrich, Haddow and Birch2016a), and more in line with our proposed approach, pre-translation (Niehues et al. Reference Niehues, Cho, Ha and Waibel2016). Recently, transformer-based models have outperformed the attention-based GRU models on the various benchmark datasets and have become the state-of-the-art technique in NMT. This is also evident from the evaluation results that we obtain. Hence, we focus only on the transformer-based models for comparison. We compare our approach for English–Hindi translation direction involving Health domain data.
Table 4. BLEU scores of different models
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_tab4.png?pub-status=live)
$\blacktriangle$
, improvement over baseline;
$\uparrow$
, positive improvement; *, Better than SMT; H, Health; T, Tourism; J, Judicial. Highest BLEU score for each translation direction using NMT indicated in bold.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_fig4.png?pub-status=live)
Figure 4. Comparison of different attention-based GRU models. HE: Hindi
$\rightarrow$
English, EH: English
$\rightarrow$
Hindi, HB: Hindi
$\rightarrow$
Bengali, BH: Bengali
$\rightarrow$
Hindi. H for Health, T for Tourism and J for Judicial.
Pre-translation for NMT Niehues et al. (Reference Niehues, Cho, Ha and Waibel2016) have proposed two methods to improve the NMT with the help of phrase-based SMT system. In the first method, Niehues et al. (Reference Niehues, Cho, Ha and Waibel2016) first trained a source–target SMT system, and then using the SMT system, they translated the entire training data from source to target. Thereafter, they trained a monolingual NMT system from the translated-target to original target. Second method is almost the same as the first but the NMT system is trained to predict the target from the combination of original source and translated output of SMT system. The difference between our approach and the pre-translation technique is that we consider the phrase pairs extracted from the phrase table as the additional parallel data, whereas the authors (Niehues et al. Reference Niehues, Cho, Ha and Waibel2016) trained a monolingual NMT system to correct the outputs produced by a phrase-based SMT system. We follow the same approach as in Niehues et al. (Reference Niehues, Cho, Ha and Waibel2016), and the results are shown in Table 5.
Table 5. Comparative systems for English–Hindi for Health domain
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_tab5.png?pub-status=live)
Table 6. Comparison of our approach (using attention-based GRU) with PhraseNet for English–Hindi for Health domain. In parenthesis, we show the dimension
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_tab6.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_fig5.png?pub-status=live)
Figure 5. Comparison of different transformer models. HE: Hindi
$\rightarrow$
English, EH: English
$\rightarrow$
Hindi, HB: Hindi
$\rightarrow$
Bengali, BH: Bengali
$\rightarrow$
Hindi. H for Health, T for Tourism, and J for Judicial.
From Table 5, we observe that pre-translation and mixed pre-translation strategies do not improve the SMT and NMT models but, rather, they degrade their baselines.
Back-translation For this, we first generate synthetic parallel data by translating 100K monolingual sentences from the Hindi monolingual data (Bojar et al. Reference Bojar, Diatka, Rychlý, Stranák, Suchomel, Tamchyna and Zeman2014) into English. Then, we use these synthetic parallel sentences along with the original parallel data to train a transformer-based system for English
$\rightarrow$
Hindi. From Table 5, we observe that the BLEU score is lower than that of the system (Transformer) using only the original parallel data. Although it has been shown in the literature that back translation helps in improving the BLEU score, but it is also sensitive to the domain of the back-translated data. The monolingual Hindi data (Bojar et al. Reference Bojar, Diatka, Rychlý, Stranák, Suchomel, Tamchyna and Zeman2014) are from a mixed domain, crawled from the web. This shows that back translation may not be always useful.
Table 7. Our approach using transformer at subword level for English–Hindi direction on the Health domain
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_tab7.png?pub-status=live)
PhraseNet Tang et al. (Reference Tang, Meng, Lu, Li and Yu2016) have proposed PhraseNet in which decoder generates a word in word mode or a phrase in phrase mode. As the code of the PhraseNet is not available, we re-implement it to compare with our proposed method. Here, we first mathematically describe the approach and then present our results. Suppose, at time
$t-1$
, the decoder has generated
$y_{t-1}$
in word mode and the current decoder state is
$s_t$
.
-
1. Compute the word mode (
$=1$ ) and phrase mode (
$=0$ ) probabilities as
\begin{eqnarray*}p(z_t=1|s_t;\,\theta) & = & f_z(s_t)\\p(z_t=0|s_t;\,\theta) & = & 1-f_z(s_t)\end{eqnarray*}
-
2. If
$z_t=0$ (word mode), generate a word
$w_i$ based on
$s_t$ from the regular word vocabulary as
\begin{eqnarray*}p_{w}(y_t=w_{i}|s_t,0;\,\theta) & = & f_w(s_t)\end{eqnarray*}
-
3. If
$z_t=1$ (phrase mode), generate target phrase
$p_j$ as
\begin{eqnarray*}p_{p}(y_t=p_{j}|s_t, 1;\,\theta) & = & f_p(s_t)\end{eqnarray*}
-
4. Calculate the final probabilities and sample the next word (or phrase):
\begin{eqnarray*}p(y_t=w_i) &= & p(z_t=0|s_t;\,\theta)p(w_i|s_t,0;\,\theta)\\p(y_t=p_j) & = & p(z_t=1|s_2;\,\theta)p(p_j|s_t,1;\,\theta)\\p(y_t) & = & \left[ \begin{array}{c}p(y_t=w)\\p(y_t=p) \end{array} \right]\end{eqnarray*}
$p(y_t)$ is
$n_p$ plus the number of words in the vocabulary. The value of the hyper-parameter
$n_p$ is set to 5. The next word or phrase will be sampled according to
$p(y_t)$ . For more details, see Tang et al. (Reference Tang, Meng, Lu, Li and Yu2016).
We reimplement PhraseNet using PyTorch (Paszke et al. Reference Paszke, Gross, Chintala, Chanan, Yang, DeVito, Lin, Desmaison, Antiga and Lerer2017). We compare our approach with PhraseNet for English–Hindi translation direction involving Health domain data. Tang et al. (Reference Tang, Meng, Lu, Li and Yu2016) have experimented with embedding dimension 620 and hidden dimension 1000. As our data size is smaller, we experiment with different embedding dimensions and hidden dimensions. We present the results in Table 6. We observe from this table that our proposed approach outperforms PhraseNet with a significant BLEU point.
Subword-level NMT We train subword-level transformer systems (baseline and our approach) for English–Hindi direction for Health domain, and the results are shown in Table 7. We consider 10,000 merge operations for each language independently. From Table 7, we see that our approach at the subword level also outperforms the baseline system.
Table 8. Fluency: SMT versus NMT systems. Word-to-word translation shown in brackets
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_tab8.png?pub-status=live)
Table 9. Some translation outputs by different attention-based encoder–decoder NMT systems. Word-to-word English translation is shown in brackets
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_tab9a.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_tab9b.png?pub-status=live)
7.3 Quantitative analysis
From Figure 4, it is very much evident that the baseline NMT systems are outperformed by all of our proposed systems based on attention-based GRU and transformer. As the data size is too small for an NMT, we feed the phrase-level translation during training. The intuition is that the added phrases provide more information about the association among the phrases, and this helps in the learning process. The baseline model finds it too difficult to learn this association from the original training set with a few parallel sentences. The baselines for different language pairs (and translation directions) are trained on the original training sentences, whereas in our proposed method, we feed the extracted phrases into the model. We can see the differences those additional phrase-level translations make. Although we do not use any external data, still, we obtain significant improvements over the baselines for all the translation systems. We observe the improvement of 1.38 for Bengali
$ \rightarrow$
Hindi(Health) to 8.65 for Hindi
$\rightarrow$
English (Tourism) in BLEU points on ILCI data. For Hindi
$\rightarrow$
English (Tourism), we observe the highest improvement.
Surprisingly, for all the translation directions, the improvements are more in case of the tourism domain compared to the health domain. The possible reason behind this can be explained as follows: tourism domain corpora have more named entities (mostly location names) compared to the health domain corpora, and when we feed phrase pairs as sentences, those named entities (source–target pairs) are also included in the system. Thus, the system learns better alignment between these, and as a result, the overall translation quality is improved. We see, from the table, that without these additional phrases the baseline for tourism domain always performs poorer than the baselines for the health domain. Thus, the additional phrases make the improvements more visible.
We also notice that when translating from morphologically rich (Hindi) to poor (English) language, the improvements are higher as compared to the setup in the opposite direction, that is, English to Hindi translation.
For old-to-modern English system, the baseline model has a BLEU of 10.03 and the proposed models are better than the baseline model. Out of the three proposed models, the model using all phrases (Baseline+
$Set_{all}$
) yields the best performance with 28.76 BLEU points. However, the difference between the baseline and the proposed models is huge because the old-to-modern systems have original training data of only 2.7K parallel sentences. As a result, the baseline model does not learn the mappings well and our models have better scope in learning the mappings.
Along with NMT, we also train SMT systems. SMT systems are known to be good for situations when we do not have enough parallel corpus, and with no surprise, we observe that SMT models perform better than the NMT models. However, there are good reasons for considering NMT when we do not have sufficient amount of parallel corpus. Improvement of NMT quality with the increase in data size is huge as compared to SMT (Koehn and Knowles Reference Koehn and Knowles2017). Other reasons to consider SMT are NMT follows an end-to-end framework and generates more fluent outputs than the SMT systems.
The phrases that are added to the original training corpus have lengths from 1 to 7. So, we did a study on the effect of phrase length on the overall system performance. Apart from the sets as mentioned in Section 4.3, we also consider only the phrases with lengths 1, 3, 5, and 7 from the set
$Set_{p=1.0}$
(English
$\rightarrow$
Hindi; tourism) for augmenting to the original training corpus. Experiments show that all these sets of phrases improve the performance of the baseline system. However, while we consider all the phrases (of length 1–7), we obtain the best BLEU score. Experimental results are shown in Figure 7.
We use three types of training data in terms of the number of sentences: with 2.7K parallel sentences (for old-to-modern English), 5.5k parallel sentences (for Judicial domain), and 23k parallel sentences (for Health and Tourism domains). However, the improvements are higher for the smaller datasets (c.f. Table 1 for data size and Table 4 for the improvements). For example, old-to-modern English system has the highest improvement, whereas the systems trained using 23k sentences have relatively less improvements. Hence, we take one system (English
$\rightarrow$
Hindi; tourism) with lower improvement and apply our proposed approach to see how it behaves while we have relatively smaller sets (of 5k, 10k, and 15k parallel sentences). These smaller sets were taken from the original training data.
From the experimental results as shown in Figure 6, we observe that improvements are higher for the smaller amount of data. This implies that NMT models, as we know, fail when the data size is small, but the extracted phrases help up to a certain level.
Since we use phrase pairs as training data, it is obvious that most of the training samples are relatively short. Thus, it is interesting to see if this affects the translation quality for different length intervals. For this, we split the testset for English–Hindi (Health) according to the different length intervals (such as
$<$
10, 10–20, 20–30, 30–40, and
$>$
40) and score using transformer (
$set_{p=1}$
)-based model. The BLEU scores for these intervals are 23.84, 26.67, 21.03, 21.23, and 27.63, and the sentence counts are 172, 508, 236, 60, and 18, respectively.
7.4 Qualitative analysis
In this section, we present our observations on the quality of outputs produced by the proposed systems as compared to the baseline models. From Table 9, for English
$\rightarrow$
Hindi (Health), we observe that the output of the baseline system is not adequate as the translation of “lungs” is dropped and “second stage” is translated twice. These kinds of errors are very common in any NMT system. This is because the baseline model (trained only on the original training set) does not learn the mappings between these phrases as the corpus is very small. In contrast, our proposed system produces better translation output.
From Table 9, we observe that for English
$\rightarrow$
Hindi (Tourism) system, all the models including baseline generate the good-quality translations. However, there are some failed cases. For example, the proposed model Baseline+
$Set_{all}$
over-translates the source word “idol.”
Now we look into the quality of translation produced by the related language pair, for example, Hindi–Bengali. The Hindi
$\rightarrow$
Bengali (Health) baseline system generates the incorrect output, and the outputs generated by our proposed systems are of good quality.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_fig6.png?pub-status=live)
Figure 6. Comparison of different NMT models with incremental original training data.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_fig7.png?pub-status=live)
Figure 7. Performance of proposed method with different phrase lengths.
For Hindi
$\rightarrow$
Bengali (Tourism), the output of the baseline system is incorrect and only a few source words
are correctly translated, whereas our proposed systems produce partially correct translations. However, one important point we observe for Hindi
$\rightarrow$
Bengali (Tourism) models: all the three systems made similar kinds of mistakes by incorrectly translating the place names. For example, while the baseline model drops the translation of
(Chandigarh), our two proposed models wrongly translate the place names into
(Kolkata) and
(Digha). One advantage of continuous representation of words is that it enables NMT to learn the semantic similarity between the related words (e.g., house and home). However, this also introduces a drawback in the NMT system, which often wrongly translates into the words that seem natural in the context but does not reflect the source words.
We also conduct a study on the fluency of output translations. Although the SMT systems have better BLEU scores than the NMT systems, we found that NMT outputs are more fluent than the SMT ones. Few examples are shown in Table 8. From Table 4, we see the Bengali
$\rightarrow$
Hindi for Health domain has the least improvement (1.38 BLEU points) compared to that of the other systems. In order to check if the said improvement is significant, we perform the statistical significance test based on the bootstrap re-sampling (Koehn Reference Koehn2004). We found that the improvement is significant at the confidence level 95% with p-value 0.001.
Table 10. Some translation outputs by different transformer-based models. Word-to-word English translation is shown in brackets
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210528144758861-0549:S1351324920000303:S1351324920000303_tab10.png?pub-status=live)
8. Conclusion and future works
In this paper, we have proposed an approach for training an NMT model using a small parallel corpus. Our approach uses the phrase pairs extracted from the original training corpus as feedback to NMT training. We followed the attention-based GRU and transformer architectures for experiments. However, the proposed approach is not specific to these architecture only. It can also be applied to the other NMT architectures as well. We used publicly available English–Hindi and Hindi–Bengali parallel corpora in three and two domains, respectively, for the evaluation. We also applied our proposed approach to an interesting translation task that focused on old-to-modern English translation. We found that the proposed method significantly improves over the baseline system when we have far from sufficient amount of parallel corpus. Improvement in BLEU is approximately 1.38–15.36 points. For old-to-modern English translation, we observed significant improvement of more than 18 BLEU points over the baseline model.
In future, our main focus will study the effect of phrase augmentation for language pairs with bigger (sufficient) corpora—if extracted phrases help or they are just redundant.
Acknowledgments
Asif Ekbal gratefully acknowledges Young Faculty Research Fellowship, supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia).