The following sections describe the data, the transliteration method employed for transforming the data from the HRL to the LRL, and the neural MT system workflow used in our experiments.
Choice of languages
For the purpose of our experiments, we chose Belarusian as the low-resource language. According to Lewis et al. (2016), Belarusian or Belorussian is spoken by more than 2 million people in Belarus, Ukraine, Lithuania and Poland. Even though the majority of Belarusians speak Russian as their first language (L1), there has been an increasing interest in using Belarusian over the past few years. It is morphologically rich and linguistically close to Russian and Ukrainian, with transitional dialects to both. Russian was chosen as the related, high-resource language. Both Russian and Belarusian belong to the East Slavic family of languages and they share a high degree of mutual intelligibility. Russian is a good candidate for our task, i.e. translation LRL \(\leftrightarrow \) EN using data from a related HRL, since there is a large amount of training data between English–Russian.
Since testing the method directly on a low-resource language poses several challenges, such as absence of readily available test sets and language specific preprocessing tools, we additionally experiment with a non-low-resource language, as an extra validation process for our method. In this scenario, Spanish acts as the pseudo-low-resource language (PS-LRL) and Italian as the closely-related HRL. Spanish and Italian both belong to the Romance family of Indo-European languages. Spanish has strong lexical similarity with Italian (82%) (Lewis et al. 2016). Among major Romance languages, Spanish and Italian have been found to be the second-closest pair (following Spanish and Portuguese) in automatic corpus comparisons (Ciobanu and Dinu 2014) and in comprehension studies (Voigt and Gooskens 2014). One more reason for choosing Spanish and Italian is to compare our results with the previous work of Nakov and Ng (2012) and Currey et al. (2016).
Data
For the low-resource setting, the training data consists of Russian–English data from WMT2016,Footnote 1 and the Belarusian monolingual data is taken from Web to Corpus (W2C), built from texts available on the Web and Wikipedia and is available for a large number of languages (Majliš and Žabokrtský 2012). In order to provide a proper test scenario, the development and test sets were compiled from bilingual articles extracted from the Belarusian Telegraph Agency (BelTA).Footnote 2 150 articles in Belarusian and English were manually collected. Then the main text was extracted using the tool Justext,Footnote 3 dates were removed and the remaining sentences were written to files one sentence per line. The bilingual, tokenized text was aligned at sentence level using the Hunalign sentence aligner.Footnote 4 Since the highest possible quality was required, sentences with a score lower than 0.4 were excluded and only bisentences (one-to-one alignment segments) were preserved, for which both the preceding and the following segments were one-to-one. The number of sentences left in the test set is 1006. The remaining sentences (1547) were used to create the validation set. Statistics of the data are presented in Tables 1 and 2.
Table 1 Datasets used for the low-resource language experiments and their size
For the pseudo-low-resource scenario, the training data is taken from Europarl (Koehn 2005). The size of the training data was paired with the size of the training data for our low-resource language scenario, in order to achieve comparison of results between the two scenarios.
Table 2 Datasets used for the pseudo-low-resource language experiments and their size
As a preprocessing step, all data was tokenized using the Moses tokeniser with the language setting for each of the languages we experiment with. Since there is no language specific tokeniser for Belarusian, the Belarusian data was tokenised with Russian language setting. Then, the data was truecased and sentences longer than 50 tokens were ignored. Byte-Pair Encoding (Sennrich et al. 2016) was applied for subword segmentation to achieve common subword units. The number of merge operations was set to 50,000. In the low-resource language scenario, BPE was trained separately for English and jointly for Russian, Belarusian and transliterated Russian, since the languages have different script. For training BPE jointly, the corpora of the respective languages were concatenated. The final vocabulary sizes for the WMT data were 49,292 for Russian, 24,175 for English and 30,850 for Russian transliterated into Belarusian (BE\(_{ru}\)), and for the W2C data 46,584 for monolingual Belarusian and 32,042 for back-translated English. For the pseudo-low-resource scenario, BPE was trained jointly for all languages, i.e. English, Spanish, Italian and transliterated Italian. The final vocabulary sizes were 25,988 for Italian, 20,079 for English, 29,179 for Italian transliterated into Spanish (ES\(_{it}\)), 27,040 for the monolingual Spanish and 15,754 for back-translated English.
Transliteration
In the following sections we present the creation of a bilingual glossary of the most frequent words in the HRL corpus and their LRL translations, the method of extracting transliteration pairs and the transliteration system used for transforming the HRL data into data more similar to the LRL.
Glossary
Although transliteration pairs from Wikipedia articles might be suitable training data for transliterating named entities, the goal of the presented method is to transliterate sentential data. After examining the extracted pairs, we noticed that very frequent words, which are usually function words, are not present in the pairs, which might deteriorate the transliteration output in the case of sentences. For this purpose, we created a bilingual glossary containing many of these words.
The 200 most frequent words were extracted from the Russian training corpus and manually translated into Belarusian by linguists. In cases where multiple translations were possible, the frequency of the two variants was checked in the Belarusian corpus. If the difference in frequency was large, the most frequent word was kept. When the words were equally or almost equally frequent, the more similar word to the Russian word was kept. Table 3 shows examples of bilingual glossary entries, their English translation and their frequency. Most of them represent function words. It is worth noting that most multiple translations were caused by the letter
or
, which is particular to pronunciation in Belarusian. The letter
is called the non-syllabic u because it does not form syllables. When a word beginning with
follows a vowel, it is replaced by
. Following the frequency principle above, the
variants were chosen in the glossary. As a first step, the glossary was used to substitute the words in the Russian part of the data with their translations in Belarusian, similar to word-based translation, in order to obtain an additional baseline for our NMT experiments; later, it was used as part of the training data for the transliteration system.
Table 3 Examples of glossary entries and their frequency in the Russian corpus
The same procedure was followed for creating the glossary for the pseudo-low-resource language, IT \(\rightarrow \) ES.
Cognate pair extraction from Wikipedia titles
One of the best-suited sources for obtaining bilingual resources when there is no explicit parallel text available is Wikipedia. Even ‘small’ languages can have strong Wikipedia presence. A notable example is Aragonese, with 300 articles per speaker. The most straightforward way to extract bilingual vocabularies from Wikipedia is extracting the titles of the articles. Wikipedia dumps offer per-page data, which includes article IDs and their titles in a particular language and the interlanguage link records, which include links to articles on the same entity in other languages. The data was downloaded and the titles extracted as described in the project wikipedia-parallel-titles.Footnote 5 Table 4 shows examples of extracted bilingual entries, their translation into English, and the similarity score based on the Longest Common Subsequence Ratio (LCSR).
Table 4 Examples of extracted bilingual entries from Wikipedia articles between Belarusian and Russian and their translation into English
It is clear from Table 4 above that the majority of the titles are Named Entities referring to the same object, person or abstract notion. In other words, they are words that have similar orthography and are mutual translations of each other, i.e. cognates (Melamed 1999; Mann and Yarowsky 2001; Bergsma and Kondrak 2007). Therefore, the bilingual entries are suitable data to train a transliteration model between Russian and Belarusian after using string similarity measures to extract transliteration pairs, as shown in Nakov and Ng (2012).
For RU \(\rightarrow \) BE, 145,910 titles were extracted in total. First, the data had to be cleaned. Indexes preceding the entities, such as Category:, Annex: etc. were erased and then duplicate entries and entries with Latin, Greek or Asian scripts were removed (11,269 entries). For the Russian part of the extracted data, the format for names was in many cases Surname, Name, which did not correspond to the Belarusian format Name Surname. This would cause the entries to be left out when the similarity measure is applied. For this reason name and surname were switched. For the Belarusian part, some entries contained a description after the named entity, e.g.
, 2016 (Viking, film, 2016) whose equivalent in Russian was simply
(Viking). For the reason mentioned above, the description was removed too and only the first part of the entry was kept. As a similarity measure, we chose the Longest Common Subsequence Ratio (LCSR) (Melamed 1999), which is defined as follows:
$$\begin{aligned} LCSR(s_1, s_2) = \frac{|LCS(s_1, s_2)|}{\max (|s_1|,|s_2|)} \end{aligned}$$
(1)
where LCS(\(s_1\),\(s_2\)) is the longest common subsequence of strings \(s_1\) and \(s_2\) and |s| is the length of string s.
Following Kondrak et al. (2003), we retained only pairs of entities with LCSR \(>0.58\), a value they found to be useful for cognate extraction in many language pairs. We also retained pairs without spelling differences, i.e. with LCSR \(=1\). The remaining transliteration pairs were 80,608. This was further split into train (78,608 pairs), development (1000 pairs) and test (1000 pairs) set. The data was further tokenized.
It should be noted that the preprocessing method described above was followed to achieve the best possible quality of transliteration pairs while at the same time preserve as many pairs as possible for training. If LCSR is applied directly on the extracted titles, without any other preprocessing, only 70,911 titles in Cyrillic script are left for training, development and testing.
The same procedure was followed in the pseudo-low-resource scenario, for extracting pairs IT \(\rightarrow \) ES, allowing us to obtain 409,582 pairs after applying the data cleaning techniques mentioned above. This was further split into train (407,582 pairs), development (1000 pairs) and test (1000 pairs) set.
The resulting bilingual titles were tokenised and split into characters. Special word-initial and word-final symbols (\(\wedge \) and $ ) were introduced to mark word boundaries, as shown in Fig. 1.
Then, transliteration was applied to the sentences of the related HRL data, i.e. the Russian part of the EN–RU corpus, and the Italian part of the EN–IT corpus. In order to transliterate the sentences, the data had to be split into characters too. In addition, the text was split one word per line, since the system was trained on short titles and therefore unable to translate long sequences. A special end-of-sentence token (\({\langle }{} \texttt {end}{\rangle }\)) was introduced to restore the original sentence alignment with the English part of the parallel data.
Transliteration system
The system chosen for the purpose of the experiments is OpenNMT, an open-source neural MT system utilising the Torch toolkit (Klein et al. 2017). The RNNs consisted of two layers of 500-node LSTMs, with 500-node embeddings on the source and target sides. For optimisation, Adam was used for training with a minibatch size of 64 and a learning rate of 0.002. We trained the models for 10 epochs and conducted two experimental runs, one with a bidirectional encoder and one without. We evaluated the model transliterating the titles test set as well as a test set with full sentences. The results of the transliteration experiments are presented in Sect. 5.1.
Neural machine translation system
For the translation system, we used the same attention-based NMT system that we used for our transliteration experiments. We trained the sequence-to-sequence models with global attention and a 2-layer LSTM with 500 hidden units each on both the encoder/decoder with the same optimisation parameters as in the transliteration experiments. Drop-out was set to 0.3 and vocabulary size to 50,000. Models were trained for 7 epochs, after which we did not observe large improvements on the development set. We propose a method for integrating the monolingual data from the low-resource language without any changes in the system architecture, presented in Fig. 2.
Our final goal is to create a system for translating from the third language (English) into LRL language. Since there is no readily available parallel data for training such a system, we exploit the RU–EN data and train two systems, as seen in Fig. 2. The steps are the following:
-
1.
After training the transliteration system as described in Sect. 3.3, transform the HRL–EN (RU–EN) data to LRL–EN (BE\(_{ru}\)–EN) data.
-
2.
With the transliterated data, train a BE \(\rightarrow \) EN MT system (System 1).
-
3.
Translate monolingual LRL data (BE\(_{mon}\)) into English, using System 1.
-
4.
Train our final system (System 2) to translate from English into the LRL (EN \(\rightarrow \) BE), using the parallel corpus generated from System 1; i.e. the monolingual LRL data and the machine-generated English (BE\(_{mon}\)–EN\(^{\prime }\)).
System 2 can also be used to translate from the LRL into English, however, we expect that target-side monolingual data of high quality will be more beneficial, a question explored in Sect. 5.
For evaluation, BLEU (Papineni et al. 2002) was calculated on tokenised output. Because our method involves transliteration, which is applied at a character level, we found it also useful to evaluate the output with character-based metrics, which reward some translations even if the morphology is not completely correct. For this reason, we additionally report BEER (Stanojević and Sima’an 2014) and chrF3 (Popović 2015) scores.