Introduction

Natural language processing (NLP) plays a significant role in keeping languages alive and the development of languages in the digital device era [1]. One of the sub-parts of NLP is machine translation (MT). MT has been the most promising application of artificial intelligence (AI) since the invention of computers, which has been shown to increase access to information by the native language of speakers in many cases. One of such critical cases is the spread of vital information during a crisis or emergency [2, 3]. Recently, translation accuracy has increased, and commercial systems have gained popularity. These systems have been developed for hundreds of languages, and hundreds of millions of people gained access. However, some of the less common languages do not enjoy this availability of resources. These under-resourced languages lack essential linguistic resources, e.g. corpora, POS taggers, computational grammars. This is more pertinent for MT since most common systems require large amounts of high-quality parallel resources or linguists to make a vast set of rules. This survey studies how to take advantage of the orthographic information and closely related languages to improve the translation quality of under-resourced languages.

The most common MT systems are based on either Rule-Based Machine Translation (RBMT) or Corpus-Based Machine Translation (CBMT). RBMT systems [4,5,6,7,8,9,10] are based on linguistic knowledge which are encoded by experts. On the other hand, CBMT [11, 12] depends on a large number of aligned sentences such as Statistical Machine Translation (SMT) [13,14,15,16,17,18] and Neural Machine Translation (NMT) [19,20,21,22]. Unlike RBMT systems, which require expertise of linguists to write down the rules for the language, CBMT-based systems rely on examples in the form of sentence aligned parallel corpora. CBMT systems such as SMT and NMT have alleviated the burden of writing down rules, which is not feasible for all languages since human languages are more dynamic in nature.

However, CBMT systems suffer from the lack of parallel corpora for under-resourced languages to train machine translation systems. A number of the methods have been proposed to address the non-availability of parallel corpora for under-resourced languages, such as pivot-based approaches [23,24,25], zero-shot translation [26,27,28,29,30] and unsupervised methods [31,32,33], which are described in detail in following sections. A large array of techniques have been applied to overcome the data sparsity problem in MT, and virtually all of them seem to be based on the field of transfer learning from high-resource languages in recent years. Other techniques are based on lexical and semantic similarities of closely related languages which are more relevant to our survey on orthographic information in machine translation.

The main goal of this survey is to shed light on how orthographic information is utilised in the MT system development and how orthography helps to overcome the data sparsity problem for under-resourced languages. More particularly, it tries to explain the nature of interactions with orthography with different types of machine translation. For the sake of simplicity, the analysis presented here in this article is restricted to those languages which have some form of internet resources. The survey is organised as follows: second section explains the background information to follow this article. We present orthographic information in subsection. Third section describes the challenges of automatically using orthographic information in RBMT outputs. Fourth section presents an analysis of orthographic information in SMT systems. Fifth section presents an analysis of orthographic information in NMT systems. This survey ends with a discussion of the future directions towards utilising the orthographic information.

Background

In this section, we explain the necessary background information to follow the paper, different types of MT systems and the orthographic information available for MT.

Under-resourced Languages

Worldwide, there are around 7000 languages [34, 35]. However, most of the machine-readable data and natural language applications are available for very few popular languages, some of these are: Chinese, English, French, or German etc. For other languages, resources are scarcely available and, for some languages, not at all. Some examples of these languages do not even have a writing system [36,37,38], or are not encoded in major schemes such as Unicode. Due to the unavailability of digital resources, many of these languages may go extinct. With each language that is lost, we lose connection with the culture of the people and characteristics of the languages.

Alegria et al. [36] proposed a six-level language typology to develop language technologies that could be useful for several hundred languages. This classifies the world’s languages based on the availability of Internet resources for each language. According to the study, the term resource-poor or under-resourced is relative and also depends on the year. The first level is the most resourced languages; the second level is languages in the top 10 languages used on the web. The third level is languages which have some form of resources in NLP. The fourth level considers languages which have any lexical resources. Languages that have a writing system but not in digital form are in the fifth level. The last level is significant, including oral languages which do not have a writing system of its own. For the purpose of this work, we define under-resourced languages to be those at the third and fourth levels as the challenges are purely technical rather than social in nature. Languages that lack extensive parallel corpora are known as under-resourced or low-resourced languages [39].

Languages that seek to survive in modern society need NLP, which requires a vast amount of data and linguistic knowledge to create new language technology tools for languages. Mainly, it is a big challenge to develop MT systems for these languages due to the scarcity of data, specifically sentence aligned data (parallel corpora) in large amounts to train MT systems. For example, Irish, Scottish Gaelic, Manx or Tamil, Telugu, and Kannada belonging to the Goidelic and the Dravidian languages, respectively are considered as under-resourced languages due to scarcely available machine-readable resources as mentioned in Alegria et al. [36].

Orthographic Information

Humans are endowed with a language faculty that is determined by biological and genetic development. However, this is not true of the written form of the language, which is the visual representation of the natural and genetically determined spoken form. With the development of orthography, humans have not only overcome limitations with human short-term memory, and brain storage capacity, but also this development allows communication through space and time [40]. Orthography is a linguistic factor of mutual intelligibility which may facilitate or impede inter-comprehension [41].

The orthographic information of languages does not only represent the information of the language but also the psychological representation of the world of the users. Chinese orthography is unique in its own in the sense that it uses a logo graphic writing system. In such a system, each Chinese character carries visual patterns along with rich linguistic information. These characters are visualised in square space, which depends on the number of strokes a character has. Each character can be decomposed in two parts. Radicals, which carries the semantic meaning, whereby the other part tells about the pronunciation. According to Shuo WenJie ZiFootnote 1 new Chinese characters consist of 540 radicals but only 214 in modern Chinese [42]. The problems lie when the decomposition strategy does not comply with some of the characters. On the other hand, other Asian languages such as Korean and Japanese, have two different writing systems. Modern-day Korea uses the Hangul orthography, which is part of the syllabic writing system, and the other is known as Hanja, which uses classical Chinese characters. Like the history of writing in Korea, Japan to have two writing systems, Kana and Kanji, where Kanji is identified as Classical Chinese characters, and Kana represents sounds where each kana character is recognized as a syllable. As both Korean and Japanese are very different from Chinese and morphologically-rich languages, the adoption of Chinese characters was rather difficult. These problems also posed great difficulty in the field of translation and transliteration. Irrespective of all the differences and challenges these three Asian languages share common properties which could be significant advantages in MT.

Closely related languages share similar morphological, syntactic, orthographic properties. Orthographic similarity can be seen from two major sources. First one is based on the genetic relationship between languages such as based on language families, Germanic, Slavic, Gaelic and Indo-Aryan languages. The second one is based on the contact though geographical area Indo-Aryan and Dravidian languages in the Indian subcontinent [43]. Two languages posses orthographic similarity only when these languages have the following properties: overlapping phonemes, mutually compatible orthographic systems and similar grapheme to phoneme mapping. Tables 1 and 2 shows the example difference and similarities in writing systems in the same language family.

Table 1 The table categorises the languages of the Indo-European language family which share the same and have different orthographies
Table 2 The table categorises the languages of the Dravidian language family which have different orthographies

The widespread and underlying problem for the MT systems is variations in orthographic conventions. The two languages written in two different orthography leads to error in MT outputs. Orthographic information can also be used to improve the machine translation system. In the following subsection, we describe the different orthographic properties related to MT.

Spelling and Typographical Errors

Spelling or typographical errors are to be handled very carefully in MT task as even a minor spelling error could generate an out-of-vocabulary error in the training corpus. The source and the target languages highly influenced the methodology used to correct orthographic errors. As these languages vary in use of the same orthographic conventions very differently. These problems can be solved with different methods which basically depend upon the type and source of the problem, for example [44] came up with some solutions to overcome the errors in Catalan-Spanish language pairs such as the incorrect use of geminated l, the apostrophe, and the coordinating conjunctions y and o.

True-casing and Capitalization

The process of restoring case information to badly cased or not cased text is true-casing [45]. To avoid orthographical errors, it is a popular method to lower-case all words, especially in SMT. This method allows the system to avoid the mismatching of the same words, which seems different due to differences in casing thus keeping all the text in the lower-case is one of the methods to avoid the error. In most MT systems, both a pre-processing and post-processing is carried out. Post-processing of the text involves converting all the lower case to its original case form and generating the proper surface forms. These is done mostly in case of Latin and Slavic languages, where the same words with different case could me overgeneralised as different by the models for example the word cat and the word CAT could be put in different semantic category just because of the case. Therefore, to avoid such mistake True-casing is necessary.

Normalization

The use of the same words with different orthographic spellings such as colour and color give rise to different errors while building a translation model. In such cases, orthographic normalization is required. There are several other issues which require orthographic normalization, which could be language-specific such as Arabic diacritization, or contextual orthographic normalization. This approach needs some linguistic knowledge and can be adapted easily to other languages as well. Normalization is a process which is carried out before most of the natural language processing task; similarly, in machine translation, language-specific normalization yields a good result. Some of the examples of text normalization carried out for SMT system are removal of HTML contents, extraction of tag contents, splitting each line after proper punctuation marks as well as correction of language-specific word forms [46]. Normalization reduces sparsity as it eliminates out-of-vocabulary words used in the text [47].

Tokenization and Detokenization

The process of splitting text into smaller elements is known as tokenization. Tokenization can be done at different levels depending on the source and the target language as well the goal which we want to achieve. It also includes processing of the signs and symbols used in the text such as hyphens, apostrophes, punctuation marks, and numbers to make the text more accessible for further steps in MT. Like normalization, tokenization also helps in reducing language sparsity. The most commonly used words are assigned specific ids in sub-word tokenization technique, whereas less frequently used words are broken into sub-words that better reflect the context separately. If the word few appears regularly in the language, it will be given a special ID, while fewer and fewest, which are more unusual words that occur infrequently in the text, will be broken into sub words such as few, er, and est. This prevents the language model from misinterpreting less and fewest as two distinct terms. This helps the unknown terms in the data collection to be identified during preparation.

Detokenization is the process of combining all the token to the correct form before processing the main output. Tokenization and detokenization are not linked directly to orthographic correction, rather, they are more about morphological linking and correction, especially towards morphological rich languages like Irish and Arabic [48]. Orthography plays a major role in tokenization and detokenizations as each orthography has different rules on how to tokenize and detokenize.

Transliteration

Transliteration is the conversion of the text from one orthography to another without any phonological changes. The best example of transliteration is named entities and generic words [49]. Data collected from social media are highly transliterated and contain errors, thus, using these data for building a machine translation system for resource-poor languages cause errors. One of the primary forms that have a high chance of transliteration is cognates. Cognates are words from different languages derived from the same root. The concept cognate in NLP approaches are the words with similar orthography for example family in English and familia in Spanish. In the conventional approaches to automatic cognate detection, words with similar meanings or forms are used as probable cognates. From such sets, the ones that reveal a high phonological, lexical and/or semantic similarity, are investigated to find true cognates. Therefore, cognates have a high chance of transliteration. Though machine translation has progressed a lot in recently, the method of dealing with transliteration problem has changed from a language-independent manner to cognates prediction when translating between closely related languages, transliteration of cognates would help to improve the result for under-resourced languages.

Code-Mixing

Code-mixing is a phenomenon which occurs commonly in most multilingual societies where the speaker or writer alternate between more than one languages in a sentence [50,51,52,53]. Most of the corpora for under-resourced languages came from the publicly available parallel corpora which were created by voluntary annotators or aligned automatically. The translation of technical documents such as KDE, GNOME, and Ubuntu translations have code-mixed data since some of the technical terms may not be known to voluntary annotators for translation. Code-mixing in the OpenSubtitles corpus is due to bilingual and historical reasons of native speakers [51, 54]. Different combinations of languages may occur while code-mixing, for example, German-Italian and French-Italian in Switzerland, Hindi-Telugu in state of Telangana, India, Hokkien-Mandarin Chinese in Taiwan [55]. As a result of code-mixing of the script are also possible from a voluntary annotated corpus. This poses another challenge for MT

Orthographic Information in RBMT

RBMT was one of the first approaches to tackle translation from the input of the source text to target text without human assistance by means of collection of dictionaries, collection of linguistics rules and special programs based on these dictionaries and rules. It also depends on rules and linguistic resources, such as bilingual dictionaries, morphological analysers, and part-of-speech taggers. The rules dictate the syntactic knowledge while the linguistic resources deal with morphological, syntactic, and semantic information. Both of them are grounded in linguistic knowledge and generated by linguists [7, 10, 56, 57]. The strength of RBMT is that analysis can be done at both syntactic and semantic level. However, it requires a linguistic expert to write down all the rules that cover the language.

An open-source shallow-transfer MT engine for the Romance languages of Spain such as Spanish, Catalan and Galician was developed by Armentano-Oller et al. [58]. They were regeneration of existing non-open-source engines based on linguistic data. The post-generator in the system performs orthographical operations such as contraction and apostrophes to reduce the orthographical errors. Dictionaries were used for string transformation operations to the target language surface forms. Similarly, the translation between Spanish-Portugues used a post-generation module to performs orthographical transformations to improve the translation quality [59, 60].

Manually constructed list of orthographic transformation rules assist in detecting cognates by string matching [61]. Irish, Scottish and Gaelic belong to the Goidelic language family and share similar orthography and cognates. Scannell [62] developed ga2gd software which translates from Irish to Scottish Gaelic. In the context-sensitive syntactic rewriting submodule, the authors implemented transfer rules based on orthography, which are stored in a plain text. Then each rule is transformed into a finite-state recogniser for the input stream. This work also uses simple rule-based orthographic changes to find cognates by orthography.

A Czech to Polish translation system also followed the shallow-transfer method at the lexical stage. A set of collective transformation rules were used on a source language list to produce a target language list of cognates [63]. Another shallow-transfer MT system used frequent orthographic changes from Swedish to Danish to identify cognates and transfer rules are based on orthography [64]. A Turkmen to Turkish MT system [65, 66] uses the finite-state transformer to identify the cognate even though the orthography is different for these languages.

Orthographic Information in SMT

Statistical Machine Translation (SMT) [15, 16, 67,68,69] is one of the CBMT based systems. SMT systems assume that we have a set of example translations(\(S^{(k)}\), \(T^{(k)}\)) for \(k=1\ldots .n\), where \(S^{(k)}\) is the kth source sentence, \(T^{(k)}\) is the kth target sentence which is the translation of \(S^{(k)}\) in the corpus. SMT systems try to maximize the conditional probability p(t|s) of target sentence t given a source sentence s by maximizing separately a language model p(t) and the inverse translation model p(s|t). A language model assigns a probability p(t) for any sentence t and translation model assigns a conditional probability p(s|t) to source / target pair of sentence [70]. By Bayes rule

$$\begin{aligned} p(t|s) \propto p(t)p(s|t) \end{aligned}$$
(1)

This decomposition into a translation and a language model improves the fluency of generated texts by making full use of available corpora. The language model is not only meant to ensure a fluent output, but also supports difficult decisions about word order and word translation [68].

The two core methodologies used in the development of machine translation systems—RBMT and SMT—come with their own shares of advantages and disadvantages. In the initial stages, RBMTs were the first commercial systems to be developed. These systems were based on linguistic rules and have proved to be more feasible for resource-poor languages with little or no data. It is also relatively simpler to carry out error analysis and work on improving the results. Moreover, these systems require very little computational resources.

On the contrary, SMT systems need a large amount of data, but no linguistic theories, and so especially with morphologically rich languages such as Irish, Persian, and Tamil, SMT suffers from out-of-vocabulary problems very frequently due to orthographic inconsistencies. To mitigate the problem, orthographic normalization was proposed to improve the quality of SMT by sparsity reduction [71]. SMT learns from data and requires less human effort in terms of creating linguistics rules. SMT systems, unlike RBMT system, does not cause disambiguation problems. Even though SMT has lots of advantages over rule-based, it also has some disadvantages. Its is very difficult to conduct error analysis with SMT and data sparsity is another disadvantage faced by SMT [72].

Spelling and Typographical Errors

The impact of spelling and typographical errors in SMT has been studied extensively [73,74,75]. Dealing with random, non-word error or real-word error can be done in many ways; one such method is the use of a character-level translator, which provides various spelling alternatives. Typographical errors such as substitution, insertion, deletion, transposition, run-on, and split can be addressed with edit-distance under a noisy channel model paradigm [76, 77]. Error recovery was performed to correct spelling alternatives in the input before the translation process.

True-casing and Capitalization, Tokenization and Detokenization

Most SMT systems accept pre-processed inputs, where the pre-processing consists of tokenising, true-casing, and normalising punctuation. Moses [16] is a toolkit for SMT, which has pre-processing tools for most languages based on hand-crafted rules. Improvement has been achieved for recasing and tokenization processes [78]. For a language which does not use Roman characters, linguistically-motivated tokenization has shown to improve the results on SMT [79]. Byte Pair Encoding (BPE) avoids out-of-vocabulary issues by representing more frequent sub-word as atomic units Sennrich et al. [80]. A joint BPE model based on the lexical similarity between Czech and Polish identified cognate vocabulary of sub-words. This is based on the orthographic correspondences from which words in both languages can be composed [81].

Normalization

Under-resourced languages utilise corpora from the user-generated text, media text or voluntary annotators. However, SMT suffers from customisation problems as tremendous effort is required to adapt to the style of the text. A solution to this is text normalization, that is normalising the corpora before passing it to SMT [75] which has been shown to improve the results. The orthographies of the Irish and Scottish Gaelic languages were quite similar due to a shared literary tradition. Nevertheless, after the spelling reform in Irish, the orthography became different. Scannell [82] proposed a statistical method to normalise the orthography between Scottish Gaelic and Irish as part of the translation for social media text. To able to use the current NLP tool to deal with historical text, spelling normalization is essential; that is converting the original spelling to present-day spelling which was studied for historical English text by Schneider et al. [83] and Hämäläinen et al. [84]. For dialects translation, spelling normalising is an important step to take advantage of high-resource languages resources [85, 86]

Transliteration (Cognate)

As we know, closely related languages share the same features; the similarities between the language would be of much help to study the cognates of two languages. Cognates can also exist in the same language and different language families. Several methods have been obtained to manipulate the features of resource-rich languages to improve SMT for resource-poor languages. Manipulation of the cognates to obtain transliteration is one of the methods adopted by some of the authors to improve the SMT system for resource-poor languages.

Language similarities and regularities in morphology and spelling variation motivate the use of character-level transliteration models. However, to avoid the character mapping differences in various contexts Nakov and Tiedemann [87] transformed the input to a sequence of character n-grams. A sequence character of n-grams increases the vocabulary as well as also make the standard alignment models and their lexical translation parameters more expressive.

For the languages which use same or similar scripts, approximate string matching approaches, like Levenshtein distance [88] are used to find cognate and longest common subsequence ratio (LCSR) [89]. For the languages which use different scripts, transliteration is the first step and follow the above approach. A number of studies have used statistical and deep learning methods along with orthographic information [90, 91] to find the cognates. In reference to the previous section we know that cognates can be used for mutual translation between two languages if they share similar properties, it is essential to know the cognateness between the two languages of a given text. The word “cognateness” means how much two pieces of text are related in terms of cognates. These cognates were useful to improve the alignment when the scoring function of the length-based alignment function is very low then it passes to the second method, a cognate alignment function for getting a proper alignment result [92].

One of the applications of cognates before applying MT is parallel corpora alignment. A study of using cognates to align sentences for parallel corpora was done by Simard et al. [93]. Character level methods to align sentences [94] are based on a cognate approach [93].

As early as Bemova et al. [95], researchers have looked into translation between closely-related languages such as from Czech-Russian RUSLAN and Czech-Slovak CESILKO [96] using syntactic rules and lexicons. The closeness of the related languages makes it possible to obtain a good translation by means of more straightforward methods. However, both systems were rule-based approaches and bottlenecks included complexities associated with using a word-for-word dictionary translation approach. Nakov and Ng [97] proposed a method to use resource-rich closely-related languages to improve the statistical machine translation of under-resourced languages by merging parallel corpora and combining phrase tables. The authors developed a transliteration system trained on automatically-extracted likely cognates for Portuguese into Spanish using systematic spelling variation.

Popović and Ljubešić [98] created an MT system between closely-related languages for the Slavic language family. Language-related issues between Croatian, Serbian and Slovenian are explained by Popović et al. [99]. Serbian is digraphic (uses both Cyrillic and Latin Script), the other two are written using only the Latin script. For the Serbian language transliteration without loss of information is possible from Latin to Cyrillic script because there is a one-to-one correspondence between the characters.

In 2013 a group of researchers used a PBSMT approach as the base method to produce cognates. Instead of translating the phrase, they tried to transform a character sequence from one language to another. They have used words instead of sentences and characters instead of words in the transformation process. The combination of the phrase table with transformation probabilities, language model probabilities, selects the best combination of sequence. Thus the process includes the surrounding context and produces cognates [100]. A joint BPE model based on the lexical similarity between Czech and Polish identifies a cognate vocabulary of sub-words. This is based on the orthographic correspondences from which words in both languages can be composed [81]. It has been demonstrated that the use of cognates improves the translation quality [17].

Code-Switching

An SMT system with a code-switched parallel corpus was studied by Menacer et al. [101] and Fadaee and Monz [102] for Arabic–English language pair. The authors have manually translated or used back translation method to translate foreign words. The identification of the language of the word is based on the orthography. Chakravarthi et al. [103] used the same approach for Dravidian languages; they used the improved MT for creating WordNet, showing improvement in the results. For English–Hindi, Dhar et al. [104] manually translated the code-switched component and shown improvements. Machine translation of social media was studied by Rijhwani et al. [105] where they tackle the code-mixing for Hindi–English and Spanish–English. The same approach translated the main language of the sentence using Bing Translate API [106].

Back transliteration from one script to native script in code-mixed data is one of the challenging tasks to be performed. Riyadh and Kondrak [107] adopted three different methods to back transliterate Romanised Hindi-Bangla code-mixed data to Hindi and Bangla script. They have used Sequitur, a generative joint n-gram transducer, DTLM, a discriminate string transducer and the OpenNMTFootnote 2 neural machine translation toolkit. Along with these three approaches, they have leveraged target word lists, character language models, as well as synthetic training data, whenever possible, to support transliteration. Finally, these transliterations are provided to a sequence prediction module for further processing.

Pivot Translation

Pivot translation is a translation from a source language to the target language through an intermediate language which is called a pivot language. Usually, pivot language translation has large source-pivot and pivot-target parallel corpora [25, 108]. There are different levels of pivot translation, the first one is the triangulation method where the corresponding translation probabilities and lexical weights in the source-pivot and pivot-target translation are multiplied. In the second method, the sentences are translated to the pivot language using the source-pivot translation system then pivoted to target language using a pivot-target translation system [109]. Finally, using the source-target MT system to create more data and adding it back to the source-target model, which is called back-translation [80, 110]. Back-translation is simple and easy to achieve without modifying the architecture of the machine translation models. Back-translation has been studied in both SMT [111,112,113] and NMT [23, 80, 110, 114,115,116].

The pivot translation method could also be used to improve MT systems for under-resourced languages. One popular way is training SMT systems using source-pivot or pivot-target language pair using sub words where the pivot language is related to source or target or both. The subwords units consisted of orthographic syllable and byte-pair-encoded unit. The orthographic unit is a linguistically motivated unit which occurs in a sequence of one or more consonants followed by a vowel. Unlike orthographic units, BPE (Byte Pair Encoded Unit) [80] is motivated by statistical properties of the text. It represents stable and frequent character sequences in the texts. As orthographic syllable and BPE are variable-length units and the vocabularies used are much smaller than morpheme and word-level model, the problem of data sparsity does not occur but provides an appropriate context for translation between closely related languages [117].

Orthographic Information in NMT

Neural Machine Translation is a sequence-to-sequence approach [21] based on encoder-decoder architectures with attention [19, 118] or self attention encoder [119, 120]. Given a source sentence \({\mathbf {x}}={x_1,x_2,x_3,...}\) and target sentence \({\mathbf {y}}={y_1,y_2,y_3,..}\), the training objective for NMT is to maximize the log-likelihood \({\mathcal {L}}\) with respect to \(\theta\):

$$\begin{aligned} {\mathcal {L}}_{\theta }=\sum _{({\mathbf {x}}, {\mathbf {y}}) \in \mathrm {C}} \log p({\mathbf {y}} | {\mathbf {x}} ; \theta ) \end{aligned}$$
(2)

The decoder produces one target word at a time by computing the probability

$$\begin{aligned} p({\mathbf {y}} | {\mathbf {x}} ; \theta )=\prod _{j=1}^{m} p\left( y_{j} | y_{<j}, {\mathbf {x}} ; \theta \right) \end{aligned}$$
(3)

where m is the number of words in \({\mathbf {y}}, y_{j}\) is the current generated word, and \(y_{<j}\) are the previously generated words. At inference time, beam search is typically used to find the translation that maximises the above probability. Most of NMT models follows the \(Embedding\rightarrow\) \(Encoder\rightarrow\) \(Attention\rightarrow\) Decoder framework.

The attention mechanism across encoder and decoder is calculated by \(c_t\) as the weighted sum of the source-side context vectors:

$$\begin{aligned} c_t= & {} \sum _{i=1}^n \alpha _{t,i} h_i \end{aligned}$$
(4)
$$\begin{aligned} \alpha _{t,i}= & {} \frac{\exp {(e_{t,i}})}{\sum _{j=1}^{m}\exp {(e_{t,j}})} \end{aligned}$$
(5)

\(\alpha _{t,i}\) is the normalized alignment matrix between each source annotation vector \(h_i\) and word \(y_t\) to be emitted at a time step t. Expected alignment \(e_{t,i}\) between each source annotation vector \(h_i\) and the target word \(y_t\) is computed using the following formula:

$$\begin{aligned} e_{t,i}= & {} a({\mathbf {s}}_{{\mathbf {t}}-{\mathbf {1}}},h_i) \end{aligned}$$
(6)
$$\begin{aligned} {\mathbf {s}}_{{\mathbf {t}}}= & {} g\left( {\mathbf {s}}_{{\mathbf {t}}-{\mathbf {1}}}, {\mathbf {y}}_{{\mathbf {t}}-{\mathbf {1}}}, {\mathbf {c}}_{{\mathbf {t}}}\right) \end{aligned}$$
(7)

where g is an activation decoder function, \(s_{j-1}\) is the previous decoder hidden state, \(y_{j-1}\) is the embedding of the previous word. The current decoder hidden state \(s_{j},\) the previous word embedding and the context vector are fed to a feedforward layer f and a softmax layer computes a score for generating a target word as output:

$$\begin{aligned} P\left( y_{j} | y_{<j}, {\mathbf {x}}\right) ={\text {softmax}}\left( f\left( {\mathbf {s}}_{{\mathbf {j}}}, {\mathbf {y}}_{j-1}, {\mathbf {c}}_{{\mathbf {j}}}\right) \right) \end{aligned}$$

Multilingual Neural Machine Translation

In recent years, NMT has improved translation performance, which has lead to a boom in NMT research. The most popular neural architectures for NMT are based on the encoder-decoder [19, 21, 121] structure and the use of attention or self-attention based mechanism [119, 122]. Multilingual NMT created with or without multiway corpora has been studied for the potential for translation between two languages without any direct parallel corpus. Zero-shot translation is translation using multilingual data to create a translation for languages which have no direct parallel corpora to train independently. Multilingual Neural Machine Translation with only monolingual corpora was studied by [123, 124]. In Ha et al. [125] and [28], the authors have demonstrated that multilingual NMT improves translation quality. For this, they created a multilingual NMT without changing the architecture by introducing special tokens at the beginning of the source sentence indicating the source language and target language.

Phonetic transcription to Latin script and the International Phonetic Alphabet (IPA) was studied by Chakravarthi et al. [103] and showed that Latin script outperforms IPA for the Multilingual NMT of Dravidian languages. Chakravarthi et al. [126] propose to combine multilingual, phonetic transcription and multimodal content with improving the translation quality of under-resourced Dravidian languages. The authors studied how to use the closely-related languages from the Dravidian language family to exploit the similar syntax and semantic structures by phonetic transcription of the corpora into Latin script along with image feature to improve the translation quality [127]. They showed that orthographic information improves the translation quality in multilingual NMT [128].

Spelling and Typographical Errors

Spelling errors are amplified in under-resourced setting due to the potential infinite possible misspelling and leads to a large number of out-of-vocabulary words. Additionally, under-resourced morphological rich languages have morphological variation, which causes orthographic errors while using character level MT. A shared task was organised by Li et al. [129]; to deal with orthographic variation, grammatical error and informal languages from the noisy social media text. Data cleaning was used along with suitable corpora to handle spelling errors. Belinkov and Bisk [130] investigated noise in NMT, focusing on kinds of orthographic errors. Parallel corpora were cleaned before submitting to NMT to reduce the spelling and typographical errors.

NMT with word embedding lookup ignores the orthographic representation of the words such as the presence of stems, prefixes, suffixes and another kind of affixes. To overcome these drawbacks, character-based word embedding was proposed by Kim et al. [131]. Character-based NMT [132,133,134,135] were developed to cover the disadvantages of the languages which do not have explicit word segmentation. This enhances the relationship between the orthography of a word and its meaning in the translation system. For spelling mistake data for under-resourced languages, the quality of word-based translation drops severely, because every non-canonical form of the word cannot be represented. Character-level model overcomes the spelling and typological error without much effort.

True-casing and Capitalization, Normalization, Tokenization and Detokenization

Although NMT can be trained end-to-end translations, many NMT systems are still language-specific and require language-dependent preprocessing, such as used in Statistical Machine Translation, Moses [16] a toolkit for SMT which has preprocessing tools for most languages which based on hand-crafted rules. In fact, these are mainly available for European languages. For Asian languages which do not use space between words, a segmenter is required for each language independently before feeding into NMT to indicate a word segment. This becomes a problem when we train Multilingual NMT [28].

A solution for the open vocabulary problems in NMT is to break up the rare words into subword units [136, 137] which has been shown to deal with multiple script languages ambiguities [138, 139]. A simple and language-independent tokenizer was introduced for NMT and Multilingual NMT by Kudo and Richardson [140]; it is based on two subword segmentation algorithms, byte-pair encoding (BPE) [80] and a unigram language model [141]. This system also normalise semantically equivalent Unicode character into canonical forms. Subword segmentation and true-casing model will be rebuilt whenever the training data changes. The preprocessing tools introduced by OpenNMT normalises characters and separates punctuation from words, and it can be used for any languages and any orthography [142].

Character-level NMT systems work at the character level to grasp orthographic similarity between the languages. They were developed to overcome the issue of limited parallel corpora and resolve the out-of-vocabulary problem for the under-resourced languages. For Hindi–Bhojpuri, where Bhojpuri is closely related to Hindi, Bhojpuri is considered as an under-resourced language, and it has an overlap of word with high-resource language Hindi due to the adoption of works from a common properties of two languages [143]. To solve the out-of-vocabulary problem the transduction of Hindi word to Bhojpuri words was adapted from NMT models by training on Hindi–Bhojpuri cognate pairs. It was a two-level system: first, the Hindi–Bhojpuri system was developed to translate the sentence; then the out-of-vocabulary words were transduced.

Transliteration (Cognate)

Transliteration emerged to deal with proper nouns and technical terms that are translated with preserved pronunciation. Transliteration can also be used to improve machine translation between closely related languages, which uses different scripts since closely related languages language have orthographic and phonological similarities between them.

Machine Translation often occurs between closely related languages or through a pivot language (like English) [144]. Translation between closely related languages or dialects is either a simple transliteration from one language to another language or a post-processing step. Transliterating cognates has been shown to improve MT results since closely related languages share linguistic features. To translate from English to Finnish and Estonian, where the words have similar orthography Grönroos et al. [145] used Cognate Morfessor, a multilingual variant of Morfessor which learns to model cognates pairs based on the unweighted Levenshtein distance [88]. The ideas are to improve the consistency of morphological segmentation of words that have similar orthography, which shows improvement in the translation quality for the resource-poor Estonian language.

Cherry and Suzuki [146] use transliteration as a method to handle out-of-vocabulary (OOV) problems. To remove the script barrier, Bhat et al. [147] created machine transliteration models for the common orthographic representation of Hindi and Urdu text. The authors have transliterated text in both directions between Devanagari script (used to write the Hindi language) and Perso-Arabic script (used to write the Urdu language). The authors have demonstrated that a dependency parser trained on augmented resources performs better than individual resources. The authors have shown that there was a significant improvement in BLEU (Bilingual Evaluation Understudy) [148] score and have shown that the problem of data sparsity is reduced.

Recent work by Kunchukuttan et al. [149] has explored orthographic similarity for transliteration. In their work, they have used related languages which share similar writing systems and phonetic properties such as Indo-Aryan languages. They have shown that multilingual transliteration leveraging similar orthography outperforms bilingual transliteration in different scenarios. Phonetic transcription is a method for writing a language in the other scripts keeping the phonemic units intact. It is extensively used in speech processing research, text-to-speech, and speech database construction—phonetic transcription to a common script has shown to improve the results of machine translation [103]. The authors focus on the multilingual translation of languages which uses different scripts and studies the effect of different orthographies to common script with multilingual NMT. Multiway NMT system was created for Czech and Polish with Czech IPA transcription and Polish transcription to a 3-way parallel text together to take advantage of the phonology of the closely related languages [81]. Orthographic correspondence rules were used as a replacement list for translation between closely related Czech-Polish with added back-translated corpus [81]. Dialect translation was studied by Baniata et al. [150]. To translate Arabic dialects to Modern Standard Arabic, they used multitask learning which shares one decoder for standard Arabic, while every source has a separate encoder. This is due to the non-standard orthography in the Arabic dialects. The experiments showed that for the under-resourced Arabic dialects, it improved the results.

Machine Translation of named entities is a significant issue due to linguistic and algorithmic challenges found in between languages. The quality of MT of named entities, including the technical terms, was improved with the help of developing lexicons using orthographic information. The lexicon integration to NMT was studied for the Japanese and Chinese MT [151]. They deal with the orthographic variation of named entities of Japanese using large scale lexicons. For English-to-Japanese, English-to-Bulgarian, and English-to-Romanian Ugawa et al. [152] proposed a model that encodes the input word based on its NE tag at each time step. This helps to improve the BLEU score for machine translation results.

Code-Switching

A significant part of corpora for under-resourced languages comes from movie subtitles and technical documents, which makes it even more prone to code-mixing. Most of these corpora are movie speeches [153] transcribed to text, and they differ from those in other written genres: the vocabulary is informal, non-linguistics sounds like ah, and mixes of scripts in case of English and native languages [154,155,156,157,158,159]. Data augmentation [160, 161] and changing the foreign to native words using dictionaries or other methods have been studied. Removing the code-mixing word from the corpus on both sides was studied by Chakravarthi et al. [103, 127] for English–Dravidian languages. Song et al. [162] studied the data augmentation method, making code-switched training data by replacing source phrases with their target translation. Character-based NMT [133,134,135] can naturally handle intra-sentence codeswitching as a result of the many-to-one translation task.

Orthographic Information in Unsupervised Machine Translation

Building parallel corpora for the under-resourced languages is time-consuming and expensive. As a result parallel corpora for the under-resourced languages are limited or unavailable for some of the languages. With limited parallel corpora, supervised SMT and NMT cannot achieve the desired quality translations. However, monolingual corpora can be collected from various sources on the Internet, and are much easier to obtain than parallel corpora. Recent research has created a machine translation system using only monolingual corpora [163,164,165] by the unsupervised method to remove the dependency of sentence aligned parallel corpora. These systems are based on both SMT [166, 167] and NMT [168]. One such task is bilingual lexicon induction.

Bilingual lexicon induction is a task of creating word translation from monolingual corpora in two languages [169, 170]. One way to induce the bilingual lexicon induction is using orthographic similarity. Based on the assumptions that words that are spelled similarly are sometimes good translation and maybe cognates as they have similar orthography due to historical reasons. A generative model for inducing a bilingual lexicon from monolingual corpora by exploiting orthographic and contextual similarities of words in two different languages was proposed by Haghighi et al. [171]. Many methods, based on edit-distance and orthographic similarity are proposed for using linguist feature for word alignments supervised and unsupervised methods [172,173,174]. Riley and Gildea [175] proposed method to utilise the orthographic information in word-embedding based bilingual lexicon induction. The authors used the alphabets of two languages to extend the word embedding and modifying the similarity score functions of previous word-embedding methods to include the orthographic similarity measure. Bilingual lexicons are shown to improve machine translation in both RBMT [170] and CBMT [163, 176, 177].

In work by Bloodgood and Strauss [178], the authors translated lexicon induction for a heavily code-switched text of historically unwritten colloquial words via loanwords using expert knowledge with language information. Their method is to take word pronunciation (IPA) from a donor language and convert them into the borrowing language. This shows improvements in BLEU score for induction of Moroccan Darija-English translation lexicon bridging via French loan words.

Discussion

From our comprehensive survey, we can see that orthographic information improves translation quality in all types of machine translation from rule-based to completely unsupervised systems like bilingual lexicon induction. For RBMT, translation between closely related languages is simplified to transliteration due to the cognates. Statistical machine translation deals with data sparsity problem using orthographic information. Since statistical machine translation has been studied for a long time, most of the orthographic properties are studies for different types of languages. Even recent neural machine translation and other methods still use preprocessing tools such as true-casers, tokenizers, and detokenizers that are developed for statistical machine translation. Recent neural machine translation is completely end-to-end, however, it suffers from data sparsity when dealing with morphologically rich languages or under-resourced languages. These issues are dealt by utilising orthographic information in neural machine translation. One such method which improves the translation is a transliteration of cognates. Code-switching is another issue with under-resourced languages due to the data collected from voluntary annotator, web crawling or other such methods. However, dealing with code-switching based on orthography or using character-based neural machine translation has been shown to improve the results significantly.

From this, we conclude that orthographic information is much utilised while translating between closely related languages or using multilingual neural machine translation with closely related languages. While exciting advances have been made in machine translation in recent years, there is still an exciting direction for exploration from leveraging linguistic information to it, such as orthographic information. One such area is unsupervised machine translation or bilingual lexicon induction. Recent works show that word vector, along with orthographic information, performs better for aligning the bilingual lexicons in completely unsupervised or semi-supervised approaches. We believe that our survey will help to catalogue future research papers and better understand the orthographic information to improve machine translation results.

Conclusion

In this work, we presented a review of the current state-of-the-art in machine translation utilising orthographic information, covering rule-based machine translation, statistical machine translation, neural machine translation and unsupervised machine translation. As a part of this survey, we introduced different machine translations methods and have shown how orthography played a role in machine translation results. These methods to utilise the orthographic information have already let to a significant improvement in machine translation results.