Advertisement

Exploring the Relevance of Bilingual Morph-Units in Automatic Induction of Translation Templates

  • Kavitha Karimbi MaheshEmail author
  • Luís Gomes
  • José Gabriel Pereira Lopes
Conference paper
  • 722 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11238)

Abstract

To tackle the problem of out-of-vocabulary (OOV) words and improve bilingual lexicon coverage, the relevance of bilingual morph-units is explored in inducing translation patterns considering unigram to n-gram and n-gram to unigram translations. The approach relies on induction of translation templates using bilingual stems learnt from automatically acquired bilingual translation lexicons. By generalising the templates using bilingual suffix clusters, new translations are automatically suggested.

1 Introduction

Numerous investigations have been reported on learning suffixes and suffixation operations using a lexicon or corpus of a language, for tackling out-of-vocabulary (OOV) words [1, 2, 3]. Beyond mere words or word forms, morphological similarities between known word to word translation forms have also been explored as a means to generalise the existing examples for automatic induction of word-to-word translations [4]. Learning approaches such as these employ available bilingual examples in inducing new translations that are infrequent or have never been encountered in the corpus used for lexicon acquisition. Approaches that allow simultaneous learning of morphology from multiple languages work well in inducing morphological segmentation by exploiting cross-lingual morpheme patterns [5]. The underlying benefit is that morphological structure ambiguous in one language is explicitly marked in another language. Along similar lines, is the bilingual learning approach [6] that works in improving the coverage of available bilingual lexica by employing bilingual stems, suffixes and their clusters, thereby generating those OOV word to word translations that remain missing.

To further enhance the coverage of existing bilingual lexicon beyond word level, a generative approach based on translation templates induced from existing bilingual lexicon augmented with bilingual stems, and bilingual stem and suffix clusters is discussed in this paper. In our previous work [6] we addressed the extraction of word to word translations and here we extend our method to generate word to n-gram translations through the use of translation templates. Translation templates are composed of one or more template tokens on each language side. Each template token is responsible for generating a token in the resulting translation pairs. In this paper we restrict ourselves to word to n-gram templates, but in principle the procedure can be generalised to n-gram to n-gram translations. Induction of translation templates might be further viewed as an application of bilingual morph-units, previously learnt [6] from a specific corpora of linguistically validated bilingual translations. Identifying the correspondence between units in a bilingual pair of phrases is essential for inducing translation templates and is determined by using a dictionary of bilingual stems acquired using the bilingual learning approach [6]. Induced templates serve in generating new1 translations.

2 Related Work

Güvenir et al. [7] use analogical reasoning between translation pairs to learn structural correspondences between two languages from a corpus of translated sentence pairs.

Hu’s approach [8] relies on extracting semantic groups and phrase structure groups from the language pairs under consideration. The phrase structure groups upon alignment are post-processed to yield translation templates.

Another approach for generating generalised templates is based on finding common patterns in a bilingual corpus [9]. Combination of commonly used approaches such as, identifying similar or dissimilar portions of text in groups of sentence pairs, finding semantically similar words, finding syntactic correspondences employing dictionaries and parsers, are used in identifying common patterns. Upon grouping the semantically related phrase-pairs based on the contexts, templates are induced by replacing clustered phrase-pairs by their class labels.

The bilingual stems [6] induced from similar bilingual pairs (translations), employed in template induction, is in line with the aspects related to the identification of common and different parts proposed by Gangadharaiah et al. [9]. While this identification relies on sentence pairs in their approach, word-to-word translation pairs form the source of our study. The common parts, referred to as bilingual stems, correspond to semantically similar morph-units and these bilingual segments conflates the meaning conveyed by similar translation pairs. The different parts represent their bilingual morphological extensions, referred to as bilingual suffixes. Our work loosely coincides with the approach proposed by Gangadharaiah et al. [9] in the use of clusters. Nevertheless, in our approach, clusters of bilingual stems aid in suggesting new translations after the induction of translation templates.

3 Background

In the current section we provide a brief overview of the bilingual resources, the validated lexicon of translations augmented with bilingual stems, suffixes, and their clusters, employed in template induction. Also, a brief overview of the approaches employed in acquiring those resources is presented.

3.1 Validated Bilingual Lexicons

We used English-Portuguese (EN-PT) bilingual lexicon acquired automatically employing various extraction techniques [10, 11, 12, 13, 15] applied on aligned parallel corpora2. Methods proposed by Brown et al. [10] and Lardilleux and Lepage [11] were employed for initial extractions. The former provides an alignment for every word in the corpus based on corpus-wide frequency counts, while the latter follows random sub-corpus sampling. In a different strategy, a bilingual lexicon was used to initially align parallel texts [14, 15]. New3 term-pairs were then extracted from those aligned texts. In this setting, the extraction method proposed by Aires et al. [12] employs the alignments [14, 15] as anchors to further infer alignments for neighbouring unaligned words, based on co-occurrence statistics. The extracted term-pairs were manually verified and the correct ones were added to the bilingual lexicon, marked as ‘accepted’, with the incorrect ones marked as ‘rejected’. Remaining extractions were done following two different approaches proposed by Gomes and Lopes [13, 15], one of which is based on combining the co-occurrence statistics with SpSim - a spelling similarity score for estimating the similarity between words, and the other based on translation templates. It is to be noted that, unlike the templates induced using the approach proposed in this paper, the templates used for extraction [15] were handwritten. Using the handwritten templates enabled extraction of translation equivalents with very high precision. 16 most productive EN-PT patterns extracted 228,645 translation equivalents with precision as high as 98.63\(\%\) and 2,631 EN-PT patterns extracted 217,775 translation equivalents with precision 97.21\(\%\) [15]. These results further motivated us to research on the use of human validated bilingual lexicon in automatic template induction for generating missing translations.

Entries in the lexicon were classified as ‘accepted’ or ‘rejected’ automatically using SVM based classifiers [16] and were later validated by linguists making use of a bilingual concordancer [17]. The translation lexicon thus obtained, being sufficiently large with enough near word and phrase translation forms, was used in a bilingual morphology learning framework for lexicon augmentation, yielding bilingual stems, suffixes and their clusters (further discussed in Sect. 3.2).

3.2 Bilingual Resources

The bilingual lexicon discussed in the previous section augmented with bilingual stems, suffixes and their clusters learnt from EN-PT lexicon of unigram translations4 serve as fundamental resources in inducing translation templates. Throughout this paper, the term bilingual morph-units is alternatively used to collectively refer to bilingual stems and suffixes. As the bilingual morph units, primarily the bilingual stems form the basis of translation template induction process, a brief overview of the approach employed in learning them is presented for the language pair EN-PT.

Bingual Stems and Suffixes. The induction of bilingual stems and suffixes follows the bilingual learning approach [6] applied to the EN-PT lexicon of unigram translations. The approach involves identification and extraction of orthographically and semantically similar bilingual segments, as for instance, ‘ensur’ \(\Leftrightarrow \) ‘assegur’, occurring in known translation examples, such as, ‘ensuring’ \(\Leftrightarrow \) ‘assegurando’, ‘ensured’ \(\Leftrightarrow \) ‘assegurou’, ‘ensure’ \(\Leftrightarrow \) ‘assegurar’, ‘ensured’ \(\Leftrightarrow \) ‘assegurado’, ‘ensured’ \(\Leftrightarrow \) ‘assegurados’, ‘ensured’ \(\Leftrightarrow \) ‘asseguradas’, ‘ensures’ \(\Leftrightarrow \) ‘assegure’, ‘ensures’ \(\Leftrightarrow \) ‘assegura’, ‘ensure’ \(\Leftrightarrow \) ‘asseguram’, ‘ensure’ \(\Leftrightarrow \) ‘assegurem’, and ‘ensured’ \(\Leftrightarrow \) ‘asseguraram’ together with their bilingual extensions constituting dissimilar bilingual segments (bilingual suffixes), (‘e’, ‘ar’), (‘e’, ‘arem’), (‘e’, ‘am’), (‘e’, ‘em’), (‘es’, ‘e’), (‘es’, ‘a’), (‘ed’, ‘ada’), (‘ed’, ‘adas’), (‘ed’, ‘ado’), (‘ed’, ‘ados’), (‘ed’, ‘aram’), (‘ed’, ‘ou’), (‘ing’, ‘ando’), (‘ing’, ‘ar’). The common part of translations that conflates all its bilingual variants5 represents a bilingual stem (‘ensur’ \(\Leftrightarrow \) ‘assegur’). The different parts of the translations contributing to various surface forms represent bilingual suffixes ((‘e’, ‘ar’), (‘e’, ‘arem’), and so forth)).

Clusters of Bilingual Stems and Suffixes. A set of bilingual suffixes representing bilingual extensions for a set of bilingual stems together form bilingual suffix clusters6. In other words, bilingual stems undergoing same suffix transformations form a cluster.
Table 1.

Clusters of bilingual stems sharing same morphological extensions

Cluster number

Suffix pairs

Stem pairs

17

(”, er), (”, erem), (”, am), (”, em), (s, e), (s, a),

answer \(\Leftrightarrow \) respond,

(ed, ida), (ed, idas), (ed, ido), (ed, idos),

reply \(\Leftrightarrow \) respond,

(ed, eram), (ed, eu), (ing, endo), (ing, er)

spend \(\Leftrightarrow \) dispend

32

(e, ar), (e, arem), (e, am), (e, em), (es, e),

declar \(\Leftrightarrow \) declar,

(es, a), (ed, ada), (ed, adas), (ed, ado),

encourag \(\Leftrightarrow \) estimul,

(ed, ados), (ed, aram), (ed, ou), (ing, ando),

ensur \(\Leftrightarrow \) assegur,

(ing, ar)

argu \(\Leftrightarrow \) afirm

Table 1 illustrates bilingual stems, suffixes and 2 largest verb clusters7 learnt for the data set size presented in Table 3 (refer to Sect. 5). The bilingual stem ‘declar’ \(\Leftrightarrow \) ‘declar’ shares same morphological extension as the bilinguals stem ‘ensur’ \(\Leftrightarrow \) ‘ensur’ and hence forms a cluster [6].

In our experiments, bilingual stems are employed in inducing translation templates. Further, clusters are used in generating new translations via generalisation of translation templates.

4 Approach

The current section presents the approach for automatic induction of translation templates using the automatically learnt bilingual morph-units [6] consisting of stem pairs. Using the clusters of bilingual stems and suffixes learnt [6], new surface translation forms are automatically suggested as discussed in Sect. 4.4.

4.1 Definitions

Let L be a Bilingual Lexicon consisting of unique word pairs.

Let P be a validated bilingual lexicon of unigram to n-gram, n-gram to unigram translations.

Let L1, L2 be languages with alphabet set \(\varSigma _1, \varSigma _2\).

Let \((p_{i},p_{j})\) be any bilingual pair (translation) in P, 1 \(\le \) i \(\le \) m, 1 \(\le \) j \(\le \) n, m and n are the number of unique phrases in language L1 and L2.

Let \((s_{a},s_{b})\) represent a bilingual stem in the set of bilingual stems, S, induced by bilingual learning approach; where a \(\le \) m and b \(\le \) n.

If \(S_{L1}\) and \(S_{L2}\) represents the set of stems in languages L1 and L2, then \(s_{a}\) \(\epsilon \) \(S_{L1}\) and \(s_{b}\) \(\epsilon \) \(S_{L2}\).

\(\$_{a}\) and \(\$T_{a\#b}\) respectively represent wildcard symbols for stem in first language and its translation in second language, where a represents the identifier for the stem in first language and \(a\#b\) represents the identifier for its translation in second language. It should be noted that a stem in first language may have multiple translations in second language. Thus, \(\$T_{a\#b}\) and \(\$T_{a\#c}\) represents different translations (with identifiers \(a\#b\) and \(a\#c\)) for the same stem, \(\$_{a}\), in first language.

4.2 Inputs

Bilingual/Translation Lexicon (P). The Translation lexicon used for template induction consists of unigrams (taken as a single word - any contiguous sequence of characters) in the first language cross-listed with their corresponding translations consisting of n-grams (contiguous sequence of n words, 2 \(\le \) n \(\le \) 4) in second language or vice-versa, such that they share the same meaning or are usable in equivalent contexts. Examples illustrating bilingual variants are shown in Table 2.
Table 2.

Translation examples

Translation forms

EN

PT

Verb

Involving

que envolva

Involving

que envolvam

Involving

que envolvem

Noun

Forwarding agent

expedidor

Watermark

marca de água

Adjective

Lower

mais pequena

Quickest

mais rápidos

Adverb

Indirectly

de modo inderecto

Comprehensively

de forma aprofundada

Scientifically

a nível de a ciência

List of Bilingual Stems. These are orthographically and semantically similar bilingual segments shared by similar surface translation forms and are induced by applying the bilingual learning mechanism [6] on the translation lexicon L containing only word-to-word translations. Column 3 of Table 1 lists various bilingual stems with their respective morphological extensions in column 2.

4.3 Automatic Induction of Translation Templates

The steps involved in translation template induction are as outlined in Algorithm 1. The approach employs a lexicon of translations P (consisting of unigram to n-gram and n-gram to unigram translations), and a dictionary of bilingual stems, S. We begin by building separate keyword trees (Trie) of all stems in \(S_{L1}\) (say, \(T_{L1}\)) and \(S_{L2}\) (say, \(T_{L2}\)). We extend the keyword tree into an automaton to allow O(k) lookup time, where k is the size of the key. The Aho-Corasick set matching algorithm [18] is then applied to look for all occurrences of matching bilingual stems in each of the translations under consideration. Specifically, this involves for each bilingual pair (\(p_{i}\), \(p_{j}\)) in P, traversing the phrase \(p_{i}\) over the built automaton \(T_{L1}\) and similarly traversing \(p_{j}\) over \(T_{L1}\) to find all matching stems. If the matching stems happen to be the translations of each other (i.e., a bilingual stem existing in S), we generalise the stem in first language with a wildcard symbol \(\$_{a}\) and with \(\$T_{a\#b}\) in second language, where a and \({a\#b}\) represent the identifiers of the matched stems in L1 and L2, respectively.

4.4 Automatic Generation of New Translations

Upon induction of preliminary translation templates as specified in Algorithm 1, new translations are automatically suggested by employing clusters of bilingual stems and suffixes. Generation of new translations involves the following steps:
  1. 1.

    Identify the bilingual stem employed in template induction.

     
  2. 2.

    Identify the cluster to which the bilingual stem employed in a particular template induction belongs.

     
  3. 3.

    Identify all other bilingual stems that belong to the identified cluster.

     
  4. 4.

    For each bilingual stem in the cluster (different from that used in template induction), replace the string representing the bilingual stem used in template induction (\(\$_{a}\) and with \(\$T_{a\#b}\)) with the remaining bilingual stems in the cluster.

     

4.5 Illustration

As an example, consider a translation with two words in first language and a word in second language (as in the bilingual pair ‘we declare’ \(\leftrightarrow \) ‘declaramos’). To extract translation pattern, a set matching is performed using the previously learnt bilingual stems, represented as a Trie. For the example considered, this enables the induction of translation templates, ‘we $\(_{2511}\)e’ \(\leftrightarrow \) ‘$\(T_{2511\#8}\)amos’8, as the lexicon of bilingual stems contains the bilingual pair ‘declar’ \(\leftrightarrow \) ‘declar’. By identifying all the stem pairs that associate with this particular template (refer Table 1), using the bilingual suffix clusters [19], new translations are suggested. We may see that, the bilingual stem ‘declar’ \(\leftrightarrow \) ‘declar’ belongs to the cluster 32. Thus, a possible translation suggestion in this case would be, ‘we argue’ \(\leftrightarrow \)afirmamos’, obtained by replacing ‘declar’ on the left hand side with ‘argu’ and ‘declar’ on the right hand side with ‘afirm’ (translation of ‘argu’ is ‘afirm’), in the bilingual pair ‘we declare’ \(\leftrightarrow \)declaramos’. Likewise, other suggestions proposed are, ‘we encourage’ \(\leftrightarrow \)estimulamos’, ‘we toggle’ \(\leftrightarrow \)comutamos’ and so forth, all of which are instances of correct translations missing in the existing lexicon.

5 Experimental Setup and Evaluation

5.1 Data Sets

The translations used for template induction were acquired using various extraction techniques [10, 11, 12, 13, 15] applied on a (sub-)sentence aligned parallel corpora introduced in Sect. 3.
Table 3.

Statistics of EN-PT datasets used in bilingual learning and template induction

Description

Bilingual pairs

Bilingual stems

Bilingual suffixes

Bilingual Learning

209,739

24,223

232

Template Induction

1,476

24,223

-

The dataset used for bilingual learning (column 2) and the associated statistics of unique bilingual segments identified using the bilingual learning approach (columns 3 and 4) [6] are shown in the first row of Table 3. The last row shows the statistics of bilingual pairs used as input in inducing translation templates. A subset of bilingual stems used in translation template induction are shown in the Table 4.
Table 4.

Selected list of indexed bilingual stems employed in newly induced translation templates shown in Table 6

ID_EN - EN

ID_PT - PT

ID_EN - EN

ID_PT - PT

18618 involv

18618#5 interess

1 provid

1#21 facult

18618 involv

18618#6 envolv

681 provid

681#18 conced

5621 precipit

5621#2 precipit

17882 meteor

17882#2 meteor

18758 analys

18758#4 analis

701 mass

701#4 mass

18758 analys

18758#6 examin

718 affect

718#22 afect

1996 plat

1996#2 prat

1 provid

1#33 fornec

435 cycl

435#6 cicl

1393 organ

1393#9 organ

1605 estimat

1605#6 estimat

3416 regular

3416#2 regular

18897 establish

18897#18 estabelec

800 past

800#5 passad

18897 establish

18897#19 afix

1078 introduc

1078#3 introduz

16585 digit

16585#6 digit

16585 digit

16585#1 númer

5.2 Results and Discussion

The statistics of translation templates learnt from EN-PT bilingual lexicon using the dataset described in Sect. 5.1 are presented in Table 5.
Table 5.

Statistics of newly induced translation templates

Description

Statistics

Total templates induced

958

Templates occurring once

587

Templates occurring more than once

82

Unigram to bigram templates induced

580

Table 6 presents few of the randomly chosen templates that were automatically induced from unigram to n-gram and n-gram to unigram translations.
Table 6.

Unigram to bigram and bigram to unigram translation templates

Description

EN

PT

Verb forms

\(\$_{18618}\)ing

que \(\$T_{18618\#6}\)a

\(\$_{18618}\)ing

que \(\$T_{18618\#6}\)em

\(\$_{18618}\)ing

que \(\$T_{18618\#6}\)am

\(\$_{5621}\)ated

o \(\$T_{5621\#2}\)ado

was \(\$_{1}\)ing

\(\$T_{1\#21}\)ava

to \(\$_{1078}\)e

\(\$T_{1078\#3}\)ir

Noun forms

\(\$_{18758}\)er

o \(\$T_{18758\#4}\)ador

\(\$_{1996}\)es

as \(\$T_{1996\#9}\)as

\(\$_{435}\)ists

os \(\$T_{435\#6}\)istas

\(\$_{1393}\)ism

o \(\$T_{1393\#9}\)ismo

\(\$_{1605}\)es

uma \(\$T_{1605\#6}\)iva

ir\(\$_{3416}\)ity

a ir\(\$T_{3416\#2}\)idade

\(\$_{17882}\)ology

a \(\$T_{17882\#2}\)ologia

\(\$_{ 18897}\)ments

os \(\$T_{18897\#18}\)imentos

Adjective forms

\(\$_{16585}\)al

a \(\$T_{16585\#6}\)al

Manual evaluation of a subset of induced templates showed that few of the templates induced were too specific and were less productive. Translation templates presented in Table 7, for instance, are unproductive as they do not contribute to any new translation forms.
Table 7.

Less productive translation templates

EN

PT

Bilingual stem (EN\(\leftrightarrow \)PT)

\(\$_{9800}\)ol

a \(\$T_{9800\#4}\)ol

9800 europ \(\leftrightarrow \) 9800#4 europ

some\(\$_{9805}\)s

por \(\$T_{9805\#6}\)s

9805 time \(\leftrightarrow \) 9805#6 veze

de\(\$_{659}\)ees

os de\(\$T_{659\#10}\)ados

659 sign \(\leftrightarrow \) 659#10 sign

to take ac\(\$_{2347}\) of

a fim de ter em \(\$T_{2347\#9}\)

2347 count \(\leftrightarrow \) 2347#9 conta

Generalising each of the induced templates by replacing the initial representations (indicating specific stem pairs such as \(\$_{9800}\) \(\leftrightarrow \) \(\$T_{9800\#4}\)) with \(\$\) \(\leftrightarrow \) \(\$T\) and counting the occurrence frequency of the resulting templates, we observed that the templates shown in Table 7 appeared only once. Thus, by generalising and filtering the induced translation templates based on the occurrence frequency we were able to discard templates that are unproductive. Alternatively, to avoid over-generations, templates sharing same contexts were further grouped together yielding generalised templates. In other words, after the stems were generalised to a wildcard symbol of the form \(\$_{a}\) \(\leftrightarrow \) \(\$T_{a\#b}\) as explained in the Sect. 4, the preliminary set of induced templates were clustered by finding stems that shared common contexts, where the context comprised of the suffix and other surrounding words. These clustered templates are used to suggest new translation forms that remain missing from the lexicon.

Templates such as \(\$_{13830}\)s \(\leftrightarrow \) os \(\$T_{13830\#3}\)s, \(\$_{13830}\)s \(\leftrightarrow \) os \(\$T_{13830\#2}\)tos and \(\$_{13830}\)s \(\leftrightarrow \) os \(\$T_{13830\#1}\)s9 lead to the generation of entries longer than necessary, containing articles10 that may or may not occur in English.

In our earlier work, we had learnt bilingual morphology from word to word translations [6] and now with the newly induced bigram to unigram templates, we infer those other pair of suffixes that were not learnt in our earlier experiments. For instance, ‘shall consider \(\leftrightarrow \) considerará’ includes the suffix ‘ará’ in the Portuguese side, which was not learnt previously. As it co-occurs with stems such as 815 consider \(\leftrightarrow \) 815#5 analis, 815 consider \(\leftrightarrow \) 815#1 consider, 815 consider \(\leftrightarrow \) 815#4 ponder and so forth in the Portuguese side, the suffix belongs to the same class of suffixes for bigram to unigram as the suffix pairs and other Portuguese verbs belonging to the cluster characterised by suffix pairs: (ed, ada), (ed, adas), (ed, ado), (ing ando), (ing, ar) etc. Here, we have a gapped pattern ‘shall \(\$_{a}\) \(\leftrightarrow \) \(\$T_{a\#b}\)arão’.

Further, knowing 14815 affect \(\leftrightarrow \) 14815#22 afect, 14815 affect \(\leftrightarrow \) 14815#18 influenci, 14815 affect \(\leftrightarrow \) 14815#13 prejudic, 14815 affect \(\leftrightarrow \) 14815#6 consider, 14815 affect \(\leftrightarrow \) 14815#5 interess, 14815 affect \(\leftrightarrow \) 14815#4 afet, 14815 affect \(\leftrightarrow \) 14815#3 implic and the templates learnt employing these stem pairs \(\$_{14815}\)ing \(\leftrightarrow \) que \(\$T_{14815\#22}\)em, \(\$_{14815}\)ing \(\leftrightarrow \) que \(\$T_{14815\#22}\)a, \(\$_{14815}\)ing \(\leftrightarrow \) que \(\$T_{14815\#22}\)am, different future forms ‘shall affect \(\leftrightarrow \) afectará’ or ‘shall affect \(\leftrightarrow \) afectarão’, can be generated as we also know that the suffixes ‘ará’ or ‘arão’ for those patterns apply to verbs of first conjugation ending in ‘a’.

Unlike the hand-written templates proposed by Gomes [15] that are highly precise and productive in extracting translation equivalents, the templates induced using the approach proposed in this paper are particularly suitable for automatic translation generation. While the hand-written templates generated are intended for aligning and extraction of translation equivalents from parallel corpora [15], the templates generated lack information about the suffixes and hence is not adequate for translation generation, which is addressed in this study.

6 Conclusion

We have presented a method for automatic induction of translation templates from a lexicon of unigram to n-gram, n-gram to unigram translations using bilingual stems, suffixes and their clusters. By generalising the induced templates using clusters of bilingual stems and suffixes, new translations can be automatically suggested. The contributions of the study can be summarised as follows:
  1. 1.

    Automatic induction of translation templates from a bilingual corpus of translations by employing the bilingual morph-units such as bilingual stems.

     
  2. 2.

    Continual accommodation of the newly acquired knowledge in enhancing the learning process. Human validation of newly generated translations (or templates) prevent learning from incorrectly generated or extracted translation pairs (or templates).

     

As future work, we intend to focus exclusively on generation of lexical entries considering the templates induced using the approach proposed in this paper and taking into account those stem pairs that belong to a cluster [6]. Further, the inflection-based method could be generalised so as to make it applicable to any morphological phenomenon representing grammatical information, rather than just verb forms. Learning bilingual prefixes using the previously proposed algorithm [6] could be explored in future.

Footnotes

  1. 1.

    Translations not present in the existing lexicon.

  2. 2.
  3. 3.

    Not in the bilingual lexicon that was used for aligning the parallel texts.

  4. 4.

    Word-to-word translations taken from the lexicon discussed in Sect. 3.1.

  5. 5.

    Translations that are lexically similar.

  6. 6.

    A suffix cluster may or may not correspond to Part-of-Speech such as noun or adjective but there are cases where the same suffix cluster aggregates nouns, adjectives and adverbs.

  7. 7.

    Verb - (‘’,‘ar’) and (‘e’,‘ar’).

  8. 8.

    $\(_{2511}\) represents the stem ‘declar’ in English and $\(T_{2511\#8}\) represents its translation in Portuguese, which is ‘declar’ as well.

  9. 9.

    13830 contract \(\leftrightarrow \) 13830#2 contra, 13830 contract \(\leftrightarrow \) 13830#1 contrat and 13831 buyout \(\leftrightarrow \) 13831#3 compra.

  10. 10.

    masculine plural.

Notes

Acknowledgements

K. M. Kavitha and Luís Gomes acknowledge the Research Fellowship by FCT/MCTES with Ref. nos., SFRH/BD/64371/2009 and SFRH/BD/65059/2009, respectively, and the funded research project ISTRION (Ref. PTDC/EIA-EIA/114521/2009) that provided other means for the research carried out. The authors thank NOVA LINCS, FCT/UNL for the support and SJEC for the partial financial assistance provided.

References

  1. 1.
    Yang, M., Kirchhoff, K.: Phrase-based backoff models for machine translation of highly inflected languages. In: Proceedings of EACL, pp. 41–48 (2006)Google Scholar
  2. 2.
    de Gispert, A., Mariño, J.B. Crego, J.M.: Improving statistical machine translation by classifying and generalizing inflected verb forms. In: Proceedings of 9th European Conference on Speech Communication and Technology, Lisboa, Portugal , pp. 3193–3196 (2005)Google Scholar
  3. 3.
    Poon, H., Cherry, C., Toutanova, K.: Unsupervised morphological segmentation with log-linear models. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 209–217. ACL (2009)Google Scholar
  4. 4.
    Momouchi, H.S.K.A.Y., Tochinai, K.: Prediction method of word for translation of unknown word. In: Proceedings of the IASTED International Conference, Artificial Intelligence and Soft Computing, 27 July–1 August 1997, Banff, Canada, p. 228. Acta Pr. (1997)Google Scholar
  5. 5.
    Snyder, B., Barzilay, R.: Unsupervised multilingual learning for morphological segmentation. In: Proceedings of ACL 2008: HLT, pp. 737–745. ACL (2008)Google Scholar
  6. 6.
    Karimbi Mahesh, K., Gomes, L., Lopes, J.G.P.: Identification of bilingual segments for translation generation. In: Blockeel, H., van Leeuwen, M., Vinciotti, V. (eds.) IDA 2014. LNCS, vol. 8819, pp. 167–178. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-12571-8_15CrossRefGoogle Scholar
  7. 7.
    Cicekli, I., Güvenir, H.A.: Learning translation templates from bilingual translation examples. In: Carl, M., Way, A. (eds.) Recent Advances in Example-Based Machine Translation. TLTB, vol. 21, pp. 255–286. Springer, Dordrecht (2003).  https://doi.org/10.1007/978-94-010-0181-6_9CrossRefzbMATHGoogle Scholar
  8. 8.
    Rile, H., Zong, C., Bo, X.: An approach to automatic acquisition of translation templates based on phrase structure extraction and alignment. IEEE Trans. Audio Speech Lang. Process. 14(5), 1656–1663 (2006)CrossRefGoogle Scholar
  9. 9.
    Gangadharaiah, R., Brown, R.D., Carbonell, J.: Phrasal equivalence classes for generalized corpus-based machine translation. In: Gelbukh, A. (ed.) CICLing 2011. LNCS, vol. 6609, pp. 13–28. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-19437-5_2CrossRefGoogle Scholar
  10. 10.
    Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)Google Scholar
  11. 11.
    Lardilleux, A., Lepage, Y.: Sampling-based multilingual alignment. In: Proceedings of Recent Advances in Natural Language Processing, pp. 214–218 (2009)Google Scholar
  12. 12.
    Aires, J., Lopes, G.P., Gomes, L.: Phrase translation extraction from aligned parallel corpora using suffix arrays and related structures. In: Lopes, L.S., Lau, N., Mariano, P., Rocha, L.M. (eds.) EPIA 2009. LNCS (LNAI), vol. 5816, pp. 587–597. Springer, Heidelberg (2009).  https://doi.org/10.1007/978-3-642-04686-5_48CrossRefGoogle Scholar
  13. 13.
    Gomes, L., Pereira Lopes, J.G.: Measuring spelling similarity for cognate identification. In: Antunes, L., Pinto, H.S. (eds.) EPIA 2011. LNCS (LNAI), vol. 7026, pp. 624–633. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-24769-9_45CrossRefGoogle Scholar
  14. 14.
    Gomes, L.. Lopes, G.P.: Parallel texts alignment. In: New Trends in Artificial Intelligence, 14th Portuguese Conference in Artificial Intelligence, EPIA 2009, Aveiro, pp. 513–524, October 2009Google Scholar
  15. 15.
    Gomes, L.: Translation alignment and extraction within a lexica-centered iterative workflow. Ph.D. thesis, Lisboa, Portugal, December 2017Google Scholar
  16. 16.
    Kavitha, K.M., Gomes, L., Aires, J., Lopes, J.G.P.: Classification and selection of translation candidates for parallel corpora alignment. In: Pereira, F., Machado, P., Costa, E., Cardoso, A. (eds.) EPIA 2015. LNCS (LNAI), vol. 9273, pp. 723–734. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-23485-4_73CrossRefGoogle Scholar
  17. 17.
    Costa, J., Gomes, L., Lopes, G.P., Russo, L.M.S.: Improving bilingual search performance using compact full-text indices. In: Gelbukh, A. (ed.) CICLing 2015. LNCS, vol. 9041, pp. 582–595. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-18111-0_44CrossRefGoogle Scholar
  18. 18.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, pp. 52–61. Cambridge University Press, Cambridge (1997)CrossRefGoogle Scholar
  19. 19.
    Kavitha, K.M., Gomes, L., Lopes, J.G.P.: Learning clusters of bilingual suffixes using bilingual translation lexicon. In: Prasath, R., Vuppala, A.K., Kathirvalavakumar, T. (eds.) MIKE 2015. LNCS (LNAI), vol. 9468, pp. 607–615. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-26832-3_57CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Kavitha Karimbi Mahesh
    • 1
    • 2
    Email author
  • Luís Gomes
    • 1
  • José Gabriel Pereira Lopes
    • 1
  1. 1.NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), Faculdade de Ciências e TecnologiaUniversidade Nova de LisboaLisbonPortugal
  2. 2.Department of Computer Science and EngineeringSt Joseph Engineering College VamanjoorMangaluruIndia

Personalised recommendations