Identification of Bilingual Segments for Translation Generation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8819)


We present an approach that uses known translation forms in a validated bilingual lexicon and identifies bilingual stem and suffix segments. By applying the longest sequence common to pair of orthographically similar translations we initially induce the bilingual suffix transformations (replacement rules). Redundant analyses are discarded by examining the distribution of stem pairs and associated transformations. Set of bilingual suffixes conflating various translation forms are grouped. Stem pairs sharing similar transformations are subsequently clustered which serves as a basis for the generative approach. The primary motivation behind this work is to eventually improve the lexicon coverage by utilising the correct bilingual entries in suggesting translations for OOV words. In the preliminary results, we report generation results, wherein, 90% of the generated translations are correct. This was achieved when both the bilingual segments (bilingual stem and bilingual suffix) in the bilingual pair being analysed are known to have occurred in the training data set.


Translation lexicon coverage Cluster analysis Bilingual morphology Translation generation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Gomes, L., Pereira Lopes, J.G.: Parallel texts alignment. In: New Trends in Artificial Intelligence, 14th Portuguese Conference in Artificial Intelligence, EPIA 2009, pp. 513–524 (2009)Google Scholar
  2. 2.
    Aires, J., Pereira Lopes, J.G., Gomes, L.: Phrase translation extraction from aligned parallel corpora using suffix arrays and related structures. In: Progress in Artificial Intelligence, pp. 587–597 (2009)Google Scholar
  3. 3.
    Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational linguistics 19(2), 263–311 (1993)Google Scholar
  4. 4.
    Lardilleux, A., Lepage, Y.: Sampling-based multilingual alignment. In: Proceedings of Recent Advances in Natural Language Processing, pp. 214–218 (2009)Google Scholar
  5. 5.
    Gomes, L., Pereira Lopes, J.G.: Measuring spelling similarity for cognate identification. In: Antunes, L., Pinto, H.S. (eds.) EPIA 2011. LNCS, vol. 7026, pp. 624–633. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  6. 6.
    Déjean, H.: Morphemes as necessary concept for structures discovery from untagged corpora. In: Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning, pp. 295–298. ACL (1998)Google Scholar
  7. 7.
    Goldsmith, J.: Unsupervised learning of the morphology of a natural language. Computational linguistics 27(2), 153–198 (2001)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In: Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning, vol. 6, pp. 21–30. ACL (2002)Google Scholar
  9. 9.
    Hammarström, H., Borin, L.: Unsupervised learning of morphology. Computational Linguistics 37(2), 309–350 (2011)CrossRefGoogle Scholar
  10. 10.
    Monson, C., Carbonell, J., Lavie, A., Levin, L.: ParaMor and morpho challenge 2008. In: Peters, C., et al. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 967–974. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  11. 11.
    Momouchi, H.S.K.A.Y., Tochinai, K.: Prediction method of word for translation of unknown word. In: Proceedings of the IASTED International Conference, Artificial Intelligence and Soft Computing, Banff, Canada, July 27-August 1, p. 228. Acta Pr. (1997)Google Scholar
  12. 12.
    Koehn, P., Knight, K.: Empirical methods for compound splitting. In: Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics, vol. 1, pp. 187–193 (2003)Google Scholar
  13. 13.
    Yang, M., Kirchhoff, K.: Phrase-based backoff models for machine translation of highly inflected languages. In: Proceedings of EACL, pp. 41–48 (2006)Google Scholar
  14. 14.
    de Gispert, A., Mariño, J.B., Crego, J.M.: Improving statistical machine translation by classifying and generalizing inflected verb forms. In: Proceedings of 9th European Conference on Speech Communication and Technology, Lisboa, Portugal, pp. 3193–3196 (2005)Google Scholar
  15. 15.
    de Gispert, A., Marino, J.B.: On the impact of morphology in english to spanish statistical mt. Speech Communication 50(11-12), 1034–1046 (2008)CrossRefGoogle Scholar
  16. 16.
    Snyder, B., Barzilay, R.: Unsupervised multilingual learning for morphological segmentation, pp. 737–745. ACL (2008)Google Scholar
  17. 17.
    Poon, H., Cherry, C., Toutanova, K.: Unsupervised morphological segmentation with log-linear models. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 209–217 (2009)Google Scholar
  18. 18.
    Jisha, P.J., Rajeev, R.R.: Morphological analyser and morphological generator for malayalam-tamil machine translation. International Journal of Computer Applications 13(8), 15–18 (2011)CrossRefGoogle Scholar
  19. 19.
    Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524. ACM (2002)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.CITI (NOVA LINCS), Faculdade de Ciências e TecnologiaUniversidade Nova de Lisboa, Quinta da TorreCaparicaPortugal
  2. 2.ISTRION BOX-Translation & Revision, Lda., ParkurbisCovilhãPortugal
  3. 3.Department of Computer ApplicationsSt. Joseph Engineering CollegeVamanjoor, MangaloreIndia

Personalised recommendations