Incorporating Linguistic Information to Statistical Word-Level Alignment

  • Eduardo Cendejas
  • Grettel Barceló
  • Alexander Gelbukh
  • Grigori Sidorov
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5856)


Parallel texts are enriched by alignment algorithms, thus establishing a relationship between the structures of the implied languages. Depending on the alignment level, the enrichment can be performed on paragraphs, sentences or words, of the expressed content in the source language and its translation. There are two main approaches to perform word-level alignment: statistical or linguistic. Due to the dissimilar grammar rules the languages have, the statistical algorithms usually give lower precision. That is why the development of this type of algorithms is generally aimed at a specific language pair using linguistic techniques. A hybrid alignment system based on the combination of the two traditional approaches is presented in this paper. It provides user-friendly configuration and is adaptable to the computational environment. The system uses linguistic resources and procedures such as identification of cognates, morphological information, syntactic trees, dictionaries, and semantic domains. We show that the system outperforms existing algorithms.


Parallel texts word alignment linguistic information dictionary cognates semantic domains morphological information 


  1. 1.
    Langlais, P., Simard, M., Vronis, J.: Methods and practical issues in evaluating alignment techniques. In: Proceedings of the 17th International Conference on Computational Linguistics, Montréal, pp. 711–717 (1998)Google Scholar
  2. 2.
    Veronis, J.: Parallel Text Processing: Alignment and Use of Translation Corpora. Kluwer Academic Publishers, Dordrecht (2001)Google Scholar
  3. 3.
    McEnery, T., Xiao, R., Tonio, Y.: Corpus-based language studies: An advanced resource book. Routledge, London (2006)Google Scholar
  4. 4.
    Kit, C., Webster, J., Kui, K., Pan, H., Li, H.: Clause alignment for hong kong legal texts: A lexical-based approach. International Journal of Corpus Linguistics 9, 29–51 (2004)CrossRefGoogle Scholar
  5. 5.
    Agirre, E., Díaz de Ilarraza, A., Labaka, G., Sarasola, K.: Uso de información morfológica en el alineamiento español-euskara. In: Actas del XXII Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural, Zaragoza, pp. 257–264 (2006)Google Scholar
  6. 6.
    Dale, R., Moisl, H., Somers, H.L.: Handbook of natural language processing. Marcel Dekker Inc., New York (2000)Google Scholar
  7. 7.
    Moore, R.: A discriminative framework for bilingual word alignment. In: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, Vancouver, pp. 81–88 (2005)Google Scholar
  8. 8.
    Ma, Y., Ozdowska, S., Sun, Y., Way, A.: Improving word alignment using syntactic dependencies. In: Proceedings of the ACL 2008:HLT Second Workshop on Syntax and Structure in Statistical Translation, Ohio, pp. 69–77 (2008)Google Scholar
  9. 9.
    Pianta, E., Bentivogli, L.: Knowledge intensive word alignment with knowa. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, pp. 1086–1092 (2004)Google Scholar
  10. 10.
    Mihalca, R., Pedersen, T.: An evaluation exercise for word alignment. In: Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: data driven machine translation and beyond, Edmonton, vol. 3, pp. 1–10 (2003)Google Scholar
  11. 11.
    Ayan, N., Borr, B., Habash, N.: Multi-align: Combining linguistic and statistical techniques to improve alignments for adaptable mt. In: Proceedings of the 6th Conference of the Association for Machine Translation in the Americas, Washington DC, pp. 17–26 (2004)Google Scholar
  12. 12.
    De Gispert, A., Mario, J., Crego, J.: Linguistic knowledge in statistical phrase-based word alignment. Natural Language Engineering 12, 91–108 (2006)CrossRefGoogle Scholar
  13. 13.
    Hermjakob, U.: Improved word alignment with statistics and linguistic heuristics. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 229–237 (2009)Google Scholar
  14. 14.
    Hwang, Y., Finch, A., Sasaki, Y.: Improving statistical machine translation using shallow linguistic knowledge. Computer Speech and Language 21, 350–372 (2007)CrossRefGoogle Scholar
  15. 15.
    Fung, P., Church, K.: K-vec: A new approach for aligning parallel text. In: Proceedings of the 15th Conference on Computational Linguistics, Kyoto, vol. 2, pp. 1096–1102 (1994)Google Scholar
  16. 16.
    Och, F., Ney, H.: A systematic comparison of various statistical alignment models. In: Proceedings of the 18th Conference on Computational Linguistics, vol. 2, pp. 19–51 (2003); Computational Linguistics 29(1), 19–51 (2003)Google Scholar
  17. 17.
    Tiedeman, J.: Word to word alignment strategies. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, pp. 221–218 (2004)Google Scholar
  18. 18.
    Bentivogli, L., Forner, P., Magnini, B., Pianta, E.: Revising wordnet domains hierarchy: Semantics, coverage, and balancing. In: Proceedings of COLING 2004 Workshop on Multilingual Linguistic Resources, Geneva, pp. 101–108 (2004)Google Scholar
  19. 19.
    Wu, H., Wang, H., Liu, Z.: Alignment model adaptation for domain-specific word alignment. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Michigan, pp. 467–474 (2005)Google Scholar
  20. 20.
    Och, F., Ney, H.: Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Hong Kong, pp. 440–447 (2000)Google Scholar
  21. 21.
    De Gispert, A., Mario, J., Crego, J.: Phrase-based alignment combining corpus cooccurrences and linguistic knowledge. In: Proceedings of the International Workshop on Spoken Language Translation, Kyoto, pp. 85–90 (2004)Google Scholar
  22. 22.
    GIZA++: Training of statistical translation models,
  23. 23.
    K-Vec++: Approach for finding word correspondences,
  24. 24.

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Eduardo Cendejas
    • 1
  • Grettel Barceló
    • 1
  • Alexander Gelbukh
    • 1
  • Grigori Sidorov
    • 1
  1. 1.Center for Computing ResearchNational Polytechnic InstituteMexico CityMexico

Personalised recommendations