Advertisement

Improving Statistical Word Alignments with Morpho-syntactic Transformations

  • Adrià de Gispert
  • Deepa Gupta
  • Maja Popović
  • Patrik Lambert
  • Jose B. Mariño
  • Marcello Federico
  • Hermann Ney
  • Rafael Banchs
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4139)

Abstract

This paper presents a wide range of statistical word alignment experiments incorporating morphosyntactic information. By means of parallel corpus transformations according to information of POS-tagging, lemmatization or stemming, we explore which linguistic information helps improve alignment error rates. For this, evaluation against a human word alignment reference is performed, aiming at an improved machine translation training scheme which eventually leads to improved SMT performance. Experiments are carried out in a Spanish–English European Parliament Proceedings parallel corpus, both in a large and a small data track. As expected, improvements due to introducing morphosyntactic information are bigger in case of data scarcity, but significant improvement is also achieved in a large data task, meaning that certain linguistic knowledge is relevant even in situations of large data availability.

Keywords

Statistical Machine Translation Computational Linguistics Alignment Quality Parallel Corpus Word Alignment 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Smadja, F.A., McKeown, K.R., Hatzivassiloglou, V.: Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics 22, 1–38 (1996)Google Scholar
  2. 2.
    Diab, M., Resnik, P.: An unsupervised method for word sense tagging using parallel corpora. In: Proc. of the Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, pp. 255–262 (2002)Google Scholar
  3. 3.
    Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proc. of the 1st International Conference on Human Language Technology Research (HLT), pp. 161–168 (2001)Google Scholar
  4. 4.
    Kuhn, J.: Experiments in parallel-text based grammar induction. In: Proc. of the 42th Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, pp. 470–477 (2004)Google Scholar
  5. 5.
    Brown, P., Della Pietra, S., Della Pietra, V., Mercer, R.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 263–311 (1993)Google Scholar
  6. 6.
    Zens, R., Och, F.J., Ney, H.: Phrase-based statistical machine translation. In: Jarke, M., Koehler, J., Lakemeyer, G. (eds.) KI 2002. LNCS, vol. 2479, p. 18. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  7. 7.
    Mariño, J., Banchs, R., Crego, J.M., de Gispert, A., Lambert, P., Fonollosa, J., Ruiz, M.: Bilingual n-gram statistical machine translation. In: Proc. of Machine Translation Summit X, Phuket, Thailand, pp. 275–282 (2005)Google Scholar
  8. 8.
    Och, F., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29, 19–51 (2003)CrossRefGoogle Scholar
  9. 9.
    Yamada, K., Knight, K.: A syntax-based statistical translation model. In: Proc. of the Annual Meeting of the Association for Computational Linguistics, Toulouse, France (2001)Google Scholar
  10. 10.
    Och, F., Ney, H.: A comparison of alignment models for statistical machine translation. In: Proc. of the 18th Int. Conf. on Computational Linguistics, Saarbrucken, Germany, pp. 1086–1090 (2000)Google Scholar
  11. 11.
    Toutanova, K., Ilhan, H.T., Manning, C.D.: Extensions to hmm-based statistical word alignment models. In: Proc. of the Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA (2002)Google Scholar
  12. 12.
    Tiedemann, J.: Combining clues for word alignment. In: Proc. of the 10th Conf. of the European Chapter of the ACL (EACL), Budapest, Hungary (2003)Google Scholar
  13. 13.
    de Gispert, A.: Phrase linguistic classification and generalization for improving statistical machine translation. In: Proc. of the ACL Student Research Workshop, pp. 67–72 (2005)Google Scholar
  14. 14.
    Popović, M., Ney, H.: Improving word alignment quality using morpho-syntactic information. In: Proc. of the 20th Int. Conf. on Computational Linguistics, COLING 2004, Geneva, Switzerland, pp. 310–314 (2004)Google Scholar
  15. 15.
    Popović, M., Ney, H.: POS-based word reorderings for statistical machine translation. In: Proc. 5th Int. Conf. on Language Resources and Evaluation (LREC), Genoa, Italy, pp. 1278–1283 (2006)Google Scholar
  16. 16.
    Costa-jussà, M., Crego, J., de Gispert, A., Lambert, P., Khalilov, M., Banchs, R., Mariño, J., Fonollosa, J.: Talp phrase-based statistical translation system for european language pairs. In: Proc. of the HLT/NAACL Workshop on Statistical Machine Translation, New York (2006)Google Scholar
  17. 17.
    Brants, T.: Tnt — a statistical part-of-speech tagger. In: Proc. of Applied Natural Language Processing (ANLP), Seattle, WA (2000)Google Scholar
  18. 18.
    Miller, G., Beckwith, R., Fellbaum, C., Gross, D., Miller, K., Tengi, R.: Five papers on wordnet. Special Issue of International Journal of Lexicography 3, 235–312 (1991)CrossRefGoogle Scholar
  19. 19.
    Carreras, X., Chao, I., Padró, L., Padró, M.: Freeling: An open-source suite of language analyzers. In: Proc. of the 4th Int. Conf. on Linguistic Resources and Evaluation (LREC), Lisbon, Portugal (2004)Google Scholar
  20. 20.
    Lambert, P., de Gispert, A., Banchs, R., Mariño, J.: Guidelines for word alignment and manual alignment. Language Resources and Evaluation (2006), doi:10.1007/s10579-005-4822-5Google Scholar
  21. 21.
    Och, F.: Giza++: Training of statistical translation models (2000), http://www.fjoch.com/GIZA++.html

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Adrià de Gispert
    • 1
  • Deepa Gupta
    • 2
  • Maja Popović
    • 3
  • Patrik Lambert
    • 1
  • Jose B. Mariño
    • 1
  • Marcello Federico
    • 2
  • Hermann Ney
    • 3
  • Rafael Banchs
    • 1
  1. 1.TALP Research CenterUniversitat Politècnica de CatalunyaBarcelonaSpain
  2. 2.ITC-irstCentro per la Ricerca Scientifica e TecnologicaTrentoItaly
  3. 3.Lehrstuhl für Informatik 6RWTH Aachen UniversityAachenGermany

Personalised recommendations