Paragraph-Level Alignment of an English-Spanish Parallel Corpus of Fiction Texts Using Bilingual Dictionaries

  • Alexander Gelbukh
  • Grigori Sidorov
  • José Ángel Vera-Félix
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4188)

Abstract

Aligned parallel corpora are very important linguistic resources useful in many text processing tasks such as machine translation, word sense disambiguation, dictionary compilation, etc. Nevertheless, there are few available linguistic resources of this type, especially for fiction texts, due to the difficulties in collecting the texts and high cost of manual alignment. In this paper, we describe an automatically aligned English-Spanish parallel corpus of fiction texts and evaluate our method of alignment that uses linguistic data-namely, on the usage of existing bilingual dictionaries-to calculate word similarity. The method is based on the simple idea: if a meaningful word is present in the source text then one of its dictionary translations should be present in the target text. Experimental results of alignment at paragraph level are described.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning Sentences in Parallel Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, pp. 169–176 (1991)Google Scholar
  2. 2.
    Chen, S.: Aligning sentences in bilingual corpora using lexical information. In: Proceeding of ACL 1993, pp. 9–16 (1993)Google Scholar
  3. 3.
    Cowie, J., Guthrie, J.A., Guthrie, L.: Lexical disambiguation using simulated annealing. In: Proc. of the International Conference on Computational Linguistics, pp. 359–365 (1992)Google Scholar
  4. 4.
    Kit, C., Webster, J.J., Sin, K.K., Pan, H., Li, H.: Clause alignment for Hong Kong legal texts: A lexical-based approach. International Journal of Corpus Linguistics 9(1), 29–51 (2004)CrossRefGoogle Scholar
  5. 5.
    Gale, W.A., Church, K.W.: A program for Aligning Sentences in Bilingual Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California (1991)Google Scholar
  6. 6.
    Gelbukh, A., Sidorov, G.: Approach to construction of automatic morphological analysis systems for inflective languages with little effort. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 215–220. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  7. 7.
    Gelbukh, A., Sidorov, G., Han, S.Y.: On Some Optimization Heuristics for Lesk-Like WSD Algorithms. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 402–405. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  8. 8.
    McEnery, A.M., Oakes, M.P.: Sentence and word alignment in the CRATER project. In: Thomas, J., Short, M. (eds.) Using Corpora for Language Research, London, pp. 211–231 (1996)Google Scholar
  9. 9.
    Mikhailov, M.: Two Approaches to Automated Text Aligning of Parallel Fiction Texts. Across Languages and Cultures 2(1), 87–96 (2001)CrossRefGoogle Scholar
  10. 10.
    Kay, M., Roscheisen, M.: Text-translation alignment. Computational Linguistics 19(1), 121–142 (1993)Google Scholar
  11. 11.
    Langlais, P., Simard, M., Veronis, J.: Methods and practical issues in evaluation alignment techniques. In: Proceeding of Coling-ACL 1998 (1998)Google Scholar
  12. 12.
    Li, W., Sun, M.: Automatic Image Annotation based on WordNet and Hierarchical Ensembles. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 551–563. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. 13.
    Meyers, A., Kosaka, M., Grishman, R.: A Multilingual Procedure for Dictionary-Based Sentence Alignment. In: Proceedings of AMTA 1998: Machine Translation and the Information Soup, pp. 187–198 (1998)Google Scholar
  14. 14.
    Velásquez, F., Gelbukh, A., Sidorov, G.: AGME: un sistema de análisis y generación de la morfología del español. In: Proc. Of Workshop Multilingual information access and natural language processing of IBERAMIA 2002 (8th Iberoamerican conference on Artificial Intelligence), Sevilla, España, November 12, pp. 1–6 (2002)Google Scholar
  15. 15.
    Villaseñor Pineda, L., Massé Márquez, J.A., Pineda Cortés, L.A.: Towards a Multimodal Dialogue Coding Scheme. In: Gelbukh, A. (ed.) Proc. of CICLing 2000 Computational Linguistics and Intelligent Text Processing, IPN, Mexico, pp. 551–563 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Alexander Gelbukh
    • 1
  • Grigori Sidorov
    • 1
  • José Ángel Vera-Félix
    • 1
  1. 1.Natural Language and Text Processing Laboratory, Center for Research in Computer Science, National Polytechnic InstituteMexico CityMexico

Personalised recommendations