Advertisement

Mapping and Aligning Units from Comparable Corpora

  • Ahmet Aker
  • Alexandru Ceaușu
  • Yang Feng
  • Robert GaizauskasEmail author
  • Sabine Hunsicker
  • Radu Ion
  • Elena Irimia
  • Dan Ștefănescu
  • Dan Tufiș
Chapter
Part of the Theory and Applications of Natural Language Processing book series (NLP)

Abstract

Extracting parallel units (e.g. sentences or phrases) from comparable corpora in order to enrich existing statistical translation models is an avenue that has attracted a lot of research in recent years. There are experiments that convincingly show how parallel sentences extracted from comparable corpora are able to improve statistical machine translation (SMT). Yet, the existing body of research on the subject does not take into account the degree of comparability of the corpus being processed nor the computation time that it takes to extract translational similar pairs from a corpus of a given size. We will show that the performance of a parallel unit extractor crucially depends on the degree of comparability, such that it is more difficult to mine for parallel data in a weakly comparable corpus than a strongly comparable corpus.

Most of the research in parallel data mining from comparable corpora focusses on parallel sentence mining, but parallel phrase mining (i.e. sub-sentential fragments) is of equal importance, because it can be more robust in the presence of weakly comparable corpora that usually do not contain whole translated sentences. We will present different approaches to parallel sentence and phrase mining from comparable corpora developed in the ACCURAT project, and we will evaluate them both in terms of absolute measures (e.g., P, R and F1) and with respect to their ability to generate significant improvements of the BLEU scores of a statistical translation system. Comprehensive testing of these algorithms in the context of statistical machine translation will be undertaken in Chap.  6.

References

  1. Aker, A., Kanoulas, E., & Gaizauskas, R. (2012a). A light way to collect comparable corpora from the Web. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012) (pp. 21–27), Istanbul, Turkey.Google Scholar
  2. Aker, A., Feng, Y., & Gaizauskas, R. (2012b). Automatic bilingual phrase extraction from comparable corpora. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), IIT Bombay, Mumbai, India.Google Scholar
  3. Aswani, N., & Gaizauskas, R. (2010). English-Hindi transliteration using multiple similarity metrics. In Proceedings of the 7th Language Resources and Evaluation Conference (LREC 2010), Valletta, Malta.Google Scholar
  4. Borman, S. (2009). The expectation maximization algorithm. A short tutorial. http://www.seanborman.com/publications/EM_algorithm.pdf
  5. Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311.Google Scholar
  6. Ceauşu, A. (2009). Statistical machine translation for Romanian. PhD Thesis, Romanian Academy (in Romanian).Google Scholar
  7. Chen, S. F.(1993). Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st Annual Meeting on Association for Computational Linguistics (pp. 9–16), Columbus, OH.Google Scholar
  8. Chiang, D. (2005). A hierarchical phrase-based model for statistical machine translation. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, June 2005 (pp. 263–270), Ann Arbor, MI.Google Scholar
  9. Fellbaum, C. (Ed.) (1998) WordNet: An electronic lexical database. Cambridge, MA: MIT Press.zbMATHGoogle Scholar
  10. Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004) (pp. 57–63), Barcelona, Spain.Google Scholar
  11. Gale, W. A., & Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.Google Scholar
  12. Gao, Q., & Vogel, S. (2008). Parallel implementations of a word alignment tool. In Proceedings of ACL-08 HLT: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, June 20, 2008 (pp. 49–57), Ohio State University, Columbus, OH.Google Scholar
  13. Hewavitharana, S., & Vogel, S. (2011). Extracting parallel phrases from comparable data. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web (BUCC 2011) (pp. 61–68), Portland, OR.Google Scholar
  14. Ion, R. (2012). PEXACC: A parallel sentence mining algorithm from comparable corpora. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012) (pp. 2181–2188), May 21–27, 2012, Istanbul, Turkey.Google Scholar
  15. Ion, R., Ceauşu, A., & Irimia, E. (2011a). An expectation maximization algorithm for textual unit alignment. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC 2011) (pp. 128–135), June 24th, 2011, Portland, OR.Google Scholar
  16. Ion, R., Zhang, X., Su, F., Paramita, M., & Ștefănescu, D. (2011b). Report on Multi-Level Alignment of Comparable Corpora. Technical report no. D2.2 of the ACCURAT Project (http://www.accurat-project.eu/).
  17. Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004) (pp. 388–395), Barcelona, Spain.Google Scholar
  18. Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of the Tenth Machine Translation Summit, September 12–16, 2005 (pp. 79—86), Phuket, Thailand.Google Scholar
  19. Koehn, P., Och, F., & Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (pp. 48–54), May 27–June 1, 2003, Edmonton, Canada.Google Scholar
  20. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Cowan, B., et al. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL Companion Volume Proceedings of the Demo and Poster Sessions (pp. 177–180), Prague, Czech Republic.Google Scholar
  21. Manning, C., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval (Vol. 1). Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  22. Munteanu, D. S., & Marcu, D. (2002). Processing comparable corpora with bilingual suffix trees. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002) (pp. 289–295), July 6–7, 2002, University of Pennsylvania, Philadelphia, PAGoogle Scholar
  23. Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.CrossRefGoogle Scholar
  24. Och, F. J. (2003). Minimum error rate training in statistical machine translation. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (pp. 160–167), July 07–12, 2003, Sapporo, Japan.Google Scholar
  25. Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.CrossRefGoogle Scholar
  26. Och, F. J., & Ney, H. (2004). The alignment template approach to statistical machine translation. Computational Linguistics, 30(4), 417–449.CrossRefGoogle Scholar
  27. Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 7–12 2002 (pp. 311–318), University of Pennsylvania, Philadelphia, PA.Google Scholar
  28. Quirk, C., Udupa, R., & Menezes, A. (2007). Generative models of noisy translations with applications to parallel fragment extraction. In Proceedings of the MT Summit XI (pp. 321–327), September, 2007, Copenhagen, Demark.Google Scholar
  29. Rauf, S. A., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.CrossRefGoogle Scholar
  30. Skadiņa, I., Aker, A., Giouli, V., Tufiş, D., Gaizauskas, R., Mieriņa, M., et al. (2010). A collection of comparable corpora for under-resourced languages. In Proceedings of the Fourth International Conference Baltic HLT 2010. Frontiers in Artificial Intelligence and Applications (Vol. 219, pp. 161–168), IOS Press.Google Scholar
  31. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA 2006): Visions for the Future of Machine Translation (pp. 223–231), Cambridge, MA.Google Scholar
  32. Snover, M., Madnani, N., Dorr, B., & Schwartz, R. (2009). Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In Proceedings of the Fourth Workshop on Statistical Machine Translation (pp. 259–268). Association for Computational Linguistics, Athens, Greece.Google Scholar
  33. Ștefănescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. In Proceedings of the16th Conference of the European Association for Machine Translation (EAMT 2012) (pp. 137–144), May 28–30, 2012, Trento, Italy.Google Scholar
  34. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiș, D., et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’2006), May 24–26, 2006, Genoa, Italy.Google Scholar
  35. Steinberger, R., Eisele, A., Klocek, A., Pilos, S., & Schlüter, P. (2012). DGT-TM: A freely Available Translation Memory in 22 Languages. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012), May 21–27, 2012, Istanbul, Turkey.Google Scholar
  36. Stolcke, A. (2002). SRILM – An extensible language modeling toolkit. In Proceedings of the International Conference of Spoken Language Processing (ICSLP 2002) (pp. 901–904), September 2002, Denver, CO.Google Scholar
  37. Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., & Laurikkala, J. (2008). Focused web crawling in the acquisition of comparable corpora. Information Retrieval, 11(5), 427–445.CrossRefGoogle Scholar
  38. Thi Ngoc Diep, D., Besacier, L., Castelli, E. (2010). A fully unsupervised approach for mining parallel data from comparable corpora. In Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT 2010), May 27–28, 2010, Saint-Raphaël, France.Google Scholar
  39. Tillmann, C. (2009). A beam-search extraction algorithm for comparable data. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers (pp. 225–228), Suntec, Singapore, August 4th, 2009.Google Scholar
  40. Tsvetkov, Y., & Wintner, S. (2010). Automatic acquisition of parallel corpora from websites with dynamic content. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10) (pp. 3389–3392), Valletta, Malta, May 2010.Google Scholar
  41. Tufiș, D., Ion, R., Ceaușu, A., & Ștefănescu, D. (2006). Improved lexical alignment by combining multiple reified alignments. In Proceedings of the11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006) (pp. 153–160), Trento, Italy, April 3–7 2006.Google Scholar
  42. Tufiș, D., Ion, R., Bozianu, L., Ceaușu, A., & Ștefănescu, D. (2008). Romanian wordnet: Current state, new applications and prospects. In A. Tanacs, D. Csendes, V. Vincze, C. Fellbaum, & P. Vossen (Eds.), Proceedings of 4th Global WordNet Conference, GWC-2008, January 2008 (pp. 441–452). Hungary: University of Szeged.Google Scholar
  43. Zhang, Y., Wu, K., Gao, J., & Vines, P. (2006). Automatic acquisition of Chinese-English parallel corpus from the web. In Proceedings of 28th European Conference on Information Retrieval ECIR 2006, April 10–12, 2006, London.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Ahmet Aker
    • 1
  • Alexandru Ceaușu
    • 2
  • Yang Feng
    • 1
  • Robert Gaizauskas
    • 1
    Email author
  • Sabine Hunsicker
    • 3
  • Radu Ion
    • 2
  • Elena Irimia
    • 2
  • Dan Ștefănescu
    • 2
  • Dan Tufiș
    • 2
  1. 1.University of SheffieldSheffieldUK
  2. 2.Research Institute for Artificial Intelligence of the Romanian Academy (RACAI)BucharestRomania
  3. 3.The German Research Center for Artificial Intelligence (DFKI)KaiserslauternGermany

Personalised recommendations