Abstract
Comparable corpora exhibit various degrees of parallelism. Fung and Cheung [3] describe corpora ranging from noisy parallel, to comparable, and finally to very non-parallel. The last category contains corpora composed of “... disparate, very non-parallel bilingual documents that could either be on the same topic (on-topic) or not”. This is the type of corpora that out work is attempting to exploit
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cettolo, M., Federico, M., Bertoldi, N.: Mining parallel fragments from comparable texts. In: Proceedings of the 7th International Workshop on Spoken Language Translation, pp. 227–234 (2010)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
Fung, P., Cheung, P.: Mining very non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pp. 57–63 (2004)
Fung, P., Cheung, P.: Mining very non-parallel corpora: parallel sentence and lexicon extraction vie bootstrapping and em. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 57–63 (2004)
Fung, P., Cheung, P.: Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING), pp. 1051–1057 (2004)
Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Dejean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 527–534 (2004)
Koehn, P.: Statistical significance tests for machine translation evaluation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 388–395 (2004)
Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), pp. 9–16 (2002)
Melamed, I.D.: Models of translational equivalence among words. Comput. Linguist. 26(2), 221–249 (2000)
Moore, R.C.: Improving IBM word-alignment model 1. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 519–526 (2004)
Moore, R.C.: On log-likelihood-ratios and the significance of rare events. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 333–340 (2004)
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)
Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pp. 81–88 (2006)
Och, F.J., Ney, H.: The alignment template approach to statistical machine translation. Comput. Linguist. 30(4), 417–450 (2003)
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Quick, C., Udupa, R.U., Menezes, A.: Generative models of noisy translations with applications to fragment extraction. In: Proceedings of MT Summit XI (2007)
Rapp, R.: Identifying word translation in non-parallel texts. In: Proceedings of the Conference of the Association for Computational Linguistics, pp. 320–322 (1995)
Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, pp. 519–526 (1999)
Resnik, P., Oard, D., Levow, G.: Improved cross-language retrieval using backoff translation. In: Proceedings of the 1st International Conference on Human Language Technology Research (2001)
Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003)
Uszkoreit, J., Ponte, J., Popat, A., Dubiner, M.: Large scale parallel document mining for machine translation. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pp. 1101–1109 (2010)
Utiyama, M., Isahara, H.: Reliable measures for aligning Japanese-English news articles and sentences. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 72–79 (2003)
Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Proceedings of 2nd International Joint Conference on Natural Language Processing (IJCNLP), pp. 257–268 (2005)
Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news collection. In: 2002 IEEE International Conference on Data Mining, pp. 745–748 (2002)
Zhao, B., Vogel, S.: Full-text story alignment models for Chinese-English bilingual news corpora. In: Proceedings of the International Conference on Spoken Language Processing (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Munteanu, D.S., Marcu, D. (2013). Exploiting Comparable Corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-20128-8_11
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)