Abstract
Since the quality of statistical machine translation (SMT) is heavily dependent upon the size and quality of training data, many approaches have been proposed for automatically mining bilingual text from comparable corpora. However, the existing solutions are restricted to extract either bilingual sentences or sub-sentential fragments. Instead, we present an efficient framework to extract both sentential and sub-sentential units. At sentential level, we consider the parallel sentence identification as a classification problem and extract more representative and effective features. At sub-sentential level, we refer to the idea of phrase table’s acquisition in SMT to extract parallel fragments. A novel word alignment model is specially designed for comparable sentence pairs and parallel fragments can be extracted based on such word alignment. We integrate the two levels’ extraction task into a united framework. Experimental results on SMT show that the baseline SMT system can achieve significant improvement by adding those extra-mined knowledge.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Brown Peter, F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)
Fung, P., Cheung, P.: Mining very non-parallel corpora: Parallel sentence and lexicon extraction vie bootstrapping and EM. In: EMNLP 2004, pp. 57–63 (2004a)
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase based translation. In: Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL (2003)
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R.-C., Dyer, C., Bojar, O.: Moses: Open source toolkit for Statistical Machine Translation. In: Proceedings of the ACL 2007 Demo and Poster Sessions, pp. 177–180 (2007)
Moore, R.C.: Improving IBM word alignment model 1. In: ACL 2004, pp. 519–526 (2004a)
Moore, R.C.: On log-likelihood-ratios and the significance of rare events. In: EMNLP 2004, pp. 333–340 (2004b)
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31(4), 477–504 (2005)
Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from nonparallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 81–88 (2006)
Och, F.J., Tillmann, C., Ney, H.: Improved alignment models for statistical machine translation. In: Proceedings of the Joint Conference of Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 20–28 (1999)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of ACL, Philadelpha, Pennsylvania, USA, pp. 311–318 (2002)
Quirk, C., Udupa, R.U., Menezes, A.: Generative models of noisy translations with applications to parallel fragment extraction. In: Proceedings of the Machine Translation Summit XI, Copenhagen, Denmark, pp. 377–384 (2007)
Riesa, J., Marcu, D.: Automatic parallel fragment extraction from noisy data. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 538–542. Association for Computational Linguistics (2012)
Stolcke, A.: SRILM - An Extensible Language Modeling Toolkit. In: Proceedings of ICSLP, vol. 2, pp. 901–904 (2002)
Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Proceedings of the Human Language Technologies/North American Association for Computational Linguistics, pp. 403–411 (2010)
Tufiş, D., Ion, R., Ceauşu, A., Ştefănescu, D.: Improved Lexical Alignment by Combining Multiple Reified Alignments. In: Proceedings of EACL 2006, Trento, Italy, pp. 153–160 (2006)
Tillmann, C.: A Beam-Search extraction algorithm for comparable data. In: Proceedings of ACL, pp. 225–228 (2009)
Ture, F., Lin, J.: Why not grab a free lunch? Mining large corpora for parallel sentences to improve translation modeling. In: HLT-NAACL, pp. 626–630 (2012)
Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news collection. In: IEEE International Conference on Data Mining, Maebashi City, Japan, pp. 745–748 (2002)
Ştefănescu, D., Ion, R., Hunsicker, S.: Hybrid parallel sentence mining from comparable corpora. In: Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT 2012), Trento, Italy (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xiang, L., Zhou, Y., Zong, C. (2013). An Efficient Framework to Extract Parallel Units from Comparable Data. In: Zhou, G., Li, J., Zhao, D., Feng, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2013. Communications in Computer and Information Science, vol 400. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41644-6_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-41644-6_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41643-9
Online ISBN: 978-3-642-41644-6
eBook Packages: Computer ScienceComputer Science (R0)