An Efficient Framework to Extract Parallel Units from Comparable Data

Xiang, Lu; Zhou, Yu; Zong, Chengqing

doi:10.1007/978-3-642-41644-6_15

Lu Xiang⁴,
Yu Zhou⁴ &
Chengqing Zong⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 400))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

1827 Accesses
2 Citations

Abstract

Since the quality of statistical machine translation (SMT) is heavily dependent upon the size and quality of training data, many approaches have been proposed for automatically mining bilingual text from comparable corpora. However, the existing solutions are restricted to extract either bilingual sentences or sub-sentential fragments. Instead, we present an efficient framework to extract both sentential and sub-sentential units. At sentential level, we consider the parallel sentence identification as a classification problem and extract more representative and effective features. At sub-sentential level, we refer to the idea of phrase table’s acquisition in SMT to extract parallel fragments. A novel word alignment model is specially designed for comparable sentence pairs and parallel fragments can be extracted based on such word alignment. We integrate the two levels’ extraction task into a united framework. Experimental results on SMT show that the baseline SMT system can achieve significant improvement by adding those extra-mined knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brown Peter, F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)
Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)
Google Scholar
Fung, P., Cheung, P.: Mining very non-parallel corpora: Parallel sentence and lexicon extraction vie bootstrapping and EM. In: EMNLP 2004, pp. 57–63 (2004a)
Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase based translation. In: Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL (2003)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R.-C., Dyer, C., Bojar, O.: Moses: Open source toolkit for Statistical Machine Translation. In: Proceedings of the ACL 2007 Demo and Poster Sessions, pp. 177–180 (2007)
Google Scholar
Moore, R.C.: Improving IBM word alignment model 1. In: ACL 2004, pp. 519–526 (2004a)
Google Scholar
Moore, R.C.: On log-likelihood-ratios and the significance of rare events. In: EMNLP 2004, pp. 333–340 (2004b)
Google Scholar
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31(4), 477–504 (2005)
Article Google Scholar
Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from nonparallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 81–88 (2006)
Google Scholar
Och, F.J., Tillmann, C., Ney, H.: Improved alignment models for statistical machine translation. In: Proceedings of the Joint Conference of Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 20–28 (1999)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of ACL, Philadelpha, Pennsylvania, USA, pp. 311–318 (2002)
Google Scholar
Quirk, C., Udupa, R.U., Menezes, A.: Generative models of noisy translations with applications to parallel fragment extraction. In: Proceedings of the Machine Translation Summit XI, Copenhagen, Denmark, pp. 377–384 (2007)
Google Scholar
Riesa, J., Marcu, D.: Automatic parallel fragment extraction from noisy data. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 538–542. Association for Computational Linguistics (2012)
Google Scholar
Stolcke, A.: SRILM - An Extensible Language Modeling Toolkit. In: Proceedings of ICSLP, vol. 2, pp. 901–904 (2002)
Google Scholar
Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Proceedings of the Human Language Technologies/North American Association for Computational Linguistics, pp. 403–411 (2010)
Google Scholar
Tufiş, D., Ion, R., Ceauşu, A., Ştefănescu, D.: Improved Lexical Alignment by Combining Multiple Reified Alignments. In: Proceedings of EACL 2006, Trento, Italy, pp. 153–160 (2006)
Google Scholar
Tillmann, C.: A Beam-Search extraction algorithm for comparable data. In: Proceedings of ACL, pp. 225–228 (2009)
Google Scholar
Ture, F., Lin, J.: Why not grab a free lunch? Mining large corpora for parallel sentences to improve translation modeling. In: HLT-NAACL, pp. 626–630 (2012)
Google Scholar
Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news collection. In: IEEE International Conference on Data Mining, Maebashi City, Japan, pp. 745–748 (2002)
Google Scholar
Ştefănescu, D., Ion, R., Hunsicker, S.: Hybrid parallel sentence mining from comparable corpora. In: Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT 2012), Trento, Italy (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

NLPR, Institute of Automation Chinese Academy of Sciences, Beijing, China
Lu Xiang, Yu Zhou & Chengqing Zong

Authors

Lu Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Chengqing Zong
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Soochow University, 1 Shizi Street, 215006, Suzhou, China
Guodong Zhou
Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Juanzi Li
Institute of Computer Science & Technology, Peking University, 100871, Beijing, China
Dongyan Zhao & Yansong Feng &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiang, L., Zhou, Y., Zong, C. (2013). An Efficient Framework to Extract Parallel Units from Comparable Data. In: Zhou, G., Li, J., Zhao, D., Feng, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2013. Communications in Computer and Information Science, vol 400. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41644-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-41644-6_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41643-9
Online ISBN: 978-3-642-41644-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics