Abstract
Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from e.g. Wikipedia dumps and Euronews web page. The improvements in machine translation are shown on Polish-English language pair for various text domains. We also tested another method of building parallel corpora based on comparable corpora data. It lets automatically broad existing corpus of sentences from subject of corpora based on analogies between them.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 257–268. Springer, Heidelberg (2005)
Pal, S., Pakray, P., Naskar, S.: Automatic building and using parallel resources for SMT from comparable corpora (2014)
Tyer, F., Pienaar J.: Extracting bilingual words pairs from Wikipedia (2008)
Clark, J., Dyer, C., Lavie, A., Smith, N.: Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In: Proceedings of the Association for Computational Lingustics, Portland, Oregon, USA (2011)
Marasek, K.: TED Polish-to-English translation system for the IWSLT 2012. In: Proceedings of the 9th International Workshop on Spoken Language Translation IWSLT 2012, pp. 126–129, Hong Kong (2012)
Smith J., Quirk C., Toutanova K.: Extracting parallel sentences from comparable corpora using document level alignmen (2010)
Chu, C., Nakazawa, T., Kurohashi, S.: Chinese-japanese parallel sentence extraction from quasi-comparable corpora. In: Proceedings of ACL 2013, pp 34–42 (2013)
Adafree, S., de Rijke, M.: Finding similar sentences across multiple languages in wikipedia (2006)
Skadiņa, I., Aker, A.: Collecting and using comparable corpora for statistical machine translation. In: Proceedings of LREC 2012, Instanbul (2012)
Koehn, P., Haddow, B.: Towards effective use of training data in statistical machine translation. In: WMT 2012 Proceedings of the Seventh Workshop on Statistical Machine Translation, pp. 317–321, Stroudsburg, PA, USA (2012)
Berrotarán, G., Carrascosa, R., Vine, A.: Yalign documentation. http://yalign.readthedocs.org/en/latest/
Tiedemann, J.: Parallel data, tools and interfaces in OPUS.: In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), pp. 2214–2218 (2012)
Wołk, K., Marasek, K.: Real-Time statistical speech translation. In: Rocha, Á., Correia, A.M., Tan, F., Stroetmann, K. (eds.) New Perspectives in Information Systems and Technologies, Volume 1. AISC, vol. 275, pp. 107–113. Springer, Heidelberg (2014)
Kilgarriff, A., Avinesh, P.V.S., Pomikalek, J.: BootCatting comparable corpora. In: Proceedings of 9th International Conference on Terminology and Artificial Intelligence, Paris, France (2011)
Strotgen, J., Gertz, M.: Temporal tagging on different domains:challenges, strategies, and gold standards. In: Proceedings of LREC 2012, Instanbul (2012)
Cettolo, M., Girardi, C., Federico, M.: WIT3: web inventory of transcribed and translated talks. In: Proceedings of EAMT, pp. 261–268, Trento, Italy (2012)
Zeng, W., Church, R.L.: Finding shortest paths on real road networks: the case for A*. Int. J. Geogr. Inf. Sci. 23(4), 531–543 (2009)
Wołk, K., Marasek, K.: Alignment of the polish-english parallel text for a statistical machine translation. Comput. Technol. Appl. 4, 575–583 (2013). David Publishing, ISSN:1934–7332 (Print), ISSN: 1934-7340 (Online)
Yang, W., Lepage, Y.: Inflating a training corpus for SMT by using unrelated unaligned monolingual data. In: Ogrodniczuk, A., Przepiórkowski, M. (eds.) PolTAL 2014. LNCS, vol. 8686, pp. 236–248. Springer, Heidelberg (2014)
Musso, G.: Sequence alignment (Needleman-Wunsch, Smith-Waterman). http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wołk, K., Rejmund, E., Marasek, K. (2015). Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics. In: Esposito, F., Pivert, O., Hacid, MS., Rás, Z., Ferilli, S. (eds) Foundations of Intelligent Systems. ISMIS 2015. Lecture Notes in Computer Science(), vol 9384. Springer, Cham. https://doi.org/10.1007/978-3-319-25252-0_46
Download citation
DOI: https://doi.org/10.1007/978-3-319-25252-0_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25251-3
Online ISBN: 978-3-319-25252-0
eBook Packages: Computer ScienceComputer Science (R0)