Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Wołk, Krzysztof; Rejmund, Emilia; Marasek, Krzysztof

doi:10.1007/978-3-319-25252-0_46

Krzysztof Wołk¹⁸,
Emilia Rejmund¹⁸ &
Krzysztof Marasek¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9384))

Included in the following conference series:

International Symposium on Methodologies for Intelligent Systems

688 Accesses
1 Citations
2 Altmetric

Abstract

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from e.g. Wikipedia dumps and Euronews web page. The improvements in machine translation are shown on Polish-English language pair for various text domains. We also tested another method of building parallel corpora based on comparable corpora data. It lets automatically broad existing corpus of sentences from subject of corpora based on analogies between them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 257–268. Springer, Heidelberg (2005)
Chapter Google Scholar
Pal, S., Pakray, P., Naskar, S.: Automatic building and using parallel resources for SMT from comparable corpora (2014)
Google Scholar
Tyer, F., Pienaar J.: Extracting bilingual words pairs from Wikipedia (2008)
Google Scholar
Clark, J., Dyer, C., Lavie, A., Smith, N.: Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In: Proceedings of the Association for Computational Lingustics, Portland, Oregon, USA (2011)
Google Scholar
Marasek, K.: TED Polish-to-English translation system for the IWSLT 2012. In: Proceedings of the 9^th International Workshop on Spoken Language Translation IWSLT 2012, pp. 126–129, Hong Kong (2012)
Google Scholar
Smith J., Quirk C., Toutanova K.: Extracting parallel sentences from comparable corpora using document level alignmen (2010)
Google Scholar
Chu, C., Nakazawa, T., Kurohashi, S.: Chinese-japanese parallel sentence extraction from quasi-comparable corpora. In: Proceedings of ACL 2013, pp 34–42 (2013)
Google Scholar
Adafree, S., de Rijke, M.: Finding similar sentences across multiple languages in wikipedia (2006)
Google Scholar
Skadiņa, I., Aker, A.: Collecting and using comparable corpora for statistical machine translation. In: Proceedings of LREC 2012, Instanbul (2012)
Google Scholar
Koehn, P., Haddow, B.: Towards effective use of training data in statistical machine translation. In: WMT 2012 Proceedings of the Seventh Workshop on Statistical Machine Translation, pp. 317–321, Stroudsburg, PA, USA (2012)
Google Scholar
Berrotarán, G., Carrascosa, R., Vine, A.: Yalign documentation. http://yalign.readthedocs.org/en/latest/
Tiedemann, J.: Parallel data, tools and interfaces in OPUS.: In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), pp. 2214–2218 (2012)
Google Scholar
Wołk, K., Marasek, K.: Real-Time statistical speech translation. In: Rocha, Á., Correia, A.M., Tan, F., Stroetmann, K. (eds.) New Perspectives in Information Systems and Technologies, Volume 1. AISC, vol. 275, pp. 107–113. Springer, Heidelberg (2014)
Chapter Google Scholar
Kilgarriff, A., Avinesh, P.V.S., Pomikalek, J.: BootCatting comparable corpora. In: Proceedings of 9th International Conference on Terminology and Artificial Intelligence, Paris, France (2011)
Google Scholar
Strotgen, J., Gertz, M.: Temporal tagging on different domains:challenges, strategies, and gold standards. In: Proceedings of LREC 2012, Instanbul (2012)
Google Scholar
Cettolo, M., Girardi, C., Federico, M.: WIT3: web inventory of transcribed and translated talks. In: Proceedings of EAMT, pp. 261–268, Trento, Italy (2012)
Google Scholar
Zeng, W., Church, R.L.: Finding shortest paths on real road networks: the case for A*. Int. J. Geogr. Inf. Sci. 23(4), 531–543 (2009)
Article Google Scholar
Wołk, K., Marasek, K.: Alignment of the polish-english parallel text for a statistical machine translation. Comput. Technol. Appl. 4, 575–583 (2013). David Publishing, ISSN:1934–7332 (Print), ISSN: 1934-7340 (Online)
Google Scholar
Yang, W., Lepage, Y.: Inflating a training corpus for SMT by using unrelated unaligned monolingual data. In: Ogrodniczuk, A., Przepiórkowski, M. (eds.) PolTAL 2014. LNCS, vol. 8686, pp. 236–248. Springer, Heidelberg (2014)
Google Scholar
Musso, G.: Sequence alignment (Needleman-Wunsch, Smith-Waterman). http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Multimedia, Polish - Japanese Academy of Information Technology, Warsaw, Poland
Krzysztof Wołk, Emilia Rejmund & Krzysztof Marasek

Authors

Krzysztof Wołk
View author publications
You can also search for this author in PubMed Google Scholar
Emilia Rejmund
View author publications
You can also search for this author in PubMed Google Scholar
Krzysztof Marasek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Krzysztof Wołk .

Editor information

Editors and Affiliations

Computer Science, University of Bari, Bari, Italy
Floriana Esposito
Enssat, Lannion, France
Olivier Pivert
LISI-UFR d'Informatique, Université Claude Bernard Lyon 1, Villeurbanne Cedex, France
Mohand-Said Hacid
University of North Carolina, CHARLOTTE, North Carolina, USA
Zbigniew W. Rás
Dipartimento di Informatica, Università degli Studi di Bari, Bari, Italy
Stefano Ferilli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wołk, K., Rejmund, E., Marasek, K. (2015). Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics. In: Esposito, F., Pivert, O., Hacid, MS., Rás, Z., Ferilli, S. (eds) Foundations of Intelligent Systems. ISMIS 2015. Lecture Notes in Computer Science(), vol 9384. Springer, Cham. https://doi.org/10.1007/978-3-319-25252-0_46

Download citation

DOI: https://doi.org/10.1007/978-3-319-25252-0_46
Published: 30 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25251-3
Online ISBN: 978-3-319-25252-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics