Skip to main content

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

  • Conference paper
  • First Online:
Foundations of Intelligent Systems (ISMIS 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9384))

Included in the following conference series:

Abstract

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from e.g. Wikipedia dumps and Euronews web page. The improvements in machine translation are shown on Polish-English language pair for various text domains. We also tested another method of building parallel corpora based on comparable corpora data. It lets automatically broad existing corpus of sentences from subject of corpora based on analogies between them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.ted.com/talks.

  2. 2.

    https://github.com/machinalis/yalign.

  3. 3.

    https://github.com/jhclark/multeval.

References

  1. Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 257–268. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  2. Pal, S., Pakray, P., Naskar, S.: Automatic building and using parallel resources for SMT from comparable corpora (2014)

    Google Scholar 

  3. Tyer, F., Pienaar J.: Extracting bilingual words pairs from Wikipedia (2008)

    Google Scholar 

  4. Clark, J., Dyer, C., Lavie, A., Smith, N.: Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In: Proceedings of the Association for Computational Lingustics, Portland, Oregon, USA (2011)

    Google Scholar 

  5. Marasek, K.: TED Polish-to-English translation system for the IWSLT 2012. In: Proceedings of the 9th International Workshop on Spoken Language Translation IWSLT 2012, pp. 126–129, Hong Kong (2012)

    Google Scholar 

  6. Smith J., Quirk C., Toutanova K.: Extracting parallel sentences from comparable corpora using document level alignmen (2010)

    Google Scholar 

  7. Chu, C., Nakazawa, T., Kurohashi, S.: Chinese-japanese parallel sentence extraction from quasi-comparable corpora. In: Proceedings of ACL 2013, pp 34–42 (2013)

    Google Scholar 

  8. Adafree, S., de Rijke, M.: Finding similar sentences across multiple languages in wikipedia (2006)

    Google Scholar 

  9. Skadiņa, I., Aker, A.: Collecting and using comparable corpora for statistical machine translation. In: Proceedings of LREC 2012, Instanbul (2012)

    Google Scholar 

  10. Koehn, P., Haddow, B.: Towards effective use of training data in statistical machine translation. In: WMT 2012 Proceedings of the Seventh Workshop on Statistical Machine Translation, pp. 317–321, Stroudsburg, PA, USA (2012)

    Google Scholar 

  11. Berrotarán, G., Carrascosa, R., Vine, A.: Yalign documentation. http://yalign.readthedocs.org/en/latest/

  12. Tiedemann, J.: Parallel data, tools and interfaces in OPUS.: In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), pp. 2214–2218 (2012)

    Google Scholar 

  13. Wołk, K., Marasek, K.: Real-Time statistical speech translation. In: Rocha, Á., Correia, A.M., Tan, F., Stroetmann, K. (eds.) New Perspectives in Information Systems and Technologies, Volume 1. AISC, vol. 275, pp. 107–113. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  14. Kilgarriff, A., Avinesh, P.V.S., Pomikalek, J.: BootCatting comparable corpora. In: Proceedings of 9th International Conference on Terminology and Artificial Intelligence, Paris, France (2011)

    Google Scholar 

  15. Strotgen, J., Gertz, M.: Temporal tagging on different domains:challenges, strategies, and gold standards. In: Proceedings of LREC 2012, Instanbul (2012)

    Google Scholar 

  16. Cettolo, M., Girardi, C., Federico, M.: WIT3: web inventory of transcribed and translated talks. In: Proceedings of EAMT, pp. 261–268, Trento, Italy (2012)

    Google Scholar 

  17. Zeng, W., Church, R.L.: Finding shortest paths on real road networks: the case for A*. Int. J. Geogr. Inf. Sci. 23(4), 531–543 (2009)

    Article  Google Scholar 

  18. Wołk, K., Marasek, K.: Alignment of the polish-english parallel text for a statistical machine translation. Comput. Technol. Appl. 4, 575–583 (2013). David Publishing, ISSN:1934–7332 (Print), ISSN: 1934-7340 (Online)

    Google Scholar 

  19. Yang, W., Lepage, Y.: Inflating a training corpus for SMT by using unrelated unaligned monolingual data. In: Ogrodniczuk, A., Przepiórkowski, M. (eds.) PolTAL 2014. LNCS, vol. 8686, pp. 236–248. Springer, Heidelberg (2014)

    Google Scholar 

  20. Musso, G.: Sequence alignment (Needleman-Wunsch, Smith-Waterman). http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf

  21. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Krzysztof Wołk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wołk, K., Rejmund, E., Marasek, K. (2015). Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics. In: Esposito, F., Pivert, O., Hacid, MS., Rás, Z., Ferilli, S. (eds) Foundations of Intelligent Systems. ISMIS 2015. Lecture Notes in Computer Science(), vol 9384. Springer, Cham. https://doi.org/10.1007/978-3-319-25252-0_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25252-0_46

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25251-3

  • Online ISBN: 978-3-319-25252-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics