Automatic Parallel Data Mining After Bilingual Document Alignment

  • Krzysztof WołkEmail author
  • Agnieszka Wołk
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 569)


It has become essential to have precise translations of texts from different parts of the world, but it is often difficult to fill the translation gaps as quickly as might be needed. Undoubtedly, there are multiple dictionaries that can help in this regard, and various online translators exist to help cross this lingual bridge in many cases, but even these resources can fall short of serving their true purpose. The translators can provide a very accurate meaning of given words in a phrase, but they often miss the true essence of the language. The research presented here describes a method that can help close this lingual gap by extending certain aspects of the alignment task for WMT16. It is possible to achieve this goal by utilizing different classifiers and algorithms and by use of advanced computation. We carried out various experiments that allowed us to extract parallel data at the sentence level. This data proved capable of improving overall machine translation quality.


SMT Quasi comparable Corpora Parallel corpora generation Comparable corpora Unsupervised corpora acquisition Data mining 



Work financed as part of the investment in the CLARIN-PL research infrastructure funded by the Polish Ministry of Science and Higher Education and was backed by the PJATK legal resources.


  1. 1.
    Wołk, K., Marasek, K.: Real-Time Statistical Speech Translation. New Perspectives in Information Systems and Technologies, vol. 1, pp. 107–113. Springer, Switzerland (2014)Google Scholar
  2. 2.
    Wołk, K., Marasek, K.: Polish–English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the 10th International Workshop on Spoken Language Translation, Heidelberg, Germany, pp. 113–119 (2013)Google Scholar
  3. 3.
    Koehn, P.: Statistical Machine Translation. Cambridge University Press, Cambridge (2009)CrossRefzbMATHGoogle Scholar
  4. 4.
    García Berrotarán, G., Carrascosa, R., Vine, A.: Yalign documentation. Accessed 01/2015
  5. 5.
    Dieny, R., Thevenon, J., Martinez-del-Rincon, J., Nebel, J.-C.: Bioinformatics inspired algorithm for stereo correspondence. In: International Conference on Computer Vision Theory and Applications, Algarve, Portugal, Vilamoura (2011)Google Scholar
  6. 6.
    Musso, G.: Sequence alignment (Needleman-Wunsch, Smith-Waterman) (2007).
  7. 7.
    Cetollo, M., Girardi, C., Federico, M.: Wit3: web inventory of transcribed and translated talks. In: Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pp. 261–268 (2012)Google Scholar
  8. 8.
    Mohammadi, M., GasemAghaee, N.: Building bilingual parallel corpora based on Wikipedia. In: 2010 Second International Conference on Computer Engineering and Applications (ICCEA), pp. 264–268. IEEE (2010)Google Scholar
  9. 9.
    Tyers, F.M., Pienaar, J.A.: Extracting bilingual word pairs from Wikipedia. In: Collaboration: Interoperability Between People in the Creation of Language Resources for Less-Resourced Languages, vol. 19, pp. 19–22 (2008)Google Scholar
  10. 10.
    Yasuda, K., Sumita, E.: Method for building sentence-aligned corpus from Wikipedia. In: AAAI Workshop on Wikipedia and Artificial Intelligence (WikiAI 2008), pp. 263–268 (2008)Google Scholar
  11. 11.
    Pal, S., Pakray, P., Naskar, S.K.: Automatic building and using parallel resources for SMT from comparable corpora. In: Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra)@ EACL, pp. 48–57 (2014)Google Scholar
  12. 12.
    Plamada, M., Volk, M.: Mining for domain-specific parallel text from wikipedia. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, ACL 2013, pp. 112–120 (2013)Google Scholar
  13. 13.
    Strötgen, J., Gertz, M., Junghans, C.: An event-centric model for multilingual document similarity. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 953–962 (2011)Google Scholar
  14. 14.
    Paramita, M.L., Guthrie, D., Kanoulas, E., Gaizauskas, R., Clough, P., Sanderson, M.: Methods for collection and evaluation of comparable documents. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds.) Building and Using Comparable Corpora, pp. 93–112. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-20128-8_5 CrossRefGoogle Scholar
  15. 15.
    Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 257–268. Springer, Heidelberg (2005). doi: 10.1007/11562214_23 CrossRefGoogle Scholar
  16. 16.
    Clark, J.H., et al.: Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers, vol. 2, pp. 176–181. Association for Computational Linguistics (2011)Google Scholar
  17. 17.
    Wołk, K., Marasek, K.: A sentence meaning based alignment method for parallel text corpora preparation. In: Rocha, Á., Correia, A.M., Tan, F.B., Stroetmann, K.A. (eds.) New Perspectives in Information Systems and Technologies, Volume 1. AISC, vol. 275, pp. 229–237. Springer, Cham (2014). doi: 10.1007/978-3-319-05951-8_22 CrossRefGoogle Scholar
  18. 18.
    Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362. Association for Computational Linguistics (2011)Google Scholar
  19. 19.
    Wołk, K., Marasek, K.: Tuned and GPU-accelerated parallel data mining from comparable corpora. In: Král, P., Matoušek, V. (eds.) TSD 2015. LNCS (LNAI), vol. 9302, pp. 32–40. Springer, Cham (2015). doi: 10.1007/978-3-319-24033-6_4 CrossRefGoogle Scholar
  20. 20.
    Khaladkar, C.S.: An Efficient Implementation of Needleman-Wunsch Algorithm on Graphical Processing Units. P.h.D. thesis, School of Computer Science and Software Engineering, The University of Western Australia (2009)Google Scholar
  21. 21.
    Junczys-Dowmunt, M., Szał, A.: SyMGiza++: symmetrized word alignment models for statistical machine translation. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) SIIS 2011. LNCS, vol. 7053, pp. 379–390. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-25261-7_30 CrossRefGoogle Scholar
  22. 22.
    Durrani, N., et al.: Integrating an unsupervised transliteration model into statistical machine translation. In: EACL, pp. 148–153 (2014)Google Scholar
  23. 23.
    Heafield, K.: KenLM: Faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Polish-Japanese Academy of Information TechnologyWarsawPoland

Personalised recommendations