3-Step Parallel Corpus Cleaning Using Monolingual Crowd Workers

  • Toshiaki NakazawaEmail author
  • Sadao Kurohashi
  • Hayato Kobayashi
  • Hiroki Ishikawa
  • Manabu Sassano
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 593)


A high-quality parallel corpus needs to be manually created to achieve good machine translation for the domains which do not have enough existing resources. Although the quality of the corpus to some extent can be improved by asking the professional translators to translate, it is impossible to completely avoid making any mistakes. In this paper, we propose a framework for cleaning the existing professionally-translated parallel corpus in a quick and cheap way. The proposed method uses a 3-step crowdsourcing procedure to efficiently detect and edit the translation flaws, and also guarantees the reliability of the edits. The experiments using the fashion-domain e-commerce-site (EC-site) parallel corpus show the effectiveness of the proposed method for the parallel corpus cleaning.


Parallel corpus cleaning Crowdsourcing Machine translation 



This work is supported by the Yahoo Japan Corporation. We want to thank the anonymous reviewers for many very useful comments.


  1. 1.
    Ambati, V., Vogel, S.: Can crowds build parallel corpora for machine translation systems? In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 62–65 (2010)Google Scholar
  2. 2.
    Ambati, V., Vogel, S., Carbonell, J.: Active learning and crowd-sourcing for machine translation. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010) (2010)Google Scholar
  3. 3.
    Aranberri, N., Labaka, G., de Ilarraza, A.D., Sarasola, K.: Comparison of post-editing productivity between professional translators and lay users. In: Proceedings of the Third Workshop on Post-Editing Technology and Practice, pp. 20–33 (2014)Google Scholar
  4. 4.
    Cao, D., Nakano, H., Xu, Y., Kumai, H.: Development of “Chinese-Japanese bilingual corpus” and its remaining tasks. IPSJ SIG Notes 99(95), 1–8 (1999)Google Scholar
  5. 5.
    Chu, C., Nakazawa, T., Kurohashi, S.: Accurate parallel fragment extraction from quasi-comparable corpora using alignment model and translation lexicon. In: Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP 2013), pp. 1144–1150 (2013)Google Scholar
  6. 6.
    Chu, C., Nakazawa, T., Kurohashi, S.: Chinese-Japanese parallel sentence extraction from quasi-comparable corpora. In: Proceedings of the 6th Workshop on Building and Using Comparable Corpora (BUCC 2013), pp. 34–42 (2013)Google Scholar
  7. 7.
    Koehn, P.: Statistical significance tests for machine translation evaluation. In: Lin, D., Wu, D. (eds.) Proceedings of EMNLP 2004, pp. 388–395. Association for Computational Linguistics, Barcelona, July 2004Google Scholar
  8. 8.
    Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the Tenth Machine Translation Summit (MT Summit X), pp. 79–86 (2005)Google Scholar
  9. 9.
    Nakazawa, T., Kurohashi, S.: Alignment by bilingual generation and monolingual derivation. In: Proceedings of COLING 2012, pp. 1963–1978. The COLING 2012 Organizing Committee, Mumbai, December 2012.
  10. 10.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)Google Scholar
  11. 11.
    Richardson, J., Cromières, F., Nakazawa, T., Kurohashi, S.: KyotoEBMT: an example-based dependency-to-dependency translation framework. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 79–84 (2014)Google Scholar
  12. 12.
    Schwartz, L.: Monolingual post-editing by a domain expert is highly effective for translation triage. In: Proceedings of the Third Workshop on Post-editing Technology and Practice, pp. 34–44 (2014)Google Scholar
  13. 13.
    Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 403–411 (2010)Google Scholar
  14. 14.
    Uszkoreit, J., Ponte, J., Popat, A., Dubiner, M.: Large scale parallel document mining for machine translation. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 1101–1109 (2010)Google Scholar
  15. 15.
    Utiyama, M., Isahara, H.: A Japanese-English patent parallel corpus. In: MT summit XI, pp. 475–482 (2007)Google Scholar
  16. 16.
    Zaidan, O.F., Callison-Burch, C.: Crowdsourcing translation: professional quality from non-professionals. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1220–1229 (2011)Google Scholar
  17. 17.
    Zhang, Y., Uchimoto, K., Ma, Q., Isahara, H.: Building an annotated Japanese-Chinese parallel corpus - a part of NICT multilingual corpora. In: Proceedings of 2nd International Joint Conference on Natural Language Processing, pp. 85–90 (2005)Google Scholar

Copyright information

© Springer Science+Business Media Singapore 2016

Authors and Affiliations

  • Toshiaki Nakazawa
    • 1
    Email author
  • Sadao Kurohashi
    • 1
  • Hayato Kobayashi
    • 2
  • Hiroki Ishikawa
    • 2
  • Manabu Sassano
    • 2
  1. 1.Graduate School of InformaticsKyoto UniversityKyotoJapan
  2. 2.Yahoo Japan CorporationTokyoJapan

Personalised recommendations