Advertisement

Language Resources and Evaluation

, Volume 51, Issue 4, pp 1019–1051 | Cite as

Crawl and crowd to bring machine translation to under-resourced languages

  • Antonio Toral
  • Miquel Esplá-Gomis
  • Filip Klubička
  • Nikola Ljubešić
  • Vassilis Papavassiliou
  • Prokopis Prokopidis
  • Raphael Rubino
  • Andy Way
Original Paper

Abstract

We present a widely applicable methodology to bring machine translation (MT) to under-resourced languages in a cost-effective and rapid manner. Our proposal relies on web crawling to automatically acquire parallel data to train statistical MT systems if any such data can be found for the language pair and domain of interest. If that is not the case, we resort to (1) crowdsourcing to translate small amounts of text (hundreds of sentences), which are then used to tune statistical MT models, and (2) web crawling of vast amounts of monolingual data (millions of sentences), which are then used to build language models for MT. We apply these to two respective use-cases for Croatian, an under-resourced language that has gained relevance since it recently attained official status in the European Union. The first use-case regards tourism, given the importance of this sector to Croatia’s economy, while the second has to do with tweets, due to the growing importance of social media. For tourism, we crawl parallel data from 20 web domains using two state-of-the-art crawlers and explore how to combine the crawled data with bigger amounts of general-domain data. Our domain-adapted system is evaluated on a set of three additional tourism web domains and it outperforms the baseline in terms of automatic metrics and/or vocabulary coverage. In the social media use-case, we deal with tweets from the 2014 edition of the soccer World Cup. We build domain-adapted systems by (1) translating small amounts of tweets to be used for tuning by means of crowdsourcing and (2) crawling vast amounts of monolingual tweets. These systems outperform the baseline (Microsoft Bing) by 7.94 BLEU points (5.11 TER) for Croatian-to-English and by 2.17 points (1.94 TER) for English-to-Croatian on a test set translated by means of crowdsourcing. A complementary manual analysis sheds further light on these results.

Keywords

Statistical machine translation Web crawling Crowdsourcing 

Notes

Acknowledgments

This research is supported by the European Union Seventh Framework Programme FP7/2007–2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran) and by the ADAPT Centre for Digital Content Technology, funded under the SFI Research Centres Programme (Grant 13/RC/2106) and co-funded under the European Regional Development Fund.

References

  1. Achananuparp, P., Hu, X., & Shen, X. (2008). The evaluation of sentence similarity measures. In I. Y. Song, J. Eder & T. Nguyen (Eds.), Data warehousing and knowledge discovery (Vol. 5182, pp. 305–316). Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. doi: 10.1007/978-3-540-85836-2_29.
  2. Ambati, V., & Vogel, S. (2010). Can crowds build parallel corpora for machine translation systems? In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 62–65). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0710.
  3. Axelrod, A., He, X., & Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 355–362). Edinburgh, Scotland, UK: Association for Computational Linguistics. http://www.aclweb.org/anthology/D11-1033.
  4. Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The wacky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226. doi: 10.1007/s10579-009-9081-4.CrossRefGoogle Scholar
  5. Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., et al. (2014). Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation (pp. 12–58). Association for Computational Linguistics, Baltimore, Maryland, USA. http://www.aclweb.org/anthology/W/W14/W14-3302.
  6. Boleda, G., Bott, S., Meza, R., Castillo, C., Badia, T., & López, V. (2006). In Proceedings of the 2nd international workshop on Web as Corpus, chap CUCWeb: A Catalan corpus built from the Web. http://aclweb.org/anthology/W06-1704.
  7. Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 286–295). EMNLP 2009, 6–7 August 2009, Singapore, A meeting of SIGDAT, a Special Interest Group of the ACL, ACL. http://www.aclweb.org/anthology/D09-1030.
  8. Chakrabarti, S. (2003). Mining the Web: Discovering knowledge from hypertext data. Massachusetts:Morgan Kaufmann.Google Scholar
  9. Esplà-Gomis, M., Klubička, F., Ljubešić, N., Ortiz-Rojas, S., Papavassiliou, V., & Prokopidis, P. (2014). Comparing two acquisition systems for automatically building an english-croatian parallel corpus from multilingual websites. In N. C. C. Chair, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk & S. Piperidis (Eds.), Proceedings of the ninth international conference on language resources and evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland.Google Scholar
  10. Esplà-Gomis, M., & Forcada, M. L. (2010). Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with bitextor. The Prague Bulletin of Mathematical Linguistics, 93, 77–86.CrossRefGoogle Scholar
  11. Fišer, D., Tavčar, A., & Erjavec, T. (2014). slowcrowd: A crowdsourcing tool for lexicographic tasks. In: N. C. C. Chair, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk & S. Piperidis (Eds.),Proceedings of the ninth international conference on language resources and evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland.Google Scholar
  12. Gao, Q., & Vogel, S. (2008). Parallel implementations of word alignment tool. In Software engineering, testing, and quality assurance for natural language processing, association for computational linguistics (pp. 49–57).Google Scholar
  13. Graham, Y., Baldwin, T., Moffat, A., & Zobel, J. (2014). Is machine translation getting better over time? In Proceedings of the 14th conference of the European Chapter of the Association for Computational Linguistics (pp. 443–451), Gothenburg, Sweden. http://www.aclweb.org/anthology/E14-1047.
  14. Hasler, E., Haddow, B., & Koehn, P. (2011). Margin infused relaxed algorithm for moses. The Prague Bulletin of Mathematical Linguistics, 96, 69–78.CrossRefGoogle Scholar
  15. Hassan, H., & Menezes, A. (2013). Social text normalization using contextual graph random walks. In Proceedings of the 51st annual meeting of the association for computational linguistics (Vol.1: Long Papers, pp. 1577–1586), Association for Computational Linguistics, Sofia, Bulgaria. http://www.aclweb.org/anthology/P13-1155.
  16. Heafield, K. (2011). Kenlm: Faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation (pp. 187–197). Association for Computational Linguistics.Google Scholar
  17. Irvine, A., & Klementiev, A. (2010). Using mechanical turk to annotate lexicons for less commonly used languages. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 108–113). Association for Computational Linguistics, Stroudsburg, PA, USA. http://www.aclweb.org/anthology/W10-0717.pdf.
  18. Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–347.CrossRefGoogle Scholar
  19. Klubička, F., & Ljubešić, N. (2014). Using crowdsourcing in building a morphosyntactically annotated and lemmatized silver standard corpus of croatian. In T. Erjavec & J. Ž. Gros (Eds.), Language technologies: Proceedings of the 17th International Multiconference Information Society IS2014. Slovenia: Ljubljana.Google Scholar
  20. Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing (pp. 388–395). EMNLP 2004, A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004, 25–26 July 2004, Barcelona, Spain, ACL. http://www.aclweb.org/anthology/W04-3250.
  21. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions (pp. 177–180). http://dl.acm.org/citation.cfm?id=1557769.1557821.
  22. Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate detection using shallow text features. Proceedings of the third ACM international conference on Web search and data mining (pp. 441–450). New York, NY, USA.Google Scholar
  23. Laranjeira, B., Moreira, V., Villavicencio, A., Ramisch, C., & José Finatto, M. (2014). Comparing the quality of focused crawlers and of the translation resources obtained from them. In Proceedings of the ninth international conference on language resources and evaluation (LREC-2014). European Language Resources Association (ELRA).Google Scholar
  24. Ljubešić, N., & Erjavec, T. (2011). hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene. Text, speech and dialogue—14th international conference, TSD 2011 (pp. 395–402). Pilsen: Czech Republic, Springer, Lecture Notes in Computer Science.Google Scholar
  25. Ljubešić, N., & Klubička, F. (2014). bs, hr, srWaC—Web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9) (pp. 29–35). Gothenburg, Sweden: Association for Computational Linguistics.Google Scholar
  26. Ljubešić, N., & Kranjčić, D. (2015). Discriminating between closely related languages on twitter. Informatica, 39(1), 1–8.Google Scholar
  27. Ljubešić, N., Fišer, D., & Erjavec, T. (2014). TweetCaT: A tool for building twitter corpora of smaller languages. In: N. C. C. Chair, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk & S. Piperidis (Eds.), Proceedings of the ninth international conference on language resources and evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland.Google Scholar
  28. Ma, X., & Liberman, M. (1999). Bits: A method for bilingual text search over the web. Machine Translation Summit VII (pp. 538–542), Singapore.Google Scholar
  29. Munro, R. (2010). Crowdsourced translation for emergency response in haiti: the global collaboration of local knowledge. In AMTA workshop on collaborative crowdsourcing for translation, Denver, Colorado.Google Scholar
  30. Munteanu, S. D., & Marcu, D. (2006). Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (pp. 81–88). Association for Computational Linguistics.Google Scholar
  31. Nie, J. Y., Simard, M., Isabelle, P., & Durand, R. (1999). Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 74–81), ACM, Berkeley, California, USA, SIGIR’99.Google Scholar
  32. Papavassiliou, V., Prokopidis, P., & Thurmair, G. (2013). A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. In Proceedings of the sixth workshop on building and using comparable corpora (pp. 43–51), Association for Computational Linguistics, Sofia, Bulgaria. http://www.aclweb.org/anthology/W13-2506.
  33. Papavassiliou, V., Prokopidis, P., Esplà-Gomis, M., & Ortiz-Rojas, S. (2014). D3.2. corpora acquisition software. Public deliverable, The Abu-MaTran Project (PIAP- GA-2012-324414).Google Scholar
  34. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311–318). doi: 10.3115/1073083.1073135.
  35. Pecina, P., Toral, A., & van Genabith, J. (2012). Simple and effective parameter tuning for domain adaptation of statistical machine translation. Proceedings of the 24th international conference on computational linguistics (Coling 2012), Coling 2012 Organizing Committee (pp. 2209–2224). India: Mumbai.Google Scholar
  36. Rarrick, S., Quirk, C., & Lewis, W. (2011). Mt detection in web-scraped parallel corpora. In Proceedings of MT Summit XIII, Asia-Pacific Association for Machine Translation. http://research.microsoft.com/pubs/153367/MT-Summit-Detection_Lewis_0819.pdf.
  37. Rehm, G., & Uszkoreit, H. (2013). META-NET Strategic Research Agenda for Multilingual Europe 2020 Incorporated. Springer.Google Scholar
  38. Resnik, P., & Smith, N. A. (2003). The Web as a parallel corpus. Computational Linguistics, 29(3), 349–380.CrossRefGoogle Scholar
  39. Resnik, P., Buzek, O., Kronrod, Y., Hu, C., Quinn, A. J., & Bederson, B. B. (2013). Using targeted paraphrasing and monolingual crowdsourcing to improve translation. ACM Trans Intell Syst Technol, 4(3), 38:1–38:21. doi: 10.1145/2483669.2483671.CrossRefGoogle Scholar
  40. Rubino, R., Toral, A., Sánchez-Cartagena, V. M., Ferrández-Tordera, J., Ortiz Rojas, S., Ramírez-Sánchez, G., et al. (2014). Abu-matran at wmt 2014 translation task: Two-step data selection and rbmt-style synthetic rules. In Proceedings of the ninth workshop on statistical machine translation (pp. 171–177).Google Scholar
  41. Rubino, R., Pirinen, T., Esplà-Gomis, M., Ljubešić, N., Ortiz Rojas, S., Papavassiliou, V., et al. (2015). Abu-matran at wmt 2015 translation task: Morphological segmentation and web crawling. In Proceedings of the tenth workshop on statistical machine translation, Association for Computational Linguistics, Lisbon, Portugal (pp. 184–191) http://aclweb.org/anthology/W15-3022.
  42. Sennrich, R. (2012) Perplexity minimization for translation model domain adaptation in statistical machine translation. In Proceedings of the 13th conference of the European chapter of the association for computational linguistics (pp. 539–549). http://dl.acm.org/citation.cfm?id=2380816.2380881.
  43. Sikes, R. (2007). Fuzzy matching in theory and practice. MultiLingual, 18(6), 39–43.Google Scholar
  44. Skadina, I., Vasiljevs, A., Skadins, R., Gaizauskas, R., & Tufis, D. (2010). Analysis and evaluation of comparable corpora for under resourced areas of machine translation. In Proceedings of the 3rd workshop on building and using comparable corpora. Applications of parallel and comparable corpora in natural language engineering and the humanities (pp. 6–14).Google Scholar
  45. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Weischedel, R. (2006). A study of translation error rate with targeted human annotation. In Proceedings of the association for machine translation in the Americas.Google Scholar
  46. Snow, R., O’Connor, B., Jurafsky, D., & Ng, A.Y. (2008). Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In 2008 conference on empirical methods in natural language processing, EMNLP 2008, Proceedings of the conference, 25–27 October 2008, Honolulu, Hawaii, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, ACL (pp. 254–263). http://www.aclweb.org/anthology/D08-1027.
  47. Stolcke, A., Zheng, J., Wang, W., & Abrash, V. (2011). Srilm at sixteen: Update and outlook. In Proceedings of IEEE automatic speech recognition and understanding workshop (p. 5).Google Scholar
  48. Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In S. S. Adam Kilgarriff (Ed.), Proceedings of the seventh Web as Corpus Workshop (WAC7), Lyon (pp. 39–43).Google Scholar
  49. Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., & Laurikkala, J. (2008). Focused web crawling in the acquisition of comparable corpora. Inf Retr, 11(5), 427–445. doi: 10.1007/s10791-008-9058-8.CrossRefGoogle Scholar
  50. Tiedemann, J. (2009). News from opus-a collection of multilingual parallel corpora with tools and interfaces. Recent Advances in Natural Language Processing, 5, 237–248.CrossRefGoogle Scholar
  51. Toral, A., Rubino, R., Esplà-Gomis, M., Pirinen, T., Way, A., & Ramirez-Sanchez, G. (2014). Extrinsic evaluation of web-crawlers in machine translation: A case study on Croatian–English for the tourism domain. In Proceedings of the 17th Conference of the European Association for Machine Translation (EAMT) (pp. 221–224).Google Scholar
  52. Toral, A., Wu, X., Pirinen, T., Qiu, Z., Bicici, E., & Du, J. (2015). Dublin city university at the tweetmt 2015 shared task. TweetMT@ SEPLN. In Proceedings of the La Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN).Google Scholar
  53. Tyers, F. M., & Alperen, M. S. (2010). South-east european times: A parallel corpus of balkan languages. In Proceedings of the LREC workshop on exploitation of multilingual resources and tools for Central and (South-) Eastern European Languages (pp. 49–53).Google Scholar
  54. Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón, V., & Nagy, V. (2005). Parallel corpora for medium density languages. Recent advances in natural language processing (pp. 590–596). Bulgaria: Borovets.Google Scholar
  55. Wasala, A., Schäler, R., Buckley, J., Weerasinghe, R., & Exton, C. (2013). Building multilingual language resources in web localisation: A crowdsourcing approach. In I. Gurevych & J. Kim (Eds.), The people’s Web meets NLP, theory and applications of natural language processing (pp. 69–99). Berlin, Heidelberg: Springer. doi: 10.1007/978-3-642-35085-6_3.
  56. Zaidan, O. F., & Callison-Burch, C. (2011). Crowdsourcing translation: Professional quality from non-professionals. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (Vol. 1, pp. 1220–1229). Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11. http://dl.acm.org/citation.cfm?id=2002472.2002626.
  57. Zbib, R., Markiewicz, G., Matsoukas, S., Schwartz, R. M., & Makhoul, J. (2013). Systematic comparison of professional and crowdsourced reference translations for machine translation. In HLT-NAACL (pp. 612–616).Google Scholar
  58. Zhechev, V. (2012). Machine translation infrastructure and post-editing performance at autodesk. AMTA 2012 workshop on post-editing technology and practice (WPTP 2012) (pp. 87–96), San Diego, USA.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  • Antonio Toral
    • 1
  • Miquel Esplá-Gomis
    • 2
  • Filip Klubička
    • 3
  • Nikola Ljubešić
    • 3
  • Vassilis Papavassiliou
    • 4
  • Prokopis Prokopidis
    • 4
  • Raphael Rubino
    • 5
  • Andy Way
    • 1
  1. 1.ADAPT Centre, School of ComputingDublin City UniversityDublinIreland
  2. 2.Dep. Llenguatges i Sistemes InformàticsUniversitat d’AlacantAlacantSpain
  3. 3.Faculty of Humanities and Social SciencesUniversity of ZagrebZagrebCroatia
  4. 4.Institute for Language and Speech ProcessingAthensGreece
  5. 5.Prompsit Language Engineering, S.L.ElxSpain

Personalised recommendations