Parallel Texts Extraction from Multimodal Comparable Corpora

  • Haithem Afli
  • Loïc Barrault
  • Holger Schwenk
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7614)


Statistical machine translation (SMT) systems depend on the availability of domain-specific bilingual parallel text. However parallel corpora are a limited resource and they are often not available for some domains or language pairs. We analyze the feasibility of extracting parallel sentences from multimodal comparable corpora. This work extends the use of comparable corpora by using audio sources instead of texts on the source side. The audio is transcribed by an automatic speech recognition system and translated with a baseline SMT system. We then use information retrieval in a large text corpus in the target language to extract parallel sentences. We have performed a series of experiments on data of the IWSLT’11 speech translation task that shows the feasibility of our approach.


statistical machine translation automatic speech recognition multimodal comparable corpora extraction of parallel sentences 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abdul-Rauf, S., Schwenk, H.: Parallel sentence generation from comparable corpora for improved SMT. Machine Translation (2011)Google Scholar
  2. 2.
    Deléglise, P., Estève, Y., Meignier, S., Merlin, T.: Improvements to the LIUM french ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate? In: Interspeech 2009, September 6-10 (2009)Google Scholar
  3. 3.
    Fung, P., Cheung, P.: Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004 (2004)Google Scholar
  4. 4.
    Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, SETQA-NLP 2008, pp. 49–57 (2008)Google Scholar
  5. 5.
    Grézl, F., Fousek, P.: Optimizing bottle-neck features for LVCSR. In: 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 4729–4732. IEEE Signal Processing Society (2008)Google Scholar
  6. 6.
    Hewavitharana, S., Vogel, S.: Extracting parallel phrases from comparable data. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, BUCC 2011, pp. 61–68 (2011)Google Scholar
  7. 7.
    Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180 (2007)Google Scholar
  8. 8.
    Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 48–54 (2003)Google Scholar
  9. 9.
    Munteanu, D.S., Marcu, D.: Improving Machine Translation Performance by Exploiting Non-Parallel Corpora. Computational Linguistics 31(4), 477–504 (2005)CrossRefGoogle Scholar
  10. 10.
    Ogilvie, P., Callan, J.: Experiments using the lemur toolkit. In: Procedding of the Trenth Text Retrieval Conference, TREC-10 (2001)Google Scholar
  11. 11.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318 (2002)Google Scholar
  12. 12.
    Paulik, M., Waibel, A.: Automatic translation from parallel speech: Simultaneous interpretation as MT training data. In: ASRU, Merano, Italy (December 2009)Google Scholar
  13. 13.
    Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29, 349–380 (2003)CrossRefGoogle Scholar
  14. 14.
    Rousseau, A., Bougares, F., Deléglise, P., Schwenk, H., Estève, Y.: LIUM’s systems for the IWSLT 2011 speech translation tasks. In: International Workshop on Spoken Language Translation 2011 (2011)Google Scholar
  15. 15.
    Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of Association for Machine Translation in the Americas, pp. 223–231 (2006)Google Scholar
  16. 16.
    Stolcke, A.: SRILM - an extensible language modeling toolkit. In: International Conference on Spoken Language Processing, pp. 257–286 (November 2002)Google Scholar
  17. 17.
    Utiyama, M., Isahara, H.: Reliable measures for aligning japanese-english news articles and sentences. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 72–79 (2003)Google Scholar
  18. 18.
    Yang, C.C., Li, K.W.: Automatic construction of english/chinese parallel corpora. J. Am. Soc. Inf. Sci. Technol. 54, 730–742 (2003)CrossRefGoogle Scholar
  19. 19.
    Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news collection. In: Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM 2002, 745 pages. IEEE Computer Society, Washington, DC (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Haithem Afli
    • 1
  • Loïc Barrault
    • 1
  • Holger Schwenk
    • 1
  1. 1.Universit du MaineLe MansFrance

Personalised recommendations