Multimedia Tools and Applications

, Volume 75, Issue 18, pp 10823–10853 | Cite as

Automating live and batch subtitling of multimedia contents for several European languages

  • Aitor Álvarez
  • Carlos Mendes
  • Matteo Raffaelli
  • Tiago Luís
  • Sérgio Paulo
  • Nicola Piccinini
  • Haritz Arzelus
  • João Neto
  • Carlo Aliprandi
  • Arantza del Pozo


The subtitling demand of multimedia content has grown quickly over the last years, especially after the adoption of the new European audiovisual legislation, which forces to make multimedia content accessible to all. As a result, TV channels have been moved to produce subtitles for a high percentage of their broadcast content. Consequently, the market has been seeking subtitling alternatives more productive than the traditional manual process. The large effort dedicated by the research community to the development of Large Vocabulary Continuous Speech Recognition (LVCSR) over the last decade has resulted in significant improvements on multimedia transcription, becoming the most powerful technology for automatic intralingual subtitling. This article contains a detailed description of the live and batch automatic subtitling applications developed by the SAVAS consortium for several European languages based on proprietary LVCSR technology specifically tailored to the subtitling needs, together with results of their quality evaluation.


Multimedia communication Multimedia systems Automatic speech recognition Automatic subtitling Subtitling quality Access services 


  1. 1.
    Abad A (2007) The L2F language recognition system for NIST LRE 2011. In: The 2011 NIST language recognition evaluation (LRE11) workshopGoogle Scholar
  2. 2.
    AENOR (2003) Spanish Technical Standards. Standard UNE 153010:2003: Subtitled Through Teletext.
  3. 3.
    Ajot J, Fiscus J (2009) The rich transcription 2009 speech-to-text (STT) and speaker attributed STT results. Tech. rep., NIST - National Institute of Standards and Technology, Rich Transcription Evaluation Workshop, Melbourne, FloridaGoogle Scholar
  4. 4.
    Aliprandi C, et al. (2003) RAI voice subtitle: how the lexical approach can improve quality in Speech Recognition Systems.
  5. 5.
    Álvarez A, Arzelus H, Etchegoyhen T (2014) Towards customized automatic segmentation of subtitles. In: Advances in speech and language technologies for Iberian languages. Springer, pp 229–238Google Scholar
  6. 6.
    Batista F, Caseiro D, Mamede N, Trancoso I (2008) Recovering capitalization and punctuation marks for automatic speech recognition: case study for Portuguese broadcast news. Speech Comm 50(10):847–862CrossRefGoogle Scholar
  7. 7.
    Caseiro D, Trancoso I (2006) A specialized on-the-fly algorithm for lexicon and language model composition. IEEE Trans Audio Speech Lang Process 14(4):1281–1291CrossRefGoogle Scholar
  8. 8.
    Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537MATHGoogle Scholar
  9. 9.
    Del Pozo A, Aliprandi C, Álvarez A, Mendes C, Neto J, Paulo S, Piccinini N, Raffaelli M (2014) SAVAS: collecting, annotating and sharing audiovisual language resources for automatic subtitling. In: LREC 2014. Proceedings of the 9th international conference on language resources and evaluationGoogle Scholar
  10. 10.
    Díaz-Cintas J, Orero P, Remael A (2007) Media for all: subtitling for the deaf, audio description, and sign language, vol 30. RodopiGoogle Scholar
  11. 11.
  12. 12.
    FAB - Teletext & Subtitling Systems: FAB Subtitler Live Edition.
  13. 13.
    Fiscus J, Garofolo J, Ajot J, Michet M (2006) Rt-06s speaker diarization results and speech activity detection results. In: NIST 2006 spring rich transcrition evaluation workshop, Washington DCGoogle Scholar
  14. 14.
    Flanagan M (2009) Recycling texts: human evaluation of example-based machine translation subtitles for DVD. Ph.D. thesis, School of applied language and intercultural studies. Dublin City University, DublinGoogle Scholar
  15. 15.
    Galliano S, Geoffrois E, Gravier G, Bonastre JF, Mostefa D, Choukri K (2006) Corpus description of the ester evaluation campaign for the rich transcription of french broadcast news. In: Proceedings of LREC, vol 6, pp 315–320Google Scholar
  16. 16.
    Gauvain JL, Lamel L, Adda G (2001) Audio partitioning and transcription for broadcast data indexation. Multimedia Tools Appl 14(2):187–200CrossRefGoogle Scholar
  17. 17.
  18. 18.
    Google: Translate youtube captions. (2009)
  19. 19.
  20. 20.
  21. 21.
  22. 22.
    Lambourne A, Hewitt J, Lyon C, Warren S (2004) Speech-based real-time subtitling services. Int J Speech Technol 7(4):269–279CrossRefGoogle Scholar
  23. 23.
    Lan ZZ, Bao L, Yu SI, Liu W, Hauptmann AG (2013) Multimedia classification and event detection using double fusion. Multimedia Tools Appl 1–15Google Scholar
  24. 24.
    Lööf J, Gollan C, Hahn S, Heigold G, Hoffmeister B, Plahl C, Rybach D, Schlüter R, Ney H (2007) The RWTH 2007 TC-STAR evaluation system for european English and Spanish. In: INTERSPEECH, pp 2145–2148Google Scholar
  25. 25.
    Meignier S, Merlin T (2010) LIUM SpkDiarization: an open source toolkit for diarization. In: CMU SPUD workshop, vol 2010, DallasGoogle Scholar
  26. 26.
    Meinedo H, Abad A, Pellegrini T, Trancoso I, Neto J (2010) The L2F broadcast news speech recognition system. Proc Fala 93–96Google Scholar
  27. 27.
    Meinedo H, Caseiro D, Neto J, Trancoso I (2003) a broadcast news speech recognition system for the european portuguese language. In: Computational Processing of the Portuguese Language. Springer, pp 9–17Google Scholar
  28. 28.
    Meinedo H, Neto JP (2005) A stream-based audio segmentation, classification and clustering pre-processing system for broadcast news using ann models. In: INTERSPEECH. Citeseer, pp 237–240Google Scholar
  29. 29.
    Meinedo H, Viveiros M, Neto JP (2008) Evaluation of a live broadcast news subtitling system for portuguese. In: INTERSPEECH, pp 508–511Google Scholar
  30. 30.
  31. 31.
    Neto J, Meinedo H, Viveiros M, Cassaca R, Martins C, Caseiro D (2008) Broadcast news subtitling system in portuguese. In: IEEE international conference on acoustics, speech and signal processing, 2008. ICASSP 2008. IEEE, pp 1561–1564Google Scholar
  32. 32.
    Nuance: Dragon Naturally Speaking.
  33. 33.
    Obach M, Lehr M, Arruti A (2007) Automatic speech recognition for live TV subtitling for hearing-impaired people. Challenges for Assistive Technology: AAATE 07 20:286Google Scholar
  34. 34.
  35. 35.
    Screen Systems: WinCAPS Q-live for live and news subtitling and captioning.
  36. 36.
    Screen Systems: WINCAPS QU4NTUM subtitling software.
  37. 37.
    Starfish Technologies: Subtitling and closed captioning systems.
  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
    Vu NT, Imseng D, Povey D, Motlicek P, Schultz T, Bourlard H (2014) Multilingual deep neural network based acoustic modeling for rapid language adaptation. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 7639–7643Google Scholar
  43. 43.
    Woodland PC (2002) The development of the HTK broadcast news transcription system: an overview. Speech Comm 37(1):47–67CrossRefMATHGoogle Scholar
  44. 44.
    Zhang X, Trmal J, Povey D, Khudanpur S (2014) Improving deep neural network acoustic models using generalized maxout networks. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 215–219Google Scholar
  45. 45.
    Zibert J, Mihelic F, Martens JP, Meinedo H, Neto J, Docio L, García-Mateo C, David P, Zdansky J, Pleva M et al (2005) The COST278 broadcast news segmentation and speaker clustering evaluation: overview, methodology, systems, results. In: 6th Annual conference of the international speech communication association (Interspeech 2005); 9th European conference on speech communication and technology (Eurospeech), vol 2005. International Speech Communication Association (ISCA), pp 629–632Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Aitor Álvarez
    • 1
  • Carlos Mendes
    • 2
  • Matteo Raffaelli
    • 3
  • Tiago Luís
    • 2
  • Sérgio Paulo
    • 2
  • Nicola Piccinini
    • 3
  • Haritz Arzelus
    • 1
  • João Neto
    • 2
  • Carlo Aliprandi
    • 3
  • Arantza del Pozo
    • 1
  1. 1.Department of Human Speech and Language Technologies, Vicomtech-IK4 FoundationSan Sebastian-DonostiaSpain
  2. 2.VoiceInteraction-Speech Processing TechnologiesLisbonPortugal
  3. 3.Synthema-Language and Semantic TechnologiesPisaItaly

Personalised recommendations