Multimedia Tools and Applications

, Volume 72, Issue 1, pp 21–40 | Cite as

An automatic caption alignment mechanism for off-the-shelf speech recognition technologies

  • Maria Federico
  • Marco Furini


With a growing number of online videos, many producers feel the need to use video captions in order to expand content accessibility and face two main issues: production and alignment of the textual transcript. Both activities are expensive either for the high labor of human resources or for the employment of dedicated software. In this paper, we focus on caption alignment and we propose a novel, automatic, simple and low-cost mechanism that does not require human transcriptions or special dedicated software to align captions. Our mechanism uses a unique audio markup and intelligently introduces copies of it into the audio stream before giving it to an off-the-shelf automatic speech recognition (ASR) application; then it transforms the plain transcript produced by the ASR application into a timecoded transcript, which allows video players to know when to display every single caption while playing out the video. The experimental study evaluation shows that our proposal is effective in producing timecoded transcripts and therefore it can be helpful to expand video content accessibility.


Automatic caption alignment Speech recognition Off-the-shelf ASR 


  1. 1.
    Canadian Association of Broadcasters (2008) Closed captioning standards and protocol for canadian english language television programming services. In: CAB’s closed captioning manualGoogle Scholar
  2. 2.
    Carnegie Mellon University CMU-Sphinx—open source toolkit for speech recognition. Accessed 19 Sep 2012
  3. 3.
    Federico M, Furini M (2012) Enhancing learning accessibility through fully automatic captioning. In: Proceedings of the international cross-disciplinary conference on web accessibility, W4A ’12. New York, NY, USA, ACM, pp 40:1–40:4Google Scholar
  4. 4.
    Furini M (2008) Fast play: A novel feature for digital consumer video devices. IEEE Trans Consum Electron 54(2):513–520CrossRefGoogle Scholar
  5. 5.
    Garza T (1991) Evaluating the use of captioned video materials in advanced foreign language learning. Foreign Lang Ann 24(3):239–258CrossRefGoogle Scholar
  6. 6.
    Haubold A, Kender JR (2007) Alignment of speech to highly imperfect text transcriptions. In: Proceedings of the 2007 IEEE international conference on multimedia and expo, ICME 2007. IEEE, Beijing, China, pp 224–227, 2–5 July 2007Google Scholar
  7. 7.
    Hong R, Wang M, Xu M, Yan S, Chua TS (2010) Dynamic captioning: video accessibility enhancement for hearing impairment. In: Proceedings of the international conference on multimedia, MM ’10. New York, NY, USA, ACM, pp 421–430Google Scholar
  8. 8.
    Huang CW, Hsu W, Chang SF (2003) Automatic closed caption alignment based on speech recognition transcripts. Technical report, Columbia UniversityGoogle Scholar
  9. 9.
    Jelinek L, Jackson DW (2001) Television literacy: comprehension of program content using closed captions for the deaf. J Deaf Stud Deaf Educ 6(1):43–53CrossRefGoogle Scholar
  10. 10.
    Johnson K (2011) Acoustic and auditory phonetics, 3rd edn. Wiley-Blackwell, MaldenGoogle Scholar
  11. 11.
    Kemp T, Schmidt M, Westphal M, Waibel A (2000) Strategies for automatic segmentation of audio data. In: Proceedings of the international IEEE conference on acoustics, speech, and signal processing (ICASSP), pp 1423–1426Google Scholar
  12. 12.
    Kim SK, Hwang DS, Kim JY, Seo YS (2005) An effective news anchorperson shot detection method based on adaptive audio/visual model generation. In: Proceedings of the international conference on image and video retrieval (CIVR), pp 276–285Google Scholar
  13. 13.
    Knight A, Almeroth KC (2010) Fast caption alignment for automatic indexing of audio. Int J Multimed Data Eng Manag 1(2):1–17CrossRefGoogle Scholar
  14. 14.
    Martone AF, Taskiran, CM, Delp EJ (2004) Automated closed-captioning using text alignment. In: SPIE Proceedings of Storage and retrieval methods and applications for multimedia, vol 5307. SPIE, pp 108–116Google Scholar
  15. 15.
    Reager SE (2009) Closed captioning for online video. In: Streaming media industry sourcebook, pp 100–102Google Scholar
  16. 16.
    Shimogori N, Ikeda T, Tsuboi S (2010) Automatically generated captions: will they help non-native speakers communicate in english? In: Proceedings of the 3rd international conference on intercultural collaboration, ICIC ’10. New York, NY, USA, ACM, pp 79–86Google Scholar
  17. 17.
    Zhang X, Zhao Y, Schopp L (2007) A novel method of language modeling for automatic captioning in tc video teleconferencing. IEEE Trans Inf Technol Biomed 11(3):332–337CrossRefMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  1. 1.Servizio Accoglienza Studenti DisabiliUniversità di Modena e Reggio EmiliaModenaItaly
  2. 2.Dipartimento di Comunicazione ed EconomiaUniversità di Modena e Reggio EmiliaReggio EmiliaItaly

Personalised recommendations