Correcting automatic speech recognition captioning errors in real time

  • Mike WaldEmail author
  • John-Mark Bell
  • Philip Boulain
  • Karl Doody
  • Jim Gerrard


Lectures can be digitally recorded and replayed to provide multimedia revision material for students who attended the class and a substitute learning experience for students unable to attend. Deaf and hard of hearing people can find it difficult to follow speech through hearing alone or to take notes while they are lip-reading or watching a sign-language interpreter. Synchronising the speech with text captions can ensure deaf students are not disadvantaged and assist all learners to search for relevant specific parts of the multimedia recording by means of the synchronised text. Automatic speech recognition has been used to provide real-time captioning directly from lecturers’ speech in classrooms but it has proved difficult to obtain accuracy comparable to stenography. This paper describes the development, testing and evaluation of a system that enables editors to correct errors in the captions as they are created by automatic speech recognition and makes suggestions for future possible improvements.


Accessibility Multimedia Automatic speech recognition Captioning Real-time editing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Baecker, R. M., Wolf, P., & Rankin, K. (2004). The ePresence Interactive Webcasting System: Technology overview and current research issues. In Proceedings of Elearn 2004 (pp. 2396–3069). Washington. Google Scholar
  2. Bain, K., Basson, S., & Wald, M. (2002). Speech recognition in university classrooms. In Proceedings of the fifth international ACM SIGCAPH conference on assistive technologies (pp. 192–196). Edinburgh. Google Scholar
  3. Bain, K., Basson, S., Faisman, A.A., & Kanevsky, D. (2005). Accessibility, transcription, and access everywhere. IBM Systems Journal, 44(3), 589–603. Accessed 12 December 2005. CrossRefGoogle Scholar
  4. Brotherton, J. A., & Abowd, G. D. (2004). Lessons learned from eClass: Assessing automated capture and access in the classroom. ACM Transactions on Computer-Human Interaction, 11(2), 121–155. CrossRefGoogle Scholar
  5. Clements, M., Robertson, S., & Miller, M. S. (2002). Phonetic searching applied to on-line distance learning modules. In Digital signal processing workshop, 2002 and the 2nd signal processing education workshop. Proceedings of 2002 IEEE 10th (pp. 186–191). Accessed 8 December 2005.
  6. Coffield, F., Moseley, D., Hall, E., & Ecclestone, K. (2004). Learning styles and pedagogy in post-16 learning: A systematic and critical review (Learning and Skills Research Centre Report). London. Accessed 12 December 2005.
  7. Dufour, C., Toms, E. G., Bartlett, J., Ferenbok, J., & Baecker, R. M. (2004). Exploring user interaction with digital videos. In Proceedings of Graphics Interface 2004. London: Ontario. Google Scholar
  8. Francis, P. M., & Stinson, M. (2003). The C-Print speech-to-text system for communication access and learning. In Proceedings of CSUN conference technology and persons with disabilities. Northridge, California State University. Accessed 12 December 2005.
  9. Howard-Spink, S. (2005). IBM’s Superhuman Speech initiative clears conversational confusion. Accessed 12 December 2005.
  10. Huang, X. D. (2002). Making speech mainstream. Microsoft Speech Technologies Group. Google Scholar
  11. IBM (2003). The Superhuman Speech Recognition Project. Accessed 12 December 2005.
  12. IBM (2005). IBM ViaScribe. Accessed 12 December 2005.
  13. Imai, T., Matsui, A., Homma, S., Kobayakawa, T., Onoe, K., Sato, S., & Ando, A. (2002). Speech recognition with a re-speak method for subtitling live broadcasts. In ICSLP-2002 (pp. 1757–1760). Google Scholar
  14. Karat, C. M., Halverson, C., Horn, D., & Karat, J. (1999). Patterns of entry and correction in large vocabulary continuous speech recognition systems. In: Proceedings of the SIGCHI conference on human factors in computing systems: the CHI is the limit (pp. 568–575). Pittsburgh, Pennsylvania. Google Scholar
  15. Karat, J., Horn, D., Halverson, C. A., & Karat, C. M. (2000). Overcoming unusability: Developing efficient strategies in speech recognition systems. In Conference on human factors in computing systems CHI ’00 extended abstracts (pp. 141–142). The Hague, The Netherlands. Google Scholar
  16. Kieras, D. (2001). Using the keystroke-level model to estimate execution times. Accessed 23 February 2006.
  17. Lambourne, A., Hewitt, J., Lyon, C., & Warren, S. (2004). Speech-based real-time subtitling service. International Journal of Speech Technology, 7(4), 269–279. CrossRefGoogle Scholar
  18. Leitch, D., & MacMillan, T. (2003). Innovative technology and inclusion: Current issues and future directions for liberated learning research. (Year IV research report on the liberated learning initiative). Saint Mary’s University, Nova Scotia. Accessed 12 December 2005.
  19. Lewis, J. R. (1999). Effect of error correction strategy on speech dictation throughput. In Proceedings of the human factors and ergonomics society (pp. 457–461). Houston, Texas, USA. Google Scholar
  20. Marin (2006). Accessed 17 May 2006.
  21. Moore, R. (2005). Keynote paper. In Proc. SPECOM 2005 (pp. 17–19). Patras, Greece. Google Scholar
  22. NCAM (2000). International Captioning Project. Accessed 12 December 2005.
  23. Nuance (2005). Products. Accessed 12 December 2005.
  24. Olavsrud, T. (2002). IBM wants you to talk to your devices. Accessed 12 December 2005.
  25. Robison, J., & Jensema, C. (1996). Computer speech recognition as an assistive device for deaf and hard of hearing people. In Challenge of change: beyond the horizon, proceedings from seventh biennial conference on postsecondary education for persons who are deaf or hard of hearing. Accessed 8 November 2005.
  26. SENDA (2001). Accessed 12 December 2005.
  27. Shneiderman, B. (2000). The limits of speech recognition. Communications of the ACM, 43(9), 63–65. CrossRefGoogle Scholar
  28. Softel (2001). FAQ Live or ‘real-time’ subtitling. Accessed 12 December 2005.
  29. Start-Stop Dictation and Transcription Systems (2005). Products. Accessed 27 December 2005.
  30. Stinson, M., Stuckless, E., Henderson, J., & Miller, L. (1988). Perceptions of hearing-impaired college students towards real-time speech to print: real-time graphic display and other educational support services. The Volta Review, 90, 341–347. Google Scholar
  31. Suhm, B., & Myers, B. (2001). Multimodal error correction for speech user interfaces. ACM Transactions on Computer-Human Interaction, 8(1), 60–98. CrossRefGoogle Scholar
  32. Suhm, B., Myers, B., & Waibel, A. (1999). Model-based and empirical evaluation of multimodal interactive error correction. In CHI 99 conference proceedings (pp. 584–591). Pittsburgh, Pennsylvania, United States. Google Scholar
  33. Teletec International (2005). Remote communication support service. Accessed 27 December 2005.
  34. Tyre, P. (2005). Professor in your pocket, Newsweek MSNBC. Accessed 8 December 2005.
  35. WAI (2005). Web accessibility initiative. Accessed 12 December 2005.
  36. Wald, M. (2000). Developments in technology to increase access to education for deaf and hard of hearing students. In Proceedings of CSUN conference technology and persons with disabilities. California State University, Northridge. Accessed 12 December 2005.
  37. Wald, M. (2002). Hearing disability and technology. In Phipps, L., & Sutherland, A., Seale, J. (Eds.), Access all areas: disability, technology and learning (pp. 19–23). JISC TechDis and ALT. Google Scholar
  38. Wald, M. (2005). Personalised displays. In Proceedings of speech technologies: captioning, transcription and beyond. IBM T.J. Watson Research Center, New York. Accessed 27 December 2005.

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Mike Wald
    • 1
    Email author
  • John-Mark Bell
    • 1
  • Philip Boulain
    • 1
  • Karl Doody
    • 1
  • Jim Gerrard
    • 1
  1. 1.School of Electronics and Computer ScienceUniversity of SouthamptonSouthamptonUK

Personalised recommendations