Correcting automatic speech recognition captioning errors in real time

Wald, Mike; Bell, John-Mark; Boulain, Philip; Doody, Karl; Gerrard, Jim

doi:10.1007/s10772-008-9014-4

Correcting automatic speech recognition captioning errors in real time

Published: 18 December 2008

Volume 10, pages 1–15, (2007)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Mike Wald¹,
John-Mark Bell¹,
Philip Boulain¹,
Karl Doody¹ &
…
Jim Gerrard¹

217 Accesses
4 Citations
Explore all metrics

Abstract

Lectures can be digitally recorded and replayed to provide multimedia revision material for students who attended the class and a substitute learning experience for students unable to attend. Deaf and hard of hearing people can find it difficult to follow speech through hearing alone or to take notes while they are lip-reading or watching a sign-language interpreter. Synchronising the speech with text captions can ensure deaf students are not disadvantaged and assist all learners to search for relevant specific parts of the multimedia recording by means of the synchronised text. Automatic speech recognition has been used to provide real-time captioning directly from lecturers’ speech in classrooms but it has proved difficult to obtain accuracy comparable to stenography. This paper describes the development, testing and evaluation of a system that enables editors to correct errors in the captions as they are created by automatic speech recognition and makes suggestions for future possible improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Baecker, R. M., Wolf, P., & Rankin, K. (2004). The ePresence Interactive Webcasting System: Technology overview and current research issues. In Proceedings of Elearn 2004 (pp. 2396–3069). Washington.
Bain, K., Basson, S., & Wald, M. (2002). Speech recognition in university classrooms. In Proceedings of the fifth international ACM SIGCAPH conference on assistive technologies (pp. 192–196). Edinburgh.
Bain, K., Basson, S., Faisman, A.A., & Kanevsky, D. (2005). Accessibility, transcription, and access everywhere. IBM Systems Journal, 44(3), 589–603. http://www.research.ibm.com/journal/sj/443/bain.pdf. Accessed 12 December 2005.
Article Google Scholar
Brotherton, J. A., & Abowd, G. D. (2004). Lessons learned from eClass: Assessing automated capture and access in the classroom. ACM Transactions on Computer-Human Interaction, 11(2), 121–155.
Article Google Scholar
Clements, M., Robertson, S., & Miller, M. S. (2002). Phonetic searching applied to on-line distance learning modules. In Digital signal processing workshop, 2002 and the 2nd signal processing education workshop. Proceedings of 2002 IEEE 10th (pp. 186–191). http://www.imtc.gatech.edu/news/multimedia/spe2002_paper.pdf. Accessed 8 December 2005.
Coffield, F., Moseley, D., Hall, E., & Ecclestone, K. (2004). Learning styles and pedagogy in post-16 learning: A systematic and critical review (Learning and Skills Research Centre Report). London. https://www.lsneducation.org.uk/user/login.aspx?code=041543&P=041543PD&action=pdfdl&src=xoweb. Accessed 12 December 2005.
Dufour, C., Toms, E. G., Bartlett, J., Ferenbok, J., & Baecker, R. M. (2004). Exploring user interaction with digital videos. In Proceedings of Graphics Interface 2004. London: Ontario.
Francis, P. M., & Stinson, M. (2003). The C-Print speech-to-text system for communication access and learning. In Proceedings of CSUN conference technology and persons with disabilities. Northridge, California State University. http://www.csun.edu/cod/conf/2003/proceedings/157.htm. Accessed 12 December 2005.
Howard-Spink, S. (2005). IBM’s Superhuman Speech initiative clears conversational confusion. http://www.research.ibm.com/thinkresearch/pages/2002/20020918_speech.shtml. Accessed 12 December 2005.
Huang, X. D. (2002). Making speech mainstream. Microsoft Speech Technologies Group.
IBM (2003). The Superhuman Speech Recognition Project. http://www.research.ibm.com/superhuman/superhuman.htm. Accessed 12 December 2005.
IBM (2005). IBM ViaScribe. http://www-306.ibm.com/able/solution_offerings/ViaScribe.html. Accessed 12 December 2005.
Imai, T., Matsui, A., Homma, S., Kobayakawa, T., Onoe, K., Sato, S., & Ando, A. (2002). Speech recognition with a re-speak method for subtitling live broadcasts. In ICSLP-2002 (pp. 1757–1760).
Karat, C. M., Halverson, C., Horn, D., & Karat, J. (1999). Patterns of entry and correction in large vocabulary continuous speech recognition systems. In: Proceedings of the SIGCHI conference on human factors in computing systems: the CHI is the limit (pp. 568–575). Pittsburgh, Pennsylvania.
Karat, J., Horn, D., Halverson, C. A., & Karat, C. M. (2000). Overcoming unusability: Developing efficient strategies in speech recognition systems. In Conference on human factors in computing systems CHI ’00 extended abstracts (pp. 141–142). The Hague, The Netherlands.
Kieras, D. (2001). Using the keystroke-level model to estimate execution times. ftp://www.eecs.umich.edu/people/kieras/GOMS/KLM.pdf. Accessed 23 February 2006.
Lambourne, A., Hewitt, J., Lyon, C., & Warren, S. (2004). Speech-based real-time subtitling service. International Journal of Speech Technology, 7(4), 269–279.
Article Google Scholar
Leitch, D., & MacMillan, T. (2003). Innovative technology and inclusion: Current issues and future directions for liberated learning research. (Year IV research report on the liberated learning initiative). Saint Mary’s University, Nova Scotia. http://www.liberatedlearning.com/. Accessed 12 December 2005.
Lewis, J. R. (1999). Effect of error correction strategy on speech dictation throughput. In Proceedings of the human factors and ergonomics society (pp. 457–461). Houston, Texas, USA.
Marin (2006). http://www.marin.cc.ca.us/~holub/Equipmnt.htm. Accessed 17 May 2006.
Moore, R. (2005). Keynote paper. In Proc. SPECOM 2005 (pp. 17–19). Patras, Greece.
NCAM (2000). International Captioning Project. http://ncam.wgbh.org/resources/icr/europe.html. Accessed 12 December 2005.
Nuance (2005). Products. http://www.nuance.com/. Accessed 12 December 2005.
Olavsrud, T. (2002). IBM wants you to talk to your devices. http://www.internetnews.com/ent-news/article.php/1004901. Accessed 12 December 2005.
Robison, J., & Jensema, C. (1996). Computer speech recognition as an assistive device for deaf and hard of hearing people. In Challenge of change: beyond the horizon, proceedings from seventh biennial conference on postsecondary education for persons who are deaf or hard of hearing. http://sunsite.utk.edu/cod/pec/1996/robison.pdf. Accessed 8 November 2005.
RNID (2005). http://www.rnid.org.uk/howwehelp/research_and_technology/communication_and_broadcasting/virtual_signing/. Accessed 12 December 2005.
SENDA (2001). http://www.opsi.gov.uk/acts/acts2001/20010010.htm. Accessed 12 December 2005.
Shneiderman, B. (2000). The limits of speech recognition. Communications of the ACM, 43(9), 63–65.
Article Google Scholar
Softel (2001). FAQ Live or ‘real-time’ subtitling. http://www.softel-usa.com/downloads/Softel_Live_Subtitling_FAQ.pdf. Accessed 12 December 2005.
Start-Stop Dictation and Transcription Systems (2005). Products. http://www.startstop.com/sst2.asp. Accessed 27 December 2005.
Stinson, M., Stuckless, E., Henderson, J., & Miller, L. (1988). Perceptions of hearing-impaired college students towards real-time speech to print: real-time graphic display and other educational support services. The Volta Review, 90, 341–347.
Google Scholar
Suhm, B., & Myers, B. (2001). Multimodal error correction for speech user interfaces. ACM Transactions on Computer-Human Interaction, 8(1), 60–98.
Article Google Scholar
Suhm, B., Myers, B., & Waibel, A. (1999). Model-based and empirical evaluation of multimodal interactive error correction. In CHI 99 conference proceedings (pp. 584–591). Pittsburgh, Pennsylvania, United States.
Teletec International (2005). Remote communication support service. http://www.teletec.co.uk/remote/. Accessed 27 December 2005.
Tyre, P. (2005). Professor in your pocket, Newsweek MSNBC. http://www.msnbc.msn.com/id/10117475/site/newsweek. Accessed 8 December 2005.
WAI (2005). Web accessibility initiative. http://www.w3.org/WAI. Accessed 12 December 2005.
Wald, M. (2000). Developments in technology to increase access to education for deaf and hard of hearing students. In Proceedings of CSUN conference technology and persons with disabilities. California State University, Northridge. http://www.csun.edu/cod/conf/2000/proceedings/0218Wald.htm. Accessed 12 December 2005.
Wald, M. (2002). Hearing disability and technology. In Phipps, L., & Sutherland, A., Seale, J. (Eds.), Access all areas: disability, technology and learning (pp. 19–23). JISC TechDis and ALT.
Wald, M. (2005). Personalised displays. In Proceedings of speech technologies: captioning, transcription and beyond. IBM T.J. Watson Research Center, New York. http://www.nynj.avios.org/Proceedings.htm. Accessed 27 December 2005.

Download references

Author information

Authors and Affiliations

School of Electronics and Computer Science, University of Southampton, Southampton, SO171BJ, UK
Mike Wald, John-Mark Bell, Philip Boulain, Karl Doody & Jim Gerrard

Authors

Mike Wald
View author publications
You can also search for this author in PubMed Google Scholar
John-Mark Bell
View author publications
You can also search for this author in PubMed Google Scholar
Philip Boulain
View author publications
You can also search for this author in PubMed Google Scholar
Karl Doody
View author publications
You can also search for this author in PubMed Google Scholar
Jim Gerrard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mike Wald.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wald, M., Bell, JM., Boulain, P. et al. Correcting automatic speech recognition captioning errors in real time. Int J Speech Technol 10, 1–15 (2007). https://doi.org/10.1007/s10772-008-9014-4

Download citation

Received: 20 May 2006
Accepted: 25 November 2008
Published: 18 December 2008
Issue Date: March 2007
DOI: https://doi.org/10.1007/s10772-008-9014-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Correcting automatic speech recognition captioning errors in real time

Abstract

Access this article

Similar content being viewed by others

Exploring collaborative caption editing to augment video-based learning

A Case Study of Audio Alignment for Multimedia Language Learning: Applications of SRGS and EMMA in Colibro Publishing

Quality of the captions produced by students of an accessibility MOOC using a semi-automatic tool

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Correcting automatic speech recognition captioning errors in real time

Abstract

Access this article

Similar content being viewed by others

Exploring collaborative caption editing to augment video-based learning

A Case Study of Audio Alignment for Multimedia Language Learning: Applications of SRGS and EMMA in Colibro Publishing

Quality of the captions produced by students of an accessibility MOOC using a semi-automatic tool

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation