SPECOM 2017: Speech and Computer pp 820-828 | Cite as
What Speech Recognition Accuracy is Needed for Video Transcripts to be a Useful Search Interface?
Abstract
Informative videos (e.g. recorded lectures) are increasingly being made available online, but they are difficult to use, browse and search. Nowadays, popular platforms let users search and navigate videos via a transcript, which, in order to guarantee a satisfactory level of word accuracy, has typically been generated using some manual inputs. The goal of our work is to try and take a step closer to the fully automatic generation of informative video transcripts based on current automatic speech recognition technology. We present a user study designed to better understand viewers’ use of video transcripts for searching a video content, with the aim of estimating what minimum word recognition accuracy is needed for video captions to be a useful search interface. We found that transcripts with 70% word recognition accuracy are as effective as 100% accuracy transcripts in supporting video search when using single word search. We also found that there are large variations in the time it takes to search a video, independently of the quality of the transcript. With adequate and adapted search strategies, even low accuracy transcripts can support quick video search.
Keywords
Speech recognition Word accuracy Video transcripts Video searchReferences
- 1.TED Homepage. http://www.ted.com/. Last Accessed 12 Apr 2017
- 2.edX Homepage. http://www.edx.org. Last Accessed 12 Apr 2017
- 3.Coursera Homepage. http://www.coursera.org/. Last Accessed 12 Apr 2017
- 4.Breslow, L.B., Pritchard, D.E., DeBoer, J., Stump, G.S., Ho, A.D., Seaton, D.T.: Studying learning in the worldwide classroom: Research into edX’s first MOOC. Res. Pract. Assess. 8, 13–25 (2013)Google Scholar
- 5.Kim, J., Li, S.W., Cai, C.J., Gajos, K.Z., Miller, R.C.: Leveraging video interaction data and content analysis to improve video learning. In: Proceedings of the CHI 2014, Learning Innovation at Scale workshop, pp. 31–40 (2014)Google Scholar
- 6.Guo, P.J., Kim, J., Rubin, R.: How video production affects student engagement: an empirical study of MOOC videos. In: Proceedings of the first ACM Learning@scale Conference, pp. 41–50. ACM (2014)Google Scholar
- 7.Pavel, A., Reed, C., Hartmann, B., Agrawala, M.: Video digests: a browsable, skimmable format for informational lecture videos. In: Proceedings of UIST 2014, 5–8 October, Honolulu, USA (2014)Google Scholar
- 8.Victor, B.: April 2013. http://worrydream.com/MediaForThinkingTheUnthinkable. Last Accessed 12 Apr 2017
- 9.WebAim Homepage. http://webaim.org/techniques/captions/. Last Accessed 12 Apr 2017
- 10.CaptionSync Homepage. http://www.automaticsync.com/captionsync/. Last Accessed 12 Apr 2017
- 11.PlayMedia Homepage. http://www.3playmedia.com/. Last Accessed 12 Apr 2017
- 12.YouTube Homepage. https://www.youtube.com/. Last Accessed 12 Apr 2017
- 13.GoogleSpeech Homepage. https://cloud.google.com/speech/. Last Accessed 12 Apr 2017
- 14.Miró, J.D., Silvestre-Cerdà, J.A., Civera, J., Turró, C., Juan, A.: Efficiency and usability study of innovative computer-aided transcription strategies for video lecture repositories. Speech Commun. 74, 65–75 (2015)CrossRefGoogle Scholar
- 15.Ranchal, R., Taber-Doughty, T., Guo, Y., Bain, K., Martin, H., Robinson, J.P., Duerstock, B.S.: Using speech recognition for real-time captioning and lecture transcription in the classroom. IEEE Trans. Learn. Technol. 6(4), 299–311 (2013)CrossRefGoogle Scholar
- 16.Sphinx Homepage. http://cmusphinx.sourceforge.net/. Last Accessed 12 Apr 2017
- 17.WhiteHouse Homepage. https://www.whitehouse.gov/. Last Accessed 12 Apr 2017