A Case Study of Audio Alignment for Multimedia Language Learning: Applications of SRGS and EMMA in Colibro Publishing
The synchronization of read-aloud audio and text in language learning is a powerful reinforcement for learners at all levels. In order to provide this kind of synchronized media experience, audio must be aligned with the text so that the correct audio plays while the related text is being presented or highlighted. One solution for aligning text and audio in this way is a manual process using an audio editor, but this is time-consuming, expensive, and error-prone. A much faster and less expensive alternative is automatic alignment through the use of speech recognition. Since the text and the matching audio are known ahead of time, the speech recognizer can perform this task with a very low error rate. Further enhancing accuracy is the fact that read-aloud stories are typically recorded with careful speech at a lower word-per-minute rate than is typical of conversational speech. In Colibro Publishing’s approach, a Speech Recognition Grammar Specification grammar is generated from the text and provided to a speech recognizer, which then generates Extensible Multimodal Annotation output with the exact audio timestamps for the beginning and end points of each sentence. The alignment is then used in the interactive story production process so that the correct audio is played with highlighted text.
- 1.Eurostat (2016). Foreign language learning statistics. European Union. http://ec.europa.eu/eurostat/statistics-explained/index.php/Foreign_language_learning_statistics. Accessed 18 Jan 2016.
- 2.Bhattacharjee, Y. (2012). Why bilinguals are smarter. New York Times, March 17.Google Scholar
- 4.Krashen, S. (2007). Free voluntary reading. Santa Barbara, CA: ABC-CLIO, LLC.Google Scholar
- 5.Lomicka, L. L. (1998). To gloss or not to gloss: An investigation of reading comprehension online. Language Learning and Technology, 1(2), 41–50.Google Scholar
- 6.Johnston, M. (2016). Extensible multimodal annotation for intelligent interactive systems. In D. Dahl (Ed.), Multimodal interaction with W3C standards: Towards natural user interfaces to everything. New York, NY: Springer.Google Scholar
- 7.Johnston, M., Baggia, P., Burnett, D., Carter, J., Dahl, D. A., McCobb, G., et al. (2009). EMMA: Extensible MultiModal Annotation markup language. W3C. http://www.w3.org/TR/emma/. Accessed 9 Nov 2012.
- 8.Johnston, M., Dahl, D. A., Denny, T., & Kharidi, N. (2015). EMMA: Extensible MultiModal Annotation markup language Version 2.0. World Wide Web Consortium. http://www.w3.org/TR/emma20/. Accessed 16 Dec 2015.
- 9.Hunt, A., & McGlashan, S. (2004). W3C Speech Recognition Grammar Specification (SRGS). W3C. http://www.w3.org/TR/speech-grammar/. Accessed 9 Nov 2012.
- 10.Stanford Natural Language Processing Group (2014). Stanford CoreNLP. Stanford University. http://nlp.stanford.edu/software/corenlp.shtml.
- 11.Galitz, W. O. (2007). The essential guide to user interface design: An introduction to GUI design principles and techniques (3rd ed.). Indianapolis, IN: Wiley Publishing, Inc.Google Scholar
- 12.Microsoft (2007). Microsoft Speech API 5.3 (SAPI). http://msdn2.microsoft.com/en-us/library/ms723627.aspx.
- 13.Shenoy, A., Wu, Y., & Wang, Y. (2005). Singing voice detection for karaoke application. Paper Presented at the Proceedings SPIE 5960, Visual Communications and Image Processing 2005, Bellingham, WA, USA.Google Scholar
- 14.Wilcox, L. (1988). Annotation and segmentation for multimedia indexing and retrieval. In System Sciences, Proceedings of the Thirty-First Hawaii International Conference on System Sciences (Vol. 252), pp. 259–266. doi: 10.1109/HICSS.1998.651708.
- 15.Lee, K., Hagen, A., Romanyshyn, N., Martin, S., & Pellom, B. (2004). Analysis and detection of reading miscues for interactive literacy tutors. Paper Presented at the Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland.Google Scholar