Abstract
Since the performance of automatic speech recognition (ASR) still degrades under adverse acoustic conditions, recognition robustness can be improved by incorporating further modalities. The arising question of information fusion shows interesting parallels to problems in digital communications, where the turbo principle revolutionized reliable communication. In this paper, we examine whether the immense gains obtained in communications could also probably be achieved in the field of ASR, since decoding algorithms are often practically the same: Viterbi algorithm, or forward-backward algorithm (FBA). First, we show that an ASR turbo recognition scheme can be implemented within the classical FBA framework by modifying the observation likelihoods only; second, we extend our solution to a generalized turbo ASR approach, which is fully applicable to multimodal ASR. Applied to an audio-visual speech recognition task, our proposed method clearly outperforms a conventional coupled hidden-Markov model approach as well as an iterative state-of-the-art approach with up to 32.3 % relative reduction in word error rate.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In the special case that \(\mathbf {u}_1^T=\mathbf {o}_1^T\), the turbo method could still be applied by using two different recognizers or HMMs.
- 2.
A full proof is beyond the scope of this paper, but can be conducted along the lines of [11, Sect. IV].
- 3.
For a more detailed description of the visual frontend, see [12, Sect. 3.2].
References
Bahl L, Cocke J, Jelinek F, Raviv J (1974) Optimal decoding of linear codes for minimizing symbol error rate. IEEE Trans Inf Theory 20(2):284–287. doi:10.1109/TIT.1974.1055186
Berrou C, Glavieux A, Thitimajshima P (1993) Near Shannon limit error-correcting coding and decoding: turbo-codes. In: Proceedings of IEEE International conference on communications (ICC 1993), Geneva, Switzerland, pp 1064–1070. doi:10.1109/ICC.1993.397441
Bourlard H, Dupont S (1996) A new ASR approach based on independent processing and recombination of partial frequency bands. In: Proceedings of 4th international conference on spoken language processing (ICSLP 1996), Philadelphia, PA, USA, pp 426–429. doi:10.1109/ICSLP.1996.607145
ten Brink S (2001) Convergence behavior of iteratively decoded parallel concatenated codes. IEEE Trans Commun 49(10):1727–1737. doi:10.1109/26.957394
Cooke M, Barker J, Cunningham S, Shao X (2006) An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust Soc Am 120(5):2421–2424
Garg A, Potamianos G, Neti C, Huang T (2003) Frame-dependent multi-stream reliability indicators for audio-visual speech recognition. In: Proceedings of international conference on multimedia and expo (ICME 2003), Baltimore, MD, USA, pp 605–608
Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578–589
Hermansky H, Tibrewala S, Pavel M (1996) Towards ASR on partially corrupted speech. In: Proceedings of 4th international conference on spoken language (ICSLP 1996), Philadelphia, PA, USA, pp 462–465
ITU-T: Rec. P.56 (2011) Objective measurement of active speech level. Int Telecommun Union (2011)
Jain U, Siegler MA, Doh SJ, Gouvea E, Huerta J, Moreno PJ, Raj B, Stern RM (1996) Recognition of continuous broadcast news with multiple unknown speakers and environments. In: Proceedings of ARPA speech recognition workshop. Harriman, NY, USA, pp 61–66
Kliewer J, Ng SX, Hanzo L (2006) Efficient computation of EXIT functions for nonbinary iterative decoding. IEEE Trans Commun 54(12):2133–2136. doi:10.1109/TCOMM.2006.885050
Kolossa D, Zeiler S, Vorwerk A, Orglmeister R (2009) Audiovisual speech recognition with missing or unreliable data. In: Proceedings of international conference on auditory-visual speech processing (AVSP 2009), Norwich, UK, pp 117–122
Kratt J, Metze F, Stiefelhagen R, Waibel A (2004) Large vocabulary audio-visual speech recognition using the janus speech recognition toolkit. In: Proceedings of DAGM-symposium, Tübingen, Germany, pp 488–495
Luettin J, Potamianos G, Neti C (2001) Asynchronous stream modeling for large vocabulary audio-visual speech recognition. In: edings of international conference on acoustics speech and signal processing (ICASSP 2001), Salt Lake City, UT, USA, pp 169–172. doi:10.1109/ICASSP.2001.940794
Ming J, Hanna P, Stewart D, Owens M, Smith FJ (1999) Improving speech recognition performance by using multi-model approaches. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP 1999), Phoenix, AZ, USA, pp 161–164
Nefian AV, Liang L, Pi X, Liu X, Murphy K (2002) Dynamic Bayesian networks for audio-visual speech recognition. EURASIP J Appl Signal Process 11(1):1274–1288
Neti C, Potamianos G, Luettin J, Matthews I, Glotin H, Vergyri D, Sison J, Mashari A, Zhou J (2000) Audio-visual speech recognition. Technical report, center lang speech process, Johns Hopkins University, Baltimore, MD, USA
Potamianos G, Neti C, Iyengar G, Helmuth E (2001) Large-vocabulary audio-visual speech recognition by machines and humans. In: Proceedings of Eurospeech, Aalborg, Denmark, pp 1027–1030
Potamianos G, Neti C, Luettin J, Matthews I (2004) Audio-visual automatic speech recognition: an overview. In: Bailly G, Vatikiotis-Bateson E, Perrier P (eds) Issues in visual and audio-visual speech processing. MIT Press, Cambridge, pp 356–396
Rogozan A, Deléglise P, Alissali M (1997) Adaptive determination of audio and visual weights for automatic speech recognition. In: Proceedings of European tutorial workshop on audio-visual speech processing, Rhodes, Greece, pp 61–64
Scheler D, Walz S, Fingscheidt T (2012) On iterative exchange of soft state information in two-channel automatic speech recognition. In: Proceedings of 10th ITG conference on speech communication, Braunschweig, Germany, pp 55–58
Shivappa ST, Rao BD, Trivedi MM (2007) An iterative decoding algorithm for fusion of multimodal information. EURASIP J Adv Signal Process 2008:1–10
Shivappa ST, Rao BD, Trivedi MM (2008) Multimodal information fusion using the iterative decoding algorithm and its application to audio-visual speech recognition. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP 2008), Las Vegas, NV, USA, pp 2241–2244. doi:10.1109/ICASSP.2008.4518091
Stork DG, Hennecke ME, Prasad KV (1996) Visionary speech: looking ahead to practical speechreading systems. In: Stork DG, Hennecke ME (eds) Speechreading by humans and machines. Springer, Berlin
Sumby WH, Pollack I (1954) Visual contribution to speech intelligibility in noise. J Acoust Soc Am 26(2):212–215. doi:10.1121/1.1907309
Tomlinson MJ, Russell MJ, Brooke NM (1996) Integrating audio and visual information to provide highly robust speech recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP 1996), Atlanta, GA, USA, pp 821–824
Varga P, Moore RK (1990) Hidden Markov model decomposition of speech and noise. In Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP 1990), Albuquerque, NM, USA, pp 845–848
Acknowledgments
We would like to thank Dorothea Kolossa and Peter Transfeld for valuable discussions, as well as Carlos Harms for his assistance in reviewing iterative and coupled speech recognition approaches.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Receveur, S., Scheler, D., Fingscheidt, T. (2016). A Turbo-Decoding Weighted Forward-Backward Algorithm for Multimodal Speech Recognition. In: Rudnicky, A., Raux, A., Lane, I., Misu, T. (eds) Situated Dialog in Speech-Based Human-Computer Interaction. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-21834-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-21834-2_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21833-5
Online ISBN: 978-3-319-21834-2
eBook Packages: EngineeringEngineering (R0)