A Turbo-Decoding Weighted Forward-Backward Algorithm for Multimodal Speech Recognition

Receveur, Simon; Scheler, David; Fingscheidt, Tim

doi:10.1007/978-3-319-21834-2_16

Simon Receveur⁵,
David Scheler⁵ &
Tim Fingscheidt⁵

Part of the book series: Signals and Communication Technology ((SCT))

735 Accesses
3 Citations

Abstract

Since the performance of automatic speech recognition (ASR) still degrades under adverse acoustic conditions, recognition robustness can be improved by incorporating further modalities. The arising question of information fusion shows interesting parallels to problems in digital communications, where the turbo principle revolutionized reliable communication. In this paper, we examine whether the immense gains obtained in communications could also probably be achieved in the field of ASR, since decoding algorithms are often practically the same: Viterbi algorithm, or forward-backward algorithm (FBA). First, we show that an ASR turbo recognition scheme can be implemented within the classical FBA framework by modifying the observation likelihoods only; second, we extend our solution to a generalized turbo ASR approach, which is fully applicable to multimodal ASR. Applied to an audio-visual speech recognition task, our proposed method clearly outperforms a conventional coupled hidden-Markov model approach as well as an iterative state-of-the-art approach with up to 32.3 % relative reduction in word error rate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In the special case that \(\mathbf {u}_1^T=\mathbf {o}_1^T\), the turbo method could still be applied by using two different recognizers or HMMs.
2.
A full proof is beyond the scope of this paper, but can be conducted along the lines of [11, Sect. IV].
3.
For a more detailed description of the visual frontend, see [12, Sect. 3.2].

References

Bahl L, Cocke J, Jelinek F, Raviv J (1974) Optimal decoding of linear codes for minimizing symbol error rate. IEEE Trans Inf Theory 20(2):284–287. doi:10.1109/TIT.1974.1055186
Article MathSciNet MATH Google Scholar
Berrou C, Glavieux A, Thitimajshima P (1993) Near Shannon limit error-correcting coding and decoding: turbo-codes. In: Proceedings of IEEE International conference on communications (ICC 1993), Geneva, Switzerland, pp 1064–1070. doi:10.1109/ICC.1993.397441
Bourlard H, Dupont S (1996) A new ASR approach based on independent processing and recombination of partial frequency bands. In: Proceedings of 4th international conference on spoken language processing (ICSLP 1996), Philadelphia, PA, USA, pp 426–429. doi:10.1109/ICSLP.1996.607145
ten Brink S (2001) Convergence behavior of iteratively decoded parallel concatenated codes. IEEE Trans Commun 49(10):1727–1737. doi:10.1109/26.957394
Article MATH Google Scholar
Cooke M, Barker J, Cunningham S, Shao X (2006) An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust Soc Am 120(5):2421–2424
Article Google Scholar
Garg A, Potamianos G, Neti C, Huang T (2003) Frame-dependent multi-stream reliability indicators for audio-visual speech recognition. In: Proceedings of international conference on multimedia and expo (ICME 2003), Baltimore, MD, USA, pp 605–608
Google Scholar
Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578–589
Article Google Scholar
Hermansky H, Tibrewala S, Pavel M (1996) Towards ASR on partially corrupted speech. In: Proceedings of 4th international conference on spoken language (ICSLP 1996), Philadelphia, PA, USA, pp 462–465
Google Scholar
ITU-T: Rec. P.56 (2011) Objective measurement of active speech level. Int Telecommun Union (2011)
Google Scholar
Jain U, Siegler MA, Doh SJ, Gouvea E, Huerta J, Moreno PJ, Raj B, Stern RM (1996) Recognition of continuous broadcast news with multiple unknown speakers and environments. In: Proceedings of ARPA speech recognition workshop. Harriman, NY, USA, pp 61–66
Google Scholar
Kliewer J, Ng SX, Hanzo L (2006) Efficient computation of EXIT functions for nonbinary iterative decoding. IEEE Trans Commun 54(12):2133–2136. doi:10.1109/TCOMM.2006.885050
Google Scholar
Kolossa D, Zeiler S, Vorwerk A, Orglmeister R (2009) Audiovisual speech recognition with missing or unreliable data. In: Proceedings of international conference on auditory-visual speech processing (AVSP 2009), Norwich, UK, pp 117–122
Google Scholar
Kratt J, Metze F, Stiefelhagen R, Waibel A (2004) Large vocabulary audio-visual speech recognition using the janus speech recognition toolkit. In: Proceedings of DAGM-symposium, Tübingen, Germany, pp 488–495
Google Scholar
Luettin J, Potamianos G, Neti C (2001) Asynchronous stream modeling for large vocabulary audio-visual speech recognition. In: edings of international conference on acoustics speech and signal processing (ICASSP 2001), Salt Lake City, UT, USA, pp 169–172. doi:10.1109/ICASSP.2001.940794
Ming J, Hanna P, Stewart D, Owens M, Smith FJ (1999) Improving speech recognition performance by using multi-model approaches. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP 1999), Phoenix, AZ, USA, pp 161–164
Google Scholar
Nefian AV, Liang L, Pi X, Liu X, Murphy K (2002) Dynamic Bayesian networks for audio-visual speech recognition. EURASIP J Appl Signal Process 11(1):1274–1288
Article MATH Google Scholar
Neti C, Potamianos G, Luettin J, Matthews I, Glotin H, Vergyri D, Sison J, Mashari A, Zhou J (2000) Audio-visual speech recognition. Technical report, center lang speech process, Johns Hopkins University, Baltimore, MD, USA
Google Scholar
Potamianos G, Neti C, Iyengar G, Helmuth E (2001) Large-vocabulary audio-visual speech recognition by machines and humans. In: Proceedings of Eurospeech, Aalborg, Denmark, pp 1027–1030
Google Scholar
Potamianos G, Neti C, Luettin J, Matthews I (2004) Audio-visual automatic speech recognition: an overview. In: Bailly G, Vatikiotis-Bateson E, Perrier P (eds) Issues in visual and audio-visual speech processing. MIT Press, Cambridge, pp 356–396
Google Scholar
Rogozan A, Deléglise P, Alissali M (1997) Adaptive determination of audio and visual weights for automatic speech recognition. In: Proceedings of European tutorial workshop on audio-visual speech processing, Rhodes, Greece, pp 61–64
Google Scholar
Scheler D, Walz S, Fingscheidt T (2012) On iterative exchange of soft state information in two-channel automatic speech recognition. In: Proceedings of 10th ITG conference on speech communication, Braunschweig, Germany, pp 55–58
Google Scholar
Shivappa ST, Rao BD, Trivedi MM (2007) An iterative decoding algorithm for fusion of multimodal information. EURASIP J Adv Signal Process 2008:1–10
MATH Google Scholar
Shivappa ST, Rao BD, Trivedi MM (2008) Multimodal information fusion using the iterative decoding algorithm and its application to audio-visual speech recognition. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP 2008), Las Vegas, NV, USA, pp 2241–2244. doi:10.1109/ICASSP.2008.4518091
Stork DG, Hennecke ME, Prasad KV (1996) Visionary speech: looking ahead to practical speechreading systems. In: Stork DG, Hennecke ME (eds) Speechreading by humans and machines. Springer, Berlin
Chapter Google Scholar
Sumby WH, Pollack I (1954) Visual contribution to speech intelligibility in noise. J Acoust Soc Am 26(2):212–215. doi:10.1121/1.1907309
Article Google Scholar
Tomlinson MJ, Russell MJ, Brooke NM (1996) Integrating audio and visual information to provide highly robust speech recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP 1996), Atlanta, GA, USA, pp 821–824
Google Scholar
Varga P, Moore RK (1990) Hidden Markov model decomposition of speech and noise. In Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP 1990), Albuquerque, NM, USA, pp 845–848
Google Scholar

Download references

Acknowledgments

We would like to thank Dorothea Kolossa and Peter Transfeld for valuable discussions, as well as Carlos Harms for his assistance in reviewing iterative and coupled speech recognition approaches.

Author information

Authors and Affiliations

Institute for Communications Technology, Technische Universität Braunschweig, 38106, Braunschweig, Germany
Simon Receveur, David Scheler & Tim Fingscheidt

Authors

Simon Receveur
View author publications
You can also search for this author in PubMed Google Scholar
David Scheler
View author publications
You can also search for this author in PubMed Google Scholar
Tim Fingscheidt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simon Receveur .

Editor information

Editors and Affiliations

School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Alexander Rudnicky
Cupertino, California, USA
Antoine Raux
Silicon Valley, Carnegie Mellon University, Moffett Field, California, USA
Ian Lane
Mountain View, California, USA
Teruhisa Misu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Receveur, S., Scheler, D., Fingscheidt, T. (2016). A Turbo-Decoding Weighted Forward-Backward Algorithm for Multimodal Speech Recognition. In: Rudnicky, A., Raux, A., Lane, I., Misu, T. (eds) Situated Dialog in Speech-Based Human-Computer Interaction. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-21834-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-21834-2_16
Published: 21 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21833-5
Online ISBN: 978-3-319-21834-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics