Skip to main content

A Turbo-Decoding Weighted Forward-Backward Algorithm for Multimodal Speech Recognition

  • Chapter
  • First Online:
Situated Dialog in Speech-Based Human-Computer Interaction

Abstract

Since the performance of automatic speech recognition (ASR) still degrades under adverse acoustic conditions, recognition robustness can be improved by incorporating further modalities. The arising question of information fusion shows interesting parallels to problems in digital communications, where the turbo principle revolutionized reliable communication. In this paper, we examine whether the immense gains obtained in communications could also probably be achieved in the field of ASR, since decoding algorithms are often practically the same: Viterbi algorithm, or forward-backward algorithm (FBA). First, we show that an ASR turbo recognition scheme can be implemented within the classical FBA framework by modifying the observation likelihoods only; second, we extend our solution to a generalized turbo ASR approach, which is fully applicable to multimodal ASR. Applied to an audio-visual speech recognition task, our proposed method clearly outperforms a conventional coupled hidden-Markov model approach as well as an iterative state-of-the-art approach with up to 32.3 % relative reduction in word error rate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In the special case that \(\mathbf {u}_1^T=\mathbf {o}_1^T\), the turbo method could still be applied by using two different recognizers or HMMs.

  2. 2.

    A full proof is beyond the scope of this paper, but can be conducted along the lines of [11, Sect. IV].

  3. 3.

    For a more detailed description of the visual frontend, see [12, Sect. 3.2].

References

  1. Bahl L, Cocke J, Jelinek F, Raviv J (1974) Optimal decoding of linear codes for minimizing symbol error rate. IEEE Trans Inf Theory 20(2):284–287. doi:10.1109/TIT.1974.1055186

    Article  MathSciNet  MATH  Google Scholar 

  2. Berrou C, Glavieux A, Thitimajshima P (1993) Near Shannon limit error-correcting coding and decoding: turbo-codes. In: Proceedings of IEEE International conference on communications (ICC 1993), Geneva, Switzerland, pp 1064–1070. doi:10.1109/ICC.1993.397441

  3. Bourlard H, Dupont S (1996) A new ASR approach based on independent processing and recombination of partial frequency bands. In: Proceedings of 4th international conference on spoken language processing (ICSLP 1996), Philadelphia, PA, USA, pp 426–429. doi:10.1109/ICSLP.1996.607145

  4. ten Brink S (2001) Convergence behavior of iteratively decoded parallel concatenated codes. IEEE Trans Commun 49(10):1727–1737. doi:10.1109/26.957394

    Article  MATH  Google Scholar 

  5. Cooke M, Barker J, Cunningham S, Shao X (2006) An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust Soc Am 120(5):2421–2424

    Article  Google Scholar 

  6. Garg A, Potamianos G, Neti C, Huang T (2003) Frame-dependent multi-stream reliability indicators for audio-visual speech recognition. In: Proceedings of international conference on multimedia and expo (ICME 2003), Baltimore, MD, USA, pp 605–608

    Google Scholar 

  7. Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578–589

    Article  Google Scholar 

  8. Hermansky H, Tibrewala S, Pavel M (1996) Towards ASR on partially corrupted speech. In: Proceedings of 4th international conference on spoken language (ICSLP 1996), Philadelphia, PA, USA, pp 462–465

    Google Scholar 

  9. ITU-T: Rec. P.56 (2011) Objective measurement of active speech level. Int Telecommun Union (2011)

    Google Scholar 

  10. Jain U, Siegler MA, Doh SJ, Gouvea E, Huerta J, Moreno PJ, Raj B, Stern RM (1996) Recognition of continuous broadcast news with multiple unknown speakers and environments. In: Proceedings of ARPA speech recognition workshop. Harriman, NY, USA, pp 61–66

    Google Scholar 

  11. Kliewer J, Ng SX, Hanzo L (2006) Efficient computation of EXIT functions for nonbinary iterative decoding. IEEE Trans Commun 54(12):2133–2136. doi:10.1109/TCOMM.2006.885050

    Google Scholar 

  12. Kolossa D, Zeiler S, Vorwerk A, Orglmeister R (2009) Audiovisual speech recognition with missing or unreliable data. In: Proceedings of international conference on auditory-visual speech processing (AVSP 2009), Norwich, UK, pp 117–122

    Google Scholar 

  13. Kratt J, Metze F, Stiefelhagen R, Waibel A (2004) Large vocabulary audio-visual speech recognition using the janus speech recognition toolkit. In: Proceedings of DAGM-symposium, Tübingen, Germany, pp 488–495

    Google Scholar 

  14. Luettin J, Potamianos G, Neti C (2001) Asynchronous stream modeling for large vocabulary audio-visual speech recognition. In: edings of international conference on acoustics speech and signal processing (ICASSP 2001), Salt Lake City, UT, USA, pp 169–172. doi:10.1109/ICASSP.2001.940794

  15. Ming J, Hanna P, Stewart D, Owens M, Smith FJ (1999) Improving speech recognition performance by using multi-model approaches. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP 1999), Phoenix, AZ, USA, pp 161–164

    Google Scholar 

  16. Nefian AV, Liang L, Pi X, Liu X, Murphy K (2002) Dynamic Bayesian networks for audio-visual speech recognition. EURASIP J Appl Signal Process 11(1):1274–1288

    Article  MATH  Google Scholar 

  17. Neti C, Potamianos G, Luettin J, Matthews I, Glotin H, Vergyri D, Sison J, Mashari A, Zhou J (2000) Audio-visual speech recognition. Technical report, center lang speech process, Johns Hopkins University, Baltimore, MD, USA

    Google Scholar 

  18. Potamianos G, Neti C, Iyengar G, Helmuth E (2001) Large-vocabulary audio-visual speech recognition by machines and humans. In: Proceedings of Eurospeech, Aalborg, Denmark, pp 1027–1030

    Google Scholar 

  19. Potamianos G, Neti C, Luettin J, Matthews I (2004) Audio-visual automatic speech recognition: an overview. In: Bailly G, Vatikiotis-Bateson E, Perrier P (eds) Issues in visual and audio-visual speech processing. MIT Press, Cambridge, pp 356–396

    Google Scholar 

  20. Rogozan A, Deléglise P, Alissali M (1997) Adaptive determination of audio and visual weights for automatic speech recognition. In: Proceedings of European tutorial workshop on audio-visual speech processing, Rhodes, Greece, pp 61–64

    Google Scholar 

  21. Scheler D, Walz S, Fingscheidt T (2012) On iterative exchange of soft state information in two-channel automatic speech recognition. In: Proceedings of 10th ITG conference on speech communication, Braunschweig, Germany, pp 55–58

    Google Scholar 

  22. Shivappa ST, Rao BD, Trivedi MM (2007) An iterative decoding algorithm for fusion of multimodal information. EURASIP J Adv Signal Process 2008:1–10

    MATH  Google Scholar 

  23. Shivappa ST, Rao BD, Trivedi MM (2008) Multimodal information fusion using the iterative decoding algorithm and its application to audio-visual speech recognition. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP 2008), Las Vegas, NV, USA, pp 2241–2244. doi:10.1109/ICASSP.2008.4518091

  24. Stork DG, Hennecke ME, Prasad KV (1996) Visionary speech: looking ahead to practical speechreading systems. In: Stork DG, Hennecke ME (eds) Speechreading by humans and machines. Springer, Berlin

    Chapter  Google Scholar 

  25. Sumby WH, Pollack I (1954) Visual contribution to speech intelligibility in noise. J Acoust Soc Am 26(2):212–215. doi:10.1121/1.1907309

    Article  Google Scholar 

  26. Tomlinson MJ, Russell MJ, Brooke NM (1996) Integrating audio and visual information to provide highly robust speech recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP 1996), Atlanta, GA, USA, pp 821–824

    Google Scholar 

  27. Varga P, Moore RK (1990) Hidden Markov model decomposition of speech and noise. In Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP 1990), Albuquerque, NM, USA, pp 845–848

    Google Scholar 

Download references

Acknowledgments

We would like to thank Dorothea Kolossa and Peter Transfeld for valuable discussions, as well as Carlos Harms for his assistance in reviewing iterative and coupled speech recognition approaches.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Simon Receveur .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Receveur, S., Scheler, D., Fingscheidt, T. (2016). A Turbo-Decoding Weighted Forward-Backward Algorithm for Multimodal Speech Recognition. In: Rudnicky, A., Raux, A., Lane, I., Misu, T. (eds) Situated Dialog in Speech-Based Human-Computer Interaction. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-21834-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-21834-2_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-21833-5

  • Online ISBN: 978-3-319-21834-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics