Springer Nature is making Coronavirus research free. View research | View latest news | Sign up for updates

Toward Robust Speech Recognition and Understanding

  • 146 Accesses

  • 3 Citations


The principal cause of speech recognition errors is a mismatch between trained acoustic/language models and input speech due to the limited amount of training data in comparison with the vast variation of speech. It is crucial to establish methods that are robust against voice variation due to individuality, the physical and psychological condition of the speaker, telephone sets, microphones, network characteristics, additive background noise, speaking styles, and other aspects. This paper overviews robust architecture and modeling techniques for speech recognition and understanding. The topics include acoustic and language modeling for spontaneous speech recognition, unsupervised adaptation of acoustic and language models, robust architecture for spoken dialogue systems, multi-modal speech recognition, and speech summarization. This paper also discusses the most important research problems to be solved in order to achieve ultimate robust speech recognition and understanding systems.

This is a preview of subscription content, log in to check access.


  1. 1.

    B.-H. Juang and S. Furui, “Automatic Recognition and Understanding of Spoken Language—A First Step Towards Natural Human-Machine Communication,” Proc. IEEE, vol. 88, no. 8, 2000, pp. 1142–1165.

  2. 2.

    L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993.

  3. 3.

    S. Furui, Digital Speech Processing, Synthesis, and Recognition, 2nd edition, Marcel Dekker, 2000.

  4. 4.

    H. Ney, “Corpus-Based Statistical Methods in Speech and Language Processing,” in Corpus-based Methods in Language and Speech Processing, S. Young and G. Bloothooft (Eds.), Kluwer, 1997, pp. 1–26.

  5. 5.

    S. Furui, “Recent Advances in Spontaneous Speech Recognition and Understanding,” in Proc. IEEE-ISCA Workshop on Spontaneous Speech Processing and Recognition (SSPR), Tokyo, 2003, pp. 1–6.

  6. 6.

    S. Furui, “Steps Toward Natural Human-Machine Communication in the 21st Century,” in Proc. ISCA Workshop on Voice Operated Telecom Services, Ghent, 2000, pp. 17–24.

  7. 7.

    E. Levin et al., “The AT&T-DARPA Communicator Mixed-Initiative Spoken Dialogue System,” in Proc. ICSLP, Beijing, 2000, pp. II-122–125.

  8. 8.

    S. Basu et al., “Audio-Visual Large Vocabulary Continuous Speech Recognition in the Broadcast Domain,” in Proc. IEEE Multimedia Signal Processing (MMSP), Copenhagen, 1999, pp. 475–481.

  9. 9.

    S. Furui, “Toward Spontaneous Speech Recognition and Understanding,” in Pattern Recognition in Speech and language Processing, W. Chou and B.-H. Juang (Eds.), CRC Press, 2003, pp. 191–227.

  10. 10.

    T. Shinozaki et al., “Towards Automatic Transcription of Spontaneous Presentations,” in Proc. Eurospeech, Aalborg, vol. 1, 2001, pp. 491–494.

  11. 11.

    T. Shinozaki and S. Furui, “Analysis on Individual Differences in Automatic Transcription of Spontaneous Presentations,” in Proc. ICASSP, Orlando, 2002, pp. I-729–732.

  12. 12.

    Z. Zhang et al., “On-Line Incremental Speaker Adaptation for Broadcast News Transcription,” in Speech Communication, vol. 37, 2002, pp. 271–281.

  13. 13.

    Z. Zhang et al., “An Online Incremental Speaker Adaptation Method Using Speaker-Clustered Initial Models,” in Proc. ICSLP, Beijing, 2000, pp. III-694–697.

  14. 14.

    M.J.F. Gales et al., “An Improved Approach to the Hidden Markov Model Decomposition of Speech and Noise,” in Proc. ICASSP, San Francisco, 1992, pp. 233–236.

  15. 15.

    F. Martin et al., “Recognition of Noisy Speech by Composition of Hidden Markov Models,” in Proc. Eurospeech, Berlin, 1993, pp. 1031–1034.

  16. 16.

    S. Furui et al., “Noise Adaptation of HMMs Using Neural Networks,” in Proc. ISCA Workshop on Automatic Speech Recognition, Paris, 2000, pp. 160–167.

  17. 17.

    Z. Zhang et al., “Tree-Structured Noise-Adapted HMM Modeling for Piecewise Linear-Transformation-Based Adaptation,” in Proc. Eurospeech, Geneva, 2003.

  18. 18.

    T. Shinozaki and S. Furui, “Time Adjustable Mixture Weights for Speaking Rate Fluctuation,” in Proc. Eurospeech, Geneva, 2003.

  19. 19.

    G. Zweig, “Bayesian Network Structures and Inference Techniques for Automatic Speech Recognition,” Computer Speech and Language, vol. 17, 2003, pp. 173–193.

  20. 20.

    Y. Yokoyama et al., “Unsupervised Language Model Adaptation Using Word Classes for Spontaneous Speech Recognition,” in Proc. IEEE-ISCA Workshop on Spontaneous Speech Processing and Recognition, Tokyo, 2003, pp. 71–74.

  21. 21.

    R. Taguma et al., “Parallel Computing-Based Architecture for Mixed-Initiative Spoken Dialogue,” in Proc. IEEE Int. Conf. on Multimodal Interfaces (ICMI), Pittsburgh, 2002, pp. 53–58.

  22. 22.

    S. Tamura et al., “Arobust Multi-Modal Speech Recognition Method Using Optical-Flow Analysis,” in Proc. ISCA Workshop on Multi-modal Dialogue in Mobile Environments, Kloster Irsee, 2002.

  23. 23.

    T. Yoshinaga et al., “Audio-Visual Speech Recognition Using Lip Movement Extracted from Side-Face Images,” in Proc. Eurospeech, Geneva, 2003.

  24. 24.

    S. Furui et al., “Speech-to-Speech and Speech-to-Text Summarization,” in Proc. Int. Workshop on Language Understanding and Agents for Real World Interaction, Sapporo, 2003, pp. 100–106.

  25. 25.

    T. Kikuchi et al., “Two-Stage Automatic Speech Summarization by Sentence Extraction and Compaction,” in Proc. IEEE-ISCA Workshop on Spontaneous Speech Processing and Recognition (SSPR), Tokyo, 2003, pp. 207–210.

  26. 26.

    C. Hori et al., “A Statistical Approach to Automatic Speech Summarization,” EURASIP Journal on Applied Signal Processing, 2003, pp. 128–139.

Download references

Author information

Correspondence to Sadaoki Furui.

Additional information

Dr. Sadaoki Furui is currently a Professor at Tokyo Institute of Technology, Department of Computer Science. He is engaged in a wide range of research on speech analysis, speech recognition, speaker recognition, speech synthesis, and multimodal human-computer interaction and has authored or coauthored over 450 published articles. From 1978 to 1979, he served on the staff of the Acoustics Research Department of Bell Laboratories, Murray Hill, New Jersey, as a visiting researcher working on speaker verification. He is a Fellow of the IEEE, the Acoustical Society of America and the Institute of Electronics, Information and Communication Engineers of Japan (IEICE). He was President of the Acoustical Society of Japan (ASJ) from 2001 to 2003 and the Permanent Council for International Conferences on Spoken Language Processing (PC-ICSLP) from 2000 to 2004. He is currently President of the International Speech Communication Association (ISCA). He was a Board of Governor of the IEEE Signal Processing Society from 2001 to 2003. He has served on the IEEE Technical Committees on Speech and MMSP and on numerous IEEE conference organizing committees. He has served as Editor-in-Chief of both Journal of Speech Communication and the Transaction of the IEICE. He is an Editorial Board member of Speech Communication, the Journal of Computer Speech and Language, and the Journal of Digital Signal Processing.

He has received the Yonezawa Prize and the Paper Awards from the IEICE (1975, 88, 93, 2003), and the Sato Paper Award from the ASJ (1985, 87). He has received the Senior Award from the IEEE ASSP Society (1989) and the Achievement Award from the Minister of Science and Technology, Japan (1989). He has received the Technical Achievement Award and the Book Award from the IEICE (2003, 1990). He has also received the Mira Paul Memorial Award from the AFECT, India (2001). In 1993 he served as an IEEE SPS Distinguished Lecturer. He is the author of “Digital Speech Processing, Synthesis, and Recognition” (Marcel Dekker, 1989, revised, 2000) in English, “Digital Speech Processing” (Tokai University Press, 1985) in Japanese, “Acoustics and Speech Processing” (Kindai-Kagaku-Sha, 1992) in Japanese, and “Speech Information Processing” (Morikita, 1998) in Japanese. He edited “Advances in Speech Signal Processing” (Marcel Dekker, 1992) jointly with Dr. M.M. Sondhi. He has translated into Japanese “Fundamentals of Speech Recognition,” authored by Drs. L.R. Rabiner and B.-H. Juang (NTT Advanced Technology, 1995) and “Vector Quantization and Signal Compression,” authored by Drs. A. Gersho and R. M. Gray (Corona-sha, 1998).

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Furui, S. Toward Robust Speech Recognition and Understanding. J VLSI Sign Process Syst Sign Image Video Technol 41, 245–254 (2005).

Download citation


  • speech recognition
  • speech understanding
  • robustness
  • adaptation
  • spontaneous speech
  • corpus
  • acoustic models
  • language models
  • dialogue
  • multi-modal
  • summarization