Audio-Visual Speech Processing for Human Computer Interaction

  • Siew Wen Chin
  • Kah Phooi Seng
  • Li-Minn Ang
Part of the Intelligent Systems Reference Library book series (ISRL, volume 26)

Abstract

This chapter presents an audio-visual speech recognition (AVSR) for Human Computer Interaction (HCI) that mainly focuses on 3 modules: (i) the radial basis function neural network (RBF-NN) voice activity detection (VAD) (ii) the watershed lips detection and H∞ lips tracking and (iii) the multi-stream audio-visual back-end processing. The importance of the AVSR as the pipeline for the HCI and the background studies of the respective modules are first discussed follow by the design details of the overall proposed AVSR system. Compared to the conventional lips detection approach which needs a prerequisite skin/non-skin detection and face localization, the proposed watershed lips detection with the aid of H∞ lips tracking approach provides a potentially time saving direct lips detection technique, rendering the preliminary criterion obsolete. Alternatively, with a better noise compensation and a more precise speech localization offered by the proposed RBF-NN VAD compared to the conventional zero-crossing rate and short-term signal energy, it has yield to a higher performance capability for the recognition process through the audio modality. Lastly, the developed AVSR system which integrates the audio and visual information, as well the temporal synchrony audiovisual data stream has proved to obtain a significant improvement compared to the unimodal speech recognition, also the decision and feature integration approaches.

Keywords

Discrete Cosine Transform Speech Recognition Radial Basis Function Neural Network Continuous Wavelet Transform Visual Speech 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Yoshida, T., et al.: Automatic speech recognition improved by two-layered audio-visual integration for robot audition. In: 9th IEEE-RAS International Conference on Humanoid Robots, Humanoids 2009, pp. 604–609 (2009)Google Scholar
  2. 2.
    Guan, L., et al.: Toward natural and efficient human computer interaction. Presented at the Proceedings of the 2009 IEEE international conference on Multimedia and Expo., New York, NY, USA (2009)Google Scholar
  3. 3.
    Hao, T., et al.: Humanoid Audio \& Visual Avatar With Emotive Text-to-Speech Synthesis. IEEE Transactions on Multimedia 10, 969–981 (2008)CrossRefGoogle Scholar
  4. 4.
    Rabiner, L.R., Sambur, M.R.: Algorithm for determining the endpoints of isolated utterances. The Journal of the Acoustical Society of America 56, S31 (1974)CrossRefGoogle Scholar
  5. 5.
    Bachu, R.G., et al.: Voiced/Unvoiced Decision for Speech Signals Based on Zero- Crossing Rate and Energy. In: Elleithy, K. (ed.) Advanced Techniques in Computing Sciences and Software Engineering, pp. 279–282. Springer, Netherlands (2010)CrossRefGoogle Scholar
  6. 6.
    Chakrabartty, S., et al.: Robust speech feature extraction by growth transformation in reproducing kernel Hilbert space. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings (ICASSP 2004), vol. 1, pp. I-133–I-136 (2004)Google Scholar
  7. 7.
    Satya, D., et al.: Robust Feature Extraction for Continuous Speech Recognition Using the MVDR Spectrum Estimation Method. IEEE Transactions on Audio, Speech, and Language Processing 15, 224–234 (2007)CrossRefGoogle Scholar
  8. 8.
    Zheng, J., et al.: Modified Local Discriminant Bases and Its Application in Audio Feature Extraction. In: International Forum on Information Technology and Applications, IFITA 2009, pp. 49–52 (2009)Google Scholar
  9. 9.
    Umapathy, K., et al.: Audio Signal Feature Extraction and Classification Using Local Discriminant Bases. IEEE Transactions on Audio, Speech, and Language Processing 15, 1236–1246 (2007)CrossRefGoogle Scholar
  10. 10.
    Delphin-Poulat, L.: Robust speech recognition techniques evaluation for telephony server based in-car applications. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP 2004), vol. 1, p. I-65–I-68 (2004)Google Scholar
  11. 11.
    Chazan, D., et al.: Speech reconstruction from mel frequency cepstral coefficients and pitch frequency. In: Proceedings. 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2000, vol. 3, pp. 1299–1302 (2000)Google Scholar
  12. 12.
    Denbigh, P.: System analysis and signal processing with emphasis on the use of MATLAB. Addison Wesley Longman Ltd, Amsterdam (1998)Google Scholar
  13. 13.
    Zhuo, F., et al.: Use Hamming window for detection the harmonic current based on instantaneous reactive power theory. In: The 4th International Power Electronics and Motion Control Conference, IPEMC 2004, vol. 2, pp. 456–461 (2004)Google Scholar
  14. 14.
    Song, Y., Peng, X.: Spectra Analysis of Sampling and Reconstructing Continuous Signal Using Hamming Window Function. Presented at the Proceedings of the 2008 Fourth International Conference on Natural Computation, vol. 07 (2008)Google Scholar
  15. 15.
    Shah, J.K., Iyer, A.N.: Robust voice/unvoiced classification using novel featuresand Gaussian Mixture Model. Temple University, Philadelphia, USA (2004)Google Scholar
  16. 16.
    Jurafsky, D., Martin, J.: Speech and Language Processing: An Introduction to Natural Language Processing. In: Computational Linguistics and Speech Recognition. Prentice Hall, Englewood Cliffs (2008)Google Scholar
  17. 17.
    Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence 13, 583–598 (1991)CrossRefGoogle Scholar
  18. 18.
    Osma-Ruiz, V., et al.: An improved watershed algorithm based on efficient computation of shortest paths. Pattern Recogn. 40, 1078–1090 (2007)Google Scholar
  19. 19.
    Aja-Fern, S., et al.: A fuzzy-controlled Kalman filter applied to stereo-visual tracking schemes. Signal Process 83, 101–120 (2003)CrossRefGoogle Scholar
  20. 20.
    Canton-Ferrer, C., et al.: Projective Kalman Filter: Multiocular Tracking of 3D Locations Towards Scene Understanding. In: Machine Learning for Multimodal Interaction, pp. 250–261 (2006)Google Scholar
  21. 21.
    Maghami, M., et al.: Kalman filter tracking for facial expression recognition using noticeable feature selection. In: International Conference on Intelligent and Advanced Systems, ICIAS 2007, pp. 587–590 (2007)Google Scholar
  22. 22.
    Chieh-Cheng, C., et al.: A Robust Speech Enhancement System for Vehicular Applications Using H∞ Adaptive Filtering. In: IEEE International Conference on Systems, Man and Cybernetics, SMC 2006, pp. 2541–2546 (2006)Google Scholar
  23. 23.
    Shen, X.M., Deng, L.: Game theory approach to discrete H∞ filter design. IEEE Transactions on Signal Processing 45, 1092–1095 (1997)CrossRefGoogle Scholar
  24. 24.
    Dan, S.: Optimal State Estimation: Kalman, H Infinity, and Nonlinear Approaches. Wiley-Interscience, Hoboken (2006)Google Scholar
  25. 25.
    Eveno, N., et al.: New color transformation for lips segmentation. In: 2001 IEEE Fourth Workshop on Multimedia Signal Processing, pp. 3–8 (2001)Google Scholar
  26. 26.
    Hurlbert, A., Poggio, T.: Synthesizing a color algorithm from examples. Science 239, 482–485 (1988)CrossRefGoogle Scholar
  27. 27.
    Yau, W.C., et al.: Visual recognition of speech consonants using facial movement features. Integr. Comput.-Aided Eng. 14, 49–61 (2007)Google Scholar
  28. 28.
    Harvey, R., et al.: Lip reading from scale-space measurements. In: Proceedings. 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 582–587 (1997)Google Scholar
  29. 29.
    Xiaopeng, H., et al.: A PCA Based Visual DCT Feature Extraction Method for Lip-Reading. In: International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIH-MSP 2006, pp. 321–326 (2006)Google Scholar
  30. 30.
    Peng, L., Zuoying, W.: Visual information assisted Mandarin large vocabulary continuous speech recognition. In: Proceedings. 2003 International Conference on Natural Language Processing and Knowledge Engineering, pp. 72–77 (2003)Google Scholar
  31. 31.
    Potamianos, G., et al.: Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91, 1306–1326 (2003)CrossRefGoogle Scholar
  32. 32.
    Seyedin, S., Ahadi, M.: Feature extraction based on DCT and MVDR spectral estimation for robust speech recognition. In: 9th International Conference on Signal Processing, ICSP 2008, pp. 605–608 (2008)Google Scholar
  33. 33.
    Wu, J.-D., Lin, B.-F.: Speaker identification using discrete wavelet packet transform technique with irregular decomposition. Expert Syst. Appl. 36, 3136–3143 (2009)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Nefian, A.V., et al.: A coupled HMM for audio-visual speech recognition. In: Proceedings. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, pp. 2013–2016 (2002)Google Scholar
  35. 35.
    Guocan, F., Jianmin, J.: Image spatial transformation in DCT domain. In: Proceedings. 2001 International Conference on Image Processing, vol. 3, pp. 836–839 (2001)Google Scholar
  36. 36.
    Hao, X., et al.: Lifting-Based Directional DCT-Like Transform for Image Coding. IEEE Transactions on Circuits and Systems for Video Technology 17, 1325–1335 (2007)CrossRefGoogle Scholar
  37. 37.
    Kaynak, M.N., et al.: Analysis of lip geometric features for audio-visual speech recognition. IEEE Transactions on Systems, Man and Cybernetics, Part A 34, 564–570 (2004)CrossRefGoogle Scholar
  38. 38.
    Meynet, J., Thiran, J.-P.: Audio-Visual Speech Recognition With A Hybrid SVM-HMM System. Presented at the 13th European Signal Processing Conference (2005)Google Scholar
  39. 39.
    Teissier, P., et al.: Comparing models for audiovisual fusion in a noisy-vowel recognition task. IEEE Transactions on Speech and Audio Processing 7, 629–642 (1999)CrossRefGoogle Scholar
  40. 40.
    Potamianos, G., et al.: An image transform approach for HMM based automatic lipreading. In: Proceedings. 1998 International Conference on Image Processing, ICIP 1998, vol. 3, pp. 173–177 (1998)Google Scholar
  41. 41.
    Neti, G.P.C., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., Zhou, J.: Audio-Visual Speech Recognition. The John Hopkins University, Baltimore (2000)Google Scholar
  42. 42.
    Heckmann, F.B.M., Kroschel, K.: A hybrid ANN/HMM audio-visual speech recognition system. In: Presented at the International Conference on Audio-Visual Speech Processing (2001)Google Scholar
  43. 43.
    Yu, K., et al.: Sentence lipreading using hidden Markov model with integrated grammar. In: Hidden Markov models: applications in computer vision, pp. 161–176. World Scientific Publishing Co., Inc., Singapore (2002)Google Scholar
  44. 44.
    Çetingül, H.E.: Multimodal speaker/speech recognition using lip motion, lip texture and audio. Signal Processing 86, 3549–3558 (2006)Google Scholar
  45. 45.
    Yau, W., et al.: Visual Speech Recognition Using Motion Features and Hidden Markov Models. Computer Analysis of Images and Patterns, 832–839 (2007)Google Scholar
  46. 46.
    Yuhas, B.P., et al.: Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine 27, 65–71 (1989)CrossRefGoogle Scholar
  47. 47.
    Meier, U., et al.: Adaptive bimodal sensor fusion for automatic speechreading. In: IEEE International Conference Proceedings - Presented at the Proceedings of the Acoustics, Speech, and Signal Processing, vol. 02 (1996)Google Scholar
  48. 48.
    Gordan, M., et al.: Application of support vector machines classifiers to visual speech recognition. In: Proceedings. 2002 International Conference on Image Processing, vol. 3, pP. III-129–III-132(2002)Google Scholar
  49. 49.
    Saenko, K., et al.: Articulatory features for robust visual speech recognition. Presented at the Proceedings of the 6th International Conference on Multimodal Interfaces, State College, PA, USA (2004)Google Scholar
  50. 50.
    Zhao, G., et al.: Local spatiotemporal descriptors for visual recognition of spoken phrases. Presented at the Proceedings of the International Workshop on Human-centered Multimedia, Augsburg, Bavaria, Germany (2007)Google Scholar
  51. 51.
    Rabiner, L., Juang, B.H.: Fundamental of speech recognition. Prentice-Hall, Upper Saddle River (1993)Google Scholar
  52. 52.
    Xie, L., Liu, Z.-Q.: A coupled HMM approach to video-realistic speech animation. Pattern Recogn. 40, 2325–2340 (2007)MATHCrossRefGoogle Scholar
  53. 53.
    Nefian, A.V., Lu Hong, L.: Bayesian networks in multimodal speech recognition and speaker identification. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 2, pp. 2004–2008 (2003)Google Scholar
  54. 54.
    Xie, L., Liu, Z.-Q.: Multi-stream Articulator Model with Adaptive Reliability Measure for Audio Visual Speech Recognition. In: Advances in Machine Learning and Cybernetics, pp. 994–1004 (2006)Google Scholar
  55. 55.
    Luettin, J., et al.: Asynchronous stream modeling for large vocabulary audio-visual speech recognition. In: Proceedings. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2001, vol. 1, pp. 169–172 (2001)Google Scholar
  56. 56.
    Marcheret, E., et al.: Dynamic Stream Weight Modeling for Audio-Visual Speech Recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, pp. IV-945–IV-948 (2007)Google Scholar
  57. 57.
    Dean, D.B., et al.: Fused HMM-adaptation of multi-stream HMMs for audio-visual speech recognition (2007)Google Scholar
  58. 58.
    Dean, D.B., et al.: Fused HMM-Adaptation of Synchronous HMMs for Audio-Visual Speech Recognition (2008)Google Scholar
  59. 59.
    Kumatani, K., et al.: An adaptive integration based on product hmm for audio-visual speech recognition. In: IEEE International Conference on Multimedia and Expo, ICME 2001, pp. 813–816 (2001)Google Scholar
  60. 60.
    Lee, A., et al.: Gaussian mixture selection using context-independent HMM. In: Proceedings. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2001, vol. 1, pp. 69–72 (2001)Google Scholar
  61. 61.
    Seng Kah, P., Ang, L.M.: Adaptive RBF Neural Network Training Algorithm For Nonlinear And Nonstationary Signal. In: International Conference on Computational Intelligence and Security, pp. 433–436 (2006)Google Scholar
  62. 62.
    Sinha, S., Routh, P.S., Anno, P.D., Castagna, J.P.: Spectral decomposition of seismic data with continuous-wavelet transforms. Geophysics 70, 19–25 (2005)Google Scholar
  63. 63.
    Lab, I.M.: Asian Face Image Database PF01. Pohang University of Science and TechnologyGoogle Scholar
  64. 64.
    Brand, M., et al.: Coupled hidden Markov models for complex action recognition. In: Proceedings. 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 994–999 (1997)Google Scholar
  65. 65.
    Nefian, A., et al.: A Bayesian Approach to Audio-Visual Speaker Identification. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 1056–1056. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  66. 66.
    Patterson, E., et al.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: Proceedings. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, pp. 2017–2020 (2002)Google Scholar
  67. 67.
    Ramírez, J., Górriz, J.M., Segura, J.C.: Voice Activity Detection. Fundamentals and Speech Recognition System Robustness (Robust Speech Recognition and Understanding) (2007)Google Scholar
  68. 68.
    Tomi Kinnunen, E.C., Tuononen, M., Franti, P., Li, H.: Voice Activity detection Using MFCC Features and Support Vector Machine. In: SPECOM (2007)Google Scholar
  69. 69.
    Joachims, T.: SVM light (2008), http://svmlight.joachims.org/
  70. 70.
    Gurban, M.: Multimodal feature extraction and fusion for audio-visual speech recognition. Programme Doctoral En Informatique, Communications Et Information, Signal Processing Laboratory(LTS5), Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland (2009)Google Scholar
  71. 71.
    Liew, A.W.C., et al.: Segmentation of color lip images by spatial fuzzy clustering. IEEE Transactions on Fuzzy Systems 11, 542–549 (2003)CrossRefGoogle Scholar

Copyright information

© IFIP 2012

Authors and Affiliations

  • Siew Wen Chin
    • 1
  • Kah Phooi Seng
    • 1
  • Li-Minn Ang
    • 1
  1. 1.The University of NottinghamMalaysia

Personalised recommendations