Journal on Multimodal User Interfaces

, Volume 1, Issue 1, pp 7–20 | Cite as

Comparison between different feature extraction techniques for audio-visual speech recognition

  • Alin G. Chiţu
  • Leon J. M. Rothkrantz
  • Pascal Wiggers
  • Jacek C. Wojdel


The current audio-only speech recognition still lacks the expected robustness when the Signal to Noise Ratio (SNR) decreases. The video information is not affected by noise which makes it an ideal candidate for data fusion for speech recognition benefit. In the paper [1] the authors have shown that most of the techniques used for extraction of static visual features result in equivalent features or at least the most informative features exhibit this property. We argue that one of the main problems of existing methods is that the resulting features contain no information about the motion of the speaker’s lips. Therefore, in this paper we will analyze the importance of motion detection for speech recognition. For this we will first present the Lip Geometry Estimation(LGE) method for static feature extraction. This method combines an appearance based approach with a statistical based approach for extracting the shape of the mouth. The method was introduced in [2] and explored in detail in [3]. Further more, we introduce a second method based on a novel approach that captures the relevant motion information with respect to speech recognition by performing optical flow analysis on the contour of the speaker’s mouth. For completion, a middle way approach is also analyzed. This third method considers recovering the motion information by computing the first derivatives of the static visual features. All methods were tested and compared on a continuous speech recognizer for Dutch. The evaluation of these methods is done under different noise conditions. We show that the audio-video recognition based on the true motion features, namely obtained by performing optical flow analysis, outperforms the other settings in low SNR conditions.


Audio-visual fusion Speech recognition Automatic lipreading Optical flow Mouth movement 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

7. References

  1. [1]
    L. J. M. Rothkrantz, J. C. Wojdel, and P. Wiggers, “Comparison between different feature extraction techniques in lipreading applications”, inSpecom’2006, SpIIRAS Petersburg, 2006. 7, 8, 17Google Scholar
  2. [2]
    J. C. Wojdel and L. J. M. Rothkrantz, “Visually based speech onset/offset detection”, inProceedings of 5th Annual Scientific Conference on Web Technology, New Media, Communications and Telematics Theory, Methods, Tools and Application (Euromedia 2000), (Antwerp, Belgium), pp. 156–160, 2000. 7, 13Google Scholar
  3. [3]
    L. J. M. Rothkrantz, J. C. Wojdel, and P. Wiggers, “Fusing Data Streams in Continuous Audio-Visual Speech Recognition”, inText, Speech and Dialogue: 8th International Conference, TSD 2005, vol. 3658, (Karlovy Vary, Czech Republic), pp. 33–44, Springer Berlin/Heidelberg, September 2005. 7Google Scholar
  4. [4]
    H. McGurk and J. MacDonald, “Hearing lips and seeing voices”,Nature, vol. 264, pp. 746–748, December 1976. 7CrossRefGoogle Scholar
  5. [5]
    K. P. Green, P. K. Kuhl, A. N. Meltzoff, and E. B. Stevens, “Integrating speech information across talkers, gender, and sensory modality: female faces and male voices in the McGurk effect”,Perception and psychophysics, vol. 50, no. 6, pp. 524–536, 1991. 7Google Scholar
  6. [6]
    N. Li, S. Dettmer, and M. Shah, “Lipreading using eigen sequences”, inProc. International Workshop on Automatic Face- and Gesture-Recognition, (Zurich, Switzerland), pp. 30–34, 1995. 7Google Scholar
  7. [7]
    N. Li, S. Dettmer, and M. Shah, “Visually recognizing speech using eigensequences”,Motion-based recognition, 1997. 7Google Scholar
  8. [8]
    X. Hong, H. Yao, Y. Wan, and R. Chen, “A PCA Based Visual DCT Feature Extraction Method for Lip-Reading”,iih-msp, vol. 0, pp. 321–326, 2006. 7Google Scholar
  9. [9]
    C. Bregler and Y. Konig, ““Eigenlips” for robust speech recognition”, inAcoustics, Speech, and Signal Processing, 1994. ICASSP-94 IEEE International Conference on, 1994. 7Google Scholar
  10. [10]
    P. Duchnowski, M. Hunke, D. Büsching, U. Meier, and A. Waibel, “Toward Movement-Invariant Automatic Lip-Reading and Speech Recognition”, inInternational Conference on Acoustics, Speech, and Signal Processing, 1995 (ICASSP-95), vol. 1, pp. 109–112, 1995. 8CrossRefGoogle Scholar
  11. [11]
    I. A. Essa and A. Pentland, “A Vision System for Observing and Extracting Facial Action Parameters”, inProceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 76–83, IEEE, June 1994. 8, 13Google Scholar
  12. [12]
    S. Tamura, K. Iwano, and S. Furui, “A Robust Multi-Modal Speech Recognition Method Using Optical-Flow Analysis”, inExtended summary of IDS02, (Kloster Irsee, Germany), pp. 2–4, June 2002. 8, 13Google Scholar
  13. [13]
    T. Yoshinaga, S. Tamura, K. Iwano, and S. Furui, “Audio-Visual Speech Recognition Using Lip Movement Extracted from Side-Face Images”, inAVSP2003, pp. 117–120, September 2003. 8, 13Google Scholar
  14. [14]
    T. Yoshinaga, S. Tamura, K. iwano, and S. Furui, “Audio-Visual Speech Recognition Using New Lip Features Extracted from Side-Face Images”, inRobust 2004, August 2004. 8, 13Google Scholar
  15. [15]
    K. Mase and A. Pentland., “Automatic Lipreading by Optical-Flow Analysis”, inSystems and Computers in Japan, vol. 22, pp. 67–76, 1991. 8, 13CrossRefGoogle Scholar
  16. [16]
    K. Iwano, S. Tamura, and S. Furui, “Bimodal Speech Recognition Using Lip Movement Measured By Optical-Flow analysis”, inHSC2001, 2001. 8, 13Google Scholar
  17. [17]
    D. J. Fleet, M. J. Black, Y. Yacoob, and A. D. Jepson, “Design and Use of Linear Models for Image Motion Analysis”,International Journal of Computer Vision, vol. 36, no. 3, pp. 171–193, 2000. 8, 13CrossRefGoogle Scholar
  18. [18]
    A. Martin, “Lipreading by Optical Flow Correlation”, tech. rep., Compute Science Department University of Central Florida, 1995. 8, 13Google Scholar
  19. [19]
    S. Tamura, K. Iwano, and S. Furui, “Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images”,J. VLSI Signal Process. Syst., vol. 36, no. 2–3, pp. 117–124, 2004. 8, 13CrossRefGoogle Scholar
  20. [20]
    S. Furui, “Robust Methods in Automatic Speech Recognition and Understanding”, inEUROSPEECH 2003 — GENEVA, 2003. 8, 13Google Scholar
  21. [21]
    B. K. Horn and B. G. Schunck, “Determining optical flow.”,Artificial Intelligence, vol. 17, pp. 185–203, 1981. 8, 9CrossRefGoogle Scholar
  22. [22]
    B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision.”, inProc. Seventh International Joint Conference on Artificial Intelligence, pp. 674–679, 1981. 8Google Scholar
  23. [23]
    A. Bruhn and J. Weickert, “Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods”,International Journal of Computer Vision, vol. 61, no. 3, pp. 211–231, 2005. 9CrossRefGoogle Scholar
  24. [24]
    S. Uras, F. Girosi, A. Verri, and V. Torre, “A computational approach to motion perception”, inBiological Cybernetics, vol. 60, pp. 79–87, December 1988. 9CrossRefGoogle Scholar
  25. [25]
    H.-H. Nagel, “On the estimation of optical flow: relations between different approaches and some new results”,Artificial Intelligence, vol. 33, no. 3, pp. 298–324, 1987. 9CrossRefGoogle Scholar
  26. [26]
    P. Anandan, “A Computational Framework and an Algorithm for the Measurement of Visual Motion”,International Journal of Computer Vision, vol. 2, pp. 283–310, 1989. 9CrossRefGoogle Scholar
  27. [27]
    A. Singh, “Optic Flow Computation. A Unified Perspective”. IEEE Computer Society Press, 1991. 9Google Scholar
  28. [28]
    D. J. Heeger, “Model for the extraction of image flow”,Journal Opt. Soc. Amer., vol. 4, pp. 1455–1471, August 1987. 9CrossRefGoogle Scholar
  29. [29]
    A. Waxman, J. Wu, and F. Bergholm, “Convected activation profiles and receptive fields for real time measurement of short range visual motion”, inProceedings of Conference Computational Visual Pattern Recognition, pp. 771–723, 1988. 9Google Scholar
  30. [30]
    D. J. Fleet and A. D. Jepson, “Computation of Component Image Velocity from Local Phase Information”,International Journal of Computer Vision, vol. 5, pp. 77–104, August 1990. 9CrossRefGoogle Scholar
  31. [31]
    J. L. Barron, D. J. Fleet, and S. S. Beauchemin, “Performance of optical flow techniques”,International Journal of Computer Vision, vol. 12, pp. 43–77, February 1994. 9CrossRefGoogle Scholar
  32. [32]
    B. Galvin, B. McCane, K. Novins, D. Mason, and S. Mills, “Recovering Motion Fields: An Evaluation of Eight Optical Flow Algorithms”, inProceedings of the British Machine Vision Converence (BMVC) ’98, September 1998. 9Google Scholar
  33. [33]
    L. Rabiner and B. Juang,Fundamentals of Speech Recognition. N.J, USA: Prentice Hall, 1993. 9Google Scholar
  34. [34]
    R. P. Lippmann, “Review of neural networks for speech recognition”,Neural Comput., vol. 1, no. 1, pp. 1–38, 1990. 9CrossRefGoogle Scholar
  35. [35]
    L. J. Rothkrantz and D. Nollen, “Automatic speech recognition using recurrent neural networks”, inNeural Network World, vol. 10, pp. 445–453, July 2000. 9Google Scholar
  36. [36]
    A. Ganapathiraju,Support vector machines for speech recognition. PhD thesis, Mississippi State University, 2002. Major Professor-Joseph Picone. 9Google Scholar
  37. [37]
    T. S. Andersen, K. Tiippana, and M. Lampien, “Modeling of audio-visual speech perception in noise”, inIn Proceedings of AVSP 2001, (Aalborg, Denmark), September 2001. 10Google Scholar
  38. [38]
    P. Smeele,Perceiving speech: Integrating auditory and visual speech. PhD thesis, Delft University of Technology, 1995. 10Google Scholar
  39. [39]
    D. Massaro, “A fuzzy logical model of speech perception”, inProceedings of XXIV International Congress of Psychology. Human Information Processing: Measures, Mechanisms and Models (D. Vickers and P. Smith, eds.), (Amsterdam, North Holland), pp. 367–379, 1989. 10Google Scholar
  40. [40]
    G. Meyer, J. Mulligan, and S. Wuerger, “Continuous audio-visual digit recognition using N-best decision fusion”,Information Fusion, vol. 5, pp. 91–101, 2004. 10CrossRefGoogle Scholar
  41. [41]
    S. Dupont, H. Bourlard, and C. Ris, “Robust Speech recognition Based on Multi-Stream Features”, inProc. of ESCA/NATO Workshop on Robust Speech Recognition for Unknown Communication Channels, (Pont- Mousson, France), pp. 95–98, 1997. 10Google Scholar
  42. [42]
    J. Luettin and S. Dupont, “Continuous Audio-Visual Speech Recognition”, inIDIAP, Dalle Molle Institute for Perceptual Artificial Intelligence, 1998. 10Google Scholar
  43. [43]
    Z. Ghahramani and M. I. Jordan, “Factorial Hidden Markov Models”, inProc. Conf. Advances in Neural Information Processing Systems, NIPS, 1997. 11Google Scholar
  44. [44]
    P. Viola and M. Jones, “Robust Real-time Object Detection”, inSecond International Workshop On Statistical And Computational Theories Of Vision Modeling, Learning, Computing, And Sampling, (Vancouver, Canada), July 2001. 11Google Scholar
  45. [45]
    T. Coianiz, L. Torresani, and B. Caprile, “2D deformable models for visual speech analysis”, inSpeechreading by humans and machines: models, systems, and applications. (D. G. Stork and M. E. Hennecke, eds.), vol. 150 ofNATO ASI Series F: Computer and Systems Sciences, Berlin and New York: Springer, 1996. 11Google Scholar
  46. [46]
    J. Millar, M. Wagner, and R. Goecke, “Aspects of Speaking-Face Data Corpus Design Methodology”, inProceedings of the 8th International Conference on Spoken Language Processing ICSLP2004, vol. II, (Jeju, Korea), pp. 1157–1160, oct 2004. 15Google Scholar
  47. [47]
    J. R. Movellan, “Visual Speech Recognition with Stochastic Networks”, inAdvances in Neural Information Processing Systems, vol. 7, (Cambridge), MIT Press, 1995. 15Google Scholar
  48. [48]
    N. A. Fox,Audio and Video Based Person Identification. PhD thesis, Department of Electronic and Electrical Engineering Faculty of Engineering and Architecture University College Dublin, 2005. 15Google Scholar
  49. [49]
    K. Messer, J. Matas, and J. Kittler, “Acquisition of a large database for biometric identity verification”, inBIOSIGNAL 98 (J. Jan, J. Kozumplík, and Z. Szabó, eds.), (Technical University Brno, Purkynova 188, 612 00, Brno, Czech Republic), pp. 70–72, Vutium Press, June 1998. 15Google Scholar
  50. [50]
    E. Bailly-Baillire, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariéthoz, J. Matas, K. Messer, V. Popovici, F. Porée, B. Ruiz, and J.-P. Thiran, “The BANCA Database and Evaluation Protocol”, inAudio and Video Based Biometric Person Authentication, vol. 2688, pp. 625–638, Springer Berlin/Heidelberg, 2003. 15CrossRefGoogle Scholar
  51. [51]
    E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy, “CUAVE: A New Audio-Visual Database for Multimodal Human-Computer Interface Research”, inProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002. 15Google Scholar
  52. [52]
    S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland,The HTK Book (for HTK Version 3.4). 2005. 15, 16Google Scholar
  53. [53]
    K. Murphy,Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, University of California, Berkeley, 2002. 18Google Scholar
  54. [54]
    J. Pearl,Probabilistic Reasoning in Intelligent Systems — Networks of Plausible Inference. Morgan Kaufmann Publishers, Inc., 1988. 18Google Scholar
  55. [55]
    A. V. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, “Dynamic Bayesian Networks for Audio-Visual Speech Recognition”,EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1274–1288, 2002. 18CrossRefGoogle Scholar

Copyright information

© OpenInterface Association 2007

Authors and Affiliations

  • Alin G. Chiţu
    • 1
  • Leon J. M. Rothkrantz
    • 1
  • Pascal Wiggers
    • 1
  • Jacek C. Wojdel
    • 2
  1. 1.Man-Machine Interaction GroupDelft University of TechnologyThe Netherlands
  2. 2.Quantum Chemistry of Materials Research GroupUniversity of BarcelonaSpain

Personalised recommendations