Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images

Tamura, Satoshi; Iwano, Koji; Furui, Sadaoki

doi:10.1023/B:VLSI.0000015091.47302.07

Satoshi Tamura¹,
Koji Iwano¹ &
Sadaoki Furui¹

132 Accesses
23 Citations
3 Altmetric
Explore all metrics

Abstract

This paper proposes a multi-modal speech recognition method using optical-flow analysis for lip images. Optical flow is defined as the distribution of apparent velocities in the movement of brightness patterns in an image. Since the optical flow is computed without extracting the speaker's lip contours and location, robust visual features can be obtained for lip movements. Our method calculates two kinds of visual feature sets in each frame. The first feature set consists of variances of vertical and horizontal components of optical-flow vectors. These are useful for estimating silence/pause periods in noisy conditions since they represent movement of the speaker's mouth. The second feature set consists of maximum and minimum values of integral of the optical flow. These are expected to be more effective than the first set since this feature set has not only silence/pause information but also open/close status of the speaker's mouth. Each of the feature sets is combined with an acoustic feature set in the framework of HMM-based recognition. Triphone HMMs are trained using the combined parameter sets extracted from clean speech data. Noise-corrupted speech recognition experiments have been carried out using audio-visual data from 11 male speakers uttering connected digits. The following improvements of digit accuracy over the audio-only recognition scheme have been achieved when the visual information was used only for silence HMM: 4% at SNR = 5 dB and 13% at SNR = 10 dB using the integral information of optical flow as the visual feature set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Facial emotion recognition using convolutional neural networks (FERC)

Article 18 February 2020

Bayesian optimization of histogram of oriented gradients (HOG) parameters for facial recognition

Article 30 May 2024

Machine learning for human emotion recognition: a comprehensive review

Article Open access 20 February 2024

References

S. Furui, ‘Speech Recognition Technology in the Ubiquitous/Wearable Computing Environment,’ in Proc. ICASSP2000, vol. 6, 2000, pp. 3735-3738.
Google Scholar
S. Furui, K. Iwano, C. Hori, T. Shinozaki, Y. Saito, and S. Tamura, ‘Ubiquitous Speech Processing,’ in Proc. ICASSP2001, vol. 1, 2001, pp. 13-16.
Google Scholar
K. Iwano, S. Tamura, and S. Furui, ‘Bimodal Speech Recognition Using Lip Movement Measured by Optical-Flow Analysis,’ in Proc. HSC2001, 2001, pp. 187-190.
S. Nakamura, H. Ito, and K. Shikano, ‘Stream Weight Optimization of Speech and Lip Image Sequence for Audio-Visual Speech Recognition,’ in Proc. ICSLP2000, vol. 3, 2000, pp. 20-24.
Google Scholar
C. Miyajima, K. Tokuda, and T. Kitamura, ‘Audio-Visual Speech Recognition Using MCE-Based HMMs and Model-Dependent Stream weights,’ in Proc. ICSLP2000, vol. 2, 2000, pp. 1023-1026.
Google Scholar
Y. Zhang, S. Levinson, and T. Huang, ‘Speaker Independent Audio-Visual Speech Recognition,’ in Proc. ICME2000, TP8-1, 2000.
S. Basu, C. Neti, N. Rajput, A. Senior, L. Subramaniam, and A. Verma, ‘Audio-Visual Large Vocabulary Continuous Speech Recognition in the Broadcast Domain,’ in Proc. MMSP'99, 1999, pp. 475-481.
G. Potamianos, E. Cosatto, H.P. Gref, and D.B. Roe, ‘Speaker Independent Audio-Visual Database for Bimodal ASR,’ in Proc. AVSP'97, 1997, pp. 65-68.
C. Bregler and Y. Konig, ‘Eigenlips’ for Robust Speech Recognition,’ in Proc. ICASSP'94, vol. 2, 1994, pp. 669-672.
Google Scholar
K. Mase and A. Pentland, ‘Automatic Lipreading by Optical-Flow Analysis,’ Trans. Systems and Computers in Japan, vol. 22, no. 6, 1991, pp. 67-76.
Article Google Scholar
B.K.P. Horn and B.G. Schunck, ‘Determining Optical Flow,’ Artificial Intelligence, vol. 17, nos. 1–3, 1981, pp. 185-203.
Article Google Scholar
D.L. Hall, Mathematical Techniques in Multisensor Data Fusion, Artech House, Boston, 1992.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo, 152-8552, Japan
Satoshi Tamura, Koji Iwano & Sadaoki Furui

Authors

Satoshi Tamura
View author publications
You can also search for this author in PubMed Google Scholar
Koji Iwano
View author publications
You can also search for this author in PubMed Google Scholar
Sadaoki Furui
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tamura, S., Iwano, K. & Furui, S. Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 36, 117–124 (2004). https://doi.org/10.1023/B:VLSI.0000015091.47302.07

Download citation

Published: 01 February 2004
Issue Date: February 2004
DOI: https://doi.org/10.1023/B:VLSI.0000015091.47302.07

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Bayesian optimization of histogram of oriented gradients (HOG) parameters for facial recognition

Machine learning for human emotion recognition: a comprehensive review

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Bayesian optimization of histogram of oriented gradients (HOG) parameters for facial recognition

Machine learning for human emotion recognition: a comprehensive review

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation