Abstract
This paper proposes a multi-modal speech recognition method using optical-flow analysis for lip images. Optical flow is defined as the distribution of apparent velocities in the movement of brightness patterns in an image. Since the optical flow is computed without extracting the speaker’s lip contours and location, robust visual features can be obtained for lip movements. Our method calculates two kinds of visual feature sets in each frame. The first feature set consists of variances of vertical and horizontal components of optical-flow vectors. These are useful for estimating silence/pause periods in noisy conditions since they represent movement of the speaker’s mouth. The second feature set consists of maximum and minimum values of integral of the optical flow. These are expected to be more effective than the first set since this feature set has not only silence/pause information but also open/close status of the speaker’s mouth. Each of the feature sets is combined with an acoustic feature set in the framework of HMM-based recognition. Triphone HMMs are trained using the combined parameter sets extracted from clean speech data. Noise-corrupted speech recognition experiments have been carried out using audio-visual data from 11 male speakers uttering connected digits. The following improvements of digit accuracy over the audio-only recognition scheme have been achieved when the visual information was used only for silence HMM: 4% at SNR = 5 dB and 13% at SNR = 10 dB using the integral information of optical flow as the visual feature set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S. Furui, “Speech Recognition Technology in the Ubiquitous/ Wearable Computing Environment,” in Proc. ICASSP2000, vol. 6, 2000, pp. 3735–3738.
S. Furui, K. Iwano, C. Hori, T. Shinozaki, Y. Saito, and S. Tamura, “Ubiquitous Speech Processing,” in Proc. ICASSP2001, vol. 1, 2001, pp. 13–16.
K. Iwano, S. Tamura, and S. Furui, “Bimodal Speech Recognition Using Lip Movement Measured by Optical-Flow Analysis,” in Proc. HSC2001, 2001, pp. 187–190.
S. Nakamura, H. Ito, and K. Shikano, “Stream Weight Optimization of Speech and Lip Image Sequence for Audio-Visual Speech Recognition,” in Proc. ICSLP2000, vol. 3, 2000, pp. 20–24.
C. Miyajima, K. Tokuda, and T. Kitamura, “Audio-Visual Speech Recognition Using MCE-Based HMMs and Model-Dependent Stream weights,” in Proc. ICSLP2000, vol. 2, 2000, pp. 1023–1026.
Y. Zhang, S. Levinson, and T. Huang, “Speaker Independent Audio-Visual Speech Recognition,” in Proc. ICME2000, TP8–1, 2000.
S. Basu, C. Neti, N. Rajput, A. Senior, L. Subramaniam, and A. Verma, “Audio-Visual Large Vocabulary Continuous Speech Recognition in the Broadcast Domain,” in Proc. MMSP’99, 1999, pp. 475–481.
G. Potamianos, E. Cosatto, H.P. Gref, and D.B. Roe, “Speaker Independent Audio-Visual Database for Bimodal ASR,” in Proc. AVSP’97, 1997, pp. 65–68.
C. Bregler and Y. Konig, “Eigenlips” for Robust Speech Recognition,“ in Proc. ICASSP’94, vol. 2, 1994, pp. 669–672.
K. Mase and A. Pentland, “Automatic Lipreading by Optical-Flow Analysis,” Trans. Systems and Computers in Japan, vol. 22, no. 6, 1991, pp. 67–76.
B.K.P. Ham and B.G. Schunck, “Determining Optical Flow,” Artificial Intelligence, vol. 17, nos. 1–3, 1981, pp. 185–203.
D.L. Hall, Mathematical Techniques in Multisensor Data Fusion, Artech House, Boston, 1992.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer Science+Business Media New York
About this chapter
Cite this chapter
Tamura, S., Iwano, K., Furui, S. (2004). Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images. In: Wang, JF., Furui, S., Juang, BH. (eds) Real World Speech Processing. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-6363-8_4
Download citation
DOI: https://doi.org/10.1007/978-1-4757-6363-8_4
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-5439-8
Online ISBN: 978-1-4757-6363-8
eBook Packages: Springer Book Archive