An Investigation into Audiovisual Speech Correlation in Reverberant Noisy Environments
Abstract
As evidence of a link between the various human communication production domains has become more prominent in the last decade, the field of multimodal speech processing has undergone significant expansion. Many different specialised processing methods have been developed to attempt to analyze and utilize the complex relationship between multimodal data streams. This work uses information extracted from an audiovisual corpus to investigate and assess the correlation between audio and visual features in speech. A number of different feature extraction techniques are assessed, with the intention of identifying the visual technique that maximizes the audiovisual correlation. Additionally, this paper aims to demonstrate that a noisy and reverberant audio environment reduces the degree of audiovisual correlation, and that the application of a beamformer remedies this. Experimental results, carried out in a synthetic scenario, confirm the positive impact of beamforming not only for improving the audio-visual correlation but also in a complete audio-visual speech enhancement scheme. Thus, this work inevitably highlights an important aspect for the development of future promising bimodal speech enhancement systems.
Keywords
Discrete Cosine Transform Automatic Speech Recognition Speech Enhancement Noisy Speech Active Appearance ModelPreview
Unable to display preview. Download preview PDF.
References
- 1.McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)CrossRefGoogle Scholar
- 2.Yehia, H., Rubin, P., Vatikiotis Bateson, E.: Quantitative association of vocal tract and facial behavior. Speech Communication 26(1), 23–43 (1998)CrossRefGoogle Scholar
- 3.Barker, J.P., Berthommier, F.: Evidence of correlation between acoustic and visual features of speech. In: ICPhS 1999, San Francisco (August 1999)Google Scholar
- 4.Almajai, I., Milner, B.: Maximising Audio-Visual Speech Correlation. Accepted for AVSP 2007, paper P16 (2007)Google Scholar
- 5.Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent Advances in the Automatic Recognition of Audiovisual Speech. Proceedings - IEEE 91, part 9, 1306–1326 (2003)CrossRefGoogle Scholar
- 6.Girin, L., Feng, G., Schwartz, J.L.: Fusion of auditory and visual information for noisy speech enhancement: a preliminary study of vowel transition. In: ICASSP 1998, Seattle, WA, USA (1998)Google Scholar
- 7.Almajai, I., Milner, B., Darch, J., Vaseghi, S.: Visually-Derived Wiener Filters for Speech Enhancement. In: ICASSP 2007, vol. 4, pp. IV-585–IV-588 (2007)Google Scholar
- 8.Young, S.J., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book. Version 2.1 Department of Engineering. Cambridge University, UK (1995)Google Scholar
- 9.Ahmed, N., Natarajan, T., Rao, K.R.: On image processing and a discrete cosine transform. IEEE Transactions on Computers C-23(1), 90–93 (1974)MathSciNetCrossRefMATHGoogle Scholar
- 10.Chen, W.H., Pratt, W.K.: Scene adaptive coder. IEEE Transactions on Communications 32(3), 225–232 (1984)CrossRefGoogle Scholar
- 11.Brandstein, M.S., Ward, D.B.: Microphone Arrays. Springer, New York (2001)CrossRefGoogle Scholar
- 12.Griffiths, L.J., Jim, C.W.: An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. Antennas Propagat. AP-30, 27–34 (1982)CrossRefGoogle Scholar
- 13.Gannot, S., Burshtein, D., Weinstein, E.: Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. Signal Processing 49, 1614–1626 (2001)CrossRefGoogle Scholar
- 14.Chatterjee, S., Hadi, A.S., Price, B.: Regression analysis by example. John Wiley and Sons, Canada (2000)MATHGoogle Scholar
- 15.Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Amer. 120(5), 2421–2424 (2006)CrossRefGoogle Scholar
- 16.Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active Appearance Models. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001)CrossRefGoogle Scholar
- 17.Ringeval, F., Chetouani, M.: A Vowel Based Approach for Acted Emotion Recognition. In: Interspeech 2008 (2008)Google Scholar
- 18.Wang, D., Lu, L., Zhang, H.J.: Speech Segmentation without Speech Recognition. In: ICASSP 2003, vol. 1, pp. 468–471 (2003)Google Scholar
- 19.Barker, J., Shao, X.: Audio-Visual Speech Fragment Decoding. Accepted for AVSP 2007, paper L5-2 (2007)Google Scholar
- 20.Rivet, B., Girin, L., Jutten, C.: Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures. IEEE Trans. on Audio, Speech, and Lang. Processing 15(1), 96–108 (2007)CrossRefGoogle Scholar
- 21.Hazen, J.T., Saenko, K., La, C.H., Glass, J.R.: A Segment Based Audio-Visual Speech Recognizer: Data Collection, Development, and Initial Experiments. In: ICMI 2004: Proceedings of the 6th international conference on Multimodal interfaces, pp. 235–242 (2004)Google Scholar
- 22.Hussain, A., Cifani, S., Squartini, S., Piazza, F., Durrani, T.: A Novel Psychoacoustically Motivated Multichannel Speech Enhancement System. In: Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M. (eds.) COST Action 2102. LNCS, vol. 4775, pp. 190–199. Springer, Heidelberg (2007)CrossRefGoogle Scholar