An Investigation into Audiovisual Speech Correlation in Reverberant Noisy Environments

  • Simone Cifani
  • Andrew Abel
  • Amir Hussain
  • Stefano Squartini
  • Francesco Piazza
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5641)

Abstract

As evidence of a link between the various human communication production domains has become more prominent in the last decade, the field of multimodal speech processing has undergone significant expansion. Many different specialised processing methods have been developed to attempt to analyze and utilize the complex relationship between multimodal data streams. This work uses information extracted from an audiovisual corpus to investigate and assess the correlation between audio and visual features in speech. A number of different feature extraction techniques are assessed, with the intention of identifying the visual technique that maximizes the audiovisual correlation. Additionally, this paper aims to demonstrate that a noisy and reverberant audio environment reduces the degree of audiovisual correlation, and that the application of a beamformer remedies this. Experimental results, carried out in a synthetic scenario, confirm the positive impact of beamforming not only for improving the audio-visual correlation but also in a complete audio-visual speech enhancement scheme. Thus, this work inevitably highlights an important aspect for the development of future promising bimodal speech enhancement systems.

Keywords

Discrete Cosine Transform Automatic Speech Recognition Speech Enhancement Noisy Speech Active Appearance Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)CrossRefGoogle Scholar
  2. 2.
    Yehia, H., Rubin, P., Vatikiotis Bateson, E.: Quantitative association of vocal tract and facial behavior. Speech Communication 26(1), 23–43 (1998)CrossRefGoogle Scholar
  3. 3.
    Barker, J.P., Berthommier, F.: Evidence of correlation between acoustic and visual features of speech. In: ICPhS 1999, San Francisco (August 1999)Google Scholar
  4. 4.
    Almajai, I., Milner, B.: Maximising Audio-Visual Speech Correlation. Accepted for AVSP 2007, paper P16 (2007)Google Scholar
  5. 5.
    Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent Advances in the Automatic Recognition of Audiovisual Speech. Proceedings - IEEE 91, part 9, 1306–1326 (2003)CrossRefGoogle Scholar
  6. 6.
    Girin, L., Feng, G., Schwartz, J.L.: Fusion of auditory and visual information for noisy speech enhancement: a preliminary study of vowel transition. In: ICASSP 1998, Seattle, WA, USA (1998)Google Scholar
  7. 7.
    Almajai, I., Milner, B., Darch, J., Vaseghi, S.: Visually-Derived Wiener Filters for Speech Enhancement. In: ICASSP 2007, vol. 4, pp. IV-585–IV-588 (2007)Google Scholar
  8. 8.
    Young, S.J., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book. Version 2.1 Department of Engineering. Cambridge University, UK (1995)Google Scholar
  9. 9.
    Ahmed, N., Natarajan, T., Rao, K.R.: On image processing and a discrete cosine transform. IEEE Transactions on Computers C-23(1), 90–93 (1974)MathSciNetCrossRefMATHGoogle Scholar
  10. 10.
    Chen, W.H., Pratt, W.K.: Scene adaptive coder. IEEE Transactions on Communications 32(3), 225–232 (1984)CrossRefGoogle Scholar
  11. 11.
    Brandstein, M.S., Ward, D.B.: Microphone Arrays. Springer, New York (2001)CrossRefGoogle Scholar
  12. 12.
    Griffiths, L.J., Jim, C.W.: An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. Antennas Propagat. AP-30, 27–34 (1982)CrossRefGoogle Scholar
  13. 13.
    Gannot, S., Burshtein, D., Weinstein, E.: Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. Signal Processing 49, 1614–1626 (2001)CrossRefGoogle Scholar
  14. 14.
    Chatterjee, S., Hadi, A.S., Price, B.: Regression analysis by example. John Wiley and Sons, Canada (2000)MATHGoogle Scholar
  15. 15.
    Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Amer. 120(5), 2421–2424 (2006)CrossRefGoogle Scholar
  16. 16.
    Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active Appearance Models. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001)CrossRefGoogle Scholar
  17. 17.
    Ringeval, F., Chetouani, M.: A Vowel Based Approach for Acted Emotion Recognition. In: Interspeech 2008 (2008)Google Scholar
  18. 18.
    Wang, D., Lu, L., Zhang, H.J.: Speech Segmentation without Speech Recognition. In: ICASSP 2003, vol. 1, pp. 468–471 (2003)Google Scholar
  19. 19.
    Barker, J., Shao, X.: Audio-Visual Speech Fragment Decoding. Accepted for AVSP 2007, paper L5-2 (2007)Google Scholar
  20. 20.
    Rivet, B., Girin, L., Jutten, C.: Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures. IEEE Trans. on Audio, Speech, and Lang. Processing 15(1), 96–108 (2007)CrossRefGoogle Scholar
  21. 21.
    Hazen, J.T., Saenko, K., La, C.H., Glass, J.R.: A Segment Based Audio-Visual Speech Recognizer: Data Collection, Development, and Initial Experiments. In: ICMI 2004: Proceedings of the 6th international conference on Multimodal interfaces, pp. 235–242 (2004)Google Scholar
  22. 22.
    Hussain, A., Cifani, S., Squartini, S., Piazza, F., Durrani, T.: A Novel Psychoacoustically Motivated Multichannel Speech Enhancement System. In: Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M. (eds.) COST Action 2102. LNCS, vol. 4775, pp. 190–199. Springer, Heidelberg (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Simone Cifani
    • 2
  • Andrew Abel
    • 1
  • Amir Hussain
    • 1
  • Stefano Squartini
    • 2
  • Francesco Piazza
    • 2
  1. 1.Dept. of Computing ScienceUniversity of StirlingScotland, UK
  2. 2.3MediaLabs, DIBETUniversità Politecnica delle MarcheAnconaItaly

Personalised recommendations