Enhancing Visual Speech Recognition with Lip Protrusion Estimation

Chapter
Part of the Studies in Computational Intelligence book series (SCI, volume 730)

Abstract

Visual speech recognition is emerging as an important research area in human–computer interaction. Most of the work done in this area has focused on lip-reading using the frontal view of the speaker or on views available from multiple cameras. However, in absence of views available from different angles, profile information from the speech articulators is lost. This chapter tries to estimate lip protrusion from images available from only the frontal pose of the speaker. With our proposed methodology, an estimated computation of lip profile information from frontal features, increases system efficiency in absence of expensive hardware and without adding to computation overheads. We also show that lip protrusion is a key speech articulator and that other prominent articulators are contained within the centre area of the mouth.

Notes

Acknowledgements

The author thanks the Department of Science and Technology, Government of India for their support in this research.

References

  1. 1.
    Aizawa, K., Morishima, S., Harashima, H.: An intelligent facial image coding driven by speech and phoneme. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1795–1798 (1989)Google Scholar
  2. 2.
    Aravabhumi, V., Chenna, R., Reddy, K.: Robust method to identify the speaker using lip motion features. In: International Conference on Mechanical and Electrical Technology (ICMET 2010), pp. 125–129 (2010)Google Scholar
  3. 3.
    Arsic, I., Thiran, J.: Mutual information eigenlips for audio-visual speech recognition. In: 14th European Signal Processing Conference (EUSIPCO) (2006)Google Scholar
  4. 4.
    Batista, F., Caseiro, D., Mamede, N., Trancoso, I.: Recovering punctuation marks for automatic speech recognition. In: Interspeech, pp. 2153–2156 (2007)Google Scholar
  5. 5.
    Brooke, N., Summerfield, A.: Analysis, synthesis, and perception of visible articulatory movements. J. Phon., 63–76 (1983)Google Scholar
  6. 6.
    Chan, M.: Hmm-based audio-visual speech recognition integrating geometric and appearance-based visual features. In: IEEE Fourth Workshop on Multimedia Signal Processing, pp. 9–14 (2001)Google Scholar
  7. 7.
    Chen, T.: Audiovisual speech processing. IEEE Signal Process. Mag., 9–31 (2001)Google Scholar
  8. 8.
    Chen, T., Graf, H., Wang, K.: Lip synchronization using speech-assisted video processing. IEEE Signal Process. Lett. 4, 57–59 (1995)CrossRefGoogle Scholar
  9. 9.
    Cootes, T., Edwards, G., Taylor, C.: Active appearance models. Proc. Eur. Conf. Comput. Vis. 2, 484–498 (1998)Google Scholar
  10. 10.
    Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimed. 2(3), 141–151 (2000)CrossRefGoogle Scholar
  11. 11.
    Dzeroski, S., Zenko, B.: Is combining classifiers with stacking better than selecting the best one? Mach. Learn. 3, 255–273 (2004)CrossRefMATHGoogle Scholar
  12. 12.
    Faruquie, T., Majumdar, A., Rajput, N., Subramaniam, L.: Large vocabulary audio-visual speech recognition using active shape models. Int. Conf. Pattern Recogn. 3, 106–109 (2000)CrossRefGoogle Scholar
  13. 13.
    Finn, K.: An investigation of visible lip information to be used in automated speech recognition. Ph.D. thesis, Washington DC, USA (1986)Google Scholar
  14. 14.
    Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. Seventh Conf. Nat. Lang. Learn. 4, 168–171 (2003)Google Scholar
  15. 15.
    Goecke, R., Millar, J., Zelinsky, A., Zelinsky, E., Ribes, J.: Stereo vision lip-tracking for audio-video speech processing. In: Proceedings IEEE Conference Acoustics, Speech, and Signal Processing (2001)Google Scholar
  16. 16.
    Gupta, D., Singh, P., V.Laxmi, M.S.Gaur: Comparison of parametric visual features for speech recognition. In: Proceedings of the IEEE International Conference on Network Communication and Computer, pp. 432–435 (2011)Google Scholar
  17. 17.
    Hall, M.: Correlation-based feature subset selection for machine learning. Ph.D. thesis, Hamilton, New Zealand (1998)Google Scholar
  18. 18.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  19. 19.
    Ichino, M., Sakano, H., Komatsu, N.: Multimodal biometrics of lip movements and voice using kernel fisher discriminant analysis. In: 9th International Conference on Control, Automation, Robotics and Vision, pp. 1–6 (2006)Google Scholar
  20. 20.
    Iwano, K., Yoshinaga, T., Tamura, S., Furui, S.: Audio-visual speech recognition using lip information extracted from side-face images. EURASIP J. Audio Speech Music Process. 2007 (2007)Google Scholar
  21. 21.
    Jun, H., Hua, Z.: Research on visual speech feature extraction. In: International Conference on Computer Engineering and Technology (ICCET 2009), vol. 2, pp. 499–502 (2009)Google Scholar
  22. 22.
    Kawahara, T., Hasegawa, M.: Automatic indexing of lecture speech by extracting topic-independent discourse markers. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I.1–I.4 (2002)Google Scholar
  23. 23.
    Kaynak, M., Zhi, Q., Cheok, A., Sengupta, K., Zhang, J., Ko, C.: Analysis of lip geometric features for audio-visual speech recognition. IEEE Trans. Syst. Man Cybern. Part A 34(4), 564–570 (2004)CrossRefGoogle Scholar
  24. 24.
    Kumar, K., Chen, T., Stern, R.: Profile view lip reading. IEEE Int. Conf. Acoust. Speech Signal Process. 4, 429–432 (2007)Google Scholar
  25. 25.
    Lan, Y., Theobald, B., Harvey, R.: View independent computer lip–reading. In: IEEE International Conference on Multimedia, pp. 432–437 (2012)Google Scholar
  26. 26.
    Lan, Y., Theobald, B., Harvey, R., Ong, E., Bowden, R.: Improving visual features for lip–reading. In: International Conference on Auditory-Visual Speech Processing, pp. 142–147 (2010)Google Scholar
  27. 27.
    Lavagetto, F.: Converting speech into lip movements: a multimedia telephone for hard of hearing people. IEEE Trans. Rehab. Eng. (1995)Google Scholar
  28. 28.
    Lucey, P.: Lipreading across multiple views. Ph.D. thesis (2007)Google Scholar
  29. 29.
    Matthews, I., Cootes, T., Bangham, J., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)CrossRefGoogle Scholar
  30. 30.
    Matthews, I., Potamianos, G., Neti, C., Luettin, J.: A comparison of model and transform-based visual features for audio-visual LVCSR. In: IEEE International Conference on Multimedia and Expo (ICME 2001), pp. 825–828 (2001)Google Scholar
  31. 31.
    McCowan, I., Moore, D., Dines, J., Gatica-Perez, D., Flynn, M., Wellner, P., Bourlard, H.: On the use of information retrieval measures for speech recognition evaluation. Idiap-rr, IDIAP (2004)Google Scholar
  32. 32.
    Movellan, J.: Visual speech recognition with stochastic networks. In: Tesauro, G., Touretzky, D., Leen, T. (eds.) Advances in Neural Information Processing Systems, vol. 7. MIT Press (1995)Google Scholar
  33. 33.
    Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., Zhou, J.: Audio-visual Speech Recognition, Final Workshop 2000 Report. Technical report, The John Hopkins University, Baltimore (2000)Google Scholar
  34. 34.
    Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91, 1306–1326 (2003)CrossRefGoogle Scholar
  35. 35.
    Potamianos, G., Verma, A., Neti, C., Iyengar, G., Basu, S.: A cascade image transform for speaker independent automatic speechreading. In: IEEE International Conference on Multimedia and Expo (II) (ICME 2000), pp. 1097–1100 (2000)Google Scholar
  36. 36.
    Price, P., Fisher, W., Bernstein, J., Pallett, D.: Resource management RM2 2.0. Linguistic Data Consortium, Philadelphia (1993)Google Scholar
  37. 37.
    Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. In: Waibel, A., Lee, K.F. (eds.) Readings in Speech Recognition, pp. 267–296. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1990)CrossRefGoogle Scholar
  38. 38.
    Saenko, K., Darrell, T., Glass, J.: Articulatory features for robust visual speech recognition. In: 6th International Conference on Multimodal Interfaces, pp. 152–158 (2004)Google Scholar
  39. 39.
    Saitoh, T., Konishi, R.: Profile Lip-Reading for Vowel and Word Recognition. In: Proceedings of the 2010 20th International Conference on Pattern Recognition, pp. 1356–1359 (2010)Google Scholar
  40. 40.
    Saitoh, T., Morishita, K., Konishi, R.: Analysis of efficient lip reading method for various languages. In: International Conference on Pattern Recognition, pp. 1–4 (2008)Google Scholar
  41. 41.
    Singh, P., Laxmi, V., Gaur, M.: Near–optimal geometric feature selection for visual speech recognition. Int. J. Pattern Recogn. Artif. Intell. 27(8) (2013)Google Scholar
  42. 42.
    Singh, P., Laxmi, V., Gaur, M.S.: Lip peripheral motion for visual surveillance. In: 5th International Conference on Security of Information and Networks, pp. 173–177. ACM (2012)Google Scholar
  43. 43.
    Singh, P., Laxmi, V., Gaur, M.S.: Visual Speech as Behavioural Biometric. Taylor and Francis (2013)Google Scholar
  44. 44.
    Sumby, W.H., Pollack, I.: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26(2), 212–215 (1954)CrossRefGoogle Scholar
  45. 45.
    Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1 edn. Addison-Wesley Longman Publishing Co. Inc. (2005)Google Scholar
  46. 46.
    Yamahata, S., Yamaguchi, Y., Ogawa, A., Masataki, H., Yoshioka, O., Takahashi, S.: Automatic vocabulary adaptation based on semantic similarity and speech recognition confidence measure. In: Interspeech (2012)Google Scholar
  47. 47.
    Zekeriya, S., Gurbuz, S., Tufekci, Z., Patterson, E., Gowdy, J.: Application of affine-invariant fourier descriptors to lipreading for audio-visual speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP01), pp. 177–180 (2001)Google Scholar
  48. 48.
    Zhang, X., Mersereau, R., Clements, M., Broun, C.: Visual speech feature extraction for improved speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. II–1993–II–1996 (2002)Google Scholar
  49. 49.
    Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.The LNM Institute of Information TechnologyJaipurIndia

Personalised recommendations