Skip to main content

Visual Speech Recognition with Selected Boundary Descriptors

  • Chapter
  • First Online:
Image Feature Detectors and Descriptors

Part of the book series: Studies in Computational Intelligence ((SCI,volume 630))

Abstract

Lipreading is an important research area for human-computer interaction. In this chapter, we explore relevant features for a visual speech recognition system by representing the lip movement of a person during speech, by a set of spatial points on the lip boundary, termed as boundary descriptors. In a real time system, minimizing the input feature vector is important to improve the efficiency of the system. To reduce data dimensionality of our feature set and identify prominent visual features, we apply feature selection technique, Minimum Redundancy Maximum Relevance (mRMR) on our set of boundary descriptors. A sub-optimal feature set is then computed from these visual features by applying certain evaluation criteria. Features contained in the sub-optimal set are analyzed to determine relevant features. It is seen that a small set of spatial points on the lip contour is sufficient to achieve speech recognition accuracy, otherwise obtained by using the complete set of boundary descriptors. It is also shown experimentally that lip width and corner lip segments are major visual speech articulators. Experiments also show high correlation between the upper and lower lips.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aha, D., Kibler, D.: Instance-based learning algorithms. Mach. Learn. 6, 37–66 (1991)

    Google Scholar 

  2. Aravabhumi, V.R., Chenna, R.R., Reddy, K.U.: Robust Method to Identify the Speaker using Lip Motion Features. In: International Conference on Mechanical and Electrical Technology, pp.125–129 (2010)

    Google Scholar 

  3. Arsic, I., Thiran, J.P.: Mutual information eigenlips for audio-visual speech recognition. In: 14th European Signal Processing Conference, pp. 1–5 (2006)

    Google Scholar 

  4. Bala, R., Agrawal, R.K.: Mutual information and cross entropy framework to determine relevant gene subset for cancer classification. Informatica 35, 375–382 (2011)

    MathSciNet  MATH  Google Scholar 

  5. Batista, F., Caseiro, D., Mamede, N., Trancoso, I.: Recovering punctuation marks for automatic speech recognition. Interspeech 2153–2156 (2007)

    Google Scholar 

  6. Bregler, C., Konig, Y.: Eigenlips for robust speech recognition. In: International Conference on Acoustics,Speech and Signal Processing, pp. 669–672 (1994)

    Google Scholar 

  7. Brooke, N.M., Scott, S.D.: PCA image coding schemes and visual speech intelligibility. Proc. Inst. Acoust. 16(5), 123–129 (1994)

    Google Scholar 

  8. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  9. Chandramohan, D., Silsbee, P.L.: A multiple deformable template approach for visual speech recognition. In: Fourth International Conference on Spoken Language Processing, pp. 50–53 (1996)

    Google Scholar 

  10. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models: their training and application. Comput. Vis. Image Underst. 61(1), 38–59 (1995)

    Article  Google Scholar 

  11. Davies, A., Velastin, S.: A progress review of intelligent CCTV surveillance systems. IEEE Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, pp. 417–423 (2005)

    Google Scholar 

  12. Deselaers, T., Heigold, G., Ney, H.: Speech recognition with state-based nearest neighbour classifiers. Interspeech 2093–2096 (2007)

    Google Scholar 

  13. Dieckmann, U., Plankensteiner, P., Schamburger, R., Froeba, B., Meller, S.: SESAM: a biometric person identification system using sensor fusion. Pattern Recogn. Lett. 18(9), 827–833 (1997)

    Article  Google Scholar 

  14. Duchnowski, P., Hunke, M., Busching, D., Meier, U., Waibel, A.: Toward movement-invariant automatic lip-reading and speech recognition. In: IEEE International Conference on Acoustics,Speech and Signal Processing, pp. 109–112 (1995)

    Google Scholar 

  15. Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimed. 2(3), 141–151 (2000)

    Article  Google Scholar 

  16. Fanelli, G., Gall, J., Gool, L.V.: Hough transform based mouth localization for audio visual speech recognition. In: British Machine Vision Conference (2009)

    Google Scholar 

  17. Faruquie, T.A., Majumdar, A., Rajput, N., Subramaniam, L.V.: Large vocabulary audio-visual speech recognition using active shape models. International Conference on Pattern Recognition, pp. 106–109 (2000)

    Google Scholar 

  18. Feng, X., Wang, W.: DTCWT-based dynamic texture features for visual speech recognition. In: IEEE Asia Pacific Conference on Circuits and Systems, pp. 497–500 (2008)

    Google Scholar 

  19. Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Seventh Conference on Natural Language Learning, pp. 168–171 (2003)

    Google Scholar 

  20. Frischholz, R.W., Dieckmann, U.: Bioid: a multimodal biometric identification system. IEEE Comput. 33(2), 64–68 (2000)

    Article  Google Scholar 

  21. Furui, S.: Recent advances in speaker recognition. Pattern Recogn. Lett. 18(9), 859–872 (1997)

    Article  Google Scholar 

  22. Gordan, M., Kotropoulos, C., Pitas, I.: A support vector machine-based dynamic network for visual speech recognition applications. EURASIP J. Appl. Sig. Process 11, 1248–1259 (2002)

    Article  Google Scholar 

  23. Graf, H.P., Cosatto, E., Potamianos, M.: Robust recognition of faces and facial features with a multi-modal system. IEEE Int. Conf. Syst. Man Cybern. B Cybern. 2034–2039 (1997)

    Google Scholar 

  24. Gudavalli, M., Raju, S.V., Babu, A.V., Kumar, D.S.: Multimodal biometrics-sources, architecture and fusion techniques: an overview. In: International Symposium on Biometrics and Security Technologies, pp. 27–34 (2012)

    Google Scholar 

  25. Gupta, D. Singh, P., Laxmi, V., Gaur, M.S.: Comparison of parametric visual features for speech recognition. In: IEEE International Conference on Network Communication and Computer, pp. 432–435 (2011)

    Google Scholar 

  26. Gurban, M., Thiran, J.-P.: Information theoretic feature extraction for audio-visual speech recognition. IEEE Trans. Signal Process. 57(12), 4765–4776 (2009)

    Article  MathSciNet  Google Scholar 

  27. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  28. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)

    Article  Google Scholar 

  29. Hennecke, M.E., Prasad, K.V., Venkatesh, K., David, P., Stork, D.G.: Using Deformable Templates to Infer Visual Speech Dynamics. In: 28th Annual Asilomar Conference on Signals, System and Computer, pp. 578–582 (1994)

    Google Scholar 

  30. Huang, J., Potamianos, G., Connell, J., Neti, C.: Audio-visual speech recognition using an infrared headset. Speech Commun. 44, 83–96 (2004)

    Article  Google Scholar 

  31. Ichino, M., Sakano, H., Komatsu, N.: Multimodal biometrics of lip movements and voice using Kernel Fisher Discriminant Analysis. In: 9th International Conference on Control, Automation, Robotics and Vision, pp. 1–6 (2006)

    Google Scholar 

  32. Jain, A., Ross, A., Pankanti, S.: Biometrics: a tool for information security. IEEE Trans. Inf. Forensics Secur. 1(2), 125–143 (1997)

    Article  Google Scholar 

  33. Jun, H., Hua, Z.: Research on visual speech feature extraction. In: International Conference on Computer Engineering and Technology, pp. 499–502 (2009)

    Google Scholar 

  34. Kass, M., Witkin, A., Terropoulos, D.: Snakes: active contour models. Int. J. Comput. Vis. 1, 321–331 (1988)

    Article  Google Scholar 

  35. Kawahara, T., Hasegawa, M.: Automatic indexing of lecture speech by extracting topic-independent discourse markers. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I-1-I-4 (2002)

    Google Scholar 

  36. Kaynak, M.N., Zhi, Q., Cheok, A.D., Sengupta, K., Zhang, J., Ko, C.C.: Analysis of lip geometric features for audio-visual speech recognition. IEEE Trans. Syst. Man Cybern. Part A Syst. 34(4), 564–570 (2004)

    Article  Google Scholar 

  37. Kohavi, R.: A study of crossvalidation and bootstrap for accuracy estimation and model selection. In: 14th International Joint Conference on Artificial Intelligence, pp. 1137–1143 (1995)

    Google Scholar 

  38. Lan, Y., Theobald, B., Harvey, R., Ong, E., Bowden, R.: Improving visual features for lip-reading. In: International Conference on Auditory-Visual Speech Processing, pp. 142–147 (2010)

    Google Scholar 

  39. Lan, Y., Theobald, B., Harvey, R.: View independent computer lip-reading. In: IEEE International Conference on Multimedia, pp. 432–437 (2012)

    Google Scholar 

  40. Luettin, J., Thacker, N.A., Beet, S.W.: Visual speech recognition using active shape models and hidden markov models. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 817–820 (1996)

    Google Scholar 

  41. Matthews, I.: Features for Audio-Visual Speech Recognition. Ph. D. Thesis. School of Information Systems, University of East Anglia, Norwich, United Kingdom (1998)

    Google Scholar 

  42. Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)

    Article  Google Scholar 

  43. Matthews, I., Potamianos, G., Neti, C., Luettin, J.: A comparison of model and transform-based visual features for audio-visual LVCSR. In: IEEE International Conference on Multimedia and Expo, pp. 825–828 (2001)

    Google Scholar 

  44. McCowan, I.A., Moore, D., Dines, J., Gatica-Perez, D., Flynn, M., Wellner, P., Bourlard, H.: On the use of information retrieval measures for speech recognition evaluation. Idiap-RR (2004)

    Google Scholar 

  45. Movellan, J.R.: Visual speech recognition with stochastic networks. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) Advances in Neural Information Processing Systems. MIT Press (1995)

    Google Scholar 

  46. Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Audio-visual speech recognition using deep learning. J. Appl. Intell. 42(4), 722–737 (2015)

    Article  Google Scholar 

  47. Oparin, L., Gauvain, J.L.: Large-scale language modeling with random forests for mandarin Chinese speech-to-text. In: 7th International Conference on Advances in Natural Language Processing, pp. 269–280 (2010)

    Google Scholar 

  48. Pao, T-L., Liao, Wen-Y., Chen, Y.T.: Audio-Visual speech recognition with weighted KNN-based classification in mandarin database. In: IEEE Third International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 39–42 (2007)

    Google Scholar 

  49. Peng, H.C., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)

    Google Scholar 

  50. Petajan, E.D.: Automatic lipreading to enhance speech recognition. In: Conferenceon computer vision and pattern recognition, pp. 40–47 (1985)

    Google Scholar 

  51. Potamianos, G., Graf, H.P., Cosatto, E.: An image transform approach for HMM based automatic lipreading. In: International Conference on Image Processing, pp. 173–177 (1998)

    Google Scholar 

  52. Potamianos, G., Verma, A., Neti, C., Iyengar, G., Basu, S.: a cascade image transform for speaker independent automatic speechreading. In: IEEE International Conference on Multimedia and Expo (II), pp. 1097–1100 (2000)

    Google Scholar 

  53. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)

    Article  Google Scholar 

  54. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. In: Waibel, A., Lee, K.-F. (eds.) Readings in Speech Recognition, pp. 267–296. Morgan Kaufmann Publishers Inc., USA (1990)

    Google Scholar 

  55. Ross, A., Jain, A.: Information fusion in biometrics. Pattern Recogn. Lett. Special Issue Multimodal Biometrics 24(13), 2115–2125 (2003)

    Article  Google Scholar 

  56. Saenko, K., Darrell, T., Glass, J.R.: Articulatory features for robust visual speech recognition. In: 6th International Conference on Multimodal Interfaces, pp. 152–158 (2004)

    Google Scholar 

  57. Saitoh, T., Hisagi, M., Konishi, R.: Analysis of Features for Efficient Japanese Vowel Recognition. IEICE Trans. Inf. Syst. E90-D(11), 1889–1891 (2007)

    Google Scholar 

  58. Seymour, R., Stewart, D., Ming, J.: Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. J. Image Video Process. 14:1–14:9 (2008)

    Google Scholar 

  59. Su, Y., Jelinek, F., Khudanpur, S.: Large-scale random forest language models for speech recognition. Interspeech 598–601 (2007)

    Google Scholar 

  60. Singh, P., Laxmi, V., Gaur, M.S.: Lip peripheral motion for visual surveillance. In: 5th International Conference on Security of Information and Networks, pp. 173–177 (2012)

    Google Scholar 

  61. Singh, P., Laxmi, V., Gaur, M.S.: Relevant mRMR features for visual speech recognition. Int. Conf. Recent Adv. Comput. Softw. Syst. 148–153 (2012)

    Google Scholar 

  62. Singh, P., Laxmi, V., Gaur, M.S.: Speaker Identification using Optimal Lip Biometrics. 5th IAPR International Conference on Biometrics, pp. 472–477 (2012)

    Google Scholar 

  63. Singh, P., Laxmi, V., Gaur, M.S.: Visual speech as behavioural biometric. In: Kisku, D.R., Gupta, P., Sing, J.K. (eds.) Advances in Biometrics for Secure Human Authentication and Recognition. Taylor and Francis (2013)

    Google Scholar 

  64. Singh, P., Laxmi, V., Gaur, M.S.: Near-optimal geometric feature selection for visual speech recognition. Int. J. Pattern Recogn. Artif. Intell. 27(8) (2013)

    Google Scholar 

  65. Sumby, W.H., Pollack, I.: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26(2), 212–215 (1954)

    Article  Google Scholar 

  66. Summerfield, Q., Macleod, A., McGrath, M., Brooke, M.: Lips, teeth, and the benefits of lipreading. In: Young, A.W., Ellis, H.D. (eds.) Handbook of Research on Face Processing, pp. 218–223. Elsevier Science Publishers, Amsterdam (1989)

    Google Scholar 

  67. Tan, P-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley Longman Publishing Co., Inc. (2005)

    Google Scholar 

  68. Xue, J., Zhao, Y.: Random-forests-based phonetic decision trees for conversational speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4169–4172 (2008)

    Google Scholar 

  69. Yamahata, S., Yamaguchi, Y., Ogawa, A., Masataki, H., Yoshioka, O., Takahashi, S.: Automatic vocabulary adaptation based on semantic similarity and speech recognition confidence measure. Interspeech (2012)

    Google Scholar 

  70. Zekeriya, S.G., Gurbuz, S., Tufekci, Z., Patterson, E., Gowdy, J.N.: Application of affine-invariant fourier descriptors to lipreading for audio-visual speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 177–180 (2001)

    Google Scholar 

  71. Zhang, X., Mersereau, R.M., Clements, M., Broun, C.C.: Visual speech feature extraction for improved speech recognition. In: IEEE International Conference on Acoustics, Speech, and SignalProcessing, pp. II-1993-II-1996 (2002)

    Google Scholar 

  72. Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank the Department of Science & Technology, Government of India, for funding and supporting this project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Preety Singh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Singh, P., Laxmi, V., Gaur, M.S. (2016). Visual Speech Recognition with Selected Boundary Descriptors. In: Awad, A., Hassaballah, M. (eds) Image Feature Detectors and Descriptors . Studies in Computational Intelligence, vol 630. Springer, Cham. https://doi.org/10.1007/978-3-319-28854-3_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-28854-3_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28852-9

  • Online ISBN: 978-3-319-28854-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics