Abstract
Lipreading is an important research area for human-computer interaction. In this chapter, we explore relevant features for a visual speech recognition system by representing the lip movement of a person during speech, by a set of spatial points on the lip boundary, termed as boundary descriptors. In a real time system, minimizing the input feature vector is important to improve the efficiency of the system. To reduce data dimensionality of our feature set and identify prominent visual features, we apply feature selection technique, Minimum Redundancy Maximum Relevance (mRMR) on our set of boundary descriptors. A sub-optimal feature set is then computed from these visual features by applying certain evaluation criteria. Features contained in the sub-optimal set are analyzed to determine relevant features. It is seen that a small set of spatial points on the lip contour is sufficient to achieve speech recognition accuracy, otherwise obtained by using the complete set of boundary descriptors. It is also shown experimentally that lip width and corner lip segments are major visual speech articulators. Experiments also show high correlation between the upper and lower lips.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aha, D., Kibler, D.: Instance-based learning algorithms. Mach. Learn. 6, 37–66 (1991)
Aravabhumi, V.R., Chenna, R.R., Reddy, K.U.: Robust Method to Identify the Speaker using Lip Motion Features. In: International Conference on Mechanical and Electrical Technology, pp.125–129 (2010)
Arsic, I., Thiran, J.P.: Mutual information eigenlips for audio-visual speech recognition. In: 14th European Signal Processing Conference, pp. 1–5 (2006)
Bala, R., Agrawal, R.K.: Mutual information and cross entropy framework to determine relevant gene subset for cancer classification. Informatica 35, 375–382 (2011)
Batista, F., Caseiro, D., Mamede, N., Trancoso, I.: Recovering punctuation marks for automatic speech recognition. Interspeech 2153–2156 (2007)
Bregler, C., Konig, Y.: Eigenlips for robust speech recognition. In: International Conference on Acoustics,Speech and Signal Processing, pp. 669–672 (1994)
Brooke, N.M., Scott, S.D.: PCA image coding schemes and visual speech intelligibility. Proc. Inst. Acoust. 16(5), 123–129 (1994)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Chandramohan, D., Silsbee, P.L.: A multiple deformable template approach for visual speech recognition. In: Fourth International Conference on Spoken Language Processing, pp. 50–53 (1996)
Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models: their training and application. Comput. Vis. Image Underst. 61(1), 38–59 (1995)
Davies, A., Velastin, S.: A progress review of intelligent CCTV surveillance systems. IEEE Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, pp. 417–423 (2005)
Deselaers, T., Heigold, G., Ney, H.: Speech recognition with state-based nearest neighbour classifiers. Interspeech 2093–2096 (2007)
Dieckmann, U., Plankensteiner, P., Schamburger, R., Froeba, B., Meller, S.: SESAM: a biometric person identification system using sensor fusion. Pattern Recogn. Lett. 18(9), 827–833 (1997)
Duchnowski, P., Hunke, M., Busching, D., Meier, U., Waibel, A.: Toward movement-invariant automatic lip-reading and speech recognition. In: IEEE International Conference on Acoustics,Speech and Signal Processing, pp. 109–112 (1995)
Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimed. 2(3), 141–151 (2000)
Fanelli, G., Gall, J., Gool, L.V.: Hough transform based mouth localization for audio visual speech recognition. In: British Machine Vision Conference (2009)
Faruquie, T.A., Majumdar, A., Rajput, N., Subramaniam, L.V.: Large vocabulary audio-visual speech recognition using active shape models. International Conference on Pattern Recognition, pp. 106–109 (2000)
Feng, X., Wang, W.: DTCWT-based dynamic texture features for visual speech recognition. In: IEEE Asia Pacific Conference on Circuits and Systems, pp. 497–500 (2008)
Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Seventh Conference on Natural Language Learning, pp. 168–171 (2003)
Frischholz, R.W., Dieckmann, U.: Bioid: a multimodal biometric identification system. IEEE Comput. 33(2), 64–68 (2000)
Furui, S.: Recent advances in speaker recognition. Pattern Recogn. Lett. 18(9), 859–872 (1997)
Gordan, M., Kotropoulos, C., Pitas, I.: A support vector machine-based dynamic network for visual speech recognition applications. EURASIP J. Appl. Sig. Process 11, 1248–1259 (2002)
Graf, H.P., Cosatto, E., Potamianos, M.: Robust recognition of faces and facial features with a multi-modal system. IEEE Int. Conf. Syst. Man Cybern. B Cybern. 2034–2039 (1997)
Gudavalli, M., Raju, S.V., Babu, A.V., Kumar, D.S.: Multimodal biometrics-sources, architecture and fusion techniques: an overview. In: International Symposium on Biometrics and Security Technologies, pp. 27–34 (2012)
Gupta, D. Singh, P., Laxmi, V., Gaur, M.S.: Comparison of parametric visual features for speech recognition. In: IEEE International Conference on Network Communication and Computer, pp. 432–435 (2011)
Gurban, M., Thiran, J.-P.: Information theoretic feature extraction for audio-visual speech recognition. IEEE Trans. Signal Process. 57(12), 4765–4776 (2009)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Hennecke, M.E., Prasad, K.V., Venkatesh, K., David, P., Stork, D.G.: Using Deformable Templates to Infer Visual Speech Dynamics. In: 28th Annual Asilomar Conference on Signals, System and Computer, pp. 578–582 (1994)
Huang, J., Potamianos, G., Connell, J., Neti, C.: Audio-visual speech recognition using an infrared headset. Speech Commun. 44, 83–96 (2004)
Ichino, M., Sakano, H., Komatsu, N.: Multimodal biometrics of lip movements and voice using Kernel Fisher Discriminant Analysis. In: 9th International Conference on Control, Automation, Robotics and Vision, pp. 1–6 (2006)
Jain, A., Ross, A., Pankanti, S.: Biometrics: a tool for information security. IEEE Trans. Inf. Forensics Secur. 1(2), 125–143 (1997)
Jun, H., Hua, Z.: Research on visual speech feature extraction. In: International Conference on Computer Engineering and Technology, pp. 499–502 (2009)
Kass, M., Witkin, A., Terropoulos, D.: Snakes: active contour models. Int. J. Comput. Vis. 1, 321–331 (1988)
Kawahara, T., Hasegawa, M.: Automatic indexing of lecture speech by extracting topic-independent discourse markers. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I-1-I-4 (2002)
Kaynak, M.N., Zhi, Q., Cheok, A.D., Sengupta, K., Zhang, J., Ko, C.C.: Analysis of lip geometric features for audio-visual speech recognition. IEEE Trans. Syst. Man Cybern. Part A Syst. 34(4), 564–570 (2004)
Kohavi, R.: A study of crossvalidation and bootstrap for accuracy estimation and model selection. In: 14th International Joint Conference on Artificial Intelligence, pp. 1137–1143 (1995)
Lan, Y., Theobald, B., Harvey, R., Ong, E., Bowden, R.: Improving visual features for lip-reading. In: International Conference on Auditory-Visual Speech Processing, pp. 142–147 (2010)
Lan, Y., Theobald, B., Harvey, R.: View independent computer lip-reading. In: IEEE International Conference on Multimedia, pp. 432–437 (2012)
Luettin, J., Thacker, N.A., Beet, S.W.: Visual speech recognition using active shape models and hidden markov models. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 817–820 (1996)
Matthews, I.: Features for Audio-Visual Speech Recognition. Ph. D. Thesis. School of Information Systems, University of East Anglia, Norwich, United Kingdom (1998)
Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)
Matthews, I., Potamianos, G., Neti, C., Luettin, J.: A comparison of model and transform-based visual features for audio-visual LVCSR. In: IEEE International Conference on Multimedia and Expo, pp. 825–828 (2001)
McCowan, I.A., Moore, D., Dines, J., Gatica-Perez, D., Flynn, M., Wellner, P., Bourlard, H.: On the use of information retrieval measures for speech recognition evaluation. Idiap-RR (2004)
Movellan, J.R.: Visual speech recognition with stochastic networks. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) Advances in Neural Information Processing Systems. MIT Press (1995)
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Audio-visual speech recognition using deep learning. J. Appl. Intell. 42(4), 722–737 (2015)
Oparin, L., Gauvain, J.L.: Large-scale language modeling with random forests for mandarin Chinese speech-to-text. In: 7th International Conference on Advances in Natural Language Processing, pp. 269–280 (2010)
Pao, T-L., Liao, Wen-Y., Chen, Y.T.: Audio-Visual speech recognition with weighted KNN-based classification in mandarin database. In: IEEE Third International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 39–42 (2007)
Peng, H.C., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Petajan, E.D.: Automatic lipreading to enhance speech recognition. In: Conferenceon computer vision and pattern recognition, pp. 40–47 (1985)
Potamianos, G., Graf, H.P., Cosatto, E.: An image transform approach for HMM based automatic lipreading. In: International Conference on Image Processing, pp. 173–177 (1998)
Potamianos, G., Verma, A., Neti, C., Iyengar, G., Basu, S.: a cascade image transform for speaker independent automatic speechreading. In: IEEE International Conference on Multimedia and Expo (II), pp. 1097–1100 (2000)
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. In: Waibel, A., Lee, K.-F. (eds.) Readings in Speech Recognition, pp. 267–296. Morgan Kaufmann Publishers Inc., USA (1990)
Ross, A., Jain, A.: Information fusion in biometrics. Pattern Recogn. Lett. Special Issue Multimodal Biometrics 24(13), 2115–2125 (2003)
Saenko, K., Darrell, T., Glass, J.R.: Articulatory features for robust visual speech recognition. In: 6th International Conference on Multimodal Interfaces, pp. 152–158 (2004)
Saitoh, T., Hisagi, M., Konishi, R.: Analysis of Features for Efficient Japanese Vowel Recognition. IEICE Trans. Inf. Syst. E90-D(11), 1889–1891 (2007)
Seymour, R., Stewart, D., Ming, J.: Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. J. Image Video Process. 14:1–14:9 (2008)
Su, Y., Jelinek, F., Khudanpur, S.: Large-scale random forest language models for speech recognition. Interspeech 598–601 (2007)
Singh, P., Laxmi, V., Gaur, M.S.: Lip peripheral motion for visual surveillance. In: 5th International Conference on Security of Information and Networks, pp. 173–177 (2012)
Singh, P., Laxmi, V., Gaur, M.S.: Relevant mRMR features for visual speech recognition. Int. Conf. Recent Adv. Comput. Softw. Syst. 148–153 (2012)
Singh, P., Laxmi, V., Gaur, M.S.: Speaker Identification using Optimal Lip Biometrics. 5th IAPR International Conference on Biometrics, pp. 472–477 (2012)
Singh, P., Laxmi, V., Gaur, M.S.: Visual speech as behavioural biometric. In: Kisku, D.R., Gupta, P., Sing, J.K. (eds.) Advances in Biometrics for Secure Human Authentication and Recognition. Taylor and Francis (2013)
Singh, P., Laxmi, V., Gaur, M.S.: Near-optimal geometric feature selection for visual speech recognition. Int. J. Pattern Recogn. Artif. Intell. 27(8) (2013)
Sumby, W.H., Pollack, I.: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26(2), 212–215 (1954)
Summerfield, Q., Macleod, A., McGrath, M., Brooke, M.: Lips, teeth, and the benefits of lipreading. In: Young, A.W., Ellis, H.D. (eds.) Handbook of Research on Face Processing, pp. 218–223. Elsevier Science Publishers, Amsterdam (1989)
Tan, P-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley Longman Publishing Co., Inc. (2005)
Xue, J., Zhao, Y.: Random-forests-based phonetic decision trees for conversational speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4169–4172 (2008)
Yamahata, S., Yamaguchi, Y., Ogawa, A., Masataki, H., Yoshioka, O., Takahashi, S.: Automatic vocabulary adaptation based on semantic similarity and speech recognition confidence measure. Interspeech (2012)
Zekeriya, S.G., Gurbuz, S., Tufekci, Z., Patterson, E., Gowdy, J.N.: Application of affine-invariant fourier descriptors to lipreading for audio-visual speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 177–180 (2001)
Zhang, X., Mersereau, R.M., Clements, M., Broun, C.C.: Visual speech feature extraction for improved speech recognition. In: IEEE International Conference on Acoustics, Speech, and SignalProcessing, pp. II-1993-II-1996 (2002)
Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)
Acknowledgments
The authors would like to thank the Department of Science & Technology, Government of India, for funding and supporting this project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Singh, P., Laxmi, V., Gaur, M.S. (2016). Visual Speech Recognition with Selected Boundary Descriptors. In: Awad, A., Hassaballah, M. (eds) Image Feature Detectors and Descriptors . Studies in Computational Intelligence, vol 630. Springer, Cham. https://doi.org/10.1007/978-3-319-28854-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-28854-3_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28852-9
Online ISBN: 978-3-319-28854-3
eBook Packages: EngineeringEngineering (R0)