Visual Speech Recognition with Selected Boundary Descriptors

Singh, Preety; Laxmi, Vijay; Gaur, Manoj Singh

doi:10.1007/978-3-319-28854-3_14

Preety Singh⁴,
Vijay Laxmi⁵ &
Manoj Singh Gaur⁵

Part of the book series: Studies in Computational Intelligence ((SCI,volume 630))

2240 Accesses
1 Citations
2 Altmetric

Abstract

Lipreading is an important research area for human-computer interaction. In this chapter, we explore relevant features for a visual speech recognition system by representing the lip movement of a person during speech, by a set of spatial points on the lip boundary, termed as boundary descriptors. In a real time system, minimizing the input feature vector is important to improve the efficiency of the system. To reduce data dimensionality of our feature set and identify prominent visual features, we apply feature selection technique, Minimum Redundancy Maximum Relevance (mRMR) on our set of boundary descriptors. A sub-optimal feature set is then computed from these visual features by applying certain evaluation criteria. Features contained in the sub-optimal set are analyzed to determine relevant features. It is seen that a small set of spatial points on the lip contour is sufficient to achieve speech recognition accuracy, otherwise obtained by using the complete set of boundary descriptors. It is also shown experimentally that lip width and corner lip segments are major visual speech articulators. Experiments also show high correlation between the upper and lower lips.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aha, D., Kibler, D.: Instance-based learning algorithms. Mach. Learn. 6, 37–66 (1991)
Google Scholar
Aravabhumi, V.R., Chenna, R.R., Reddy, K.U.: Robust Method to Identify the Speaker using Lip Motion Features. In: International Conference on Mechanical and Electrical Technology, pp.125–129 (2010)
Google Scholar
Arsic, I., Thiran, J.P.: Mutual information eigenlips for audio-visual speech recognition. In: 14th European Signal Processing Conference, pp. 1–5 (2006)
Google Scholar
Bala, R., Agrawal, R.K.: Mutual information and cross entropy framework to determine relevant gene subset for cancer classification. Informatica 35, 375–382 (2011)
MathSciNet MATH Google Scholar
Batista, F., Caseiro, D., Mamede, N., Trancoso, I.: Recovering punctuation marks for automatic speech recognition. Interspeech 2153–2156 (2007)
Google Scholar
Bregler, C., Konig, Y.: Eigenlips for robust speech recognition. In: International Conference on Acoustics,Speech and Signal Processing, pp. 669–672 (1994)
Google Scholar
Brooke, N.M., Scott, S.D.: PCA image coding schemes and visual speech intelligibility. Proc. Inst. Acoust. 16(5), 123–129 (1994)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Chandramohan, D., Silsbee, P.L.: A multiple deformable template approach for visual speech recognition. In: Fourth International Conference on Spoken Language Processing, pp. 50–53 (1996)
Google Scholar
Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models: their training and application. Comput. Vis. Image Underst. 61(1), 38–59 (1995)
Article Google Scholar
Davies, A., Velastin, S.: A progress review of intelligent CCTV surveillance systems. IEEE Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, pp. 417–423 (2005)
Google Scholar
Deselaers, T., Heigold, G., Ney, H.: Speech recognition with state-based nearest neighbour classifiers. Interspeech 2093–2096 (2007)
Google Scholar
Dieckmann, U., Plankensteiner, P., Schamburger, R., Froeba, B., Meller, S.: SESAM: a biometric person identification system using sensor fusion. Pattern Recogn. Lett. 18(9), 827–833 (1997)
Article Google Scholar
Duchnowski, P., Hunke, M., Busching, D., Meier, U., Waibel, A.: Toward movement-invariant automatic lip-reading and speech recognition. In: IEEE International Conference on Acoustics,Speech and Signal Processing, pp. 109–112 (1995)
Google Scholar
Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimed. 2(3), 141–151 (2000)
Article Google Scholar
Fanelli, G., Gall, J., Gool, L.V.: Hough transform based mouth localization for audio visual speech recognition. In: British Machine Vision Conference (2009)
Google Scholar
Faruquie, T.A., Majumdar, A., Rajput, N., Subramaniam, L.V.: Large vocabulary audio-visual speech recognition using active shape models. International Conference on Pattern Recognition, pp. 106–109 (2000)
Google Scholar
Feng, X., Wang, W.: DTCWT-based dynamic texture features for visual speech recognition. In: IEEE Asia Pacific Conference on Circuits and Systems, pp. 497–500 (2008)
Google Scholar
Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Seventh Conference on Natural Language Learning, pp. 168–171 (2003)
Google Scholar
Frischholz, R.W., Dieckmann, U.: Bioid: a multimodal biometric identification system. IEEE Comput. 33(2), 64–68 (2000)
Article Google Scholar
Furui, S.: Recent advances in speaker recognition. Pattern Recogn. Lett. 18(9), 859–872 (1997)
Article Google Scholar
Gordan, M., Kotropoulos, C., Pitas, I.: A support vector machine-based dynamic network for visual speech recognition applications. EURASIP J. Appl. Sig. Process 11, 1248–1259 (2002)
Article Google Scholar
Graf, H.P., Cosatto, E., Potamianos, M.: Robust recognition of faces and facial features with a multi-modal system. IEEE Int. Conf. Syst. Man Cybern. B Cybern. 2034–2039 (1997)
Google Scholar
Gudavalli, M., Raju, S.V., Babu, A.V., Kumar, D.S.: Multimodal biometrics-sources, architecture and fusion techniques: an overview. In: International Symposium on Biometrics and Security Technologies, pp. 27–34 (2012)
Google Scholar
Gupta, D. Singh, P., Laxmi, V., Gaur, M.S.: Comparison of parametric visual features for speech recognition. In: IEEE International Conference on Network Communication and Computer, pp. 432–435 (2011)
Google Scholar
Gurban, M., Thiran, J.-P.: Information theoretic feature extraction for audio-visual speech recognition. IEEE Trans. Signal Process. 57(12), 4765–4776 (2009)
Article MathSciNet Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Article Google Scholar
Hennecke, M.E., Prasad, K.V., Venkatesh, K., David, P., Stork, D.G.: Using Deformable Templates to Infer Visual Speech Dynamics. In: 28th Annual Asilomar Conference on Signals, System and Computer, pp. 578–582 (1994)
Google Scholar
Huang, J., Potamianos, G., Connell, J., Neti, C.: Audio-visual speech recognition using an infrared headset. Speech Commun. 44, 83–96 (2004)
Article Google Scholar
Ichino, M., Sakano, H., Komatsu, N.: Multimodal biometrics of lip movements and voice using Kernel Fisher Discriminant Analysis. In: 9th International Conference on Control, Automation, Robotics and Vision, pp. 1–6 (2006)
Google Scholar
Jain, A., Ross, A., Pankanti, S.: Biometrics: a tool for information security. IEEE Trans. Inf. Forensics Secur. 1(2), 125–143 (1997)
Article Google Scholar
Jun, H., Hua, Z.: Research on visual speech feature extraction. In: International Conference on Computer Engineering and Technology, pp. 499–502 (2009)
Google Scholar
Kass, M., Witkin, A., Terropoulos, D.: Snakes: active contour models. Int. J. Comput. Vis. 1, 321–331 (1988)
Article Google Scholar
Kawahara, T., Hasegawa, M.: Automatic indexing of lecture speech by extracting topic-independent discourse markers. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I-1-I-4 (2002)
Google Scholar
Kaynak, M.N., Zhi, Q., Cheok, A.D., Sengupta, K., Zhang, J., Ko, C.C.: Analysis of lip geometric features for audio-visual speech recognition. IEEE Trans. Syst. Man Cybern. Part A Syst. 34(4), 564–570 (2004)
Article Google Scholar
Kohavi, R.: A study of crossvalidation and bootstrap for accuracy estimation and model selection. In: 14th International Joint Conference on Artificial Intelligence, pp. 1137–1143 (1995)
Google Scholar
Lan, Y., Theobald, B., Harvey, R., Ong, E., Bowden, R.: Improving visual features for lip-reading. In: International Conference on Auditory-Visual Speech Processing, pp. 142–147 (2010)
Google Scholar
Lan, Y., Theobald, B., Harvey, R.: View independent computer lip-reading. In: IEEE International Conference on Multimedia, pp. 432–437 (2012)
Google Scholar
Luettin, J., Thacker, N.A., Beet, S.W.: Visual speech recognition using active shape models and hidden markov models. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 817–820 (1996)
Google Scholar
Matthews, I.: Features for Audio-Visual Speech Recognition. Ph. D. Thesis. School of Information Systems, University of East Anglia, Norwich, United Kingdom (1998)
Google Scholar
Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)
Article Google Scholar
Matthews, I., Potamianos, G., Neti, C., Luettin, J.: A comparison of model and transform-based visual features for audio-visual LVCSR. In: IEEE International Conference on Multimedia and Expo, pp. 825–828 (2001)
Google Scholar
McCowan, I.A., Moore, D., Dines, J., Gatica-Perez, D., Flynn, M., Wellner, P., Bourlard, H.: On the use of information retrieval measures for speech recognition evaluation. Idiap-RR (2004)
Google Scholar
Movellan, J.R.: Visual speech recognition with stochastic networks. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) Advances in Neural Information Processing Systems. MIT Press (1995)
Google Scholar
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Audio-visual speech recognition using deep learning. J. Appl. Intell. 42(4), 722–737 (2015)
Article Google Scholar
Oparin, L., Gauvain, J.L.: Large-scale language modeling with random forests for mandarin Chinese speech-to-text. In: 7th International Conference on Advances in Natural Language Processing, pp. 269–280 (2010)
Google Scholar
Pao, T-L., Liao, Wen-Y., Chen, Y.T.: Audio-Visual speech recognition with weighted KNN-based classification in mandarin database. In: IEEE Third International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 39–42 (2007)
Google Scholar
Peng, H.C., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Google Scholar
Petajan, E.D.: Automatic lipreading to enhance speech recognition. In: Conferenceon computer vision and pattern recognition, pp. 40–47 (1985)
Google Scholar
Potamianos, G., Graf, H.P., Cosatto, E.: An image transform approach for HMM based automatic lipreading. In: International Conference on Image Processing, pp. 173–177 (1998)
Google Scholar
Potamianos, G., Verma, A., Neti, C., Iyengar, G., Basu, S.: a cascade image transform for speaker independent automatic speechreading. In: IEEE International Conference on Multimedia and Expo (II), pp. 1097–1100 (2000)
Google Scholar
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)
Article Google Scholar
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. In: Waibel, A., Lee, K.-F. (eds.) Readings in Speech Recognition, pp. 267–296. Morgan Kaufmann Publishers Inc., USA (1990)
Google Scholar
Ross, A., Jain, A.: Information fusion in biometrics. Pattern Recogn. Lett. Special Issue Multimodal Biometrics 24(13), 2115–2125 (2003)
Article Google Scholar
Saenko, K., Darrell, T., Glass, J.R.: Articulatory features for robust visual speech recognition. In: 6th International Conference on Multimodal Interfaces, pp. 152–158 (2004)
Google Scholar
Saitoh, T., Hisagi, M., Konishi, R.: Analysis of Features for Efficient Japanese Vowel Recognition. IEICE Trans. Inf. Syst. E90-D(11), 1889–1891 (2007)
Google Scholar
Seymour, R., Stewart, D., Ming, J.: Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. J. Image Video Process. 14:1–14:9 (2008)
Google Scholar
Su, Y., Jelinek, F., Khudanpur, S.: Large-scale random forest language models for speech recognition. Interspeech 598–601 (2007)
Google Scholar
Singh, P., Laxmi, V., Gaur, M.S.: Lip peripheral motion for visual surveillance. In: 5th International Conference on Security of Information and Networks, pp. 173–177 (2012)
Google Scholar
Singh, P., Laxmi, V., Gaur, M.S.: Relevant mRMR features for visual speech recognition. Int. Conf. Recent Adv. Comput. Softw. Syst. 148–153 (2012)
Google Scholar
Singh, P., Laxmi, V., Gaur, M.S.: Speaker Identification using Optimal Lip Biometrics. 5th IAPR International Conference on Biometrics, pp. 472–477 (2012)
Google Scholar
Singh, P., Laxmi, V., Gaur, M.S.: Visual speech as behavioural biometric. In: Kisku, D.R., Gupta, P., Sing, J.K. (eds.) Advances in Biometrics for Secure Human Authentication and Recognition. Taylor and Francis (2013)
Google Scholar
Singh, P., Laxmi, V., Gaur, M.S.: Near-optimal geometric feature selection for visual speech recognition. Int. J. Pattern Recogn. Artif. Intell. 27(8) (2013)
Google Scholar
Sumby, W.H., Pollack, I.: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26(2), 212–215 (1954)
Article Google Scholar
Summerfield, Q., Macleod, A., McGrath, M., Brooke, M.: Lips, teeth, and the benefits of lipreading. In: Young, A.W., Ellis, H.D. (eds.) Handbook of Research on Face Processing, pp. 218–223. Elsevier Science Publishers, Amsterdam (1989)
Google Scholar
Tan, P-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley Longman Publishing Co., Inc. (2005)
Google Scholar
Xue, J., Zhao, Y.: Random-forests-based phonetic decision trees for conversational speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4169–4172 (2008)
Google Scholar
Yamahata, S., Yamaguchi, Y., Ogawa, A., Masataki, H., Yoshioka, O., Takahashi, S.: Automatic vocabulary adaptation based on semantic similarity and speech recognition confidence measure. Interspeech (2012)
Google Scholar
Zekeriya, S.G., Gurbuz, S., Tufekci, Z., Patterson, E., Gowdy, J.N.: Application of affine-invariant fourier descriptors to lipreading for audio-visual speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 177–180 (2001)
Google Scholar
Zhang, X., Mersereau, R.M., Clements, M., Broun, C.C.: Visual speech feature extraction for improved speech recognition. In: IEEE International Conference on Acoustics, Speech, and SignalProcessing, pp. II-1993-II-1996 (2002)
Google Scholar
Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)
Google Scholar

Download references

Acknowledgments

The authors would like to thank the Department of Science & Technology, Government of India, for funding and supporting this project.

Author information

Authors and Affiliations

The LNM Institute of Information Technology, Post Sumel, Jaipur, India
Preety Singh
Malaviya National Institute of Technology, JLN Marg, Jaipur, India
Vijay Laxmi & Manoj Singh Gaur

Authors

Preety Singh
View author publications
You can also search for this author in PubMed Google Scholar
Vijay Laxmi
View author publications
You can also search for this author in PubMed Google Scholar
Manoj Singh Gaur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Preety Singh .

Editor information

Editors and Affiliations

Department of Computer Science, Luleå University of Technology, Luleå, Sweden
Ali Ismail Awad
Image and Video Processing Lab, Faculty, South Valley University, Qena, Egypt
Mahmoud Hassaballah

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Singh, P., Laxmi, V., Gaur, M.S. (2016). Visual Speech Recognition with Selected Boundary Descriptors. In: Awad, A., Hassaballah, M. (eds) Image Feature Detectors and Descriptors . Studies in Computational Intelligence, vol 630. Springer, Cham. https://doi.org/10.1007/978-3-319-28854-3_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-28854-3_14
Published: 23 February 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28852-9
Online ISBN: 978-3-319-28854-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics