International Journal of Speech Technology

, Volume 18, Issue 2, pp 167–175 | Cite as

Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition

  • Prashant Borde
  • Amarsinh Varpe
  • Ramesh Manza
  • Pravin Yannawar


Automatic speech recognition by machine is an attractive research topic in signal processing domain and has attracted many researchers to contribute in this area. In recent year, there have been many advances in automatic speech reading system with the inclusion of audio and visual speech features to recognize words under noisy conditions. The objective of audio-visual speech recognition system is to improve recognition accuracy. In this paper we computed visual features using Zernike moments and audio feature using mel frequency cepstral coefficients on visual vocabulary of independent standard words dataset which contains collection of isolated set of city names of ten speakers. The visual features were normalized and dimension of features set was reduced by principal component analysis (PCA) in order to recognize the isolated word utterance on PCA space. The performance of recognition of isolated words based on visual only and audio only features results in 63.88 and 100 % respectively.


Lip tracking Zernike moment Principal component analysis (PCA) Mel frequency cepstral coefficients (MFCC) 



The Authors gratefully acknowledge support by the Department of Science and Technology (DST) for providing financial assistance for Major Research Project sanctioned under Fast Track Scheme for Young Scientist, vide sanction number SERB/1766/2013/14 and the authorities of Dr. Babasaheb Ambedkar Marathwada University, Aurangabad (MS) India, for providing the infrastructure for this research work.


  1. Bishop, C. M. (2006). Pattern recognition and machine learning. heidelberg: Springer.zbMATHGoogle Scholar
  2. Bradski, G., & Kaehler, A. (2008). Learning Open CV: Computer vision with the OpenCV library (1st ed.). CA, USA: O’Reilly Media.Google Scholar
  3. Capiler, A. (2001). Lip detection and tracking, 11th International Conference on Image Analysis and Processing (ICIAP 2001), Palermo, ItalyGoogle Scholar
  4. Christopher, B. (1993). Improving connected letter recognition by Lip-reading, IEEE (pp. 361–365).Google Scholar
  5. Deller, J. R., Proakis, J. G., & Hansen, J. H. L. (1993). Discrete-Time Processing of Speech Signals. Englewood Cliffs: Macmillan Publishing Company.Google Scholar
  6. Duchnowski, P. (1995). Toward movement invariant automatic lip-reading and speech recognition, IEEE, pp.109–111Google Scholar
  7. Finn K.I. (1986). An investigation of visible lip information to be used in automated speech recognition. Ph.D Thesis, George-Town University.Google Scholar
  8. Gold, B., & Morgan, N. (2000). Speech and audio signal processing. New York, NY: John Wiley and Sons.Google Scholar
  9. Hong, X, et al. (2006). A PCA based visual DCT feature extraction method for lip-reading. International Conference on Intelligent Information Hiding and Multimedia Signal Processing. Google Scholar
  10. Hwang, S-K., Kim, W-Y. “A novel approach to the fast computation of Zernike moments”,The Journal of the Pattern Recognition Society, doi: 10.1016/j.patcog.2006.03.004.
  11. Juergen, L. (1996). Visual speech recognition using active shape model and hidden markov model, IEEE, pp.817–820Google Scholar
  12. Leon, C. G. K., Perai, P. S., Pauh, J. P. (2009). “Robust Computer Voice Recognition Using Improved MFCC Algorithm”, International Conference on New Trends in Information and Service Science.Google Scholar
  13. Li, M., Cheung, M. (2008). A Novel motion based lip feature extraction for lip-reading, IEEE International Conference on Computational Intelligence and Security (pp. 361–365). Sichan Province, ChinaGoogle Scholar
  14. Macdonald, J., & MacGurk, H. (1978). Visual influences on speech perception process. Perception and Psychophysics, 24, 253–257.CrossRefGoogle Scholar
  15. MacGurk, H., & Macdonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748.CrossRefGoogle Scholar
  16. Malkin, F. J. (1986). The effect on computer recognition of speech when speaking through protective masks. Proceeding Speech Technology, 268, 87–265.Google Scholar
  17. Matthews, L., Cootes, T. F., Banbham, J. A., Cox, S., & Harvey, R. (2002). Extraction of visual features of lip-reading. IEEE Transaction on Pattern Analysis and Machine Intelligence, 24, 198–213.CrossRefGoogle Scholar
  18. Meisel, W. S. (1987). A natural speech recognition system. Proceeding Speech Technology, 87, 10–13.Google Scholar
  19. Moody, T., Joost, M., & Rodman, R. (1987). A comparative evaluation a speech recognizers. Proceeding Speech Technology, 87, 275–280.Google Scholar
  20. Neti, C., et al. (Oct 2000) audio-visual speech recognition, Workshop 2000 Final report.Google Scholar
  21. Paul, D. B., Lippmann, R. P., Chen, Y., & Weinstein, C. J. (1987). Robust HMM based technique for recognition of speech Produced under stress and in noise. Proceeding Speech Technology, 87, 275–280.Google Scholar
  22. Petjan, E., Bischoff, B., & Bodoff, D. (1987). An Improved automatic Lip-reading system to enhance speech Recognition, Technical Report TM 11251–871012-11, AT&T Bell LabsGoogle Scholar
  23. Saitoh, T., Morishita, K. & Konishi, R. (2008). Analysis of efficient lip-reading method for various languages, In Pattern Recognition, ICPR 2008. 19th International Conference on IEEE (pp. 1–4). Florida, USAGoogle Scholar
  24. Sum, K.L., et al. (2001). A new optimization procedure for extracting the point based lip contour using active shape model. In Acoustics, Speech, and Signal Processing. Proceedings of (ICASSP’01) 2001 IEEE International Conference ( pp. 1485–1488).Google Scholar
  25. Tiwari, V. (2010). MFCC and its applications in speaker recognition. Dept. of Electronics Engg., Gyan Ganga Institute of Technology and Management, Bhopal, (MP). Scholar
  26. Tripathy, J. (2010). Reconstruction of oriya alphabets using Zernike moments. International Journal of Computer Applications, 8(8), 0975–8887.CrossRefGoogle Scholar
  27. Yuhas, B.P., Goldstien, M.H. & Sejnowski, T.J. (1989). Integration of acoustic and visual speech signals using neural networks, IEEE Communication Magazine, pp. 65–71Google Scholar
  28. Železný, M., Krňoul, Z., Císař, P., Matoušek, J. (2006). Design, implementation and evaluation of the Czech realistic audio-visual speech synthesis. Signal Processing, vol. 86, no. 12, New York: Elsevier Science (ISSN 0165–1684).Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Prashant Borde
    • 1
  • Amarsinh Varpe
    • 1
  • Ramesh Manza
    • 2
  • Pravin Yannawar
    • 1
  1. 1.Vision and Intelligent System Lab, Department of Computer Science and ITDr. Babasaheb Ambedkar Marathwada UniversityAurangabadIndia
  2. 2.Biomedical Image Processing Lab, Department of Computer Science and ITDr. Babasaheb Ambedkar Marathwada UniversityAurangabadIndia

Personalised recommendations