Skip to main content
Log in

About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems

  • Published:
Optical Memory and Neural Networks Aims and scope Submit manuscript

Abstract

The paper considers the phoneme recognition by facial expressions of a speaker in voice-activated control systems. We have developed a neural network recognition algorithm by using the phonetic words decoding method and the requirement for isolated syllable pronunciation of voice commands. The paper presents the experimental results of viseme (facial and lip position corresponding to a particular phoneme) classification of Russian vowels. We show the dependence of the classification accuracy on the used classifier (multilayer feed-forward network, support vector machine, k-nearest neighbor method), image features (histogram of oriented gradients, eigenvectors, SURF local descriptors) and the type of camera (built-in or Kinect one). The best accuracy of speaker-dependent recognition is shown to be 85% for a built-in camera and 96% for Kinect depth maps when the classification is performed with the histogram of oriented gradients and the support vector machine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Benesty, J., Sondh, M., and Huang, Y., Eds., Springer Handbook of Speech Recognition, New York: Springer, 2008.

    Google Scholar 

  2. Rabiner, L.R., A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, 1989, vol. 77, pp. 257–286.

    Article  Google Scholar 

  3. Koutroumbas, K. and Theodoridis, S., Pattern Recognition, Boston: Academic Press, 2008, p. 840.

    Google Scholar 

  4. Savchenko, A.V., Adaptive video image recognition system using a committee machine, Optical Memory and Neural Networks (Information Optics), 2012, vol. 21, no. 4, pp. 219–226.

    Article  Google Scholar 

  5. Potamianos, G., Neti, C., Luettin, J., and Matthews, I., Audio-visual automatic speech recognition: an overview, in Issues in Visual and Audio-Visual Speech Processing, MIT Press Cambridge, MA, 2004, pp. 356–396.

    Google Scholar 

  6. Karpov, A., Ronzhin, A., and Kipyatkova, I., An assistive bi-modal user interface integrating multi-channel speech recognition and computer vision, Proc. 14th International Conference on Human-Computer Interaction HCII-2011, LNCS, 2011, vol. 6762, pp. 454–463.

    Google Scholar 

  7. Dupont, S. and Luettin, J., Audio-visual speech modeling for continuous speech recognition, IEEE Transactions on Multimedia, 2000, vol. 2, no. 3, pp. 141–151.

    Article  Google Scholar 

  8. Asadpour, V., Homayounpour, M.M., and Towhidkhah, F., Audio-visual speaker identification using dynamic facial movements and utterance phonetic content, Applied Soft Computing, 2011, no. 2, pp. 2083–2093.

    Google Scholar 

  9. Summerfield, A.Q., Lipreading and audio-visual speech perception, Philosophy Transaction Research Society of London B, 1992, vol. 335, pp. 71–78.

    Article  Google Scholar 

  10. Vogt, M., Fast matching of a dynamic lip model to color video sequence under regular illumination condition, in NATO ASI Series F, 1996, vol. 150, pp. 399–407.

    Article  Google Scholar 

  11. Budkov, V., Prischepa, M., and Ronzhin, A., Dialog model development of a mobile information and reference robot, Pattern Recognition and Image Analysis, 2011, vol. 21, no. 3, pp. 458–461.

    Article  Google Scholar 

  12. RealSpeaker Audio-Visual Speech Recognition-Voice to Text, http://realspeaker.net/ (last access, December 2, 2013).

  13. Murygin, K.V., Concept of speech recognition based on lip reading, Iskusstvenny Intellekt, 2009, no. 2, pp. 116–123 [in Russian].

    Google Scholar 

  14. Potamianos, G., Graf, H.P., and Cosatto, E., An image transform approach for HMM based automatic lipreading, Proc. of Int. Conf. on Image Processing, 1998, vol. I, pp. 173–177.

    Google Scholar 

  15. Lucey, P., Martin, T., and Sridharan, S., Confusability of phonemes grouped according to their viseme classes in noisy environments, Proc. of 10th Australian International Conference on Speech Science & Technology, 2004, pp. 8–10.

    Google Scholar 

  16. Petajan, E.D., Automatic lipreading to enhance speech recognition, Proc. of Int. Conf. Global Telecommunications Conference, 1984, pp. 265–272.

    Google Scholar 

  17. Cox, S., Harvey, R., Lan, Y., Newman, J., and Theobald, B., The challenge of multispeaker lip-reading, Proc. of Int. Conf. on Auditory-Visual Speech Processing, 2008, pp. 179–184.

    Google Scholar 

  18. Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S.J., and Harvey, R.W., Extraction of visual features for lipreading, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, vol. 24, no. 2, pp. 198–213.

    Article  Google Scholar 

  19. Blake, A. and Isard, M., Active Contours: The Application of Techniques from Graphics,Vision,Control Theory and Statistics to Visual Tracking of Shapes in Motion, Springer, 1998, p. 343.

    Book  Google Scholar 

  20. Dalal, N. and Triggs, B., Histograms of Oriented Gradients for Human Detection, Proc. of Int. Conf. on Computer Vision & Pattern Recognition, 2005, pp. 886–893.

    Google Scholar 

  21. Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L., SURF: Speeded Up Robust Features, Computer Vision and Image Understanding, 2008, vol. 110, no. 3, pp. 346–359.

    Article  Google Scholar 

  22. Senior, A.W., Face and feature finding for a face recognition system, Proc. International Conference on Audio and Video-Based Biometric Person Authentication, 1999, pp. 154–159.

    Google Scholar 

  23. Savchenko, A.V., Phonetic words decoding software in the problem of Russian speech recognition, Automation and Remote Control, 2013, vol. 74, no. 7, pp. 1225–1232.

    Article  Google Scholar 

  24. Savchenko, V.V., The method of words phonetic decoding in automatic speech recognition problem using the minimum information discrimination principle, Izvestia vuzov Rossii. Radioelectronika, 2009, no. 5, pp. 31–41 [in Russian].

    Google Scholar 

  25. Savchenko, A.V., Probabilistic neural network with homogeneity testing in recognition of discrete patterns set, Neural Networks, 2013, vol. 46, pp. 227–241.

    Article  Google Scholar 

  26. Cortes, C. and Vapnik, V.N., Support-vector networks, Machine Learning, 1995, vol. 20, no. 3, pp. 273–297.

    MATH  Google Scholar 

  27. Bishop, C., Neural Networks for Pattern Recognition, Oxford University Press, 1996, p. 504.

    MATH  Google Scholar 

  28. Savchenko, A.V., Adaptive speech recognition algorithm on the basis of the words phonetic decoding method in a remote control problem, Informatsionnye technologii, 2013, no. 4, pp. 34–39 [in Russian].

    Google Scholar 

  29. Viola, P. and Jones, M.J., Rapid object detection using a boosted cascade of simple features, Proc. of Int. Conf. on Computer Vision & Pattern Recognition, 2001, vol. 1, pp. 511–518.

    Google Scholar 

  30. Wang, H., Wang, Y., and Cao, Y., Video-based face recognition: a survey, World Academy of Science. Engineering and Technologies, 2009, vol. 60, pp. 293–302.

    Google Scholar 

  31. Savchenko, L.V. and Savchenko, A.V., Fuzzy phonetic decoding method in a phoneme recognition problem, Proc. of the Int. Conf. on Nonlinear Speech Processing (NOLISP-2013), LNCS/LNAI, 2013, vol. 7911, pp. 176–183.

    Google Scholar 

  32. Lowe, D., Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision, 2004, vol. 60, no. 2, pp. 91–110.

    Article  Google Scholar 

  33. Savchenko, A.V., Directed enumeration method in image recognition, Pattern Recognition, 2012, vol. 45, no. 8, pp. 2952–2961.

    Article  Google Scholar 

  34. Savchenko, A.V., Face recognition in real-time applications: comparison of directed enumeration method and k-d trees, Proc. of Int. Conf. BIR’12 LNBIP, 2012, vol. 128, pp. 187–199.

    Google Scholar 

  35. Shapiro, L. and Stockman, G., Computer Vision, Upper Saddle River, NJ, Prentice Hall, 2001, p. 752.

    Google Scholar 

  36. Savchenko, A.V., Pattern recognition and increasing of the computational efficiency of a parallel realization of the probabilistic neural network with homogeneity testing, Optical Memory and Neural Networks (Information Optics), 2013, vol. 22, no. 3, pp. 184–192.

    Article  Google Scholar 

  37. PrimeSense, http://www.primesense.com/solutions/technology/ ((last access, December 2, 2013).

  38. Ghoshal, A., Swietojanski, P., and Renals, S., Multilingual training of deep neural networks, Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 7319–7323.

    Google Scholar 

  39. Hinton, G.E., Osindero, S., and Teh, Y., A fast learning algorithm for deep belief nets, Neural Computation, 2006, vol. 18, no. 7, pp. 1527–1554.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. V. Savchenko.

About this article

Cite this article

Savchenko, A.V., Khokhlova, Y.I. About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems. Opt. Mem. Neural Networks 23, 34–42 (2014). https://doi.org/10.3103/S1060992X14010068

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S1060992X14010068

Keywords

Navigation