Optical Memory and Neural Networks

, Volume 23, Issue 1, pp 34–42 | Cite as

About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems



The paper considers the phoneme recognition by facial expressions of a speaker in voice-activated control systems. We have developed a neural network recognition algorithm by using the phonetic words decoding method and the requirement for isolated syllable pronunciation of voice commands. The paper presents the experimental results of viseme (facial and lip position corresponding to a particular phoneme) classification of Russian vowels. We show the dependence of the classification accuracy on the used classifier (multilayer feed-forward network, support vector machine, k-nearest neighbor method), image features (histogram of oriented gradients, eigenvectors, SURF local descriptors) and the type of camera (built-in or Kinect one). The best accuracy of speaker-dependent recognition is shown to be 85% for a built-in camera and 96% for Kinect depth maps when the classification is performed with the histogram of oriented gradients and the support vector machine.


neural-network-aided image recognition audiovisual speech recognition phonetic decoding method histogram of oriented gradients support vector machine Kinect 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Benesty, J., Sondh, M., and Huang, Y., Eds., Springer Handbook of Speech Recognition, New York: Springer, 2008.Google Scholar
  2. 2.
    Rabiner, L.R., A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, 1989, vol. 77, pp. 257–286.CrossRefGoogle Scholar
  3. 3.
    Koutroumbas, K. and Theodoridis, S., Pattern Recognition, Boston: Academic Press, 2008, p. 840.Google Scholar
  4. 4.
    Savchenko, A.V., Adaptive video image recognition system using a committee machine, Optical Memory and Neural Networks (Information Optics), 2012, vol. 21, no. 4, pp. 219–226.CrossRefGoogle Scholar
  5. 5.
    Potamianos, G., Neti, C., Luettin, J., and Matthews, I., Audio-visual automatic speech recognition: an overview, in Issues in Visual and Audio-Visual Speech Processing, MIT Press Cambridge, MA, 2004, pp. 356–396.Google Scholar
  6. 6.
    Karpov, A., Ronzhin, A., and Kipyatkova, I., An assistive bi-modal user interface integrating multi-channel speech recognition and computer vision, Proc. 14th International Conference on Human-Computer Interaction HCII-2011, LNCS, 2011, vol. 6762, pp. 454–463.Google Scholar
  7. 7.
    Dupont, S. and Luettin, J., Audio-visual speech modeling for continuous speech recognition, IEEE Transactions on Multimedia, 2000, vol. 2, no. 3, pp. 141–151.CrossRefGoogle Scholar
  8. 8.
    Asadpour, V., Homayounpour, M.M., and Towhidkhah, F., Audio-visual speaker identification using dynamic facial movements and utterance phonetic content, Applied Soft Computing, 2011, no. 2, pp. 2083–2093.Google Scholar
  9. 9.
    Summerfield, A.Q., Lipreading and audio-visual speech perception, Philosophy Transaction Research Society of London B, 1992, vol. 335, pp. 71–78.CrossRefGoogle Scholar
  10. 10.
    Vogt, M., Fast matching of a dynamic lip model to color video sequence under regular illumination condition, in NATO ASI Series F, 1996, vol. 150, pp. 399–407.CrossRefGoogle Scholar
  11. 11.
    Budkov, V., Prischepa, M., and Ronzhin, A., Dialog model development of a mobile information and reference robot, Pattern Recognition and Image Analysis, 2011, vol. 21, no. 3, pp. 458–461.CrossRefGoogle Scholar
  12. 12.
    RealSpeaker Audio-Visual Speech Recognition-Voice to Text, http://realspeaker.net/ (last access, December 2, 2013).
  13. 13.
    Murygin, K.V., Concept of speech recognition based on lip reading, Iskusstvenny Intellekt, 2009, no. 2, pp. 116–123 [in Russian].Google Scholar
  14. 14.
    Potamianos, G., Graf, H.P., and Cosatto, E., An image transform approach for HMM based automatic lipreading, Proc. of Int. Conf. on Image Processing, 1998, vol. I, pp. 173–177.Google Scholar
  15. 15.
    Lucey, P., Martin, T., and Sridharan, S., Confusability of phonemes grouped according to their viseme classes in noisy environments, Proc. of 10th Australian International Conference on Speech Science & Technology, 2004, pp. 8–10.Google Scholar
  16. 16.
    Petajan, E.D., Automatic lipreading to enhance speech recognition, Proc. of Int. Conf. Global Telecommunications Conference, 1984, pp. 265–272.Google Scholar
  17. 17.
    Cox, S., Harvey, R., Lan, Y., Newman, J., and Theobald, B., The challenge of multispeaker lip-reading, Proc. of Int. Conf. on Auditory-Visual Speech Processing, 2008, pp. 179–184.Google Scholar
  18. 18.
    Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S.J., and Harvey, R.W., Extraction of visual features for lipreading, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, vol. 24, no. 2, pp. 198–213.CrossRefGoogle Scholar
  19. 19.
    Blake, A. and Isard, M., Active Contours: The Application of Techniques from Graphics,Vision,Control Theory and Statistics to Visual Tracking of Shapes in Motion, Springer, 1998, p. 343.CrossRefGoogle Scholar
  20. 20.
    Dalal, N. and Triggs, B., Histograms of Oriented Gradients for Human Detection, Proc. of Int. Conf. on Computer Vision & Pattern Recognition, 2005, pp. 886–893.Google Scholar
  21. 21.
    Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L., SURF: Speeded Up Robust Features, Computer Vision and Image Understanding, 2008, vol. 110, no. 3, pp. 346–359.CrossRefGoogle Scholar
  22. 22.
    Senior, A.W., Face and feature finding for a face recognition system, Proc. International Conference on Audio and Video-Based Biometric Person Authentication, 1999, pp. 154–159.Google Scholar
  23. 23.
    Savchenko, A.V., Phonetic words decoding software in the problem of Russian speech recognition, Automation and Remote Control, 2013, vol. 74, no. 7, pp. 1225–1232.CrossRefGoogle Scholar
  24. 24.
    Savchenko, V.V., The method of words phonetic decoding in automatic speech recognition problem using the minimum information discrimination principle, Izvestia vuzov Rossii. Radioelectronika, 2009, no. 5, pp. 31–41 [in Russian].Google Scholar
  25. 25.
    Savchenko, A.V., Probabilistic neural network with homogeneity testing in recognition of discrete patterns set, Neural Networks, 2013, vol. 46, pp. 227–241.CrossRefGoogle Scholar
  26. 26.
    Cortes, C. and Vapnik, V.N., Support-vector networks, Machine Learning, 1995, vol. 20, no. 3, pp. 273–297.MATHGoogle Scholar
  27. 27.
    Bishop, C., Neural Networks for Pattern Recognition, Oxford University Press, 1996, p. 504.MATHGoogle Scholar
  28. 28.
    Savchenko, A.V., Adaptive speech recognition algorithm on the basis of the words phonetic decoding method in a remote control problem, Informatsionnye technologii, 2013, no. 4, pp. 34–39 [in Russian].Google Scholar
  29. 29.
    Viola, P. and Jones, M.J., Rapid object detection using a boosted cascade of simple features, Proc. of Int. Conf. on Computer Vision & Pattern Recognition, 2001, vol. 1, pp. 511–518.Google Scholar
  30. 30.
    Wang, H., Wang, Y., and Cao, Y., Video-based face recognition: a survey, World Academy of Science. Engineering and Technologies, 2009, vol. 60, pp. 293–302.Google Scholar
  31. 31.
    Savchenko, L.V. and Savchenko, A.V., Fuzzy phonetic decoding method in a phoneme recognition problem, Proc. of the Int. Conf. on Nonlinear Speech Processing (NOLISP-2013), LNCS/LNAI, 2013, vol. 7911, pp. 176–183.Google Scholar
  32. 32.
    Lowe, D., Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision, 2004, vol. 60, no. 2, pp. 91–110.CrossRefGoogle Scholar
  33. 33.
    Savchenko, A.V., Directed enumeration method in image recognition, Pattern Recognition, 2012, vol. 45, no. 8, pp. 2952–2961.CrossRefGoogle Scholar
  34. 34.
    Savchenko, A.V., Face recognition in real-time applications: comparison of directed enumeration method and k-d trees, Proc. of Int. Conf. BIR’12 LNBIP, 2012, vol. 128, pp. 187–199.Google Scholar
  35. 35.
    Shapiro, L. and Stockman, G., Computer Vision, Upper Saddle River, NJ, Prentice Hall, 2001, p. 752.Google Scholar
  36. 36.
    Savchenko, A.V., Pattern recognition and increasing of the computational efficiency of a parallel realization of the probabilistic neural network with homogeneity testing, Optical Memory and Neural Networks (Information Optics), 2013, vol. 22, no. 3, pp. 184–192.CrossRefGoogle Scholar
  37. 37.
    PrimeSense, http://www.primesense.com/solutions/technology/ ((last access, December 2, 2013).
  38. 38.
    Ghoshal, A., Swietojanski, P., and Renals, S., Multilingual training of deep neural networks, Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 7319–7323.Google Scholar
  39. 39.
    Hinton, G.E., Osindero, S., and Teh, Y., A fast learning algorithm for deep belief nets, Neural Computation, 2006, vol. 18, no. 7, pp. 1527–1554.CrossRefMATHMathSciNetGoogle Scholar

Copyright information

© Allerton Press, Inc. 2014

Authors and Affiliations

  1. 1.Department of Business Informatics and Applied MathematicsNational Research University High School of EconomicsNizhni NovgorodRussia

Personalised recommendations