Abstract
The paper considers the phoneme recognition by facial expressions of a speaker in voice-activated control systems. We have developed a neural network recognition algorithm by using the phonetic words decoding method and the requirement for isolated syllable pronunciation of voice commands. The paper presents the experimental results of viseme (facial and lip position corresponding to a particular phoneme) classification of Russian vowels. We show the dependence of the classification accuracy on the used classifier (multilayer feed-forward network, support vector machine, k-nearest neighbor method), image features (histogram of oriented gradients, eigenvectors, SURF local descriptors) and the type of camera (built-in or Kinect one). The best accuracy of speaker-dependent recognition is shown to be 85% for a built-in camera and 96% for Kinect depth maps when the classification is performed with the histogram of oriented gradients and the support vector machine.
Similar content being viewed by others
References
Benesty, J., Sondh, M., and Huang, Y., Eds., Springer Handbook of Speech Recognition, New York: Springer, 2008.
Rabiner, L.R., A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, 1989, vol. 77, pp. 257–286.
Koutroumbas, K. and Theodoridis, S., Pattern Recognition, Boston: Academic Press, 2008, p. 840.
Savchenko, A.V., Adaptive video image recognition system using a committee machine, Optical Memory and Neural Networks (Information Optics), 2012, vol. 21, no. 4, pp. 219–226.
Potamianos, G., Neti, C., Luettin, J., and Matthews, I., Audio-visual automatic speech recognition: an overview, in Issues in Visual and Audio-Visual Speech Processing, MIT Press Cambridge, MA, 2004, pp. 356–396.
Karpov, A., Ronzhin, A., and Kipyatkova, I., An assistive bi-modal user interface integrating multi-channel speech recognition and computer vision, Proc. 14th International Conference on Human-Computer Interaction HCII-2011, LNCS, 2011, vol. 6762, pp. 454–463.
Dupont, S. and Luettin, J., Audio-visual speech modeling for continuous speech recognition, IEEE Transactions on Multimedia, 2000, vol. 2, no. 3, pp. 141–151.
Asadpour, V., Homayounpour, M.M., and Towhidkhah, F., Audio-visual speaker identification using dynamic facial movements and utterance phonetic content, Applied Soft Computing, 2011, no. 2, pp. 2083–2093.
Summerfield, A.Q., Lipreading and audio-visual speech perception, Philosophy Transaction Research Society of London B, 1992, vol. 335, pp. 71–78.
Vogt, M., Fast matching of a dynamic lip model to color video sequence under regular illumination condition, in NATO ASI Series F, 1996, vol. 150, pp. 399–407.
Budkov, V., Prischepa, M., and Ronzhin, A., Dialog model development of a mobile information and reference robot, Pattern Recognition and Image Analysis, 2011, vol. 21, no. 3, pp. 458–461.
RealSpeaker Audio-Visual Speech Recognition-Voice to Text, http://realspeaker.net/ (last access, December 2, 2013).
Murygin, K.V., Concept of speech recognition based on lip reading, Iskusstvenny Intellekt, 2009, no. 2, pp. 116–123 [in Russian].
Potamianos, G., Graf, H.P., and Cosatto, E., An image transform approach for HMM based automatic lipreading, Proc. of Int. Conf. on Image Processing, 1998, vol. I, pp. 173–177.
Lucey, P., Martin, T., and Sridharan, S., Confusability of phonemes grouped according to their viseme classes in noisy environments, Proc. of 10th Australian International Conference on Speech Science & Technology, 2004, pp. 8–10.
Petajan, E.D., Automatic lipreading to enhance speech recognition, Proc. of Int. Conf. Global Telecommunications Conference, 1984, pp. 265–272.
Cox, S., Harvey, R., Lan, Y., Newman, J., and Theobald, B., The challenge of multispeaker lip-reading, Proc. of Int. Conf. on Auditory-Visual Speech Processing, 2008, pp. 179–184.
Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S.J., and Harvey, R.W., Extraction of visual features for lipreading, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, vol. 24, no. 2, pp. 198–213.
Blake, A. and Isard, M., Active Contours: The Application of Techniques from Graphics,Vision,Control Theory and Statistics to Visual Tracking of Shapes in Motion, Springer, 1998, p. 343.
Dalal, N. and Triggs, B., Histograms of Oriented Gradients for Human Detection, Proc. of Int. Conf. on Computer Vision & Pattern Recognition, 2005, pp. 886–893.
Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L., SURF: Speeded Up Robust Features, Computer Vision and Image Understanding, 2008, vol. 110, no. 3, pp. 346–359.
Senior, A.W., Face and feature finding for a face recognition system, Proc. International Conference on Audio and Video-Based Biometric Person Authentication, 1999, pp. 154–159.
Savchenko, A.V., Phonetic words decoding software in the problem of Russian speech recognition, Automation and Remote Control, 2013, vol. 74, no. 7, pp. 1225–1232.
Savchenko, V.V., The method of words phonetic decoding in automatic speech recognition problem using the minimum information discrimination principle, Izvestia vuzov Rossii. Radioelectronika, 2009, no. 5, pp. 31–41 [in Russian].
Savchenko, A.V., Probabilistic neural network with homogeneity testing in recognition of discrete patterns set, Neural Networks, 2013, vol. 46, pp. 227–241.
Cortes, C. and Vapnik, V.N., Support-vector networks, Machine Learning, 1995, vol. 20, no. 3, pp. 273–297.
Bishop, C., Neural Networks for Pattern Recognition, Oxford University Press, 1996, p. 504.
Savchenko, A.V., Adaptive speech recognition algorithm on the basis of the words phonetic decoding method in a remote control problem, Informatsionnye technologii, 2013, no. 4, pp. 34–39 [in Russian].
Viola, P. and Jones, M.J., Rapid object detection using a boosted cascade of simple features, Proc. of Int. Conf. on Computer Vision & Pattern Recognition, 2001, vol. 1, pp. 511–518.
Wang, H., Wang, Y., and Cao, Y., Video-based face recognition: a survey, World Academy of Science. Engineering and Technologies, 2009, vol. 60, pp. 293–302.
Savchenko, L.V. and Savchenko, A.V., Fuzzy phonetic decoding method in a phoneme recognition problem, Proc. of the Int. Conf. on Nonlinear Speech Processing (NOLISP-2013), LNCS/LNAI, 2013, vol. 7911, pp. 176–183.
Lowe, D., Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision, 2004, vol. 60, no. 2, pp. 91–110.
Savchenko, A.V., Directed enumeration method in image recognition, Pattern Recognition, 2012, vol. 45, no. 8, pp. 2952–2961.
Savchenko, A.V., Face recognition in real-time applications: comparison of directed enumeration method and k-d trees, Proc. of Int. Conf. BIR’12 LNBIP, 2012, vol. 128, pp. 187–199.
Shapiro, L. and Stockman, G., Computer Vision, Upper Saddle River, NJ, Prentice Hall, 2001, p. 752.
Savchenko, A.V., Pattern recognition and increasing of the computational efficiency of a parallel realization of the probabilistic neural network with homogeneity testing, Optical Memory and Neural Networks (Information Optics), 2013, vol. 22, no. 3, pp. 184–192.
PrimeSense, http://www.primesense.com/solutions/technology/ ((last access, December 2, 2013).
Ghoshal, A., Swietojanski, P., and Renals, S., Multilingual training of deep neural networks, Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 7319–7323.
Hinton, G.E., Osindero, S., and Teh, Y., A fast learning algorithm for deep belief nets, Neural Computation, 2006, vol. 18, no. 7, pp. 1527–1554.
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Savchenko, A.V., Khokhlova, Y.I. About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems. Opt. Mem. Neural Networks 23, 34–42 (2014). https://doi.org/10.3103/S1060992X14010068
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S1060992X14010068