About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems

Savchenko, A. V.; Khokhlova, Ya. I.

doi:10.3103/S1060992X14010068

About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems

Published: 26 March 2014

Volume 23, pages 34–42, (2014)
Cite this article

Optical Memory and Neural Networks Aims and scope Submit manuscript

A. V. Savchenko¹ &
Ya. I. Khokhlova¹

170 Accesses
17 Citations
Explore all metrics

Abstract

The paper considers the phoneme recognition by facial expressions of a speaker in voice-activated control systems. We have developed a neural network recognition algorithm by using the phonetic words decoding method and the requirement for isolated syllable pronunciation of voice commands. The paper presents the experimental results of viseme (facial and lip position corresponding to a particular phoneme) classification of Russian vowels. We show the dependence of the classification accuracy on the used classifier (multilayer feed-forward network, support vector machine, k-nearest neighbor method), image features (histogram of oriented gradients, eigenvectors, SURF local descriptors) and the type of camera (built-in or Kinect one). The best accuracy of speaker-dependent recognition is shown to be 85% for a built-in camera and 96% for Kinect depth maps when the classification is performed with the histogram of oriented gradients and the support vector machine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Facial emotion recognition using convolutional neural networks (FERC)

Article 18 February 2020

Real-time facial emotion recognition system among children with autism based on deep learning and IoT

Article Open access 07 March 2023

The struggle for recognition in the age of facial recognition technology

Article Open access 08 March 2022

References

Benesty, J., Sondh, M., and Huang, Y., Eds., Springer Handbook of Speech Recognition, New York: Springer, 2008.
Google Scholar
Rabiner, L.R., A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, 1989, vol. 77, pp. 257–286.
Article Google Scholar
Koutroumbas, K. and Theodoridis, S., Pattern Recognition, Boston: Academic Press, 2008, p. 840.
Google Scholar
Savchenko, A.V., Adaptive video image recognition system using a committee machine, Optical Memory and Neural Networks (Information Optics), 2012, vol. 21, no. 4, pp. 219–226.
Article Google Scholar
Potamianos, G., Neti, C., Luettin, J., and Matthews, I., Audio-visual automatic speech recognition: an overview, in Issues in Visual and Audio-Visual Speech Processing, MIT Press Cambridge, MA, 2004, pp. 356–396.
Google Scholar
Karpov, A., Ronzhin, A., and Kipyatkova, I., An assistive bi-modal user interface integrating multi-channel speech recognition and computer vision, Proc. 14th International Conference on Human-Computer Interaction HCII-2011, LNCS, 2011, vol. 6762, pp. 454–463.
Google Scholar
Dupont, S. and Luettin, J., Audio-visual speech modeling for continuous speech recognition, IEEE Transactions on Multimedia, 2000, vol. 2, no. 3, pp. 141–151.
Article Google Scholar
Asadpour, V., Homayounpour, M.M., and Towhidkhah, F., Audio-visual speaker identification using dynamic facial movements and utterance phonetic content, Applied Soft Computing, 2011, no. 2, pp. 2083–2093.
Google Scholar
Summerfield, A.Q., Lipreading and audio-visual speech perception, Philosophy Transaction Research Society of London B, 1992, vol. 335, pp. 71–78.
Article Google Scholar
Vogt, M., Fast matching of a dynamic lip model to color video sequence under regular illumination condition, in NATO ASI Series F, 1996, vol. 150, pp. 399–407.
Article Google Scholar
Budkov, V., Prischepa, M., and Ronzhin, A., Dialog model development of a mobile information and reference robot, Pattern Recognition and Image Analysis, 2011, vol. 21, no. 3, pp. 458–461.
Article Google Scholar
RealSpeaker Audio-Visual Speech Recognition-Voice to Text, http://realspeaker.net/ (last access, December 2, 2013).
Murygin, K.V., Concept of speech recognition based on lip reading, Iskusstvenny Intellekt, 2009, no. 2, pp. 116–123 [in Russian].
Google Scholar
Potamianos, G., Graf, H.P., and Cosatto, E., An image transform approach for HMM based automatic lipreading, Proc. of Int. Conf. on Image Processing, 1998, vol. I, pp. 173–177.
Google Scholar
Lucey, P., Martin, T., and Sridharan, S., Confusability of phonemes grouped according to their viseme classes in noisy environments, Proc. of 10th Australian International Conference on Speech Science & Technology, 2004, pp. 8–10.
Google Scholar
Petajan, E.D., Automatic lipreading to enhance speech recognition, Proc. of Int. Conf. Global Telecommunications Conference, 1984, pp. 265–272.
Google Scholar
Cox, S., Harvey, R., Lan, Y., Newman, J., and Theobald, B., The challenge of multispeaker lip-reading, Proc. of Int. Conf. on Auditory-Visual Speech Processing, 2008, pp. 179–184.
Google Scholar
Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S.J., and Harvey, R.W., Extraction of visual features for lipreading, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, vol. 24, no. 2, pp. 198–213.
Article Google Scholar
Blake, A. and Isard, M., Active Contours: The Application of Techniques from Graphics,Vision,Control Theory and Statistics to Visual Tracking of Shapes in Motion, Springer, 1998, p. 343.
Book Google Scholar
Dalal, N. and Triggs, B., Histograms of Oriented Gradients for Human Detection, Proc. of Int. Conf. on Computer Vision & Pattern Recognition, 2005, pp. 886–893.
Google Scholar
Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L., SURF: Speeded Up Robust Features, Computer Vision and Image Understanding, 2008, vol. 110, no. 3, pp. 346–359.
Article Google Scholar
Senior, A.W., Face and feature finding for a face recognition system, Proc. International Conference on Audio and Video-Based Biometric Person Authentication, 1999, pp. 154–159.
Google Scholar
Savchenko, A.V., Phonetic words decoding software in the problem of Russian speech recognition, Automation and Remote Control, 2013, vol. 74, no. 7, pp. 1225–1232.
Article Google Scholar
Savchenko, V.V., The method of words phonetic decoding in automatic speech recognition problem using the minimum information discrimination principle, Izvestia vuzov Rossii. Radioelectronika, 2009, no. 5, pp. 31–41 [in Russian].
Google Scholar
Savchenko, A.V., Probabilistic neural network with homogeneity testing in recognition of discrete patterns set, Neural Networks, 2013, vol. 46, pp. 227–241.
Article Google Scholar
Cortes, C. and Vapnik, V.N., Support-vector networks, Machine Learning, 1995, vol. 20, no. 3, pp. 273–297.
MATH Google Scholar
Bishop, C., Neural Networks for Pattern Recognition, Oxford University Press, 1996, p. 504.
MATH Google Scholar
Savchenko, A.V., Adaptive speech recognition algorithm on the basis of the words phonetic decoding method in a remote control problem, Informatsionnye technologii, 2013, no. 4, pp. 34–39 [in Russian].
Google Scholar
Viola, P. and Jones, M.J., Rapid object detection using a boosted cascade of simple features, Proc. of Int. Conf. on Computer Vision & Pattern Recognition, 2001, vol. 1, pp. 511–518.
Google Scholar
Wang, H., Wang, Y., and Cao, Y., Video-based face recognition: a survey, World Academy of Science. Engineering and Technologies, 2009, vol. 60, pp. 293–302.
Google Scholar
Savchenko, L.V. and Savchenko, A.V., Fuzzy phonetic decoding method in a phoneme recognition problem, Proc. of the Int. Conf. on Nonlinear Speech Processing (NOLISP-2013), LNCS/LNAI, 2013, vol. 7911, pp. 176–183.
Google Scholar
Lowe, D., Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision, 2004, vol. 60, no. 2, pp. 91–110.
Article Google Scholar
Savchenko, A.V., Directed enumeration method in image recognition, Pattern Recognition, 2012, vol. 45, no. 8, pp. 2952–2961.
Article Google Scholar
Savchenko, A.V., Face recognition in real-time applications: comparison of directed enumeration method and k-d trees, Proc. of Int. Conf. BIR’12 LNBIP, 2012, vol. 128, pp. 187–199.
Google Scholar
Shapiro, L. and Stockman, G., Computer Vision, Upper Saddle River, NJ, Prentice Hall, 2001, p. 752.
Google Scholar
Savchenko, A.V., Pattern recognition and increasing of the computational efficiency of a parallel realization of the probabilistic neural network with homogeneity testing, Optical Memory and Neural Networks (Information Optics), 2013, vol. 22, no. 3, pp. 184–192.
Article Google Scholar
PrimeSense, http://www.primesense.com/solutions/technology/ ((last access, December 2, 2013).
Ghoshal, A., Swietojanski, P., and Renals, S., Multilingual training of deep neural networks, Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 7319–7323.
Google Scholar
Hinton, G.E., Osindero, S., and Teh, Y., A fast learning algorithm for deep belief nets, Neural Computation, 2006, vol. 18, no. 7, pp. 1527–1554.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Business Informatics and Applied Mathematics, National Research University High School of Economics, Nizhni Novgorod, Russia
A. V. Savchenko & Ya. I. Khokhlova

Authors

A. V. Savchenko
View author publications
You can also search for this author in PubMed Google Scholar
Ya. I. Khokhlova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. V. Savchenko.

About this article

Cite this article

Savchenko, A.V., Khokhlova, Y.I. About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems. Opt. Mem. Neural Networks 23, 34–42 (2014). https://doi.org/10.3103/S1060992X14010068

Download citation

Received: 03 December 2013
Accepted: 24 December 2013
Published: 26 March 2014
Issue Date: January 2014
DOI: https://doi.org/10.3103/S1060992X14010068

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Real-time facial emotion recognition system among children with autism based on deep learning and IoT

The struggle for recognition in the age of facial recognition technology

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Keywords

Navigation

About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Real-time facial emotion recognition system among children with autism based on deep learning and IoT

The struggle for recognition in the age of facial recognition technology

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Share this article

Keywords

Search

Navigation