About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems
- First Online:
The paper considers the phoneme recognition by facial expressions of a speaker in voice-activated control systems. We have developed a neural network recognition algorithm by using the phonetic words decoding method and the requirement for isolated syllable pronunciation of voice commands. The paper presents the experimental results of viseme (facial and lip position corresponding to a particular phoneme) classification of Russian vowels. We show the dependence of the classification accuracy on the used classifier (multilayer feed-forward network, support vector machine, k-nearest neighbor method), image features (histogram of oriented gradients, eigenvectors, SURF local descriptors) and the type of camera (built-in or Kinect one). The best accuracy of speaker-dependent recognition is shown to be 85% for a built-in camera and 96% for Kinect depth maps when the classification is performed with the histogram of oriented gradients and the support vector machine.
Keywordsneural-network-aided image recognition audiovisual speech recognition phonetic decoding method histogram of oriented gradients support vector machine Kinect
Unable to display preview. Download preview PDF.
- 1.Benesty, J., Sondh, M., and Huang, Y., Eds., Springer Handbook of Speech Recognition, New York: Springer, 2008.Google Scholar
- 3.Koutroumbas, K. and Theodoridis, S., Pattern Recognition, Boston: Academic Press, 2008, p. 840.Google Scholar
- 5.Potamianos, G., Neti, C., Luettin, J., and Matthews, I., Audio-visual automatic speech recognition: an overview, in Issues in Visual and Audio-Visual Speech Processing, MIT Press Cambridge, MA, 2004, pp. 356–396.Google Scholar
- 6.Karpov, A., Ronzhin, A., and Kipyatkova, I., An assistive bi-modal user interface integrating multi-channel speech recognition and computer vision, Proc. 14th International Conference on Human-Computer Interaction HCII-2011, LNCS, 2011, vol. 6762, pp. 454–463.Google Scholar
- 8.Asadpour, V., Homayounpour, M.M., and Towhidkhah, F., Audio-visual speaker identification using dynamic facial movements and utterance phonetic content, Applied Soft Computing, 2011, no. 2, pp. 2083–2093.Google Scholar
- 12.RealSpeaker Audio-Visual Speech Recognition-Voice to Text, http://realspeaker.net/ (last access, December 2, 2013).
- 13.Murygin, K.V., Concept of speech recognition based on lip reading, Iskusstvenny Intellekt, 2009, no. 2, pp. 116–123 [in Russian].Google Scholar
- 14.Potamianos, G., Graf, H.P., and Cosatto, E., An image transform approach for HMM based automatic lipreading, Proc. of Int. Conf. on Image Processing, 1998, vol. I, pp. 173–177.Google Scholar
- 15.Lucey, P., Martin, T., and Sridharan, S., Confusability of phonemes grouped according to their viseme classes in noisy environments, Proc. of 10th Australian International Conference on Speech Science & Technology, 2004, pp. 8–10.Google Scholar
- 16.Petajan, E.D., Automatic lipreading to enhance speech recognition, Proc. of Int. Conf. Global Telecommunications Conference, 1984, pp. 265–272.Google Scholar
- 17.Cox, S., Harvey, R., Lan, Y., Newman, J., and Theobald, B., The challenge of multispeaker lip-reading, Proc. of Int. Conf. on Auditory-Visual Speech Processing, 2008, pp. 179–184.Google Scholar
- 20.Dalal, N. and Triggs, B., Histograms of Oriented Gradients for Human Detection, Proc. of Int. Conf. on Computer Vision & Pattern Recognition, 2005, pp. 886–893.Google Scholar
- 22.Senior, A.W., Face and feature finding for a face recognition system, Proc. International Conference on Audio and Video-Based Biometric Person Authentication, 1999, pp. 154–159.Google Scholar
- 24.Savchenko, V.V., The method of words phonetic decoding in automatic speech recognition problem using the minimum information discrimination principle, Izvestia vuzov Rossii. Radioelectronika, 2009, no. 5, pp. 31–41 [in Russian].Google Scholar
- 28.Savchenko, A.V., Adaptive speech recognition algorithm on the basis of the words phonetic decoding method in a remote control problem, Informatsionnye technologii, 2013, no. 4, pp. 34–39 [in Russian].Google Scholar
- 29.Viola, P. and Jones, M.J., Rapid object detection using a boosted cascade of simple features, Proc. of Int. Conf. on Computer Vision & Pattern Recognition, 2001, vol. 1, pp. 511–518.Google Scholar
- 30.Wang, H., Wang, Y., and Cao, Y., Video-based face recognition: a survey, World Academy of Science. Engineering and Technologies, 2009, vol. 60, pp. 293–302.Google Scholar
- 31.Savchenko, L.V. and Savchenko, A.V., Fuzzy phonetic decoding method in a phoneme recognition problem, Proc. of the Int. Conf. on Nonlinear Speech Processing (NOLISP-2013), LNCS/LNAI, 2013, vol. 7911, pp. 176–183.Google Scholar
- 34.Savchenko, A.V., Face recognition in real-time applications: comparison of directed enumeration method and k-d trees, Proc. of Int. Conf. BIR’12 LNBIP, 2012, vol. 128, pp. 187–199.Google Scholar
- 35.Shapiro, L. and Stockman, G., Computer Vision, Upper Saddle River, NJ, Prentice Hall, 2001, p. 752.Google Scholar
- 37.PrimeSense, http://www.primesense.com/solutions/technology/ ((last access, December 2, 2013).
- 38.Ghoshal, A., Swietojanski, P., and Renals, S., Multilingual training of deep neural networks, Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 7319–7323.Google Scholar