Advertisement

Designing Advanced Geometric Features for Automatic Russian Visual Speech Recognition

  • Denis IvankoEmail author
  • Dmitry Ryumin
  • Alexandr Axyonov
  • Miloš Železný
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11096)

Abstract

The use of video information plays an increasingly important role for automatic speech recognition. Nowadays, audio-only based systems have reached a certain accuracy threshold and many researchers see a solution to the problem in the use of visual modality to obtain better results. Despite the fact that audio modality of speech is much more representative than video, their proper fusion can improve both quality and robustness of the entire recognition system that was proved in practice by many researches. However, no agreement between researchers on the optimal set of visual features was reached. In this paper, we investigate this issue in more detail and propose advanced geometry-based visual features for automatic Russian lip-reading system. The experiments were conducted using collected HAVRUS audio-visual speech database. The average viseme recognition accuracy of our system trained on the entire corpus is 40.62%. We also tested the main state-of-the-art methods for visual speech recognition, applying them to continuous Russian speech with high-speed recordings (200 frames per seconds).

Keywords

Lip-reading Automatic speech recognition  Visual speech decoding Visual features Geometric features Russian speech 

Notes

Acknowledgments

This research is financially supported by the Ministry of Education and Science of the Russian Federation, agreement No. 14.616.21.0095 (reference RFMEFI61618X0095) and by the Ministry of Education of the Czech Republic, project No. LTARF18017.

References

  1. 1.
    Yu, D., Deng, L.: Automatic Speech Recognition. SCT. Springer, London (2015).  https://doi.org/10.1007/978-1-4471-5779-3CrossRefzbMATHGoogle Scholar
  2. 2.
    McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)CrossRefGoogle Scholar
  3. 3.
    Potamianos, G., Neti, C., Matthews, I.: Audio-visual automatic speech recognition: an overview. Issues Audio Vis. Speech Process. 22, 23 (2004)Google Scholar
  4. 4.
    Zhou, Z., Zhao, G., Hong, X., Pietikainen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32, 590–605 (2014)CrossRefGoogle Scholar
  5. 5.
    Bowden, R., et al.: Recent developments in automated lip-reading. In: Proceedings of SPIE, Optics and Photonics for Counterterrorism, Crime Fighting and Defence IX, vol. 8901, p. 13 (2013)Google Scholar
  6. 6.
    Katsaggelos, K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103(9), 1635–1653 (2015)CrossRefGoogle Scholar
  7. 7.
    Seong, T.W., Ibrahim, M.Z.: A review of audio-visual speech recognition. J. Telecommun. Electron. Comput. Eng. 10(1–4), 35–40 (2018)Google Scholar
  8. 8.
    Lee, B., et al.: AVICAR: audio-visual speech corpus in a car environment. In: Proceedings of Interspeech 2004, pp. 380–383 (2004)Google Scholar
  9. 9.
    Cox, S., Harvey, R., Lan, Y., Newmann, J., Theobald, B.: The challenge of multispeaker lip-reading. In: Proceedings of the International Conference Auditory-Visual Speech Process (AVSP), pp. 179–184 (2008)Google Scholar
  10. 10.
    Patterson, E., Gurbuz, E., Tufekci, Z., Gowdy, J.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: Proceedings of the IEEE ICASSP 2002, vol. 2, pp. 2017–2020 (2002)Google Scholar
  11. 11.
    Hazen, T., Saenko, K., La, C., Glass, J.: A segment-base audio-visual speech recognizer: data collection, development, and initial experiments. In: Proceedings of the International Conference Multimodal Interfaces, pp. 235–242 (2004)Google Scholar
  12. 12.
    Lucey, P., Potaminanos, G., Sridharan, S.: Patch-based analysis of visual speech from multiple views. In: Proceedings of AVSP 2008, pp. 69–74 (2008)Google Scholar
  13. 13.
    Abhishek, N., Prasanta, K.G.: PRAV: a phonetically rich audio visual corpus. In: Proceedings of Interspeech 2017, pp. 3747–3751 (2017)Google Scholar
  14. 14.
    Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 338–345. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-43958-7_40CrossRefGoogle Scholar
  15. 15.
    Newman, J., Cox, S.: Language identification using visual features. Proc. IEEE Audio Speech Lang. Process. 20(7), 1936–1947 (2012)CrossRefGoogle Scholar
  16. 16.
    Lan, Y., Theobald, B., Harvey, R.: View independent computer lip-reading. In: Proceedings of the International Conference Multimedia Expo (ICME), pp. 432–437 (2012)Google Scholar
  17. 17.
    Zhou, Z., Hong, X., Zhao, G., Pietikainen, M.: A compact representation of visual speech data using latent variables. Proc. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 181–187 (2014)Google Scholar
  18. 18.
    Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. Proc. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)CrossRefGoogle Scholar
  19. 19.
    Estellers, V., Thiran, J.: Multi-pose lipreading and audio-visual speech recognition. EURALISP J. Adv. Signal Process. 51 (2012)Google Scholar
  20. 20.
    Hong, X., Yao, H., Wan, Y., Chen, R.: A PCA Based visual DCT feature extraction method for lip-reading. In: Proceedings of the International Conference Intelligent Information Hiding Multimedia, Signal Process, pp. 321–326 (2006)Google Scholar
  21. 21.
    Cetingul, H., Yemez, Y., Erzin, E., Tekalp, A.: Discriminative analysis of lip motion features for speaker identification and speech-reading. Proc. IEEE Trans. Image Process. 15(10), 2879–2891 (2006)CrossRefGoogle Scholar
  22. 22.
    Yoshinaga, T., Tamura, S., Iwano, K., Furui, S.: Audio-visual speech recognition using lip-movement extracted from side-face images. In: Proceedings of the International Conference Auditory-Visual Speech Processing (AVSP), pp. 117–120 (2003)Google Scholar
  23. 23.
    Lan, Y., Theobald, B., Harvey, R., Ong, E., Bowden, R.: Improving visual features for lip-reading. In: Proceedings of the International Conference Auditory Visual Speech Processing (AVSP), pp. 142–147 (2010)Google Scholar
  24. 24.
    Radha, N., Shahina, A., Khan, A.: An improved visual speech recognition of isolated words using combined pixel and geometric features. Proc. J. Sci. Technol. 9(44), 7 (2016)Google Scholar
  25. 25.
    Rahmani, M.H., Alamsganj, F.: Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In: 3D International Conference on Pattern Recognition and Image Analysis, pp. 195–199 (2017)Google Scholar
  26. 26.
    Ivanko, D., et al.: Using a high-speed video camera for robust audio-visual speech recognition in acoustically noisy conditions. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 757–766. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-66429-3_76CrossRefGoogle Scholar
  27. 27.
    Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-54184-6_6CrossRefGoogle Scholar
  28. 28.
    Implementation of Computer Vision Library. https://github.com/davisking/dlib. Accessed 30 Apr 2018
  29. 29.
    Baltrusaitis, T., Deravi, F., Morency, L.: 3D constrained local model for rigid and non-rigid facial tracking. In: Computer Vision and Pattern Recognition (CVPR), pp. 2610–2617 (2012)Google Scholar
  30. 30.
    Howell, D., Cox, S., Theobald, B.: Visual units and confusion modelling for automatic lip-reading. Image Vis. Comput. 51, 1–12 (2016)CrossRefGoogle Scholar
  31. 31.
    Description of Euclidean Distance Calculation. https://en.wikipedia.org/wiki/Euclidean_distance. Accessed 30 Apr 2018
  32. 32.
    Machine Learning Toolkit. http://scikit-learn.org/stable/. Accessed 30 Apr 2018
  33. 33.
    Ivanko, D., et al.: Multimodal speech recognition: increasing accuracy using high-speed video data. J. Multimodal User Interfaces (JMUI) (2018, in press)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Denis Ivanko
    • 1
    Email author
  • Dmitry Ryumin
    • 1
  • Alexandr Axyonov
    • 1
  • Miloš Železný
    • 2
  1. 1.St. Petersburg Institute for Informatics and Automation of the Russian Academy of SciencesSt. PetersburgRussia
  2. 2.University of West BohemiaPilsenCzech Republic

Personalised recommendations