Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions

  • Denis IvankoEmail author
  • Alexey Karpov
  • Dmitry Ryumin
  • Irina Kipyatkova
  • Anton Saveliev
  • Victor Budkov
  • Dmitriy Ivanko
  • Miloš Železný
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10458)


The purpose of this study is to develop a robust audio-visual speech recognition system and to investigate the influence of a high-speed video data on the recognition accuracy of continuous Russian speech under different noisy conditions. Developed experimental setup and collected multimodal database allow us to explore the impact brought by the high-speed video recordings with various frames per second (fps) starting from standard 25 fps up to high-speed 200 fps. At the moment there is no research objectively reflecting the dependence of the speech recognition accuracy from the video frame rate. Also there are no relevant audio-visual databases for model training. In this paper, we try to fill in this gap for continuous Russian speech. Our evaluation experiments show the increase of absolute recognition accuracy up to 3% and prove that the use of the high-speed camera JAI Pulnix with 200 fps allows achieving better recognition results under different acoustically noisy conditions.


Audio-visual speech recognition High-speed video camera Noisy conditions Russian speech Visemes Multimodal communication 



This research is partially supported by the Russian Foundation for Basic Research (projects No. 15-07-04415, 15-07-04322 and 16-37-60085), by the Council for Grants of the President of the Russian Federation (projects No. MD-254.2017.8, MK-1000.2017.8 and MК-7925.2016.9), by the Government of Russia (grant No. 074-U01), by grant of the University of West Bohemia (project No. SGS-2016-039), and by the Ministry of Education, Youth and Sports of Czech Republic, (project No. LO1506).


  1. 1.
    Katsaggelos, K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103(9), 1635–1653 (2015)CrossRefGoogle Scholar
  2. 2.
    Corradini, A., Mehta, M., Bernsen, N.O., Martin, J., Abrilian, S.: Multimodal input fusion in human-computer interaction. Nato Sci. Ser. Comput. Syst. Sci. 198, 223 (2005)Google Scholar
  3. 3.
    Lahat, D., Adall, T., Jutten, C.: Challenges in multimodal data fusion. In: Proceedings of the European Signal Processing Conference, pp. 101–105 (2014)Google Scholar
  4. 4.
    Shao, X., Barker, J.: Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment. Speech Commun. 50(4), 337–353 (2008)CrossRefGoogle Scholar
  5. 5.
    Chitu, A.G., Rothkrantz, L.J.M.: The influence of video sampling rate on lipreading performance. In: Proceedings of the International Conference on Speech and Computer, SPECOM 2007, Moscow, Russia, pp. 678–684 (2007)Google Scholar
  6. 6.
    Chitu, A.G., Driel, K., Rothkrantz, L.J.M.: Automatic lip reading in the Dutch language using active appearance models on high speed recordings. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS(LNAI), vol. 6231, pp. 259–266. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15760-8_33 CrossRefGoogle Scholar
  7. 7.
    Polykovsky, S., Kameda, Y., Ohta, Y.: Facial micro-expressions recognition using high speed camera and 3D-gradient descriptor. In: Proceedings of the 3rd International Conference on Crime Detection and Prevention (ICDP), Tsukuba, Japan, pp. 1–6 (2009)Google Scholar
  8. 8.
    Bettadapura, V.: Face expression recognition and analysis: the state of the art. Technical report, pp. 1–27. College of Computing, Georgia Institute of Technology, USA (2012)Google Scholar
  9. 9.
    Ohzeki, K.: Video analysis for detecting eye blinking using a high-speed camera. In: Proceedings of the 40th Asilomar Conference on Signals, Systems and Computers (ACSSC), Part 1, Pacific Grove, USA, pp. 1081–1085 (2006)Google Scholar
  10. 10.
    Chitu, A.G., Rothkrantz, L.J.M.: On dual view lipreading using high speed camera. In: Proceedings of the 14th Annual Scientific Conference Euromedia, Ghent, Belgium, pp. 43–51 (2008)Google Scholar
  11. 11.
    Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 338–345. Springer, Cham (2016). doi: 10.1007/978-3-319-43958-7_40 CrossRefGoogle Scholar
  12. 12.
    Karpov, A., Ronzhin, A., Markov, K., Železný, M.: Viseme-dependent weight optimization for CHMM-based audio-visual speech recognition. In: Proceedings of the Interspeech 2010, pp. 2678–2681 (2010)Google Scholar
  13. 13.
    Karpov, A.: An automatic multimodal speech recognition system with audio and video information. Autom. Remote Control 75(12), 2190–2200 (2014)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Zelezny, M., Csar, P.: Czech audio-visual speech corpus of a car driver for in-vehicle audio-visual speech recognition. In: Proceedings of the International Conference on Audio-Visual Speech Processing (AVSP 2003), pp. 169–173 (2003)Google Scholar
  15. 15.
    Csar, P., Zelezny, M., Krnoul, Z., Kanis, J., Zelinka, J., Muller, L.: Design and recording of Czech speech corpus for audio-visual continuous speech recognition. In: Proceedings of the International Conference on the Auditory-Visual Speech Processing, pp. 1–4 (2005)Google Scholar
  16. 16.
    Grishina E.: Multimodal Russian corpus (MURCO): first steps. In: Proceedings of the 7th Language Resources and Evaluation Conference (LREC 2010), pp. 2953–2960 (2010)Google Scholar
  17. 17.
    Karpov, A., Kipyatkova, I., Železný, M.: A framework for recording audio-visual speech corpora with a microphone and a high-speed camera. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS, vol. 8773, pp. 50–57. Springer, Cham (2014). doi: 10.1007/978-3-319-11581-8_6 Google Scholar
  18. 18.
    Chu, S.M., Huang, T.S.: Multi-Modal sensory fusion with application to audio-visual speech recognition. In: Proceedings of the Multi-Modal Speech Recognition Workshop 2002, Greensboro, USA (2002)Google Scholar
  19. 19.
    Stewart, D., Seymour, R., Pass, A., Ming, J.: Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Trans. Cybern. 44(2), 175–184 (2014)CrossRefGoogle Scholar
  20. 20.
    Huang, J., Kingsbury, B.: Audio-visual deep learning for noise robust speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7596–7599 (2013)Google Scholar
  21. 21.
    Estellers, V., Gurban, M., Thiran, J.: On dynamic stream weighting for audio-visual speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(4), 1145–1157 (2012)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Denis Ivanko
    • 1
    • 3
    • 4
    Email author
  • Alexey Karpov
    • 1
    • 3
  • Dmitry Ryumin
    • 1
    • 3
  • Irina Kipyatkova
    • 1
  • Anton Saveliev
    • 1
  • Victor Budkov
    • 1
  • Dmitriy Ivanko
    • 3
  • Miloš Železný
    • 2
  1. 1.St. Petersburg Institute for Informatics and Automation of the Russian Academy of SciencesSt. PetersburgRussia
  2. 2.University of West BohemiaPilsenCzech Republic
  3. 3.ITMO UniversitySt. PetersburgRussia
  4. 4.Ulm UniversityUlmGermany

Personalised recommendations