HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

  • Vasilisa Verkhodanova
  • Alexander Ronzhin
  • Irina Kipyatkova
  • Denis Ivanko
  • Alexey Karpov
  • Miloš Železný
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9811)

Abstract

In this paper we present a software-hardware complex for collection of audio-visual speech databases with a high-speed camera and a dynamic microphone. We describe the architecture of the developed software as well as some details of the collected database of Russian audio-visual speech HAVRUS. The developed software provides synchronization and fusion of both audio and video channels and makes allowance for and processes the natural factor of human speech - the asynchrony of audio and visual speech modalities. The collected corpus comprises recordings of 20 native speakers of Russian and is meant for further research and experiments on audio-visual Russian speech recognition.

Keywords

Multimodal database Audiovisual speech Speech technology Automatic speech recognition 

Notes

Acknowledgments

This research is financially supported by the Ministry of Education and Science of the Russian Federation, agreement No 14.616.21.0056 (reference RFMEFI61615X0056), project “Research and development of audio-visual speech recognition system based on a microphone and a high-speed camera”, as well as by the Czech Ministry of Education, Youth and Sports, project No LO1506.

References

  1. 1.
    Biwi 3D Audiovisual Corpus of Affective Communication. http://www.vision.ee.ethz.ch/datasets/b3dac2.en.html
  2. 2.
    CHIL - Computers in the Human Interaction Loop. https://imatge.upc.edu/web/projects/chil-computers-human-interaction-loop
  3. 3.
    Czech Audio-Visual Speech Corpus for Recognition with Impaired Conditions. http://catalog.elra.info/product_info.php?cPath=25&products_id=1082
  4. 4.
    Císař, P., Železnỳ, M., Krňoul, Z., Kanis, J., Zelinka, J., Müller, L.: Design and recording of czech speech corpus for audio-visual continuous speech recognition. In: Proceedings of International Conference on the Auditory-Visual Speech Processing, pp. 1–4 (2005)Google Scholar
  5. 5.
    Císař, P., Zelinka, J., Železnỳ, M., Karpov, A., Ronzhin, A.: Audio-visual speech recognition for slavonic languages (Czech and Russian). In: Proceedings of 11th International Conference SPECOM 2006, St. Petersburg, Russia, pp. 493–498 (2006)Google Scholar
  6. 6.
    Estival, D., Cassidy, S., Cox, F., Burnham, D., et al.: Austalk: an audio-visual corpus of australian english. In: Proceedings of 9th Language Resources and Evaluation Conference LREC 2014, pp. 3105–3109 (2014)Google Scholar
  7. 7.
    Giraudel, A., Carré, M., Mapelli, V., Kahn, J., Galibert, O., Quintard, L.: The REPERE corpus: a multimodal corpus for person recognition. In: Proceedings of 8th Language Resources and Evaluation Conference (LREC 2012), pp. 1102–1107 (2012)Google Scholar
  8. 8.
    Grishina, E.: Multimodal russian corpus (MURCO): first steps. In: Proceedings of 7th Language Resources and Evaluation Conference (LREC 2010), pp. 2953–2960 (2010)Google Scholar
  9. 9.
    Karpov, A., Ronzhin, A., Kipyatkova, I.: Designing a multimodal corpus of audio-visual speech using a high-speed camera. In: Proceedings of 11th International Conference on Signal Processing (ICSP 2012), vol. 1, pp. 519–522. IEEE (2012)Google Scholar
  10. 10.
    Karpov, A., Kipyatkova, I., Železný, M.: A framework for recording audio-visual speech corpora with a microphone and a high-speed camera. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS, vol. 8773, pp. 50–57. Springer, Heidelberg (2014)Google Scholar
  11. 11.
    Karpov, A., Ronzhin, A., Kipyatkova, I., Železnỳ, M.: Influene of phone-viseme temporal correlations on audiovisual STT and TTS performance. In: Proceedings of 17th International Congress of Phonetic Sciences, pp. 1030–1033 (2011)Google Scholar
  12. 12.
    Karpov, A., Ronzhin, A., Markov, K., Zeleznỳ, M.: Viseme-dependent weight optimization for CHMM-based audio-visual speech recognition. In: Proceedings of INTERSPEECH 2010, Makuhari, Japan, pp. 2678–2681 (2010)Google Scholar
  13. 13.
    Karpov, A.A., Ronzhin, A.L.: Information enquiry kiosk with multimodal user interface. Pattern Recogn. Image Analy. 19(3), 546–558 (2009)CrossRefGoogle Scholar
  14. 14.
    Lee, B., Hasegawa-Johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., Huang, T.S.: AVICAR: audio-visual sspeech corpus in a car eenvironment. In: Proceedings of INTERSPEECH 2004, Jeju Island, Korea, pp. 2489–2492 (2004)Google Scholar
  15. 15.
    Mostefa, D., Moreau, N., Choukri, K., Potamianos, G., Chu, S.M., Tyagi, A., Casas, J.R., Turmo, J., Cristoforetti, L., Tobia, F., et al.: The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms. Lang. Resour. Evalu. 41(3–4), 389–407 (2007)CrossRefGoogle Scholar
  16. 16.
    Nikan, S.: Human face recognition under degraded conditions. University of Windsor (2014)Google Scholar
  17. 17.
    Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. 2017–2020. IEEE (2002)Google Scholar
  18. 18.
    Ronzhin, A.L., Vatamanyuk, I., Ronzhin, A.L., Železnỳ, M.: Mathematical methods to estimate image blur and recognize faces in the system of automatic conference participant registration. Autom. Remote Control 76(11), 2011–2020 (2015)CrossRefMATHGoogle Scholar
  19. 19.
    Togneri, R., B.M., Sui, C.: Multimodal speech recognition with the AusTalk 3D audio-visual corpus. In: Tutorial at ITERSPEECH 2014 (2014)Google Scholar
  20. 20.
    Waibel, A., Stiefelhagen, R., Carlson, R., Casas, J., Kleindienst, J., Lamel, L., Lanz, O., Mostefa, D., Omologo, M., Pianesi, F., et al.: Computers in the human interaction loop. In: Nakashima, H., Aghajan, H., Augusto, J.C. (eds.) Handbook of Ambient Intelligence and Smart Environments, pp. 1071–1116. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  21. 21.
    Xie, X.: Illumination preprocessing for face images based on empirical mode decomposition. Signal Process. 103, 250–257 (2014)CrossRefGoogle Scholar
  22. 22.
    Železnỳ, M., Císař, P., Krňoul, Z., Ronzhin, A., Li, I., Karpov, A.: Design of russian audio-visual speech corpus for bimodal speech recognition. In: Proceedings of SPECOM, pp. 397–400 (2005)Google Scholar
  23. 23.
    Zeleznỳ, M., Císar, P.: Czech audio-visual speech corpus of a car driver for in-vehicle audio-visual speech recognition. In: Proceedings of International Conference on Audio-Visual Speech Processing (AVSP 2003), pp. 169–173 (2003)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Vasilisa Verkhodanova
    • 1
  • Alexander Ronzhin
    • 1
  • Irina Kipyatkova
    • 1
  • Denis Ivanko
    • 1
  • Alexey Karpov
    • 1
  • Miloš Železný
    • 2
  1. 1.SPIIRASSt. PetersburgRussia
  2. 2.University of West BohemiaPilsenCzech Republic

Personalised recommendations