Developing of a Software–Hardware Complex for Automatic Audio–Visual Speech Recognition in Human–Robot Interfaces

Ivanko, Denis; Ryumin, Dmitry; Karpov, Alexey

doi:10.1007/978-981-16-2814-6_23

Denis Ivanko⁵,
Dmitry Ryumin⁵ &
Alexey Karpov⁵

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 232))

606 Accesses
3 Citations

Abstract

In recent years, audio speech has become more and more popular and often used in modern human–robot interfaces. Such natural form of communication is highly appreciated by users. There is no doubt that in the nearest future, alongside with the technology development, we will encounter the development of such “native” human–robot interfaces. In this paper, we propose the architecture and develop the software–hardware complex designed for automatic speech recognition with a dictionary of small and medium size and to be used in robots. A distinctive feature of the developed software–hardware complex is the presence of an audio–visual speech synchronization module, which allows both (1) to detect a speech signal in audio data and (2) to take into account the natural asynchrony between acoustic and visual speech. Based on this, it is possible (3) to synchronize the speech sections of audio and video streams in time. Another distinctive feature is the presence of a modality combining module, which allows (1) to combine informative data from audio and video signals and (2) to adjust the weights of each modality depending on the SNR level, which allows achieving optimal recognition accuracy even in acoustically noisy conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Softcover Book: USD 299.99; Price excludes VAT (USA)

Hardcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Article Google Scholar
Ivanko, D., Ryumin, D., Axyonov, A., Zelezny, M.: Designing advanced geometric features for automatic Russian visual speech recognition. In: Proceedings of 20th International Conference on Speech and Computer, SPECOM 2018, pp. 245–255. Springer, Cham (2018)
Google Scholar
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
Article Google Scholar
Potamianos, G., Neti, C., Matthews, I.: Audio-visual automatic speech recognition: an overview. In: Bailly, G., Vatikiotis-Bateson, E., Perrier, P. (eds.) Issues in Visual and Audio-Visual Speech Processing. MIT Press, Cambridge (2004)
Google Scholar
Hlavac, M.: Automated lipreading with LipsID features. PhD thesis, Pilsen (2019)
Google Scholar
Zhou, Z., Zhao, G., Hong, X., Pietikainen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32(9), 590–605 (2014)
Article Google Scholar
Ivanko, D., Ryumin, D., Kipyatkova, I., Axyonov, A., Karpov, A.: Lip-reading using pixel-based and geometry-based features for multimodal human-robot interfaces. In: 14th International Conference on Electromechanics and Robotics “Zavalishin’s Readings”, ERZR 2019, pp. 197–207. Springer, Cham (2019)
Google Scholar
Ivanko, D., Ryumin, D., Karpov, A.: An experimental analysis of different approaches to audio-visual speech recognition and lip-reading. In: 15th International Conference on Electromechanics and Robotics “Zavalishin’s Readings”, ERZR 2020, pp. 1–13. Springer, Cham (2020)
Google Scholar
Casanovas, A.L., Vandergheynst, P.: Audio-Visual Object Extraction Using Graph Cuts. Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland (2012)
Google Scholar
Nock, H.J., Iyengar, G., Neti, C.: Speaker localisation using audio-visual synchrony: an empirical study. In: International Conference on Image and Video Retrieval, pp. 488–499. Springer, Berlin, Heidelberg (2003)
Google Scholar
Katsaggelos, K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103(9), 1635–1653 (2015)
Google Scholar
Newman, J., Cox, S.: Language identification using visual features. IEEE Audio Speech Lang. Process. 20(7), 1936–1947 (2012)
Article Google Scholar
Lan, Y., Theobald, B., Harvey, R.: View independent computer lip-reading. In: Proceedings of International Conference Multimedia Expo (ICME), pp. 432–437. IEEE (2012)
Google Scholar
Zhou, Z., Hong, X., Zhao, G., Pietikainen, M.: A compact representation of visual speech data using latent variables. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 181–187 (2014)
Google Scholar
Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)
Article Google Scholar
Estellers, V., Gurban, M., Thiran, J.: On dynamic stream weighting for audio-visual speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(4), 1145–1157 (2012)
Google Scholar
Hong, S., Yao, H., Wan, Y., Chen, R.: A PCA based visual DCT feature extraction method for lip-reading. In: Proceedings of Intelligent Informatics and Hiding Multimedia and Signal Processing, pp. 321–326 (2006)
Google Scholar
Cetingul, H., Yemez, Y., Erzin, E., Tekalp, A.: Discriminative analysis of lip motion features for speaker identification and speech reading. IEEE Trans. Image Process. 15(10), 2879–2891 (2006)
Article Google Scholar
Yoshinaga, T., Tamura, S., Iwano, K., Furui, S.: Audio-visual speech recognition using lip movement extracted from side-face images. In: Proceedings of International Conference Auditory-Visual Speech Processing (AVSP), pp. 117–120 (2003)
Google Scholar
Lan, Y., Harvey, R., Theobald, B., Ong, E., Bowden, R.: Comparing visual features for lipreading. In: Proceedings of International Conference on Auditory-Visual Speech Processing (AVSP), pp. 102–106 (2009)
Google Scholar
Rahmani, M.H., Alamsganj, F.: Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In: 3D International Conference on Pattern Recognition and Image Analysis, pp. 195–199. IEEE (2017)
Google Scholar
Radha, N., Shahina, A., Khan, A.: An improved visual speech recognition of isolated words using combined pixel and geometric features. Indian J. Sci. Technol. 9(44), 83–93 (2016)
Google Scholar
Ivanko, D., Karpov, A., Fedotov, D., Kipyatkova, I., Ryumin, D., Ivanko, D., Minker, W., Zelezny, M.: Multimodal speech recognition: increasing accuracy using high-speed video data. J. Multimodal User Interfaces 12(4), 319–328 (2018)
Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Proceedings of ACCV, pp. 171–182 (2016)
Google Scholar
He, J., Zhang, H.: Lipreading recognition based on SVM and DTAK. In: 4th International Conference on Bioinformatics and Biomedical Engineering, pp. 321–324 (2010)
Google Scholar
Karpov, A., Kipyatkova, I., Zelezny, M.: A framework for recording audio-visual speech corpora with a microphone and a high-speed camera. In: International Conference on Speech and Computer (SPECOM 2014), pp 50–57. Springer, Cham (2014)
Google Scholar
Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Speech and Computer (SPECOM 2016), vol. 9811, pp. 338–345. Springer, Cham (2016)
Google Scholar

Download references

Acknowledgements

This research is supported by the Russian Foundation for Basic Research (project No. 19-29-09081 мк).

Author information

Authors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, 39, 14th Line, 199178, St. Petersburg, Russia
Denis Ivanko, Dmitry Ryumin & Alexey Karpov

Authors

Denis Ivanko
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Ryumin
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Karpov
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Andrey Ronzhin
State University of Airspace Instrumentation, St. Petersburg, Russia
Vladislav Shishlakov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ivanko, D., Ryumin, D., Karpov, A. (2022). Developing of a Software–Hardware Complex for Automatic Audio–Visual Speech Recognition in Human–Robot Interfaces. In: Ronzhin, A., Shishlakov, V. (eds) Electromechanics and Robotics. Smart Innovation, Systems and Technologies, vol 232. Springer, Singapore. https://doi.org/10.1007/978-981-16-2814-6_23

Download citation

DOI: https://doi.org/10.1007/978-981-16-2814-6_23
Published: 29 August 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2813-9
Online ISBN: 978-981-16-2814-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics