Abstract
In recent years, audio speech has become more and more popular and often used in modern human–robot interfaces. Such natural form of communication is highly appreciated by users. There is no doubt that in the nearest future, alongside with the technology development, we will encounter the development of such “native” human–robot interfaces. In this paper, we propose the architecture and develop the software–hardware complex designed for automatic speech recognition with a dictionary of small and medium size and to be used in robots. A distinctive feature of the developed software–hardware complex is the presence of an audio–visual speech synchronization module, which allows both (1) to detect a speech signal in audio data and (2) to take into account the natural asynchrony between acoustic and visual speech. Based on this, it is possible (3) to synchronize the speech sections of audio and video streams in time. Another distinctive feature is the presence of a modality combining module, which allows (1) to combine informative data from audio and video signals and (2) to adjust the weights of each modality depending on the SNR level, which allows achieving optimal recognition accuracy even in acoustically noisy conditions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Ivanko, D., Ryumin, D., Axyonov, A., Zelezny, M.: Designing advanced geometric features for automatic Russian visual speech recognition. In: Proceedings of 20th International Conference on Speech and Computer, SPECOM 2018, pp. 245–255. Springer, Cham (2018)
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
Potamianos, G., Neti, C., Matthews, I.: Audio-visual automatic speech recognition: an overview. In: Bailly, G., Vatikiotis-Bateson, E., Perrier, P. (eds.) Issues in Visual and Audio-Visual Speech Processing. MIT Press, Cambridge (2004)
Hlavac, M.: Automated lipreading with LipsID features. PhD thesis, Pilsen (2019)
Zhou, Z., Zhao, G., Hong, X., Pietikainen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32(9), 590–605 (2014)
Ivanko, D., Ryumin, D., Kipyatkova, I., Axyonov, A., Karpov, A.: Lip-reading using pixel-based and geometry-based features for multimodal human-robot interfaces. In: 14th International Conference on Electromechanics and Robotics “Zavalishin’s Readings”, ERZR 2019, pp. 197–207. Springer, Cham (2019)
Ivanko, D., Ryumin, D., Karpov, A.: An experimental analysis of different approaches to audio-visual speech recognition and lip-reading. In: 15th International Conference on Electromechanics and Robotics “Zavalishin’s Readings”, ERZR 2020, pp. 1–13. Springer, Cham (2020)
Casanovas, A.L., Vandergheynst, P.: Audio-Visual Object Extraction Using Graph Cuts. Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland (2012)
Nock, H.J., Iyengar, G., Neti, C.: Speaker localisation using audio-visual synchrony: an empirical study. In: International Conference on Image and Video Retrieval, pp. 488–499. Springer, Berlin, Heidelberg (2003)
Katsaggelos, K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103(9), 1635–1653 (2015)
Newman, J., Cox, S.: Language identification using visual features. IEEE Audio Speech Lang. Process. 20(7), 1936–1947 (2012)
Lan, Y., Theobald, B., Harvey, R.: View independent computer lip-reading. In: Proceedings of International Conference Multimedia Expo (ICME), pp. 432–437. IEEE (2012)
Zhou, Z., Hong, X., Zhao, G., Pietikainen, M.: A compact representation of visual speech data using latent variables. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 181–187 (2014)
Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)
Estellers, V., Gurban, M., Thiran, J.: On dynamic stream weighting for audio-visual speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(4), 1145–1157 (2012)
Hong, S., Yao, H., Wan, Y., Chen, R.: A PCA based visual DCT feature extraction method for lip-reading. In: Proceedings of Intelligent Informatics and Hiding Multimedia and Signal Processing, pp. 321–326 (2006)
Cetingul, H., Yemez, Y., Erzin, E., Tekalp, A.: Discriminative analysis of lip motion features for speaker identification and speech reading. IEEE Trans. Image Process. 15(10), 2879–2891 (2006)
Yoshinaga, T., Tamura, S., Iwano, K., Furui, S.: Audio-visual speech recognition using lip movement extracted from side-face images. In: Proceedings of International Conference Auditory-Visual Speech Processing (AVSP), pp. 117–120 (2003)
Lan, Y., Harvey, R., Theobald, B., Ong, E., Bowden, R.: Comparing visual features for lipreading. In: Proceedings of International Conference on Auditory-Visual Speech Processing (AVSP), pp. 102–106 (2009)
Rahmani, M.H., Alamsganj, F.: Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In: 3D International Conference on Pattern Recognition and Image Analysis, pp. 195–199. IEEE (2017)
Radha, N., Shahina, A., Khan, A.: An improved visual speech recognition of isolated words using combined pixel and geometric features. Indian J. Sci. Technol. 9(44), 83–93 (2016)
Ivanko, D., Karpov, A., Fedotov, D., Kipyatkova, I., Ryumin, D., Ivanko, D., Minker, W., Zelezny, M.: Multimodal speech recognition: increasing accuracy using high-speed video data. J. Multimodal User Interfaces 12(4), 319–328 (2018)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Proceedings of ACCV, pp. 171–182 (2016)
He, J., Zhang, H.: Lipreading recognition based on SVM and DTAK. In: 4th International Conference on Bioinformatics and Biomedical Engineering, pp. 321–324 (2010)
Karpov, A., Kipyatkova, I., Zelezny, M.: A framework for recording audio-visual speech corpora with a microphone and a high-speed camera. In: International Conference on Speech and Computer (SPECOM 2014), pp 50–57. Springer, Cham (2014)
Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Speech and Computer (SPECOM 2016), vol. 9811, pp. 338–345. Springer, Cham (2016)
Acknowledgements
This research is supported by the Russian Foundation for Basic Research (project No. 19-29-09081 мк).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ivanko, D., Ryumin, D., Karpov, A. (2022). Developing of a Software–Hardware Complex for Automatic Audio–Visual Speech Recognition in Human–Robot Interfaces. In: Ronzhin, A., Shishlakov, V. (eds) Electromechanics and Robotics. Smart Innovation, Systems and Technologies, vol 232. Springer, Singapore. https://doi.org/10.1007/978-981-16-2814-6_23
Download citation
DOI: https://doi.org/10.1007/978-981-16-2814-6_23
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2813-9
Online ISBN: 978-981-16-2814-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)