An automatic multimodal speech recognition system with audio and video information

Karpov, A. A.

doi:10.1134/S000511791412008X

An automatic multimodal speech recognition system with audio and video information

Intellectual Control Systems
Published: 17 December 2014

Volume 75, pages 2190–2200, (2014)
Cite this article

Automation and Remote Control Aims and scope Submit manuscript

A. A. Karpov^1,2

156 Accesses
13 Citations
Explore all metrics

Abstract

The mathematical model and software implementation of an automatic Russian speech recognition system that employs techniques of digital processing and analysis of audiovisual signals from a microphone and a video camera are presented. The description of probabilistic modeling of audiovisual speech based on coupled hidden Markov models, information fusion methods with weight coefficients for audio and video speech modalities, and parametric representation of signals is provided. Quantitative results in multimodal recognition of continuous Russian speech indicate high accuracy and reliability of the automatic system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bimodal Speech Recognition Fusing Audio-Visual Modalities

Robust Multi-Modal Speech Recognition in Two Languages Utilizing Video and Distance Information from the Kinect

Multimodal speech recognition: increasing accuracy using high speed video data

Article 01 August 2018

Denis Ivanko, Alexey Karpov, … Milos Zelezny

References

Kipyatkova, I.S. and Karpov, A.A., An Analytical Survey of Large Vocabulary Russian Speech Recognition Systems, Tr. SPIIRAN, 2010, no. 12, pp. 7–20.
Google Scholar
Soldatov, S., Lip Reading: Preparing Feature Vectors, in Proc. Int. Conf. Graphicon’03, Moscow, 2003, pp. 254–256.
Google Scholar
Krak, Yu.V., Barmak, A.V., and Ternov, A.S., Information Technology Designed for Automatic Lip Reading for Ukrainian Language, Komp’yut. Mat., 2009, no. 1, pp. 86–95.
Google Scholar
Nefian, A., Liang, L., Pi, X., et al., A Coupled HMM for Audio-Visual Speech Recognition, Proc. Int. Conf. ICASSP’02, Orlando, USA, 2002, pp. 2013–2016.
Google Scholar
Karpov, A.A., Automatic Recognition of Audio-visual Russian Speech by Asynchronous Model, Inform.-Izm. Upravl. Sist., 2010, vol. 8, no. 7, pp. 91–96.
Google Scholar
Young, S., Evermann, G., Gales, M., et al., The HTK Book. HTK Version 3.4, Cambridge: Cambridge Univ. Press, 2009.
Google Scholar
Benesty, J., Sondhi, M., Huang, Y., et al., Springer Handbook of Speech Processing, New York: Springer, 2008.
Book Google Scholar
Vezhnevets, A. and Vezhnevets, V., Boosting—Strengthening Simple Classifiers, Komp’yut. Grafika Mul’timedia, 2006, no. 4, no. 2 (http://cgm.computergraphics.ru/content/view/112).
Google Scholar
Castrillyn, M., Deniz, O., Hernandez, D., et al., A Comparison of Face and Facial Feature Detectors Based on the Viola-Jones General Object Detection Framework, Machine Vision Appl., 2011, vol. 22, no. 3, pp. 481–494.
Google Scholar
Bradsky, G. and Kaehler, A., Learning OpenCV, Sebastopol, California: O’Reilly, 2008.
Google Scholar
Liang, L., Liu, X., Zhao, Y., et al., Speaker Independent Audio-Visual Continuous Speech Recognition, Proc. Int. Conf. on Multimedia and Expo ICME’02, Lausanne, Switzerland, 2002, vol. 2, pp. 25–28.
Article Google Scholar
Levenshtein, V.I., Binary Codes Capable of Correcting Deletions, Insertions, and Reversals, Dokl. Akad. Nauk USSR, 1965, vol. 163, no. 4, pp. 845–848.
MathSciNet Google Scholar
Saakyan, A.A., Investigation of Quality Measures for Speech Recognition Systems, Probl. Upravlen., 2009, no. 4, pp. 66–73.
Google Scholar
Bisani, M. and Ney, H., Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation, Proc. 29th IEEE Int. Conf. on Acoustics, Speech, and Signal Processing ICASSP’04, Montreal, Canada, 2004, pp. 409–412.
Google Scholar
Heckmann, M., Berthommier, F., and Kroschel, K., Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition, EURASIP J. Appl. Signal Process., 2002, no. 1, pp. 1260–1273.
Google Scholar
Gurban, M., Thiran, J.P., Drugman, T., et al., Dynamic Modality Weighting for Multi-Stream HMMs in Audio-Visual Speech Recognition, Proc. Int. Conf. on Multimodal Interfaces ICMI’08, Chania, 2008, pp. 237–240.
Google Scholar
Yusupov, R.M., Ronzhin, A.L., Prishchepa, M.V., et al., Models and Hardware-Software Solutions for Automatic Control of Intelligent Hall, Autom. Remote Control, 2011, vol. 72, no. 7, pp. 1389–1397.
Article Google Scholar
Bilik, R.V., Zhozhikashvili, V.A., Petukhova, N.V., et al., Analysis of the Oral Interface in the Interactive Servicing Systems. II, Autom. Remote Control, 2009, vol. 70, no. 4, pp. 434–448.
Article MATH Google Scholar
Karpov, A.A. and Ronzhin, A.L., Information Enquiry Kiosk with Multimodal User Interface, Pattern Recogn. Image Anal., 2009, vol. 19, no. 3, pp. 546–558.
Article Google Scholar

Download references

Author information

Authors and Affiliations

St. Petersburg Institute of Informatics and Automation, Russian Academy of Sciences, St. Petersburg, Russia
A. A. Karpov
ITMO University, St. Petersburg, Russia
A. A. Karpov

Authors

A. A. Karpov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. A. Karpov.

Additional information

Rights and permissions

Reprints and permissions

About this article

Cite this article

Karpov, A.A. An automatic multimodal speech recognition system with audio and video information. Autom Remote Control 75, 2190–2200 (2014). https://doi.org/10.1134/S000511791412008X

Download citation

Received: 28 March 2012
Published: 17 December 2014
Issue Date: December 2014
DOI: https://doi.org/10.1134/S000511791412008X

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An automatic multimodal speech recognition system with audio and video information

Abstract

Access this article

Similar content being viewed by others

Bimodal Speech Recognition Fusing Audio-Visual Modalities

Robust Multi-Modal Speech Recognition in Two Languages Utilizing Video and Distance Information from the Kinect

Multimodal speech recognition: increasing accuracy using high speed video data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An automatic multimodal speech recognition system with audio and video information

Abstract

Access this article

Similar content being viewed by others

Bimodal Speech Recognition Fusing Audio-Visual Modalities

Robust Multi-Modal Speech Recognition in Two Languages Utilizing Video and Distance Information from the Kinect

Multimodal speech recognition: increasing accuracy using high speed video data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation