Automation and Remote Control

, Volume 75, Issue 12, pp 2190–2200 | Cite as

An automatic multimodal speech recognition system with audio and video information

  • A. A. Karpov
Intellectual Control Systems


The mathematical model and software implementation of an automatic Russian speech recognition system that employs techniques of digital processing and analysis of audiovisual signals from a microphone and a video camera are presented. The description of probabilistic modeling of audiovisual speech based on coupled hidden Markov models, information fusion methods with weight coefficients for audio and video speech modalities, and parametric representation of signals is provided. Quantitative results in multimodal recognition of continuous Russian speech indicate high accuracy and reliability of the automatic system.


Speech Recognition Automatic Speech Recognition Speech Recognition System Speech Corpus Automatic Speech Recognition System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kipyatkova, I.S. and Karpov, A.A., An Analytical Survey of Large Vocabulary Russian Speech Recognition Systems, Tr. SPIIRAN, 2010, no. 12, pp. 7–20.Google Scholar
  2. 2.
    Soldatov, S., Lip Reading: Preparing Feature Vectors, in Proc. Int. Conf. Graphicon’03, Moscow, 2003, pp. 254–256.Google Scholar
  3. 3.
    Krak, Yu.V., Barmak, A.V., and Ternov, A.S., Information Technology Designed for Automatic Lip Reading for Ukrainian Language, Komp’yut. Mat., 2009, no. 1, pp. 86–95.Google Scholar
  4. 4.
    Nefian, A., Liang, L., Pi, X., et al., A Coupled HMM for Audio-Visual Speech Recognition, Proc. Int. Conf. ICASSP’02, Orlando, USA, 2002, pp. 2013–2016.Google Scholar
  5. 5.
    Karpov, A.A., Automatic Recognition of Audio-visual Russian Speech by Asynchronous Model, Inform.-Izm. Upravl. Sist., 2010, vol. 8, no. 7, pp. 91–96.Google Scholar
  6. 6.
    Young, S., Evermann, G., Gales, M., et al., The HTK Book. HTK Version 3.4, Cambridge: Cambridge Univ. Press, 2009.Google Scholar
  7. 7.
    Benesty, J., Sondhi, M., Huang, Y., et al., Springer Handbook of Speech Processing, New York: Springer, 2008.CrossRefGoogle Scholar
  8. 8.
    Vezhnevets, A. and Vezhnevets, V., Boosting—Strengthening Simple Classifiers, Komp’yut. Grafika Mul’timedia, 2006, no. 4, no. 2 ( Scholar
  9. 9.
    Castrillyn, M., Deniz, O., Hernandez, D., et al., A Comparison of Face and Facial Feature Detectors Based on the Viola-Jones General Object Detection Framework, Machine Vision Appl., 2011, vol. 22, no. 3, pp. 481–494.Google Scholar
  10. 10.
    Bradsky, G. and Kaehler, A., Learning OpenCV, Sebastopol, California: O’Reilly, 2008.Google Scholar
  11. 11.
    Liang, L., Liu, X., Zhao, Y., et al., Speaker Independent Audio-Visual Continuous Speech Recognition, Proc. Int. Conf. on Multimedia and Expo ICME’02, Lausanne, Switzerland, 2002, vol. 2, pp. 25–28.CrossRefGoogle Scholar
  12. 12.
    Levenshtein, V.I., Binary Codes Capable of Correcting Deletions, Insertions, and Reversals, Dokl. Akad. Nauk USSR, 1965, vol. 163, no. 4, pp. 845–848.MathSciNetGoogle Scholar
  13. 13.
    Saakyan, A.A., Investigation of Quality Measures for Speech Recognition Systems, Probl. Upravlen., 2009, no. 4, pp. 66–73.Google Scholar
  14. 14.
    Bisani, M. and Ney, H., Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation, Proc. 29th IEEE Int. Conf. on Acoustics, Speech, and Signal Processing ICASSP’04, Montreal, Canada, 2004, pp. 409–412.Google Scholar
  15. 15.
    Heckmann, M., Berthommier, F., and Kroschel, K., Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition, EURASIP J. Appl. Signal Process., 2002, no. 1, pp. 1260–1273.Google Scholar
  16. 16.
    Gurban, M., Thiran, J.P., Drugman, T., et al., Dynamic Modality Weighting for Multi-Stream HMMs in Audio-Visual Speech Recognition, Proc. Int. Conf. on Multimodal Interfaces ICMI’08, Chania, 2008, pp. 237–240.Google Scholar
  17. 17.
    Yusupov, R.M., Ronzhin, A.L., Prishchepa, M.V., et al., Models and Hardware-Software Solutions for Automatic Control of Intelligent Hall, Autom. Remote Control, 2011, vol. 72, no. 7, pp. 1389–1397.CrossRefGoogle Scholar
  18. 18.
    Bilik, R.V., Zhozhikashvili, V.A., Petukhova, N.V., et al., Analysis of the Oral Interface in the Interactive Servicing Systems. II, Autom. Remote Control, 2009, vol. 70, no. 4, pp. 434–448.CrossRefzbMATHGoogle Scholar
  19. 19.
    Karpov, A.A. and Ronzhin, A.L., Information Enquiry Kiosk with Multimodal User Interface, Pattern Recogn. Image Anal., 2009, vol. 19, no. 3, pp. 546–558.CrossRefGoogle Scholar

Copyright information

© Pleiades Publishing, Ltd. 2014

Authors and Affiliations

  1. 1.St. Petersburg Institute of Informatics and AutomationRussian Academy of SciencesSt. PetersburgRussia
  2. 2.ITMO UniversitySt. PetersburgRussia

Personalised recommendations