Multimodal speech recognition: increasing accuracy using high speed video data

Ivanko, Denis; Karpov, Alexey; Fedotov, Dmitrii; Kipyatkova, Irina; Ryumin, Dmitry; Ivanko, Dmitriy; Minker, Wolfgang; Zelezny, Milos

doi:10.1007/s12193-018-0267-1

Multimodal speech recognition: increasing accuracy using high speed video data

Original Paper
Published: 01 August 2018

Volume 12, pages 319–328, (2018)
Cite this article

Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Denis Ivanko^1,2,3,
Alexey Karpov^1,2,
Dmitrii Fedotov³,
Irina Kipyatkova¹,
Dmitry Ryumin^1,2,
Dmitriy Ivanko²,
Wolfgang Minker³ &
…
Milos Zelezny⁴

690 Accesses
19 Citations
Explore all metrics

Abstract

To date, multimodal speech recognition systems based on the processing of audio and video signals show significantly better results than their unimodal counterparts. In general, researchers divide the solution of the audio–visual speech recognition problem into two parts. First, in extracting the most informative features from each modality and second, in the most successful way of fusion both modalities. Ultimately, this leads to an improvement in the accuracy of speech recognition. Almost all modern studies use this approach with video data of a standard recording speed of 25 frames per second. The choice of such a recording speed is easily explained, since the vast majority of existing audio–visual databases are recorded with this rate. However, it should be noticed that the number of 25 frames per second is a world standard for many areas and has never been specifically calculated for speech recognition tasks. The main purpose of this study is to investigate the effect brought by the high-speed video data (up to 200 frames per second) on the speech recognition accuracy. And also to find out whether the use of a high-speed video camera makes the speech recognition systems more robust to acoustical noise. To this end, we recorded a database of audio–visual Russian speech with high-speed video recordings, which consists of records of 20 speakers, each of them pronouncing 200 phrases of continuous Russian speech. Experiments performed on this database showed an improvement in the absolute speech recognition rate up to 3.10%. We also proved that the use of the high-speed camera with 200 fps allows achieving better recognition results under different acoustically noisy conditions (signal-to-noise ratio varied between 40 and 0 dB) with different types of noise (e.g. white noise, babble noise).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions

A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera

HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

References

McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748
Article Google Scholar
Neti C, Potamianos G, Luettin J, Matthews I, Glotin H, Vergyri D, Sison J, Mashari A, Zhou J (2000) Audio visual speech recognition. In: Final workshop 2000 report. Center for Language and Speech Processing, The Johns Hopkins University, Baltimore
Katsaggelos K, Bahaadini S, Molina R (2015) Audiovisual fusion: challenges and new approaches. In: Proceedings of the IEEE, vol 103(9), pp 1635–1653
Article Google Scholar
Dean D, Sridharan S (2010) Dynamic visual features for audio–visual speaker verification. Comput Speech Lang 24(2):136–149
Article Google Scholar
Luckyanets E, Melnikov A, Kudashev O, Novoselov S, Lavrentyeva G (2017) Bimodal anti-spoofing system for mobile security. In: SPECOM 2017, LNAI 10458, pp 211–220
Chapter Google Scholar
Akhtiamov O, Sidorov M, Karpov A, Minker W (2017) Speech and text analysis for multimodal addressee detection in human–human–computer interaction. In: Proceedings of the interspeech 2017, pp 2521–2525
Shamim HM, Muhammad G (2016) Audio–visual emotion recognition using multi-directional regression and ridgelet transform. J Multimodal User Interfaces (JMUI) 10(4):325–333
Article Google Scholar
Fedotov D, Sidorov M, Minker W (2017) Context-awarded models in time-continuous multidimensional affect recognition. In: ICR 2017, LNAI 10459, pp 59–66
Google Scholar
Liu Q, Wang W, Jackson P (2011) A visual voice activity detection method with adaboosting. In: Proceedings of the sensor signal process defence, pp 1–5
Barnard M et al (2014) Robust multi-speaker tracking via dictionary learning and identity modeling. IEEE Trans Multimed 16(3):864–880
Article Google Scholar
Kaya H, Karpov A (2017) Introducing weighted kernel classifiers for handling imbalanced paralinguistic corpora: snoring, addressee and cold. In: Proceedings of the interspeech 2017, pp 3527–3531
Shivappa ST, Trivedi ST (2010) Audiovisual information fusion in human–computer interfaces and intelligent environments: a survey. Proc IEEE 98(10):1692–1715
Article Google Scholar
Khokhlov Y, Tomashenko N, Medennikov I, Romanenko A (2017) Fast and accurate OOV decoder on high-level features. In: Proceedings of the interspeech 2017, pp 2884–2888
Ngiam J et al (2011) Multimodal deep learning. In: Proceedings of the 28th international conference of machine learning, pp 689–696
Chetty G, Wagner M (2006) Audio–visual multimodal fusion for biometric person authentication and liveness verification. In: Proceedings of the NICTA-HCSNet multimodal user interaction workshop, vol 57, pp 17–24
Atrey PK, Hossain MA, Saddik E, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimed Syst 16(6):345–379
Article Google Scholar
Xu H, Chua TS (2006) Fusion of AV features and external information sources for event detection in team sport video. ACM Trans Multimed Comput Commun Appl 2(1):44–67
Article Google Scholar
Dean D.B (2008) Synchronous HMMs for audio–visual speech processing. Ph.D. dissertation, Queensland University
Morency LP, Kok I, Gratch J (2010) A probabilistic multimodal approach for predicting listener backchannels. Auton Agents Multi-Agents Syst 20(1):70–84
Article Google Scholar
Lv G, Jiang D, Zhao R, Hou Y (2007) Multi-stream asynchrony modeling for audio–visual speech recognition. In: Proceedings of the 9th IEEE international symposium multimedia, pp 37–44
Torres-Valencia C, Alvarez-Lopez M, Orozco-Gutierrez A (2017) SVM-based feature selection methods for emotion recognition from multimodal data. J Multimodal User Interfaces (JMUI) 11(1):9–23
Article Google Scholar
Terry L (2011) Audio–visual asynchrony modeling and analysis for speech alignment and recognition. Ph.D. dissertation, Northwestern University
Nefian AV et al (2002) A coupled HMM for audio–visual speech recognition. In: Proceedings of the IEEE international conference acoustic speech signal processing, vol 2, pp 2009–2013
Estellers V, Gurban M, Thiran J (2012) On dynamic stream weighting for audio–visual speech recognition. IEEE Trans Audio Speech Lang Process 20(4):1145–1157
Article Google Scholar
Abdelaziz AH, Kolossa D (2014) Dynamic stream weight estimation in coupled HMM-based audio–visual speech recognition using multilayer perceptrons. In: Proceedings of the interspeech, pp 1144–1148
Chitu AG, Rothkrantz LJM (2007) The influence of video sampling rate on lipreading performance. In: Proceedings of the international conference on speech and computer SPECOM 2007. Moscow, pp 678–684
Chitu AG, Driel K, Rothkrantz LJM (2010) Automatic lip reading in the Dutch language using active appearance models on high speed recordings. In: Text, speech and dialogue, Springer LNCS (LNAI) 2010, vol 6231, pp 259–266
Chapter Google Scholar
Polykovsky S, Kameda Y, Ohta Y (2009) Facial micro-expressions recognition using high speed camera and 3D-gradient descriptor. In: Proceedings of the 3rd international conference on crime detection and prevention (ICDP). Tsukuba, pp 1–6
Bettadapura V (2012) Face expression recognition and analysis: the state of the art. Technical Report, College of Computing, Georgia Institute of Technology, pp 1–27
Ohzeki K (2006) Video analysis for detecting eye blinking using a high-speed camera. In: Proceedings of the 40th Asilomar conference on signals, systems and computers (ACSSC). Pacific Grove, Part 1, pp 1081–1085
Chitu AG, Rothkrantz LJM (2008) On dual view lipreading using high speed camera. In: Proceedings of the 14th annual scientific conference euromedia. Ghent, pp 43–51
Verkhodanova V, Ronzhin A, Kipyatkova I, Ivanko D, Karpov A, Zelezny M (2016) HAVRUS corpus: high-speed recordings of audio–visual Russian speech. In: Ronzhin A, Potapova R, Nmeth G (eds) Speech and computer. SPECOM 2016. Lecture notes in computer science, vol 9811. Springer, Cham
Chapter Google Scholar
Karpov A, Ronzhin A, Markov K, Zelezny M (2010) Viseme-dependent weight optimization for CHMM-based audio–visual speech recognition. In: Proceedings of the interspeech 2010, pp 2678–2681
Karpov A (2014) An automatic multimodal speech recognition system with audio and video information. Autom Remote Control 75(12):2190–2200
Article MathSciNet Google Scholar
Ivanko D, Karpov A, Ryumin D, Kipyatkova I, Saveliev A, Budkov V, Ivanko D, Zelezny M (2017) Using a high-speed video Camera for robust audio–visual speech recognition in acoustically noisy conditions. In: SPECOM 2017, LNAI 10458, pp 757–766
Chapter Google Scholar
Lee B, Hasegawa-Johnson M, Goudeseune C, Kamdar S, Borys S, Liu M, Huang T (2004) AVICAR: audio–visual speech corpus in a car environment. In: Proceedings of the interspeech, pp 380–383
Cox S, Harvey R, Lan Y, Newman J, Theobald B (2008) The challenge of multispeaker lip-reading. In: Proceedings of the international conference auditory-visual speech process (AVSP), pp 179–184
Patterson E, Gurbuz S, Tufekci Z, Gowdy J (2002) CUAVE: a new audio–visual database for multimodal human–computer interface research. In: Proceedings of the IEEE ICASSP 2002, vol 2, pp 2017–2020
Hazen T, Saenko K, La C, Glass J (2004) A segment-base audio–visual speech recognizer: data collection, development, and initial experiments. In: Proceedings of the international conference multimodal interfaces, pp 235–242
Lucey P, Potaminanos G, Sridharan S (2008) Patch-based analysis of visual speech from multiple views. In: Proceedings of the AVSP 2008, pp 69–74
Abhishek N, Prasanta KG (2017) PRAV: a phonetically rich audio visual corpus. In: Proceedings of the interspeech 2017, pp 3747–3751
Zhou Z, Zhao G, Hong X, Pietikainen M (2014) A review of recent advances in visual speech decoding. In: Proceedings of the image and vision computing, vol 32, pp 590–605
Article Google Scholar
Karpov A, Kipyatkova I, Zelezny M (2014) A framework for recording audio–visual speech corpora with a microphone and a high-speed camera. In: Speech and computer. SPECOM 2014. Lecture notes in computer science, vol 8773. Springer, Cham
Google Scholar
Yan S, Xu D, Zhang H, Yang Q, Lin S (2007) Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Trans Pattern Anal Mach Intell 29(1):40–51
Article Google Scholar
Hong S, Yao H, Wan Y, Chen R (2006) A PCA based visual DCT feature extraction method for lip-reading. In: Proceedings of the intelligent informatics hiding multimedia, signal process, pp 321–326
Yoshinaga T, Tamura S, Iwano K, Furui S (2003) Audio–visual speech recognition using lip movement extracted from side-face images. In: Proceedings of the international conference auditory-visual speech processing (AVSP), pp 117–120
Cetingul H, Yemez Y, Erzin E, Tekalp A (2006) Discriminative analysis of lip motion features for speaker identification and speech reading. IEEE Trans Image Process 15(10):2879–2891
Article Google Scholar
Kumar S, Bhuyan MK, Chakraborty BK (2017) Extraction of texture and geometrical features from informative facial regions for sign language recognition. J Multimodal User Interfaces (JMUI) 11(2):227–239
Article Google Scholar
Lan Y, Theobald B, Harvey E, Ong E, Bowden R (2010) Improving visual features for lip-reading. In: Proceedings of the AVSP 2010, pp 142–147
Chu SM, Huang TS (2002) Multi-modal sensory fusion with application to audio–visual speech recognition. In: Proceedings of the multi-modal speech recognition workshop-2002, Greensboro
Bear H, Harvey R, Theobald B, Lan Y (2014) Which phoneme-to-viseme maps best improve visual-only computer lip-reading. In: Advances in visual computing. Springer, Berlin, pp 230–239
Google Scholar
Stewart D, Seymour R, Pass A, Ming J (2014) Robust audio–visual speech recognition under noisy audio–video conditions. IEEE Trans Cybern 44(2):175–184
Article Google Scholar
Huang J, Kingsbury B (2013) Audio–visual deep learning for noise robust speech recognition. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, pp 7596–7599

Download references

Acknowledgements

This research is supported by the Ministry of Education and Science of the Russian Federation (Project No. 8.9957.2017/5.2), by the Government of Russia (Grant No. 08-08), by the Russian Foundation for Basic Research (Project Nos. 18-37-00306, 16-37-60100), by the Council for Grants of the President of the Russian Federation (Project Nos. MD-254.2017.8, MK-1000.2017.8), by the Russian state research (No. 0073-2018-0002), by the Ministry of Education of the Czech Republic (Project No. LTARF18017).

Author information

Authors and Affiliations

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia
Denis Ivanko, Alexey Karpov, Irina Kipyatkova & Dmitry Ryumin
ITMO University, St. Petersburg, Russia
Denis Ivanko, Alexey Karpov, Dmitry Ryumin & Dmitriy Ivanko
Institute of Communications Engineering, Ulm University, Ulm, Germany
Denis Ivanko, Dmitrii Fedotov & Wolfgang Minker
University of West Bohemia, Plzeň, Czech Republic
Milos Zelezny

Authors

Denis Ivanko
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Karpov
View author publications
You can also search for this author in PubMed Google Scholar
Dmitrii Fedotov
View author publications
You can also search for this author in PubMed Google Scholar
Irina Kipyatkova
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Ryumin
View author publications
You can also search for this author in PubMed Google Scholar
Dmitriy Ivanko
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Minker
View author publications
You can also search for this author in PubMed Google Scholar
Milos Zelezny
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Denis Ivanko.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ivanko, D., Karpov, A., Fedotov, D. et al. Multimodal speech recognition: increasing accuracy using high speed video data. J Multimodal User Interfaces 12, 319–328 (2018). https://doi.org/10.1007/s12193-018-0267-1

Download citation

Received: 03 December 2017
Accepted: 11 June 2018
Published: 01 August 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s12193-018-0267-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal speech recognition: increasing accuracy using high speed video data

Abstract

Access this article

Similar content being viewed by others

Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions

A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera

HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multimodal speech recognition: increasing accuracy using high speed video data

Abstract

Access this article

Similar content being viewed by others

Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions

A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera

HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation