Emotional Speech Recognition Based on Lip-Reading

Ryumina, Elena; Ivanko, Denis

doi:10.1007/978-3-031-20980-2_52

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13721))

Included in the following conference series:

International Conference on Speech and Computer

909 Accesses
3 Citations

Abstract

Automatic speech recognition and emotion recognition have been research hotspots in the field of human-computer interaction over the last years. However, despite significant recent advances, the problem of robust recognition of emotional speech remains unresolved. In this research we try to fill this gap by looking into the multimodality of speech and starting to use visual information to increase both recognition accuracy and robustness. We present extensive experimental investigation of how different emotions (anger, disgust, fear, happy, neutral, and sad) affect automatic lip-reading. We train the 3D ResNet-18 model on the CREMA-D emotional speech database by experimentation with different parameters of the model. To the best of our knowledge, this is the first research investigating the influence of human emotions on automatic lip-reading. Our results demonstrate that speech with the emotion of disgust is the most difficult to recognize correctly. This is due to the fact that a person significantly curves his lips and articulation is distorted. We have experimentally confirmed that the accuracy of models trained on all types of emotions (mean UAR 94.04%) significantly exceeds the accuracy of recognition of models trained only on a neutral emotion (mean UAR 65.81%), or on any other separate emotion (mean UAR from 54.82% to 68.62% with the emotion of disgust and sadness respectively). We have carefully analyzed the visual manifestations of various emotions and assessed their impact on the accuracy of automatic lip-reading. Current research is the first step in the creation of emotion-robust speech recognition systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Lip-Reading: Toward Phoneme Recognition Through Lip Kinematics

Sentences Prediction Based on Automatic Lip-Reading Detection with Deep Learning Convolutional Neural Networks Using Video-Based Features

An adaptive approach for lip-reading using image and depth data

Article 09 July 2015

References

Zhou, P., Yang, W., Chen, W., et al.: Modality attention for end-to-end audio-visual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6565–6569 (2018)
Google Scholar
Ivanko, D., Ryumin, D., Kashevnik, A., et al.: DAVIS: Driver’s Audio-Visual Speech Recognition. In: ISCA Annual Conference Interspeech, pp. 1141–1142 (2022)
Google Scholar
Kim, M., Hong, J., Park, S.J., et al.: Multi-modality associative bridging through memory: Speech sound recollected from face video. In: IEEE/CVF International Conference on Computer Vision, pp. 296–306 (2021)
Google Scholar
Ryumina, E., Verkholyak, O., Karpov, A.: Annotation Confidence vs. Training Sample Size: Trade-off Solution for Partially-Continuous Categorical Emotion Recognition. In: ISCA Annual Conference Interspeech, pp. 3690–3694 (2021)
Google Scholar
Erickson, D., Zhu, C., Kawahara, S.: Articulation, acoustics and perception of Mandarin Chinese emotional speech. Open Linguistics 2, 620–635 (2016)
Article Google Scholar
Dresvyanskiy, D., Ryumina, E., Kaya, H., et al.: End-to-end modeling and transfer learning for audiovisual emotion recognition in-the-wild. Multimodal Technologies and Interaction 6(2), 11 (2022)
Google Scholar
Afouras, T., Chung, J. S., Senior, A., et al.: Deep audio-visual speech recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–13 (2018)
Google Scholar
Ivanko, D.: Audio-Visual Russian Speech Recognition. PhD thesis, 404 (2022)
Google Scholar
Ekman, P.: Are there basic emotions? Psychol Rev. 99(3), 550–553 (1992)
Article Google Scholar
Russel, J.: A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161–1178 (1980)
Article Google Scholar
Poria, S., Majumder, N., Mihalcea, R., et al.: Emotion recognition in conversation: research challenges, datasets, and recent advances. IEEE Access 7, 100943–100953 (2018)
Google Scholar
Kashevnik, A., Lashkov, I., Axyonov, A., et al.: Multimodal corpus design for audio-visual speech recognition in vehicle cabin. IEEE Access 9, 34986–35003 (2021)
Article Google Scholar
Ivanko, D., Ryumin, D., Axyonov, et al.: Multi-speaker audio-visual corpus rusavic: Russian audio-visual speech in cars. In: LREC, pp. 1555–1559 (2022)
Google Scholar
Fernandez-Lopez, A., Sukno, F.M.: Survey on automatic lip-reading in the era of deep learning. Image Vis. Comput. 78, 53–72 (2018)
Article Google Scholar
Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018)
Cao, H., Cooper, D.G., Keutmann, M.K., et al.: CREMA-D: Crowd-sourced emotional multimodal actors dataset. In: IEEE Transactions on Affective Computing 5(4), 377–390 (2014)
Google Scholar
Livingstone, S.R., Russo, F.A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS one 13(5), e0196391 (2018)
Google Scholar
Haq, S., Jackson, P.J., Edge, J.: Audio-visual feature selection and reduction for emotion classification. In: International Conference on Auditory-Visual Speech Processing (AVSP’08), pp. 185–190 (2008)
Google Scholar
Martin, O., Kotsia, I., Macq, B., Pitas, I.: The eNTERFACE’05 audio-visual emotion database. In: IEEE International Conference on Data Engineering Workshops (ICDEW’06), pp. 1–8 (2006)
Google Scholar
Ivanko, D., Ryumin, D., Axyonov, A., Kashevnik, A.: Speaker-dependent visual command recognition in vehicle cabin: methodology and evaluation. In: Karpov, A., Potapova, R. (eds.) SPECOM 2021. LNCS (LNAI), vol. 12997, pp. 291–302. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87802-3_27
Chapter Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision, pp. 87–103 (2016)
Google Scholar
Ma, P., Martinez, B., Petridis, S., et al.: Towards practical lipreading with distilled and efficient models. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7608–7612 (2021)
Google Scholar
Petridis, S., Stafylakis, T., Ma, P., et al.: End-to-end audiovisual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552 (2018)
Google Scholar
Ivanko, D., Ryumin, D., Karpov, A.: Automatic Lip-Reading of Hearing Impaired People. In: International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, vol. XLII-2/W12, pp. 97–101 (2019)
Google Scholar
Zhu, H., Luo, M.D., Wang, R., et al.: Deep audio-visual learning: a survey. Int. J. Automation Comput. 18(3), 351–376 (2021)
Google Scholar
Assael, Y.M., Shillingford, B., Whiteson, S., et al.: Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)
Grishchenko, I., Ablavatski, A., Kartynnik, Y., et al.: Attention mesh: High-fidelity face mesh prediction in real-time. In: CVPRW on Computer Vision for Augmented and Virtual Reality, pp. 1–4 (2020)
Google Scholar
Ivanko, D., Ryumin, D., Kashevnik, A., et al.: Visual Speech Recognition in a Driver Assistance System. In: EURASIP 30th European Signal Processing Conference (EUSIPCO), pp. 1131–1135 (2022)
Google Scholar
Zhao, X., Yang, S., Shan, S., et al.: Mutual information maximization for effective lip reading. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp. 420–427 (2020)
Google Scholar

Download references

Acknowledgments

This research is financially supported by the Russian Science Foundation (project No. 22–11-00321). Section 5 is supported by the Grant No. MК-42.2022.4.

Author information

Authors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, 199178, Russia
Elena Ryumina & Denis Ivanko

Authors

Elena Ryumina
View author publications
You can also search for this author in PubMed Google Scholar
Denis Ivanko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Denis Ivanko .

Editor information

Editors and Affiliations

Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna
St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ryumina, E., Ivanko, D. (2022). Emotional Speech Recognition Based on Lip-Reading. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_52

Download citation

DOI: https://doi.org/10.1007/978-3-031-20980-2_52
Published: 10 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Emotional Speech Recognition Based on Lip-Reading

Abstract

Access this chapter

Similar content being viewed by others

Lip-Reading: Toward Phoneme Recognition Through Lip Kinematics

Sentences Prediction Based on Automatic Lip-Reading Detection with Deep Learning Convolutional Neural Networks Using Video-Based Features

An adaptive approach for lip-reading using image and depth data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Emotional Speech Recognition Based on Lip-Reading

Abstract

Access this chapter

Similar content being viewed by others

Lip-Reading: Toward Phoneme Recognition Through Lip Kinematics

Sentences Prediction Based on Automatic Lip-Reading Detection with Deep Learning Convolutional Neural Networks Using Video-Based Features

An adaptive approach for lip-reading using image and depth data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation