Skip to main content

Emotional Speech Recognition Based on Lip-Reading

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13721))

Included in the following conference series:

Abstract

Automatic speech recognition and emotion recognition have been research hotspots in the field of human-computer interaction over the last years. However, despite significant recent advances, the problem of robust recognition of emotional speech remains unresolved. In this research we try to fill this gap by looking into the multimodality of speech and starting to use visual information to increase both recognition accuracy and robustness. We present extensive experimental investigation of how different emotions (anger, disgust, fear, happy, neutral, and sad) affect automatic lip-reading. We train the 3D ResNet-18 model on the CREMA-D emotional speech database by experimentation with different parameters of the model. To the best of our knowledge, this is the first research investigating the influence of human emotions on automatic lip-reading. Our results demonstrate that speech with the emotion of disgust is the most difficult to recognize correctly. This is due to the fact that a person significantly curves his lips and articulation is distorted. We have experimentally confirmed that the accuracy of models trained on all types of emotions (mean UAR 94.04%) significantly exceeds the accuracy of recognition of models trained only on a neutral emotion (mean UAR 65.81%), or on any other separate emotion (mean UAR from 54.82% to 68.62% with the emotion of disgust and sadness respectively). We have carefully analyzed the visual manifestations of various emotions and assessed their impact on the accuracy of automatic lip-reading. Current research is the first step in the creation of emotion-robust speech recognition systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zhou, P., Yang, W., Chen, W., et al.: Modality attention for end-to-end audio-visual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6565–6569 (2018)

    Google Scholar 

  2. Ivanko, D., Ryumin, D., Kashevnik, A., et al.: DAVIS: Driver’s Audio-Visual Speech Recognition. In: ISCA Annual Conference Interspeech, pp. 1141–1142 (2022)

    Google Scholar 

  3. Kim, M., Hong, J., Park, S.J., et al.: Multi-modality associative bridging through memory: Speech sound recollected from face video. In: IEEE/CVF International Conference on Computer Vision, pp. 296–306 (2021)

    Google Scholar 

  4. Ryumina, E., Verkholyak, O., Karpov, A.: Annotation Confidence vs. Training Sample Size: Trade-off Solution for Partially-Continuous Categorical Emotion Recognition. In: ISCA Annual Conference Interspeech, pp. 3690–3694 (2021)

    Google Scholar 

  5. Erickson, D., Zhu, C., Kawahara, S.: Articulation, acoustics and perception of Mandarin Chinese emotional speech. Open Linguistics 2, 620–635 (2016)

    Article  Google Scholar 

  6. Dresvyanskiy, D., Ryumina, E., Kaya, H., et al.: End-to-end modeling and transfer learning for audiovisual emotion recognition in-the-wild. Multimodal Technologies and Interaction 6(2), 11 (2022)

    Google Scholar 

  7. Afouras, T., Chung, J. S., Senior, A., et al.: Deep audio-visual speech recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–13 (2018)

    Google Scholar 

  8. Ivanko, D.: Audio-Visual Russian Speech Recognition. PhD thesis, 404 (2022)

    Google Scholar 

  9. Ekman, P.: Are there basic emotions? Psychol Rev. 99(3), 550–553 (1992)

    Article  Google Scholar 

  10. Russel, J.: A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161–1178 (1980)

    Article  Google Scholar 

  11. Poria, S., Majumder, N., Mihalcea, R., et al.: Emotion recognition in conversation: research challenges, datasets, and recent advances. IEEE Access 7, 100943–100953 (2018)

    Google Scholar 

  12. Kashevnik, A., Lashkov, I., Axyonov, A., et al.: Multimodal corpus design for audio-visual speech recognition in vehicle cabin. IEEE Access 9, 34986–35003 (2021)

    Article  Google Scholar 

  13. Ivanko, D., Ryumin, D., Axyonov, et al.: Multi-speaker audio-visual corpus rusavic: Russian audio-visual speech in cars. In: LREC, pp. 1555–1559 (2022)

    Google Scholar 

  14. Fernandez-Lopez, A., Sukno, F.M.: Survey on automatic lip-reading in the era of deep learning. Image Vis. Comput. 78, 53–72 (2018)

    Article  Google Scholar 

  15. Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018)

  16. Cao, H., Cooper, D.G., Keutmann, M.K., et al.: CREMA-D: Crowd-sourced emotional multimodal actors dataset. In: IEEE Transactions on Affective Computing 5(4), 377–390 (2014)

    Google Scholar 

  17. Livingstone, S.R., Russo, F.A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS one 13(5), e0196391 (2018)

    Google Scholar 

  18. Haq, S., Jackson, P.J., Edge, J.: Audio-visual feature selection and reduction for emotion classification. In: International Conference on Auditory-Visual Speech Processing (AVSP’08), pp. 185–190 (2008)

    Google Scholar 

  19. Martin, O., Kotsia, I., Macq, B., Pitas, I.: The eNTERFACE’05 audio-visual emotion database. In: IEEE International Conference on Data Engineering Workshops (ICDEW’06), pp. 1–8 (2006)

    Google Scholar 

  20. Ivanko, D., Ryumin, D., Axyonov, A., Kashevnik, A.: Speaker-dependent visual command recognition in vehicle cabin: methodology and evaluation. In: Karpov, A., Potapova, R. (eds.) SPECOM 2021. LNCS (LNAI), vol. 12997, pp. 291–302. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87802-3_27

    Chapter  Google Scholar 

  21. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision, pp. 87–103 (2016)

    Google Scholar 

  22. Ma, P., Martinez, B., Petridis, S., et al.: Towards practical lipreading with distilled and efficient models. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7608–7612 (2021)

    Google Scholar 

  23. Petridis, S., Stafylakis, T., Ma, P., et al.: End-to-end audiovisual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552 (2018)

    Google Scholar 

  24. Ivanko, D., Ryumin, D., Karpov, A.: Automatic Lip-Reading of Hearing Impaired People. In: International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, vol. XLII-2/W12, pp. 97–101 (2019)

    Google Scholar 

  25. Zhu, H., Luo, M.D., Wang, R., et al.: Deep audio-visual learning: a survey. Int. J. Automation Comput. 18(3), 351–376 (2021)

    Google Scholar 

  26. Assael, Y.M., Shillingford, B., Whiteson, S., et al.: Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)

  27. Grishchenko, I., Ablavatski, A., Kartynnik, Y., et al.: Attention mesh: High-fidelity face mesh prediction in real-time. In: CVPRW on Computer Vision for Augmented and Virtual Reality, pp. 1–4 (2020)

    Google Scholar 

  28. Ivanko, D., Ryumin, D., Kashevnik, A., et al.: Visual Speech Recognition in a Driver Assistance System. In: EURASIP 30th European Signal Processing Conference (EUSIPCO), pp. 1131–1135 (2022)

    Google Scholar 

  29. Zhao, X., Yang, S., Shan, S., et al.: Mutual information maximization for effective lip reading. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp. 420–427 (2020)

    Google Scholar 

Download references

Acknowledgments

This research is financially supported by the Russian Science Foundation (project No. 22–11-00321). Section 5 is supported by the Grant No. MК-42.2022.4.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Denis Ivanko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ryumina, E., Ivanko, D. (2022). Emotional Speech Recognition Based on Lip-Reading. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_52

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20980-2_52

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20979-6

  • Online ISBN: 978-3-031-20980-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics