Abstract
Automatic speech recognition and emotion recognition have been research hotspots in the field of human-computer interaction over the last years. However, despite significant recent advances, the problem of robust recognition of emotional speech remains unresolved. In this research we try to fill this gap by looking into the multimodality of speech and starting to use visual information to increase both recognition accuracy and robustness. We present extensive experimental investigation of how different emotions (anger, disgust, fear, happy, neutral, and sad) affect automatic lip-reading. We train the 3D ResNet-18 model on the CREMA-D emotional speech database by experimentation with different parameters of the model. To the best of our knowledge, this is the first research investigating the influence of human emotions on automatic lip-reading. Our results demonstrate that speech with the emotion of disgust is the most difficult to recognize correctly. This is due to the fact that a person significantly curves his lips and articulation is distorted. We have experimentally confirmed that the accuracy of models trained on all types of emotions (mean UAR 94.04%) significantly exceeds the accuracy of recognition of models trained only on a neutral emotion (mean UAR 65.81%), or on any other separate emotion (mean UAR from 54.82% to 68.62% with the emotion of disgust and sadness respectively). We have carefully analyzed the visual manifestations of various emotions and assessed their impact on the accuracy of automatic lip-reading. Current research is the first step in the creation of emotion-robust speech recognition systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zhou, P., Yang, W., Chen, W., et al.: Modality attention for end-to-end audio-visual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6565–6569 (2018)
Ivanko, D., Ryumin, D., Kashevnik, A., et al.: DAVIS: Driver’s Audio-Visual Speech Recognition. In: ISCA Annual Conference Interspeech, pp. 1141–1142 (2022)
Kim, M., Hong, J., Park, S.J., et al.: Multi-modality associative bridging through memory: Speech sound recollected from face video. In: IEEE/CVF International Conference on Computer Vision, pp. 296–306 (2021)
Ryumina, E., Verkholyak, O., Karpov, A.: Annotation Confidence vs. Training Sample Size: Trade-off Solution for Partially-Continuous Categorical Emotion Recognition. In: ISCA Annual Conference Interspeech, pp. 3690–3694 (2021)
Erickson, D., Zhu, C., Kawahara, S.: Articulation, acoustics and perception of Mandarin Chinese emotional speech. Open Linguistics 2, 620–635 (2016)
Dresvyanskiy, D., Ryumina, E., Kaya, H., et al.: End-to-end modeling and transfer learning for audiovisual emotion recognition in-the-wild. Multimodal Technologies and Interaction 6(2), 11 (2022)
Afouras, T., Chung, J. S., Senior, A., et al.: Deep audio-visual speech recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–13 (2018)
Ivanko, D.: Audio-Visual Russian Speech Recognition. PhD thesis, 404 (2022)
Ekman, P.: Are there basic emotions? Psychol Rev. 99(3), 550–553 (1992)
Russel, J.: A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161–1178 (1980)
Poria, S., Majumder, N., Mihalcea, R., et al.: Emotion recognition in conversation: research challenges, datasets, and recent advances. IEEE Access 7, 100943–100953 (2018)
Kashevnik, A., Lashkov, I., Axyonov, A., et al.: Multimodal corpus design for audio-visual speech recognition in vehicle cabin. IEEE Access 9, 34986–35003 (2021)
Ivanko, D., Ryumin, D., Axyonov, et al.: Multi-speaker audio-visual corpus rusavic: Russian audio-visual speech in cars. In: LREC, pp. 1555–1559 (2022)
Fernandez-Lopez, A., Sukno, F.M.: Survey on automatic lip-reading in the era of deep learning. Image Vis. Comput. 78, 53–72 (2018)
Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018)
Cao, H., Cooper, D.G., Keutmann, M.K., et al.: CREMA-D: Crowd-sourced emotional multimodal actors dataset. In: IEEE Transactions on Affective Computing 5(4), 377–390 (2014)
Livingstone, S.R., Russo, F.A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS one 13(5), e0196391 (2018)
Haq, S., Jackson, P.J., Edge, J.: Audio-visual feature selection and reduction for emotion classification. In: International Conference on Auditory-Visual Speech Processing (AVSP’08), pp. 185–190 (2008)
Martin, O., Kotsia, I., Macq, B., Pitas, I.: The eNTERFACE’05 audio-visual emotion database. In: IEEE International Conference on Data Engineering Workshops (ICDEW’06), pp. 1–8 (2006)
Ivanko, D., Ryumin, D., Axyonov, A., Kashevnik, A.: Speaker-dependent visual command recognition in vehicle cabin: methodology and evaluation. In: Karpov, A., Potapova, R. (eds.) SPECOM 2021. LNCS (LNAI), vol. 12997, pp. 291–302. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87802-3_27
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision, pp. 87–103 (2016)
Ma, P., Martinez, B., Petridis, S., et al.: Towards practical lipreading with distilled and efficient models. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7608–7612 (2021)
Petridis, S., Stafylakis, T., Ma, P., et al.: End-to-end audiovisual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552 (2018)
Ivanko, D., Ryumin, D., Karpov, A.: Automatic Lip-Reading of Hearing Impaired People. In: International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, vol. XLII-2/W12, pp. 97–101 (2019)
Zhu, H., Luo, M.D., Wang, R., et al.: Deep audio-visual learning: a survey. Int. J. Automation Comput. 18(3), 351–376 (2021)
Assael, Y.M., Shillingford, B., Whiteson, S., et al.: Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)
Grishchenko, I., Ablavatski, A., Kartynnik, Y., et al.: Attention mesh: High-fidelity face mesh prediction in real-time. In: CVPRW on Computer Vision for Augmented and Virtual Reality, pp. 1–4 (2020)
Ivanko, D., Ryumin, D., Kashevnik, A., et al.: Visual Speech Recognition in a Driver Assistance System. In: EURASIP 30th European Signal Processing Conference (EUSIPCO), pp. 1131–1135 (2022)
Zhao, X., Yang, S., Shan, S., et al.: Mutual information maximization for effective lip reading. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp. 420–427 (2020)
Acknowledgments
This research is financially supported by the Russian Science Foundation (project No. 22–11-00321). Section 5 is supported by the Grant No. MК-42.2022.4.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Ryumina, E., Ivanko, D. (2022). Emotional Speech Recognition Based on Lip-Reading. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_52
Download citation
DOI: https://doi.org/10.1007/978-3-031-20980-2_52
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)