Skip to main content

Speech emotion recognition for the Urdu language

Dataset and evaluation

Abstract

Crafting reliable Speech Emotion Recognition systems is an arduous task that inevitably requires large amounts of data for training purposes. Such voluminous datasets are currently obtainable in only a few languages, including English, German, and Italian. In this work, we present SEMOUR\(^+\): a Scripted EMOtional Speech Repository for Urdu, the first scripted database of emotion-tagged and diverse-accent speech in the Urdu language, to design an Urdu Speech Emotion Recognition system. Our gender-balanced 14-h repository contains 27, 640 unique instances recorded by 24 native speakers eliciting a syntactically complex script. The dataset is phonetically balanced, and reliably exhibits varied emotions, as marked by the high agreement scores among human raters in experiments. We also provide various baseline speech emotion prediction scores on SEMOUR\(^+\), which could be utilized for multiple applications like personalized robot assistants, diagnosis of psychological disorders, getting feedback from a low-tech-enabled population, etc. In a speaker-independent experimental setting, our ensemble model accurately predicts an emotion with a state-of-the-art \(56\%\) accuracy.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. https://tinyurl.com/yc499z7k

References

  • Ali, H., Ahmad, N., Yahya, K. M., & Farooq, O. (2012). A medium vocabulary Urdu isolated words balanced corpus for automatic speech recognition. In 2012 international conference on electronics computer technology (ICECT 2012) (pp. 473–476).

  • Atta, F., van de Weijer, J., & Zhu, L. (2020). Saraiki. Journal of the International Phonetic Association, 1–21.

  • Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In 3rd international conference on learning representations (ICLR 2015). Computational and Biological Learning Society.

  • Batliner, A., Fischer, K., Huber, R., Spilker, J., & Nöth, E. (2000). Desperately seeking emotions or: Actors, wizards, and human beings. ISCA tutorial and research workshop (ITRW) on speech and emotion. International Speech Communication Association.

  • Batliner, A., Steidl, S., & Nöth, E. (2008). Releasing a thoroughly annotated and processed spontaneous emotional database: The FAU Aibo emotion corpus. In Proceedings of a satellite workshop of IREC (p. 28). European Language Resources Association.

  • Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In Ninth European conference on speech communication and technology (pp. 1517–1520). Lisbon, PortugalInternational Speech Communication Association.

  • Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335.

    Article  Google Scholar 

  • Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., & Provost, E. M. (2016). MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing., 8(1), 67–80.

    Article  Google Scholar 

  • Cai, X., Yuan, J., Zheng, R., Huang, L., & Church, K. (2021). Speech emotion recognition with multi-task learning. In Proceedings of interspeech 2021 (pp. 4508–4512). https://doi.org/10.21437/Interspeech.2021-1852

  • Castillo, J. C., Fernández-Caballero, A., Castro-González, Á., Salichs, M. A., & López, M. T. (2014a). A framework for recognizing and regulating emotions in the elderly. In L. Pecchia, L. L. Chen, C. Nugent, & J. Bravo (Eds.), Ambient assisted living and daily activities (pp. 320–327). Springer.

  • Castillo, J.C., Fernández-Caballero, A., Castro-González, Á., Salichs, M.A., & López, M. T. (2014b). A framework for recognizing and regulating emotions in the elderly. In L. Pecchia, L. L. Chen, C. Nugent, & J. Bravo (Eds.), Ambient assisted living and daily activities (pp. 320–327). Springer.

  • Cauldwell, R. T. (2000). Where did the anger go? The role of context in interpreting emotion in speech. In ISCA tutorial and research workshop (ITRW) on speech and emotion. International Speech Communication Association.

  • Chatziagapi, A., Paraskevopoulos, G., Sgouropoulos, D., Pantazopoulos, G., Nikandrou, M., Giannakopoulos, T., & Narayanan, S. (2019). Data augmentation using gans for speech emotion recognition. In Interspeech (pp. 171–175). International Speech Communication Association.

  • Chen, J., She, Y., Zheng, M., Shu, Y., Wang, Y., & Xu, Y. (2019). A multimodal affective computing approach for children companion robots. In Proceedings of the seventh international symposium of Chinese CHI (pp. 57–64).

  • Chen, M., He, X., Yang, J., & Zhang, H. (2018). 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters., 25(10), 1440–1444.

    Article  Google Scholar 

  • Costantini, G., Iaderola, I., Paoloni, A., & Todisco, M. (2014). EMOVO Corpus: An italian emotional speech database. In International conference on language resources and evaluation (IREC 2014) (pp. 3501–3504). European Language Resources Association.

  • Cummins, N., Amiriparian, S., Hagerer, G., Batliner, A., Steidl, S., & Schuller, B. W. (2017). An image-based deep spectrum feature representation for the recognition of emotional speech. In Proceedings of the 25th acm international conference on multimedia (pp. 478–484).

  • Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003). Emotional speech: Towards a new generation of databases. Speech communication., 40(1–2), 33–60.

    Article  Google Scholar 

  • Douglas-Cowie, E., Devillers, L., Martin, J.-C., Cowie, R., Savvidou, S., Abrilian, S., & Cox, C. (2005). Multimodal databases of everyday emotion: Facing up to complexity. In Ninth European conference on speech communication and technology (p. 4). International Speech Communication Association.

  • Eberhard, D. M., Simons, G. F., & Fennig, C. D. (2020). Ethnologue: Languages of the world, 23rd edn (Vol. 23). Dallas.

  • Engberg, I. S., Hansen, A. V., Andersen, O., & Dalsgaard, P. (1997). Design, recording and verification of a danish emotional speech database. In Fifth European conference on speech communication and technology.

  • Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., André, E., Busso, C., Devillers, L. Y., Epps, J., Laukka, P., Narayanan, S. S., & Truong, K. P. (2015). The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), 190–202.

  • Fayek, H. M., Lech, M., & Cavedon, L. (2017). Evaluating deep learning architectures for speech emotion recognition. Neural Networks., 92, 60–68.

    Article  Google Scholar 

  • Ghulam, S. M., & Soomro, T. R. (2018). Twitter and Urdu. In 2018 international conference on computing, mathematics and engineering technologies (ICOMET) (p. 1-6). IEEE. https://doi.org/10.1109/ICOMET.2018.8346370

  • Grimm, M., Kroschel, K., & Narayanan, S. (2008). The vera am mittag german audio-visual emotional speech database. In 2008 IEEE international conference on multimedia and expo (pp. 865–868). IEEE.

  • Han, W., Jiang, T., Li, Y., Schuller, B., & Ruan, H. (2020). Ordinal learning for emotion recognition in customer service calls. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6494–6498). IEEE.

  • Ijaz, M., & Hussain, S. (2007). Corpus based Urdu lexicon development. In The proceedings of conference on language technology (CLT07), University of Peshawar, Pakistan (Vol. 73, pp. 12). Academia.

  • Jackson, P., & Haq, S. (2014). Surrey audio-visual expressed emotion (SAVEE) database.

  • Jürgens, R., Grass, A., Drolet, M., & Fischer, J. (2015). Effect of acting experience on emotion expression and recognition in voice: Non-actors provide better stimuli than expected. Journal of Nonverbal Behavior, 39(3), 195–214.

    Article  Google Scholar 

  • Kabir, H., & Saleem, A. M. (2002). Speech assessment methods phonetic alphabet (SAMPA): Analysis of Urdu.

  • Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). IITKGP-SESC: Speech database for emotion analysis. In International conference on contemporary computing (pp. 485–492).

  • Koolagudi, S. G., Reddy, R., Yadav, J., & Rao, K. S. (2011). IITKGP-SEHSC: Hindi speech corpus for emotion analysis. In 2011 international conference on devices and communications (ICDECOM) (pp. 1–5).

  • Kostoulas, T., Mporas, I., Ganchev, T., & Fakotakis, N. (2008). The effect of emotional speech on a smart-home application. In International conference on industrial, engineering and other applications of applied intelligent systems (pp. 305–310). Springer.

  • Kumar, P., Kaushik, V., & Raman, B. (2021). Towards the explainability of multimodal speech emotion recognition. In Proceedings of interspeech 2021 (pp. 1748–1752). https://doi.org/10.21437/Interspeech.2021-1718

  • Kumar, T. M., Sanchez, E., Tzimiropoulos, G., Giesbrecht, T., Valstar, M. (2021). Stochastic process regression for cross-cultural speech emotion recognition. In Proceedings of interspeech 2021 (pp. 3390–3394). https://doi.org/10.21437/Interspeech.2021-610

  • Kumawat, P., & Routray, A. (2021). Applying TDNN architectures for analyzing duration dependencies on speech emotion recognition. In Proceedings of interspeech 2021 (pp. 3410–3414). https://doi.org/10.21437/Interspeech.2021-2168

  • Latif, S., Qayyum, A., Usman, M., Qadir, J. (2018). Cross lingual speech emotion recognition: Urdu vs. western languages. In 2018 international conference on frontiers of information technology (FIT) (pp. 88–93). IEEE. https://doi.org/10.1109/FIT.2018.00023

  • Leem, S.-G., Fulford, D., Onnela, J.-P., Gard, D., & Busso, C. (2021). Separation of emotional and reconstruction embeddings on ladder network to improve speech emotion recognition robustness in noisy conditions. In Proceedings of interspeech 2021 (pp. 2871–2875). https://doi.org/10.21437/Interspeech.2021-1438

  • Li, A., Zheng, F., Byrne, W., Fung, P., Kamm, T., Liu, Y., & Chen, X. (2000). CASS: A phonetically transcribed corpus of mandarin spontaneous speech. In Sixth international conference on spoken language processing (pp. 485-488). International Speech Communication Association.

  • Li, B., Dimitriadis, D., & Stolcke, A. (2019). Acoustic and lexical sentiment analysis for customer service calls. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5876–5880).

  • Li, J.-L., & Lee, C.-C. (2019). Attentive to individual: A multimodal emotion recognition network with personalized attention profile. In Interspeech (pp. 211–215). International Speech Communication Association.

  • Li, Y., Tao, J., Chao, L., Bao, W., & Liu, Y. (2017). CHEAVD: A chinese natural emotional audio-visual database. Journal of Ambient Intelligence and Humanized Computing., 8(6), 913–924.

    Article  Google Scholar 

  • Liu, J., & Wang, H. (2021). Graph isomorphism network for speech emotion recognition. In Proceedings of interspeech 2021 (pp. 3405–3409). https://doi.org/10.21437/Interspeech.2021-1154

  • Livingstone, S. R., & Russo, F. A. (2018). The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PLoS ONE., 13(5), e0196391.

    Article  Google Scholar 

  • Makarova, V., & Petrushin, V. A. (2002). Ruslana: A database of russian emotional utterances. In Seventh international conference on spoken language processing.

  • McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference (Vol. 8, pp. 18–25). Academia.

  • Meddeb, M., Karray, H., & Alimi, A.M. (2017). Building and analysing emotion corpus of the arabic speech. In 2017 1st international workshop on arabic script analysis and recognition (ASAR) (pp. 134–139). IEEE. https://doi.org/10.1109/ASAR.2017.8067775

  • Montero, J. M., Gutiérrez-Arriola, J., Colás, J., Enriquez, E., & Pardo, J. M. (1999). Analysis and modelling of emotional speech in spanish. In Proceedngs of of ICPHS (Vol. 2, pp. 957–960).

  • Moriyama, T., Mori, S., & Ozawa, S. (2009). A synthesis method of emotional speech using subspace constraints in prosody. Journal of Information Processing Society of Japan., 50(3), 1181–1191.

  • Murphy, K. P. (2012). Machine learning: A probabilistic perspective. MIT press.

  • Nezami, O. M., Lou, P. J., & Karami, M. (2019). Shemo: A large-scale validated database for persian speech emotion detection. Language Resources and Evaluation., 53(1), 1–16.

    Article  Google Scholar 

  • Oflazoglu, C., & Yildirim, S. (2013). Recognizing emotion from turkish speech using acoustic features. EURASIP Journal on Audio, Speech, and Music Processing, 2013(1), 26.

    Article  Google Scholar 

  • Parry, J., Palaz, D., Clarke, G., Lecomte, P., Mead, R., Berger, M., & Hofer, G. (2019). Analysis of deep learning architectures for cross-corpus speech emotion recognition. In Interspeech (pp. 1656–1660). International Speech Communication Association.

  • Qasim, M., Nawaz, S., Hussain, S., & Habib, T. (2016). Urdu speech recognition system for district names of pakistan: Development, challenges and solutions. In 2016 conference of the oriental chapter of international committee for coordination and standardization of speech databases and assessment techniques (O-COCOSDA) (pp. 28–32). IEEE. https://doi.org/10.1109/ICSDA.2016.7918979

  • Ramakrishnan, S., & El Emary, I. M. (2013). Speech emotion recognition approaches in human computer interaction. Telecommunication Systems, 52(3), 1467–1478.

    Article  Google Scholar 

  • Raza, A. A., Athar, A., Randhawa, S., Tariq, Z., Saleem, M. B., Zia, H. B., & Rosenfeld, R. (2018). Rapid collection of spontaneous speech corpora using telephonic community forums. In Proceedings of interspeech 2018 (pp. 1021–1025). https://doi.org/10.21437/Interspeech.2018-1139

  • Raza, A. A., Hussain, S., Sarfraz, H., Ullah, I., & Sarfraz, Z. (2009). Design and development of phonetically rich Urdu speech corpus. In 2009 oriental cocosda international conference on speech database and assessments (pp. 38–43).

  • Ringeval, F., Sonderegger, A., Sauer, J., & Lalanne, D. (2013). Introducing the recola multimodal corpus of remote collaborative and affective interactions. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG) (pp. 1–8). IEEE. https://doi.org/10.1109/FG.2013.6553805

  • Russell, J. A., & Mehrabian, A. (1977). Evidence for a three-factor theory of emotions. Journal of Research in Personality, 11(3), 273–294. https://doi.org/10.1016/0092-6566(77)90037-X

    Article  Google Scholar 

  • Sager, J., Shankar, R., Reinhold, J., & Venkataraman, A. (2019). Vesus: A crowd-annotated database to study emotion production and perception in spoken english. In Interspeech (pp. 316–320). International Speech Communication Association.

  • Sahidullah, M., & Saha, G. (2012). Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Communication, 54(4), 543–565.

    Article  Google Scholar 

  • Santoso, J., Yamada, T., Makino, S., Ishizuka, K., & Hiramura, T. (2021). Speech emotion recognition based on attention weight correction using word-level confidence measure. In Proceedings of interspeech 2021 (pp. 1947–1951). https://doi.org/10.21437/Interspeech.2021-411

  • Sarfraz, H., Hussain, S., Bokhari, R., Raza, A.A., Ullah, I., Sarfraz, Z., & Parveen, R. (2010). Speech corpus development for a speaker independent spontaneous Urdu speech recognition system. In Proceedings of the O-COCOSDA, Kathmandu, Nepal.

  • Scherer, K. R. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40(1–2), 227–256.

    Article  Google Scholar 

  • Schuller, B., Steidl, S., Batliner, A., Hirschberg, J., Burgoon, J. K., Baird, A., Elkins, A., Zhang, Y., Coutinho, E., & Evanini, K. (2016). The interspeech 2016 computational paralinguistics challenge: Deception, sincerity & native language. In 17th annual conference of the international speech communication association (Interspeech) (Vol. 1–5, pp. 2001–2005). International Speech Communication Association.

  • Scott, K. M., Ashby, S., & Hanna, J. (2020). “Human, all too human”: NOAA weather radio and the emotional impact of synthetic voices. In Proceedings of the 2020 CHI conference on human factors in computing systems (p. 1-9). Association for Computing Machinery. https://doi.org/10.1145/3313831.3376338

  • Sebastian, J., & Pierucci, P. (2019). Fusion techniques for utterance-level emotion recognition combining speech and transcripts. In Interspeech (pp. 51–55). International Speech Communication Association.

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

  • Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). A scale for the measurement of the psychological magnitude pitch. The Journal of the Acoustical Society of America, 8(3), 185–190.

    Article  Google Scholar 

  • Swain, M., Routray, A., & Kabisatpathy, P. (2018). Databases, features and classifiers for speech emotion recognition: A review. International Journal of Speech Technology, 21(1), 93–120.

    Article  Google Scholar 

  • Vashistha, A., Garg, A., Anderson, R., & Raza, A. A. (2019). Threats, abuses, flirting, and blackmail: Gender inequity in social media voice forums. In Proceedings of the 2019 CHI conference on human factors in computing systems (pp. 1–13). Association for Computing Machinery. https://doi.org/10.1145/3290605.3300302

  • Walker, K., Ma, X., Graff, D., Strassel, S., Sessa, S., & Jones, K. (2015). RATS speech activity detection. Abacus Data Network 11272.1/AB2/1UISJ7.

  • Xu, X., Deng, J., Cummins, N., Zhang, Z., Zhao, L., & Schuller, B. W. (2019). Autonomous emotion learning in speech: A view of zero-shot speech emotion recognition. In Interspeech (pp. 949–953). International Speech Communication Association.

  • Zhalehpour, S., Onder, O., Akhtar, Z., & Erdem, C. E. (2016). Baum-1: A spontaneous audio-visual face database of affective and mental states. IEEE Transactions on Affective Computing, 8(3), 300–313.

    Article  Google Scholar 

  • Zhang, J. T. F. L. M., & Jia, H. (2008). Design of speech corpus for mandarin text to speech. In The blizzard challenge 2008 workshop (p. 4). International Speech Communication Association.

  • Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1D & 2D CNN ISTM networks. Biomedical Signal Processing and Control., 47, 312–323.

    Article  Google Scholar 

  • Zia, A. A. R. H. B., & Athar, A. (2018). Pronouncur: An Urdu pronunciation lexicon generator. In Proceedings of the eleventh international conference on language resources and evaluation (IREC 2018). European Language Resources Association (ELRA).

Download references

Acknowledgements

This work is partially supported by the Higher Education Commission (HEC), Pakistan under the National Center for Big Data and Cloud Computing funding for the Crime Investigation and Prevention Lab (CIPL) project at Information Technology University, Lahore. We acknowledge the efforts of our volunteers including Sidra Shuja, Abbas Ahmad Khan, Abdullah Rao, Talha Riaz, Shawaiz Butt, Fatima Sultan, Naheed Bashir, Farrah Zaheer, Deborah Eric, Maryam Zaheer, Abdullah Zaheer, Anwar Said, Farooq Zaman, Fareed Ud Din Munawwar, Muhammad Junaid Ahmad, Taha Chohan, Sufyan Khalid, Iqra Safdar, Anum Zahid, Hajra Waheed, Mehvish Ghafoor, Sehrish Iqbal, Akhtar Munir, Hassaan, Hamza, Javed Iqbal, Syed Javed, Noman Khan, Mahr Muhammad Shaaf Abdullah, Talha, Tazeen Bokhari and Muhammad Usama Irfan. We also thank the staff at ITU FM Radio 90.4 for their help in the recording process.

Funding

Funding was provided by Higher Education Commission, Pakistan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nimra Zaheer.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zaheer, N., Ahmad, O.U., Shabbir, M. et al. Speech emotion recognition for the Urdu language. Lang Resources & Evaluation (2022). https://doi.org/10.1007/s10579-022-09610-7

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10579-022-09610-7

Keywords

  • Emotional speech dataset
  • Speech emotion recognition
  • Urdu language
  • Accent diversity
  • Deep learning