ShEMO: a large-scale validated database for Persian speech emotion detection


This paper introduces a large-scale, validated database for Persian called Sharif Emotional Speech Database (ShEMO). The database includes 3000 semi-natural utterances, equivalent to 3 h and 25 min of speech data extracted from online radio plays. The ShEMO covers speech samples of 87 native-Persian speakers for five basic emotions including anger, fear, happiness, sadness and surprise, as well as neutral state. Twelve annotators label the underlying emotional state of utterances and majority voting is used to decide on the final labels. According to the kappa measure, the inter-annotator agreement is 64% which is interpreted as “substantial agreement”. We also present benchmark results based on common classification methods in speech emotion detection task. According to the experiments, support vector machine achieves the best results for both gender-independent (58.2%) and gender-dependent models (female = 59.4%, male = 57.6%). The ShEMO will be available for academic purposes free of charge to provide a baseline for further research on Persian emotional speech.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2


  1. 1.

    Upon publishing this paper, we release our database for academic purposes.

  2. 2.

    The prompt excludes any emotional contents in order not to intervene the expression and perception of emotional states.

  3. 3.

  4. 4.

    Cohen’s kappa ranges generally from 0 to 1, where large numbers indicate higher reliability and values near zero suggest that agreement is attributable to chance alone.

  5. 5.

    As Landis and Koch (1977) explain, \(0.61< kappa < 0.80\) is interpreted as “substantial agreement” among the judges.

  6. 6.

    The IPA was devised by the International Phonetic Association as a standardized representation of the sounds of oral language.

  7. 7.

    It contains 88 different parameters. For further information, please refer to Eyben et al. (2016).

  8. 8.

    Happiness has the lowest number of utterances after fear. As mentioned before, fear utterances were ignored in the classification experiments.

  9. 9.

    Actors were asked to read 10 short emotionally neutral sentences.

  10. 10.

    We trained the models on the audio (not video), speech (not song) files of the database.


  1. Alvarado, N. (1997). Arousal and valence in the direct scaling of emotional response to film clips. Motivation and Emotion, 21, 323–348.

    Article  Google Scholar 

  2. Audhkhasi, K., & Narayanan, S. (2010). Data-dependent evaluator modeling and its application to emotional valence classification from speech. In Proceedings of INTERSPEECH (pp. 2366–2369), Makuhari, Japan.

  3. Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587.

    Article  Google Scholar 

  4. Batliner, A., Fischer, K., Huber, R., Spilker, J., & Noth, E. (2000). Desperately seeking emotions or: Actors, wizards, and human beings. In Proceedings of ISCA workshop on speech and emotion (pp. 195–200).

  5. Batliner, A., Fischer, K., Huber, R., Spilker, J., & Noth, E. (2003). How to find trouble in communication. Speech Communication, 40(1–2), 117–143.

    Article  Google Scholar 

  6. Bijankhan, M., Sheikhzadegan, J., Roohani, M., & Samareh, Y. (1994). FARSDAT—The speech database of Farsi spoken language. In Proceedings of Australian conference on speech science and technology (pp. 826–831), Perth, Australia.

  7. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  Google Scholar 

  8. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss, B. (2005). A database of German emotional speech. In Proceedings of INTERSPEECH (pp. 1517–1520), Lissabon, Portugal. ISCA.

  9. Busso, C., Bulut, M., & Narayanan, S. (2013). Toward effective automatic recognition systems of emotion in speech. In J. Gratch & S. Marsella (Eds.), Social emotions in nature and artifact: Emotions in human and human–computer interaction (pp. 110–127). New York, NY: Oxford University Press.

    Google Scholar 

  10. Cawley, G. C., & Talbot, N. L. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11, 2079–2107.

    Google Scholar 

  11. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.

    Article  Google Scholar 

  12. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., et al. (2001). Emotion recognition in human–computer interaction. IEEE Signal Processing Magazine, 18(1), 32–80.

    Article  Google Scholar 

  13. Deng, J., Han, W., & Schuller, B. (2012). Confidence measures for speech emotion recognition: A start. In Proceedings of speech communication (pp. 1–4), Braunschweig, Germany.

  14. Dickerson, R., Gorlin, E., & Stankovic, J. (2011). Empath: A continuous remote emotional health monitoring system for depressive illness. In Proceedings of the 2nd conference on wireless health (pp. 1–10), New York, NY, USA.

  15. Douglas-Cowie, E., Cowie, R., & Schroeder, M. (2000). A new emotion database: Considerations, sources and scope. In Proceedings of ISCA workshop on speech and emotion (pp. 39–44).

  16. Ekman, P. (1982). Emotion in the human face. Cambridge: Cambridge University Press.

    Google Scholar 

  17. Engberg, I., Hansen, A., Andersen, O., & Dalsgaard, P. (1997). Design, recording and verification of a Danish emotional speech database. In Proceedings of EUROSPEECH (Vol. 4, pp. 1695–1698).

  18. Esmaileyan, Z., & Marvi, H. (2013). A database for automatic Persian speech emotion recognition: Collection, processing and evaluation. International Journal of Engineering, 27, 79–90.

    Google Scholar 

  19. Eyben, F., Scherer, K., Schuller, B., Sundberg, J., Andre, E., Busso, C., et al. (2016). The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), 190–202.

    Article  Google Scholar 

  20. Eyben, F., Wollmer, M., & Schuller, B. (2010). openSMILE—The Munich versatile and fast open-source audio feature extractor. In Proceedings of ACM multimedia (pp. 1459–1462), Florance, Italy.

  21. Feraru, S. M., Schuller, D., & Schuller, B. (2015). Cross-language acoustic emotion recognition: An overview and some tendencies. In Proceedings of the 6th international conference on affective computing and intelligent interaction (ACII) (pp. 125–131), Xi’an, China.

  22. Frank, M., & Stennett, J. (2001). The forced-choice paradigm and the perception of facial expressions of emotion. Personality and Social Psychology, 80(1), 75–85.

    Article  Google Scholar 

  23. Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1987). The vocabulary problem in human–system communication. Communications of the ACM, 30(11), 964–971.

    Article  Google Scholar 

  24. Gharavian, D., & Ahadi, S. (2006). Recognition of emotional speech and speech emotion in Farsi. In Proceedings of international symposium on chinese spoken language processing (Vol. 2, pp. 299–308).

  25. Giannakopoulos, T., Pikrakis, A., & Theodoridis, S. (2009). A dimensional approach to emotion recognition of speech from movies. In Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 65–68).

  26. Grimm, M., Kroschel, K., Mower, E., & Narayanan, S. (2007). Primitives-based evaluation and estimation of emotions in speech. Speech Communication, 49(10–11), 787–800.

    Article  Google Scholar 

  27. Hamidi, M., & Mansoorizade, M. (2012). Emotion recognition from Persian speech with neural network. Artificial Intelligence and Applications, 3(5), 107–112.

    Article  Google Scholar 

  28. Heni, N., & Hamam, H. (2016). Design of emotional education system mobile games for autistic children. In Proceedings of the 2nd international conference on advanced technologies for signal and image processing (ATSIP).

  29. Huahu, X., Jue, G., & Jian, Y. (2010). Application of speech emotion recognition in intelligent household robot. In Proceedings of international conference on artificial intelligence and computational intelligence (Vol. 1, pp. 537–541).

  30. James, A. (1994). Is there universal recognition of emotion from facial expression? A review of the cross-cultural studies. Psychological Bulletin, 115(1), 102–141.

    Article  Google Scholar 

  31. Johnstone, T., Van Reekum, C., Hird, K., Kirsner, K., & Scherer, K. (2005). Affective speech elicited with a computer game. Emotion, 5(4), 513–518.

    Article  Google Scholar 

  32. Keshtiari, N., Kuhlmann, M., Eslami, M., & Klann-Delius, G. (2015). Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD). Behavior Research Methods, 47(1), 275–294.

    Article  Google Scholar 

  33. Kort, B., Reilly, R., & Picard, R. (2001). An affective model of interplay between emotions and learning: Reengineering educational pedagogy-building a learning companion. In Proceedings of the IEEE international conference on advanced learning technologies (ICALT) (pp. 43–46), Washington, DC, USA.

  34. Landis, J., & Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.

    Article  Google Scholar 

  35. Lee, C., Mower, E., Busso, C., Lee, S., & Narayanan, S. (2011). Emotion recognition using a hierarchical binary decision tree approach. Speech Communication, 53(9–10), 1162–1171.

    Article  Google Scholar 

  36. Lewis, P. A., Critchley, H. D., Rotshtein, P., & Dolan, J. R. (2007). Neural correlates of processing valence and arousal in affective words. Cerebral Cortex, 17(3), 742–748.

    Article  Google Scholar 

  37. Livingstone, S., Peck, K., & Russo, F. (2012). RAVDESS: The Ryerson audio-visual database of emotional speech and song. In Proceedings of the 22nd annual meeting of the Canadian Society for Brain, Behaviour and Cognitive Science (CSBBCS), ON, Canada.

  38. Mansoorizadeh, M. (2009). Human emotion recognition using facial expression and speech features fusion. PhD thesis, Tarbiat Modares University, Tehran, Iran (in Persian).

  39. McKeown, G., Valstar, M., Cowie, R., & Pantic, M. (2010). The semaine corpus of emotionally coloured character interactions. In Proceedings of IEEE international conference on multimedia and expo (ICME’10) (pp. 1079–1084), Singapore, Singapore. IEEE Computer Society.

  40. Metze, F., Batliner, A., Eyben, F., Polzehl, T., Schuller, B., & Steidl, S. (2011). Emotion recognition using imperfect speech recognition. In Proceedings of INTERSPEECH (pp. 478–481), Makuhari, Japan.

  41. Moosavian, A., Norasteh, R., & Rahati, S. (2007). Speech emotion recognition using adaptive neuro-fuzzy inference systems. In Proceedings of the 8th conference on intelligent systems (in Persian).

  42. Mower, E., Mataric, M., & Narayanan, S. (2009b). Evaluating evaluators: A case study in understanding the benefits and pitfalls of multi-evaluator modeling. In Proceedings of INTERSPEECH (pp. 1583–1586), Brighton, UK.

  43. Mower, E., Metallinou, A., Lee, C., Kazemzadeh, A., Busso, C., Lee, S., & Narayanan, S. (2009a). Interpreting ambiguous emotional expressions. In Proceedings of the 3rd international conference on affective computing and intelligent interaction and workshops (ACII) (pp. 662–669), Amsterdam, The Netherlands.

  44. Nicolaou, M., Gunes, H., & Pantic, M. (2011). Continuous prediction of spontaneous affect from multiple cues and modalities in valance–arousal space. IEEE Transactions on Affective Computing, 2(2), 92–105. eemcs-eprint-21287.

    Article  Google Scholar 

  45. Russell, J. A. (1980). A circumplex model of affect. Personality and Social Psychology, 39(6), 1161–1178.

    Article  Google Scholar 

  46. Sagha, H., Matejka, P., Gavryukova, M., Povolny, F., Marchi, E., & Schuller, B. (2016). Enhancing multilingual recognition of emotion in speech by language identification. In Proceedings of INTERSPEECH (pp. 2949–2953).

  47. Savargiv, M., & Bastanfard, A. (2015). Persian speech emotion recognition. In Proceedings of the 7th international conference on information and knowledge technology (IKT) (pp. 1–5).

  48. Scherer, K. (1986). Vocal affect expression: A review and a model for future research. Psychol Bull, 99(2), 143–165.

    Article  Google Scholar 

  49. Scherer, K., Banse, R., Wallbott, H., & Goldbeck, T. (1991). Vocal cues in emotion encoding and decoding. Motivation and Emotion, 15(2), 123–148.

    Article  Google Scholar 

  50. Schuller, B., Batliner, A., Steidl, S., Schiel, F., & Krajewski, J. (2011). The INTERSPEECH 2011 speaker state challenge. In Proceedings of INTERSPEECH (pp. 3201–3204), Florence, Italy. ISCA.

  51. Schuller, B., & Munchen, T. U. (2002). Towards intuitive speech interaction by the integration of emotional aspects. In Proceedings of IEEE international conference on systems, man and cybernetics (SMC) (Vol. 1, pp. 6–11).

  52. Schuller, B., Reiter, S., Muller, R., Al-Hames, M., Lang, M., & Rigoll, G. (2005). Speaker independent speech emotion recognition by ensemble classification. In Proceedings of IEEE international conference on multimedia and expo (ICME) (pp. 864–867).

  53. Schuller, B., Rigoll, G., & Lang, M. (2004). Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 1, pp. 577–580).

  54. Schuller, B., Steidl, S., & Batliner, A. (2009). The INTERSPEECH 2009 emotion challenge. In Proceedings of INTERSPEECH (pp. 312–315), Brighton, UK. ISCA.

  55. Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Muller, C., et al. (2010). The INTERSPEECH 2010 paralinguistic challenge. In Proceedings of INTERSPEECH (pp. 2794–2797), Makuhari, Japan. ISCA.

  56. Schuller, B., Steidl, S., Batliner, A., Hirschberg, J., Burgoon, J., Baird, A., et al. (2016). The INTERSPEECH 2016 computational paralinguistics challenge: Deception, sincerity & native language. In Proceedings of INTERSPEECH (pp. 2001–2005), San Francisco, USA. ISCA.

  57. Sedaaghi, M. (2008). Documentation of the Sahand Emotional Speech Database (SES). Technical report, Department of engineering, Sahand University of Technology.

  58. Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems (pp. 2951–2959).

  59. Steidl, S. (2009). Automatic classification of emotion related user states in spontaneous children’s speech. Ph.D. thesis, University of Erlangen-Nuremberg Erlangena, Bavaria, Germany.

  60. Wöllmer, M., Kaiser, M., Eyben, F., Schuller, B., & Rigoll, G. (2013). LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 31(2), 153–163.

    Article  Google Scholar 

  61. Yu, F., Chang, E., Xu, Y., & Shum, H. (2001). Emotion detection from speech to enrich multimedia content. In Proceedings of the 2nd IEEE Pacific Rim conference on multimedia: Advances in multimedia information processing (pp. 550–557), London, UK. Springer.

Download references


We would like to thank the anonymous reviewers for their insightful comments and suggestions. We also gratefully thank Dr. Steve Cassidy for his helpful points.

Author information



Corresponding author

Correspondence to Omid Mohamad Nezami.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mohamad Nezami, O., Jamshid Lou, P. & Karami, M. ShEMO: a large-scale validated database for Persian speech emotion detection. Lang Resources & Evaluation 53, 1–16 (2019).

Download citation


  • Emotional speech
  • Speech database
  • Emotion detection
  • Benchmark
  • Persian