Abstract
This paper introduces a large-scale, validated database for Persian called Sharif Emotional Speech Database (ShEMO). The database includes 3000 semi-natural utterances, equivalent to 3 h and 25 min of speech data extracted from online radio plays. The ShEMO covers speech samples of 87 native-Persian speakers for five basic emotions including anger, fear, happiness, sadness and surprise, as well as neutral state. Twelve annotators label the underlying emotional state of utterances and majority voting is used to decide on the final labels. According to the kappa measure, the inter-annotator agreement is 64% which is interpreted as “substantial agreement”. We also present benchmark results based on common classification methods in speech emotion detection task. According to the experiments, support vector machine achieves the best results for both gender-independent (58.2%) and gender-dependent models (female = 59.4%, male = 57.6%). The ShEMO will be available for academic purposes free of charge to provide a baseline for further research on Persian emotional speech.
Similar content being viewed by others
Notes
Upon publishing this paper, we release our database for academic purposes.
The prompt excludes any emotional contents in order not to intervene the expression and perception of emotional states.
Cohen’s kappa ranges generally from 0 to 1, where large numbers indicate higher reliability and values near zero suggest that agreement is attributable to chance alone.
As Landis and Koch (1977) explain, \(0.61< kappa < 0.80\) is interpreted as “substantial agreement” among the judges.
The IPA was devised by the International Phonetic Association as a standardized representation of the sounds of oral language.
It contains 88 different parameters. For further information, please refer to Eyben et al. (2016).
Happiness has the lowest number of utterances after fear. As mentioned before, fear utterances were ignored in the classification experiments.
Actors were asked to read 10 short emotionally neutral sentences.
We trained the models on the audio (not video), speech (not song) files of the database.
References
Alvarado, N. (1997). Arousal and valence in the direct scaling of emotional response to film clips. Motivation and Emotion, 21, 323–348.
Audhkhasi, K., & Narayanan, S. (2010). Data-dependent evaluator modeling and its application to emotional valence classification from speech. In Proceedings of INTERSPEECH (pp. 2366–2369), Makuhari, Japan.
Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587.
Batliner, A., Fischer, K., Huber, R., Spilker, J., & Noth, E. (2000). Desperately seeking emotions or: Actors, wizards, and human beings. In Proceedings of ISCA workshop on speech and emotion (pp. 195–200).
Batliner, A., Fischer, K., Huber, R., Spilker, J., & Noth, E. (2003). How to find trouble in communication. Speech Communication, 40(1–2), 117–143.
Bijankhan, M., Sheikhzadegan, J., Roohani, M., & Samareh, Y. (1994). FARSDAT—The speech database of Farsi spoken language. In Proceedings of Australian conference on speech science and technology (pp. 826–831), Perth, Australia.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss, B. (2005). A database of German emotional speech. In Proceedings of INTERSPEECH (pp. 1517–1520), Lissabon, Portugal. ISCA.
Busso, C., Bulut, M., & Narayanan, S. (2013). Toward effective automatic recognition systems of emotion in speech. In J. Gratch & S. Marsella (Eds.), Social emotions in nature and artifact: Emotions in human and human–computer interaction (pp. 110–127). New York, NY: Oxford University Press.
Cawley, G. C., & Talbot, N. L. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11, 2079–2107.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., et al. (2001). Emotion recognition in human–computer interaction. IEEE Signal Processing Magazine, 18(1), 32–80.
Deng, J., Han, W., & Schuller, B. (2012). Confidence measures for speech emotion recognition: A start. In Proceedings of speech communication (pp. 1–4), Braunschweig, Germany.
Dickerson, R., Gorlin, E., & Stankovic, J. (2011). Empath: A continuous remote emotional health monitoring system for depressive illness. In Proceedings of the 2nd conference on wireless health (pp. 1–10), New York, NY, USA.
Douglas-Cowie, E., Cowie, R., & Schroeder, M. (2000). A new emotion database: Considerations, sources and scope. In Proceedings of ISCA workshop on speech and emotion (pp. 39–44).
Ekman, P. (1982). Emotion in the human face. Cambridge: Cambridge University Press.
Engberg, I., Hansen, A., Andersen, O., & Dalsgaard, P. (1997). Design, recording and verification of a Danish emotional speech database. In Proceedings of EUROSPEECH (Vol. 4, pp. 1695–1698).
Esmaileyan, Z., & Marvi, H. (2013). A database for automatic Persian speech emotion recognition: Collection, processing and evaluation. International Journal of Engineering, 27, 79–90.
Eyben, F., Scherer, K., Schuller, B., Sundberg, J., Andre, E., Busso, C., et al. (2016). The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), 190–202.
Eyben, F., Wollmer, M., & Schuller, B. (2010). openSMILE—The Munich versatile and fast open-source audio feature extractor. In Proceedings of ACM multimedia (pp. 1459–1462), Florance, Italy.
Feraru, S. M., Schuller, D., & Schuller, B. (2015). Cross-language acoustic emotion recognition: An overview and some tendencies. In Proceedings of the 6th international conference on affective computing and intelligent interaction (ACII) (pp. 125–131), Xi’an, China.
Frank, M., & Stennett, J. (2001). The forced-choice paradigm and the perception of facial expressions of emotion. Personality and Social Psychology, 80(1), 75–85.
Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1987). The vocabulary problem in human–system communication. Communications of the ACM, 30(11), 964–971.
Gharavian, D., & Ahadi, S. (2006). Recognition of emotional speech and speech emotion in Farsi. In Proceedings of international symposium on chinese spoken language processing (Vol. 2, pp. 299–308).
Giannakopoulos, T., Pikrakis, A., & Theodoridis, S. (2009). A dimensional approach to emotion recognition of speech from movies. In Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 65–68).
Grimm, M., Kroschel, K., Mower, E., & Narayanan, S. (2007). Primitives-based evaluation and estimation of emotions in speech. Speech Communication, 49(10–11), 787–800.
Hamidi, M., & Mansoorizade, M. (2012). Emotion recognition from Persian speech with neural network. Artificial Intelligence and Applications, 3(5), 107–112.
Heni, N., & Hamam, H. (2016). Design of emotional education system mobile games for autistic children. In Proceedings of the 2nd international conference on advanced technologies for signal and image processing (ATSIP).
Huahu, X., Jue, G., & Jian, Y. (2010). Application of speech emotion recognition in intelligent household robot. In Proceedings of international conference on artificial intelligence and computational intelligence (Vol. 1, pp. 537–541).
James, A. (1994). Is there universal recognition of emotion from facial expression? A review of the cross-cultural studies. Psychological Bulletin, 115(1), 102–141.
Johnstone, T., Van Reekum, C., Hird, K., Kirsner, K., & Scherer, K. (2005). Affective speech elicited with a computer game. Emotion, 5(4), 513–518.
Keshtiari, N., Kuhlmann, M., Eslami, M., & Klann-Delius, G. (2015). Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD). Behavior Research Methods, 47(1), 275–294.
Kort, B., Reilly, R., & Picard, R. (2001). An affective model of interplay between emotions and learning: Reengineering educational pedagogy-building a learning companion. In Proceedings of the IEEE international conference on advanced learning technologies (ICALT) (pp. 43–46), Washington, DC, USA.
Landis, J., & Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
Lee, C., Mower, E., Busso, C., Lee, S., & Narayanan, S. (2011). Emotion recognition using a hierarchical binary decision tree approach. Speech Communication, 53(9–10), 1162–1171.
Lewis, P. A., Critchley, H. D., Rotshtein, P., & Dolan, J. R. (2007). Neural correlates of processing valence and arousal in affective words. Cerebral Cortex, 17(3), 742–748.
Livingstone, S., Peck, K., & Russo, F. (2012). RAVDESS: The Ryerson audio-visual database of emotional speech and song. In Proceedings of the 22nd annual meeting of the Canadian Society for Brain, Behaviour and Cognitive Science (CSBBCS), ON, Canada.
Mansoorizadeh, M. (2009). Human emotion recognition using facial expression and speech features fusion. PhD thesis, Tarbiat Modares University, Tehran, Iran (in Persian).
McKeown, G., Valstar, M., Cowie, R., & Pantic, M. (2010). The semaine corpus of emotionally coloured character interactions. In Proceedings of IEEE international conference on multimedia and expo (ICME’10) (pp. 1079–1084), Singapore, Singapore. IEEE Computer Society. https://doi.org/10.1109/ICME.2010.5583006.
Metze, F., Batliner, A., Eyben, F., Polzehl, T., Schuller, B., & Steidl, S. (2011). Emotion recognition using imperfect speech recognition. In Proceedings of INTERSPEECH (pp. 478–481), Makuhari, Japan.
Moosavian, A., Norasteh, R., & Rahati, S. (2007). Speech emotion recognition using adaptive neuro-fuzzy inference systems. In Proceedings of the 8th conference on intelligent systems (in Persian).
Mower, E., Mataric, M., & Narayanan, S. (2009b). Evaluating evaluators: A case study in understanding the benefits and pitfalls of multi-evaluator modeling. In Proceedings of INTERSPEECH (pp. 1583–1586), Brighton, UK.
Mower, E., Metallinou, A., Lee, C., Kazemzadeh, A., Busso, C., Lee, S., & Narayanan, S. (2009a). Interpreting ambiguous emotional expressions. In Proceedings of the 3rd international conference on affective computing and intelligent interaction and workshops (ACII) (pp. 662–669), Amsterdam, The Netherlands.
Nicolaou, M., Gunes, H., & Pantic, M. (2011). Continuous prediction of spontaneous affect from multiple cues and modalities in valance–arousal space. IEEE Transactions on Affective Computing, 2(2), 92–105. eemcs-eprint-21287.
Russell, J. A. (1980). A circumplex model of affect. Personality and Social Psychology, 39(6), 1161–1178.
Sagha, H., Matejka, P., Gavryukova, M., Povolny, F., Marchi, E., & Schuller, B. (2016). Enhancing multilingual recognition of emotion in speech by language identification. In Proceedings of INTERSPEECH (pp. 2949–2953).
Savargiv, M., & Bastanfard, A. (2015). Persian speech emotion recognition. In Proceedings of the 7th international conference on information and knowledge technology (IKT) (pp. 1–5).
Scherer, K. (1986). Vocal affect expression: A review and a model for future research. Psychol Bull, 99(2), 143–165.
Scherer, K., Banse, R., Wallbott, H., & Goldbeck, T. (1991). Vocal cues in emotion encoding and decoding. Motivation and Emotion, 15(2), 123–148.
Schuller, B., Batliner, A., Steidl, S., Schiel, F., & Krajewski, J. (2011). The INTERSPEECH 2011 speaker state challenge. In Proceedings of INTERSPEECH (pp. 3201–3204), Florence, Italy. ISCA.
Schuller, B., & Munchen, T. U. (2002). Towards intuitive speech interaction by the integration of emotional aspects. In Proceedings of IEEE international conference on systems, man and cybernetics (SMC) (Vol. 1, pp. 6–11).
Schuller, B., Reiter, S., Muller, R., Al-Hames, M., Lang, M., & Rigoll, G. (2005). Speaker independent speech emotion recognition by ensemble classification. In Proceedings of IEEE international conference on multimedia and expo (ICME) (pp. 864–867).
Schuller, B., Rigoll, G., & Lang, M. (2004). Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 1, pp. 577–580).
Schuller, B., Steidl, S., & Batliner, A. (2009). The INTERSPEECH 2009 emotion challenge. In Proceedings of INTERSPEECH (pp. 312–315), Brighton, UK. ISCA.
Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Muller, C., et al. (2010). The INTERSPEECH 2010 paralinguistic challenge. In Proceedings of INTERSPEECH (pp. 2794–2797), Makuhari, Japan. ISCA.
Schuller, B., Steidl, S., Batliner, A., Hirschberg, J., Burgoon, J., Baird, A., et al. (2016). The INTERSPEECH 2016 computational paralinguistics challenge: Deception, sincerity & native language. In Proceedings of INTERSPEECH (pp. 2001–2005), San Francisco, USA. ISCA.
Sedaaghi, M. (2008). Documentation of the Sahand Emotional Speech Database (SES). Technical report, Department of engineering, Sahand University of Technology.
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems (pp. 2951–2959).
Steidl, S. (2009). Automatic classification of emotion related user states in spontaneous children’s speech. Ph.D. thesis, University of Erlangen-Nuremberg Erlangena, Bavaria, Germany.
Wöllmer, M., Kaiser, M., Eyben, F., Schuller, B., & Rigoll, G. (2013). LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 31(2), 153–163.
Yu, F., Chang, E., Xu, Y., & Shum, H. (2001). Emotion detection from speech to enrich multimedia content. In Proceedings of the 2nd IEEE Pacific Rim conference on multimedia: Advances in multimedia information processing (pp. 550–557), London, UK. Springer.
Acknowledgements
We would like to thank the anonymous reviewers for their insightful comments and suggestions. We also gratefully thank Dr. Steve Cassidy for his helpful points.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mohamad Nezami, O., Jamshid Lou, P. & Karami, M. ShEMO: a large-scale validated database for Persian speech emotion detection. Lang Resources & Evaluation 53, 1–16 (2019). https://doi.org/10.1007/s10579-018-9427-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-018-9427-x