Emotion Recognition in Sound
In this paper we consider the automatic emotions recognition problem, especially the case of digital audio signal processing. We consider and verify an straight forward approach in which the classification of a sound fragment is reduced to the problem of image recognition. The waveform and spectrogram are used as a visual representation of the image. The computational experiment was done based on Radvess open dataset including 8 different emotions: “neutral”, “calm”, “happy,” “sad,” “angry,” “scared”, “disgust”, “surprised”. Our best accuracy result 71% was produced by combination “melspectrogram + convolution neural network VGG-16”.
KeywordsDeep learning Classification Convolutional neural networks Audio recognition Emotion recognition Speech recognition
The article was prepared within the framework of the Academic Fund Program at the National Research University Higher School of Economics (HSE) in 2017 (Grant №17-05-0007) and by the Russian Academic Excellence Project “5-100”.
- 1.Krofto, E.: Kak YAndeks raspoznaet muzyku s mikrofona. In: Yet another Conference 2013, Moscow (2013). (in Russian)Google Scholar
- 2.Wang, A.: An industrial strength audio search algorithm. In: ISMIR, vol. 2003, pp. 7–13 (2003)Google Scholar
- 4.Choi, K., Fazekas, G., Sandler, M.: Automatic tagging using deep convolutional neural networks. arXiv preprint arXiv:1606.00298 (2016)
- 9.Livingstone, S.R., Peck, K., Russo, F.A.: Ravdess: the Ryerson audio-visual database of emotional speech and song. In: Annual Meeting of the Canadian Society for Brain, Behaviour and Cognitive Science (2012)Google Scholar
- 10.Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
- 11.Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Narayanan, S.: Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the 6th International Conference on Multimodal Interfaces, pp. 205–211. ACM (2004)Google Scholar
- 13.Tsai, T.J., Morgan, N.: Longer features: they do a speech detector good. In: INTERSPEECH, pp. 1356–1359 (2012)Google Scholar
- 14.Eyben, F., Böck, S., Schuller, B.W., Graves, A.: Universal onset detection with bidirectional long short-term memory neural networks. In: ISMIR, pp. 589–594 (2010)Google Scholar
- 15.Ramachandran, A., Vasudevan, S., Naganathan, V.: Deep learning for music era classification. http://varshaan.github.io/Media/ml_report.pdf. Accessed 23 June 2017
- 16.Ishaq, M.: Voice activity detection and garbage modelling for a mobile automatic speech recognition application. https://aaltodoc.aalto.fi/handle/123456789/24702. Accessed 23 June 2017