Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

Shilandari, Arash; Marvi, Hossein; Khosravi, Hossein; Wang, Wenwu

doi:10.1007/s11760-022-02156-9

Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

Original Paper
Published: 09 February 2022

Volume 16, pages 1955–1962, (2022)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Arash Shilandari ORCID: orcid.org/0000-0003-1329-5330¹,
Hossein Marvi¹,
Hossein Khosravi¹ &
…
Wenwu Wang²

675 Accesses
13 Citations
1 Altmetric
Explore all metrics

Abstract

One of the obstacles in developing speech emotion recognition (SER) systems is the data scarcity problem, i.e., the lack of labeled data for training these systems. Data augmentation is an effective method for increasing the amount of training data. In this paper, we propose a cycle-generative adversarial network (cycle-GAN) for data augmentation in the SER systems. For each of the five emotions considered, an adversarial network is designed to generate data that have a similar distribution to the main data in that class but have a different distribution to those of other classes. These networks are trained in an adversarial way to produce feature vectors similar to those in the training set, which are then added to the original training sets. Instead of using the common cross-entropy loss to train cycle-GANs, we use the Wasserstein divergence to mitigate the gradient vanishing problem and to generate high-quality samples. The proposed network has been applied to SER using the EMO-DB dataset. The quality of the generated data is evaluated using two classifiers based on support vector machine and deep neural network. The results showed that the recognition accuracy in unweighted average recall was about 83.33%, which is better than the baseline methods compared.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network

Article 16 December 2023

Emotion Classification with Data Augmentation Using Generative Adversarial Networks

Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition

Article 13 June 2022

References

El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit. 44(3), 572–587 (2011)
Article Google Scholar
Wang, J., Perez, L.: The effectiveness of data augmentation in image classification using deep learning. In: Computer Vision and Pattern Recognition (2017)
Zhang, X., LeCun, Y.: Text understanding from scratch (2015). arXiv preprint arXiv:1502.01710
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: The Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany (2015)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (2014)
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., Zafeiriou, S.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5200–5204 (2016)
Ma, X., Wu, Z., Jia, J., Xu, M., Meng, H., Cai, L.: Emotion recognition from variable-length speech segments using deep learning on spectrograms. In: Proceedings of Interspeech, pp. 3683–3687 (2018)
Li, P., Song, Y., McLoughlin, I., Guo, W., Dai, L.: An attention pooling-based representation learning method for speech emotion recognition. In: Proceedings of Interspeech, pp. 3087–3091 (2018)
Deng, J., Zhang, Z., Marchi, E., Schuller, B.: Sparse autoencoder based feature transfer learning for speech emotion recognition. In: Humaine Association Conference on Affective Computing and Intelligent, pp. 511–516 (2013)
Sahu, S., Gupta, R., Espy-Wilson, C.: On enhancing speech emotion recognition using generative adversarial networks (2018). arXiv:1806.06626
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)
Article Google Scholar
Wang, M., Deng, W.: Deep visual domain adaptation: a survey. Neurocomputing 312, 135–153 (2018)
Article Google Scholar
Antoniou, A., Storkey, A., Edwards, H.: Data augmentation generative adversarial networks (2017). arXiv:1711.04340
Zhang, Z., Han, J., Qian, K., Jannett, C., Guo, Y., Schuller, B.: Snore- GANs: improving automatic snore sound classification with synthesized data. IEEE J. Biomed. Health Inf. 24(1), 300–310 (2020)
Article Google Scholar
Park, S., Chan, W., Zhang, Y., Chiu, C., Zoph, B., Cubuk, D., Le, Q.V.: SpecAugment: a simple data augmentation method for automatic speech recognition. In: Proceedings of Interspeech, pp. 2613–2617 (2019)
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: Proceedings of International Conference on Machine Learning, pp. 214–223 (2017)
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A. C.: Improved training of Wasserstein GANs. In: Proceedings of Advanced Neural Information Processing Systems, pp. 5767–5777 (2017)
Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 4058–4065 (2018)
Tiwari, U., Soni, M., Panda, A., Chakraborty, R., Kumar Kopparapu, S.: Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2020)
DeVries, T., Taylor, G.W.: Dataset augmentation in feature space (2017). arXiv:1702.05538
Hu, H., Tan, T., Qian, Y.: Generative adversarial network-based data augmentation for noise-robust speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5044–5048 (2018)
Sahu, S., Gupta, R., Sivaraman, G., Abdalmageed, W., Espy-Wilson, C.: Adversarial auto-encoders for speech-based emotion recognition. In: Proceedings of Interspeech, pp. 1243–1247 (2017)
Hajarolasvadi, N., Bashirov, E., Demirel, H.: Video-based person-dependent and person-independent facial emotion recognition. Signal Image Video Process. 15(5), 1049–1056 (2021)
Article Google Scholar
Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adversarial autoencoders. In: 4th International Conference on Learning Representations (ICLR), Puerto Rico (2016)
El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion reconition: features, classification schemes, and data-bases. Pattern Recognit. 44(3), 572–587 (2011)
Article Google Scholar
Bao, F., Neumann, M., Vu, N.T.: CycleGAN-based emotion style transfer as data augmentation for speech emotion recognition. In: Interspeech (2019)
Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: ICML (2017)
Wu, J., Huang, Z., Thoma, J., Acharya, D., Van Gool, L.: Wasserstein divergence for GANs. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 653–668 (2018)
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. In: Proceedings of 9th European Conference on Speech Communication and Technology, pp. 1–4 (2005)
Kossaifi, J., Walecki, R., Panagakis, Y., Shen, J., Schmitt, M., Ringeval, F., Han, J., Pandit, V., Toisoul, A., Schuller, B., Star, K., Hajiyev, E., Pantic, M.: SEWA DB: a rich database for audio-visual emotion and sentiment research in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 43(3), 1022–1040 (2021)
Article Google Scholar
Stappen, L., Baird, A., Schumann, L., Schuller, B.: The multimodal sentiment analysis in car reviews (MuSe-CaR) dataset: collection, insights and improvements. In: IEEE Transactions on Affective Computing (EARLY ACCESS) (2021)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Kingma, D.P., Adam, J.B.: A method for stochastic optimization. In: Proceedings of 3rd International Conference on Learning Representations (ICLR), pp. 1–15 (2015)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: 13th Conference on Neural Information Processing Systems, Barcelona, Spain (2016)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Chen, M., He, X., Yang, J.: 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)
Article Google Scholar
Luengo, I., Navas, E., Hernaez, I.: Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans. Multimed. 12(6), 490–501 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Electrical Engineering, Shahrood University of Technology, Shahrood, Iran
Arash Shilandari, Hossein Marvi & Hossein Khosravi
Department of Electrical and Electronic Engineering, University of Surrey, Guildford, UK
Wenwu Wang

Authors

Arash Shilandari
View author publications
You can also search for this author in PubMed Google Scholar
Hossein Marvi
View author publications
You can also search for this author in PubMed Google Scholar
Hossein Khosravi
View author publications
You can also search for this author in PubMed Google Scholar
Wenwu Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arash Shilandari.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shilandari, A., Marvi, H., Khosravi, H. et al. Speech emotion recognition using data augmentation method by cycle-generative adversarial networks. SIViP 16, 1955–1962 (2022). https://doi.org/10.1007/s11760-022-02156-9

Download citation

Received: 04 August 2021
Revised: 11 January 2022
Accepted: 17 January 2022
Published: 09 February 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s11760-022-02156-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

Abstract

Access this article

Similar content being viewed by others

Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network

Emotion Classification with Data Augmentation Using Generative Adversarial Networks

Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

Abstract

Access this article

Similar content being viewed by others

Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network

Emotion Classification with Data Augmentation Using Generative Adversarial Networks

Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation