Skip to main content

Training universal background models with restricted data for speech emotion recognition

Abstract

Speech emotion recognition (SER) is an important research topic which relies heavily on emotional data. Admitting that SER has seen some recent advancements, Universal Background Model (UBM), a standard reference concept from a neighbouring field which is speaker recognition, has always been the base module for the newly developed methods such as Joint Factor Analysis. Theoretically, UBM is a Gaussian model trained with an extensive and representative set of speech samples recorded from different target classes in order to extract general feature characteristics. Obtaining large amount of emotional data to train UBM is a challenging task, further complicated by the cost of annotations and the ambiguity of resulting labels. In addition, it’s dependent upon the training data. In this paper, we make preliminary exploration on a new approach: Training UBM models, named as restricted UBM, with a small amount of speech, which can be even different from the training data. Experiments show that this approach results in a domain-independent UBM capable of designing an acoustic model transferable to different datasets. Four standard benchmark speech databases from different languages have been used for the experimental evaluation. The results show that our proposed model outperforms the existing state of the art baselines. Moreover, we applied this approach on emotional speaker recognition.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2

Notes

  1. http://emodb.bilderbar.info/docu/.

  2. https://catalog.ldc.upenn.edu/LDC93S1.

  3. http://kahlan.eps.surrey.ac.uk/savee/.

  4. https://www.unige.ch/cisa/gemep.

  5. http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox.

References

  • Alhasan K, Aliyu S, Chen L, Chen F (2019) Ica-based eeg feature analysis and classification of learning styles. 2019 IEEE international conference on dependable. autonomic and secure computing, international conference on pervasive intelligence and computing, international conference on cloud and big data computing, international conference on cyber science and technology congress (DASC/PiCom/CBDCom/CyberSciTech), IEEE, pp 271–276

  • Aljanaki A, Wiering F, Veltkamp RC (2016) Studying emotion induced by music through a crowdsourcing game. Inf Process Manag 52(1):115–128

    Article  Google Scholar 

  • Bänziger T, Mortillaro M, Scherer KR (2012) Introducing the Geneva multimodal expression corpus for experimental research on emotion perception. Emotion 12:1161–1179

    Article  Google Scholar 

  • Banziger T, Patel S, Scherer K (2014) The role of perceived voice and speech characteristics in vocal emotion communication. J Nonverbal Behav 38:31–52

    Article  Google Scholar 

  • Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of German emotional speech. In: Ninth European conference on speech communication and technology

  • Davis MH (2018) Empathy: a social psychological approach. Routledge

    Book  Google Scholar 

  • Dempster A, Laid N, Durbin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39:1–38

    MathSciNet  Google Scholar 

  • Desplanques B, Demuynck K (2018) Cross-lingual speech emotion recognition through factor analysis. In: Interspeech2018, ISCA, pp 3648–3652

  • Dissanayake V, Zhang H, Billinghurst M, Nanayakkara S (2020) Speech emotion recognition—in the wild-using an autoencoder. Proc Interspeech 2020:526–530

    Article  Google Scholar 

  • Ekman P (2016) What scientists who study emotion agree about. Perspect Psychol Sci 11(1):31–34

    Article  Google Scholar 

  • Eyben F, Scherer KR, Schuller BW, Sundberg J, André E, Busso C, Devillers LY, Epps J, Laukka P, Narayanan SS et al (2016) The Geneva minimalistic acoustic parameter set (GEMAPS) for voice research and affective computing. IEEE Trans Affect Comput 7(2):190–202

    Article  Google Scholar 

  • Gangamohan P, Kadiri SR, Yegnanarayana B (2016) Analysis of emotional speech–a review. In: Toward robotic socially believable behaving systems-Volume I, Springer, pp 205–238

  • Garofolo JS (1993) Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium

  • Haq S, Jackson P (2009) Speaker-dependent audio-visual emotion recognition. In: International conference on auditory-visual speech processing, pp 53–58

  • Haque A, Rao KS (2017) Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech. Int J Speech Technol 20(1):15–25

    Article  Google Scholar 

  • Hasan T, Hansen JH (2011) A study on universal background model training in speaker verification. IEEE Trans Audio Speech Lang Process 19(7):1890–1899

    Article  Google Scholar 

  • Hofmann M (2006) Support vector machines-kernels and the kernel trick. Notes 26:3

    Google Scholar 

  • Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425

    Article  Google Scholar 

  • Huang X, Maier A, Hornegger J, Suykens JA (2017) Indefinite kernels in least squares support vector machines and principal component analysis. Appl Comput Harmon Anal 43(1):162–172

    MathSciNet  Article  Google Scholar 

  • Huang Y, Tian K, Wu A, Zhang G (2019) Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. J Ambient Intell Human Comput 10(5):1787–1798

    Article  Google Scholar 

  • Issa D, Demirci MF, Yazici A (2020) Speech emotion recognition with deep convolutional neural networks. Biomed Signal Process Control 59:101894

    Article  Google Scholar 

  • Keltner D, Cordaro DT (2017) Understanding multimodal emotional expressions: recent advances in basic emotion theory. Sci Facial Exp 20:57–76

    Google Scholar 

  • Kerkeni L, Serrestou Y, Mbarki M, Raoof K, Mahjoub MA, Cleder C (2019) Automatic speech emotion recognition using machine learning. Social media and machine learning. IntechOpen

    Google Scholar 

  • Koolagudi SG, Rao KS (2012) Emotion recognition from speech: a review. Int J Speech Technol 15(2):99–117

    Article  Google Scholar 

  • Kragel PA, LaBar KS (2016) Decoding the nature of emotion in the brain. Trends Cogn Sci 20(6):444–455

    Article  Google Scholar 

  • Latif S, Rana R, Younis S, Qadir J, Epps J (2018) Transfer learning for improving speech emotion classification accuracy. In: Interspeech, pp 257–261

  • Lee L, Rose RC (1996) Speaker normalization using efficient frequency warping procedures. In: 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings, IEEE, vol 1, pp 353–356

  • Lin HCK, Hsieh MC, Loh LC, Wang CH (2012) An emotion recognition mechanism based on the combination of mutual information and semantic clues. J Ambient Intell Human Comput 3(1):19–29

    Article  Google Scholar 

  • Lopatovska I, Arapakis I (2011) Theories, methods and current research on emotions in library and information science, information retrieval and human-computer interaction. Inf Process Manag 47(4):575–592

    Article  Google Scholar 

  • Lozano-Monasor E, López MT, Vigo-Bustos F, Fernández-Caballero A (2017) Facial expression recognition in ageing adults: from lab to ambient assisted living. J Ambient Intell Human Comput 8(4):567–578

    Article  Google Scholar 

  • McLaughlin J, Reynolds DA, Gleason T (1999) A study of computation speed-ups of the GMM-UBM speaker recognition system. In: Sixth European conference on speech communication and technology

  • Meyer D, Wien FT (2015) Support vector machines: the interface to libsvm in package e1071. Tech. rep, FH Technikum Wien, Austria

  • Pols LC, et al. (1977) Spectral analysis and identification of dutch vowels in monosyllabic words

  • Rabiner L (1993) Fundamentals of speech recognition. Fundamentals of speech recognition

  • Ralambondrainy H (1995) A conceptual version of the k-means algorithm. Pattern Recogn Lett 16(11):1147–1157

    Article  Google Scholar 

  • Rong J, Li G, Chen YPP (2009) Acoustic feature selection for automatic emotion recognition from speech. Inf Process Manag 45(3):315–328

    Article  Google Scholar 

  • Schmitt M, Janott C, Pandit V, Qian K, Heiser C, Hemmert W, Schuller B (2016) A bag-of-audio-words approach for snore sounds’ excitation localisation. Speech Communication, 12. ITG Symposium, VDE, pp 1–5

  • Schuller B, Vlasenko B, Eyben F, Wollmer M, Stuhlsatz A, Wendemuth A, Rigoll G (2010) Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans Affect Comput 1(2):119–131

    Article  Google Scholar 

  • Schuller B, Steidl S, Batliner A, Vinciarelli A, Scherer K, Ringeval F, Chetouani M, Weninger F, Eyben F, Marchi E et al (2013) The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Interspeech, pp 122–126

  • Schuller B, Steidl S, Batliner A, Nöth E, Vinciarelli A, Burkhardt F, van Son R, Weninger F, Eyben F, Bocklet T, Mohammadi G, Weiss B (2015) A survey on perceived speaker traits: personality, likability, pathology, and the first challenge. Comput Speech Lang 29(1):100–131

    Article  Google Scholar 

  • Schuller BW, Steidl S, Batliner A, Marschik PB, Baumeister H, Dong F, Hantke S, Pokorny FB, Rathner EM, Bartl-Pokorny KD et al (2018) The interspeech 2018 computational paralinguistics challenge: atypical and self-assessed affect, crying and heart beats. In: Interspeech, pp 122–126

  • Schuller B, Weninger F, Zhang Y, Ringeval F, Batliner A, Steidl S, Eyben F, Marchi E, Vinciarelli A, Scherer K et al (2019) Affective and behavioural computing: lessons learnt from the first computational paralinguistics challenge. Comput Speech Lang 53:156–180

    Article  Google Scholar 

  • Sobin C, Alpert M (1999) Emotion in speech: the acoustic attributes of fear, anger, sadness, and joy. J Psycholinguist Res 28(4):347–365

    Article  Google Scholar 

  • Sohn J, Kim NS, Sung W (1999) A statistical model-based voice activity detection. IEEE Signal Process Lett 6(1):1–3

    Article  Google Scholar 

  • Swain M, Routray A, Kabisatpathy P (2018a) Databases, features and classifiers for speech emotion recognition: a review. Int J Speech Technol 21(1):93–120

    Article  Google Scholar 

  • Swain M, Routray A, Kabisatpathy P (2018b) Databases, features and classifiers for speech emotion recognition: a review. Int J Speech Technol 21(1):93–120

    Article  Google Scholar 

  • Tzinis E, Paraskevopoulos G, Baziotis C, Potamianos A (2018) Integrating recurrence dynamics for speech emotion recognition. In: Interspeech, pp 927–931

  • Vafeiadis A, Kalatzis D, Votis K, Giakoumis D, Tzovaras D, Chen L, Hamzaoui R (2017) Acoustic scene classification: from a hybrid classifier to deep learning

  • Vafeiadis A, Votis K, Giakoumis D, Tzovaras D, Chen L, Hamzaoui R (2020) Audio content analysis for unobtrusive event detection in smart homes. Eng Appl Artif Intell 89:103226

    Article  Google Scholar 

  • Vapnik V (1995) The nature of statistical learning theory, vol 2. Spring, New York

    Book  Google Scholar 

  • Verma GK, Tiwary US (2017) Affect representation and recognition in 3d continuous valence-arousal-dominance space. Multimed Tools Appl 76(2):2159–2183

    Article  Google Scholar 

  • You CH, Li H, Lee KA (2015) Relevance factor of maximum a posteriori adaptation for GMM-NAP-SVM in speaker and language recognition. Comput Speech Lang 30(1):116–134

    Article  Google Scholar 

  • Zhang J, Zhou Y, Liu Y (2020) EEG-based emotion recognition using an improved radial basis function neural network. J Ambient Intell Human Comput 20:1–12

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Imen Trabelsi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Trabelsi, I., Perotto, F.S. & Malik, U. Training universal background models with restricted data for speech emotion recognition. J Ambient Intell Human Comput (2021). https://doi.org/10.1007/s12652-021-03200-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12652-021-03200-1

Keywords

  • Speech emotion recognition
  • Restricted universal background models
  • Support vector machines