Abstract
In this paper, we focus on various compression and classification algorithms for three different paralinguistic signal classification tasks. These tasks are quite difficult for humans because the sound information from such signals is difficult to distinguish. Therefore, when machine learning techniques are applied to analyze paralinguistic signals, several different aspects of speech-related information, such as prosody, energy, and cepstral information, are usually considered for feature extraction. However, when the size of the training corpus is not sufficiently large, it is extremely difficult to directly apply machine learning to classify such signals due to their high feature dimensions; this problem is also known as the curse of dimensionality. This paper proposes to address this limitation by means of feature compression. First, we present experimental results obtained by using various compression algorithms to compress signals to eliminate redundancy of the signal features. We observe that compared with the original features, the compressed signal features still provide a comparable ability to distinguish the signals, especially when using a fully connected neural network classifier. Second, we calculate the output distribution of the F1-score for each emotion in the speech emotion recognition problem and show that the fully connected neural network classifier performs more stably than other classical methods.
Similar content being viewed by others
References
Aldeneh Z, Provost EM (2017) Using regional saliency for speech emotion recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 2741–2745
Amiriparian S, Gerczuk M, Ottl S, Cummins N, Freitag M, Pugachevskiy S, Baird A, Schuller BW (2017) Snore sound classification using image-based deep spectrum features. In: INTERSPEECH, pp 3512–3516
Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al. (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In: International Conference on Machine Learning, pp 173–182
Bandela SR, Kishpre KT (2019) Speech emotion recognition using semi-NMF feature optimization. Turk J Electr Eng Comput Sci 27(5):3741–3757
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. ACM, pp 144–152
Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: Interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335
Byun S, Yoon S, Jung K (2019) Neural networks for compressing and classifying speaker-independent paralinguistic signals. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE, pp 1–4
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. ACM, pp 785–794
Chiou BC, Chen CP (2013) Feature space dimension reduction in speech emotion recognition using support vector machine. In: 2013 Asia–Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, pp 1–6
Cho J, Pappagari R, Kulkarni P, Villalba J, Carmiel Y, Dehak N (2018) Deep neural networks for emotion recognition combining audio and transcripts. Proc Interspeech 2018:247–251
Eyben F, Weninger F, Gross F, Schuller B (2013) Recent developments in opensmile, the Munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia. ACM, pp 835–838
Fewzee P, Karray F (2012) Dimensionality reduction for emotional speech recognition. In: 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Conference on Social Computing. IEEE, pp 532–537
Gamage KW, Sethu V, Ambikairajah E (2017) Salience based lexical features for emotion recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5830–5834
Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth Annual Conference of the International Speech Communication Association
Hantke S, Eyben F, Appel T, Schuller B (2015) ihearu-play: introducing a game for crowdsourced data collection for affective computing. In: 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, pp 891–897
Hantke S, Sagha H, Cummins N, Schuller B (2017) Emotional speech of mentally and physically disabled individuals: introducing the emotass database and first findings. Proc Interspeech 2017:3137–3141
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp 448–456
Klambauer G, Unterthiner T, Mayr A, Hochreiter S (2017) Self-normalizing neural networks. In: Advances in Neural Information Processing Systems, pp 971–980
Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth Annual Conference of the International Speech Communication Association
Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent neural networks with local attention. 2017 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP). IEEE, pp 2227–2231
Neumann M, Vu NT (2017) Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech. Proc Interspeech 2017:1263–1267
Pan Y, Shen P, Shen L (2012) Speech emotion recognition using support vector machine. Int J Smart Home 6(2):101–108
Panwar S, Rad P, Choo KKR, Roopaei M (2019) Are you emotional or depressed? Learning about your emotional state from your music using machine learning. J Supercomput 75(6):2986–3009
Quan C, Wan D, Zhang B, Ren F (2013) Reduce the dimensions of emotional features by principal component analysis for speech emotion recognition. In: Proceedings of the 2013 IEEE/SICE International Symposium on System Integration. IEEE, pp 222–226
Sahu S, Gupta R, Sivaraman G, AbdAlmageed W, Espy-Wilson C (2017) Adversarial auto-encoders for speech based emotion recognition. Proc Interspeech 2017:1243–1247
Sahu S, Gupta R, Espy-Wilson C (2018) On enhancing speech emotion recognition using generative adversarial networks. Proc Interspeech 2018:3693–3697
Schuller B, Rigoll G, Lang M (2003) Hidden Markov model-based speech emotion recognition. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings (ICASSP’03), vol 2. IEEE, pp II-1
Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004. Proceedings (ICASSP’04), vol 1. IEEE, pp I-577
Schuller B, Steidl S, Batliner A, Burkhardt F, Devillers L, MüLler C, Narayanan S (2013) Paralinguistics in speech and language-state-of-the-art and the challenge. Comput Speech Lang 27(1):4–39
Schuller B, Steidl S, Batliner A, Hirschberg J, Burgoon JK, Baird A, Elkins A, Zhang Y, Coutinho E, Evanini K (2016) The interspeech 2016 computational paralinguistics challenge: deception, sincerity & native language. Interspeech 2016:2001–2005
Schuller B, Steidl S, Batliner A, Bergelson E, Krajewski J, Janott C, Amatuni A, Casillas M, Seidl A, Soderstrom M, et al (2017) The interspeech 2017 computational paralinguistics challenge: addressee, cold & snoring. In: Computational Paralinguistics Challenge (ComParE), Interspeech 2017, pp 3442–3446
Schuller B, Steidl S, Batliner A, Marschik PB, Baumeister H, Dong F, Hantke S, Pokorny FB, Rathner EM, Bartl-Pokorny KD et al (2018) The interspeech 2018 computational paralinguistics challenge: atypical & self-assessed affect, crying & heart beats. Proc Interspeech 2018:122–126
Schuller BW, Batliner A, Bergler C, Pokorny FB, Krajewski J, Cychosz M, Schmitt M, et al (2019) The interspeech 2019 computational paralinguistics challenge: styrian dialects, continuous sleepiness, baby sounds & orca activity. In: Proceedings of Interspeech
Senoussaoui M, Cardinal P, Dehak N, Koerich AL (2016) Native language detection using the i-vector framework. Interspeech 2016:2398–2402
Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. In: 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, pp 112–118
Yoon S, Byun S, Dey S, Jung K (2019) Speech emotion recognition using multi-hop attention mechanism. In: ICASSP 2019–2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP). IEEE, pp 2822–2826
Yu D, Deng L (2016) Automatic speech recognition. Springer, Berlin
Acknowledgements
This work was supported by the Ministry of Trade, Industry and Energy (MOTIE, Korea) under the Industrial Technology Innovation Program (No.10073144). K. Jung is with ASRI, Seoul National University, Korea.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Byun, S., Yoon, S. & Jung, K. Comparative studies on machine learning for paralinguistic signal compression and classification. J Supercomput 76, 8357–8371 (2020). https://doi.org/10.1007/s11227-020-03346-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03346-3