Abstract
Conflict is a fundamental phenomenon inevitably arising in inter-human communication and only recently has become the subject of study in the emerging field of computational paralinguistics. As speech is a predominant carrier of information about the valence and level of conflict we investigate and demonstrate how deep and hierarchical neural networks, which have become the new mainstream paradigm in automatic speech recognition over the last few years, can be leveraged to automatically classify and predict levels of conflict purely based on audio recordings. For this purpose we adopt a neural network architecture which we previously have applied successfully to another paralinguistics task. On the Conflict Sub-Challenge data set of the Interspeech 2013 Computational Paralinguistics Challenge (ComParE) we obtained the best results reported so far in the literature on both the classification and the regression task. These results demonstrate that deep neural networks are also appropriate for the prediction of conflict levels, both for classification and regression.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin
Boakye K, Vinyals O, Friedland G (2011) Improved overlapped speech handling for speaker diarization. In: Proceedings of interspeech, ISCA, Florence, Aug 2011, pp 941–944
Bousmalis K, Mehu M, Pantic M (2009) Spotting agreement and disagreement: a survey of nonverbal audiovisual cues and tools. In: Proceedings of the 3rd international conference on affective computing and intelligent interaction and workshops, ACII 2009, vol 2. IEEE Computer Society Press, Los Alamitos
Brueckner R, Schuller B (2012) Likability classification - a not so deep neural network approach. In: Proceedings of interspeech, Portland, OR, Sep 2012
Brueckner R, Schuller B (2013) Hierarchical neural networks and enhanced class posteriors for social signal classification. In: Proceedings of ASRU, IEEE, Olomouc, Dec 2013, pp 361–364
Brueckner R, Schuller B (2014) Social signal classification using deep BLSTM recurrent neural networks. In: Proceedings of ICASSP, IEEE, Florence, May 2014
Dahl G, Sainath T, Hinton G (2013) Improving deep neural networks for LVCSR using rectified linear units and dropout. In: Proceedings of ICASSP, IEEE, Vancouver, May 2013, pp 8609–8613
Erhan D, Bengio Y, Courville A, Vincent PAMP, Bengio S (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11:625–660
Eyben F, Wöllmer M, Schuller B (2010) openSMILE – the Munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM multimedia, MM 2010, ACM, Florence, Oct 2010. ACM, New York, pp 1459–1462 (acceptance rate short paper: about 30 %)
Geiger JT, Vipperla R, Bozonnet S, Evans N, Schuller B, Rigoll G (2012) Convolutive non-negative sparse coding and new features for speech overlap handling in speaker diarization. In: Proceedings of interspeech, Portland, OR, Sept 2012
Geiger J, Eyben F, Schuller B, Rigoll G (2013) Detecting overlapping speech with long short-term memory recurrent neural networks. In: Proceedings of interspeech, ISCA, Lyon, Aug 2013, pp 1668–1672
Gers F, Schraudolph N, Schmidhuber J (2002) Learning precise timing with LSTM recurrent networks. J Mach Learn Res 3:115–143
Grèzes F, Richards J, Rosenberg A (2013) Let me finish: automatic conflict detection using speaker overlap. In: Proceedings of interspeech, ISCA, Lyon, Aug 2013, pp 200–204
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. CoRR. abs/1207.0580
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer SC, Kolen JF (eds) A field guide to dynamical recurrent neural networks. IEEE Press, New York
Jaeger H (2001) The “echo state” approach to analysing and training recurrent neural networks. GMD Report 148, GMD - German National Research Institute for Computer Science
Jaeger H, Maass W, Príncipe JC (2007) Special issue on echo state networks and liquid state machines. Neural Netw 20(3):287–289
Judd CM (1978) Cognitive effects of attitude conflict resolution. J Conflict Resolut 22(3):483–498
Kim S, Filippone M, Valente F, Vinciarelli A (2012) Predicting the conflict level in television political debates: an approach based on crowdsourcing, nonverbal communication and Gaussian processes. In: Babaguchi N, Aizawa K, Smith JR, Satoh S, Plagemann T, Hua XS, Yan R (eds) Proceedings of ACM international conference on multimedia, Nara. ACM, New York, pp 793–796
Kim S, Yella SH, Valente F (2012) Automatic detection of conflict escalation in spoken conversations. In: Proceedings of interspeech, ISCA, Portland, OR, Sept 2012
Levine JM, Moreland RL (1998) Small groups. In: Gilbert D, Lindzey G (eds) The handbook of social psychology, vol 2. Oxford University Press, Oxford
Maas A, Hannun A, Ng A (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML workshop on deep learning for audio, speech, and language processing, WDLASL, Atlanta, GA, Jun 2013
Pesarin A, Cristani M, Murino V, Vinciarelli A (2012) Conversation analysis at work: detection of conflict in competitive discussions through automatic turn-organization analysis. Cogn Process 13(2):533–540
Räsänen O, Pohjalainen J (2013) Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech. In: Proceedings of interspeech, Lyon, Aug 2013, pp 210–214
Salakhutdinov R (2009) Learning deep generative models. Ph.D. thesis, University of Toronto
Schmidhuber J (1992) Learning complex extended sequences using the principle of history compression. Neural Comput 4(2):234–242
Schuller B (2012) The computational paralinguistics challenge. IEEE Signal Process Mag 29(4):97–101
Schuller B, Batliner A (2013) Computational paralinguistics: emotion, affect and personality in speech and language processing. Wiley, New York
Schuller B, Batliner A, Steidl S, Seppi D (2011) Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun 53(9/10):1062–1087 [Special Issue on Sensing Emotion and Affect – Facing Realism in Speech Processing]
Schuller B, Steidl S, Batliner A, Nöth E, Vinciarelli A, Burkhardt A, van Son R, Weninger F, Eyben F, Bocklet T, Mohammadi G, Weiss B (2012) The interspeech 2012 speaker trait challenge. In: Proceedings of interspeech, Portland, OR
Schuller B, Steidl S, Batliner A, Vinciarelli A, Scherer K, Ringeval F, Chetouani M Weninger F, Eyben F, Marchi E, Mortillaro M, Salamin H, Polychroniou A, Valente F, Kim S (2013) The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of interspeech, Lyon, Aug 2013
Schuster M, Paliwal K (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681
Stuhlsatz A, Meyer C, Eyben F, Zielke T, Meier G, Schuller B (2011) Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Proceedings of ICASSP, Prague, pp 5688–5691
Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of ICML, New York, NY, 2008, pp 1096–1103
Vinciarelli A, Dielmann A, Favre S, Salamin H (2009) Canal9: a database of political debates for analysis of social interactions. In: Proceedings of the international conference on affective computing and intelligent interaction, Sept 2009, pp 1–4
Vinciarelli A, Pantic M, Bourlard H (2009) Social signal processing: survey of an emerging domain. Image Vis Comput 27(12):1743–1759
Waibel A, Hanazawa T, Hinton G, Shikano K, Lang K (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoust Speech Signal Process 37(3):328–339
Wang N, Melchior J, Wiskott L (2012) An analysis of Gaussian-binary restricted Boltzmann machines for natural images. In: Proceedings of ESANN, Bruges, Apr 2012, pp 287–292
Wrede B, Shriberg E (2003) Spotting “hot spots” in meetings: human judgments and prosodic cues. In: Proceedings of Eurospeech, ISCA, Geneva, Sept 2003, pp 2805–2808
Yamamoto K, Asano F, Yamada T, Kitawaki N (2006) Detection of overlapping speech in meetings using support vector machines and support vector regression. IEICE Trans Fundam Electron Commun Comput Sci 89-A(8):2158–2165
Zeiler M, Ranzato M, Monga R, Mao M, Yang K, Le QV, Nguyen P, Senior A, Vanhoucke V, Dean J, Hinton G (2013) On rectified linear units for speech processing. In: ICASSP, IEEE, Vancouver, May 2013, pp 3517–3521
Zelenák M, Hernando J (2011) The detection of overlapping speech with prosodic features for speaker diarization. In: Proceedings of interspeech, ISCA, Florence, Aug 2011, pp 1041–1044
Acknowledgements
The research presented in this publication was conducted while the first author was employed by Nuance Communications Deutschland GmbH.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Brueckner, R., Schuller, B. (2015). Be at Odds? Deep and Hierarchical Neural Networks for Classification and Regression of Conflict in Speech. In: D'Errico, F., Poggi, I., Vinciarelli, A., Vincze, L. (eds) Conflict and Multimodal Communication. Computational Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-14081-0_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-14081-0_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14080-3
Online ISBN: 978-3-319-14081-0
eBook Packages: Computer ScienceComputer Science (R0)