Automatic Speech Feature Learning for Continuous Prediction of Customer Satisfaction in Contact Center Phone Calls

  • Carlos Segura
  • Daniel Balcells
  • Martí Umbert
  • Javier Arias
  • Jordi Luque
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10077)


Speech related processing tasks have been commonly tackled using engineered features, also known as hand-crafted descriptors. These features have usually been optimized along years by the research community that constantly seeks for the most meaningful, robust, and compact audio representations for the specific domain or task. In the last years, a great interest has arisen to develop architectures that are able to learn by themselves such features, thus by-passing the required engineering effort. In this work we explore the possibility to use Convolutional Neural Networks (CNN) directly on raw audio signals to automatically learn meaningful features. Additionally, we study how well do the learned features generalize for a different task. First, a CNN-based continuous conflict detector is trained on audios extracted from televised political debates in French. Then, while keeping previous learned features, we adapt the last layers of the network for targeting another concept by using completely unrelated data. Concretely, we predict self-reported customer satisfaction from call center conversations in Spanish. Reported results show that our proposed approach, using raw audio, obtains similar results than those of a CNN using classical Mel-scale filter banks. In addition, the learning transfer from the conflict detection task into satisfaction prediction shows a successful generalization of the learned features by the deep architecture.


Feature learning End-to-end learning Convolutional neural networks Conflict speech retrieval Automatic tagging 



We would like to thank the AVA innovation team members, among them, Roberto González and Nuria Oliver for interesting review. This project has received funding from the EU’s Horizon 2020 research and innovation programme under grant agreement No. 645323. This text reflects only the author’s view and the Commission is not responsible for any use that may be made of the information it contains.


  1. 1.
    Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Penn, G.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280, March 2012Google Scholar
  2. 2.
    Bergstra, J., et al.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy), Austin, TX, vol. 4, p. 3 (2010)Google Scholar
  3. 3.
    Budnik, M., Gutierrez-Gomez, E.L., Safadi, B., Quénot, G.: Learned features versus engineered features for semantic video indexing. In: 2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI), pp. 1–6, June 2015Google Scholar
  4. 4.
    Deng, L., Li, J., et al.: Recent advances in deep learning for speech research at Microsoft. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8604–8608. IEEE (2013)Google Scholar
  5. 5.
    Devillers, L., Vaudable, C., Chastagnol, C.: Real-life emotion-related states detection in call centers: a cross-corpora study. In: Eleventh Annual Conference of the International Speech Communication Association, vol. 10, pp. 2350–2353 (2010)Google Scholar
  6. 6.
    Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968, May 2014Google Scholar
  7. 7.
    Eyben, F., Wollmer, M., Schuller, B.: OpenEAR - introducing the Munich open-source emotion and affect recognition toolkit. In: 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, ACII 2009, pp. 1–6 (2009)Google Scholar
  8. 8.
    Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A.C., Bengio, Y.: Maxout networks. Int. Conf. Mach. Learn. (ICML) 28, 1319–1327 (2013)Google Scholar
  9. 9.
    Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4624–4628. IEEE (2015)Google Scholar
  11. 11.
    Huang, D.Y., Li, H., Dong, M.: Ensemble Nyström method for predicting conflict level from speech. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, pp. 1–5, December 2014Google Scholar
  12. 12.
    Jaitly, N., Hinton, G.: Learning a better representation of speech soundwaves using restricted Boltzmann machines. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5887. IEEE (2011)Google Scholar
  13. 13.
    Kim, S., Filippone, M., Valente, F., Vinciarelli, A.: Predicting the conflict level in television political debates: an approach based on crowdsourcing, nonverbal communication and Gaussian processes. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 793–796. ACM (2012)Google Scholar
  14. 14.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  15. 15.
    Le, Q.V.: Building high-level features using large scale unsupervised learning. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8595–8598, May 2013Google Scholar
  16. 16.
    LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. Handb. Brain Theor. Neural Netw. 3361(10) (1995)Google Scholar
  17. 17.
    Llimona, Q., Luque, J., Anguera, X., Hidalgo, Z., Park, S., Oliver, N.: Effect of gender and call duration on customer satisfaction in call center big data. In: Proceedings of 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015, Dresden, Germany, 6–10 September (2015)Google Scholar
  18. 18.
    Palaz, D., Magimai-Doss, M., Collobert, R.: Convolutional neural networks-based continuous speech recognition using raw speech signal. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4295–4299, April 2015Google Scholar
  19. 19.
    Park, Y., Gates, S.C.: Towards real-time measurement of customer satisfaction using automatically generated call transcripts. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1387–1396. ACM (2009)Google Scholar
  20. 20.
    Räsänen, O., Pohjalainen, J.: Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech. In: INTERSPEECH, pp. 210–214 (2013)Google Scholar
  21. 21.
    Schuller, B., et al.: The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social Signals, Conflict, Emotion, AutismGoogle Scholar
  22. 22.
    Vaudable, C., Devillers, L.: Negative emotions detection as an indicator of dialogs quality in call centers. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5109–5112. IEEE (2012)Google Scholar
  23. 23.
    Vinciarelli, A., Kim, S., Valente, F., Salamin, H.: Collecting data for socially intelligent surveillance and monitoring approaches: the case of conflict in competitive conversations. In: 2012 5th International Symposium on Communications Control and Signal Processing (ISCCSP), pp. 1–4, May 2012Google Scholar
  24. 24.
    Zweig, G., Siohan, O., Saon, G., Ramabhadran, B., Povey, D., Mangu, L., Kingsbury, B.: Automated quality monitoring for call centers using speech and NLP technologies. In: Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Companion Volume: Demonstrations, pp. 292–295. Association for Computational Linguistics (2006)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Carlos Segura
    • 1
  • Daniel Balcells
    • 1
    • 2
  • Martí Umbert
    • 1
    • 3
  • Javier Arias
    • 1
  • Jordi Luque
    • 1
  1. 1.Telefonica Research Edificio Telefonica-Diagonal 00BarcelonaSpain
  2. 2.Department Signal Theory and CommunicationsUniversitat Politècnica de CatalunyaBarcelonaSpain
  3. 3.Music Technology GroupUniversitat Pompeu FabraBarcelonaSpain

Personalised recommendations