Speech Analysis in the Big Data Era

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9302)


In spoken language analysis tasks, one is often faced with comparably small available corpora of only one up to a few hours of speech material mostly annotated with a single phenomenon such as a particular speaker state at a time. In stark contrast to this, engines such as for the recognition of speakers’ emotions, sentiment, personality, or pathologies, are often expected to run independent of the speaker, the spoken content, and the acoustic conditions. This lack of large and richly annotated material likely explains to a large degree the headroom left for improvement in accuracy by todays engines. Yet, in the big data era, and with the increasing availability of crowd-sourcing services, and recent advances in weakly supervised learning, new opportunities arise to ease this fact. In this light, this contribution first shows the de-facto standard in terms of data-availability in a broad range of speaker analysis tasks. It then introduces highly efficient ‘cooperative’ learning strategies basing on the combination of active and semi-supervised alongside transfer learning to best exploit available data in combination with data synthesis. Further, approaches to estimate meaningful confidence measures in this domain are suggested, as they form (part of) the basis of the weakly supervised learning algorithms. In addition, first successful approaches towards holistic speech analysis are presented using deep recurrent rich multi-target learning with partially missing label information. Finally, steps towards needed distribution of processing for big data handling are demonstrated.


Speech analysis Paralinguistics Big data Self-learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Schuller, B., Steidl, S., Batliner, A., Hantke, S., Hönig, F., Orozco-Arroyave, J.R., Nöth, E., Zhang, Y., Weninger, F.: The INTERSPEECH 2015 computational paralinguistics challenge: degree of nativeness, Parkinson’s & eating condition. In: Proc. INTERSPEECH, Dresden, Germany, p. 5. ISCA (2015)Google Scholar
  2. 2.
    Devillers, L., Vaudable, C., Chastagnol, C.: Real-life emotion-related states detection in call centers: a cross-corpora study. In: Proc. INTERSPEECH, Makuhari, Japan, pp. 2350–2353. ISCA (2010)Google Scholar
  3. 3.
    Deng, J., Zhang, Z., Eyben, F., Schuller, B.: Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition. IEEE Signal Processing Letters 21(9), 1068–1072 (2014)CrossRefGoogle Scholar
  4. 4.
    Coutinho, E., Deng, J., Schuller, B.: Transfer learning emotion manifestation across music and speech. In: Proc. IJCNN, Beijing, China, pp. 3592–3598. IEEE (2014)Google Scholar
  5. 5.
    Deng, J., Zhang, Z., Schuller, B.: Linked source and target domain subspace feature transfer learning - exemplified by speech emotion recognition. In: Proc. ICPR, Stockholm, Sweden, pp. 761–766. IAPR (2014)Google Scholar
  6. 6.
    Swietojanski, P., Ghoshal, A., Renals, S.: Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR. In: Proc. Spoken Language Technology Workshop (SLT), Miama, FL, pp. 246–251. IEEE (2012)Google Scholar
  7. 7.
    Bondu, A., Lemaire, V., Poulain, B.: Active learning strategies: a case study for detection of emotions in speech. In: Perner, P. (ed.) ICDM 2007. LNCS (LNAI), vol. 4597, pp. 228–241. Springer, Heidelberg (2007) Google Scholar
  8. 8.
    Zhang, Z., Deng, J., Marchi, E., Schuller, B.: Active learning by label uncertainty for acoustic emotion recognition. In: Proc. INTERSPEECH, Lyon, France. ISCA, pp. 2841–2845 (2013)Google Scholar
  9. 9.
    Lotfian, R., Busso, C.: Emotion recognition using synthetic speech as neutral reference. In: Proc. ICASSP, Brisbane, Australia, pp. 4759–4763. IEEE (2015)Google Scholar
  10. 10.
    Han, W., Li, H., Ruan, H., Ma, L., Sun, J., Schuller, B.: Active learning for dimensional speech emotion recognition. In: Proc. INTERSPEECH, Lyon, France, pp. 2856–2859. ISCA (2013)Google Scholar
  11. 11.
    Callison-Burch, C., Dredze, M.: Creating speech and language data with Amazon’s mechanical turk. In: Proc. NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, CA, pp. 1–12. ACL (2010)Google Scholar
  12. 12.
    Zhang, Y., Coutinho, E., Zhang, Z., Quan, C., Schuller, B.: Agreement-based dynamic active learning with least and medium certainty query strategy. In: Proc. Advances in Active Learning: Bridging Theory and Practice Workshop held in conjunction with ICML, Lille, France, p. 5. IMLS (2015)Google Scholar
  13. 13.
    Yamada, M., Sugiyama, M., Matsui, T.: Semi-supervised speaker identification under covariate shift. Signal Processing 90(8), 2353–2361 (2010)CrossRefzbMATHGoogle Scholar
  14. 14.
    Liu, J., Chen, C., Bu, J., You, M., Tao, J.: Speech emotion recognition using an enhanced co-training algorithm. In: Proc. ICME, Beijing, P.R. China, pp. 999–1002. IEEE (2007)Google Scholar
  15. 15.
    Jeon, J.H., Liu, Y.: Semi-supervised learning for automatic prosodic event detection using co-training algorithm. In: Proc. Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 2, pp. 540–548. ACL, Stroudsburg (2009)Google Scholar
  16. 16.
    Zhang, Z., Deng, J., Schuller, B.: Co-training succeeds in computational paralinguistics. In: Proc. ICASSP, Vancouver, Canada, pp. 8505–8509. IEEE (2013)Google Scholar
  17. 17.
    Zhang, Z., Coutinho, E., Deng, J., Schuller, B.: Cooperative Learning and its Application to Emotion Recognition from Speech. IEEE/ACM Transactions on Audio, Speech and Language Processing 23(1), 115–126 (2015)Google Scholar
  18. 18.
    Jiang, H.: Confidence measures for speech recognition: A survey. Speech communication 45(4), 455–470 (2005)CrossRefGoogle Scholar
  19. 19.
    Deng, J., Han, W., Schuller, B.: Confidence measures for speech emotion recognition: a start. In: Fingscheidt, T., Kellermann, W. (eds.) Proc. Speech Communication. 10. ITG Symposium, Braunschweig, Germany, pp. 1–4. IEEE (2012)Google Scholar
  20. 20.
    Deng, J., Schuller, B.: Confidence measures in speech emotion recognition based on semi-supervised learning. In: Proc. INTERSPEECH, Portland, OR. ISCA (2012)Google Scholar
  21. 21.
    Eyben, F., Wöllmer, M., Schuller, B.: A Multi-Task Approach to Continuous Five-Dimensional Affect Sensing in Natural Speech. ACM Transactions on Interactive Intelligent Systems 2(1), 29 (2012)CrossRefGoogle Scholar
  22. 22.
    Schuller, B., Zhang, Y., Eyben, F., Weninger, F.: Intelligent systems’ Holistic evolving analysis of real-life universal speaker characteristics. In: Proc. 5th Int. Workshop on Emotion Social Signals, Sentiment & Linked Open Data (ES\(^3\)LOD 2014), satellite of LREC, Reykjavik, Iceland, pp. 14–20. ELRA (2014)Google Scholar
  23. 23.
    Madden, S.: From databases to big data. IEEE Internet Computing 3, 4–6 (2012)CrossRefGoogle Scholar
  24. 24.
    Chen, M., Mao, S., Liu, Y.: Big data: A survey. Mobile Networks and Applications 19(2), 171–209 (2014)CrossRefGoogle Scholar
  25. 25.
    Zhang, Z., Coutinho, E., Deng, J., Schuller, B.: Distributing Recognition in Computational Paralinguistics. IEEE Transactions on Affective Computing 5(4), 406–417 (2014)CrossRefGoogle Scholar
  26. 26.
    Wöllmer, M., Eyben, F., Reiter, S., Schuller, B., Cox, C., Douglas-Cowie, E., Cowie, R.: Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies. In: Proc. INTERSPEECH, Brisbane, Australia, pp. 597–600. ISCA (2008)Google Scholar
  27. 27.
    Zhang, Y., Glass, J.R.: Towards multi-speaker unsupervised speech pattern discovery. In: Proc. ICASSP, Dallas, TX, pp. 4366–4369. IEEE (2010)Google Scholar
  28. 28.
    Richards, N.M., King, J.H.: Big data ethics. Wake Forest L. Rev. 49, 393 (2014)Google Scholar
  29. 29.
    Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Transactions on Knowledge and Data Engineering 26(1), 97–107 (2014)CrossRefGoogle Scholar
  30. 30.
    Harwath, D.F., Hazen, T.J., Glass, J.R.: Zero resource spoken audio corpus analysis. In: Proc. ICASSP, Vancouver, BC, pp. 8555–8559. IEEE (2013)Google Scholar
  31. 31.
    Jansen, A., Dupoux, E., Goldwater, S., Johnson, M., Khudanpur, S., Church, K., Feldman, N., Hermansky, H., Metze, F., Rose, R., et al.: A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition, pp. 8111–8115 (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.University of Passau, Chair of Complex and Intelligent SystemsPassauGermany
  2. 2.Deparment of ComputingImperial College LondonLondonUK
  3. 3.audEERING UGGilchingGermany
  4. 4.Joanneum ResearchGrazAustria
  5. 5.CISAUniversity of GenevaGenevaSwitzerland
  6. 6.Harbin Institute of TechnologyHarbinPeople’s Republic of China

Personalised recommendations