Big Data, Deep Learning – At the Edge of X-Ray Speaker Analysis

Schuller, Björn W.

doi:10.1007/978-3-319-66429-3_2

Björn W. Schuller^16,17,18

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

International Conference on Speech and Computer

2298 Accesses
1 Citations

Abstract

With two years, one has roughly heard a thousand hours of speech – with ten years, around ten thousand. Similarly, an automatic speech recogniser’s data hunger these days is often fed in these dimensions. In stark contrast, however, only few databases to train a speaker analysis system contain more than ten hours of speech. Yet, these systems are ideally expected to recognise the states and traits of speakers independent of the person, spoken content, language, cultural background, and acoustic disturbances at human parity or even super-human levels. While this is not reached at the time for many tasks such as speaker emotion recognition, deep learning – often described to lead to ‘dramatic improvements’ – in combination with sufficient learning data satisfying the ‘deep data cravings’ holds the promise to get us there. Luckily, every second, more than five hours of video are uploaded to the web and several hundreds of hours of audio and video communication in most languages of the world take place. If only a fraction of these data would be shared and labelled reliably, ‘x-ray’-alike automatic speaker analysis could be around the corner for next gen human-computer interaction, mobile health applications, and many further benefits to society. In this light, first, a solution towards utmost efficient exploitation of the ‘big’ (unlabelled) data available is presented. Small-world modelling in combination with unsupervised learning help to rapidly identify potential target data of interest. Then, gamified dynamic cooperative crowdsourcing turn its labelling into an entertaining experience, while reducing the amount of required labels to a minimum by learning alongside the target task also the labellers’ behaviour and reliability. Further, increasingly autonomous deep holistic end-to-end learning solutions are presented for the task at hand. Benchmarks are given from the nine research challenges co-organised by the author over the years at the annual Interspeech conference since 2009. The concluding discussion will contain some crystal ball gazing alongside practical hints not missing out on ethical aspects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.youtube.com/yt/press/de/statistics.html – accessed 1 June 2017.
2.
See http://compare.openaudio.eu/ for details on these events.
3.
http://audeering.com/technology/opensmile/.
4.
http://www.cs.waikato.ac.nz/ml/weka/.
5.
http://github.com/openXBOW/openXBOW/.
6.
http://www.tensorflow.org/.

References

Adda, G., Besacier, L., Couillault, A., Fort, K., Mariani, J., De Mazancourt, H.: “Where the data are coming from?" ethics, crowdsourcing and traceability for big data in human language technology. In: Proceedings Crowdsourcing and Human Computation Multidisciplinary Workshop, Paris, France (2014)
Google Scholar
Amiriparian, S., Gerczuk, M., Ottl, S., Cummins, N., Freitag, M., Pugachevskiy, S., Schuller, B.: Snore sound classification using image-based deep spectrum features. In: Proceedings INTERSPEECH, 5 p. ISCA, Stockholm (2017)
Google Scholar
Arsikere, H., Lulich, S.M., Alwan, A.: Estimating speaker height and subglottal resonances using mfccs and gmms. IEEE Signal Process. Lett. 21(2), 159–162 (2014)
Article Google Scholar
Chang, J., Scherer, S.: Learning representations of emotional speech with deep convolutional generative adversarial networks. arXiv preprint (2017). arXiv:1705.02394
Chen, N., Qian, Y., Yu, K.: Multi-task learning for text-dependent speaker verification. In: Proceedings INTERSPEECH, 5 p. ISCA, Dresden, Germany (2015)
Google Scholar
Chen, X.W., Lin, X.: Big data deep learning: challenges and perspectives. IEEE Access 2, 514–525 (2014)
Article Google Scholar
Covington, P., Adams, J., Sargin, E.: Deep neural networks for youtube recommendations. In: Proceedings 10th ACM Conference on Recommender Systems (RecSys), pp. 191–198. ACM, Boston (2016)
Google Scholar
Davis, K.: Ethics of Big Data: Balancing risk and innovation. O’Reilly Media Inc., Newton (2012)
Google Scholar
Deng, J., Schuller, B.: Confidence measures in speech emotion recognition based on semi-supervised learning. In: Proceedings of INTERSPEECH, 5 p. ISCA, Portland (2012)
Google Scholar
Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., et al.: Recent advances in deep learning for speech research at microsoft. In: Proceedings ICASSP, pp. 8604–8608. IEEE, Vancouver (2013)
Google Scholar
Deng, X.N., Joshi, K.: Is crowdsourcing a source of worker empowerment or exploitation? understanding crowd workers perceptions of crowdsourcing career (2013)
Google Scholar
Eyben, F., Wöllmer, M., Schuller, B.: A Multi-task approach to continuous five-dimensional affect sensing in natural speech. ACM Trans. Interact. Intell. Syst. Spec. Issue Affect. Interact. Nat. Environ. 2(1), 6 (2012)
Google Scholar
Freitag, M., Amiriparian, S., Cummins, N., Gerczuk, M., Schuller, B.: An ‘end-to-evolution’ hybrid approach for snore sound classification. In: Proceedings INTERSPEECH, 5 p. ISCA, Stockholm (2017)
Google Scholar
Goldberg, A.B., Zhu, X.: Seeing stars when there aren’t many stars: graph-based semi-supervised learning for sentiment categorization. In: Proceedings 1st Workshop on Graph Based Methods for Natural Language Processing, pp. 45–52. ACL, Stroudsburg (2006)
Google Scholar
Guggilla, C.: Discrimination between similar languages, varieties and dialects using cnn-and lstm-based deep neural networks. VarDial 3, 185 (2016)
Google Scholar
Hantke, S., Eyben, F., Appel, T., Schuller, B.: ihearu-play: Introducing a game for crowdsourced data collection for affective computing. In: Proceedings 6th biannual Conference on Affective Computing and Intelligent Interaction (ACII), pp. 891–897. aaac/IEEE, Xi’An (2015)
Google Scholar
Hantke, S., Zhang, Z., Schuller, B.: Towards intelligent crowdsourcing for audio data annotation: integrating active learning in the real world. In: Proceedings INTERSPEECH, 5 p. ISCA, Stockholm, Sweden (2017)
Google Scholar
Harris, C.G., Srinivasan, P.: Crowdsourcing and ethics. In: Altshuler, Y., Elovici, Y., Cremers, A.B., Aharony, N., Pentland, A. (eds.) Security and Privacy in Social Networks, pp. 67–83. Springer, Heidelberg (2013)
Chapter Google Scholar
Kranjec, J., Beguš, S., Geršak, G., Drnovšek, J.: Non-contact heart rate and heart rate variability measurements: a review. Biomed. Signal Process. Control 13, 102–112 (2014)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)
Google Scholar
Künzel, H.J.: How well does average fundamental frequency correlate with speaker height and weight? Phonetica 46(1–3), 117–125 (1989)
Article Google Scholar
Liu, P., Qiu, X., Huang, X.: Adversarial multi-task learning for text classification. arXiv preprint (2017). arXiv:1704.05742
Lu, J., Behbood, V., Hao, P., Zuo, H., Xue, S., Zhang, G.: Transfer learning using computational intelligence: a survey. Knowl. Based Syst. 80, 14–23 (2015)
Article Google Scholar
Lyakso, E., Frolova, O., Dmitrieva, E., Grigorev, A., Kaya, H., Salah, A.A., Karpov, A.: EmoChildRu: emotional child russian speech corpus. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS, vol. 9319, pp. 144–152. Springer, Cham (2015). doi:10.1007/978-3-319-23132-7_18
Chapter Google Scholar
Majumder, N., Poria, S., Gelbukh, A., Cambria, E.: Deep learning-based document modeling for personality detection from text. IEEE Intell. Syst. 32(2), 74–79 (2017)
Article Google Scholar
Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16(8), 2203–2213 (2014)
Article Google Scholar
Mitchell, T.M., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A., Mishra, B.D., Gardner, M., Kisiel, B., Krishnamurthy, J., et al.: Never-ending learning. In: Proceedings 29th AAAI Conference on Artificial Intelligence. AAAI, Austin (2015)
Google Scholar
Miyato, T., Dai, A.M., Goodfellow, I.: Virtual adversarial training for semi-supervised text classification. Stat 1050, 25 (2016)
Google Scholar
Moore, R.K.: A comparison of the data requirements of automatic speech recognition systems and human listeners. In: Proceedings INTERSPEECH, pp. 2582–2584, Geneva, Switzerland (2003)
Google Scholar
Morschheuser, B., Hamari, J., Koivisto, J.: Gamification in crowdsourcing: A review. In: Proceedings 49th Hawaii International Conference on System Sciences (HICSS). pp. 4375–4384. IEEE (2016)
Google Scholar
Nakov, P., Ritter, A., Rosenthal, S., Sebastiani, F., Stoyanov, V.: Semeval-2016 task 4: sentiment analysis in twitter. In: Proceedings International Workshop on Semantic Evaluations (SemEval), pp. 1–18 (2016)
Google Scholar
Pokorny, F., Schuller, B., Marschik, P., Brückner, R., Nyström, P., Cummins, N., Bölte, S., Einspieler, C., Falck-Ytter, T.: Earlier identification of children with autism spectrum disorder: an automatic vocalisation-based approach. In: Proceedings INTERSPEECH, 5 p. ISCA, Stockholm (2017)
Google Scholar
Poorjam, A.H., Bahari, M.H., Vasilakakis, V., et al.: Height estimation from speech signals using i-vectors and least-squares support vector regression. In: Proceedings 38th International Conference on Telecommunications and Signal Processing (TSP), pp. 1–5. IEEE, Prague (2015)
Google Scholar
Poorjam, A.H., Bahari, M.H., et al.: Multitask speaker profiling for estimating age, height, weight and smoking habits from spontaneous telephone speech signals. In: Proceedings 4th International eConference on Computer and Knowledge Engineering (ICCKE). pp. 7–12. IEEE, Mashhad (2014)
Google Scholar
Poria, S., Cambria, E., Hazarika, D., Vij, P.: A deeper look into sarcastic tweets using deep convolutional neural networks. arXiv preprint (2016). arXiv:1610.08815
Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.Y.: Self-taught learning: transfer learning from unlabeled data. In: Proceedings 24th International Conference on Machine learning. pp. 759–766. ACM, Corvallis, OR (2007)
Google Scholar
Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at pan 2016: cross-genre evaluations. Working Notes Papers of the CLEF (2016)
Google Scholar
Schuller, B., Mousa, A.E.D., Vryniotis, V.: Sentiment analysis and opinion mining: on optimal parameters and performances. Wiley Interdisc. Rev. Data Min. Knowl. Disc. 5(5), 255–263 (2015)
Article Google Scholar
Schuller, B., Steidl, S., Batliner, A., Bergelson, E., Krajewski, J., Janott, C., Amatuni, A., Casillas, M., Seidl, A., Soderstrom, M., Warlaumont, A., Hidalgo, G., Schnieder, S., Heiser, C., Hohenhorst, W., Herzog, M., Schmitt, M., Qian, K., Zhang, Y., Trigeorgis, G., Tzirakis, P., Zafeiriou, S.: The INTERSPEECH 2017 computational paralinguistics challenge: addressee, Cold and Snoring.. In: Proceedings INTERSPEECH, 5 p. ISCA, Stockholm (2017)
Google Scholar
Schuller, B., Vlasenko, B., Eyben, F., Wollmer, M., Stuhlsatz, A., Wendemuth, A., Rigoll, G.: Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans. Affect. Comput. 1(2), 119–131 (2010)
Article Google Scholar
Schuller, B., Wöllmer, M., Eyben, F., Rigoll, G., Arsić, D.: Semantic speech tagging: towards combined analysis of speaker traits. In: Proceedings AES 42nd International Conference, pp. 89–97. AES, Ilmenau (2011)
Google Scholar
Silver, D.L., Yang, Q., Li, L.: Lifelong machine learning systems: Beyond learning algorithms. In: Proceedings AAAI spring symposium series. AAAI, Palo Alto (2013)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint (2014). arXiv:1409.1556
Strapparava, C., Mihalcea, R.: Semeval-2007 task 14: Affective text. In: Proceedings 4th International Workshop on Semantic Evaluations (SemEval), pp. 70–74. ACL, Swarthmore (2007)
Google Scholar
Stuhlsatz, A., Meyer, C., Eyben, F., Zielke, T., Meier, G., Schuller, B.: Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Proceedings ICASSP, pp. 5688–5691. IEEE, Prague (2011)
Google Scholar
Sun, X., Gao, F., Li, C., Ren, F.: Chinese microblog sentiment classification based on convolution neural network with content extension method. In: Proceedings 6th Biannual Conference on Affective Computing and Intelligent Interaction (ACII), pp. 408–414. aaac/IEEE, Xi’an (2015)
Google Scholar
Tang, D., Qin, B., Liu, T.: Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1422–1432. ACL, Lisbon, Portugal (2015)
Google Scholar
Tarasov, A., Delany, S.J., Mac Namee, B.: Dynamic estimation of worker reliability in crowdsourcing for regression tasks: making it work. Expert Syst. Appl. 41(14), 6190–6210 (2014)
Article Google Scholar
Taylor, M.E., Stone, P.: Transfer learning for reinforcement learning domains: a survey. J. Mach. Learn. Res. 10, 1633–1685 (2009)
MathSciNet MATH Google Scholar
Trigeorgis, G., Ringeval, F., Brückner, R., Marchi, E., Nicolaou, M., Schuller, B., Zafeiriou, S.: Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: Proceedings ICASSP, pp. 5200–5204. IEEE, Shanghai (2016)
Google Scholar
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation (workshop extended abstract) (2017)
Google Scholar
Van Dommelen, W.A., Moxness, B.H.: Acoustic parameters in speaker height and weight identification: sex-specific behaviour. Lang. Speech 38(3), 267–287 (1995)
Article Google Scholar
Walker, S., Pedersen, M., Orife, I., Flaks, J.: Semi-supervised model training for unbounded conversational speech recognition. arXiv preprint (2017). arXiv:1705.09724
Wöllmer, M., Eyben, F., Reiter, S., Schuller, B., Cox, C., Douglas-Cowie, E., Cowie, R.: Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies. In: Proceedings INTERSPEECH, pp. 597–600. ISCA, Brisbane (2008)
Google Scholar
Xia, R., Liu, Y.: Leveraging valence and activation information via multi-task learning for categorical emotion recognition. In: Proceedings ICASSP, pp. 5301–5305. IEEE, Brisbane (2015)
Google Scholar
Zhang, B., Provost, E.M., Essi, G.: Cross-corpus acoustic emotion recognition from singing and speaking: a multi-task learning approach. In: Proceedings ICASSP, pp. 5805–5809. IEEE, Shanghai (2016)
Google Scholar
Zhang, B., Provost, E.M., Essl, G.: Cross-corpus acoustic emotion recognition with multi-task learning: seeking common ground while preserving differences. IEEE Trans. Affect. Comput. (2017)
Google Scholar
Zhang, Y., Coutinho, E., Zhang, Z., Adam, M., Schuller, B.: On rater reliability and agreement based dynamic active learning. In: Proceedings 6th Biannual Conference on Affective Computing and Intelligent Interaction (ACII), pp. 70–76. aaac/IEEE, Xi’an (2015)
Google Scholar
Zhang, Y., Liu, Y., Weninger, F., Schuller, B.: Multi-task deep neural network with shared hidden layers: breaking down the wall between emotion representations. In: Proceedings ICASSP, pp. 4990–4994. IEEE, New Orleans (2017)
Google Scholar
Zhang, Y., Weninger, F., Ren, Z., Schuller, B.: Sincerity and deception in speech: two sides of the same coin? a transfer- and multi-task learning perspective. In: Proceedings INTERSPEECH, pp. 2041–2045. ISCA, San Francisco (2016)
Google Scholar
Zhang, Y., Weninger, F., Schuller, B.: Cross-domain classification of drowsiness in speech: the case of alcohol intoxication and sleep deprivation. In: Proceedings INTERSPEECH, 5 p. ISCA, Stockholm (2017)
Google Scholar
Zhang, Y., Zhou, Y., Shen, J., Schuller, B.: Semi-autonomous data enrichment based on cross-task labelling of missing targets for holistic speech analysis. In: Proceedings ICASSP, pp. 6090–6094. IEEE, Shanghai (2016)
Google Scholar
Zhang, Z., Coutinho, E., Deng, J., Schuller, B.: Cooperative learning and its application to emotion recognition from speech. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 115–126 (2015)
Google Scholar
Zhang, Z., Weninger, F., Wöllmer, M., Schuller, B.: Unsupervised learning in cross-corpus acoustic emotion recognition. In: Proceedings ASRU, pp. 523–528. IEEE, Big Island (2011)
Google Scholar
Zhou, C., Sun, C., Liu, Z., Lau, F.: A c-lstm neural network for text classification. arXiv preprint (2015). arXiv:1511.08630
Zhu, X., Lafferty, J., Ghahramani, Z.: Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings ICML 2003 Workshop on the Continuum From Labeled to Unlabeled Data in Machine Learning and Data Mining, vol. 3, Washington, DC (2003)
Google Scholar

Download references

Acknowledgment

The author acknowledges funding from the European Research Council within the European Union’s 7th Framework Programme under grant agreement no. 338164 (Starting Grant Intelligent systems’ Holistic Evolving Analysis of Real-life Universal speaker characteristics (iHEARu)) and the European Union’s Horizon 2020 Framework Programme under grant agreement no. 645378 (Research Innovation Action Artificial Retrieval of Information Assistants - Virtual Agents with Linguistic Understanding, Social skills, and Personalised Aspects (ARIA-VALUSPA)). The responsobility lies with the author. The author would further like to thank his team colleague Anton Batliner at University of Passau/Germany as well as Stefan Steidl at FAU Erlangen/Germany and all other co-organisers and participants over the years for running the Interspeech Computational Paralinguistics related challenge events and turning them into a meaningful benchmark.

Author information

Authors and Affiliations

Department of Computing, Imperial College London, London, SW7 2AZ, UK
Björn W. Schuller
Chair of Complex and Intelligent Systems, University of Passau, 94032, Passau, Germany
Björn W. Schuller
audEERING GmbH, 82205, Gilching, Germany
Björn W. Schuller

Authors

Björn W. Schuller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Björn W. Schuller .

Editor information

Editors and Affiliations

SPIIRAS, Saint Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova
University of Hertfordshire, Hatfield, United Kingdom
Iosif Mporas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schuller, B.W. (2017). Big Data, Deep Learning – At the Edge of X-Ray Speaker Analysis. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-66429-3_2
Published: 13 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics