Discussion

Schuller, Björn

doi:10.1007/978-3-642-36806-6_13

Björn Schuller²

Part of the book series: Signals and Communication Technology ((SCT))

2168 Accesses

Abstract

A statement on how the state-of-the-art in the field of Intelligent Audio Analysis was advanced more recently is provided at first. Based upon this, a distilled ’best practice’ recommendation is given to the reader. This includes aspects of high realism, standardised, multi-faceted and machine-aided data collection, source separation, feature brute-forcing, temporal evolution modelling, coupling of tasks, and standardisation. Then, a critical discussion is led on missing aspects and remaining research steps. Considerations in this direction comprise the request for more robustness, blind separation and multi-task processing of real-life streams, massive weakly supervised and evolutionary learning, closure of the gap between analysis and synthesis, cross-cultural and cross-lingual widening, novel tasks, further unification and transfer of methods, confidence measures, distributed processing, and new competitive research challenges.

A scientist’s aim in a discussion with his colleagues is not to persuade, but to clarify.

—Leo Szilard.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Schuller, B., Lehmann, A., Weninger, F., Eyben, F., Rigoll, G.: Blind enhancement of the rhythmic and harmonic sections by nmf: Does it help? In: Proceedings International Conference on Acoustics including the 35th German Annual Conference on Acoustics, NAG/DAGA 2009, pp. 361–364. DEGA, Rotterdam, March 2009
Google Scholar
Weninger, F., Wöllmer, M., Schuller B.: Automatic assessment of singer traits in popular music: gender, age, height and race. In: Proceedings 12th International Society for Music Information Retrieval Conference, ISMIR 2011, pp. 37–42. ISMIR, Miami (2011)
Google Scholar
Weninger, F., Durrieu, J.-L., Eyben, F., Richard, G., Schuller, B.: Combining monaural source separation with long short-term memory for increased robustness in vocalist gender recognition. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2011), pp. 2196–2199. IEEE, Prague, Czech Republic, May 2011
Google Scholar
Weninger, F., Lehmann, A., Schuller, B.: Openblissart: design and evaluation of a research toolkit for blind source separation in audio recognition tasks. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pp. 1625–1628. IEEE, Prague, May 2011
Google Scholar
Weninger, F., Geiger, J., Wöllmer, M., Schuller, B., Rigoll, G.: The munich 2011 chime challenge contribution: Nmf-blstm speech enhancement and recognition for reverberated multisource environments. In: Proceedings Machine Listening in Multisource Environments, CHiME 2011, Satellite Workshop of Interspeech 2011, pp. 24–29. ISCA, Florence, Sept 2011
Google Scholar
Weninger, F., Wöllmer, M., Geiger, J., Schuller, B., Gemmeke, J., Hurmalainen, A., Virtanen, T., Rigoll, G.: Non-negative matrix factorization for highly noise-robust asr: to enhance or to recognize? In: Proceedings 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2012, pp. 4681–4684. IEEE, Kyoto, March 2012
Google Scholar
Weninger, F., Feliu, J., Schuller, B.: Supervised and semi-supervised supression of background music in monaural speech recordings. In: Proceedings 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2012, pp. 61–64. IEEE, Kyoto, March 2012
Google Scholar
Weninger, F., Amir, N., Amir, O., Ronen, I., Eyben, F., Schuller, B.: Robust feature extraction for automatic recognition of vibrato singing in recorded polyphonic music. In: Proceedings 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2012, pp. 85–88. IEEE, Kyoto, March 2012
Google Scholar
Joder, C., Weninger, F., Eyben, F., Virette, D., Schuller, B.: Real-time speech separation by semi-supervised nonnegative matrix factorization. In: Theis, F.J., Cichocki, A., Yeredor, A., Zibulevsky, M. (eds.) Proceedings 10th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA 2012). Lecture Notes in Computer Science, vol. 7191, pp. 322–329. Springer, Tel Aviv (2012)
Google Scholar
Batliner, A., Steidl, S., Schuller, B., Seppi, D., Laskowski, K., Vogt, T., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., Aharonson, V.: Combining efforts for improving automatic classification of emotional user states. In: Proceedings 5th Slovenian and 1st International Language Technologies Conference, ISLTC 2006, pp. 240–245. Slovenian Language Technologies Society, Ljubljana, Oct 2006
Google Scholar
Schuller, B., Wimmer, M., Mösenlechner, L., Kern, C., Arsić, D., Rigoll, G.: Brute-forcing hierarchical functionals for paralinguistics: a waste of feature space? In: Proceedings 33rd IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2008, pp. 4501–4504. IEEE, Las Vegas, April 2008
Google Scholar
Schuller, B.: The computational paralinguistics challenge. IEEE Signal Process. Mag. 29(4), 97–101 (2012)
Article Google Scholar
Schuller, B., Weninger, F., Wöllmer, M., Sun, Y., Rigoll, G.: Non-negative matrix factorization as noise-robust feature extractor for speech recognition. In: Proceedings 35th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2010, pp. 4562–4565. IEEE, Dallas, March 2010
Google Scholar
Schuller, B., Weninger, F.: Discrimination of speech and non-linguistic vocalizations by non-negative matrix factorization. In: Proceedings 35th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2010, pp. 5054–5057. IEEE, Dallas, March 2010
Google Scholar
Weninger, F., Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Recognition of non-prototypical emotions in reverberated and noisy speech by non-negative matrix factorization. EURASIP J. Adv. Signal Process. Article ID 838790, 16 (2011). Special issue on emotion and mental state recognition from speech
Google Scholar
Weninger, F., Schuller, B., Wöllmer, M., Rigoll, G.: Localization of non-linguistic events in spontaneous speech by non-negative matrix factorization and long short-term memory. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pp. 5840–5843. IEEE, Prague, May 2011
Google Scholar
Schuller, B., Gollan, B.: Music theoretic and perception-based features for audio key determination. J. New Music Res. 41(2), 175–193 (2012)
Article Google Scholar
Wöllmer, M., Eyben, F., Graves, A., Schuller, B., Rigoll, G.: A tandem blstm-dbn architecture for keyword spotting with enhanced context modeling. In: Proceedings ISCA Tutorial and Research Workshop on Non-Linear Speech Processing, NOLISP 2009, pp. 9. ISCA, Vic, June 2009
Google Scholar
Wöllmer, M., Eyben, F., Schuller, B., Douglas-Cowie, E., Cowie, R.: Data-driven clustering in emotional space for affect recognition using discriminatively trained lstm networks. In: Proceedings INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, pp. 1595–1598. ISCA, Brighton, Sept 2009
Google Scholar
Eyben, F., Böck, S., Schuller, B., Graves, A.: Universal onset detection with bidirectional long-short term memory neural networks. In: Proceedings 11th International Society for Music Information Retrieval Conference, ISMIR 2010, pp. 589–594. ISMIR, Utrecht, Oct 2010
Google Scholar
Böck, S., Eyben, F., Schuller, B.: Tempo detection with bidirectional long short-term memory neural networks. In: Proceedings Annual Meeting of the MIREX 2010 community as part of the 11th International Conference on Music Information Retrieval, pp. 3. ISMIR, Utrecht, August 2010
Google Scholar
Böck, S., Eyben, F., Schuller, B.: Onset detection with bidirectional long short-term memory neural networks. In: Proceedings Annual Meeting of the MIREX 2010 community as part of the 11th International Conference on Music Information Retrieval, pp. 2. ISMIR, Utrecht, August 2010
Google Scholar
Arsić, D., Wöllmer, M., Rigoll, G., Roalter, L., Kranz, M., Kaiser, M., Eyben, F., Schuller, B.: Automated 3d gesture recognition applying long short-term memory and contextual knowledge in a cave. In: Proceedings 1st Workshop on Multimodal Pervasive Video Analysis, MPVA 2010, held in conjunction with ACM Multimedia 2010, pp. 33–36. ACM, Florence, Oct 2010
Google Scholar
M. Wöllmer, A. Metallinou, F. Eyben, B. Schuller, and S. Narayanan: Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling. In: Proceedings INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, pp. 2362–2365. ISCA, Makuhari, Sept 2010
Google Scholar
Landsiedel, C., Edlund, J., Eyben, F., Neiberg, D., Schuller, B.: Syllabification of conversational speech using bidirectional long-short-term memory neural networks. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pp. 5265–5268. IEEE, Prague, May 2011
Google Scholar
Eyben, F., Petridis, S., Schuller, B., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: Audiovisual classification of vocal outbursts in human conversation using long-short-term memory networks. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pp. 5844–5847. IEEE, Prague, May 2011
Google Scholar
Wöllmer, M., Weninger, F., Eyben, F., Schuller, B.: Acoustic-linguistic recognition of interest in speech with bottleneck-blstm nets. In: Proceedings INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, pp. 3201–3204. ISCA, Florence, August 2011
Google Scholar
Wöllmer, M., Blaschke, C., Schindl, T., Schuller, B., Färber, B., Mayer, S., Trefflich, B.: On-line driver distraction detection using long short-term memory. IEEE Trans. Intell. Transp. Syst. 12(2), 574–582 (2011)
Article Google Scholar
Wöllmer, M., Metallinou, A., Katsamanis, N., Schuller, B., Narayanan, S.: Analyzing the memory of blstm neural networks for enhanced emotion classification in dyadic spoken interactions. In: Proceedings 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2012, pp. 4157–4160. IEEE, Kyoto, March 2012
Google Scholar
Wöllmer, M., Kaiser, M., Eyben, F., Schuller, B., Rigoll, G.: Lstm-modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, Special Issue on Affect Analysis in Continuous Input, p. 16, 2012
Google Scholar
Reiter, S., Schuller, B., Rigoll, G.: A combined lstm-rnn-hmm-approach for meeting event segmentation and recognition. In: Proceedings 31st IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2006, vol. 2, pp. 393–396. IEEE, Toulouse, May 2006
Google Scholar
Wöllmer, M., Eyben, F., Keshet, J., Graves, A., Schuller, B., Rigoll, G.: Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional lstm networks. In: Proceedings 34th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009, pp. 3949–3952. IEEE, Taipei, April 2009
Google Scholar
Wöllmer, M., Eyben, F., Schuller, B., Sun, Y., Moosmayr, T., Nguyen-Thien, N.: Robust in-car spelling recognition: a tandem blstm-hmm approach. In: Proceedings INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, pp. 1990–9772. ISCA, Brighton, Sept 2009
Google Scholar
Wöllmer, M., Eyben, F., Graves, A., Schuller, B., Rigoll, G.: Bidirectional lstm networks for context-sensitive keyword detection in a cognitive virtual agent framework. Cogn. Comput. 2(3), 180–190 (2010). Special issue on non-linear and non-conventional speech processing
Article Google Scholar
Wöllmer, M., Eyben, F., Graves, A., Schuller, B., Rigoll, G.: Improving keyword spotting with a tandem blstm-dbn architecture. In: Sole-Casals, J., Zaiats, V. (eds.) Advances in Non-Linear Speech Processing: International Conference on Nonlinear Speech Processing, 25–27 June 2009 (NOLISP 2009). Revised Selected Papers, Lecture Notes on Computer Science (LNCS), vol. 5933/2010, pp. 68–75. Springer, Vic (2010)
Google Scholar
Wöllmer, M., Schuller, B., Eyben, F., Rigoll, G.: Combining long short-term memory and dynamic Bayesian networks for incremental emotion-sensitive artificial listening. IEEE J. Sel. Top. Signal Proces. 4(5), 867–881 (2010). Special issue on speech processing for natural interaction with intelligent environments
Article Google Scholar
Wöllmer, M., Sun, Y., Eyben, F., Schuller, B.: Long short-term memory networks for noise robust speech recognition. In: Proceedings INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, pp. 2966–2969. ISCA, Makuhari, Sept 2010
Google Scholar
Wöllmer, M., Eyben, F., Schuller, B., Rigoll, G.: Recognition of spontaneous conversational speech using long short-term memory phoneme predictions. In: Proceedings INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, pp. 1946–1949. ISCA, Makuhari, Sept 2010
Google Scholar
Wöllmer, M., Marchi, E., Squartini, S., Schuller, B.: Multi-stream lstm-hmm decoding and histogram equalization for noise robust keyword spotting. Cogn. Neurodyn. 5(3), 253–264 (2011)
Article Google Scholar
Wöllmer, M., Schuller, B., Rigoll, G.: A novel bottleneck-blstm front-end for feature-level context modeling in conversational speech recognition. In: Proceedings 12th Biannual IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2011, pp. 36–41. IEEE, Big Island, Dec 2011
Google Scholar
Wöllmer, M., Eyben, F., Schuller, B., Rigoll, G.: A multi-stream asr framework for blstm modeling of conversational speech. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pp. 4860–4863. IEEE, Prague, May 2011
Google Scholar
Wöllmer, M., Schuller, B.: Enhancing spontaneous speech recognition with blstm features. In: Travieso-González, C.M., Alonso-Hernández, J. (eds.) Advances in Nonlinear Speech Processing, 5th International Conference on Nonlinear Speech Processing, 7–9 Nov 2011 (NoLISP 2011). Proceedings, Lecture Notes in Computer Science (LNCS), vol. 7015/2011, pp. 17–24. Springer, Las Palmas de Gran Canaria (2011)
Google Scholar
Schuller, B., Wöllmer, M., Moosmayr, T., Rigoll, G.: Recognition of noisy speech: a comparative survey of robust model architecture and feature enhancement. EURASIP J. Audio Speech Music Process. Article ID 942617, 17 (2009)
Google Scholar
Schuller, B., Burkhardt, F.: Learning with synthesized speech for automatic emotion recognition. In: Proceedings 35th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2010, pp. 5150–515. IEEE, Dallas, March 2010
Google Scholar
Zhang, Z., Schuller, B.: Semi-supervised learning helps in sound event classification. In: Proceedings 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2012, pp. 333–336. IEEE, Kyoto, March 2012
Google Scholar
Zhang, Z., Weninger, F., Wöllmer, M., Schuller, B.: Unsupervised learning in cross-corpus acoustic emotion recognition. In: Proceedings 12th Biannual IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2011, pp. 523–528. IEEE, Big Island, Dec 2011
Google Scholar
Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C., Narayanan, S.: The interspeech 2010 paralinguistic challenge. In: Proceedings INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, pp. 2794–2797. ISCA, Makuhari, Sept 2010
Google Scholar
Schuller, B., Wöllmer, M., Eyben, F., Rigoll, G., Arsić, D.: Semantic speech tagging: Towards combined analysis of speaker traits. In: Brandenburg, K., Sandler, M. (eds.) Proceedings AES 42nd International Conference, pp. 89–97. Audio Engineering Society, Ilmenau, July 2011
Google Scholar
Schuller, B., Köhler, N., Müller, R., Rigoll, G.: Recognition of interest in human conversational speech. In: Proceedings INTERSPEECH 2006, 9th International Conference on Spoken Language Processing, ICSLP, pp. 793–796. ISCA, Pittsburgh, Sept 2006
Google Scholar
Schuller, B., Müller, R., Hörnler, B., Höthker, A., Konosu, H., Rigoll, G.: Audiovisual recognition of spontaneous interest within conversations. In: Proceedings 9th ACM International Conference on Multimodal Interfaces, ICMI 2007, pp. 30–37. ACM, Nagoya, Nov 2007
Google Scholar
Vlasenko, B., Schuller, B., Mengistu, K.T., Rigoll, G., Wendemuth, A.: Balancing spoken content adaptation and unit length in the recognition of emotion and interest. In: Proceedings INTERSPEECH 2008, 9th Annual Conference of the International Speech Communication Association, Incorporating 12th Australasian International Conference on Speech Science and Technology, SST 2008, pp. 805–808. ISCA/ASSTA, Brisbane, Sept 2008
Google Scholar
Schuller, B., Rigoll, G.: Recognising interest in conversational speech: comparing bag of frames and supra-segmental features. In: Proceedings INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, pp. 1999–2002. ISCA, Brighton, Sept 2009
Google Scholar
Schuller, B., Müller, R., Eyben, F., Gast, J., Hörnler, B., Wöllmer, M., Rigoll, G., Höthker, A., Konosu, H.: Being bored? recognising natural interest by extensive audiovisual integration for real-life application. Image Vis. Comput. 27(12), 1760–1774 (November 2009). Special issue on visual and multimodal analysis of human spontaneous behavior
Article Google Scholar
Wöllmer, M., Weninger, F., Eyben, F., Schuller, B.: Computational assessment of interest in speech: facing the real-life challenge. Künstliche Intelligenz (German J. Artif. Intell.) 25(3), 227–236 (2011). Special issue on emotion and computing
Google Scholar
Schuller, B., Batliner, A., Steidl, S., Schiel, F., Krajewski, J.: The interspeech 2011 speaker state challenge. In: Proceedings INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, pp. 3201–3204. ISCA, Florence, August 2011
Google Scholar
Weninger, F., Schuller, B.: Fusing utterance-level classifiers for robust intoxication recognition from speech. In: Proceedings MMCogEmS Workshop (Inferring Cognitive and Emotional States from Multimodal Measures), Held in Conjunction with the 13th International Conference on Multimodal Interaction, Nov 2011 (ICMI 2011). ACM, Alicante (2011)
Google Scholar
Krajewski, J., Schnieder, S., Sommer, D., Batliner, A., Schuller, B.: Applying multiple classifiers and non-linear dynamics features for detecting sleepiness from speech. Neurocomputing 84, 65–75 (2012). Special issue from neuron to behavior: evidence from behavioral measurements
Article Google Scholar
Schuller, B., Kozielski, C., Weninger, F., Eyben, F., Rigoll, G.: Vocalist gender recognition in recorded popular music. In: Proceedings 11th International Society for Music Information Retrieval Conference, ISMIR 2010, pp. 613–618. ISMIR, Utrecht, Oct 2010
Google Scholar
Schuller, B., Eyben, F., Rigoll, G.: Fast and robust meter and tempo recognition for the automatic discrimination of ballroom dance styles. In: Proceedings 32nd IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2007, vol. I, pp. 217–220. IEEE, Honolulu, April 2007
Google Scholar
Eyben, F., Schuller, B., Reiter, S., Rigoll, G.: Wearable assistance for the ballroom-dance hobbyist: holistic rhythm analysis and dance-style classification. In: Proceedings 8th IEEE International Conference on Multimedia and Expo, ICME 2007, pp. 92–95. IEEE, Beijing, July 2007
Google Scholar
Schuller, B., Eyben, F., Rigoll, G.: Tango or waltz?—putting ballroom dance style into tempo detection. EURASIP J. Audio Speech Music Process. Article ID 846135, 12 (2008). Special issue on intelligent audio, speech, and music processing applications
Google Scholar
Schuller, B., Hantke, S., Weninger, F., Han, W., Zhang, Z., Narayanan, S.: Automatic recognition of emotion evoked by general sound events. In: Proceedings 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2012, pp. 341–344. IEEE, Kyoto, March 2012
Google Scholar
Schuller, B., Steidl, S., Batliner, A.: The interspeech 2009 emotion challenge. In: Proceedings INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, pp. 312–315. ISCA, Brighton, Sept 2009
Google Scholar
Schuller, B., Steidl, S., Batliner, A.: Introduction to the special issue on sensing emotion and affect: facing realism in speech processing. Speech Commun. 53(9/10), 1059–1061 (2011). Special issue sensing emotion and affect: facing realism in speech processing
Article Google Scholar
Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 53(9/10), 1062–1087 (2011). Special issue on sensing emotion and affect—facing realism in speech processing
Article Google Scholar
Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C., Narayanan, S.: Paralinguistics in speech and language: state-of-the-art and the challenge. Comput. Speech Lang. 27(1), 4–39 (January 2013). Special issue on paralinguistics in naturalistic speech and language
Article Google Scholar
Schuller, B., Steidl, S., Batliner, A., Nöth, E., Vinciarelli, A., Burkhardt, F., van Son, R., Weninger, F., Eyben, F., Bocklet, T., Mohammadi, G., Weiss, B.: The interspeech 2012 speaker trait challenge. In: Proceedings INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association, p. 4. ISCA, Portland, Sept 2012
Google Scholar
Schuller, B., Valstar, M., Cowie, R., Pantic, M. (eds.): In: Proceedings of the First International Audio/Visual Emotion Challenge and Workshop, AVEC, Oct 2011. Lecture Notes on Computer Science (lncs), Part II, vol. 6975. Springer, Memphis (2011)
Google Scholar
Schuller, B., Valstar, M., Eyben, F., McKeown, G., Cowie, R., Pantic, M.: Avec 2011: the first international audio/visual emotion challenge. In: Schuller, B., Valstar, M., Cowie, R., Pantic, M. (eds.) Proceedings First International Audio/Visual Emotion Challenge and Workshop, Oct 2011 (AVEC 2011), Held in Conjunction with the International HUMAINE Association Conference on Affective Computing and Intelligent Interaction 2011 (ACII 2011), vol. II, pp. 415–424. Springer, Memphis (2011)
Google Scholar
Schuller, B., Valstar, M., Eyben, F., Cowie, R., Pantic, M.: Avec 2012: the continuous audio/visual emotion challenge. In: Morency, L.-P., Bohus, D., Aghajan, H.K., Cassell, J., Nijholt, A., Epps, J. (eds.) Proceedings of the 14th ACM International Conference on Multimodal Interaction, ICMI, pp. 449–456. ACM, Santa Monica, Oct 2012
Google Scholar
Schuller, B., Metze, F., Steidl, S., Batliner, A., Eyben, F., Polzehl, T.: Late fusion of individual engines for improved recognition of negative emotions in speech: learning versus democratic vote. In: Proceedings 35th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2010, pp. 5230–5233. IEEE, Dallas, March 2010
Google Scholar
Eyben, F., Wöllmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 9th ACM International Conference on Multimedia, MM 2010, pp. 1459–1462. ACM, Florence, Oct 2010
Google Scholar
Eyben, F., Wöllmer, M., Schuller, B.: Openear: introducing the munich open-source emotion and affect recognition toolkit. In: Proceedings 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, ACII 2009, vol. I, pp. 576–581. IEEE, Amsterdam, Sept 2009
Google Scholar
Weninger, F., Schuller, B.: Optimization and parallelization of monaural source separation algorithms in the openblissart toolkit. J. Signal Process. Syst. 69(3), 267–277 (2012)
Article Google Scholar
Weninger, F., Schuller, B.: Audio recognition in the wild: Static and dynamic classification on a real-world database of animal vocalizations. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pp. 337–340. IEEE, Prague, May 2011
Google Scholar
Schuller, B., Knaup, T.: Learning and knowledge-based sentiment analysis in movie review key excerpts. In: Esposito, A., Esposito, A.M., Martone, R., Müller, V., Scarpetta, G. (eds.) Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical Issues: Third COST 2102 International Training School, 15–19 March 2010, Caserta, Italy. Revised Selected Papers of Lecture Notes on Computer Science (LNCS), vol. 6456/2010, pp. 448–472, 1st edn. Springer, Heidelberg (2011)
Google Scholar
Schuller, B., Dorfner, J., Rigoll, G.: Determination of non-prototypical valence and arousal in popular music: features and performances. EURASIP J. Audio Speech Music Process. Article ID 735854, 19 (2010). Special issue on scalable audio-content analysis
Google Scholar
Eyben, F., Petridis, S., Schuller, B., Pantic, M.: Audiovisual vocal outburst classification in noisy acoustic conditions. In: Proceedings 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2012, pp. 5097–5100. IEEE, Kyoto, March 2012
Google Scholar
Schuller, B., Wimmer, M., Arsić, D., Rigoll, G., Radig, B.: Audiovisual behavior modeling by combined feature spaces. In: Proceedings 32nd IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2007, vol. II, pp. 733–736. IEEE, Honolulu, April 2007
Google Scholar
Schröder, M., Bevacqua, E., Eyben, F., Gunes, H., Heylen, D., ter Maat, M., Pammi, S., Pantic, M., Pelachaud, C., Schuller, B., de Sevin, E., Valstar, M., Wöllmer, M.: A demonstration of audiovisual sensitive artificial listeners. In: Proceedings 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, ACII 2009, vol. I, pp. 263–264. IEEE, Amsterdam, Sept 2009
Google Scholar
Schröder, M., Bevacqua, E., Cowie, R., Eyben, F., Gunes, H., Heylen, D., ter Maat, M., McKeown, G., Pammi, S., Pantic, M., Pelachaud, C., Schuller, B., de Sevin, E., Valstar, M., Wöllmer, M.: Building autonomous sensitive artificial listeners. IEEE Trans. Affect. Comput. 3(2), 165–183 (2012)
Article Google Scholar
Eyben, F. Wöllmer, M., Valstar, M., Gunes, H., Schuller, B., Pantic, M.: String-based audiovisual fusion of behavioural events for the assessment of dimensional affect. In: Proceedings International Workshop on Emotion Synthesis, Representation, and Analysis in Continuous Space, EmoSPACE 2011, Held in Conjunction with the 9th IEEE International Conference on Automatic Face & Gesture Recognition and Workshops, FG 2011, pp. 322–329. IEEE, Santa Barbara, March 2011
Google Scholar
Metallinou, A., Wöllmer, M., Katsamanis, A., Eyben, F., Schuller, B., Narayanan, S.: Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans. Affect. Comput. 3(2), 184–198 (2012)
Article Google Scholar
Schuller, B., Weninger, F.: Ten recent trends in computational paralinguistics. In: Esposito, A., Vinciarelli, A., Hoffmann, R., Müller, V.C. (eds.) 4th COST 2102 International Training School on Cognitive Behavioural Systems. Lecture Notes on Computer Science (LNCS), p. 15. Springer, Berlin (2012)
Google Scholar
Schuller, B., Zhang, Z., Weninger, F., Rigoll, G.: Using multiple databases for training in emotion recognition: to unite or to vote? In: Proceedings INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, pp. 1553–1556. ISCA, Florence, August 2011
Google Scholar
Stuhlsatz, A., Meyer, C., Eyben, F., Zielke, T., Meier, G., Schuller, B.: Deep neural networks for acoustic emotion recognition: Raising the benchmarks. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pp. 5688–5691. IEEE, Prague, May 2011
Google Scholar
Schuller, B., Vlasenko, B., Eyben, F., Wöllmer, M., Stuhlsatz, A., Wendemuth, A., Rigoll, G.: Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans. Affect. Comput. 1(2), 119–131 (2010)
Article Google Scholar
Eyben, F., Batliner, A., Schuller, B., Seppi, D., Steidl, S.: Cross-corpus classification of realistic emotions: some pilot experiments. In: Devillers, L., Schuller, B., Cowie, R., Douglas-Cowie, E., Batliner, A. (eds.) Proceedings 3rd International Workshop on EMOTION: Corpora for Research on Emotion and Affect, Satellite of LREC 2010, pp. 77–82. European Language Resources Association, Valletta, May 2010
Google Scholar
Jia, L., Chun, C., Jiajun, B., Mingyu, Y., Jianhua, T.: Speech emotion recognition using an enhanced co-training algorithm. In: Proceedings of the 2007 IEEE International Conference on Multimedia and Expo, ICME 2007, pp. 999–1002. IEEE, Beijing (2007)
Google Scholar
Mahdhaoui, A., Chetouani, M.: A new approach for motherese detection using a semi-supervised algorithm. In: Machine Learning for Signal Processing XIX: Proceedings of the 2009 IEEE Signal Processing Society Workshop, MLSP 2009, pp. 1–6. IEEE, Grenoble (2009)
Google Scholar
Yamada, M., Sugiyama, M., Matsui, T.: Semi-supervised speaker identification under covariate shift. Signal Process. 90(8), 2353–2361 (2010)
Article MATH Google Scholar
Lee, K., Slaney, M.: Automatic chord recognition from audio using a supervised hmm trained with audio-from-symbolic data. In: Proceedings of the ACM Multimedia ’06, Santa Barbara, USA, pp. 11–20. ACM, New York (2006)
Google Scholar
Wu, S., Falk, T.H., Chan, W.: Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53(5), 768–785 (2011)
Article Google Scholar
Mahdhaoui, A., Chetouani, M., Kessous, L.: Time-frequency features extraction for infant directed speech discrimination. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5933 LNAI, pp. 120–127. Springer, Berlin Heidelberg (2010)
Google Scholar
Ringeval, F., Chetouani, M.: A vowel based approach for acted emotion recognition. In: INTERSPEECH 2008: 9th Annual Conference of the International Speech Communication Association, pp. 2763–2766. ISCA, Brisbane (2008)
Google Scholar
Reisenzein, R., Weber, H.: Personality and emotion. In: Corr, P.J., Matthews, G. (eds.) The Cambridge Handbook of Personality Psychology, pp. 54–71. Cambridge University Press, Cambridge (2009)
Chapter Google Scholar
Provine, R.: Laughter punctuates speech: linguistic, social and gender contexts of laughter. Ethology 15, 291–298 (1993)
Google Scholar
Ververidis, D., Kotropoulos, C.: Automatic speech classification to five emotional states based on gender information. In: Proceedings of 12th European Signal Processing Conference, pp. 341–344, Vienna, 2004
Google Scholar
Vogt, T., André, E.: Improving automatic emotion recognition from speech via gender differentiation. In: Proceedings of Language Resources and Evaluation Conference (LREC), Genoa, 2006
Google Scholar
Stadermann, J., Koska, W., Rigoll, G.: Multi-task learning strategies for a recurrent neural net in a hybrid tied-posteriors acoustic mode. In: Proceedings of Interspeech 2005, pp. 2993–2996. ISCA, Lisbon (2005)
Google Scholar
Byrd, D.: Relations of sex and dialect to reduction. Speech Commun. 15(1–2), 39–54 (1994)
Article Google Scholar
Batliner, A., Steidl, S., Schuller, B., Seppi, D., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Aharonson, V., Kessous, L., Amir, N.: Whodunnit: searching for the most important feature types signalling emotion-related user states in speech. Comput. Speech Lang. 25(1), 4–28 (2011). Special issue on affective speech in real-life interactions
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Article Google Scholar
Baggia, P., Burnett, D.C., Carter, J., Dahl, D.A., McCobb, G., Raggett, D.: EMMA: Extensible MultiModal Annotation Markup Language, World Wide Web Consortium, Recommendation REC-emma-20090210, M. Johnston (ed.), February 2009
Google Scholar
Schröder, M., Devillers, L., Karpouzis, K., Martin, J.-C., Pelachaud, C., Peter, C., Pirker, H., Schuller, B., Tao, J., Wilson, I.: What should a generic emotion markup language be able to represent? In: Paiva, A., Picard, R.W., Prada, R. (eds.) Affective Computing and Intelligent Interaction: Second International Conference, Lisbon, Portugal, 12–14 Sept 2007 (ACII 2007). Proceedings, Lecture Notes on Computer Science (LNCS), vol. 4738/2007, pp. 440–451. Springer, Berlin (2007)
Google Scholar
Mao, X., Li, Z., Bao, H.: An extension of MPML with emotion recognition functions attached. LNAI of Lecture Notes in Computer Science. Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, vol. 5208. Springer, Berlin Heidelberg (2008)
Google Scholar
Schuller, B.: Affective speaker state analysis in the presence of reverberation. Int. J. Speech Technol. 14(2), 77–87 (2011)
Article Google Scholar
Tabatabaei, T.S., Krishnan, S.: Towards robust speech-based emotion recognition. In: Proceeding of IEEE International Conference on Systems, Man and Cybernetics, pp. 608–611. IEEE, Istanbul (2010)
Google Scholar
Cannizzaro, M., Reilly, N., Snyder, P.J.: Speech content analysis in feigned depression. J. Psycholinguist. Res. 33(4), 289–301 (2004)
Article Google Scholar
Reilly, N., Cannizzaro, M.S., Harel, B.T., Snyder, P.J.: Feigned depression and feigned sleepiness: a voice acoustical analysis. Brain Cogn. 55(2), 383–386 (2004)
Article Google Scholar
Boden, M.: Mind as Machine: A History of Cognitive Science, Chapter 9. Oxford University Press, New York (2008)
Google Scholar
Shami, M., Verhelst, W.: Automatic classification of expressiveness in speech: a multi-corpus study. In: Müller, C. (ed.) Speaker Classification II. Lecture Notes in Computer Science/Artificial Intelligence, vol. 4441, pp. 43–56. Springer, Heidelberg (2007)
Google Scholar
Chen, A.: Perception of paralinguistic intonational meaning in a second language. Lang. Learn. 59(2), 367–409 (2009)
Article Google Scholar
Esposito, A., Riviello, M.T.: The cross-modal and cross-cultural processing of affective information. In: Proceeding of the 2011 Conference on Neural Nets WIRN10: Proceedings of the 20th Italian Workshop on Neural Nets, vol. 226, pp. 301–310, 2011
Google Scholar
Bellegarda, J.R.: Language-independent speaker classification over a far-field microphone. In: Mueller, C. (ed.) Speaker Classification II: Selected Projects, pp. 104–115. Springer, Berlin (2007)
Chapter Google Scholar
Kleynhans, N.T., Barnard, E.: Language dependence in multilingual speaker verification. In: Proceedings of the 16th Annual Symposium of the Pattern Recognition Association of South Africa, pp. 117–122, Langebaan, Nov 2005
Google Scholar
Weninger, F., Schuller, B., Liem, C., Kurth, F., Hanjalic, A.: Music information retrieval: An inspirational guide to transfer from related disciplines. In: Müller, M., Goto, M. (eds.) Multimodal Music Processing, volume Seminar 11041 of Dagstuhl Follow-UpsSchloss, pp. 195–215. Dagstuhl, Germany (2012)
Google Scholar
Jiang, H.: Confidence measures for speech recognition: a survey. Speech Commun. 45(4), 455–470 (2005)
Article Google Scholar
Sukkar, R.: Rejection for connected digit recognition based on GPD segmental discrimination. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994 (ICASSP-94), vol. 1, pp. I-393–I-396
Google Scholar
White, C., Droppo, J., Acero, A., Odell, J.: Maximum entropy confidence estimation for speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, 2007 (ICASSP 2007), vol. 4, pp. 809–812
Google Scholar
Wessel, F., Schluter, R., Macherey, K., Ney, H.: Confidence measures for large vocabulary continuous speech recognition. IEEE Trans. Speech Audio Process. 9(3), 288–298 (2001)
Article Google Scholar
Rahim, M., Lee, C., Juang, B.: Discriminative utterance verification for connected digits recognition. IEEE Trans. Speech Audio Process. 5(3), 266–277 (1997)
Article Google Scholar
Han, W., Zhang, Z., Deng, J., Wöllmer, M., Weninger, F., Schuller, B.: Towards distributed recognition of emotion in speech. In: Proceedings 5th International Symposium on Communications, Control, and Signal Processing (ISCCSP 2012), pp. 1–4. IEEE, Rome, May 2012
Google Scholar
ETSI. ETSI ES 202 050 V1.1.5: Speech processing, transmission and quality aspects (STQ), distributed speech recognition, advanced front-end feature extraction algorithm, compression algorithms (2007)
Google Scholar
Zhang, W., He, L., Chow, Y.L., Yang, R., Su, Y.: The study on distributed speech recognition system. In: Proceedings of ICASSP, pp. 1431–1434, Istanbul, 2000
Google Scholar
Tsakalidis, S., Digalakis, V., Neumeyer, L.: Efficient speech recognition using subvector quantization and discrete-mixture hmms. In: Proceedings of ICASSP, pp. 569–572, Phoenix, 1999
Google Scholar
Jain, A.K., Flynn, P.J., Ross, A.A.: Handbook of Biometrics. Springer, Heidelberg (2008)
Book Google Scholar

Download references

Author information

Authors and Affiliations

LS für Mensch-Maschine-Kommunikation, TU München, Arcisstr. 21, 80290, München, Germany
Björn Schuller

Authors

Björn Schuller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Björn Schuller .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Schuller, B. (2013). Discussion. In: Intelligent Audio Analysis. Signals and Communication Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36806-6_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-36806-6_13
Published: 25 April 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36805-9
Online ISBN: 978-3-642-36806-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics