Skip to main content

Discussion

  • Chapter
  • First Online:
Intelligent Audio Analysis

Part of the book series: Signals and Communication Technology ((SCT))

  • 2168 Accesses

Abstract

A statement on how the state-of-the-art in the field of Intelligent Audio Analysis was advanced more recently is provided at first. Based upon this, a distilled ’best practice’ recommendation is given to the reader. This includes aspects of high realism, standardised, multi-faceted and machine-aided data collection, source separation, feature brute-forcing, temporal evolution modelling, coupling of tasks, and standardisation. Then, a critical discussion is led on missing aspects and remaining research steps. Considerations in this direction comprise the request for more robustness, blind separation and multi-task processing of real-life streams, massive weakly supervised and evolutionary learning, closure of the gap between analysis and synthesis, cross-cultural and cross-lingual widening, novel tasks, further unification and transfer of methods, confidence measures, distributed processing, and new competitive research challenges.

A scientist’s aim in a discussion with his colleagues is not to persuade, but to clarify.

—Leo Szilard.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.openaudio.eu

  2. 2.

    http://www.multimediaeval.org

References

  1. Schuller, B., Lehmann, A., Weninger, F., Eyben, F., Rigoll, G.: Blind enhancement of the rhythmic and harmonic sections by nmf: Does it help? In: Proceedings International Conference on Acoustics including the 35th German Annual Conference on Acoustics, NAG/DAGA 2009, pp. 361–364. DEGA, Rotterdam, March 2009

    Google Scholar 

  2. Weninger, F., Wöllmer, M., Schuller B.: Automatic assessment of singer traits in popular music: gender, age, height and race. In: Proceedings 12th International Society for Music Information Retrieval Conference, ISMIR 2011, pp. 37–42. ISMIR, Miami (2011)

    Google Scholar 

  3. Weninger, F., Durrieu, J.-L., Eyben, F., Richard, G., Schuller, B.: Combining monaural source separation with long short-term memory for increased robustness in vocalist gender recognition. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2011), pp. 2196–2199. IEEE, Prague, Czech Republic, May 2011

    Google Scholar 

  4. Weninger, F., Lehmann, A., Schuller, B.: Openblissart: design and evaluation of a research toolkit for blind source separation in audio recognition tasks. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pp. 1625–1628. IEEE, Prague, May 2011

    Google Scholar 

  5. Weninger, F., Geiger, J., Wöllmer, M., Schuller, B., Rigoll, G.: The munich 2011 chime challenge contribution: Nmf-blstm speech enhancement and recognition for reverberated multisource environments. In: Proceedings Machine Listening in Multisource Environments, CHiME 2011, Satellite Workshop of Interspeech 2011, pp. 24–29. ISCA, Florence, Sept 2011

    Google Scholar 

  6. Weninger, F., Wöllmer, M., Geiger, J., Schuller, B., Gemmeke, J., Hurmalainen, A., Virtanen, T., Rigoll, G.: Non-negative matrix factorization for highly noise-robust asr: to enhance or to recognize? In: Proceedings 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2012, pp. 4681–4684. IEEE, Kyoto, March 2012

    Google Scholar 

  7. Weninger, F., Feliu, J., Schuller, B.: Supervised and semi-supervised supression of background music in monaural speech recordings. In: Proceedings 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2012, pp. 61–64. IEEE, Kyoto, March 2012

    Google Scholar 

  8. Weninger, F., Amir, N., Amir, O., Ronen, I., Eyben, F., Schuller, B.: Robust feature extraction for automatic recognition of vibrato singing in recorded polyphonic music. In: Proceedings 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2012, pp. 85–88. IEEE, Kyoto, March 2012

    Google Scholar 

  9. Joder, C., Weninger, F., Eyben, F., Virette, D., Schuller, B.: Real-time speech separation by semi-supervised nonnegative matrix factorization. In: Theis, F.J., Cichocki, A., Yeredor, A., Zibulevsky, M. (eds.) Proceedings 10th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA 2012). Lecture Notes in Computer Science, vol. 7191, pp. 322–329. Springer, Tel Aviv (2012)

    Google Scholar 

  10. Batliner, A., Steidl, S., Schuller, B., Seppi, D., Laskowski, K., Vogt, T., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., Aharonson, V.: Combining efforts for improving automatic classification of emotional user states. In: Proceedings 5th Slovenian and 1st International Language Technologies Conference, ISLTC 2006, pp. 240–245. Slovenian Language Technologies Society, Ljubljana, Oct 2006

    Google Scholar 

  11. Schuller, B., Wimmer, M., Mösenlechner, L., Kern, C., Arsić, D., Rigoll, G.: Brute-forcing hierarchical functionals for paralinguistics: a waste of feature space? In: Proceedings 33rd IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2008, pp. 4501–4504. IEEE, Las Vegas, April 2008

    Google Scholar 

  12. Schuller, B.: The computational paralinguistics challenge. IEEE Signal Process. Mag. 29(4), 97–101 (2012)

    Article  Google Scholar 

  13. Schuller, B., Weninger, F., Wöllmer, M., Sun, Y., Rigoll, G.: Non-negative matrix factorization as noise-robust feature extractor for speech recognition. In: Proceedings 35th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2010, pp. 4562–4565. IEEE, Dallas, March 2010

    Google Scholar 

  14. Schuller, B., Weninger, F.: Discrimination of speech and non-linguistic vocalizations by non-negative matrix factorization. In: Proceedings 35th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2010, pp. 5054–5057. IEEE, Dallas, March 2010

    Google Scholar 

  15. Weninger, F., Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Recognition of non-prototypical emotions in reverberated and noisy speech by non-negative matrix factorization. EURASIP J. Adv. Signal Process. Article ID 838790, 16 (2011). Special issue on emotion and mental state recognition from speech

    Google Scholar 

  16. Weninger, F., Schuller, B., Wöllmer, M., Rigoll, G.: Localization of non-linguistic events in spontaneous speech by non-negative matrix factorization and long short-term memory. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pp. 5840–5843. IEEE, Prague, May 2011

    Google Scholar 

  17. Schuller, B., Gollan, B.: Music theoretic and perception-based features for audio key determination. J. New Music Res. 41(2), 175–193 (2012)

    Article  Google Scholar 

  18. Wöllmer, M., Eyben, F., Graves, A., Schuller, B., Rigoll, G.: A tandem blstm-dbn architecture for keyword spotting with enhanced context modeling. In: Proceedings ISCA Tutorial and Research Workshop on Non-Linear Speech Processing, NOLISP 2009, pp. 9. ISCA, Vic, June 2009

    Google Scholar 

  19. Wöllmer, M., Eyben, F., Schuller, B., Douglas-Cowie, E., Cowie, R.: Data-driven clustering in emotional space for affect recognition using discriminatively trained lstm networks. In: Proceedings INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, pp. 1595–1598. ISCA, Brighton, Sept 2009

    Google Scholar 

  20. Eyben, F., Böck, S., Schuller, B., Graves, A.: Universal onset detection with bidirectional long-short term memory neural networks. In: Proceedings 11th International Society for Music Information Retrieval Conference, ISMIR 2010, pp. 589–594. ISMIR, Utrecht, Oct 2010

    Google Scholar 

  21. Böck, S., Eyben, F., Schuller, B.: Tempo detection with bidirectional long short-term memory neural networks. In: Proceedings Annual Meeting of the MIREX 2010 community as part of the 11th International Conference on Music Information Retrieval, pp. 3. ISMIR, Utrecht, August 2010

    Google Scholar 

  22. Böck, S., Eyben, F., Schuller, B.: Onset detection with bidirectional long short-term memory neural networks. In: Proceedings Annual Meeting of the MIREX 2010 community as part of the 11th International Conference on Music Information Retrieval, pp. 2. ISMIR, Utrecht, August 2010

    Google Scholar 

  23. Arsić, D., Wöllmer, M., Rigoll, G., Roalter, L., Kranz, M., Kaiser, M., Eyben, F., Schuller, B.: Automated 3d gesture recognition applying long short-term memory and contextual knowledge in a cave. In: Proceedings 1st Workshop on Multimodal Pervasive Video Analysis, MPVA 2010, held in conjunction with ACM Multimedia 2010, pp. 33–36. ACM, Florence, Oct 2010

    Google Scholar 

  24. M. Wöllmer, A. Metallinou, F. Eyben, B. Schuller, and S. Narayanan: Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling. In: Proceedings INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, pp. 2362–2365. ISCA, Makuhari, Sept 2010

    Google Scholar 

  25. Landsiedel, C., Edlund, J., Eyben, F., Neiberg, D., Schuller, B.: Syllabification of conversational speech using bidirectional long-short-term memory neural networks. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pp. 5265–5268. IEEE, Prague, May 2011

    Google Scholar 

  26. Eyben, F., Petridis, S., Schuller, B., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: Audiovisual classification of vocal outbursts in human conversation using long-short-term memory networks. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pp. 5844–5847. IEEE, Prague, May 2011

    Google Scholar 

  27. Wöllmer, M., Weninger, F., Eyben, F., Schuller, B.: Acoustic-linguistic recognition of interest in speech with bottleneck-blstm nets. In: Proceedings INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, pp. 3201–3204. ISCA, Florence, August 2011

    Google Scholar 

  28. Wöllmer, M., Blaschke, C., Schindl, T., Schuller, B., Färber, B., Mayer, S., Trefflich, B.: On-line driver distraction detection using long short-term memory. IEEE Trans. Intell. Transp. Syst. 12(2), 574–582 (2011)

    Article  Google Scholar 

  29. Wöllmer, M., Metallinou, A., Katsamanis, N., Schuller, B., Narayanan, S.: Analyzing the memory of blstm neural networks for enhanced emotion classification in dyadic spoken interactions. In: Proceedings 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2012, pp. 4157–4160. IEEE, Kyoto, March 2012

    Google Scholar 

  30. Wöllmer, M., Kaiser, M., Eyben, F., Schuller, B., Rigoll, G.: Lstm-modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, Special Issue on Affect Analysis in Continuous Input, p. 16, 2012

    Google Scholar 

  31. Reiter, S., Schuller, B., Rigoll, G.: A combined lstm-rnn-hmm-approach for meeting event segmentation and recognition. In: Proceedings 31st IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2006, vol. 2, pp. 393–396. IEEE, Toulouse, May 2006

    Google Scholar 

  32. Wöllmer, M., Eyben, F., Keshet, J., Graves, A., Schuller, B., Rigoll, G.: Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional lstm networks. In: Proceedings 34th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009, pp. 3949–3952. IEEE, Taipei, April 2009

    Google Scholar 

  33. Wöllmer, M., Eyben, F., Schuller, B., Sun, Y., Moosmayr, T., Nguyen-Thien, N.: Robust in-car spelling recognition: a tandem blstm-hmm approach. In: Proceedings INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, pp. 1990–9772. ISCA, Brighton, Sept 2009

    Google Scholar 

  34. Wöllmer, M., Eyben, F., Graves, A., Schuller, B., Rigoll, G.: Bidirectional lstm networks for context-sensitive keyword detection in a cognitive virtual agent framework. Cogn. Comput. 2(3), 180–190 (2010). Special issue on non-linear and non-conventional speech processing

    Article  Google Scholar 

  35. Wöllmer, M., Eyben, F., Graves, A., Schuller, B., Rigoll, G.: Improving keyword spotting with a tandem blstm-dbn architecture. In: Sole-Casals, J., Zaiats, V. (eds.) Advances in Non-Linear Speech Processing: International Conference on Nonlinear Speech Processing, 25–27 June 2009 (NOLISP 2009). Revised Selected Papers, Lecture Notes on Computer Science (LNCS), vol. 5933/2010, pp. 68–75. Springer, Vic (2010)

    Google Scholar 

  36. Wöllmer, M., Schuller, B., Eyben, F., Rigoll, G.: Combining long short-term memory and dynamic Bayesian networks for incremental emotion-sensitive artificial listening. IEEE J. Sel. Top. Signal Proces. 4(5), 867–881 (2010). Special issue on speech processing for natural interaction with intelligent environments

    Article  Google Scholar 

  37. Wöllmer, M., Sun, Y., Eyben, F., Schuller, B.: Long short-term memory networks for noise robust speech recognition. In: Proceedings INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, pp. 2966–2969. ISCA, Makuhari, Sept 2010

    Google Scholar 

  38. Wöllmer, M., Eyben, F., Schuller, B., Rigoll, G.: Recognition of spontaneous conversational speech using long short-term memory phoneme predictions. In: Proceedings INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, pp. 1946–1949. ISCA, Makuhari, Sept 2010

    Google Scholar 

  39. Wöllmer, M., Marchi, E., Squartini, S., Schuller, B.: Multi-stream lstm-hmm decoding and histogram equalization for noise robust keyword spotting. Cogn. Neurodyn. 5(3), 253–264 (2011)

    Article  Google Scholar 

  40. Wöllmer, M., Schuller, B., Rigoll, G.: A novel bottleneck-blstm front-end for feature-level context modeling in conversational speech recognition. In: Proceedings 12th Biannual IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2011, pp. 36–41. IEEE, Big Island, Dec 2011

    Google Scholar 

  41. Wöllmer, M., Eyben, F., Schuller, B., Rigoll, G.: A multi-stream asr framework for blstm modeling of conversational speech. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pp. 4860–4863. IEEE, Prague, May 2011

    Google Scholar 

  42. Wöllmer, M., Schuller, B.: Enhancing spontaneous speech recognition with blstm features. In: Travieso-González, C.M., Alonso-Hernández, J. (eds.) Advances in Nonlinear Speech Processing, 5th International Conference on Nonlinear Speech Processing, 7–9 Nov 2011 (NoLISP 2011). Proceedings, Lecture Notes in Computer Science (LNCS), vol. 7015/2011, pp. 17–24. Springer, Las Palmas de Gran Canaria (2011)

    Google Scholar 

  43. Schuller, B., Wöllmer, M., Moosmayr, T., Rigoll, G.: Recognition of noisy speech: a comparative survey of robust model architecture and feature enhancement. EURASIP J. Audio Speech Music Process. Article ID 942617, 17 (2009)

    Google Scholar 

  44. Schuller, B., Burkhardt, F.: Learning with synthesized speech for automatic emotion recognition. In: Proceedings 35th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2010, pp. 5150–515. IEEE, Dallas, March 2010

    Google Scholar 

  45. Zhang, Z., Schuller, B.: Semi-supervised learning helps in sound event classification. In: Proceedings 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2012, pp. 333–336. IEEE, Kyoto, March 2012

    Google Scholar 

  46. Zhang, Z., Weninger, F., Wöllmer, M., Schuller, B.: Unsupervised learning in cross-corpus acoustic emotion recognition. In: Proceedings 12th Biannual IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2011, pp. 523–528. IEEE, Big Island, Dec 2011

    Google Scholar 

  47. Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C., Narayanan, S.: The interspeech 2010 paralinguistic challenge. In: Proceedings INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, pp. 2794–2797. ISCA, Makuhari, Sept 2010

    Google Scholar 

  48. Schuller, B., Wöllmer, M., Eyben, F., Rigoll, G., Arsić, D.: Semantic speech tagging: Towards combined analysis of speaker traits. In: Brandenburg, K., Sandler, M. (eds.) Proceedings AES 42nd International Conference, pp. 89–97. Audio Engineering Society, Ilmenau, July 2011

    Google Scholar 

  49. Schuller, B., Köhler, N., Müller, R., Rigoll, G.: Recognition of interest in human conversational speech. In: Proceedings INTERSPEECH 2006, 9th International Conference on Spoken Language Processing, ICSLP, pp. 793–796. ISCA, Pittsburgh, Sept 2006

    Google Scholar 

  50. Schuller, B., Müller, R., Hörnler, B., Höthker, A., Konosu, H., Rigoll, G.: Audiovisual recognition of spontaneous interest within conversations. In: Proceedings 9th ACM International Conference on Multimodal Interfaces, ICMI 2007, pp. 30–37. ACM, Nagoya, Nov 2007

    Google Scholar 

  51. Vlasenko, B., Schuller, B., Mengistu, K.T., Rigoll, G., Wendemuth, A.: Balancing spoken content adaptation and unit length in the recognition of emotion and interest. In: Proceedings INTERSPEECH 2008, 9th Annual Conference of the International Speech Communication Association, Incorporating 12th Australasian International Conference on Speech Science and Technology, SST 2008, pp. 805–808. ISCA/ASSTA, Brisbane, Sept 2008

    Google Scholar 

  52. Schuller, B., Rigoll, G.: Recognising interest in conversational speech: comparing bag of frames and supra-segmental features. In: Proceedings INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, pp. 1999–2002. ISCA, Brighton, Sept 2009

    Google Scholar 

  53. Schuller, B., Müller, R., Eyben, F., Gast, J., Hörnler, B., Wöllmer, M., Rigoll, G., Höthker, A., Konosu, H.: Being bored? recognising natural interest by extensive audiovisual integration for real-life application. Image Vis. Comput. 27(12), 1760–1774 (November 2009). Special issue on visual and multimodal analysis of human spontaneous behavior

    Article  Google Scholar 

  54. Wöllmer, M., Weninger, F., Eyben, F., Schuller, B.: Computational assessment of interest in speech: facing the real-life challenge. Künstliche Intelligenz (German J. Artif. Intell.) 25(3), 227–236 (2011). Special issue on emotion and computing

    Google Scholar 

  55. Schuller, B., Batliner, A., Steidl, S., Schiel, F., Krajewski, J.: The interspeech 2011 speaker state challenge. In: Proceedings INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, pp. 3201–3204. ISCA, Florence, August 2011

    Google Scholar 

  56. Weninger, F., Schuller, B.: Fusing utterance-level classifiers for robust intoxication recognition from speech. In: Proceedings MMCogEmS Workshop (Inferring Cognitive and Emotional States from Multimodal Measures), Held in Conjunction with the 13th International Conference on Multimodal Interaction, Nov 2011 (ICMI 2011). ACM, Alicante (2011)

    Google Scholar 

  57. Krajewski, J., Schnieder, S., Sommer, D., Batliner, A., Schuller, B.: Applying multiple classifiers and non-linear dynamics features for detecting sleepiness from speech. Neurocomputing 84, 65–75 (2012). Special issue from neuron to behavior: evidence from behavioral measurements

    Article  Google Scholar 

  58. Schuller, B., Kozielski, C., Weninger, F., Eyben, F., Rigoll, G.: Vocalist gender recognition in recorded popular music. In: Proceedings 11th International Society for Music Information Retrieval Conference, ISMIR 2010, pp. 613–618. ISMIR, Utrecht, Oct 2010

    Google Scholar 

  59. Schuller, B., Eyben, F., Rigoll, G.: Fast and robust meter and tempo recognition for the automatic discrimination of ballroom dance styles. In: Proceedings 32nd IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2007, vol. I, pp. 217–220. IEEE, Honolulu, April 2007

    Google Scholar 

  60. Eyben, F., Schuller, B., Reiter, S., Rigoll, G.: Wearable assistance for the ballroom-dance hobbyist: holistic rhythm analysis and dance-style classification. In: Proceedings 8th IEEE International Conference on Multimedia and Expo, ICME 2007, pp. 92–95. IEEE, Beijing, July 2007

    Google Scholar 

  61. Schuller, B., Eyben, F., Rigoll, G.: Tango or waltz?—putting ballroom dance style into tempo detection. EURASIP J. Audio Speech Music Process. Article ID 846135, 12 (2008). Special issue on intelligent audio, speech, and music processing applications

    Google Scholar 

  62. Schuller, B., Hantke, S., Weninger, F., Han, W., Zhang, Z., Narayanan, S.: Automatic recognition of emotion evoked by general sound events. In: Proceedings 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2012, pp. 341–344. IEEE, Kyoto, March 2012

    Google Scholar 

  63. Schuller, B., Steidl, S., Batliner, A.: The interspeech 2009 emotion challenge. In: Proceedings INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, pp. 312–315. ISCA, Brighton, Sept 2009

    Google Scholar 

  64. Schuller, B., Steidl, S., Batliner, A.: Introduction to the special issue on sensing emotion and affect: facing realism in speech processing. Speech Commun. 53(9/10), 1059–1061 (2011). Special issue sensing emotion and affect: facing realism in speech processing

    Article  Google Scholar 

  65. Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 53(9/10), 1062–1087 (2011). Special issue on sensing emotion and affect—facing realism in speech processing

    Article  Google Scholar 

  66. Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C., Narayanan, S.: Paralinguistics in speech and language: state-of-the-art and the challenge. Comput. Speech Lang. 27(1), 4–39 (January 2013). Special issue on paralinguistics in naturalistic speech and language

    Article  Google Scholar 

  67. Schuller, B., Steidl, S., Batliner, A., Nöth, E., Vinciarelli, A., Burkhardt, F., van Son, R., Weninger, F., Eyben, F., Bocklet, T., Mohammadi, G., Weiss, B.: The interspeech 2012 speaker trait challenge. In: Proceedings INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association, p. 4. ISCA, Portland, Sept 2012

    Google Scholar 

  68. Schuller, B., Valstar, M., Cowie, R., Pantic, M. (eds.): In: Proceedings of the First International Audio/Visual Emotion Challenge and Workshop, AVEC, Oct 2011. Lecture Notes on Computer Science (lncs), Part II, vol. 6975. Springer, Memphis (2011)

    Google Scholar 

  69. Schuller, B., Valstar, M., Eyben, F., McKeown, G., Cowie, R., Pantic, M.: Avec 2011: the first international audio/visual emotion challenge. In: Schuller, B., Valstar, M., Cowie, R., Pantic, M. (eds.) Proceedings First International Audio/Visual Emotion Challenge and Workshop, Oct 2011 (AVEC 2011), Held in Conjunction with the International HUMAINE Association Conference on Affective Computing and Intelligent Interaction 2011 (ACII 2011), vol. II, pp. 415–424. Springer, Memphis (2011)

    Google Scholar 

  70. Schuller, B., Valstar, M., Eyben, F., Cowie, R., Pantic, M.: Avec 2012: the continuous audio/visual emotion challenge. In: Morency, L.-P., Bohus, D., Aghajan, H.K., Cassell, J., Nijholt, A., Epps, J. (eds.) Proceedings of the 14th ACM International Conference on Multimodal Interaction, ICMI, pp. 449–456. ACM, Santa Monica, Oct 2012

    Google Scholar 

  71. Schuller, B., Metze, F., Steidl, S., Batliner, A., Eyben, F., Polzehl, T.: Late fusion of individual engines for improved recognition of negative emotions in speech: learning versus democratic vote. In: Proceedings 35th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2010, pp. 5230–5233. IEEE, Dallas, March 2010

    Google Scholar 

  72. Eyben, F., Wöllmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 9th ACM International Conference on Multimedia, MM 2010, pp. 1459–1462. ACM, Florence, Oct 2010

    Google Scholar 

  73. Eyben, F., Wöllmer, M., Schuller, B.: Openear: introducing the munich open-source emotion and affect recognition toolkit. In: Proceedings 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, ACII 2009, vol. I, pp. 576–581. IEEE, Amsterdam, Sept 2009

    Google Scholar 

  74. Weninger, F., Schuller, B.: Optimization and parallelization of monaural source separation algorithms in the openblissart toolkit. J. Signal Process. Syst. 69(3), 267–277 (2012)

    Article  Google Scholar 

  75. Weninger, F., Schuller, B.: Audio recognition in the wild: Static and dynamic classification on a real-world database of animal vocalizations. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pp. 337–340. IEEE, Prague, May 2011

    Google Scholar 

  76. Schuller, B., Knaup, T.: Learning and knowledge-based sentiment analysis in movie review key excerpts. In: Esposito, A., Esposito, A.M., Martone, R., Müller, V., Scarpetta, G. (eds.) Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical Issues: Third COST 2102 International Training School, 15–19 March 2010, Caserta, Italy. Revised Selected Papers of Lecture Notes on Computer Science (LNCS), vol. 6456/2010, pp. 448–472, 1st edn. Springer, Heidelberg (2011)

    Google Scholar 

  77. Schuller, B., Dorfner, J., Rigoll, G.: Determination of non-prototypical valence and arousal in popular music: features and performances. EURASIP J. Audio Speech Music Process. Article ID 735854, 19 (2010). Special issue on scalable audio-content analysis

    Google Scholar 

  78. Eyben, F., Petridis, S., Schuller, B., Pantic, M.: Audiovisual vocal outburst classification in noisy acoustic conditions. In: Proceedings 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2012, pp. 5097–5100. IEEE, Kyoto, March 2012

    Google Scholar 

  79. Schuller, B., Wimmer, M., Arsić, D., Rigoll, G., Radig, B.: Audiovisual behavior modeling by combined feature spaces. In: Proceedings 32nd IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2007, vol. II, pp. 733–736. IEEE, Honolulu, April 2007

    Google Scholar 

  80. Schröder, M., Bevacqua, E., Eyben, F., Gunes, H., Heylen, D., ter Maat, M., Pammi, S., Pantic, M., Pelachaud, C., Schuller, B., de Sevin, E., Valstar, M., Wöllmer, M.: A demonstration of audiovisual sensitive artificial listeners. In: Proceedings 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, ACII 2009, vol. I, pp. 263–264. IEEE, Amsterdam, Sept 2009

    Google Scholar 

  81. Schröder, M., Bevacqua, E., Cowie, R., Eyben, F., Gunes, H., Heylen, D., ter Maat, M., McKeown, G., Pammi, S., Pantic, M., Pelachaud, C., Schuller, B., de Sevin, E., Valstar, M., Wöllmer, M.: Building autonomous sensitive artificial listeners. IEEE Trans. Affect. Comput. 3(2), 165–183 (2012)

    Article  Google Scholar 

  82. Eyben, F. Wöllmer, M., Valstar, M., Gunes, H., Schuller, B., Pantic, M.: String-based audiovisual fusion of behavioural events for the assessment of dimensional affect. In: Proceedings International Workshop on Emotion Synthesis, Representation, and Analysis in Continuous Space, EmoSPACE 2011, Held in Conjunction with the 9th IEEE International Conference on Automatic Face & Gesture Recognition and Workshops, FG 2011, pp. 322–329. IEEE, Santa Barbara, March 2011

    Google Scholar 

  83. Metallinou, A., Wöllmer, M., Katsamanis, A., Eyben, F., Schuller, B., Narayanan, S.: Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans. Affect. Comput. 3(2), 184–198 (2012)

    Article  Google Scholar 

  84. Schuller, B., Weninger, F.: Ten recent trends in computational paralinguistics. In: Esposito, A., Vinciarelli, A., Hoffmann, R., Müller, V.C. (eds.) 4th COST 2102 International Training School on Cognitive Behavioural Systems. Lecture Notes on Computer Science (LNCS), p. 15. Springer, Berlin (2012)

    Google Scholar 

  85. Schuller, B., Zhang, Z., Weninger, F., Rigoll, G.: Using multiple databases for training in emotion recognition: to unite or to vote? In: Proceedings INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, pp. 1553–1556. ISCA, Florence, August 2011

    Google Scholar 

  86. Stuhlsatz, A., Meyer, C., Eyben, F., Zielke, T., Meier, G., Schuller, B.: Deep neural networks for acoustic emotion recognition: Raising the benchmarks. In: Proceedings 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pp. 5688–5691. IEEE, Prague, May 2011

    Google Scholar 

  87. Schuller, B., Vlasenko, B., Eyben, F., Wöllmer, M., Stuhlsatz, A., Wendemuth, A., Rigoll, G.: Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans. Affect. Comput. 1(2), 119–131 (2010)

    Article  Google Scholar 

  88. Eyben, F., Batliner, A., Schuller, B., Seppi, D., Steidl, S.: Cross-corpus classification of realistic emotions: some pilot experiments. In: Devillers, L., Schuller, B., Cowie, R., Douglas-Cowie, E., Batliner, A. (eds.) Proceedings 3rd International Workshop on EMOTION: Corpora for Research on Emotion and Affect, Satellite of LREC 2010, pp. 77–82. European Language Resources Association, Valletta, May 2010

    Google Scholar 

  89. Jia, L., Chun, C., Jiajun, B., Mingyu, Y., Jianhua, T.: Speech emotion recognition using an enhanced co-training algorithm. In: Proceedings of the 2007 IEEE International Conference on Multimedia and Expo, ICME 2007, pp. 999–1002. IEEE, Beijing (2007)

    Google Scholar 

  90. Mahdhaoui, A., Chetouani, M.: A new approach for motherese detection using a semi-supervised algorithm. In: Machine Learning for Signal Processing XIX: Proceedings of the 2009 IEEE Signal Processing Society Workshop, MLSP 2009, pp. 1–6. IEEE, Grenoble (2009)

    Google Scholar 

  91. Yamada, M., Sugiyama, M., Matsui, T.: Semi-supervised speaker identification under covariate shift. Signal Process. 90(8), 2353–2361 (2010)

    Article  MATH  Google Scholar 

  92. Lee, K., Slaney, M.: Automatic chord recognition from audio using a supervised hmm trained with audio-from-symbolic data. In: Proceedings of the ACM Multimedia ’06, Santa Barbara, USA, pp. 11–20. ACM, New York (2006)

    Google Scholar 

  93. Wu, S., Falk, T.H., Chan, W.: Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53(5), 768–785 (2011)

    Article  Google Scholar 

  94. Mahdhaoui, A., Chetouani, M., Kessous, L.: Time-frequency features extraction for infant directed speech discrimination. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5933 LNAI, pp. 120–127. Springer, Berlin Heidelberg (2010)

    Google Scholar 

  95. Ringeval, F., Chetouani, M.: A vowel based approach for acted emotion recognition. In: INTERSPEECH 2008: 9th Annual Conference of the International Speech Communication Association, pp. 2763–2766. ISCA, Brisbane (2008)

    Google Scholar 

  96. Reisenzein, R., Weber, H.: Personality and emotion. In: Corr, P.J., Matthews, G. (eds.) The Cambridge Handbook of Personality Psychology, pp. 54–71. Cambridge University Press, Cambridge (2009)

    Chapter  Google Scholar 

  97. Provine, R.: Laughter punctuates speech: linguistic, social and gender contexts of laughter. Ethology 15, 291–298 (1993)

    Google Scholar 

  98. Ververidis, D., Kotropoulos, C.: Automatic speech classification to five emotional states based on gender information. In: Proceedings of 12th European Signal Processing Conference, pp. 341–344, Vienna, 2004

    Google Scholar 

  99. Vogt, T., André, E.: Improving automatic emotion recognition from speech via gender differentiation. In: Proceedings of Language Resources and Evaluation Conference (LREC), Genoa, 2006

    Google Scholar 

  100. Stadermann, J., Koska, W., Rigoll, G.: Multi-task learning strategies for a recurrent neural net in a hybrid tied-posteriors acoustic mode. In: Proceedings of Interspeech 2005, pp. 2993–2996. ISCA, Lisbon (2005)

    Google Scholar 

  101. Byrd, D.: Relations of sex and dialect to reduction. Speech Commun. 15(1–2), 39–54 (1994)

    Article  Google Scholar 

  102. Batliner, A., Steidl, S., Schuller, B., Seppi, D., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Aharonson, V., Kessous, L., Amir, N.: Whodunnit: searching for the most important feature types signalling emotion-related user states in speech. Comput. Speech Lang. 25(1), 4–28 (2011). Special issue on affective speech in real-life interactions

    Article  Google Scholar 

  103. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)

    Article  Google Scholar 

  104. Baggia, P., Burnett, D.C., Carter, J., Dahl, D.A., McCobb, G., Raggett, D.: EMMA: Extensible MultiModal Annotation Markup Language, World Wide Web Consortium, Recommendation REC-emma-20090210, M. Johnston (ed.), February 2009

    Google Scholar 

  105. Schröder, M., Devillers, L., Karpouzis, K., Martin, J.-C., Pelachaud, C., Peter, C., Pirker, H., Schuller, B., Tao, J., Wilson, I.: What should a generic emotion markup language be able to represent? In: Paiva, A., Picard, R.W., Prada, R. (eds.) Affective Computing and Intelligent Interaction: Second International Conference, Lisbon, Portugal, 12–14 Sept 2007 (ACII 2007). Proceedings, Lecture Notes on Computer Science (LNCS), vol. 4738/2007, pp. 440–451. Springer, Berlin (2007)

    Google Scholar 

  106. Mao, X., Li, Z., Bao, H.: An extension of MPML with emotion recognition functions attached. LNAI of Lecture Notes in Computer Science. Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, vol. 5208. Springer, Berlin Heidelberg (2008)

    Google Scholar 

  107. Schuller, B.: Affective speaker state analysis in the presence of reverberation. Int. J. Speech Technol. 14(2), 77–87 (2011)

    Article  Google Scholar 

  108. Tabatabaei, T.S., Krishnan, S.: Towards robust speech-based emotion recognition. In: Proceeding of IEEE International Conference on Systems, Man and Cybernetics, pp. 608–611. IEEE, Istanbul (2010)

    Google Scholar 

  109. Cannizzaro, M., Reilly, N., Snyder, P.J.: Speech content analysis in feigned depression. J. Psycholinguist. Res. 33(4), 289–301 (2004)

    Article  Google Scholar 

  110. Reilly, N., Cannizzaro, M.S., Harel, B.T., Snyder, P.J.: Feigned depression and feigned sleepiness: a voice acoustical analysis. Brain Cogn. 55(2), 383–386 (2004)

    Article  Google Scholar 

  111. Boden, M.: Mind as Machine: A History of Cognitive Science, Chapter 9. Oxford University Press, New York (2008)

    Google Scholar 

  112. Shami, M., Verhelst, W.: Automatic classification of expressiveness in speech: a multi-corpus study. In: Müller, C. (ed.) Speaker Classification II. Lecture Notes in Computer Science/Artificial Intelligence, vol. 4441, pp. 43–56. Springer, Heidelberg (2007)

    Google Scholar 

  113. Chen, A.: Perception of paralinguistic intonational meaning in a second language. Lang. Learn. 59(2), 367–409 (2009)

    Article  Google Scholar 

  114. Esposito, A., Riviello, M.T.: The cross-modal and cross-cultural processing of affective information. In: Proceeding of the 2011 Conference on Neural Nets WIRN10: Proceedings of the 20th Italian Workshop on Neural Nets, vol. 226, pp. 301–310, 2011

    Google Scholar 

  115. Bellegarda, J.R.: Language-independent speaker classification over a far-field microphone. In: Mueller, C. (ed.) Speaker Classification II: Selected Projects, pp. 104–115. Springer, Berlin (2007)

    Chapter  Google Scholar 

  116. Kleynhans, N.T., Barnard, E.: Language dependence in multilingual speaker verification. In: Proceedings of the 16th Annual Symposium of the Pattern Recognition Association of South Africa, pp. 117–122, Langebaan, Nov 2005

    Google Scholar 

  117. Weninger, F., Schuller, B., Liem, C., Kurth, F., Hanjalic, A.: Music information retrieval: An inspirational guide to transfer from related disciplines. In: Müller, M., Goto, M. (eds.) Multimodal Music Processing, volume Seminar 11041 of Dagstuhl Follow-UpsSchloss, pp. 195–215. Dagstuhl, Germany (2012)

    Google Scholar 

  118. Jiang, H.: Confidence measures for speech recognition: a survey. Speech Commun. 45(4), 455–470 (2005)

    Article  Google Scholar 

  119. Sukkar, R.: Rejection for connected digit recognition based on GPD segmental discrimination. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994 (ICASSP-94), vol. 1, pp. I-393–I-396

    Google Scholar 

  120. White, C., Droppo, J., Acero, A., Odell, J.: Maximum entropy confidence estimation for speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, 2007 (ICASSP 2007), vol. 4, pp. 809–812

    Google Scholar 

  121. Wessel, F., Schluter, R., Macherey, K., Ney, H.: Confidence measures for large vocabulary continuous speech recognition. IEEE Trans. Speech Audio Process. 9(3), 288–298 (2001)

    Article  Google Scholar 

  122. Rahim, M., Lee, C., Juang, B.: Discriminative utterance verification for connected digits recognition. IEEE Trans. Speech Audio Process. 5(3), 266–277 (1997)

    Article  Google Scholar 

  123. Han, W., Zhang, Z., Deng, J., Wöllmer, M., Weninger, F., Schuller, B.: Towards distributed recognition of emotion in speech. In: Proceedings 5th International Symposium on Communications, Control, and Signal Processing (ISCCSP 2012), pp. 1–4. IEEE, Rome, May 2012

    Google Scholar 

  124. ETSI. ETSI ES 202 050 V1.1.5: Speech processing, transmission and quality aspects (STQ), distributed speech recognition, advanced front-end feature extraction algorithm, compression algorithms (2007)

    Google Scholar 

  125. Zhang, W., He, L., Chow, Y.L., Yang, R., Su, Y.: The study on distributed speech recognition system. In: Proceedings of ICASSP, pp. 1431–1434, Istanbul, 2000

    Google Scholar 

  126. Tsakalidis, S., Digalakis, V., Neumeyer, L.: Efficient speech recognition using subvector quantization and discrete-mixture hmms. In: Proceedings of ICASSP, pp. 569–572, Phoenix, 1999

    Google Scholar 

  127. Jain, A.K., Flynn, P.J., Ross, A.A.: Handbook of Biometrics. Springer, Heidelberg (2008)

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Björn Schuller .

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Schuller, B. (2013). Discussion. In: Intelligent Audio Analysis. Signals and Communication Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36806-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36806-6_13

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36805-9

  • Online ISBN: 978-3-642-36806-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics