A Temporal Dependency Based Multi-modal Active Learning Approach for Audiovisual Event Detection

  • Patrick Thiam
  • Sascha Meudt
  • Günther Palm
  • Friedhelm Schwenker


In this work, two novel active learning approaches for the annotation and detection of audiovisual events are proposed. The assumption behind the proposed approaches is that events are susceptible to substantively deviate from the distribution of normal observations and therefore should be lying in regions of low density. Thus, it is believed that an event detection model can be trained more efficiently by focusing on samples that appear to be inconsistent with the majority of the dataset. The first approach is an uni-modal method which consists in using rank aggregation to select informative samples which have previously been ranked using different unsupervised outlier detection techniques in combination with an uncertainty sampling technique. The information used for the sample selection stems from an unique modality (e.g. video channel). Since most active learning approaches focus on one target channel to perform the selection of informative samples and thus do not take advantage of potentially useful and complementary information among correlated modalities, we propose an extension of the previous uni-modal approach to multi-modality. From a target pool of instances belonging to a specific modality, the uni-modal approach is used to select and manually label a set of informative instances. Additionally, a second set of automatically labelled instances of the target pool is generated, based on a transfer of information stemming from an auxiliary modality which is temporally dependent to the target one. Both sets of labelled instances (automatically and manually labelled instances) are used for the semi-supervised training of a classification model to be used in the next active learning iteration. Both methods have been assessed on a set of participants selected from the UUlmMAC dataset and have proven to be effective in substantially reducing the cost of manual annotation required for the training of a facial event detection model. The assessment is done based on two different methods: Support Vector Data Description and expected similarity estimation. Furthermore, given an appropriate sampling approach, the multi-modal approach outperforms its uni-modal counterpart in most of the cases.


Active learning Unsupervised outlier detection Support Vector Data Description Expected similarity estimation 



This paper is based on work done within the Transregional Collaborative Research Centre SFB/TRR 62 Companion-Technology for Cognitive Technical Systems funded by the German Research Foundation (DFG). Patrick Thiam is supported by the Federal Ministry of Education and Research (BMBF) within the project SenseEmotion. The work of Sascha Meudt and Friedhelm Schwenker is supported by the SFB/TRR 62. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.


  1. 1.
    Kächele M, Schels M, Meudt S, Kessler V, Glodek M, Thiam P, Tschechne S, Palm G, Schwenker F (2014) On annotation and evaluation of multi-modal corpora in affective human-computer interaction. In: International workshop on multimodal analyses enabling artificial agents in human-machine, interaction, pp 35–44Google Scholar
  2. 2.
    Kächele M, Schels M, Meudt S, Palm G, Schwenker F (2016) Revisiting the EmotiW challenge: how wild is it really? J Multimodal User Interfaces 10:151–162CrossRefGoogle Scholar
  3. 3.
    Valstar M, Gratch J, Schuller B, Ringeval F, Lalanne D, Torres Torres M, Scherer S, Stratou G, Cowie R, Pantic M (2016) AVEC 2016: depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th international workshop on audio/visual emotion challenge, pp 3–10Google Scholar
  4. 4.
    Chapelle O, Schölkopf B, Zien A (2006) Semi-supervised learning. The MIT Press, CambridgeCrossRefGoogle Scholar
  5. 5.
    Settles B (2009) Active learning literature survey. Computer sciences technical report, University of Wisconsin, MadisonGoogle Scholar
  6. 6.
    Schwenker F, Trentin E (2014) Pattern classification and clustering: a review of partially supervised learning approaches. Pattern Recognit Lett 37:4–14CrossRefGoogle Scholar
  7. 7.
    Meudt S, Schmidt-Wack M, Honold F, Schüssel F, Weber M, Schwenker F, Palm G (2016) Going further in affective computing: how emotion recognition can improve adaptive user interaction. In: Esposito A, Jain LC (eds) Toward robotic socially believable behaving systems, vol I. Springer, pp 73–103Google Scholar
  8. 8.
    Schels M, Glodek M, Meudt S, Scherer S, Schmidt M, Layher G, Tschechne S, Brosch T, Hrabal D, Walter S, Traue HC, Palm G, Neumann H, Schwenker F (2013) Multi-modal classifier-fusion for the recognition of emotions. In: Rojc M, Campbell N (eds) Coverbal synchrony in human-achine interaction, CRC Press, Boca Raton, pp 73–97Google Scholar
  9. 9.
    Zhang C, Chen T (2002) An active learning framework for content based information retrieval. IEEE Trans Multimed 4:260–268CrossRefGoogle Scholar
  10. 10.
    Gosselin P-H, Cord M (2008) Active learning methods for interactive image retrieval. IEEE Trans Image Process 17:1200–1211MathSciNetCrossRefGoogle Scholar
  11. 11.
    Wang M, Hua X-S (2011) Active learning in multimedia annotation and retrieval: a survey. ACM Trans Intell Syst Technol 2:1–21CrossRefGoogle Scholar
  12. 12.
    Pelleg D, Moore A (2004) Active learning for anomaly and rare-category detection. In: Saul LK, Weiss Y, Bottou L (eds) Advances in neural information processing systems, vol 17. MIT Press, pp 1073–1080Google Scholar
  13. 13.
    He J, Carbonell J (2007) Nearest-neighbor-based active learning for rare category detection. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems, pp 633–640Google Scholar
  14. 14.
    Hospedales T-M, Gong S, Xiang T (2011) Finding rare classes: active learning with generative and discriminative models. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining, pp 296–308Google Scholar
  15. 15.
    Pichara K, Soto A (2011) Active learning and subspace clustering for anomaly detection. Intell Data Anal 15:151–171Google Scholar
  16. 16.
    Zhao Z, Ma X (2013) Active learning for speech emotion recognition using conditional random fields. In: 14th ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing, pp 127–131Google Scholar
  17. 17.
    Zhang Y, Coutinho E, Zhang Z, Quan C, Schuller B (2015) Dynamic active learning based on agreement and applied to emotion recognition in spoken interactions. In: Proceedings of the 2015 ACM on international conference on multimedia interaction, pp 275–278Google Scholar
  18. 18.
    Xia V, Jaques N, Taylor S, Fedor S, Picard R (2015) Active learning for electrodermal activity classification. In: 2015 IEEE signal processing in medicine and biology symposium, pp 1–6Google Scholar
  19. 19.
    Wiens J, Guttag J-V (2010) Patient-adaptive ectopic beat classification using active learning. In: Proceedings of computing in cardiology, 2010, pp 109–112Google Scholar
  20. 20.
    Wiens J, Guttag JV (2010) Active learning applied to patient-adaptive heartbeat classification. Adv Neural Inf Process Syst 23:2442–2450Google Scholar
  21. 21.
    Balakrishnan G, Syed Z (2012) Scalable personalization of long-term physiological monitoring: active learning methodologies for epileptic seizure onset detection. J Mach Learn Res 22:73–81Google Scholar
  22. 22.
    Görnitz N, Kloft M, Rieck K, Brefeld U (2009) Active learning for network intrusion detection. In: Proceedings of the 2nd ACM workshop on security and artificial intelligence, pp 47–54Google Scholar
  23. 23.
    Tax DMJ, Duin RPW (2004) Support vector data description. Mach Learn 54:45–66CrossRefMATHGoogle Scholar
  24. 24.
    He J, Liu Y, Lawrence R (2008) Graph-based rare category detection. In: Proceedings of eight IEEE international conference on data mining, pp 833–838Google Scholar
  25. 25.
    Abe S (2005) Support vector machines for pattern classification. Springer, BerlinMATHGoogle Scholar
  26. 26.
    Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163CrossRefMATHGoogle Scholar
  27. 27.
    Yan R, Yang J, Hauptmann A (2003) Automatically labeling video data using multi-class active learning. In: Proceedings of the ninth IEEE international conference on computer vision, pp 516–523Google Scholar
  28. 28.
    Lafferty J-D, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning, pp 282–289Google Scholar
  29. 29.
    Zhang Z, Schuller B (2012) Active learning by sparse instance tracking and classifier confidence in acoustic emotion recognition. Confid Acoust Emot Proc Interspeech 2012:362–365Google Scholar
  30. 30.
    Senechal T, McDuff D, Kaliouby R (2015) Facial action unit detection using active learning and an efficient non-linear kernel approximation. In: 2015 IEEE international conference on computer vision workshop, pp 10–18Google Scholar
  31. 31.
    Thiam P, Meudt S, Kächele M, Palm G, Schwenker F (2014) Detection of emotional events utilizing support vector methods in an active learning HCI scenario. In: Proceedings of the 2014 workshop on emotion representation and modelling in human-computer-interaction-systems, pp 31–36Google Scholar
  32. 32.
    Thiam P, Kächele M, Schwenker F, Palm G (2015) Ensembles of support vector data description for active learning based annotation of affective corpora. In: 2015 IEEE symposium series on computational intelligence, pp 1801–1807Google Scholar
  33. 33.
    Hodge VJ, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22:85–126CrossRefMATHGoogle Scholar
  34. 34.
    Chandola V, Baerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41:1–58Google Scholar
  35. 35.
    Pimentel MAF, Clifton DA, Clifton L, Tarassenko L (2014) A review of novelty detection. Signal Process 99:215–249CrossRefGoogle Scholar
  36. 36.
    Thiam P, Meudt S, Schwenker F, Palm G (2016) Active learning for speech event detection in HCI. In: Proceedings of the 7th IAPR TC3 workshop, artificial neural networks in pattern recognition, ANNPR 2016, pp 285–297Google Scholar
  37. 37.
    Vapnik VN (2013) Methods of pattern recognition. Springer, Berlin, pp 123–170Google Scholar
  38. 38.
    Schneider M, Ertel W, Ramos F (2016) Expected similarity estimation for large-scale batch streaming anomaly detection. Mach Learn 105:305–333MathSciNetCrossRefGoogle Scholar
  39. 39.
    Williams C, Seeger M (2001) Using the Nyström method to speed up kernel machines. Adv Neural Inf Process Syst 13:682–688Google Scholar
  40. 40.
    Drineas P, Mahoney MW (2005) On the Nyström method for approximating a gram matrix for improved kernel-based learning. J Mach Learn Res 6:2153–2175Google Scholar
  41. 41.
    Chang W-C, Lee C-P, Lin C-J (2013) A revisit to support vector data description (SVDD). In: Technical reportsGoogle Scholar
  42. 42.
    Lin S (2010) Rank aggregation methods. Wiley Interdiscip Rev Comput Stat 555–570Google Scholar
  43. 43.
    Muslea I, Minton S, Knoblock CA (2002) Active + semi-supervised learning = robust multi-view learning. In: Proceedings of the 19th international conference of machine learning, pp 435–442Google Scholar
  44. 44.
    Knoblock CA, Minton S, Muslea I (2006) Active learning with multiple view. J Artif Intell Res 27:203–233MathSciNetMATHGoogle Scholar
  45. 45.
    Wang W, Zhou Z-H (2008) On multi-view active learning and the combination with semi-supervised learning. In: Proceedings of the 25th international conference on machine learning, pp 1152–1159Google Scholar
  46. 46.
    Schüssel F, Honold F, Bubalo N, Huckauf A, Traue H, Hazer-Rau D (2016) In-depth analysis of multimodal interaction: an explorative paradigm. In: Proceedings of international conference on human-computer interaction, pp 233–240Google Scholar
  47. 47.
    Russell JA (2009) Emotion, core affect and psychological construction. Cognit Emot 23:1259–1283CrossRefGoogle Scholar
  48. 48.
    Bradley MM, Lang PJ (1994) Measuring emotion: the self-assessment manikin and the semantic differential. J Behav Ther Exp Psychiatry 25:49–59CrossRefGoogle Scholar
  49. 49.
    Hihn H, Meudt S, Schwenker F (2016) Inferring mental overload based on postural behavior and gestures. In: Proceedings of the 2nd workshop on emotion representations and modelling for companion systems, pp 1–4Google Scholar
  50. 50.
    Hihn H, Meudt S, Schwenker F (2016) On gestures and postural behavior as a modality in ensemble methods. In: IAPR workshop on artificial neural networks, pattern recognition, pp 312–323Google Scholar
  51. 51.
    Alam J, Kenny P, Ouellet P, Stafylakis T, Dumouchel P (2014) Supervised/unsupervised voice activity detectors for text-dependent speaker recognition on the rsr2015 corpus. In: Odyssey speaker and language recognition workshopGoogle Scholar
  52. 52.
    Meudt S, Bigalke L, Schwenker F (2012) ATLAS–an annotation tool for HCI data utilizing machine learning methods. Adv Affect Pleasurable Des 5347–5352Google Scholar
  53. 53.
    Meudt S, Bigalke L, Schwenker F (2012) ATLAS-annotation tool using partially supervised learning and multi-view co-learning in human-computer-interaction scenarios. In: 11th international conference on information science, signal processing and their applications (ISSPA), 2012, pp 1309–1312Google Scholar
  54. 54.
    Biundo S, Höller D, Schattenberg P (2016) Companion-technology: an overview. KI-Künstliche Intelligenz 30:11–20Google Scholar
  55. 55.
    Krothapalli SR, Koolagudi SG (2013) Emotion recognition using vocal tract information. In: Emotion recognition using speech features. SpringerBriefs in electrical and computer engineering (SpringerBriefs in speech technology). Springer, New York, pp 67–78. doi: 10.1007/978-1-4614-5143-3_4
  56. 56.
    Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am 87:1738–1752CrossRefGoogle Scholar
  57. 57.
    Bhadragiri JM, Ramesh BN (2014) Speech recognition using MFCC and DTW. In: Proceedings of international conference on advances in electrical engineering (ICAEE), pp 1–4Google Scholar
  58. 58.
    Krothapalli SR, Koolagudi SG (2013) Speech emotion recognition: a review. In: Emotion recognition using speech features. SpringerBriefs in electrical and computer engineering (SpringerBriefs in speech technology). Springer, New York, pp 15–34Google Scholar
  59. 59.
    Eyben F, Weninger F, Gross F, Schuller B (2013) Recent developments in openSMILE, the munich open-source multimedia feature extractor. In: MM ’13 Proceedings of the 21st ACM international conference on Multimedia. ACM, New York, pp 835–838Google Scholar
  60. 60.
    Baltrusaitis T, Robinson P, Morency L-P (2016) OpenFace: an open source facial behavior analysis toolkit. In: Proceedings of IEEE winter conference on applications of computer vision, 2016, pp 1–10Google Scholar
  61. 61.
    Zhao G, Pietikaeinen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans Pattern Anal Mach Intell 29:915–928CrossRefGoogle Scholar
  62. 62.
    Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial pyramid kernel. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 401–408Google Scholar
  63. 63.
    Bergmeir C, Benìtez JM (2012) On the use of cross-validation for time series predictor evaluation. Inf Sci 191:192–213CrossRefGoogle Scholar
  64. 64.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATHGoogle Scholar
  65. 65.
    Gu Q, Zhu L, Cai Z (2009) Evaluation measures of the classification performance of imbalanced data sets. In: Cai Z, Li Z, Kang Z, Liu Y (eds) Computational intelligence and intelligent systems. ISICA 2009. Communications in computer and information science, vol 51. Springer, Berlin, pp 461–471Google Scholar
  66. 66.
    Barandela R, Sánchez JS, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Patt Recognit 36(3):849–851. doi: 10.1016/S0031-3203(02)00257-1
  67. 67.
    Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:1–27CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Institute of Neural Information ProcessingUlm UniversityUlmGermany

Personalised recommendations