Multimedia Tools and Applications

, Volume 78, Issue 5, pp 6277–6308 | Cite as

Improved TOPSIS method for peak frame selection in audio-video human emotion recognition

  • Lovejit Singh
  • Sarbjeet SinghEmail author
  • Naveen Aggarwal


The peak frame selection with corresponding voice segment identification is a challenging problem in the audio-video human emotion recognition. The peak frame is a most relevant descriptor of facial expression that can be inferred from varied emotional states. In this paper, an improved Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) is proposed to select the key frame based on facial action units co-occurrence behavior in the visual sequences. The proposed method utilizes the experts judgments while identifying the peak frame in video modality. It locates the peak voiced segment in audio modality using synchronous and asynchronous temporal relationship with selected peak visual frame. The facial action unit features of peak frame are fused with nine statistical characteristics of spectral features of the voiced segment. The weighted product rule-based decision level fusion is performed to combine the posterior probabilities of two independent (i.e., audio, and video) support vector machines based classification models. The performance of the proposed peak frame and voiced segment selection method is evaluated and compared with the existing Maximum-Dissimilarity (MAX-DIST), Dendrogram -Clustering (DEND-CLUSTER), and Emotion Intensity (EIFS) based peak frame selection methods on two challenging emotion datasets in two different languages namely eNTERFACE’05 in English and BAUM-1a in Turkish. The results show that the system with the proposed method has performed better than the existing techniques, and it achieved 88.03%, and 84.61% emotion recognition accuracies on the eNTERFACE’05 and BAUM-1a datasets respectively.


Face recognition Improved TOPSIS method Peak frame selection Audio-video emotion recognition 



This work was supported by University Grant Commission (UGC), Ministry of Human Resource Development (MHRD) of India under Basic Scientific Research (BSR) fellowship for meritorious fellows vide UGC letter no. F.25-1/2013-14(BSR)/7-379/2012(BSR) Dated 30.5. 2014.


  1. 1.
    Alonso JA, Teresa Lamata M (2006) Consistency in the analytic hierarchy process: a new approach. Int J Uncertainty Fuzziness Knowledge Based Syst 14.4:445–459. CrossRefzbMATHGoogle Scholar
  2. 2.
    Amiriparian S, Freitag M, Cummins N, Schuller B (2017) Feature selection in multimodal continuous emotion prediction. In: 17th IEEE international conference on affective computing and intelligent interaction workshops and demos (ACIIW), pp 30–37.
  3. 3.
    Atrey PK, Anwar Hossain M, Saddik AEl, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Systems 16.6:345–379. CrossRefGoogle Scholar
  4. 4.
    Baltrušaitis T, Ahuja C, Morency L-P (2018) Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence. p 99.
  5. 5.
    Baltrusaitis T, Mahmoud M, Robinson P (2015) Cross-dataset learning and person-specific normalisation for automatic action unit detection. In: IEEE international conference and workshops on automatic face and gesture recognition (FG), vol 6. pp 1–6. Accessed 21 Feb 2016
  6. 6.
    Baltrusaitis T, Robinson P, Morency L-P (2012) 3D constrained local model for rigid and non-rigid facial tracking. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2610–2617.
  7. 7.
    Baltrusaitis T, Robinson P, Morency L-P (2013) Constrained local neural fields for robust facial landmark detection in the wild. In: Proceedings of the IEEE international conference on computer vision workshops, pp 354–361.
  8. 8.
    Baltrusaitis T, Robinson P, Morency L-P (2016) Openface: an open source facial behavior analysis toolkit. In: IEEE winter conference on applications of computer vision (WACV), pp 1–10.
  9. 9.
    Chang C-C, Lin C-J (2011) LIBSVM: A library for support vector machines. ACM transactions on Intelligent Systems and Technology (TIST) 2.3:27. CrossRefGoogle Scholar
  10. 10.
    Du S, Tao Y, Martinez AM (2014) Compound facial expressions of emotion. Proc Natl Acad Sci 111.15:1454–62. CrossRefGoogle Scholar
  11. 11.
    Ekman P (1999) Basic emotions. The Handbook of Cognition and Emotion. pp 45–60CrossRefGoogle Scholar
  12. 12.
    Ekman P, Friesen WV, Hager JC (2002) Facial action coding system: the manual on CD ROM instructor’s guide. Network Information Research Co, Salt Lake CityGoogle Scholar
  13. 13.
    Escalera S, Pujol O, Radeva P (2009) Separability of ternary codes for sparse designs of error-correcting output codes. Pattern Recogn Lett 30.3:285–297. CrossRefGoogle Scholar
  14. 14.
    Gharavian D, Bejani M, Sheikhan M (2017) Audio-visual emotion recognition using FCBF feature selection method and particle swarm optimization for fuzzy ARTMAP neural networks. Multimedia Tools and Applications 76.2:2331–2352. CrossRefGoogle Scholar
  15. 15.
    Giannakopoulos T (2009) A method for silence removal and segmentation of speech signals, implemented in Matlab. University of Athens, Athens, p 2Google Scholar
  16. 16.
    Grant KW, Greenberg S (2001) Speech intelligibility derived from asynchronous processing of auditory-visual information. In: International conference on auditory-visual speech processing (AVSP), pp 132–137Google Scholar
  17. 17.
    Haq S, Jackson PJB (2009) Speaker-dependent audio-visual emotion recognition. In: Proceedings of international conference on auditory-visual speech processing, pp 53—58Google Scholar
  18. 18.
    Hermansky H, Morgan N (1994) RASTA Processing of speech. IEEE Transactions on Speech and Audio Processing 2.4:578–589. CrossRefGoogle Scholar
  19. 19.
    Hwang C-L, Yoon K (1981) Methods for multiple attribute decision making. In: Multiple attribute decision making, Springer, Berlin, pp 58–191. CrossRefGoogle Scholar
  20. 20.
    King DE (2009) Dlib-ml: a machine learning toolkit. J Mach Learn Res 10:1755–1758Google Scholar
  21. 21.
    Kolakowska A, Landowska A, Szwoch M, Wrobel MR (2014) Emotion recognition and its applications. Human-Computer Systems Interactions: Backgrounds and Applications. pp 51–62. Google Scholar
  22. 22.
    Kohler CG, Turner T, Stolar NM, Bilker WB, Brensinger CM, Gur RE, Gur RC (2004) Differences in facial expressions of four universal emotions. Psychiatry Res 128.3:235–244. CrossRefGoogle Scholar
  23. 23.
    Martin O, Kotsia I, Macq B, Pitas I (2006) The enterface’05 audio-visual emotion database. In: 22nd IEEE international conference on data engineering workshops, pp 1–8. Accessed 13 Sept 2016
  24. 24.
    Mehmet K, Cigdem EE (2015) Affect recognition using key frame selection based on minimum sparse reconstruction. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp 519–524.
  25. 25.
    Mohammad Mavadati S, Mahoor MH, Bartlett K, Trinh P, Cohn JF (2013) DISFA: A spontaneous facial action intensity database. IEEE Trans Affect Comput 4.2:151–60. CrossRefGoogle Scholar
  26. 26.
    Picard RW, Picard R (1997) Affective computing. MIT Press, Cambridge, p 252Google Scholar
  27. 27.
    Rao RV (2013) Improved multiple attribute decision making methods. In: Decision making in manufacturing environment using graph theory and fuzzy multiple attribute decision making methods, Springer, London, pp 7–39. Google Scholar
  28. 28.
    Sidorov M, Sopov E, Ivanov I, Minker W (2015) Feature and decision level audio-visual data fusion in emotion recognition problem. In: 12th IEEE international conference on informatics in control automation and robotics (ICINCO), vol 2. pp 246–251Google Scholar
  29. 29.
    Valstar MF, Almaev T, Girard JM, McKeown G, Mehu M, Yin L, Pantic M, Cohn JF (2015) FERA 2015-Second facial expression recognition and analysis challenge. In: 11th IEEE international conference and workshops on automatic face and gesture recognition (FG) vol 6. pp 1–8.
  30. 30.
    Yan C, Zhang Y, Xu J, Dai F, Zhang J, Dai Q, Wu F (2014) Efficient parallel framework for HEVC motion estimation on Many-Core processors. IEEE Trans Circuits Syst Video Technol 24.12:2077–2089. CrossRefGoogle Scholar
  31. 31.
    Yan C, Zhang Y, Xu J, Dai F, Li L, Dai Q, Wu F (2014) A highly parallel framework for HEVC coding unit partitioning tree decision on many-core processors. IEEE Signal Process Lett 21.5:573–576. CrossRefGoogle Scholar
  32. 32.
    Yan C, Xie H, Yang D, Yin J, Zhang Y, Dai Q (2018) Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans Intell Transp Syst 19.1:284–295. CrossRefGoogle Scholar
  33. 33.
    Zhalehpour S, Akhtar Z, Erdem CE (2014) Multimodal emotion recognition with automatic peak frame selection. In: Proceedings of IEEE international symposium on innovations in intelligent systems and applications (INISTA), pp 116–121Google Scholar
  34. 34.
    Zhalehpour S, Akhtar Z, Erdem CE (2016) Multimodal emotion recognition based on peak frame selection from video. Signal Image and Video Processing. 10:827–834. CrossRefGoogle Scholar
  35. 35.
    Zhalehpour S, Onder O, Akhtar Z, Erdem CE (2016) BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States. IEEE Trans Affect Comput 8.3:300–313. Accessed 18 Aug 2017CrossRefGoogle Scholar
  36. 36.
    Zhang X, Yin L, Cohn JF, Canavan SJ, Reale MJ, Horowitz A, Liu P, Girard JM (2014) BP4D-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing 32.10:692–706CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Lovejit Singh
    • 1
  • Sarbjeet Singh
    • 1
    Email author
  • Naveen Aggarwal
    • 1
  1. 1.UIETPanjab UniversityChandigarhIndia

Personalised recommendations