Skip to main content

Multi-modal egocentric activity recognition using multi-kernel learning


Existing methods for egocentric activity recognition are mostly based on extracting motion characteristics from videos. On the other hand, ubiquity of wearable sensors allow acquisition of information from different sources. Although the increase in sensor diversity brings out the need for adaptive fusion, most of the studies use pre-determined weights for each source. In addition, there are a limited number of studies making use of optical, audio and wearable sensors. In this work, we propose a new framework that adaptively weighs the visual, audio and sensor features in relation to their discriminative abilities. For that purpose, multi-kernel learning (MKL) is used to fuse multi-modal features where the feature and kernel selection/weighing and recognition tasks are performed concurrently. Audio-visual information is used in association with the data acquired from wearable sensors since they hold information on different aspects of activities and help building better models. The proposed framework can be used with different modalities to improve the recognition accuracy and easily be extended with additional sensors. The results show that using multi-modal features with MKL outperforms the existing methods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. 1.

    Abebe G, Cavallaro A (2017) Hierarchical modeling for first-person vision activity recognition. Neurocomputing 267:362–377

    Article  Google Scholar 

  2. 2.

    Abebe G, Cavallaro A (2017) Inertial-vision: cross-domain knowledge transfer for wearable sensors. In: Proceedings of the IEEE international conference on computer vision, pp 1392–1400

  3. 3.

    Abebe G, Cavallaro A, Parra X (2016) Robust multi-dimensional motion features for first-person vision activity recognition. Comput Vis Image Underst 149:229–248

    Article  Google Scholar 

  4. 4.

    Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Computing Surveys (CSUR) 43(3):16

    Article  Google Scholar 

  5. 5.

    Alsheikh MA, Niyato D, Lin S, Tan HP, Han Z (2016) Mobile big data analytics using deep learning and apache spark. IEEE Network 30(3):22–29

    Article  Google Scholar 

  6. 6.

    Arsigny V, Fillard P, Pennec X, Ayache N (2006) Log-euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 56(2):411–421

    Article  Google Scholar 

  7. 7.

    Avola D, Bernardi M, Foresti GL (2019) Fusing depth and colour information for human action recognition. Multimed Tools Appl 78(5):5919–5939

    Article  Google Scholar 

  8. 8.

    Betancourt A, Morerio P, Regazzoni CS, Rauterberg M (2015) The evolution of first person vision methods: a survey. IEEE Trans Circ Sys Video Technol 25(5):744–760

    Article  Google Scholar 

  9. 9.

    Bhattacharya S, Lane ND (2016) From smart to deep: robust activity recognition on smartwatches using deep learning. In: 2016 IEEE International conference on pervasive computing and communication workshops (PerCom Workshops), pp 1–6

  10. 10.

    Bottou L, Lin CJ (2007) Support vector machine solvers. Large Scale Kernel Machines 3(1):301–320

    Google Scholar 

  11. 11.

    Bulling A, Blanke U, Schiele B (2014) A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys (CSUR) 46(3):33

    Article  Google Scholar 

  12. 12.

    Campbell WM, Sturim DE, Reynolds DA, Solomonoff A (2006) Svm based speaker verification using a gmm supervector kernel and nap variability compensation. In: 2006 IEEE International conference on acoustics speech and signal processing proceedings, vol 1, pp I–I

  13. 13.

    Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Sys Technol (TIST) 2(3):27

    Google Scholar 

  14. 14.

    Chen C, Jafari R, Kehtarnavaz N (2014) Improving human action recognition using fusion of depth camera and inertial sensors. IEEE Transactions on Human-Machine Systems 45(1):51–61

    Article  Google Scholar 

  15. 15.

    Chen Y, Zhong K, Zhang J, Sun Q, Zhao X (2016) Lstm networks for mobile human activity recognition. In: 2016 international conference on artificial intelligence: technologies and applications. Atlantis Press

  16. 16.

    Clarkson B, Mase K, Pentland A (2000) Recognizing user context via wearable sensors. In: Digest of papers. Fourth international symposium on wearable computers, pp 69–75

  17. 17.

    Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28(4):357–366

    Article  Google Scholar 

  18. 18.

    Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. VS-PETS Beijing

  19. 19.

    Fathi A, Farhadi A, Rehg JM (2011) Understanding egocentric activities. In: 2011 International conference on computer vision. IEEE, pp 407–414

  20. 20.

    Fitzgerald R, Lees B (1994) Assessing the classification accuracy of multisource remote sensing data. Remote Sensing of Environment 47(3):362–368

    Article  Google Scholar 

  21. 21.

    Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence 14(771-780):1612

    Google Scholar 

  22. 22.

    Gärtner T (2003) A survey of kernels for structured data. ACM SIGKDD Explorations Newsletter 5(1):49–58

    Article  Google Scholar 

  23. 23.

    Gauvain J, Lee C-H (1994) Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process 2 (2):291–298

    Article  Google Scholar 

  24. 24.

    Gönen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268

    MathSciNet  MATH  Google Scholar 

  25. 25.

    Guan Y, Plötz T (2017) Ensembles of deep lstm learners for activity recognition using wearables. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1(2):11

    Article  Google Scholar 

  26. 26.

    Guo K, Ishwar P, Konrad J (2013) Action recognition from video using feature covariance matrices. IEEE Trans Image Process 22(6):2479–2494

    MathSciNet  MATH  Article  Google Scholar 

  27. 27.

    Incel O (2015) Analysis of movement, orientation and rotation-based sensing for phone placement recognition. Sensors 15(10):25474–25506

    Article  Google Scholar 

  28. 28.

    Iwashita Y, Takamine A, Kurazume R, Ryoo MS (2014) First-person animal activity recognition from egocentric videos. In: 2014 22nd international conference on pattern recognition. IEEE, pp 4310– 4315

  29. 29.

    Kwon H, Kim Y, Lee JS, Cho M (2018) First person action recognition via two-stream convnet with long-term fusion pooling. Pattern Recogn Lett 112:161–167

    Article  Google Scholar 

  30. 30.

    Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS (2004) A statistical framework for genomic data fusion. Bioinformatics 20(16):2626–2635

    Article  Google Scholar 

  31. 31.

    Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPR 2008-IEEE conference on computer vision & pattern recognition, pp 1–8

  32. 32.

    Li X, Wang L, Sung E (2004) Improving adaboost for classification on small training sample sets with active learning. In: Proceedings of Asian conference on computer vision (ACCV), pp 1–6

  33. 33.

    Li Y, Ye Z, Rehg JM (2015) Delving into egocentric actions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 287–295

  34. 34.

    Liu L, Cheng L, Liu Y, Jia Y, Rosenblum DS (2016) Recognizing complex activities by a probabilistic interval-based model. In: Thirtieth AAAI conference on artificial intelligence

  35. 35.

    Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2activity: recognizing complex activities from sensor data. In: Twenty-fourth international joint conference on artificial intelligence

  36. 36.

    Liu Y, Nie L, Liu L, Rosenblum DS (2016) From action to activity: sensor-based activity recognition. Neurocomputing 181:108–115

    Article  Google Scholar 

  37. 37.

    Lu Y, Wei Y, Liu L, Zhong J, Sun L, Liu Y (2017) Towards unsupervised physical activity recognition using smartphone accelerometers. Multimed Tools Appl 76(8):10701–10719

    Article  Google Scholar 

  38. 38.

    Ma M, Fan H, Kitani KM (2016) Going deeper into first-person activity recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 1894–1903

  39. 39.

    Morales J, Akopian D (2017) Physical activity recognition by smartphones, a survey. Biocybern Biomed Eng 37(3):388–400

    Article  Google Scholar 

  40. 40.

    Ni B, Nguyen CD, Moulin P (2012) Rgbd-camera based get-up event detection for hospital fall prevention. In: 2012 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1405–1408

  41. 41.

    Ni B, Pei Y, Moulin P, Yan S (2013) Multilevel depth and image fusion for human activity detection. IEEE Trans Cybern 43(5):1383–1394

    Article  Google Scholar 

  42. 42.

    Nweke HF, Teh YW, Al-Garadi MA, Alo UR (2018) Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: state of the art and research challenges. Expert Syst Appl 105:233–261

    Article  Google Scholar 

  43. 43.

    Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R (2013) Berkeley mhad: a comprehensive multimodal human action database. In: 2013 IEEE workshop on applications of computer vision (WACV). IEEE, pp 53–60

  44. 44.

    Ordóñez FJ, Roggen D (2016) Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16(1):115

    Article  Google Scholar 

  45. 45.

    Ozkan F, Arabaci MA, Surer E, Temizel A (2017) Boosted multiple kernel learning for first-person activity recognition. In: 2017 25th European signal processing conference (EUSIPCO). IEEE, pp 1050–1054

  46. 46.

    Pansiot J, Stoyanov D, McIlwraith D, Lo BP, Yang GZ (2007) Ambient and wearable sensor fusion for activity recognition in healthcare monitoring systems. In: 4th International workshop on wearable and implantable body sensor networks (BSN 2007). Springer, Berlin, pp 208–212

  47. 47.

    Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125

    Article  Google Scholar 

  48. 48.

    Poleg Y, Arora C, Peleg S (2014) Temporal segmentation of egocentric videos. In: 2014 IEEE conference on computer vision and pattern recognition, pp 2537–2544

  49. 49.

    Poleg Y, Ephrat A, Peleg S, Arora C (2016) Compact cnn for indexing egocentric videos. In: 2016 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1–9

  50. 50.

    Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) Simplemkl. J Machine Learn Res 9:2491–2521

    MathSciNet  MATH  Google Scholar 

  51. 51.

    Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted gaussian mixture models. Digital Signal Processing 10(1-3):19–41

    Article  Google Scholar 

  52. 52.

    Ryoo MS, Matthies L (2013) First-person activity recognition: what are they doing to me?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2730–2737

  53. 53.

    Safavi S, Russell M, Jančovič P (2018) Automatic speaker, age-group and gender identification from children’s speech. Computer Speech & Language 50:141–156

    Article  Google Scholar 

  54. 54.

    Sathyanarayana A, Joty S, Fernandez-Luque L, Ofli F, Srivastava J, Elmagarmid A, Arora T, Taheri S (2016) Sleep quality prediction from wearable data using deep learning. JMIR mHealth and uHealth 4(4):e125

    Article  Google Scholar 

  55. 55.

    Schölkopf B, Smola AJ, Bach F, et al. (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, Cambridge

    Google Scholar 

  56. 56.

    Song H, Thiagarajan JJ, Sattigeri P, Ramamurthy KN, Spanias A (2017) A deep learning approach to multiple kernel fusion. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2292–2296

  57. 57.

    Song S, Chandrasekhar V, Cheung NM, Narayan S, Li L, Lim JH (2015) Activity recognition in egocentric life-logging videos. In: Jawahar CV, Shan S (eds) Computer Vision - ACCV 2014 Workshops. Springer International Publishing, Cham, pp 445–458

  58. 58.

    Song S, Cheung NM, Chandrasekhar V, Mandal B, Liri J (2016) Egocentric activity recognition with multimodal fisher vector. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2717–2721

  59. 59.

    Sudhakaran S, Lanz O (2017) Convolutional long short-term memory networks for recognizing first person interactions. In: 2017 IEEE international conference on computer vision workshops (ICCVW), pp 2339–2346

  60. 60.

    Tadesse GA, Cavallaro A (2018) Visual features for ego-centric activity recognition: a survey. In: Proceedings of the 4th ACM workshop on wearable systems and applications. ACM, pp 48–53

  61. 61.

    Wang H, Kläser A, Schmid C, Cheng-Lin L (2011) Action recognition by dense trajectories. In: CVPR 2011-IEEE conference on computer vision & pattern recognition. IEEE, pp 3169–3176

  62. 62.

    Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vision 103(1):60–79

    MathSciNet  Article  Google Scholar 

  63. 63.

    Wang L (2016) Recognition of human activities using continuous autoencoders with wearable sensors. Sensors 16(2):189

    Article  Google Scholar 

  64. 64.

    Wang X, Gao L, Song J, Zhen X, Sebe N, Shen HT (2018) Deep appearance and motion learning for egocentric activity recognition. Neurocomputing 275:438–447

    Article  Google Scholar 

  65. 65.

    Wang X, Rosenblum D, Wang Y (2012) Context-aware mobile music recommendation for daily activities. In: Proceedings of the 20th ACM international conference on multimedia. ACM, pp 99–108

  66. 66.

    Xia H, Hoi SC (2012) Mkboost: a framework of multiple kernel boosting. IEEE Trans Knowl Data Eng 25(7):1574–1586

    Article  Google Scholar 

  67. 67.

    Yao R, Lin G, Shi Q, Ranasinghe DC (2018) Efficient dense labelling of human activity sequences from wearables using fully convolutional networks. Pattern Recogn 78:252–266

    Article  Google Scholar 

  68. 68.

    Yi W, Ballard D (2009) Recognizing behavior in hand-eye coordination patterns. Int J Humanoid Robotics 6(03):337–359

    Article  Google Scholar 

  69. 69.

    Yilmaz T, Foster R, Hao Y (2010) Detecting vital signs with wearable wireless sensors. Sensors 10(12):10837–10862

    Article  Google Scholar 

  70. 70.

    Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, et al. (2006) The htk book (v3. 4). Cambridge University, Cambridge

    Google Scholar 

  71. 71.

    Yu C, Bambach S, Zhang Z, Crandall DJ (2017) Exploring inter-observer differences in first-person object views using deep learning models. In: 2017 IEEE international conference on computer vision workshops (ICCVW), pp 2773–2782

Download references

Author information



Corresponding author

Correspondence to Mehmet Ali Arabacı.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was partly supported by The Scientific and Technological Research Council of Turkey under TUBITAK BIDEB-2219 grant no 1059B191500048.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Arabacı, M.A., Özkan, F., Surer, E. et al. Multi-modal egocentric activity recognition using multi-kernel learning. Multimed Tools Appl 80, 16299–16328 (2021).

Download citation


  • Egocentric
  • First-person vision
  • Activity recognition
  • Multi-kernel learning
  • Multi-modality