Skip to main content

Advertisement

Log in

First-person activity recognition from micro-action representations using convolutional neural networks and object flow histograms

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

A novel first-person human activity recognition framework is proposed in this work. Our proposed methodology is inspired by the central role moving objects have in egocentric activity videos. Using a Deep Convolutional Neural Network we detect objects and develop discriminant object flow histograms in order to represent fine-grained micro-actions during short temporal windows. Our framework is based on the assumption that large scale activities are synthesized by fine-grained micro-actions. We gather all the micro-actions and perform Gaussian Mixture Model clusterization, so as to build a micro-action vocabulary that is later used in a Fisher encoding schema. Results show that our method can reach 60% recognition rate on the benchmark ADL dataset. The capabilities of the proposed framework are also showcased by profoundly evaluating for a great deal of hyper-parameters and comparing to other State-of-the-Art works.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Avgerinakis K, Briassouli A, Kompatsiaris Y (2016) Activity detection using sequential statistical boundary detection (ssbd). Comput Vis Image Underst 144:46–61

    Article  Google Scholar 

  2. Bambach S, Crandall DJ, Yu C (2015) Viewpoint integration for hand-based recognition of social interactions from a first-person view. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, pp 351–354

  3. Bambach S, Lee S, Crandall DJ, Yu C (2015) Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In: 2015 IEEE international conference on Computer vision (ICCV). IEEE, pp 1949–1957

  4. Chéron G, Laptev I, Schmid C (2015) P-cnn: Pose-based cnn features for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3218–3226

  5. Crispim-Junior CF, Buso V, Avgerinakis K, Meditskos G, Briassouli A, Benois-Pineau J, Kompatsiaris I, Bremond F (2016) Semantic event fusion of different visual modality concepts for activity recognition. IEEE Trans Pattern Anal Mach Intell 38(8):1598–1611

    Article  Google Scholar 

  6. Crispim-Junior CF, Gómez uría A, Strumia C, Koperski M, König A, Negin F, Cosar S, Nghiem AT, Chau DP, Charpiat G et al (2017) Online recognition of daily activities by color-depth sensing and knowledge models. Sensors 17(7):1528

    Article  Google Scholar 

  7. Dai J, Li Y, He K, Sun J (2016) R-fcn: Object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387

  8. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision. Springer, pp 428–441

  9. Damen D, Doughty H, Maria Farinella G, Fidler S, Furnari A, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W et al (2018) Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 720–736

  10. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-fei L (2009) Imagenet: A large-scale hierarchical image database

  11. Du W, Wang Y, Qiao Y (2017) Rpan: an end-to-end recurrent pose-attention network for action recognition in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3725–3734

  12. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88 (2):303–338

    Article  Google Scholar 

  13. Fathi A, Farhadi A, Rehg JM (2011) Understanding egocentric activities. In: 2011 IEEE international conference on Computer vision (ICCV). IEEE, pp 407–414

  14. Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems, pp 3468–3476

  15. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941

  16. Frankl P, Maehara H (1988) The johnson-lindenstrauss lemma and the sphericity of some graphs. J Comb Theory Ser B 44(3):355–362

    Article  MathSciNet  Google Scholar 

  17. Gaidon A, Harchaoui Z, Schmid C (2011) Actom sequence models for efficient action detection. In: CVPR 2011. IEEE, pp 3201–3208

  18. Giannakeris P, Avgerinakis K, Vrochidis S, Kompatsiaris I (2018) Activity recognition from wearable cameras. In: 2018 International conference on content-based multimedia indexing (CBMI). IEEE, pp 1–6

  19. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  20. González-Díaz I, Benois-Pineau J, Domenger JP, Cattaert D, de Rugy A (2019) Perceptually-guided deep neural networks for ego-action prediction: Object grasping. Pattern Recogn 88:223–235

  21. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  22. Henriques JF, Caseiro R, Martins P, Batista J (2015) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596

    Article  Google Scholar 

  23. Huang J, Rathod V, Sun C, Zhu M, Korattikara A, Fathi A, Fischer I, Wojna Z, Song Y, Guadarrama S et al (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In: IEEE CVPR

  24. Jain M, Jegou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2555–2562

  25. Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3192–3199

  26. Jin P, Rathod V, Zhu X (2018) Pooling pyramid network for object detection. arXiv:1807.03284

  27. Johnson WB, Lindenstrauss J (1984) Extensions of lipschitz mappings into a hilbert space. Contemp Math 26(189-206):1

    MathSciNet  MATH  Google Scholar 

  28. Kong T, Sun F, Yao A, Liu H, Lu M, Chen Y (2017) Ron: Reverse connection with objectness prior networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5936–5944

  29. Kroeger T, Timofte R, Dai D, Van Gool L (2016) Fast optical flow using dense inverse search. In: European conference on computer vision. Springer, pp 471–488

  30. Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Duerig T et al (2018) The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982

  31. Li Y, Liu M, Rehg JM (2018) In the eye of beholder: Joint learning of gaze and actions in first person video. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 619–635

  32. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision. Springer, pp 740–755

  33. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37

  34. McCandless T, Grauman K (2013) Object-centric spatio-temporal pyramids for egocentric activity recognition. In: BMVC, vol 2, pp 3

  35. Meditskos G, Plans PM, Stavropoulos TG, Benois-Pineau J, Buso V, Kompatsiaris I (2018) Multi-modal activity recognition from egocentric vision, semantic enrichment and lifelogging applications for the care of dementia. J Vis Commun Image Represent 51:169–190

    Article  Google Scholar 

  36. Ohnishi K, Kanehira A, Kanezaki A, Harada T (2016) Recognizing activities of daily living with a wrist-mounted camera. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3103–3111

  37. Papadimitriou CH, Raghavan P, Tamaki H, Vempala S (2000) Latent semantic indexing: a probabilistic analysis. J Comput Syst Sci 61(2):217–235

    Article  MathSciNet  Google Scholar 

  38. Pirsiavash H, Ramanan D (2012) Detecting activities of daily living in first-person camera views. In: 2012 IEEE conference on Computer vision and pattern recognition (CVPR). IEEE, pp 2847–2854

  39. Pirsiavash H, Ramanan D (2014) Parsing videos of actions with segmental grammars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 612–619

  40. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  41. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  42. Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119

  43. Singh B, Najibi M, Davis LS (2018) Sniper: Efficient multi-scale training. In: Advances in neural information processing systems, pp 9333–9343

  44. Sun S, Kuang Z, Sheng L, Ouyang W, Zhang W (2018) Optical flow guided feature: a fast and robust motion representation for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1390–1399

  45. Tran A, Cheong LF (2017) Two-stream flow-guided convolutional attention networks for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3110–3119

  46. Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171

  47. Vaca-Castano G, Das S, Sousa JP, Lobo ND, Shah M (2017) Improved scene identification and object detection on egocentric vision of daily activities. Comput Vis Image Underst 156:92–103

    Article  Google Scholar 

  48. Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: 2011 IEEE conference on Computer vision and pattern recognition (CVPR). IEEE, pp 3169–3176

  49. Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103 (1):60–79

    Article  MathSciNet  Google Scholar 

  50. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558

  51. Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv:1507.02159

  52. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2018) Temporal segment networks for action recognition in videos IEEE transactions on pattern analysis and machine intelligence

  53. Wang X, Gao L, Song J, Zhen X, Sebe N, Shen HT (2018) Deep appearance and motion learning for egocentric activity recognition. Neurocomputing 275:438–447

    Article  Google Scholar 

  54. Yan Y, Ricci E, Liu G, Sebe N (2015) Egocentric daily activity recognition via multitask clustering. IEEE Trans Image Process 24(10):2984–2995

    Article  MathSciNet  Google Scholar 

  55. Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-shot refinement neural network for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4203–4212

  56. Zhao Q, Sheng T, Wang Y, Tang Z, Chen Y, Cai L, Ling H (2018) M2det: A single-shot object detector based on multi-level feature pyramid network. arXiv:1811.04533

  57. Zhou Y, Ni B, Hong R, Yang X, Tian Q (2016) Cascaded interactional targeting network for egocentric video analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1904–1913

  58. Zhu Y, Zhao C, Wang J, Zhao X, Wu Y, Lu H (2017) Couplenet: Coupling global structure with local parts for object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4126–4134

Download references

Acknowledgements

This work is supported by the project SUITCEYES that has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 780814.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Panagiotis Giannakeris.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Giannakeris, P., Petrantonakis, P.C., Avgerinakis, K. et al. First-person activity recognition from micro-action representations using convolutional neural networks and object flow histograms. Multimed Tools Appl 80, 22487–22507 (2021). https://doi.org/10.1007/s11042-020-09902-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09902-6

Keywords

Navigation