Skip to main content
Log in

Egocentric upper limb segmentation in unconstrained real-life scenarios

  • S.I. : New Trends on Immersive Healthcare
  • Published:
Virtual Reality Aims and scope Submit manuscript

Abstract

The segmentation of bare and clothed upper limbs in unconstrained real-life environments has been less explored. It is a challenging task that we tackled by training a deep neural network based on the DeepLabv3+ architecture. We collected about 46 thousand real-life and carefully labeled RGB egocentric images with a great variety of skin tones, clothes, occlusions, and lighting conditions. We then widely evaluated the proposed approach and compared it with state-of-the-art methods for hand and arm segmentation, e.g., Ego2Hands, EgoArm, and HGRNet. We used our test set and a subset of the EgoGesture dataset (EgoGestureSeg) to assess the model generalization level on challenging scenarios. Moreover, we tested our network on hand-only segmentation since it is a closely related task. We made a quantitative analysis through standard metrics for image segmentation and a qualitative evaluation by visually comparing the obtained predictions. Our approach outperforms all comparing models in both tasks and proving the robustness of the proposed approach to hand-to-hand and hand-to-object occlusions, dynamic user/camera movements, different lighting conditions, skin colors, clothes, and limb/hand poses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. Source code, video, and test dataset (example samples) generated or analysed during this study are included in this published article [and its supplementary information files].

Notes

  1. The pre-trained weights used are publicly available on the DeepLab project page: https://github.com/tensorflow/models/tree/master/research/deeplab.

  2. The original subset (Gonzalez-Sosa et al. 2020) contains 277 images. After a careful inspection, we deleted some data whose labels showed small imperfections or errors.

  3. We tested the best Ego2Hands model without custom scene adaptation, which is public available at https://github.com/AlextheEngineer/Ego2Hands.

  4. Since the code and the EgoArm network are not publicly available, network inference was performed by one of the authors, Ester Gonzalez-Sosa, who sent us the obtained predictions.

References

  • Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Software available from tensorflow.org https://www.tensorflow.org/

  • Alletto S, Serra G, Calderara S, Cucchiara R (2015) Understanding social relationships in egocentric vision. Pattern Recognit 48(12):4082–4096

    Article  Google Scholar 

  • Bambach S, Lee S, Crandall DJ, Yu C (2015) Lending a hand: detecting hands and recognizing activities in complex egocentric interactions. In: The IEEE international conference on computer vision (ICCV)

  • Bandini A, Zariffa J (2020) Analysis of the hands in egocentric vision: A survey. IEEE Trans Pattern Anal Mach Intell

  • Betancourt A, Morerio P, Regazzoni CS, Rauterberg M (2015) The evolution of first person vision methods: a survey. IEEE Trans Circuits Syst Video Technol 25(5):744–760

    Article  Google Scholar 

  • Betancourt A, Morerio P, Barakova E, Marcenaro L, Rauterberg M, Regazzoni C (2017) Left/right hand segmentation in egocentric videos. Comput Vis Image Underst 154:73–81

    Article  Google Scholar 

  • Bojja AK, Mueller F, Malireddi SR, Oberweger M, Lepetit V, Theobalt C, Yi KM, Tagliasacchi A (2019) Handseg: an automatically labeled dataset for hand segmentation from depth images. In: 2019 16th conference on computer and robot vision (CRV), pp 151–158. IEEE

  • Brancati N, Caggianese G, Frucci M, Gallo L, Neroni P (2015) Robust fingertip detection in egocentric vision under varying illumination conditions. In: 2015 IEEE international conference on multimedia and expo workshops (ICMEW), pp 1–6 IEEE

  • Caggianese G, Gallo L, Neroni P (2015) Design and preliminary evaluation of free-hand travel techniques for wearable immersive virtual reality systems with egocentric sensing. In: International conference on augmented and virtual reality, pp 399–408. Springer

  • Caggianese G, Capece N, Erra U, Gallo L, Rinaldi M (2020) Freehand-steering locomotion techniques for immersive virtual environments: a comparative evaluation. Int J Hum Comput Interact 36(18):1734–1755

    Article  Google Scholar 

  • Cai M, Kitani KM, Sato Y (2017) An ego-vision system for hand grasp analysis. IEEE Trans Hum Mach Syst 47(4):524–535

    Article  Google Scholar 

  • Cai M, Lu F, Sato Y (2020) Generalizing hand segmentation in egocentric videos with uncertainty-guided model adaptation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 14380–14389. https://doi.org/10.1109/CVPR42600.2020.01440

  • Capece N, Erra U, Gruosso M, Anastasio M (2020) Archaeo puzzle: an educational game using natural user interface for historical artifacts

  • Chalasani T, Ondrej J, Smolic A (2018) Egocentric gesture recognition for head-mounted ar devices. In: 2018 IEEE international symposium on mixed and augmented reality adjunct (ISMAR-Adjunct), pp 109–114. https://doi.org/10.1109/ISMAR-Adjunct.2018.00045

  • Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587

  • Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848

    Article  Google Scholar 

  • Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder–decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 801–818

  • Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1251–1258

  • Dadashzadeh A, Targhi AT, Tahmasbi M, Mirmehdi M (2019) Hgr-net: a fusion network for hand gesture segmentation and recognition. IET Comput Vis 13(8):700–707

    Article  Google Scholar 

  • Dave IR, Chaudhary V, Upla KP (2019) Simulation of analytical chemistry experiments on augmented reality platform. In: Panigrahi CR, Pujari AK, Misra S, Pati B, Li K-C (eds) Progress in advanced computing and intelligent engineering. Springer, Singapore, pp 393–403

    Chapter  Google Scholar 

  • Fathi A, Ren X, Rehg JM (2011) Learning to recognize objects in egocentric activities. In: CVPR 2011, pp 3281–3288. IEEE

  • Ferracani A, Pezzatini D, Bianchini J, Biscini G, Del Bimbo A (2016) Locomotion by natural gestures for immersive virtual environments. AltMM ’16, pp 21–24. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2983298.2983307

  • Garcia-Hernando G, Yuan S, Baek S, Kim TK (2018) First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  • Gonzalez-Sosa E, Perez P, Tolosana R, Kachach R, Villegas A (2020) Enhanced self-perception in mixed reality: egocentric arm segmentation and database with automatic labeling. IEEE Access 8:146887–146900

    Article  Google Scholar 

  • Gruosso M, Capece N, Erra U, Angiolillo FA (2020) preliminary investigation into a deep learning implementation for hand tracking on mobile devices. In: 2020 IEEE international conference on artificial intelligence and virtual reality (AIVR), pp 380–385. IEEE

  • Gruosso M, Capece N, Erra U (2021a) Human segmentation in surveillance video with deep learning. Multimed Tools Appl 80(1):1175–1199

    Article  Google Scholar 

  • Gruosso M, Capece N, Erra U (2021b) Exploring upper limb segmentation with deep learning for augmented virtuality. In: Frosini P, Giorgi D, Melzi S, Rodolá E (eds) Smart tools and apps for graphics: Eurographics Italian chapter conference. The Eurographics Association. https://doi.org/10.2312/stag.20211483

  • Gruosso M, Capece N, Erra U (2021c) Solid and effective upper limb segmentation in egocentric vision. In: The 26th international conference on 3D Web technology. https://doi.org/10.1145/3485444.3495179

  • Haria A, Subramanian A, Asokkumar N, Poddar S, Nayak JS (2017) Hand gesture recognition for human computer interaction. Procedia Comput Sci 115:367–374

    Article  Google Scholar 

  • Harkat H, Nascimento J, Bernardino A (2020) Fire segmentation using a deeplabv3+ architecture. In: Image and signal processing for remote sensing XXVI, vol 11533, p 115330. International Society for Optics and Photonics

  • Herumurti D, Yuniarti A, Kuswardayan I, Nurul W, Hariadi RR, Suciati N, Manggala MG (2017) Mixed reality in the 3d virtual room arrangement. In: 2017 11th international conference on information communication technology and system (ICTS), pp 303–306. https://doi.org/10.1109/ICTS.2017.8265688

  • Ju Z, Ji X, Li J, Liu H (2017) An integrative framework of human hand gesture segmentation for human–robot interaction. IEEE Syst J 11(3):1326–1336. https://doi.org/10.1109/JSYST.2015.2468231

    Article  Google Scholar 

  • Kapidis G, Poppe R, Van Dam E, Noldus L, Veltkamp R (2019) Egocentric hand track and object-based human action recognition. In: 2019 IEEE SmartWorld, ubiquitous intelligence & computing, advanced & trusted computing, scalable computing & communications, cloud & big data computing, internet of people and smart city innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI) pp. 922–929. IEEE

  • Kok VJ, Chan CS (2016) Grcs: granular computing-based crowd segmentation. IEEE Trans Cybern 47(5):1157–1168

    Article  Google Scholar 

  • Kong Y, Liu Y, Yan B, Leung H, Peng X (2021) A novel deeplabv3+ network for sar imagery semantic segmentation based on the potential energy loss function of gibbs distribution. Remote Sensing 13(3):454

    Article  Google Scholar 

  • Lateef F, Ruichek Y (2019) Survey on semantic segmentation using deep learning techniques. Neurocomputing 338:321–348

    Article  Google Scholar 

  • Lee K, Kacorri H (2019) Hands holding clues for object recognition in teachable machines. In: Proceedings of the 2019 CHI conference on human factors in computing systems ACM

  • Lee S, Bambach S, Crandall DJ, Franchak JM, Yu C (2014) This hand is my hand: a probabilistic approach to hand disambiguation in egocentric video. In: 2014 IEEE conference on computer vision and pattern recognition workshops, pp 557–564 . https://doi.org/10.1109/CVPRW.2014.86

  • Li C, Kitani KM (2013) Pixel-level hand detection in ego-centric videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3570–3577

  • Li Y, Ye Z, Rehg JM (2015) Delving into egocentric actions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 287–295

  • Li Y, Jia L, Wang Z, Qian Y, Qiao H (2019) Un-supervised and semi-supervised hand segmentation in egocentric images with noisy label learning. Neurocomputing 334:11–24. https://doi.org/10.1016/j.neucom.2018.12.010

    Article  Google Scholar 

  • Lin F, Martinez T (2020) Ego2hands: A dataset for egocentric two-hand segmentation and detection. arXiv preprint arXiv:2011.07252

  • Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755. Springer

  • Lin G, Milan A, Shen C, Reid I (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1925–1934

  • Lin F, Wilhelm C, Martinez T (2021) Two-hand global 3d pose estimation using monocular rgb. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2373–2381

  • Liu W, Rabinovich A, Berg AC (2015) Parsenet: looking wider to see better. arXiv preprint arXiv:1506.04579

  • Maricchiolo F, Bonaiuto M, Gnisci A (2005) Hand gestures in speech: studies of their roles in social interaction. In: Proceedings of the conference of the international society for gesture studies

  • Matilainen M, Sangi P, Holappa J, Silvén O (2016) Ouhands database for hand detection and pose recognition. In: 2016 Sixth international conference on image processing theory, tools and applications (IPTA), pp 1–5 IEEE

  • Maurya J, Hebbalaguppe R, Gupta P (2018) Real time hand segmentation on frugal headmounted device for gestural interface. In: 2018 25th IEEE international conference on image processing (ICIP), pp 4023–4027. https://doi.org/10.1109/ICIP.2018.8451213

  • Minaee S, Boykov YY, Porikli F, Plaza AJ, Kehtarnavaz N, Terzopoulos D (2021) Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell

  • Mueller F, Bernard F, Sotnychenko O, Mehta D, Sridhar S, Casas D, Theobalt C (2018) Ganerated hands for real-time 3d hand tracking from monocular rgb. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 49–59

  • Narasimhaswamy S, Wei Z, Wang Y, Zhang J, Hoai M (2019) Contextual attention for hand detection in the wild. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9567–9576

  • Nguyen T-H-C, Nebel J-C, Florez-Revuelta F (2018) Recognition of activities of daily living from egocentric videos using hands detected by a deep convolutional network. In: Campilho A, Karray F, ter Haar Romeny B (eds) Image analysis and recognition. Springer, Cham, pp 390–398

    Chapter  Google Scholar 

  • Papandreou G, Kokkinos I, Savalle PA (2015) Modeling local and global deformations in deep learning: epitomic convolution, multiple instance learning, and sliding window detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 390–399

  • Paul S, Bhattacharyya A, Mollah AF, Basu S, Nasipuri M (2020) Hand segmentation from complex background for gesture recognition. In: Mandal JK, Bhattacharya D (eds) Emerging technology in modelling and graphics. Springer, Singapore, pp 775–782

    Chapter  Google Scholar 

  • Pirsiavash H, Ramanan D (2012) Detecting activities of daily living in first-person camera views. In: 2012 IEEE conference on computer vision and pattern recognition, pp 2847–2854. IEEE

  • Poularakis S, Katsavounidis I (2015) Low-complexity hand gesture recognition system for continuous streams of digits and letters. IEEE Trans Cybern 46(9):2094–2108

    Article  Google Scholar 

  • Rautaray SS, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 43(1):1–54

    Article  Google Scholar 

  • Ren Y, Kong AWK, Jiao L (2020) A survey on image and video cosegmentation: methods, challenges and analyses. Pattern Recognit 103:107297

    Article  Google Scholar 

  • Rogez G, Khademi M, Supančič JS III, Montiel JMM, Ramanan D (2015) 3d hand pose detection in egocentric rgb-d images. In: Agapito L, Bronstein MM, Rother C (eds) Computer vision: ECCV 2014 workshops. Springer, Cham, pp 356–371

    Chapter  Google Scholar 

  • Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  • Sharma S, Huang S (2021) An end-to-end framework for unconstrained monocular 3d hand pose estimation. Pattern Recognit 115:107892

    Article  Google Scholar 

  • Shilkrot R, Narasimhaswamy S, Vazir S, Hoai M (2019) Workinghands: a hand-tool assembly dataset for image segmentation and activity mining. In: BMVC, p 258

  • Tang Y, Wang Z, Lu J, Feng J, Zhou J (2018) Multi-stream deep neural networks for rgb-d egocentric action recognition. IEEE Trans Circuits Syst Video Technol 29(10):3001–3015

    Article  Google Scholar 

  • Thalmann D, Liang H, Yuan J (2015) First-person palm pose tracking and gesture recognition in augmented reality. In: International joint conference on computer vision, imaging and computer graphics, pp. 3–15. Springer

  • Urooj A, Borji A (2018) Analysis of hand segmentation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4710–4719

  • Valli A (2008) The design of natural interaction. Multimed Tools Appl 38(3):295–305

    Article  Google Scholar 

  • Wang J, Liu X (2021) Medical image recognition and segmentation of pathological slices of gastric cancer based on deeplab v3+ neural network. Comput Methods Programs Biomed 207:106210

    Article  Google Scholar 

  • Wang W, Yu K, Hugonot J, Fua P, Salzmann M (2019) Recurrent u-net for resource-constrained segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2142–2151

  • Wu W, Gan J, Zhou J, Wang J (2021) A lightweight and effective semantic segmentation network for ethnic clothing images based on deeplab. In: 2021 9th international conference on communications and broadband networking, pp 34–40

  • Yuan S, Ye Q, Garcia-Hernando G, Kim TK (2017) The 2017 hands in the million challenge on 3d hand pose estimation. arXiv preprint arXiv:1707.02237

  • Yueming W, Hanwu H, Tong R, Detao Z (2007) Hand segmentation for augmented reality system. In: Second workshop on digital media and its application in museum heritages (DMAMH 2007), pp 395–401. https://doi.org/10.1109/DMAMH.2007.39

  • Zhang Y, Cao C, Cheng J, Lu H (2018) Egogesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans Multimed 20(5):1038–1050

    Article  Google Scholar 

  • Zimmermann C, Brox T (2017) Learning to estimate 3d hand pose from single rgb images. In: Proceedings of the IEEE international conference on computer vision, pp 4903–4911

Download references

Acknowledgements

The authors would like to thank NVIDIA’s Academic Research Team for providing the Titan Xp cards under the Hardware Donation Program. The authors also thank Ester Gonzalez-Sosa for the availability and support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicola Capece.

Ethics declarations

Conflict of interest

The authors declare that there are no conflicts of interests regarding the publication of this papers.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gruosso, M., Capece, N. & Erra, U. Egocentric upper limb segmentation in unconstrained real-life scenarios. Virtual Reality 27, 3421–3433 (2023). https://doi.org/10.1007/s10055-022-00725-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10055-022-00725-4

Keywords

Navigation