Skip to main content
Log in

Harmonizing local and global features: enhanced hand gesture segmentation using synergistic fusion of CNN and transformer networks

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Hand gesture segmentation is an important research topic in computer vision. Despite ongoing efforts, achieving optimal gesture segmentation remains challenging, attributed to factors like gesture morphology and intricate backgrounds. In light of these challenges, we propose a novel hand gesture segmentation approach that strategically combines the strengths of Convolutional Neural Networks (CNN) for local feature extraction and Transformer Networks for global feature integration. To be more specific, we design two feature fusion modules. One employs an attention mechanism to learn how to fuse features extracted by CNN and Transformer. The second module utilizes a combination of group convolution and activation functions to implement gating mechanisms, enhancing the response of crucial features while minimizing interference from weaker ones. Our proposed method achieves mIoU score of 93.53%, 97.25%, and 90.39% on OUHANDS, HGR1, and EgoHands hand gesture datasets respectively, which outperforms the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

No datasets were generated or analysed during the current study.

References

  1. Aggarwal, A., Bhutani, N., Kapur, R., Dhand, G., Sheoran, K.: Real-time hand gesture recognition using multiple deep learning architectures. Signal Image Video Process. 17(8), 3963–3971 (2023)

    Article  Google Scholar 

  2. Sahoo, J.P., Sahoo, S.P., Ari, S., Patra, S.K.: Rbi-2rcnn: residual block intensity feature using a two-stage residual convolutional neural network for static hand gesture recognition. Signal Image Video Process. 16(8), 2019–2027 (2022)

    Article  Google Scholar 

  3. Jiang, Y., Zhao, M., Wang, C., Wei, F., Wang, K., Qi, H.: Diver’s hand gesture recognition and segmentation for human-robot interaction on AUV. Signal Image Video Process. 15(8), 1899–1906 (2021)

    Article  Google Scholar 

  4. Urooj, A., Borji, A.: Analysis of hand segmentation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4710–4719 (2018)

  5. Gnanapriya, S., Rahimunnisa, K.: A hybrid deep learning model for real time hand gestures recognition. Intell. Autom. Soft Comput. 36(1), 1105–1119 (2023). https://doi.org/10.32604/iasc.2023.032832

    Article  Google Scholar 

  6. Sagayam, K.M., Hemanth, D.J.: Hand posture and gesture recognition techniques for virtual reality applications: a survey. Virtual Real. 21, 91–107 (2017)

    Article  Google Scholar 

  7. Kayalibay, B., Jensen, G., Smagt, P.: Cnn-based segmentation of medical imaging data (2017). arXiv preprint arXiv:1701.03056

  8. Peng, C., Zhang, K., Ma, Y., Ma, J.: Cross fusion net: a fast semantic segmentation network for small-scale semantic information capturing in aerial scenes. IEEE Trans. Geosci. Remote Sens. 60, 1–13 (2021)

    Google Scholar 

  9. Liu, M., Shi, W., Zhao, L., Beyette, F.R., Jr.: Best performance with fewest resources: unveiling the most resource-efficient convolutional neural network for P300 detection with the aid of Explainable AI. Mach. Learn. Appl. 16, 100542 (2024)

    Google Scholar 

  10. Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? Adv. Neural Inf. Process. Syst. 34, 12116–12128 (2021)

    Google Scholar 

  11. Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: CVT: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)

  12. Liu, Y., Zhang, Y., Wang, Y., Mei, S.: Rethinking transformers for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. (2023). https://doi.org/10.1109/TGRS.2023.3302024

    Article  Google Scholar 

  13. Dadashzadeh, A., Targhi, A.T., Tahmasbi, M., Mirmehdi, M.: Hgr-net: a fusion network for hand gesture segmentation and recognition. IET Comput. Vis. 13(8), 700–707 (2019)

    Article  Google Scholar 

  14. Xu, Z., Zhang, W.: Hand segmentation pipeline from depth map: an integrated approach of histogram threshold selection and shallow cnn classification. Connect. Sci. 32(2), 162–173 (2020)

    Article  Google Scholar 

  15. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention-MICCAI, pp. 234–241 (2015)

  16. Wang, W., Yu, K., Hugonot, J., Fua, P., Salzmann, M.: Recurrent u-net for resource-constrained segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2142–2151 (2019)

  17. Yang, Z., Wang, Q., Zeng, J., Qin, P., Chai, R., Sun, D.: Rau-net: U-net network based on residual multi-scale fusion and attention skip layer for overall spine segmentation. Mach. Vis. Appl. 34(1), 10 (2023)

    Article  Google Scholar 

  18. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with Atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) COMPUTER VISION–ECCV 2018, pp. 833–851 (2018). https://doi.org/10.1007/978-3-030-01234-2_49

  19. Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Xiao, B.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2020)

    Article  Google Scholar 

  20. Hong, Y., Pan, H., Sun, W., Jia, Y.: Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes (2021). arXiv preprint arXiv:2101.06085

  21. Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin- unet: Unet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision, pp. 205–218 (2022)

  22. Wang H., Cao P., Liu X., Yang J., Zaiane O.: Narrowing the semantic gaps in U-Net with learnable skip connections: the case of medical image segmentation (2023). arXiv preprint arXiv:2312.15182

  23. Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: Transunet: transformers make strong encoders for medical image segmentation (2021). arXiv preprint arXiv:2102.04306

  24. Li, Z., Li, D., Xu, C., Wang, W., Hong, Q., Li, Q., Tian, J.: TFCNs: a CNN-transformer hybrid network for medical image segmentation. In: International Conference on Artificial Neural Networks, pp. 781–792 (2022)

  25. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018)

  26. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

  27. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. IEEE, Salt Lake City, UT, USA (2018). https://doi.org/10.1109/CVPR.2018.00813

  28. Matilainen, M., Sangi, P., Holappa, J., Silven, O.: OUHANDS database for hand detection and pose recognition. In: 2016 6th International Conference on Image Processing Theory, Tools and Applications (IPTA), pp. 1–5. IEEE, Oulu, Finland (2016). https://doi.org/10.1109/IPTA.2016.7821025

  29. HGR1. http://sun.aei.polsl.pl/mkawulok/gestures/

  30. Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: detecting hands and recognizing activities in complex egocentric interactions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1949–1957 (2015)

Download references

Acknowledgements

We gratefully appreciate the editor and reviewers for reviewing this manuscript. This work is partially supported by the Central Government Guided Local Funds for Science and Technology Development No.216Z0301G; National Natural Science Foundation of China No.61379065; Hebei Natural Science Foundation No.F2023203012; Science Research Project of Hebei Education Department No. QN2024010; Innovation Capability Improvement Plan Project of Hebei Province No.22567626H.

Funding

Central Government Guided Local Funds for Science and Technology Development (216Z0301G), National Natural Science Foundation of China (61379065), Hebei Natural Science Foundation (F2023203012), Science Research Project of Hebei Education Department (QN2024010), Innovation Capability Improvement Plan Project of Hebei Province (22567626H).

Author information

Authors and Affiliations

Authors

Contributions

SW wrote the main manuscript text and wrote codes. NY prepared the data and figures. ML edited the manuscript. QT validated the manuscript. SZ directed this study. All authors reviewed the manuscript.

Corresponding author

Correspondence to Shihui Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, S., Yang, N., Liu, M. et al. Harmonizing local and global features: enhanced hand gesture segmentation using synergistic fusion of CNN and transformer networks. SIViP (2024). https://doi.org/10.1007/s11760-024-03255-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11760-024-03255-5

Keywords

Navigation