Harmonizing local and global features: enhanced hand gesture segmentation using synergistic fusion of CNN and transformer networks

Wang, Shi; Yang, Ning; Liu, Maohua; Tian, Qing; Zhang, Shihui

doi:10.1007/s11760-024-03255-5

Harmonizing local and global features: enhanced hand gesture segmentation using synergistic fusion of CNN and transformer networks

Original Paper
Published: 18 May 2024

(2024)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Shi Wang^1,4,
Ning Yang¹,
Maohua Liu^2,4,
Qing Tian³ &
…
Shihui Zhang^1,5

60 Accesses
Explore all metrics

Abstract

Hand gesture segmentation is an important research topic in computer vision. Despite ongoing efforts, achieving optimal gesture segmentation remains challenging, attributed to factors like gesture morphology and intricate backgrounds. In light of these challenges, we propose a novel hand gesture segmentation approach that strategically combines the strengths of Convolutional Neural Networks (CNN) for local feature extraction and Transformer Networks for global feature integration. To be more specific, we design two feature fusion modules. One employs an attention mechanism to learn how to fuse features extracted by CNN and Transformer. The second module utilizes a combination of group convolution and activation functions to implement gating mechanisms, enhancing the response of crucial features while minimizing interference from weaker ones. Our proposed method achieves mIoU score of 93.53%, 97.25%, and 90.39% on OUHANDS, HGR1, and EgoHands hand gesture datasets respectively, which outperforms the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A two-branch hand gesture recognition approach combining atrous convolution and attention mechanism

Article 14 July 2022

HyFiNet: Hybrid feature attention network for hand gesture recognition

Article 08 January 2022

A Multi-branch Cascade Transformer Network (MBCT–Net) for Hand Gesture Segmentation in Cluttered Background

Data Availability

No datasets were generated or analysed during the current study.

References

Aggarwal, A., Bhutani, N., Kapur, R., Dhand, G., Sheoran, K.: Real-time hand gesture recognition using multiple deep learning architectures. Signal Image Video Process. 17(8), 3963–3971 (2023)
Article Google Scholar
Sahoo, J.P., Sahoo, S.P., Ari, S., Patra, S.K.: Rbi-2rcnn: residual block intensity feature using a two-stage residual convolutional neural network for static hand gesture recognition. Signal Image Video Process. 16(8), 2019–2027 (2022)
Article Google Scholar
Jiang, Y., Zhao, M., Wang, C., Wei, F., Wang, K., Qi, H.: Diver’s hand gesture recognition and segmentation for human-robot interaction on AUV. Signal Image Video Process. 15(8), 1899–1906 (2021)
Article Google Scholar
Urooj, A., Borji, A.: Analysis of hand segmentation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4710–4719 (2018)
Gnanapriya, S., Rahimunnisa, K.: A hybrid deep learning model for real time hand gestures recognition. Intell. Autom. Soft Comput. 36(1), 1105–1119 (2023). https://doi.org/10.32604/iasc.2023.032832
Article Google Scholar
Sagayam, K.M., Hemanth, D.J.: Hand posture and gesture recognition techniques for virtual reality applications: a survey. Virtual Real. 21, 91–107 (2017)
Article Google Scholar
Kayalibay, B., Jensen, G., Smagt, P.: Cnn-based segmentation of medical imaging data (2017). arXiv preprint arXiv:1701.03056
Peng, C., Zhang, K., Ma, Y., Ma, J.: Cross fusion net: a fast semantic segmentation network for small-scale semantic information capturing in aerial scenes. IEEE Trans. Geosci. Remote Sens. 60, 1–13 (2021)
Google Scholar
Liu, M., Shi, W., Zhao, L., Beyette, F.R., Jr.: Best performance with fewest resources: unveiling the most resource-efficient convolutional neural network for P300 detection with the aid of Explainable AI. Mach. Learn. Appl. 16, 100542 (2024)
Google Scholar
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? Adv. Neural Inf. Process. Syst. 34, 12116–12128 (2021)
Google Scholar
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: CVT: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)
Liu, Y., Zhang, Y., Wang, Y., Mei, S.: Rethinking transformers for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. (2023). https://doi.org/10.1109/TGRS.2023.3302024
Article Google Scholar
Dadashzadeh, A., Targhi, A.T., Tahmasbi, M., Mirmehdi, M.: Hgr-net: a fusion network for hand gesture segmentation and recognition. IET Comput. Vis. 13(8), 700–707 (2019)
Article Google Scholar
Xu, Z., Zhang, W.: Hand segmentation pipeline from depth map: an integrated approach of histogram threshold selection and shallow cnn classification. Connect. Sci. 32(2), 162–173 (2020)
Article Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention-MICCAI, pp. 234–241 (2015)
Wang, W., Yu, K., Hugonot, J., Fua, P., Salzmann, M.: Recurrent u-net for resource-constrained segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2142–2151 (2019)
Yang, Z., Wang, Q., Zeng, J., Qin, P., Chai, R., Sun, D.: Rau-net: U-net network based on residual multi-scale fusion and attention skip layer for overall spine segmentation. Mach. Vis. Appl. 34(1), 10 (2023)
Article Google Scholar
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with Atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) COMPUTER VISION–ECCV 2018, pp. 833–851 (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Xiao, B.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2020)
Article Google Scholar
Hong, Y., Pan, H., Sun, W., Jia, Y.: Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes (2021). arXiv preprint arXiv:2101.06085
Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin- unet: Unet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision, pp. 205–218 (2022)
Wang H., Cao P., Liu X., Yang J., Zaiane O.: Narrowing the semantic gaps in U-Net with learnable skip connections: the case of medical image segmentation (2023). arXiv preprint arXiv:2312.15182
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: Transunet: transformers make strong encoders for medical image segmentation (2021). arXiv preprint arXiv:2102.04306
Li, Z., Li, D., Xu, C., Wang, W., Hong, Q., Li, Q., Tian, J.: TFCNs: a CNN-transformer hybrid network for medical image segmentation. In: International Conference on Artificial Neural Networks, pp. 781–792 (2022)
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. IEEE, Salt Lake City, UT, USA (2018). https://doi.org/10.1109/CVPR.2018.00813
Matilainen, M., Sangi, P., Holappa, J., Silven, O.: OUHANDS database for hand detection and pose recognition. In: 2016 6th International Conference on Image Processing Theory, Tools and Applications (IPTA), pp. 1–5. IEEE, Oulu, Finland (2016). https://doi.org/10.1109/IPTA.2016.7821025
HGR1. http://sun.aei.polsl.pl/mkawulok/gestures/
Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: detecting hands and recognizing activities in complex egocentric interactions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1949–1957 (2015)

Download references

Acknowledgements

We gratefully appreciate the editor and reviewers for reviewing this manuscript. This work is partially supported by the Central Government Guided Local Funds for Science and Technology Development No.216Z0301G; National Natural Science Foundation of China No.61379065; Hebei Natural Science Foundation No.F2023203012; Science Research Project of Hebei Education Department No. QN2024010; Innovation Capability Improvement Plan Project of Hebei Province No.22567626H.

Funding

Central Government Guided Local Funds for Science and Technology Development (216Z0301G), National Natural Science Foundation of China (61379065), Hebei Natural Science Foundation (F2023203012), Science Research Project of Hebei Education Department (QN2024010), Innovation Capability Improvement Plan Project of Hebei Province (22567626H).

Author information

Authors and Affiliations

School of Information Science and Engineering, Yanshan University, Hebei Street, Qinhuangdao, 066004, Hebei, China
Shi Wang, Ning Yang & Shihui Zhang
School of Electrical and Computer Engineering, University of Georgia, 200 D.W. Brooks Drive, Athens, GA, 30602, USA
Maohua Liu
Department of Computer Science, University of Alabama at Birmingham, 1720 2nd Avnue South, Birmingham, AL, 35294, USA
Qing Tian
Artificial Neural Network Technology Application and Innovation Team of School of Mathematics and Information Science and Technology, Hebei Normal University of Science and Technology, Hebei Street, Qinhuangdao, 066004, Hebei, China
Shi Wang & Maohua Liu
The Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province, Yanshan University, Hebei Street, Qinhuangdao, 066004, Hebei, China
Shihui Zhang

Authors

Shi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ning Yang
View author publications
You can also search for this author in PubMed Google Scholar
Maohua Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qing Tian
View author publications
You can also search for this author in PubMed Google Scholar
Shihui Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SW wrote the main manuscript text and wrote codes. NY prepared the data and figures. ML edited the manuscript. QT validated the manuscript. SZ directed this study. All authors reviewed the manuscript.

Corresponding author

Correspondence to Shihui Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, S., Yang, N., Liu, M. et al. Harmonizing local and global features: enhanced hand gesture segmentation using synergistic fusion of CNN and transformer networks. SIViP (2024). https://doi.org/10.1007/s11760-024-03255-5

Download citation

Received: 28 March 2024
Revised: 18 April 2024
Accepted: 28 April 2024
Published: 18 May 2024
DOI: https://doi.org/10.1007/s11760-024-03255-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Harmonizing local and global features: enhanced hand gesture segmentation using synergistic fusion of CNN and transformer networks

Abstract

Access this article

Similar content being viewed by others

A two-branch hand gesture recognition approach combining atrous convolution and attention mechanism

HyFiNet: Hybrid feature attention network for hand gesture recognition

A Multi-branch Cascade Transformer Network (MBCT–Net) for Hand Gesture Segmentation in Cluttered Background

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Harmonizing local and global features: enhanced hand gesture segmentation using synergistic fusion of CNN and transformer networks

Abstract

Access this article

Similar content being viewed by others

A two-branch hand gesture recognition approach combining atrous convolution and attention mechanism

HyFiNet: Hybrid feature attention network for hand gesture recognition

A Multi-branch Cascade Transformer Network (MBCT–Net) for Hand Gesture Segmentation in Cluttered Background

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation