Advertisement

Adaptive Computationally Efficient Network for Monocular 3D Hand Pose Estimation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12349)

Abstract

3D hand pose estimation is an important task for a wide range of real-world applications. Existing works in this domain mainly focus on designing advanced algorithms to achieve high pose estimation accuracy. However, besides accuracy, the computation efficiency that affects the computation speed and power consumption is also crucial for real-world applications. In this paper, we investigate the problem of reducing the overall computation cost yet maintaining the high accuracy for 3D hand pose estimation from video sequences. A novel model, called Adaptive Computationally Efficient (ACE) network, is proposed, which takes advantage of a Gaussian kernel based Gate Module to dynamically switch the computation between a light model and a heavy network for feature extraction. Our model employs the light model to compute efficient features for most of the frames and invokes the heavy model only when necessary. Combined with the temporal context, the proposed model accurately estimates the 3D hand pose. We evaluate our model on two publicly available datasets, and achieve state-of-the-art performance at 22% of the computation cost compared to traditional temporal models.

Keywords

3D hand pose estimation Computation efficiency Dynamic adaption Gaussian gate 

Notes

Acknowledgements

This work is partially supported by the National Institutes of Health under Grant R01CA214085 as well as SUTD Projects PIE-SGP-Al-2020-02 and SRG-ISTD-2020-153.

Supplementary material

504439_1_En_8_MOESM1_ESM.pdf (128 kb)
Supplementary material 1 (pdf 127 KB)

Supplementary material 2 (mp4 19311 KB)

References

  1. 1.
    Boukhayma, A., de Bem, R., Torr, P.H.: 3D hand shape and pose from images in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10843–10852 (2019)Google Scholar
  2. 2.
    Cai, Y., Ge, L., Cai, J., Yuan, J.: Weakly-supervised 3D hand pose estimation from monocular RGB images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 678–694. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01231-1_41CrossRefGoogle Scholar
  3. 3.
    Cai, Y., et al.: Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2272–2281 (2019)Google Scholar
  4. 4.
    Courbariaux, M., Bengio, Y., David, J.P.: Binaryconnect: training deep neural networks with binary weights during propagations. In: Advances in Neural Information Processing Systems, pp. 3123–3131 (2015)Google Scholar
  5. 5.
    Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or \(-\)1. arXiv preprint arXiv:1602.02830 (2016)
  6. 6.
    Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.K.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 409–419 (2018)Google Scholar
  7. 7.
    Ge, L., Liang, H., Yuan, J., Thalmann, D.: Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3593–3601 (2016)Google Scholar
  8. 8.
    Ge, L., Liang, H., Yuan, J., Thalmann, D.: 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1991–2000 (2017)Google Scholar
  9. 9.
    Ge, L., Liang, H., Yuan, J., Thalmann, D.: Real-time 3D hand pose estimation with 3D convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 956–970 (2018)CrossRefGoogle Scholar
  10. 10.
    Ge, L., et al.: 3D hand shape and pose estimation from a single RGB image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10833–10842 (2019)Google Scholar
  11. 11.
    Gouidis, F., Panteleris, P., Oikonomidis, I., Argyros, A.: Accurate hand keypoint localization on mobile devices. In: 2019 16th International Conference on Machine Vision Applications (MVA), pp. 1–6. IEEE (2019)Google Scholar
  12. 12.
    Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, pp. 1135–1143 (2015)Google Scholar
  13. 13.
    Hassibi, B., Stork, D.G.: Second order derivatives for network pruning: optimal brain surgeon. In: Advances in Neural Information Processing Systems, pp. 164–171 (1993)Google Scholar
  14. 14.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  15. 15.
    Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
  16. 16.
    Iqbal, U., Molchanov, P., Breuel, T., Gall, J., Kautz, J.: Hand pose estimation via latent 2.5D heatmap regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 125–143. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01252-6_8CrossRefGoogle Scholar
  17. 17.
    Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)
  18. 18.
    Korbar, B., Tran, D., Torresani, L.: SCSampler: sampling salient clips from video for efficient action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6232–6242 (2019)Google Scholar
  19. 19.
    LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Advances in Neural Information Processing Systems, pp. 598–605 (1990)Google Scholar
  20. 20.
    Li, Z., Ni, B., Zhang, W., Yang, X., Gao, W.: Performance guaranteed network acceleration via high-order residual quantization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2584–2592 (2017)Google Scholar
  21. 21.
    Lin, M., Lin, L., Liang, X., Wang, K., Cheng, H.: Recurrent 3D pose sequence machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 810–819 (2017)Google Scholar
  22. 22.
    Liu, J., et al.: Feature boosting network for 3D pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 494–501 (2020)CrossRefGoogle Scholar
  23. 23.
    Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744 (2017)Google Scholar
  24. 24.
    Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through \( l\_0 \) regularization. arXiv preprint arXiv:1712.01312 (2017)
  25. 25.
    Malik, J., Elhayek, A., Nunnari, F., Varanasi, K., Tamaddon, K., Heloir, A., Stricker, D.: DeepHPS: end-to-end estimation of 3d hand pose and shape by learning from synthetic depth. In: 2018 International Conference on 3D Vision (3DV), pp. 110–119. IEEE (2018)Google Scholar
  26. 26.
    Mueller, F., et al.: Ganerated hands for real-time 3d hand tracking from monocular RGB. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–59 (2018)Google Scholar
  27. 27.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_29CrossRefGoogle Scholar
  28. 28.
    Oculus: Hand tracking SDK for oculus quest available with v12 release. https://developer.oculus.com/blog/hand-tracking-sdk-for-oculus-quest-available
  29. 29.
    Pan, B., Lin, W., Fang, X., Huang, C., Zhou, B., Lu, C.: Recurrent residual module for fast inference in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1536–1545 (2018)Google Scholar
  30. 30.
    Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)Google Scholar
  31. 31.
    Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-Net: ImageNet classification using binary convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 525–542. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_32CrossRefGoogle Scholar
  32. 32.
    Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3D human pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 69–86. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01249-6_5CrossRefGoogle Scholar
  33. 33.
    Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Trans. Graph. (ToG) 36(6), 245 (2017)CrossRefGoogle Scholar
  34. 34.
    Sinha, A., Choi, C., Ramani, K.: Deephand: robust hand pose estimation by completing a matrix imputed with deep features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4150–4158 (2016)Google Scholar
  35. 35.
    Tekin, B., Rozantsev, A., Lepetit, V., Fua, P.: Direct prediction of 3D body poses from motion compensated sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 991–1000 (2016)Google Scholar
  36. 36.
    Wan, C., Probst, T., Gool, L.V., Yao, A.: Self-supervised 3D hand pose estimation through training by fitting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10853–10862 (2019)Google Scholar
  37. 37.
    Wan, C., Probst, T., Van Gool, L., Yao, A.: Dense 3D regression for hand pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147–5156 (2018)Google Scholar
  38. 38.
    Wang, F., Wang, G., Huang, Y., Chu, H.: Sast: learning semantic action-aware spatial-temporal features for efficient action recognition. IEEE Access 7, 164876–164886 (2019)CrossRefGoogle Scholar
  39. 39.
    Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2074–2082 (2016)Google Scholar
  40. 40.
    Wu, Z., Xiong, C., Jiang, Y.G., Davis, L.S.: LiteEval: a coarse-to-fine framework for resource efficient video recognition. In: Advances in Neural Information Processing Systems, pp. 7778–7787 (2019)Google Scholar
  41. 41.
    Wu, Z., Xiong, C., Ma, C.Y., Socher, R., Davis, L.S.: AdaFrame: adaptive frame selection for fast video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1278–1287 (2019)Google Scholar
  42. 42.
    Xiang, D., Joo, H., Sheikh, Y.: Monocular total capture: Posing face, body, and hands in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10965–10974 (2019)Google Scholar
  43. 43.
    Zhang, J., Jiao, J., Chen, M., Qu, L., Xu, X., Yang, Q.: 3D hand pose tracking and estimation using stereo matching. arXiv preprint arXiv:1610.07214 (2016)
  44. 44.
    Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)Google Scholar
  45. 45.
    Zhang, X., Li, Q., Mo, H., Zhang, W., Zheng, W.: End-to-end hand mesh recovery from a monocular RGB image. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2354–2364 (2019)Google Scholar
  46. 46.
    Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4903–4911 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Tandon School of Engineering, New York UniversityBrooklynUSA
  2. 2.Information Systems Technology and Design PillarSingapore University of Technology and DesignSingaporeSingapore

Personalised recommendations