Skip to main content

SportsCap: Monocular 3D Human Motion Capture and Fine-Grained Understanding in Challenging Sports Videos

Abstract

Markerless motion capture and understanding of professional non-daily human movements is an important yet unsolved task, which suffers from complex motion patterns and severe self-occlusion, especially for the monocular setting. In this paper, we propose SportsCap—the first approach for simultaneously capturing 3D human motions and understanding fine-grained actions from monocular challenging sports video input. Our approach utilizes the semantic and temporally structured sub-motion prior in the embedding space for motion capture and understanding in a data-driven multi-task manner. To enable robust capture under complex motion patterns, we propose an effective motion embedding module to recover both the implicit motion embedding and explicit 3D motion details via a corresponding mapping function as well as a sub-motion classifier. Based on such hybrid motion information, we introduce a multi-stream spatial-temporal graph convolutional network to predict the fine-grained semantic action attributes, and adopt a semantic attribute mapping block to assemble various correlated action attributes into a high-level action label for the overall detailed understanding of the whole sequence, so as to enable various applications like action assessment or motion scoring. Comprehensive experiments on both public and our proposed datasets show that with a challenging monocular sports video input, our novel approach not only significantly improves the accuracy of 3D human motion capture, but also recovers accurate fine-grained semantic action attribute.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

References

  1. Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  2. Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., & Davis, J. (2005). Scape: Shape completion and animation of people. In ACM SIGGRAPH 2005 papers (pp. 408–416).

  3. Bertasius, G., Soo Park, H., Yu, SX., & Shi, J. (2017). Am i a baller? Basketball performance assessment from first-person videos. In Proceedings of the IEEE international conference on computer vision (pp. 2177–2185).

  4. Bertasius, G., Chan, A., & Shi, J. (2018a). Egocentric basketball motion planning from a single first-person image. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5889–5898).

  5. Bertasius, G., Feichtenhofer, C., Tran, D., Shi, J., & Torresani, L. (2018b). Learning discriminative motion features through detection. arXiv preprint arXiv:1812.04172.

  6. Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 961–970).

  7. Cao, Z., Martinez, G. H., Simon, T., Wei, S., & Sheikh, Y. A. (2019). Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence,. https://doi.org/10.1109/TPAMI.2019.2929257.

    Article  Google Scholar 

  8. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6299–6308).

  9. Chen, J., & Little, J. J. (2019). Sports camera calibration via synthetic data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 0–0).

  10. Chen, X., Pang, A., Wei, Y., Xui, L., & Yu, J. (2019). TightCap: 3D human shape capture with clothing tightness. arXiv preprint arXiv:1904.02601.

  11. Choutas, V., Weinzaepfel, P., Revaud, J., & Schmid, C. (2018). Potion: Pose motion representation for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7024–7033).

  12. Collet, A., Chuang, M., Sweeney, P., Gillett, D., Evseev, D., Calabrese, D., et al. (2015). High-quality streamable free-viewpoint video. ACM Transactions on Graphics (ToG), 34(4), 1–13.

    Article  Google Scholar 

  13. Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). Convolutional neural networks on graphs with fast localized spectral filtering. Advances in Neural Information Processing Systems, 29, 3844–3852.

    Google Scholar 

  14. Dou, M., Khamis, S., Degtyarev, Y., Davidson, P., Fanello, S. R., Kowdle, A., et al. (2016). Fusion 4d: Real-time performance capture of challenging scenes. ACM Transactions on Graphics (TOG), 35(4), 1–13.

    Article  Google Scholar 

  15. Fani, M., Neher, H., Clausi, D. A., Wong, A., & Zelek, J. (2017). Hockey action recognition via integrated stacked hourglass network. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 29–37).

  16. Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1933–1941).

  17. Henaff, M., Bruna, J., & LeCun, Y. (2015). Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163.

  18. He, Y., Pang, A., Chen, X., Liang, H., Wu, M., Ma, Y., & Xu, L. (2021). Challencap: Monocular 3d capture of challenging human performances using multi-modal references. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

  19. Hu, T., & Qi, H. (2019). See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification. arXiv preprint arXiv:1901.09891.

  20. Hussein, N., Gavves, E., & Smeulders, A. W. (2019). Timeception for complex action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 254–263).

  21. Kanade, T., Rander, P., & Narayanan, P. (1997). Virtualized reality: Constructing virtual worlds from real scenes. IEEE Multimedia, 4(1), 34–47.

    Article  Google Scholar 

  22. Kanazawa, A., Black, M. J., Jacobs, D. W., & Malik, J. (2018). End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7122–7131).

  23. Kanojia, G., Kumawat, S., & Raman, S. (2019). Attentive spatio-temporal representation learning for diving classification. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 0–0).

  24. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1725–1732).

  25. Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of the international conference for learning representations.

  26. Kocabas, M., Athanasiou, N., & Black, M. J. (2020). Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5253–5263).

  27. Li, C., Cui, Z., Zheng, W., Xu, C., & Yang, J. (2018a). Spatio-temporal graph convolution for skeleton based action recognition. Proceedings of the AAAI conference on artificial,. intelligence.

  28. Li, R., Wang, S., Zhu, F., & Huang, J. (2018b). Adaptive graph convolutional neural networks. Proceedings of the AAAI conference on artificial,. intelligence.

  29. Li, Y., Li, Y., & Vasconcelos, N. (2018c). Resound: Towards action recognition without representation bias. In Proceedings of the European conference on computer vision (pp. 513–528).

  30. Li, Z., Chen, X., Zhou, W., Zhang, Y., & Yu, J. (2019). Pose2body: Pose-guided human parts segmentation. In 2019 IEEE international conference on multimedia and expo (ICME) (pp. 640–645). IEEE.

  31. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proceedings of the European conference on computer vision (pp. 740–755). Springer.

  32. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In European conference on computer vision (pp. 21–37). Springer.

  33. Loper, M., Mahmood, N., & Black, M. J. (2014). Mosh: Motion and shape capture from sparse markers. ACM Transactions on Graphics (TOG), 33(6), 1–13.

    Article  Google Scholar 

  34. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). SMPL: A skinned multi-person linear model. ACM Trans Graphics (Proc SIGGRAPH Asia), 34(6), 248:1-248:16.

  35. Luvizon, D. C., Picard, D., & Tabia, H. (2018). 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5137–5146).

  36. Mahmood, N., Ghorbani, N., Troje, N. F., Pons-Moll, G., & Black, M. J. (2019). Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE international conference on computer vision (pp 5442–5451).

  37. Newcombe, R. A., Fox, D., & Seitz, S. M. (2015). Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 343–352).

  38. Nibali, A., He, Z., Morgan, S., & Greenwood, D. (2017). Extraction and classification of diving clips from continuous video footage. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 38–48).

  39. Pan, J. H., Gao, J., & Zheng, W. S. (2019). Action assessment by joint relation graphs. In Proceedings of the IEEE international conference on computer vision.

  40. Parmar, P., & Morris, B. (2019a) Action quality assessment across multiple actions. In Proceedings of the IEEE winter conference on applications of computer vision (pp. 1468–1476). IEEE.

  41. Parmar, P., & Morris, B. T. (2019b). What and how well you performed? A multitask learning approach to action quality assessment. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  42. Parmar, P., & Tran Morris, B. (2017). Learning to score olympic events. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 20–28).

  43. Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A. A., Tzionas, D., & Black, M. J. (2019). Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE conference on computer vision and pattern recognition (CVPR).

  44. Pirsiavash, H., Vondrick, C., & Torralba, A. (2014). Assessing the quality of actions. In European conference on computer vision (pp. 556–571). Springer.

  45. Pishchulin, L., Andriluka, M., & Schiele, B. (2014). Fine-grained activity recognition with holistic and pose based features. In Proceedings of the German conference on pattern recognition (pp. 678–689). Springer.

  46. Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P. V., & Schiele, B. (2016). Deepcut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4929–4937).

  47. Raaj, Y., Idrees, H., Hidalgo, G., & Sheikh, Y. (2019). Efficient online multi-person 2d pose tracking with recurrent spatio-temporal affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4620–4628).

  48. Ran, L., Zhang, Y., Zhang, Q., & Yang, T. (2017). Convolutional neural network-based robot navigation using uncalibrated spherical images. Sensors, 17(6), 1341.

    Article  Google Scholar 

  49. Rematas, K., Kemelmacher-Shlizerman, I., Curless, B., & Seitz, S. (2018). Soccer on your tabletop. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4738–4747).

  50. Romero, J., Tzionas, D., & Black, M. J. (2017). Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics (ToG), 36(6), 1–17.

    Article  Google Scholar 

  51. Sha, L., Hobbs, J., Felsen, P., Wei, X., Lucey, P., & Ganguly, S. (2020). End-to-end camera calibration for broadcast videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13627–13636).

  52. Shao, D., Zhao, Y., Dai, B., & Lin, D. (2020). Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2616–2625).

  53. Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 12026–12035).

  54. Si, C., Jing, Y., Wang, W., Wang, L., & Tan, T. (2018). Skeleton-based action recognition with spatial reasoning and temporal stack learning. In Proceedings of the European conference on computer vision (ECCV) (pp. 103–118).

  55. Soomro, K., Zamir, A. R., & Shah, M. (2012). A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision.

  56. Su, S., Pyo Hong, J., Shi, J., & Soo Park, H. (2017). Predicting behaviors of basketball players from first person videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1501–1510).

  57. Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  58. Sun, X., Xiao, B., Wei, F., Liang, S., & Wei, Y. (2018). Integral human pose regression. In Proceedings of the European conference on computer vision (pp. 529–545).

  59. Suo, X., Jiang, Y., Lin, P., Zhang, Y., Wu, M., Guo, K., & Xu, L. (2021). NeuralHumanFVV: Real-time neural volumetric human performance rendering using RGB cameras. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6226–6237).

  60. Tang, Z., Peng, X., Geng, S., Wu, L., Zhang, S., & Metaxas, D. (2018). Quantized densely connected u-nets for efficient landmark localization. In Proceedings of the European conference on computer vision (pp. 339–354).

  61. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489–4497).

  62. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6450–6459).

  63. Varol, G., Laptev, I., & Schmid, C. (2017). Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(6), 1510–1517.

    Article  Google Scholar 

  64. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al. (2018). Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11), 2740–2755.

    Article  Google Scholar 

  65. Wen, Y. H., Gao, L., Fu, H., Zhang, F. L., & Xia, S. (2019). Graph CNNS with motif and variable temporal block for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8989–8996.

    Article  Google Scholar 

  66. Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (pp. 466–481).

  67. Xiaohan Nie, B., Xiong, C., & Zhu, SC. (2015). Joint action recognition and pose estimation from video. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1293–1301).

  68. Xu, L., Su, Z., Han, L., Yu, T., Liu, Y., & Fang, L. (2019). UnstructuredFusion: realtime 4D geometry and texture reconstruction using commercial RGBD cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10), 2508–2522.

    Article  Google Scholar 

  69. Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32).

  70. Yang, Y., & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In CVPR 2011 (pp. 1385–1392). IEEE.

  71. Zhang, W., Zhu, M., & Derpanis, KG. (2013). From actemes to action: A strongly-supervised representation for detailed action understanding. In Proceedings of the IEEE international conference on computer vision.

  72. Zhang, X., Xu, C., Tian, X., & Tao, D. (2019). Graph edge convolutional neural networks for skeleton-based action recognition. IEEE Transactions on Neural Networks and Learning Systems.

  73. Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision (pp. 803–818).

  74. Zhu, L., Rematas, K., Curless, B., Seitz, SM., & Kemelmacher-Shlizerman, I. (2020). Reconstructing NBA players. In European conference on computer vision (pp 177–194). Springer.

Download references

Acknowledgements

This work was supported by NSFC programs (61976138, 61977047), the National Key Research and Development Program (2018YFB2100500), STCSM (2015F0203-000-06) and SHMEC (2019-01-07-00-01-E00003).

Author information

Affiliations

Authors

Corresponding authors

Correspondence to Lan Xu or Jingyi Yu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Manuel J. Marin-Jimenez.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 24178 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, X., Pang, A., Yang, W. et al. SportsCap: Monocular 3D Human Motion Capture and Fine-Grained Understanding in Challenging Sports Videos. Int J Comput Vis (2021). https://doi.org/10.1007/s11263-021-01486-4

Download citation

Keywords

  • Human modeling
  • 3D motion capture
  • Motion understanding