Skip to main content

Domain Knowledge-Informed Self-supervised Representations for Workout Form Assessment

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13698))

Included in the following conference series:

Abstract

Maintaining proper form while exercising is important for preventing injuries and maximizing muscle mass gains. Detecting errors in workout form naturally requires estimating human’s body pose. However, off-the-shelf pose estimators struggle to perform well on the videos recorded in gym scenarios due to factors such as camera angles, occlusion from gym equipment, illumination, and clothing. To aggravate the problem, the errors to be detected in the workouts are very subtle. To that end, we propose to learn exercise-oriented image and video representations from unlabeled samples such that a small dataset annotated by experts suffices for supervised error detection. In particular, our domain knowledge-informed self-supervised approaches (pose contrastive learning and motion disentangling) exploit the harmonic motion of the exercise actions, and capitalize on the large variances in camera angles, clothes, and illumination to learn powerful representations. To facilitate our self-supervised pretraining, and supervised finetuning, we curated a new exercise dataset, Fitness-AQA (https://github.com/ParitoshParmar/Fitness-AQA), comprising of three exercises: BackSquat, BarbellRow, and OverheadPress. It has been annotated by expert trainers for multiple crucial and typically occurring exercise errors. Experimental results show that our self-supervised representations outperform off-the-shelf 2D- and 3D-pose estimators and several other baselines. We also show that our approaches can be applied to other domains/tasks such as pose estimation and dive quality assessment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Benaim, S., et al.: SpeedNet: learning the speediness in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9922–9931 (2020)

    Google Scholar 

  2. Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172–186 (2019)

    Article  Google Scholar 

  3. Chen, S., Yang, R.R.: Pose trainer: correcting exercise posture using pose estimation. arXiv preprint arXiv:2006.11718 (2020)

  4. Chen, X., Pang, A., Yang, W., Ma, Y., Xu, L., Yu, J.: SportsCap: monocular 3D human motion capture and fine-grained understanding in challenging sports videos. arXiv preprint arXiv:2104.11452 (2021)

  5. Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)

    Google Scholar 

  6. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 539–546. IEEE (2005)

    Google Scholar 

  7. Doughty, H., Mayol-Cuevas, W., Damen, D.: The pros and cons: rank-aware temporal attention for skill determination in long videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7862–7871 (2019)

    Google Scholar 

  8. Du, C., Graham, S., Depp, C., Nguyen, T.: Assessing physical rehabilitation exercises using graph convolutional network with self-supervised regularization. In: 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp. 281–285. IEEE (2021)

    Google Scholar 

  9. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)

  10. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)

    Google Scholar 

  11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  12. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    Article  MathSciNet  Google Scholar 

  13. Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 84–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_7

    Chapter  Google Scholar 

  14. Honari, S., Constantin, V., Rhodin, H., Salzmann, M., Fua, P.: Unsupervised learning on monocular videos for 3D human pose estimation. arXiv preprint arXiv:2012.01511 (2020)

  15. Hyvarinen, A., Morioka, H.: Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. Adv. Neural. Inf. Process. Syst. 29, 3765–3773 (2016)

    Google Scholar 

  16. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2014)

    Google Scholar 

  17. Jenni, S., Meishvili, G., Favaro, P.: Video representation learning by recognizing temporal transformations. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 425–442. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_26

    Chapter  Google Scholar 

  18. Jing, L., Yang, X., Liu, J., Tian, Y.: Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387 (2018)

  19. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Computer Vision and Pattern Regognition (CVPR) (2018)

    Google Scholar 

  20. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  21. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  22. Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV (2019)

    Google Scholar 

  23. Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR 2011, pp. 3361–3368. IEEE (2011)

    Google Scholar 

  24. Li, J., Bhat, A., Barmaki, R.: Improving the movement synchrony estimation with action quality assessment in children play therapy. In: Proceedings of the 2021 International Conference on Multimodal Interaction, pp. 397–406 (2021)

    Google Scholar 

  25. Liu, D., et al.: Towards unified surgical skill assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9522–9531 (2021)

    Google Scholar 

  26. Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32

    Chapter  Google Scholar 

  27. Ogata, R., Simo-Serra, E., Iizuka, S., Ishikawa, H.: Temporal distance matrices for squat classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019)

    Google Scholar 

  28. Pan, J.H., Gao, J., Zheng, W.S.: Action assessment by joint relation graphs. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

    Google Scholar 

  29. Park, T., et al.: Swapping autoencoder for deep image manipulation. Adv. Neural. Inf. Process. Syst. 33, 7198–7211 (2020)

    Google Scholar 

  30. Parmar, P., Morris, B.: Action quality assessment across multiple actions. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1468–1476. IEEE (2019)

    Google Scholar 

  31. Parmar, P., Morris, B.T.: Measuring the quality of exercises. In: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 2241–2244. IEEE (2016)

    Google Scholar 

  32. Parmar, P., Reddy, J., Morris, B.: Piano skills assessment. arXiv preprint arXiv:2101.04884 (2021)

  33. Parmar, P., Tran Morris, B.: Learning to score olympic events. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–28 (2017)

    Google Scholar 

  34. Parmar, P., Tran Morris, B.: What and how well you performed? A multitask learning approach to action quality assessment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 304–313 (2019)

    Google Scholar 

  35. Pirsiavash, H., Vondrick, C., Torralba, A.: Assessing the quality of actions. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 556–571. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_36

    Chapter  Google Scholar 

  36. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

  37. Rhodin, H., Salzmann, M., Fua, P.: Unsupervised geometry-aware representation for 3D human pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 750–767 (2018)

    Google Scholar 

  38. Roditakis, K., Makris, A., Argyros, A.: Towards improved and interpretable action quality assessment with self-supervised alignment. In: The 14th PErvasive Technologies Related to Assistive Environments Conference, pp. 507–513 (2021)

    Google Scholar 

  39. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  40. Ryali, C.K., Schwab, D.J., Morcos, A.S.: Characterizing and improving the robustness of self-supervised learning through background augmentations. arXiv preprint arXiv:2103.12719 (2021)

  41. Sardari, F., Paiement, A., Hannuna, S., Mirmehdi, M.: VI-Net-view-invariant quality of human movement assessment. Sensors 20(18), 5258 (2020)

    Article  Google Scholar 

  42. Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141. IEEE (2018)

    Google Scholar 

  43. Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Actor and observer: joint modeling of first and third-person videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7396–7404 (2018)

    Google Scholar 

  44. Tang, Y., et al.: Uncertainty-aware score distribution learning for action quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9839–9848 (2020)

    Google Scholar 

  45. Tao, L., et al.: A comparative study of pose representation and dynamics modelling for online motion quality assessment. Comput. Vis. Image Underst. 148, 136–152 (2016)

    Article  Google Scholar 

  46. Wang, J., Jiao, J., Bao, L., He, S., Liu, Y., Liu, W.: Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In: CVPR, pp. 4006–4015 (2019)

    Google Scholar 

  47. Wang, J., Jiao, J., Liu, Y.-H.: Self-supervised video representation learning by pace prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 504–521. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_30

    Chapter  Google Scholar 

  48. Wang, T., Wang, Y., Li, M.: Towards accurate and interpretable surgical skill assessment: a video-based method incorporating recognized surgical gestures and skill levels. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 668–678. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_64

    Chapter  Google Scholar 

  49. Xu, C., Fu, Y., Zhang, B., Chen, Z., Jiang, Y.G., Xue, X.: Learning to score figure skating sport videos. IEEE Trans. Circuits Syst. Video Technol. 30(12), 4578–4590 (2019)

    Article  Google Scholar 

  50. Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  51. Yu, X., Rao, Y., Zhao, W., Lu, J., Zhou, J.: Group-aware contrastive regression for action quality assessment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7919–7928 (2021)

    Google Scholar 

  52. Zeng, L.A., et al.: Hybrid dynamic-static context-aware attention network for action assessment in long videos. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2526–2534 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paritosh Parmar .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3311 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Parmar, P., Gharat, A., Rhodin, H. (2022). Domain Knowledge-Informed Self-supervised Representations for Workout Form Assessment. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13698. Springer, Cham. https://doi.org/10.1007/978-3-031-19839-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19839-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19838-0

  • Online ISBN: 978-3-031-19839-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics