Skip to main content
Log in

Auto-encoding score distribution regression for action quality assessment

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Assessing the quality of actions in videos is a challenging vision task, as the relationship between videos and action scores can be difficult to model. Consequently, extensive research has been conducted on action quality assessment (AQA) in the literature. Traditional AQA methods treat the problem as a regression task to learn the underlying mappings between videos and action scores. However, previous approaches overlook the presence of data uncertainty in AQA datasets. To address aleatoric uncertainty, we have developed a plug-and-play module called distribution auto-encoder (DAE). DAE encodes videos into distributions and utilizes the reparameterization trick to sample scores, which enables a more accurate mapping between videos and scores. Additionally, we use a likelihood loss to learn the uncertainty parameters. We have evaluated our approach on publicly available datasets, and extensive experiments demonstrate that DAE achieves state-of-the-art performance with the Spearman’s correlation metric of 82.58%, 92.32%, and 76.00% on the AQA-7, MTL-AQA, and JIGSAWSS datasets, respectively. Furthermore, plug-and-play experiments also demonstrate the extensibility of DAE. Our code is available at https://github.com/InfoX-SEU/DAE-AQA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Availability of data and materials

Publicly accessible data have been used by the authors.

Code availability

Most of the code has been open sourced on https://github.com/InfoX-SEU/DAE-AQA, the rest can be obtained by reasonable request.

Notes

  1. Note that the Gaussian distribution is just one choice and not a limitation in our method.

  2. \(N \ge k +1\).

  3. Whether the variance is positive does not affect the reparameterization trick. Here, variance shows in absolute value.

References

  1. Doughty H, Mayol-Cuevas W, Damen D (2019) The pros and cons: Rank-aware temporal attention for skill determination in long videos. Proc Comput Vis PattRecognit (CVPR). https://doi.org/10.1109/CVPR.2019.00805

    Article  Google Scholar 

  2. Bertasius G, Park HS, Yu SX, Shi J (2017) Am i a baller? basketball performance assessment from first-person videos. Proc IEEE Int Conf Comput Vis (ICCV). https://doi.org/10.1109/ICCV.2017.239

    Article  Google Scholar 

  3. Parmar P, Morris B (2017) Learning to score olympic events. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), pp. 76–84 . https://doi.org/10.1109/CVPRW.2017.16

  4. Parmar P, Morris BT (2019) What and how well you performed? a multitask learning approach to action quality assessment. Proc Comput Vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR.2019.00039

    Article  Google Scholar 

  5. Jug M, Pers J, Dezman B, Kovačič S (2003) Trajectory based assessment of coordinated human activity. Int Conf Comput Vis Syst (ICVS). https://doi.org/10.1007/3-540-36592-3_51

    Article  Google Scholar 

  6. Abdelbaky A, Aly S (2020) Human action recognition using short-time motion energy template images and pcanet features. Neural Comput Appl 32(16):12561–12574. https://doi.org/10.1007/s00521-020-04712-1

    Article  Google Scholar 

  7. Yu X, Rao Y, Zhao W, Lu J, Zhou J (2021) Group-aware contrastive regression for action quality assessment. Proc IEEE Int Conf Comput Vis (ICCV), 7919–7928

  8. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR.2017.502

    Article  Google Scholar 

  9. Geng X (2016) Label distribution learning. IEEE Trans Knowl Data Eng 28(7):1734–1748

    Article  Google Scholar 

  10. Kingma D.P, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114

  11. Tang Y, Ni Z, Zhou J, Zhang D, Lu J, Wu Y, Zhou J (2020) Uncertainty-aware score distribution learning for action quality assessment. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR42600.2020.00986

    Article  Google Scholar 

  12. Zhang Q, Li B (2015) Relative hidden markov models for video-based evaluation of motion skills in surgical training. IEEE Trans Pattern Anal Mach Intell. 37(6):1206–18

    Article  Google Scholar 

  13. AS G (1995) Automated video assessment of human performance. Proceedings of AI-ED, 16–19

  14. Doughty H, Damen D, Mayol-Cuevas W (2018) Who’s better? who’s best? pairwise deep ranking for skill determination. In: Proc. Comput. Vis. Pattern Recognit. (CVPR), pp. 6057–6066. https://doi.org/10.1109/CVPR.2018.00634

  15. Jelodar AB, Paulius D, Sun Y (2019) Long activity video understanding using functional object-oriented network. IEEE Trans Multimedia 21(7):1813–1824. https://doi.org/10.1109/TMM.2018.2885228

    Article  Google Scholar 

  16. Wei P, Sun H, Zheng N (2019) Learning composite latent structures for 3d human action representation and recognition. IEEE Trans Multimedia 21(9):2195–2208. https://doi.org/10.1109/TMM.2019.2897902

    Article  Google Scholar 

  17. Xiang X, Tian Y, Reiter A, Hager G, Tran T (2018) S3d: Stacking segmental p3d for action quality assessment. In: 25th IEEE Int Conf Image Process. (ICIP), pp. 928–932. https://doi.org/10.1109/ICIP.2018.8451364

  18. Geng X (2016) Label distribution learning. IEEE Trans Knowl Data Eng 28(7):1734–1748

    Article  Google Scholar 

  19. Der Kiureghian A, Ditlevsen O (2009) Aleatory or epistemic? Does it matter? Struct Safety 31(2):105–112

    Article  Google Scholar 

  20. Faber M.H (2005) On the treatment of uncertainties and probabilities in engineering decision analysis

  21. Geng X, Luo L (2014) Multilabel ranking with inconsistent rankers. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3742–3747 . https://doi.org/10.1109/CVPR.2014.478

  22. Paté-Cornell ME (1996) Uncertainties in risk analysis: six levels of treatment. Reliab Eng Syst Saf 54(2–3):95–111

    Article  Google Scholar 

  23. Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural network, 1613–1622

  24. Gal Y, Ghahramani Z (2016) Dropout as a bayesian approximation: Representing model uncertainty in deep learning, 1050–1059

  25. Kendall A, Gal Y (2017) What uncertainties do we need in bayesian deep learning for computer vision?, vol. 30

  26. Kendall A, Badrinarayanan V, Cipolla R (2015) Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680

  27. Chang J, Lan Z, Cheng C, Wei Y (2020) Data uncertainty learning in face recognition, 5710–5719

  28. Choi J, Chun D, Kim H, Lee H.-J (2019) Gaussian yolov3: An accurate and fast object detector using localization uncertainty for autonomous driving. In: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 502–511

  29. Kraus F, Dietmayer K (2019) Uncertainty estimation in one-stage object detection. In: IEEE Trans. Intell. Transp. Syst. Conf. (ITSC), pp. 53–60 . IEEE

  30. Yu T, Li D, Yang Y, Hospedales T.M, Xiang T (2019) Robust person re-identification by modelling feature uncertainty. In: Proceedings of the IEEE/CVF international conference on computer vision 2019. (ICCV), pp. 552–561

  31. Hinton G, E., Salakhutdinov, R, (2006) Reducing the dimensionality of data with neural networks. Science, 504–507

  32. Masci J, Meier U, Cireşan D, Schmidhuber J (2011) Stacked convolutional auto-encoders for hierarchical feature extraction, 52–59

  33. Vincent P, Larochelle H, Bengio Y, Manzagol P.-a (2008) Extracting and composing robust features with denoising autoencoders. In: International Conference on Machine Learning (ICML), pp. 1096–1103

  34. Nibali A, He Z, Morgan S, Greenwood D (2017) Extraction and classification of diving clips from continuous video footage. In: Proceedings of the IEEE/CVF international conference on computer vision 2019 Pattern Recognit. Workshops (CVPRW), pp. 94–104. https://doi.org/10.1109/CVPRW.2017.18

  35. Cohen P, West S.G, Aiken L.S (2014) Applied multiple regression/correlation analysis for the behavioral sciences

  36. Pirsiavash H, Vondrick C, Torralba A. Assessing the Quality of Actions

  37. Pan J-H, Gao J, Zheng W-S (2019) Action assessment by joint relation graphs. Proc Int Conf Comput Vis (ICCV). https://doi.org/10.1109/ICCV.2019.00643

    Article  Google Scholar 

  38. Pan J-H, Gao J, Zheng W-S (2019) Action assessment by joint relation graphs. Proc IEEE Int Conf Comput Vis (ICCV). https://doi.org/10.1109/ICCV.2019.00643

    Article  Google Scholar 

  39. Parmar P, Morris B (2019) Action quality assessment across multiple actions. In: Winter Conference on Applications of Computer Vision. (WACV), pp. 1468–1476 . IEEE

  40. Gao Y, Swaroop S, V Carol, Narges E.R, Balakrishnan A, Henry V, Lingling C.L, T. (2014) Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In: MICCAI Workshop: M2CAI

  41. Kingma D.P, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  42. Zhang C, Liu S, Xu X, Zhu C (2019) C3ae: Exploring the limits of compact model for age estimation. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR.2019.01287

    Article  Google Scholar 

  43. Badr MM, Elbasiony RM, Sarhan AM (2022) Lrti: landmark ratios with task importance toward accurate age estimation using deep neural networks. Neural Comput Appl 34(12):9647–9659. https://doi.org/10.1007/s00521-022-06955-6

    Article  Google Scholar 

  44. Cao K, Choi KN, Jung H, Duan L (2020) Deep learning for facial beauty prediction. Information (Switzerland) 11(8):391

    Google Scholar 

  45. Gan J, Scotti F, Xiang L, Zhai Y, Chaoyun M, He G, Zeng J, Bai Z, Labati R, Piuri V-C (2020) 2m beautynet: facial beauty prediction based on multi-task transfer learning. IEEE Access. https://doi.org/10.1109/ACCESS.2020.2968837

    Article  Google Scholar 

  46. Luo M, Chang X, Nie L, Yang Y, Hauptmann AG, Zheng Q (2018) An adaptive semisupervised feature analysis for video semantic recognition. IEEE Trans Cybern 48(2):648–660. https://doi.org/10.1109/TCYB.2017.2647904

    Article  Google Scholar 

  47. Zhang D, Yao L, Chen K, Wang S, Chang X, Liu Y (2020) Making sense of spatio-temporal preserving representations for eeg-based human intention recognition. IEEE Trans Cybern 50(7):3033–3044. https://doi.org/10.1109/TCYB.2019.2905157

    Article  Google Scholar 

  48. Chen K, Yao L, Zhang D, Wang X, Chang X, Nie F (2020) A semisupervised recurrent convolutional attention model for human activity recognition. IEEE Trans Neural Netw Learn Syst 31(5):1747–1756. https://doi.org/10.1109/TNNLS.2019.2927224

    Article  Google Scholar 

Download references

Funding

This research was funded by the Zhi Shan Young Scholar Program of Southeast University.

Author information

Authors and Affiliations

Authors

Contributions

BZ and JC have equally contributed in the manuscript preparation.

Corresponding author

Correspondence to Yinfei Xu.

Ethics declarations

Conflict of interest

The authors certify that there is no conflict of interest in the subject matter discussed in this manuscript.

Ethics approval

The work conducted is not plagiarized. No one has been harmed in this work.

Consent participate

All the authors have given consent to submit the manuscript.

Consent participate

Authors provide their consent for the publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Time cost

We present a more comprehensive comparative experiment for various methods on the MTL-AQA dataset, as shown in Table 7. The inference time is divided into the cost of extracting video features and the cost of mapping features to scores.

Table 7 Comparison of computational cost on MTL-AQA

Appendix B: Case study

We conduct more case studies here. Figures 9, 10, 11, 12 illustrate that DAE captures the uncertainty in video quality assessment datasets well. Uncertainty in each case can be learned adaptively by the case itself. Specifically, for an excellent movement, the variance of the evaluation is minor, and the judges’ scores will be close to stable, while for low-quality movements, the range of changes in the score will increase, and the reliability will decrease accordingly.

Fig. 9
figure 9

Case study on AQA-7 (Gym Vault) dataset. Each row indicates four frames of a video corresponding to its prediction distribution

Fig. 10
figure 10

Case study on MTL-AQA dataset. Each row indicates four frames of a video corresponding to its prediction distribution

Fig. 11
figure 11

Case study on JIGSAWS (KT) dataset. Each row indicates four frames of a video corresponding to its prediction distribution

Fig. 12
figure 12

Case study on JIGSAWS (NP) dataset. Each row indicates four frames of a video corresponding to its prediction distribution

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, B., Chen, J., Xu, Y. et al. Auto-encoding score distribution regression for action quality assessment. Neural Comput & Applic 36, 929–942 (2024). https://doi.org/10.1007/s00521-023-09068-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-09068-w

Keywords

Navigation