Auto-encoding score distribution regression for action quality assessment

Zhang, Boyu; Chen, Jiayuan; Xu, Yinfei; Zhang, Hui; Yang, Xu; Geng, Xin

doi:10.1007/s00521-023-09068-w

Auto-encoding score distribution regression for action quality assessment

Original Article
Published: 03 October 2023

Volume 36, pages 929–942, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Boyu Zhang¹^na1,
Jiayuan Chen¹^na1,
Yinfei Xu ORCID: orcid.org/0000-0003-3191-6903²,
Hui Zhang³,
Xu Yang^1,4 &
…
Xin Geng^1,4

309 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Assessing the quality of actions in videos is a challenging vision task, as the relationship between videos and action scores can be difficult to model. Consequently, extensive research has been conducted on action quality assessment (AQA) in the literature. Traditional AQA methods treat the problem as a regression task to learn the underlying mappings between videos and action scores. However, previous approaches overlook the presence of data uncertainty in AQA datasets. To address aleatoric uncertainty, we have developed a plug-and-play module called distribution auto-encoder (DAE). DAE encodes videos into distributions and utilizes the reparameterization trick to sample scores, which enables a more accurate mapping between videos and scores. Additionally, we use a likelihood loss to learn the uncertainty parameters. We have evaluated our approach on publicly available datasets, and extensive experiments demonstrate that DAE achieves state-of-the-art performance with the Spearman’s correlation metric of 82.58%, 92.32%, and 76.00% on the AQA-7, MTL-AQA, and JIGSAWSS datasets, respectively. Furthermore, plug-and-play experiments also demonstrate the extensibility of DAE. Our code is available at https://github.com/InfoX-SEU/DAE-AQA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Action Quality Assessment with Temporal Parsing Transformer

Unsupervised video-based action recognition using two-stream generative adversarial network

Article 26 December 2023

Uncertainty-Based Spatial-Temporal Attention for Online Action Detection

Availability of data and materials

Publicly accessible data have been used by the authors.

Code availability

Most of the code has been open sourced on https://github.com/InfoX-SEU/DAE-AQA, the rest can be obtained by reasonable request.

Notes

Note that the Gaussian distribution is just one choice and not a limitation in our method.
\(N \ge k +1\).
Whether the variance is positive does not affect the reparameterization trick. Here, variance shows in absolute value.

References

Doughty H, Mayol-Cuevas W, Damen D (2019) The pros and cons: Rank-aware temporal attention for skill determination in long videos. Proc Comput Vis PattRecognit (CVPR). https://doi.org/10.1109/CVPR.2019.00805
Article Google Scholar
Bertasius G, Park HS, Yu SX, Shi J (2017) Am i a baller? basketball performance assessment from first-person videos. Proc IEEE Int Conf Comput Vis (ICCV). https://doi.org/10.1109/ICCV.2017.239
Article Google Scholar
Parmar P, Morris B (2017) Learning to score olympic events. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), pp. 76–84 . https://doi.org/10.1109/CVPRW.2017.16
Parmar P, Morris BT (2019) What and how well you performed? a multitask learning approach to action quality assessment. Proc Comput Vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR.2019.00039
Article Google Scholar
Jug M, Pers J, Dezman B, Kovačič S (2003) Trajectory based assessment of coordinated human activity. Int Conf Comput Vis Syst (ICVS). https://doi.org/10.1007/3-540-36592-3_51
Article Google Scholar
Abdelbaky A, Aly S (2020) Human action recognition using short-time motion energy template images and pcanet features. Neural Comput Appl 32(16):12561–12574. https://doi.org/10.1007/s00521-020-04712-1
Article Google Scholar
Yu X, Rao Y, Zhao W, Lu J, Zhou J (2021) Group-aware contrastive regression for action quality assessment. Proc IEEE Int Conf Comput Vis (ICCV), 7919–7928
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR.2017.502
Article Google Scholar
Geng X (2016) Label distribution learning. IEEE Trans Knowl Data Eng 28(7):1734–1748
Article Google Scholar
Kingma D.P, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
Tang Y, Ni Z, Zhou J, Zhang D, Lu J, Wu Y, Zhou J (2020) Uncertainty-aware score distribution learning for action quality assessment. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR42600.2020.00986
Article Google Scholar
Zhang Q, Li B (2015) Relative hidden markov models for video-based evaluation of motion skills in surgical training. IEEE Trans Pattern Anal Mach Intell. 37(6):1206–18
Article Google Scholar
AS G (1995) Automated video assessment of human performance. Proceedings of AI-ED, 16–19
Doughty H, Damen D, Mayol-Cuevas W (2018) Who’s better? who’s best? pairwise deep ranking for skill determination. In: Proc. Comput. Vis. Pattern Recognit. (CVPR), pp. 6057–6066. https://doi.org/10.1109/CVPR.2018.00634
Jelodar AB, Paulius D, Sun Y (2019) Long activity video understanding using functional object-oriented network. IEEE Trans Multimedia 21(7):1813–1824. https://doi.org/10.1109/TMM.2018.2885228
Article Google Scholar
Wei P, Sun H, Zheng N (2019) Learning composite latent structures for 3d human action representation and recognition. IEEE Trans Multimedia 21(9):2195–2208. https://doi.org/10.1109/TMM.2019.2897902
Article Google Scholar
Xiang X, Tian Y, Reiter A, Hager G, Tran T (2018) S3d: Stacking segmental p3d for action quality assessment. In: 25th IEEE Int Conf Image Process. (ICIP), pp. 928–932. https://doi.org/10.1109/ICIP.2018.8451364
Geng X (2016) Label distribution learning. IEEE Trans Knowl Data Eng 28(7):1734–1748
Article Google Scholar
Der Kiureghian A, Ditlevsen O (2009) Aleatory or epistemic? Does it matter? Struct Safety 31(2):105–112
Article Google Scholar
Faber M.H (2005) On the treatment of uncertainties and probabilities in engineering decision analysis
Geng X, Luo L (2014) Multilabel ranking with inconsistent rankers. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3742–3747 . https://doi.org/10.1109/CVPR.2014.478
Paté-Cornell ME (1996) Uncertainties in risk analysis: six levels of treatment. Reliab Eng Syst Saf 54(2–3):95–111
Article Google Scholar
Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural network, 1613–1622
Gal Y, Ghahramani Z (2016) Dropout as a bayesian approximation: Representing model uncertainty in deep learning, 1050–1059
Kendall A, Gal Y (2017) What uncertainties do we need in bayesian deep learning for computer vision?, vol. 30
Kendall A, Badrinarayanan V, Cipolla R (2015) Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680
Chang J, Lan Z, Cheng C, Wei Y (2020) Data uncertainty learning in face recognition, 5710–5719
Choi J, Chun D, Kim H, Lee H.-J (2019) Gaussian yolov3: An accurate and fast object detector using localization uncertainty for autonomous driving. In: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 502–511
Kraus F, Dietmayer K (2019) Uncertainty estimation in one-stage object detection. In: IEEE Trans. Intell. Transp. Syst. Conf. (ITSC), pp. 53–60 . IEEE
Yu T, Li D, Yang Y, Hospedales T.M, Xiang T (2019) Robust person re-identification by modelling feature uncertainty. In: Proceedings of the IEEE/CVF international conference on computer vision 2019. (ICCV), pp. 552–561
Hinton G, E., Salakhutdinov, R, (2006) Reducing the dimensionality of data with neural networks. Science, 504–507
Masci J, Meier U, Cireşan D, Schmidhuber J (2011) Stacked convolutional auto-encoders for hierarchical feature extraction, 52–59
Vincent P, Larochelle H, Bengio Y, Manzagol P.-a (2008) Extracting and composing robust features with denoising autoencoders. In: International Conference on Machine Learning (ICML), pp. 1096–1103
Nibali A, He Z, Morgan S, Greenwood D (2017) Extraction and classification of diving clips from continuous video footage. In: Proceedings of the IEEE/CVF international conference on computer vision 2019 Pattern Recognit. Workshops (CVPRW), pp. 94–104. https://doi.org/10.1109/CVPRW.2017.18
Cohen P, West S.G, Aiken L.S (2014) Applied multiple regression/correlation analysis for the behavioral sciences
Pirsiavash H, Vondrick C, Torralba A. Assessing the Quality of Actions
Pan J-H, Gao J, Zheng W-S (2019) Action assessment by joint relation graphs. Proc Int Conf Comput Vis (ICCV). https://doi.org/10.1109/ICCV.2019.00643
Article Google Scholar
Pan J-H, Gao J, Zheng W-S (2019) Action assessment by joint relation graphs. Proc IEEE Int Conf Comput Vis (ICCV). https://doi.org/10.1109/ICCV.2019.00643
Article Google Scholar
Parmar P, Morris B (2019) Action quality assessment across multiple actions. In: Winter Conference on Applications of Computer Vision. (WACV), pp. 1468–1476 . IEEE
Gao Y, Swaroop S, V Carol, Narges E.R, Balakrishnan A, Henry V, Lingling C.L, T. (2014) Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In: MICCAI Workshop: M2CAI
Kingma D.P, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Zhang C, Liu S, Xu X, Zhu C (2019) C3ae: Exploring the limits of compact model for age estimation. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR.2019.01287
Article Google Scholar
Badr MM, Elbasiony RM, Sarhan AM (2022) Lrti: landmark ratios with task importance toward accurate age estimation using deep neural networks. Neural Comput Appl 34(12):9647–9659. https://doi.org/10.1007/s00521-022-06955-6
Article Google Scholar
Cao K, Choi KN, Jung H, Duan L (2020) Deep learning for facial beauty prediction. Information (Switzerland) 11(8):391
Google Scholar
Gan J, Scotti F, Xiang L, Zhai Y, Chaoyun M, He G, Zeng J, Bai Z, Labati R, Piuri V-C (2020) 2m beautynet: facial beauty prediction based on multi-task transfer learning. IEEE Access. https://doi.org/10.1109/ACCESS.2020.2968837
Article Google Scholar
Luo M, Chang X, Nie L, Yang Y, Hauptmann AG, Zheng Q (2018) An adaptive semisupervised feature analysis for video semantic recognition. IEEE Trans Cybern 48(2):648–660. https://doi.org/10.1109/TCYB.2017.2647904
Article Google Scholar
Zhang D, Yao L, Chen K, Wang S, Chang X, Liu Y (2020) Making sense of spatio-temporal preserving representations for eeg-based human intention recognition. IEEE Trans Cybern 50(7):3033–3044. https://doi.org/10.1109/TCYB.2019.2905157
Article Google Scholar
Chen K, Yao L, Zhang D, Wang X, Chang X, Nie F (2020) A semisupervised recurrent convolutional attention model for human activity recognition. IEEE Trans Neural Netw Learn Syst 31(5):1747–1756. https://doi.org/10.1109/TNNLS.2019.2927224
Article Google Scholar

Download references

Funding

This research was funded by the Zhi Shan Young Scholar Program of Southeast University.

Author information

Boyu Zhang and Jiayuan Chen have contributed equally to this work.

Authors and Affiliations

School of Computer Science and Engineering, Southeast University, Nanjing, 211189, China
Boyu Zhang, Jiayuan Chen, Xu Yang & Xin Geng
School of Information Science and Engineering, Southeast University, Nanjing, 210096, China
Yinfei Xu
Inspur Acadaemy of Science and Technology, Jinan, 250000, China
Hui Zhang
Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, Nanjing, 211189, China
Xu Yang & Xin Geng

Authors

Boyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiayuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yinfei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Hui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Geng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

BZ and JC have equally contributed in the manuscript preparation.

Corresponding author

Correspondence to Yinfei Xu.

Ethics declarations

Conflict of interest

The authors certify that there is no conflict of interest in the subject matter discussed in this manuscript.

Ethics approval

The work conducted is not plagiarized. No one has been harmed in this work.

Consent participate

All the authors have given consent to submit the manuscript.

Consent participate

Authors provide their consent for the publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Time cost

We present a more comprehensive comparative experiment for various methods on the MTL-AQA dataset, as shown in Table 7. The inference time is divided into the cost of extracting video features and the cost of mapping features to scores.

Table 7 Comparison of computational cost on MTL-AQA

Full size table

Appendix B: Case study

We conduct more case studies here. Figures 9, 10, 11, 12 illustrate that DAE captures the uncertainty in video quality assessment datasets well. Uncertainty in each case can be learned adaptively by the case itself. Specifically, for an excellent movement, the variance of the evaluation is minor, and the judges’ scores will be close to stable, while for low-quality movements, the range of changes in the score will increase, and the reliability will decrease accordingly.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, B., Chen, J., Xu, Y. et al. Auto-encoding score distribution regression for action quality assessment. Neural Comput & Applic 36, 929–942 (2024). https://doi.org/10.1007/s00521-023-09068-w

Download citation

Received: 10 September 2022
Accepted: 14 September 2023
Published: 03 October 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s00521-023-09068-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Auto-encoding score distribution regression for action quality assessment

Abstract

Access this article

Similar content being viewed by others

Action Quality Assessment with Temporal Parsing Transformer

Unsupervised video-based action recognition using two-stream generative adversarial network

Uncertainty-Based Spatial-Temporal Attention for Online Action Detection

Availability of data and materials

Code availability

Notes

References

Funding