Abstract
Assessing the quality of actions in videos is a challenging vision task, as the relationship between videos and action scores can be difficult to model. Consequently, extensive research has been conducted on action quality assessment (AQA) in the literature. Traditional AQA methods treat the problem as a regression task to learn the underlying mappings between videos and action scores. However, previous approaches overlook the presence of data uncertainty in AQA datasets. To address aleatoric uncertainty, we have developed a plug-and-play module called distribution auto-encoder (DAE). DAE encodes videos into distributions and utilizes the reparameterization trick to sample scores, which enables a more accurate mapping between videos and scores. Additionally, we use a likelihood loss to learn the uncertainty parameters. We have evaluated our approach on publicly available datasets, and extensive experiments demonstrate that DAE achieves state-of-the-art performance with the Spearman’s correlation metric of 82.58%, 92.32%, and 76.00% on the AQA-7, MTL-AQA, and JIGSAWSS datasets, respectively. Furthermore, plug-and-play experiments also demonstrate the extensibility of DAE. Our code is available at https://github.com/InfoX-SEU/DAE-AQA.
Similar content being viewed by others
Availability of data and materials
Publicly accessible data have been used by the authors.
Code availability
Most of the code has been open sourced on https://github.com/InfoX-SEU/DAE-AQA, the rest can be obtained by reasonable request.
Notes
Note that the Gaussian distribution is just one choice and not a limitation in our method.
\(N \ge k +1\).
Whether the variance is positive does not affect the reparameterization trick. Here, variance shows in absolute value.
References
Doughty H, Mayol-Cuevas W, Damen D (2019) The pros and cons: Rank-aware temporal attention for skill determination in long videos. Proc Comput Vis PattRecognit (CVPR). https://doi.org/10.1109/CVPR.2019.00805
Bertasius G, Park HS, Yu SX, Shi J (2017) Am i a baller? basketball performance assessment from first-person videos. Proc IEEE Int Conf Comput Vis (ICCV). https://doi.org/10.1109/ICCV.2017.239
Parmar P, Morris B (2017) Learning to score olympic events. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), pp. 76–84 . https://doi.org/10.1109/CVPRW.2017.16
Parmar P, Morris BT (2019) What and how well you performed? a multitask learning approach to action quality assessment. Proc Comput Vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR.2019.00039
Jug M, Pers J, Dezman B, Kovačič S (2003) Trajectory based assessment of coordinated human activity. Int Conf Comput Vis Syst (ICVS). https://doi.org/10.1007/3-540-36592-3_51
Abdelbaky A, Aly S (2020) Human action recognition using short-time motion energy template images and pcanet features. Neural Comput Appl 32(16):12561–12574. https://doi.org/10.1007/s00521-020-04712-1
Yu X, Rao Y, Zhao W, Lu J, Zhou J (2021) Group-aware contrastive regression for action quality assessment. Proc IEEE Int Conf Comput Vis (ICCV), 7919–7928
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR.2017.502
Geng X (2016) Label distribution learning. IEEE Trans Knowl Data Eng 28(7):1734–1748
Kingma D.P, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
Tang Y, Ni Z, Zhou J, Zhang D, Lu J, Wu Y, Zhou J (2020) Uncertainty-aware score distribution learning for action quality assessment. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR42600.2020.00986
Zhang Q, Li B (2015) Relative hidden markov models for video-based evaluation of motion skills in surgical training. IEEE Trans Pattern Anal Mach Intell. 37(6):1206–18
AS G (1995) Automated video assessment of human performance. Proceedings of AI-ED, 16–19
Doughty H, Damen D, Mayol-Cuevas W (2018) Who’s better? who’s best? pairwise deep ranking for skill determination. In: Proc. Comput. Vis. Pattern Recognit. (CVPR), pp. 6057–6066. https://doi.org/10.1109/CVPR.2018.00634
Jelodar AB, Paulius D, Sun Y (2019) Long activity video understanding using functional object-oriented network. IEEE Trans Multimedia 21(7):1813–1824. https://doi.org/10.1109/TMM.2018.2885228
Wei P, Sun H, Zheng N (2019) Learning composite latent structures for 3d human action representation and recognition. IEEE Trans Multimedia 21(9):2195–2208. https://doi.org/10.1109/TMM.2019.2897902
Xiang X, Tian Y, Reiter A, Hager G, Tran T (2018) S3d: Stacking segmental p3d for action quality assessment. In: 25th IEEE Int Conf Image Process. (ICIP), pp. 928–932. https://doi.org/10.1109/ICIP.2018.8451364
Geng X (2016) Label distribution learning. IEEE Trans Knowl Data Eng 28(7):1734–1748
Der Kiureghian A, Ditlevsen O (2009) Aleatory or epistemic? Does it matter? Struct Safety 31(2):105–112
Faber M.H (2005) On the treatment of uncertainties and probabilities in engineering decision analysis
Geng X, Luo L (2014) Multilabel ranking with inconsistent rankers. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3742–3747 . https://doi.org/10.1109/CVPR.2014.478
Paté-Cornell ME (1996) Uncertainties in risk analysis: six levels of treatment. Reliab Eng Syst Saf 54(2–3):95–111
Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural network, 1613–1622
Gal Y, Ghahramani Z (2016) Dropout as a bayesian approximation: Representing model uncertainty in deep learning, 1050–1059
Kendall A, Gal Y (2017) What uncertainties do we need in bayesian deep learning for computer vision?, vol. 30
Kendall A, Badrinarayanan V, Cipolla R (2015) Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680
Chang J, Lan Z, Cheng C, Wei Y (2020) Data uncertainty learning in face recognition, 5710–5719
Choi J, Chun D, Kim H, Lee H.-J (2019) Gaussian yolov3: An accurate and fast object detector using localization uncertainty for autonomous driving. In: Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 502–511
Kraus F, Dietmayer K (2019) Uncertainty estimation in one-stage object detection. In: IEEE Trans. Intell. Transp. Syst. Conf. (ITSC), pp. 53–60 . IEEE
Yu T, Li D, Yang Y, Hospedales T.M, Xiang T (2019) Robust person re-identification by modelling feature uncertainty. In: Proceedings of the IEEE/CVF international conference on computer vision 2019. (ICCV), pp. 552–561
Hinton G, E., Salakhutdinov, R, (2006) Reducing the dimensionality of data with neural networks. Science, 504–507
Masci J, Meier U, Cireşan D, Schmidhuber J (2011) Stacked convolutional auto-encoders for hierarchical feature extraction, 52–59
Vincent P, Larochelle H, Bengio Y, Manzagol P.-a (2008) Extracting and composing robust features with denoising autoencoders. In: International Conference on Machine Learning (ICML), pp. 1096–1103
Nibali A, He Z, Morgan S, Greenwood D (2017) Extraction and classification of diving clips from continuous video footage. In: Proceedings of the IEEE/CVF international conference on computer vision 2019 Pattern Recognit. Workshops (CVPRW), pp. 94–104. https://doi.org/10.1109/CVPRW.2017.18
Cohen P, West S.G, Aiken L.S (2014) Applied multiple regression/correlation analysis for the behavioral sciences
Pirsiavash H, Vondrick C, Torralba A. Assessing the Quality of Actions
Pan J-H, Gao J, Zheng W-S (2019) Action assessment by joint relation graphs. Proc Int Conf Comput Vis (ICCV). https://doi.org/10.1109/ICCV.2019.00643
Pan J-H, Gao J, Zheng W-S (2019) Action assessment by joint relation graphs. Proc IEEE Int Conf Comput Vis (ICCV). https://doi.org/10.1109/ICCV.2019.00643
Parmar P, Morris B (2019) Action quality assessment across multiple actions. In: Winter Conference on Applications of Computer Vision. (WACV), pp. 1468–1476 . IEEE
Gao Y, Swaroop S, V Carol, Narges E.R, Balakrishnan A, Henry V, Lingling C.L, T. (2014) Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In: MICCAI Workshop: M2CAI
Kingma D.P, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Zhang C, Liu S, Xu X, Zhu C (2019) C3ae: Exploring the limits of compact model for age estimation. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR.2019.01287
Badr MM, Elbasiony RM, Sarhan AM (2022) Lrti: landmark ratios with task importance toward accurate age estimation using deep neural networks. Neural Comput Appl 34(12):9647–9659. https://doi.org/10.1007/s00521-022-06955-6
Cao K, Choi KN, Jung H, Duan L (2020) Deep learning for facial beauty prediction. Information (Switzerland) 11(8):391
Gan J, Scotti F, Xiang L, Zhai Y, Chaoyun M, He G, Zeng J, Bai Z, Labati R, Piuri V-C (2020) 2m beautynet: facial beauty prediction based on multi-task transfer learning. IEEE Access. https://doi.org/10.1109/ACCESS.2020.2968837
Luo M, Chang X, Nie L, Yang Y, Hauptmann AG, Zheng Q (2018) An adaptive semisupervised feature analysis for video semantic recognition. IEEE Trans Cybern 48(2):648–660. https://doi.org/10.1109/TCYB.2017.2647904
Zhang D, Yao L, Chen K, Wang S, Chang X, Liu Y (2020) Making sense of spatio-temporal preserving representations for eeg-based human intention recognition. IEEE Trans Cybern 50(7):3033–3044. https://doi.org/10.1109/TCYB.2019.2905157
Chen K, Yao L, Zhang D, Wang X, Chang X, Nie F (2020) A semisupervised recurrent convolutional attention model for human activity recognition. IEEE Trans Neural Netw Learn Syst 31(5):1747–1756. https://doi.org/10.1109/TNNLS.2019.2927224
Funding
This research was funded by the Zhi Shan Young Scholar Program of Southeast University.
Author information
Authors and Affiliations
Contributions
BZ and JC have equally contributed in the manuscript preparation.
Corresponding author
Ethics declarations
Conflict of interest
The authors certify that there is no conflict of interest in the subject matter discussed in this manuscript.
Ethics approval
The work conducted is not plagiarized. No one has been harmed in this work.
Consent participate
All the authors have given consent to submit the manuscript.
Consent participate
Authors provide their consent for the publication.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Time cost
We present a more comprehensive comparative experiment for various methods on the MTL-AQA dataset, as shown in Table 7. The inference time is divided into the cost of extracting video features and the cost of mapping features to scores.
Appendix B: Case study
We conduct more case studies here. Figures 9, 10, 11, 12 illustrate that DAE captures the uncertainty in video quality assessment datasets well. Uncertainty in each case can be learned adaptively by the case itself. Specifically, for an excellent movement, the variance of the evaluation is minor, and the judges’ scores will be close to stable, while for low-quality movements, the range of changes in the score will increase, and the reliability will decrease accordingly.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, B., Chen, J., Xu, Y. et al. Auto-encoding score distribution regression for action quality assessment. Neural Comput & Applic 36, 929–942 (2024). https://doi.org/10.1007/s00521-023-09068-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-09068-w