Action Quality Assessment with Temporal Parsing Transformer

Bai, Yang; Zhou, Desen; Zhang, Songyang; Wang, Jian; Ding, Errui; Guan, Yu; Long, Yang; Wang, Jingdong

doi:10.1007/978-3-031-19772-7_25

Yang Bai^12,13,
Desen Zhou¹³,
Songyang Zhang¹⁴,
Jian Wang¹³,
Errui Ding¹³,
Yu Guan¹⁵,
Yang Long¹² &
…
Jingdong Wang¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13664))

Included in the following conference series:

European Conference on Computer Vision

2474 Accesses
11 Citations

Abstract

Action Quality Assessment(AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. Existing state-of-the-art methods typically rely on the holistic video representations for score regression or ranking, which limits the generalization to capture fine-grained intra-class variation. To overcome the above limitation, we propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Specifically, we utilize a set of learnable queries to represent the atomic temporal patterns for a specific action. Our decoding process converts the frame representations to a fixed number of temporally ordered part representations. To obtain the quality score, we adopt the state-of-the-art contrastive regression based on the part representations. Since existing AQA datasets do not provide temporal part-level labels or partitions, we propose two novel loss functions on the cross attention responses of the decoder: a ranking loss to ensure the learnable queries to satisfy the temporal order in cross attention and a sparsity loss to encourage the part representations to be more discriminative. Extensive experiments show that our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.

Y. Bai and D. Zhou—Equal contribution.

Y. Bai—Work done when Yang Bai was a research intern at VIS, Baidu.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We note that it might be better to weight each part. However, part weighting does not provide improvements during our practice. We guess that it may be during the self-attention process in the decoder, the relations between parts have already been taken into account.

References

Alayrac, J.B., Laptev, I., Sivic, J., Lacoste-Julien, S.: Joint discovery of object states and manipulation actions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2127–2136 (2017)
Google Scholar
Bertasius, G., Soo Park, H., Yu, S.X., Shi, J.: Am i a baller? basketball performance assessment from first-person videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2177–2185 (2017)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Doughty, H., Damen, D., Mayol-Cuevas, W.: Who’s better? who’s best? pairwise deep ranking for skill determination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6057–6066 (2018)
Google Scholar
Doughty, H., Mayol-Cuevas, W., Damen, D.: The pros and cons: rank-aware temporal attention for skill determination in long videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7862–7871 (2019)
Google Scholar
Gao, Y., et al.: Jhu-isi gesture and skill assessment working set (jigsaws): a surgical activity dataset for human motion modeling. In: MICCAI workshop: M2CAI, vol. 3, p. 3 (2014)
Google Scholar
Gordon, A.S.: Automated video assessment of human performance. In: Proceedings of AI-ED, vol. 2 (1995)
Google Scholar
Jug, M., Perš, J., Dežman, B., Kovačič, S.: Trajectory based assessment of coordinated human activity. In: International Conference on Computer Vision Systems, pp. 534–543. Springer (2003)
Google Scholar
Kuehne, H., Arslan, A., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014)
Google Scholar
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017)
Google Scholar
Lei, P., Todorovic, S.: Temporal deformable residual networks for action segmentation in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6742–6751 (2018)
Google Scholar
Li, J., Lei, P., Todorovic, S.: Weakly supervised energy-based learning for action segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6243–6251 (2019)
Google Scholar
Meng, D., et al.: Conditional DETR for fast training convergence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3651–3660 (2021)
Google Scholar
Newell, A., Huang, Z., Deng, J.: Associative embedding: end-to-end learning for joint detection and grouping. In: Advances in Neural Information Processing systems 30 (2017)
Google Scholar
Pan, J.H., Gao, J., Zheng, W.S.: Action assessment by joint relation graphs. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6331–6340 (2019)
Google Scholar
Parmar, P., Morris, B.: Action quality assessment across multiple actions. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1468–1476. IEEE (2019)
Google Scholar
Parmar, P., Morris, B.T.: What and how well you performed? a multitask learning approach to action quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 304–313 (2019)
Google Scholar
Parmar, P., Tran Morris, B.: Learning to score olympic events. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–28 (2017)
Google Scholar
Pirsiavash, H., Vondrick, C., Torralba, A.: Assessing the quality of actions. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 556–571. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_36
Chapter Google Scholar
Shao, D., Zhao, Y., Dai, B., Lin, D.: Intra-and inter-action understanding via temporal action parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 730–739 (2020)
Google Scholar
Tang, Y., et al.: Uncertainty-aware score distribution learning for action quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9839–9848 (2020)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference On Computer Vision, pp. 3551–3558 (2013)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Wang, S., Yang, D., Zhai, P., Chen, C., Zhang, L.: TSA-NET: tube self-attention network for action quality assessment. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 4902–4910 (2021)
Google Scholar
Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor DETR: query design for transformer-based detector. arXiv preprint arXiv:2109.07107 (2021)
Xu, C., Fu, Y., Zhang, B., Chen, Z., Jiang, Y.G., Xue, X.: Learning to score figure skating sport videos. IEEE Trans. Circuits Syst. Video Technol. 30(12), 4578–4590 (2019)
Article Google Scholar
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Yi, F., Wen, H., Jiang, T.: AsFormer: transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021)
Yu, X., Rao, Y., Zhao, W., Lu, J., Zhou, J.: Group-aware contrastive regression for action quality assessment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7919–7928 (2021)
Google Scholar
Zhang, C., Gupta, A., Zisserman, A.: Temporal query networks for fine-grained video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4486–4496 (2021)
Google Scholar
Zhang, Q., Li, B.: Relative hidden Markov models for video-based evaluation of motion skills in surgical training. IEEE Trans. Pattern Anal. Mach. Intell. 37(6), 1206–1218 (2014)
Article Google Scholar
Čehovin Zajc, L.: A modular toolkit for visual tracking performance evaluation. SoftwareX 12, 100623 (2020). https://doi.org/10.1016/j.softx.2020.100623

Download references

Author information

Authors and Affiliations

Department of Computer Science, Durham University, Durham, UK
Yang Bai & Yang Long
Department of Computer Vision Technology (VIS), Baidu Inc., Beijing, China
Yang Bai, Desen Zhou, Jian Wang, Errui Ding & Jingdong Wang
Shanghai AI Laboratory, Shanghai, China
Songyang Zhang
University of Warwick, Coventry, UK
Yu Guan

Authors

Yang Bai
View author publications
You can also search for this author in PubMed Google Scholar
Desen Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Songyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Errui Ding
View author publications
You can also search for this author in PubMed Google Scholar
Yu Guan
View author publications
You can also search for this author in PubMed Google Scholar
Yang Long
View author publications
You can also search for this author in PubMed Google Scholar
Jingdong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingdong Wang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1295 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bai, Y. et al. (2022). Action Quality Assessment with Temporal Parsing Transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13664. Springer, Cham. https://doi.org/10.1007/978-3-031-19772-7_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-19772-7_25
Published: 28 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19771-0
Online ISBN: 978-3-031-19772-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Action Quality Assessment with Temporal Parsing Transformer