Skip to main content
Log in

Automatic Modelling for Interactive Action Assessment

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Action assessment, the task of visually assessing the quality of performing an action, has attracted much attention in recent years, with promising applications in areas such as medical treatment and sporting events. However, most existing methods of action assessment mainly target the actions performed by a single person; in particular, they neglect the asymmetric relations among agents (e.g., between persons and objects), limiting their performance in many nonindividual actions. In this work, we formulate a framework for modelling asymmetric interactions among agents for action assessment, considering the subordinations among agents in many interactive actions. Specifically, we propose an asymmetric interaction learner consisting of an automatic assigner and an asymmetric interaction network search module. The automatic assigner is designed to automatically group agents within an action into a primary agent (e.g., human) and secondary agents (e.g., objects); the asymmetric interaction network search module adaptively learns the asymmetric interactions between these agents. We conduct experiments on the JIGSAWS dataset containing surgical actions and additionally collect two new datasets, TASD-2 and PaSk, for action assessment on interactive sporting actions. The experimental results on these three datasets demonstrate the effectiveness of our framework in achieving state-of-the-art performance. The extensive experiments on the AQA-7 dataset also indicate the robustness of our model in conventional action assessment settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. The Fisher Transform was proposed in 1921 to address a skewed distribution of the sample correlation (r) (PEARSON, 1913; Fisher, 1915); introducing it in the average correlation computation makes the result more reliable (Corey et al., 1998).

References

  • Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., & Sivic, J. (2016). Netvlad: Cnn architecture for weakly supervised place recognition. In CVPR (pp. 5297–5307).

  • Azar, S. M., Atigh, M. G., Nickabadi, A., & Alahi, A. (2019). Convolutional relational machine for group activity recognition. In CVPR (pp. 7892–7901).

  • Bertasius, G., Soo Park, H., Yu, S. X., & Shi, J. (2017). Am I a baller? Basketball performance assessment from first-person videos. In ICCV (pp. 2177–2185).

  • Cai, H., Zhu, L., & Han, S. (2018). Proxylessnas: Direct neural architecture search on target task and hardware. In ICLR.

  • Carreira, J., Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR (pp. 6299–6308).

  • Chang, X., Zheng, W.-S., & Zhang, J. (2015). Learning person-person interaction in collective activity recognition. TIP 24(6), 1905–1918.

  • Chen, J., Wang, Y., Qin, J., Liu, L., & Shao, L. (July 2017). Fast person re-identification via cross-camera semantic binary transformation. In CVPR.

  • Corey, D. M., Dunlap, W. P., & Burke, M. J. (1998). Averaging correlations: Expected values and bias in combined Pearson RS and Fisher’s Z transformations. JGP, 125(3), 245–261.

    Google Scholar 

  • Dong, X., & Yang, Y. (2019). Searching for a robust neural architecture in four GPU hours. In CVPR (pp. 1761–1770).

  • Doughty, H., Damen, D., & Mayol-Cuevas, W. (2018). Whoś better, whoś best: Skill determination in video using deep ranking. In CVPR.

  • Doughty, H., Mayol-Cuevas, W., & Damen, D. (2019). The pros and cons: Rank-aware temporal attention for skill determination in long videos. In CVPR (pp. 7862–7871).

  • Fang, H.-S., Xie, S., Tai, Y.-W., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In ICCV (pp. 2334–2343).

  • Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 10(4), 507–521.

    Google Scholar 

  • Gao, J., Zheng, W.-S., Pan, J.-H., Gao, C., Wang, Y., Zeng, W., & Lai, J. (2020). An asymmetric modeling for action assessment. In ECCV (pp. 222–238), Springer.

  • Gao, Y., Vedula, S. S., Reiley, C. E., Ahmidi, N., Varadarajan, B., Lin, H. C., Tao, L., Zappella, L., Béjar, B., Yuh, D. D. et al. (2014). Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In W2CAI (Vol. 3, p. 3).

  • Guo, Z., Zhang, X., Mu, H., Heng, W., Liu, Z., Wei, Y., & Sun, J. (2019). Single path one-shot neural architecture search with uniform sampling. In ECCV (pp. 544–560).

  • Hu, S., Xie, S., Zheng, H., Liu, C., Shi, J., Liu, X., & Lin, D. (2020). Dsnas: Direct neural architecture search without parameter retraining. In CVPR (pp. 12084–12092).

  • Ilg, W., Mezger, J., & Giese, M. (2003). Estimation of skill levels in sports based on hierarchical Spatio-temporal correspondences. In JPRS (pp. 523–531), Springer.

  • International Swimming Federation (FINA). Fina diving rules, 2017. URL https://resources.fina.org/fina/document/2021/01/12/916f78f6-2a42-46d6-bea8-e49130211edf/2017-2021_diving_16032018.pdf.

  • Joachims, T. (2006). Training linear SVMs in linear time. In SIGKDD (pp. 217–226).

  • Liu, D., Li, Q., Jiang, T., Wang, Y., Miao, R., Shan, F., & Li, Z. (June 2021). Towards unified surgical skill assessment. In CVPR (pp. 9522–9531).

  • Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: Differentiable architecture search. In ICLR.

  • Lu, L., Lu, Y., Yu, R., Di, H., Zhang, L., & Wang, S. (2019). Gaim: Graph attention interaction model for collective activity recognition. TMM 22(2), 524–539.

  • Malpani, A., Vedula, S. S., Chen, C. C. G., & Hager, G. D. (2014). Pairwise comparison-based objective score for automated skill assessment of segments in a surgical task. In IPCAI (pp. 138–147), Springer.

  • Martin, J., Regehr, G., Reznick, R., Macrae, H., Murnaghan, J., Hutchison, C., & Brown, M. (1997). Objective structured assessment of technical skill (OSATS) for surgical residents. BJS, 84(2), 273–278.

  • Pan, J.-H., Gao, J., & Zheng, W.-S. (October 2019). Action assessment by joint relation graphs. In ICCV.

  • Parmar, P., & Morris, B. T. (June 2019). What and how well you performed? A multitask learning approach to action quality assessment. In CVPR.

  • Parmar, P., & Tran Morris, B. (2017). Learning to score Olympic events. In CVPRW (pp. 20–28).

  • Parmar, P., Tran Morris, B. (Jan 2019). Action quality assessment across multiple actions. In WACV (pp. 1468–1476). https://doi.org/10.1109/WACV.2019.00161.

  • Pearson, K. (1913). On the probable error of a correlation coefficient as found from a fourfold table. Biometrika. https://doi.org/10.1093/biomet/9.1-2.22

  • Pérez, J. S., Meinhardt-Llopis, E., & Facciolo, G. (2013). Tv-l1 optical flow estimation. In IPOL (pp. 137–150).

  • Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., & Dean, J. (2018). Efficient neural architecture search via parameters sharing. In ICML (pp. 4092–4101).

  • Pirsiavash, H., Vondrick, C., & Torralba, A. (2014). Assessing the quality of actions. In ECCV (pp. 556–571), Springer.

  • Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2009). The graph neural network model. TNN, 20(1), 61–80.

    Google Scholar 

  • Sharma, Y., Bettadapura, V., Plötz, T., Hammerla, N., Mellor, S., McNaney, R., Olivier, P., Deshmukh, S., McCaskie, A., & Essa, I. (2014). Video based assessment of OSATS using sequential motion textures, Georgia Institute of Technology.

  • Shu, T., Todorovic, S., Zhu, S.-C. (2017). Cern: Confidence-energy recurrent network for group activity recognition. In CVPR (pp. 5523–5531).

  • Tang, Y., Ni, Z., Zhou, J., Zhang, D., Lu, J., Wu, Y., & Zhou, J. (2020) Uncertainty-aware score distribution learning for action quality assessment. In CVPR (pp. 9839–9848).

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L.u., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS (pp. 5998–6008). Curran Associates, Inc.,. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.

  • Wang, M., Ni, B., & Yang, X. (2017). Recurrent modeling of interaction context for collective activity recognition. In CVPR (pp. 3048–3056).

  • Wu, J., Wang, L., Wang, L., Guo, J., & Wu, G. (2019). Learning actor relation graphs for group activity recognition. In CVPR (pp. 9964–9974).

  • Xie, S., Zheng, H., Liu, C., & Lin, L. (2018). Snas: Stochastic neural architecture search. In ICLR.

  • Xu, C., Fu, Y., Zhang, B., Chen, Z., Jiang, Y.-G., & Xue, X. (2018). Learning to score the figure skating sports videos. arXiv preprint arXiv:1802.02774.

  • Yan, R., Tang, J., Shu, X., Li, Z., & Tian, Q. (2018a). Participation-contributed temporal dynamic model for group activity recognition. In ACM MM (pp. 1292–1300).

  • Yan, S., Xiong, Y., & Lin, D. (2018b). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.

  • Yao, T., Mei, T., & Rui, Y. (2016). Highlight detection with pairwise deep ranking for first-person video summarization. In CVPR (pp. 982–990).

  • Zeng, L.-A., Hong, F.-T., Zheng, W.-S., Yu, Q.-Z., Zeng, W., Wang, Y.-W., & Lai, J.-H. (2020). Hybrid dynamic-static context-aware attention network for action assessment in long videos. In ACM MM (pp. 2526–2534).

  • Zhang, P., Tang, Y., Hu, J.-F., & Zheng, W.-S. (2019). Fast collective activity recognition under weak supervision. TIP 29, 29–43.

  • Zhang, Q. & Li, B. (2011). Video-based motion expertise analysis in simulation-based surgical training using hierarchical dirichlet process hidden Markov model. In MMAR (pp. 19–24), ACM.

  • Zhang, Q., & Li, B. (2015). Relative hidden Markov models for video-based evaluation of motion skills in surgical training. TPAMI, 37(6), 1206–1218.

    Article  Google Scholar 

  • Zhang, Y., Wang, C., Wang, X., Zeng, W., & Liu, W. (2020). Fairmot: On the fairness of detection and re-identification in multiple object tracking. arXiv preprint arXiv:2004.01888.

  • Zhu, K. & Wu, J. (2021). Residual attention: A simple but effective method for multi-label recognition. In ICCV (pp. 184–193).

  • Zia, A., & Essa, I. (2018). Automated surgical skill assessment in RMIS training. IJCARS, 13, 731–739.

    Google Scholar 

  • Zia, A., Sharma, Y., Bettadapura, V., Sarin, E. L., Ploetz, T., Clements, M. A., & Essa, I. (2016). Automated video-based assessment of surgical skills for training and evaluation in medical schools. IJCARS, 11(9), 1623–1636.

    Google Scholar 

  • Zia, A., Sharma, Y., Bettadapura, V., Sarin, E. L., & Essa, I. (2018). Video and accelerometer-based motion analysis for automated surgical skills assessment. IJCARS, 13(3), 443–455.

Download references

Acknowledgements

This work was supported partially by the NSFC (U21A20471,U1911401,U1811461), Guangdong NSF Project (Nos. 2020B1515120085, 2018B030312002), Guangzhou Research Project (201902010037), the Key-Area Research and Development Program of Guangzhou (202007030004), and the Major Key Project of PCL (PCL2021A12). The corresponding author and principal investigator for this paper is Wei-Shi Zheng.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei-Shi Zheng.

Additional information

Communicated by Dima Damen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, J., Pan, JH., Zhang, SJ. et al. Automatic Modelling for Interactive Action Assessment. Int J Comput Vis 131, 659–679 (2023). https://doi.org/10.1007/s11263-022-01695-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01695-5

Keywords

Navigation