Abstract
Action assessment, the task of visually assessing the quality of performing an action, has attracted much attention in recent years, with promising applications in areas such as medical treatment and sporting events. However, most existing methods of action assessment mainly target the actions performed by a single person; in particular, they neglect the asymmetric relations among agents (e.g., between persons and objects), limiting their performance in many nonindividual actions. In this work, we formulate a framework for modelling asymmetric interactions among agents for action assessment, considering the subordinations among agents in many interactive actions. Specifically, we propose an asymmetric interaction learner consisting of an automatic assigner and an asymmetric interaction network search module. The automatic assigner is designed to automatically group agents within an action into a primary agent (e.g., human) and secondary agents (e.g., objects); the asymmetric interaction network search module adaptively learns the asymmetric interactions between these agents. We conduct experiments on the JIGSAWS dataset containing surgical actions and additionally collect two new datasets, TASD-2 and PaSk, for action assessment on interactive sporting actions. The experimental results on these three datasets demonstrate the effectiveness of our framework in achieving state-of-the-art performance. The extensive experiments on the AQA-7 dataset also indicate the robustness of our model in conventional action assessment settings.
Similar content being viewed by others
References
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., & Sivic, J. (2016). Netvlad: Cnn architecture for weakly supervised place recognition. In CVPR (pp. 5297–5307).
Azar, S. M., Atigh, M. G., Nickabadi, A., & Alahi, A. (2019). Convolutional relational machine for group activity recognition. In CVPR (pp. 7892–7901).
Bertasius, G., Soo Park, H., Yu, S. X., & Shi, J. (2017). Am I a baller? Basketball performance assessment from first-person videos. In ICCV (pp. 2177–2185).
Cai, H., Zhu, L., & Han, S. (2018). Proxylessnas: Direct neural architecture search on target task and hardware. In ICLR.
Carreira, J., Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR (pp. 6299–6308).
Chang, X., Zheng, W.-S., & Zhang, J. (2015). Learning person-person interaction in collective activity recognition. TIP 24(6), 1905–1918.
Chen, J., Wang, Y., Qin, J., Liu, L., & Shao, L. (July 2017). Fast person re-identification via cross-camera semantic binary transformation. In CVPR.
Corey, D. M., Dunlap, W. P., & Burke, M. J. (1998). Averaging correlations: Expected values and bias in combined Pearson RS and Fisher’s Z transformations. JGP, 125(3), 245–261.
Dong, X., & Yang, Y. (2019). Searching for a robust neural architecture in four GPU hours. In CVPR (pp. 1761–1770).
Doughty, H., Damen, D., & Mayol-Cuevas, W. (2018). Whoś better, whoś best: Skill determination in video using deep ranking. In CVPR.
Doughty, H., Mayol-Cuevas, W., & Damen, D. (2019). The pros and cons: Rank-aware temporal attention for skill determination in long videos. In CVPR (pp. 7862–7871).
Fang, H.-S., Xie, S., Tai, Y.-W., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In ICCV (pp. 2334–2343).
Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 10(4), 507–521.
Gao, J., Zheng, W.-S., Pan, J.-H., Gao, C., Wang, Y., Zeng, W., & Lai, J. (2020). An asymmetric modeling for action assessment. In ECCV (pp. 222–238), Springer.
Gao, Y., Vedula, S. S., Reiley, C. E., Ahmidi, N., Varadarajan, B., Lin, H. C., Tao, L., Zappella, L., Béjar, B., Yuh, D. D. et al. (2014). Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In W2CAI (Vol. 3, p. 3).
Guo, Z., Zhang, X., Mu, H., Heng, W., Liu, Z., Wei, Y., & Sun, J. (2019). Single path one-shot neural architecture search with uniform sampling. In ECCV (pp. 544–560).
Hu, S., Xie, S., Zheng, H., Liu, C., Shi, J., Liu, X., & Lin, D. (2020). Dsnas: Direct neural architecture search without parameter retraining. In CVPR (pp. 12084–12092).
Ilg, W., Mezger, J., & Giese, M. (2003). Estimation of skill levels in sports based on hierarchical Spatio-temporal correspondences. In JPRS (pp. 523–531), Springer.
International Swimming Federation (FINA). Fina diving rules, 2017. URL https://resources.fina.org/fina/document/2021/01/12/916f78f6-2a42-46d6-bea8-e49130211edf/2017-2021_diving_16032018.pdf.
Joachims, T. (2006). Training linear SVMs in linear time. In SIGKDD (pp. 217–226).
Liu, D., Li, Q., Jiang, T., Wang, Y., Miao, R., Shan, F., & Li, Z. (June 2021). Towards unified surgical skill assessment. In CVPR (pp. 9522–9531).
Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: Differentiable architecture search. In ICLR.
Lu, L., Lu, Y., Yu, R., Di, H., Zhang, L., & Wang, S. (2019). Gaim: Graph attention interaction model for collective activity recognition. TMM 22(2), 524–539.
Malpani, A., Vedula, S. S., Chen, C. C. G., & Hager, G. D. (2014). Pairwise comparison-based objective score for automated skill assessment of segments in a surgical task. In IPCAI (pp. 138–147), Springer.
Martin, J., Regehr, G., Reznick, R., Macrae, H., Murnaghan, J., Hutchison, C., & Brown, M. (1997). Objective structured assessment of technical skill (OSATS) for surgical residents. BJS, 84(2), 273–278.
Pan, J.-H., Gao, J., & Zheng, W.-S. (October 2019). Action assessment by joint relation graphs. In ICCV.
Parmar, P., & Morris, B. T. (June 2019). What and how well you performed? A multitask learning approach to action quality assessment. In CVPR.
Parmar, P., & Tran Morris, B. (2017). Learning to score Olympic events. In CVPRW (pp. 20–28).
Parmar, P., Tran Morris, B. (Jan 2019). Action quality assessment across multiple actions. In WACV (pp. 1468–1476). https://doi.org/10.1109/WACV.2019.00161.
Pearson, K. (1913). On the probable error of a correlation coefficient as found from a fourfold table. Biometrika. https://doi.org/10.1093/biomet/9.1-2.22
Pérez, J. S., Meinhardt-Llopis, E., & Facciolo, G. (2013). Tv-l1 optical flow estimation. In IPOL (pp. 137–150).
Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., & Dean, J. (2018). Efficient neural architecture search via parameters sharing. In ICML (pp. 4092–4101).
Pirsiavash, H., Vondrick, C., & Torralba, A. (2014). Assessing the quality of actions. In ECCV (pp. 556–571), Springer.
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2009). The graph neural network model. TNN, 20(1), 61–80.
Sharma, Y., Bettadapura, V., Plötz, T., Hammerla, N., Mellor, S., McNaney, R., Olivier, P., Deshmukh, S., McCaskie, A., & Essa, I. (2014). Video based assessment of OSATS using sequential motion textures, Georgia Institute of Technology.
Shu, T., Todorovic, S., Zhu, S.-C. (2017). Cern: Confidence-energy recurrent network for group activity recognition. In CVPR (pp. 5523–5531).
Tang, Y., Ni, Z., Zhou, J., Zhang, D., Lu, J., Wu, Y., & Zhou, J. (2020) Uncertainty-aware score distribution learning for action quality assessment. In CVPR (pp. 9839–9848).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L.u., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS (pp. 5998–6008). Curran Associates, Inc.,. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
Wang, M., Ni, B., & Yang, X. (2017). Recurrent modeling of interaction context for collective activity recognition. In CVPR (pp. 3048–3056).
Wu, J., Wang, L., Wang, L., Guo, J., & Wu, G. (2019). Learning actor relation graphs for group activity recognition. In CVPR (pp. 9964–9974).
Xie, S., Zheng, H., Liu, C., & Lin, L. (2018). Snas: Stochastic neural architecture search. In ICLR.
Xu, C., Fu, Y., Zhang, B., Chen, Z., Jiang, Y.-G., & Xue, X. (2018). Learning to score the figure skating sports videos. arXiv preprint arXiv:1802.02774.
Yan, R., Tang, J., Shu, X., Li, Z., & Tian, Q. (2018a). Participation-contributed temporal dynamic model for group activity recognition. In ACM MM (pp. 1292–1300).
Yan, S., Xiong, Y., & Lin, D. (2018b). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.
Yao, T., Mei, T., & Rui, Y. (2016). Highlight detection with pairwise deep ranking for first-person video summarization. In CVPR (pp. 982–990).
Zeng, L.-A., Hong, F.-T., Zheng, W.-S., Yu, Q.-Z., Zeng, W., Wang, Y.-W., & Lai, J.-H. (2020). Hybrid dynamic-static context-aware attention network for action assessment in long videos. In ACM MM (pp. 2526–2534).
Zhang, P., Tang, Y., Hu, J.-F., & Zheng, W.-S. (2019). Fast collective activity recognition under weak supervision. TIP 29, 29–43.
Zhang, Q. & Li, B. (2011). Video-based motion expertise analysis in simulation-based surgical training using hierarchical dirichlet process hidden Markov model. In MMAR (pp. 19–24), ACM.
Zhang, Q., & Li, B. (2015). Relative hidden Markov models for video-based evaluation of motion skills in surgical training. TPAMI, 37(6), 1206–1218.
Zhang, Y., Wang, C., Wang, X., Zeng, W., & Liu, W. (2020). Fairmot: On the fairness of detection and re-identification in multiple object tracking. arXiv preprint arXiv:2004.01888.
Zhu, K. & Wu, J. (2021). Residual attention: A simple but effective method for multi-label recognition. In ICCV (pp. 184–193).
Zia, A., & Essa, I. (2018). Automated surgical skill assessment in RMIS training. IJCARS, 13, 731–739.
Zia, A., Sharma, Y., Bettadapura, V., Sarin, E. L., Ploetz, T., Clements, M. A., & Essa, I. (2016). Automated video-based assessment of surgical skills for training and evaluation in medical schools. IJCARS, 11(9), 1623–1636.
Zia, A., Sharma, Y., Bettadapura, V., Sarin, E. L., & Essa, I. (2018). Video and accelerometer-based motion analysis for automated surgical skills assessment. IJCARS, 13(3), 443–455.
Acknowledgements
This work was supported partially by the NSFC (U21A20471,U1911401,U1811461), Guangdong NSF Project (Nos. 2020B1515120085, 2018B030312002), Guangzhou Research Project (201902010037), the Key-Area Research and Development Program of Guangzhou (202007030004), and the Major Key Project of PCL (PCL2021A12). The corresponding author and principal investigator for this paper is Wei-Shi Zheng.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Dima Damen.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gao, J., Pan, JH., Zhang, SJ. et al. Automatic Modelling for Interactive Action Assessment. Int J Comput Vis 131, 659–679 (2023). https://doi.org/10.1007/s11263-022-01695-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-022-01695-5