Advertisement

HARD-Net: Hardness-AwaRe Discrimination Network for 3D Early Activity Prediction

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12356)

Abstract

Predicting the class label from the partially observed activity sequence is a very hard task, as the observed early segments of different activities can be very similar. In this paper, we propose a novel Hardness-AwaRe Discrimination Network (HARD-Net) to specifically investigate the relationships between the similar activity pairs that are hard to be discriminated. Specifically, a Hard Instance-Interference Class (HI-IC) bank is designed, which dynamically records the hard similar pairs. Based on the HI-IC bank, a novel adversarial learning scheme is proposed to train our HARD-Net, which thus grants our network with the strong capability in mining subtle discrimination information for 3D early activity prediction. We evaluate our proposed HARD-Net on two public activity datasets and achieve state-of-the-art performance.

Keywords

Early activity prediction Action/gesture understanding 3D skeleton data Hardness-aware learning 

Notes

Acknowledgement

This work is supported by SUTD Project PIE-SGP-Al-2020-02, SUTD Project SRG-ISTD-2020-153, the National Natural Science Foundation of China under Grant 61991411, and Grant U1913204, the National Key Research and Development Plan of China under Grant 2017YFB1300205, and the Shandong Major Scientific and Technological Innovation Project (MSTIP) under Grant 2018CXGC1503.

References

  1. 1.
    Aliakbarian, M., Saleh, F., Salzmann, M., Fernando, B., Petersson, L., Andersson, L.: Encouraging LSTMs to anticipate actions very early (2017)Google Scholar
  2. 2.
    Cai, Y., Ge, L., Cai, J., Magnenat-Thalmann, N., Yuan, J.: 3D hand pose estimation using synthetic data and weakly labeled RGB images. IEEE Trans. Pattern Anal. Mach. Intell. (2020)Google Scholar
  3. 3.
    Cai, Y., Ge, L., Cai, J., Yuan, J.: Weakly-supervised 3D hand pose estimation from monocular RGB images. In: Proceedings of the European Conference on Computer Vision, pp. 666–682 (2018)Google Scholar
  4. 4.
    Cai, Y., et al.: Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2272–2281 (2019)Google Scholar
  5. 5.
    Cai, Y., Huang, L., Wang, Y., et al.: Learning progressive joint propagation for human motion prediction. In: Proceedings of the European Conference on Computer Vision (2020)Google Scholar
  6. 6.
    Felzenszwalb, P., Girshick, R., Mcallester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1627–45 (2010) CrossRefGoogle Scholar
  7. 7.
    Gammulle, H., Denman, S., Sridharan, S., Fookes, C.: Predicting the future: a jointly learnt model for action anticipation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5561–5570 (2019)Google Scholar
  8. 8.
    Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.K.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: Proceedings of Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  9. 9.
    Girshick, R.: Fast r-cnn. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448 (December 2015)Google Scholar
  10. 10.
    Hu, J.-F., Zheng, W.-S., Ma, L., Wang, G., Lai, J.: Real-time RGB-D activity prediction by soft regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 280–296. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_17CrossRefGoogle Scholar
  11. 11.
    Hu, J.F., Zheng, W.S., Ma, L., Wang, G., Lai, J., Zhang, J.: Early action prediction by soft regression. IEEE Trans. Pattern Anal. Mach. Intell. 41(11), 2568–2583 (2018)CrossRefGoogle Scholar
  12. 12.
    Jain, A., Singh, A., Koppula, H., Soh, S., Saxena, A.: Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. Arxiv (2015)Google Scholar
  13. 13.
    Ke, Q., Bennamoun, M., Rahmani, H., An, S., Sohel, F., Boussaid, F.: Learning latent global network for skeleton-based action prediction. IEEE Trans. Image Process. 29, 959–970 (2020)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Ke, Q., Bennamoun, M., An, S., Sohel, F.A., Boussaïd, F.: A new representation of skeleton sequences for 3D action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4570–4579 (2017)Google Scholar
  15. 15.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  16. 16.
    Kong, Y., Tao, Z., Fu, Y.: Adversarial action prediction networks. IEEE Trans. Pattern Anal. Mach. Intell. 42(3), 539–553 (2020)CrossRefGoogle Scholar
  17. 17.
    Kong, Y., Gao, S., Sun, B., Fu, Y.: Action prediction from videos via memorizing hard-to-predict samples. In: AAAI (2018)Google Scholar
  18. 18.
    Kong, Y., Kit, D., Fu, Y.: A discriminative model with multiple temporal scales for action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 596–611. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_39
  19. 19.
    Kong, Y., Tao, Z., Fu, Y.: Deep sequential context networks for action prediction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3662–3670 (2017)Google Scholar
  20. 20.
    Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: IJCAI, pp. 786–792 (2018)Google Scholar
  21. 21.
    Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  22. 22.
    Lin, T., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 318–327 (2020)CrossRefGoogle Scholar
  23. 23.
    Liu, J., Shahroudy, A., Wang, G., Duan, L., Kot, A.C.: Skeleton-based online action prediction using scale selection network. IEEE Trans. Pattern Anal. Mach. Intell. 42(6), 1453–1467 (2020)CrossRefGoogle Scholar
  24. 24.
    Liu, J., Wang, G., Duan, L., Abdiyeva, K., Kot, A.C.: Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Process. 27(4), 1586–1599 (2018)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Liu, J., et al.: Feature boosting network for 3D pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 494–501 (2020)CrossRefGoogle Scholar
  26. 26.
    Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_50CrossRefGoogle Scholar
  27. 27.
    Loshchilov, I., Hutter, F.: Online batch selection for faster training of neural networks (2015)Google Scholar
  28. 28.
    Lou, Y., Bai, Y., Liu, J., Wang, S., Duan, L.: Veri-wild: a large dataset and a new method for vehicle re-identification in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3235–3243 (2019)Google Scholar
  29. 29.
    Moreno-Noguer, F.: 3D human pose estimation from a single image via distance matrix regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2823–2832 (2017)Google Scholar
  30. 30.
    Pang, G., Wang, X., Hu, J.F., Zhang, Q., Zheng, W.S.: DBDNet: learning bi-directional dynamics for early action prediction. In: IJCAI, pp. 897–903 (2019)Google Scholar
  31. 31.
    Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: Bengio, Y., LeCun, Y. (eds.) 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2–4, 2016, Conference Track Proceedings (2016). http://arxiv.org/abs/1511.06434
  32. 32.
    Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)Google Scholar
  33. 33.
    Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  34. 34.
    Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: CVPR (2019)Google Scholar
  35. 35.
    Shotton, J., et al.: Real-time human pose recognition in parts from single depth images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1297–1304 (2011)Google Scholar
  36. 36.
    Shrivastava, A., Mulam, H., Girshick, R.: Training region-based object detectors with online hard example mining (2016)Google Scholar
  37. 37.
    Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  38. 38.
    Tang, Y., Tian, Y., Lu, J., Li, P., Zhou, J.: Deep progressive reinforcement learning for skeleton-based action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)Google Scholar
  39. 39.
    Wang, J., Liu, Z., Wu, Y., Yuan, J.: Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 914–927 (2013) CrossRefGoogle Scholar
  40. 40.
    Wang, X., Hu, J., Lai, J., Zhang, J., Zheng, W.: Progressive teacher-student learning for early action prediction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3551–3560 (2019)Google Scholar
  41. 41.
    Wang, X., Shrivastava, A., Gupta, A.: A-fast-rcnn: hard positive generation via adversary for object detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  42. 42.
    Weng, J., Jiang, X., Zheng, W., Yuan, J.: Early action recognition with category exclusion using policy-based reinforcement learning. IEEE Trans. Circuits Syst. Video Technol. 1 (2020)Google Scholar
  43. 43.
    Xu, W., Yu, J., Miao, Z., Wan, L., Ji, Q.: Prediction-CGAN: human action prediction with conditional generative adversarial networks. In: Proceedings of the ACM International Conference on Multimedia (2019)Google Scholar
  44. 44.
    Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018)Google Scholar
  45. 45.
    Yang, S., Liu, J., Lu, S., Er, M.H., Kot, A.C.: Collaborative learning of gesture recognition and 3D hand pose estimation with multi-order feature analysis. In: Proceedings of the European Conference on Computer Vision (2020)Google Scholar
  46. 46.
    Zhu, W., et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.School of CSEShandong UniversityJinanChina
  2. 2.ISTD PillarSingapore University of Technology and DesignSingaporeSingapore
  3. 3.School of EE and CSPeking UniversityBeijingChina

Personalised recommendations