Capsule Recurrent Neural Network with Weight Update Using Dynamic Routing by Agreement: A Unified Model for Action Recognition in Videos

  • Keyang ChengEmail author
  • Lubamba Kasangu Eric
  • Rabia Tahir
  • Maozhen Li
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1074)


The current state of the art for action recognition in computer vision field is still the Convolutional Neural Network (CNN) with its outstanding result and performances. But it lacks capability of resolving ambiguity on overlap action. A novel framework Capsule Recurrent Neural Network (Caps-RNN) is proposed in this paper that aim at achieving better accuracy in video action recognition. The proposed model is comprised of CNN, Primary Capsule and RNN. The CNN and Primary Capsule is in charge of extracting spatial feature information and the RNN with the dynamic routing is employed for temporal feature extraction and frames prediction. As the key component of the model Dynamic Routing by Agreement is used to update the weight during the training of RNN. Experiments were conducted on a subset of the UCF-101 dataset and the result reveals that our proposed model provides a competitive performance for action classification as compare to other methods.


Capsule Network Dynamic Routing by Agreement Recurrent Neural Network Action classification 



This research is supported by National Natural Science Foundation of China (61972183, 61602215) and the Director Foundation Project of National Engineering Laboratory for Public Safety Risk Perception and Control by Big Data(PSRPC).


  1. 1.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)Google Scholar
  2. 2.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, vol. abs/1409.1 (2014)Google Scholar
  3. 3.
    Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision (2017)Google Scholar
  4. 4.
    Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: a survey. Image Vis. Comput. 60, 4–21 (2017)Google Scholar
  5. 5.
    Sabour, S., Frosst, N., Hinton, G.: Matrix capsules with EM routing. In: ICLR 2018 (2018)Google Scholar
  6. 6.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39, 677–691 (2017)Google Scholar
  7. 7.
    Girshick, R., et al.: Two-stream convolutional networks for action recognition in videos. arXiv Preprint arXiv:1406.2199 (2016)
  8. 8.
    Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with R*CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1080–1088 (2015)Google Scholar
  9. 9.
    Hochreiter, S., Urgen Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)Google Scholar
  10. 10.
    Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016)Google Scholar
  11. 11.
    Yang, Z., Gao, J., Nevatia, R.: Spatio-temporal action detection with cascade proposal and location anticipation. CoRR, vol. abs/1708.0 (2017)Google Scholar
  12. 12.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  13. 13.
    Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  14. 14.
    Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. CoRR, vol. abs/1710.0 (2017)Google Scholar
  15. 15.
    Duarte, K., Rawat, Y.S., Shah, M.: VideoCapsuleNet: a simplified network for action detection. CoRR, vol. abs/1805.0 (2018)Google Scholar
  16. 16.
    Nguyen, H.H., Yamagishi, J., Echizen, I.: Capsule-forensics: using capsule networks to detect forged images and videos, October 2018Google Scholar
  17. 17.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition, August 2016Google Scholar
  18. 18.
    Diba, A., et al.: Temporal 3D ConvNets: new architecture and transfer learning for video classification, November 2017Google Scholar
  19. 19.
    Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for video action recognition. In: Conference on Neural Information Processing Systems (2016)Google Scholar
  20. 20.
    Wu, Z., Jiang, Y.-G., Wang, X., Ye, H., Xue, X., Wang, J.: Fusing multi-stream deep networks for video classification, September 2015Google Scholar
  21. 21.
    Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks for video action recognition. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Keyang Cheng
    • 1
    • 2
    Email author
  • Lubamba Kasangu Eric
    • 1
  • Rabia Tahir
    • 1
  • Maozhen Li
    • 3
  1. 1.Jiangsu UniversityZhenjiangChina
  2. 2.National Engineering Laboratory for Public Safety Risk Perception and Control by Big DataBeijingChina
  3. 3.Brunei UniversityLondonUK

Personalised recommendations