RhyRNN: Rhythmic RNN for Recognizing Events in Long and Complex Videos

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12355)


Though many successful approaches have been proposed for recognizing events in short and homogeneous videos, doing so with long and complex videos remains a challenge. One particular reason is that events in long and complex videos can consist of multiple heterogeneous sub-activities (in terms of rhythms, activity variants, composition order, etc.) within quite a long period. This fact brings about two main difficulties: excessive/varying length and complex video dynamic/rhythm. To address this, we propose Rhythmic RNN (RhyRNN) which is capable of handling long video sequences (up to 3,000 frames) as well as capturing rhythms at different scales. We also propose two novel modules: diversity-driven pooling (DivPool) and bilinear reweighting (BR), which consistently and hierarchically abstract higher-level information. We study the behavior of RhyRNN and empirically show that our method works well even when only event-level labels are available in the training stage (compared to algorithms requiring sub-activity labels for recognition), and thus is more practical when the sub-activity labels are missing or difficult to obtain. Extensive experiments on several public datasets demonstrate that, even without fine-tuning the feature backbones, our method can achieve promising performance for long and complex videos that contain multiple sub-activities.


Video understanding Complex event recognition RNN 


  1. 1.
    Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)Google Scholar
  2. 2.
    Arjovsky, M., Shah, A., Bengio, Y.: Unitary evolution recurrent neural networks. In: ICML (2016)Google Scholar
  3. 3.
    Bilen, H., Fernando, B., Gavves, E., Vedaldi, A.: Action recognition with dynamic image networks. PAMI 40(12), 2799–2813 (2017)CrossRefGoogle Scholar
  4. 4.
    Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image networks for action recognition. In: CVPR (2016)Google Scholar
  5. 5.
    Campos, V., Jou, B., Giró-i Nieto, X., Torres, J., Chang, S.F.: Skip RNN: learning to skip state updates in recurrent neural networks. arXiv preprint arXiv:1708.06834 (2017)
  6. 6.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)Google Scholar
  7. 7.
    Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster R-CNN architecture for temporal action localization. In: CVPR (2018)Google Scholar
  8. 8.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
  9. 9.
    Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: CVPR (2017)Google Scholar
  10. 10.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  11. 11.
    Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR (2015)Google Scholar
  12. 12.
    Duan, L., Xu, D., Tsang, I.W.H., Luo, J.: Visual event recognition in videos by learning from web data. PAMI 34(9), 1667–1680 (2012)CrossRefGoogle Scholar
  13. 13.
    Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV (2019)Google Scholar
  14. 14.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)Google Scholar
  15. 15.
    Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., Tuytelaars, T.: Rank pooling for action recognition. PAMI 39(4), 773–787 (2016)CrossRefGoogle Scholar
  16. 16.
    Fernando, B., Tan, C., Bilen, H.: Weakly supervised Gaussian networks for action detection. In: WACV (2020)Google Scholar
  17. 17.
    Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM (1999)Google Scholar
  18. 18.
    Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. In: NIPS (2017)Google Scholar
  19. 19.
    Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: NIPS (2014)Google Scholar
  20. 20.
    Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet? In: CVPR (2018)Google Scholar
  21. 21.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  22. 22.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  23. 23.
    Hussein, N., Gavves, E., Smeulders, A.W.: Timeception for complex action recognition. In: CVPR (2019)Google Scholar
  24. 24.
    Hussein, N., Gavves, E., Smeulders, A.W.: VideoGraph: recognizing minutes-long human activities in videos. In: ICCVW (2019)Google Scholar
  25. 25.
    Jiang, Y.G., Bhattacharya, S., Chang, S.F., Shah, M.: High-level event recognition in unconstrained videos. Int. J. Multimedia Inf. Retrieval 2(2), 73–101 (2013)CrossRefGoogle Scholar
  26. 26.
    Kanuparthi, B., Arpit, D., Kerg, G., Ke, N.R., Mitliagkas, I., Bengio, Y.: h-detach: Modifying the LSTM gradient towards better optimization. In: ICLR (2019)Google Scholar
  27. 27.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  28. 28.
    Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  29. 29.
    Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: NIPS (2018)Google Scholar
  30. 30.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)Google Scholar
  31. 31.
    Kuehne, H., Arslan, A., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: CVPR (2014)Google Scholar
  32. 32.
    Kuehne, H., Gall, J., Serre, T.: An end-to-end generative framework for video segmentation and recognition. In: WACV (2016)Google Scholar
  33. 33.
    Lan, T., Zhu, Y., Roshan Zamir, A., Savarese, S.: Action recognition by hierarchical mid-level action elements. In: ICCV (2015)Google Scholar
  34. 34.
    Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: CVPR (2012)Google Scholar
  35. 35.
    Li, S., Li, W., Cook, C., Zhu, C., Gao, Y.: Independently recurrent neural network (IndRNN): building a longer and deeper RNN. In: CVPR (2018)Google Scholar
  36. 36.
    Nguyen, P., Liu, T., Prasad, G., Han, B.: Weakly supervised action localization by sparse temporal pooling network. In: CVPR (2018)Google Scholar
  37. 37.
    Oh, S., et al.: A large-scale benchmark dataset for event recognition in surveillance video. In: CVPR (2011)Google Scholar
  38. 38.
    Paszke, A., et al.: Automatic differentiation in PyTorch (2017)Google Scholar
  39. 39.
    Piergiovanni, A., Ryoo, M.S.: Learning latent super-events to detect multiple activities in videos. In: CVPR (2018)Google Scholar
  40. 40.
    Rashid, M., Kjellström, H., Lee, Y.J.: Action graphs: Weakly-supervised action localization with graph convolution networks. arXiv preprint arXiv:2002.01449 (2020)
  41. 41.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: ICPR (2004)Google Scholar
  42. 42.
    Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. In: NIPS Time Series Workshop (2015)Google Scholar
  43. 43.
    Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR (2016)Google Scholar
  44. 44.
    Sigurdsson, G.A., Divvala, S., Farhadi, A., Gupta, A.: Asynchronous temporal fields for action recognition. In: CVPR (2017)Google Scholar
  45. 45.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). Scholar
  46. 46.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  47. 47.
    Soomro, K., Zamir, A.R., Shah, M., Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR (2012)Google Scholar
  48. 48.
    Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: CVPR (2016)Google Scholar
  49. 49.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  50. 50.
    Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR (2018)Google Scholar
  51. 51.
    Tran, S.D., Davis, L.S.: Event modeling and recognition using Markov logic networks. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5303, pp. 610–623. Springer, Heidelberg (2008). Scholar
  52. 52.
    Veeriah, V., Zhuang, N., Qi, G.J.: Differential recurrent neural networks for action recognition. In: ICCV (2015)Google Scholar
  53. 53.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)Google Scholar
  54. 54.
    Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR (2015)Google Scholar
  55. 55.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). Scholar
  56. 56.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)Google Scholar
  57. 57.
    Wang, X., Ji, Q.: Hierarchical context modeling for video event recognition. PAMI 39(9), 1770–1782 (2017)CrossRefGoogle Scholar
  58. 58.
    Wang, Y., Wang, S., Tang, J., O’Hare, N., Chang, Y., Li, B.: Hierarchical attention network for action recognition in videos. arXiv preprint arXiv:1607.06416 (2016)
  59. 59.
    Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: CVPR (2019)Google Scholar
  60. 60.
    Wu, Y., Zhang, S., Zhang, Y., Bengio, Y., Salakhutdinov, R.R.: On multiplicative integration with recurrent neural networks. In: NIPS (2016)Google Scholar
  61. 61.
    Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). Scholar
  62. 62.
    Xu, S., Cheng, Y., Gu, K., Yang, Y., Chang, S., Zhou, P.: Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In: ICCV (2017)Google Scholar
  63. 63.
    Xu, Y., et al.: Segregated temporal assembly recurrent networks for weakly supervised multiple action detection. In: AAAI (2019)Google Scholar
  64. 64.
    Xu, Z., Yang, Y., Hauptmann, A.G.: A discriminative CNN video representation for event detection. In: CVPR (2015)Google Scholar
  65. 65.
    Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Fei-Fei, L.: Every moment counts: dense detailed labeling of actions in complex videos. Int. J. Comput. Vision 126(2–4), 375–389 (2018)MathSciNetCrossRefGoogle Scholar
  66. 66.
    Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016). Scholar
  67. 67.
    Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 831–846. Springer, Cham (2018). Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Arizona State UniversityTempeUSA

Personalised recommendations