Discriminatively Learned Hierarchical Rank Pooling Networks


Rank pooling is a temporal encoding method that summarizes the dynamics of a video sequence to a single vector which has shown good results in human action recognition in prior work. In this work, we present novel temporal encoding methods for action and activity classification by extending the unsupervised rank pooling temporal encoding method in two ways. First, we present discriminative rank pooling in which the shared weights of our video representation and the parameters of the action classifiers are estimated jointly for a given training dataset of labelled vector sequences using a bilevel optimization formulation of the learning problem. When the frame level features vectors are obtained from a convolutional neural network (CNN), we rank pool the network activations and jointly estimate all parameters of the model, including CNN filters and fully-connected weights, in an end-to-end manner which we coined as end-to-end trainable rank pooled CNN. Importantly, this model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. Then, we extend rank pooling to a high capacity video representation, called hierarchical rank pooling. Hierarchical rank pooling consists of a network of rank pooling functions, which encode temporal semantics over arbitrary long video clips based on rich frame level features. By stacking non-linear feature functions and temporal sub-sequence encoders one on top of the other, we build a high capacity encoding network of the dynamic behaviour of the video. The resulting video representation is a fixed-length feature vector describing the entire video clip that can be used as input to standard machine learning classifiers. We demonstrate our approach on the task of action and activity recognition. We present a detailed analysis of our approach against competing methods and explore variants such as hierarchy depth and choice of non-linear feature function. Obtained results are comparable to state-of-the-art methods on three important activity recognition benchmarks with classification performance of 76.7% mAP on Hollywood2, 69.4% on HMDB51, and 93.6% on UCF101.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. 1.



  1. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675.

  2. Bard, J. F. (1998). Practical bilevel optimization: Algorithms and applications. Dordrecht: Kluwer Academic Press.

    Google Scholar 

  3. Bilen, H., Fernando, B., Gavves, E., & Vedaldi, A. (2016). Action recognition with dynamic image networks. arXiv:1612.00738.

  4. Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., & Gould, S. (2016). Dynamic image networks for action recognition. In CVPR.

  5. Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. JMLR, 2, 499–526.

    MathSciNet  MATH  Google Scholar 

  6. Bregler, C. (1997). Learning and recognizing human dynamics in video sequences. In CVPR, IEEE (pp. 568–574).

  7. Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In ECCV.

  8. Chang, C.-C., & Lin, C.-J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.

    Google Scholar 

  9. Chollet, F. (2015). Keras.

  10. Dempe, S., & Franke, S. (2016). On the solution of convex bilevel optimization problems. Computational Optimization and Applications, 63(3), 685–703.

    MathSciNet  Article  MATH  Google Scholar 

  11. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

  12. Do, C.B., Foo, C.-S., & Ng, A.Y. (2007). Efficient multiple hyperparameter learning for log-linear models. In NIPS

  13. Domke, J. (2012). Generic methods for optimization-based modeling. In AISTATS.

  14. Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In CVPR.

  15. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.

    MATH  Google Scholar 

  16. Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR

  17. Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., & Tuytelaars, T. (2016). Rank pooling for action recognition. TPAMI, PP(99), 1–1.

    Google Scholar 

  18. Fernando, B., Anderson, P., Hutter, M., & Gould, S. (2016). Discriminative hierarchical rank pooling for activity recognition. In CVPR.

  19. Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In CVPR.

  20. Fernando, B., & Gould, S. (2016). Learning end-to-end video classification with rank-pooling. In ICML.

  21. Fox, E., Jordan, M.I., Sudderth, E.B., & Willsky, A.S. (2009). Sharing features among dynamical systems with beta processes. In NIPS (pp. 549–557).

  22. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.

  23. Golub, Gene  H, & Van Loan, Charles F. (1996). Matrix computations (3rd ed.). Baltimore: Johns Hopkins University Press.

    Google Scholar 

  24. Gould, S., Fernando, B., Cherian, A., Anderson, P., Cruz, R.S., & Guo, E. (2016). On differentiating parameterized argmin and argmax problems with application to bi-level optimization. 1(1):1. arXiv:1607.05447.

  25. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  26. Hoai, M., & Zisserman, A. (2014). Improving human action recognition using score distribution and ranking. In ACCV.

  27. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  28. Hughes, M.C., & Sudderth, E.B. (2012). Nonparametric discovery of activity patterns from video collections. In CVPR Workshops (pp. 25–32).

  29. Jain, M., Jégou, H., & Bouthemy, P. (2013). Better exploiting motion for better action recognition. In CVPR.

  30. Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In CVPR, IEEE (pp. 3304–3311).

  31. Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3d convolutional neural networks for human action recognition. PAMI, 35(1), 221–231.

    Article  Google Scholar 

  32. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia (pp. 675–678). ACM.

  33. Joachims, T. (2006). Training linear svms in linear time. In ICKDD.

  34. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.

  35. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., & Natsev, P. et al. (2017). The kinetics human action video dataset. arXiv:1705.06950.

  36. Klatzer, T., & Pock, T. (2015). Continuous hyper-parameter learning for support vector machines. In Computer Vision Winter Workshop (CVWW).

  37. Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS (pp. 1097–1105).

  38. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In ICCV.

  39. Kunisch, K., & Pock, T. (2013). A bilevel optimization approach for parameter learning in variational models. SIAM Journal on Imaging Sciences, 6(2), 938–983.

    MathSciNet  Article  MATH  Google Scholar 

  40. Lan, T., Zhu, Y., Roshan Zamir, A. & Savarese, S. (2015). In ICCV: Action recognition by hierarchical mid-level action elements. In ICCV.

  41. Lan, Z., Lin, M., Li, X., Hauptmann, A.G, & Raj, B. (2015). Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In CVPR.

  42. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.

  43. Li, Y., Li, W., Mahadevan, V., & Vasconcelos, N. (2016). Vlad3: Encoding dynamics of deep features for action recognition. In CVPR.

  44. Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.

    Article  Google Scholar 

  45. Lu, L., Zhang, H.-J., & Jiang, H. (2002). Content analysis for audio classification and segmentation. IEEE Transactions on Speech and Audio Processing, 10(7), 504–516.

    Article  Google Scholar 

  46. Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R. & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR.

  47. Ochs, P., Ranftl, R., Brox, T., & Pock, T. (2015). Bilevel optimization with nonsmooth lower level problems. In International Conference on Scale Space and Variational Methods in Computer Vision (SSVM) (pp. 654–665).

  48. Peng, X., Zou, C., Qiao, Y., & Peng, Q. (2014). Action recognition with stacked fisher vectors. In ECCV.

  49. Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed fisher vectors. In CVPR.

  50. Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6), 976–990.

    Article  Google Scholar 

  51. Rodriguez, M.D., Ahmed, J. & Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR.

  52. Ryoo, M.S., Rothrock, B., & Matthies, L. (June 2015). Pooled motion features for first-person videos. In CVPR.

  53. Samuel, K.G. G., & Tappen, M.F. (2009). Learning optimized MAP estimates in continuously-valued MRF models. In CVPR.

  54. Sener, O., Zamir, A.R., Savarese, S., & Saxena, A. (2015). Unsupervised semantic parsing of video collections. In ICCV (pp. 4480–4488).

  55. Shinozaki, K., Yamaguchi-Shinozaki, K., & Seki, M. (2003). Regulatory network of gene expression in the drought and cold stress responses. Current Opinion in Plant Biology, 6(5), 410–417.

    Article  Google Scholar 

  56. Simonyan, K. & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS (pp. 568–576).

  57. Simonyan, K. & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. 1(1):1. arXiv:1409.1556.

  58. Snoek, C., Ghanem, B., & Niebles, J.C. (2016). The activitynet large scale activity recognition challenge.

  59. Song, Y., Morency, L.-P. & Davis, R. (2013). Action recognition by hierarchical sequence summarization. In CVPR.

  60. Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. 1(1):1. arXiv:1212.0402.

  61. Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using lstms. 1(1):1. arXiv:1502.04681.

  62. Sun, L., Jia, K., Yeung, D.-Y. & Shi, B.E. (2015). Human action recognition using factorized spatio-temporal convolutional networks. In The IEEE International Conference on Computer Vision (ICCV).

  63. Sutskever, I., Vinyals, O., & Le Q.VV. (2014) Sequence to sequence learning with neural networks. In NIPS (pp 3104–3112).

  64. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In ICCV.

  65. Vedaldi, A., & Lenc, K. (2015). Matconvnet–convolutional neural networks for matlab. In Proceeding of the ACM International Conference on Multimedia.

  66. Veeriah, V., Zhuang, N., & Qi, G.-J. (2015). Differential recurrent neural networks for action recognition. In The IEEE International Conference on Computer Vision (ICCV).

  67. Heng, W., Kläser, A., Schmid, C., & Liu, C.-L. (2013). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103, 60–79.

    MathSciNet  Article  Google Scholar 

  68. Wang, H. & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.

  69. Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR (pp. 4305–4314).

  70. Wu, J., Zhang, Y., & Lin, W. (2014). Towards good practices for action video encoding. In CVPR.

  71. Zha, S., Luisier, F., Andrews, W., Srivastava, N., & Salakhutdinov, R. (2015). Exploiting image-trained CNN architectures for unconstrained video classification. In BMVC.

Download references


This research was supported by the Australian Research Council Centre of Excellence for Robotic Vision (project number CE140100016).

Author information



Corresponding author

Correspondence to Basura Fernando.

Additional information

Communicated by Svetlana Lazebnik.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Fernando, B., Gould, S. Discriminatively Learned Hierarchical Rank Pooling Networks. Int J Comput Vis 124, 335–355 (2017). https://doi.org/10.1007/s11263-017-1030-x

Download citation


  • Rank pooling
  • Action recognition
  • Activity recognition
  • Convolutional neural networks