International Journal of Computer Vision

, Volume 124, Issue 3, pp 335–355 | Cite as

Discriminatively Learned Hierarchical Rank Pooling Networks

  • Basura FernandoEmail author
  • Stephen Gould


Rank pooling is a temporal encoding method that summarizes the dynamics of a video sequence to a single vector which has shown good results in human action recognition in prior work. In this work, we present novel temporal encoding methods for action and activity classification by extending the unsupervised rank pooling temporal encoding method in two ways. First, we present discriminative rank pooling in which the shared weights of our video representation and the parameters of the action classifiers are estimated jointly for a given training dataset of labelled vector sequences using a bilevel optimization formulation of the learning problem. When the frame level features vectors are obtained from a convolutional neural network (CNN), we rank pool the network activations and jointly estimate all parameters of the model, including CNN filters and fully-connected weights, in an end-to-end manner which we coined as end-to-end trainable rank pooled CNN. Importantly, this model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. Then, we extend rank pooling to a high capacity video representation, called hierarchical rank pooling. Hierarchical rank pooling consists of a network of rank pooling functions, which encode temporal semantics over arbitrary long video clips based on rich frame level features. By stacking non-linear feature functions and temporal sub-sequence encoders one on top of the other, we build a high capacity encoding network of the dynamic behaviour of the video. The resulting video representation is a fixed-length feature vector describing the entire video clip that can be used as input to standard machine learning classifiers. We demonstrate our approach on the task of action and activity recognition. We present a detailed analysis of our approach against competing methods and explore variants such as hierarchy depth and choice of non-linear feature function. Obtained results are comparable to state-of-the-art methods on three important activity recognition benchmarks with classification performance of 76.7% mAP on Hollywood2, 69.4% on HMDB51, and 93.6% on UCF101.


Rank pooling Action recognition Activity recognition Convolutional neural networks 



This research was supported by the Australian Research Council Centre of Excellence for Robotic Vision (project number CE140100016).


  1. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675.
  2. Bard, J. F. (1998). Practical bilevel optimization: Algorithms and applications. Dordrecht: Kluwer Academic Press.CrossRefzbMATHGoogle Scholar
  3. Bilen, H., Fernando, B., Gavves, E., & Vedaldi, A. (2016). Action recognition with dynamic image networks. arXiv:1612.00738.
  4. Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., & Gould, S. (2016). Dynamic image networks for action recognition. In CVPR.Google Scholar
  5. Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. JMLR, 2, 499–526.MathSciNetzbMATHGoogle Scholar
  6. Bregler, C. (1997). Learning and recognizing human dynamics in video sequences. In CVPR, IEEE (pp. 568–574).Google Scholar
  7. Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In ECCV.Google Scholar
  8. Chang, C.-C., & Lin, C.-J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.Google Scholar
  9. Chollet, F. (2015). Keras.Google Scholar
  10. Dempe, S., & Franke, S. (2016). On the solution of convex bilevel optimization problems. Computational Optimization and Applications, 63(3), 685–703.MathSciNetCrossRefzbMATHGoogle Scholar
  11. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.Google Scholar
  12. Do, C.B., Foo, C.-S., & Ng, A.Y. (2007). Efficient multiple hyperparameter learning for log-linear models. In NIPS Google Scholar
  13. Domke, J. (2012). Generic methods for optimization-based modeling. In AISTATS.Google Scholar
  14. Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In CVPR.Google Scholar
  15. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.zbMATHGoogle Scholar
  16. Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR Google Scholar
  17. Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., & Tuytelaars, T. (2016). Rank pooling for action recognition. TPAMI, PP(99), 1–1.Google Scholar
  18. Fernando, B., Anderson, P., Hutter, M., & Gould, S. (2016). Discriminative hierarchical rank pooling for activity recognition. In CVPR.Google Scholar
  19. Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In CVPR.Google Scholar
  20. Fernando, B., & Gould, S. (2016). Learning end-to-end video classification with rank-pooling. In ICML.Google Scholar
  21. Fox, E., Jordan, M.I., Sudderth, E.B., & Willsky, A.S. (2009). Sharing features among dynamical systems with beta processes. In NIPS (pp. 549–557).Google Scholar
  22. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.Google Scholar
  23. Golub, Gene  H, & Van Loan, Charles F. (1996). Matrix computations (3rd ed.). Baltimore: Johns Hopkins University Press.zbMATHGoogle Scholar
  24. Gould, S., Fernando, B., Cherian, A., Anderson, P., Cruz, R.S., & Guo, E. (2016). On differentiating parameterized argmin and argmax problems with application to bi-level optimization. 1(1):1. arXiv:1607.05447.
  25. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.Google Scholar
  26. Hoai, M., & Zisserman, A. (2014). Improving human action recognition using score distribution and ranking. In ACCV.Google Scholar
  27. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRefGoogle Scholar
  28. Hughes, M.C., & Sudderth, E.B. (2012). Nonparametric discovery of activity patterns from video collections. In CVPR Workshops (pp. 25–32).Google Scholar
  29. Jain, M., Jégou, H., & Bouthemy, P. (2013). Better exploiting motion for better action recognition. In CVPR.Google Scholar
  30. Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In CVPR, IEEE (pp. 3304–3311).Google Scholar
  31. Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3d convolutional neural networks for human action recognition. PAMI, 35(1), 221–231.CrossRefGoogle Scholar
  32. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia (pp. 675–678). ACM.Google Scholar
  33. Joachims, T. (2006). Training linear svms in linear time. In ICKDD.Google Scholar
  34. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.Google Scholar
  35. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., & Natsev, P. et al. (2017). The kinetics human action video dataset. arXiv:1705.06950.
  36. Klatzer, T., & Pock, T. (2015). Continuous hyper-parameter learning for support vector machines. In Computer Vision Winter Workshop (CVWW).Google Scholar
  37. Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS (pp. 1097–1105).Google Scholar
  38. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In ICCV.Google Scholar
  39. Kunisch, K., & Pock, T. (2013). A bilevel optimization approach for parameter learning in variational models. SIAM Journal on Imaging Sciences, 6(2), 938–983.MathSciNetCrossRefzbMATHGoogle Scholar
  40. Lan, T., Zhu, Y., Roshan Zamir, A. & Savarese, S. (2015). In ICCV: Action recognition by hierarchical mid-level action elements. In ICCV.Google Scholar
  41. Lan, Z., Lin, M., Li, X., Hauptmann, A.G, & Raj, B. (2015). Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In CVPR.Google Scholar
  42. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.Google Scholar
  43. Li, Y., Li, W., Mahadevan, V., & Vasconcelos, N. (2016). Vlad3: Encoding dynamics of deep features for action recognition. In CVPR.Google Scholar
  44. Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.CrossRefGoogle Scholar
  45. Lu, L., Zhang, H.-J., & Jiang, H. (2002). Content analysis for audio classification and segmentation. IEEE Transactions on Speech and Audio Processing, 10(7), 504–516.CrossRefGoogle Scholar
  46. Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R. & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR.Google Scholar
  47. Ochs, P., Ranftl, R., Brox, T., & Pock, T. (2015). Bilevel optimization with nonsmooth lower level problems. In International Conference on Scale Space and Variational Methods in Computer Vision (SSVM) (pp. 654–665).Google Scholar
  48. Peng, X., Zou, C., Qiao, Y., & Peng, Q. (2014). Action recognition with stacked fisher vectors. In ECCV.Google Scholar
  49. Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed fisher vectors. In CVPR.Google Scholar
  50. Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6), 976–990.CrossRefGoogle Scholar
  51. Rodriguez, M.D., Ahmed, J. & Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR.Google Scholar
  52. Ryoo, M.S., Rothrock, B., & Matthies, L. (June 2015). Pooled motion features for first-person videos. In CVPR.Google Scholar
  53. Samuel, K.G. G., & Tappen, M.F. (2009). Learning optimized MAP estimates in continuously-valued MRF models. In CVPR.Google Scholar
  54. Sener, O., Zamir, A.R., Savarese, S., & Saxena, A. (2015). Unsupervised semantic parsing of video collections. In ICCV (pp. 4480–4488).Google Scholar
  55. Shinozaki, K., Yamaguchi-Shinozaki, K., & Seki, M. (2003). Regulatory network of gene expression in the drought and cold stress responses. Current Opinion in Plant Biology, 6(5), 410–417.CrossRefGoogle Scholar
  56. Simonyan, K. & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS (pp. 568–576).Google Scholar
  57. Simonyan, K. & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. 1(1):1. arXiv:1409.1556.
  58. Snoek, C., Ghanem, B., & Niebles, J.C. (2016). The activitynet large scale activity recognition challenge.Google Scholar
  59. Song, Y., Morency, L.-P. & Davis, R. (2013). Action recognition by hierarchical sequence summarization. In CVPR.Google Scholar
  60. Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. 1(1):1. arXiv:1212.0402.
  61. Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using lstms. 1(1):1. arXiv:1502.04681.
  62. Sun, L., Jia, K., Yeung, D.-Y. & Shi, B.E. (2015). Human action recognition using factorized spatio-temporal convolutional networks. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
  63. Sutskever, I., Vinyals, O., & Le Q.VV. (2014) Sequence to sequence learning with neural networks. In NIPS (pp 3104–3112).Google Scholar
  64. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In ICCV.Google Scholar
  65. Vedaldi, A., & Lenc, K. (2015). Matconvnet–convolutional neural networks for matlab. In Proceeding of the ACM International Conference on Multimedia.Google Scholar
  66. Veeriah, V., Zhuang, N., & Qi, G.-J. (2015). Differential recurrent neural networks for action recognition. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
  67. Heng, W., Kläser, A., Schmid, C., & Liu, C.-L. (2013). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103, 60–79.MathSciNetCrossRefGoogle Scholar
  68. Wang, H. & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.Google Scholar
  69. Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR (pp. 4305–4314).Google Scholar
  70. Wu, J., Zhang, Y., & Lin, W. (2014). Towards good practices for action video encoding. In CVPR.Google Scholar
  71. Zha, S., Luisier, F., Andrews, W., Srivastava, N., & Salakhutdinov, R. (2015). Exploiting image-trained CNN architectures for unconstrained video classification. In BMVC.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.ACRV, Research School of EngineeringThe Australian National UniversityCanberraAustralia

Personalised recommendations