Rank pooling is a temporal encoding method that summarizes the dynamics of a video sequence to a single vector which has shown good results in human action recognition in prior work. In this work, we present novel temporal encoding methods for action and activity classification by extending the unsupervised rank pooling temporal encoding method in two ways. First, we present discriminative rank pooling in which the shared weights of our video representation and the parameters of the action classifiers are estimated jointly for a given training dataset of labelled vector sequences using a bilevel optimization formulation of the learning problem. When the frame level features vectors are obtained from a convolutional neural network (CNN), we rank pool the network activations and jointly estimate all parameters of the model, including CNN filters and fully-connected weights, in an end-to-end manner which we coined as end-to-end trainable rank pooled CNN. Importantly, this model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. Then, we extend rank pooling to a high capacity video representation, called hierarchical rank pooling. Hierarchical rank pooling consists of a network of rank pooling functions, which encode temporal semantics over arbitrary long video clips based on rich frame level features. By stacking non-linear feature functions and temporal sub-sequence encoders one on top of the other, we build a high capacity encoding network of the dynamic behaviour of the video. The resulting video representation is a fixed-length feature vector describing the entire video clip that can be used as input to standard machine learning classifiers. We demonstrate our approach on the task of action and activity recognition. We present a detailed analysis of our approach against competing methods and explore variants such as hierarchy depth and choice of non-linear feature function. Obtained results are comparable to state-of-the-art methods on three important activity recognition benchmarks with classification performance of 76.7% mAP on Hollywood2, 69.4% on HMDB51, and 93.6% on UCF101.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675.
Bard, J. F. (1998). Practical bilevel optimization: Algorithms and applications. Dordrecht: Kluwer Academic Press.
Bilen, H., Fernando, B., Gavves, E., & Vedaldi, A. (2016). Action recognition with dynamic image networks. arXiv:1612.00738.
Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., & Gould, S. (2016). Dynamic image networks for action recognition. In CVPR.
Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. JMLR, 2, 499–526.
Bregler, C. (1997). Learning and recognizing human dynamics in video sequences. In CVPR, IEEE (pp. 568–574).
Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In ECCV.
Chang, C.-C., & Lin, C.-J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
Chollet, F. (2015). Keras.
Dempe, S., & Franke, S. (2016). On the solution of convex bilevel optimization problems. Computational Optimization and Applications, 63(3), 685–703.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
Do, C.B., Foo, C.-S., & Ng, A.Y. (2007). Efficient multiple hyperparameter learning for log-linear models. In NIPS
Domke, J. (2012). Generic methods for optimization-based modeling. In AISTATS.
Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In CVPR.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR
Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., & Tuytelaars, T. (2016). Rank pooling for action recognition. TPAMI, PP(99), 1–1.
Fernando, B., Anderson, P., Hutter, M., & Gould, S. (2016). Discriminative hierarchical rank pooling for activity recognition. In CVPR.
Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In CVPR.
Fernando, B., & Gould, S. (2016). Learning end-to-end video classification with rank-pooling. In ICML.
Fox, E., Jordan, M.I., Sudderth, E.B., & Willsky, A.S. (2009). Sharing features among dynamical systems with beta processes. In NIPS (pp. 549–557).
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
Golub, Gene H, & Van Loan, Charles F. (1996). Matrix computations (3rd ed.). Baltimore: Johns Hopkins University Press.
Gould, S., Fernando, B., Cherian, A., Anderson, P., Cruz, R.S., & Guo, E. (2016). On differentiating parameterized argmin and argmax problems with application to bi-level optimization. 1(1):1. arXiv:1607.05447.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Hoai, M., & Zisserman, A. (2014). Improving human action recognition using score distribution and ranking. In ACCV.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Hughes, M.C., & Sudderth, E.B. (2012). Nonparametric discovery of activity patterns from video collections. In CVPR Workshops (pp. 25–32).
Jain, M., Jégou, H., & Bouthemy, P. (2013). Better exploiting motion for better action recognition. In CVPR.
Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In CVPR, IEEE (pp. 3304–3311).
Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3d convolutional neural networks for human action recognition. PAMI, 35(1), 221–231.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia (pp. 675–678). ACM.
Joachims, T. (2006). Training linear svms in linear time. In ICKDD.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., & Natsev, P. et al. (2017). The kinetics human action video dataset. arXiv:1705.06950.
Klatzer, T., & Pock, T. (2015). Continuous hyper-parameter learning for support vector machines. In Computer Vision Winter Workshop (CVWW).
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS (pp. 1097–1105).
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In ICCV.
Kunisch, K., & Pock, T. (2013). A bilevel optimization approach for parameter learning in variational models. SIAM Journal on Imaging Sciences, 6(2), 938–983.
Lan, T., Zhu, Y., Roshan Zamir, A. & Savarese, S. (2015). In ICCV: Action recognition by hierarchical mid-level action elements. In ICCV.
Lan, Z., Lin, M., Li, X., Hauptmann, A.G, & Raj, B. (2015). Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In CVPR.
Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.
Li, Y., Li, W., Mahadevan, V., & Vasconcelos, N. (2016). Vlad3: Encoding dynamics of deep features for action recognition. In CVPR.
Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.
Lu, L., Zhang, H.-J., & Jiang, H. (2002). Content analysis for audio classification and segmentation. IEEE Transactions on Speech and Audio Processing, 10(7), 504–516.
Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R. & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR.
Ochs, P., Ranftl, R., Brox, T., & Pock, T. (2015). Bilevel optimization with nonsmooth lower level problems. In International Conference on Scale Space and Variational Methods in Computer Vision (SSVM) (pp. 654–665).
Peng, X., Zou, C., Qiao, Y., & Peng, Q. (2014). Action recognition with stacked fisher vectors. In ECCV.
Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed fisher vectors. In CVPR.
Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6), 976–990.
Rodriguez, M.D., Ahmed, J. & Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR.
Ryoo, M.S., Rothrock, B., & Matthies, L. (June 2015). Pooled motion features for first-person videos. In CVPR.
Samuel, K.G. G., & Tappen, M.F. (2009). Learning optimized MAP estimates in continuously-valued MRF models. In CVPR.
Sener, O., Zamir, A.R., Savarese, S., & Saxena, A. (2015). Unsupervised semantic parsing of video collections. In ICCV (pp. 4480–4488).
Shinozaki, K., Yamaguchi-Shinozaki, K., & Seki, M. (2003). Regulatory network of gene expression in the drought and cold stress responses. Current Opinion in Plant Biology, 6(5), 410–417.
Simonyan, K. & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS (pp. 568–576).
Simonyan, K. & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. 1(1):1. arXiv:1409.1556.
Snoek, C., Ghanem, B., & Niebles, J.C. (2016). The activitynet large scale activity recognition challenge.
Song, Y., Morency, L.-P. & Davis, R. (2013). Action recognition by hierarchical sequence summarization. In CVPR.
Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. 1(1):1. arXiv:1212.0402.
Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using lstms. 1(1):1. arXiv:1502.04681.
Sun, L., Jia, K., Yeung, D.-Y. & Shi, B.E. (2015). Human action recognition using factorized spatio-temporal convolutional networks. In The IEEE International Conference on Computer Vision (ICCV).
Sutskever, I., Vinyals, O., & Le Q.VV. (2014) Sequence to sequence learning with neural networks. In NIPS (pp 3104–3112).
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In ICCV.
Vedaldi, A., & Lenc, K. (2015). Matconvnet–convolutional neural networks for matlab. In Proceeding of the ACM International Conference on Multimedia.
Veeriah, V., Zhuang, N., & Qi, G.-J. (2015). Differential recurrent neural networks for action recognition. In The IEEE International Conference on Computer Vision (ICCV).
Heng, W., Kläser, A., Schmid, C., & Liu, C.-L. (2013). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103, 60–79.
Wang, H. & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.
Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR (pp. 4305–4314).
Wu, J., Zhang, Y., & Lin, W. (2014). Towards good practices for action video encoding. In CVPR.
Zha, S., Luisier, F., Andrews, W., Srivastava, N., & Salakhutdinov, R. (2015). Exploiting image-trained CNN architectures for unconstrained video classification. In BMVC.
This research was supported by the Australian Research Council Centre of Excellence for Robotic Vision (project number CE140100016).
Communicated by Svetlana Lazebnik.
About this article
Cite this article
Fernando, B., Gould, S. Discriminatively Learned Hierarchical Rank Pooling Networks. Int J Comput Vis 124, 335–355 (2017). https://doi.org/10.1007/s11263-017-1030-x
- Rank pooling
- Action recognition
- Activity recognition
- Convolutional neural networks