Discriminatively Learned Hierarchical Rank Pooling Networks
- 499 Downloads
Rank pooling is a temporal encoding method that summarizes the dynamics of a video sequence to a single vector which has shown good results in human action recognition in prior work. In this work, we present novel temporal encoding methods for action and activity classification by extending the unsupervised rank pooling temporal encoding method in two ways. First, we present discriminative rank pooling in which the shared weights of our video representation and the parameters of the action classifiers are estimated jointly for a given training dataset of labelled vector sequences using a bilevel optimization formulation of the learning problem. When the frame level features vectors are obtained from a convolutional neural network (CNN), we rank pool the network activations and jointly estimate all parameters of the model, including CNN filters and fully-connected weights, in an end-to-end manner which we coined as end-to-end trainable rank pooled CNN. Importantly, this model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. Then, we extend rank pooling to a high capacity video representation, called hierarchical rank pooling. Hierarchical rank pooling consists of a network of rank pooling functions, which encode temporal semantics over arbitrary long video clips based on rich frame level features. By stacking non-linear feature functions and temporal sub-sequence encoders one on top of the other, we build a high capacity encoding network of the dynamic behaviour of the video. The resulting video representation is a fixed-length feature vector describing the entire video clip that can be used as input to standard machine learning classifiers. We demonstrate our approach on the task of action and activity recognition. We present a detailed analysis of our approach against competing methods and explore variants such as hierarchy depth and choice of non-linear feature function. Obtained results are comparable to state-of-the-art methods on three important activity recognition benchmarks with classification performance of 76.7% mAP on Hollywood2, 69.4% on HMDB51, and 93.6% on UCF101.
KeywordsRank pooling Action recognition Activity recognition Convolutional neural networks
This research was supported by the Australian Research Council Centre of Excellence for Robotic Vision (project number CE140100016).
- Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675.
- Bilen, H., Fernando, B., Gavves, E., & Vedaldi, A. (2016). Action recognition with dynamic image networks. arXiv:1612.00738.
- Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., & Gould, S. (2016). Dynamic image networks for action recognition. In CVPR.Google Scholar
- Bregler, C. (1997). Learning and recognizing human dynamics in video sequences. In CVPR, IEEE (pp. 568–574).Google Scholar
- Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In ECCV.Google Scholar
- Chang, C.-C., & Lin, C.-J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.Google Scholar
- Chollet, F. (2015). Keras.Google Scholar
- Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.Google Scholar
- Do, C.B., Foo, C.-S., & Ng, A.Y. (2007). Efficient multiple hyperparameter learning for log-linear models. In NIPS Google Scholar
- Domke, J. (2012). Generic methods for optimization-based modeling. In AISTATS.Google Scholar
- Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In CVPR.Google Scholar
- Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR Google Scholar
- Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., & Tuytelaars, T. (2016). Rank pooling for action recognition. TPAMI, PP(99), 1–1.Google Scholar
- Fernando, B., Anderson, P., Hutter, M., & Gould, S. (2016). Discriminative hierarchical rank pooling for activity recognition. In CVPR.Google Scholar
- Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In CVPR.Google Scholar
- Fernando, B., & Gould, S. (2016). Learning end-to-end video classification with rank-pooling. In ICML.Google Scholar
- Fox, E., Jordan, M.I., Sudderth, E.B., & Willsky, A.S. (2009). Sharing features among dynamical systems with beta processes. In NIPS (pp. 549–557).Google Scholar
- Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.Google Scholar
- Gould, S., Fernando, B., Cherian, A., Anderson, P., Cruz, R.S., & Guo, E. (2016). On differentiating parameterized argmin and argmax problems with application to bi-level optimization. 1(1):1. arXiv:1607.05447.
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.Google Scholar
- Hoai, M., & Zisserman, A. (2014). Improving human action recognition using score distribution and ranking. In ACCV.Google Scholar
- Hughes, M.C., & Sudderth, E.B. (2012). Nonparametric discovery of activity patterns from video collections. In CVPR Workshops (pp. 25–32).Google Scholar
- Jain, M., Jégou, H., & Bouthemy, P. (2013). Better exploiting motion for better action recognition. In CVPR.Google Scholar
- Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In CVPR, IEEE (pp. 3304–3311).Google Scholar
- Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia (pp. 675–678). ACM.Google Scholar
- Joachims, T. (2006). Training linear svms in linear time. In ICKDD.Google Scholar
- Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.Google Scholar
- Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., & Natsev, P. et al. (2017). The kinetics human action video dataset. arXiv:1705.06950.
- Klatzer, T., & Pock, T. (2015). Continuous hyper-parameter learning for support vector machines. In Computer Vision Winter Workshop (CVWW).Google Scholar
- Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS (pp. 1097–1105).Google Scholar
- Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In ICCV.Google Scholar
- Lan, T., Zhu, Y., Roshan Zamir, A. & Savarese, S. (2015). In ICCV: Action recognition by hierarchical mid-level action elements. In ICCV.Google Scholar
- Lan, Z., Lin, M., Li, X., Hauptmann, A.G, & Raj, B. (2015). Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In CVPR.Google Scholar
- Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.Google Scholar
- Li, Y., Li, W., Mahadevan, V., & Vasconcelos, N. (2016). Vlad3: Encoding dynamics of deep features for action recognition. In CVPR.Google Scholar
- Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R. & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR.Google Scholar
- Ochs, P., Ranftl, R., Brox, T., & Pock, T. (2015). Bilevel optimization with nonsmooth lower level problems. In International Conference on Scale Space and Variational Methods in Computer Vision (SSVM) (pp. 654–665).Google Scholar
- Peng, X., Zou, C., Qiao, Y., & Peng, Q. (2014). Action recognition with stacked fisher vectors. In ECCV.Google Scholar
- Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed fisher vectors. In CVPR.Google Scholar
- Rodriguez, M.D., Ahmed, J. & Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR.Google Scholar
- Ryoo, M.S., Rothrock, B., & Matthies, L. (June 2015). Pooled motion features for first-person videos. In CVPR.Google Scholar
- Samuel, K.G. G., & Tappen, M.F. (2009). Learning optimized MAP estimates in continuously-valued MRF models. In CVPR.Google Scholar
- Sener, O., Zamir, A.R., Savarese, S., & Saxena, A. (2015). Unsupervised semantic parsing of video collections. In ICCV (pp. 4480–4488).Google Scholar
- Simonyan, K. & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS (pp. 568–576).Google Scholar
- Simonyan, K. & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. 1(1):1. arXiv:1409.1556.
- Snoek, C., Ghanem, B., & Niebles, J.C. (2016). The activitynet large scale activity recognition challenge.Google Scholar
- Song, Y., Morency, L.-P. & Davis, R. (2013). Action recognition by hierarchical sequence summarization. In CVPR.Google Scholar
- Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. 1(1):1. arXiv:1212.0402.
- Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using lstms. 1(1):1. arXiv:1502.04681.
- Sun, L., Jia, K., Yeung, D.-Y. & Shi, B.E. (2015). Human action recognition using factorized spatio-temporal convolutional networks. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
- Sutskever, I., Vinyals, O., & Le Q.VV. (2014) Sequence to sequence learning with neural networks. In NIPS (pp 3104–3112).Google Scholar
- Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In ICCV.Google Scholar
- Vedaldi, A., & Lenc, K. (2015). Matconvnet–convolutional neural networks for matlab. In Proceeding of the ACM International Conference on Multimedia.Google Scholar
- Veeriah, V., Zhuang, N., & Qi, G.-J. (2015). Differential recurrent neural networks for action recognition. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
- Wang, H. & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.Google Scholar
- Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR (pp. 4305–4314).Google Scholar
- Wu, J., Zhang, Y., & Lin, W. (2014). Towards good practices for action video encoding. In CVPR.Google Scholar
- Zha, S., Luisier, F., Andrews, W., Srivastava, N., & Salakhutdinov, R. (2015). Exploiting image-trained CNN architectures for unconstrained video classification. In BMVC.Google Scholar