Abstract
Determination of human attributes and recognition of actions in still images are two related and challenging tasks in computer vision, which often appear in fine-grained domains where the distinctions between the different categories are very small. Deep Convolutional Neural Network (CNN) models have demonstrated their remarkable representational learning capability through various examples. However, the successes are very limited for attributes and action recognition as the potential of CNNs to acquire both of the global and local information of an image remains largely unexplored. This paper proposes to tackle the problem with an encoding of a spatial pyramid Vector of Locally Aggregated Descriptors (VLAD) on top of CNN features. With region proposals generated by Edgeboxes, a compact and efficient representation of an image is thus produced for subsequent prediction of attributes and classification of actions. The proposed scheme is validated with competitive results on two benchmark datasets: 90.4% mean Average Precision (mAP) on the Berkeley Attributes of People dataset and 88.5% mAP on the Stanford 40 action dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Fei-Fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 524–531 (2005)
Csurka, G., Perronnin, F.: Fisher vectors: beyond bag-of-visual-words image representations. In: Richard, P., Braz, J. (eds.) VISIGRAPP 2010. CCIS, vol. 229, pp. 28–42. Springer, Heidelberg (2011). doi:10.1007/978-3-642-25382-9_2
Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3304–3311. IEEE (2010)
Sharma, G., Jurie, F., Schmid, C.: Discriminative spatial saliency for image classification. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3506–3513 (2012)
Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: Proceedings of the British Machine Vision Conference, pp. 97.1–97.11. BMVA Press (2010). doi:10.5244/C.24.97.
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103, 60–79 (2013)
Peng, X., Zou, C., Qiao, Y., Peng, Q.: Action recognition with stacked fisher vectors. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 581–595. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10602-1_38
Lowe, D.G.: Object recognition from local scale-invariant features. In: The Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1150–1157. IEEE (1999)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems (2012)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with R* CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1080–1088 (2015)
Uricchio, T., Bertini, M., Seidenari, L., Bimbo, A.D.: Fisher encoded convolutional bag-of-windows for efficient image retrieval and social image tagging. In: 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 1020–1026 (2015)
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10602-1_26
Shin, A., Yamaguchi, M., Ohnishi, K., Harada, T.: Dense image representation with spatial pyramid VLAD coding of CNN for locally robust captioning. arXiv preprint arXiv:1603.09046 (2016)
Bourdev, L., Maji, S., Malik, J.: Describing people: a poselet-based approach to attribute classification. In: 2011 International Conference on Computer Vision, pp. 1543–1550 (2011)
Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1331–1338. IEEE (2011)
Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attribute and simile classifiers for face verification. In: IEEE International Conference on Computer Vision (ICCV) (2009)
Chen, H., Gallagher, A., Girod, B.: Describing clothing by semantic attributes. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 609–623. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33712-3_44
Cai, J., Zha, Z.J., Zhou, W., Tian, Q.: Attribute-assisted reranking for web image retrieval. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 873–876. ACM (2012)
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. arXiv preprint arXiv:1405.4506 (2014)
Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1817–1824 (2013)
Ullah, M.M., Parizi, S.N., Laptev, I.: Improving bag-of-features action recognition with non-local cues. In: BMVC, vol. 10, pp. 95–1. Citeseer (2010)
Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-of-features and part-based representations (2010). http://www.di.ens.fr/willow/research/stillactions/
Sun, C., Nevatia, R.: Large-scale web video event classification by use of fisher vectors. In: 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp. 15–22. IEEE (2013)
Jain, M., Jégou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2555–2562 (2013)
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1627–1645 (2010)
Bourdev, L., Malik, J.: Poselets: body part detectors trained using 3D human pose annotations. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1365–1372. IEEE (2009)
Zhang, N., Paluri, M., Ranzato, M., Darrell, T., Bourdev, L.: PANDA: pose aligned networks for deep attribute modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1637–1644 (2014)
Yao, B., Fei-Fei, L.: Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1691–1703 (2012)
Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 34, 601–614 (2012)
Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448 (2015)
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.S.: Conditional random fields as recurrent neural networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1529–1537 (2015)
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724 (2014)
Gkioxari, G., Girshick, R., Malik, J.: Actions and attributes from wholes and parts. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2470–2478 (2015)
Diba, A., Pazandeh, A.M., Pirsiavash, H., Van Gool, L.: DeepCAMP: deep convolutional action & attribute mid-level patterns. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 3557–3565 (2016). doi:10.1109/CVPR.2016.387
Arandjelovic, R., Zisserman, A.: All about VLAD. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1578–1585 (2013)
Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. Int. J. Comput. Vis. 105, 222–245 (2013)
Dixit, M., Chen, S., Gao, D., Rasiwasia, N., Vasconcelos, N.: Scene classification with semantic fisher vectors. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2974–2983 (2015)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2169–2178. IEEE (2006)
Zhou, R., Yuan, Q., Gu, X., Zhang, D.: Spatial pyramid VLAD. In: 2014 IEEE Visual Communications and Image Processing Conference, pp. 342–345. IEEE (2014)
Hosang, J., Benenson, R., Schiele, B.: How good are detection proposals, really? In: 25th British Machine Vision Conference, pp. 1–12. BMVA Press (2014)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: IEEE Conference on Computer Vision & Pattern Recognition (2010)
Ross, D.A., Lim, J., Lin, R., Yang, M.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77, 125–141 (2008)
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Vedaldi, A., Fulkerson, B.: VLFeat: an open and portable library of computer vision algorithms (2008)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). http://www.csie.ntu.edu.tw/ cjlin/libsvm
Li, L.J., Su, H., Fei-Fei, L., Xing, E.P.: Object bank: a high-level image representation for scene classification & semantic feature sparsification. In: Advances in Neural Information Processing Systems, pp. 1378–1386 (2010)
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3360–3367. IEEE (2010)
Sharma, G., Jurie, F., Schmid, C.: Expanded parts model for human attribute and action recognition in still images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–659 (2013)
Khan, F.S., Xu, J., van de Weijer, J., Bagdanov, A.D., Anwer, R.M., Lopez, A.M.: Recognizing actions through action-specific person detection. IEEE Trans. Image Process. 24, 4422–4432 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Yan, S., Smith, J.S., Zhang, B. (2017). Attributes and Action Recognition Based on Convolutional Neural Networks and Spatial Pyramid VLAD Encoding. In: Chen, CS., Lu, J., Ma, KK. (eds) Computer Vision – ACCV 2016 Workshops. ACCV 2016. Lecture Notes in Computer Science(), vol 10118. Springer, Cham. https://doi.org/10.1007/978-3-319-54526-4_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-54526-4_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54525-7
Online ISBN: 978-3-319-54526-4
eBook Packages: Computer ScienceComputer Science (R0)