Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

  • Kaiming He
  • Xiangyu Zhang
  • Shaoqing Ren
  • Jian Sun
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8691)

Abstract

Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101.

The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method computes convolutional features 30-170× faster than the recent leading method R-CNN (and 24-64× faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Chang, C.C., Lin, C.J.: Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, TIST (2011)Google Scholar
  2. 2.
    Chatfield, K., Lempitsky, V., Vedaldi, A., Zisserman, A.: The devil is in the details: An evaluation of recent feature encoding methods. In: BMVC (2011)Google Scholar
  3. 3.
    Cheng, M.M., Zhang, Z., Lin, W.Y., Torr, P.: BING: Binarized normed gradients for objectness estimation at 300fps. In: CVPR (2014)Google Scholar
  4. 4.
    Coates, A., Ng, A.: The importance of encoding versus training with sparse coding and vector quantization. In: ICML (2011)Google Scholar
  5. 5.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)Google Scholar
  6. 6.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  7. 7.
    Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. ArXiv:1310.1531 (2013)Google Scholar
  8. 8.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge, VOC 2007 Results (2007)Google Scholar
  9. 9.
    Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVIU (2007)Google Scholar
  10. 10.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. PAMI (2010)Google Scholar
  11. 11.
    van Gemert, J.C., Geusebroek, J.-M., Veenman, C.J., Smeulders, A.W.M.: Kernel codebooks for scene categorization. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 696–709. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  12. 12.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  13. 13.
    Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. ArXiv:1403.1840 (2014)Google Scholar
  14. 14.
    Grauman, K., Darrell, T.: The pyramid match kernel: Discriminative classification with sets of image features. In: ICCV (2005)Google Scholar
  15. 15.
    Howard, A.G.: Some improvements on deep convolutional neural network based image classification. ArXiv:1312.5402 (2013)Google Scholar
  16. 16.
    Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  17. 17.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR (2006)Google Scholar
  18. 18.
    LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Computation (1989)Google Scholar
  19. 19.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)Google Scholar
  20. 20.
    Oquab, M., Bottou, L., Laptev, I., Sivic, J., et al.: Learning and transferring mid-level image representations using convolutional neural networks. In: CVPR (2014)Google Scholar
  21. 21.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  22. 22.
    Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: Cnn features off-the-shelf: An astounding baseline for recogniton. In: CVPR 2014, DeepVision Workshop (2014)Google Scholar
  23. 23.
    van de Sande, K.E., Uijlings, J.R., Gevers, T., Smeulders, A.W.: Segmentation as selective search for object recognition. In: ICCV (2011)Google Scholar
  24. 24.
    Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. ArXiv:1312.6229 (2013)Google Scholar
  25. 25.
    Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: ICCV (2003)Google Scholar
  26. 26.
    Szegedy, C., Toshev, A., Erhan, D.: Deep neural networks for object detection. In: NIPS (2013)Google Scholar
  27. 27.
    Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: Closing the gap to human-level performance in face verification. In: CVPR (2014)Google Scholar
  28. 28.
    Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: CVPR (2010)Google Scholar
  29. 29.
    Wang, X., Yang, M., Zhu, S., Lin, Y.: Regionlets for generic object detection. In: ICCV (2013)Google Scholar
  30. 30.
    Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: CVPR (2009)Google Scholar
  31. 31.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional neural networks. ArXiv:1311.2901 (2013)Google Scholar
  32. 32.
    Zhang, N., Paluri, M., Ranzato, M., Darrell, T., Bourdevr, L.: Panda: Pose aligned networks for deep attribute modeling. In: CVPR (2014)Google Scholar
  33. 33.
    Zou, W.Y., Wang, X., Sun, M., Lin, Y.: Generic object detection with dense neural patterns and regionlets. ArXiv:1404.4316 (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Kaiming He
    • 1
  • Xiangyu Zhang
    • 2
  • Shaoqing Ren
    • 3
  • Jian Sun
    • 1
  1. 1.Microsoft ResearchChina
  2. 2.Xi’an Jiaotong UniversityChina
  3. 3.University of Science and TechnologyChina

Personalised recommendations