Persistent Evidence of Local Image Properties in Generic ConvNets

  • Ali Sharif Razavian
  • Hossein Azizpour
  • Atsuto Maki
  • Josephine Sullivan
  • Carl Henrik Ek
  • Stefan Carlsson
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9127)

Abstract

Supervised training of a convolutional network for object classification should make explicit any information related to the class of objects and disregard any auxiliary information associated with the capture of the image or the variation within the object class. Does this happen in practice? Although this seems to pertain to the very final layers in the network, if we look at earlier layers we find that this is not the case. In fact, strong spatial information is implicit. This paper addresses this, in particular, exploiting the image representation at the first fully connected layer, i.e. the global image descriptor which has been recently shown to be most effective in a range of visual recognition tasks. We empirically demonstrate evidences for the finding in the contexts of four different tasks: 2d landmark detection, 2d object keypoints prediction, estimation of the RGB values of input image, and recovery of semantic label of each pixel. We base our investigation on a simple framework with ridge rigression commonly across these tasks, and show results which all support our insight. Such spatial information can be used for computing correspondence of landmarks to a good accuracy, but should potentially be useful for improving the training of the convolutional nets for classification purposes.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Imagenet large scale visual recognition challenge (2013). http://www.image-net.org/challenges/LSVRC/2013/
  2. 2.
    Azizpour, H., Sharif Razavian, A., Sullivan, J., Maki, A., Carlsson, S.: From generic to specific deep representations for visual recognition (2014). arXiv:1406.5774 [cs.CV]
  3. 3.
    Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts of faces using a consensus of exemplars. In: CVPR, pp. 545–552 (2011)Google Scholar
  4. 4.
    Bourdev, L., Malik, J.: Poselets: body part detectors trained using 3d human pose annotations. In: ICCV (2009)Google Scholar
  5. 5.
    Burgos-Artizzu, X.P., Perona, P., Dollár, P.: Robust face landmark estimation under occlusion. In: ICCV (2013)Google Scholar
  6. 6.
    Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression. In: CVPR, pp. 2887–2894 (2012)Google Scholar
  7. 7.
    Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: a deep convolutional activation feature for generic visual recognition. In: ICML (2014)Google Scholar
  8. 8.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NIPS (2014)Google Scholar
  9. 9.
    Fischer, P., Dosovitskiy, A., Brox, T.: Descriptor matching with convolutional neural networks: a comparison to sift (2014). arXiv:1405.5769v1 [cs.CV]
  10. 10.
    Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  11. 11.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R.B., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding (2014)Google Scholar
  12. 12.
    Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: CVPR, pp. 1867–1874 (2014)Google Scholar
  13. 13.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  14. 14.
    Le, V., Brandt, J., Lin, Z., Bourdev, L., Huang, T.S.: Interactive facial feature localization. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 679–692. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  15. 15.
    Long, J., Zhang, N., Darrell, T.: Do convnets learn correspondence? (2014). arXiv:1411.1091 [cs.CV]
  16. 16.
    Milborrow, S., Nicolls, F.: Locating facial features with an extended active shape model. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 504–513. Springer, Heidelberg (2008) CrossRefGoogle Scholar
  17. 17.
    Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: CVPR (2014)Google Scholar
  18. 18.
    Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: A semi-automatic methodology for facial landmark annotation. In: CVPR Workshops, pp. 896–903 (2013)Google Scholar
  19. 19.
    Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. In: ICLR (2014)Google Scholar
  20. 20.
    Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for visual recognition. In: CVPR workshop of DeepVision (2014)Google Scholar
  21. 21.
    Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: CVPR (2014)Google Scholar
  22. 22.
    Xiong, X., De la Torre, F.: Supervised descent method and its applications to face alignment. In: CVPR (2013)Google Scholar
  23. 23.
    Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. PAMI 35(12), 2878–2890 (2013)CrossRefGoogle Scholar
  24. 24.
    Zhang, N., Paluri, M., Ranzato, M., Darrell, T., Bourdev, L.: Panda: pose aligned networks for deep attribute modeling. In: CVPR (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Ali Sharif Razavian
    • 1
  • Hossein Azizpour
    • 1
  • Atsuto Maki
    • 1
  • Josephine Sullivan
    • 1
  • Carl Henrik Ek
    • 1
  • Stefan Carlsson
    • 1
  1. 1.Computer Vision and Active Perception Lab (CVAP), School of Computer Science and Communication (CSC)Royal Institute of Technology (KTH)StockholmSweden

Personalised recommendations