Advertisement

Learning to Zoom: A Saliency-Based Sampling Layer for Neural Networks

  • Adrià Recasens
  • Petr Kellnhofer
  • Simon Stent
  • Wojciech Matusik
  • Antonio Torralba
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11213)

Abstract

We introduce a saliency-based distortion layer for convolutional neural networks that helps to improve the spatial sampling of input data for a given task. Our differentiable layer can be added as a preprocessing block to existing task networks and trained altogether in an end-to-end fashion. The effect of the layer is to efficiently estimate how to sample from the original data in order to boost task performance. For example, for an image classification task in which the original data might range in size up to several megapixels, but where the desired input images to the task network are much smaller, our layer learns how best to sample from the underlying high resolution data in a manner which preserves task-relevant information better than uniform downsampling. This has the effect of creating distorted, caricature-like intermediate images, in which idiosyncratic elements of the image that improve task performance are zoomed and exaggerated. Unlike alternative approaches such as spatial transformer networks, our proposed layer is inspired by image saliency, computed efficiently from uniformly downsampled data, and degrades gracefully to a uniform sampling strategy under uncertainty. We apply our layer to improve existing networks for the tasks of human gaze estimation and fine-grained object classification. Code for our method is available in: http://github.com/recasens/Saliency-Sampler.

Keywords

Task saliency Image sampling Attention Spatial transformer Convolutional neural networks Deep learning 

Notes

Acknowledgment

This research was funded by Toyota Research Institute. We acknowledge NVIDIA Corporation for hardware donations.

References

  1. 1.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Conference on Neural Information Processing Systems (2012)Google Scholar
  2. 2.
    Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \(<\)0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)
  3. 3.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  4. 4.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  5. 5.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. (IJCV) 60(2), 91–110 (2004)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing System, pp. 91–99 (2015)Google Scholar
  8. 8.
    Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)CrossRefGoogle Scholar
  9. 9.
    Mnih, V., Heess, N., Graves, A.: Recurrent models of visual attention. In: Advances in Neural Information Processing System, pp. 2204–2212 (2014)Google Scholar
  10. 10.
    Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014)
  11. 11.
    Eslami, S.A., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Hinton, G.E., et al.: Attend, infer, repeat: fast scene understanding with generative models. In: Advances in Neural Information Processing Systems, pp. 3225–3233 (2016)Google Scholar
  12. 12.
    Fu, J., Zheng, H., Mei, T.: Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  13. 13.
    Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing System, pp. 2017–2025 (2015)Google Scholar
  14. 14.
    Dai, J., et al.: Deformable convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 764–773 (2017)Google Scholar
  15. 15.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. IEEE (2016)Google Scholar
  16. 16.
    Li, J., Chen, Y., Cai, L., Davidson, I., Ji, S.: Dense transformer networks, May 2017. arXiv:1705.08881 [cs, stat] arXiv: 1705.08881
  17. 17.
    Zheng, H., Fu, J., Mei, T., Luo, J.: Learning multi-attention convolutional neural network for fine-grained image recognition. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  18. 18.
    Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., Zhang, Z.: The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 842–850. IEEE (2015)Google Scholar
  19. 19.
    Rosenfeld, A., Ullman, S.: Visual concept recognition and localization via iterative introspection. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 264–279. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-54193-8_17CrossRefGoogle Scholar
  20. 20.
    Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 618–626 (2017)Google Scholar
  21. 21.
    Khosla, A., et al.: Eye tracking for everyone. In: IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, June 2016Google Scholar
  22. 22.
    Wang, S., Luo, L., Zhang, N., Li, J.: AutoScaler: scale-attention networks for visual correspondence. In: British Machine Vision Conference (BMVC) (2017)Google Scholar
  23. 23.
    Rubinstein, M., Gutierrez, D., Sorkine, O., Shamir, A.: A comparative study of image retargeting. In: ACM Transactions on Graphics (TOG), vol. 29, p. 160. ACM (2010)Google Scholar
  24. 24.
    Wolf, L., Guttmann, M., Cohen-Or, D.: Non-homogeneous content-driven video-retargeting. In: IEEE International Conference on Computer Vision (ICCV), pp. 1–6. IEEE (2007)Google Scholar
  25. 25.
    Karni, Z., Freedman, D., Gotsman, C.: Energy-based image deformation. In: Computer Graphics Forum, vol. 28, pp. 1257–1268. Wiley Online Library (2009)Google Scholar
  26. 26.
    Kaufmann, P., Wang, O., Sorkine-Hornung, A., Sorkine-Hornung, O., Smolic, A., Gross, M.: Finite element image warping. In: Computer Graphics Forum, vol. 32, pp. 31–39. Wiley Online Library (2013)Google Scholar
  27. 27.
    Chen, R., Freedman, D., Karni, Z., Gotsman, C., Liu, L.: Content-aware image resizing by quadratic programming. In: Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1–8. IEEE (2010)Google Scholar
  28. 28.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)Google Scholar
  29. 29.
    Van Horn, G., et al.: The inaturalist challenge 2017 dataset. arXiv preprint arXiv:1707.06642 (2017)
  30. 30.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  31. 31.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)Google Scholar
  32. 32.
    Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-UCSD birds-200-2011 dataset (2011)Google Scholar
  33. 33.
    Li, Z., Yang, Y., Liu, X., Zhou, F., Wen, S., Xu, W.: Dynamic computational time for visual attention. arXiv preprint arXiv:1703.10332 (2017)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Massachusetts Institute of TechnologyCambridgeUSA
  2. 2.Toyota Research InstituteCambridgeUSA

Personalised recommendations