Skip to main content
Log in

Top-Down Neural Attention by Excitation Backprop

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We aim to model the top-down attention of a convolutional neural network (CNN) classifier for generating task-specific attention maps. Inspired by a top-down human visual attention model, we propose a new backpropagation scheme, called Excitation Backprop, to pass along top-down signals downwards in the network hierarchy via a probabilistic Winner-Take-All process. Furthermore, we introduce the concept of contrastive attention to make the top-down attention maps more discriminative. We show a theoretic connection between the proposed contrastive attention formulation and the Class Activation Map computation. Efficient implementation of Excitation Backprop for common neural network layers is also presented. In experiments, we visualize the evidence of a model’s classification decision by computing the proposed top-down attention maps. For quantitative evaluation, we report the accuracy of our method in weakly supervised localization tasks on the MS COCO, PASCAL VOC07 and ImageNet datasets. The usefulness of our method is further validated in the text-to-region association task. On the Flickr30k Entities dataset, we achieve promising performance in phrase localization by leveraging the top-down attention of a CNN model that has been trained on weakly labeled web images. Finally, we demonstrate applications of our method in model interpretation and data annotation assistance for facial expression analysis and medical imaging tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://www.cs.bu.edu/groups/ivc/excitation-backprop.

  2. https://github.com/jimmie33/Caffe-ExcitationBP.

  3. https://github.com/BVLC/caffe/wiki/Model-Zoo.

  4. On COCO, we need to compute about 116K attention maps, which leads to over 950 h of computation on a single machine for LRP using VGG16.

  5. https://stock.adobe.com.

  6. The Facial Action Coding System (FACS) is a taxonomy for encoding facial muscle movements into Action Units (AUs). Combinations of coded action units are used to make higher-level decisions, such as a facial emotion: happy, sad, angry, etc.

References

  • Anderson, C. H., & Van Essen, D. C. (1987). Shifter circuits: A computational strategy for dynamic aspects of visual processing. Proceedings of the National Academy of Sciences, 84(17), 6297–6301.

    Article  Google Scholar 

  • Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Multiscale combinatorial grouping. In CVPR.

  • Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., & Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS ONE, 10(7), e0130140.

    Article  Google Scholar 

  • Baluch, F., & Itti, L. (2011). Mechanisms of top-down attention. Trends in Neurosciences, 34(4), 210–224.

    Article  Google Scholar 

  • Bazzani, L., Bergamo, A., Anguelov, D. & Torresani, L. (2016). Self-taught object localization with deep networks. In 2016 IEEE winter conference on applications of computer vision (WACV) (pp. 1–9). IEEE.

  • Beck, D. M., & Kastner, S. (2009). Top-down and bottom-up mechanisms in biasing competition in the human brain. Vision Research, 49(10), 1154–1165.

    Article  Google Scholar 

  • Cao, C., Liu, X., Yang, Y., Yu, Y., Wang, J., Wang, Z., et al. (2015). Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In ICCV.

  • Chatfield, K., Simonyan, K., Vedaldi, A. &  Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In BMVC.

  • Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2016). Fast and accurate deep network learning by exponential linear units (elus). In ICLR.

  • Desimone, R. (1998). Visual attention mediated by biased competition in extrastriate visual cortex. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 353(1373), 1245–1255.

    Article  Google Scholar 

  • Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annual Review of Neuroscience, 18(1), 193–222.

    Article  Google Scholar 

  • Dhall, A., Goecke, R., Lucey, S., & Gedeon, T. (2012). Collecting large, richly annotated facial-expression databases from movies. IEEE MultiMedia, 19(3), 34–41.

    Article  Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., et al. (2015). From captions to visual concepts and back. In CVPR.

  • Fong, R., & Vedaldi, A. (2017). Interpretable explanations of black boxes by meaningful perturbation. arXiv:1704.03296.

  • Gonzalez-Garcia, A., Modolo, D., & Ferrari, V. (2016). Do semantic parts emerge in convolutional neural networks? arXiv:1607.03738.

  • Guillaumin, M., Küttel, D., & Ferrari, V. (2014). Imagenet auto-annotation with segmentation propagation. International Journal of Computer Vision, 110(3), 328–348.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).

  • Huang, W., Bridge, C. P., Noble, J. A., & Zisserman, A. (2017). Temporal heartnet: Towards human-level automatic analysis of fetal cardiac screening video. arXiv:1707.00665.

  • Jamaludin, A., Kadir, T., & Zisserman, A. (2017). Spinenet: Automated classification and evidence visualization in spinal mris. Medical Image Analysis, 41, 63–73.

    Article  Google Scholar 

  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM international conference on multimedia.

  • Kemeny, J. G., Snell, J. L., et al. (1960). Finite Markov chains. New York: Springer.

    MATH  Google Scholar 

  • Koch, C., & Ullman, S. (1987). Shifts in selective visual attention: Towards the underlying neural circuitry. In L. M. Vaina (Ed.), Matters of intelligence. Synthese library (Studies in epistemology, logic, methodology, and philosophy of science) (vol 188, pp. 115–141). Dordrecht: Springer.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.

  • Levi, G., & Hassner, T. (2015). Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In Proceedings of the 2015 ACM on international conference on multimodal interaction (pp. 503–510). ACM.

  • Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft COCO: Common objects in context. In ECCV.

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR (pp. 3431–3440).

  • Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2015). Is object localization for free?-weakly-supervised learning with convolutional neural networks. In CVPR.

  • Papandreou, G., Chen, L.-C., Murphy, K., & Yuille, A. L. (2015). Weakly-and semi-supervised learning of a dcnn for semantic image segmentation. In ICCV.

  • Pathak, D., Krahenbuhl, P., & Darrell, T. (2015). Constrained convolutional neural networks for weakly supervised segmentation. In ICCV.

  • Pinheiro, P. O., & Collobert, R. (2014). Recurrent convolutional neural networks for scene parsing. In ICLR.

  • Pinheiro, P. O., & Collobert, R. (2015). From image-level to pixel-level labeling with convolutional networks. In CVPR.

  • Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In CVPR.

  • Reynolds, J. H., & Heeger, D. J. (2009). The normalization model of attention. Neuron, 61(2), 168–185.

    Article  Google Scholar 

  • Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y.

    Article  MathSciNet  Google Scholar 

  • Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR.

  • Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR workshop.

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.

  • Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving for simplicity: The all convolutional net.arXiv preprint. arXiv:1412.6806.

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. In CVPR.

  • Treisman, A. M., & Gelade, G. (1980). A feature-integration theory of attention. Cognitive Psychology, 12(1), 97–136.

    Article  Google Scholar 

  • Tsotsos, J. K., Culhane, S. M., Wai, W. Y. K., Lai, Y., Davis, N., & Nuflo, F. (1995). Modeling visual attention via selective tuning. Artificial Intelligence, 78(1), 507–545.

    Article  Google Scholar 

  • Usher, M., & Niebur, E. (1996). Modeling the temporal dynamics of it neurons in visual search: A mechanism for top-down selective attention. Journal of Cognitive Neuroscience, 8(4), 311–327.

    Article  Google Scholar 

  • Wolfe, J. M. (1994). Guided search 2.0 a revised model of visual search. Psychonomic Bulletin and Review, 1(2), 202–238.

    Article  Google Scholar 

  • Wolfe, J. M., Butcher, S. J., Lee, C., & Hyle, M. (2003). Changing your mind: On the contributions of top-down and bottom-up guidance in visual search for feature singletons. Journal of Experimental Psychology: Human Perception and Performance, 29(2), 483.

    Google Scholar 

  • Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., & Lipson, H. (2015). Understanding neural networks through deep visualization. arXiv:1506.06579

  • Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV.

  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene cnns. In ICLR.

  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In CVPR.

  • Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS.

  • Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In ECCV.

Download references

Acknowledgements

This research was supported in part by Adobe Research, US NSF Grants 0910908 and 1029430, and gifts from NVIDIA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianming Zhang.

Additional information

Communicated by Jiri Matas, Bastian Leibe, Max Welling and Nicu Sebe.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 18049 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Bargal, S.A., Lin, Z. et al. Top-Down Neural Attention by Excitation Backprop. Int J Comput Vis 126, 1084–1102 (2018). https://doi.org/10.1007/s11263-017-1059-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-017-1059-x

Keywords

Navigation