Top-Down Neural Attention by Excitation Backprop
- 2.6k Downloads
We aim to model the top-down attention of a convolutional neural network (CNN) classifier for generating task-specific attention maps. Inspired by a top-down human visual attention model, we propose a new backpropagation scheme, called Excitation Backprop, to pass along top-down signals downwards in the network hierarchy via a probabilistic Winner-Take-All process. Furthermore, we introduce the concept of contrastive attention to make the top-down attention maps more discriminative. We show a theoretic connection between the proposed contrastive attention formulation and the Class Activation Map computation. Efficient implementation of Excitation Backprop for common neural network layers is also presented. In experiments, we visualize the evidence of a model’s classification decision by computing the proposed top-down attention maps. For quantitative evaluation, we report the accuracy of our method in weakly supervised localization tasks on the MS COCO, PASCAL VOC07 and ImageNet datasets. The usefulness of our method is further validated in the text-to-region association task. On the Flickr30k Entities dataset, we achieve promising performance in phrase localization by leveraging the top-down attention of a CNN model that has been trained on weakly labeled web images. Finally, we demonstrate applications of our method in model interpretation and data annotation assistance for facial expression analysis and medical imaging tasks.
KeywordsConvolutional neural network Top-down attention Selective tuning
This research was supported in part by Adobe Research, US NSF Grants 0910908 and 1029430, and gifts from NVIDIA.
- Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Multiscale combinatorial grouping. In CVPR.Google Scholar
- Bazzani, L., Bergamo, A., Anguelov, D. & Torresani, L. (2016). Self-taught object localization with deep networks. In 2016 IEEE winter conference on applications of computer vision (WACV) (pp. 1–9). IEEE.Google Scholar
- Cao, C., Liu, X., Yang, Y., Yu, Y., Wang, J., Wang, Z., et al. (2015). Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In ICCV.Google Scholar
- Chatfield, K., Simonyan, K., Vedaldi, A. & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In BMVC.Google Scholar
- Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2016). Fast and accurate deep network learning by exponential linear units (elus). In ICLR.Google Scholar
- Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., et al. (2015). From captions to visual concepts and back. In CVPR.Google Scholar
- Fong, R., & Vedaldi, A. (2017). Interpretable explanations of black boxes by meaningful perturbation. arXiv:1704.03296.
- Gonzalez-Garcia, A., Modolo, D., & Ferrari, V. (2016). Do semantic parts emerge in convolutional neural networks? arXiv:1607.03738.
- He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).Google Scholar
- Huang, W., Bridge, C. P., Noble, J. A., & Zisserman, A. (2017). Temporal heartnet: Towards human-level automatic analysis of fetal cardiac screening video. arXiv:1707.00665.
- Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM international conference on multimedia.Google Scholar
- Koch, C., & Ullman, S. (1987). Shifts in selective visual attention: Towards the underlying neural circuitry. In L. M. Vaina (Ed.), Matters of intelligence. Synthese library (Studies in epistemology, logic, methodology, and philosophy of science) (vol 188, pp. 115–141). Dordrecht: Springer.Google Scholar
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.Google Scholar
- Levi, G., & Hassner, T. (2015). Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In Proceedings of the 2015 ACM on international conference on multimodal interaction (pp. 503–510). ACM.Google Scholar
- Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft COCO: Common objects in context. In ECCV.Google Scholar
- Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR (pp. 3431–3440).Google Scholar
- Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2015). Is object localization for free?-weakly-supervised learning with convolutional neural networks. In CVPR.Google Scholar
- Papandreou, G., Chen, L.-C., Murphy, K., & Yuille, A. L. (2015). Weakly-and semi-supervised learning of a dcnn for semantic image segmentation. In ICCV.Google Scholar
- Pathak, D., Krahenbuhl, P., & Darrell, T. (2015). Constrained convolutional neural networks for weakly supervised segmentation. In ICCV.Google Scholar
- Pinheiro, P. O., & Collobert, R. (2014). Recurrent convolutional neural networks for scene parsing. In ICLR.Google Scholar
- Pinheiro, P. O., & Collobert, R. (2015). From image-level to pixel-level labeling with convolutional networks. In CVPR.Google Scholar
- Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In CVPR.Google Scholar
- Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR.Google Scholar
- Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR workshop.Google Scholar
- Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.Google Scholar
- Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving for simplicity: The all convolutional net.arXiv preprint. arXiv:1412.6806.
- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. In CVPR.Google Scholar
- Wolfe, J. M., Butcher, S. J., Lee, C., & Hyle, M. (2003). Changing your mind: On the contributions of top-down and bottom-up guidance in visual search for feature singletons. Journal of Experimental Psychology: Human Perception and Performance, 29(2), 483.Google Scholar
- Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., & Lipson, H. (2015). Understanding neural networks through deep visualization. arXiv:1506.06579
- Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV.Google Scholar
- Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene cnns. In ICLR.Google Scholar
- Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In CVPR.Google Scholar
- Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS.Google Scholar
- Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In ECCV.Google Scholar