Zoom Out-and-In Network with Map Attention Decision for Region Proposal and Object Detection

Li, Hongyang; Liu, Yu; Ouyang, Wanli; Wang, Xiaogang

doi:10.1007/s11263-018-1101-7

Zoom Out-and-In Network with Map Attention Decision for Region Proposal and Object Detection

Published: 20 June 2018

Volume 127, pages 225–238, (2019)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Hongyang Li ORCID: orcid.org/0000-0001-9446-6191¹,
Yu Liu¹,
Wanli Ouyang² &
…
Xiaogang Wang¹

2079 Accesses
60 Citations
6 Altmetric
Explore all metrics

Abstract

In this paper, we propose a zoom-out-and-in network for generating object proposals. A key observation is that it is difficult to classify anchors of different sizes with the same set of features. Anchors of different sizes should be placed accordingly based on different depth within a network: smaller boxes on high-resolution layers with a smaller stride while larger boxes on low-resolution counterparts with a larger stride. Inspired by the conv/deconv structure, we fully leverage the low-level local details and high-level regional semantics from two feature map streams, which are complimentary to each other, to identify the objectness in an image. A map attention decision (MAD) unit is further proposed to aggressively search for neuron activations among two streams and attend the most contributive ones on the feature learning of the final loss. The unit serves as a decision-maker to adaptively activate maps along certain channels with the solely purpose of optimizing the overall training loss. One advantage of MAD is that the learned weights enforced on each feature channel is predicted on-the-fly based on the input context, which is more suitable than the fixed enforcement of a convolutional kernel. Experimental results on three datasets demonstrate the effectiveness of our proposed algorithm over other state-of-the-arts, in terms of average recall for region proposal and average precision for object detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration

Rich Features and Precise Localization with Region Proposal Network for Object Detection

Joint deep separable convolution network and border regression reinforcement for object detection

Article 05 August 2020

Notes

https://github.com/hli2020/zoom_network.
The first row is the inner product of two vectors, resulting in a scalar gradient; while the second is the common vector multiplication by a scalar, resulting in a vector also.
Direction ‘matches’ means the included angle between two vectors in multi-dimensional space is within \(90{^{\circ }}\) degrees; and ‘departs’ means the angle falls at [90, 180].

References

Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2189–2202.
Article Google Scholar
Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Multiscale combinatorial grouping. In CVPR.
Bell, S., Zitnick, C. L., Bala, K., & Girshick, R. (2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR.
Chavali, N., Agrawal, H., Mahendru, A., & Batra, D. (2016). Object-proposal evaluation protocol is ‘gameable’. In: CVPR.
Cheng, M., Zhang, Z., Lin, W., & Torr, P. H. S. (2014). BING: binarized normed gradients for objectness estimation at 300fps. In CVPR.
Chi, Z., Li, H., Lu, H., & Yang, M.-H. (2016). Dual deep network for visual tracking. arXiv:1612.06053.
Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In NIPS.
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. arXiv:1703.06211.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
Endres, I., & Hoiem, D. (2014). Category-independent object proposals with diverse ranking. IEEE Transactions on PAMI, 36, 222–234.
Article Google Scholar
Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.
Article Google Scholar
Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). Dssd: Deconvolutional single shot detector. arXiv:1701.06659.
Ghodrati, A., Diba, A., Pedersoli, M., Tuytelaars, T., & Gool, L. V. (2016). DeepProposals: Hunting objects and actions by cascading deep convolutional layers. arXiv:1606.04702.
Gidaris, S., & Komodakis, N. (2016). Attend refine repeat: Active box proposal generation via in-out localization. In BMVC.
Girshick, R. (2015). Fast R-CNN. In ICCV.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538, 471–476.
Article Google Scholar
Hariharan, B., Arbelez, P., Girshick, R., & Malik, J. (2014). Hypercolumns for object segmentation and fine-grained localization. In CVPR.
Hayder, Z., He, X., & Salzmann, M. (2016). Learning to co-generate object proposals with a deep structured network. In CVPR.
He, S. & Lau, R. W. (2015). Oriented object proposals. In: ICCV.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR.
Hosang, J., Benenson, R., Dollár, P., & Schiele, B. (2015). What makes for effective detection proposals? IEEE Transactions on PAMI, 38, 814–830.
Article Google Scholar
Hu, J., Shen, L., & Sun, G. (2017). Squeeze-and-excitation networks. arXiv:1709.01507.
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., & Murphy, K. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR.
Humayun, A., Li, F., & Rehg, J. M. (2014). Rigor: Reusing inference in graph cuts for generating object regions. In CVPR.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia.
Jie, Z., Liang, X., Feng, J., Lu, W. F., Tay, E. H. F., & Yan, S. (2016). Scale-aware pixelwise object proposal networks. IEEE Transactions on Image Processing, 25, 4525–4539.
Article MathSciNet Google Scholar
Kaiming, H., Xiangyu, Z., Shaoqing, R., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV.
Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). Hypernet: Towards accurate region proposal generation and joint object detection. In CVPR.
Krahenbuhl, P., & Koltun, V. (2014). Geodesic object proposals. In ECCV.
Krahenbuhl, P., & Koltun, V. (2015). Learning to propose objects. In CVPR.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS, (pp. 1106–1114).
Kuo, W., Hariharan, B., & Malik, J. (2015). DeepBox: Learning objectness with convolutional networks. In ICCV.
Li, H., Liu, Y., Ouyang, W., & Wang, X. (2017a). Zoom out-and-in network with recursive training for object proposal. arXiv:1702.05711.
Li, H., Liu, Y., Zhang, X., An, Z., Wang, J., Chen, Y., & Tong, J. (2017b). Do we really need more training data for object localization. In IEEE international conference on image processing. IEEE.
Li, H., Ouyang, W., & Wang, X. (2016). Multi-bias non-linear activation in deep neural networks. In ICML.
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In CVPR.
Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., & Dollar, P. (2014). Microsoft COCO: Common objects in context. arXiv preprint:1405.0312.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., & Reed, S. (2016). SSD: Single shot multibox detector. In ECCV.
Liu, Y., Li, H., & Wang, X. (2017a). Learning deep features via congenerous cosine loss for person recognition. arXiv:1702.06890.
Liu, Y., Li, H., Yan, J., Wei, F., Wang, X., & Tang, X. (2017b). Recurrent scale approximation for object detection in cnn. In IEEE international conference on computer vision.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.
Manén, S., Guillaumin, M., & Van Gool, L. (2013). Prime object proposals with randomized Prim’s algorithm. In ICCV.
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV.
Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In ICCV.
Pinheiro, P. O., Collobert, R., & Dollar, P. (2015). Learning to segment object candidates. In NIPS.
Pinheiro, P. O., Lin, T.-Y., Collobert, R., & Dollr, P. (2016). Learning to refine object segments. In ECCV.
Pont-Tuset, J., & Gool, L. V. (2015). Boosting object proposals: From pascal to coco. In CVPR.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In CVPR.
Redmon, J., & Farhadi, A. (2016). Yolo9000: Better, faster, stronger. arXiv:1612.08242.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. arXiv:1505.04597.
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.
Sun, C., Paluri, M., Collobert, R., Nevatia, R., & Bourdev, L. (2016). ProNet: Learning to propose object-specific boxes for cascaded neural networks. In CVPR.
Uijlings, J., van de Sande, K., Gevers, T., & Smeulders, A. (2013). Selective search for object recognition. International Journal of Computer Vision, 10, 154–171.
Article Google Scholar
Wang, X., Shrivastava, A., & Gupta, A. (2017). A-fast-rcnn: Hard positive generation via adversary for object detection. In CVPR.
Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016). A discriminative feature learning approach for deep face recognition. In ECCV.
Zitnick, L., & Dollar, P. (2014). Edge Boxes: Locating object proposals from edges. In ECCV.

Download references

Acknowledgements

We would like to thank reviewers for helpful comments, S. Gidaris, X. Tong and K. Kang for fruitful discussions along the way, W. Yang for proofreading the manuscript. H. Li is funded by the Hong Kong Ph.D. Fellowship scheme. We are also grateful for SenseTime Group Ltd. donating the resource of GPUs at time of this project.

Author information

Authors and Affiliations

Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, China
Hongyang Li, Yu Liu & Xiaogang Wang
University of Sydney, Sydney, Australia
Wanli Ouyang

Authors

Hongyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Yu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wanli Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaogang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongyang Li.

Additional information

Communicated by Antonio Torralba.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, H., Liu, Y., Ouyang, W. et al. Zoom Out-and-In Network with Map Attention Decision for Region Proposal and Object Detection. Int J Comput Vis 127, 225–238 (2019). https://doi.org/10.1007/s11263-018-1101-7

Download citation

Received: 16 July 2017
Accepted: 08 June 2018
Published: 20 June 2018
Issue Date: 15 March 2019
DOI: https://doi.org/10.1007/s11263-018-1101-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Zoom Out-and-In Network with Map Attention Decision for Region Proposal and Object Detection

Abstract

Access this article

Similar content being viewed by others

Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration

Rich Features and Precise Localization with Region Proposal Network for Object Detection

Joint deep separable convolution network and border regression reinforcement for object detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Zoom Out-and-In Network with Map Attention Decision for Region Proposal and Object Detection

Abstract

Access this article

Similar content being viewed by others

Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration

Rich Features and Precise Localization with Region Proposal Network for Object Detection

Joint deep separable convolution network and border regression reinforcement for object detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation