Skip to main content
Log in

Zoom Out-and-In Network with Map Attention Decision for Region Proposal and Object Detection

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper, we propose a zoom-out-and-in network for generating object proposals. A key observation is that it is difficult to classify anchors of different sizes with the same set of features. Anchors of different sizes should be placed accordingly based on different depth within a network: smaller boxes on high-resolution layers with a smaller stride while larger boxes on low-resolution counterparts with a larger stride. Inspired by the conv/deconv structure, we fully leverage the low-level local details and high-level regional semantics from two feature map streams, which are complimentary to each other, to identify the objectness in an image. A map attention decision (MAD) unit is further proposed to aggressively search for neuron activations among two streams and attend the most contributive ones on the feature learning of the final loss. The unit serves as a decision-maker to adaptively activate maps along certain channels with the solely purpose of optimizing the overall training loss. One advantage of MAD is that the learned weights enforced on each feature channel is predicted on-the-fly based on the input context, which is more suitable than the fixed enforcement of a convolutional kernel. Experimental results on three datasets demonstrate the effectiveness of our proposed algorithm over other state-of-the-arts, in terms of average recall for region proposal and average precision for object detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://github.com/hli2020/zoom_network.

  2. The first row is the inner product of two vectors, resulting in a scalar gradient; while the second is the common vector multiplication by a scalar, resulting in a vector also.

  3. Direction ‘matches’ means the included angle between two vectors in multi-dimensional space is within \(90{^{\circ }}\) degrees; and ‘departs’ means the angle falls at [90, 180].

References

  • Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2189–2202.

    Article  Google Scholar 

  • Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Multiscale combinatorial grouping. In CVPR.

  • Bell, S., Zitnick, C. L., Bala, K., & Girshick, R. (2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR.

  • Chavali, N., Agrawal, H., Mahendru, A., & Batra, D. (2016). Object-proposal evaluation protocol is ‘gameable’. In: CVPR.

  • Cheng, M., Zhang, Z., Lin, W., & Torr, P. H. S. (2014). BING: binarized normed gradients for objectness estimation at 300fps. In CVPR.

  • Chi, Z., Li, H., Lu, H., & Yang, M.-H. (2016). Dual deep network for visual tracking. arXiv:1612.06053.

  • Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In NIPS.

  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. arXiv:1703.06211.

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

  • Endres, I., & Hoiem, D. (2014). Category-independent object proposals with diverse ranking. IEEE Transactions on PAMI, 36, 222–234.

    Article  Google Scholar 

  • Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.

    Article  Google Scholar 

  • Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). Dssd: Deconvolutional single shot detector. arXiv:1701.06659.

  • Ghodrati, A., Diba, A., Pedersoli, M., Tuytelaars, T., & Gool, L. V. (2016). DeepProposals: Hunting objects and actions by cascading deep convolutional layers. arXiv:1606.04702.

  • Gidaris, S., & Komodakis, N. (2016). Attend refine repeat: Active box proposal generation via in-out localization. In BMVC.

  • Girshick, R. (2015). Fast R-CNN. In ICCV.

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.

  • Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538, 471–476.

    Article  Google Scholar 

  • Hariharan, B., Arbelez, P., Girshick, R., & Malik, J. (2014). Hypercolumns for object segmentation and fine-grained localization. In CVPR.

  • Hayder, Z., He, X., & Salzmann, M. (2016). Learning to co-generate object proposals with a deep structured network. In CVPR.

  • He, S. & Lau, R. W. (2015). Oriented object proposals. In: ICCV.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR.

  • Hosang, J., Benenson, R., Dollár, P., & Schiele, B. (2015). What makes for effective detection proposals? IEEE Transactions on PAMI, 38, 814–830.

    Article  Google Scholar 

  • Hu, J., Shen, L., & Sun, G. (2017). Squeeze-and-excitation networks. arXiv:1709.01507.

  • Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., & Murphy, K. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR.

  • Humayun, A., Li, F., & Rehg, J. M. (2014). Rigor: Reusing inference in graph cuts for generating object regions. In CVPR.

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.

  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia.

  • Jie, Z., Liang, X., Feng, J., Lu, W. F., Tay, E. H. F., & Yan, S. (2016). Scale-aware pixelwise object proposal networks. IEEE Transactions on Image Processing, 25, 4525–4539.

    Article  MathSciNet  Google Scholar 

  • Kaiming, H., Xiangyu, Z., Shaoqing, R., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV.

  • Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). Hypernet: Towards accurate region proposal generation and joint object detection. In CVPR.

  • Krahenbuhl, P., & Koltun, V. (2014). Geodesic object proposals. In ECCV.

  • Krahenbuhl, P., & Koltun, V. (2015). Learning to propose objects. In CVPR.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS, (pp. 1106–1114).

  • Kuo, W., Hariharan, B., & Malik, J. (2015). DeepBox: Learning objectness with convolutional networks. In ICCV.

  • Li, H., Liu, Y., Ouyang, W., & Wang, X. (2017a). Zoom out-and-in network with recursive training for object proposal. arXiv:1702.05711.

  • Li, H., Liu, Y., Zhang, X., An, Z., Wang, J., Chen, Y., & Tong, J. (2017b). Do we really need more training data for object localization. In IEEE international conference on image processing. IEEE.

  • Li, H., Ouyang, W., & Wang, X. (2016). Multi-bias non-linear activation in deep neural networks. In ICML.

  • Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In CVPR.

  • Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., & Dollar, P. (2014). Microsoft COCO: Common objects in context. arXiv preprint:1405.0312.

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., & Reed, S. (2016). SSD: Single shot multibox detector. In ECCV.

  • Liu, Y., Li, H., & Wang, X. (2017a). Learning deep features via congenerous cosine loss for person recognition. arXiv:1702.06890.

  • Liu, Y., Li, H., Yan, J., Wei, F., Wang, X., & Tang, X. (2017b). Recurrent scale approximation for object detection in cnn. In IEEE international conference on computer vision.

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.

  • Manén, S., Guillaumin, M., & Van Gool, L. (2013). Prime object proposals with randomized Prim’s algorithm. In ICCV.

  • Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV.

  • Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In ICCV.

  • Pinheiro, P. O., Collobert, R., & Dollar, P. (2015). Learning to segment object candidates. In NIPS.

  • Pinheiro, P. O., Lin, T.-Y., Collobert, R., & Dollr, P. (2016). Learning to refine object segments. In ECCV.

  • Pont-Tuset, J., & Gool, L. V. (2015). Boosting object proposals: From pascal to coco. In CVPR.

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In CVPR.

  • Redmon, J., & Farhadi, A. (2016). Yolo9000: Better, faster, stronger. arXiv:1612.08242.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS.

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. arXiv:1505.04597.

  • Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR.

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.

  • Sun, C., Paluri, M., Collobert, R., Nevatia, R., & Bourdev, L. (2016). ProNet: Learning to propose object-specific boxes for cascaded neural networks. In CVPR.

  • Uijlings, J., van de Sande, K., Gevers, T., & Smeulders, A. (2013). Selective search for object recognition. International Journal of Computer Vision, 10, 154–171.

    Article  Google Scholar 

  • Wang, X., Shrivastava, A., & Gupta, A. (2017). A-fast-rcnn: Hard positive generation via adversary for object detection. In CVPR.

  • Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016). A discriminative feature learning approach for deep face recognition. In ECCV.

  • Zitnick, L., & Dollar, P. (2014). Edge Boxes: Locating object proposals from edges. In ECCV.

Download references

Acknowledgements

We would like to thank reviewers for helpful comments, S. Gidaris, X. Tong and K. Kang for fruitful discussions along the way, W. Yang for proofreading the manuscript. H. Li is funded by the Hong Kong Ph.D. Fellowship scheme. We are also grateful for SenseTime Group Ltd. donating the resource of GPUs at time of this project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongyang Li.

Additional information

Communicated by Antonio Torralba.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, H., Liu, Y., Ouyang, W. et al. Zoom Out-and-In Network with Map Attention Decision for Region Proposal and Object Detection. Int J Comput Vis 127, 225–238 (2019). https://doi.org/10.1007/s11263-018-1101-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-018-1101-7

Keywords

Navigation