AutoDet: Pyramid Network Architecture Search for Object Detection

Li, Zhihang; Xi, Teng; Zhang, Gang; Liu, Jingtuo; He, Ran

doi:10.1007/s11263-020-01415-x

AutoDet: Pyramid Network Architecture Search for Object Detection

Published: 06 January 2021

Volume 129, pages 1087–1105, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Zhihang Li ORCID: orcid.org/0000-0002-9305-7924^1,2^na1,
Teng Xi^3,4^na1,
Gang Zhang³,
Jingtuo Liu³ &
…
Ran He^1,2

1732 Accesses
13 Citations
Explore all metrics

Abstract

Feature pyramids have delivered significant improvement in object detection. However, building effective feature pyramids heavily relies on expert knowledge, and also requires strenuous efforts to balance effectiveness and efficiency. Automatic search methods, such as NAS-FPN, automates the design of feature pyramids, but the low search efficiency makes it difficult to apply in a large search space. In this paper, we propose a novel search framework for a feature pyramid network, called AutoDet, which enables to automatic discovery of informative connections between multi-scale features and configure detection architectures with both high efficiency and state-of-the-art performance. In AutoDet, a new search space is specifically designed for feature pyramids in object detectors, which is more general than NAS-FPN. Furthermore, the architecture search process is formulated as a combinatorial optimization problem and solved by a Simulated Annealing-based Network Architecture Search method (SA-NAS). Compared with existing NAS methods, AutoDet ensures a dramatic reduction in search times. For example, our SA-NAS can be up to 30x faster than reinforcement learning-based approaches. Furthermore, AutoDet is compatible with both one-stage and two-stage structures with all kinds of backbone networks. We demonstrate the effectiveness of AutoDet with outperforming single-model results on the COCO dataset. Without pre-training on OpenImages, AutoDet with the ResNet-101 backbone achieves an AP of 39.7 and 47.3 for one-stage and two-stage architectures, respectively, which surpass current state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Deep Feature Pyramid Reconfiguration for Object Detection

You Should Look at All Objects

Detection-Oriented Backbone Trained from Near Scratch and Local Feature Refinement for Small Object Detection

Article 20 March 2021

Notes

Dataset available from https://storage.googleapis.com/openimages/web/index.html.
https://lpcv.ai/2020CVPR/ovic-track.
https://rebootingcomputing.ieee.org/lpirc.

References

Adelson, E. H., Anderson, C. H., Bergen, J. R., Burt, P. J., & Ogden, J. M. (1984). Pyramid methods in image processing. RCA Engineer, 29(6), 33–41.
Google Scholar
Baker, B., Gupta, O., Naik, N., & Raskar, R. (2017). Designing neural network architectures using reinforcement learning. In ICLR.
Bell, S., Lawrence Zitnick, C., Bala, K., & Girshick, R. (2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR.
Bender, G., Kindermans, P. J., Zoph, B., Vasudevan, V., & Le, Q. (2018). Understanding and simplifying one-shot architecture search. In ICML.
Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-nms-improving object detection with one line of code. ICCV, 5561–5569.
Brock, A., Lim, T., Ritchie, J. M., & Weston, N. J. (2018). Smash: One-shot model architecture search through hypernetworks. In ICLR.
Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. CVPR, 6154–6162.
Cai, Z., Fan, Q., Feris, R. S., & Vasconcelos, N. (2016). A unified multi-scale deep convolutional neural network for fast object detection. In ECCV.
Cai, H., Yang, J., Zhang, W., Han, S., & Yu, Y. (2018). Path-level network transformation for efficient architecture search. ICML, 677–686.
Chen, L. C., Collins, M., Zhu, Y., Papandreou, G., Zoph, B., Schroff, F., et al. (2018). Searching for efficient multi-scale architectures for dense image prediction. NIPS, 8713–8724.
Chen, Y., Yang, T., Zhang, X., Meng, G., Xiao, X., & Sun, J. (2019). Detnas: Backbone search for object detection. NIPS, 6638–6648.
Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., & Chua, T. S. (2016). Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. arXiv preprint arXiv:1611.05594.
Chopard, B., & Tomassini, M. (2018). Simulated annealing. In An introduction to metaheuristics for optimization (pp. 59–79). Springer.
Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully convolutional networks. In NIPS.
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., et al. (2017). Deformable convolutional networks. CVPR, 764–773.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR (pp. 248–255).
Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyramids for object detection. TPAMI, 36(8), 1532–1545.
Article Google Scholar
Dong, X., & Yang, Y. (2019). Searching for a robust neural architecture in four gpu hours. CVPR, 1761–1770.
Elsken, T., Metzen, J. H., & Hutter, F. (2018). Efficient multi-objective neural architecture search via lamarckian evolution. In ICLR.
Elsken, T., Metzen, J. H., & Hutter, F. (2019). Neural architecture search: A survey. JMLR, 20(55), 1–21.
MathSciNet MATH Google Scholar
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. IJCV, 111(1), 98–136.
Article Google Scholar
Fu, C. Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659.
Ghiasi, G., Lin, T. Y., & Le, Q. V. (2019). Nas-fpn: Learning scalable feature pyramid architecture for object detection. CVPR, 7036–7045.
Girshick, R. (2015). Fast r-cnn. ICCv, 1440–1448.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q. V., & Adam, H. (2019). Searching for mobilenetv3. In ICCV.
Hu, Y., Wu, X., & He, R. (2020). Tf-nas: Rethinking three search freedoms of latency-constrained differentiable neural architecture search. In ECCV.
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., et al. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR.
Jenatton, R., Archambeau, C., González, J., & Seeger, M. (2017). Bayesian optimization with tree-structured dependencies. ICML, 1655–1664.
Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., & Schapire, R. E. (2017). Contextual decision processes with low bellman rank are pac-learnable. In ICML (pp. 1704–1713). JMLR. org.
Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., & Chen, Y. (2017). Ron: Reverse connection with objectness prior networks for object detection. CVPR, 5936–5944.
Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). Hypernet: Towards accurate region proposal generation and joint object detection. In CVPR.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
Law, H., & Deng, J. (2019). Cornernet: Detecting objects as paired keypoints. In IJCV.
Li, Z., & Zhou, F. (2017). Fssd: Feature fusion single shot multibox detector. arXiv preprint arXiv:1712.00960.
Li, Z., Xi, T., Deng, J., Zhang, G., Wen, S., & He, R. (2020). Gp-nas: Gaussian process based neural architecture search. In CVPR.
Li, S., Yang, L., Huang, J., Hua, X. S., & Zhang, L. (2019). Dynamic anchor feature selection for single-shot object detection. ICCV, 6609–6618.
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. CVPR, 2117–2125.
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. ICCV, 2980–2988.
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2018). Focal loss for dense object detection. In TPAMI.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In ECCV.
Liu, C., Chen, L. C., Schroff, F., Adam, H., Hua, W., Yuille, A., & Fei-Fei, L. (2019). Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. arXiv preprint arXiv:1901.02985.
Liu, S., Huang, D., et al. (2018). Receptive field block net for accurate and fast object detection. ECCV, 385–400.
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäinen, M. (2019). Deep learning for generic object detection: A survey. In IJCV.
Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018). Path aggregation network for instance segmentation. In CVPR.
Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055.
Liu, H., Simonyan, K., Vinyals, O., Fernando, C., & Kavukcuoglu, K. (2017). Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436.
Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L. J., et al. (2018). Progressive neural architecture search. ECCV, 19–34.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 91–110.
Article Google Scholar
Luo, R., Tian, F., Qin, T., Chen, E., & Liu, T. Y. (2018). Neural architecture optimization. NIPS, 7816–7827.
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV.
Pang, Y., Wang, T., Anwer, R. M., Khan, F. S., & Shao, L. (2019). Efficient featurized image pyramid network for single shot detector. CVPR, 7336–7344.
Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., Jia, K., et al. (2018). Megdet: A large mini-batch object detector. CVPR, 6181–6189.
Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., & Dean, J. (2018). Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268.
Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2018). Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548.
Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. CVPR, 7263–7271.
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In MICCAI.
Shen, Z., Liu, Z., Li, J., Jiang, Y.G., Chen, Y., & Xue, X. (2017). Dsod: Learning deeply supervised object detectors from scratch. In ICCV.
Shrivastava, A., Sukthankar, R., Malik, J., & Gupta, A. (2016). Beyond skip connections: Top-down modulation for object detection. arXiv preprint arXiv:1612.06851.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Singh, B., & Davis, L. S. (2018). An analysis of scale invariance in object detection snip. CVPR, 3578–3587.
Singh, B., Najibi, M., & Davis, L. S. (2018). Sniper: Efficient multi-scale training. NIPS, 9333–9343.
Suganuma, M., Ozay, M., & Okatani, T. (2018). Exploiting the potential of standard convolutional autoencoders for image restoration by evolutionary search. ICML, 4778–4787.
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI.
Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. ICML, 6105–6114.
Tan, M., Chen, B., Pang, R., Vasudevan, V., & Le, Q. V. (2018). Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626.
Tan, M., Pang, R., V. & Le, Q. (2020). Efficientdet: Scalable and efficient object detection. In CVPR.
Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., & Tang, X. (2017). Residual attention network for image classification. arXiv preprint arXiv:1704.06904.
Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., et al. (2019). Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. CVPR, 10734–10742.
Xie, L., & Yuille, A. (2017). Genetic cnn. ICCV, 1379–1388.
Xie, S., Zheng, H., Liu, C., & Lin, L. (2018). Snas: Stochastic neural architecture search.
Zhang, Z., Qiao, S., Xie, C., Shen, W., Wang, B., & Yuille, A. L. (2018). Single-shot object detection with enriched semantics. CVPR, 5813–5821.
Zhang, C., Ren, M., & Urtasun, R. (2019). Graph hypernetworks for neural architecture search. In ICLR.
Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). Single-shot refinement neural network for object detection. CVPR, 4203–4212.
Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., & Ling, H. (2019). M2det: A single-shot object detector based on multi-level feature pyramid network. In AAAI.
Zheng, X., Ji, R., Tang, L., Zhang, B., Liu, J., & Tian, Q. (2019). Multinomial distribution learning for effective neural architecture search. In ICCV.
Zhong, Z., Yan, J., Wu, W., Shao, J., & Liu, C. L. (2018). Practical block-wise neural network architecture generation. CVPR, 2423–2432.
Zhu, Y., Zhao, C., Wang, J., Zhao, X., Wu, Y., & Lu, H. (2017). Couplenet: Coupling global structure with local parts for object detection. ICCV, 4126–4134.
Zoph, B., & Le, Q. V. (2016). Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.
Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). Learning transferable architectures for scalable image recognition. CVPR, 8697–8710.

Download references

Acknowledgements

This work is partially funded by Beijing Natural Science Foundation (Grant No. JQ18017), National Natural Science Foundation of China (Grant No. U20A20223), and Youth Innovation Promotion Association CAS (Grant No. Y201929).

Author information

Zhihang Li and Teng Xi have contributed equally to this work. This work was done when Z. Li was an intern at Baidu Inc.

Authors and Affiliations

NLPR, CRIPAC, CEBSIT, CAS, Beijing, China
Zhihang Li & Ran He
School of Artifical Intelligence, University of Chinese Academy of Sciences, Beijing, China
Zhihang Li & Ran He
Department of Computer Vision Technology (VIS), Baidu Inc., Beijing, China
Teng Xi, Gang Zhang & Jingtuo Liu
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Teng Xi

Authors

Zhihang Li
View author publications
You can also search for this author in PubMed Google Scholar
Teng Xi
View author publications
You can also search for this author in PubMed Google Scholar
Gang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jingtuo Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ran He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ran He.

Additional information

Communicated by Mei Chen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Z., Xi, T., Zhang, G. et al. AutoDet: Pyramid Network Architecture Search for Object Detection. Int J Comput Vis 129, 1087–1105 (2021). https://doi.org/10.1007/s11263-020-01415-x

Download citation

Received: 03 January 2020
Accepted: 04 December 2020
Published: 06 January 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s11263-020-01415-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AutoDet: Pyramid Network Architecture Search for Object Detection

Abstract

Access this article

Similar content being viewed by others

Deep Feature Pyramid Reconfiguration for Object Detection

You Should Look at All Objects

Detection-Oriented Backbone Trained from Near Scratch and Local Feature Refinement for Small Object Detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

AutoDet: Pyramid Network Architecture Search for Object Detection

Abstract

Access this article

Similar content being viewed by others

Deep Feature Pyramid Reconfiguration for Object Detection

You Should Look at All Objects

Detection-Oriented Backbone Trained from Near Scratch and Local Feature Refinement for Small Object Detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation