Skip to main content
Log in

AutoDet: Pyramid Network Architecture Search for Object Detection

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Feature pyramids have delivered significant improvement in object detection. However, building effective feature pyramids heavily relies on expert knowledge, and also requires strenuous efforts to balance effectiveness and efficiency. Automatic search methods, such as NAS-FPN, automates the design of feature pyramids, but the low search efficiency makes it difficult to apply in a large search space. In this paper, we propose a novel search framework for a feature pyramid network, called AutoDet, which enables to automatic discovery of informative connections between multi-scale features and configure detection architectures with both high efficiency and state-of-the-art performance. In AutoDet, a new search space is specifically designed for feature pyramids in object detectors, which is more general than NAS-FPN. Furthermore, the architecture search process is formulated as a combinatorial optimization problem and solved by a Simulated Annealing-based Network Architecture Search method (SA-NAS). Compared with existing NAS methods, AutoDet ensures a dramatic reduction in search times. For example, our SA-NAS can be up to 30x faster than reinforcement learning-based approaches. Furthermore, AutoDet is compatible with both one-stage and two-stage structures with all kinds of backbone networks. We demonstrate the effectiveness of AutoDet with outperforming single-model results on the COCO dataset. Without pre-training on OpenImages, AutoDet with the ResNet-101 backbone achieves an AP of 39.7 and 47.3 for one-stage and two-stage architectures, respectively, which surpass current state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Dataset available from https://storage.googleapis.com/openimages/web/index.html.

  2. https://lpcv.ai/2020CVPR/ovic-track.

  3. https://rebootingcomputing.ieee.org/lpirc.

References

  • Adelson, E. H., Anderson, C. H., Bergen, J. R., Burt, P. J., & Ogden, J. M. (1984). Pyramid methods in image processing. RCA Engineer, 29(6), 33–41.

    Google Scholar 

  • Baker, B., Gupta, O., Naik, N., & Raskar, R. (2017). Designing neural network architectures using reinforcement learning. In ICLR.

  • Bell, S., Lawrence Zitnick, C., Bala, K., & Girshick, R. (2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR.

  • Bender, G., Kindermans, P. J., Zoph, B., Vasudevan, V., & Le, Q. (2018). Understanding and simplifying one-shot architecture search. In ICML.

  • Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-nms-improving object detection with one line of code. ICCV, 5561–5569.

  • Brock, A., Lim, T., Ritchie, J. M., & Weston, N. J. (2018). Smash: One-shot model architecture search through hypernetworks. In ICLR.

  • Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. CVPR, 6154–6162.

  • Cai, Z., Fan, Q., Feris, R. S., & Vasconcelos, N. (2016). A unified multi-scale deep convolutional neural network for fast object detection. In ECCV.

  • Cai, H., Yang, J., Zhang, W., Han, S., & Yu, Y. (2018). Path-level network transformation for efficient architecture search. ICML, 677–686.

  • Chen, L. C., Collins, M., Zhu, Y., Papandreou, G., Zoph, B., Schroff, F., et al. (2018). Searching for efficient multi-scale architectures for dense image prediction. NIPS, 8713–8724.

  • Chen, Y., Yang, T., Zhang, X., Meng, G., Xiao, X., & Sun, J. (2019). Detnas: Backbone search for object detection. NIPS, 6638–6648.

  • Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., & Chua, T. S. (2016). Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. arXiv preprint arXiv:1611.05594.

  • Chopard, B., & Tomassini, M. (2018). Simulated annealing. In An introduction to metaheuristics for optimization (pp. 59–79). Springer.

  • Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully convolutional networks. In NIPS.

  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., et al. (2017). Deformable convolutional networks. CVPR, 764–773.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR (pp. 248–255).

  • Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyramids for object detection. TPAMI, 36(8), 1532–1545.

    Article  Google Scholar 

  • Dong, X., & Yang, Y. (2019). Searching for a robust neural architecture in four gpu hours. CVPR, 1761–1770.

  • Elsken, T., Metzen, J. H., & Hutter, F. (2018). Efficient multi-objective neural architecture search via lamarckian evolution. In ICLR.

  • Elsken, T., Metzen, J. H., & Hutter, F. (2019). Neural architecture search: A survey. JMLR, 20(55), 1–21.

    MathSciNet  MATH  Google Scholar 

  • Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. IJCV, 111(1), 98–136.

    Article  Google Scholar 

  • Fu, C. Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659.

  • Ghiasi, G., Lin, T. Y., & Le, Q. V. (2019). Nas-fpn: Learning scalable feature pyramid architecture for object detection. CVPR, 7036–7045.

  • Girshick, R. (2015). Fast r-cnn. ICCv, 1440–1448.

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  • Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q. V., & Adam, H. (2019). Searching for mobilenetv3. In ICCV.

  • Hu, Y., Wu, X., & He, R. (2020). Tf-nas: Rethinking three search freedoms of latency-constrained differentiable neural architecture search. In ECCV.

  • Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., et al. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR.

  • Jenatton, R., Archambeau, C., González, J., & Seeger, M. (2017). Bayesian optimization with tree-structured dependencies. ICML, 1655–1664.

  • Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., & Schapire, R. E. (2017). Contextual decision processes with low bellman rank are pac-learnable. In ICML (pp. 1704–1713). JMLR. org.

  • Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., & Chen, Y. (2017). Ron: Reverse connection with objectness prior networks for object detection. CVPR, 5936–5944.

  • Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). Hypernet: Towards accurate region proposal generation and joint object detection. In CVPR.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.

  • Law, H., & Deng, J. (2019). Cornernet: Detecting objects as paired keypoints. In IJCV.

  • Li, Z., & Zhou, F. (2017). Fssd: Feature fusion single shot multibox detector. arXiv preprint arXiv:1712.00960.

  • Li, Z., Xi, T., Deng, J., Zhang, G., Wen, S., & He, R. (2020). Gp-nas: Gaussian process based neural architecture search. In CVPR.

  • Li, S., Yang, L., Huang, J., Hua, X. S., & Zhang, L. (2019). Dynamic anchor feature selection for single-shot object detection. ICCV, 6609–6618.

  • Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. CVPR, 2117–2125.

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. ICCV, 2980–2988.

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2018). Focal loss for dense object detection. In TPAMI.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In ECCV.

  • Liu, C., Chen, L. C., Schroff, F., Adam, H., Hua, W., Yuille, A., & Fei-Fei, L. (2019). Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. arXiv preprint arXiv:1901.02985.

  • Liu, S., Huang, D., et al. (2018). Receptive field block net for accurate and fast object detection. ECCV, 385–400.

  • Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäinen, M. (2019). Deep learning for generic object detection: A survey. In IJCV.

  • Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018). Path aggregation network for instance segmentation. In CVPR.

  • Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055.

  • Liu, H., Simonyan, K., Vinyals, O., Fernando, C., & Kavukcuoglu, K. (2017). Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436.

  • Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L. J., et al. (2018). Progressive neural architecture search. ECCV, 19–34.

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 91–110.

    Article  Google Scholar 

  • Luo, R., Tian, F., Qin, T., Chen, E., & Liu, T. Y. (2018). Neural architecture optimization. NIPS, 7816–7827.

  • Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV.

  • Pang, Y., Wang, T., Anwer, R. M., Khan, F. S., & Shao, L. (2019). Efficient featurized image pyramid network for single shot detector. CVPR, 7336–7344.

  • Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., Jia, K., et al. (2018). Megdet: A large mini-batch object detector. CVPR, 6181–6189.

  • Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., & Dean, J. (2018). Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268.

  • Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2018). Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548.

  • Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. CVPR, 7263–7271.

  • Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In MICCAI.

  • Shen, Z., Liu, Z., Li, J., Jiang, Y.G., Chen, Y., & Xue, X. (2017). Dsod: Learning deeply supervised object detectors from scratch. In ICCV.

  • Shrivastava, A., Sukthankar, R., Malik, J., & Gupta, A. (2016). Beyond skip connections: Top-down modulation for object detection. arXiv preprint arXiv:1612.06851.

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  • Singh, B., & Davis, L. S. (2018). An analysis of scale invariance in object detection snip. CVPR, 3578–3587.

  • Singh, B., Najibi, M., & Davis, L. S. (2018). Sniper: Efficient multi-scale training. NIPS, 9333–9343.

  • Suganuma, M., Ozay, M., & Okatani, T. (2018). Exploiting the potential of standard convolutional autoencoders for image restoration by evolutionary search. ICML, 4778–4787.

  • Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI.

  • Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. ICML, 6105–6114.

  • Tan, M., Chen, B., Pang, R., Vasudevan, V., & Le, Q. V. (2018). Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626.

  • Tan, M., Pang, R., V. & Le, Q. (2020). Efficientdet: Scalable and efficient object detection. In CVPR.

  • Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., & Tang, X. (2017). Residual attention network for image classification. arXiv preprint arXiv:1704.06904.

  • Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., et al. (2019). Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. CVPR, 10734–10742.

  • Xie, L., & Yuille, A. (2017). Genetic cnn. ICCV, 1379–1388.

  • Xie, S., Zheng, H., Liu, C., & Lin, L. (2018). Snas: Stochastic neural architecture search.

  • Zhang, Z., Qiao, S., Xie, C., Shen, W., Wang, B., & Yuille, A. L. (2018). Single-shot object detection with enriched semantics. CVPR, 5813–5821.

  • Zhang, C., Ren, M., & Urtasun, R. (2019). Graph hypernetworks for neural architecture search. In ICLR.

  • Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). Single-shot refinement neural network for object detection. CVPR, 4203–4212.

  • Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., & Ling, H. (2019). M2det: A single-shot object detector based on multi-level feature pyramid network. In AAAI.

  • Zheng, X., Ji, R., Tang, L., Zhang, B., Liu, J., & Tian, Q. (2019). Multinomial distribution learning for effective neural architecture search. In ICCV.

  • Zhong, Z., Yan, J., Wu, W., Shao, J., & Liu, C. L. (2018). Practical block-wise neural network architecture generation. CVPR, 2423–2432.

  • Zhu, Y., Zhao, C., Wang, J., Zhao, X., Wu, Y., & Lu, H. (2017). Couplenet: Coupling global structure with local parts for object detection. ICCV, 4126–4134.

  • Zoph, B., & Le, Q. V. (2016). Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.

  • Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). Learning transferable architectures for scalable image recognition. CVPR, 8697–8710.

Download references

Acknowledgements

This work is partially funded by Beijing Natural Science Foundation (Grant No. JQ18017), National Natural Science Foundation of China (Grant No. U20A20223), and Youth Innovation Promotion Association CAS (Grant No. Y201929).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ran He.

Additional information

Communicated by Mei Chen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Z., Xi, T., Zhang, G. et al. AutoDet: Pyramid Network Architecture Search for Object Detection. Int J Comput Vis 129, 1087–1105 (2021). https://doi.org/10.1007/s11263-020-01415-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01415-x

Keywords

Navigation