Abstract
Much research on object detection focuses on building better model architectures and detection algorithms. Changing the model architecture, however, comes at the cost of adding more complexity to inference, making models slower. Data augmentation, on the other hand, doesn’t add any inference complexity, but is insufficiently studied in object detection for two reasons. First it is more difficult to design plausible augmentation strategies for object detection than for classification, because one must handle the complexity of bounding boxes if geometric transformations are applied. Secondly, data augmentation attracts less research attention perhaps because it is believed to add less value and to transfer poorly compared to advances in network architectures.
This paper serves two main purposes. First, we propose to use AutoAugment [3] to design better data augmentation strategies for object detection because it can address the difficulty of designing them. Second, we use the method to assess the value of data augmentation in object detection and compare it against the value of architectures. Our investigation into data augmentation for object detection identifies two surprising results. First, by changing the data augmentation strategy to our method, AutoAugment for detection, we can improve RetinaNet with a ResNet-50 backbone from 36.7 to 39.0 mAP on COCO, a difference of +2.3 mAP. This gain exceeds the gain achieved by switching the backbone from ResNet-50 to ResNet-101 (+2.1 mAP), which incurs additional training and inference costs. The second surprising finding is that our strategies found on the COCO dataset transfer well to the PASCAL dataset to improve accuracy by +2.7 mAP. These results together with our systematic studies of data augmentation call into question previous assumptions about the role and transferability of architectures versus data augmentation. In particular, changing the augmentation may lead to performance gains that are equally transferable as changing the underlying architecture.
B. Zoph and E. D. Cubuk—Equal contribution.
Code and models are available at https://github.com/tensorflow/tpu/tree/master/models/official/detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In this paper, we use “policy” and “strategy” interchangeably.
- 2.
The color transformations largely derive from transformation in the Python Image Library (PIL). https://pillow.readthedocs.io/en/5.1.x/.
- 3.
We employed a base learning rate of 0.08 over 150 epochs; image size was \(640 \times 640\); \(\alpha =0.25\) and \(\gamma =1.5\) for the focal loss parameters; weight decay of \(1e\) \(-\) \(4\); batch size was 64.
- 4.
- 5.
Specifically, we increase the number of anchors from \(3\times 3\) to \(9 \times 9\) by changing the aspect ratios from {1/2, 1, 2} to {1/5, 1/4, 1/3, 1/2, 1, 2, 3, 4, 5}. When making this change we increased the strictness in the IoU thresholding from 0.5/0.5 to 0.6/0.5 due to the increased number of anchors following [44]. The anchor scale was also increased from 4 to 5 to compensate for the larger image size.
References
Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI 2016, pp. 265–283. USENIX Association, Berkeley (2016)
Ciregan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3642–3649. IEEE (2012)
Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: AutoAugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018)
Cubuk, E.D., Zoph, B., Schoenholz, S.S., Le, Q.V.: Intriguing properties of adversarial examples. arXiv preprint arXiv:1711.02846 (2017)
DeVries, T., Taylor, G.W.: Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538 (2017)
DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)
Dwibedi, D., Misra, I., Hebert, M.: Cut, paste and learn: surprisingly easy synthesis for instance detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1301–1310 (2017)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88(2), 303–338 (2010)
Ghiasi, G., Lin, T.Y., Le, Q.V.: DropBlock: a regularization method for convolutional networks. In: Advances in Neural Information Processing Systems, pp. 10750–10760 (2018)
Ghiasi, G., Lin, T.Y., Pang, R., Le, Q.V.: NAS-FPN: learning scalable feature pyramid architecture for object detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Girshick, R.: Fast R-CNN. In: The IEEE International Conference on Computer Vision (ICCV), December 2015
Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., He, K.: Detectron (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Ho, D., Liang, E., Stoica, I., Abbeel, P., Chen, X.: Population based augmentation: efficient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393 (2019)
Huang, J., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7310–7311 (2017)
Inoue, H.: Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929 (2018)
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12. IEEE (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)
Lemley, J., Bazrafkan, S., Corcoran, P.: Smart augmentation learning an optimal data augmentation strategy. IEEE Access 5, 5858–5869 (2017)
Lim, S., Kim, I., Kim, T., Kim, C., Kim, S.: Fast autoaugment. arXiv preprint arXiv:1905.00397 (2019)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, C., et al.: Progressive neural architecture search. arXiv preprint arXiv:1712.00559 (2017)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
Miyato, T., Maeda, S.i., Koyama, M., Ishii, S.: Virtual adversarial training: a regularization method for supervised and semi-supervised learning. In: International Conference on Learning Representations (2016)
Peng, C., et al.: MegDet: a large mini-batch object detector. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Ratner, A.J., Ehrenberg, H., Hussain, Z., Dunnmon, J., Ré, C.: Learning to compose domain-specific transformations for data augmentation. In: Advances in Neural Information Processing Systems, pp. 3239–3249 (2017)
Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: Thirty-Third AAAI Conference on Artificial Intelligence (2019)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Redmon, J., Farhadi, A.: YOLO9000: Better, faster, stronger. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Redmon, J., Farhadi, A.: YOLOV3: an incremental improvement. CoRR abs/1804.02767 (2018). http://arxiv.org/abs/1804.02767
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Sato, I., Nishimura, H., Yokoi, K.: APAC: augmented pattern classification with neural networks. arXiv preprint arXiv:1505.03229 (2015)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Simard, P.Y., Steinkraus, D., Platt, J.C., et al.: Best practices for convolutional neural networks applied to visual document analysis. In: Proceedings of International Conference on Document Analysis and Recognition (2003)
Tan, M., Pang, R., Le, Q.V.: EfficientDet: scalable and efficient object detection. arXiv preprint arXiv:1911.09070 (2019)
Tran, T., Pham, T., Carneiro, G., Palmer, L., Reid, I.: A Bayesian data augmentation approach for learning deep models. In: Advances in Neural Information Processing Systems, pp. 2794–2803 (2017)
Verma, V., Lamb, A., Beckham, C., Courville, A., Mitliagkis, I., Bengio, Y.: Manifold mixup: encouraging meaningful on-manifold interpolation as a regularizer. arXiv preprint arXiv:1806.05236 (2018)
Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., Fergus, R.: Regularization of neural networks using dropconnect. In: International Conference on Machine Learning, pp. 1058–1066 (2013)
Wang, X., Shrivastava, A., Gupta, A.: A-fast-RCNN: hard positive generation via adversary for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2606–2615 (2017)
Yang, T., Zhang, X., Li, Z., Zhang, W., Sun, J.: MetaAnchor: learning to detect objects with customized anchors. In: Advances in Neural Information Processing Systems, pp. 318–328 (2018)
Zagoruyko, S., Komodakis, N.: Wide residual networks. In: British Machine Vision Conference (2016)
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. arXiv preprint arXiv:1708.04896 (2017)
Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: International Conference on Learning Representations (2017)
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2017)
Acknowledgements
We thank the larger teams at Google Brain for their help and support. We also thank Dumitru Erhan for detailed comments on the manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
A Appendix
A Appendix
1.1 A.1 AutoAugment Controller Training Details
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Zoph, B., Cubuk, E.D., Ghiasi, G., Lin, TY., Shlens, J., Le, Q.V. (2020). Learning Data Augmentation Strategies for Object Detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12372. Springer, Cham. https://doi.org/10.1007/978-3-030-58583-9_34
Download citation
DOI: https://doi.org/10.1007/978-3-030-58583-9_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58582-2
Online ISBN: 978-3-030-58583-9
eBook Packages: Computer ScienceComputer Science (R0)