Abstract
Pretraining on a large dataset is the first stage of many computer vision tasks such as classification, detection, and segmentation. A conventional pretraining approach is performed on large datasets with human annotation. In this context, self-supervised learning, which uses unlabeled datasets to pretrain models, shows increasing promise for applications. Throughout the development of self-supervised learning, image-level contrastive representation learning has emerged as a highly effective approach for general transfer learning. However, it may lack specificity when applied to a specific downstream task, compromising performance in that task. Recently, an object-level self-supervised pretraining framework called SoCo is proposed for object detection. To achieve object-level pretraining, they adopt the traditional selective search algorithm to generate object proposals, which needs high space and time cost and also hinders end-to-end training to achieve global optimization. In this work, we propose an end-to-end object-level contrastive pretraining for detection, which obtains object proposals using the pretraining network itself. Specifically, we adopt the heat map from the features at the last backbone convolutional layer as semantic information to roughly localize objects, then generate promised proposals with center-suppressed sampling and multiple cropping strategies. The experimental results show that our method displays better performance with significantly less training space and time cost.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zaidi, S.S.A., Ansari, M.S., Aslam, A., Kanwal, N., Asghar, M., & Lee, B.: A survey of modern deep learning based object detection models. Digital Signal Process. 126 103514. (2022)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Wei, F., Gao, Y., Wu, Z., Hu, H., Lin, S.: Aligning pretraining for detection via object-level contrastive learning. Adv. Neural. Inf. Process. Syst. 34, 22682–22694 (2021)
Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vision 104(2), 154–171 (2013)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Lin, T.-Y., et al.: Microsoft coco: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Huang, G., Laradji, I., Vazquez, D., Lacoste-Julien, S., Rodriguez, P.: A survey of self-supervised and few-shot object detection (2021). arXiv preprint arXiv:2110.14711
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning (PMLR), pp. 1597–1607 (2020)
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural. Inf. Process. Syst. 33, 9912–9924 (2020)
Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural. Inf. Process. Syst. 33, 21271–21284 (2020)
Dai, Z., Cai, B., Lin, Y., Chen, J.: UP-DETR: unsupervised pretraining for object detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1601–1610 (2021)
Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3024–3033 (2021)
He, K., Gkioxari, G., Dollár, P., Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Peng, X., Wang, K., Zhu, Z., Wang, M., You, Y.: Crafting better contrastive views for siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16031–16040 (2022)
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning (2020). arXiv preprint arXiv:2003.04297
Acknowledgments
This work was supported by Scientific Research Project of Beijing Educational Committee (KM202011232014).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Geng, L., Huang, X. (2024). End-to-End Object-Level Contrastive Pretraining for Detection via Semantic-Aware Localization. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14473. Springer, Singapore. https://doi.org/10.1007/978-981-99-8850-1_24
Download citation
DOI: https://doi.org/10.1007/978-981-99-8850-1_24
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8849-5
Online ISBN: 978-981-99-8850-1
eBook Packages: Computer ScienceComputer Science (R0)