Abstract
Transformer has been proposed to augment the attention mechanism in neural networks without using recurrence and convolutions. Starting with machine translation, it graduated to vision transformer. Among the vision transformers, we explore the DEtection TRansformer (DETR) model proposed in the End-to-end Object Detection with Transformers paper by the team at Facebook AI. The authors have demonstrated interesting object detection results from the DETR model. That triggered the curiosity to use the model for detection of custom objects. Here, we are presenting the way to fine-tune the pre-trained DETR model over custom dataset. The fine-tuning results demonstrate significant improvement with respect to number of training epochs, both visibly as well as statistically.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation (2013). (Online). Available: http://arxiv.org/abs/1311.2524
Girshick, R.: Fast R-CNN (2015). (Online). Available: http://arxiv.org/abs/1504.08083
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 386–397 (2020). https://doi.org/10.1109/TPAMI.2018.2844175
Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. Int. J. Comput. Vision 104(2), 154–171 (2013). https://doi.org/10.1007/s11263-013-0620-5
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You Only Look Once: Unified, Real-Time Object Detection (2015) (Online). Available: http://arxiv.org/abs/1506.02640
Redmon, J., Farhadi, A.: YOLOv3: An incremental improvement. arXiv (2018)
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings—30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6517–6525 (2017). https://doi.org/10.1109/CVPR.2017.690.
Liu, W., et al.: SSD: single shot multibox detector. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9905 LNCS, pp. 21–37 (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Vaswani, A., et al.: Attention Is All You Need (2017). (Online). Available: http://arxiv.org/abs/1706.03762
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-End Object Detection with Transformers (2020). (Online). Available: http://arxiv.org/abs/2005.12872
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable Transformers for End-to-End Object Detection (2020). (Online). Available: http://arxiv.org/abs/2010.04159
El-Nouby, A., et al.: XCiT: Cross-Covariance Image Transformers. (2021). (Online). Available: http://arxiv.org/abs/2106.09681
Li, Y., Zhang, K., Cao, J., Timofte, R., van Gool, L.: LocalViT: Bringing Locality to Vision Transformers (2021). (Online). Available: http://arxiv.org/abs/2104.05707
Wang, W., et al.: Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021). (Online). Available: http://arxiv.org/abs/2102.12122
Zhang, P., et al.: Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding (2021). (Online). Available: http://arxiv.org/abs/2103.15358
Dubey, S.R., Singh, S.K., Chu, W.-T.: Vision Transformer Hashing for Image Retrieval (2021). (Online). Available: http://arxiv.org/abs/2109.12564
Muñoz, E.: Attention is all you need: Discovering the Transformer paper (2020)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8693 LNCS, no. PART 5, pp. 740–755 (2014). https://doi.org/10.1007/978-3-319-10602-1_48.
Yu, J., Li, J., Yu, Z., Huang, Q.: Multimodal Transformer with Multi-View Visual Representation for Image Captioning (2019). (Online). Available: http://arxiv.org/abs/1905.07841
Everingham, M., van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88(2), 303–338 (2010). https://doi.org/10.1007/s11263-009-0275-4
fmassa et al.: facebookresearch/detr (2020)
Acknowledgements
We thank DIPR, DRDO for providing the R&D environment to carry out the research work. We also thank IIIT Allahabad for providing the opportunity to carry out the PhD course under the Working Professional Scheme.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kumar, A., Singh, S.K., Dubey, S.R. (2023). Target Detection Using Transformer: A Study Using DETR. In: Tistarelli, M., Dubey, S.R., Singh, S.K., Jiang, X. (eds) Computer Vision and Machine Intelligence. Lecture Notes in Networks and Systems, vol 586. Springer, Singapore. https://doi.org/10.1007/978-981-19-7867-8_59
Download citation
DOI: https://doi.org/10.1007/978-981-19-7867-8_59
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-7866-1
Online ISBN: 978-981-19-7867-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)