Transformer based Multitask Learning for Image Captioning and Object Detection

Basak, Debolena; Srijith, P. K.; Desarkar, Maunendra Sankar

doi:10.1007/978-981-97-2253-2_21

Debolena Basak¹³,
P. K. Srijith¹³ &
Maunendra Sankar Desarkar¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14646))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

210 Accesses

Abstract

In several real-world scenarios like autonomous navigation and mobility, to obtain a better visual understanding of the surroundings, image captioning and object detection play a crucial role. This work introduces a novel multitask learning framework that combines image captioning and object detection into a joint model. We propose TICOD, Transformer-based Image Captioning and Object Detection model for jointly training both tasks by combining the losses obtained from image captioning and object detection networks. By leveraging joint training, the model benefits from the complementary information shared between the two tasks, leading to improved performance for image captioning. Our approach utilizes a transformer-based architecture that enables end-to-end network integration for image captioning and object detection and performs both tasks jointly. We evaluate the effectiveness of our approach through comprehensive experiments on the MS-COCO dataset. Our model outperforms the baselines from image captioning literature by achieving a \(3.65\%\) improvement in BERTScore.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on CVPR (2018)
Google Scholar
Cai, Z., Vasconcelos, N.: Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE Conference on CVPR (2018)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision. ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, T., et al.: A unified sequence interface for vision tasks. In: NeurIPS (2022)
Google Scholar
Cornia, M., et al.: Meshed-memory transformer for image captioning. In: CVPR (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fariha, A.: Automatic image captioning using multitask learning. In: NeurIPS, vol. 20 (2016)
Google Scholar
Girshick, R.: Fast r-cnn. In: International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Girshick, R., et al.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on CVPR (2014)
Google Scholar
Han, K., et al.: Transformer in transformer. In: Advances in NeurIPS, pp. 15908–15919 (2021)
Google Scholar
He, P., et al.: Deberta: decoding-enhanced bert with disentangled attention. In: ICLR (2021)
Google Scholar
Jiang, H., et al.: In defense of grid features for visual question answering. In: CVPR (2020)
Google Scholar
Li, X., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Computer Vision. ECCV 2020. Springer (2020)
Google Scholar
Liang, J., et al.: Swinir: image restoration using Swin transformer. In: ICCV (2021)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision. ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lin, T.Y., et al.: Feature pyramid networks for object detection. In: CVPR (2017)
Google Scholar
Liu, H., et al.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF ICCV, pp. 10012–10022 (2021)
Google Scholar
Luo, Y., et al.: Dual-level collaborative transformer for image captioning. Proc. AAAI Conf. Artif. Intell. 35(3), 2286–2293 (2021). https://doi.org/10.1609/aaai.v35i3.16328
Mokady, R., Hertz, A., Bermano, A.H.: Clipcap: clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021)
Pan, Y., et al.: X-linear attention networks for image captioning. In: CVPR (2020)
Google Scholar
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog (2019)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ren, S., et al.: Faster r-cnn: towards real-time object detection with region proposal networks. In: NeurIPS, vol. 28 (2015)
Google Scholar
Touvron, H., et al.: Training data-efficient image transformers and distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
Google Scholar
Vinyals, O., et al: Show and tell: a neural image caption generator. In: CVPR (2015)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF ICCV, pp. 568–578 (2021)
Google Scholar
Wang, Y., Xu, J., Sun, Y.: End-to-end transformer based model for image captioning. Proc. AAAI Conf. Artif. Intell. 36(3), 2585–2594 (2022). https://doi.org/10.1609/aaai.v36i3.20160
Xinlei Chen, H.F., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Xu, H., et al.: E2e-vlp: end-to-end vision-language pre-training enhanced by visual learning. In: Proceedings of the 59th Annual Meeting of the ACL and the 11th IJCNLP (2021)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd ICML, vol. 37, pp. 2048–2057. PMLR (2015)
Google Scholar
Zhang, K., et al.: Bertscore: evaluating text generation with bert. In: ICLR (2020)
Google Scholar
Zhang, P., et al.: Vinvl: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on CVPR (2021)
Google Scholar
Zhao, W., et al.: A multi-task learning approach for image captioning. In: IJCAI 2018
Google Scholar
Zhuo, T.Y., et al.: Rethinking round-trip translation for machine translation evaluation. In: Findings of the Association for Computational Linguistics: ACL (2023)
Google Scholar

Download references

Acknowledgements

This work was supported by DST National Mission on Interdisciplinary Cyber-Physical Systems (NM-ICPS), Technology Innovation Hub on Autonomous Navigation and Data Acquisition Systems: TiHAN Foundations at Indian Institute of Technology (IIT) Hyderabad, India. We also acknowledge the support from Japan International Cooperation Agency (JICA). We express gratitude to Suvodip Dey for his valuable insights and reviews on this work.

Author information

Authors and Affiliations

Indian Institute of Technology Hyderabad, Sangareddy, India
Debolena Basak, P. K. Srijith & Maunendra Sankar Desarkar

Authors

Debolena Basak
View author publications
You can also search for this author in PubMed Google Scholar
P. K. Srijith
View author publications
You can also search for this author in PubMed Google Scholar
Maunendra Sankar Desarkar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Debolena Basak .

Editor information

Editors and Affiliations

Taipei, Taiwan
De-Nian Yang
Microsoft Research Asia, Beijing, China
Xing Xie
National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Vincent S. Tseng
Duke University, Durham, NC, USA
Jian Pei
National Cheng Kung University, Tainan, Taiwan
Jen-Wei Huang
Silesian University of Technology, Gliwice, Poland
Jerry Chun-Wei Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Basak, D., Srijith, P.K., Desarkar, M.S. (2024). Transformer based Multitask Learning for Image Captioning and Object Detection. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14646. Springer, Singapore. https://doi.org/10.1007/978-981-97-2253-2_21

Download citation

DOI: https://doi.org/10.1007/978-981-97-2253-2_21
Published: 25 April 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2252-5
Online ISBN: 978-981-97-2253-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Transformer based Multitask Learning for Image Captioning and Object Detection