Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning

Bayisa, Leta Yobsan; Wang, Weidong; Wang, Qingxian; Ukwuoma, Chiagoziem C.; Gutema, Hirpesa Kebede; Endris, Ahmed; Abu, Turi

doi:10.1007/s13042-024-02177-5

Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning

Original Article
Published: 17 May 2024

(2024)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Leta Yobsan Bayisa ORCID: orcid.org/0000-0001-9939-9821¹,
Weidong Wang²,
Qingxian Wang²,
Chiagoziem C. Ukwuoma^3,4,5,
Hirpesa Kebede Gutema¹,
Ahmed Endris⁶ &
…
Turi Abu⁷

122 Accesses
Explore all metrics

Abstract

The application of deep learning has demonstrated impressive performance in computer vision tasks such as object detection, image classification, and image captioning. Though most models excel at performing single vision or language tasks, designing a single architecture that balances task specialization, performance, and adaptability across diverse tasks is challenging. To effectively address vision and language integration challenges, a combination of text embeddings and visual representation is necessary to understand dependencies of each subarea for multiple tasks. This paper proposes a single architecture that can handle various tasks in computer vision with fine-tuning capabilities for other specific vision and language tasks. The proposed model employs a modified DenseNet201 as a feature extractor (network backbone), an encoder-decoder architecture, and a task-specific head for inference. To tackle overfitting and improve precision, enhanced data augmentation and normalization techniques are employed. The model’s robustness is evaluated on over five datasets for different tasks: image classification, object detection, image captioning, and adversarial attack and defense. The experimental results demonstrate competitive performance compared to other works on CIFAR-10, CIFAR-100, Flickr8, Flickr30, Caltech10, and other task-specific datasets such as OCT, BreakHis, and so on. The proposed model is flexible and easy to adapt to new tasks, as it can also be extended to other vision and language tasks through fine-tuning with task-specific input indices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image captioning model using attention and object features to mimic human image understanding

Article Open access 14 February 2022

Fast image captioning using LSTM

Article 29 March 2018

Domain-specific image captioning: a comprehensive review

Article 18 April 2024

Data availability

Not applicable because our model utilizes public benchmark datasets.

References

Huang DZ, Baber JC, Bahmanyar SS (2021) The challenges of generalizability in artificial intelligence for ADME/Tox endpoint and activity prediction. Expert Opin Drug Discov 16:1
Article Google Scholar
Fu Q, Wang C, Han X (2020) A CNN-LSTM network with attention approach for learning universal sentence representation in embedded system. Microprocess Microsyst. https://doi.org/10.1016/j.micpro.2020.103051
Article Google Scholar
Pang C, Liu H, Li X (2019) Multitask learning of time–frequency CNN for sound source localization. IEEE Access. https://doi.org/10.1109/ACCESS.2019.2905617
Article Google Scholar
Toshniwal S, Tang H, Lu L, Livescu K (2017) Multitask learning with low-level auxiliary tasks for encoder–decoder based speech recognition. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH
Guo S, Zhang B, Yang T et al (2020) Multitask convolutional neural network with information fusion for bearing fault diagnosis and localization. IEEE Trans Ind Electron. https://doi.org/10.1109/TIE.2019.2942548
Article Google Scholar
Kapidis G, Poppe R, Veltkamp RC (2021) Multi-dataset, multitask learning of egocentric vision tasks. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2021.3061479
Article Google Scholar
Wang XE, Jain V, Ie E et al (2020) environment-agnostic multitask learning for natural language grounded navigation. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)
Gu J, Wang Z, Kuen J et al (2018) Recent advances in convolutional neural networks. Pattern Recognit. https://doi.org/10.1016/j.patcog.2017.10.013
Article Google Scholar
Ben-Baruch E, Ridnik T, Zamir N et al (2019) Attention Is All You Need. Adv Neural Inf Process Syst 16:1
Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR 2021—9th international conference on learning representations
Wright LG, Onodera T, Stein MM et al (2022) Deep physical neural networks trained with backpropagation. Nature. https://doi.org/10.1038/s41586-021-04223-6
Article Google Scholar
Strezoski G, Noord N, Worring M (2019) Many task learning with task routing. In: Proceedings of the IEEE international conference on computer vision
Zhang Y, Yang Q (2018) An overview of multi-task learning. Natl Sci Rev 5:1
Article Google Scholar
Furusho Y, Ikeda K (2020) Theoretical analysis of skip connections and batch normalization from generalization and optimization perspectives. APSIPA Trans Signal Inf Process. https://doi.org/10.1017/ATSIP.2020.7
Article Google Scholar
Jain N, Singh H, Sharma V (2019) Competitor analysis and benchmarking of improved Alex net. Int J Sci Technol Res 8:1
Google Scholar
Rampersad H (2020) FAST-RCNN. Total Perform Scorec 2020:1
Google Scholar
Zhang Y, Li D, Wang Y et al (2019) Abstract text summarization with a convolutional seq2seq model. Appl Sci. https://doi.org/10.3390/app9081665
Article Google Scholar
Warrier S, Rutter EM, Flores KB (2022) Multitask neural networks for predicting bladder pressure with time series data. Biomed Signal Process Control. https://doi.org/10.1016/j.bspc.2021.103298
Article Google Scholar
Chou SH, Chao WL, Lai WS et al (2020) Visual question answering on 360° images. In: Proceedings—2020 IEEE winter conference on applications of computer vision, WACV 2020
Jha S, Dey A, Kumar R, Kumar-Solanki V (2019) A novel approach on visual question answering by parameter prediction using faster region based convolutional neural network. Int J Interact Multimed Artif Intell. https://doi.org/10.9781/ijimai.2018.08.004
Article Google Scholar
Xie J, Cai Y, Huang Q, Wang T (2021) Multiple objects-aware visual question generation. In: MM 2021—proceedings of the 29th ACM international conference on multimedia
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Toshevska M, Stojanovska F, Zdravevski E et al (2020) Exploration into deep learning text generation architectures for dense image captioning. In: Proceedings of the 2020 federated conference on computer science and information systems, FedCSIS 2020. pp 129–136
Rangamani A, Xiong T, Nair A et al (2016) Landmark detection and tracking in ultrasound using a CNN-RNN framework. Conf Neural Inf Process Syst 2016:1
Google Scholar
Sun G, Probst T, Paudel DP et al (2021) Task switching network for multi-task learning. In: Proceedings of the IEEE international conference on computer vision
Yu J, Li J, Yu Z, Huang Q (2020) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2947482
Article Google Scholar
Lan Y, Hao Y, Xia K et al (2020) Stacked residual recurrent neural networks with cross-layer attention for text classification. IEEE Access. https://doi.org/10.1109/ACCESS.2020.2987101
Article Google Scholar
Ji Z, Wang H, Han J, Pang Y (2022) SMAN: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2020.2985716
Article Google Scholar
Mittal S, Lamb A, Goyal A et al (2020) Learning to combine top-down and bottom-up signals in recurrent neural networks with attention over modules. In: 37th international conference on machine learning, ICML 2020
Wang X, Wang WY, Wang Y-F (2020) Closing the loop between language and vision for embodied agents. University of California, Santa Barbara
Google Scholar
Lee J, Kim I (2021) Vision–language–knowledge co-embedding for visual commonsense reasoning. Sensors. https://doi.org/10.3390/s21092911
Article Google Scholar
Van Tu N, Cuong LA (2021) A deep learning model of multiple knowledge sources integration for community question answering. VNU J Sci Comput Sci Commun Eng. https://doi.org/10.25073/2588-1086/vnucsce.295
Article Google Scholar
Wang Y, Zhu M, Xu C et al (2022) Exploiting image captions and external knowledge as representation enhancement for VQA. Qinghua Daxue Xuebao/J Tsinghua Univ. https://doi.org/10.16511/j.cnki.qhdxxb.2022.21.010
Wu J, Hu Z, Mooney RJ (2020) Generating question relevant captions to aid visual question answering. In: ACL 2019—57th annual meeting of the association for computational linguistics, proceedings of the conference
Lin P, Yang M (2020) A shared-private representation model with coarse-to-fine extraction for target sentiment analysis. In: Findings of the association for computational linguistics findings of ACL: EMNLP 2020
Niu G, Liu E, Wang X et al (2023) Enhanced discriminate feature learning deep residual CNN for multitask bearing fault diagnosis with information fusion. IEEE Trans Ind Inform. https://doi.org/10.1109/TII.2022.3179011
Article Google Scholar
Liu Y, Li K, Yan D, Gu S (2022) A network-based CNN model to identify the hidden information in text data. Phys A Stat Mech its Appl. https://doi.org/10.1016/j.physa.2021.126744
Article Google Scholar
Chen L, Zhang H, Xiao J et al (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings-30th IEEE conference on computer vision and pattern recognition, CVPR 2017
Graham B, El-Nouby A, Touvron H et al (2021) LeViT: a vision transformer in ConvNet’s clothing for faster inference. In: Proceedings of the IEEE international conference on computer vision
Tolstikhin I, Houlsby N, Kolesnikov A et al (2021) MLP-mixer: an all-MLP architecture for vision. In: Advances in neural information processing systems
Kabir HMD, Abdar M, Khosravi A et al (2022) SpinalNet: deep neural network with gradual input. IEEE Trans Artif Intell. https://doi.org/10.1109/TAI.2022.3185179
Article Google Scholar
Sudowe P, Leibe B (2016) Patchit: Self-supervised network weight initialization for fine-grained recognition. In: British machine vision conference 2016, BMVC 2016
Zakraoui J, Saleh M, Al-Maadeed S, Jaam JM (2021) Improving text-to-image generation with object layout guidance. Multimed Tools Appl. https://doi.org/10.1007/s11042-021-11038-0
Article Google Scholar
Khan S, Naseer M, Hayat M et al (2022) Transformers in vision: a survey. ACM Comput Surv. https://doi.org/10.1145/3505244
Article Google Scholar
Fotso Kamga GA, Bitjoka L, Akram T et al (2021) Advancements in satellite image classification: methodologies, techniques, approaches and applications. Int J Remote Sens 42:1
Article Google Scholar
Peters ME, Neumann M, Iyyer M et al (2018) Improving language understanding with unsupervised learning. OpenAI 2018:1
Google Scholar
Ghosh S, Ekbal A, Bhattacharyya P (2022) A multitask framework to detect depression, sentiment and multi-label emotion from suicide notes. Cognit Comput. https://doi.org/10.1007/s12559-021-09828-7
Article Google Scholar
Tan M, Le QV (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In: 36th international conference on machine learning, ICML 2019
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) DenseNet. In: Proceedings of the 30th IEEE conf comput vis pattern recognition, CVPR 2017 2017-Janua
Ivan V, Slater D, Spacagna G, et al (2019) Python deep deep learning
Sun Z, Sarma PK, Liang Y, Sethares WA (2021) A new view of multi-modal language analysis: Audio and video features as text “Styles”. In: EACL 2021-16th conference of the European chapter of the association for computational linguistics, proceedings of the conference
Tetko IV, Karpov P, Van Deursen R, Godin G (2020) State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis. Nat Commun. https://doi.org/10.1038/s41467-020-19266-y
Article Google Scholar
Barbella M, Tortora G (2022) Rouge metric evaluation for text summarization techniques. SSRN Electron J. https://doi.org/10.2139/ssrn.4120317
Article Google Scholar
Li X, Wang W, Hu X, Yang J (2019) Selective kernel networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition
Zhang Z, Zhang H, Zhao L et al (2022) Nested hierarchical transformer: towards accurate, data-efficient and interpretable visual understanding. In: Proceedings of the 36th AAAI conference on artificial intelligence, AAAI 2022
Yuan K, Guo S, Liu Z et al (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE international conference on computer vision
Konstantinidis D, Papastratis I, Dimitropoulos KPD (2022) Multi-manifold atten vis transform
Dagli R (2023) Astroformer: more data might not be all you need for classification. In: ICLR 2023
Liu J, Wen D, Wang D et al (2020) QuantNet: learning to quantize by learning within fully differentiable framework. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)
Belhasin O, Bar-Shalom GRE-Y (2022) TransBoost: improving the best imagenet performance using deep transductionle. In: 36th conference on neural information processing systems
Su Z, Zhang H, Chen J, Pang L, Chong-Wah-Ngo Y-GJ (2022) Adaptive split-fusion transformer. Comput Vis Pattern Recognit. https://doi.org/10.48550/arXiv.2204.12196
Article Google Scholar
Bakalo R, Goldberger J, Ben-Ari R (2021) Weakly and semi supervised detection in medical imaging via deep dual branch net. Neurocomputing. https://doi.org/10.1016/j.neucom.2020.09.037
Article Google Scholar
Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Paying more attention to saliency: image captioning with saliency and context attention. ACM Trans Multimed Comput Commun Appl. https://doi.org/10.1145/3177745
Article Google Scholar
Zhou L, Palangi H, Zhang L et al (2020) Unified vision-language pre-training for image captioning and VQA. In: AAAI 2020-34th AAAI conference on artificial intelligence
Hao Y, Song H, Dong L, Huang S, Chi Z, Wang W, Shuming Ma FW (2022) Language models are general-purpose interfaces
Jin W, Cheng Y, Shen Y et al (2022) A good prompt is worth millions of parameters: low-resource prompt-based learning for vision-language models. In: Proceedings of the annual meeting of the association for computational linguistics
Cho J, Lei J, Tan T, Bansal M (2021) Unifying vision-and-language tasks via text generation. In: ICML, pp 1931–1942
Muneeb ul Hassan (2018) VGG16-convolutional network for classification and detection. Neurohive
Tan M, Le QV (2021) EfficientNetV2: smaller models and faster training. In: Proceedings of machine learning research
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings-30th IEEE conference on computer vision and pattern recognition, CVPR 2017
Ekoputris RO (2018) MobileNet: Deteksi Objek pada Platform Mobile | by Rizqi Okta Ekoputris | Nodeflux | Medium. In: 9 May 2018
Liu Z, Mao H, Wu CY et al (2022) A ConvNet for the 2020s. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition
Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. In: 3rd international conference on learning representations, ICLR 2015-conference track proceedings
Borkowski AA, Bui MM, Thomas LB et al (2019) Lung and colon cancer histopathological image dataset (LC25000)
BreakHis [OL]. https://www.kaggle.com/datasets/ambarish/breakhis
Kermany DS, Goldbaum M, Cai W et al (2018) Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. https://doi.org/10.1016/j.cell.2018.02.010
Article Google Scholar
Retinal OCT Images (optical coherence tomography)tle. https://www.kaggle.com/datasets/paultimothymooney/kermany2018. Accessed 31 Oct 2023

Download references

Acknowledgements

This work was supported by a grant from the Key R&D Program of Planned Science and Technology Project of Sichuan Province (No. 2021YFQ0054). This paper is partially funded by Grant SCITLAB (SCITLAB-20004) of the Intelligent Terminal Key Laboratory of Sichuan Province.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, 518055, China
Leta Yobsan Bayisa & Hirpesa Kebede Gutema
School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, People’s Republic of China
Weidong Wang & Qingxian Wang
Oxford Brooks University, Sino-British Collaborative Education, Chengdu University of Technology, Chengdu, 610059, Sichuan, People’s Republic of China
Chiagoziem C. Ukwuoma
College of Nuclear Technology and Automation Engineering, Chengdu University of Technology, Chengdu, 610059, Sichuan, People’s Republic of China
Chiagoziem C. Ukwuoma
Sichuan Engineering Technology Research Center for Industrial Internet Intelligent Monitoring and Application, Chengdu University of Technology, Chengdu, 610059, Sichuan, People’s Republic of China
Chiagoziem C. Ukwuoma
Paul C. Lauterbur Research Center for Biomedical Imaging, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, People’s Republic of China
Ahmed Endris
Department of Computer Science and Technology, Tsinghua University, Beijing, People’s Republic of China
Turi Abu

Authors

Leta Yobsan Bayisa
View author publications
You can also search for this author in PubMed Google Scholar
Weidong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qingxian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chiagoziem C. Ukwuoma
View author publications
You can also search for this author in PubMed Google Scholar
Hirpesa Kebede Gutema
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Endris
View author publications
You can also search for this author in PubMed Google Scholar
Turi Abu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Leta Yobsan Bayisa or Weidong Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bayisa, L.Y., Wang, W., Wang, Q. et al. Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02177-5

Download citation

Received: 21 June 2023
Accepted: 08 April 2024
Published: 17 May 2024
DOI: https://doi.org/10.1007/s13042-024-02177-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning

Abstract

Access this article

Similar content being viewed by others

Image captioning model using attention and object features to mimic human image understanding

Fast image captioning using LSTM

Domain-specific image captioning: a comprehensive review

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning

Abstract

Access this article

Similar content being viewed by others

Image captioning model using attention and object features to mimic human image understanding

Fast image captioning using LSTM

Domain-specific image captioning: a comprehensive review

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation