Skip to main content
Log in

Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

The application of deep learning has demonstrated impressive performance in computer vision tasks such as object detection, image classification, and image captioning. Though most models excel at performing single vision or language tasks, designing a single architecture that balances task specialization, performance, and adaptability across diverse tasks is challenging. To effectively address vision and language integration challenges, a combination of text embeddings and visual representation is necessary to understand dependencies of each subarea for multiple tasks. This paper proposes a single architecture that can handle various tasks in computer vision with fine-tuning capabilities for other specific vision and language tasks. The proposed model employs a modified DenseNet201 as a feature extractor (network backbone), an encoder-decoder architecture, and a task-specific head for inference. To tackle overfitting and improve precision, enhanced data augmentation and normalization techniques are employed. The model’s robustness is evaluated on over five datasets for different tasks: image classification, object detection, image captioning, and adversarial attack and defense. The experimental results demonstrate competitive performance compared to other works on CIFAR-10, CIFAR-100, Flickr8, Flickr30, Caltech10, and other task-specific datasets such as OCT, BreakHis, and so on. The proposed model is flexible and easy to adapt to new tasks, as it can also be extended to other vision and language tasks through fine-tuning with task-specific input indices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

Not applicable because our model utilizes public benchmark datasets.

References

  1. Huang DZ, Baber JC, Bahmanyar SS (2021) The challenges of generalizability in artificial intelligence for ADME/Tox endpoint and activity prediction. Expert Opin Drug Discov 16:1

    Article  Google Scholar 

  2. Fu Q, Wang C, Han X (2020) A CNN-LSTM network with attention approach for learning universal sentence representation in embedded system. Microprocess Microsyst. https://doi.org/10.1016/j.micpro.2020.103051

    Article  Google Scholar 

  3. Pang C, Liu H, Li X (2019) Multitask learning of time–frequency CNN for sound source localization. IEEE Access. https://doi.org/10.1109/ACCESS.2019.2905617

    Article  Google Scholar 

  4. Toshniwal S, Tang H, Lu L, Livescu K (2017) Multitask learning with low-level auxiliary tasks for encoder–decoder based speech recognition. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH

  5. Guo S, Zhang B, Yang T et al (2020) Multitask convolutional neural network with information fusion for bearing fault diagnosis and localization. IEEE Trans Ind Electron. https://doi.org/10.1109/TIE.2019.2942548

    Article  Google Scholar 

  6. Kapidis G, Poppe R, Veltkamp RC (2021) Multi-dataset, multitask learning of egocentric vision tasks. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2021.3061479

    Article  Google Scholar 

  7. Wang XE, Jain V, Ie E et al (2020) environment-agnostic multitask learning for natural language grounded navigation. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)

  8. Gu J, Wang Z, Kuen J et al (2018) Recent advances in convolutional neural networks. Pattern Recognit. https://doi.org/10.1016/j.patcog.2017.10.013

    Article  Google Scholar 

  9. Ben-Baruch E, Ridnik T, Zamir N et al (2019) Attention Is All You Need. Adv Neural Inf Process Syst 16:1

    Google Scholar 

  10. Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR 2021—9th international conference on learning representations

  11. Wright LG, Onodera T, Stein MM et al (2022) Deep physical neural networks trained with backpropagation. Nature. https://doi.org/10.1038/s41586-021-04223-6

    Article  Google Scholar 

  12. Strezoski G, Noord N, Worring M (2019) Many task learning with task routing. In: Proceedings of the IEEE international conference on computer vision

  13. Zhang Y, Yang Q (2018) An overview of multi-task learning. Natl Sci Rev 5:1

    Article  Google Scholar 

  14. Furusho Y, Ikeda K (2020) Theoretical analysis of skip connections and batch normalization from generalization and optimization perspectives. APSIPA Trans Signal Inf Process. https://doi.org/10.1017/ATSIP.2020.7

    Article  Google Scholar 

  15. Jain N, Singh H, Sharma V (2019) Competitor analysis and benchmarking of improved Alex net. Int J Sci Technol Res 8:1

    Google Scholar 

  16. Rampersad H (2020) FAST-RCNN. Total Perform Scorec 2020:1

    Google Scholar 

  17. Zhang Y, Li D, Wang Y et al (2019) Abstract text summarization with a convolutional seq2seq model. Appl Sci. https://doi.org/10.3390/app9081665

    Article  Google Scholar 

  18. Warrier S, Rutter EM, Flores KB (2022) Multitask neural networks for predicting bladder pressure with time series data. Biomed Signal Process Control. https://doi.org/10.1016/j.bspc.2021.103298

    Article  Google Scholar 

  19. Chou SH, Chao WL, Lai WS et al (2020) Visual question answering on 360° images. In: Proceedings—2020 IEEE winter conference on applications of computer vision, WACV 2020

  20. Jha S, Dey A, Kumar R, Kumar-Solanki V (2019) A novel approach on visual question answering by parameter prediction using faster region based convolutional neural network. Int J Interact Multimed Artif Intell. https://doi.org/10.9781/ijimai.2018.08.004

    Article  Google Scholar 

  21. Xie J, Cai Y, Huang Q, Wang T (2021) Multiple objects-aware visual question generation. In: MM 2021—proceedings of the 29th ACM international conference on multimedia

  22. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  23. Toshevska M, Stojanovska F, Zdravevski E et al (2020) Exploration into deep learning text generation architectures for dense image captioning. In: Proceedings of the 2020 federated conference on computer science and information systems, FedCSIS 2020. pp 129–136

  24. Rangamani A, Xiong T, Nair A et al (2016) Landmark detection and tracking in ultrasound using a CNN-RNN framework. Conf Neural Inf Process Syst 2016:1

    Google Scholar 

  25. Sun G, Probst T, Paudel DP et al (2021) Task switching network for multi-task learning. In: Proceedings of the IEEE international conference on computer vision

  26. Yu J, Li J, Yu Z, Huang Q (2020) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2947482

    Article  Google Scholar 

  27. Lan Y, Hao Y, Xia K et al (2020) Stacked residual recurrent neural networks with cross-layer attention for text classification. IEEE Access. https://doi.org/10.1109/ACCESS.2020.2987101

    Article  Google Scholar 

  28. Ji Z, Wang H, Han J, Pang Y (2022) SMAN: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2020.2985716

    Article  Google Scholar 

  29. Mittal S, Lamb A, Goyal A et al (2020) Learning to combine top-down and bottom-up signals in recurrent neural networks with attention over modules. In: 37th international conference on machine learning, ICML 2020

  30. Wang X, Wang WY, Wang Y-F (2020) Closing the loop between language and vision for embodied agents. University of California, Santa Barbara

    Google Scholar 

  31. Lee J, Kim I (2021) Vision–language–knowledge co-embedding for visual commonsense reasoning. Sensors. https://doi.org/10.3390/s21092911

    Article  Google Scholar 

  32. Van Tu N, Cuong LA (2021) A deep learning model of multiple knowledge sources integration for community question answering. VNU J Sci Comput Sci Commun Eng. https://doi.org/10.25073/2588-1086/vnucsce.295

    Article  Google Scholar 

  33. Wang Y, Zhu M, Xu C et al (2022) Exploiting image captions and external knowledge as representation enhancement for VQA. Qinghua Daxue Xuebao/J Tsinghua Univ. https://doi.org/10.16511/j.cnki.qhdxxb.2022.21.010

  34. Wu J, Hu Z, Mooney RJ (2020) Generating question relevant captions to aid visual question answering. In: ACL 2019—57th annual meeting of the association for computational linguistics, proceedings of the conference

  35. Lin P, Yang M (2020) A shared-private representation model with coarse-to-fine extraction for target sentiment analysis. In: Findings of the association for computational linguistics findings of ACL: EMNLP 2020

  36. Niu G, Liu E, Wang X et al (2023) Enhanced discriminate feature learning deep residual CNN for multitask bearing fault diagnosis with information fusion. IEEE Trans Ind Inform. https://doi.org/10.1109/TII.2022.3179011

    Article  Google Scholar 

  37. Liu Y, Li K, Yan D, Gu S (2022) A network-based CNN model to identify the hidden information in text data. Phys A Stat Mech its Appl. https://doi.org/10.1016/j.physa.2021.126744

    Article  Google Scholar 

  38. Chen L, Zhang H, Xiao J et al (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings-30th IEEE conference on computer vision and pattern recognition, CVPR 2017

  39. Graham B, El-Nouby A, Touvron H et al (2021) LeViT: a vision transformer in ConvNet’s clothing for faster inference. In: Proceedings of the IEEE international conference on computer vision

  40. Tolstikhin I, Houlsby N, Kolesnikov A et al (2021) MLP-mixer: an all-MLP architecture for vision. In: Advances in neural information processing systems

  41. Kabir HMD, Abdar M, Khosravi A et al (2022) SpinalNet: deep neural network with gradual input. IEEE Trans Artif Intell. https://doi.org/10.1109/TAI.2022.3185179

    Article  Google Scholar 

  42. Sudowe P, Leibe B (2016) Patchit: Self-supervised network weight initialization for fine-grained recognition. In: British machine vision conference 2016, BMVC 2016

  43. Zakraoui J, Saleh M, Al-Maadeed S, Jaam JM (2021) Improving text-to-image generation with object layout guidance. Multimed Tools Appl. https://doi.org/10.1007/s11042-021-11038-0

    Article  Google Scholar 

  44. Khan S, Naseer M, Hayat M et al (2022) Transformers in vision: a survey. ACM Comput Surv. https://doi.org/10.1145/3505244

    Article  Google Scholar 

  45. Fotso Kamga GA, Bitjoka L, Akram T et al (2021) Advancements in satellite image classification: methodologies, techniques, approaches and applications. Int J Remote Sens 42:1

    Article  Google Scholar 

  46. Peters ME, Neumann M, Iyyer M et al (2018) Improving language understanding with unsupervised learning. OpenAI 2018:1

    Google Scholar 

  47. Ghosh S, Ekbal A, Bhattacharyya P (2022) A multitask framework to detect depression, sentiment and multi-label emotion from suicide notes. Cognit Comput. https://doi.org/10.1007/s12559-021-09828-7

    Article  Google Scholar 

  48. Tan M, Le QV (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In: 36th international conference on machine learning, ICML 2019

  49. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition

  50. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) DenseNet. In: Proceedings of the 30th IEEE conf comput vis pattern recognition, CVPR 2017 2017-Janua

  51. Ivan V, Slater D, Spacagna G, et al (2019) Python deep deep learning

  52. Sun Z, Sarma PK, Liang Y, Sethares WA (2021) A new view of multi-modal language analysis: Audio and video features as text “Styles”. In: EACL 2021-16th conference of the European chapter of the association for computational linguistics, proceedings of the conference

  53. Tetko IV, Karpov P, Van Deursen R, Godin G (2020) State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis. Nat Commun. https://doi.org/10.1038/s41467-020-19266-y

    Article  Google Scholar 

  54. Barbella M, Tortora G (2022) Rouge metric evaluation for text summarization techniques. SSRN Electron J. https://doi.org/10.2139/ssrn.4120317

    Article  Google Scholar 

  55. Li X, Wang W, Hu X, Yang J (2019) Selective kernel networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition

  56. Zhang Z, Zhang H, Zhao L et al (2022) Nested hierarchical transformer: towards accurate, data-efficient and interpretable visual understanding. In: Proceedings of the 36th AAAI conference on artificial intelligence, AAAI 2022

  57. Yuan K, Guo S, Liu Z et al (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE international conference on computer vision

  58. Konstantinidis D, Papastratis I, Dimitropoulos KPD (2022) Multi-manifold atten vis transform

  59. Dagli R (2023) Astroformer: more data might not be all you need for classification. In: ICLR 2023

  60. Liu J, Wen D, Wang D et al (2020) QuantNet: learning to quantize by learning within fully differentiable framework. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)

  61. Belhasin O, Bar-Shalom GRE-Y (2022) TransBoost: improving the best imagenet performance using deep transductionle. In: 36th conference on neural information processing systems

  62. Su Z, Zhang H, Chen J, Pang L, Chong-Wah-Ngo Y-GJ (2022) Adaptive split-fusion transformer. Comput Vis Pattern Recognit. https://doi.org/10.48550/arXiv.2204.12196

    Article  Google Scholar 

  63. Bakalo R, Goldberger J, Ben-Ari R (2021) Weakly and semi supervised detection in medical imaging via deep dual branch net. Neurocomputing. https://doi.org/10.1016/j.neucom.2020.09.037

    Article  Google Scholar 

  64. Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Paying more attention to saliency: image captioning with saliency and context attention. ACM Trans Multimed Comput Commun Appl. https://doi.org/10.1145/3177745

    Article  Google Scholar 

  65. Zhou L, Palangi H, Zhang L et al (2020) Unified vision-language pre-training for image captioning and VQA. In: AAAI 2020-34th AAAI conference on artificial intelligence

  66. Hao Y, Song H, Dong L, Huang S, Chi Z, Wang W, Shuming Ma FW (2022) Language models are general-purpose interfaces

  67. Jin W, Cheng Y, Shen Y et al (2022) A good prompt is worth millions of parameters: low-resource prompt-based learning for vision-language models. In: Proceedings of the annual meeting of the association for computational linguistics

  68. Cho J, Lei J, Tan T, Bansal M (2021) Unifying vision-and-language tasks via text generation. In: ICML, pp 1931–1942

  69. Muneeb ul Hassan (2018) VGG16-convolutional network for classification and detection. Neurohive

  70. Tan M, Le QV (2021) EfficientNetV2: smaller models and faster training. In: Proceedings of machine learning research

  71. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings-30th IEEE conference on computer vision and pattern recognition, CVPR 2017

  72. Ekoputris RO (2018) MobileNet: Deteksi Objek pada Platform Mobile | by Rizqi Okta Ekoputris | Nodeflux | Medium. In: 9 May 2018

  73. Liu Z, Mao H, Wu CY et al (2022) A ConvNet for the 2020s. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition

  74. Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. In: 3rd international conference on learning representations, ICLR 2015-conference track proceedings

  75. Borkowski AA, Bui MM, Thomas LB et al (2019) Lung and colon cancer histopathological image dataset (LC25000)

  76. BreakHis [OL]. https://www.kaggle.com/datasets/ambarish/breakhis

  77. Kermany DS, Goldbaum M, Cai W et al (2018) Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. https://doi.org/10.1016/j.cell.2018.02.010

    Article  Google Scholar 

  78. Retinal OCT Images (optical coherence tomography)tle. https://www.kaggle.com/datasets/paultimothymooney/kermany2018. Accessed 31 Oct 2023

Download references

Acknowledgements

This work was supported by a grant from the Key R&D Program of Planned Science and Technology Project of Sichuan Province (No. 2021YFQ0054). This paper is partially funded by Grant SCITLAB (SCITLAB-20004) of the Intelligent Terminal Key Laboratory of Sichuan Province.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Leta Yobsan Bayisa or Weidong Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bayisa, L.Y., Wang, W., Wang, Q. et al. Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02177-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13042-024-02177-5

Keywords

Navigation