Abstract
The application of deep learning has demonstrated impressive performance in computer vision tasks such as object detection, image classification, and image captioning. Though most models excel at performing single vision or language tasks, designing a single architecture that balances task specialization, performance, and adaptability across diverse tasks is challenging. To effectively address vision and language integration challenges, a combination of text embeddings and visual representation is necessary to understand dependencies of each subarea for multiple tasks. This paper proposes a single architecture that can handle various tasks in computer vision with fine-tuning capabilities for other specific vision and language tasks. The proposed model employs a modified DenseNet201 as a feature extractor (network backbone), an encoder-decoder architecture, and a task-specific head for inference. To tackle overfitting and improve precision, enhanced data augmentation and normalization techniques are employed. The model’s robustness is evaluated on over five datasets for different tasks: image classification, object detection, image captioning, and adversarial attack and defense. The experimental results demonstrate competitive performance compared to other works on CIFAR-10, CIFAR-100, Flickr8, Flickr30, Caltech10, and other task-specific datasets such as OCT, BreakHis, and so on. The proposed model is flexible and easy to adapt to new tasks, as it can also be extended to other vision and language tasks through fine-tuning with task-specific input indices.
Similar content being viewed by others
Data availability
Not applicable because our model utilizes public benchmark datasets.
References
Huang DZ, Baber JC, Bahmanyar SS (2021) The challenges of generalizability in artificial intelligence for ADME/Tox endpoint and activity prediction. Expert Opin Drug Discov 16:1
Fu Q, Wang C, Han X (2020) A CNN-LSTM network with attention approach for learning universal sentence representation in embedded system. Microprocess Microsyst. https://doi.org/10.1016/j.micpro.2020.103051
Pang C, Liu H, Li X (2019) Multitask learning of time–frequency CNN for sound source localization. IEEE Access. https://doi.org/10.1109/ACCESS.2019.2905617
Toshniwal S, Tang H, Lu L, Livescu K (2017) Multitask learning with low-level auxiliary tasks for encoder–decoder based speech recognition. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH
Guo S, Zhang B, Yang T et al (2020) Multitask convolutional neural network with information fusion for bearing fault diagnosis and localization. IEEE Trans Ind Electron. https://doi.org/10.1109/TIE.2019.2942548
Kapidis G, Poppe R, Veltkamp RC (2021) Multi-dataset, multitask learning of egocentric vision tasks. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2021.3061479
Wang XE, Jain V, Ie E et al (2020) environment-agnostic multitask learning for natural language grounded navigation. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)
Gu J, Wang Z, Kuen J et al (2018) Recent advances in convolutional neural networks. Pattern Recognit. https://doi.org/10.1016/j.patcog.2017.10.013
Ben-Baruch E, Ridnik T, Zamir N et al (2019) Attention Is All You Need. Adv Neural Inf Process Syst 16:1
Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR 2021—9th international conference on learning representations
Wright LG, Onodera T, Stein MM et al (2022) Deep physical neural networks trained with backpropagation. Nature. https://doi.org/10.1038/s41586-021-04223-6
Strezoski G, Noord N, Worring M (2019) Many task learning with task routing. In: Proceedings of the IEEE international conference on computer vision
Zhang Y, Yang Q (2018) An overview of multi-task learning. Natl Sci Rev 5:1
Furusho Y, Ikeda K (2020) Theoretical analysis of skip connections and batch normalization from generalization and optimization perspectives. APSIPA Trans Signal Inf Process. https://doi.org/10.1017/ATSIP.2020.7
Jain N, Singh H, Sharma V (2019) Competitor analysis and benchmarking of improved Alex net. Int J Sci Technol Res 8:1
Rampersad H (2020) FAST-RCNN. Total Perform Scorec 2020:1
Zhang Y, Li D, Wang Y et al (2019) Abstract text summarization with a convolutional seq2seq model. Appl Sci. https://doi.org/10.3390/app9081665
Warrier S, Rutter EM, Flores KB (2022) Multitask neural networks for predicting bladder pressure with time series data. Biomed Signal Process Control. https://doi.org/10.1016/j.bspc.2021.103298
Chou SH, Chao WL, Lai WS et al (2020) Visual question answering on 360° images. In: Proceedings—2020 IEEE winter conference on applications of computer vision, WACV 2020
Jha S, Dey A, Kumar R, Kumar-Solanki V (2019) A novel approach on visual question answering by parameter prediction using faster region based convolutional neural network. Int J Interact Multimed Artif Intell. https://doi.org/10.9781/ijimai.2018.08.004
Xie J, Cai Y, Huang Q, Wang T (2021) Multiple objects-aware visual question generation. In: MM 2021—proceedings of the 29th ACM international conference on multimedia
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput. https://doi.org/10.1162/neco.1997.9.8.1735
Toshevska M, Stojanovska F, Zdravevski E et al (2020) Exploration into deep learning text generation architectures for dense image captioning. In: Proceedings of the 2020 federated conference on computer science and information systems, FedCSIS 2020. pp 129–136
Rangamani A, Xiong T, Nair A et al (2016) Landmark detection and tracking in ultrasound using a CNN-RNN framework. Conf Neural Inf Process Syst 2016:1
Sun G, Probst T, Paudel DP et al (2021) Task switching network for multi-task learning. In: Proceedings of the IEEE international conference on computer vision
Yu J, Li J, Yu Z, Huang Q (2020) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2947482
Lan Y, Hao Y, Xia K et al (2020) Stacked residual recurrent neural networks with cross-layer attention for text classification. IEEE Access. https://doi.org/10.1109/ACCESS.2020.2987101
Ji Z, Wang H, Han J, Pang Y (2022) SMAN: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2020.2985716
Mittal S, Lamb A, Goyal A et al (2020) Learning to combine top-down and bottom-up signals in recurrent neural networks with attention over modules. In: 37th international conference on machine learning, ICML 2020
Wang X, Wang WY, Wang Y-F (2020) Closing the loop between language and vision for embodied agents. University of California, Santa Barbara
Lee J, Kim I (2021) Vision–language–knowledge co-embedding for visual commonsense reasoning. Sensors. https://doi.org/10.3390/s21092911
Van Tu N, Cuong LA (2021) A deep learning model of multiple knowledge sources integration for community question answering. VNU J Sci Comput Sci Commun Eng. https://doi.org/10.25073/2588-1086/vnucsce.295
Wang Y, Zhu M, Xu C et al (2022) Exploiting image captions and external knowledge as representation enhancement for VQA. Qinghua Daxue Xuebao/J Tsinghua Univ. https://doi.org/10.16511/j.cnki.qhdxxb.2022.21.010
Wu J, Hu Z, Mooney RJ (2020) Generating question relevant captions to aid visual question answering. In: ACL 2019—57th annual meeting of the association for computational linguistics, proceedings of the conference
Lin P, Yang M (2020) A shared-private representation model with coarse-to-fine extraction for target sentiment analysis. In: Findings of the association for computational linguistics findings of ACL: EMNLP 2020
Niu G, Liu E, Wang X et al (2023) Enhanced discriminate feature learning deep residual CNN for multitask bearing fault diagnosis with information fusion. IEEE Trans Ind Inform. https://doi.org/10.1109/TII.2022.3179011
Liu Y, Li K, Yan D, Gu S (2022) A network-based CNN model to identify the hidden information in text data. Phys A Stat Mech its Appl. https://doi.org/10.1016/j.physa.2021.126744
Chen L, Zhang H, Xiao J et al (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings-30th IEEE conference on computer vision and pattern recognition, CVPR 2017
Graham B, El-Nouby A, Touvron H et al (2021) LeViT: a vision transformer in ConvNet’s clothing for faster inference. In: Proceedings of the IEEE international conference on computer vision
Tolstikhin I, Houlsby N, Kolesnikov A et al (2021) MLP-mixer: an all-MLP architecture for vision. In: Advances in neural information processing systems
Kabir HMD, Abdar M, Khosravi A et al (2022) SpinalNet: deep neural network with gradual input. IEEE Trans Artif Intell. https://doi.org/10.1109/TAI.2022.3185179
Sudowe P, Leibe B (2016) Patchit: Self-supervised network weight initialization for fine-grained recognition. In: British machine vision conference 2016, BMVC 2016
Zakraoui J, Saleh M, Al-Maadeed S, Jaam JM (2021) Improving text-to-image generation with object layout guidance. Multimed Tools Appl. https://doi.org/10.1007/s11042-021-11038-0
Khan S, Naseer M, Hayat M et al (2022) Transformers in vision: a survey. ACM Comput Surv. https://doi.org/10.1145/3505244
Fotso Kamga GA, Bitjoka L, Akram T et al (2021) Advancements in satellite image classification: methodologies, techniques, approaches and applications. Int J Remote Sens 42:1
Peters ME, Neumann M, Iyyer M et al (2018) Improving language understanding with unsupervised learning. OpenAI 2018:1
Ghosh S, Ekbal A, Bhattacharyya P (2022) A multitask framework to detect depression, sentiment and multi-label emotion from suicide notes. Cognit Comput. https://doi.org/10.1007/s12559-021-09828-7
Tan M, Le QV (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In: 36th international conference on machine learning, ICML 2019
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) DenseNet. In: Proceedings of the 30th IEEE conf comput vis pattern recognition, CVPR 2017 2017-Janua
Ivan V, Slater D, Spacagna G, et al (2019) Python deep deep learning
Sun Z, Sarma PK, Liang Y, Sethares WA (2021) A new view of multi-modal language analysis: Audio and video features as text “Styles”. In: EACL 2021-16th conference of the European chapter of the association for computational linguistics, proceedings of the conference
Tetko IV, Karpov P, Van Deursen R, Godin G (2020) State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis. Nat Commun. https://doi.org/10.1038/s41467-020-19266-y
Barbella M, Tortora G (2022) Rouge metric evaluation for text summarization techniques. SSRN Electron J. https://doi.org/10.2139/ssrn.4120317
Li X, Wang W, Hu X, Yang J (2019) Selective kernel networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition
Zhang Z, Zhang H, Zhao L et al (2022) Nested hierarchical transformer: towards accurate, data-efficient and interpretable visual understanding. In: Proceedings of the 36th AAAI conference on artificial intelligence, AAAI 2022
Yuan K, Guo S, Liu Z et al (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE international conference on computer vision
Konstantinidis D, Papastratis I, Dimitropoulos KPD (2022) Multi-manifold atten vis transform
Dagli R (2023) Astroformer: more data might not be all you need for classification. In: ICLR 2023
Liu J, Wen D, Wang D et al (2020) QuantNet: learning to quantize by learning within fully differentiable framework. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)
Belhasin O, Bar-Shalom GRE-Y (2022) TransBoost: improving the best imagenet performance using deep transductionle. In: 36th conference on neural information processing systems
Su Z, Zhang H, Chen J, Pang L, Chong-Wah-Ngo Y-GJ (2022) Adaptive split-fusion transformer. Comput Vis Pattern Recognit. https://doi.org/10.48550/arXiv.2204.12196
Bakalo R, Goldberger J, Ben-Ari R (2021) Weakly and semi supervised detection in medical imaging via deep dual branch net. Neurocomputing. https://doi.org/10.1016/j.neucom.2020.09.037
Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Paying more attention to saliency: image captioning with saliency and context attention. ACM Trans Multimed Comput Commun Appl. https://doi.org/10.1145/3177745
Zhou L, Palangi H, Zhang L et al (2020) Unified vision-language pre-training for image captioning and VQA. In: AAAI 2020-34th AAAI conference on artificial intelligence
Hao Y, Song H, Dong L, Huang S, Chi Z, Wang W, Shuming Ma FW (2022) Language models are general-purpose interfaces
Jin W, Cheng Y, Shen Y et al (2022) A good prompt is worth millions of parameters: low-resource prompt-based learning for vision-language models. In: Proceedings of the annual meeting of the association for computational linguistics
Cho J, Lei J, Tan T, Bansal M (2021) Unifying vision-and-language tasks via text generation. In: ICML, pp 1931–1942
Muneeb ul Hassan (2018) VGG16-convolutional network for classification and detection. Neurohive
Tan M, Le QV (2021) EfficientNetV2: smaller models and faster training. In: Proceedings of machine learning research
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings-30th IEEE conference on computer vision and pattern recognition, CVPR 2017
Ekoputris RO (2018) MobileNet: Deteksi Objek pada Platform Mobile | by Rizqi Okta Ekoputris | Nodeflux | Medium. In: 9 May 2018
Liu Z, Mao H, Wu CY et al (2022) A ConvNet for the 2020s. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition
Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. In: 3rd international conference on learning representations, ICLR 2015-conference track proceedings
Borkowski AA, Bui MM, Thomas LB et al (2019) Lung and colon cancer histopathological image dataset (LC25000)
BreakHis [OL]. https://www.kaggle.com/datasets/ambarish/breakhis
Kermany DS, Goldbaum M, Cai W et al (2018) Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. https://doi.org/10.1016/j.cell.2018.02.010
Retinal OCT Images (optical coherence tomography)tle. https://www.kaggle.com/datasets/paultimothymooney/kermany2018. Accessed 31 Oct 2023
Acknowledgements
This work was supported by a grant from the Key R&D Program of Planned Science and Technology Project of Sichuan Province (No. 2021YFQ0054). This paper is partially funded by Grant SCITLAB (SCITLAB-20004) of the Intelligent Terminal Key Laboratory of Sichuan Province.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bayisa, L.Y., Wang, W., Wang, Q. et al. Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02177-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13042-024-02177-5