Abstract
In the recent past, pre-trained models in vision-language research have witnessed a dramatic increase. However, most of these models are typically pre-trained independently, following either a contrastive, image-to-text generative, or text-to-image generative objective. This paper presents a unique framework, UIT, which fuses these pre-training objectives using a unicoder-decoder architecture that comprises an image unicoder, a text unicoder, and a bi-modal decoder. The image/text unicoders can interchange between encoding and decoding roles for different tasks, offering versatility and shared understanding that enhances both image-to-text and text-to-image transformations. UIT outshines existing models in a variety of tasks, such as retrieval, captioning, VQA, and SNLI-VE, demonstrating particular prowess in zero-shot situations. It delivers notable results in tasks like zero-shot ImageNet classification, zero-shot text-to-image synthesis, and zero-shot captioning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022)
Bao, H., et al.: UniLMv2: pseudo-masked language models for unified language model pre-training. In: International Conference on Machine Learning, pp. 642–652. PMLR (2020)
Chang, H., et al.: Muse: text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704 (2023)
Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: MaskGIT: masked generative image transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11315–11325 (2022)
Chen, X., et al.: PaLI: a jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022)
Ding, M., et al.: CogView: mastering text-to-image generation via transformers. Adv. Neural. Inf. Process. Syst. 34, 19822–19835 (2021)
Ding, M., Zheng, W., Hong, W., Tang, J.: CogView2: faster and better text-to-image generation via hierarchical transformers. arXiv preprint arXiv:2204.14217 (2022)
Dong, L., et al.: Unified language model pre-training for natural language understanding and generation. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: scene-based text-to-image generation with human priors. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13675, pp. 89–106. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19784-0_6
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: a unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022)
Mu, N., Kirillov, A., Wagner, D., Xie, S.: SLIP: self-supervision meets language-image pre-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13686, pp. 529–544. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19809-0_30
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)
Tang, X., et al.: Hyperbolic code retrieval: a novel approach for efficient code search using hyperbolic space embeddings. arXiv preprint arXiv:2308.15234 (2023)
Tang, X., et al.: Multilevel semantic embedding of software patches: a fine-to-coarse grained approach towards security patch detection. arXiv preprint arXiv:2308.15233 (2023)
Tang, X., Sun, T., Zhu, R., Wang, S.: CKG: dynamic representation based on context and knowledge graph. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2889–2895. IEEE (2021)
Tang, X., Tian, H., Kong, P., Liu, K., Klein, J., Bissyande, T.F.: App review driven collaborative bug finding. arXiv preprint arXiv:2301.02818 (2023)
Tang, X., Zhu, R., Sun, T., Wang, S.: Moto: enhancing embedding with multiple joint factors for Chinese text classification. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2882–2888. IEEE (2021)
Tian, H., et al.: Is ChatGPT the ultimate programming assistant-how far is it? arXiv preprint arXiv:2304.11938 (2023)
Tian, H., et al.: Is this change the answer to that problem? correlating descriptions of bug and code changes for evaluating patch correctness. In: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pp. 1–13 (2022)
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, J., et al.: GIT: a generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100 (2022)
Wang, L., et al.: Delving into commit-issue correlation to enhance commit message generation models. CoRR abs/2308.00147 (2023). https://doi.org/10.48550/arXiv.2308.00147
Wang, P., et al.: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052 (2022)
Wang, S., Tang, D., Zhang, L.: A large-scale hierarchical structure knowledge enhanced pre-training framework for automatic ICD coding. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds.) ICONIP 2021. CCIS, vol. 1517, pp. 494–502. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92310-5_57
Wang, S., Tang, D., Zhang, L., Li, H., Han, D.: HieNet: bidirectional hierarchy framework for automated ICD coding. In: Bhattacharya, A., et al. (eds.) DASFAA 2022. LNCS, vol. 13246, pp. 523–539. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-00126-0_38
Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: SimVLM: simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904 (2021)
Yao, L., et al.: FILIP: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783 (2021)
You, H., et al.: Learning visual representation from modality-shared contrastive language-image pre-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 69–87. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_5
Yu, J., et al.: Vector-quantized image modeling with improved VQGAN. arXiv preprint arXiv:2110.04627 (2021)
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022)
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13041–13049 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, G., Yan, S. (2023). UIT: Unifying Pre-training Objectives for Image-Text Understanding. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14258. Springer, Cham. https://doi.org/10.1007/978-3-031-44192-9_46
Download citation
DOI: https://doi.org/10.1007/978-3-031-44192-9_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44191-2
Online ISBN: 978-3-031-44192-9
eBook Packages: Computer ScienceComputer Science (R0)