Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Li, Xiujun; Yin, Xi; Li, Chunyuan; Zhang, Pengchuan; Hu, Xiaowei; Zhang, Lei; Wang, Lijuan; Hu, Houdong; Dong, Li; Wei, Furu; Choi, Yejin; Gao, Jianfeng

doi:10.1007/978-3-030-58577-8_8

Xiujun Li^12,13,
Xi Yin¹²,
Chunyuan Li¹²,
Pengchuan Zhang¹²,
Xiaowei Hu¹²,
Lei Zhang¹²,
Lijuan Wang¹²,
Houdong Hu¹²,
Li Dong¹²,
Furu Wei¹²,
Yejin Choi¹³ &
…
Jianfeng Gao¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12375))

Included in the following conference series:

European Conference on Computer Vision

7080 Accesses
505 Citations
3 Altmetric

Abstract

Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use self-attention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks (The code and pre-trained models are released: https://github.com/microsoft/Oscar).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
It includes coordinates of top-left & bottom-right corners, and/or height & width.
2.
A semantic space can be viewed a vector space defined by a dictionary, which maps an input to a vector representation in the semantic space. For example, BERT can be viewed as a dictionary that defines a linguistic semantic space. BERT maps an input word or word sequence into a feature vector in the semantic space.
3.
This is not necessarily the best fine-tuning choice for NLVR2, please refer to the Pair-biattn finetuning in UNITER [5] for a better choice, which introduces a multi-head attention layer to look back the concatenated text-image sequences.
4.
All the (single-model) SoTAs are from the published results.

References

Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: ICCV (2019)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Google Scholar
Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics (1991)
Google Scholar
Chen, W., Gan, Z., Li, L., Cheng, Y., Wang, W., Liu, J.: Meta module network for compositional visual reasoning (2019). arXiv preprint arXiv:1910.03230
Chen, Y.C., et al.: Uniter: learning universal image-text representations (2019). arXiv preprint arXiv:1909.11740
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Google Scholar
Frome, A., et al.: DeViSE: a deep visual-semantic embedding model. In: NeurIPS (2013)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)
Google Scholar
Hao, W., Li, C., Li, X., Carin, L., Gao, J.: Towards learning a generic agent for vision-and-language navigation via pre-training. In: CVPR (2020)
Google Scholar
Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: ICCV (2019)
Google Scholar
Hudson, D., Manning, C.D.: Learning by abstraction: the neural state machine. In: NeurIPS (2019)
Google Scholar
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering (2019). arXiv preprint arXiv:1902.09506
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Google Scholar
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models (2014). arXiv preprint arXiv:1411.2539
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Article MathSciNet Google Scholar
Kuznetsova, A., et al.: The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale (2018). arXiv preprint arXiv:1811.00982
Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: ECCV (2018)
Google Scholar
Li, G., Duan, N., Fang, Y., Jiang, D., Zhou, M.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training (2019). arXiv preprint arXiv:1908.06066
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: a simple and performant baseline for vision and language (2019). arXiv preprint arXiv:1908.03557
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV (2014)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)
Google Scholar
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: Multi-Task vision and language representation learning (2019). arXiv preprint arXiv:1912.02315
Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
Norouzi, M., et al.: Zero-shot learning by convex combination of semantic embeddings (2013). arXiv preprint arXiv:1312.5650
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: NeurIPS (2011)
Google Scholar
Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A.: Imagebert: cross-modal pre-training with large-scale weak-supervised image-text data (2020). arXiv preprint arXiv:2001.07966
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Ren, Z., Jin, H., Lin, Z., Fang, C., Yuille, A.: Joint image-text representation by gaussian visual-semantic embedding. In: Multimedia (2016)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR (2017)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Annual Meeting of the Association for Computational Linguistics (2018)
Google Scholar
Socher, R., Fei-Fei, L.: Connecting modalities: semi-supervised segmentation and annotation of images using unaligned text corpora. In: CVPR (2010)
Google Scholar
Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through cross-modal transfer. In: NeurIPS (2013)
Google Scholar
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations (2019). arXiv preprint arXiv:1908.08530
Suhr, A., Zhou, S., Zhang, A., Zhang, I., Bai, H., Artzi, Y.: A corpus for reasoning about natural language grounded in photographs (2018). arXiv preprint arXiv:1811.00491
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: ICCV (2019)
Google Scholar
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wu, Q., Shen, C., Liu, L., Dick, A., Van Den Hengel, A.: What value do explicit high level concepts have in vision to language problems? In: CVPR (2016)
Google Scholar
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR (2016)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Corporation, Redmond, USA
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei & Jianfeng Gao
University of Washington, Seattle, USA
Xiujun Li & Yejin Choi

Authors

Xiujun Li
View author publications
You can also search for this author in PubMed Google Scholar
Xi Yin
View author publications
You can also search for this author in PubMed Google Scholar
Chunyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Pengchuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowei Hu
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lijuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Houdong Hu
View author publications
You can also search for this author in PubMed Google Scholar
Li Dong
View author publications
You can also search for this author in PubMed Google Scholar
Furu Wei
View author publications
You can also search for this author in PubMed Google Scholar
Yejin Choi
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiujun Li .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 551 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X. et al. (2020). Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12375. Springer, Cham. https://doi.org/10.1007/978-3-030-58577-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-58577-8_8
Published: 24 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58576-1
Online ISBN: 978-3-030-58577-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics