VLP: A Survey on Vision-Language Pre-training

In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances from five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey focused on VLP. We hope that this survey can shed light on future research in the VLP field.


Introduction
Making machines respond in ways similar to humans has been a relentless goal of AI researchers.To enable machines to perceive and think, researchers propose a series of related tasks, such as face recognition, reading comprehension, and human-machine dialogue, to train and evaluate the intelligence of machines in a particular aspect.Specifically, domain experts manually construct standard datasets and then train and evaluate relevant models on them.However, due to the limitations of related technologies, it is often necessary to train on a large amount of labelled data to obtain a better and more capable model.The recent emergence of pre-training models based on the Transformer structure [1] has alleviated this problem.They are first pre-trained via self-supervised learning that typically exploits auxiliary tasks (pre-training objectives) to mine supervision signals from large-scale unlabelled data to train the model, thereby learning universal representations.Then they can achieve surprising effectiveness by fine-tuning with only a tiny amount of manually-labelled data on downstream tasks.Since the advent of BERT [2] in natural language processing (NLP), various pre-training models have sprung up in the uni-modal field, such as Vision Transformer (ViT) [3] in computer vision (CV) and Wave2Vec [4] in speech.Substantial works have shown they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch.
Similar to the uni-modal field, there is also a problem of less highquality labelled data in the multi-modal field.The natural question is, can the above pre-training method be applied to multi-modal tasks?Researchers have explored this problem and made significant progress.In this paper, we focus on mainstream vision-language pre-training (VLP), including image-text and video-text pre-training.VLP mainly learns the semantic correspondence between different modalities by pre-training on large-scale data.For example, in image-text pre-training, we expect the model to associate "dog" in text with what "dog" looks like in images.In video-text pre-training, we expect the model to map objects/actions in the text to objects/actions in the video.To achieve this goal, the VLP objects and model architecture need to be cleverly designed to allow the model to mine the associations between different modalities.
To give readers a better global grasp of VLP, we first comprehensively review its recent advances and focus on five significant aspects: • Feature extraction.This section includes the preprocessing and representation methods of image, video, and text in VLP models (see Section 2).• Model architecture.We introduce the architecture of the VLP models from two different perspectives: Single-stream versus Dual-stream from multi-modal fusion perspective, and Encoder-only versus Encoder-decoder from the overall architectural design perspective (see Section 3).
• Pre-training objectives.Pre-training objectives are the core of VLP, mainly used to guide the model to learn vision-language associated information.We summarize typical and characteristic pre-training objectives divided into completion, matching, temporal, and particular types (see Section 4).• Pre-training datasets.Data is critical for VLP.We briefly introduce mainstream corpora for VLP and their specific sizes (see Section 5).• Downstream tasks.Various tasks requires a cooperative knowledge of both vision and language.We discuss the basic details and goals of these tasks (see Section 6).
Then we summarize the specific state-of-the-art (SOTA) VLP models in detail (see Section 7).Finally, We conclude the paper and have broad discussions on new frontiers in VLP (see Section 8).
Although there are many surveys on pretrained language models [5,6] and pretrained vision models [7], to the best of our knowledge, this is the first survey focused on VLP.We hope that our survey can help researchers better understand this field and inspire them to design better models.

Feature Extraction
This section describes how VLP models preprocess and represent an image, video and text to obtain counterpart features.
Most previous work [8,9,10] on VLP utilizes pre-trained object detectors to extract visual features.The most commonly used object detection model is Faster R-CNN [11] with bottom-up attention [12].It is designed to identify objects belonging to certain classes and localize them with bounding boxes.By using the Faster R-CNN, VLP models obtain the OD-based Region feature embedding V = [o 1 , o 2 , . . ., o k ] of an image with k selected regions.Each region feature o i is a 2048-d Region-of-Interest (RoI) feature with its bounding box.The bounding box is defined by the coordinates of the bottom-left and top-right corners of the region.VLP models use bounding boxes to construct 5-d vectors, and the vector is embedded into a high-dimensional representation (2048-d) named visual geometry embedding.The OD-RFs are obtained by adding the OD-based Region feature embedding with its visual geometry embedding.Although ODFs have brought impressive performance, extracting region features can be time-consuming.To relieve this problem, the pre-trained object detectors are usually frozen during pre-training, which can limit the capacity of VLP models.
VLP models [13,14] extract visual features by utilizing convolutional neural networks (CNNs) to obtain the grid features.On the one hand, VLP models can train the CNNs end-to-end by using the grid features [15] directly.On the other hand, VLP models can also first discretize grid features using a learned vision dictionary, then feed them into the cross-modal module.
Inspired by ViT [3,16], VLP models reshape the image I i ∈ R H×W ×C into a sequence of flattened 2D patches I p ∈ R N ×(P 2 •C) , where (H, W ) is the resolu- tion of the original image, C is the number of channels, (P, P ) is the resolution of each image patch, and N = HW/P 2 is the resulting number of patches, which also serves as the effective input sequence length for the Transformer.An input image I i is encoded into a sequence of embeddings: {v cls , v 1 , ..., v N }, where v cls is the embedding of the [CLS] token.

Video Feature Extraction
A video clip is denoted as M frames (images).VLP models [17,18] extract the frame features by using the method mentioned above.The two most commonly used features are CNN-GFs and ViT-PFs.For CNN-GFs, VLP models first use ResNet [19] pre-trained on ImageNet [20] or SlowFast [21] and I3D [22] pre-trained on Kinetics [23] to extract 2D and 3D visual features for each video frame.These features are concatenated as visual features and fed through a fully-connected (FC) layer to be projected into the same lower-dimensional space as token embeddings.For ViT-PFs, a video clip V i ∈ R M ×H×W ×C consisting of M frames of resolution H×W , where M = 1 for images.Following the protocol in ViT and Timesformer, the input video clip is divided into M × N non-overlapping spatio-temporal patches of size P × P , where N = HW/P 2 .

Text Feature Extraction
For the textual features, following pretrained language model such as BERT [2], RoBERTa [24], AlBERT [25], and XLNet [26], VLP models [9,27,28] first segment the input sentence into a sequence of subwords.And then, insert a start-of-sequence token and an end-of-sequence token at the beginning and the end of the sequence to generate the input text sequence.Text input representations are computed via summing the corresponding word embedding, text position embedding, and text type embedding.

Feature Representation
To make full use of uni-modal pre-trained models, VLP models can send the visual or text features to a transformer encoder [1].Specifically, VLP models utilize the standard transformer encoder with random initialization to generate the visual or textual representation.In addition, VLP models can utilize a pretrained visual transformer to encode the ViT-PFs, such as ViT and DeiT [29].VLP models can use a pre-trained textual transformer to encode the textual features, such as BERT.For simplicity, we name these transformer Xformer.

Model Architecture
In this section, we introduce the architecture of the VLP models from two different perspectives: (1) Single-stream versus Dual-stream from multi-modal fusion perspective, and (2) Encoder-only versus Encoder-decoder from the overall architectural design perspective.

Single-stream versus Dual-stream
Single-stream Architecture.
The single-stream architecture [9,30,31] refers to that the text and visual features are concatenated together, then fed into a single transformer block as shown in Firgue 1 (a).The single-stream structure utilizes merged attention to fuse multimodal inputs.The single-stream architecture is more parameterefficient, as the same set of parameters is used for both modalities.
The dual-stream architecture [32,33] refers to that the text and visual features are not concatenated together but sent to two different transformer blocks independently, as shown in Firgue 1 (b).These two transformer blocks do not share parameters.To achieve higher performance, cross-attention (as shown by the dotted line in Firgue 1 (b)) are used to enable cross-modal interaction.
To achieve higher efficiency, there can also be no cross-attention between the visual transformer and textual transformer blocks.

Encoder-only versus Encoder-decoder
Many VLP models adopt the encoder-only architecture, where the cross-modal representations are directly fed into an output layer to generate the final outputs.In contrast, other VLP models advocate using a transformer encoderdecoder architecture, where the cross-modal representations are first fed into a decoder and then to an output layer.

Pre-training Objectives
This section introduces how we pre-train VLP models by using different pre-training objectives, which are crucial for learning the universal representation of vision-language.We summarize the pre-training objectives into four categories: completion, matching, temporal, and particular types.
• Completion is to reconstruct the masked element by leverage the unmasked remainders to understand the modality.(see section 4.

Masked Language Modeling
Masked language modeling (MLM), which was first proposed by Talylor [34] in the literature, is widely known because the BERT model adapted it as a novel pre-training task.To model language conditioned on vision, MLM in VLP models is similar to MLM in pre-training language models (PLMs) but predicts the masked textual tokens not only by the rest of the textual tokens but also by the visual tokens.Empirically, VLP models following BERT randomly mask each textual input token with probability 15% and replace the masked one by using a special token [MASK] 80% of the time, a random textual token 10% of the time and the original token 10% of the time to perform masking.The formal definition is as follows: where v denotes the vision, w denotes the textual tokens, w m denotes the masked textual tokens, w \m denotes the remained textual tokens and D denotes the training dataset.

Prefix Language Modeling
Prefix Language Modeling (PrefixLM) [14] is unified of MLM and language modeling (LM).To make the model simultaneously has good understanding and generation ability, PrefixLM is proposed to facilitate the model with solid generation capability that enables text-induced zero-shot generalization without finetuning.PrefixLM differs from the standard LM such that it enables bi-directional attention on the prefix sequence and only conducts autoregressive factorization on the remaining tokens.PrefixLM under the sequence-to-sequence (seq2seq) framework not only enjoys the bidirectional contextualized representation as in MLM but also can perform text generation similar to LM.The formal definition is as follows: where T P denotes the length of the prefix sequence.

Masked Vision Modeling
To have good understanding on vision or generate images/videos given text, like MLM, masked vision modeling (MVM) [30] samples vision (image or video) regions or patches and usually masks their visual features with a probability of 15%.VLP models need to reconstruct the masked visual features given the remaining visual features and all the textual features.The masked visual features are set to zeros.Because visual features are high-dimensional and continuous, VLP models propose two variants for MVM.
(1) Masked Features Regression learns to regress the model output of masked features to its original visual features.VLP models convert the model output of the masked features to a vector of the same dimension as the original visual features first and apply L2 regression between the original visual features and the vector.The formal definition is as follows: where h(v i m ) denotes the predicted vision representation and O(v i m ) denotes the original vision representation.
(2) Masked Feature Classification learns to predict the object semantic class for the masked features.VLP models first feed the output of the masked features into an FC layer to predict the scores of object class, which further goes through a softmax function to be transformed into a prediction normalized distribution.Note that there is no ground-truth label.There are two kinds of methods to train VLP models.One is that VLP models take the most likely object class from the object detection model as the hard label (w.p. 0 or 1), assuming the detected object class is the ground-truth label for the masked features and apply cross-entropy loss to minimize the gap between the prediction and pseudo class.The other is that VLP models utilize soft label as supervision signal, which is the raw output from the detector (i.e., a distribution of object classes) and minimize the KL divergence between two distributions.The formal definition is as follows: We use the object detection output from Faster R-CNN, and take the detected object category as the label of the masked region: where g 1 (v i m ) the detected detected object category and K denotes the number of vision regions.
We avoid this assumption by using soft label as supervision signal, which is the raw output from the detector: where g 1 (v i m ) the detected detected object category distribution.

Vision-Language Matching
Vision-Language Matching (VLM) [35] is the most commonly used pre-training objective to align vision and language, which aims to project vision and language into the same space.In the single-stream VLP models, they use the representation of the special token [CLS] as the fused representation of both modalities.In the dual-stream VLP models, they concatenate the visual representation of the special visual token [CLS V ] and the textual representation of the special textual token [CLS T ] as the fused representation of both modalities.VLP models feed the fused representation of both modalities to an FC layer and a sigmoid function to predict a score between 0 and 1, where 0 indicates the vision and language are mismatched, and 1 indicates the vision and language are matched.During training, VLP models sample positive or negative pairs from the dataset at each step.The negative pair is created by replacing the vision or text in a paired sample with randomly selected from other samples.

Vision-Language Contrastive Learning
Vision-Language Contrastive Learning (VLC) [35] also aims to align vision and language.Different VLM, VLC predicts the matched vision-language pairs from N × N possible vision-language pairs given a batch of N vision-language pairs.Note that there are N 2 −N negative vision-language pairs within a training batch.VLP models use the visual representation of the special visual token [CLS V ] and the textual representation of the special textual token [CLS T ] to denote the aggregated representation of the vision and language, respectively.VLP models compute the softmax-normalized vision (image or video)-to-text similarity and text-to-vision similarity and leverage cross-entropy losses over vision-to-text and text-to-vision similarities to update themselves.The similarity is often implemented by dot products.The formal definitions are as follows: where I. T denotes the images and texts, s(cot) denotes the similarity function and τ denotes temperature coefficient.y v2t and y t2v denote the labels of vision2text retrieval and text2vision retrieval.

Word-Region Alignment
Word-Region Alignment (WRA) [30] is an unsupervised pre-training objective to align vision regions (vision patches) and words.VLP models utilize Optimal Transport to learn the alignment between vision and language.Empirically, VLP models use the IPOT algorithm to approximate the OT distance since the exact minimization is computationally intractable.After solving minimization, the OT distance serves as the WRA loss to train VLP models.The formal definition is as follows: where c(w i , v j ) is the cost function evaluating the distance between w i and v j , T ∈ II(a, b) = {T ∈ R T ×K |T1 m = a, T 1 n = b}, a and b Dirac function coefficients centered on w i and v j .

Frame Order Modeling
To better model the timing of the video, VLP models randomly disrupt the order of some input frames and then predict the actual position of each frame.Frame Order Modeling (FOM) [36] is modeled as a classification task in practice.

Particular Pre-training Objects
To better adapt to downstream tasks, VLP models sometimes use the training objects of some downstream tasks, such as visual question answering (VQA) [37,38,12], and visual captioning (VC) [39,40], as pre-training objectives.As for VQA, VLP models take the fused representation mentioned above, apply an FC layer, and use the transformed representation to predict the classification over predefined answer candidates.In addition to VLP models tackling the task as classification over predefined answer candidates, VLP models also can directly generate answers in their original text format.As for VC, to reconstruct the input sentence to endow VLP models with the generation capability, VLP models employ an auto-regressive decoder to generate a corresponding textual description of the image or video.Note that due to space limitations, we only introduce some popular pretraining objectives.We omit some specific pre-training objectives such as grounding referring expression (GRE), image-conditioned denoising autoencoding (IDA) [41], text-conditioned image feature generation (TIFG) [41], object detection (OD) [42] and aligned Kaleido patch modeling (AKPM) [43].Moreover, we put masked action prediction into the category of MVM.

Pre-training Datasets
Pre-training datasets are significant for the success of cross-modal representation learning.The quality and the size of pre-training datasets sometimes overwhelm the importance of training strategies and algorithms.Hence, a detailed description of several widely used pre-training datasets is necessary.Table 1 shows statistics of some popular pre-training datasets for VLP.
Since VLP includes image-language pre-training and video-language pretraining, we roughly divide pre-training datasets into two main categories.In later sections, we provide more details about representative pre-training datasets for each category.It is worth noting that no matter which category pre-training datasets belong, they differ in size and sources across different researches.In most works, the pre-training datasets for VLP are constructed by combining public datasets across different cross-modal tasks or scenarios.However, other works, such as VideoBERT [58], ImageBERT [53], ALIGN [55], and CLIP [16], conduct pre-training with self-constructed datasets.These selfconstructed datasets are usually larger than most public datasets but might contain more noise.

Datasets for Image-language Pre-training
For image-language pre-training, the most widely used data form is image-text pairs.Most image-language pre-training datasets consist of a large number of image-caption pairs.SBU [44] and Flickr30k [45] are collected from Flickr and labelled with human-generated annotations.COCO [46] consists of images with five human-generated captions, filtered with special procedures to guarantee the quality of images and annotations.CC3M [51] and CC12M [54] are constructed by crawling images and their alt-text HTML attributes from the Internet and annotating these pictures with filtered descriptions.Due to looser filtering strategies, CC12M contains more noise than CC3M.Another data source is the visual question answering task.Many image-language datasets are organized as structured data in the context of visual question answering.The representative large-scale dataset is Visual Genome (VG) [47].VG contains rich information in its structured data form.Its region-level descriptions and question-answer pairs are frequently used in the study of image-language pre-training.Besides VG, VQA [48] and GQA [52] are also popular datasets of visual question-answer pairs.Compared with VGA, GQA further alleviates the systematic biases.
Datasets mentioned above are suitable for most common scenarios.There are also some datasets designed for special cases.Matterport3D [49] consists of RGB-D images of building-scale scenes, annotated with labels for classification and segmentation.Fashion-Gen [50] contains fashion images paired with item descriptions generated by professional stylists.

Datasets for Video-language Pre-training
Compared to image-language pre-training datasets, video-language pretraining datasets are usually more time-consuming and more difficult to collect and process.These inconveniences restrict the development of the community and the scale of pre-training.Datasets for video-language pre-training cover different scenarios and sources.Most of them, such as Kinetics-400 [23], HowTo100M [56] and WebVid-2M [57], are collected from the Internet and processed with different procedures.These kinds of videos are usually accompanied  by subtitles, thus providing weak or strong alignments between video clips and text.Although those subtitles sometimes might be too weak to align modalities, they still provide useful information, especially for the pre-training on large-scale datasets.Another source of video-text pairs is television programs.TVQA [38] is a video-language pre-training dataset generated from television shows.These television shows are collected and converted to a dataset comprised of many dialogues for understanding the videos and recognizing semantic concepts in videos.
Considering the diversity of the sources and formation of these datasets, researchers apply different annotation and processing procedures.For example, Kinetics-400 [23] consists of many action-related videos annotated with action classes.For other datasets [38,56,57], the accompanying captions/subtitles of video clips or the class of concepts in videos are usually processed and used as annotations.

Downstream Tasks
As shown in Figure 2, a diverse range of tasks requires a cooperative knowledge of vision and language.In this section, we introduce the fundamental details and goals of these tasks.
Visual Question Answering (VQA) [37,59,60,61] .Giving a visual input (image or video), VQA represents the task of correctly providing an answer to a question.It is usually regarded as a classification task where the model predicts the most suitable answer from a pool of choices.To obtain accurate performance, it is important to infer logical entailments from images (or videos) based on the question posed.
Visual Reasoning and Compositional Question Answering (GQA) [52,62,63] .GQA is an upgraded version of VQA and aims to advance research on the visual reasoning of natural scenes.The images, questions, and answers in its dataset have matching semantic representations.The advantage of this structured representation is that the distribution of answers can be more uniform, and we can analyze the model's performance from more dimensions.Compared with the single evaluation metric (e.g., accuracy) of traditional VQA, GQA includes multi-dimensional evaluation metrics: consistency, validity, plausibility, distribution, and grounding.
Video-Language Inference (VLI) [36,64,65] .Given a video clip with aligned subtitles as a premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
Visual Entailment (VE) [66,67,68] .In the VE task, image is the premise, and text is the hypothesis.Its goal is to predict whether the text is "Entailment Image".There are three labels, Entailment, Neutral, and Contradiction.
Visual Commonsense Reasoning (VCR) [69,70,71] .VCR is the task of inferring commonsense information and cognitive understanding by a machine when it sees an image.It exists in the form of multiple-choice questions.For a question posed about the image, there are several alternative answers.The model must choose an answer from several answers and then select the reason for choosing this answer from several alternative reasons.Thus, VCR can be divided into two tasks, including question answering (selecting the best answer from a pool of expected answers to the question) and answer justification (providing the rationale behind the given answer).You can follow VCR's leaderboard1 to track VLP's latest ideas.
Natural Language for Visual Reasoning (NLVR) [72,73] .NLVR is a subtask of the broader VCR category, limited to the classification paradigm.The input of the NLVR task is two images and a text description, and the output is whether the corresponding relationship between the images and the text description is consistent (two labels: true or false).It is typically different from VQA due to longer text sequences covering various linguistic phenomena.
Grounding Referring Expressions (GRE) [74,75,76] .The GRE task aims to localize certain regions (e.g., objects and persons) in an image given a referring expression, where the main challenge is to comprehend and align various types of information from visual and textual domain, such as visual attributes, locations and interactions with surrounding regions.Specifically, the model can output a score for each region, and the region with the highest score is used as the prediction region.
CR refers to identifying the category and sub-category of a product, such as {HOODIES, SWEATERS}, {TROUSERS, PANTS}, which are vital attributes for describing a product, and are useful in lots of real-life applications.
(MSA) [77,78,79,80].MSA is aimed to detect sentiments in videos by leveraging multi-modal signals (e.g., vision, language, etc.).It is to predict the affective orientation of an utterance as a continuous intensity variable.
VLR involves understanding both vision (image or video) and language domains with appropriate matching strategies.It includes two subtasks, visionto-text, and text-to-vision retrieval, where vision-to-text retrieval is to fetch the top-most relevant text description from a larger pool of descriptions as per the vision and vice versa.VLR is widely used in domain-specific searches, multiple search engines, and context-based vision retrieval design systems.
VC aims to generate semantically and syntactically appropriate text descriptions for a given visual (image or video) input.Generating relevant and explanatory captions for a visual input requires not only a rich knowledge of language, but also a consistent understanding of scenes, entities, and their interactions appreare in the visual input.
Novel Object Captioning at Scale (NoCaps) [87,88] .NoCaps extends the VC task to test a model's capability of describing novel objects from the Open Images dataset, which are unseen in the training corpus.
The specific task in VD is the following: given an image, a dialog history consisting of a sequence of question-answer pairs, and a natural language follow-up question, the goal for the task is to response the question in free-form natural language (e.g., generate an answer).VD is the visual analogue of the Turing Test.
MMT is a two-fold task of translation and text generation, translating text from one language to another with additional information from other modalities, e.g., image.The additional visual features aim to remove ambiguities that may arise in straightforward text machine translation and help retain the context of the text descriptions.The multi-modal representation space facilitates robust latent representations to complement the inherent semantic information preserved by visual and linguistic embeddings, respectively.
VLN is a grounding language task of an agent's locomotion as it sees and explores the real-world dynamics based on linguistic instructions.Like generation tasks, it is typically seen as the task of sequence-to-sequence transcoding.However, VLN has unique characteristics.It usually has longer sequences, and the dynamics of the problem are quite different since it is a real-time evolving task.Its main challenge lies in understanding the environment and making confident decisions during exploring.
OCR generally refers to extract handwritten or printed text from images (such as street signs and photos of products) as well as documents (articles, bills, invoices, financial reports, etc.), which includes two parts: text detection (similar to regression) and text recognition (similar to classification).
In addition, there are some iamge-related downstream tasks for evaluating the image-text pre-training models, including semantic segmentation [101,102], and object detection [103,104].There are also some video-related downstream tasks for evaluating the video-text pre-training models, including action classification (AC) [58], action segmentation (AS) [105], and action step Localization (ASL) [106].
Recently, Changpinyo et.al [54] scale up pre-training data for VLP tasks and benchmark its effectiveness against Conceptual Captions 3M on multiple downstream tasks with an emphasis on long-tail visual recognition.Rethmeier et.al [107] study the performance of pretrained model on a challenging long-tail task and analyze the resulting long-tail learning capabilities under zero-shot, few-shot and full supervision conditions to explore the performance influence of model size and self-supervision signal amount.

SOTA VLP models
Image-Text VLP models.
VisualBERT [9], known as the first image-text pre-training model, uses the visual features extracted by Faster R-CNN, concatenates the visual features and textual embeddings, and then fed the concatenated features to a single transformer initialed by BERT.Many VLP models [13,110,30,53] follow the similar feature extraction and architecture as VisualBERT while adjusting the pre-training objectives and pre-training datasets.Recently, VDBERT [134] models the common implicit vision-language alignment in vision and language by pretraining on large-scale image-text pairs via transfer learning [135,136].VLMO [129] leverages patch embeddings for image and word embeddings for text and feeds the concatenated embeddings into a single transformer with modality experts and achieves an impressive performance.METER [33] explores how to use a uni-modal pre-trained model and proposes a dualstream architecture model to handle the multimodel fusion, which achieves the SOTA performance on many downstream tasks.The summary of mainstream image-text VLP models is shown in Table 2.

Video-Text VLP models.
VideoBERT [58], known as the first video-text pre-training model, extends the BERT model to process videos and texts simultaneously.VideoBERT uses the pre-trained ConvNet and S3D [137] to extract video features and concatenate them with textual word embeddings to feed into a transformer initialed with BERT.ConvNet and S3D are frozen when training the VideoBERT, which indicates the approach is not end-to-end.Recently, inspired by ViT, CLIP4Clip [17] and CLIP2Video [18] first process video clips into frames and get patch embeddings according to the method of ViT processing images for each frame.CLIP4clip and CLIP2Video optimize themselves in an end-toend manner and achieve SOTA performance.The summary of mainstream video-text VLP models is shown in Table 3.

Conclusion and New Frontiers
In this paper, we provide the first VLP survey.We review its recent advances from five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks and summarize the specific SOTA VLP models in detail.We hope our survey can help researchers understand VLP better and inspire new works to advance this field.In the future, based on existing works, VLP can be further developed from the following aspects: Incorporating Acoustic Information.
Most previous works on multi-modal pre-training emphasize the joint modeling of language and vision but ignore the information buried in audios [138,139].Although the semantic information in audios might intersect with language, audios could provide extra emotion information, acoustic boundary information, etc.Moreover, pre-training with audios makes the model capable of downstream tasks with acoustic inputs.Until now, joint modeling and representation across text, vision, and audio is still an open problem left for further investigation.Several cutting-edge works have shed light on the future of this research field.Unlike previous VLP models, VATT [140] takes the raw audio as input and learns the multi-modal representations with the noise contrastive estimation (NCE).Differing from VATT, OPT [141] learns the cross-modal representations across text, image, and audio jointly with various multi-level masking strategies, and it is also capable of generating text and images.Some other works, such as AudioCLIP [142] and MERLOT Reserve [143], also shows their unique approaches to learn the cross-modal representations over three modalities.

Knowledgeable and Cognitive Learning.
Although the existing VLP models have achieved remarkable performance, their essence is to fit large-scale multimodal datasets.Making VLP models more knowledgeable is important for future VLP.For input vision and text, there is rich related external common sense world knowledge and illustrative situational knowledge [144], which can be used to augment the input and accelerate the model training and inference.The solution to this problem requires unified cognitive model architectures, knowledge-guided pre-training objectives, and the support of interacting with new knowledge.

Prompt Tuning.
Currently, fine-tuning is the dominant method to transfer the knowledge of VLP to downstream tasks.However, as the scale of the model increases, each downstream task has its fine-tuning parameters leading to parameter inefficiency.Moreover, the diverse downstream tasks also make the design of the pre-training and fine-tuning stages cumbersome, leading to a gap between them.Recently, prompt tuning is getting more and more attention in NLP.By designing discrete or continuous prompts and using MLM for specific downstream tasks, these models could: 1) reduce the computational cost on fine-tuning the enormous amounts of parameters; 2) bridge the gap between pre-training and fine-tuning.Prompt tuning is a promising way to stimulate the linguistic and world knowledge distributed in PLMs.In the next step, it can be improved and transferred to multi-modal scenarios, breaking the traditional paradigm and solving the pain points of VLP [145].

Model Compression and Acceleration.
Model compression and acceleration is an essential approach to improve the efficiency of VLP models.In this case, large models are compressed to small ones to meet the need for faster inference and deployment on various reallife scenarios such as resource-constrained devices.In general PLMs, model compression and acceleration is a hot topic, and specific methods include parameter sharing [25], model pruning [146], knowledge distillation [147] and model quantization [148].Recently, knowledge distillation has been used to compress VLP models [149], but other methods such as pruning and quantization of VLP models remain to be explored.Furthermore, a data-efficient VLP paradigm is constructed [150].However, only a few efforts are currently focused on improving the efficiency of VLP models, leaving much room for exploration.

Out-of-domain Pretraining.
Despite the significant progress achieved by VLP models, part of their success can be traced back to the introduction of in-domain pretraining datasets, used in both pretraining and downstream tasks.The out-of-domain pretraining will be an essential research direction, that is, VLP models transfer the learned knowledge and representation into downstream tasks with unknown data distributions.To mitigate the distribution biases between pretraining and funetuning, DeVLBert [32] is proposed to perform intervention-based learning.It borrows the idea of the backdoor adjustment from the research area of causality and designs several neural-network based structures for Bert-style out-of-domain pretraining.
Advanced Model Architecture.
Nowadays, the transformer-based architectures make great progress in VLP.Is such a structure the optimal structure for VLP?We note that the recently popular diffusion model [151] for image generation has succeeded greatly.Some researchers [152] also extend the diffusion model to controllable text generation.So whether the diffusion model can be used in VLP?It may be a question worth exploring in the future.Moreover, neural networks themselves are inspired by neuroscience, and we can explore next-generation VLP frameworks with support from other disciplines.The inspirations from mathematics include the framework of non-Euclidean space Manifold and how to put some geometric priors into the model [153,154], which are relatively new research directions.Research on the energy-efficient Spiking Neural Networks [155,156,157] in the brain-inspired field may also provide insights into the exploration of novel VLP architectures.

Fig. 1
Fig. 1 Illustration of two types of model architectures for VLP.
1, 4.2 and 4.3).• Matching is to unify the vision and language into a shared hidden space to generate universal vision-language representation (see Section 4.4, 4.5 and 4.6).
• Temporal is to learn good representation by reorder the disrupted input sequence (see Section 4.7) • Particular types consists of other pre-training objects, such as visual question answering and visual captioning (see Section 4.8).Now we introduce the most used pre-training objectives.

Table 1
Details of some popular pre-training datasets for VLP.Names of some datasets are abbreviated for the convenience of subsequent description.FLKR represents Flickr30k, and HT100M represents HowTo100M.