Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey

With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT), generative pre-trained transformers (GPT), etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the multi-modal pre-trained big models have also drawn more and more attention in recent years. In this work, we give a comprehensive survey of these models and hope this paper could provide new insights and helps fresh researchers to track the most cutting-edge works. Specifically, we firstly introduce the background of multi-modal pre-training by reviewing the conventional deep learning, pre-training works in natural language process, computer vision, and speech. Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network architectures, and knowledge enhanced pre-training. After that, we introduce the downstream tasks used for the validation of large-scale MM-PTMs, including generative, classification, and regression tasks. We also give visualization and analysis of the model parameters and results on representative downstream tasks. Finally, we point out possible research directions for this topic that may benefit future works. In addition, we maintain a continuously updated paper list for large-scale pre-trained multi-modal big models: https://github.com/wangxiao5791509/MultiModal_BigModels_Survey.


Introduction
Along with the breakthroughs of recognition performance of AlexNet [1] on the ImageNet competition [2], the artificial intelligence have developed greatly.Many representative deep neural networks are proposed, such as VGG [3], ResNet [4], Inception [5], LSTM [6].The researchers usually collect and annotate some samples for their task, and train their models based on pre-trained backbones on large-scale datasets (such as ImageNet [2] for computer vision, Glove [7] and Skip-thought vectors [8] for natural language processing).Many tasks can be solved well in such an end-to-end manner compared with traditional handcrafted features, such as object detection, segmentation, and recognition.However, the generalization ability of obtained deep model is still limited.Collecting and annotating a larger dataset can address these issues to some extent, but this procedure is expensive and tedious.
To address this issue, Ashish et al. propose the Transformer network [9] which achieves new SOTA (State-Of-The-Art) performance on machine translation task.After that, the selfsupervised pre-training on large-scale corpus, then, fine-tuning on downstream tasks attracts more and more researchers' attention.Many pretrained big models are proposed by following such paradigm, such as BERT [10], GPT [11,12], T5 [13], XLNet [14] which also trigger new research highlights of pre-training in CV community.More and more large-scale NLP and CV models demonstrate the powerful effect by pretrain-and-finetuning paradigm, including ViT [15] and Swin-Transformer [16].
Although the progress brings new impetus to the development of artificial intelligence, however, the issues caused by the defect of single modality are still hard to solve.Researchers attempt to incorporate more modalities to bridge the data gap for deep models.Many multi-modality fusion based tasks are also explored in a traditional deep learning manner, such as RGB, Depth, Natural Language, Point Cloud, Audio, Event stream, etc.Many large-scale pre-trained multi-modal models [17][18][19][20][21][22][23] are proposed which set new SOTA on downstream tasks one after another, as shown in Fig. 1.In this paper, we give a comprehensive review of these works which target to help the new researchers who are interested in this area to understand the history and latest developments quickly.
Organization of our review.In this paper, we firstly review the background of multi-modal pre-training technique in Section 2, from the traditional deep learning paradigm to pre-training in single modality tasks, including natural language processing, computer vision, and automatic speech processing.Then, we focus on MM-PTMs and describe the task definition, key challenges, and benefits, in Section 3.1 and 3.2.The key components are also reviewed in the following sub-sections, including large-scale data, network architectures, optimization objectives, and knowledge-enhanced pre-training.To validate the effectiveness of pre-trained models, many downstream tasks are used for quantitative assessment.In Section 4, we provide detailed reviews on the task definition and evaluation metrics of these tasks.In Section 5, we review the model parameters and hardware for training and also report the experimental results of several representative downstream tasks.Finally, in Section 6, we conclude this survey and propose multiple research directions needed to be studied.The architecture of this survey is visualized in Fig. 2.
Difference from existing reviews.Although there are already two surveys [24,25] proposed for MM-PTMs, the difference between our survey and existing ones can be summarized as follows: • Scope: Existing multi-modal surveys [24,25] focus on vision-language only, however, the multi-modal information problem is a wider research topic.This paper is more comprehensive than the aforementioned reviews by introducing more modalities, such as audio, video, table, etc. • Timeliness: This paper introduces the latest datasets and algorithms (from the year 2019 to June 2022) proposed for multi-modal pretraining which is a long survey, meanwhile, their work belongs to short paper.• New insights to MM-PTMs: By classifying and analyzing the existing MM-PTMs from different perspectives, this article can help readers master the cutting-edge methods and techniques from both detailed and high-level perspectives.In addition, our proposed research directions on the MM-PTMs are deliberate and will provide new clues for the follow-up research.

Conventional Deep Learning
With the release of AlexNet [1], a series of deep learning models are proposed in the artificial intelligence community.These deep models show better capabilities for fitting complex data than conventional machine learning models.From the perspective of its development (LeNet [51] → AlexNet [1] → VGG [3] → ResNet [4] → DenseNet [52]), we can find that their architectures become deeper and deeper, and the corresponding performance accordingly becomes better.The success of these approaches is supported by large-scale annotated training data, such as the ImageNet [2] for the classification task.The scale of used data is much larger than traditional methods, but it's still limited.The pursuit No. Title Year Pub.

Topic Pages
A short survey of pre-trained language models for conversational ai-a new age in nlp [26] 2020 ACSWM NLP DC, 4 A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models [27] 2022 arXiv NLP SC, 34 Vision-Language Intelligence: Tasks, Representation Learning, and Large Models [38] 2022 arXiv MM DC, 19 A survey on vision transformer [39] 2022 TPAMI CV DC, 23 Transformers in vision: A survey [40] 2021 CSUR CV SC, 38 A Survey of Visual Transformers [41] 2021 arXiv CV DC, 21 Video Transformers: A Survey [42] 2022 arXiv CV DC, 24 Threats to Pre-trained Language Models: Survey and Taxonomy [43] 2022 arXiv NLP DC, 8 A survey on bias in deep NLP [44] 2021 AS NLP SC, 26 A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models [27] 2022 arXiv NLP SC, 34 An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-Trained Language Models [45] 2021 arXiv NLP DC, 21 A multi-layer bidirectional transformer encoder for pre-trained word embedding: A survey of BERT [46] 2020 CCDSE NLP DC, 5 Survey of Pre-trained Models for Natural Language Processing [47] 2021 ICEIB NLP DC, 4 A Roadmap for Big Model [48] 2022 arXiv NLP, CV, MM SC, 200 Vision-and-Language Pretrained Models: A Survey [49] 2022 IJCAI MM DC, 8 Multimodal Learning with Transformers: A Survey [50] 2022 arXiv MM DC, 23 Fig. 1 The chronological milestones on multi-modal pre-trained big models from 2019 to the present (June 2022), including multi-modal datasets (as shown by the orange arrow) and representative models (as shown by the blue arrow).The purple font indicates that the dataset contains Chinese text (other datasets contain English text).The models highlighted in wine red are trained on more than two modalities.
Fig. 2 The overall framework of this survey.
of robustness and generalization performance of machine learning models has never stopped.
Recently, the results of large-scale pre-trained models obtained by pre-training on massive data are constantly refreshing people's cognition of artificial intelligence.Compared with previous smallscale deep learning methods, pre-trained big models show obvious advantages in Natural Language Processing (NLP), Computer Vision (CV), and Multi-Modal fields.Such a pre-training scheme take full advantage of the large-scale unlabeled data, therefore, getting rid of expensive annotation costs.Therefore, the study of large-scale pre-trained models is a feasible and necessary way to explore real intelligence.

Pre-training in Natural
Language Processing The large-scale pre-trained models [29,43,44,[53][54][55][56] first appeared in the NLP field.Their success is mainly attributed to self-supervised learning and network structures like Transformer [9].Specifically, the advent of Bidirectional Encoder Representations (BERT) [10] based on self-supervised learning has led to revolutionary performance improvements on a wide variety of downstream tasks by fine-tuned on fewer training data [57].Generative Pre-trained Transformers (GPT) [12,58,59] further extends the number of parameters and the training data for better performance.Note that, the GPT-3 [12] has ten times more parameters than TuringNLP [60].It can not only better fulfill the functions of general NLP tasks, but also has some mathematical calculation ability.The success of the GPT-3 model has made it widely used in various fields, such as search engines, chatbots, music composition, graphics, and coding.XLNet [14] is developed based on a generalized permutation language modeling objective, which achieves unsupervised language representation learning.PanGu-α [61] is a largescale pre-trained Chinese model with 200 billion parameters and implemented based on MindSpore Auto-parallel.NEZHA [62] is another Chinese pretrained big model based on BERT proposed by Wei et al.More large-scale pre-trained models for NLP can be found in surveys [27,34].

Pre-training in Computer Vision
Inspired by the revolutionary advancement of Transformer for NLP tasks, many large-scale Transformer-based vision models are also proposed in recent years.Chen et al. [63] attempt to auto-regressively predict pixels using a sequence Transformer.The model obtained by pre-training on the low-resolution ImageNet dataset demonstrates strong image representations.The ViT (Vision Transformer) model [64] directly adopts the pure Transformer to handle the sequence of image patches for classification.Many new SOTA performances are achieved on several downstream CV tasks, including object detection [65], semantic segmentation [66], image processing [67], video understanding [67].The Swin-Transformer [16] is another milestone for computer vision, as a hierarchical Transformer, it adopts shifted windows for representation learning.
For the pre-training methods, the Masked Image Modeling (MIM) [63,64] is proposed to learn rich visual representations via masked parts prediction by conditioning on visible context.MIM provides another direction for the exploration of the visual large-scale pre-training model.[68] to re-explore pixel regression in MIM and show more comparable performance on multiple image recognition tasks.BEiT [69] greatly improves MIM's performance via masked visual token prediction, and PeCo [70] finds injecting perceptual similarity during visual codebook learning benefits MIM pre-trained representation.

Pre-training in Audio and Speech
As one of the most popular modalities, the audio and speech based pre-training also draws the researcher's attention.For example, the wav2vec [71] is the first work that applies contrastive learning to improve supervised speech recognition by learning the future raw audio based on the past raw audio.The vq-wav2vec [71] uses context prediction tasks from wav2vec to learn the representations of audio segments.Discrete-BERT [72] is BERT-style model by finetuning the pre-trained BERT models on transcribed speech.
HuBERT [73] uses self-supervised speech learning where an offline clustering step is used to generate discrete labels of masked speech signals.wav2vec 2.0 [74] solves a contrastive task to predict the masked latent representation.w2v-BERT [75] uses contrastive learning and masked speech modeling simultaneously, where a model predicts discretized speech tokens and another model solves a masked prediction task.
Fig. 3 The detailed network architecture of Transformer network [9].

Task Definition and Key Challenges
Task Definition.Usually, the deep neural networks are trained on a large-scale dataset, for example, the widely used residual network [4] are pre-trained using a classification task on the Ima-geNet dataset [2].In contrast, the multi-modal pre-training big models are usually trained on a massive training dataset.Usually, these data are not annotated with labels due to the scale are too large to annotate.On the other hand, the parameters need to reach a certain scale.As illustrated in Fig. 4, the multi-modal data, big model, and computing power are tightly connected.All in all, with the support of computing power, the multimodal pre-training usually denotes the task that the multi-modality model with huge parameters pre-trained on the massive multi-modal data in an unsupervised way.Key Challenges.It is challenging to attain a great multi-modal pre-training big model according to aforementioned process.More in detail, we summarize the following key challenging factors: • Acquisition and clean of large-scale multi-modal data.The multi-modal data is one of the most important elements in MM-PTMs.The collection of multi-modal data is significantly harder than the single one, due to the scarce of multi-modal imaging devices.The frequently used multi-modal cameras are usually covers two modalities only, such as RGB-Depth, RGB-Thermal, RGB-Radar, RGB-Event cameras, etc.Most of current MM-PTMs are vision-language models, because of the easy access to image and text data from the Internet.But the additional cleaning of these data is also necessary due to the noisy samples.
• Design of network architectures for large-scale multi-modal pre-training.The network architecture is another key component for multi-modal pre-training.The networks used for feature encoding of multiple input modalities are worthy carefully tailored, as different modalities may have their own features and particular networks are needed.For example, the Transformer or CNN are suggested for image and text modality, the spiking networks can be used for event streams.Another problem is the design of multimodal fusion or cross-modality matching modules.Whether similar modules designed for small-scale multi-modal tasks work for large-scale pre-trained models or not are still remain to be verified.
• Design of pre-training objectives.Due to the massive unlabelled multi-modal data, the pre-training tasks usually need to be done in an unsupervised learning manner.Many current works adopt the masked region prediction for each modality as their learning objective.Obviously, the objectives for multi-modal tasks can be directly borrowed from single-modality pre-training, however, the pre-training objectives designed for the multi-modal tasks are also necessary, intuitive and effective.The widely used contrastive learning, modality based matching, and modality translation are all valid and meaningful attempts.How to design new multi-modal pretraining objectives is one of the most challenging tasks for MM-PTMs.
• Support of large-scale computing power.The training for traditional deep neural networks can be executed on a server with limited number of GPUs.In contrast, the MM-PTMs needs more computing power due to the largescale multi-modal data and the super large-scale model parameters.Therefore, the first thing is to prepare a supercomputing device and the subsequent model training also requires a lot of power to support.
• Skills on parameter tuning.It is never a simple task to train an effective large model considering aforementioned challenging factors.The tricks used for training the neural networks are also very important.As the research and techniques for the small scale pre-training are relatively more mature, however, there is less accumulation of experience on large-scale pre-training techniques.

Advantages of MM-PTMs
Compared with single modality pre-trained big models, the MM-PTMs are more suitable for practical application scenarios.Specifically, the problems like multi-modal collaborative generation, modal completion, cross-domain retrieval, etc, can be addressed well using MM-PTMs.Also, the multi-modal data contains more information which can make up for the defects of a Fig. 4 The relations between multi-modal data, model, and computing power.single modality.Therefore, the MM-PTMs can help extracting the common features of multimodalities.Many recent works demonstrate that the utilization of MM-PTMs indeed brings in the additional prior knowledge [76][77][78].
Compared with small-scale multi-modal models, the generalizability of MM-PTMs which are obtained by self-supervised/unsupervised learning can be improved significantly.As some prior knowledge is only contained in massive big data, and a small amount of artificially selected annotated data is biased, therefore, it is hard for the small-scale models to master such knowledge.

Pre-training Data
As shown in Table 2, many large-scale multimodal datasets are proposed for the pre-training task.In this subsection, we will briefly introduce these datasets to help readers quickly master the data information for pre-training.
• SBU Captions [79] is originally collected by querying Flickr 1 using plentiful query terms.Then, they filter the obtained large-scale but noisy samples to get the dataset, which contains more than 1M images with high-quality captions.
• COCO [111] is developed based on MS-COCO dataset [111] which contains 123,000 images.The authors recruit the Amazon Mechanical Turk2 to annotate each image with five sentences.
• Visual Genome [82] is proposed to help develop machine learning models that can understand the image by mining the interactions and relationships between objects.Therefore, they perform well on the cognitive tasks, such as the image description and visual question answering, etc. Statistically, the Visual Genome dataset contains more than 108K images and each image has about 35 objects, 26 attributes, 21 pairwise relationships.
• VQA v2.0 [83] is proposed to reduce the language biases that existed in previous VQA datasets which contain about 1.1M imagequestion samples and 13M associated answers on 200K visual images from the COCO dataset.
• FashionGen [84] contains 325,536 highresolution images (1360 × 1360), each image has a paragraph-length descriptive captions sourced from experts.Six different angles are photographed for all fashion items.
• CC3M [85] is a dataset annotated with conceptual captions proposed in 2018.The image-text samples are mainly collected from the web, then, about 3.3M image-description pairs remained after some necessary operations, such as extract, filter, and transform.
• CC12M [88] is the outcome of urgent need of MM-PTMs for large-scale data.The released CC3M dataset is far failed to meet the demand, therefore, the authors further relax the filters used in CC3M for the image and text cleaning.Correspondingly, a four times larger dataset CC12M can be obtained with a slight loss of accuracy.
• GQA [86] is mainly proposed for visual reasoning and compositional question answering.A robust question engine is carefully refined by considering content and structure information.Then, the associated semantic representations are adopted to greatly reduce biases within the dataset and control for its question type composition.Finally, a balanced dataset with 1.7M samples is obtained.
• LAIT [87] (Large-scale weAk-supervised Image-Text) is a large-scale image-text dataset collected from the Internet in a weak-supervised manner.It contains about 10M visual images, and each image has a corresponding natural language description which contains about 13 words.
• AltText [89] is collected by following the rules for constructing Conceptual Captions dataset [85].To get a large-scale dataset (1.8B image-text pairs), the authors only apply minimal frequency-based filtering for data cleaning.Although the obtained resulting dataset is noisy, the big models obtained by pre-training on this dataset still beats many SOTA works on many downstream tasks.
• TVQA [90] is build based on six longrunning TV shows from 3 genres, including sitcoms, medical dramas, and crime drama.Then, the Amazon Mechanical Turk is used for VQA collection of video clips.Finally, this dataset contains about 152, 545 question-answer pairs from 21,793 video clips.
• HT100M [91] contains about 136 million video clips, which are collected from 1.22 million narrated instructional videos.The content of these videos are mainly focus on humans with a total of 23,000 various tasks.The language description for each clip is an automatically transcribed narration.Therefore, the video and text are weakly-paired, compared with other captioning datasets.
• WebVid2M [92] is a video-text captioning dataset which contains over two million video alt-text pairs.These data are collected from the Internet following a similar procedure to CC3M dataset.The authors find that more than 10% of CC3M images are thumbnails from videos, therefore, they scrape these video sources (a total of 2.5M text-video pairs) and create the WebVid2M dataset.
• YFCC-100M [93] totally contains 100 million media objects (99.2 million photos, 0.8 million videos) collected from Flickr, the time span of these videos from 2004 and 2014.Note that the YFCC100M dataset is constantly evolving, various expansion packs are unscheduled released.
• LAION-400M [94] contains 400 million image-text pairs which is released for visionlanguage related pre-training.It is worthy to note that this dataset is filtered using CLIP [77] which is a very popular pre-trained vision-language model.
• RedCaps [95] is a large-scale dataset with 12M image-text samples collected from 350 subreddits.The authors firstly define the range of subreddit, then, filter the image post and clean the captions.The ethical issue is also considered when building the dataset, and the problematic images are filtered according to privacy, harmful stereotypes, etc.
• Wukong [96] is the currently largest dataset collected from the Internet which contains 100 million image-text pairs.A list of 200K queries is maintained to ensure the collected samples cover diverse visual concepts.These queries are fed into the Baidu Image Search Engine, then, the image and its corresponding captions can be obtained.Note that each query can get at most 1000 samples to keep a balance between different queries and a series of filtering strategies are adopted for the final Wukong dataset.
• CxC [97] is extended based on MS-COCO dataset by rating existing and new pairs with continuous (0-5) semantic similarity.In general, the CxC contains human ratings for 267,095 pairs which is a significant extension in scale and detail.It can be used for a variety of tasks, such as the image-text, text-text, and image-image retrieval, etc.
• Product1M [98] contains 1,182,083 imagecaption pairs, 458 categories, 92,200 instance.Each image contains about 2.83 objects.Different from regular object detection benchmark datasets, this dataset obtains the instance locations in a paste manner.They first segment the target object, then, paste them into other images based on a given bounding box.It can be used for multiple tasks, including weak-supervised, multi-modal, and instance-level retrieval.
• WIT [99] is constructed by crawling on Wikipedia3 .Then, a set of rigorous filtering operations are executed on these data which finally resulting the dataset containing over 37.5 million image-text sets.Note that, the WIT dataset contains multi-lingual, in contrast, other image-text datasets only contain single lingual (for example, English or Chinese).
• JFT-300M [100] contains about 300M images and 375M labels, and each image has about 1.26 labels.Note that, 18291 categories are annotated in this dataset, including 1165 animals and 5720 vehicles, etc.A rich hierarchy is formed according to these categories.It is worthy to note that this dataset is not available online.
• JFT-3B [101] is also an internal Google dataset, which contains about 3 billion images.These samples are annotated in a semi-automatic way with a class hierarchy of 30,000 labels.In other words, this dataset contains large amount of noisy samples.Note that, this dataset is also not available online.
• M6-Corpus [103] is specifically constructed for the pre-training of vision-Chinese big model M6 [103].The samples are collected from various sources, such as the product description, community question answering, forum, etc.It contains 60.5M images and 111.8B tokens.
• M5Product [104] is a benchmark dataset specifically proposed for E-commerce.It contains 6 million multi-modal samples which cover 6,000 categories, 5,000 attributes, and five modalities, including the visual image, table, video, language description, and audio.It is worthy to note that the M5Product dataset is different from standard multimodal datasets which have completely paired samples, that is to say, each sample may only contain only a subset of modalities.It also has a challenging long-tailed distribution issue.
• Localized Narratives [105] is proposed by Jordi et al. in 2020, which provides a new form of multi-modal image annotations for the connection of vision and language.The image and corresponding spoken description, textual description, and mouse trace are all embodied in this dataset which provides dense grounding between language and vision.It contains 849k images and covers the whole COCO, Flickr30k, and ADE20K [112] datasets and 671k images of Open Images.
• RUC-CAS-WenLan [106] is obtained by crawling multi-source image-text data and totally contains about 30M image-text pairs.These samples covers a wide range of topics and categories, such as the sports, entertainment, news, art, and culture, etc.It plays a fundamental role in the WenLan project and supports the training of the BriVL model [106].
• WSCD [109] (Weak Semantic Correlation Dataset) is a multi-source dataset, which contains large-scale image-text data samples (650 million).The English texts are all translated into Chinese to support the pre-training of BriVL.
• MEP-3M [108] is a large-scale image-text dataset collected from several Chinese large Ecommerce platforms which contains 3 million image-text pairs of products and 599 classes.
Another key feature of this dataset is the hierarchical category classification, in detail, it covers 14 classes, 599 sub-classes, and 13 sub-classes have further sub-subclasses.

Pre-training Objectives
How to design the learning objectives is a very important step for multi-modal pre-training.Currently, the following learning objectives are proposed, including contrastive loss, generative loss, etc.
• Contrastive loss (CS) function usually constructs positive and negative training samples which is widely used in dual-modality.For example, CLIP [77], ALIGN [21] are all trained using contrastive learning loss.The authors of VinVL [113] adopt the 3-way contrastive loss for the pre-training to replace the binary contrastive loss function utilized in the Oscar model [17].
The contrastive losses in ALIGN are defined as follows: where L i2t , L t2i , L CL are an image-to-text classification loss function, a text-to-image classification loss function and the total contrastive loss respectively.The x i is used to denote the normalized image embedding in the i-th pair, while the y j denote the normalized embedding of text in the j-th pair.The N and σ are batch size and temperature parameter.
• Modality Matching loss (MML) is widely used in multi-modal pre-training big models due to the explicit or implicit alignment relationships between various modalities.For instance, Unicoder-VL [114] utilizes the Visuallinguistic Matching (VLM) for vision-language pre-training.They extract the positive and negative image-sentence pairs and train their model to predict whether the given sample pairs are aligned or not (in other words, to predict the matching scores).Different from regular negative image-text samples, the authors of InterBERT [115] design the image-text matching with hard negatives (i.e., ITM-hn) by selecting the highest TF-IDF similarities.
• Masked Language Modeling (MLM) is another widely pre-training objective, usually, the researchers usually mask and fill the input words randomly using special tokens.The surrounding words and corresponding image regions can be used as a reference for the masked word prediction.Wang et al. train SIMVLM [116] using the Prefix Language Modeling (PrefixLM), which executes the bi-directional attention on the prefix sequence and auto-regressive factorization on the rest tokens, respectively.The words are denoted as w = {x 1 , • • • , x K }, and the image regions as For MLM, the input words is masked as x m by the mask indices m by generated randomly with a probability of p% The optimizing goal is to predict the masked words based on all image regions v and remaining words x ¬m , by minimizing the negative log-likelihood: where θ is the trainable parameters.Beside MLM, PrefixLM in SIMVLM can also be adopted to pretrain vision-language representation: where x is the given text sequence, D is the pretraining data and T p is the length of a prefix sequence of tokens.
• Masked Segment Modeling (MSM) masks a continuous segment of given text using the special token, meanwhile, the MLM masks random words.
• Image Question Answering (QA) is used in LXMERT [117] to further expand the pretraining data, as many image-sentence pairs are image and question.The authors train their model to predict the answers as one of their pre-training objectives.
• Masked Object Classification (MOC) mainly focuses on masking the visual images using zero values.Then, people often take the predicted labels by object detector as the ground truth labels.This pre-training objective is widely used, such as Unicoder-VL [114].Similar to MLM, the image regions can be masked by masking their viusal feature with a prabability of p%.The goal is predict the object category of the masked image regions v i m .The encoder output of the masked image regions v i m is feed into an FC layer to predict the scores of T object classes, which further goes through a softmax function to be be transformed into a normalized distribution g θ (v i m ).The final objective is: where c(v i m ) is the ground-truth label.• Masked Object Regression (MOR) is implemented to regress the masked feature or image regions.For example, the LXMERT [117] considers both MOC and MOR for their pretraining.
• Image-Text Matching (ITM) aims to align the image-text data.Negative training data is generated by randomly sampling, including negative sentences for each image, and negative images for each sentence.y is denoted by the gourd truth label for each image-text pair (v, t).A binary classification loss function is used for optimization: where s θ is the image-text similarity score.
• Unidirectional LM (UiDT) Single direction history information is used for masked token prediction only, such as left-to-right and right-toleft language model objectives.Successful stories includes the ELMo [118], UNILM [119].
• Bidirectional LM (BiDT) Different from Unidirectional LM which predicts the masked token from a single direction only, the Bidirectional LM considers contextual information from both directions.Therefore, the contextual representations of text can be encoded more accurately.BERT [10], UNIML [119] and VLP [24] all adopt BiDT as one of their pre-training objective.
• Sequence-to-Sequence LM (Seq2seq) is a pre-training objective used in VLP [24], etc.It treats the inputs as different parts, each part can attend to different contexts.
• Word-Region Alignment (WRA) is used in UNITER [18] which target at explicitly achieves the fine-grained alignment between the multimodal inputs via Optimal Transport (OT) [120].Specifically, the authors learn a transport plan which is a 2D matrix to optimize the alignment and resort to the IPOT algorithm [121] for approximate OT distance estimation.Then, the authors take this distance as the WRA loss to optimize their networks.
• Action Prediction (AP) target at evaluating whether the agent developed for visionlanguage navigation (VLN) can select the right actions based on the current image and instruction [122].
• Image-conditioned Denoising Autoencoding (IDA) is adopted in XGPT [11] to align the underlying image-text using an attention matrix.Even without the prior length of the masked fragment, the IDA could still reconstruct the whole sentence successfully.
• Attribute Prediction (AttP) is used to recover the masked tokens of attribute pairs, as indicated in ERNIE-ViL [123].
• Relation Prediction (RelP) is used in ERNIE-ViL [123] to predict the probability for each masked relation tokens to recover the masked relationship tokens.
• Aligned Kaleido Patch Modeling (AKPM) is proposed for the pre-training of Kaleido-BERT [124], which contains five kaleido sub-tasks, i.e., Rotation Recognition (RR), Jigsaw Puzzle Solving (JPS), Camouflage Prediction (CP), Grey-to-Color Modeling (G2CM), and Blank-to-Color Modeling (B2CM): where CE represents the cross-entropy loss function, y r denotes the rotation angle, K p is the hidden output patch of size p × p, KLD denotes the KL-divergence, and K p are kaleido patches, among which k pi is the masked out ones.
• OBject Detection (OBD) is introduced in the [125] as a direct set prediction to enhance the pre-training.Also, the authors consider object attribute prediction to learn the fine-grained semantic information.A negative log-likelihood loss is defined for OBD as follows: where y denotes the ground truth set of objects and ŷ = {ŷ i } N i=1 , the number of elements is N , σ is the cost of a permutation of N elements, L match (y i , ŷσ(i) ) denotes the pair-wise matching loss between a prediction with index σ(i) and ground truth y i , pσ(i) (a i ), pσ(i) (c i ) denotes the attribute and class probability, L box (b i , bσ(i) (i)) is a normalized loss of bounding box regression.
• Image-Text Generation (ITG) also plays an important role in the vision-language related pre-training tasks.The aligned image and text are capable of training a model for text generation based on a given image, for example, Xu et al. train the E2E-VLP [125] with ITG objective: where X represents the visual sequence with context, Y denotes the generated set of text, and the length of tokens in text y is n.
• Video-Subtitle Matching (VSM) considers two targets for the video-text pre-training task, i.e., (i) local alignment, (ii) global alignment, as used in HERO [126].The score functions and the corresponding loss functions are defined as follows: where s q denotes the sampled query from all subtitle sentences, v is the whole video clip, V temp ∈ R Nv×d is the final visual frame representation generated by temporal transformer, q ∈ R d is the final query vector, y st , y ed ∈ {1, ..., N v } are the start and end index respectively, p st , p ed ∈ R Nv represent probability vectors generated from the scores, p[y] indexes the y-th element of the vector p, L h denotes the combined hinge loss over positive and negative query-video pairs, (s q , v) is a positive pair while (s q , v), ( ŝq , v) are negative ones replaced with one other sample in v and s q respectively, δ is the margin hyper-parameter and λ 1 , λ 2 are balancing factors.
• Frame Order Modeling (FOM) is treated as a classification problem in HERO [126], which targets reconstructing the timestamps of selected video frames.The objective of FOM is defined as follows: where the number of reordered frames is R, i ∈ [1, R], t i ∈ {1, ..., N v }, r i is the reorder index, P ∈ R Nv×Nv is the probability matrix.
• Textual Aspect-Opinion Extraction (AOE) aims to extract aspect and opinion terms from the text, as noted in [127].To handle the lack of label information required for supervised learning, the authors resort to other models for aspect extraction and opinion extraction.The obtained aspect and opinion terms are treated as labels for the AOE task.
• Visual Aspect-Opinion Generation (AOG) targets at generating the aspect-opinion pair detected from the input image [127].
• Multimodal Sentiment Prediction (MSP) enhance the pre-trained models by capturing the subjective information from visionlanguage inputs [127].
• Modality-Level Masking (MoLM) is used in [22] to learn the alignment among the text, vision, and audio.The authors mask out each modality independently with a certain probability.
• Structural Knowledge Masking (SKM) is proposed in [128] which attempts to mask the tokens selectively based on the cue provided by the knowledge entry.The masking probabilities is calculated to obtain mask indices M w and M r for each knowledge entry, the two items denote the words of sentences and visual regions of images need to be masked, respectively.The loss function of Structural Knowledge Masking Language Model can be formulated as: where θ is the parameters.W \Mw and R \Mr represent the non-masked words of sequences and the remaining regions of images, respectively.

Self-attention and Transformer
In the large-scale pre-training era, most of current pre-trained models are inspired by the Transformer (which is mainly consisted of self-attention layers).It is originally developed for natural language processing tasks in 2017 [9] which sets new SOTA performance on many downstream tasks by a large margin.Such framework is also introduced into the computer vision community, therefore, the design of unified network architectures for various tasks and inputs is the current research hotspot.
Given the input x, an attention module A(x) is used to generate attention weights, then, some procedures are conducted based on input x and A(x) to get the attended input x' = f(A(x), x).Many attention models are designed based on this idea, such as the channel attention, spatial attention, temporal attention, branch attention [129].The self-attention scheme is a special case of attention mechanism, as shown in Fig. 6.More in detail, where the Linear denotes fully connected layers.
On the basis of self-attention, the work mechanism of multi-head attention is the aggregation of parallel attention layers.Mathematically speaking, where [, ] denotes the concatenate operation,

Single-and Multi-stream
The multi-layer transformer is widely used in many current MM-PTMs.The input of each modality is first extracted as feature embeddings by the independent encoder and then interacted with other modalities.According to the manner of multi-modal information fusion, two categories of MM-PTMs can be concluded, i.e., single-and cross-stream.In this subsection, we will present these two architectures separately.
• Single-stream Multi-modal inputs such as images and text are treated equally and fused in a unified model.The uni-modal features extracted from each modality are tokenized and concatenated by the separators as the input of the multi-modal transformer for multi-modal fusion, as shown in Fig. 8(a).In the transformer, the MHSA (multi-head self-attention) mechanism is usually adopted to interactively fuse the unimodal features, then, the multi-modal fusion features are output from the class token of the transformer.Large-scale MM-PTMs based on single-stream structure includes VL PTMs (e.g., Oscar [17] and ALBEF [130]) and vision-languageaudio pre-training model OPT [22].Single-stream pre-training models perform token-level matching based on strong semantic correlation, e.g.object features of the image are matched with semantic features of object tags.It provides realistic interaction between uni-modal features, and multi-modal fusion features contain information from different modalities with better characterization capability.
• Cross-stream Features of different modalities are extracted in parallel by independent models and then are aligned by self-supervised contrastive learning in cross-stream architecture.The pre-training models obtain aligned uni-modal features rather than fused multi-modal features.As shown in Fig. 8(b), multi-modal fusion features are obtained by concatenating uni-modal features and fed into a MLP (Multi-Layer Perceptron) for pretraining objective learning.Representative largescale MM-PTMs based on cross-stream structure include BriVL [106] and CLIP [77], etc.Compared with pre-training models based on single-stream, cross-stream models align different modality features into a consistent high-dimensional feature space, such as text semantics and visual image representation.Cross-stream pre-training models generally contain the CS pre-training objective and achieve embedding-level matching based on "weak semantic correlation" [106].The structure of cross-stream models is more flexible, and modifying the branching structure of one modality of the model does not affect other modalities, making it easy to deploy in real scenarios.However, cross-stream models extract the aligned multimodal common features, and how to effectively exploit the information differences and complementarity between multi-modal data is an issue to be studied.
In addition, depending on the needs of the pretraining objectives, the structure of pre-training models can be divided into with and without a decoder.If pre-training objectives contain generative tasks, such as masked image reconstruction, generating matching images based on the text description, etc., the pre-training model adds a decoder after the encoder for converting multimodal fusion features into the corresponding output.

Modality Interactive Learning
Most of current large-scale pre-trained multimodal models adopt concatenate, add, Mergeattention, Co-attention, and Cross-attention [132] to achieve interactive learning between modalities.An introduction to these modules are given in the following paragraphs.
• Merge-attention: As shown in Fig. 7 (a), a unified feature representation is obtained by concatenating the input modalities.Then, this feature is fed into the fusion network.For example, the i-Code [131] flatten the visual inputs along the temporal and spatial dimensions.Note that the parameters of this attention model is shared by these input modalities.
• Co-attention: For the co-attention module, as shown in Fig. 7, each input modality has its own self-attention layers for modality-specific feature embedding.Then, the multiple embeddings are fused using a cross-attention layer.
• Cross-attention: For the multi-modal task, the key step is how to design a fusion module to connect the multi-modality inputs effectively.For instance, the cross-attention layer is proposed by Suo et al. [132], which integrate the image and language subtly for visual question answering.Specifically, they mutually input one modality into the Q-branch of another self-attention network.Then, the output of two modalities are concatenated as one unified representation for final prediction.
• Tangled-transformer: The TaNgled Transformer (TNT) [133] is proposed to handle the action-, regional object-, and linguisticfeatures, simultaneously, using three Transformer modules.As shown in Fig. 7 (d), the authors inject one modality to the Transformer network designed for other modality to enhance the interactions.
• Inter-Modality Contrastive Learning: The contrastive learning is widely used for intermodality relation modelling, such as the CLIP [77] and its following-up works [19,104,[134][135][136][137][138].The representative work SCALE [104] is trained with Self-harmonized Inter-Modality Contrastive Learning (SIMCL), which can be written as: where (d i ) is a positive pair, and the pairing of d (0) i and other samples will bring us negative training data.f i are feature embedding of (d i ) respectively.The Sim denotes the cosine similarity, 1 [k̸ =i] is the binary indicator function, τ is a temperature parameter.

Pre-training using Knowledge
Conventional pre-trained models suffer from poor logical reasoning and lack of interpretability.To alleviate those problems, it is straightforward to involve knowledge, deep understanding of data, in pre-training models, i.e., pre-training using knowledge also known as Knowledge Enhanced Pre-Trained Models (KEPTMs) shown in Fig. 9.
Knowledge Representation Learning By learning to represent symbolic knowledge, usually in the form of entities and relations, knowledge representation learning enables neural network based models to fuse knowledge and improve their reasoning capabilities.Similarity-based models and graph neural network (GNN) models are two major methods of knowledge representation learning.
• Similarity-based Models Given similarity-based scoring functions, similaritybased models measure the similarity of latent semantics between two entities.Translation-based models are representatives of similarity-based models, as the distance in the vector space is often used to describe the similarity.TransE firstly models relations by translations, which operates on entity embeddings at low-dimension [197].To deal with mapping properties of relations efficiently in complex models, such as reflexive, one-to-many, many-to-one and many-to-many,   TransH is proposed to model a relation as a translation operation on a hyperplane [198].
TransR is proposed to embed entity and relation in a separated spaces to capture different aspects of entities over various relations [199].Compared with TransR, not only the diversity of relations but also entities are considered in TransD [200].
To deal with heterogeneity and imbalance issues brought by knowledge graphs but ignored by aforementioned translation-based models, transfer matrices are replaced with adaptive sparse matrices in TranSparse, because the number of entities linked by relations determines sparse degrees [201].Besides translation-based models, tensor or matrix factorization approaches have also been proposed for multi-relational data by introducing scoring or ranking functions to measure how likely the semantic matching is correct.With the latent components, RESCAL is capable of collective learning and can provide an efficient algorithm of the factorization of a three-way tensor [202].NTN introduces an expressive neural tensor network for reasoning over relationships between two entities [203].DistMult presents a general framework for multi-relational learning and shows the effectiveness of a simple bilinear formulation [204].SME designs a new neural network architecture to encode multi-relational graphs or tensors into a flexible continuous vector space, so that multi-relational semantics can be learnt [205].HolE is proposed to learn compositional vector space representations of entire knowledge graphs by employing holographic models of associative memory and circular correlation to create compositional representations [206].
• Graph Neural Network Models To further leverage the structure of the graph rather than collections of triplets, graph neural network  [207].Inspired by the pioneering work, further efforts have been done on graph convolutional networks (GCNs), such as semi-supervised classification [208], unsupervised learning based on the variational auto-encoder (VAE) [209], inductive representation learning to sample and aggregate features from a node's local neighborhood [210], and attention mechanism by leveraging masked self-attentional layers [211].Beyond GCNs, R-GCNs is developed to deal with the highly multirelation data characteristic of realistic knowledge bases [212].A structure-aware convolutional network (SACN) takes the benefit of GCN and ConvE [213] together, where GCN as the encoder utilizes knowledge graph node structure and ConvE as the decoder enables the translational feature [214].To further enhance Graph Attention Networks (GATs) and capture both entity and relation features within any entity's neighborhood, another model is proposed for attentionbased feature embedding [215].To leverage various composition operations for embedding entities and relations in KGs and ever-increasing number of relations, a composition-based GCN named CompGCN is proposed to embed both nodes and relations jointly [216].Knowledge Fusion Methods How to fuse knowledge into pre-trained models and improve their logical understanding of data after knowledge representation learning remains a challenge to researchers.According to the category of knowledge provided, KEPTMs roughly contain two categories: unstructured knowledge and structured knowledge enhanced pre-trained models.
• Unstructured KEPTMs Unstructured knowledge often refers to the knowledge without structures involved, which is in the form of plain text, like the words or phrases.Although some literatures introduce entities as supervised data and achieve promising performance, structural information is ignored while only entities are used to enable PTMs to learn semantics or attain extra key features from them.Word-aligned attention aligns the character-level attention to the word level to exploit explicit word information in Chinese [217].SentiLARE also introduces part-of-speech tag and sentiment polarity to build word-level linguistic knowledge [218].As unstructured text trained neural language models can store knowledge implicitly, PTMs can be further fine-tuned to explicitly retrieve knowledge without access to external knowledge or context [219].
• Structured KEPTMs Contrary to unstructured KEPTMs, structured KEPTMs take account of sorts of structural information, including syntax-tree, rules and knowledge graphs.Syntax-BERT incorporates syntax trees effectively and efficiently into pre-trained Transformers [220].LIMIT-BERT learns language representations across multiple linguistics tasks including constituent and dependency syntactic parsing [221].Syntax-GNN is proposed to learn syntax representations by using dependency trees and fusing the embeddings into transformers [220].Knowledge graphs (KGs) provide structural knowledge in the form of entities and relations between them.An enhanced language representation model ERNIE is trained by utilizing both large-scale textual corpora and knowledge graphs, so that it can simultaneously leverage lexical, syntactic and knowledge [222].Similar work named KnowBert is also proposed for largescale models to embed multiple knowledge bases with entity linkers, which retrieves relevant entity embeddings and updates contextual word representations by the word-to-entity attention [223].Moreover, the reasoning capability is also developed by finding supporting-facts, based on a large external knowledge base [224,225].Rules, in the form of constraints or even logical expressions, are preferred due to their interpretability and accountability.HEX graphs are proposed to enhance existing models by capturing semantic relations between labels applied to the same object [226].
Knowledge Evaluation Tasks Besides conventional performance metrics, more knowledgeoriented tasks are required to evaluate the capability of KEPTMs and inspect whether external knowledge really helps models understand data semantically.Knowledge evaluation tasks are severed as testbeds to ensure the effectiveness of knowledge fusion methods.Currently, knowledge evaluation tasks mainly focus on NLP tasks and can be categorized into two groups based on the types of required knowledge: factual knowledge and commonsense knowledge evaluation tasks.
• Factual Knowledge Evaluation Tasks Factual knowledge is the knowledge of facts, including specific details and elements to describe the objective facts [28].Factual knowledge evaluation tasks focus on testing models' reasoning ability on factual knowledge over various domains, like answering questions by giving a fact or judging the correctness of a given fact.Natural Questions is the first large publicly available dataset and robust metrics are also introduced to evaluate the performance of question answering (QA) systems [227].HotpotQA, another QA dataset, provides supporting facts at sentence-level for reasoning and new factoid comparison questions [228].Different from the above two open-domain QA tasks, BoolQ only involves yes/no naturally occurring questions, namely verifying facts generated in unprompted and unconstrained settings, but those queries involve with complicated and nonfactoid information so that make it unexpectedly challenging [229].Another fact extraction and verification task FEVER is proposed and a new type of claims NotEnoughInfo is introduced beside Supported and Refuted [230].Entity linking, linking entities from a knowledge base to the corresponding textual mentions in a corpus, can also evaluate how well a model understands the factual knowledge [231].
• Commonsense Knowledge Evaluation Tasks Commonsense knowledge refers to the information generally accepted by the majority of people concerning everyday life, i.e. the practical knowledge about how the world works [29].Like factual knowledge evaluation tasks, Commonsense QA also focuses on QA, but such QA requires prior knowledge outside the given document or context [232].To extend the QA task Abductive Natural Language Inference (αNLI), Abductive Natural Language Generation (αNLG), a conditional generation task, is also proposed to explain given observations in natural language [233].Common-Gen further explicitly tests models for the ability of generative commonsense reasoning due to its rigorous requirements on both relation reasoning and compositional generalization [234].Besides general commonsense evaluation tasks evaluating how well models understand daily scenarios, specific commonsense knowledge ones are further designed for different scenarios.SocialIQA, a large-scale benchmark for social commonsense reasoning, is challenging even for PTMs [235].Beside human interactions, physical interactions are also important in commonsense knowledge, hence the task of PIQA is introduced for physical commonsense reasoning [236].Temporal commonsense is crucial for understanding the timing of events, for example duration, frequency, and order, leading to correct reasoning.McTaco defines five classes of temporal commonsense [237], while TRACIE evaluates models' temporal understanding of implicit events [238].

Characteristics of Different
Pre-trained Big Models In the aforementioned paragraphs, we give a review to the main streams of multi-modal pretrained models and highlight the features of each model in Table 3, Table 4, and Table 5.In this subsection, we compare and analyze the characteristics of these models.Specifically, the early multi-modal pre-trained big models usually design an interactive learning module, for example, the ViLBERT [140], LXMERT [117].They integrate the co-attention or cross-attention mechanism into their framework to boost the feature representation between multiple inputs.Actually, these models obey the idea of interactive fusion of traditional small models.This allows for seamless integration with numerous downstream tasks and providing a high degree of flexibility.In contrast, many current big models directly process the inputs using projection layers and feed them into a unified network like the Transformers, including Unicoder-VL [114], VideoBERT [158], UniVL [160].More and more works demonstrate that the powerful Transformer network can achieve comparable or event better performance.
There are also some works make full use of existing big models and carry out secondary development to achieve a higher performance [181,190].To address the issues caused by shortage of paired multi-modal data, some researchers propose to training their model using unpaird data [173].These models show the great potential of processing massive multi-modal data.Unlike general big models, some models are specifically designed for a specific task or domain, like the e-commerce, or Indoor navigation.This provides conditions and convenience for fully mining more detailed domain knowledge assist the pre-training process.

Downstream Tasks
After the pre-training phase, the researchers usually test their model on many downstream tasks to validate the powerful ability.Specifically, the generative tasks, classification tasks, regression tasks are adopted for the validation which will be discussed below.As a new learning paradigm, the prompt learning which target at modifying the downstream tasks to fit the pre-trained big model draws more and more attention.In this part, several representative prompt learning algorithms are also reviewed.An overview of these downstream tasks are visualized in Fig. 10.

Generative Tasks
Image/Video Captioning attempt to describe content of input image or video using a couple of sentences.Usually, a visual encoder is used to encode the input image/video, then, a language decoder is adopted for sentence prediction in a word by word manner.NoCaps [239]

Classification Tasks
Visual Question Answering (VQA) model is provided with an image and a question, and asked to produce an answer [242].The relations between GQA [86] and VQA is similar to the NoCaps and the standard captioning task.It is introduced to address key drawbacks of previous VQA datasets, and generate novel and diverse questions from a robust question engine, which sufficiently considers the content and structure.Video-Language Inference (VLI) is proposed by Liu et al. [243] in year 2020, which aims at understanding the video and text multimodal data.Natural Language for Visual Reasoning (NLVR) can be seen as a binary classification problem.As noted in [244], the model needs to judge the authenticity of a statement for the image.Visual Entailment (VE) [245] is a triplet-label classification problem derived from Text Entailment (TE) task [246].The VE model needs to predict whether the given image semantically entails the text.The three labels are entailment, neutral or contradiction.

Visual
Commonsense Reasoning (VCR) [247] is a variation of VQA, which require a machine to provide a rationale justification and answer correctly for the given challenging problem.Category Recognition (CR) is a classification problem which attempt to predict the category of given image.Many computer vision tasks are belong to this downstream task, such as pedestrian attribute recognition [248], action recognition [134].Multi-modal Sentiment Analysis (MSA) is a multi-modal fusion task proposed for sentiment analysis [249], which attempt to aggregate various homogeneous and/or heterogeneous modalities for more accurate reason.The modalities can be text, visual and acoustic, etc. Vision-Language Retrieval (VLR) can be used in many applications, such as text-based person search [250], or general object retrieval based on language [251].Vision-Language Navigation (VLN) [252,253] is task that the agents learn to navigate in 3D indoor environments following the given natural language instruction.A benchmark for the popular VLN can be found at the following leaderboard.Optical Character Recognition (OCR) target at convert the images of Diverse text information into machine-encoded text.Usually, the OCR system contains both text detection and text recognition modules.

Regression Tasks
Grounding Referring Expressions (GRE) takes the visual image and language description as input, and output the location of target object described by the language [254][255][256].Similar tasks defined on videos are termed Spatio-Temporal Video Grounding (STVG) [257] or Tracking by Natural Language [258][259][260].

Prompt Learning
To make full use of pre-trained big models, the prompt learning (also called prompt tuning) is proposed to re-formulate the downstream tasks to fit the objectives of pre-trained models, including CPT [261], CPL [262].Also, some prompt tuning schemes are designed to fix the parameters of the large model and adjust the parameters as little as possible to achieve good results, such as the VPT [263], CoOp [264], CoCoOp [265].To be specific, the VPT [263] fixes the parameters of ViT models and integrates the prompt vectors as additional input.It achieves good performance even only tune the parameters of classification head and prompts.CoOp [264] achieves huge improvements by tuning the context words into a set of learnable prompt vectors.Conditional Context Optimization (CoCoOp) [265] is developed based on CoOp which learns an external network to generate input-conditional tokens for each image.It addresses the issue of class shift significantly using such dynamic prompts.

Experimental Analysis
Considering the complexity and numbers of MM-PTMs, it is almost impossible to reproduce pretraining tasks in a short amount of time.Therefore, the experiments and related analyses of the pre-training are ignored in this paper.However, we still want to summarize a more complete review paper for the readers, thus, we extract the experimental results of the corresponding downstream tasks from their paper and compare them to the shared benchmark datasets.More detailed results can be found in Table 3 and Table 4.
Based on Fig. 11 (c), it is also easy to find that many large-scale MM-PTMs are still with limited parameters, but some of them indeed reached new heights.For example, the DALLE-E [164] (12000 MB), BriVL [106] (10000 MB), M6 [103] (100000 MB), and CogView [166] (4000 MB).The reasons for this phenomenon may be as follows: 1).Many MM-PTMs are trained on several public datasets.The scale of parameters is greatly improved compared to traditional models, but not by a shocking amount.2).The development of big models is also limited by the need for large-scale computing power, and only a few giant companies or research institutes have such computing power platforms.

Performance on Representative Downstream Tasks
Here, we report the experimental results of zeroshot image retrieval, image captioning, and visual question answering.From Fig. 12 (a), we can find that the performance of different MM-PTMs have a big difference on the zero-shot image retrieval task.The blue and red vertical bar denotes the results of Rank-1 and Rank-5, respectively.Some models achieve high performance on this task which demonstrates the effectiveness of large-scale pre-training.For example, the ALBEF [130] and METER [157]  For the image captioning task, we can find that the compared models achieved close performance on the COCO dataset according to Fig. 12 (b).Specifically, OSCAR [17]

Research Directions
Although the multi-modal pre-trained big models have obtained huge development, however, it is still a young research direction.Many problems and opportunities are still waiting for researchers to solve.In this section, we summarize several research points which are worthy to be tried.
• Pre-training on More Modalities: Existing large-scale PTMs are usually pre-trained on two modalities, e.g., the vision and language.The missing of large amount aligned multi-modal data may be a key reason.As an old saying goes, "Sharpening your axe will not delay your job of chopping wood".The acquirement of real multi-modal data is the most important thing for large-scale pre-training, as shown in Fig. 13, such as visual image, text, audio, radar, event streams, depth image, thermal image, etc.To the best of our knowledge, no imaging device can capture so many modalities at the same time.Therefore, the manufacture of multi-modal imaging equipment can be a very significant thing.The pre-trained big model based on these data may have a wider potential for applications.
• Incremental Learning based Pretraining: Currently, existing pre-trained big methods are used for downstream tasks through feature finetuning or prompt learning [266].This standard deep learning procedure works well in a  short time, but pre-training is an expensive process.Specifically, the collection and cleaning of data, the electric charge used for pre-training, and the hardware device all cost a huge amount of human and material resources.When we gathered another group of data, the pre-training on the mixed data are expensive, redundant, and not environmentally friendly.However, seldom of them consider incremental learning for big models, and it is still unclear if the incremental learning algorithms developed for traditional deep learning work well for big models.
In addition to the aforementioned data incremental learning, there are still many aspects that can be exploited for multi-modal pre-trained big modals.For example, the class (or category) incremental learning is a classical machine learning problem.Another interesting problem is modalityincremental learning, in another word, how to introduce and absorb the new modality into the already pre-trained multi-modal model.Because the new sensors (modalities) will appear at some indefinite time in the future, the designed multimodal big models should be flexible enough to handle this situation.
• Knowledge Enhanced Multi-Modal Pre-training: Based on aforementioned reviews on MM-PTMs, we can find that the study of knowledge-assisted pre-training is still in the starting stage.Current works simply adopt external knowledge-graph or knowledge base in the pretraining phase, but they are usually single-modal, independent of multi-modal data, and limited to improving the understanding of data for models.Although commonsense knowledge is more ubiquitous, it is also abstract and introduces ambiguities, leading to challenges when applying to specific data.Therefore, we believe that further explorations on knowledge enhanced multi-modal pretraining are worth investigating.First, specified knowledge for multi-modal data is demanded to collect or extract through self-supervised learning.Second, more general knowledge fusion methods designed for multi-modal data are needed, beyond the limitations of vision and language modalities.Third, knowledge evaluation tasks specific for pre-training are required to inspect the enhancement of knowledge at this early stage, because pre-training is the first phase of the entire training procedure while downstream tasks are to be determined.
• Fine-grained Multi-Modal Pretraining: Most existing MM-PTMs are pre-trained from a global-view, for example, the researchers adopt the matching between the whole image and language as a supervised signal for the pre-training.The representative works are CLIP [77], ALIGN [21], etc.Note that, the fine-grained local information mining or instancelevel pre-training may further improve the overall performance of multi-modal pre-training.Some researchers have exploited the possibilities of fine-grained pre-training strategies [98].We hope more researchers can focus on this direction to further boost the final results.
• Multi-Modal Pre-trained Model based Prompt Learning: Current pre-trained big models are usually used in a "pretrain-finetuning" way, specifically, the users need to initialize their model using pre-trained weights, then, finetune on downstream tasks.Although it works well in many tasks, however, the finetune maybe not be the most direct way.Because current multi-modal big models are pre-trained via modality matching, masked token prediction, and the downstream tasks are usually classification and regression tasks.Therefore, it exists a gap between multimodal pre-training and finetuning.Recently, a new framework (termed prompt learning) is developed for big model based downstream tasks, which slickly transforms the setting of downstream tasks to make them consistent with pretraining [266].Many works have demonstrated its effectiveness [76,135,261,264,265] in CV and NLP tasks.The research in this direction is also interesting and has great potential.
• Migration of techniques developed for small-scale models: The small-scale multimodal models have been exploited for many years, and many representative models are proposed for deep multi-modal tasks [267][268][269].Among these works, diffusion, cross-attention, and dynamic neural networks are useful for specific multi-modal tasks.Part of these techniques is exploited in VL-PTMs, such as the cross-attention based ViL-BERT [140].There are still many algorithms or tricks that have not yet been explored on large model tasks.We believe the transfer from smallscale to large-scale PTMs is worthy to be studied.
• Coupling and decoupling problems in cross-modal pre-training models: The coupling involves establishing the correlation between different modalities and the "cross" can be only realized through such correlation.The decoupling can further expand the modality dynamically.It is worth studying how to give feasible solutions to the two problems from the aspect of framework design.

Conclusion
We give a comprehensive review of large-scale Multi-Modal Pre-Trained Models (MM-PTMs) in this paper.Firstly, we introduce the background of MM-PTMs, with a focus on conventional deep learning, and pre-training in NLP, CV, and speech.Then, the task definition, key challenges, and benefits of MM-PTMs are discussed.After that, we dive into the reviews of MM-PTMs and discuss the pre-training data, objectives, networks, knowledge enhanced pre-training, etc.We review the downstream tasks including generative, classification, and regression tasks, and also give an overview of model parameters of MM-PTMs and hardware for the pre-training.Experimental results of several representative tasks are also discussed and visualized.Finally, we point out some research directions that are worth to be focused on.We summarize this paper and hope our survey can provide some useful insights for the MM-PTMs.

Fig. 10
Fig. 10 An overview of downstream tasks reviewed in this paper.
is proposed byAgrawal et al. in 2019.It is also an image captioning task but focus on developing generalized captioning models.Visual Dialogue (VD) attempt to let the AI agent to talk with humans by holding a meaningful dialog about the visual content[240].Multi-modal Machine Translation (MMT) is a task that targets translating the source sentence into a different language based on the paired image[241].

Fig. 13
Fig. 13 Representative samples of mainstream modalities frequently used.

Table
Summary of related single-and multi-modal pre-training surveys.SC and DC denotes Single Column and Double Column.Pub. is short for Publication.

Table 2
An overview of multi-modal datasets proposed for large-scale pre-training.Lang.and Ava. is short for Language and Available, respectively.

Table 3
The summary of mainstream multi-modal pre-trained big models (Part-I).

Table 4
The summary of mainstream multi-modal pre-trained big models (Part-II).

Table 5
The summary of mainstream multi-modal pre-trained big models (Part-III).