Transformers in computational visual media: A survey

Transformers, the dominant architecture for natural language processing, have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance. Transformers are sequence-to-sequence models, which use a self-attention mechanism rather than the RNN sequential structure. Thus, such models can be trained in parallel and can represent global information. This study comprehensively surveys recent visual transformer works. We categorize them according to task scenario: backbone design, high-level vision, low-level vision and generation, and multimodal learning. Their key ideas are also analyzed. Differing from previous surveys, we mainly focus on visual transformer methods in low-level vision and generation. The latest works on backbone design are also reviewed in detail. For ease of understanding, we precisely describe the main contributions of the latest works in the form of tables. As well as giving quantitative comparisons, we also present image results for low-level vision and generation tasks. Computational costs and source code links for various important works are also given in this survey to assist further development.


Introduction
Convolutional neural networks (CNNs) [1][2][3] have become the fundamental architecture in computational visual media (CVM). Researchers began to incorporate a self-attention mechanism into CNNs to model long-range relationships, due to the problem of locality of convolutional kernels [4][5][6][7][8]. Recently, Dosovitskiy et al. [9] found that using a self-attentiononly structure, without convolution, works well in computer vision. Since then, the transformer architecture [10], a non-convolutional architecture dominating the research field of natural language processing (NLP), has has been used in computer vision. Introducing transformers into computer vision provides four advantages that CNNs lack: • Transformers learn with more inductive bias and performs better when trained on large datasets (e.g., ImageNet-21K or JFT-300M) [9,11]. • Transformers provide a more general architecture suitable for most fields, including NLP, CV, and multimodal learning. • Transformers powerfully model long-range interactions in a computationally-efficient manner [12,13]. • The learned representation of relationships is more general and robust than the local patterns from convolution modules [14]. As Table 1 shows, an increasing number of works on visual transformers have come out in various subfields of computational visual media. An instructive survey is important because of the difficulties in arranging

Segmentation
VisTR [28] First transformer-based segmentation model SegFormer [29] A lightweight efficient segmentation transformer model Low-level vision Colorization ColTran [30] First transformer-based image colorization model Text-to-image TIME [31] Text-to-image generation DALL·E [32] Zero-shot text-to-image generation framework Super resolution IPT [11] Image processing model TTSR [33] Flexible application of transformer Image generation TransGAN [34] First pure transformer-based GAN for generation GANsformer [35] A bipartite transformer VQGAN [36] A transformer-based high-resolution image generator Image restoration Uformer [37] A transformer-based hierarchical encoder-decoder network Style transfer StyTr 2 [38] First transformer-based style transfer model Point cloud learning PCT [39] Among the first transformer-based point cloud models Multi-modality learning Two-stream model ViLBERT [40] The first proposed two-stream model for V+L tasks Single-stream model UNITER [41] A universal model for joint multi-modal embedding Mixed model SemVLP [42] First mixed single-and two-stream model such fast and abundant developments. Due to the fast development of visual transformer backbones, this survey specifically focuses on the latest works in that area, as well as low-level vision tasks. Specifically, this study is mainly arranged into four specific fields: backbone design, high-level vision (e.g., object detection and semantic segmentation), lowlevel vision and generation, and multimodal learning. We highlight backbone design and low-level vision as our main focus in Fig. 1. The developments to be introduced are summarised in Table 1. For backbone design, several latest works are introduced, considering two aspects: (i) injecting convolutional prior knowledge into ViT, and (ii) boosting the richness of visual features. We also summarize the breakthrough ideas of each work in Fig. 1. For high-level vision, we introduce the mainstream of DETRbased transformer detection models [24]. For low-level vision and generation, we arrange papers according to different subareas including colorization [30,[43][44][45], text-to-image [31,32,46], super-resolution [47][48][49], and image generation [50][51][52][53][54]. For multimodal learning, we review some recent representative works on vision-plus-language (V+L) models and summarize pretraining objectives in this field.
We comprehensively compare results in different fields and give training details, including computational cost and source code links to facilitate and encourage further research. Some images resulting from low-level vision models are also illustrated. The rest of the paper is organized as follows. Section 2 introduces visual transformers. Section 3 lists latest developments in backbone networks for visual transformers in image classification. Section 4 describes several recent advanced designs using visual transformers in object detection. Section 5 introduces transformer-based methods for various low-level vision tasks. Section 6 reviews recent representative works on multimodal learning. Finally, we draw conclusions from different research fields in Section 7.

Visual transformers
Before introducing the latest developments, we give the basic formulation of visual transformers by using ViT [9] as an example. As shown in Fig. 2, a typical ViT mainly contains five basic procedures: splitting input images into smaller local patches, preparing the input token (patch tokens, class token, and position embedding), a series of stacked transformer blocks [55] (i.e., layer normalization (LN) [56] + multihead self-attention (MSA) [57] + skip-connection layer [1] + multilayer perception (MLP) or feedforward network (FFN)), and post-process module.
Formally, given an input image X ∈ R H×W ×C and its labels Y , X is first reshaped into a sequence of flattened 2D image patches X p ∈ R N ×(P 2 ·C) . Then, following BERT [10], a class token and several position tokens are used to record extra meaningful information for inference. Together, the input is formulated as follows: pos ; E 1 pos ; · · · ; E N pos ] where x cls ∈ R D is the class token, E ∈ R (P 2 ·C)×D is a linear projection of each patch X p , and E i pos ∈ R D is the learnable position embedding for the i-th token.
Then, the input is sent into several sequential transformer blocks: where l ∈ {0, · · · , L − 1} denotes the layer, L is the number of transformer blocks, the MLP includes two fully-connected layers using GELU [58] as the activation function, LN(·) is a layer-normalization module [56], and the MSA module is formulated as is the softmax function, and U msa ∈ R (H·D h )×D re-casts the output from H heads of the MSA module into one D-dimensional output. Several variants of MSA, like Reformer [59], Performer [60], and LinFormer [61], are available.

Backbone design
In this section, we describe several recent designs for the backbone of ViT models. Without loss of generality, we focus on the image classification task. We divide recent progress into two mainstream approaches: (i) injecting convolutional prior knowledge into ViT, works including T2T-ViT [15], ConViT [18], PiT [21], and Swin Transformer [20], and (ii) boosting the richness of visual features, including TNT [16], CPVT [17], DeepViT [19], and LocalViT [22]. We also briefly describe recent developments in visualizing feature maps of ViT models [23,62,63], which help to better understand the working mechanism of ViT models. We list core details of their performance on ImageNet [64] in Table 2.

T2T-ViT
Yuan et al. [15] note that the method to convert input images into tokens in a typical ViT [9] ineffectively models the spatial structure of image data and may lead to poor training efficiency and suboptimal performance. They propose two effective approaches to address the aforementioned problem. First, they propose a token-to-token (T2T) module to inject spatial information into the tokenization of image patches and reduce the length of tokens progressively for the sake of computational and parameter efficiency. Inspired by CNN architectures [1][2][3], they also devise a deep-narrow ViT framework to reduce the number of parameters and enhance training efficiency. Overall, they train ViT models from scratch on ImageNet without additional datasets.

TNT
Han et al. [16] propose a novel Transformer-iN-Transformer (TNT) framework to further exploit the intrinsic spatial structural information in image data. As Fig. 3 shows, TNT considers patch and pixel level relations in learning useful visual features. They propose a TNT block to utilize the pixellevel representations effectively and efficiently. They introduce an additional transformer called an Inner T-Block to model pixel-level relationships in each patch and then reinforce the patch-level features with the calculated pixel-level ones. Consequently, TNT achieves 81.3% top-1 classification accuracy on ImageNet [64] at the cost of only moderate additional computation. The experimental results verify the positive effects of pixel-level relation modeling.

ConViT
D'Ascoli et al. [18] propose a novel ViT model with soft convolutional inductive biases (ConViT) to endow transformers with an adaptive receptive field. Figure 4 schematically shows the core block, called a gated positional self-attention (GPSA) module. A GPSA block has two branches: W qry or W key is used to model the global or long-range relationship, and v pos is utilized to model the relationship within local regions. To adaptively trade-off between the two branches, they adopt a learnable parameter λ, which  is initialized as 1 for all layers and all heads in MSA.
With the proposed GPSA module, they manage to adaptively expand the self-attention receptive field during training.

CPVT
Chu et al. [17] resort to a novel design of position embedding module to further reinforce the richness of learned visual features from ViT. Instead of a predefined position embedding that is independent of the input data, they propose a conditional position embedding scheme to generate different positional encodings for various input tokens, akin to dynamic neural network design [66]. In their implementation, they also rearrange the input tokens in a spatial manner and apply convolution operations to extract the position embedding in a learnable way. In this way, they also maintain the local neighborhood information during tokenization, benefiting classification performance.
Two further ViT models, LeViT [12] x and CoaT [67] y , investigate the importance of position embedding and propose different implementations. We do not describe them further due to lack of space.

Swin Transformer
On the basis of the observations that image data contain much redundant spatial information and given the success of deep-narrow CNN architectures, Liu et al. [20] propose a novel hierarchical visual transformer design. Figure 5(a) illustrates the core idea of the window MSA (W-MSA) and the shifted W-MSA (SW-MSA) within Swin Transformer, which separate local patches into several windows and run the MSA module window by window. With the W-MSA mechanism, they reduce the computation complexity from O(4HW C 2 +2(HW ) 2 C) to O(4HW C 2 + 2M 2 HW C), where H and W represent the size of input patches, M × M is the number of windows, and C is the feature dimension. A shifted window design is also proposed to encourage cross-window communication for rich visual features. They also propose a deep-narrow architecture (see Fig. 5(b)). Extensive experiments on ImageNet, COCO, and ADE-20K demonstrate that Swin Transformer enhances efficient use of parameters and achieves state-of-the-art object detection and semantic segmentation.
Dong et al. [68] also propose another vision transformer model, CSWin Transformer, which utilizes a cross-shaped window self-attention mechanism (akin to criss-cross attention [69] or strip pooling [70]) and a locally enhanced position encoding. CSWin Transformer obtains even better performance than SWin Transformer.

DeepViT
Layer scaling (e.g., 152-layer ResNet [1]) is an important aspect of CNN architectures. With regard to ViT models, Zhou et al. [19] empirically find that the performance of deep layer ViT models saturates when we stack more than 20 transformer blocks even with the help of skip-connection layers. They unveil that the reason is attention collapse: the feature maps extracted from each head in one MSA module share increasingly similar patterns, leading to huge information redundancy and low training efficiency. If the communication between the MSA heads is promoted, the information redundancy between each head and rich learned visual feature can be reduced. On the basis of the aforementioned motivation, they propose a simple and effective Re-Attention module: where Θ ∈ R H×H is a learnable parameter to facilitate the communication between the H heads within one MSA module. Experiments on ImageNet [64] verify that a 32-layer ViT model can be trained without performance saturation with the help of a Re-Attention module.
Notably, concurrent work, which is termed CaiT [71], also investigates the topic of layer scaling and proposes a different perspective. Further details can be obtained from their paper.

PiT
Considering the importance of the pooling layer to model capability and generalization performance of CNN architectures, Heo et al. [21] investigate the possibility of taking advantage of pooling modules in ViT. The pooling layer in a conventional CCN architecture conducts spatial information aggregation for spatially invariant features. On the basis of this observation, they propose to implement spatial information condensation via depth-wise convolution. As shown in Fig. 6, they first split the obtained input tokens into class tokens and spatial ones, and then they recover the spatial shape of the latter. Next, they leverage a depth-wise convolution operation on the spatial branch for the purpose of a pooling layer. Meanwhile, they apply a fully connected layer to project the class token into the same dimension. With a simple and effective pooling module, they propose a pooling-based ViT (PiT) and achieve an optimal trade-off between computation efficiency and classification performance.

LocalViT
Li et al. [22] study the differences between ViT models and CNN architectures. They find that visual transformers are good at modeling global relations while lacking a local scheme to learn interactions within a local region, which is the characteristic of convolution. A local mechanism is important and useful for modeling spatial structures for image data. Thus, they believe that visual transformers must reinforce the model's capability for local relation modeling to promote the learned visual features from ViT models. Specifically, they investigate several possible blocks and then propose local ViT (LocalViT), as shown in Fig. 7(right). Experiments on ImageNet [64] indicate that the LocalViT module is a practical local mechanism which boosts the performance of various ViT models [15,16,27,65].

Comparison on ImageNet
We compare the classification accuracies of the latest ViT models on the ImageNet benchmark [64] in Table 2, together with their implementation details, namely, #FLOGs, #Param, and source code, to facilitate further research. The experimental values indicate that ViT models have potential to achieve comparable performance or even outperform stateof-the-art CNN architectures like RegNet [3] and EfficientNet [2], which are based on expert-designed basic modules and the power of neural architecture search (NAS) techniques. We also observe very recent exciting progress, in that the latest proposed ViT models possess higher model capability and better parameter efficiency for vision than the original version of ViT.

Visualization of ViT
Visualizing the feature maps in ViT is also an interesting and worthy research topic. As ViT models leverage different basic components from CNN models, we should adopt different visualization methods correspondingly. As shown in Fig. 8, the latest tools specialized for MSA modules and ViT models, namely, partial LRP [63] and Transformer-Explainability [23] x , can generate better results for feature map visualization than the visualization methods for CNN. The visualizations indicate that ViT models can learn additional meaningful spatial information with image-level annotations alone. Therefore, ViT models have potential values in weakly supervision scenarios, such as weakly supervised object detection.

High-level vision
In this section, we focus on representative recent highlevel vision tasks based on transformer framework. High-level vision refers to stages of visual processing that transition from analyzing local image structure to exploring the structure of the external world that produced those images. The main tasks include object detection [24][25][26], segmentation [28,29,[74][75][76][77][78][79], and key-point detection [80][81][82][83][84][85]. As the focus of this survey is low-level vision tasks, we only briefly introduce some interesting works in object detection. Modern detection methods address the set prediction task by defining a large set of proposals [86,87], anchors [88], or window centers [89,90]. Unlike previous attempts [91][92][93][94][95][96], transformer-based detection raises the possibility of total anchor-free and end-to-end models. We begin with the stream of DETR [24], followed by Deformable DETR [25] and UP-DETR [26]. A more complete approach, PVT [27], which is the earliest transformer backbone for dense prediction tasks like detection, is also introduced. Additional recent high-level backbones like Swin Transformer [20] and Twins [97] are introduced in Section 3. A comparison is provided in Table 3.

DETR
Carion et al. [24] were the first to provide a completely end-to-end detection model based on the transformer encoder-decoder architecture. It gives researchers a new insight that the transformer architecture can achieve state-of-the-art performance in detection. Unlike previous detection models, DETR does not rely on artificially designed anchors. The overall structure is illustrated in Fig. 9. The transformer encoders are arranged after a convolution feature Fig. 8 Class-specific visualization results from ViT. Left to right: input image, rollout [62], raw-attention, GradCAM [72], LRP [73], partial LRP [63], and Transformer-Explainability [23]. Reproduced with permission from Ref. [23], c The Author(s) 2021.
In the DETR encoder, first, the output feature map of CNN is decomposed into patches as for ViT [9] introduced in Section 3. Then, the patches are mapped to one-dimensional vectors to go through several traditional transformer encoders. DETR and traditional BERT encoders differ only in the positional embedding. The positional embedding is injected into all encoder blocks rather than only the input layer to preserve the positional information; they claim that high-level vision detection needs more positional information than classification. Only queries and keys are also injected into the positional embedding.
The decoder has two inputs. The first is the object query, which only serves as queries in the second MSA layers. The second is the output of the encoder module, which serves as values and keys for the second MSA layers. To easily understand the mechanism, readers can treat the object queries as information of different target objects and suppose the decoder aims to find whether the similar patterns to the object queries exist in the image features. The output of the decoder is then passed through two branches, namely, the box and class branches. The box branch predicts the positions of the target objects, while the class branch serves to predict the category of each predicted box.

Deformable DETR
Although considerable progress has been achieved by DETR in transformer-based detection, its main deficiency is its huge computational cost. Training a DETR on one V100 GPU is reported to take 48 days, which is unaffordable for common institutions. Thus, Zhu et al. [25] propose the deformable selfattention module to reduce the training time to 340 GPU hours at the same time as improving the original performance. The core idea of deformable attention is to find the nearest K values of an input query to calculate attention. Nearest here refers to semantic distance rather than spatial distance. Deformable attention is illustrated in Fig. 10, which is drawn from deformable convolution [99]. A linear model is established to learn the offsets of the nearest K values, and then another linear model is established to learn the attention score of each value. In summary, the main contributions of deformable attention module are (i) only K corresponding values rather than all values are required to calculate the attention of one query and (ii) the attention scores are learned by a network rather than by simple multiplication of queries and keys.

UP-DETR
Dai et al. [26] propose a novel unsupervised pretraining method called random query patch detection for DETR [24,25], which leads to better performance. Figure 11(a) illustrates their pretraining method. A random query patch is randomly cropped from an input image. Then, the query patch is added to the object queries of the DETR decoder. The final goal is to predict two things: (i) L cls , that is, existence of objects in the query patch, and (ii) L box , that is, the location of the query patch in the image. A reconstruction loss L rec is also designed to ensure that the CNN has extracted full information from the query patch. This pretraining method leads to more flexible training. As shown in Fig. 11(b), a more robust representation can be learned after augmenting the random query patches.

PVT
Diverging from DETR, Wang et al. [27] also propose a pure transformer-based backbone, called a pyramid vision transformer (PVT), for detection and segmentation. Its framework is shown in Fig. 12. After each stage, the output is rearranged to recover spatial structure and is then down-sampled to half resolution. Notably, the spatial reduction is only conducted on the key K and value V while the spatial size of the query Q is maintained. In practice, the full architecture of PVT-based detection models includes a PVT backbone and a general detection head, such as RetinaNet [88] and Mask R-CNN [100]. Recently, several dense prediction backbones have come out after PVT [27], like Swin Transformer [20], CPVT [17], and Twins [97], which we introduced in Section 3.

Low-level vision and generation
In this section, we focus on some representative recent transformer-based works on low-level vision   tasks, as listed in Table 4. Low-level vision tasks include super-resolution [101], denoising [101], image  [30], text-to-image generation [31], and image generation [34,35]. We separately introduce how these tasks use transformers to achieve good results (see examples in Fig. 13).

TIME
As a pre-trained NLP model is always required for the text-to-image (T2I) task, it may introduce inflexibility for the whole model. Liu et al. [31] propose an efficient model for T2I tasks: Text and Image Mutual Translation Adversarial Networks (TIME). TIME can jointly handle T2I and image captioning using a single network without a pretrained NLP model. As Fig. 14 shows, TIME introduces a multihead and multi-layer transformer to the generator and text decoder, which can be used to effectively  Text-Conditioned Image Transformer but for image captioning [103][104][105]. T2I and the image captioning task are jointly trained in the generative adversarial network (GAN) manner. TIME achieves state-of-theart T2I performance without pretraining. DALL·E. Text-to-image generation is a classical generation problem, which needs to construct a mapping between two streams. Ramesh et al. [32] propose a transformer-based framework to better align text and image semantic information. A twostage model is applied to model the text and image tokens. They first train a discrete variational autoencoder [106] to build 1024 image tokens and adopt 256 BPE-encoded text tokens to represent the text information. Thereafter, an auto-regressive transformer is used to capture the joint distribution of the text and image tokens. They also use a mixedprecision training strategy and PowerSGD [57] to save GPU memory. The model consumes approximately 24 GB memory in 16-bit precision.

IPT
Classification models can be pretrained on largescale datasets to enlarge model representation ability. Related low-level vision tasks such as image superresolution, inpainting, and deraining are combined in a model to help one another. The generalized pretraining procedure solves the problem of taskspecific data limitation. Therefore, Chen et al. [101] develop a pretrained model for image processing using the transformer architecture, the Image Processing Transformer (IPT). The model architecture is shown in Fig. 15. To adapt to different vision tasks, Chen et al. [101] design a multi-head and multitail architecture, which involves three convolutional layers. The transformer body consists of an encoder and a decoder described in Ref. [57]. Like the discriminator in Ref. [34], they split the given features into patches and each patch is regarded as a "word" before features are input into the transformer body. Unlike the original transformer, they utilize a taskspecific embedding as an additional input to the decoder. The model is pretrained on ImageNet, which is a key factor for success.

Uformer
Wang et al. [37] propose an effective and efficient transformer-based architecture for image restoration. It uses a transformer module to construct a hierarchical encoder-decoder network. Two core designs of Uformer make it suitable for image restoration. The first is a local-enhanced window transformer block. Specifically, a nonoverlapping window-based self-attention is used to reduce the computational cost, and depth-wise convolution is used in the FFN to further improve its ability to capture local context. The second is the skip-connection mechanism, which is explored to effectively deliver the encoder information to the decoder. Uformer can capture useful dependencies for image restoration because of the two designs above. The network structure of Uformer is shown in Fig. 16. Its performance has been verified through several image restoration tasks, including denoising, deraining, and deblurring.

TransGAN
Driven by curiosity, Jiang et al. [34] first design a GAN using pure transformer-based structures to determine whether transformers perform well when applied to generative adversarial networks (GANs) [107]. This network consists of a memoryfriendly transformer-based generator and a patchlevel discriminator. Jiang et al. [34] also imitate the philosophy in CNN-based GANs and design a novel structure for image generation to avoid the high cost when applying transformers from NLP  to visual tasks. As shown in Fig. 17(left), the memory-friendly transformer-based generator has multiple stages, thus increasing the feature resolution while decreasing the embedding dimension. The discriminator splits the generated images into small patches and regards them as "words". The tokens are taken by the classification head to output the real/fake prediction. The whole net is trained with three ingenious strategies: data augmentation, selfsupervised auxiliary task (super task) cooperative training, and locality-aware initialization. The results in CIFAR-10 and STL-10 are comparable to those of some state-of-the-art works using CNN-based GANs.

TTSR
Texture is often damaged during downsampling and also cannot be easily recovered. Traditional single image super-resolution always leads to blurring effects in the output. Therefore, Yang et al. [33] propose a reference-based image super resolution method, namely, the Texture Transformer Network for Image Super Resolution (TTSR). As shown in Fig. 18, the Learnable Texture Extractor is first used to extract proper texture information, which is crucial for super resolution. Then, the input to the texture transformer can be expressed as follows: where LR↑, Ref, and Ref↓↑ denote the image to be reconstructed, the reference image, and the reference image that is down-sampled and then up-sampled respectively. The texture transformer contains a Hard-Attention and a Soft-Attention, and it is applied to the high-resolution feature guided by the reference image. Finally, they propose a cross-scale feature integration module to exchange information between the features at different scales for better representation at different scales.

ColTran
Image colorization is a challenging task that needs to determine the image semantics. Most colorization models estimate log-likelihood based on neural generative approaches. Kumar et al. [30] propose the Colorization Transformer (ColTran) using a self-attention mechanism to promote the effects of a probabilistic colorization model. ColTran replaces self-attention blocks with axial self-attention, which decreases the computational complexity from O(D 2 ) to O(D √ D). Kumar et al. [30] adopt a conditional variant of the Axial Transformer [108] for low-resolution coarse colorization. As shown in Fig. 19, the ColTran core consists of Conditional Self-Attention, MLP, and Layer Norm modules, and it applies conditioning to the auto-regressive core. They also design a Color Upsampler and Spatial Upsampler to produce high-fidelity colorized images from low resolution results. The Color Upsampler converts the coarse image of 512 colors back into a 3-bit RGB image with 8 symbols per channel. The Spatial Upsampler generates colorized images with high resolution. ColTran can handle grayscale images of 256 × 256 pixels.

GANsformer
The cognitive science literature talks about two mechanisms by which human perception interacts, namely, bottom-up and top-down processing. Previous vision tasks using CNNs do not reflect this bidirectional nature because the local receptive field reduces their ability to model long-range dependencies. Therefore, Hudson and Zitnick [35] aim to design a transformer network with a highly adaptive architecture centered around relational attention and dynamic interaction. They propose a Bipartite Transformer to eliminate the limitation of huge computational complexity of self-attention of transformers. Unlike the selfattention operator which considers all pairwise relations between input elements, the Bipartite Transformer generalizes this formulation by featuring a bipartite graph between two groups of variables (latent and image features) instead. As shown in Fig. 20, simplex attention distributes information in a single direction over the Bipartite Transformer, while Duplex attention supports bidirectional interaction between the elements. The bipartite structure makes a good balance between expressiveness and efficiency, and it constructs the interaction between latent and visual features to generate good results.

StyTr 2
Considering the limited receptive fields of CNNs, obtaining global information about input images is difficult but is critical for the image style transfer task. The content leak problem also occurs when CNN-based models are adopted for style transfer.
Therefore, Deng et al. [38] propose the first transformerbased style transfer model using the ability for longrange extraction (Fig. 21). The unbiased Style Transfer Transformer framework StyTr 2 contains two transformer encoders to obtain domain-specific  information. Following encoding, a multilayer transformer decoder generates the output sequences. Moreover, Deng et al. [38] propose a content-aware mechanism to learn the positional encoding based on image semantic features and dynamically expand the position to suit different image sizes.

VQGAN
High-resolution image synthesis is a difficult generation problem which aims to generate high-fidelity images within a reasonable time. Convolutional approaches exploit the local structure of the image, while transformer methods are good at establishing long-range interactions. Esser et al. [102] utilize the advantages of CNNs and transformers to build a highresolution image generation framework. They propose a variant of VQVAE [36] and adopt adversarial learning to achieve vivid results. The content hidden space consists of a discrete codebook, and different codes in the codebook are combined according to a certain probability to represent the content information. The key to sampling in a discrete space is to predict the distribution of discrete codes, and the transformer can deal with the issue. Given the first i codes, the transformer module is used to predict the probability of occurrence of the i-th code.
The number of codes in the codebook is 512-4096 according to the dataset. The model can synthesize the results containing 1280 × 460 pixels.

PCT
Unlike CNNs, transformers are inherently permutation invariant when processing a series of points and are thus suitable for point cloud learning. Guo et al. [39] propose a state-of-the-art transformer-based point cloud model based on offset-attention with an implicit Laplace operator. They enhance the input embedding based on farthest point sampling and nearest neighbor search to better capture the local context in the point cloud.

Multimodal learning
The above sections cover developments in conventional computer vision. Apart from pure vision tasks, transformer-based models have also achieved promising progress in language and vision multimodal tasks, such as visual question answering (VQA) [109,110], image captioning [111], and image retrieval [112], due to the high performance achieved by the NLP transformers. Transformer-based vision-language (V+L) approaches are often pretrained on multiple tasks and fine-tuned on diverse downstream sub-tasks. Inputs of different modalities share the analogous single-or two-stream architecture.
In this section, we start from recently representative transformer-based works on V+L tasks with different frameworks (Section 6.1), and then summarise pretraining objectives (Section 6.2) and compare details (Section 6.3).

Transformer-based V+L works
Most transformer-based V+L works are based on two kinds of structures: the two-stream (each stream for a single modality) framework or the singlestream (common stream for jointly learning crossmodal representation) framework. ViLBERT [40] and UNITER [41] are representative works for two-and single-stream frameworks, respectively. Meanwhile, SemVLP [42] unifies the two mainstream architectures for aligning the cross-modal semantics.

ViLBERT
ViLBERT [40] is a representative two-stream transformer-based model for V+L. Two separate streams are used for vision and language processing. Figure 23 shows the architecture of ViLBERT. Two parallel BERT-style models operate on image regions and text tokens. Each stream connects a series of transformer blocks (TRM) and co-attentional transformer layers (Co-TRM). As shown in Fig. 22, the Co-TRM layers enable information exchange between modalities, and the modified attention mechanism is the key technical innovation. By exchanging key-value pairs in multi-headed attention, the Co-TRM structure allows for variable network depth for each modality and enables cross-modal connections at different depths.

UNITER
Chen et al. [41] propose UNITER: UNiversal Image-TExt Representation. It can power heterogeneous downstream V+L tasks with joint multimodal embeddings. As shown in Fig. 24, UNITER first encodes image regions (visual features and bounding box features) and textual words (tokens and positions) into a shared embedding space with image and text embedders. Then, UNITER applies a transformer module to learn the joint embedding of the two modalities through designed pretraining tasks that include classic image-text matching (ITM), masked language modeling (MLM), and masked region modeling (MRM). UNITER uses conditional masking on MLM and MRM, which means masking only one modality while keeping the other untainted. A novel word-region alignment pretraining task via optimal transport is also proposed to encourage fine-grained alignment between words and image regions. The authors consider the matching of word tokens and RoI regions as minimizing the distance of two discrete distributions, where the distance is computed based on optimal transport. UNITER, as a single-stream model, achieved state-of-the-art performance when proposed. ViLLA [113], which combines UNITER and adversarial training, achieves higher performance.

SemVLP
Li et al. [42] present a novel V+L framework, SemVLP. It unifies both mainstream architectures. By fusing single-and two-stream architectures, SemVLP utilizes cross-modal semantics. Its framework is detailed in Fig. 25. On the basis of a shared bidirectional transformer encoder with cross-modal attention module, SemVLP can encode the input text and image into different semantics. It adopts common pretraining methods with a special training strategy: single-and two-stream frameworks are updated in

Multimodal pretraining
Designing reasonable pretraining objectives for transformer-based models, such as masked language modeling (MLM) and next sentence classification from BERT, has brought excellent results on NLP tasks. These methods also work in the cross-modal field with V+L. The key challenge is the way to replicate or extend large-scale pretraining to crossmodal methods and to design novel pretraining objectives for multimodal learning. In this section, we briefly introduce pretraining tasks extended from BERT. These extended approaches include MLM, masked region modeling (MRM), and image-text matching (ITM). We also list other specially designed pretraining tasks for multimodal learning.

Masked language modeling
Most recent V+L works follow BERT in using MLM for cross-modal tasks. UNITER modifies MLM by introducing visual information. Specifically, UNITER attempts to predict masked words based on observation of the surrounding words and all image regions. InterBERT [114] changes MLM to masked segment modeling. In the case of using a random word to replace the selected word, masked segment modeling masks a continuous segment of text instead of random words.

Image-text matching
For another pretraining task of BERT, next sentence classification has been converted to an ITM problem, which determines whether a pair of sentence and image regions match. This task is widely used in advanced V+L works. InterBERT [114] performs ITM with hard negatives by regarding the imagetext pairs in the dataset as positive samples, pairing the images with uncorrelated texts, and regarding the pairs as negative samples. VL-BERT [115] and UnifiedVLP [116] also do not use ITM, tending to use other efficient choices like MRM introduced next.

Masked region modeling
The existing masking method, MRM, is the dual task of MLM. MLM can be easily applied to visual input. Some researchers have proposed several novel pretraining methods by masking input visual tokens to extend masked modeling to vision. Masked region feature regression (MRFR) is one of these approaches applied by ViLBERT [40]. ViLBERT trains the model to regress the masked input RoI pooled feature, which is extracted by Faster R-CNN [86]. Most models perform optimization with L2 loss. VL-BERT [115] also follows MRFR instead of using ITM. It uses masked RoI classification with linguistic clues, predicting the category label of the masked RoI obtained by Fast R-CNN [117] from the other clues. On the contrary, some models choose masked region classification, which lets the model predict the object semantic class for each masked region. Models are often optimized by cross-entropy loss or KL-divergence to learn the class distribution. These MRM tasks are performed in UNITER and UNIMO [118]. InterBERT [114] also changes MRM strategy in the visual modality by masking objects which have a high proportion of mutual intersection with zero vectors to avoid information leakage due to overlap between objects. Notably, earlier transformer-based works, such as VisualBERT [119] and B2T2 [120], do not extend MLM to the visual domain.

Other designs for V+L
Some models are also trained with unique, newly designed pretraining strategies. In Oscar [121], each image-text pair is defined as a triple and thus consists of a word sequence, a set of object tags, and a set of image region features. Therefore, in addition to MLM on words and object tags, Oscar uses a contrastive loss to encourage the model to distinguish the original and modified triple. By differently using contrastive learning, UNIMO creates image-text pairs by a novel text rewriting method. ERNIE-ViL [122] introduces a scene graph to design advanced pretrained tasks, including object prediction, attribute prediction, and relationship prediction. Li et al. [123] add masked sentence generation to optimize their model: a crossmodal decoder is taught to autoregressively decode the input sentence word-by-word conditioned on the input image. Training directly on downstream tasks like QA is also used in LXMERT [124] and SemVLP [42].  [110], and GQA [128] datasets can all be used for VQA pretraining. Notably, partial datasets are used as benchmarks simultaneously. Table 6 shows the performance of models reported above on different V+L benchmark datasets. The results are obtained by models fine-tuned on the corresponding datasets.

Backbone design
Section 3 describes several recent developments in the backbone design of visual transformers, including  feature map visualization approaches. Recent progress can be technically divided into two main streams: (i) enhancing the capability of visual transformers in modeling spatial structure and locality mechanism, such as a better image-to-token module, a pixel-level transformer block, a depthwise convolution-based pooling layer, and an SW-MSA module, and (ii) boosting the richness of learned visual features and promoting efficient use of parameters, such as conditional position encoding, a message communication scheme between the MSA heads, and deep-narrow ViT architectures. As the first visual transformer was proposed very recently (October 2020), we believe that the potential of the ViT model has not been fully exploited and several research topics are worthy of consideration and effort: • Advanced designs of basic ViT operation or modules and the corresponding learning scheme for CV tasks, like injecting prior knowledge of image data or the computer vision task into the module design or the learning scheme of visual transformer models, and making the transformer more computationally efficient, are of interest. The versatility of ViT models in additional real-world scenarios, such as aesthetic visual analysis [137][138][139], face anti-spoofing [140,141], and point cloud learning [142], is also worthy of exploration. • The transformer block can be placed in the perspective of NAS. One of the goals of the NAS framework [143][144][145] is to search for optimal network architectures for a given task without human intervention. Interesting architectures can be considered and practical insights for further developments can be gained by building on a welldesigned search space that contains a transformer block. Several recent works have investigated this topic. Wang et al. [146] and So et al. [147] leverage NAS techniques to seek for effective and efficient transformer-based architectures automatically. Li et al. [148] propose a novel scheme, BossNAS, to achieve optimal solutions which trade-off CNN architecture and transformer blocks.
• Understanding of the working mechanism and theoretical rationale of visual transformers can be enhanced. Several researchers have achieved promising progress in unveiling the power of transformer models, from such perspectives as information bottlenecks [149,150] and better visualization tools [23,63].

High-level vision
In Section 4, we introduce several representative works on object detection. The basic logic follows the line of DETR [24]. PVT [27], which is a general backbone for dense prediction, is also introduced. Several problems still need to be addressed despite improvements brought by these works. Unlike CNN-based methods, such as Faster-RCNN [86], current transformers for dense prediction tasks suffer from high computation time. Thus, efficiency of transformers for high-level vision remains a pressing research direction.

Low-level vision and generation
In Section 5, we introduce some low-level vision and image generation tasks using transformer-based models. They can achieve outstanding results but have difficulty generating large images. Therefore, extending a pure transformer with CNN layers is widely adopted by many works. A pure transformer structure still faces the challenge of high computation time.

Multimodal learning
In Section 6, we introduce several representative transformer-based models proposed in the past 2 years for vision and language tasks. We also review mainstream pretraining tasks in the V+L field. Meanwhile, transformer-based models have succeeded for the tasks listed in Table 6, but performance can still be improved: • Pure transformers may be an alternative choice for the image mode. • Design of efficient pretraining tasks can lead to better results and performance. Changsheng Xu is a professor in NLPR. His research interests include multimedia content analysis, indexing and retrieval, pattern recognition, and computer vision. Prof. Xu has served as associate editor, guest editor, general chair, program chair, area/track chair, special session organizer, session chair and TPC member for over 20 prestigious IEEE and ACM multimedia journals, conferences, and workshops. Currently he is the editor-inchief of Multimedia Systems. Changsheng Xu is an IEEE Fellow, IAPR Fellow, and ACM Distinguished Scientist.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.