PVT v2: Improved Baselines with Pyramid Vision Transformer

Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.


Introduction
Recent studies on vision Transformer are converging on the backbone network [8,31,33,34,23,36,10,5] designed for downstream vision tasks, such as image classification, object detection, instance and semantic segmentation. To date, there have been some promising results. For example, Vision Transformer (ViT) [8] first proves that a pure Transformer can archive state-of-the-art performance in image classification. Pyramid Vision Transformer (PVT v1) [33] shows that a pure Transformer backbone can also surpass CNN counterparts in dense prediction tasks such as detection and segmentation tasks [22,41]. After that, Swin Transformer [23], CoaT [36], LeViT [10], and Twins [5] further improve the classification, detection, and segmentation performance with Transformer backbones.
This work aims to establish stronger and more feasible baselines built on the PVT v1 framework. We report that three design improvements, namely (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network are orthogonal to the PVT v1 framework, and when used with PVT v1, they can bring better image classification, object detection, instance and semantic segmentation performance. The improved framework is termed as PVT v2. Specifically, PVT v2-B5 1 yields 83.8% top-1 error on ImageNet, which is better than Swin-B [23] and Twins-SVT-L [5], while our model has fewer parameters and GFLOPs. Moreover, GFL [19] with PVT-B2 archives 50.2 AP on COCO val2017, 2.6 AP higher than the one with Swin-T [23], 5.7 AP higher than the one with ResNet50 [13]. We hope these improved baselines will provide a reference for future research in vision Transformer.

Related Work
We mainly discuss transformer backbones related to this work. ViT [8] treats each image as a sequence of tokens (patches) with a fixed length, and then feeds them to multiple Transformer layers to perform classification. It is the first work to prove that a pure Transformer can also archive state-of-the-art performance in image classification when training data is sufficient (e.g., ImageNet-22k [7], JFT-300M). DeiT [31] further explores a data-efficient training strategy and a distillation approach for ViT.
To improve image classification performance, recent methods make tailored changes to ViT. T2T ViT [37] concatenates tokens within an overlapping sliding window into one token progressively. TNT [11] utilizes inner and outer Transformer blocks to generate pixel and patch embeddings respectively. CPVT [6] replaces the fixed size position embedding in ViT with conditional position encodings, making it easier to process images of arbitrary resolution. CrossViT [2] processes image patches of different sizes via a dual-branch Transformer. LocalViT [20] incorporates depth-wise convolution into vision Transformers to improve the local continuity of features.
To adapt to dense prediction tasks such as object de- tection, instance and semantic segmentation, there are also some methods [33,23,34,36,10,5] to introduce the pyramid structure in CNNs to the design of Transformer backbones. PVT v1 is the first pyramid structure Transformer, which presents a hierarchical Transformer with four stages, showing that a pure Transformer backbone can be as versatile as CNN counterparts and performs better in detection and segmentation tasks. After that, some improvements [23,34,36,10,5] are made to enhance the local continuity of features and to remove fixed size position embedding. For example, Swin Transformer [23] replaces fixed size position embedding with relative position biases, and restricts self-attention within shifted windows. CvT [34], CoaT [36], and LeViT [10] introduce convolution-like operations into vision Transformers. Twins [5] combines local attention and global attention mechanisms to obtain stronger feature representation.

Limitations in PVT v1
There are three main limitations in PVT v1 [33] as follows: (1) Similar to ViT [8], when processing highresolution input (e.g., shorter side being 800 pixels), the computational complexity of PVT v1 is relatively large.
(2) PVT v1 [33] treats an image as a sequence of nonoverlapping patches, which loses the local continuity of the image to a certain extent; (3) The position encoding in PVT v1 is fixed-size, which is inflexible for process images of arbitrary size. These problems limit the performance of PVT v1 on vision tasks.
To address these issues, we propose PVT v2, which improves PVT v1 through three designs, which are listed in Sec 3.2, 3.3, and 3.4.

Linear Spatial Reduction Attention
First, to reduce the high computational cost caused by attention operations, we propose linear spatial reduction at- tention (SRA) layer as illustrated in Fig. 1. Different from SRA [33] which uses convolutions for spatial reduction, linear SRA uses average pooling to reduce the spatial dimension (i.e., h×w) to a fixed size (i.e., P ×P ) before the attention operation. So linear SRA enjoys linear computational and memory costs like a convolutional layer. Specifically, given an input of size h × w × c, the complexity of SRA and linear SRA are: where R is the spatial reduction ratio of SRA [33]. P is the pooling size of linear SRA, which is set to 7.

Overlapping Patch Embedding
Second, to model the local continuity information, we utilize overlapping patch embedding to tokenize images. As shown in Fig. 2(a), we enlarge the patch window, making adjacent windows overlap by half of the area, and pad the feature map with zeros to keep the resolution. In this work, we use convolution with zero paddings to implement overlapping patch embedding. Specifically, given an input of size h × w × c, we feed it to a convolution with the stride of S, the kernel size of 2S − 1, the padding size of S − 1, and the kernel number of c . The output size is h S × w S × C .
Transformer Encoder Transformer Encoder Overlapping Patch Embedding Transformer Encoder Overlapping Patch Embedding Transformer Encoder

Convolutional Feed-Forward
Third, inspired by [17,6,20], we remove the fixed-size position encoding [8], and introduce zero padding position encoding into PVT. As shown in Fig. 2(b), we add a 3 × 3 depth-wise convolution [16] with the padding size of 1 between the first fully-connected (FC) layer and GELU [15] in feed-forward networks.

Details of PVT v2 Series
We scale up PVT v2 from B0 to B5 By changing the hyper-parameters. which are listed as follows: • S i : the stride of the overlapping patch embedding in Stage i; • C i : the channel number of the output of Stage i; • L i : the number of encoder layers in Stage i; • R i : the reduction ratio of the SRA in Stage i; • P i : the adaptive average pooling size of the linear SRA in Stage i; • N i : the head number of the Efficient Self-Attention in Stage i; • E i : the expansion ratio of the feed-forward layer [32] in Stage i; Tab. 1 shows the detailed information of PVT v2 series. Our design follows the principles of ResNet [14]. (1) the channel dimension increase while the spatial resolution shrink with the layer goes deeper. (2) Stage 3 is assigned to most of the computation cost.

Advantages of PVT v2
Combining these improvements, PVT v2 can (1) obtain more local continuity of images and feature maps; (2) pro-cess variable-resolution input more flexibly; (3) enjoy the same linear complexity as CNN.

Image Classification
Settings. Image classification experiments are performed on the ImageNet-1K dataset [27], which comprises 1.28 million training images and 50K validation images from 1,000 categories. All models are trained on the training set for fair comparison and report the top-1 error on the validation set. We follow DeiT [31] and apply random cropping, random horizontal flipping [29], label-smoothing regularization [30], mixup [38], and random erasing [40] as data augmentations. During training, we employ AdamW [25] with a momentum of 0.9, a mini-batch size of 128, and a weight decay of 5 × 10 −2 to optimize models. The initial learning rate is set to 1 × 10 −3 and decreases following the cosine schedule [24]. All models are trained for 300 epochs from scratch on 8 V100 GPUs. We apply a center crop on the validation set to benchmark, where a 224× 224 patch is cropped to evaluate the classification accuracy. Results. In Tab. 2, we see that PVT v2 is the state-of-theart method on ImageNet-1K classification. Compared to PVT, PVT v2 has similar flops and parameters, but the image classification accuracy is greatly improved. For example, PVT v2-B1 is 3.6% higher than PVT v1-Tiny, and PVT v2-B4 is 1.9% higher than PVT-Large.
Compared to other recent counterparts, PVT v2 series also has large advantages in terms of accuracy and model size. For example, PVT v2-B5 achieves 83.8% Im-ageNet top-1 accuracy, which is 0.5% higher than Swin  Table 2: Image classification performance on the Ima-geNet validation set. "#Param" refers to the number of parameters. "GFLOPs" is calculated under the input scale of 224 × 224. "*" indicates the performance of the method trained under the strategy of its original paper. "-Li" denotes PVT v2 with linear SRA.

Object Detection
Settings. Object detection experiments are conducted on the challenging COCO benchmark [22]. All models are trained on COCO train2017 (118k images) and evaluated on val2017 (5k images). We verify the effectiveness of PVT v2 backbones on top of mainstream detectors, including RetinaNet [21], Mask R-CNN [12], Cascade Mask R-CNN [1], ATSS [39], GFL [19], and Sparse R-CNN [28]. Before training, we use the weights pre-trained on Ima-geNet to initialize the backbone and Xavier [9] to initialize the newly added layers. We train all the models with batch size 16 on 8 V100 GPUs, and adopt AdamW [25] with an initial learning rate of 1 × 10 −4 as optimizer. Following common practices [21,12,3], we adopt 1× or 3× training schedule (i.e., 12 or 36 epochs) to train all detection models. The training image is resized to have a shorter side of 800 pixels, while the longer side does not exceed 1,333 pixels. When using the 3× training schedule, we randomly resize the shorter side of the input image within the range of [640, 800]. In the testing phase, the shorter side of the input image is fixed to 800 pixels.
Results. As reported in Tab. 3, PVT v2 significantly outperforms PVT v1 on both one-stage and two-stage object detectors with similar model size. For example, PVT v2-B4 archive 46.1 AP on top of RetinaNet [21], and 47.5 AP b on top of Mask R-CNN [12], surpassing the models with PVT v1 by 3.5 AP and 4.6 AP b , respectively. We present some qualitative object detection and instance segmentation results on COCO val2017 [22] in Fig. 3, which also shows the good performance of our models. For a fair comparison between PVT v2 and Swin Transformer [23], we keep all settings the same, including ImageNet-1K pre-training and COCO fine-tuning strategies. We evaluate Swin Transformer and PVT v2 on four state-of-the-arts detectors, including Cascade R-CNN [1], ATSS [39], GFL [19], and Sparse R-CNN [28]. We see PVT v2 obtain much better AP than Swin Transformer among all the detectors, showing its better feature representation ability. For example, on ATSS, PVT v2 has similar parameters and flops compared to Swin-T, but PVT v2 achieves 49.9 AP, which is 2.7 higher than Swin-T. Our PVT v2-Li can largely reduce the computation from 258 to 194 GFLOPs, while only sacrificing a little performance.

Semantic Segmentation
Settings. Following PVT v1 [33], we choose ADE20K [41] to benchmark the performance of semantic segmentation. For a fair comparison, we test the performance of PVT v2 backbones by applying it to Semantic FPN [18]. In the training phase, the backbone is initialized with the weights pre-trained on ImageNet [7], and the newly added layers are initialized with Xavier [9]. We optimize our models using AdamW [25] with an initial learning rate of 1e-4. Following common practices [18,4], we train our models for 40k iterations with a batch size of 16 on 4 V100 GPUs. The learning rate is decayed following the polynomial decay schedule with a power of 0.9. We randomly resize and crop the image to 512 × 512 for training, and rescale to have a shorter side of 512 pixels during testing. Results. As shown in Tab. 5, when using Semantic FPN [18] for semantic segmentation, PVT v2 consistently outperforms PVT v1 [33] and other counterparts. For example, with almost the same number of parameters and   Table 4: Compare with Swin Transformer on object detection. "AP b " denotes bounding box AP. "#P" refers to parameter number. "GFLOPs" is calculated under the input scale of 1280×800. "-Li" denotes PVT v2 with linear SRA.

Computation Overhead Analysis
As shown in Figure 4, with increasing input scale, the GFLOPs growth rate of the proposed PVT v2-B2-Li is much lower than that of PVT v1-Small [33], and is similar to that of ResNet-50 [13]. This result proves that our PVT v2-Li successfully addresses the high computational overhead problem caused by the attention layer.

Conclusion
We study the limitations of Pyramid Vision Transformer (PVT v1) and improve it with three designs, which are overlapping patch embedding, convolutional feed-forward network, and linear spatial reduction attention layer. Extensive experiments on different tasks, such as image classification, object detection, and semantic segmentation demonstrate that the proposed PVT v2 is stronger than its predecessor PVT v1 and other state-of-the-art transformer-based backbones, under comparable numbers of parameters. We hope these improved baselines will provide a reference for future research in vision Transformer.