Learning convolutional multi-level transformers for image-based person re-identification

As a vital vision task, person re-identification (Re-ID) aims to retrieve the same person under non-overlapping cameras. It is a very challenging task due to the presence of complex backgrounds, diverse illuminations and different perspectives. In this work, we integrate the advantages of convolutional neural networks (CNNs) and transformers, and propose a novel learning framework named convolutional multi-level transformer (CMT) for image-based person Re-ID. More specifically, we first propose a scale-aware feature enhancement (SFE) module to extract multi-scale local features from a pre-trained CNN backbone. Then, we introduce a part-aware transformer encoder (PTE) to further mine discriminative local information guided by global semantics. Finally, a deeply-supervised learning (DSL) technique is adopted to optimize the proposed CMT and improve its training efficiency. Extensive experiments on four large-scale Re-ID benchmarks demonstrate that our method performs favorably against several state-of-the-art methods.


Introduction
Person re-identification (Re-ID) aims to retrieve specific persons in a scene based on the content of images or videos taken at different times and places.It has drawn much attention due to its diversified real-world applications, such as safe communities, intelligent surveillance and criminal investigations [1][2][3].Although great success has been achieved, there are still many challenges in person Re-ID, such as object occlusion, illumination change, pose distortion and background clutter.
In the past two decades, great progresses have been achieved in the typical image-based Re-ID task [4].The accomplishment of this task largely depends on the robust representations of person images.In fact, early per-son Re-ID methods [5][6][7] primarily focus on the handcrafted feature extraction and the similarity metric design.With the development of deep learning technologies, many works focus on the end-to-end learning of more discriminative features by designing complex deep convolutional neural networks (CNNs).In addition, local information is also discriminative and helpful in retrieving the target person.As illustrated in the upper row of Fig. 1, the features extracted by the CNN backbone are horizontally divided into multiple parts in such part-level feature extraction methods as part-based convolutional baseline (PCB) and research has demonstrated that the PCB method has achieved significant performance improvements [8].However, the convolutional layers usually model the relationship between pixels in a small neighborhood and cannot realize the global modeling of person images.Thus, most CNN-based methods [9][10][11] are ineffective when facing certain challenges such as varied posture, occlusion, and background clutter.
Recently, transformers [12] have achieved excellent performance in natural language processing and computer vision.The key reason is that transformers are global operations based on self-attention and can model the relationship between all input elements.As a result, several attempts have been made to accomplish person Re-ID using transformers.For example, Zhu et al. [13] introduced an auto-aligned structure and enhanced the ability of transformers to extract more discriminative features.He et al. [14] proposed a pure transformer architecture to integrate camera and viewpoint information and achieved excellent performance in object re-identification.Although effective, these transformer-based methods require a large number of transformer blocks, resulting in high model complexity.In addition, these works seldom take into account the local information of persons, which is crucial for person Re-ID.Therefore, there is still much room for improvement in current transformer-based methods.
In this work, we take advantage of CNNs and transformers, and propose a novel learning framework named convolutional multi-level transformer (CMT) for image-based person Re-ID.More specifically, we first utilize a scaleaware feature enhancement (SFE) module to extract multiscale local features from deep CNN backbones.As a result, they can capture multi-granularity representations of various appearances in person images.Then, we introduce a part-aware transformer encoder (PTE) to further extract local discriminative information guided by global semantics.As shown in the bottom row of Fig. 1, we incorporate the idea of feature partitioning into the transformer and design a recursive transformer structure.This structure can generate hierarchical features for diverse local parts, resulting in great performance improvements.Finally, we adopt a deeply-supervised learning (DSL) technique to optimize the proposed CMT and improve its training efficiency.Extensive experiments on four large-scale Re-ID benchmarks demonstrate that our method performs favorably against most state-of-the-art methods.
The main contributions are summarized as follows:  In recent years, image-based person Re-ID has achieved great improvements in performance.Generally, existing person Re-ID methods mainly focus on extracting discriminative global features from entire images.However, focusing merely on the global information of persons has some limitations, such as ignoring the effectiveness of local cues.Fine-grained local part features such as T-shirt and black backpack can be very useful to identify persons in complex scenes.As a typical practice, many researchers resort to part features for pedestrian image description.In particular, Sun et al. [8] proposed a method to divide spatial features into horizontal strips to improve the Re-ID performance.

Attention-based person re-identification
Visual attention mechanisms aim to highlight relevant information and suppress irrelevant information.Inspired by the advantages of attention mechanisms, researchers have proposed various attention-based methods to extract distinguishable features for person Re-ID.For example, Chen et al. [18] proposed a mixed high-order attention to capture the subtle differences among pedestrians.Rao et al. [19] presented a counterfactual attention to capture

Transformer-based person re-identification
In fact, transformers [12] are initially proposed for processing sequential data.With the global modeling ability, transformers have been recently introduced to many computer vision tasks, including person Re-ID.For imagebased person Re-ID, He et al. [14] first utilized a pure transformer-based structure [23] to learn discriminative features.Zhu et al. [13] added learnable vectors of part tokens to learn part features and integrated part alignments into the self-attention.Lai et al. [24] utilized transformers to achieve adaptive part divisions.Li et al. [25] introduced a diverse part discovery with part-aware transformers for occluded person Re-ID.Liao and Shao [26] built a transformer-based deep image matching for generalizable person Re-ID.Wang et al. [27] proposed a selfguided transformer framework to explore the relations of body parts for feature alignment.Chen et al. [28] proposed an omni-relational high-order transformer for person Re-ID.Ma et al. [29] proposed a pose-guided transformer to mine the inter-part and intra-part relations for occluded person Re-ID.Liu et al. [30] designed a trigeminal transformer to simultaneously encode the spatial, temporal and spatial-temporal features in complex videos.These transformer-based methods have achieved superior performances.However, they generally lack desirable local properties.Different from them, we introduce a hybrid structure combining CNNs and transformers for more effective person Re-ID.

Proposed method
As illustrated in Fig. 2, the proposed framework mainly includes three key modules: a multi-level feature extractor (MFE), the SFE module and a PTE.More specifically, the MFE utilizes a pre-trained CNN backbone (e.g., ResNet-50 [31]) to extract multi-level features of person images.Afterwards, the SFE module adopts multi-scale dilated convolutions [32] with residual connections to capture multi-granularity feature representations.Furthermore, with a hierarchical structure, PTE further mines local discriminative information guided by global semantics.Finally, the DSL technique is utilized to optimize the whole framework.We will elaborate on these key components in the following subsections.

Multi-level feature extractor
As illustrated in the left part of Fig. 2, we utilize the ResNet-50 [31] pre-trained on ImageNet to extract multilevel features.Similar to previous works [8,10,33], we remove the fully-connected layers after the global average pooling (GAP) layer, and change the stride of the fifth stage to 1, resulting in a 1/16 feature resolution of input images.In addition, we take the outputs of stages 3, 4 and 5, and introduce an additional convolutional layer to generate size-fixed multi-level features.

Scale-aware feature enhancement
Due to the variations of persons in scenes, multi-scale information [33] is effective for robust appearance representations.Thus, we propose the SFE module to extract multiscale features at three stages of the backbone network.The structure of our proposed SFE module is illustrated in Fig. 3. Given an input X i (i = 3, 4, 5), we first reduce the channel numbers to a quarter of X i by a convolutional layer and obtain Xi .Then, we utilize four dilated convolutional layers to generate multi-scale features and gradually ex- tend the receptive fields [32].
Then, they are concatenated in the channel and aggregated by another convolutional layer.Meanwhile, a residual connection is utilized to obtain the final output of SFE, where [;] means the concatenation in the channel.In fact, due to the utilization of different kernel sizes and dilation sizes, our SFE module is able to capture multi-scale local cues for scale-aware feature enhancement.

Part-aware transformer-based encoder
In addition to SFE, we employ PTE to further extract partware fine-grained representations with transformers.As illustrated in Fig. 4, our PTE is designed with a recursive and hierarchical structure, which progressively generates diverse part features with global semantic guidance.Formally, the PTE takes Y i as input and introduces hierarchical divisions for diverse part features.It should be noted that all the transformers at the same stage share weights for computation reduction.The structure of the transformers is identical to [23].At the 2 k -part learning stage (k = 1, 2, . ..), we first use a 1 × 1 convolutional layer to halve the number of channels.Then, we reshape the feature map into a sequence representation Here, H and W denote the height and width of the input image, respectively.C represents the number of channels.The class token F cls 2 k-1 ∈ R 1×C from the 2 k-1 -part learning stage is concatenated into the sequence to guide the fine-grained features.In addition, a new class token F cls 2 k ∈ R 1×C is also concatenated into the sequence to summarize contextual information.Finally, the position embed-ding F pos 2 k ∈ R (HW +2)×C is added to the sequence.For the 2 kpart learning stage, the input embedding for the j-th part transformer is: where j ∈ {1, 2, . . ., 2 k }, and n is equal to j/2 when j is even; otherwise n is equal to (j + 1)/2.φ is a linear projection to align the channel numbers of features.The above input goes through several transformer layers, each of which includes a multi-head self attention (MHSA) module and a feed forward network (FFN).After building the hierarchical structure, we generate the part features as: From the above equations and Fig. 4, one can see that our proposed PTE uses transformers to generate hierarchical local features with the guidance of global semantics.This recursive and hierarchical design can not only generate multi-scale and multi-granularity features but also provide global guidance for more discriminative features, enhancing the extraction of local features.In addition, we apply transformers to extract local features and stack fewer transformer blocks, which can significantly reduce the model complexity.

Deeply-supervised learning
As illustrated in Fig. 2, we utilize both the feature F CNN generated from the CNN branch and the features F Trans from the transformer branch for inference.To train the whole framework, we adopt the DSL technique [33,34], which makes the network optimization a task that is easy to complete.At each branch, we use the label-smoothed cross-entropy loss [35] and the batch-hard triplet loss [36].The label-smoothed cross-entropy loss is defined as: where p i is the predicted logit of identity i and q i is the ground-truth label.The batch-hard triplet loss is defined as: where d pos and d neg are defined as the distance of positive sample pairs and negative sample pairs, respectively.[x] + is max(0, x) and m is the distance margin.
Finally, the overall loss can be summarized as: where K is the number of stages.λ is the balanced coefficient for the multiple loss terms.

Datasets and evaluation metrics
We conducted extensive experiments on four widelyused person Re-ID datasets, i.e., Market1501 [37], DukeMTMC-ReID [38], CUHK03-NP [39] and MSMT17 [40].The Market1501 was collected from six cameras and has 1501 pedestrians (751 for training and 750 for testing).The DukeMTMC-ReID was collected from eight cameras with 1404 pedestrians (702 for training and 702 for testing).The CUHK03-NP dataset consists of 1467 pedestrians, which are divided into two sub-datasets: one with manual labeling and the other with bounding boxes labeled by a person detector.The MSMT17 is a large-scale dataset deriving from 15 cameras with 4101 pedestrians (1041 for training and 3010 for testing).Table 1 provides more detailed statistics of the four datasets.Following previous works [4,33], we compute the mean average precision (mAP) and cumulative matching characteristics (CMC) at rank-1 for performance evaluation.

Implementation details
In this work, all the experiments are performed with the PyTorch toolbox1 and one GeForce RTX 3090 GPU.We utilize the ResNet-50 pre-trained on ImageNet as our backbone.In addition, we balance the accuracy and complexity, and ultimately choose to extract four parts through the PTE.To extract the multi-scale features by the SFE module, a 1 × 1 convolutional layer and three 3 × 3 dilated convolutional layers are used to gradually extend the receptive fields.Dilation sizes d are 1, 2 and 3, respectively.
During training, all images of pedestrians are resized to 256 × 128 and augmented by random cropping, horizontal flipping and random erasing [41].In one mini-batch, 16 identities are randomly sampled and each identity has 4 images.The Adam optimizer [42] is deployed with an initial learning rate of 3.5 × 10 -4 , which is multiplied by 0.4 every 20 epochs until 180 epochs.The source code is released at https://github.com/AI-Zhpp/CMT.

Comparison with state-of-the-art methods
In this subsection, we compare our method with other state-of-the-art methods.The comparison results on four public Re-ID benchmarks are presented in Table 2.The detail analysis is as follows: Market1501 As for CNN-based methods, PCB [8] and MGN [10] mine diverse part features by horizontal strip features and reach 81.6% mAP and 86.9% mAP on Mar-ket1501, respectively, which validate the reasonableness of part learning in Re-ID.In our method, we adopt a hierarchical transformer-based structure to progressively extract multi-granularity part representations.Thus, our method achieves the best mAP and outperforms PCB and MGN by 8.3% and 3.0%, respectively.Even in comparison with transformer-based methods, such as AAformer [13], TransReID [14], APD [24] and HAT [33], our method still delivers a better performance.
DukeMTMC-ReID On this dataset, our method shows superior performances.The mAP and rank-1 accuracy are CUHK03-NP On two sub-datasets of CUHK03-NP, our method consistently achieves competitive results.Meanwhile, APNet [20] utilizes a pyramid attention to explore the discriminative regions of person images, and achieves 81.1% mAP and 78.1% mAP on the labeled and detected sub-datasets of CUHK03-NP, respectively.Different from APNet, our method extracts fine-grained partial features by multi-stage transformers.Compared with APNet, our method improves the mAP on the detected CUHK03-NP by 0.3%.
MSMT17 On this dataset, our framework also attains comparable performance in terms of mAP and rank-1.
In fact, TransReID achieves the best mAP and rank-1 on MSMT17.However, TransReID uses ViT [23] as the backbone to capture long-range dependencies, which consumes high cost complexity and extremely impacts the inference speed.In contrast, our method uses ResNet-50 to extract local representations and combines part-aware transformers for fine-grained cues.Thus, our method attains a significant improvement of efficiency over Tran-sReID.In addition, TransReID utilizes camera information for performance boosting, while our method does not utilize camera information but unifies the strengths of CNNs and transformers, which leads to the second-best performance on MSMT17.

Model complexities
To further clarify the computation advantages, we compare the model complexity of some typical methods in Table 3.We use floating point operations per second (FLOPs) to test our model's computational complexity.As can be seen in Table 3, our proposed model shows great advantages over other transformerbased methods in terms of FLOPs.We also note that our proposed model has more parameters.This problem  can be solved by light-weight designs.CNN-based methods generally have fewer parameters and FLOPs.However, their performances are usually worse than those of transformer-based methods.Overall, our proposed model achieves a good balance between the Re-ID performance and the model complexity.

Ablation studies
To verify the effectiveness of our proposed modules, we conduct ablation experiments on the MSMT17 dataset.

Effectiveness of key modules
The ablation results of our key modules are reported in Table 4.For the baseline method, we fine-tune ResNet-50 on MSMT17 and adopt GAP to obtain a feature vector for testing, which achieves 49.8% mAP and 74.0%rank-1 accuracy.Then, we add our PTE to the baseline to further extract diverse part features at three stages.In PTE, the global feature is recursively passed into part-aware transformers and used to refine part features.Thus, our PTE brings significant improvements over the baseline (i.e., 12.8% mAP and 8.4% rank-1 accuracy).Furthermore, we insert SFE to enhance the local features before PTE.SFE can capture multi-granularity representations of person images.Therefore, it brings performance improvement (i.e., 0.9% mAP and 0.9% rank-1 accuracy).Overall, the resulting improvements verify the effectiveness of our SFE module and PTE, which play a critical role in the extraction of multi-scale and discriminative features.

Effects of the PTE module
In PTE, we introduce a hierarchical transformer to split and encode part features.The ablation results are summarized in Table 5.As the recursive and hierarchical structure advances, the accuracy also achieve significant improvements.It can be observed that the mAP and rank-1 accuracy are improved by 5.3% and 2.5%, repectively when the spatial features are divided into two parts.With the increase of the part numbers, the diversity of fine-grained clues is captured.In our work, four parts are extracted for the trade-off between accuracy and complexity,.

Effects of PTE at different levels
In experiments, we deploy PTE at different levels of ResNet-50 to realize multilevel part learning.The experimental results are listed in Table 6.From the results, one can observe that the deployment of PTE at a single level can improve performance.The best performance is achieved when PTE is deployed at three levels of ResNet-50.This fact confirms that multilevel representation learning is helpful to achieve better performances of person Re-ID.
Effectiveness of DSL In this work, we introduce the DSL for better model training.By deploying losses at different stages, the ablation results are reported in Table 7.It can be observed that the single deployment of supervision at the 4-part learning stage is not sufficient and more supervision is needed to train the entire framework well.When supervision is deployed at all stages, we can obtain the best performance.

Effects of different transformer layers and attention heads
The number of transformer layers and attention heads may change the structure and performance of our PTE.Thus, we perform ablation experiments to examine the effects  of transformer layers and attention heads.As shown in Fig. 5, the performance of our proposed model is significantly reduced without transformer layers.The performance degradation indicates that the features obtained solely from CNNs are not robust enough, and transformers can implicitly learn more discriminative information.In addition, we observe that when the number of transformer layers is set to 2, the best performance can be achieved.Meanwhile, with the increase of transformer layers, there are some fluctuating changes in performance.This may be because different transformer layers can change the local features.Furthermore, from Fig. 6, it can be observed that as the number of attention heads increases, the retrieval accuracy continues to be improved.Nevertheless, the performance is saturated when the number of attention heads is equal to 16.Based on the aforementioned facts, we set Effects of the balance coefficient λ In our work, we utilize λ to balance different loss terms in Eq. ( 7).To verify its effect, we conduct experiments by changing the coefficient λ from 0 to 3. As displayed in Fig. 7, with the increase of λ, the performance continues to be improved.When λ is set to 1.5, the best performance can be achieved.

Retrieval results
We also visualize the retrieval results on the MSMT17 dataset in Fig. 10.It can be observed that the retrieval accuracies are improved when the proposed key modules are gradually added to the baseline method.As illustrated in Fig. 10, the matching accuracies of the baseline method are the worst because the correct samples have extremely similar global appearances to the incorrect samples.However, with the utilization of SFE and PTE, the matching accuracy is significantly improved.Our SFE module and PTE can extract the multi-scale and multipart features from global appearances.They are useful in improving the ability of our method to distinguish similar samples.The retrieval results further validate the effectiveness of our proposed modules.

t-SNE visualization
As shown in Fig. 11, we visualize the feature distributions of the baseline method and our CMT using t-SNE [59].We randomly select 18 persons from the MSMT17 dataset, and 50 images of each person.Different colors represent different identities.From Fig. 11(a), it can be observed that the feature distributions with the same identity are relatively scattered.There are some misclassified samples.However, with our CMT, features of the same identity are more clustered and features of different identities are relatively separated.In addition, there are few misclassified samples compared with the baseline method.The t-SNE visualizations show that our method indeed helps the method learn a more discriminative embedding space, which further confirms our superiority to achieve robust person Re-ID.

Conclusion
In this paper, we integrate the advantages of CNNs and transformers and propose a novel learning framework named CMT for image-based person Re-ID.First, we propose a SFE module to extract the multi-scale features at different levels of the CNN backbone.Furthermore, we propose a PTE to generate and mine local diverse part features with global guidance.Experimental results on four public Re-ID benchmarks demonstrate that our method performs favorably against most state-of-the-art methods.In the future, we will reduce the computational complexity and improve the efficiency of our part-aware transformers.

Figure 1
Figure 1 The insight of our proposed CMT.Upper row: Previous horizontal part divisions in models such as PCB [8]; Bottom row: Our hierarchical and recursive part encoding with transformers

Figure 2
Figure 2The architecture of our proposed framework.Given a person image, we first utilize the multi-level feature extractor (MFE) to extract multi-level feature maps.Then, for the CNN branch, we employ a GAP layer to obtain the CNN feature.For three stages of the MFE, we introduce the transformer branch, and utilize the SFE module and PTE to further extract multi-scale and diverse part features, respectively.All these branches are supervised by the triplet loss and cross-entropy loss for model training

Figure 3
Figure 3The structure of our scale-aware feature enhancement

Figure 4
Figure 4 Illustration of our proposed part-aware transformer-based encoder.H and W denote the height and width of the input image, respectively.Q, K, and V represent query, key and value, respectively.MHSA represents a multi-head self attention and FFN means a feed forward network

Figure 5 Figure 6
Figure 5 Ablation results with different transformer layers

Figure 7
Figure 7 Ablation results with different balanced coefficient λ

4. 5
Visualization analysis Visualization of feature maps To verify the effectiveness of the proposed modules, we further visualize the features of person examples.The visualizations are shown in Fig. 8.In each example, from left to right, there are the original image, baseline features, SFE features, and PTE features.It can be observed that increasingly detailed information is captured with the gradual utilization of our key modules.Moreover, the feature maps obtained from the baseline generally focus on salient regions, such as the heads or shoes of persons.With the utilization of the SFE module to extract multi-scale features, our model can capture more meaningful information, such as bags and clothing.With the utilization of the PTE module to extract diverse local features, our model can capture more detailed information, such as torso details.The visualization results demonstrate that our PTE can indeed mine discriminative and diverse local cues guided by global semantics.The visualizations intuitively verify the effectiveness of our proposed SFE module and PTE.Meanwhile, we visualize the different parts in PTE for qualitative comparison in Fig.9.Comparing the 2-nd, 3-rd and 4-th columns, it can be observed that more local cues can be captured as the number of parts increases.These visualization comparisons further explain the reasonableness of our PTE.

Figure 8 Figure 9
Figure 8 Visualizations of features obtained from Baseline, SFE and PTE

Figure 10 Figure 11
Figure 10 Visualization of the retrieved results on the MSMT17 dataset.The top-5 retrieved images are presented.The true matches are annotated by green boxes and the wrong matches are annotated by red boxes

Table 1
Statistics of our used datasets

Table 2
Performance(%) comparison with state-of-the-arts.The best performance is marked in bold and the second-best performance is underlined.* indicates that the methods are using camera information

Table 3
Comparisons of model complexities.Both CNN-based methods and transformer-based methods are selected for comparisons.Params means parameter.FLOPs denotes floating point operations per second

Table 4
Ablation analysis of key modules.Params means parameter.FLOPs denotes floating point operations per second

Table 5
Ablation analysis of the PTE module.Params means parameter.FLOPs denotes floating point operations per second

Table 7
Ablation results of DSL