Consistent attentive dual branch network for person re-identification

Several recent person re-identification methods are focusing on learning discriminative representations by designing efficient metric learning loss functions. Other approaches design part based architectures to compute an informative descriptor based on local features from semantically coherent parts. Few efforts learn the relationship between distant similar regions and parts by adjusting them to their most feasible positions with the help of soft attention. However, they focus on calibrating distant similar parts features and ignore to learn the noise (blur) free and distinct feature representations as the person re-identification datasets contain degraded images. To tackle these issues, we propose a novel Consistent Attention Dual Branch Network (CadNet) that has ability to model long-range dependencies (correlations) between channels as well as feature maps. We adopt multiple classifiers trained to learn the most discriminative global features for a unique representation of a person. Correlation between channels are consistently computed by using channel attention mechanism to make the learned feature noise free and distict from noisy and blurry data. Feature correlations interpret the relationship between distant similarities in the images computed by the self attention mechanism. The proposed CadNet significantly enhances the performance with respect to the baseline on the person re-identification benchmarks.


Introduction
The Problem of Person re-identification (re-id) is, given a probe, to retrieve person's images from gallery sets acquired by the same or other cameras. The task of re-id is an essential component of intelligent surveillance systems [4,22] and its importance is progressively improving in research. The presence of highly variable factors like illumination, resolution, clothing, view angle, human pose, occlusions and background in the images make re-id a very challenging task.
Another complication with respect to classical classification tasks is that in re-id the identities set (classes) mismatch between the training and the testing stages. Precisely, image classification task has similar classes (i.e. person identities) in training and testing sets while re-id has different identities in both sets. Therefore, the task of re-id requires a strong and discriminative feature descriptor to distinguish unseen, in the testing set, similar images belonging to new identities.
With the development of neural networks and deep learning algorithms, the ConvNets [11,16,32], originally well designed for image classification tasks, perform impressively by providing a discriminative feature representations for person images. Such a representation capability outperforms the traditional handcrafted low-level features by a large margin. To exploit ConvNets in Re-Id solutions, a research trend aims to design better metric learning loss functions [6,13,21] such as triplet loss, triplet hard loss, quadruplet loss, etc. for a better description of the person's image. These loss functions enlarge and reduce the inter-class and intra-class variations respectively, thus improve the generalization capability of the model. The performance of such metric learning based loss functions is highly controlled by the sampling method and by hard sample mining techniques. On the other hand, many approaches [23,42,52] address the person re-identification task as a general image classification task. The basic idea of these studies is computing the cross entropy softmax classification loss for the person's images. While testing, these classification based approaches compute the distance matrix from the output features of the images to distinguish the person identities. Due to the mismatch between training targets and testing targets, the performance of metric learning loss functions becomes inferior in re-id task. To overcome this issue, we propose a multi classifiers training instead of a general single classifier to learn the most discriminative features from person images. The effectiveness of the proposed multi classifiers learning is presented in the ablation study section of the paper.
Recently, part based models [30,34,44,[46][47][48]53] have represent the state of art performance in person re-id by learning part based local feature representations from the person's image. Some of these methods [34,47] compute a strong discriminative representation of the image by splitting it into several body parts and then evolve the local features from all the parts into a single representation. Other approaches [44,48] horizontally partition the deep neural networks feature maps into several parts to learn more informative and finegrained salient features in individual local parts. They distinguish one identity from an other by using discriminative cues from these parts. To learn salient part features, such methods require well aligned body parts for the same person. This is one of the main drawbacks due to lack of part consistency.
Lately, many attention-based approaches [5,18,19,25,39,40,50,53] have been proposed to overcome such partitioning and misalignment issues occurring in part based techniques. Attention is a powerful tool to perform spatial localization in the neural networks to interpret their decisions. AACN [40] tackle the misalignment and occlusions issues that occur in the re-identification tasks by masking out the undesirable background with pose guided attention mechanism. Others [19,50] focus on better matching features and essential attention regions by learning superior attention maps. Self-attention helps to compute the features correlations [43] by providing more weight to similar parts in the image and modeling long-range dependencies in a statistically efficient way. However, the drawbacks of these approaches are the lack of learning the key-part features due to randomness part selection and in considering noise (e.g. blur) effect in the learned features since most of the re-identification images are blurry and noisy.
To overcome aforementioned issues, thus to enhance the final matching relevant region features should be computed as well as the feature correlation. For such a purposes we propose the introduction of a self attention module. In addition, to address the issue of noise, e.g. blur, thus to learn noise free features, e.g. features learned from sharp patches, we propose a consistent computation of channels correlation of multi scale features by exploiting the channel attention module. The proposed Consistent Attention Dual Branch Network (CadNet) is a modified version of the self and channel attention network (SCAN) [24] work such that noise free, salient and discriminative features are learned. The overall scheme of the proposed approach is shown in Fig. 1 and our contributions with respect to the SCAN [24] are: -The channels correlation are used at the end of each stage instead for every residual connection, thus reducing their number. This is motivated by the hypothesis that the features computed at the end of each stage are a better representation of the image. Such an hypothesis is experimentally supported by the fact that computing the channels correlation at this stage enhances the performance and reduces the computational cost.
In addition, these inter dependencies make the learned features robust to noise (e.g. blur) thus contributing to improve the matching score. -A dual branch mechanism composed by a residual and an attentive branch is introduced.
The former aims to provide noise free features while the latter provides similarities between patches at different location in the matching images (e.g. a backpack carried by hand in the probe and on shoulders on the gallery images). The final representation is the concatenation of both branches.
With the above mentioned changes, the performance of CadNet significantly improves on person re-id benchmarks as compared to SCAN [24] as well as to other state of the art methods.

Metric and classification losses
Deep neural networks, compared to hand-crafted features, perform better by learning the required features and metrics from the data using suitable loss functions. Ding et al [10] proposed a triplet loss to compute the relative distance between different images. Chen et al. [6] adopted quadruplet loss to enlarge inter-class variations and reduce intra-class ones. Yu et al. [41] proposed a soft hard sample mining technique which assigns weights to hard samples. Many research efforts address the person re-id problem as an image classification problem. Some [1,17] compute the cross entropy loss for the images pairs by taking paired images as input. Others, like [34,35], propose margin based losses or adopt simple classification losses for each part by splitting an image into multiple parts.
Unlike all the above methods, we compute multiple cross entropy losses from each added classifier allowing to achieve significantly higher performance than those of a standard single classifier. These multiple losses are then used along with triplet loss.

Part based deep neural networks
Different works [30,34,44,[46][47][48] introduced local part based feature representations to enhance the re-id performance. Some methods perform inaccurate part localization by directly splitting images into local stripes. Other approaches deal with the alignment of these local parts by pose estimation and region proposal generation. Zhao et al. [48] proposed a part-aligned network for a better partitioning of the body parts. Sun et al. [34] introduced a uniform partition strategy to partition the person image into horizontal stripes. Zhang et al. [44] computed local and global losses by partitioning the body parts into horizontal stripes. Shu et al. [30] learned part based features by dividing images into parts and used a network to assign weights to each part.
Unlike these part based methods, attention based methods directly learn local features correlation from the image data. There is no need to split the image into stripes because the attention method computes the correlations between features patches and hence provides better relationship between different parts among the images.

Attention based models
Attention-based methods are used to manage localization and misalignment issues in the images. Liu et al. [19] proposed the HydraPlus network which provides better representation of the images by learning low-level attentive features. Li et al. [18] proposed a network for a multi-scale feature representation which simultaneously learns hard region-level and soft pixel-level attentive features. Other approaches [40,50] focused on better matching the features and proposed attentive learning to learn appropriate attention maps. Xiang et.al [39] proposed a scheme to fuse feature from multiple regions and use soft attention to assign weights for each region. Zhong et al. [53] solved the alignment problem by using attention mechanism to mix global-local features for more stable pedestrian descriptors. Unlike these works, SCAN [24] introduced the self and channel attention mechanisms to weight more the some features and to make them noise free by constantly computing the channels correlation to avoid noise effects. These improve the features matching when the images have similarities at different locations. SCAN focuses on attentive representations and ignores the original features which may still have useful information.
In this work, we modified the SCAN [24] model by applying a dual branch mechanism which preserves the original features along with attentive representations. We also improve the channels correlation by computing them at key locations in the network.

Problem definition and notations
Let a set of n training images I i n i=1 with corresponding identities labels y i n i=1 be acquired by a camera network. The task of person re-identification is to, given a probe, retrieve the person's images from the galleries of different cameras. The problem of person re-id is usually treated as an image classification task when using cross entropy loss. The difference between these two tasks is that training and testing classes (person identities) are identical in image classification while different in re-id. With the help of classifiers, the most discriminative features for each person are learned from the dataset. During testing, these features are used to compute the distance matrix between the probe and the persons identities to achieve the person re-identification.

Overview
To make a better matching in person re-id, strong and discriminative feature representations of person's images are required by the neural networks. To learn such representations, neural networks are trained in a supervised fashion by using data of persons with known identities. In the testing stage, the features of unseen persons are extracted to match with other unknown persons. The presence of unknown identities significantly reduces the matching performance. The training mechanism of neural networks plays an important role to learn from the person's image specific things (cloths, handbags, etc) that are important features to disambiguate between different people. We propose a training mechanism that exploits 4 classifiers. The predictions from all the classifiers are merged to make the final decision. We name it multi classifier (Multi-C) training. During training, several convolution operations takes place across multiple channels of the features produced by the network's layers. The final output of network's layers are the sum through all the channels. This induces channel dependencies in the learned features. Such channel dependencies cause to miss the tiny effective details in the output features especially in case of person re-id since the images are blurry and noisy. We compute channels correlation to enhance the convolution features at every stage of the network so that the network is able to increase its sensitivity to missing informative features due to degraded data. Another aspect is that the convolution operations have just local information, hence they miss the long range similarities present in the images. These long range similarities has an essential impact in re-id when matching images. We capture these similarities with features correlations which can be exploited by self attention mechanism. The details of computing channel and feature correlations are explained in the next sections.

Proposed network architecture
Recent research works have shown that Convolutional Neural Networks (CNNs) efficiently learn deeper and robust feature representations from images and are precise to train if they have shorter connections between layers. Leveraging on such outcomes, we adopt the ResNet-50 [11] as our backbone network with several additions and modifications. We adopt multi classifiers training with multiple fully connected layers which are shown as classifiers in Fig. 2 to predict the identity of the person in the input probe image. The gradients from all added classifiers are gathered at the previous 1 × 1 convolution layer and force that layer to learn the most discriminative global features. Such features are used to compute the distance matrix to overcome the issue of identity mismatching in the testing stage. Since the convolution layers have a local receptive fields then the learned global features depend on the local neighbourhood similarities and ignore the long-range dependencies. To capture the similar parts at different regions in the image and to work with re-id degraded data, we compute feature and channels correlation with the help of self and channel attention mechanisms to learn noise free and salient features. These two modifications are expressed as channels correlation and feature correlations in Fig. 2. After the third stage of the backbone network [11] we designed two branches named residual and attentive branches. We added the self attention module at the start of the attentive branch because self attention produces better results when the spatial size of features is small [43]. Channels correlation are computed after every block of the ResNet-50. The details for computing channel and feature correlations are explained in the next sections. The resulting proposed network (CadNet) is shown in Fig. 2 and is trained with cross entropy losses (L id ) from all classifiers and the triplet loss.

Channels correlation
Since person re-identification is applied to surveillance cameras, commonly real scenarios and used datasets consist of blurry and noisy images. Most of the existing methods are unable to grasp deep salient features from them. To build a stronger descriptor against such a degradation, noise free and distinct feature learning is required. To fulfill this objective, we introduce several channel attention modules to compute channels correlation consistently during the feature learning process.
Let K = [k 1 , k 2 , ..., k C ] be the learned set of filter kernels for C output channels with k l being the parameters of the l th filter in a general convolution operation. The output from this convolution operation can be written as U = [u 1 , u 2 , ..., u C ], where In the above equation, k l = [k 1 l , k 2 l , ..., k C l ], X = [x 1 , x 2 , ..., x C ] (X being the input feature maps and C is the number of input channels). The convolution operation is denoted by * and 2D spatial kernel k n c represents a single channel of k l which interacts with the corresponding channel of X. The output of the convolutional layers is obtained through a channel-wise sum of the computed feature values. Therefore, the channel dependencies are introduced along with the spatial correlation captured by the convolutional filters in the learned weights. We follow the work in [15] for computing these channel dependencies (correlations) but apply them at compacted features (convolutional block) instead at residual connections (used in [15]). Each unit of the output U is unable to exploit contextual information outside of its region because the convolution operation has a local receptive field. To resolve this issue, global spatial information is squeezed into a channel descriptor. This operation generates channelwise statistics and is achieved by using global average pooling. A statistic z ∈ R C can be generated by shrinking U through the spatial dimension H × W . The l th element of z, computed by global average pooling, can be written as: For better modeling channel-wise dependencies, the learned function must have the ability to capture the nonlinear interaction between channels and permit multiple channels to oppose one-hot activation. The sigmoid activation fulfills these requirements and can be written as: where W 1 ∈ R C r ×C are the parameters of the dimensionality reduction layer and W 2 ∈ R C× C r are the parameters of dimensionality-increasing layer while δ denotes the ReLU function and r is the reduction ratio 1 . Two 1 × 1 convolution layers implement W 1 and W 2 around the non-linearity. The final output of the channel attention is obtained by rescaling the output U by means of the activations: The dot product refers to channel-wise multiplication of feature maps u l ∈ R H ×W and the scalar n l . The overall operation of the channel attention for computing channels correlation is shown in Fig. 3 and it helps to boost feature discrimination.

Feature correlations
In the existing works, most of the designed models for person re-id consist of convolutional layers. These convolutional layers are computationally unable to model long-range dependencies and distant similarities in the images because the convolution operation computations are bound to local neighbourhoods. To efficiently model the relationships between widely separated spatial regions, we adapt a non local model [38] to compute features correlations with self-attention in a convolutional framework. The introduction of this module builds a strong descriptor along with the original feature and hence improves the matching by capturing similarities at different image locations. The image features from previous hidden layers x ∈ R C×N are first split into two feature spaces f , g with the help of 1 × 1 convolution layers. C is the number of channels and N = H × W . The attention is computed such that f (x) = W f x and g(x) = W g x with learnable weight matrices W f ∈ RC ×C and W g ∈ RC ×C (C = C r and r is the reduction ratio). The attention map α j,i for s ij = f (x i ) T g(x j ) is computed as: where α j,i denotes the extent to which the model contributes in the i th location when synthesizing the j th region. The attention layer's output is where h( (6) are implemented using 1 × 1 convolution layers and are the learned weight matrices.C is the number of channels after reduction C/r. To avoid memory usage, we set r = 16(i.e.,C = C/16) in our experiments. The output of the attention layer is added back to the input feature map. The final output is written as: where γ is a learnable scalar parameter and is initialized as 0. In the start, γ encourages the network to rely on the cues in the local neighbourhood and then gradually learns to assign more weight to non-local evidence. We append the self attention module at the start of the attentive branch which is shown in Fig. 2. The operation of computing the features correlation through self attention is shown in Fig. 4.

Datasets
We performed our experiments and evaluated the proposed network on two person reid benchmark datasets, market-1501 [49] and DukeMTMC-reID [26]. We adopted rank-1 accuracy, rank-5 accuracy and mean average precision (mAP) as our evaluation metrics. We used the standards splits for training and testing identities. The details about the two datsets are: Market-1501 dataset has 32668 images of 1501 person identities automatically detected from six disjoint cameras. The training set consists of 12936 images of 751 identities. The

Implementation details
The backbone of the proposed CadNet network consists of a ResNet-50 network and is implemented using Pytorch. We trained CadNet on a nvidia RTX2080Ti gpu. Following the work of R-FCN [8], we modified the stride (stride=1) of the last downsampling block before the global average pooling to make the spatial size of convolution features larger. We used global max pooling on these features instead of global average pooling. A 1 × 1 convolution layer followed by Batch normalization, Rectified Linear Unit (ReLU) and dropout layers are appended after the max pooling to reduce the size of features from 2048 to 1024. We add several modifications in the network as specified in Section 3.2. The channels correlation of features at each stage are computed by the channel attention modules embedded throughout the network. The two branches of the proposed CadNet provide two feature vectors of length 1024 and are trained separately (without concatenation) by using shared multiple classifiers. The concatenation of the features from two branches defines the final representation vector of length 2048 which is used for feature matching. We optimized the network by using Adam optimizer with momentum 0.9. The initial learning rate is set to 3e − 4 and is divided by 10 after 80 epochs. We trained our model for 140 epochs with a batch size of 64. Photometric distortions [14] and the AlexNet-style color augmentation [12] are applied to 256 × 128 sized images followed by random horizontal flipping and random erasing data augmentations. The dropout probability is set to 0.5 and the weight decay is 5e − 4. The ResNet-50 baseline training time, on the exploited testing configuration, is 2.5 hours for Market-1501 and 3 hours for DukeMTMC-reID. The proposed CadNet converges in 3.5 hours for DukeMTMC-reID and takes 3 hours for Market-1501 dataset to train. The training time for the CadNet is comparable with respect to the baseline while the performance is significantly higher than the baseline.The inference time is identical for both baseline and CadNet (0.175 sec per batch).

Comparison with state-of-art methods
The results of the proposed CadNet along with the comparison with other state of the art methods on market-1501 and DukeMTMC-reID datasets are presented in Table 1 and Table 2. Unlike the other state of the art methods, the proposed CadNet introduce multi classifiers training mechanism which enhance the performance. The gradients from the each added classifiers are gathered at previous convolution layer and make that layer to learn more and more refined and discriminative features with each addition. With 4 classifiers  CadNet(Proposed) - The highest and second highest results are shown in and we got the highest performance and further addition of classifiers starts reducing the scores because of the classifiers errors which are also getting added for each classifier. To handle the blurriness and noise in the data, the proposed network computes channels correlation continuously at various spatial sizes. These correlations produce the noise free feature maps from degraded data [9,45] and proceed them towards the classifiers. The sharpness in features make them easy to distinguish from each other and hence improve the matching scores. For further refinement of features, the proposed network adopts a dual branch mechanism. The contribution of the residual branch in the performance of the CadNet is to provide the noise free and discrimative features of the image which enlarge the difference between two different identities. The attentive branch merge the information from the distant image location which has higher contributions in the prediction of person's identity. The concatenation of the attentive features with residual features amplify the information in the learned feature vectors. The final representation from the proposed method produce better matching between to person and accelerate the performance. The effects of each of the components on the results of the proposed method are explained in Section 4.5. In all experiments, we reported the results of the other methods directly from their papers. The results in Table 1 and Table 2 show the superior performance of the proposed method as compared to other state of the art methods. Unlike other methods, the learned features with the proposed Cad-Net consist of distant similarities and hence provide better and unique representation of the person. Therefore, the performance of the proposed method is higher than the others.

Visual results
The visual/qualitative results from the proposed network are illustrated in Fig. 5. We used the trained CadNet model to obtain the feature representations of all the images and then followed [54] to compute the visual results. We computed the class activation maps [54] for both the datasets and present them visually. The proposed network is unable to compute the class activation maps at the last convolution block because we reduce the size of the features to 1024. The last block returns 2048 feature maps and the input to the classifiers is 1024. Since they have different sizes thus we calculated class activation maps at the third convolution block where the network is split into two branches. In Fig. 6 first row shows the original images from Market1501 dataset and second row represents the corresponding class activation maps. Similarly, third and fourth rows demonstrate the class activation maps for DukeMTMC-reID dataset. Class activation maps represents how much each image region contributes in the prediction of classes (person identities) probabilities. The highest contribution in the predictions is carried out by red regions while the blue regions represents the lowest contribution (or no contribution). The images clearly show that the proposed solution takes into higher consideration regions belonging to persons while is discarding the background. This behaviour contributes to improve the quantitative performance.

Effect of classifiers
To evaluate the contribution of single classifier versus multiple classifiers, we modified the ResNet-50 baseline to get unbiased results. In this view, we trained the ResNet-50 [11] baseline network with different number of classifiers and reported the rank1 accuracy and mAP for both the datasets. In particular, we used a ResNet-50 pretrained on ImageNet dataset and removed its last linear layers. New linear layers according to the number of classes present in the datasets have been appended before training on the re-id datasets. Figure 7 shows the contribution of different classifier layer on the ResNet-50 baseline. Both the measurements rank1 accuracy and mAP linearly increase until 4 classifiers. Then the slopes reaches a plateau or decrease gently. Since the highest performance has been reached with 4 classifiers we exploited such a number of layers in the CadNet solution and in the aforementioned/comparison results.

Parameters selection
Most of the parameters and specifications expressed in Section 3.2 are used throughout our experiments and are gathered from the previous standard person re-identification techniques. Instead, for the parameter r used in the self attention module can be set to 2,4,8,16. The parameter r is the division factor to generate patches from the input features. We reported the results by selecting multiple sizes of the generated patches in self attention module. The performance is slightly effected with different values of r because the number of channelsC are reduced to C/r. The impact of different values for r is shown in Fig. 8.
Evaluating such information, we chose r = 16 (i.e.,C = C/16) in all our experiments. Such a selection not only improves qualitative performance but also reduces the computational costs and improves memory efficiency.

Component analysis
To illustrate the effectiveness of the proposed contributions, we provide a component analysis for the proposed network. First, we performed the separation of the three components (Channels Correlation CC, Feature Correlation FC and multi classifiers Multi-C) according to SCAN [24] and reported the results in Table 3. The ResNet-50 baseline proposed in Section 4.5.1 ( one classifier) has been exploited as performance reference (the first row). Second row shows the impact of the exploitation of four classification layers. Such first two rows show the numerical values of points 1 and 4 of Fig.7. Third and forth rows in Table 3 demonstrate the contributions of the channel attention (CA) and self attention (SA) modules. SCAN [24] is the ResNet-50 4-C with both modules. Third and forth row of Table 3 show performance of SCAN without SA and CA respectively. The performance improvement of the new exploitation of the channel attention with respect  to SCAN [24] show the superiority of CadNet(residual branch) in row 6 with respect to SCAN row 5. To evaluate the CadNet(attentive branch), we trained the proposed network shown in Fig. 2 with only such a branch (e.g. residual branch has been removed). The results are in row 7 of Table 3. Finally, both branches have been used (row 8) to show the results of the complete proposed solution. Each of the residual and attentive branches improve the performance over SCAN [24] separately but the combination of both branches have a greater effect on the performance. This implies that the mixture of all the proposed contributions together in the form of CadNet provides a stronger descriptor.
With the proposed placement of channels correlation (CC), R1 and mAP are increased by 0.9% and 2.5% for Market1501 and 1.6% and 2.8% for DukeMTMC-reID as compared to SCAN. Similarly, the dual branch design with the proposed CC and feature correlation (FC) improves R1 and mAP by 0.4% and 1.6% for Market1501 and 1.0% and 1.3% for DukeMTMC-reID.

Conclusions
We proposed a novel Consistent Attentive Dual Branch Network with multiple classifiers for Person Re-Identification (CadNet). We exploited a multi classifiers training strategy in which each classifier contributes in distinguishing between identities and helps the model to learn the most discriminative and unique features for each person. Due to blurry and noisy person re-id data, general re-id models misses small and tiny details. The introduction of channels correlation makes the learned features noise free and highlights these small details to build a stronger descriptor. Channels correlation are computed through channel attention module consistently at multiple positions in the network to flow the tiny information towards the final representations. Local and non-local similarities in the person images are computed by the two branches, respectively residual and attentive ones, and merged to create a strong, unique and discriminative feature representation of each person. This has been shown to improve the matching score between two persons. Spectral normalization is applied while computing channel and self correlations to stabilize the training dynamics for the convergence of the model. The visual results show the participation of each tiny component of person in predicting the identity. The proposed CadNet learns small details that help to significantly enhance the person re-id performance with respect to other state of the art methods as shown on two widely adopted benchmarks datasets. The proposed network only focused on learning similarities/dissimilarities present for a single person. Cross correlations can be introduced in the future work to learn distinguishing features between two persons.