1 Introduction

Person re-identification (ReID) aims at identifying a query individual from a large set of candidates under the non-overlapping camera views. As an essential role in various applications of security and surveillance, lots of attempts and dramatic improvements have been witnessed in recent years  [22, 23, 37, 48, 58].

Despite the satisfactory performance obtained by the supervised deep learning model and some label annotations in the single domain, it is still a challenge to deploy the trained person ReID models on different target environments. It is due to the domain bias between the training and deploying environments, e.g., the model trained on one university dataset need to be applied for airport or underground station. One of the common methods is finetuning the deep model by the image data of the target domain and pseudo labels generated by the source pre-trained model (e.g., clustering  [12, 34, 52], reference comparison  [51], or nearest neighborhood  [61]). However, the predicted pseudo labels might involve much noise, which misleads the training process in the target domain. As shown in Fig. 1, the noisy labels might generate opposite gradients which undermine the model discrimination.

Fig. 1.
figure 1

Difference between our DCML method and conventional methods. The left part shows that conventional metric learning methods treat all samples equally to train the model and thus are easy to be misled by the noise labels. The right part shows that our method adaptively mines credible samples to train the model, which can avoid the damage from these low-quality samples. Best viewed in color.

To address this problem, we propose a deep credible metric learning (DCML) method to avoid the damage from noise pseudo labels by adaptively exploring credible and valuable training samples. Specifically, our DCML method consists of two parts, including adaptively credible anchor sample mining and instance margin spreading. The former is proposed to explore credible samples, which are effective for learning the intra-class compact embeddings. We propose two credibility metrics including the k-Nearest Neighbor similarity and the prototype similarity. We implement two different similarity metrics to demonstrate the generality of the credible anchor sample mining strategy. The k-Nearest Neighbor similarity measures the neighborhood density of the sample by calculating the maximum distance (minimum similarity) between itself and k nearest neighbors. While the prototype similarity calculates the similarity between the sample and class prototype, which denotes the sample’s centrality. Using these credibility metrics, we can select samples with higher credibility as anchors. As the training iterations increasing, the credibility of pseudo labels continues to increase too. We therefore, progressively reduce the limitation of anchor sample mining to select more credible training samples. In addition, we propose an instance margin spreading (IMS) loss to increase the instance-wise discrimination, due to the initial embeddings of target samples are always confusing and in-discriminative without supervised training. We regard each sample as an independent individual and learn a spreading embedding apace by pushing the samples away from each other by a large margin. We summarize the contributions of this work as follows:

  1. 1)

    We propose a deep credible metric learning (DCML) method for unsupervised domain adaptation person ReID, which adaptively and progressively mines credible and valuable training samples to avoid the damage from the noise of predicted pseudo labels.

  2. 2)

    We design an instance margin spreading method loss to encourage the instance-wise discrimination by spreading the embeddings of samples with a large margin.

  3. 3)

    We conduct extensive experiments to demonstrate the superiority of our method, and achieve the state-of-the-art performance on several large scale datasets including Market-1501  [57], DukeMTMC-reID  [30], and CUHK03  [21].

2 Related Work

Supervised Deep Person ReID: Most existing person ReID methods obtain excellent performance by the supervised deep learning model and a number of label annotations. Some methods are devoted to designing more effective networks by part-based model  [3, 6, 36, 37, 41] or attention model  [1, 2, 11, 22, 31, 47]. Other methods focus on capture more prior knowledge or supervisory signals, including body structure  [18, 19, 53, 54], human pose  [29, 35], attribute labels  [39, 55], and other loss functions  [4, 15, 56]. Despite the recent progress in the supervised manner, the deployment of trained models for different target environments is still a challenge due to the large domain bias.

Unsupervised Domain Adaptation Person ReID: To address the above problem, Some works  [24, 49] study purely unsupervised learning to learn from unlabelled data for Re-ID. However, the performance is limited without any labeled data. Furthermore, many works attempt to learn the unsupervised domain adaptation person ReID model, which leverages the labeled source domain data and unlabeled target domain data. Many existing works  [5, 7, 44] apply the generative model (e.g., GAN) to transform the images of source domain into the target domain as the training data, aiming to reduce the domain bias from data. While other works finetune the deep model with the target domain data and pseudo labels generated by the source pre-trained model. The clustering methods  [12, 34, 52] and reference comparison  [51] are widely used to generate the supervisory signal from pre-trained models. Besides, some unsupervised domain adaptation person ReID methods explore other human prior knowledge or auxiliary supervisory signals to improve the adaptation and generalization ability from the source domain to the target domain. EANet  [16] employs the human parsing results to assist feature alignment. While TJ-AIDL attempts to learn a joint attribute-identity space which improves the model generalization ability with transferred attribute knowledge. Our work is related to PAST, which randomly selects the positive and negative samples from top k neighbors and k-2k neighbors respectively with all samples as the anchors and employs a cross-entropy loss as the promoting stage. However, PAST applies the fixed sampling strategy for all anchors in the whole training process which ignores the initial low-quality and continuous improvement of pseudo labels. Our DCML method adaptively selects credible anchors by measuring the credibility of each sample and progressively adjusts the sampling strategy for the different stages of the training process.

Deep Metric Learning: Deep metric learning aims to learn the discriminative feature embedding space instead of the final classifier, which generalizes better to the unseen environment  [4]. Existing deep metric learning methods mainly focus on design effective loss functions or develop efficient sampling strategies. The loss designing methods focus on utilizing higher order relationships  [26, 40, 42], global information  [27, 33], or the margin maximum  [8, 38, 50]. While sampling-based methods are devoted to mining the hard negative samples for training efficiency improvement. For instance, TriNet  [15] samples the most negative samples in the batch for fast convergence. Harwood et al.  [13] found the negative samples from an increasing search space defined by the nearest neighbor distance. However, these mining strategies tend to select the harder samples due to the larger gradient from violating triplet relation defined by the annotations, which is confused with the noise labels, especially for pseudo labels. To address this issue, we adaptively and progressively select the credible anchor samples, which is appropriate for the low-quality predicted pseudo labels.

3 Deep Credible Metric Learning

The goal of our deep credible metric learning method is adaptively and progressively discovering the credible samples to reduce the damage from noise labels. In this section, we will introduce our DCML method from two parts, including adaptively credible sample mining and instance margin spreading.

3.1 Problem Formulation

For the unsupervised domain adaptation person ReID problem, we have a source dataset \(\mathcal {S}= \{\mathcal {X}^{\mathcal {S}}, \mathcal {Y}^{\mathcal {S}}\}\), where \(\mathcal {X}^{\mathcal {S}}\) denotes the image data and \(\mathcal {Y}^{\mathcal {S}} \) is the corresponding labels. Besides, we have another dataset in the deployed environment without any annotations, which is called target dataset \(\mathcal {X}^{\mathcal {T}} = \{x^t_i \}_1^N \). The cross-domain person ReID system aims to learn the robust and generalizable representations in the target domain with the supervised source dataset and unsupervised target one. A popular solution for the unsupervised domain adaptation person ReID problem is finetuning the pre-trained model in the target domain with the predicted pseudo labels. Support we have predicted pseudo labels \(\hat{\mathcal {Y}}^{\mathcal {T}} = \mathcal {P}(\mathcal {X}^{\mathcal {T}};\mathcal {X}^{\mathcal {S}}, \mathcal {Y}^{\mathcal {S}})\) generated by the pre-trained model from the source domain, we learn feature embeddings with a convolutional neural network (CNN) \(\mathcal {F}_{\theta }\) as \(f_i = \mathcal {F}_{\theta } (x^t_i) \) with the objective function which is formulated as:

$$\begin{aligned} \begin{aligned} \theta = \arg \min \limits _{\theta } \mathcal {L}(\theta ;\mathcal {X}^{\mathcal {T}},\hat{\mathcal {Y}}^{\mathcal {T}}), \end{aligned} \end{aligned}$$
(1)

where the objective is to learn CNN \(\mathcal {F}_{\theta }\) by using pseudo labels as a supervisory signal. However, the performance of this objective function entirely depends on the properties of generated labels without a stable guarantee. The generated labels are always noisy due to the large domain bias between the source and target datasets. These noise labels always mislead the training process by providing wrong gradients. This inevitably leads to the necessity of adaptively credible samples mining for more reliable model learning (Fig. 2).

Fig. 2.
figure 2

Illustration of the deep credible metric learning method. The DCML method starts with learning a pre-trained CNN network with the source labeled data. In each iteration, we extract the embeddings of unlabeled target images and generate pseudo labels with the clustering method. To avoid the misleading of noise pseudo labels, we adaptively mine credible samples as the anchor data and optimize the model with these samples. The gradients come from two objective functions including the triplet loss with red arrows and the IMS loss with purple arrows. In addition, we progressively adjust the anchor sample mining strategy to select more anchor samples as iteration increases. Best viewed in color

3.2 Adaptively Credible Sample Mining

The adaptively credible sample mining strategy aims to select the more credible samples to avoid the damage from noise labels. For one target sample and corresponding pseudo label \( (x^t_i, \hat{y}^t_i) \), we define a credibility metric \(\mathcal {C}(x^t_i, \hat{y}^t_i) \) to evaluate whether a label is credible enough as a supervisory signal. Given a threshold \(\tau \), we select the more credible samples as the training data:

$$\begin{aligned} \begin{aligned} \mathcal {X}^{\mathcal {T}}_C = \{ x^t_i \in \mathcal {X}^{\mathcal {T}} | \mathcal {C}(x^t_i,\hat{y}^t_i)> \tau \}, \end{aligned} \end{aligned}$$
(2)

where \(\mathcal {X}^{\mathcal {T}}_C \) denotes selected credible dataset in which each sample is credible as an anchor sample to train the model. In the following subsections, we will introduce that the threshold \(\tau \) is adaptive with the learning process, which reduces the threshold when the pseudo labels are more credible. The main problem is how to evaluate the credibility of samples. The basic assumption of our anchor sample mining strategy is that the central and dense samples are credible for training. Thus we design two credibility metrics including the k-Nearest Neighbor distance and the prototype distance to measure the neighborhood density and class centrality of samples.

Prototype Similarity: In the prototype similarity, we define the credibility of one sample with the similarity between it and the class prototype. Inspired by the prototypical network  [32], we assume all support data points of the same “class” lie in a manifold, and calculate the class prototype as the center of class:

$$\begin{aligned} \begin{aligned} \mathcal {P}_k = \frac{1}{|\mathcal {M}_k|} \sum \limits _{x^t_i \in \mathcal {M}_k } \mathcal {F}_{\theta } (x^t_i), \end{aligned} \end{aligned}$$
(3)

where \(\mathcal {M}_k = \{ x^t_i \in \mathcal {X}^{\mathcal {T}} | \hat{y}^t_i=k\} \) denotes the set of examples labeled with class k, and \(\hat{y}^t_i \) is the pseudo label of \( x^t_i\). Then the intra-class centrality can be calculated with the Euclidean distance as:

$$\begin{aligned} \begin{aligned} \mathcal {C}_P(x^t, \hat{y}^t) = -|| x^t - \mathcal {P}_{\hat{y}^t} ||_2. \end{aligned} \end{aligned}$$
(4)

The larger \(\mathcal {C}_P(x^t, \hat{y}^t)\) values correspond to more intra-class consistent samples. When the intra-class centrality \(\mathcal {C}_P(x^t)\) is large, the sample \(x^t \) is close to the class prototype, which means that its representation as a class is trustworthy. On the contrary, the samples with small credibility values might be mislabeled since these samples are always close to the uncredited classification-plane.

KNN Similarity: Different from prototype similarity measuring the intra-class sample centrality, the KNN similarity calculates the local density by the neighborhood information. For a sample \(x^t \), the neighborhood set \(\mathcal {N}(x^t)\) consists of k samples whose distance is nearest with the \(x^t \). The neighborhood set denotes the local neighborhood information of samples, which can be employed to describe the density. We define the KNN distance as

$$\begin{aligned} \begin{aligned} \mathcal {C}_N(x^t) = -\max \limits _{x^t_i \in \mathcal {N}(x^t)} d(x^t, x^t_i), \end{aligned} \end{aligned}$$
(5)

where \(d(\cdot , \cdot ) \) is a distance metric, e.g., the Euclidean distance. We employ the minimal similarity among the k nearest neighborhoods to denote the local density. All the samples in the neighborhood set \(\mathcal {N}(x^t)\) are more compact as KNN similarity \(\mathcal {C}_N(x^t)\) is large, which denotes that the \(x^t \) resides in a high-density region. When the samples are dense in the neighborhood set and far away from other samples, the neighborhood-based pseudo label generation method, e.g., clustering, will give a more reliable result. When the samples are dense and indistinguishable, they are also necessary to pay more attention. Thus, we select the samples with higher KNN similarity as training data.

Progressively Learning: In the whole training stage, we iteratively generate the pseudo labels with the embedding model and train the embedding model with pseudo labels. In each iteration, we first extract the embeddings with current model \(\mathcal {F}_{\theta }\) and cluster on the embedding space to generate the pseudo labels. Then, we apply the pseudo labels as supervisory signal to train and update the embedding model. Though this iterative learning process, the pseudo labels become more and more credible and embeddings become more and more discriminative. In our DCML method, we progressively adjust the anchor sample mining strategy to select more anchor samples by reducing the selection threshold as iteration increases, since the pseudo labels are more credible as the model is finetuned. When the pseudo labels are credible enough, we tend to employ all the data in the target domain to train our model. Specifically, we design a linear threshold adaptation strategy, which progressively reduce the threshold \(\tau \) with the iterations r. We formulate the threshold adaptation strategy with iterations r as follows:

$$\begin{aligned} \begin{aligned} \tau = \arg \min \limits _{\tau } |\mathcal {X}^{\mathcal {T}}_c | \ge (\gamma _0+r\times \varDelta \gamma )|\mathcal {X}^{\mathcal {T}} | \end{aligned} \end{aligned}$$
(6)

where \( |\mathcal {X}^{\mathcal {T}}_c |\) and \( |\mathcal {X}^{\mathcal {T}} |\) respectively denote the number of samples in the selected and original datasets. \(\gamma _0 \) and \(\varDelta \gamma \) are the hyperparameters of algorithm which respectively denote the initial sampling rate of anchor samples and the increment in each iteration. The basic goal of this strategy is adapting an appropriate threshold \(\tau \) to select sufficient credible anchor samples. The number of selected samples progressively increases with the assuming that the credibility of pseudo labels increase as training iterations.

3.3 Instance Margin Spreading

The pre-trained embeddings on the target domain are always confusing and in-discriminative. It is difficult to cluster these in-discriminative samples and generate credible pseudo labels. In order to increase the inter-class discrimination, we propose an instance margin spreading (IMS) loss which spreads the embeddings by pushing the samples a large margin apart from each other for a discriminative embeddings space. Inspirited by the instance discrimination learning  [46] which assumes each instance is a independent class, we aim to learn a spreading metric space where the distances between each instance pair are over a large margin. Different from conventional margin-based losses (e.g., triplet loss), our IMS loss doesn’t require any labels, which learns the embedding space only by the instance-wise discrimination. The basic formulation of this margin constraint is as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{ims}(x^t_a) = \sum \limits _{i \ne a } \max \big (0,m-d_{a,i} \big ) \end{aligned} \end{aligned}$$
(7)

where \(x_a\) denotes the random selected sample, \(d_{a,i} \) denotes the distance between the sample pair \(d(x^t_a,x^t_i) \) and \( i\ne a\) represents all other samples in the dataset except itself. The m is a margin which denotes the lower bound of distances between each sample pair. As shown in  [4] and  [33], we can obtain the equivalent loss function by replacing the \( \max (0,x) \) with a continuous exponential function and a logarithmic function, which is formulated as:

$$\begin{aligned} \mathcal {L}_{ims}(x^t_a)&= \log \big (1+\sum \limits _{i \ne a }e^{m-d_{a,i}} \big ) \nonumber \\&= -\log \frac{e^{-d_{a,a}}}{e^{-d_{a,a}} +\sum \limits _{i \ne a} e^{m-d_{a,i}} } \nonumber \\&= -\log \frac{e^{-d_{a,a}}}{ \sum _{i=1}^{N} e^{m_a-d_{a,i}} }, \end{aligned}$$
(8)

where \(m_a \) is an adaptive margin. For the same instance, \(m_a \) is zero. For others, \(m_a \) is large. In this formulation, we assume that the distance between the sample and itself is zero, i.e., \(d_{a,a}=0 \). Different from other instance discrimination learning methods (e.g., [46, 61]), we learn a spreading metric space with a large margin. This metric space encourages an inter-class discrimination by the margin constraint, which is beneficial for robust clustering and credible sample mining.

figure a

3.4 Objective Function

Given the anchor sample set \(\mathcal {X}^{\mathcal {T}}_C\) discovered by our adaptively credible sample mining strategy, we train our embedding model \(\mathcal {F}_{\theta } \) with the objective function combining the proposed instance margin spreading loss and conventional metric learning loss:

$$\begin{aligned} \begin{aligned} \mathcal {L} = \sum \limits _{x^t_i \in \mathcal {X}^{\mathcal {T}}_C } \mathcal {L}_{tri}(x^t_i) + \lambda \mathcal {L}_{ims}(x^t_i), \end{aligned} \end{aligned}$$
(9)

where \( \mathcal {L}_{tri}(x^t_i) \) is the common metric learning loss: Triplet Loss  [15], and \(\lambda \) denotes the hyper-parameter that balance the importance of different objectives. The triplet loss aims to learn an embedding space in which an anchor sample is closer to its positive sample than other negative ones by a large margin. We formulated it as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{tri}(x^t_i) = [|| f_i- f_i^+||^2_2-|| f_i- f_i^-||^2_2+m_{tri} \big ]_+, \end{aligned} \end{aligned}$$
(10)

where \([ \cdot ]_+\) indicates the max function \( \max (0,\cdot )\) which denotes that gradients will disappear when the difference between the intra-class and inter-class distances is large enough. \(f_i,f_i^+,f_i^-\) respectively denote as features of the anchor, positive and negative sample in a triplet. The positive and negative samples selection strategy follows  [15] that only uses the hardest positive and negative points in the mini-batch. \(m_{tri}\) is a margin to enhance the discriminative ability, which is similar with \(m_{a} \) in the instance margin spreading loss. For more clear explanation, we provide the Algorithm 1 to introduce the learning process of our DCML method in detail.

3.5 Discussion

Some methods (e.g., PUL  [10], UDA  [34], PAST  [52], and SSG  [12]) also apply the clustering algorithm to generate pseudo labels of target domain. However, the pseudo labels might involve much noise, which misleads the training process in the target domain. To solve this problem, our DCML method develops a credible sample mining strategy in the metric learning to avoid the noisy labels. PUL  [10] have proposed a reliable objective function to regulate the sparsity of samples, and then simultaneously optimized the objective of the discriminative model and the regulation term of the number of samples. However, this regulation term may disturb the original discriminative learning since the valuable samples in the optimization process tend to be removed. Different from PUL, our DCML method proposes a credible sample mining strategy which is inspired by the hard negative mining in the metric learning. The credible data sampling is separated from the metric learning process, without the disturbance. As far as we know, DCML is the first metric learning method to adaptively select credible samples, which does not break the discriminative learning.

Table 1. The basic statictics of all datasets in experiments.

4 Experiment

In this section, we evaluated our DCML method on three large-scale person ReID datasets: Market-1501  [57], DukeMTMC-reID  [30], and CUHK03  [21]. Quantitatively, we compared our DCML method with other state-of-the-art unsupervised domain adaptation person ReID approaches and conducted ablation studies to analyze each component. Besides, we visualized the embedding space to qualitatively analyze our method.

4.1 Datasets and Experimental Settings

Datasets: Our experiments are conducted on three large-scale datasets including Market-1501  [57], DukeMTMC-reID  [30], and CUHK03  [21]. Although all the above datasets are collected from the natural real-world scene of the university environment, there still is a large domain shift among them such as background, illumination, and clothing style. For example, the persons in the Market-1501 and DueMTMC-reID datasets mainly come from Asia and America respectively. For all datasets, we share the same experiment settings with the standard cross-domain person ReID experimental setups in the baseline method UDA  [34] and PAST  [52]. Specifically, we follow the source/target selection strategy, training/testing ID splitting strategy, and evaluation measuring protocols. For Market-1501 and DukeMTMC-reID datasets, we evaluated our method in the single query mode. While for the CUHK03 dataset, we only use the DPM detected images and choose the new train/test evaluation protocol in  [59] for a fair comparison. The detailed information of the datasets are shown in Table 1.

Evaluation Protocol: In our experiments, we employed the standard metrics including cumulative matching characteristic (CMC) curve and the mean average precision (mAP) score to evaluate the performance of the person reID methods. We reported rank-1, rank-5 and rank-10 accuracy and mAP score in our experiments. Note that post-processing methods, e.g., re-ranking [59], are not applied for the final evaluation.

Table 2. Ablation studies show the influences of design choices on mAP and Rank-1,5,10(%), with Market-1501 as the source dataset and DukeMTMC-reID as the target dataset and vice versa. The \(\dagger \) denotes that this method is reproduced by ourself with the same backbone and hyperparameters.

4.2 Implementation Details

Source Domain Pre-training: Leveraging the labeled source domain images, we pre-train a CNN model in a supervised manner by following the training strategy described in  [2]. Specifically, we use the ImageNet pre-trained ResNet50  [14] without any attention model as the backbone of our model for fairness. The original \(stride =2\) convolution layer in the last block is replaced by a \(stride =1 \) one to preserve the image resolution. For image preprocessing, we attempt to use the generative images by the SPGAN  [7] and adopt the random horizontal flipping, random cropping, and random erasing data augmentation methods for image diversity. The supervisory signals in the source domain training consist of label smooth cross-entropy loss and triplet loss. Besides, other hyperparameters including image resolution, batch size, learning rate, weight decay factor, learning rate decay strategy, and max epochs are the same as  [2].

Pseudo Label Generation: We adopt the DBSCAN clustering method  [9] to generate pseudo labels, which is the same as the baseline UDA method  [34]. The input of DBSCAN algorithm is the reranked distance matrix of the target domain samples and the output is the clustering result. We give each image cluster containing more than two samples a pseudo-label and then discard the individual images.

DCML: In the process of target domain adaptation, we train our model for 8 iterations and 30 epochs are required in each iteration. For the credible sample mining strategy, we set \(\gamma _0 \approx 0.75 \) and \( \varDelta \gamma \approx 0.05 \) to update the sample selection threshold. Taking the DukeMTMC-reID datasets as an example, we select 12000 anchor samples in the first iteration and increase 1000 samples each iteration. For objective function, we respectively set the margins \(m_a = 0.1 \) and \(m_{tri}= 0.3 \) for instance margin spreading loss and triplet loss. The rate of loss weighting is set as \(\lambda = 0.01\). In each mini-batch, we randomly select 224 samples from the credible sample set, in which each individual contains 16 images. We use Adam optimizer with an initial learning rate of 0.0005 and the weight decay of 0.001. The initial learning rate is reduced to 0.1 at 3th and 6th iterations, and in each iteration, it is temporarily reduced in the last 10 epochs. We conducted All our experiments on 4 Nvidia GTX 1080Ti GPUs with PyTorch 1.2.

4.3 Ablation Study

To analyze the effectiveness of individual components in our DCML approach, we conducted comprehensive ablation experiments on the M \(\rightarrow \) D and D \(\rightarrow \) M settings, where M \(\rightarrow \) D denotes that the source dataset is Market-1051 and the target dataset is DukeMTMC-reID. We reproduced the UDA  [34] method with the same backbone and hyperparameters of our method as the baseline, and applied the proposed credible anchor mining strategy, instance margin spreading loss, and the GAN based image style transfer on it. Table 2 We exhibited the comparison results in different settings in Table 2 and analyzed different components as follows.

Table 3. Performance comparisons with SOTA unsupervised domain adaptation person Re-ID methods from Market-1501 to DukeMTMC-reID and vice versa.
Table 4. Performance comparisons with other methods from CUHK03 to DukeMTMC-reID and Market-1501.

Credible Anchor Mining Strategy: As shown in Table 2, CAMS denotes our credible anchor mining strategy. Compared the performance under the setting of \(UDA \dagger + GAN + IMSLoss \) and the full DCML method, we can observe the obvious decline when the CAMS is removed. It illustrates that progressively and adaptively mining credible samples assists the target domain training by discarding samples with noise labels. In addition, we compared the effectiveness of different credibility similarity methods. The KNN similarity and prototype similarity are comparable to evaluate the credibility, which indicates our sample mining strategy is robust for different credibility evaluation methods.

Instance Margin Spreading Loss: The proposed IMS Loss aims to increase inter-class discrimination by enlarging the margin between the instances. We conducted the ablation studies about IMS Loss on the both “UDA” and “UDA+GAN” baselines, and obtained consistent improvement. Besides, we observed that the improvement on the stronger baseline (GAN+UDA) is lower than the original UDA method. This might be due to the generative images with GAN have a lower domain shift than the original images. The embedding space pre-trained with generative images is more spreading.

Image Style Transfer: In our final system, we employed the domain adaptation generative images with SPGAN  [7] to pre-train the model on the source domain. The generator transfers the style of source domain images to the target domain style, which reduces the domain shift between source and target datasets. With the generative images pre-train, the baseline UDA method achieves a large improvement, which demonstrates that the quality of predicted pseudo labels is important for target domain finetuning. It also motivates us to additionally enhance the quality of pseudo labels.

4.4 Comparison with State-of-the-Art Methods

We compared our method with other SOTA unsupervised domain adaptation person ReID methods on the Market-1501, DukeMTMC-ReID and CUHK03 datasets. Specifically, we conducted the experiments following evaluation settings in  [52] including M \(\rightarrow \) D, D \(\rightarrow \) M, C \(\rightarrow \) D, and C \(\rightarrow \) M tasks, where M, D, C respectively denote Market-1501, DukeMTMC-ReID and CUHK03 datasets. As shown in Table 3 and 4, the bottom groups summarize the performance of methods generating pseudo superiority signal to train the model on the target domain, while the top and middle groups respectively show these methods using GAN or other auxiliary attributes. Our DCML achieved consistent improvement over other comparing methods, which indicates the effectiveness of our credible sample mining strategy and instance margin spreading loss.

M \(\rightarrow \) D and D \(\rightarrow \) M: As shown in Table 3, we compare our results with 7 methods finetuning meodel by pseudo superiority signal, 5 methods reducing the domain shift with GAN and 4 methods using auxiliary clues. The * in the tables denotes that the method whose source dataset is MSMT17  [44], which is the largest re-ID dataset with large-scale images and multiple cameras. We achieve the state-of-the-art results for both settings.

C \(\rightarrow \) D and C \(\rightarrow \) M: We also evaluated our DCML method using CUHK03  [21] as the source dataset. The results of our DCML method and other state-of-the-art methods are summarized in Table 4. Our DCML method improved PAST  [52] by adaptively and mining credible anchors and progressively adjusting the mining strategy, which avoids the misleading from noise labels. Note that we don’t use the complex part model like PCB  [37] in our DCML method.

Fig. 3.
figure 3

Barnes-Hut t-SNE visualization  [25] of the proposed DCML method on the gallery set of DukeMTMC-ReID, where we zoom in several areas for a clear view.

4.5 Qualitative Analysis

To validate the effectiveness of our DCML method, we qualitatively examined the learned embeddings. As shown in Fig. 3, we visualize the Barnes-Hut t-SNE  [25] map of our learned embeddings of the gallery dataset in DukeMTMC-ReID. To observe the details, we magnify several regions in the corners. Despite the large intra-class variations such as illumination, backgrounds, viewpoints and human poses, our DCML method still groups similar individuals on the target domain in an unsupervised manner.

5 Conclusion

In this paper, we have proposed a deep credible metric learning method for unsupervised domain adaptation person re-identification, which adaptively mines credible samples to train the network and progressively adjusts the sample mining strategy with the learning process. It is due to that the generated pseudo labels are always unreliable and the noise will mislead the model training. We present two similarity metrics for the goal of measuring the credibilities of pseudo labels, including the k-Nearest Neighbor distance for density evaluation and the prototype distance for centrality evaluation. With the training process, we progressively reduce the limitation to select more samples. In addition, we propose an instance margin spreading loss to further increase the inter-class discrimination. We have conducted extensive experiments to demonstrate the effectiveness of our DCML method. In the future, we will attempt to design a credible negative mining strategy to further improve the cross-domain metric learning.