Semi-supervised pedestrian re-identification via a teacher–student model with similarity-preserving generative adversarial networks

Zhao, Botong; Wang, Yanjie; Su, Keke; Ren, Hong; Han, Xiyu

doi:10.1007/s10489-022-03218-8

Semi-supervised pedestrian re-identification via a teacher–student model with similarity-preserving generative adversarial networks

Open access
Published: 30 April 2022

Volume 53, pages 1605–1618, (2023)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Semi-supervised pedestrian re-identification via a teacher–student model with similarity-preserving generative adversarial networks

Download PDF

Botong Zhao ORCID: orcid.org/0000-0001-6746-7893^1,2,
Yanjie Wang^1,2,
Keke Su^1,2,
Hong Ren¹ &
…
Xiyu Han¹

1218 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

This paper describes a pedestrian re-identification algorithm, which was developed by integrating semi-supervised learning and similarity-preserving generative adversarial networks (SPGAN). The pedestrian re-identification task aimed to rapidly capture the same target using different cameras. Importantly, this process can be applied in the field of security. Because real-life environments are complex, the number of detected identities is uncertain, and the cost of manual labeling is high; therefore, it is difficult to apply the re-identification model based on supervised learning in real-life scenarios. To use the existing labeled dataset and a large amount of unlabeled data in the application environment, this report proposes a semi-supervised pedestrian re-identification model, which combines a teacher–student model with SPGAN. SPGAN was used to reduce the difference between the target domain and the source domain by transferring the style of the labeled dataset from the source domain. Additionally, the dataset from the source domain was used after the style transfer to pre-train the model; this enabled the model to adapt more rapidly to the target domain. The teacher–student model and the transformer model were then employed to generate soft pseudo-labels and hard pseudo-labels (via iterative training) and to update the parameters through distillation learning. Thus, it retained the learned features while adapting to the target domain. Experimental results indicated that the maps of the applied method on the Market-to-Duke, Duke-to-Market, Market-to-MSMT, and Duke-to-MSMT domains were 70.2, 79.3, 30.2, and 33.4, respectively.

SSD: Single Shot MultiBox Detector

A survey on Image Data Augmentation for Deep Learning

Article Open access 06 July 2019

Deepfake video detection: challenges and opportunities

Article Open access 29 May 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The task of pedestrian re-recognition involves evaluating a pedestrian image taken by a camera, and then re-identifying that pedestrian from a large number of images captured by different cameras. This process can be widely applied in the field of security and has recently emerged as a research hotspot in the field of computer vision. Pedestrian re-recognition tasks can be deconvoluted into two processes: feature extraction and feature matching. Because the images captured by different cameras have large differences in terms of the background, brightness, camera resolution, and other parameters, feature extraction and feature matching processes face significant challenges. The key to attaining pedestrian re-identification (re-ID) lies in extracting robust feature representation.

Conventional pedestrian re-recognition based on supervised models can achieve suitable performance in each dataset; however, this approach is not robust, and it has difficulty adapting to the application environment after training. In general, models based on supervised learning are difficult to apply in real environments because of the uncertain number of identities during real-life applications and the high cost of manual labeling. Moreover, in the field of pedestrian re-identification, a large amount of unmarked data can be obtained; therefore, semi-supervised pedestrian re-identification technology is required.

In recent years, four main strategies have been employed for target re-identification based on unsupervised learning: (1) method based on pseudo-labeling; (2) method based on image generation; (3) method based on instance classification; (4) method based on domain adaptation.

Considering that there are already many available labeled datasets, and a significant amount of unlabeled data can be obtained during pedestrian re-identification tasks, this study used unsupervised domain-adaptive methods for modeling. First, similarity-preserving generative adversarial networking (SPGAN) was used to adapt the style of the source domain image to make it closer to the target domain style. Then, ResNet-50 was used to extract the discriminative features shared by the target domain and the source domain.

Next, a clustering algorithm was used to generate pseudo-labels for the unlabeled target domain images. Considering that the number of identities is uncertain in real application scenarios, we used density-based spatial clustering of applications with noise (DBSCAN) to generate pseudo-labels. To minimize the influence of the noise contained in each pseudo-label and to reduce the impact of the hard pseudo-label, this study applied the predicted value of the network as a soft pseudo-label instead of the output in a one-hot format. Meanwhile, to avoid the network’s predicted value being used directly under the network’s own supervision, this study implemented a teacher–student model, which involved constructing two networks for collaborative training and ensuring the relative independence of the two networks. This principle is illustrated in Fig. 1.

The ENC [1] described three characteristics of pedestrian re-identification tasks, i.e., exemplar invariance, camera invariance, and neighborhood invariance, which are presented in Fig. 2.

The task of this article is to build a semi-supervised pedestrian re-recognition system based on the teacher-student model and SPGAN. The main challenges we face are:

1. The noise of the pseudo-label will interfere with the training of the neural network.

2. The loss function needs to consider instance invariance, camera invariance and neighborhood invariance.

3. How to improve the decoupling ability of the network framework and the robustness in application scenarios.

4. How to provide reliable pre-training network for teacher-student model.

Based on the above issues, the main contributions of this article are divided into three aspects:

1. In response to problems 1 and 2, we propose a new compound loss function, which makes the teacher-student model pay attention to the three characteristics of the pedestrian re-recognition task during the training process, and reduces the noise of pseudo-label.

2. In order to solve problem 3, we introduced Transformer to adjust the structure of ResNet to improve the decoupling ability of ResNet in the task of pedestrian re-recognition.

3. By introducing SPGAN into the teacher-student model, it provides the pre-training network with a labeled data set similar to the image style of the target domain, and then provides a better pre-training model for the teacher-student model.

2 Related work

2.1 Re-ID via generating pseudo-labels

The re-ID method based on generating pseudo-labels involved generating high-quality pseudo-labels for unlabeled data to train and update the network. Yu et al. [2] proposed a soft label-based learning method to overcome the challenge of unsupervised pedestrian re-identification. This method generated pseudo-labels by supplementing the dataset with labeled data. Specifically, a cluster center was generated for each class in the target domain, and then, a vector was generated according to the similarity between each unlabeled sample and each class. Then, the similarity between these datasets was calculated. Yang et al. [3] proposed a discriminative feature learning method based on blocks. This method first brought similar images closer together, and pushed dissimilar images away. Then, the original image after style transfer was considered as a positive sample, and the most difficult negative samples were identified. Finally, the system was optimized based on the triplet loss. Fu et al. [4] used a DBSCAN clustering algorithm to cluster the unlabeled data based on the features extracted from the source domain, and then applied the triple loss technique for training. Ding et al. [5] proposed a clustering method based on the degree of dispersion to cluster the target domain samples. This clustering method not only considered the differences between individuals, but also the compactness of similar individuals. Compared with alternative clustering methods, this approach more widely captured the relationship between multiple samples and effectively dealt with the problems caused by unbalanced data distribution.

Currently, the generation of pseudo-labels has become a mainstream technical route. This method involves clear steps and achieves good performance (similar to that of supervised learning methods). However, as their name suggests, pseudo-labels are not real, and they contain noise. Therefore, improving the quality of pseudo-labels and the effective use of tags is required for such methods, e.g., by improving the extraction and analysis of features so that the clustering algorithm can generate more accurate labels, or using the extracted features as soft labels to reduce the influence of pseudo-label noise.

2.2 Re-ID via generating images

Recently, with the rapid development of generative adversarial networks (GAN), researchers have tried to solve the problem of pedestrian re-identification from the perspective of style transfer. Huang et al. proposed SBSGAN [6], which removes the background area of an image by generating a soft mask. This method can effectively suppress the errors of the image segmentation method. Zhong et al. [7] developed StarGAN [8] to transform images captured with different camera styles in the target domain. The positive samples obtained in the training process adopted the style of the same camera, which combined with the original target domain image, the source domain image, and the transformed image to generate a triplet to train the neural network. Wei et al. proposed PTGAN [9] to transfer images from the source domain to the target domain. This method introduced the pedestrian background segmentation image on the basis of CycleGAN [10] to verify the consistency of the pedestrian area before and after the style transfer. The SPGAN [11] proposed by Deng et al. (based on CycleGAN) increased similarity preservation for the characteristics of an unchanged identity before and after conversion; this approach made the generated image more reasonable.

This type of approach relies on the quality of the images generated by the GAN, but images from surveillance videos generally exhibit poor quality and noise, which causes the quality of the image after the style transfer to be unstable. However, this method makes full use of the image in the source domain, so it is essentially complementary to the method based on pseudo-labels. Therefore, this type of method must use images in the application scene to further improve the model after the style transfer. This enables the model to be transferred to the application environment, and the robustness of the model in the application scene is improved.

2.3 Re-ID via exemplar classification

This type of re-ID method focuses obtaining and utilizing better relationships between samples. Zhong et al. [12] proposed a prediction method based on graph neural networks to determine whether two samples were real neighboring samples. Ding et al. [13] selected nearby samples by setting a distance threshold. Considering that the imbalance of adjacent samples for each instance can lead to bias in learning, they suppressed this phenomenon by applying a loss function.

Although this method demonstrates superior performance, the relationship between samples requires further research. For example, it is necessary to design an effective loss function so the model can learn more nuanced feature relationships among samples in order to not be limited to whether or not they are the same sample.

2.4 Re-ID via unsupervised domain adaptation

This method based on unsupervised domain adaptation (UDA) follows the traditional domain-adaptive framework, aims to eliminate or reduce the differences between domains, and transfers discriminative information in the source domain to the target domain.

Both Delorme et al. [14] and Qi et al. [15] proposed camera-based GAN methods to solve the problem of data distribution differences in cross-domain pedestrian re-ID tasks. Ge et al. [16] used network joint training to ensure independence between the two networks and employed soft labels to alleviate the noise problem of the clustering algorithm. However, the integrated loss function considered only the differences between different samples but not the neighbor invariance.

Compared with methods based on pseudo-labels or instance-based classification, this approach achieves lower performance. However, in pedestrian re-recognition tasks, it shows that the effect of transferred learning is better at the feature level than at the image level.

According to above works, the main challenge for pedestrian re-identification is how to effectively use labeled source domain data sets and a large number of unlabeled images in application scenarios. In the existing methods, Re-ID via generating pseudo-labels directly ignores the noise of pseudo-label generated by the clustering algorithm. And Re-ID via generating images can only generate images similar in style to the target domain image. It is not the image in the application scenario, so it also contains noise. This also leads to the performance of the network after training cannot meet the requirements of the application. Although the method based on domain adaptation focuses on distinguishing different samples, we need to make the model notice that there are some similar discriminative features between similar samples, which is the neighborhood invariance. Based on the above challenges, we propose our model to alleviate these problems, which will be introduced in Section 3.

3 Proposed method

The method proposed herein has three stages. First, SPGAN is used to transfer the style of the source domain dataset, which makes the sample style of the source domain dataset similar to that of the target domain sample. Additionally, the samples from the source domain dataset after the style transfer are independent of any samples in the target domain dataset. Then, supervised training is conducted on the source domain data to search for discriminative features shared between the target domain and source domain datasets. Finally, a clustering algorithm is used to generate labels for the unlabeled data, and the teacher–student model is applied to update the parameters of the network. In this way, the model gradually adapts to the sample style of the target domain dataset.

3.1 SPGAN

SPGAN is a learning framework based on style transfer and composed of a Siamese network (SiaNET) and a CycleGAN. Through the coordination between these two networks, SPGAN can generate samples that adopt the style of the target domain while maintaining their identity information, so that the network can converge faster in the teacher–student model.

3.1.1 CycleGAN

The main principle of CycleGAN is that when an image is transferred from the style of dataset A to the style of dataset B, the new image should form an image similar to the original via the second style transfer, and the main information is retained. This concept is shown schematically in Fig. 3.

We proposed that the discriminator D_Y could effectively determine whether the object is in the style of the target domain Y. At the same time, the generator G($\text {X} \rightarrow \text {Y}$) was used to effectively generate images in the style of the target domain Y. The loss function involving the generator G($\text {X} \rightarrow \text {Y}$) and the discriminator D_Y is shown in (1),

$$ \begin{array}{@{}rcl@{}} {L_{YGAN}}(G,{D_{Y}},X,Y) &=& {E_{y\sim py}}[{({D_{Y}}(Y) - 1)^{2}}]\\ &&+ {E_{x\sim px}}[{D_{Y}}{(G(X))^{2}}] \end{array} $$

(1)

where px and py represent the sample distribution of the X and Y datasets, respectively.

Similarly, the loss function involving the generator F and the discriminator D_X is shown in (2):

$$ \begin{array}{@{}rcl@{}} {L_{XGAN}}(G,{D_{X}},Y,X) &=& {E_{x\sim px}}[{({D_{X}}(X) - 1)^{2}}]\\ &&+ {E_{y\sim py}}[{D_{X}}{(F(Y))^{2}}] \end{array} $$

(2)

In particular, the discriminator D, is used to determine whether the style of the image after the style transfer is similar to the style of the image in the target domain.

We also proposed that an image could be restored to one that was similar to the original image by using another generator after the style transfer. The loss function for this case is shown in (3):

$$ \begin{array}{@{}rcl@{}} {L_{cyc}}(G,F) &=& {E_{x\sim px}}[{\left\| {F(G(X)) - X} \right\|_{1}}] \\ &&+ {E_{y\sim py}}[{\left\| {G(F(Y)) - Y} \right\|_{1}}] \end{array} $$

(3)

According to the ablation experiments (Section 4.3, vide infra), the source domain and the target domain share common discriminative features, which supports the effectiveness of the UDA method. Therefore, while satisfying the above conditions, the original image should retain the information of the original image as much as possible after the style transfer, giving the loss function shown in (4):

$$ \begin{array}{@{}rcl@{}} {L_{id}}(G,F,X,Y) &=& {E_{x\sim px}}[{\left\| {F(X) - X} \right\|_{1}}]\\ && + {E_{y\sim py}}[{\left\| {G(Y) - Y} \right\|_{1}}] \end{array} $$

(4)

3.1.2 SiaNet and the Loss Function of SPGAN

For the SPGAN, the sample should retain the information contained in the original sample after the style transfer is complete, and it should be independent of any sample in the target domain (in terms of the exemplar invariance and camera invariance in the re-ID task). Therefore, adding SiaNet on the basis of CycleGAN can constrain the learning process of the mapping function (Fig. 4).

The similarity preservation loss function employed to train SiaNet is shown in (5),

$$ \begin{array}{@{}rcl@{}} {L_{con}}(i,{x_{1}},{x_{2}}) &=& (1 - i){\{ max(0,m - d)\}^{2}} + i*{d^{2}}\\ &&\qquad {m \in [0,2]} \end{array} $$

(5)

where x1 and x2 are a pair of input vectors, d is the Euclidean distance between those two vectors, i indicates whether x1 and x2 are a pair of positive samples (i = 1 indicates positive samples; i = 0 indicates negative samples), m represents the discrimination boundary between positive and negative samples in the feature space (when m = 0, negative samples are ignored by the loss function and cannot be introduced into back propagation; when m > 0, both positive and negative samples will be introduced by the loss function) and determines the proportion of positive and negative samples in the loss function.

3.1.3 Loss Function of SPGAN

Through the loss function of SiaNet, it is possible to reduce the distance between positive sample pairs and increase the distance between negative sample pairs. Therefore, the overall loss function of SPGAN is shown in (6),

$$ {L_{SPGAN}} = {L_{XGAN}} + {L_{YGAN}} + {\lambda_{1}}{L_{cyc}} + {\lambda_{2}}{L_{id}} + {\lambda_{3}}{L_{con}} $$

(6)

where λ₁, λ₂, and λ₃ are the weights that control the relationship between the four loss functions (Fig. 5).

After the introduction of SPGAN to process the source domain data set, the pre-training model can be provided with pictures that are more similar in style to the target domain, thereby improving the reliability of the pre-training model. From Fig. 6 of the 4.3 ablation experiment, it can be found that when the model is pre-trained using the data set generated by SPGAN, although the accuracy of the pre-training model cannot meet the needs of the application, the pre-training model can achieve higher performance. Furthermore, we can conclude that compared to directly using the source domain data set, using the SPGAN-processed data set can provide a small amount of information unique to the target domain for the pre-training model, and improve the reliability of the pre-training model.

3.2 Teacher–student model

The teacher–student model comprises two steps. First, a clustering algorithm is used to generate pseudo-labels, and then, those pseudo-labels are used for collaborative training of the network. This process is illustrated in Fig. 5.

3.2.1 Pre-training of teacher–student model

The most recent UDA methods focus on pre-training the source domain dataset to identify common discriminative features and gradually adapting to the environment of the target domain via transfer learning. Although it is difficult for the network to achieve satisfactory performance after pre-training (i.e., because of distinct camera parameters, brightness, environments, and other factors), the ablation experiments described in Section 4.3 (vide infra) indicated that while the network is trained in the source domain, the accuracy of the target domain gradually increases. SPGAN is more suitable for determining the common discriminative features while retaining the source domain information, and this approach can also help the teacher–student model learn faster.

The loss function of the traditional UDA-based re-ID task consists of a classification loss function and a triplet loss function, as shown in (7) and (8), respectively:

$$ L_{id}^{s}(\theta ) = \frac{1}{{N_{s}}}\sum\limits_{i = 1}^{N_{s}} {{L_{ce}}({C^{s}}(F({x_{i}^{s}}|\theta )),{y_{i}^{s}})} $$

(7)

In Eq. 7, $L_{id}^{s}(\theta )$ represents the classification loss function of the source domain; Ns represents the sample size of the source domain; L_ce is the cross entropy loss function; $F({x_{i}^{s}}|\theta )$ is the feature extracted by the pre-training network; ${C^{s}}(F({x_{i}^{s}}|\theta ))$ is the source domain classifier, which determines whether the output result of the pre-training model is the label of the corresponding sample; and ${y_{i}^{s}}$ represents the label of the i-th sample of the source domain.

$$ \begin{array}{l} L_{tri}^{s}(\theta ) = \frac{1}{{N_{s}}}\sum\limits_{i = 1}^{N_{s}} {\max (0,||F({x_{i}^{s}}|\theta ) - F(x_{i,p}^{s}|\theta )||} \\ \qquad \qquad + m - ||F({x_{i}^{s}}|\theta ) - F(x_{i,n}^{s}|\theta )||) \end{array} $$

(8)

In (8), $L_{tri}^{s}(\theta )$ represents the triplet loss function of the source domain; $x_{i,p}^{s}$ represents the positive i-th sample; $x_{i,n}^{s}$ represents the negative i-th sample; m represents the threshold between the positive and negative samples; and ||.|| designates the normal form distance.

The pre-trained loss function is shown in (9),

$$ L_{pre}^{s}(\theta ) = (1 - \lambda )L_{tri}^{s}(\theta ) + \lambda L_{id}^{s}(\theta ) $$

(9)

where λ represents the weight relationship between the classification function and the triplet loss function.

In the pre-training model, the basic framework comprised a ResNet [17] based on an IBN-Net [18] proposed by Pan et al. for improving the performance of cross-domain transfer learning. Considering that each extracted feature is determined considering global features, a transformer [19] was used to replace the final convolutional layer.

The advantage of the Transformer module is that it can process features in parallel. As in the case of CNN , the same knowledge should be used for all image locations. The transformer also uses the Seq2seq [20] concept to ensure that the new features of each output are obtained after summarizing and analyzing the global features.

The calculation process of Transformer involved (10–14):

$$ \begin{array}{@{}rcl@{}} A &=& Sigmoid(X + Station) \end{array} $$

(10)

$$ \begin{array}{@{}rcl@{}} Q &=& {W_{q}}A + {b_{q}} \end{array} $$

(11)

$$ \begin{array}{@{}rcl@{}} K &=& {W_{k}}A + {b_{k}} \end{array} $$

(12)

$$ \begin{array}{@{}rcl@{}} V &=& {W_{v}}A + {b_{v}} \end{array} $$

(13)

$$ \begin{array}{@{}rcl@{}} Output &=& softmax(\frac{{Q{K^{T}}}}{{\sqrt {{d_{k}}} }})V \end{array} $$

(14)

In (10), A represents the activation of input X after adding the position weight matrix Station. The transformer function maps a query and a set of key-value pairs to an output, wherein the query, keys, values, and outputs are all vectors.

In the transformer, the matrix Q contains a set of queries packed together. Keys and values are also packed together into matrices K and V, respectively, where dk denotes the dimension of the keys.

3.2.2 Updating parameters via distillation learning

When constructing the teacher–student model, it is important to pay attention to the following five issues:

1. Because pseudo-labels are not real labels, directly using the labels generated by the clustering algorithm will impact the accuracy of the results. Moreover, the imperfection of the clustering algorithm causes the label itself to be noisy.

2. The model cannot supervise itself using the pseudo-labels that it generates. This does not achieve a learning effect, but will rather diverge the learning results.

3. While updating the parameters, it is important to avoid forgetting the learned knowledge while adapting to the target domain.

4. Re-ID is an open class problem, so the number of identities in the task is unknown.

5. During the training process, the three characteristics of the re-ID task (i.e., exemplar invariance, camera invariance, and neighbor invariance) must be considered.

From the ablation experiments (Section 4.3, vide infra), it was clear that the pseudo-labels generated by K-means and DBSCAN led the model to achieve similar levels of accuracy; however, DBSCAN could cluster dense datasets with any pore size distribution and find abnormal points while clustering, while involving no bias in the clustering results. Therefore, this approach is not affected by the position of the initial cluster center (as with K-means). Therefore, to control the model to pay attention to the open class characteristics during the learning process, DBSCAN was selected to generate pseudo-labels.

To prevent the model from using clustered pseudo-labels for self-supervision, a collaborative training network was built. The same input from each batch underwent two different data enhancements, and the output results supervised one another. This method guaranteed the independence between the two networks. In particular, after each training, the network will only retain the better performance parameters in the target domain; therefore, in essence, only one network was trained.

To retain the discriminative features learned in the pre-training and subsequent training steps of the learning process, this study applied the idea of distillation learning to update the parameters. The parameter update formulas are presented as (15),

$$ \begin{array}{l} {E^{(T)}}[{\theta_{1}}] = \alpha {E^{(T - 1)}}[{\theta_{1}}] + (1 - \alpha ){\theta_{1}}\\ {E^{(T)}}[{\theta_{2}}] = \alpha {E^{(T - 1)}}[{\theta_{2}}] + (1 - \alpha ){\theta_{2}} \end{array} $$

(15)

where E^(T)[𝜃₁] and E^(T)[𝜃₂] represent the parameters of the previous epoch, i.e., E⁽⁰⁾[𝜃₁] = 𝜃₁, E⁽⁰⁾[𝜃₂] = 𝜃₂; and α, with the range (0,1], represents the proportion of distillation learning required to retain old knowledge each iteration.

To reduce the noise caused by the pseudo-label itself during the learning process, the classification loss function and triple loss function based on the soft pseudo-label were introduced. The soft pseudo-label represents the probability that the output result corresponds to each identity. The classification losses based on soft pseudo-labels were determined using (16),

$$ \begin{array}{l} L_{sid}^{t}({\theta_{1}}|{\theta_{2}}) = - \frac{1}{{Nt}}\sum\limits_{i = 1}^{Nt} ({{C_{2}^{t}}}(F({x^{\prime}}_{i}^{t}|{E^{(T)}}[{\theta_{2}}])) .log{C_{1}^{t}}(F({x_{i}^{t}}|{E^{(T)}}[{\theta_{1}}])))\\ L_{sid}^{t}({\theta_{2}}|{\theta_{1}}) = - \frac{1}{{Nt}}\sum\limits_{i = 1}^{Nt} {({C_{1}^{t}}(F({x_{i}^{t}}|{E^{(T)}}[{\theta_{1}}])).log{C_{2}^{t}}(F({x^{\prime}}_{i}^{t}|{E^{(T)}}[{\theta_{2}}])))} \end{array} $$

(16)

where ${C_{j}^{t}}(F({x_{i}^{t}}|E[{\theta _{1}}]))$ represents the target domain classifier based on the E^(T)[𝜃_j] parameter in the j-th network; and ${x^{\prime }}_{i}^{t}$ and ${x}_{i}^{t}$ represent the two data enhancements of the i-th sample of the target domain.

The triplet loss functions based on soft labels are represented by (17):

$$ \begin{array}{l} L_{stri}^{t}({\theta_{1}}|{\theta_{2}}) = \frac{1}{{N_{t}}}\sum\limits_{i = 1}^{N_{t}} {{L_{bce}}({\tau_{i}}({\theta_{1}}),{\tau_{i}}({E^{(T)}}[{\theta_{2}}]))} \\ L_{stri}^{t}({\theta_{2}}|{\theta_{1}}) = \frac{1}{{N_{t}}}\sum\limits_{i = 1}^{N_{t}} {{L_{bce}}({\tau_{i}}({\theta_{2}}),{\tau_{i}}({E^{(T)}}[{\theta_{1}}]))} \end{array} $$

(17)

$$ {\tau_{i}}(\theta ) = \frac{{\exp (||F({x_{i}^{t}}|\theta ) - F(x_{i,n}^{t}|\theta )||)}}{{\exp (||F({x_{i}^{t}}|\theta ) - F(x_{i,p}^{t}|\theta )|| + ||F({x_{i}^{t}}|\theta ) - F(x_{i,n}^{t}|\theta )||)}} $$

(18)

The soft triplet loss function can be used to shorten the distance between the sample and the positive sample, and extend the distance between the sample and the negative sample. However, the triplet loss function and the classification loss function do not consider bringing similar samples closer together. That is, only the exemplar invariance and camera invariance are considered, but the neighbor invariance is not considered. Therefore, it is necessary to introduce a loss function that narrows the distance between similar samples, as shown in (19),

$$ \begin{array}{l} {L_{push}}({\theta_{1}}|{\theta_{2}}) = \sum\limits_{i = 1}^{Nt} {(F({x_{i}^{t}}|{E^{(T)}}[{\theta_{1}}]) - C({y_{i}^{t}}))} \\ {L_{push}}({\theta_{2}}|{\theta_{1}}) = \sum\limits_{i = 1}^{Nt} {(F({x_{i}^{t}}|{E^{(T)}}[{\theta_{2}}]) - C({y_{i}^{t}}))} \end{array} $$

(19)

where $C({y_{i}^{t}})$ is the feature vector of the cluster center of a certain class. Considering that this method employs the DBSCAN algorithm, the cluster center is the center of the nearest n points of the same type in the current sample (i.e., a mini-K-means-based clustering space is established for each point in each cluster of DBSCAN). Then, the loss function can be summarized as shown in (20),

$$ \begin{array}{l} L({\theta_{1}},{\theta_{2}}) = {\beta_{1}}((1 - \lambda_{id}^{t})(L_{id}^{t}({\theta_{1}}) + L_{id}^{t}({\theta_{2}})) + \lambda_{id}^{t}(L_{sid}^{t}({\theta_{1}}|{\theta_{2}})\\ \qquad \qquad \quad{\text{ }} + L_{sid}^{t}({\theta_{2}}|{\theta_{1}}))) + {\beta_{2}}((1 - \lambda_{tri}^{t})(L_{tri}^{t}({\theta_{1}}) + L_{tri}^{t}({\theta_{2}}))\\ \qquad \qquad \quad{\text{ }} + \lambda_{tri}^{t}(L_{stri}^{t}({\theta_{1}}\left| {{\theta_{2}}} \right.) + L_{stri}^{t}({\theta_{2}}\left| {{\theta_{1}}} \right.))) + (1 - {\beta_{1}} - {\beta_{2}})\\ \qquad \qquad \quad{\text{ }}(L_{pull}^{t}({\theta_{1}}\left| {{\theta_{2}}} \right.) + L_{pull}^{t}({\theta_{2}}\left| {{\theta_{1}}} \right.)) \end{array}\ $$

(20)

Where, β₁,β₂,β₁ + β₂ ∈ [0,1], and β₁,β₂,1 − β₁ − β₂ represent the classification loss, triple loss, and the weight of attraction loss in the loss function; $\lambda _{id}^{t}$ is the weight of soft pseudo-labels and hard pseudo-labels in the classification loss function; and $\lambda _{tri}^{t}$ is the weight of soft pseudo-labels and hard pseudo-labels in the triplet loss function.

4 Experiments

We conducted experiments on three datasets (i.e., Market-1501 [21], MSMT17 [22], and dukeMTMC [23]) and compare the performance of the developed method with others reported in the literature based on cumulative matching characteristics (CMC) and mean average precision (mAP). An ablation experiment was also carried out to illustrate the importance of each component for improving the performance.

4.1 Preparation

4.1.1 Dataset

The DukeMTMC dataset is a large-scale, labeled, multi-target, multi-camera pedestrian tracking dataset. It provides new large-scale high-definition video data recorded by eight synchronized cameras, with more than 7000 single-camera tracks and more than 2700 independent characters. It samples an image from every 120 frames in the video and obtains 36411 images. A total of 1404 people appeared in images captured by more than two cameras, and 408 people (distractor ID) only appeared in one camera.

The MSMT dataset was proposed on CVPR2018. Specifically, MSMT17 contains 126441 bounding boxes and 4101 identities, including 12 outdoor cameras and three indoor cameras. Four days with different weather conditions were selected in a given month, and three hours of video were collected every day, covering three time periods (i.e., morning, noon, and afternoon), which allowed it to better simulate real scenes and consider more complex lighting changes.

The Market-1501 dataset was collected on the campus of Tsinghua University in the summer; it was built and made public in 2015. This dataset includes five high-definition cameras and one low-definition camera, which together collected images of 1501 pedestrians and 32668 detected pedestrian rectangular frames . Each pedestrian was captured by at least two cameras, and there may be multiple images containing a given pedestrian in each camera. The training set identified 751 people in 12936 images, and each person corresponded to an average of 17.2 training data points. The test set contained 750 people in 19732 images, and each person corresponded to an average of 26.3 test data points. The pedestrian detection rectangles of the 3368 query images were drawn manually, whereas the pedestrian detection rectangles in the gallery were detected using the DPM(Deformable Parts Model) detector (Table 1).

Table 1 Information from some image-based person re-identification datasets

Full size table

4.1.2 Evaluation Indicators: mAP and CMC

Cumulative match characteristic curves and mean average precision (mAP) are commonly used to evaluate the performance of pedestrian re-identification algorithms.

The CMC curve comprehensively reflects the performance of the classifier, and it can indicate the probability that the matching target appears in a candidate list of size k. Intuitively, the CMC curve can be given in the form of a Rank-k accuracy rate, i.e., the probability that the correct match of the target appears in the top k positions of the match list. In the pedestrian re-identification problem, the algorithm performance is typically evaluated when k = 1,5,10. For example, the accuracy of Rank-1 indicates the probability that the correct match appears at the first place in the matching list, i.e., the probability that the system can return the correct result only by looking it up once. Generally, the final Rank-k accuracy rate refers to the average of the results obtained after querying all retrieval targets.

However, when there are multiple correct matches in the test set, the CMC accuracy rate cannot fully evaluate the algorithm, considering that the pedestrian re-identification algorithm should retrieve all of the correct targets. Essentially, while the algorithm considers precision, it should also consider recall. Therefore, mAP is used to account for the retrieval and recall capabilities of the algorithm. Specifically, the mAP calculation process needs to traverse all retrieval targets and calculate the average precision (AP) for each retrieval target to obtain the average. The AP calculation process involves computing the integral of the precision-recall (PR) curve. As a result, mAP considers the precision and recall of the target under certain thresholds. Therefore, mAP and Rank-k accuracy rates are typically used together as an evaluation index for pedestrian re-identification problems to ensure comprehensive evaluations of the algorithm performance.

4.2 Comparative experiments

Four sets of comparative experiments were carried out: market-to-duke, duke-to-market, market-to-MSMT, and duke-to-MSMT (Table 2). The significance of duke-to-market is that dukeMTMC is used as a labeled source domain dataset to pre-train the model, and the teacher–student model is used to make the network applicable to the unlabeled target domain dataset, market1501, and vice versa.

Table 2 Experimental results of the proposed approach and state-of-the-art methods for Market1501, DukeMTMC-ReID, and MSMT17 datasets

Full size table

All hyper-parameters were selected based on the verification machine of the duke-to-market task, the number of pseudo-labels was 500, and IBN-ResNet50 comprised the basic framework. The same hyper-parameters were also applied to the other three areas.

These comparison experiments revealed that the method developed in this study demonstrated excellent performance in terms of mAP and CMC.

4.3 Ablation experiments

To confirm the role of each module in the developed model, ablation experiments were designed using the duke-to-market results for a self-comparison of the performance (Table 2).

When the number of K-means settings was 500, the model achieved the same performance as with DBSCAN. However, it is important to recall that re-ID is an open class problem, meaning that there is an unknown number of identities in the environment in practical applications. Therefore, the samples in the source domain dataset must be independent from the samples in the target domain dataset, i.e., exemplar invariance and camera invariance must be taken into account (Table 3).

Table 3 Comparison of the impacts of pseudo-labels generated based on K-means-500, -600, and -700, and the DBSCAN clustering algorithm on model performance

Full size table

By observing the changes in the mAP of the source and target domains that were not processed by SPGAN during the pre-training process (Fig. 6), we determined that although the model had poor performance in the target domain, its performance gradually increased as the training improved. This indicated that although the two pictures had different styles, they shared some discriminative features. It was therefore confirmed that SPGAN can accelerate the convergence of the teacher–student model.(Table 4)

Table 4 The influence of SPGAN on the extraction of discriminative features shared by the source domain and target domain datasets during the pre-training stage of the model

Full size table

Finally, to improve the decoupling ability of the model while saving computational space, we replaced the last two layers of the IBN-based ResNet with Transformer and compared their performance. From Table 5, it is clear that Transformer can improve the accuracy of the algorithm.

Table 5 The impact of Transformer on model performance

Full size table

5 Conclusion

In this study, we build a semi-supervised pedestrian re-identification system based on the teacher-student model and SPGAN. It enables the pedestrian re-identification system to be trained with a small amount of labeled data and a large number of unlabeled data in application scenarios, and the performance of the system meets the requirements of the application. Our main work is: 1. Proposed a new loss function so that the teacher-student model can satisfy the instance invariance, camera invariance and neighborhood invariance during the training process, and reduce the impact of pseudo-label noise on the system. 2. By locally introducing Transformer into ResNet, the decoupling ability of the model is improved and the speed of the system is guaranteed. 3. By introducing SPGAN to process the labeled source domain data set, the pre-training model is provided with more labeled data sets with the same discriminative features as the target domain.

References

Zhong Z, Zheng L, Luo ZM, Li SZ, Yang Y (2019) Invariance matters: Exemplar memory for domain adaptive person re-identification. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. pp 598–607
Yu HX, Zheng WS, Wu AC, Guo XW, Gong SG, Lai JH (2019) Unsupervised person re-identification by soft multilabel learning. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. pp 2148–2157
Yang QZ, Yu HX, Wu AC, Zheng WS (2019) Patch-Based discriminative feature learning for unsupervised person re-identification. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. pp 3633–3642
Fu Y, Wei YC, Wang GS, Zhou YQ, Shi HH, Huang TS (2019) Self-Similarity grouping: A simple unsupervised cross domain adaptation approach for person re-identification. In: Proc. of the IEEE Int’l Conf. on Computer Vision. pp 6112–6121
Ding GD, Khan S, Tang ZM, Zhang J, Porikli F (2019) Towards better validity:, Dispersion based clustering for unsupervised person re-identification. arXiv:1906.01308
Huang Y, Wu Q, Xu JS, Zhong Y (2019) SBSGAN: Suppression of inter-domain background shift for person re-identification. In: Proc. of the IEEE Int’l Conf. on Computer Vision. pp 9527–9536
Zhong Z, Zheng L, Li SZ, Yang Y (2018) Generalizing a person retrieval model hetero-and homogeneously. In: Proc. of the European Conf. on Computer Vision (ECCV). pp 172–188
Choi Y, Choi M, Kim M, Ha JW, Kim S, Choo J (2018) Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. pp 8789–8797
Wei LH, Zhang SL, Gao W, Tian Q (2018) Person transfer gan to bridge domain GAP for person re-identification. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. pp 79–88
Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proc. of the IEEE Int’l Conf. on Computer Vision. pp 2223–2232
Deng WJ, Zheng L, Ye QX, Kang GL, Yang Y, Jiao JB (2018) Image-Image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. pp 994–1003
Zhong Z, Zheng L, Luo ZM, Li SZ, Yang Y (2019) Learning to adapt invariance in memory for person re-identification. arXiv:1908.00485
Ding YH, Fan HH, Xu ML, Yang Y. (2019) Adaptive exploration for unsupervised person re-identification. arXiv:1907.04194
Delorme G, Xu YH, Lathuilière S, Horaud R (2019) Alameda-pineda X. CANU-reID: A conditional adversarial network for unsupervised person re-identification. arXiv:1904.01308
Qi L, Wang L, Huo J, Zhou LP, Shi YH, Gao Y. (2019) A novel unsupervised camera-aware domain adaptation framework for person re-identification. In: Proc. of the int’l conf. on computer vision
Ge Y, Chen D, Li H. (2020) Mutual mean-teaching:, Pseudo label refinery for unsupervised domain adaptation on person re-identification[J]. arXiv:2001.01526
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778
Pan X, Luo P, Shi J et al (2018) Two at once:, Enhancing learning and generalization capacities via ibn-net[C]//Proceedings of the European Conference on Computer Vision (ECCV). 464–479
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need[C]//Advances in neural information processing systems. 5998–6008
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks[C]//Advances in neural information processing systems. 3104–3112
Zheng L, Shen LY, Tian L, Wang SJ, Wang JD, Tian Q (2015) Scalable person re-identification: A benchmark. In: Proc. of the IEEE Int’l Conf. on Computer Vision. pp 1116–1124
Wei LH, Zhang SL, Gao W, Tian Q (2018) Person transfer gan to bridge domain GAP for person re-identification. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. pp 79–88
Zheng ZD, Zheng L, Yang Y (2017) Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In: Proc. of the IEEE Int’l Conf. on Computer Vision. pp 3754–3762
Fan H, Zheng L, Yan C et al (4) Unsupervised person re-identification: Clustering and fine-tuning[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14:1–18
Wang J, Zhu X, Gong S, Li W (2018) Transferable joint attribute-identity deep learning for unsupervised person re-identification. In: CVPR
Zhong Z, Zheng L, Li S, Yang Y (2018) Generalizing a person retrieval model heteroand homogeneously. In: ECCV
Chang X, Yang Y, Xiang T, Hospedales TM (2019) Disjoint label space transfer learning with common factorised space AAAI
Lin Y, Dong X, Zheng L, Yan Y, Yang Y (2019) A bottom-up clustering approach to unsupervised person re-identification. In: AAAI
Li Y-J, Yang F-E, Liu Y-C, Yeh Y-Y, Du X, Wang Y-CF (2018) Adaptation and re-identification network: An unsupervised deep transfer learning approach to person re-identification. In: CVPRW
Song L, Wang C, Zhang L, Du B, Zhang Q, Huang C, Wang X (2018) Unsupervised domain adaptive re-identification:, Theory and practice. arXiv:1807.11334
Qi L, Wang L, Huo J, Zhou L, Shi Y, Gao Y (2019) A novel unsupervised camera-aware domain adaptation framework for person re-identification. ICCV
Li Y-J, Lin C-S, Lin Y-B, Wang Y-CF (2019) Cross-dataset person reidentification via unsupervised pose disentanglement and adaptation. ICCV
Fu Y, Wei YC, Wang GS, Zhou YQ, Shi HH, Huang TS (2019) Self-Similarity grouping: A simple unsupervised cross domain adaptation approach for person re-identification. In: Proc. of the IEEE Int’l Conf. on Computer Vision. pp 6112–6121
Yang QZ, Yu HX, Wu AC, Zheng WS (2019) Patch-Based discriminative feature learning for unsupervised person re-identification. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. pp 3633–3642
Zheng K, Lan C, Zeng W, Zhan Z, Zha Z-J (2021) Exploiting sample uncertainty for domain adaptive person re-identification. In: AAAI
Zheng K, Liu W, He L, Mei T, Luo J, Zha Z-J (2021) Group-aware label transfer for domain adaptive person re-identification. In: CVPR, pp 5310–5319
Ge Y, Chen D, Zhu F, Zhao R, Li H (2020) Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. In: NeurIPS
Dai Y, Liu J, Bai Y, Tong Z, Duan L-Y (2021) Dual-refinement: Joint label and feature refinement for unsupervised domain adaptive person re-identification. IEEE TIP

Download references

Author information

Authors and Affiliations

Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun, 130033, China
Botong Zhao, Yanjie Wang, Keke Su, Hong Ren & Xiyu Han
University of Chinese Academy of Sciences, Beijing, 100049, China
Botong Zhao, Yanjie Wang & Keke Su

Authors

Botong Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yanjie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Keke Su
View author publications
You can also search for this author in PubMed Google Scholar
Hong Ren
View author publications
You can also search for this author in PubMed Google Scholar
Xiyu Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanjie Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, B., Wang, Y., Su, K. et al. Semi-supervised pedestrian re-identification via a teacher–student model with similarity-preserving generative adversarial networks. Appl Intell 53, 1605–1618 (2023). https://doi.org/10.1007/s10489-022-03218-8

Download citation

Accepted: 06 January 2022
Published: 30 April 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s10489-022-03218-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Semi-supervised pedestrian re-identification via a teacher–student model with similarity-preserving generative adversarial networks

Abstract

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

A survey on Image Data Augmentation for Deep Learning

Deepfake video detection: challenges and opportunities

1 Introduction

2 Related work

2.1 Re-ID via generating pseudo-labels

2.2 Re-ID via generating images

2.3 Re-ID via exemplar classification

2.4 Re-ID via unsupervised domain adaptation

3 Proposed method

3.1 SPGAN

3.1.1 CycleGAN

3.1.2 SiaNet and the Loss Function of SPGAN

3.1.3 Loss Function of SPGAN

3.2 Teacher–student model

3.2.1 Pre-training of teacher–student model

3.2.2 Updating parameters via distillation learning

4 Experiments

4.1 Preparation

4.1.1 Dataset

4.1.2 Evaluation Indicators: mAP and CMC

4.2 Comparative experiments

4.3 Ablation experiments

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semi-supervised pedestrian re-identification via a teacher–student model with similarity-preserving generative adversarial networks

Abstract

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

A survey on Image Data Augmentation for Deep Learning

Deepfake video detection: challenges and opportunities

1 Introduction

2 Related work

2.1 Re-ID via generating pseudo-labels

2.2 Re-ID via generating images

2.3 Re-ID via exemplar classification

2.4 Re-ID via unsupervised domain adaptation

3 Proposed method

3.1 SPGAN

3.1.1 CycleGAN

3.1.2 SiaNet and the Loss Function of SPGAN

3.1.3 Loss Function of SPGAN

3.2 Teacher–student model

3.2.1 Pre-training of teacher–student model

3.2.2 Updating parameters via distillation learning

4 Experiments

4.1 Preparation

4.1.1 Dataset

4.1.2 Evaluation Indicators: mAP and CMC

4.2 Comparative experiments

4.3 Ablation experiments

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation