Keywords

1 Introduction

Person Re-identification (re-id) aims to match a person across multiple non-overlapping camera views [14]. It is a very challenging problem because a person’s appearance can change drastically across views, due to the changes in various covariate factors independent of the person’s identity. These factors include viewpoint, body configuration, lighting, and occlusion (see Fig. 1). Among these factors, pose plays an important role in causing a person’s appearance changes. Here pose is defined as a combination of viewpoint and body configuration. It is thus also a cause of self-occlusion. For instance, in the bottom row examples in Fig. 1, the big backpacks carried by the three persons are in full display from the back, but reduced to mostly the straps from the front.

Fig. 1.
figure 1

The same person’s appearance can be very different across camera views, due to the presence of large pose variations.

Most existing re-id approaches [2, 9, 25, 34, 40, 47, 51, 63] are based on learning identity-sensitive and view-insensitive features using deep neural networks (DNNs). To learn the features, a large number of persons’ images need to be collected in each camera view with variable poses. With the collected images, the model can have a chance to learn what features are discriminative and invariant to the camera view and pose changes. These approaches thus have a number of limitations. The first limitation is lack of scalability to large camera networks. Existing models require sufficient identities and sufficient images per identity to be collected from each camera view. However, manually annotating persons across views in the camera networks is tedious and difficult even for humans. Importantly, in a real-world application, a camera network can easily consist of hundreds of cameras (i.e. those in an airport or shopping mall); annotating enough training identities from all camera views are infeasible. The second limitation is lack of generalizability to new camera networks. Specifically, when an existing deep re-id model is deployed to a new camera network, view points and body poses are often different across the networks; additional data thus need to be collected for model fine-tuning, which severely limits its generalization ability. As a result of both limitations, although deep re-id models are far superior for large re-id benchmarks such as Market-1501 [61] and CUHK03 [25], they still struggle to beat hand-crafted feature based models on smaller datasets such as CUHK01 [24], even when they are pre-trained on the larger re-id datasets.

Even with sufficient labeled training data, existing deep re-id models face the challenge of learning identity-sensitive and view-insensitive features in the presence of large pose variations. This is because a person’s appearance is determined by a combination of identity-sensitive but view-insensitive factors and identity-insensitive but view-sensitive ones, which are inter-connected. The former correspond to semantic related identity properties, such as gender, carrying, color, and texture. The latter are the covariates mentioned earlier including poses. Existing models aim to keep the former and remove the latter in the learned feature representations. However, these two aspects of the appearance are not independent, e.g., the appearance of the carrying depends on the pose. Making the learned features pose-insensitive means that the features supposed to represent the backpacks in the bottom row examples in Fig. 1 are reduced to those representing only the straps – a much harder type of features to learn.

In this paper, we argue that the key to learning an effective, scalable and generalizable re-id model is to remove the influence of pose on the person’s appearance. Without the pose variation, we can learn a model with much less data thus making the model scalable to large camera networks. Furthermore, without the need to worry about the pose variation, the model can concentrate on learning identity-sensitive features and coping with other covariates such as different lighting conditions and backgrounds. The model is thus far more likely to generalize to a new dataset from a new camera network. Moreover, with the different focus, the features learned without the presence of pose variation would be different and complementary to those learned with pose variation.

To this end, a novel deep re-id framework is proposed. Key to the framework is a deep person image generation model. The model is based on a generative adversarial network (GAN) designed specifically for pose normalization in re-id. It is thus termed pose-normalization GAN (PN-GAN). Given any person’s image and a desirable pose as input, the model will output a synthesized image of the same identity with the original pose replaced with the new one. In practice, we define a set of eight canonical poses, and synthesize eight new images for any given image, resulting in a 8-fold increase in the training data size. The pose-normalized images are used to train a pose-normalized re-id model which produces a set of features that are complementary to the feature learned with the original images. The two sets of feature are thus fused as the final feature.

Contributions. Our contributions are as follows. (1) We identify pose as the chief culprit for preventing a deep re-id model from learning effective identity-sensitive and view-insensitive features, and propose a novel solution based on generating pose-normalized images. This also addresses the scalability and generalizability issues of existing models. (2) A novel person image generation model PN-GAN is proposed to generate pose-normalized images, which are realistic, identity-preserving and pose controllable. With the synthesized images of canonical poses, strong and complementary features are learned to be combined with features learned with the original images. Extensive experiments on several benchmarks show the efficacy of our proposed model. (3) A more realistic unsupervised transfer learning is considered in this paper. Under this setting, no data from the target dataset is used for model updating: the model trained from labeled source domain is applied to the target domain without any modification.

2 Related Work

Deep Re-id Models. Most recently proposed re-id models employ a DNN to learn discriminative view-invariant features [2, 9, 25, 34, 40, 47, 51, 63]. They differ in the DNN architectures – some adopt a standard DNN developed for other tasks, whilst others have architectures tailor-made. They differ also in the training objectives. Different models use different training losses including identity classification, pairwise verification, and triplet ranking losses. A comprehensive study on the effectiveness of different losses and their combinations on re-id can be found in [12]. The focus of this paper is not on designing new re-id deep model architecture or loss – we use an off-the-shelf ResNet architecture [16] and the standard identity classification loss. We show that once the pose variation problem is solved, it could help to improve the performance of re-id.

Pose-Guided Deep Re-id. The negative effects of pose variation on deep re-id models have been recognised recently. A number of models [23, 39, 45, 50, 57, 58, 62] are proposed to address this problem. Most of them are pose-guided based on body part detection. For example, [39, 57] first detect normalized part regions from a person image, and then fuse the features extracted from the original images and the part region images. These body part regions are predefined and the region detectors are trained beforehand. Differently, [58] combine region selection and detection with deep re-id in one model. Our model differs significantly from these models in that we synthesize realistic whole-body images using the proposed PN-GAN, rather than only focusing on body parts for pose normalization. Note that body parts are related to semantic attributes which are often specific to different body parts. A number of attributes based re-id models [11, 37, 44, 52] have been proposed. They use attributes to provide additional supervision for learning identity-sensitive features. In contrast, without using the additional attribute information, our PN-GAN is learned as a conditional image generation model for the re-id problem.

Deep Image Generation. Generating realistic images of objects using DNNs has received much interest recently, thanks largely to the development of GAN [15]. GAN is designed to find the optimal discriminator network D between training data and generated samples using a min-max game and simultaneously enhance the performance of an image generator network G. It is formulated to optimize the following objective functions:

$$\begin{aligned} \underset{G}{\mathrm {min}}\underset{D}{\mathrm {max}}{\mathcal {L}}_{GAN}&={\mathbb {E}}_{x\sim p_{data}\left( x\right) }\left[ \mathrm {log}D\left( x\right) \right] \nonumber \\&+{\mathbb {E}}_{z\sim p_{prior}\left( z\right) }\left[ \mathrm {log}\left( 1-D\left( G\left( z\right) \right) \right) \right] \end{aligned}$$
(1)

where \(p_{data}\left( x\right) \) and \(p_{prior}\left( z\right) \) are the distributions of real data x and Gaussian prior \(z\sim {\mathcal {N}}\left( {\mathbf {0}},{\mathbf {1}}\right) \). The training process iteratively updates the parameters of G and D with the loss functions \({\mathcal {L}}_{D}=-{\mathcal {L}}_{GAN}\) and \({\mathcal {L}}_{G}={\mathcal {L}}_{GAN}\) for the generator and discriminator respectively. The generator can draw a sample \(z\sim p_{prior}\left( z\right) ={\mathcal {N}}\left( {\mathbf {0}},{\mathbf {1}}\right) \) and utilize the generator network G, i.e., G(z) to generate an image.

Among all the variants of GAN, our pose normalization GAN is built upon deep convolutional generative adversarial networks (DCGANs) [35]. Based on a standard convolutional decoder, DCGAN scales up GAN using Convolutional Neural Networks (CNNs) and it results in stable training across various datasets. Many other variants of GAN, such as VAEGAN [21], Conditional GAN [18], stackGAN [53] also exist. However, most of them are designed for training with high-quality images of objects such as celebrity faces, instead of low-quality surveillance video frames of pedestrians. This problem is tackled in a very recent work [22, 30]. Their objective is to synthesize person images in different poses, whilst our work aims to solve the re-id problem with the synthesized images. Besides, both of them utilized two generators/parts from coarse to fine to generate images. As a result, their models are more complicated and not easy to train.

Overall, our model differs from the existing variants of GAN. In particular, built upon the residual blocks, our PN-GAN is learned to change the poses and yet keeps the identity of input person. Note that the only work so far that uses deep image generator for re-id is [65]. However, their model is not a conditional GAN and thus cannot control either identity or pose in the generated person images. As a result, the generated images can only be used as unlabeled or weakly labeled data. In contrast, our model generate strongly labeled data with its ability to preserve the identity and remove the influence of pose variation.

Fig. 2.
figure 2

Overview of our framework. Given an person image, we utilize PN-GAN to synthesize auxiliary images with different poses. Base Networks A and B are then deployed to extract features of original image and synthesized images, respectively. Finally, two types of features are merged for final re-identification task. (Color figure online)

3 Methodology

3.1 Problem Definition and Overview

Problem Definition. Assume we have a training dataset of N persons \({\mathcal {D}}_{Tr}=\left\{ {\mathbf {I}}_{k},y_{k}\right\} _{k=1}^{N}\), where \({\mathbf {I}}_{k}\) and \(y_{k}\) are the person image and person id of the k-th person. In the training stage we learn a feature extraction function \(\phi \) so that a given image \({\mathbf {I}}\) can be represented by a feature vector \({\mathbf {f}}_{{\mathbf {I}}}=\phi ({\mathbf {I}})\). In the testing stage, given a pair of person images \(\left\{ {\mathbf {I}}_{i},{\mathbf {I}}_{j}\right\} \) in the testing dataset \({\mathcal {D}}_{Te}\), we need to judge whether \(y_{i}=y_{j}\) or \(y_{i}\ne y_{j}\). This is done by simply computing the Euclidean distance between \({\mathbf {f}}_{{\mathbf {I}}_{i}}\) and \({\mathbf {f}}_{{\mathbf {I}}_{j}}\) as the identity-similarity measure.

Framework Overview. As shown in Fig. 2, our framework has two key components, i.e., a GAN based person image generation model (Sect. 3.2) and a person re-id feature learning model (Sect. 3.3).

3.2 Deep Image Generator

Our image generator aims at producing the same person’s images under different poses. Particularly, given an input person image \({\mathbf {I}}_{i}\) and a desired pose image \({\mathbf {I}}_{{\mathcal {P}}_{j}}\), our image generator aims to synthesize a new person image \(\hat{{\mathbf {I}}}_{j}\), which contains the same person but with a different pose defined by \({\mathbf {I}}_{{\mathcal {P}}_{j}}\). As in any GAN model, the image generator has two components, a Generator \(G_{P}\) and a Discriminator \(D_{P}\). The generator is learned to edit the person image conditional on a given pose; the discriminator discriminates real data samples from the generated samples and help to improve the quality of generated images.

Fig. 3.
figure 3

Schematic of our PN-GAN model

Pose Estimation. The image generation process is conditional on the input image and one factor: the desired pose represented by a skeleton pose image. Pose estimation is obtained by a pretrained off-the-shelf model. More concretely, the off-the-shelf pose detection toolkit – OpenPose [4] is deployed, which is trained without using any re-id benchmark data. Given an input person image \({\mathbf {I}}_{i}\), the pose estimator can produce a pose image \({\mathbf {I}}_{{\mathcal {P}}_{i}}\), which localizes and detects 18 anatomical key-points as well as their connections. In the pose images, the orientation of limbs is encoded by color (see Fig. 2, target pose). In theory, any pose from any person image can be used as a condition to control the pose of another person’s generated image. In this work, we focus on pose normalization so we stick to eight canonical poses as shown in Fig. 4(a), to be detailed later.

Generator. As shown in Fig. 3, given an input person image \({\mathbf {I}}_{i}\), and a target person image \({\mathbf {I}}_{j}\) which contains the same person as \({\mathbf {I}}_{i}\) but a different pose \({\mathbf {I}}_{{\mathcal {P}}_{j}}\), our generator will learn to replace pose information in \({\mathbf {I}}_{i}\) with the target pose \({\mathbf {I}}_{{\mathcal {P}}_{j}}\) and generate the new pose \(\hat{{\mathbf {I}}}_{j}\). The input to the generator is the concatenation of the input person image \({\mathbf {I}}_{i}\) and target pose image \({\mathbf {I}}_{{\mathcal {P}}_{j}}\). Specifically, we treat the target body pose image \({\mathbf {I}}_{{\mathcal {P}}_{j}}\) as a three-channel image and directly concatenate it with the three-channel source person image as the input of the generator. The generator \(G_{P}\) is designed based on the “ResNet” architecture and is an encoder-decoder network [17]. The encoder-decoder network progressively down-samples \({\mathbf {I}}_{i}\) to a bottleneck layer, and then reverse the process to generate \(\hat{{\mathbf {I}}}_{j}\). The encoder contains 9 ResNet basic blocksFootnote 1.

The motivation of designing such a generator is to take advantage of learning residual information in generating new images. The general shape of “ResNet” is learning \(y=f(x)+x\) which can be used to pass invariable information from the bottom layers of the encoder to the decoder, and change the variable information of pose. To this end, the other features (e.g., clothing, and the background) will also be reserved and passed to the decoder in order to generate \(\hat{{\mathbf {I}}}_{j}\). With this architecture (see Fig. 3), we have the best of both worlds: the encoder-decoder network can help learn to extract the semantic information, stored in the bottleneck layer, while the ResNet blocks can pass rich invariable information of person identity to help synthesize more realistic images, and change variable information of poses to realize pose normalization at the same time.

Formally, let \(G_{P}\left( \cdot \right) \) be the generator network which is composed of an encoder subnet \(G_{Enc}\left( \cdot \right) \) and a decoder subnet \(G_{Dec}\left( \cdot \right) \), the objective of the generator network can be expressed as

$$\begin{aligned} {\mathcal {L}}_{_{G_{P}}=}{\mathcal {L}}_{GAN}+\lambda _{1}\cdot {\mathcal {L}}_{L_{1}}, \end{aligned}$$
(2)

where \({\mathcal {L}}_{GAN}\) is the loss of the generator in Eq. (1) with the generator \(G_{P}\left( \cdot \right) \) and discriminator \(D_{P}\left( \cdot \right) \) respectively,

$$\begin{aligned} {\mathcal {L}}_{GAN}&={\mathbb {E}}_{{\mathbf {I}}_{j}\sim p_{data}\left( {\mathbf {I}}_{j}\right) }\left\{ \mathrm {log}D_{P}\left( {\mathbf {I}}_{j}\right) \right. \nonumber \\ +&\left. \mathrm {log}\left( 1-D_{P}\left( G_{P}\left( {\mathbf {I}}_{i},{\mathbf {I}}_{{\mathcal {P}}_{j}}\right) \right) \right) \right\} \end{aligned}$$
(3)

and \({\mathcal {L}}_{L_{1}}={\mathbb {E}}_{{\mathbf {I}}_{j}\sim p_{data}\left( {\mathbf {I}}_{j}\right) }\left[ \left\| {\mathbf {I}}_{j}-\hat{{\mathbf {I}}}_{j}\right\| _{1}\right] \), and \(\hat{{\mathbf {I}}}_{j}=G_{Dec}\left( G_{Enc}\left( {\mathbf {I}}_{i},{\mathbf {I}}_{{\mathcal {P}}_{j}}\right) \right) \) is the reconstructed image for \({\mathbf {I}}_{j}\) from the input image \({\mathbf {I}}_{i}\) with the body pose \({\mathbf {I}}_{{\mathcal {P}}_{j}}\). Here the \(L_{1}-\)norm is used to yield sharper and cleaner images. \(\lambda _{1}\) is the weighting coefficient to balance the importance of each term.

Discriminator. The discriminator \(D_{P}\left( \cdot \right) \) aims at learning to differentiate the input images as real or fake (i.e., a binary classification task). Given the input image \({\mathbf {I}}_{i}\) and target output image \({\mathbf {I}}_{j}\), the objective of the discriminator network can be formulated as

$$\begin{aligned} {\mathcal {L}}_{D_{P}}=-{\mathcal {L}}_{GAN}, \end{aligned}$$
(4)

Since our final goal is to obtain the best generator \(G_{P}\), the optimization step would be to iteratively minimize the loss function \({\mathcal {L}}_{G_{P}}\) and \({\mathcal {L}}_{D_{P}}\) until convergence. Please refer to the Supplementary Material for the detailed structures and parameters of the generator and discriminator.

3.3 Person Re-id with Pose Normalization

As shown in Fig. 2, we train two re-id models. One model is trained using the original images in a training set to extract identity-invariant features in the presence of pose variation. The other is trained using the synthesized images with normalized poses using our PN-GAN to compute re-id features free of pose variation. They are then fused as the final feature representation.

Fig. 4.
figure 4

Visualization of canonical poses. Note that red crosses in (b) indicates the canonical pose obtained as the cluster means. (Color figure online)

Pose Normalization. We need to obtain a set of canonical poses, which are representative of the typical viewpoint and body-configurations exhibited by people in public captured by surveillance cameras. To this end, we predict the poses of all training images in a dataset and then group the poses into eight clusters \(\left\{ {\mathbf {I}}_{{\mathcal {P}}_{C}}\right\} _{c=1}^{8}\). We use VGG-19 [5] pre-trained on the ImageNet ILSVRC-2012 dataset to extract the features of each pose images, and K-means algorithm is used to cluster the training pose images into canonical poses. The mean pose images of these clusters are then used as the canonical poses. The eight poses obtained on Market-1501 [61] is shown in Fig. 4(a). With these poses, given each image \({\mathbf {I}}_{i}\), our generator will synthesize eight images \(\left\{ \hat{{\mathbf {I}}}_{i,{\mathcal {P}}_{C}}\right\} _{C=1}^{8}\) by replacing the original pose with these poses.

Re-id Feature with Pose Variation. We train one re-id model with the original training images to extract re-id features with pose variation. The ResNet-50 model [16] is used as the base network. It is pre-trained on the ILSVRC-2012 dataset, and fine-tuned on the training set of a given re-id dataset to classify the training identities. We name this network ResNet-50-A (Base Network A), as shown in Fig. 2. Given an input image \({\mathbf {I}}_{i}\), ResNet-50-A produces a feature set \(\left\{ {\mathbf {f}}_{{\mathbf {I}}_{i},layer}\right\} \), where layer indicates from which layer of the network, the re-id features are extracted. Note that, in most existing deep re-id models, features are computed from the final convolutional layer. Inspired by [29] which shows that layers before the final layer in a DNN often contain useful mid-level identity-sensitive information. We thus merge the 5a, 5b and 5c convolutional layers of ResNet-50 structures into a 1024–d feature vector after an FC layer.

Re-id Feature Without Pose Variation. The second model called ResNet-50-B has the same architecture as ResNet-50-A, but performs feature learning using the pose-normalized synthetic images. We thus obtain eight sets of features for the eight poses \({\mathbf {f}}_{\hat{{\mathbf {I}}}{}_{i,{\mathcal {P}}_{C}}}=\left\{ {\mathbf {f}}_{\hat{{\mathbf {I}}}_{i,{\mathcal {P}}_{C}}}\right\} _{C=1}^{8}\).

Testing Stage. Once ResNet-50-A and ResNet-50-B are trained, during testing, for each gallery image, we feed it into ResNet-50-A to obtain one feature vector; as for synthesize eight images of the canonical poses, in consideration of confidence, we feed them into ResNet-50-B to obtain 8 pose-free features and one extra FC layer for the fusion of original feature and each pose feature. This can be done offline. Then given a query image \({\mathbf {I}}_{q}\), we do the same to obtain nine feature vectors \(\left\{ {\mathbf {f}}_{{\mathbf {I}}_{q}},{\mathbf {f}}_{\hat{{\mathbf {I}}}_{q,{\mathcal {P}}_{C}}}\right\} \). Since Maxout and Max-pooling have been widely used in multi-query video re-id, we thus obtain one final feature vector by fusing the nine feature vectors by element-wise maximum operation. We then calculate the Euclidean distance between the final feature vectors of the query and gallery images and use the distance to rank the gallery images.

4 Experiments

4.1 Datasets and Settings

Experiments are carried out on four benchmark datasets:

Market-1501. [61] is collected from 6 different camera views. It has 32,668 bounding boxes of 1,501 identities obtained using a Deformable Part Model (DPM) person detector. Following the standard split [61], we use 751 identities with 12,936 images as training and the rest 750 identities with 19,732 images for testing. The training set is used to train our PN-GAN model.

CUHK03. [25] contains 14,096 images of 1,467 identities, captured by six camera views with 4.8 images for each identity in each camera on average. We utilize the more realistic yet harder detected person images setting. The training and testing sets consist of 1,367 identities and 100 identities respectively. The testing process is repeated with 20 random splits following [25].

DukeMTMC-reID. [36] is constructed from the multi-camera tracking dataset – DukeMTMC. It contains 1,812 identities. Following the evaluation protocol [65], 702 identities are used as the training set and the remaining 1,110 identities as the testing set. During testing, one query image for each identity in each camera is used for query and the remaining as the gallery set.

CUHK01. [24] has 971 identities with 2 images per person captured in two disjoint camera views respectively. As in [24], we use as probe the images of camera A and utilize those from camera B as gallery. 486 identities are randomly selected for testing and the remaining are used for training. The experiments are repeated for 10 times with the average results reported.

Table 1. Results on Market-1501. ‘-’ indicates not reported. Note that *: on [65], we report the results of using both Basel. + LSRO and Verif.-Identif. + LSRO. Our model only uses the identification loss, so should be compared with Basel. + LSRO which uses the same ResNet-50 base network and the same loss.

Evaluation Metrics. Two evaluation metrics are used to quantitatively measure the re-id performance. The first one is Rank-1, Rank-5 and Rank-10 accuracy. For Market-1501 and DukeMTMC-reID datasets, the mean Average Precision (mAP) is also used.

Implementation Details. Our model is implemented on Tensorflow [1] (PN-GAN part) and Caffe [19] (re-id feature learning part) framework. The \(\lambda _{1}\) in Eq. (2) is empirically set as 10 in all experiments. We utilize the two-stepped fine-tuning strategy in [13] to fine-tune re-id networks. The input images are resized into \(256\times 128\). Adam [20] is used to train both the PN-GAN model and re-id networks with a learning rate of 0.0002, \(\beta _{1}=0.5\), a batch size of 32, and a learning rate of 0.00035, \(\beta _{1}=0.9\), a batch size of 16, respectively. The dropout ratio is set as 0.5. PN-GAN models and re-id networks are converged in 19 h and 8 h individually on Market-1501 with one NVIDIA 1080Ti GPU card.

Table 2. Results on CUHK01 and CUHK03 datasets. Note that both Spindle [57] and HP-net [29] reported higher results on CUHK03. But their results are obtained using a very different setting: six auxiliary re-id datasets are used and both labeled and detected bounding boxes are used for both training and testing. So their results are not comparable to those in this table.

Experimental Settings. Experiments are conducted under two settings. The first is the standard Supervised Learning (SL) setting on all datasets: the models are trained on the training set of the dataset, and evaluated on the testing set. The other one is the Transfer Learning (TL) setting only for the datasets, CUHK03, CUHK01, and DukeMTMC-reID. Specifically, the re-id model is trained on Market-1501 dataset. We then directly utilize the trained single model to do the testing (i.e., to synthesize images with canonical poses and to extract the nine feature vectors) on the test set of CUHK03, CUHK01, and DukeMTMC-reID. That is, no model updating is done using any data from these three datasets. The TL setting is especially useful in real-world scenarios, where a pre-trained model needs to be deployed to a new camera network without any model fine-tuning. This setting thus tests how generalizable a re-id model is.

Table 3. Results on DukeMTMC-reID.

4.2 Supervised Learning Results

Results on Large-Scale Datasets. Tables 1, 3 and 2(a) compare our model with the best performing alternative models. We can make the following observations:

  1. (1)

    On all three datasets, the results clearly show that, in the supervised learning settings, our results are improved over those of ResNet-50-A baselines by a clear margin. This validates that the synthetic person images generated by PN-GAN can indeed help the person re-id tasks.

  2. (2)

    Compared with the existing pose-guided re-id models [39, 57, 62], our model is clearly better, indicating that synthesizing multiple normalized poses is a more effective way to deal with the large pose variation problem.

  3. (3)

    Compared with the other re-id model that uses synthesized images for re-id model training [65], our model yields better performance for all datasets, the gap on Market-1501 and DukeMTCM-reID being particularly clear. This is because our model can synthesize images with different poses, which can thus be used for supervised training. In contrast, the synthesized images in [65] do not correspond to any particular person identities or poses, so can only be used as unlabeled or weakly-labeled data.

Results on Small-Scale Dataset. On the smaller dataset – CUHK01, Table 2(b) shows that, again our ResNet-50-A is a pretty strong baseline which can beat almost all the other methods. And by using the normalized pose images generated by PN-GAN, our framework further boosts the performance of ResNet-50-A by more than \(3\%\) in the supervised setting. This demonstrates the efficacy of our framework. Note that on the small dataset CUHK01, the handcrafted feature + metric learning based models (e.g., NullReid [55]) are still quite competitive, often beating the more recent deep models. This reveals the limitations of the existing deep models on scalability and generalizability. In particular, previous deep re-id models are pre-trained on some large-scale training datasets, such as CUHK03 and Market-1501. But the models still struggle to fine-tune on the small datasets such as CUHK01 due to the covariate condition differences between them. With the pose normalization, our model is more adaptive to the small datasets and the model pre-trained on only Market-1501 can be easily fine-tuned on the small datasets, achieving much better result than existing models.

Table 4. The Ablation study of Rank-1 and Rank-5 on benchmarks.
Table 5. The Ablation study of Market-1501 on 1 pose feature and 8 pose features.
Table 6. The Rank-1/mAP results of ensembling two networks and ours. ‘A+B’ means training one ResNet-50-A and one ResNet-50-B model.

4.3 Transfer Learning Results

We report our results obtained under the TL settings on the three datasets – CUHK03, CUHK01, and DukeMTMC-reID in Tables 2(b) and 3 respectively. On CUHK01 dataset, we can achieve \(27.58\%\) Rank-1 accuracy in Table 2(b) which is comparable to some models trained under the supervised learning setting, such as eSDC [59]. These results thus show that our model has the potential to be generalizable to a new re-id data from new camera networks – when operating in a ‘plug-and-play’ mode. Our results are also compared against those of ResNet-50-A (TL) baseline. On all three datasets, we can observe that our model gets improved over those of ResNet-50-A (TL) baseline. Again, this demonstrates that our pose normalized person images can also help the person re-id in the transfer learning settings. Note that due to the intrinsic difficulty of transfer setting, the results are still much lower than those in supervised setting.

Fig. 5.
figure 5

Visualization of different poses generated by PN-GAN model. (Color figure online)

4.4 Further Evaluations

Ablation Studies. (1) We first evaluate the contributions from the two types of features computed using ResNet-50-A and ResNet-50-B respectively towards the final performance. Table 4 shows that although ResNet-50-B alone performs poorly compared to other methods, when the two types of features are combined, there is an improvement in the final results on all four datasets. This clearly indicates that the two types of features are complementary to each other. (2) In a second study, we compare the result obtained when features are merged with 8 poses and that obtained with only one pose, in Table 5. The result drops from 72.58 to 69.60 on Market-1501 on mAP. This suggests that having eight canonical poses is beneficial – the quality of generated image under one particular pose may be poor; using all eight poses thus reduces the sensitivity to the quality of the generated images for specific poses. (3) In order to prove that the performance gain comes from synthesized images instead of ensembling 2 networks, we conducted experiments on ensembling two ResNet-50-A models. As shown in Table 6, the gain from ensembling two ResNet-50-A is clearly less than that of ensembling one ResNet-50-A and one ResNet-50-B, despite the fact that the ResNet-50-B is much weaker than the second ResNet-50-A. These results thus suggest that our approaches performance gain is not due to ensembling but complementary features extracted from the ResNet-50-B model.

Examples of the Synthesized Images. Figure 5 gives some examples of the synthesized image poses. Given one input image, our image generator can produce realistic images under different poses, while keeping the similar visual appearance as the input person image. We find that, (1) Even though we did not explicitly use the attributes to guide the PN-GAN, the generated images of different poses have roughly the same visual attributes as the original images. (2) Our model can help alleviate the problems caused by occlusion as shown in the last row of Fig. 5: a man with yellow shirt and grey trousers is blocked by a bicycle, while our image generator can generate synthesized images to keep his key attributes whilst removing the occlusion.

5 Conclusion

We have proposed a novel deep person image generation model by synthesizing pose-normalized person images for re-id. In contrast to previous re-id approaches that try to extract discriminative features which are identity-sensitive but view-insensitive, the proposed method learns complementary features from both original images and pose-normalized synthetic images. Extensive experiments on four benchmarks showed that our model achieves state-of-the-art performance. More importantly, we demonstrated that our model has the potential to be generalized to new re-id datasets collected from new camera networks without any additional data collection and model fine-tuning.