Keywords

1 Introduction

Person re-identification is an important topic in computer vision area and has been widely used in video surveillance. The aim of person re-identification is to determine associations of the pedestrians in multi-camera networks in video surveillance. Here, we focus on the person re-identification (or pedestrian association problem) in non-overlapping camera networks.

Many works have been proposed for this problem. Generally, these works can be divided into two categories, i.e., supervised and unsupervised methods. For supervised methods, distance metric learning is a classical method towards this problem [3, 4, 9, 12, 24]. Schwartz et al. [10] extracted high-dimensional features for metric learning. Porikli and Divakaran et al. [11] proposed a distance metric method and a non-parametric and non-linear model of color shift function based on color histogram. Li and Wang et al. [12] proposed a method which learns different metrics for different camera image spaces. Chilgunde et al. [13] applied Kalman Filter to predict the person movement in blind areas of the monitoring cameras, and matched motion trail on Gaussian model. Similarly, Riccardo et al. [14] proposed Multi Social Force Model to predict the tracklets of pedestrians in blind regions. Javed et al. [15] used inter-camera space-time and appearance probabilities to find an object in different cameras by maximizing the conditional probability of the corresponding observations. For unsupervised methods, Ma et al. [20] modified the BiCov descriptor based on Gabor filters for person matching, and used the covariance descriptor to deal with illumination change and image misalignment. Farenzena et al. [17] proposed Symmetry-Driven Accumulation of Local Features to handle the change of camera viewpoint, as pedestrian images are symmetrical. Prosser et al. [1] formulated the person re-identification as a global optimum ranking problem, and utilized an ensemble of RankSVM to learn global feature weights. Liu et al. [9] adopted an attribute-based weighting scheme to find the unique appearance attribute. However, the most existing methods ignored some discriminative and simple features which are actually useful for person pairs matching in human images.

Recently, Zhao et al. [5] proposed a more effective unsupervised salience learning method with some distinct regions which are discriminative, reliable and useful for person matching. The main idea of this method is that it further detects and explores some human salient regions of pedestrian body in computing similarity scores between pedestrian images. Human salience of pedestrian images is distinctive and has been shown importantly in person re-identification problem. Thus, how to obtain the salient area of pedestrian images is important for this salience based pedestrians identification problem.

Fig. 1.
figure 1figure 1

The whole process of the proposed person re-identification algorithm.

Inspired by the work on salience leaning method in [5], in this paper, we first show that this kind of salient area detection can be formulated as a kind of general outlier detection problem. Then, we propose a novel unsupervised salience learning method, called Density-Distance salience learning, which aims to use a local outlier-detection technique for person re-identification problem. The procedure of the whole algorithm has been shown in Fig. 1. The detailed algorithm is introduced in Sect. 3. One main feature of the proposed salience learning method is that it exploits both distance and density information simultaneously. Compared with traditional methods, the proposed salience detection method is simple and efficient for person re-identification problem. Promising experimental results on the widely used VIPeR and CUHK01 datasets demonstrate that the proposed salience-based person re-identification method is more effective and efficient than some recent methods.

2 Salience Learning with Density-Distance Outlier Detection

In this section, a novel salience learning method is proposed. As mentioned in work [5], the salient patches of pedestrian images generally have the following properties: (1) these patches are deviates markedly from others and thus provide some distinctive information for the pedestrian body image. (2) These patches are robust to the changes of camera viewpoint. Some examples of salient patches are shown in Fig. 2.

Fig. 2.
figure 2figure 2

Illustration of person salience

Intuitively, this kind of salience definition can be regarded as the outlier patches in patch dataset. This motivates us to explore some outlier detection methods for salient patch detection. Outlier detection is a fundamental problem in data mining area. In outlier detection, an outlier is defined as an observation which deviates markedly from the other observations [29, 30].

In the following, we first propose a local outlier detection based salience learning method, called Density-Distance salience learning, and then use this method for person re-identification problem. The core idea of the proposed Density-Distance salience learning method is to integrate both distance and density information for salience learning. Each human image is firstly densely segmented into some local patches, then the dLabSift features(details in Sect. 3.1) are extracted for each patch and denoted as \(x_{A,i}(m,n)\), where (A, i) indicates the i-th person in camera A, and (m, n) is index denotes the patch located at the m-th row and the n-th column in the human image. The combined salience value is designed as:

$$\begin{aligned} \begin{aligned}&Svalue_{DD}(x_{A,i}(m,n)) = \alpha \cdot Svalue_{density}(x_{A,i}(m,n))\\&~~~~~~~~~~~~~~~~ + (1- \alpha )\cdot Svalue_{distance}(x_{A,i}(m,n)) \end{aligned} \end{aligned}$$
(1)

where \(Svalue_{density}(x_{A,i}(m,n))\) denotes the density salience values and similarly, \(Svalue_{distance}(x_{A,i}(m,n))\) denotes the distance salience values. \(\alpha (0 < \alpha < 1)\) is a balance parameter. In the following, we introduce the distance salience and density salience, respectively.

2.1 Distance Salience Learning

Each human image is densely segmented into some local patches, as shown in Fig. 3. The patches with discriminative property are called salient patches. Indeed, this kind of salient patches can be regarded as outliers in the patch set, as shown in Fig. 4. Byers et al. [21] proposed a method which uses the KNN distances and been demonstrated to be useful for outlier detection. In this paper, we use the mean value of KNN distance for salience learning.

Fig. 3.
figure 3figure 3

Examples of Dense Patches. a(1)–a(4) are human images, and b(1)–b(4) are illustrations of dense local patches. Each patch is dense and overlapping.

Fig. 4.
figure 4figure 4

Illustration of Salient Patches. The patches in the red and the black dotted boxes denote the salient patches and the general patches respectively.

Before computing the average neighbor distance of each testing patch, the reference set initialization is needed. Let the number of the images in the reference set be \(R_n\). After building the dense correspondences between a testing image and the images in reference set, each patch in the testing image will find an optimal matched patch in every image from the reference set, that is for each testing patch \(x_{A,i}(m,n)\) there are \(R_n\) matched patches, or called neighbors from the reference set \(X_{nn}(x_{A,i}(m,n))\).

$$\begin{aligned} X_{nn}(x_{A,i}(m,n))= \left\{ x|\arg \max s(x_{A,i}(m,n),\hat{x}),j=1,2,\dots ,R_n \right\} , \hat{x}\in S_{i,j} \end{aligned}$$
(2)
$$\begin{aligned} S_{i,j}= S(x_{A,i}(m,n),\text {x}_{B,j}) \end{aligned}$$
(3)

More detailed introduction of Eq. (3) can refer to Sect. 3.1. Byers et al. [21] utilize the K nearest neighbor to find clutters. Instead of the k-th distance used in work [5], the average distance is calculated in our method, which is more reasonable and effective. The salience value can be computed as:

$$\begin{aligned} Svalue_{distance}(x_{A,i}(m,n))= D_{average} \left( X_{nn}(x_{A,i}(m,n)) \right) \end{aligned}$$
(4)

where \(D_{average}\) denotes the average distance to the \(R_n\) nearest neighbor. As we can see in Fig. 4, if the patch in the testing image is a salient patch then its \(D_{average}\) will be larger than the average distance of general patch. In practice, it is not necessary to take all the \(R_n\) neighbors to achieve the global optimum. Therefore, the distance between the testing patch and all \(R_n\) neighbors are sorted and the intermediate \(k(k = \beta R_n)\) neighbors are selected for computing, where \(0 < \beta < 1 \). Figure 5 shows the feature weighting map results of salience learning by our average distance strategy, and it is estimated by partial feast square (PLS)[10].

Fig. 5.
figure 5figure 5

Illustration of the feature weighting map of our average distance salience learning estimated by partial feast square (PLS) [10].

2.2 Density Salience Learning

As discussed in works [23, 27, 28], density information is also important for outlier detection, therefore it is also important for salience learning.

Similar to Sect. 2.1, each image in testing set is segmented into \(M \times N\) dense patches. As mentioned above, each testing patch has \(R_n\) neighbors, i.e.,  every image in testing set has a set of patches \(P^i\):

$$\begin{aligned} P^i= \left\{ X_{nn}(x_{A,i}(m,n))|m = 1,2,\dots ,M; N = 1,2,\dots ,N \right\} \end{aligned}$$
(5)

Each neighbor has a matching distance to the testing patch. For each testing image, both maximum and minimum matching distances are used to compute the threshold \(d_{thr}^i\) for outliers, i.e.,

$$\begin{aligned} d_{max} = D_{max}(P^i) \end{aligned}$$
(6)
$$\begin{aligned} d_{min} = D_{min}(P^i) \end{aligned}$$
(7)
$$\begin{aligned} d_{thr}^i = \mu (d_{max} - d_{min}) + d_{min} \end{aligned}$$
(8)

where \(D_{max}\) and \(D_{min}\) are the maximum and minimum distances respectively in \(P^i\) set, \(\mu (0 < \mu < 1)\) is a threshold parameter. For each testing patch, \(Q^i(m,n)\) denotes the set of matching distances which are larger than \(d_{thr}^i\):

$$\begin{aligned} \begin{aligned}&Q^i(m,n) = \{ D_k(X_{nn}(x_{A,i}(m,n)))|D_k(X_{nn}(x_{A,i}(m,n))) > d_{thr}^i,\\&~~~~~~~~~~~~~~~~\,\,k = 1,2, \dots ,R_n\} \end{aligned} \end{aligned}$$
(9)

where \(D_k\) denotes the distance of the k-th nearest neighbor. The density factor can be defined as:

$$\begin{aligned} fa_{A,i}(m,n) = (R_n - |Q^i(m,n)|)/ R_n \end{aligned}$$
(10)

where \(|Q^i(m,n)|\) is the number of element in \(Q^i(m,n)\). Finally, the density salience value is obtained as follows:

$$\begin{aligned} Svalue_{density}(x_{A,i}(m,n)) = \exp (-fa_{A,i}(m,n)/2\sigma ^2) \end{aligned}$$
(11)

3 Person Re-identification Algorithm

The whole algorithm of our person re-identification is presented in Algorithm 1.

figure afigure a

3.1 Dense Correspondence for Patches

In this section, the dense correspondence proposed in [5, 9, 19] is employed to achieve patch alignment.

Each human image is densely segmented into some local patches. Dense SIFT descriptor and a LAB color histogram, named as dLabSift features are extracted for each patch with the dimension of \(32 \times 3 \times 3 + 128 \times 3 = 672\)(more details can be found in [5]). Similar to work [5], for each image patch \(x_{A,i}(m,n)\) in \(\mathrm {x}_{A,i}\), we first generate candidate patches in \(\mathrm {x}_{B,i}\) (shown in Fig. 6) as follows.

Fig. 6.
figure 6figure 6

Illustration of Adjacency Constrained Search. The green box region denotes search region of the patch in red box (Color figure online).

First, let \(T_{A,i}(m)\) be the m-th row patches set of the i-th image of Camera A, i.e.,

$$\begin{aligned} T_{A,i}(m)= \left\{ x_{A,i}(m,n)|n=1,2,\dots ,N \right\} \end{aligned}$$
(12)

Then, we obtain candidate patches in \(\mathrm {x}_{B,i}\), as

$$\begin{aligned} S(x_{A,i}(m,n),\text {x}_{B,j})= \left\{ T_{B,j}(a) | a \in \theta (m) \right\} ,\forall x_{A,i}(m,n) \in T_{A,i}(m) \end{aligned}$$
(13)

where \(T_{B,j}(a)\) is the a-th row patches set of the j-th image of Camera B. \(\theta (m)\) is the relaxation of adjacency search, as human images may have vertical misalignment.

After that, the optimal corresponding patch \(x_{B,j}(m',n')\) is obtained from gallery, as

$$\begin{aligned} x_{B,j}(m',n') = \arg \max _{\bar{x} \in S_{i,j}} s(x_{A,i}(m,n),\bar{x}) \end{aligned}$$
(14)

where \(S_{ij} = S(x_{A,i}(m,n),\text {x}_{B,j})\), and s(x, y) is computed as

$$\begin{aligned} s(x,y)= \exp \, \left( -\frac{\mathrm {d}(x,y)^2}{2 \sigma ^2} \right) \end{aligned}$$
(15)

where \(\mathrm {d}(x,y)\) is Euclidean distance between x and y.

3.2 Similarity Computation

The testing human images always contain a large part of backgrounds, which are various around pedestrians especially due to the change of camera viewpoint. Therefore, it’s unreasonable to take all the testing patches for salience learning.

In this paper, the statistical information of human head position is utilized to remove some background patches before computing the similarity scores. Since the backgrounds around human lower part are normally the similar pavement, only the statistics of human head position are produced to remove the backgrounds around human head.

For each image \(\text {x}_{A,i}\) in the probe, the matching result is obtained as the image in the gallery with the maximal similarity score \(\text {x}_{B,j}\), while the similarity score between two images is obtained by accumulating the score of their each patch pair which can be formulated as:

$$\begin{aligned} \begin{aligned}&Sim(\text {x}_{A,i},\text {x}_{B,j})= \\&~~~~~~~~~~~~~~~~\sum _{m,n}\frac{Sa_{A,i}(m,n)\cdot s(x_{A,i}(m,n),x_{B,j}(m',n'))\cdot Sa_{B,j}(m',n')}{\varepsilon + |Sa_{A,i}(m,n) - Sa_{B,j}(m',n')|} \end{aligned} \end{aligned}$$
(16)

where \(\text {x}_{A,i}\) and \(\text {x}_{B,j}\) are collection of patch features of the probe image and the gallery image. \(\varepsilon \) is a parameter controlling the salience difference, and,

$$\begin{aligned} \begin{aligned} Sa_{A,i}(m,n)= Svalue_{DD}(x_{A,i}(m,n)) \end{aligned} \end{aligned}$$
(17)
$$\begin{aligned} \begin{aligned} Sa_{B,j}(m',n')= Svalue_{DD}(x_{B,j}(m',n')) \end{aligned} \end{aligned}$$
(18)

4 Experiments

The proposed method is evaluated on two publicly available datasets, VIPeR dataset [22] and CUHK01 dataset [8]. Both reflect most challenging problems of person re-identification applications such as human pose variation, cameras viewpoint and illumination change, occlusions between persons.

Evaluation Protocol. Similar to the work [5], each dataset is randomly partitioned into two parts, 50 % for salience learning and 50 % for testing and the standard Cumulated Matching Characteristics (CMC) curve is mainly evaluated during experiments. Images from camera A are used as probe and other images from camera B are used as gallery. For each probe image the rank of matching scores to the images in gallery is obtained. Rank-k recognition rate is the expectation of correct match at rank-k, and the cumulated values of recognition rate at all ranks are recorded as one-trial CMC result. 10 trials of evaluation are executed to achieve stable statistics. Our Density-Distance salience learning method is denoted as DdSal. As mentioned above, before computing the similarity scores the position information of human head is utilized to remove some background patches, this process is denoted as heaPri. The combination of Density-Distance salience learning method and the process of heaPri is denoted as PriDd.

Fig. 7.
figure 7figure 7

CMC statistics on the VIPeR dataset and the CUHK01 Campus dataset. (a) On VIPeR dataset, our approach (heaPri, DdSal and PriDd) is compared with BiCov [16], eBiCov [16], LDFV [20], bLDFV [20], eSDC_knn [5] and eSDC_ocsvm [5]. (b) On CUHK01 Campus dataset, our method is compared with eSDC_knn and eSDC_ocsvm.

4.1 Evaluations on VIPeR Dataset

VIPeR dataset [22] is one of the most challenging person re-identification dataset and it is captured by two cameras in outdoor with two images for each person shot from different viewpoints. The persons have been accurately detected from original videos. It contains 632 pedestrian pairs. Each person has two images from different cameras with different viewpoints, most of which are larger than 90 degree. All images are normalized to the same size of \(128\times 48\) [5].

On the VIPeR dataset, the results of DdSal, heaPri and PriDd are reported in Fig. 7(a), compared to several unsupervised methods including eSDC_knn [5], eSDC_ocsvm [5], BiCov [16], eBiCov [16], LDFV [20], bLDFV [20]. Generally speaking, (1) PriDd is slightly better than heaPri, DdSal which means both the Density-Distance salience information and the head background removal improve the matching results. (2) Our approaches and the eSDC-based methods outperform the other four methods. While our PriDd approach is slightly better than eSDC_knn salience learning approach and also slightly better than eSDC_ocsvm salience learning method from rank 1 to rank 15. It’s worth mentioning that although eSDC_ocsvm slightly outperforms our approach after rank 15, it requires much higher computing cost.

Table 1. VIPeR dataset: top ranked matching rates in [%] with 316 persons.
Table 2. VIPeR dataset: running time in [seconds] of three salience learning methods with 316 test persons.

More comparison results on VIPeR dataset with some supervised methods are reported in Table 1, which can also demonstrate the effectiveness of our approaches. eSDC_knn and eSDC_ocsvm seems to have satisfactory performance close to our approaches, however, comprehensively considering Table 2, which reports the running time of our PriDd method compared to eSDC_knn and eSDC_ocsvmon MATLAB (2013a), with 64-bit Win7, Intel core i7, 4.00 GHz CPU, our approaches can achieve better performance with acceptable computing complexity.

4.2 Evaluations on CUHK01 Dataset

The CUHK01 dataset [8] is also captured by two cameras in a campus environment with higher resolution compared to the VIPeR dataset. All the images are normalized to \(160 \times 60\) for evaluation.

Table 3. CUHK01 dataset: top ranked matching rates in [%] with 316 persons.
Table 4. CUHK01 dataset: running time in [seconds] of three salience learning methods with 316 test persons.

For the concise presentation, only two outstanding methods, eSDC_knn and eSDC_ocsvm are implemented on CUHK01 dataset for comparison as shown in Fig. 7(b). From which we can see, our methods, especially PriDd method, have a significant improvement compared to eSDC_knn and eSDC_ocsvm. eSDC_ocsvm has almost identical performance as our DdSal method but much worse performance than our heaPri and PriDd methods, which means the background interference of human head has higher influence on CUHK01 dataset. More detailed comparison results on CUHK01 dataset are shown in Table 3. Same as the evaluation on VIPeR dataset, Table 4 reports the running time of our PriDd method compared to eSDC_knn and eSDC_ocsvm also on MATLAB (2013a), with 64-bit Win7, Intel core i7, 4.00 GHz CPU, which also can demonstrate that our approaches can achieve better performance with acceptable computing complexity.

5 Conclusion

A novel unsupervised salience learning method is proposed for person re-identification, where a Density-Distance is designed for salience learning. Meanwhile, the statistics of pedestrian’s head position is learnt to relieve the background interference during similarity matching. The experimental results on widely used VIPeR dataset and CUHK01 dataset demonstrate the outperformance of our method which improves matching precision with acceptable computing complexity comparing with the state of art methods.