1 Introduction

Person re-identification (Re-Id) is useful in various intelligent video surveillance applications. The process can be considered as image retrieval problem, where a query image of a person (probe) is given and we search the person in a set of images extracted from different cameras (gallery). The task is difficult for various reasons. Firstly, face-based [24] and body movement-based identification [2] cannot be used due to the variations in CCTV camera positions. Secondly, complex nature of similarity measure and pose matching makes it harder. Recent advancement in object tracking [4] has opened up new possibilities. Video object trackers can be used to track people in real-time. These tracks containing humans can be passed to a ML framework to search for identification in other cameras. The query can be a single image [25] or multiple images [9]. Often multi-image query uses early fusion and generate an average query image [29]. The method thus consumes higher computational power as compared to single image-based methods. Video-based re-identification research is still evolving [6, 18]. Existing algorithms are sensitive to the query images or video segment. Choosing an improper image or video segment may lead to poor retrieval [25]. In this paper, we detect and track humans and construct spatio-temporal tubes that are used in the re-identification framework. We also propose a method for selecting an optimum set of key-pose frames. We apply a new learning framework to re-identify persons appearing in other cameras. To accomplish this, we have made the following contributions in this paper:

  • We propose a learning-based method to select an optimum set of key-pose frames to reconstruct a query tube by minimizing its length.

  • We propose a new hierarchical video Re-Id framework using detection, self-similarity matching, and temporal correlation that can be integrated with any image-based re-identification framework.

  • We introduce a new video dataset, named Tube-based Re-identification Video Dataset (TRiViD) that has been prepared with an aim to help the re-identification research community.

Rest of the paper is organized as follows. In Section 2, we discuss the state-of-the-art of person re-identification research. Section 3 presents the proposed Re-Id framework with various components. Experiment results are presented in Section 4. Conclusion and future work are presented in Section 5.

2 Related work

Person re-identification applications are growing rapidly in numbers. However, the primary challenges are to handle large volume of data [33, 34], tracking in complex environment [21, 35], presence of group [7], occlusion [12], varying pose and style across different cameras [9, 17, 23, 36], etc. The process of Re-Id can be categorized as image-guided [1, 5, 7, 9] and video-guided [6, 8, 18, 28, 31]. The image-guided methods typically use deep neural networks for feature representation and re-identification, whereas the video-guided methods typically use recurrent convolutional networks (RNN) to embed the temporal information such as optical flow [8], sequence of pose, etc. Advancement of hardware and AI techniques, often re-identification tasks are solved using deep learning. In this area of research, Zhao et al. [32]. Liu et al. [15] have used saliency and attention-based learning to compute similarity among persons, Liu et al. [16] have proposed motion-based learning and Xu et al. [29] have proposed jointly learning of image and motion feature. It may be noted that the temporal information such as motion can be a good feature for re-identification. Chen et al. [6] and Zhang et al. [31] have used video sequence-based learning methods. Multiple information fusion-based methods have also been presented by various researchers. Chen et al. [7] have used fusion of local and global feature. Chung et al. [8] have proposed weighted fusion of spatio-temporal features. Considering the pose information, Zhong et al. [36] have used style transfer to learn similarity matching and Liu et al. [17] have augmented the pose to generate training data. Recently, late fusion of different scores [1, 20] has shown significant improvement over the final ranking. Our method is similar to a typical delayed or late fusion guided method. We refine search results obtained using convolutional neural networks with the help of temporal correlation analysis.

3 Proposed approach

Our method can be regarded as tracking followed by re-identification. Moving persons are first tracked using Simple Online Deep Tracking (SODT) [3]. A tube is defined as the sequence of spatio-temporal frames. A gallery (G) contains a set of tubes {T1,T2,....,Tn}, where Ti represents the ith tube. A tube (T) is created by arranging the image frames {I0,I1,...,Ik} with respect to time.

The re-identification of a person in videos now can be defined by: “Given a query tube of a person, say Tq, we need to find out the tubes in other cameras that are likely to contain the queried person.”

First, the noisy frames are eliminated and the query tube is minimized. Next, the minimized query tube is passed through a 3-stage hierarchical re-ranking process to get the final ranking of the tubes in the gallery. The method is depicted in Fig. 1.

Fig. 1
figure 1

The proposed method for Tube-to-tube Re-identification. The method takes a tube as query and ranks the tubes by best possible matching

3.1 Simple Online Deep Tracking (SODT)

The tubes of the persons are extracted by tracking individuals in the video sequences. The tracking is a two-step process (i) detection and (ii) continuous track handling. The detection has been developed using YOLO [13] framework. The outputs of YOLO are the bounding boxes marking the persons in all frames. The model is pre-trained on ImageNet [22]. Next, the bounding boxes are linked as a sequence by preserving the identity (track number). Each track is represented by unique motion model. A liner constant motion model is used to estimate the inter-frame displacements of each object. The state of the target is modeled using using (1).

$$ x = [u, v, s, r, \dot{u},\dot{v},\dot{s}]^{T} $$
(1)

where u and v are velocities at the center of the target’s bounding box along horizontal and vertical axes, s is the scale, and r represents aspect ratio of the bounding box. A detection from YOLO is associated with a target, and the initial bounding box is used to update the target’s state. The velocity components are solved using a Kalman filter framework [27]. The missing frames are updated by the prediction using the linear velocity model. The track assignment problem among the Kalman filter prediction and newly arrived detection are solved by Hungarian algorithm. We use the (squared) Mahalanobis distance to estimate the distance and optimum association of tracks. Finally, continuously detected bounding boxes preserving identity are included as the frames of a tube.

3.2 Query minimization

Selecting a set of frames that can uniquely represent a tube can be challenging. To address this, we have used a deep similarity matching architecture to select a set of representative frames based on pose dissimilarity. First, a query tube is passed through a binary classifier to remove noisy frames such as blurry, cropped, low-quality, etc. A ResNet [14] framework has been trained using a few query tubes containing similarly looking images. An energy function as discussed in (4) has been used to select a set of unique frames from the query tube (Tq). The two components of the energy function (ζ and γ) take into account the minimal impact of the closeness of each pair of frames and the maximal impact of the differences between each pair of frames in the query tube.

Overall closeness or similarity index of a frame, say i, is estimated as given in (2), where σ(i,j) is a measure to quantify the similarity between two frames i and j in Tq.

$$ {\zeta_{i} = min (\sigma(i,j)), \forall j \in T_{q}} $$
(2)

Overall dissimilarity index of a frame, say i, is estimated as given in (3), where σ(i,j) is a measure to quantify the similarity between two frames i and jTq.

$$ {\gamma_{i} = max (\sigma(i,j)), \forall j \in T_{q}} $$
(3)

We now assume that the input tube contains k images and the output query tube contains l images such that l << k. Our objective is to choose query images from a tube such that most dissimilar images are taken and similar images are discarded. The optimal query energy (E) is defined in (4), where \(\hat {Q}\) is the set of images that are not included in the optimal query (Q) and ϕ is a weighting parameter. Increasing the weigh (ϕ) also increases number of images in the query.

$$ { E=\sum\limits_{i \in Q}{\phi \xi_{i}}+\sum\limits_{i \in Q, i \notin \hat{Q}}{\gamma_{i}}} $$
(4)

3.3 Proposed Re-identification and ranking

At the beginning, SVDNet [25] has been used for re-identification at image-level. The network takes 128 × 64 images as input and produces a set of retrieved images. The outputs are then passed through late fusion layers for re-ranking the retrieved images. Figure 2 illustrates the process. Assume the set of retrieved images is denoted by TSV D as given in (5). In the next stage, a network similar to ResNet50 has been trained to learn the self-similarity scores using the tubes of the query set.

$$ {T_{\textit{SVD}}=\{I_{1}, I_{2}, ....., I_{p}\}} $$
(5)
Fig. 2
figure 2

The proposed re-ranking scheme adopted in our work

During this process, similarity scores between every output image of SVD network up to rank p are calculated using a self similarity estimation layer. It learns to measure self-similarity using tracked tubes during training assuming single track contains images of the same person. We use ResNet50 [10] as the baseline. It takes a set of ranked images (SVDNet outputs) and produces a set of ranked images by introducing self-similarities between the retrieved images.

Finally, the scores are averaged and the images are re-ranked. This step ensures that the dissimilar images get pushed toward the end of the ranked sequence of the retrieved images.

3.4 Tube ranking by temporal correlation

Final step of the proposed method is to rank the tubes by temporal correlation among the retrieved images. Let the result matrix up to rank p(pk) for the query tube Tq after the first two stages be given in (R). Weight of an image (Ijk) in R is estimated using (7).

$$ R= \begin{bmatrix} I_{11} & I_{12} & I_{13} & {\dots} & I_{1p} \\ I_{21} & I_{22} & I_{23} & {\dots} & I_{2p} \\ \hdotsfor{5} \\ I_{j1} & I_{j2} & I_{j3} & {\dots} & I_{jp} \end{bmatrix} $$
(6)
$$ {\alpha=\frac{1}{I_{r}}\text{, where $I_{r}$ is the rank of the frame } I} $$
(7)

Similarly, weight of a tube (T) is denoted by β and estimated using (8).

$$ {\upbeta =\frac{\text{\# of images in }T \cap R}{\text{max(\# of images in }T \cap R), \forall T}} $$
(8)

Finally, the temporal correlation cost (τI) of an image in R can be estimated as given in (9).

$$ {\tau_{I} =\alpha \times \upbeta, \text{such that} I \in T} $$
(9)

Based on the temporal correlation, the retrieved tubes are ranked, where higher rank tubes have higher weights. The final ranked images are extracted by taking the highest scoring images from the tubes. Figure 3 explains the whole process of tube ranking and selection of final set of frames. The main motivation of the temporal correlation is to assign higher weights of the tubes containing a maximum number of retrieved images.

Fig. 3
figure 3

Explanation of re-identification framework with the help of the proposed 3-stage framework depicted in Fig. 1

4 Experiments

We have evaluated our proposed approach on two public datasets, iLIDS-VID [26] and PRID-11 [11] that are often used for testing video-based re-identification frameworks. In addition to that, we have also prepared a new re-identification dataset. It has been recorded using 2 cameras in an indoor environment with human movements with moderately dense crowd (with more than 10 people appearing within 4-6 sq-mt), varying camera angles, and persons with similar clothing. Such situations have not been covered yet in existing re-identification video datasets. Details about these datasets are presented in Table 1. Several experiments have been carried out to validate our method. A through comparative analysis has also been performed.

Table 1 Dataset used in our experiments. Only TRiViD dataset is tracked to extract tube

Evaluation Metrics and Strategy:

We have followed the well known experimental protocols for evaluating the method. For iLIDS-VID and TRiViD dataset videos, the tubes are randomly split into 50% for training and 50% for testing. For PRID-11, we have followed the experimental setup as proposed in [6, 19, 26, 29, 37]. Only first 200 persons who appeared in both cameras of the PRID-11 dataset, have been used in our experiments. A 10-folds cross validation scheme has been adopted and the average results are reported. We have prepared Cumulative Matching Characteristics (CMC) and mean average precision (mAP) curves to evaluate and compare the performance.

4.1 Comparative analysis

As per the state-of-the-art, our work though unique in design has some similarities with video re-id methods proposed in [19, 30], multiple query-based method [25], and the re-ranking method [20]. The RCNN-based method [19] uses image level CNN and optical flow for learning and searching. Video-based feature learning method [30] uses sequence-based distance com compute similarity between query and gallery. SVDNet [25] is a typical image-based re-identification framework, it can be used a single image or multiple images as probe. Therefore, we have compared our approach with the above recently proposed methods. It has been observed that our method can achieve a gain up to 9.6% as compared to the state-of-the-art methods when top rank accuracy is estimated. Even if we compute the accuracy up to rank 20, our method performs better with a margin of 3%. This happens because our method tries to reduce the number of false positives which has not yet been addressed by the re-identification research community. Figures 45 and 6 represent CMC curves and Table 2 summarizes the mAP up to rank 20 across the three datasets. Figure 7 shows a typical query and response applied on PRID-11 dataset.

Fig. 4
figure 4

The accuracy (CMC) in PRID-11 dataset using RCNN [19], TDL [30], Video re-id [19], SVDNet [25] (single image), SVDNet (multiple images), SVDNet+Re-rank [20]

Fig. 5
figure 5

The accuracy (CMC) in iLIDS dataset using RCNN [19], TDL [30], Video re-id [19], SVDNet [25] (single image), SVDNet (multiple images), SVDNet+Re-rank [20]

Fig. 6
figure 6

The accuracy (CMC) using the TRiViD dataset with the help of RCNN [19], TDL [30], Video re-id [19], SVDNet [25] (single image), SVDNet (multiple images), SVDNet+Re-rank [20]

Table 2 mAP (%) up to rank 20 in across three video datasets
Fig. 7
figure 7

Typical results obtained using PRID-11 dataset using single image query [25], video sequence [19], and using the proposed method. Green box indicates a correct retrieval

4.2 Computational complexity analysis

Re-identification in real-time is a challenging task. All research work carried out so far presume the gallery as a pre-recorded set of images and they try to rank best 5, 10, 15, 20 images from the set. However, executing a single query takes considerable time when multiple images are involved in the query. We have carried out a comparative analysis on computation complexities across various re-identification frameworks including the proposed scheme. One Nvdia Quadro P5000 series GPU has been used to implement the frameworks. The results are reported in Fig. 8. We have observed that the proposed tube-based re-identification framework takes lesser time as compared to video re-id framework proposed in [19] and the multiple images-based re-id using SVDNet [25].

Fig. 8
figure 8

Average response time (in seconds) for a given query by varying the datasets. We have taken 100 query tubes in random and calculated the average response time using RCNN [19], TDL [30], Video re-id [19], SVDNet [25] (single image), SVDNet (multiple images), SVDNet+Re-rank [20]

4.3 Effect of ϕ

Our proposed method depends on the query threshold (ϕ). In this section, we present an analysis about the effect of ϕ on results. Figure 9 depicts the average number of query images generated from various query tubes. It may be observed that, higher ϕ produces more query images.

Fig. 9
figure 9

Average number of query images by varying the query threshold (ϕ). We have taken 100 query sequences randomly and average number of optimized images, is reported. It may be observed that a higher ϕ produces more number of query images

Figure 10 depicts average CMC by varying ϕ. It may be observed that the accuracy does not increase significantly when ϕ is increased above 0.4.

Fig. 10
figure 10

Accuracy (CMC) by varying the query threshold (ϕ). We have taken 100 query sequences randomly and average is reported. It may be observed that a higher ϕ may not produce higher accuracy

Figure 11 presents execution time (in seconds) by varying the query threshold. It can also be observed that an increase in ϕ leads to higher response time. Therefore, we have used ϕ = 0.4 in our experiments.

Fig. 11
figure 11

Execution time by varying ϕ. It may be observed that a higher ϕ takes more time to execute as it produces more query images.

4.4 Results after various stages

In this section, we present the effect of various stages of the overall framework on re-identification results. Table 3 shows the accuracy (CMC) in each step of the proposed method. It may be observed that the proposed method gains 11% rank-1 accuracy after the first stage and 7% rank-1 accuracy after the second step. The method gains 7% rank-20 accuracy in the first stage and 6% rank-20 accuracy after the second stage. Figure 12 shows an example of scores (true positives and false positives) during the self-similarity fusion. It may be observed that SVDNet output scores and similarity scores are high in case of true positives. Similarity scores are relatively low in case of false positives. More results can be found in the form of supplementary data.

Table 3 Accuracy (CMC) in each step of the proposed method
Fig. 12
figure 12

Typical examples of failure cases using SVDNet + Self Similarity in TRiViD (first two rows) and PRID-11 [11] (last row)

5 Conclusion

In this paper, we propose a new person re-identification framework that is able to outperform existing re-identification schemes when applied on videos or sequence of frames. The method uses any CNN-based framework at the beginning (we have considered SVDNet). A self-similarity layer is used to refine the SVDNet scores. Finally, a temporal correlation layer is used to aggregate multiple query outputs and to match tubes. A query optimization has also been proposed to select an optimum set of images for a query tube. Our study reveals that the proposed method outperforms in several cases as compared to the state-of-the-art single image-based, multiple images-based, and video-based re-identification methods. The computational is also reasonably low. It can be noted that the method can rank the tubes with an average increase in the CMC accuracy of 6-8% across multiple datasets. Also, our method significantly reduces the number of false positives.

One straight extension of the present work is to fuse methods like camera pose-based [9], video-based [19], and description-based [5]. It may lead to higher accuracy in complex situations. Also, group re-identification can be tried with the similar concept of tube guided analysis.