Person Re-identification in Videos by Analyzing Spatio-Temporal Tubes

Typical person re-identification frameworks search for k best matches in a gallery of images that are often collected in varying conditions. The gallery may contain image sequences when re-identification is done on videos. However, such a process is time consuming as re-identification has to be carried out multiple times. In this paper, we extract spatio-temporal sequences of frames (referred to as tubes) of moving persons and apply a multi-stage processing to match a given query tube with a gallery of stored tubes recorded through other cameras. Initially, we apply a binary classifier to remove noisy images from the input query tube. In the next step, we use a key-pose detection-based query minimization. This reduces the length of the query tube by removing redundant frames. Finally, a 3-stage hierarchical re-identification framework is used to rank the output tubes as per the matching scores. Experiments with publicly available video re-identification datasets reveal that our framework is better than state-of-the-art methods. It ranks the tubes with an increased CMC accuracy of 6-8% across multiple datasets. Also, our method significantly reduces the number of false positives. A new video re-identification dataset, named Tube-based Reidentification Video Dataset (TRiViD), has been prepared with an aim to help the re-identification research community


I. INTRODUCTION
Person re-identification (Re-Id) is useful in various intelligent video surveillance applications.The task can be considered as image retrieval problem, where a query image of a person (probe) is given and we search the person in a set of images extracted from different cameras (gallery).The query can be a single image [1] or multiple images [2].Often multi-image query uses early fusion of images and generate an average query image [3].The method thus consumes higher computational power as compared to single imagebased methods.Advanced hardware and efficient learning frameworks have encouraged the researchers to focus on designing Re-Id systems applicable to videos.However, videobased re-identification research is still in its infancy [4], [5].Even though the existing video Re-Id applications seem to be promising, such methods often fail in low resolution videos, crowded environment, or in the presence of significant camera angle variations.It has also been observed that the query image or video has to be selected judiciously to obtain good retrieval results.Choosing an improper image or video may lead to Sk. Arif Ahmed (Email: arif.1984.in@ieee.org) is with NIT Durgapur, India Debi Prosad Dogra (Email: dpdogra@iitbbs.ac.in) is with IIT Bhubaneswar, India Heeseung Choi (Email:hschoi@kist.re.kr)Seungho Chae (Email: seungho.chae@kist.re.kr) and Ig-Jae Kim (Email: drjay@kist.re.kr) are with KIST South Korea poor quality of retrieval.In this paper, we detect and track humans in movement and construct spatio-temporal tubes that are used in the re-identification framework.We also propose a method for selecting optimum set of key pose images and use a 3-stage learning framework to re-identify persons appearing in different cameras.To accomplish this, we have made the following contributions in this paper: • We propose a learning-based method to select an optimum set of key pose images to reconstruct the query tube by minimizing its length in terms of number of frames.• We propose a 3-stage hierarchical framework that has been built using (i) SVDNet guided Re-Id architecture, (ii) self-similarity estimation, and (iii) temporal correlation analysis to rank the tubes of the gallery.

• We introduce a new video dataset, named Tube-based
Re-identification Video Dataset (TRiViD) that has been prepared with an aim to help the re-identification research community.
Rest of the paper is organized as follows.In Section 2, we discuss the state-of-the-art of person re-identification research.Section 3 presents the proposed Re-Id framework with various components.Experiment results are presented in Section 4. Conclusion and future work are presented in Section 5.

II. RELATED WORK
Person re-identification applications are growing rapidly in numbers.However, humongeous growth in CCTV surveillance has thrown up various challenges to the re-identification research community.The primary challenges are to handle large volume of data [6], [7], tracking in complex environment [8], [9], presence of group [10], occlusion [11], varying pose and style across different cameras [2], [12]- [14], etc.The process of Re-Id can be categorized as image-guided [2], [10], [15], [16] and video-guided [4], [5], [17]- [19].The imageguided methods typically use deep neural networks for feature representation and re-identification, whereas the video-guided methods typically use recurrent convolutional networks (RNN) to embed the temporal information such as optical flow [17], sequence of pose, etc. Table I summarizes recent progress in person re-identification.In recent years, late fusion of different scores [15], [20] has shown significant improvement over the final ranking.Our method is similar to a typical delayed or late fusion guided method.We refine search results obtained using convolutional neural networks with the help of temporal correlation analysis.

Reference Method Overview
Lv et al. [4] Motion and image based features Recurrent convolutional network for video-based person re-identification Barman et al. [15] Graph theory and multiple algorithm fusion-based algorithm SHaPE: A Novel Graph Theoretic Algorithm for Making Consensus-based Decisions in Person Re-identification Systems Chang et al. [16] Visual appearance and multiple semantic level features Multi-Level Factorization Net for Person Re-Identification Chen et al. [10] Fusion of local similarity and group similarity-based DNN and CRF Group Consistent Similarity Learning via Deep CRF for Person Re-Identification Chen et al. [5] Divides a long person sequence into short snippet and match snippets for re-identification Video Person Re-identification with Competitive Snippet-similarity Aggregation and Co-attentive Snippet Embedding Chung et al. [17] Learn spatial and temporal similarity and used weighed fusion A Two Stream Siamese Convolutional Neural Network For Person Re-Identification Deng et al. [2] Learn self similarity and domain dissimilarity Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identification He et al. [21] Deep pixel-level CNN for person re-identification from partially observed images.Deep Spatial Feature Reconstruction for Partial Person Re-identification: Alignment-free Approach Huang et al. [11] Proposed augmented training data generation for person re-identification.Adversarially Occluded Samples for Person Re-identification Kalayeh et al. [22] Proposed human semantic parts model to train state-of-the-art deep networks and calculate weighted average.Human Semantic Parsing for Person Re-identification Li et al. [23] Distinct body parts-based attention model for re-identification.Diversity Regularized Spatiotemporal Attention for Video-based Person Re-identification Li et al. [24] Harmonious attention network consists of pixel-level and bounding box level attention as feature.Harmonious Attention Network for Person Re-Identification Liu et al. [12] Augmented pose of persons and generate training set as used to re-identify persons Pose Transferrable Person Re-Identification Liu et al. [25] Tracklets have been used as training and re-identification.Stepwise Metric Promotion for Unsupervised Video Person Re-identification Lv et al. [4] Transfer learning have been used to learn spatio-temporal pattern in unsupervised manner.Unsupervised Cross-dataset Person Re-identification by Transfer Learning of Spatial-Temporal Patterns Fu et al. [26] Used multi-scale feature representation and chose correct scale for matching Multi-scale Deep Learning Architectures for Person Re-identification Tomasi et al. [27] Proposed method for selection of good features for re-identification Features for Multi-Target Multi-Camera Tracking and Re-Identification Roy et al. [28] Minimized the labeling effort by choosing minimum image for labeling task in re-identification.Exploiting Transitivity for Learning Person Re-identification Models on a Budget Sarfraz et al. [13] Used fine and coarse pose information for deep re-identification.A Pose-Sensitive Embedding for Person Re-Identification with Expanded Cross Neighborhood Re-Ranking Shen et al. [29] Proposed group-shuffling random walk network for fully utilizing train and test images.Deep Group-shuffling Random Walk for Person Re-identification Shen et al. [30] Proposed Kronecker Product Matching module to match feature maps of different persons in an end-to-end trainable deep neural network.End-to-End Deep Kronecker-Product Matching for Person Re-identification Si et al. [31] Uses and learn context-aware feature sequences and perform attentive sequence comparison simultaneously.Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-Identification Wang et al. [32] Deep architecture named BraidNet is proposed.It uses the cascaded Wconv structure learns to extract the comparison features Images.Person Re-identification with Cascaded Pairwise Convolutions Wu et al. [18] It propose an approach to exploiting unsupervised Convolutional Neural Network (CNN) feature representation via stepwise learning.Exploit the Unknown Gradually: One-Shot Video-Based Person Re-Identification by Stepwise Learning Xu et al. [33] Body parts-based attention network for re-identification Attention-Aware Compositional Network for Person Re-identification Xu et al. [3] Joint Spatial and Temporal Attention Pooling Network (ASTPN) has been used in video sequences.Jointly Attentive Spatial-Temporal Pooling Networks for Video-based Person Re-Identification Zhang et al. [19] Sequential decision making has been used to identify each frame in a video Multi-shot Pedestrian Re-identification via Sequential Decision Making Zhong et al. [14] Used style transfer across different camera to improve re-identification Camera Style Adaptation for Person Re-identification

III. PROPOSED APPROACH
Our method can be regarded as tracking followed by reidentification.Moving persons are tracked using Simple Online Deep Tracking (SODT) that has been developed using YOLO [34] framework.A tube is defined as the sequence of spatio-temporal frames of a moving person.Training is done using the videos captured by a camera.Videos captured using cameras are used to construct the gallery of tubes.Assume a gallery (G) contains n tubes as given in (1).
Suppose a tube (T ) in the gallery contains m frames as given in (2).
At the time of re-identification, a query tube is given as a probe.First, the noisy frames are eliminated and the query tube is minimized.Next, frames of the revised query tube are passed through a 3-stage hierarchical re-ranking process to get the final ranking of the tubes in the gallery.The method is depicted in Figure 1.

A. Query Minimization
Re-identification using multiple images usually performs better as compared to single image-based frameworks.However, the former method consumes more computational power.Also, selecting a set of frames that can uniquely represent a tube can be challenging.To address this, we have used a deep similarity matching architecture to select a set of representative frames based on pose dissimilarity.First, a query tube is passed through binary classifier to remove noisy frames such as blurry, cropped, low-quality, etc. Next, a ResNet50 [35] framework has been trained using a few query tubes containing similar looking images.The similarity cost (σ ij ) is calculated using (3).
The input tube contains m images, whereas the output query tube contains n images such that n << m.The images in the optimized query tube can be represented using (4).
The pairwise query cost function (ξ) for a given frame (I i ) and other frame (I j ) is defined in (5).
The loss of energy is defined as given in (6).
The optimal query energy (E) is defined in (7), where Q is the set of images that are not included in Q and φ is a weighting parameter called query threshold (between 0-1).Larger φ produces higher number of images in Q.
Figure 2 depicts the steps and the minimized query images TRiViD dataset.

B. Image Re-identification using SVDNet
Our proposed method uses single image-based reidentification at the top layer of the hierarchy.We have used Singular Vector Decomposition Network (SVDNet) [1] as the baseline.It uses a convolutional neural network and an eigenlayer before the fully connected layer.The eigenlayer consists of a set of weights.Figure 3 demonstrates the architecture of a typical SVDNet.The outputs of SVDNet are a set of retrieved images with ranks up to k as given in (8).

C. Self Similarity Guided Re-ranking
In the next step, we have aggregated the self-similarity scores with the SVDNet outputs.A typical ResNet50 [35] architecture has been trained to learn self-similarity scores using the tubes of the query set.We assume the images available in a tube are similar.Next, a similarity score between the query image and every output image of SVD network up to rank k, is calculated.Finally, the scores are averaged and the images are re-ranked.This step ensures that the dissimilar images get pushed toward the end of the ranked sequence of the retrieved images.Figure 4 illustrates this method.

D. Tube Ranking by Temporal Correlation
Final step of the proposed method is to rank the tubes by temporal correlation among the retrieved images.We assume the images that belong to a single tube, are temporally correlated as they are extracted by detection and tracking.Let the result matrix up to rank k for the query tube Q after the first two stages be denoted by R. Weight of an image of R can be estimated using (9).
Similarly, weight of a tube (T n ) can be estimated using (10).
Finally, the temporal correlation cost (τ I jk ) of an image in R can be estimated as given in (11).
Based on the temporal correlation, the retrieved tubes are ranked.Let the ranked tubes up to k be represented using (12), where higher rank tubes have higher weights.
The final ranked images are extracted by taking the highest scoring images from the tubes.The final ranked images are given in (13).Figure 5 explains the whole process of tube ranking and selection of final set of frames.
IV. EXPERIMENTS We have evaluated our proposed approach on two public datasets, iLIDS-VID [36] and PRID-11 [37] that are often used for testing video-based re-identification frameworks.In addition to that, we have also prepared a new re-identification dataset.It has been recorded using 2 cameras in an indoor environment with human movements with moderately dense crowd (with more than 10 people appearing within 4-6 sqmt), varying camera angles, and persons with similar clothing.Such situations have not been covered yet in existing reidentification video datasets.Details about these datasets are presented in Table II.Several experiments have been conducted to validate our method and a through comparative analysis has been performed.Evaluation Metrics and Strategy: We have followed the well known experimental protocols for evaluating the method.For iLIDS-VID and TRiViD dataset videos, the tubes are randomly split into 50% for training and 50% for testing.For PRID-11, we have followed the experimental setup as proposed in [3], [5], [36], [38], [39].Only first 200 persons who appeared in both cameras of the PRID-11 dataset, have been used in our experiments.A 10-folds cross validation scheme has been adopted and the average results are reported.We have prepared Cumulative Matching Characteristics (CMC) and mean average precision (mAP) curves to evaluate and compare the performance.

A. Comparative Analysis
As per the state-of-the-art, our work though unique in design has some similarities with video re-id methods proposed in [38], [40], multiple query-based method [1], and the re-ranking method [20].Therefore, we have compared our approach with the above three recently proposed methods.It has been observed that the proposed method can achieve a gain up to 9.6% as compared to the state-of-the-art methods when top rank accuracy is estimated.Even if we compute the accuracy up to rank 20, our method has the upper hand with a margin of 3%.This is the USP of the proposed method and we claim it to be significant at this stage.This happens because our method tries to reduce the number of false positives which has not yet been addressed by the re-identification research community.Figures 6-8 represent CMC curves and Table ?? summarizes the mAP up to rank 20 across the three datasets.Figure 9 shows a typical query and response applied on PRID-11 dataset.

B. Computational Complexity Analysis
re-identification in real-time is a challenging task.All research work carried out so far presume the gallery as a pre-  recorded set of images and they try to rank best 5, 10, 15, 20 images from the set.However, executing a single query takes considerable time when multiple images are involvd in the query.We have carried out a comparative analysis on computation complexities across various re-identification frameworks including the proposed scheme.A Nvdia Quadro P5000 series GPU has been used to implement the frameworks.The results are reported in Figure 10.We have observed that the proposed tube-based re-identification framework takes lesser time as compared to video re-id framework proposed in [38] and the multiple images-based re-id using SVDNet [1].

C. Effect of φ
Our proposed method depends on the query threshold (φ).In this section, we present an analysis about the effect of φ on results.Figure 11 depicts the average number of query images generated from various query tubes.It may be observed that, higher φ produces more query images.
Figure 12 depicts average CMC by varying φ.It may be observed that the accuracy does not increase significantly when φ is increased above 0.4.
Fig. 9: Typical results obtained using PRID-11 dataset using single image query [1], video sequence [38], and using the proposed method.Green box indicates a correct retrieval.

D. Results After Various Stages
In this section, we present the effect of various stages of the overall framework on re-identification results.Table ?? shows the accuracy (CMC) in each step of the proposed method.It may be observed that the proposed method gains 11% rank-1 accuracy after the first stage and 7% rank-1 accuracy after the second step.The method gains 7% rank-20 accuracy in the first stage and 6% rank-20 accuracy after the second stage.Table ?? shows the accuracy (CMC) in each step.Figure 14 shows an example of scores (true positives and false positives) during the self-similarity fusion.It may be observed that SVDNet output scores and similarity scores are high in case of true positives.Similarity scores are relatively low in case of false positives.More results can be found in the form of supplementary data.

V. CONCLUSION
In this paper, we propose a new person re-identification framework that is able to outperform existing re-identification schemes when applied on videos or sequence of frames.The method uses a CNN-based framework (SVDNet) at the beginning.A self-similarity layer is used to refine the SVD-Net scores.Finally, a temporal correlation layer is used to aggregate multiple query outputs and to match tubes.A query optimization has also been proposed to select an optimum set of images for a query tube.Our study reveals that the proposed method outperforms in several cases as compared to the state-of-the-art single image-based, multiple images-based, and video-based re-identification methods.The computational is also reasonably low.
One straight extension of the present work is to fuse methods like camera pose-based [2], video-based [38], and description-based [16].It may lead to higher accuracy in complex situations.Also, group re-identification can be tried with the similar concept of tube guided analysis.Fig. 10: Average response time (in seconds) for a given query by varying the datasets.We have taken 100 query tubes in random and calculated the average response time with the help of RCNN [38], TDL [40], Video re-id [38], SVDNet [1] (single image), SVDNet (multiple images), SVDNet+Re-rank [20].
PRID11 [37] iLIDS [ Fig. 11: Average number of query images by varying the query threshold (φ).We have taken 100 query sequences randomly and average number of optimized images, is reported.It may be observed that a higher φ produces more number of query images.We have taken 100 query sequences randomly and average is reported.It may be observed that a higher φ may not produce higher accuracy

Fig. 1 :
Fig.1:The proposed method for Tube-to-tube Re-identification.Our contributions are marked with circle.The method takes a tube as query and rank the tubes by best matching.

Fig. 2 :
Fig. 2: Examples of original tube (first row), detected noisy frames (second row), tube after noise removal (third row), and minimized tube for query execution (fourth row) taken from the TRiViD dataset.

Fig. 3 :
Fig. 3: Architecture of the SVDNet used in the fist stage of the re-identification framework shown in Figure 1.It contains an Eigenlayer before the fully connected layer.The Eigenlayer contains the weights to be used during training.

Fig. 4 :
Fig. 4: The self similarity estimation layer.It learns to measure self-similarity during training.We use ResNet50 [35] as the baseline.It takes a set of ranked images (SVDNet outputs) and produces a set of ranked images by introducing self-similarities between the query image and the retrieved images.

Fig. 5 :
Fig. 5: Explanation of re-identification framework with the help of the proposed 3-stage framework depicted in Figure 1.

Fig. 12 :
Fig.12: Accuracy (CMC) by varying the query threshold (φ).We have taken 100 query sequences randomly and average is reported.It may be observed that a higher φ may not produce higher accuracy

Fig. 13 :
Fig. 13: Execution time by varying φ.It may be observed that a higher φ takes more time to execute as it produces more query images.

TABLE I :
Recent progress in person re-identification research

TABLE II :
Dataset used in our experiments.Only TRiViD dataset is tracked to extract tube.In other datset the given sequence of images are considered as tube

TABLE III :
mAP (%) up to rank 20 in across three video datasets

TABLE IV :
Accuracy (CMC) in each step of the proposed method