Person Search in Videos with One Portrait Through Visual and Temporal Links

Huang, Qingqiu; Liu, Wentao; Lin, Dahua

doi:10.1007/978-3-030-01261-8_26

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11217))

Included in the following conference series:

European Conference on Computer Vision

2846 Accesses
34 Citations
3 Altmetric

Abstract

In real-world applications, e.g. law enforcement and video retrieval, one often needs to search a certain person in long videos with just one portrait. This is much more challenging than the conventional settings for person re-identification, as the search may need to be carried out in the environments different from where the portrait was taken. In this paper, we aim to tackle this challenge and propose a novel framework, which takes into account the identity invariance along a tracklet, thus allowing person identities to be propagated via both the visual and the temporal links. We also develop a novel scheme called Progressive Propagation via Competitive Consensus, which significantly improves the reliability of the propagation process. To promote the study of person search, we construct a large-scale benchmark, which contains 127K manually annotated tracklets from 192 movies. Experiments show that our approach remarkably outperforms mainstream person re-id methods, raising the mAP from $42.16\%$ to $62.27\%$ (Code at https://github.com/hqqasw/person-search-PPCC).

You have full access to this open access chapter, Download conference paper PDF

Multi-shot person re-identification based on appearance and spatial-temporal cues in a large camera network

Article 26 May 2021

Person Re-identification in Videos by Analyzing Spatio-temporal Tubes

Article Open access 23 June 2020

MARS: A Video Benchmark for Large-Scale Person Re-Identification

Keywords

1 Introduction

Searching persons in videos is frequently needed in real-world scenarios. To catch a wanted criminal, the police may have to go through thousands of hours of videos collected from multiple surveillance cameras, probably with just a single portrait. To find the movie shots featured by a popular star, the retrieval system has to examine many hour-long films, with just a few facial photos as the references. In applications like these, the reference photos are often taken in an environment that is very different from the target environments where the search is conducted. As illustrated in Fig. 1, such settings are very challenging. Even state-of-the-art recognition techniques would find it difficult to reliably identify all occurrences of a person, facing the dramatic variations in pose, makeups, clothing, and illumination.

It is noteworthy that two related problems, namely person re-identification (re-id) and person recognition in albums, have drawn increasing attention from the research community. However, they are substantially different from the problem of person search with one portrait, which we aim to tackle in this work. Specifically, in typical settings of person re-id [8, 13, 16, 22, 38, 44, 45], the queries and the references in the gallery set are usually captured under similar conditions, e.g. from different cameras along a street, and within a short duration. Even though some queries can be subject to issues like occlusion and pose changes, they can still be identifies via other visual cues, e.g. clothing. For person recognition in albums [43], one is typically given a diverse collection of gallery samples, which may cover a wide range of conditions and therefore can be directly matched to various queries. Hence, for both problems, the references in the gallery are often good representatives of the targets, and therefore the methods based on visual cues can perform reasonably well [1, 3, 4, 14, 15, 22, 39, 43, 44]. On the contrary, our task is to bridge a single portrait with a highly diverse set of samples, which is much more challenging and requires new techniques that go beyond visual matching.

To tackle this problem, we propose a new framework that propagates labels through both visual and temporal links. The basic idea is to take advantage of the identity invariance along a person trajectory, i.e. all person instances along a continuous trajectory in a video should belong to the same identity. The connections induced by tracklets, which we refer to as the temporal links, are complementary to the visual links based on feature similarity. For example, a trajectory can sometimes cover a wide range of facial images that can not be easily associated based on visual similarity. With both visual and temporal links incorporated, our framework can form a large connected graph, thus allowing the identity information to be propagated over a very diverse collection of instances.

While the combination of visual and temporal links provide a broad foundation for identity propagation, it remains a very challenging problem to carry out the propagation reliably over a large real-world dataset. As we begin with only a single portrait, a few wrong labels during propagation can result in catastrophic errors downstream. Actually, our empirical study shows that conventional schemes like linear diffusion [46, 47] even leads to substantially worse results. To address this issue, we develop a novel scheme called Progressive Propagation via Competitive Consensus, which performs the propagation prudently, spreading a piece of identity information only when there is high certainty.

To facilitate the research on this problem setting, we construct a dataset named Cast Search in Movies (CSM), which contains 127K tracklets of 1218 cast identities from 192 movies. The identities of all the tracklets are manually annotated. Each cast identity also comes with a reference portrait. The benchmark is very challenging, where the person instances for each identity varies significantly in makeup, pose, clothing, illumination, and even age. On this benchmark, our approach get $63.49\%$ and $62.27\%$ mAP under two settings, Comparing to the $53.33\%$ and $42.16\%$ mAP of the conventional visual-matching method, it shows that only matching by visual cues can not solve this problem well, and our proposed framework – Progressive Propagation via Competitive Consensus can significantly raise the performance.

In summary, the main contributions of this work lie in four aspects: (1) We systematically study the problem of person search in videos, which often arises in real-world practice, but remains widely open in research. (2) We propose a framework, which incorporates both the visual similarity and the identity invariance along a tracklet, thus allowing the search to be carried out much further. (3) We develop the Progressive Propagation via Competitive Consensus scheme, which significantly improves the reliability of propagation. (4) We construct a dataset Cast Search in Movies (CSM) with 120K manually annotated tracklets to promote the study on this problem.

2 Related Work

Person Re-id. Person re-id [6, 7, 41], which aims to match pedestrian images (or tracklets) from different cameras within a short period, has drawn much attention in the research community. Many datasets [8, 13, 16, 22, 38, 44, 45] have been proposed to promote the research of re-id. However, the videos are captured by just several cameras in nearby locations within a short period. For example, the Airport [16] dataset is captured in an airport from 8 a.m. to 8 p.m. in one day. So the instances of the same identities are usually similar enough to identify by visual appearance although with occlusion and pose changes. Based on such characteristic of the data, most of the re-id methods focus on how to match a query and a gallery instance by visual cues. In earily works, the matching process is splited into feature designing [9, 11, 26, 27] and metric learning [17, 23, 28]. Recently, many deep learning based methods have been proposed to jointly handle the matching problem. Li et al. [22] and Ahmed et al. [1] designed siamese-based networks which employ a binary verification loss to train the parameters. Ding et al. [4] and Cheng et al. [3] exploit triple loss for training more discriminating feature. Xiao et al. [39] and Zheng et al. [44] proposed to learn features by classifying identities. Although the feature learning methods of re-id can be adopted for the Person Search with One Portrait problem, they are substantially different as the query and the gallery would have huge visual appearances gap in person search, which would make one-to-one matching fail.

Person Recognition in Photo Album. Person recognition [14, 15, 19, 24, 43] is another related problem, which usually focuses on the persons in photo album. It aims to recognize the identities of the queries given a set of labeled persons in gallery. Zhang et al. [43] proposed a Pose Invariant Person Recognition method (PIPER), which combines three types of visual recognizers based on ConvNets, respectively on face, full body, and poselet-level cues. The PIPA dataset published in [43] has been widely adopted as a standard benchmark to evaluate person recognition methods. Oh et al. [15] evaluated the effectiveness of different body regions, and used a weighted combination of the scores obtained from different regions for recognition. Li et al. [19] proposed a multi-level contextual model, which integrates person-level, photo-level and group-level contexts. But the person recognition is also quite different from the person search problem we aim to tackle in this paper, since the samples of the same identities in query and gallery are still similar in visual appearances and the methods mostly focus on recognizing by visual cues and context.

Person Search. There are some works that focus on person search problem. Xiao et al. [40] proposed a person search task which aims to search the corresponding instances in the images of the gallery without bounding box annotation. The associated data is similar to that in re-id. The key difference is that the bounding box is unavailable in this task. Actually it can be seen as a task to combine pedestrian detection and person re-id. There are some other works try to search person with different modality of data, such as language-based [21] and attribute-based [5, 35], which focus on the application scenarios that are different from the portrait-based problem we aim to tackle in this paper.

Label Propogation. Label propagation (LP) [46, 47], also known as Graph Transduction [30, 32, 37], is widely used as a semi-supervised learning method. It relies on the idea of building a graph in which nodes are data points (labeled and unlabeled) and the edges represent similarities between points so that labels can propagate from labeled points to unlabeled points. Different kinds of LP-based approaches have been proposed for face recognition [18, 48], semantic segmentation [33], object detection [36], saliency detection [20] in the computer vision community. In this paper, We develop a novel LP-based approach called Progressive Propagation via Competitive Consensus, which differs from the conventional LP in two folds: (1) propagating by competitive consensus rather than linear diffusion, and (2) iterating in a progressive manner.

Table 1. Comparing CSM with related datasets

Full size table

3 Cast Search in Movies Dataset

Whereas there have been a number of public datasets for person re-id [8, 13, 16, 22, 38, 44, 45] and album-based person recognition [43]. But dataset for our task, namely person search with a single portrait, remains lacking. In this work, we constructed a large-scale dataset Cast Search in Movies (CSM) for this task. CSM comprises a query set that contains the portraits for 1, 218 cast (the actors and actresses) and a gallery set that contains 127K tracklets (with 11M person instances) extracted from 192 movies.

We compare CSM with other datasets for person re-id and person recognition in Table 1. We can see that CSM is significantly larger, 6 times for tracklets and 11 times more instances than MARS [44], which is the largest dataset for person re-id to our knowledge. Moreover, CSM has a much wider range of tracklet durations (from 1 to 4686 frames) and instance sizes (from 23 to 557 pixels in height). Figure 2 shows several example tracklets as well as their corresponding portraits, which are very diverse in pose, illumination, and wearings. It can be seen that the task is very challenging (Fig. 3).

Query Set. For each movie in CSM, we acquired the cast list from IMDB. For those movies with more than 10 cast, we only keep the top 10 according to the IMDB order, which can cover the main characters for most of the movies. In total, we obtained 1, 218 cast, which we refer to as the credited cast. For each credited cast, we download a portrait from either its IMDB or TMDB homepage, which will serve as the query portraits in CSM.

Gallery Set. We obtained the tracklets in the gallery set through five steps:

1.
Detecting shots. A movie is composed of a sequence of shots. Given a movie, we first detected the shot boundaries of the movies using a fast shot segmentation technique [2, 34], resulting in totally 200K shots for all movies. For each shot, we selected 3 frames as the keyframes.
2.
Annotating bounding boxes on keyframes. We then manually annotated the person bounding boxes on keyframes and obtained around 700K bounding boxes.
3.
Training a person detector. We trained a person detector with the annotated bounding boxes. Specifically, all the keyframes are partitioned into a training set and a testing set by a ratio 7:3. We then finetuned a Faster-RCNN [29] pre-trained on MSCOCO [25] on the training set. On the testing set, the detector gets around $91\%$ mAP, which is good enough for tracklet generation.
4.
Generating tracklets. With the person detector as described above, we performed per-frame person detection over all the frames. By concatenating the bounding boxes across frames with $\text {IoU} > 0.7$ within each shot, we obtained 127K trackets from the 192 movies.
5.
Annotating identities. Finally, we manually annotated the identities of all the tracklets. Particularly, each tracklet is annotated as one of the credited cast or as “others”. Note that the identities of the tracklets in each movie are annotated independently to ensure high annotation quality with a reasonable budget. Hence, being labeled as “others” means that the tracklet does not belong to any credited cast of the corresponding movie.

4 Methodology

In this work, we aim to develop a method to find all the occurrences of a person in a long video, e.g. a movie, with just a single portrait. The challenge of this task lies in the vast gap of visual appearance between the portrait (query) and the candidates in the gallery.

Our basic idea to tackle this problem by leveraging the inherent identity invariance along a person tracklet and propagate the identities among instances via both visual and temporal links. The visual and temporal links are complementary. The use of both types of links allows identities to be propagated much further than using either type alone. However, how to propagate over a large, diverse, and noisy dataset reliably remains a very challenging problem, considering that we only begin with just a small number of labeled samples (the portraits). The key to overcoming this difficulty is to be prudent, only propagating the information which we are certain about. To this end, we propose a new propagation framework called Progressive Propagation via Competitive Consensus, which can effectively identify confident labels in a competitive way.

4.1 Graph Formulation

The propagation is carried out over a graph among person instances. Specifically, the propagation graph is constructed as follows. Suppose there are C cast in query set, M tracklets in gallery set, and the length of k-th tracklet (denoted by $\tau _k$) is $n_k$, i.e. it contains $n_k$ instances. The cast portraits and all the instances along the tracklets are treated as graph nodes. Hence, the graph contains $N = C + \sum _{k=1}^M n_k$ nodes. In particular, the identities of the C cast portraits are known, and the corresponding nodes are referred to as labeled nodes, while the other nodes are called unlabled nodes.

The propagation framework aims to propagate the identities from the labeled nodes to the unlabeled nodes through both visual and temporal links between them. The visual links are based on feature similarity. For each instance (say the i-th), we can extract a feature vector, denoted as . Each visual link is associated with an affinity value – the affinity between two instances and is defined to be their cosine similarity as . Generally, higher affinity value $w_{ij}$ indicates that and are more likely to be from the same identity. The temporal links capture the identity invariance along a tracklet, i.e. all instances along a tracklet should share the same identity. In this framework, we treat the identity invariance as hard constraints, which is enforced via a competitive consensus mechanism.

For two tracklets with lengths $n_k$ and $n_l$, there can be $n_k \cdot n_l$ links between their nodes. Among all these links, the strongest link, i.e. the one between the most similar pair, is the best to reflect the visual similarity. Hence, we only keep one strongest link for each pair of tracklets as shown in Fig. 4, which makes the propagation more reliable and efficient. Also, thanks to the temporal links, such reduction would not compromise the connectivity of the whole graph.

As illustrated in Fig. 4, the visual and temporal links are complementary. The former allows the identity information to be propagated among those instances that are similar in appearance, while the latter allows the propagation along a continuous trajectory, in which the instances can look significantly different. With only visual links, we can obtain clusters in the feature space. With only temporal links, we only have isolated tracklets. However, with both types of links incorporated, we can construct a more connected graph, which allows the identities to be propagated much further.

4.2 Propagating via Competitive Consensus

Each node of the graph is associated with a probability vector , which will be iteratively updated as the propagation proceeds. To begin with, we set the probability vector for each labeled node to be a one-hot vector indicating its label, and initialize all others to be zero vectors. Due to the identity invariance along tracklets, we enforce all nodes along a tracklet $\tau _k$ to share the same probability vector, denoted by . At each iteration, we traverse all tracklets and update their associated probability vectors one by one.

Linear Diffusion. Linear diffusion is the most widely used propagation scheme, where a node would update its probability vector by taking a linear combination of those from the neighbors. In our setting with identity invariance, the linear diffusion scheme can be expressed as follows:

(1)

Here, is the set of all visual neighbors of those instances in $\tau _k$. Also, $\tilde{w}_{kj}$ is the affinity of a neighbor node j to the tracklet $\tau _k$. Due to the constraint that there is only one visual link between two tracklets (see Sect. 4.1), each neighbor j will be connected to just one of the nodes in $\tau _k$, and $\tilde{w}_{kj}$ is set to the affinity between the neighbor j to that node.

However, we found that the linear diffusion scheme yields poor performance in our experiments, even far worse than the naive visual matching method. An important reason for the poor performance is that errors will be mixed into the updated probability vector and then propagated to other nodes. This can cause catastrophic errors downstream, especially in a real-world dataset that is filled with noise and challenging cases.

Competitive Consensus. To tackle this problem, it is crucial to improve the reliability and propagate the most confident information only. Particularly, we should only trust those neighbors that provide strong evidence instead of simply taking the weighted average of all neighbors. Following this intuition, we develop a novel scheme called competitive consensus.

When updating , the probability vector for the tracklet $\tau _k$, we first collect the strongest evidence to support each identity c, from all the neighbors in , as

(2)

where the normalized coefficient $\alpha _{kj}$ is defined in Eq. (1). Intuitively, an identity is strongly supported for $\tau _k$ if one of its neighbors assigns a high probability to it. Next, we turn the evidences for individual identities into a probability vector via a tempered softmax function as

$$\begin{aligned} p^{(t+1)}_{\tau _k}(c) = \exp (\eta _k(c)/T) / \sum _{c'=1}^C \exp (\eta _k(c')/T). \end{aligned}$$

(3)

Here, T is a temperature the controls how much the probabilities concentrate on the strongest identity. In this scheme, all identities compete for getting high probability values in by collecting the strongest supports from the neighbors. This allows the strongest identity to stand out.

Competitive consensus can be considered as a coordinate ascent method to solve Eq. 4, where we introduce a binary variable $z_{kj}^{(c)}$ to indicate whether the j-th neighbor is a trustable source for the class c for the k-th tracklet. Here, $\mathcal {H}$ is the entropy. The constraint means that one trustable source is selected for each class c and tracklet k.

(4)

Figure 5 illustrates how linear diffusion and our competitive Consensus work. Experiments on CSM also show that competitive consensus significantly improves the performance of the person search problem.

4.3 Progressive Propagation

In conventional label propagation, labels of all the nodes would be updated until convergence. This way can be prohibitively expensive when the graph contains a large number of nodes. However, for the person search problem, this is unnecessary – when we are very confident about the identity of a certain instance, we don’t have to keep updating it.

Motivated by the analysis above, we propose a progressive propagation scheme to accelerate the propagation process. At each iteration, we will fix the labels for a certain fraction of nodes that have the highest confidence, where the confidence is defined to be the maximum probability value in . We found empirically that a simple freezing schedule, e.g. adding $10\%$ of the instances to the label-frozen set, can already bring notable benefits to the propagation process.

Note that the progressive scheme not only reduces computational cost but also improves propagation accuracy. The reason is that without freezing, the noise and the uncertain nodes will keep affecting all the other nodes, which can sometimes cause additional errors. Experiments in Sect. 5.3 will show more details.

Table 2. Train/Val/Test splits of CSM

Full size table

Table 3. Query/Gallery size

Full size table

5 Experiments

5.1 Evaluation Protocol and Metrics of CSM

The 192 movies in CSM are partitioned into training (train), validation (val) and testing (test) sets. Statistics of these sets are shown in Table 2. Note that we make sure that there is no overlap between the cast of different sets. i.e. the cast in the testing set would not appear in training and validation. This ensures the reliability of the testing results.

Under the Person Search with One Portrait setting, one should rank all the tracklets in the gallery given a query. For this task, we use mean Average Precision (mAP) as the evaluation metric. We also report the recall of tracklet identification results in our experiments in terms of R@k. Here, we rank the identities for each tracklet according to their probabilities. R@k means the fraction of tracklets for which the correct identity is listed within the top k results.

We consider two test settings in the CSM benchmark named “search cast in a movie” (IN) and“search cast across all movies” (ACROSS). The setting “IN” means the gallery consists of just the tracklets from one movie, including the tracklets of the credited cast and those of “others”. While in the “ACROSS” setting, the gallery comprises all the tracklets of credited cast in testing set. Here we exclude the tracklets of “others” in the“ACROSS” setting because “others” just means that it does not belong to any one of the credited cast of a particular movie rather than all the movies in the dataset as we have mentioned in Sect. 3. Table 3 shows the query/gallery sizes of each setting.

5.2 Implementation Details

We use two kinds of visual features in our experiments. The first one is the IDE feature [44] widely used in person re-id. The IDE descriptor is a CNN feature of the whole person instance, extracted by a Resnet-50 [12], which is pre-trained on ImageNet [31] and finetuned on the training set of CSM. The second one is the face feature, extracted by a Resnet-101, which is trained on MS-Celeb-1M [10]. For each instance, we extract its IDE feature and the face feature of the face region, which is detected by a face detector [42]. All the visual similarities in experiments are calculated by cosines similarity between the visual features.

Table 4. Results on CSM under two test settings

Full size table

5.3 Results on CSM

We set up four baselines for comparison: (1) FACE: To match the portrait with the tracklet in the gallery by face feature similarity. Here we use the mean feature of all the instances in the tracklet to represent it. (2) IDE: Similar to FACE, except that the IDE features are used rather than the face features. (3) IDE+FACE: To combine face similarity and IDE similarity for matching, respectively with weights 0.8 and 0.2. (4) LP: Conventional label propagation with linear diffusion with both visual and temporal links. Specifically, we use face similarity as the visual links between portraits and candidates and the IDE similarity as the visual links between different candidates. We also consider two settings of the proposed Progressive Propagation via Competitive Consensus method. (5) PPCC-v: using only visual links. (6) PPCC-vt: the full config with both visual and temporal links.

From the results in Table 4, we can see that: (1) Even with a very powerful CNN trained on a large-scale dataset, matching portrait and candidates by visual cues cannot solve the person search problem well due to the big gap of visual appearances between the portraits and the candidates. Although face features are generally more stable than IDE features, they would fail when the faces are invisible, which is very common in real-world videos like movies. (2) Label propagation with linear diffusion gets very poor results, even worse than the matching-based methods. (3) Our approach raises the performance by a considerable margin. Particularly, the performance gain is especially remarkable on the more challenging “ACROSS” setting (62.27 with ours vs. 42.16 with the visual matching method).

Table 5. Results of different updating schemes

Full size table

Analysis on Competitive Consensus. To show the effectiveness of Competitive Consensus, we study different settings of the Competitive Consensus scheme in two aspects: (1) The $\max $ in Eq. (3) can be relaxed to top-k average. Here k indicates the number of neighbors to receive information from. When $k=1$, it reduces to only taking the maximum, which is what we use in PPCC. Performances obtained with different k are shown in Fig. 6. (2) We also study on the “softmax” in Eq. (3) and compare results between different temperatures of it. The results are also shown in Fig. 6. Clearly, using smaller temperature of softmax significantly boosts the performance. This study supports what we have claimed when designing Competitive Consensus: we should only propagate the most confident information in this task.

Analysis on Progressive Propagation. Here we show the comparison between our progressive updating scheme and the conventional scheme that updates all the nodes at each iteration. For progressive propagation, we try two kinds of freezing mechanisms: (1) Step scheme means that we set the freezing ratio of each iteration and the ratio are raised step by step. More specifically, the freezing ratio r is set to $r = 0.5 + 0.1 \times \text {iter}$ in our experiment. (2) Threshold scheme means that we set a threshold, and each time we freeze the nodes whose max probability to a particular identity is greater than the threshold. In our experiments, the threshold is set to 0.5. The results are shown in Table 5, from which we can see the effectiveness of the progressives scheme.

Case Study. We show some samples that are correctly searched in different iterations in Fig. 7. We can see that the easy cases, which are usually with clear frontal faces, can be identified at the beginning. And after iterative propagation, the information can be propagated to the harder samples. At the end of the propagation, even some very hard samples, which are non-frontal, blurred, occluded and under extreme illumination, can be propagated a right identity.

6 Conclusion

In this paper, we studied a new problem named Person Search in Videos with One Protrait, which is challenging but practical in the real world. To promote the research on this problem, we construct a large-scale dataset CSM, which contains 127K tracklets of 1, 218 cast from 192 movies. To tackle this problem, we proposed a new framework that incorporates both visual and temporal links for identity propagation, with a novel Progressive Propagation vis Competitive Consensus scheme. Both quantitative and qualitative studies show the challenges of the problem and the effectiveness of our approach.

References

Ahmed, E., Jones, M., Marks, T.K.: An improved deep learning architecture for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3908–3916 (2015)
Google Scholar
Apostolidis, E., Mezaris, V.: Fast shot segmentation combining global and local visual descriptors. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6583–6587. IEEE (2014)
Google Scholar
Cheng, D., Gong, Y., Zhou, S., Wang, J., Zheng, N.: Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1335–1344 (2016)
Google Scholar
Ding, S., Lin, L., Wang, G., Chao, H.: Deep feature learning with relative distance comparison for person re-identification. Pattern Recogn. 48(10), 2993–3003 (2015)
Article Google Scholar
Feris, R., Bobbitt, R., Brown, L., Pankanti, S.: Attribute-based people search: lessons learnt from a practical surveillance system. In: Proceedings of International Conference on Multimedia Retrieval, p. 153. ACM (2014)
Google Scholar
Gheissari, N., Sebastian, T.B., Hartley, R.: Person reidentification using spatiotemporal appearance. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1528–1535. IEEE (2006)
Google Scholar
Gong, S., Cristani, M., Yan, S., Loy, C.C. (eds.): Person Re-Identification. ACVPR. Springer, London (2014). https://doi.org/10.1007/978-1-4471-6296-4
Book Google Scholar
Gou, M., Karanam, S., Liu, W., Camps, O., Radke, R.J.: DukeMTMC4ReID: a large-scale multi-camera person re-identification dataset. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (2017)
Google Scholar
Gray, D., Tao, H.: Viewpoint invariant pedestrian recognition with an ensemble of localized features. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 262–275. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88682-2_21
Chapter Google Scholar
Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Ms-celeb-1m: challenge of recognizing one million celebrities in the real world. Electron. Imaging 2016(11), 1–6 (2016)
Article Google Scholar
Hamdoun, O., Moutarde, F., Stanciulescu, B., Steux, B.: Person re-identification in multi-camera system by signature based on interest point descriptors collected on short video sequences. In: Second ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC 2008), pp. 1–6. IEEE (2008)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hirzer, M., Beleznai, C., Roth, P.M., Bischof, H.: Person re-identification by descriptive and discriminative classification. In: Heyden, A., Kahl, F. (eds.) SCIA 2011. LNCS, vol. 6688, pp. 91–102. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21227-7_9
Chapter Google Scholar
Huang, Q., Xiong, Y., Lin, D.: Unifying identification and context learning for person recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2217–2225 (2018)
Google Scholar
Joon Oh, S., Benenson, R., Fritz, M., Schiele, B.: Person recognition in personal photo collections. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3862–3870 (2015)
Google Scholar
Karanam, S., Gou, M., Wu, Z., Rates-Borras, A., Camps, O., Radke, R.J.: A systematic evaluation and benchmark for person re-identification: features, metrics, and datasets. arXiv preprint arXiv:1605.09653 (2016)
Koestinger, M., Hirzer, M., Wohlhart, P., Roth, P.M., Bischof, H.: Large scale metric learning from equivalence constraints. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2288–2295. IEEE (2012)
Google Scholar
Kumar, V., Namboodiri, A.M., Jawahar, C.: Face recognition in videos by label propagation. In: 2014 22nd International Conference on Pattern Recognition (ICPR), pp. 303–308. IEEE (2014)
Google Scholar
Li, H., Brandt, J., Lin, Z., Shen, X., Hua, G.: A multi-level contextual model for person recognition in photo albums. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1297–1305 (2016)
Google Scholar
Li, H., Lu, H., Lin, Z., Shen, X., Price, B.: Inner and inter label propagation: salient object detection in the wild. IEEE Trans. Image Process. 24(10), 3176–3186 (2015)
Article MathSciNet Google Scholar
Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: Proceedings of the CVPR (2017)
Google Scholar
Li, W., Zhao, R., Xiao, T., Wang, X.: Deepreid: deep filter pairing neural network for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 152–159 (2014)
Google Scholar
Liao, S., Hu, Y., Zhu, X., Li, S.Z.: Person re-identification by local maximal occurrence representation and metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2197–2206 (2015)
Google Scholar
Lin, D., Kapoor, A., Hua, G., Baker, S.: Joint people, event, and location recognition in personal photo collections using cross-domain context. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 243–256. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_18
Chapter Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Ma, B., Su, Y., Jurie, F.: Local descriptors encoded by fisher vectors for person re-identification. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012. LNCS, vol. 7583, pp. 413–422. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33863-2_41
Chapter Google Scholar
Ma, B., Su, Y., Jurie, F.: Covariance descriptor based on bio-inspired features for person re-identification and face verification. Image Vis. Comput. 32(6–7), 379–390 (2014)
Article Google Scholar
Prosser, B.J., Zheng, W.S., Gong, S., Xiang, T., Mary, Q.: Person re-identification by support vector ranking. In: BMVC, vol. 2, p. 6 (2010)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Rohrbach, M., Ebert, S., Schiele, B.: Transfer learning in a transductive setting. In: Advances in nEural Information Processing Systems, pp. 46–54 (2013)
Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Sener, O., Song, H.O., Saxena, A., Savarese, S.: Learning transferrable representations for unsupervised domain adaptation. In: Advances in Neural Information Processing Systems, pp. 2110–2118 (2016)
Google Scholar
Sheikh, R., Garbade, M., Gall, J.: Real-time semantic segmentation with label propagation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 3–14. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_1
Chapter Google Scholar
Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Meinedo, H., Bugalho, M., Trancoso, I.: Temporal video segmentation to scenes using high-level audiovisual features. IEEE Trans. Circ. Syst. Video Technol. 21(8), 1163–1177 (2011)
Article Google Scholar
Su, C., Zhang, S., Xing, J., Gao, W., Tian, Q.: Deep attributes driven multi-camera person re-identification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 475–491. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_30
Chapter Google Scholar
Tripathi, S., Belongie, S., Hwang, Y., Nguyen, T.: Detecting temporally consistent objects in videos through object class label propagation. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9. IEEE (2016)
Google Scholar
Wang, J., Jebara, T., Chang, S.F.: Graph transduction via alternating minimization. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1144–1151. ACM (2008)
Google Scholar
Wang, T., Gong, S., Zhu, X., Wang, S.: Person re-identification by discriminative selection in video ranking. IEEE Trans. Pattern Anal. Mach. Intell. 38(12), 2501–2514 (2016)
Article Google Scholar
Xiao, T., Li, H., Ouyang, W., Wang, X.: Learning deep feature representations with domain guided dropout for person re-identification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1249–1258. IEEE (2016)
Google Scholar
Xiao, T., Li, S., Wang, B., Lin, L., Wang, X.: Joint detection and identification feature learning for person search. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3376–3385. IEEE (2017)
Google Scholar
Zajdel, W., Zivkovic, Z., Krose, B.: Keeping track of humans: have i seen this person before? In: Proceedings of the 2005 IEEE International Conference on Robotics and Automation, ICRA 2005. pp. 2081–2086. IEEE (2005)
Google Scholar
Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Process. Lett. 23(10), 1499–1503 (2016)
Article Google Scholar
Zhang, N., Paluri, M., Taigman, Y., Fergus, R., Bourdev, L.: Beyond frontal faces: improving person recognition using multiple cues. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4804–4813 (2015)
Google Scholar
Zheng, L., et al.: MARS: a video benchmark for large-scale person re-identification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 868–884. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_52
Chapter Google Scholar
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: a benchmark. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1116–1124 (2015)
Google Scholar
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Advances in Neural Information Processing Systems, pp. 321–328 (2004)
Google Scholar
Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation (2002)
Google Scholar
Zoidi, O., Tefas, A., Nikolaidis, N., Pitas, I.: Person identity label propagation in stereo videos. IEEE Trans. Multimedia 16(5), 1358–1368 (2014)
Article Google Scholar

Download references

Acknowledgement

This work is partially supported by the Big Data Collaboration Research grant from SenseTime Group (CUHK Agreement No. TS1610626), the General Research Fund (GRF) of Hong Kong (No. 14236516).

Author information

Authors and Affiliations

CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong, Shatin, Hong Kong
Qingqiu Huang & Dahua Lin
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Wentao Liu
SenseTime Research, Beijing, China
Wentao Liu

Authors

Qingqiu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Wentao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dahua Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingqiu Huang .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 10794 KB)

Supplementary material 2 (pdf 121 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, Q., Liu, W., Lin, D. (2018). Person Search in Videos with One Portrait Through Visual and Temporal Links. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11217. Springer, Cham. https://doi.org/10.1007/978-3-030-01261-8_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-01261-8_26
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01260-1
Online ISBN: 978-3-030-01261-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Person Search in Videos with One Portrait Through Visual and Temporal Links

Abstract

Similar content being viewed by others

Multi-shot person re-identification based on appearance and spatial-temporal cues in a large camera network

Person Re-identification in Videos by Analyzing Spatio-temporal Tubes

MARS: A Video Benchmark for Large-Scale Person Re-Identification

Keywords

1 Introduction

2 Related Work

3 Cast Search in Movies Dataset