Joint Face Representation Adaptation and Clustering in Videos

Zhang, Zhanpeng; Luo, Ping; Loy, Chen Change; Tang, Xiaoou

doi:10.1007/978-3-319-46487-9_15

Zhanpeng Zhang¹⁷,
Ping Luo^17,18,
Chen Change Loy^17,18 &
…
Xiaoou Tang^17,18

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9907))

Included in the following conference series:

European Conference on Computer Vision

14k Accesses
19 Citations

Abstract

Clustering faces in movies or videos is extremely challenging since characters’ appearance can vary drastically under different scenes. In addition, the various cinematic styles make it difficult to learn a universal face representation for all videos. Unlike previous methods that assume fixed handcrafted features for face clustering, in this work, we formulate a joint face representation adaptation and clustering approach in a deep learning framework. The proposed method allows face representation to gradually adapt from an external source domain to a target video domain. The adaptation of deep representation is achieved without any strong supervision but through iteratively discovered weak pairwise identity constraints derived from potentially noisy face clustering result. Experiments on three benchmark video datasets demonstrate that our approach generates character clusters with high purity compared to existing video face clustering methods, which are either based on deep face representation (without adaptation) or carefully engineered features.

You have full access to this open access chapter, Download conference paper PDF

Self-supervised deep subspace clustering network for faces in videos

Article 07 October 2020

Cluster Analysis of Facial Video Data in Video Surveillance Systems Using Deep Learning

Face clustering using a weighted combination of deep representations

Article 29 October 2021

Keywords

1 Introduction

Face clustering in videos aims at grouping detected faces into different subsets according to different characters. It is a popular research topic [1–5] due to its wide spectrum of applications, e.g. video summarization, content-based retrieval, story segmentation, and character interaction analysis. It can be even exploited as a tool for collecting large-scale dataset for face recognition [4].

Clustering faces in videos is challenging. As shown in Fig. 1, the appearance of a character can vary drastically under different scenes as the story progresses. The viewing angles and lighting also vary widely due to the rich cinematic techniques, such as different shots (e.g. deep focus, follow shot), variety of lighting techniques, and aesthetics. In many cases, the face is blur due to fast motion or occluded due to interactions between characters. The blurring and occlusion are more severe for fantasy and action movies, i.e. Harry Potter series.

Conventional techniques that assume fixed handcrafted features [2, 4] may fail in the cases as shown in Fig. 1. Specifically, handcrafted features are susceptible to large appearance, illumination, and viewpoint variations, and therefore cannot cope with drastic appearance changes. Deep learning approaches have achieved substantial advances for face representation learning [6–8]. These methods could arguably provide a more robust representation to our problem. However, two issues hinder a direct application of deep learning approaches. Firstly, contemporary deep models [6–8] for face recognition are trained with web images or photos from personal albums. These models overfit to the training data distributions thus will not be directly generalizable to clustering faces in different videos with different cinematic styles. Secondly, faces detected in videos usually do not come with identity labels^{Footnote 1}. Hence, we cannot adopt the popular transfer learning approach [11] to adapt these models for our desired videos.

In the absence of precise face annotations, we need to provide deep models with the new capability of learning from weak and noisy supervisions to achieve model adaptation for unseen videos. To this end, we formulate a novel deep learning framework that jointly performs representation adaptation and face clustering in a target video. On one hand, deep representation adaptation provides robust features that permit for better face clustering under unconstrained variations. On the other hand, the clustering results, in return, provide weak pairwise constraints (whether two faces should/should not be assigned to the same cluster) for learning more robust deep representation.

We note that pairwise constraints derived from face tracks (i.e. detection or tracking result of face image subsequences) have been used in previous studies to improve video face clustering [3–5]. In particular, faces appearing in the same frame unlikely belong to the same person while any two faces in the same face track should belong to the same person. Our approach differs to these studies in that we not only exploit such static constraints. Our method also takes advantage of weak dynamic constraints obtained from joint clustering. How to carefully utilize such noisy constraints is challenging and we show that our approach is capable of forming a positive and alternating feedback loop between representation adaptation and clustering.

Contributions: (1) We formulate the video face clustering in a novel deep learning framework. An alternating feedback loop between representation adaptation and clustering is proposed to adapt the deep model from a source domain to a target video domain. To our knowledge, this is the first attempt to introduce deep learning for video face clustering. (2) Different from existing methods that construct static pairwise constraints from the face trajectories, we iteratively discover inter and intra person constraints that allow us to better adapt the face representation in the target video. Experiments on three benchmark video datasets show that the proposed method significantly outperforms existing state-of-the-art methods [2, 4, 12]. In addition, we apply the adapted representation for face verification in the target video. The results demonstrate the superiority of our method compared to deep face representation without adaptation [7]. Code will be released to provide details of our algorithm^{Footnote 2}.

2 Related Work

Traditional face clustering methods [13–16] are usually purely data-driven and unsupervised. In particular, these algorithms mainly focus on clustering the photo albums. How to find a good distance metric between faces or effective subspace for face representation is the key point for these algorithms. For example, Zhu et al. [14] propose a rank-order distance that measures similarity between two faces using their neighboring information. Fitzgibbon and Zisserman [17] develop a joint manifold distance (JMD) that measures the distance between two subspaces. Each subspace is invariant to a desired group of transformations. In addition, there are also techniques that utilize the user interaction [18], extra information on the web [19] and prior knowledge of family photo albums [20] to improve the performance. Another line of work on clustering employs linear classification cost as a clustering criterion, such as DIFFRAC discriminative clustering framework [21].

Recently, clustering face in videos has attracted more attention. Existing algorithms aim at exploiting the inherent pairwise constraints obtained from the face tracks for better clustering performance. Cinbis et al. [3] learn a cast-specific metric, adapted to the people appearing in a particular video, such that pairs of faces within a track are close and faces appearing together in a video frame are far form each other. More recently, Xiao et al. [2] introduce subspace clustering to solve this problem, and design a weighted block-sparse regularizer to incorporate the pairwise constraints. These algorithms usually employ handcrafted feature thus the representation effectiveness is limited. For example, the algorithm in [2] extracts SIFT descriptor from the detected facial landmarks. It cannot deal with profile faces. In addition, in these works, the constraints extracted from the face tracks are sparse and not updated in the clustering process. It may fail to provide enough information to guide the clustering. To mitigate this problem, Wu et al. [4] augment the constraints by a local smoothness assumption before clustering. Different from these studies, we gain richer guidance by iteratively generating constraints based on the clustering process.

In addition to the inherent pairwise constraint, recent works on video face clustering also incorporate contextual information [1]. For example, clothing [22], speech [23], gender [12], video editing style [24], and cluster proportion prior [25] are employed as additional cues to link faces of the same person. While additional context may introduce uncertainty and its availability will limit the application scenario, in this work, we focus on adapting better face representation via dynamic clustering constraints, which are robust and readily obtainable.

3 Our Approach

This section presents our formulation of joint face representation adaptation and clustering as a probabilistic framework, and provides an alternating optimization method to solve it.

Following previous video face clustering methods [2, 4, 5], given a set of face tracks $\mathbf{O }=\{I^i_j\}$ in a target video, where $I^{i}_j$ is the j-th face image of the i-th track, our goal is to obtain representation of face images as well as to partition all the face images according to different characters of the target video. We define a set of filters $\mathbf{W }$, which transform the raw pixels of each face image $I^{i}_j$ into its high-level hidden representation $\mathbf{x }^i_j$ in a Deep Convolutional Network (DCN), as shown in Fig. 2(a). The filters $\mathbf{W }$ are initialized by training on external large-scale face dataset (see Sect. 3.3). To guide the clustering process, we also define a set of pairwise identity constraints $\mathbf{C }=\{c(I^i_j, I^{i'}_{j'})\}$ for any pair of face images:

$$\begin{aligned} c(I^i_j, I^{i'}_{j'})= {\left\{ \begin{array}{ll} 1 &{}I^i_j\ and\ I^{i'}_{j'}\ belong\ to\ the\ same\ identity,\\ -1&{}I^i_j\ and\ I^{i'}_{j'}\ belong\ to\ different\ identities,\\ 0 &{}not\ defined. \end{array}\right. } \end{aligned}$$

(1)

Note that different from previous studies [3–5], the identity constraints $\mathbf{C }$ will be updated iteratively instead of kept static. As shown in Fig. 3, at the very beginning, we initialize the identity constraints (denoted as $\mathbf{C }_0$) by assuming all the face images in the same track have the same identity, i.e. $c(I^i_j, I^{i'}_{j'})=1$, $i=i'$. In addition, for faces in partially or fully overlapped face tracks (e.g. faces appearing in the same frame of the video), their identities should be exclusive. Thus, we define $c(I^i_j, I^{i'}_{j'})=-1$. The constraints between the remaining face pairs are undefined, i.e. $c(I^i_j, I^{i'}_{j'})=0$.

Then we define a set of cluster labels $\mathbf{Y }=\{y^i_j\}$, where $y^i_j=\ell $ and $\ell \in \{1,2,...,K\}$, indicating the corresponding face image $I^{i}_j$ belongs to which one of the K characters, as shown in Fig. 2(b). To this end, the clusters and face representation can be obtained by maximizing a posteriori probability (MAP)

$$\begin{aligned} \mathbf{X }^*,\mathbf{Y }^*,\mathbf{W }^*=\arg \max _{\mathbf{X },\mathbf{Y },\mathbf{W }}p(\mathbf{X },\mathbf{Y },\mathbf{W }|\mathbf{O },\mathbf{C }), \end{aligned}$$

(2)

where $\mathbf{O }=\{I^i_j\}$ and $\mathbf{X }=\{\mathbf{x }^i_j\}$. $\mathbf{C }$ is the dynamic identity constraint. By factorization, Eq. (2) is proportional to $p(\mathbf{C }| \mathbf{O },\mathbf{W }) P(\mathbf{C }|\mathbf{X },\mathbf{Y },\mathbf{O },\mathbf{W }) p(\mathbf{X },\mathbf{Y }|\mathbf{O },W)$ $P(\mathbf{W }|\mathbf{O })$. Note that the image set $\mathbf{O }$ is given and fixed, then we can remove it in the last term. Here we also make the following assumptions: (1) the update of constraints $\mathbf{C }$ is independent to $\mathbf{W }$, i.e. $P(\mathbf{C }|\mathbf{X },\mathbf{Y },\mathbf{O },\mathbf{W })=P(\mathbf{C }|\mathbf{X },\mathbf{Y })$; (2) $\mathbf{O }$ is independent to the inference process of $\mathbf{Y }$ because $\mathbf{Y }$ is inferred from $\mathbf{X }$, i.e. $p(\mathbf{X },\mathbf{Y }|\mathbf{O },W)=p(\mathbf{X },\mathbf{Y }|\mathbf{W })$; (3) inference of the cluster label $\mathbf{Y }$ is independent to $\mathbf{W }$, i.e. $p(\mathbf{X },\mathbf{Y }|\mathbf{W })=p(\mathbf{X },\mathbf{Y })$. Then we have

$$\begin{aligned} p(\mathbf{X },\mathbf{Y },\mathbf{W }|\mathbf{O },\mathbf{C })\propto p(\mathbf{C }|\mathbf{O },\mathbf{W })p(\mathbf{C }|\mathbf{X },\mathbf{Y })p(\mathbf{X },\mathbf{Y })p(\mathbf{W }), \end{aligned}$$

(3)

where the first term $p(\mathbf{C }|\mathbf{O },\mathbf{W })$ solves filters $\mathbf{W }$ of the DCN by using the pairwise identity constraints as supervision. This can be implemented by imposing a contrastive loss in the DCN training process (see Sect. 3.3 for details). As a result, the hidden representation $\mathbf{X }$ can be obtained using the learned filters $\mathbf{W }$. The second term $p(\mathbf{C }|\mathbf{X },\mathbf{Y })$ updates these constraints leveraging $\mathbf{X }$ and the estimated character labels $\mathbf{Y }$, as discussed in Sect. 3.2. The forth term $p(\mathbf{W })$ regularizes the network filters.

In Eq. (3), the third term $p(\mathbf{X },\mathbf{Y })$ infers the character label $\mathbf{Y }$ given the hidden representation $\mathbf{X }$. Motivated by the fact that if two face images are close in the space of the hidden representation, the character labels are likely to be the same, we establish the relation between face pairs by Markov Random Field (MRF), where each node represents a character label $y^i_j$ and each edge represents the relation between the character labels. For each node $y^i_j$, we associate it with the observed variable $\mathbf{x }^i_j$. Then we have

$$\begin{aligned} {\begin{matrix}&p(\mathbf{X },\mathbf{Y })=p(\mathbf{X }|\mathbf{Y })p(\mathbf{Y })\propto \prod _{i,j}\varPhi (\mathbf{x }^i_j|y^i_j)\prod _{i,j}\prod _{i',j'\in \mathcal {N}^i_j}\varPsi (y^i_j,y^{i'}_{j'}), \end{matrix}} \end{aligned}$$

(4)

where $\varPhi (\cdot )$ and $\varPsi (\cdot )$ are the unary and pairwise term, respectively. $\mathcal {N}^i_j$ signifies a set of face images, which are the neighbors of $y^i_j$ and defined by the representation similarity.

The parameters of Eq. (3) are optimized by alternating between the following three steps as illustrated in Fig. 2, (1) fix the filter $\mathbf{W }$ of DCN, obtain the current face representation $\mathbf{X }$, and infer character labels $\mathbf{Y }$ by optimizing MRF as defined in Eq. (4), (2) update the identity constraints $\mathbf{C }$ given $\mathbf{X }$ and the inferred character labels $\mathbf{Y }$, and (3) update the hidden face representation using $\mathbf{W }$ by minimizing the contrastive loss of the identity constraints, corresponding to maximizing $p(\mathbf{C }|\mathbf{O },\mathbf{W })p(\mathbf{W })$. This optimization process is conducted for $T=3$ iterations in our implementation. We will describe these three steps in Sects. 3.1, 3.2, and 3.3 respectively.

3.1 Inferring Character Labels

Given the current face representation $\mathbf{X }$, we infer the character labels $\mathbf{Y }$ by maximizing the joint probability $p(\mathbf{X }, \mathbf{Y })$. We employ the Gaussian distribution to model the unary term $\varPhi (\cdot )$ in Eq. (4)

$$\begin{aligned} \varPhi (\mathbf{x }^i_j|y^i_j=\ell )\thicksim \mathcal {N}(\mathbf{x }^i_j|\mu _{\ell },\varSigma _{\ell }), \end{aligned}$$

(5)

where $\mu _{\ell }$ and $\varSigma _{\ell }$ denote the mean vector and covariance matrix of the $\ell $-th character, which are obtained and updated in the inference process. For the pairwise term $\varPsi (\cdot )$ in Eq. (4), it is defined as

$$\begin{aligned} {\begin{matrix} \varPsi (y^i_j,y^{i'}_{j'})=\exp \big \{\alpha v(\mathbf{x }^i_j,\mathbf{x }^{i'}_{j'})\cdot \big (\mathbf 1 (y^i_j,y^{i'}_{j'})-\mathbf 1 (v(\mathbf{x }^i_j,\mathbf{x }^{i'}_{j'})>0)\big )\big \}, \end{matrix}} \end{aligned}$$

(6)

where $\mathbf 1 (\cdot )$ is an indicator function and $\alpha $ is a trade-off coefficient updated in the inference process. Furthermore, $v(\cdot ,\cdot )$ is a pre-computed function that encodes the relation between any pair of face images $\mathbf{x }^i_j$ and $\mathbf{x }^{i'}_{j'}$. Similar to [4], positive relation (i.e. $v(\cdot ,\cdot )>0$) means that the face images are likely from the same character. Otherwise, they belong to different characters. Specifically, the computation of v is a combination of two cues: (1) the similarity between appearances of a pair of face images and (2) the pairwise spatial and temporal constraints of the face images. For instance, face images within a face track belong to the same character, while face images appearing in the same frame belong to different characters. Intuitively, Eq. (6) encourages face images with positive relation to be the same character. For example, if $v(\mathbf{x }^i_j,\mathbf{x }^{i'}_{j'})>0$ and $y^i_j=y^{i'}_{j'}$, we have $\varPsi (y^i_j,y^{i'}_{j'})=1$. However, if $v(\mathbf{x }^i_j,\mathbf{x }^{i'}_{j'})>0$ but $y^i_j\ne y^{i'}_{j'}$, we have $\varPsi (y^i_j,y^{i'}_{j'})<1$, indicating the character label assignment is violating the pairwise constraints.

To solve Eq. (4), we employ the simulated field algorithm [26], which is a classic technique for MRF optimization. To present the main steps of our work clearly, we provide the details of this algorithm and the computation of $v(\cdot ,\cdot )$ in the supplementary material.

3.2 Dynamic Pairwise Identity Constraints

Different from previous methods [2, 4, 12], where the identity constraints between a pair of face images are fixed after initialized at the very beginning, the identity constraints $\mathbf{C }$ in our approach is updated iteratively in the adaptation process to obtain additional supervision to adapt the face representation. In particular, after inferring the character labels $\mathbf{Y }$ in Sect. 3.1, we compute the confidence value that measures the possibility of a face pair from different characters, i.e. negative pair. After that, we append pairs with high confidence to the current set of pairwise constraints $\mathbf{C }$. The negative pair generation process is motivated by the facts that: diverse clusters contain large noise, while clusters with high purity are compact; and faces from the same character are likely to be close in the representation space. Specifically, for the face pairs in each cluster, we define the confidence Q by

$$\begin{aligned} Q (i_\ell ,i'_\ell ) = \frac{1}{1+\gamma e^{-trace(\varSigma _\ell )D_{i_\ell ,i'_\ell }}} \end{aligned}$$

(7)

where ${i_\ell }$ and ${i'_\ell }$ denote the faces in cluster $\ell $. $trace(\varSigma _\ell )$ is the trace of the covariance matrix, which describes the variations within the cluster. $D_{i_\ell ,i'_\ell }$ is the L2-distance between the faces in the learned face representation space $\mathbf{X }$. $\gamma $ is a scale factor for normalization. In this case, face pairs in diverse clusters with large distances will have high confidence. In our implementation, face pairs with confidence value $Q (i_\ell ,i'_\ell )>0.5$ are selected as additional negative pairs.

3.3 Face Representation Adaptation

Pre-training DCN. The network filter $\mathbf{W }$ is initialized by pre-training DCN to classify massive identities as discussed in DeepID2+ [7]. We adopt its network architecture due to its exceptional performance in face recognition.

Specifically, DCN takes face image of size 55 $\times $ 47 as input. It has four successive convolution layers followed by one fully connected layer. Each convolution layer contains learnable filters and is followed by a $2\times 2$ max-pooling layer and Rectified Linear Units (ReLUs) [27] as the activation function. The number of feature map generated by each convolution layer is 128, and the dimension of the face representation generated by the final fully connected layer is 512. Similar to [7], our DCN is pre-trained on CelebFace [28], with around 290, 000 faces images from 12, 000 identities. The training process is conducted by back-prorogation using both the identification and verification loss functions.

Fine-tuning Face Representation by $\mathbf{C }$ . After updating the identity constraints $\mathbf{C }$ in Sect. 3.2, we update the hidden face representation by back-propagating the constraint information to the DCN. In particular, given a constraint in $\mathbf{C }$, we minimize a contrastive loss function [7], $E_{c}(\mathbf{x }^i_j,\mathbf{x }^{i'}_{j'})$, which is defined as

$$\begin{aligned} E_{c}= {\left\{ \begin{array}{ll} \frac{1}{2}\parallel \mathbf{x }^i_j-\mathbf{x }^{i'}_{j'}\parallel _2^2, &{}c(I^i_j,I^{i'}_{j'})=1, \\ \frac{1}{2}\max (0,\tau -\parallel \mathbf{x }^i_j-\mathbf{x }^{i'}_{j'}\parallel _2^2), &{}c(I^i_j,I^{i'}_{j'})=-1, \end{array}\right. } \end{aligned}$$

(8)

where $\tau $ is the margin between different identities. Eq. (8) encourages face images of the same character to be close and that of the different characters to be far away from each other.

To facilitate representation adaptation, beside $E_c$, we fine-tune DCN by back-propagating the errors of the MRF defined in Sect. 3.1. We take the negative logarithm of Eq. (4), drop the constant terms, and obtain $\frac{1}{2}\sum _{i,j}\sum _{\ell =1}^K\mathbf 1 (y^i_j=\ell ) \big (\ln |\varSigma _{\ell }|+{(\mathbf{x }^i_j-\mu _\ell )}^{\mathsf {T}} \varSigma _{\ell }^{-1}(\mathbf{x }^i_j-\mu _\ell )\big )$. Note that in the step of representation adaptation, we update network filters $\mathbf{W }$ while keeping the remaining parameters fixed, such as $\mathbf{Y }$, $\varSigma $, and $\mu $. Therefore, minimizing the above function is equivalent to optimize $\mathbf{W }$, such that the distance between each face image and its corresponding cluster center is minimized. We define this loss function as below

$$\begin{aligned} {\begin{matrix} E_{MRF} = \frac{1}{2}\sum _{\ell =1}^{K}{} \mathbf 1 (y^i_j=\ell )\parallel x^i_j-\mu _\ell \parallel _2^2.\\ \end{matrix}} \end{aligned}$$

(9)

By minimizing Eq. (9), the representation naturally reduces the intra-personal variations.

Combining Eqs. (8) and (9), the training process is conducted by back-propagation using stochastic gradient descent (SGD) [29]. Algorithm 1 shows the entire pipeline of the proposed joint face representation adaptation and clustering.

4 Experiments

4.1 Datasets

Experiments are conducted on three publicly available face clustering datasets: Accio [30], BF0502 [31] and Notting-Hill [32]. The Accio dataset is collected from the eight “Harry Potter” movies and we use the first instalment of this series in our experiment (denoted as Accio-1 in the following text). Accio-1 contains multiple challenges, such as a large number of dark scenes and many tracks with non-frontal faces. In addition, the number of the faces of each character is unbalanced (e.g., there are 51,620 faces of the character “Harry Potter”, while 4,843 faces for “Albus Dumbledore”). In particular, there are 36 characters, 3,243 tracks, and 166,885 faces in the test movie. The face tracks are obtained by tracking-by-detection using a particle filter [30]. BF0502 [31] is collected from the TV series “Buffy the Vampire Slayer”. Following the protocol of other face video clustering studies [2, 4, 12], we evaluate on 6 main casts including 17,337 faces in 229 face tracks. The dataset Notting-Hill is gathered from the movie “Notting Hill”. It includes faces of 5 main casts, with 4,660 faces in 76 tracks.

4.2 Evaluation Criteria and Baselines

The clustering performance is measured in two different ways. In the first one, we evaluate how the algorithm balances the precision and recall. In particular, we employ the B-cubed precision and recall [1, 33] to compute one series of score pairs for the tested methods given different numbers of clusters. Specifically, the B-cubed precision is the fraction of face pairs assigned to a cluster with matching identity labels. The B-cubed recall is the average fraction of face pairs belonging to the groundtruth identity assigned to the same cluster [15]. To combine the precision and recall, we use the $F_1$-score (the harmonic mean of these two metrics).

For the second evaluation metric, we use accuracy computed from a confusion matrix, which is derived by the best match between the cluster labels and groundtruth identities. The best match is obtained by using the Hungarian method [34]. This evaluation metric is widely employed in current video face clustering methods [2, 4, 12, 25].

We compare the proposed method with the following classic and state-of-the-art approaches: (1) K-means [35]; (2) Unsupervised Logistic Discriminant Metric Learning (ULDML) [3]; (3) Penalized Probabilistic Clustering (PPC) [36]; (4) DIFFRAC [21] discriminative clustering; (5) HMRF-based clustering [4]; (6) Weighted Block-Sparse Low Rank Representation (WBSLRR) method [2]; (7) Multi-cue Augmented Face Clustering (McAFC) [12]. The latter three recent approaches are specifically designed for face clustering in videos.

Table 1. B-cubed precision (P), recall (R), and $F_1$-score (F) with different iterations (T) of the proposed method on the Accio-1 [30] dataset, with cluster number $K=36$.

Full size table

Table 2. B-cubed precision (P), recall (R), and $F_1$-score (F) of different methods on the Accio-1 [30] (Harry Potter) dataset.

Full size table

4.3 Experiments on Accio-1 (Harry Potter) [30]

Effects of the Iterations in the Adaptation Process. The evaluation is first conducted on the Accio-1 dataset [30]. Firstly, to demonstrate the effectiveness of the alternating adaptation process, we report the performance in different iterations in Table 1. Given that there are 36 characters in this movie, we set the cluster number $K=36$ here. It is observed that the performance increases and it converges when $T=3$. This demonstrates the benefits of the alternating adaptation process.

Performance of Different Variants and Competitors. To verify other components of the proposed method, we further test different variants of our method, as well as other existing models:

DeepID2$^+$ $\cdot \mathbf{C }_0$: We perform clustering with fixed DCN filters $\mathbf{W }$ and pairwise constraints set $\mathbf{C }_0$. That means we do not perform representation adaptation after training the network on the face photo dataset and initial pairwise constraints. This variant corresponds to the typical transfer learning strategy [11] adopted in most deep learning studies. Since our network structure and pre-training data are identical to that of [7], We use the notation DeepID2$^+$.
DeepID2$^+$ $\cdot \mathbf{C }_0\cdot $Intra: We finetune DCN filters $\mathbf{W }$ only with the intra person constraints (Eq. (9)) but not the inter person constraints (Eq. (8)).
“HMRF$^+$” and “HMRF-DeepID2$^+$”: Since HMRF [4] only uses the raw pixel value or handcrafted features, for fair comparison, we also use the DCN representation initially trained on the face photo dataset for this method. Similar notation scheme is used for K-means [35], DIFFRAC [21] and WBSLRR [2] algorithms, and Fisher Vector [37] representation.

We report the B-cubed precision and recall, as well as the $F_1$-score of different methods in Table 2. It is observed that:

As the cluster number increases, the precision increases while the recall decreases. This is intuitive since larger number of clusters decreases the cluster size and improves the cluster purity.
This dataset is very challenging. For example, the K-means [35] only achieved 0.379 in precision even the cluster number is nearly six times of the identities.
The DCN representation improves the performance substantially (e.g., the DIFFRAC [21], HMRF [4], and WBSLRR [2] method have 0.2–0.3 improvements in terms of precision when employing the DCN representation).
The proposed method (i.e. full model) performs the best, and the comparison on different variants of the proposed method demonstrates the superiority of the alternating adaptation process (e.g., the performance of full model is better than that of “DeepID2$^+$ $\cdot \mathbf{C }_0$”).
Interestingly, by comparing “DeepID2$^+$ $\cdot \mathbf{C }_0$” and “DeepID2$^+$ $\cdot \mathbf{C }_0\cdot $Intra”, we can observe obvious improvement on recall, but the precision can hardly increase. This is because using only the intra person constraints can decrease the distances between the faces of the same character, but can not provide discriminative information directly to correct the wrong pairs in the cluster. Thus, both intra- and inter-person constraints are important for discriminative face clustering.

Representation Visualization. Figure 4 visualizes different representations by projecting them to a 2D space. Firstly, in Fig. 4(a, b and c), we project the representations by PCA. We can observe that for the original pixel values, the representations are severely overlapped. By pre-training DCN with face dataset and adapting the representation, we can gradually obtain more discriminative representation. After that, we use the t-SNE [38] dimensionality reduction and Fig. 4(d) shows that the characters can be almost linearly separated. This demonstrates the effectiveness of the adapted face representation.

Example Results. Figure 5 shows some clustering examples, where each bank except the right bottom one denotes a cluster. It is observed that each cluster covers a character’s faces in different head pose, lighting conditions, and expressions. This demonstrates the effectiveness of the adapted face representation. We also show some failed cases indicated by the red rectangles, where each pair with different characters is incorrectly partitioned in the same cluster. These faces fail mainly because of the unbalanced face number of the identity (e.g. , some characters just appear in a few shots) and some extreme lighting conditions.

4.4 Experiments on BF0502 [31] and Notting Hill [32]

We report the accuracy of our method and other competitors in Figs. 6 and 7. Following previous research [2, 4, 12], each algorithm is repeated for 30 times, and the mean accuracy and standard deviation are reported. The results of the competitors are gathered from the literatures [2, 4, 12]. Figure 6 shows that our method achieves substantial improvement compared to the best competitor (from 62.76 % to 92.13 %), demonstrating the superiority of our method.

4.5 Computational Cost

Training a high-capacity DCN from scratch is time consuming due to the large amount of training data. However, given the DCN pre-trained on a large face dataset, for a new target video, we only need to perform representation adaptation. Table 3 shows the running time of our algorithm on the videos. In particular, the DCN adaptation in Table 3 is the time that we use to train the DCN with a Nvidia Titan GPU and the total time additionally includes the computation cost of other steps (i.e. inferring the character label $\mathbf{Y }$ in Sect. 3.1 and updating the constraints in Sect. 3.2). It is observed that the time cost is feasible in many applications, where face clustering can be performed off-line.

Table 3. Running time for the Accio-1 [30] (Harry Potter), BF0502 [31], and Notting Hill [32] dataset (in minutes).

Full size table

4.6 Application to Face Verification

To further demonstrate the effectiveness of the adapted face representation, we perform face verification on the Accio-1 dataset [30]. To evaluate the representation directly, for each face pair, we calculate the L2 distance of the representation to measure the pairwise similarity, instead of training a joint Bayesian model as in [7]. If the distance is larger than a threshold, the face pair is regarded as negative (i.e. different identities). The threshold is determined by 1,000 validation face pairs (500 positive and 500 negative samples) randomly chosen from Accio-1 [30] dataset. Evaluation is performed on another 1,000 randomly chosen face pairs (500 positive and 500 negative samples) from this dataset. The validation and test faces are exclusive in terms of scenes and identities. Similar to Sect. 4.3, we compare the performance among different representations, including (1) DeepID2$^+$, (2) DeepID2$^+$ $\cdot \mathbf{C }_0$, and (3) full model. Figure 8 shows the Receiver Operating Characteristic Comparison (ROC). It is evident that representation adapted by the proposed method outperforms the original deep representation and can handle different cinematic styles better.

5 Conclusion

In this work, we have presented a novel deep learning framework for joint face representation adaptation and clustering in videos. In the absence of precise face annotations on the target video, we propose a feedback loop in which the deep representation provides robust features for face clustering, and the clustering results provide weak pairwise constraints for learning more suitable deep representation with respect to the target video. Experiments on three benchmark video datasets demonstrate the superiority of the proposed method when compared to the state-of-the-art video clustering methods that either use handcrafted features or deep face representation (without adaptation). The effectiveness of the adapted face representation is further demonstrated by a face verification experiment.

Notes

1.
Unless we perform joint matching of visual appearance with video’s script [9, 10]. However, an accurate visual-script matching is still far from addressed. This option is beyond the scope of this study.
2.
http://mmlab.ie.cuhk.edu.hk/projects/DeepFaceClustering/index.html.

References

Zhang, L., Kalashnikov, D.V., Mehrotra, S.: A unified framework for context assisted face clustering. In: ACM Conference on International Conference on Multimedia Retrieval (2013)
Google Scholar
Xiao, S., Tan, M., Xu, D.: Weighted block-sparse low rank representation for face clustering in videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 123–138. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10599-4_9
Google Scholar
Cinbis, R., Verbeek, J., Schmid, C.: Unsupervised metric learning for face identification in tv video. In: ICCV (2011)
Google Scholar
Wu, B., Zhang, Y., Hu, B.G., Ji, Q.: Constrained clustering and its application to face clustering in videos. In: CVPR (2013)
Google Scholar
Wu, B., Lyu, S., Hu, B., Ji, Q.: Simultaneous clustering and tracklet linking for multi-face tracking in videos. In: ICCV (2013)
Google Scholar
Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: DeepFace: closing the gap to human-level performance in face verification. In: CVPR (2014)
Google Scholar
Sun, Y., Wang, X., Tang, X.: Deeply learned face representations are sparse, selective, and robust. In: CVPR (2015)
Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: CVPR (2015)
Google Scholar
Ding, L., Yilmaz, A.: Learning relations among movie characters: a social network perspective. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 410–423. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15561-1_30
Chapter Google Scholar
Tapaswi, M., Bauml, M., Stiefelhagen, R.: Improved weak labels using contextual cues for person identification in videos. In: FG (2015)
Google Scholar
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: NIPS (2014)
Google Scholar
Zhou, C., Zhang, C., Fu, H., Wang, R., Cao, X.: Multi-cue augmented face clustering. In: ACM Multimedia Conference (2015)
Google Scholar
Li, Z., Tang, X.: Bayesian face recognition using support vector machine and face clustering. In: CVPR (2004)
Google Scholar
Zhu, C., Wen, F., Sun, J.: A rank-order distance based clustering algorithm for face tagging. In: CVPR (2011)
Google Scholar
Otto, C., Klare, B., Jain, A.: An efficient approach for clustering face images. In: International Conference on Biometrics (2015)
Google Scholar
Cao, X., Zhang, C., Fu, H., Liu, S., Zhang, H.: Diversity-induced multi-view subspace clustering. In: CVPR (2015)
Google Scholar
Fitzgibbon, A., Zisserman, A.: Joint manifold distance: a new approach to appearance based clustering. In: CVPR (2003)
Google Scholar
Tian, Y., Liu, W., Xiao, R., Wen, F., Tang, X.: A face annotation framework with partial clustering and interactive labeling. In: CVPR (2007)
Google Scholar
Berg, T., Berg, A., Edwards, J., Maire, M., White, R., Teh, Y.W., Learned-Miller, E., Forsyth, D.: Names and faces in the news. In: CVPR (2004)
Google Scholar
Xia, S., Pan, H., Qin, A.: Face clustering in photo album. In: ICPR (2014)
Google Scholar
Bach, F.R., Harchaoui, Z.: Diffrac: a discriminative and flexible framework for clustering. In: NIPS, pp. 49–56 (2008)
Google Scholar
El-Khoury, E., Senac, C., Joly, P.: Face-and-clothing based people clustering in video content. In: ACM International Conference on Multimedia Information Retrieval (2010)
Google Scholar
Paul, G., Elie, K., Sylvain, M., Jean-Marc, O., Paul, D.: A conditional random field approach for audio-visual people diarization. In: ICASSP (2014)
Google Scholar
Tapaswi, M., Parkhi, O.M., Rahtu, E., Sommerlade, E., Stiefelhagen, R., Zisserman, A.: Total cluster: a person agnostic clustering method for broadcast videos. In: Proceedings of Indian Conference on Computer Vision Graphics and Image Processing (2014)
Google Scholar
Tang, Z., Zhang, Y., Li, Z., Lu, H.: Face clustering in videos with proportion prior. In: IJCAI (2015)
Google Scholar
Celeux, G., Forbes, F., Peyrard, N.: EM procedures using mean field-like approximations for markov model-based image segmentation. Pattern Recogn. 36(1), 131–144 (2003)
Article MATH Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML (2010)
Google Scholar
Sun, Y., Wang, X., Tang, X.: Deep learning face representation from predicting 10,000 classes. In: CVPR (2014)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Google Scholar
Ghaleb, E., Tapaswi, M., Al-Halah, Z., Ekenel, H.K., Stiefelhagen, R.: Accio: a data set for face track retrieval in movies across age. In: ACM International Conference on Multimedia Retrieval (2015)
Google Scholar
Everingham, M., Sivic, J., Zisserman, A.: Hello! my name is.. buffy -automatic naming of characters in TV video. In: BMVC (2006)
Google Scholar
Zhang, Y., Xu, C., Lu, H., Huang, Y.: Character identification in feature-length films using global face-name matching. IEEE Trans. Multimedia 11(7), 1276–1288 (2009)
Article Google Scholar
Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)
Article Google Scholar
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)
Article MathSciNet MATH Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
MATH Google Scholar
Lu, Z., Leen, T.K.: Penalized probabilistic clustering. Neural Comput. 19(6), 1528–1567 (2007)
Article MathSciNet MATH Google Scholar
Parkhi, O., Simonyan, K., Vedaldi, A., Zisserman, A.: A compact and discriminative face track descriptor. In: CVPR, pp. 1693–1700 (2014)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
Cao, X., Zhang, C., Zhou, C., Fu, H., Foroosh, H.: Constrained multi-view video face clustering. IEEE Trans. Image Process. 24(11), 4381–4393 (2015)
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work is partially supported by SenseTime Group Limited, the Hong Kong Innovation and Technology Support Programme, the General Research Fund sponsored by the Research Grants Council of the Kong Kong SAR (CUHK 416312), the External Cooperation Program of BIC, Chinese Academy of Sciences (No. 172644KYSB20150019), the Science and Technology Planning Project of Guangdong Province (2015B010129013, 2014B050505017), and the National Natural Science Foundation of China (61503366, 61472410; Corresponding author is Ping Luo).

Author information

Authors and Affiliations

Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong, China
Zhanpeng Zhang, Ping Luo, Chen Change Loy & Xiaoou Tang
Shenzhen Key Lab of Comp. Vis. & Pat. Rec., Shenzhen Institutes of Advanced Technology, CAS, Shenzhen, China
Ping Luo, Chen Change Loy & Xiaoou Tang

Authors

Zhanpeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ping Luo
View author publications
You can also search for this author in PubMed Google Scholar
Chen Change Loy
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoou Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoou Tang .

Editor information

Editors and Affiliations

RWTH Aachen , Aachen, Germany
Bastian Leibe
Czech Technical University , Prague 2, Czech Republic
Jiri Matas
University of Trento , Povo - Trento, Italy
Nicu Sebe
University of Amsterdam , Amsterdam, The Netherlands
Max Welling

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Luo, P., Loy, C.C., Tang, X. (2016). Joint Face Representation Adaptation and Clustering in Videos. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science(), vol 9907. Springer, Cham. https://doi.org/10.1007/978-3-319-46487-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-46487-9_15
Published: 17 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46486-2
Online ISBN: 978-3-319-46487-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Joint Face Representation Adaptation and Clustering in Videos

Abstract

Similar content being viewed by others

Self-supervised deep subspace clustering network for faces in videos

Cluster Analysis of Facial Video Data in Video Surveillance Systems Using Deep Learning

Face clustering using a weighted combination of deep representations

Keywords

1 Introduction

2 Related Work