Dynamic Character Graph via Online Face Clustering for Movie Analysis

An effective approach to automated movie content analysis involves building a network (graph) of its characters. Existing work usually builds a static character graph to summarize the content using metadata, scripts or manual annotations. We propose an unsupervised approach to building a dynamic character graph that captures the temporal evolution of character interaction. We refer to this as the character interaction graph(CIG). Our approach has two components:(i) an online face clustering algorithm that discovers the characters in the video stream as they appear, and (ii) simultaneous creation of a CIG using the temporal dynamics of the resulting clusters. We demonstrate the usefulness of the CIG for two movie analysis tasks: narrative structure (acts) segmentation, and major character retrieval. Our evaluation on full-length movies containing more than 5000 face tracks shows that the proposed approach achieves superior performance for both the tasks.


Introduction
Automated analysis of media content, such as movies has traditionally focused on extracting and using low level features from shots and scenes for analyzing narrative structures and key events [10,11]. For humans, however, a movie is not just a collection of shots or scenes. It is the characters that usually play the most important role in storytelling [18]. More recently, character-centric representation of movies, such as character networks have emerged as an effective approach towards media content analysis [15,16,22]. A character network usually has the major characters as its nodes where the edges summarize the relationship between . . . Fig. 1 Overview of the proposed approach: A movie is processed at shot level. For each shot, face tracks are created, and our online clustering algorithm either group a face track with an existing cluster or create a new one. In this example, at shot0, 3 face tracks are created and grouped into two clusters. A new cluster is added in shot1 as it belongs to a new character. The face track in shot2 is added to an existing cluster belonging to the same character. The CIG is updated after each shot is processing. Note that the CIG for the (i − 1) th shot is obtained after the (i − 1) th shot is processed.
Ramakrishna et al. [16] used scripts to construct a character network, where an edge between two characters (nodes) is added if the characters have consecutive dialogs. This network is used to examine the character analytics based on gender, race and age [16]. Weng et al. [22] constructed a character network, called the RoleNet, that captures the co-occurrence statistics of movie characters via face recognition. This network is used to identify the lead characters and communities, and for story segmentation [22]. Park et al. [15] built a network aligning scripts and subtitles. This network is employed in classification of major and minor characters, community clustering and sequence detection [15]. Along the similar lines, Tran and Jung constructed a CoCharNet [21] using manual annotations to encode information regarding character co-occurrences.
The work most related to our work is that of Yeh and Wu [28], where character network is constructed using face clustering. This work clusters faces and construct a character network in an iterative fashion. However, this requires prior knowledge of the number of clusters, and is an offline method. To the best of our knowledge, this is the only prior work that uses (offline) face clustering for constructing character graphs.

Face clustering in videos
Offline methods. The problem of unsupervised face clustering is relatively less studied as compared to its supervised counterpart, i.e., face recognition. The dominant approach to face clustering involves learning a suitable distance measure between face pairs [9,17,20,30]. Several methods have proposed to use partial supervision to improve performance [3,23]. While image-based clustering is more common, face clustering in videos can achieve significant improvement by exploiting the temporal information about the faces [1,[25][26][27]. Temporal constraints have been used in frameworks based on hidden Markov random field (HMRF) [26] and unsupervised logistic discriminative metric learning (ULDML) [2] with applications to face clustering in movies and TV series. A constrained multiview face clustering technique used constrained sparse subspace representation of faces with constrained spectral clustering [1]. Recent clustering approaches use convolutional neural networks (CNN) to learn robust face representations by using aggregated deep features [19], deep features with pairwise constraints [30] and deep features with triplet loss [29].
Online methods. The approaches discussed above are all offline methods i.e., they assume the availability of the entire data at once. In an online setting, a clustering algorithm does not have the luxury of 'seeing' the entire data simultaneously. To the best of our knowledge, there is only one work on online face clustering in videos in the existing literature [14]. This work created small tracklets of faces from the video, and clustered them in an online fashion based on temporal coherence and the Chinese restaurant process (TCCRP) [14]. An extension of this work is Temporally Coherent Chinese Restaurant Franchise (TCCRF) [13], that jointly model short temporal segments. These online methods tend to create multiple clusters for the same person thereby degrading the completeness of the clusters [14].

Proposed approach
Overview. In our dynamic CIG construction approach, we process a movie stream at short-level, where a shot is a contiguously recorded sequence of frames. Our approach consists of two main components: (i) face track creation and clustering, and (ii) CIG formation and update. All the components are executed simultaneously in an online fashion by processing one shot at a time. As a shot appears, all faces are detected frame by frame and face tracks are created. Our online clustering algorithm then assigns the face tracks to either an existing cluster or to a new one. The information about the cluster updates including formation of new clusters are used to create a dynamic CIG. Fig. 1 presents an overview of the proposed method. Below, we describe each component in detail.

Face track creation and clustering
Face track creation. Consider a movie M comprising T frames: M = {I t } T t=1 . We define the i th shot S i as a sequence of consecutive frames {I t (i−1) +1 , . . . I t i }, where t i is the i th shot boundary. The shot boundary t i corresponding to S i is detected by computing the pixel differences between consecutive frames (as they appear) and by comparing the difference to a predefined threshold. The accuracy of shot boundary detection is not critical to the performance of our method, hence we stick to this simple frame differencing method.
Once we have detected the boundaries of S i , a standard face detector [7] is employed to detect the faces in each frame in S i . This frame level face detection can be done in parallel to searching for shot boundaries. The face detector returns the bounding boxes of each face detected in every frame. To build a robust representation of these faces, we use a pretrained CNN, called the FaceNet [17]. Each face f p is forward-passed through the FaceNet to obtain its corresponding d dimensional feature vector v p .
To create face tracks, we use a simple yet effective strategy to combine the faces detected in consecutive frames. Let us define two faces detected in two consecutive frames as f p and f q . The overlap a(·) between the two faces is defined as: * 100 (1) where area(f ) is the area of the rectangular bounding box of f . The squared distance between the feature vectors v p and v q are defined as δ(p, q) = v p − v q 2 2 . If a(p, q) > 0.85 and δ(p, q) ≤ 1.0 i.e., if the faces have more than 85% overlap and less than 1.0 feature distance in consecutive frames, they are considered to be of the same person (see Fig. 2). Detected faces that overlap this way in consecutive frames are combined to form a face track, and the sequence of features corresponding to each of these faces is defined as a feature track.
Online face clustering. The next task is to cluster the face tracks as they appear in each shot. For this subtask, we use our recently developed online clustering algorithm [8]. We assume the availability of all face tracks in a single shot at a given time. Our goal is to assign a face track belonging to a person who has appeared earlier to the correct existing cluster, and to form a new cluster for a face track belonging to a person appearing for the first time.
Let us consider a shot S i containing K face tracks where N k is the number of faces in F k . Also consider that we have already processed previous (i − 1) shots and have obtained L clusters corresponding to unique L characters. The clusters are represented by their corresponding cluster centers C = {c l } L l=1 , where c l ∈ R d , where c l is the feature vector obtained by averaging all features across all face tracks within the l th cluster. Note that the number of clusters and the clusters themselves are dynamic and they evolve as each shot is processed. We now define two matrices as follows: Algorithm 1: Facetrack clustering in a single shot.
Compute Q, D using (2) and (3) while where, p, q ∈ {1, 2, . . . , K}. The matrix Q enforces a temporal constraint on the face tracks such that if two face tracks have any overlap in time, they are considered to belong to two different characters, and hence, are assigned to different clusters.
-A similarity matrix D ∈ R L×K that measures the similarity between a face track (represented by V k ) and a cluster center c l for a given shot.
where l = 1, 2, . . . , L, and k = 1, 2, . . . , K. The second component is an average squared distance, the maximum value of which is 4 (since each feature is a unit vector). By subtracting the distance from 4 we obtain a similarity value between [0, 4].
Given {V k } K k=1 , our task is to assign them to either one of the L clusters or create new clusters, if required. This is done by simply computing the similarities between V k for all k and {c l } L l=1 .
(l,k) = argmax where, W ∈ R L×K is a weight matrix (initialized with all ones) and denotes element wise product. If max l,k (D W) ≥ τ , where τ is an user defined threshold Vk is assigned to thel th cluster. Consequently, we update cl by averaging over the existing and the newly added face track. On the other hand, if max l,k (D W) < τ , a new cluster is created assuming a new character has appeared. We add a new cluster C ← C ∪c new . Note that since W is initialized as a matrix of all ones, it has no effect on the clustering of the first face track. For the subsequent assignments W is updated to add temporal constraints. After Vk is assigned to a cluster, we update D and W as follows: -Case I: Vk is assigned to an existing clusterl This updated W will make D W zero for all the face tracks having any temporal overlap with Vk in thel th row.
-Case II: Vk is assigned to a new cluster where ind = [1, 2, · · · K]. As Vk is processed and sent into a cluster, its id is removed i.e.k th element of ind,k th column of D and W, andk th row and column of Q are removed. This process goes on until all tracks in S i are processed, and then we move to the next shot. We also keep track of the clusters that are updated during each shot. This information is later used to create and update the CIG. Algorithm 1 summarizes our proposed online face clustering algorithm.

CIG construction
We now describe the method to construct and update the CIG based on the online face clustering results. Each node in the CIG represents a single cluster corresponding to a character, and each edge captures the interaction between the two characters it connects. In our approach, the CIG is created in parallel to the online face clustering process, where new nodes are added to the CIG and the edge weights are updated after each shot is processed.
We define the relationship between two characters p and q in terms of their temporal co-occurrence in the same or consecutive shots. Considering an adjacency matrix A the relationship between p and q is formally defined as follows.
where I(.) is the indicator function. This count defines the strength of the edge between p and q nodes in the CIG, and is denoted by the element A(p, q). A diagonal element A(p, p) denotes the number of times a character p appears in two consecutive shots. To construct and update A in an online fashion, we begin with an empty A and keep populating it with new rows and columns (corresponding to newly added nodes and edges) as new shots are processed. The dimension of Algorithm 2: Character interaction graph (CIG) construction via online clustering Input: Movie frames: I 1 ,...I T Output: A thus increases as new characters are discovered, and consequently, new nodes are added to the CIG. According to our definition of character relationship in eq. (11), we need to look for the characters in the shot immediately before and after it. Since we can not peek into the future shot, at shot S i (i > 2), we update A for S i−1 .
Our clustering algorithm yields updated cluster ids U i−2 , U i−1 , and U i pertaining to the shots S i−2 , S i−1 , S i . We append N i−1 new rows and N i−1 new columns to A (all new elements initialized to 0), where N i−1 new is the number of new clusters added during (i − 1) th shot. Then A(p, q) is updated as follows.
Algorithm 2 summarizes the entire process of online clustering and CIG creation as they are performed in parallel. Fig. 3 shows an example of a CIG created using the proposed approach for a movie called Hope Springs. The CIG has 6 pure clusters corresponding to the 6 characters discovered by our online clustering algorithm and a noisy cluster denoted by 'X'. The edges depict the relationship between the characters where thicker edges denote higher interaction. The numbers represent the character importance scores, later described in Section 4.2 in detail.

Applications to movie analysis
In this section, we demonstrate the usefulness of the CIGs for two important movie analysis tasks: (i) Three act segmentation: detecting high level semantic structures in a movie, and (ii) Major character identification. Below, we describe in detail how CIG can facilitate these tasks. The CIG shows 7 nodes corresponding to the 7 clusters discovered by our algorithm. The node marked 'X' denotes a noisy cluster. The numbers below each node in the CIG denote the importance (σ(p)) of the characters.  Fig. 4 The three-act narrative structure of a movie.

Three act segmentation
Popular films and screenplays are known to follow a well defined storytelling paradigm. The majority of movies consist of three main segments or acts (see Fig. 4): Act I -introduces the main characters and presents a key incident or plot point that drives the story, Act II -consists of a series of events including a key event which prepares the audience for the climax, and Act III -includes the climax and the resolution of the story [5,12]. Discovering these high-level semantic units automatically can help in movie summarization and detection of the key events [6]. Our objective is to segment a movie into its three acts by detecting the two act boundaries as shown in Fig. 4. Consider the CIGs A S i−1 and A S i obtained at shots S (i−1) and S i respectively. The difference between two CIGs is computed using graph edit distance (GED) as follows: where, ∆η is the number of new nodes added to A S i , and ∆e is the number of edges that are modified to obtain A S i from A S i−1 .
Using this measure, we compute how the CIG for a given movie changes over time between consecutive shots. A window of length T w is used to sum all the GED scores within the window to incorporate a longer context and get a measure of overall interaction around each shot. Let this CIG difference be denoted as y ged , where y ged i represents the changes in interaction around shot S i . We detect act boundary I as follows where, t i is the time at the center of S i , and B is a predefined interval. This interval B is chosen leveraging information from film grammar [5], which suggests that act boundary I lies within 25 to 30 minutes from the start of the movie. We thus set B to have all the shots between an interval of 22 to 40 minutes from the start of the movie. The act boundary II, t b2 is detected in a similar fashion with an interval B 2 being 14 to 34 minute before the end of the movie.

Major character identification
Another important task in movie analysis is to identify its major characters. Past work on major character discovery using character networks usually rely on betweenness, centrality and sum of the edge weights [16,21,22]. We compute a new measure called the eigenvector centrality for each character in our CIG. The eigenvector centrality, e p of a character (node) p measures the influence the node p has on the CIG, and is defined as follows: where, ζ is the largest eigenvalue of A, and A(p, q) denotes the weight of the edge between nodes p and q. We then define a node importance measure σ(p) for node p as follows: It is easy to see that the higher the value of σ(p), the more important is the node (character). We use the values of σ(p) to rank the movie characters in terms of their importance in the movie. For example, Fig. 3 shows these node importance measures for the characters in a movie.

Performance evaluation
In this section, we present results and performance comparisons for the different components of our proposed method. First, we present results on the performance of the online face clustering algorithm as it is a critical component of the CIG construction algorithm, and its accuracy determines the quality of the CIG. Direct evaluation of a CIG is not very meaningful, as CIGs may have different characteristics by construction. Hence, we evaluate the usefulness of the CIGs vis two movie analysis tasks -act segmentation and major character discovery. Comparison with existing methods: We compare with two baselines ((i) Gaussian mixture model (GMM) with FaceNet features, (ii) Kmeans with FaceNet features) several state-of-the-art face clustering methods: (i) ULDML [2], (ii) a recently proposed constrained clustering method -the coupled HMRF (cHMRF) [24], and (iii) an aggregated CNN feature-based clustering (aCNN) [19]. Performances of all the methods are compared in Table 1 in terms of clustering accuracy (expressed in %) which compares the predicted cluster labels with the ground truth labels. Note that all the methods in Table 1 are offline methods, where the entire data, information about the face tracks and the cluster counts are provided as an input to the algorithms. For the online method, however, no information about the face tracks or cluster counts are available. The performance of our algorithm on the BF2006 database is superior to that of cHMRF and ULDML, and is comparable to Kmeans. On the NH2016 database, our algorithm outperforms all its offline counterparts, achieving a clustering accuracy of 93.8%. We next compare with the only existing online face clustering algorithm TC-CRP [14]. We combine TCCRP with FaceNet features, and used a tracklet length of 10. Comparison is made in terms of homogeneity score, completeness score and their harmonic-mean i.e., the V measure (see Table 2). Table 2 shows that TCCRP  has higher cluster homogeneity, but this is achieved at the cost of over-clustering (note the large number of clusters created by TCCRP) and thereby degrading completeness. Our method achieves significantly higher completeness and V measure while discovering a more accurate number of clusters.

Evaluating CIGs for act segmentation
Database: We use a database of six full length Hollywood movies: Good Deeds, Hope Springs, Joyful Noise, Resident Evil, Step Up Revolution, The Vow. These movies are known to have a well defined three act structure [6]. The labels for the act boundaries for the movies were annotated by three film experts. Each expert independently marked the act boundaries for each movie, and then decided on a final time stamp (at the precision level of seconds) through discussions [6].
Experimental Set up: We run the DLib face detector [7] on each frame of the movie, and create face tracks. For removing the false detections and very small tracks we set the feature-distance threshold δ for track creation at 1.0, and the spacial overlap threshold α is set to 0.95. The online clustering threshold τ is set to 3.0 for all the movies. Also, the face tracks of length less than 15 are discarded.
Results and discussion: We detect the two act boundaries (see Fig. 4) in each movie using the CIGs as described in Section 4.1 and compute the error in terms of  the distance (in seconds) from the expert annotations. The parameter value of T w is set to 60s. We compare the performance of our CIG-based approach with that of an existing multimodal approach proposed in an earlier work [6]. We also create a simple baseline for comparison. The baseline sets the first act boundary at 25th minute mark of the movie, and the second act boundary is at 25th minute mark from the end of the movie. Table 3 present all the results of act boundary detection for the proposed method along with the baseline and the multimodal approach [6]. Our CIG-based approach performs the best in terms of overall error, even though it uses information from only the visual stream. We also note that detecting act boundary II is more difficult as it has higher variability across movies. Figure 5 presents an example of CIG distance plot and the detected act boundaries for the movie Hope Springs.

Evaluating CIGs for major character identification
For this task, we use the same six movies as the three act segmentation task described in the previous section. The experimental settings remain the same.
Results and discussion: We first run our online face clustering algorithm on each movie. Some of the clusters thus obtained may be noisy i.e., they may contain faces from multiple characters. Such noisy clusters are formed due to (i) the presence of the minor characters in movies who do not appear on-screen long enough, and (ii) some wrongly clustered faces of major characters. Since the ground truth face labels are not available for the movies, we sought manual validation of the clusters