Automatic Segmentation of TV News into Stories Using Visual and Temporal Information

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10016)

Abstract

In this paper we propose a new method for automatic storyboard segmentation of TV news using image retrieval techniques and content manipulation. Our framework performs: shot boundary detection, global key-frame representation, image re-ranking based on neighborhood relations and temporal variance of image locations in order to construct a unimodal cluster for anchor person detection and differentiation. Finally, anchor shots are used to form video scenes. The entire technique is unsupervised being able to learn semantic models and extract natural patterns from the current video data. The experimental evaluation performed on a dataset of 50 videos, totalizing more than 30 h, demonstrates the pertinence of the proposed method, with gains in terms of recall and precision rates with more than 5–7% when compared with state of the art techniques.

Keywords

News video story segmentation Relevant interest points Anchor person extraction Temporal and visual constrained clustering 

1 Introduction

Nowadays, with the development of the video on demand capabilities, large collection of video archives dating back decades can be accessed over the Internet by regular users. Furthermore, most of the TV chains broadcast and publish the video content on-line. In this context, the efficient retrieval of images clips, and notably in the case of television news, is not always feasible, because of the poor or incomplete video indexation available.

In the last years, the semantic segmentation of news videos has captured the attention of the scientific community. The main focus is to efficiently structure the video information so that the system can be able to retrieve only a particular segment or a specific topic of interest rather than the complete broadcasted video stream. The challenge is to temporally segment the multimedia information into meaningful and manageable high-level semantic parts (i.e. coherent news stories) that can be automatically indexed and then retrieved to users. Such video units need to be defined at a higher level of abstraction, superior to the one of video shots, satisfying the user needs at an increased level of granularity. In the case of movies a video scene needs to respect three rules of continuity: in space, time and in action. However, in the case of TV journals the definition of a scene is highly difficult, most of the underlying hypotheses that are made within the context of scene identification being violated. In this context, we argue that the shots included in a story segment need to satisfy the following proprieties: temporal continuity and semantic coherence. Based on the above considerations, this paper tackles the issue of TV news segmentation into scenes from the perspective of visual similarities and semantic content. More specifically, we aim at identifying the boundaries between different news stories in order to facilitate the navigation.

The rest of the paper is organized as follows. Section 2 reviews the state of the art in the field. In Sect. 3 we introduce and describe in detail the proposed video temporal segmentation method. Section 4 presents the experimental results obtained on a set of videos from the French national television broadcast, NBC and CNN TV stations. Finally, Sect. 5 concludes the paper and opens some perspectives of future work.

2 Related Work

In the last couple of years, the increase volume of video news has lead to the development of various systems dedicated to TV journal management and understanding. The structure of news magazines was studied in [1] where authors conclude that news typically begin with a highlight, the main body is organized in subjects and each story begins with an anchorperson corresponding to the presenter. Thus, two categories of shots are identified [2]: (1) anchorperson shots and (2) news reports. A novel subject usually starts with an anchor shot and therefore, for content-based news understanding, it is important to identify the location of the presenters within the video stream. Anchor shots can be characterized as containing a certain percentage of similar visual features, static/dynamic camera movement and one or two news presenters.

Existing content-based approaches for structuring the video news [3] aim of automatically determining the temporal interval of each story, without any human intervention. One of the first methods in the field has been proposed in [4]. Here, a trained classifier is used to categorize shots into a set of 9 predefined classes (intro, anchor, people, interview, sports, text scenes, special and logos). The story boundaries are finally detected using a Hidden Markov Model (HMM). Although the proposed method seems to return good results, the results are highly dependent on the quality of the annotated data. In order to avoid training a Support Vector Machine (SVM), in [5] the stories are extracted with the help of an anchorperson shots detection technique. In addition, the text associated to each subtitle is assigned to a story by using the Latent Dirichlet Allocation (LDA) method.

A different anchor person identification method is proposed in [6]. The system uses low level image descriptors and text transcripts within a split and merge algorithm in order to identify different subject boundaries. In [7], different sources of information are considered (i.e. features extracted from audio tracks with an automatic speech recognition system, concepts selected from large scale ontology of images) and combined within a discriminative fusion scheme.

In [8], the anchor person identification system is based on a spatio-temporal analysis dedicated to news magazine with dynamic studio background motion or multiple presenters. The method starts by extracting two diagonal spatio-temporal slices that are further divided into three parts. Then, a sequential clustering method is applied to determine a set of candidate anchor shots. The actual presenter shots are established by applying a structural tensor.
Fig. 1.

The proposed news video segmentation framework

The authors in [9] introduce a cost-effective anchor shot detection method that performs search space reduction. The method combines a skin color detector, a face detector and support vector data descriptions in order to localize the presenter within the video stream. More recently, in [10] an automatic method for news video segmentation into stories is proposed based on visual features as: face and anchor person detection, junk frames removal, motion activity and TV logo extraction. Both approaches in [9, 10] outperform the state-of the art methods, but still require the availability of a large amount of manually annotated data.

Although there are many works addressing the problem of effectively and efficiently extraction of relevant subjects from news video, the robust identification of subject boundaries remains very challenging because of the camera or object movement, cropping, illumination changes, clutter or important background variation. In addition, nowadays most of the news videos contain interviews where different invited persons appear in the presence of the news presenter.

In order to overcome such limitations, in this paper we propose a novel segmentation method of TV news videos, which is fully unsupervised and able to automatically learn semantic models from video data. The main contributions presented in this paper concern:
  1. 1.

    An anchor person identification technique based on relevant interest point extraction, global image representation using a fusion of VLAD (Vector of Locally Aggregated Descriptors) and color histograms, confident image retrieval using neighborhood relations and Jaccard similarity coefficient.

     
  2. 2.

    A dedicated method that differentiates between news presenters and invited guests, by analyzing the temporal location of candidate anchor images within the video streams.

     

Figure 1 presents the pipeline of the proposed news video segmentation method. The following section describes in detail the proposed approach.

3 Proposed Approach

In a first stage we perform a temporal structuring of the video stream into shots. For each shot, we then identify a set of representative key-frames. Finally, we select a set of candidate anchor key-frames, corresponding to potential anchor persons.

3.1 Temporal Video Segmentation

In order to perform temporal video segmentation into shots, we considered our method firstly introduced in [11], which includes both an enhanced shot boundary detection algorithm and a key-frame extraction technique, briefly recalled here. Each frame of the video stream is represented as a vertex in a graph spanning structure, connected with others by edges expressing the visual similarity between two nodes. The analysis is performed by using a temporal sliding window that selects a fixed number of frames for analyses. Then, for each position of the sliding window a graph partition is computed. In order to detect a transition we perform the analysis within the scale space of derivatives of the local minimum vector. The computational complexity is reduced by applying the two-pass approach. Then, for each detected shot a set of representative key-frames are extracted.

Two additional features to the baseline method introduced in [11] are here introduced. The first one concerns the quality of the selected key-frames. In order to avoid blurry images that can occur in current videos, we perform an edge sharpness analysis. Thus, the image edges are detected with a Canny edge detector [12]. A vote is associated to each contour point, depending on its gradient magnitude, in order to construct a global sharpness measure for each considered frame. Between the candidate frames, solely the one with the maximum global sharpness coefficient is finally retained. The second feature corresponds to the detection of key-frames including faces, that are candidates for anchor images. To this purpose, we apply the face detection method introduced in [13] that shows high detection performances. In this way, we obtain a set of candidate key-frames (\(S_{key-fr}\)), corresponding to possible anchor persons. This set of images is further analyzed, as described in the following section.

3.2 Anchor Shot Identification System

Representative Key-Point Selection. For each key-frame, we first extract interest points using the pyramidal FAST algorithm [14] that are further described using SIFT descriptor [15]. However, in the case of anchor shot identification we have observed that relevant objects (i.e. the news presenter) returns a significantly lower number of interest points than the one corresponding to textured objects appearing during the TV news (e.g. background information, trees, grass, buildings). For this reason, in order to jointly reduce the computational time and to ensure a sufficient number of interest points we have privileged a simple, semi-dense sampling approach. A regular rectangular grid is overlapped on each key-frame. The sets of interest points included in each grid cell is then determined. Then, for each cell we retain only the most relevant point, i.e. the one with the highest value of the Harris-Laplacian operator [16]. The size of the grid cell is defined as: \(P = (H*W)/T\), where W and H are the width and height of the image, while T is the maximum total number of interest points we retain for an image. In our work, we have restricted T to 1000 interest points.

Figure 2 illustrates three different strategies of selecting relevant points from a key-frame. Figure 2a shows the key-points extracted using the classical FAST algorithm. The key-points obtained using FAST and refined using the Harris-Laplacian operator (i.e. the points are ranked based on the magnitude of their interest strength and the first T are kept) as proposed in [17] are presented in Fig. 2b. The results obtained with our method are illustrated in Fig. 2c. Let us underline that the examples illustrated in Fig. 2b and c contain the same number of interest points.

We can observe that the proposed method yields more evenly distributed key-points and avoids over-accumulations of points in certain textured areas.
Fig. 2.

Three different strategies for interest point extraction: (a). FAST extractor; (b). FAST extractor with Harris-Laplacian filtering; (c). our method based on regular grid filtering.

In order to obtain a robust representation of the key-frame, each SIFT descriptor is transformed into a normalized RootSIFT [18] and projected on 128 principal directions using PCA basis learned off-line [19].

Global Key-Frame Description. In order to characterize the informational content of a key-frame I, we develop a global image representation using the PCA-VLAD (Vector of Locally Aggregated Descriptors) representation [20], which encodes descriptors using the locality of feature space. The size of the vocabulary used for VLAD is set to 256 visual words.

The VLAD image representation proves to be invariant to rotation and illumination changes. However, in the case of anchor shot identification the color information is a useful parameter that needs to be taken into account. In order to capture the color information present in every key-frame we retain also for each image a color histogram, computed in the HSV color space. Several histogram sizes have been evaluated, with equivalent performances between 128 and 256 bins (corresponding to a uniform quantization of the HSV color space).

Representative Anchor Image Retrieval. Based on the observation that the news video content is highly structured in a regular and repetitive manner, we focused on identifying recursive patterns that can correspond with the anchor person. In [1], the authors conclude that key-frames containing news readers have an occurrence rate superior to any other image appearing during the video. However, for TV news with invited guests where the camera switches between them and the visual content has low variation (i.e. constant background), the previous assumption may not hold. In order to deal with all type of video news and extract, with high confidence, the most relevant key-frame showing the presenter, we propose the following strategy that transforms the anchor image identification problem into an image retrieval task.

Each image from the key-frame set (\(S_{key-fr}\)) is considered as query (q) to \(S_{key-fr}\). The retrieved results are presented as a sorted list of candidate images with respect to a similarity measure that takes into account the two global image descriptors considered (i.e. VLAD and HSV color histogram).
$$\begin{aligned} Score(q,x) = \alpha \cdot Score_{VLAD}(q,x)+(1-\alpha ) \cdot Score_{Hist}(q,x) ; \end{aligned}$$
(1)
where \(q \ne x\), x is an image from the static summary associated to the video stream (\(x \in S_{key-fr}\)), \(\alpha \) is a weighting parameter that controls the influence of each descriptor, while \(Score_{VLAD}\) and \(Score_{Hist}\) are the cosine distances between VLAD and the color HSV histogram image representation, respectively. In our experiments, we have considered relatively high values of the \(\alpha \) parameter (i.e., within the range 0.8–0.9). Such a strategy makes it possible to privilege the similarity score provided by the VLAD representation, which is highly discriminative, while taking into account a rather minimal amount of color information, yielded from the global HSV histograms.
After analyzing the top-k ranked candidate images we observed that the similarity curve returns good results at the beginning, but after a period it starts flattening. This behavior signifies that relevant, but also false positive images can have equivalent scores with the query image. To refine the results and determine the outliers (that can correspond to spurious faces) we propose first to remove from the retrieved list the key-frames not satisfying the following reciprocal neighborhood relation:
$$\begin{aligned} R_k(q,d) = q \in N_k(x) \cap x \in N_k(q) ; \end{aligned}$$
(2)
where \(N_k(x)\) and \(N_k(q)\) denotes the sets of top-k retrieved key-frames for queries x and q, respectively.
Then, a query image q is considered to be similar with a key-frame if the associated Jaccard similarity coefficient J(dx) is superior to a pre-establish threshold:
$$\begin{aligned} J(q,x) = \frac{\mid N_k(q) \cap N_k(x) \mid }{\mid N_k(q) \cup N_k(x) \mid } \ge Th_1 ; \end{aligned}$$
(3)
where \(\mid \dots \mid \) denotes the cardinality of the corresponding sets. The values of J(qx) range from 0 to 1, where 0 corresponds to no overlap and 1 signify that x and q share exactly the same list of neighbors. In our experiments, we have considered a value of \(Th_1 = 0.5\) which means x and q are considered as belonging to the same class if they share at least half of their corresponding neighbors.
Each query image and its associated top-k ranked results forms a cluster for which we compute an intra-cluster similarity score as:
$$\begin{aligned} SimScore_{q-class} = \sum _{d=0}^{N_k}Score(q,d) ; d \in S_{key-fr} . \end{aligned}$$
(4)
The value of \(SimScore_{q-class}\) can be interpreted as a measure that determines the degree of resemblance between different members of the considered cluster. The class with the maximum value of \(SimScore_{q-class}\) is considered as the one containing the reference anchor shots.
The proposed strategy is effective in detecting the presence of anchor persons, but it is insufficient to help us differentiate between actual news presenters and invited guests. This is illustrated in Fig. 3. Here, the background variation of frames corresponding to news presenters is relatively high (images marked with green), while the key-frames showing the invited guest are very similar (images marked with magenta), thus boosting up the \(SimScore_{q-class}\) of the corresponding class.
Fig. 3.

Different classes of anchor persons: news presents (green box); invited guest (magenta box). (Color figure online)

Anchor Person Differentiation. In order to distinguish between multiple anchor persons and differentiate various occurrence patterns we analyze the time component of each key-frame (temporal location within the video stream).

We propose to characterize a cluster of images not only based on the visual similarity, but also by the temporal position of the key-frames included in the class. For each cluster, we compute the temporal variance of all key-frames positions:
$$\begin{aligned} VarScore_{q-class} = \frac{1}{N_k} \sum _{i=1}^{N_k}(t_i - \mu _{q-class})^2 ; \end{aligned}$$
(5)
where \(N_k\) is the list of images belonging to the current class, \(t_i\) is the key-frame timestamp and \(\mu _{q-class}\) is the average temporal position of all images within the class q. The \(VarScore_{q-class}\) value can help to automatically understand the role of each character within the video stream. Thus, the temporal variance corresponding to the news presenter tends to take more important values, since the presenter occurrences are spread out over the entire video. On the contrary, the occurrences of invited guests are much more localized in time and the corresponding variance is significantly lower. Finally, the score of each image class is given by:
$$\begin{aligned} Score_{q-class} = SimScore_{q-class} \cdot VarScore_{q-class} ; \end{aligned}$$
(6)
where \(SimScore_{q-class}\) and \(VarScore_{q-class}\) are normalized (\(SimScore_{q-class}\) to the maximum visual similarity score retrieved from all the considered classes, while \(VarScore_{q-class}\) to the highest variance of all clusters). The class with the highest value for \(Score_{q-class}\) will contain the relevant anchor person. We need to highlight that, after this step, the anchor person class will not incorporate an exhaustive set of key-frames where the presenter appears, but instead just a model which corresponds to a sub-set of visual appearances of the presenter, characterized by the most represented visual and temporal characteristics.

For segmenting the news video into stories we compare each key-frame from the static summary to the set of all images included in the reference anchor shots class. If the similarity measure (cf. eq. 1) is above a threshold (\(Th_2\)) then the present key-frame is considered as the first shot of a news story. The value of \(Th_2\) is adaptively establish as half of the average similarity scores of all images included in anchor shot class.

However, the semantic content of video news is extremely diverse. Solely the correct identification of the anchor person is insufficient for obtaining a correct and comprehensive segmentation into stories of all type of video streams. Most TV news contain conversational scenes that are characterized as a collection of shots that exhibit a high similarity of the visual content, where the camera switches back and forth between different personages. For the conversation scenes the occurrence rate of the news presenter shots is significantly higher than the actual change within the addressed subject. Because the conversation scenes are characterized by continuity in time and space we propose using our scene grouping method, based on temporally constrained clustering, firstly introduced in [11]. The method consists of iteratively merging shots falling into a temporal analysis window and satisfying similarity clustering criteria. The influence of outlier shots which might correspond to some punctual digressions from the main action within the considered scene is reduced based on a shot neutralization process.

4 Experimental Evaluation

To evaluate the accuracy of the proposed framework we have performed an extensive testing on a database of 50 videos, with a total of more than 30 h of broadcast. The dataset is composed of two types of video clips: the first one selected from the broadcasted news of the French national television (30 videos with the average duration of 35 min denoted: JT13 France2, JT20 France2, Grand Soir, JT12, JT19 edition national, JT19 edition regional from France3), and corresponding to various programs of France Television (FTV) broadcast programs. The second has been acquired from YouTube and includes 20 video from NBC and CNN TV stations. The FTV videos are at a resolution of 1280\(\,\times \,\)720 pixels, while the YouTube videos present a lower, 704\(\,\times \,\)396 pixels resolution.

The shot boundary detection method [11] produces an average number of 375 shots per video; with the associated key-frames. Here, we selected a unique key-frame. After the face detection method [13], an average of 150 images per video was identified as possible candidates as anchor persons. In order to determine the performances of the anchor person detection system we used as evaluation metrics the traditional precision, recall and F-score defined as:
$$\begin{aligned} Recall = \frac{N_{CDA}}{N_{GT}} ; Precision = \frac{N_{CDA}}{N_{ES}} ; F-score = \frac{2 \cdot Recall \cdot Precision}{Recall + Precision} ; \end{aligned}$$
(7)
where \(N_{CDA}\) represent the number of correctly detected anchors shots, \(N_{GT}\) number of anchor shots in the ground truth and \(N_{ES}\) is the number of shots extracted as anchor elements.
Figure 4 presents the average results obtained for the anchor person identification. The average precision combining all the broadcasted channels exceeds 92 % with about 95 % of recall.
Fig. 4.

Precision, Recall and F-scores obtained of the anchor person identification method

As it can be observed the video selected from YouTube return the lowest results in term of \(F-score\). This behavior can be explained by the reduced quality of the video stream and the continuous presence of a scroll bar.

In order to provide a comprehensive evaluation of our framework we have compared our results with different state of the art techniques: Chen et al. [21], Broilo et al. [22] and Ji et al. [2]. In Table 1 we present the synthesized results of all methods.

The experimental evaluation clearly demonstrates the superiority of our proposed framework, with gains in terms of recall and precision rates with more than 5 % and 7 %, respectively. The results can be explained by the fact that our method is designed to be robust to important geometric and photometric distortion. Moreover, the use of VLAD image representation combined with color histograms allows the representation of key-frames in a global manner tolerating large changes in the appearance. Because, our method is entirely unsupervised, the system is not depended on a training phase and can naturally extract patterns from the considered video data. The increase in the precision rate can be explained by the use of the temporal location variance in extracting the most relevant cluster, which allows us to differentiate between anchor persons and invited guests. Finally, we evaluated the quality of the story segmentation method on the considered TV news videos dataset. The results are presented in Fig. 5. As it can observed, we obtain an average \(F-score\) for the story segmentation of 88 %. A story is considered as correctly identified if the scene automatically obtained covers more than 80 % of the ground truth scene. The results can be explained by use of the temporal scene cluttering methods and by ability of our system to differentiate between news presents and invited guest.
Table 1.

Comparative evaluation of the proposed method

Analyzed method

Precision [%]

Recall [%]

\(F-score\) [%]

Chen et al. [21]

79

84

81

Broilo et al. [22]

81

88

84

Ji et al. [2]

85

90

87

Proposed method

92

95

93

Fig. 5.

Precision, Recall and F-scores obtained for the story segmentation

5 Conclusion and Perspectives

In this paper we have introduced a complete framework from automatic storyboard segmentation of TV news. The method is based on a novel anchor person identification system that uses relevant interest point extraction, global image representation and image retrieval based on neighborhood relations. Then, the system is able to differentiate between news readers and invited guests, by taking advantage of visual similarity between key-frames and the temporal variance of candidates anchor images.

The experimental evaluation clearly demonstrates the superiority of our proposed framework, with gains in terms of recall and precision rates with more than 5 % and 7 %, respectively. The entire system proves to be robust to important geometric and photometric distortion allowing large variation on the object appearance. In our future work we will consider including within the framework more high-level functionalities as: identification and clustering of news stories addressing the same subject, face and location recognition capabilities. In addition, the textual data available in subtitles can be considered within a multi-modal semantic, approach.

Notes

Acknowledgments

This work has been partially accomplished within the framework of the FUI 19 Media4D project, supported by BPI (Banque Publique d’investissement) France and DGE (Direction Generale des Entreprises).

This work was supported by a grant of the Romanian National Authority for Scientific Research and Innovation, CNCS - UEFISCDI, project number: PN-II-RU-TE-2014-4-0202.

References

  1. 1.
    Chaisorn, L., Chua, T.S., Lee, C.H.: A multi-modal approach to story segmentation for news video. World Wide Web 6(2), 187–208 (2003)CrossRefGoogle Scholar
  2. 2.
    Ji, P., Cao, L.J., Zhang, X.G.: News videos anchor person detection by shot clustering. Neurocomputing 123, 86–99 (2014)CrossRefGoogle Scholar
  3. 3.
    Dumont, E., Quenot, G.: A local temporal context-based approach for TV news story segmentation. In: IEEE ICME, pp. 973–978 (2012)Google Scholar
  4. 4.
    Chaisorn, L., Chua, T.S.: Story boundary detection in news video using global rule induction technique. In: IEEE ICME, pp. 2101–2104 (2006)Google Scholar
  5. 5.
    Misra, H., Hopfgartner, F., Goyal, A., Punitha, P., Jose, J.M.: TV news story segmentation based on semantic coherence and content similarity. In: Boll, S., Tian, Q., Zhang, L., Zhang, Z., Chen, Y.-P.P. (eds.) MMM 2010. LNCS, vol. 5916, pp. 347–357. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Goyal, A., Punitha, P., Hopfgartner, F., Jose, J.M.: Split and merge based story segmentation in news videos. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 766–770. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  7. 7.
    Ma, C., Byun, B., Kim, I., Lee, C.H.: A detection-based approach to broadcast news video story segmentation. In: IEEE ICASSP, pp. 1957–1960 (2009)Google Scholar
  8. 8.
    Zheng, F., Li, S., Wu, H., Feng, J.: Anchor shot detection with diverse style backgrounds based on spatial-temporal slice analysis. In: Boll, S., Tian, Q., Zhang, L., Zhang, Z., Chen, Y.-P.P. (eds.) MMM 2010. LNCS, vol. 5916, pp. 676–682. Springer, Heidelberg (2010). doi:10.1007/978-3-642-11301-7_68 CrossRefGoogle Scholar
  9. 9.
    Lee, H., Yu, J., Im, Y., Gil, J.M., Park, D.: A unified scheme of shot boundary detection and anchor shot detection in news video story parsing. Multimedia Tools Appl. 51(3), 1127–1145 (2011)CrossRefGoogle Scholar
  10. 10.
    Dumont, E., Quénot, G.: Automatic story segmentation for TV news video using multiple modalities. Int. J. Digit. Multimedia Broadcast. 2012, 11 (2012). doi:10.1155/2012/732514 Google Scholar
  11. 11.
    Tapu, R., Zaharia, T.: High level video temporal segmentation. In: Bebis, G., et al. (eds.) ISVC 2011. LNCS, vol. 6938, pp. 224–235. Springer, Heidelberg (2011). doi:10.1007/978-3-642-24028-7_21 CrossRefGoogle Scholar
  12. 12.
    Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. PAMI–8(6), 679–698 (1986)CrossRefGoogle Scholar
  13. 13.
    Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: CVPR, pp. 2879–2886 (2012)Google Scholar
  14. 14.
    Tuzel, O., Porikli, F., Meer, P.: Region covariance: a fast descriptor for detection and classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 589–600. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  15. 15.
    Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)CrossRefGoogle Scholar
  16. 16.
    Harris, C., Stephens, M.: A combined corner and edge detector. In: Alvey Vision Conference, pp. 147–151 (1988)Google Scholar
  17. 17.
    Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. IJCV 60(1), 63–86 (2004)CrossRefGoogle Scholar
  18. 18.
    Arandjelovic, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: CVPR (2012)Google Scholar
  19. 19.
    Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Delhumeau, J., Gosselin, P.H., Jegou, H., Perez, P.: Revisiting the VLAD image representation. In: ACM Multimedia, pp. 653–656 (2013)Google Scholar
  21. 21.
    Chen, D.M., Vajda, P., Tsai, S., Daneshi, M., Yu, M., Chen, H., Araujo, A., Girod, B.: Analysis of visual similarity in news videos with robust and memory-efficient image retrieval. In: IEEE ICME Workshops, pp. 1–6 (2013)Google Scholar
  22. 22.
    Broilo, M., Basso, A., De Natale, F.G.B.: Unsupervised anchor persons differentiation in news video. In: 9th International Workshop on Content-Based Multimedia Indexing, pp. 115–120 (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.ARTEMIS, Institut Mines-Telecom/TelecomSudParis, CNRS MAP5ParisFrance
  2. 2.Telecommunication, Faculty of ETTIUniversity Politehnica of BucharestBucharestRomania

Personalised recommendations