Deep Learning Based Semantic Video Indexing and Retrieval

Conference paper
Part of the Lecture Notes in Networks and Systems book series (LNNS, volume 16)


Vast amount of video stored in web archives makes their retrieval based on manual text annotations impractical. This study presents a video retrieval system capitalizing on image recognition techniques. The article discloses the details of implementation and empirical evaluation results for the system entirely based on features, extracted by convolutional neural networks. It is shown that these features can serve as universal signatures of the semantic content of the video and can be useful for implementing several types of multimedia retrieval queries defined in MPEG-7 standard. Further, the graph-based structure of the video index storage is proposed in order to efficiently implement complicated spatial and temporal search queries. Thus, technical approaches proposed in this work may help to build cost-efficient and user-friendly multimedia retrieval system.


Video indexing Video retrieval Shot boundary detection Graph database Semantic features Convolutional neural networks Deep learning MPEG-7 

1 Introduction

Authors focus on video search or content-based video retrieval for cinematography and television production. Everyday need for footage in TV production consumes much of editors work spent in movie/broadcast archives. Non-fiction movies production also relies on historical and cultural heritage content stored in scattered archives.

The amount of information that is stored in movie archives is huge. For example in Russian Federation there are National Cinematography Archive storing above 70,000 titles and State TV/Radio Foundation storing above 100,000 titles. Much of their content comprises rare documentary films dated from early XX century to nowadays. Putting these materials to modern producers’ disposal is only possible by means of search techniques.

Well-established methods for searching, navigating, and retrieving broadcast quality video content rely on transcripts obtained from manual annotating, close captioning and/or speech recognition [1]. Recent progress in descriptive audio stream provisioning for the visually impaired has lead to video indexing solutions based on speech recognition of descriptive audio [2].

Video search and data exchange are ruled by the international standards, among which one of the most important is MPEG-7 [3]. This standard defines Query Format MPQF in order to provide a standard multimedia query language to unify access to distributed multimedia retrieval systems. Within MPQF format it’s possible to identify the following query types:
  • QueryByMedia specifies a similarity or exact-match query by example retrieval where the example media can be an image, video, audio or text.

  • QueryByFreeText specifies a free text retrieval where it is possible to declare optionally the fields that must be ignored or focused on.

  • SpatialQuery specifies the retrieval of spatial elements within media objects (e.g., a tree in an image), which can be connected by a specific spatial relation.

  • TemporalQuery specifies the retrieval of temporal elements within media objects (e.g., a scene in a video), which can be connected by a specific temporal relation.

It is clear that relying on speech recognition techniques is not sufficient to implement the above standards requirements. Querying by media (either by sample image or by sample video clip) is not possible when only text-based indexing is used. Spatial querying would be very much limited as well. There is a need to index video by visual content in addition to speech content.

The present paper shows that all the above mentioned query types can be implemented using the semantic features extracted from video by deep learning algorithms, namely by convolutional neural networks. Our contribution is (1) presenting a video indexing and retrieval architecture based on unified semantic features and capable of implementation of MPQF query interface and (2) sharing the results of real world testing.

2 Related Work

There are two possible approaches for video indexing based on visual content: image classification and image description. Image classification approach involves assigning preset tags to every frame, or every key frame, or every scene of a video file. Certain improvements to this approach include salient objects detection and image segmentation. In case of salient objects detection one tags essentially the bounding boxes found in video frames with preset categories. In case of segmentation one tags free-form image regions. In both cases the resulting index includes a set of time codes and categories assigned to corresponding movie parts.

Convolutional neural networks (CNN, e.g. [4, 5, 6]) have recently become de-facto standard in visual classification, segmentation and salient objects detection. For example an architecture that is described in [5] comprises 19 trainable layers with 144 million parameters. It achieved 6.8% top-5 error rate at ILSVRC2014 competition [7]. Authors in [8] expand CNN architecture to video classification by means of temporal pooling and optical flow channel addition to raw frames content. This study applies temporal pooling as well but does not apply optical flow in order to save computational overload.

CNN, however, are often trained to analyze individual photos that are usually carefully framed and focused for the subject of the image (i.e. the scene) in a clear manner. Videos are typically comprised of “shots” i.e. units of action in a video filmed without interruption and comprising a single camera view. Within the shots objects may be occluded, blurred or ill positioned (non centered) because the shots are intended for integral perception.

Usually, scene content in videos dramatically varies in appearance, resulting in difficulty in classification of such content. For example, the subject of a video shot may be filmed from different angles and scales within the shot, from panoramic to close-up, causing the subject to appear differently across frames in the shot. Thus, because video often represents wide varieties of content and subjects, even within a particular content type, identification of that content is exceedingly difficult.

Image description approach involves generating natural text annotations based on video frame content. Karpathy et al. [9] proposes the deep neural network architecture that matches image regions with natural language sentence parts in, along with the multimodal recurrent neural network that takes images as input and generates their textual descriptions. Using this architecture it’s possible to generate text descriptions for key frames extracted from video stream and build a searchable index. Since the proposed architecture is capable of generating sentences that describe image regions defined by bounding boxes, it is possible to apply complex search queries with spatial relations between objects within a key frame.

In [10] text descriptions are generated for video shot i.e. a sequence of frames, using features extracted by CNN (similarly to [9]) and applying soft attention mechanism to generate a description for the shot in the whole.

Image description-based approach has advantages of being friendly for general-purpose search engines like Google or Yandex. This approach, however, is not efficient for searching by examples as required by MPEG-7 standard. There is an opinion that this approach is most promising as an accompanying technology for broadcasting quality content retrieval tasks.

Search by example is based on video descriptors. In [11] compact descriptors (28 bit) are obtained by layer-wise training of autoencoder, where every layer is RBM. Compact video descriptors based on oriented histograms are defined in MPEG-7 standard as well [3].

Searching by example allows dealing with the fact that current image classifiers typically have a capacity of 103 while reasonable nomenclature of classes that is suitable for usage in information retrieval amounts to 104 categories of common concepts. In addition, typical search requests include named entities like famous person names, architectural and natural landmarks and brand names (e.g. car models). This makes the classifiers trained for pre-set only categories infeasible. In [12] the author describes an elegant method involving the HoG features storing in image archive index, and online training of exemplar SVM classifiers based on a set of images (around 102) provided as a template for target concept to be found in the archive. This study expands the concept of [12] for video archives using deep learnt features instead of HoG.

3 Video Indexing

This section deals with video indexing architecture.

3.1 Features Extraction and Film Segmenting

GoogLeNet network structure [6] is used in this work as a primary source of semantic features extraction. The research claims that one-time operation of CNN calculation per frame is enough to build a powerful video indexing and retrieval system. For our experiments we use already trained model and image pre-processing protocol described in [6].

The first step of video processing pipeline includes features extraction and film segmenting into the shots (see Algorithm 1). In this algorithm, the sub-sampled sequence of movie frames is obtained. Sub-sampling period S is chosen as a tradeoff between accuracy and speed, and the value 320 ms (1/8th frame for standard movie frame rate) is considered to be optimal.

Thus, step 3 consists of applying GetFeatureVector function to the frame to get the feature vector that is used throughout all further operations of indexing and searching. This function includes pre-processing: image re-scaling into 256 × 256 BGR, selecting single central crop 224 × 224 and applying the CNN calculation. The function returns an output of the last average-pooling layer of network [6] that has the dimension 1024. In practice, several frames are packed in mini-batches and the calculations are run in GPU batch mode using caffe library [13] to speed up computations.

Step 5 performs the calculations of a distance between previous and current feature vectors. We are using squared Euclidean distance, however other choices are also possible e.g. cosine distance. Figure 1 shows the typical plot of distance values vs. frame number.
Fig. 1.

Distance between neighboring frames feature vectors; red dots indicate shot boundaries detected by Algorithm 1

Intuitively, since feature vector in CNN is the source for SOFTMAX classifier and contains semantic information of the frame, it is expected that the frames with similar content would have close feature vectors. As the shot in a video is a sequence of frames filmed at single camera view it would normally contain similar objects and background in all frames. Consequently, a shot boundary happens where frame content differs dramatically from the previous shot, and feature vectors differ substantially. In cinema shot boundaries are often made soft with dilution or darkening effects. However CNN has shown to be robust to illumination condition of images, so darkening effects usually are treated well. Dilution effects where objects from previous shot are blended with objects from new shot produce spikes in the plot similar to Fig. 1, and are easy to filter out.

Filtering operation is performed at step 6. Simple low-pass filter e.g. convolution of 4-window of last distance values with vector [0.1, 0.1, 0.1, 0.99] is used. Step 7 is aimed to check if the value of vector distance exceeds the threshold value, and to add frame number to shot boundary list if it exceeds.

Sample results of shots detection are shown in Fig. 2. In order to evaluate this algorithm shot boundaries are compared to I-Frame positions in MPEG-4 encoded movie. I-Frames are used by MPEG-4 codec as base frames stored without compression, while consecutive frames are encoded as difference values from latest I-Frame. Thus I-Frame are good candidates to shot boundaries because they are specifically inserted into the video stream when the scene changes dramatically and difference encoding becomes not feasible.
Fig. 2.

Example shots detected in “The great Serengeti”, National Geographic, 2011

We obtained precision 0.935 and recall 0.860 considering MPEG-4 I-Frames positions as ground truth. The shot boundary is considered as true if its index was within 5 frames from ground truth.

Relatively low recall is explained by the fact that I-Frames are inserted by MPEG-4 codec in order to minimize reconstruction error in video stream. Therefore it may insert numerous semantically similar key frames having just a small visual difference. Algorithm 1 considers the shot by its semantic contents and produces fewer shot boundaries.

As a side product of Algorithm 1 feature vectors and classification vectors (CNN output) are stored for every frame into a distributed key-value storage (Apache Cassandra). This is the only time when CNN calculations are applied. Technically it may mean that from this point there is no need in GPU for efficient functioning of video indexing and video retrieval. The rest operation may be performed in inexpensive cluster or cloud-based infrastructure with CPU-only server nodes.

3.2 Graph-Oriented Indexing

Here a video index graph structure is introduced. This is partially due to the fact that trained model [6] that we use predicts categories within ImageNet contest framework [7] which uses Wordnet [14] lexical database. This lexical database is essentially a graph representation of words (synsets) connected with linguistic relations such as “hypernum”, “part holonym” etc. This opens wide possibilities for video retrieval by description e.g. by a query for videos where an object being part of some general category is required. The Wordnet lexical database is represented by graph
$$ {\text{G}}_{\text{WORDNET}} = \left( {{\text{N}}_{\text{NOUNS}} ,{\text{ E}}_{{{\text{LEXICAL}}\_{\text{RELATIONS}}}} } \right) $$

Where N denotes nodes, E - edges.

The main unit of graph-based representation of videos is a shot. As it is shown below, CNN classifier more accurately categorizes shots than single frames. From the user’s experience point of view, retrieving shots is natural in case of video searching.

In the experiment within the research CNN trained for ILSVRC2014 competition is used [7]. It was trained for 1000 categories, majority of which were species of flowers, dogs and other animals. This is biased from what we may expect in categorizing common videos. Therefore we chose BBC Natural World (2006) series of 102 movies, each approx. 45 min long for evaluating the proposed system. For practical use it will be enough to train the classifier with common objects in order to remove this bias to natural history.

Our evaluation of per-frame classification by top-5 score using 1056 random video frames labeled manually yielded the accuracy 0.36 ± 0.11. This is much lower than 0.93 that was reported in [6] but of course this is due to the fact that ImageNet dataset is closed, which means that every ImageNet image does have correct tags belonging to 1000 categories known to the classifier.

We then performed classification vectors temporal pooling. Specifically min(10, <number of frames in the shot>) classification vectors were pooled and the accuracy for average pooling and max pooling was compared. The difference between pooling methods was vanishing, and the accuracy rose to 0.46 ± 0.23. The further consideration of Wordnet lexical hierarchy and treating one-step hypernum from the category predicted by CNN as correct classifications (e.g. CNN predicted cheetah while true category is leopard, both share same hypernum big_cat) results in accuracy 0.53 ± 0.23. Therefore it was chosen to index shots by average pooling the classification vectors and to provide an option for the retrieval of shots using hypernum to the queried keyword.

Section 3.1 simplistically presents video processing as classifying every frame with single CNN. In reality it’s possible to apply numerous classifiers e.g. place classifier, faces detector and classifier, salient objects detector and classifier. If all that classifiers are applied, it’s possible to obtain numerous tags for a frame. Moreover, these tags may also have structure e.g. after detecting two salient objects the spatial relationship between them become obvious: which object is atop or right to the second one. Therefore it becomes natural to represent a film as a graph:
$$ \begin{array}{*{20}c} {{\text{G}}_{\text{FILM}} = \left( {{\text{N}},{\text{ E}}} \right),} \\ {{\text{N = }}\left\{ {{\text{N}}_{\text{SHOTS}} ,{\text{ N}}_{\text{TAGS}} } \right\},} \\ {{\text{E = }}\left\{ {{\text{E}}_{\text{CATEGORIES}} ,{\text{ E}}_{\text{PLACES}} ,{\text{ E}}_{\text{FACES}} ,{\text{ E}}_{{{\text{SALIENT}}\_{\text{OBJ}}}} ,{\text{ E}}_{\text{SPATIAL}} } \right\}} \\ \end{array} $$
It is clear that GFILM with GWORDNET can be linked by matching NTAGS with NNOUNS. Figure 3 illustrates possible graph representation of a film comprising two shots.
Fig. 3.

Graph representation of a film

Neo4j graph-oriented database was used for video index due to its excellent implementation of Cypher query language. The expressive querying of Cypher is inspired by a number of different approaches and established practices from SQL, SPARQL, Haskell and Python. Its pattern matching syntax looks like ASCII art for graphs, which will be shown in Sect. 4.

4 Video Retrieval

This section deals with an implementation of video retrieval modes required by MPEG-7 standard.

4.1 Searching by Structured Queries

Basic keywords-based search in our graph index can be implemented with Cypher statement (3). It accounts for minimum confidence level of shot tags, and sorts the search results by shot duration descending.

Basic Cypher syntax rules denote graph nodes in round brackets and edges in square brackets. Thus query (3) matches nodes of type Shot: NSHOTS, see (2) linked to nodes of type Wordnet: NNOUNS, see (1) having synset zebra with edge of type Category having weight greater than 0.1. Edge of type Category corresponds to ECATEGORIES in (2). Neo4j provides indexing by nodes/edges attributes, therefore performance of this query is quite good. In our test archive storing 99,505 shots the query with additional LIMIT/SKIP clause took approx. 40 ms. In the preformed tests the average precision of queries by 40 random keywords from ImageNet contest categories nomenclature was 0.84 ± 0.25. We could not afford the recall evaluation because of a lack of labeled video content, but, from the user’s point of view, precision is more important in information retrieval: when a user searches for zebra she definitely doesn’t want to see fish in search result (authors are aware of zebrafish existence).

It is easy to extend (3) to search for combinations of keywords, as well as logical combinations (AND, OR, NOT).

One way to improve recall is to include synonyms and/or hypernums into the search query. Graph representation of Wordnet lexical database allows easy solution in the proposed video index by query (4). Here the hypernum of cheetah (which is a big_cat), is matched with all shots having a path to the big_cat node. Such type of query limited by 10 results was executed in approx. 40 ms in our tests.
It is possible to build a query matching the video shots having certain spatial structure. E.g. query (5) shows the process of finding videos that have a lion to the left from a zebra

4.2 Searching by Sample Video

Video retrieval by sample clip is important in content production (finding footage in archives) and in duplicates finding (for legal purposes and for archives de-duplication). In the proposed setting the sample video is limited to a single shot discussed above, and the goal is to find semantically close shots. This differs from many existing solutions based on e.g. HSV histograms or SIFT/SURF descriptors.

It was discovered that feature vector fv \( \in \mathcal{R} \)1024 extracted in Algorithm 1 contains enough semantic information for retrieving video shots having similar content with the sample clip. A brute force solution involves comparing distance between sample clip feature vector and every other shot’s feature vector with some threshold, and including the shots having smaller distance to the sample into the search results. In comparison of Euclidean distance and cosine distance metrics of vector distance the preference was given to the cosine distance (6).
$$ {\text{d}} = 1 - {\text{dot}}(x,y) $$

Where x - sample clip feature vector, y - other clip feature vector. Both x and y are averaged feature vectors of first K frames of each shot, K = min(10, NFRAMES_IN_SHOT).

In order to improve the performance Wordnet hierarchy is applied to limit the scope of shots to check. Namely one or two hypernums of the categories of sample clip are selected by query (4). Only the shots matching this condition are cycled through vector distance check. Thus the shots having similar lexical content are examined and the closest ones are selected by feature vector distance. This results in retrieving the relevant shots by terms that are hard to formalize, see Fig. 4. Figure 4(a) shows the results of a search by keyword elephant. From these results the user has chosen a sample shot where a herd of elephants, a lake and a forest are filmed (the last row in Fig. 4(a). Searching by this sample retrieved a number of shots having just the desired characteristics proving that one image is better than hundreds of words - see Fig. 4(b).
Fig. 4.

Search by example use case: (a) search results by keyword “elephant”; (b) results of searching by sample clip – the last row of (a); (c) histogram of precision values measured for different keywords

An average precision of search by video sample was 0.86. The precision was evaluated by searching by a keyword and then searching by one of resulted shots with cosine distance threshold 0.3. A human expert performed true/false positives counting. 42 keywords were used for this evaluation. Figure 4(c) shows the distribution of precision values measured by different keywords.

4.3 Searching by Sample Images

In order to extend possibilities for video retrieval beyond the scope of pre-set nomenclature of categories on-line training of linear classifiers over feature vectors extracted by CNN was explored.

In order to train the classifier around 100 positive samples were obtained by querying images search engine like Yandex or Google. E.g. the search in Yandex for steamboat was conducted and 100 first search results were chosen. Every image was scaled to 256 × 256 BGR pixels and applied CNN [6] to both straight and horizontally flipped central patch 224 × 224 px. It resulted in obtaining 200 feature vectors from the output of layer “pool5/7 × 7_s1”.

For negative samples 25,000 shots were randomly selected from the test archive, and averaged feature vectors of first K frames of each shot, K = min(10, NFRAMES_IN_SHOT).

Positive and negative samples were randomly shuffled for online training and Vowpal Wabbit [15] was applied to train a logistic regression classifier. The following parameters differed from default values: positive sample weight 200, epochs number 3, learning rate 0.5. Training took less than a second in standard Intel-based PC.

A brute force solution involves applying the trained classifier to every shot’s feature vector, and including the shots having positive classification into the search results.

Average precision of search by sample images was 0.64. The precision was evaluated by obtaining sample images from Yandex by a random keyword and then searching our test archive by 100 sample images. A human expert performed true/false positives counting. 13 keywords were used for this evaluation. Figure 5(a) shows some of the sample images from Yandex, Fig. 5(b) shows some video shots retrieved from the test archive, Fig. 5(c) shows the distribution of precision values measured by different search requests.
Fig. 5.

Search by sample images use case: (a) sample images obtained from Yandex by query “steamboat”; (b) some video clips retrieved from test archive; (c) histogram of precision values by various requests for sample images.

5 Conclusion

This work shows that feature vector fv \( \in \mathcal{R} \)1024 extracted by CNN [6] contains enough semantic information for segmenting raw video into shots with 0.94 precision; retrieving video shots by keywords with 0.84 precision; retrieving videos by sample video clip with 0.86 precision and retrieving videos by online learning with 0.64 precision. All that is needed for indexing is a single pass of feature vector extraction and storing into the database. This is the only time when expensive GPU-enabled hardware is needed. All video retrieval operations may run in commodity servers e.g. in cloud-based setting.

However it’s necessary to make more effort to increase the performance of samples-based video retrieval. While lexical pruning of search space helps to limit the scope for brute force algorithm it scales linearly with the data amount. Future development of the work on this subject implies exploring several approaches for lowering the feature vector dimensionality in order to search in log time scale, e.g. random projections and compact binary descriptors, as well as tree-based indexing.


  1. 1.
    Smith, J.R., Basu, S., Lin, C.-Y., Naphade, M., Tseng, B.: Interactive content-based retrieval of video. In: IEEE International Conference on Image Processing, ICIP 2002, September 2002Google Scholar
  2. 2.
    Bangalore, S.: System and method for digital video retrieval involving speech recognition. US Patent 8487984 (2013)Google Scholar
  3. 3.
    ISO/IEC 15938-5:2003 Information technology – Multimedia content description interface – Part 5: Multimedia description schemes. International Organization for Standardization, Geneva, Switzerland (2003)Google Scholar
  4. 4.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)Google Scholar
  5. 5.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
  6. 6.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. CoRR, arXiv:1409.4842 (2014)
  7. 7.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. CoRR, arXiv:1409.0575 (2014)
  8. 8.
    Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification (2015).
  9. 9.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. arXiv preprint arXiv:1412.2306 (2014)
  10. 10.
    Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. arXiv preprint arXiv:1502.08029 (2015)
  11. 11.
    Krizhevsky, A., Hinton, G.E.: Using very deep autoencoders for content-based image retrieval. In: ESANN (2011)Google Scholar
  12. 12.
    Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-SVMs for object detection and beyond. In: ICCV (2011)Google Scholar
  13. 13.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
  14. 14.
    Princeton University “About WordNet.” WordNet. Princeton University (2010).
  15. 15.
    Langford, J., Li, L., Strehl, A.: Vowpal wabbit online learning project (2007).

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Cinema and Photo Research Institute (NIKFI)Creative Production Association “Gorky Film Studio”MoscowRussia

Personalised recommendations