Rebuilding Visual Vocabulary via Spatial-temporal Context Similarity for Video Retrieval
The Bag-of-visual-Words (BovW) model is one of the most popular visual content representation methods for large-scale content-based video retrieval. The visual words are quantized according to a visual vocabulary, which is generated by a visual features clustering process (e.g. K-means, GMM, etc). In principle, two types of errors can occur in the quantization process. They are referred to as the UnderQuantize and OverQuantize problems. The former causes ambiguities and often leads to false visual content matches, while the latter generates synonyms and may lead to missing true matches. Unlike most state-of-the-art research that concentrated on enhancing the BovW model by disambiguating the visual words, in this paper, we aim to address the OverQuantize problem by incorporating the similarity of spatial-temporal contexts associated to pair-wise visual words. The visual words with similar context and appearance are assumed to be synonyms. These synonyms in the initial visual vocabulary are then merged to rebuild a more compact and descriptive vocabulary. Our approach was evaluated on the TRECVID2002 and CC_WEB_VIDEO datasets for two typical Query-By-Example (QBE) video retrieval applications. Experimental results demonstrated substantial improvements in retrieval performance over the initial visual vocabulary generated by the BovW model. We also show that our approach can be utilized in combination with the state-of-the-art disambiguation method to further improve the performance of the QBE video retrieval.
KeywordsVisual Vocabulary Synonyms Spatial-Temporal Context Content based Video Retrieval Bag-of-visual-Word
Unable to display preview. Download preview PDF.
- 1.Cao, L., Tian, Y., Liu, Z., Yao, B., Zhang, Z., Huang, T.S.: Action detection using multiple spatial-temporal interest point features. In: ICME, pp. 340–345 (2010)Google Scholar
- 2.Chum, O., Mikulik, A., Perdoch, M., Matas, J.: Total recall ii: Query expansion revisited. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 889–896 (June 2011)Google Scholar
- 5.Jegou, H., Perronnin, F., Douze, M., Sánchez, J., Perez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intell. 34(9), 1704–1716 (2012)Google Scholar
- 7.Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 2, pp. 2169–2178. IEEE Computer Society, Washington, DC (2006)Google Scholar
- 11.Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2007) (June 2007)Google Scholar
- 12.Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and trecvid. In: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval (MIR 2006), pp. 321–330. ACM Press, New York (2006)Google Scholar
- 13.Wang, H., Yuan, J., Tan, Y.-P.: Combining feature context and spatial context for image pattern discovery. In: Proceedings of the 2011 IEEE 11th International Conference on Data Mining (ICDM 2011), pp. 764–773. IEEE Computer Society, Washington, DC (2011)Google Scholar
- 14.Wang, L., Song, D., Elyan, E.: Improving bag-of-visual-words model with spatial-temporal correlation for video retrieval. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012), pp. 1303–1312. ACM, New York (2012)Google Scholar
- 16.Yuan, J., Wu, Y.: Context-aware clustering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), pp. 1–8 (2008)Google Scholar