International Journal of Computer Vision

, Volume 119, Issue 3, pp 291–306 | Cite as

Circulant Temporal Encoding for Video Retrieval and Temporal Alignment

  • Matthijs Douze
  • Jérôme Revaud
  • Jakob Verbeek
  • Hervé Jégou
  • Cordelia Schmid


We address the problem of specific video event retrieval. Given a query video of a specific event, e.g., a concert of Madonna, the goal is to retrieve other videos of the same event that temporally overlap with the query. Our approach encodes the frame descriptors of a video to jointly represent their appearance and temporal order. It exploits the properties of circulant matrices to efficiently compare the videos in the frequency domain. This offers a significant gain in complexity and accurately localizes the matching parts of videos. The descriptors can be compressed in the frequency domain with a product quantizer adapted to complex numbers. In this case, video retrieval is performed without decompressing the descriptors. We also consider the temporal alignment of a set of videos. We exploit the matching confidence and an estimate of the temporal offset computed for all pairs of videos by our retrieval approach. Our robust algorithm aligns the videos on a global timeline by maximizing the set of temporally consistent matches. The global temporal alignment enables synchronous playback of the videos of a given scene.


Video retrieval Video synchronization Fourier transform 



We are grateful to the team members who participated as cameramen or actors in the shooting of the Climbing dataset. This work was supported by the European integrated project AXES, the MSR/INRIA joint project and the ERC advanced grant ALLEGRO.


  1. Arandjelovic, R., & Zisserman, A. (2012). Three things everyone should know to improve object retrieval. In CVPR.Google Scholar
  2. Ballan, L., Brostow, G. J., Puwein, J., & Pollefeys, M. (2010). Unstructured video-based rendering: Interactive exploration of casually captured videos. In ACM Siggraph.Google Scholar
  3. Bishop, C. M. (2007). Pattern recognition and machine learning. Berlin: Springer.zbMATHGoogle Scholar
  4. Bolme, D., Beveridge, J., Draper, B., & Lui, Y. (2010). Visual object tracking using adaptive correlation filters. In CVPR.Google Scholar
  5. Brown, L. G. (1992). A survey of image registration techniques. ACM Computing Surveys, 24(4), 325–376.CrossRefGoogle Scholar
  6. Caspi, Y., & Irani, M. (2002). Spatio-temporal alignment of sequences. Transactions on PAMI, 24(11), 1409–1424.CrossRefGoogle Scholar
  7. Chu, W.-S., Zhou, F., & de la Torre, F. (2012). Unsupervised temporal commonality discovery. In ECCV.Google Scholar
  8. Douze, M., Jégou, H., Schmid, C., & Pérez, P. (2010). Compact video description for copy detection with precise temporal alignment. In ECCV.Google Scholar
  9. Dubout, C., & Fleuret, F. (2012). Exact acceleration of linear object detectors. In ECCV.Google Scholar
  10. Evangelidis, G., & Bauckhage, C. (2013). Efficient subframe video alignment using short descriptors. Transactions on PAMI, 35(10), 2371–2386.CrossRefGoogle Scholar
  11. Franco, J.-S., & Boyer, E. (2009). Efficient polyhedral modeling from silhouettes. Transactions on PAMI, 31(3), 414–427.CrossRefGoogle Scholar
  12. Hasler, N., Rosenhahn, B., Thormählen, T., Wand, M., Gall, J., & Seidel, H.-P. (2009). Markerless motion capture with unsynchronized moving cameras. In CVPR.Google Scholar
  13. Henriques, J., Carreira, J., Caseiro, R., & Batista, J. (2013). Beyond hard negative mining: Efficient detector learning via block-circulant decomposition. In ICCV.Google Scholar
  14. Henriques, J., Caseiro, R., Martins, P., & Batista, J. (2012). Exploiting the circulant structure of tracking-by-detection with kernels. In ECCV.Google Scholar
  15. Hoai, M., & de la Torre, F. (2012). Maximum margin temporal clustering. In AISTATS.Google Scholar
  16. Jain, M., Benmokhtar, R., Gros, P., & Jégou, H. (2012). Hamming embedding similarity-based image classification. In ICMR.Google Scholar
  17. Jégou, H., & Chum, O. (2012). Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening. In ECCV.Google Scholar
  18. Jégou, H., Douze, M., & Schmid, C. (2008). Hamming embedding and weak geometric consistency for large scale image search. In ECCV.Google Scholar
  19. Jégou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. Transactions on PAMI, 33(1), 117–128.CrossRefGoogle Scholar
  20. Jiang, H., Liu, H., Tan, P., Zhang, G., & Bao, H. (2012). 3D Reconstruction of dynamic scenes with multiple handheld cameras. In ECCV.Google Scholar
  21. Kalker, T., Depovere, G., Haitsma, J., & Maes, M. (1999). A video watermarking system for broadcast monitoring. In SPIE Conference on Security and watermarking of multimedia contents.Google Scholar
  22. Karpenko, A., & Aarabi, P. (2011). Tiny videos: A large data set for nonparametric video retrieval and frame classification. Transactions on PAMI, 33(3), 618–630.CrossRefGoogle Scholar
  23. Kennedy, L., & Naaman, M. (2009). Less talk, more rock: Automated organization of community-contributed collections of concert videos. In WWW.Google Scholar
  24. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: a large video database for human motion recognition. In ICCV.Google Scholar
  25. Kumar, B., Mahalanobis, A., & Juday, R. (2005). Correlation pattern recognition. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar
  26. Law-To, J., Chen, L., Joly, A., Laptev, I., Buisson, O., Gouet-Brunet, V., Boujemaa, N., & Stentiford, F. (2007). Video copy detection: A comparative study. In CIVR.Google Scholar
  27. Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 91–110.CrossRefGoogle Scholar
  28. Mallat, S. (2008). A wavelet tour of singal processing. Berlin: Springer.Google Scholar
  29. Nowak, E., Jurie, F., & Triggs, B. (2006). Sampling strategies for bag-of-features image classification. In ECCV.Google Scholar
  30. Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV.Google Scholar
  31. Over, P., Awad, G., Michel, M., Fiscus, J., Sanders, G., Kraaij, W., Smeaton, A., & Quénot, G. (2014). Trecvid 2014—An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2014. NIST, USA.Google Scholar
  32. Petit, B., Lesage, J.-D., Menier, C., Allard, J., Franco, J.-S., Raffin, B., Boyer, E., & Faure, F. (2009). Multicamera real-time 3D modeling for telepresence and remote collaboration. International Journal of Digital Multimedia Broadcasting, 2010.
  33. Revaud, J., Douze, M., Schmid, C., & Jégou, H. (Mar. 2013). Event retrieval in large video collections with circulant temporal encoding. In CVPR, Portland, United States.Google Scholar
  34. Sánchez, J., Perronnin, F., Mensink, T., & Verbeek, J. (2013). Image classification with the fisher vector: Theory and practice. IJCV, 105(3), 222–245.MathSciNetCrossRefzbMATHGoogle Scholar
  35. Smeaton, A. F., Over, P., & Kraaij, W. (2006). Evaluation campaigns and trecvid. In MIR.Google Scholar
  36. Song, J., Yang, Y., Huang, Z., Shen, H., & Hong, R. (2011). Multiple feature hashing for real-time large scale near-duplicate video retrieval. In ACM Multimedia.Google Scholar
  37. Soomro, K., Zamir, A., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. Technical Report CRCV-TR-12-01.Google Scholar
  38. Tuytelaars, T., & Van Gool, L. (2004). Synchronizing video sequences. In CVPR.Google Scholar
  39. Wang, H., & Schmid, C. (2013). Action Recognition with ImprovedTrajectories. In ICCV, (pp. 3551–3558). Sydney, Australia,IEEE.Google Scholar
  40. Wang, O., Schroers, C., Zimmer, H., Gross, M., & Sorkine-Hornung, A. (2014). Videosnapping: Interactive synchronization of multiple videos. ACM Transactions on Graphics (TOG), 33(4), 77.Google Scholar
  41. Wu, X., Hauptmann, A. G., & Ngo, C.-W. (2007). Practical elimination of near-duplicates from web video search. In ACM Multimedia.Google Scholar
  42. Xiong, C., & Corso, J. (2012). Coaction discovery: Segmentation of common actions across multiple videos. In International Workshop on Multimedia Data Mining.Google Scholar
  43. Yeh, M.-C., & Cheng, K.-T. (2009). Video copy detection by fast sequence matching. In CIVR.Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Matthijs Douze
    • 1
    • 2
  • Jérôme Revaud
    • 1
    • 3
  • Jakob Verbeek
    • 1
  • Hervé Jégou
    • 2
    • 4
  • Cordelia Schmid
    • 1
  1. 1.INRIA GrenobleMontbonnot-Saint-MartinFrance
  2. 2.Facebook Artificial Intelligence ResearchParisFrance
  3. 3.Xerox Research Centre EuropeMeylanFrance
  4. 4.INRIA RennesRennesFrance

Personalised recommendations