Learning Image Matching by Simply Watching Video

  • Gucan LongEmail author
  • Laurent Kneip
  • Jose M. Alvarez
  • Hongdong Li
  • Xiaohu Zhang
  • Qifeng Yu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9910)


This work presents an unsupervised learning based approach to the ubiquitous computer vision problem of image matching. We start from the insight that the problem of frame interpolation implicitly solves for inter-frame correspondences. This permits the application of analysis-by-synthesis: we first train and apply a Convolutional Neural Network for frame interpolation, then obtain correspondences by inverting the learned CNN. The key benefit behind this strategy is that the CNN for frame interpolation can be trained in an unsupervised manner by exploiting the temporal coherence that is naturally contained in real-world video sequences. The present model therefore learns image matching by simply “watching videos”. Besides a promise to be more generally applicable, the presented approach achieves surprising performance comparable to traditional empirically designed methods.


Image matching Unsupervised learning Analysis by synthesis Temporal coherence Convolutional neural network 



L. Kneip’s research is funded by ARC DECRA grant DE150101365. The research of L. Kneip and H. Li is also funded by the ARC Centre of Excellence for Robotic Vision CE140100016. All authors gratefully acknowledge the support of the NVIDIA corporation for the donation of Tesla K40 GPUs. G. Long would like to give special thanks to Yuchao Dai, Stephen Gould and Anoop Cherian for the valuable discussions and feedback.


  1. 1.
    Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. arXiv preprint arXiv:1505.01596 (2015)
  2. 2.
    Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One 10(7), e0130140 (2015)CrossRefGoogle Scholar
  3. 3.
    Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. Int. J. Comput. Vis. 92(1), 1–31 (2011)CrossRefGoogle Scholar
  4. 4.
    Becker, S.: Learning temporally persistent hierarchical representations. In: Advances in Neural Information Processing Systems, pp. 824–830 (1997)Google Scholar
  5. 5.
    Bouguet, J.Y.: Pyramidal implementation of the affine lucas kanade feature tracker description of the algorithm. Intel Corporation 5(1-10), 4 (2001)Google Scholar
  6. 6.
    Bradski, G., Kaehler, A.: Learning OpenCV: Computer Vision with the OpenCV Library. O’Reilly Media Inc., Sebastopol (2008)Google Scholar
  7. 7.
    Brox, T., Malik, J.: Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Trans. Pattern Anal. Mach. Intell. 33(3), 500–513 (2011)CrossRefGoogle Scholar
  8. 8.
    Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33783-3_44 Google Scholar
  9. 9.
    Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014)
  10. 10.
    Dosovitskiy, A., Brox, T.: Inverting convolutional networks with convolutional networks. arXiv preprint arXiv:1506.02753 (2015)
  11. 11.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014)Google Scholar
  12. 12.
    Ess, A., Leibe, B., Schindler, K., Van Gool, L.: Robust multiperson tracking from a mobile platform. IEEE Trans. Pattern Anal. Mach. Intell. 31(10), 1831–1846 (2009)CrossRefGoogle Scholar
  13. 13.
    Fischer, P., Dosovitskiy, A., Brox, T.: Descriptor matching with convolutional neural networks: a comparison to sift. arXiv preprint arXiv:1405.5769 (2014)
  14. 14.
    Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852 (2015)
  15. 15.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. (IJRR) 32, 1229–1235 (2013)Google Scholar
  16. 16.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)Google Scholar
  17. 17.
    Goroshin, R., Mathieu, M., LeCun, Y.: Learning to linearize under uncertainty. arXiv preprint arXiv:1506.03011 (2015)
  18. 18.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
  19. 19.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. arXiv preprint arXiv:1502.01852 (2015)
  20. 20.
    Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Huang, G., Mattar, M., Lee, H., Learned-Miller, E.G.: Learning to align from scratch. In: Advances in Neural Information Processing Systems, pp. 764–772 (2012)Google Scholar
  22. 22.
    Jensen, C., Reed, R.D., Marks, R.J., El-Sharkawi, M., Jung, J.B., Miyamoto, R.T., Anderson, G.M., Eggen, C.J., et al.: Inversion of feedforward neural networks: algorithms and applications. Proc. IEEE 87(9), 1536–1549 (1999)CrossRefGoogle Scholar
  23. 23.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
  24. 24.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  25. 25.
    Klein, D.A., Schulz, D., Frintrop, S., Cremers, A.B.: Adaptive real-time video-tracking for arbitrary objects. In: IEEE International Conference on Intelligent Robots and Systems (IROS), pp. 772–777, October 2010Google Scholar
  26. 26.
    Long, J.L., Zhang, N., Darrell, T.: Do convnets learn correspondence? In: Advances in Neural Information Processing Systems, pp. 1601–1609 (2014)Google Scholar
  27. 27.
    Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. arXiv preprint arXiv:1412.0035 (2014)
  28. 28.
    Meyer, S., Wang, O., Zimmer, H., Grosse, M., Sorkine-Hornung, A.: Phase-based frame interpolation for video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1410–1418 (2015)Google Scholar
  29. 29.
    Mobahi, H., Collobert, R., Weston, J.: Deep learning from temporal coherence in video. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 737–744. ACM, New York (2009)Google Scholar
  30. 30.
    Park, M.G., Yoon, K.J.: Leveraging stereo matching with learning-based confidence measures. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 101–109 (2015)Google Scholar
  31. 31.
    Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
  32. 32.
    Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Deep convolutional matching. arXiv preprint arXiv:1506.07656 (2015)
  33. 33.
    Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., Moreno-Noguer, F.: Discriminative learning of deep convolutional feature point descriptors. In: Proceedings of the International Conference on Computer Vision (ICCV) (2015)Google Scholar
  34. 34.
    Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
  35. 35.
    Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. arXiv preprint arXiv:1502.04681 (2015)
  36. 36.
    Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks. arXiv preprint arXiv:1505.00387 (2015)
  37. 37.
    Sun, D., Roth, S., Black, M.J.: A quantitative analysis of current practices in optical flow estimation and the principles behind them. Int. J. Comput. Vis. 106(2), 115–137 (2014)CrossRefGoogle Scholar
  38. 38.
    Vedaldi, A., Lenc, K.: Matconvnet-convolutional neural networks for MATLAB. arXiv preprint arXiv:1412.4564 (2014)
  39. 39.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. arXiv preprint arXiv:1505.00687 (2015)
  40. 40.
    Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: Deepflow: large displacement optical flow with deep matching. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 1385–1392. IEEE (2013)Google Scholar
  41. 41.
    Wiskott, L., Sejnowski, T.J.: Slow feature analysis: unsupervised learning of invariances. Neural Comput. 14(4), 715–770 (2002)CrossRefzbMATHGoogle Scholar
  42. 42.
    Yildirim, I., Kulkarni, T., Freiwald, W., Tenenbaum, J.B.: Efficient and robust analysis-by-synthesis in vision: a computational framework, behavioral tests, and modeling neuronal representations. In: Annual Conference of the Cognitive Science Society (2015)Google Scholar
  43. 43.
    Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. CoRR abs/1504.03641 (2015)Google Scholar
  44. 44.
    Žbontar, J., LeCun, Y.: Computing the stereo matching cost with a convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1592–1599 (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Gucan Long
    • 1
    • 2
    Email author
  • Laurent Kneip
    • 2
    • 3
  • Jose M. Alvarez
    • 2
    • 4
  • Hongdong Li
    • 2
    • 3
  • Xiaohu Zhang
    • 1
  • Qifeng Yu
    • 1
  1. 1.National University of Defense TechnologyChangshaPeople’s Republic of China
  2. 2.Australian National UniversityCanberraAustralia
  3. 3.Australian Centre of Excellence for Robotic VisionCanberraAustralia
  4. 4.Data61, CSIROCanberraAustralia

Personalised recommendations