Skip to main content

Cross Pixel Optical-Flow Similarity for Self-supervised Learning

  • Conference paper
  • First Online:
Computer Vision – ACCV 2018 (ACCV 2018)

Abstract

We propose a novel method for learning convolutional neural image representations without manual supervision. We use motion cues in the form of optical-flow, to supervise representations of static images. The obvious approach of training a network to predict flow from a single image can be needlessly difficult due to intrinsic ambiguities in this prediction task. We instead propose a much simpler learning goal: embed pixels such that the similarity between their embeddings matches that between their optical-flow vectors. At test time, the learned deep network can be used without access to video or flow information and transferred to tasks such as image classification, detection, and segmentation. Our method, which significantly simplifies previous attempts at using motion for self-supervision, achieves state-of-the-art results in self-supervision using motion cues, and is overall state of the art in self-supervised pre-training for semantic image segmentation, as demonstrated on standard benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Optical-flow is stored in fixed point 16bit PNG files similar to KITTI [18] for compression.

  2. 2.

    Full model: ‘pool 1–5’, and ‘fc7’ (projected to 256 channels using a \(1\times 1\) convolution for faster training) constitute the hypercolumn head for pre-training on the dataset (Sect. 4.2).

References

  1. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)

  2. Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV (2015)

    Google Scholar 

  3. Arandjelović, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)

    Google Scholar 

  4. Bansal, A., Chen, X., Russell, B., Gupta, A., Ramanan, D.: PixelNet: representation of the pixels, by the pixels, and for the pixels. arXiv:1702.06506 (2017)

  5. Bojanowski, P., Joulin, A.: Unsupervised learning by predicting noise. In: ICML (2017)

    Google Scholar 

  6. Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44

    Chapter  Google Scholar 

  7. Cristianini, N., et al.: An Introduction to Support Vector Machines. CUP, Cambridge (2000)

    Google Scholar 

  8. Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015)

    Google Scholar 

  9. Doersch, C., et al.: Multi-task self-supervised visual learning. In: ICCV (2017)

    Google Scholar 

  10. Donahue, J., et al.: Adversarial feature learning. In: ICLR (2017)

    Google Scholar 

  11. Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)

    Google Scholar 

  12. Dosovitskiy, A., et al.: Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE PAMI 38(9), 1734–1747 (2016)

    Article  Google Scholar 

  13. Everingham, M., et al.: The PASCAL visual object classes challenge 2007 results (2007)

    Google Scholar 

  14. Everingham, M., et al.: The PASCAL visual object classes challenge 2012 results (2012)

    Google Scholar 

  15. Faktor, A., Irani, M.: Video segmentation by non-local consensus voting. In: BMVC (2014)

    Google Scholar 

  16. Gan, C., Gong, B., Liu, K., Su, H., Guibas, L.J.: Geometry guided convolutional neural networks for self-supervised video representation learning. In: CVPR (2018)

    Google Scholar 

  17. Gao, R., Jayaraman, D., Grauman, K.: Object-centric representation learning from unlabeled videos. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 248–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54193-8_16

    Chapter  Google Scholar 

  18. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)

    Google Scholar 

  19. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: Proceedings of ICLR (2018)

    Google Scholar 

  20. Girshick, R.B.: Fast R-CNN. In: ICCV (2015)

    Google Scholar 

  21. Hariharan, B., et al.: Semantic contours from inverse detectors. In: ICCV (2011)

    Google Scholar 

  22. Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR, pp. 447–456 (2015)

    Google Scholar 

  23. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)

    Google Scholar 

  24. Isola, P., Zoran, D., Krishnan, D., Adelson, E.H.: Learning visual groups from co-occurrences in space and time. In: ICLR Workshop (2015)

    Google Scholar 

  25. Jayaraman, D., Grauman, K.: Slow and steady feature analysis: higher order temporal coherence in video. In: CVPR, pp. 3852–3861 (2016)

    Google Scholar 

  26. Jayaraman, D., et al.: Learning image representations tied to ego-motion. In: ICCV (2015)

    Google Scholar 

  27. Jenni, S., Favaro, P.: Self-supervised feature learning by learning to spot artifacts. In: CVPR (2018)

    Google Scholar 

  28. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980 (2014)

  29. Krähenbühl, P., et al.: Data-dependent initializations of convolutional neural networks. In: ICLR (2016)

    Google Scholar 

  30. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)

    Google Scholar 

  31. Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: CVPR (2017)

    Google Scholar 

  32. Lee, H.Y., Huang, J.B., Singh, M.K., Yang, M.H.: Unsupervised representation learning by sorting sequence. In: ICCV (2017)

    Google Scholar 

  33. Liu, C.: Beyond pixels: exploring new representations and applications for motion analysis. Ph.D. thesis, Massachusetts Institute of Technology, USA (2009)

    Google Scholar 

  34. Mahendran, A., Vedaldi, A.: Visualizing deep convolutional neural networks using natural pre-images. IJCV 120, 1–23 (2016)

    Article  MathSciNet  Google Scholar 

  35. Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32

    Chapter  Google Scholar 

  36. Mundhenk, T., Ho, D., Chen, B.Y.: Improvements to context based self-supervised learning. In: CVPR (2017)

    Google Scholar 

  37. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving Jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5

    Chapter  Google Scholar 

  38. Noroozi, M., Vinjimoor, A., Favaro, P., Pirsiavash, H.: Boosting self-supervised learning via knowledge transfer. In: CVPR (2018)

    Google Scholar 

  39. Noroozi, M., et al.: Representation learning by learning to count. In: ICCV (2017)

    Google Scholar 

  40. Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_48

    Chapter  Google Scholar 

  41. Pathak, D., et al.: Context encoders: feature learning by inpainting. In: CVPR (2016)

    Google Scholar 

  42. Pathak, D., et al.: Learning features by watching objects move. In: CVPR (2017)

    Google Scholar 

  43. Prest, A., et al.: Learning object class detectors from weakly annotated video. In: CVPR (2012)

    Google Scholar 

  44. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)

    Google Scholar 

  45. Ren, Z., Lee, Y.J.: Cross-domain self-supervised multi-task feature learning using synthetic imagery. In: CVPR (2018)

    Google Scholar 

  46. Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: EpicFlow: edge-preserving interpolation of correspondences for optical flow. In: CVPR (2015)

    Google Scholar 

  47. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  48. de Sa, V.R.: Learning classification with unlabeled data. In: NIPS, pp. 112–119 (1994)

    Google Scholar 

  49. Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: Proceedings of International Conference on Robotics and Automation (2018)

    Google Scholar 

  50. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)

  51. Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML (2015)

    Google Scholar 

  52. Thomee, B., et al.: YFCC100M: the new data in multimedia research. ACM (2016)

    Google Scholar 

  53. Todorovic, D.: Gestalt principles. Scholarpedia 3(12), 5345 (2008). revision #91314

    Article  Google Scholar 

  54. Walker, J.: Data-driven visual forecasting. Ph.D. thesis, Carnegie Mellon University (2018)

    Google Scholar 

  55. Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV, pp. 2794–2802 (2015)

    Google Scholar 

  56. Wang, X., He, K., Gupta, A.: Transitive invariance for self-supervised visual representation learning. In: ICCV, pp. 2794–2802 (2017)

    Google Scholar 

  57. Wei, D., et al.: Learning and using the arrow of time. In: CVPR, pp. 8052–8060 (2018)

    Google Scholar 

  58. Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: large displacement optical flow with deep matching. In: ICCV, pp. 1385–1392 (2013)

    Google Scholar 

  59. Xue, T., Wu, J., Bouman, K.L., Freeman, W.T.: Visual dynamics: stochastic future generation via layered cross convolutional networks. IEEE PAMI (2018). https://ieeexplore.ieee.org/document/8409321. https://doi.org/10.1109/TPAMI.2018.2854726

  60. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40

    Chapter  Google Scholar 

  61. Zhang, R., Isola, P., Efros, A.A.: Split-brain autoencoders: unsupervised learning by cross-channel prediction. In: CVPR (2017)

    Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge ERC IDIU, AIMS CDT (EPSRC EP/L015897/1) and AWS Cloud Credits for Research program. The authors thank Ankush Gupta and David Novotný for helpful discussions, and Christian Rupprecht, Fatma Guney and Ruth Fong for proof reading the paper. We thank Deepak Pathak for help with reproducing some of the results from [42].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aravindh Mahendran .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 7643 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mahendran, A., Thewlis, J., Vedaldi, A. (2019). Cross Pixel Optical-Flow Similarity for Self-supervised Learning. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11365. Springer, Cham. https://doi.org/10.1007/978-3-030-20873-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20873-8_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20872-1

  • Online ISBN: 978-3-030-20873-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics