Cross Pixel Optical-Flow Similarity for Self-supervised Learning

Mahendran, Aravindh; Thewlis, James; Vedaldi, Andrea

doi:10.1007/978-3-030-20873-8_7

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11365))

Included in the following conference series:

Asian Conference on Computer Vision

2735 Accesses
16 Citations

Abstract

We propose a novel method for learning convolutional neural image representations without manual supervision. We use motion cues in the form of optical-flow, to supervise representations of static images. The obvious approach of training a network to predict flow from a single image can be needlessly difficult due to intrinsic ambiguities in this prediction task. We instead propose a much simpler learning goal: embed pixels such that the similarity between their embeddings matches that between their optical-flow vectors. At test time, the learned deep network can be used without access to video or flow information and transferred to tasks such as image classification, detection, and segmentation. Our method, which significantly simplifies previous attempts at using motion for self-supervision, achieves state-of-the-art results in self-supervision using motion cues, and is overall state of the art in self-supervised pre-training for semantic image segmentation, as demonstrated on standard benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Optical-flow is stored in fixed point 16bit PNG files similar to KITTI [18] for compression.
2.
Full model: ‘pool 1–5’, and ‘fc7’ (projected to 256 channels using a \(1\times 1\) convolution for faster training) constitute the hypercolumn head for pre-training on the dataset (Sect. 4.2).

References

Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV (2015)
Google Scholar
Arandjelović, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)
Google Scholar
Bansal, A., Chen, X., Russell, B., Gupta, A., Ramanan, D.: PixelNet: representation of the pixels, by the pixels, and for the pixels. arXiv:1702.06506 (2017)
Bojanowski, P., Joulin, A.: Unsupervised learning by predicting noise. In: ICML (2017)
Google Scholar
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44
Chapter Google Scholar
Cristianini, N., et al.: An Introduction to Support Vector Machines. CUP, Cambridge (2000)
Google Scholar
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015)
Google Scholar
Doersch, C., et al.: Multi-task self-supervised visual learning. In: ICCV (2017)
Google Scholar
Donahue, J., et al.: Adversarial feature learning. In: ICLR (2017)
Google Scholar
Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)
Google Scholar
Dosovitskiy, A., et al.: Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE PAMI 38(9), 1734–1747 (2016)
Article Google Scholar
Everingham, M., et al.: The PASCAL visual object classes challenge 2007 results (2007)
Google Scholar
Everingham, M., et al.: The PASCAL visual object classes challenge 2012 results (2012)
Google Scholar
Faktor, A., Irani, M.: Video segmentation by non-local consensus voting. In: BMVC (2014)
Google Scholar
Gan, C., Gong, B., Liu, K., Su, H., Guibas, L.J.: Geometry guided convolutional neural networks for self-supervised video representation learning. In: CVPR (2018)
Google Scholar
Gao, R., Jayaraman, D., Grauman, K.: Object-centric representation learning from unlabeled videos. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 248–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54193-8_16
Chapter Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)
Google Scholar
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: Proceedings of ICLR (2018)
Google Scholar
Girshick, R.B.: Fast R-CNN. In: ICCV (2015)
Google Scholar
Hariharan, B., et al.: Semantic contours from inverse detectors. In: ICCV (2011)
Google Scholar
Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR, pp. 447–456 (2015)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
Google Scholar
Isola, P., Zoran, D., Krishnan, D., Adelson, E.H.: Learning visual groups from co-occurrences in space and time. In: ICLR Workshop (2015)
Google Scholar
Jayaraman, D., Grauman, K.: Slow and steady feature analysis: higher order temporal coherence in video. In: CVPR, pp. 3852–3861 (2016)
Google Scholar
Jayaraman, D., et al.: Learning image representations tied to ego-motion. In: ICCV (2015)
Google Scholar
Jenni, S., Favaro, P.: Self-supervised feature learning by learning to spot artifacts. In: CVPR (2018)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980 (2014)
Krähenbühl, P., et al.: Data-dependent initializations of convolutional neural networks. In: ICLR (2016)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)
Google Scholar
Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: CVPR (2017)
Google Scholar
Lee, H.Y., Huang, J.B., Singh, M.K., Yang, M.H.: Unsupervised representation learning by sorting sequence. In: ICCV (2017)
Google Scholar
Liu, C.: Beyond pixels: exploring new representations and applications for motion analysis. Ph.D. thesis, Massachusetts Institute of Technology, USA (2009)
Google Scholar
Mahendran, A., Vedaldi, A.: Visualizing deep convolutional neural networks using natural pre-images. IJCV 120, 1–23 (2016)
Article MathSciNet Google Scholar
Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32
Chapter Google Scholar
Mundhenk, T., Ho, D., Chen, B.Y.: Improvements to context based self-supervised learning. In: CVPR (2017)
Google Scholar
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving Jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Chapter Google Scholar
Noroozi, M., Vinjimoor, A., Favaro, P., Pirsiavash, H.: Boosting self-supervised learning via knowledge transfer. In: CVPR (2018)
Google Scholar
Noroozi, M., et al.: Representation learning by learning to count. In: ICCV (2017)
Google Scholar
Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_48
Chapter Google Scholar
Pathak, D., et al.: Context encoders: feature learning by inpainting. In: CVPR (2016)
Google Scholar
Pathak, D., et al.: Learning features by watching objects move. In: CVPR (2017)
Google Scholar
Prest, A., et al.: Learning object class detectors from weakly annotated video. In: CVPR (2012)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Google Scholar
Ren, Z., Lee, Y.J.: Cross-domain self-supervised multi-task feature learning using synthetic imagery. In: CVPR (2018)
Google Scholar
Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: EpicFlow: edge-preserving interpolation of correspondences for optical flow. In: CVPR (2015)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)
Article MathSciNet Google Scholar
de Sa, V.R.: Learning classification with unlabeled data. In: NIPS, pp. 112–119 (1994)
Google Scholar
Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: Proceedings of International Conference on Robotics and Automation (2018)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML (2015)
Google Scholar
Thomee, B., et al.: YFCC100M: the new data in multimedia research. ACM (2016)
Google Scholar
Todorovic, D.: Gestalt principles. Scholarpedia 3(12), 5345 (2008). revision #91314
Article Google Scholar
Walker, J.: Data-driven visual forecasting. Ph.D. thesis, Carnegie Mellon University (2018)
Google Scholar
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV, pp. 2794–2802 (2015)
Google Scholar
Wang, X., He, K., Gupta, A.: Transitive invariance for self-supervised visual representation learning. In: ICCV, pp. 2794–2802 (2017)
Google Scholar
Wei, D., et al.: Learning and using the arrow of time. In: CVPR, pp. 8052–8060 (2018)
Google Scholar
Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: large displacement optical flow with deep matching. In: ICCV, pp. 1385–1392 (2013)
Google Scholar
Xue, T., Wu, J., Bouman, K.L., Freeman, W.T.: Visual dynamics: stochastic future generation via layered cross convolutional networks. IEEE PAMI (2018). https://ieeexplore.ieee.org/document/8409321. https://doi.org/10.1109/TPAMI.2018.2854726
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40
Chapter Google Scholar
Zhang, R., Isola, P., Efros, A.A.: Split-brain autoencoders: unsupervised learning by cross-channel prediction. In: CVPR (2017)
Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge ERC IDIU, AIMS CDT (EPSRC EP/L015897/1) and AWS Cloud Credits for Research program. The authors thank Ankush Gupta and David Novotný for helpful discussions, and Christian Rupprecht, Fatma Guney and Ruth Fong for proof reading the paper. We thank Deepak Pathak for help with reproducing some of the results from [42].

Author information

Authors and Affiliations

Visual Geometry Group, University of Oxford, Oxford, UK
Aravindh Mahendran, James Thewlis & Andrea Vedaldi

Authors

Aravindh Mahendran
View author publications
You can also search for this author in PubMed Google Scholar
James Thewlis
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Vedaldi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aravindh Mahendran .

Editor information

Editors and Affiliations

IIIT Hyderabad, Hyderabad, India
C.V. Jawahar
ANU, Canberra, ACT, Australia
Hongdong Li
Simon Fraser University, Burnaby, BC, Canada
Greg Mori
ETH Zurich, Zurich, Zürich, Switzerland
Konrad Schindler

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 7643 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mahendran, A., Thewlis, J., Vedaldi, A. (2019). Cross Pixel Optical-Flow Similarity for Self-supervised Learning. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11365. Springer, Cham. https://doi.org/10.1007/978-3-030-20873-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-20873-8_7
Published: 26 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20872-1
Online ISBN: 978-3-030-20873-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics