Contrastive Multiview Coding

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12356)


Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed by the left eye, or the high-frequency vibrations channel, heard by the right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a “dog” can be seen, heard, and felt). We investigate the classic hypothesis that a powerful representation is one that models view-invariant factors. We study this hypothesis under the framework of multiview contrastive learning, where we learn a representation that aims to maximize mutual information between different views of the same scene but is otherwise compact. Our approach scales to any number of views, and is view-agnostic. We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics. Code is available at:

Supplementary material

504452_1_En_45_MOESM1_ESM.pdf (268 kb)
Supplementary material 1 (pdf 267 KB)


  1. 1.
    Information Diagram - Wikipedia.
  2. 2.
    Agustsson, E., Timofte, R.: Ntire 2017 challenge on single image super-resolution: dataset and study. In: CVPR (2017)Google Scholar
  3. 3.
    Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., Saunshi, N.: A theoretical analysis of contrastive unsupervised representation learning. In: ICML (2019)Google Scholar
  4. 4.
    Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910 (2019)
  5. 5.
    Belghazi, M.I., et al.: Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062 (2018)
  6. 6.
    Bellet, A., Habrard, A., Sebban, M.: Similarity learning for provably accurate sparse linear classification. arXiv preprint arXiv:1206.6476 (2012)
  7. 7.
    Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. TPAMI 35, 1798–1828 (2013)CrossRefGoogle Scholar
  8. 8.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT. ACM (1998)Google Scholar
  9. 9.
    Buchler, U., Brattoli, B., Ommer, B.: Improving spatiotemporal self-supervision by deep reinforcement learning. In: ECCV (2018)Google Scholar
  10. 10.
    Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 139–156. Springer, Cham (2018). Scholar
  11. 11.
    Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: AISTATS (2011)Google Scholar
  12. 12.
    Cortes, C., Mohri, M., Rostamizadeh, A.: Learning non-linear combinations of kernels. In: NIPS (2009)Google Scholar
  13. 13.
    Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719 (2019)
  14. 14.
    Den Ouden, H.E., Kok, P., De Lange, F.P.: How prediction errors shape perception, attention, and motivation. Front. Psychol. 3, 548 (2012)Google Scholar
  15. 15.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  16. 16.
    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: CVPR (2015)Google Scholar
  17. 17.
    Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning. In: ICLR (2017)Google Scholar
  18. 18.
    Donahue, J., Simonyan, K.: Large scale adversarial representation learning. In: NIPS (2019)Google Scholar
  19. 19.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)Google Scholar
  20. 20.
    Gan, C., Gong, B., Liu, K., Su, H., Guibas, L.J.: Geometry guided convolutional neural networks for self-supervised video representation learning. In: CVPR (2018)Google Scholar
  21. 21.
    Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018)Google Scholar
  22. 22.
    Goodale, M.A., Milner, A.D.: Separate visual pathways for perception and action. Trends Neurosci. 15, 2025Grave (1992)CrossRefGoogle Scholar
  23. 23.
    Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)Google Scholar
  24. 24.
    Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: AISTATS (2010)Google Scholar
  25. 25.
    Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2006)Google Scholar
  26. 26.
    He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019)
  27. 27.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  28. 28.
    Hénaff, O.J., Razavi, A., Doersch, C., Eslami, S., Oord, A.V.D.: Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 (2019)
  29. 29.
    Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Trischler, A., Bengio, Y.: Learning deep representations by mutual information estimation and maximization. In: ICLR (2019)Google Scholar
  31. 31.
    Hohwy, J.: The Predictive Mind. Oxford University Press, Oxford (2013)CrossRefGoogle Scholar
  32. 32.
    Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis, vol. 46. Wiley, Hoboken (2004)Google Scholar
  33. 33.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)Google Scholar
  34. 34.
    Isola, P., Zoran, D., Krishnan, D., Adelson, E.H.: Learning visual groups from co-occurrences in space and time. arXiv preprint arXiv:1511.06811 (2015)
  35. 35.
    Ji, X., Henriques, J.F., Vedaldi, A.: Invariant information clustering for unsupervised image classification and segmentation. In: ICCV (2019)Google Scholar
  36. 36.
    Jolliffe, I.: Principal Component Analysis. Springer, Heidelberg (2011). Scholar
  37. 37.
    Kawakami, K., Wang, L., Dyer, C., Blunsom, P., Oord, A.V.D.: Learning robust and multilingual speech representations. arXiv preprint arXiv:2001.11128 (2020)
  38. 38.
    Kidd, C., Piantadosi, S.T., Aslin, R.N.: The goldilocks effect: human infants allocate attention to visual sequences that are neither too simple nor too complex. PloS One 7, e36399 (2012)CrossRefGoogle Scholar
  39. 39.
    Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
  40. 40.
    Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation learning. In: CVPR (2019)Google Scholar
  41. 41.
    Krähenbühl, P., Doersch, C., Donahue, J., Darrell, T.: Data-dependent initializations of convolutional neural networks. arXiv preprint arXiv:1511.06856 (2015)
  42. 42.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  43. 43.
    Lee, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: ICCV (2017)Google Scholar
  44. 44.
    Li, Y., Yang, M., Zhang, Z.M.: A survey of multi-view representation learning. TKDE 31, 1863–1883 (2018)Google Scholar
  45. 45.
    Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. In: CVPR (2017)Google Scholar
  46. 46.
    McAllester, D., Statos, K.: Formal limitations on the measurement of mutual information. arXiv preprint arXiv:1811.04251 (2018)
  47. 47.
    Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: CVPR (2020)Google Scholar
  48. 48.
    Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991 (2019)
  49. 49.
    Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). Scholar
  50. 50.
    Mnih, A., Kavukcuoglu, K.: Learning word embeddings efficiently with noise-contrastive estimation. In: NIPS (2013)Google Scholar
  51. 51.
    Mobahi, H., Collobert, R., Weston, J.: Deep learning from temporal coherence in video. In: ICML (2009)Google Scholar
  52. 52.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). Scholar
  53. 53.
    Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). Scholar
  54. 54.
    Noroozi, M., Pirsiavash, H., Favaro, P.: Representation learning by learning to count. In: ICCV (2017)Google Scholar
  55. 55.
    Oord, A.V.D., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016)
  56. 56.
    Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
  57. 57.
    Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)Google Scholar
  58. 58.
    Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NIPS (2019)Google Scholar
  59. 59.
    Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: CVPR (2016)Google Scholar
  60. 60.
    Piergiovanni, A., Angelova, A., Ryoo, M.S.: Evolving losses for unlabeled video representation learning. In: CVPR (2020)Google Scholar
  61. 61.
    Poole, B., Ozair, S., Oord, A.V.D., Alemi, A.A., Tucker, G.: On variational bounds of mutual information. In: ICML (2019)Google Scholar
  62. 62.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). Scholar
  63. 63.
    Sa, V.: Sensory modality segregation. In: NIPS (2004)Google Scholar
  64. 64.
    Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: AISTATS (2009)Google Scholar
  65. 65.
    Sayed, N., Brattoli, B., Ommer, B.: Cross and learn: cross-modal self-supervision. arXiv preprint arXiv:1811.03879 (2018)
  66. 66.
    Schneider, G.E.: Two visual systems. Science 163, 895–902 (1969)CrossRefGoogle Scholar
  67. 67.
    Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: ICRA (2018)Google Scholar
  68. 68.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  69. 69.
    Smith, L., Gasser, M.: The development of embodied cognition: six lessons from babies. Artif. Life 11, 13–29 (2005)CrossRefGoogle Scholar
  70. 70.
    Smolensky, P.: Information processing in dynamical systems: foundations of harmony theory. Tech. rep., Colorado University at Boulder Department of Computer Science (1986)Google Scholar
  71. 71.
    Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: NIPS (2016)Google Scholar
  72. 72.
    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  73. 73.
    Sun, C., Baradel, F., Murphy, K., Schmid, C.: Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743 (2019)
  74. 74.
    Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: ICLR (2020)Google Scholar
  75. 75.
    Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., Isola, P.: What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243 (2020)
  76. 76.
    Tschannen, M., et al.: Self-supervised learning of video-induced visual invariances. arXiv preprint arXiv:1912.02783 (2019)
  77. 77.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NIPS (2016)Google Scholar
  78. 78.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV (2015)Google Scholar
  79. 79.
    Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR (2018)Google Scholar
  80. 80.
    Xu, C., Tao, D., Xu, C.: A survey on multi-view learning. arXiv preprint arXiv:1304.5634 (2013)
  81. 81.
    Ye, M., Zhang, X., Yuen, P.C., Chang, S.F.: Unsupervised embedding learning via invariant and spreading instance feature. In: CVPR (2019)Google Scholar
  82. 82.
    Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007). Scholar
  83. 83.
    Zhang, L., Qi, G.J., Wang, L., Luo, J.: AET vs. AED: unsupervised representation learning by auto-encoding transformations rather than data. In: CVPR (2019)Google Scholar
  84. 84.
    Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). Scholar
  85. 85.
    Zhang, R., Isola, P., Efros, A.A.: Split-brain autoencoders: unsupervised learning by cross-channel prediction. In: CVPR (2017)Google Scholar
  86. 86.
    Zhuang, C., Andonian, A., Yamins, D.: Unsupervised learning from video with deep neural embeddings. arXiv preprint arXiv:1905.11954 (2019)
  87. 87.
    Zhuang, C., Zhai, A.L., Yamins, D.: Local aggregation for unsupervised learning of visual embeddings. arXiv preprint arXiv:1903.12355 (2019)

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.MIT CSAILCambridgeUSA
  2. 2.Google ResearchCambridgeUSA

Personalised recommendations