Skip to main content

Contrastive Self-supervised Representation Learning Using Synthetic Data

Abstract

Learning discriminative representations with deep neural networks often relies on massive labeled data, which is expensive and difficult to obtain in many real scenarios. As an alternative, self-supervised learning that leverages input itself as supervision is strongly preferred for its soaring performance on visual representation learning. This paper introduces a contrastive self-supervised framework for learning generalizable representations on the synthetic data that can be obtained easily with complete controllability. Specifically, we propose to optimize a contrastive learning task and a physical property prediction task simultaneously. Given the synthetic scene, the first task aims to maximize agreement between a pair of synthetic images generated by our proposed view sampling module, while the second task aims to predict three physical property maps, i.e., depth, instance contour maps, and surface normal maps. In addition, a feature-level domain adaptation technique with adversarial training is applied to reduce the domain difference between the realistic and the synthetic data. Experiments demonstrate that our proposed method achieves state-of-the-art performance on several visual recognition datasets.

References

  1. [1]

    B. Zhao, J. S. Feng, X. Wu, S. Yan. A survey on deep learning-based fine-grained object classification and semantic segmentation. International Journal of Automation and Computing, vol. 14, no. 2, pp. 119–135, 2017. DOI: https://doi.org/10.1007/s11633-017-1053-3.

    Article  Google Scholar 

  2. [2]

    V. K. Ha, J. C. Ren, X. Y. Xu, S. Zhao, G. Xie, V. Masero, A. Hussain. Deep learning based single image super-resolution: A survey. International Journal of Automation and Computing, vol. 16, no. 4, pp. 413–426, 2019. DOI: https://doi.org/10.1007/s11633-019-1183-x.

    Article  Google Scholar 

  3. [3]

    K. Aukkapinyo, S. Sawangwong, P. Pooyoi, W. Kusakunniran. Localization and classification of rice-grain images using region proposals-based convolutional neural network. International Journal of Automation and Computing, vol. 17, no. 2, pp. 233–246, 2020. DOI: https://doi.org/10.1007/s11633-019-1207-6.

    Article  Google Scholar 

  4. [4]

    X. L. Wang, A. Gupta. Unsupervised learning of visual representations using videos. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 2794–2802, 2015. DOI: https://doi.org/10.1109/ICCV.2015.320.

    Google Scholar 

  5. [5]

    C. Doersch, A. Gupta, A. A. Efros. Unsupervised visual representation learning by context prediction. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1422–1430, 2015. DOI: https://doi.org/10.1109/ICCV.2015.167.

    Google Scholar 

  6. [6]

    C. Doersch, A. Zisserman. Multi-task self-supervised visual learning. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 2070–2079, 2017. DOI: https://doi.org/10.1109/ICCV.2017.226.

    Google Scholar 

  7. [7]

    S. Gidaris, P. Singh, N. Komodakis. Unsupervised representation learning by predicting image rotations. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018.

  8. [8]

    D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A. A. Efros. Context encoders: Feature learning by inpainting. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 2536–2544, 2016. DOI: https://doi.org/10.1109/CVPR.2016.278.

    Google Scholar 

  9. [9]

    G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, vol. 313, no. 5786, pp. 504–507, 2006. DOI: https://doi.org/10.1126/science.1127647.

    MathSciNet  MATH  Article  Google Scholar 

  10. [10]

    P. Vincent, H. Larochelle, Y. Bengio, P. A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine learning, ACM, Helsinki, Finland, pp. 1096–1103, 2008. DOI: https://doi.org/10.1145/1390156.1390294.

    Chapter  Google Scholar 

  11. [11]

    R. Lopez, J. Regier, M. I. Jordan, N. Yosef. Information constraints on auto-encoding variational bayes. In Advances in Neural Information Processing, Montreal, Canada, pp. 6117–6128, 2018.

  12. [12]

    X. Liu, F. J. Zhang, Z. Y. Hou, Z. Y. Wang, L. Mian, J. Zhang, J. Tang. Seff-supervssed learning: Generative or contrastive. [Online], Available: https://arxiv.org/abs/2006.08218, 2020.

  13. [13]

    Z. Z. Ren, Y. Jae Lee. Cross-domain self-supervised multitask feature learning using synthetic imagery. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, USA, pp. 762–771, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00086.

    Google Scholar 

  14. [14]

    R. Zhang, P. Isola, A. A. Efros. Colorful image colorization. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 649–666, 2016. DOI: https://doi.org/10.1007/978-3-319-46487-9_40.

    Google Scholar 

  15. [15]

    R. Hadsell, S. Chopra, Y. LeCun. Dimensionality reduction by learning an invariant mapping. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern, IEEE, New York, USA, pp. 1735–1742, 2006. DOI: https://doi.org/10.1109/CVPR.2006.100.

    Google Scholar 

  16. [16]

    A. van den Oord, Y. Z. Li, O. Vinyals. Representation learning with contrastive predictive coding. [Online], Available: https://arxiv.org/abs/1807.03748, 2018.

  17. [17]

    R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, Y. Bengio. Learning deep representations by mutual information estimation and maximization. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.

  18. [18]

    N. Saunshi, O. Plevrakis, V. Arora, M. Khodak, H. Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, USA, pp. 5628–5637, 2019.

  19. [19]

    T. Nathan Mundhenk, D. Ho, B. Y. Chen. Improvements to context based self-supervised learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 9339–9348, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00973.

    Google Scholar 

  20. [20]

    M. Noroozi, P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 69–84, 2016. DOI: https://doi.org/10.1007/978-3-319-46466-4_5.

    Google Scholar 

  21. [21]

    H. Y. Lee, J. B. Huang, M. Singh, M. H. Yang. Unsupervised representation learning by sorting sequences. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 667–676, 2017. DOI: https://doi.org/10.1109/ICCV.2017.79.

    Google Scholar 

  22. [22]

    D. Kim, D. Cho, D. Yoo, I. S. Kweon. Learning image representations by completing damaged jigsaw puzzles. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, IEEE, Lake Tahoe, USA, pp. 793–802, 2018. DOI: https://doi.org/10.1109/WACV.2018.00092.

    Google Scholar 

  23. [23]

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, ACM, Lake Tahoe, USA, pp. 3111–3119, 2013.

    Google Scholar 

  24. [24]

    X. H. Zhan, X. G Pan, Z. W. Liu, D. H. Lin, C. C. Loy. Self-supervised learning via conditional motion propagation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 1881–1889, 2019 DOI: https://doi.org/10.1109/CVPR.2019.00198

  25. [25]

    Z. Y. Feng, C. Xu, D. C. Tao. Self-supervised representation learning by rotation feature decoupling. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 10364–10374, 2019. DOI: https://doi.org/10.1109/CVPR.2019.01061.

    Google Scholar 

  26. [26]

    X. L. Wang, K. M. He, A. Gupta. Transitive invariance for self-supervised visual representation learning. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 1338–1347, 2017. DOI: https://doi.org/10.1109/ICCV.2017.149.

    Google Scholar 

  27. [27]

    L. H. Zhang, G J. Qi, L. Q. Wang, J. B. Luo. AET vs. AED: Unsupervised representation learning by auto-encoding transformations rather than data. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 2542–2550, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00265.

    Google Scholar 

  28. [28]

    J. Donahue, K. Simonyan. Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 10541–10551, 2019.

  29. [29]

    R. Zhang, P. Isola, A. A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 645–654, 2017. DOI: https://doi.org/10.1109/CVPR.2017.76.

    Google Scholar 

  30. [30]

    X. C. Peng, B. C. Sun, K. Ali, K. Saenko. Learning deep object detectors from 3D models. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1278–1286, 2015. DOI 10.1109/ICCV.2015.151.

    Google Scholar 

  31. [31]

    O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. M. A. Eslami, A. van den Oord. Data-efficient image recognition with contrastive predictive coding. [Online], Available: https://arxiv.org/abs/1905.09272, 2019.

  32. [32]

    P. Bachman, R. D. Hjelm, W. Buchwalter. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 15509–15519, 2019.

  33. [33]

    M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, M. Lucic. On mutual information maximization for representation learning. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.

  34. [34]

    K. M. He, H. Q. Fan, Y. X. Wu, S. N. Xie, R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9726–9735, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00975.

    Google Scholar 

  35. [35]

    T. Chen, S. Kornblith, M. Norouzi, G. Hinton. A simple framework for contrastive learning of visual representations. [Online], Available: https://arxiv.org/abs/2002.05709, 2020.

  36. [36]

    Y. L. Tian, D. Krishnan, P. Isola. Contrastive Multiview coding. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 776–794, 2020. DOI: https://doi.org/10.1007/978-3-030-58621-8_45.

    Google Scholar 

  37. [37]

    T. Chen, Y. Z. Sun, Y. Shi, L. J. Hong. On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Halifax, Canada, pp. 767–776, 2017. DOI: https://doi.org/10.1145/3097983.3098202.

    Chapter  Google Scholar 

  38. [38]

    J. McCormac, A. Handa, S. Leutenegger, A. J. Davison. SceneNet RGB-D: Can 5M synthetic images beat generic imagenet pre-training on indoor segmentation? In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 2697–2706, 2017. DOI: https://doi.org/10.1109/ICCV.2017.292.

    Google Scholar 

  39. [39]

    T. Hachisuka, H. W. Jensen. Parallel progressive photon mapping on GPUS. In ACM SIGGRAPH ASIA, Seoul, Proceedings of Korea, pp. 54:1, 2010.

  40. [40]

    S. N. Xie, Z. W. Tu. Holistically-nested edge detection. International Journal of Computer Vision, vol. 125, no. 1–3, pp. 3–18, 2017. DOI: https://doi.org/10.1007/s11263-017-1004-z.

    MathSciNet  Article  Google Scholar 

  41. [41]

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, ACM, Montreal, Canada, pp. 2672–2680, 2014.

    Google Scholar 

  42. [42]

    Y. Ganin, V. S. Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 1180–1189, 2015.

  43. [43]

    K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 3722–3731, 2017. DOI: https://doi.org/10.1109/CVPR.2017.18.

    Google Scholar 

  44. [44]

    E. Tzeng, J. Hoffman, K. Saenko, T. Darrell. Adversarial discriminative domain adaptation. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 7167–7176, 2017. DOI: https://doi.org/10.1109/CVPR.2017.316.

    Google Scholar 

  45. [45]

    K. Sohn, W. L. Shang, X. Yu, M. Chandraker. Unsupervised domain adaptation for distance metric learning. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.

  46. [46]

    A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, ACM, Lake Tahoe, USA, pp. 1097–1105, 2012.

    Google Scholar 

  47. [47]

    B. L. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1452–1464, 2018. DOI: https://doi.org/10.1109/TPAMI.2017.2723009.

    Article  Google Scholar 

  48. [48]

    M. Noroozi, A. Vinjimoor, P. Favaro, H. Pirsiavash. Boosting self-supervised learning via knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 9359–9367, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00975.

    Google Scholar 

  49. [49]

    P. Krähenbühl, C. Doersch, J. Donahue, T. Darrell. Data-dependent initializations of convolutional neural networks. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2016.

  50. [50]

    M. Noroozi, H. Pirsiavash, P. Favaro. Representation learning by learning to count. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 5899–5907, 2017. DOI: https://doi.org/10.1109/ICCV.2017.628.

    Google Scholar 

  51. [51]

    B. Zhou, À. Lapedriza, J. X. Xiao, A. Torralba, A. Oliva. Learning deep features for scene recognition using places database. In Proceedings of Conference in Neural Information Processing Systems, Montreal, Canada, pp. 487–495, 2014.

  52. [52]

    M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015. DOI: https://doi.org/10.1007/s11263-014-0733-5.

    Article  Google Scholar 

  53. [53]

    R. Girshick. Fast R-CNN. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1440–1448, 2015. DOI: https://doi.org/10.1109/ICCV.2015.169.

    Google Scholar 

  54. [54]

    J. Long, E. Shelhamer, T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 3431–3440, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298965.

    Google Scholar 

  55. [55]

    N. Silberman, D. Hoiem, P. Kohli, R. Fergus. Indoor segmentation and support inference from RGBD images. In Proceedings of the 12th European Conference on Computer Vision, Springer, Florence, Italy, pp. 746–760, 2012. DOI: https://doi.org/10.1007/978-3-642-33715-4_54.

    Google Scholar 

  56. [56]

    L. Ladicky, B. Zeisl, M. Pollefeys. Discriminatively trained dense surface normal estimation. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 468–484, 2014. DOI: https://doi.org/10.1007/978-3-319-10602-1_31.

    Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (No. 61822204 and 61521002).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Kun Xu.

Additional information

Recommended by Associate Editor Jangmyung Lee

Colored figures are available in the online version at https://link.springer.com/journal/11633

Dong-Yu She received the B. Eng. and the M. Eng. degrees in computer science and technology from Nankai University, China in 2019 and 2016, respectively. She is a Ph. D. degree candidate in Department of Computer Science and Technology, Tsinghua University, China.

Her research interests include deep learning and computer vision. E-mail: shedy19@mails.tsinghua.edu.cn

ORCID iD: 0000-0002-1434-562X

Kun Xu received B. Eng. and Ph.D. degrees in computer science and technology from Tsinghua University, China in 2005 and 2009, respectively. He is an associate professor in Department of Computer Science and Technology, Tsinghua University, China.

His research interests include realistic rendering and image/video editing.

E-mail: xukun@tsinghua.edu.cn (Corresponding author)

ORCID iD: 0000-0002-2671-4170

Rights and permissions

Open Access

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

She, DY., Xu, K. Contrastive Self-supervised Representation Learning Using Synthetic Data. Int. J. Autom. Comput. 18, 556–567 (2021). https://doi.org/10.1007/s11633-021-1297-9

Download citation

Keywords

  • Self-supervised learning
  • contrastive learning
  • synthetic image
  • convolutional neural network
  • representation learning