Context Autoencoder for Self-supervised Representation Learning

Chen, Xiaokang; Ding, Mingyu; Wang, Xiaodi; Xin, Ying; Mo, Shentong; Wang, Yunhao; Han, Shumin; Luo, Ping; Zeng, Gang; Wang, Jingdong

doi:10.1007/s11263-023-01852-4

Context Autoencoder for Self-supervised Representation Learning

Published: 28 August 2023

Volume 132, pages 208–223, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Xiaokang Chen¹,
Mingyu Ding^2,3,
Xiaodi Wang⁴,
Ying Xin⁴,
Shentong Mo⁴,
Yunhao Wang⁴,
Shumin Han⁴,
Ping Luo²,
Gang Zeng¹ &
…
Jingdong Wang ORCID: orcid.org/0000-0002-4888-4445⁴

3824 Accesses
40 Citations
Explore all metrics

Abstract

We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised representation pretraining. We pretrain an encoder by making predictions in the encoded representation space. The pretraining tasks include two tasks: masked representation prediction—predict the representations for the masked patches, and masked patch reconstruction—reconstruct the masked patches. The network is an encoder–regressor–decoder architecture: the encoder takes the visible patches as input; the regressor predicts the representations of the masked patches, which are expected to be aligned with the representations computed from the encoder, using the representations of visible patches and the positions of visible and masked patches; the decoder reconstructs the masked patches from the predicted encoded representations. The CAE design encourages the separation of learning the encoder (representation) from completing the pertaining tasks: masked representation prediction and masked patch reconstruction tasks, and making predictions in the encoded representation space empirically shows the benefit to representation learning. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks: semantic segmentation, object detection and instance segmentation, and classification. The code will be available at https://github.com/Atten4Vis/CAE.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Fig. 5

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

Data availability

The datasets used in this paper are publicly available. ImageNet: https://www.image-net.org/, ADE20K: https://groups.csail.mit.edu/vision/datasets/ADE20K/, COCO: https://cocodataset.org/, Food-101: https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/, Clipart: http://projects.csail.mit.edu/cmplaces/download.html, Sketch: http://projects.csail.mit.edu/cmplaces/download.html.

Code Availability

Our code will be available at https://github.com/Atten4Vis/CAE.

Notes

https://arxiv.org/abs/2202.03026.
Our encoder does not know that the subspace is about a dog, and just separates it from the subspaces of other categories.
There are a few images in which the object does not lie in the center in ImageNet-1K. The images are actually viewed as noises and have little influence for contrastive self-supervised learning.

References

Asano, Y. M., Rupprecht, C., & Vedaldi, A. (2019). Self-labelling via simultaneous clustering and representation learning. arXiv:1911.05371
Atito, S., Awais, M., & Kittler, J. (2021). Sit: Self-supervised vision transformer. arXiv:2104.03602
Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). data2vec: A general framework for self-supervised learning in Speech. Technical report: Vision and Languags.
Google Scholar
Bao, H., Dong, L., & Wei, F. (2021). BEiT: BERT pre-training of image transformers. arXiv:2106.08254
Bardes, A., Ponce, J., & LeCun, Y. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv:2105.04906
Bossard, L., Guillaumin, M., & Van Gool, L. (2014). Food-101—Mining discriminative components with random forests. In ECCV.
Cai, Z., & Vasconcelos, N. (2021). Cascade r-cnn: High quality object detection and instance segmentation. TPAMI, 43, 1483–1498.
Article Google Scholar
Caron, M., Bojanowski, P., Joulin, A., & Douze, M. (2018). Deep clustering for unsupervised learning of visual features. In ECCV (pp. 132–149).
Caron, M., Bojanowski, P., Mairal, J., & Joulin, A. (2019). Unsupervised pre-training of image features on non-curated data. In ICCV (pp. 2959–2968).
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. arXiv:2006.09882.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A. (2021). Emerging properties in self-supervised vision transformers. CoRR, arxiv:2104.14294.
Castrejon, L., Aytar, Y., Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Learning aligned cross-modal representations from weakly aligned data. In CVPR (pp. 2940–2949).
Chen, X., & He, K. (2021). Exploring simple Siamese representation learning. In CVPR (pp. 15750–15758).
Chen, Q., Chen, X., Wang, J., Feng, H., Han, J., Ding, E., Zeng, G., & Wamg, J. (2022). Group detr: Fast detr training with group-wise one-to-many assignment.
Chen, X., Ding, M., Wang, X., Xin, Y., Mo, S., Wang, Y., Han, S., Luo, P., Zeng, G., & Wang, J. (2022). Context autoencoder for self-supervised representation learning. CoRR, arxiv:2202.03026.
Chen, J., Hu, M., Li, B., & Elhoseiny, M. (2022). Efficient self-supervised vision pretraining with local masked reconstruction. arXiv:2206.00790
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. E. (2020). A simple framework for contrastive learning of visual representations. In ICML, volume 119 of Proceedings of Machine Learning Research (pp. 1597–1607). PMLR.
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. (2020). Generative pretraining from pixels. In ICML (pp. 1691–1703). PMLR.
Chen, Q., Wang, J., Han, C., Zhang, S., Li, Z., Chen, X., Chen, J., Wang, X., Han, S., Zhang, G., Feng, H., Yao, K., Han, J., Ding, E., & Wang, J. (2022). Group DETR v2: Strong object detector with encoder-decoder pretraining. CoRR, arxiv:2211.03594.
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., & Lin, D. (2019). MMDetection: Open mmlab detection toolbox and benchmark. arXiv:1906.07155.
Chen, X., Xie, S., & Kaiming, H. (2021). An empirical study of training self-supervised vision transformers. CoRR, arxiv:2104.02057.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR (pp. 248–255). IEEE.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), NAACL-HLT (pp. 4171–4186). Association for Computational Linguistics.
Google Scholar
Doersch, C., Gupta, A., & Efros, Alexei, A. (2015). Unsupervised visual representation learning by context prediction. In ICCV.
Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, Weiming,Y., Lu, C., Dong, W., Fang, & Yu, N. (2021). Peco: Perceptual codebook for bert pre-training of vision transformers. arXiv:2111.12710.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR: OpenReview.net.
Dosovitskiy, A., Fischer, P., Springenberg, J. T., Riedmiller, M., & Brox, T. (2015). Discriminative unsupervised feature learning with exemplar convolutional neural networks. TPAMI, 38(9), 1734–1747.
Article Google Scholar
Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., & Brox, T. (2014). Discriminative unsupervised feature learning with convolutional neural networks. NeurIPS, 27, 766–774.
Google Scholar
El-Nouby, A., Izacard, G., Touvron, H., Laptev, I., Jegou, H., & Grave, E. (2021). Are large-scale datasets necessary for self-supervised pre-training? arXiv:2112.10740
Ermolov, A., Siarohin, A., Sangineto, E., & Sebe, N. (2021). Whitening for self-supervised representation learning. In ICML (pp. 3015–3024). PMLR.
Fang, Y., Dong, L., Bao, H., Wang, X., & Wei, F. (2022). Corrupted image modeling for self-supervised visual pre-training. arXiv:2202.03382
Gallinari, P., Lecun, Y., Thiria, S., & Soulie, F. F. (1987). Mémoires associatives distribuées: une comparaison (distributed associative memories: A comparison). In Proceedings of COGNITIVA 87, Paris, La Villette, May 1987. Cesta-Afcet.
Garrido, Q., Chen, Y., Bardes, A., Najman, L. (2022). On the duality between contrastive and non-contrastive self-supervised learning. CoRR, arXiv:2206.02574
Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., & Cord, M. (2020). Learning representations by predicting bags of visual words. In CVPR (pp. 6928–6938).
Gidaris, S., Bursuc, A., Puy, G., Komodakis, N., Cord, M., & Pérez, P. (2020). Online bag-of-visual-words generation for unsupervised representation learning. arXiv:2012.11552
Goyal, P., Caron, M., Lefaudeux, B., Xu, M., Wang, P., Pai, V., Singh, M., Liptchinsky, V., Misra, I., Joulin, A., et al. (2021). Self-supervised pretraining of visual features in the wild. arXiv:2103.01988
Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., Azar, M. G., et al. (2020). Bootstrap your own latent: A new approach to self-supervised learning. arXiv:2006.07733
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In CVPR.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. B. (2020). Momentum contrast for unsupervised visual representation learning. In CVPR (pp. 9726–9735). Computer Vision Foundation/IEEE.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV (pp. 2961–2969).
Henaff, O. (2020). Data-efficient image recognition with contrastive predictive coding. In ICML (pp. 4182–4192). PMLR.
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Article MathSciNet Google Scholar
Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length, and helmholtz free energy. NeurIPS, 6, 3–10.
Google Scholar
Huang, J., Dong, Q., Gong, S., & Zhu, X. (2019). Unsupervised deep learning by neighbourhood discovery. In ICML (pp. 2849–2858). PMLR.
Huang, Z., Jin, X., Lu, C., Hou, Q., Cheng, M.-M., Fu, D., Shen, X., & Feng, J. (2022). Contrastive masked autoencoders are stronger vision learners. arXiv:2207.13532
Huang, G., Sun, Y., Liu, Z., Sedra, D., & Weinberger, K. Q. (2016). Deep networks with stochastic depth. In ECCV (pp. 646–661). Springer.
Huang, L., You, S., Zheng, M., Wang, F., Qian, C., & Yamasaki, T. (2022). Green hierarchical vision transformer for masked image modeling. arXiv:2205.13515
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.
Jing, L., Zhu, J., & LeCun, Y. (2022). Masked siamese convnets. arXiv:2206.07700
Kakogeorgiou, I., Gidaris, S., Psomas, B., Avrithis, Y., Bursuc, A., Karantzalos, K., & Komodakis, N. (2022). What to hide from your students: Attention-guided masked image modeling. In ECCV.
Kingma, Diederik P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv:1312.6114.
Kong, X., & Zhang, X. (2022). Understanding masked image modeling via learning occlusion invariant feature. arXiv:2208.04164
LeCun, Y. (1987). Mod‘eles connexionistes de l’apprentissage. PhD thesis, Universit’e de Paris VI.
Li, X., Ge, Y., Yi, K., Hu, Z., Shan, Y., & Duan, L.-Y. (2022). mc-beit: Multi-choice discretization for image bert pre-training. In ECCV.
Li, X., Wang, W., Yang, L., & Yang, J. (2022). Uniform masking: Enabling mae pre-training for pyramid-based vision transformers with locality. arXiv:2205.10063
Li, S., Wu, D., Wu, F., Zang, Z., Wang, K., Shang, L., Sun, B., Li, H., Li, S., et al. (2022). Architecture-agnostic masked image modeling-from vit back to cnn. arXiv:2205.13943
Li, G., Zheng, H., Liu, D., Su, B., & Zheng, C. (2022). Semmae: Semantic-guided masking for learning masked autoencoders. arXiv:2206.10207
Li, J., Zhou, P., Xiong, C., & Hoi, S. C. H. (2020). Prototypical contrastive learning of unsupervised representations. arXiv:2005.04966
Li, Z., Chen, Z., Yang, F., Li, W., Zhu, Y., Zhao, C., Deng, R., Liwei, W., Zhao, R., Tang, M., et al. (2021). Mst: Masked self-supervised transformer for visual representation. NeurIPS, 34, 13165–13176.
Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Lawrence Zitnick, C. (2014). Microsoft coco: Common objects in context. In ECCV (pp. 740–755). Springer.
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F. & Guo, B. (2021). Swin transformer v2: Scaling up capacity and resolution. Cornell University.
Liu, H., Jiang, X., Li, X., Guo, A., Jiang, D., & Ren, B. (2022). The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training. arXiv:2204.08227
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv:1711.05101
Noroozi, M., & Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV (pp. 69–84). Springer.
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: Feature learning by inpainting. In CVPR (pp. 2536–2544).
Peng, X., Wang, K., Zhu, Z., & You, Y. (2022). Crafting better contrastive views for siamese representation learning. In CVPR.
Qi, J., Zhu, J., Ding, M., Chen, X., Luo, P., Wang, L., Wang, X., Liu, W., & Wang, J. (2023). Understanding self-supervised pretraining with part-aware representation learning. Report: Tech.
Google Scholar
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-shot text-to-image generation. In M. Meila & T. Zhang (Eds.), ICML (Vol. 139, pp. 8821–8831). PMLR.
Google Scholar
Ranzato, M., Poultney, C., Chopra, S., LeCun, Y., et al. (2007). Efficient learning of sparse representations with an energy-based model. NeurIPS, 19, 1137.
Google Scholar
Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. IJCV, 115(3), 211–252.
Article MathSciNet Google Scholar
Tao, C., Zhu, X., Huang, G., Qiao, Y., Wang, X. & Dai, J. (2022). Siamese image modeling for self-supervised vision representation learning. arXiv:2206.01204
Tian, Y., Xie, L., Fang, J., Shi, M., Peng, J., Zhang, X., Jiao, J., Tian, Q., & Ye, Q. (2022). Beyond masking: Demystifying token-based pre-training for vision transformers. arXiv:2203.14313
Tian, Y., Xie, L., Zhang, X., Fang, J., Xu, H., Huang, W., Jiao, J., Tian, Q., & Ye, Q. (2021). Semantic-aware generation for self-supervised visual representation learning. arXiv:2111.13163
Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., & Isola, P. (2020). What makes for good views for contrastive learning? NeurIPS, 33, 6827–6839.
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2020). Training data-efficient image transformers & distillation through attention. arXiv:2012.12877
van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv:1807.03748
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research,9(11).
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In ICML (pp. 1096–1103).
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11, 3371–3408.
MathSciNet Google Scholar
Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Khan, O., Singhal, S., Som, S., & Wei, F. (2023). Image as a foreign language: Beit pretraining for all vision and vision-language tasks.
Wang, L., Liang, F., Li, Y., Ouyang, W., Zhang, H., & Shao, J. (2022). Repre: Improving self-supervised vision transformer with reconstructive pre-training. arXiv:2201.06857
Wang, X., Zhang, R., Shen, C., Kong, T., & Li, L. (2021). Dense contrastive learning for self-supervised visual pre-training. In CVPR (pp. 3024–3033).
Wei, C., Fan, H., Xie, S., Wu, C.-Y., Yuille, A., & Feichtenhofer, C. (2021). Masked feature prediction for self-supervised visual pre-training. arXiv:2112.09133
Wei, L., Xie, L., Zhou, W., Li, H., & Tian, Q. (2022). Mvp: Multimodality-guided visual pre-training. In ECCV.
Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In CVPR (pp. 3733–3742).
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In ECCV (pp. 418–434).
Xie, Z., Geng, Z., Hu, J., Zhang, Z., Hu, H., & Cao, Y. (2022). Revealing the dark secrets of masked image modeling. arXiv:2205.13543.
Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. In ICML (pp. 478–487). PMLR.
Xie, J., Li, W., Zhan, X., Liu, Z., Ong, Y. S., & Loy, C. C. (2022). Masked frequency modeling for self-supervised visual pre-training. arXiv:2206.07706
Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, Stephen, & Hu, H. (2021). Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In CVPR (pp. 16684–16693).
Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., & Hu, H. (2021). Simmim: A simple framework for masked image modeling. arXiv:2111.09886
Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Wei, Y., Dai, Q., & Hu, H. (2022). On data scaling in masked image modeling. arXiv:2206.04664
Yang, J., Parikh, D., & Batra, D. (2016). Joint unsupervised learning of deep representations and image clusters. In CVPR (pp. 5147–5156).
Yi, K., Ge, Y., Li, Xiaotong, Y., Shusheng, L., Dian, W., Jianping, S. Y., & Qie, X. (2022). Masked image modeling with denoising contrast. arXiv:2205.09616
You, Y., Gitman, I., & Ginsburg, B. (2017). Large batch training of convolutional networks. arXiv:1708.03888
Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV (pp. 6023–6032).
Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow twins: Self-supervised learning via redundancy reduction. arXiv:2103.03230
Zhang, X., Chen, J., Yuan, J., Chen, Q., Wang, J., Wang, X., Han, S., Chen, X., Pi, J., Yao, K., Han, J., Ding, E., & Wang, J. (2022). CAE v2: Context autoencoder with CLIP target. CoRRarxiv:2211.09799.
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond empirical risk minimization. In ICLR.
Zhang, R., Isola, P., & Efros, A. A. (2016). Colorful image colorization. In ECCV (pp. 649–666). Springer.
Zhang, H., Li, F., Liu, S., Zhang, L., Hang, S., Zhu, J., Ni, L. M., & Shum, H.-Y. (2023). Dino: Detr with improved denoising anchor boxes for end-to-end object detection.
Zhang, X., Tian, Y., Huang, W., Ye, Q., Dai, Q., Xie, L., & Tian, Q. (2022). Hivit: Hierarchical vision transformer meets masked image modeling. arXiv:2205.14949
Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., & Kong, T. (2021). Ibot: Image bert pre-training with online tokenizer. arxiv:2111.07832
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralb, A.(2017). Scene parsing through ade20k dataset. In CVPR (pp. 633–641).
Zhuang, C., Zhai, A. L., & Yamins, D. (2019). Local aggregation for unsupervised learning of visual embeddings. In ICCV (pp. 6002–6012).

Download references

Acknowledgements

We would like to acknowledge Hangbo Bao, Xinlei Chen, Li Dong, Qi Han, Zhuowen Tu, Saining Xie, and Furu Wei for the helpful discussions.

Funding

This work is partially supported by the National Key Research and Development Program of China (2020YFB1708002), National Natural Science Foundation of China (61632003, 61375022, 61403005), Grant SCITLAB-20017 of Intelligent Terminal Key Laboratory of SiChuan Province, Beijing Advanced Innovation Center for Intelligent Robots and Systems (2018IRS11), and PEK-SenseTime Joint Laboratory of Machine Vision. Ping Luo is supported by the General Research Fund of HK No.27208720, No.17212120, and No.17200622.

Author information

Authors and Affiliations

Peking University, Beijing, China
Xiaokang Chen & Gang Zeng
University of Hong Kong, Pok Fu Lam, Hong Kong
Mingyu Ding & Ping Luo
UC Berkeley, Berkeley, CA, USA
Mingyu Ding
Baidu, Beijing, China
Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han & Jingdong Wang

Authors

Xiaokang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Mingyu Ding
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Xin
View author publications
You can also search for this author in PubMed Google Scholar
Shentong Mo
View author publications
You can also search for this author in PubMed Google Scholar
Yunhao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shumin Han
View author publications
You can also search for this author in PubMed Google Scholar
Ping Luo
View author publications
You can also search for this author in PubMed Google Scholar
Gang Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Jingdong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingdong Wang.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Communicated by Chunyuan Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, X., Ding, M., Wang, X. et al. Context Autoencoder for Self-supervised Representation Learning. Int J Comput Vis 132, 208–223 (2024). https://doi.org/10.1007/s11263-023-01852-4

Download citation

Received: 27 January 2023
Accepted: 12 July 2023
Published: 28 August 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11263-023-01852-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Context Autoencoder for Self-supervised Representation Learning

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

Data availability

Code Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Context Autoencoder for Self-supervised Representation Learning

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

Data availability

Code Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation