Skip to main content
Log in

Learning Contrastive Representation for Semantic Correspondence

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

A Correction to this article was published on 13 April 2022

This article has been updated

Abstract

Dense correspondence across semantically related images has been extensively studied, but still faces two challenges: 1) large variations in appearance, scale and pose exist even for objects from the same category, and 2) labeling pixel-level dense correspondences is labor intensive and infeasible to scale. Most existing methods focus on designing various matching modules using fully-supervised ImageNet pretrained networks. On the other hand, while a variety of self-supervised approaches are proposed to explicitly measure image-level similarities, correspondence matching the pixel level remains under-explored. In this work, we propose a multi-level contrastive learning approach for semantic matching, which does not rely on any ImageNet pretrained model. We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects, while the performance can be further enhanced by regularizing cross-instance cycle-consistency at intermediate feature levels. Experimental results on the PF-PASCAL, PF-WILLOW, and SPair-71k benchmark datasets demonstrate that our method performs favorably against the state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Change history

References

  • Bristow, H., Valmadre, J., & Lucey, S. (2015). Dense semantic correspondence where every pixel is a classifier. IEEE International Conference on Computer Vision (ICCV) pp 4024–4031

  • Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. International Conference on Machine Learning (ICML)

  • Chen, Y. C., Huang, P. H., Yu, L. Y., Huang, J. B., Yang, M. H., & Lin, Y. Y. (2018). Deep semantic matching with foreground detection and cycle-consistency. In: Asian Conference on Computer Vision (ACCV), Springer, pp 347–362

  • Choy, C. B., Gwak, J., Savarese, S., & Chandraker, M. (2016). Universal correspondence network. Neural Information Processing Systems (NeurIPS)

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  • Dale, K., Johnson, M. K., Sunkavalli, K., Matusik, W., & Pfister, H. (2009). Image restoration using online photo collections. IEEE International Conference on Computer Vision (ICCV) pp 2217–2224

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 248–255

  • Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. IEEE International Conference on Computer Vision (ICCV) pp 1422–1430

  • Duchenne, O., Joulin, A., & Ponce, J. (2011). A graph-matching kernel for object categorization. In: IEEE International Conference on Computer Vision (ICCV), pp 1792–1799

  • Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2014). The pascal visual object classes challenge: A retrospective. International Journal on Computer Vision (IJCV), 111, 98–136.

    Article  Google Scholar 

  • Gidaris, S., Singh, P., & Komodakis, N. (2018). Unsupervised representation learning by predicting image rotations. International Conference on Learning Representations (ICLR)

  • Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., & Azar, M. G., et al. (2020).. Bootstrap your own latent: A new approach to self-supervised learning. Neural Information Processing Systems (NeurIPS)

  • Ham, B., Cho, M., Schmid, C., & Ponce, J. (2016). Proposal flow. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3475–3484

  • Ham, B., Cho, M., Schmid, C., & Ponce, J. (2018). Proposal flow: Semantic correspondences from object proposals. IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI), 40, 1711–1725.

    Article  Google Scholar 

  • Han, K., Rezende, R. S., Ham, B., Wong, K.Y.K., Cho, M., Schmid, C., & Ponce, J. (2017). Scnet: Learning semantic correspondence. In: IEEE International Conference on Computer Vision (ICCV)

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 770–778

  • He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 9729–9738

  • Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., & Bengio, Y. (2019). Learning deep representations by mutual information estimation and maximization. International Conference on Learning Representations (ICLR)

  • Huang, S., Wang, Q., Zhang, S., Yan, S., & He, X. (2019). Dynamic context correspondence network for semantic alignment. IEEE International Conference on Computer Vision (ICCV) pp 2010–2019

  • Hur, J., Lim, H., Park, C., & Chul Ahn, S. (2015). Generalized deformable spatial pyramid: Geometry-preserving dense correspondence estimation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 1392–1400

  • Jabri, A., Owens, A., & Efros, A.A. (2020). Space-time correspondence as a contrastive random walk. Neural Information Processing Systems (NeurIPS)

  • Jeon, S., Kim, S., Min, D., & Sohn, K. (2018). Parn: Pyramidal affine regression networks for dense semantic correspondence. In: European Conference on Computer Vision (ECCV)

  • Kanazawa, A., Jacobs, D. W., & Chandraker, M. (2016). Warpnet: Weakly supervised matching for single-view reconstruction. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 3253–3261

  • Kang, G., Wei, Y., Yang, Y., Zhuang, Y., & Hauptmann, A. G. (2020). Pixel-level cycle association: A new perspective for domain adaptive semantic segmentation. In: Neural Information Processing Systems (NeurIPS)

  • Kim, J., Liu, C., Sha, F., & Grauman, K. (2013). Deformable spatial pyramid matching for fast dense correspondences. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 2307–2314

  • Kim, S., Min, D., Lin, S., & Sohn, K. (2017). Dctm: Discrete-continuous transformation matching for semantic flow. In: IEEE International Conference on Computer Vision (ICCV), pp 4529–4538

  • Kim, S., Lin, S., Jeon, S. R., Min, D., & Sohn, K. (2018). Recurrent transformer networks for semantic correspondence. Neural Information Processing Systems (NeurIPS), 31, 6126–6136.

    Google Scholar 

  • Kim, S., Min, D., Ham, B., Jeon, S., Lin, S., & Sohn, K. (2019). Fcss: Fully convolutional self-similarity for dense semantic correspondence. IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI), 41, 581–595.

    Article  Google Scholar 

  • Lee, J., Kim, D., Ponce, J., & Ham, B. (2019). Sfnet: Learning object-aware semantic correspondence. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2278–2287

  • Li, X., Liu, S., Mello, S. D., Wang, X., Kautz, J., & Yang, M. H. (2019). Joint-task self-supervised learning for temporal correspondence. Neural Information Processing Systems (NeurIPS)

  • Liu, C., Yuen, J., & Torralba, A. (2011). Sift flow: Dense correspondence across scenes and its applications. IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI), 33, 978–994.

    Article  Google Scholar 

  • Liu, P., King, I., Lyu, M. R., & Xu, J. (2019). Ddflow: Learning optical flow with unlabeled data distillation. Association for the Advancement of Artificial Intelligence (AAAI), 33, 8770–8777.

    Google Scholar 

  • Liu, Y., Zhu, L., Yamada, M., & Yang, Y. (2020). Semantic correspondence as an optimal transport problem. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4463–4472

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision (IJCV)

  • Meister, S., Hur, J., & Roth, S. (2018). Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In: Association for the Advancement of Artificial Intelligence (AAAI)

  • Min, J., Lee, J., Ponce, J., & Cho, M. (2019a). Hyperpixel flow: Semantic correspondence with multi-layer neural features. IEEE International Conference on Computer Vision (ICCV) pp 3394–3403

  • Min, J., Lee, J., Ponce, J., & Cho, M. (2019b). Spair-71k: A large-scale benchmark for semantic correspondence. arXiv:1908.10543

  • Min, J., Lee, J., Ponce, J., & Cho, M. (2020). Learning to compose hypercolumns for visual correspondence. In: European Conference on Computer Vision (ECCV)

  • Misra, I., Zitnick, C. L., & Hebert, M. (2016). Shuffle and learn: Unsupervised learning using temporal order verification. In: European Conference on Computer Vision (ECCV)

  • Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of The Society for Industrial and Applied Mathematics, 10, 196–210.

    MathSciNet  MATH  Google Scholar 

  • Noroozi, M., & Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In: European Conference on Computer Vision (ECCV)

  • Novotny, D., Larlus, D., & Vedaldi, A. (2017). Anchornet: A weakly supervised network to learn geometry-sensitive features for semantic matching. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5277–5286

  • Oord Avd, Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., & Kavukcuoglu, K. (2016). Conditional image generation with pixelcnn decoders. Neural Information Processing Systems (NeurIPS)

  • Oord Avd, Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv:1807.03748

  • Pathak, D., Girshick, R. B., Dollár, P., Darrell, T., & Hariharan, B. (2017). Learning features by watching objects move. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 6024–6033

  • Pinheiro, P. O., Almahairi, A., Benmaleck, R. Y., Golemo, F., & Courville, A. (2020). Unsupervised learning of dense visual representations. Neural Information Processing Systems (NeurIPS)

  • Rocco, I., Arandjelovic, R., & Sivic, J. (2017). Convolutional neural network architecture for geometric matching. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6148–6157

  • Rocco, I., Arandjelović, R., & Sivic, J. (2018a). End-to-end weakly-supervised semantic alignment. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6917–6925

  • Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla, T., & Sivic, J. (2018b). Neighbourhood consensus networks. Neural Information Processing Systems (NeurIPS)

  • Seo, P. H., Lee, J., Jung, D., Han, B., & Cho, M. (2018). Attentive semantic alignment with offset-aware correlation kernels. In: European Conference on Computer Vision (ECCV), pp 349–364

  • Sinkhorn, R. (1967). Diagonal equivalence to matrices with prescribed row and column sums. American Mathematical Monthly, 74, 402.

    Article  MathSciNet  Google Scholar 

  • Taniai, T., Sinha, S. N., & Sato, Y. (2016). Joint recovery of dense correspondence and cosegmentation in two images. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 4246–4255

  • Tola, E., Lepetit, V., & Fua, P. (2010). Daisy: An efficient dense descriptor applied to wide-baseline stereo. IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI), 32, 815–830.

    Article  Google Scholar 

  • Van Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural networks. International Conference on Machine Learning (ICML)

  • Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (2008). Extracting and composing robust features with denoising autoencoders. In: International Conference on Machine Learning (ICML)

  • Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., & Murphy, K. (2018). Tracking emerges by colorizing videos. In: ECCV

  • Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In: IEEE International Conference on Computer Vision (ICCV), pp 2794–2802

  • Wang, X., Jabri, A., & Efros, A. A. (2019). Learning correspondence from the cycle-consistency of time. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 2561–2571

  • Wang, X., Zhang, R., Shen, C., Kong, T., & Li, L. (2020). Dense contrastive learning for self-supervised visual pre-training. ArXiv

  • Xiao, T., Hong, J., & Ma, J. (2018a). Dna-gan: Learning disentangled representations from multi-attribute images. International Conference on Learning Representations Workshop (ICLRW)

  • Xiao, T., Hong, J., & Ma, J. (2018b). Elegant: Exchanging latent encodings with gan for transferring multiple face attributes. In: European Conference on Computer Vision (ECCV), pp 172–187

  • Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., & Hu, H. (2021). Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 16684–16693

  • Yang, H., Lin, W.Y., & Lu, J. (2014). Daisy filter flow: A generalized discrete approach to dense correspondences. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 3406–3413

  • Zhang R, Isola P, & Efros, A. A. (2016). Colorful image colorization. In: European Conference on Computer Vision (ECCV)

  • Zhou, S., Xiao, T., Yang, Y., Feng, D., He, Q., & He, W. (2017). Genegan: Learning object transfiguration and attribute subspace from unpaired data. In: British Machine Vision Conference (BMVC)

  • Zhou, T., Lee, Y. J., Yu, S. X., & Efros, A. A. (2015a). Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 1191–1200

  • Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q., & Efros, A. A. (2016). Learning dense correspondence via 3d-guided cycle consistency. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 117–126

  • Zhou, X., Zhu, M., & Daniilidis, K. (2015b). Multi-image matching via fast alternating minimization. In: IEEE International Conference on Computer Vision (ICCV), pp 4032–4040

  • Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision (ICCV), pp 2223–2232

Download references

Acknowledgements

T. Xiao and M.-H. Yang are supported in part by NSF CAREER grant 1149783.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ming-Hsuan Yang.

Additional information

Communicated by Bumsub Ham.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiao, T., Liu, S., De Mello, S. et al. Learning Contrastive Representation for Semantic Correspondence. Int J Comput Vis 130, 1293–1309 (2022). https://doi.org/10.1007/s11263-022-01602-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01602-y

Keywords

Navigation