Cross-Modal Visual Correspondences Learning Without External Semantic Information for Zero-Shot Sketch-Based Image Retrieval

Gao, Zhijie; Wang, Kai

doi:10.1007/978-981-99-9109-9_34

Zhijie Gao⁷ &
Kai Wang⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1998))

Included in the following conference series:

International Symposium on Artificial Intelligence and Robotics

269 Accesses

Abstract

In this paper, we study the problem of zero-shot sketch-based image retrieval (ZS-SBIR), which is challenging because of the modal gap between sketch and image and the semantic inconsistency between seen categories and unseen categories. Most of the previous methods in ZS-SBIR, need external semantic information, i.e., texts and class labels, to minimize modal gap or semantic inconsistency. To tackle the challenging ZS-SBIR without external semantic information which is labor intensive, we propose a novel method of learning the visual correspondences between different modalities, i.e., sketch and image, to transfer knowledge from seen data to unseen data. This method is based on a transformer-based dual-pathway structure to learn the visual correspondences. In order to eliminate the modal gap between sketch and image, triplet loss and Gaussian distribution based domain alignment mechanism are introduced and performed on tokens obtained from our proposed structure. In addition, knowledge distillation is introduced to maintain the generalization capability brought by the vision transformer (ViT) used as the backbone to build the model. The comprehensive experiments on three benchmark datasets, i.e., Sketchy, TU-Berlin and QuickDraw, demonstrate that our method achieves superior results compared to baselines on all three datasets without external semantic information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9650–9660, October 2021
Google Scholar
Casey, E., Pérez, V., Li, Z.: The animation transformer: visual correspondence via segment matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11323–11332, October 2021
Google Scholar
Deng, C., Xu, X., Wang, H., Yang, M., Tao, D.: Progressive cross-modal semantic network for zero-shot sketch-based image retrieval. IEEE Trans. Image Process. 29, 8892–8902 (2020). https://doi.org/10.1109/TIP.2020.3020383
Article Google Scholar
Dey, S., Riba, P., Dutta, A., Llados, J., Song, Y.Z.: Doodle to search: practical zero-shot sketch-based image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2179–2188 (2019)
Google Scholar
Doersch, C., Gupta, A., Zisserman, A.: Crosstransformers: spatially-aware few-shot transfer. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20, Curran Associates Inc., Red Hook, NY, USA (2020)
Google Scholar
Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale (2021)
Google Scholar
Dutta, A., Akata, Z.: Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Google Scholar
Eitz, M., Hildebrand, K., Boubekeur, T., Alexa, M.: An evaluation of descriptors for large-scale image retrieval from sketched feature lines. Comput. Graph. 34(5), 482–498 (2010)
Article Google Scholar
Lin, F., Li, M., Li, D., Hospedales, T., Song, Y.Z., Qi, Y.: Zero-shot everything sketch-based image retrieval, and in explainable style. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23349–23358, June 2023
Google Scholar
Liu, L., Shen, F., Shen, Y., Liu, X., Shao, L.: Deep sketch hashing: fast free-hand sketch-based image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2862–2871 (2017)
Google Scholar
Liu, Q., Xie, L., Wang, H., Yuille, A.L.: Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Saavedra, J.M., Barrios, J.M.: Sketch based image retrieval using learned keyshapes (LKS). In: British Machine Vision Conference (2015). https://api.semanticscholar.org/CorpusID:11324587
Sain, A., Bhunia, A.K., Potlapalli, V., Chowdhury, P.N., Xiang, T., Song, Y.Z.: Sketch3t: test-time training for zero-shot SBIR. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7452–7461 (2022). https://api.semanticscholar.org/CorpusID:247762119
Sangkloy, P., Burnell, N., Ham, C., Hays, J.: The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans. Graph. (TOG) 35(4), 1–12 (2016)
Article Google Scholar
Shen, Y., Liu, L., Shen, F., Shao, L.: Zero-shot sketch-image hashing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3598–3607 (2018)
Google Scholar
Tian, J., Xu, X., Shen, F., Yang, Y., Shen, H.T.: TVT: three-way vision transformer through multi-modal hypersphere learning for zero-shot sketch-based image retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, pp. 2370–2378, June 2022. https://doi.org/10.1609/aaai.v36i2.20136, https://ojs.aaai.org/index.php/AAAI/article/view/20136
Tian, J., Xu, X., Wang, Z., Shen, F., Liu, X.: Relationship-preserving knowledge distillation for zero-shot sketch based image retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 5473–5481. MM ’21, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3474085.3475676
Tursun, O., Denman, S., Sridharan, S., Goan, E., Fookes, C.: An efficient framework for zero-shot sketch-based image retrieval. Pattern Recognit. 126, 108528 (2022). https://doi.org/10.1016/j.patcog.2022.108528, https://www.sciencedirect.com/science/article/pii/S0031320322000097
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wang, J., Chen, J., Lin, J., Sigal, L., de Silva, C.W.: Discriminative feature alignment: improving transferability of unsupervised domain adaptation by gaussian-guided latent alignment. Pattern Recognit. 116, 107943 (2021). https://doi.org/10.1016/j.patcog.2021.107943, https://www.sciencedirect.com/science/article/pii/S0031320321001308
Wang, Z., Wang, H., Yan, J., Wu, A., Deng, C.: Domain-smoothing network for zero-shot sketch-based image retrieval. ArXiv abs/2106.11841 (2021). https://api.semanticscholar.org/CorpusID:235593135
Wu, Y., Song, K., Zhao, F., Chen, J., Ma, H.: Distribution aligned feature clustering for zero-shot sketch-based image retrieval (2023)
Google Scholar
Yelamarthi, S.K., Reddy, S.K., Mishra, A., Mittal, A.: A zero-shot framework for sketch based image retrieval. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 316–333. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_19
Chapter Google Scholar
Zhang, H., Liu, S., Zhang, C., Ren, W., Wang, R., Cao, X.: Sketchnet: sketch classification with web images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1105–1113 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Electronic Science and Technology of China, Chengdu, 611731, Sichuan, China
Zhijie Gao & Kai Wang

Authors

Zhijie Gao
View author publications
You can also search for this author in PubMed Google Scholar
Kai Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhijie Gao .

Editor information

Editors and Affiliations

Kyushu Institute of Technology, Fukuoka, Japan
Huimin Lu
Southeast University, Nanjing, China
Jintong Cai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, Z., Wang, K. (2024). Cross-Modal Visual Correspondences Learning Without External Semantic Information for Zero-Shot Sketch-Based Image Retrieval. In: Lu, H., Cai, J. (eds) Artificial Intelligence and Robotics. ISAIR 2023. Communications in Computer and Information Science, vol 1998. Springer, Singapore. https://doi.org/10.1007/978-981-99-9109-9_34

Download citation

DOI: https://doi.org/10.1007/978-981-99-9109-9_34
Published: 04 January 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-9108-2
Online ISBN: 978-981-99-9109-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cross-Modal Visual Correspondences Learning Without External Semantic Information for Zero-Shot Sketch-Based Image Retrieval