Abstract
In this paper, we study the problem of zero-shot sketch-based image retrieval (ZS-SBIR), which is challenging because of the modal gap between sketch and image and the semantic inconsistency between seen categories and unseen categories. Most of the previous methods in ZS-SBIR, need external semantic information, i.e., texts and class labels, to minimize modal gap or semantic inconsistency. To tackle the challenging ZS-SBIR without external semantic information which is labor intensive, we propose a novel method of learning the visual correspondences between different modalities, i.e., sketch and image, to transfer knowledge from seen data to unseen data. This method is based on a transformer-based dual-pathway structure to learn the visual correspondences. In order to eliminate the modal gap between sketch and image, triplet loss and Gaussian distribution based domain alignment mechanism are introduced and performed on tokens obtained from our proposed structure. In addition, knowledge distillation is introduced to maintain the generalization capability brought by the vision transformer (ViT) used as the backbone to build the model. The comprehensive experiments on three benchmark datasets, i.e., Sketchy, TU-Berlin and QuickDraw, demonstrate that our method achieves superior results compared to baselines on all three datasets without external semantic information.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9650–9660, October 2021
Casey, E., Pérez, V., Li, Z.: The animation transformer: visual correspondence via segment matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11323–11332, October 2021
Deng, C., Xu, X., Wang, H., Yang, M., Tao, D.: Progressive cross-modal semantic network for zero-shot sketch-based image retrieval. IEEE Trans. Image Process. 29, 8892–8902 (2020). https://doi.org/10.1109/TIP.2020.3020383
Dey, S., Riba, P., Dutta, A., Llados, J., Song, Y.Z.: Doodle to search: practical zero-shot sketch-based image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2179–2188 (2019)
Doersch, C., Gupta, A., Zisserman, A.: Crosstransformers: spatially-aware few-shot transfer. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20, Curran Associates Inc., Red Hook, NY, USA (2020)
Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale (2021)
Dutta, A., Akata, Z.: Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Eitz, M., Hildebrand, K., Boubekeur, T., Alexa, M.: An evaluation of descriptors for large-scale image retrieval from sketched feature lines. Comput. Graph. 34(5), 482–498 (2010)
Lin, F., Li, M., Li, D., Hospedales, T., Song, Y.Z., Qi, Y.: Zero-shot everything sketch-based image retrieval, and in explainable style. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23349–23358, June 2023
Liu, L., Shen, F., Shen, Y., Liu, X., Shao, L.: Deep sketch hashing: fast free-hand sketch-based image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2862–2871 (2017)
Liu, Q., Xie, L., Wang, H., Yuille, A.L.: Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
Saavedra, J.M., Barrios, J.M.: Sketch based image retrieval using learned keyshapes (LKS). In: British Machine Vision Conference (2015). https://api.semanticscholar.org/CorpusID:11324587
Sain, A., Bhunia, A.K., Potlapalli, V., Chowdhury, P.N., Xiang, T., Song, Y.Z.: Sketch3t: test-time training for zero-shot SBIR. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7452–7461 (2022). https://api.semanticscholar.org/CorpusID:247762119
Sangkloy, P., Burnell, N., Ham, C., Hays, J.: The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans. Graph. (TOG) 35(4), 1–12 (2016)
Shen, Y., Liu, L., Shen, F., Shao, L.: Zero-shot sketch-image hashing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3598–3607 (2018)
Tian, J., Xu, X., Shen, F., Yang, Y., Shen, H.T.: TVT: three-way vision transformer through multi-modal hypersphere learning for zero-shot sketch-based image retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, pp. 2370–2378, June 2022. https://doi.org/10.1609/aaai.v36i2.20136, https://ojs.aaai.org/index.php/AAAI/article/view/20136
Tian, J., Xu, X., Wang, Z., Shen, F., Liu, X.: Relationship-preserving knowledge distillation for zero-shot sketch based image retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 5473–5481. MM ’21, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3474085.3475676
Tursun, O., Denman, S., Sridharan, S., Goan, E., Fookes, C.: An efficient framework for zero-shot sketch-based image retrieval. Pattern Recognit. 126, 108528 (2022). https://doi.org/10.1016/j.patcog.2022.108528, https://www.sciencedirect.com/science/article/pii/S0031320322000097
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wang, J., Chen, J., Lin, J., Sigal, L., de Silva, C.W.: Discriminative feature alignment: improving transferability of unsupervised domain adaptation by gaussian-guided latent alignment. Pattern Recognit. 116, 107943 (2021). https://doi.org/10.1016/j.patcog.2021.107943, https://www.sciencedirect.com/science/article/pii/S0031320321001308
Wang, Z., Wang, H., Yan, J., Wu, A., Deng, C.: Domain-smoothing network for zero-shot sketch-based image retrieval. ArXiv abs/2106.11841 (2021). https://api.semanticscholar.org/CorpusID:235593135
Wu, Y., Song, K., Zhao, F., Chen, J., Ma, H.: Distribution aligned feature clustering for zero-shot sketch-based image retrieval (2023)
Yelamarthi, S.K., Reddy, S.K., Mishra, A., Mittal, A.: A zero-shot framework for sketch based image retrieval. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 316–333. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_19
Zhang, H., Liu, S., Zhang, C., Ren, W., Wang, R., Cao, X.: Sketchnet: sketch classification with web images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1105–1113 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Gao, Z., Wang, K. (2024). Cross-Modal Visual Correspondences Learning Without External Semantic Information for Zero-Shot Sketch-Based Image Retrieval. In: Lu, H., Cai, J. (eds) Artificial Intelligence and Robotics. ISAIR 2023. Communications in Computer and Information Science, vol 1998. Springer, Singapore. https://doi.org/10.1007/978-981-99-9109-9_34
Download citation
DOI: https://doi.org/10.1007/978-981-99-9109-9_34
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-9108-2
Online ISBN: 978-981-99-9109-9
eBook Packages: Computer ScienceComputer Science (R0)