Introduction

Sketch-Based Image Retrieval (SBIR) is challenging and practical in computer vision [1]. Sketch retrieval aims to obtain the result of natural images with similar visual content or class to the sketch when given a hand-drawn sketch as a query [2]. Since sketches and natural images belong to different modalities, it is necessary to solve the domain gap between them. With the powerful modeling capabilities of deep learning, many sophisticated methods have been used to solve SBIR problems, such as multiple independent networks [3], generative adversarial networks (GAN) [4], and graph convolution networks (GCN) [5]. However, the existing SBIR models require a large-scale training dataset which is unlikely to include all the categories in realistic scenarios. It is results in unsatisfactory performance when tested on unseen categories. To solve this problem, the researchers proposed a zero-shot sketch-based image retrieval (ZS-SBIR) task, where the models are trained to retrieve corresponding images using free-hand sketches that are not seen during the training stage. In this paper, we study the more challenging ZS-SBIR that simultaneously addresses the inherent domain gap and the limited knowledge of unseen classes.

Currently, most ZS-SBIR works focus on mapping sketches and natural images to low-dimensional common space and domain generalization problems. For example, some work mapped sketches and natural images to a common space for retrieval via convolutional neural networks [6,7,8,9], generative adversarial networks [1, 10,11,12,13,14], and graph convolution networks [15] on the Sketchy and TU-Berlin datasets. However, there is a large domain gap between sketches and natural images, as shown in Fig. 1. The natural images in the first column of Fig. 1 and the sketches in the third column have different contents and styles. For example, natural images have textures and rich colors, while sketches are abstract representations of an object with a single color. Furthermore, there is not a one-to-one correspondence between sketches and natural images. Therefore, it is not effectively enough to align multimodal features when projecting sketches and natural images into a common space. As contour maps and sketches have the same content and style, contour maps can effectively represent the target object in natural images. Therefore, how to effectively extract contour maps from natural images is essential to narrow the domain gaps between sketches and natural images.

Fig. 1
figure 1

The illustration of our CDNNet model, which maps natural image, contour, and sketch features into a common space. During the testing stage the learned mappings are used for generating embeddings on the unseen classes for ZS-SBIR

Previous SBIR models [16, 17] used traditional detection methods, resulting in the target contours with poor continuity and much background noise. To enhance the precision of contour detection, we explore a CNN-based contour detection model, which extracts contour maps with better continuity and less background noise. Based on this, we propose to fine-tune the contour detection network, which makes it suitable for ZS-SBIR. Although deep learning approaches can effectively aggregate features into compressed descriptors, they have not explored the correlation among feature layers leading to unsatisfied retrieval performance. Therefore, we introduce a second-order attention (SOA) mechanism [16] to capture the correlation among different feature layers and improve the feature descriptors for ZS-SBIR.

Accordingly, we propose a novel three-branch joint training network with contour detection network (called CDNNet) for the ZS-SBIR task, which uses contour maps as a bridge to align sketches and natural images to alleviate the domain gap. Specifically, CDNNet exploit the contour detection model based on the convolutional neural network (CNN) to generate one-to-one contour maps of natural images. Meanwhile, classification and semantic metric loss functions constrain the relationship between them so that contour features can align the natural features with the sketch features. Moreover, this work is inspired by the second-order attention (SOA) mechanism [18], which makes the descriptors focus on the target object to improve descriptor performance. In addition, we introduce the teacher model [8] and the word embedding method [6] to ensure domain knowledge transfer. It is worth noting that CDNNet fine-tuned the contour detection network in an end-to-end manner in order to adapt to the ZS-SBIR task. The main contributions are summarized as follows:

  1. 1.

    We propose a novel three-branch joint training network with contour detection network (called CDNNet) for the ZS-SBIR task, which uses contour maps as a bridge to align sketches and natural images to alleviate the domain gap.

  2. 2.

    We propose to fine-tune the contour detection network, which makes the contour detection network suitable for ZS-SBIR task.

  3. 3.

    We introduce the second-order attention (SOA) mechanism to mine the second-order spatial structure of the retrieval feature layer to improve retrieval performance and we perform a thorough ablation study on its effects.

  4. 4.

    Extensive experiments on two large-scale datasets demonstrate that CDNNet outperforms state-of-the-art CNN-based models on Sketchy and TU-Berlin dataset.

The remainder of this paper is organized as follows. “Related work” describes the related work. “Proposed model” describes our CDNNet architecture and loss function. “Experiments” compares our model with the state-of-the-art ZS-SBIR models. Furthermore, we discuss the detailed ablation study to analyze the influence of each component. “Conclusion” concludes this paper in conclusion finally.

Related work

Sketch-based image retrieval

Sketch-Based Image Retrieval (SBIR) is a cross-modal matching problem that aims to project natural images and sketches into a common space. Traditional SBIR methods mainly uses SIFT [19], HOG [20], and other methods to construct the joint representation of local or global natural images and sketches. At present, CNN-based SBIR models mainly include coarse-grained (CG-SBIR) and fine-grained (FG-SBIR) approaches [21]. The CG-SBIR focuses on sketch-to-nature image retrieval at the species level, while FG-SBIR focuses on sketch-to-nature image retrieval at the instance level. For CG-SBIR, Sangkloy et al. [22] proposed a triplet network to make the distance between the same class closer and the distance between different classes farther. At present, researchers mainly focus on FG-SBIR. Yu et al. [16] proposed a method to reduce the domain gap between natural images and sketches by using edge maps. In this method, natural images are first converted to edge maps, and then edge maps and sketches are used to calculate similarity for obtaining retrieval results. Lin et al. [17] claimed that converting natural images into edge images for retrieval has drawbacks, such as requiring a large amount of data for pre-training, and sketch retrieval is sensitive to the quality of edge images. In addition, Chen et al. [21] proposed a sketch retrieval network based on attention mechanism, which utilized residual channel attention module, local self-attention mechanism, and spatial sequence Transformer to mine fine-grained details of the sketches and images of each dimension. The aforementioned models improve the retrieval performance from multiple perspectives. However, some models use a traditional edge detection model, which detects discontinuous contours and introduces a lot of background noise. In contrast, we employ a CNN-based contour detection model to extract the contour maps of natural images more accurately, and use these contour maps as a bridge between sketches and natural images to reduce the domain gap. Besides, this paper focuses on zero-shot sketch-based image retrieval, which solves sketch-based image retrieval and zero-shot learning problems.

Zero-shot learning

Zero-shot learning (ZSL) aims to identify some new classes that are not present during training. Since collecting and annotating datasets require a lot of human and material resources, ZSL has received attention from several fields. Previous ZSL exploits properties in a two-stage method to infer an image label of an unseen class, while recent ZSL directly learns the mapping from image features to a semantic space [23]. Kodirov et al. [24] proposes a semantic auto-encoder to regularize the model, which enforces the projection of image features into a reconstructed semantic space. Besides, such as learning nonlinear multimodal embedding [25], embedding image and semantic features into a common space [26], and teacher models [8] have been successively proposed. It is worth noting that ZSL requires some form of side information, such as shared and nameable visual properties of objects, in order to transfer knowledge learned from seen to unseen classes. However, these properties usually require expensive manual annotation. Therefore, many researchers use other auxiliary information for retrieval to reduce manual annotations, e.g., text-based [12] and hierarchical models [10]. Motivated by [6], this paper utilizes the teacher model as well as constrain local metric learning and global semantics to one formulation to improve the transferability of the network.

Zero-shot sketch-based image retrieval

Zero-shot sketch-based image retrieval (ZS-SBIR) is an extremely challenging task that simultaneously addresses the inherent modal and semantic gaps. Shen et al. [15] proposed a ZS-SBIR model named ZSIH, which is the first work consisting of both SBIR and ZSL. Yelamarthi et al. [13] used a depth conditional generation model that takes sketches as input and stochastically fills in the missing information. Dutta et al. [12] used the more complex generative model called SEM-PCYC, which maps the visual information of each branch to a common space. Subsequently, Zhu et al. [11] proposed an OCEAN model that utilizes a dual learning approach to cyclically map sketch and image features to a common semantic space, and project the semantic features back to the relevant visual space through adversarial training. For multimodal input features, Lin et al. [14] proposed to find shared low-dimensional latent spaces and class embeddings using modality-specific variational autoencoders. In addition, Liu et al. [8] proposed using knowledge distillation to alleviate the problem of catastrophic forgetting. Dey et al. [7] presented a more complex sketch dataset and proposed an inter-domain mutual information mining strategy to alleviate the domain gap. Wang et al. [9] proposed a category-specific memory bank with sketch features to reduce the diversity of classes in the sketch domain. Although all of these methods effectively improve ZS-SBIR performance, they ignore the contour maps in the ZS-SBIR task. In contrast, we propose a novel three-branch joint training network with contour detection network (called CDNNet), which uses contour maps as a bridge to align the sketches and natural images to reduce the domain gap. Moreover, we explore the correlation among feature layers using the second-order attention (SOA) mechanism, which makes the feature descriptors focus on the retrieval target.

Contour detection

Early contour detection approaches focused on computing the gradient magnitude in the brightness channel, such as the Canny operator [27]. Subsequently, some researchers proposed contour detection methods based on the human visual system [28] and machine learning [29]. With the development of deep learning, convolutional neural networks have become widely utilized in computer vision fields, including fault diagnosis [30], road object detection [31], and contour detection [32,33,34]. Xie et al. [32] developed an end-to-end CNN model, called HED, using convolutional feature maps and a novel loss function to improve the effectively and accuracy of contour detection. Cao et al. [33] proposed a stack of multiple refinement modules network, which extracts rich feature representations. Zhang et al. [34] introduced the biological mechanism into the CNN according to the human visual system, which effectively improves the performance of the model. However, the existing contour detection models are not adapted to ZS-SBIR. Therefore, we propose to fine-tune the model to effectively improve its performance and make it more adapt to the task of ZS-SBIR.

Proposed model

Problem definition

The ZS-SBIR dataset contains a training set for training the retrieval model and a test set for evaluating the retrieval performance. In this paper, we define \({O}_{tr}^{S}=\left\{{I}_{tr}^{S}, {S}_{tr}^{S}\right\}\) as the training set, also called the source domain, where \({I}_{tr}^{s}\) and \({S}_{tr}^{s}\) are natural images and sketches, respectively, and the upper S and the lower corner labels tr are the source domain and training set, respectively. Under the condition of ZS-SBIR, the training set belongs to the seen classes and the test set belongs to the unseen classes, and the classes of the training and test sets do not overlap, so \({O}_{tr}^{S}\cap {O}_{te}^{T}=\varnothing \). In the training stage, CDNNet obtains the D-dimensional descriptor of each image by training the CNN model. In the test phase, CDNNet searches natural images related to a given hand-drawn sketch under the zero-shot scene. Therefore, the ZS-SBIR not only needs to align sketches with natural images, but also needs to transfer the knowledge of the seen to the unseen classes.

The CDNNet architecture

Through the above analysis, sketches and natural images have different styles and contents. The natural image contains rich color information and texture background information. In contrast, the sketch only contains the abstract description of the object, and the contour map of the natural image has the same style as the sketch. Therefore, we propose to use CNN-based contour maps to align sketches and natural images in a common space to reduce the domain gap. Moreover, we exploit the second-order attention to obtain continuous feature maps with different resolutions to improve the descriptor. Finally, following [6, 8], we utilizes a teacher model as well as combines local metric learning and global semantic constraints into one formulation to ensure the transfer of seen class knowledge to unseen classes.

We propose a novel three-branch joint training network with contour detection network for the ZS-SBIR task, which efficiently aligns sketches with natural images using contour maps, and the pipeline is shown in Fig. 2(a). Natural images and sketch images are inputs to the CDNNet network, and contour maps are generated by detecting natural images. Therefore, the CDNNet mainly includes the contour detection network (CDN), retrieval feature extraction network, retrieval descriptor generation network, and teacher network. The CDNNet adopts the CNN-based contour detection network HED [32] to obtain contours. For the retrieval feature extraction network, each branch uses ResNet50 [35] as the backbone network to extract retrieval features. Moreover, we insert the second-order attention (SOA) module [11] into any stage of the backbone network to improve the representation ability of retrieved features. It is worth noting that the sketch branch in the retrieval feature extraction network is soft weights shared [6] with natural and contour image branches, respectively, to reduce imbalanced learning. Motivated by [6], we construct a retrieval descriptor generation network using the linear layer to satisfy the requirements of the retrieval dimension. In this paper, we use ResNet50 pre-trained in ImageNet as the teacher model [8] and the word vector model pre-trained in Google News Dataset [36], which provides good transferability.

Fig. 2
figure 2

a The pipeline of the proposed CDNNet for ZS-SBIR task, which uses contour maps as a bridge to align the sketches and natural images. We first obtain contour maps for the natural images by using contour detection networks. Next, the CNN extracts the retrieved features for each branch. Subsequently, we use the retrieval descriptor generation network to obtain D-dimensional feature descriptors. We use soft weight sharing loss \({\mathcal{L}}_{SWS}\), knowledge distillation loss \({\mathcal{L}}_{kd}\), classification loss \({\mathcal{L}}_{cls}\), and semantic metric loss \({\mathcal{L}}_{sem}\) to guide the CDNNet network joint training. b SOA module in our pipeline


Contour detection network There are plenty algorithms for extracting contour maps, such as Canny [27], DIDY [28], HED [32], and DRC [33]. Compared to traditional edge detection and human visual system-based algorithms, CNN-based contour detection networks can obtain more precise contours. These precise contours can reduce the gap between contour and natural images as well as effectively alleviate the domain gap between natural images and sketches. However, most of the existing contour detection models are trained on the BSDS dataset, which cannot be better adapted to ZS-SBIR. To address this issue, we propose to fine-tune the contour detection model in the task of ZS-SBIR to alleviate the domain gap between sketches and natural images. We analyze the impact of different contour detection models and fine-tune the performance of CDNNet in subsequent experiments. In this paper, we use HED [32] as the contour detection model.


Second-order attention module We use non-local blocks [18] to integrate second-order spatial information into the feature layer. The architecture of the SOA module is shown in Fig. 2(b). SOA uses 1 × 1 convolution to project the feature map to query Q head, key K head, and value V head. Then flattening and transposing both tensors, we obtain Q and V with the shape HW \(\times\) D. At the same time, we flatten the tensor K, and obtain K with the shape D \(\times \) HW. The second-order attention map \(\mathrm{Z}\) is computed as follows:

$$\mathrm{Z}=\mathrm{softmax}\left(\upmu \bullet {\mathrm{Q}}^{\mathrm{T}}\mathrm{K}\right),$$
(1)

where \(\upmu \) is a scaling factor. Finally, the SOA block output of feature map \(\widetilde{f}\) is computed as follows:

$$\widetilde{f}=f+ \rho \left(\mathrm{Z }\times \mathrm{V }\right),$$
(2)

where \(\rho \) is 1 \(\times \) 1 convolution, \(and f\) is the input feature map. The shape of \(\widetilde{f}\) is D \(\times \) H \(\times \) W as same as \(f\).


Retrieval descriptor generating network According to [6], our model reduces the dimensionality of the retrieval descriptors by using a network composed of fully connected layers. Specifically, CDNNet uses the fully connected network \({E}_{R}\) with the parameter \({\alpha }_{R}\) to obtain the retrieval descriptor of the corresponding image:

$${R}^{S}={E}_{R}\left({F}^{S};{\alpha }_{R}\right), {R}^{I}={E}_{R}\left({F}^{I};{\alpha }_{R}\right), {R}^{C}={E}_{R}\left({F}^{C};{\alpha }_{R}\right),$$
(3)

where \({R}^{S}\), \({R}^{I}\), and \({R}^{C}\) are the retrieval descriptors of sketches, natural images, and contour maps, respectively. In this paper, the semantic metric branch uses linear layer to reduce feature representation to retrieval descriptor dimensions. The classification branch uses a linear layer to reduce the dimensions of the retrieval descriptor to the number of classes contained in the dataset.

Objective and optimization

Soft weight share loss

Mentioned by [6], the hard shared weights are optimized only according to the image modality and bring many irrelevant parameters to the sketch feature branch, so the generalization ability of the network is poor. Therefore, we use a soft weight sharing mechanism that enables the network to learn similar geometric information. Specifically, when given natural images, contour maps, and sketches, their retrieved features can be defined as follows:

$${F}^{S}={E}_{S}\left(S;{\alpha }_{S}\right), {F}^{I}={E}_{I}\left(I;{\alpha }_{I}\right), {F}^{C}={E}_{C}\left(C;{\alpha }_{C}\right),$$
(4)

where \({E}_{S}\), \({E}_{I}\), and \({E}_{C}\) denote the sketch feature extraction networks with parameters \({\alpha }_{S}\), natural image feature extraction networks with parameters \({\alpha }_{I}\) and contour feature extraction networks with parameters \({\alpha }_{C}\), respectively. \({F}^{S}\), \({F}^{I}\), and \({F}^{C}\) are the outputs of sketches retrieved features, natural images retrieved features, and contour images retrieved features, respectively. In this paper, the soft weight sharing loss is specifically expressed as follows:

$${\mathcal{L}}_{SWS}=\sum_{l}{\text{l}\kern-0.15em\text{l}}\left[l\notin BN\right]\cdot {\Vert {\alpha }_{S}^{l}-{\alpha }_{I}^{l}\Vert }_{2}^{2}+\sum_{l}{\text{l}\kern-0.15em\text{l}}\left[l\notin BN\right]\cdot {\Vert {\alpha }_{S}^{l}-{\alpha }_{C}^{l}\Vert }_{2}^{2},$$
(5)

where \({\alpha }_{S}^{l}\), \({\alpha }_{I}^{l}\), and \({\alpha }_{C}^{l}\) are the parameters of sketch feature extraction network, natural image feature extraction network and contour extraction network at each feature layer \(l\), respectively. \({\Vert {\alpha }_{S}^{l}-{\alpha }_{I}^{l}\Vert }_{2}^{2}\) means that the geometric modeling of the image feature extraction network is constrained by the sketch parameters. Similarly, \({\Vert {\alpha }_{S}^{l}-{\alpha }_{C}^{l}\Vert }_{2}^{2}\) means that the geometric modeling of the contour feature extraction network is constrained by sketch parameters. Besides, the indicator function \({\text{l}\kern-0.15em\text{l}}\left[l\notin BN\right]\) equals to 1 if the layer \(l\) of sketch, image or contour extraction network is not the batch normalization layer, and 0 otherwise. So the independent batch normalization layers allow for separating modality-common information from modality-specific information.

Knowledge distillation loss

The teacher model pre-trained on ImageNet has reliable domain knowledge, which can build a bridge between seen and unseen classes in ZSL. According to SAKE [8], CDNNet uses ResNet50 pre-trained on ImageNet as the teacher model to guide the network to extract features. Considering that the parameters of the teacher model are not updated during training, the teacher model is more robust in guiding the student model for feature extraction. Specifically, when natural images are input into the teacher model, the teacher model obtains a predicted soft label \({P}_{i}^{T}=Softmax\left({g}_{i}^{T}\right)\), and likewise, the student model obtains a label prediction result \({P}_{i}^{s}=Softmax\left({g}_{i}^{s}\right)\), where \( g_{i}^{T} \)and \( g_{i}^{s} \)are the outputs of i-th knowledge distillation and i-th classification branches, respectively. We calculate the knowledge distillation loss \({\mathcal{L}}_{kd}\) using the cross-entropy loss with soft labels as follows:

$${\mathcal{L}}_{kd}= \frac{1}{N}\sum_{i-1}^{N}{-{P}_{i}^{s}\mathrm{log}(P}_{i}^{T}).$$
(6)

Classification loss

In order to distinguish the learned feature descriptors of sketches, natural images, and contour maps, we use cross entropy to align the learned features with their labels as follows:

$${\mathcal{L}}_{cls}=-\sum_{i=1}^{N}log\frac{\mathrm{exp}({\beta }_{i}^{\mathrm{T}}{R}^{m}+{\gamma }_{i})}{\sum_{j\in {C}^{seen}}\mathrm{exp}({\beta }_{j}^{\mathrm{T}}{R}^{m}+{\gamma }_{j})},$$
(7)

where \(\beta \) and \(\gamma \) are the weight and bias of the classifier, respectively, and \(\mathrm{m}\in \left\{sketch, image, contour\right\}\).

Semantic metric loss

The CDNNet contains three branches as follows: the natural image feature extraction branch, the contour feature extraction branch, and the sketch feature extraction branch. The distance between the three branches belonging to the same class is made closer, while the distance between the three branches belonging to different classes is extended further. In addition, we can use semantic information to guide the discrimination of classes. In the triplet selection process, for each anchor sample, we select the hardest positive (the farthest positive) and hardest negative (the nearest negative) samples within each batch. We choose the semantic vector in each category as the global anchor. Therefore, CDNNet uses the same semantic metric learning approach as in [6], constraining local metric learning and global semantics into one formula, as in Eq. 8 as follows:

$$ \begin{gathered} R_{j}^{{{\text{SA}}}} = E_{A} \left( {Y_{j}^{S} ;\alpha_{A} } \right), \hfill \\ \tilde{R}_{j}^{A} = \omega R_{j}^{SA} + \left( {1 - \omega } \right)R_{j}^{A} , \omega \sim U\left( {0,1} \right), \hfill \\ R_{j}^{P} = {\text{argmax}}_{{k,Y_{k}^{C} = Y_{j}^{C} }} Dist\left( {\tilde{R}_{j}^{A} ,R_{k} } \right), \hfill \\ R_{j}^{N} = {\text{argmax}}_{{k,Y_{k}^{C} \ne Y_{j}^{C} }} Dist\left( {\tilde{R}_{j}^{A} ,R_{k} } \right), \hfill \\ \end{gathered} $$
(8)

where \({E}_{A}\) is the fully connected layer with parameter \({\alpha }_{A}\), which transforms the semantic vector \({Y}_{j}^{S}\) into a feature embedding \({R}_{j}^{SA}\). Then the new semantic-based anchor \({\widetilde{R}}_{j}^{A}\) is generated via uniform interpolation. \({R}_{j}^{P}\) and \({R}_{j}^{N}\) are the hardest positive and negative examples, respectively. As a result, the semantic metric loss can be described as follows:

$${\mathcal{L}}_{sem}= \sum_{j=1}^{B}\varepsilon \left(Dsit\left({\widetilde{R}}_{j}^{A},{R}_{j}^{P}\right)-Dsit\left({\widetilde{R}}_{j}^{A},{R}_{j}^{N}\right)\right),$$
(9)

where \(Dsit\) is the standard euclidean distance and \(\varepsilon \) is Softplus activation. We define \({\mathcal{L}}_{\mathrm{semi}}\) as the semantic metric \({\mathcal{L}}_{\mathrm{sem}}\) for natural images and contour maps, and \({\mathcal{L}}_{\mathrm{sems}}\) as the semantic metric \({\mathcal{L}}_{\mathrm{sem}}\) for contour maps and sketches.

Overall objective loss

With the above proposed network and the losses, the overall objective loss of our proposed model is defined as follows:

$$\begin{aligned} & \mathcal{L}={\lambda }_{\mathrm{SWS}}{\mathcal{L}}_{\mathrm{SWS}}+{\lambda }_{\mathrm{kd}}{\mathcal{L}}_{\mathrm{kd}}+{\lambda }_{\mathrm{cls}}{\mathcal{L}}_{\mathrm{cls}}\\ & \qquad \qquad \;\;\;+{\lambda }_{\mathrm{semi}}{\mathcal{L}}_{\mathrm{semi}}+{\lambda }_{\mathrm{sems}}{\mathcal{L}}_{\mathrm{sems}},\end{aligned} $$
(10)

where \({\lambda }_{\mathrm{SWS}}\), \({\lambda }_{\mathrm{kd}}\), \({\lambda }_{\mathrm{cls}}\), \({\lambda }_{\mathrm{semi}}\;\mathrm{and}\;{\lambda }_{\mathrm{sems}}\) are coefficients to balance the overall performance, and all losses are trained end-to-end simultaneously.

Experiments

In this section, we first introduce the datasets, evaluation metrics, and implementation details. And then, we conduct experiments on two public large-scale ZS-SBIR datasets, including Sketchy [37] and TU-Berlin [37], and compare the proposed model with the state-of-the-art models. Meanwhile, we performed ablation experiments on Sketchy and TU-Berlin datasets to evaluate the performance of each component.

Datasets and evaluation metric

To verify the effectiveness of our model, we conduct our experiments on two popular large-scale ZS-SBIR datasets, i.e., Sketchy and TU-Berlin.

The Sketchy dataset is a large-scale sketch retrieval dataset that originally consists of 75,749 sketches and 12,500 natural images. Subsequently, an extended version was proposed in the literature [37], which is a collection of 60,502 images on the ImageNet, so the Sketchy dataset currently has a total of 73,002 natural images with 125 classes. Following the setting in [6], we select 100 classes for training and 25 classes for testing, which is defined as Split1. Following the setting in [13], we strictly select Split2, which consists of 21 test classes that do not appear in the ImageNet dataset and 104 training classes.

The TU-Berlin [37] extended dataset contains 20,000 sketches, 204,489 natural images, and 250 classes. Following the setting in [6], we randomly select 30 classes for testing and the rest 220 classes for training. Therefore, each class has at least 400 natural images during the test to satisfy the retrieval.

To evaluate the performance of the proposed model, we follow the ZS-SBIR [6, 7, 10] evaluation criteria in terms of mean average precision (mAP) and precision (Prec). Given a query sketch and a list of K ranked retrieval results, the AP for this query is defined as follows:

$$\mathrm{AP}\left(\mathrm{K}\right)=\frac{1}{k}\sum_{n=1}^{k}\delta \left(r\right),$$
(11)

where \(\delta \left(r\right)=1\) when the \(r-th\) retrieved candidate natural image corresponds to the query sketch; otherwise, \(\updelta (r)=0\). Thus, the mAP can be formulated as follows:

$$\mathrm{mAP}\left(\mathrm{K}\right)=\frac{1}{k}\sum_{n=1}^{k}AP\left(r\right).$$
(12)

Implementation details

We implemented our network using the publicly available and well-known PyTorch package on Tesla p100 GPUs. In the CDNNet model, the feature extraction network for each branch and the teacher network for knowledge distillation adopt the ResNet-50 pre-trained on ImageNet. Following [6], we extract word vectors as category-level semantic information via the text model pre-trained on Google News Dataset [36].

We train the model with Adam optimizer and set its weight decay as 5 × 1e–4. The learning rate is set to 1 × 1e–4, and the exponential decays to 3 × 1e–6. The batch-size and the number of maximum training epochs are 32 and 20, respectively. \({\lambda }_{\mathrm{SWS}}\), \({\lambda }_{\mathrm{kd}}\) and \({\lambda }_{\mathrm{cls}}\), are set to 1 while \({\lambda }_{\mathrm{semi}}\;\mathrm{ and }\;{\lambda }_{\mathrm{sems}}\) are set to 1000, which is empirically determined by the retrieval performance.

At the testing stage, in order to make a fair comparison with the previous hashed models [6, 10], CDNNet also uses the iterative quantization (ITQ) [38] algorithm to construct the hash code based on the real-valued retrieval feature. Then the retrieval results are sorted according to Hamming distance.

Comparing with existing models

To evaluate the performance of the CDNNet, we have compared the experimental results of CDNNet with the state-of-the-art SBIR, ZSL, and ZS-SBIR models on Sketchy and TU-Berlin datasets in Table 1. As can be seen from Table 1, the performance of ZS-SBIR is higher than SBIR and ZSL. The reason is that SBIR mainly solves the domain gap problem, and ZSL mainly solves the domain transfer problem. In contrast, ZS-SBIR simultaneously solves the SBIR and ZSL problems and can use more information, so the retrieval performance is higher. As shown in Table 1, compared with the state-of-the-art CNN-based TCN [6] model, the CDNNet of mAP@all performance with a boost of up to 2.6% on Sketchy Split1 and 1.2% on TU-Berlin. It is worth noting that all models performed worse on TU-Berlin, which may contain more classes and more abstract sketches. We also demonstrate the performance of CDNNet on the more challenging Sketchy Split2, where CDNNet improved by 0.3% at mAP@200 and 0.6% at Prec@200 compared to the TCN model. Therefore, CDNNet obtains significant results on multiple retrieval datasets.

Table 1 The performance between CDNNet and existing models

We compare the performance of each model more intuitively in Fig. 3, where we show the performance of each model on Sketchy Split1 mAP@all (SS1-mAP@all), Sketchy Split1 Prec@100 (SS1-Prec@100), Sketchy Split2 mAP@200 (SS2-mAP@200), Sketchy Split2 Prec@200 (SS2-Prec@200), TU-Berlin mAP@all (TU-mAP@all) and TU-Berlin Prec@100 (TU-Prec@100) datasets, respectively. For a fair comparison, we have chosen each model without using ITQ. From Fig. 3, our model achieves the best performance in each dataset compared with other models, which demonstrates the effectiveness of our model. We construct the experiments on more challenging Sketchy split2, which strictly selects the classes out of the ImageNet as the testing set. The CDNNet also performs better in Sketchy split2. Therefore, CDNNet possesses better transferability. We can see that each model performs better on the Sketchy dataset than on the TU-Berlin dataset, indicating that the TU-Berlin dataset is more challenging. We consider that the TU-Berlin sketches are more abstract than Sketchy, so we need to explore novel methods for abstract sketch retrieval.

Fig. 3
figure 3

The performance of CDNNet and existing models on Sketchy and TU-Berlin using 2 evaluation metric. Top: mAP. Bottom: Prec

In Fig. 4, we show the top 10 retrieval results using the CDNNet model with 512-d real-value feature in Sketchy and TU-Berlin datasets. The first to fourth rows present the Sketchy retrieval results, and the fifth to eighth rows show the TU-Berlin retrieval results. The green box represents positive retrieval results, and the red represents negative retrieval results. As shown in Fig. 4, CDNNet can return more positive examples in most cases. Our proposed model also shows incorrect retrieval results, such as the third row image in Fig. 4, because these objects contain similar structural information, making it difficult for the model to distinguish them from each other. By comparing the fourth and fifth rows, we can see that TU-Berlin is more abstract than the windmill sketch in the Sketchy dataset, so that the Sketchy dataset can return more positive images. In addition, the pizza, the last row of TU-Berlin dataset, has a similar shape and contour with the frying pan, thereby leading to wrong retrieval results. In general, we observe that the wrongly retrieved candidates mostly have a closer visual and semantic relevance with the queried ones.

Fig. 4
figure 4

The top 10 zero-shot sketch-based image retrieval results using our CDNNet model with 512-d real value feature in Sketchy and TU-Berlin datasets. The green and red boxes represent positive and negative retrieval results, respectively

Ablation studies

Impact of CDN and SOA module

In order to verify the impact of each module on CDNNet, we evaluated the performance of the contour detection network (CDN) and second-order attention (SOA) module on the Sketchy and TU-Berlin datasets, respectively. The results are shown in Table 2. The first row of Table 2 shows the results of the TCN model, while the second row represents the performance of adding the CDN into the TCN model. As shown in Table 2, CDN can improve the TCN performance of mAP@all by 1.6% and 0.7% in the Sketchy and TU-Berlin datasets, respectively. Although the adding of SOA module to the TCN showed only a slight improvement, we added the combination of the CDN and the SOA module to the TCN network, and the retrieval performance of mAP@all improved by 2.6% and 1.2% for the Sketchy and TU-Berlin datasets, respectively. Therefore, CDNNet significantly improves performance on Sketchy and TU-Berlin datasets by using contour maps as a bridge between natural images and sketches, which alleviates the domain gap.

Table 2 Ablation studies of each component

To demonstrate the performance of CDN and SOA, we select an image from the Sketchy dataset, showing the top10 retrieval results in Fig. 5. The first row shows the TCN retrieval results. The second row shows the CDNNet with CDN without SOA (CDNNet W/o SOA) retrieval results. The third row shows the CDNNet with SOA without CDN (CDNNet W/o CDN) retrieval results. The last row shows the CDNNet with CDN and SOA retrieval results.

Fig. 5
figure 5

Effect of the qualitative results of the CDN and SOA on Sketchy dataset. The rows from up to down represent the TCN, corresponding CDNNet W/o SOA, CDNNet W/o CDN and CDNNet, respectively

As seen in the second and third rows of Fig. 5, the network can return more positive examples when CDN and SOA are added to the TCN model, respectively. As can be seen from the last line, CDNNet returns more positive examples compared with CDNNet W/o SOA and CDNNet W/o CDN, which indicates that adding CDN and SOA can effectively improve retrieval performance. We find that the structure information of the windmill sketch is similar to the umbrella, so our model returns many windmill images. It inspires our future work on using other visual information to effectively distinguish objects with similar structural information but not of the same class.

Figure 6 shows bar chart that compares the mAP of CDNNet and TCN models for 25 test classes on the Sketchy dataset. The names of the classes are shown on the x-axis. In Fig. 6, we can see that compared with the TCN model, CDNNet improved the mAP in the volcano and squirrel class increased by 7.4% and 6%, respectively, while it increased by − 1.3% in rifle class. Through the analysis of the images and sketches of rifle, we believe that the images of this class contain a large amount of human information while the sketch only contains the rifle. Therefore, there are still some shortcomings in using contour-aligned images and sketches. To understand the kinds of features generated by the CDNNet, we visualize the generated sketch features of the Sketch dataset in Fig. 7 via the t-SNE [39] method. From Fig. 7, we can see that the retrieval feature distribution of CDNNet is more discriminative than the TCN model. In general, the introduction of contour information and second-order attention can enhance the category discrimination, thereby significantly improving the retrieval performance on most categories.

Fig. 6
figure 6

Comparison of mAP performance between TCN and CDNNet on the Sketchy dataset using 512-dimensional features for 25 categories

Fig. 7
figure 7

The t-SNE visualization of the retrieval features learned from sketch modality for the TCN and CDNNet model on the Sketchy dataset. We sample 25 classes of from test categories and visualize the distribution of the features. Each color represents a particular class

Impact of the SOA module

We evaluate the performance of adding SOA to ResNet50, which contains fully convolution blocks conv1 to conv5. We insert SOA into one to three blocks between layers 3 to 5, resulting in the SOA3, SOA34, SOA35, and SOA345. In Table 3, SOA3 means that we insert the SOA after the third layer. Table 3 shows the performance of each SOA model on the Sketchy and TU-Berlin datasets and also includes results using ITQ. Table 3 shows that insertion of SOA improves retrieval mAP@all compared to TCN for 2.2% with SOA3, 2.6% with SOA34, 1% with SOA35, and 1.9% with SOA345 on the Sketchy datasets. We can see that SOA34 has a lower improvement in the TU-Berlin dataset, which we attribute to the fact that the TU-Berlin dataset sketch is more abstract. As an intuitive way of illustrating SOA performance, we show the results of various combinations of parameters in Fig. 8. By comparing SOA34 with SOA3, SOA35, and SOA345, we observe that SOA34 inserted at consecutive feature levels performs noticeably better. Therefore, we select SOA34 as the final model.

Table 3 Compare the performance of inserting SOA after different layers. The subscript “b” denotes results obtained by ITQ. The CDNNet best and CDNNet with ITQ results are marked in bold and underlined, respectively
Fig. 8
figure 8

Compare the performance of inserting SOA after different layers in Sketchy and TU-Berlin datasets. The second column indicates the use of ITQ algorithm to construct the hash code based on the real valued retrieval feature

Impact of the different contour detection network

In our experiments, we choose different contour detection networks to test their impact. The results are shown in Table 4. Table 4 shows the performance of each contour detection network in BSDS500, where fixed thresholding in every image to obtain optimal F-score, called the optimal dataset scale (ODS), dataset selection of an optimal threshold per image to obtain optimal F-score, called optimal image scale (OIS), and under a given threshold range, the average precision (AP) of the dataset [28]. In Table 4, DRC-SS [31] and DRC-MS [31] represent single-scale and multi-scale testing, respectively. Compared with the traditional method, we can see that the model based on the convolutional neural network greatly improved. At present, the ODS of the best contour detection model is 0.83 [40]. Due to the space limitation, more details of contour detection can refer to literature [40]. In order to optimize the contour detection model along with the training, we selected single-scale HED and DRC as contour extraction networks for experiments.

Table 4 Quantitative comparison on the BSDS500 test set

Table 5 shows the performance of HED and DRC on the Sketchy and TU-Berlin datasets, where both models are trained simultaneously with CDNNet. We compared the performance of HED and DRC under the same experimental configuration. We observe that HED compared with the DRC model, improved from 0.636 to 0.642 on Sketchy and 0.505 to 0.507 on TU-Berlin datasets, respectively. To ensure a fair comparison, we also included the results of HED without second-order attention (HED w/o SOA) and DRC without second-order attention (DRC w/o SOA). We can see that the retrieval performance of the HED model is still better than that of the DRC model. Therefore, we selected HED as the contour detection network according to the experimental results. Moreover, we present the results of HED without fine-tuning (HED w/o fine-tuning) with CDNNet simultaneously in Table 5. We can observe that the results of HED and CDNNet network training jointly are better than the results of HED w/o fine-tuning, so fine-tuning the contour detection model in the sketch retrieval model can achieve better results. Through the above experiments, we consider that selecting a contour detection model based on the convolutional neural network can achieve a better alignment of sketches and natural images in the common space.

Table 5 Performance comparison of using different contour detection model in CDN with 512 dimensional features

Impact of the different loss terms

To evaluate the contribution of each loss term in the training stage, we perform an ablation study by removing one loss term from Eq. (10) at a time. The results of these variants on Sketchy and TU-Berlin are shown in Table 6, where “w/o” indicates the removed loss term. First, we train the model ablating the soft weight sharing (CDNNet w/o \({\mathcal{L}}_{\mathrm{SWS}}\)). We can observe the poor performance of the model, which we consider as model overfitting. Therefore, soft weight sharing is very important for the convergence of the model. The teacher model helps guide the student model for feature extraction and establishes a bridge between seen and unseen classes. Thus, when we remove the knowledge distillation (CDNNet w/o \({\mathcal{L}}_{\mathrm{kd}}\)), we observe a decrease in performance on both Sketchy and TU-Berlin datasets. Next, we remove the classification loss (CDNNet w/o \({\mathcal{L}}_{\mathrm{cls}}\)) to train the network. The mAP values of our model are decrease by 3% in Sketchy and 2% in TU-Berlin, respectively. This validates the significance of \({\mathcal{L}}_{\mathrm{cls}}\) in making the clusters of the two modalities class-wise discernible. Finally, we train the model without using the semantic loss, which verifies that semantic loss can effectively establish the relationship between contour maps, nature images and sketches. A clear downfall in all the evaluation metrics validates its functionality in the proposed model.

Table 6 Ablation results for each loss term on Sketchy and TU-Berlin

Impact of the different retrieval dimensions

In Fig. 9, we compare the performance of CDNNet and TCN using 64,128 and 512 retrieval dimensions on the Sketchy and TU-Berlin datasets, respectively. In the Sketchy dataset, compared to the TCN model, CDNNet shows improved by 0.7%, 1.6%, and 2.6% on 64, 128 and 512 retrieval dimensions, respectively. Similarly, in the TU-Berlin dataset, we observed that the retrieval performance of CDNNet is lower than of the TCN model at 64 retrieval dimensions. However, as the retrieval dimension increases, CDNNet outperforms the TCN model; for example, CDNNet improves by 0.3% and 1.2% on 128 and 512 retrieval dimensions, respectively. In addition, we observed that from 128 to 512 dimensions, TCN improves from 0.466 to 0.495, while CDNNet improves from 0.469 to 0.507, indicating that our model improves higher performance. We consider that the TU-Berlin dataset sketch is more abstract and CDNNet needs longer dimensional retrieval descriptors to guarantee retrieval accuracy.

Fig. 9
figure 9

The results of mAP@all between CDNNet and TCN using 64, 128, and 512 retrieve dimension on Sketchy and TU-Berlin datasets

Conclusion

To our best knowledge, we are the first to propose a novel three-branch joint training network with contour detection network and second-order attention module (SOA) for the ZS-SBIR task, which uses contour maps as a bridge to align sketches and natural images to alleviate the domain gap. Specifically, the proposed CDNNet employs the contour maps extracted from natural images to align natural images and sketches in a common space and employs the teacher and word vector model to ensure knowledge transfer from seen to unseen classes. Experiments show that these models are essential for modal knowledge transfer and domain gap reduction. Moreover, this paper employs SOA to mine the second-order spatial structure information of the retrieved feature layer, which enables the retrieved descriptors to focus on the target object. We conduct detailed quantitative and qualitative studies on the impact of incorporating SOA that learns to effectively re-weight feature maps and can effectively improve retrieval performance. Therefore, the SOA can be seamlessly and effortlessly employed in other related tasks. In addition, we found that fine-tuning the contour detection network in the ZS-SBIR task can achieve better results, which indicates that fine-tuning the contour detection model can be better adapted to the ZS-SBIR task. Compared with existing ZS-SBIR models, we validate the effectiveness of the proposed model on two large-scale sketch retrieval datasets, Sketchy and TU-Berlin.

Through the above experiments, we discovered that it is difficult for abstract sketch retrieval to distinguish these images from the existing models. Besides, we also observe that the performance of CDNNet with low-dimensional retrieval descriptors worsens in TU-Berlin, so we need to explore how to represent the abstract sketch in the case of low-dimensional retrieval descriptors. In the future, for TU-Berlin and other very abstract sketch datasets, we still need to explore more effective feature extraction model to distinguish objects with similar structures. In view of the Transformer model global contexts capability in computer vision [2], we will explore whether extracting image features using the Transformer is possible. In addition, most of the existing ZS-SBIR models use large models to enhance the retrieval performance, but ignore the efficiently of retrieval. In contrast, our future work will focus on network lightweight and miniaturization while ensuring the effectiveness of retrieval.