Introduction

As one of the key technologies in information management, content retrieval has a wide range of application prospects and values. It is a common lore that an image is worth a thousand words and sketches provide an easier way to express expectations and target candidates more visually and concisely [1, 2]. Therefore, sketch-based content retrieval has drawn great attention. It is of great significance to improve the performance and generalization of existing retrieval algorithms [3, 4].

With the rapid development of multimedia technology and the wide usage of smart devices, the number of images and 3D shapes has increased [5,6,7], which leads to a growing demand for effective retrieval of these multimedia data. Thus, sketch-based 3D shape retrieval (SBSR) and sketch-based image retrieval (SBIR) tasks become a hot topic in computer vision.

In recent decades, with the rapid development of computer hardware and the improvement of the quality of datasets, deep learning methods have been widely used in many fields [8,9,10,11,12,13]. SBSR and SBIR are still challenging because (1) query sketches and target candidates are derived from two domains with large gaps, and (2) features extracted from different modalities follow quite different distributions. Researchers have proposed a lot of methods to address these problems, the framework of these methods can be summarized as shown in Fig. 1. For example, Dai et al. proposed the deep correlated metric learning (DCML) and deep correlated holistic metric learning (DCHML) architectures, which use nonlinear transformations to generate unified and robust feature representations for sketches and 3D models [14, 15]. Lei et al. proposed a deep point-to-subspace metric learning (DPSML) framework, which optimizes the similarity distances between selected representative views and sketches [16]. Lin et al. proposed the triplet classification network (TC-Net), which relies on a triplet Siamese network with auxiliary classification loss for SBIR [17]. In general, the aforementioned representative approaches have the following disadvantages: these methods pay attention to metric learning but ignore the data distribution in each specific domain, which influences the performance of discrepancy mitigation. Domain gap elimination has great benefits for improving retrieval performance.

Fig. 1
figure 1

An example of the SBSR framework

In recent years, one wishes to conduct SBIR on object categories without the training data because of the difficulty and high cost of manual labeling [18], which largely motivates the problem of zero-shot SBIR (ZS-SBIR) [19,20,21]. ZS-SBIR shares the same challenges as conventional SBIR. Additionally, the “zero-shotc setting assumes that the training class set and the test class set are disjoint, which means it’s more generic and practical but presents an even greater challenge: the knowledge gap between seen categories and unseen categories. To tackle these challenges, some studies attempt to transform sketches and images into a common space and rely on domain adaptation algorithms for data alignment. After that, the semantic information is adopted to constrain the relationship between the transformed features, such as Word2Vec, GloVe, FastText and WordNet. These methods applied external text-based or hierarchical models to understand the attribute mappings in a semantic space, as shown in Fig. 2. Nevertheless, these methods do not explicitly encourage the correlation of features. Using convolutional neural networks (CNNs) for feature extraction from both domains is limited, and even domain adaption does not fully account for the characteristics of sketches and images. As a result, the domain gap, semantic gap, as well as knowledge gap, may not be effectively reduced. The cross-domain feature correlation is the key part of cross-domain retrieval networks. Exploring the visual relevance to increase the correlations between different domains is important to improve the effectiveness and robustness of the whole network.

Fig. 2
figure 2

An example of the ZS-SBIR framework

The existing methods treat SBSR and ZS-SBIR tasks separately. Technically, the key step to address these tasks is to generate discriminative feature representations of query sketches and target candidates to compute their semantic similarity. For example, the objects belonging to the same category should have the similar or same feature representations in the feature space. Therefore, it is perfectly possible to design a general framework to address both tasks. In addition, previous methods in SBSR tasks only learn local correlation. Note that “local” here means that only the cross-domain feature correlation is enhanced in the common space (marked in pink in Fig. 1), while there is no feature correlation or weak feature correlation in initial spaces (marked in blue and green in Fig. 1), which are built by feature extractors. Despite the excellent representational power of CNNs, these methods are also limited in modeling the global structural information due to the inherent local nature of convolution operations.

To address the above challenges and the shortcomings of previous methods, we propose a cross-domain learning framework called global semantics correlation transmitting and learning (GSCTL) to extract robust features effectively from the data with different modalities. In GSCTL, we first construct two distinct feature extractors to obtain the features related to modalities. Next, we construct a public space encoder to project these features to a shared subspace, thereby extracting projecting features. Further, we propose a joint learning strategy to further improve the effectiveness of extracting projecting features. In the joint learning strategy, four types of learning are integrated, including semantic consistency learning, domain consistency learning, feature correlation learning and feature similarity learning. Feature correlation learning and semantic consistency learning together ensure that the extracted features are highly robust and discriminative. Modal consistency learning is used to reduce the large cross-modality heterogeneity. Feature similarity learning is designed to further reduce the discrepancy in cross-domain features.

The contributions of this paper are summarized as follows:

  1. 1.

    We introduce a joint network for cross-domain feature learning to address SBSR and ZS-SBIR, which considers the intra-class similarity and inter-class distinction for data alignment. The architecture could simultaneously deal with the vast domain gap, semantic gap, and limited knowledge about the unseen categories.

  2. 2.

    We propose a novel global correlation transmit approach for sketch-based cross-domain visual retrieval. By introducing relevant learning in the initial space, more semantic and visual information can be efficiently captured and retained during the training process, thus effectively bridging the knowledge gap in the ZS setting.

  3. 3.

    We elaborate a simple yet effective loss function for feature similarity learning which can increase the compactness of intra-class features and the separation of inter-class features through distance- optimization.

  4. 4.

    Extensive experiments are conducted on five benchmarks, i.e. Sketchy, TU-Berlin, RSketch, SHREC 2013 and SHREC 2014, demonstrating the effectiveness of the proposed approach for sketch-based 3D shape retrieval and zero-shot sketch-based image retrieval.

Related work

Domain adaptation

Domain adaptation can map data from different query and target domains with different distributions into a feature space so that they are as close as possible to each other in that space. It is a good strategy to achieve domain alignment. Widely used methods include maximum mean discrepancy (MMD) [22], central moment discrepancy(CMD) [23], adversarial learning (AD) [24] and gradient reversal layer (GRL) [25]. MMD and CMD are the state-of-the-art distance metrics that measure the difference between the distributions of two representations. Intuitively, the MMD or CMD distance decreases as the two distributions become more similar. Adversarial learning and GRL are used to confuse the modal discriminator so that it cannot distinguish which modalities the features come from. Adversarial learning is a very popular learning strategy that is widely used to explore the heterogeneous distribution between different modalities [26,27,28]. In recent years, GRL has also been widely used in multiple cross-domain retrieval tasks and demonstrated its effectiveness [27, 29].

Deep metric learning

Deep metric learning inherits advantages from the existing deep learning techniques and can learn much more complex, powerful nonlinear transformations [14, 30], which is a good strategy for achieving feature alignment. Yan et al. designed contrastive loss [31], which can effectively handle the relationship of paired data in Siamese neural networks. Triplet loss [32] is calculated by the triplet of training samples, which encourages the training network to find an embedding space where the distance between the samples from the same category is smaller than the distance between the samples from different categories. Center loss [33] is usually combined with a softmax loss to reduce intra-class distances. Triplet-center loss (TCL) [34] establishes a center for each class such that the features of the samples from the same class are pulled closer to the corresponding center, and are pushed away from centers of different classes. The attribute-based label smoothing (ALS) loss is proposed to better regular the attributes of different instances for training more discriminative features [35].

Sketch-based cross-domain retrieval

Sketch-based cross-domain retrieval is a fundamental problem in the computer vision and multimedia fields. The task is to retrieve 2D images or 3D models in the dataset that are similar in content to the sample images submitted by the user. Lots of works have been proposed. Lei et al. proposed a semi-heterogeneous three-way joint embedding network (Semi3-Net) to explore the semantic correlation and capture the common feature information for cross-domain image data [36]. Chen et al. designed attention-enhanced network (AE-Net) for constructing and understanding spatial sequence information between two modalities of sketch and image [37]. Sun et al. proposed dual local interaction network (DLI-Net) to explore an effective and efficient way to utilize local features of sketch and photo [38]. Yang et al. designed the sequential learning (SL) framework to deeply mine the category relationships between cross-domain data to achieve cross-domain retrieval [39].

With the development of artificial intelligence and deep learning technology, the zero-shot image-based cross-domain retrieval task is proposed. Recently, many excellent methods have been proposed. Li et al. designed STRucture-aware Asymmetric Disentanglement (STRAD) to achieve cross-domain correspondence of sketches and images by analyzing the differences between them in structure space [40]. Wang et al. proposed transferable coupled network (TCN) to effectively improve network transferability from seen categories to unseen ones [41]. Xu et al. proposed domain disentangled generative adversarial network (DD-GAN) for zero-shot sketch-based 3D retrieval, which can not only reduce the inter-domain difference between sketches and 3D shapes but also minimize the domain discrepancy between seen categories and unseen categories [42].

Methods

For achieving sketch-based cross-domain retrieval tasks, we propose an effective end-to-end training method to diminish cross-domain heterogeneity and enhance cross-domain correlation based on GSCTL, as shown in Fig. 3. The whole framework mainly contains three parts: (1) data pairing, (2) feature extraction and (3) learning objectives. Then we elaborate on the details of the proposed framework.

Fig. 3
figure 3

Training framework of our proposed method for sketch-based cross-domain visual data retrieval

Data pairing

The training data from the two modalities are composed of query-target pairs, \(S = \left\{ {{s_i}} \right\} _{i = 1}^N\), \({s_i} = \left\{ {{x_i},l_i^x,{t_i},{y_i},l_i^y} \right\} \). N represents the total number of query-target pairs; \({x_i}\) is an image from the query dataset; \({y_i}\) is a random data item (image or 3D model) from the target dataset; The positive or negative random pairing of multi-modal data (reflected by \({t_i}\)) is designed for feature similarity learning. \({t_i=1}\) if \({x_i}\) and \({y_i}\) have the same semantic label, and \({t_i=0}\) otherwise; \({l_i^x}\) and \({l_i^y}\) are the semantic labels of the query and the target data, respectively.

Based on the above statement, we randomly match different categories of query sketches with target candidates in different domains, which on the one hand helps to explore the visual and semantic similarity of intra-class features, and is beneficial for bridging cross-domain discrepancy on the other hand.

Feature extraction

For query sketches and target candidates, there are two independent sub-networks, to learn the initial features, respectively. These two branches can be any other existing image or 3D shape feature extraction network. After obtaining initial representations of query sketches and target candidates, constructing the common feature space is effective in the cross-domain retrieval task [4, 16, 43].

In the training process, we randomly select query-target pairs from the whole dataset and train our model by mini-batches. Separate feature extractors and public shared subspace are built to generate the initial features and the projected features for each query-target pair. We denote the feature representations of a query-target pair as:

$$\begin{aligned} \left\{ {\begin{array}{*{20}{c}} {F_i^x = {f_q}\left( {{x_i},{\theta }_q } \right) }\\ {F_i^y = {f_t}\left( {{y_i},{\theta }_t } \right) }\\ {P_i^x = {f_s}\left( {{F_i^x},{\theta }_s } \right) }\\ {P_i^y = {f_s}\left( {{F_i^y},{\theta }_s } \right) }\\ \end{array}} \right. , \end{aligned}$$
(1)

where \({f_q}\left( . \right) \) and \({f_t}\left( . \right) \) represent feature extractors for query and target data, respectively; \({\theta }_q\) and \({\theta }_t\) represent the parameters of feature extractors for query and target data, respectively; \({f_s}\left( . \right) \)represents the public shared subspace; \({\theta }_s\) represents the parameters of the public shared subspace; \({F_i^x}\) and \({F_i^y}\) represent original features of the given query and target data, respectively; \({P_i^x}\) and \({P_i^y}\) represent the projected features of the given query and target data.

Joint learning objectives

The joint feature learning architecture is employed to compensate for the divergence between domains at the same time and learn the category-related characteristics. Specifically, for each query-target pair, we introduce domain consistency learning, semantic consistency learning, feature correlation learning and feature similarity learning.

Domain consistency learning

Domain consistency is an important property for cross-domain retrieval tasks, which is the key to reducing domain gap. If the distributions of the cross-domain features have intermodal consistency, the difference in heterogeneity between them is effectively reduced. To achieve modal consistency learning, we introduce a modality discriminator \({f_d \left( {.}\right) }\) to identify the modality information. Domain consistency learning aims to confuse domain information by adversarial learning [24], which ensures that the domain discriminator cannot accurately identify the modal information.

The domain discriminator consists of the fully connected layers and the sigmoid function. During training, the projected features of query data and target data are assigned with different modality labels. Given the query-target pairs, the loss function of domain consistency learning is defined as follows:

$$\begin{aligned} {L_{con}\left( {{P^x,P^y;}{\theta }_q},{{\theta }_t},{{\theta }_s},{{\theta }_d} \right) = - \frac{1}{N}\sum \limits _{i = 1}^N \left( { \log {P_d}\left( {f_d} \left( {P_i^x;{\theta }_c} \right) \right) } \right. } \nonumber \\ {\left. + { \log \left( 1 - {{P_d}\left( {f_d \left( { P_i^y;{\theta }_c} \right) } \right) }\right) } \right) }, \end{aligned}$$
(2)

where \({{\theta }_d}\) represents the parameters of the modality discriminator, \({P_d \left( {.}\right) }\) represents the probability distribution generated by the modality discriminator for the projected feature.

Adversarial learning is used to train the domain discriminator. Unlike training normal discriminators, which preserves and enhances the discrimination of input features, the domain information of the feature gradually disappears by training the domain discriminator. Specifically, a negative hyper-parameter is multiplied with \(L_{con}\left( {{P^x, P^y;}{\theta }_q},{{\theta }_t},{{\theta }_s},\right. \) \( \left. {{\theta }_d} \right) \), making the domain boundary of the feature blurred. This way, the heterogeneous gap between different domains effectively decreases, achieving domain alignment for cross-domain features. In this way, features from different domains are encoded into a semantic space to obtain domain-agnostic semantic representation.

Semantic consistency learning

Besides domain consistency, semantic consistency is also an important property for cross-domain retrieval tasks. To achieve cross-domain features semantic alignment, we introduce a category label classifier \({f_c \left( {.}\right) }\) to predict the category label of the projected features, which is achieved by the fully connected layers and the cross-entropy loss. Given the query-target pairs, the loss function of feature discrimination learning is defined as follows:

$$\begin{aligned}{} & {} L_{dis}\left( {{P^x,P^y};{\theta }_q},{{\theta }_t},{{\theta }_s},{{\theta }_c} \right) \nonumber \\ {}{} & {} \qquad = - \frac{1}{N}\sum \limits _{i = 1}^N \left( {l_i^x \log {P_c}\left( {f_c}\left( {P_i^x;{\theta }_c}\right) \right) } \right. \nonumber \\{} & {} \quad {\left. + {l_i^y \log {P_c}\left( {f_c}\left( {P_i^y;{\theta }_c} \right) \right) } \right) }, \end{aligned}$$
(3)

where \({{\theta }_c}\) represents the parameters of the category label classifier; \(P_c \left( .\right) \) represents the generated probability distribution of the categories predicted by the label classifier for projected features.

In this way, the category label is employed to bridge the common space and label space, and then minimize the difference between classification predictions and labels. The category information of each data is utilized for classification and the discrimination of query sketch and target candidates descriptors can be significantly improved.

Feature correlation learning

Semantic and domain alignment are important, but they alone cannot achieve superior retrieval performance. Because the cross-entropy loss focuses on finding a decision boundary to separate features of different classes, but cannot optimize inter-class and intra-class distances as well as metric learning methods can. Domain consistency learning focuses on guaranteeing global domain alignment. Feature correlation learning aims to explore the relationship between features in semantics and vision, which is important to improve the effectiveness and robustness of the whole network.

In ZS-SBIR, great strides have been made but attempts have largely aligned with the larger photo-based zero-shot literature, where the key lies in leveraging external knowledge for cross-category adaptation [44]. However, SBIR is a visual task. The embedded semantic knowledge comes from the text. The two are not exactly matched. In SBSR, most previous methods build a common subspace with deep metric learning for correlation learning. These methods merely enhance the local feature correlation in the common subspace. Namely, there is no feature correlation or weak feature correlation in original feature spaces constructed by feature extractors. Therefore, we consider applying global feature learning to achieve semantic and visual joint modeling and generate discriminative features for retrieval tasks.

In this paper, we achieve the global feature correlation by encouraging the learning of feature correlations in each local feature Note that “global” here means that the proposed loss is applied not only to the constructed common feature space but also to the initial feature spaces. Specifically, in each feature space, we define the centers \({C} = \left\{ {{c_1},{c_2},{c_3},\ldots ,{c_m}} \right\} \) for metric learning, where \({c_i}\) represents the i-th category center and M is the number of categories. We update every center and feature within a mini-batch, pulling the features to their corresponding centers and pushing the features away from the other centers. To increase the inter-class distinction and strengthen the intra-class similarity, we adopt the loss function for metric learning following TCL [34]:

$$\begin{aligned} {L_{tc}\left( {{F_i};{\theta }}\right) } = \sum \limits _{i = 1}^N \max \Bigg ( {D}\left( {{F_i},{c_{y^i}}} \right) + {m} \nonumber \ \mathop {\min }\limits _{{j} \ne {{y^i}}} {D}\left( {{F_i},{c_j}} \right) ,0 \Bigg ), \end{aligned}$$
(4)

where \({F_i}\) represents data feature; \({\theta }\) is training parameters; \({{D}\left( {.} \right) }\)represents the function of squared Euclidean distance; \({c_{y^i}}\) and \({c_j}\) represent the center of \({F_i}\) and the negative center, respectively; m is a hyper-parameter, representing the margin between different classes.

As shown in Fig. 3, the global feature correlation learning is achieved through local correlation learning (the three purple blocks in Fig. 3). We define \({L_{tc} \left( {P^x,P^y;{{\theta }_q},{{\theta }_t},{{\theta }_s}}\right) }\) as local correlation learning function for projected features, \({L_{tc} \left( {F^x;}{{\theta }_q} \right) }\) and \({L_{tc} \left( {F^y;}{{\theta }_t} \right) }\) as local correlation learning functions for original features. The global correlation learning function can then be established by combining these three local correlation learning functions:

$$\begin{aligned}{} & {} {L_{cl}\left( {{\theta }_q},{{\theta }_t},{{\theta }_s} \right) = \alpha {L_{tc}\left( {P^x},{P^y};{{\theta }_q},{{\theta }_t},{{\theta }_s} \right) } }\nonumber \\{} & {} \quad \qquad +{\beta {\left[ {L_{tc}\left( {F^x};{{\theta }_q} \right) }+L_{tc}\left( {F^y};{{\theta }_t} \right) \right] }}, \end{aligned}$$
(5)

where \({\alpha }\) and \({\beta }\) are hyper-parameters that control balance.

Through this global feature correlation learning, the discrimination of features in the two initial feature spaces can be enhanced and transferred to the common space. As a result, the eventually learned features could be more discriminative and consistent, which will be demonstrated in the experiments.

Feature similarity learning

TCL focuses on pulling the features closer to their corresponding centers and pushing them farther away from unrelated centers. However, TCL ignores the distance between features, which may cause some features near the decision boundaries to be closer to the features belonging to other categories, as shown in Fig. 4, which indicates that the intra-class features are not compact enough. In other words, feature alignment and domain alignment can be partially achieved.

Fig. 4
figure 4

The distributions of deep features learned by TCL

Spired by contrastive learning and GRL, we design a loss function for feature similarity learning, which improves the intra-class compactness and inter-class separability of cross-domain features, as shown in Fig. 5. Feature similarity learning explores the projected features with semantics, pulls intra-class features closer, and pushes inter-class features farther apart.

Fig. 5
figure 5

A toy illustration of the feature similarity learning

Given the query-target pairs, the loss function of feature-similarity learning is defined as follows:

$$\begin{aligned}{} & {} {L_{sim}\left( {{P_i^x},{P_i^y};{t};{\theta }_q},{{\theta }_t},{{\theta }_s} \right) = \frac{1}{N}\sum \limits _{i = 1}^N \left( { t * {D}\left( {P_i^x,P_i^y} \right) } \right. }\nonumber \\ {}{} & {} \quad {\left. + { \left( 1-t_i \right) \left( {{D}\left( {R_{\lambda }}\left( P_i^x \right) {,P_i^y} \right) }\right) } \right) }, \end{aligned}$$
(6)

where \({R_{\lambda }\left( {.}\right) }\) represents the GRL. If \({P_i^x}\) and \({P_i^y}\) have the same semantic label (\({t_i=1}\)), the intra-class features are more compact by minimizing \({L_{sim}\left( {{P_i^x},{P_i^y};{t};{\theta }_q},{{\theta }_t},{{\theta }_s} \right) }\). And \({L_{sim}\left( {{P_i^x},{P_i^y};{t};{\theta }_q},{{\theta }_t},{{\theta }_s} \right) }\) relies on GRL, multiplies the gradient by a certain constant during the backpropagation-based training, contradicts the training goal.

Optimization objectives

Based on the above discussions, the formal loss function of the method can be expressed as:

$$\begin{aligned}{} & {} {L\left( {{\theta }_q},{{\theta }_t},{{\theta }_s},{{\theta }_c},{{\theta }_d} \right) } = \lambda L_{dis}\left( {{P^x,P^y};{\theta }_q},{{\theta }_t},{{\theta }_s},{{\theta }_c} \right) \nonumber \\{} & {} \quad {+ L_{cl}\left( {{F^x,F^y,P^x,P^y};{\theta }_q},{{\theta }_t},{{\theta }_s} \right) }\nonumber \\{} & {} \quad {- \phi L_{con}\left( {{P^x,P^y;}{\theta }_q},{{\theta }_t},{{\theta }_s},{{\theta }_d} \right) }\nonumber \\{} & {} \quad {+ \varphi L_{sim}\left( {{P_i^x},{P_i^y};{t};{\theta }_q},{{\theta }_t},{{\theta }_s} \right) }, \end{aligned}$$
(7)

where \({\lambda }\), \({\phi }\), \({\varphi }\) are are the positive hyper-parameters that control the balance.

For learning more robust and discriminative features, our method searches for the optimal parameters \({{\theta }_q}\), \({{\theta }_t}\), \({{\theta }_s}\), \({{\theta }_c}\), and \({{\theta }_d}\) as a min-max game, shown as follows:

$$\begin{aligned}{} & {} \begin{aligned} {\left( {{\theta }_q^*},{{\theta }_t^*},{{\theta }_s^*},{{\theta }_c^*}\right) = \arg \mathop {\min }\limits _{{{\theta }_q},{{\theta }_t},{{\theta }_s},{{\theta }_c}} {L\left( {{\theta }_q},{{\theta }_t},{{\theta }_s},{{\theta }_c},{{\theta }_d} \right) }}, \end{aligned} \end{aligned}$$
(8)
$$\begin{aligned}{} & {} \quad \begin{aligned} {\left( {{\theta }_d^*}\right) = \arg \mathop {\max }\limits _{{{\theta }_d}} {L\left( {{\theta }_q^*},{{\theta }_t^*},{{\theta }_s^*},{{\theta }_c^*},{{\theta }_d} \right) }}, \end{aligned} \end{aligned}$$
(9)

The training can be realized by using a stochastic gradient descent algorithm, which is summarized in Algorithm 1.

Algorithm 1
figure a

Training algorithm for the proposed method

Experiments

Dataset

For the sketch-based cross-domain visual data retrieval datasets, we select 5 widely used datasets for experimental validation. For sketch-based image retrieval, the selected datasets are Sketchy [45, 46], TU-Berlin [47, 48] and RSketch [49]. For sketch-based 3D model retrieval, SHREC 2013 [50] and SHREC 2014 [51] are selected. Table 1 illustrates the key properties of the existing image-based cross-domain retrieval datasets.

Table 1 Key properties of existing datasets

Training settings

We implement the proposed method using Pytorch on a computer with a Tesla V100S GPU. The modality discriminator is trained using Adam [52] with a learning rate of 0.01, and others are trained using SGD [53] with an initial learning rate of 0.01, and after ten iterations, the learning rate is reduced to one-tenth of the previous rate. Hyper-parameter m of TCL is set to 5 according to the literature [34]. Other hyper-parameters \({\alpha }\), \({\beta }\), \({\lambda }\), \({\phi }\), \({\varphi }\) are set to 0.5, 0.01, 1, 0.1, 1. We use ResNet50 [54] as the backbone for feature extraction. For cross-domain 3D shape retrieval tasks, we use multi-views to represent the 3D shape and use multi-view convolutional neural network (MVCNN) [55] to obtain the 3D shape feature. We initialize the centers with the standard normal distribution, with the mean and standard deviation being 0 and 1, respectively.

Zero-shot sketch-based image retrieval

The main challenge of ZS-SBIR is building a knowledge-sharing bridge between known training classes and unknown test classes, which will test the information representation of the features. In this section, we use Sketchy, TU-Berlin and RSketch as datasets for ZS-SBIR. For this task, we use the evaluation metrics in [49, 56, 57] to evaluate the retrieval preference:

  1. 1.

    Mean average precision (MAP): Average precision (AP) is used to record the retrieval accuracy more clearly, which amounts to the area under the precision–recall (PR) curve. APs of multiple query images are averaged to produce the MAP.

  2. 2.

    Prec@K: the precision of the top K returned images (Prec@K) is used to evaluate the retrieval preference. On the Sketchy and TU-Berlin datasets, K = 100 and 200. On the RSketch dataset, K = 10 and 50.

Retrieval result on Sketchy and TU-Berlin

For the splitting method of Sketchy and TU-Berlin datasets, we followed the settings in [56, 57]. For Sketchy, we selected 21 testing classes that are not in the 1000 classes of ImageNet. For TU-Berlin, we randomly selected 220 categories for training and the remaining 30 categories were used for testing.

The retrieval results in comparison with the state-of-the-art approaches are shown in Table 2. The methods of comparison are ZSIH [58], conditional variational auto-encoders (CVAE) [56], semantically tied paired cycle consistency (SEM-PCYC) [21], Doodle To Search (DTS) [20], CrossATNet [19], SkechGCN [59], STRAD [40], StyleGuide [57], TCN [41], bi-level domain adaptation for zero-shot SBIR (BDA-SketRet) [60], discriminant adversarial learning (DAL) [61]. The highest value of each metric has been shown in bold.

Table 2 Comparison with the state-of-the-art approaches on Sketchy and TU-Berlin datasets

For all metrics, our method achieves the best retrieval performance. These results demonstrate that the proposed GSCTL method can achieve feature alignment and domain alignment. It is worth noting that our method does not use extra semantic information as in previous works, but adopts global correlation learning and feature similarity learning. Previous works have extracted word vectors of class names from word vector models, or measured word similarity of class names through hierarchical models, or a combination of both. These approaches have an obvious drawback: semantic embedding encodes mostly textual information, whereas ZS-SBIR is a visual task and semantic guidance is not optimal. Our approach focuses on visual relevance, avoids additional language modeling and time consumption, and reduces the burden of training resource acquisition. The comparison in the above metrics shows that these two learning methods can achieve the cross-domain correspondence between sketches and images, which is conducive to the transfer of semantic knowledge under the zero-shot scenario.

Figure 6 visualizes the retrieval results on the Sketchy dataset for the GSCTL method. The query sketches are listed on the left of the vertical line, which are dolphin, helicopter, sword, and windmill. The top 7 retrieved models are listed on the right, based on their ranking orders. The incorrect retrievals are highlighted in red. It can be seen from the figure that for the classes of helicopter, sword, and windmill, all the retrieved models are correct. For the class of dolphin, several retrievals are not correct. However, even if the retrieved images are not correct, they exhibit a high visual resemblance to the given dolphin sketch. These incorrect results also show that zero-short sketch-based image retrieval is a very challenging task.

Fig. 6
figure 6

Retrieval results on Sketchy

Retrieval result on RSketch

For the RSketch dataset, the dataset splitting method proposed in the literature [50] is used. The method divides the 20 classes in the dataset into 4 subsets, each of which contains 15 seen classes and 5 unseen classes. According to the above splitting method, the proposed GSCTL method is trained and tested on the 4 subsets. The average performance, in terms of the evaluated metrics, across the four subsets is used for comparison.

The comparison between our GSCTL with the state-of-the-art approaches is shown in Table 3. All comparison methods are based on deep learning, such as deep spatial-semantic attention (DSSA) [62], deep shape matching (DSM) [63], learned deep image-sketch features (LDF) [64], adversarial sketch-image feature learning (AFL) [49], DTS [20], deep supervised cross-modal retrieval (DSCMR) [65], cross-modal center loss (CMCL) [66] and DAL [61]. The highest value for each metric is shown in bold.

Table 3 Comparison with the state-of-the-art approaches on RSketch dataset

For the 3 metrics in the seen categories, the GSCTL method achieves the best retrieval performances. For the 3 metrics in the unseen categories, the GSCTL method attains two quasi-best values in terms of MAP and Prec@10, and the third-best performance in terms of Prec@50.

Compared with DAL, the GSCTL method mines more feature correlation information through global feature correlation learning, which enhances the retrieval performance for seen categories but reduces the inference ability for unseen categories. Compared with other methods, the GSCTL method can discover hidden relationships between features and transfer the semantic knowledge from seen to unseen categories through the strategy of joint learning, which is beneficial for improving retrieval performance. Further experimental details and conclusions on the RSketch dataset can be found in “Ablation study”.

Sketch-based 3D shape retrieval

The 3D shapes from SHREC 2013 and SHREC 2014 are not consistently aligned, i.e., these models are perturbed by random rotations. For a more detailed and comprehensive description of the 3D models, we use dodecahedron-view images to render 3D objects following the method in [67] and obtain 20 views for each 3D object.

We employ nearest neighbor (NN), first tier (FT), second tier (ST), E-measure (E), discounted cumulated gain (DCG), and mean average precision (MAP) to evaluate the retrieval performance following [50, 51]. A higher metric value indicates better performance.

On the two benchmark datasets (SHREC 2013 and SHREC 2014), we compare our method with the state-of-the-art methods, including the TCL [34], AFL [49], deep cross-modality adaptation (DCA) [68], SEM [43], deep sketch-shape hashing (DSSH) [69], BV-CDSM [ [70], deep point-to-subspace metric learning (DPSML) [16], dual independent classification (DIC) [71], hyperbolic embedded attentive representation (HEAR) [72], JHFL_DA (joint heterogeneous feature learning and distribution alignment) [73], M-GCN [74] and DAL [61].

Table 4 Comparison with state-of-the-art approaches on SHREC 2013 dataset
Table 5 Comparison with state-of-the-art approaches on SHREC 2014 dataset

The results are summarized in Tables 4 and 5. Bold indicates the best performance, italic indicates the quasi-best performance, and bold italic indicates the third-best performance. Our method achieves the best performance in 7 metrics on the two benchmark datasets (NN, FT, ST, DCG and MAP on SHREC 2013, ST and DCG on SHREC 2014), the quasi-best performance in 3 metrics (E on SHREC 2013, E and MAP on SHREC 2014), and the third-best performance in one metric (FT on SHREC 2014). Our method achieves state-of-the-art performance in the remaining metric in general. Overall, the experimental results suggest the superiority of our GSCTL method compared with the existing methods.

Figure 7 visualizes the example 3D models retrieved by GSCTL from SHREC 2013. The incorrect retrievals are marked by red squares. However, the incorrect results are due to the sample size. With regards to the “ant” class, our method first retrieves the correct samples from SHREC 2013, and then the wrong samples. This is because there are only 5 samples in the “ant” class. So our methods retrieved other models (bees) that have similar visual shapes to the given “ant” sketch.

Fig. 7
figure 7

Retrieval results on SHREC 2013

The reasons why our method is also effective in sketch-based 3D model retrieval tasks can be explained as follows: We build a public space encoder and jointly apply four types of learning. Such a learning framework is intended to further improve the information representation capability of the public space encoder in extracting modality-invariance and semantic-correlation features. Domain consistency learning essentially converts the cross-domain retrieval task into a same-domain retrieval task, which reduces the difficulty of retrieval. Global correlation learning can capture and transmit modality-invariant features effectively from two initial feature spaces to a common feature space.

Discussions

In this section, we provide in-depth analyses of various design choices to gain insight into the effectiveness and generalization of our proposed model, especially the parameter settings. In our final objective function (Eq.(7)), there are three hyper-parameters for balancing the contributions of loss functions. To explore the sensitiveness of the proposed GSCTL to the three hyper-parameters, we have designed three experiments using different values of \({\alpha }\), \({\beta }\) and \({\varphi }\) as shown in “Comparison of the effectiveness of similarity loss” and “The impact of hyperparameter”.

Image-based 3D shape retrieval

Image-based 3D shape (IBSR) retrieval task is also a hot research topic for cross-domain 3D retrieval, which shares similar difficulty with SBSR. To validate the robustness of GSCTL, we conduct comparative experiments on the IN2MN, SceneIBR2019 and MI3DOR, which are popular image-based 3D shape retrieval benchmarks.

Following [63, 75], different metrics are used to evaluate the retrieval performance on different datasets. Namely, on IM2MN, MAP, Prec@10, Prec@50 and Prec@100 are used as metrics. For SceneIBR2019, NN, FT, ST, E, DCG, MAP and PR are used as metrics [75]. For MI3DOR, the metrics are NN, FT, ST, F, DCG, ANMRR (average normalized modified retrieval rank) and AUC (the area under PR-curve) [76]. Note that a lower ANMRR value indicates a better performance.

Retrieval results on IN2MN

The 3D models of IN2MN come from ModelNet40, which are consistently aligned. In addition, IN2MN provides 12 views for each 3D shape. Therefore, in this subsection, we use 12 views to represent each 3D object following [53].

The retrieval results, in terms of the evaluated metrics, of our method and the state-of-the-art approaches are presented in Table 6. The compared methods include CDTNN (cross-domain triplet neural network) [77], DSCMR [65], AFL citeref47, CMCL [66] and DAL [65]. The experimental results show that our method achieves significantly better retrieval performance in each metric, which indicates the effectiveness of the proposed GSCTL method.

Table 6 Comparison with state-of-the-art approaches on IN2MN dataset

Figure 8 visualizes the retrieval results of GSCTL on IN2MN. It can be seen that all retrieved results are correct.

Fig. 8
figure 8

Retrieval results on IN2MN

Retrieval results on SceneIBR2019

SceneIBR2019 is a dataset designed for image-based 3D scene retrieval. Image-based scene retrieval is more challenging because the scene contains more entities and there may be overlapping relationships between entities [78].

In this experiment, we use SketchUp to capture the representative views of all 3D scene models following MMD-VGG in [79]. Therefore, an image-based 3D scene retrieval task is transformed into an image-based 2D scene retrieval task.

Our method is compared against RNIRAP, CVAE, TCL, VMV-VGG, and DRF, which are the methods based on deep learning for the SceneIBR2019 dataset from SHREC 2019 [75]. We also extended DSCMR [65], AFL [49], CMCL [66] and DAL [61] so that they can be trained on the SceneIBR2019 dataset.

The comparison results of the first six metrics are shown in Table 7. Bold indicates the best performance, italic indicates quasi-best performance, and bold italic indicates third-best performance. Their precision-recall curves are shown in Fig. 9. Our method achieves the best performance in four metrics(i.e., ST, DCG, MAP, PR-curve), and the quasi-best performance in the remaining three metrics.

Table 7 Comparison with state-of-the-art approaches on SceneIBR2019 dataset
Fig. 9
figure 9

Precision–recall curves performance comparison results with state-of-the-art approaches on SceneIBR2019

Retrieval result on MI3DOR

MI3DOR followed the method in [55] to render the 3D object (.OBJ) and provided 12 views for each 3D object. In this experiment, we also use 12 views to represent each 3D object.

Our method is compared against RNF-MVCVR, SORMI, RNFETL, CLA, MLIS, ADDA-MVCNN, SRN, ALIGN, collaborative distribution alignment (CDA) [80], consistent domain structure learning and domain alignment (CDSLDA) [81], universal cross-domain (UCD) [82], AFL [49], DSCMR [65], CMCL [66], JHFL_DA [73], M-GNN [74] and DAL [61]. The first eight methods are supervised methods on the MI3DOR dataset from SHREC 2019 [55]. AFL, DSCMR and CMCL are re-produced for image-based 3D shape retrieval.

Table 8 demonstrates the results compared with the state-of-the-art methods. Bold indicates the best performance, italic indicates the quasi-best performance, and bold italic indicates the third-best performance. In this dataset, our method achieves the best performance in one metric (ST), the quasi-best performance in three metrics (FT, DCG and ANMRR), and the third-best metrics in three metrics (NN, F and AUC). In general, our method attains the quasi-best retrieval performance, only behind M-GNN.

Table 8 Comparison with state-of-the-art approaches on MI3DOR dataset
Table 9 Comparison of metrics under different similarity losses

For cross-domain 3D model retrieval tasks, several approaches are performed on multiple datasets. As shown in Tables 4, 5, 7 and 8, compared with these reused methods, our GSCTL has the best metrics on SHREC 2013, SHREC 2014 and SceneIBR2019. For the MI3DOR dataset, the retrieval performance of our GSCTL method is behind MI3DOR. However, the retrieval performance of M-GNN on SHREC 2014 is unsatisfactory. The retrieval performance of our GSCTL method is dramatically ahead of SHREC 2014. The superior and stable retrieval performance on multiple datasets illustrates the excellent generalization of our method.

Based on all experimental results, our approach outperforms the other comparison methods in multiple types of retrieval tasks for the following reasons:

  1. 1.

    Domain consistency learning reduces the heterogeneous gap and achieves domain alignment, which transforms cross-domain retrieval tasks into same-domain retrieval task and reduces the difficulty of retrieval.

  2. 2.

    Global correlation learning aims to learn more semantic-correlation attributes in initial feature spaces and transfer them to a common space. More semantic-related information is captured and the generated visual features are more robust.

  3. 3.

    Feature similarity learning pulls intra-class features closer and pushes inter-class features farther apart. The two operations narrow the gap between two features for more compact semantic alignment.

In summary, the joint network could simultaneously deal with the vast domain gap, semantic gap, and limited knowledge about the unseen categories. All these designs enable our framework to mine more semantic category relationships.

Comparison of the effectiveness of similarity loss

One contribution of our method is the designed similarity learning loss. In this subsection, we use contrastive loss instead of similarity loss to conduct extensive experiments on the IN2MN dataset. The retrieval results under different loss functions are recorded and presented in Table 9.

In Table 9, we observe that retrieval performances in terms of both MAP and Prec@K show an increasing trend. Specifically, the optimal retrieval performance is achieved when \({\varphi } = 1\). At this point, the intra-class features are more compact, and GSCTL shows improvements in several metrics. The results further validate the effectiveness of feature similarity learning, which can be used directly as a plug-in for cross-domain retrieval tasks.

Compared with contrastive loss, the designed similarity loss lacks a hyperparameter (margin), which means that the training process avoids the setting of a hyperparameter. Analyzing the results presented in Table 9, we observe that contrastive loss is sensitive to hyperparameters. The retrieval performance varies greatly when the hyperparameters (margin and \({\varphi }\)) are taken to different values. Finding the correct values for these hyperparameters can be challenging and may require a lot of experimentation.

Comparison of the effectiveness of domain adaptation algorithms

In this subsection, we investigate the impact of different domain adaptation algorithms on retrieval performance, as they play a crucial role in mitigating modal heterogeneity. Specifically, we compare our adversarial learning-based approach with three widely used algorithms (GRL, CMD, MMD) for modal consistency learning. We evaluate the retrieval results using MAP (Mean Average Precision) on four datasets (SHREC 2013, SHREC 2014, MI3DOR, and SceneIBR2019).

The comparison of MAP values under different domain adaptation algorithms is illustrated in Fig. 10. Across the four datasets, our method achieves the three best results, while one quasi-best result is obtained with adversarial learning. This result demonstrates that adversarial learning is the domain adaptation algorithm that best suits our proposed method. The key reason is as follows: Instead of explicitly measuring the distance between the source and target domains as CMD and MMD do, adversarial learning leverages deep neural networks to more flexibly learn the mapping relationship between the source and target domains, generating more robust and realistic feature representations.

Fig. 10
figure 10

Comparison of metrics under different domain adaptation algorithms

Furthermore, we employ t-SNE [83] to visualize the data distribution of the features obtained using the four domain adaptation algorithms on the MI3DOR dataset, as depicted in Fig. 11, where each color represents a different class. The visualization shows that using adversarial learning (AD) and gradient reversal layer (GRL) leads to better separation of the features. Although there is still some overlap among inter-class clusters, it is significantly reduced compared to using maximum mean discrepancy (MMD) and correlation matrix distance (CMD). When comparing AD with GRL, the inter-class clusters exhibit more distinct boundaries.

Fig. 11
figure 11

Visualization of the features learned by different domain adaptation algorithms: a AD, b GRL, c CMD, d MMD

Empirical analysis

Time and space cost analysis

In Table 10, we compare the convergence batch and average retrieval time cost of our model with DSCMR, AFL, CMCL and DAL from IM2MN. During the evaluation, all data in the retrieval gallery are projected into the latent embedding space and stored in the memory, then given one query sample, the time cost of calculating similarities between the query image and all 3D shapes in the gallery following previous work [21, 84], as well as sorting the similarities is recorded, and the average time cost of one query overall query samples is reported in Table 10. Denoting the number of samples in the retrieval gallery as N, the dimension of latent representation as D, and the time complexity and space complexity are decided by N and D, without much difference observed between different methods. From the experimental results, we observe that the convergence speed and MAP of the proposed method are better than others, which is attributed to its strong feature learning ability.

Table 10 Comparison of convergence batch and the retrieval time cost

Ablation study

The proposed method combines different learning objectives and leverages their advantages to achieve superior performance. To analyze the contributions of each learning objective, we conducted an ablation experiment using the RSketch dataset In this experiment, we represent feature semantic consistency as SC, domain consistency learning as DC, local feature correlation learning as LFC, global feature correlation learning as GFC, and feature similarity learning as FS. The retrieval results under different combinations of loss functions are recorded and presented in Table 11.

Table 11 Ablation experiments on RSketch

From Table 11, we can draw the following conclusions:

  1. 1.

    The addition of DC or LFC to SC improves the retrieval performance for seen categories by achieving domain alignment capability or feature alignment capability.

  2. 2.

    Adding GFC to SC improves the retrieval performance for both seen and unseen categories. Global feature correlation learning helps extract and transmit more useful information, facilitating the transfer of semantic knowledge under the zero-shot scenario.

  3. 3.

    Incorporating LFC or GFC into SC + DC implies that the mutual information of cross-domain features is enhanced by utilizing relevance learning on the basis of domain alignment, and the experimental results show that this measure can significantly improve retrieval performance. However, adding DC to SC + GFC results in intermodal consistency in the distribution of cross-domain features, leading to the loss of modality-specific semantic knowledge and affecting the model’s inference of unknown data.

  4. 4.

    Adding FS to SC + DC + GFC improves the intra-class compactness and inter-class separability of cross-domain features, resulting in enhanced retrieval performance for both seen and unseen categories.

The main contributions of our method lie in the use of global feature correlation learning and the design of feature similarity learning. To further illustrate their roles, we compared the baseline (SC + DC + LFC) with the addition of feature correlation learning in the original feature spaces (OFC) and feature similarity learning (FS) in cross-domain 3D retrieval experiments on SHREC 2013, SHREC 2014, MI3DOR, and SceneIBR2019. The retrieval performance was evaluated using MAP (Mean Average Precision) and is shown in Fig. 12.

Fig. 12
figure 12

Ablation experiment in MAP comparison on SHREC 2013, SHREC 2014, MI3DOR and SceneIBR2019

We also adopt t-SNE [83] to visualize the data distribution of the features on MI3DOR about the ablation experiment, as shown in Fig. 13. Better feature clustering performance is obtained and most clusters have clear spacing between different categories by adding FS and OFC to the baseline.

From Figs. 12 and  13, we can draw the following conclusions:

  1. 1.

    Without OFC and FS, The baseline achieves good retrieval performance but lacks compactness in intra-class features, resulting in unsatisfactory results.

  2. 2.

    Adding OFC significantly improves the MAP, indicating that the proposed global feature correlation learning effectively learns useful information from the original space and transfers it to the common subspace, thus enhancing retrieval accuracy under domain alignment.

  3. 3.

    The addition of FS also leads to significant improvements in MAP. This indicates that the proposed feature similarity learning method reduces cross-domain discrepancy and enhances feature correlation.

  4. 4.

    The best MAP scores are obtained when both FS and OFC are added. The combination of these two learning methods allows our method to leverage their advantages and extract more useful information.

The impact of the methods of generating initial centers

In this subsection, we explore the impact of different ways of initializing the centers on the retrieval results for the IN2MN dataset. The results are listed in Table 12. We consider two initialization methods: normal distribution and uniform distribution. When using normal distribution initialization, the means are set as 0 while the standard deviations change from 1 to 0.01. When initializing with a uniform distribution, the center distributions are within the interval [0,1).

Fig. 13
figure 13

Visualization of the features learned by ablation experiment: a baseline, b baseline + OFC, c baseline + FS, d baseline + FS + OFC

Table 12 Retrieval results in different ways of initializing the centers of TCL on the IN2MN dataset

Analyzing the results presented in Table 12, we observe that the proposed GSCTL method achieves the best retrieval performance when the mean is set to 0 and the standard deviation is 1. This indicates the effectiveness of our method for initializing centers using a normal distribution. When initializing the centers with a uniform distribution, the performance becomes worse. This outcome indicates that normal distribution initialization contributes more significantly to the overall performance compared to uniform distribution.

The impact of hyperparameter

The hyperparameters \({\alpha }\) and \({\beta }\) control the trade-off among these learning objectives. To investigate the impact of these parameters, we conducted extensive experiments on the IN2MN dataset.

Fig. 14
figure 14

The performance of GSCTL in 3D model retrieval on the IN2MN dataset with different values of hyperparameters of \({\alpha }\) and \({\beta }\)

Figure 14 shows the performance of GSCTL for different values of \({\alpha }\) and \({\beta }\). The evaluation involved changing one parameter while keeping the other parameter fixed. When the value of \({\alpha }\) increases from 0.1 to 0.9, the MAP value initially increases and then decreases. When the value of \({\beta }\) increases from 0.01 to 0.09, the MAP value initially decreases, then increases, and finally decreases again. We observe that GSCTL achieves the best performance when \({{\alpha }=0.5}\) and \({{\beta } = 0.01}\). This result indicates that enhancing feature correlation in original feature spaces effectively improves retrieval performance.

Conclusion and future work

In this research, we propose an end-to-end general framework for a sketch-based cross-domain retrieval method. The proposed method incorporates global feature correlation, semantic and domain consistency, global feature correlation and feature similarity learning to reduce cross-domain heterogeneity and enhance global cross-domain correlation. To validate the effectiveness of the proposed method, we conducted experiments on 8 datasets. The results show that the proposed method outperforms state-of-the-art approaches, especially excelling in the ST metric for cross-domain 3D model retrieval tasks.

Although the proposed method achieves impressive retrieval performances, there are still some shortcomings. First, the setting of hyperparameter values comes from a large number of experiments. Secondly, the search method mentioned occurs in a closed setting between two pre-defined domains. Thirdly, there is room for improvement in feature relationship mining. In future work, we will design a better approach to balancing the loss functions, make the step towards an open setting where multiple visual domains are available, and design a better feature correlation learning method to further improve retrieval performance.

In summary, this research is the first exploration to combine SBSR and ZS-SBIR tasks, which has contributed to cross-domain retrieval, and the improved retrieval performance shows its potential for further research and practical applications in the field. Efficiency can be improved by utilizing this method. For example, managers can quickly find the desired images and 3D shapes and 3D designers can directly reuse the retrieved 3D shape, avoiding the need to design from scratch. More importantly, the enhancement of retrieval performance under zero-shot conditions enables the utilization of existing knowledge to manage unknown categories of data when labeled sample data is insufficient or even completely absent, thereby avoiding manual labeling and reducing costs.