Global semantics correlation transmitting and learning for sketch-based cross-domain visual retrieval

Jiao, Shichao; Han, Xie; Kuang, Liqun; Xiong, Fengguang; He, Ligang

doi:10.1007/s40747-024-01503-2

Global semantics correlation transmitting and learning for sketch-based cross-domain visual retrieval

Original Article
Open access
Published: 29 June 2024

Volume 10, pages 6931–6952, (2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Global semantics correlation transmitting and learning for sketch-based cross-domain visual retrieval

Download PDF

Shichao Jiao ORCID: orcid.org/0000-0002-2589-3533^1,2,3,
Xie Han^1,2,3,
Liqun Kuang ORCID: orcid.org/0000-0003-3276-5748^1,2,3,
Fengguang Xiong^1,2,3 &
…
Ligang He⁴

322 Accesses
Explore all metrics

Abstract

Sketch-based cross-domain visual data retrieval is the process of searching for images or 3D models using sketches as input. Achieving feature alignment is a significantly challenging task due to the high heterogeneity of cross-domain data. However, the alignment process faces significant challenges, such as domain gap, semantic gap, and knowledge gap. The existing methods adopt different ideas for sketch-based image and 3D shape retrieval tasks, one is domain alignment, and the other is semantic alignment. Technically, both tasks verify the accuracy of extracted features. Hence, we propose a method based on the global feature correlation and the feature similarity for multiple sketch-based cross-domain retrieval tasks. Specifically, the data from various modalities are fed into separate feature extractors to generate original features. Then, these features are projected to the shared subspace. Finally, domain consistency learning, semantic consistency learning, feature correlation learning and feature similarity learning are performed jointly to make the projected features modality-invariance. We evaluate our method on multiple benchmark datasets. Where the MAP in Sketchy, TU-Berlin, SHREC 2013 and SHREC 2014 are 0.466, 0.473, 0.860 and 0.816. The extensive experimental results demonstrate the superiority and generalization of the proposed method, compared to the state-of-the-art approaches. The in-depth analyses of various design choices are also provided to gain insight into the effectiveness of the proposed method. The outcomes of this research contribute to advancing the field of sketch-based cross-domain visual data retrieval and are expected to be applied to a variety of applications that require efficient retrieval of cross-domain domain data.

Deep Manifold Alignment for Mid-Grain Sketch Based Image Retrieval

Spatially aligned sketch-based fine-grained 3D shape retrieval

Article 25 April 2023

Sequential learning for sketch-based 3D model retrieval

Article 09 January 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

As one of the key technologies in information management, content retrieval has a wide range of application prospects and values. It is a common lore that an image is worth a thousand words and sketches provide an easier way to express expectations and target candidates more visually and concisely [1, 2]. Therefore, sketch-based content retrieval has drawn great attention. It is of great significance to improve the performance and generalization of existing retrieval algorithms [3, 4].

With the rapid development of multimedia technology and the wide usage of smart devices, the number of images and 3D shapes has increased [5,6,7], which leads to a growing demand for effective retrieval of these multimedia data. Thus, sketch-based 3D shape retrieval (SBSR) and sketch-based image retrieval (SBIR) tasks become a hot topic in computer vision.

In recent decades, with the rapid development of computer hardware and the improvement of the quality of datasets, deep learning methods have been widely used in many fields [8,9,10,11,12,13]. SBSR and SBIR are still challenging because (1) query sketches and target candidates are derived from two domains with large gaps, and (2) features extracted from different modalities follow quite different distributions. Researchers have proposed a lot of methods to address these problems, the framework of these methods can be summarized as shown in Fig. 1. For example, Dai et al. proposed the deep correlated metric learning (DCML) and deep correlated holistic metric learning (DCHML) architectures, which use nonlinear transformations to generate unified and robust feature representations for sketches and 3D models [14, 15]. Lei et al. proposed a deep point-to-subspace metric learning (DPSML) framework, which optimizes the similarity distances between selected representative views and sketches [16]. Lin et al. proposed the triplet classification network (TC-Net), which relies on a triplet Siamese network with auxiliary classification loss for SBIR [17]. In general, the aforementioned representative approaches have the following disadvantages: these methods pay attention to metric learning but ignore the data distribution in each specific domain, which influences the performance of discrepancy mitigation. Domain gap elimination has great benefits for improving retrieval performance.

In recent years, one wishes to conduct SBIR on object categories without the training data because of the difficulty and high cost of manual labeling [18], which largely motivates the problem of zero-shot SBIR (ZS-SBIR) [19,20,21]. ZS-SBIR shares the same challenges as conventional SBIR. Additionally, the “zero-shotc setting assumes that the training class set and the test class set are disjoint, which means it’s more generic and practical but presents an even greater challenge: the knowledge gap between seen categories and unseen categories. To tackle these challenges, some studies attempt to transform sketches and images into a common space and rely on domain adaptation algorithms for data alignment. After that, the semantic information is adopted to constrain the relationship between the transformed features, such as Word2Vec, GloVe, FastText and WordNet. These methods applied external text-based or hierarchical models to understand the attribute mappings in a semantic space, as shown in Fig. 2. Nevertheless, these methods do not explicitly encourage the correlation of features. Using convolutional neural networks (CNNs) for feature extraction from both domains is limited, and even domain adaption does not fully account for the characteristics of sketches and images. As a result, the domain gap, semantic gap, as well as knowledge gap, may not be effectively reduced. The cross-domain feature correlation is the key part of cross-domain retrieval networks. Exploring the visual relevance to increase the correlations between different domains is important to improve the effectiveness and robustness of the whole network.

The existing methods treat SBSR and ZS-SBIR tasks separately. Technically, the key step to address these tasks is to generate discriminative feature representations of query sketches and target candidates to compute their semantic similarity. For example, the objects belonging to the same category should have the similar or same feature representations in the feature space. Therefore, it is perfectly possible to design a general framework to address both tasks. In addition, previous methods in SBSR tasks only learn local correlation. Note that “local” here means that only the cross-domain feature correlation is enhanced in the common space (marked in pink in Fig. 1), while there is no feature correlation or weak feature correlation in initial spaces (marked in blue and green in Fig. 1), which are built by feature extractors. Despite the excellent representational power of CNNs, these methods are also limited in modeling the global structural information due to the inherent local nature of convolution operations.

To address the above challenges and the shortcomings of previous methods, we propose a cross-domain learning framework called global semantics correlation transmitting and learning (GSCTL) to extract robust features effectively from the data with different modalities. In GSCTL, we first construct two distinct feature extractors to obtain the features related to modalities. Next, we construct a public space encoder to project these features to a shared subspace, thereby extracting projecting features. Further, we propose a joint learning strategy to further improve the effectiveness of extracting projecting features. In the joint learning strategy, four types of learning are integrated, including semantic consistency learning, domain consistency learning, feature correlation learning and feature similarity learning. Feature correlation learning and semantic consistency learning together ensure that the extracted features are highly robust and discriminative. Modal consistency learning is used to reduce the large cross-modality heterogeneity. Feature similarity learning is designed to further reduce the discrepancy in cross-domain features.

The contributions of this paper are summarized as follows:

1.
We introduce a joint network for cross-domain feature learning to address SBSR and ZS-SBIR, which considers the intra-class similarity and inter-class distinction for data alignment. The architecture could simultaneously deal with the vast domain gap, semantic gap, and limited knowledge about the unseen categories.
2.
We propose a novel global correlation transmit approach for sketch-based cross-domain visual retrieval. By introducing relevant learning in the initial space, more semantic and visual information can be efficiently captured and retained during the training process, thus effectively bridging the knowledge gap in the ZS setting.
3.
We elaborate a simple yet effective loss function for feature similarity learning which can increase the compactness of intra-class features and the separation of inter-class features through distance- optimization.
4.
Extensive experiments are conducted on five benchmarks, i.e. Sketchy, TU-Berlin, RSketch, SHREC 2013 and SHREC 2014, demonstrating the effectiveness of the proposed approach for sketch-based 3D shape retrieval and zero-shot sketch-based image retrieval.

Related work

Domain adaptation

Domain adaptation can map data from different query and target domains with different distributions into a feature space so that they are as close as possible to each other in that space. It is a good strategy to achieve domain alignment. Widely used methods include maximum mean discrepancy (MMD) [22], central moment discrepancy(CMD) [23], adversarial learning (AD) [24] and gradient reversal layer (GRL) [25]. MMD and CMD are the state-of-the-art distance metrics that measure the difference between the distributions of two representations. Intuitively, the MMD or CMD distance decreases as the two distributions become more similar. Adversarial learning and GRL are used to confuse the modal discriminator so that it cannot distinguish which modalities the features come from. Adversarial learning is a very popular learning strategy that is widely used to explore the heterogeneous distribution between different modalities [26,27,28]. In recent years, GRL has also been widely used in multiple cross-domain retrieval tasks and demonstrated its effectiveness [27, 29].

Deep metric learning

Deep metric learning inherits advantages from the existing deep learning techniques and can learn much more complex, powerful nonlinear transformations [14, 30], which is a good strategy for achieving feature alignment. Yan et al. designed contrastive loss [31], which can effectively handle the relationship of paired data in Siamese neural networks. Triplet loss [32] is calculated by the triplet of training samples, which encourages the training network to find an embedding space where the distance between the samples from the same category is smaller than the distance between the samples from different categories. Center loss [33] is usually combined with a softmax loss to reduce intra-class distances. Triplet-center loss (TCL) [34] establishes a center for each class such that the features of the samples from the same class are pulled closer to the corresponding center, and are pushed away from centers of different classes. The attribute-based label smoothing (ALS) loss is proposed to better regular the attributes of different instances for training more discriminative features [35].

Sketch-based cross-domain retrieval

Sketch-based cross-domain retrieval is a fundamental problem in the computer vision and multimedia fields. The task is to retrieve 2D images or 3D models in the dataset that are similar in content to the sample images submitted by the user. Lots of works have been proposed. Lei et al. proposed a semi-heterogeneous three-way joint embedding network (Semi3-Net) to explore the semantic correlation and capture the common feature information for cross-domain image data [36]. Chen et al. designed attention-enhanced network (AE-Net) for constructing and understanding spatial sequence information between two modalities of sketch and image [37]. Sun et al. proposed dual local interaction network (DLI-Net) to explore an effective and efficient way to utilize local features of sketch and photo [38]. Yang et al. designed the sequential learning (SL) framework to deeply mine the category relationships between cross-domain data to achieve cross-domain retrieval [39].

With the development of artificial intelligence and deep learning technology, the zero-shot image-based cross-domain retrieval task is proposed. Recently, many excellent methods have been proposed. Li et al. designed STRucture-aware Asymmetric Disentanglement (STRAD) to achieve cross-domain correspondence of sketches and images by analyzing the differences between them in structure space [40]. Wang et al. proposed transferable coupled network (TCN) to effectively improve network transferability from seen categories to unseen ones [41]. Xu et al. proposed domain disentangled generative adversarial network (DD-GAN) for zero-shot sketch-based 3D retrieval, which can not only reduce the inter-domain difference between sketches and 3D shapes but also minimize the domain discrepancy between seen categories and unseen categories [42].

Methods

For achieving sketch-based cross-domain retrieval tasks, we propose an effective end-to-end training method to diminish cross-domain heterogeneity and enhance cross-domain correlation based on GSCTL, as shown in Fig. 3. The whole framework mainly contains three parts: (1) data pairing, (2) feature extraction and (3) learning objectives. Then we elaborate on the details of the proposed framework.

Data pairing

The training data from the two modalities are composed of query-target pairs, $S = \left\{ {{s_i}} \right\} _{i = 1}^N$, ${s_i} = \left\{ {{x_i},l_i^x,{t_i},{y_i},l_i^y} \right\} $. N represents the total number of query-target pairs; ${x_i}$ is an image from the query dataset; ${y_i}$ is a random data item (image or 3D model) from the target dataset; The positive or negative random pairing of multi-modal data (reflected by ${t_i}$) is designed for feature similarity learning. ${t_i=1}$ if ${x_i}$ and ${y_i}$ have the same semantic label, and ${t_i=0}$ otherwise; ${l_i^x}$ and ${l_i^y}$ are the semantic labels of the query and the target data, respectively.

Based on the above statement, we randomly match different categories of query sketches with target candidates in different domains, which on the one hand helps to explore the visual and semantic similarity of intra-class features, and is beneficial for bridging cross-domain discrepancy on the other hand.

Feature extraction

For query sketches and target candidates, there are two independent sub-networks, to learn the initial features, respectively. These two branches can be any other existing image or 3D shape feature extraction network. After obtaining initial representations of query sketches and target candidates, constructing the common feature space is effective in the cross-domain retrieval task [4, 16, 43].

In the training process, we randomly select query-target pairs from the whole dataset and train our model by mini-batches. Separate feature extractors and public shared subspace are built to generate the initial features and the projected features for each query-target pair. We denote the feature representations of a query-target pair as:

$$\begin{aligned} \left\{ {\begin{array}{*{20}{c}} {F_i^x = {f_q}\left( {{x_i},{\theta }_q } \right) }\\ {F_i^y = {f_t}\left( {{y_i},{\theta }_t } \right) }\\ {P_i^x = {f_s}\left( {{F_i^x},{\theta }_s } \right) }\\ {P_i^y = {f_s}\left( {{F_i^y},{\theta }_s } \right) }\\ \end{array}} \right. , \end{aligned}$$

(1)

where ${f_q}\left( . \right) $ and ${f_t}\left( . \right) $ represent feature extractors for query and target data, respectively; ${\theta }_q$ and ${\theta }_t$ represent the parameters of feature extractors for query and target data, respectively; ${f_s}\left( . \right) $represents the public shared subspace; ${\theta }_s$ represents the parameters of the public shared subspace; ${F_i^x}$ and ${F_i^y}$ represent original features of the given query and target data, respectively; ${P_i^x}$ and ${P_i^y}$ represent the projected features of the given query and target data.

Joint learning objectives

The joint feature learning architecture is employed to compensate for the divergence between domains at the same time and learn the category-related characteristics. Specifically, for each query-target pair, we introduce domain consistency learning, semantic consistency learning, feature correlation learning and feature similarity learning.

Domain consistency learning

Domain consistency is an important property for cross-domain retrieval tasks, which is the key to reducing domain gap. If the distributions of the cross-domain features have intermodal consistency, the difference in heterogeneity between them is effectively reduced. To achieve modal consistency learning, we introduce a modality discriminator ${f_d \left( {.}\right) }$ to identify the modality information. Domain consistency learning aims to confuse domain information by adversarial learning [24], which ensures that the domain discriminator cannot accurately identify the modal information.

The domain discriminator consists of the fully connected layers and the sigmoid function. During training, the projected features of query data and target data are assigned with different modality labels. Given the query-target pairs, the loss function of domain consistency learning is defined as follows:

$$\begin{aligned} {L_{con}\left( {{P^x,P^y;}{\theta }_q},{{\theta }_t},{{\theta }_s},{{\theta }_d} \right) = - \frac{1}{N}\sum \limits _{i = 1}^N \left( { \log {P_d}\left( {f_d} \left( {P_i^x;{\theta }_c} \right) \right) } \right. } \nonumber \\ {\left. + { \log \left( 1 - {{P_d}\left( {f_d \left( { P_i^y;{\theta }_c} \right) } \right) }\right) } \right) }, \end{aligned}$$

(2)

where ${{\theta }_d}$ represents the parameters of the modality discriminator, ${P_d \left( {.}\right) }$ represents the probability distribution generated by the modality discriminator for the projected feature.

Adversarial learning is used to train the domain discriminator. Unlike training normal discriminators, which preserves and enhances the discrimination of input features, the domain information of the feature gradually disappears by training the domain discriminator. Specifically, a negative hyper-parameter is multiplied with $L_{con}\left( {{P^x, P^y;}{\theta }_q},{{\theta }_t},{{\theta }_s},\right. $ $ \left. {{\theta }_d} \right) $, making the domain boundary of the feature blurred. This way, the heterogeneous gap between different domains effectively decreases, achieving domain alignment for cross-domain features. In this way, features from different domains are encoded into a semantic space to obtain domain-agnostic semantic representation.

Semantic consistency learning

Besides domain consistency, semantic consistency is also an important property for cross-domain retrieval tasks. To achieve cross-domain features semantic alignment, we introduce a category label classifier ${f_c \left( {.}\right) }$ to predict the category label of the projected features, which is achieved by the fully connected layers and the cross-entropy loss. Given the query-target pairs, the loss function of feature discrimination learning is defined as follows:

$$\begin{aligned}{} & {} L_{dis}\left( {{P^x,P^y};{\theta }_q},{{\theta }_t},{{\theta }_s},{{\theta }_c} \right) \nonumber \\ {}{} & {} \qquad = - \frac{1}{N}\sum \limits _{i = 1}^N \left( {l_i^x \log {P_c}\left( {f_c}\left( {P_i^x;{\theta }_c}\right) \right) } \right. \nonumber \\{} & {} \quad {\left. + {l_i^y \log {P_c}\left( {f_c}\left( {P_i^y;{\theta }_c} \right) \right) } \right) }, \end{aligned}$$

(3)

where ${{\theta }_c}$ represents the parameters of the category label classifier; $P_c \left( .\right) $ represents the generated probability distribution of the categories predicted by the label classifier for projected features.

In this way, the category label is employed to bridge the common space and label space, and then minimize the difference between classification predictions and labels. The category information of each data is utilized for classification and the discrimination of query sketch and target candidates descriptors can be significantly improved.

Feature correlation learning

Semantic and domain alignment are important, but they alone cannot achieve superior retrieval performance. Because the cross-entropy loss focuses on finding a decision boundary to separate features of different classes, but cannot optimize inter-class and intra-class distances as well as metric learning methods can. Domain consistency learning focuses on guaranteeing global domain alignment. Feature correlation learning aims to explore the relationship between features in semantics and vision, which is important to improve the effectiveness and robustness of the whole network.

In ZS-SBIR, great strides have been made but attempts have largely aligned with the larger photo-based zero-shot literature, where the key lies in leveraging external knowledge for cross-category adaptation [44]. However, SBIR is a visual task. The embedded semantic knowledge comes from the text. The two are not exactly matched. In SBSR, most previous methods build a common subspace with deep metric learning for correlation learning. These methods merely enhance the local feature correlation in the common subspace. Namely, there is no feature correlation or weak feature correlation in original feature spaces constructed by feature extractors. Therefore, we consider applying global feature learning to achieve semantic and visual joint modeling and generate discriminative features for retrieval tasks.

In this paper, we achieve the global feature correlation by encouraging the learning of feature correlations in each local feature Note that “global” here means that the proposed loss is applied not only to the constructed common feature space but also to the initial feature spaces. Specifically, in each feature space, we define the centers ${C} = \left\{ {{c_1},{c_2},{c_3},\ldots ,{c_m}} \right\} $ for metric learning, where ${c_i}$ represents the i-th category center and M is the number of categories. We update every center and feature within a mini-batch, pulling the features to their corresponding centers and pushing the features away from the other centers. To increase the inter-class distinction and strengthen the intra-class similarity, we adopt the loss function for metric learning following TCL [34]:

$$\begin{aligned} {L_{tc}\left( {{F_i};{\theta }}\right) } = \sum \limits _{i = 1}^N \max \Bigg ( {D}\left( {{F_i},{c_{y^i}}} \right) + {m} \nonumber \ \mathop {\min }\limits _{{j} \ne {{y^i}}} {D}\left( {{F_i},{c_j}} \right) ,0 \Bigg ), \end{aligned}$$

(4)

where ${F_i}$ represents data feature; ${\theta }$ is training parameters; ${{D}\left( {.} \right) }$represents the function of squared Euclidean distance; ${c_{y^i}}$ and ${c_j}$ represent the center of ${F_i}$ and the negative center, respectively; m is a hyper-parameter, representing the margin between different classes.

As shown in Fig. 3, the global feature correlation learning is achieved through local correlation learning (the three purple blocks in Fig. 3). We define ${L_{tc} \left( {P^x,P^y;{{\theta }_q},{{\theta }_t},{{\theta }_s}}\right) }$ as local correlation learning function for projected features, ${L_{tc} \left( {F^x;}{{\theta }_q} \right) }$ and ${L_{tc} \left( {F^y;}{{\theta }_t} \right) }$ as local correlation learning functions for original features. The global correlation learning function can then be established by combining these three local correlation learning functions:

$$\begin{aligned}{} & {} {L_{cl}\left( {{\theta }_q},{{\theta }_t},{{\theta }_s} \right) = \alpha {L_{tc}\left( {P^x},{P^y};{{\theta }_q},{{\theta }_t},{{\theta }_s} \right) } }\nonumber \\{} & {} \quad \qquad +{\beta {\left[ {L_{tc}\left( {F^x};{{\theta }_q} \right) }+L_{tc}\left( {F^y};{{\theta }_t} \right) \right] }}, \end{aligned}$$

(5)

where ${\alpha }$ and ${\beta }$ are hyper-parameters that control balance.

Through this global feature correlation learning, the discrimination of features in the two initial feature spaces can be enhanced and transferred to the common space. As a result, the eventually learned features could be more discriminative and consistent, which will be demonstrated in the experiments.

Feature similarity learning

TCL focuses on pulling the features closer to their corresponding centers and pushing them farther away from unrelated centers. However, TCL ignores the distance between features, which may cause some features near the decision boundaries to be closer to the features belonging to other categories, as shown in Fig. 4, which indicates that the intra-class features are not compact enough. In other words, feature alignment and domain alignment can be partially achieved.

Spired by contrastive learning and GRL, we design a loss function for feature similarity learning, which improves the intra-class compactness and inter-class separability of cross-domain features, as shown in Fig. 5. Feature similarity learning explores the projected features with semantics, pulls intra-class features closer, and pushes inter-class features farther apart.

Given the query-target pairs, the loss function of feature-similarity learning is defined as follows:

$$\begin{aligned}{} & {} {L_{sim}\left( {{P_i^x},{P_i^y};{t};{\theta }_q},{{\theta }_t},{{\theta }_s} \right) = \frac{1}{N}\sum \limits _{i = 1}^N \left( { t * {D}\left( {P_i^x,P_i^y} \right) } \right. }\nonumber \\ {}{} & {} \quad {\left. + { \left( 1-t_i \right) \left( {{D}\left( {R_{\lambda }}\left( P_i^x \right) {,P_i^y} \right) }\right) } \right) }, \end{aligned}$$

(6)

where ${R_{\lambda }\left( {.}\right) }$ represents the GRL. If ${P_i^x}$ and ${P_i^y}$ have the same semantic label (${t_i=1}$), the intra-class features are more compact by minimizing ${L_{sim}\left( {{P_i^x},{P_i^y};{t};{\theta }_q},{{\theta }_t},{{\theta }_s} \right) }$. And ${L_{sim}\left( {{P_i^x},{P_i^y};{t};{\theta }_q},{{\theta }_t},{{\theta }_s} \right) }$ relies on GRL, multiplies the gradient by a certain constant during the backpropagation-based training, contradicts the training goal.

Optimization objectives

Based on the above discussions, the formal loss function of the method can be expressed as:

$$\begin{aligned}{} & {} {L\left( {{\theta }_q},{{\theta }_t},{{\theta }_s},{{\theta }_c},{{\theta }_d} \right) } = \lambda L_{dis}\left( {{P^x,P^y};{\theta }_q},{{\theta }_t},{{\theta }_s},{{\theta }_c} \right) \nonumber \\{} & {} \quad {+ L_{cl}\left( {{F^x,F^y,P^x,P^y};{\theta }_q},{{\theta }_t},{{\theta }_s} \right) }\nonumber \\{} & {} \quad {- \phi L_{con}\left( {{P^x,P^y;}{\theta }_q},{{\theta }_t},{{\theta }_s},{{\theta }_d} \right) }\nonumber \\{} & {} \quad {+ \varphi L_{sim}\left( {{P_i^x},{P_i^y};{t};{\theta }_q},{{\theta }_t},{{\theta }_s} \right) }, \end{aligned}$$

(7)

where ${\lambda }$, ${\phi }$, ${\varphi }$ are are the positive hyper-parameters that control the balance.

For learning more robust and discriminative features, our method searches for the optimal parameters ${{\theta }_q}$, ${{\theta }_t}$, ${{\theta }_s}$, ${{\theta }_c}$, and ${{\theta }_d}$ as a min-max game, shown as follows:

$$\begin{aligned}{} & {} \begin{aligned} {\left( {{\theta }_q^*},{{\theta }_t^*},{{\theta }_s^*},{{\theta }_c^*}\right) = \arg \mathop {\min }\limits _{{{\theta }_q},{{\theta }_t},{{\theta }_s},{{\theta }_c}} {L\left( {{\theta }_q},{{\theta }_t},{{\theta }_s},{{\theta }_c},{{\theta }_d} \right) }}, \end{aligned} \end{aligned}$$

(8)

$$\begin{aligned}{} & {} \quad \begin{aligned} {\left( {{\theta }_d^*}\right) = \arg \mathop {\max }\limits _{{{\theta }_d}} {L\left( {{\theta }_q^*},{{\theta }_t^*},{{\theta }_s^*},{{\theta }_c^*},{{\theta }_d} \right) }}, \end{aligned} \end{aligned}$$

(9)

The training can be realized by using a stochastic gradient descent algorithm, which is summarized in Algorithm 1.

Experiments

Dataset

For the sketch-based cross-domain visual data retrieval datasets, we select 5 widely used datasets for experimental validation. For sketch-based image retrieval, the selected datasets are Sketchy [45, 46], TU-Berlin [47, 48] and RSketch [49]. For sketch-based 3D model retrieval, SHREC 2013 [50] and SHREC 2014 [51] are selected. Table 1 illustrates the key properties of the existing image-based cross-domain retrieval datasets.

Table 1 Key properties of existing datasets

Full size table

Training settings

We implement the proposed method using Pytorch on a computer with a Tesla V100S GPU. The modality discriminator is trained using Adam [52] with a learning rate of 0.01, and others are trained using SGD [53] with an initial learning rate of 0.01, and after ten iterations, the learning rate is reduced to one-tenth of the previous rate. Hyper-parameter m of TCL is set to 5 according to the literature [34]. Other hyper-parameters ${\alpha }$, ${\beta }$, ${\lambda }$, ${\phi }$, ${\varphi }$ are set to 0.5, 0.01, 1, 0.1, 1. We use ResNet50 [54] as the backbone for feature extraction. For cross-domain 3D shape retrieval tasks, we use multi-views to represent the 3D shape and use multi-view convolutional neural network (MVCNN) [55] to obtain the 3D shape feature. We initialize the centers with the standard normal distribution, with the mean and standard deviation being 0 and 1, respectively.

Zero-shot sketch-based image retrieval

The main challenge of ZS-SBIR is building a knowledge-sharing bridge between known training classes and unknown test classes, which will test the information representation of the features. In this section, we use Sketchy, TU-Berlin and RSketch as datasets for ZS-SBIR. For this task, we use the evaluation metrics in [49, 56, 57] to evaluate the retrieval preference:

1.
Mean average precision (MAP): Average precision (AP) is used to record the retrieval accuracy more clearly, which amounts to the area under the precision–recall (PR) curve. APs of multiple query images are averaged to produce the MAP.
2.
Prec@K: the precision of the top K returned images (Prec@K) is used to evaluate the retrieval preference. On the Sketchy and TU-Berlin datasets, K = 100 and 200. On the RSketch dataset, K = 10 and 50.

Retrieval result on Sketchy and TU-Berlin

For the splitting method of Sketchy and TU-Berlin datasets, we followed the settings in [56, 57]. For Sketchy, we selected 21 testing classes that are not in the 1000 classes of ImageNet. For TU-Berlin, we randomly selected 220 categories for training and the remaining 30 categories were used for testing.

The retrieval results in comparison with the state-of-the-art approaches are shown in Table 2. The methods of comparison are ZSIH [58], conditional variational auto-encoders (CVAE) [56], semantically tied paired cycle consistency (SEM-PCYC) [21], Doodle To Search (DTS) [20], CrossATNet [19], SkechGCN [59], STRAD [40], StyleGuide [57], TCN [41], bi-level domain adaptation for zero-shot SBIR (BDA-SketRet) [60], discriminant adversarial learning (DAL) [61]. The highest value of each metric has been shown in bold.

Table 2 Comparison with the state-of-the-art approaches on Sketchy and TU-Berlin datasets

Full size table

For all metrics, our method achieves the best retrieval performance. These results demonstrate that the proposed GSCTL method can achieve feature alignment and domain alignment. It is worth noting that our method does not use extra semantic information as in previous works, but adopts global correlation learning and feature similarity learning. Previous works have extracted word vectors of class names from word vector models, or measured word similarity of class names through hierarchical models, or a combination of both. These approaches have an obvious drawback: semantic embedding encodes mostly textual information, whereas ZS-SBIR is a visual task and semantic guidance is not optimal. Our approach focuses on visual relevance, avoids additional language modeling and time consumption, and reduces the burden of training resource acquisition. The comparison in the above metrics shows that these two learning methods can achieve the cross-domain correspondence between sketches and images, which is conducive to the transfer of semantic knowledge under the zero-shot scenario.

Figure 6 visualizes the retrieval results on the Sketchy dataset for the GSCTL method. The query sketches are listed on the left of the vertical line, which are dolphin, helicopter, sword, and windmill. The top 7 retrieved models are listed on the right, based on their ranking orders. The incorrect retrievals are highlighted in red. It can be seen from the figure that for the classes of helicopter, sword, and windmill, all the retrieved models are correct. For the class of dolphin, several retrievals are not correct. However, even if the retrieved images are not correct, they exhibit a high visual resemblance to the given dolphin sketch. These incorrect results also show that zero-short sketch-based image retrieval is a very challenging task.

Retrieval result on RSketch

For the RSketch dataset, the dataset splitting method proposed in the literature [50] is used. The method divides the 20 classes in the dataset into 4 subsets, each of which contains 15 seen classes and 5 unseen classes. According to the above splitting method, the proposed GSCTL method is trained and tested on the 4 subsets. The average performance, in terms of the evaluated metrics, across the four subsets is used for comparison.

The comparison between our GSCTL with the state-of-the-art approaches is shown in Table 3. All comparison methods are based on deep learning, such as deep spatial-semantic attention (DSSA) [62], deep shape matching (DSM) [63], learned deep image-sketch features (LDF) [64], adversarial sketch-image feature learning (AFL) [49], DTS [20], deep supervised cross-modal retrieval (DSCMR) [65], cross-modal center loss (CMCL) [66] and DAL [61]. The highest value for each metric is shown in bold.

Table 3 Comparison with the state-of-the-art approaches on RSketch dataset

Full size table

For the 3 metrics in the seen categories, the GSCTL method achieves the best retrieval performances. For the 3 metrics in the unseen categories, the GSCTL method attains two quasi-best values in terms of MAP and Prec@10, and the third-best performance in terms of Prec@50.

Compared with DAL, the GSCTL method mines more feature correlation information through global feature correlation learning, which enhances the retrieval performance for seen categories but reduces the inference ability for unseen categories. Compared with other methods, the GSCTL method can discover hidden relationships between features and transfer the semantic knowledge from seen to unseen categories through the strategy of joint learning, which is beneficial for improving retrieval performance. Further experimental details and conclusions on the RSketch dataset can be found in “Ablation study”.

Sketch-based 3D shape retrieval

The 3D shapes from SHREC 2013 and SHREC 2014 are not consistently aligned, i.e., these models are perturbed by random rotations. For a more detailed and comprehensive description of the 3D models, we use dodecahedron-view images to render 3D objects following the method in [67] and obtain 20 views for each 3D object.

We employ nearest neighbor (NN), first tier (FT), second tier (ST), E-measure (E), discounted cumulated gain (DCG), and mean average precision (MAP) to evaluate the retrieval performance following [50, 51]. A higher metric value indicates better performance.

On the two benchmark datasets (SHREC 2013 and SHREC 2014), we compare our method with the state-of-the-art methods, including the TCL [34], AFL [49], deep cross-modality adaptation (DCA) [68], SEM [43], deep sketch-shape hashing (DSSH) [69], BV-CDSM [ [70], deep point-to-subspace metric learning (DPSML) [16], dual independent classification (DIC) [71], hyperbolic embedded attentive representation (HEAR) [72], JHFL_DA (joint heterogeneous feature learning and distribution alignment) [73], M-GCN [74] and DAL [61].

Table 4 Comparison with state-of-the-art approaches on SHREC 2013 dataset

Full size table

Table 5 Comparison with state-of-the-art approaches on SHREC 2014 dataset

Full size table

The results are summarized in Tables 4 and 5. Bold indicates the best performance, italic indicates the quasi-best performance, and bold italic indicates the third-best performance. Our method achieves the best performance in 7 metrics on the two benchmark datasets (NN, FT, ST, DCG and MAP on SHREC 2013, ST and DCG on SHREC 2014), the quasi-best performance in 3 metrics (E on SHREC 2013, E and MAP on SHREC 2014), and the third-best performance in one metric (FT on SHREC 2014). Our method achieves state-of-the-art performance in the remaining metric in general. Overall, the experimental results suggest the superiority of our GSCTL method compared with the existing methods.

Figure 7 visualizes the example 3D models retrieved by GSCTL from SHREC 2013. The incorrect retrievals are marked by red squares. However, the incorrect results are due to the sample size. With regards to the “ant” class, our method first retrieves the correct samples from SHREC 2013, and then the wrong samples. This is because there are only 5 samples in the “ant” class. So our methods retrieved other models (bees) that have similar visual shapes to the given “ant” sketch.

The reasons why our method is also effective in sketch-based 3D model retrieval tasks can be explained as follows: We build a public space encoder and jointly apply four types of learning. Such a learning framework is intended to further improve the information representation capability of the public space encoder in extracting modality-invariance and semantic-correlation features. Domain consistency learning essentially converts the cross-domain retrieval task into a same-domain retrieval task, which reduces the difficulty of retrieval. Global correlation learning can capture and transmit modality-invariant features effectively from two initial feature spaces to a common feature space.

Discussions

In this section, we provide in-depth analyses of various design choices to gain insight into the effectiveness and generalization of our proposed model, especially the parameter settings. In our final objective function (Eq.(7)), there are three hyper-parameters for balancing the contributions of loss functions. To explore the sensitiveness of the proposed GSCTL to the three hyper-parameters, we have designed three experiments using different values of ${\alpha }$, ${\beta }$ and ${\varphi }$ as shown in “Comparison of the effectiveness of similarity loss” and “The impact of hyperparameter”.

Image-based 3D shape retrieval

Image-based 3D shape (IBSR) retrieval task is also a hot research topic for cross-domain 3D retrieval, which shares similar difficulty with SBSR. To validate the robustness of GSCTL, we conduct comparative experiments on the IN2MN, SceneIBR2019 and MI3DOR, which are popular image-based 3D shape retrieval benchmarks.

Following [63, 75], different metrics are used to evaluate the retrieval performance on different datasets. Namely, on IM2MN, MAP, Prec@10, Prec@50 and Prec@100 are used as metrics. For SceneIBR2019, NN, FT, ST, E, DCG, MAP and PR are used as metrics [75]. For MI3DOR, the metrics are NN, FT, ST, F, DCG, ANMRR (average normalized modified retrieval rank) and AUC (the area under PR-curve) [76]. Note that a lower ANMRR value indicates a better performance.

Retrieval results on IN2MN

The 3D models of IN2MN come from ModelNet40, which are consistently aligned. In addition, IN2MN provides 12 views for each 3D shape. Therefore, in this subsection, we use 12 views to represent each 3D object following [53].

The retrieval results, in terms of the evaluated metrics, of our method and the state-of-the-art approaches are presented in Table 6. The compared methods include CDTNN (cross-domain triplet neural network) [77], DSCMR [65], AFL citeref47, CMCL [66] and DAL [65]. The experimental results show that our method achieves significantly better retrieval performance in each metric, which indicates the effectiveness of the proposed GSCTL method.

Table 6 Comparison with state-of-the-art approaches on IN2MN dataset

Full size table

Figure 8 visualizes the retrieval results of GSCTL on IN2MN. It can be seen that all retrieved results are correct.

Retrieval results on SceneIBR2019

SceneIBR2019 is a dataset designed for image-based 3D scene retrieval. Image-based scene retrieval is more challenging because the scene contains more entities and there may be overlapping relationships between entities [78].

In this experiment, we use SketchUp to capture the representative views of all 3D scene models following MMD-VGG in [79]. Therefore, an image-based 3D scene retrieval task is transformed into an image-based 2D scene retrieval task.

Our method is compared against RNIRAP, CVAE, TCL, VMV-VGG, and DRF, which are the methods based on deep learning for the SceneIBR2019 dataset from SHREC 2019 [75]. We also extended DSCMR [65], AFL [49], CMCL [66] and DAL [61] so that they can be trained on the SceneIBR2019 dataset.

The comparison results of the first six metrics are shown in Table 7. Bold indicates the best performance, italic indicates quasi-best performance, and bold italic indicates third-best performance. Their precision-recall curves are shown in Fig. 9. Our method achieves the best performance in four metrics(i.e., ST, DCG, MAP, PR-curve), and the quasi-best performance in the remaining three metrics.

Table 7 Comparison with state-of-the-art approaches on SceneIBR2019 dataset

Full size table

Retrieval result on MI3DOR

MI3DOR followed the method in [55] to render the 3D object (.OBJ) and provided 12 views for each 3D object. In this experiment, we also use 12 views to represent each 3D object.

Our method is compared against RNF-MVCVR, SORMI, RNFETL, CLA, MLIS, ADDA-MVCNN, SRN, ALIGN, collaborative distribution alignment (CDA) [80], consistent domain structure learning and domain alignment (CDSLDA) [81], universal cross-domain (UCD) [82], AFL [49], DSCMR [65], CMCL [66], JHFL_DA [73], M-GNN [74] and DAL [61]. The first eight methods are supervised methods on the MI3DOR dataset from SHREC 2019 [55]. AFL, DSCMR and CMCL are re-produced for image-based 3D shape retrieval.

Table 8 demonstrates the results compared with the state-of-the-art methods. Bold indicates the best performance, italic indicates the quasi-best performance, and bold italic indicates the third-best performance. In this dataset, our method achieves the best performance in one metric (ST), the quasi-best performance in three metrics (FT, DCG and ANMRR), and the third-best metrics in three metrics (NN, F and AUC). In general, our method attains the quasi-best retrieval performance, only behind M-GNN.

Table 8 Comparison with state-of-the-art approaches on MI3DOR dataset

Full size table

Table 9 Comparison of metrics under different similarity losses

Full size table

For cross-domain 3D model retrieval tasks, several approaches are performed on multiple datasets. As shown in Tables 4, 5, 7 and 8, compared with these reused methods, our GSCTL has the best metrics on SHREC 2013, SHREC 2014 and SceneIBR2019. For the MI3DOR dataset, the retrieval performance of our GSCTL method is behind MI3DOR. However, the retrieval performance of M-GNN on SHREC 2014 is unsatisfactory. The retrieval performance of our GSCTL method is dramatically ahead of SHREC 2014. The superior and stable retrieval performance on multiple datasets illustrates the excellent generalization of our method.

Based on all experimental results, our approach outperforms the other comparison methods in multiple types of retrieval tasks for the following reasons:

1.
Domain consistency learning reduces the heterogeneous gap and achieves domain alignment, which transforms cross-domain retrieval tasks into same-domain retrieval task and reduces the difficulty of retrieval.
2.
Global correlation learning aims to learn more semantic-correlation attributes in initial feature spaces and transfer them to a common space. More semantic-related information is captured and the generated visual features are more robust.
3.
Feature similarity learning pulls intra-class features closer and pushes inter-class features farther apart. The two operations narrow the gap between two features for more compact semantic alignment.

In summary, the joint network could simultaneously deal with the vast domain gap, semantic gap, and limited knowledge about the unseen categories. All these designs enable our framework to mine more semantic category relationships.

Comparison of the effectiveness of similarity loss

One contribution of our method is the designed similarity learning loss. In this subsection, we use contrastive loss instead of similarity loss to conduct extensive experiments on the IN2MN dataset. The retrieval results under different loss functions are recorded and presented in Table 9.

In Table 9, we observe that retrieval performances in terms of both MAP and Prec@K show an increasing trend. Specifically, the optimal retrieval performance is achieved when ${\varphi } = 1$. At this point, the intra-class features are more compact, and GSCTL shows improvements in several metrics. The results further validate the effectiveness of feature similarity learning, which can be used directly as a plug-in for cross-domain retrieval tasks.

Compared with contrastive loss, the designed similarity loss lacks a hyperparameter (margin), which means that the training process avoids the setting of a hyperparameter. Analyzing the results presented in Table 9, we observe that contrastive loss is sensitive to hyperparameters. The retrieval performance varies greatly when the hyperparameters (margin and ${\varphi }$) are taken to different values. Finding the correct values for these hyperparameters can be challenging and may require a lot of experimentation.

Comparison of the effectiveness of domain adaptation algorithms

In this subsection, we investigate the impact of different domain adaptation algorithms on retrieval performance, as they play a crucial role in mitigating modal heterogeneity. Specifically, we compare our adversarial learning-based approach with three widely used algorithms (GRL, CMD, MMD) for modal consistency learning. We evaluate the retrieval results using MAP (Mean Average Precision) on four datasets (SHREC 2013, SHREC 2014, MI3DOR, and SceneIBR2019).

The comparison of MAP values under different domain adaptation algorithms is illustrated in Fig. 10. Across the four datasets, our method achieves the three best results, while one quasi-best result is obtained with adversarial learning. This result demonstrates that adversarial learning is the domain adaptation algorithm that best suits our proposed method. The key reason is as follows: Instead of explicitly measuring the distance between the source and target domains as CMD and MMD do, adversarial learning leverages deep neural networks to more flexibly learn the mapping relationship between the source and target domains, generating more robust and realistic feature representations.

Furthermore, we employ t-SNE [83] to visualize the data distribution of the features obtained using the four domain adaptation algorithms on the MI3DOR dataset, as depicted in Fig. 11, where each color represents a different class. The visualization shows that using adversarial learning (AD) and gradient reversal layer (GRL) leads to better separation of the features. Although there is still some overlap among inter-class clusters, it is significantly reduced compared to using maximum mean discrepancy (MMD) and correlation matrix distance (CMD). When comparing AD with GRL, the inter-class clusters exhibit more distinct boundaries.

Empirical analysis

Time and space cost analysis

In Table 10, we compare the convergence batch and average retrieval time cost of our model with DSCMR, AFL, CMCL and DAL from IM2MN. During the evaluation, all data in the retrieval gallery are projected into the latent embedding space and stored in the memory, then given one query sample, the time cost of calculating similarities between the query image and all 3D shapes in the gallery following previous work [21, 84], as well as sorting the similarities is recorded, and the average time cost of one query overall query samples is reported in Table 10. Denoting the number of samples in the retrieval gallery as N, the dimension of latent representation as D, and the time complexity and space complexity are decided by N and D, without much difference observed between different methods. From the experimental results, we observe that the convergence speed and MAP of the proposed method are better than others, which is attributed to its strong feature learning ability.

Table 10 Comparison of convergence batch and the retrieval time cost

Full size table

Ablation study

The proposed method combines different learning objectives and leverages their advantages to achieve superior performance. To analyze the contributions of each learning objective, we conducted an ablation experiment using the RSketch dataset In this experiment, we represent feature semantic consistency as SC, domain consistency learning as DC, local feature correlation learning as LFC, global feature correlation learning as GFC, and feature similarity learning as FS. The retrieval results under different combinations of loss functions are recorded and presented in Table 11.

Table 11 Ablation experiments on RSketch

Full size table

From Table 11, we can draw the following conclusions:

1.
The addition of DC or LFC to SC improves the retrieval performance for seen categories by achieving domain alignment capability or feature alignment capability.
2.
Adding GFC to SC improves the retrieval performance for both seen and unseen categories. Global feature correlation learning helps extract and transmit more useful information, facilitating the transfer of semantic knowledge under the zero-shot scenario.
3.
Incorporating LFC or GFC into SC + DC implies that the mutual information of cross-domain features is enhanced by utilizing relevance learning on the basis of domain alignment, and the experimental results show that this measure can significantly improve retrieval performance. However, adding DC to SC + GFC results in intermodal consistency in the distribution of cross-domain features, leading to the loss of modality-specific semantic knowledge and affecting the model’s inference of unknown data.
4.
Adding FS to SC + DC + GFC improves the intra-class compactness and inter-class separability of cross-domain features, resulting in enhanced retrieval performance for both seen and unseen categories.

The main contributions of our method lie in the use of global feature correlation learning and the design of feature similarity learning. To further illustrate their roles, we compared the baseline (SC + DC + LFC) with the addition of feature correlation learning in the original feature spaces (OFC) and feature similarity learning (FS) in cross-domain 3D retrieval experiments on SHREC 2013, SHREC 2014, MI3DOR, and SceneIBR2019. The retrieval performance was evaluated using MAP (Mean Average Precision) and is shown in Fig. 12.

We also adopt t-SNE [83] to visualize the data distribution of the features on MI3DOR about the ablation experiment, as shown in Fig. 13. Better feature clustering performance is obtained and most clusters have clear spacing between different categories by adding FS and OFC to the baseline.

From Figs. 12 and 13, we can draw the following conclusions:

1.
Without OFC and FS, The baseline achieves good retrieval performance but lacks compactness in intra-class features, resulting in unsatisfactory results.
2.
Adding OFC significantly improves the MAP, indicating that the proposed global feature correlation learning effectively learns useful information from the original space and transfers it to the common subspace, thus enhancing retrieval accuracy under domain alignment.
3.
The addition of FS also leads to significant improvements in MAP. This indicates that the proposed feature similarity learning method reduces cross-domain discrepancy and enhances feature correlation.
4.
The best MAP scores are obtained when both FS and OFC are added. The combination of these two learning methods allows our method to leverage their advantages and extract more useful information.

The impact of the methods of generating initial centers

In this subsection, we explore the impact of different ways of initializing the centers on the retrieval results for the IN2MN dataset. The results are listed in Table 12. We consider two initialization methods: normal distribution and uniform distribution. When using normal distribution initialization, the means are set as 0 while the standard deviations change from 1 to 0.01. When initializing with a uniform distribution, the center distributions are within the interval [0,1).

Table 12 Retrieval results in different ways of initializing the centers of TCL on the IN2MN dataset

Full size table

Analyzing the results presented in Table 12, we observe that the proposed GSCTL method achieves the best retrieval performance when the mean is set to 0 and the standard deviation is 1. This indicates the effectiveness of our method for initializing centers using a normal distribution. When initializing the centers with a uniform distribution, the performance becomes worse. This outcome indicates that normal distribution initialization contributes more significantly to the overall performance compared to uniform distribution.

The impact of hyperparameter

The hyperparameters ${\alpha }$ and ${\beta }$ control the trade-off among these learning objectives. To investigate the impact of these parameters, we conducted extensive experiments on the IN2MN dataset.

Figure 14 shows the performance of GSCTL for different values of ${\alpha }$ and ${\beta }$. The evaluation involved changing one parameter while keeping the other parameter fixed. When the value of ${\alpha }$ increases from 0.1 to 0.9, the MAP value initially increases and then decreases. When the value of ${\beta }$ increases from 0.01 to 0.09, the MAP value initially decreases, then increases, and finally decreases again. We observe that GSCTL achieves the best performance when ${{\alpha }=0.5}$ and ${{\beta } = 0.01}$. This result indicates that enhancing feature correlation in original feature spaces effectively improves retrieval performance.

Conclusion and future work

In this research, we propose an end-to-end general framework for a sketch-based cross-domain retrieval method. The proposed method incorporates global feature correlation, semantic and domain consistency, global feature correlation and feature similarity learning to reduce cross-domain heterogeneity and enhance global cross-domain correlation. To validate the effectiveness of the proposed method, we conducted experiments on 8 datasets. The results show that the proposed method outperforms state-of-the-art approaches, especially excelling in the ST metric for cross-domain 3D model retrieval tasks.

Although the proposed method achieves impressive retrieval performances, there are still some shortcomings. First, the setting of hyperparameter values comes from a large number of experiments. Secondly, the search method mentioned occurs in a closed setting between two pre-defined domains. Thirdly, there is room for improvement in feature relationship mining. In future work, we will design a better approach to balancing the loss functions, make the step towards an open setting where multiple visual domains are available, and design a better feature correlation learning method to further improve retrieval performance.

In summary, this research is the first exploration to combine SBSR and ZS-SBIR tasks, which has contributed to cross-domain retrieval, and the improved retrieval performance shows its potential for further research and practical applications in the field. Efficiency can be improved by utilizing this method. For example, managers can quickly find the desired images and 3D shapes and 3D designers can directly reuse the retrieved 3D shape, avoiding the need to design from scratch. More importantly, the enhancement of retrieval performance under zero-shot conditions enables the utilization of existing knowledge to manage unknown categories of data when labeled sample data is insufficient or even completely absent, thereby avoiding manual labeling and reducing costs.

Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Abbreviations

SBSR:: Sketch-based 3D shape retrieval
SBIR:: Sketch-based image retrieval
DCML:: Deep correlated metric learning
DCHML:: Deep correlated holistic metric learning
DPSML:: Deep point-to-subspace metric learning
TC-Net:: Triplet classification ntwork
ZS:: Zero-shot
GSCTL:: Global semantics correlation transmit and learning
CNN:: Convolutional neural network
MMD:: Maximum mean discrepancy
CMD:: Central moment discrepancy
AD:: Adversarial learning
GRL:: Gradient reversal layer
TCL:: Triplet-center loss
ALS:: Attribute-based label smoothing
Semi3-Net:: Semi-heterogeneous three-way joint embedding network
AE-Net:: Attention-enhanced network
DLI-Net:: Dual local interaction network
SL:: Sequential learning
STRAD:: Structure-aware asymmetric disentanglement
TCN:: Transferable coupled network
DD-GAN:: Domain disentangled generative adversarial network
MAP:: Mean average precision
AP:: Average precision
Prec@K:: The precision of the top K returned images
ZSIH:: Zero-shot sketch-image hashing
SEM-PCYC:: Semantically tied paired cycle consistency
DTS:: Doodle to search
BDA-Sketret:: Bi-level domain adaptation for zero-shot SBIR
DAL:: Discriminant adversarial learning
DSSA:: Deep spatial-semantic attention
DSM:: Deep shape matching
LDF:: Learned deep image-sketch features
AFL:: Adversarial sketch-image feature learning
DSCMR:: Deep supervised cross-modal retrieval
CMCL:: Cross-modal center loss
NN:: Nearest neighbor
FT:: First tier
ST:: Second tier
DCG:: Discounted cumulated gain
E:: E-measure
DCA:: Deep cross-modality adaptation
DSSH:: Deep sketch-shape hashing
SEM:: Semantic embedding
BV-CDSM:: Best view selection and cross-domain similarity measure
DIC:: Dual independent classification
HEAR:: Hyperbolic embedded attentive representation
JHFL_DA:: Joint heterogeneous feature learning and distribution Alignment
M-GCN:: Multi-branch graph convolution network
CDTNN:: Cross-domain triplet neural network
CDA:: Collaborative distribution alignment
UCD:: Universal cross-domain
ANMRR:: Average normalized modified retrieval rank
AUC:: The area under PR-curve

References

Saito K, Sohn K, Zhang X, Li C, Lee C, Saenko K, Pfister T (2023) Pic2word: mapping pictures to words for zero-shot composed image retrieval. In: IEEE/CVF conference on computer vision and pattern recognition, CVPR 2023, Vancouver, BC, Canada, June 17–24, 2023. IEEE, pp 19305–19314. https://doi.org/10.1109/CVPR52729.2023.0
Yu Q, Song J, Song Y, Xiang T, Hospedales TM (2021) Fine-grained instance-level sketch-based image retrieval. Int J Comput Vis 129(2):484–500. https://doi.org/10.1007/S11263-020-01382-3
Article Google Scholar
Wang X, Peng D, Hu P, Gong Y, Chen Y (2023) Cross-domain alignment for zero-shot sketch-based image retrieval. IEEE Trans Circuits Syst Video Technol 33(11):7024–7035. https://doi.org/10.1109/TCSVT.2023.3265697
Article Google Scholar
Bai S, Bai J (2023) HDA2L: hierarchical domain-augmented adaptive learning for sketch-based 3d shape retrieval. Knowl Based Syst 264:110302. https://doi.org/10.1016/J.KNOSYS.2023.110302
Article Google Scholar
Lin M, Yang J, Wang H, Lai Y, Jia R, Zhao B, Gao L (2021) Single image 3d shape retrieval via cross-modal instance and category contrastive learning. In: 2021 IEEE/CVF international conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10–17, 2021. IEEE, pp 11385–11395. https://doi.org/10.1109/ICCV48922.2021.01121
Pandey A, Mishra A, Verma VK, Mittal A (2019) Adversarial joint-distribution learning for novel class sketch-based image retrieval. In: 2019 IEEE/CVF international conference on computer vision workshops, ICCV workshops 2019, Seoul, Korea (South), October 27–28, 2019. IEEE, pp 1391–1400. https://doi.org/10.1109/ICCVW.2019.00175
Liu A, Zhou H, Li X, Wang L (2023) Vulnerability of feature extractors in 2d image-based 3d object retrieval. IEEE Trans Multimed 25:5065–5076. https://doi.org/10.1109/TMM.2022.3186740
Article Google Scholar
Shen L, Tao H, Ni Y, Wang Y, Stojanovic V (2023) Improved YOLOv3 model with feature map cropping for multi-scale road object detection. Meas Sci Technol 34(4):045406. https://doi.org/10.1088/1361-6501/acb075
Article Google Scholar
Zhou C, Tao H, Chen Y, Stojanovic V, Paszke W (2022) Robust point-to-point iterative learning control for constrained systems: a minimum energy approach. Int J Robust Nonlinear Control 32(18):10139–10161
Article MathSciNet Google Scholar
Wang H, Tang J, Ji J, Sun X, Zhang R, Ma Y, Zhao M, Li L, Zhao Z, Lv T, Ji R (2023) Beyond first impressions: Integrating joint multi-modal cues for comprehensive 3d representation. In: El-Saddik A, Mei T, Cucchiara R, Bertini M, Vallejo DPT, Atrey PK, Hossain MS (eds) Proceedings of the 31st ACM international conference on multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023–3 November 2023. ACM,, pp 3403–3414. https://doi.org/10.1145/3581783.3611767
Muhsen YR, Husin NA, Zolkepli MB, Manshor NB, Al-Hchaimi AAJ, Ridha HM (2023) Enhancing noc-based mpsoc performance: a predictive approach with ANN and guaranteed convergence arithmetic optimization algorithm. IEEE Access 11:90143–90157. https://doi.org/10.1109/ACCESS.2023.3305669
Article Google Scholar
Kareem BA, Zubaidi SL, Al-Ansari N, Muhsen YR (2024) Review of recent trends in the hybridisation of preprocessing-based and parameter optimisation-based hybrid models to forecast univariate streamflow. Comput Model Eng Sci 138(1):1–41. https://doi.org/10.32604/cmes.2023.027954
Article Google Scholar
Husin NA, Zolkepli MB, Manshor N, Al-Hchaimi AA, Albahri AS (2024) Routing techniques in network-on-chip based multiprocessor-system-on-chip for iot: A systematic review. Iraqi J Comput Sci Math 5(1):181–204. https://doi.org/10.52866/ijcsm.2024.05.01.014
Article Google Scholar
Dai G, Xie J, Zhu F, Fang Y (2017) Deep correlated metric learning for sketch-based 3d shape retrieval. In: Singh S, Markovitch S (eds) Proceedings of the Thirty-First AAAI conference on artificial intelligence, February 4–9, 2017, San Francisco, California, USA. AAAI Press, pp 4002–4008. https://doi.org/10.1609/AAAI.V31I1.11211
Dai G, Xie J, Fang Y (2018) Deep correlated holistic metric learning for sketch-based 3d shape retrieval. IEEE Trans Image Process 27(7):3374–3386. https://doi.org/10.1109/TIP.2018.2817042
Article MathSciNet Google Scholar
Lei Y, Zhou Z, Zhang P, Guo Y, Ma Z, Liu L (2019) Deep point-to-subspace metric learning for sketch-based 3d shape retrieval. Pattern Recognit. https://doi.org/10.1016/J.PATCOG.2019.106981
Article Google Scholar
Lin H, Fu Y, Lu P, Gong S, Xue X, Jiang Y (2019) Tc-net for isbir: Triplet classification network for instance-level sketch based image retrieval. In: Amsaleg L, Huet B, Larson MA, Gravier G, Hung H, Ngo C, Ooi WT (eds) Proceedings of the 27th ACM international conference on multimedia, MM 2019, Nice, France, October 21–25, 2019. ACM, pp 1676–1684. https://doi.org/10.1145/3343031.3350900
Tao H, Shi H, Qiu J, Jin G, Stojanovic V (2024) Planetary gearbox fault diagnosis based on FDKNN-DGAT with few labeled data. Meas Sci Technol 35(2):025036. https://doi.org/10.1088/1361-6501/ad0f6d
Article Google Scholar
Chaudhuri U, Banerjee B, Bhattacharya A, Datcu M (2020) Crossatnet—a novel cross-attention based framework for sketch-based image retrieval. Image Vis Comput 104:104003. https://doi.org/10.1016/J.IMAVIS.2020.104003
Article Google Scholar
Dey S, Riba P, Dutta A, Lladós J, Song Y (2019) Doodle to search: Practical zero-shot sketch-based image retrieval. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019. Computer Vision Foundation/IEEE, pp 2179–2188. https://doi.org/10.1109/CVPR.2019.00228. Accessed 9 Jan 2020
Dutta A, Akata Z (2020) Semantically tied paired cycle consistency for any-shot sketch-based image retrieval. Int J Comput Vis 128(10):2684–2703. https://doi.org/10.1007/S11263-020-01350-X
Article Google Scholar
Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola AJ (2012) A kernel two-sample test. J Mach Learn Res 13:723–773. https://doi.org/10.5555/2503308.2188410
Article MathSciNet Google Scholar
Zellinger W, Grubinger T, Lughofer E, Natschläger T, Saminger-Platz S (2017) Central moment discrepancy (CMD) for domain-invariant representation learning. In: 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, conference track Proceedings. OpenReview.net. https://doi.org/10.48550/arXiv.1702.08811. Accessed 28 Feb 2017
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville AC, Bengio Y (2014) Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural Information processing systems 27: annual conference on neural information processing systems 2014, December 8–13 2014, Montreal, Quebec, Canada, pp 2672–2680. https://doi.org/10.48550/arXiv.1406.2661. Accessed 10 June 2014
Ganin Y, Lempitsky VS (2015) Unsupervised domain adaptation by backpropagation. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6-11 July 2015. JMLR workshop and conference Proceedings, vol 37. JMLR.org, pp 1180–1189. https://doi.org/10.48550/arXiv.1409.7495. Accessed 6 July 2015
Kim Y, Hong S (2022) Adaptive graph adversarial networks for partial domain adaptation. IEEE Trans Circuits Syst Video Technol 32(1):172–182. https://doi.org/10.1109/TCSVT.2021.3056208
Article Google Scholar
Bousmalis K, Trigeorgis G, Silberman N, Krishnan D, Erhan D (2016) Domain separation networks. In: Lee DD, Sugiyama M, Luxburg U, Guyon I, Garnett R (eds) Advances in neural information processing systems 29: annual conference on neural information processing systems 2016, December 5–10, 2016, Barcelona, Spain, pp 343–351. https://doi.org/10.48550/arXiv.1608.06019. Accessed 5 Dec 2016
Zheng W, Liu H, Wang B, Sun F (2019) Cross-modal surface material retrieval using discriminant adversarial learning. IEEE Trans Ind Inform 15(9):4978–4987. https://doi.org/10.1109/TII.2019.2895602
Article Google Scholar
Peng X, Bai Q, Xia X, Huang Z, Saenko K, Wang B (2019) Moment matching for multi-source domain adaptation. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019. IEEE, pp 1406–1415. https://doi.org/10.1109/ICCV.2019.00149
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Bartlett PL, Pereira FCN, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 25: 26th annual conference on neural information processing systems 2012. Proceedings of a meeting held December 3–6, 2012, Lake Tahoe, Nevada, United States, pp 1106–1114. https://doi.org/10.1145/3065386. Accessed 24 May 2017
Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society conference on computer vision and pattern recognition (CVPR 2006), 17–22 June 2006, New York, NY, USA. IEEE Computer Society, pp 1735–1742. https://doi.org/10.1109/CVPR.2006.100
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015. IEEE Computer Society, pp 815–823. https://doi.org/10.1109/CVPR.2015.7298682
Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M (eds) Computer Vision—ECCV 2016—14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, part VII. Lecture notes in computer science, vol 9911. Springer, pp 499–515. https://doi.org/10.1007/978-3-319-46478-7_31
He X, Zhou Y, Zhou Z, Bai S, Bai X (2018) Triplet-center loss for multi-view 3d object retrieval. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018. Computer Vision Foundation/IEEE Computer Society, pp 1945–1954. https://doi.org/10.1109/CVPR.2018.00208. Accessed 6 Dec 2018
Wang H, Peng J, Chen D, Jiang G, Zhao T, Fu X (2020) Attribute-guided feature learning network for vehicle reidentification. IEEE Multimed 27(4):112–121. https://doi.org/10.1109/MMUL.2020.2999464
Article Google Scholar
Lei J, Song Y, Peng B, Ma Z, Shao L, Song Y (2020) Semi-heterogeneous three-way joint embedding network for sketch-based image retrieval. IEEE Trans Circuits Syst Video Technol 30(9):3226–3237. https://doi.org/10.1109/TCSVT.2019.2936710
Article Google Scholar
Chen Y, Zhang Z, Wang Y, Zhang Y, Feng R, Zhang T, Fan W (2022) Ae-net: fine-grained sketch-based image retrieval via attention-enhanced network. Pattern Recognit 122:108291. https://doi.org/10.1016/j.patcog.2021.108291
Article Google Scholar
Sun H, Xu J, Wang J, Qi Q, Ge C, Liao J (2022) Dli-net: dual local interaction network for fine-grained sketch-based image retrieval. IEEE Trans Circuits Syst Video Technol 32(10):7177–7189. https://doi.org/10.1109/TCSVT.2022.3171972
Article Google Scholar
Yang H, Tian Y, Yang C, Wang Z, Wang L, Li H (2022) Sequential learning for sketch-based 3d model retrieval. Multimed Syst 28(3):761–778. https://doi.org/10.1007/s00530-021-00871-w
Article Google Scholar
Li J, Ling Z, Niu L, Zhang L (2022) Zero-shot sketch-based image retrieval with structure-aware asymmetric disentanglement. Comput Vis Image Underst 218:103412. https://doi.org/10.1016/j.cviu.2022.103412
Article Google Scholar
Wang H, Deng C, Liu T, Tao D (2022) Transferable coupled network for zero-shot sketch-based image retrieval. IEEE Trans Pattern Anal Mach Intell 44(12):9181–9194. https://doi.org/10.1109/TPAMI.2021.3123315
Article Google Scholar
Xu R, Han Z, Hui L, Qian J, Xie J (2022) Domain disentangled generative adversarial network for zero-shot sketch-based 3d shape retrieval. In: Thirty-Sixth AAAI conference on artificial intelligence, AAAI 2022, thirty-fourth conference on innovative applications of artificial intelligence, IAAI 2022, the twelveth symposium on educational advances in artificial intelligence, EAAI 2022 virtual event, February 22–March 1, 2022. AAAI Press, pp 2902–2910. https://doi.org/10.1609/aaai.v36i3.20195. Accessed 8 June 2022
Qi A, Song Y, Xiang T (2018) Semantic embedding for sketch-based 3d shape retrieval. In: British machine vision conference 2018, BMVC 2018, Newcastle, UK, September 3–6, 2018. BMVA Press, p 43. http://bmvc2018.org/contents/papers/0040.pdf. Accessed 5 Sept 2018
Lin F, Li M, Li D, Hospedales TM, Song Y, Qi Y (2023) Zero-shot everything sketch-based image retrieval, and in explainable style. In: IEEE/CVF conference on computer vision and pattern recognition, CVPR 2023, Vancouver, BC, Canada, June 17–24, 2023. IEEE, pp 23349–23358. https://doi.org/10.1109/CVPR52729.2023.02236
Sangkloy P, Burnell N, Ham C, Hays J (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans Graph 35(4):119–111912. https://doi.org/10.1145/2897824.2925954
Article Google Scholar
Liu L, Shen F, Shen Y, Liu X, Shao L (2017) Deep sketch hashing: fast free-hand sketch-based image retrieval. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017. IEEE Computer Society, pp 2298–2307. https://doi.org/10.1109/CVPR.2017.247
Eitz M, Hays J, Alexa M (2012) How do humans sketch objects? ACM Trans Graph 31(4):44–14410. https://doi.org/10.1145/2185520.2185540
Article Google Scholar
Zhang H, Liu S, Zhang C, Ren W, Wang R, Cao X (2016) Sketchnet: sketch classification with web images. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. IEEE Computer Society, pp 1105–1113. https://doi.org/10.1109/CVPR.2016.125
Xu F, Yang W, Jiang T, Lin S, Luo H, Xia G (2020) Mental retrieval of remote sensing images via adversarial sketch-image feature learning. IEEE Trans Geosci Remote Sens 58(11):7801–7814. https://doi.org/10.1109/TGRS.2020.2984316
Article Google Scholar
Li B, Lu Y, Godil A, Schreck T, Aono M, Johan H, Saavedra JM, Tashiro S (2013) Shrec’13 track: Large scale sketch-based 3d shape retrieval. In: Castellani U, Schreck T, Biasotti S, Pratikakis I, Godil A, Veltkamp RC (eds) 6th Eurographics workshop on 3D object retrieval, 3DOR@Eurographics 2013, Girona, Spain, May 11, 2013. Eurographics Association, pp 89–96. https://doi.org/10.2312/3DOR/3DOR13/089-096
Li B, Lu Y, Li C, Godil A, Schreck T, Aono M, Burtscher M, Fu H, Furuya T, Johan H, Liu J, Ohbuchi R, Tatsuma A, Zou C (2014) Shrec’14 track: extended large scale sketch-based 3d shape retrieval. In: Bustos B, Tabia H, Vandeborre J, Veltkamp RC (eds) 7th Eurographics workshop on 3D object retrieval, 3DOR@Eurographics 2014, Strasbourg, France, April 6, 2014. Eurographics Association, pp 121–130. https://doi.org/10.2312/3dor.20141058
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track Proceedings. arXiv:1412.6980
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) 19th international conference on computational statistics, COMPSTAT 2010, Paris, France, August 22–27, 2010—keynote, invited and contributed papers. Physica-Verlag, pp 177–186. https://doi.org/10.1007/978-3-7908-2604-3_16
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. IEEE Computer Society, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Su H, Maji S, Kalogerakis E, Learned-Miller EG (2015) Multi-view convolutional neural networks for 3d shape recognition. In: 2015 IEEE international conference on computer vision, ICCV 2015, Santiago, Chile, December 7–13, 2015. IEEE Computer Society, pp 945–953. https://doi.org/10.1109/ICCV.2015.114
Yelamarthi SK, Reddy MSK, Mishra A, Mittal A (2018) A zero-shot framework for sketch based image retrieval. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer vision—ECCV 2018—15th European conference, Munich, Germany, September 8–14, 2018, Proceedings, part IV. Lecture notes in computer science, vol 11208. Springer, pp 316–333. https://doi.org/10.1007/978-3-030-01225-0_19
Dutta T, Singh A, Biswas S (2021) Styleguide: zero-shot sketch-based image retrieval using style-guided image generation. IEEE Trans Multim 23:2833–2842. https://doi.org/10.1109/TMM.2020.3017918
Article Google Scholar
Shen Y, Liu L, Shen F, Shao L (2018) Zero-shot sketch-image hashing. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018. Computer Vision Foundation/IEEE Computer Society, pp 3598–3607. https://doi.org/10.1109/CVPR.2018.00379. Accessed 16 Dec 2018
Zhang Z, Zhang Y, Feng R, Zhang T, Fan W (2020) Zero-shot sketch-based image retrieval via graph convolution network. In: The Thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, the thirty-second innovative applications of artificial intelligence conference, IAAI 2020, the tenth AAAI symposium on educational advances in artificial intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020. AAAI Press, pp 12943–12950. https://doi.org/10.1609/aaai.v34i07.6993. Accessed 3 Apr 2020
Chaudhuri U, Chavan R, Banerjee B, Dutta A, Akata Z (2022) Bda-sketret: bi-level domain adaptation for zero-shot SBIR. Neurocomputing 514:245–255. https://doi.org/10.1016/j.neucom.2022.09.104
Article Google Scholar
Jiao S, Han X, Xiong F, Yang X, Han H, He L, Kuang L (2022) Deep cross-modal discriminant adversarial learning for zero-shot sketch-based image retrieval. Neural Comput Appl 34(16):13469–13483. https://doi.org/10.1007/s00521-022-07169-6
Article Google Scholar
Song J, Yu Q, Song Y, Xiang T, Hospedales TM (2017) Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22–29, 2017. IEEE Computer Society, pp. 5552–5561. https://doi.org/10.1109/ICCV.2017.592
Radenovic F, Tolias G, Chum O (2018) Deep shape matching. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer vision—ECCV 2018—15th European conference, Munich, Germany, September 8–14, 2018, Proceedings, part V. Lecture notes in computer science, vol 11209. Springer, pp 774–791. https://doi.org/10.1007/978-3-030-01228-1_46
Jiang T, Xia G, Lu Q, Shen W (2017) Retrieving aerial scene images with learned deep image-sketch features. J Comput Sci Technol 32(4):726–737. https://doi.org/10.1007/s11390-017-1754-7
Article Google Scholar
Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019. Computer Vision Foundation/IEEE, pp 10394–10403. https://doi.org/10.1109/CVPR.2019.01064. Accessed 9 Jan 2020
Jing L, Vahdani E, Tan J, Tian Y (2021) Cross-modal center loss for 3d cross-modal retrieval. In: IEEE conference on computer vision and pattern recognition, CVPR 2021, Virtual, June 19–25, 2021. Computer Vision Foundation/IEEE, pp 3142–3151. https://doi.org/10.1109/CVPR46437.2021.00316. Accessed 2 Nov 2021
Kanezaki A, Matsushita Y, Nishida Y (2018) Rotationnet: joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018. Computer Vision Foundation/IEEE Computer Society, pp 5010–5019. https://doi.org/10.1109/CVPR.2018.00526. Accessed 16 Dec 2018
Chen J, Fang Y (2018) Deep cross-modality adaptation via semantics preserving adversarial learning for sketch-based 3d shape retrieval. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer vision—ECCV 2018 - 15th European conference, Munich, Germany, September 8–14, 2018, Proceedings, Part XIII. Lecture notes in computer science, vol 11217. Springer, pp 624–640. https://doi.org/10.1007/978-3-030-01261-8_37
Chen J, Qin J, Liu L, Zhu F, Shen F, Xie J, Shao L (2019) Deep sketch-shape hashing with segmented 3d stochastic viewing. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019. Computer Vision Foundation/IEEE, pp 791–800. https://doi.org/10.1109/CVPR.2019.00088. Accessed 9 Jan 2020
Xu Y, Hu J, Wattanachote K, Zeng K, Gong Y (2020) Sketch-based shape retrieval via best view selection and a cross-domain similarity measure. IEEE Trans Multimed 22(11):2950–2962. https://doi.org/10.1109/TMM.2020.2966882
Article Google Scholar
Mouffok MZ, Tabia H, ElHara OA (2020) Dual independent classification for sketch-based 3d shape retrieval. In: IEEE international conference on image processing, ICIP 2020, Abu Dhabi, United Arab Emirates, October 25–28, 2020. IEEE, pp 2676–2680. https://doi.org/10.1109/ICIP40778.2020.9190963
Chen J, Qin J, Shen Y, Liu L, Zhu F, Shao L (2020) Learning attentive and hierarchical representations for 3d shape recognition. In: Vedaldi A, Bischof H, Brox T, Frahm J (eds) Computer vision—ECCV 2020—16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV. Lecture notes in computer science, vol 12360. Springer, pp 105–122https://doi.org/10.1007/978-3-030-58555-6_7
Su Y, Li Y, Nie W, Song D, Liu A (2020) Joint heterogeneous feature learning and distribution alignment for 2d image-based 3d object retrieval. IEEE Trans Circuits Syst Video Technol 30(10):3765–3776. https://doi.org/10.1109/TCSVT.2019.2942688
Article Google Scholar
Nie W, Ren M, Liu A, Mao Z, Nie J (2021) M-GCN: multi-branch graph convolution network for 2d image-based on 3d model retrieval. IEEE Trans Multimed 23:1962–1976. https://doi.org/10.1109/TMM.2020.3006371
Article Google Scholar
Abdul-Rashid H, Yuan J, Li B, Lu Y, Schreck T, Bui N, Do T, Holenderski M, Jarnikov D, Le T, Menkovski V, Nguyen K, Nguyen T, Nguyen V, Ninh V, Rey LAP, Tran M, Wang T (2019) Shrec’19 track: Extended 2d scene image-based 3d scene retrieval. In: Biasotti S, Lavoué G, Veltkamp RC (eds) 12th Eurographics workshop on 3D object retrieval, 3DOR@Eurographics 2019, Genoa, Italy, May 5–6, 2019. Eurographics Association, pp 41–48. https://doi.org/10.2312/3dor.20191060
Li W, Liu A, Nie W, Song D, Li Y, Wang W, Xiang S, Zhou H, Bui N, Cen Y, Chen Z, Chung-Nguyen H, Diep G, Do T, Doubrovski EL, Duong AD, Geraedts JMP, Guo H, Hoang T, Li Y, Liu X, Liu Z, Luu D, Ma Y, Nguyen V, Nie J, Ren T, Tran M, Tran-Nguyen S, Tran M, Vu-Le T, Wang CCL, Wang S, Wu G, Yang C, Yuan M, Zhai H, Zhang A, Zhang F, Zhao S (2019) Shrec’19 track: monocular image based 3d model retrieval. In: Biasotti S, Lavoué G, Veltkamp RC (eds) 12th Eurographics workshop on 3D object retrieval, 3DOR@Eurographics 2019, Genoa, Italy, May 5–6, 2019. Eurographics Association, pp 103–110. https://doi.org/10.2312/3dor.20191068
Lee T, Lin Y, Chiang H, Chiu M, Hsu WH, Huang P (2018) Cross-domain image-based 3d shape retrieval by view sequence learning. In: 2018 international conference on 3D vision, 3DV 2018, Verona, Italy, September 5–8, 2018. IEEE Computer Society, pp 258–266. https://doi.org/10.1109/3DV.2018.00038
Abdul-Rashid H, Yuan J, Li B, Lu Y., Bai S, Bai X, Bui N, Do MN, Do T, Duong AD, He X, Le T, Li W, Liu A, Liu X, Nguyen K, Nguyen V, Nie W, Ninh V, Su Y, Ton-That V, Tran M, Xiang S, Zhou H, Zhou Y, Zhou Z (2018) 2d image-based 3d scene retrieval. In: Telea AC, Theoharis T, Veltkamp RC (eds) 11th Eurographics workshop on 3d object retrieval, 3DOR@Eurographics 2018, Delft, The Netherlands, April 16, 2018. Eurographics Association, pp 37–44. https://doi.org/10.2312/3dor.20181051
Yuan J, Abdul-Rashid H, Li B, Lu Y (2019) Sketch/image-based 3d scene retrieval: Benchmark, algorithm, evaluation. In: 2nd IEEE conference on multimedia information processing and retrieval, MIPR 2019, San Jose, CA, USA, March 28–30, 2019. IEEE, pp 264–269. https://doi.org/10.1109/MIPR.2019.00054
Hu N, Zhou H, Liu A, Huang X, Zhang S, Jin G, Guo J, Li X (2022) Collaborative distribution alignment for 2d image-based 3d shape retrieval. J Vis Commun Image Represent 83:103426. https://doi.org/10.1016/j.jvcir.2021.103426
Article Google Scholar
Su Y, Li Y, Song D, Nie W, Li W, Liu A (2020) Consistent domain structure learning and domain alignment for 2d image-based 3d objects retrieval. In: Bessiere C (ed) Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI 2020. ijcai.org, pp 883–889. https://doi.org/10.24963/ijcai.2020/123
Song D, Li T, Li W, Nie W, Liu W, Liu A (2021) Universal cross-domain 3d model retrieval. IEEE Trans Multimed 23:2721–2731. https://doi.org/10.1109/TMM.2020.3015554
Article Google Scholar
Maaten LVD, Hinton G (2008) Viualizing data using t-sne. J Mach Learn Res 9:2579–2605
Google Scholar
Jing T, Xia H, Hamm J, Ding Z (2022) Augmented multimodality fusion for generalized zero-shot sketch-based visual retrieval. IEEE Trans Image Process 31:3657–3668. https://doi.org/10.1109/TIP.2022.3173815
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant 62272426 and 62106238, the National Natural Science Foundation of Shanxi under Grant 202303021212189 and 202303021211153, the Special Fund for Guidance Transformation of Scientific and Technological Achievements of Shanxi Province under Grant 202104021301055, the Research Project Supported by Shanxi Scholarship Council of China under Grant 2020-113, the Patent Navigator Project of Taiyuan under Grant DH20210302017, Foundation of Shanxi Key Laboratory of Machine Vision and Virtual Reality (no. 447-110103).

Author information

Authors and Affiliations

School of Computer Science and Technology, North University of China, Taiyuan, China
Shichao Jiao, Xie Han, Liqun Kuang & Fengguang Xiong
Shanxi Key Laboratory of Machine Vision and Virtual Reality, Taiyuan, China
Shichao Jiao, Xie Han, Liqun Kuang & Fengguang Xiong
Shanxi Province’s Vision Information Processing and Intelligent Robot Engineering Research Center, Taiyuan, China
Shichao Jiao, Xie Han, Liqun Kuang & Fengguang Xiong
Department of Computer, University of Warwick, Coventry, UK
Ligang He

Authors

Shichao Jiao
View author publications
You can also search for this author in PubMed Google Scholar
Xie Han
View author publications
You can also search for this author in PubMed Google Scholar
Liqun Kuang
View author publications
You can also search for this author in PubMed Google Scholar
Fengguang Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Ligang He
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Shichao Jiao: methodology and writing—original draft. Xie Han: funding acquisition and resources. Liqun Kuang: validation and funding acquisition. Fengguang Xiong: visualization. Ligang He: writing—review and editing.

Corresponding author

Correspondence to Liqun Kuang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jiao, S., Han, X., Kuang, L. et al. Global semantics correlation transmitting and learning for sketch-based cross-domain visual retrieval. Complex Intell. Syst. 10, 6931–6952 (2024). https://doi.org/10.1007/s40747-024-01503-2

Download citation

Received: 15 September 2023
Accepted: 15 May 2024
Published: 29 June 2024
Issue Date: October 2024
DOI: https://doi.org/10.1007/s40747-024-01503-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Global semantics correlation transmitting and learning for sketch-based cross-domain visual retrieval

Abstract

Similar content being viewed by others

Deep Manifold Alignment for Mid-Grain Sketch Based Image Retrieval

Spatially aligned sketch-based fine-grained 3D shape retrieval

Sequential learning for sketch-based 3D model retrieval

Explore related subjects

Introduction

Related work

Domain adaptation

Deep metric learning

Sketch-based cross-domain retrieval

Methods

Data pairing

Feature extraction

Joint learning objectives

Domain consistency learning

Semantic consistency learning

Feature correlation learning

Feature similarity learning

Optimization objectives

Experiments

Dataset

Training settings

Zero-shot sketch-based image retrieval

Retrieval result on Sketchy and TU-Berlin

Retrieval result on RSketch

Sketch-based 3D shape retrieval

Discussions

Image-based 3D shape retrieval

Retrieval results on IN2MN

Retrieval results on SceneIBR2019

Retrieval result on MI3DOR

Comparison of the effectiveness of similarity loss

Comparison of the effectiveness of domain adaptation algorithms

Empirical analysis

Time and space cost analysis

Ablation study

The impact of the methods of generating initial centers

The impact of hyperparameter

Conclusion and future work

Data Availability

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation