1 Introduction

Matching natural images with free-hand sketches, i.e. sketch-based image retrieval (SBIR) (Yu et al. 2015, 2016a; Liu et al. 2017; Pang et al. 2017; Song et al. 2017b; Shen et al. 2018; Zhang et al. 2018; Chen and Fang 2018; Kiran Yelamarthi et al. 2018; Dutta and Akata 2019; Dey et al. 2019) has received a lot of attention. Since sketches can effectively express shape, pose and some fine-grained details of the target images, SBIR serves a favorable scenario complementary to the conventional text-image cross-modal retrieval or the classical content based image retrieval protocol. This may be because in some situations it is difficult to provide a textual description or a suitable image of the desired query, whereas, an user can easily draw a sketch of the desired object on a touch screen.

Fig. 1
figure 1

Our SEM-PCYC model learns to map visual information from seen-class sketches and images to a semantic space through an adversarial training procedure in zero-shot SBIR setting. Furthermore, our mdoel is flexible enough to use a few examples from novel classes to fine-tune the model, where the novel classes contain a few labeled samples in few-shot SBIR (FS-SBIR) setting. During the testing phase the learned mappings are used to generate embeddings of the novel classes. We refer to the combination of zero- and few-shot SBIR as any-shot SBIR (AS-SBIR)

As the visual information from all classes gets explored by the system during training, with overlapping training and test classes, existing SBIR methods perform well (Zhang et al. 2018). Since for practical applications there is no guarantee that the training data would include all possible queries, a more realistic setting is low-shot or any-shot SBIR (AS-SBIR) (Shen et al. 2018; Kiran Yelamarthi et al. 2018; Dutta and Akata 2019; Dey et al. 2019), which combines zero- and few-shot learning (Lampert et al. 2014; Vinyals et al. 2016; Xian et al. 2018a; Ravi and Larochelle 2017) and SBIR as a single task, where the aim is an accurate class prediction and a competent retrieval performance. However, this is an extremely challenging task, as it simultaneously deals with domain gap, intra-class variability and limited or no knowledge on novel classes. Additionally, fine-grained SBIR (Pang et al. 2017, 2019) is an alternative sketch-based image retrieval task, allowing to search for specific object images, which has already received remarkable attention in the computer vision community. However, it has never been explored in low shot setting, which is an extremely challenging and at the same time of high practical relevance.

One of the major shortcomings of the prior work on any-shot SBIR is that a natural image is retrieved after learning a mapping from an input sketch to an output image using a training set of labelled aligned pairs (Kiran Yelamarthi et al. 2018). The supervision of the pair correspondence is to enhance the correlation of multi-modal data (here, sketch and image) so that learning can be guided by semantics. However, for many realistic scenarios, paired (aligned) training data is either unavailable or obtaining it is very expensive. Furthermore, often a joint representation of two or more modalities is learned by using a memory fusion layer (Shen et al. 2018), such as, tensor fusion (Hu et al. 2017), bilinear pooling (Yu et al. 2017) etc. These fusion layers are often expensive in terms of memory (Yu et al. 2017), and extracting useful information from this high dimensional space could result in information loss (Yu et al. 2018).

To alleviate these shortcomings, we propose a semantically aligned paired cycle consistent generative adversarial network (SEM-PCYC) model for any-shot SBIR task, where each branch either maps the sketch or image features to a common semantic space via an adversarial training. These two branches dealing with two different modalities (sketch and image) constitute an essential component for solving SBIR task. The cycle consistency constraint on each branch guarantees that the mapping of sketch or image modality to a common semantic space and their translation back to the original modality, avoiding the necessity of aligned sketch-image pairs. Imposing a classification loss on the semantically aligned outputs from the sketch and image space enforces the generated features in the semantic space to be discriminative which is very crucial for effective any-shot SBIR. Furthermore, inspired by the previous works on label embedding (Akata et al. 2015), we propose to combine side information from text-based and hierarchical models via a feature selection auto-encoder (Wang et al. 2017) which selects discriminating side information based on intra and inter class covariance.

This paper extends our CVPR 2019 conference paper (Dutta and Akata 2019), with the following additional contributions: (1) We propose to apply the SEM-PCYC model for any-shot SBIR task, i.e. addition to zero-shot paradigm, we introduce few-shot setting for SBIR and combine it with generalized setting, which has been experimentally proven to be effective for difficult or confusing classes (Fig. 1). (2) We adapt the recent zero-shot SBIR models and ours to fine-grained SBIR in the generalized low-shot setting and provide an extensive benchmark including quantitative and qualitative evaluations. (3) We evaluate our model on one recent dataset, i.e. QuickDraw, in addition to extending our experiments to new settings with Sketchy and TU-Berlin. We show that our proposed model consistently improves the state-of-the-art results of any-shot SBIR on all the three datasets.

2 Related Work

As our work belongs at the verge of sketch-based image retrieval and any-shot learning task, we briefly review the relevant literature from these fields.

Sketch Based Image Retrieval (SBIR). Attempts for solving SBIR task mostly focus on bridging the domain gap between sketch and image, which can roughly be grouped in hand-crafted and cross-domain deep learning-based methods (Liu et al. 2017). Hand-crafted methods mostly work by extracting the edge map from natural image and then matching them with sketch using a Bag-of-Words model on top of some specifically designed SBIR features, viz., gradient field HOG (Hu and Collomosse 2013), histogram of oriented edges (Saavedra 2014), learned key shapes (Saavedra and Barrios 2015) etc. However, the difficulty of reducing domain gap remained unresolved as it is extremely challenging to match edge maps with unaligned hand drawn sketch. This domain shift issue is further addressed by neural network models where domain transferable features from sketch to image are learned in an end-to-end manner. Majority of such models use variant of siamese networks (Qi et al. 2016; Sangkloy et al. 2016; Yu et al. 2016a; Song et al. 2017a) that are suitable for cross-modal retrieval. These frameworks either use generic ranking losses, viz., contrastive loss (Chopra et al. 2005), triplet ranking loss (Sangkloy et al. 2016) or more sophisticated HOLEF based loss (Song et al. 2017b) for the same. Further to these discriminative losses, Pang et al. (2017) introduced a discriminative-generative hybrid model for preserving all the domain invariant information useful for reducing the domain gap between sketch and image. Alternatively, Liu et al. (2017) and Zhang et al. (2018) focus on learning cross-modal hash code for category level SBIR within an end-to-end deep model.

In addition to the above coarse-grained SBIR models, fine-grained sketch-based image retrieval (FG-SBIR) has gained popularity recently (Li et al. 2014; Song et al. 2017a, b; Pang et al. 2017). In this more realistic setting, a FG-SBIR model allows to search a specific object or image. First, models tackled this task using deformable part model and graph matching (Li et al. 2014). Recently, different ranking frameworks and corresponding losses, such as, siamese (Pang et al. 2017), triplet (Sangkloy et al. 2016), quadruplet (Song et al. 2017a) networks were used for the same. Song et al. (2017b) proposed attention model for FG-SBIR task, Zhang et al. (2018) improving retrieval efficiency using a hashing scheme.

Zero-Shot Learning (ZSL) and Few-Shot Learning (FSL). Zero-shot learning in computer vision refers to recognizing objects whose instances are not seen during the training phase; a comprehensive and detailed survey on ZSL is available in Xian et al. (2018a). Early works on ZSL (Lampert et al. 2014; Jayaraman and Grauman 2014; Changpinyo et al. 2016; Al-Halah et al. 2016) make use of attributes within a two-stage approach to infer the label of an image that belong to the unseen classes. However, the recent works (Frome et al. 2013; Romera-Paredes and Torr 2015; Akata et al. 2015, 2016; Kodirov et al. 2017) directly learn a mapping from image feature space to a semantic space. Many other ZSL approaches learn non-linear multi-modal embedding (Socher et al. 2013; Akata et al. 2016; Xian et al. 2016; Changpinyo et al. 2017; Zhang et al. 2017), where most of the methods focus to learn a non-linear mapping from the image space to the semantic space. Mapping both image and semantic features into another common intermediate space is another direction that ZSL approaches adapt (Zhang and Saligrama 2015; Fu et al. 2015; Zhang and Saligrama 2016; Akata et al. 2016; Long et al. 2017). Although, most of the deep neural network models in this domain are trained using a discriminative loss function, a few generative models also exist (Wang et al. 2018a; Xian et al. 2018b; Chen et al. 2018) that are used as a data augmentation mechanism. In ZSL, some form of side information is required, so that the knowledge learned from seen classes gets transferred to unseen classes. One popular form of side information is attributes (Lampert et al. 2014) that, however, require costly expert annotation. Thus, there has been a large group of studies (Mensink et al. 2014; Akata et al. 2015; Xian et al. 2016; Reed et al. 2016; Qiao et al. 2016; Ding et al. 2017) which utilize other auxiliary information, such as, text-based (Mikolov et al. 2013) or hierarchical model (Miller 1995) for label embedding.

Fig. 2
figure 2

Our SEM-PCYC Model. The sketch (in light gray) and image cycle consistent networks (in light blue) respectively map the sketch and image to the semantic space and then the original input space. An auto-encoder (light orange) combines the semantic information based on text and hierarchical model, and produces a compressed semantic representation which acts as a true example to the discriminator. During the test phase only the learned sketch (light gray polygonal region) and image (light blue polygonal region) encoders to the semantic space are used for generating embeddings on the novel classes for any-shot, i.e. zero- and few-shot SBIR. (best viewed in color) (color figure online)

On the other hand, few-shot learning (FSL) refers to the task of recognizing images or detecting objects with a model trained on very few samples (Xian et al. 2019; Schönfeld et al. 2018). Directly training a given model with small amount of training samples could have the risk of over fitting. Hence a general step to overcome this hurdle is to initially train the model on classes with sufficient examples, and then generalize it to classes with fewer examples without learning any new parameters. This setup already attracted a lot of attention within the computer vision community. One of the first attempts (Koch et al. 2015) is a siamese convolutional network model for computing similarity between pair of images, and then the learned similarity was used to solve the one-shot problem by k-nearest neighbors classification. On the other hand, matching network model (Vinyals et al. 2016) uses cosine distance to predict image label based on support sets and apply the episodic training strategy that mimics few-shot learning. An extension, i.e. prototypical network (Snell et al. 2017), used Euclidean distance instead of cosine distance and built a prototype representation of each class for the few-shot learning scenario. As an orthogonal direction Ravi and Larochelle (2017) introduced meta-learning framework for FSL, which updates weights of a classifier for a given episode. Model agnostic meta-learner (Finn et al. 2017) learns better weight initialization capable to generalize in FSL scenario with fewer gradient descent steps. There also exist few low shot methods that learn a generator from the base class data to generate novel class features for data augmentation (Girshick 2015; Wang et al. 2018b). Alternatively, GNN (Kipf and Welling 2017) was also proposed as a framework for few-shot learning task (Satorras and Estrach 2018).

Our Work. The prior work on zero-shot sketch-based image retrieval (ZS-SBIR) (Shen et al. 2018), proposed a generative cross-modal hashing scheme using a graph convolution network for aligning the sketch and image in the semantic space. Inspired by them, Kiran Yelamarthi et al. (2018) proposed two similar autoencoder-based generative models for zero-shot SBIR, where they have used the aligned pairs of sketch and image for learning the semantics between them. In this work, we propose a paired cycle consistent generative model where each branch either maps sketch or image features to a common semantic space via adversarial training, which we found to be effective for reducing the domain gap between sketch and image. The cycle consistency constraint on each branch allows supervision only at category level, and avoids the need of aligned sketch-image pairs. Furthermore, we address zero-shot and few-shot cross-modal (sketch to image) retrieval, for that, we effectively combine different side information within an end-to-end framework, and map visual information to the semantic space through an adversarial training. Finally, we unify low-shot learning models and generalize them to fine-grained SBIR scenario.

3 Semantically Aligned Paired Cycle Consistent GAN (SEM-PCYC)

Our Semantically Aligned Paired Cycle Consistent GAN (SEM-PCYC) model uses the sketch and image data from the seen categories for training the underlying model. It then encodes and matches the sketch and image categories that remain novel during the training phase. The overall pipeline of our end-to-end deep architecture is shown in Fig. 2.

We define \({\mathcal {D}}^s=\{{\mathbf {X}}^s, {\mathbf {Y}}^s\}\) to be a collection of sketch-image data from the training categories \({\mathcal {C}}^s\), which contains sketch images \({\mathbf {X}}^s=\{{\mathbf {x}}_i^s\}_{i=1}^N\) as well as natural images \({\mathbf {Y}}^s=\{{\mathbf {y}}_i^s\}_{i=1}^N\), where N is the total number of sketch and image pairs that are not necessarily aligned. Without loss of generality, a sketch and an image have the same index i, and share the same category label. The set \({\mathbf {S}}^s=\{{\mathbf {s}}_i^s\}_{i=1}^{N}\) indicates the side information necessary for transferring knowledge from seen to the novel classes (a.k.a unseen classes in zero-shot learning literature). In our setting, we also use an auxiliary training set \({\mathcal {D}}^a=\{{\mathbf {X}}^a, {\mathbf {Y}}^a\}\) from the unseen classes \({\mathcal {C}}^u\) which is disjoint from \({\mathcal {C}}^s\), where the number of samples per class is fixed to k.

Our aim is to learn two deep functions \(G_\text {sk}(\cdot )\) and \(G_\text {im}(\cdot )\) respectively for sketch and image for mapping them to a common semantic space where the learned knowledge is applied to the novel classes. Now, given a second set \({\mathcal {D}}^u=\{{\mathbf {X}}^u, {\mathbf {Y}}^u\}\) from the test categories \({\mathcal {C}}^u\), the proposed deep networks \(G_\text {sk}:{\mathbb {R}}^d\rightarrow {\mathbb {R}}^M\), \(G_\text {im}:{\mathbb {R}}^d\rightarrow {\mathbb {R}}^M\) (d is the dimension of the original data and M is the targeted dimension of the common representation) map the sketch and natural image to a common semantic space where the retrieval is performed. Depending on k, i.e. the number of samples considered per class as an auxiliary set, the scenario is called k-shot. In the classical zero-shot sketch-based image retrieval setting, the test categories belong to \({\mathcal {C}}^u\), in other words, at test time the assumption is that every image will come from a previously unseen class. This is not realistic as the true generalization performance of the classifier can only be measured with how well it generalizes to unseen classes without forgetting the classes it has seen. Hence, in the generalized zero-shot sketch based image retrieval scenario the search space contains both \({\mathcal {C}}^u\) and \({\mathcal {C}}^s\). In other words, at test time an image may come either from a previously seen or an unseen class. As this setting is significantly more challenging, the accuracy decreases for all the methods considered.

3.1 Paired Cycle Consistent Generative Model

To achieve the flexibility to handle sketch and image individually, i.e. even without aligned sketch-image pairs, during training \(G_\text {sk}\) and \(G_\text {im}\), we propose a cycle consistent generative model whose each branch is semantically aligned with a common discriminator. The cycle consistency constraint on each branch of the model ensures the mapping of sketch or image modality to a common semantic space, and their translation back to the original modality, which only requires supervision at the category level. Imposing a classification loss on the output of \(G_\text {sk}\) and \(G_\text {im}\) allows generating highly discriminative features.

Our main goal is to learn two mappings \(G_\text {sk}\) and \(G_\text {im}\) that can respectively translate the unaligned sketch and natural image to a common semantic space. Zhu et al. (2017) pointed out about the existence of underlying intrinsic relationship between modalities and domains, for example, sketch or image of same object category have the same semantic meaning, and possess that relationship. Even though, we lack visual supervision as we do not have access to aligned pairs, we can exploit semantic supervision at category levels. We train a mapping \(G_\text {sk}:{\mathbf {X}}\rightarrow {\mathbf {S}}\) so that \(\hat{{\mathbf {s}}}_i=G_\text {sk}({\mathbf {x}}_i)\), where \({\mathbf {s}}_i\in {\mathbf {S}}\) is the corresponding side information and is indistinguishable from \(\hat{{\mathbf {s}}}_i\) via an adversarial training that classifies \(\hat{{\mathbf {s}}}_i\) different from \({\mathbf {s}}_i\). The optimal \(G_\text {sk}\) thereby translates the modality \({\mathbf {X}}\) into a modality \(\hat{{\mathbf {S}}}\) which is identically distributed to \({\mathbf {S}}\). Similarly, another function \(G_\text {im}:{\mathbf {Y}}\rightarrow {\mathbf {S}}\) can be trained via the same discriminator such that \(\hat{{\mathbf {s}}}_i=G_\text {im}({\mathbf {y}}_i)\).

Adversarial Loss. As shown in Fig. 2, for mapping the sketch and image representation to a common semantic space, we introduce four generators \(G_\text {sk}:{\mathbf {X}}\rightarrow {\mathbf {S}}\), \(G_\text {im}:{\mathbf {Y}}\rightarrow {\mathbf {S}}\), \(F_\text {sk}:{\mathbf {S}}\rightarrow {\mathbf {X}}\) and \(F_\text {im}:{\mathbf {S}}\rightarrow {\mathbf {Y}}\). In addition, we bring in three adversarial discriminators: \(D_\text {se}(\cdot )\), \(D_\text {sk}(\cdot )\) and \(D_\text {im}(\cdot )\), where \(D_\text {se}\) discriminates among original side information \(\{{\mathbf {s}}\}\), sketch transformed to side information \(\{G_\text {sk}({\mathbf {x}})\}\) and image transformed to side information \(\{G_\text {im}({\mathbf {y}})\}\); likewise \(D_\text {sk}\) discriminates between original sketch representation \(\{{\mathbf {x}}\}\) and side information transformed to sketch representation \(\{F_\text {sk}({\mathbf {s}})\}\); in a similar way \(D_\text {im}\) distinguishes between \(\{{\mathbf {y}}\}\) and \(\{F_\text {im}({\mathbf {s}})\}\). For the generators \(G_\text {sk}\), \(G_\text {im}\) and their common discriminator \(D_\text {se}\), the objective is:

$$\begin{aligned}&{\mathcal {L}}_\text {adv}(G_\text {sk}, G_\text {im}, D_\text {se}, {\mathbf {x}}, {\mathbf {y}}, {\mathbf {s}})=2\times {\mathbb {E}}\left[ \log D_\text {se}({\mathbf {s}})\right] \nonumber \\&\quad +{\mathbb {E}}\left[ \log (1- D_\text {se}(G_\text {sk}({\mathbf {x}})))\right] +{\mathbb {E}}\left[ \log (1- D_\text {se}(G_\text {im}({\mathbf {y}})))\right] \end{aligned}$$
(1)

where \(G_\text {sk}\) and \(G_\text {im}\) generate side information similar to the ones in \({\mathbf {S}}\) while \(D_\text {se}\) distinguishes between the generated and original side information. Here, \(G_\text {sk}\) and \(G_\text {im}\) minimize the objective against an opponent \(D_\text {se}\) that tries to maximize it, namely

$$\begin{aligned} \min _{G_\text {sk}, G_\text {im}}\max _{D_\text {se}}{\mathcal {L}}_\text {adv}(G_\text {sk}, G_\text {im}, D_\text {se}, {\mathbf {x}}, {\mathbf {y}}, {\mathbf {s}}). \end{aligned}$$

In a similar way, for the generator \(F_\text {sk}\) and its discriminator \(D_\text {sk}\), the objective is:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_\text {adv}(F_\text {sk}, D_\text {sk}, {\mathbf {x}}, {\mathbf {s}})=&\text { }{\mathbb {E}}\left[ \log D_\text {sk}({\mathbf {x}})\right] \\&+{\mathbb {E}}\left[ \log (1- D_\text {sk}(F_\text {sk}({\mathbf {s}})))\right] \end{aligned} \end{aligned}$$
(2)

\(F_\text {sk}\) minimizes the objective and its adversary \(D_\text {sk}\) intends to maximize it, namely

$$\begin{aligned} \min _{F_\text {sk}}\max _{D_\text {sk}}{\mathcal {L}}_\text {adv}(F_\text {sk}, D_\text {sk}, {\mathbf {x}}, {\mathbf {s}}). \end{aligned}$$

Similarly, another adversarial loss is introduced for the mapping \(F_\text {im}\) and its discriminator \(D_\text {im}\), i.e. \(\min _{F_\text {im}}\max _{D_\text {im}}{\mathcal {L}}_\text {adv}(F_\text {im}, D_\text {im}, {\mathbf {y}}, {\mathbf {s}})\).

Cycle Consistency Loss. The adversarial mechanism effectively reduces the domain or modality gap, however, it is not guaranteed that an input \({\mathbf {x}}_i\) and an output \({\mathbf {s}}_i\) are matched well. To this end, we impose cycle consistency (Zhu et al. 2017). When we map the feature of a sketch of an object to the corresponding semantic space, and then further translate it back from the semantic space to the sketch feature space, we should reach back to the original sketch feature. This cycle consistency loss also assists in learning mappings across domains where paired or aligned examples are not available. Specifically, if we have a function \(G_\text {sk}:{\mathbf {X}}\rightarrow {\mathbf {S}}\) and another mapping \(F_\text {sk}:{\mathbf {S}}\rightarrow {\mathbf {X}}\), then both \(G_\text {sk}\) and \(F_\text {sk}\) are reverse of each other, and hence form a one-to-one correspondence or bijective mapping.

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_\text {cyc}(G_\text {sk}, F_\text {sk})=&\text { }{\mathbb {E}}\left[ \Vert F_\text {sk}(G_\text {sk}({\mathbf {x}}))-{\mathbf {x}} \Vert _{1}\right] \\&+{\mathbb {E}}\left[ \Vert G_\text {sk}(F_\text {sk}({\mathbf {s}}))-{\mathbf {s}} \Vert _{1}\right] \end{aligned} \end{aligned}$$
(3)

where \({\mathbf {s}}\) is the semantic features of the class c which is the category label of \({\mathbf {x}}\). Similarly, a cycle consistency loss is imposed for the mappings \(G_\text {im}:{\mathbf {Y}}\rightarrow {\mathbf {S}}\) and \(F_\text {im}:{\mathbf {S}}\rightarrow {\mathbf {Y}}\): \({\mathcal {L}}_\text {cyc}(G_\text {im}, F_\text {im})\). These consistent loss functions also behave as a regularizer to the adversarial training to assure that the learned function maps a specific input \({\mathbf {x}}_i\) to a desired output \({\mathbf {s}}_i\).

Classification Loss. On the other hand, adversarial training and cycle-consistency constraints do not explicitly ensure whether the generated features by the mappings \(G_\text {sk}\) and \(G_\text {im}\) are class discriminative, i.e. a requirement for the zero-shot sketch-based image retrieval task. We conjecture that this issue can be alleviated by introducing a discriminative classifier pre-trained on the input data. At this end we minimize a classification loss over the generated features.

$$\begin{aligned} {\mathcal {L}}_\text {cls}(G_\text {sk})=-{\mathbb {E}}_{{\mathbf {x}} \sim {\mathbf {X}}}\left[ \log P(c|G_\text {sk}({\mathbf {x}});\theta ) \right] \end{aligned}$$
(4)

where c is the category label of \({\mathbf {x}}\), \(P(c|G_\text {sk}({\mathbf {x}});\theta )\) denotes the probability of \(G_\text {sk}({\mathbf {x}}\)) being predicted with its true class label c. The conditional probability is computed by a linear softmax classifier parameterized by \(\theta \). Similarly, a classification loss \({\mathcal {L}}_\text {cls}(G_\text {im})\) is also imposed on the generator \(G_\text {im}\).

3.2 Selection of Side Information

Learning a compatibility or a matching function between multiple modalities in zero-shot scenario (Shen et al. 2018; Dey et al. 2019; Liu et al. 2019) requires structure in the class embedding space where the image features are mapped to. Attributes provide one such a structured class embedding space (Lampert et al. 2014), however obtaining attributes requires costly human annotation. On the other hand, side information can also be learned at a much lower cost from large-scale text corpora such as Wikipedia. Similarly, output embeddings built from hierarchical organization of classes such as WordNet can also provide structure in the output space and substitute the attributes. Motivated by attribute selection for zero-shot learning (Guo et al. 2018), indicating that a subset of discriminative attributes are more effective than the whole set of attributes for ZSL, we incorporate a joint learning framework integrating an auto-encoder to select side information. Let \({\mathbf {s}}\in {\mathbb {R}}^{k}\) be the side information with k as the original dimension. The loss function is:

$$\begin{aligned} {\mathcal {L}}_\text {aenc}(f,g)=\Vert {\mathbf {s}}-g(f({\mathbf {s}}))\Vert _{F}+\lambda \Vert W_1 \Vert _{2,1} \end{aligned}$$
(5)

where \(f({\mathbf {s}})=\sigma (W_1{\mathbf {s}}+b_1)\), \(g(f({\mathbf {s}}))=\sigma (W_2f({\mathbf {s}})+b_2)\), with \(W_1\in {\mathbb {R}}^{k\times m}\), \(W_2\in {\mathbb {R}}^{m\times k}\) and \(b_1\), \(b_2\) respectively as the weights and biases for the function f and g. Additionally, \(\Vert .\Vert _{F}\) denotes the Frobenius norm defined as the square root of the sum of the absolute squares of its elements and \(\Vert .\Vert _{2,1}\) indicates \(\ell _{2,1}\) norm (Nie et al. 2010). Selecting side information reduces the dimensionality of embeddings, which further improves retrieval time. Therefore, the training objective of our model:

$$\begin{aligned}&{\mathcal {L}}(G_\text {sk}, G_\text {im}, F_\text {sk}, F_\text {im}, D_\text {se}, D_\text {sk}, D_\text {im}, f, g, {\mathbf {x}}, {\mathbf {y}}, {\mathbf {s}})\nonumber \\&\quad = \lambda _\text {adv}^\text {se}{\mathcal {L}}_\text {adv}(G_\text {sk}, G_\text {im}, D_\text {se}, {\mathbf {x}}, {\mathbf {y}}, {\mathbf {s}})\nonumber \\&\qquad +\lambda _\text {adv}^\text {sk}{\mathcal {L}}_\text {adv}(F_\text {sk}, D_\text {sk}, {\mathbf {x}}, {\mathbf {s}})\nonumber \\&\qquad +\lambda _\text {adv}^\text {im}{\mathcal {L}}_\text {adv}(F_\text {im}, D_\text {im}, {\mathbf {y}}, {\mathbf {s}})\nonumber \\&\qquad +\lambda _\text {cyc}^\text {sk}{\mathcal {L}}_\text {cyc}(G_\text {sk}, F_\text {sk})+\lambda _\text {cyc}^\text {im}{\mathcal {L}}_\text {cyc}(G_\text {im}, F_\text {im})\nonumber \\&\qquad +\lambda _\text {cls}^\text {sk}{\mathcal {L}}_\text {cls}(G_\text {sk})+\lambda _\text {cls}^\text {im}{\mathcal {L}}_\text {cls}(G_\text {im})\nonumber \\&\qquad +\lambda _\text {aenc}{\mathcal {L}}_\text {aenc}(f,g) \end{aligned}$$
(6)

where different \(\lambda \)s are the weights on respective loss terms. For obtaining the initial side information, we combine a text-based and a hierarchical model, which are complementary and robust (Akata et al. 2015). Below, we provide a description of our text-based and hierarchical models for side information.

Text-based Model. We use three different text-based side information. (1) Word2Vec (Mikolov et al. 2013) is a two layered neural network that are trained to reconstruct linguistic contexts of words. During training, it takes a large corpus of text and creates a vector space of several hundred dimensions, with each unique word being assigned to a corresponding vector in that space. The model can be trained with a hierarchical softmax with either skip-gram or continuous bag-of-words formulation for target prediction. (2) GloVe (Pennington et al. 2014) considers global word-word co-occurrence statistics that frequently appear in a corpus. Intuitively, co-occurrence statistics encode important semantic information. The objective is to learn word vectors such that their dot product equals to the probability of their co-occurrence. (3) FastText (Joulin et al. 2017) extends the Word2Vec model, where instead of learning vector for words directly, FastText represents each word as n-gram of characters and then trains a skip-gram model to learn the embeddings. FastText works well with rare words, even if a word was not seen during training, it can be broken down into n-grams to get its embeddings, which is a huge advantage of this model.

Hierarchical Model. Semantic distance (or similarity) between words can also be approximated by their distance (or similarity) in a large ontology such as WordNetFootnote 1 with \(\approx 100,000\) words in English. One can measure the similarity [\({\mathcal {S}}_\text {WN}\) in Eq. (7)] between words represented as nodes in the ontology using techniques, such as path similarity, e.g. counting the number of hops required to reach from one node to the other, and Jiang and Conrath (1997). For a set \({\mathbb {S}}\) of nodes in a dictionary \({\mathbb {D}}\) that consists of a set of classes, similarities between every class c and all the other nodes considered in the same order in \({\mathbb {S}}\) to determine the entries of the class embedding vector (Akata et al. 2015) of c [\({\mathbf {s}}_\text {hier}(c)\) in Eq. (7)]:

$$\begin{aligned} {\mathbf {s}}_\text {hier}(c) = [{\mathcal {S}}_\text {WN}(c,c_1), \ldots , {\mathcal {S}}_\text {WN}(c,c_{|{\mathbb {S}}|})] \end{aligned}$$
(7)

Note that, \({\mathbb {S}}\) considers all the nodes on the path from each node in \({\mathbb {D}}\) to its highest level ancestor. The WordNet hierarchy contains most of the classes of the Sketchy (Sangkloy et al. 2016), Tu-Berlin (Eitz et al. 2012) and QuickDraw (Dey et al. 2019) datasets. Few exceptions are: jack-o-lantern which we replaced with lantern that appears higher in the hierarchy, similarly human skeleton with skeleton, and octopus with octopods etc. \(|{\mathbb {S}}|\), i.e. the number of nodes, for Sketchy, TU-Berlin and QuickDraw datasets are respectively 354, 664 and 344.

4 Experiments

In this section, we detail our datasets, implementation protocol and present our results on (generalized) zero-shot, (generalized) few-shot and fine-grained settings.

Datasets. We experimentally validate our model on three popular SBIR datasets, namely Sketchy (Extended), TU-Berlin (Extended) and QuickDraw (Extended). For brevity, we refer to these extended datasets as Sketchy, TU-Berlin and QuickDraw respectively.

The Sketchy Dataset (Sangkloy et al. 2016) is a large collection of sketch-photo pairs. The dataset originally consists of images from 125 different classes, with 100 photos each. The 75,471 sketch images of the objects that appear in these 12,500 images are collected via crowd sourcing. This dataset also contains a fine grained correspondence (alignment) between particular photos and sketches as well as various data augmentations for deep learning based methods. Liu et al. (2017) extended the dataset by adding 60,502 photos yielding in total 73,002 images. We randomly pick 25 classes as the novel test set, and the data from remaining 100 training classes.

The original TU-Berlin Dataset (Eitz et al. 2012) contains 250 categories with a total of 20,000 sketches extended by Liu et al. (2017) with 204,489 natural images corresponding to the sketch classes. 30 classes of sketches and images are randomly chosen to respectively from the query set and the retrieval gallery. The remaining 220 classes are utilized for training. We follow Shen et al. (2018) and select classes with at least 400 images to form a test set.

The QuickDraw (Extended), a large-scale dataset proposed recently in Dey et al. (2019), contains the sketch-image pairs of 110 classes consisting of 203,885 images and 330,111 sketches, i.e. approximately 1854 images/class and 3000 sketches/class. The main difference of this dataset from the previous ones is in the abstractness of the sketches which are collected from the Quick, Draw!Footnote 2 online game. The increased abstractness in the drawings has eventually enlarged the sketch-image domain gap, and hence increased the challenge of SBIR task.

Implementation details. We implemented the SEM-PCYC model using PyTorch (Paszke et al. 2017) deep learning toolboxFootnote 3 on a single TITAN Xp or TITAN V graphics card. Unless otherwise mentioned, we extract features from sketch and image from the VGG-16 (Simonyan and Zisserman 2014) network model pre-trained on ImageNet (Deng et al. 2009) (before the last pooling layer). In Sect. 4.1, we compare the VGG-16 features with SE-ResNet-50 features for zero-shot SBIR task, which is only restricted to that experimentation. Since in this work, we deal with single object retrieval and an object usually spans only on certain regions of a sketch or image, we apply an attention mechanism inspired by Song et al. (2017b) without the shortcut connection for extracting only the informative regions from sketch and image. The attended 512d representation is obtained by a pooling operation guided by the attention model and fully connected (fc) layer. This entire model is fine tuned on our training set (100 classes for Sketchy, 220 classes for TU-Berlin and 80 classes for QuickDraw). Both the generators \(G_\text {sk}\) and \(G_\text {im}\) are built with a fc layer followed by a ReLU non-linearity that accept 512d vector and output Md representation, whereas, the generators \(F_\text {sk}\) and \(F_\text {im}\) take Md features and produce 512d vector. Accordingly, all discriminators are designed to take the output of respective generators and produce a single dimensional output. The auto-encoder is designed by stacking two non-linear fc layers respectively as encoder and decoder for obtaining a compressed and encoded representation of dimension M. We experimentally set \(\lambda _\text {adv}^\text {se}=1.0\), \(\lambda _\text {adv}^\text {sk}=0.5\), \(\lambda _\text {adv}^\text {im}=0.5\), \(\lambda _\text {cyc}^\text {sk}=1.0\), \(\lambda _\text {cyc}^\text {im}=1.0\), \(\lambda _\text {cls}^\text {sk}=1.0\), \(\lambda _\text {cls}^\text {im}=1.0\), \(\lambda _\text {aenc}=0.01\) to give the optimum performance of our model.

While constructing the hierarchy for the class embedding, we only consider the training classes belong to that dataset. In this way, the WordNet hierarchy or the knowledge graph for the Sketchy, TU-Berlin and QuickDraw datasets respectively contain 354 and 664 nodes. Although our method does not produce binary hash code as a final representation for matching sketch and image, for the sake of comparison with some related works, such as, ZSH (Yang et al. 2016a), ZSIH (Shen et al. 2018), GDH (Zhang et al. 2018), that produce hash codes, we have used the iterative quantization (ITQ) (Gong et al. 2013) algorithm to obtain the binary codes for sketch and image. We have used final representation of sketches and images from the train set to learn the optimized rotation which later used on our final representation for obtaining the binary codes.

4.1 (Generalized) Zero-Shot Sketch-Based Image Retrieval

Apart from the two prior Zero-Shot SBIR works closest to ours, i.e. ZSIH (Shen et al. 2018) and ZS-SBIR (Kiran Yelamarthi et al. 2018), we adopt fourteen ZSL and SBIR models to the zero-shot SBIR task. Note that in this setting, the training classes are indicated as “seen” and novel classes as “unseen” since none of the sketches of these classes are visible to the model during training.

The SBIR methods that we evaluate are SaN (Yu et al. 2015), 3D Shape (Wang et al. 2015a), Siamese CNN (Qi et al. 2016), GN Triplet (Sangkloy et al. 2016), DSH (Liu et al. 2017) and GDH (Zhang et al. 2018). A softmax baseline is also added, which is based on computing the 4096d VGG-16 (Simonyan and Zisserman 2014) feature vector pre-trained on the seen classes for nearest neighbour search. The ZSL methods that we evaluate are: CMT (Socher et al. 2013), DeViSE (Frome et al. 2013), SSE (Zhang and Saligrama 2015), JLSE (Zhang and Saligrama 2016), ZSH (Yang et al. 2016a), SAE (Kodirov et al. 2017) and FRWGAN (Felix et al. 2018). We use the same seen-unseen splits of categories for all the experiments for a fair comparison. We compute the mean average precision (mAP@all) and precision considering top 100 (Precision@100) (Su et al. 2015; Shen et al. 2018) retrievals for the performance evaluation and comparison.

Table 1 (Generalized) zero-shot sketch-based image retrieval and (generalized) fine-grained sketch-based image retrieval performance comparison with existing SBIR, ZSL, zero-shot SBIR and generalized zero-shot SBIR methods
Fig. 3
figure 3

ac PR curves of SEM-PCYC model and several SBIR, ZSL and zero-shot SBIR methods respectively on the Sketchy, TU-Berlin and QuickDraw datasets, d plot showing mAP@all wrt the ratio of removed side information. (best viewed in color) (color figure online)

Table 1 shows that most of the SBIR and ZSL methods perform worse than the zero-shot SBIR methods. Among them, the ZSL methods usually suffer from the domain gap between the sketch and image modalities. The majority SBIR methods although have performed better than their ZSL counterparts, fail to generalize the learned representations to unseen classes. However, GN Triplet (Sangkloy et al. 2016), DSH (Liu et al. 2017), GDH (Zhang et al. 2018) have shown reasonable potential to generalize information only from object with common shape.

As per the expectation, the specialized zero-shot SBIR methods have surpassed most of the ZSL and SBIR baselines as they possess both the ability of reducing the domain gap and generalizing the learned information for the unseen classes. ZS-SBIR learns to generalize between sketch and image from the aligned sketch-image pairs, as a result it performs well on the Sketchy dataset, but not on the TU-Berlin or QuickDraw datasets, as in these datasets, aligned sketch-image pairs are not available. Our proposed method has excels the state-of-the-art method by 0.091 mAP@all on the Sketchy, 0.074 mAP@all on the TU-Berlin and 0.046 mAP@all on the QuickDraw, which shows the effectiveness of our proposed SEM-PCYC model due to the cycle consistency between sketch, image and semantic space, as well as the compact and discriminative side information.

In general, the main challenge in TU-Berlin dataset is the large number of visually similar and overlapping classes. On the other hand, in QuickDraw datatset there is a the large domain gap that is intentionally introduced for designing future realistic models. Also, the ambiguity in annotation, e.g. non-professional sketches, is a major challenge in this dataset. Although our results are encouraging in that they show that the cycle consistency helps zero-shot SBIR task and our model sets the new state-of-the-art in this domain, we hope that our work will encourage further research in improving these results.

Finally, the PR-curves of SEM-PCYC and considered baselines on Sketchy, TU-Berlin and QuickDraw are respectively shown in Fig. 3a–c which show that the precision-recall curves correspond to our SEM-PCYC model (dark blue line) are always plotted above the other methods. This indicates that our proposed model consistently exhibits the superiority on all three datasets, which clearly show the benefit of our proposal.

Generalized Zero-Shot Sketch-Based Image Retrieval. We conducted experiments on generalized ZS-SBIR setting where search space contains both seen and unseen classes. This task is significantly more challenging than ZS-SBIR as seen classes create distraction to the test queries. Our results in Table 1 show that our model significantly outperforms both the existing models (Shen et al. 2018; Kiran Yelamarthi et al. 2018), due to the benefit of our cross-modal adversarial mechanism and heterogeneous side information.

Fig. 4
figure 4

Top-20 zero-shot SBIR results obtained by our SEM-PCYC model on the Sketchy (extended) dataset are shown here according to the Euclidean distances, where the green ticks denote the correctly retrieved candidates, whereas the red crosses indicate the wrong retrievals. (best viewed in color) (color figure online)

Fig. 5
figure 5

Top-20 zero-shot SBIR results obtained by our SEM-PCYC model on the TU-Berlin (extended) dataset are shown here according to the Euclidean distances, where the green ticks denote the correctly retrieved candidates, whereas the red crosses indicate the wrong retrievals. (best viewed in color) (color figure online)

Fig. 6
figure 6

Top-20 zero-shot SBIR results obtained by our SEM-PCYC model on the QuickDraw (extended) dataset are shown here according to the Euclidean distances, where the green ticks denote the correctly retrieved candidates, whereas the red crosses indicate the wrong retrievals. (best viewed in color) (color figure online)

Fig. 7
figure 7

Inter-class similarity in TU-Berlin dataset may indicate the challenge of the task

Qualitative Results. We analyze the retrieval performance of our proposed model qualitatively in Figs. 4, 5 and 6. Some notable examples are as follows. Sketch query of tank retrieves some examples of motorcycle probably because both of them have wheels in common (row 1 of Fig. 4). Similar explanation can be given in the case of car and motorcycle (row 1 of Fig. 6). For having visual and semantic similarity, sketching guitar retrieves some violins (row 2 of Fig. 4). This can also be observed in case of train and van in row 2 of Fig. 6.

For having visual and semantic similarity, querying bear retrieves some squirrels (row 3 of Fig. 4). Querying objects with wheel (e.g., wheelchair, motorcycle) sometime wrongly retrieves other vehicles, probably because of having wheels in common (row 6 of Fig. 4). Sketch query of spoon retrieves some examples of racket (row 4 of Fig. 4), possibly for having significant visual similarity. Sketch of burger retrieves some examples of jack-o-lantern (row 5 of Fig. 4), perhaps for having same shape. Querying castle, retrieves images having large portion of sky (row 2 of Fig. 5), because the images of its semantically similar classes, such as, skyscraper, church, are mostly captured with sky in background. Similar phenomenon can be observed in case of tree and electrical post in row 5 of Fig. 6. Querying duck, retrieves images of swan or shark (row 4 of Fig. 5), probably for having watery background in common. Sketch of pickup truck retrieves some images from traffic light class for having a truck like object in the scene (row 3 of Fig. 5). Sketching bookshelf retrieves some examples of cabinet for having significant visual and semantic similarity (row 5 of Fig. 5).

Sometimes too much abstraction in sketches can produce wrong retrieval results. For example, in row 3 of Fig. 6, it is difficult to understand whether the sketch is of eiffel tower or any other tower or a hill. Furthermore, we have observed certain ambiguities in annotation of images in QuickDraw dataset. Currently, the images are much complex, which often contain two or more objects, and most of the currently available SBIR datasets provide single object annotation ignoring the object in background. For example see row 6 of Fig. 6, many of the wrongly retrieved images truly contain flower, whereas some of them are annotated as tower or trees etc. Additionally, as the images from QuickDraw dataset are collected from the Flickr website, it contains many subsequent captures which can be confused as identical frames. Hence, although some retrievals on QuickDraw dataset appear identical, they are not in terms of the actual pixel values.

In general, we observe that the wrongly retrieved candidates mostly have a closer visual and semantic relevance with the queried ones. This effect is more prominent in TU-Berlin dataset, which may be due to the inter-class similarity of sketches between different classes. As shown in Fig. 7, the classes swan, duck and owl, penguin have substantial visual similarity, and all of them are standing bird which is a separate class of the same dataset. Therefore, for TU-Berlin dataset, it is challenging to generalize the unseen classes from the learned representation of seen classes.

Table 2 Zero-shot SBIR mAP@all using different semantic embeddings (top) and their combinations (bottom) with 32, 64 and 128 dimension

Effect of Side-Information. In zero-shot learning, side information is as important as the visual information as it is the only means the model can discover similarities between classes. As the type of side information has a high effect in performance of any method, we analyze the effect of side-information and present zero-shot SBIR results by considering different side information and their combinations. We compare the effect of using GloVe (Pennington et al. 2014), Word2Vec (Mikolov et al. 2013) and FastText (Joulin et al. 2017) as text-based model, and three similarity measurements, i.e. path, Lin (1998) and Jiang-Conrath (Jiang and Conrath 1997) for constructing three different side information that are based on WordNet hierarchy. Table 2 contains the quantitative results on Sketchy, TU-Berlin and QuickDraw datasets with different side information mentioned and their combinations, where we set \(M=32, 64, 128\). We have observed that in majority of cases combining different side information increases the performance by 1–3\(\%\).

On Sketchy, the combination of Word2Vec and Jiang-Conrath hierarchical similarity as well as FastText and Path reach the highest mAP of 0.349 with 64d embedding while on TU Berlin dataset, in addition to the combination of Word2Vec and path similarity, FastText and Path lead with 0.297 mAP with 64d, and for QuickDraw the combination of GloVe and Lin hierarchical similarity reaches to 0.177 for 64d. We conclude from these experiments that indeed text-based and hierarchy-based class embeddings are complementary.

Effect of Visual Features. Visual features are also crucial for the zero-shot SBIR task. For having some overview on that, addition to VGG-16 (Simonyan and Zisserman 2014) features obtained before the last fc layer, we also consider SE-ResNet-50 (Hu et al. 2019; He et al. 2015) features, and perform zero-shot SBIR experiments on the Sketchy, TU-Berlin and QuickDraw datasets with different semantic models mentioned above. In Table 3, we present the mAP@all values obtained by the considered visual features and semantic models, where we observe that SE-ResNet-50 features work consistently better than VGG-16 on all the three datasets. Especially, the performance gain on the challenging TU-Berlin dataset should be noted, which we speculate as the benefit of feature calibration strategy involved in the SE blocks, that effectively produces robust features minimizing inter-class confusion as presented in Fig. 7.

Table 3 Zero-shot SBIR mAP@all using different semantic embeddings either with VGG-16 or ResNet-50 visual features while the dimension is kept equal to 64

Model Ablations. The baselines of our ablation study are built by modifying some parts of the SEM-PCYC model and analyze the effect of different losses of our model. First, we train the model only with adversarial loss, and then alternatively add cycle consistency and classification loss for the training. Second, we train our model by only withdrawing the adversarial loss for the semantic domain, which should indicate the effect of side information in our case. We also train the model without the side information selection mechanism, for that, we only take the original text or hierarchical embedding or their combination as side information, which can give an idea on the advantage of selecting side information via the auto-encoder. Next, we experiment reducing the dimensionality of the class embedding to a percentage of the full dimensionality. Finally, to demonstrate the effectiveness of the regularizer used in the auto-encoder for selecting discriminative side information, we experiment by making \(\lambda =0\) in eqn. (5).

Table 4 Ablation study on our SEM-PCYC model (64d) on three datasets (measured with mAP@all)

The mAP@all values obtained by respective baselines mentioned above are shown in Table 4. We consider the best side information setting according to Table 2 depending on the dataset. The assessed baselines have typically underperformed the full SEM-PCYC model. Only with adversarial losses, the performance of our system drops significantly. We suspect that only adversarial training although maps sketch and image input to a semantic space, there is no guarantee that sketch-image pairs of same category are matched. This is because adversarial training only ensures the mapping of input modality to target modality that matches its empirical distribution (Zhu et al. 2017), but does not guarantee an individual input and output are paired up.

Imposing cycle-consistency constraint ensures the one-to-one correspondence of sketch-image categories. However, the performance of our system does not improve substantially while the model is trained both with adversarial and cycle consistency loss. We speculate that this issue could be due to the lack of inter-category discriminating power of the learned embedding functions; for that, we set a classification criteria to train discriminating cross-modal embedding functions. We further observe that only imposing classification criteria together with adversarial loss, neither improves the retrieval results. We conjecture that in this case the learned embedding could be very discriminative but the two modalities might be matched in wrong way. Hence, it can be concluded that all these three losses are complimentary to each other and absolutely essential for effective zero-shot SBIR.

Next, we analyze the effect of side information and notice that without the adversarial loss for the semantic domain, our model performs better than the previously mentioned three configurations but does not reach near to the full model. This is due to the fact that without semantic mapping, the resulting embeddings are not semantically related to each other, which do not help in cross modal retrieval in zero-shot scenario. We further observe that without the encoded and compact side information, we achieve better mAP@all with a compromise on retrieval time, as the original dimension (\(354+300=654\)d for Sketchy, \(664+300=964\)d for TU-Berlin and \(344+300=644\)d for QuickDraw) of considered side information is much higher than the encoded ones (64d). We further investigate by reducing its dimension as a percentage of the original one (see Fig. 3c), and we have observed that at the beginning, reducing a small part (mostly 5–30\(\%\)) usually leads to a better performance, which reveals that not all the side information are necessary for effective zero-shot SBIR and some of them are even harmful. In fact, the first removed ones have low information content, and can be regarded as noise.

We have also perceived that removing more side information (beyond 20–40\(\%\)) deteriorates the performance of the system, which is quite justifiable because the compressing mechanism of auto-encoder progressively removes important and predictable side information. However, it can be observed that with highly compressed side information as well, our model provides a very good deal with performance and retrieval time.

Finally, without using the regularizer in Eq. (5) although our system performs reasonably, the mAP@all value is still lower than the best obtained performance. We explain this as a benefit of using \(\ell _{21}\)-norm based regularizer that effectively select representative side information.

Fig. 8
figure 8

Few-shot sketch-based image retrieval (k = 0, 1, 5, 10, 15, 20) performance comparison with three existing state-of-the-art methods on Sketchy, TU-Berlin and Quickdraw datasets. Top: few-shot sketch based image retrieval results, Bottom: generalized few-shot sketch-based image retrieval results (color figure online)

Fig. 9
figure 9

Top-5 k-shot (\(k=0, 1, 5, 10\)) SBIR results obtained by our SEM-PCYC model on the Sketchy (extended) dataset are shown here according to the Euclidean distances, where the green ticks denote the correctly retrieved candidates, whereas the red crosses indicate the wrong retrievals. (best viewed in color) (color figure online)

Fig. 10
figure 10

Top-5 k-shot (\(k=0, 1, 5, 10\)) SBIR results obtained by our SEM-PCYC model on the TU-Berlin (extended) dataset are shown here according to the Euclidean distances, where the green ticks denote the correctly retrieved candidates, whereas the red crosses indicate the wrong retrievals. (best viewed in color) (color figure online)

Fig. 11
figure 11

Top-5 k-shot (\(k=0, 1, 5, 10\)) SBIR results obtained by our SEM-PCYC model on the QuickDraw (extended) dataset are shown here according to the Euclidean distances, where the green ticks denote the correctly retrieved candidates, whereas the red crosses indicate the wrong retrievals. (best viewed in color) (color figure online)

4.2 (Generalized) Few-Shot Sketch-Based Image Retrieval

For the few-shot scenario, we start with the pre-trained model trained in the zero-shot setting, and then fine tune it using a few example images, e.g. k-shot, from “novel” classes. For fine tuning the model in k-shot setting, we consider k different sketch and image instances from each of the unseen classes and cross-combine according to the coarse-grained and fine-grained settings to fine tune the model. The performance is evaluated on the rest of the instances from each class at test time.

Table 5 Ablation study with few shot setting on our SEM-PCYC model (64d) on three datasets (measured with mAP@all)
Fig. 12
figure 12

Fine-grained (generalized) few-shot sketch-based image retrieval performance comparison (color figure online)

Few-Shot Sketch-Based Image Retrieval. Figure 8a–c present the few-shot SBIR performance of our SEM-PCYC model together with ZSIH (Shen et al. 2018) and ZS-SBIR (Kiran Yelamarthi et al. 2018) respectively on the Sketchy, TU-Berlin and QuickDraw databases. All these plots show that the considered methods have performed consistently with the increment of k. However, this growth slowly gets saturated after \(k=10\). In this case also our proposed SEM-PCYC model consistently outperforms the other prior works, which clearly points out the supremacy of our proposal.

Generalized Few-Shot Sketch-Based Image Retrieval. We also tested our few-shot model in generalized scenario, where during the test phase the search space includes both the seen and novel classes. Typically, this setting poses remarkably challenging scenario as the seen classes may create significant confusion to the novel queries. However, the generalized setting is more realistic as it allows to query the system with sketch from any classes. In this setting as well, we considered ZSIH (Shen et al. 2018) and ZS-SBIR (Kiran Yelamarthi et al. 2018) as two benchmark methods and trained them with the same experimental settings as ours. In FS-SBIR the generalized setting results follow the non-generalized setting quite closely (see Fig. 8d–f). This eventually indicates the convergence of the generalization ability of different models. In this setting as well, our proposed model steadily surpassed both the benchmark models, which indicates the advantage of our proposed model.

Qualitative Results. Figures 9, 10 and 11 present a selection of qualitative results obtained by our SEM-PCYC model respectively on the Sketchy, TU-Berlin and QuickDraw datasets in the scenario of increasing number of shots, which show an evolution of model performance with the increment of k (\(=0,1,5,10\)) for the classes where 0-shot results are weak. From these results, we can see that sometimes a single unseen example is sufficient to correctly retrieve images (row 3 of Fig. 9, row 5 of Fig. 10 and row 5 of Fig. 11), however, sometimes it needs more examples (row 2 and 5 of Fig. 9, row 2, 3, 4 of Fig. 10 and row 2, 3, 4 of Fig. 11) to remove the confusion from the other similar classes. This uncertainty may either come from visual or semantic similarity. As expected, increasing the number of examples also improves the performance.

Model Ablations. Similar to zero-shot setting, we perform an ablation study for few-shot scenario as well, where we consider the same model baselines as of Table 4. The mAP@all values obtained by those baselines in 5-shot scenario are shown in Table 5. In this case, all the baselines have achieved much better performance than the corresponding zero-shot performance on that dataset, which is absolutely justified since the model is already trained to zero-shot setting and having few examples from novel classes provide some gain with any combination of losses. We observe that the first three configurations (first three rows of Table 5) work quite closely across all the three datasets and we haven’t found any prominent difference among these three baselines on the considered datasets. However, the baselines with more criterion or losses (bottom three rows of Table 5) achieve much better performance from the previously mentioned three baselines. Among these baselines, we have not found much difference between the ones that do and do not use side information. This is due to the consideration of pre-trained zero-shot model which already has past knowledge of side information, and in this case training with side information could be slightly redundant.

Fine-Grained Settings. We have further evaluated our model in fine-grained setting where the task is to find a specific object image of a drawn sketch, and we have combined it with the above mentioned variations of k-shot scenarios. For this experiment, we only considered the Sketchy dataset as only this corpus contains aligned sketch-image pairs, which are often used for fine-grained SBIR evaluation tasks. We have not considered other fine-grained datasets, such as shoe, chair etc (Song et al. 2017a) as they do not contain class information which we need for semantic space mapping. For this setting as well, we have considered ZSIH (Shen et al. 2018) and ZS-SBIR (Kiran Yelamarthi et al. 2018) as the two benchmark methods and the same experimental protocol.

Figure 12a and b show the performance of our model in fine-grained generalized few-shot together with ZSIH (Shen et al. 2018) and ZS-SBIR (Kiran Yelamarthi et al. 2018). In fine-grained setting, all the methods have performed remarkably poor. We explain this fact as the drawback of semantic space mapping which intends to map visual information from sketch and image to the same neighborhood and ignores fine-grained information. Therefore the proposed solution to low-shot task and the notion of fine-grained problem contradicts, and as a consequence the performance of all the considered models deteriorates. In generalized setting, we have observed that all the models have performed slightly better. We conjecture that the considered models can memorize the fine-grained information of the training or seen samples, which gives a slight rise (as they are very few in number) in performance in generalized scenario. However, we see that low-shot fine-grained paradigm is very important for SBIR. Nevertheless, we admit that it is an extremely challenging task, which needs substantial research work to be solved.

5 Conclusion

In this paper, we proposed the SEM-PCYC model for the any-shot SBIR task. Our SEM-PCYC model is a semantically aligned paired cycle consistent generative adversarial network whose each branch either maps a sketch or an image to a common semantic space via adversarial training with a shared discriminator. Thanks to cycle consistency on both the branches our model does not require aligned sketch-image pairs. Moreover, it acts as a regularizer in the adversarial training. The classification losses on the generators guarantee the features to be discriminative. We show that combining heterogeneous side information through an auto-encoder, which encodes a compact side information useful for adversarial training, is effective. In addition to the model, in this paper, we introduced (generalized) few-shot SBIR as a new task, which is combined with fine-grained setting. We considered three benchmark datasets with varying difficulties and challenges, and performed exhaustive evaluation with the above mentioned paradigms. Our assessment on these three datasets has shown that our model consistently outperforms the existing methods in (generalized) zero- and few-shot, and fine-grained settings. We encourage future work to evaluate sketch based image retrieval methods in these incrementally challenging and realistic settings.