Semantically Tied Paired Cycle Consistency for Any-Shot Sketch-based Image Retrieval

Low-shot sketch-based image retrieval is an emerging task in computer vision, allowing to retrieve natural images relevant to hand-drawn sketch queries that are rarely seen during the training phase. Related prior works either require aligned sketch-image pairs that are costly to obtain or inefficient memory fusion layer for mapping the visual information to a semantic space. In this paper, we address any-shot, i.e. zero-shot and few-shot, sketch-based image retrieval (SBIR) tasks, where we introduce the few-shot setting for SBIR. For solving these tasks, we propose a semantically aligned paired cycle-consistent generative adversarial network (SEM-PCYC) for any-shot SBIR, where each branch of the generative adversarial network maps the visual information from sketch and image to a common semantic space via adversarial training. Each of these branches maintains cycle consistency that only requires supervision at the category level, and avoids the need of aligned sketch-image pairs. A classification criteria on the generators' outputs ensures the visual to semantic space mapping to be class-specific. Furthermore, we propose to combine textual and hierarchical side information via an auto-encoder that selects discriminating side information within a same end-to-end model. Our results demonstrate a significant boost in any-shot SBIR performance over the state-of-the-art on the extended version of the challenging Sketchy, TU-Berlin and QuickDraw datasets.


Introduction
Matching natural images with free-hand sketches, i.e. sketchbased image retrieval (SBIR) [79,77,36,45,63,58,82,7,30,13,11] has received a lot of attention. Since sketches can effectively express shape, pose and some fine-grained details of the target images, SBIR serves a favorable scenario complementary to the conventional text-image cross-modal retrieval or the classical content based image retrieval protocol. This may be because in some situations it is difficult to provide a textual description or a suitable image of the desired query, whereas, an user can easily draw a sketch of the desired object on a touch screen.
As the visual information from all classes gets explored by the system during training, with overlapping training and test classes, existing SBIR methods perform well [82]. Since for practical applications there is no guarantee that the training data would include all possible queries, a more realistic setting is low-shot or any-shot SBIR (AS-SBIR) [58,30,13,11], which combines zero-and few-shot learning [33,65,72,50] and SBIR as a single task, where the aim is an accurate class prediction and a competent retrieval performance. However, this is an extremely challenging task, as it simultaneously deals with domain gap, intra-class variability and limited or no knowledge on novel classes. Additionally, fine-grained SBIR [45,44] is an alternative sketchbased image retrieval task, allowing to search for specific object images, which has already received remarkable attention in the computer vision community. However, it has never been explored in low shot setting, which is an extremely challenging and at the same time of high practical relevance.
One of the major shortcomings of the prior work on anyshot SBIR is that a natural image is retrieved after learning a mapping from an input sketch to an output image using a training set of labelled aligned pairs [30]. The supervi- sion of the pair correspondence is to enhance the correlation of multi-modal data (here, sketch and image) so that learning can be guided by semantics. However, for many realistic scenarios, paired (aligned) training data is either unavailable or obtaining it is very expensive. Furthermore, often a joint representation of two or more modalities is learned by using a memory fusion layer [58], such as, tensor fusion [23], bilinear pooling [81] etc. These fusion layers are often expensive in terms of memory [81], and extracting useful information from this high dimensional space could result in information loss [80].
To alleviate these shortcomings, we propose a semantically aligned paired cycle consistent generative adversarial network (SEM-PCYC) model for any-shot SBIR task, where each branch either maps the sketch or image features to a common semantic space via an adversarial training. These two branches dealing with two different modalities (sketch and image) constitute an essential component for solving SBIR task. The cycle consistency constraint on each branch guarantees that the mapping of sketch or image modality to a common semantic space and their translation back to the original modality, avoiding the necessity of aligned sketchimage pairs. Imposing a classification loss on the semantically aligned outputs from the sketch and image space enforces the generated features in the semantic space to be discriminative which is very crucial for effective any-shot SBIR. Furthermore, inspired by the previous works on label embedding [3], we propose to combine side information from text-based and hierarchical models via a feature selection auto-encoder [68] which selects discriminating side information based on intra and inter class covariance. This paper extends our CVPR 2019 conference paper [13], with the following additional contributions: (1) We propose to apply the SEM-PCYC model for any-shot SBIR task, i.e. addition to zero-shot paradigm, we introduce few-shot setting for SBIR and combine it with generalized setting, which has been experimentally proven to be effective for difficult or confusing classes. (2) We adapt the recent zero-shot SBIR models and ours to fine-grained SBIR in the generalized low-shot setting and provide an extensive benchmark including quantitative and qualitative evaluations. (3) We evaluate our model on one recent dataset, i.e. QuickDraw, in addition to extending our experiments to new settings with Sketchy and TU-Berlin. We show that our proposed model consistently improves the state-of-the-art results of any-shot SBIR on all the three datasets.

Related Work
As our work belongs at the verge of sketch-based image retrieval and any-shot learning task, we briefly review the relevant literature from these fields.
Sketch Based Image Retrieval (SBIR). Attempts for solving SBIR task mostly focus on bridging the domain gap between sketch and image, which can roughly be grouped in hand-crafted and cross-domain deep learning-based methods [36]. Hand-crafted methods mostly work by extracting the edge map from natural image and then matching them with sketch using a Bag-of-Words model on top of some specifically designed SBIR features, viz., gradient field HOG [25], histogram of oriented edges [53], learned key shapes [54] etc. However, the difficulty of reducing domain gap remained unresolved as it is extremely challenging to match edge maps with unaligned hand drawn sketch. This domain shift issue is further addressed by neural network models where domain transferable features from sketch to image are learned in an end-to-end manner. Majority of such models use variant of siamese networks [48,55,77,62] that are suitable for cross-modal retrieval. These frameworks either use generic ranking losses, viz., contrastive loss [9], triplet ranking loss [55] or more sophisticated HOLEF based loss [63]) for the same. Further to these discriminative losses, Pang et al. [45] introduced a discriminative-generative hybrid model for preserving all the domain invariant information useful for reducing the domain gap between sketch and image. Alternatively, [36,82] focus on learning cross-modal hash code for category level SBIR within an end-to-end deep model.
In addition to the above coarse-grained SBIR models, fine-grained sketch-based image retrieval (FG-SBIR) has gained popularity recently [34,62,63,45]. In this more realistic setting, a FG-SBIR model allows to search a specific object or image. First, models tackled this task using deformable part model and graph matching [34]. Recently, different ranking frameworks and corresponding losses, such as, siamese [45], triplet [55], quadruplet [62] networks were used for the same. [63] proposed attention model for FG-SBIR task, [82] improving retrieval efficiency using a hashing scheme.
Zero-Shot Learning (ZSL) and Few-Shot Learning (FSL). Zero-shot learning in computer vision refers to recognizing objects whose instances are not seen during the training phase; a comprehensive and detailed survey on ZSL is available in [72]. Early works on ZSL [33,26,5,4] make use of attributes within a two-stage approach to infer the label of an image that belong to the unseen classes. However, the recent works [17,52,3,2,32] directly learn a mapping from image feature space to a semantic space. Many other ZSL approaches learn non-linear multi-modal embedding [61,2,71,6,83], where most of the methods focus to learn a non-linear mapping from the image space to the semantic space. Mapping both image and semantic features into another common intermediate space is another direction that ZSL approaches adapt [85,18,86,1,38]. Although, most of the deep neural network models in this domain are trained using a discriminative loss function, a few generative models also exist [69,73,8] that are used as a data augmentation mechanism. In ZSL, some form of side information is required, so that the knowledge learned from seen classes gets transferred to unseen classes. One popular form of side information is attributes [33] that, however, require costly expert annotation. Thus, there has been a large group of studies [39,3,71,51,49,12] which utilize other auxiliary information, such as, text-based [40] or hierarchical model [42] for label embedding.
On the other hand, few-shot learning (FSL) refers to the task of recognizing images or detecting objects with a model trained on very few samples [74,57]. Directly training a given model with small amount of training samples could have the risk of over fitting. Hence a general step to overcome this hurdle is to initially train the model on classes with sufficient examples, and then generalize it to classes with fewer examples without learning any new parameters. This setup already attracted a lot of attention within the computer vision community. One of the first attempts [31] is a siamese convolutional network model for computing sim-ilarity between pair of images, and then the learned similarity was used to solve the one-shot problem by k-nearest neighbors classification. On the other hand, matching network model [65] uses cosine distance to predict image label based on support sets and apply the episodic training strategy that mimics few-shot learning. An extension, i.e. prototypical network [60], used Euclidean distance instead of cosine distance and built a prototype representation of each class for the few-shot learning scenario. As an orthogonal direction [50] introduced meta-learning framework for FSL, which updates weights of a classifier for a given episode. Model agnostic meta-learner [16] learns better weight initialization capable to generalize in FSL scenario with fewer gradient descent steps. There also exist few low shot methods that learn a generator from the base class data to generate novel class features for data augmentation [19,70]. Alternatively, GNN [29] was also proposed as a framework for few-shot learning task [56].
Our work. The prior work on zero-shot sketch-based image retrieval (ZS-SBIR) [58], proposed a generative crossmodal hashing scheme using a graph convolution network for aligning the sketch and image in the semantic space. Inspired by them, [30] proposed two similar autoencoderbased generative models for zero-shot SBIR, where they have used the aligned pairs of sketch and image for learning the semantics between them. In this work, we propose a paired cycle consistent generative model where each branch either maps sketch or image features to a common semantic space via adversarial training, which we found to be effective for reducing the domain gap between sketch and image. The cycle consistency constraint on each branch allows supervision only at category level, and avoids the need of aligned sketch-image pairs. Furthermore, we address zero-shot and few-shot cross-modal (sketch to image) retrieval, for that, we effectively combine different side information within an end-to-end framework, and map visual information to the semantic space through an adversarial training. Finally, we unify low-shot learning models and generalize them to finegrained SBIR scenario.

Semantically Aligned Paired Cycle Consistent GAN (SEM-PCYC)
Our Semantically Aligned Paired Cycle Consistent GAN (SEM-PCYC) model uses the sketch and image data from the seen categories for training the underlying model. It then encodes and matches the sketch and image categories that remain novel during the training phase. The overall pipeline of our end-to-end deep architecture is shown in Fig. 2.
We define D s = {X s , Y s } to be a collection of sketchimage data from the training categories C s , which contains sketch images X s = {x s i } N i=1 as well as natural images where N is the total number of sketch and image pairs that are not necessarily aligned. Without loss of generality, a sketch and an image have the same index i, and share the same category label. The set S s = {s s i } N i=1 indicates the side information necessary for transferring knowledge from seen to the novel classes (a.k.a unseen classes in zero-shot learning literature). In our setting, we also use an auxiliary training set D a = {X a , Y a } from the unseen classes C u which is disjoint from C s , where the number of samples per class is fixed to k.
Our aim is to learn two deep functions G sk (·) and G im (·) respectively for sketch and image for mapping them to a common semantic space where the learned knowledge is applied to the novel classes. Now, given a second set D u = {X u , Y u } from the test categories C u , the proposed deep networks G sk : R d → R M , G im : R d → R M (d is the dimension of the original data and M is the targeted dimension of the common representation) map the sketch and natural image to a common semantic space where the retrieval is performed. Depending on k, i.e. the number of samples considered per class as an auxiliary set, the scenario is called k-shot. In the classical zero-shot sketch-based image retrieval setting, the test categories belong to C u , in other words, at test time the assumption is that every image will come from a previously unseen class. This is not realistic as the true generalization performance of the classifier can only be measured with how well it generalizes to unseen classes without forgetting the classes it has seen. Hence, in the generalized zero-shot sketch based image retrieval scenario the search space contains both C u and C s . In other words, at test time an image may come either from a previously seen or an unseen class. As this setting is significantly more challenging, the accuracy decreases for all the methods considered.

Paired Cycle Consistent Generative Model
To achieve the flexibility to handle sketch and image individually, i.e. even without aligned sketch-image pairs, during training G sk and G im , we propose a cycle consistent generative model whose each branch is semantically aligned with a common discriminator. The cycle consistency constraint on each branch of the model ensures the mapping of sketch or image modality to a common semantic space, and their translation back to the original modality, which only requires supervision at the category level. Imposing a classification loss on the output of G sk and G im allows generating highly discriminative features.
Our main goal is to learn two mappings G sk and G im that can respectively translate the unaligned sketch and natural image to a common semantic space. Zhu et al. [87] pointed out about the existence of underlying intrinsic relationship between modalities and domains, for example, sketch or image of same object category have the same semantic mean-ing, and possess that relationship. Even though, we lack visual supervision as we do not have access to aligned pairs, we can exploit semantic supervision at category levels. We train a mapping G sk : X → S so thatŝ i = G sk (x i ), where s i ∈ S is the corresponding side information and is indistinguishable fromŝ i via an adversarial training that classifiesŝ i different from s i . The optimal G sk thereby translates the modality X into a modalityŜ which is identically distributed to S. Similarly, another function G im : Y → S can be trained via the same discriminator such thatŝ i = G im (y i ).
Adversarial Loss. As shown in Fig. 2, for mapping the sketch and image representation to a common semantic space, we introduce four generators G sk : In addition, we bring in three adversarial discriminators: D se (·), D sk (·) and D im (·), where D se discriminates among original side information {s}, sketch transformed to side information {G sk (x)} and image transformed to side information {G im (y)}; likewise D sk discriminates between original sketch representation {x} and side information transformed to sketch representation {F sk (s)}; in a similar way D im distinguishes between {y} and {F im (s)}. For the generators G sk , G im and their common discriminator D se , the objective is: where G sk and G im generate side information similar to the ones in S while D se distinguishes between the generated and original side information. Here, G sk and G im minimize the objective against an opponent D se that tries to maximize it, namely In a similar way, for the generator F sk and its discriminator D sk , the objective is: F sk minimizes the objective and its adversary D sk intends to maximize it, namely Similarly, another adversarial loss is introduced for the mapping F im and its discriminator D im , i.e.
Cycle Consistency Loss. The adversarial mechanism effectively reduces the domain or modality gap, however, it is not guaranteed that an input x i and an output s i are matched well. To this end, we impose cycle consistency [87]. When we map the feature of a sketch of an object to the corresponding semantic space, and then further translate it back from the semantic space to the sketch feature space, we should reach back to the original sketch feature. This cycle consistency loss also assists in learning mappings across domains where paired or aligned examples are not available. Specifically, if we have a function G sk : X → S and another mapping F sk : S → X, then both G sk and F sk are reverse of each other, and hence form a one-to-one correspondence or bijective mapping.
where s is the semantic features of the class c which is the category label of x. Similarly, a cycle consistency loss is imposed for the mappings G im : Y → S and F im : S → Y: L cyc (G im , F im ). These consistent loss functions also behave as a regularizer to the adversarial training to assure that the learned function maps a specific input x i to a desired output s i .
Classification Loss. On the other hand, adversarial training and cycle-consistency constraints do not explicitly ensure whether the generated features by the mappings G sk and G im are class discriminative, i.e. a requirement for the zero-shot sketch-based image retrieval task. We conjecture that this issue can be alleviated by introducing a discriminative classifier pre-trained on the input data. At this end we minimize a classification loss over the generated features.
where c is the category label of x, P (c|G sk (x); θ) denotes the probability of G sk (x) being predicted with its true class label c. The conditional probability is computed by a linear softmax classifier parameterized by θ. Similarly, a classification loss L cls (G im ) is also imposed on the generator G im .

Selection of Side Information
Learning a compatibility or a matching function between multiple modalities in zero-shot scenario [58,11,37] requires structure in the class embedding space where the image features are mapped to. Attributes provide one such a structured class embedding space [33], however obtaining attributes requires costly human annotation. On the other hand, side information can also be learned at a much lower cost from large-scale text corpora such as Wikipedia. Similarly, output embeddings built from hierarchical organization of classes such as WordNet can also provide structure in the output space and substitute the attributes. Motivated by attribute selection for zero-shot learning [21], indicating that a subset of discriminative attributes are more effective than the whole set of attributes for ZSL, we incorporate a joint learning framework integrating an auto-encoder to select side information. Let s ∈ R k be the side information with k as the original dimension. The loss function is: respectively as the weights and biases for the function f and g. Additionally, . F denotes the Frobenius norm defined as the square root of the sum of the absolute squares of its elements and . 2,1 indicates 2,1 norm [43]. Selecting side information reduces the dimensionality of embeddings, which further improves retrieval time. Therefore, the training objective of our model: where different λs are the weights on respective loss terms.
For obtaining the initial side information, we combine a textbased and a hierarchical model, which are complementary and robust [3]. Below, we provide a description of our textbased and hierarchical models for side information.
Text-based Model. We use three different text-based side information.
(1) Word2Vec [41] is a two layered neural network that are trained to reconstruct linguistic contexts of words. During training, it takes a large corpus of text and creates a vector space of several hundred dimensions, with each unique word being assigned to a corresponding vector in that space. The model can be trained with a hierarchical softmax with either skip-gram or continuous bagof-words formulation for target prediction.
(2) GloVe [47] considers global word-word co-occurrence statistics that frequently appear in a corpus. Intuitively, co-occurrence statistics encode important semantic information. The objective is to learn word vectors such that their dot product equals to the probability of their co-occurrence.
(3) FastText [28] extends the Word2Vec model, where instead of learning vector for words directly, FastText represents each word as n-gram of characters and then trains a skip-gram model to learn the embeddings. FastText works well with rare words, even if a word was not seen during training, it can be broken down into n-grams to get its embeddings, which is a huge advantage of this model.
Hierarchical Model. Semantic distance (or similarity) between words can also be approximated by their distance (or similarity) in a large ontology such as WordNet 1 with ≈ 100, 000 words in English. One can measure the similarity (S WN in eqn. (4)) between words represented as nodes in the ontology using techniques, such as path similarity, e.g. counting the number of hops required to reach from one node to the other, and Jiang-Conrath [27]. For a set S of nodes in a dictionary D that consists of a set of classes, similarities between every class c and all the other nodes considered in the same order in S to determine the entries of the class embedding vector [3] of c (s hier (c) in eqn. (4)): Note that, S considers all the nodes on the path from each node in D to its highest level ancestor. The WordNet hierarchy contains most of the classes of the Sketchy [55], Tu-Berlin [14] and QuickDraw [11] datasets. Few exceptions are: jack-o-lantern which we replaced with lantern that appears higher in the hierarchy, similarly human skeleton with skeleton, and octopus with octopods etc. |S|, i.e. the number of nodes, for Sketchy, TU-Berlin and QuickDraw datasets are respectively 354, 664 and 344.

Experiments
In this section, we detail our datasets, implementation protocol and present our results on (generalized) zero-shot, (generalized) few-shot and fine-grained settings.

Datasets.
We experimentally validate our model on three popular SBIR datasets, namely Sketchy (Extended), TU-Berlin (Extended) and QuickDraw (Extended). For brevity, we refer to these extended datasets as Sketchy, TU-Berlin and QuickDraw respectively. The Sketchy Dataset [55] is a large collection of sketchphoto pairs. The dataset originally consists of images from 125 different classes, with 100 photos each. The 75, 471 sketch images of the objects that appear in these 12, 500 images are collected via crowd sourcing. This dataset also contains a fine grained correspondence (alignment) between particular photos and sketches as well as various data augmentations for deep learning based methods. Liu et al. [36] extended the dataset by adding 60, 502 photos yielding in total 73, 002 images. We randomly pick 25 classes as the novel test set, and the data from remaining 100 training classes.
The original TU-Berlin Dataset [14] contains 250 categories with a total of 20, 000 sketches extended by [36] with 204, 489 natural images corresponding to the sketch classes. 30 classes of sketches and images are randomly chosen to respectively from the query set and the retrieval gallery. The remaining 220 classes are utilized for training. We follow Shen et al. [58] and select classes with at least 400 images to form a test set.
The QuickDraw (Extended), a large-scale dataset proposed recently in [11], contains the sketch-image pairs of 110 classes consisting of 203, 885 images and 330, 111 sketches, i.e.approximately 1854 images/class and 3000 sketches/class. The main difference of this dataset from the previous ones is in the abstractness of the sketches which are collected from the Quick, Draw! 2 online game. The increased abstractness in the drawings has eventually enlarged the sketch-image domain gap, and hence increased the challenge of SBIR task.
Implementation details. We implemented the SEM-PCYC model using PyTorch [46] deep learning toolbox 3 on a single TITAN Xp or TITAN V graphics card. Unless otherwise mentioned, we extract features from sketch and image from the VGG-16 [59] network model pre-trained on Ima-geNet [10] (before the last pooling layer). In Section 4.1, we compare the VGG-16 features with SE-ResNet-50 features for zero-shot SBIR task, which is only restricted to that experimentation. Since in this work, we deal with single object retrieval and an object usually spans only on certain regions of a sketch or image, we apply an attention mechanism inspired by Song et al. [63] without the shortcut connection for extracting only the informative regions from sketch and image. The attended 512d representation is obtained by a pooling operation guided by the attention model and fully connected (fc) layer. This entire model is fine tuned on our training set (100 classes for Sketchy, 220 classes for TU-Berlin and 80 classes for QuickDraw). Both the generators G sk and G im are built with a fc layer followed by a ReLU non-linearity that accept 512d vector and output M d representation, whereas, the generators F sk and F im take M d features and produce 512d vector. Accordingly, all discriminators are designed to take the output of respective generators and produce a single dimensional output. The autoencoder is designed by stacking two non-linear fc layers respectively as encoder and decoder for obtaining a compressed and encoded representation of dimension M . We experimentally set λ se adv = 1.0, λ sk adv = 0.5, λ im adv = 0.5, λ sk cyc = 1.0, λ im cyc = 1.0, λ sk cls = 1.0, λ im cls = 1.0, λ aenc = 0.01 to give the optimum performance of our model.
While constructing the hierarchy for the class embedding, we only consider the training classes belong to that dataset. In this way, the WordNet hierarchy or the knowledge graph for the Sketchy, TU-Berlin and QuickDraw datasets respectively contain 354 and 664 nodes. Although our method does not produce binary hash code as a final representation for matching sketch and image, for the sake of comparison with some related works, such as, ZSH [75], ZSIH [58], GDH [82], that produce hash codes, we have used the iterative quantization (ITQ) [20] algorithm to obtain the binary codes for sketch and image. We have used final representation of sketches and images from the train set to learn the optimized rotation which later used on our final representation for obtaining the binary codes.

(Generalized) Zero-Shot Sketch-based Image Retrieval
Apart from the two prior Zero-Shot SBIR works closest to ours, i.e. ZSIH [58] and ZS-SBIR [30], we adopt fourteen ZSL and SBIR models to the zero-shot SBIR task. Note that in this setting, the training classes are indicated as "seen" and novel classes as "unseen" since none of the sketches of these classes are visible to the model during training.
The SBIR methods that we evaluate are SaN [79], 3D Shape [66], Siamese CNN [48], GN Triplet [55], DSH [36] and GDH [82]. A softmax baseline is also added, which is based on computing the 4096d VGG-16 [59] feature vector pre-trained on the seen classes for nearest neighbour search. The ZSL methods that we evaluate are: CMT [61], DeViSE [17], SSE [85], JLSE [86], ZSH [75], SAE [32] and FRW-GAN [15]. We use the same seen-unseen splits of categories for all the experiments for a fair comparison. We compute the mean average precision (mAP@all) and precision considering top 100 (Precision@100) [64,58] retrievals for the performance evaluation and comparison. Table 1 shows that most of the SBIR and ZSL methods perform worse than the zero-shot SBIR methods. Among them, the ZSL methods usually suffer from the domain gap between the sketch and image modalities. The majority SBIR methods although have performed better than their ZSL counterparts, fail to generalize the learned representations to unseen classes. However, GN Triplet [55], DSH [36], GDH [82] have shown reasonable potential to generalize information only from object with common shape.
As per the expectation, the specialized zero-shot SBIR methods have surpassed most of the ZSL and SBIR baselines as they possess both the ability of reducing the domain gap and generalizing the learned information for the unseen classes. ZS-SBIR learns to generalize between sketch and image from the aligned sketch-image pairs, as a result it performs well on the Sketchy dataset, but not on the TU-Berlin or QuickDraw datasets, as in these datasets, aligned sketch-image pairs are not available. Our proposed method has excels the state-of-the-art method by 0.091 mAP@all on the Sketchy, 0.074 mAP@all on the TU-Berlin and 0.046 mAP@all on the QuickDraw, which shows the effectiveness of our proposed SEM-PCYC model due to the cycle consistency between sketch, image and semantic space, as well as the compact and discriminative side information.
In general, the main challenge in TU-Berlin dataset is the large number of visually similar and overlapping classes.    On the other hand, in QuickDraw datatset there is a the large domain gap that is intentionally introduced for designing future realistic models. Also, the ambiguity in annotation, e.g. non-professional sketches, is a major challenge in this dataset. Although our results are encouraging in that they show that the cycle consistency helps zero-shot SBIR task and our model sets the new state-of-the-art in this domain, we hope that our work will encourage further research in improving these results.
Finally, the PR-curves of SEM-PCYC and considered baselines on Sketchy, TU-Berlin and QuickDraw are respec-tively shown in Fig. 3(a)-(c) which show that the precisionrecall curves correspond to our SEM-PCYC model (dark blue line) are always plotted above the other methods. This indicates that our proposed model consistently exhibits the superiority on all three datasets, which clearly show the benefit of our proposal.
Generalized Zero-Shot Sketch-based Image Retrieval. We conducted experiments on generalized ZS-SBIR setting where search space contains both seen and unseen classes. This task is significantly more challenging than ZS-SBIR as seen swan duck owl penguin standing bird Fig. 4 Inter-class similarity in TU-Berlin dataset may indicate the challenge of the task.
classes create distraction to the test queries. Our results in Table 1 show that our model significantly outperforms both the existing models [58,30], due to the benefit of our crossmodal adversarial mechanism and heterogeneous side information.
Qualitative Results. We analyze the retrieval performance of our proposed model qualitatively in Fig. 5, Fig. 6 and Fig. 7. Some notable examples are as follows. Sketch query of tank retrieves some examples of motorcycle probably because both of them have wheels in common (row 1 of Fig. 5). Similar explanation can be given in the case of car and motorcycle (row 1 of Fig. 7). For having visual and semantic similarity, sketching guitar retrieves some violins (row 2 of Fig. 5). This can also be observed in case of train and van in row 2 of Fig. 7.
For having visual and semantic similarity, querying bear retrieves some squirrels (row 3 of Fig. 5). Querying objects with wheel (e.g., wheelchair, motorcycle) sometime wrongly retrieves other vehicles, probably because of having wheels in common (row 6 of  Fig. 5), perhaps for having same shape. Querying castle, retrieves images having large portion of sky (row 2 of Fig. 6), because the images of its semantically similar classes, such as, skyscraper, church, are mostly captured with sky in background. Similar phenomenon can be observed in case of tree and electrical post in row 5 of Fig. 7. Querying duck, retrieves images of swan or shark (row 4 of Fig. 6), probably for having watery background in common. Sketch of pickup truck retrieves some images from traffic light class for having a truck like object in the scene (row 3 of Fig. 6). Sketching bookshelf retrieves some examples of cabinet for having significant visual and semantic similarity (row 5 of Fig.  6).
Sometimes too much abstraction in sketches can produce wrong retrieval results. For example, in row 3 of Fig. 7, it is difficult to understand whether the sketch is of eiffel tower or any other tower or a hill. Furthermore, we have observed certain ambiguities in annotation of images in Quick-Draw dataset. Currently, the images are much complex, which often contain two or more objects, and most of the currently available SBIR datasets provide single object annotation ig-noring the object in background. For example see row 6 of Fig. 7, many of the wrongly retrieved images truly contain flower, whereas some of them are annotated as tower or trees etc. Additionally, as the images from QuickDraw dataset are collected from the Flickr website, it contains many subsequent captures which can be confused as identical frames. Hence, although some retrievals on QuickDraw dataset appear identical, they are not in terms of the actual pixel values.
In general, we observe that the wrongly retrieved candidates mostly have a closer visual and semantic relevance with the queried ones. This effect is more prominent in TU-Berlin dataset, which may be due to the inter-class similarity of sketches between different classes. As shown in Fig. 4, the classes swan, duck and owl, penguin have substantial visual similarity, and all of them are standing bird which is a separate class of the same dataset. Therefore, for TU-Berlin dataset, it is challenging to generalize the unseen classes from the learned representation of seen classes.
Effect of Side-Information. In zero-shot learning, side information is as important as the visual information as it is the only means the model can discover similarities between classes. As the type of side information has a high effect in performance of any method, we analyze the effect of sideinformation and present zero-shot SBIR results by considering different side information and their combinations. We compare the effect of using GloVe [47], Word2Vec [40] and FastText [28] as text-based model, and three similarity measurements, i.e. path, Lin [35] and Jiang-Conrath [27] for constructing three different side information that are based on WordNet hierarchy. Table 2 contains the quantitative results on Sketchy, TU-Berlin and QuickDraw datasets with different side information mentioned and their combinations, where we set M = 32, 64, 128. We have observed that in majority of cases combining different side information increases the performance by 1% to 3%.
On Sketchy, the combination of Word2Vec and Jiang-Conrath hierarchical similarity as well as FastText and Path reach the highest mAP of 0.349 with 64d embedding while on TU Berlin dataset, in addition to the combination of Word2Vec and path similarity, FastText and Path lead with 0.297 mAP with 64d, and for QuickDraw the combination of GloVe and Lin hierarchical similarity reaches to 0.177 for 64d. We conclude from these experiments that indeed text-based and hierarchy-based class embeddings are complementary.
Effect of Visual Features. Visual features are also crucial for the zero-shot SBIR task. For having some overview on that, addition to VGG-16 [59] features obtained before the last fc layer, we also consider SE-ResNet-50 [24,22] features, and perform zero-shot SBIR experiments on the Sketchy, TU-Berlin and QuickDraw datasets with different semantic models mentioned above. In Table 3, we present the mAP@all values obtained by the considered visual features and semantic models, where we observe that SE-ResNet-50 features work consistently better than VGG-16 on all the three datasets. Especially, the performance gain on the challenging TU-Berlin dataset should be noted, which we speculate as the benefit of feature calibration strategy involved in the SE blocks, that effectively produces robust features minimizing inter-class confusion as presented in Fig. 4.
Model Ablations. The baselines of our ablation study are built by modifying some parts of the SEM-PCYC model and analyze the effect of different losses of our model. First, we train the model only with adversarial loss, and then alternatively add cycle consistency and classification loss for the training. Second, we train our model by only withdrawing the adversarial loss for the semantic domain, which should indicate the effect of side information in our case. We also train the model without the side information selection mech-   anism, for that, we only take the original text or hierarchical embedding or their combination as side information, which can give an idea on the advantage of selecting side infor-mation via the auto-encoder. Next, we experiment reducing the dimensionality of the class embedding to a percentage of the full dimensionality. Finally, to demonstrate the effective-ness of the regularizer used in the auto-encoder for selecting discriminative side information, we experiment by making λ = 0 in eqn. (2).
The mAP@all values obtained by respective baselines mentioned above are shown in Table 4. We consider the best side information setting according to Table 2 depending on the dataset. The assessed baselines have typically underperformed the full SEM-PCYC model. Only with adversarial losses, the performance of our system drops significantly. We suspect that only adversarial training although maps sketch and image input to a semantic space, there is no guarantee that sketch-image pairs of same category are matched. This is because adversarial training only ensures the mapping of input modality to target modality that matches its empirical distribution [87], but does not guarantee an individual input and output are paired up.
Imposing cycle-consistency constraint ensures the oneto-one correspondence of sketch-image categories. However, the performance of our system does not improve substantially while the model is trained both with adversarial and cycle consistency loss. We speculate that this issue could be due to the lack of inter-category discriminating power of the learned embedding functions; for that, we set a classification criteria to train discriminating cross-modal embedding functions. We further observe that only imposing classification criteria together with adversarial loss, neither improves the retrieval results. We conjecture that in this case the learned embedding could be very discriminative but the two modalities might be matched in wrong way. Hence, it can be concluded that all these three losses are complimentary to each other and absolutely essential for effective zero-shot SBIR.
Next, we analyze the effect of side information and notice that without the adversarial loss for the semantic domain, our model performs better than the previously mentioned three configurations but does not reach near to the full model. This is due to the fact that without semantic mapping, the resulting embeddings are not semantically related to each other, which do not help in cross modal retrieval in zero-shot scenario. We further observe that without the encoded and compact side information, we achieve better mAP@all with a compromise on retrieval time, as the original dimension (354 + 300 = 654d for Sketchy, 664 + 300 = 964d for TU-Berlin and 344 + 300 = 644d for QuickDraw) of considered side information is much higher than the encoded ones (64d). We further investigate by reducing its dimension as a percentage of the original one (see Fig. 3(c)), and we have observed that at the beginning, reducing a small part (mostly 5% to 30%) usually leads to a better performance, which reveals that not all the side information are necessary for effective zero-shot SBIR and some of them are even harmful. In fact, the first removed ones have low information content, and can be regarded as noise.
We have also perceived that removing more side information (beyond 20% to 40%) deteriorates the performance of the system, which is quite justifiable because the compressing mechanism of auto-encoder progressively removes important and predictable side information. However, it can be observed that with highly compressed side information as well, our model provides a very good deal with performance and retrieval time.
Finally, without using the regularizer in eqn.
(2) although our system performs reasonably, the mAP@all value is still lower than the best obtained performance. We explain this as a benefit of using 21 -norm based regularizer that effectively select representative side information.

(Generalized) Few-Shot Sketch-based Image Retrieval
For the few-shot scenario, we start with the pre-trained model trained in the zero-shot setting, and then fine tune it using a few example images, e.g. k-shot, from "novel" classes. For fine tuning the model in k-shot setting, we consider k different sketch and image instances from each of the unseen classes and cross-combine according to the coarse-grained and fine-grained settings to fine tune the model. The performance is evaluated on the rest of the instances from each class at test time.
Few-Shot Sketch-based Image Retrieval. Fig. 8(a)-(c) present the few-shot SBIR performance of our SEM-PCYC model together with ZSIH [58] and ZS-SBIR [30] respectively on the Sketchy, TU-Berlin and QuickDraw databases. All these plots show that the considered methods have performed consistently with the increment of k. However, this growth slowly gets saturated after k = 10. In this case also our proposed SEM-PCYC model consistently outperforms the other prior works, which clearly points out the supremacy of our proposal.
Generalized Few-Shot Sketch-based Image Retrieval. We also tested our few-shot model in generalized scenario, where during the test phase the search space includes both the seen and novel classes. Typically, this setting poses remarkably challenging scenario as the seen classes may create significant confusion to the novel queries. However, the generalized setting is more realistic as it allows to query the system with sketch from any classes. In this setting as well, we considered ZSIH [58] and ZS-SBIR [30] as two benchmark methods and trained them with the same experimental settings as ours. In FS-SBIR the generalized setting results follow the non-generalized setting quite closely (see Fig.  8(d)-(f)). This eventually indicates the convergence of the generalization ability of different models. In this setting as well, our proposed model steadily surpassed both the benchmark models, which indicates the advantage of our proposed model.  Table 4 Ablation study on our SEM-PCYC model (64d) on three datasets (measured with mAP@all). Qualitative Results. Fig. 9, Fig. 10 and Fig. 11 present a selection of qualitative results obtained by our SEM-PCYC model respectively on the Sketchy, TU-Berlin and Quick-Draw datasets in the scenario of increasing number of shots, which show an evolution of model performance with the increment of k (= 0, 1, 5, 10) for the classes where 0-shot results are weak. From these results, we can see that sometimes a single unseen example is sufficient to correctly retrieve images (row 3 of Fig. 9, row 5 of Fig. 10 and row 5 of Fig. 11), however, sometimes it needs more examples (row 2 and 5 of Fig. 9, row 2, 3, 4 of Fig. 10 and row 2, 3, 4 of Fig.  11) to remove the confusion from the other similar classes. This uncertainty may either come from visual or semantic similarity. As expected, increasing the number of examples also improves the performance.
Model Ablations. Similar to zero-shot setting, we perform an ablation study for few-shot scenario as well, where we consider the same model baselines as of Table 4. The mAP@all values obtained by those baselines in 5-shot scenario are shown in Table 5. In this case, all the baselines have achieved much better performance than the corresponding zero-shot performance on that dataset, which is absolutely justified since the model is already trained to zero-shot setting and having few examples from novel classes provide some gain with any combination of losses. We observe that the first three configurations (first three rows of Table 5) work quite closely across all the three datasets and we haven't found any prominent difference among these three baselines on the considered datasets. However, the baselines with more criterion or losses (bottom three rows of Table 5) achieve much Query 0-shot 1-shot 5-shot 10-shot  Table 5 Ablation study with few shot setting on our SEM-PCYC model (64d) on three datasets (measured with mAP@all).
better performance from the previously mentioned three baselines. Among these baselines, we have not found much difference between the ones that do and do not use side information. This is due to the consideration of pre-trained zeroshot model which already has past knowledge of side infor-mation, and in this case training with side information could be slightly redundant.
Fine-Grained Settings. We have further evaluated our model in fine-grained setting where the task is to find a specific object image of a drawn sketch, and we have combined it with Query 0-shot 1-shot 5-shot 10-shot the above mentioned variations of k-shot scenarios. For this experiment, we only considered the Sketchy dataset as only this corpus contains aligned sketch-image pairs, which are often used for fine-grained SBIR evaluation tasks. We have not considered other fine-grained datasets, such as shoe, chair etc [62] as they do not contain class information which we need for semantic space mapping. For this setting as well, we have considered ZSIH [58] and ZS-SBIR [30] as the two benchmark methods and the same experimental protocol. Fig. 12(a) and Fig. 12(b) show the performance of our model in fine-grained generalized few-shot together with ZSIH [58] and ZS-SBIR [30]. In fine-grained setting, all the methods have performed remarkably poor. We explain this fact as the drawback of semantic space mapping which intends to map visual information from sketch and image to the same neighborhood and ignores fine-grained information. Therefore the proposed solution to low-shot task and the notion of fine-grained problem contradicts, and as a consequence the performance of all the considered models deteriorates. In generalized setting, we have observed that all the mod-els have performed slightly better. We conjecture that the considered models can memorize the fine-grained information of the training or seen samples, which gives a slight rise (as they are very few in number) in performance in generalized scenario. However, we see that low-shot fine-grained paradigm is very important for SBIR. Nevertheless, we admit that it is an extremely challenging task, which needs substantial research work to be solved.

Conclusion
In this paper, we proposed the SEM-PCYC model for the any-shot SBIR task. Our SEM-PCYC model is a semantically aligned paired cycle consistent generative adversarial network whose each branch either maps a sketch or an image to a common semantic space via adversarial training with a shared discriminator. Thanks to cycle consistency on both the branches our model does not require aligned sketchimage pairs. Moreover, it acts as a regularizer in the adver-sarial training. The classification losses on the generators guarantee the features to be discriminative. We show that combining heterogeneous side information through an autoencoder, which encodes a compact side information useful for adversarial training, is effective. In addition to the model, in this paper, we introduced (generalized) few-shot SBIR as a new task, which is combined with fine-grained setting. We considered three benchmark datasets with varying difficulties and challenges, and performed exhaustive evaluation with the above mentioned paradigms. Our assessment on these three datasets has shown that our model consistently outperforms the existing methods in (generalized) zero-and few-shot, and fine-grained settings. We encourage future work to evaluate sketch based image retrieval methods in these incrementally challenging and realistic settings.