1 Introduction

Being able to classify, detect or segment objects into images with as less annotated data as possible is one of the most important problem addressed to implement practical application in computer vision. A radical framework is proposed by the zero-shot learning (ZSL), in which not a single visual example is used during learning, but where it is possible to rely on external data from another modality. Typically, the latter are semantic attributes or textual descriptions that can be represented by vectors. Hence, the task consists in learning a mapping between the image space and the semantic space using images from seen classes, available at training time only. In the original form of ZSL, the images of the test set belong to unseen classes, for which no sample is available at training time. A more realistic “generalized” setting (GZSL) [30] proposes nevertheless to recognize both seen and unseen classes at test time. Another classical distinction is made between the inductive and the transductive setting, the latter allowing to use test images (without their annotation) at training time, similarly to semi-supervised learning.

Recent approaches to ZSL use generative models to produce visual samples from unseen classes based on their semantic descriptions [3, 4, 27, 29]. With such synthetic samples, we thus have a classical supervised learning setting for the unseen classes as well. One of the most performing approaches in this vein is f-VAEGAN-D2 [31] that shares the weights of the decoder of a variational autoencoder (VAE) with those of the generator of a generative adversarial network (GAN). Since it is trained in combination with a conditional encoder (and either a conditional or a non-conditional discriminator), it is able to benefit from unlabeled unseen visual samples (transductive setting) to synthesize discriminative image features of unseen classes. It also obtains very good performance in inductive setting as well. Narayan et al. [16] proposed TF-VAEGAN that add a decoder which reconstructs the semantic prototypes and a feedback loop from this module to the GAN generator to refine the generated features during both the training and feature synthesis stages. The motivation to add a decoder is that it can provide a complementary information along with the generator, since the latter maps a single prototype to many possible visual instance while the decoder does the opposite. Both pieces of information can be used at test time to create features that are used to learn ZSL and GZSL classifiers.

Beyond the usefulness of the feedback loop, one can see TF-VAEGAN as similar to f-VAEGAN-D2 with an additional loss that regularizes the generator. However, this loss essentially addresses a reconstruction task, while ZSL consists first and foremost in discriminating classes. We thus argue that a loss that regularizes in a way that favor class disambiguation would be more relevant. The loss we propose (Section 3) can be integrated to any generative model for ZSL. The idea is to train the generator to learn some ambiguous semantic prototypes built by mixing real available ones, and recognize the corresponding ambiguous classes. This idea may seem similar to the one proposed by [6] who applied mixup [33] to the ZSL task, but it has a crucial difference. In fact, when they apply mixup, Chou et al. matches the virtual prototype to a corresponding virtual visual feature. Our approach focuses on recognizing a virtual class, thus its label only. Hence, the regularization forces the generator to synthesize discriminative features from unknown class prototypes, some of them being potentially close to some prototypes of the unseen classes (see Section 4.4.2 for a detailed discussion and evaluation with regards to the availability of the seen/unseen prototypes during training). The generator nevertheless learns without visual samples from seen classes and is thus not constrained by particular images that are not relevant. Indeed, mixing two particular images does not usually result into a meaningful image, while mixing semantic description or attributes may make sense (Fig. 1). Therefore, restricting mixup to the semantic space corresponds better to what is expected in the (G)ZSL task. Our approach is computationally less expensive and leads to better results than Chou et al. in practice (Section 4). Last, the loss they defined is mainly useful in an inductive setting, while the approach we propose can be even more useful in a transductive one.

Fig. 1
figure 1

An image of tiger differs from a mix of an image of zebra and fox. At the opposite, learning to discriminate a mix of semantic attributes (or description) of a fox and a zebra makes the proposed model able to better identify a tiger at test time

Our main contribution consists in a regularization loss, that can be applied to any conditional generative-based ZSL model. Integrated to f-VAEGAN-D2 or TFVAEGAN, it improves significantly their performance on different benchmarks, either in inductive or transductive setting.

2 State of the art

Early approaches in ZSL relied on attribute prediction [8], ridge regression [21] or triplet-loss [7, 9, 10]. We refer to [12] for a more detailed overview of these approaches as we will focus on generative approaches in the following.

In order to address the biased prediction towards seen classes, generative approaches synthesize visual features of unseen classes from their semantic features with generative models like Variational Autoencoders (VAEs) or Generative adversarial networks (GANs). [29] combines a conditional Wasserstein GAN [2] with a categorization network to generate more discriminative features. Bucher et al. [4] proposed three different conditional GANs to generate features, Generative Moment Matching Network (GMMN), AC-GAN, and Denoising Auto-Encoder. Other works use conditional VAE. Arora et al. [3] integrates an attribute regressor and a feedback mechanism into a VAE-based model to generate more discriminative features and ensure that the generated features are semantically close to the distribution of the real features. Schonfeld et al. [22] proposes to align the visual features and the corresponding semantic embeddings in a shared latent space, using two Variational Autoencoders (VAEs). Recent works take advantage of both GANs and VAEs by combining them with shared decoder and generator. Xian et al. [31] proposes a VAEGAN-based model that leverages the unlabeled instances under the transductive setting via an additional unconditional discriminator. Similar to the idea proposed by Arora et al. [16] augments the f-VAEGAN-D2 method with a semantic embedding decoder and a feedback mechanism to enforce a semantic consistency and improve feature synthesis. In this work, we propose to enrich a conditional VAEGAN-based ZSL method with an auxiliary task as well while focusing on another aspect, more related to the ability to discriminate classes, namely reducing ambiguities among categories. Such a goal can be useful beyond zero-shot learning, for tasks that aim at relating ambiguous visual and semantic information such as multimodal entity linking [1] and retrieval [15, 34], cross-modal retrieval [5, 24, 25] or classification [26].

To alleviate the domain shift problem, transductive ZSL methods are proposed to leverage the unlabeled unseen-class data at training. Xian et al. [30] and Ye and Guo [32] proposed to use graph-based label propagation while [27] uses an Expectation-Maximization (EM) strategy, where pseudo-labeled unseen class examples were used to update the parameter estimates of unseen class distributions. Generative models were also applied to transductive ZSL. Paul et al. [19] leverages Wasserstein GAN [2] to synthesize the unseen domain distribution via minimizing the marginal difference between the true latent space representation of the unlabeled samples of unseen classes and the synthesized space. f-VAEGAN-D2 and TF-VAEGAN can also be applied to transductive settings.

Since its advent, interpolation-based regularization has been shown to be a surprisingly effective method to improve generalization and robustness on both supervised and semi-supervised settings. Zhang et al. [33] proposed mixup, a data augmentation technique for image classification consisting in creating virtual training examples constructed as the convex combinations of pairs of visual data samples and their corresponding labels. This simple approach has shown to be an effective model regularizer that favor linear behavior in-between training examples. Recently, [6] applied mixup to zero-shot learning. Similarly to our method, they interpolate both the visual samples and the semantic prototypes. However, unlike our approach, they used mixup as a direct data augmentation approach, while we apply the interpolation in the conditional space of a generative ZSL model and propose a specific regularization loss in the semantic space. More specifically, we train a conditional generative ZSL model to recognize virtual ambiguous classes. The generator synthesizes features from the corresponding ambiguous class prototypes, which are then used to perform the classification task. In practice, the difference between the linear interpolation we propose and the usual mixup setting used by [6] is reflected by the mixing proportion leading to the best performances. Indeed, as an augmentation data approach, mixup usually have better performances with a mixing proportion that must be either close to 0 or 1, making the new virtual samples pretty close to the original ones. In contrast, we obtain the best performances with a mixing proportion close to 0.5, making the new virtual classes completely distinct from the real ones. These classes are different from the actual unseen classes used at test time, but allow the generator to be regularized in some ‘empty’ parts of the semantic space.

3 Method

3.1 Problem setting and notation

Let us consider a set of images X = {x1,...,xl,xl+ 1,...,xt} encoded in the image feature space \(\mathcal {X} = \mathbb {R}^{d}\) and two disjoint sets of class labels: a seen class label set \(\mathcal {Y}^{s}\) and the unseen one \(\mathcal {Y}^{u}\). The set of class prototypes is denoted as \(C = \{ c(y) \| y \in \mathcal {Y}^{s} \cup \mathcal {Y}^{u}, c(y) \in \mathcal {C} \}\). Usually, c(y) is a vector of binary attributes, but may be word embeddings when one wants to describe a large set of classes [7, 11, 13]. The first l samples xs, with sl, are labeled samples from seen classes \(y^{s} \in \mathcal {Y}^{s}\) and the remaining samples xu, with l + 1 ≤ ut, are unlabeled data from novel classes, \(y^{u} \in \mathcal {Y}^{u}\). In the inductive setting, the training set contains only labeled seen classes examples, and the semantic information about both seen and unseen classes. In the transductive setting, the training set contains both labeled (seen classes) and unlabeled (unseen classes) data samples. In fact, there is an ambiguity in the definition of the transductive setting in the literature, as there is more than one definition of this setting. Indeed, [12] defines the class-transductive setting, in which class prototypes of both seen and unseen classes are available during the training phase, and the instance-transductive setting, where both prototypes and unlabeled images from unseen classes are available. The class-transductive setting is sometimes referred as inductive, as the author considers that the unseen prototypes need to be available to generate the unseen visual samples used to learn the classifiers. In this paper, we refer to “transductive” setting when unseen prototypes are used for any other usage than generating unseen visual features.

In zero-shot learning, the goal is to predict the label of images that belong to unseen classes, i.e. \(f_{z s l}: \mathcal {X} \rightarrow \mathcal {Y}^{u}\), while in the GZSL scenario, the task is to predict labels of images that can belong to either seen or unseen classes, i.e. \(f_{g z s l}: \mathcal {X} \rightarrow \mathcal {Y}^{s} \cup \mathcal {Y}^{u}\), with f a compatibility function that computes the likelihood of an image to belong to a class.

3.2 Learning semantic ambiguities

We consider a generative approach to ZSL, for which a conditional model G(.) is trained to minimize a loss \({\mathscr{L}}\), using visual samples xs and prototypes c(ys) from seen classes ys. At test time, it is able to generate visual samples xu from a prototype c(yu) for unseen classes yu, which are used to train a classifier in a fully supervised fashion. Applying mixup to this setting such as [6] consists in augmenting the training data with virtual pairs:

$$ \begin{array}{@{}rcl@{}} \tilde{c} & = & \lambda c({y^{s}_{i}})+(1-\lambda) c({y^{s}_{j}}) \\ \tilde{x} & = & \lambda {x^{s}_{i}}+(1-\lambda) {x^{s}_{j}} \end{array} $$
(1)

where λ is a hyperparameter to determine and ((xi,yi),(xj,yj)) is a couple of annotated data from seen classes randomly selected. As explained above, we argue that using the visual samples biases the generator towards seen classes. Such a bias was identified for the former ZSL approaches by [30] and led to the definition of generalized zero-shot learning (GZSL), that is the most common and challenging setting in the literature.

We thus adopt a different strategy: focusing on learning the ambiguities in the semantic space only. Indeed, as illustrated in Fig. 1 mixing two particular images does not usually make sense, because of strong inconsistencies at the pixel level. In contrast, mixing semantic information may make sense. First, it is at the origin of a large bestiary in fantasy literature, in science fiction and in heroic fantasy. A unicorn is described semantically as a horse with a corn, and no picture of such creature has been taken up to date, although many artist proposed some visual representation of it. However, we do not expect the generator to produce features that would result into a plausible representation but rather features able to emphasis the differences between classes, in order to better distinguish them. Hence, we propose to regularize the model with such a constraint only, while being independent of any particular existing visual representation. In practice, we create ambiguous classes as a linear interpolation of real semantic prototype pairs and their labels:

$$ \begin{array}{@{}rcl@{}} \tilde{c} & = & \lambda c(y_{i})+(1-\lambda) c(y_{j})\\ \tilde{y} & = & \lambda y_{i}+(1-\lambda) y_{j} \end{array} $$
(2)

The hyperparameter λ can be a fixed value or more generally a random variable \(\lambda \sim {\Lambda }\). During the learning phase, the generator G synthesizes feature \(\hat {x} \in \mathcal {X}\) from a latent code \(z \sim \mathcal {N}(0,1)\) conditioned by the ambiguous class prototype \(\tilde {c} \in \mathcal {C}\). This image is then used as input to a classifier f, leading to the proposed regularization loss as:

$$ \mathcal{L}_{I} = \mathbb{E}_{z,\lambda} [ l(f (\hat{x}),\tilde{y})] $$
(3)

where l is the cross-entropy between the input \(\hat {x}=G([z;\tilde {c}])\) and the target \(\tilde {y}\). With its generic formulation, the proposed regularization can be applied to a large number of generative models. In the following, we integrate it to f-VAEGAN-D2 [31] as illustrated in Fig. 2 by adding \({\mathscr{L}}_{I}\) to their losses. In that case, the total loss to minimize is \({\mathscr{L}}={\mathscr{L}}_{BCE}+\gamma {\mathscr{L}}_{WGAN}^{s}+{\mathscr{L}}_{WGAN}^{u}+{\mathscr{L}}_{I}\) with γ an hyperparameter (see Section 4.2). It can nevertheless be added to any generative-based ZSL model, and we show how it performs with TFVAEGAN in Section 4.5.

Fig. 2
figure 2

Our contribution, highlighted by the red dashed rectangle, included to a model similar to f-VAEGAN-D2 [31]. The encoder E takes the real seen features xs as input and outputs a latent code z, which is then input together with embeddings c(ys) to the generator G that synthesizes features \(\tilde {x}_{s}\). The generator G synthesizes features of unseen classes from the class prototypes c(yu) concatenated with random noise z. The two discriminators D1 and D2 learn to distinguish between real and synthesized features. We introduce a novel task to train the generator G : First, virtual ambiguous classes are constructed as convex combinations of real classes, then, the generator G synthesizes a feature \(\tilde {x}\) from the corresponding ambiguous class prototypes concatenated with random noise. Further, the synthesized features \(\tilde {x}\) are used to perform a classification task

4 Experimental evaluation

4.1 Datasets and metrics

We evaluate our method on four datasets that are commonly used in the ZSL literature, namely Caltech UCSD Birds 200-2011 (CUB) [28], SUN Attribute dataset [18], Oxford Flowers (FLO) [17] and Animals with Attributes (AWA2) [30]. Their main characteristics are reported in Table 1

Table 1 Main characteristics of the dataset used

We applied the evaluation protocol of [30], relying on the “proposed splits” that insure that none of the test classes appear in ImageNet, since it is used to pre-train the visual feature extractor. The performances are reported in terms of average per-class top-1 accuracy (T1) for ZSL settings, and with the harmonic mean (H) of the average per-class top-1 accuracy on seen (s) and unseen (u) classes for GZSL. Unless otherwise specified, we use 2048-dimension 101-ResNet features as visual embeddings for all the datasets. For class semantic prototypes of CUB and FLO, we adopt the 1024-dim sentence embeddings of character-based CNN-RNN model generated from fine-grained visual descriptions [20]. For AWA2, the binary attributes relate to e.g animal species (“fish, bird, plankton”), color (“black, brown, blue”), behaviour (“hibernate, timid, slow”) and other features. For SUN, they rather relate to function/affordances, materials, spatial envelope and surface properties. We compare our method to TF-VAEGAN [16], f-VAEGAN-D2 [31], CLSWGAN [29] and LisGAN [14].

To compare the methods over several benchmarks and estimate their aggregated merit, we adopt the median normalized relative gain (mNRG) [23]. Indeed, such a comparison can be biased if one uses a simple average over different benchmarks. mNRG exhibits several interesting features such as an independence to outlier scores, coherent aggregation or time consistency. Its main drawback is that a reference method has to be chosen, from which the performance of each method is measured, according to a unique aggregated score, possibly negative if the method performs globally worse. In our case, we choose CLSWGAN [29] in inductive setting as reference. For the comparison with fine-tuned features, we use f-VAEGAN-D2 in inductive setting as reference. We compute the mNRG by aggregating the accuracy for ZSL and the harmonic mean H of seen and unseen accuracy for GZSL. By definition, the score of the reference is 0. If mNRG < 0 then the method performs globally worse than the references over all datasets.

4.2 Implementation details

The generator G and discriminators D1 and D2 are implemented as two-layer fully connected networks with 4096 hidden units. The generator is updated every 5 discriminator iterations [2]. The function f used in equation (3) is implemented as a two-layers fully connected network that takes an input synthesized feature of size d = 2048, has a hidden layer of size 4096 and outputs a probability distribution with regards to all classes of interest. We use LeakyReLU activation everywhere, except at the output of G, where a sigmoid non-linearity is applied before the binary cross-entropy loss \({\mathscr{L}}_{BCE}\). ZSL and GZSL classifiers are implemented as a single layer perceptron of size 2,048, trained for 20 epochs. We use Adam optimizer with a learning rate of 0.0001. Our (PyTorch) code is based on the one of [16] and is available at https://github.com/hanouticelina/lsa-zsl. We determined that the hyperparameters γ = 10 and the gradient penalty of the WGAN loss λWGAN = 10 allowed us to obtain similar performances as those reported in [16, 31], although they are sometimes different to the hyperparameter values reported in these papers. Using the code of [16], it is possible to reproduce their experiments and those of [31] for the inductive setting only. Our code allows to reproduce the experiments under the transductive setting as well.

4.3 State-of-the-art comparison

Tables 2 and 3 show the comparison to the state-of-the-art. For inductive ZSL setting, our model performs globally better than all other methods with the highest mNRG score. It also achieves the best score on CUB and SUN. In the transductive ZSL setting, our approach obtains a mNRG score of 22.8, establishing a new transductive ZSL state-of-the-art on CUB, SUN and AWA2. The comparison to [6] is particular since they report results on three of the considered datasets only. Without fine-tuning, their results is far above other methods on AWA2 but also far below on CUB and SUN.

Table 2 State-of-the-art comparison accuracy for ZSL on the “proposed split” of [30], both inductive (IN) and transductive (TR) results are shown
Table 3 State-of-the-art comparison accuracy for GZSL on the “proposed split” of [30], both inductive (IN) and transductive (TR) results are shown

Unsurprinsingly, in the GZSL setting, feature generating approaches obtain better results than others. We also note that the accuracy on unseen classes (u) and the one on seen classes (s) are better balanced. Our model outperforms the existing methods for both inductive and transductive GZSL settings. In particular, in the inductive GZSL setting, our model obtains 67.2% on CUB, significantly improving those obtained previously (56.9%). By reducing the bias towards seen classes, we globally achieve better performance on unseen classes. However, the scores on seen classes may slightly decrease, in particular in inductive setting. It is nevertheless compensated by the gain on the unseen classes.

We also conducted some experiments with fine-tuned features, with the same features as those used in [16, 31]. To compute the global score mNRG for this experiment, we used f-VAEGAN-D2 in transductive setting as a baseline. In the ZSL setting, the results of TF-VAEGAN are globally better than ours in the transductive setting, both being significantly above f-VAEGAN. However, in the inductive setting, the results of TF-VAEGAN are below the baseline while ours are still slightly above.

In the GZSL settings, TF-VAEGAN still has a lower mNRG score than the baseline f-VAEGAN-D2 in the inductive setting and quite comparable score in the transductive one. Our approach obtains performances in line with f-VAEGAN in the inductive but significantly outperforms the two other approaches in the transductive setting when compared over the four datasets.

4.4 Ablation study

4.4.1 Influence of the mixing proportion

In this section, we perform an ablation study on four ZSL datasets. We evaluate our model with different values of the random mixing proportion λ (Results are shown in Table 4). Note that when λ is sampled from a distribution, a new value is selected for each minibatch.

Table 4 Transductive ZSL and GZSL results with different values of the λ hyperparameter

We found that the best performances accross all datasets were met when λ = 0.5 or \(\lambda \sim \mathcal {N}(0.5, 0.25)\), i.e. setting equal weights for the two terms of the convex combination. Other settings for λ, such as λ = 0.2 or \(\lambda \sim Uniform(0, 1)\), deteriorates the performances (\(\sim \) 1% worse for CUB, SUN and FLO and \(\sim \) 4% worse for AWA2). Even poorer performances were found when setting \(\lambda \sim Beta(0.3, 0.3)\) (\(\sim \) 1% worse for CUB and SUN and \(\sim \) 8% worse for FLO and AWA2).

For image classification [33], the random mixing proportion is sampled from the Beta distribution with a small value of α, as it assumes that the examples in the neighborhood of each data sample share the same class. Indeed, given a small α = 0.3, beta distribution samples more values closer to either 0 and 1, making the mixing result closer to either one of the two examples. However, in our method, we construct ambiguous semantic prototypes with the corresponding ambiguous classes being completely distinct from the real ones. Therefore, sampling λ from a Beta distribution, with α < 1, is not a reasonable choice.

4.4.2 Influence of the subset to learn virtual prototypes

According to the nominal protocol of the proposed method, new “frontier prototypes” are learned by combining prototypes of both seen and unseen categories in the transductive setting. The usage of seen prototypes in (2) allows to regularize the conditional latent space, such that further used unseen prototypes result in more discriminative features. If unseen prototypes are used, the space is better regularized in their neighborhood. It is particularly interesting if some unseen prototypes are not contained in the convex envelop of the seen classes. Let nevertheless note that we conducted experiments with negative λ without getting noticeable improvements. To evaluate the respective contribution of seen and unseen prototypes in our model, we compared the performances on CUB, while using different subsets of prototypes in (2). For fair comparison, the results reported in Table 5 were all obtained with f-VAEGAN using unlabeled images at training time, such that the ‘s+u’ results are the same as the transductive setting in Tables 2 and 3.

Table 5 Performances on CUB according to the prototype training subset used to learn ambiguous prototypes (s =seen, u=unseen)

In the ZSL setting, we obtain the same results whether we use all prototypes or unseen prototypes only, this makes sense since the test images are from the unseen classes only and there is no point in modelling the ambiguities with seen classes. It is also interesting to note that f-VAEGAN-D2 has a score of 74.2 in the transductive setting, while one can have have a score of 79.1 with the regularization learned with seen prototypes only. It shows that most of the improvement is due to the global regularization of the conditional latent space, rather than to a local one in the neighborhood of the prototypes used at test time.

In the GZSL setting, the results are better when the regularization is learned with unseen prototypes only rather than seen ones, but the usage of both is still above. Looking at the results on the seen and unseen classes specifically, one can see that the results are obviously better for the classes that are regularized with (3). The comparison to the results obtained by f-VAEGAN-D2 without the regularization in Table 3 (u = 65.6 s = 68.1 H = 66.8) shows that the regularization is beneficial in any case.

4.5 Integration to TF-VAEGAN

Our contribution is generic and can be used in other conditional generative-based ZSL architectures. Therefore, we evaluate the generalization capabilities of our proposed method, by integrating our contribution in the TF-VAEGAN [16] framework.

We first learn the model end-to-end, adding the proposed regularization. Table 6 shows the comparison on CUB, between the original TF-VAEGAN model and the one learned with the proposed regularization (3). Our contribution improves the performance of the vanilla TF-VAEGAN for both ZSL and GZSL tasks, either in inductive or transductive settings, by 1.5 to 3 points. Interestingly, in GZSL, one can note that the improvement is mainly due to an increase of the scores on unseen classes, while the ones on seen classes is almost similar to the vanilla TF-VAEGAN. It thus tends to show that our approach reduces the bias towards seen classes in the generalized context.

Table 6 Comparison between the vanilla TF-VAEGAN and that augmented with our loss (3), on CUB dataset for both inductive and transductive ZSL/GZSL settings, either when the model is learned from scratch or fine-tuned (ft)

We conducted an additional experiment consisting in fine-tuning the generator learned by TF-VAEGAN with our method. To prevent the generator from losing the previously learned information, which is the marginal feature distribution, the discriminators D1 and D2 are trained from scratch. We again observe an improvement of the performance for ZSL and GZSL, both in inductive and transductive settings. The scores are nevertheless intermediate between those obtained by the original model and those obtained previously by learning from scratch. In both cases, the GZSL experiments show that most of the score improvement is due to a better recognition of the unseen classes, while the performances on the seen classes are similar to (or slightly below) the original model.

5 Conclusion

We propose a novel approach to train a conditional generative-based model for zero-shot learning. The approach improves the discriminative capacity of the synthesized features by training the generator to recognize virtual ambiguous classes. We construct the corresponding ambiguous class prototypes as convex combinations of the real class prototypes and then we train the generator to recognize these virtual classes. This simple procedure allows the generator to learn the transitions between categories and thus, to better distinguish them. Our approach can be integrated to any conditional generative model. Experiments on four benchmark datasets show the effectiveness of our approach across zero-shot and generalized zero-shot learning. In most cases, the improvement is due to a better recognition of unseen classes, while the score on seen classes are maintained, which means that our approach reduces the bias towards seen classes in GZSL. However, most of the time, the score on seen classes remains higher than the one on unseen classes, showing the bias still remains to some extent.

The method is limited to create ambiguous classes from a couple of real classes by a linear interpolation. To push further our approach, one could explore non-linear interpolation for constructing ambiguous classes, or considering more than two real classes to construct an ambiguous one. Note that the experiment we conducted on (linear) extrapolation did not bring interesting results. Beyond this contribution to zero-shot learning, our approach can also be beneficial to other tasks that aims at relating ambiguous visual and semantic information such as multimodal entity linking and retrieval, cross-modal retrieval or classification and more generally those in which a latent space is used for learning data features.