Abstract
In the problem of generalized zero-shot learning, the datapoints from unknown classes are not available during training. The main challenge for generalized zero-shot learning is the unbalanced data distribution which makes it hard for the classifier to distinguish if a given testing sample comes from a seen or unseen class. However, using Generative Adversarial Network (GAN) to generate auxiliary datapoints by the semantic embeddings of unseen classes alleviates the above problem. Current approaches combine the auxiliary datapoints and original training data to train the generalized zero-shot learning model and obtain state-of-the-art results. Inspired by such models, we propose to feed the generated data via a model selection mechanism. Specifically, we leverage two sources of datapoints (observed and auxiliary) to train some classifier to recognize which test datapoints come from seen and which from unseen classes. This way, generalized zero-shot learning can be divided into two disjoint classification tasks, thus reducing the negative influence of the unbalanced data distribution. Our evaluations on four publicly available datasets for generalized zero-shot learning show that our model obtains state-of-the-art results.
Keywords
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
In the zero-shot learning task, a classifier is trained with datapoints from seen classes and applied to recognize previously unseen dataponts belonging to unseen classes. The main objective is to leverage knowledge from label embeddings, e.g. attributes, word embedding or class hierarchy information, to build a universal mapping that can classify unseen datapoints without retraining the system on new unseen classes. Firstly, let us denote \(\mathbf {X}_{tr}\) as training datapoints from seen classes \(C_s\), \(\mathbf {X}_{ts}\) to be testing datapoints from unseen classes \(C_u\) such that \(C_s \cap C_u = \emptyset \). The model is trained on \(\mathbf {X}_{tr}\) but needs to assign a label \(l \in C_u\) for each datapoint from \(\mathbf {X}_{ts}\). Recently, researchers have argued that standard zero-shot learning protocols are biased towards good results on unseen classes while neglecting performance on seen classes. To address this issue, a generalized zero-shot learning task was proposed for which testing datapoints come from seen and unseen classes, and the classifier needs to cope well with all classes \(C = C_s \cup C_u\).
It has emerged that most of zero-shot learning methods achieve low accuracy in such a protocol because training datapoints come only from the seen classes. In most cases, the strong imbalance of data distribution will make the classifier assign datapoints from seen classes to unseen classes.
The use of Generalized Adversarial Network (GAN) to generate auxiliary datapoints for unseen classes [1] enables the classifier to be trained on datapoints from both seen and unseen categories. Inspired by such an extension, we found that using the auxiliary and original training data to learn a classifier, e.g. Support Vector Machine (SVM), can be further improved by treating the classification of original datapoints separately, that is, by decomposing the generalized zero-shot learning into two disjoint classification tasks: one classifier dealing with datapoints from seen classes and another classifier dealing with datapoints of unseen classes.
In this paper, we propose to use the auxiliary data of unseen classes generated by GAN together with the original training data to build a model selection approach for generalized zero-shot learning. We refer to our approach as ModelSel and propose its three variants in Sect. 3. We evaluate ModelSel on four standard datasets and demonstrate state-of-the-art results.
2 Related Work
Zero-shot learning is a form of transfer learning. Specifically, it utilizes the knowledge learned on datapoints of seen classes and attribute vectors to generalize and recognize testing datapoints from new classes. The majority of previous zero-shot learning methods use some linear mapping to capture the relation between the feature and attribute vectors. Attribute Label Embedding (ALE) [2] uses the attributes as label embedding and presents an objective inspired by a structured WSABIE ranking method that assigns more importance to the top of the ranking list. Embarrassingly Simple Zero-Shot Learning (ESZSL) [3] uses a linear mapping and simple empirical objective with several regularization terms that impose penalty on the projection of features from the Euclidean into the attribute space and the projection of attribute vectors back to the Euclidean space. Structured Joint Embedding (SJE) [4] proposes an objective inspired by the structured SVM and applied as linear mapping while [5] proposes new data splits and evaluation protocols to eliminate the overlap between classes of ImageNet [6] and zero-shot learning datasets. Zero-shot Kernel Learning (ZSKL) [7] proposes a non-linear kernel method with weak incoherence constraints to make the columns of projection matrix weakly incoherent. Feature Generating Networks [1] leverages a conditional Wasserstein Generative Adversarial Network (WGAN) to generate auxiliary datapoints for unseen classes from attribute vectors followed by training a simple Softmax classifier. SoSN [8] and So-HoT [9] use second-order statistics [10] for similarity learning and domain adaptation.
3 Approach
3.1 Notations
Let us denote seen classes as \(C_s\), unseen classes as \(C_u\). \(\mathbf {X}_{tr}\) denotes original training datapoints, \(\mathbf {X}_{ge}\) are the generated datapoints for unseen classes. Each datapoint is a column vector in one of the above matrices. \(M_{sel}\) is the selector between seen/unseen class, \(M_s\) is the model for \(C_s\), \(M_u\) is the model for \(C_u\), \(M_t\) is a model for \(C_s\cup C_u\). Moreover, \(\varvec{w}_{sel}\), \(b_{sel}\), \(\varvec{W}_s\), \(\mathbf {b}_s\), \(\varvec{W}_u\), \(\mathbf {b}_u\), \(\varvec{W}_t\) and \(\mathbf {b}_t\) are the projection vector/matrices and biases used by our models as detailed below.
3.2 Model Selection Mechanism
In this paper, we propose a mechanism that leverages several classifiers to perform generalized zero-shot learning. Firstly, we label the original datapoints as 1 and auxiliary datapoints as \(-1\) to train \(M_{sel}\), which is a linear SVM classifier.
Model \(M_s\) is a classifier trained with datapoints from seen classes \(C_s\), model \(M_u\) is trained with auxiliary datapoints from GAN corresponding to unseen classes \(C_u\). Model \(M_t\) is trained for \(C_s\cup C_u\) simultaneously.
\(M_s\), \(M_u\) and \(M_t\) are trained separately via the SoftmaxLog classifier. While we use a single training process, we distinguish three selection models applied at the testing stage. The output of each classifier can be defined as:
ModelSel-2Way. The testing mechanism of ModelSel-2Way can be illustrated as follows. For each testing datapoint \(\mathbf {x}\in \mathbf {X}_{tr}\), we feed it firstly into \(M_{sel}\). The role of \(M_{sel}\) is to decide if \(\mathbf {x}\) belongs to the seen or unseen class based on which we select either \(M_s\) or \(M_u\) model for the final classification:
Then, the final prediction for \(\mathbf {x}\) becomes:
ModelSel-2Way-SA. We also propose to use the Sigmoid function to generate soft assignment scores from the output of \(M_{sel}\) as the weights assigned to the outputs of \(M_s\) and \(M_u\). We call this method as ModelSel-2Way-SA. The intuition behind this model is that \(M_{sel}\) suffers from the quantization errors close to the classification boundary, thus we model the assignment uncertainty in \(M_{sel}\) to reduce quantization errors. The probability that \(\mathbf {x}\) belongs to seen classes \(C_s\) or \(C_u\) is denoted \(p_s(\mathbf {x})\) and \(p_u(\mathbf {x}) = 1 - p_s(\mathbf {x})\), respectively, and \(p_s(\mathbf {x})\) is given as (Figs. 1 and 2):
where \(\sigma \) is the parameter to control the slope of the Sigmoid function. Then, the output of ModelSel-2Way-SA is given as:
ModelSel-3Way. For the ModelSel-3Way, we use additionally classifier \(M_t\) trained with both original and auxiliary datapoints so it can classify data from both seen and unseen classes. While its performance is worse than \(M_s\) and \(M_u\) in each domain, we leverage the output of \(M_t\) as a mask to correct some incorrect predictions from \(M_u\) and \(M_s\). The output of our ModelSel-3Way model, shown in Fig. 3, is defined as follows:
where c, \(o_s\) and \(o_u\) adjust the importance of \(M_t\) and offset for \(M_s\) and \(M_u\). Intuitively, close to the classification boundaries, predictions of \(\mathbf {g}_s(\mathbf {x})\) and \(\mathbf {g}_u(\mathbf {x})\) become replaced by \(\mathbf {g}_t(\mathbf {x})\) in this model.
Figure 4 illustrates the selection of classifiers in our ModelSel-3Way approach. We define N as the total number of testing data, \(N_s\) and \(N_u\) as the number of testing data assigned to seen and unseen classes \(C_s\) and \(C_u\), respectively. The distribution map has the same size as \(\mathbf {g}_t(\mathbf {X})\in \mathbb {R}^{C\times N}\), the light gray color highlights successful predictions from \(\mathbf {g}_s(\mathbf {X}_{tr}) \in \mathbb {R}^{C_s \times N_s}\) while the dark black color highlights successful predictions from \(\mathbf {g}_u(\mathbf {X}_{te}) \in \mathbb {R}^{C_u \times N_u}\).
4 Experiments
Below we detail datasets used in our experiments, describe evaluation protocols and show our experimental results to demonstrate usefulness of our approach.
4.1 Setup
Datasets. We evaluate proposed models on four datasets. Attribute Pascal and Yahoo (APY) contains 15339 images, 64 attributes and 32 classes. The 20 classes from Pascal VOC are used for training and 12 classes collected from Yahoo! are used for testing. Animals with Attributes (AWA1) contains 30475 images from 50 classes. Each class is annotated with 85 attributes. The zero-shot learning split of AWA1 is 40 classes for training and 10 classes for testing. The Animal with Attributes 2 (AWA2) proposed by [5] is the updated and open source version of AWA1. It has the same number of classes, attributes and train/test split with AWA1. Flower102 (FLO) [11] contains 8189 images from 102 classes.
An evaluation paper [5] proposes a novel zero-shot learning splits to eliminate the overlap between the classes in zero-shot datasets and ImageNet [5], and evaluates most popular zero-shot learning methods. In this paper, we follow the new splits to make a fair comparison to other state-of-the-art methods.
Parameters. We perform the mean extraction and standard deviation normalization on both original and auxiliary datapoints to train \(M_{sel}\) to alleviate the imbalance between two distributions. For \(M_s\) and \(M_u\), we simply use the original data provided in paper [5] without any preprocessing. Our models use classifiers with the SoftmaxLog objective. We use the Adam solver with mini-batches of size 60, the parameters of Adam are set to \(\beta 1 = 0.9\) and \(\beta 2 = 0.99\). We run the solver for 50 epochs. The learning rate is set to \(1e\!-\!4\). The parameters used by ModelSel-2Way and ModelSel-3Way are chosen via cross-validation.
Protocols. For training, all models are trained at once as the training process is the same for each model. To perform testing, we follow the generalized zero-shot learning protocols in [5]. There are two testing splits for seen and unseen classes, respectively. We evaluate the two testing splits, and collect two per-class mean top-1 accuracies \(Acc_S\) and \(Acc_U\) as suggested by [5]. We report the harmonic mean over the two results as the final score:
4.2 Evaluations
Figure 5 shows how the classification accuracy varies w.r.t. \(\sigma \) of ModelSel-2Way-SA. It can be seen that the soft assignment score obtained by passing SVM scores via the Sigmoid function helps improve the performance of our model.
Table 1 shows that our models obtain state-of-the-art results on AWA1, AWA2, FLO and APY datasets. Compared to f-CLSWGAN, our ModelSel-3Way achieves a \(2.8\%\) higher accuracy on AWA1, \(3.6\%\) on AWA2 and \(0.8\%\) on FLO. The biggest improvement for ModelSel-2Way-SA is observed on APY, where the accuracy increased from \(20.5\%\) of ZSKL [7] to \(42.3\%\). The above evaluations illustrate that our models can combine predictions on seen and auxiliary datapoints better than current state-of-the-art approaches.
5 Conclusions
In this paper, we have presented three approaches to the model selection, which introduce a novel way of leveraging generated datapoints on generalized zero-shot learning task. Different from [1], our models use original and generated datapoints to train a selector function which distinguishes between classifiers for seen and unseen training datapoints. Evaluations on our ModelSel variants achieve state-of-the-art results on four publicly available datasets.
References
Xian, Y., Lorenz, T., Schiele, B., Akata, Z.: Feature generating networks for zero-shot learning. In: CVPR (2018)
Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for attribute-based classification. In: CVPR, pp. 819–826 (2013)
Romera-Paredes, B., Torr, P.: An embarrassingly simple approach to zero-shot learning. In: ICML, pp. 2152–2161 (2015)
Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In: CVPR, pp. 2927–2936 (2015)
Xian, Y., Schiele, B., Akata, Z.: Zero-shot learning - the good, the bad and the ugly. In: CVPR (2017)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)
Zhang, H., Koniusz, P.: Zero-shot kernel learning. In: CVPR, pp. 7670–7679 (2018)
Zhang, H., Koniusz, P.: Power normalizing second-order similarity network for few-shot learning. CoRR (2018)
Koniusz, P., Tas, Y., Zhang, H., Harandi, M., Porikli, F., Zhang, R.: Museum exhibit identification challenge for the supervised domain adaptation and beyond. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 815–833. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_48
Koniusz, P., Zhang, H., Porikli, F.: A deeper look at power normalizations. In: CVPR (2018)
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP, December 2008
Lampert, C.H., Nickisch, H., Harmeling, S.: Attribute-based classification for zero-shot visual object categorization. TPAMI 36(3), 453–465 (2014)
Zhang, Z., Saligrama, V.: Zero-shot learning via semantic similarity embedding. In: ICCV, pp. 4166–4174 (2015)
Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., Schiele, B.: Latent embeddings for zero-shot classification. In: CVPR, pp. 69–77 (2016)
Frome, A., et al.: Devise: A deep visual-semantic embedding model. In: NIPS, pp. 2121–2129 (2013)
Changpinyo, S., Chao, W.L., Gong, B., Sha, F.: Synthesized classifiers for zero-shot learning. In: CVPR, pp. 5327–5336 (2016)
Kodirov, E., Xiang, T., Gong, S.: Semantic autoencoder for zero-shot learning. In: CVPR (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, H., Koniusz, P. (2019). Model Selection for Generalized Zero-Shot Learning. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11130. Springer, Cham. https://doi.org/10.1007/978-3-030-11012-3_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-11012-3_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11011-6
Online ISBN: 978-3-030-11012-3
eBook Packages: Computer ScienceComputer Science (R0)