Keywords

1 Introduction

Zero-shot recognition (ZSR) is the problem of recognizing data instances from unseen classes (i.e. no training data for these classes) during test time. The motivation for ZSR stems from the need for solutions to diverse research problems ranging from poorly annotated big data collections [1] to the problem of extreme classification [2]. In this paper we consider the classical ZSL setting. Namely, we are given two sources of data the so called source domain and target domain, respectively. In the source domain, each class is represented by a single vector of side information such as attributes [37], language words/phrases [810], or even learned classifiers [11]. In target domain, each class is represented by a collection of data instances (e.g. images or videos). During training some known classes with data are given as seen classes, while during testing some other unknown classes are revealed as unseen classes. The goal of ZSL is to learn suitable models using seen class training data so that in ZSR the class labels of arbitrary target domain data instances from unseen classes during testing can be predicted.

Key Insight: In batch mode we are given the ensemble of target domain data. Our main idea is that even though labels for target-domain data are unknown, subtle shifts in the data distributions can be inferred and these shifts can in turn be utilized to better adapt the learned classifiers for test-time use.

Intuitively, our insight is justified by noting that target domain data instances could form compact and disjoint clusters in their latent space embeddings. These clusters can be reliably separated into different seen or unseen classes. Nevertheless, the predicted locations of clusters based on source domain data are somewhat inaccurate, resulting in large errors. Consequently, we can improve accuracy by adapting to target domain ensemble distribution in test time.

Fig. 1.
figure 1

t-SNE [12] visualization of CNN features for the 12 unseen classes in aP&Y dataset, one color per class. (Color figure online)

Another perspective on this issue can be gleaned from Fig. 1, which depicts the CNN feature distribution for the 12 “unseen” classes in the aPascal & aYahoo (aP&Y) dataset [3]. As we see there exist clear gaps between most of the class pairs, indicating that CNN features are sufficiently reliable to recognize these classes. Indeed a linear multi-class support vector machine (SVM) would suffice if we were given even a few instances from the unseen dataset. By using half of unseen data for training the recognition performance on the remaining data is as high as 97 %. Nevertheless, the best known result in the ZSL literature is around 50 % [13]. This huge performance difference, i.e. \(97\,\%-50\,\%=47\,\%\), suggests that the estimated unseen class classifiers are inaccurate to some extent compared with the supervised classifiers.

Obviously, there are many reasons for the significant performance degradation. First and foremost is that we have no access to the labels for unseen classes and obviously no training instances for them. In addition this difference can also stem from inaccurate source domain attribute vectors, noisy data in target domain during training, imbalanced data distributions, etc. Among them one of plausible reasons could be the projection domain shift problem, which has been investigated recently [14, 15]. The major argument here is that the test-time data distributions in the projection/latent space could be different from the estimation based on training data, and as a result the learned ZSL classifiers for unseen classes cannot work well. This leads us to question that is the focus of this paper, namely, is it possible to improve the recognition performance of the estimated classifiers for unseen classes if we posit that the unseen class target data forms nice clusters?

In this paper, we propose a structured prediction approach for ZSR, by assuming that the unseen data can be visually clustered but that the predicted locations from the training data can be somewhat inaccurate. Our idea arises from the following two perspectives: The first perspective is from unsupervised data clustering, where we attempt to capture the correct underlying distribution in the latent space for each unseen classFootnote 1. Given clustered features such as CNN, it is reasonable to assume that data instances in each cluster should have the same class label as in label propagation (e.g. [16]). The second perspective is based on data assignment, which in our case is a bipartite graph matching problem with vertices representing clusters and unseen classes on each side, respectively. The edge weights between these vertices represent the (weighted) average similarities between the data instances in each cluster and unseen class classifiers. This perspective suggests that rather than predicting class label individually we seek to recognize the class label at the cluster level, a viewpoint closely related to multiple instance learning (MIL) [17]. Both aspects aim to globally predict a suitable data structure for unseen classes and utilize it to improve the recognition performance in an unsupervised manner.

Our approach is based on a novel structured prediction method, which in essence is equivalent to maximum a posteriori (MAP) estimation. Further we propose a Gaussian parameterization for batch-mode ZSR, and accordingly derive an efficient algorithm for ZSR. The parameters accounting for test-time shift are adaptive to test data based on the learned associations between source domain attribute vectors and target domain images. Empirically we test our method on four popular benchmark image datasets for ZSL, namely, aPascal& aYahoo (aP&Y) [3], Animals with Attributes (AwA) [18], Caltech-UCSD Birds-200-2011 (CUB) [19], and SUN Attribute (SUN) [20], and achieve the state-of-the-art.

1.1 Related Work

A significant number of works for zero-shot learning are based on learning attribute classifiers that map target domain instances to those in source domain [4, 11, 2127]. More recently methods based on similarity learning using linear [9, 10, 2832] or nonlinear kernels [13, 14, 33, 34] on source and target domain embeddings have been proposed. There also exist other approaches such as transfer learning [35], multimodel learning [36], multi-view learning [37], multi-domain and multi-task learning [38], that have been applied to zero-shot learning [38]. In general these learning methods can suffer from data noise (e.g. intra-class variability, inter-class similarity, noisy ground-truth attribute vectors, etc.) leading to performance degradation during test-time recognition of unseen classes.

Recently researchers have begun to incorporate test-time unseen class data into ZSL as unlabeled data to handle the projection domain shift problem [14, 15]. In [14] an unsupervised domain adaption was proposed, where the target domain class label projections are utilized as regularization in a sparse coding framework to learn the target domain projection. A separate classifier such as nearest neighbor or semi-supervised label propagation is used as a post-step for recognition with the learned target domain projection. In [15] an approach based on transductive multi-class and multi-label ZSL is proposed. The idea there is to align the unlabeled data in the feature space with multiple semantic views through multi-view canonical correlation analysis and then recognize these data instances using label propagation. Underlying these methods is the need to account for target domain unseen class data structure in the learning procedure. This has led to improvement in ZSL performance.

In contrast to these previous ZSL approaches which cannot accept trained classifiers as inputs, our method specifically focuses on the recognition task for batch-mode test time processing. Potentially our method can be used in conjunction with any similarity learning procedure trained on seen-class data and can score similarity between unseen classes and target domain data instances. We pursue our goal by formulating ZSR as a bipartite graph matching structured prediction problem. Our aim is to find the best assignment matrix between data instances and unseen classes.

While label propagation (e.g. [16]) and certain multi-class classification methods (e.g. CoConut [39]) are closely related to ours, they do not incorporate data/domain shift, which is fundamental. We account for domain shift by proposing a novel joint structured prediction problem in test time that accounts for unseen-class data structure (i.e. clustering) and label assignment. This is the first such work that like CoConut utilizes existing trained classifiers for scoring prior similarity but in addition deals with data-shift arising in ZSR.

Also our method is different from active learning, which can select data samples and acquire labels to learn classifiers. In contrast in ZSR labels cannot be acquired. In addition our method does no learning. It is a structured prediction test time method for labelling unseen unlabelled instances.

2 Zero-Shot Recognition via Structured Prediction

2.1 Problem Setting

(i) ZSL in training: Our method for training predictors using seen class training data resembles many past approaches (see [13]). Let \(\{\mathbf {x}_c^{(s)}\}_{c=1,\cdots ,C}\) and \(\{\mathbf {x}_i^{(t)}, y_i\}_{i=1,\cdots ,N}\) denote the training data for source and target domains, respectively. Here \(\mathbf {x}_c^{(s)}\in \mathbb {R}^{d_s},\,\forall \, c\in [C]\) is the \(d_s\)-dim attribute vector for class c; \(\mathbf {x}_i^{(t)}\in \mathbb {R}^{d_t}, \forall i\) is a \(d_t\)-dim data instance with class label \(y_i\in [C]\) for \(i \in [N]\). We learn two projection functions \(\phi _s: \mathbb {R}^{d_s}\rightarrow \mathbb {R}^{D_s}\) and \(\phi _t: \mathbb {R}^{d_t}\rightarrow \mathbb {R}^{D_t}\) for source and target domains, respectively, to minimize the binary prediction loss:

$$\begin{aligned} \min _{\kappa \in \mathcal {K}, \phi _s\in \varPhi _s, \phi _t\in \varPhi _t} \sum _{c=1}^{C}\sum _{i=1}^N\ell \left( \kappa (\phi _s(\mathbf {x}_{c}^{(s)}), \phi _t(\mathbf {x}_i^{(t)})), \mathbf {1}_{\{c=y_i\}}\right) , \end{aligned}$$
(1)

where \(\mathcal {K}, \varPhi _s, \varPhi _t\) denote the corresponding feasible functional spaces, \(\kappa :\mathbb {R}^{D_s}\times \mathbb {R}^{D_t}\rightarrow \mathbb {R}\) denotes a similarity function, \(\mathbf {1}_{\{c=y_i\}}\) denotes an indicator function returning 1 if the condition \(c=y_i\) holds, otherwise -1, and \(\ell : \mathbb {R}\times \{-1,+1\}\rightarrow \mathbb {R}\) denotes a loss function (e.g. hinge loss).

(ii) Online-mode ZSR in testing: We briefly describe this mode in order to contrast our batch-mode setup of ZSR. As is the convention, in this mode, we are given \(C'\) source domain unseen class attribute vectors \(\bar{\mathcal {X}}^{(s)}=\{\mathbf {x}_{c'}^{(s)}\}_{c'=1,\cdots ,C'}\) and a single data instance, \(\mathbf {x}_{i'}^{(t)}\), chosen uniformly at random from a collection of \(N'\) target domain unseen class data instances \(\bar{\mathcal {X}}^{(t)}=\{\mathbf {x}_{i'}^{(t)}\}_{i'=1,\cdots ,N'}\). The goal is to match this instance to one of the \(C'\) unseen source-domain descriptions. Given the learned similarity kernel \(\kappa \) and the source and target domain embedding functions, the problem reduces to a multi-class classification rule.

$$\begin{aligned} y_{i'}=\mathop {\text {arg max}}\limits _{c'\in \{1,\cdots , C'\}} P_{\theta }(c'|\mathbf {x}_{c'}^{(s)}, \mathbf {x}_{i'}^{(t)}) \equiv \mathop {\text {arg max}}\limits _{c'\in \{1,\cdots , C'\}}\kappa (\phi _s(\mathbf {x}_{c'}^{(s)}), \phi _t(\mathbf {x}_{i'}^{(t)})), \end{aligned}$$
(2)

As depicted above we can view the similarity kernel as a probability functional. \(P_{\theta }\) denotes the probability of being labeled as \(c'\) given the data pair parameterized by \(\theta \).

(iii) Batch-mode ZSR in testing: In contrast, our method is based on batch-mode processing. Here during test-time all the \(N'\) target-domain unseen class instances are revealed and our task is to match these \(N'\) target domain instances to \(C'\) source domain descriptions. Our goal is thus to predict a good global structure, \(\bar{\mathcal {Y}}\), among the predicted labels simultaneously by exploring useful data dependencies for unseen classes in both source and target domains, rather than in isolation as in the online-mode. We can view this problem probabilistically as attempting to jointly label all instances conditioned on combined (but unassociated) source/target test data:

$$\begin{aligned} \bar{\mathcal {Y}}=\mathop {\text {arg max}}\limits _{\omega \in \varOmega }P_{\theta }(\omega |\bar{\mathcal {X}}^{(s)}, \bar{\mathcal {X}}^{(t)}), \end{aligned}$$
(3)

where \(\omega \in \varOmega \) denotes a feasible assignment solution between target data and source attribute vectors (and hence unseen class labels). If one were to utilize primarily the similarity function learned on seen class training data the problem reduces to the standard bipartite matching problem. In any case this approach is infeasible due to lack of knowledge of the number of instances corresponding to each class. Regardless we hope to do better by utilizing target-domain batch data (although unlabeled/unassociated) to improve these assignments. Note that the prediction functions described in [14, 15] that use unseen class data as unlabeled data can be abstractly represented in this way.

2.2 Structured Prediction in Testing

We propose a structured prediction method for batch-mode ZSR. Intuitively a good labeling structure for target domain unseen class data instances should result in smooth label assignments in the latent space. Namely, two close data points tend to have the same class label. To predict smooth labeling structures we consider an approach based on fusing information obtained from cross-domain similarities with empirically observed target domain data distribution.

(i) Maximum a posteriori (MAP) estimation: We will develop a generative parameterized probabilistic model for recognizing test-time target data and describe an approach based on MAP. Using Bayes’ rule we can further expand the batch-mode decision rule in Eq. 3 as follows:

$$\begin{aligned} \bar{\mathcal {Y}}&= \mathop {\text {arg max}}\limits _{\omega \in \varOmega }\sum _{c'=1}^{C'}\sum _{i'=1}^{N'}P_{\theta }(\omega _{c',i'}|\bar{\mathcal {X}}^{(s)}, \bar{\mathcal {X}}^{(t)}) =\mathop {\text {arg max}}\limits _{\omega \in \varOmega }\sum _{c'=1}^{C'}\sum _{i'=1}^{N'}P_{\theta }(\omega _{c',i'}, \bar{\mathcal {X}}^{(s)}, \bar{\mathcal {X}}^{(t)}) \nonumber \\&=\mathop {\text {arg max}}\limits _{\omega \in \varOmega }\sum _{c'=1}^{C'}\sum _{i'=1}^{N'}P_{\theta }(\omega _{c',i'})P_{\theta }(\bar{\mathcal {X}}^{(s)}|\omega _{c',i'})P_{\theta }(\bar{\mathcal {X}}^{(t)}|\omega _{c',i'}), \end{aligned}$$
(4)

where \(\omega _{c',i'}\) denotes data \(\mathbf {x}_{i'}^{(t)}\) being labeled as unseen class \(c'\), \(P_{\theta }(\omega _{c',i'})\) denotes the prior distribution, and \(P_{\theta }(\bar{\mathcal {X}}^{(s)}|\omega _{c',i'}), P_{\theta }(\bar{\mathcal {X}}^{(t)}|\omega _{c',i'})\) denote the likelihoods of generating data sets \(\bar{\mathcal {X}}^{(s)}, \bar{\mathcal {X}}^{(t)}\) given the assignment and parameter \(\theta \), respectively. Note that our MAP formulation corresponds to the online-mode ZSR if we remove \(P_{\theta }(\bar{\mathcal {X}}^{(t)}|\omega _{c',i'})\) from Eq. 4 and assume \(\omega \) is a one-to-one assignment function.

We view \(P_{\theta }(\omega _{c',i'}, \bar{\mathcal {X}}^{(s)}, \bar{\mathcal {X}}^{(t)})\) as a generative model that models the likelihood of labeling data \(\mathbf {x}_{i'}^{(t)}\) as unseen class \(c'\) in the context of source and target data \(\bar{\mathcal {X}}^{(s)}, \bar{\mathcal {X}}^{(t)})\). We posit that the data generation process for source and target domains is conditionally independent given the assignment variable. Consequently, we can factorize the likelihood as the last line in Eq. 4.

Empirically we would like to maximize the log-likelihood as many Bayesian methods [40] do. Therefore, rather than optimizing Eq. 4 directly we prefer optimizing the lower bound of the log-likelihood for structured prediction:

$$\begin{aligned} \bar{\mathcal {Y}}&= \mathop {\text {arg max}}\limits _{\omega \in \varOmega }\sum _{c'=1}^{C'}\sum _{i'=1}^{N'}P_{\theta }(\omega _{c',i'})\left[ \log P_{\theta }(\bar{\mathcal {X}}^{(s)}|\omega _{c',i'}) + \log P_{\theta }(\bar{\mathcal {X}}^{(t)}|\omega _{c',i'})\right] . \end{aligned}$$
(5)

(ii) Parameterization: We parameterize the log-likelihoods in Eq. 5 with Gaussian models. For source domain, we directly utilize the similarity between data \(\mathbf {x}_{i'}^{(t)}\) and unseen class \(c'\) with learned functions \(\kappa , \phi _s, \phi _t\) as follows:

$$\begin{aligned} \log P_{\theta }(\bar{\mathcal {X}}^{(s)}|\omega _{c',i'}) \mathop {=}\limits ^{\text{ def }} \lambda _s\kappa (\phi _s(\mathbf {x}_{c'}^{(s)}), \phi _t(\mathbf {x}_{i'}^{(t)})), \end{aligned}$$
(6)

with predefined parameter \(\lambda _s\ge 0\). For target domain, we utilize the distance between the projected data \(\phi _t(\mathbf {x}_{i'}^{(t)})\) and the empirical mean vector \(\varvec{\mu }_{c'}^{(t)}\) for unseen class \(c'\) in the same latent space by setting parameter \(\theta =\{\varvec{\mu }_{c'}^{(t)}\}\). That is,

$$\begin{aligned} \log P_{\theta }(\bar{\mathcal {X}}^{(t)}|\omega _{c',i'}) \mathop {=}\limits ^{\text{ def }} -\lambda _t\Vert \phi _t(\mathbf {x}_{i'}^{(t)}) - \varvec{\mu }_{c'}^{(t)}\Vert _2^2, \end{aligned}$$
(7)

with another predefined parameter \(\lambda _t\ge 0\) and \(\Vert \cdot \Vert _2\) denoting the \(\ell _2\) norm operator of a vector.

(iii) Initial model for estimating \(\omega \) and \(\theta \) : In order to account for target data distribution efficiently, we initialize \(\theta \) as a set of cluster centers generated from K-means with \(K=C'\). Then we identify one-to-one matches between the clusters and unseen classes so that we can label the data instances in each cluster using the matched class label as the initialization of parameter \(\omega \).

To identify the matches, we solve the following binary assignment problem:

$$\begin{aligned} \max _{\{\bar{B}_{c',k'}\}} \sum _{c'=1}^{C'}\sum _{k'=1}^{C'}\bar{S}_{c',k'}\bar{B}_{c',k'}, \; \text{ s.t. } \; \forall c', \forall k', \sum _{c'}\bar{B}_{c',k'}=1, \sum _{k'}\bar{B}_{c',k'}=1, \end{aligned}$$
(8)

where \(\bar{B}_{c',k'}\in \{0, 1\}, \forall c', \forall k', \) denotes the binary assignment variable, and \(\bar{S}_{c',k'}\) denotes the average similarities between unseen class \(c'\) and data in cluster \(k'\). This problem can be efficiently solved using linear programming (LP).

(iv) Complete model: In fact each parameter \(\varvec{\mu }_{c'}^{(t)}\) in Eq. 7 can be estimated as the weighted means of all the projected target domain features in the latent space for class \(c'\). Importantly this estimation is coupled with parameter \(\omega \), as \(\omega \) describes the relationship between target data and unseen classes.

We denote as \(\mathbf {S}\in \mathbb {R}^{C'\times N'}\) the test-time source-target data similarity matrix where \(S_{c',i'}=\kappa (\phi _s(\mathbf {x}_{c'}^{(s)}), \phi _t(\mathbf {x}_{i'}^{(t)})), \forall c', \forall i'\) is the \((c',i')\)-th entry in matrix \(\mathbf {S}\). We denote as \(\bar{\varvec{\varPhi }}_t\mathop {=}\limits ^{\text{ def }}[\phi _t(\mathbf {x}_{i'}^{(t)})]_{i'=1,\cdots ,N'}\in \mathbb {R}^{d_t\times N'}\) the target domain data matrix consisting of each instance \(\phi _t(\mathbf {x}_{i'}^{(t)}), \forall i',\) as a column. We denote as \(P_{\theta }(\omega )\mathop {=}\limits ^{\text{ def }}\mathbf {H}\in \mathbb {R}^{C'\times N'}\) the source-target assignment weighting matrix. We denote as \(\mathbf {S}_{c'}\in \mathbb {R}^{1\times N'}, \mathbf {H}_{c'}\in \mathbb {R}^{1\times N'}, \forall c',\) the \(c'\)-th rows in \(\mathbf {S}\) and \(\mathbf {H}\), respectively. Then by substituting Eqs. 6 and 7 into Eq. 5, we can write down our regularized structured prediction objective for ZSR as follows:

$$\begin{aligned}& \min _{\mathbf {H}, \{\varvec{\mu }_{c'}^{(t)}=\bar{\varvec{\varPhi }}_t\mathbf {H}_{c'}^T\}} \frac{1}{2}\Vert \mathbf {H}\Vert _F^2 - \lambda _s\sum _{c'=1}^{C'}\mathbf {S}_{c'}\mathbf {H}_{c'}^T + \lambda _t\sum _{i'=1}^{N'}\sum _{c'=1}^{C'}H_{c',i'}\Vert \phi _t(\mathbf {x}_{i'}^{(t)}) - \varvec{\mu }_{c'}^{(t)}\Vert _2^2 \\&\text{ s.t. } \forall i', \forall c', H_{c',i'}\ge 0, \sum _{c'=1}^{C'}H_{c',i'}\ne 0, \sum _{i'=1}^{N'}H_{c',i'}=1, \forall c_m'\ne c_n', \sum _{i'=1}^{N'}H_{c_m',i'}H_{c_n',i'} = 0, \nonumber \end{aligned}$$
(9)

where \(\Vert \cdot \Vert _F\) denotes the Frobenius norm of a matrix, and \((\cdot )^T\) denotes the matrix transpose operator. Here the constraints guarantee that: (1) Every instance is assigned to at least one unseen class, and for each unseen class, each row in \(\mathbf {H}\) represents a probability distribution over all the instances (on a simplex); (2) The additional orthogonality constraints ensure that every instance is assigned to only one unseen class.

Note that all the assignment constraints for minimizing Eq. 9 are chosen to reflect the fact that we know a priori that in test time every instance must belongs to a single unseen class. Nevertheless, our method can be extended to handle missing matches between source and target domain data by suitably modifying the bipartite graph matching constraints. In reality these missing-match scenarios in ZSR may be more interesting and important, but they are outside the scope of this paper.

(v) Optimization: Solving Eq. 9 is nontrivial as it is highly non-convex. In Algorithm 1 we propose an efficient alternating optimization algorithm to solve Eq. 9 sub-optimally. The idea here is to decompose \(\mathbf {H}=\mathbf {B}\circ \mathbf {Z}\), where \(\circ \) denotes the entry-wise multiplication operator between two matrices, \(\mathbf {B}\in \{0,1\}^{C'\times N'}\) is a binary matrix indicating the assignments and \(\mathbf {Z}\in \mathbb {R}^{C'\times N'}\) is a weighting matrix for the corresponding assignments. When \(\mathbf {B}\) is learned and fixed, we can solve a weighting problem for each unseen class using quadratic programming (QP). Letting \(\mathcal {J}_{c'}\) be the index set where \(\forall j'\in \mathcal {J}_{c'}\) the \((c',j')\)-th entry in \(\mathbf {B}\) is 1, we can optimize Eq. 9 as follows: \(\forall c'\),

$$\begin{aligned} \min _{\{Z_{c',j'}\}} &\frac{1}{2}\sum _{j'\in \mathcal {J}_{c'}}Z_{c',j'}^2 -\lambda _s\sum _{j'\in \mathcal {J}_{c'}}S_{c',j'}Z_{c',j'} + \lambda _t\sum _{j'\in \mathcal {J}_{c'}}Z_{c',j'}\Vert \phi _t(\mathbf {x}_{j'}^{(t)}) - \varvec{\mu }_{c'}^{(t)}\Vert _2^2 \\ \text{ s.t. } &\forall j'\in \mathcal {J}_{c'}, Z_{c',j'}\ge 0, \sum _{j'}Z_{c',j'}=1, \nonumber \end{aligned}$$
(10)

where \(Z_{c',j'}\) is the \((c',j')\)-th entry in \(\mathbf {Z}\). This leads to a sub-optimal solution \(\mathbf {H}=\mathbf {B}\circ \mathbf {Z}\). Next, we estimate the binary assignment variable \(\mathbf {B}\) only based on the distance term in Eq. 9, which is equivalent to a nearest neighbor problem as shown below:

$$\begin{aligned} \forall i', \; B_{y',i'}=\left\{ \begin{array}{ll} 1, &{} \text{ if } \, y'={\text {arg min}}_{c'}\Vert \phi _t(\mathbf {x}_{i'}^{(t)}) - \varvec{\mu }_{c'}^{(t)}\Vert _2^2, \\ 0, &{} \text{ otherwise }, \end{array} \right. \end{aligned}$$
(11)

where \(B_{y',i'}\) is the \((y',i')\)-th entry in \(\mathbf {B}\). This step guarantees the orthogonality constraints in Eq. 9. We repeat this procedure until certain stop criterion is satisfied (i.e. number of iterations). Empirically our algorithm works well even with very few iterations, although there are no guarantees of convergence.

Note that Eq. 11 is also utilized as the recognition decision function.

figure a

2.3 Similarity Learning in Training

Our structured prediction method can be applied in test time for ZSR as long as the similarity matrix \(\mathbf {S}\) in Eq. 9 can be calculated. Therefore our method is very flexible, and can be incorporated with other ZSL methods such as [13, 28, 29, 34] for the purpose of recognition. Inspired by the success of semantic embedding, we learn the following similarity function \(\kappa \) with embedding functions \(\phi _s, \phi _t\):

$$\begin{aligned} \kappa (\phi _s(\mathbf {x}_c^{(s)}), \phi _t(\mathbf {x}_i^{(t)})) \mathop {=}\limits ^{\text{ def }} \phi _s(\mathbf {x}_c^{(s)})^T\phi _t(\mathbf {x}_i^{(t)})=\phi _s(\mathbf {x}_c^{(s)})^T\mathbf {W}\mathbf {x}_i^{(t)}, \end{aligned}$$
(12)

where \(\phi _t(\mathbf {x}_i^{(t)})\mathop {=}\limits ^{\text{ def }}\mathbf {W}\mathbf {x}_i^{(t)}\) is a linear embedding function. Specifically we propose independent learning of embedding functions for source and target domains, respectively, as follows:

(i) Source domain semantic embedding based on mixture models: We simplify the embedding function in [34] and propose using the following optimization problem to define embedding function \(\phi _s\):

$$\begin{aligned} \phi _s(\mathbf {x}_y^{(s)})=\mathop {\text {arg min}}\limits _{\varvec{\alpha }}\Vert \mathbf {x}_y^{(s)}-\mathbf {X}_s\varvec{\alpha }\Vert _2^2, \; \text{ s.t. } \, \varvec{\alpha }\ge 0, \mathbf {e}^T\varvec{\alpha }=1, \end{aligned}$$
(13)

where \(\mathbf {x}_y^{(s)}\) denotes an arbitrary seen or unseen class attribute vector, \(\mathbf {X}_s=[\mathbf {x}_c^{(s)}]_{c=1,\cdots ,C}\in \mathbb {R}^{d_s\times C}\) denotes the matrix consisting of all the seen class attribute vectors as its columns, and \(\mathbf {e}\) denotes a vector consisting of all 1’s. Clearly the source domain mapping function \(\phi _s\) projects an arbitrary attribute vector onto a \((C-1)\)-simplex and represents it as a mixture of seen class attribute vectors. As a result all the C seen class attribute vectors are mapped to the C unique vertices of the simplex accordingly. In test time we use QP to solve Eq. 13 so that all the unseen class attribute vectors can be mapped to unique points on the simplex due to the convexity of Eq. 13.

(ii) Target domain semantic embedding based on multi-class classification: With function \(\phi _s\) in Eq. 13, the learning of linear embedding approaches such as [28, 29] can be simplified to the training problem of multi-class SVMs, because in each source domain seen class semantic embedding, there exists only one bin that is not 0 and equal to 1. Consequently we utilize the following optimization to learn the target domain semantic embedding function \(\phi _t\):

$$\begin{aligned} \min _{\mathbf {W}} \frac{1}{2}\Vert \mathbf {W}\Vert _F^2 + \rho \sum _{i=1}^N\sum _{c=1}^{C}\max \left\{ 0, \mathbf {1}_{\{c=y_i\}}\mathbf {W}_c\mathbf {x}_i^{(t)}\right\} , \end{aligned}$$
(14)

where \(\mathbf {W}\in \mathbb {R}^{C\times d_t}\) denotes the multi-class classifier, \(\forall c, \mathbf {W}_c\in \mathbb {R}^{1\times d_t}\) denotes the c-th row in \(\mathbf {W}\) for predicting the similarities between data instances and seen class c, and \(\rho \ge \) is a predefined regularization parameter. We utilize existing linear SVM solver such as LIBLINEAR [41] to solve Eq. 14.

Note that this learning approach above is essentially a (source domain) denoising version of [29]. For simplicity in our experiments latter we denote this learning approach as BL-ZSL, namely the baseline approach for ZSL.

2.4 Cross-Validation on Predefined Parameters

As in [13, 34], we utilize cross-validation to determine suitable values for training-time SVM regularization parameter \(\rho \) in Eq. 14 and test-time structured prediction regularization parameters \(\lambda _s, \lambda _t\) in Eq. 9. Precisely we randomly select two held-out seen classes for validation purpose to tune \(\lambda _s, \lambda _t\), and use the remaining data to tune \(\rho \) for training SVMs. We repeat this procedure for several times and choose the parameter combination which returns the best average ZSR performance. The easiest way to set these predefined parameters for our Algorithm 1 is \(\lambda _s\gg \lambda _t\gg 1\). Then Eq. 10 is simplified as

$$\begin{aligned} \forall c', \; \max _{\{Z_{c',j'}\}} \sum _{j'\in \mathcal {J}_{c'}}S_{c',j'}Z_{c',j'}, \; \text{ s.t. } \; \forall j'\in \mathcal {J}_{c'}, Z_{c',i'}\ge 0, \sum _{j'}Z_{c',j'}=1, \end{aligned}$$
(15)

which can be solved efficiently using LP. In practice we find that this simplified version of Algorithm 1 achieves similar performance to the complete one but offers significant computational improvement.

Table 1. Zero-shot recognition accuracy comparison (%) using CNN features in the form of “mean±standard deviation”. Here numbers for the comparative methods are cited from the original papers, and “-” means no repeated result available yet.

3 Experiments

We test our method with predefined attributes for ZSR on aP&Y, AwA, CUB, and SUN. In our experiments we utilize the same experimental settings, including the CNN features and data preprocessing, as [13, 34]. We denote by SP-ZSR our batch-mode ZSR method, and report our results averaged over 100 trials. To overcome the randomness in Algorithm 1, in each trial we run Algorithm 1 for another 100 times and record the average as probabilities over unseen classes per target data. We predict class labels and report our performance based on this assignment probability matrix in each trial.

The computational complexity of our method SP-ZSR scales as O(#target-data * #unseen-classes). Our implementation is based on unoptimized MATLAB codeFootnote 2 with multi-thread computation, and potentially any ZSL method. In terms of running time, for instance, on aP&Y with [13, 34] we can finish prediction within 5 min for 1 trial with 100 runs of Algorithm 1 on a common PC.

3.1 Zero-Shot Recognition

For this task we are only interested in whether or not the predicted class label for a target data instance is correct. Therefore, we measure the overall recognition performance by accuracy, while for each individual class we measure the performance by precision and recall (equivalence to accuracy per class).

Fig. 2.
figure 2

Class-level recognition precision comparison, where y-axis denotes precision and x-axis denotes the indexes of unseen classes in the corresponding datasets.

Fig. 3.
figure 3

Class-level recognition recall comparison, where y-axis denotes recall and x-axis denotes the indexes of unseen classes in the corresponding datasets.

We summarize the benchmark comparison results against recently proposed methods in Table 1. Overall our method outperforms the state-of-the-art by large margins. Our SP-ZSR significantly improves upon the accuracy of state-of-art ZSL methods using traditional online mode such as [13] by more than 10 %. Compared against related methods such as [14, 15] which both benefit from exploring data structures like ours, our method significantly outperforms these methods by 11.58 % on AwA and 7.44 % on CUB, respectively. Also our SP-ZSR outperforms label propagation methods such as [16] by 5.03 %. These observations indicate that our method is more effective in accounting for test-time data shifts as opposed to methods [14, 15] which directly seek to associate test data distribution with training data.

Table 2. Average precision and recall comparison (%) for recognition.
Fig. 4.
figure 4

Class-level average precision (AP) comparison for retrieval, where y-axis denotes AP and x-axis denotes the indexes of unseen classes in the corresponding datasets.

Fig. 5.
figure 5

Visualization of class distributions for aP&Y using CNN features.

To better analyze the performance of our SP-ZSR for recognition, we also tabulate the class level precision and recall comparison in Figs. 2 and 3, respectively. Here (and in the following experiments) we only consider [13] as the baseline comparative approach because it achieves the state-of-the-art over the four datasets on average. In general SP-ZSR helps improve the performance on individual class when its distribution can be separated from others. In some cases, SP-ZSR decreases precision (or recall), but increases recall (or precision). In few cases, however, we observe that recognition with estimation of data distributions deteriorate the performance on both measures, such as class 2 in aP&Y and class 8 in CUB. More details can be seen from the class distributions in Fig. 5.

Fig. 6.
figure 6

Precision-recall curve comparison for retrieval on aP&Y, AwA, and SUN. The class names in each legend correspond to the indexes along x-axis in Figs. 2, 3, and 4, respectively.

To summarize the precision and recall comparison, we list the average numbers over all unseen classes in each dataset in Table 2. Overall SP-ZSR does help improve both precision and recall by, at least, 5.10 % and 9.04 %, respectively. Though the learning methods [13] and BL-ZSL are different, our SP-ZSR leads to similar performance. The large standard deviation implies that the performance for individual class has large variability. We will explore this issue further in our future work.

3.2 Zero-Shot Retrieval

In zero-shot retrieval we rank the assignment probabilities per unseen class and measure the retrieval performance by average precision (AP) and precision-recall curve per class. In this way we hope to explore performance of ZSR methods from the perspective of retrieval.

Table 3. mAP comparison (%) for zero-shot retrieval.

As an overview, we first summarize the mean average precision (mAP) comparison in Table 3. SP-ZSR appears to improve upon the retrieval performance of [13] significantly by 30.12 %. Again with different learning approaches, SP-ZSR works equally well. These results suggest that for retrieval exploring test-time data structures is much more useful than for recognition.

Similar to recognition, we also show the class-level AP performance in Fig. 4. Overall SP-ZSR helps improve the retrieval performance on individual class by taking data structure into account. Unlike recognition, there are a few cases (i.e. class 10 in aP&Y, and classes 16, 38 in CUB) where our method leads to small degradation over using only cross-domain similarities. Possible reasons for deterioration could be the tuning parameters or the learned similarity matrix. Note that if the initial predicted similarities for certain class are not distinguishing and its corresponding distribution is not separable as well, just like class 7, “centaur”, in aP&Y, our SP-ZSR cannot be expected to work well.

Next we analyze our retrieval performance from the perspective of precision-recall curve as shown in Fig. 6. We do not display the figures for CUB dataset to avoid unnecessary clutter in our illustrations. Note that larger areas under the precision-recall curves again demonstrate the superior performance with our structured prediction method.

4 Conclusion

The focus of this paper is on improving the recognition and retrieval performance of learned classifiers for unseen classes under the supposition that target domain data forms clusters in a suitable embedded space. To deal with the problems such as domain shift in ZSL, we propose a novel structured prediction approach to seek a globally well-matched assignment structure between clusters and unseen classes in test time. Our idea is motivated by the fact that there is a substantial performance gap between supervised learning and current state-of-art ZSL. The key difference between the two approaches is that the former approach benefits from utilizing test data distribution during training. With this as justification we propose classifying unseen target data by taking into consideration not only the learned similarities but also empirical distribution of unlabelled target data. In particular we introduce an unsupervised clustering subroutine into the assignment procedure so that target data structures in both clustering and assignment can be updated iteratively. Empirically we demonstrate significant improvement consistently over state-of-the-art in both zero-shot recognition and retrieval on the four popular benchmark datasets for ZSR.