Zero-Shot Fine-Grained Classification by Deep Feature Learning with Semantics

Fine-grained image classification, which aims to distinguish images with subtle distinctions, is a challenging task due to two main issues: lack of sufficient training data for every class and difficulty in learning discriminative features for representation. In this paper, to address the two issues, we propose a two-phase framework for recognizing images from unseen fine-grained classes, i.e. zero-shot fine-grained classification. In the first feature learning phase, we finetune deep convolutional neural networks using hierarchical semantic structure among fine-grained classes to extract discriminative deep visual features. Meanwhile, a domain adaptation structure is induced into deep convolutional neural networks to avoid domain shift from training data to test data. In the second label inference phase, a semantic directed graph is constructed over attributes of fine-grained classes. Based on this graph, we develop a label propagation algorithm to infer the labels of images in the unseen classes. Experimental results on two benchmark datasets demonstrate that our model outperforms the state-of-the-art zero-shot learning models. In addition, the features obtained by our feature learning model also yield significant gains when they are used by other zero-shot learning models, which shows the flexility of our model in zero-shot fine-grained classification.


I. INTRODUCTION
F INE-GRAINED image classification, which aims to recognize subordinate level categories, has emerged as a popular research area in the computer vision community [1]- [4]. Different from general image recognition such as scene or object recognition, fine-grained image classification needs to explicitly distinguish images with subtle difference, which actually involves the classification of many subclasses of objects belonging to the same class such as birds [5]- [7], dogs [8] and plants [9], [10].
In general, fine-grained image classification is a challenging task due to two main issues:  (e.g. ImageNet [11]) is thus impractical. Therefore, how to recognize images from fine-grained classes in the lack of sufficient training data for every class becomes a thought-provoking task in computer vision.
• As compared with general image recognition, finegrained classification is a more challenging task, which needs to discriminate between objects that are visually similar to each other. As shown in Fig.1, people can easily recognize that objects in the red box are birds and the object in the blue box is a cow, but they fail to distinguish the two kinds of birds in the red box. This example demonstrates that we have to learn more discriminative representation for fine-grained classification than that for general image classification.
Considering the lack of training data for every class in fine-grained classification, we can adopt zero-shot learning to recognize images from unseen classes without labelled training data. However, conventional zero-shot learning algorithms mainly explore the semantic relationship among classes (using textual information) and attempt to learn a match between images and their textual descriptions [13]- [15]. In other words, rare works on zero-shot learning focus on feature learning. This is really bad for fine-grained classification, since it requires more discriminative features than general image recognition. Hence, we must pay our main attention to feature leaning for zero-shot fine-grained image classification.
In this paper, we propose a two-phase framework to recognize images from unseen fine-grained classes, i.e. zeroshot fine-grained classification (ZSFC). The first phase of our model is to learn discriminative features. Most fine-grained classification models extract features from deep convolutional neural networks that are finetuned by images with extra annotations (eg. bounding box of objects and part locations). However, these extra annotations of images are expensive to Fig. 2. Overview of the proposed framework for zero-shot fine-grained image classification. The proposed framework contains two phases: feature learning and label inference. In the first feature learning phase, hierarchical classification subnetworks and a domain adaptation structure are both integrated into VGG-16Net [12]. In the second label inference phase, deep features from the first phase and a semantic directed graph constructed with class attributes are involved into a label propagation process to infer the labels of images in the unseen classes.
access. Different from these models, our model only exploits implied hierarchical semantic structure among fine-grained classes for finetuning deep networks. The hierarchical semantic structure among classes is obtained based on taxonomy, which can be easily collected from Wikipedia. In our model, we generally assume that experts recognize objects in finegrained classes based on the discriminative visual features of images and the hierarchical semantic structure among finegrained classes is their prior knowledge. Under this assumption, we finetune deep convolutional neural networks using hierarchical semantic structure among fine-grained classes to extract discriminative deep visual features. Meanwhile, a domain adaptation subnetwork is introduced into the proposed network to avoid domain shift caused by zero-shot setting.
In the second label inference phase, a semantic directed graph is firstly constructed over attributes of fine-grained classes. Based on the semantic directed graph and also the discriminative features obtained by our feature learning model, we develop a label propagation algorithm to infer the labels of images in the unseen classes. The flowchart of the proposed framework is illustrated in Fig. 2. Note that the proposed framework can be extended to weakly supervised setting by replacing class attributes with semantic vectors extracted by word vector extractors (e.g. Word2Vec [16]).
To evaluate the effectiveness of the proposed model, we conduct experiments on two benchmark fine-grained image datasets (i.e. Caltech UCSD Birds-200-2011 [5] and Oxford Flower-102 [9]). Experimental results demonstrate that the proposed model outperforms the state-of-the-art zero-shot learning models in the task of zero-shot fine-grained classification. Moreover, we further test the features extracted by our feature learning model by applying them to other zeroshot learning models and the obtained significant gains verify the effectiveness of our feature learning model.
The main contributions of this work are given as follows: • We have proposed a two-phase learning framework for zero-shot fine-grained classification. Unlike most of previous works that focus on zero-shot learning, we pay more attention to feature learning instead. • We have developed a deep feature learning method for fine-grained classification, which can learn discriminative features with hierarchical semantic structure among classes and a domain adaptation structure. More notably, our feature learning method needs no extra annotations of images (e.g. part locations and bounding boxes of objects), which means that it can be readily used for different zero-shot fine-grained classification tasks. • We have developed a zero-shot learning method for label inference from seen classes to unseen classes, which can help to address the issue of lack of labelled training data in fine-grained image classification.
The remainder of this paper is organized as follows. Section II provides related works of fine-grained classification and zero-shot learning. Section III gives the details of the proposed model for zero-shot fine-grained classification. Experimental results are presented in Section IV. Finally, the conclusions are drawn in Section V.

II. RELATED WORKS A. Fine-Grained Image Classification
There are two strategies widely used in existing fine-grained image classification algorithms. The idea of the first strategy is distinguishing images according to the unique properties of object parts, which encourages the use of part-based algorithms that rely on localizing object parts and assigning them detailed attributes. Zhang et al. propose a part-based Region based-Convolutional Neural Network (R-CNN) where R-CNN is used to detect object parts and geometric relations among object parts are used for label inference [17]. Since R-CNN extracts too many proposals for each image, this algorithm is time-consuming. To solve this problem, Huang et al. propose a Part-Stacked Convolutional Neural Network (PS-CNN) [18], where a fully-convolutional network is used to detect object parts and a part-crop layer is induced into AlexNet [19] to combine part/object features for classification. To solve the limited scale of well-annotated data, Xu et al.
propose an agumented part-based R-CNN to utilize the weak labeled data from web [20]. Different from these models that mainly use large parts of images (i.e. proposals) for finegrained classification, Zhang et al. detect semantic part and classify images based on features of their semantic parts [21]. However, the aforementioned part-based algorithms need very strong annotations (i.e. locations of parts), which are very expensive to acquire.
The second strategy is to exploit more discriminative visual representations, which is inspired by recent success of CNNs in image recognition [22]. Lin et al. propose a bilinear CNN [23], which combines the outputs of two different feature extractors by using a outer product, to model local pairwise feature interactions in a translationally invariant manner. This structure can create robust representations and achieve significant improvement compared with the state-of-the-arts. Zhang et al. propose a deep filter selection strategy to choose suitable deep filters for each kinds of parts [24]. With the suitable deep filters, they can detect more accurate parts and extract more discriminative features for fine-grained classification.
Note that the above models need extra annotations of images (eg. bounding boxes of objects and locations of parts). Moreover, their training data include all fine-grained classes. When we only have training images from a subset of finegrained classes, the domain shift problem will occur [25]. Besides, without extra object or part annotations, these models will fail. In contrast, our model needs not extra object or part annotations at both training and testing stages. Furthermore, the domain adaptation strategy is induced into our model to avoid domain shift. In this way, we can learn more discriminative features for zero-shot fine-grained classification.

B. Zero-Shot Learning
Zero-shot learning, which aims to learn to classify in the absence of labeled data, is a challenging problem [26]- [31]. Recently, many approaches have been developed for zero-shot learning. Zhang et al. viewed testing instances as arising from seen instances and attempted to express test instances as a mixture of seen class proportions [14]. To solve this problem, they propose a semantic similarity embedding (SSE) approach for zero-shot learning. Besides, they also formulate zero-shot learning as a binary classification problem and develop a joint discriminative learning framework based on dictionary learning to solve it [32]. Paredes et al. propose a general zeroshot learning framework to model the relationships between features, attributes, and classes as a two linear layers network [13]. Bucher et al. address the task of zero-shot learning by formulating this problem as a metric learning problem, where a metric among class attributes and image visual features is learned for inferring labels of test images [33]. A multi-cue framework facilitates a joint embedding of multiple language parts and visual information into a joint space to recognize images from unseen classes [34]. Considering the manifold structure of semantic categories, Fu et al. provide a novel zeroshot learning approach by formulating a semantic manifold distance among testing images and unseen classes [15]. To avoid domain shift between the sets of seen classes and unseen classes, Kodirov et al. propose a zero-shot learning method based on unsupervised domain adaptation [25]. On the observation that textual descriptions are noisy, Qiao et al.
propose an L 2,1 -norm based objective function to suppress the noisy signal in the text and learn a function to match the text document and visual features of images [35]. However, the aforementioned works mainly focus on learning a match between images and their textual descriptions and few of them pay attention to discriminative feature learning, which is very crucial for fine-grained classification.

III. THE PROPOSED MODEL
In this section, we propose a two-phase framework for zeroshot fine-grained classification. A deep convolutional neural network integrating hierarchical semantic structure of classes and domain adaptation strategy is first developed for feature learning and a label propagation method based on semantic directed graph is further proposed for label inference.

A. Feature Learning
Our main idea is motivated by implied hierarchical semantic structure among fine-grained classes. For example, winter wren (species-level name), a very small North American bird, can be called 'Troglodytes' at genus level and also can be called 'Troglodytidae' at family level (See Fig. 3). We assume that experts recognize objects in fine-grained classes by using the discriminative visual features and the hierarchical semantic structure among fine-grained classes is their prior knowledge. As shown in Fig. 2, lower-level features are used (with fewer network layers) for classifying images at coarser level. In other words, to recognize images in a fine-grained level, we must exploit higher-level and fine-grained features.
To induce the hierarchical semantic structure into feature learning, we integrate hierarchical classification subnetworks into VGG-16Net [12]. The detailed architectures of hierarchical classification subnetworks are presented in Fig. 4. In our model, each classification subnetwork is designed to classify images into the corresponding level semantic classes (i.e. family level, genus level, or species level). Concretely, we locate the classification subnetworks for family-level, genus-level, and species-level labels afterwards the third, forth, and fifth groups of convolutional layers, respectively (also see Fig. 2). For family-level and genus-level classification subnetworks, their detailed network structure includes a convolutional layer, two fully-connected layers, and a softmax activation layer (see Fig. 4). For the sake of quick converegence, we take the classification structure of VGG-16Net as the specieslevel classification subnetwork, which can be initialized by ImageNet pretrained parameters [11]. By merging the VGG-16Net and hierarchical classification subnetworks into one network, we define the loss function for image x as: where L f , L g , and L s denote the loss of family, genus, and species-level classification subnetworks, respectively. y f , y g , and y s denote the true label of the image at family, genus, and species level, respectively. θ F denotes the parameters of the feature extractor (the first fifth groups of convolutional layers) in VGG-16Net. θ f , θ g , and θ s denote the parameters of family, genus, and species-level classification subnetworks, respectively. µ f and µ g respectively denote the weights of loss of family and genus-level classification subnetworks. G and G f (or G g , G s ) respectively denote the feature extractor of VGG-16Net, family (or genus, species)-level hierarchical classification subnetworks.
Note that the labels of training data do not include unseen classes and thus domain shift will occur when we extract features for test images using the deep neural networks trained by these training data [25]. To avoid domain shift, we add a domain adaptation structure [36], which includes a gradient reversal layer and a domain classifier, after the fifth group of convolutional layers in VGG-16Net (as shown in Fig. 2). The domain adaption structure views training data and test data as two domains and aims to train a domain classifier that cannot distinguish its domain of a given data. In this way, the difference of features among data from two domains can be eliminated. In our model, we aim to achieve an adversarial process, i.e. to learn features that can confuse the domain classifier and classify fine-grained classes. Therefore, we aim to minimize the loss of hierarchical classification subnetworks and maximize the loss of the domain classifier. The gradient reversal layer (Grl layer in Fig. 5) proposed by [36] is used to achieve the goal. In the following, we denote the domain classifier as G d , which is also presented in Fig.  5. By merging the domain adaptation structure, hierarchical classification subnetworks and VGG-16Net together, we define the total loss for image x as: where L d , y d , µ d and θ d denote the loss of domain classifier, the domain label of image x, the weight of loss of domain classifier, and the parameters of domain classifier, respectively.
To end this subsection, we qualitatively demonstrate the important role of the hierarchical semantic structure of finegrained classes in extracting discriminative features for zeroshot fine-grained classification. Fig. 6 provides some samples of misclassified images when only species-level features are used, and Fig. 7 provides some samples of misclassified images when only species/genus-level features are used. It can be seen that the true labels and predicted labels of these misclassified images (in blue boxes) have hierarchical semantical relations, and these misclassified images can be correctly classified when higher-level features are used. That is, the hierarchical semantic structure of fine-grained classes can be used to capture more discriminative features.

B. Label Inference
In this subsection, with the discriminative features obtained from Section III-A, we provide a label propagation approach for zero-shot fine-grained image classification.
Let S = {s 1 , ..., s p } denote the set of seen classes and U = {u 1 , ..., u q } denote the set of unseen classes, where p and q are the total numbers of seen classes and unseen classes, respectively. These two sets of classes are disjoint, i.e. S ∩ U = φ. We are given a set of labeled training images D s = {(x i , y i ) : i = 1, ..., N s }, where x i is the feature vector of the i-th image in the training set, y i ∈ S is the corresponding label, and N s denotes the total number of labeled images. Let D u = {(x j , y j ) : j = 1, ..., N u } denote a set of unlabeled test images, where x j is the feature vector of the j-th image in the test set, y j ∈ U is the corresponding unknown label, and N u denotes the total number of unlabeled images. The main goal of zero-shot learning is to predict y j by learning a classifier f : X u → U , where X u = {x j : j = 1, ..., N u }.
For zero-shot fine-grained image classification, we first need estimate the semantic relationships between seen and unseen classes, which will be used for predicting the labels of images in unseen classes. In this paper, we collect the attributes of each fine-grained class to form its semantic vector and further construct a semantic-directed graph G = {V, E} over all the   classes (including seen/unseen), where V denotes the set of nodes (i.e. fine-grained classes) in the graph and E denotes the set of directed edges between classes. The detailed steps of constructing the graph G is given as follows: • We first construct the edges between seen classes. Specifically, for each seen class, we perform the k-nearestneighbors (k-NN) method on semantic vectors to find its k 1 nearest neighbors in seen classes, and then construct a directed edge from this seen class to each of its neighbors. The weight of this edge is the negative exponent of the Euclidean distance between them. • We further take the same strategy to construct edges between seen classes and unseen classes. Specifically, for each seen class, we perform the k-NN method on semantic vectors to find its k 2 nearest neighbors in unseen classes, and then construct a directed edge from this seen class to each of its neighboring unseen classes. The weight of this edge is the negative exponent of the Euclidean distance between them. • Finally, for each unseen classes, it has only one edge pointing to itself with a weight of 1.
By collecting the above edge weights up, we can denote the

Algorithm 1 The Proposed Framework
Input: the set of labeled training images D s the set of test images in unseen classes X u Feature Learning: 1) Train the proposed neural network using hierarchical semantic structure among fine-grained classes; 2) Run forward computation of the proposed neural network for each test image and extract deep features from hierarchical classification subnetworks; 3) Concatenate the features from hierarchical classification subnetworks to obtain deep features F ; Label Inference: 4) Compute the initial probabilities of test images belonging to unseen classes Y with the LIBLINEAR toolbox [38] and deep features F ; 5) Construct the semantic-directed graph based on semantic vectors; 6) Compute the normalized transition matrix P according to Equations (3-5); 7) Find the solutionỸ * of label propagation problem formulated in Equation (6)  weight matrix W of the semantic-directed graph G as: where R 1 ∈ R p×p collects the edge weights among seen classes, R 2 ∈ R p×q collects the edge weights between seen classes and unseen classes, and I ∈ R q×q denotes an identity matrix. A Markov chain process can be further defined over the graph G by constructing the transition matrix: where D is a diagonal matrix with its i-th diagonal element being equal to the sum of the i-th row of W .
To guarantee that the Markov chain process has a unique stationary solution [37], we normalize T as: where η is a normalization parameter (empirically set as η = 0.001), and 1 p+q and I p+q are the one matrix and identity matrix of the size (p + q) × (p + q), respectively. Based on the normalized transition matrix P = [p uv ] ∈ R (p+q)×(p+q) , we formulate zero-shot fine-grained classification as the following label propagation problem: whereỸ i. (the i-th row ofỸ ∈ R Nu×(p+q) ) and Y i. (the ith row of Y ∈ R Nu×(p+q) ) collect the optimal and initial probabilities of the i-th test image belonging to each class, respectively. That is, Y is an initialization ofỸ andỸ is the final solution of the problem formulated in Equation (6). Moreover, π(u) is the sum of the u-th row of P (i.e. v p uv ), and λ is a regularization parameter. The first term of the above objective function sums the weighted variation ofỸ i. on each edge of the directed graph G, which aims to ensure thatỸ i. does not change too much between semantically similar classes for the i-th test image. The second term denotes an L 2 -norm fitting constraint, which means thatỸ i. should not change too much from Y i. .
To solve the above label propagation problem, we adopt the technique introduced in [37] and define the operator Θ: where Π is a (p + q) × (p + q) diagonal matrix with its u-th diagonal element being equal to π(u). According to [37], the optimal solutionỸ * of the problem in Equation (6) is: where I ∈ R (p+q)×(p+q) denotes an identity matrix and α = 1/(1 + λ) ∈ (0, 1). To obtain the above solution, we need to provide Y in advance. Note that each row of Y consists of two parts: the probabilities of a test image belonging to seen classes, and the probabilities of a test image belonging to unseen classes. Given no labeled data in unseen classes, we directly set the probabilities belonging to unseen classes as 0. To compute the initial probabilities belonging to seen classes, we use LIBLINEAR toolbox [38] to train an L 2 -regularized logistic regression classifier. In general, we empirically set the parameter c in L 2 -regularized logistic regression as 0.01.
To sum up, by combining the coarse-to-fine feature learning and label propagation approaches together, the complete algorithm for zero-shot fine-grained classification is outlined as Algorithm 1. It should be noted that the proposed approach can be extended to weak supervision setting by replacing class attributes with semantic vectors extracted by word vector extraction methods (e.g. word2vec [16]).

IV. EXPERIMENTAL RESULTS AND DISCUSSIONS A. Experimental Setup
We conduct experiments on two benchmark fine-grained datasets, i.e. Caltech UCSD Birds-200-2011 [5] and Oxford Flower-102 [9]. Our experimental setup is give as follows.  [5]. Each species is associated with a Wikipedia article and organized by scientific classification (family, genus, species). Each class is also annotated with 312 visual attributes. In the zero-shot setting, we follow [32] to use 150 bird spices as seen classes for training and the left 50 spices as unseen classes for testing. The results are the average of four fold cross validation. For parameter validation, we also use a zero-shot setting within the 150 classes of the training set, i.e. we use 100 classes for training and the rest for validation. The hierarchical labels of fine-grained classes are collected from Wikipedia. For each fine-grained class, we use 312-d class attributes as semantic description, or 300-d semantic vectors extracted by the word2vec model [16] (trained by GoogLeNews). Examples of images in CUB-200-2011 are shown in Fig. 8.   are collected from Wikipedia. For each fine-grained class, only 300-d semantic vectors extracted by the word2vec model [16] (trained by GoogLeNews) are used as semantic description. Examples of Flower-102 are shown in Fig. 9.

B. Implementation Details
In the feature learning phase, the VGG-16Net's layers are pre-trained on ILSVRC 2012 1K classification [11], and then finetuned with the training data. Meanwhile, the other layers are trained from scratch. All input images are resized to 224×224 pixels. Stochastic gradient descent (SGD) [39] is used to optimize our model with a basic learning rate of 0.01, a momentum of 0.9, a weight decay of 0.005 and a minibatch size of 20. For layers trained from scratch, their learning rate is 10 times of basic learning rate. Our model is implemented based on the popular Caffe [40].
Note that different-level features are extracted from the last but one fully-connected layers before the softmax layers. Hence, we finally obtain three kinds of features which are used to classify images at different levels. To find a good way to combine these features, we conduct experiments with the proposed model using the concatenation of different-level features. The results are shown in Table I. It can be seen that high-level features perform better than features extracted from the shallow layers. Furthermore, we also find that the combination of three-level features performs the best. In the following, we use the concatenation of three-level features as the final deep visual features in our model.

C. Effectiveness of Deep Feature Learning
To test the effectiveness of the proposed feature learning approach, we apply the features extracted by the the proposed feature learning approach to different zero-shot learning models (e.g. [13]- [15]) under the same setting. The results are given in Fig. 10. It can be seen that the proposed feature learning approach works well in all the zero-shot learning models. This observation can be explained as follows. Compared with the traditional VGG-16Net pretrained by ImageNet, the proposed feature learning approach takes hierarchical semantic structure of classes and domain adaptation structure into account, and thus succeeds in generating more discriminative features for zero-shot fine-grained classification.
D. Comparison to the State-of-the-Arts 1) Test with Class Attributes: We provide the comparison of the proposed approach to the state-of-the-art zero-shot finegrained classification approaches [13]- [15], [32]- [34] using class attributes, which is shown in Table II. Since Flowers-102 provides no class attributes, we do not present the results on this dataset. In this table, 'ZC' denotes the zeroshot learning approach based on label propagation, 'VGG-16Net' denotes the features obtained from VGG-16Net [12]  (pretrained with ImageNet [11]), 'HCS' denotes the hierarchical classification subnetworks proposed in Section III-A, and 'DA' denotes the domain adaptation structure proposed in Section III-A. It can be seen that the proposed approach significantly outperforms the state-of-the-art zero-shot learning approaches. That is, both feature learning and label inference used in the proposed approach are crucial for zero-shot finegrained classification. Moreover, the comparison between 'ZC' vs. 'Ours-2' demonstrates that the proposed feature learning approach is extremely effective in the task of zero-shot finegrained classification. Additionally, the comparison between 'Ours-1' vs. 'Ours-2' demonstrates that the domain adaptation structure is important for feature learning in the task of zeroshot fine-grained classification. It should be noted that [34] has achieved an accuracy of 56.5 % using a multi-cue framework, where locations of parts are used as very strong supervision in both training and test process. The accuracy of 49.5 % released in Table II is its classification result when only annotations of the whole images are used (without locations of parts). The superior performance of the proposed approach compared with [34] further verifies the effectiveness of the proposed approach in zero-shot fine-grained classification.
2) Test with Semantic Vectors: We also evaluate the proposed approach in the weakly supervised setting, where only fine-grained labels of training images are given and the semantics among fine-grained are learned from text descriptions. Table III provides the classification results on both CUB-200-2011 and Flowers-102 in the weaker supervised setting. From this table, we can still observe that the proposed approach outperforms the state-of-the-art approaches, which further verifies the effectiveness of the proposed approach.

V. CONCLUSION
In this paper, we propose a two-phase framework for zeroshot fine-grained classification approach, which can recognize images from unseen fine-grained classes. In our approach, a feature learning strategy based on the hierarchical semantic structure of fine-grained classes and domain adaptation structure is developed to generate robust and discriminative features, and then a label propagation method based on semantic directed graph is proposed for label inference. Experimental results on the benchmark fine-grained classification datasets demonstrate that the proposed approach outperforms the stateof-the-art zero-shot learning algorithms. Our approach can also be extended to the weakly supervised setting (i.e. only finegrained labels of training images are given) and has achieved better results than the state-of-the-arts. In the future work, we will make further improvements on developing more powerful word vector extractors to explore better semantic relationships among fine-grained classes and optimize the feature extractors with word vector extractors simultaneously.