1 Introduction

Image classification has long been an active area of research, which aims to classify images into pre-defined categories, and helps people to know what kind of objects the images contain. Traditionally, image classification is mainly performed on small datasets, by encoding local hand-crafted features and using them as input for classifiers [20, 47, 51].

In recent years, two fundamental changes occurred for this task: first, the number of digital images has been increasing exponentially. This brings people more alternatives, and more difficulties, in finding relevant images from this large volume of data. To help people access data, in an effortless and meaningful way, we need a good semantic organization of the categories. Second, deep learning methods have proven to be successful for image classification. In recent years, researchers have built various deep structures [12, 35, 41], and have achieved quite accurate predictions on small datasets [4, 10].

As a consequence, the current research focus has moved to larger and more challenging datasets [3, 34], such as ImageNet [6]. Such datasets often organize the large number of categories in a hierarchy, according to their semantic belongings. The deeper one goes in the hierarchy, the more specific the category is. In contrast to current approaches, which only focus on the leaf categories, we argue that generating hierarchical labels in a coarse-to-fine pattern can present how the semantic categories evolve, and thus can better describe what the objects are. For example, for Fig. 1c, the predicted leaf-category label is ‘Triceratops’. Without specialized knowledge, we cannot learn that this category label belongs to the higher level category label ‘Horned Dinosaur’.

Fig. 1
figure 1

The example images with the ‘coarse → fine’ label

The first contribution of this paper is a framework capable of generating hierarchical labels, by integrating the powerful Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CNN is used to generate discriminative features, and RNN is used to generate sequential labels.

There are several notable advantages for the CNN-RNN framework:

  1. (1)

    Learning things in a hierarchical way is consistent with human perception and concept organization. By predicting the labels in a coarse-to-fine pattern, we can better understand what the objects are, such as depicted in Fig. 1c.

  2. (2)

    It can exploit the relationship between the coarse and fine categories, which, in turn, helps the traditional image classification task. For example, when we build the CNN-RNN framework with wrn-28-10 [53], we can increase the accuracy of coarse and fine categories by 2.8% and 1.68%, respectively. To the best of our knowledge, this is the first work trying to employ RNN to improve the classification performance by exploiting the relationship between hierarchical labels.

  3. (3)

    It is transferrable. In principle, the framework can be built on top of any CNN architecture which is primarily intended for single-level classification, and boost the performance for each hierarchical level. To verify this, we have conducted extensive experiments with three high-performance networks, i.e. CNN-7 [25], wrn-28-10 [53] and our proposed wider-Resnet.

  4. (4)

    The structure can be trained end-to-end. In contrast to other methods which can only model the category relationship with pre-computed image features [7, 30], the CNN-RNN framework can jointly learn the features and relationship in an end-to-end way, which can improve the final predictions considerably. For a subset of ImageNet 2010, we compared employing pre-computed CNN features to train the RNN with end-to-end training the CNN and RNN and demonstrated a significant improvement of the subcategory accuracy from 77.27% to 82%.

  5. (5)

    The number of the hierarchical labels can be variable. The flexibility of RNN allows us to generate hierarchical labels of different lengths, i.e. more specific categories would have more hierarchical labels. We have demonstrated this property on the widely used ImageNet 2012 dataset [32].

As the framework is transferrable, we intend to build a high performance CNN model, and utilize its CNN-RNN variant to further boost the accuracy. Therefore, the second contribution of this paper is, we build a high performance network, i.e. wider-ResNet. In recent years, deep residual networks (ResNet) [12] have attracted great attention because of its leading performance in several image classification tasks and Zagoruyko et al. [53] presented a thorough experimental study about several important aspects of ResNet, such as the width and depth, and proposed a wide Resnet that obtained better performance than the original ResNet. We intend to further enhance the performance and build a wider ResNet compared to [53]. Our implementation shows that, the wider-Resnet performs better than [53] on CIFAR-100, and also outperforms the original ResNet with thousands layers. In addition, by utilizing the CNN-RNN framework, we obtain considerably better results than the state-of-the-art.

The performance of deep models has benefited from the accurate and large-scale annotations, such as ImageNet [6]. However, manual labeling is an excessively tedious and expensive task, especially for the fine-grained classes, which often require expert knowledge (e.g. breeds of dogs, flower species, etc.). For example, for Fig. 1a and b, it is easy to annotate the images with the coarse label ‘dog’, but it requires specialized knowledge to divide them into subcategories ‘Bsenji’ and ‘Leonberg’. One optional thought is, if a part of the training data is only annotated with coarse category labels, whether we could utilize the coarse-labeled training data to improve the classification performance of fine categories?

The third contribution of this paper is, we investigate how to utilize the CNN-RNN framework to improve the subcategory classification when a fraction of the training data only has coarse labels. By training the CNN-RNN framework on the fully annotated data in the training set, we can exploit the relationship between the coarse and fine categories. Thereby, we can predict the fine labels of the coarse-labeled training data, and then re-train the CNN-RNN model. Experimental results demonstrate that the coarse-labeled training data can normally help the subcategory classification. In some cases, it can even surpass the performance of fully annotated training data. This alleviates the expensive process of fine-grained labeling.

2 Related work

2.1 Usage of CNN-RNN framework

In recent years, deep learning methods have attracted significant attention [11] and have achieved revolutionary successes in various applications [12, 24]. Two important structures for deep learning are CNN and RNN. CNN has proven to be successful in processing image-like data, while RNN is more appropriate in modeling sequential data. Recently, several works [8, 23, 44, 48, 52, 54] have attempted to combine them together, and have built various CNN-RNN frameworks. Generally, the combination can be divided in two types: the unified combination and the cascaded combination.

The unified combination often attempts to introduce a recurrent property into the traditional CNN structure in order to increase the classification performance. For example, Zuo et al. [54] converted each image into 1D spatial sequences by concatenating the CNN features of different regions, and utilized RNN to learn the spatial dependencies of image regions. Similar work appeared in [46]. The proposed ReNet replaced the ubiquitous convolutional+pooling layer with four recurrent neural networks that sweep horizontally and vertically in both directions across the image. In order to improve the multi-label classification, Wang et al. [48] presented the CNN-RNN framework to learn a joint embedding space in modeling semantic label dependency as well as the image-label relevance.

On the other hand, the cascaded combination would process the CNN and RNN separately, where the RNN takes the output of CNN as input, and returns sequential predictions of different timesteps. The cascaded CNN-RNN frameworks are often intended for different tasks, rather than image classification. For example, [8, 45, 52] employed CNN-RNN to address the image captioning task, and [50] utilized CNN-RNN to rank the tag list based on the visual importance.

In this paper, we propose to utilize the cascaded CNN-RNN framework to address a new task, i.e. hierarchical image classification, where we utilize CNN to generate discriminative image features, and utilize RNN to model the sequential relationship of hierarchical labels.

2.2 Hierarchical models for image classification

Hierarchical models have been used extensively for image classification. For example, Salakhutdinov et al. [33] presented a hierarchical classification model to share features between categories, and boosted the classification performance for objects with few training examples. Yan et al. [49] presented a hierarchical deep CNN (HD-CNN) that consists of a coarse component trained over all classes as well as several fine components trained over subsets of classes. Instead of utilizing a fixed architecture for classification, Murdock et al. [29] proposed a regularization method, i.e. Blockout, to automatically learn the hierarchical structure.

Another pipeline to employ hierarchical models tends to improve the classification performance by exploiting the relationship of the categories in the hierarchy. For instance, Deng et al. [7] introduced HEX graphs to capture the hierarchical and exclusive relationship between categories. Ristin et al. [30] utilized Random Forests and proposed a regularized objective function to model the relationship between the categories and subcategories. This type of hierarchical models can not only improve the traditional image classification performance, but also provide an alternative way to utilize the coarse-labeled training data.

In contrast to previous works, our paper utilizes RNN to exploit the hierarchical relationship between coarse and fine categories, and aims to adapt the model to address the hierarchical image classification task, in which we simultaneously generate hierarchical labels for the images. Compared with [7, 30] that can only process the pre-computed image features, our proposed CNN-RNN framework can be trained end-to-end.

3 Our proposed scheme

The goal of our approach is to simultaneously generate hierarchical labels of the images. To this end, we can employ two types of generators: a CNN-based generator and a CNN-RNN generator. Both of them keep the preceding layers of the basic CNN structure except for the last layer.

3.1 CNN-based generator

A CNN-based generator aims to generate coarse and fine labels by utilizing the conventional CNN structure. It acts as a common practice to fulfill this specific task. In this paper, we replace the last layer of conventional CNN with two layers, through which to provide separate supervisory signals for both the coarse categories and fine categories. The two layers can be arranged either in a serial pattern (Fig. 2: Strategy 1 & 2), or in a parallel pattern (Fig. 2: Strategy 3).

Fig. 2
figure 2

The illustration of the four strategies which can jointly train and generate the coarse and fine labels

During the training phase, we utilize the softmax loss function to jointly optimize the coarse and fine label predictions, as defined in (1).

$$\begin{array}{@{}rcl@{}} Loss = -\frac{1}{N}\sum\limits_{i = 1}^{N}\left( \sum\limits_{j = 1}^{C}1\left\{x^{i}=j\right\}\log p_{j} + \sum\limits_{k = 1}^{F}1\left\{y^{i}=k\right\}\log p_{k}\right) \end{array} $$
(1)

Where 1{⋅} is the indicator function. N, C, F denote the number of the images, coarse categories, and fine categories, respectively. p j and p k are the softmax probabilities of the coarse and fine categories, respectively.

During the inference phase, we can utilize the trained network to determine the coarse and fine labels at the same time.

There are two potential drawbacks for the CNN-based generator: first, it treats the two supervisory signals individually, and does not exploit the relationship between them. Second, when the hierarchy is of variable length, we cannot define a universal CNN-based generator to simultaneously determine the hierarchical labels.

3.2 CNN-RNN generator

A CNN-RNN generator determines hierarchical predictions using an architecture where the last layer of a CNN is replaced by an RNN (Fig. 2: Strategy 4).

RNN [9] is a class of artificial neural networks where connections between units form a directed cycle, as shown in Fig. 3. It can effectively model the dynamic temporal behavior of sequences with arbitrary lengths. However, RNN suffers from the vanishing and exploding gradient problem since the gradients need to propagate down through many layers of the recurrent network. Therefore, it is difficult to model the long-term dynamics. In contrast, Long-Short Term Memory (LSTM) [14] provides a solution by incorporating a memory cell to encode knowledge at each time step. Specifically, the behavior of the cell is controlled by three gates: an input gate, a forget gate and an output gate. These gates are used to control how much it should read its input (input gate i), whether to forget the current cell value (forget gate f) and whether to output the new cell value (output gate o). These gates help the input signal to prorogate through the recurrent hidden states without affecting the output, therefore, LSTM can deal well with exploding and vanishing gradients, and effectively model long-term temporal dynamics that RNN is not capable of learning.

Fig. 3
figure 3

The pipeline of RNN(left) and LSTM(right)

In this paper, we use LSTM neurons as our recurrent neurons. The definition of the gates and the update of LSTM at the timestep t are as follows:

$$\begin{array}{@{}rcl@{}} && i_{t} = \sigma (W_{xi}x_{t} + W_{hi}h_{t-1} + W_{vi}v + b_{i}) \end{array} $$
(2)
$$\begin{array}{@{}rcl@{}} && f_{t} = \sigma (W_{xf}x_{t} + W_{hf}h_{t-1} + W_{vf}v + b_{f}) \end{array} $$
(3)
$$\begin{array}{@{}rcl@{}} && o_{t} = \sigma (W_{xo}x_{t} + W_{ho}h_{t-1} + W_{vo}v + b_{o}) \end{array} $$
(4)
$$\begin{array}{@{}rcl@{}} && g_{t} = \varphi (W_{xc}x_{t} + W_{hc}h_{t-1} + W_{vc}v + b_{c}) \end{array} $$
(5)
$$\begin{array}{@{}rcl@{}} && c_{t} = f_{t} \odot c_{t-1} + i_{t} \odot g_{t} \end{array} $$
(6)
$$\begin{array}{@{}rcl@{}} && h_{t} = o_{t} \odot \varphi (c_{t}) \end{array} $$
(7)

Where ⊙ represents the product operation, σ is the sigmoid function (σ(x) = (1 + e x)− 1), and φ is the hyperbolic tangent function (\(\varphi (x)=\frac {e^{x}-e^{-x}}{e^{x}+e^{-x}}\)). The definition for other symbols are: i t , f t , o t , g t denote the input gate, forget gate, output gate, and input modulation gate, respectively. x, h, v and c represent the input vector, hidden state, image visual feature, and memory cell, respectively. We propose to impose the image visual feature v at each timestep when updating the LSTM. W and b are the weights and bias that need to be learned.

The goal of our approach is to generate hierarchical labels for images. The labels are ordered in a coarse-to-fine pattern, i.e. coarser labels appear at the front of the list. To this end, we merge the C coarse categories and F fine categories as C + F super categories. For different timesteps, the CNN-RNN generator takes the labels of different levels as input, where the coarser-level labels appear at the preceding timesteps. In this way, the coarser-level labels can provide insightful information for the prediction of finer labels. The procedure is shown in Fig. 4.

Fig. 4
figure 4

The pipeline of the CNN-RNN framework

During the training phase, the CNN-RNN generator utilizes the groundtruth coarser-level labels as input, and jointly optimizes the coarse and fine predictions, as denoted in (8).

$$\begin{array}{@{}rcl@{}} Loss = -\frac{1}{N}\sum\limits_{i = 1}^{N}\left( \sum\limits_{t = 1}^{T}\sum\limits_{j = 1}^{C+F}1\left\{{x_{t}^{i}}=j\right\}\log p_{j}\right) \end{array} $$
(8)

During the inference phase, when the groundtruth coarser-level labels are not available, the CNN-RNN generator first predicts the maximum likelihood label for current timestep, i.e. \(W_{t}=argmax_{W_{t-1}}p(W_{t-1}|I)\), and then utilize the predicted label as the input for the next timestep.

Furthermore, we visualize the LSTM activations of different timesteps using the t-SNE technique [43] on CIFAR-100 [18], as shown in Fig. 5. The two activations are intended for the coarse and fine predictions, respectively. We can see that, the activation of the second timestep is more discriminative than that of the first timestep.

Fig. 5
figure 5

The visualization of the LSTM activations of different timesteps. The left one is the feature intended for the coarse label classification, while the right one is used for the fine label classification

As the CNN-RNN generator defines the super categories, and it equally trains and predicts the super categories, we do not need to design specific networks for the categories of different levels. Therefore, the CNN-RNN generator is robust, and can be employed to generate hierarchical labels of different lengths.

4 Experiments

We perform our experiments on three well-known datasets: CIFAR-100 [18], ImageNet 2012 [32] and a subset of ImageNet 2010 [30]. These three datasets have provided hierarchical image labels. The characteristics of the three datasets are summarized in Table 1.

Table 1 The characteristics of the datasets, including the depth of the hierarchy, the number of the coarse categories and fine categories

The performance is measured based on the top-1 accuracy. All the experiments are conducted using the Caffe [16] library with a NVIDIA TITAN X card.

The experiments can be divided into two parts. In the first part, we evaluate the performance of hierarchical predictions. In the second part, we investigate the performance of subcategory classification when only a part of the training data is labeled with fine labels while the rest only has coarse labels.

4.1 Hierarchical predictions

We evaluate the hierarchical predictions on two widely-used datasets: CIFAR-100 [18] and ImageNet 2012 [32].

4.1.1 CIFAR-100

CIFAR-100 contains 100 classes, and each class has 500 training images and 100 test images. These classes are further grouped into 20 superclasses. Therefore, each image comes with two level labels: a fine label (the class to which it belongs) and a coarse label (the superclass to which it belongs). For data preprocessing, we normalize the data using the channel means and standard deviations. The symbol ‘+’ means a standard data augmentation, i.e. first zero-padded with 4 pixels on each side, and randomly crop 32 × 32 images from the padded images, or their horizontal reflections.

Evaluation of the hierarchical image classification task

The CNN-based generator and CNN-RNN generator are considered as two alternative structures to fulfill the hierarchical image classification task. In contrast to CNN-based generator, the CNN-RNN generator can effectively exploit the dependency of the hierarchical labels, and thereby achieving a better classification performance for both the coarse and fine categories. We compare their performance in Table 2.

Table 2 The comparison of the accuracy for the coarse categories and fine categories. Best results are in bold face

All of the evaluations in this part are conducted based on the CNN model proposed in [25], because of its high training efficiency and decent performance on CIFAR-100. The CNN structure is shown in Fig. 6. We employ exactly the same experimental configuration as used in [25].

Fig. 6
figure 6

The CNN baseline proposed in [25]

As can be seen, the CNN-RNN generator can significantly outperform the CNN-based generator, both for the coarse/fine predictions with/without data augmentation. Specifically, for the coarse predictions, the CNN-RNN generator outperforms the CNN-based generator by at least 5.05%, while for the fine predictions, the CNN-RNN generator is even more advantageous, with an improvement of more than 6.7%. This demonstrates that, by exploiting the latent relationship between the coarse and fine categories, RNN can properly address the hierarchical-based task.

Evaluation of the traditional image classification task

The traditional image classification task consists of classifying images into one pre-defined category, rather than multiple hierarchical categories.

As the CNN-RNN generator can simultaneously generate the coarse and fine labels, in this part, we further compare its performance with ‘coarse-specific’ and ‘fine-specific’ networks. The ‘fine-specific’ network uses the common CNN structure which is specifically employed for the fine category classification. The ‘coarse-specific’ network shares the same preceding layers with the ‘fine-specific’ network, where the last layer is adapted to equal the coarse category number, e.g. 20 for CIFAR-100.

The coarse-specific, fine-specific and CNN-RNN framework can be constructed based on any CNN architecture. To make the comparison more general and convincing, we evaluate the performance on three networks: CNN-7 [25], wrn-28-10 [53] and our proposed wider-Resnet.

For wrn-28-10, we adopt the version with dropout [39], and train the network with larger mini-batch size (i.e. 200), and more iterations (a total of 7 × 104 iterations, and the learning rate dropped at 2 × 104, 4 × 104, 6 × 104 iterations). Other experimental configuration follows [53].

The structure of our proposed wider-Resnet is shown in Table 3. We adopt the pre-activation residual block as in [13], and train the models for a total of 7 × 104 iterations, with a mini-batch size of 200, a weight decay of 0.0005 and a momentum of 0.9. The learning rate is initialized with 0.1, and is dropped by 0.1 at 4 × 104 and 6 × 104 iterations.

Table 3 The framework of our proposed wider-Resnet

The results on these three datasets are shown in Table 4. We can see that, CNN-RNN can simultaneously generate the coarse and fine labels without developing two separate models, and the accuracy for both categories outperforms the specific networks. Take our proposed wider-Resnet as an example, the CNN-RNN structure increases the coarse and fine accuracy by 2.85% and 1.17% respectively, over the coarse-specific and fine-specific networks. This advantage demonstrates that, by exploiting the latent relationship of the coarse and fine categories, CNN-RNN can help the traditional image classification task.

Table 4 The comparison of the accuracy for the coarse categories and fine categories. For each network, CNN-RNN could get better results and their results are bolded

Our implementation of wrn-28-10 [53] cannot reproduce the original published results, possibly as a result of the differences in the platforms (Torch v.s. Caffe), or the differences in the preprocessing step (pad with reflections of original image v.s. pad with zero). Nevertheless, we can still improve the coarse and fine accuracy by 2.8% and 1.68% respectively, through utilizing the CNN-RNN structure.

Comparison with the state-of-the-art

We compare our wider-Resnet network, as well as its CNN-RNN variant, with the state-of-the-art, as is shown in Table 5.

Table 5 The test error of different methods on the CIFAR-100 dataset with standard data augmentation (translation/mirroring). Best results are in bold face

Through the comparison, we further demonstrate the superiority of the wider networks on CIFAR-100 dataset, as our not-very-deep wider-Resnet network (29 layers) surpasses the performance of the ResNet with super deep layers (1001 layers). In comparison with another wide ResNet [53] with similar depth, wider-Resnet also demonstrates great improvements and remarkably reduces the classification error from 25.45% to 22.03%, under the same platform and pre-processing step.

Overall, our proposed wider-Resnet achieves the best performance over previous works, and wider-Resnet-RNN further increases the state-of-the-art to 20.86%. Nevertheless, we are still seeking to build the CNN-RNN framework on top of future state-of-the-art architectures to boost the classification performance.

4.1.2 ImageNet 2012

One notable advantage of RNN is that it can generate sequences with variable lengths. To demonstrate this, we investigate the CNN-RNN framework on the widely used ImageNet 2012 dataset [32].

ImageNet is an image dataset organized according to the WordNet hierarchy [27]. It is larger in scale and diversity than other image classification datasets. ImageNet 2012 uses a subset of ImageNet with roughly 1300 images in each of 1000 categories. The images are annotated with hierarchical labels of different lengths. In total, there are about 1.2 million training images and 50000 validation images. For all the experiments, we train our model on the training images, and test on the validation images of the ImageNet 2012 dataset.

We utilize the ResNet-152 [12] as our CNN model. For simplicity, pre-trained model weights are kept fixed without fine-tuning. For the RNN model, we use 1000 dimensions for the embedding and the size of the LSTM memory. During the experiments, we first resize all the images to 224 × 224 pixels and extract the last pooling features utilizing ResNet-152, and then send the features into LSTM for modeling the category dependency.

Figure 7 demonstrates the hierarchical predictions for some example images, from which we can observe that: first, RNN is able to generate predictions with different lengths, and more specific categories would have more hierarchical labels. Second, the hierarchical labels can describe how the fine categories are evolved from higher level coarse categories, and thus can provide us a better understanding of the objects. Consider for example the upper image in Fig. 7a, we may get confused with the leaf-level label: ‘Granny Smith’. But when the coarse-level labels are provided, we observe that ‘Granny Smith’ is a breed of apple. Third, it may be more difficult to classify images into leaf-level categories than branch-level categories. When we get faulty leaf-level predictions for the given image, we might still learn what the image depicts from the coarse predictions, as shown in Fig. 7b.

Fig. 7
figure 7

The hierarchical predictions of some example images. a and c show some positive examples. b shows the examples with partly wrong predictions, e.g. correct coarse labels & wrong fine labels. d shows examples in the same category as (c), but which have totally wrong predictions

4.2 From coarse categories to fine categories

In the previous section, we have investigated the hierarchical classification performance of CNN-RNN when all of the coarse and fine labels are available for the training data. However, annotating fine labels for large amounts of training data is quite expensive, especially when it requires expert knowledge. In this subsection, we focus on a scenario in which a part of the training data is annotated with fine labels, while the rest only has coarse labels. This can be viewed as a special case of weakly supervised learning, and has ever been investigated in [30].

We follow the experiment setup of [30], and conduct our experiment on a subset of ImageNet 2010. This dataset particularly selected the classes from ImageNet 2010 that have a unique parent class, and obtained 143 coarse classes and 387 fine ones accordingly. The reduced training set contains 487K images where each coarse class has between 1.4K and 9.8K images, and each fine class has between 668 and 2.4K images. The test set contains 21450 images, and each coarse class has 150 images. More details about the dataset can be found in (http://www.vision.ee.ethz.ch/datasets_extra/mristin/ristin_et_al_cvpr15_data.zip).

All of the image features are extracted from the VGG-Net [35], as was done for the preliminary experiments in [30].

Evaluation of the classification performance when all of the training fine labels are available

When all of the coarse and fine labels are available, we can directly train the RNN on the full training set, and evaluate the classification performance on the test set. To better demonstrate the advantage of RNN, we further conduct the training process on a fraction of the training set. In addition, we investigate how much the performance may improve when the coarse labels are provided for the test data, and when we train the CNN-RNN in an end-to-end way, rather than with the off-the-shelf image features. As a comparison with CNN-RNN framework, we also finetune the VGG-Net on the ImageNet 2010 subset. The results are shown in Table 6.

Table 6 Accuracy for classifying fine labels using the ImageNet 2010 subset described in [30]

We can notice that, training on more data results in a more powerful RNN model, and thus can achieve better performance. Compared with the models trained on parts of the training set, i.e. 0.2S and 0.4S, utilizing the full training set S shows an improvement of 2.18% and 1.1%, respectively. It reveals that, a large training dataset is essential in training the deep models.

In contrast to other methods listed in [30], RNN achieves superior classification performance by inherently exploiting the relationship between the coarse and fine categories. Notably, RNN can deliver better performance even utilizing only 20 percent of the training data.

One additional advantage of the CNN-RNN framework is that it can be trained end-to-end. Compared with the predictions generated with off-the-shelf CNN features, jointly training the CNN and RNN results in a significant improvement, from 77.27% to 82%. It is also much better than directly finetuning the VGG-Net on the ImageNet 2010 subset (82% v.s. 76.01%). When provided the coarse labels for the test images, CNN-RNN achieves an accuracy of 90.69%.

Evaluation of the classification performance when part of the fine labels for training are missing

The training set S in this part are randomly divided into two disjoint sets: S c o a r s e and S f i n e . S c o a r s e has only the coarse labels, while S f i n e has both coarse and fine labels. We vary |S f i n e |∈{0.1|S|,0.2|S|,0.5|S|}, and for each S f i n e , we further vary |S c o a r s e |∈{0.1|S|,0.2|S|,0.5|S|}.

For each training/test configuration, we conduct three evaluations:

  1. 1)

    S f i n e : We train the RNN on S f i n e , and evaluate on the test set;

  2. 2)

    S f i n e + \(S_{coarse}^{-}\): We first train the RNN on S f i n e , and use it to predict the fine labels of S c o a r s e . In this way, we obtain a new training set \(S_{coarse}^{-}\), which contains both coarse and (predicted) fine labels. Next, we utilize the S f i n e and \(S_{coarse}^{-}\) to re-train the RNN, and evaluate on the test set.

  3. 3)

    S f i n e + \(S_{coarse}^{+}\): We train the RNN on S f i n e and \(S_{coarse}^{+}\), and evaluate on the test set. \(S_{coarse}^{+}\) means we utilize the groundtruth fine labels of S c o a r s e .

The results are shown in Fig. 8.

Fig. 8
figure 8

The classification performance with different training/test set

In general, S f i n e + \(S_{coarse}^{-}\) performs better than S f i n e , indicating that even some of the fine labels for the training data are missing, the fine category classification can benefit from the CNN-RNN structure.

Since the fine labels of S c o a r s e are predicted by the RNN trained on S f i n e , their accuracy cannot be guaranteed. As a consequence, the second training of RNN may be conducted on a partly wrong labeled dataset. This is particularly severe when |S f i n e | is small. As we can see in Fig. 8a, when |S f i n e | = 0.1|S|, the classification hardly benefited from using S c o a r s e when compared to the RNN trained solely on S f i n e .

On the contrary, when |S f i n e | is large, e.g. |S f i n e | = 0.5|S|, we can achieve a considerable improvement by incorporating S c o a r s e . Notably, when |S f i n e | = 0.5|S|,|S c o a r s e | = 0.1|S|, S f i n e + \(S_{coarse}^{-}\) even performs slightly better than S f i n e + \(S_{coarse}^{+}\), demonstrating its great potential in weakly supervised classification.

We further compare our method with the NN-H-RNCMF [30], which also attempted to improve the classification by exploiting the hierarchy. We set the amount of coarse-labeled data to |S c o a r s e | = 0.5|S|, and vary |S f i n e |∈{0.1|S|,0.2|S|,0.5|S|}, and the results are shown in Table 7. It can be seen that, RNN performs much better than NN-H-RNCMF in all configurations, demonstrating its great potential in exploiting the hierarchical relationship.

Table 7 Accuracy in classifying fine categories for the test set. We set the amount of coarse-labeled data to |S c o a r s e | = 0.5|S|. Best results are in bold face

5 Conclusion

In this paper, we proposed to integrate CNN and RNN to accomplish hierarchical classification task. The CNN-RNN framework can be trained end-to-end, and can be built on top of any CNN structures that are primarily intended for leaf-level classification, and further boost the prediction of the fine categories. In addition, we also investigated how the classification would benefit from coarse-labeled training data, which alleviates the professional and expensive manual process of fine-grained annotation.

Currently, it is necessary to have hierarchical labels in the training set, in order to train the RNN. However, this is not available for many small datasets. In the future, we will examine taking advantage of traditional clustering methods towards automatically constructing a hierarchy for the objects, and use CNN-RNN to boost the classification performance for general datasets.