Keywords

1 Introduction

Single-label image classification, which just outputs a dominant label from a predefined label set for an input image, has been studied during the past years. However, real-world images mostly contain multiple objects of different categories, thus multi-label image classification needs to be considered for real-world images and usually it is a more complex and challenging task.

In recent years, Convolutional Neural Network (CNN) [1] has achieved great success in single-label image classification [2,3,4]. Inspired by this, recent state-of-the-art works for multi-label image classification are mainly involved with CNN and these methods can be generally categorized into two types: based on proposals and based on multi-network. The first type of methods [5,6,7] has a multi-stage pipeline in training phase that first generate object proposals for an input image, and then makes predictions from features extracted by a CNN for each proposal. Although proposal based methods can produce high quality proposals, most of these proposals are redundant and thus proposal selection is required to reduce computation. The second type of methods [7, 8] trains a fusion model of multiple CNNs or CNN combining with Recurrent Neural Network (RNN). These multi-network models usually have more parameters to tune and in practice are harder to converge. Moreover, combination of local information of objects and global information of label representation is not considered in these methods.

To address the problems above, in this paper we propose a proposal-free single-CNN based multi-label classification framework with a multi-task loss. Firstly, a CNN is used to extract hierarchical features for an input image. By directly taking an image as input instead of multiple region proposals, the redundant proposal extraction process is avoided. Secondly, the dilated convolution operation is adopted to expand receptive fields without additional parameters compared to common convolutional operation, which will benefit further global information representation. Thirdly, inspired by [9], with stronger activations in convolutional feature maps of higher layers generally corresponding to positions of object instances in the image, bounding box annotation (ground-truth rectangle tightly enclosed an object) of each instance can be considered as local constraint information with strong label. To leverage this insight into multi-label classification, the CNN model is trained with a multi-task loss composed of two loss functions: one is to model local information of instance activations in convolutional feature maps and the other model global information of label representation.

The main contributions of our work can be briefly summarized as follows:

  • An end-to-end proposal-free method with single-CNN framework for multi-label image classification is proposed.

  • The dilated convolution operation is adopted to expand receptive fields for aggregating multi-scale contexture information without additional parameters.

  • A multi-task loss is utilized to leverage local information of object instances and global information of label representation to enhance the discriminative capability of CNN.

The rest of this paper is organized as follows. The proposed method is given in Sect. 2, in which the basic structure of CNN, the dilated convolution operator and the multi-task loss are described in details. Section 3 shows experimental results on two widely used datasets and the performance comparisons of the proposed method with the state-of-the-art methods. Finally, concluding remarks are drawn in Sect. 4.

2 Our Method

To address the multi-label image classification problem, we propose an end-to-end proposal-free single-CNN based framework with a multi-task loss. Figure 1 shows that our method comprises three main parts: hierarchical feature learning of CNN (ConvNet in Fig. 1), local instance constraint on convolutional feature maps (loss1 in Fig. 1) and global presentation in label space for classification (loss2 in Fig. 1). Contributing to hierarchical feature learning, the basic structure of CNN and the dilated convolution operation are separately described in Subsects. 2.1 and 2.2, and a multi-task loss composed of two loss functions is elaborated in Subsect 2.3.

Fig. 1.
figure 1

Framework of the proposed method with a multi-task loss (loss1 represents local instance constraint and loss2 represents global representation in label space). Blue arrows indicate forward computation and red arrows indicate backward computation in CNN. Red rectangles in input represent bounding box annotations. Black dashed lines show description of output in CNN. (Color figure online)

2.1 Basic Structure of Convolutional Neural Network

A CNN is generally composed of several convolutional and pooling layers (denoted as C layers and P layers) to extract hierarchical features from the original inputs or receptive fields, subsequently with several fully connected layers (denoted as FC layers) followed for specific tasks, as shown in Fig. 2.

Fig. 2.
figure 2

Common CNN architecture with convolution layers (C), pooling layers (P) and fully connected layers (FC).

Assumed that a CNN is constructed with \( L \) layers and the output of the \( l{\text{ - th}} \) layer is denoted as \( {\mathbf{I}}^{l} \), where \( l \in \left\{ { 1 , { 2, } \ldots { ,}\;L} \right\} \), layer, specifically \( {\mathbf{I}}^{0} \) denotes the input data. As shown in Fig. 2, the input data is connected locally to a convolutional where a 2-D convolution operation is performed with convolutional kernels \( {\mathbf{W}}_{l}^{\text{c}} \) and a bias term \( {\mathbf{b}}_{l}^{\text{c}} \) is added to the resultant feature maps. To model nonlinearities in CNN, an activation function \( \partial ( \cdot ) \) is generally performed following convolutional layers. Then, a pooling operation \( {\text{pool}}( \cdot ) \) is usually followed to achieve shift-invariance by reducing the resolution of the feature maps. The general C-P block of CNN can be formulated as

$$ {\mathbf{I}}^{l} = {\text{pool}}(\partial ({\mathbf{I}}^{l - 1} *{\mathbf{W}}_{l}^{\text{c}} + {\mathbf{b}}_{l}^{\text{c}} ), $$
(1)

where \( * \) denotes the convolution operation. After some C-P blocks, hierarchical features are further transformed into 1-D feature vector by the FC layers. The FC layers connect all neurons in the previous layer to each singe neuron of the current layer to generate global semantic information. Denoting weight as \( {\mathbf{W}}_{l}^{{{\text{f}}c}} \) and bias as \( {\mathbf{b}}_{l}^{{{\text{f}}c}} \), an FC layer computation can be formulated as follows:

$$ {\mathbf{I}}^{l} = \partial ({\mathbf{I}}^{l - 1} \bullet {\mathbf{W}}_{l}^{\text{fc}} + {\mathbf{b}}_{l}^{\text{fc}} ) $$
(2)

The output of the last FC layer is usually fed to an output layer using certain operations for specific tasks, for example, softmax operation is used for multi-class classification. Suppose we have \( N \) desired input-output pairs \( \left\{ { (\varvec{x}^{n} ,\varvec{y}^{n} );n \in [1,2, \ldots ,N]} \right\} \), where \( \varvec{x}^{n} \) is the \( n{\text{ - th}} \) input data and \( \varvec{y}^{n} \) is its corresponding target label and \( \varvec{t}^{n} \) is the corresponding output of CNN. Denoting \( {\varvec{\uptheta}} \) as all the parameters of CNN, the loss of CNN can be computed as

$$ L = \frac{1}{N}\sum\limits_{n = 1}^{N} {\ell ({\varvec{\uptheta}};{\mathbf{y}}^{n} ,{\mathbf{t}}^{n} )} $$
(3)

Training a CNN can be seen as an optimization of function mapping, i.e., to minimize the loss of CNN, and generally, stochastic gradient descent (SGD) is used to find the best fitting set of parameters.

2.2 Dilated Convolutional Neural Network

Compared to common convolution operation, the dilated convolution operator is used to gain context information like cross-layer connection in [10]. Unlike the deconvolutional layer [10], with dilation rate [11] in CNN, the dilated convolution operation can apply the same convolutional kernel at different scales without additional memory and loss of information. Combining with proper parameter stride and padding in convolution operation, the dilated convolution operation can be used for multi-scale context information, which is demonstrated to be superior to cross-layer connection [12].

Considering one-dimensional convolution operator with a kernel \( \omega [\text{m}] \) of length \( M \) for a 1-D input signal \( x[i] \), the output \( y[i] \) is defined as

$$ y[i] = \sum\limits_{m = 1}^{M} {x[i + d \cdot m]\omega [\text{m}]} , $$
(4)

where \( d \) is the dilation rate for input sampling. Thus common convolution operation can be seen as a special case of dilated convolution with a dilation rate of 1. In practice, as shown in Fig. 3, the dilated convolution operator with kernel size of \( k \times k \) and dilation rate of \( d \) just inserts \( d - 1 \) zeros between consecutive filter values, transforming kernel size of \( k \) to \( k + (k - 1)(d - 1) \) without additional computation and memory.

Fig. 3.
figure 3

Illustration of common convolution operation and dilated convolution operation in one dimension. (a) Common convolution (dilation rate of 1). (b) Dilated convolution (dilation rate of 2, insert zero between adjacent filter values).

Due to dilation rate in convolution operation, the effective kernel size increases, but the number of filter parameters remains the same because of insertion of zero values. By aggregating dilated convolution in a chain of layers with proper stride and padding, a CNN can produce feature maps with desired resolution and larger receptive fields, which contains more context information and benefits for semantic representation.

2.3 Multi-task Loss

The proposed single-CNN framework is trained with a multi-task loss composed of two loss functions. The first loss \( L_{\text{act}} \) involves a \( H \times W \times C \) convolutional feature maps, in which each \( H \times W \) plane represents an activation map of the category. The second loss \( L_{\text{cls}} \) involves a discrete probability over \( C \) categories.

Each input image is labeled with a multi-label ground-truth and instances ground-truth. A multi-task loss \( L \) is used to jointly train for multi-label classification:

$$ L = L_{\text{act}} + \lambda L_{\text{cls}} , $$
(5)

where the hyper-parameter \( \lambda \) controls the balance between the two task losses.

Local Instance Constraint. As discovered in [9], CNN can learn hierarchical features due to its deep architecture, and higher complex features are sensitive to local structures in the input images. Following these works, we propose a loss function that considers precise instance location and activation values in convolutional feature maps, allowing the network to capture local structures of each individual object instance.

Based on [12], the dilated convolution operator described in Sect. 2.2 is employed to expand receptive fields, and after the last convolution operation, a \( 1 \times 1 \) convolutional layer with the same number of filters as the number of categories is adopted. In this way, as shown in Fig. 1, each plane of convolutional feature maps stands for one specific category, thus higher activations in specific feature map indicate higher existing probability of the category. For local instance constraint, a Euclidean distance based loss function is adopted for penalizing the position with no object and constraining the activation values where there are objects corresponding to the category. Thus, for \( N \) training samples, the loss function \( L_{\text{act}} \) is Euclidean distance between convolutional feature map \( f^{c,i} \) and sum of instance bounding box masks \( \sum\nolimits_{t = 1}^{T(c,i)} {b_{t}^{c,i} } \) over \( C \) categories, which can be expressed as:

$$ L_{\text{act}} = \sum\limits_{i = 1}^{N} {\sum\limits_{c = 1}^{C} {\left\| {f^{c,i} - \sum\limits_{t = 1}^{T(c,i)} {b_{t}^{c,i} } } \right\|} } , $$
(6)

where \( b_{t}^{c,i} \in \{ 0,1\} \) (1 indicates the position with instances and 0 indicates the position without instance) is the \( t{\text{ - th}} \) instance bounding box mask for category \( c \) and \( T(c,i) \) is the number of instances in the category \( c \) in the \( i{\text{ - th}} \) image. There may exist overlapped instances in each individual category and we encoded its overlapped regions of instances by summing all of the individual binary masks to make the loss function \( L_{\text{cls}} \) surely aware of the higher activation values of objects.

Global Label Representation. For global representation, previous works mainly choose Euclidean distance [6, 7] or cross-entropy [8] for distance metric, but no work discusses pros and cons of the two metric learning for multi-label image classification. For each input image with a ground-truth class label \( {\mathbf{u}} \) and predicted class label \( {\mathbf{v}} \), by adopting Euclidean distance the loss function \( L_{\text{cls}} \) can be defined as:

$$ L_{cls} = \frac{1}{N}\sum\limits_{i = 1}^{N} {\sqrt {\sum\limits_{c = 1}^{C} {(u^{c,i} - v^{c,i} )^{2} } } } , $$
(7)

and by adopting cross-entropy the loss function \( L_{\text{cls}} \) can be defined as:

$$ L_{cls} = - \frac{1}{N}\sum\limits_{i = 1}^{N} {\sum\limits_{c = 1}^{C} {(u^{c,i} } } \log v^{c,i} ), $$
(8)

where \( u^{c,i} \) is the ground-truth label indicator for category \( c \) for \( i{\text{ - th}} \) image and \( v^{c,i} \) corresponds to its prediction. The two losses will be compared in Subsect 3.3.

3 Experimental Results

3.1 Datasets and Baseline

Our method is evaluated on the VOC datasets [13], which is widely used as benchmark datasets for multi-label object recognition task. Following [5,6,7,8], VOC 2007 and VOC 2012 are chosen as our experimental datasets, which has been split into 3 parts: TRAIN, VAL and TEST. Like [6,7,8], we take TRAIN and VAL as our training datasets and TEST for model evaluation. Details of these datasets are shown in Table 1, in which the 20 classes are airplane (aero), bike, bird, boat, bottle, bus, car, cat, chair, cow, table, dog, horse, motorbike (motor), person, plant, sheep, sofa, train and television (tv). The evaluation metric is average precision (AP) and mean average precision (mAP). In particular, for VOC 2007 TEST, the scores are evaluated with standard VOC evaluation package and for VOC 2012 TEST, the scores are evaluated on VOC evaluation server.

Table 1. Datasets information.

We compare the proposed method with several state-of-the-art approaches [6,7,8, 15,16,17, 19] in terms of metric mAP and the results are shown in Sect. 3.3.

3.2 Parameters Configuration

Our CNN architecture is based on VGG16 [3], which is pre-trained on ImageNet. Following DeepLab [12], layer fc6 and fc7 are converted into convolutional layers and the dilated convolution operator is employed in layers conv5_1, conv5_2, conv5_3, and fc6. More details of CNN architecture can be seen in Table 2. We fine-tune the VGG model from [12] using SGD with initial learning rate \( 10^{ - 5} \), 0.9 momentum, 0.0005 weight decay through caffe deep learning framework [14]. The hyper-parameter in \( \lambda \) Eq. 5 is set to 1 in all experiments.

Table 2. CNN architecture of the proposed method.

3.3 Multi-label Classification Results

Multi-label Image Classification on VOC 2007. Table 3 reports our experimental results compared with the state-of-the-arts on VOC 2007. In the upper part of Table 3 above the double strike, we compared with those methods without using bounding box annotations for training, while the lower part shows the methods with bounding box information. For the state-of-the-art methods, INRIA [15] and FV [16] are hand-crafted based methods, and CNN-SVM [17] uses OverFeat [18] as a feature extractor, and the rest are CNN-based methods mainly fine-tuning pre-trained models on ImageNet.

Table 3. Average precision (AP in %) comparison for the state-of-the-art methods on VOC 2007 TEST. The upper part shows the results of methods without using ground-truth annotations and the lower part shows methods with bounding box information.

From Table 3 it can be seen that the CNN-based methods outperform the hand-crafted methods with a large margin of more than 10%, which indicates that hierarchical features of CNN greatly benefits for image representation. PRE-1000C [19] fine-tunes pre-trained models on ImageNet with limited VOC data. Compared with PRE-1000C, 2% improvement can be achieved by our d-CNN (CNN with dilated convolution operation) which takes advantage of dilated convolution operator to learn more semantic information. HCP-1000C [6] is a proposal-based method that relies on proposal extraction method to prepare input patches. Compared with HCP-1000C, both our CNN-L-GE (CNN with local instance constraint and global representation of Euclidean distance) and CNN-L-GC (CNN with local instance constraint and global representation of cross-entropy metric) get higher mAP, which shows a positive effect on multi-task learning because the two tasks, separately involving with local and global information, influence each other through shared parameters. In terms of loss function measuring global representation, cross-entropy achieves a further 2.2% performance than that of Euclidean distance, which verifies the discovery that Euclidean distance is not suitable for distance metric of sparse data in high dimension [20]. Compared with the state-of-the-art method CNN-RNN that uses CNN and RNN to model label dependency and image-label representation, our CNN-L-GC with only one network achieves competitive performance, which demonstrates the effectiveness of the multi-task learning both the local and global information. In particular, the proposed method outperforms the state-of-the-art methods with a large margin when the objects are nearly squared (i.e., bus, chair, table, motor, plant, and sofa), mainly due to local instance constraint from bounding box annotations.

Multi-label Image Classification on VOC 2012. Table 4 reports our experimental results compared with the state-of-the-art methods on VOC 2012. Similar to Table 3, we compare with methods without using bounding box annotations in the upper part and methods with bounding box information in the lower part.

Table 4. Average precision (AP in %) comparison for the state-of-the-art methods on VOC 2012 TEST. The upper part shows the results of methods without using ground-truth annotations and the lower part shows methods with bounding box information.

The multi-label classification results on VOC 2012 in terms of mAP are consistent with those in Table 3. Compared with HCP-2000C [6] pre-trained on ImageNet with 2000 categories and PRE-1512C [19] pre-trained on ImageNet with 1512 categories, our CNN-L-GC pre-trained on ImageNet with only 1000 categories outperforms the two state-of-the-art methods by 1.7% and 3.1%. Compared with the state-of-the-art proposal-based FeV [7] with two-stream CNN, our CNN-L-GC has an improvement of 1.9%. Similar to results on VOC 2007, the proposed method takes advantage of squared objects because of local instance constraint with bounding box annotations.

4 Conclusions

In this paper, we presented an end-to-end proposal-free single-CNN based method multi-label image classification framework with a multi-task loss. Without region proposals extraction, the training phase of our work is a single-stage pipeline. Compared with the existing works, our method adopted the dilated convolution operation to expand receptive fields without additional parameters. Further, the proposed method utilized instance constraint for local information and cross-entropy metric for global information representation at the same time to leverage a multi-task learning for boosting the discriminative capacity of CNN. The experimental results on VOC 2007 and VOC 2012 showed that the proposed method achieved the state-of-the-art performance.