Keywords

1 Introduction

Nowadays CNNs have been popular for their relevance to wide extent of applications, such as image segmentationĀ [22], object classification [4, 12, 31], scene image classificationĀ [18, 41], fine-grained classificationĀ [2, 9, 43] and so on. Fine-grained recognition has recently become popular [2, 44], because it is applicable in a variety of challenging domain such as bird species recognitionĀ [35] or flower species recognitionĀ [27]. An important issue in fine-grained recognition is inter-class similarity i.e, images of birds with different species can be ambiguous due to uncontrolled natural settings. On the other hand, generic scene image recognition is challenging task because scene images are composed with spatially correlated layout of different objects and conceptsĀ [19]. Successful recognition methods need to extract powerful visual representations to deal with high intra-class and low inter-class variabilityĀ [38], complex semantic structureĀ [29], varying size of same semantic concept across dataset, and so on. For addressing such issues many CNNs like, AlexNetĀ [23], GoogLeNetĀ [32] and VGGNet-16Ā [31] have already been trained on datasets like PlacesĀ [45] and ImageNetĀ [7] for image recognition tasks. These deep networks can be altered and prepare to train for other datasets and applications with less modifications. In all similar scenarios, features acquired from pre-trained, altered or fine-tuned CNNs are used to build standard classifier like fully connected neural network or support vector machine (SVM).

One major drawback of these frameworks is that the CNNs require the input images to be of fixed dimensions. For instance, GoogLeNet accepts images of resolution ā€œ\(224 \,{\times }\, 224\)ā€. Although the standard datasets like SUN397Ā [38] or MIT-67Ā [29] consist of variable resolution images which are much bigger than ā€œ\(224 \times 224\)ā€. Similarly, in case of CUB-200-2011 [35] bird dataset images are varying in size. Also, as we demonstrate, it is useful to consider bird region of interest (ROI) which focuses on the subject, and discards most of the background. In such cases, too the size of ROI can vary with the shape and size of the birds. The traditional methods to use these CNNs is to reshape the random-sized images to a same size. This leads to dissipation of information of the images before giving as input to the CNN for extracting the feature. The capability of classifier to give better results gets affected due to such usage, which is evident from the work published inĀ [18]. To avoid any such prior information loss, different approaches are explored to feed varying resolution images as input to CNN. The works in [18], eliminates the necessity of fixed resolution image by including a spatial pyramid poolingĀ (SPP) layer in CNN and titled the new architecture as SPP-Net. The works in [11], follows the similar technique by evaluating the feature maps of conv layer into a super vector using one of the encoding like Fisher vectorĀ (FV) [41] or vector of locally aggregated descriptorĀ (VLAD) [20] by building the Gaussian mixture modelĀ (GMM).

As conv layers are the necessary part of convolutional neural network and responsible for producing discriminative activation maps. Generated activation maps are of varying resolution according to original image size and contain more spatial layout information compared to the activation of the fc layers, as fc layer integrates the spatial content present in the conv layer features. Inspired by the same fact, in our previously published workĀ [16], we focused on passing the images in their actual size as input to the convolutional neural network and then acquire varying size sets of deep activation maps from the last conv layer as output.

In literature study, mainly two approaches are proposed to handle varying size pattern classification using support vector machines. In the first, a varying size set of activation maps is transformed into a fixed dimension pattern as inĀ [11], and further a kernel function for fixed dimension pattern is used to build the support vector machine classifier. In the second, a suitable kernel function is directly designed for varying size set of activation maps. The kernel designed for varying size set of features is called dynamic kernelsĀ [8]. The dynamic kernels inĀ [8, 15, 17, 24] shows promising results for classification of varying resolution images and speech signals. We adopt the second approach and propose to design deep spatial pyramid match kernelĀ (DSPMK) as dynamic kernel.

In this work, we extended the previous work of [16] and propose to explore different pooling and normalization techniques for computing DSPMK which is discussed in Sect.Ā 3.2. Inspired fromĀ [18], we propose to consider the CNN architecture for fine-tuning by passing original images as input and added spatial pyramid poolingĀ (SPP) layer to the network for handling the same. SPP-layer maps varying size activation maps to fixed size for passing to fully-connected layer. SPP-layer allows for end-to-end fine-tuning and training of the network with variable size images. This is discussed in Sect.Ā 3.1. The key contribution of this work are:

  • Deep spatial pyramid match kernel with different pooling and normalization technique to find the similarity score between a pair of varying size set of deep activation maps.

  • Introducing SPP-layerĀ [18] in between last convolutional layer and first fc-layer, so that varying size deep activation maps of images can be converted into fixed length representation.

  • End-to-end fine-tuning of the network for different dataset with SPP-layer to handle the images in their original size.

  • We demonstrate the effectiveness of our approach and its variants, with state-of-the-art results, over two different applications of scene image classification and fine-grained bird image classification.

The rest of the paper is structured as follows: A review of related approaches for image classification using CNN-based features is presented in Sect.Ā 2. SectionĀ 3.1, gives the detail about CNN architecture with SPP-layer. In Sect.Ā 3.2, we discuss the DSPMK for varying size set of deep activation maps with different pooling and normalization technique. The experimental studies using the proposed approach on scene image classification and fine-grained bird classification tasks is presented in Sect.Ā 4. In Sect.Ā 5 conclusion is presented.

2 Literature Review

In this section, we revisit the state-of-the-art techniques for fine-grained image classification and scene image classification tasks. Traditional method of image classification includes generating the local feature vector of images using local descriptors like, scale invariant feature transformĀ (SIFT) [25] and histogram of oriented gradientĀ (HOG)Ā [6]. Further, GMM-based or SVM-based classifier can be built using the standard function such as Gaussian kernel, where the feature vectors are encoded into a fixed length representation. Generally bag of Visual Words (BoVW) [5, 36, 37], sparse codingĀ [40], and Fisher vector (FV)Ā [26] encoding is used for fixed-dimensional representations of an image. These fixed length vector representations does not incorporate spatial information of concepts present in the image.

As an alternative, SVM-based classifiers can be learn with matching based dynamic kernels which are designed with consideration of spatial information. Spatial pyramid match kernel [24], class independent GMM-based intermediate matching kernelĀ [8] and segment-level pyramid match kernel [15] are few of the matching based kernels for matching different size images and speech signals. With the development of deep CNNs, conventional features and related methods are being replaced by leaned features from datasets with linear kernel (LK) based SVM classifier.

The eye-popping performance of various deep CNN architectures on ImageNet large scale visual recognition challenge (ILSVRC) [23, 31, 32] has motivated the research community to adapt CNNs to other challenging domain and datasets like fine-grained classification. Initially, fc-layer features from convolutional neural network were directly in use to build SVM-based classifier using LK for any task of vision and perform batter than traditional methodsĀ [9, 46]. Few researchers also encoded learned features into a novel representation e.g, in [26] authors have transformed the features from fc layer to bag of semantics representation. This bag of semantic representation is then summaries in semantic Fisher vector representation. In case of fine-grained bird classification, state-of-the-art approaches are based on deep CNNs [2, 9, 39, 44]. These approach consider part based and bounding box annotation for generating the final representation. Moreover, all these approaches are based on giving fixed size input to the network because of rigid nature of fc-layer as it is based on fixed number of fully connected neurons and expects a fixed length representation of input, whereas the convolution process is not constrained with fixed length representation. So we can say, the necessity of fixed resolution image as input to convolutional networks is an mandatory demand of the fc-layer.

The impact of reshaping the images to a fixed size results in loss of informationĀ [18]. On the other side, convolutional layers of CNNs accept any arbitrary sized input image which results in random sized deep activation maps according to the input. Deep activations maps contain the strongest response of filters on the previous layer output and conserve the spatial information of the concepts present in input image. From the work in [11, 18], we can observe the similar idea. The approaches in these papers considered spatial pyramid based approach and scaled space of input images to incorporate the concept information of images into the activation maps at different scales. The work in [42], focuses on scale characteristics of images over feature activations. They consider images at different scales to input to the CNN and obtain seven layer pyramid of dense activation maps. Further they have used Fisher framework for encoding the activation maps to aggregate into a fixed length representation. The work proposed in [18], considers the SPP approach to expel the essentialness of same size image as input to convolutional networks. Here, the CNN is fed with images of original size. However, in [42] the CNN is fed with differently scaled images. The work in [11], also follow the similar way and fed the original sized image to convolutional network. However, the approach for converting into fixed size is different. Here, a GMM using fixed size vector representation obtained from spatial pyramid pooling, is built to generate the Fisher vectorsĀ [11]. Finally, all the Fishers vectors are concatenated to form a fixed dimensional representation.

In our work, we focuses on integrating the power convolutional-based varying size set of deep activation maps with dynamic kernel to obtain a matching value between a pair of images of different size. We used DSPMK as dynamic kernel rather of building GMM based dictionary on varying size conv features. In this way, our proposed approach is computationally less expensive. Further we modify the deep CNN architecture by adding the spatial pyramid pooling for end-to-end fine-tuning or training. In the next Section, we discuss CNN architecture with SPP-layer and the proposed DSPMK for the varying size set of deep activation maps.

3 Approaches for Handling Variable Size Images for Classification

In this section, we discuss the approaches to handle the variable size images in CNN for classification on different domain datasets like scene image classification dataset and fine-grained bird classification dataset.

  • In Sect.Ā 3.1, we introduce SPP-layer inspired byĀ [18] in between last convolutional layer and first dense layer so that variable size set of convolutional activation maps of images can be converted into fixed length representation for end-to-end training of the network.

  • In Sect.Ā 3.2, we present DSPMK proposed inĀ [16] to compute the similarity score between two sample images represented as varying size set of activation maps.

3.1 CNN Architecture with Spatial Pyramid Pooling (SPP) Layer

As mentioned earlier, traditional CNN architecture like AlexNet [23], GoogLeNet [32] or VGGNet-16Ā [31] are pre-trained on the dataset with images of fixed resolution (e.g 227 \(\times \) 227). Conversion of image from original size to fixed size results in loss of information in the beginning of network. CNN architecture is mainly the combination of convolutionalĀ (conv) layer and fully-connectedĀ (fc) layer. The fc-layer demands fixed size input and conv-layers are free from such restrictions. In this work, we modify the CNN architecture such that images are allowed in its original size for input and gives varying size set of activation maps as output from last convolutional layer.

To handle this further, we propose to use spatial pyramid pooling (SPP) layer inspired byĀ [18] which map varying size set of deep activation maps onto a fixed length representation for end-to-end training. Using SPP-layer, information aggregation happens at later stage in the network which improve the training process. We have considered VGG-19 architectureĀ [30, 31] and added SPP-layer between last convolutional layer and first fc-layer. The SPP-layer sum-pools the varying length convolutional layer activation maps at three different levels to convert them into fixed length vector. In the first level complete convolutional activation maps are considered and sum or max-pool is applied to obtain a fixed length vector. In the second level, activation maps are spatially divided into 4 blocks and sum or max-pooling is applied in respective block for converting the variable size deep activation maps to fixed length vector. In the third level, activation maps are divided into 16 blocks and so on. In this scenario we are fixing the number of spatially divided blocks instead of block size. The fixed length vectors obtained in each level are concatenated to form a fixed length supervector. This fixed length supervector is further passed onto fully-connected layer for end-to-end training using back propagation. The block diagram of proposed CNN architecture with SPP-layer is shown in Fig.Ā 1.

Fig. 1.
figure 1

Block diagram of CNN architecture with SPP-layer.

3.2 Deep Spatial Pyramid Match Kernel

In this section, we present deep spatial pyramid match kernel (DSPMK) proposed inĀ [16] for matching varying size set of deep activation maps obtained from convolutional neural network. The entire process of classification using DSPMK-based SVM is demonstrated in block diagram of Fig.Ā 2. As shown in diagram, \(I_m\) and \(I_n\) are two images given to the convolutional layer of network as input such that we get set of deep activation maps. Different image give variant size activation maps as output i.e, activation maps in set corresponding to image \(I_m\) is different from image \(I_n\). From these different size activation maps, we propose to compute similarity sore using DSPMK. DSPMK-based SVM classifier is learn by association of feature maps of training images with the class label. This is in contrast toĀ [11], InĀ [11] varying size activation maps are transformed into fixed size using Fisher framework and are encoded to fixed length super vector like Fisher vector and then LK-based SVM is used for building the classifier. Main features of the proposed approach is that, DSPMK computes the similarity score on different size actual images at different spatial levels ranging from 0, 1 to L using varying size set of deep activation maps.

Fig. 2.
figure 2

Block diagram of DSPMK as proposed in [16].

Consider dataset of images as \(\mathcal {D} = \{I_1, I_2, \dots , I_m, \dots , I_N\}\) and ā€˜fā€™ be the number of kernels or filters in last conv layer of pre-defined deep CNN architecture. Let the mapping \(\mathcal {F}\), takes input, actual image and project it to set of deep activation maps using conv layers of CNN. Mapping \(\mathcal {F}\) is given as, \(\mathcal {X}_m = \mathcal {F}(I_m)\). Size of activation maps obtain from last conv layer in a set corresponding to a image is same but vary from image to image as images are fed in its original resolution to the CNN architecture.

Fig. 3.
figure 3

Illustration of computing similarity score using DSPMK between two different resolution images \(I_m\) and \(I_n\), similar to the Fig.Ā 2 of paperĀ [16]. Here, \(\mathcal {X}_m\) and \(\mathcal {X}_n\) are set of deep activation map computed using conv layer of pre-trained CNN, size of \(\mathcal {X}_m\) depends on size of \(I_m\), similarly size of \(\mathcal {X}_n\) depends on size of \(I_n\). The matching score at each level l (i.e, \(S_0, S_1\) and \(S_2\)) is computed using Eq.Ā (3).

Firstly, images to pre-trained CNN is fed in its actual size. For image \(I_n\), we have a set \(\mathcal {X}_n\) \( = \{\mathbf {x}_{n1}, \mathbf {x}_{n2}, \mathbf {x}_{n3},\dots ,\mathbf {x}_{nf}\}\) consisting of ā€˜fā€™ feature maps from mapping \(\mathcal {F}\), where \(\mathbf x _{ni} \in \mathbb {R}^{p_{n} \times q_{n}}\) and \({p_n\times q_n}\) is the size of each feature map obtained from last conv layer which varies accordant to the input image resolution. This conclude to varying size deep activation map as shown in Fig.Ā 3 for images \(I_m\) and \(I_n\).

Secondly, deep activation maps are spatially divided into sub-blocks to form spatial pyramid. At level-0, activation maps are considered as it is without spatial division. At level-1, every deep activation map is divided into 4 spatial divisions related to 4 quadrant, as depicted in Fig.Ā 3. Consider \(L+1\) the total number of levels in the pyramid start from 0, 1 till L. In At any level-l, a deep activation map \(\mathbf x _{ni}\) is spatially split into \(2^{2l}\) blocks. At any level-l, activation values of cells in every spatial block of all the f deep activation maps are sum or max-pooled and concatenated to form a vector \(\mathbf X ^l_n\) of size \(f2^{2l} \times 1\). This scenario is expatiated in Fig.Ā 3 by considering three different levels, l = 0, 1 and 2 and same is also described in AlgorithmĀ 1 for \(L + 1\) pyramid levels.

In our proposed framework, we considered three spatial pyramid levels. At level-0, (i.e, \(l=0\)) the complete activation maps corresponding to input image is sum or max-pooled, inĀ total there are f activation maps in output of conv layer which results is \(f \times 1\) size vector representation. At level-1 (i.e, \(l=1\)), the same activation maps are considered again and divided into four equal spatial blocks. Each block correspond to single activation maps are again sum or max pooled, which results in 1 \(\times \) 4 size vector. Same procedure is repeated for f activation maps which results into a vector of \(4f \times 1\) size. Similarly, at level-2 (i.e. \(l=2\)), again the same activation maps are divided into sixteen equal spatial regions resulting into a vector of \(16f \times 1\) dimensional vector. Corresponding to image \(I_n\), after concatenating all the sum or max pooled activation values are results in a single vector called \(\mathbf X ^l_n\).

The \(\mathbf X ^l_m\) can now be seen as representation of image \(I_m\) at level-l of pyramid. At this stage, we propose to compute deep spatial pyramid match kernelĀ (DSPMK) to match two images rather than deriving Fisher vectorĀ (FV) representation as inĀ [11]. Our proposed approach avoids building GMM to obtain FV and hence reduces the computation complexity as compared toĀ [11]. The process of computing DSPMK is motivated from spatial pyramid match kernelĀ (SPMK) [24]. SPMK involves the histogram intersection function that match the frequency based image representation or normalized vector representation of two images at every levels of pyramidĀ [24]. However, \(\mathbf X ^l_m\) is not in the normalized vector representation of image \(I_m\). We propose to normalized \(\mathbf X ^l_m\) using \(\ell _1\) and \(\ell _2\) to obtain normalized vector representation.

Let \(\mathbf X ^l_m\) and \(\mathbf X ^l_n\) be the representation at level-l corresponding to two images \(I_m\) and \(I_n\) respectively. The normalized vector representation of \(\mathbf X ^l_m\) and \(\mathbf X ^l_n\) is obtained using \(\ell _1\) or \(\ell _2\) normalization as given in Eqs.Ā (1) and (2)

$$\begin{aligned} \widehat{\mathbf {X}}^l_m =\frac{\mathbf {X}^l_m}{||\mathbf {X}^l_{m}||_1}, \widehat{\mathbf {X}}^l_n =\frac{\mathbf {X}^l_n}{||\mathbf {X}^l_{n}||_1} \end{aligned}$$
(1)
$$\begin{aligned} \widehat{\mathbf {X}}^l_m =\frac{\mathbf {X}^l_m}{||\mathbf {X}^l_{m}||_2}, \widehat{\mathbf {X}}^l_n =\frac{\mathbf {X}^l_n}{||\mathbf {X}^l_{n}||_2} \end{aligned}$$
(2)
figure a

The Histogram intersectionĀ (HI) function is used to compute intermediate matching score \(S_l\) between \(\widehat{\mathbf{X }}^l_m\) and \(\widehat{\mathbf{X }}^l_n\) at each level l as,

$$\begin{aligned} S_l = \sum \limits _{j=1}^{f} \sum \limits _{k=1}^{2^{2l}} min(\hat{x}_{mj(k)}^l,\hat{x}_{nj(k)}^l) \end{aligned}$$
(3)

Here, the intermediate similarity score \(S_l\) found at level-l also includes all the matches found at the finer level l + 1. As a result, the number of new matches found at level l is given by \(S_l - S_{l+1}\) for l = 0, ..., \(L-1\). The DSPMK is computed as a weighted sum of the number of new matches at different levels of the spatial pyramid. The weight associated with level l is set to \(\frac{1}{2^{(L-l)}}\), which is inversely proportional to width of spatial regions at that level.

The DSPMK kernel is computed as,

$$\begin{aligned} K_{\text{ DSPMK }}(\mathcal {X}_m,\mathcal {X}_n) =\sum _{l=0}^{L-1} \frac{1}{2^{L-l}} (S_l - S_{l+1}) + S_{L} \end{aligned}$$
(4)

The main advantages of proposed approach is that it incorporate any size image without any resizing loss and it combines the convolutional varying size deep activation maps with dynamic kernel named DSPMK based SVM.

4 Experimental Studies

In this section, the efficacy of the proposed framework is studied on scene image and bird species classification task using SVM-based classifier. In experiments, we cover mainly two aspects for handling varying nature of image; one by computing the varying size deep activation maps from last convolutional layer and compute the classification score using DSPMK, and the other by adding the spatial pyramid pooling layer to the network for handling varying nature of image and fine-tuned it with respective dataset for computing the fully trained features.

4.1 Datasets

We tested our proposed approach on two different kinds of datasets one for scene classification which includes datasets such as MIT-8 SceneĀ [28], Vogel-SchieleĀ (VS) [34], MIT-67Ā [29] and SUN-397Ā [38], and the other for fine-grained bird species classification with the CUB-200-2011Ā [35] dataset.

MIT-8-scene: This dataset containĀ total of 2688 scene images belonging to 8 different semantic classes, like, ā€˜coastā€™, ā€˜mountainā€™, ā€˜forestā€™, ā€˜open-countryā€™, ā€˜inside-cityā€™, ā€˜highwayā€™, ā€˜tall buildingā€™ and ā€˜streetā€™. We randomly select 100 scene images from each class for training the model and keep remaining images for testing. We consider 5 such sets. The final classification scores computed in this paper correspond to the average classification accuracy for 5 trials.

Vogel-Schiele: This dataset containĀ total of 700 scene images belonging to 6 different semantic classes, viz., ā€˜forestsā€™, ā€˜mountainsā€™, ā€˜coastsā€™, ā€˜riverā€™, ā€˜open-countryā€™, and ā€˜sky-cloudsā€™. We consider 5-fold stratified cross validation and present the result as average classification score of 5-fold.

MIT-67: This is indoor scene dataset. Most of the scene recognition models work well for outdoor scenes but perform poorly in the indoor domain. This dataset contain 15,620 images with 67 scene categories. All images have a minimum resolution of 200 pixels in the smallest axis. It is a challenging dataset, due to the less in-class variability. The standard divisionĀ [29] for this dataset consist of approximately 80 images of each class for training and 20 images for testing.

SUN397: This database contains 397 categories used in the benchmark of several paper. The number of images varies across categories like indoor, urban and nature but there are at least 100 images per category, and 108,754 images inĀ total. We consider publicly available fixed train and test splits fromĀ [38], where each split has 50 training and 50 testing images per category. We consider the first five split set and the result computed is the average classification accuracy for 5 splits.

Caltech-UCSD Birds CUB-200-2011 dataset consists of 11,788 images of birds belonging to 200 different species. The standard division for this dataset consist of 5994 images for training and 5794 for testingĀ [35] with approximately 30 images per species in the training class, and the rest in test class. Bird data suffers from high intra-class and low inter-class variance. Dataset is available with bird bounding boxes and other annotations. In this work, we evaluate our methods in two scenarios one with the bounding-box which enables one to focus on the bird region rather then background and other without bounding-box information considered at training and test time.

4.2 Experiment Studies for Scene Image Classification Task

In our studies of scene image classification, we have consider different CNN architectures for extracting the features like, AlexNetĀ [23], GoogLeNetĀ [32] and VGGNet-16Ā [31] which are pre-trained on three different datasets, i.e, ImageNetĀ [7], Places205 and Places365Ā [45] datasets. Reason behind using the different pre-trained networks (on different datasets) is that all used datasets consist of variety of images. In this context, ImageNet dataset contains mainly object centric images and it shows activations for object-like structures, whereas Places dataset comprise of largely indoor/outdoor scene images. We believe that CNNs trained on Places dataset activate for landscapes, natural structure of scenes with more spatial features, and indoor scene patterns.

In all the convolutional networks, pre-trained weights are kept consistent without fine-tuning. These networks are in use without its fc-layers in our experimental studies so that input images of arbitrary size can be accepted. As discussed in Sect.Ā 3.2, we have passed the original image of arbitrary size as input to deep CNNs and extracted varying size set of deep activation maps from last convolutional layer. The size of set of activation map corresponding to an image depends on the filter size, number of filters f in last convolutional layer and input image size. The number of filters f, in last convolution layer of AlexNet, GoogLeNet and VGGNet-16 are 256, 1024 and 512 respectively. The architecture of these CNNs also differs from each other. So, activation map size will vary from image to image and architecture to architecture.

Table 1. Comparison of classification accuracy (CA) (in %) with 95% confidence interval for the SVM-based classifier using DSPMK computed using sum-pooling on different datasets, similar to study shown in Table 1 of paperĀ [16]. Base features for the proposed approach are extracted from different CNN architecture like, AlexNet, GoogLeNet and VGGNet which are pre-trained deep network on ImageNet, Places-365 and Places-205 dataset respectively.
Table 2. Comparison of classification accuracy (CA) (in %) with 95% confidence interval for the SVM-based classifier using DSPMK computed using max-pooling on different datasets, similar to study shown in Table 1 of paperĀ [16]. Base features for the proposed approach are extracted from different CNN architecture like, AlexNet, GoogLeNet and VGGNet which are pre-trained deep network on ImageNet, Places-365 and Places-205 dataset respectively.

DSPMK between varying size deep activation map for pair of images is computed as in Fig.Ā 3 using Eq.Ā (1) to (4). We consider \(L+1 = 3\) as the number of levels in spatial pyramid. In computation of DSPMK, we have performed the experiments with both sum and max-pooling techniques. Reason behind using different pooling technique is, max-pooling extracts the most activated feature like edges, corner and texture, whereas, sum-pooling smoothen out the activation map and measures the sum value of existence of a pattern in a given region. Although results with both the pooling technique are comparable as shown in TablesĀ 1 and 2, we observed that max-pooling works a bit better then sum-pooling in our case. It is seen that performance of SVM-based classifier with DSPMK obtained using deep features from VGGNet-16 is significantly better than that of SVM with DSPMK obtained using deep features from GoogLeNet and AlexNet. Reason being VGGNet-16 has very deep network compare to other architectures and it learns the hierarchical representation of visual data more efficiently. We consider LIBSVMĀ [3] tool to build the DSPMK-based SVM classifier. Specifically, we uses one-against-the-rest approach for multi-class scene image classification. In SVM for building the classifier, we use default value of trade-off parameter \(C =1\). In our further study, we fine-tuned the VGG-16 architecture for respective datasets by adding the spatial pyramid poolingĀ (SPP) layer to the network as shown in Fig.Ā 2. We computed the spatial pyramid pooling features and train the neural network based classifier. We consider the neural network with two hidden layer and one soft-max layer. Dropout is chosen as 0.5 learning rate as 0.01 and 2048 neurons in the hidden layers. We observe that results are comparable with DSPMK-based SVM approach.

Table 3. Comparison of classification accuracy (CA) (in %) with 95% confidence interval of proposed approach with state-of-the-art approaches on MIT-8 scene, Vogel-Schiele, MIT-67 Indoor and SUN-397 dataset, similar to study shown in Sect.Ā 4, Table 2 of paperĀ [16]. (SIFT: Scale invariant feature transform, IFK: Improved Fisher kernel, BoP: Bag of part, MOP: Multi-scale orderless pooling, FV: Fisher vector, DSP: Deep spatial pyramid, MPP: Multi-scale pyramid pooling, DSFL: Discriminative and shareable feature learning and NN: Neural network).

TableĀ 3 presents the comparison of scene image classification accuracy of proposed DSPMK-based SVM classifier and the SPP-based neural network classifier with that of state-of-the-art approaches. From TableĀ 3, it is seen that both of our proposed approaches are giving better performance in comparison with traditional feature based approaches in [21, 25] and also with CNN-based approaches in [11, 13, 26, 42, 46].

The works in [25], uses scale invariant feature transformĀ (SIFT) descriptors to represent images as set of local feature vectors, which are further converted into bag-of-visual wordĀ (BOVW) representation for classification using linear kernel based SVM classifier. The works in [21] uses the learned bag-of-partĀ (BoP) representation and combine with improved Fisher vector for building linear kernel based SVM classifier. The works in [13] extracted CNN-based features from multiple scale of image at different levels and performs orderless vectors of locally aggregated descriptorsĀ (VLAD) pooling [20] at every scale separately. The representations from different level are then concatenated to form a new representation known as multi-scale orderless poolingĀ (MOP) which is used for training linear kernel based SVM classifier. The works in [46] uses more direct approach, where a large scale image dataset (Places dataset) is used for training the AlexNet architecture and extracted fully-connectedĀ (fc7) layer feature from the trained network. The basic architecture of their Places-CNN is same as that of the AlexNetĀ [23] trained on ImageNet. The works in [46] trained a Hybrid-CNN, by combining the training data of Places dataset with ImageNet dataset. Here, features from fully-connectedĀ (fc7) layer are then used for training linear kernel based SVM classifier. The works in [26] obtained the semantic Fisher vectorĀ (FV) using standard Gaussian mixture encoding for CNN-based feature. Further linear kernel based SVM classifier is build using semantic FV for classification of scene images. The works in [11] uses the generative model based approach to build a dictionary on top of CNN activation maps. A FV representation for different spatial region of activation map is then obtained from the dictionary. A power and \(l_2\) normalization is applied on the combined FV from different spatial region. A linear kernel based SVM classifier is then used for scene classification. The works in [42] combine the features from fc7 layer of AlexNet (Alex-fc7) and their complementary features named discriminative and shareable feature learningĀ (DSFL). DSFL learns discriminative and shareable filters with a target dataset. The final image representation is used with the linear kernel based SVM classifier for the scene classification task.

In contrast to all the above briefly explained approaches, our proposed approach es use the image of arbitrary size and gives the deep activation map of varying size without any loss of information. The deep spatial pyramid match kernel can handle the varying size set of deep activation maps and incorporates the local spatial information at the time of computing level wise matching score. Specifically, our proposed approach is very simple and discriminative in nature which outperforms the other CNN-based approaches without combining any complementary features as in [42]. Our first proposed approach, based on SPP-feature with neural networkĀ (NN) also shows good quality results (second to only our proposed DSPMK method), as this approach consider original size images for fine-tuning the network. Our second proposed framework, bring out that for scene recognition, good performance is accomplishable by using last conv layer features with DSPMK-based-SVM. Proposed framework is free of fully connected layer, believe on the actual size image, memory efficient, simple and take very less computing time in compare to state-of-the-art techniques.

4.3 Experiment Studies for Fine-Grained Bird Species Classification

The experiments for fine-grained bird species classification cover three main aspects of our approach. First, we compute varying size deep activation map by passing images in its original size without any prior loss of information. Second, we use DSPMK to compute matching score between them. Third, we fine-tune the VGG-19 architectureĀ [30] by adding SPP-layer to it. We fine-tune the network for CUB-200-2011 datasetĀ [35] and compute variable size deep activation map features and SPP-features for further experiments. We show our proposed approach is generic and along with scene image classification it works well for fine-grained bird species classification.

TableĀ 4, shows the results of fine-grained bird species classification with different methods. We have shown the results for testing with bounding box (Bbox) and without bounding box. The bounding box annotation essentially helps us to crop only the prominent bird region of interestĀ (RoI) while discarding the background. Such regions may also be obtained by detection algorithm. The case without Bbox corresponds to complete actual image. Firstly, we passes the image in fixed size i.e, ā€œ224 \(\times \) 224ā€ for both the cases to the CNN architecture and computed fixed length fc7 and pool5 features. We use linear kernel based SVM to compute the classification score. Secondly, we pass the image in its original size without resizing it toā€œ224 \(\times \) 224ā€ and computed varying size deep activation maps. In this context, we perform experiments using DSPMK-based SVM with different pooling technique for computing the classification score. Next, we fine-tune the VGG-19 architecture by adding SPP-layer between last convolutional layer and first fully-connected layer. We consider the fine-tuned network for further experiments in two ways. In the first approach, we compute the varying size set of deep activation map and use DSPMK-based SVM for computation of classification score. In the second approach, we compute SPP-features from fine-tuned network and train neural network based classifier. In this context, we uses two hidden layer with 2096 neurons in each. We empirically chosen learning rate as 0.001 and dropout as 0.5.

Table 4. Comparison of classification accuracy (CA) (in %) for the SVM-based classifier using linear kernel and DSPMK, fine-tuned VGG19 with SPP-layer based neural network on CUB-200-2011 dataset. Proposed approach uses base features extracted from VGG19Ā [30]. Here NN indicate neural network.

We observe in TableĀ 4 that, if the images are not resized and no Bbox RoI detection is available, original images can be used instead with proposed DSPMK-based SVM approach. In this context, one can notice that classification accuracy will be marginally affected. This is natural as the case with Bbox focuses only on the bird RoI. However, this difference is relatively small for most of variants of the proposed methods using DSPMK and SPP. This indicates that for bird images of the size and scale as in the CUB dataset, the proposed methods are largely invariant to ROI selection, and thus can obviate an ROI detection step. When images are used without bounding box annotation, we observe that there is huge i.e, (approx 10%) improvement in performance from linear kernel based SVM with VGG-19 pool5 features to DSPMK-based SVM with varying size activation maps features from last conv layer. We believe that, our proposed approach compute the matching score between two images more efficiently with consideration of spatial information.

Table 5. Comparison of classification accuracy (CA) (in %) on CUB-200-2011 dataset between different state-of-the-art method with that of the proposed approaches. Some of the state-of-the-art approaches uses part annotations during training and testing. The proposed approaches do not use any part information. (DPD: Deformable part descriptors; POOFs: Part-based One-vs-One Features; NN: Neural Network)

In TableĀ 5, we compare the classification results of proposed approaches with state-of-the-art results. The deformable part descriptorĀ (DPD) inĀ [44], is based on the supervised version of deformable part models (DPD)Ā [10] for training, which then allows for pose normalization by comparing corresponding parts. The work inĀ [1], learns a linear classifier for each pair of parts and classes. TheĀ decision values from many of such classifiers are used as feature representation. This approach also require ground-truth part annotations at training and also at test time. The work in, [14], is based on nonparametric part detection. Here, the basic idea is to use nearest neighbor matching to obtain similar training example from human-annotated part positions. The work inĀ [9] is based on feature extraction from part regions detected using a DPM, which have sufficient depictive power and generalization ability to perform desired task. The work in [2] uses deep CNNs for extracting the features from image patches that are located and normalized by the pose. The work in [43], generate object proposals using Selective SearchĀ [33] and uses the part locations to calculate localized features from R-CNNs.

From TableĀ 5, we also infer that our approaches for bird species classification does not require part annotation, and yet improves over very complex state-of-the-art approaches that use part based annotation at the time of training and testing. In contrast, our approaches are generic and easy to adapt to other datasets as we only require a pre-trained CNN architecture. For fine-tuning the CNN architecture with SPP-layer, we perform experiments without bounding box as well as with bounding box. It is observed that proposed framework perform much improved without any extra annotations.

5 Conclusion

In this work, we propose deep spatial pyramid match kernelĀ (DSPMK) for improving the base features from last conv CNNā€™s layer. DSPMK-based SVM can classify different size images which are represented as the varying size set of deep activation maps. Further, we propose to add spatial pyramid pooling layer in CNN architecture so that, we can fine-tune the pre-trained CNNs for other datasets containing varying size images. Our model has a dynamic kernel which calculates the layer-wise intermediate matching score and strengthens the matching procedure of conv layer features. The training of DSPMK-based SVM classifier take very less time in compare to training of GMM in [11]. In our research, we have considered the last convolutional layer features rather than fc layer features as fc layer limits these features to the fixed size and requires much larger computation time as it contains approximately 90% of the aggregate parameters of CNN. Thus, conv layer features are effectively considered in handling large varying size images in scene image classification datasets like, SUN-397 and MIT-67, as well as for size variations in the fine-grained classification with the CUB dataset. Almost all approaches in fine-grained classification are specialized, but we show that our approach is generic and works well for both the diverse datasets. In terms of performance, our proposed approach accomplishes state-of-the-art results for standard scene classification and bird species classification dataset. In future, for capturing differences of the activations caused by the varying size of concepts in an image, multi-scale deep spatial pyramid match kernel can be investigated.