1 Introduction

The complex structure of human visual system and the heavy processes performed in the brain when looking at an image provide impressive ability to recognize real images in a fraction of a second. Although real world image classification, which is the focus of this paper, seems to be trivial for humans, it is a challenging task in computer vision. In recent years, image classification has attracted a lot of attention in computer vision due to the rapid improvement of intelligent robots and the need for processing images.

There is a very rich literature on image classification including methods based on bag of word [1, 2], Sparse representation [3,4,5,6,7], and Deep learning [8,9,10]. We should point out that nonlinear classifiers, including kernel based ones, have gained more attention due to their high performance compared to linear classifiers [5, 7, 9].

Classifying real world images is a challenging task. Following are the two challenges which this paper concentrates on. First, images cannot be described precisely by one single feature; therefore, they should be represented by multiple features such as color, shape and texture. Second, the intra class variance (dissimilates between images in the same class) and inter class relationship (similarities between images from different classes) are large. The mentioned challenges are discussed in the following sub-sections.

1.1 The effectiveness of using multiple features

Images are informative in different aspects like color, shape and texture. Describing images with multiple features rather than a single feature, results in a more accurate classifier. For example, an approach is proposed in [11] which describes an image by means of multiple bag of word features and designs a classifier based on them. Also, some kernel based classifiers are proposed based on multiple features [12,13,14,15,16].

1.2 Large intra class variance and inter class relationship

The second principal challenge in real world image classification is the existence of large intra class variance and large interclass relationship between images. Even if we use multiple features, there are images in a class which could be considered dissimilar (large intra class variance). Moreover, there are images from different classes that may be classified to one class (large inter class relationship). Fig. 1 is an illustration of the second challenge.

Fig. 1
figure 1

This figure illustrates that intra class variance and inter class relationship are large in real world image datasets. Images in each box belong to the same class. The images on both sides of the vertical dash line are examples of dissimilarity in images in the same class. The horizontal red arrows connect two images which are similar but belong to different classes. Images are taken from Caltech 101

In addition to the two described challenges, we should note that feature spaces of real world images are complex, so they cannot be linearly classified. Kernel based methods have achieved major success in building nonlinear classifiers [17]. A multiple kernel learning (MKL) framework proposed by Lanckriet et al. is considered as one of the most powerful classifiers [18]. To classify data, MKL considers a linearly weighted sum of kernels instead of a single kernel. By using MKL we can combine different kernels. Each kernel is computed based on an individual feature (for example, a color based kernel describes the color information of an image). In this way, the first challenge is addressed.

In the standard framework of MKL, as stated above, the computed weights of kernels are the same for all samples. This means that each kernel has a fixed share in deciding the class of each test image. With respect to the second challenge, a more accurate classifier will be achieved if the share of each kernel is not similar; and its weight is computed based on its efficiency in classification of samples. For example, in the first row of Fig. 1, to prevent misclassification the weight of the color based kernel should be reduced while the weights of other kernels should be increased.

Gonen et al. proposed a localized multiple kernel learning (LMKL) framework which computes non-uniform weights for kernels based on their location in the feature space [19]. LMKL is briefly reviewed in section 2. To address both challenges mentioned in subsections 1.1 and 1.2, we propose a feature fusion version of the original LMKL. A comparison between a single kernel based on SVM, MKL, LMKL, and the feature fusion based LMKL is illustrated in Fig. 2. The block diagram of our proposed system is depicted in Fig. 3. Our experiments on Caltech 101 and Caltech 256 achieved promising results.

Fig. 2
figure 2

This figure illustrates different approaches of using the kernel in combination with SVM. a When data samples from different classes are not linearly separable, they are mapped from input space to higher even infinite dimension Hilbert space. In the mapped space, data samples are linearly classified by SVM. We should note that this mapping is done implicitly by introducing kernel function. b In MKL framework, multiple kernels are used instead of a single one. Fixed weights for kernels are computed in the training phase and the weighted sum of kernels is computed. c Local weights are computed for kernels in the training phase in LMKL framework. Despite MKL, they are not fixed. d Data samples are represented by heterogeneous features instead of a single one in the feature fusion based LMKL. As shown in d, data samples are represented by two features. Three kernels are computed for the feature shown in the top rectangle, and two kernels are computed for the one in the bottom rectangle

Fig. 3
figure 3

Block diagram of the proposed feature fusion-based LMKL

The rest of paper is organized as follows. A brief review about LMKL is given in section 2. In section 3 the proposed algorithm is discussed in detail. The experimental results are given and analyzed in section 4. Finally, we conclude the paper in section 5.

2 LMKL related work

In this section, we give a brief review about the related work of localized multiple kernel learning (LMKL) which is an extension of the MKL framework. The original MKL computes fixed weight for each kernel by embedding kernel weights in the SVM optimization problem and then constructs a single kernel by summing up the weighted kernels [18]. In [12], fixed weights for kernels are computed by a slight modification of MKL framework. It extracts heterogeneous features from data then a group of kernels is assigned to each feature. By using a group lasso regularization method, only a few kernels are selected for each feature.

Some other works dedicate fixed weights to kernels without using the standard MKL framework. Gu et al. computed fixed weights for kernels by projecting them in the maximum variance direction [20]. Wang et el. computed optimal fixed kernel weights by finding the best projective direction which results in maximum separation between kernels in RKHS (Reproducing Kernel Hilbert Space) [21].

There are some approaches which combine kernels in a nonlinear manner while the weight of each kernel is fixed. For example, in [22], all weighted kernel matrices are combined by Hadamard product while the kernel matrix and its corresponding weight are powered by an identical number. Algorithms which combine weighted kernels are reviewed and discussed in [23].

As discussed in section 1.2, in problems like image classification, it is more beneficial to use variable weights for each kernel. Some algorithms which compute variable weights for kernels are discussed below.

Lewis et al. combined kernels in a non-stationary manner in a framework of maximum entropy discrimination [24]. Lee et al. proposed a method to combine kernels without learning distinct weights for kernels [25]. In this method, the local impact of each kernel is directly considered in the process of margin maximization. Gönen et al. designed a nonlinear framework which computes separate kernel weights for each data point based on nonlinear gating functions [19]. Yang et al. defined interclass clusters of samples and found the optimal kernel combinations for each cluster in an image classification task [26]. In [19, 26] the authors suggested to partition the space linearly. Kannao et al. allowed nonlinear boundary between clusters of the space [27]. They computed a linear kernel weight per cluster in a pre-process step without considering the sample labels.

Despite the functionality of computing variable weights for kernels, few works in image classification are based on this approach. Lu et al. proposed a Localized Multiple Kernel Metric Learning approach to classify images taken from varying viewpoints or under varying illuminations [28]. Fan et al. considered the relationship between global and local structures of features [29]. They proposed an algorithm based on multiple empirical kernel which maps data explicitly in multiple kernel spaces.

3 Methods

In this section, at first, we explain the SPM model which is used to represent images. Then, we introduce the designed feature fusion-based LMKL algorithm and its optimization problem in detail. Finally, the optimization strategy to solve the problem is discussed.

3.1 Image representation by SPM model

Introducing the bag of word (BoW) model to compute image feature significantly improves the performance of image classification systems [30]. Pyramid matching is a BoW based model to approximate the similarity between two images [31]. In this model, a pyramid of grids is placed on the feature space at different resolutions. At each resolution level, the corresponding histogram of the image is computed. The weighted sum of histograms is computed such that finer resolutions get higher weights. Finally, the intersection kernel is applied on the weighted histograms of two images to approximate their correspondence. The main shortcoming of the pyramid matching method is that it discards the spatial information of images which plays an important role in the performance of image classification systems. Lazebnik et al. proposed the spatial pyramid match (SPM) approach to address the mentioned problem [1]. By extending BoW, the SPM method divides the original image into sub-regions in a pyramid manner and computes histograms of features in each sub-region separately. The final representation of the image is the concatenation of extracted histograms.

3.2 Preliminaries and formulation of feature fusion based LMKL

Consider the classification task as \( D={\left\{\left({x}_i,{y}_i\right)\right\}}_{i=1}^N \) where N is the number of samples, x i denotes the i th sample and y i  = {±1} is the corresponding label for binary classification. In the MKL framework, multiple kernels are combined as follows:

$$ K\left({x}_i,{x}_j\right)=\sum \limits_{k=1}^m{\pi}_k{K}_k\left({x}_i,{x}_j\right) $$
(1)

where m is the number of kernels and π k is the weight of k th kernel.

The discriminator function f(x i ) for a test data x i in the standard MKL framework is formulated as follows:

$$ f\left({x}_i\right)=\sum \limits_{k=1}^m{\pi}_k\left\langle {w}_k^T,{\varphi}_k\left({x}_i\right)\right\rangle +b $$
(2)

where φ k (x i ) represents the k th mapping function, and w k  and b are‍ SVM parameters.

The standard framework of MKL assigns fixed weights to kernels in the entire space. As discussed in section 1.2, because of the large intraclass variance and inter class relationship in complicated spaces, such as an image feature space, similar weights for kernels are not suitable. For example, in some cases the kernel based on color information is more informative than a texture based kernel. Therefore, a more accurate classifier will be achieved if variable weights are assigned to a kernel in different areas of the space.

Gönen and Alpaydin proposed a localized MKL (LMKL) framework in which the weights of kernels are calculated distinctly for each training sample [19]. The localized version of K(x i , x j ) is as follows:

$$ K\left({x}_i,{x}_j\right)=\sum \limits_{k=1}^m{\pi}_k\left({x}_i\right){\pi}_k\left({x}_j\right){K}_k\left({x}_i,{x}_j\right) $$
(3)

where π k (x i ) is the weight of kth kernel corresponding to x i .

In the original LMKL framework, Gönen et al. assumed that kernels are computed based on a single feature. In the proposed algorithm, multiple kernels are computed based on multiple features. Using multiple features instead of a single one results in a more accurate classifier in an image classification task as discussed in section 1.1. The kernel value between two images x i and x j is computed as follows:

$$ K\left({x}_i,{x}_j\right)=\sum \limits_{k=1}^m{\pi}_k\left({x}_i^k\right){\pi}_k\left({x}_j^k\right){K}_k\left({x}_i^k,{x}_j^k\right) $$
(4)

where \( {x}_i^k \) is a representation of training sample x i corresponding to the k th feature.

The combined kernel of (4) changes the standard kernel based margin maximization problem of SVM into a non-convex optimization problem. Instead of solving this difficult optimization problem, Gönen et al. estimated kernel weights by using the gating function.

A gating function formulates the effectiveness of the k th kernel in classification of sample x i . There are several ways to calculate the gating function. Sigmoid function formulated in (5) is a good choice and was used by Gönen et al. [19]:

$$ {\pi}_k\left({x}_i^k\right)=1/\left(1+\exp \left(-\left\langle {v}_k,{x}_i^k\right\rangle -{v}_{k0}\right)\right) $$
(5)

where v k and v k0 are the parameters of the gating function. As stated before, \( {x}_i^k \) is a representation of training sample x i corresponding to k th feature which is in the form of a SPM histogram. Comparing the SPM histograms by their inner product is not accurate enough. Χ 2 kernel is a better choice in histogram comparison. Therefore, we modified the gating function of (5) by using the Χ 2 kernel instead of the inner product. The Χ 2 kernel based gating function is as follows:

$$ {\pi}_k\left({x}^k\right)=1/\left(1+\mathit{\exp}\left(-{X}^2\left({v}_k,{x}^k\right)-{v}_{k0}\right)\right) $$
(6)

where Χ 2 kernel is defined as:

$$ {X}^2\left({v}_k(i),x(i)\right)=2{v}_k(i)x(i)/\left({v}_k(i)+x(i)\right),i=1\dots DG $$
(7)

DG is the dimension of feature space.

Because of the efficiency of Χ 2 kernel in computing the similarity of SPM histograms, we use the following gating function as well:

$$ {\pi}_k\left({x}^k\right)={X}^2\left({v}_k,{x}^k\right)+{v}_{k0} $$
(8)

3.3 Optimization strategy

By plugging local kernel weights in standard MKL formulation, the following optimization problem will result:

$$ {\displaystyle \begin{array}{c}{\min}_{\left\{{w}_k\right\},b,\left\{{\xi}_i\right\},\left\{{v}_k\right\},\left\{{v}_{k0}\right\}}\frac{1}{2}\sum \limits_{k=1}^m\parallel {w_k}^2\parallel +C\sum \limits_{i=1}^N{\xi}_i\\ {} subject to\kern1em {y}_i\left(\sum \limits_{k=1}^m{\pi}_k\left({x}_i^k\right)\left\langle {w}_k,{\varPhi}_k\left({x}_i^k\right)\right\rangle +b\right)\ge 1-{\xi}_i\kern1.25em i=1\dots N,\kern1em {\xi}_i\ge 0\end{array}} $$
(9)

where C is the regularization parameter and ξ i s are the slack variables.

Since standard MKL is a convex optimization problem it can be solved by common optimization methods. Combining nonlinear gating functions with standard MKL problem changes the convex optimization problem of MKL into a nonlinear and non-convex problem. This problem can be solved using the alternate optimization method, which is an iterative two step approach. In step one, some parameters are assumed to be fixed and the others are computed by solving the optimization problem. In step two, the non-fixed parameters in the first step are considered to be fixed and the remaining parameters are calculated by solving the new optimization problem. The optimization algorithm iterates until convergence. We considered two termination criteria: the maximum number of iterations and reaching the changes of object function below a predefined threshold.

Step one: Learning SVM parameters.

In this step, the optimization problem should be minimized with respect to w k , ξ i and b, while v k and v k0 are fixed. In order to remove the constraints, the Lagrangian of problem (9) is calculated and the following problem is obtained:

$$ L\left(\left\{{w}_k\right\},b,\left\{{\xi}_i\right\},\left\{{\lambda}_i\right\},\left\{{\eta}_i\right\}\right)=\frac{1}{2}\sum \limits_{k=1}^m{\left\Vert {w}_k\right\Vert}^2+\sum \limits_{i=1}^N\left(C-{\lambda}_i-{\eta}_i\right){\xi}_i+\sum \limits_{i=1}^N{\lambda}_i-\sum \limits_{i=1}^N{\lambda}_i{y}_i\left(\sum \limits_{k=1}^m{\pi}_k\left({x}_i^k\right)\left\langle {w}_k,{\varPhi}_k\left({x}_i^k\right)\right\rangle +b\right) $$
(10)

where λ i and η i are Lagrangian parameters.

Calculating the derivatives of (10) with respect to {w k }, b and ξ i will result in:

$$ {\displaystyle \begin{array}{c}\partial L/\partial {w}_k=0\Rightarrow {w}_k-\sum \limits_{i=1}^N{\lambda}_i{y}_i{\pi}_k\left({x}_i^k\right){\varPhi}_k\left({x}_i^k\right)=0\\ {}\partial L/\partial b=0\Rightarrow \sum \limits_{i=1}^N{\lambda}_i{y}_i=0\ \\ {}\partial L/\partial {\xi}_i=0\Rightarrow C-{\lambda}_i-{\eta}_i=0\end{array}} $$
(11)

substituting (11) in (10), the dual problem of (10) is obtained:

$$ {\displaystyle \begin{array}{l}J={\mathit{\max}}_{\left\{{\lambda}_i\right\}}\sum \limits_{i=1}^N{\lambda}_i-\sum \limits_{i=1}^N\sum \limits_{j=1}^N{\lambda}_i{\lambda}_j{y}_i{y}_j\sum \limits_{k=1}^m{\pi}_k\left({x}_i^k\right){\pi}_k\left({x}_j^k\right){K}_k\left({x}_i^k,{x}_j^k\right)\\ {} such that\ \sum \limits_{i=1}^n{\lambda}_i{y}_i=0,\kern0.5em 0\le {\lambda}_i\le C\ \end{array}} $$
(12)

where \( {K}_k\left({x}_i^k,{x}_j^k\right)={\varPhi}_k\left({x}_i^k\right){\varPhi}_k\left({x}_j^k\right) \).

If we prove that the localized weighted sum of kernels \( {\sum}_{k=1}^m{\pi}_k\left({x}_i^k\right){\pi}_k\left({x}_j^k\right){K}_k\left({x}_i^k,{x}_j^k\right) \) is a positive semi definite kernel matrix, then (12) can be solved as a standard canonical SVM problem.

In order to prove that the localized weighted sum of kernels is positive semi definite, we use the definition of a quasi-conformal transformation. For a positive function c(x), a quasi-conformal transformation of K(x, y) is defined as follows:

$$ \tilde{K}\left(x,y\right)=c(x)c(y)K\left(x,y\right) $$
(13)

The gating function in (6) and (8) used in our experiments always provide positive values; therefore, \( {\pi}_k\left({x}_i^k\right){\pi}_k\left({x}_j^k\right){K}_{\mathrm{k}}\left({x}_i^k,{x}_j^k\right) \) in (4) is a quasi-conformal transformation of K(x, y). Positive semidefinite kernels are closed under quasi-conformal transformation [32], so \( {\pi}_k\left({x}_i^k\right){\pi}_k\left({x}_j^k\right){\mathrm{K}}_k\left({x}_i^k,{x}_j^k\right) \) is a positive semi-definite kernel. On the other hand, summing up several kernels together leads to a single kernel. Thus, \( {\sum}_{k=1}^m{\pi}_k\left({x}_i^k\right){\pi}_k\left({x}_j^k\right){K}_k\left({x}_i^k,{x}_j^k\right) \) is a positive semidefinite kernel as well and (12) is considered as a canonical SVM that can be solved by common approaches.

Step Two: Learning locality function parameters.

To determine the values of parameters in gating functions, we use the gradient descent method such that the derivatives of the dual problem of (12) are calculated with respect to v k and v k0 while {w k }, b and ξ i are fixed. The step size of each iteration is determined by a line search method. Taking derivatives of problem (12) with respect to v k and v k0 we obtain:

$$ {\displaystyle \begin{array}{l}\partial J/\partial {v}_k=-\frac{1}{2}\sum \limits_{i=1}^N\sum \limits_{j=1}^N\sum \limits_{k=1}^m{\lambda}_i{\lambda}_j{y}_i{y}_j{K}_k\left({x}_i^k,{x}_j^k\right)\left({\pi_k}^{\hbox{'}}\left({x}_i^k\right){\pi}_k\left({x}_j^k\right)+{\pi}_k\left({x}_i^k\right){\pi_k}^{\hbox{'}}\left({x}_j^k\right)\right)\\ {}\partial J/\partial {v}_{k0}=-\frac{1}{2}\sum \limits_{i=1}^N\sum \limits_{j=1}^N\sum \limits_{k=1}^m{\lambda}_i{\lambda}_j{y}_i{y}_j{\pi}_k\left({x}_i^k\right){\pi}_k\left({x}_j^k\right){K}_k\left({x}_i^k,{x}_j^k\right)\left(2-{\pi}_k\left({x}_i^k\right)-{\pi}_k\left({x}_j^k\right)\right)\end{array}} $$
(14)

where πk (x) is defined as (15) for the Χ 2 gating function of (8),

$$ {\left\{2\left(x(i)\left(x(i)+{v}_k(i)\right)-x(i){v}_k(i)\right)/{\left(x(i)+{v}_k(i)\right)}^2\right\}}_{i=1}^{DG} $$
(15)

also πk (x) is defined as (16) for the Χ 2 kernel based sigmoid function of (6),

$$ A\left(\exp \left(-{X}^2\left({v}_k,{x}^k\right)-{v}_{k0}\right)\right)/{\left(1+\exp \left(\left(-{X}^2\left({v}_k,{x}^k\right)-{v}_{k0}\right)\right)\right)}^2 $$
(16)

where A is equal to (15).

The block diagram of the optimization strategy to find the parameters of the training model is depicted in Fig. 4.

Fig. 4
figure 4

Optimization strategy to find the parameters of training model. The entire process shown in the loop is repeated until convergence. The convergence criteria are based on the number of iterations and changes in object function

4 Results and discussion

In this section, we conduct some experiments to study the classification performance of the proposed method on two widely used benchmark datasets: Caltech 101 [33] and Caltech 256 [34]. The mentioned datasets are challenging for image classification because of their large intra class variance and inter class relationship. In particular, in Caltech 256 the intra class variance is very large, making it more challenging for image classification.

4.1 Experimental configurations

We explain the implementation details of our proposed algorithm in this section. To describe images, at first, the features are extracted, then the kernels are computed based on them. We used the subset of features suggested in [15]. The selected features for Caltech 101 include dense SIFT (scale invariant feature transform) [30], dense color SIFT and SSIM (structural similarity) [35]. Dense SIFT is calculated over regular grids of 16 × 16 image patches with eight pixels spacing using VLFeat Lib [36]. Likewise, color-dense SIFT is calculated in three channels of CIElab. SSIM is computed in 5 × 5 patches to obtain a correlation map.

To represent images for classification, we considered spatial pyramid match (SPM) histograms based on the extracted features [1]. To this end, we trained three separate dictionaries via k-means clustering for dense SIFT, color dense SIFT and SSIM feature spaces. The numbers of visual words for each individual dictionary are 600, 600, and 300, respectively. Compared to similar works, we used less visual words for each dictionary, thereby avoiding large feature vectors. As a result, the computation time is reduced. To generate the SPM representation, each image was partitioned hierarchically into 1 × 1, 2 × 2 and 4 × 4 blocks and the corresponding feature vectors of each individual block was encoded based on the learned dictionaries.

The abovementioned SPM based feature vectors were fed to the proposed classifier. To compute the train-train and train-test kernel matrices, we used the parameter free Χ 2 kernel for all features. The proposed algorithm is written in MATLAB and the source codes available in [15, 23] are used as well.

We used two gating functions to compute the kernel weights: Χ 2 based sigmoid and Χ 2 as formulated in (6) and (8). We partitioned the training data to train set and validation set by cross validation. Then, we grid searched the space to tune the SVM regularization parameter and the gating function simultaneously. The SVM regularization parameter is set to 10 and Χ 2 is selected as the gating function by cross validation.

The optimization problem discussed in section 2, was solved in two phases in an iterative manner. In the first phase, the parameters of gating function are fixed and the problem is solved in the same method as a standard kernel based SVM problem. In the second phase, the problem is solved to find the parameters of the gating function by a gradient descent approach.

In addition, we followed the One vs. All strategy in the training phase where we trained one classifier for each individual class. We should note that, generally compared to the One vs. One method, the One vs. All method suffers from high data imbalance between one class and the remaining classes. However, because of the high intraclass variance in real world image classification, the One vs. One method suffers from the same high data imbalance problem. The data imbalances both inside each class and between classes are addressed by dedicating variable weights to kernels as discussed in section 1.2.

4.2 Evaluations on Caltech 101

Caltech 101 contains a total of 9144 images in 101 object classes and an extra BACKGROUND class [33]. Each class has 31 to 800 images. The size of most images is medium, about 300 × 300. Caltech 101 is a challenging dataset because of the large number of classes, intra class variance, and interclass relationship. For fair comparison with other works, we followed the experimental setup suggested in [1] and randomly selected 30 images per class for training, leaving the rest for testing.

Table 1 reports the mean classification accuracy over 102 classes in Caltech 101. It shows the reported performance of the related algorithms and ours. According to this table, our algorithm outperforms all of the baseline algorithms including nearest neighbor-based SVM [37], SPM [1], ScSPM [38], nearest neighbor [39], and LLC [40].

Table 1 Performance comparison of algorithms on Caltech 101 using 30 training images per class

In addition, we note that as reported in [15], which has the same experimental setup as ours, the classification accuracy using single kernel based SVM is around 73, 62.5, and 62% for dense SIFT, color dense SIFT, and SSIM features, respectively. The confusion matrix of the classification is depicted in Fig. 5.

Fig. 5
figure 5

Confusion matrix of Caltech 101 classification by the proposed algorithm

4.3 Evaluations on Caltech 256

Caltech 256 contains 30,607 images in 256 classes and a BACKGROUND class [34]. Each class contains at least 80 images. Compared to Caltech 101, Caltech 256 is more challenging because the objects are not centered in the images and the intra class variance is much higher.

As a common experimental setup for this dataset, we chose 30 images per class for training and used the rest for testing. We measured the performance of our proposed algorithm by calculating the mean classification accuracy over 257 classes. Table 2 shows the comparison results of our algorithm with the related ones. Fig. 6 illustrates the classification confusion matrix.

Table 2 Performance comparison of algorithms on Caltech 256 using 30 training images per class
Fig. 6
figure 6

Confusion matrix of Caltech 256 classification by the proposed algorithm

As seen in Table 2, the classification accuracy of [2] is 3.08% better than ours. The reason for this better performance is that, in comparison to SPM (the feature extraction used in our algorithm), the method in [2] not only considers the spatial information of images, but also the shape information. To this end, they integrate the salient region and the spatial geometry structure. This combination makes the visual words more discriminative. In addition, this integration makes the extracted feature vectors more resistant to both the complexity of background and location variations of images in each category. This approach indirectly gives more weight to shape descriptor parameters which could be the cause of better performance of this method on large datasets.

5 Performance on difficult classes

There are some classes in Caltech 101 in which images are very difficult to be classified because of the high intra class variance. In [41] the average classification accuracy for nine difficult classes including butterfly, crab, cannon, crayfish, beaver, crocodile, cougar body, chair and lamp, is reported as 24%, while our proposed method has an average accuracy of 52.38% for the same classes. Fig. 7, shows samples from four of these difficult classes.

Fig. 7
figure 7

A few instance images from some difficult classes from Caltech 101. This figure illustrates large intra class variance in each class

In addition, [1] has tested their method on four difficult classes which are cougar body, beaver, crocodile and ant from Caltech 101 and reported the classification accuracy for each individual class. We compared the performance of our method with [1] on the same classes. The results are as shown in Table 3.

Table 3 Comparison of the proposed method with [1] on individual difficult classes on Caltech 101

We should note that in our proposed method, the improvement of classification accuracy on difficult classes is the result of calculating the local weights for kernels which could address the problem of high intra class variance.

6 Conclusions

Image classification, which is the task of determining the semantic class of un-labeled test samples, is a challenging task especially for real world images. Two issues challenge the classification accuracy in image classification. First, images are better described by several types of features; thus, the designed system should be able to merge heterogonous features. The second challenge comes from the large intraclass variance and interclass relationship in real world image databases.

In this study, we designed a feature fusion-based localized multiple kernel learning algorithm using the SPM feature to overcome the mentioned difficulties. Our results demonstrate that the proposed approach performs well in image classification problems. The higher performance of our method partially depends on computing weights of kernels locally. In the future, we will directly compute kernel weights in the kernel space.