Kernel-blending connection approximated by a neural network for image classification

This paper proposes a kernel-blending connection approximated by a neural network (KBNN) for image classification. A kernel mapping connection structure, guaranteed by the function approximation theorem, is devised to blend feature extraction and feature classification through neural network learning. First, a feature extractor learns features from the raw images. Next, an automatically constructed kernel mapping connection maps the feature vectors into a feature space. Finally, a linear classifier is used as an output layer of the neural network to provide classification results. Furthermore, a novel loss function involving a cross-entropy loss and a hinge loss is proposed to improve the generalizability of the neural network. Experimental results on three well-known image datasets illustrate that the proposed method has good classification accuracy and generalizability.


Introduction
Image classification assigns images to predefined categories by recognizing a subject or an object in images. Image classification is a classic image processing tool that is a basis for tasks such as image segmentation, behavior analysis, scene understanding, and other high-level visual tasks, and it has a wide range of practical applications, such as target recognition, object tracking, and image retrieval. Based on the feature extraction approach used, existing image classification methods can be broadly classified as prior-based methods and learning-based methods.
Prior-based image classification methods first extract image features according to empirical knowledge and then use a classifier. The support vector machine (SVM) [1] is a widely used classifier due to its good generalizability. In particular, a kernelbased SVM can deal effectively with nonlinear and high-dimensional data, and performs well on many image classification tasks. An incremental SVM that used histogram of oriented gradients (HOG) features as training vectors was proposed in Ref. [2] to classify images under different imaging conditions. In Ref. [3], using spectral features, an SVM-based sequential classifier was developed to classify multitemporal remote sensing images. Although such prior-based methods coupled with SVM classifiers show good classification performance, their feature extractors are hand crafted, so require domain knowledge. Thus, they are unsuitable for new data and tasks. Moreover, they cannot fully express the information in the raw images [4,5].
Learning-based methods learn features automatically and directly from raw pixels. LeCun et al. [6] successfully applied a convolutional neural network (CNN) to handwritten character recognition and achieved remarkable classification performance; since then, CNNs have been widely applied to image classification tasks. In Ref. [7], a multimodal CNN was used to classify RGB-D images. Through convolution and pooling, color and depth features were fused effectively to maintain good classification performance on images with significant noise or object occlusions. Based on a deep CNN, a fine-grained image classifier with generalized large-margin loss was proposed in Ref. [8], with improved classification performance on fine-grained images. Such CNN-based image classification methods extract salient features from raw images that are invariant to shifting or to shape distortions, but the algorithm used to train the CNN is based on empirical risk minimizationit attempts to minimize errors for the training set. However, with structural risk minimization, such models are less generalizable than SVMs [9], which aim to minimize generalization errors on unseen data given a fixed distribution for the training set.
In recent years, combining a CNN with an SVM for image classification has attracted considerable attention. In Ref. [10], a trained CNN was first applied to extract features from functional magnetic resonance images, and then an SVM was employed for classification. Experiments showed that this combination of SVM and CNN performed better than other classifiers. In Ref. [9], a hybrid model integrating a CNN and an SVM was used to recognize handwritten digits. The CNN functioned as a trainable feature extractor and the SVM was used as a classifier. These papers show that a combination of a CNN and an SVM for image classification can perform well. However, the CNN and SVM were trained separately. Thus, the SVM classification results could not provide feedback to assist in CNN training. Feature extraction and feature classification did not interact effectively, affecting classification accuracy.
These combined methods compensate for the limits of both CNNs and SVMs by incorporating the merits of both. However, they are based on two different algorithm architectures, which are inappropriate for a "hard connection". Seeking a mapping between CNN and SVM to establish a "soft connection" enables more flexible interaction. Furthermore, it is crucial to establish a precise mapping. To better blend feature extraction with feature classification for improved classification performance, one should establish an effective and precise mapping connection on a theoretical basis.
In this paper, a kernel-blending connection approximated by a neural network (KBNN) is proposed for neural networks applied to image classification tasks. Considering that a three-layered neural network with nonlinear units in the hidden layer can approximate both continuous and other kinds of functions, we devise a network module that can learn the kernel function for the SVM. Using this kernel mapping connection, which carries a theoretical guarantee, feature extraction and feature classification are blended organically and precisely. First, the image features are automatically learned by a feature extractor. Second, the extracted feature vector is mapped into a feature space through a kernel mapping connection module. Finally, the classification results are obtained using a linear classification layer. To further improve the KBNN's generalizability, we propose a novel loss function to train the network in which a hinge loss is introduced to the cross-entropy loss. The main contributions of this paper are as follows: • A novel image classification method based on a new deep neural network, in which an SVM kernel function is learned through a subnetwork to blend a CNN and an SVM in a unified framework, providing improved classification performance. • Inspired by the function approximation ability of neural networks, a kernel mapping connection that organically blends feature extraction with feature classification. Unlike traditional combination methods, this kernel mapping connection provides a soft connection between the CNN and SVM as it is performed as a component of the neural network. Furthermore, unlike traditional manual selection, the kernel mapping can be trained adaptively without use of kernel tricks to improve classification accuracy. This kernel mapping has a sound theoretical basis. • A novel loss function for improved generalizability of the method. Unlike in traditional cross-entropy loss, a hinge loss is combined to minimize both empirical and structural risk.

Preliminaries
Inspired by the biological architecture of the mammalian visual cortex [11,12], CNNs are characterized by limited receptive fields, shared weight parameters, and pooling layers [6]. This architecture allows CNNs to suppress increases in the number of weight parameters and makes them robust to parallel shifts of objects in images. The backpropagation algorithm [6] is generally employed in CNNs to update the weight parameters by calculating the gradient obtained at the output layer and then backpropagating it to the input layer. In recent years, CNNs have been successfully applied to practical situations and have made significant achievements in image processing [13,14]. Zhang et al. [15] proposed a learning-based method to automatically detect and localize visual distractions in video. Video frames with extracted feature maps are first used as input layers for the network.
Then, a state-of-the-art image segmentation CNN network, the end-to-end deep NN model SegNet, predicts a distraction map for every video frame; these are further refined in a post-processing step. Experimental results show that this method can efficiently improve the visual quality of video. Targeting the problem that conventional graph convolution methods fail to capture higher order information, Wen et al. [16] presented a motif-based graph convolution with variable temporal dense blocks for skeleton-based action recognition. It effectively fuses information from the different semantic roles of physically connected and disconnected joints to learn higherorder features. Furthermore, to enhance its ability to extract global temporal features, a non-local block is applied to capture whole-range dependencies in an attention mechanism. Experimental results on two challenging large-scale datasets demonstrate the effectiveness of the method.

Our method
This paper focuses on creating a new neural network to improve the image classification performance by blending feature extraction and feature classification in a unified framework that can be jointly trained. To this end, motivated by the function approximation ability of neural networks, we introduce a theoretically guaranteed kernel mapping connection module that provides a soft connection between feature extraction and feature classification. This is achieved by using a neural network to learn the kernel functions for the classifier.
The framework of the proposed KBNN network consists of three parts: feature extraction, kernel mapping connection, and feature classification, as shown in Fig. 1.
First, a feature extraction subnetwork is applied to extract the input image features into a one-dimensional feature vector. This subnetwork uses a series of convolutional and pooling operations combined in several ways. Then, to enable a soft connection from feature extraction to feature classification, a kernel mapping connection module puts the extracted feature vector into a feature space; the result serves as the kernel function in the classifier. Finally, a linear classification layer is employed to classify the features in the feature space, giving the final results. To improve the generalizability of the network, a novel loss function with hinge loss is applied to train the neural network until convergence.

Feature extraction
To improve feature extraction performance, we use a feature extraction subnetwork with a series of convolutional and pooling operations combined using several techniques. The feature extraction subnetwork is composed of three convolutional layers (some of which are followed by pooling layers) and a fully connected layer. The convolutional layers are mainly used to extract feature maps using convolutional filters followed by a nonlinear activation function; we adopt the rectified linear units (ReLU) function to avoid the vanishing gradient problem. To accelerate the training process, we employ batch normalization [17] before ReLU activation. The pooling layers group the local features from adjacent pixels. Max pooling and average pooling are adopted in different pooling layers. The fully connected layer integrates local feature information into a onedimensional feature vector. To prevent overfitting in the traditional fully connected layer, we adopt global average pooling (GAP) [18], which takes the average of each feature map. Overfitting is avoided as this approach does not require parameter optimization. Moreover, because spatial information is summed out, this approach is more robust to spatial translations of features in the input images.

Kernel mapping connection
In most image classification methods based on combining a CNN and an SVM, feature vectors are extracted from the trained CNN and then input into the SVM classifier separately, because the CNN and SVM have different implementation frameworks. Therefore, the connection between the CNN and the SVM is a "hard connection": feature extraction and feature classification are trained separately and do not interdepend. To integrate feature extraction and feature classification into an organic whole, it is necessary to use a "soft connection". We regard the SVM kernel function as the point at which to address this problem, by applying the function approximation ability of a neural network. The kernel function in an SVM is continuous and can realize linear separability by mapping the linearly inseparable space into a higher dimensional feature space. Cybenko [19] has proved theoretically that a three-layer neural network can approximate any continuous nonlinear function arbitrarily well on a compact interval. Inspired by this function approximation theorem, we devise a kernel mapping connection that learns the kernel function using a neural network, to enable a soft connection between feature extraction and feature classification. The theorem is given as follows: Theorem 1. Let I n be the n-dimensional unit cube, [0, 1] n .
Let C(I n ) denote the space of continuous functions on I n , and M (I n ) denote the space of finite and signed regular Borel measure on I n . Let σ be any continuous sigmoidal function. Let Y j , X ∈ n , and θ j ,α j ∈ . Then, finite sums of the form: G(X) = Σ N j=1 α j σ(Y T X + θ j ) are dense in C(I n ). In other words, given any f ∈ C(I n ) and ε > 0, there is a sum, G(x), of the above form, for which: Based on this function approximation theorem, a kernel mapping connection is established to map the feature vectors from the GAP layer to a feature space used as the input to the linear classification layer.
As shown in Fig. 2, D = (x 1 , · · · , x m ), x i ∈ is the feature vector output by the GAP layer with d neurons. The kernel mapping layer contains q neurons, and the kernel mapping output layer has l neurons. The kernel mapping learned by a neuron, that is, the input of a neuron y k in the linear output layer is where 1 i d, 1 m q, 1 k l, ν mi , η km are the weight vectors, β m is the bias of the mth neuron, and σ(·) is the sigmoid function.

Feature classification
Unlike the softmax layer minimizing cross-entropy loss used in traditional neural networks, to improve the generalizability of the proposed network, our novel loss function combines cross-entropy loss and hinge loss to minimize both empirical and structural risk. The traditional softmax loss function is where i = 1, . . . , M, j = 1, . . . , K, M and K are the numbers of training images and classes, respectively, and p i,j denotes the probability between the image X i in class j and the ground truth. After introducing the positive penalty factor C for the SVM, the improved hinge loss is where C controls the tradeoff between maximizing the margin and misclassification, ω is the weight vector, and b is the bias. By combining the cross-entropy loss and the improved hinge loss, the proposed loss function is defined as follows: When applying the loss function to train the KBNN, the weights and biases in the feature extraction layers and kernel mapping layer are learned by backpropagating the gradients from the linear classification layer.

Results and discussion
In this section, we report the results of a variety of experiments performed to evaluate the performance of the proposed KBNN image classification method.

Experimental details
We conducted experiments on three datasets: MNIST, CIFAR-10, and CIFAR-100 [20,21]. These datasets are widely used and specifically intended for investigating the performance of image classification methods. MNIST is a handwritten digit dataset in which the goal is to classify handwritten numerals 0 to 9. The dataset contains 60,000 training images and 10,000 test images, each of which is a 28 × 28 pixel grayscale image. The CIFAR-10 dataset consists of 50,000 training images and 10,000 test images. Each image is a 32 × 32 RGB image that belongs to one of ten natural-object categories. In CIFAR-10, the object positions and scales within categories and their colors and textures between categories vary significantly. The CIFAR-100 dataset has the same size and format of images as the CIFAR-10 database, but contains 100 classes. Thus, the CIFAR-100 dataset only has one tenth as many labeled images in each class, i.e., 500 training images and 100 testing images.
To reveal the generality of our proposed method, we applied the same neural network architecture to all datasets, set up as shown in Table 1. We used the mini-batch gradient descent method to learn the parameters and adopted Adam to accelerate network

Loss function analysis
The penalty parameter C adopted in the proposed loss function controls the tradeoff between margin maximization and classification violation. We first analyzed the impact of this parameter on the performance of KBNN, by investigating variation of classification error for values of C in the range of 0.001-1000, using MNIST and CIFAR-10 datasets. The mean error over 5 independent trials is plotted in Fig. 3. When C < 3, low classification accuracy results for both datasets, particularly MNIST. As C increases from 4 to 10, classification accuracy improves, and for C > 10, classification accuracy converges. As Fig. 3 shows, any arbitrary value for C of at least 10 is acceptable for the loss function. Thus, we set C = 10 in the remaining experiments. Next, we compared the performance of the new loss function to those of cross-entropy loss and hinge loss. The results for MNIST and CIFAR-10 are presented in Figs. 4 and 5, respectively. On MNIST, with grayscale images, KBNN with the proposed loss function performs best regarding errors, while hinge loss is worst. Moreover, compared with cross-entropy loss and hinge loss, the KBNN loss function converges faster. On the more complex CIFAR-10 dataset, which is composed of RGB images, the proposed loss function is more stable, especially in comparison to cross-entropy loss. More generally, these results show that using a linear output layer with the improved loss function performs better than a traditional output  layer with cross-entropy loss and hinge loss.

Comparison with the state-of-the-art
We next compare the classification performances of the proposed KBNN and state-of-the-art methods for these three datasets.

MNIST results
To verify the effectiveness of the proposed KBNN image classification method, we first compared the proposed KBNN with state-of-the-art methods on the MNIST dataset, including two combined methods: DLSVM [22] and Niu and Suen's [9], a method using CNN with softmax (CNN+softmax) [23], and four other representative methods: CDBM [24], PCAnet [25], Deep NCAE [26], and Drplu [27]. All methods were trained on the original training dataset except for Niu and Suen's method, which was trained with an augmented training dataset by using distortion techniques. Classification error results are summarized in Table 2. As can be seen, KBNN performs better on MNIST than most of the other methods in the comparison. In particular, KBNN achieves higher classification accuracy than the traditional combined method DLSVM, which indicates that the kernel mapping contributes to classification accuracy. The KBNN also outperforms CNN+softmax, further showing the good performance of the proposed loss function. Unlike Niu and Suen's method, which uses distortion techniques to augment the training dataset and enhance generalizability, KBNN is trained directly on the original training set, so KBNN's results are weaker than those of Niu and Suen's method. However, the difference in classification accuracy is quite small: the KBNN result trails Niu and Suen's method's by 0.17.

CIFAR-10 results
To further illustrate the generalizability of KBNN, we compared it to six other image classification methods using the CIFAR-10 dataset. These were: the combination method DLSVM, two improved loss function methods, large-margin Gaussian mixture loss (ResNst110+L-GM [23]) and multi-loss (ML-DNN [28]), and three high-performance methods: NIN [18], maxout networks [29], and drop-connect [30]. Classification error results are presented in Table 3. It can be seen that KBNN has lowest errors for CIFAR-10. In particular, the network architecture of KBNN suffices for more complex datasets, verifying that KBNN has good generalizability.

CIFAR-100 results
To investigate the performance of KBNN on a more complex dataset, we further compared KBNN with six representative low-error methods on the CIFAR-100 dataset: learned pooling [31], stochastic pooling [32], maxout networks, NIN, ML-DNN, and ResNet (110layer) [33]. Table 4 gives classification errors for our proposed KBNN and these other methods. It can be seen that KBNN surpasses several methods with a test error of 32.71%. ResNet is an outstanding network for diverse applications, which uses deep residual learning to overcome the difficulty of training a deeper network. Although the classification errors of KBNN are larger than for RestNet, KBNN has fewer network layers: KBNN can obtain good performance with a relatively small model size. Generally, the experimental results of Table 4 show that KBNN is also competitive for more complex datasets.

Conclusions and future work
This study has proposed a novel deep neural network for image classification with an approximate theorembased kernel blending connection. To implement a soft connection between a CNN and an SVM, we established a kernel mapping connection structure, guaranteed by the function approximation theorem, to better blend feature extraction and feature classification.
Neural network learning further increases the adaptability of the connection, which avoids the need for kernel tricks as applied in traditional SVMs. Moreover, we combine a hinge loss with cross-entropy loss to improve the generalizability of KBNN.
In future research, we will focus on further improving the generalizability of KBNN in terms of network architecture optimization, including the number of layers and hidden neurons, the size of the convolution kernel, and the value of the penalty factor. The experimental results on benchmark datasets indicate that, although KBNN shows promising classification performance and generalizability, the network architecture and the penalty factor still need to be set manually. Even though we performed a parameter sensitivity test on the penalty factor, upper and lower bounds on its value referred to empirical settings in other literature. Such empirical settings of parameters and architecture may affect the performance of the method on other datasets. Therefore, our further research work will focus on improving model generalizability. We will attempt to set the penalty factor as a trainable parameter and optimize the network architecture with intelligent optimization algorithms. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.