Keywords

1 Introduction

Nowdays, convolutional neural networks are a central topic in computer vision, specifically in applications such as video and image processing, where deep learning architectures have proven to outperform traditional image classification approaches. A CNN is a variation of a multilayer perceptron, but its performance makes it much more effective in machine vision tasks because each stage models one part of the visual cortex. Theoretical foundations of CNNs are based on the Neocognitron introduced by Fukushima in 1980 [13], improved by LeCun in 1998 [19] and fine-tuned by Ciresan in 2011 [11].

As classifiers, CNNs have an incredible generalization capability, which increases with the size of the training set, because the greater the amount of information available, the better the millions of parameters contained in its structure will be trained [3, 12]. For databases with limited number of labeled images, it has been suggested that fine-tuning a trained architecture with a robust database is better than training the same architecture from scratch [10, 14, 21, 27].

One of the key issues for a successful CNN is choosing the structure and parameters that represent image information accurately and uniquely. These structure is comprised of stages, where each stage is conformed by convolution, rectification, pooling and normalization tasks. When the CNN is used in classification problems, there is also a classification stage comprised of fully connected layers followed by an output layer, where the number of neurons is equal to the number of classes.

The configuration parameters in a CNN are divided into two categories: (a) those concerning the architecture –such as kernel size, stride and padding– and (b) those concerning the training algorithm –such as mini-batch size, regularization type, number of training epochs, learning-rate drop period, learning-rate drop factor, learning-rate schedule, and initial learning rate.

In this paper, we present the design and implementation of an improved convolutional neural network for image classification which was carefully crafted to avoid overfitting. The performance of the proposed architecture, which we named ACEnet, was compared against that of Alexnet [18], achieving a 5.11% increase in classification accuracy.

ACEnet consists of six feature extraction (FE) stages, instead of five as proposed in [17, 18]. The additional stage is comprised of convolution, rectification and pooling tasks. Another important difference between Alexnet and our proposal is on the second FE stage, where in the sequence of tasks: convolution, rectification, normalization and pooling, proposed by Alexnet, the order of the last two task is inverted.

To compare the performance of both architectures, each was trained from scratch using five databases: Oliva and Torralba (Oliva) [4], Paris [6], Oxford [5], Caltech-101 [1] and Caltech-256 [2]. For each selected database we present processing time and classification accuracy for both training set and testing set.

The rest of the paper is organized as follows. Section 2 describes the state of the art. Section 3 presents a general convolutional neural network architecture. The proposed architecture is presented in Sect. 4. Section 5 presents the employed methodology. Results are described in Sect. 6. Finally, Sect. 7 summarizes the main conclusions and briefly talks about future directions.

2 State of the Art

The classical techniques to avoid over-fitting in CNNs have been dropout and data augmentation, first proposed in [18]. Recently, several works have shifted the attention into other regularization forms that attempt to modify the loss function used by the gradient descent algorithm or its variants. An intermediate layer between pooling and convolution layers is proposed in [23], such layer is a micro network which can be modelled as a small classifier, such as SVM or softmax classifiers. Each micro network adds a regularization term to the CNN’s loss function, which simultaneously minimize final classification error while enforcing hidden layers to learn more discriminative features.

The term micro network was introduced in [20] and refers to a modification in the architecture of a CNN where a multi-layer perceptron (MLP), or other small network, is used instead of a linear filter at each convolutional layer. Each patch of the input volume is fed to the MLP, and the MLP output forms a voxel of the output matrix. The feature maps are build by sliding the MLP over the whole input volume. The architecture using micro networks is known as network-in-network (NiN).

An extension of [20] was proposed in [26], where the output of a NiN is routed into multiple fully-connected layers, each concatenated to a different loss function. The authors of [26] claim that using multiple loss functions drag the training algorithm away from overfitting to one particular single-loss function. The intuition behind using multiple loss functions is that each will model a different aspect of the same task. For instance, the pairwise ranking loss [24] prioritizes learning a given order of labelled pairs, while the LambdaRank loss [9] optimizes the top-k classification accuracy.

A redundancy regularizer was proposed in [25] to reduce the number of correlated kernels at each convolutional layer. Such regularizer showed a slight improvement over dropout, but when combined with it and early stopping the improvement was significant.

Fig. 1.
figure 1

Hierarchical representation of a convolutional neural network.

3 Convolutional Neural Network

Figure 1 displays a generic CNN architecture. It is a hierarchical structure composed of five principal layers:

  • Input layer. It is considered as a pre-processing layer, where the input images can be resized, rotated or subsampled.

  • Convolution layer. It is considered as the basic building block in a CNN, and performs most of the computational work. This layer receives an input volume with \(H^c_{in}\) pixels height, \(W^c_{in}\) pixels width, \(D^c_{in}\) channels deep, and P pixels for padding. The input volume is processed with a set of k filters, which are encoded as weights and connections between neurons. Such filters are based on convolutional masks, called kernels, and are defined with the following parameters: spatial extension \((E, [W_f, H_f])\) and stride \((S_x,S_y)\), where E is the kernel depth along \(D^c_{in}\), \([W_f, H_f]\) is the area size where convolution will be computed, and \((S_x, S_y)\) define the number of pixels that will be skipped between consecutive convolutions. The output volume size is defined as \((H^c_{out}, W^c_{out}, D^c_{out})\), where each dimension is defined in terms of the input volume as \(W^c_{out}=(W^c_{in}-E+2P)/S_x+1\), \(H^c_{out}=(H^c_{in}-E+2P)/S_y+1\), and \(D^c_{out}=k\).

  • Pooling layer. Its purpose is to shrink the convolution layer output so that the dimensionality of the extracted features can also be reduced, while keeping the information they encode. The pooling layer receives an input volume of size \((W^p_{in}, H^p_{in}, D^p_{in})\), which is processed with a set of filters with the following parameters: spatial extension \((F, [W_f, H_f])\) and stride \((S_x,S_y)\). The output volume size is defined as \((H^p_{out}, W^p_{out}, D^p_{out})\), where each dimension is defined in terns of the input volume as \(W^p_{out}=(W^p_{in}-F)/S_x+1\), \(H^p_{out}=(H^p_{in}-F)/S_y+1\), and \(D^p_{out}=D^p_{in}\).

  • Fully connected layer. This layer computes the weighted sum of the pooling layer output, or of the convolution layer when no pooling layer is present.

  • Output layer. This layer holds one neuron per category in the classification task.

The first three layers comprise the FE stage while the last two layers encode the classification stage. In a deep convolutional neural network, several FE stages are present, while only one classification stage remains at the end of the data flow.

4 Proposed Architecture

The proposed architecture, named ACEnet, is shown in Table 1. It is comprised of 29 layers, from which 6 are convolution layers and 4 max-pooling layers, all of them using different kernels and strides. Some of the convolutional layers use padding to make their output conform with the input size needed for the next feature-extraction stage. ACEnet uses six FE stages. Each of the first three FE stages comprises of four layers: a convolution layer, a rectification layer, a pooling layer, and a normalization layer. The next two FE stages only have two layers: a convolution layer followed by a rectification layer, while the last FE stage also have a pooling layer at the top of the previously mentioned layers. Layers 20–25 form two generalization stages, intended to avoid overfitting in the network. Each generalization stage calculates the weighted sum of its inputs and rectifies it, to subsequently apply a regularization technique known as drop-out. In the following we describe some characteristics of our proposed architecture.

  • Convolution layers: They were designed using small kernels (i.e. \(E \in \{7,5,3\}\)) to get highly representative features and decrease the number of training parameters. The first convolution layer gets low-level features (i.e. edges, lines and curves), while subsequent convolution layers get high-level features.

  • Rectification layers: These layers have neurons with the non-linear, non-saturating activation function \(f(x)=\text {max}(0,x)\). They are known as Rectified Linear Units (ReLU) because their behavior is similar to the half-wave rectifier in electrical engineering. They provide a better model of biological neurons, with similar or better performance than the logistic-sigmoid function and the hyperbolic-tangent function. It has been proved that ReLU units reach good performance without resorting to unsupervised pre-training and, although their training requires large amounts of labeled data, there is no negative effect to their performance [15]. The CNNs with ReLU layers are trained several times faster then their equivalent with hyperbolic-tangent units [18].

  • Pooling layers with overlap: Pooling in CNNs is highly disruptive. If the filters do not overlap, the pooling layers loose information about objects localization on the image, which is needed to detect the precise relation among them. The most popular pooling layers use \(2\times 2\) filters with a stride of 2, shrinking the input image size by half, discarding about 75% of the activations generated by the previous layer. To avoid information loss and overfitting, the pooling layers used in the proposed architecture consist of \(3\times 3\) filters with a stride of 2, also called pooling with overlap; if pooling windows overlap enough, location information will be preserved.

  • Normalization layers: Their propose is to add generalization ability to the network. Normalization of a neuron’s output implements a kind of lateral inhibition much more like the found in real neurons, promoting competition among several neurons output at times of great neural activity [18].

  • Drop-out layer: These layers were included to force fully connected layers to learn more robust features. Drop-out is a highly recommended technique to cope with overfitting [22], in which the output of some randomly-selected neurons are set to zero so they can’t contribute with the backpropagation of the error at training, reducing complex coadaptation of neurons; in this way, every time a new example is fed for training, the neural network shows a different architecture, however all the different architectures share their weights.

Table 1. Architecture of the proposed convolutional neural network

5 Methodology

In this section we provide a detailed description of three important experiments that allow us to justify the proposed architecture, as well as the training parameters. In the first subsection, we describe the model selection method used to set the mini-batch size and epochs number in the training algorithm for ACEnet. In the last subsection, we provide the settings for the numerical comparison of the classification performance between ACEnet and Alexnet.

We used the algorithm Stochastic Gradient Descent with Momentum (SGDM) to train the compared architectures. Let \(\theta _i\) be the vector with all the weights of the network at the i-th iteration of the SGDM algorithm, which are updated by the rule \(\theta _{i+1}= \theta _{i} - \eta \nabla E(\theta _{i}) + \mu (\theta _{i}-\theta _{i-1})\), where \(\eta \) is the initial learning rate, \(\mu \) is the momentum, and \(E(\theta _{i})\) is the loss function described as \(E(\theta _i) = \frac{1}{\beta }\sum _{k=1}^{\beta }E_k(\theta _i) - \frac{\lambda }{2}\theta _i^T\theta _i\), where \(E_k(\theta _i)\) is the loss function value at the k-th training example in the mini-batch of size \(\beta \), and \(\lambda \) is the regularization coefficient. The SGDM parameters used to train Alexnet were set as recommended in [18]. Some other parameters were kept in common for both architectures and were set as follows: the initial weights at every layer were randomly drawn from the gaussian distribution N(0, 0.01); the activation thresholds at each layer were initialized to zero; finally, the number of neurons in the output layer was adjusted to match the number of classes in each database. Both architectures were trained on a GPU NVIDIA GeForce GTX TITAN X with 3072 cores and 12 GB of memory, using MatLab and the Deep Learning Toolbox.

Table 2. SGDM parameters used for training of ACEnet and Alexnet
Table 3. Summary of the selected databases for image classification. “Min p. class” and“Max p. class” refer to the minimum and maximum images per class, respectively.

In case of ACEnet, we decided to set the learning rate at 0.001 to get a finer control over the steps given by the SGDM algorithm, and by setting \(\lambda =0.0005\), we reduced the probability of overfitting and the complexity of the CNN [16]. We performed a sequential search to find the optimal value for the mini-batch size and epoch number. In each step of the search the classification error was computed using the hold-out method and the optimal was found in the minimum value of the classification error.

To test the ACEnet generalization performance, we selected five complex databases with real-world images in JPEG format.Relevant statistics for each database are displayed on Table 3. A sixth database was build by combining all selected databases, and used only for performance comparison against Alexnet. Both, model selection and performance comparison was completed for each database. In every experiment, we estimated the classification performance by means of hold-out, where each database was split into training and testing sets. We used 70% of the total number of images as training set and the rest as testing set.

5.1 Training Parameters

Mini-batch Size. The mini-batch size is one of the parameters that directly impacts CNNs performance. It indicates the number of images from the training set that are considered at each iteration of the SGDM algorithm. The optimal mini-batch size was found by means of sequential search in the interval [10, 120] with increments of 10 units. At each point in the search, we recorded the classification accuracy and training time, and selected the optimal mini-batch size as the point in the search with minimum training time and maximum accuracy. In this experiment, every time we run the SGDM algorithm we fixed the epochs number to 100.

Epochs Number and Overfitting. Besides drop-out layers, controlling the epochs number in the training algorithm is an efficient way to avoid overfitting in a CNN [18]. We trained ACEnet with different values for the epochs number, ranging from 10 to 100 with increments of 10. We recorded the classification error for the training set and testing set at the end of the training algorithm. It is well known that the training and testing errors steadily decrease before overfitting, and they diverge when the CNN has been overfitted [8]. Thus, the optimal epochs number is the point before the classification errors diverge.

5.2 Performance Comparison

We compared the classification performance of ACEnet against that of the well-known Alexnet architecture. The training parameters for Alexnet and for ACEnet were the same except for the mini-batch size, which was set using the results of the model selection method for our proposed architecture. The mini-batch size as well as the rest of the training parameters for Alexnet were taken from [18] and are listed in Table 2. For each of the five selected databases, we trained both architectures using 70% of the total number of images until the SGDM converged. After reaching convergence, we recorded training time and classification accuracy on the training set. Next, we computed and recorded the classification accuracy on the testing set.

6 Results

In this section we start by describing the results of the parameter selection experiment and then we will proceed to the comparison with the well-known Alexnet architecture.

6.1 Training Parameters

Mini-batch Size. Figure 2 shows the recorded training times and classification accuracies for every mini-batch size tested in the model selection method described in Sect. 5.1 for Oliva database. It can be seen in Fig. 2 that the best accuracy is located between 1 and 50 images per batch, while the minimum training time is located between 20 and 40 images per batch. So, having a trade-off between accuracy and training time, we propose to use a mini-batch size of 25 for Oliva database. The same results were found for other image datasets.

Fig. 2.
figure 2

Mini-batch size selection for olivaba database.

Epoch Number and Overfitting. We can observe in Fig. 3a the classification accuracy reached when ACEnet is trained using different number of epochs for each of the selected databases. Clearly, our proposed architecture correctly learns the training set in about 60 epochs, regardless of the database used. Thus, setting the number of epochs to 100 give a good training margin for future databases, however it must be tested if ACEnet does not overfit to the training set with such large number of epochs. Figure 3b displays training and testing errors recorded for different values of the epochs number using the Oliva database. It is worth noting that although the testing error has a light increment when setting the epochs number to 20, both training and testing errors keep decreasing steadily until the 50 epochs point. Such increment could be considered as an overfitting indicator, however in the next point along the grid we see both errors decrease again, which points towards the ability of our architecture to recover from overfitting. The training error can be considered stable after the 50 epochs point, however the testing error keeps decreasing until the 100 epochs point. Therefore, we can safely train ACEnet for 100 epochs without incurring in overfitting.

Fig. 3.
figure 3

Epoch number selection based on overfitting analysis.

6.2 Performance Comparison

Table 4 shows an overview of the experimental results from the performance comparison against Alexnet. In the following we elaborate further on the three aspects shown in Table 4, namely training time \(T_{tr}\), and classification accuracy for the training \(A_{tr}\) and testing \(A_{te}\) set. ACEnet required consistently less time to reach a 100% of classification accuracy on the training set than Alexnet. The achieved speed up for Oliva, Paris, Oxford, Caltech-101, and Caltech-256 databases was 1.4x, 1.2x, 1.3x, 1.9x, 1.8x and 1.4x, respectively. Although ACEnet has one more stage than Alexnet, such increase in model complexity did not directly impacted on the time needed for training. Regarding classification accuracy on the test set, ACEnet performed better than Alexnet for all selected databases. The performance improvement for Oliva, Paris, Oxford, Caltech-101, and Caltech-256 databases was 9.58%, 8.60%, 4.35%, 12.05%, and 17.60%, respectively. On average, our proposal improves 5.11% the mean testing performance over the five selected databases. For the artificially build database, our proposal outperformed Alexnet by 6.65%.

Table 4 also shows other measures of classification performance for both networks, namely F1-score, G-measure, and Matthews correlation coefficient. While F1-score provides the harmonic mean of precision and recall, G-measures is their geometric mean. On the other hand, Matthews correlation coefficient indicates the disagreement between predicted and true labels, and it is robust to the imbalanced-class problem. As can be seen, ACEnet generally outperforms Alexnet in all the mentioned measures. Additionally, we performed the nonparametric Friedman’s test to prove the statistical difference in the results obtained by the two architectures. The resulting p-values for all the selected performance measures are given in the bottom of Table 4. From such tests, we can say that the results obtained with ACEnet are statistically different from those obtained with Alexnet with 97% of confidence.

Table 4. Time and performance comparison between ACEnet and Alexnet. Training time \(T_{tr}\) is given in hours, while classification accuracy for the training \(A_{tr}\) and testing \(A_{te}\) set are given in percentages. F1 is the harmonic mean of precision and recall, while G is their geometric mean. MCC displays Matthew correlation coefficient.

7 Conclusions and Future Works

We have presented an improved convolutional neural network architecture for image classification. Designing a CNN for image classification involves setting properly the feature extraction layers, the classification stage, and the parameters of the learning algorithm. The architecture of a CNN can be empirically designed if the behavior and purpose of each stage are clearly understood, however, the parameters of the learning algorithm must be fine-tuned by a model selection method. The proposed architecture can be trained from scratch with any database, which avoids reusing pre-trained networks.

The results presented show that the proposed architecture do not suffer from overfitting, allowing us to increase the number of epochs in the training algorithm. We showed that while the training error reaches stability, the testing error keeps decreasing as the number of epochs increases, however the training time will also increase. In order to keep a low training time, we must sacrifice generalization performance, however we obtained testing errors comparable to those in the state-of-the-art.

Comparing our proposal with the well-known Alexnet architecture. Our proposal improves 5.11% the mean testing performance over five real-world databases. Regarding mean processing time, our proposal required 17.8 h to process all five databases, while Alexnet needed 27.47 h. Our proposal is faster due to the use of a smaller mini-batch and kernel size, and due to the reduction in the number of epoch. However we empirically demonstrated that such size reduction do not affect the classification accuracy. While Alexnet was specifically designed to get a low classification error on the ILSVRC-2012 database, our proposed architecture does not depend on a specific database and has a better generalization performance due to the extra processing stage. As future work, we would like to use more complex image databases, beside we would like to use the weights of intermediate layers in pretrained CNN as a set of features for image classification/identification, where the final aim is to conform a content-based image retrieval (CBIR) system.