A novel training algorithm for convolutional neural network

Many machine learning softwares are available which help the researchers to accomplish various tasks. These software packages have various conventional algorithms which perform well if the training and test data are independent and identically distributed. However, this might not be the case in the real world. The training data may not be available at one time. In the case of neural networks, the architecture has to be retrained with new data that are made available subsequently. In this paper, we present a novel training algorithm which can avoid complete retraining of any neural network architecture meant for visual pattern recognition. To show the utility of the algorithm, we have investigated the performance of convolutional neural network (CNN) architecture for a face recognition task under transfer learning. The proposed training algorithm may be used for enhancing the utility of machine learning software by providing researchers with an approach that can reduce the training time under transfer learning.


Introduction
Machine learning algorithms aim at building a model from example inputs in order to make data-driven decisions or predictions. Applications such as face recognition, spam filtering, and recommendation engines which use large dataset uses machine learning. Google uses machine learning to identify and deindex webspam. Various machine learning software such as Weka, Java Neural Network Framework Neuroph, Scikit Learn, Open NN Multiple Back propagation exists that assists researchers in solving complex problems. These packages have conventional algorithms [1][2][3][4][5][6][7][8][9] for image analysis, machine learning and data mining that assume training and test data have the same distribution. In many real-world applications, this may not hold, for example, if one has to detect users current location using previously collected Wi-Fi data. It is expensive to calibrate Wi-Fi data in large-scale environment as the user needs to label extensive collection of Wi-Fi signal at each location. Knowledge transfer or transfer learning may be useful in saving significant efforts in labeling data [10]. Transfer of knowledge from a related task that has already been learned to a new task which shares some of the commonality is transfer learning. Basics of transfer learning are well explained in [11]. Transfer learning aims to solve the problem when the training and test data are different. Transfer learning approaches like instance transfer, feature representation transfer, parameter transfer, and relational knowledge transfer are discussed in [12][13][14][15][16]. Transfer learning finds its motivation in the fact that human beings can intelligently apply knowledge acquired previously to solve the new problem faster or with better solutions. NISP-95 workshop on "Learning to Learn" had a special session that discussed the fundamental motivation behind transfer Learning. The workshop was focused on the need for lifelong machine learning techniques that retain and reuse previously acquired knowledge [12,17,18]. Thus, the machine learning software packages should provide some simple and automatic/semiautomatic setting for users dealing with transfer learning tasks.
Multilayer feedforward neural networks are been effectively used in machine learning. They can be used to approximate complex nonlinear functions from high-dimensional input data. The performance of multilayer perceptron (MLP) depends on the underlying feature extraction method used [19]. The choice of feature extraction algorithm and features used for classification is often empirical, and therefore, it is suboptimal. One can directly use the training algorithm to find the best feature extractors by adjusting the weights. However, when the input dimension is high (image processing application), the number of connections, the number of free parameters increases because each hidden unit is fully connected to the input layer. This may lead to a network that overfits the data as the neural network would have a too high complexity. The input patterns are to be well aligned and normalized while presenting to such type of MLP leading to no built in variance with respect to local distortions and translations [20]. Various neural network classifiers are explained in [18,[21][22][23][24][25][26]. A convolutional neural network(CNN) tries to solve the problems of MLP by extracting local features and combining them subsequently to perform the detection or recognition. CNN and neocognitron are the neural network architectures which are meant for visual pattern recognition. These architectures have integrated feature extraction and classification layers. However, in the literature, no work has been reported which focuses on neural networks (meant for visual pattern recognition) equipped with transfer learning without making changes in the architecture.
The contributions of this paper include the following: 1. Novel training algorithm for CNN architecture under transfer learning task. Three-phase training algorithm is proposed for the same. Phase I is a conventional phase in which CNN is trained with the conventional methods in [27][28][29] The remainder of this paper is organized as follows. Section "Related work" throws light on the work done in the area of transfer learning and deep learning. The aim of the proposed work is to equip CNN (deep learning network) with transfer learning framework. This section explains various ongoing applications in field of transfer learning and deep learning. Section "The framework of transfer learning" explains the transfer learning framework used in this research. Framework of transfer learning is applied to principal component analysis (PCA) to derive the projection matrix. Section "Convolutional neural networks" explains the architecture of CNN followed by the proposed training algorithm for the CNN architecture. Section "Comparison of traditional algorithm (conventional) with proposed algorithm" explains the comparison of traditional algorithm with the proposed algorithm. Section "Dataset" describes the dataset that is used in this research. Section "Experiments, parameter settings, and observations" throws light on the experiments performed on the CNN architecture. This section also explains various parameter setting in the algorithm followed by the observations. Section "Conclusion" brings ahead the conclusion of this research work.

Related work
In the last few years, visual recognition community has shown a growing interest in transfer learning algorithms [30,31]. Transfer subspace learning (TSL) is effectively used in understanding kin relationships in the photo [32]. Classification under covariate shift is been solved by transfer learning [33]. Features with meta-features that can be used in prediction task is studied in [34]. Building classifiers for text classification by extracting positive examples from unlabeled examples for improving performance of the system are highlighted in [10]. Transfer subspace learning that can reduce time and space cost is proposed in [35]. Enhanced subspace clustering algorithms [36,37] are used to handle complex data and to improve clustering results. Cross-domain discriminative locally linear embedding (CDLLE) can be used to reduce the human labeling efforts for social image annotation problem [38]. Robust framework against noise in the transfer learning setting is proposed in [39]. Semisupervised clustering algorithm with domain adaptation and the constraint knowledge with transferred centroid regularization is proposed in [40]. Xiaoxin Yin et al. have proposed [41] efficient classification across multiple database relations. Performance improvement is seen when transfer learning is used in medical image segmentation followed by classifi-cation [42]. Low-resolution face images are matched with the high-resolution gallery images using transfer learning which improved cross-resolution face matching [43]. Transfer learning using Bayesian model was used in [44] for face verification application. Ensemble-based transfer learning was used in text classification [45]. Knowledge was transferred between text and images using matrix factorization approach by Zhu et al. [46]. Geng et al. used domain adaptation metric learning for face recognition and web image annotation [47]. Server-based spam filter learned from public sources was designed and applied to individual users with the help of transfer learning [48]. In recent years due to its state-of-the art performance in many research domains, deep learning has attracted attention of academic community. Companies like Google, Facebook and Apple who collect and analyze massive amounts of data are putting forward lot of deep learning-related projects that happens to be the prime motivation behind this research. Deep learning challenges and perspectives are well explained in [49]. Weilong Hou et al. have done blind quality assessment via deep learning [50]. Shuhui Bu et al. for the first time applied deep learning for 3D shape retrieval [51]. Traffic flow prediction and deep learning approach is been proposed in [52]. Object tracking in blurred videos using blurred videos and deep image representations is proposed by Jianwei Ding et al. [53]. Adaptively learn representation that is more effective for the task of vehicle color recognition using spatial pyramid deep learning is given by Chuanping Hu et al. [54]. Deep learning is also been used to grade nuclear cataracts [55]. Deep learning is been widely used in medical image processing for segmentation, classification and registration [56][57][58][59][60][61], image denoising [62] and multimodal learning [63]. Deep learning is proved to give robust image representation for single training sample per person in face recognition task [64]. Corey Kereliuk et al. did music content analysis with deep learning [65]. Land use classification [66], scene classification [67] and visual tracking [68] applications work well with deep learning architectures. Impact of deep learning on developmental robotics is explained in [69]. Multi-label image annotation is been achieved using semisupervised deep learning [70]. Financial signal representation is done in [71] using deep neural networks. Pipeline for object detection and segmentation in the context of volumetric image parsing is proposed using marginal space deep learning [72]. Deep learning is also been used in indoor localization that reduces the location error compared with the three existing methods [73]. Convolutional neural networks (CNN), a very popular deep learning network is used in almost all the applications since it is believed to be one of the most appropriate networks for modeling images [74]. CNN are used for image classification [75], pose estimation [76], face recognition [77] and modeling texts [78][79][80][81][82][83][84].
The proposed work contrasts clearly from a concurrent work on deep learning and transfer learning in following ways: 1. Support vector machine (SVM) is extensively used in transfer learning methods. Most of the transfer learning algorithms are developed only for specific model that makes it difficult to use it for other models and restrict the applicability. To the best of author's knowledge and the data available from literature, the first attempt made to equip deep neural network with transfer learning framework was by Mingsheng Long et al. [85].
In their framework they have modified the architecture of CNN. However, the research work proposed in this paper is for the conventional CNN architecture. A novel training algorithm under transfer learning is proposed without changing the architecture of CNN. There was also an attempt to equip shallow neural network with transfer learning [52]. The authors of this paper also acknowledge the work of Fan Zhang et al., in their work they have suggested a neural network ensemble training to improve prediction accuracy at the expense of increased trainable parameters [67]. In short the proposed algorithm is generic and can be used for any deep learning architecture. 2. The transfer learning task is demonstrated with applications like medical image segmentation, text classification [86], web image annotation, face recognition, etc. Various standard datasets like Yale Face database, the Facial Recognition Technology (FERET) and Labeled Faces in the Wind (LFW) exists for doing the experimentation on recognition of faces. The face images in these datasets are acquired with various poses, illumination [87] and expressions, etc. No dataset of face exists which has face images acquired at different distances. The authors of this paper have made their own dataset which may be used by researchers working on a problem of face recognition at a distance. The details of the dataset is explained in Sect. "Dataset".

The framework of transfer learning
Given m training samples with x as a input and t as a target for classification task, With constraints, e.g., W T W = I . In (2) ρ is the regularization parameter that controls the trade-off between E(W) and D W (P T ||P U ). The solution of (2) can be obtained by the gradient descent algorithm and is given by where /∂ W and α is the learning rate.

Framework of TSL applied to principal component analysis (PCA)
Principal component analysis (PCA) projects the highdimensional data to lower dimensional space by capturing maximum variance [88]. PCA projection matrix maximizes the trace of the total scatter matrix Subject to WW T = I. R is the autocorrelation matrix of training samples. E(W) of PCA is given by (5) By substituting (5) and (6) into (3), we can obtain the projection matrix W for transfer learning. The detailed procedure to get the solution of (3) is given in [2].
Convolutional neural networks Figure 1 shows convolutional neural network for face recognition task. The input plane receives images. The input is 74 × 74 pixel image. Layer C 1 is a convolutional layer with six feature maps. Each unit in each feature map has a connection to the 11 × 11 neighborhood in the input. The size Layer S 2 is a subsampling layer with six feature maps of size 32 × 32. Each unit in feature map has a connection to a 2 × 2 neighborhood in the corresponding feature map of C 1 . Layer S 2 has no trainable weights.
Layer C 3 is a convolutional layer consisting of 16 feature maps, i.e., 16 kernels of size 11×11 and sixteen biases which result in 1952 trainable weights.
Layer S 4 is a subsampling layer with 16 feature maps of size 22 × 22. The S 4 layer has no trainable parameters.
Layer C 5 is a convolutional layer consisting of 120 feature maps, i.e., 120 kernels of size 11 × 11 and 120 biases which result in 14,640 trainable weights.
Layer Proposed three-phase training algorithm for CNN architecture using transfer learning approach Figure 2 shows the proposed three-phase training algorithm.

Comparison of traditional algorithm (conventional) with proposed algorithm
Phase I of the algorithm is the training of CNN C and S Layers. Supervised learning is used to train C and S layers. A gradient descend method is used to update the weights in   Traditional/conventional algorithm which is used to train CNN has following two steps: 1. Conventional phase: this phase is same as conventional phase of proposed algorithm. MSE1 is the performance index used in this phase. This phase is used to train feature extraction layers of CNN (C 1 , C 3 and C 5) . 2. Weight updating phase: output of C 5 layer which is also called as features is used to train F 6 and output layer. Weight modification is done by using all the samples in the training data set. MSE2 is the performance goal used in this step.
In the proposed algorithm, we have tapped the output of C 5 layer and reweighted the same using Eq. 3. The output features are reweighted till the distribution difference between old training and new samples is reduced. In the phase III part of the proposed algorithm, these reweighted features are used to train F 6 and output layer. When the new training set is made available, the proposed algorithm does not disturb the trained CNN layers (C 1 , C 3 and C 5) . However, the new information is incorporated into the network by modifying weights on F 6 and output layer. This is proposed minimum change principle.

Dataset
To the best of our knowledge, there is no public dataset constructed with a large number of samples for performing experiments on face recognition under transfer subspace learning. The distance between subject and camera is varied, and the camera position is shifted while preparing the database. The distance is varied in steps of 15 cm.We refer to distance of 15 cm as scale 1 (S1), 30 cm as scale 2 (S2) and 120 cm as scale 8 (S8). Similarly, shift of 5 cm at S1 as S1sh1, shift of 10 cm at S1 as S1sh2, rotation of 5 • at S7 as S7r1 and rotation of 10 • as S7r2, etc. Camera positions were shifted by 5 and 10 cm at scale S1 and S2. The camera was rotated with 5 • and 10 • of inclination at scale S7 and S8. The images were collected in an illumination controlled environment. For maintaining a level of consistency throughout the database, the same physical setup was used in each photography session. Because the equipment had to be reassembled for each session, there was some minor variation in images collected on different dates. The proposed database was collected in 10 sessions between December 2012 and June 2013.
The database contains 20,000 images that include 50 subjects. For every subject 25 images per scale, at four shifts and two angles were taken (total 400 images per subject). The details of the database are shown in Table 1. Figure 3 shows some sample images in database.

Experiments, parameter settings, and observations
Basic CNN architecture was trained using the conventional/traditional algorithm as per the steps discussed in Sect. "Comparison of traditional algorithm (conventional) with proposed algorithm" with the samples from developed database. Table 2 shows the results for 25-user system. Table 3 shows the results for 50-user system. 10 images per user per scale were used for training and 15 images per user per scale were used for testing. It was observed that when the training and testing samples are from the same scale, the classification rate seems to be high as compare to the testing samples from different scales.
Various researchers have tackled different applications of transfer learning on SVM architecture. We have proposed a generic training algorithm which can be used for any deep learning network having feature extraction and classification layer integrated. Also the application of face recognition at a distance is novel. As a result there is no data available in the literature with which the proposed work can be compared.

Conclusion
We have proposed a novel training algorithm that can be used to train any neural network architecture which is meant for visual pattern recognition. These networks have feature extraction and classification layers integrated into the architecture. In many applications training data are made available subsequently. In this situation neural networks like CNN and neocognitron are to be trained again with the new data. The proposed approach can be used in such situations. In this approach, one can tap the output of the last feature extraction layer and reweight the output in such a way that the distribu- Fig. 10 Variation of classification accuracy as a function of rho (ρ) and alpha (α) (50-user system) tion difference between the old and new training samples is reduced. We have shown the utility of the algorithm for the CNN architecture. However, the approach is generic and can be used for any neural network architecture which has feature extraction and classification layers integrated into one architecture.
Training time of any neural network increases with the increasing number of samples. If the training samples are not available at one time, then the situation demands retraining. Many machine learning softwares do not have provision to avoid retraining. The proposed algorithm can increase the utility of any machine learning software by giving an user a method with which by doing few disturbances in the trainable parameters transfers the new information into the architecture. By this approach the training time can be reduced under transfer learning task.
We have proposed a novel three-phase training algorithm for CNN under transfer learning that gives a constant average classification rate. With the proposed framework one has to disturb only 60 percent weights in the architecture for incorporating the knowledge available from the new training samples. We proposed minimum change principle, as per that one has to disturb few weights to transfer knowledge. The work may be extended by (1) reducing a negative transfer of knowledge; (2) coming up with information theoretic measure of the information transfer.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.