DensePILAE: a feature reuse pseudoinverse learning algorithm for deep stacked autoencoder

Autoencoder has been widely used as a feature learning technique. In many works of autoencoder, the features of the original input are usually extracted layer by layer using multi-layer nonlinear mapping, and only the features of the last layer are used for classification or regression. Therefore, the features of the previous layer aren’t used explicitly. The loss of information and waste of computation is obvious. In addition, faster training and reasoning speed is generally required in the Internet of Things applications. But the stacked autoencoders model is usually trained by the BP algorithm, which has the problem of slow convergence. To solve the above two problems, the paper proposes a dense connection pseudoinverse learning autoencoder (DensePILAE) from reuse perspective. Pseudoinverse learning autoencoder (PILAE) can extract features in the form of analytic solution, without multiple iterations. Therefore, the time cost can be greatly reduced. At the same time, the features of all the previous layers in stacked PILAE are combined as the input of next layer. In this way, the information of all the previous layers not only has no loss, but also can be strengthened and refined, so that better features could be learned. The experimental results in 8 data sets of different domains show that the proposed DensePILAE is effective.


Introduction
With the development of the Internet of things (IoT), people can obtain all kinds of data anytime and anywhere through various types of sensors. It lays the foundation for the application of deep learning. With the continuous growth of data volume, deep learning technology has been applied in many fields of IoT, such as smart cities [16,17], intelligent transportation [20,27], healthcare [4,15,29], and so on. Deep  School of Information, Shanxi University of Finance and Economics, Taiyuan 030012, China ral network, that is, the output of the former layer is used as the input of the latter layer. Then DNN is trained by error back propagation (BP) algorithm. This paradigm has been used in a large number of applications and has achieved very good results. Deep autoencoder [9] is a typical application.
The deep autoencoder utilizes the excellent feature reconstruction capability of the autoencoder to learn features. Many variants of deep autoencoder have been proposed, such as stacked autoencoder (SAE) [12], deep denoising autoencoder [21]. In addition, it has been applied in many fields, such as remote sensing image recognition [30] and anomaly detection [5].
Since the deep autoencoder is usually trained by BP algorithm, the two notorious problems of BP algorithms, local minima and slow convergence, are present in the training process of the network. In this way, the network not only has a long training time, but also cannot obtain an optimal solution. Especially in the application of the IoT, DNN needs to run on resource-constrained devices, which often have low computing power and can not adapt to a large amount of computing. To overcome the shortcomings mentioned above, researchers have proposed many non-BP based methods, such as pseu-doinverse learning algorithm (PIL) [6,22] and random vector functional link (RVFL) [18].
RVFL is only a single hidden layer feedforward neural network. In order to introduce RVFL into deep learning, dRVFL was proposed in [14]. This method can extract features quickly and get a satisfactory performance. However, its disadvantage is that the features of each layer are obtained by random projection, which is difficult to understand. PILAE [25] is proposed by combining PIL with autoencoder. It uses PIL to train autoencoder. PILAE is a unsupervised feature learning method, and could exactly learn feature. Therefore, interpretability of PILAE is more acceptable than RVFL.
However, stacked PILAE and all other deep autoencoder have information loss problems. Autoencoder is a unsupervised feature learning method by setting the input equal to the output. To avoid trivial solutions, there are usually multiple bottleneck layers, which could force the network to learn abstract compression. The narrower the layer is, the greater the compression is. As the increase of network depth, feature is becoming more and more abstract, and information loss is becoming more and more serious [31]. The learning ability of the model is affected by information loss. To strength learning ability of model, loss information should be supplied.
In deep CNN networks, feature reuse is used to reduce the information loss. ResNet [10] introduces an identity connection, which integrates linear activation and nonlinear activation. Each residual block reuses the information of the previous layer. DenseNet [13] further extends the input range of the identity connection. The output of each layer is used as the input of the subsequent layer. The previously extracted features are preserved in the later layers. The problem of loss of information is alleviated by feature reuse. Inspired by DenseNet, dense connection is introduced to PILAE. The feature reuse is used to solve the problem of information loss in autoencoder, and a new feature learning approach is proposed, namely, dense connection pseudoinverse learning autoencoder (DensePILAE). The input of each layer in DensePILAE is the concatenation of all previous layer outputs, and the new feature is learned by reconstructing all historical features. By reusing the features learned in all the previous layers, more lossless and compressed features are extracted. It could get better accuracy with fewer parameters. In this paper, we make the following contributions: (1) aiming at the problem of information loss in stacked PILAE, a dense connection PILAE method is proposed; (2) the effectiveness of DensePILAE is analyzed from the perspective of feature reuse; (3) experiments are carried out on 8 data sets, and the accuracy, the area under curve, time cost and parameter sensitivity are analyzed to verify the effectiveness of DensePILAE. The rest of the paper is organized as follows: In "Related work", we briefly review the related works of this paper. Then we detail the basic theory of PILAE and the proposed DensePILAE in "DensePILAE". In "Experi-ments and discussion", we conduct experiments and present the comparison and analysis. Finally, we give our conclusions in "Discussion".

Feature reuse
Feature reuse is an important concept, which is proposed by Bengio in seminal paper [1]. Feature reuse can be achieved by depth of network. Through the multi-layer nonlinear transformation, the input is compressed, so the current deep network can be seen as a way of feature reuse. This is an implicit feature reuse. Another feature reuse is to directly input the output of the front non-adjacent layer into the current layer by crossing connection, which can be regarded as an explicit feature reuse. ResNet introduces the concept of residual blocks, which is essentially combination of the previous layer feature and the current layer feature. This is the reuse of the previous layer feature. DenseNet go one step further by concatenate the output of all the previous layers as input for subsequent layers. Not only the feature of the previous layer is reused, but also the features of all layers before the previous layer are reused. Therefore, the subsequent layers could make use of the knowledge learned from all previous layers. Deep layer aggregation [28] extends the way feature reuse. It does not simply concatenate the features of the previous layer or all of the previous features, but selectively reuse them. Two effective reuse methods are proposed, namely, iterative reuse and hierarchical reuse.
Similar feature reuse methods have also been used in fully connected networks. The deep stacking network (DSN) [3] reuses the predictions of all previous layers. In DSN, the reused features have lower dimensionality, whose representation ability is limited. In addition, for some simple samples, the prediction results of each layer will be more consistent, which leads to a lot of simple redundant information. Different from the reuse of DSN, DensePILAE reuses the hidden layer of autoencoder, which has larger dimension and includes richer information. The feature reuse of sequence data is studied in ResInNet [19], which is applied in the traffic prediction of Internet of things.

Non-BP based fast learning network
As a training method, BP algorithm has been widely used in the training of deep neural network and has become the most popular training method. However, there are two notorious shortcomings, local minimum and slow convergence rate, which are also widely criticized. To avoid using the BP algorithm, many network architectures are proposed, such as RVFL [18], PIL [6][7][8]. The weight of the network is obtained by solving the analytical solution. The differences between PIL and RVFL are the network structure and the initialization method of weight. PIL adopts standard single layer feed-forward neural network (SLFN), while RVFL adds direct connection between input layer and output layer. For the weight between input layer and hidden layer, PIL adopts pseudoinverse or random, while RVFL adopts random value.
After years of development, many variants of PIL and RVFL have been developed, such as PILAE [25], LR-PILAE [26], CPILer [24], D-RVFL [11], dRVFL [14], SP-RVFL [32]. PILAE is proposed by applying PIL to the training of autoencoder. LR-PILAE is proposed to solve the problem of automatic selection of network structure by using low rank constraint. CPILer uses graph Laplace regularization to solve the robustness problem of AutoML system. In [23], The combination of PILAE and AdaBoost is used to solve the problem of driving stress recognition. Deep random vector functional link (D-RVFL) [11] is a multi-layer RVFL network by stacking. The deep RVFL (dRVFL) [14] is another multi-layer RVFL network. The dRVFL uses RVFL as the basic building block. Except for the first layer, the enhancement unit of each layer is obtained by multiplying the previous layer enhancement unit by a random weight. The enhancement units of all layers are concatenated together as the enhancement unit of the dRVFL, and then the weight of the output layer is determined by the least squares method.

DensePILAE
In this section, we will introduce the basic theory of PILAE and our proposed DensePILAE.

Basic theory of PILAE
The pseudoinverse learning algorithm (PIL) [6][7][8] is a fast training method for a single hidden layer feed-forward neural network. It uses the random or pseudoinverse of input data to initialize the weight between the input layer and the hidden layer, and the weight between the hidden layer and the output layer can be obtained in the form of an analytical solution.
Given training set D = {X, Y}, the weight between the input layer and the hidden layer is represented as W in , and the weight between the hidden layer and the output layer is represented as W out . W in is initialized by random or pseudoinverse of input matrix X, the hidden layer output H is f (·) is activation function. Learning problems can be expressed as We can get the analytical solution of W out by solving the pseudoinverse of H: The pseudoinverse of H is The autoencoder is essentially a three-layer neural network. The biggest difference is that a constraint is added. That is to make the input and output equal Wang et al. [25] proposed PILAE using PIL training autoencoder. The weight of the encoder W e is initialized by random or pseudoinverse of input matrix X. According to formulas Eqs. (3), (4) and (5), the weight of the decoder W d can be obtained as To avoid the ill-conditioned problem and enhance the generalization ability of the network, the L 2 regularization constraint is adopted for the decoder weight in PILAE. The weight formula of the decoder can be rewritten as where λ > 0 is the regularization parameter. Since the autoencoder is a symmetrical structure, to reduce the risk of overfitting, weight tied is used to reduce the number of parameters, then the weight of the encoder will be updated to Recalculating the output of hidden layer with new weights of encoder, we can get the feature. Because the learning ability of single PILAE is limited, several PILAEs are stacked to learn. However, with the increase of depth, the performance improvement of stacked PILAE is not obvious. The reason is that to avoid identity mapping, the constraint of forced dimension reduction is added in the network structure of each PILAE. Therefore, there are a lot of necklace layers in stacked PILAE. Although the features are refined with the increase of depth, partial information is also lost. Therefore, it leads to the increase of model error.

DensePILAE
To this end, we concatenate the output of all the previous layers as input to the subsequent layer. Figure 1 illustrates network structure of DensePILAE. The input of the lth layer is where F i is the extracted feature of ith layer. According to Eq. (1), the hidden output H l is where W el is the random weight of the encoder in lth layer. According to Eqs. (7) and (8), the weight of the decoder in lth layer can be calculated as The weight of the lth layer encoder is obtained by weight tied, the feature extracted F l by the lth layer autoencoder can calculated as DensePILAE is implemented by applying feature reuse to stacked PILAE. It has two advantages. One is that the lost information can be directly supplemented by identity connections, thus the error of the model is reduced. Another advantage is that the supplementary information comes from the features of low layers that have been learned, so there is no need to design new modules to learn the lost information.
On the whole, DensePILAE is a combination of width reuse and depth reuse. Layer by layer stacking realizes the feature reuse in depth perspective. It is implicit reuse. The concatenated feature realizes the feature reuse in width perspective. It is explicit reuse. Feature reuse reduces the error of the network and improves the feature learning ability of the network.

Data set
To verify the validity of our proposed method, several experiments are performed on 8 public data sets in several fields, including MNIST, USPS, BA, Yale, ORL, COIL-20, COIL-100 and NORB data set. The MNIST, USPS and BA data set are handwritten font recognition data set. The Yale and ORL data set are face recognition data set. The COIL-20, COIL- 100 and NORB are object recognition data set. The data sets are described in detail as follows (Table 1): • MNIST The Mixed National Institute of Standards and Technology (MNIST) is a handwritten digital identification data set, which contains a total of ten numbers from 0 to 9. MNIST has a total of 70,000 images, of which 60,000 images are trained and 10,000 images are tested. Each image is a 28 × 28 pixel grayscale image. In the experiment, we randomly selected 400 images for each class to form our experimental data set. The experimental data set contains a total of 4000 images. • BA The Binary Alphadigits (BA) data set includes 1404 samples, and each sample is a image, whose size is 20 × 16. There are 36 categories, including numbers from 0 to 9 and letters from A to Z. Each category have 39 samples.  The best performance is shown in bold

Compared methods
We compare the proposed DensePILAE with three non-BP methods, stacked PILAE [25], RVFL [18], and dRVFL [14]. Stacked PILAE is a forward learning algorithm that uses PIL to quickly train SAE. In RVFL, the input layer is directly connected with the output layer. Therefore, RVFL is a special single hidden layer feed-forward neural network. The dRVFL is an extension of RVFL in the depth direction. Its characteristic is that only the weight of the last layer is obtained by learning, and the weights of all the previous layers are generated by random projection.

Experiment settings
To compare the different methods fairly, the number of neurons in the hidden layer of RVFL is set to 100. The number of neurons all hidden layers of dRVFL and DensePILAE is all set to 100, and the number of layers is set to 10. The width and depth of stacked PILAE is set by cross validation. The activation functions of all methods is sigmoid function. The regularization parameter λ is selected in the range of {2 −6 , 2 −4 , 2 −2 , 2 0 , 2 2 , 2 4 , 2 6 , 2 8 , 2 10 }. To reduce the randomness and contingency as much as possible, the final experimental results are obtained by 10-fold cross validation. Our experiments are performed on a Geforce GTX 1080 GPU.

Performance comparison and analysis
To verify the effectiveness of DensePILAE, we first report the accuracy (ACC) and the area under curve (AUC) of DensePILAE and other methods on 8 data sets. Among them, DensePILAE gets the best ACC and AUC on 7 data sets. In other words, DensePILAE outperformed other methods on 87.5% data sets. The ACC of DensePILAE is more than 99% on COIL-20 and COIL-100 data set. The AUC of DensePI-LAE is 100% on ORL, COIL-20 and COIL-100 data set. Table 2 shows the average values of ACC for the compared method and our proposed method, and Table 3 shows the average AUC of the compared method and our proposed method. The results are the average values of tenfold cross validation. Tables 2 and 3, we can find that DensePILAE achieves the best results on all 8 data sets. Specifically, the accuracy of DensePILAE is improved more obviously on NORB, BA and COIL-100 data sets, where the improvements of ACC reach 9.59%, 4.68% and 2.86%, respectively. However, the improvement is smaller on ORL and COIL-20 data sets, only 0.5% and 0.01%. The AUC of DensePILAE is significantly improved on NORB and Yale data sets, where the improvements of ACC reach 1.95% and 1.38%, respectively. The resulta of experiments show that the feature reuse can significantly improve the feature extraction ability of network, and make the network helpful to extract more generalized features. In Tables 2  and 3, we can find that DensePILAE get the best results on 7 data sets and is defeated on Yale data set. Specifically, the accuracy of DensePILAE is significantly improved on BA, COIL-100, NORB and MNIST data sets, reaching 10.09%, 9.79%, 4.76% and 4.23%, respectively. In the COIL-20 and ORL data sets, the accuracy of improvement is weak, less than 1%. The AUC of DensePILAE, respectively, increases by 2.71%, 2.48% and 1.65% on NORB, BA and MNIST data sets. However, the improvement on USPS is weak, only 0.46%. The results show that compared with the features obtained by random projection with dRVFL, the features obtained by pseudoinverse have stronger discrimination ability. In addition, it also shows that the feature reuse by densely connect can extract better features even for simple network structure.

Time analysis
The time cost is an important criteria to evaluate the performance of the model. Feature learning takes up most of the time cost. We report the time of feature learning on 8 data sets in Table 4. It can be seen from the table that the order of feature learning speed from fast to slow is RVFL, dRVFL, DensePILAE and stacked PILAE. Except NORB data set, DensePILAE is slightly slower than dRVFL. This is because the weights of each layer of dRVFL except output layer don't need to be learned, is only set to random projection of input. However, the weights of DensePILAE need to be learned. DensePILAE is faster than stacked PILAE. As the depth increases of DensePILAE, the input of every PILAE is increasing in DensePILAE, but the hidden width is fixed. Because the features of low layers are reused, the width of hidden layer could be set smaller. In stacked PILAE, the width of hidden is closely related to the width of input, which is usually lager. The width of stacked PILAE is larger than that of DensePILAE, so the time cost of stacked PILAE is large.

Parameter sensitivity analysis
In neural networks, the selection of parameters plays an important role in the network performance. In DensePILAE, regularization parameter and the number of hidden neurons are two important hyperparameters. We use grid search method to analyze the influence of two parameters on the performance of DensePILAE. The search range of regularization parameters λ is from 2 −6 to 2 12 . Each sample point is four times the previous one. Therefore, the selected regularization parameters are {2 −6 , 2 −4 , 2 −2 , 2 0 , 2 2 , 2 4 , 2 6 , 2 8 , 2 10 , 2 12 }. The number of neurons in the hidden layer is selected from 10 to 100, and the interval between the two samples is 10. Therefore, the selected numbers of hidden neurons H are {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}. We have carried out parameter sensitivity analysis experiments on 8 data sets. Figures 2 and 3 are the experimental results of ACC and AUC, respectively. On NORB, MNIST, COIL-100 and USPS data sets, when the regularization parameter is small, ACC and AUC will gradually reach the best performance with the increase of width in hidden layer. The best ACC values are 92.99%, 92.58%, 99.35% and 96.25%, respectively. In addition, the best AUC values are 99.42%, 99.51%, 100.00% and 99.83%, respectively. On Yale and ORL data sets, larger regularization parameters lead to better ACC and AUC. The width of hidden layer has limited influence on the final results. The best ACC values are 83.33% and 98.75%, respectively. The best AUC values are 97.96% and 100.00%, respectively. For COIL-20 data set, when the regularization parameter is small, as long as the number of hidden layer neurons exceeds threshold, the perfect learning can be obtained. The best ACC and AUC is 99.94% and 100%, respectively. For BA data set, although ACC and AUC will increase slowly with the increase of width in the hidden layer, the regularization parameter have great influence. Therefore, it must be selected carefully. In a word, ACC and AUC can be improved with the increase of the number of neurons in the hidden layer. The larger number of hidden neurons will contribute to get better results. The regularization parameter has a more important impact on the performance of the model. For most data sets, smaller regularization parameter will get an acceptable result. However, if you want to get the best result, you need to choose it carefully.

Discussion
There are significant differences between dRVFL and DensePI-LAE in feature reuse and feature learning. The dRVFL reuses the features of the all previous layer in the last layer. However, the DensePILAE reuses the features of all previous layers in every layer. Therefore, the feature learned from every layer in DensePILAE is the comprehensive utilization of historical information. It can be seen from Tables 2 and 3 that DensePI-LAE can obtain better results than dRVFL. In addition, the features of hidden layer are obtained by random projection in dRVFL, so dRVFL is similar to width learning network. However, the hidden layer features in DensePILAE are obtained by pseudoinverse learning. Therefore, it can be seen from Table 4 that the speed of feature learning in DensePILAE is slightly slower.

Conclusion
In this paper, a dense connection pseudoinverse learning autoencoder based on feature reuse is proposed. The method can reuse the information of the middle layer faster and better, and the learned features have a stronger discriminating ability. Our method can be seen as a combined implementation of explicit reuse and implicit reuse. The explicit reuse of features is realized by crossing connections, and the implicit reuse of features is realized by multi-layer stacking. In addition, the method can not only greatly shorten the feature extraction time of the network, but also effectively avoid the gradient explosion and gradient vanished problems. The experimental results show that the proposed method has comprehensive performance compared with the other non-BP based methods. This is because the feature reuse makes up for the loss of information and reduces the error of network. Moreover, this strategy can also be applied to other non-BP based learning networks to further improve the performance of the network. In addition to image classification, DensePI-LAE can be applied in many scenarios. In the future, we will apply DensePILAE to object detection and fault detection.