Multilayer Fisher extreme learning machine for classification

As a special deep learning algorithm, the multilayer extreme learning machine (ML-ELM) has been extensively studied to solve practical problems in recent years. The ML-ELM is constructed from the extreme learning machine autoencoder (ELM-AE), and its generalization performance is affected by the representation learning of the ELM-AE. However, given label information, the unsupervised learning of the ELM-AE is difficult to build the discriminative feature space for classification tasks. To address this problem, a novel Fisher extreme learning machine autoencoder (FELM-AE) is proposed and is used as the component for the multilayer Fisher extreme leaning machine (ML-FELM). The FELM-AE introduces the Fisher criterion into the ELM-AE by adding the Fisher regularization term to the objective function, aiming to maximize the between-class distance and minimize the within-class distance of abstract feature. Different from the ELM-AE, the FELM-AE requires class labels to calculate the Fisher regularization loss, so that the learned abstract feature contains sufficient category information to complete classification tasks. The ML-FELM stacks the FELM-AE to extract feature and adopts the extreme leaning machine (ELM) to classify samples. Experiments on benchmark datasets show that the abstract feature extracted by the FELM-AE is more discriminative than the ELM-AE, and the classification results of the ML-FELM are more competitive and robust in comparison with the ELM, one-dimensional convolutional neural network (1D-CNN), ML-ELM, denoising multilayer extreme learning machine (D-ML-ELM), multilayer generalized extreme learning machine (ML-GELM), and hierarchical extreme learning machine with L21‑norm loss and regularization (H-LR21-ELM).


Introduction
Extreme learning machine (ELM) [1] is the latest research result of single-hidden layer feedforward networks (SLFNs) and attracts much attention in pattern recognition and machine learning fields. The unique principle of the ELM is that the input weights and biases of hidden layer are randomly chosen without fine-tuning, and the output weights are the global optimal solution determined by the least square method, avoiding the dilemma of local optimum. The ELM is proven to have the universal approximation capability [2] and is adequate for regression and classification tasks [3]. And to overcome the problems of online learning, unbalanced learning, and structural redundancy, many methods B Xiaodan Wang afeu_wang@163.com 1 of improvement have been introduced into the ELM [4][5][6][7][8]. By virtue of strong generalization ability and fast training, the ELM has been successfully applied for speech emotion recognition [9,10], power load forecasting [11,12], and medical diagnoses [13,14].
As a focus of research in recent years, deep learning [15] attaches importance to the hierarchical abstract representation learning and realizes the approximation of complex function. However, deep learning is confronted with the expensive computational cost [16]. To deal with this, Kasun et al. [17] proposed the extreme learning machine autoencoder (ELM-AE) and stacked the ELM-AE to create a multilayer extreme learning machine (ML-ELM). The ELM-AE is the combination of the ELM and the autoencoder (AE), and is utilized to extract abstract feature of samples for the final mapping from feature to targets. Different from traditional deep learning algorithm, the network parameters of the ML-ELM do not need fine-tuning, which greatly reduces training time. Having both excellent generalization performance of deep learning and high training efficiency of the ELM, the ML-ELM is gradually applied in many fields, such as text classification [18], activity recognition [19], and dynamics identification [20].
To enhance the performance of the ML-ELM, many methods of improvement have been introduced into the ML-ELM. Based on the denoising autoencoder (DAE), Zhang et al. [21] proposed the extreme learning machine denoising autoencoder (ELM-DAE) by introducing denoising criterion into the ELM-AE, and developed the denoising multilayer extreme learning machine (D-ML-ELM). To avoid the uncertain influence caused by the number of hidden nodes, Wong et al. [22] combined the kernel learning with the ML-ELM and proposed the multilayer kernel extreme learning machine (ML-KELM) to improve the performance of the ML-ELM. Sun et al. [23] proposed the generalized extreme learning machine autoencoder (GELM-AE) by integrating the manifold regularization into the ELM-AE, and built a deep network called multilayer generalized extreme learning machine (ML-GELM). To improve the sparsity and robustness of the ML-ELM, Li et al. [24] replaced the mean square error with L21-norm loss, and proposed the hierarchical extreme learning machine with L21-norm loss and regularization (H-LR21-ELM). In addition, many methods [25][26][27] are studied for the development and application of the ML-ELM.
However, the unsupervised learning of the ELM-AE may not be suitable for building the discriminative feature space given the label information and lead to the performance limitation of the ML-ELM for classification tasks. Thus, in the hierarchical feature extraction, it is necessary for the ML-ELM to use the class labels to learn a discriminative feature space. The Fisher criterion is the basis of Fisher discriminant analysis (FDA) and aims to extract discriminative abstract feature by maximizing the between-class distance and minimizing the within-class distance [28]. Based on the fact, the Fisher extreme learning machine autoencoder (FELM-AE) is proposed in this study and used as the basic component for the multilayer Fisher extreme learning machine (ML-FELM). The FELM-AE introduces the Fisher criterion into the ELM-AE and adds the Fisher regularization term about output weights to objective function to enforces the samples of different classes to be far in feature space. The ML-FELM stacked the FELM-AE layer-by-layer to extract abstract feature, and then utilized the ELM to map feature to labels. According to our best knowledge, this study is the first to combine the ELM-AE with the Fisher criterion to extract discriminative feature. Compared with the ELM-AE, the label information contained in the Fisher criterion guides the FELM-AE to learn discriminative feature, and it effectively improves the classification performance of the ML-FELM.
The contributions of this study are summarized as follows: The rest of this paper is organized as follows. In section "Related works", a brief review of ELM and ELM-AE is introduced. In section "Multilayer Fisher extreme learning machine", the details of the proposed FELM-AE and ML-FELM are described. In section "Experiments", experimental implementation and results over benchmark datasets are reported. In section "Conclusion", the conclusion of this paper is summarized.

Extreme learning machine
ELM is an efficient single-hidden layer feedforward neural network, and its network structure is shown in Fig. 1.
Given training samples {X, T } (x j , t j ) N j 1 , the output of the ELM is formulated as follows: where x j x j1 , x j2 , . . . , x jd ∈ R d and t j t j1 , t j2 , . . . , t jm ∈ R m are the input and target of the jth training sample, d and m are the dimension of input and target, respectively. w i is the input weights of ith hidden node, and b i is the bias. w i and b i are both randomly assigned. g(·) is the activation function and β i [β i1 , β i2 , . . . , β im ] T is the output weights of ith hidden node.
Due to the universal approximation capability, the ELM can fit the given samples with zero error. Therefore, the matrix form of the above Eq. (1) can be written as follows: where H is the output matrix of hidden layer; β is the output weights matrix To improve its generalization ability, the ELM aims to find the optimal β to minimize the following formula: where C is the regularization parameter to control the balance of empirical and structural risk. By setting the gradient of L ELM with respect to β to zero, the closed-form solution of β can be obtained as follows: where I is the identity matrix.

Extreme learning machine autoencoder
As a special ELM, the ELM-AE is an unsupervised learning algorithm and its input is also used as output. The ELM-AE consists of an encoder and a decoder. The encoder realizes the encoding of samples and extracts a random feature representation of the input samples. The decoder realizes the decoding of abstract feature and learns the mapping from The network structure of the ELM-AE is shown in Fig. 2.
Assume that the number of nodes in each layer of the ELM-AE is d, L and d. For the training sample X x i ∈ R d N i 1 , the output of hidden layer H can be calculated as Eq. (7), and the relationship between H and the output X is shown in Eq. (8) Hβ X.
It should be noted that the input weights W and biases b randomly generated by the ELM-AE need to be orthogonal, that is According to the number of nodes in input and hidden layer, the ELM-AE represents the input samples in different representation: The output weights β of the ELM-AE is calculated as Once β is got, the ELM-AE then obtains the final abstract representation of the samples by Eq. (11)

Fisher extreme learning machine autoencoder
The unsupervised learning of the ELM-AE can learn the intrinsic structure of samples for feature extraction. However, for classification tasks, the ELM-AE fails to make full use of the complete sample labels to guide the feature extraction and enrich the category information contained in the feature. To address this problem, the Fisher criterion is integrated into the ELM-AE, and a new regularized ELM-AE called the FELM-AE is proposed. The Fisher criterion aims to maximize the between-class distance and minimize the within-class distance of feature. Based on the Fisher criterion, class labels are utilized in the training of the FELM-AE, so that the abstract feature learned by the FELM-AE can enhance the class separability of samples in the feature space and improve the performance of classifiers. Compared to the ELM-AE, the major innovation of the FELM-AE is the Fisher regularization about output weights to find a discriminative feature space.
The network structure of the FELM-AE is similar to that of the ELM-AE, as shown in Fig. 2. According to the Fisher criterion, the between-class distance is quantified by the between-class scatter matrix to measure the dispersion of different classes of samples, and the within-class distance is quantified by the within-class scatter matrix to measure the dispersion of a specific class of samples. In the FELM-AE, assume that the output of hidden nodes is h (i) j , i 1, . . . , c, j 1, . . . , n i and that S b and S w denote the between-class scatter matrix and within-class scatter matrix, respectively. S b and S w are formulated as follows: where h (i) is the mean output vector of hidden nodes of the ith class and h is the mean output vector of all hidden nodes. h (i) and h are calculated as Eqs. (14) and (15), respectively Using these distance criteria, the FELM-AE introduces the Fisher regularization matrix S to control the different distance criterion, which is calculated as where D is a diagonal matrix with a same element and is usually set to the identity matrix. The Fisher criterion is formulated to maximize the between-class distance and to minimize the within-class distance, i.e., maximizing S b and minimizing S w . Thus, the objective function of FELM-AE is formulated as where λ the Fisher regularization parameter. The gradient of the objective function with respect to β is derived as By setting the gradient to zero, the closed-form solution of Eq. (18) can be obtained. Depending on the number of samples and hidden nodes, the β is calculated differently After getting β, the FELM-AE can obtain the abstract representation of samples by Eq. (11).
Therefore, the feature extraction based on the FELM-AE can be summarized as Algorithm 1.

Multilayer Fisher extreme learning machine
Similar to the ML-ELM, the ML-FELM needs to be created by stacking the FELM-AE and adding ELM to complete the classification tasks. Therefore, the training process of the ML-FELM can be divided into two stages: feature extraction and classification. In the feature extraction stage, the ML-FELM extracts representative and discriminative feature by stacking the FELM-AE to train the network parameters layerby-layer. In the classification stage, the ML-FELM puts the extracted feature into the ELM to implement classification, completing the mapping from feature to class labels.
Suppose that the training samples is i 1 is the number of hidden nodes in each FELM-AE, and L is the number of hidden nodes in ELM. The network structure of the ML-FELM is shown in Fig. 3.
In the feature extraction stage, the ML-FELM uses the FELM-AE to learn the abstract representation of samples. The relationship between the output of hidden layers is formulated as where H i is the output of ith hidden layer and β i is the output weights of ith FELM-AE. H k is the final abstract feature extracted by the stacked FELM-AE and will be used as the input of the ELM to complete the training of the ML-FELM.
In the classification stage, the input of the ELM is H k . Assume that the input weights of hidden layer are W and the biases are b. The output H E L M and output weights β of hidden layer are calculated as Eqs. (3) and (6), respectively.
Thus, the training process of the ML-FELM is presented in Algorithm 2.

Experiments
To verify the performance of the FELM-AE and ML-FELM, the following experiments were carried out: Experiment 1: Visualization of the original data and abstract feature. Visualize the original data and the abstract feature extracted by the ELM-AE and FELM-AE to compare their class separability. Experiment 2: Influence of the hyperparameters. Observe the classification accuracy of the ML-FELM with different Fisher regularization parameter and different number of hidden layers and hidden nodes, and then analyze the performance change of the ML-FELM.

Data description
The empirical experiments are carried out over various benchmark datasets including five image datasets. The image   Table 1, and a brief description of the image datasets is as follows: Convex: Convex is an artificial dataset to discriminate between convex and nonconvex shapes. It consists of 8000 training images and 50,000 testing images, and the size of each image is 28 × 28 pixels in gray scale.
USPS: USPS is the US Postal handwritten digit dataset. The training set has 7291 images, and the testing set has 2007 images. Each image is 16 × 16 pixels in gray scale and belongs to the digits 0-9.
Rectangles: Rectangles is an artificial dataset to discriminate between wide and tall rectangles. It contains 1200 images for training and 50,000 images for testing. Each image is in gray scale and of size 28 × 28.
MNIST: MNIST is a commonly used handwritten dataset of digits 0-9. The training samples and testing samples are 60,000 images and 10,000 images with 28×28 pixels in gray scale, respectively.
Fashion-MNIST: Fashion-MNIST is a fashion products dataset with 10 categories. It consists of 60,000 training images and 10,000 testing images, and each image is 28 × 28 pixels in gray scale. Compared with MNIST, the images in Fashion-MNIST are more difficult to classify.

Implementation details
All experiments were carried out in MATLAB 2019(b) running on a laptop with a 3.2 GHz AMD 5800H CPU, 16 GB RAM, and a 1 TB hard disk. To ensure the fairness, all algorithms adopt a same network structure, and each experiment is repeated 20 times to obtain the mean results. The trialand-error method is used to determine the network structure, considering the balance of generalization performance and computational complexity with different number of hidden layer and hidden nodes.
The implementation details for experiment 1 are as follows: the used datasets are MNIST and Fashion-MNIST, the network structure is 784-500-784, the activation function is sigmoid function, the regularization parameter of ELM-AE is C 10 −1 , and the regularization parameter of FELM-AE is λ 10 2 . The original data and abstract feature both need a dimensionality reduction by the T-distribution stochastic neighbor embedding (T-SNE).
The implementation details for experiment 2 are as follows: the used dataset is MNIST, the activation function is sigmoid function. When changing the regularization parameter, the network structure is 784-500-500-2000-10, the range of Fisher regularization parameter is λ ∈ 10 −5 , 10 −4 , . . . , 10 4 , 10 5 , and the range of parameter C is C ∈ 10 −3 , 10 −2 , . . . , 10 6 , 10 7 . When changing the number of hidden layers k, the range of k is k ∈ {1, 2, 3, 4, 5}, the number of hidden nodes in the extraction stage is 500, the number of hidden nodes in the classification stage is 2000, the parameter λ is λ 10 5 and the parameter C is C 10 5 .When changing the number of hidden nodes, the network structure is 784 − L 1 − L 1 − L 2 − 10, the range of L 1 is L 1 ∈ {100, 200, . . . , 900, 1000}, the range of L 2 is Table 2 The  . , 9000, 10, 000}, the parameter λ is λ 10 5 , and the parameter C is C 10 5 . The implementation details for experiment 3 are as follows: The network structure of algorithms is shown in Table 2, the activation function is sigmoid function, the Fisher regularization parameter λ is selected from λ ∈ 10 −3 , 10 −2 , . . . , 10 4 , 10 5 , and the parameter C is selected from C ∈ 10 −2 , 10 −1 , . . . , 10 7 , 10 8 . For the 1D-CNN, the number of convolutional layers and pooling layers are both 2, the size of convolution kernel is 1 × 3, the number of convolution kernel is 4, the max-pooling is used for the down-sampling, the size of pooling kernel is 1 × 2, the activation function is relu function, the learning rate of the stochastic gradient descent method is 0.05, the batch size is 100, and the number of epochs is 100. The Signal-to-Noise Ratio (SNR) is introduced to represent different levels of noise and the range of SNR is SNR ∈ {30, 25, 20, 15, 10}. In the performance comparison for multiclass and binary classification tasks, the evaluation metrics are the accuracy, recall, G-Mean, and F1-Measure. In multiclass classification tasks, the recall, G-Mean, and F1-Measure are the averages of all classes.

Visualization of the original data and abstract feature
Visualization is a direct way to observe the distribution of abstract feature and to evaluate the effectiveness of feature extraction. To verify that the Fisher criterion can improve the class separability of abstract feature and enrich the category information, the original data and the abstract feature  As shown in Figs. 4 and 5, compared with the original data and the abstract feature extracted by the ELM-AE, the between-class distance of abstract feature extracted by the FELM-AE is larger and the within-class distance is smaller, which indicates that the abstract feature extracted by the FELM-AE is more discriminative. The result shows that the introduction of Fisher criterion into the FELM-AE is effective way to enhance the class separability of abstract feature. This is because that adding the Fisher regularization term to objective function enforces the network to balance the reconstruction error and Fisher regularization loss, and makes the output weights can both reduce the reconstruction error and Fisher regularization loss. Thus, the between-class distance of abstract feature is increased, and the within-class distance is decreased. However, the ELM-AE introduces the L2-norm regularization term about the output weights to prevent the over-fitting rather than to improve the class separability of feature.

Influence of the hyperparameters
The regularization parameter and the number of hidden layers and hidden nodes are important hyperparameters of the ML-FELM. The regularization parameter controls the effectiveness of regularization, and the number of hidden layers and hidden nodes control the generalization ability of the ML-FELM. To clarify the influence of the hyperparameters on the ML-FELM, different regularization parameter and different number of hidden layers and hidden nodes are used to observe the change of generalization performance. With variable regularization parameters, the classification accuracy of the ML-FELM is shown in Fig. 6. With variable number of hidden layers, the classification accuracy of the ML-FELM is shown in Fig. 7. And with variable number of hidden nodes, the classification accuracy and training time of the ML-FELM are shown in Fig. 8.
As is shown in Fig. 6, increasing the regularization parameter λ and C properly can improve the generalization Fig. 6 The influence of the regularization parameter on the generalization performance Fig. 7 The influence of the number of hidden layers on the generalization performance performance of the ML-FELM. When λ is fixed, the classification accuracy increases with the increase of C. When C ≤ 10 2 , the classification accuracy increases and then decreases with the increase of λ. When C > 10 2 , the classification accuracy increases with the increase of λ. The reason is that the increase of Fisher regularization parameter λ can enhance the class separability of the extracted feature and make them easy to classify, but an excessively large λ will cause the FELM-AE to ignore the effect of reconstruction error, leading to a under-fitting. And the parameter C controls the empirical risk of the ELM. An increase C will drive the ELM to further reduce the reconstruction error and improve the classification performance, but in practical applications, an excessively large C is likely to lead to an over-fitting. Therefore, the proper regularization parameters λ and C need to be chosen according to the specific samples.
As is shown in Fig. 7, with the increase of hidden layers, the classification accuracy of the ML-FELM increases. This indicates that the proper number of hidden layers can improve the approximation capability of the ML-FELM. When k 1, the ML-FELM do not adopt the FELM-AE to extract feature and its performance is not competitive. With the increase of hidden layer, the FELM-AE is used for representation learning, and the abstract feature learned by the stacked FELM-AE is more discriminative and representative. The discriminative and representative feature can enhance the classification capability of the ELM and the comprehensive performance of ML-FELM.
As is shown in Fig. 8, the generalization performance and training time of the ML-FELM both increase as hidden nodes of the FELM-AE and ELM increases. This is because the ELM has universal approximation capability. With the increase of hidden nodes, the performance of the FELM-AE and ELM both gradually improves, making the FELM-AE extract more representative feature and the ELM more capable of approximating nonlinear mappings. Moreover, when the number of samples is not less than hidden nodes, i.e., N ≥ L, the computational complexity of the FELM-AE and ELM is O N L 2 , and when N < L, that is O N 2 L . Therefore, as hidden nodes increase, the computational complexity and training time of the ML-FELM both increase.

Performance comparison
To test the comprehensive performance of the ML-FELM, all benchmark datasets are used in the comparison and analysis of the ELM, 1D-CNN, ML-ELM, D-ML-ELM, ML-GELM, H-LR21-ELM, and ML-FELM. First, different evaluation metrics are recorded to evaluate the classification performance and computational complexity of each algorithm. Then, by adding the different levels of noise to training samples, the change of classification accuracy is recorded to evaluate the robustness of each algorithm.

Classification results over benchmark datasets.
Since the number of hidden nodes has a significant effect on the generalization performance, the same network structure is required for all algorithms to ensure a fair comparison. And variable regularization parameters are selected to find the best results. The classification performance of each algorithm for multiclass and binary classification tasks are shown in Tables 3 and 4, respectively. The training time and testing time of each algorithm are shown in Tables 5 and 6, respectively. The '-' indicates that the results could not be obtained, and the numbers in boldface represent the best results.  Tables 3 and 4, the classification performance of ML-FELM is better than that of ELM, 1D-CNN, ML-ELM, D-ML-ELM, ML-GELM, and H-LR21-ELM on most of benchmark datasets and the generalization improvement is about 0.5-3%. The reason is that the FELM-AE in the ML-FELM introduces the Fisher criterion and uses the category information to guide the feature extraction, which makes the extracted feature more discriminative and the classification boundaries of classifiers more evident. Moreover, the balance of Fisher regularization loss and empirical error prevents the over-fitting of the ML-FELM and improves the generalization performance. The importance difference of labels is not considered by all algorithms; thus, the algorithms with better classification performance tend to have higher accuracy, recall, G-Mean, and F1-Measure. In addition, the ML-GELM needs to calculate the similarity matrix and Laplacian matrix of samples, and there is a memory overflow problem. Thus, dealing with large-scale datasets such as Drive diagnosis, Internet firewall, MNIST, and Fashion-MNIST, the ML-GELM does not obtain effective results.
As shown in Tables 5 and 6 Table 7, and the accuracy change curves are shown in Fig. 9.
As is shown in Table 7 and Fig. 9, the classification accuracy of ML-FELM is higher than that of ML-ELM, D-ML-ELM, and ML-GELM with different SNR, and the performance improvement increases as the SNR decreases. With the decrease of SNR, the noise in the datasets is gradually enhanced, and the generalization performance of each algorithm is affected. However, the performance decline of the ML-FELM and ML-GELM is slower than that of ML-ELM and D-ML-ELM. This is because, by introducing the Fisher regularization, the ML-FELM uses the class labels that are not affected by noise to guide feature extraction. The maximizing of between-class distance and the minimizing of within-class distance can increase the distance among classification boundaries and weaken the influence of noise.      In addition, the ML-GELM enhance its robustness by introducing the manifold regularization and learning the intrinsic manifold structure of data.

Conclusion
To enhance the class separability of feature extracted by the ELM-AE and improve the generalization performance of the ML-ELM for classification tasks, this study integrates the Fisher criterion into the ELM-AE and proposes the FELM-AE and ML-FELM. The FELM-AE uses class labels to guide the feature extraction and enhance the class separability of feature by adding Fisher regularization term to the loss function. The ML-FELM stacks the FELM-AE to extract feature and adopts the ELM to classify. The experimental results on various benchmark datasets show that the abstract feature extracted by the FELM-AE is more discriminative than the ELM-AE, and the classification accuracy of the ML-FELM is higher than ML-ELM, 1D-CNN, D-ML-ELM, ML-GELM, and H-LR21-ELM. However, due to the inefficiency of the trial-and-error method, it is difficult for the ML-FELM to choose proper Fisher regularization parameters in different datasets. Thus, the automatic setting of the regularization parameter is an important issue. One feasible method is to replace the trial-and-error method with the particle swarm optimization (PSO) [33] or genetic algorithm (GA) [34]. The PSO and GA are both the intelligent search algorithms. By setting the regularization parameters as the optimization objects and the generalization performance as the evaluation criteria, they can automatically search the optimal regularization parameters of the FELM-AE and ML-FELM. Moreover, all datasets used in the experiments are complete and balanced, which is contrary to the practical applications. Thus, further works needs to analyze the influence of the incomplete and unbalanced datasets on the performance of the ML-ELM and find the improvement methods. Combining the ML-ELM with data augmentation methods [35] or the semi-supervised learning [36] is a worthwhile attempt.