1 Introduction

Neural network, which usually refers to artificial neural network, was firstly proposed in last century for a history of more than 70 years [114]. Neural network was inspired by biological neural network and its basic structure mimics the nerve system in human brain. As a kind of machine learning method, a typical neural network consists of neurons, connections, and weights. In neural network, a neuron, also called perception [152], receives input signals from former neurons and sends out output signals to later neurons. Neurons within a neural network are often organized in different layers. The connection was defined as relation between neurons of different layers. A neuron may have lots of connections to neurons of its former layer and later layer. And each connection delivers an important parameter called weight. It is weights that decide how a neuron processes input signals to give out output signals.

To train the network is to determine the weights in its layers that makes it best approximates the data. Traditional training algorithms are often based on gradient, like back propagation (BP) [191]. BP was widely used in many domains because it is easy to understand and simple to implement. However, gradient based algorithms do not always come up with global best solution. Their solutions are dependent on the parameter initialization and the complexity of the feature space, so they are more likely to converge at local extrema. Extreme learning machine (ELM) was proposed by Guang-Bin and Qin-Yu [41], which was aim to train single-hidden layer feedforward networks (SLFNs). Different than gradient based methods, ELM assigns random values to the weights between input and hidden layer and the biases in the hidden layer, and these parameters are frozen during training. The nonlinear activation functions in hidden layer provide nonlinearity for the system. Then, it can be regarded as a linear system. The only parameter needs to learn is the weight between hidden layer and output layer. Hence, ELM converges much faster than traditional algorithms because it learns without iteration. Meanwhile, random hidden nodes promise the universal approximation ability. Theoretical analysis showed that ELM are more likely to reach global optimal solution with random parameters than traditional networks with all the parameters to be trained [81]. Compared with support vector machine (SVM) [23], ELM tends to yield better classification performance with less optimization constrains [55]. Due to its superior training speed and good generalization capability [82], ELM is widely applied in a variety of learning problems, such as classification, regression, clustering, and feature mapping. ELM evolved as many variants have been proposed to further improve its stability and generalization for specific applications [29, 61, 86, 223].

Over the past decade, extensive research on ELM has been carried out for three purpose: less manual intervention, higher classification accuracy and less training time. Theoretical analysis has also been done to investigate the generalization and global approximation ability of ELM [55].

In this study, we try to provide a comprehensive review on the development of ELM, including the theoretical analysis, its variants, recent advances, and real applications. The rest of this paper is organized as follows. In section 2, we will give the detailed theoretical analysis of basic ELM. Section 3 present recent advances on ELM for classification, regression, and clustering. Section 4 is a comparison between ELM and other machine learning algorithms. Section 5 gives a summary on machine learning tasks including recognition, prediction, representation, clustering, and surrogate modeling of ELM. Section 6 summarized the application fields where ELM was adopted such as medical analysis, chemistry, transportation, economy, robotics, management, geography, and food safety. Finally, Section 8 concludes the review.

2 Theoretical fundamentals

In this part, we are to show the theoretical analysis of basic ELM. ELM was invented to train SLFNs, which is the most widely used artificial neural network structure. A conventional SLFN consists of three layers: input layer, hidden layer and output layer, shown in Fig. 1. The notations are given in Table 1. x and o denote the input and output vector. w and b represent the weight from input to hidden layer and the bias of hidden layer. β denotes the output weight. Training the network is to decide these parameters that reach the optimal solution.

Fig. 1
figure 1

Structure of SLFN

Table 1 Notations

2.1 SLFN training

In this section, we will briefly introduce the training problem for SLFN. Given a training set S = {(xi, ti)| xi = (xi1, xi2, …, xin)T ∈ Rn, ti = (ti1, ti2, …, tim)T ∈ Rm}, where xi denotes the input value and ti represents the target, the output o of an ELM with \( \hat{N} \) hidden neurons can be expressed as:

$$ \sum \limits_{i=1}^{\hat{N}}{\boldsymbol{\beta}}_{\boldsymbol{i}}g\left({\boldsymbol{w}}_i{\boldsymbol{x}}_j+{b}_i\right)={\boldsymbol{o}}_j,j=1,\dots, N $$

Where g(x) means the activation function in the hidden layer. In ELM, activation functions are nonlinear ones to provide nonlinear mapping for the system. Table 2 lists several widely used activation functions.

Table 2 Activation functions in ELM
Table 3 Training of ELM

The goal of training is to minimize the error between the target and the output of ELM. The most commonly used object function is mean squared error (MSE):

$$ \mathrm{MSE}=\sum \limits_{i=1}^N{\left({t}_{ij}-{o}_{ij}\right)}^2,j=1,\dots, m $$

where N is the number of training samples, and i and j are the indexes for the training sample and output layer node. It can be proved that SLFN is able to approximate all the training samples when the number of hidden nodes N approaches to infinity,

$$ \sum \limits_{j=1}^N\left|\left|{\boldsymbol{o}}_j-{\boldsymbol{t}}_j\right|\right|=0 $$

which is called the universal approximation capability, so there must be a set of wi, bi and βi that suffice:

$$ \sum \limits_{i=1}^{\hat{N}}{\boldsymbol{\beta}}_{\boldsymbol{i}}g\left({\boldsymbol{w}}_i{\boldsymbol{x}}_j+{b}_i\right)={\boldsymbol{t}}_j,j=1,\dots, m $$

The formula above can be abbreviated as

$$ \boldsymbol{H}\boldsymbol{\beta } =\boldsymbol{T} $$


$$ {\displaystyle \begin{array}{c}\boldsymbol{H}\left({\boldsymbol{w}}_1,\dots, {\boldsymbol{w}}_{\hat{N}},{b}_1,\dots, {b}_{\hat{N}},{\boldsymbol{x}}_1,\dots, {\boldsymbol{x}}_N\right)=\\ {}\left[\begin{array}{ccc}g\left({\boldsymbol{w}}_1{\boldsymbol{x}}_1+{b}_1\right)& \cdots & g\left({\boldsymbol{w}}_{\hat{N}}{\boldsymbol{x}}_1+{b}_{\hat{N}}\right)\\ {}\vdots & \ddots & \vdots \\ {}g\left({\boldsymbol{w}}_1{\boldsymbol{x}}_N+{b}_1\right)& \cdots & g\left({\boldsymbol{w}}_{\hat{N}}{\boldsymbol{x}}_N+{b}_{\hat{N}}\right)\end{array}\right]\end{array}} $$
$$ \boldsymbol{\beta} ={\left[\begin{array}{c}{\boldsymbol{\beta}}_{\mathbf{1}}^{\boldsymbol{T}}\\ {}\mathbf{\vdots}\\ {}{\boldsymbol{\beta}}_{\hat{\boldsymbol{N}}}^{\boldsymbol{T}}\end{array}\right]}_{\hat{N}\times m},\boldsymbol{T}={\left[\begin{array}{c}{\boldsymbol{t}}_1^T\\ {}\vdots \\ {}{\boldsymbol{t}}_N^T\end{array}\right]}_{N\times m} $$

Therefore, training the SLFN is to find the best wi, bi and βi.

2.2 Principles of ELM

The basic training of ELM can be regarded as two steps: random initialization and linear parameter solution. Firstly, ELM uses random parameters wi and bi in its hidden layer, and they are frozen during the whole training process. The input vector is mapped into a random feature space with random settings and nonlinear activation functions which is more efficient than those of trained parameters. With nonlinear piecewise continuous activation functions, ELM has the universal approximation capability [41, 58]. In the second step, βi can be obtained by Moore-Penrose inverse as it is a linear problem Hβ = T. So, the training of ELM is summarized in Table 3:

ELM can yield better generalization ability without iteratively tuned hidden parameters. The theory of ELM was proved rigorously by Guang-Bin and Qin-Yu [41], and detailed information can be found in literature [59]. ELM can approximate arbitrary complex classification boundary with sufficient hidden nodes.

3 Variants of ELM

This section presents the recent advanced ELM variants. Various algorithms have been used to improve the performance of ELM for real applications.

3.1 Robustness improvement

Random hidden nodes enable ELM to converge much faster but may also result in fluctuation in classification performance. Calculating the output weight βi is another significant problem that related to ELM performance. Starting from these two points, researchers have been devoted to improving the robustness and stability of ELM.

Huang and Chen [54] proposed incremental ELM (I-ELM) in which the hidden nodes are added incrementally. With nonconstant piecewise continuous activation functions, I-ELM has universal approximation capability. Later, They improved the learning algorithm for I-ELM in literature [53]. The convergence can be accelerated by convex optimization for recalculating the output weight of existing nodes. Xu and Yao [202] proposed an improved version of I-ELM called incremental regularized extreme learning machine (IR-ELM). The hidden nodes are added in one by one, and the output weight can be recursively updated in an efficient way. In incremental learning scheme, ELM can automatically adjust number of hidden nodes to obtain a better performance for regression and classification than those of random initialization.

Huang and Ding [55] compared ELM and support vector machine (SVM), and found out ELM was equivalent with SVM in terms of classification but ELM was more probably to reach better generalization performance. Liu and Wang [97] found that ELM may suffer from overfitting because it is trained by minimizing error on training set. So, they introduced ensemble learning and cross validation to ELM and proposed ensemble ELM (EN-ELM). In k fold cross validation, the training set was divided into k groups. In every iteration, k ELMs were trained by k-1 groups and validated by the other group. In testing phrase, the prediction was obtained by majority voting. Experimental results suggested that overfitting was alleviated and EN-ELM reached better generalization. Soria-Olivas and Gomez-Sanchis [167] combined Bayesian approach with ELM. By using prior knowledge, the confidence intervals can be calculated efficiently. The proposed Bayesian ELM outperformed classical ELM in six different regression tasks in their experiment. Wang and Cao [184] noticed that ELM sometimes suffers from low effectiveness because the hidden layer output matrix H is not full column rank. So, they proposed to make selections of the input weight and bias instead of using random parameters. In this way, the singular H can be avoided which accelerated the computation of generalized inverse. Experiments on benchmark data suggested that the proposed effective ELM (EELM) achieved better classification results with less training time. Cao and Lin [10] employed self-adaptive differential evolution algorithm to train hidden nodes parameters in ELM. With a strategy pool, the potential solution can evolve from previous experience into better ones. In comparison with several other evolutionary based ELM, the proposed self-adaptive differential evolution ELM was better in classification. Zhang and Qu [225] proposed a classification scheme based on evaluation ELM and weighted nearest-neighbor equality algorithm. The significance of features were measured by the weights of ELM input layer for feature selection and weighted nearest-neighbor equality algorithm served as the classifier. Cao and Lin [11] introduced voting mechanism to improve the robustness of ELM. They trained several individual ELMs with the same structure and the final classification result was obtained by voting on the results of these ELMs. The choice of kernel is an important part for kernel ELM. Traditionally, kernel choice is empirical, but it may not always suitable. Liu and Wang [103] proposed multiple kernel extreme learning machines (MK-ELM) for solving this problem. MK-ELM optimizes both the weights of kernel combination and network weights. They also introduced norm constrains and minimum enclosing ball to the learning framework. Three different optimization algorithms were employed in the experiment. The result suggested that the proposed learning framework was comparable to state-of-the-arts with less training time. Deng and Zheng [28] proposed reduced kernel extreme learning machine which aimed to greatly decrease time consumption while not affecting its performance compared to normal kernel extreme learning machine. Their proposed method replaced the original kernel matrix K(X, X) with reduced kernel matrix \( K\left(X,\overset{\sim }{X}\right) \), choosing a subset randomly from given dataset [26]. After proving its efficiency, Deng and Zheng [27] also applied this method for cross-person activity recognition task and achieved success.

3.2 Parameter optimization

The pure ELM obtains the output weight by pseudo inverse and random initialization. Its performance and robustness can be further boosted. For regression problems, Chen and Lv [14] put forward a new method to alleviate side effect of outliers on ELM for regression. They proposed a robust regularized ELM using iteratively reweighted least squares (RELM-IRLS). They further compared the generalization performance of different loss functions and regularization terms. Zhang and Yang [226] proposed a hybrid learning algorithm named robust AdaBoost.RT based ensemble ELM (RAE-ELM). In training, RAE-ELM assigned a threshold for each weak learner, and the final output of the system was calculated by weighted ensemble. The proposed method outperformed several state-of-the-arts on real world datasets in their experiment.

Some researchers suggested to use swarm intelligence for optimization of the random initialized weights and biases in ELM. The optimization processes of these methods can be summarized in a flowchart in Fig. 2. Every particle in the population carries a potential solution to the optimization problem, which is updated with the iterations. For ELM optimization, the potential solution is set as the hidden weights and biases, and the fitness function is the training error, which can be expressed as

$$ error=\sum \limits_{i=1}^N{\left\Vert {\boldsymbol{t}}_i-{\boldsymbol{o}}_i\right\Vert}^2 $$

where ti and oi stand for the ground truth label and the ELM output of the ith training sample.

Fig. 2
figure 2

ELM optimization using swarm intelligence methods

Zhang and Li [227] proposed a new method called firefly algorithm to optimize the parameters in hidden layer. Firefly algorithm employs a set of fireflies to search the solution space. Initially, every firefly contains a potential solution and they moves with a strategy that inspired by moving of firefly swarm. Zhang and Wu [233] proposed a Memetic Algorithm (MA)-based Extreme Learning Machine (M-ELM), which can reach the global optimal solution for the hidden layer parameters. The memetic algorithm employed a set of individuals to search the solution space. The solutions were set as genes of individuals. By mutation, crossover and selection, the genes in individuals were updated with the generation and finally stopped at the global best. Nayak and Dash [120] employed Improved Whale Optimization Algorithm (IWOA) which is a swarm optimization algorithm inspired from behavior of whales to optimize the parameters of hidden layer in ELM. The object function consisted of root mean squared error and output weight norm. Wu and Yao [194] utilized dolphin swarm optimization to train the parameters in ELM. Wang and Chen [181] trained the input weight and bias of kernel ELM (KELM) by grey wolf optimization (GWO) which belongs to a swarm optimization algorithm that mimics the hunting behavior of wolves. Rigorous comparison was made between KELMs trained by GWO, genetic algorithm, particle swarm optimization, and grid search.

Zhang and Wu [234] proposed instance clone ELM (IC-ELM) for overfitting preventing on small datasets. For each testing sample, IC-ELM finds k training samples that are nearest to the testing sample and clones these weighted ones into training set before training. The proposed IC-ELM outperformed traditional ELM and several other variants.

Zheng and Wang [239] proposed a multi-layer ELM for bearing fault detection. They developed a novel ant lion algorithm (NALO) to optimize the input weight and bias. The NALO can be regarded as an improved form of genetic algorithm with better searching ability. The classification performance of their approach was better than five algorithms.

Missing data is another issue related to robustness, which is common in real world applications. Gao and Liu [36] put forward a sample-based ELM (S-ELM) for handling missing data learning. The traditional ELM trains the network in order to obtain the maximum margin and minimum norm of weight. However, it is hard to estimate the distance between the sample with missing data and the separating hype plane because calculating the distance in full feature space is unsuitable. So, they observed this problem and proposed to compute the distance in sample relevant feature space. The proposed S-ELM achieved good classification performance on three different datasets.

3.3 Structure modification

Researchers are also exploring to modify the structure of ELM to improve its performance or for specific tasks. Bai and Kasun [4] added local receptive fields to ELM and proposed ELM-LRF for object recognition. In its frame, the input was the raw image samples. The various local receptive fields were used for feature extraction. Then, a pooling layer with tuning-free nodes was employed to extract comprehensive representation. Finally, the representations were fed into an output layer for classification. The weight in output layer was calculated by least square solution. Experiment on NORB showed that ELM-LRF achieved better classification accuracy than deep belief network (DBN) and conventional neural network (CNN) with much less training time. Deng and Bai [29] combined singular value decomposition (SVD) with ELM for large scale data analysis. The SVD nodes were arranged in hidden layer to capture the innate characteristics of the input and reduce the dimension of data. A fast approximation method was utilized in SVD to improve computational efficiency. Experimental results demonstrated that the proposed approach was superior to several state-of-the-art algorithms. Ouyang and Cheng [127] proposed a hybrid system based on fuzzy theory which included SVD and ELM for online learning. Instead of conventional Moore Penrose inverse to calculate output weight, they used a recursive SVD based optimization algorithm. Compared to those traditional application of fuzzy system theory [67], their combined approach showed better results. Geng and Dong [38] replaced the hidden layer with Central Neural System (CNS) and proposed Self-Organizing Extreme Learning Machine (SOELM). The training of SOELM consists the following three steps. Firstly, it tunes the input weight using Hebbian learning rule iteratively. Then, mutual information is employed to adjust the structure of the network in a self-organizing manner. Finally, the output weight is obtained by ELM algorithm. The proposed SOELM was applied in predicting complicated chemical processes and yielded better results than ELM. Wu and Qu [193] put forward a novel learning framework for multi-layer ELM. They implemented the hybrid hidden nodes optimization by ant clone algorithm and multiple GWO. Multiple hidden layer output matrixes were leveraged to generate the optimal structure.

3.4 ELM for online learning

In real world applications, there are various kinds of data, so ELM needs to be modified to better learn from these data. For instance, sometimes we may not be able to have access to the whole dataset because the dataset itself is growing. New samples are being added into the dataset from time to time. We need to re-train the ELM every time the set grows. However, the new samples are often accounted for only a small part, so it is inefficient to re-train the network using the whole dataset again and again. To handle this issue, Huang and Liang [56] proposed online sequential ELM (OS-ELM). The basic idea of OS-ELM is to avoid re-training over old samples by using sequential algorithm. After initialization, OS-ELM can adjust parameters over new samples sequentially. In this way, OS-ELM can be trained one by one or block by block. The detailed training process is given below in Table 4:

Table 4 Training of OS-ELM
$$ {\displaystyle \begin{array}{c}{\beta}_0={\mathrm{M}}_0{H}_0^T{T}_0\\ {}\mathrm{where}\ {\mathrm{M}}_0={\left({H}_0^T{H}_0\right)}^{-1}\end{array}} $$
$$ {\displaystyle \begin{array}{c}{\beta}_{\left(t+1\right)}={\beta}_{(t)}+{M}_{t+1}{h}_{t+1}\left({t}_i^T-{h}_{t+1}^T{\beta}_{(t)}\right)\\ {}\mathrm{where}\ {\mathrm{M}}_{t+1}={M}_t-\frac{M_t{h}_{t+1}{h}_{t+1}^T{M}_t}{1+{\mathrm{h}}_{t+1}^T{M}_t{h}_{t+1}}\end{array}} $$

Later, Zhao and Wang [237] introduced forgetting mechanism to OS-ELM to improve its performance on sequential data with time validity. Time is an important factor in many applications, such as stock market and preferences of people shopping. The outdated training samples should be down weighted and new samples emphasized. Also, Huynh and Won [64] proposed their approach to improve online sequential extreme learning machine. They applied the Tikhonov regularization to relieve the empirical error and tackle with ill-posed problems. The experiments presented its capacity of learning data one-by-one or chunk-by-chunk and superiority of better generalization ability. Moreover, Mirza and Lin [118] proposed their modified method called weighted online sequential extreme learning machine to mitigate the problem of imbalanced class. According to their experimental results, weighted online sequential extreme learning acquired better performance than other methods of class imbalance learning.

3.5 ELM for imbalanced data

Another problem in real application is the imbalanced distribution of samples in training set, which can hinder the performance of ELM severely. In ideal training sets, samples of different classes obey uniform distribution generally, but in real world datasets, the number of samples of some class may be times of that of other classes. Hence, the ELM cannot learn from minority classes effectively. Consequently, the trained network often recognize majority class samples more accurately than minorities. Zong and Huang [246] put forward weighted ELM (W-ELM) for classification of imbalanced data. They suggested to assign weights to each training samples according to class distribution. The formulation of W-ELM can be expressed as

$$ \underset{\boldsymbol{\beta}}{\mathit{\min}}\frac{1}{2}{\left\Vert \boldsymbol{\beta} \right\Vert}^2+\frac{1}{2}\sum \limits_{i=1}^N{C}_i{\left\Vert {\boldsymbol{e}}_i\right\Vert}^2,\kern0.5em s.t.\kern0.5em {\boldsymbol{t}}_i^T-{\boldsymbol{e}}_i^T=h\left({\boldsymbol{x}}_i\right)\boldsymbol{\beta}, i=1\dots N $$

where ei denotes the error of sample xi, and Ci is the penalty weight hyper-parameter. Practically, weights on majority classes are usually set smaller than those on minority classes. One of the most widely used weighting schemes is:

$$ {w}_i=\frac{1}{n\left({c}_i\right)} $$

where ci denotes the class label, and n(ci) represents the total sample number. W-ELM can achieve good generalization both on imbalanced and balanced datasets and the scheme can be easily applied in cost sensor learning tasks. Its weighting algorithm is also simple and effective. Later, Tang and Chen [174] employed artificial bee colony optimization to train the W-ELM, which further improved its classification performance. Shen and Zhang [158] proposed their scheme for imbalanced binary classification. They used the minority samples as seeds to train an ELM autoencoder (ELM-AE). Then, more samples of minority class can be obtained by the ELM-AE. Those generated sample were added in the dataset so that it become balanced.

3.6 ELM for big data

ELM is now modified for big data. Liu and Huang [98] used ELM to reduce the dimension of high-dimensional data by spectral regression. Then, the output weight can be obtained. The proposed two-stage ELM can converge faster for high dimension data.

Xin and Wang [198] found that ELM on MapReduce was inefficient for updated big data learning and proposed elastic ELM for improvement. The most time-consuming part for big data learning of ELM is the matrix multiplication, so they improved the efficiency by calculating incrementally. Elastic ELM used intermediate matrix from the updated training subset multiplications and obtained the output weight by centralized calculating. Various experiments were conducted for evaluation, and result showed the proposed elastic ELM was rapid and effective for big data learning.

Yang and Xu [204] applied ELM for stream data classification and proposed an ensemble ELM. They embedded a concept drift detection module in ELM, which can recognize both gradual and abrupt concept drift by online sequential learning and classifier updating.

Chen and Song [18] developed a domain adaptation algorithm based on ELM. In many real-world applications, the training data and testing data are from different distributions but are related. So, there is domain transfer in classification. They employed an ELM based feature learning network to transfer the training and testing data into the same feature space and minimized the distance between the two domains. Afterwards, the transferred data can be used to train an adaptive ELM for classification.

3.7 ELM for transfer learning

It is known that datasets with annotation play an important role in many tasks of artificial intelligence. Despite of its importance, we need to realize that collecting labeled data is often a challenging work due to it usually consumes a lot of time and money. Transfer learning has its unique strength in solving this problem. By transferring knowledge from source domain to target domain, transfer learning could be implemented with only a small annotated dataset which makes it quite suitable for fields lack of annotation data such as computer-aided diagnosis, transportation, and emotion recognition. In recent years, extreme learning machine has been more and more applied in the field of transfer learning for its fast-learning speed and easy realization. Then, we will introduce several representative works of applying extreme learning machine in constructing transfer learning scheme.

In 2013, Chen and Wang [19] applied online sequential extreme learning machine in transfer learning scheme for recognizing various transportation modes. Their scheme consisted of three steps. First, they trained an original extreme learning machine on labeled source domain. Second, they calculated the mean and standard deviation for extracting trustable data from target domain. Third, they adopted online sequential extreme learning machine to retrain the original extreme learning machine classifier. According to their experimental results, this approach obtained better performance than other conventional methods.

In 2015, Zhang and Zhang [220] proposed a universal framework for improving the generalization capacity of traditional extreme learning machine. They applied both source domain adaptation method and target domain adaptation method to enhance the cross-domain ability of extreme learning machine and claimed to achieve state-of-the-art experimental results. Later they modified this generalized domain adaptation extreme learning machine and tested it in the field of electronic nose sensor drift compensation [219] and visual knowledge domain adaptation [221], proving this framework’s efficiency and potential.

In 2016, Li and Mao [87] presented an extreme learning machine based method to implement a transfer learning approach for data classification. An extreme learning machine classifier was trained with data from source domain first. During this process, they constructed two different penalty parameters. Second, they applied these two computed penalty parameters to obtain the remaining parameters of derived Lagrangian with data from target domain. In this way, the original extreme learning machine classifier was optimized, and knowledge was transferred from source domain to target domain. Their experimental results demonstrated this approach’s efficiency. Further, Li and Mao [88] improved their approach via introducing a free sparse representation from unlabeled data and claimed to achieve promising results as well.

3.8 ELM for Semi-supervised and unsupervised learning

Apart from the supervised learning, ELM has also been applied in semi-supervised and unsupervised learning.

Liu and Xia [92] combined manifold regularization with ELM for semi-supervised learning. They gave a deep analysis on relation between manifold regularization and ELM. They also proposed manifold regularization ELM. While maintaining the features of traditional ELM, manifold regularization ELM can be applied to large scale datasets. Experiment results showed that manifold regularization ELM was superior to traditional semi-supervised learning methods in terms of scalability and generalization. Krishnasamy and Paramesran [76] found that Laplacian regularization for ELM failed in extrapolating ability and proposed a new algorithm for semi-supervised learning called Hessian semi-supervised ELM (HSS-ELM). HSS-ELM used Hessian regularization which preserved the manifold structure and improved the extrapolating ability while maintaining the main features of conventional ELM. The implementation of multiclass learning was also simple.

Liu and Liu [96] studied non-parametric kernel learning (NPKL) and pointed out traditional NPKL methods often require manifold assumption and lack scalability. So they employed ELM for NPKL which can obtain a low rank kernel matrix from the samples without clear manifold structure. Mehrizi and Yazdi [115] combined growing self-organizing map (GSOM) with ELM. The use of ELM solved the parameters calibration of GSOM which substantially boosted the speed and improved the classification performance. Zhang and Ding [224] introduced wavelet analysis to ELM for unsupervised learning and semi-supervised learning.

Ding and Zhang [30] employed ELM as auto encoder for feature learning and sparse representation. The embedded features were used for unsupervised ELM for clustering. Zhu and Miao [244] studied deep learning structures and proposed ELM autoencoder (ELM-AE) for feature representation from unlabeled images. ELM-AE was used to extract low hierarchical representation from local perceptive fields and the learnt features were fed into the last layer called trans-layer for high level feature learning. The block histogram was employed to achieve translation and rotation invariance for representation extraction. Experiments on MNIST and Caltech 101 suggested the effectiveness of their method. Cheng and Liu [20] improved ELM-AE and applied it for K-SVD. They put forward denoising deep ELM-AE (DDELM-AE) by orthonormal initialization of the input weight. Compared with complete random initialization, orthogonalization made better projection in ELM feature space and boosted the performance.

Sun and Zhang [169] added manifold regularization term in the object function of ELM-AE and proposed generalize extreme learning machine autoencoder (GELM-AE). They also developed a new deep network based on GELM-AE which converges faster than conventional deep networks. Nayak and Das [119]used ELM-AE to form a multilayer ELM and employed leaky rectified linear unit (LReLU) as activation function, which can be written as

$$ LReLU(x)=\left\{\begin{array}{c}x,\kern1.25em if\ x>0\\ {}\mathrm{a}x,\kern0.5em otherwise\end{array}\right. $$

Where a ∈ (0, 1) is a hyper-parameter. Yousefi-Azar and McDonnell [209] proposed a hybrid approach for semi-supervised learning. They employed convolution and pooling to extract feature from input images and used ELM as the classifier. She and Hu [157] proposed to use hierarchical ELM to learn features from EEG and employ semi-supervised ELM for classification. Xia and Wang [196] studied OS-ELM and developed density-based semi-supervised OS-ELM which can learn patterns from unlabeled samples. A summary of ELM variants is presented in Table 5.

Table 5 Summary of variants of ELM

4 ELM vs other algorithms

Researchers have been devoted to analysis of ELM over the last decade, especially the comparison between ELM and other state-of-the art algorithms including Support Vector Machine (SVM), Random Neural Network (RNN), Radial Basis Function Neural Network (RBFNN), Hopfield Neural Network (HNN), Boltzmann Machine (BM), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), and other deep learning methods. As representative machine learning algorithms, these classic methods stand for state-of-the-art technologies of machine learning in different age, not only playing an important role in the history of development of machine learning but also being active in research field nowadays. Via comparison between ELM and these machine learning algorithms, we could more easily to investigate the meaning of ELM in machine learning and more clearly to observe ELM’s strength in related research. Generally speaking, ELM is obviously faster than those methods, and its variants are becoming more and more robust. With simple implementation, the performances of ELM are good in terms of accuracy.

4.1 ELM vs support vector machine (SVM)

As one of most well-known supervised algorithms for classification since last century, Support Vector Machine (SVM) was proposed by Cortes and Vapnik [24] in 1995. Basically speaking, considering data points scattering in space, we could find plenty of planes to separate these data points into several classes. In SVM, these planes are called hyperplanes. And among these hyperplanes, there exists one which would maximize the distance of every data point to the plane. That means this hyperplane could distinguish data points best. The mission of SVM is to find this best hyperplane, and it constructs essential idea of SVM.

More specifically, given data sample D

$$ D=\left\{\left({X}_1,{y}_1\right),\left({X}_2,{y}_2\right),\dots, \left({X}_n,{y}_n\right)\Big\}\right. $$

where Xi is a vector consists of d elements which represents a coordinate in space of d dimension. And yi is the value for classification and normally there exists y ∈ {1, −1} for representation of positive class and negative one. In this space, we could always find a hyperplane which could be recorded as

$$ {X}^Tw+\mathrm{b}=0 $$

It is easy to understand that we could always find a pair of hyperplanes which could separate D into positive and negative groups. And this pair of hyperplanes could be recorded as

$$ \left\{\begin{array}{c}{X}^Tw+b=+1\\ {}{X}^Tw+b=-1\end{array}\right. $$

Thus, we could obtain the distance between this pair of hyperplanes and the distance in SVM is called margin, which could be written as

$$ margin=\frac{2}{\left\Vert w\right\Vert } $$

Margin is demonstrated in Fig. 3.

Fig. 3
figure 3

Illustration of hyperplanes and margins

Due to principle of SVM, support vector is defined as point on the pair of hyperplanes, which could be recorded as

$$ {y}_i\left({X}^Tw+b\right)=1 $$

And the purpose of SVM is to maximize the margin, which is equal to this equation as below

$$ \max \frac{2}{\left\Vert w\right\Vert}\iff \min \frac{1}{2}{\left\Vert w\right\Vert}^2 $$

So, concluding the restriction condition, the mathematical expression of SVM’s goal is written as below

$$ {\displaystyle \begin{array}{c}\min \frac{1}{2}{\left\Vert w\right\Vert}^2\\ {}s.t.{y}_i\left({X}^Tw+b\right)\ge 1,i=1,2,\dots, n\end{array}} $$

However, we usually do not calculate the equation above directly since it is hard to figure out. Commonly, we call this equation a primal problem and we could apply Lagrange Function to solve its dual problem which is much easier to be calculated. In this situation, we introduced Lagrange Multiplier

$$ {\alpha}_i\ge 0,i=1,2,\dots, n $$

into primal equation and acquired its Lagrange Function and got the formula as below:

$$ L\left(w,b,\alpha \right)=\frac{1}{2}{\left\Vert w\right\Vert}^2-\sum \limits_{i=1}^n{\alpha}_i\left[{y}_i\left({X}^Tw+b\right)-1\right] $$

We could see it that only when all restricted conditions are required, L(w, b, α) could reach optimized value which could be shown as

$$ L\left(w,b,\alpha \right)=\frac{1}{2}{\left\Vert w\right\Vert}^2 $$

So, the optimization of margin is equal to optimization of its Lagrange Function, when restricted conditions are all required. That is how we provided the basic principle of SVM method.

Huang and Ding [55] presented a detailed analysis and comparison of ELM and SVM. It is revealed that the maximal margin theory of SVM and the minimal weight norm of network are in conformity in ELM. In classification problems, ELM and SVM are equivalent, but ELM has less optimization constrains so it tends to yield better performance. Moreover, ELM is robust to user defined factors. Later, Huang and Zhou [57] compared ELM with least square SVM (LS-SVM) and proximal SVM (PSVM) on wide types of datasets. They found that ELM can be regarded as a unified form of LS-SVM and PSVM. The performance of ELM for binary class problems was similar to those of LS-SVM and PSVM but for multiclass problem, ELM outperformed the two substantially. Furthermore, the convergence speed of ELM is thousands of times faster due to the random feature mapping. Random feature mapping not only provides nonlinearity but also enables the global approximation ability for ELM.

Liu and Loh [106] compared SVM with ELM in the field of text classification in 2005. Their experiments demonstrated that SVM showed better performance than ELM in most evaluation categories especially F1 value. In 2014, Chorowski and Wang [21] presented their research of implementing framework of convex optimization to assess different machine learning models including ELM and SVM. Zhong and Miao [240] compared several machine learning methods for corporate credit ratings. According to their experiments, SVM had the highest accuracy while single hidden layer feedforward networks showed best robustness and ELM proved its strength in efficiency, simplicity, and fast speed. In 2015, Qiu and Wang [140] investigated various methods for prediction of quality status of mandarin. They found that ELM and Random forest showed better prediction performance than SVM in mandarin quality prediction. In 2016, Zhang and Zhang [222] performed experiments to search best classifier for deep features extracted by convolutional neural network. They claimed that ELM showed obvious superiority to SVM in cross domain recognition tasks and kernel extreme learning machine achieved state-of-the-art performance. In 2017, Olatunji [125] compared ELM with SVM for email classification. According to their experimental results, SVM had better accuracy while ELM had faster speed.

4.2 ELM vs random neural network (RNN)

Random Neural Network (RNN) is a special kind of neural network which has its unique structure and learning strategy. It was proposed by Erol Gelenbe [37] in 1989, relatively late compared to birth year of other neural network model such as Support Vector Machine (SVM) or Boltzmann Machine (BM). Random Neural Network was inspired by simulation of the process that how biological neural network received and emitted stimulation signal. In actual fact, this process was determined by neuron’s own electric potential. It was the same with the process occurring in RNN. RNN was an open neural network which consisted of several neurons. Each neuron’s electric potential was recorded by Random Neural Network. And if the neuron received a positive signal, its electric potential would plus one. In turn, if the neuron received a negative signal, its electric potential would minus one. The process of emitting and receiving signal would occur continuously within RNN until the whole network network reached a balance condition.

Bhat and Merchant [5] proposed their approach on predicting melting points of organic compounds based on extreme learning machine. In their research, they also compared ELM based approach with other machine learning methods including Random Neural Network and found that ELM was superior according to their experiments. Feng and Huang [34] combined extreme learning machine with random structure of hidden neurons and improved the model performance in their research. Wang and Cao [184] addressed in their paper that according to their research results, extreme learning machine had several advantages in some conditions including higher accuracy and lower training cost against some other neural network structure, Random Neural Network included. In 2014, Zhu and Miao [245] claimed they established a novel neural network structure using constrained extreme learning machine. They adopted random weights and connections derived from RNN and achieved outstanding performance in that year. Lima and Cannon [91] evaluated several representative neural network models in environmental sciences including extreme learning machine and RNN in 2015. According to their inspection, extreme learning machine were more widely applied in environmental sciences than RNN. Minemoto and Isokawa [117] performed research on optimization of feed forward neural network but in their research, they also acknowledged that extreme learning machine was a more powerful approach in most cases compared to Random Neural Network.

4.3 ELM vs radial basis function neural network (RBFNN)

RBFNN is a single hidden layer feedforward neural network based on function approximation proposed by Broomhead and Lowe [7] in the late 1980s. With the increasing maturity of the research, RBFNN has attracted great attention from researchers in various fields due to its simple structure, strong nonlinear approximation ability and good generalization ability, and has been widely used in many research fields such as pattern classification, function approximation and data mining [68].

The basic idea of RBFNN is to apply Radial Basis Function (RBF) as the "basis" of the hidden element to form a hidden layer space. The hidden layer transforms the input vector, transforms the input data of the low-dimensional mode into the high-dimensional space, and makes the linear inseparability problem in the low-dimensional space linearly separable in the high-dimensional space. One detail is to use the "basis" of the hidden element of RBF to form the hidden layer space, so that input vectors can be mapped directly (without weight connection) to the hidden space. When the center point of RBF is determined, the mapping relationship is also determined. The mapping from the hidden layer space to the output space is linear (note the distinction between linear mapping and nonlinear mapping here), that is, the network output is the linear weighted sum of the output of the element, and the weight here is the network adjustable parameter.

As a three-layer forward network with a single hidden layer, RBFNN’s structure is shown in Fig. 4. The first layer is the input layer and consists of signal source nodes. The second layer to hidden layer, hidden layer node number depending on the described problems need, transformation function of neurons in hidden layer radial basis function is the center of radial symmetry and attenuation of nonnegative linear function, the function is local response function, local response of the concrete is embodied in its visible to hidden layer of transformation is different from other networks. The previous forward network transformation functions are all functions of global response. The third layer, the output layer, is the response to the input pattern. The input layer only plays the role of signal transmission, and the connection between the input layer and the hidden layer can be regarded as a connection with the weight value of 1. The tasks completed by the output layer and the hidden layer are different, so their learning strategies are also different. The output layer adjusts the linear weight and adopts the linear optimization strategy, so the learning speed is faster. However, the hidden layer adjusts the parameters of the activation function (Green's function, Gaussian function and generally the latter), and adopts the nonlinear optimization strategy, so the learning speed is slow. The understanding of this sentence can be found from the transformation between layers below.

Fig. 4
figure 4

Illustration of RBFNN

The advantages of radial basis function neural network are as follows: the approximation ability, classification ability and learning speed are superior to BP neural network, the structure is simple, the training is simple, the learning convergence speed is fast, it can approximate any nonlinear function, and it can overcome the local minimum problem. The reason is that its parameter initialization has certain method, not random initialization. For the RBF neural network training process, the key problem is the reasonable determination of the central parameters of the hidden layer neurons. The common method is to directly select the central parameter (or its initial value) from a given training sample set in a certain way, or to determine it by clustering method.

Researchers have been trying comparing Radial Basis Function Neural Network (RBFNN) with extreme learning machine (ELM) when solving some specific machine learning tasks for a long history. Li and Huang [84] inspected the performance comparison of ELM and RBFNN in the field of channel equalization in 2006. According to their study, ELM based methods obtained better results in symbol error rate and learning speed which meant ELM won the competition against RBFNN in the field of channel equalization application. Besides comparison of these two models, scientists also did experiments on improving performance by uniting advantages of both ELM and RBFNN. In 2011, Fernandez-Navarro and Hervas-Martinez [35] proposed a novel model which applied extreme learning machine as training method and used generalized radial basis function (GRBF) neural network as basis. Their merged new approach was called modified extreme learning machine-generalized radial basis (MELM-GRBF) neural network. They claimed that the proposed method was superior to either common extreme learning machine or radial basis function neural network. Ding and Ma [31] also considered applying extreme learning machine to train radial basis function neural network. They proposed a self-adaptive training strategy based on affinity propagation and demonstrated that their methodology showed obvious advantages beyond other models. Liu and Wan [102] analyzed the universal consistency of extreme learning machine when training radial basis function neural networks and which kernel function should be chosen in extreme learning machine according to different conditions. Lam and Wunsch [77] studied various unsupervised learning approaches for classification in the field of compute unified device architecture (CUDA) including extreme learning machine (ELM), support vector machine (SVM) and radial basis function (RBF) neural network. Wen and Fan [190] proposed a classifier called hybrid structure-adaptive radial basis function-extreme learning machine (HSARBF-ELM) in 2017 and claimed their method combined advantages of both structure-adaptive radial basis function (SARBF) network and extreme learning machine (ELM). Yan [203] adopted both ELM and RBF to achieve higher accuracy of prediction in their experiments. And research of similar direction was concerned constantly in recent years [65, 201].

4.4 ELM vs hopfield neural network (HNN)

In 1982, US physicist Hopfield [50] proposed a novel neural network aimed at solving pattern recognition and optimization problem, named as Hopfield Neural Network (HNN). Since Hopfield Network came into people’s view, scientists developed a series of creative structures of neural network by his inspiration. Hopfield Neural Network is a kind of recursive neural network with feedback connections from output to input. Each neuron is interconnected with all other neurons, also known as full interconnection network. Hopfield Neural Network is a neural network combining storage system and binary system. It guarantees convergence to a local minimum, but the condition that converging to the wrong local minimum instead of a global minimum can also happen. Hopfield Network also provides a model to simulate human memory. Hopfield Neural Network is a feedback neural network, and its output will be fed back to its input. Under the excitation of input, its output will produce constant state changes, and this feedback process will be repeated all the time. If Hopfield neural network is a convergent and stable network, the changes generated by this feedback and iterative calculation process will become smaller and smaller. Once a stable equilibrium state is reached, the Hopfield neural network will output a stable constant value. For Hopfield Neural Network, the key is to determine its weight coefficient under stable conditions.

Generally speaking, Hopfield Neural Network could be divided into two types, Discrete Hopfield Neural Network (DHNN) and Continuous Hopfield Neural Network (CHNN). The structure of DHNN is drawn as below in Fig. 5. There are few articles directly comparing Hopfield Neural Network (HNN) with extreme learning machine (ELM) due to scientists have developed more advanced neural network models based on HNN, such as Restricted Boltzmann Machine and Deep Belief Network (DBN) which are more widely applied since they faced the world. Although in this condition, we still reviewed some papers concerning performance comparison between HNN and ELM. Liang and Cheng [90] proposed their modified model based on ELM and claimed it was superior to some other popular neural networks including Hopfield neural network. Liu and Liu [95] constructed an improved multiple hidden layers ELM by employing random search enhancement. According to their experiments, their improved ELM based method had advantages in training speed and accuracy compared to other methods including HNN.

Fig. 5
figure 5

Illustration of DHNN

4.5 ELM vs boltzmann machine (BM)

As we discussed above, Hopfield Neural Network has the function of optimal calculation. However, the network can only evolve in strict accordance with the descending mode of energy function, and it is difficult to avoid the appearance of false state. Moreover, the weight is easy to fall into the local minimum value and cannot converge to the global optimal solution. If the iterative process of the feedback neural network is not so rigid, it can temporarily accept the result of increasing the energy function to some extent, then it is possible to jump out of the local minimum. The core idea of stochastic neural network is to add probability factors into the network. The network does not evolve in the direction of the energy function reduction but evolves in the direction of a large probability to ensure the correct iteration direction. Meanwhile, the probability of the energy function increase also exists to prevent the network from falling into the local minimum.

Gradient descent is the most used method in machine learning and optimization combination problems. For example, BP neural network, the more units of the multilayer perceptron, the larger the corresponding weight matrix, and each weight can be regarded as one degree of freedom or variable. We know that the higher the degree of freedom, the more variables, the more complex the model, the more powerful the model. However, the stronger the model capability is, the easier it is for the model to overfit and be over-sensitive to noise. On the other hand, when using gradient descent to search for the optimal solution, the error surface of multiple variables is very similar to the undulating mountain peak. The more variables there are, the more mountains and valleys there are, which makes it very easy for the gradient descent method to fall into a small local valley and stop searching. This is the most common local optimal problem in solving multi-dimensional optimization problems by conventional gradient descent method. The reason is the search criterion of the gradient descent method, which is based on the negative direction of the gradient, blindly pursuing the reduction of network error or energy function, so that the search only has the ability to "go down the mountain", but not to "go up the mountain". The so-called "ability to climb" is that when the search falls into the local optimum, it can also have the ability to "climb over the mountains" to escape from the local optimum and continue to search for the global optimum. If you want to make a graphic metaphor for a system with multiple local minima. Imagine that there is a concave and convex surface of multi-dimensional energy on the tray. If a ball is placed on the surface, it will roll into the nearest trough (local minimum point) under the action of gravity and cannot break away. But the trough is not necessarily the lowest trough on the surface (global minimum point). Therefore, the local minimum problem can only be solved by improving the algorithm. One way to do this is to give the algorithm the ability to "climb", as mentioned earlier, while ensuring that the search does not "climb" out of the globally optimal "valley".

Simulated annealing algorithm is an effective method to solve the problem of local minimum energy in stochastic network. Its basic idea is to simulate the process of metal annealing. The process of metal annealing is roughly as follows: firstly, the object is heated to a high temperature so that its atoms are in a high-speed motion state, and then the object has a high internal energy. Then, the temperature drops slowly. As the temperature drops, the atoms slow down and their internal energy drops. And then finally, the whole thing reaches its lowest internal energy state. As one of most significant derivatives of simulated annealing algorithm, Boltzmann machine was firstly proposed by Ackley and Hinton [1] in 1985. As it is shown in Fig. 6, a typical Boltzmann machine consists of hidden units, visible units, and edges of dependency. Boltzmann machine was the backbone of many later more well-known and applicative neural network models such as Restricted Boltzmann Machine (RBM) and Deep Boltzmann Machine (DBM). Due to these improved neural network models are more widely applied in research field, we will not provide separate paragraph on literature of comparing Boltzmann Machine with extreme learning machine but will review representative papers of comparing extreme learning machine with RBM or DBM in next section.

Fig. 6
figure 6

Network structure of Boltzmann Machine

4.6 ELM vs restricted boltzmann machine (RBM)

Restricted Boltzmann Machine (RBM) is a randomly generated neural network that can learn probability distribution through input data sets. RBM was originally named Harmonium by its inventor, Paul Skitsky, in 1986, but the limited Boltzmann machine did not become well known until Hinton and Salakhutdinov [49] developed a fast-learning algorithm in the mid-2000s. Restricted Boltzmann machines have been used in dimension reduction, classification, collaborative filtering, feature learning and topic modeling. Depending on the task, the restricted Boltzmann machine can be trained using supervised learning or unsupervised learning.

Restricted Boltzmann Machine (which structure is shown is Fig. 7) is a special topological structure of Boltzmann machine (BM). The principle of BM originates from statistical physics, and it is a modeling method based on energy function, which can describe the higher-order interaction between variables. The learning algorithm of BM is more complex, but the established model and learning algorithm are based on relatively complete physical explanation and strict mathematical statistics theory. BM is a symmetrically coupled random feedback binary cell neural network, which is composed of visible layer and multiple hidden layers. Network nodes are divided into visible unit and hidden unit. Visible unit and hidden unit are used to express the learning model of random network and random environment, and the correlation between units is expressed through weights.

Fig. 7
figure 7

Illustration of Restricted Boltzmann Machine (RBM)

Once RBM learns the structure of the input data associated with the activation values of the first hidden layer, the data is passed down the network one layer. The first hidden layer becomes the new visible or input layer. The activation value of this layer is multiplied by the weight of the second hidden layer to produce another set of activations. This process of creating a sequence of activation values through feature grouping and grouping feature groups is the basis of feature hierarchy, through which neural networks learn more complex and abstract representations of data. For each new hidden layer, the weight is iteratively adjusted until the layer approximates the input from the previous layer. It is greedy, hierarchical, unsupervised pre-training. It does not require the use of tags to improve the weight of the network, which means that we can train on unlabeled data sets that have not been manually processed, which is the vast majority of data in reality. Often, algorithms with more data produce more accurate results, which is one reason for the rise of deep learning algorithms. Since these weights are already close to the characteristics of the data, the subsequent supervised learning stage can be learned more easily when deep belief networks are used for image classification. Although RBM has many application use, proper weight initialization to facilitate later classification is one of its main advantages. In a way, they perform something like backpropagation: they do a good job of adjusting the weights to better model the data. We could say that pre-training and back propagation are alternative ways to achieve the same goal.

As one of most famous and commonly used neural network models, Restricted Boltzmann Machine (RBM) has been hot in research field since it came out. Bu and Zhao [8] presented a methodology which applied RBM to enhance the potential capacity of extreme learning machine (ELM) in the field of spectral processing. Through their research, it was demonstrated that their joint architecture of RBM and ELM showed better performance than the algorithm of applying principal component analysis (PCA) with ELM learning method. Hassan [45] exploited various classification methods in the field of automated sleep apnea detection, including naïve bayes, k nearest neighbor (KNN), AdaBoost, random forest, RBM and ELM. According to their conclusion, ELM was superior to other methods (including RBM) in building automated sleep apnea detection system. In 2016, He and Wang [47] proposed a novel approach which used Gaussian Restricted Boltzmann Machine (Gaussian RBM) as basis and claimed it beat other popular methods including ELM, support vector machine (SVM) and deep belief network (DBN) in diagnosis of bearing fault due to it was more robust, accurate and effective. Chen and Yang [15] presented a method based on ELM and RBM to handle big data problem. Their method applied RBM by inserting hidden nodes into traditional ELM and was proved to achieve better performance than traditional methods. In 2017, Ramasamy and Rajaraman [147] proposed their modified approach called Meta-cognitive Restricted Boltzmann Machine (McRBM) which was superior to other approaches such as support vector machine (SVM) and extreme learning machine (ELM) according to their experiment results. It is also demonstrated that RBM could be utilized to improve the performance of ELM by determining the input weights [130, 131]. Wang and Zhang [183] studied various methods for training multiple hidden layer feed-forward neural network, including back propagation (BP) and extreme learning machine (ELM). Their study showed that RBM had advantages such as fast training speed, better generalization, and high understandability compared to BP and ELM. In other fields such as aircraft auxiliary power unit (APU) sensing data prediction [104] and gas path fault diagnosis [108], scientists also revealed that Restricted Boltzmann Machine could be applied to promote performance of extreme learning machine.

4.7 ELM vs deep belief network (DBN)

Deep Belief Network (DBN) is a kind of neural network of machine learning, which can be used for both unsupervised learning and supervised learning. As a probabilistic generation model, DBN was also proposed by Hinton and Osindero [48]. Compared with the neural network of traditional discriminant model, generation model is to establish a joint distribution between observation data and labels. By training the weights among the neurons, the whole neural network can generate training data according to the maximum probability. Not only can people use DBN to identify features, classify data, but people can also use it to generate data. DBN model is a very practical learning algorithm, which has a wide range of applications and strong expansibility. It can be used in the fields of handwriting recognition, speech recognition and image processing of machine learning.

DBN is composed of multiple layers of neurons (shown in Fig. 8), which are divided into dominant neurons and recessive neurons, which could also be called visible units and hidden units. Visible units are used to accept input, and hidden units are used to extract features. So hidden units also have another name, called feature detector (feature detectors). The connections between the top two levels are undirected, constituting associative memory. There are directed connections between the lower and upper layers. The lowest levels represent data vectors, each neuron representing one dimension of the data vector.

Fig. 8
figure 8

Illustration of Deep Belief Network (DBN)

The component of DBN is Restricted Boltzmann Machines (RBM). The process of training DBN is carried out layer by layer. In each layer, the data vector is used to infer the hidden layer, and then this hidden layer is treated as the data vector of the next layer (one layer above). In fact, each RBM can be used separately as a cluster. RBM has only two layers of neurons, the first layer is called the visible layer, which is composed of visible units for the input of training data. Another layer is called the hidden layer, accordingly, consisting of hidden units, used as a feature detector.

The training process of DBN has five steps. First, we need to fully train the first RBM. Second, we fix the first RBM’s weights and bias and apply its hidden units as input vector of next RBM. The third step is to fully train the second RBM and heap up the second RBM above the first one. Next, the fourth step, this process should be looped for many times. In the end, if training data has label, not only visible units in uppermost RBM but also neurons with classification labels should be trained together.

Since Deep Belief Network (DBN) was proposed, scientists have not stopped enlarging its ability including embedding advantages of extreme learning machine (ELM) into it. Yu and Zhuang [211] presented a modified architecture of extreme learning machine by applying deep level representation with stacked generalization philosophy. They compared their scheme with other frequently-used methods including DBN and claimed their scheme could obtain better performance. Zhang and Ding [216] also adopted deep representation to improve extreme learning machine which turned out to be superior to DBN method according to their experiment results. Ronoud and Asadi [151] proposed a novel methodology to detect breast cancer by applying extreme learning machine to train a classifier with DBN. Their research proved that the methodology of DBN with ELM had advantage in performance compared to other combination methods such as DBN with genetic algorithm (GA). Zhang and Xiao [217] proposed their approach based on ELM called multilayer probability ELM (MP-ELM) for device-free localization. It was indicated via their research that their approach was better than some other popular methods including Deep Belief Network (DBN) and Deep Boltzmann Machine (DBM).

4.8 ELM vs other deep learning methods

Kasun and Zhou [75] combined autoencoder with ELM and proposed ELM-AE. They constructed a five-layer ELM for experiment on MNIST dataset which consists of 70,000 handwritten digit samples with 60,000 for training and the rest for testing. Results revealed that the multilayer ELM was comparable to deep belief network and deep Boltzmann machine but required less training time. They also discovered that the output weight β has a good capability of representation learning. The effectiveness of ELM-AE are further proven and improved by work in [20, 30, 169]. Zhao and Jiao [236] combined tensor operation with ELM and proposed sparse deep tensor extreme learning machine (SDT-ELM) which achieved good classification performance on three open datasets. Shen and Xiao [159] suggested to hybrid deep CNN and ELM for hyperspectral image classification. They designed a two-branch convolution module for spatial and spectral feature extraction respectively. The generated features were concatenated and sent into stack ELM for classification. Results from benchmark datasets revealed that their method achieved good classification performance with faster speed.

At last, in order to present brief comparison between ELM and these machine learning algorithms, we provide Table 6 to illustrate the strength and popularity of each method.

Table 6 Comparison between ELM and other machine learning algorithms

5 Tasks of ELM

As ELM has many advantages over other machine learning algorithms, it has been widely used to solve real world machine learning problems.

5.1 Recognition

ELM is now an important and efficient tool for object recognition. Chacko and Vimal Krishnan [13] utilized wavelet energy as feature and ELM as classification algorithm for handwritten character recognition. Malayalam character set was used for evaluation in experiment. Rong and Jia [150] developed an aircraft recognition system based on ELM. Firstly, three different kinds of moments were extracted from aircraft images as the features. Then, the features were divided into subsets to train a set of ELMs. Finally, the output of the system was obtained by weighting on the outputs of each ELM. Generic object recognition remains a challenge for machine learning because of the Intra-class variabilities. Bai and Kasun [4] proposed local receptive fields based extreme learning machine (ELM-LRF) which used convolution and pooling for automatic feature learning and ELM for classification. Comparison was conducted on NORB, ETH-80 and COIL datasets and ELM-LRF achieved comparable classification results. Zhang and He [218] used a hybrid method for cross domain learning. They employed a deep convolution neural network to extract representation from input images and proposed adaptive ELM (AELM) for recognition. Domain related error terms and a regularization term were added in the object function for AELM. Experiment result suggested the effectiveness of their method. Zhou and Wang [242] used extreme learning machine for fabric defect recognition. They proposed to employ k-means singular value decomposition for dictionary learning and trained an adaptive differential evolution algorithm optimized regularization extreme learning machine (ADE-RELM) to recognize the defects such as broken warp, broken weft, etc. Zhang and Jiang [232] proposed to hybrid super pixel pattern and KELM for hyperspectral image recognition and classification which achieved 10% improvement of accuracy over other spectral methods. Tian and Zhang [177] used ELM for sensor-based activity recognition. They firstly trained a set of base ELMs independently and removed the similar ones. Then, glowworm swarm optimization was leveraged to select optimal sub-ensemble of the ELMs. Finally, the output of the system was generated by majority voting of the ensemble. Sharma and Giri [155] suggested to use ELM for intrusion detection. They firstly employed ExtraTrees classifier for feature selection of each intrusion type. Then, a set of ELMs were trained for every types of intrusions. The results from ELMs were refined by Softmax to generate final result. Rodriguez and Barba [149] combined wavelet packet Fourier entropy and shallow kernel ELM for bearing fault detection. Qiu and Wu [141] proposed an insulator pollution detection method based on ELM for hyperspectral image. Lu and Zhang [109] integrated ELM with hypercomplex space for palm print recognition. Liu and Li [107] utilized 2D Gabor and local binary pattern for feature extraction and trained an ELM for human facial expression recognition.

5.2 Prediction

ELM was employed to solve prediction problems. Yang and Yan [205] used ELM for pressure prediction in the pipeline of coal slurry. Zhang and Liu [223] developed a ELM based system to predict the melt index in propylene polymerization process. Zeng and Zhang [213] trained ELM by switch delayed particle swarm optimization (SDPSO) and applied it to forecast short time load. In their experiment, ELM with SDPSO outperformed radial basis function neural network (RBFNN). Wang and Chen [181] employed grey wolf optimization to train KELM. The trained model was used for bankruptcy prediction. Zou and Yao [247] put forward a convex bidirectional ELM (CB-ELM) to verify and predict the temperature and humidity in greenhouse. In CB-ELM, when a new node is added into the hidden layer, the output weight can be obtained by convex optimization algorithm. Therefore, CB-ELM preserved the features of conventional ELM while improved the accuracy. Zhang and Wei [215] combined wavelet transform, principle component analysis and ELM for solar radiation forecasting. The parameters in ELM were optimized by bat algorithm. Sun and Duan [170] employed ELM optimized by PSO to predict carbon price. Prates [138] proposed spatial ELM and applied it on disease counting prediction. Naz and Javed [121] proposed an electric load and price prediction system based on ELM optimized by smart grid. Liu and Wang [99] leveraged wavelet transform and KELM for passenger flow prediction.

5.3 Representation/feature learning

ELM was used by researchers for representation and feature learning. Ding and Zhang [30] proposed to apply ELM-AE to unsupervised ELM and the features were the output of hidden layer. Zhang and Ding [216] put forward a hybrid scheme for feature learning and classification. Deep belief network was used for feature learning and ELM was used for feature mapping. Zhu and Miao [244] employed ELM-AE as the learning unit in local perceptive fields to construct a multilayer network. The low-level features were fed into the next layers to form high level representation. Then, block histogram was utilized to transform the obtained representations into translation and rotation invariant ones. Cheng and Liu [20] introduced ELM to K-SVD for improvement. Firstly, a denoising deep ELM-AE was used to extract features. Then, K-SVD was utilized to generate sparse representation from the features. The obtained representations can be used for classification and recognition. Experiment results showed the proposed method was comparable with other state-of-the-arts. Chen and Song [18] proposed an ELM based feature transfer method. First of all, like conventional ELM, random mapping was employed to generate features from input for both source and target domain samples. Then, domain transfer was performed with the aim of minimizing the domain distance and preserving the information in target domain as much as possible. Finally, the transferred features were used to train an adaptive ELM for classification. Song and Li [165] proposed an improved ELM for sparse feature learning named OAL1-ELM. They added l1 regularization in loss function of ELM, which was solved by alternative direction method of multipliers. The OAL1-ELM can be trained on one-by-one samples or in batch mode. Pei and Wang [135] used ELM for label-specific feature learning.

5.4 Clustering

Researchers have paid attention to ELM for clustering, which is an unsupervised learning problem. He and Jin [46] found out that the ELM feature mapping outperformed Mercer kernel function based methods for clustering. Alshamiri and Singh [3] employed artificial bee colony to train ELM for feature projection. The proposed system can overcome the initialization randomness in cluster centers, which works more robustly. Huang and Liu [60] proposed a discriminative clustering based on ELM and Fisher’s Linear Discriminant Analysis (LDA). With simple implementation, their approach achieved state-of-the-art performance. Peng and Zheng [137] put forward an improved unsupervised discriminative ELM (UDELM) which leveraged both local manifold and global discriminative learning. Huang and Yu [61] suggested to use Gaussian and sigmoid neurons in ELM hidden layer for feature learning. Alternative direction method was employed to obtain the parameters. Liu and Lekamalage [100] combined ELM and joint embedding clustering. Zhang and Wu [235] exploited the random mapping ability of ELM and proposed a multi-view clustering scheme.

5.5 Surrogate modeling

ELM has also been used as meta-model or surrogate model. Pavelski and Delgado [134] tried to leverage ELM to solve multi-objective optimization problems (MOPs). Unlike the conventional multi-objective evolutionary algorithms, they proposed an ELM surrogate-assisted method which is less expensive. Their method achieved state-of-the-art performance in simulation. Hao and Liu [44] proposed to convert the bottleneck stage scheduling problem to an expensive-to-evaluate optimization problem and used incremental ELM as the surrogate model. Then, a differential evolutionary was utilized to solve the problem. Kang and Li [73] employed an ELM based surrogate model for the analysis of soil slope reliability. Artificial bee colony optimization was used to obtain the best values for the hidden weights and biases in ELM. Pavelski and Delgado [133] proposed a multi-objective decomposition and differential evolutionary algorithm based on ELM. They tested the proposed method on three famous benchmark datasets. Ghiasi and Ghasemi [40] proposed two algorithms to detect structure damage and leveraged 10 surrogate models including ELM. Limited by space, we cannot cover all the subjects. A summary is presented in Table 7.

Table 7 A summary of ELM tasks

6 Application fields of ELM

Due to its superiority in training speed, accuracy and generalization, extreme learning machine (ELM) was employed in many application fields such as medicine, chemistry, transportation, economy, robotics and etc. In this section, we will review some representative works in these application fields.

6.1 Medical application

Medical imaging generates a huge number of images every day. ELM is a popular tool in applications in medical imaging fields, such as magnetic resonance imaging (MRI), computerized tomography (CT), and mammogram.

6.1.1 MRI

Computer aided diagnosis is a hot research topic which employs machine learning approaches to detect diseases based on medical images. ELM was found in some computer aided diagnosis systems. Brain MRI analysis is one of fields which attract most scientists’ interest. Although many approaches have been applied into Brain MRI analysis, ELM-based method still shows its unique strength. Peng and Lin [136] proposed an attention-deficit/hyperactivity disorder detection method based on ELM and structural MRI. Originally, 340 cortical features were obtained from brain MRIs and reduced by Sequential forward selection method. Both SVM and ELM were trained for classification. They found out that ELM was better in generalization and robustness than SVM. Termenon and Graña [175] proposed an ELM based cocaine dependency detection scheme for brain MRI. They firstly calculated volume of correlation values and segmented the images into regions by watershed algorithm. Then, they measured the region relevance by two methods. ELM based algorithm performed better than other classification algorithms in experiment. Qureshi and Min [142] proposed a novel recursive feature elimination SVM based method for feature selection and trained a hierarchical ELM to detect attention-deficit/hyperactivity disorder. Termenon and Graña [176] firstly used ELM to generate region candidates and employed a majority voting based classification strategy for detecting Alzheimer’s disease. Lu and Lu [110] proposed a KELM based pathological brain detection system for magnetic resonance image (MRI), and improved the performance by ELM variants. They firstly performed 2 level 2D wavelet transform on brain images and computed entropies from the obtained sub-coefficients to form the feature vector. Then, the features with their labels were sent into ELM for training and classification. Lama and Gwak [78] proposed a PCA based method to extract features from brain MRIs and compared three different classification algorithms in experiment: SVM, import vector machine (IVM) and regularized ELM. Simulation results suggested that regularized ELM performed best. Qureshi and Oh [143] put forward a schizophrenia diagnosis approach based on ELM and brain MRI. Independent component analysis was employed, and the functional connectivity-based features were obtained. The cortex features along with shape and textural features were also calculated. Then, a hybrid feature concatenation method was utilized for feature reduction. Finally, ELM was employed for classification. Zhang and Zhao [231] employed oversampling to deal with class imbalanced problem in brain MRI dataset and trained ELM by Jaya algorithm for automatic diagnosis. They utilized wavelet packet Tsallis entropy as the image feature. Nayak and Das [119] utilized multilayer ELM as the classifier and trained the network with multiclass pathological brain images. Their diagnosis system can detect four different categories of brain diseases with a competitive accuracy and at fast speed. Nguyen and Ryu [122] used ELM to identify Alzheimer’s disease, mild cognitive impairment and cognitive normal state based on brain MRI. They totally combined ten 3D map features to form the feature vector. Leave one out validation and 10 fold cross validation were employed for evaluation. Gumaei and Hassan [42] employed a regularized ELM for brain tumor identification. They leveraged normalized GIST as feature descriptor and combined it with PCA.

6.1.2 CT

As we know, computed tomography (CT) images play a very important role in medical diagnosis. Huang and Tan [62] proposed a 3D liver computed tomography (CT) segmentation method based on ELM. They treated every voxel as the sample and carried out voxel-level classification by ELM. Then, morphological method was used to improve the segmentation results. Huang and Yang [63] used ELM autoencoder for image feature pre-processing. They trained a set of ELM and generated the final output by majority voting ensemble. Ramalho and Filho [146] applied ELM to detection lung disease in CT. Firstly, the CT images were segmented by an adaptive contour model. Then, structural information was calculated from the segmented images as features. Finally, an ELM was trained to identify two types of lung diseases and healthy control. Zhu and Huang [243] developed liver tumor detection and segmentation system for CT based on ELM. To alleviate the overfitting problem, they employed data ensemble and feature ensemble. A sequential kernel learning strategy was utilized to improve segmentation ability.

6.1.3 Ultrasound

As one of most commonly used medical imaging technologies, ultrasound has irreplaceable value in diagnosis and assessment of many diseases including thyroid nodules. Plenty of methods have been adopted to classify thyroid nodules into malignant one and benign ones [66]. According to some research, ELM-based methods outperformed other algorithms due to its strength in accuracy and fast speed. In 2017, Xia and Chen [195] claimed they firstly investigated the potential of ELM in thyroid nodules classification and achieved promising experimental results, which indicated that ELM could be successfully applied into thyroid nodules classification. In 2019, Cai and Gu [9] proposed a novel scheme for different machine learning tasks including thyroid nodules classification. They improved kernel extreme learning machine based on an enhanced grey wolf optimization strategy to boost the stochastic behavior with the construction of a new hierarchical mechanism. Their experimental results demonstrated its advance and efficiency.

6.1.4 RNA classification

Researching in RNA assists human in exploring mystery of lives. RNA classification could enable scientists to investigate tumor formation and life mechanisms including growth, development, and immunity to a higher level [70]. Thus, scientists put efforts in developing methods for RNA classification and recognition with machine learning technologies, ELM included. In 2018, Chen and Zhang [16] presented an ELM-based approach for classification between cirRNAs and lncRNAs. They first applied feature selection to construct a feature representation framework extracted from RNA’s graph, sequence, and conservation properties. Then they utilized hierarchical extreme learning machine as classifier and claimed that they obtained convincing results. In 2020, Niu and Zhang [123] proposed their method for recognition of cirRNAs based on improved extreme learning machine. They applied particle swarm optimization algorithm to elevate the performance of ELM classifier. According to the experimental results, their method showed potential in future research.

6.1.5 EEG

The electroencephalogram (EEG) is regarded as the most effective way to monitor the status of patients with epilepsy. Among many machine learning algorithms, ELM-based method was widely applied into seizure detection with EEG information for its efficiency and simplicity. In 2011, Yuan and Zhou [212] proposed an approach for epileptic EEG classification. They adopted extreme learning machine to train single hidden layer feedforward networks with extracted nonlinear features such as approximate entropy, Hurst exponent, and detrended fluctuation analysis exponent. They compared ELM-based approach with BP and SVM, concluding that ELM-based approach achieved best experimental results. Song and Crowcroft [166] presented an improved ELM-based approach for automated detection of seizure with optimized sample entropy algorithm. They claimed their approach not only obtained outstanding accuracy but also had the advantage of fast speed inherited from ELM which made it potential to build real-time detection system. In 2016, Song and Hu [164] put forward an automatic epileptic detection approach with EEG information. They fused several level features including both the Mahalanobis-similarity-based feature and the sample-entropy-based feature and employed ELM to construct seizure detection based on this fusion feature. Their experiments proved this approach’s efficiency. Zhou and Tang [241] proposed a novel ELM-based method for epileptic EEG classification combined with cellular automata and achieved better results compared to BP and SVM.

6.1.6 Mammogram

Vani and Savitha [178] applied ELM for abnormality detection in Mammograms. They initially selected 15 different shape and textual feature descriptors for feature extraction and reduced the feature number gradually for experiment. They found that 9 features performed the best. Malar and Kandaswamy [111] combined grey level spatial dependence matrix with wavelet features to form the feature vector, and used ELM to detect breast cancer based on those features obtained from mammogram. Wang and Yu [187] used ELM to detect breast cancer in mammography. For segmentation, wavelet transform, region growth and morphology were used. Then, textual and morphological features were obtained from the segmented images and sent into ELM for training and classification. Xie and Li [197] firstly segmented the masses in mammogram by a level set model. Then, multi-dimensional features were extracted and reduced by SVM and ELM based feature selection method. Finally, those features were sent into ELM classifier for training. The trained ELM can classify the masses into benign or malignant accurately. Wang and Qu [186] put forward a breast tumor detection method based on ELM and mammogram. They fused geometry and textural features from two views of mammogram and optimized the feature model with feature selection. ELM served as the classification algorithm. Hu and Yang [51] proposed a novel hidden Markov tree model of dual-tree complex wavelet transform for feature extraction from mammogram. They also optimized the features by genetic algorithm. ELM was trained to classify the samples as malignant or benign. Wang and Li [185] employed ELM for breast cancer detection. They firstly detect masses based on deep CNN features and unsupervised ELM features. Then, several different features were fused as the feature vector. Finally, an ELM was trained to classify the masses as benign or malignant. A summary is presented in Table 8.

Table 8 Application summary in medical imaging

6.2 Chemistry application

ELM has also been applied in other fields, like chemistry. Geng and Dong [38] proposed a self-organizing ELM to predict variables in chemical processes. They firstly tuned the input weight by Hebbian rule and optimized the structure using mutual information. Then, the output weight was obtained by pseudo inverse. Experiment results from UCI data revealed that their method was better in generalization compared to original ELM. Qin and Li [139] applied ELM to evaluate the green management in power generation enterprises in China. Firstly, an evaluation indicator system for low carbon sustainability was created by an improved dynamic hesitation degree. Then, an ELM with RBF kernel was trained to implement comprehensive evaluation. Their model achieved good results in experiment, and it can be used in similar projects. In 2013, Wang and Jin [179] introduced ELM into prediction of the nonlinear optical property. In 2017, You and Zhou [208] employed ELM in prediction of protein-protein interactions (PPIs) and claimed to obtain satisfying performance. In 2018, You and Zhou [208] designed and implemented experiments to compare different kinds of classifiers’ performance in predicting protein secondary structure. According their results, extreme learning machine (ELM) had the fastest training speed, but support vector machine (SVM) was the most accurate. Cao and Zhu [12] also compared various approaches to predict the toxicity of ionic liquids and they claimed ELM had the best performance, beating multiple linear regression (MLR) and support vector machine (SVM). Kang and Liu [74] brought the competition into predicting the Henry’s law constant (HLC) of CO2. And it seemed that ELM won again among several competitors including MLR and SVM. In 2019, Lei and Wen [80] proposed their method of predicting protein-protein interactions based on regularized extreme learning machine and claimed to achieve promising performance. Further in 2020, Li and Shi [83] also presented their modified method to predict protein-protein interactions. They adopted weighted extreme learning machine as basis with combination of scale-invariant feature transform to improve the model’s performance and gained the ideal results. Ouyang and Wang [129] applied ELM into measuring NOx in vehicle exhaust and achieved qualified experiment results. And it could be observed that more scientists were still putting efforts into applying ELM based methods to chemical analysis [162, 180].

6.3 Economy application

Extreme learning machine was also applied in solving economic problem. In 2012, Landa-Torres and Ortiz-Garcia [79] proposed their method based on ELM to evaluate the internationalization success of companies. In 2016, Sokolov-Mladenović and Milovančević [163] applied ELM with data of trade, import and export to predict economic growth and claimed to obtain convincing results. Marković and Petković [113] also adopted ELM for similar research but using data of science and technology development and investment in nations. Moreover, Milačić and Jović [116] employed ELM to predict economic growth with data of agriculture, manufacturing and industry. They utilized artificial neural network (ANN) with back propagation (BP) as contrast and concluded that ELM was superior to contrast approach. Further, Rakic and Milenkovic [145] presented their study of applying ELM to forecast economic growth based on information technology level of nations and also claimed to obtain promising results. Their research indicated that extreme learning machine could be applied successfully in forecasting economic growth based on statistical data.

The energy industry is one of most crucial departments in economy. The consumption of energy could reflect vitality of economy. Thus, scientists often study trend of economic development via analysis of energy industry and its consumption. For example, emission of carbon dioxide (CO2) presents a significant signal of energy consumption during economic activities. Many researchers have applied ELM into analyzing CO2 emission to predict economic development[112, 161, 171]. Except for data of carbon emission, scientists also utilized the data of demand and price from energy resources market to forecast economic growth via ELM [22, 153, 172] and proved that ELM was qualified in these situation of economic analysis field. Besides energy, in 2019, Shoumo and Dhruba [160] implemented a comparative research of various machine learning models in credit risk assessment including ELM, SVM, random forest and logistic regression. And ELM showed its advantage of quickness and simplicity among these various methods.

6.4 Transportation application

Transportation is a hot topic for applying machine learning into application. For example, scientists utilized machine learning technique to establish driver drowsiness monitoring system to avoid dangerous driving and save people lives [69]. It has been a long time since extreme learning machine was applied in solving problems of transportation field. Sun and Ng [173] proposed a two-stage approach which combined linear programming (LP) and extreme learning machine (ELM) to optimize transportation system. Their experiments proved that their combined approach could improve lifetime of transportation system and increased its reliability. In 2016, Liu and Yang [101] studied semi-supervised methods for real-time driver distraction detection. They analyzed Laplacian support vector machine and semi-supervised extreme learning machine and concluded that these two semi-supervised methods could mitigate the cost of labeled training data in building driver distraction detection system. In 2017, Liu and Yan [93] applied radial basis function into extreme learning machine to predict road surface temperature and achieved hopeful experiment results. Oneto and Fumeo [126] applied ELM into predicting dynamic delay of large-scale railway network. Zeng and Xu [214] adopted ELM to recognize traffic sign with deep features. Zhang and He [229] employed ELM to analyze traffic accidents based on video data. In 2018, Wang and Chow [182] applied ELM with large-scale GPS data to help taxi drivers searching best routes. Similar to their research, more scientists have utilized ELM to predict traffic flow for drivers and governments [94, 199]. And it has become a popular trend in transportation research field.

6.5 Robotics application

It is because ELM had the advantage of fast training speed, it was widely employed in robotics application field. In 2015, Sun and Yu [168] proposed an ELM based method to perform object grasping detection. They applied ELM as the basis of their cascade classifier and adopted histograms of oriented gradients (HOG) as features. Their method was announced to achieve better performance than other benchmark methods. In 2016, Alcin and Ucar [2] also applied ELM in robotic arms control and claimed their approach was superior to Artificial Neural Network (ANN) model. In 2018, She and Hu [156] introduced ELM into the field of brain-computer interface. They utilized extreme learning machine to classifier electroencephalographic (EEG) signals and obtained promising experiment results. In 2019, Duan and Ou [32] applied ELM to build their dynamical system for robotic motion control. Via their experiments, it was claimed that ELM’s characteristics of efficiency and accuracy brought their approach better performance compared to other approaches.

6.6 IoT Application

As Internet of Things attracting continuous attention from both academic and industrial field in recent years, more and more scientists developed quantity of approaches or applications in IoT with modern advanced information technologies [71]. There existed a lot of approaches which applied ELM in IoT application as well. Rathore and Park [148] proposed an ELM-based approach for detection of cyber-attack. They built an attack detection framework based on fog computing and adopted improved ELM as classifier to differentiate attack from normal visit. According to their experiments, the proposed approach worked. Yin and Dong [207] presented an algorithm to assign bug fixing jobs to engineers with ELM. They combined diversified feature selection with extreme learning machine to build classifier and claimed to achieve promising results. In 2019, Li and Wei [89] analyzed DDoS attack to propose joint entropy features. combining with extreme learning machine for detection of DDoS attack. The experimental results proved this method was efficient and rapid. In 2020, Zheng and Hong [238] put forward a novel approach for intrusion detection which combined several useful scheme including extreme learning machine. They first applied improved linear discriminant analysis (LDA) for reduction of feature dimensions. Then the approach adopted single hidden layer feedforward neural network as classifier to process dimension-reduced data with ELM optimization. The experimental results demonstrated that their approach obtained the performance of outstanding real-time characteristics and generalization.

6.7 Geography application

ELM can also been applied in Geography. Huang and Yin [52] used ELM to generate the landslide susceptibility indexes. Firstly, a rough index map was obtained by self-organizing map network, and the very low susceptible areas were removed. Then, the final landslide susceptibility index map was generated using by ELM. Their SOM-ELM outperformed single ELM and SOM-SVM on real data. Wei and Liu [189] tried to evaluate the stability of rubble-mound breakwaters using ELM. They trained 5000 different ELMs to select the best one based on testing performance. Wei and Li [188] used ELM to predict sediment-carrying capacity. They used 30 sets of data for training and 4 sets of data for validation. The input vector of ELM included surface velocity, slope and settling velocity, and the output was sediment-carrying capacity. Detailed discussion of the parameters was presented. Niu and Feng [124] developed four smart systems for deriving the operation rule of hydropower reservoir based on multiple linear regression, artificial neural network, SVM and ELM. From simulation results, they found artificial neural network, SVM and ELM performed better than multiple linear regression. Li and Chen [85] presented an efficient method for hyperspectral imagery classification. They adopted local binary patterns which was to extract local features like edges or corners and then combined feature-level fusion and decision-level fusion. Using the classifier, which was constructed by extreme learning machine, they reported to achieve promising results better than other traditional methods.

6.8 Food Industry application

ELM is a popular tool in food safety. Geng and Zhao [39] proposed a food safety monitoring system based on ELM. Analytic hierarchy process (AHP) was utilized to generate process characteristic information (PCIs). Then, the PCIs were sent into ELM for prediction. The prediction results along with the PCIs can provide better information for food safety monitoring. Ouyang and Chen [128] raised an approach for detecting amino acid nitrogen content in soy sauce. They applied near infrared spectroscopy and then employed feature selection. Using ELM as classifier, they claimed their approach outperformed other popular methods in construction of amino acid nitrogen detection system for quality assessment of soy sauce. da Costa and Llobodanin [25] utilized similar approach of combining feature selection and ELM for classification of different kinds of wines. Liu and Li [105] demonstrated ELM was also adaptive for prediction during large-scale food sampling analysis. Zhang and Zhou [228] applied ELM in prediction of dairy food safety from big data and proved ELM’s superiority to other methods such as BP neural network and support vector machine. Guo and Ma [43] employed impact acoustic signal to collect features of wheat kernel and then applied the feature with ELM classifier to differentiate kernel-damaged wheat from kernel-undamaged wheat. The experimental results proved that their ELM-based approach could successfully build a wheat kernel classification system.

6.9 Other interdisciplinary application fields

Control system: Zhang and Ma [230] put forward an error minimized regularized OS-ELM for estimating time-varying parameters in model-free adaptive control. The improved ELM algorithm can train the parameter and optimize the network structure simultaneously. Xu and Lei [200] proposed an adaptive neural control method for pure-feedback systems based on ELM. They firstly transformed the system into nonaffine system by equivalent transform. Then, system state variables were estimated by finite-time-convergence-differentiator. Finally, ELM was employed to estimate the unknown term of the nonaffine system. Wong and Huang [192] leveraged ELM for automotive engine idle speed control. A novel adaptive law was combined with ELM based on Lyapunov analysis. Simulation results suggested that their method outperformed other state-of-the-art approaches. Video analysis: Yi and Dai [206] proposed a moving cast shadow detection method based on ELM. Firstly, the pixel-level features and region-level features of different channels were combined. Then, ELM was trained to classify the pixels as shadow or not. Rajpal and Mishra [144] proposed their watermarking algorithm based on bi-directional ELM. For odd or even number of hidden nodes, the bi-directional ELM used two different methods to update the parameters in order to minimize the error to zero. A threshold based fuzzy algorithm was employed to select key frames and bi-directional ELM was trained to produce the LL3 sub-band coefficients of blue channel for watermarking. Their scheme was fast enough for real applications. Yu and Song [210] developed an automatic violent scene detection system based on KELM. 3D histogram of gradient orientation (HOG) was used as the feature extractor and visual words were generated by K means clustering. The combined features were fed into KELM for classification. Experiment results showed that the method was effective and efficient.

7 ELM controversies

Actually, there are some controversies about ELM as two anonymous reviewers suggested. The idea of randomizing some of weights to reduce the complexity of weight learning is not new. It was employed in reservoir computing in 2002, and in convolutional learning in 1989. The Schmidt neural network (SNN) proposed by Schmidt and Kraaijveld [154] and the random vector functional link net (RVFL) proposed by Pao and Park [132] were proposed over ten years before ELM. Both SNN and RVFL used random hidden weights and biases, and ELM can be regarded as a special form of RVFL or radial basis function neural network. Moreover, ELM is almost the same as SNN, the only difference of ELM and SNN is that SNN has trainable output biases while the output biases of ELM are always zeros.

8 Conclusion

In this review, we presented the development and application of ELM over the last decade. The principle and theoretical analysis of ELM are given and a wide range of methods to improve the performance of ELMs for specific learning tasks are reported as well. Due to its fast speed, good generalization and easy implementation, ELM is often preferred than other state-of-the-art approaches for artificial intelligence related problems in a variety of fields. Through the survey, we found several interesting research directions and problems for future research.

ELM has potential of playing a more important role in big data. With the invention of deep learning structures and effective training algorithms, we are able to solve big data machine learning problems. Deep learning has achieved astonishing performance, but the computational complexity is high, and the training process often requires long time. ELM shows the potential of random mapping which can achieve the same classification performance as the trained parameters do. Therefore, it will be interesting to combine deep learning and ELM to learn from big data. The hybrid network may be able to yield the same results as deep learning does with less training time.

Theoretical justification of ELM never stop evolution. Random mapping in hidden layer is the most significant idea for ELM, which enables the universal approximation and contributes to the extremely fast training speed. But it remains a problem for justifying the randomness. It may be helpful to compare ELM with other random mechanisms, such as random forest.

Another promising direction of ELM development may be connected to biological learning. Researchers have found that the neurons in our brain also use random mechanism for learning [17]. As artificial neural network is originally inspired by our brain, it will be interesting to investigate the underlying connection of ELM and biological learning.