Extreme learning machine versus classical feedforward network

Markowska-Kaczmar, Urszula; Kosturek, Michał

doi:10.1007/s00521-021-06402-y

Extreme learning machine versus classical feedforward network

Comparison from the usability perspective

Review Article
Open access
Published: 30 August 2021

Volume 33, pages 15121–15144, (2021)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Extreme learning machine versus classical feedforward network

Download PDF

3595 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

Our research is devoted to answering whether randomisation-based learning can be fully competitive with the classical feedforward neural networks trained using backpropagation algorithm for classification and regression tasks. We chose extreme learning as an example of randomisation-based networks. The models were evaluated in reference to training time and achieved efficiency. We conducted an extensive comparison of these two methods for various tasks in two scenarios: $\bullet$ using comparable network capacity and $\bullet$ using network architectures tuned for each model. The comparison was conducted on multiple datasets from public repositories and some artificial datasets created for this research. Overall, the experiments covered more than 50 datasets. Suitable statistical tests supported the results. They confirm that for relatively small datasets, extreme learning machines (ELM) are better than networks trained by the backpropagation algorithm. But for demanding image datasets, like ImageNet, ELM is not competitive to modern networks trained by backpropagation; therefore, in order to properly address current practical needs in pattern recognition entirely, ELM needs further development. Based on our experience, we postulate to develop smart algorithms for the inverse matrix calculation, so that determining weights for challenging datasets becomes feasible and memory efficient. There is a need to create specific mechanisms to avoid keeping the whole dataset in memory to compute weights. These are the most problematic elements in ELM processing, establishing the main obstacle in the widespread ELM application.

A Short Review of Recent ELM Applications

A Review of Advances in Extreme Learning Machine Techniques and Its Applications

Scikit-ELM: An Extreme Learning Machine Toolbox for Dynamic and Scalable Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Artificial neural networks are among the fastest-growing areas of artificial intelligence. Existing deep models can solve tasks such as visual object recognition or automatic text translation, often achieving human-level performance. Frequently, these fantastic results need weeks of training and require substantial computational power. High training times are a consequence of the most commonly used training strategy—numerical optimisation using iterative methods and the backpropagation algorithm. Its purpose is to determine the error gradient for each parameter of the network. We will refer to them as classical or standard networks. High hardware requirements and long training times are frequent obstacles in applying neural networks in practical use. These difficulties explain our interest in searching for an alternative solution. In this work, we examine on randomisation-based neural networks, which offer significantly lower training times and claim to perform on par with classical networks. Specifically, we focus on extreme learning machines (ELM), a special case of random vector functional-link (RVFL) networks.

1.1 Motivation

Motivation for this research stems from our search for models useful to solve real problems in classification and regression tasks which are well suited to be applied in practice. We limited our interests to processing inputs in the form of vectors and raw images. We do not consider sequential data.

Random vector functional-link networks were proposed in [32], and their characteristics were discussed in [33]. Since then, they were developed and evaluated [11, 47, 50]. RVFL is a feedforward network with a single hidden layer and direct links between input and output layers which bypass the hidden layer. The most important characteristic of RVFL is that weights assigned to the hidden layer are set randomly and are not optimised. Output layer weights are the only trainable parameters.

Authors of [13] proposed extreme learning machines. ELM follows the same principle as RVFL. In fact, they are a special case of RVFL, where the direct input-output links are disabled [5]. Therefore ELM is architecturally identical to commonly used classical feedforward networks with a single hidden layer. The only difference between them is the training method. Authors of ELM claim that time of assigning weights for ELM is significantly lower than when using standard network training methods and with comparable or even better performance. Extreme learning machines were used as a baseline for many further models, such as stacked ELM [51], on-line sequential ELM [14], LRF-ELM designed for image classification [17], and biased dropout ELM [25]. ELM can also be considered as a non-recurrent equivalent of reservoir computing (RC) models [40]. Because random units construct features (to deal with nonlinearly separable tasks) without learning, this model is a natural candidate to overcome classical neural networks’ time-consuming training. This allows to find all remaining weights in a closed form, which means that no iterative numerical optimisation is required.

The use of random parameters in randomisation-based networks may raise concerns about deteriorating the performance quality. However, authors of [16] prove that a sufficiently high number of random hidden features allow them to learn effectively. This statement sounds very promising, but our literature survey does not result in an in-depth comparison of backpropagation and randomisation-based learning—both ELM and RVFL. There are, however, detailed comparisons between RVFL and ELM [9, 43, 50], where these models are compared using multiple datasets for both classification and regression. These comparisons show that using direct links between input and output layers does improve the performance in classification and does not have a significant effect on the quality of regression. Nonetheless, these findings do not show whether randomisation-based learning offers a true alternative for commonly used backpropagation-based learning algorithms.

In the original paper on extreme learning [13], there is a simple comparison between ELMs and networks trained with backpropagation. The comparison covers two datasets—Diabetes for classification and California Housing for regression, both from the UCI repository [8]. The study ensures identical architectures for both models so that the only properties compared were the training algorithms. A similar comparison was held on the Forest Type dataset, but in this case, ELM contained twice the number of hidden neurons the regular network had. Thus, this comparison lacks objectivity. Unfortunately, the authors do not specify hyperparameter set-up, e.g. stopping criterion for backpropagation-based method. This approach causes that the results are non-reproducible. Further works on extreme learning include more comprehensive comparisons [16]. They cover more than 10 datasets for regression obtained from University of Porto repository [46] and several classification datasets. Hyperparameter set-up is described in more detail compared to the original paper on extreme learning [13].

Methods improving the basic RVFL and ELM models [25, 47] usually use their immediate predecessors or simple RVFL and ELM as baselines, omitting backpropagation-based networks.

When proposing the LRF-ELM model, authors of [17] conduct a comparison to a regular convolutional network using only a single dataset—NORB. Therefore, conclusions from such experiments are far from being general. Papers on the extensions of extreme learning usually do not present any comparison to other training methods [53]. They are focused on showing the difference between the proposed extension and the basic ELM. Some other works comparing ELMs to regular deep learning models often use outdated or inadequate network architectures. We can cite the paper [21], where authors performed a comparison in the task of image classification and did not use convolutional networks. When publishing that work (the year 2013), CNNs were just emerging; however, nowadays, they are considered the primary choice for image recognition (classification). Despite the dynamic evolution of CNNs, studies are scarcely comparing them to ELMs.

None of the referred works covered analysis of the impact of hyperparameters set-up on the ELM performance. Therefore, it seems worthy to explore some basic hyperparameters (the number of hidden neurons, choice of an activation function, and the value of regularisation coefficient), because they can profoundly influence results. Only [15] mentions that “Since the input weights and hidden biases of ELM are randomly generated instead of being fine-tuned, ELM usually needs more neurons than other learning algorithms”. Unfortunately, it is not supported by extensive research.

To summarise, our choice of using extreme learning machines for comparison was based on three factors:

A common use of ELM as a baseline solution among random weights networks;
A 1-to-1 correspondence between ELM and classical neural networks architecture;
A scarcity of comparisons between ELM and classical neural networks.

1.2 Contribution

Our research was inspired by the need for practical neural network applications in classification and regression tasks. Training standard neural networks often requires computational power exceeding the capabilities of most research and development centres. The natural solution lies in alternative training methods. Because we did not find any comprehensive comparison between neither ELM or RVFL and classical networks, we wanted to objectively verify whether and to what extent ELM can be competitive for classical feedforward networks trained with backpropagation (with applied batch SGD) for demanding classification and regression tasks.

We performed the comparison twice. In the first series of experiments following the approach from [13], the network’s and ELM’s capacities (the number of neurons) are comparable. In this case, we compare both model properties in the context of training procedure. In the second one, we tried to reflect a machine learning practitioner approach to training models on a given dataset. This means that we tuned the model for a given dataset. For classical networks, we can make use of the extensive literature and publicly shared implementations. For ELM’s, this is usually impossible, but short training times allow us to optimise hyperparameters efficiently. An essential premise to all experiments is to cover the vast and varied selection of datasets. This research limited the datasets to classification and regression tasks from public repositories (UCI [8] and University of Porto repository [46]), the current image classification benchmark datasets and artificially created tasks, skipping the text and audio domain datasets. We can summarise our contribution as follows:

comparison of models’ efficiency for more than 50 datasets using both training methods in terms of the achieved prediction quality and training times (in two scenarios: models with comparable capacity and well-suited (supported by literature) networks’ architectures),
implementation of both models available at: https://github.com/mkosturek/extreme-learning-vs-backprop,
statistical analysis of the results, which ensure that any conclusions drawn from experiments are credible,
evaluation of the decision boundaries of both models for several runs,
ELM’s hyperparameter sensitivity analysis,
qualitative comparison focused on the practical application of both methods,
formulation of postulates referring to the development direction for ELM for current demanding datasets.

The contributions are essential for machine learning researchers and practitioners to know which model to use depending on the dataset size, and what results can we expect regarding training time and the model’s efficiency. Our research gives some insights in the decision boundaries of both models. It is crucial with the growing need to process massive datasets. The importance of our contributions can also be assistive to the random weights network researchers—by showing current challenges and obstacles in practical applications of ELM models and the need to compare enhanced ELM models to classical neural networks. It is also worth saying that ELM’s hyperparameter sensitivity analysis is helpful in defining how to set their values properly.

1.3 Paper structure

The paper consists of six sections. Section 1 presents an introduction to the topic of this study. In Sect. 2, we characterise the two compared approaches to training neural networks—based on backpropagation in Sect. 2.1 and extreme learning in Sect. 2.2. In Sect. 3, we depict the procedure we adopted to conduct experiments. Section 5 presents experimental results and conclusions derived directly from them. In Sect. 6, we show our insights from the experimental study and present postulates about future development of ELMs. Finally, in Sect. 7, we summarise all conclusions drawn from this research.

2 Description of compared models

In this section, we briefly present both models describing their architectures and the method of training. We focus on classification and regression as they are the two most common tasks in real applications. It is necessary to prepare the training set used to find parameters $\theta$ (weights and biases) of the models to solve these problems. The training set consists of vector pairs $<\mathbf {x_i}, \mathbf {y_i}>$, i.e. the input vector and corresponding output vector (for classification outputs are encoded using one hot principle). In the case of regression, the corresponding output is a scalar $y_i$. Formally, the task is solved by a function $f({\mathbf {x}};{\theta })$ implemented by a classical neural network or ELM. In the case of ELM, the parameters $\theta$ are calculated in one step, in opposite to classical network where training is an iterative procedure that optimises a loss function ${\mathcal {L}}(f(\cdot ;\theta ), {\mathbf {x}}, {\mathbf {y}})$. Having model parameters, the output ${\hat{{\mathbf {y}}}({\mathbf {x}})}$ specifies the predicted class or predicted value in the case of regression.

2.1 Classical neural networks

The integral part of each neural network is a neuron. The formal description of its operation is described by Eq. 1.

$$\begin{aligned} {\hat{y}}({\mathbf {x}}) = f\left( \sum _{i=1}^N w_i x_i + b\right) \end{aligned}$$

(1)

where ${\hat{y}}$ is the neuron’s output, N is the number of inputs, $x_i$ stands for the value of the ith input and $w_i$ for the weight assigned to the ith input, b depicts the bias, and f is an activation function.

Each neuron has many inputs represented by a vector ${\mathbf {x}} = [x_1, \dots , x_N]$. The input signals combined by weighted sum create the total neuron input activity. Weights ${\mathbf {w}} = [w_1, \dots , w_N]$ are assigned to each neuron input connection and are tuned during the training process. Bias b is a weight assigned to additional input, where the signal is always set to 1. The output signal ${\hat{y}}$ is created on the basis of the total input transformation by the activation function f.

In this paper, we consider layered, feedforward network architectures. The first one is the multilayer perceptron (MLP), where neurons are fully connected in neighbouring layers. Further, we will also describe the convolutional neural network (CNN).

MLP is composed of an input layer, one or many hidden layers, and an output layer. The number of hidden layers and the number of neurons in each layer are hyperparameters of a network. The number of outputs depends on the problem solved by the network (classification or regression). For simplicity, in the description below we assume the simplest MLP network with one hidden layer. The signal processing in the network can be described as a sequence of matrix operations for all patterns in the dataset ${\mathbf {X}}$ given in Eq. 2

$$\begin{aligned} {\mathbf {H}}({\mathbf {X}}) = f({\mathbf {X}}{\mathbf {W}}^{(h)} + b^{(h)}) \end{aligned}$$

(2)

where ${\mathbf {W}}^{(h)}$ is a matrix of hidden layer’s parameters, $b^{(h)}$ is a vector of hidden layer’s bias values, and f is an activation function. The network output is defined by Eq. 3.

$$\begin{aligned} \mathbf {{\hat{Y}}}({\mathbf {X}}) = f(\mathbf {H(X)}{\mathbf {W}}^{(o)} + b^{(o)}) \end{aligned}$$

(3)

where ${\mathbf {W}}^{(o)}$ and $b^{(o)}$ are a matrix of weights and vector of biases in the output layer.

There are many different activation functions, commonly used are: sigmoid, hyperbolic tangent, step function, ReLU [19, 30], linear function, and softmax in the output layer for classification problems.

Another type of neural network considered in the comparison is the convolutional neural network (CNN) [24]. In the convolutional layers, neurons are not fully connected. A given neuron is only connected to a defined subset of neurons in the subsequent layer. Moreover, weights assigned to these connections are shared between neurons of a single feature map of the convolutional layer. They are widely used in image processing because they learn to extract complex image features that serve the performed task the most. In each convolutional layer, one defines a kernel (or a filter)—matrix with assumed sizes, significantly lower than the image resolution. Filter values are tuned during training the network; they correspond to neurons’ connections weights. A single convolutional layer may consist of multiple filters, each producing a separate feature map. While processing an image, convolutional filters are moved across the image stepwise by a constant number of pixels, and convolution operation is calculated. It defines the total activation for one neuron in a convolutional layer, Eq. 4.

$$\begin{aligned} y_{kij}({\mathbf {X}}, {\mathbf {W}}) = \sum _{p=1}^P\sum _{q=1}^Q w_{kpq}\ x_{i+p, j+q} \end{aligned}$$

(4)

where ${\mathbf {X}}$ is the input image, ${\mathbf {W}}$ denotes a matrix of convolutional filters (size: $K\times P\times Q$), K is the number of filters (feature maps), $P\times Q$ is the size of a single convolutional filter, and i, j are a single neuron’s coordinates.

Similarly to MLP, convolution output values are processed by an activation function, and then pooling is applied. Its role is to decrease the size of a feature map. Usually it is implemented as max operation (max-pooling) or average (average-pooling) from the sliding window. Pooling allows a convolutional network to be more robust to small image rotations and translations.

Neural network training is performed as an optimisation task. We search for the cost (loss) function minimum. It defines the error the network makes approximating the target function ${\mathcal {F}}$. Most commonly used loss functions are: mean square error, binary cross-entropy, and categorical cross-entropy.

The primary method of neural network training is the gradient descent method that is shown in Algorithm 1. It requires a training dataset ${\mathcal {D}}$. In each iteration, the network assigns its output (line 2), and then the model parameters are corrected by a small value in the direction opposite to the gradient of cost function ${\mathcal {L}}$ (line 3). As an effect, the cost value decreases. The procedure lasts until a stopping criterion is not satisfied.

Long training may cause network overfitting; therefore, in practice, some other techniques are applied. One of them is early stopping. It relies on setting aside an additional validation dataset. In every iteration, it is used to monitor the cost value. The training is stopped when the validation cost increases.

Table 1

Extreme learning machine versus classical feedforward network

Abstract

Similar content being viewed by others

A Short Review of Recent ELM Applications

A Review of Advances in Extreme Learning Machine Techniques and Its Applications

Scikit-ELM: An Extreme Learning Machine Toolbox for Dynamic and Scalable Learning

1 Introduction

1.1 Motivation

1.2 Contribution

1.3 Paper structure

2 Description of compared models

2.1 Classical neural networks

2.2 Extreme learning machines

3 Research methodology

4 Datasets used

5 Experimental study

5.1 Models with comparable capacity

5.1.1 Experiment 1: comparison of ELM and neural networks performance and training times for input data with vector representation

5.1.2 Experiment 2: comparison of ELM and neural networks performance in image classification

5.2 Models with well-suited architectures for a given dataset

5.2.1 Experiment 1: comparison of ELM and neural networks performance and training times for input data with vector representation

5.2.2 Experiment 2: comparison of ELM and neural networks performance in image classification

6 Findings based on the performed research

7 Final conclusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A ELM hyperparameter sensitivity analysis

Appendix B Artificial datasets for classification

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation