1 Introduction

Artificial neural networks are among the fastest-growing areas of artificial intelligence. Existing deep models can solve tasks such as visual object recognition or automatic text translation, often achieving human-level performance. Frequently, these fantastic results need weeks of training and require substantial computational power. High training times are a consequence of the most commonly used training strategy—numerical optimisation using iterative methods and the backpropagation algorithm. Its purpose is to determine the error gradient for each parameter of the network. We will refer to them as classical or standard networks. High hardware requirements and long training times are frequent obstacles in applying neural networks in practical use. These difficulties explain our interest in searching for an alternative solution. In this work, we examine on randomisation-based neural networks, which offer significantly lower training times and claim to perform on par with classical networks. Specifically, we focus on extreme learning machines (ELM), a special case of random vector functional-link (RVFL) networks.

1.1 Motivation

Motivation for this research stems from our search for models useful to solve real problems in classification and regression tasks which are well suited to be applied in practice. We limited our interests to processing inputs in the form of vectors and raw images. We do not consider sequential data.

Random vector functional-link networks were proposed in [32], and their characteristics were discussed in [33]. Since then, they were developed and evaluated [11, 47, 50]. RVFL is a feedforward network with a single hidden layer and direct links between input and output layers which bypass the hidden layer. The most important characteristic of RVFL is that weights assigned to the hidden layer are set randomly and are not optimised. Output layer weights are the only trainable parameters.

Authors of [13] proposed extreme learning machines. ELM follows the same principle as RVFL. In fact, they are a special case of RVFL, where the direct input-output links are disabled [5]. Therefore ELM is architecturally identical to commonly used classical feedforward networks with a single hidden layer. The only difference between them is the training method. Authors of ELM claim that time of assigning weights for ELM is significantly lower than when using standard network training methods and with comparable or even better performance. Extreme learning machines were used as a baseline for many further models, such as stacked ELM [51], on-line sequential ELM [14], LRF-ELM designed for image classification [17], and biased dropout ELM [25]. ELM can also be considered as a non-recurrent equivalent of reservoir computing (RC) models [40]. Because random units construct features (to deal with nonlinearly separable tasks) without learning, this model is a natural candidate to overcome classical neural networks’ time-consuming training. This allows to find all remaining weights in a closed form, which means that no iterative numerical optimisation is required.

The use of random parameters in randomisation-based networks may raise concerns about deteriorating the performance quality. However, authors of [16] prove that a sufficiently high number of random hidden features allow them to learn effectively. This statement sounds very promising, but our literature survey does not result in an in-depth comparison of backpropagation and randomisation-based learning—both ELM and RVFL. There are, however, detailed comparisons between RVFL and ELM [9, 43, 50], where these models are compared using multiple datasets for both classification and regression. These comparisons show that using direct links between input and output layers does improve the performance in classification and does not have a significant effect on the quality of regression. Nonetheless, these findings do not show whether randomisation-based learning offers a true alternative for commonly used backpropagation-based learning algorithms.

In the original paper on extreme learning [13], there is a simple comparison between ELMs and networks trained with backpropagation. The comparison covers two datasets—Diabetes for classification and California Housing for regression, both from the UCI repository [8]. The study ensures identical architectures for both models so that the only properties compared were the training algorithms. A similar comparison was held on the Forest Type dataset, but in this case, ELM contained twice the number of hidden neurons the regular network had. Thus, this comparison lacks objectivity. Unfortunately, the authors do not specify hyperparameter set-up, e.g. stopping criterion for backpropagation-based method. This approach causes that the results are non-reproducible. Further works on extreme learning include more comprehensive comparisons [16]. They cover more than 10 datasets for regression obtained from University of Porto repository [46] and several classification datasets. Hyperparameter set-up is described in more detail compared to the original paper on extreme learning [13].

Methods improving the basic RVFL and ELM models [25, 47] usually use their immediate predecessors or simple RVFL and ELM as baselines, omitting backpropagation-based networks.

When proposing the LRF-ELM model, authors of [17] conduct a comparison to a regular convolutional network using only a single dataset—NORB. Therefore, conclusions from such experiments are far from being general. Papers on the extensions of extreme learning usually do not present any comparison to other training methods [53]. They are focused on showing the difference between the proposed extension and the basic ELM. Some other works comparing ELMs to regular deep learning models often use outdated or inadequate network architectures. We can cite the paper [21], where authors performed a comparison in the task of image classification and did not use convolutional networks. When publishing that work (the year 2013), CNNs were just emerging; however, nowadays, they are considered the primary choice for image recognition (classification). Despite the dynamic evolution of CNNs, studies are scarcely comparing them to ELMs.

None of the referred works covered analysis of the impact of hyperparameters set-up on the ELM performance. Therefore, it seems worthy to explore some basic hyperparameters (the number of hidden neurons, choice of an activation function, and the value of regularisation coefficient), because they can profoundly influence results. Only [15] mentions that “Since the input weights and hidden biases of ELM are randomly generated instead of being fine-tuned, ELM usually needs more neurons than other learning algorithms”. Unfortunately, it is not supported by extensive research.

To summarise, our choice of using extreme learning machines for comparison was based on three factors:

  • A common use of ELM as a baseline solution among random weights networks;

  • A 1-to-1 correspondence between ELM and classical neural networks architecture;

  • A scarcity of comparisons between ELM and classical neural networks.

1.2 Contribution

Our research was inspired by the need for practical neural network applications in classification and regression tasks. Training standard neural networks often requires computational power exceeding the capabilities of most research and development centres. The natural solution lies in alternative training methods. Because we did not find any comprehensive comparison between neither ELM or RVFL and classical networks, we wanted to objectively verify whether and to what extent ELM can be competitive for classical feedforward networks trained with backpropagation (with applied batch SGD) for demanding classification and regression tasks.

We performed the comparison twice. In the first series of experiments following the approach from [13], the network’s and ELM’s capacities (the number of neurons) are comparable. In this case, we compare both model properties in the context of training procedure. In the second one, we tried to reflect a machine learning practitioner approach to training models on a given dataset. This means that we tuned the model for a given dataset. For classical networks, we can make use of the extensive literature and publicly shared implementations. For ELM’s, this is usually impossible, but short training times allow us to optimise hyperparameters efficiently. An essential premise to all experiments is to cover the vast and varied selection of datasets. This research limited the datasets to classification and regression tasks from public repositories (UCI [8] and University of Porto repository [46]), the current image classification benchmark datasets and artificially created tasks, skipping the text and audio domain datasets. We can summarise our contribution as follows:

  • comparison of models’ efficiency for more than 50 datasets using both training methods in terms of the achieved prediction quality and training times (in two scenarios: models with comparable capacity and well-suited (supported by literature) networks’ architectures),

  • implementation of both models available at: https://github.com/mkosturek/extreme-learning-vs-backprop,

  • statistical analysis of the results, which ensure that any conclusions drawn from experiments are credible,

  • evaluation of the decision boundaries of both models for several runs,

  • ELM’s hyperparameter sensitivity analysis,

  • qualitative comparison focused on the practical application of both methods,

  • formulation of postulates referring to the development direction for ELM for current demanding datasets.

The contributions are essential for machine learning researchers and practitioners to know which model to use depending on the dataset size, and what results can we expect regarding training time and the model’s efficiency. Our research gives some insights in the decision boundaries of both models. It is crucial with the growing need to process massive datasets. The importance of our contributions can also be assistive to the random weights network researchers—by showing current challenges and obstacles in practical applications of ELM models and the need to compare enhanced ELM models to classical neural networks. It is also worth saying that ELM’s hyperparameter sensitivity analysis is helpful in defining how to set their values properly.

1.3 Paper structure

The paper consists of six sections. Section 1 presents an introduction to the topic of this study. In Sect. 2, we characterise the two compared approaches to training neural networks—based on backpropagation in Sect. 2.1 and extreme learning in Sect. 2.2. In Sect. 3, we depict the procedure we adopted to conduct experiments. Section 5 presents experimental results and conclusions derived directly from them. In Sect. 6, we show our insights from the experimental study and present postulates about future development of ELMs. Finally, in Sect. 7, we summarise all conclusions drawn from this research.

2 Description of compared models

In this section, we briefly present both models describing their architectures and the method of training. We focus on classification and regression as they are the two most common tasks in real applications. It is necessary to prepare the training set used to find parameters \(\theta\) (weights and biases) of the models to solve these problems. The training set consists of vector pairs \(<\mathbf {x_i}, \mathbf {y_i}>\), i.e. the input vector and corresponding output vector (for classification outputs are encoded using one hot principle). In the case of regression, the corresponding output is a scalar \(y_i\). Formally, the task is solved by a function \(f({\mathbf {x}};{\theta })\) implemented by a classical neural network or ELM. In the case of ELM, the parameters \(\theta\) are calculated in one step, in opposite to classical network where training is an iterative procedure that optimises a loss function \({\mathcal {L}}(f(\cdot ;\theta ), {\mathbf {x}}, {\mathbf {y}})\). Having model parameters, the output \({\hat{{\mathbf {y}}}({\mathbf {x}})}\) specifies the predicted class or predicted value in the case of regression.

2.1 Classical neural networks

The integral part of each neural network is a neuron. The formal description of its operation is described by Eq. 1.

$$\begin{aligned} {\hat{y}}({\mathbf {x}}) = f\left( \sum _{i=1}^N w_i x_i + b\right) \end{aligned}$$
(1)

where \({\hat{y}}\) is the neuron’s output, N is the number of inputs, \(x_i\) stands for the value of the ith input and \(w_i\) for the weight assigned to the ith input, b depicts the bias, and f is an activation function.

Each neuron has many inputs represented by a vector \({\mathbf {x}} = [x_1, \dots , x_N]\). The input signals combined by weighted sum create the total neuron input activity. Weights \({\mathbf {w}} = [w_1, \dots , w_N]\) are assigned to each neuron input connection and are tuned during the training process. Bias b is a weight assigned to additional input, where the signal is always set to 1. The output signal \({\hat{y}}\) is created on the basis of the total input transformation by the activation function f.

In this paper, we consider layered, feedforward network architectures. The first one is the multilayer perceptron (MLP), where neurons are fully connected in neighbouring layers. Further, we will also describe the convolutional neural network (CNN).

MLP is composed of an input layer, one or many hidden layers, and an output layer. The number of hidden layers and the number of neurons in each layer are hyperparameters of a network. The number of outputs depends on the problem solved by the network (classification or regression). For simplicity, in the description below we assume the simplest MLP network with one hidden layer. The signal processing in the network can be described as a sequence of matrix operations for all patterns in the dataset \({\mathbf {X}}\) given in Eq. 2

$$\begin{aligned} {\mathbf {H}}({\mathbf {X}}) = f({\mathbf {X}}{\mathbf {W}}^{(h)} + b^{(h)}) \end{aligned}$$
(2)

where \({\mathbf {W}}^{(h)}\) is a matrix of hidden layer’s parameters, \(b^{(h)}\) is a vector of hidden layer’s bias values, and f is an activation function. The network output is defined by Eq. 3.

$$\begin{aligned} \mathbf {{\hat{Y}}}({\mathbf {X}}) = f(\mathbf {H(X)}{\mathbf {W}}^{(o)} + b^{(o)}) \end{aligned}$$
(3)

where \({\mathbf {W}}^{(o)}\) and \(b^{(o)}\) are a matrix of weights and vector of biases in the output layer.

There are many different activation functions, commonly used are: sigmoid, hyperbolic tangent, step function, ReLU [19, 30], linear function, and softmax in the output layer for classification problems.

Another type of neural network considered in the comparison is the convolutional neural network (CNN) [24]. In the convolutional layers, neurons are not fully connected. A given neuron is only connected to a defined subset of neurons in the subsequent layer. Moreover, weights assigned to these connections are shared between neurons of a single feature map of the convolutional layer. They are widely used in image processing because they learn to extract complex image features that serve the performed task the most. In each convolutional layer, one defines a kernel (or a filter)—matrix with assumed sizes, significantly lower than the image resolution. Filter values are tuned during training the network; they correspond to neurons’ connections weights. A single convolutional layer may consist of multiple filters, each producing a separate feature map. While processing an image, convolutional filters are moved across the image stepwise by a constant number of pixels, and convolution operation is calculated. It defines the total activation for one neuron in a convolutional layer, Eq. 4.

$$\begin{aligned} y_{kij}({\mathbf {X}}, {\mathbf {W}}) = \sum _{p=1}^P\sum _{q=1}^Q w_{kpq}\ x_{i+p, j+q} \end{aligned}$$
(4)

where \({\mathbf {X}}\) is the input image, \({\mathbf {W}}\) denotes a matrix of convolutional filters (size: \(K\times P\times Q\)), K is the number of filters (feature maps), \(P\times Q\) is the size of a single convolutional filter, and ij are a single neuron’s coordinates.

Similarly to MLP, convolution output values are processed by an activation function, and then pooling is applied. Its role is to decrease the size of a feature map. Usually it is implemented as max operation (max-pooling) or average (average-pooling) from the sliding window. Pooling allows a convolutional network to be more robust to small image rotations and translations.

Neural network training is performed as an optimisation task. We search for the cost (loss) function minimum. It defines the error the network makes approximating the target function \({\mathcal {F}}\). Most commonly used loss functions are: mean square error, binary cross-entropy, and categorical cross-entropy.

The primary method of neural network training is the gradient descent method that is shown in Algorithm 1. It requires a training dataset \({\mathcal {D}}\). In each iteration, the network assigns its output (line 2), and then the model parameters are corrected by a small value in the direction opposite to the gradient of cost function \({\mathcal {L}}\) (line 3). As an effect, the cost value decreases. The procedure lasts until a stopping criterion is not satisfied.

Long training may cause network overfitting; therefore, in practice, some other techniques are applied. One of them is early stopping. It relies on setting aside an additional validation dataset. In every iteration, it is used to monitor the cost value. The training is stopped when the validation cost increases.

Table 1

The gradient descent method requires a calculation of the loss function gradient value for all network parameters. For complex, multilayered network architectures it is a complicated task; therefore, to solve this problem the Backpropagation algorithm [26] is in everyday use. It assigns the loss function gradient for the last network layer, and then using the chain rule, it computes the gradient value for the weights in the immediately preceding hidden layer. For a network with more layers, analogously, the gradient in the nth layer is calculated by propagating the loss function gradient from \((n+1)\)th layer. We can calculate the loss function gradient for the weights in the nth network layer using the following recursive definition.

$$\begin{aligned}&\frac{\partial {\mathcal {L}}}{\partial {\mathbf {W}}_n} = f_{n-1}\ \delta _n \end{aligned}$$
(5)
$$\begin{aligned}&\hbox {where:}\quad \delta _i = {\left\{ \begin{array}{ll} \left( {\mathbf {W}}_{i+1} \delta _{i+1} \odot f_{i}'\right) ^\top &{} i < L\\ \left( {\mathcal {L}}' \odot f_L'\right) ^\top &{} i = L \end{array}\right. } \end{aligned}$$
(6)

where \({\mathcal {L}}\) is the loss function, L denotes the total number of layers (and the index of the output layer), \({\mathbf {W}}_i\) is the weight matrix between \((i-1)\)th layer and ith layer, \(f_i\) is the activation vector in the ith layer, \('\) denotes a value of the derivative (or a gradient), and the operator \(\odot\) is the Hadamard product.

The method described above is in its simplest form. In common use is stochastic gradient [2], where in a single iteration, the network parameters are updated based on a small portion of the training dataset—a batch. Another improvement is using momentum. Other advanced methods apply optimisation methods of learning coefficient as, for instance, ADAM [22] method. A high level of generalisation is the aim of each machine learning model. Techniques that increase generalisation are called regularisation. The most popular ones are L2 method and Dropout.

2.2 Extreme learning machines

ELMs are based on random projections of input feature space to the hidden features and then on linear regression as opposed to classical neural networks. ELM hidden connections can be weighted randomly, and there is no need to tune them.

The most straightforward ELM architecture is a feedforward neural network containing a single hidden layer. It resembles the MLP architecture, but the weights between input and hidden layer are randomly assigned, and not tuned. They perform some random black-box transformation from input feature space to the hidden features. The parameters are tuned in the output layer only.

Extreme learning itself is not necessarily limited to this simple architecture. It could be incorporated in many well-known deep learning models. Huang et al. [17] propose a method based on extreme learning and CNNs to perform image classification. The model was called local receptive fields ELM (LRF-ELM). Local receptive fields are a general concept. It assumes that a single neuron is responsible for the aggregating signal from a specific input image region. However, usually for implementation purposes, LRF-ELM simplifies a network to a single convolutional layer with pooling and a fully connected output layer. Only the weights of the output layer are optimised. The convolutional layer acts as the random feature extractor. In other words, the values of all convolutional filters are chosen at random. There are two additional operations specific to LRF-ELM. The first one is orthogonalisation of filters after their initialisation. It aims to minimise the risk of producing redundant features by randomly initialised convolutional filters. The second one is a specific pooling method—square root pooling, given in Eq. 7, which enables the data dimensionality reduction and introduces nonlinearity in the network. This operation is crucial because there is no activation function after the convolutional layer in LRF-ELM.

$$\begin{aligned} \text {Square Root Pooling}({\mathbf {X}}) = \sqrt{\sum _i^{D_1}\sum _j^{D_2} {\mathbf {x}}_{ij}^2} \end{aligned}$$
(7)

where \({\mathbf {X}}\) is the input feature map (output of convolution layer) with size \(D_1\times D_2\). It is a matrix of \(x_{ij}\) elements.

In order to compute optimal output layer weights \(\varvec{\beta }\), Eq. 9 has to be solved—network outputs should match the expected values \({\mathbf {Y}}\) from the dataset. \({\mathbf {H}}\) represents a matrix of random features extracted from the training dataset \({\mathbf {X}}\). For single hidden layer networks, they are computed according to Eq. 8.

$$\begin{aligned} {\mathbf {H}} = f({\mathbf {X}} \mathbf {W_0} + {\mathbf {b}}) \end{aligned}$$
(8)
$$\begin{aligned} {\mathbf {Y}} - {\mathbf {H}} \varvec{\beta } = {\mathbf {0}} \end{aligned}$$
(9)

where: \({\mathbf {X}}\), \({\mathbf {Y}}\),-training inputs and expected outputs, respectively, \(\mathbf {W_0}\), \({\mathbf {b}}\)-weights and biases in a hidden layer—both random,f-activation function.

To find the output layer weights \(\varvec{\beta }\) from Eq. 9, it is necessary to calculate the inverse of the matrix \({\mathbf {H}}\), as presented in Eq. 10, which can be computationally difficult for high-dimensional or large datasets. To address this issue, ELM authors proposed to use Moore-Penrose pseudoinverse, which leads to the final solution for the optimal weights, as shown in Eq. 11. It is equivalent to the least squares optimisation.

$$\begin{aligned} \varvec{\beta } = {\mathbf {H}}^{-1} {\mathbf {Y}} \end{aligned}$$
(10)
$$\begin{aligned} \varvec{\beta } = ({\mathbf {H}}^T {\mathbf {H}})^{-1} {\mathbf {H}}^T {\mathbf {Y}} \end{aligned}$$
(11)

This algorithm is called extreme learning. It is also applied in classification layer in LRF-ELM, which training procedure is presented in Algorithm 2.

Table 2

Just as with classical neural networks, there exist some techniques that improve the performance of ELMs. ELMs can be regularised using L2 regularisation [12, 17]. Equation 12 shows regularised solution for optimal ELM parameters. C denotes regularisation coefficient and \(\mathbb {1}\) is the identity matrix.

$$\begin{aligned} \varvec{\beta } = (C\mathbb {1} + {\mathbf {H}}^T {\mathbf {H}})^{-1} {\mathbf {H}}^T {\mathbf {Y}} \end{aligned}$$
(12)

Even though ELM parameters are chosen at random, the probability distribution used for sampling them may impact the model performance. Authors of [45] compare various distributions with different variances. Authors recommend an excellent default choice of distribution—Gaussian with mean equal to 0 and a standard deviation less than or equal to 0.1.

The presentation of the essential elements of both models and teaching methods was the aim of this section. The next one is devoted to experimental research to compare both models in terms of learning time and achieved results.

3 Research methodology

The experiments on ELM hyperparameter sensitivity allow us to know hyperparameters’ influences on the final ELM’s results. This analysis is described in A. It enables us to assess which parameter has a particularly strong impact on the model responses and to determine good default values for hyperparameters.

We have designed two series of experiments: in the first one, models have comparable capacities. In the second one, they have well-suited architectures based on the literature review or hyperparameter optimisation.

The first series is designed to compare just the learning algorithms. Given identical network architectures, we train them using extreme learning and backpropagation. This comparison can show whether it is beneficial to choose one algorithm over the other. This series consists of two parts. The first part refers to ELM and fully connected networks trained with backpropagation, performing classification and regression tasks on data with vector representation. This part of experimentation utilises well-known, real-life benchmark datasets, datasets used in [16] and several artificial datasets described in Sect. 4.

To ensure the objectivity of conclusions, we used datasets with diverse properties concerning data dimensionality, class balance, the sparsity of features, presence of continuous and discrete features. Datasets were acquired from two public repositories: UCI [8] and University of Porto repository [46]. Tables 1 and 2 present characteristics of datasets for classification and regression.

The second part of this series is devoted to the image classification task. We considered two approaches to image classification: \(\bullet\) using external extractor of visual features and fully connected network, \(\bullet\) using convolutional network, which learns features on its own.

Previously mentioned LRF-ELM models [17] allow researchers to utilise extreme learning for convolutional networks. Authors of [18] show that ELM for image classification performs well when using HOG features (histogram of oriented gradients). The paper considers only one task—road signs classification. However, it seems worthy to investigate this approach on more datasets, described in Table 5. Therefore, the models covered in this comparison are:

  • LRF-ELM,

  • Convolutional network trained with backpropagation,

  • ELM using HOG features,

  • fully connected network trained with backpropagation on HOG features.

In the second series, we try to tune the architectures and hyperparameter set-ups for a given problem, just like a machine learning practitioner would do. In this task, we make use of the available literature. This means that for classical network we often refer to the best found results reported in other papers, while we perform our own hyperparameter search for ELMs. This series also consists of two parts. Analogously, the first part is based on classification and regression on data with vector representation (this time however, we do not use synthetic datasets), and the second concerns image classification. The choice of image datasets is extended to two variants of ImageNet—with resolutions of \(16 \times 16\) and \(32 \times 32\) pixels. We compare training times when possible—for training performed on our own, and when the time is given in the papers we refer to. But the main focus lies on the comparison of the performance measured with various metrics:

  • F1-score in the first series of comparison for the classification tasks. In this series, we performed training and evaluation of all models on our own. Therefore we were able to measure the F1-score, which is more robust to imbalanced classes than accuracy.

  • Accuracy in the first series of comparison for the classification tasks. This time we referred to results reported in literature which are usually stated using only the accuracy metric.

  • RMSE (normalised) for the regression task.

Their definitions are given below.

F1-score. It is a measure of binary classification quality. It summarises the number of Type I and Type II errors made by the classifier. Type I errors are also called false positives. They correspond to the situations where the classifier mistakenly predicts a positive class. Type II errors false negatives mean that the classifier mistakenly predicted the negative class. F1 score is therefore defined as stated in Eq. 13

$$\begin{aligned} F_1 = \frac{2\cdot TP}{2\cdot TP + FN + FP} \end{aligned}$$
(13)

where: TP- number of true positives (correct predictions of positive class,), FP- number of false positives, FN- number of false negatives. To use this metric for multiclass classification, we can simply compute binary F1-score for each class separately and then evaluate an average F1-score weighted by the numbers of samples in each class. The formal description is given in Eq. 14.

$$\begin{aligned} F_1 = \frac{\sum _{k=1}^K{w_k F_1^{(k)}}}{\sum _{k=1}^K{w_k}} \end{aligned}$$
(14)

where: \(F_1^{(k)}\)- F1 score for class k, K- number of classes, \(w_k\)- number of samples in class k. For simplicity, we will denote both binary and multiclass version of this measure as F1 score.

Accuracy. It is the most basic measure of classification performance. It is simply a ratio of correctly classified samples and the number of all data samples in the dataset, as stated in Eq. 15.

$$\begin{aligned} Accuracy = \frac{\# \textit{of\, correct\, predictions}}{\# \textit{of\, all\, test\, data\, samples}} \end{aligned}$$
(15)

RMSE. Root mean square error is a measure of regression quality. It is defined in Eq. 16.

$$\begin{aligned} \text {RMSE}(\hat{{\mathbf {Y}}},{\mathbf {Y}}) = \sqrt{\frac{\sum _i^N{(\hat{y_i} - y_i)^2}}{N}} \end{aligned}$$
(16)

where: \({\mathbf {Y}}\)-vector of expected model responses for N data samples, \(\hat{{\mathbf {Y}}}\)-vector of actual model responses for N data samples, N-number of samples in the testing dataset. In this paper, we present normalised values of RMSE. Normalisation is performed by dividing the value of RMSE by the range of the target variable (the difference between the greatest and the lowest value). It makes a more straightforward interpretation of the model error magnitude without any domain knowledge about the target variable.

4 Datasets used

In selecting datasets for experimental research, we were guided by the diversity of the data in terms of class balancing, data dimensions, number of classes, continuous and discrete data, and sparsity of features. Datasets were acquired from two public repositories: UCI [8] and University of Porto repository [46]. Table 1 presents characteristics of datasets for regression and Table 2 for classification. Moreover, some synthetic datasets were considered.

Table 1 Characteristics and descriptions of datasets for regression from UCI and University of Porto repositories used for comparisons
Table 2 Characteristics and descriptions of datasets for classification from UCI repository used for hyperparameter sensitivity analysis and for comparisons

We also propose three types of artificial classification datasets and several datasets for regression. They are characterised in Tables 3 and 4, and described in detail in Appendix B.

Table 3 Characteristics of artificial datasets for classification used for hyperparameter sensitivity analysis and for comparisons
Table 4 Characteristics of artificial datasets for regression used for comparisons

Image classification datasets. None of the available ELM studies presents results obtained on benchmark image classification datasets, such as MNIST OCR and CIFAR-10. For this reason, we included them in the experiments. In addition, we use the NORB dataset, which was used in [17], and the GTSRB set, used in [18]. All image datasets are characterised in Table 5.

Table 5 Specifications and descriptions of image classification datasets used in this study

5 Experimental study

We conducted all experiments on a computer with a specification presented in Table 6.

Table 6 Specification of the computer on which the experiments were conducted

For the needs of the study, the ELM and LRF-ELM models were implemented using Python 3.6 and PyTorch 1.4 library [34]. All the reference models (fully connected and convolutional networks with backpropagation) were also implemented based on the following libraries: scikit-learn, scipy, scikit-posthocs, OpenCV. Our implementation is available at: github.com/mkosturek/extreme-learning-vs-backprop.

The first experiment compares ELM models and neural networks trained using backpropagation (also referred to as MLP) in each series. The next one focuses on image datasets recognition. The simple ELM and MLP models working on HOG features extracted from images are compared. We also examine models dedicated to image classification: LRF-ELM and convolutional neural networks.

5.1 Models with comparable capacity

In this series, we compare models with the same architecture and comparable number of neurons.

5.1.1 Experiment 1: comparison of ELM and neural networks performance and training times for input data with vector representation

The purpose of the study is to conduct a comparison of the performance and training time of ELM models and neural networks characterised by similar capacities using datasets with the vector representation for regression and classification tasks. We based the choice of values of 3 hyperparameters (regularisation coefficient, activation function, and hidden layer size coefficient) on the hyperparameter sensitivity analysis (Appendix A). We chose values that resulted in decent performances on average on various datasets. Eventually, the following hyperparameter values were applied:

  • Regularisation - L2 was applied for both ELM and MLP with \(C=0.01\),

  • Activation function - ReLU for both MLP and ELM,

  • Optimisation algorithm - ADAM algorithm [22] was used while training by backpropagation,

  • Batch size - set to 16 while training by backpropagation,

  • Learning rate - set to 0.001 while training by backpropagation,

  • Training stopping criteria - early stopping with patience set to 3 epochs and maximum number of epochs set to 200,

  • Hidden layer size coefficient - based on the analysis of the hyperparameter sensitivity in A, \(h=10\). It applies for both ELM and backpropagation trained network. (It is in contrast to the comparison presented in [16], where the number of neurons in ELM was usually about twice as high as the MLP).

  • Significance level - \(\alpha =0.05\)

The experiment was conducted on 23 classification datasets and 30 datasets for regression presented in Tables 1 and 2, respectively. Moreover, synthetic datasets were used. They are described in detail in Appendix B. We generated data in multiple variants with different dimensions, varying from 1 up to 1000 dimensions. Each synthetic dataset contained 2000 examples.

Each model was evaluated on every dataset using a fivefold cross-validation, repeated 10 times, giving a sample of 50 results for each model on each dataset. In the classification task, the F1 measure was used to compare the performance. For the regression task, we used normalised RMSE. Normalisation was performed by dividing the RMSE measure by the range of the target variable (the difference between the maximum and minimum). To evaluate the training speed, we measured the training time in seconds. To ensure the reliability of the comparison of the average values, we used the dependent t-test for paired samples, but, as indicated in [3], in the case of repeated cross-validation, the basic form of this t-test may lead to false and non-reproducible results. Therefore we used a modified paired-samples t-test, called corrected repeated k-fold cross-validation test.

Tables 7 and 8 show the results of comparison of performance and training times in classification and regression tasks. Whenever a p-value from the statistical test was below the significance level \(\alpha =0.05\), the better model’s result was presented in bold, denoting a significant difference between compared models.

Table 7 Comparison of the average F1-measure and training times (in seconds) for ELM and MLP
Table 8 Comparison of the average RMSE and training times (in seconds) for ELM and MLP

Discussion Considering classification performance in terms of F1 measure, the results presented in Table 7 show that for most sets (15 out of 23) the statistical test did not show any significant difference between the average results of the models. On 8 sets, where the differences were statistically significant, better results were achieved by ELM. It is worth noting that the differences between the models were found to be significant on high-dimensional synthetic sets rather than on their low-dimensional counterparts.

A qualitative comparison of decision regions of sample ELM and MLP models is also interesting. For the simplest, two-dimensional datasets, they are visualised in Figs. 12, 3. It is worth underlining the significant differences in the shapes of the decision regions for different runs of the ELM model.

Fig. 1
figure 1

Visualisation of decision regions of sample models on the nested circles dataset (two-dimensional hyperspheres)

Fig. 2
figure 2

Visualisation of decision regions of sample models on the XOR classification dataset (two-dimensional hypercube vertices)

Fig. 3
figure 3

Visualisation of decision regions of sample models on the intertwined spirals dataset

Although the identified decision regions usually correctly cover the training set’s observations, these areas undergo significant shape changes just beyond the concentration areas of observations in the case of ELM. It is particularly noticeable for a dataset of nested hyperspheres, where the inner circle class region also appears outside both circles. Decision boundaries for MLP seem to be much simpler than for ELM. It can be seen that the decision boundaries for ELM vary for each of the 15 runs of the method. This phenomenon suggests better generalisation capabilities of MLPs. For most regression datasets, the statistical test showed significant differences in the quality of model performance (22 out of 30 sets).

Of these 22 comparisons, only 8 indicated better results of ELM model, and 14 indicated better results on MLP. This effect is different from the classification comparison result. However, in Table 8, it can be seen that the MLP achieved significantly better results on only one real-life set, Servo, and three synthetic sets: linear function, 2nd degree polynomial, Friedman I, but with a majority of their possible dimensions. ELM proved to be significantly better on three real sets, on synthetic sets based on trigonometric functions and Friedman II and III functions. It should be noted that all of these 8 sets are low-dimensional, e.g. dimensionality did not exceed 10.

The obtained results indicate that ELM models achieve better results in low-dimensional and nonlinear regression problems, which occur in many real-life tasks, in sets based on trigonometric functions and Friedman II and III curves based on physical phenomena. MLP, on the other hand, achieves better results for high-dimensional problems and those with linear or polynomial dependencies.

The comparison shows that ELM models are significantly faster than neural networks trained by backpropagation. On each of the 53 sets, for both classification and regression tasks, the statistical test showed a significant difference in processing times, always in favour of ELM. Typically, learning times for MLP turn out to be 4 orders of magnitude higher than for ELM.

5.1.2 Experiment 2: comparison of ELM and neural networks performance in image classification

The aim of the experiment is to compare networks trained with extreme learning and backpropagation in the task of image classification. We compare F1-measure and training times. The following hyperparameter values were set after preliminary studies:

  • Regularisation coefficient - \(C=0.01\). It was set based on the preliminary experiment, which is not shown here.

  • Hidden layer size coefficient - for models based on HOG: \(h=10\),

  • Activation function - ReLU,

  • Architecture of convolutional networks - one convolutional layer, one pooling layer, with no activation function for both CNN and LRF-ELM. A flattened feature map after pooling is the input to the fully connected classification layer,

  • Number and size of convolutional filters - 10 filters, with a size of 5\(\times\)5 pixels, the stride is 1 pixel for both LRF-ELM and CNN,

  • Pooling method and window size - Square Root Pooling for LRF-ELM and CNN. Window size is 5\(\times\)5 pixels, with the stride equal to 2 pixels,

  • Optimisation algorithm when using backpropagation - ADAM [22],

  • Batch size while training by backpropagation - for MLP model with HOG features: 32; for CNNs, the batch size was 128,

  • Learning rate - set to 0.001 while training by backpropagation,

  • Training stopping criteria - early stopping with patience set to 3 epochs and maximum number of epochs set to 200,

  • Significance level for statistical testing - \(\alpha =0.05\),

  • HOG descriptor set-up - the number of gradient orientations is 9, the block size is 2\(\times\)2 cells, each cell is 8\(\times\)8 pixels.

The experiment was conducted on 4 datasets, the characteristics of which are presented in Table 5. CIFAR-10 dataset was used in its original form, the HOG vector for this set consists of 324 features. GTSRB dataset consists of various sizes of images therefore, for the LRF-ELM models and convolutional networks, all images were scaled to 32\(\times\)32 pixels. The HOG features were calculated on images scaled to a resolution of 43\(\times\)43 pixels. This size is a median size of all images in the set. The HOG vector for this set consists of 576 features. MNIST dataset was used in its original form. The vector of the HOG features for this set consists of 144 features. All images in the NORB dataset are 96\(\times\)96 pixels. Due to hardware limitations, for networks using convolutional layers, the images were scaled to a resolution of 32\(\times\)32. HOG features were determined for images scaled to 48\(\times\)48 pixels. The HOG vector for this set consists of 900 features.

On every dataset, a fivefold cross-validation was repeated five times for each of the tested models. Comparison is then based on performance measures averaged over these repetitions. For performance comparison, F1 measure was used, while the time measured in seconds was used to assess the training speed. Time measurement includes only model training and not HOG features extraction.

Figure 4 shows the results of the performance comparison for each of the four datasets. We performed the corrected repeated cross-validation t-test pairwise for all models, we assumed significance level \(\alpha =0.05\). These tests allowed us to determine that on CIFAR-10 and GTSRB datasets, both CNN and HOG+ELM approach resulted in the best performances without significant differences between them; on MNIST dataset, CNN was solely the best performing model; on NORB the best results were achieved with HOG+ELM approach. Moreover, these tests show that the performance results of CNN models trained on CPU and GPU do not differ, so they are only used to compare training times.

Figure 5 shows a comparison of the training times. Corrected repeated cross-validation tests were also performed. They showed that each of the average training times was significantly different from the others.

Fig. 4
figure 4

Comparison of the classification performance of tested models on 4 datasets for image classification. Average and standard deviation of results obtained from five repetitions of fivefold cross-validation are presented

Fig. 5
figure 5

Comparison of training times of the tested models on 4 image classification datasets. Average and standard deviation of results from five repetitions of fivefold cross-validation are presented. Times are shown on a logarithmic scale due to their considerable range

Discussion Figure 4 shows that CNN and ELM models using HOG features provide the best classification performance the same number of times. On CIFAR and GTSRB sets, these two models achieved results that did not differ significantly. On the MNIST set, convolutional networks using backpropagation proved to be better, and on the NORB set, the ELM model achieved the best results.

LRF-ELM achieved a result comparable to CNN only on NORB and GTSRB, while in the two other cases networks trained by backpropagation were superior.

A surprisingly high improvement in classification performance was achieved by the use of extreme learning instead of backpropagation when using HOG features. Three out of four ELM models using these features also proved superior to the LRF-ELM model, which has been designed for image processing.

It should be noted that the architecture of the convolutional networks used in this experiment was very limited in comparison with the usual design of such networks; for example, the number of filters was limited to 10. This is due to the high memory requirements of the LRF-ELM model and hardware limitations. In the case of convolutional networks with backpropagation, the size of the network could be greatly increased. Therefore, drawing conclusions from a comparison of image-based models with those using HOG features may not be fully justified. The superiority of a simple extractor over the use of a dedicated image network architecture cannot be undeniably confirmed.

Again, the comparison showed that extreme learning is significantly faster than backpropagation. On three out of four sets, LRF-ELM proved to be the fastest, while ELM using the HOG feature extractor was the fastest on a single set. The use of the GPU allowed to accelerate backpropagation training more than a hundred times to a level comparable to ELM models using HOG features. However, it did not achieve learning times as short as LRF-ELM.

5.2 Models with well-suited architectures for a given dataset

In this series, we conduct experiments using machine learning practitioner’s approach. Following this assumption, the experiments compare both methods’ performance using models with hyperparameter values tuned to achieve the best possible results or, if available, using reference to state-of-the-art results.

5.2.1 Experiment 1: comparison of ELM and neural networks performance and training times for input data with vector representation

In this scenario, the goal is to find a solution that performs the best. Therefore we utilise the hyperparameter search for ELMs and make use of works published throughout the years of classical networks development. Then we compare the best solutions found on datasets for regression and classification tasks with vector representation. Tables 9 and 10 show the best ELM hyperparameter set-ups found for each dataset. Due to some discrepancy in literature concerning metrics used for evaluating regression models, we decided to perform hyperparameter search for classical networks on our own. Table 11 shows the best hyperparameters found for each regression dataset. Tables 12 and 13 present the best results achieved with ELMs and classical networks.

Table 9 Hyperparameter set-up for the best found ELM models for each classification dataset
Table 10 Hyperparameter set-up for the best found ELM models for each regression dataset
Table 11 Hyperparameter set-up for the best found MLP models for each regression dataset
Table 12 Comparison of the accuracy for the best found ELM and MLP solutions for given classification datasets
Table 13 Comparison of the RMSE (normalised) and training times for best found ELM and MLP solutions for given regression datasets

Discussion This experiment shows that both extreme learning and backpropagation are able to produce a model best fitting a given task. However, classical networks proved to be superior on majority of tested datasets in both classification (6 out of 9 datasets) and regression (4 out of 6 datasets). It must be noted nonetheless that there are some cases when ELMs go behind standard networks only by a small margin (e.g. Machine CPU and Auto MPG datasets) and when they surpass classical networks significantly (e.g. Sonar and Bank datasets).

Overall, this experiment demonstrates that the many years of extensive research on backpropagation-based networks were beneficial. Classical networks are a solid, robust framework allowing very efficient modelling. However, ELMs seem to be a good complement for classical networks—easy and quick to train, having a potential of producing very competitive results. Considering the disproportion in the amount of research on ELMs compared to backpropagation, extreme learning seems to be promising approach, which is worth further development.

5.2.2 Experiment 2: comparison of ELM and neural networks performance in image classification

Just like in the previous experiment, the aim of this study is to achieve the best possible performance with each method in the image classification task. This comparison covers a large-scale dataset—ImageNet. It was constrained to two variants, with 16\(\times\)16 and 32\(\times\)32 resolution, due to the high memory complexity of ELMs. Again, we performed hyperparameter search for ELM-based classifiers and supported the comparison with results achieved with classical networks reported in the literature. Table 14 presents the best hyperparameter set-ups found for LRF-ELM models for each dataset. Table 15 shows the performance and time comparison of ELMs and backpropagation-based models.

In this experiment, we modified the preprocessing steps on the NORB dataset to match [37] to whom we compare our ELM’s results.

Table 14 Hyperparameter set-up for the best found ELM models for each classification dataset
Table 15 Comparison of the accuracy and training times for the best found ELM and BP solutions for given image classification datasets

Discussion Comparing current state-of-the-art deep learning models for image classification to the LRF-ELM clearly demonstrates classical deep learning superiority. LRF-ELM was inferior on every dataset considered, often by a great margin. This outcome was expected because LRF-ELM corresponds to a very simple CNN, whereas modern deep learning architectures are significantly more complex. In the case of image classification, it is hard to argue that the training speed is an advantage of ELMs, because the performance decrease is too high. Presently, extreme learning is not a competitive alternative for deep learning-based image classification.

Again, this experiment shows a high disproportion in the development of extreme learning-based architectures compared to backpropagation. Extending ELMs into a framework allowing efficient implementation and training of more complex network architectures could render valuable.

6 Findings based on the performed research

As mentioned before, extreme learning machines are highly memory-consuming. This is because of the need to compute the inverse of the latent representation of the entire training dataset. This process requires storing the hidden features matrix in memory. One could argue that it is possible to implement ELMs so that this matrix is cached on disk. However, the main advantage of extreme learning—short training times—would suffer significantly. Here we present a short theoretical study of memory requirements that one needs to train ELM on a 224\(\times\)224 ImageNet dataset that consists of roughly 14 mln images. Let us presume the following assumptions:

  • the training is performed on ImageNet’s 1 mln images subset, used in popular challenges such as ILSVRC,

  • all images are resized to constant 224\(\times\)224 px size,

  • the LRF-ELM architecture is configured as follows:

    • 16 convolutional filters only,

    • filter size: 5\(\times\)5,

    • pooling size: 5\(\times\)5,

    • pooling stride: 2.

  • standard 32-bit precision floating-point numbers are used.

The assumed image size and network architecture produce hidden feature maps of size 108\(\times\)108\(\times\)16 for each image, LRF-ELM flattens such feature maps; hence, a hidden representation is a vector of 186624 elements. Therefore the hidden representation matrix \({\mathbf {H}}\) size is 1mln\(\times\)186624. Such matrix consists of \(1.87\times 10^{11}\) numbers. Each of them is stored on 4 bytes, so the total size of the matrix \({\mathbf {H}}\) is equal to approximately 746 GB.

Using such limited number of convolutional filters leads to memory requirements that are beyond abilities of most systems. Let us assume changing convolution and pooling configuration in order to produce feature maps compressed to 32\(\times\)32 size. Even in this case the matrix \({\mathbf {H}}\) takes more than 60 GB. Further scaling down introduces a high risk that such compressed features extracted randomly contain too little information for ELM output layer to train properly. This effect is clearly seen in the experiment results in Table 15. Using classical convolutional networks trained with backpropagation, the memory requirements are several orders of magnitude lower. There is no need to keep the entire dataset in memory; training requires loading only one mini-batch at time.

Our experience from conducting this study is described in Table 16, which shows a qualitative comparison between ELM and classical neural networks in the current state of their development.

Table 16 Qualitative comparison of ELM models and classical neural networks

We can sum up its content as follows:

  • The number of hyperparameters is much higher in the classical neural networks, but selecting their values can be supported by the extensive literature review and help of active community members on public forums. It is easy to find heuristics on how to set hyperparameters values in the case of classical neural networks.

  • The number of parameters in the classical deep networks is huge, but because the models are popular, we can use transfer learning that speeds up training. In ELM models, the number of parameters is lower, but in challenging image data, memory requirements are so vast that it is not possible to calculate weights assuming typical hardware equipment.

  • There are few publicly available ELM implementations; existing ones are usually simple and non-generic or are written in Matlab, which makes them hard to utilise in more complex use cases and popular, well-supported frameworks such as Python and PyTorch or TensorFlow.

Summing up, in this paper we referred only to the classical methods in ELMs conducting experiments on image and vector representation datasets. It should be noticed, however, that there exist some new approaches which propose promising solutions. Here, we can mention incremental learning [42], multilayer extreme learning machine (ML-ELM), and hierarchical extreme learning machine (H-ELM) described in [20], and non-iterative and fast learning algorithms for deep ELM [49]. It is a good prognostic for further development of ELM, but none of them has presented results using a challenging dataset as ImageNet nor has compared their performance to the mainstream deep learning methods.

7 Final conclusion

Based on presented research, we can sum up that the tested ELM models do not offer a real alternative compared to classical neural networks for contemporary problems characterised by complex patterns and huge datasets. The first series of experiments shows that their superiority to classical neural networks about training times is visible for networks with capacity corresponding to the time before the deep learning era. The demand for models that are trained quickly for tough image, video, or audio problems is enormous. ELM models transforming input space to high-dimensional hidden neuron space with random weights are attractive and essential when one wants to train the model quickly. But to achieve this state, we postulate to develop smart algorithms for the inverse matrix calculation, so that determining weights in the output layer for challenging datasets becomes feasible and memory efficient. There is a need to create specific mechanisms to avoid keeping the whole dataset in memory to compute weights. Although we noticed new approaches to solve these problems [20, 42, 49], contemporary demanding datasets used in classical deep neural networks are still a challenge for ELMs. Also, it seems necessary to develop generic frameworks that enable practitioners simple access to ELM models and easy development of new architectures efficiently utilising this training algorithm. Ultimately, sharing implementations should be a common practice among ELM researchers. In future works, we plan to prepare comparison with other random networks like RVFL or its extended version. There are also other promising solutions to consider—self-normalising networks that outperformed all competing methods [23].