Extreme learning machine versus classical feedforward network

Our research is devoted to answering whether randomisation-based learning can be fully competitive with the classical feedforward neural networks trained using backpropagation algorithm for classification and regression tasks. We chose extreme learning as an example of randomisation-based networks. The models were evaluated in reference to training time and achieved efficiency. We conducted an extensive comparison of these two methods for various tasks in two scenarios: ∙\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bullet$$\end{document} using comparable network capacity and ∙\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bullet$$\end{document} using network architectures tuned for each model. The comparison was conducted on multiple datasets from public repositories and some artificial datasets created for this research. Overall, the experiments covered more than 50 datasets. Suitable statistical tests supported the results. They confirm that for relatively small datasets, extreme learning machines (ELM) are better than networks trained by the backpropagation algorithm. But for demanding image datasets, like ImageNet, ELM is not competitive to modern networks trained by backpropagation; therefore, in order to properly address current practical needs in pattern recognition entirely, ELM needs further development. Based on our experience, we postulate to develop smart algorithms for the inverse matrix calculation, so that determining weights for challenging datasets becomes feasible and memory efficient. There is a need to create specific mechanisms to avoid keeping the whole dataset in memory to compute weights. These are the most problematic elements in ELM processing, establishing the main obstacle in the widespread ELM application.


Introduction
Artificial neural networks are among the fastest-growing areas of artificial intelligence. Existing deep models can solve tasks such as visual object recognition or automatic text translation, often achieving human-level performance. Frequently, these fantastic results need weeks of training and require substantial computational power. High training times are a consequence of the most commonly used training strategy-numerical optimisation using iterative methods and the backpropagation algorithm. Its purpose is to determine the error gradient for each parameter of the network. We will refer to them as classical or standard networks. High hardware requirements and long training times are frequent obstacles in applying neural networks in practical use. These difficulties explain our interest in searching for an alternative solution. In this work, we examine on randomisation-based neural networks, which offer significantly lower training times and claim to perform on par with classical networks. Specifically, we focus on extreme learning machines (ELM), a special case of random vector functional-link (RVFL) networks.

Motivation
Motivation for this research stems from our search for models useful to solve real problems in classification and regression tasks which are well suited to be applied in & Urszula Markowska-Kaczmar urszula.markowska-kaczmar@pwr.edu.pl

Michał Kosturek
kosturek.michal@gmail.com practice. We limited our interests to processing inputs in the form of vectors and raw images. We do not consider sequential data. Random vector functional-link networks were proposed in [32], and their characteristics were discussed in [33]. Since then, they were developed and evaluated [11,47,50]. RVFL is a feedforward network with a single hidden layer and direct links between input and output layers which bypass the hidden layer. The most important characteristic of RVFL is that weights assigned to the hidden layer are set randomly and are not optimised. Output layer weights are the only trainable parameters.
Authors of [13] proposed extreme learning machines. ELM follows the same principle as RVFL. In fact, they are a special case of RVFL, where the direct input-output links are disabled [5]. Therefore ELM is architecturally identical to commonly used classical feedforward networks with a single hidden layer. The only difference between them is the training method. Authors of ELM claim that time of assigning weights for ELM is significantly lower than when using standard network training methods and with comparable or even better performance. Extreme learning machines were used as a baseline for many further models, such as stacked ELM [51], on-line sequential ELM [14], LRF-ELM designed for image classification [17], and biased dropout ELM [25]. ELM can also be considered as a non-recurrent equivalent of reservoir computing (RC) models [40]. Because random units construct features (to deal with nonlinearly separable tasks) without learning, this model is a natural candidate to overcome classical neural networks' time-consuming training. This allows to find all remaining weights in a closed form, which means that no iterative numerical optimisation is required.
The use of random parameters in randomisation-based networks may raise concerns about deteriorating the performance quality. However, authors of [16] prove that a sufficiently high number of random hidden features allow them to learn effectively. This statement sounds very promising, but our literature survey does not result in an indepth comparison of backpropagation and randomisationbased learning-both ELM and RVFL. There are, however, detailed comparisons between RVFL and ELM [9,43,50], where these models are compared using multiple datasets for both classification and regression. These comparisons show that using direct links between input and output layers does improve the performance in classification and does not have a significant effect on the quality of regression. Nonetheless, these findings do not show whether randomisation-based learning offers a true alternative for commonly used backpropagation-based learning algorithms.
In the original paper on extreme learning [13], there is a simple comparison between ELMs and networks trained with backpropagation. The comparison covers two datasets-Diabetes for classification and California Housing for regression, both from the UCI repository [8]. The study ensures identical architectures for both models so that the only properties compared were the training algorithms. A similar comparison was held on the Forest Type dataset, but in this case, ELM contained twice the number of hidden neurons the regular network had. Thus, this comparison lacks objectivity. Unfortunately, the authors do not specify hyperparameter set-up, e.g. stopping criterion for backpropagation-based method. This approach causes that the results are non-reproducible. Further works on extreme learning include more comprehensive comparisons [16]. They cover more than 10 datasets for regression obtained from University of Porto repository [46] and several classification datasets. Hyperparameter set-up is described in more detail compared to the original paper on extreme learning [13].
Methods improving the basic RVFL and ELM models [25,47] usually use their immediate predecessors or simple RVFL and ELM as baselines, omitting backpropagationbased networks.
When proposing the LRF-ELM model, authors of [17] conduct a comparison to a regular convolutional network using only a single dataset-NORB. Therefore, conclusions from such experiments are far from being general. Papers on the extensions of extreme learning usually do not present any comparison to other training methods [53]. They are focused on showing the difference between the proposed extension and the basic ELM. Some other works comparing ELMs to regular deep learning models often use outdated or inadequate network architectures. We can cite the paper [21], where authors performed a comparison in the task of image classification and did not use convolutional networks. When publishing that work (the year 2013), CNNs were just emerging; however, nowadays, they are considered the primary choice for image recognition (classification). Despite the dynamic evolution of CNNs, studies are scarcely comparing them to ELMs.
None of the referred works covered analysis of the impact of hyperparameters set-up on the ELM performance. Therefore, it seems worthy to explore some basic hyperparameters (the number of hidden neurons, choice of an activation function, and the value of regularisation coefficient), because they can profoundly influence results. Only [15] mentions that ''Since the input weights and hidden biases of ELM are randomly generated instead of being fine-tuned, ELM usually needs more neurons than other learning algorithms''. Unfortunately, it is not supported by extensive research.
To summarise, our choice of using extreme learning machines for comparison was based on three factors: • A common use of ELM as a baseline solution among random weights networks; • A 1-to-1 correspondence between ELM and classical neural networks architecture; • A scarcity of comparisons between ELM and classical neural networks.

Contribution
Our research was inspired by the need for practical neural network applications in classification and regression tasks. Training standard neural networks often requires computational power exceeding the capabilities of most research and development centres. The natural solution lies in alternative training methods. Because we did not find any comprehensive comparison between neither ELM or RVFL and classical networks, we wanted to objectively verify whether and to what extent ELM can be competitive for classical feedforward networks trained with backpropagation (with applied batch SGD) for demanding classification and regression tasks. We performed the comparison twice. In the first series of experiments following the approach from [13], the network's and ELM's capacities (the number of neurons) are comparable. In this case, we compare both model properties in the context of training procedure. In the second one, we tried to reflect a machine learning practitioner approach to training models on a given dataset. This means that we tuned the model for a given dataset. For classical networks, we can make use of the extensive literature and publicly shared implementations. For ELM's, this is usually impossible, but short training times allow us to optimise hyperparameters efficiently. An essential premise to all experiments is to cover the vast and varied selection of datasets. This research limited the datasets to classification and regression tasks from public repositories (UCI [8] and University of Porto repository [46]), the current image classification benchmark datasets and artificially created tasks, skipping the text and audio domain datasets. We can summarise our contribution as follows: • comparison of models' efficiency for more than 50 datasets using both training methods in terms of the achieved prediction quality and training times (in two scenarios: models with comparable capacity and wellsuited (supported by literature) networks' architectures), • implementation of both models available at: https:// github.com/mkosturek/extreme-learning-vs-backprop, • statistical analysis of the results, which ensure that any conclusions drawn from experiments are credible, • evaluation of the decision boundaries of both models for several runs, • ELM's hyperparameter sensitivity analysis, • qualitative comparison focused on the practical application of both methods, • formulation of postulates referring to the development direction for ELM for current demanding datasets.
The contributions are essential for machine learning researchers and practitioners to know which model to use depending on the dataset size, and what results can we expect regarding training time and the model's efficiency.
Our research gives some insights in the decision boundaries of both models. It is crucial with the growing need to process massive datasets. The importance of our contributions can also be assistive to the random weights network researchers-by showing current challenges and obstacles in practical applications of ELM models and the need to compare enhanced ELM models to classical neural networks. It is also worth saying that ELM's hyperparameter sensitivity analysis is helpful in defining how to set their values properly.

Paper structure
The paper consists of six sections. Section 1 presents an introduction to the topic of this study. In Sect.

Description of compared models
In this section, we briefly present both models describing their architectures and the method of training. We focus on classification and regression as they are the two most common tasks in real applications. It is necessary to prepare the training set used to find parameters h (weights and biases) of the models to solve these problems. The training set consists of vector pairs \x i ; y i [ , i.e. the input vector and corresponding output vector (for classification outputs are encoded using one hot principle). In the case of regression, the corresponding output is a scalar y i . Formally, the task is solved by a function f ðx; hÞ implemented by a classical neural network or ELM. In the case of ELM, the parameters h are calculated in one step, in opposite to classical network where training is an iterative procedure that optimises a loss function Lðf ðÁ; hÞ; x; yÞ. Having model parameters, the outputŷðxÞ specifies the predicted class or predicted value in the case of regression.

Classical neural networks
The integral part of each neural network is a neuron. The formal description of its operation is described by Eq. 1.
whereŷ is the neuron's output, N is the number of inputs, x i stands for the value of the ith input and w i for the weight assigned to the ith input, b depicts the bias, and f is an activation function. Each neuron has many inputs represented by a vector x ¼ ½x 1 ; . . .; x N . The input signals combined by weighted sum create the total neuron input activity. Weights w ¼ ½w 1 ; . . .; w N are assigned to each neuron input connection and are tuned during the training process. Bias b is a weight assigned to additional input, where the signal is always set to 1. The output signalŷ is created on the basis of the total input transformation by the activation function f.
In this paper, we consider layered, feedforward network architectures. The first one is the multilayer perceptron (MLP), where neurons are fully connected in neighbouring layers. Further, we will also describe the convolutional neural network (CNN).
MLP is composed of an input layer, one or many hidden layers, and an output layer. The number of hidden layers and the number of neurons in each layer are hyperparameters of a network. The number of outputs depends on the problem solved by the network (classification or regression). For simplicity, in the description below we assume the simplest MLP network with one hidden layer. The signal processing in the network can be described as a sequence of matrix operations for all patterns in the dataset X given in Eq. 2 where W ðhÞ is a matrix of hidden layer's parameters, b ðhÞ is a vector of hidden layer's bias values, and f is an activation function. The network output is defined by Eq. 3.
where W ðoÞ and b ðoÞ are a matrix of weights and vector of biases in the output layer. There are many different activation functions, commonly used are: sigmoid, hyperbolic tangent, step function, ReLU [19,30], linear function, and softmax in the output layer for classification problems.
Another type of neural network considered in the comparison is the convolutional neural network (CNN) [24]. In the convolutional layers, neurons are not fully connected. A given neuron is only connected to a defined subset of neurons in the subsequent layer. Moreover, weights assigned to these connections are shared between neurons of a single feature map of the convolutional layer. They are widely used in image processing because they learn to extract complex image features that serve the performed task the most. In each convolutional layer, one defines a kernel (or a filter)-matrix with assumed sizes, significantly lower than the image resolution. Filter values are tuned during training the network; they correspond to neurons' connections weights. A single convolutional layer may consist of multiple filters, each producing a separate feature map. While processing an image, convolutional filters are moved across the image stepwise by a constant number of pixels, and convolution operation is calculated. It defines the total activation for one neuron in a convolutional layer, Eq. 4.
where X is the input image, W denotes a matrix of convolutional filters (size: K Â P Â Q), K is the number of filters (feature maps), P Â Q is the size of a single convolutional filter, and i, j are a single neuron's coordinates. Similarly to MLP, convolution output values are processed by an activation function, and then pooling is applied. Its role is to decrease the size of a feature map. Usually it is implemented as max operation (max-pooling) or average (average-pooling) from the sliding window. Pooling allows a convolutional network to be more robust to small image rotations and translations.
Neural network training is performed as an optimisation task. We search for the cost (loss) function minimum. It defines the error the network makes approximating the target function F . Most commonly used loss functions are: mean square error, binary cross-entropy, and categorical cross-entropy.
The primary method of neural network training is the gradient descent method that is shown in Algorithm 1. It requires a training dataset D. In each iteration, the network assigns its output (line 2), and then the model parameters are corrected by a small value in the direction opposite to the gradient of cost function L (line 3). As an effect, the cost value decreases. The procedure lasts until a stopping criterion is not satisfied.
Long training may cause network overfitting; therefore, in practice, some other techniques are applied. One of them is early stopping. It relies on setting aside an additional validation dataset. In every iteration, it is used to monitor the cost value. The training is stopped when the validation cost increases. The gradient descent method requires a calculation of the loss function gradient value for all network parameters. For complex, multilayered network architectures it is a complicated task; therefore, to solve this problem the Backpropagation algorithm [26] is in everyday use. It assigns the loss function gradient for the last network layer, and then using the chain rule, it computes the gradient value for the weights in the immediately preceding hidden layer. For a network with more layers, analogously, the gradient in the nth layer is calculated by propagating the loss function gradient from ðn þ 1Þth layer. We can calculate the loss function gradient for the weights in the nth network layer using the following recursive definition. where: where L is the loss function, L denotes the total number of layers (and the index of the output layer), W i is the weight matrix between ði À 1Þth layer and ith layer, f i is the activation vector in the ith layer, 0 denotes a value of the derivative (or a gradient), and the operator is the Hadamard product. The method described above is in its simplest form. In common use is stochastic gradient [2], where in a single iteration, the network parameters are updated based on a small portion of the training dataset-a batch. Another improvement is using momentum. Other advanced methods apply optimisation methods of learning coefficient as, for instance, ADAM [22] method. A high level of generalisation is the aim of each machine learning model.
Techniques that increase generalisation are called regularisation. The most popular ones are L2 method and Dropout.

Extreme learning machines
ELMs are based on random projections of input feature space to the hidden features and then on linear regression as opposed to classical neural networks. ELM hidden connections can be weighted randomly, and there is no need to tune them.
The most straightforward ELM architecture is a feedforward neural network containing a single hidden layer. It resembles the MLP architecture, but the weights between input and hidden layer are randomly assigned, and not tuned. They perform some random black-box transformation from input feature space to the hidden features. The parameters are tuned in the output layer only.
Extreme learning itself is not necessarily limited to this simple architecture. It could be incorporated in many wellknown deep learning models. Huang et al. [17] propose a method based on extreme learning and CNNs to perform image classification. The model was called local receptive fields ELM (LRF-ELM). Local receptive fields are a general concept. It assumes that a single neuron is responsible for the aggregating signal from a specific input image region. However, usually for implementation purposes, LRF-ELM simplifies a network to a single convolutional layer with pooling and a fully connected output layer. Only the weights of the output layer are optimised. The convolutional layer acts as the random feature extractor. In other words, the values of all convolutional filters are chosen at random. There are two additional operations specific to LRF-ELM. The first one is orthogonalisation of filters after their initialisation. It aims to minimise the risk of producing redundant features by randomly initialised convolutional filters. The second one is a specific pooling method-square root pooling, given in Eq. 7, which enables the data dimensionality reduction and introduces nonlinearity in the network. This operation is crucial because there is no activation function after the convolutional layer in LRF-ELM.
where X is the input feature map (output of convolution layer) with size D 1 Â D 2 . It is a matrix of x ij elements. In order to compute optimal output layer weights b, Eq. 9 has to be solved-network outputs should match the expected values Y from the dataset. H represents a matrix of random features extracted from the training dataset X.
For single hidden layer networks, they are computed according to Eq. 8.
where: X, Y,-training inputs and expected outputs, respectively, W 0 , b-weights and biases in a hidden layerboth random,f-activation function.
To find the output layer weights b from Eq. 9, it is necessary to calculate the inverse of the matrix H, as presented in Eq. 10, which can be computationally difficult for high-dimensional or large datasets. To address this issue, ELM authors proposed to use Moore-Penrose pseudoinverse, which leads to the final solution for the optimal weights, as shown in Eq. 11. It is equivalent to the least squares optimisation.
This algorithm is called extreme learning. It is also applied in classification layer in LRF-ELM, which training procedure is presented in Algorithm 2. Just as with classical neural networks, there exist some techniques that improve the performance of ELMs. ELMs can be regularised using L2 regularisation [12,17]. Equation 12 shows regularised solution for optimal ELM parameters. C denotes regularisation coefficient and 1 is the Even though ELM parameters are chosen at random, the probability distribution used for sampling them may impact the model performance. Authors of [45] compare various distributions with different variances. Authors recommend an excellent default choice of distribution-Gaussian with mean equal to 0 and a standard deviation less than or equal to 0.1. The presentation of the essential elements of both models and teaching methods was the aim of this section. The next one is devoted to experimental research to compare both models in terms of learning time and achieved results.

Research methodology
The experiments on ELM hyperparameter sensitivity allow us to know hyperparameters' influences on the final ELM's results. This analysis is described in A. It enables us to assess which parameter has a particularly strong impact on the model responses and to determine good default values for hyperparameters.
We have designed two series of experiments: in the first one, models have comparable capacities. In the second one, they have well-suited architectures based on the literature review or hyperparameter optimisation.
The first series is designed to compare just the learning algorithms. Given identical network architectures, we train them using extreme learning and backpropagation. This comparison can show whether it is beneficial to choose one algorithm over the other. This series consists of two parts. The first part refers to ELM and fully connected networks trained with backpropagation, performing classification and regression tasks on data with vector representation. This part of experimentation utilises well-known, real-life benchmark datasets, datasets used in [16] and several artificial datasets described in Sect. 4.
To ensure the objectivity of conclusions, we used datasets with diverse properties concerning data dimensionality, class balance, the sparsity of features, presence of continuous and discrete features. Datasets were acquired from two public repositories: UCI [8] and University of Porto repository [46]. Tables 1 and 2 present characteristics of datasets for classification and regression.
The second part of this series is devoted to the image classification task. We considered two approaches to image classification: using external extractor of visual features and fully connected network, using convolutional network, which learns features on its own.
Previously mentioned LRF-ELM models [17] allow researchers to utilise extreme learning for convolutional networks. Authors of [18] show that ELM for image classification performs well when using HOG features (histogram of oriented gradients). The paper considers only one task-road signs classification. However, it seems worthy to investigate this approach on more datasets, described in Table 5. Therefore, the models covered in this comparison are: • LRF-ELM, • Convolutional network trained with backpropagation, • ELM using HOG features, • fully connected network trained with backpropagation on HOG features.
In the second series, we try to tune the architectures and hyperparameter set-ups for a given problem, just like a machine learning practitioner would do. In this task, we make use of the available literature. This means that for classical network we often refer to the best found results reported in other papers, while we perform our own hyperparameter search for ELMs. This series also consists of two parts. Analogously, the first part is based on classification and regression on data with vector representation (this time however, we do not use synthetic datasets), and the second concerns image classification. The choice of image datasets is extended to two variants of ImageNetwith resolutions of 16 Â 16 and 32 Â 32 pixels. We compare training times when possible-for training performed on our own, and when the time is given in the papers we refer to. But the main focus lies on the comparison of the performance measured with various metrics: • F1-score in the first series of comparison for the classification tasks. In this series, we performed training and evaluation of all models on our own. Therefore we were able to measure the F1-score, which is more robust to imbalanced classes than accuracy.  • Accuracy in the first series of comparison for the classification tasks. This time we referred to results reported in literature which are usually stated using only the accuracy metric. • RMSE (normalised) for the regression task.
Their definitions are given below.
F1-score. It is a measure of binary classification quality. It summarises the number of Type I and Type II errors made by the classifier. Type I errors are also called false positives. They correspond to the situations where the classifier mistakenly predicts a positive class. Type II errors false negatives mean that the classifier mistakenly predicted the negative class. F1 score is therefore defined as stated in Eq. 13 where: TP-number of true positives (correct predictions of positive class,), FP-number of false positives, FN-number of false negatives. To use this metric for multiclass classification, we can simply compute binary F1-score for each class separately and then evaluate an average F1-score weighted by the numbers of samples in each class. The formal description is given in Eq. 14.
where: F ðkÞ 1 -F1 score for class k, K-number of classes, w knumber of samples in class k. For simplicity, we will denote both binary and multiclass version of this measure as F1 score.
Accuracy. It is the most basic measure of classification performance. It is simply a ratio of correctly classified samples and the number of all data samples in the dataset, as stated in Eq. 15.
RMSE. Root mean square error is a measure of regression quality. It is defined in Eq. 16.

Datasets used
In selecting datasets for experimental research, we were guided by the diversity of the data in terms of class balancing, data dimensions, number of classes, continuous and discrete data, and sparsity of features. Datasets were acquired from two public repositories: UCI [8] and University of Porto repository [46]. Table 1 presents characteristics of datasets for regression and Table 2 for classification. Moreover, some synthetic datasets were considered.
We also propose three types of artificial classification datasets and several datasets for regression. They are characterised in Tables 3 and 4, and described in detail in Appendix B.
Image classification datasets. None of the available ELM studies presents results obtained on benchmark image classification datasets, such as MNIST OCR and CIFAR-10. For this reason, we included them in the experiments. In addition, we use the NORB dataset, which was used in [17], and the GTSRB set, used in [18]. All image datasets are characterised in Table 5.

Experimental study
We conducted all experiments on a computer with a specification presented in Table 6.
For the needs of the study, the ELM and LRF-ELM models were implemented using Python 3.6 and PyTorch 1.4 library [34]. All the reference models (fully connected and convolutional networks with backpropagation) were also implemented based on the following libraries: scikit-learn, scipy, scikit-posthocs, OpenCV. Our implementation is available at: github.com/mkosturek/ extreme-learning-vs-backprop.
The first experiment compares ELM models and neural networks trained using backpropagation (also referred to as MLP) in each series. The next one focuses on image datasets recognition. The simple ELM and MLP models working on HOG features extracted from images are compared. We also examine models dedicated to image classification: LRF-ELM and convolutional neural networks.

Models with comparable capacity
In this series, we compare models with the same architecture and comparable number of neurons.   Each model was evaluated on every dataset using a fivefold cross-validation, repeated 10 times, giving a sample of 50 results for each model on each dataset. In the classification task, the F1 measure was used to compare the performance. For the regression task, we used normalised RMSE. Normalisation was performed by dividing the RMSE measure by the range of the target variable (the difference between the maximum and minimum). To evaluate the training speed, we measured the training time in seconds. To ensure the reliability of the comparison of the average values, we used the dependent t-test for paired samples, but, as indicated in [3], in the case of repeated cross-validation, the basic form of this t-test may lead to false and non-reproducible results. Therefore we used a modified paired-samples t-test, called corrected repeated k-fold cross-validation test. Tables 7 and 8 show the results of comparison of performance and training times in classification and regression tasks. Whenever a p-value from the statistical test was below the significance level a ¼ 0:05, the better model's result was presented in bold, denoting a significant difference between compared models.
Discussion Considering classification performance in terms of F1 measure, the results presented in Table 7 show that for most sets (15 out of 23) the statistical test did not show any significant difference between the average results of the models. On 8 sets, where the differences were statistically significant, better results were achieved by ELM. It is worth noting that the differences between the models were found to be significant on high-dimensional synthetic sets rather than on their low-dimensional counterparts.
A qualitative comparison of decision regions of sample ELM and MLP models is also interesting. For the simplest, two-dimensional datasets, they are visualised in Figs. 1, 2, 3. It is worth underlining the significant differences in the shapes of the decision regions for different runs of the ELM model.
Although the identified decision regions usually correctly cover the training set's observations, these areas undergo significant shape changes just beyond the concentration areas of observations in the case of ELM. It is particularly noticeable for a dataset of nested hyperspheres, where the inner circle class region also appears outside both circles. Decision boundaries for MLP seem to be much simpler than for ELM. It can be seen that the decision boundaries for ELM vary for each of the 15 runs of the method. This phenomenon suggests better generalisation  Table 8, it can be seen that the MLP achieved significantly better results on only one real-life set, Servo, and three synthetic sets: linear function, 2nd degree polynomial, Friedman I, but with a majority of their possible dimensions. ELM proved to be significantly better on three real sets, on synthetic sets based on trigonometric functions and Friedman II and III functions. It should be noted that all of these 8 sets are low-dimensional, e.g. dimensionality did not exceed 10.
The obtained results indicate that ELM models achieve better results in low-dimensional and nonlinear regression problems, which occur in many real-life tasks, in sets based on trigonometric functions and Friedman II and III curves based on physical phenomena. MLP, on the other hand, achieves better results for high-dimensional problems and those with linear or polynomial dependencies.
The comparison shows that ELM models are significantly faster than neural networks trained by backpropagation. On each of the 53 sets, for both classification and regression tasks, the statistical test showed a significant difference in processing times, always in favour of ELM. Typically, learning times for MLP turn out to be 4 orders of magnitude higher than for ELM.

Experiment 2: comparison of ELM and neural networks performance in image classification
The aim of the experiment is to compare networks trained with extreme learning and backpropagation in the task of image classification. We compare F1-measure and training times. The following hyperparameter values were set after preliminary studies:  HOG descriptor set-up -the number of gradient orientations is 9, the block size is 2Â2 cells, each cell is 8Â8 pixels.
The experiment was conducted on 4 datasets, the characteristics of which are presented in Table 5. CIFAR-10 dataset was used in its original form, the HOG vector for  On every dataset, a fivefold cross-validation was repeated five times for each of the tested models. Comparison is then based on performance measures averaged over these repetitions. For performance comparison, F1 measure was used, while the time measured in seconds was used to assess the training speed. Time measurement includes only model training and not HOG features extraction. Figure 4 shows the results of the performance comparison for each of the four datasets. We performed the corrected repeated cross-validation t-test pairwise for all models, we assumed significance level a ¼ 0:05. These tests allowed us to determine that on CIFAR-10 and GTSRB datasets, both CNN and HOG?ELM approach resulted in the best performances without significant differences between them; on MNIST dataset, CNN was solely the best performing model; on NORB the best results were achieved with HOG?ELM approach. Moreover, these tests show that the performance results of CNN models trained on CPU and GPU do not differ, so they are only used to compare training times. Figure 5 shows a comparison of the training times. Corrected repeated cross-validation tests were also performed. They showed that each of the average training times was significantly different from the others.
Discussion Figure 4 shows that CNN and ELM models using HOG features provide the best classification performance the same number of times. On CIFAR and GTSRB sets, these two models achieved results that did not differ significantly. On the MNIST set, convolutional networks using backpropagation proved to be better, and on the NORB set, the ELM model achieved the best results.
LRF-ELM achieved a result comparable to CNN only on NORB and GTSRB, while in the two other cases networks trained by backpropagation were superior.
A surprisingly high improvement in classification performance was achieved by the use of extreme learning instead of backpropagation when using HOG features. Three out of four ELM models using these features also proved superior to the LRF-ELM model, which has been designed for image processing.
It should be noted that the architecture of the convolutional networks used in this experiment was very limited in comparison with the usual design of such networks; for example, the number of filters was limited to 10. This is due to the high memory requirements of the LRF-ELM model and hardware limitations. In the case of convolutional networks with backpropagation, the size of the network could be greatly increased. Therefore, drawing conclusions from a comparison of image-based models with those using HOG features may not be fully justified. The superiority of a simple extractor over the use of a Again, the comparison showed that extreme learning is significantly faster than backpropagation. On three out of four sets, LRF-ELM proved to be the fastest, while ELM using the HOG feature extractor was the fastest on a single set. The use of the GPU allowed to accelerate backpropagation training more than a hundred times to a level comparable to ELM models using HOG features. However, it did not achieve learning times as short as LRF-ELM.

Models with well-suited architectures for a given dataset
In this series, we conduct experiments using machine learning practitioner's approach. Following this assumption, the experiments compare both methods' performance using models with hyperparameter values tuned to achieve the best possible results or, if available, using reference to state-of-the-art results.

Experiment 1: comparison of ELM and neural networks performance and training times for input data with vector representation
In this scenario, the goal is to find a solution that performs the best. Therefore we utilise the hyperparameter search for ELMs and make use of works published throughout the years of classical networks development. Then we compare the best solutions found on datasets for regression and classification tasks with vector representation. Tables 9 and  10 show the best ELM hyperparameter set-ups found for each dataset. Due to some discrepancy in literature concerning metrics used for evaluating regression models, we decided to perform hyperparameter search for classical networks on our own. Table 11 shows the best hyperparameters found for each regression dataset. Tables 12 and  13 present the best results achieved with ELMs and classical networks.
Discussion This experiment shows that both extreme learning and backpropagation are able to produce a model best fitting a given task. However, classical networks proved to be superior on majority of tested datasets in both classification (6 out of 9 datasets) and regression (4 out of 6 datasets). It must be noted nonetheless that there are some cases when ELMs go behind standard networks only by a small margin (e.g. Machine CPU and Auto MPG datasets) and when they surpass classical networks significantly (e.g. Sonar and Bank datasets).
Overall, this experiment demonstrates that the many years of extensive research on backpropagation-based networks were beneficial. Classical networks are a solid, robust framework allowing very efficient modelling. However, ELMs seem to be a good complement for classical networks-easy and quick to train, having a potential of producing very competitive results. Considering the disproportion in the amount of research on ELMs compared to backpropagation, extreme learning seems to be promising approach, which is worth further development.

Experiment 2: comparison of ELM and neural networks performance in image classification
Just like in the previous experiment, the aim of this study is to achieve the best possible performance with each method in the image classification task. This comparison covers a large-scale dataset-ImageNet. It was constrained to two variants, with 16Â16 and 32Â32 resolution, due to the high memory complexity of ELMs. Again, we performed hyperparameter search for ELM-based classifiers and supported the comparison with results achieved with classical networks reported in the literature. Table 14 presents the best hyperparameter set-ups found for LRF-ELM models for each dataset. Table 15 shows the performance and time comparison of ELMs and backpropagation-based models.
In this experiment, we modified the preprocessing steps on the NORB dataset to match [37] to whom we compare our ELM's results.
Discussion Comparing current state-of-the-art deep learning models for image classification to the LRF-ELM clearly demonstrates classical deep learning superiority. LRF-ELM was inferior on every dataset considered, often by a great margin. This outcome was expected because LRF-ELM corresponds to a very simple CNN, whereas modern deep learning architectures are significantly more complex. In the case of image classification, it is hard to argue that the training speed is an advantage of ELMs, because the performance decrease is too high. Presently, extreme learning is not a competitive alternative for deep learning-based image classification.
Again, this experiment shows a high disproportion in the development of extreme learning-based architectures compared to backpropagation. Extending ELMs into a framework allowing efficient implementation and training of more complex network architectures could render valuable.

Findings based on the performed research
As mentioned before, extreme learning machines are highly memory-consuming. This is because of the need to compute the inverse of the latent representation of the  entire training dataset. This process requires storing the hidden features matrix in memory. One could argue that it is possible to implement ELMs so that this matrix is cached on disk. However, the main advantage of extreme learning-short training times-would suffer significantly. Here we present a short theoretical study of memory requirements that one needs to train ELM on a 224Â224 Ima-geNet dataset that consists of roughly 14 mln images. Let us presume the following assumptions: • the training is performed on ImageNet's 1 mln images subset, used in popular challenges such as ILSVRC, • all images are resized to constant 224Â224 px size, • the LRF-ELM architecture is configured as follows: • 16 convolutional filters only, • filter size: 5Â5, • pooling size: 5Â5, • pooling stride: 2.   NORB  48  3  3  3   GTSRB  64  7  3  3   ImageNet16  48  3  3  3   ImageNet32  8  3  3 1 • standard 32-bit precision floating-point numbers are used.
The assumed image size and network architecture produce hidden feature maps of size 108Â108Â16 for each image, LRF-ELM flattens such feature maps; hence, a hidden representation is a vector of 186624 elements. Therefore the hidden representation matrix H size is 1mlnÂ186624. Such matrix consists of 1:87 Â 10 11 numbers. Each of them is stored on 4 bytes, so the total size of the matrix H is equal to approximately 746 GB. Using such limited number of convolutional filters leads to memory requirements that are beyond abilities of most systems. Let us assume changing convolution and pooling configuration in order to produce feature maps compressed to 32Â32 size. Even in this case the matrix H takes more than 60 GB. Further scaling down introduces a high risk that such compressed features extracted randomly contain too little information for ELM output layer to train properly. This effect is clearly seen in the experiment results in Table 15. Using classical convolutional networks trained with backpropagation, the memory requirements are several orders of magnitude lower. There is no need to keep the entire dataset in memory; training requires loading only one mini-batch at time.
Our experience from conducting this study is described in Table 16, which shows a qualitative comparison between ELM and classical neural networks in the current state of their development.
We can sum up its content as follows: • The number of hyperparameters is much higher in the classical neural networks, but selecting their values can be supported by the extensive literature review and help of active community members on public forums. It is easy to find heuristics on how to set hyperparameters values in the case of classical neural networks. • The number of parameters in the classical deep networks is huge, but because the models are popular, we can use transfer learning that speeds up training. In  ELM models, the number of parameters is lower, but in challenging image data, memory requirements are so vast that it is not possible to calculate weights assuming typical hardware equipment. • There are few publicly available ELM implementations; existing ones are usually simple and non-generic or are written in Matlab, which makes them hard to utilise in more complex use cases and popular, wellsupported frameworks such as Python and PyTorch or TensorFlow.
Summing up, in this paper we referred only to the classical methods in ELMs conducting experiments on image and vector representation datasets. It should be noticed, however, that there exist some new approaches which propose promising solutions. Here, we can mention incremental learning [42], multilayer extreme learning machine (ML-ELM), and hierarchical extreme learning machine (H-ELM) described in [20], and non-iterative and fast learning algorithms for deep ELM [49]. It is a good prognostic for further development of ELM, but none of them has presented results using a challenging dataset as ImageNet nor has compared their performance to the mainstream deep learning methods.

Final conclusion
Based on presented research, we can sum up that the tested ELM models do not offer a real alternative compared to classical neural networks for contemporary problems characterised by complex patterns and huge datasets. The first series of experiments shows that their superiority to classical neural networks about training times is visible for networks with capacity corresponding to the time before the deep learning era. The demand for models that are trained quickly for tough image, video, or audio problems is enormous. ELM models transforming input space to high-dimensional hidden neuron space with random weights are attractive and essential when one wants to train the model quickly. But to achieve this state, we postulate to develop smart algorithms for the inverse matrix calculation, so that determining weights in the output layer for challenging datasets becomes feasible and memory efficient. There is a need to create specific mechanisms to avoid keeping the whole dataset in memory to compute weights. Although we noticed new approaches to solve these problems [20,42,49], contemporary demanding datasets used in classical deep neural networks are still a challenge for ELMs. Also, it seems necessary to develop generic frameworks that enable practitioners simple access to ELM models and easy development of new architectures efficiently utilising this training algorithm. Ultimately, sharing implementations should be a common practice among ELM researchers. In future works, we plan to prepare comparison with other random networks like RVFL or its extended version. There are also other promising solutions to consider-self-normalising networks that outperformed all competing methods [23].
Appendix A ELM hyperparameter sensitivity analysis ELM are significantly less popular than classical networks. Numerous studies on training based on backpropagation allowed to establish intuitions and heuristics on how neural networks react to changes of hyperparameters. In this study, we aim to gain insight and build similar intuitions on ELMs' sensitivity to hyperparameters' values. We conducted experiments to evaluate the level of contribution that each hyperparameter has on the response variance, as defined in [39] and [52]. A theoretical model has been defined. The analysed hyperparameters are the input variables of the model. The output corresponds to the classification quality measure on a given dataset. Variance-based sensitivity analysis [41] can be performed using such model. Sensitivity indices are calculated by using decomposition of variance VarðYÞ of model response Y ¼ f ðXÞ. In our case, the sensitivity indices were estimated based on repeated measurements within the experiments. Sensitivity indices can be interpreted as the distribution of each parameter's impact on the variance of the model response. I-st order indices are defined for every single parameter. They indicate the share of the total model variance the variations of this single parameter have. II-nd order indices are defined for each pair of parameters. They specify how much the variance of joint changes of a pair affects the variance of the model response. To identify hyperparameter values that allow reaching high performance on multiple various datasets we utilised a pair of statistical tests used to compare classifiers [7]-Friedman's test [10] and Nemenyi's post-hoc test [31].
We performed a repeated K-fold cross-validation on every dataset. The use of cross-validation implies a lack of independence of observations in the test samples. This suggests using the dependent t-test for paired samples. However, as indicated in [3], in the case of a repeated cross-validation, the basic form of this t-test may lead to false and non-reproducible results. They proposed a modified paired-samples t-test, called corrected repeated k-fold cross-validation test. Therefore, we utilised this test to compare the average performance measures of the models.
Three ELM hyperparameters are considered: • Regularisation coefficient C (Eq. In the experiment, 17 datasets for classification task were used. The values of sensitivity indices are shown in Table 17.     It can be noted that the values of the S ðf Þ index (Table 17) increase significantly as the dimensionality of the hypersphere classification problem increases. The same can be observed for hypercube vertices classification. Simultaneously, the values of the S ðh;f Þ index increase, which means that the importance of properly choosing the activation function increases when increasing the problem dimensionality.
The changes of the C and f hyperparameters demonstrate a comparable contribution to the overall variance considering the first order interaction. Among the second order indices, the index S ðC;f Þ is usually the one with the lowest value. Thus, the optimisation of the regularisation coefficient and the activation function can be carried out independently.
Nemenyi test for activation function f revealed the existence of two pairs of activation functions, which differ significantly with regard to the results achieved on various datasets. Both pairs include the sigmoid function. Considering the average ranking positions (approximately 3.5 for the sigmoid and about 2 for the others), it can be concluded that the sigmoid function contributes least to the improvement of the classification results.
Nemenyi test revealed significant differences between the smallest two values of h (0.1 and 0.3) and the four largest values (1, 3, 10, and 30). Just like in the case of the activation function, it can be concluded that small values of the h hyperparameter have a little positive effect on the performance quality. There are no indications to exclude h ¼ 0:1 and h ¼ 0:3 from the process of hyperparameter tuning.
Based on the Nemenyi test for the regularisation coefficient C (Table 21), it is easy to observe that even the lowest values of the average ranking position (approximately 4) are noticeably higher than the ranking values of the other hyperparameters (approximately 2). This may indicate a high sensitivity to this hyperparameter and a lack of values that are clearly better for multiple datasets. The best ranking positions were obtained for C values in the range from 0.01 to 0.3.

Appendix B Artificial datasets for classification
In the experiments we have used three types of artificial classification datasets: Nested hyperspheres Dataset consists of points lying on two concentric hyperspheres with different radii, which represent two classes. To generate random points uniformly distributed on a hypersphere we used the procedure shown in [36]. We added a Gaussian noise to all points.
Hypercube vertices This is a generalisation of the XOR classification task. In this dataset, each hypercube vertex is assigned one of two labels in such a way that the classes are not linearly separable. Knowing that a linear SVM achieves 100% accuracy for linearly separable classes, we repeat  random vertex-class assignments until the SVM is unable to fit the data. Once a linearly non-separable labelling is obtained, the vertices are sampled to the dataset and Gaussian noise is added. Figure 6 shows a visualisation of the dataset generated in 3-D space.
Intertwined spirals This is a binary classification task. Samples of a single class lie on a 2-D Archimedean spiral or on a 3-D conical spiral. Data points of the other class lie on a similar rotated spiral. 2-D spirals were generated using formulas in Eqs. 17 and 18 defined in polar coordinates. We converted points to Cartesian coordinates and added Gaussian noise. We used the following values of coefficients: a 1 ¼ a 2 ¼ 1 and b 1 ¼ b 2 ¼ 0:5. Figure 7 presents visualisations of such datasets.
where ðh; rÞ are the polar coordinates (angle and radius) and a, b are parameters of Archimedean spiral.
3-D spirals were generated according to the Eqs. 19 and 20 in the cylindrical space. The following values of parameters were used in this study. a 1 ¼ a 2 ¼ 1, b 1 ¼ 0:5, where ðh; r; hÞ are the cylindrical coordinates (angle, radius, and height) and a, b, c are the parameters of the spiral. Regression datasets are generated similarly. They are designed to embed into the data important properties of real-valued functions: periodicity and trend.
Linear function y ¼ a > x þ b.: It is the simplest approach to trend modelling. It is possible to generate a set in a space of any dimensionality. The values of the slope a and the bias b are drawn randomly from the ½À5; 5 range. Second degree polynomial of many variables.: Coefficients and the bias are randomly drawn from the ½À5; 5 range. There may be correlations between variables, which must be correctly reproduced by the model. Sine, cosine: Periodicity modelling. Data generated on the ½À2p; 2p domain.

SinC:
Function expressed by Eq. 21. It is a mixture of periodicity and trend. The splines were proposed as reference functions for the study of performance of regression models. The first one is a synthetic curve, designed to test the ability to detect relationships between variables. The remaining two correspond to physical phenomena. The splines are defined by Eqs. 22, 23, and 24 . denotes a random noise. Spline I has an interesting characteristic-it is defined using only 5 variables, however the dataset consists of more variables, so the function values depend only on some of the features.
f 3 ðxÞ ¼ arctan Declarations Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.