Empirical validation of ELM trained neural networks for financial modelling

The purpose of this work is to compare predictive performance of neural networks trained using the relatively novel technique of training single hidden layer feedforward neural networks (SFNN), called Extreme Learning Machine (ELM), with commonly used backpropagation-trained recurrent neural networks (RNN) as applied to the task of financial market prediction. Evaluated on a set of large capitalisation stocks on the Australian market, specifically the components of the ASX20, ELM-trained SFNNs showed superior performance over RNNs for individual stock price prediction. While this conclusion of efficacy holds generally, long short-term memory (LSTM) RNNs were found to outperform for a small subset of stocks. Subsequent analysis identified several areas of performance deviations which we highlight as potentially fruitful areas for further research and performance improvement.


Introduction
The prediction of stock price movements is of particular interest to traders in financial markets, in addition to a wide range of finance applications not limited to the market timing strategies of funds managers and the selective hedging practices of corporations. Paradoxically the most successful theories of this field assume that stock prices undergo random walks when markets are in equilibrium and include work pioneered by the seminal publications of Black, Scholes [2,3], and Merton [16] in continuous time and the Binomial Options Pricing model of Cox, Ross, and Rubinstein [7] in discrete time. Nonetheless, this remains a developing area, with frequent deviations from the random walk assumption observed (such as those identified in Jacobs [14]) and with recent advances in data provision and computational techniques many of these approaches have found application in this area of research.
One such application has been machine learning techniques (Strader (2020)). This field has seen the widespread application of Neural Network models, which are explicitly nonlinear in their mathematical representations (see, for example, Chollet [6] with such specifications as Long Short-Term Memory (LSTM) models being particularly suited to financial time-series analysis. Nevertheless, much of the published work is accomplished using deep neural networks, with relatively few studies using shallow neural networks notably across regionally separated markets. Strader [23] in particular, highlights these areas as requiring additional empirical research. The Extreme Learning Machine (ELM) of Huang et al. [13] is one such alternative training technique for shallow neural networks that has recently seen promising empirical results across various practical applications. These applications have been in diverse areas of research, ranging from image quality assessment Suresh, Venkatesh Babu, and Kim [25], improving wireless sensor network localization Phoemphon, So-In, and Niyato [21] to predicting landslides (KV, Pillai, and Peethambaran [17]). A natural extension of these applications then is in the area of financial research. 1 Tang et al. [27] present a case for using a Random Vector Functional Link (RVFL) network, a special case of single hidden layer feedforward neural networks, together with a Variational Mode Decomposition for the day-ahead crude oil price prediction and compare the results with other neural networks. They find the RVFL network outperforms simpler ELM-based models (as it distinctively includes a direct input-output link, unlike an ELM), albeit taking longer to train. Bisoi et al. [4,5] support these findings on the same dataset utilising similar network choices. Bisoi et al. [4,5] explore the same concept of Variational Mode Decomposition yet applying it in the study of next-day ahead stock price prediction using ELMbased models. They analyse a sample of daily prices from regional markets and find promising results for the S&P BSE 500, Hang Seng and FTSE 100 Indices. Mohanty, Parida, and Khuntia [19] show an improvement on a plain ELM-based model performance for stock price prediction by morphing it into a kernel ELM (KELM) combined with an auto encoder (AE). Mohanty et al. [19] test this augmented model (KELM-AE) on several bank stock indices using a normalised OHLC (open-high-low-close) ''candlestick'' dataset as input variables with the next-day ahead prediction. Göçken et al. [10] compare the effectiveness of using ELM-trained SFNN to predict stock prices on the Borsa Istanbul (BIST 100) stock exchange, albeit focusing on a relatively narrow subset of 3 stocks. Results are compared to other popular models used for the task such as DNN, Jordan RNN, GLM, RT, and Gaussian Process Regression. The conclusion drawn on the best performing model architectures and variables is used for prediction and comparison in this study. An empirical study conducted by Zhang [30] involved an attempt to apply an ELM-trained SFNN to predict price movements of a stock on the Hong Kong stock exchange. Zhang's results (2021) tantalizingly provide some empirical support to the work of Gocken et al. [10]. This paper utilises emerging findings from Zhang [30] within the framework of Gocken [10] and designs, validates and evaluates a series of SFNN models, trained using the ELM methodology, on the 20 largest stocks on the Australian equity market, also known as constituents of the ASX20. The ELM-trained model results are shown to indeed fulfil the promise of fast learning times (as compared to training the same models using classic backpropagation algorithm) with a comparably high, and often superior, level of accuracy. In general, improving prediction accuracy in this task, even if by a slight margin, can bring about material benefits to the interested stakeholders. This is the driving motivation behind this study; it shows that less computationally intensive model training techniques can deliver potentially higher economic benefits to capital market participants. With varying degree of computational power availability among market participants, designing more efficient techniques, such as those based on ELM training methodology, has direct industry applications. Additionally, performance findings may be used to academically interpret the operating mechanism of neural networks, thus advancing this strand of research.
We draw particular distinction between existing studies in the underlying dataset used as individual stocks are likely to possess different movement characteristics driven by inherent risks and influential factors, as compared to, for example, combinatory stock indices or foreign exchange rates. To the best of our knowledge, there has been no study comparing stock price prediction performance of the most commonly known and successfully used LSTM to the relatively novel ELM training methodology on a broad set of large and frequently traded stocks.
Another differentiating element between these relevant studies is the extent of initial data reconstruction applied to the raw financial price series. Given the well-studied noisiness and dynamics in the financial price series, such structured methods as Empirical or Variation Mode Decomposition Das et al. [8], Bisoi et al. [4,5], and Discrete Wavelet Transform Wu et al. [28] have previously been applied to the raw stock price series. The rest of the studies rely on the logic of the Takens' theorem [26], either explicitly or implicitly, by constructing technical features and statistical metrics from the raw price series to be used as the model input Khuwaja et al. [15] provide the most comprehensive detail on the application of this methodology, Das et al. [9], Mohanty et al. [19], Panda et al. [20] appear to implicitly follow this path). We apply the latter approach of explicitly applying the spirit of the Takens' theorem by constructing technical features from the raw price series, intentionally very similar to the ones used in Zhang [30], albeit with slight modification and an addition. This paper is organised as follows. Section 2 details the training and testing methodology and data used in this study. Subsequent Results and Discussion section evaluates model performance for the two mentioned training methodologies on the holdout test dataset. Finally, Conclusion and Future Research section concludes the paper and identifies promising future research opportunities.

Methodology
This section first discusses the models construction and training methodology, followed by the data description and preparation.

Extreme learning machine (ELM)
Extreme Learning Machine, as introduced by Huang et al. [13], modifies the training methodology for single hidden layer feedforward neural networks by converting it to an analytical solution. It has been shown ELMs (i.e., ELMtrained SFNNs) produce similar, if not better results for the typical NN-allocated tasks, albeit at much faster training speed and avoiding local optima convergence challenges.
In an ELM, the vectors of input weights and hidden node biases are first randomly assigned and used to calculate the hidden layer output matrix, H, using a specified transfer function. The Moore-Penrose generalised inverse of this matrix H, denoted as H y as adapted from Huang et al. [13], is then used to analytically determine the vector of output weights c that would best fit with the data output matrix T.
Mathematical representation of a single hidden layer feedforward neural network (SFNN) trained with Extreme Learning Machine (ELM) training methodology is as follows. A model with N distinct samples and j randomly assigned hidden neurons ðX j ; t j Þ where X j ¼ x j1 ; x j2 ; . . .; x jn Â Ã T as an input and T ¼ t j1 ; t j2 ; . . .; t jn Â Ã T 2 R n as targets vector can be represented by the following equation: where c j and W j ¼ w j1 ; w j2 ; . . .; w jn Â Ã T are the output and input weights, respectively, b j is the randomly assigned bias of jth hidden neuron and hð:Þ is the nonlinear activation function. The goal of the ELM training methodology is to reduce the error between the predicted and the target (actual) values, such that P N n¼1 ko n À t n k ¼ 0 and P L j¼1 c j h W j :X n þ b j À Á ¼ t n ; forn ¼ 1; . . .; N: In short, the ELM training methodology comprises the following steps: 1) Randomly assign hidden layer weights W j and bias b j values. 2) Calculate the hidden layer output matrix H: . . .
where h j ðxÞ stands for the nonlinear transfer function of the j-th hidden neuron, and L stands for the number of hidden neurons chosen in the SFNN.
Each column of the H output matrix represents the j-th hidden neuron output vector with regards to the vector of inputs x 1 ; x 2 ; . . .; x N . 3) Calculate the vector of output weights c j : where c ¼ c 1 ; c 2 ; . . .; c L ½ T is the vector of output weights, . . .; t N ½ T represents the output matrix, and H y is the Moore-Penrose generalised inverse of the hidden layer output matrix H.
Effectively, the objective function of the SFNN trained using ELM methodology is the minimisation of the cost function as follows: where j ¼ 1; . . .; L. The minimisation of the cost function is based on the sum of squared errors calculation represented by and is defined in Eq. (5) Figure 1 depicts a high level overview of the ELMtrained neural network structure, with the specific inputs and outputs chosen for this study. For full details and proof of theorems underpinning the ELM training methodology please see Huang et al. [13].
''The design of an ANN [Artificial Neural Network (ANN)] is more of an art than a science'' Zhang, Patuwo and Hu [31], and, in the case of the ELM-trained SFNN, it is mainly the number of hidden nodes in the single hidden layer that needs to be chosen. Some works in this area focus on developing theoretical bounds on the minimum and maximum number of hidden nodes required (for example, LeCun et al. [18] while others develop complex algorithms for supporting this decision using the dataset at hand Xu and Chen [29]. Yet another category focuses on providing rule-of-thumb advice for determining optimal number of hidden nodes based on the number of inputs and outputs in the model, or merely on the number of training samples. Major risk here is discounting some of the other key attributes of the dataset used (e.g., signal-to-noise ratio or complexity of the function to be learnt) that may materially impact on this decision.
Thus, we use cross-validation to determine the optimal number of hidden nodes in the network. Given we are dealing with the time-series data, we split the training set (* 7.5 years out of the total 10 years of data obtained) into training (the first 5 years) and validation (the remaining * 2.5 years) subsets. We train ELM neural networks starting with very small number of hidden nodes (2) progressing incrementally (initially, with increments of 1) to larger network sizes ([ 300 hidden nodes in the largest network) for each individual stock. The best topology is then chosen based on the mean squared error results from the validation dataset for each stock across networks trained. Before the best chosen model is used for the final test on the holdout data (the last 2.5 years of data available), training and validation datasets are combined, and this final model is trained on the full training dataset.

Recurrent neural network (LSTM)
The Long Short-Term Memory (LSTM) algorithm was developed by Hochreiter and Schmidhuber [11] to address the vanishing gradient problem, a persistent effect observed in simpler (i.e., non-recurrent) feedforward neural networks as their depth increases (theoretical explanation behind this effect is discussed extensively by Bengio et al. [1]. At its core, the LSTM algorithm allows the network to ''carry'' information across many timesteps (hence the name) to be later ''reinjected'' back into the network when needed. This is particularly useful for tackling tasks where time-series are studied and locally learnt features at some previous point can then be ''remembered'' by the network and used later when a similar pattern arises. Hu et al. [12] conduct a survey of literature studying deep learning models used for stock price prediction and conclude hybrid LSTM-based models are most widely researched.
We train LSTM neural network models using raw stock daily close price data with identical lookback period of 5 days. Decision to use the raw daily stock price data is based on understanding of the feature extraction mechanism of an LSTM Recurrent Neural Network (RNN). An LSTM network is designed to be able to remember information and features learned several timesteps before the current unit processing which is what we are attempting to accomplish with feature extraction for the ELM-trained SFNNs. 5 day lookback period is chosen to provide direct comparison between ELM-trained SFNN and LSTM neural networks. A three-layer stacked LSTM structure with 50 hidden nodes in each layer is used as a commonly used architecture of this type of a neural network in finance research (see, e.g., Sirignano and Cont [22]. Figure 2 provides a schematic overview of the training process behind an LSTM-based neural network.

Data
Individual stocks in the S&P/ASX20 index are chosen for empirical evaluation of the models. This is consistent with existing literature in the field (as discussed in detail in Sect. 1) while also capturing broad coverage of industries on the Australian share market. Daily candlestick chart price data (Open-High-Low-Close, without Volume) is obtained using yahoo-finance module in Python (Table 1). History of the index constituents for the period under study is obtained and stocks that have been present in the index for more than a year are chosen. Table 2 contains descriptive list of the dataset obtained, and the final set of individual stocks chosen for model testing.
For illustration purposes, example price charts for several stocks included in this study are provided below (Fig. 3).
This study diverges from the previous research (for example, Zhang [30] or Bisoi et al. [4,5] by testing models on the stock price data that may have moved out of a certain narrow range (e.g., within 3 standard deviations of the historical mean). Dataset under study is chosen broadly to capture stock price movements of any kind and range to eliminate any bias from the research.
To present this data in a meaningful manner, we calculate technical indicators utilising the following parameters of a ''lookback period'' L = 5 and an ''forward period'' O = 1. These parameters have been chosen arbitrarily yet driven by the logic of having at least one week of daily stock price data (hence the lookback period of 5 days) and predicting for the time period ahead where this data sensibly matters (thus, limiting this to just 1 day ahead). The choice of parameters is consistent with the existing literature in the field (e.g., Göçken et al. [10], Khuwaja et al. [15]).
Using these parameters, the daily stock price data are then converted into a set of technical indicators as below. Assuming daily stock price data contains substantial amount of noise, Takens' delay embedding theorem is relied upon to construct these smoothed ''attractors'' (i.e., technical indicators in this case) that could be used for stock price prediction Takens [26]. The technical indicators developed for this study were built to resemble the ones presented in the study by Zhang [30], yet with some addition, and comprise the following: where P i is the mean of the stock price for the i-th lookback period (calculated as the Average above); 5. Pseudo Log Return (PLR): the logarithmic difference between average prices of consecutive lookback periods; 6. Trend Indicator (TI): a simple trend indicator calculated as the difference between the last close price for the ith lookback period, P i , and the first close price of the (i)-th lookback period, P iÀ4 , and taken as an ''upward trend'' (i.e., assigned the value of ''1'') when the difference is positive ([ 0), ''no trend'' (''0'') when there is no difference (i.e., P i = P iÀ4 ), or ''downward trend'' when it is  Table 3 shows a sample set of technical indicators calculated for a stock given the original candlestick chart price data. As shown in Fig. 1, these technical indicators for each i-th lookback period (i.e., for each day in a sequence of daily stock price data) are fed into the ELMtrained SFNN model directly.   Table 4 on a number of critical performance metrics, including speed of training which gains importance as the prediction horizon decreases. All training and testing procedures were run on the Dell Latitude 5310 laptop with IntelI CITM) Processor, i5-10310U CPU @ 1.70 GHz, 2208 MHz, 4 Core(s), 8 Logical Processor(s) with 16.0 Gb of RAM (Installed Physical Memory).

Results and discussion
Overall, the performance of the ELM-trained models is exceeding that of the LSTM networks, with only a few cases of LSTM slightly outperforming the former. On average, Group 1 in Table 4-where ELM models provide superior prediction results to the LSTM ones-shows that ELM performs materially better than LSTM for these timeseries. However, for Group 2-where LSTM yielded better performance than ELM-the difference in performance between the two models under study is minimal. This is also supported by the charts in Fig. 4 where the top two stocks (STO.AX and IAG.AX) belong to Group 1 and the other two stocks (WPL.AX and WBC.AX) are from Group 2. Group 1 stocks have visible difference in LSTM and ELM prediction, with LSTM being off point visibly. Group 2 sees ELM and LSTM predictions move almost too close to call.
The speed of training is also noteworthy to mention. ELM-trained models have been known to increase the speed of training by a factor of 10-100 (at least) compared to their more widely accepted counterparts (LSTM in this case) Wu et al. [28]. Our test showed ELM models have been trained and tested more than 100 times faster than LSTM ones for the full set of stocks under analysis. The value of this advantage is appreciated when dealing with In general, ELM appears to capture well trends in the data, as well as better reacting to more short-term changes in the price move process (i.e., peaks and troughs). This may be attributed to the set of technical indicators used for prediction where network is almost able to build an expected stock price distribution and choose the most likely next point given the latest point in the data (i.e., previous day close price).
Let us contrast the stocks where LSTM achieved better prediction results than ELM with the ones where it did not. They do belong to either the financial services (ANZ, NAB, WBC) or metals and mining (NCM, WPL) industries, however, stocks belonging to the same industries are found within Group 1 as well (for example, CBA, MQG or BHP, STO) where ELM performed better than LSTM. Therefore, a hypothesis that changes between these two groups may be driven by industry membership does not stand. Table 5 presents initial further analysis on the issue. Descriptive statistics are calculated for stocks from the two groups, split by the dataset used to train, validate, and test the models. It should be noted that the final model is trained on the full combined training and validation dataset, before running the final test on the holdout dataset. This timewise split allows us to explore the differences in model performance between stocks.
First, it is worth noting that either model, but especially LSTM, performed comparatively worse on stocks that experienced strong growth through the period of analysis (2010-2019). For example, CSL.AX and MQG.AX had substantial increase in the mean between datasets. Second, it appears that stocks from Group 2 had substantial change in data distribution skewness (highlighted by bold italics in the table) between periods. For example, ANZ.AX showed positive skewness around 0.3-0.4 in the training and NCM.AX skewness has sharply increased from the negative 0.15 for training to the positive above 1 for the test datasets. This change in skewness represents material change in data interaction dynamics between time-series mean and standard deviation that were used as critical input variables into the ELM-trained models. This may explain the difference in predictive performance results between models for these stocks, however, bodes well for further research.
To test economic significance and practicality of the aforementioned findings, a simple trading strategy is developed using ELM model-based stock price predictions. Table 6 shows profitability results by threshold level d that is used in the following manner: where a i t is the action taken at time t, Ds i t is the expected change in stocks price s of stock i for time period t based on the ELM model prediction, and d is the chosen threshold level in absolute $ terms to indicate whether the predicted stock price change is considered material. Performance of the prediction-based strategy is tested on   several threshold levels as there does not appear to be agreement in the literature on a common value or even an approach. For example, Mohanty et al. [19] and Bisoi et al. [4,5] only compare the next day predicted price to the current day value, inherently assigning 0 value to the action threshold. We use threshold in this strategy to avoid unnecessary trades based on minor changes in price and, since no agreement appears to exist in the literature as to its value, we test a range of threshold levels. We test model performance using simple return metric by various thresholds to indicate persistence of the findings and conduct robustness checks. The main result of ELM outperformance over its LSTM counterpart holds across the full range of thresholds used. The difference is that fewer trades are made with higher thresholds. Profitable results are bolded in Table 6. In addition to being profitable for the majority of stocks, it is important to note that the overall average results of the model are positive across all threshold values. These results may be best understood as investing the same amount of funds for each individual stock trading from the beginning. These results are designed to reveal that real-world implementations are possible, and future research should investigate an optimal trading strategy based on such a model.

Conclusion and future research
Prediction of future stock prices returns is arguably one of the most challenging areas of finance. This is driven by common belief that stock prices have a relatively low signal-to-noise ratio and a substantial array of influential The analysis confirms the proposed benefits of ELM training, specifically reduced training time with comparable predictive power without requiring as much data. It further confirms emerging findings of Zhang [29] as to the overall efficacy of the models and potential shortfalls, albeit on a broader real stock price dataset. ELM-trained models have shown substantial improvement in predictive accuracy on the majority of individual stock price datasets used in the study. In the relatively few cases where LSTM does better, it appears that changes in the stock price data distribution might be the reason for this deviation.
The performance discrepancy between LSTM and ELM-trained models bodes well for potential future research. There are relatively few number of cases where LSTM outperforms and, in these cases, we have observed substantial drift in the degree of skewness between training and test datasets as compared to less unexpected changes for the rest of the stock price series. LSTM's ability to capture changes in the underlying asymmetry of the distribution may prove a significant advantage in practice. Identifying important variables in the ELM neural networks and trialling additional input metrics directly relating to skewness of the input series distribution also represent interesting research directions. Additionally, testing the results on a different set of financial instruments may shed further light on the findings about the LSTM's ability to respond better to changes in the underlying data distribution which would be a useful feature in investment practice.
Given the encouraging results, future research could investigate ways to further enhance the ELM-based trading system. It would be valuable to investigate adding a stronger trend indicator (such as a medium or a long-term Exponential Moving Average) given that the current ELM did not perform well with the changing distribution of the price data. It is likely to help better identify the underlying trend in the data and adjust price prediction accordingly. Another parameter that could be further tuned is the threshold value in the action decision step. The strategy presented used an initial set of dollar-based thresholds, but the optimal threshold to use might vary according to underlying timeseries data and its distribution properties. Finally, the choice of input variables used in the model can be further investigated to improve its predictive power. For example, it might be beneficial to trial various volatility measures instead of the standard deviation currently used in the model. Measures that may better reflect more recent volatility or more accurately capture both short-and longterm inherent price series variance are most promising. Such future research identifying better methods to more accurately predict security prices has widespread applications, from improving on pure profitability-driven investment objectives to better informed portfolio construction and risk management.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions.
Data availability The datasets generated during and/or analysed during the current study are available in the Yahoo Finance repository, https://finance.yahoo.com/lookup?s=DATA. Data have been sourced using the API module connection direct from Python IDE, Spyder using the yahoo-finance module commands. Additionally, the datasets generated during and/or analysed during the current study are also available from the corresponding author on reasonable request.

Declarations
Conflict of interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.