A federated learning-enabled predictive analysis to forecast stock market trends

This article proposes a federated learning framework to build Random Forest, Support Vector Machine, and Linear Regression models for stock market prediction. The performance of the federated learning is compared against centralised and decentralised learning frameworks to figure out the best fitting approach for stock market prediction. According to the results, federated learning outperforms both centralised and decentralised frameworks in terms of Mean Square Error if Random Forest (MSE = 0.021) and Support Vector Machine techniques (MSE = 37.596) are used, while centralised learning (MSE = 0.011) outperforms federated and decentralised frameworks if a linear regression model is used. Moreover, federated learning gives a better model training delay as compared to the benchmarks if Linear Regression (time = 9.7 s) and Random Forest models (time = 515 s) are used, whereas decentralised learning gives a minimised model training delay (time = 3847 s) for Support Vector Machine.


Introduction
Stock market development suffers from uncertainties caused by a variety of social, financial, and business factors. It results in stock price fluctuations and macroeconomic issues such as inflation and deflation during different periods (Edwards et al. 2018). By this, economists, governors, and investors are interested in stock market forecasting that allows them to model the market, manage the resources and enhance stock profits.
Machine learning (ML) techniques are widely used to extract knowledge and data pattern from statistics and automate data analysis (Wang et al. 2022). They have the capacity to build classification, prediction and/or regression models for various applications -mainly stock markets. ML models have the capacity to predict stock prices resulting in efficient decision-making and market profit enhancement (Pang et al. 2020). However, ML-enabled stock market predictions are challenging due to the uncertain and dynamic behaviours of the market, data sensitivity concerns, and the complexity of the historical and/or time-series data (Gandhmal and Kumar 2019; Malle et al. 2016).
Federated learning (FL) builds a collaborative ML framework to analyse and explore data patterns . It supports data isolation, minimises data sharing and maximises computing parallelism (Yang et al. 2019;Hauschild et al. 2022).
This article aims to build an FL framework for stock trend prediction. For this, three ML models including Linear Regression (LR) (Cakra and Trisedya 2015), Random Forest (RF) (Iannace et al. 2019) and Support Vector Machine (SVM) (Patle and Chouhan 2013) are trained and tested 1 3 using a public stock market dataset (Quant 2021) which comprises Chinese stock market data. The performance of the proposed FL framework is evaluated and compared against centralised and decentralised learning to find the best-fitted approach in terms of Mean Square Error (MSE) and model training time. The key contributions of this research are listed below: • Propose a data pre-processing approach to clean and prepare stock market datasets. • Deploy an FL framework for stock trend predictions, and evaluate its performance against centralised and decentralised learning. • Compare the performance of three ML models including LR, RF, and SVM to find the best technique for stock market trend prediction.
The remainder of this paper is organised as follows. Section 2 reviews the literature, introduces the relevant stateof-the-art ML techniques for stock market prediction, and highlights the existing research gaps. Section 3 introduces the research methodology and experimental plan. Section 4 presents the experimental results, while Sect. 5 discusses the results to outline the research findings. Section 6 concludes the paper and highlights the future works.

Related works
This section introduces ML frameworks, particularly FL, and describes the distinctive ML techniques that are widely used in stock market predictions.

ML frameworks
ML models can be built and trained via three key frameworks including centralised, distributed and federated learning (Elbir et al. 2021). Centralized learning forms a framework that uses an integrated dataset to build and train an ML model at once (Abadi et al. 2016). However, it may become slow and expensive if a huge and complex dataset is used. Distributed learning aims to resolve this drawback by utilising distributed computing platforms (e.g., Apache Spark) (Chen et al. 2018). It partitions the dataset into several parts (e.g., Resilient Distributed Datasets (Zaharia et al. 2012)) and trains the ML model based on a parallel processing paradigm (Geyer et al. 2017). Federated learning (FL) is a comparatively new approach that supports collaborative machine learning. As Fig. 1 depicts, it partitions the dataset into several parts, each of which trains a local model. In turn, the local models are aggregated to form a master model which is used to analyse the dataset. It offers benefits -mainly model training speed-up and data leakage avoidance as compared to centralised and distributed learning.  (Bonawitz et al. 2017;Konečnỳ et al. 2016;Abdul et al. 2021):

Federated learning technology
1. Control: FL does not allow the server to either directly or indirectly manipulate worker nodes' data, whereas traditional DML such as Mapreduce (Dan et al. 2006) lets the server control the worker nodes. 2. Data distribution and load balancing: DML usually supports independent and identical data distribution (IID) to enhance the efficiency of model training and support load balancing. However, FL does not support IID and may assign different data portions to the worker nodes. 3. Communication cost: DML applications address low communication costs as worker and server nodes are usually located at the same geographical location. However, FL applications suffer from high communication overhead because of interconnecting the nodes through cloud-based communication links. 4. Communication quality: DML's nodes are usually provided with high-speed broadband as they are welllocated. Hence, the DML network and operating environments are stable. On the other hand, FL's worker nodes may experience different connection quality due to network/bandwidth restrictions.

Stock market prediction
There are a large number of non-FL frameworks and ML models that have been used for stock market predictions. Patel et al. (2015) aim to train the Artificial Neural Network (ANN), Support Vector Machine (SVM), Random Forest, and Naive-Bayes algorithms to forecast Indian stock markets during a decade. Hassan and Nath (2005) use a Hidden Markov Model to predict the stock prices of four international airlines. Shen and Shafiq (2020) collect, pre-process and analyze 2 years of Chinese stock market records to propose an LSTM deep learning model for stock market trend prediction. Hong (2020) collects and analyses a financial dataset from an international company to study the correlation of stock prices. It uses BLSTM to reduce the errors that usually occur in LSTM models for one-way forecasting. For this, it trains a two-way prediction model by adding macro indicators including economic growth rates, economic indicators, and interest rates, and analyzing the trade balance, exchange rates, and currency volumes. Big data analysis techniques -mainly in sentiment analysis and text mining-have the capacity to offer stock market prediction benefits. Awan et al. (2021) combine fundamental analysis and Big data techniques to propose a machine learning model (i.e., linear regression, generalized linear regression, random forest, and decision trees) to help investors decide whether to buy or sell a stock. The proposed model uses sentiment analysis and text classification for stocks, tweets, and social media news to predict stock movements. The results show that linear regression, random forest, and generalized linear regression models are able to address an acceptable prediction accuracy. Attigeri et al. (2015) use technical analysis to process social media data in real time. However, they report that fundamental analysis is still required as social media is enormous, unstructured, and rapidly changing. Therefore, they extract sentiment expressed by individuals from social media and use Big data text mining techniques to analyze the correlation between the sentiments and stock values to train a stock market prediction model.
According to the literature review, there is still a gap to propose an FL framework for stock market prediction and compare its performance with non-FL frameworks including centralised or decentralised learning. FL builds predictive models according to a distributed and collaborative machine learning model training fashion. However, the performance of FL can be influenced by dataset partitioning and distribution, especially if the dataset is large and integrated (highly correlated features) (Lundberg 2021). This paper aims to investigate these issues by evaluating and comparing the performance of FL and centralised/decentralised frameworks in stock market prediction applications.

Methodology
This section explains the research methodology aiming to propose a horizontal federated learning framework for stock market prediction.

Dataset selection and preprocessing
A public dataset of Chinese stock markets for nine provinces including Hubei, Fujian, Sichuan, Shandong, Beijing, Zhejiang, Jiangsu, Guangdong, and Shanghai has been chosen for this research. The key rationale for choosing these nine provinces is that they are the top nine and the most dynamic and strongest stock regions in China (Chinadaily 2022). The dataset is live and contains 2,699,730 samples and eight features including closing price, Low stock Limitation, High stock Limitation, stock ID, total money, stock volume, high price, and low price. A new feature, named price, is also added to the dataset as the result of Money (the total stock market income) over Volume (the total number of sold stocks). This feature is used to study the stock trends as refers to the average price of each stock.
A data cleaning, normalisation and correlation analysis approach is used to prepare the dataset for ML model training. Data cleaning is used to remove the missing and NAN values, while the standard-scaler library is used for data normalisation. Moreover, the Pearson technique (Benesty et al. 2009) is used to analyse the feature correlations. According to the Pearson Correlation Coefficient matrix, the dataset features are highly correlated unless Volume and the stock ID. As a result, the dataset with seven features and stock records from 2014 to 2020 is used to train the models after the data preprocessing. It is randomly partitioned into two parts, the training dataset 80%, and the test dataset 20%. Moreover, a 5-fold cross-validation approach is used to avoid over-fitting and achieve solid results.

ML framework deployment
This research deploys three frameworks including FL, centralised and decentralised to test and analyse the ML approaches. The given dataset contains stock data samples of nine Chinese provinces. For this, a horizontal FL is required to divide the dataset based on the data samples (stock regions), and assign the data partitions with the same feature space to the worker nodes for processing. As Fig. 2a shows, the horizontal FL framework is set up using nine worker nodes, each of which is assigned by a training dataset (80%) of one province.
Flower framework (Flower 2022) with scikit-learn is used to implement the proposed FL. For this, the minimum number of clients (worker nodes) is set to nine, and the Mean Squared Error (MSE) is used as the framework's evaluation function. By this, each worker trains a local ML using one province data and then sends the model's gradient of loss to the server. The server utilizes a Federated Averaging strategy (FedAvg) to generate a new set of model parameters. The new parameters are sent to the worker nodes to update each local model. This is iteratively repeated based on a convergence fashion and without data sharing until the application's requirements are met.
According to Fig. 2b, the FL framework evaluates the trained models using local test datasets (20%) for each province. By this, each worker node measures Mean Squared Error (MSE) for an ML prediction and sends the values to the server node for aggregation (i.e., averaging). Figure 3a shows the decentralised learning framework which is implemented on Apache Spark (Apache 2021) using Spark RDDs. By this, the dataset is portioned into nine RDD, each of which is assigned to a worker thread for   (Dan et al. 2006) is used to combine the RDD results and form the final output.
As Fig. 3b depicts, centralised learning uses the whole dataset to train the ML models without parallel processing. Spyder IDE (Gerlach 2022) is used to implement centralised learning. It is a Python development environment that is widely used to build data analysis and ML applications.

Experimental result
This section evaluates and compares the performances of federated, centralized, and decentralised Learning frameworks. Each framework builds three ML models including LR, RF, and SVM. They are tested and evaluated in terms of MSE and Model Training Time to study their similarities, differences, and superiorities. Table 1 summarises the setup parameters of the machine learning models. MSE is measured via Eq. 1, where P and P are true price and predicted price respectively, and n refers to the number of samples in the test dataset.
Training time is measured to study the latency of ML model training. It is increased depending on the model complexity, dataset size, and processing framework performance. Training delay reduction offers real-time prediction/classification benefits. Table 2 shows the MSE results of the LR model running on the ML frameworks. According to the results, centralized and distributed learning outperform FL if the LR model is used, while FL outperforms centralized and distributed learning when an RF model is used.
(1) MSE = (P −P) 2 ∕n Using a Gird Search approach, the SVM model gives the best results if it is an RBF kernel with a degree and C parameter of 1. According to the results, FL with SVM outperforms centralized and distributed learning. However, SVM's MSE is significantly increased as compared to LR and RF on all three frameworks. Table 3 shows ML model training delay for three ML, each of which is trained via three ML frameworks. According to the results, SVM is the slowest ML model due to the model convergence delay, while LR is the fastest one. Moreover, FL and decentralized learning reduce the ML training delay as compared to a centralised learning framework. This is because both FL and decentralized frameworks use parallel data processing which results in ML training delay reduction. However, FL with SVM increases ML training time as compared to centralized and distributed learning. It is because of the model convergence delay, parameter distribution, and iteration frequency in SVM.

Discussion
FL framework works better than centralized and distributed learning to forecast stock market trends according to the following circumstances: • FL gives better prediction results when RF and SVM models are used as the MSE of FL is lower than the centralised and decentralised learning. • FL reduces ML training time as compared to the benchmarks if LR and RF models are used. However, decentralised learning gives a better ML training delay for SVM. • FL shares only model parameters and supports data privacy as compared to centralised and decentralised learning.
The performance of LR model underperforms RF in decentralised learning framework. It is because of the increased error rates caused by data distribution (each province) on the worker nodes in decentralised learning frameworks. However, it outperforms RF if a centralised learning is used. This is because of utilising linear relationships between the selected features at once to train the LR model on a centralised learning framework. RF gives a better result than LR if an FL is used. It is built based on collaborative decision trees that each of which is established on a worker node to process data features of one province. RF utilises the trained decision trees to predict the market trends that result in MSE reduction.
SVM has the worst performance as compared to LR and RF due to high MSE results. It is because of the large and complex dataset that leads to SVM model convergence failure. As the dataset is huge, the convex optimization approach is unable to support the model convergence (Fine and Scheinberg 2002).
FL has the capacity to minimise data sharing and leakage. Using FL, each worker node locally uses the stock market data of a province to build a slave model. The worker nodes share only the model parameters (i.e., gradient loss) to train the master model without sharing stock data. It would be beneficial for predictive analysis applications as the data owners are not in favor of sharing their market data.

Conclusion
This research proposes an FL framework to predict stock market trends. The proposal is established via nine worker nodes, each of which is fed by 6-year stock data (2014-2020) of a Chinese province. A data pre-processing approach is used to clean and prepare the dataset and three ML techniques including LR, RF, and SVM are used to forecast the stock market trends.
An extensive experimental plan is conducted to evaluate and compare the performance of the ML techniques and frameworks for stock market trend prediction. According to the results, FL gives the best performance as compared to centralized and decentralised learning if RF or SVM models are used. However, it underperforms centralized and distributed learning if a LR model is built. As the results show, SVM underperforms LR and RF due to the lack of model convergence during the model training process.
The performance of FL still needs to be evaluated and analysed with true parallelism on multicomputer platforms. This paper utilises a hyper-threading approach to build the FL framework. However, it is unable to simultaneously run all the available threads due to the restriction of the computing platforms. Service Oriented Architecture (SOA) can be used to establish a distributed computing environment to train the FL model on multiple computing workstations instead of threads.

Conflict of interest
The authors declare that they have no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.