PCA-ICA-LSTM: A Hybrid Deep Learning Model Based on Dimension Reduction Methods to Predict S&P 500 Index Price

Sarıkoç, Mehmet; Celik, Mete

doi:10.1007/s10614-024-10629-x

PCA-ICA-LSTM: A Hybrid Deep Learning Model Based on Dimension Reduction Methods to Predict S&P 500 Index Price

Open access
Published: 28 May 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Computational Economics Aims and scope Submit manuscript

PCA-ICA-LSTM: A Hybrid Deep Learning Model Based on Dimension Reduction Methods to Predict S&P 500 Index Price

Download PDF

516 Accesses
Explore all metrics

Abstract

In this paper, we propose a new hybrid model based on a deep learning network to predict the prices of financial assets. The study addresses two key limitations in existing research: (1) the lack of standardized datasets, time scales, and evaluation metrics, and (2) the focus on prediction return. The proposed model employs a two-stage preprocessing approach utilizing Principal Component Analysis (PCA) for dimensionality reduction and de-noising, followed by Independent Component Analysis (ICA) for feature extraction. A Long Short-Term Memory (LSTM) network with five layers is fed with this preprocessed data to predict the price of the next day using a 5 day time horizon. To ensure comparability with existing literature, experiments employ an 18 year dataset of the Standard & Poor's 500 (S&P500) index and include over 40 technical indicators. Performance evaluation encompasses six metrics, highlighting the model's superiority in accuracy and return rates. Comparative analyses demonstrate the superiority of the proposed PCA-ICA-LSTM model over single-stage statistical methods and other deep learning architectures, achieving notable improvements in evaluation metrics. Evaluation against previous studies using similar datasets corroborates the model's superior performance. Moreover, extensions to the study include adjustments to dataset parameters to account for the COVID-19 pandemic, resulting in improved return rates surpassing traditional trading strategies. PCA-ICA-LSTM achieves a 220% higher return compared to the “hold and wait” strategy in the extended S&P500 dataset, along with a 260% higher return than its closest competitor in the comparison. Furthermore, it outperformed other models in additional case studies.

Graphical Abstract

Predicting the daily return direction of the stock market using hybrid machine learning algorithms

Article Open access 15 June 2019

A Comprehensive Study of Market Prediction from Efficient Market Hypothesis up to Late Intelligent Market Prediction Approaches

Article 14 June 2022

Stock market index prediction using transformer neural network models and frequency decomposition

Article 18 May 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Data that changes over time is called time-series data. Time series data analysis is critical in the time-dependent application domains, such as health (Chialvo, 1987; Goldberger et al., 1988), finance (Hsieh, 1991; Peters, 1991), meteorology (Celik et al., 2014; Fraedrich, 1986; Nicolis & Nicolis, 1984; Özdoğan-Sarıkoç et al., 2023; Shekhar et al., 2008), industry (Huang et al., 2019; Mehdiyev et al., 2017), etc. Financial markets are one of the most challenging application areas to model and predict (Teixeira & De Oliveira, 2010). The complex nature of this field has led people to develop mathematical and statistical models that make the underlying structure clear, thereby making trade more efficient (Fayyad et al., 1996; Gandhmal & Kumar, 2019). Therefore, time series analysis and forecasting methods are frequently used in finance and economics (Abu-Mostafa & Atiya, 1996; Atsalakis & Valavanis, 2009; Kauffman et al., 2015; Kim, 2003; Pai & Lin, 2005; Rounaghi & Zadeh, 2016; Tay & Cao, 2001; Zhang, 2003). Recently, there has been remarkable progress in the development of deep learning models, which is a sub-branch of machine learning and has been shown to achieve successful results in various studies (Abu-Mostafa & Atiya, 1996; Nosratabadi et al., 2020; Ozbayoglu et al., 2020; Sezer et al., 2020).

The primary purpose of analyzing financial data is to predict stock market characteristics' effects and future directions for decision mechanisms based on market behavior (Cavalcante et al., 2016). In particular, financial time series forecasting constitutes the core of future decisions and transactions in the financial asset market. These forecasts help investors’ decisions and reduce potential risks. Investors always try to determine and predict the probable value of the stock or asset before deciding on their transactions (Gandhmal & Kumar, 2019; Teixeira & De Oliveira, 2010). Therefore, when the studies on financial data are analyzed, the stock market stands out as the most researched area (Nosratabadi et al., 2020). The stock market's primary target in the stock market research area is stock price or trend prediction (Cavalcante et al., 2016).

However, the large, complex, and variable structure of financial data makes it challenging to analyze and predict such data. In the reviews on financial markets, studies are generally classified according to the model used and the types of inputs (Bustos & Pomares-Quimbaya, 2020). The prediction models that can be used are divided into traditional prediction models and machine learning prediction models (Cavalcante et al., 2016; Liu et al., 2021; Nosratabadi et al., 2020). Some of the known traditional prediction models are autoregressive moving average (ARMA) and the autoregressive conditional heteroskedasticity (ARCH), while we can list some machine learning prediction models as support vector machine (SVM), naive Bayes (NB), artificial neural network (ANN). However, some authors state that traditional forecasting models are not as efficient as models based on artificial intelligence because they treat financial time series as linear systems (Atsalakis & Valavanis, 2009; Cavalcante et al., 2016; Li & Bastos, 2020). Studies in the literature are classified not only based on the mentioned model types but also according to input types. Figure 1 presents the input types to be used in the prediction models. The input type chosen for the forecasting model affects the forecasting model’s performance. For this reason, a dataset previously used in the literature was preferred in our study (Sethia & Raut, 2019; Thakkar & Chaudhari, 2021). At the same time, the authors emphasize that the dataset used is balanced (with a structure that includes both bullish (bull market) and bearish (bear market) movements) (Thakkar & Chaudhari, 2021). The data set used is of a structured input-type, consisting of market informations and technical indicators (Bustos & Pomares-Quimbaya, 2020). Our purpose in choosing a dataset with this input type is that the technical analysis approach is widely preferred in the literature (Atsalakis & Valavanis, 2009; Berradi & Lazaar, 2019; Gao & Chai, 2018; Gao et al., 2021; Kakade et al., 2023; Kwon & Moon, 2007; Li & Bastos, 2020; Sethia & Raut, 2019; Teixeira & De Oliveira, 2010; Thakkar & Chaudhari, 2021; Wei & Ouyang, 2024; Wen et al., 2020; Zheng & He, 2021). For this reason, machine learning techniques and deep learning models are very much recommended in the literature to analyze such data. In addition, in recent years, hybrid models have been proposed by combining both machine learning algorithms and deep learning models with different approaches instead of single models that cannot produce good results for every situation to improve the prediction performance of machine learning models (Nosratabadi et al., 2020; Ozbayoglu et al., 2020; Thakkar & Chaudhari, 2021).

Some of our motivations for preparing our research were as follows: (1) During our literature review, we observed that most deep learning-based asset price prediction models use different datasets and performance evaluation metrics. Although this situation limits the effectiveness or measurability of new technology prediction models, it can be considered a gap in the literature (Atsalakis & Valavanis, 2009; Bustos & Pomares-Quimbaya, 2020; Thakkar & Chaudhari, 2021). (2) Additionally, we noticed that most researchers in the existing literature focus on accuracy and low error rates in stock price prediction. However, the sole purpose of financial market participants is to reduce potential risks and achieve high returns in market volatility (Deng et al., 2023, 2024; Sethia & Raut, 2019; Zhang et al., 2020). Our research not only focuses on prediction accuracy but also calculates the return rate by establishing a simple trading strategy. Thus, the performance of the proposed method is presented to researchers as a simple decision support system. (3) Lastly, it is well-known that prediction models encounter challenges of dimensionality and overfitting due to large datasets and noisy data. In the literature, research aiming to address these structural issues and enhance the performance of prediction models is increasingly prevalent (Berradi & Lazaar, 2019; Chen et al., 2022; Guo et al., 2022; He & Dai, 2022; Huang et al., 2022; Jianwei et al., 2019; Kakade et al., 2023; Li et al., 2022; Ma et al., 2019; Sethia & Raut, 2019; Srijiranon et al., 2022; Wang et al., 2023; Wei & Ouyang, 2024; Zheng & He, 2021). To tackle these problems and improve the performance of prediction models, we aim to leverage the advantages of hybrid models to combine different methods and models. For all these reasons, we chose a dataset and model used in the literature in our study in addition to making efforts to keep the evaluation criteria of our study as broad as possible. Thus, we aimed to make our study comparable with other studies (Sethia & Raut, 2019; Thakkar & Chaudhari, 2021).

The novelty and contributions of the research to the literature are as follows: (1) We propose a hybrid PCA-ICA-LSTM model for predicting asset prices. While studies exist in the literature that combine PCA or ICA with different methods, to our knowledge, ours is the first study to combine these two statistical methods for a two-stage preprocessing and integrate them with a recurrent deep learning network to create a hybrid model. The proposed model introduces a new framework that combines PCA and ICA statistical methods to provide input to an LSTM deep learning network used for prediction. We combine PCA for dimensionality reduction and noise removal, and ICA for feature extraction from processed data, leveraging the advantages of both methods to significantly enhance prediction performance. We support this claim with various experiments. (2) Many studies in the literature use different datasets, time scales, and evaluation metrics, making it challenging to compare studies fairly. Therefore, we divide our experiments into two stages initially. In the first stage, we use a well-established dataset and time scale from the literature (Sethia & Raut, 2019; Thakkar & Chaudhari, 2021). Additionally, we prefer commonly used evaluation metrics for a fair comparison. Thus, our goal is to achieve a directly comparable study in the literature concerning dataset, time scale, and evaluation metrics. (3) The second phase of experiments expands the utilized dataset to include the turbulent period experienced in financial markets during the COVID-19 pandemic, subsequently re-evaluating the model's effectiveness. Furthermore, two additional case studies are incorporated into our work to establish a new benchmark for validating the model's efficacy. While conducting these experiments, we emphasize that the primary objective of predicting a financial asset is to achieve high returns and mitigate risks. Therefore, we go beyond focusing solely on high prediction accuracy and low error rates by incorporating the return rate metric into our study, aiming to highlight this gap in the existing literature. (4) Our experiments yielded promising results when compared to existing models that do not utilize dimensionality reduction for predicting asset prices. Our findings suggest that our model has the potential to offer researchers higher accuracy and lower error rates when working with high-dimensional datasets, while also achieving competitive return rates when compared to state-of-the-art approaches.

The outline of this article is organized as follows: Sect. 2 reviews studies on various methods that use technical indicators and dimensionality reduction techniques to analyze financial data. Section 3 presents the Study Methodology. This section presents information about the operation of the study, baseline settings, the dataset used, the cross-validation method, machine learning models used in the study, statistical methods, and the latest proposed hybrid PCA-ICA-LSTM model. The experimental results of the proposed PCA-ICA-LSTM approach in Sect. 4 are analyzed in four parts. First, the proposed model is compared with models from the same family using single-stage statistical methods. The second part presents comparisons with state-of-the-art models in the literature. The third part compares the results of the proposed model with similar studies in the literature. Finally, in the last part, we repeat our experiments by expanding our dataset from 2000–2017 to 2000–2024 to include the COVID-19 pandemic. We also include two additional case studies in our research as a benchmark. The conclusions are summarized in Sect. 5.

2 Research Review

Studies on financial markets have been analyzed according to various classification methods. We believe that the two most important of these classifications are the classification according to the type of dataset input and the classification according to the type of forecasting model. We form our research analysis on this basis and try to justify our choices based on these two classifications.

2.1 Dataset Input Types

In their study, Bustos et al. grouped the input types under two main headings: structured and unstructured data (Bustos & Pomares-Quimbaya, 2020). The authors classified structured inputs as (1) market information, (2) technical indicators, and (3) economic indicators. Unstructured inputs are categorized as (1) news, (2) social networks, and (3) blogs. The classification of studies according to input types is shown in Fig. 1 (Bustos & Pomares-Quimbaya, 2020).

In the literature, studies on structured inputs are in the majority, and two approaches to structured inputs come to the fore (Bustos & Pomares-Quimbaya, 2020; Cavalcante et al., 2016). These approaches are called technical analysis and fundamental analysis. Technical analysis information is the approach that uses stock prices and indicators derived from this price information (Atsalakis & Valavanis, 2009). Researchers who adopt this approach argue that the effect of fundamental analysis indicators and news already exists in the price of financial assets (Bustos & Pomares-Quimbaya, 2020). According to this approach, it is sufficient to analyze price movements when forecasting asset prices. On the other hand, the fundamental analysis approach uses macroeconomic and financial situation information and consists of time series information that tries to understand the reasons for price movements (Bustos & Pomares-Quimbaya, 2020; Cavalcante et al., 2016). Fundamental analysis information is challenging to obtain and may require expertise to interpret. For this reason, it is not as widely used as the technical analysis approach. In addition to these approaches, studies utilizing data from social media as input and performing price prediction using sentiment analysis have also emerged in recent years(Deng et al., 2023, 2024; Srijiranon et al., 2022).

2.2 Prediction Model Types

In the literature, other than the estimation input, another approach to classifying time series forecasting models is based on the forecasting model itself (Atsalakis & Valavanis, 2009; Bustos & Pomares-Quimbaya, 2020; Cavalcante et al., 2016; Liu et al., 2021; Nosratabadi et al., 2020). According to the forecasting model used, two forecasting models come to the fore: traditional forecasting models and machine learning forecasting models.

Traditional forecasting methods are based on mathematical and statistical foundations. It is possible to examine traditional forecasting models in linear and nonlinear classes. Famous traditional linear forecasting models are named as autoregressive (AR) model, moving average (MA) model, autoregressive moving average (ARMA) model, and autoregressive integrated moving average (ARIMA) models (Liu et al., 2021). Besides, well-known traditional nonlinear forecasting models are the Threshold Autoregressive (TAR) model, the Autoregressive Conditional Heteroskedasticity (ARCH) model, and the Constant Conditional Correlation (CCC) model, which can be listed (Cavalcante et al., 2016; Liu et al., 2021). Traditional forecasting methods assume the studied time series is produced after a linear process. They usually try to model the underlying process according to this assumption. However, financial time series are complex, noisy, and uncertain. Therefore, it exhibits non-linear behavior and makes it difficult to predict such data. This private nature of financial time series causes traditional statistical methods not to be applied effectively in the financial context (Cavalcante et al., 2016). For all these reasons, traditional forecasting models are not as reliable as necessary to predict the price of a financial asset (Nosratabadi et al., 2020).

Another predictive model classified according to the model used is the machine learning prediction model. Machine learning models provide the ability to learn from data and provide in-depth insight into problems (Nosratabadi et al., 2020). Some authors state that traditional forecasting models are not efficient because they treat financial time series as linear systems, and they get lower results than models based on artificial intelligence (Atsalakis & Valavanis, 2009; Cavalcante et al., 2016; Li & Bastos, 2020). The unpredictable dynamic nature of financial markets and the advantages as mentioned above of machine learning models have encouraged many researchers to work in this direction (Berradi & Lazaar, 2019; Chowdhury et al., 2018; Gao et al., 2021; Gudelek et al., 2017; Huang et al., 2019; Jianwei et al., 2019; Kao et al., 2013; Kauffman et al., 2015; Kim, 2003; Kwon & Moon, 2007; Long et al., 2019; Pai & Lin, 2005; Sarıkoç & Çelik, 2022; Sethia & Raut, 2019; Tay & Cao, 2001; Teixeira & De Oliveira, 2010; Thakkar & Chaudhari, 2021; Wen et al., 2020; Zhang, 2003). The deep learning models, a sub-branch of machine learning methods in recent years, have attracted attention with their successful results (LeCun et al., 2015; Schmidhuber, 2015). The advantage of deep learning models compared to other machine learning models is that deep learning models can effectively identify highly qualified features and outputs from a wide range of inputs (Nosratabadi et al., 2020). For this reason, studies focusing on deep learning models by keeping them separate from machine learning models are frequently seen in the literature (Bustos & Pomares-Quimbaya, 2020; Cavalcante et al., 2016; Li & Bastos, 2020; Nosratabadi et al., 2020; Ozbayoglu et al., 2020; Sezer et al., 2020; Thakkar & Chaudhari, 2021).

2.3 Related Work

Researchers on financial markets generally adopted a technical analysis approach and used market or technical analysis datasets(Berradi & Lazaar, 2019; Gao & Chai, 2018; Gao et al., 2021; Kakade et al., 2023; Kwon & Moon, 2007; Sethia & Raut, 2019; Thakkar & Chaudhari, 2021; Wei & Ouyang, 2024; Wen et al., 2020; Zheng & He, 2021). Due to its widespread use, ease of calculation, and ability to show changes in price movements, we adopted the technical analysis approach in our study. In addition, as a prediction model, we focus on deep learning models, which have attracted attention with their successful results in recent years and have become popular compared to other machine learning methods. Accordingly, some of the studies using technical analysis information and machine learning models on the dataset are summarized in Table 1.

Table 1 Summary of related works that include technical analysis information on the dataset and use machine learning models

PCA-ICA-LSTM: A Hybrid Deep Learning Model Based on Dimension Reduction Methods to Predict S&P 500 Index Price

Abstract

Graphical Abstract

Similar content being viewed by others

Predicting the daily return direction of the stock market using hybrid machine learning algorithms

A Comprehensive Study of Market Prediction from Efficient Market Hypothesis up to Late Intelligent Market Prediction Approaches

Stock market index prediction using transformer neural network models and frequency decomposition

1 Introduction

2 Research Review

2.1 Dataset Input Types

2.2 Prediction Model Types

2.3 Related Work

3 Methodology

3.1 Baseline Settings

3.2 Dataset

3.3 Methods and Models

3.3.1 Principal Component Analysis (PCA) for Dimension Reduction and Noise Removal

3.3.2 Independent Component Analysis (ICA) for Feature Extraction

3.3.3 Deep Learning Using Long Short-Term Memory (LSTM) Network

3.4 Prediction Model Steps

3.5 Proposed Hybrid PCA-ICA-LSTM Model

3.6 Evaluation Criteria

4 Experimental Results and Discussion

4.1 Comparison of Our Proposed Model and Prediction Models Derived from the Same Family

4.2 Comparison of Our Proposed PCA-ICA-LSTM Model with Widely Used State-of-the-Art Models

4.3 Comparison of the Recommended PCA-ICA-LSTM Model Similar Studies from the Literature

4.4 The Expanded New S&P 500 Dataset and Additional Case Studies

4.4.1 The Descriptions of the Expanded New S&P 500 Dataset

4.4.2 Experimental Setup

4.4.3 Investigating the Optimal Number of Components for Dimension Reduction and Feature Extraction on the Extended S&P 500 Dataset

4.4.4 Predictive Performance of Models for the Expanded New S&P 500 Dataset

4.4.5 Additional Case Studies Conducted to Demonstrate the Effectiveness of the Proposed Model

5 Conclusions

Availability of Data and Materials

Code Availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation