1 Introduction

Recently, there is growing interest in using machine learning and data analysis methods to predict stock prices. The main objective is often predicting the direction of stock trading.

The S&P 500 index follows the weighted performance of the 500 most valuable companies in the US. It is often used by economists to measure the health of the US stock market and the overall economy

[1]. The massive adoption of this index has resulted in an increased interest in retail and institutional investment leading to exponential growth in trading of Exchange-Traded Funds (ETFs) and options on the index. Modeling the performance of the index is valuable.

Multiple factors like earnings season, wars, political news, and traders' psychology influence stock prices [2]. There are two main approaches to predict stock performance: fundamental and technical analysis. Fundamental analysis focuses on how a company is performing and the success of the relevant industry and market. Information such as earnings per share, dividend yield, price-to-earnings ratio, and fair market value of the stock are important metrics in fundamental analysis [3].

To predict possible price movements, some traders use technical analysis to find trends or patterns. This discipline employs past data such as price movement and volume to identify trading opportunities [4]. Although past performance does not correlate highly with future returns, people’s emotions have autocorrelation. Technical analysts create compound features using historical data. These technical indicators may find recurring patterns in stock price movement [5].

Numerous machine learning papers propose models that could accomplish this feat. Macchiarulo has tried both technical analysis and machine learning methods to predict the movement of stock prices [6]. Other models look at the relationship between social sentiment and the stock price to forecast future movements [7].

Brunhuemer et al. evaluated a machine learning model which focuses on trading straddles of buying put and call options at the same strike price. They focus on volatility and other technical analysis data as inputs into the machine learning model [8]. They show that predicting volatility is as useful as predicting a specific market direction.

Machine learning techniques such as support vector regression, decision trees, k-nearest neighbor, random forest and multilayer perceptron have been used to predict the stock market. Techniques such as MLP and LSTM have showed promising results [9].

Fu [10] applied hybrid machine learning methods to increase the accuracy of S&P index prediction [10]. Dr. Wei applied a support vector machine method to predict S&P indices [11]. Wei applied different machine learning methods with boosting approach to increase the accuracy of prediction.

In this paper, a combination of technical and statistical analysis and machine learning methods are used to create models that can improve the likelihood of accurately forecasting the absolute movement of the S&P 500 index. The predictions focus on large versus small changes in the index. An absolute change of less than 1% is coded as a small change (Group 0), while a change above 1% is coded as a large change (Group 1). The reasoning for this is that S&P 500 options have a high likelihood of yielding over 100% daily return when the index moves more than 1% in either direction.

2 Data description

Historical data for the S&P 500 from January 2000 until August 2022 was taken from Yahoo Finance [12]. The selected features include opening price, highest price in the day, lowest price in the day, closing price, and volume. These factors were constructed and preprocessed for analysis with machine learning methods. Two main groups of features were created: Features based on percentages, and features based on days.

2.1 Features based on percentages

Percentage features are datapoints that come from changes among attributes of the original data, transformed into percentages. The features that belong to this group are:

  • Daily Percent Change—the percentage difference between opening and closing price.

  • Overnight Percent Change—percent change between the previous day’s close, and current day’s open

  • Percentage Volatility—percent change between the daily low and high price

  • 5-day and 3-day Percent Moving Average—moving average of Daily Percent Change and Absolute Percent Change for 3 or 5 days

  • Absolute Percent Change (Abs Mov)– absolute value of the Daily Percent Change

These features were then passed through binarization and discretization to facilitate implementation in the algorithms.

2.2 Features based on days

These features investigate the characteristics of the trading day to create data points. The features that belong to this group are:

  • Day of the Week

  • Consecutive Red Days—the number of days that have closed below the open price.

  • Consecutive Green Days—the number of days that have closed above the open price.

  • Discretization Consecutive days—merging of the Red Days and Green Days

3 Technical and statistical analysis

Before feeding the data into machine learning models, data analysis was performed to investigate any correlations between the features. The likelihood of the closing price of the S&P 500 with an Absolute Percent Change above 1%, is 21.43%. The probabilities of closing price with a percent change between − 1 and + 1% are 78.5%. (Fig. 1).

Fig. 1
figure 1

Daily percent change with number of instances

On average, volatility is 1.34%, and there is a 54% chance that intraday volatility is above 1%. (Fig. 2).

Fig. 2
figure 2

Intraday volatility, based on percent change

Scatter plots including the correlations between Consecutive Days and Absolute Percent Change shows higher streaks than Absolute Percent Change. Similarly Consecutive Days ranging from − 4 to + 2 show the biggest probabilities of producing a movement above 3%. (Fig. 3).

Fig. 3
figure 3

Scatter plot

4 Machine learning models and results

4.1 Decision tree

The Decision Tree is one of the most applied machine learning methods due to their easy implementation and robustness. These trees separate continuous or discrete datasets based on their Gini and Entropy [13].

In our project, the model extracted from the entire dataset for 22 years was 80% accurate. But starting from 2009—after the crisis of 2008—the accuracy of decision tree rose to 87%. The low Gini (~ 0.26) shows the high accuracy of decision tree (Figure 4).

Fig. 4
figure 4

Decision tree

4.2 Rule based classifier

After multiple iterations of decision trees, rule-based classifiers were created to predict the absolute percent change of sessions below 1% [Group 0] or above that threshold [Group 1].

Rule 1: used the 3-day moving average for yesterday, today, and tomorrow to predict a low percent change, of less than 1%. The results showed an accuracy of 88.14% with a coverage of 38.62% (Fig. 5). Note that the chance of a day ending the session with a percent change below 1% is 78.5% as seen previously (Fig. 1). This rule generates a 10% improvement.

Fig. 5
figure 5

Rule 1 statistical data. R1: (− 0.3 ≥ Previous MA3 ≤ 0.3) ^ (− 0.3 ≥ Current MA3 ≤ 0.3)—> Abs Mov = 0

Rule 2: the 5-day absolute moving average performed better than Rule 1, exhibiting an accuracy of 91.09% when predicting low volume, and a coverage of 42.09% (Fig. 6). This is a 13% improvement over the naïve estimate.

Fig. 6
figure 6

Rule 2 statistical data. R2: (Previous Abs MA5 ≤ 0.516) ^ (Current Abs MA5 ≤ 0.516)—> Abs Mov = 0

Rule 3: was created to predict absolute price changes above the 1% threshold. Since Abs MA5 performed well with rule 2, it was used again. Rule 3 had a prediction percentage of 50.62% and a coverage of 14.12% (Fig. 7).

Fig. 7
figure 7

Rule 3 statistical data. R3: (Previous Abs MA5 ≥ 1) ^ (Current Abs MA5 ≥ 1)—> Abs Mov = 1

4.3 K-mean classifier

Based on the previous analysis, the Consecutive Days and MA5 Absolute features were fed to a K-mean Classifier algorithm. The 3-mean classifier had an accuracy of 85%. It successfully classified 88% of the Absolute Percent Change Movement of group 0 and 51% of group 1 (Fig. 8).

Fig. 8:
figure 8

3-mean classifier

5 Conclusion

Looking at the results, all three machine learning methods were able to improve the base probabilities of a low or high Absolute Percent Change Movement. The Rule Based Classifier had the highest accuracy of 91.09% to predict a low percent change in prices, while the K-mean Classifier had the best prediction of a high percent change with 51% accuracy.

Moving average and boosting machine learning methods performed well to predict SP 500 stock prices. In this project a decision tree with 87% accuracy was created. Three rules with the highest accuracy of 91% were derived from decision tree. The features selected from the rule with the highest accuracy (rule 2) were used to classify data with a 3-mean classifier with an accuracy of 88%

Technical and machine learning analysis made the prediction of the S&P 500 index possible with high accuracy. There is a plan to apply Random Forest and Convolutional Neural Network (CNN) to increase the accuracy of predictions. Besides that, we can apply the introduced methods: Decision Tree, Rule Based Classifier and K-Mean to predict prices of other stock indices.