A Machine Learning Approach to Predict the S&P 500 Absolute Percent Change

Models of the stock market often focus on predicting the direction of the stock market either up or down. Instead of following that approach, this paper created a model for daily absolute percent change of the S&P 500. An accurate model of this metric would greatly increase pro�tability of option trading strategies such as straddles and iron condors. In this publication, novel features were created based on historical data and fed to machine learning algorithms such as Decision Trees, Rule Based Classi�ers, K-mean Clusters, and Kernels. Based on our �ndings, Decision Trees and Kernels showed an accuracy of 80% when predicting Absolute Percent change, while Rule Based Classi�ers had an accuracy of 88% and a 38% coverage of the data.


Introduction
Recently, there is growing interest in using machine learning and data analysis methods to study the stock market.The main objective is often predicting the direction of a stock trading.
The S&P 500 index follows the weighted performance of the 500 most valuable companies in the US.It is often used by economists to measure the health of the US stock market and the overall economy [1] .The massive adoption of this index has resulted in an increased interest by retail and institutional investment leading to exponential growth in trading of ETFs and options on the index.Modeling the performance of the index is increasingly valuable.
Multiple factors in uence the price of a stock.Among these are the earnings season, wars, political news, and traders' psychology [2] .There are two main approaches to predict stock performance: fundamental analysis and technical analysis.Fundamental analysis focuses on how a company is performing and the success of the relevant industry and market.Information such as earnings per share, dividend yield, priceto-earnings ratio, and fair market value of the stock are important metrics in fundamental analysis [3] .
In order to predict possible price movements, some traders use technical analysis to nd trends or patterns.This discipline employs past data such as price movement and volume in order to identify trading opportunities [4] .Although past performance does not correlate highly with future returns, people's emotions in the market do repeat themselves.Technical analysts create compound features using historical data.These technical indicators may nd recurring patterns within the stock [5] .
Plenty of machine learning and arti cial intelligence papers propose models that could accomplish this feat.Some models have used machine learning, comparing the results against technical analysis indicators such as Bollinger bands.Key insights from this line of research are that there are no statistical differences between the two approaches.However, it is worth noticing that both outperform the index return by almost 1-fold [6] .Other models look at the relationship between social sentiment and the stock price to draw conclusions of future movements in price [7] .
Brunhuemer et al. evaluated a machine learning model that focuses on trading straddles of buying put and call options at the same strike price.They focus on volatility and other technical analysis data to feed information to the machine learning model.However, they did not nd any signi cant signs of outperforming their previous trading strategy [8] .Nevertheless, this shows that predicting volatility is as useful as predicting a speci c market direction.
Machine learning techniques such as support vector regression, decision trees, k-nearest neighbor, random forest and multilayer perceptron have been used to predict the stock market.Techniques such as MLP and LSTM have showed promising results [9] In this paper, a combination of technical analysis and machine learning methods has been used to create models that can improve the likelihood of accurately forecasting the absolute movement of the S&P 500 index.The predictions focus on large versus small changes in the index.An absolute change of less than 1% is coded as a small change (Group 0), while a change above 1% is coded as a large change (Group 1).
The reasoning for this is that S&P 500 options have a high likelihood of yielding over 100% daily return when the index moves more than 1% in either direction.

Preprocessing
Historical data for the S&P 500 from January 2000 until August 2022 was taken from Yahoo Finance [10] .From the original dataset that include date, price at open, highest price of the day, lowest price of the day, price at close, and volume, new features were constructed and preprocessed to work with machine learning methods.Two main groups of features were created: Features based on percentages, and features based on days.

Data Description
Before feeding the data into machine learning models, data analysis was performed to investigate any correlations between the features.While looking at Daily Percent Change, the likelihood that the S&P 500 will close with an Absolute Percent Change above 1%, is 21.43%.Similarly, the probabilities it will close with a percent change between − 1% and + 1% are of 78.5%.(Fig. 1).Furthermore, volatility showed that movements above 1% are more common than previously thought.On average, volatility is 1.34%, and there is a 54% chance that intraday volatility is above 1%.(Fig. 2) A K-mean correlation matrix was also performed, and a correlation between Consecutive Days and Absolute Percent Change was found.Higher streaks were shown to reduce the likelihood of a large Absolute Percent Movement.Similarly Consecutive Days ranging from − 4 to + 2 saw the biggest probabilities of producing a movement above 3%.(Fig. 3)

Decision Tree
Decision Trees are the most used machine learning method due to their easy implementation and robustness.These trees separate continuous or discrete datasets based on their Gini and Entropy values, until a solution is found [12] .The entire dataset of 22 years was fed into the algorithm, and it resulted in an accuracy of 80%; however, when the dataset was placed starting from 2009, after the 2008 nancial crisis, the accuracy of the model rises to 87% (Fig. 4).The low Gini (~ 0.26) shows the high accuracy of decision.

Rule Based Classi er
From multiple iterations of Decision trees, rule-based classi ers were created to predict whether the absolute percent movement ends the trading session below 1% [Group 0] or above that threshold [Group 1].The rst rule used 3 day moving-averages previous to the current data, and current day data to predict a low percent movement.The results showed an accuracy of 88.14% with a coverage of 38.62% (Fig. 5).
Note that the chance of a day ending the session with a percent change below 1% is 78.5% as seen previously (Fig. 1).This rule generates a 10% improvement.
Rule number two follows a similar pattern of looking into moving averages; however, this time a 5-day absolute moving average was used.This moving average performed better than Rule 1 exhibiting an accuracy of 91.00% when predicting low volume, and a coverage of 42.09% (Fig. 6).This is a 13% improvement over the naïve estimate.
A third rule was created to predict when the volume would be above the 1% threshold.Since Abs MA5 performed well with the previous rule, it was used again.Rule 3 had a prediction percentage of 50.62% and a coverage of 14.12% (Fig. 7).Note that a naïve estimate of large changes, based on the frequency in the data, would be accurate 21.43% of the time (Fig. 1).The accuracy of this rule is more than 100% higher.
1.2 K-mean Classi er on previous analysis, the Consecutive Days and MA5 Absolute features were fed to a K-mean Classi er algorithm.Using a number of nearest neighbors of 3, this method was able to have an accuracy of 85%.When examining it more closely, it was observed that it successfully classi ed 88% of the Absolute Percent Change Movement of group 0 and a 51% of group 1 (Fig. 8).
Notably, the K-mean Classi er is quite similar to the Rule-Based Classi er.This perhaps isn't surprising, since they are based on the same inputs.

Conclusion
Looking at the results, all three machine learning methods were able to generate improvements to the base probabilities of having a low or high Absolute Percent Change Movement.The Rule Based Classi er had the highest accuracy with 91.09% when predicting low percent change, while the K-mean Classi er saw the best prediction with regards of high percent change with 51%.It is worth noticing that moving averages were great features to use for predicting, and that more research into them should be done on future experiments.The data analysis shows that higher accuracy can achieved by combining methods.There is also a tradeoff in the types of coverages that can be achieved.Overall, the methods shown can be pro table when used with a straddle or strangle option trading strategy.

Declarations
Funding There is no funding or grant related to this project.

Author Contribution
Percentage features are all datapoints that come from changes among attributes of the original dataset transformed into percentages.The features that belong to this group are: Daily Percent Change -the percent difference between opening and closing price Overnight Percent Change -percent change between the previous day's close, and current day's open Percentage Volatility -percent change between the daily low and high 5-day and 3-day Percent Moving Average -moving average of Daily Percent Change and Absolute Percent Change for 3 or 5 days Absolute Percent Change -absolute value of the Daily Percent Change All these features were then passed through binarization and discretization to facilitate implementation in the algorithms.
These features investigate the characteristics of the trading day to create data points.The features that belong to this group are as follows: Day of the Week Consecutive Red Days -the number of days that have closed below the open price Consecutive Green Days -the number of days that have closed above the open price Discretization Consecutive days -merging of the Red Days and Green Days

Figures Figure 1
Figures

Figure 2 See
Figure 2

Figure 3 See image above for gure Figure 4
Figure 3

Figure 5 See
Figure 5

Figure 7 See
Figure 7