Abstract
Financial news is one of the most influential sources for making long-term investment decisions. The goal of this paper is to learn whether it is possible, based on texts from news headlines, to select stocks with low forecasted volatility which outrun the market represented by the Standard and Poor’s 500 Index. We solve several binary classification problems using a range of machine learning and deep learning principles. The best classifiers are interpreted with a model-agnostic technique, LIME, to extract key words having the greatest impact on the probabilities of market outrun and low volatility.
Similar content being viewed by others
Explore related subjects
Find the latest articles, discoveries, and news in related topics.Avoid common mistakes on your manuscript.
1 Introduction
Investigating news for investment decisions is frequently practiced by both beginner and expert stock market participants. What makes it both valuable and challenging for the analysis is that news has demonstrated a certain endogenous effect. On one hand, all positive or negative events which ought to be immediately reflected in the stock price, are described in the news. On the other hand, the news is one of the most influential sources of information for making investment decisions. Bad news can create a strong negative drift according to Chan (2003). Hence, if news encourages buying or, conversely, selling certain assets, news consequently becomes the key factor changing the market value of shares (Andreassen, 1987). But not all news can provoke market participants’ desire to invest in a particular stock as it depends on their skepticism and risk aversion. So, speculating based on news is not that trivial.
We can distinguish two types of news—financial and operations-specific (non-financial) news. To illustrate, both news titles “Tesla Reported Record Profits” (financial) and “Volkswagen falls further behind Tesla in the race to electric” (non-financial) have a positive sentiment making the reader slightly more inclined to purchase the TSLA ticker. Moreover, news can entail two different types of risk. News can describe a firm-specific event having no effect on the value of other companies and impacting one particular stock. In contrast, the news may involve greater systematic risk as some events influence the economy or the market as a whole (usually in a negative way). Distinguishing various types helps us understand that the fact of news publication can influence the value of the shares differently.
Compiling a headline for a news article is art itself. A title, a short sentence, is able to successfully convey the meaning of the whole article. With an increased pace of life, people often tend to only scan the headlines (Tillinghast, 2001) to make an opinion about a company. We, therefore, will focus our analysis on the headlines to decompose their impact on the companies’ stock prices. Apart from observing trivial indicators of stock growth, words such as “soars”, “positive” etc., we would also like to extract some hidden patterns that are unobservable by most individuals but able to affect the long-term value of the shares.
We observe that there is no uniform news provider on the Internet (Fig. 1) which makes it more difficult for investors to analyze all this data. This makes the inference of an automatic sentiment analysis by natural language processing an essential tool for stock selection. Investors may benefit from the machine analysis of the news headlines by NLP algorithms, as studying titles from all assets traded on NYSE and NASDAQ seems to be an infeasible task.
We define and formulate the following research questions to be addressed in this study:
-
1.
Is it possible to label future stock trend and volatility from a quarter ahead using only news titles? How accurately can it be done?
-
2.
What are the text entities that have the greatest influence on classification probabilities?
In the following work, we discuss the scientific literature relevant to the fields of financial sentiment analysis and machine learning. We observe that few papers concern the long-term effect of news and even fewer consider the degree of price volatility which may be as important factor for making investment decisions as the direction of the drift. The trend and volatility are modeled using a stochastic differential equation. We focus on three binary classification problems which are (1) labeling the stocks that outrun the market, (2) distinguishing low-volatility and high-volatility stocks with a positive trend, and (3) separating low-volatility and high-volatility stocks with the non-positive trend. For prediction, we apply versatile machine and deep learning methods such as logistic regression, support vector machines, simple and long-short term memory neural networks, and FastText. Having evaluated the results by a range of classification metrics, we interpret the best model using the LIME procedure. The most influential entities presented in the news headlines corpus are extracted.
2 Literature Review
We start our analysis with a discussion of relevant publications. While financial sentiment analysis is a popular field of research, as many renowned researchers have tried to evaluate the financial sentiment of various text information and have released many multifaceted articles, there is a big room for the analysis. News and Twitter and their connection to the stock market have been the favorite areas for study over the last decade owing to their accessibility and possible applied value.
The first paper which will be underlying our analysis is by Xing et al. (2020). The work is of the greatest importance for us as it, firstly, discusses the latest developments in FSA and, secondly, describes three different classes of the state-of-the-art algorithms from recent publications for the financial sentiment. To have good coverage of different types of methods, the authors choose to investigate eight representative models from three clusters, lexicon-based (procedures like OpinionLex, SenticNet, and L &M), machine learning-based (SVM and fastText), and deep learning NLP models (bi-LSTM, S-LSTM, and BERT). Unfortunately, lexicon-based methods showed the poorest performance among all other approaches making them unattractive for our work. Classical statistical and deep learning models demonstrated a reasonable fit and each class dominated in various scenarios, so we can exploit some of these models in our problem.
Many papers tried to observe and measure the effects of different Internet sources on stock market movement. The work by Mittal and Goel (2012) conducts a financial sentiment analysis of Twitter posts in order to see an association between public and market sentiment. The authors mention that regardless of the wide acceptance of the Efficient Market Hypothesis implying that market prices seem to follow a random walk pattern, it is still possible to extract a valuable inference from the posts. Based on the same dataset, another paper that highlights the idea of textual information being a relevant price predictor is Bollen and Mao (2011) managed to achieve a high 87% accuracy while labeling upside and downside movements of the Dow Jones Industrial Index.
There are many decent publications in which authors tried to forecast the trend of the stock using news as a key predictor. One of such papers (Mindell, 1961), which describes the financial intuition behind the drifts’ movement because of a news release, was published sixty years ago. Although most of the financial phenomena are taking place today, some of the notions may be considered outdated due to digitalization and the online availability of news. Additionally, at the time of writing that article, statistical machine learning was not as applicable as it is today. The research may benefit from the inference from powerful methods like Neural Networks. For example, Ren et al. (2020) managed to predict the BIAS (percentage change of the close price on the i-th day after news publication) with high accuracy using a long short-term memory neural network. We can use the architecture developed in the paper in our research.
Another relevant publication on the use of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) in stock market prediction (Vargas et al., 2017) forecasts intraday directional movements of the Standard & Poor’s 500 Index by using financial news titles and technical indicators. The study is interesting for us as it describes the most sophisticated deep-learning architectures for NLP. Furthermore, the S &P500 Index is a proxy for the whole market in our study, so the structure of the networks described in the paper may be connected to our forecasts.
For a greater description of deep convolutional neural network architecture, we consult with (Ding et al., 2015) where the authors, using both simulated data close to our problem with S &P500 historical prices, developed a model for event-driven forecasting. Importantly, the authors consider not only near-term impact but longer horizons as well. Additionally, the paper compares the developed methods with other good-performing NN architectures from other publications, covering a wide range of various neural-network-related methods.
Apart from deep learning methods, we are going to implement more classical approaches from statistical learning such as logistic regression and support vector machines. One of the most relevant works is Sueno et al. (2020). The paper describes and compares different ways of transforming textual data into a numeric format to be fitted as factors. Among all procedures, we choose tf-idf as the main text vectorizations approach for our analysis.
Other works exploring the connection between news and market prices are Schumaker and Chen (2009) and Li et al. (2014). The first paper studied 9,211 news articles and more than 10 million stock quotes over five weeks and used SVM to predict the future stock price 20 min after news release. Results indicated that using terms from the news articles and the price at the time of the articles had the best performance, with the lowest MSE and 2.06% return. The paper states the Proper Nouns scheme achieved better results than the popular Bag of Words in all considered metrics. The second paper implements six different prediction principles and showed that the use of the McDonald financial sentiment dictionary (LMD) has the best performance. However, the authors claim that focusing primarily on positive and negative dimensions of inputs does not bring useful predictions, therefore more sophisticated approaches should extend the study.
In order to represent stock prices in a formal way, we apply a mathematical model in a form of a stochastic differential equation. The simplest model, which nevertheless has shown a close fit in many scenarios, as mentioned (Marathe & Ryan, 2005) and (Reddy & Clinton, 2016), is Geometric Brownian Motion (GBM). Our study wants to incorporate this approach from mathematical finance with applied machine learning, the combination of which was not often practiced in prior works.
Since we want to forecast the trend and volatility, we also need to consult with a proper method that estimates these two parameters simultaneously. The work by Croghan et al. (2017) describes three different procedures to obtain the estimates for \(\mu\) and \(\sigma\). We will exploit the Maximum Likelihood estimators formulated in the article to compile various target variables for our inference.
While many publications focus mostly on the short- or moderate-term sentiment of one entity at a time, our study is addressed to consider more visible to an investor horizon. In contrast to short-term AI trading algorithms which can create orders in timescales of microseconds, earning on the slight difference in prices, regular shareholders usually tend not to seek for quick profits and rather prefer holding assets for much longer periods.
Additionally, it may not be wise to govern the investment decisions based only on one sentiment. Many headlines bring only noise to the trend as, if a news article entails a negative sentiment about a firm, holding it is not necessarily a loss and can even benefit its owner in the long-run. Therefore, we also want to consider the cumulative effect of the news to develop a profitable strategy.
As the third contribution of this work, it investigates the headlines’ effects on the volatility which can be a key factor for or against investing, while many researchers have only studied news’ impact on trends. In the standard finance theory and portfolio theory, historical volatility represents the risk of an asset as more fluctuating shares are usually risker. While only a minority is interested in highly volatile assets, as a greater dispersion is an opportunity for highly abnormal returns, in a risk-averse world, most bullish investors prefer higher returns only if they do not pose a higher risk. Besides, foreseeing more volatile assets may be advantageous for building other strategies. One of the other ways is investing not in stocks implicitly but in options underlying those stocks. For example, long straddle brings a higher net profit, the higher the stock fluctuation is, regardless of the direction of the drift.
3 Methodology
3.1 Trend and Volatility Modeling
Building a regression of trend using only headlines without important financials (such as P/E ratios, EV/EBITDA multiple, etc.) sounds like an impossible task and obtained estimators would be unreliable. However, we do not need a regression, as in order to generate abnormal profits, the direction of the future drift, \(\mu\), and some knowledge of its volatility, \(\sigma\), may suffice. Hence, we decided to consider in some sense simpler problem, but which still has an applied value. Consider one stock as one observation in the dataset. In relation to the drift component, we will attempt to classify if it exceeds the drift of S &P500 prices from the same time window, whereas, for volatility, we wish to distinguish the left 0.25 percentile tail of adjusted sigmas from the rest part (i.e., the observation is labeled as 1 if it is in top 25% of the least volatile stocks). We formulate three main target variables for the analysis:
Note, for the simplicity of the narration we call “growth stocks" only those which have variable trend direction equal to 1.
We reckon that the headlines causing the stock to be more volatile may be different for growing and falling shares. Besides, investors usually demand higher risk-premium for stock with a greater downside sensitivity (Ang et al., 2006). Therefore, to achieve higher performance when classifying volatilities, we consider two different samples of the initial dataset where drift is positive and non-positive. We cannot know the direction of the future drift, so we may consider its estimate or the real drift over the last quarter as a proxy.
Instead of calculating the sample mean and variance of price changes, which will probably result in similar estimates of \(\mu\) and \(\sigma\), we propose using a more formal model involving both parameters simultaneously. Geometric Brownian Motion (GBM) used by Black-Scholes in their framework (Black & Scholes, 1973), defines the profitability of a stock as a stochastic differential equation (4). It says that the profit from stock grows with a drift value \(\mu\) and involves a risk described by Brownian motion, scaled by the volatility parameter \(\sigma\). Despite its simple appearance, the model generates time series which is quite close to real stock prices.
\(X_t\) denotes the price of a stock, \(\mu\) and \(\sigma\) are the true trend and volatility.
We want to find a proper way of estimating its parameters given the historical prices. In contrast to many models entailing a system of differential equations solved numerically, the main benefit of the GBM is the availability of a closed-form solution allowing us to use a Maximum Likelihood Estimation (MLE) or other estimation procedures to find the estimates for the trend and volatility. Three different pairs of estimators were described by Croghan et al. (2017), and the third option resulting in the lowest MSE is chosen for our purposes, where (6) is an estimator for the drift and (7) is the expression for the volatility.
In a similar fashion, we find \({{\hat{\mu}}}_{{\text{S}\&\text{P500}}}\) for corresponding time period. Using the estimators obtained analytically, we fit real stock prices for each ticker and save the estimates to merge them with corresponding news.
3.2 Volatility Adjustment
Apart from basic measures against overfitting described in the Data Preparation section in the Appendix, we make another adjustment to the data used in classifying volatility. Some assets tend to be more/less volatile depending on the company itself and the industry they are operating in. Industries related to technologies, oil, and healthcare entail more fluctuating stocks, while electric and water utilities sectors demonstrate low and moderate changes (Moran, 2020). Hence, the obtained volatilities should be normalized by some universal measure that captures both firm-specific and industry-related dynamics. One of such measures is market beta—sensitivity of the asset to overall market fluctuations. Theoretically, it can be calculated as a slope of a linear regression of stock return changes onto changes in the return of the market portfolio as in regression (8). Since we cannot consider all the stocks in the universe, so in practice, we again consider S &P500 Index as a proxy. To obtain the estimate of the market beta, \({\hat{\beta }}\), formula (9) is applied. The adjustment itself is (10).
Since some market sensitivities can be negative, a simple division by \({{\hat{\beta }}}_i\) would result in negative adjusted volatility. Therefore, betas are shifted by the value of the least beta. We do not wish to divide by zero when \({{\hat{\beta }}}_i = \nu\), so we add an infinitesimal \(\epsilon\) to the denominator. As it has been said, some industries are much less volatile, and since the threshold is defined by a quantile function which is a relative function, without adjustment, we would only pick stocks from least volatile industries such as health care and completely omit stocks from, say, technology. The adjustment, to some extent, sorts the stocks depending on their not nominal but real volatility. As we can observe from the violin plots in Fig. 2, the distributions of the resulting trend values are similar for stocks with positive and zero direction indicators of the drift. Unlike the volatility distribution for growth stocks, the distribution for shares with a negative trend is much more positively skewed, which consequently may have a negative impact on the performance of the corresponding models.
3.3 Text Vectorization
Most machine learning models are designed to work with numeric inputs rather than text data. Besides, the performance of the models relies not only on the final classifier by also on the vectorization procedure. Therefore, we need to pay great attention to the choice of text vectorization technique.
For statistical learning methods (Logit and SVM), we apply the tf-idf (term frequency-inverse document frequency) transformation, given by (13). The metric is widely practiced in NLP and shows the importance of each word in the document.
Each d is a document, that in our case is a merged string of all headlines for a particular stock within some quarter. D is the corpus of the document, i.e., the collection of all merged news titles for all considered stocks throughout the period 2020Q3–2021Q4. Each term t in each document is assigned with the value of the product of the term-frequency in the specific document and the ratio of all documents including this specific term.
For neural networks (NN and LSTM), we apply default keras embeddings to vectorize the text inputs. Embedding is a procedure that maps the text input to a dense vector of the fixed size based on some rules, and those weights that are initialized as uniform are trained during the fitting process of the neural network.
For FastTest, the data is not vectorized manually and we input the text itself to the model as it has an internal mechanism for text embedding.
3.4 Baseline
Knowledge about future price fluctuations is highly demanded by investors but rather difficult to access. Even advanced forecasting algorithms are not perfect in terms of their prediction accuracy as too many (usually unobservable) factors should be taken into consideration. The algorithms showing reasonable performance are a combination of specialized stacked models based on different types of both text (e.g. headlines, news articles, Twitter, Facebook, earnings reports and earnings call transcripts) and numeric (e.g. quarterly reports, financials, and multiplicators), so we do not expect a supernatural result from any model fitted on only news titles. We do not need to be accurate with all stocks available in the market but be surer about a few of them to outrun the market portfolio. While the performance may not be ideal, we still need to choose a reasonable baseline to ensure that our predictions are not inferior.
In the standard finance theory, Efficient Market Hypothesis (EMH) states that all information affecting the stock is already reflected in the current price, so, given perfectly competitive structure, the opportunity to generate excess profits is excluded (Burton & Malkiel, 2003) and (Malkiel, 2003). The implication is that the stock price changes are completely stochastic and cannot be foreseen. However, comparing the models with a single random classifier may bring us uncertain results as the accuracy relies on the random seed. A null-classifier, though too trivial for many tasks, will be quite suitable for such a description, as it approximates the mean of many random classifiers.Footnote 1
3.5 Logistic Regression
The logistic regression (a.k.a. Logit) is a popular classification model in the field of natural language processing. The method applies a sigmoid function to the linear combination of the numeric inputs, where the model coefficients are obtained using maximum-likelihood estimation. Even though the marginal effect of each explanatory variable is not constant in the model, it is still highly applicable and interpretable, especially if combined with the tf-idf transform.
3.6 Support Vector Machines
SVM is another popular supervised classification algorithm that draws a separating hyperplane in the n-dimensional space (Tong & Koller, 2001). After solving a convex optimization problem, the method maximizes the distances between the margins and the supporting vectors (the observations that lie on the margin lines). Usually, real-world tasks such as the stock trend classification problem do not allow us to place a hyperplane in the initial feature space that easily. Applying different kernels is the idea that can help us to separate the classes with a non-linear boundary. By creating different augmented variables from the initial predictors, we find the decision boundary in the higher-dimensional space and then consider the projection to the initial space as the boundary. Instead of a classic polynomial function, we use a sigmoid kernel to achieve a higher prediction accuracy.
3.7 Neural Network
The first deep learning model in our analysis is a basic neural network architecture. The model is a graph of connected layers with neurons where each link entails a weight, that is trained during the fitting procedure using gradient descent. Due to its design, the model can be quite powerful in extracting hidden patterns from the text data, making the algorithm one of the most popular methods in NLP. For our task, we use several hidden dense layers with ReLU activation function and one final layer with sigmoid activation to map the probabilities to range from 0 to 1. As the loss function, the binary cross-entropy loss with Adam optimizer is applied.
3.8 LSTM
The beginning layers in a simple neural net are not trained as notably as the final layers, which represents a problem of short-term memory. Long short-term memory (LSTM) NN can solve this issue by adding special neurons with gates that decide which information is important to be passed further and which data can be thrown away (Gers et al., 2000). We expect LSTM to catch a “snowball effect” of headlines when positive/negative statements are followed by the news with the same sentiment.
3.9 FastText
FastText is a model for text classification and word vector representation developed by Facebook’s AI Research lab. FastText uses a hierarchical classifier, and owning to binary trees underlying the model, the amount of computation for each text is reduced significantly compared to two previous algorithms. Hashing is used to improve the time and efficiency of the n-grams mapping. Since our dataset contains more than 10 thousand observations with each text being a merge of all news within a long period of time, the speed of model training becomes important. The method averages the n-gram embeddings, and a multinomial logit is a final classifier. As the loss function, Skip-gram negative sampling (SGNS) is applied (Goldberg & Levy, 2014).Footnote 2
3.10 LIME
Gaining insights from the best model may bring value to the survey. We would like to know the presence of which words or phrases was the most influential factor in the labeling trend and growth volatility. We will base our inference on the best-performing models. There are many various techniques for assessing the importance of factors in machine learning models. As the main tool for our analysis, we use Local Interpretable Model-agnostic Explanations (LIME) discussed in Ribeiro et al. (2016). Often, it is impossible to describe a complex decision boundary as a whole, so LIME approximates the black-box model locally in the neighborhood of the prediction by fitting a surrogate, an interpretable model in that local area. Mathematically, the algorithm tries to find a simple function g from a family of sparse linear models to minimize the loss function (16) given the complex algorithm f (i.e., the predicting model) and the neighborhood of the observation x. A penalty \(\Omega \left( g\right)\) representing the complexity of the model g is added to the expression to make the result simpler. Kernel function, \(\pi _x\), gives weight to the points for the generated proximity based on the distances to x.
A subsample of 350 true-positive and 350 true-negative stocks with the highest likelihood of belonging to the corresponding classes was chosen to be analyzed by LIME. For each document, 8 words having the greatest effect (in absolute value) on the probability of the positive outcome are extracted and saved. As measures of words’ importance, we consider the mean value and the sum of such effects for a given word across the considered corpus. Some words, despite having a high mean influence, have a low appearance frequency, hence, are removed from the analysis.
4 Results and Discussion
Tables 1, 2 and 3 describe the classification performance for future trend and future volatility for growing and falling stocks respectively. From Table 1 and the left ROC curve in Fig. 3, we can see that more complex models had higher accuracy with LSTM being the most successful among all and becoming the model that we will investigate using LIME. However, none of the model surpass the accuracy baseline. We can also see that the ROC curves coincide, which may represent a kind of performance boundary. To be more certain about the drift, we need to include numerical financials, like P/E indicators, EBITDA, etc. Even though the performance of the models is not ideal, we may see that all the methods managed to pass the ROC-AUC baseline in this subproblem which again brings some doubt to the validity of the EMH theory.
We expected disastrous results for volatility classification. However, we were surprised to observe that the models would perform better in the problem of volatility classification rather than trend labeling. From the Tables 2, 3, and center and right ROC curves in Fig. 3, we may see that FastText, NN, and Logit outperformed other approaches significantly. Interestingly, SVM and LSTM showed poorer performance for stocks with positive trend in comparison with the falling stocks, which may be an indication of different text factors affecting the volatilities. Consequently, the models showed slightly worse AUC for falling stocks.
Despite the fact that Logit and FastText, did not show the highest scores in all problems, the models still demonstrated a reasonable fit and are worth consideration in similar tasks given their simplicity and speed of training.
Based on the most influential extracts provided in Table 4, we make the following insights. As expected, words such as ‘dividends’, ‘technology’, and ‘revenues’ improve the likelihood of the market outrun. Headlines involving entities such as ‘foods’, ‘pharmaceuticals’, and ‘software’ also enhance the chances of class 1. Investors may consider the companies operating in these fields for further investigation.
The words such as ‘predictable’, ‘crossover’, ‘commercial’ had a great negative impact on the first target variable. Additionally, ‘bankcorp’, ‘bloomberg’, ‘steel’ and ‘hotels’ were mostly associated with negative importance (Fig. 4). Surprisingly, unlike ‘technology’, ‘technologies’ had a negative effect. We reckon that it may be related to the context where these two words are used.
For stocks with positive drift, words such as ‘preferred’, ‘undervalued’, ‘insurance’, ‘yield’, ‘fed’, ‘vaccines’ and ‘california’ were indicators of a smaller risk. Similar to the positive trend, ‘dividends’ again favored the positive outcome in this category. The titles entailing ‘pharmaceuticals’, ‘airlines’, ‘ipo’, ‘unaudited’, ‘therapeutics’, ‘drug’, and ‘inflation’ brought greater uncertainty to the holder of corresponding assets. It is important to observe words such as ‘oil’ or ‘energy’ as indicators of a lesser certainty, reflecting the validity of our model as oil- and energy-related industries are notoriously more volatile. Again, a shocking observation was that the word “technologies” in the plural form, in contrast to its singular pair, could reflect less certainty in holding such a share.
5 Conclusions
All in all, we have managed to develop the models that work better than a random classifier and that creates another contradiction to the efficient market hypothesis. This implies that the use of news headlines can be a sufficient approach to select several shares for an investor with a bullish strategy and risk-averse preferences. The models fitted on the falling stocks can still be attractive for bearish investors. Besides, being able to foresee different types of volatility (both low and high) may benefit options traders. For example, falling and noisy stock can be quite attractive for the holders of the corresponding long put or straddle positions. Furthermore, we have also extracted key words and phrases that appear to affect drastically the values of trend and volatility.
As for ideas of how to improve this work, we could consider more powerful pre-trained and domain-specific text-classification models as the ones from the Hugging Face repository. Other vectorization techniques such as improved Naïve Bayes with Laplace smoothing (Sueno et al., 2020), Bag-of-Words (Wallach, 2006) and Word2Vec (Goldberg & Levy, 2014) could be applied with the methods. The embeddings could be applied with another weight-initialization procedure as it has proven to drastically affect the results of neural nets (Glorot & Bengio, 2010; He et al., 2015). Although we have cleaned the majority of the companies’ names, the cleaning was not ideal, and, as a result, words such as “stifel” and “dass” were left in the dataset. We also observed that the grammatical numbers of some of the words had opposite effects on classification probabilities, so the sensitivity of singular and plural forms may be researched further. As for the text inference part, the LIME analysis could be enhanced by the inference from another interpretation approach such as SHapley Additive exPlanations (SHAP) (Nohara et al., 2019) which uses slightly different methods in explaining the models, but which could bring more value to our inference. Finally, the performance metrics, especially for trend classification, may be improved by other approaches or by further calibration of the considered models.
Data Availability
The results of the paper are not financial recommendations, as they need technical analysis. The information is provided for educational purposes only and does not constitute financial advice, investment advice, trading advice, or any other recommendation. The reader should not make an investment decision using the information presented in the paper without undertaking due diligence and consultation with a professional broker or financial advisory. The generated dataset is available upon reasonable request.
Notes
To improve the accuracy of the classifier, the model predicts the majority class of the train set.
Apart from the models discussed above, we also considered FinBERT and a basic neural network with initial weights from Glove. Unexpectedly, these two models did not show superior performance and only replicated the results of the initial fine-tuned neural networks. We, therefore, exclude them from the final summary.
References
Andreassen, P. B. (1987). On the social psychology of the stock market: Aggregate attributional effects and the regressiveness of prediction. Journal of Personality and Social Psychology, 53(3), 490.
Ang, A., Chen, J., & Xing, Y. (2006). Downside risk. The Review of Financial Studies, 19(4), 1191–1239.
Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of Political Economy, 81(3), 637–654.
Bollen, J., & Mao, H. (2011). Twitter mood as a stock market predictor. Computer, 44(10), 91–94.
Burton, G., & Malkiel, B. G. (2003). The efficient market hypothesis and its critics (summary). https://www.cfainstitute.org/en/research/cfa-digest/2003/11/the-efficient-market-hypothesis-and-its-critics-digest-summary.
Chan, W. S. (2003). Stock price reaction to news and no-news: Drift and reversal after headlines. Journal of Financial Economics, 70(2), 223–260.
Croghan, J., Jackman, J., & Min, K. J. (2017). Estimation of geometric Brownian motion parameters for oil price analysis. In IIE annual conference. proceedings (pp. 1858–1863).
Ding, X., Zhang, Y., Liu, T., & Duan, J. (2015). Deep learning for event-driven stock prediction. In Twenty-fourth international joint conference on artificial intelligence.
Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10), 2451–2471.
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249–256).
Goldberg, Y., & Levy, O. (2014). word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.
He, K. , Zhang, X. , Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the ieee international conference on computer vision (pp. 1026–1034).
Li, X., Xie, H., Chen, L., Wang, J., & Deng, X. (2014). News impact on stock price return via sentiment analysis. Knowledge-Based Systems, 69, 14–23.
Malkiel, B. G. (2003). The efficient market hypothesis and its critics. Journal of Economic Perspectives, 17(1), 59–82.
Marathe, R. R., & Ryan, S. M. (2005). On the validity of the geometric Brownian motion assumption. The Engineering Economist, 50(2), 159–192. https://doi.org/10.1080/00137910590949904
Mindell, J. (1961). How news affects market trends. Financial Analysts Journal, 17(1), 31–34.
Mittal, A., & Goel, A. (2012). Stock prediction using twitter sentiment analysis. Standford University, CS229 (2011 http://cs229.stanford.edu/proj2011/GoelMittal-StockMarketPredictionUsingTwitterSentimentAnalysis.pdf) 15, 2352.
Moran, M. (2020, Jan). Performance and volatility for sectors in the 2010s. https://www.spglobal.com/en/research-insights/articles/performance-and-volatility-for-sectors-in-the-2010s
Nohara, Y., Matsumoto, K. , Soejima, H., & Nakashima, N. (2019). Explanation of machine learning models using improved shapley additive explanation. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics (pp. 546–546).
Reddy, K., & Clinton, V. (2016). Simulating stock prices using geometric Brownian motion: Evidence from Australian companies. Australasian Accounting, Business and Finance Journal, 10(3), 23–47.
Ren, Y., Liao, F., & Gong, Y. (2020). Impact of news on the trend of stock price change: An analysis based on the deep bidirectiona LSTM model. Procedia Computer Science, 174, 128–140.
Ribeiro, M.T. , Singh, S., & Guestrin, C. (2016). why should i trust you? explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135–1144).
Schumaker, R. P., & Chen, H. (2009). Textual analysis of stock market prediction using breaking financial news: The azfin text system. ACM Transactions on Information Systems (TOIS), 27(2), 1–19.
Sueno, H. T., Gerardo, B. D., & Medina, R. P. (2020). Multi-class document classification using support vector machine (SVM) based on improved Naïve Bayes vectorization technique. International Journal of Advanced Trends in Computer Science and Engineering, 9(3), 3937.
Tillinghast, T. (2001, Jul). Are you reading the trade press correctly? some say no. https://www.clickz.com/are-you-reading-the-trade-press-correctly-some-say-no/61661/?amp=1
Tong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2, 45–66.
Vargas, M. R. , de Lima, B. S. L. P., & Evsukoff, A. G. (2017). Deep learning for stock market prediction from financial news articles. In 2017 IEEE international conference on computational intelligence and virtual environments for measurement systems and applications (civemsa) (pp. 60–65). https://doi.org/10.1109/CIVEMSA.2017.7995302.
Wallach, H. M. (2006). Topic modeling: Beyond bag-of-words. In Proceedings of the 23rd international conference on machine learning (pp. 977–984). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1143844.1143967.
Xing, F., Malandri, L., Zhang, Y., & Cambria, E. (2020). Financial sentiment analysis: An investigation into common mistakes and silver bullets. In Proceedings of the 28th international conference on computational linguistics (pp. 978–987).
Funding
Open access funding provided by EPFL Lausanne. The author declare that no funds, grants, or other support were received during the preparation of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The author have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Data Preparation
1.1.1 Collection
We scrap the historical value of shares, news headlines, dates of publishing, and the name of the source using the financialmodelingprep.com API. Besides, we collect some relevant to the company’s financial characteristics such as market beta and the number of floating shares. Those numeric financials will not be used as predictors but are needed to filter the obtained dataset.
For 4671 different symbols, we download iteratively the news from each quarter within the range 2020Q3–2021Q4 and prices from the current and following quarter. The procedure is visualized in Fig. 5. The GBM SDE is fitted to the price time series according to the procedure described in the Trend and Volatility Modeling section and the estimates are assigned to the corresponding slice. We deliberately consider the prices within and beyond a quarter to possibly catch market momentum anomalies. The algorithm results in the 721, 163 collected news headlines. As a measure against overfitting, we apply an out-of-time train-test split procedure including observations from 2021Q3 and 2021Q4 into the extrapolated test set and leaving the rest for training and validation.
1.1.2 Capitalization Drop
The stocks with low capitalization are too unpredictable and only speculators, possibly having some insider information, may somehow benefit from trading those assets. Such observations may bring unnecessary noise to the training data, biasing the classifiers and making it more difficult for our model to extract the key phrases from the underlying headlines. Therefore, we propose another kind of filtering which removes all tickers with capitalization lower than the threshold of $1 billion. The capitalization is obtained as follows:
$1 billion is a physiological threshold for many investors as companies below the value are considered too small to enter. Additionally, the drop will ensure that there are enough news titles as we do not wish to base our prediction on 2–3 headlines even if they are extremely positive.
Even though the filter reduces the size of the collected dataset significantly from 20,859 to 11,364 observations, such cleaning allows to remove uncertain assets from consideration beforehand to ensure that more reliable shares will be placed into the final portfolio.
1.1.3 Cleaning
Now we describe the procedure of data cleaning. First things first, irrelevant information should be removed from the text, so we proceed with basic data preprocessing. We drop special symbols, punctuation, and website links including “https”, “www.”, “.com”, etc. All characters are converted to the lower case so that the algorithms treated “Bubble” and “bubble” as the same word. We remove the stopwords, the words which occur frequently in the language but do not bring additional or special meaning—articles (“a”, “the”), prepositions (“on”, “to”, “in”, “at”, etc.), conjunctions (“and”, “because”, “but”, etc.) and other words.
Since the collected dataset contains overlapping time intervals, the problem of overfitting may occur. Instead of analyzing the overall sentiment of a sentence, a “bad” model would base its answer mostly on some specific words. We also want to avoid predicting the stock growth using only a company’s name or its symbol. If the stock used to grow/fall in the past, observing its name will make the model replicate prior trends as a prediction. For example, while Tesla has been associated with significant growth in its market value for the last two years, we do not want our models to consider “Tesla” or “TSLA” as an indicator or factor. Therefore, we attempt to eliminate any substring which directly references a firm, so companies’ short and long names, as well as their stock symbols, are removed from the strings. In addition, we wish to investigate the overall sentiment of not only one entry but all headlines within a prolonged period of time. Therefore, as soon as we collected all news titles for each stock within a quarter and cleaned them, we merge them into a single string to be passed to the models (Table 5).
1.2 Performance Evaluation
To evaluate the performance of the models and compare them, we consider the classification metrics used by the authors of the underlying articles. The first and standard metric in all classification problems is accuracy (19) which shows the fraction of the correct predictions. The second metric, Matthews correlation coefficient, also known as Phi-coefficient, is specially designed for unbalanced classes (Xing et al., 2020) and formulated as (20). F1-Score, Precision and Recall measures are also applied to compare the predictability of the models and are calculated as (21–23). We apply a probability threshold of 0.5 for all classifiers.
We supplement our analysis with the investigation of the receiver operating characteristic curves (ROC) as well as the comparison of the area under these curves (AUC). ROC is a diagram showing the relationship between False Positive and True Positive Rates of the underlying model’s forecast. A more powerful classifier would result in a ROC closer to the left upper corner, while an inferior one would lie around or below the diagonal line.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Karzanov, D. Headline-Driven Classification and Local Interpretation for Market Outperformance and Low-Risk Stock Prediction. Comput Econ (2023). https://doi.org/10.1007/s10614-023-10449-5
Accepted:
Published:
DOI: https://doi.org/10.1007/s10614-023-10449-5