1 Introduction

Investigating news for investment decisions is frequently practiced by both beginner and expert stock market participants. What makes it both valuable and challenging for the analysis is that news has demonstrated a certain endogenous effect. On one hand, all positive or negative events which ought to be immediately reflected in the stock price, are described in the news. On the other hand, the news is one of the most influential sources of information for making investment decisions. Bad news can create a strong negative drift according to Chan (2003). Hence, if news encourages buying or, conversely, selling certain assets, news consequently becomes the key factor changing the market value of shares (Andreassen, 1987). But not all news can provoke market participants’ desire to invest in a particular stock as it depends on their skepticism and risk aversion. So, speculating based on news is not that trivial.

We can distinguish two types of news—financial and operations-specific (non-financial) news. To illustrate, both news titles “Tesla Reported Record Profits” (financial) and “Volkswagen falls further behind Tesla in the race to electric” (non-financial) have a positive sentiment making the reader slightly more inclined to purchase the TSLA ticker. Moreover, news can entail two different types of risk. News can describe a firm-specific event having no effect on the value of other companies and impacting one particular stock. In contrast, the news may involve greater systematic risk as some events influence the economy or the market as a whole (usually in a negative way). Distinguishing various types helps us understand that the fact of news publication can influence the value of the shares differently.

Compiling a headline for a news article is art itself. A title, a short sentence, is able to successfully convey the meaning of the whole article. With an increased pace of life, people often tend to only scan the headlines (Tillinghast, 2001) to make an opinion about a company. We, therefore, will focus our analysis on the headlines to decompose their impact on the companies’ stock prices. Apart from observing trivial indicators of stock growth, words such as “soars”, “positive” etc., we would also like to extract some hidden patterns that are unobservable by most individuals but able to affect the long-term value of the shares.

We observe that there is no uniform news provider on the Internet (Fig. 1) which makes it more difficult for investors to analyze all this data. This makes the inference of an automatic sentiment analysis by natural language processing an essential tool for stock selection. Investors may benefit from the machine analysis of the news headlines by NLP algorithms, as studying titles from all assets traded on NYSE and NASDAQ seems to be an infeasible task.

Fig. 1
figure 1

The count of the news publisher in the dataset

We define and formulate the following research questions to be addressed in this study:

  1. 1.

    Is it possible to label future stock trend and volatility from a quarter ahead using only news titles? How accurately can it be done?

  2. 2.

    What are the text entities that have the greatest influence on classification probabilities?

In the following work, we discuss the scientific literature relevant to the fields of financial sentiment analysis and machine learning. We observe that few papers concern the long-term effect of news and even fewer consider the degree of price volatility which may be as important factor for making investment decisions as the direction of the drift. The trend and volatility are modeled using a stochastic differential equation. We focus on three binary classification problems which are (1) labeling the stocks that outrun the market, (2) distinguishing low-volatility and high-volatility stocks with a positive trend, and (3) separating low-volatility and high-volatility stocks with the non-positive trend. For prediction, we apply versatile machine and deep learning methods such as logistic regression, support vector machines, simple and long-short term memory neural networks, and FastText. Having evaluated the results by a range of classification metrics, we interpret the best model using the LIME procedure. The most influential entities presented in the news headlines corpus are extracted.

2 Literature Review

We start our analysis with a discussion of relevant publications. While financial sentiment analysis is a popular field of research, as many renowned researchers have tried to evaluate the financial sentiment of various text information and have released many multifaceted articles, there is a big room for the analysis. News and Twitter and their connection to the stock market have been the favorite areas for study over the last decade owing to their accessibility and possible applied value.

The first paper which will be underlying our analysis is by Xing et al. (2020). The work is of the greatest importance for us as it, firstly, discusses the latest developments in FSA and, secondly, describes three different classes of the state-of-the-art algorithms from recent publications for the financial sentiment. To have good coverage of different types of methods, the authors choose to investigate eight representative models from three clusters, lexicon-based (procedures like OpinionLex, SenticNet, and L &M), machine learning-based (SVM and fastText), and deep learning NLP models (bi-LSTM, S-LSTM, and BERT). Unfortunately, lexicon-based methods showed the poorest performance among all other approaches making them unattractive for our work. Classical statistical and deep learning models demonstrated a reasonable fit and each class dominated in various scenarios, so we can exploit some of these models in our problem.

Many papers tried to observe and measure the effects of different Internet sources on stock market movement. The work by Mittal and Goel (2012) conducts a financial sentiment analysis of Twitter posts in order to see an association between public and market sentiment. The authors mention that regardless of the wide acceptance of the Efficient Market Hypothesis implying that market prices seem to follow a random walk pattern, it is still possible to extract a valuable inference from the posts. Based on the same dataset, another paper that highlights the idea of textual information being a relevant price predictor is Bollen and Mao (2011) managed to achieve a high 87% accuracy while labeling upside and downside movements of the Dow Jones Industrial Index.

There are many decent publications in which authors tried to forecast the trend of the stock using news as a key predictor. One of such papers (Mindell, 1961), which describes the financial intuition behind the drifts’ movement because of a news release, was published sixty years ago. Although most of the financial phenomena are taking place today, some of the notions may be considered outdated due to digitalization and the online availability of news. Additionally, at the time of writing that article, statistical machine learning was not as applicable as it is today. The research may benefit from the inference from powerful methods like Neural Networks. For example, Ren et al. (2020) managed to predict the BIAS (percentage change of the close price on the i-th day after news publication) with high accuracy using a long short-term memory neural network. We can use the architecture developed in the paper in our research.

Another relevant publication on the use of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) in stock market prediction (Vargas et al., 2017) forecasts intraday directional movements of the Standard & Poor’s 500 Index by using financial news titles and technical indicators. The study is interesting for us as it describes the most sophisticated deep-learning architectures for NLP. Furthermore, the S &P500 Index is a proxy for the whole market in our study, so the structure of the networks described in the paper may be connected to our forecasts.

For a greater description of deep convolutional neural network architecture, we consult with (Ding et al., 2015) where the authors, using both simulated data close to our problem with S &P500 historical prices, developed a model for event-driven forecasting. Importantly, the authors consider not only near-term impact but longer horizons as well. Additionally, the paper compares the developed methods with other good-performing NN architectures from other publications, covering a wide range of various neural-network-related methods.

Apart from deep learning methods, we are going to implement more classical approaches from statistical learning such as logistic regression and support vector machines. One of the most relevant works is Sueno et al. (2020). The paper describes and compares different ways of transforming textual data into a numeric format to be fitted as factors. Among all procedures, we choose tf-idf as the main text vectorizations approach for our analysis.

Other works exploring the connection between news and market prices are Schumaker and Chen (2009) and Li et al. (2014). The first paper studied 9,211 news articles and more than 10 million stock quotes over five weeks and used SVM to predict the future stock price 20 min after news release. Results indicated that using terms from the news articles and the price at the time of the articles had the best performance, with the lowest MSE and 2.06% return. The paper states the Proper Nouns scheme achieved better results than the popular Bag of Words in all considered metrics. The second paper implements six different prediction principles and showed that the use of the McDonald financial sentiment dictionary (LMD) has the best performance. However, the authors claim that focusing primarily on positive and negative dimensions of inputs does not bring useful predictions, therefore more sophisticated approaches should extend the study.

In order to represent stock prices in a formal way, we apply a mathematical model in a form of a stochastic differential equation. The simplest model, which nevertheless has shown a close fit in many scenarios, as mentioned (Marathe & Ryan, 2005) and (Reddy & Clinton, 2016), is Geometric Brownian Motion (GBM). Our study wants to incorporate this approach from mathematical finance with applied machine learning, the combination of which was not often practiced in prior works.

Since we want to forecast the trend and volatility, we also need to consult with a proper method that estimates these two parameters simultaneously. The work by Croghan et al. (2017) describes three different procedures to obtain the estimates for \(\mu\) and \(\sigma\). We will exploit the Maximum Likelihood estimators formulated in the article to compile various target variables for our inference.

While many publications focus mostly on the short- or moderate-term sentiment of one entity at a time, our study is addressed to consider more visible to an investor horizon. In contrast to short-term AI trading algorithms which can create orders in timescales of microseconds, earning on the slight difference in prices, regular shareholders usually tend not to seek for quick profits and rather prefer holding assets for much longer periods.

Additionally, it may not be wise to govern the investment decisions based only on one sentiment. Many headlines bring only noise to the trend as, if a news article entails a negative sentiment about a firm, holding it is not necessarily a loss and can even benefit its owner in the long-run. Therefore, we also want to consider the cumulative effect of the news to develop a profitable strategy.

As the third contribution of this work, it investigates the headlines’ effects on the volatility which can be a key factor for or against investing, while many researchers have only studied news’ impact on trends. In the standard finance theory and portfolio theory, historical volatility represents the risk of an asset as more fluctuating shares are usually risker. While only a minority is interested in highly volatile assets, as a greater dispersion is an opportunity for highly abnormal returns, in a risk-averse world, most bullish investors prefer higher returns only if they do not pose a higher risk. Besides, foreseeing more volatile assets may be advantageous for building other strategies. One of the other ways is investing not in stocks implicitly but in options underlying those stocks. For example, long straddle brings a higher net profit, the higher the stock fluctuation is, regardless of the direction of the drift.

3 Methodology

3.1 Trend and Volatility Modeling

Building a regression of trend using only headlines without important financials (such as P/E ratios, EV/EBITDA multiple, etc.) sounds like an impossible task and obtained estimators would be unreliable. However, we do not need a regression, as in order to generate abnormal profits, the direction of the future drift, \(\mu\), and some knowledge of its volatility, \(\sigma\), may suffice. Hence, we decided to consider in some sense simpler problem, but which still has an applied value. Consider one stock as one observation in the dataset. In relation to the drift component, we will attempt to classify if it exceeds the drift of S &P500 prices from the same time window, whereas, for volatility, we wish to distinguish the left 0.25 percentile tail of adjusted sigmas from the rest part (i.e., the observation is labeled as 1 if it is in top 25% of the least volatile stocks). We formulate three main target variables for the analysis:

$${\textbf{trend direction}}:= I({{\hat{\mu}}} > \hat{\mu}_{{\text{S}\&\text{P500}}})$$
(1)
$$\begin{aligned}{} & {} {{\textbf {growth volatility}}}:= I({{\hat{\sigma }}}'< q_1^g ) {\text { such that}} \nonumber \\{} & {} Pr(D_{\sigma '} < q_1^g \mid {{\hat{\mu }}} > 0) = 0.25 \end{aligned}$$
(2)
$$\begin{aligned}{} & {} {{\textbf {fall volatility}}}:= I({{\hat{\sigma }}}'< q_1^f ) {\text { such that}} \nonumber \\{} & {} Pr(D_{\sigma '} < q_1^f \mid {{\hat{\mu }}} \le 0) = 0.25 \end{aligned}$$
(3)

Note, for the simplicity of the narration we call “growth stocks" only those which have variable trend direction equal to 1.

We reckon that the headlines causing the stock to be more volatile may be different for growing and falling shares. Besides, investors usually demand higher risk-premium for stock with a greater downside sensitivity (Ang et al., 2006). Therefore, to achieve higher performance when classifying volatilities, we consider two different samples of the initial dataset where drift is positive and non-positive. We cannot know the direction of the future drift, so we may consider its estimate or the real drift over the last quarter as a proxy.

Instead of calculating the sample mean and variance of price changes, which will probably result in similar estimates of \(\mu\) and \(\sigma\), we propose using a more formal model involving both parameters simultaneously. Geometric Brownian Motion (GBM) used by Black-Scholes in their framework (Black & Scholes, 1973), defines the profitability of a stock as a stochastic differential equation (4). It says that the profit from stock grows with a drift value \(\mu\) and involves a risk described by Brownian motion, scaled by the volatility parameter \(\sigma\). Despite its simple appearance, the model generates time series which is quite close to real stock prices.

$$\begin{aligned} \frac{dX_t}{X_t} = \mu dt + \sigma dW_t \end{aligned}$$
(4)

\(X_t\) denotes the price of a stock, \(\mu\) and \(\sigma\) are the true trend and volatility.

We want to find a proper way of estimating its parameters given the historical prices. In contrast to many models entailing a system of differential equations solved numerically, the main benefit of the GBM is the availability of a closed-form solution allowing us to use a Maximum Likelihood Estimation (MLE) or other estimation procedures to find the estimates for the trend and volatility. Three different pairs of estimators were described by Croghan et al. (2017), and the third option resulting in the lowest MSE is chosen for our purposes, where (6) is an estimator for the drift and (7) is the expression for the volatility.

$$\begin{aligned} {\bar{X}}= &\; {} \frac{1}{n}\sum _{t=1}^{n}\log \left( \frac{X_t}{X_{t-1}}\right) \end{aligned}$$
(5)
$$\begin{aligned} {\hat{\mu }}= &\; {} {\bar{X}}\ +\ \frac{{{\hat{\sigma }}}^2}{2} \end{aligned}$$
(6)
$$\begin{aligned} {\hat{\sigma }}= & {} \sqrt{\frac{1}{n-1}\sum _{t=1}^{n}\left( \log \left( \frac{X_t}{X_{t-1}}\right) -\ {\bar{X}}\right) ^{2}} \end{aligned}$$
(7)

In a similar fashion, we find \({{\hat{\mu}}}_{{\text{S}\&\text{P500}}}\) for corresponding time period. Using the estimators obtained analytically, we fit real stock prices for each ticker and save the estimates to merge them with corresponding news.

3.2 Volatility Adjustment

Apart from basic measures against overfitting described in the Data Preparation section in the Appendix, we make another adjustment to the data used in classifying volatility. Some assets tend to be more/less volatile depending on the company itself and the industry they are operating in. Industries related to technologies, oil, and healthcare entail more fluctuating stocks, while electric and water utilities sectors demonstrate low and moderate changes (Moran, 2020). Hence, the obtained volatilities should be normalized by some universal measure that captures both firm-specific and industry-related dynamics. One of such measures is market beta—sensitivity of the asset to overall market fluctuations. Theoretically, it can be calculated as a slope of a linear regression of stock return changes onto changes in the return of the market portfolio as in regression (8). Since we cannot consider all the stocks in the universe, so in practice, we again consider S &P500 Index as a proxy. To obtain the estimate of the market beta, \({\hat{\beta }}\), formula (9) is applied. The adjustment itself is (10).

$$\begin{aligned} R_X= & \;{} \alpha +\beta \ R_M+e \end{aligned}$$
(8)
$$\begin{aligned} {\hat{\beta }}= &\; {} \frac{\textrm{Sample} \ {\mathbb {C}}ov(R_X,\ R_M)}{\textrm{Sample}\ {\mathbb {V}}ar(R_M)} \end{aligned}$$
(9)
$$\begin{aligned} {{\hat{\sigma }}}_i'= &\; {} \frac{{{\hat{\sigma }}}_i}{{{\hat{\beta }}}_i + \nu + \epsilon }, \quad \nu = \min \{{{\hat{\beta }}}_1,\ldots ,{{\hat{\beta }}}_n \} \end{aligned}$$
(10)

Since some market sensitivities can be negative, a simple division by \({{\hat{\beta }}}_i\) would result in negative adjusted volatility. Therefore, betas are shifted by the value of the least beta. We do not wish to divide by zero when \({{\hat{\beta }}}_i = \nu\), so we add an infinitesimal \(\epsilon\) to the denominator. As it has been said, some industries are much less volatile, and since the threshold is defined by a quantile function which is a relative function, without adjustment, we would only pick stocks from least volatile industries such as health care and completely omit stocks from, say, technology. The adjustment, to some extent, sorts the stocks depending on their not nominal but real volatility. As we can observe from the violin plots in Fig. 2, the distributions of the resulting trend values are similar for stocks with positive and zero direction indicators of the drift. Unlike the volatility distribution for growth stocks, the distribution for shares with a negative trend is much more positively skewed, which consequently may have a negative impact on the performance of the corresponding models.

Fig. 2
figure 2

Violin plots of the estimates’ distributions. Left: absolute values of drift estimates with \({\hat{\mu}} > {{\hat{\mu}}}_{{\text{S}\&\text{P500}}}\) in blue. Right: volatility estimates for growing (with \({\hat{\mu }} > 0\) in blue) and falling stocks (with \({\hat{\mu }} \le 0\) in red). (Color figure online)

3.3 Text Vectorization

Most machine learning models are designed to work with numeric inputs rather than text data. Besides, the performance of the models relies not only on the final classifier by also on the vectorization procedure. Therefore, we need to pay great attention to the choice of text vectorization technique.

For statistical learning methods (Logit and SVM), we apply the tf-idf (term frequency-inverse document frequency) transformation, given by (13). The metric is widely practiced in NLP and shows the importance of each word in the document.

$$\begin{aligned}{} & {} \textrm{tf}\left( t,d\right) =\ \frac{f_d\left( t\right) }{\underset{{w\in d}}{\max }{f_d}(w)} \end{aligned}$$
(11)
$$\begin{aligned}{} & {} \textrm{idf}\left( t,\ D\right) =\ln {\left( \frac{\mid D\mid }{\mid {d\in D:t\in d}\mid }\right) } \end{aligned}$$
(12)
$$\begin{aligned}{} & {} \mathrm {tf\text {-}idf}\left( t,d, D\right) =\textrm{tf}(t,d)\cdot idf(t,D) \end{aligned}$$
(13)
$$\begin{aligned}{} & {} f_d(t):= \text {frequency of term t in document d } \end{aligned}$$
(14)
$$\begin{aligned}{} & {} D:= \text {corpus of documents} \end{aligned}$$
(15)

Each d is a document, that in our case is a merged string of all headlines for a particular stock within some quarter. D is the corpus of the document, i.e., the collection of all merged news titles for all considered stocks throughout the period 2020Q3–2021Q4. Each term t in each document is assigned with the value of the product of the term-frequency in the specific document and the ratio of all documents including this specific term.

For neural networks (NN and LSTM), we apply default keras embeddings to vectorize the text inputs. Embedding is a procedure that maps the text input to a dense vector of the fixed size based on some rules, and those weights that are initialized as uniform are trained during the fitting process of the neural network.

For FastTest, the data is not vectorized manually and we input the text itself to the model as it has an internal mechanism for text embedding.

3.4 Baseline

Knowledge about future price fluctuations is highly demanded by investors but rather difficult to access. Even advanced forecasting algorithms are not perfect in terms of their prediction accuracy as too many (usually unobservable) factors should be taken into consideration. The algorithms showing reasonable performance are a combination of specialized stacked models based on different types of both text (e.g. headlines, news articles, Twitter, Facebook, earnings reports and earnings call transcripts) and numeric (e.g. quarterly reports, financials, and multiplicators), so we do not expect a supernatural result from any model fitted on only news titles. We do not need to be accurate with all stocks available in the market but be surer about a few of them to outrun the market portfolio. While the performance may not be ideal, we still need to choose a reasonable baseline to ensure that our predictions are not inferior.

In the standard finance theory, Efficient Market Hypothesis (EMH) states that all information affecting the stock is already reflected in the current price, so, given perfectly competitive structure, the opportunity to generate excess profits is excluded (Burton & Malkiel, 2003) and (Malkiel, 2003). The implication is that the stock price changes are completely stochastic and cannot be foreseen. However, comparing the models with a single random classifier may bring us uncertain results as the accuracy relies on the random seed. A null-classifier, though too trivial for many tasks, will be quite suitable for such a description, as it approximates the mean of many random classifiers.Footnote 1

3.5 Logistic Regression

The logistic regression (a.k.a. Logit) is a popular classification model in the field of natural language processing. The method applies a sigmoid function to the linear combination of the numeric inputs, where the model coefficients are obtained using maximum-likelihood estimation. Even though the marginal effect of each explanatory variable is not constant in the model, it is still highly applicable and interpretable, especially if combined with the tf-idf transform.

3.6 Support Vector Machines

SVM is another popular supervised classification algorithm that draws a separating hyperplane in the n-dimensional space (Tong & Koller, 2001). After solving a convex optimization problem, the method maximizes the distances between the margins and the supporting vectors (the observations that lie on the margin lines). Usually, real-world tasks such as the stock trend classification problem do not allow us to place a hyperplane in the initial feature space that easily. Applying different kernels is the idea that can help us to separate the classes with a non-linear boundary. By creating different augmented variables from the initial predictors, we find the decision boundary in the higher-dimensional space and then consider the projection to the initial space as the boundary. Instead of a classic polynomial function, we use a sigmoid kernel to achieve a higher prediction accuracy.

3.7 Neural Network

The first deep learning model in our analysis is a basic neural network architecture. The model is a graph of connected layers with neurons where each link entails a weight, that is trained during the fitting procedure using gradient descent. Due to its design, the model can be quite powerful in extracting hidden patterns from the text data, making the algorithm one of the most popular methods in NLP. For our task, we use several hidden dense layers with ReLU activation function and one final layer with sigmoid activation to map the probabilities to range from 0 to 1. As the loss function, the binary cross-entropy loss with Adam optimizer is applied.

3.8 LSTM

The beginning layers in a simple neural net are not trained as notably as the final layers, which represents a problem of short-term memory. Long short-term memory (LSTM) NN can solve this issue by adding special neurons with gates that decide which information is important to be passed further and which data can be thrown away (Gers et al., 2000). We expect LSTM to catch a “snowball effect” of headlines when positive/negative statements are followed by the news with the same sentiment.

3.9 FastText

FastText is a model for text classification and word vector representation developed by Facebook’s AI Research lab. FastText uses a hierarchical classifier, and owning to binary trees underlying the model, the amount of computation for each text is reduced significantly compared to two previous algorithms. Hashing is used to improve the time and efficiency of the n-grams mapping. Since our dataset contains more than 10 thousand observations with each text being a merge of all news within a long period of time, the speed of model training becomes important. The method averages the n-gram embeddings, and a multinomial logit is a final classifier. As the loss function, Skip-gram negative sampling (SGNS) is applied (Goldberg & Levy, 2014).Footnote 2

3.10 LIME

Gaining insights from the best model may bring value to the survey. We would like to know the presence of which words or phrases was the most influential factor in the labeling trend and growth volatility. We will base our inference on the best-performing models. There are many various techniques for assessing the importance of factors in machine learning models. As the main tool for our analysis, we use Local Interpretable Model-agnostic Explanations (LIME) discussed in Ribeiro et al. (2016). Often, it is impossible to describe a complex decision boundary as a whole, so LIME approximates the black-box model locally in the neighborhood of the prediction by fitting a surrogate, an interpretable model in that local area. Mathematically, the algorithm tries to find a simple function g from a family of sparse linear models to minimize the loss function (16) given the complex algorithm f (i.e., the predicting model) and the neighborhood of the observation x. A penalty \(\Omega \left( g\right)\) representing the complexity of the model g is added to the expression to make the result simpler. Kernel function, \(\pi _x\), gives weight to the points for the generated proximity based on the distances to x.

$$\begin{aligned} \xi \left( x\right)= & {} \underset{g\in G}{\text {argmin} }\ {\mathcal {L}}\left( f,g,\ \pi _x\right) +\Omega (g) \end{aligned}$$
(16)
$$\begin{aligned} {\mathcal {L}}\left( f,g,\ \pi _x\right)= & {} \sum _{z,z^\prime \in {\mathcal {Z}}}{\pi _x\left( z\right) \left( f\left( z\right) -g\left( z^\prime \right) \right) ^2} \end{aligned}$$
(17)

A subsample of 350 true-positive and 350 true-negative stocks with the highest likelihood of belonging to the corresponding classes was chosen to be analyzed by LIME. For each document, 8 words having the greatest effect (in absolute value) on the probability of the positive outcome are extracted and saved. As measures of words’ importance, we consider the mean value and the sum of such effects for a given word across the considered corpus. Some words, despite having a high mean influence, have a low appearance frequency, hence, are removed from the analysis.

4 Results and Discussion

Tables 1, 2 and 3 describe the classification performance for future trend and future volatility for growing and falling stocks respectively. From Table 1 and the left ROC curve in Fig. 3, we can see that more complex models had higher accuracy with LSTM being the most successful among all and becoming the model that we will investigate using LIME. However, none of the model surpass the accuracy baseline. We can also see that the ROC curves coincide, which may represent a kind of performance boundary. To be more certain about the drift, we need to include numerical financials, like P/E indicators, EBITDA, etc. Even though the performance of the models is not ideal, we may see that all the methods managed to pass the ROC-AUC baseline in this subproblem which again brings some doubt to the validity of the EMH theory.

Fig. 3
figure 3

ROC curves for three classification problems. Left: trend. Middle: growth volatilities. Right: fall volatilities

Table 1 Performance metrics for future trend classification

We expected disastrous results for volatility classification. However, we were surprised to observe that the models would perform better in the problem of volatility classification rather than trend labeling. From the Tables 2, 3, and center and right ROC curves in Fig. 3, we may see that FastText, NN, and Logit outperformed other approaches significantly. Interestingly, SVM and LSTM showed poorer performance for stocks with positive trend in comparison with the falling stocks, which may be an indication of different text factors affecting the volatilities. Consequently, the models showed slightly worse AUC for falling stocks.

Despite the fact that Logit and FastText, did not show the highest scores in all problems, the models still demonstrated a reasonable fit and are worth consideration in similar tasks given their simplicity and speed of training.

Table 2 Performance metrics for future volatility classification based on growing stocks
Table 3 Performance metrics for future volatility classification based on falling stocks
Table 4 Extracts having the highest mean influence

Based on the most influential extracts provided in Table 4, we make the following insights. As expected, words such as ‘dividends’, ‘technology’, and ‘revenues’ improve the likelihood of the market outrun. Headlines involving entities such as ‘foods’, ‘pharmaceuticals’, and ‘software’ also enhance the chances of class 1. Investors may consider the companies operating in these fields for further investigation.

The words such as ‘predictable’, ‘crossover’, ‘commercial’ had a great negative impact on the first target variable. Additionally, ‘bankcorp’, ‘bloomberg’, ‘steel’ and ‘hotels’ were mostly associated with negative importance (Fig. 4). Surprisingly, unlike ‘technology’, ‘technologies’ had a negative effect. We reckon that it may be related to the context where these two words are used.

Fig. 4
figure 4

Word importance sum. Left: trend by LSTM. Right: growth volatility by Logit respectively

For stocks with positive drift, words such as ‘preferred’, ‘undervalued’, ‘insurance’, ‘yield’, ‘fed’, ‘vaccines’ and ‘california’ were indicators of a smaller risk. Similar to the positive trend, ‘dividends’ again favored the positive outcome in this category. The titles entailing ‘pharmaceuticals’, ‘airlines’, ‘ipo’, ‘unaudited’, ‘therapeutics’, ‘drug’, and ‘inflation’ brought greater uncertainty to the holder of corresponding assets. It is important to observe words such as ‘oil’ or ‘energy’ as indicators of a lesser certainty, reflecting the validity of our model as oil- and energy-related industries are notoriously more volatile. Again, a shocking observation was that the word “technologies” in the plural form, in contrast to its singular pair, could reflect less certainty in holding such a share.

5 Conclusions

All in all, we have managed to develop the models that work better than a random classifier and that creates another contradiction to the efficient market hypothesis. This implies that the use of news headlines can be a sufficient approach to select several shares for an investor with a bullish strategy and risk-averse preferences. The models fitted on the falling stocks can still be attractive for bearish investors. Besides, being able to foresee different types of volatility (both low and high) may benefit options traders. For example, falling and noisy stock can be quite attractive for the holders of the corresponding long put or straddle positions. Furthermore, we have also extracted key words and phrases that appear to affect drastically the values of trend and volatility.

As for ideas of how to improve this work, we could consider more powerful pre-trained and domain-specific text-classification models as the ones from the Hugging Face repository. Other vectorization techniques such as improved Naïve Bayes with Laplace smoothing (Sueno et al., 2020), Bag-of-Words (Wallach, 2006) and Word2Vec (Goldberg & Levy, 2014) could be applied with the methods. The embeddings could be applied with another weight-initialization procedure as it has proven to drastically affect the results of neural nets (Glorot & Bengio, 2010; He et al., 2015). Although we have cleaned the majority of the companies’ names, the cleaning was not ideal, and, as a result, words such as “stifel” and “dass” were left in the dataset. We also observed that the grammatical numbers of some of the words had opposite effects on classification probabilities, so the sensitivity of singular and plural forms may be researched further. As for the text inference part, the LIME analysis could be enhanced by the inference from another interpretation approach such as SHapley Additive exPlanations (SHAP) (Nohara et al., 2019) which uses slightly different methods in explaining the models, but which could bring more value to our inference. Finally, the performance metrics, especially for trend classification, may be improved by other approaches or by further calibration of the considered models.