Data Pre-processing
This section describes a summary of approaches and methods that have been used to process the data from raw text to machine-readable data. Data preprocessing has been divided into four major sections, namely; hourly stock returns, News Analytics preprocessing, Naive Bayes Classifier and sentiment index development. All three sections give the snapshot of preprocessing of data. Let’s briefly describe one by one.
Hourly Stock Data
Hourly stock returns are calculated with help of opening and closing price of all 10 companies. Hourly stock data is obtained from Thoumson Retures data portal. Simple formula for calculating the stock return is as follows:
$$\begin{aligned} R_{ij}= \frac{closing_{ij}-opening_{ij}}{opening{ij}} \end{aligned}$$
(1)
whereas \(R_{ij}\) is \(j_{th}\) stock at \(i_{th}\) hours, \(closing_{ij}\) is closing price of \(j_{th}\) stock at \(i_{th}\) hours, opening ij is opening price of \(j_{th}\) stock at \(i_{th}\) hours.
News Analytic Processing
There are many sources through which information flows into the stock exchanges related to a specific stock. News and information sources that have been used in this paper are: mainstream media, print media, social media news feeds, blogs, investors’ advisory portals, experts opinions, brokers updates, web-based information, company’ internal news and public announcements regarding policies and reforms. We have collected the news stories from a very well known and reliable database, named; Thomson Returns. Using Thomson Reuters’s API we were able to collect new stories if these stories would be related to any of ten stocks which, we have chosen for analysis. The reason for choosing individual stock instead of the stock exchange is; stock exchanges absorb and react to collective level information and thus, specifically event-level information is hard to be separated. Every news story has its timestamp according to GMT and precise at the millisecond level. The time frame for news collection is 10 years, so, collecting every news resulted to have very large text corpus. The timestamp for news is strictly matched with the stock exchange’s opening and closing time. Although, we have thrown a lot of use full collected news information that lies outside of the stock exchange opening and closing time window. However, it was necessary to gauge the impact of news analytics on the stock price movement.
Naive Bayes Classifier
After the raw text regarding news is extracted from sources, the text is refined in the way that it can be used in the Navie Bayes Classification model. Originally text was in ‘HTML’ form with a lot of unnecessary information, but with help of parser and some lines of coding, ‘HTML’ based- text is refined and filtered into ‘lxml’ form. ‘XML’ form of text is accurately and quickly readable by machines. Naive Bayes Classification model has been used to calculate the sentiments out of news text. The Fig. 1 shows how information filters though raw sources to sentiment score. The left column of the diagram shows that Raw text, which includes ‘HTML’ meta-information in it. The first step is to split the complete sentences into a list of unique words, the process is called tokenizing. Next comes, creating a filter of stop words, these stop words are mostly related to pronouns. At the next stage, the text is filtered from hyperlinks and unnecessary information. In the next step, lemmatization is applied to address spelling mistakes. The list of all words is labeled with a part of speech. Then, data is a little bit more refined to see any redundancies. As a next step, with the help of the already available NLTK database, each word has been assigned with negative or positive labels. In the next two steps data is prepared for test and train dataset - ready to feed to the ’Naive Bayes’ Model for training. After the training process is completed each sentence is tested to get sentiments scores out of it. The outcome of the NLP model is utilized in building the sentiment index and LSTM data at the later stages.
Sentiment Index
Following variables are taken into account while building the sentiments index: ‘sentiment time window’, ‘score value’, ‘class of sentiment score ’, ‘relevance’ of score towards the underpinning stock; time window means how many times news/information, related to the selected stock appeared during 1 h time period. The logic behind keeping the time window to 1 h is that stock exchanges need a bit of time to absorb the information related to an individual stock. secondly, Minute level analysis is too early and day level analysis is too late. Next factor is ‘score value’. Sore value is the outcome of a trained NLP model, the process is given in the Fig. 1. Sentiment score values are classified into three categories based on their scores; positive, negative, and neutral. All the negative score are carrying the negative signs and neural sentiment are equal to zero. The scores for all three type of classes are ranges from 0 to 1. Thus ’score value’ are summed up during the 1 h time window, if the sum of the score is negative and greater than − 0.10, it is labeled as negative score, if the sum is between − 0.10 and 0.10 it is considered as neural score and, from 0.10 to 0.90, the ’score value’ is positive. In the Next step, the sentiment score outcome is finally multiplied by variable ‘relevance’ to weight the sentiment with respect to its relevance score. ‘Relevance score’ is percentage number, calculated; the number of times news story mentioned the name of a stock divided by the total count of words in the news story. The mathematical expression of the sentiment index is as under:
$$\begin{aligned} S_{ij} = \sum _{i\in I}\bigg (e_{i}\times R_{i}\times C_{i}\bigg ) \end{aligned}$$
(2)
whereas I = time windows for every\(i_{th}\) and \(j_{th}\) stock. \(e = \max \big (pos_{i},neg_{i},neut_{i}\big ) \ \exists \), \(C_{i} = \left\{ \begin{array}{ll} +1&{} argmax \big (pos_{i},neg_{i},neut_{i}\big ) =1\\ 0&{} argmax \big (pos_{i},neg_{i},neut_{i}\big ) =3\\ -1&{} argmax \big (pos_{i},neg_{i},neut_{i}\big ) =2\\ \end{array}\right. \). \(R_{i} = Relevance\).
$$\begin{aligned} P ( w_{j} | x_{i} ) = \frac{ P ( x_{i} | w_{j} ). P ( w_{j} ) }{ P ( x_{i} ) } \end{aligned}$$
(3)
whereas \(w_{i}\) is a particular class (e.g. Negative or positive) and \(x_{i}\) is an given features, \(P ( x_{i} | w_{j} )\) is called the posterior or in other word probability of feature \(x_{i}\) belongs to class \(w_{j}\), \(P (w_{j})\) probability of class itself with respect to total sample also called the prior and finally, \(P(x_{i})\) is called the marginal probability or evidence. based upon above-stated Bayes theorem, conditional class probabilities of the equation can be calculated as follows:
$$\begin{aligned} P ( \mathbf { x_{i} } | \omega _ { j } ) = P \left( x _ { 1 } | \omega _ { j } \right) \cdot P \left( x _ { 2 } | \omega _ { j } \right) \cdot \cdots \cdot P \left( x _ { d } | \omega _ { j } \right) \end{aligned}$$
Posterior probabilities can be calculated with following expression:
$$\begin{aligned} P ( \mathbf { x_{i} } | \omega _ { j } )= \prod _ { k = 1 } ^ { d } P ( \mathbf { x_{k} } | \omega _ { j } ) \end{aligned}$$
So,probability of class can be calculated with this expression:
$$\begin{aligned}P(\omega _ { j })= \frac{N_{\omega _ { j }}}{N_c}\end{aligned}$$
Model Equation
Artificial intelligence-based models have proved their importance and efficiency in almost all spheres of life and the field of economics and finance can not be excluded. Our model can be used practically in a variety of ways. For example, online trading expert systems are forced to integrate advanced ways for the prediction process. The current model could be specifically very relevant for the trading system to reshape the prediction process and reduces the effort of organizing and search the relevant market info through millions of text records with either human-based effort or the traditional text filtering approaches. The model already uses sophisticated NLP techniques to include the sentimental-based market information into the model. For example, building the information-related index is very crucial. Keep this point in view we have built a customized sentiment index that collects the market information at one minute level and sums it up for a 1-h window. On one hand, it enables LSMT model to capture high-level precision and on other hand, its overcome the limitation to rely upon daily-based market information. There are many traditional models which try to achieve precise forecasting using economic data i.e. simple regression, Moving Averages, and autoregressive-based models (see.ARMA, ARIMA, ARCH GARCH), simple regression, and a bunch of other time series forecasting models. The universal problem for all these models is the limitation to handle the assumption of linear distribution, handling long past lags, and very strict criteria of data structure. These limitations come with a lot of compromises in terms of efficiency and accuracy. Artificial neural network-based model and most specifically, LSTM is very good at handling long-term dependencies i.e. you can keep tracing the past data without losing the information it carries. Moreover, With help of different activation functions and specific approaches model works flexibly without setting many assumptions Let’s elaborate how this model works.
The current model is based upon original scientific publications made by Hochreiter and Schmidhuber (1997). The research is regarded highly by the research community because of its ability to work on long-term dependencies and the ability to remember important information in previous steps. The cases where the dependency of information does not matter much, simple neural network models work fine, but this is not an ideal situation in the practical business world. Stock market prediction, natural language processing, sentimental analysis, and language translation are the example where information of model is highly dependent and context is very important thus recurrent neural network model are good alternatives of simple neural networks. Here is a short description of how the model of this study is fitted.
Hidden state function can be written in the following way:
$$\begin{aligned} h_{t}=f(h_{t-n},X_{t}) \end{aligned}$$
(4)
So, the hidden state of LSTM model has been written with the help of the following equation.
$$\begin{aligned} h_{t}=tanh(W_{h}h_{t-n}+W_{x}X_{t}) \end{aligned}$$
(5)
Weight matrix is first multiplied with current input.Previous time steps hidden states are one by one multiplied with weight matrix for hidden state. Finally, tanh has been applied on result after adding both, current input and previous time steps hidden states. Now output layer of LSTM model is as under:
$$\begin{aligned} O_{t}=\sigma (W_{o}.[h_{t-1},X]_{t}+b_{o}) \end{aligned}$$
(6)
whereas W is weight matrix for output layer and \(h_{t}\) we have calculated in Eq. 5.
Equations 5 and 6 simply shows how hidden and output layers of the LSTM model are formulated but this formulation is not much different from simple neural network models. The true secret of LSTM model lies in its unique way of developing cell and memory state with help of gating mechanism.
Signalling and Gates
Gates are basically fully connected feed-forward networks that receive information, applies functions, usually sigmoid activation functions, and do point-wise operations and then return outputs. Thus, we have applied here sigmoid activation function that spits outputs between the range of 0 and 1. So, all the outputs values closer to 0 are considered unimportant and cell deletes them, on the other hand, all information that is close to 1 is important for the prediction process and therefore updated in cell state. In this section, we will describe how signals and gates for LSTM work. Not all information in cell state is important to know for the prediction process and overflow of unnecessary information means disinformation. Primarily there are three gates of LSTM, namely: forget gate, input gates, and the output gate.
Forget Gates
Forget gate receives information from current input and earlier hidden layer input, it applies the sigmoid function on this number and multiplies it with previous cell state. This decides that whether we want information in previous cell state with respect to new information and \(t-1\) information in state \(C_{t-1}\) . The mathematical equation of forget gate is as under:
$$\begin{aligned} f_{t}=\sigma (W_{f}.[h_{t-1},X]_{t}+b_{f}) \end{aligned}$$
(7)
Input Gate
This is the second part of the signalling process. In the first part, we have decided that the previous cell state is importation to keep or not. Now it is time to store new essential information on cell state, that will be later judged again by forgetting gate with respect to its importance for the model learning process. Input gate is a multiplication of t − 1 hidden state and t input by input weight matrix, that will be later merged into the new candidate. The activation function of the input gate is sigmoid. Mathematical equation of \(i_{t}\) is as under:
$$\begin{aligned} i_{t}=\sigma (W_{i}.[h_{t-1},X]_{t}+b_{i}) \end{aligned}$$
(8)
New Candidate
Similar to input gate new candidate is multiplication of hidden state’s current input with weighted matrix of new candidate denoted with symbol \(\tilde{C_{t}}\) with combination of \(i_{t}\) new candidate will decide with how much information model wants to write on new cell state. Mathematical equation of \(\tilde{C_{t}}\) is as under:
$$\begin{aligned} \tilde{C_{t}}=\tanh (W_{c}.[h_{t-1},X]_{t}+b_{c}) \end{aligned}$$
(9)
Now cell sate is updated with help of input gate and new candidate the equation is as follows:
$$\begin{aligned} C_{t}=f_{t}\times C_{t-1}+i_{t}\times \tilde{C_{t}} \end{aligned}$$
(10)
Output layer is the multiplication of the weight matrix of the output layer by previously hidden state and current input.
$$\begin{aligned} O_{t}=\sigma (W_{o}.[h_{t-1},X]_{t}+b_{o}) \end{aligned}$$
(11)
Finally output \(h_{t}\) is product of output layer and hidden state and mathematical expression of \(h_{t}\) is as under:
$$\begin{aligned} h_{t}=o_{t}\times tanh(C_{t}) \end{aligned}$$
(12)
Model Optimization
As a model optimization function Stochastic Gradient Descent (SGD) has been used in this study. As our model is not supposed to be linear so slop of non-liner error between two point can be calculated with help of derivative as under:
$$\begin{aligned} \frac{{f(x)}=\varDelta f(x)}{\varDelta x}=\underset{\varDelta \rightarrow 0}{lim}\frac{f(x+\varDelta x)-f(x)}{\varDelta x} \end{aligned}$$
(13)
Cost of the model is always an outcome of the specific function. In our model cost is the difference between the actual price of the entity—predicted price of the entity and based on Mean Square Errors. There are two major parameters that need to be tuned to reach the global minimum level of error.
$$\begin{aligned} \frac{\delta f}{\delta _{\beta }} \end{aligned}$$
(14)
As there in our function of cost two parameters are involved namely, \(\alpha \) and \(\beta \). Because there are two parameters we need partial derivation \(\delta \).
$$\begin{aligned} \frac{\delta f}{\delta _{\alpha }} \end{aligned}$$
(15)
In the direction of the slop we can calculate all possible partial derivatives and map them on a vector and can be called gradient vector. Mathematical expression is as under:
$$\begin{aligned} {{f:R}^{n}}\rightarrow R:\nabla f=\left[ \begin{array}{c} \frac{\delta f}{{\delta \theta _{1}}}\\ \\ \frac{\delta f}{\delta \theta _{2}}\\ \vdots \\ \frac{\delta f}{\delta \theta _{n}} \end{array}\right] \end{aligned}$$
(16)
\(\theta \) is the point toward slop to achieve the global minima and \( \delta f \) changes in function due to change in slop. So, in this way, we can make a vector of all possible partial derivatives to go down to hill.
So, gradient descent update rule is as under:
$$\begin{aligned} \theta _ { \text{ new } } = \theta _ { \text{ old } } - \eta \nabla _ { \theta } f\end{aligned}$$
(17)
whereas \(\theta _ { \text{ new } }\) is updated parameter \(\theta _ { \text{ old } }\) is old parameter ’-’ sign means we want to go downhill \(\eta \) is step size that model should take on slop line to go down hill,\(\nabla _ { \theta }\) is gradient with respect to parameters.
RMS Prop
To really speed up the model learning and error reduction, RSMprop algorithm has been used in the model. The idea behind this algo is to divide the gradient decent into two parts, a gradient that moves in vertical and gradients that moves in a horizontal direction. Vertical movement is called oscillation that is not much beneficial of error reduction. Thus, this algorithm focus on horizontal movement to achieve the global minima.
$$\begin{aligned} s_{dW} = \beta s_{dW} + (1 - \beta ) (s_{dw})^2 \end{aligned}$$
(18)
$$\begin{aligned}W = W - \alpha \frac{s_{dw}}{\sqrt{s_{dW}} + \varepsilon }\\s_{db} = \beta s_{db} + (1 - \beta ) (s_{db})^2 \\ b = b - \alpha \frac{s_{db}}{\sqrt{s_{db}} + \varepsilon }\end{aligned}$$
whereas \(s_{dW}\) is gradient in horizontal direction and \(s_{db}\) is gradient in vertical direction. \(\alpha \) is learning rate and \(\beta \) is simply parameter for moving average that separate for \(s_{dW}\) and \(s_{db}\). Whereas, \((s_{dw})^2\) square of past gradient. \(\varepsilon \) is very small value to avoid dividing by zero.Moving average is effective in this algo because it gives higher weight to current value of gradient and less weight to square of past gradient.
Overall schematic of the study model is as follows.
Now, we will start next section where we have described our results of model (Fig. 2).