Keywords

1 Introduction

Predicting the future stock prices are the most important queries for the investors in share market. Many different techniques, mathematical formulation, genetic algorithm (GA) based models, neural network models, machine learning based techniques etc. have been proposed and tested with mixed success [1,2,3, 15]. Predicting the future price of some stock is inherently difficult as the price movement depends on large number of issuesā€“greatly of macro-economic, micro-economic, technical parameters as well as a lot of unknown parameters which come in to the context all of a sudden. Future stock price of a company becomes stochastic due to difference in perception about the future of the company among investors. A group of investor foresees a future uptrend or good earnings for the company and they expect its stock price to go up in near future. Therefore they buy at current price to sell at some higher price in future and earn profit. At the same time some other groups of investors with a perception that the companyā€™s future outlook is not so good and stock prices may fall in future, they sell with current price with a view to latter buy the same or more quantity of shares with lower price in future to earn profit. The basic idea behind technical analysis is that current stock price of a company incorporates impacts and effects of economic, financial, political and psychological factors. It studies the historical stock prices and assumes that the future trend will follow the past behavior. The technical analysis offers information about the possible future evolution of the stock market. Technical analysis is done based on a lot of different technical indicator parameters such as ā€˜n-days moving averageā€™ (where ā€˜nā€™ can be 5/10/20/50 etc. days), ā€˜n-days weighted averageā€™, MACD, relative strength index, momentum etc. along with price-to-earnings ratio, dividend yield, profit margin, return on investment etc. [2, 3, 12, 15]. But investorā€™s perception also depends on rumors & market speculation and some unforeseen sudden big events and their unknown reaction towards stock prices of different companies. This later part makes the ā€œsellā€ or ā€œbuyā€ decision of an investor a stochastic random event but due to the technical parameters it is also not totally unpredictable.

In any Stock market listed companies are categorized into different sectors depending on the business domain the company belongs to. We have considered the following six sectors for our study: Banks, Automobiles, IT & Software, Metals, Pharmaceuticals and FMCG (Fast Moving Consumer Goods). These different sectors have sectoral index to represent their aggregated trends in a stock exchange. It is similar to the stock exchange index (for example SENSEX, NIFTY in BSE and NSE). These sectoral indices react with different external and internal events differently and hence their movement. Same external event may affect different industry sector differently. Depending on a many different factors some sectoral index moves in positive direction while in the same time some other sector moves into the negative zone (or may remain neutral). As an example when dollar value increases with respect to Indian Rupee (INR) almost all export companies of India gains and IT sectors majorly get most of the benefits as they earn in dollar and spend in INR. At the same time importers incur losses.

This is a very complex relationship to measure. In this research work we aim to focus on this in terms of following issues:

  1. 1.

    If these reactions with the external factors are correlated between the sectors.

  2. 2.

    Identifying how different sectors are related? They may be highly correlated, neutral or not co-related at all.

  3. 3.

    Among the highly correlated sector pairs which are positively correlated and which are negatively correlated.

  4. 4.

    Is there any correlation among the highly correlated sector pairs with some days lag, i.e. if todayā€™s sectoral index movement of sector-A is correlated with sectoral index movement of sector-B on d days in future. If we find a high correlation among two different sectors with a time lag of d days then we can forecast sectoral index movement of Sector-B, ā€˜dā€™ days ahead.

In the next sub section we briefly discuss about Indian share market as well as sectoral indices that are considered in this case study. Then we will discuss about Association rule mining techniques along with support and confidence that will be used in our analysis.

An Overview of Indian Share Market

Two most important stock exchanges in India are BSE and NSE. The Bombay Stock Exchange (BSE) is one of the oldest stock exchanges in India and one of the top stock exchanges globally with respect to number of listed companies and market capitalization. The 30 company index from BSE is known as SENSEX or BSE30 is a stock market index of 30 well established and financially sound companies listed on BSE. These are some of the largest and most actively traded stocks, hence it is considered as representative of various industrial sectors of the Indian economy. It is published since 1st January 1986 and regarded as the pulse of the domestic stock markets in India [9]. The NIFTY 50 index is national stock exchange of Indiaā€™s benchmark stock market index for Indian equity market [10]. It covers 22 sectors of Indian economy. As SENSEX and NIFTY is used to understand average trend and movement of BSE and NSE for almost all financial purposes, each stock exchange has industry sectors and each sector has many sectoral index(s) that reflect the behavior and performance of the concerned sector. In this study following 6 sectors are considered: Auto, Bank, Pharma, FMCG, IT and Metal. All index values are taken from NIFTY industrial sectors. Different sectoral index(s) consists of different number of representative company stocks. For example NSE Auto Index consists of 15 stocks and NIFTY bank index comprises of 12 banking sector stocks.

Statistical Correlation

Let Xt and Yt are two given time series closing prices for NĀ days. If we consider a lag of d days between them then co-variance between the two series is defined as-

$$ \sigma {\text{XY}}(d) = \frac{1}{N - 1}\mathop \sum \limits_{t = 1}^{N} \left( {X{\text{t-}\!\!\text{d}} - \mu X} \right)\left( {Y{\text{t }} - \mu Y} \right) $$
(1)

Where Ī¼X and Ī¼Y are the sample means of the time series X and Y.

Cross correlation between them is defined as-

$$ {\text{rxy}}\left( {\text{d}} \right) = \frac{{\sigma {\text{XY}}}}{\sigma X\sigma Y} $$
(2)

where \( \sigma \text{X} = \surd (\text{Sx}) \), and \( \sigma \text{Y} = \surd (\text{Sy}) \);

Sx, Sy being the sample standard deviations of series X and Y.

The value of r varies between +1 to āˆ’1. Depending on the sign of r following can be inferred:

  • Positive correlation: r value closer to +1 signifies strong positive correlation between the variables. An r value of exact +1 indicates a perfect positive fit. Any positive r values between 0 and +1 indicates that the relationship between x and y variables are such that with increase in values of X, Y value also increases.

  • Negative correlation: If x and y have a strong negative linear correlation, r is close to an r value of exactly āˆ’1 indicates a perfect negative fit. Negative values indicate a relationship between x and y such that as values for x increase, values for y decrease.

  • No correlation: r value closer to 0 signifies that there is no linear correlation or a very weak correlation. In other words x and y values are completely un-correlated and there is a random, relationship between the two variables x, y.

A perfect correlation ofĀ Ā±1 means that all the data points are lying on a straight line. Correlation coefficient ā€˜rā€™ does not have a dimension; hence it does not depend on the units used. Generally an ā€˜rā€™ value of greater than 0.8 is considered as highly correlated and less than 0.5 is considered weakly correlated. A point to remember is that above threshold values vary with the ā€˜typeā€™ of data used. Generally with noisy data less threshold values are considered.

Association Rule Mining

Data mining, an important part of knowledge discovery in databases (KDD) process employs many different techniques for knowledge discovery and prediction such as classification, clustering, sequential pattern mining, association rule mining and analysis. Nowadays it is used in almost all the data driven decision models such as business analysis, strategic decision making, financial forecasting, future sales prediction etc. Agrawal [13] first introduced association rules for frequent pattern mining among items in large transaction dataset. They introduced the Apriori principle which says: Any subset of a frequent itemset must be frequent. Hence it can also be said in another term as: No superset of any infrequent itemset should be calculated for further processing. From the frequent item-sets a set of strong rules are calculated. Strength of a rule is measured based on support and confidence values. Not all frequent item-sets are considered as strong, only those with a minimum support and confidence are considered for the next step. This Aprori principle eliminates the ā€˜curse of dimensionalityā€™ and makes computations feasible. Let us consider an association rule:{bread, sugar}Ā =>Ā {butter} It indicates if people are buying bread and sugar then they may also buy butter. Association rule mining (ARM) is used here to show the relationship between different item-sets. It is also known as market basket analysis. An association rule is expressed in the form of an implication as:

\( {\text{X }} \to {\text{Y}} \), where X and Y are disjoint item-sets, i.e. \( {\text{X }} \cap {\text{ Y }} = \emptyset \).

Support and confidence measures the strength of an association rule. Support is used to find how frequently a rule is applicable, whereas confidence finds how frequently items in itemset Y also appear in transactions containing itemset X. The formal definitions of these metrics are:

Support is the fraction of the total transactions that matches the rule. It is defined for rule R as the ratio of the number of occurrence of R, given all occurrences of all rules [3].

$$ {\text{Support }}\left( {{\text{X }} \to {\text{ Y}}} \right) \, = {\text{ P }}\left( {\text{X U Y}} \right) \, = \frac{{\# \,\,{\text{of Transactions }}\,\,{\text{containing}}\,{\text{both X and Y}}}}{Total\,\,\# \,\,of\,Transactions} $$
(3)

Support of the rule {tire, auto accessories}Ā ā†’Ā {Automotive Service} is 0.98 signifies that 98% of people who purchase tires and auto accessories also get automotive services done.

Confidence signifies the strength of the rule. The confidence of a rule X ->Ā Y, is the ratio of the number of occurrences of Y given X, among all other occurrences given X [3].

$$ \begin{aligned} {\text{Confidence }}\left( {{\text{X }\text{-}\!\! > \text{ Y}}} \right) & = {\text{ P }}\left( {{\text{Y}}|{\text{X}}} \right),{\text{ the probability of Y given X}} \\ & = \frac{{\,\,\,\# {\text{of Transactions containing both X and Y}}}}{\# of\,Transactions\,containing\,X} \\ \end{aligned} $$
(4)

A minimum support threshold value (min_sup) is generally defined to select the point of interest. It is used to discard those itemsets with support less than min_sup as that may not be interesting from business perspective. Confidence gives an idea of the conditional probability of Y given X. It is a measure of reliability of the inference made by a rule. Higher value of confidence implies that it is more likely for Y to be present in transactions that contain X.

One important point to consider is that not all strong rules (based on support and confidence values) are necessarily interesting. As we can see support-confidence framework can be misleading; it can identify a rule (AĀ =>B) as interesting (strong) when, in fact the occurrence of A might not imply the occurrence of B. Correlation Analysis provides an alternative framework for finding interesting relationships and allows to improve understanding of meaning of some association rules. Measure of interest or Lift is one of such correlational measure of association rules. Lift is defined as [19]:

$$ {\text{Lift }}\left( {{\text{A}}, {\text{B}}} \right) = \frac{{{\text{P }}\left( {{\text{A}}\mathop \cup \nolimits {\text{B}}} \right)}}{{{\text{P}}\left( {\text{A}} \right)\,\,P\left( B \right)}} $$
(5)

If liftĀ =Ā 1 i.e. P (AāˆŖB)Ā =Ā P (A) P (B), then the occurrence of itemset A is independent of the occurrence of B; or else both the item-sets are dependent and correlated. If lift value is less than 1 then A and B are negatively correlated i.e. occurrence of one likely implies the absence of the other. A lift value of more than 1 implies positive correlation between A and B.

2 Related Study

Several researches have been done over the period on predicting future stock price or price movement direction (upward or downward) along with trend analysis based on mainly different statistical modeling [3, 4, 6,7,8, 14, 15]. Rusu et al. discussed stock forecasting [14] methods used by classical approaches such as fundamentalists and chartists and at the same time discussed various recent stochastic methods like white noise, random walk, auto-regressive models etc. In another research work [4] various models used for stock price prediction using SASĀ© System tools. Models like Time Series analysis, Auto Regression (AR), Exponential Smoothening, Moving Average (MA) etc. has been discussed along with illustrated procedure for FORECAST and ARIMA (Autoregressive Integrated Moving Average) models. Dutta et al. [15] used logistic regression methods with various financial ratios as independent variables to cluster selected 30 stocks into good and bad performing groups based on rate of return. Another model CARIMA [5] (Cross Correlation Autoregressive Integrated Moving Average) was proposed to predict short term stock price. Main idea of CARIMA is to find the most highly correlated stock to predict the target price. Stock prices of SET50 from Stock Exchange of Thailand has been used to test the effectiveness of the model with better price trend prediction with similar % MAE (Mean Absolute Error) than ARIMA model. In another study work authors investigated stock index co-movement between two different countries namely Taiwan and Hong Kong using association rules and cluster analysis [6]. They have used 30 categories of stock indices as decision variables to observe the behavior of stock index association. This study tried to identify the correlation between the similar category sectoral index movements between two different countries and that also used to recommend investment portfolio as a follow up reference. Forecasting horizon is the time lag between the price movement of independent stock price and correlated stock price. If two stocks are highly correlated with a delay of d days then following the trend of former stock, latter oneā€™s trend can be predicted d days ahead. The above method is proposed with suitable generic algorithm for automated data preprocessing and analysis using correlation [8]. This model has predicted with 67% accuracy while tested with real stock market data. Authors in [16] has analyzed correlation between stock price fluctuation, gold price and US dollar price along with association rule induction methods amongst different stocks of same sector. A rigorous mathematical discussion on ten different data mining techniques such as Support vector machine (SVM), Least squares support vector machine (LS-SVM), Linear discriminant analysis (LDA), Quadratic discriminant analysis (QDA), Logit model, neural network, Bayesian models etc. has been discussed in [17]. In another work authors proposed and evaluated a stock price prediction based recommender system [18] that used historical stock prices as input to the system and applied regression trees for dimensionality reduction and Self Organizing Maps (SOM) for clustering. The proposed system helped investors with possible profit-making opportunities with buy or sell recommendations.

The main objective of this research work is to measure the association between sectors pair-wise instead of specific stock. These would provide an integrated view of stock market including several business sectors. Here we study time lagged prediction model for the analysis on the well-established, industry defined sectors or domains of businesses like automobile, banking, reality, metal etc. As we identify the sectors instead of specific stock we able to consider number of stocks at a time and could choose the top performing stocks of the sector as required. The sectoral index of each sector has been used to find the correlation in our study. This way total number of possible sector pair reduces drastically and at the same time individual investors can gain an idea to which sectoral stocks are going to give good earning in short-term. Similarly mutual fund managers can also use it to diversify their sectoral portfolio as the movement trend of sectors is going to be identified.

3 Methodology

Research Framework

The research framework of this study is shown in Fig.Ā 1. It involves collecting index values of 6 industrial sectors from NSE. Each trading days closing prices are used as the raw input data for our analysis. Initial time series plotting of sectoral indices of selected sectors gives a basic graphical visualization of the raw data about their co-movement pattern. FigureĀ 2 shows the time series plot of the selected sectors. Raw dataset is then processed into proper format to be used in association rule mining and for correlation analysis.

Fig.Ā 1.
figure 1

Research framework followed in this study

Fig.Ā 2.
figure 2

Movement of six sectoral indices (Closing price index) during Jan to Dec 2015

Correlation Analysis

Our data set consists of day wise closing prices of 6 different sectoral indices of 2015. We then calculated pairwise correlation for all the possible pair of sectors with a lag of 0 day to 5 days. A delay of 0 day means same day correlation between the two sectoral indices.

Letā€™s say we have total of NĀ day closing price of any two sector S1 and S2 as XĀ =Ā {p1, p2ā€¦pn} for sector S1 and YĀ =Ā {p1+d, p2+d,ā€¦ pn+d} for sector S2 where prices are from (1Ā +Ā d) day to (nĀ +Ā d) day i.e. ā€˜dā€™ days ahead of the prices in X.

Correlation with delay of ā€˜dā€™ days is calculated as the correlation between the two data arrays X, Y as below:

$$ {\text{rxy}}\left( {\text{d}} \right) = \,{\text{Correlation between X }}\left\{ {{\text{p}}_{ 1} ,{\text{ p}}_{ 2} \ldots {\text{p}}_{\text{n}} } \right\}{\text{ and Y}}\left\{ {{\text{p}}_{{ 1+ {\text{d}}}} ,{\text{ p}}_{{ 2+ {\text{d}}}} , \ldots {\text{ p}}_{{{\text{n}} + {\text{d}}}} } \right\} $$

So we have a total of 15 sector pair from 6 sectors considered and for each pair we have a total of 6 correlation values (with 0 day to 5 days of lag).

Microsoft excel spreadsheet based statistical tools has been used to derive the results shown in TableĀ 1 and Fig.Ā 3.

TableĀ 1. Correlation values (r) among different sector pairs with different day lag.
Fig.Ā 3.
figure 3

Correlations among six sectoral indices with a day lag of 0 day to 5 days.

In this study a correlation value of rĀ >=Ā 0.8 has been considered as good correlation and a correlation value of rĀ <=Ā 0.5 has been neglected as ā€˜weak or no correlationā€™. Based on the correlation between different sectoral movements sector pairs are selected for further analysis using association rule mining. Only sector pairs with high positive or negative correlation are considered for further analysis as discussed in the next sub section.

Data Preprocessing and Encoding

Letā€™s consider dataset PsĀ =Ā {pi}; iĀ =Ā 1 to N is the sectoral index closing values of some sector S; N being the number of trading days considered. Whole dataset contains such sectoral index closing prices of 6 sectors.

Letā€™s also define tolerance \( \Delta {\text{t}} \) as the percentage of value up to which we ignore price changes i.e. we take consider it as no-change if percentage price change is less than equals to āˆ†t. For our experiments we have considered \( \Delta {\text{t}} = 0. 2 \) as it gives good results. We have varied it from 0 to 1 and selected 0.2.

Step 1::

change in index values are calculated for each sector as follows:

$$ \Delta {\text{pi}} = {\text{ pi}} + 1 { } - {\text{pi}} $$
Step 2::

Different sectoral index values has different base and movement amount in absolute values so to normalize all sectors we consider percentage change. It is calculated as below:

$$ {\text{Ci }} = \, (\Delta {\text{pi}}/{\text{pj)}} $$

Ci value may be positive or negative depending on the price movement of the sectoral index

Step 3::

Sectoral index price percentage change is encoded as follows:

$$ {\text{v}}i = \,\left\{ {\begin{array}{*{20}l} { + 1 } \hfill & {if\, Ci > + \Delta t} \hfill \\ 0 \hfill & {if - \Delta t \le Ci \le + \Delta t} \hfill \\ { - 1} \hfill & {if\, Ci < - \Delta t} \hfill \\ \end{array} } \right. $$
(6)

Here vi becomes +1 if change is in positive direction i.e. sectoral index moves upwards. It becomes āˆ’1 if change is in negative direction i.e. sectoral index moves downward. A value of 0 is assigned if the change in percentage value is below considered tolerance limit. We consider it as ā€˜no-changeā€™ or ā€˜no-movementā€™.

Mining Association Rules with Apriori

Apriori is the most frequently used frequent itemset mining algorithm with good time bound as already discussed in the Association Rule Mining Section in Introduction above. We adapt the association rule mining using Apriori from [11] and used lift value [19] as a measure of interest of the mined rules.

Generate Input Transaction set

  • For days dĀ =Ā 0 to 5 Do

  • For each sector pair S1, S2

  • Generate transactions T as:

    $$ {\text{T}}_{\text{d}} = \left\{ {{\text{v}}_{\text{i}}^{\text{s1}} ,{\text{ v}}_{{{\text{i}} + {\text{d}}}}^{\text{s2}} } \right\};{\text{ i}} = 1 { } \ldots {\text{N}};{\text{ d is the day lag}}. $$

    Td is a set of 2 item itemsets with possible items as ā€˜+1ā€™, ā€˜āˆ’1ā€™ and ā€˜0ā€™. For example if sector S1 has positive upward movement from ith day to (iĀ +Ā d)th day and sector S2 has a negative movement between ith day to (iĀ +Ā d)th day then ith itemset in Td becomes (+1, āˆ’1), similarly for (iĀ +Ā 1)th dayā€™s itemset will be (+1,Ā +1) if both the sector shows an upward movement from (iĀ +Ā 1)th day to (iĀ +Ā 1+ d)th day.

Apriori algorithm is now suitably modified to be used on above generated transactions TĀ =Ā {Td}; dĀ =Ā 0 to 5, to find the association between any two sectors movement trend. Here we restrict our analysis in finding association between any two sectors, where index movement direction of one sectoral index is used to find the probability of movement direction of another sectoral index. It is possible to use the same algorithm to find association rules where multiple sectoral index movements will be used to predict the movement of some another sectoral index.

Let min_sĀ =Ā minimum threshold support for an itemset to be considered. It is used only to retain healthy rules.

In a similar way min_cĀ =Ā minimum threshold for confidence measure.

Lk is the k-element itemset generated from (kĀ āˆ’Ā 1) element item-sets using Apriori principle.

Deriving the Association rules with Apriori

  1. 1.

    Find all individual elements (1 element itemset, L1) from Transactions dataset Td with support more than min_s. L1 consists of only ā€˜+1ā€™, or ā€˜āˆ’1ā€™ or ā€˜0ā€™.

  2. 2.

    DO

    1. a.

      Use previously found j element itemset (Lj) to find all (jĀ +Ā 1) element itemsets with a minimum support of min_s.

    2. b.

      This becomes the set of all frequent (jĀ +Ā 1) itemsets that are interesting

    3. c.

      Divide each frequent itemset X into two parts antecedent (LHS) and consequent (RHS). The Association rule becomes of the form R: LHS->Ā RHS.

    4. d.

      The confidence of such a rule is calculated as:

      $$ {\text{Confidence }}\left( {\text{R}} \right): = {\text{support}}\left( {\text{X}} \right)/{\text{support}}\left( {\text{LHS}} \right) $$
    5. e.

      Discard all rules whose confidence is less than min_c.

  3. 3.

    WHILE itemset size less than k.

Rank the Generated Association Rules

Rank all the derived association rules as per there support and confidence value. Top K rules are of importance. Value of K depends on the investors risk profile and preferences.

4 Results and Analysis

We have used open source java based frequent pattern mining library SPMF [7] for deriving association rules with apriori and suitably modified to incorporate other required changes. For our experiments we have considered minimum support as 0.2, minimum confidence value as 0.4 and minimum lift as 0.1 with acceptable results. Fig.Ā 2 shows the initial time series plotting of different sectors where co-movement patterns can be visually seen. One important point to consider regarding Fig.Ā 2 is that it shows same day co-movement pattern, hence it cannot be used for prediction analysis. So next logical step is to introduce some days lag between any two sectoral index movements and find if there is any correlation. Then correlation coefficient among different sector pairs with different delay period from 0 day to 5 days has been calculated as shown in TableĀ 1. Different correlation values are plotted against delay in Fig.Ā 3 to show the positive, negative as well as no-correlation between different sectors with varying delays. It is observed that a day lag of 0 denotes same day index movement correlation. Hence high same day correlation does not help in forecasting as that cannot be used to gain profit. Finally TableĀ 2 shows the top 15 association rules that are mined using the above mentioned method and ranked as per rulesā€™ support and confidence measure. As a measure of interest of the rules lift values are also calculated and shown. From the results we see that amongst considered 6 sectors metal and FMCG go hand in hand where as auto index has very low co movement with other sectors considered. IT and pharma index also shows similar pattern. From pure economic point of view both IT and pharma sector greatly dependent on dollar exchange value and with the changes in dollar price both the sector reacts similarly. Auto index movements were very low during the year 2015 than all other sectors and it again can be attributed to non-reduction of car loan interest rates during the year. So this gives out some interesting correlation between different sectors. Rule R0 is interpreted as if metal index goes down then after 5 days pharma sector may be up with a support of 42% and confidence of 49.5%. Similarly rule R2 indicates if auto index moves up then with a forecasting horizon of 5 days metal index may move downwards. Corresponding lift values also support our prediction as lift values are less than 1 so it implies negative correlation.

TableĀ 2. Top 15 Association rules generated

5 Conclusion and Future Work

Association rule mining along with statistical correlation analysis has been applied on sectoral index dataset to investigate co-movement patterns among them. Aprori algorithm, a well-known frequent itemset mining tool has been modified and applied for the present analysis. This study finds that different sectoral indices are correlated among themselves. One more interesting finding is that there exists a time delayed lagged correlation between different sectoral indices. This correlation can be exploited to predict the future index movement direction with a forecast horizon of d days where d is the number of day lag considered. Hence this model can be used by different investors in balancing their portfolio to minimize risk as well as in deciding which sector to invest next. This model can be considered for short term investment as only prediction of next few days is possible using current dayā€™s sectoral index movements. Results shows that some sectors are completely un-correlated but some are highly correlated (positively or negatively) with correlation coefficient values more than 0.8.

Future work will include analysis considering all sectors at a time instead of only a single sector predicts another. For example in this study association rules of the form, R: S1Ā ->Ā S2 is used for simplicity, but in future all possible rules of the form R: (S1ā€¦Sj)Ā ->Ā Sk, where all other sectors jointly predicts some another sectorā€™s movement can be studied. Artificial neural network models can also be considered in combination with association rules to predict the sectoral index movement. In this study only historical closing values of indices are considered but there are many other factors and features like trading volume, market capitalization, debt ratio etc. that can be considered for prediction.