# Clusterization of Indices and Assets in the Stock Market

## Abstract

K-means clustering algorithm has been used to classify major world stock market indices as well as most important assets in the Warsaw stock exchange (GPW). In addition, to obtain information about mutual connections between indices and stocks, the Granger-causality test has been applied and the Pearson R correlation coefficients have been calculated. It has been found that the three procedures applied provide qualitatively different kind of information about the groups of financial data. Not surprisingly, the major world stock market indices appear to be very strictly interconnects from the point of view of both Granger-causality and correlation. Such connections are less transparent in the case of individual stocks. However, the “cluster leaders” can be identified which leads to the possibility of more efficient trading.

### Keywords

Technical analysis Clustering K-means Granger causality Pearson correlation## 1 Introduction

The technical analysis of the stock market assets [1, 2] belongs to the most controversial branches of analysis of financial data series. This is because that the aim of technical analysis is, basically, no less than the approximate predictions of trends and their corrections in the data which appear as realizations of a *random* process.

On one hand, it has been declared a kind pseudoscience, which, because of the incorrectness of its most important principles, cannot lead to any sustainable increase of returns above the market level [3, 4]. On the other hand, it has been being applied rather blindly without any serious knowledge about the market dynamics. More recent publications, e.g. [5, 6, 7, 8] have lead to considerable revision of the ultra-critical stand of the many experts regarding technical analysis,

As a part of a common knowledge about the stock market dynamics let us notice that time series generated by the prices of stocks are *not* random walks, and at least the short-time correlations *are* present. Whether they can indeed be exploited with the purpose of maximization of returns is an open question. What we investigate here is meant to be a very small contribution to answer it.

One of the many possible strategies of “beating the market” which are close to the spirit of technical analysis consists of identification of groups of stocks with similar historical behavior of prices. Then, it may happen that within such groups certain “leaders” can be found. Such “leaders” should exhibit behavior which is, to some extent, emulated - with some time delay - by some other members of the group. It is exactly that time delay which might possibly lead to forecasting of the price movement of the whole cluster or at least one of its parts.

As a preliminary but useful exercise we have first performed such grouping, or clusterization, of 24 of the world stock market indices: ALL_ORD, AMEX_MAJ, BOVESPA, B-SHARES, BUENOS, BUX, CAC40, DAX, DJIA, DJTA, DJUA, EOE, FTSE100, HANGSENG, INTERNUS, MEXICIPS, NASDAQ, NIKKEI, RUSSEL, SASESLCT, SMI, SP500, TOPIX, TSE300. Among them there have been both the most important ones like S&P500, DJIA, NIKKEI, HANGSENG and DAX, and those related to smaller markets like Budapest’s BUX and Amsterdam’s EOE. Then, 30 stocks belonging to the “blue chips” of the Warsaw stock market (WIG30 group) have also been classified.

For the purpose of that classification we have employed: (i) a state-of-the-art clustering algorithm called K-means [9, 10, 11, 12, 13]; (ii) the Granger causality test [14] with lags equal to 1, 2, and 3 trading sessions; (iii) Pearson correlation coefficient [15] with lags; that is, we have computed the Pearson correlation coefficients between shifted closing prices of asset A (\(P_{n-k}^{(A)}\) and unshifted closing prices of asset B (\(P_{n}^{(B)}\), where *k* has been equal to 1, 2, or 3.

The main body of this work is organized as follows. In Sect. 2 we provide our preliminary results for the 24 stock market indices. Section 3 is devoted to analogous results for the stocks in the Warsaw stock market. Finally, Sect. 4 comprises some concluding remarks.

## 2 Classification of the Major World Stock Market Indices

*a*we have obtained the time series \(x_{n}^{(a)}\) in the following way. For each trading session we obtained the sequence of opening (

*O*), maximum (

*X*), minimum (

*N*) and closing (

*C*) values:

*k*enumerates the trading sessions, \(k = 1, 2, ..., N\). Then the series has been normalized by substracting the mean (over

*k*) value of the closing values and dividing by the standard deviation of those values. This gave us the sequence:

The series \(x_{n}^{(a)}\) is obtained by flattening of the above sequence.

The K-means algorithm which has been applied to series generated in this way requires the number of clusters as an input parameter. We have chosen two, four, and six clusters.

We have been somewhat astonished by the fact that the clusters visible in Figs. 1, 2 and 3 do not appear to follow any pattern associated with geographical locations of the market to which an index corresponds. Also, our hopes to see, say, Euromarican indices contrasted with the East Asian ones have obviously failed. In fact, we have had some difficulties to provide any reasonable qualitative explanation of the clusterization results. For instance, it is not easy to understand why the French CAC40 index is in the same cluster as Australian ALL ORDINARIES index given completely different methods employed to calculate those indices and rather doubtful resemblance of the French and Australian markets.

*normalized closing values*of the indices. It is employed for determining whether one time series is useful in forecasting another. We have obtained all cases in which the null hypothesis that there is

*no*causality relationships between two indices has been rejected with p-value lower that 0.01. It has turned out, however, that it is actually quite difficult two find an index which

*does not*have any causal relationships (in the Granger sense) with some other. Also, very often the Granger-causality relationships has been two-sided. This is quite intuitive: the world stock markets are interconnected very strongly indeed. As an example, in Fig. 4 we have shown an example of Granger-causality relation with the lag equal to 2 between the Australian ALL ORDINARIES index and other indices. The arrows in Fig. 4 mean that the knowledge of the closing values of ALL ORDINARIES on the day \(n-2\) can be helpful to forecast the closing values of the indices displayed at the end of the arrow on the day

*n*.

*a*and

*b*with time delay. That is, we define the series \({\bar{D}}_{n}^{(a)} = {\bar{C}}_{n-m}^{(a)}\) and compute \(PCC = PCC(m)\) as:

*cov*denotes covariance and \(\sigma \) - standard deviation.

That is, covariances divided by the product of standard deviations have been computed for the series \({\bar{C}}_{n-m}^{(a)}\) and \({\bar{C}}_{n}^{(b)}\) where *m* has been equal to 1, 2, or 3.

## 3 Classification of the Most Important Stocks in Warsaw Stock Market

We have also employed three procedures described in Sect. 2 to those Polish stocks of which the WIG30 index has been composed in the beginning of June, 2015: ALIOR, ASSECOPOL, BOGDANKA, BORYSZEW, BZWBK, CCC, CYFRPLSAT, ENEA, ENERGA, EUROCASH, GRUPAAZOTY, GTC, HANDLOWY, INGBSK, JSW, KERNEL, KGHM, LOTOS, LPP, MBANK, ORANGEPL, PEKAO, PGE, PGNIG, PKNORLEN, PKOBP, PZU, SYNTHOS, TAURONPE and TVN.

All the prices have been normalized in the same way as in Sect. 2.

In the case of the WIG30 stocks classification it is again difficult to find any regularities. For instance, it is not true that the financial intitutions appear together in a single cluster or two clusters. Nor can it be said about the oil companies or those associated with production and distribution of energy. Our attempts to associate the results of classification with fundamental analysis have also failed. It appears that the clusterization algorithms simply provide knowledge of the type quite different from intuitive correlations.

Perhaps the most striking feature of the diagram shown in Fig. 9 is the very special status of the ENERGA stocks which seems to “influenced” (in the sense of Granger causality; there is of course no material influence) by eight other stocks. What we believe is also quite interesting is the fact the prices of INGBSK is in the Granger-causality relation with the prices of two other large banks, PKOBP and ALIOR. We believe that, with sufficient care, the diagram in Fig. 9 can be used to attempt forecasting the behavior of prices of stocks.

The same may be true about the stocks for which strong correlation exists. In Fig. 10 we have displayed pairs for which PCC (computed from retarded (\(m=1\)) and unretarded values of the closing prices) has been larger than 0.8 (with significance level 1 %).

As we can see, the strong correlations are often two-sided. That is, not only the sequence \({\bar{C}}_{n-1}^{(a)}\) is strongly correlated with \({\bar{C}}_{n}^{(b)}\) for, e.g., \(a = EUROCASH\), and \(b = PZU\), but the opposite is also true. Let us notice here that none of the time series analyzed in this work passes the augmented Dickey-Fuller test for the absence of the unit root, and, most likely, none of them is stationary.

## 4 Concluding Remarks

In this work we have performed classification of a group of important world stock market indices and major stocks in the Warsaw stock market using standard K-means clustering algorithm, Granger causality test, and Pearson correlation coefficients. We have found pairs of indices as well as pairs of stocks in the Warsaw stock market which are either very strongly Granger-causality related or strongly correlated in the Pearson’s sense. Let us notice that the correlations obtained with the help of Spearman rank coefficient have not differed qualitatively from those obtained from the Pearson correlation coefficient. We believe that, with sufficient care, the diagrams similar to those obtained here may be used in practice, possibly even to enhance the predictive power of technical indicators for trading purposes.

It is to be noticed, however, that the above results are very preliminary and require careful reexamination. In particular, we have not yet provided any tests for the forecasting powers based on the above classifications. We hope to report results of such improved analysis in a forthcoming publication.

### References

- 1.Murphy, J.: Technical Analysis of Financial Markets. New York Institute of Finance, New York (1999)Google Scholar
- 2.Kaufman, P.: Trading Systems and Methods. Wiley, New York (2013)Google Scholar
- 3.Malkiel, B.: A Random Walk Down the Wall Street. Norton, New York (1981)Google Scholar
- 4.Fama, E., Blume, M.: Filter rules and stock-market trading. J. Bus.
**39**, 226–241 (1966)CrossRefGoogle Scholar - 5.Brock, W., Lakonishok, J., LeBaron, B.: Simple technical trading rules and the stochastic properties of stock returns. J. Finan.
**47**(5), 1731–1764 (1992)CrossRefGoogle Scholar - 6.Lo, A., MacKinley, A.: Stock market prices do not follow random walks: evidence from a simple specification test. Rev. Finan. Stud.
**1**, 41–66 (1988)CrossRefGoogle Scholar - 7.Lo, A., MacKinley, A.: A Non-Random Walk down Wall Street. Princeton University Press, Princeton (1999)Google Scholar
- 8.Lo, A., Mamaysky, H., Wang, J.: Foundations of technical analysis: computational algorithms, statistical inference, and empirical implementation. J. Finan.
**55**(4), 1705–1765 (2000)CrossRefGoogle Scholar - 9.MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Cam, M.L., Neyman, J. (eds.) Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)Google Scholar
- 10.Steinhaus, H.: Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci.
**4**(12), 801–804 (1957)MathSciNetMATHGoogle Scholar - 11.Lloyd, S.: Least square quantization in PCM (1957) Bell Telephone Laboratories PaperGoogle Scholar
- 12.Forgy, E.: Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics
**21**(3), 768–769 (1965)Google Scholar - 13.Hartigan, J.: Clustering Algorithms. Wiley, New York (1975)MATHGoogle Scholar
- 14.Granger, C.W.J.: Investigating causal relations by econometric models and cross-spectral methods. Econometrica
**37**(3), 424–438 (1969)CrossRefMATHGoogle Scholar - 15.Pearson, K.: Notes on regression and inheritance in the case of two parents. Proc. R. Soc. Lond.
**58**, 240–242 (1895)CrossRefGoogle Scholar - 16.Scikit-learn Community: Scikit-learn - machine learning in python (2015). http://scikit-learn.org
- 17.Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res.
**12**, 2825–2830 (2011)MathSciNetMATHGoogle Scholar