Neural Computing and Applications

, Volume 21, Issue 8, pp 2057–2063

Using multi-stage data mining technique to build forecast model for Taiwan stocks

Authors

  • Chien-Jen Huang
    • Department of Information ManagementAletheia University
  • Peng-Wen Chen
    • Department of Information ManagementOriental Institute of Technology
    • Department of Information ManagementOriental Institute of Technology
Original Article

DOI: 10.1007/s00521-011-0628-0

Cite this article as:
Huang, C., Chen, P. & Pan, W. Neural Comput & Applic (2012) 21: 2057. doi:10.1007/s00521-011-0628-0

Abstract

Taiwan stock market trend is fast changing. It is affected by not only the individual investors and the three major institutional investors, but also impacted by domestic political and economic situations. Therefore, to precisely grasp the stock market movement, one must build a perfect stock forecast model. In this article, we used a multi-stage optimized stock forecast model to grasp the changing trend of the stock market. First, data of 2 stocks, TSMC and UMC were collected, and then inputted the test data into the genetic programing and built a model to find out the arithmetic expressions. Artificial Fish Swarm Algorithm is used to dynamically adjust the variable factors and constant factors in the arithmetic expressions. Next, we took the error term (ε) in arithmetic expressions to Gray Model Neural Network to make the forecast. Finally, we used the Artificial Fish Swarm Algorithm to dynamically adjust the parameters of the Gray Model Neural Network to enhance the precision of the stock forecast model as a whole. The result showed that the forecast capability of each stage after the optimization process is better than that of its previous stage, and the mixed stock forecast model (GP–AFSA+GMNN–AFSA) in stage 4 greatly enhanced the precision of the forecast.

Keywords

Data miningGenetic programingGrey model neural networkArtificial fish swarm algorithmArithmetic expressions

1 Introduction

In recent years, various kinds of financial products, including stocks, futures, bonds, and others, were introduced to the market as the economy strengthened. These financial products also expanded options for personal wealth management. Among these products, listed shares have been introduced for a long time and familiarized by the public. Thus, stock investment has been a major investment tool for personal wealth. Stock investors have been eager to know how to select stocks to gain profit. Generally speaking, there are two most often used analytical methods when one selects stocks to invest: fundamental analysis and technical analysis. Fundamental analysis mainly focuses on the listed companies’ operation and financial status to forecast future profit/loss and one can then select stocks accordingly. Technical analysis focuses on historical stock price movement, which might show a pattern or trend used by investors to forecast the future possibility of the rise/fall of the price, and the investor can then decide whether to invest or not. In this article, we will focus on technical analysis and use a multi-stage optimized stock forecast model to build a stock forecast model.

In the past, there were many researches related to the building of stock forecast models [1, 2]. In this article, we also referred to past literature, collected data of 2 famous listed semiconductor companies, TSMC (2330) and UMC (2303), and used sample data to build the genetic programing (GP Model). Then, we referred to Professor Jabeen’s book [3, 4], focused on the weight of variables (variable factors) in the arithmetic expressions of genetic programing and used the current artificial fish swarm algorithm (AFSA) to make the dynamic adjustment (GP–AFSA Model). Next, we used the gray model neural network, (GMNN) to forecast the error term (ε) in the forecast by GP–AFSA (GP–AFSA+GMNN Model). Finally, we used the Artificial Fish Swarm Algorithm to dynamically adjust the parameters in the Gray Model Neural Network (GP–AFSA+GMNN–AFSA Model). Therefore, we have four stock forecast models in total. We compared the forecast capability of these four models to provide a reference for the investors and researchers to select target stocks.

This article is divided into four sections: Sect. 1 is the research purpose of this article. Section 2 is the papers related to genetic programing, Artificial Fish Swarm Algorithm and Gray Model Neural Network. Section 3 is the sample data and empirical analysis used in this article. Section 4 is conclusions and suggestions.

2 Research method

2.1 Artificial fish swarm algorithm

Artificial Fish Swarm Algorithm is a modern evolutionary computation method. It was first presented by the Chinese scholar, Li et al. [5]. It starts with building a pattern of fish behavior, and then uses the optimized behavior of individual fish (local) to underline the optimized value of the group (global). This kind of algorithm is suitable for overcoming the local extremes to gain the extremes as a whole. With this type of group activity, it is achieved through self-adaptability of each individual fish. We can conclude that there are three types of behavior through the observation of the living habit of fish, as shown in Fig. 1. In this article, we use Matlab to program these types of behavior and write program codes.
https://static-content.springer.com/image/art%3A10.1007%2Fs00521-011-0628-0/MediaObjects/521_2011_628_Fig1_HTML.jpg
Fig. 1

Group behavior and following behavior of a school of fish

  1. 1.

    Random food search: Fish normally swims randomly, however, when they find food around, they will swim toward it. We assume their current status is Xi, randomly pick a status Xj within the range of their sense. If Yi < Yj, then we move one step forward toward that direction, if not, then we reselect another status Xj and see if moving forward is justified. If moving forward is not justified after trial of several times, move one step randomly.

     
  2. 2.

    Group behavior: Fish can normally form a very big group. We assume their current status is Xi and search the number of fish (nf) within the current visible range (Dij < Visable). If nf/N < δ, that means there is more food at the center of the group. If Yi < Yc, move one step forward the center of the group, otherwise go to food search.

     
  3. 3.

    Following behavior: When one single fish finds out there is plenty of food in one spot, others will soon follow. We assume their current status is Xi, the best neighbor within current visible range is Xmax. If Yi < Ymax and the number nf justifies nf/N < δ, that means there is more food around Xmax and not crowded. Then, move one step forward toward the position of Xmax, otherwise go to food search.

     

For the design of the Artificial Fish Swarm Algorithm, we first built an individual fish model of (Artificial Fish, AF). The fish will choose a behavior most suitable for itself. The optimized result for the group can be found out through the group or some individual.

2.2 Gray model neural network

Deng [6] points out a gray question that is about forecast of development of feature values of gray uncertain systematic behavior. The original series of the feature values in the uncertain system, Xt(0) (t = 0, 1, 2, …, N−1) becomes Xt(1), which shows exponential growth pattern, after one-time adding. Therefore, we can use a continuous function or a differential equation to data fitting and forecast. For convenience, we redefine the symbols. The original numerical series Xt(0) is written as X(t), Xt(1) is written as Y(t), and the forecast result Xt*(1) is written as Z(t). Professor Shi–Fong [16] and others writes in their book that the differential equation of the gray nerve network model with n number of parameters is:
$$ \frac{{dy_{1} }}{{dt}} + ay_{1} = b_{1} y_{2} + b_{2} y_{3} + \cdots + b_{n - 1} y_{n} $$
(1)

In the equation, y2, …, yn, is system input parameter; y1 is system output parameter; a, b1, b2,…, bn−1 are differential equation coefficients.

The reaction time of (1) is:
$$ \begin{aligned} z(t) = & \left( {y_{1} (0) - \frac{{b_{1} }}{a}y_{2} (t) - \frac{{b_{2} }}{a}y_{3} (t) - \cdots - \frac{{b_{n - 1} }}{a}y_{n} (t) } \right)e^{ - at} \\ & + \frac{{b_{1} }}{a}y_{2} (t) + \frac{{b_{2} }}{a}y_{3} (t) + \cdots + \frac{{b_{n - 1} }}{a}y_{n} (t) \\ \end{aligned} $$
(2)
Let \( d = \frac{{b_{1} }}{a}y_{2} (t) + \frac{{b_{2} }}{a}y_{3} (t) + \cdots + \frac{{b_{n - 1} }}{a}y_{n} (t). \)
Equation (2) can be transformed to (3)
$$ \begin{aligned} z(t) &= \left( {y_{1} (0) - d) \cdot \frac{{e^{ - at} }}{{1 + e^{ - at} }} + d \cdot \frac{1}{{1 + e^{ - at} }}} \right) \cdot (1 + e^{ - at} ) \\ &= \left( {(y_{1} (0) - d)\left( {1 - \frac{1}{{1 + e^{ - at} }}} \right) + d \cdot \frac{1}{{1 + e^{ - at} }}} \right) \cdot (1 + e^{ - at} ) \\& = \left( {(y_{1} (0) - d) - y_{1} (0) \cdot \frac{1}{{1 + e^{ - at} }} + 2d \cdot \frac{1}{{1 + e^{ - at} }}} \right) \cdot (1 + e^{ - at} ) \\ \end{aligned} $$
(3)
When the transformed (3) is mapped to an expanded BP neural network, we can then obtain Gray Model Neural Network of n input parameters and 1 output parameter. It is as shown in Fig. 2:
https://static-content.springer.com/image/art%3A10.1007%2Fs00521-011-0628-0/MediaObjects/521_2011_628_Fig2_HTML.gif
Fig. 2

Gray Model Neural Network topological structure

Here, t is input parameter serial number; y2(t), …yn(t) are network input parameters; ω21, ω22, …, ω2n, ω31, ω32, … ω3n are network weighting values; y1 is network forecast value; LA, LB, LC, LD are used to represent respectively four layers of Gray Model Neural Network. Let \( \frac{{2b_{1} }}{a} = u_{1} ,\frac{{2b_{2} }}{a} = u_{2} , \ldots ,\frac{{2b_{n - 1} }}{a} = u_{n - 1} , \) then the network initial weighting value can be represented as:
$$ \omega_{11} = a,\,\quad\omega_{21} = - y_{1} (0),\,\quad\omega_{22} = u_{1} ,\,\quad\omega_{23} = u_{2} , \ldots \,\omega_{2n} = u_{n - 1} $$
$$ \omega_{31} = \omega_{32} = \cdots = \omega_{3n} = 1 + e^{ - at} $$
In LD layer, the threshold value of the output node is:
$$ \theta = (1 - e^{ - at} )(d - y_{1} (0)) $$

2.3 Genetic programing and GPOLS

Genetic programing is an algorithm developed by Koza [7, 8], based on Professor Holland’s genetic algorithm [9]. It shares the same concept as genetic algorithm and includes mechanism such as chromosome, adapting functions, replicate, matching, mutation, and so on. However, the difference is genetic programing takes one step further and uses syntax tree to replace genes of chromosome (0 and 1). Therefore, each individual within the mother body represents a set of computer programs. These program codes are just like genes which can produce the best codes after evolution through competition. Genetic programing is also different from genetic algorithm in the ways that it shows chromosomes by size, shape, and highly changeable tree structure, each of which represents different equations, as seen in Fig. 3:
https://static-content.springer.com/image/art%3A10.1007%2Fs00521-011-0628-0/MediaObjects/521_2011_628_Fig3_HTML.gif
Fig. 3

Tree structure of genetic programing

The symbols (+) and (−) are internal nodes, the other end nodes are the group of elements (X1, X2, and 3) defined by questions. The arithmetic expressions corresponding to the tree are X1 + (3-X2). One can refer to Professor Koza’s books that are related to genetic programing. In this article, we used the current Matlab GPOLS toolbox to proceed with the genetic programing model construction. The main idea of this toolbox is using orthogonal least squares algorithm (OLS) to build a GP model, and one of the results is polynomial, which includes the variable’s initial factors and constants. One can refer to Professor Babu and Karthik [10] literature that are related to the application of GPOLS. In the next section, we will use the Artificial Fish Swarm Algorithm to dynamically adjust these factors and constants to enhance the forecast capability of the model. To download Matlab GPOLS tool box, please go to http://www.fmt.vein.hu/softcomp/gp/gpols.html.

3 Empirical research

3.1 Sample data and variables

We collected shares data of TSMC and UMC from infotimes database. The collection period is from October 2006 to November 2010 and the number of data totaled 1,000. The types of data included 6-day moving average (X1), 6-day RSI (X2), 6-day opening price (X3), 6-day closing price (X4), 6-day deviation (X5), 6-day Williams overbought/oversold index (X6), 6-day psychological line (X7), and daily closing price (Y). Statistics of these indices are seen in Table 1 and price movement chart of the two companies can be seen in Fig. 4. In this article, we divided the 1,000 data numbers of each company into 5 groups, each including 200 data numbers. Then, we used four groups to construct the model, while the remaining group is used to do cross-validation for the stability test of the model and for the researchers’ reference.
Table 1

Statistics of technical indices of TSMC and UMC shares

Stock

Index

X1

X2

X3

X4

X5

X6

X7

TSMC

Max

72.38

88.07

5.25

151

13.99

100

100

Min

38.17

12.27

0.09

−137

−8.66

0

0

Avg

59.739

51.529

0.973

1.339

0.027

47.863

48.083

Std

7.094

15.608

0.652

9.018

2.371

31.985

19.205

UMC

Max

21.82

92.59

35

52

20.07

100

100

Min

7.07

8.84

0.07

−147

−11.27

0

0

Avg

15.655

49.003

1.060

1.331

−0.029

53.362

43.383

Std

3.754

17.354

1.326

7.771

3.135

32.266

19.912

https://static-content.springer.com/image/art%3A10.1007%2Fs00521-011-0628-0/MediaObjects/521_2011_628_Fig4_HTML.gif
Fig. 4

Movement of closing prices of TSMC and UMC shares

3.2 Use GP and GP–AFSA to construct initial forecast model on closing price

First, we inputted data of the first 4 groups into the Matlab system. The first stage was to use the current Matlab GPOLS toolbox to construct the genetic programing model (GP Model). In terms of setting parameters of the model, operators include “+”, “−”, “×”, “÷”. Population size is 20, Maxtreedepth is 6, evolution generations are 10,000. We put all of the 800 data numbers of independent variables (X1–X7) and dependent variables (Y) into Matlab to do the genetic programing model construction. The output is seen as Fig. 5. Fitness of TSMC share forecast model is 0.0687, mse is 0.0735. Fitness of UMC share forecast model is 0.0684, mse is 0.0742. Arithmetic expressions for TSMC and UMC shares forecast model is seen as Fig. 5 below. We will use it in both Sects. 3 and 4 when we talk about forecast capabilities of 4 share forecast models.
https://static-content.springer.com/image/art%3A10.1007%2Fs00521-011-0628-0/MediaObjects/521_2011_628_Fig5_HTML.gif
Fig. 5

Output by use of Matlab GPOLS to construct forecast model

Next, we entered into the second stage to optimize the shared forecast model for TSMC and UMC. We did it by taking factors in front of the independent variables (X) in the arithmetic expressions as initial weight (W) to achieve optimization for the enhancement of forecast capabilities of the model. Take TSMC for example, its arithmetic expressions is treated as:
$$ W1 \times \left( {X1} \right) - W2 \times \left( {X4} \right)\left( {X5} \right) - W3 \times \left( {X3} \right) + W4 $$
The initial value of W1 is −0.567489, W2 −1.038610, W3 −0.701528, W4 +2.139479.
Take UMC for example, its arithmetic expressions is treated as
$$ W1 \times \left( {X1} \right) - W2 \times \left( {X4} \right) - W3 \times \left( {X2} \right) - W4 \times \left( {X5} \right) + W5 $$
The initial value of W1 is −0.811989, W2 −0.647280, W3 −0.561026, W4 −0.837967, W5 +2.727951.
We used the current Artificial Fish Swarm Algorithm to conduct the arithmetic expression’s initial weight (W) optimization search to enhance forecast capabilities of TSMC and UMC share forecast model (GP–AFSA Model). As far as setting the initial parameters of the Artificial Fish Swarm Algorithm, the visual of fish is 1, step is 0.01, try number is 5, size is 20 fishes, iterative times are 100, and iterative evolution should not be stopped until RMSE is less than 0.05. We set the optimal adaptability value in the initial group as initial weight (W) of arithmetic expressions and then did iterative adjustments to search for the best weight (W). Through the fish group’s random food search behavior, grouping behavior, and following behavior, we can adjust RMSE to the minimum between the arithmetic expression’s forecast value (\( \widehat{Y} \)) and dependent variable (Y). Figure 6 shows the result after iterative adjustment of the arithmetic expression’s initial weight for TSMC and UMC. RMSE of TSMC stops convergence in the 61st generation, and its optimized W1 is −0.494107, W2 −0.733572, W3 −0.981022, W4 +2.444371. RMSE of UMC stops convergence in the 57th generation, and its optimized W1 is −0.992022, W2 −0.435501, W3 −0.527110, W4 −0.397713, W5 +2.776629. After the weight (W) adjustment through the Artificial Fish Swarm Algorithm, RMSE drops clearly and forecast capabilities are enhanced.
https://static-content.springer.com/image/art%3A10.1007%2Fs00521-011-0628-0/MediaObjects/521_2011_628_Fig6_HTML.gif
Fig. 6

Trend Chart of optimized RMSE after adjustment of weight by AFSA

3.3 Use of GMNN and GMNN–AFSA to forecast error of GP–AFSA model

Although GP–AFSA model’s forecast capability can be clearly enhanced compared to GP during the training stage, forecast error still exists. Traditional multiple regression similar to the arithmetic expressions defines such an error as residual (ε), shown below:
$$ Y = \alpha + \beta_{1} X_{1} + \beta_{2} X_{2} + \varepsilon $$
Wherein, ε is the error term and is a random variable. Therefore, the residual is:
$$ \varepsilon = Y - \widehat{Y} $$
The residual (ε) in this article is predicted by the GMNN model. Therefore, in this section, we refer to Professor Pai and Lin [11] literature and made the forecast focusing on (ε). Finally, we combined the arithmetic expression’s forecast results of GP–AFSA (GP–AFSA+GMNN Model) so as to further enhance forecast capabilities of TSMC and UMC share forecast model, shown below:
$$ Y = \widehat{Y} + \widehat{\varepsilon } $$
In the third stage of this article, we combined all of the independent variables (X1–X7) and residual ε (Y) to be the sample data to construct the GP–AFSA+GMNN model. That is, we used the current Gray Model Neural Network to forecast (ε) the arithmetic expression in the GP–AFSA model. As for setting the parameters of the Gray Model Neural Network, due to the fact that the Gray Model Neural Network structure is based on dimensions of input/output data, we used 7 dimensions for the input data (X1–X7) and 1 dimension for the output (ε). As a result, the gray nerve network structure is 1-1-8-1. Among it, there were 8 nodes at the LC level. We inputted technical indices data of X1–X7 into nodes 2–8. Output is (ε). We inputted data of the first 4 groups of TSMC and UMC shares into the Gray Model Neural Network for training. Among the network parameters, a, b1, b2, b3, b4, b5, b6, b7 were randomly produced. Default value of the learning rate (including u1, u2, u3, u4, u5, u6, u7) was 0.005. Iterative times of network were 1,000. Figure 7 shows the result of how this is done and executed for the GMNN Matlab program.
https://static-content.springer.com/image/art%3A10.1007%2Fs00521-011-0628-0/MediaObjects/521_2011_628_Fig7_HTML.gif
Fig. 7

Use of GMNN Matlab program to produce forecast result of residual ε

In the fourth stage, we used the Artificial Fish Swarm Algorithm to optimize the Gray Model Neural Network parameters a, b1, b2, b3, b4, b5, b6, b7 in the third stage in order to enhance the forecast capabilities (GP–AFSA+GMNN–AFSA Model). Just as the setting seen in Sect. 3.2, visual of the fish is 1, step is 0.01, try number is 5, size is 20 fishes, iterative times are 100, and cease condition of the iterative evolution is RMSE less than 0.05. The optimal adaptability values in the initial group were set as Gray Model Neural Network parameter’s (a, b1, b2, b3, b4, b5, b6, b7) initial values, and then we did an iterative adjustment search for the best parameters. Through the fish group’s random food search behavior, grouping behavior, and following behavior, we can adjust RMSE to the minimum between the Gray Model Neural Network’s forecast value (\( \widehat{Y} \)) and dependent variable (Y). Figure 8 shows the result after the iterative adjustment of Gray Model Neural Network parameters for TSMC and UMC. RMSE of TSMC stops convergence in the 41st generation, and its optimized Gray Model Neural Network parameter a is 0.6522, b1 0.3572, b2 0.4041, b3 0.3997, b4 0.5576, b5 0.3176, b6 0.3506, b7 0.4733. RMSE of UMC stops convergence in the 44th generation, and its optimized Gray Model Neural Network parameter a is 0.7199, b1 0.3337, b2 0.3506, b3 0.6707, b4 0.6554, b5 0.4224, b6 0.3006, b7 0.4111. After the Gray Model Neural Network parameter’s adjustment through the Artificial Fish Swarm Algorithm, RMSE drops evidently and forecast capabilities are enhanced.
https://static-content.springer.com/image/art%3A10.1007%2Fs00521-011-0628-0/MediaObjects/521_2011_628_Fig8_HTML.gif
Fig. 8

Trend Chart of optimized RMSE after adjustment of Gray Model Neural Network parameters by AFSA

3.4 General comparison of forecast capabilities of the four models

We cross-examine the five groups of TSMC and UMC shares to test stability of the models. We use 5 evaluation indicators for the four models. They are:

Root mean squared error, RMSE, and its formula is:
$$ {\text{RMSE}} = \sqrt {\frac{{\sum\nolimits_{{t - 1}}^{n} {x_{t} - \hat{x}_{t} } }}{N}} $$
Revision theil inequality coefficient, RTIC, and its formula is:
$$ {\text{RTIC}} = \left[ {\frac{{\sum\nolimits_{t = 1}^{N} {\left( {X_{t} - \widehat{X}_{t} } \right)^{2} } }}{{\sum\nolimits_{t = 1}^{N} {X_{t}^{2} } }}} \right]^{\frac{1}{2}} $$
Mean absolute error, MAE, and its formula is:
$$ {\text{MAE}} = \frac{1}{M}\sum\limits_{l = 1}^{M} {\left| {Z_{t + 1} - \widehat{Z}_{t} (l)} \right|} $$
M is the number of forecast value, Zt+1 is observed value for 1-h period, \( \widehat{Z}_{t} (l) \) is estimated value for 1-h period.
Mean absolute percentage error, MAPE), and its formula is:
$$ {\text{MAPE}} = \left( {\frac{1}{M}\sum\limits_{l = 1}^{M} {\left| {\frac{{\left( {Z_{t + l} - \widehat{Z}_{t} (l)} \right)}}{{Z_{t + 1} }}} \right|} } \right) \times 100\% $$
Coefficient of efficiency, CE, and its formula is:
$$ {\text{CE}} = 1 - \frac{{\sum {\left( {x_{t} - \widehat{x}_{t} } \right)^{2} } }}{{\sum {\left( {x_{t} - \overline{x}_{t} } \right)^{2} } }} $$
For the first, second, third, and the fourth indicators, the closer the value is to zero, the more precise it is. For the fifth indicator, the closer the value is to one, the more precise it is. Table 2 shows the result:
Table 2

Cross-examination of the five evaluation indicators

Stock

Model

RMSE

RTIC

MAE

MAPE

CE

TSMC

GP

0.704

0.051

0.519

0.046

0.914

GP–AFSA

0.466

0.034

0.310

0.027

0.953

GP–AFSA+GMNN

0.382

0.028

0.282

0.020

0.977

GP–AFSA+GMNN–AFSA

0.184

0.021

0.223

0.018

0.985

UMC

GP

0.588

0.092

0.664

0.044

0.918

GP–AFSA

0.290

0.057

0.471

0.028

0.943

GP–AFSA+GMNN

0.238

0.045

0.260

0.023

0.960

GP–AFSA+GMNN–AFSA

0.169

0.023

0.196

0.020

0.979

From the Table 2, we can see, for TSMC, that GP–AFSA+GMNN–AFSA forecast model‘s RMSE is 0.184, RTIC 0.021, MAE 0.223, MAPE 0.018, all of which are lower than those in GP, GP–AFSA, and GP–AFSA+GMNN. Its CE is 0.958, which is higher than that in the other three models. Moreover, for UMC, GP–AFSA+GMNN–AFSA forecast model’s RMSE is 0.169, RTIC 0.023, MAE 0.196, MAPE 0.020, all of which are lower than those in GP, GP–AFSA and GP–AFSA+GMNN. Its CE is 0.979, which is higher than that in the other three models. Therefore, GP–AFSA+GMNN–AFSA, a multi-stage stock forecast model, is better than the other four models in terms of forecast capabilities.

4 Conclusions and suggestion

As there many factors affecting Taiwan stocks, closing prices of stocks are highly random. Therefore, a closing price forecast model should be as precise as possible. This article mainly focuses on how to use a multi-stage optimized stock forecast model and input moderner data mining techniques to build a forecast model as reference for researchers. In Table 2, one can find that forecast capability of each stage after optimization process is better than that of its previous stage. And the stock forecast model (GP–AFSA+GMNN–AFSA) in stage 4 can really greatly enhance precision of forecast.

In addition, we use AFSA to optimize a forecast model in the article. In the future, we suggest other algorithms, such as Professor Eberhart and Kennedy [12] Particle Swarm Optimization or Professor Teodorovic’s [1315] Bee Colony Optimization to optimize a model.

Copyright information

© Springer-Verlag London Limited 2011