Introduction

A decision tree that recursively learns splits from higher to lower nodes is generally constructed by a series of local judgments and often involves the problem of noise overfitting. To solve this, various proposals have been made to optimize all splits collectively using evolutionary computation. However, many previous studies searched for the best tree by applying the branches of a highly rated tree to another tree. In the case of complicated data, i.e., data that contains a lot of noise and extraordinary values, the split with a high fitness value at one node is peculiar to the data sample at that node; the fitness value does not improve even if this is applied to another data sample [1, 2].

A method of searching all splits from one tree simultaneously without exchanging branches has also been proposed. However, because the number of combinations of features and their splitting points, or threshold values, is enormous, such a method is impractical for complicated data [3, 4].

Bilevel GA, which performs feature selection and searches for the order and thresholds of selected features by separate evolutionary computation, has been proposed recently as an efficient method for collectively optimizing the overall structure of a tree. In particular, this method has shown good estimation accuracy using relatively small-scale data [5,6,7].

However, the evaluation function of the tree becomes intermittent when a feature used for the split of the upper node changes owing to the simultaneous search of the feature order and the threshold value. Therefore, it is not possible to use a search method that presupposes continuity. To address this, a GA that repeats individual selection, crossover, and mutation is used as the search method because it optimizes splits that become non-deterministic polynomial time (NP)-hard when complex data are used. In addition, when employing such a GA, a split that is effective for an area region divided by specific upper nodes can be applied to a region divided by other upper nodes. Therefore, the fitness value of the tree does not always improve by generations, and the best tree might not be searched [8, 9]

In this study, we propose a method for searching splits on the premise of continuous distribution by searching for the optimal threshold under a specific feature and determining the optimal combination of features by separate evolutionary computation. In addition, we compare the computational time and prediction accuracy of this method with those using general machine learning and the results of previous studies using financial time series data as an example of complex real data.

Related work

Building a tree while avoiding overfitting

As mentioned in Sect. 2, when constructing a decision tree, there is a risk of falling into a locally optimal solution when using the greedy method, which recursively learns the splits from the upper to lower nodes. To avoid this, numerous attempts have been made to use evolutionary computation to find the optimal decision tree [10,11,12,13,14,15].

Attempts to collectively optimize the split of decision trees by genetic programming have been reported in the literature [16,17,18,19]. However, genetic programming is limited when streamlining optimization calculations by searching under continuous distribution. Moreover, it becomes difficult to converge to a global optimal solution when searching a decision tree with a large number of splits.

Building a tree by evolutionary computation

When optimizing the splits of the entire tree collectively by evolutionary computation, the fitness value of the tree changes non-linearly when the features of the splits of the upper node are changed. It is not possible to use evolutionary computation to generate individuals based on continuous distribution.

Optimization methods assuming continuous distribution include the steepest gradient method, Adam optimization, Newton’s method, and Bayesian optimization [20, 21]. Such methods include one-point search, which has a higher probability to result in a locally optimal solution when handling complex data with a large number of dimensions. Under such conditions, approaches considered to be more suitable include the stochastic search method [22], which is a black box optimization method using multipoint search, and real-valued evolutionary computation. Typical methods of real-valued evolutionary computation include real-valued GA [23], evolution strategy [24], differential evolution [25], and particle swarm optimization [26].

Real-valued GA showed high performance in the evaluation function with problems such as bad scale, intervariable dependency, and multimodality through minimal generation gap [27], unimodal distribution crossover (UNDX) [28], and real-coded ensemble crossover star (REXstar) [29]. However, none of these methods can handle multidimensional complex data that with large noise owing to their difficulty in adjusting the step size, population size, number of offspring, and other factors.

The covariance matrix adaptation evolution strategy (CMA-ES) [30] and distance-weighted exponential natural evolution strategies (DX-NES) [31] provide examples of the evolution strategy in which the need to adjust the step size, population size, and number of offspring is relatively small, which enables even a non-linear discontinuous evaluation function to be searched. However, the search performance of DX-NES deteriorates significantly when this strategy is applied to complex data with large noise [32].

CMA-ES, which is relatively resistant to noise, is considered to be desirable for the evaluation function that changes significantly by changing the threshold of the split in a decision tree. Of the many variations of CMA-ES proposed, [33] reported the best performance for data in noisy and uncertain environments, and the [34] model is suitable for searching decision trees.

Data and methods

Data

To compare the performance of a globally optimized model tree with that using general machine learning or a bilevel GA in a previous research, we evaluate the prediction accuracy using multiple benchmark data. Because the model tree constructed in this study aims to unravel a complicated data structure and extract a universal pattern for population, the data used for accuracy evaluation should also be applicable to complicated and noisy conditions.

UC Irvine Machine Learning Depository

From the UC Irvine Machine Learning Depository, we select relatively simple classification problems that have been used in many previous studies in addition to relatively complex regression problems meeting the following conditions.

  • Explanatory variable type: continuous variable

  • Data type: time series

  • Number of explanatory variables: more than 10

  • Number of samples: about 10,000 or more

  • Low sparseness

Table 1 shows a summary of the data used in this study.

Table 1 Benchmark data used in this study

Financial market time series data

In addition, time series data of financial markets are used as actual data of complex systems. Financial market data contain many one-off factors and noise and serve as representative data for which high prediction accuracy cannot be obtained even by machine learning. This study uses as an objective variable the intraday return for the TOPIX Futures nearby month from the opening price at 08:45 to the closing price at 15:15. There are many other stock indices which represent the global financial markets, such as S&P 500, Dow Jones Industrial Average, Euro Stoxx 50 and so on; however, most of them have been on a consistent upward trend since 2009. Although TOPIX Futures have a smaller trading volume than other indices, they have no long-term trends and are a better example of complex system data. As explanatory variables, we select indicators that represent the financial markets of the United States and Japan. Unlike the objective variable, the explanatory variables do not necessarily have to be the price of the product that can actually be traded, although they need to reflect the movement of the entire financial market from a different perspective. For this reason, the change in the closing value of the stock index, exchange rate, and interest rate shown in Table 2 are used.

Table 2 Financial data used to predict TOPIX Futures in next business day

The financial data used to predict the TOPIX Futures in next business day include 4,901 intraday returns between January 04, 2001, and December 30, 2020 as an objective variable.

Methods

The purpose of this study is to obtain high prediction accuracy by a globally optimized model tree. For this purpose, a model tree is constructed by splitting a sample space with certain splits of a tree, evaluating the versatility of the pattern recognition model at each final node of the tree, and searching the best splits so the average versatility of the pattern recognition models in all final nodes becomes highest. The splits are not recursively searched individually; instead, they are simultaneously searched from large-scale combination optimization.

Bilevel GA

When optimizing the overall structure of a decision tree collectively, the fitness value of the tree changes discontinuously when the features used for the splits of a certain node change; therefore, the search method based on the continuity of functions cannot be used. In this study, we first identify the features to be used for each split and their positions in a tree randomly, and we then optimize the threshold for each feature by inner level search. Finally, we search the best features and their positions in a tree by outer level search. This method is referred to as a bilevel genetic algorithm (GA) used in this study. Figure 1 shows the outline of the bilevel GA by related works, and Fig. 2 shows the outline of the bilevel GA used in this study.

Fig. 1
figure 1

Outline of bilevel genetic algorithm (GA) used in related works

Fig. 2
figure 2

Outline of bilevel genetic algorithm (GA) used in this study

In the inner level search, the features used for each split and their positions in a tree are given, which enables the threshold value of each feature to be searched by a method using continuous distribution. All of the explanatory variables in this study are continuous; therefore, the threshold value change continuously. Because the data sample under a certain split changes discontinuously when the threshold values change, the evaluation function of the tree becomes discontinuous. However, data samples under certain splits change gradually one by one as the threshold changes, which enables the use of an evolutionary computation method that generates individuals based on continuous distribution.

In previous bilevel GA research, the position of the feature was also changed at the inner level, and the search method that presupposes the continuity of the function could not be used. Conversely, the inner level search in this study can be performed efficiently using continuous distribution.

In the outer level search, multiple trees with different features and their positions are generated randomly, and the optimal tree is searched by repeating the selection, crossover, and mutation. When evaluating each individual tree at the outer level, that in which all thresholds are already optimized according to the inner level is used.

Inner level optimization

CMA-ES, an evolutionary computation that performs multipoint search based on normal distribution, is used in this study to search the threshold value at the inner level search. This method updates the covariance matrix based on the evolutionary path that accumulates the previous solutions and generates offspring in the direction of movement of the solutions. It is suitable for the inner level search because it can search non-continuous evaluation functions by assuming normal distribution. In addition, CMA-ES is more resistant to noise than other efficient search methods that can handle discontinuous evaluation functions.

However, for data with particularly high levels of complexity such as financial time series data, the evaluation function becomes steep and multimodal, and the search for a global optimal solution requires a large amount of calculation even when using CMA-ES.

In general CMA-ES applications, the degree of freedom is \(n + \frac{{n^{2} - n}}{2}\); the time complexity is \(O\left( {n^{2} } \right)\); and the spatial complexity is \(O\left( {n^{3} } \right)\) for the number of the dimension \(O\left( {n^{2} } \right)\) of the evaluation function. By limiting the variance–covariance matrix \(C^{{\left( {t + 1} \right)}}\) used for individual genetion to diagonal components, the degree of freedom becomes \(n\), and the amount of time and spatial complexity is reduced to \(O\left( n \right)\):

$$c_{jj}^{{\left( {t + 1} \right)}} = \left( {1 - c_{cov} } \right)c_{jj}^{\left( t \right)} + \frac{1}{{\mu_{cov} }}c_{cov} \left( {p_{c}^{{\left( {t + 1} \right)}} } \right)_{j}^{2} + c_{cov} \left( {1 - \frac{1}{{\mu_{cov} }}} \right)\mathop \sum \limits_{i = 1}^{\mu } w_{i} c_{jj}^{\left( t \right)} \left( {z_{i:\lambda }^{{\left( {t + 1} \right)}} } \right)_{j}^{2} , \quad j = 1, \ldots ,n$$
(1)

where \(C_{{\text{cov}}} \in [0,1]\) is the learning rate of diagonal element updates; \(\frac{1}{{\mu_{{\text{cov}}} }} \in [0,1]\) is the weighting coefficient of the evolution path \(p_{c}^{{\left( {t + 1} \right)}}\); \(z_{i:\lambda }^{{\left( {t + 1} \right)}}\) is the \(i\)-th most rated of the \(z^{{\left( {t + 1} \right)}}\); and \(\left( {z_{i:\lambda }^{{\left( {t + 1} \right)}} } \right)_{j}\) is the \(i\)-th component of \(\left( {z_{i:\lambda }^{{\left( {t + 1} \right)}} } \right)_{j}\).

Outer level optimization

In the outer level search, the features used for each split and their positions are optimized. The parameters to be optimized are discrete values. In a decision tree, if the features used in a certain split are changed, the structure below will change significantly. Therefore, in the outer level search, the search method assuming continuous distribution cannot be used. For this reason, we use a GA that searches for individuals with high fitness values by repeating the selection, selection, crossover, and mutation because the shape of the evaluation function is not the issue.

As many trees in which thresholds of all splits are already optimized according to the inner level search as the population size are randomly generated as the initial population. A certain percentage among them (preservation rate) is left for the next generation in the order of the fitness value of the tree, and a pair is randomly created from the next certain percentage (crossover rate) group. Branches at random positions are swapped in a pair, and the rest of the population will not be passed on to the next generation and will instead be replaced by trees with new features and their positions. This process is regarded as one generation, and the generation change is repeated. In this study, we use 20 population sizes, a 20% preservation rate, a 20% crossover rate, and 200 generations, as shown by the outer level search in Fig. 2.

Model evaluation method

In the classification problem, we use the weighted accuracy rate of each final node as the fitness value of the individual (model tree) generated by the bilevel GA. In the regression problem, we use the weighted prediction accuracy by linear regression analysis at the final node. The prediction accuracy is the average R2 obtained by the five-fold cross-validation method. The linear regression model uses lasso regression with a regularization parameter of 0.1.

Large-scale combination optimization is required to select the optimal model tree when using such an evaluation method. Therefore, the Oakbridge-CX supercomputer system at the Information Technology Center, University of Tokyo, is used for the calculation.

Accuracy comparison of each method

To evaluate the prediction accuracy of each method, the results obtained from following approaches are compared: linear discriminant analysis; logistic regression analysis; support vector machine; neural network; classification and regression tree (CART); random forest; XGBoost; a decision tree constructed by bilevel GA proposed in [5], hereinafter referred to as bilevel GA by related work; and a decision tree constructed by bilevel GA proposed in this study, hereinafter referred to as bilevel GA by this study. For comparing the prediction accuracy used for comparison, the average classification accuracy rate of the results of five verifications is used according to the five-fold cross-validation method.

For regression problems, the following methods are used: multiple regression analysis, lasso regression analysis, partial minimum error, neural network, XGBoost, bilevel GA by related work, and a model tree constructed by bilevel GA proposed in this study, hereinafter also referred to as bilevel GA by this study. For the prediction accuracy used for comparison, the average R2 according to the five-fold cross-validation method is used.

For problems using financial time series data, the following methods are used: multiple regression analysis, lasso regression analysis, partial minimum error, neural network, XGBoost, bilevel GA by related work, and bilevel GA by this study. For the prediction accuracy used for comparison, the average R2 according to the five-fold cross-validation method is used.

Also, hyperparameters for all of the above machine learning methods are set by the three-fold cross validation grid-search.

Results and discussion

When using the UC Irvine data in a relatively simple classification problem, bilevel GA by this study showed high prediction accuracy, as did the other methods. In regression problems with a large numbers of data, the prediction accuracy of bilevel GA by this study exceeded that of other methods. Moreover, this method showed a much higher estimation accuracy than that of other methods when using financial data.

Prediction accuracy

For the classification problems of UC Irvine, bilevel GA by this study showed high prediction accuracy, as did the other methods (Table 3). Because the other methods also showed relatively high prediction accuracy, the pattern was easily recognized.

Table 3 Accuracy comparison of classification problems

For the regression problems of UC Irvine, bilevel GA by this study showed better prediction accuracy than other methods (Table 4). Some of other methods showed relatively low prediction accuracy because the patterns were difficult to recognize in some data, although the bilevel GA by this study was high even for such complicated problems.

Table 4 Accuracy comparison of regression problems

Moreover, in the financial time series data, this method showed a prediction accuracy that greatly exceeded that of other methods (Table 5). Although many studies have been conducted on predicting financial market prices by machine learning, no clear conclusion has been reached. However, bilevel GA by this study showed explicitly higher accuracy than that using general machine learning.

Table 5 Accuracy comparison using financial time series data

In addition, bilevel GA by this study uses elitism for the outer level search; thus, the fitness of trees improves by generations. Figure 3 shows the transition of the fitness of a best model tree constructed by bilevel GA by this study in Table 5 for each generation.

Fig. 3
figure 3

Changes in tree fitness for each generation at the outer level

Furthermore, in the model tree optimized at the outer level, which is the best tree selected by using the bilevel GA by this study, we confirmed the fitness of the tree for each generation at the inner level search (Fig. 4). Because the best solution is not preserved at the inner level search, the fitness of the tree does not always increase with each generation. As indicated in Fig. 4, although the shape of the evaluation function exhibits many irregularities, a global optimum solution was obtained. Therefore, CMA-ES has succeeded in searching a global optimum solution even though the fitness fluctuated owing to the influences of noise and extraordinary values.

Fig. 4
figure 4

Changes in tree fitness for each generation at the inner level

Impact of crossover

In this study, we used 20 population sizes, a 20% preservation rate, a 20% crossover rate, and 200 alternations of generations for outer level optimization, as discussed in Sect. 3.2.3.

By using the best tree in Table 5, the outer level search was performed using different crossing rates. The optimal fitness of the tree decreased as the crossover rate increased. Moreover, the fitness did not change significantly when the crossover rate was reduced (Table 6), and the change in the prediction accuracy was small even when values other than the crossover rate changed.

Table 6 Impact on performance for each crossover rate

As discussed in Sect. 1, the split at the lower node is effective only for the subsample generated by dividing the total sample at the upper level, which explains why the relatively high crossover rate deteriorated the fitness.

Conclusions

Decision trees have been widely used for data analysis because of their ease of interpretation. However, if the Greedy method, which recursively searches for split from the upper node to the lower node, is used, overfitting is likely to occur because a tree is constructed with a series of locally optimal solutions. Many previous research has attempted to collectively optimize the structure of a decision tree using evolutionary computation; however, many of them searched attributes of each split and their thresholds simultaneously; thus, optimization methods assuming continuous distribution cannot be used. In this study, we proposed bilevel GA that improved the problems in the previous research. As a result, we found that it surpassed the conventional methods in terms of performance from relatively simple problems to complex problems.

In this study, I compared the proposed method with the several major methods from many existing machine learnings and evaluated its superiority. It cannot be denied that some of the existing machine learning methods may exceed the proposed method in the prediction accuracy. On the other hand, the proposed method was proven to be able to improve the prediction accuracy of complex data while taking advantage of the fact that the prediction model is a decision tree with which the interpretation is easy. This can be said to be the great significance of this study.

We also found that it was still not possible to derive sufficiently high prediction accuracy even by using the proposed bilevel GA, when data with increased complexity, such as financial time series data, was used. The reason for this is thought to be that the tree was constructed based on the fitness of all final nodes of the tree. We may obtain a better prediction result if a tree is constructed so some final nodes with an extremely low fitness are excluded from the evaluation of the tree and if data classified as the final nodes excluded from the evaluation are not subject to prediction.

Also, in this study, we used a binary tree; however, the space in which pattern recognition model perform well is not necessarily all on one side from a certain threshold of the entire sample data. Therefore, it is desirable to use a multi-way tree when dividing the space by a tree. We want to make these issues for the future task.