Two-Level Regression Method Using Ensembles of Trees with Optimal Divergence

The article discusses a new two-level regression analysis method in which a corrective procedure is applied to optimal ensembles of regression trees. Optimization is carried out based on the simultaneous achievement of the divergence of the algorithms in the forecast space and a good approximation of the data by individual algorithms of the ensemble. Simple averaging, random regression forest, and gradient boosting are used as corrective procedures. Experiments are presented comparing the proposed method with the standard decision forest and the standard gradient boosting method for decision trees.


INTRODUCTION
Regression modeling methods based on computing more accurate collective forecasts from predictions made by a set (ensemble) of less accurate and simpler original algorithms have been widely used in modern machine learning.These methods include random regression forest and methods based on adaptive or gradient boosting.An important role in the construction of collective algorithms is played by the method of obtaining the original ensemble of so-called weak algorithms.A theoretical analysis shows that the generalization ability can be improved by using an ensemble of algorithms that not only have high accuracy, but also produce maximally diverging forecasts [1].The low correlation between forecasts potentially makes it possible to achieve a more accurate algorithmic approximation, which objectively ensures the most accurate forecast with the use of a bounded number of algebraic operations [2,3].In the random regression forest method, the divergence of forecasts is achieved by training the algorithms of the ensemble on different samples generated from the original training sample with the use of bootstrap [4].In the gradient boosting method [5], an ensemble is generated sequentially.At every iteration step, the ensemble is supplemented with trees approximating the first derivatives of the loss function with respect to the variables corresponding to collective forecast.
Another important component is the method used to compute collective forecast, which can also be interpreted as a result of mutual correction of forecasts.In the random regression forest method, correction is carried out by simple computation of average forecasts.
Another possible method for organizing a corrective procedure is the stacking scheme, in which the outputs of the ensemble algorithms are treated as input features of an algorithm computing an output corrected forecast [6,7].As a rule, the efficiency of stacking is low as applied to the computation of collective decisions from sets of weak algorithms produced by random forest generation procedures.It can be assumed that the low efficiency is caused by the insufficient divergence of the weak algorithms in the forecast space in the case of standard ensemble generation methods.
The goal of this paper is to study the efficiency of a two-level method for improving the generalization ability, which involves the construction of an ensemble consisting of algorithms characterized by high-degree divergence in the forecast space and a good approximation of the target variable.Simple averaging and stacking are used as a corrective procedure.

Let
be an ensemble of algorithms predicting the value of a variable from a given vector X of variables .It is assumed that the algorithms of the ensemble are trained on a sample .Preliminarily, a baseline regression analysis method is chosen, which usually represents a regression tree model.Define and .According to the declared goal, an ensemble is constructed by simultaneously minimizing the criterion which estimates the average approximation error of by the vector X, and maximizing the criterion which is the variance of the forecasts produced by the algorithms of the ensemble.The problem of simultaneously minimizing and maximizing can be reduced to the minimization of where determines the contribution of the heterogeneity of the ensemble in the terms of the forecast variance.

Let
and denote the variations in the functionals and occurring when the ensemble is supplemented with an algorithm .
where does not depend on .To compute , we use the well-known variance expression where does not depend on .An approach for minimizing is to reduce this problem to the search for an algorithm (added to the ensemble) that minimizes the functional where does not depend on .An algorithm to be added to the ensemble at the step k + 1 is constructed by combining bagging and a gradient boosting procedure and consists of the following two stages: 1.At the first stage, a random number generator is applied to the original sample S to produce a sample with replacement , which is then used to train the algorithm .2. At the second stage, an algorithm to be added to the ensemble is constructed.The algorithm computes the forecast of at the point using the formula where is an algorithm predicting the gradient of the functional at the point .The algorithm is trained on the sample .

EXPERIMENTS
The method was implemented using the Python language with the help of the scikit-learn library [8].The baseline trees were constructed by applying the BaggingRegressor method.At the second level, we used GradientBoostingRegressor (which is referred to hereafter as boosting) or RandomForestRegressor (forest) or the results of the baseline methods were averaged (average).GradientBoostingRegressor was also used as a reference technique for estimating the efficiency of the proposed method.
The developed method was used to predict the parameters of the crystal lattice of the complex inorganic compound and the melting points of the halides and .For various space symmetry groups, we present results for some of the parameters admitting sufficiently reliable forecast, namely: the parameters a, , and for monoclinic space groups ( and ); the parameter for tetragonal ( ) and hexagonal ( ) groups; and the parameter a for the cubic group .The accuracy was evaluated using the standard characteristic r 2 , or the determination coefficient, which was computed by applying cross validation.Since algorithms in the bagging procedure are generated to a ( ) large degree at random, the solution results for the same problem change from experiment to experiment.Accordingly, for each problem, Table 1 presents the value of r 2 averaged over 10 experiments.

CONCLUSIONS
The results given in Table 1 show that, in most cases, the proposed two-level method yields better results than the standard gradient boosting algorithm, which prompts further research in this direction.We intend to explore the possibility of choosing an optimal gradient descent step size (in terms of certain criteria) in correcting bagging-generated algorithms.

Table 1 .
International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To Experimental results of this license, visit http://creativecommons.org/licenses/by/4.0/.