INTRODUCTION

Regression modeling methods based on computing more accurate collective forecasts from predictions made by a set (ensemble) of less accurate and simpler original algorithms have been widely used in modern machine learning. These methods include random regression forest and methods based on adaptive or gradient boosting. An important role in the construction of collective algorithms is played by the method of obtaining the original ensemble of so-called weak algorithms. A theoretical analysis shows that the generalization ability can be improved by using an ensemble of algorithms that not only have high accuracy, but also produce maximally diverging forecasts [1]. The low correlation between forecasts potentially makes it possible to achieve a more accurate algorithmic approximation, which objectively ensures the most accurate forecast with the use of a bounded number of algebraic operations [2, 3]. In the random regression forest method, the divergence of forecasts is achieved by training the algorithms of the ensemble on different samples generated from the original training sample with the use of bootstrap [4]. In the gradient boosting method [5], an ensemble is generated sequentially. At every iteration step, the ensemble is supplemented with trees approximating the first derivatives of the loss function with respect to the variables corresponding to collective forecast.

Another important component is the method used to compute collective forecast, which can also be interpreted as a result of mutual correction of forecasts. In the random regression forest method, correction is carried out by simple computation of average forecasts.

Another possible method for organizing a corrective procedure is the stacking scheme, in which the outputs of the ensemble algorithms are treated as input features of an algorithm computing an output corrected forecast [6, 7]. As a rule, the efficiency of stacking is low as applied to the computation of collective decisions from sets of weak algorithms produced by random forest generation procedures. It can be assumed that the low efficiency is caused by the insufficient divergence of the weak algorithms in the forecast space in the case of standard ensemble generation methods.

The goal of this paper is to study the efficiency of a two-level method for improving the generalization ability, which involves the construction of an ensemble consisting of algorithms characterized by high-degree divergence in the forecast space and a good approximation of the target variable. Simple averaging and stacking are used as a corrective procedure.

TWO-LEVEL METHOD

Let \(\{ {{A}_{1}}(X), \ldots ,{{A}_{{k - 1}}}(X)\} \) be an ensemble of algorithms predicting the value of a variable \(y\) from a given vector X of variables \({{x}_{1}}, \ldots ,{{x}_{n}}\). It is assumed that the algorithms of the ensemble are trained on a sample \(S = \{ ({{X}_{1}},{{y}_{1}}), \ldots ,({{X}_{m}},{{y}_{m}})\} \). Preliminarily, a baseline regression analysis method is chosen, which usually represents a regression tree model. Define \({{L}_{k}}(X)\, = \,\tfrac{1}{k}\sum\limits_{i = 1}^k \,{{A}_{i}}(X)\) and \({{Q}_{k}}(X) = \tfrac{1}{k}\sum\limits_{i = 1}^k \,A_{i}^{2}(X)\). According to the declared goal, an ensemble is constructed by simultaneously minimizing the criterion

$${{\Phi }_{E}}({{A}_{1}}(X), \ldots ,{{A}_{k}}(X)) = \frac{1}{{mk}}\sum\limits_{i = 1}^k \,\sum\limits_{j = 1}^m \,\mathop {\left( {{{y}_{j}} - {{A}_{i}}({{X}_{j}})} \right)}\nolimits^2 ,$$

which estimates the average approximation error of \(y\) by the vector X, and maximizing the criterion

$${{\Phi }_{V}}({{A}_{1}}(X), \ldots ,{{A}_{k}}(X)) = \frac{1}{{mk}}\sum\limits_{i = 1}^k \,\sum\limits_{j = 1}^m \,\mathop {\left( {{{L}_{k}}({{X}_{j}}) - {{A}_{i}}({{X}_{j}})} \right)}\nolimits^2 ,$$

which is the variance of the forecasts produced by the algorithms of the ensemble.

The problem of simultaneously minimizing \({{\Phi }_{E}}\) and maximizing \({{\Phi }_{V}}\) can be reduced to the minimization of

$${{\Phi }_{G}} = (1 - \mu ){{\Phi }_{E}} - \mu {{\Phi }_{V}},$$

where \(\mu \in [0,1]\) determines the contribution of the heterogeneity of the ensemble in the terms of the forecast variance.

Let \(D_{E}^{k}\) and \(D_{V}^{k}\) denote the variations in the functionals \({{\Phi }_{E}}\) and \({{\Phi }_{V}}\) occurring when the ensemble is supplemented with an algorithm \({{A}_{{k + 1}}}\).

$$\begin{gathered} D_{E}^{k} = {{\Phi }_{E}}({{A}_{1}}(X), \ldots ,{{A}_{{k + 1}}}(X)) - {{\Phi }_{E}}({{A}_{1}}(X), \ldots ,{{A}_{k}}(X)) \\ \, = \left( {\mathop {{{\Phi }_{E}}({{A}_{1}}(X), \ldots ,{{A}_{k}}(X)) * k}\limits_{_{{_{{_{{_{{}}}}}}}}} } \right. \\ \end{gathered} $$
$$\begin{gathered} \, + \left. {\frac{1}{m}\sum\limits_{j = 1}^m \,\mathop {({{y}_{j}} - {{A}_{{k + 1}}}({{X}_{j}}))}\nolimits^2 } \right)\frac{1}{{k + 1}} \\ \, - {{\Phi }_{E}}({{A}_{1}}(X), \ldots ,{{A}_{k}}(X)) = \frac{1}{{m(k + 1)}} \\ \end{gathered} $$
$$\begin{gathered} \, \times \sum\limits_{j = 1}^m \,\mathop {\left( {{{y}_{j}} - {{A}_{{k + 1}}}({{X}_{j}})} \right)}\nolimits^2 - \frac{1}{{k + 1}}{{\Phi }_{E}}({{A}_{1}}(X), \ldots ,{{A}_{k}}(X)) \\ \, = \frac{1}{{m(k + 1)}}\sum\limits_{j = 1}^m \,\mathop {\left( {{{y}_{j}} - {{A}_{{k + 1}}}({{X}_{j}})} \right)}\nolimits^2 - {{C}_{E}}, \\ \end{gathered} $$

where \({{C}_{E}}\) does not depend on\({{A}_{{k + 1}}}(X)\).

To compute \({{D}_{V}}\), we use the well-known variance expression

$$\sum\limits_{j = 1}^m \,\sum\limits_{i = 1}^k \,\mathop {\left( {{{L}_{k}}({{X}_{j}}) - {{A}_{i}}({{X}_{j}})} \right)}\nolimits^2 = \sum\limits_{j = 1}^m \,({{Q}_{k}}({{X}_{j}}) - L_{k}^{2}({{X}_{j}}));$$
$$\begin{gathered} D_{V}^{k} = {{\Phi }_{V}}({{A}_{1}}(X), \ldots ,{{A}_{{k + 1}}}(X)) - {{\Phi }_{V}}({{A}_{1}}(X), \ldots ,{{A}_{k}}(X)) \\ \, = \frac{1}{m}\sum\limits_{j = 1}^m \,({{Q}_{{k + 1}}}({{X}_{j}}) - L_{{k + 1}}^{2}({{X}_{j}})) \\ \end{gathered} $$
$$\begin{gathered} \, - \frac{1}{m}\sum\limits_{j = 1}^m \,\left( {{{Q}_{k}}({{X}_{j}}) - L_{k}^{2}({{X}_{j}})} \right) = \frac{1}{{m(k + 1)}} \\ \, \times \sum\limits_{j = 1}^m \,\left( {\mathop { - {{Q}_{k}}({{X}_{j}}) + A_{{k + 1}}^{2}({{X}_{j}})}\limits_{} } \right. \\ \end{gathered} $$
$$\begin{gathered} \, - \left. {\frac{1}{{k + 1}}\mathop {\left( {k{{L}_{k}}({{X}_{j}}) + {{A}_{{k + 1}}}({{X}_{j}})} \right)}\nolimits^2 + (k + 1)L_{k}^{2}({{X}_{j}})} \right) \\ \, = \frac{k}{{m{{{(k + 1)}}^{2}}}}\sum\limits_{j = 1}^m \,(A_{{k + 1}}^{2}({{X}_{j}}) - 2{{L}_{k}}({{X}_{j}}){{A}_{{k + 1}}}({{X}_{j}})) + {{C}_{V}}, \\ \end{gathered} $$

where \({{C}_{V}}\) does not depend on \({{A}_{{k + 1}}}(X)\).

An approach for minimizing \({{\Phi }_{G}}\) is to reduce this problem to the search for an algorithm \({{A}_{{k + 1}}}\) (added to the ensemble) that minimizes the functional

$$\begin{gathered} D_{G}^{k} = (1 - \mu )D_{E}^{k} - \mu D_{V}^{k} \\ \, = \frac{{1 - \mu }}{{m(k + 1)}}\sum\limits_{j = 1}^m \,\mathop {\left( {{{y}_{j}} - {{A}_{{k + 1}}}({{X}_{j}})} \right)}\nolimits^2 \\ \, - \frac{{\mu k}}{{m{{{(k + 1)}}^{2}}}}\sum\limits_{j = 1}^m \,(A_{{k + 1}}^{2}({{X}_{j}}) - 2{{L}_{k}}({{X}_{j}}){{A}_{{k + 1}}}({{X}_{j}})) + {{C}_{G}}, \\ \end{gathered} $$

where \({{C}_{G}}\) does not depend on \({{A}_{{k + 1}}}(X)\). An algorithm to be added to the ensemble at the step k + 1 is constructed by combining bagging and a gradient boosting procedure and consists of the following two stages:

1. At the first stage, a random number generator is applied to the original sample S to produce a sample with replacement \(S_{{k + 1}}^{0}\), which is then used to train the algorithm \(A_{{k + 1}}^{0}\).

2. At the second stage, an algorithm \({{A}_{{k + 1}}}\) to be added to the ensemble is constructed. The algorithm \({{A}_{{k + 1}}}\) computes the forecast of \(y\) at the point \(X\) using the formula

$${{A}_{{k + 1}}}(X) = A_{{k + 1}}^{0}(X) - \varepsilon {{G}_{{k + 1}}}(X),$$

where \({{G}_{{k + 1}}}(X)\) is an algorithm predicting the gradient of the functional \(D_{G}^{k}({{A}_{k}}({{X}_{1}}), \ldots ,{{A}_{{k + 1}}}({{X}_{m}}))\) at the point \((A_{{k + 1}}^{0}({{X}_{1}}), \ldots ,A_{k}^{0}({{X}_{m}}))\).

The algorithm \({{G}_{{k + 1}}}\) is trained on the sample \(\left\{ {\left( {{{X}_{1}},\mathop {\left. {\tfrac{{\partial D_{G}^{k}}}{{\partial {{A}_{{k + 1}}}({{X}_{1}})}}} \right|}\nolimits_{A_{{k + 1}}^{0}({{X}_{1}})} } \right), \ldots ,\left( {{{X}_{m}},\mathop {\left. {\tfrac{{\partial D_{G}^{k}}}{{\partial {{A}_{{k + 1}}}({{X}_{m}})}}} \right|}\nolimits_{A_{{k + 1}}^{0}({{X}_{m}})} } \right)} \right\}\).

It is easy to show that

$$\begin{gathered} \frac{{\partial D_{G}^{k}}}{{\partial {{A}_{{k + 1}}}({{X}_{i}})}} = \frac{{ - 2(1 - \mu )}}{{m(k + 1)}}\left( {{{y}_{i}} - {{A}_{{k + 1}}}({{X}_{i}})} \right) \\ \, - \frac{{\mu 2k}}{{m{{{(k + 1)}}^{2}}}}\left( {{{A}_{{k + 1}}}({{X}_{i}}) - {{L}_{k}}({{X}_{i}})} \right). \\ \end{gathered} $$

EXPERIMENTS

The method was implemented using the Python language with the help of the scikit-learn library [8]. The baseline trees were constructed by applying the BaggingRegressor method. At the second level, we used GradientBoostingRegressor (which is referred to hereafter as boosting) or RandomForestRegressor (forest) or the results of the baseline methods were averaged (average). GradientBoostingRegressor was also used as a reference technique for estimating the efficiency of the proposed method.

The developed method was used to predict the parameters of the crystal lattice of the complex inorganic compound \(A_{2}^{{2 + }}{{B}^{{ + 3}}}{{C}^{{ + 5}}}{{O}_{6}}\) and the melting points of the halides \({{A}_{3}}BHa{{l}_{6}}\) and \(ABHa{{l}_{3}}\). For various space symmetry groups, we present results for some of the parameters admitting sufficiently reliable forecast, namely: the parameters a, \(b\), and \(c\) for monoclinic space groups (\(I2{\text{/}}m\) and \(P{{2}_{1}}{\text{/}}n\)); the parameter \(c\) for tetragonal (\(I4{\text{/}}m\)) and hexagonal (\(R3( - )\)) groups; and the parameter a for the cubic group \(Fm3( - )m\). The accuracy was evaluated using the standard characteristic r2, or the determination coefficient, which was computed by applying cross validation. Since algorithms in the bagging procedure are generated to a large degree at random, the solution results for the same problem change from experiment to experiment. Accordingly, for each problem, Table 1 presents the value of r 2 averaged over 10 experiments.

Table 1. Experimental results

CONCLUSIONS

The results given in Table 1 show that, in most cases, the proposed two-level method yields better results than the standard gradient boosting algorithm, which prompts further research in this direction. We intend to explore the possibility of choosing an optimal gradient descent step size (in terms of certain criteria) in correcting bagging-generated algorithms.