Two-Level Regression Method Using Ensembles of Trees with Optimal Divergence

Zhuravlev, Yu. I.; Senko, O. V.; Dokukin, A. A.; Kiselyova, N. N.; Saenko, I. A.

doi:10.1134/S1064562421040177

Two-Level Regression Method Using Ensembles of Trees with Optimal Divergence

COMPUTER SCIENCE
Open access
Published: 22 September 2021

Volume 104, pages 212–215, (2021)
Cite this article

Download PDF

You have full access to this open access article

Doklady Mathematics Aims and scope Submit manuscript

Two-Level Regression Method Using Ensembles of Trees with Optimal Divergence

Download PDF

Yu. I. Zhuravlev¹,
O. V. Senko¹,
A. A. Dokukin¹,
N. N. Kiselyova² &
…
I. A. Saenko³

1353 Accesses
5 Citations
Explore all metrics

Abstract

The article discusses a new two-level regression analysis method in which a corrective procedure is applied to optimal ensembles of regression trees. Optimization is carried out based on the simultaneous achievement of the divergence of the algorithms in the forecast space and a good approximation of the data by individual algorithms of the ensemble. Simple averaging, random regression forest, and gradient boosting are used as corrective procedures. Experiments are presented comparing the proposed method with the standard decision forest and the standard gradient boosting method for decision trees.

Bootstrap bias corrections for ensemble methods

Article 30 November 2016

Double random forest

Article 02 July 2020

New Two-Level Machine Learning Method for Evaluating the Real Characteristics of Objects

Article 01 August 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

INTRODUCTION

Regression modeling methods based on computing more accurate collective forecasts from predictions made by a set (ensemble) of less accurate and simpler original algorithms have been widely used in modern machine learning. These methods include random regression forest and methods based on adaptive or gradient boosting. An important role in the construction of collective algorithms is played by the method of obtaining the original ensemble of so-called weak algorithms. A theoretical analysis shows that the generalization ability can be improved by using an ensemble of algorithms that not only have high accuracy, but also produce maximally diverging forecasts [1]. The low correlation between forecasts potentially makes it possible to achieve a more accurate algorithmic approximation, which objectively ensures the most accurate forecast with the use of a bounded number of algebraic operations [2, 3]. In the random regression forest method, the divergence of forecasts is achieved by training the algorithms of the ensemble on different samples generated from the original training sample with the use of bootstrap [4]. In the gradient boosting method [5], an ensemble is generated sequentially. At every iteration step, the ensemble is supplemented with trees approximating the first derivatives of the loss function with respect to the variables corresponding to collective forecast.

Another important component is the method used to compute collective forecast, which can also be interpreted as a result of mutual correction of forecasts. In the random regression forest method, correction is carried out by simple computation of average forecasts.

Another possible method for organizing a corrective procedure is the stacking scheme, in which the outputs of the ensemble algorithms are treated as input features of an algorithm computing an output corrected forecast [6, 7]. As a rule, the efficiency of stacking is low as applied to the computation of collective decisions from sets of weak algorithms produced by random forest generation procedures. It can be assumed that the low efficiency is caused by the insufficient divergence of the weak algorithms in the forecast space in the case of standard ensemble generation methods.

The goal of this paper is to study the efficiency of a two-level method for improving the generalization ability, which involves the construction of an ensemble consisting of algorithms characterized by high-degree divergence in the forecast space and a good approximation of the target variable. Simple averaging and stacking are used as a corrective procedure.

TWO-LEVEL METHOD

Let $\{ {{A}_{1}}(X), \ldots ,{{A}_{{k - 1}}}(X)\} $ be an ensemble of algorithms predicting the value of a variable $y$ from a given vector X of variables ${{x}_{1}}, \ldots ,{{x}_{n}}$. It is assumed that the algorithms of the ensemble are trained on a sample $S = \{ ({{X}_{1}},{{y}_{1}}), \ldots ,({{X}_{m}},{{y}_{m}})\} $. Preliminarily, a baseline regression analysis method is chosen, which usually represents a regression tree model. Define ${{L}_{k}}(X)\, = \,\tfrac{1}{k}\sum\limits_{i = 1}^k \,{{A}_{i}}(X)$ and ${{Q}_{k}}(X) = \tfrac{1}{k}\sum\limits_{i = 1}^k \,A_{i}^{2}(X)$. According to the declared goal, an ensemble is constructed by simultaneously minimizing the criterion

$${{\Phi }_{E}}({{A}_{1}}(X), \ldots ,{{A}_{k}}(X)) = \frac{1}{{mk}}\sum\limits_{i = 1}^k \,\sum\limits_{j = 1}^m \,\mathop {\left( {{{y}_{j}} - {{A}_{i}}({{X}_{j}})} \right)}\nolimits^2 ,$$

which estimates the average approximation error of $y$ by the vector X, and maximizing the criterion

$${{\Phi }_{V}}({{A}_{1}}(X), \ldots ,{{A}_{k}}(X)) = \frac{1}{{mk}}\sum\limits_{i = 1}^k \,\sum\limits_{j = 1}^m \,\mathop {\left( {{{L}_{k}}({{X}_{j}}) - {{A}_{i}}({{X}_{j}})} \right)}\nolimits^2 ,$$

which is the variance of the forecasts produced by the algorithms of the ensemble.

The problem of simultaneously minimizing ${{\Phi }_{E}}$ and maximizing ${{\Phi }_{V}}$ can be reduced to the minimization of

$${{\Phi }_{G}} = (1 - \mu ){{\Phi }_{E}} - \mu {{\Phi }_{V}},$$

where $\mu \in [0,1]$ determines the contribution of the heterogeneity of the ensemble in the terms of the forecast variance.

Let $D_{E}^{k}$ and $D_{V}^{k}$ denote the variations in the functionals ${{\Phi }_{E}}$ and ${{\Phi }_{V}}$ occurring when the ensemble is supplemented with an algorithm ${{A}_{{k + 1}}}$.

$$\begin{gathered} D_{E}^{k} = {{\Phi }_{E}}({{A}_{1}}(X), \ldots ,{{A}_{{k + 1}}}(X)) - {{\Phi }_{E}}({{A}_{1}}(X), \ldots ,{{A}_{k}}(X)) \\ \, = \left( {\mathop {{{\Phi }_{E}}({{A}_{1}}(X), \ldots ,{{A}_{k}}(X)) * k}\limits_{_{{_{{_{{_{{}}}}}}}}} } \right. \\ \end{gathered} $$

$$\begin{gathered} \, + \left. {\frac{1}{m}\sum\limits_{j = 1}^m \,\mathop {({{y}_{j}} - {{A}_{{k + 1}}}({{X}_{j}}))}\nolimits^2 } \right)\frac{1}{{k + 1}} \\ \, - {{\Phi }_{E}}({{A}_{1}}(X), \ldots ,{{A}_{k}}(X)) = \frac{1}{{m(k + 1)}} \\ \end{gathered} $$

$$\begin{gathered} \, \times \sum\limits_{j = 1}^m \,\mathop {\left( {{{y}_{j}} - {{A}_{{k + 1}}}({{X}_{j}})} \right)}\nolimits^2 - \frac{1}{{k + 1}}{{\Phi }_{E}}({{A}_{1}}(X), \ldots ,{{A}_{k}}(X)) \\ \, = \frac{1}{{m(k + 1)}}\sum\limits_{j = 1}^m \,\mathop {\left( {{{y}_{j}} - {{A}_{{k + 1}}}({{X}_{j}})} \right)}\nolimits^2 - {{C}_{E}}, \\ \end{gathered} $$

where ${{C}_{E}}$ does not depend on${{A}_{{k + 1}}}(X)$.

To compute ${{D}_{V}}$, we use the well-known variance expression

$$\sum\limits_{j = 1}^m \,\sum\limits_{i = 1}^k \,\mathop {\left( {{{L}_{k}}({{X}_{j}}) - {{A}_{i}}({{X}_{j}})} \right)}\nolimits^2 = \sum\limits_{j = 1}^m \,({{Q}_{k}}({{X}_{j}}) - L_{k}^{2}({{X}_{j}}));$$

$$\begin{gathered} D_{V}^{k} = {{\Phi }_{V}}({{A}_{1}}(X), \ldots ,{{A}_{{k + 1}}}(X)) - {{\Phi }_{V}}({{A}_{1}}(X), \ldots ,{{A}_{k}}(X)) \\ \, = \frac{1}{m}\sum\limits_{j = 1}^m \,({{Q}_{{k + 1}}}({{X}_{j}}) - L_{{k + 1}}^{2}({{X}_{j}})) \\ \end{gathered} $$

$$\begin{gathered} \, - \frac{1}{m}\sum\limits_{j = 1}^m \,\left( {{{Q}_{k}}({{X}_{j}}) - L_{k}^{2}({{X}_{j}})} \right) = \frac{1}{{m(k + 1)}} \\ \, \times \sum\limits_{j = 1}^m \,\left( {\mathop { - {{Q}_{k}}({{X}_{j}}) + A_{{k + 1}}^{2}({{X}_{j}})}\limits_{} } \right. \\ \end{gathered} $$

$$\begin{gathered} \, - \left. {\frac{1}{{k + 1}}\mathop {\left( {k{{L}_{k}}({{X}_{j}}) + {{A}_{{k + 1}}}({{X}_{j}})} \right)}\nolimits^2 + (k + 1)L_{k}^{2}({{X}_{j}})} \right) \\ \, = \frac{k}{{m{{{(k + 1)}}^{2}}}}\sum\limits_{j = 1}^m \,(A_{{k + 1}}^{2}({{X}_{j}}) - 2{{L}_{k}}({{X}_{j}}){{A}_{{k + 1}}}({{X}_{j}})) + {{C}_{V}}, \\ \end{gathered} $$

where ${{C}_{V}}$ does not depend on ${{A}_{{k + 1}}}(X)$.

An approach for minimizing ${{\Phi }_{G}}$ is to reduce this problem to the search for an algorithm ${{A}_{{k + 1}}}$ (added to the ensemble) that minimizes the functional

$$\begin{gathered} D_{G}^{k} = (1 - \mu )D_{E}^{k} - \mu D_{V}^{k} \\ \, = \frac{{1 - \mu }}{{m(k + 1)}}\sum\limits_{j = 1}^m \,\mathop {\left( {{{y}_{j}} - {{A}_{{k + 1}}}({{X}_{j}})} \right)}\nolimits^2 \\ \, - \frac{{\mu k}}{{m{{{(k + 1)}}^{2}}}}\sum\limits_{j = 1}^m \,(A_{{k + 1}}^{2}({{X}_{j}}) - 2{{L}_{k}}({{X}_{j}}){{A}_{{k + 1}}}({{X}_{j}})) + {{C}_{G}}, \\ \end{gathered} $$

where ${{C}_{G}}$ does not depend on ${{A}_{{k + 1}}}(X)$. An algorithm to be added to the ensemble at the step k + 1 is constructed by combining bagging and a gradient boosting procedure and consists of the following two stages:

1. At the first stage, a random number generator is applied to the original sample S to produce a sample with replacement $S_{{k + 1}}^{0}$, which is then used to train the algorithm $A_{{k + 1}}^{0}$.

2. At the second stage, an algorithm ${{A}_{{k + 1}}}$ to be added to the ensemble is constructed. The algorithm ${{A}_{{k + 1}}}$ computes the forecast of $y$ at the point $X$ using the formula

$${{A}_{{k + 1}}}(X) = A_{{k + 1}}^{0}(X) - \varepsilon {{G}_{{k + 1}}}(X),$$

where ${{G}_{{k + 1}}}(X)$ is an algorithm predicting the gradient of the functional $D_{G}^{k}({{A}_{k}}({{X}_{1}}), \ldots ,{{A}_{{k + 1}}}({{X}_{m}}))$ at the point $(A_{{k + 1}}^{0}({{X}_{1}}), \ldots ,A_{k}^{0}({{X}_{m}}))$.

The algorithm ${{G}_{{k + 1}}}$ is trained on the sample $\left\{ {\left( {{{X}_{1}},\mathop {\left. {\tfrac{{\partial D_{G}^{k}}}{{\partial {{A}_{{k + 1}}}({{X}_{1}})}}} \right|}\nolimits_{A_{{k + 1}}^{0}({{X}_{1}})} } \right), \ldots ,\left( {{{X}_{m}},\mathop {\left. {\tfrac{{\partial D_{G}^{k}}}{{\partial {{A}_{{k + 1}}}({{X}_{m}})}}} \right|}\nolimits_{A_{{k + 1}}^{0}({{X}_{m}})} } \right)} \right\}$.

It is easy to show that

$$\begin{gathered} \frac{{\partial D_{G}^{k}}}{{\partial {{A}_{{k + 1}}}({{X}_{i}})}} = \frac{{ - 2(1 - \mu )}}{{m(k + 1)}}\left( {{{y}_{i}} - {{A}_{{k + 1}}}({{X}_{i}})} \right) \\ \, - \frac{{\mu 2k}}{{m{{{(k + 1)}}^{2}}}}\left( {{{A}_{{k + 1}}}({{X}_{i}}) - {{L}_{k}}({{X}_{i}})} \right). \\ \end{gathered} $$

EXPERIMENTS

The method was implemented using the Python language with the help of the scikit-learn library [8]. The baseline trees were constructed by applying the BaggingRegressor method. At the second level, we used GradientBoostingRegressor (which is referred to hereafter as boosting) or RandomForestRegressor (forest) or the results of the baseline methods were averaged (average). GradientBoostingRegressor was also used as a reference technique for estimating the efficiency of the proposed method.

The developed method was used to predict the parameters of the crystal lattice of the complex inorganic compound $A_{2}^{{2 + }}{{B}^{{ + 3}}}{{C}^{{ + 5}}}{{O}_{6}}$ and the melting points of the halides ${{A}_{3}}BHa{{l}_{6}}$ and $ABHa{{l}_{3}}$. For various space symmetry groups, we present results for some of the parameters admitting sufficiently reliable forecast, namely: the parameters a, $b$, and $c$ for monoclinic space groups ($I2{\text{/}}m$ and $P{{2}_{1}}{\text{/}}n$); the parameter $c$ for tetragonal ($I4{\text{/}}m$) and hexagonal ($R3( - )$) groups; and the parameter a for the cubic group $Fm3( - )m$. The accuracy was evaluated using the standard characteristic r², or the determination coefficient, which was computed by applying cross validation. Since algorithms in the bagging procedure are generated to a large degree at random, the solution results for the same problem change from experiment to experiment. Accordingly, for each problem, Table 1 presents the value of r ² averaged over 10 experiments.

Table 1. Experimental results

Full size table

CONCLUSIONS

The results given in Table 1 show that, in most cases, the proposed two-level method yields better results than the standard gradient boosting algorithm, which prompts further research in this direction. We intend to explore the possibility of choosing an optimal gradient descent step size (in terms of certain criteria) in correcting bagging-generated algorithms.

REFERENCES

A. A. Dokukin and O. V. Senko, Comput. Math. Math. Phys. 55 (3), 526–539 (2015).
Article MathSciNet Google Scholar
Yu. I. Zhuravlev, Kibernetika, No. 4, 14–21 (1977).
Google Scholar
Yu. I. Zhuravlev, Kibernetika, No. 6, 21–27 (1977).
Google Scholar
L. Breiman, Mach. Learn. 24, 123–140 (1996).
MathSciNet Google Scholar
T. Hastie, R. Tibshirani, and J. H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. (Springer, New York, 2009), pp. 337–384.
Book Google Scholar
D. H. Wolpert, Neuron Networks 5 (2), 241–259 (1992).
Article Google Scholar
L. Breiman, Mach. Learn. 24, 49–64 (1996).
MathSciNet Google Scholar
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet Google Scholar

Download references

Funding

This work was supported in part by the Russian Foundation for Basic Research, project nos. 18-29-03151, 20-01-00609, and 21-51-53019.

Author information

Authors and Affiliations

Federal Research Center “Computer Science and Control,” Russian Academy of Sciences, 119333, Moscow, Russia
Yu. I. Zhuravlev, O. V. Senko & A. A. Dokukin
Baikov Institute of Metallurgy and Materials Science, Russian Academy of Sciences, 119334, Moscow, Russia
N. N. Kiselyova
Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, 119991, Moscow, Russia
I. A. Saenko

Authors

Yu. I. Zhuravlev
View author publications
You can also search for this author in PubMed Google Scholar
O. V. Senko
View author publications
You can also search for this author in PubMed Google Scholar
A. A. Dokukin
View author publications
You can also search for this author in PubMed Google Scholar
N. N. Kiselyova
View author publications
You can also search for this author in PubMed Google Scholar
I. A. Saenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yu. I. Zhuravlev, O. V. Senko, A. A. Dokukin, N. N. Kiselyova or I. A. Saenko.

Additional information

Translated by I. Ruzanova

Rights and permissions

Open Access. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhuravlev, Y.I., Senko, O.V., Dokukin, A.A. et al. Two-Level Regression Method Using Ensembles of Trees with Optimal Divergence. Dokl. Math. 104, 212–215 (2021). https://doi.org/10.1134/S1064562421040177

Download citation

Received: 17 June 2021
Revised: 17 June 2021
Accepted: 19 June 2021
Published: 22 September 2021
Issue Date: July 2021
DOI: https://doi.org/10.1134/S1064562421040177

Keywords:

Use our pre-submission checklist

Avoid common mistakes on your manuscript.