KTBoost: Combined Kernel and Tree Boosting

In this article, we introduce a novel boosting algorithm called `KTBoost', which combines kernel boosting and tree boosting. In each boosting iteration, the algorithm adds either a regression tree or reproducing kernel Hilbert space (RKHS) regression function to the ensemble of base learners. Intuitively, the idea is that discontinuous trees and continuous RKHS regression functions complement each other, and that this combination allows for better learning of both continuous and discontinuous functions as well as functions that exhibit parts with varying degrees of regularity. We empirically show that KTBoost outperforms both tree and kernel boosting in terms of predictive accuracy on a wide array of data sets.


Introduction
Boosting algorithms [Freund et al., 1996, Friedman et al., 2000, Mason et al., 2000, Friedman, 2001, Bühlmann and Hothorn, 2007 enjoy large popularity in both applied data analysis and machine learning research due to their high predictive accuracy observed on a wide range of data sets [Chen and Guestrin, 2016]. Boosting additively combines base learners by sequentially minimizing a risk functional. To the best of our knowledge, except for one reference [Hothorn et al., 2010], the large majority of boosting algorithms use only one type of functions as base learners. In this article, we relax this assumption by combining trees [Breiman et al., 1984] and reproducing kernel Hilbert space (RKHS) regression functions Smola, 2001, Berlinet andThomas-Agnan, 2011] as base learners, and we empirically show that this combination of different base learners results in increased predictive accuracy compared to both only tree and kernel boosting.
To date, regression trees are the most common choice of base learners for boosting in both applied data analysis and machine learning research. In particular, a lot of effort has been made in recent years to develop tree-based boosting methods that scale to large data [Chen and Guestrin, 2016, Ke et al., 2017, Ponomareva et al., 2017, Prokhorenkova et al., 2018. On the other hand, kernel machines show state-of-the-art predictive accuracy for many data sets as they can, for instance, achieve near-optimal test error [Belkin et al., 2018b,a], and kernel classifiers parallel the behaviors of deep networks as noted in Zhang et al. [2017]. Roughly speaking, RKHS regression learns a non-parametric regression function in an infinite dimensional RKHS. See Sections 2.2 and 2.3 for more information on kernel machines and trees.

Summary of results
We introduce a novel boosting algorithm and a novel way of combining multiple types of base learners. Specifically, we propose a boosting algorithm denoted by 'KTBoost', which combines kernel and tree boosting. In each boosting iteration, the KTBoost algorithm chooses to add either a regression tree or a penalized RKHS regression function, also known as kernel ridge regression [Murphy, 2012], to the ensemble. This is done by first learning both a regression tree and a RKHS regression function using one step of a functional version of Newton's method or functional gradient descent, and then selecting the base learner whose addition to the ensemble results in the lowest empirical risk.
The idea is that the KTBoost algorithm chooses in each iteration the best base learner from two fundamentally different function classes. Functions in an RKHS are continuous and, depending on the kernel function, they can also have higher regularity. Trees, on the other hand, are discontinuous functions. Intuitively, this combination is thus well suited for learning both continuous and discontinuous functions or functions that exhibit parts with varying degrees of regularity and discontinuities. In our experiments, we show on a large collection of data sets that this combination of trees and RKHS ridge regression functions as base learners leads to higher predictive accuracy compared to using only trees or only RKHS regression functions.
To briefly illustrate that the combination of tree and kernel boosting can lead to higher predictive accuracy, we report in Figure 1 the test mean square error (MSE) versus the number of boosting iterations for one data set (wine). The solid lines show average test MSEs over ten random splits into training, validation, and test data sets versus the number of boosting iterations. The confidence bands are obtained after point-wise excluding the largest and smallest MSEs. Tuning parameters of all methods are chosen on the validation data sets. See Section 4 for more details on the data set and the choice of tuning parameters. For better comparison, the shrinkage parameter ν, or learning rate, see Equation (4), in the boosting update is set to a fix value (ν = 0.1) in this example. The figure clearly illustrates how the combination of tree and kernel boosting (KTBoost) results in a lower test MSE compared to both tree and kernel boosting. In Section 4, we perform a detailed empirical comparison where we also consider the shrinkage parameter as a tuning parameter.
To the best of our knowledge, our proposed way of combining different base learners, in particular regression trees and RKHS regression functions, is novel. In addition, the use of Newton boosting with RKHS ridge regression functions as base learners is also novel. Our approach is implemented in the Python package KTBoost, which is openly available on the Python Package Index (PyPI) repository. 1 Friedman et al. [2008] also propose to use different families of base learners in an ensemble. They combine rule based learners, where the rules are learned using a boosting-type algorithm, with linear functions and find that the combination of the two can lead to increased predictive accuracy. However, their approach is different from ours since the base learners are combined by directly minimizing an L1 penalized loss, whereas we propose to combine the different base learners in the boosting algorithm. Further, the mboost R package of Hothorn et al. [2010] also allows for combining different base learners which include linear functions, one-and two-dimensional smoothing splines, spatial terms, regression trees, as well as user-defined ones. This approach is different from ours for the following reasons. First, in each boosting update, mboost chooses the best term by minimizing a least squares approximation to the negative gradient of the empirical risk. In contrast, our approach allows to learn the tree and RKHS regression function base learners using both Newton's method or gradient descent. In addition, we then select the base learner whose addition to the ensemble directly results in the lowest empirical risk, instead of choosing the one that best approximates the negative gradient. Further, in mboost, one has to explicitly choose a type of base learner for each feature or group of features.

Related work
Concerning kernel boosting, Raskutti et al. [2014] and Wei et al. [2017] use boosting with early stopping to find minimizers in a RKHS. Apart from the fact that we combine kernel boosting with tree boosting, these approaches also differ from our proposed way of kernel boosting. For instance, Wei et al. [2017] follows the approach of Mason et al. [2000] by finding boosting updates as maximizers of the inner product of the risk functional gradient with the base learner under the condition that the base learners norm is smaller than one. In contrast, we choose boosting updates using either Newton's method or a least squares approximation to the negative gradient. Further, Raskutti et al. [2014] and Wei et al. [2017] only consider interpolating kernel machines, whereas we also allow for a ridge penalty as a form of regularization in addition to early stopping.

Boosting
There exist population as well as sample versions of boosting algorithms. For the sake of brevity, we only consider the latter here. Assume that we have data {(x i , y i ) ∈ R p × R, i = 1, . . . , n} from a probability distribution P X,Y . The goal of boosting is to find a function F : R p → R in a Hilbert space Ω S which allows for predicting y given x. Note that y can be categorical, discrete, continuous, or of mixed type depending on whether the conditional distribution P Y |X is absolutely continuous with respect to the Lebesgue, a counting measure, or a mixture of the two; see, e.g., Sigrist and Hirnschall [2017] for an example of the latter. Depending on the data or the goal of the application, the function F = (F k ), k = 1, . . . , d, can also be multivariate. For instance, this is the case for multiclass classification where a function is learned for every class when using the entropy loss with a softmax function, or for generalized additive models for location, scale and shape (GAMLSS) where location, scale, and shape parameters are modeled as functions of predictor variables [Rigby andStasinopoulos, 2005, Mayr et al., 2012]. For the sake of simplicity, we assume in the following that F is univariate. The extension to the multivariate case is straightforward; see, e.g., Sigrist [2018].
The goal of boosting is to find a minimizer F * (·) of the empirical risk functional R(F ): where L(Y, F ) is an appropriately chosen loss function. For instance, for regression, one often uses the squared loss L(Y, F ) = (Y − F ) 2 /2 with Y ∈ R, and for binary or multiclass classification, popular choices are the logistic regression loss L(Y, F ) = −Y F + log 1 + e F or the exponential loss L(Y, F ) = e (2Y −1)F for Y ∈ {0, 1}, and the entropy softmax loss . . , K}, respectively. For boosting, one assumes that the function space Ω S = span(S) is the span of a set of base learners S = {f j : R p → R}. Boosting then finds F * (·) in sequential way by iteratively adding an update f m to the current estimate F m−1 : such that the empirical risk is minimized Since the risk in (3) usually cannot be minimized explicitly, one uses an approximate minimizer. Depending on whether gradient or Newton boosting is used, the update f m is either obtained as the least squares approximation to the negative functional gradient or by applying one step of the functional Newton method which corresponds to minimizing a second order Taylor expansion of the risk functional; see Section 3 or Sigrist [2018] for more information. For increased predictive accuracy [Friedman, 2001], an additional shrinkage parameter ν > 0 is added to the update equation to obtain

Kernel ridge regression
Then there exists a reproducing kernel Hilbert space (RKHS) H with an inner product ·, · such that (i) the Suppose we are interested in finding the minimizer where λ ≥ 0 is a regularization parameter. The representer theorem  then states that there is a unique minimizer of the form and (5) can be written as where y = (y 1 , . . . , y n ) T , K ∈ R n×n with K ij = K(x i , x j ), and α = (α 1 , . . . , α n ) T . Taking derivatives and equaling them to zero, we find the explicit solution as where I n denotes the n-dimensional identity matrix. Popular kernel choices include the exponential or Laplace kernel as well as the Gaussian or radial kernel where · denotes the L2 norm. The choice of the kernel determines the properties of the functions such as the degree of smoothness. The parameter λ determines the amount of regularization and, for the above mentioned kernels, ρ controls how fast the kernel decays with distance There is a close connection between Gaussian process regression and kernel regression. The solution to (5) is the posterior mean conditional on the data of a zero-mean Gaussian process with covariance function K. In spatial statistics [Diggle et al., 1998], this is also called a Kriging estimator. The regularization parameter λ is the variance of the so-called idiosyncratic nugget effect, and for, e.g., the Laplace and the radial kernel, ρ is a range parameter that determines how fast the correlation of the process at points x 1 and x 2 decays with distance kernel regression can also be interpreted as a two-layer neural network.

Regression trees
We denote by T the space which consists of regression trees [Breiman et al., 1984]. Regression and classification trees can be represented in various form. For notational brevity, we follow the notation used in Chen and Guestrin [2016]; i.e., a regression tree is given by where s : R p → {1, . . . , J}, w ∈ R J , and J ∈ N denotes the number of terminal nodes of the tree f T (x). s determines the structure of the tree, i.e., the partition of the space, and w denotes the leaf values. As in Breiman et al. [1984], we assume that the partition of the space made by s is a binary tree where each cell in the partition is a rectangle of the form

Combined kernel and tree boosting
Let R 2 (F m−1 +f ) denote the following functional, which is proportional to the second order Taylor approximation around F m−1 of the empirical risk in (1): where g m,i and h m,i are the gradient and the Hessian: .
The KTBoost algorithm then works as follows. In each boosting iteration m, a candidate regression tree f T m (x) and a candidate RKHS ridge regression function f K m (x) are found as minimizers to the second order Taylor approximation R 2 (F m−1 + f ) of the empirical risk around the current estimate F m−1 . This corresponds to applying one step of Newton's method. The algorithm then chooses to add either the regression tree or the RKHS regression function to the ensemble in such a way that the addition of the base learner to the ensemble according to Equation (4) results in the lowest empirical risk. Algorithm 1 summarizes this.
It can be shown that candidate trees f T m (x) can be found as weighted least squares minimizers; see, e.g., Chen and Guestrin [2016] or Sigrist [2018]. The candidate penalized RKHS regression functions f K m (x) can be found as shown in the following Proposition 3.1.
Algorithm 1: KTBoost Initialize F 0 (x) =ȳ. for m = 1 to M do Compute the gradient and the Hessian Proposition 3.1. The optimal kernel ridge regression solution f K m (x) in the Newton update step of boosting is given by (8) and where and I n is the identity matrix of dimension n.

Proof.
We have If we take derivatives with respect to α, equal them to zero, and solve for α, we find that In the above formulation, KTBoost uses Newton boosting; i.e., the updates f m are found by minimizing a second order approximation to the empirical risk. If either the loss function is not twice differentiable in its second argument, the second derivative is zero or constant on a non-null set of the support of X, one can alternatively use gradient descent instead of the Newton method. Further, gradient boosting is computationally less expensive than Newton boosting since the kernel matrix does not depend on the iteration number m; see Section 3.1 for more details.
In general, the gradient boosting version of KTBoost is simply obtained as a special case of the Algorithm 1 by setting h m,i = 1. For the RKHS regression function, it follows that D m = I n . The updates f m then correspond to L 2 approximations of either regression trees or RKHS regression functions to the negative functional gradient.
Concerning the RKHS boosting part, the update equation F m (x) = F m−1 (x) + νf m (x) can be replaced by simply updating the coefficients α m where α m denotes the cumulative sum of all coefficients up to iteration m. I.e., for the cases where the update step consists of a RKHS function, one updates the coefficients as α m = α m−1 + να m with the convention α m = α m−1 if a regression tree is selected in iteration m. If stage-wise predictions should be made, for e.g., trace plots of error measures versus iteration numbers, one still needs to keep track of the full set of coefficients.

Reducing computational costs for large data
The computationally expensive part for learning the regression trees consists of finding the splits when growing the trees. There are several approaches in the literature on how this can be done efficiently for large data; see, e.g., Chen and Guestrin [2016] or Ke et al. [2017].
Concerning RKHS ridge regression, there are also several approaches that allow for computational efficiency in the large data case. Examples of this include low rank approximations based on, e.g., the Nyström method [Williams and Seeger, 2001] and extensions of it such as divide-and-conquer kernel ridge regression [Zhang et al., 2013[Zhang et al., , 2015, and early stopping of iterative optimization methods [Yao et al., 2007, Blanchard and Krämer, 2010, Raskutti et al., 2014, Ma and Belkin, 2017. Another approach adopted mainly in spatial statistics is to use a kernel function that has a compact support [Gneiting, 2002, Bevilacqua et al., 2018. The resulting kernel matrix K is then sparse and can thus be efficiently inverted or factorized.
If gradient descent is used instead of the Newton-Raphson method, the RKHS function f K m (x) can be found efficiently by noting that α m = (K + λI n ) −1 y m , where y m is the negative gradient vector y m = (−g m,1 , . . . , −g m,n ) T . This means that the matrix (K + λI n ) −1 , or a Cholesky factor of it, needs to be calculated only once and not in every iteration m.
In our empirical analysis, we use the Nyström method for dealing with larger data sets. The Nyström method approximates the kernel K(·, ·) as described in the following. One first chooses a set of l samples x * 1 , . . . , x * l . Often this is done simply by sampling uniformly from the data. We call these samples Nyström samples in the following. Further, we denote the kernel matrix that corresponds to these points as K * . The Nyström method then approximates the kernel K(·, ·) as In particular, the reduced-rank Nyström approximation to the full kernel matrix K is given by

Experimental results
In the following, we compare the KTBoost algorithm with tree and kernel boosting using several Delve, Keel, Kaggle, and UCI data sets shown in Table 1. We consider both regression as well as binary and multiclass classification data sets. Further, our choice of data sets includes data sets of different sizes in order to investigate the performance on both smaller and larger data sets. We use the squared loss for regression, the logistic regression loss for binary classification, and the cross-entropy loss with the softmax function for multiclass classification. Further, for the RKHS ridge regression, we use a Gaussian kernel Concerning the boosting updates, we use gradient boosting for the regression data sets, and for the classification data sets, we use boosting with Newton updates since this can result in more accurate predictions [Sigrist, 2018]; see Section 5 for a brief comparison. However, for some classification data sets (adult, ijcnn, epileptic, and satimage), Newton boosting is computationally infeasible on a standard single CPU computer with the current implementation of KTBoost, despite the use of the Nyström method with a reasonable number of Nyström samples, say 1000, since the weighted kernel matrix in Equation (10) changes with every boosting iteration. We thus use gradient boosting in this case. All calculations are done with the Python package KTBoost on a standard laptop with a 2.9 GHz quad-core processor and 16GB of RAM.
For the larger data sets (liberty, sarcos, adult, ijcnn), we run into memory limitations, in particular when choosing tunings parameters which is done in parallel, and also into unfeasible computational times. We thus use the Nyström method described in Section 3.1. Specifically, we use l = 1000 Nyström samples, which are uniformly from the training data. In general, the larger the number of Nyström samples, the lower the approximation error but the higher the computational costs. We restrict ourselves to l = 1000 due to computational constraints. Williams and Seeger [2001] report good results with l ≈ 1000 for several data sets.
All data sets are randomly split into three parts of equal size to obtain training, validation and test sets. Learning is done on the training data, tuning parameters are chosen on the validation data, and model comparison is done on the test data. All input features are standardized using the training data to have approximately mean zero and variance one. In order to quantify variability in the results, we use ten different random splits of the data into training, validation and test sets.  abalone  regression  4177  10  ailerons  regression  13750  40  bank8FM  regression  8192  8  elevators  regression  16599  18  energy  regression  768  8  housing  regression  506  13  liberty  regression  50999  117  NavalT  regression  11934  16  parkinsons regression  5875  16  puma32h  regression  8192  32  sarcos  regression  48933  21  wine  regression  4898  11  adult  2  48842  108  cancer  2  699  9  ijcnn  2  141691  22  ionosphere 2  351  34  sonar  2  208  60  car  4  1728  21  epileptic  5  11500  178  glass  7  214  9  satimage  6  6438  36 Concerning tuning parameters, we select the number of boosting iterations M from {1, 2, . . . , 1000}, the learning rate ν from {1, 10 −1 , 10 −2 , 10 −3 }, the maximal depth of the trees from {1, 5, 10}, and the kernel ridge regularization parameter λ from {1, 10}. The kernel range parameter ρ is chosen using k-nearest neighbors distances as described in the following. We first calculate the average distance of all k-nearest neighbors in the training data, where k is a tuning parameter selected from {5, 50, 500, 5000, n − 1} and n is the size of the training data. We then choose ρ such that the kernel function has decayed to a value of 0.01 at this average k-nearest neighbors distance. This is motivated by the fact that for a corresponding Gaussian process with such a covariance function, the correlation has decayed to a level of 1% at this k-nearest neighbor distance. If the training data contains less than 5000 (or 500) samples, we use n − 1 as the maximal number for the k-nearest neighbors. In addition, we include ρ which equals the average (n − 1)-nearest neighbor distance. The latter choice is done in order to also include a range which results in a kernel that decays slowly over the entire space. For the large data sets where the Nyström method is used, we calculate the average k-nearest neighbors distance based on the Nyström samples. I.e., in this case, the maximal k equals l − 1.
The results are shown in Table 2. For the regression data sets, we show the average test mean square error (MSE) over the different sample splits, and for the classification data sets, we calculate the average test error rate. The numbers in parentheses show standard deviations over the different sample splits. In addition, we report in Table 3 p-values from t-tests comparing KTBoost with tree and kernel boosting. All tests are done using a regression model with a random effect at the sample split level to account for correlation among the results for different sample splits. We find that KTBoost outperforms both tree and kernel boosting for the large majority of data sets that we consider. In particular, KTBoost performs either significantly better than both tree and kernel boosting or the difference in predictive accuracy is not significant. In more detail, KTBoost has a higher predictive accuracy than both tree and kernel boosting for seventeen out of twenty-one data sets, and it is significantly better at the 5% significance level for six data sets. In only four cases either kernel or tree boosting performs marginally better than KTBoost. In all these four cases, the difference is clearly not significant with p-values above 0.5. In summary, our empirical findings show that the combination of regression trees and RKHS regression functions as base learners can result in increased predictive accuracy compared to using only regression trees or kernel regression functions as base learners. We note that we consider data sets with a sample size between a few hundred and approximately one hundred thousand samples. Very large data sets with several hundred thousands or millions of samples are not considered since the computational costs of the current Python implementation of KTBoost are prohibitive, in particular when choosing tuning parameters on a single CPU computer. Nonetheless, we believe that our results are important for both future research and also for applied data analysis, where small-to moderately-sized data sets are common.
Further, we note that for both tree and kernel boosting, one can consider additional tuning parameters. For instance, for trees, this includes the minimal number of samples per leaf, row and column sub-sampling, and penalization of leave values. Concerning the kernel boosting part, further tuning parameters include varying range parameters for different features, the smoothness of the kernel function, or, in general, the class of kernel functions.
Due to limits on computational costs, we have not considered all possible combinations of tuning parameters. However, it is likely that a potential increase in predictive performance in either tree or kernel boosting will also result in an increase in accuracy of the combined KTBoost algorithm.

Conclusions
We have introduced a novel boosting algorithm, which combines trees and RKHS ridge regression functions as base learners. Intuitively, the idea is that discontinuous trees and continuous RKHS regression functions complement each other since trees are better suited for learning rougher parts of functions and RKHS regression functions can better learn smoother parts of functions. We have compared the predictive accuracy of KTBoost with tree and kernel boosting in an extensive empirical study, and we find that KTBoost results in higher predictive accuracy compared to tree and kernel boosting for the large majority of data sets. In particular, KTBoost has either significantly higher accuracy or no significant differences are observed in our empirical evaluation.
In the future, the KTBoost algorithm should be implemented in a computationally efficient way such that is scales better to larger data sets. This is important not only for applying the gradient descent version of the KTBoost algorithm to very large data sets but also for applying the Newton boosting version to moderately-sized data sets. In fact, on the smaller classification data sets (cancer, ionosphere, sonar, car, glass), we generally observe higher predictive accuracy when using Newton boosting instead of gradient boosting; a finding which is in line with the results of Sigrist [2018]. In Table 4, we additionally report the results when using gradient boosting instead of Newton boosting. We find increased predictive accuracy of Newton boosting compared to gradient boosting for 11 out of the 15 combinations of the five data sets and the three boosting algorithms (KTBoost, tree boosting, kernel boosting). This highlights the importance of a computationally efficient implementation that allows for applying Newton boosting to moderately-sized and large data sets. Such an implementation can be done using both a fast tree-building algorithm as in Chen and Guestrin [2016], Ke et al. [2017], or Prokhorenkova et al. [2018 as well as a RKHS ridge regression algorithm and implementation that scales to very large data. Concerning the latter, several potential strategies how this can be done are briefly outlined in Section 3.1. Another interesting direction for future research is to investigate to which extent other base learners such as neural networks [Huang et al., 2018, Nitanda andSuzuki, 2018] can be used in addition to trees and RKHS regression.