Decision tree boosted varying coefficient models

Varying coefficient models are a flexible extension of generic parametric models whose coefficients are functions of a set of effect-modifying covariates instead of fitted constants. They are capable of achieving higher model complexity while preserving the structure of the underlying parametric models, hence generating interpretable predictions. In this paper we study the use of gradient boosted decision trees as those coefficient-deciding functions in varying coefficient models with linearly structured outputs. In contrast to the traditional choices of splines or kernel smoothers, boosted trees are more flexible since they require no structural assumptions in the effect modifier space. We introduce our proposed method from the perspective of a localized version of gradient descent, prove its theoretical consistency under mild assumptions commonly adapted by decision tree research, and empirically demonstrate that the proposed tree boosted varying coefficient models achieve high performance qualified by their training speed, prediction accuracy and intelligibility as compared to several benchmark algorithms.


Introduction
A varying coefficient model (VCM: Hastie and Tibshirani 1993) is, as the name intuitively suggests, a "parametric" model whose coefficients are allowed to vary based on individual model inputs. When presented with a set of covariates and some response of interest, a VCM isolates part of those covariates as effect modifiers based on which model coefficients are determined through a few varying coefficient mappings. These coefficients then join the remaining covariates to assume the form of a parametric model which produces structured predictions.
Consider the data generating distribution (X , Z , Y ) ∈ R p × A × R where X = (X 1 , . . . , X p ), X and Z are the covariates and Y the response. A VCM regressing Y on (X , Z ) can take the form which resembles a generalized linear model (GLM) with the link function g. In this context we refer to X as the predictive covariates and Z the action covariates (effect modifiers) which are drawn from an action space A. β i (·) : A → R, i = 0, 1, . . . , p are, varying coefficient mappings. This is similar to a fixed effects model with the fixed effects on the slopes. To give an intuitive example, we can imagine a study associating someone's resting heart rate linearly to their weight, height, and daily caffeine intake, while expecting the linear coefficients would depend on their age Z since younger people might be more responsive to caffeine consumption. In this case, A would be the set of all possible ages. While (1) preserves the linear structure, the dependence of the coefficients β on Z grants the model more complexity compared with the underlying GLM. In this paper we assume that each β i (Z ) uses the same action space, but a generalization to allow different coefficients to depend on different features is immediate. The varying coefficient mappings β(·) are conventionally nonparametric to distinguish (1) from any parametric models on a transformed sample space of (X , Z ). In this paper we are particularly interested in the use of gradient boosting, especially gradient boosted decision trees (GBDT or GBM: Friedman 2001), as those mappings. Here we propose tree boosted VCM: for each β i , i = 0, . . . , p, we represent an additive boosted tree ensemble of size b with each t i j being a decision tree constructed sequentially through gradient boosting as detailed in Sect. 2. This strategy yields the model Apart from the upgrading of model complexity for parametric models, introducing tree boosted VCM aligns with our attempt to address concerns about model intelligibility and transparency.
There are two mainstream approaches to model interpretation in the literature. The first category suggests we can attach a post hoc explanatory mechanism to a fitted model to inspect its results, hence it is model agnostic. There is a sizable literature on this topic, from the appearance of local methods (LIME: Ribeiro et al. 2016) to recent applications on neural nets (Zhang and Zhu 2018; Sundararajan et al. 2017), random forests (Mentch and Hooker 2016;Basu et al. 2018) and complex model distillation (Lou et al. 2012(Lou et al. , 2013Tan et al. 2018), to name a few. These approaches are especially favored by "black box" models since they do not interfere with model building and training. However, introducing this extra explanatory step also poses questions that involve additional decision making: which form of interpretability is preferred, which method to apply, and if possible, how to relate the external interpretation to the model's inner workings.
In contrast, we can design inherently-interpretable models. In other words, we simply consider models that consist of concise and intelligible building blocks to spontaneously achieve interpretability. Examples of this approach range from basic models as generalized linear models and decision trees, to complicated ones that guarantee monotonicity (You et al. 2017;Chipman et al. 2016), have identifiable and accountable components (Melis and Jaakkola 2018), or allow more control on the covariate structure though shape constraints (Cotter et al. 2019). Although having the advantage of not requiring the additional inspection, self-explanatory models are more restricted in terms of their flexibility and are often facing the trade-off between generalization accuracy and intelligibility.
Following this discussion, a VCM belongs to the second category as long as the involved parametric form is intelligible. It is an instant generalization of the parametric form to allow the use of local coefficients, which leads to improvements in both model complexity and accuracy. It is worth pointing out again that a VCM is structured: the predictions are produced through parametric (e.g., generalized linear) relations between predictive covariates and coefficients. This combination demonstrates a feasible means to balance the trade-off between flexibility and intelligibility.
There is a substantial literature examining the asymptotic properties of different VCMs when traditional splines or kernel smoothers are implemented as the nonparametric varying coefficient mappings. We refer the readers to Park et al. (2015) for a comprehensive review. In this paper we present similar results regarding the asymptotics of tree boosted VCM.

Trees and VCMs
While popular choices of varying coefficient mappings are either splines or kernel smoothers, we consider employing decision trees (CART: Breiman et al. 2017) and decision tree ensembles as those nonparametric mappings.
The advantages of applying tree ensembles are two-folded. They form a highly complex and flexible model class, and they are capable of handling any action space A compatible with decision tree splitting rules. The latter is especially convenient when compared to, for instance, cases when A is a high dimensional mixture of continuous and discrete covariates, for which constructing a spline basis or defining a kernel smoother distance measure would require considerably greater attention from the modeler.
There are existing straightforward attempts to utilize a single decision tree as varying coefficient mappings (Buergin and Ritschard 2017; Berger et al. 2017). Although having a simple form, these implementations are subject to the instability rooted in using a single tree. Moreover, the mathematical intractability of single decision trees prevents those varying coefficient mappings from provable optimality. On the other hand, by switching to tree ensembles of either random forests or gradient boosting we would have better variance reduction and asymptotic guarantees. One example applying random forests is based on the linear local forests introduced in Friedberg et al. (2018) that perform local linear regression with an honest random forest kernel, while the predictive covariates X are reused as the effect modifiers Z .
In terms of tree boosting methods, Wang and Hastie (2014) proposed the first tree boosted VCM algorithm. They applied functional trees (Gama 2004) with a linear model in each leaf to fit the negative gradient of the empirical risk. Here, a functional tree is a regression tree whose leaves are linear models instead of constants. Their resulting model updates the coefficients as follows: Here we assume the first column of X is constant 1 and each decision tree t j returns a p + 1 dimensional response. Their algorithm when applied to a regression problem with L 2 loss can be summarized as 1. Start with an initial guess ofβ 0 (·). A common choice isβ 0 (·) = 0. 2. For each iteration b, calculate the residue for i = 1, . . . , n: taking the same value for all x i in the same leaf node. 4. Updateβ b+1 =β b + t b+1 . 5. Keep boosting until the stopping rule is met.
In practice we this approach can result in a number of challenges. Most immediately, constructing optimal functional trees is usually time consuming. Beyond this, correlation between Z and X can produce challenges especially when a deep tree is used and the X covariates in a deep node appear more correlated or even singular due to the result of selection by similar Z 's. As solutions, we can either prevent the tree from making narrow splits or use a penalized leaf model. However, selecting a penalty parameter leaves us without a closed-form expression for the solution, requiring us to optimize numerically, presumably through gradient descent methods.
To address these issues, we propose a new tree boosted VCM framework summarized as follows when applied to regression problems with L 2 loss. A general form is presented in Sect. 2.
1. Start with an initial guess ofβ 0 (·). A common choice isβ 0 (·) = 0. 2. For each iteration b and for each dimension j = 0, . . . , p ofβ b , -calculate the negative gradient of L 2 loss at each x i , i = 1, . . . , n, In short, instead of functional trees, we directly use tree boosting to perform gradient descent in the functional space of those varying coefficients in the parametric part of the model. We will elaborate later that this framework has several advantages: -It has provable theoretical performance guarantees.
-This method takes advantage of the standard boosting framework which allows us to use and extend existing gradient descent and gradient boosting techniques and theories. -Each coefficient can be boosted separately, thereby avoiding singular submodel fitting. -Any gradient descent or gradient boosting constraints can directly apply in this approach and that can help fit constrained or penalized models.

Models under VCM
Under the settings of (1), Hastie and Tibshirani (1993) pointed out that VCM is the generalization of generalized linear models, generalized additive models, and various other semi-parametric models with careful choices of the varying coefficient mapping β. In particular, the partially linear regression model where β is a global linear coefficient (see Härdle et al. 2012) is equivalent to a least square VCM with all varying coefficient mappings except the intercept being constant. We treat the VCM in general throughout this paper, but examine a partially linear regression in Sect. 4.4.
Within this modeling framework, the closest to our approach are functional trees introduced in Gama (2004) and Zeileis et al. (2008). A functional tree segments the action space into disjoint regions, after which a parametric model gets fitted within each region using sample points inside. Zheng and Chen (2019) and Chaudhuri et al. (1994) discussed the cases when these regional models are linear or polynomial. Another example is logistic regression trees, for which there is a sophisticated building algorithm (LOTUS: Chan and Loh 2004). Their prediction on a new point (x 0 , z 0 ) is A i the tree segmentation, z 0 ∈ A k and β k = β(z), ∀z ∈ A k . As mentioned above, constructing functional trees is time consuming. The conventional way of determining the tree structure is to enumerate through candidate splits and choose the one that reduces the empirical risk the most between before and after splitting. Such greedy strategy requires lots of computation and yields mathematically intractable results.
The rest of the paper is structured as follows. In Sect. 2 we present the formal definition of our proposed method and share an alternative perspective of looking at it as local gradient descent: a localized functional version of gradient descent that treats parameters in a parametric model as functions on the action space and optimizes locally. We will prove the consistency of our algorithm in Sect. 3 and present our empirical results in Sect. 4. Further discussions on potential variations of this method follow in Sect. 5. Code to produce results throughout the paper is available at https:// github.com/siriuz42/treeboostVCM.

Tree boosted varying coefficient models
For the varying coefficient mapping of interests β, we use superscripts 0, . . . , p to indicate individual dimensions, i.e. β = (β 0 , . . . , β p ) T , and subscripts 1, . . . , n to indicate sample points or boosting iterations when there is no ambiguity. For any X , we assume X contains the intercept column, i.e. X = (1, X 1 , . . . , X p ), so that X T β = β 0 + p i=1 X i β i can be used to specify a linear regression.

Boosting framework
Here we present the general form of our proposed tree boosted VCM. We first consider fitting a generalized linear model with constant coefficients β ∈ R p+1 . Given sample (x 1 , z 1 , y 1 ), . . . , (x n , z n , y n ) and a target loss function l, gradient descent is used to minimize the empirical risk by searching for the optimalβ * -regardless of z i at this moment -as First order gradient descent improves an intermediateβ toβ by moving it in the negative gradient direction with a learning rate 0 < λ 1. This formula averages the contribution from each of the sample points −∇ β l(y i , x T i β)| β=β for i = 1, . . . , n. We now consider β to be a mapping β = β(z) : A → R p+1 on the action space so that the effect modifiers now vary the coefficients. The empirical risk remains the same The contributed gradient at each sample point is still These gradients are the ideal functional improvement ofβ captured at each of the sample points, i.e. (z 1 , Δβ (z 1 )), . . . , (z n , Δβ (z n )). We can now employ boosting (Friedman 2001) to estimate a functional improvementΔβ (·). For any function family T capable of regressing Δβ (z 1 ), . . . , Δβ (z n ) on z 1 , . . . , z n , the corresponding boosting framework works as follows.
(B2) For each iteration b and for each dimension j = 0, . . . , p ofβ b , we calculate the pseudo gradient at each point as for i = 1, . . . , n. Notice that we separate the dimensions of β on purpose.
When spline methods are implemented in (B3), T is closed under addition and the result of (B4) is a list of coefficients of basis functions for T . On the other hand, we propose applying decision trees in (B3): the resulting varying coefficient mapping will be an additive tree ensemble, whose model space varies based on the ensemble size. We will refer to this method as tree boosted VCM for the rest of the paper should there be no ambiguity.

Local gradient descent with tree kernels
Each terminal node of the decision tree fit in (B3') linearly combines the contributing gradients of the sample points grouped in this node. From a generic viewpoint, for each dimension j we introduce a kernel smoother K j = K : A × A → R (we drop the j when there is no ambiguity) such that the estimated gradient at any new z is given by In other words, with a fast decaying K , (4) can estimate the gradient at z locally using weights given by .
We refer to such method as local gradient descent.
For the CART strategy, after a decision tree is constructed at each iteration its induced smoother K assigns equal weights to all sample points in the same terminal node. If we write A(z i ) ⊂ A the region in the action space corresponding to the terminal node containing z i , we have K (z, z i ) = I (z ∈ A(z i )) and we define the following to be the tree structure function where we also use the convention that 0/0 = 0. The denominator is the size of z i 's terminal node and is equal to n j=1 I (z j ∈ A(z)) when z and z i fall in the same terminal node. In the cases where subsampling or completely random trees are employed for the purpose of variance reduction, K will be taken to be the expectation such that whose properties have been studied by Zhou and Hooker (2018). This expectation is taken over all possible tree structures and, if subsampling is applied, all possible subsamples w of a fixed size, and the denominator in the expectation is again the size of z i 's terminal node. In particular, by carefully choosing the rates for tree construction, this tree structure function is related to the random forest kernel introduced in Scornet (2016) that takes the expectation of the numerator and the denominator separately as in the sense that the deviations from these expectations are mutually bounded by constants.

Regularizing tree boosted VCM
Perfectly "chasing after" z i , Δ jβ (z i ) can lead to severe overfitting. In general, gradient boosting applied under nonparametric regression setting has to be accompanied by regularization such as using a complexity penalty or early stopping to prevent overfitting (see, for example, Bühlmann et al. 2007;Cortes et al. 2019). Hence we partially rely on the strategy of building a decision tree in (B3') to safeguard the behavior of the obtained tree boosted VCM. Consider the greedy CART strategy for dimension j: (D1) Start at the root node. Given a node, enumerate candidate splits and evaluate their splitting metrics using all z i , Δ jβ (z i ) such that z i is contained in the node.
(D2) Split on the best candidate split. Keep splitting until stopping rules are met to form terminal nodes. (D3) Calculate fitted terminal values in each terminal node using the average of all Δ jβ (z i ) such that z i is contained in the terminal node.
This model has several tuning parameters: the splitting metric, the depth of trees, the minimal number of sample points in a terminal node, and any pruning if needed. All these regularizers are intended to keep the terminal node large enough to avoid isolating a few points with extreme values. Meanwhile, recent theoretical results on decision trees suggest alternative strategies that produce better theoretical guarantees. We may consider subsampling that generates a subset w ⊂ {1, . . . , n} and only uses sample points indexed by w in (D1).
(D1') …Given a node, enumerate candidate splits and evaluate their splitting metrics using all (z i , Δ j β i ) such that i ∈ w and z i is contained in the node. We may also consider honesty (Wager and Athey 2018) which avoids using the responses, in our case Δ jβ (z i ), twice during both deciding the tree structure and deciding terminal values. For instance, a version of completely random trees discussed in Zhou and Hooker (2018) chooses the splits using solely z i without evaluating the splits by the responses Δ j β i in place of step (D1). (D2*) …Given a node, choose a random split based on z i 's contained in the node.
From the boosting point of view, multiple other methods may also be applied to the additive ensemble to prevent overfitting: early stopping, dropouts (Rashmi and Gilad-Bachrach 2015), or adaptive learning rates.

Examples
Tree boosted VCM generates a structured model joining the varying coefficient mappings and the predictive covariates. In order to understand these varying coefficient mappings on the actions space, we provide a visual example here by considering the following generative distribution: We generate a sample of 1000 examples from this distribution, apply the tree boosted VCM with 400 trees, use a learning rate λ = 0.1, tree maxdepth 3 (excluding root, the same for the rest of the paper) and minsplits 10 (the minimal node size allowed for further splitting), and obtain the following estimation of the varying coefficient mappings β on Z in Fig. 1. The fitted values accurately capture the true coefficients.
Switching to a logistic regression setting and assuming similarly that with a sample of size 1000, Fig. 2 presents corresponding plots of the tree boosted VCM results, with the same set of parameters as the previous experiment except the learning rate λ = 0.05. These results are less clear as logistic regression produces noisier gradients. In both cases our methods correctly identify β(z) as segmenting along the diagonal in z, providing clear visual identification of the behavior of β(z) without posting any structural assumptions on the action space. Further empirical studies are presented in Sect. 4.

Consistency
There is a large literature providing statistical guarantees and asymptotic analyses of different versions of VCM with varying coefficient mappings obtained via splines or local smoothers (Park et al. 2015;Fan et al. 1999Fan et al. , 2005. For decision trees, Bühlmann (2002) established a consistency guarantee for tree-type basis functions for L 2 boosting that bounds the gap between the boosting procedure and its population version by the uniform convergence in distribution of the family of hyper-rectangular indicators. We take a similar approach to expand this family of indicators to reflect the tree boosted VCM structure and study its asymptotics. In this section we will sketch the proofs, while the details can be found in the appendix.
Consider L 2 boosting setting for regression. Given the relationship we work with the following assumptions.
For a terminal node R ⊆ A for dimension j, as per (4), the decision tree update in R with a possible subsample w is

Notations
We assume the action space A can be embedded into a Euclidean space of all hyper rectangles in A which includes all possible terminal nodes of any decision tree built on A.
Given the distribution (Z , 1, X ) ∼ P, we define the inner product f 1 , f 2 = E P [ f 1 f 2 ] , and the norm · = · P,2 on the sample space A × {1} × [−1, 1] p . For a sample of size n, we write the (unscaled) empirical counterpart by f 1 , by the law of large numbers, with a corresponding norm · n .

Decomposing tree boosted VCM
Our study of decision trees requires their decomposition into the following rectangular indicators: Tree boosted VCM expands H by adding in coordinate mappings within hyper rectangles: In particular we set 1 = x 0 so that g R,0 = h R . Define the empirical remainder function r b (Bühlmann 2002) i.e. the remainder term after b-th boosting iteration. Further, consider the b-th iteration utilizing p + 1 decision trees whose disjoint terminal nodes are R j 1 , . . . , R j m ∈ R for j = 0, . . . , p respectively. (5) is equivalent to the following expression for the boosting update of the remainderr b , j is the coordinate mapping of dimension j in its corresponding hyper rectangle R j i . We flatten the subscripts when there is no ambiguity for simplicity such thatr Although the update involves m( p + 1) terms, only p + 1 of them are applicable for a given (x, z) pair as the result of using disjoint terminal nodes. Mallat and Zhang (1993) and Bühlmann (2002) suggested we consider the population counterparts of these processes defined by the remainder functions that r 0 = f and using the same tree structures. These processes converge to the consistent estimate in the completion of the decision tree family T . As a result, we can achieve asymptotic consistency as long as the gap between the sample process and this population process diminishes fast enough along with the increase of sample size. The following lemma provides uniform bounds on the asymptotic variability pertaining to G using Donsker's theorem (see van der Vaart and Wellner 1996).
Lemma 1 For given L 2 function f and random sub-Gaussian noise , the following empirical gaps

Consistency
Lemma 1 quantifies the discrepancy between tree boosted VCM fits and their population versions conditioned on the sequence of trees used during boosting. In addition, we propose several conditions for consistency, as well as practical guidelines for employing tree boosted VCM. (C4) As commonly assumed by boosting literature, we require that each boosted tree effectively reduces the empirical risk. Consider the best functional rectangular fit during the b-th population iteration We expect to empirically select at least one (R * , j) pair during the iteration to approximate g * such that for some 0 < ν < 1. Lemma 1 indicates that by choosing the sample version optimumĝ * = arg max we meet the requirement in probability for any fixed number of iterations. (C5) f 2 = M ≤ ∞. In addition, due to the linear models in the VCM, to achieve consistency we require that f ∈ span(G). (C6) We also require the identifiability of linear models such that the distribution of for a random prediction point (x * , z * ) which are independent from but identically distributed as the training data.
Corollary 1 If we further assume (C6), Corollary 1 justifies the varying coefficient mappings as valid estimators for the true varying linear relationship. Although we have not explicitly introduced any continuity condition on β, it is worth noticing that (C5) requires β to have relatively invariant local behavior. Hence we hypothesize tree boosted VCM should be the most ideal when A consists of a few flat regions.
When we consider the interpretability of tree boosted VCM, consistency is also the sufficient theoretical guarantee for local fidelity discussed in Ribeiro et al. (2016) that an interpretable local method should also yield accurate local relation between covariates and responses.

Benchmarking tree boosted VCMs
As shown by earlier examples, tree boosted VCMs are capable of adapting to the structure of the action space. An additional example is presented in the Appendix C.2 to demonstrate that it reconstructs the right signal levels.
In this subsection we focus on the training and coefficient estimating aspects to compare our proposed tree boosted VCM (denoted as TBVCM) with Wang and Hastie (2014) (denoted as FVCM, functional-tree boosted VCM). Discussions on prediction accuracy will be presented in the next subsection. We would like to again mention their discrepancy: TBVCM is gradient descent based, while FVCM relies on constructing functional models within trees.
We consider four scenarios: the error is small (σ 2 = 1) or large (σ 2 = 4), the tree is shallow (only split node with more than 50 examples) or deep (only split node with more than 20 examples). We run TBVCM and FVCM 10 times in each scenario using the same setup: 200 trees in the ensemble, 0.1 learning rate for each tree, and the same 10 random samples. For FVCM we use the R code provided by Wang and Hastie (2014) using a custom functional tree logic with LM to extract terminal models. For TBVCM we implement the gradient calculation and utilize rpart to build the regression trees. Table 1 shows different metrics of the two methods: We focus on three aspects: execution time, convergence speed and vulnerability to overfitting. We draw a couple of conclusions based on Table 1: Again, this is mostly likely due to the implementation, but it evidently suggests that TBVCM algorithm is simpler. -Both methods overfit therefore need cross validation or other methods to define early stopping criteria. -FVCM takes fewer iterations to overfit, while it is also more sensitive to deeper and more flexible trees. We expect FVCM to converge faster: fitting one submodel is equivalent to executing multiple gradient steps were we to replace the explicit solution of linear models by doing gradient descent. However the final MSEs of FVCM and TBVCM are close, suggesting similar convergence of both methods.
Figures 3 and 4 plot the distributions of estimated coefficients of two models at the full ensemble size. Both models capture the horizontal transition of β 1 , the vertical transition of β 2 , and the circular boundary of β 3 . However, the overly flexible ensembles in Fig. 4 produces less interpretable plots, indicating again the requirement of proper regularization. FVCM also produces a wider range of coefficient values. TBVCM behaves more conservatively and appears visually similar to applying a small shrinkage.

Model accuracy
To examine the accuracy of our proposed methods, we have selected 12 real world datasets and run tree boosted VCM (denoted as TBVCM) against other benchmark methods. FVCM is excluded in this experiment due to singularity issues when applying the method on these datasets. Tables 2 and 3 present the results under classification settings with three benchmark algorithms: logistic regression (GLM), a partially saturated logistic regression model (GLM(S)) where each combination of discrete action covariates acts as fixed effect with its own level, and AdaBoost. Table 4  Although with additional constrained model structures, TBVCM performs on a par with both AdaBoost and GBM. It benefits from its capability of modeling the action space without structural conditions to outperform the fixed effect linear model in certain cases.

Regularization and interpretability
One advantage of applying gradient descent based algorithm is the easy adaptation to handling constrained optimization problem.
Here we show the results of applying tree boosted VCM on the Beijing housing data (Kaggle 2018). We take the housing unit price as the target regressed on covariates of location, floor, number of living rooms and bathrooms, whether the unit has an elevator and whether the unit has been refurbished. Specially, location has been treated as the    Results are shown as mean(sd). Sources of some datasets are: BEIJINGPM (Liang et al. 2015), BIKEHOUR (Fanaee-T and Gama 2014), ONLINENEWS (Fernandes et al. 2015) and ENERGY (Tsanas and Xifara 2012) action space represented in pairs of longitude and latitude. Location specific linear coefficients of other covariates are displayed in Fig. 5. We allow 200 trees of depth of 5 in the ensemble with a constant learning rate of 0.05. In particular, from domain knowledge we would expect that the number of bathrooms should not influence the unit price negatively. The same expectations apply to the number of living rooms, the existence of elevator service and the degree of refurbishment. So as to reflect the domain knowledge, we constrain the model to always predict positive coefficients for these covariates by regularizing their local gradient descent to never drop below zero. Results of the regularization-free model are instead presented in Appendix C.2. Figure 5 provides clear visualization of the fitted tree boosted VCM and can be interpreted in the following way. The urban landscape of Beijing consists of an inner circle with a low skyline and a modern outskirt rim of new buildings. The model intercept provides the baseline of the unit housing prices in each area. Despite their high values, most buildings inside the inner circle are decades old so the refurbishment gets more attention whereas the floor contributes negatively due to the absence of elevators. In contrast, the outskirt housing is new and its unit pricing is unmoved by the refurbishment, but a higher floor with a better city view is more valuable. Based on their coefficient plots, we can also conclude that the contribution of numbers of

Fitting partially linear models
As noted above, since VCM is the generalization of many specific models, our proposed fitting algorithm and analysis should apply to them as well. We take partially linear models as an example and consider the following data set from Cornell Lab of Ornithology consisting of the recorded observations of four species of vireos along with the location and surrounding terrain types. We apply a tree boosted VCM under logistic regression setting using longitude, latitude and year as the action space and all rest covariates as linear effects without varying coefficients, obtaining the model demonstrated by Table 5. The intercept plot suggests the trend of observed vireos favoring cold climate and inland environment, while the slopes of different territory types indicate a strong preference towards the low elevation between deciduous forests and evergreen needles. It can also be used to compare the baselines across different years in the past decade.

Shrinkage, selection and serialization
Tree boosted VCM is compatible with any alternative boosting strategy in place of the boosting steps (B3) and (B4), such as the use of subsampled trees (Friedman 2002;Zhou and Hooker 2018), univariate or bivariate trees (Lou et al. 2012;Hothorn et al. 2013) or adaptive shrinkage (dropout) (Rashmi and Gilad-Bachrach 2015;Rogozhnikov and Likhomanenko 2017;Zhou and Hooker 2018). These alternative approaches have been empirically shown to help avoid overfitting or provide more model interpretability. We also anticipate that the corresponding varying coefficient mappings would inherit certain theoretical properties. For instance, Zhou and Hooker (2018) introduced a regularized boosting framework named Boulevard that guarantees finite sample convergence and asymptotic normality of its predictions. Incorporating Boulevard into our tree boosted VCM framework requires the changes to (B3) and (B4) such that (B3*) For each j, find a good fit t j b+1 ∈ T : A → R on (z i , Δ j β i ) for i ∈ w ⊂ {1, . . . , n} a random subsample. (B4*) Updateβ b with learning rate λ < 1.
where Γ M truncates the absolute value at some M > 0.
By taking the same approach in the original paper, we can show that boosting VCM with Boulevard will also yield finite sample convergence to a fixed point. Boulevard modifies the standard boosting strategy to the extent that new theoretical results have to be developed specifically for it. In contrast, there are other boosting variations that fall directly under the theoretical umbrella of tree boosted VCM. Our discussions so far assume we run boosting iterations with a distinct tree built for each coefficient dimension while these trees are simultaneously constructed using the same batch of pseudo-residuals. Despite the possibility of utilizing a single decision tree with multidimensional response to produce outputs of all dimensions, as long as we build separate trees sequentially, the question arises that whether we should update the pseudo-residuals on the fly and in what order if we choose to do so.
One advantage of this approach is that is reduces the boosting iteration from (1+ p) trees down to one tree, allowing us to use much larger learning rate λ ≤ 1/2 instead of λ ≤ (1 + p) −1 without changing the arguments we used to establish the consistency. We also anticipate that doing so in practice moderately reduces the cost as the gradients become more accurate for each tree. Here we will consider two approaches to conduct the on-the-fly updates.
The first approach is to introduce a coefficient selection rule to decide which coefficient to update for each boosting iteration. For instance, in Hothorn et al. (2013) the authors proposed the component-wise linear least squares for boosting where they select which β to update using the stepwise optimal strategy, i.e., choose j b and update the component tree that reduces the empirical risk the most, where e j is the p + 1vector with j-th element 1 and the rest 0. As a result, (B4) in our algorithm now updatesβ Notice that finding this optimum still requires the comparison among all dimensions and therefore does not save computational cost when there are no better means or prior knowledge to help detect which dimension stands out. That being said, the optimal move is compatible with the key condition (C4) we posed to ensure consistency. Namely, it still guarantees that the population counterpart of boosting is efficient in reducing the gap between the estimate and the truth. However, this greedy strategy also complicates the pattern of the sequence in which β's get updated. We can also consider serialization, which means the β's are updated in a predetermined order. An example is given Lou et al. (2012Lou et al. ( , 2013 where trees were used for univariate functions of each feature in a GAM model; later used in Tan et al. (2018) for model distillation. Their models can either be built through backfitting which eventually produces one additive component for each covariate, or through boosting that generates a sequence of additive terms.
Applying the rotation of coordinates to tree boosted VCM, we can break each of the original boosting iterations into (p+1) micro steps to writê , with j rotating through 0, . . . , p. This procedure immediately updates the pseudoresiduals after each component tree is built.
However, this procedure is not compatible with our consistency conclusion as the serialized boosting fails to guarantee (C4): each micro boosting step on a single coordinate relies on the current pseudo gradients instead of the gradients before the entire rotation. One solution to this theoretical issue is to consider an alternative to the predetermined updating sequence by randomly and uniformly proposing the coordinate to boost. In this regard,β where j ∼ Unif{0, . . . , p}. This stochastic sequence solves the compatibility issue by satisfying (C4) with a probability bounded from below.

Conclusion
In this paper we studied the combination of gradient boosted decision trees and varying coefficient models. We introduced the gradient descent-based algorithm of fitting tree boosted VCM under a generalized linear model setting. We proved the consistency of its coefficient estimators with an early stopping strategy and demonstrated its performance through our empirical study. As a result, we are confident to add gradient boosted decision trees into the collection of apt varying coefficient mappings.
We also discussed the model interpretability of tree boosted VCM, especially its parametric part as a self-explanatory component of reasoning about the covariates through the linear structure. Consistency guarantees the effectiveness of the interpretation.
There are a few future directions of this research. The performance of tree boosted VCM depends on the boosting scheme as well as the component decision trees. It is worth analyzing the discrepancies between different approaches of building the boosting tree ensemble. Finally our discussion has concentrated on the general form of the proposed tree boosted VCM. As mentioned, several regressors and additive models can be considered as its special cases, hence we expect model specific conclusions and training algorithms.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A.1 Proof of Lemma 1
Proof ξ n,6 is simply CLT. For the rest, we will conclude the corresponding function classes are P−Donsker. The collection of indicators for hyper rectangles (−∞, a 1 ] × . . . , (−∞, a p ] ⊆ R p is Donsker. By taking difference at most p times we get all elements in H, therefore G, the indicators of R, is Donsker. Thus ξ n,1 = O p n − 1 2 . The basis functions E = {1, x j , j = 1, . . . , p} is Donsker since all elements are monotonic and bounded since x ∈ [−1, 1] p . So G = H × E is Donsker, which gives ξ n,2 = O p n − 1 2 and ξ n,4 = O p n − 1 2 .
In addition, for fixed f , f G is therefore Donsker, which gives ξ n,3 = O p n − 1 2 .

A.2 Proof of Theorem 1
To supplement our discussion of norms, it is immediate that g R, j ≤ h R ≤ 1.
Another key relation is g R, j n,1 ≤ h R n,1 = h R 2 n . We also assume that all R's satisfy the terminal node condition.
Proof Consider the p + 1 trees used for one boosting iteration with the terminal nodes denoted as R . Same argument can be applied to the population version hence we get the second part.
Lemma 3 For any b ≤ 0, as defined in (7), Proof As implied by (6), for (x, z) such that z ∈ R, The key observation is that Recursively we can conclude that sup x,z |r b | ≤ 2 b sup x,z |r 0 |.

Lemma 4 Under conditions (C1)-(C6),
Proof Recall that Since for each fixed j, all R j i are disjoint, we therefore define that To bound γ b , without loss of generality, we consider a single term involved such that First consider Per Lemma 3, we have 1 n , g b n ≤ ξ n and, by iteratively applying Lemma 3 and setting C 0 = max(sup x,z | f |, 1), The last term could be bounded by Hence In order to bound |û|, we notice Therefore, we get an upper bound for Denote h be the global minimum of the ensemble that h We would like to mention the elementary result that for a series {x n } satisfying the partial sums satisfy Hence, we can verify this upper bound that Hence, which is equivalent to Combining all above we have Lemma 5 Under condition (C1)-(C6), for any ρ > 0 there exists B 0 = B 0 (ρ) and n 0 = n 0 (ρ) such that for all n > n 0 , Proof Lemma 3 in Bühlmann (2002) proves this statement for rectangular indicators. By fixing λ = (1 + p) −1 and introducing conditions (C3) and (C4), formula (11) in Bühlmann (2002) still holds in terms of the single terminal node in each of the trees that corresponds to our defined R * . Therefore cited Lemma 3 holds for our boosted trees. The conclusion is therefore reached by the assumption that f ∈ span(G).

Proof of main Theorem
For a given ρ > 0, sincer We reach the conclusion by sending ρ → 0.
where t 0 = B(0,s) x 2 1 dx = O(s p ). That is equivalent to contradicting Theorem 1. Table 6 reports the details on the datasets and parameters used for the real world experiments. In particular, all experiments build 100 trees and apply 10-fold cross validation to measure the metrics and their uncertainties. All experiments build 100 trees. Here we show respectively the sample size, the number of effect modifiers, the number of covariates, λ as the learning ate, Max Depth the maximal allowed tree depths and Min Split the minimal node size allowed for further splitting The codebase, along with all empirical study code and data, can be found at https://github.com/siriuz42/treeboostVCM C Additional empirical study

C.1 Identifying signals
Our theory suggests that tree boosted VCM is capable of identifying local linear structures and their coefficients accurately. To demonstrate this in practice, we apply it to the following regression problem with higher order feature interaction on the action space. z = (z 1 , z 2 , z 3 , z 4 ), z 1 , z 2 ∼ Unif{1, . . . , 10}, z 3 , z 4 ∼ Unif[0, 1].
The data generating process is described by the following pseudo code.
if z 1 < 4 : y = 1 + 3x 1 + 7x 2 else if z 1 > 8 : y = −5 + 2x 1 + 4x 2 + 6x 3 else if z 2 = 1, 3 or 5 : y = 5 + 5x 2 + 5x 3 else if z 3 < 0.5 : y = 10 + 10x 4 else if z 4 < 0.4 : y = 10 + 10x 5 else if z 3 < z 4 : y = 5 − 5x 2 − 10x 3 else : y = −10x 1 + 10x 3 We utilize a sample of size 10,000 and use 100 trees of maximal depth of 6 for boosting with constant learning rate of 0.2. Figure 6 plots the fitted distribution of each coefficient in red against the ground truth in grey, with reported MSE 3.28. We observe that all peaks and their intensities properly reflect the coefficient distributions on the action space. Despite the linear expressions, we have tested interactions among all four action covariates of a tree depth of 6 and have not yet achieved convergence, which we conclude as the reasons for large MSE. It manifests the effectiveness of our straightforward implementation of decision trees segmenting the action space.

C.2 Constrained Beijing housing model
Here we present the plots of the Beijing Housing dataset should we not attach the non-negative regularization on the regressors. Training error in this case is smaller (1.858024 after 200 trees, in contrast to 1.911139 after 200 trees of the regularized model). However, its results are harder to interpret.

C.3 Beijing housing model by mboost
We present the plots of the Beijing Housing dataset fit through mboost which is a standard fitting method provided along with the dataset. Here the model fit assumes a generalized linear relationship between the covariates and the label, whereas instead of a fixed effect slope, the generalized linear terms are created for each level of the covariates, so the model is indeed more complex then in Fig. 7. Here we are showing the coefficients for bathroom, living room and elevator at levels equal to 1, floor at level 10 and refurbished at level 2.
The shapes of the coefficients here mostly agree with Fig. 7, complying with the overall outline of the city. For floor and elevator, the mboost fit seems to be more smoothed out.