# Transformation boosting machines

- 232 Downloads

## Abstract

The broad class of conditional transformation models includes interpretable and simple as well as potentially very complex models for conditional distributions. This makes conditional transformation models attractive for predictive distribution modelling, especially because models featuring interpretable parameters and black-box machines can be understood as extremes in a whole cascade of models. So far, algorithms and corresponding theory was developed for special forms of conditional transformation models only: maximum likelihood inference is available for rather simple models, there exists a tailored boosting algorithm for the estimation of additive conditional transformation models, and a special form of random forests targets the estimation of interaction models. Here, I propose boosting algorithms capable of estimating conditional transformation models of arbitrary complexity, starting from simple shift transformation models featuring linear predictors to essentially unstructured conditional transformation models allowing complex nonlinear interaction functions. A generic form of the likelihood is maximized. Thus, the novel boosting algorithms for conditional transformation models are applicable to all types of univariate response variables, including randomly censored or truncated observations.

## Keywords

Transformation model Distribution regression Conditional distribution function Conditional quantile function Conditional hazard function Probabilistic forecasting## 1 Introduction

The future remains unknown, yet we have witnessed considerably improved predictions owing to advances in statistical and machine learning over the last two decades. Numerous procedures, such as support vector machines, random forests, and tree boosting, deliver accurate point predictions of conditional means. However, in many applications, a mean prediction is not good enough. Full predictive distributions, also known as probabilistic forecasts, are required in applications where an assessment of the associated uncertainty is essential, for example in models of future disease progression (Küffner et al. 2015), electricity demand (Cabrera and Schulz 2017), stock asset returns (Mitrodima and Griffin 2017), and counterfactual distributions (Chernozhukov et al. 2013). In these applications, the prediction “takes the form of a predictive probability distribution over future quantities or events of interest” (Gneiting and Katzfuss 2014). Here, I present a novel generic boosting approach to the estimation of full predictive distributions under mild assumptions.

Apart from completely model-free procedures (such as kernel smoothing, Li and Racine 2008), four main approaches of obtaining predictive distributions exist. (1) Flexible parametric models for conditional density functions rely on a strict parametric model of the response distribution those parameters might be linked to predictor variables in complex ways, for example, in generalized additive models for location, scale, and shape (GAMLSS, Rigby and Stasinopoulos 2005) and in heteroscedastic Bayesian additive regression tree ensembles (Pratola et al. 2017). (2) Quantile regression models for conditional quantiles of interest can be modelled in a linear or nonlinear additive form (Koenker 2005); more complex relationships can be estimated by quantile regression forests (Meinshausen 2006; Athey et al. 2019). (3) Distribution regression and transformation models potentially allow response-varying (or time-varying) effects (Foresi and Peracchi 1995; Rothe and Wied 2013; Chernozhukov et al. 2013; Wu and Tian 2013; Leorato and Peracchi 2015) in models for conditional distribution functions on the probit, logit, or complementary log–log scale. (4) Hazard regression (Kooperberg et al. 1995) aims at estimating conditional nonproportional hazard functions directly.

Boosting, and especially the statistical view on boosting (Friedman et al. 2000; Bühlmann and Hothorn 2007), have already proved helpful in these four different approaches. Mayr et al. (2012) developed boosting for GAMLSS, conditional quantile boosting was introduced by Fenske et al. (2011), and nonproportional hazard boosting was recently introduced by Lee and Chen (2018). Distribution regression is a special case of conditional transformation models (Hothorn et al. 2014). What is interesting about conditional transformation models is that very simple models, such as the linear proportional odds and hazards models, and essentially unstructured models for conditional distribution functions can be understood in a unified theoretical framework (Hothorn et al. 2018). The same level of generality is, however, lacking from an algorithmic perspective. The boosting algorithm introduced by Hothorn et al. (2014) is limited to additive models and explicitly excludes tree-based interaction models. Furthermore, the target function is approximate and applicable to responses observed without censoring or truncation only. The aim of this work is to establish a general computational framework that allows specification, estimation, evaluation, and comparison in a cascade of models starting with very simple linear models and ending with essentially unstructured models for conditional distribution functions for arbitrary response variables.

Section 2 gives a dense introduction to transformation models. An elaborate description and connections to well-established models can be found in Hothorn et al. (2014) and Hothorn et al. (2018). Sections 3 and 4 develop two boosting algorithms for complex and simple transformation models based on a generic form of the likelihood (technical details regarding the definition of the likelihood for all types of response variables, including random censoring and truncation, are discussed by Hothorn et al. 2018). Empirical evaluations are presented in Sect. 5.

## 2 Transformation models

Structured additive transformation functions that allow interactions between the two arguments \(y\) and \(\varvec{x}\) of the form \(h(y\mid \varvec{x}) = \sum _{j = 1}^J h_j(y\mid \varvec{x})\) lead to conditional transformation models (CTM, Hothorn et al. 2014). The *J* partial transformation functions \(h_j\) allow formulation of problem-specific effects of the predictors \(\varvec{x}\), such as linear, nonlinear, spatio-temporal, or other model terms. Distribution regression models featuring response-varying effects are an important special case of this model class. When \(\varvec{x}= (x_1, \dots , x_J) \in {\mathbb {R}}^J\), a distribution regression model is characterized by partial transformation functions \(h_j(y\mid x_j) = \beta _j(y) x_j\) and corresponding interpretable response-varying effects \(\beta _j: {\varXi }\rightarrow {\mathbb {R}}\). The analogon of an additive model features partial transformation functions \(h_j(y\mid x_j)\), *i.e.* bivariate smooth functions of both \(y\) and \(x_j\). These bivariate terms are more complex than the one-dimensional coefficient functions \(\beta _j(y)\) but can still be visualized and interpreted. If \(x_j\) is more complex, for example, if it describes a spatial location, \(h_j(y\mid x_j)\) might be a spatially smooth term that captures unexplained spatial heterogeneity (Hothorn et al. 2014).

Models with transformation function \(h(y\mid \varvec{x}) = \sum _{j = 1}^J h_j(y\mid x_j)\) and potential applications are discussed in Hothorn et al. (2014) and Hothorn et al. (2018). The standard estimation of maximizing the continuously ranked probability score over a discrete grid covering \({\varXi }\) (Foresi and Peracchi 1995; Chernozhukov et al. 2013; Hothorn et al. 2014), potentially with inverse probability of censoring weight adjustment for right censoring (Möst and Hothorn 2015; Garcia et al. 2019), does not allow essentially unstructured transformation functions \(h(y\mid \varvec{x})\), including higher-order interactions, and thus relaxation of the additivity assumption on the scale of the transformation function \(h\). Furthermore, it is computationally inefficient (because the data have to be expanded) and is unable to handle censoring or truncation directly.

This paper addresses these issues by introducing computationally efficient boosted likelihood estimation for unstructured or structured additive conditional transformation functions (Sect. 3) and shift transformation functions (Sect. 4) under all forms of random censoring and truncation for at least ordered responses based on potentially correlated observations.

## 3 Boosting the likelihood of conditional transformation models

In the following, the conditional transformation function \(h(y\mid \varvec{x}) = \varvec{a}(y)^\top \varvec{\vartheta }(\varvec{x})\) is parameterized in terms of basis functions \(\varvec{a}: {\varXi }\rightarrow {\mathbb {R}}^P\) of the response and a conditional parameter function \(\varvec{\vartheta }: \chi \rightarrow {\mathbb {R}}^P\); the latter function will be estimated.

### 3.1 Definition of the likelihood

*N*independent observations \((y_i, \varvec{x}_i), i = 1, \dots , N\) from \({\mathbb {P}}_{Y, \varvec{X}}\), empirical risk minimization with negative log-likelihood loss

### 3.2 Boosting the likelihood

*j*(

*b*) returns the index of the basis function \(\varvec{b}_j\) which was selected in the

*b*th iteration of the algorithm. Each basis may be equipped with an explicit penalty function \(\text {Pen}_j\). The corresponding penalty parameter \(\lambda _j\) is chosen such that the degrees of freedom are the same for all

*J*basis functions to facilitate unbiased model selection (Hofner et al. 2011). The number of terms

*B*, selected basis functions

*j*(

*b*), and corresponding coefficient matrices \({\varvec{\Gamma }}^{[b]} \in {\mathbb {R}}^{P\times P_{j(b)}}\) are unknowns and are estimated from data. The basis functions \(\varvec{b}_j\) may feature unknown parameters. With relatively deep regression trees \(\varvec{b}_j\) (where the tree structure is estimated from the data in every boosting iteration and \({\varvec{\Gamma }}\) are the parameters in each terminal node), model (3) is the sum of

*B*trees and as such is potentially highly unstructured. Similar to GAMLSS-boosting (Mayr et al. 2012), a parameter vector \(\varvec{\vartheta }\) is modelled instead of a scalar predictor function. The main difference is that all dimensions of the parameter vector \(\varvec{\vartheta }\) are updated simultaneously whereas each dimension is assigned its own predictor function in GAMLSS-boosting.

*N*. The number of boosting iterations \(b_\text {stop}\) is a tuning parameter that has to be chosen by the out-of-sample log-likelihood for a validation sample \(i = N + 1, \dots , N + {\tilde{N}}\)

An additional advantage of this algorithm over boosted continuously ranked probability scores (“CTM-CRPS-boosting”, Hothorn et al. 2014) is that computations of tensor products in \(\varvec{a}(y)^\top \otimes \varvec{b}_j(\varvec{x})^\top \text {vec}({\varvec{\Gamma }}) = \varvec{a}(y)^\top {\varvec{\Gamma }}\varvec{b}_j(\varvec{x})\) are never explicitly required because the linear array model formulation (*i.e.* the right-hand side of the equation, see Currie et al. 2006, formula 2.5) formula 2.5 is implemented by Algorithm 1. This allows estimation of potentially highly unstructured models by choosing relatively deep multivariate regression trees as basis functions \(\varvec{b}\). Moreover, the algorithm does not require expansion of the data set (to size sample size \(N^2\), in the worst case).

### 3.3 Model interpretation

## 4 Boosting the likelihood of shift transformation models

*i.e.*with \(\varvec{\gamma }^\top = {\varvec{\Gamma }}\in {\mathbb {R}}^{1 \times P_j}\). In contrast to conditional transformation models, the model term \(\beta (\varvec{x})\) does not depend on \(y\) and thus shift transformation models are easier to interpret. \(L_2\) boosting in this setup is performed based on log-likelihood contributions \(\ell _i(\varvec{\vartheta }, \beta (\varvec{x}_i))\) for \(\ell _i: {\mathbb {R}}^{P+ 1} \rightarrow {\mathbb {R}}\) from densities

## 5 Empirical evaluation

Applications. Eight prediction problems with continuous, ordered categorical (number of categories in parentheses), or right-censored response (per cent censored in parentheses)

Application | Source | | \(Y\) | \(\varvec{X}\) | \(F_Z\) |
---|---|---|---|---|---|

Beetle Extinction Risk | Seibold et al. (2015) | 1025 | Ordered (6) | (7, 3) | expit |

Birth Weight Prediction | Schild et al. (2008) | 150 | Continuous | (5, 0) | \({\varPhi }\) |

Body Fat Mass | Garcia et al. (2005) | 71 | Continuous | (9, 0) | \({\varPhi }\) |

CAO/ARO/AIO-04 DFS | Rödel et al. (2015) | 1153 | Survival (\(71\%\)) | (3, 15) | MEV |

Childhood Malnutrition | Fenske et al. (2011) | 24166 | Continuous | (6, 14) | \({\varPhi }\) |

Head Circumference | Fredriks et al. (2000) | 7040 | Continuous | (1, 0) | \({\varPhi }\) |

PRO-ACT ALSFRS | Küffner et al. (2015) | 1013 | Ordered (50) | (43, 16) | expit |

PRO-ACT OS | Seibold et al. (2017) | 2711 | Survival (\(69\%\)) | (3, 16) | MEV |

Applications. Median out-of-sample log-likelihoods (centered by out-of-sample log-likelihoods of the corresponding unconditional model) from 100 subsampled divisions into learning and test samples

Parm. \(\varvec{b}(\varvec{x})\) | Conditional Transformation Model \(\varvec{\vartheta }(\varvec{x})\) | Shift Transformation Model \(\beta (\varvec{x})\) | Transformation | ||||||
---|---|---|---|---|---|---|---|---|---|

N | L | T | N | L | T | Tree | Forest | ||

Beetles Exctinction Risk | 0.27 | 0.26 | 0.18 | 0.30 | 0.29 |
| 0.03 | 0.30 | |

(0.15, 0.28) | (0.16, 0.28) | (0.11, 0.29) | (0.29, 0.32) | (0.27, 0.31) | (0.30, 0.35) | (\(-\) 0.13, 0.12) | (0.28, 0.33) | ||

Birth Weight Prediction | 1.47 | 1.43 | 1.25 |
| 1.50 | 1.37 | 0.39 | 1.32 | |

(1.31, 1.55) | (1.19, 1.51) | (1.13, 1.30) | (1.32, 1.57) | (1.31, 1.56) | (1.28, 1.46) | (\(-\) 0.17, 0.69) | (1.21, 1.37) | ||

Body Fat Mass |
| 1.08 | 0.76 | 0.12 | 0.12 | 0.24 | 0.08 | 1.03 | |

(1.05, 1.27) | (0.96, 1.20) | (0.58, 0.87) | (0.11, 0.14) | (0.11, 0.14) | (0.22, 0.27) | (\(-\) 0.61, 0.49) | (0.95, 1.09) | ||

CAO/ARO/AIO-04 DFS | 0.00 | 0.01 | 0.00 | 0.02 |
| 0.01 | \(-\) 0.00 | \(-\) 0.01 | |

(\(-\) 0.00, 0.01) | (\(-\) 0.00, 0.01) | (\(-\) 0.00, 0.00) | (0.01, 0.03) | (0.01, 0.03) | (0.01,0.02) | (\(-\) 0.01, 0.00) | (\(-\) 0.02, \(-\) 0.00) | ||

Childhood Malnutrition | 0.05 | 0.05 |
| 0.13 | 0.12 | 0.14 | 0.12 | 0.15 | |

(0.05, 0.05) | (0.04, 0.05) | (0.14, 0.15) | (0.13, 0.14) | (0.11, 0.12) | (0.13, 0.14) | (0.12, 0.13) | (0.14, 0.15) | ||

Head Circumference |
| 0.96 | 1.09 | 1.01 | 0.96 | 1.08 | 1.06 | 0.99 | |

(1.08, 1.11) | (0.94, 0.97) | (1.07, 1.10) | (1.00, 1.02) | (0.95, 0.97) | (1.07, 1.10) | (1.04, 1.08) | (0.92, 1.02) | ||

PRO-ACT ALSFRS | 0.52 |
| 0.48 | 0.51 | 0.49 | 0.49 | 0.38 | 0.32 | |

(0.50, 0.55) | (0.49, 0.55) | (0.46, 0.50) | (0.47, 0.54) | (0.46, 0.52) | (0.46, 0.53) | (0.33, 0.43) | (0.30, 0.33) | ||

PRO-ACT OS | 0.04 | 0.04 | 0.04 | 0.05 | 0.05 |
| 0.02 | 0.04 | |

(0.03, 0.05) | (0.03, 0.05) | (0.02, 0.05) | (0.04, 0.06) | (0.04, 0.06) | (0.05, 0.06) | (\(-\) 0.02, 0.04) | (\(-\) 0.01, 0.05) |

### 5.1 Applications

Eight life science applications in which estimation of a predictive distribution is of special interest are listed in Table 1. Four applications are described by a continuous response, two feature an ordered categorical response, and two feature a right-censored response. Except for the Beetle Extinction Risk application, which requires a discrete basis \(\varvec{a}\), a Bernstein basis \(\varvec{a}\) of order \(M = 6\) (for technical details see Hothorn et al. 2018) was used to parameterize the transformation functions. Conditional transformation models (Algorithm 1) with nonlinear (N, using B-splines), linear (L), and tree-based (T, of depth two and thus allowing only two-way interactions) basis functions \(\varvec{b}\) as well as shift transformation models (Algorithm 2) using the same bases were evaluated. The performance of these boosted transformation models was compared to the performance of transformation trees and transformation forests (Hothorn and Zeileis 2017). The latter two procedures estimate conditional transformation models of the form \(F_Z(\varvec{a}(y)^\top \varvec{\vartheta }(\varvec{x}))\), where \(\varvec{\vartheta }(\varvec{x})\) is obtained either from a single tree (transformation trees) or from a nonlinear interaction function (transformation forest). I hypothesized a priori that transformation trees should perform worst across all applications because this method corresponds to the most simple (but easily interpretable) model. Also, I expected transformation forests to outperform transformation trees and to perform only slightly worse than the best performing boosting procedure because of the high adaptivity of the underlying random forest procedure. My motivation for this experiment was the hope that I would be able to find a simple and interpretable transformation model that outperforms the most complex transformation forests by means of either Algorithm 1 or 2.

*n*and a validation sample \(i = n + 1, \dots , {\tilde{n}}\), the centered out-of-sample log-likelihood is given by

The results presented in Table 2 demonstrate that the best-performing method was always a boosted transformation model. Transformation forests performed only slightly worse than the top model for the Beetle Extinction Risk, Birth Weight Prediction, Body Fat Mass, and Childhood Malnutrition applications. In the remaining four applications, the best boosting procedure outperformed transformation forests substantially. Nonlinear conditional transformation models (N \(\varvec{\vartheta }(\varvec{x})\)) performed best twice, as did tree-based shift transformation models (T \(\beta (\varvec{x})\)). Each of the remaining models ranked at the top once. Transformation trees outperformed transformation forests for two applications (Head Cirumference and PRO-ACT ALSFRS) but never performed better than any of the boosted transformation models.

Graphical representations of the distributions of out-of-sample log-likelihoods along with the exact model and algorithm specification and corresponding software implementation are presented for all eight applications in Hothorn (2019).

### 5.2 Artificial data-generating processes

*x*, and a corresponding linear interaction effect. The linear main and interaction effects of

*x*are replaced by nonlinear effects (a scaled \(\sin \) function) of

*x*in the shift transformation model “Nonlinear \(\beta (\varvec{x})\)”. The extension to response-varying main and interaction effects defines the distribution regression model “Linear \(\varvec{\vartheta }(\varvec{x})\)” and the conditional transformation model “Nonlinear \(\varvec{\vartheta }(\varvec{x})\)”. The coefficients of the terms introduced in Table 3 are given in Table 4. Details of the implementation of these DGPs are explained in Hothorn (2019). The conditional densities associated with the four DGPs are shown in Fig. 1.

Artificial data-generating processes (DGPs). Description of four simulation models

DGP | \({\varPhi }^{-1}({\mathbb {P}}(Y\le y\mid \text {Group 1}, x))\) | \({\varPhi }^{-1}({\mathbb {P}}(Y\le y\mid \text {Group 2}, x))\) |
---|---|---|

Linear \(\beta (\varvec{x})\) | \(h_Y(y) - 2 x\) | \(h_Y(y) + 2 - x\) |

Nonlinear \(\beta (\varvec{x})\) | \(h_Y(y) - 2 g(x)\) | \(h_Y(y) + 2 - g(x)\) |

Linear \(\varvec{\vartheta }(\varvec{x})\) | \(h_Y(y) - \beta _1(y) - \beta _2(y) x\) | \(h_Y(y) + \beta _1(y) - (\beta _2(y) + \beta _3(y)) x\) |

Nonlinear \(\varvec{\vartheta }(\varvec{x})\) | \(h_Y(y) - \beta _1(y) - \beta _2(y) g(x)\) | \(h_Y(y) + \beta _1(y) - (\beta _2(y) + \beta _3(y)) g(x)\) |

Artificial data-generating Processes (DGPs). Coefficients of baseline transformation \(h_Y\) and response-varying effects \(\beta _1, \beta _2\), and \(\beta _3\)

\(\vartheta _ 1 \) | \(\vartheta _ 2 \) | \(\vartheta _ 3 \) | \(\vartheta _ 4 \) | \(\vartheta _ 5 \) | \(\vartheta _ 6 \) | \(\vartheta _ 7 \) | |
---|---|---|---|---|---|---|---|

\(h_Y(y)\) | \(-\) 4.000 | \(-\) 0.601 | 1.065 | 1.000 | 2.667 | 4.333 | 6.000 |

\(\beta _1(y)\) | 0.000 | \(-\) 0.518 | \(-\) 1.000 | \(-\) 1.414 | \(-\) 1.732 | \(-\) 1.932 | \(-\) 2.000 |

\(\beta _2(y)\) | 0.000 | 0.816 | 1.155 | 1.414 | 1.633 | 1.826 | 2.000 |

\(\beta _3(y)\) | 0.000 | \(-\) 0.259 | \(-\) 0.500 | \(-\) 0.707 | \(-\) 0.866 | \(-\) 0.966 | \(-\) 1.000 |

In Part A of this simulation, nonlinear (N, using B-splines), linear (L), and tree-based (T, of depth six, which allows higher-order interactions) basis functions \(\varvec{b}\) for shift transformation models, *i.e.* models for \(\beta (\varvec{x})\), and for conditional transformation models, *i.e.* models for \(\varvec{\vartheta }(\varvec{x})\), were evaluated for sample sizes \(N = 75, 150, 300\) under correctly specified models; this means that models were fitted using the correct distribution function \(F_Z= {\varPhi }\), the correct order \(M = 6\) of \(\varvec{a}\), no uninformative predictor variables, and the correct basis functions. Both the linear and nonlinear models were fitted with basis functions representing a main effect of group, a main effect of *x*, and a corresponding interaction effect, whereas trees had to learn this structure from the data. In Part B, these models were evaluated under model misspecification, *i.e.* using the incorrect distribution function \(F_Z= \text {expit}\) (standard logistic distribution) or \(F_Z= \text {MEV}\) (standard minimum extreme value distribution), a too large dimension of the Bernstein basis \(\varvec{a}\) (\(M = 12\)), or \(J^+ = 5, 25\) additional uninformative uniform predictor variables. The same “correct” basis functions as in Part A were used in Part B. I hypothesized a priori that models exactly matching the DGP would perform best and that tree-based boosting would outperform boosting with linear basis functions in nonlinear problems. Under misspecification, I expected the performance of all models to decrease, but this general ranking to persist.

The results for Part A presented in the top three rows of Table 5 show that the model corresponding to the underlying DGP was associated with the largest median out-of-sample log-likelihood. For linear DGPs, the performance of boosted models with nonlinear basis functions was only slightly inferior to the performance of boosted models with linear basis functions, while tree-based boosting performed substantially worse in this situation. By contrast, the signal in nonlinear DGPs was captured relatively well by tree-based boosting, whereas linear basis functions were not able to recover this signal. This shows that tree-based boosting was able to adapt to the underlying nonlinear interaction signal in the two nonlinear simulation models “Nonlinear \(\beta (\varvec{x})\)” and “Nonlinear \(\varvec{\vartheta }(\varvec{x})\)”.

The out-of-sample log-likelihoods for misspecified models presented in Table 5, Part B for \(F_Z= {\varPhi }\), follow this general pattern in that the model corresponding to the DGP performed best and tree-based boosting outperformed boosting with linear basis functions on nonlinear problems. In only two cases, which were characterized by small samples, did a linear model for \(\varvec{\vartheta }(\varvec{x})\) outperform a true linear model for \(\beta (\varvec{x})\) or vice versa. More frequently, the too complex nonlinear model for \(\varvec{\vartheta }(\varvec{x})\) outperformed the nonlinear model for \(\beta (\varvec{x})\) slightly. Overall, Algorithms 1 and 2 seemed to be robust against overly complex basis functions \(\varvec{a}\) and additional noninformative predictor variables.

This was also true under a misspecified distribution function \(F_Z= \text {expit}\) for linear shift transformation models “Linear \(\beta (\varvec{x})\)”. More severe deviations occurred when an incorrect \(F_Z= \text {expit}\) was used for model specification in Algorithms 1 and 2 under the nonlinear shift transformation model “Nonlinear \(\beta (\varvec{x})\)”, distribution regression model “Linear \(\varvec{\vartheta }(\varvec{x})\)”, and conditional transformation model “Nonlinear \(\varvec{\vartheta }(\varvec{x})\)”. The absolute differences in the corresponding out-of-sample log-likelihoods were, however, marginal in most of these cases.

When the asymmetric standard minimum value distribution was used (\(F_Z= \text {MEV}\)), the distortions were more pronounced. The general pattern observed for \(F_Z= \text {expit}\) was the same, but the centered out-of-sample log-likelihoods seemed in general smaller in this setup. Visualizations of the distributions underlying the figures in Table 5 are presented in Hothorn (2019).

## 6 Discussion

Models defined in terms of simple linear transformation functions up to models featuring unstructured complex transformation functions can be specified, estimated, evaluated, and compared in the unified computational framework of Algorithms 1 and 2. Data analysts are no longer limited in their freedom to define and estimate transformation models, because the strong ties between models of a certain complexity and a tailored estimation procedure (such as CTM-CPRS-boosting for additive or transformation forests for interaction models) can be cut with the boosting algorithms presented here.

*B*deep trees. The log-likelihood risk function employed here, which is also able to handle time-varying covariates through appropriate truncation, avoids the technical obstacles reported by Lee and Chen (2018) when defining an appropriate nonparametric risk function for boosting in a class of models for conditional log-hazard functions.

Artificial data-generating processes (DGPs). Median out-of-sample log-likelihoods (centered by out-of-sample log-likelihoods of the true model) based on 100 simulation runs of varying sample sizes \(N = 75, 150, 300\), number of additional noninformative predictor variables \(J^+ = 0, 5, 25\), choices of \(F_Z\) (\({\varPhi }= \) standard normal, \(\text {expit} = \) standard logistic, and \(\text {MEV} = \) standard minimum extreme value), and choices of the Bernstein polynomial order \(M = 6, 12\) for four different DGPs and Boosting CTM Likelihoods (parameter \(\varvec{\vartheta }(\varvec{x})\)) with nonlinear (N), linear (L), and tree-based (T, depth two) basis functions \(\varvec{b}\) as well as Boosting STM Likelihoods (parameter \(\beta (\varvec{x})\)) with the same basis functions

In contrast to quantile regression, where separate models for each quantile are fitted, likelihood boosting for transformation models estimates conditional distribution functions simultaneously for all quantiles. It is interesting to note that a recently suggested Bayesian approach to simultaneous linear quantile regression (Yang and Tokdar 2017) maximizes a log-likelihood obtained from a numerical inversion of the quantile function instead of using the traditional check risk minimization. In light of this approach, it seems computationally attractive to model the distribution function in the distribution regression model \(F_{Y\mid \varvec{X}= \varvec{x}}(y\mid \varvec{X}= \varvec{x}) = F_Z(h_Y(y) - \varvec{x}^\top \varvec{\beta }(y))\) rather than the quantile function in a quantile regression model \(Q_{Y\mid \varvec{X}= \varvec{x}}(\tau \mid \varvec{X}= \varvec{x}) = \alpha (\tau ) + \varvec{x}^\top \varvec{\delta }(\tau )\) of the same complexity (\(\tau \in [0, 1]\); \(\alpha \) and \(\varvec{\delta }\) being the probability-varying intercept and coefficient functions, respectively). Bayesian inference for the corresponding model parameters in conditional transformation models is, however, still under development (Mitrodima and Griffin 2017).

### 6.1 Computational details

A reference implementation of transformation boosting machines (Algorithms 1 and 2) is available in the **tbm** package (Hothorn 2019). Analyses of all applications and simulation results can be reproduced in the dynamic document Hothorn (2019). All computations were performed using R version 3.5.2 (R Core Team 2018).

## Notes

## References

- Athey, S., Tibshirani, J., Wager, S.: Generalized random forests. Ann. Stat.
**47**(2), 1148–1178 (2019). https://doi.org/10.1214/18-AOS1709 MathSciNetCrossRefzbMATHGoogle Scholar - Box, G.E.P., Cox, D.R.: An analysis of transformations. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**26**(2), 211–252 (1964)zbMATHGoogle Scholar - Bühlmann, P., Hothorn, T.: Boosting algorithms: regularization, prediction and model fitting. Stat. Sci.
**22**(4), 477–505 (2007). https://doi.org/10.1214/07-STS242. with discussionMathSciNetCrossRefzbMATHGoogle Scholar - Bühlmann, P., Yu, B.: Boosting with the \(L_2\) loss: regression and classification. J. Am. Stat. Assoc.
**98**(462), 324–339 (2003). https://doi.org/10.1198/016214503000125 CrossRefzbMATHGoogle Scholar - Cabrera, B.L., Schulz, F.: Forecasting generalized quantiles of electricity demand: a functional data approach. J. Am. Stat. Assoc.
**112**(517), 127–136 (2017). https://doi.org/10.1080/01621459.2016.1219259 MathSciNetCrossRefGoogle Scholar - Chernozhukov, V., Fernández-Val, I., Melly, B.: Inference on counterfactual distributions. Econometrica
**81**(6), 2205–2268 (2013). https://doi.org/10.3982/ECTA10582 MathSciNetCrossRefzbMATHGoogle Scholar - Currie, I.D., Durban, M., Eilers, P.H.C.: Generalized linear array models with applications to multidimensional smoothing. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**68**(2), 259–280 (2006). https://doi.org/10.1111/j.1467-9868.2006.00543.x MathSciNetCrossRefzbMATHGoogle Scholar - Fenske, N., Kneib, T., Hothorn, T.: Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression. J. Am. Stat. Assoc.
**106**(494), 494–510 (2011). https://doi.org/10.1198/jasa.2011.ap09272 MathSciNetCrossRefzbMATHGoogle Scholar - Foresi, S., Peracchi, F.: The conditional distribution of excess returns: an empirical analysis. J. Am. Stat. Assoc.
**90**(430), 451–466 (1995). https://doi.org/10.1080/01621459.1995.10476537 CrossRefGoogle Scholar - Fredriks, A.M., van Buuren, S., Burgmeijer, R.J.F., Meulmeester, J.F., Beuker, R.J., Brugman, E., Roede, M.J., Verloove-Vanhorick, S.P., Wit, J.: Continuing positive secular growth change in the Netherlands 1955–1997. Pediatr. Res.
**47**(3), 316–323 (2000)CrossRefGoogle Scholar - Friedman, J.H., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting (with discussion). Ann. Stat.
**28**, 337–407 (2000). https://doi.org/10.1214/aos/1016218223 CrossRefzbMATHGoogle Scholar - Garcia, A.L., Wagner, K., Hothorn, T., Koebnick, C., Zunft, H.J.F., Trippo, U.: Improved prediction of body fat by measuring skinfold thickness, circumferences, and bone breadths. Obesity
**13**(3), 626–634 (2005). https://doi.org/10.1038/oby.2005.67 CrossRefGoogle Scholar - Garcia, T.P., Marder, K., Wang, Y.: Time-varying proportional odds model for mega-analysis of clustered event times. Biostatistics
**20**(1), 129–146 (2019). https://doi.org/10.1093/biostatistics/kxx065 MathSciNetCrossRefGoogle Scholar - Gneiting, T., Katzfuss, M.: Probabilistic forecasting. Annu. Rev. Stat. Its Appl.
**1**(1), 125–151 (2014). https://doi.org/10.1146/annurev-statistics-062713-085831 CrossRefGoogle Scholar - Hofner, B., Hothorn, T., Kneib, T., Schmid, M.: A framework for unbiased model selection based on boosting. J. Comput. Graph. Stat.
**20**(4), 956–971 (2011). https://doi.org/10.1198/jcgs.2011.09220 MathSciNetCrossRefGoogle Scholar - Hothorn, T.: tbm: Transformation Boosting Machines. R package and vignette version 0.3-0 (2019). http://CRAN.R-project.org/package=tbm
- Hothorn, T., Zeileis, A.: Transformation forests. Tech. rep. v2, https://arxiv.org/abs/1701.02110 (2017)
- Hothorn, T., Kneib, T., Bühlmann, P.: Conditional transformation models. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**76**(1), 3–27 (2014). https://doi.org/10.1111/rssb.12017 MathSciNetCrossRefGoogle Scholar - Hothorn, T., Möst, L., Bühlmann, P.: Most likely transformations. Scand. J. Stat.
**45**(1), 110–134 (2018). https://doi.org/10.1111/sjos.12291 MathSciNetCrossRefzbMATHGoogle Scholar - Kneib, T., Hothorn, T., Tutz, G.: Variable selection and model choice in geoadditive regression models. Biometrics
**65**(2), 626–634 (2009). https://doi.org/10.1111/j.1541-0420.2008.01112.x MathSciNetCrossRefzbMATHGoogle Scholar - Koenker, R.: Quantile Regression. Economic Society Monographs. Cambridge University Press, New York (2005)Google Scholar
- Kooperberg, C., Stone, C.J., Truong, Y.K.: Hazard regression. J. Am. Stat. Assoc.
**90**(429), 78–94 (1995). https://doi.org/10.1080/01621459.1995.10476491 MathSciNetCrossRefzbMATHGoogle Scholar - Küffner, R., Zach, N., Norel, R., Hawe, J., Schoenfeld, D., Wang, L., Li, G., Fang, L., Mackey, L., Hardiman, O., Cudkowicz, M., Sherman, A., Ertaylan, G., Grosse-Wentrup, M., Hothorn, T., van Ligtenberg, J., Macke, J.H., Meyer, T., Schölkopf, B., Tran, L., Vaughan, R., Stolovitzky, G., Leitner, M.L.: Crowdsourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression. Nat. Biotechnol.
**33**, 51–57 (2015). https://doi.org/10.1038/nbt.3051 CrossRefGoogle Scholar - Lee, D.K.K., Chen, N.: Boosting hazard regression with time-varying covariates. Tech. rep. v3, https://arxiv.org/abs/1701.07926 (2018)
- Leorato, S., Peracchi, F.: Comparing distribution and quantile regression. Tech. Rep. 1511, Einaudi Institute for Economics and Finance, Rome, Italy (2015). https://ideas.repec.org/p/eie/wpaper/1511.html. Accessed 24 Nov 2018
- Li, Q., Racine, J.S.: Nonparametric estimation of conditional CDF and quantile functions with mixed categorical and continuous data. J. Bus. Econ. Stat.
**26**(4), 423–434 (2008). https://doi.org/10.1198/073500107000000250 MathSciNetCrossRefGoogle Scholar - Lu, W., Li, L.: Boosting method for nonlinear transformation models with censored survival data. Biostatistics
**9**(4), 658–667 (2008). https://doi.org/10.1093/biostatistics/kxn005 CrossRefGoogle Scholar - Mayr, A., Hofner, B.: Boosting for statistical modelling—a non-technical introduction. Stat. Model.
**18**(3–4), 365–384 (2018). https://doi.org/10.1177/1471082X17748086 MathSciNetCrossRefGoogle Scholar - Mayr, A., Fenske, N., Hofner, B., Kneib, T., Schmid, M.: GAMLSS for high-dimensional data—a flexible approach based on boosting. J. R. Stat. Soc. Ser. C (Appl. Stat.)
**61**(3), 403–427 (2012). https://doi.org/10.1111/j.1467-9876.2011.01033.x MathSciNetCrossRefGoogle Scholar - Meinshausen, N.: Quantile regression forests. J. Mach. Learn. Res.
**7**, 983–999 (2006). http://jmlr.org/papers/v7/meinshausen06a.html - Mitrodima, G., Griffin, J.E.: A Bayesian quantile time series model for asset returns. Tech. rep., SSRN, https://doi.org/10.2139/ssrn.3050989 (2017)
- Möst, L., Hothorn, T.: Conditional transformation models for survivor function estimation. Int. J. Biostat.
**11**(1), 23–50 (2015). https://doi.org/10.1515/ijb-2014-0006 MathSciNetCrossRefGoogle Scholar - Pratola, M., Chipman, H., George, E.I., McCulloch, R.: Heteroscedastic bart using multiplicative regression trees. Tech. rep. v1, http://arxiv.org/abs/1709.07542 (2017)
- R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2018). http://www.R-project.org/
- Ridgeway, G.: The state of boosting. Comput. Sci. Stat.
**31**, 172–181 (1999)Google Scholar - Rigby, R.A., Stasinopoulos, D.M.: Generalized additive models for location, scale and shape. J. R. Stat. Soc. Ser. C (Appl. Stat.)
**54**(3), 507–554 (2005). https://doi.org/10.1111/j.1467-9876.2005.00510.x MathSciNetCrossRefzbMATHGoogle Scholar - Rödel, C., Graeven, U., Fietkau, R., Hohenberger, W., Hothorn, T., Arnold, D., Hofheinz, R.D., Ghadimi, M., Wolff, H.A., Lang-Welzenbach, M., Raab, H.R., Wittekind, C., Ströbel, P., Staib, L., Wilhelm, M., Grabenbauer, G.G., Hoffmanns, H., Lindemann, F., Schlenska-Lange, A., Folprecht, G., Sauer, R.: Torsten Liersch on behalf of the German Rectal Cancer Study Group: Oxaliplatin added to fluorouracil-based preoperative chemoradiotherapy and postoperative chemotherapy of locally advanced rectal cancer (the German CAO/ARO/AIO-04 study): final results of the multicentre, open-label, randomised, phase 3 trial. Lancet Oncol.
**16**(8), 979–989 (2015). https://doi.org/10.1016/S1470-2045(15)00159-X CrossRefGoogle Scholar - Rothe, C., Wied, D.: Misspecification testing in a class of conditional distributional models. J. Am. Stat. Assoc.
**108**(501), 314–324 (2013). https://doi.org/10.1080/01621459.2012.736903 MathSciNetCrossRefzbMATHGoogle Scholar - Schild, R.L., Maringa, M., Siemer, J., Meurer, B., Hart, N., Goecke, T.W., Schmid, M., Hothorn, T., Hansmann, M.E.: Weight estimation by three-dimensional ultrasound in the small fetus. Ultrasound Obstetr. Gynecol.
**32**(2), 168–175 (2008). https://doi.org/10.1002/uog.6111 CrossRefGoogle Scholar - Schmid, M., Hothorn, T.: Flexible boosting of accelerated failure time models. BMC Bioinform.
**9**, 269 (2008). https://doi.org/10.1186/1471-2105-9-269 CrossRefGoogle Scholar - Schmid, M., Hothorn, T., Maloney, K.O., Weller, D.E., Potapov, S.: Geoadditive regression modeling of stream biological condition. Environ. Ecol. Stat.
**18**(4), 709–733 (2011). https://doi.org/10.1007/s10651-010-0158-4 MathSciNetCrossRefGoogle Scholar - Seibold, S., Brandl, R., Schmidl, J., Busse, J., Thorn, S., Hothorn, T., Müller, J.: Extinction risk status of saproxylic beetles reflects the ecological degradation of forests in Europe. Conserv. Biol.
**29**(2), 382–390 (2015). https://doi.org/10.1111/cobi.12427 CrossRefGoogle Scholar - Seibold, H., Zeileis, A., Hothorn, T.: Individual treatment effect prediction for ALS patients. Stat. Methods Med. Res. (2017). https://doi.org/10.1177/0962280217693034
- Wu, C.O., Tian, X.: Nonparametric estimation of conditional distributions and rank-tracking probabilities with time-varying transformation models in longitudinal studies. J. Am. Stat. Assoc.
**108**(503), 971–982 (2013). https://doi.org/10.1080/01621459.2013.808949 MathSciNetCrossRefzbMATHGoogle Scholar - Yang, Y., Tokdar, S.T.: Joint estimation of quantile planes over arbitrary predictor spaces. J. Am. Stat. Assoc.
**112**(519), 1107–1120 (2017). https://doi.org/10.1080/01621459.2016.1192545 MathSciNetCrossRefGoogle Scholar - Yue, M., Li, J., Ma, S.: Sparse boosting for high-dimensional survival data with varying coefficients. Stat. Med.
**37**(5), 789–800 (2017). https://doi.org/10.1002/sim.7544 MathSciNetCrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.