Transformation boosting machines

The broad class of conditional transformation models includes interpretable and simple as well as potentially very complex models for conditional distributions. This makes conditional transformation models attractive for predictive distribution modelling, especially because models featuring interpretable parameters and black-box machines can be understood as extremes in a whole cascade of models. So far, algorithms and corresponding theory was developed for special forms of conditional transformation models only: maximum likelihood inference is available for rather simple models, there exists a tailored boosting algorithm for the estimation of additive conditional transformation models, and a special form of random forests targets the estimation of interaction models. Here, I propose boosting algorithms capable of estimating conditional transformation models of arbitrary complexity, starting from simple shift transformation models featuring linear predictors to essentially unstructured conditional transformation models allowing complex nonlinear interaction functions. A generic form of the likelihood is maximized. Thus, the novel boosting algorithms for conditional transformation models are applicable to all types of univariate response variables, including randomly censored or truncated observations.


Introduction
The future remains unknown, yet we have witnessed considerably improved predictions owing to advances in statistical and machine learning over the last two decades.Numerous procedures, such as support vector machines, random forests, and tree boosting, deliver accurate point predictions of conditional means.However, in many applications, a mean prediction is not good enough.Full predictive distributions, also known as probabilistic forecasts, are required in applications where an assessment of the associated uncertainty is essential, for example in models of future disease progression (Küffner et al. 2015), electricity demand (Cabrera and Schulz 2017), stock asset returns (Mitrodima and Griffin 2017), and counterfactual distributions (Chernozhukov et al. 2013).In these applications, the prediction "takes the form of a predictive probability distribution over future quantities B Torsten Hothorn   Torsten.Hothorn@uzh.ch 1 Institut für Epidemiologie, Biostatistik und Prävention, Universität Zürich, Hirschengraben 84, 8001 Zürich, Switzerland or events of interest" (Gneiting and Katzfuss 2014).Here, I present a novel generic boosting approach to the estimation of full predictive distributions under mild assumptions.
Apart from completely model-free procedures (such as kernel smoothing, Li and Racine 2008), four main approaches of obtaining predictive distributions exist.(1) Flexible parametric models for conditional density functions rely on a strict parametric model of the response distribution those parameters might be linked to predictor variables in complex ways, for example, in generalized additive models for location, scale, and shape (GAMLSS, Rigby and Stasinopoulos 2005) and in heteroscedastic Bayesian additive regression tree ensembles (Pratola et al. 2017).(2) Quantile regression models for conditional quantiles of interest can be modelled in a linear or nonlinear additive form (Koenker 2005); more complex relationships can be estimated by quantile regression forests (Meinshausen 2006;Athey et al. 2019).(3) Distribution regression and transformation models potentially allow response-varying (or time-varying) effects (Foresi and Peracchi 1995;Rothe and Wied 2013;Chernozhukov et al. 2013;Wu and Tian 2013;Leorato and Peracchi 2015) in models for conditional distribution functions on the probit, logit, or complementary log-log scale.(4) Hazard regression (Kooperberg et al. 1995) aims at estimating conditional nonproportional hazard functions directly.
Boosting, and especially the statistical view on boosting (Friedman et al. 2000;Bühlmann and Hothorn 2007), have already proved helpful in these four different approaches.Mayr et al. (2012) developed boosting for GAMLSS, conditional quantile boosting was introduced by Fenske et al. (2011), and nonproportional hazard boosting was recently introduced by Lee and Chen (2018).Distribution regression is a special case of conditional transformation models (Hothorn et al. 2014).What is interesting about conditional transformation models is that very simple models, such as the linear proportional odds and hazards models, and essentially unstructured models for conditional distribution functions can be understood in a unified theoretical framework (Hothorn et al. 2018).The same level of generality is, however, lacking from an algorithmic perspective.The boosting algorithm introduced by Hothorn et al. (2014) is limited to additive models and explicitly excludes treebased interaction models.Furthermore, the target function is approximate and applicable to responses observed without censoring or truncation only.The aim of this work is to establish a general computational framework that allows specification, estimation, evaluation, and comparison in a cascade of models starting with very simple linear models and ending with essentially unstructured models for conditional distribution functions for arbitrary response variables.
Section 2 gives a dense introduction to transformation models.An elaborate description and connections to wellestablished models can be found in Hothorn et al. (2014) and Hothorn et al. (2018).Sections 3 and 4 develop two boosting algorithms for complex and simple transformation models based on a generic form of the likelihood (technical details regarding the definition of the likelihood for all types of response variables, including random censoring and truncation, are discussed by Hothorn et al. 2018).Empirical evaluations are presented in Sect. 5.

Transformation models
Let Y denote a univariate and at least ordered response variable on a measurable space (Ξ , C) and X ∈ χ a set of predictor variables with joint distribution (Y , X) ∼ P Y ,X .Based on random samples from P Y ,X , the goal is to estimate the conditional distribution P Y |X=x of a response given predictors.For each conditional cumulative distribution function is an a priori given cumulative distribution function of an absolutely continuous random variable Z with log-concave density f Z (Hothorn et al. 2018).The conditional transformation function h is monotonic in y (1) Starting with Box and Cox (1964), shift transformation functions based on the decomposition h(y | x) = h Y (y) − β(x) featuring a baseline transformation function h Y : Ξ → R and a shift term β : χ → R have been studied intensively.The proportional hazards (with F Z (z) = 1 − exp(− exp(z))) and proportional odds (with F Z (z) = expit(z)) models are the most well-known representatives of this class of shift transformation models (STM, often also referred to as linear or nonlinear transformation models, depending on the functional form of β(x)).Boosting procedures that allow flexible estimation of β(x) have been studied for proportional hazards models under right censoring (Ridgeway 1999;Schmid and Hothorn 2008;Lu and Li 2008;Yue et al. 2017) and proportional odds models have been studied for ordered responses (Schmid et al. 2011).A comparison of prominent and less prominent members of this model class is given in Hothorn et al. (2018).
Structured additive transformation functions that allow interactions between the two arguments y and x of the form h(y | x) = J j=1 h j (y | x) lead to conditional transformation models (CTM, Hothorn et al. 2014).The J partial transformation functions h j allow formulation of problem-specific effects of the predictors x, such as linear, nonlinear, spatio-temporal, or other model terms.Distribution regression models featuring response-varying effects are an important special case of this model class.When x = (x 1 , . . ., x J ) ∈ R J , a distribution regression model is characterized by partial transformation functions h j (y | x j ) = β j (y)x j and corresponding interpretable response-varying effects β j : Ξ → R. The analogon of an additive model features partial transformation functions h j (y | x j ), i.e. bivariate smooth functions of both y and x j .These bivariate terms are more complex than the one-dimensional coefficient functions β j (y) but can still be visualized and interpreted.If x j is more complex, for example, if it describes a spatial location, h j (y | x j ) might be a spatially smooth term that captures unexplained spatial heterogeneity (Hothorn et al. 2014).

Models with transformation function h(y | x) =
J j=1 h j (y | x j ) and potential applications are discussed in Hothorn et al. (2014) and Hothorn et al. (2018).The standard estimation of maximizing the continuously ranked probability score over a discrete grid covering Ξ (Foresi and Peracchi 1995;Chernozhukov et al. 2013;Hothorn et al. 2014), potentially with inverse probability of censoring weight adjustment for right censoring (Möst and Hothorn 2015;Garcia et al. 2019), does not allow essentially unstructured transforma-tion functions h(y | x), including higher-order interactions, and thus relaxation of the additivity assumption on the scale of the transformation function h.Furthermore, it is computationally inefficient (because the data have to be expanded) and is unable to handle censoring or truncation directly.
This paper addresses these issues by introducing computationally efficient boosted likelihood estimation for unstructured or structured additive conditional transformation functions (Sect.3) and shift transformation functions (Sect.4) under all forms of random censoring and truncation for at least ordered responses based on potentially correlated observations.

Boosting the likelihood of conditional transformation models
In the following, the conditional transformation function h(y | x) = a(y) ϑ(x) is parameterized in terms of basis functions a : Ξ → R P of the response and a conditional parameter function ϑ : χ → R P ; the latter function will be estimated.

Definition of the likelihood
The parameterisation of h implies a conditional cumulative distribution function and thus a conditional density when y ∈ R comes from an absolutely continuous distribution (a is the derivative of a).For discrete y ∈ Ξ = {y 1 , y 2 , . . .}, the density function is There are also other forms of the density, for example, in mixed discrete-continuous distributions.The population optimizer for the conditional parameter function ϑ is Based on N independent observations (y i , x i ), i = 1, . . ., N from P Y ,X , empirical risk minimization with negative log-likelihood loss can be applied to estimate the conditional parameter function ϑ.The log-likelihood contribution i : R P → R for the ith observation is given by where the first case corresponds to an observation y i from an absolutely continuous or ordered response and the second case corresponds to the situation where a set or interval was observed (for example, for a left-, right-, or interval-censored observation y i ).Integration is with respect to the measure μ dominating P Y |X=x .Details on the likelihood function for models parameterized by a(y) ϑ(x), including gradients (denoted u i in Algorithm 1) and Hessians under censoring and truncation, are given in Hothorn et al. (2018).

Boosting the likelihood
The proposed boosting Algorithm 1 outputs a model of the form with offset term ϑ [0] (x) and j = 1, . . ., J a priori defined basis functions b j : χ → R P j of the predictor variables.The function j(b) returns the index of the basis function b j which was selected in the bth iteration of the algorithm.Each basis may be equipped with an explicit penalty function Pen j .The corresponding penalty parameter λ j is chosen such that the degrees of freedom are the same for all J basis functions to facilitate unbiased model selection (Hofner et al. 2011).The number of terms B, selected basis functions j(b), and corresponding coefficient matrices [b] ∈ R P×P j(b) are unknowns and are estimated from data.The basis functions b j may feature unknown parameters.With relatively deep regression trees b j (where the tree structure is estimated from the data in every boosting iteration and are the parameters in each terminal node), model ( 3) is the sum of B trees and as such is potentially highly unstructured.Similar to GAMLSSboosting (Mayr et al. 2012), a parameter vector ϑ is modelled instead of a scalar predictor function.The main difference is that all dimensions of the parameter vector ϑ are updated simultaneously whereas each dimension is assigned its own predictor function in GAMLSS-boosting.Algorithm 1 is essentially a multivariate version of L 2 boosting (Bühlmann and Yu 2003) using the negative trans-Algorithm 1 Boosting CTM Likelihoods -Start with b = 0 and offset ϑ [0] (x i ).Initialize base learners j = 1, . . ., J via basis functions b j , stepsize ν ∈ (0, 1), and with components u [b]  i = u [b]  i,1 , . . ., u [b]  i,P for i = 1, . . ., N 3. Fit base learners j = 1, . . ., J by (penalized) multivariate least squares 4. Select base learner with minimal quadratic error and [b] = ν ˆ j(b) 5. Update ϑ [b] formation log-likelihood (2) as loss function.This choice makes the algorithm agnostic with respect to the scale of the response variable and potential censoring or truncation.The default offset is the unconditional maximum-likelihood estimator ϑ [0] (x i ) ≡ θML for i = 1, . . ., N that maximizes The algorithm is also applicable to the highdimensional setting where the number of predictor variables exceeds the number of observations N .The number of boosting iterations b stop is a tuning parameter that has to be chosen by the out-of-sample log-likelihood for a validation sample Model choice, for example using cross-validation, subsampling, or the bootstrap, can also be implemented conveniently by comparing this out-of-sample log-likelihood of different candidate models.
An additional advantage of this algorithm over boosted continuously ranked probability scores ("CTM-CRPSboosting", Hothorn et al. 2014) is that computations of tensor products in a(y) ⊗ b j (x) vec( ) = a(y) b j (x) are never explicitly required because the linear array model formulation (i.e. the right-hand side of the equation, see Currie et al. 2006, formula 2.5) formula 2.5 is implemented by Algorithm 1.This allows estimation of potentially highly unstructured models by choosing relatively deep multivariate regression trees as basis functions b.Moreover, the algorithm does not require expansion of the data set (to size sample size N 2 , in the worst case).

Model interpretation
The partial transformation functions h j can be obtained from the boosted model (3) with partial transformation functions h j (y | x j ) = x j β j (y) = x j a(y) ϑ j and corresponding response-varying effects β j (y) = a(y) ϑ j .Thus, this boosting procedure can also be used to estimate Cox models with time-varying effects under all forms of random censoring and truncation.Nonlinear effects can be implemented by a B-spline basis b j (x) = b j (x j ), and more complex bases allow specification of terms that capture spatio-temporal correlations or other forms of unexplained heterogeneity (Kneib et al. 2009;Hofner et al. 2011).A collection of commonly used basis functions b, along with corresponding penalty functions and interpretable model terms, is reviewed in Mayr and Hofner (2018).Specific choices of basis functions underlying the empirical results presented in Sect. 5 are discussed in detail in Hothorn (2019).

Boosting the likelihood of shift transformation models
A comparison of conditional transformation models that allow interactions of y and x in the transformation function h with shift transformation models in which these terms are absent can help to identify situations where the simpler models perform as good as or even better than the more complex models.Likelihood boosting for shift transformation models of the form h(y The procedure outputs a model 3. Fit base learners j = 1, . . ., J by (penalized) multivariate least squares 4. Select base learner with minimal quadratic error with univariate shift function β(x) ∈ R, i.e. with γ = ∈ R 1×P j .In contrast to conditional transformation models, the model term β(x) does not depend on y and thus shift transformation models are easier to interpret.L 2 boosting in this setup is performed based on log-likelihood contributions for the absolutely continuous case and for the discrete case.The core idea is to update the nuisance parameter ϑ before computing the gradients in every boosting iteration (following Schmid and Hothorn 2008).For discrete proportional odds models, Algorithm (2) is equivalent to the procedure proposed by Schmid et al. (2011).The ability to handle censoring and truncation in the likelihood allows proportional hazards models with potentially very flexible log-hazard ratios β(x) to be fitted to randomly censored (including left-and interval-censoring) or truncated responses.

Empirical evaluation
The rationale for the empirical evaluation of Algorithms 1 and 2 was to demonstrate that interpretable transformation models can be estimated for applications where information about predictive distribution matters (Sect.5.1) and to investigate the robustness of boosted transformation models under model misspecification (Sect.5.2).

Applications
Eight life science applications in which estimation of a predictive distribution is of special interest are listed in Table 1.Four applications are described by a continuous response, two feature an ordered categorical response, and two feature a right-censored response.Except for the Beetle Extinction Risk application, which requires a discrete basis a, a Bernstein basis a of order M = 6 (for technical details see Hothorn et al. 2018) was used to parameterize the transformation functions.Conditional transformation models (Algorithm 1) with nonlinear (N, using B-splines), linear (L), and tree-based (T, of depth two and thus allowing only two-way interactions) basis functions b as well as shift transformation models (Algorithm 2) using the same bases were evaluated.The performance of these boosted transformation models was compared to the performance of transformation trees and transformation forests (Hothorn and Zeileis 2017).The latter two procedures estimate conditional transformation models of the form F Z (a(y) ϑ(x)), where ϑ(x) is obtained either from a single tree (transformation trees) or from a nonlinear interaction function (transformation forest).I hypothesized a priori that transformation trees should perform worst across all applications because this method corresponds to the most simple (but easily interpretable) model.Also, I expected transformation forests to outperform transformation trees and to perform only slightly worse than the best performing boosting procedure because of the high adaptivity of the underlying random forest procedure.My motivation for this experiment was the hope that I would be able to find a simple and interpretable transformation model that outperforms the most complex transformation forests by means of either Algorithm 1 or 2.
Subsampling (with n = 3/4N observations in the learning and ñ = 1/4N observations in the validation sample) was performed 100 times.Performance of the competitors was assessed by the out-of-sample log-likelihood centered by the out-of-sample log-likelihood of the unconditional transformation model F Z (a(y) θML ).For a learning sample of size n and a validation sample i = n + 1, . . ., ñ, the centered out-of-sample log-likelihood is given by Values close to zero indicate that the conditional model did not outperform the unconditional model.The results presented in Table 2 demonstrate that the best-performing method was always a boosted transformation model.Transformation forests performed only slightly worse than the top model for the Beetle Extinction Risk, Birth Weight Prediction, Body Fat Mass, and Childhood Malnutrition applications.In the remaining four applications, the best boosting procedure outperformed transformation forests substantially.Nonlinear conditional transformation models (N ϑ(x)) performed best twice, as did tree-based shift transformation models (T β(x)).Each of the remaining models ranked at the top once.Transformation trees outperformed transformation forests for two applications (Head Cirumference and PRO-ACT ALSFRS) but never performed better than any of the boosted transformation models.
Graphical representations of the distributions of out-ofsample log-likelihoods along with the exact model and algorithm specification and corresponding software implementation are presented for all eight applications in Hothorn (2019).

Artificial data-generating processes
The response Y was generated conditionally on two groups and one numeric predictor variable x ∈ [0, 1] following a transformation model of the form where the conditional transformation function h(y | Group, x) = Φ −1 (P(Y ≤ y | Group, x)) for four data-generating processes (DGPs) is given in Table 3.The model labelled "Linear β(x)" is a shift transformation model with a main effect of group, a linear main effect of x, and a corresponding linear interaction effect.The linear main and interaction effects of x are replaced by nonlinear effects (a scaled sin function) of x in the shift transformation model "Nonlinear β(x)".The extension to response-varying main and interaction effects defines the distribution regression model "Linear ϑ(x)" and the conditional transformation model "Nonlinear ϑ(x)".The coefficients of the terms introduced in Table 3 are given in Table 4. Details of the implementation of these DGPs are explained in Hothorn (2019).The conditional densities associated with the four DGPs are shown in Fig. 1.
Models were evaluated by out-of-sample log-likelihoods centered by the out-of-sample log-likelihood of the true DGP (for test samples of size Ñ = 2000) where ϑ True is as given in Tables 3 and 4.
In Part A of this simulation, nonlinear (N, using Bsplines), linear (L), and tree-based (T, of depth six, which allows higher-order interactions) basis functions b for shift transformation models, i.e. models for β(x), and for conditional transformation models, i.e. models for ϑ(x), were evaluated for sample sizes N = 75, 150, 300 under correctly specified models; this means that models were fitted using the correct distribution function F Z = Φ, the correct order M = 6 of a, no uninformative predictor variables, and the correct basis functions.Both the linear and nonlinear models were fitted with basis functions representing a main effect of group, a main effect of x, and a corresponding interaction effect, whereas trees had to learn this structure from the data.In Part B, these models were evaluated under model misspecification, i.e. using the incorrect distribution function    F Z = expit (standard logistic distribution) or F Z = MEV (standard minimum extreme value distribution), a too large dimension of the Bernstein basis a (M = 12), or J + = 5, 25 additional uninformative uniform predictor variables.The same "correct" basis functions as in Part A were used in Part B. I hypothesized a priori that models exactly matching the DGP would perform best and that tree-based boosting would outperform boosting with linear basis functions in nonlinear problems.Under misspecification, I expected the performance of all models to decrease, but this general ranking to persist.The results for Part A presented in the top three rows of Table 5 show that the model corresponding to the underlying DGP was associated with the largest median out-of-sample log-likelihood.For linear DGPs, the performance of boosted models with nonlinear basis functions was only slightly inferior to the performance of boosted models with linear basis functions, while tree-based boosting performed substantially worse in this situation.By contrast, the signal in nonlinear DGPs was captured relatively well by tree-based boosting, whereas linear basis functions were not able to recover this signal.This shows that tree-based boosting was able to adapt to the underlying nonlinear interaction signal in the two nonlinear simulation models "Nonlinear β(x)" and "Nonlinear ϑ(x)".
The out-of-sample log-likelihoods for misspecified models presented in Table 5, Part B for F Z = Φ, follow this general pattern in that the model corresponding to the DGP performed best and tree-based boosting outperformed boosting with linear basis functions on nonlinear problems.In only two cases, which were characterized by small samples, did a linear model for ϑ(x) outperform a true linear model for β(x) or vice versa.More frequently, the too complex nonlinear model for ϑ(x) outperformed the nonlinear model for β(x) slightly.Overall, Algorithms 1 and 2 seemed to be robust against overly complex basis functions a and additional noninformative predictor variables.
This was also true under a misspecified distribution function F Z = expit for linear shift transformation models "Linear β(x)".More severe deviations occurred when an incorrect F Z = expit was used for model specification in Algorithms 1 and 2 under the nonlinear shift transformation model "Nonlinear β(x)", distribution regression model "Linear ϑ(x)", and conditional transformation model "Nonlinear ϑ(x)".The absolute differences in the corresponding out-ofsample log-likelihoods were, however, marginal in most of these cases.
When the asymmetric standard minimum value distribution was used (F Z = MEV), the distortions were more pronounced.The general pattern observed for F Z = expit was the same, but the centered out-of-sample log-likelihoods seemed in general smaller in this setup.Visualizations of the distributions underlying the figures in Table 5 are presented in Hothorn (2019).

Discussion
Models defined in terms of simple linear transformation functions up to models featuring unstructured complex transformation functions can be specified, estimated, evaluated, and compared in the unified computational framework of Algorithms 1 and 2. Data analysts are no longer limited in their freedom to define and estimate transformation models, because the strong ties between models of a certain complexity and a tailored estimation procedure (such as CTM-CPRS-boosting for additive or transformation forests x for interaction models) can be cut with the boosting algorithms presented here.
For model specification, the choice of F Z is important in simple shift transformation models because it affects the interpretation of model parameters (log-odds ratios vs. loghazard ratios, for example).In more complex models, a direct interpretation of parameters is hardly possible, and the estimated conditional distribution functions are insensitive to the choice of F Z (Hothorn et al. 2018).However, one could use Algorithm 1 to estimate an unstructured loghazard function a(y) ϑ(x) + log(a (y) ϑ(x)) in the model P(Y ≤ y | X = x) = 1 − exp(− exp(a(y) ϑ(x)), with ϑ(x) being, for example, the sum of B deep trees.The log-likelihood risk function employed here, which is also able to handle time-varying covariates through appropriate truncation, avoids the technical obstacles reported by Lee and Chen (2018) when defining an appropriate nonparametric risk function for boosting in a class of models for conditional log-hazard functions.

Computational details
A reference implementation of transformation boosting machines (Algorithms 1 and 2) is available in the tbm package (Hothorn 2019).Analyses of all applications and simulation results can be reproduced in the dynamic document Hothorn (2019).All computations were performed using R version 3.5.2(R Core Team 2018).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
out-of-sample log-likelihoods (centered by out-of-sample log-likelihoods of the corresponding unconditional model) from 100 subsampled divisions into values indicate a superior performance of the conditional model, taking predictor variables into account, over the unconditional model.Interquartile ranges are given in parentheses.Boosting CTM Likelihoods (parameter ϑ(x)) with nonlinear (N), linear (L), and tree-based (T, depth two) basis functions b as well as Boosting STM Likelihoods (parameter β(x)) with the same basis functions were compared to transformation trees and transformation forests.The result of the best-performing model is printed in boldface 123 β 3 are Bernstein polynomials of order M = 6 on the interval (− 4, 6) (seeHothorn et al. 2018)

Fig. 1
Fig. 1 Artificial Data-generating Processes (DGPs).Conditional densities given two groups (left and right panel) and x ∈ [0, 1] (gray color coding) for four different data-generating processes Artificial data-generating processes (DGPs).Median out-of-sample log-likelihoods (centered by out-of-sample log-likelihoods of the true model) based on 100 simulation runs of varying sample sizes N = 75, 150, 300, number of additional noninformative predictor variables J + = 0, 5, 25, choices of F Z (Φ = standard normal, expit = standard logistic, and MEV = standard minimum extreme value), and choices of the Bernstein polynomial order M = 6, 12 for four different DGPs and Boosting CTM Likelihoods (parameter ϑ(x)) with nonlinear (N), linear (L), and tree-based (T, depth two) basis functions b as well as Boosting STM Likelihoods (parameter β(x)) with the same basis functions