Automatic piecewise linear regression

von Ottenbreit, Mathias; De Bin, Riccardo

doi:10.1007/s00180-024-01475-4

Automatic piecewise linear regression

Original Paper
Open access
Published: 01 March 2024

Volume 39, pages 1867–1907, (2024)
Cite this article

Download PDF

You have full access to this open access article

Computational Statistics Aims and scope Submit manuscript

Automatic piecewise linear regression

Download PDF

748 Accesses
1 Altmetric
Explore all metrics

Abstract

Regression modelling often presents a trade-off between predictiveness and interpretability. Highly predictive and popular tree-based algorithms such as Random Forest and boosted trees predict very well the outcome of new observations, but the effect of the predictors on the result is hard to interpret. Highly interpretable algorithms like linear effect-based boosting and MARS, on the other hand, are typically less predictive. Here we propose a novel regression algorithm, automatic piecewise linear regression (APLR), that combines the predictiveness of a boosting algorithm with the interpretability of a MARS model. In addition, as a boosting algorithm, it automatically handles variable selection, and, as a MARS-based approach, it takes into account non-linear relationships and possible interaction terms. We show on simulated and real data examples how APLR’s performance is comparable to that of the top-performing approaches in terms of prediction, while offering an easy way to interpret the results. APLR has been implemented in C++ and wrapped in a Python package as a Scikit-learn compatible estimator.

Bayesian additive regression trees with model trees

Article 03 March 2021

Estimation of Prediction Error with Regression Trees

Regression Forests

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Prediction models are central to modern data science. Being able to correctly predict an outcome of interest can allow better planning and may provide competitive advantages over competitors. While high predictiveness is crucial, it is often also necessary to explain the prediction, for example to understand the reasons why a model predicts the way it does. It can be something legally mandatory: for example, an insurance company using a customer scoring model to price customers must be able to justify why it has increased the premium of a customer. Or something useful to take decisions: for example a feature correlated to an increase in selling a specific product can be highlighted in future advertisements. Even from a purely statistical perspective, it is easier to perform goodness-of-fit checks on model predictions when the model is interpretable.

Regression modelling often presents a trade-off between predictiveness and interpretability. Highly predictive and popular tree-based algorithms such as Random Forest (Breiman 2001) and boosted trees perform very well in terms of prediction, but have very limited interpretability. Interpretable algorithms such as linear effect-based boosting and MARS (Friedman 1991) are totally interpretable but are typically less predictive.

One way to tackle the interpretability issue is to use frameworks that attempt to interpret black-box models. LIME (Ribeiro et al. 2016) and Shapley values (Shapley 1997) are two methods that implement this idea. LIME attempts to explain predictions from a black-box model by using local and interpretable models trained on a subset of the data. Shapley values attempt to estimate how much each predictor contributes to a prediction by running the black-box model many times with different predictor values. Unfortunately, both LIME and Shapley values have important drawbacks, for example instability (LIME) and a high computational cost (Shapley values). A different strategy consists of developing an algorithm that provides a model intrinsically interpretable. The challenge, then, is to get as close as possible to the predictive ability offered by the non-interpretable methods. This is the aim of our work.

Another trade-off that should be taken into account when considering prediction models is the ease of use. Sophisticated methods can have good performances, but may never be used in practice. For example, in a company, an algorithm that is easy to use is preferable because it may increase productivity by reducing the model development time. Variable selection, handling of interactions, and non-linear relationships are tasks that can be time-consuming to address. Algorithms such as Random Forest, boosted trees and MARS handle those tasks automatically, while more classical approaches often leave at least some of those tasks to the user. By using LIME and/or Shapley values to interpret the predictions from Random Forests or boosted trees, one needs to run these methods on top of the underlying regression models. This adds code complexity and decreases ease of use.

In this paper, a new regression algorithm, Automatic Piecewise Linear Regression (APLR) is proposed. APLR automatically handles variable selection, interactions, and non-linear relationships, so APLR has an ease of use comparable to Random Forest and boosted trees. Our empirical results show that APLR is able to compete with boosted trees and Random Forest on predictiveness. In contrast to the latter two approaches, importantly, APLR provides highly interpretable models. While introduced here only for regression, APLR can be easily extended to classification tasks (see Sect. 6).

The rest of the paper is organised as follows: the novel APLR algorithm is described in Sect. 3 after a brief review, in Sect. 2, of the basic tools on which it is built on. Sections 4 and 5 contrast APLR to other relevant regression algorithms on simulated and real data, respectively. Finally, some remarks in Sect. 6 concludes the paper.

2 Background

There are two important concepts that are central in the APLR algorithm: gradient boosting (Friedman et al. 2000) and MARS (Friedman 1991). We briefly review them in this section, focusing, as in most of the paper, on regression.

2.1 Gradient boosting

Gradient boosting is a supervised learning method introduced by Schapire (1990) and Freund (1995) to address the fundamental question of whether an ensemble of weak learners could produce a good estimate. From a modelling perspective, it is a forward stagewise additive procedure, in which at each stage (boosting step), a base learner is fitted to the negative gradient of the loss function computed at the estimate from the previous boosting step. In order to better explore the model space, the base learners are made artificially weak, in the sense of reduced fitting ability, by a penalty term (or learning rate). A base learner can be any fitting procedure, such as a regression tree, linear regression, etc. that relates the predictors to the negative gradient. The final estimate is the ensemble of the results from each base learner. One way of conceptually understanding how gradient boosting works is the following: in each boosting step, a base learner is trained to predict the residuals of the model made by all the base learners fitted in the previous boosting steps.

There are two important tuning parameters in gradient boosting, a penalty term, called learning rate, $v\in (0,1]$, and the number of boosting steps, $m_\textrm{stop}$. Normally, the former is set to a default value ($v = 0.1$), and the latter is computed via bootstrapping or cross-validation. $m_\textrm{stop}$ is a very crucial parameter: a too-small value leads to underfitting, i.e., a situation in which the algorithm is not able to capture the relationship between predictor and response; In contrast, a too-large value leads to overfitting, i.e., the resulting model also tries to explain the randomness in the training data and the resulting prediction rule is not generalizable to new data. In addition to $m_\textrm{stop}$ and v, the base learners may also have tuning parameters of their own that should be computed, for example the depth of a regression tree.

A particularly important version of the boosting algorithm is the so-called componentwise gradient boosting (Bühlmann and Yu 2003), in which at each boosting step only simple (understood as involving only one predictor) base learners are considered. One simple base learner is fitted for each predictor to the negative gradient, and the one that helps the most to minimise the loss function is kept. The procedure is computationally heavier than the basic method, but it allows variable selection and extending the use of boosting to high-dimensional settings. A description of the componentwise gradient boosting algorithm is given in Algorithm 1.

2.2 Multivariate adaptive regression splines (MARS)

MARS (Friedman 1991) is also a supervised learning algorithm that fits a model by iteratively adding new variables into it. It works like a classical stepwise linear regression procedure, as it starts from the null model (only intercept) and adds new terms step by step. But, in MARS, at each step two piecewise linear basis functions of the selected predictor (or of an interaction term) are added to the model. These basis functions form a so-called reflected pair around a constant, t. When the predictor value is a smaller than a constant t, one of the two basis functions is zero and the other basis function is negative and linear. Similarly, when the value is larger than t the function that was zero when $x<t$ is positive and linear, while the other basis function is zero. These basis functions work locally because their values may be zero for wide ranges of values, and allow the procedure to save degrees of freedom and to capture non-linearities in the relation between predictors and response.

Also in MARS there are tuning parameters that must be chosen, the most important of which is the number of iterations $max\_terms$. It has a similar role of $m_\textrm{stop}$, so too small values lead to underfitting and too large values to overfitting. In practice, optimizing this parameter leads to models that are in general too complex, so a pruning procedure is often implemented to improve the results.

An important feature of MARS is that it handles interactions among variables. At each step, in addition to the candidate pair of basis functions of a predictor, also potential interactions between the candidate pair and a term that already belongs to the model are considered. It is possible to set an upper limit on the order of interaction through an additional tuning parameter $max\_degree$.

3 Automatic piecewise linear regression

APLR is fundamentally a gradient boosting procedure adapted to allow the application of specific base learners inspired by the MARS algorithm. We start this section by introducing the two base learners. Later we will see how they are incorporated into the boosting procedure to generate APLR. Finally, we describe how the algorithm is fitted, including the choice of the tuning parameters.

3.1 APLR basis functions

Being the building block of boosting, a base learner is a function used to capture the effect of the predictors on the response (through the negative gradient). In addition to the simple linear effect, APLR uses specific basis functions that are able to capture non-linearity and interactions through local effects. There are two types of basis functions available, for main and interaction effects, respectively.

3.1.1 APLR basis functions without interactions

APLR basis functions without interactions are similar to the basis functions used in MARS, but they are used differently. In MARS, a reflected pair of basis functions is entered into the model at each step, while in APLR only one basis function can enter in a boosting step. The first reason is that in componentwise gradient boosting the base learner only uses one dimension. The second is that in gradient boosting it is advantageous to use weak learners. A single basis function is a weaker learner than a pair of them. Definition 1 formally defines APLR basis functions without interactions.

Definition 1

(APLR basis function without interactions) A basis function in APLR for a predictor x is one of the following two piecewise linear functions:

$$\begin{aligned}{} & {} \max (x-t,0), \\{} & {} \min (x-t,0), \end{aligned}$$

where t is a value that defines the split point for the basis function. A basis function of the form $\max (x-t,0)$ is defined as a right basis function because non-zero values of it are to the right of the split point when plotted on the x-axis in a chart. Conversely, a basis function of the form $\min (x-t,0)$ is defined as a left basis function. The split point for a right basis function is defined as the right split point and the split point for a left basis function is defined as the left split point. The number of effective observations $n_{eff}$ is defined as the number of observations that do not get a zero value due to the max or min functions.

These basis functions have the ability to work locally since their values can be zero for wide ranges of predictor values. This also enables them to be weaker learners than a linear effect which is useful in gradient boosting. The type of APLR basis functions described in Definition 1 cannot handle interactions unless x itself is an interaction term.

3.1.2 APLR basis functions with interactions

The MARS-like basis functions described in Sect. 2 work well in the case of independent predictors, but may have problems when interaction terms are relevant. In MARS interactions are handled by allowing terms that are products of MARS basis functions. Such product terms may cause problems. For example, higher-order interactions form higher power products that may result in interaction terms with very large values (potentially causing computational problems) or very small values (potentially causing rounding errors), depending on the data. Another problem is related to the meaning of the interaction term when the sign of the predictors that interact changes. To illustrate this, let $x_1$ and $x_2$ be two predictors and let $x_{12} = x_1 \cdot x_2$ be the estimated interaction between them. The combination $x_1 = 1$ and $x_2 = -1$ gives $x_{12} = -1$. But the combination of $x_1 = -1$ and $x_2 = 1$ also gives $x_{12} = -1$. These two sets of combinations could have vastly different response values, but $x_{12}$ would not be able to discriminate between them.

In APLR interactions are handled in a way that avoids the aforementioned problems, in a way that resembles the handling of interactions in regression trees. Namely, interactions are formed by subsetting the data. As an example, let the first split in a regression tree be on $x_1 \le 50$. The next split could be on $x_2 > 10$ when $x_1 \le 50$. Then $x_2$ and $x_1$ form a local interaction when $x_1 \le 50$. An APLR basis function with an interaction term gets values of zero when the interaction term has a value of zero. This type of basis function produces interaction terms that work on local subsets of the data. Definition 2 formally defines these basis functions.

Definition 2

(APLR basis function with interactions) A basis function in APLR with interactions is similar to Definition 1 except that the form can be either of the following:

$$\begin{aligned}{} & {} \max (x-t,0) \cdot \mathbbm {1}(i \ne 0), \\{} & {} \min (x-t,0) \cdot \mathbbm {1}(i \ne 0), \end{aligned}$$

where i is an APLR basis function of a potentially different predictor, with or without interactions. Here $\mathbbm {1}$ denotes the indicator function that assumes the value 1 if its argument is true and 0 otherwise. The depth of interactions is called interaction level. The interaction level is zero for a basis function without interactions (see Definition 1). For a basis function with interactions, the interaction level is one more than the interaction level of i. The number of effective observations $n_{eff}$ is defined as in Definition 1, except that it also excludes observations that get a zero value due to the indicator function.

3.2 APLR fitting procedure

As a boosting-like approach, APLR follows on a high level the steps described in Algorithm 1. The implementation, however, is more articulated, and it is explained in detail in the following.

3.2.1 Initialization

APLR starts with a zero intercept term and no other terms in the model. This is similar to the initialization step of Algorithm 1.

In the first boosting step the set of potential terms that can enter into the model are the APLR basis functions without interactions (Definition 1) of all available predictors. This set is called $\varvec{P}$. After a term other than the intercept has entered the model, then $\varvec{P}$ can potentially expand in each following boosting step if interaction terms are added to the model using the base-learner of Definition 3.1.2. $\varvec{P}$ can grow large and it can become computationally heavy to evaluate each potential term in every boosting step. APLR provides tuning parameters that can prevent all terms in $\varvec{P}$ from being evaluated in each boosting step. This process is described in Sect. 3.2.5. To facilitate this functionality the set $\varvec{E}$ holds terms that can be evaluated in the next boosting step. Initially $\varvec{E} = \varvec{P}$ so that all terms in $\varvec{P}$ are eligible in the first boosting step.

The final part of the initialization step is to define an empty set $\varvec{C_0}$ for storing terms other than the intercept that are included in the model. In a general boosting step m, $\varvec{C_m}$ can increase by up to one additional term. If $\varvec{C_m}$ does not increase in a boosting step, then the regression coefficient for a term already in $\varvec{C_{m-1}}$ can be updated.

Algorithm 2 summarizes how APLR is initialized.

3.2.2 Componentwise boosting step

Each boosting step starts with a calculation of the negative gradient. For ease of description, we focus on the squared error loss function. The negative gradient is computed at the model estimate from the previous boosting step. For a generic boosting step m, the set that holds terms in the model other than the intercept, $\varvec{C}_m$, is initialized to be the same as in the previous boosting step ($\varvec{C}_{m-1}$). The intercept is updated in each boosting step.

The next step is to find the optimal split points for each eligible term in $\varvec{E}$ and to consider interaction terms. These parts of the procedure are described in Sects. 3.2.3 and 3.2.4, respectively. The following cases are possible:

Add a new term from $\varvec{E}$ to the model ($\varvec{C}_m$).
Update a term already in $\varvec{C}_m$ that is also in $\varvec{E}$.
Add a new interaction term to $\varvec{C}_m$.
Terminate the boosting procedure if none of the above options (or updating the intercept) reduces the training error. In this case, no more boosting steps are carried out.

The choice that results in the lowest loss is selected. Unless the boosting procedure is terminated, the eligibility of terms ($\varvec{E}$) for the next boosting step is updated. This is described in Sect. 3.2.5. Algorithm 3 formally describes how the componentwise boosting step is performed in APLR.

3.2.3 Fitting an APLR basis function to the negative gradient

When fitting an APLR basis function to the negative gradient $\varvec{u}_m$, the first step is to determine if there are any observations for which the APLR basis function will be zero as a consequence of interactions (see Sect. 3.1.2). For such observations, the prediction from a linear regression model using the APLR basis function as the only predictor would be zero and the loss contribution would not change from the prior boosting step. It is computationally more efficient to avoid recalculating the loss for such observations. Therefore such observations are excluded from the remaining steps except that the loss contribution from them (unchanged from the previous boosting step) is used in the final step to determine the overall loss for the APLR basis function.

APLR has a tuning parameter to control model robustness, $min\_observations\_in\_split$. It prevents terms with a lower number of effective observations ($n_{eff}$) than its value from entering into the model ($\varvec{C}_m$). This tuning parameter is comparable with the minimum node size in a regression tree. The main idea is to avoid having terms in the model that rely on too few observations. The default value for $min\_observations\_in\_split$ is 20. For large datasets a larger value of $min\_observations\_in\_split$ is recommended, while for very small datasets a lower value may be preferred. If $n_{eff}$ is less than $min\_observations\_in\_split$, then the fitting procedure is aborted, setting loss to infinity so that the APLR basis function cannot enter into the model.

One of the key aspects of fitting an APLR basis function to the negative gradient is to find the optimal splitting point. Searching for this point by iterating through all observations is computationally intensive. To ease the computational burden, APLR implements an approximation technique inspired by the algorithm used in the XGBoost implementation of gradient tree boosting (Chen and Guestrin 2016). XGBoost discretizes data into bins and uses the discretized data to find optimal splits.

APLR sorts predictor values $\varvec{x}$, the negative gradient $\varvec{u}_m$ and, if provided, sample weights $\varvec{w}$, ascending by $\varvec{x}$. Then APLR discretizes these sorted vectors into bins. The maximum number of bins that APLR can create in this process is determined by the tuning parameter bins. The default value of bins is 300. This value decreases the computational burden significantly for larger datasets and does not seem to degrade predictiveness (see Sect. 4.5). When splitting the data into bins, APLR first finds the left edges of the bins. The left edge of a bin is the lowest value of $\varvec{x}$ in the bin. The first observation in the sorted $\varvec{x}$ is always a left edge since it has the lowest value of $\varvec{x}$. Apart from that, the first or last $min\_observations\_in\_split$ observations cannot be left edges. Potential left edges are found by iterating through the sorted $\varvec{x}$, starting from the lowest value. Potential left edges are required to have a higher value of x than the previous observation, otherwise the bins would overlap. If the number of potential left edges, b, is less than bins, then APLR creates a bin for each potential left edge. For ordered categorical variables with no more categories than bins, this enables each category to get a separate bin. If $b=0$, then one bin will contain all the observations. In the latter case, the APLR basis function cannot have split points and can only be used as a linear effect. If $b > bins$, then APLR first creates the two bins that have the lowest and highest potential left edges. This ensures that bin edges are placed as close as possible after the first and before the last $min\_observations\_in\_split$ observations. Then, APLR calculates a minimum number of observations, $n_{min}$, that any further bins must contain, so that the number of bins created does not exceed bins. By iterating through the remaining potential left edges, further bins are added under the constraint that they must contain at least $n_{min}$ observations. Once the left edges for all the bins are known, it is trivial to compute the right edges.

For each bin, the discretized values of $\varvec{x}$ and $\varvec{u}_m$ are their averages based on the observations in the bin. If sample weights were provided by the user, then, for each bin, the discretized values of $\varvec{w}$ are sums of $\varvec{w}$ for observations in the bin. Otherwise, the discretized values of $\varvec{w}$ are, for each bin, the number of observations in the bin. The goal is to weight the bins by the number of observations that they contain.

To increase computational efficiency, the creation of bins and discretization of $\varvec{x}$ and $\varvec{w}$ is only executed the first time that the APLR basis function is fitted to the negative gradient. For APLR basis functions without interactions, which are eligible in the first boosting step, this happens only in the first boosting step. For an APLR basis function with interactions this only happens in the boosting step when the basis function is added to $\varvec{P}$. However, discretization of $\varvec{u}_m$ happens in every boosting step when the APLR basis function is eligible.

The next step is to find the best split point by using the discretized data $\varvec{x}_d$, $\varvec{u}_{m, d}$ and $\varvec{w}_d$. A copy of the APLR basis function is made. This copy, here defined as $f(\varvec{x}_d)$, uses $\varvec{x}_d$ as predictor instead of $\varvec{x}$. For each bin, the loss is calculated for the left and for the right split points, respectively. In addition, the loss is calculated for a linear effect of $\varvec{x}_d$ (without any split). The split point (or linear effect) with the lowest loss is selected. If there is a tie, then the split point (or linear effect) giving the largest $n_{eff}$ is preferred to increase model robustness. When calculating the loss for a split point, the weighted linear regression coefficient $\beta _d$ is estimated as follows,

$$\begin{aligned} \beta _d = v \cdot \frac{\sum _{i=1}^{bins} f(x_{d,i}) \cdot w_{d,i} \cdot u_{m,d,i}}{\sum _{i=1}^{bins} f(x_{d,i})^2 \cdot w_{d,i}}, \end{aligned}$$

where $v \in (0,1]$ is the learning rate. The loss is then $L_d = (\varvec{u}_{m,d} - f(\varvec{x}_d) \cdot \beta _d)^T \cdot (\varvec{u}_{m,d} - f(\varvec{x}_d) \cdot \beta _d)$.

Finally, the loss is calculated for the original APLR basis function, $f(\varvec{x})$, using the approximately optimal split point (or linear effect) that was estimated on the discretized data. The weighted linear regression coefficient $\beta$ is estimated as follows:

$$\begin{aligned} \beta = v \cdot \frac{\sum _{i=1}^{n_{eff}} f(x_{i}) \cdot w_{i} \cdot u_{m,i}}{\sum _{i=1}^{n_{eff}} f(x_{i})^2 \cdot w_{i}} \end{aligned}$$

If sample weights were not provided by the user then $\beta$ is estimated without the w terms. The loss is $L = (\varvec{u}_m - f(\varvec{x}) \cdot \beta )^T \cdot (\varvec{u}_m - f(\varvec{x}) \cdot \beta ) + L_0$ where $L_0$ represents the loss from observations excluded in the first step mentioned in Sect. 3.2.3. Algorithm 3 selects candidates to enter into $\varvec{C}_m$ (the model) based on L for each APLR basis function considered.

Algorithm 4 summarizes how an APLR basis function is fitted to the negative gradient.

3.2.4 Considering interactions

In each boosting step, before possible interactions are considered, APLR has already found a candidate term for model update from $\varvec{E}$ (see Sect. 3.2.2). Then APLR considers the possible interactions between terms already in the model other than the intercept ($\varvec{C}_m$) and terms in $\varvec{E}$ without interactions. If any interaction terms are added, then they are added to both $\varvec{P}$ and $\varvec{E}$. An interaction term is an APLR basis function with interactions (Definition 2), where the predictor, $\varvec{x}$, is the predictor used in a term from $\varvec{E}$ and i is a term in $\varvec{C}_m$. Considering all possible interactions may be computationally intensive. APLR can reduce the number of interaction terms to evaluate with the help of three tuning parameters:

$max\_interactions$ specifies the maximum number of interaction terms that can be added to $\varvec{P}$.
$max\_interaction\_level$ specifies the maximum interaction level allowed in an interaction term.
$max\_eligible\_terms$ sets a limit on how many of the terms in $\varvec{C}_m$ can be considered as interaction partners for terms in $\varvec{E}$.

Note that if $max\_eligible\_terms$ is smaller than the number of terms in $\varvec{C}_m$, then the $max\_eligible\_terms$ terms in $\varvec{C}_m$ with the lowest previous loss are considered. For each term in $\varvec{C}_m$, the previous loss is the loss in the most recent boosting step when the term was either added to $\varvec{C}$ or had its regression coefficient updated.

The loss is calculated for each interaction term fitted to the negative gradient. Only interaction terms having a lower loss than the candidate term for model update from $\varvec{E}$ (see Sect. 3.2.2) can be added to $\varvec{P}$ and $\varvec{E}$. These interaction terms are added to $\varvec{P}$ and $\varvec{E}$ starting with the term having the lowest loss, then the term having the second lowest loss, and so on, as long as the total number of interaction terms in $\varvec{P}$ does not exceed $max\_interactions$. The reasons for not adding terms with higher losses are:

To increase the chance that interaction terms in $\varvec{C}_m$ are predictive. This can be especially relevant if $max\_interactions$ is low. In such case, it can be advantageous to only add the most promising interaction terms.
To avoid evaluating terms that are likely less predictive in future boosting steps. This can potentially reduce the computational burden.

If any terms are added to $\varvec{P}$ and $\varvec{E}$ by the above procedure, then the term with the lowest loss becomes a candidate for entry to the model.

Algorithm 5 formally describes how interaction terms are considered.

3.2.5 Updating eligibility of terms

Evaluating all terms in $\varvec{P}$ in every boosting step may be computationally costly. At the end of each boosting step APLR decides which terms in $\varvec{P}$ will be eligible in the next boosting step by redefining $\varvec{E}$. First, only the $max\_eligible\_terms$ terms in $\varvec{E}$ with the lowest loss are kept. The main idea is to avoid evaluating less predictive terms in every boosting step. Terms removed from $\varvec{E}$ become ineligible for the next $ineligible\_boosting\_steps\_added$ boosting steps. Finally, terms that have already been ineligible for $ineligible\_boosting\_steps\_added$ boosting steps are reentered into $\varvec{E}$. The reason is that a previously less predictive term may become more predictive compared to other terms in the future.

The above tuning parameters allow the user to control how and if the terms can become ineligible for some future boosting steps. The aim is to reduce the computational burden without significantly degrading predictiveness. The default values for $max\_eligible\_terms$ and $ineligible\_boosting\_steps\_added$ are 5 and 10 respectively. These defaults can notably reduce the computational burden and do not seem to degrade predictiveness (see Sect. 4.5).

Algorithm 6 summarizes how the eligibility of terms is updated at the end of each boosting step.

3.3 Finding APLR’s tuning parameters

3.3.1 Number of boosting iterations

As mentioned in Sect. 2.1, the most important tuning parameter to determine in a boosting algorithm is the optimal number of boosting steps $m_{stop}$. Tuning $m_{stop}$ in APLR by doing a grid search or similar would be computationally expensive.

Because APLR uses parametric learners, it is possible to store regression coefficients for each boosting step with immaterial computational costs. APLR automatically tunes $m_{stop}$ by performing cross-validation. For each training data subset an APLR model is trained. Afterwards $m_{stop}$ is set to the value that minimized validation loss on the hold-out fold, retrieving the stored regression coefficient for that boosting step. Finally the models are merged. For example, the intercept for the final model is the average (weighted by the sum of observation weights in each training data subset) of the intercept terms in each model that was trained for cross-validation. The merging is faster than re-training a final model on the entire training dataset. An additional benefit of merging the models is that the predictiveness may increase due to an effect similar to bagging (variance reduction) because each of the models are trained on different subsets of the training data. This procedure is significantly faster compared to a grid search or similar, because $m_{stop}$ is estimated in one cross-validation run instead of many. The user needs to specify the max number of boosting steps to try, M. The default value of M is 1000, but this default is not appropriate for all datasets. Plotting cross-validation loss versus boosting step can help the user to determine a reasonable value of M. The goal is to select M so that there are enough boosting steps to find the minimum validation loss (if it exists) while avoiding unnecessary computational costs associated with a too high M. Note that the optimal $m_{stop}$ is affected by the learning rate. The learning rate, v, has a default value of 0.1 in APLR, which is reasonable (low enough) in many cases according to our empirical results and the literature (see, e.g., Bühlmann and Hothorn 2007). If needed, however, this value can be changed.

By default APLR does a random split of the data into 5 folds for cross-validation. Selecting the number of folds represents a bias-variance trade-off and a trade-off regarding computational time. The default value has worked well in our empirical tests. The tuning parameter $cv\_folds$ specifies the amount of folds to use. Sometimes it is not feasible to split the data randomly. As an alternative, APLR provides a possibility to specify how particular observations are used in each train and validation subset. This can be useful for example in modeling of time series where it can be important to ensure that the validation set has more recent observations than the training set.

APLR allows the user to specify observation weights. This can be useful for example when handling data that is over- or undersampled. If sample weights are specified, then they are also split into cross-validation folds.

Algorithm 7 formally describes how training data is prepared.

3.3.2 Other tuning parameters

APLR has other tuning parameters that should be tuned. This can be done within APLR’s split in training and validation sets, or based on an external procedure, for example by using cross-validation. Below there is a complete list of the APLR tuning parameters and some advice on how to tune them:

M is the maximum number of boosting steps to try. Ideally, it should be large enough to find the minimum validation error (if it exists) but not so large that unnecessary computational costs are incurred. A reasonable tuning strategy may be to start with the default value of 1000 and increase it if the validation error does not flatten out during those 1000 boosting steps. Please note that in the APLR package M is denoted as m to adhere to the naming convention of having variable names in lowercase letters.
v is the learning rate and should be set to a reasonably low value (see, e.g., Bühlmann and Hothorn 2007). The choice of M is affected by the choice of v as the optimal number of boosting steps usually decreases if v increases. The default value of 0.1 should work in most cases, but sometimes higher values can be considered to reduce the computational burden, or lower values to avoid a too fast convergence (i.e., early overfitting).
$max\_interaction\_level$ specifies the maximum allowed depth of interactions. When tuning this parameter by, for example, a grid search, the values that should be tested are 0 (no interactions allowed), 1, 2, and a few larger values. Although there are no constraints on the maximum value allowed for this parameter, one must be careful when choosing larger values as the risk of overfitting increases significantly. The default value of 1 is often a safe choice, as it allows for interactions but avoids adding too much complexity to the model. Sometimes, however, adding a few interaction terms with a high interaction level reduces the loss more than adding many terms with a low interaction level.
$max\_interactions$ specifies the maximum number of interaction terms that APLR can consider. The default value of 100000 basically does not add any constraint. While it is more reasonable to control the level of complexity related to interactions through $max\_interaction\_level$, $max\_interactions$ should be set to the highest value computationally affordable.
$min\_observations\_in\_split$ determines the minimum number of effective observations ($n_{eff}$) that a term in the model must have. Higher values may give more robust models where terms rely on more observations. However, higher values may also increase bias because fewer terms are allowed to enter the model. The default value is 20. This parameter should be tuned in a grid search or similar.
$cv\_folds$ specifies the number of randomly selected folds that the training data is split into. If a random selection is not desired, then the tuning parameter $cv\_observations$ can be used instead of $cv\_folds$ to specify user defined folds. None of these two tuning parameters are intended for tuning, but rather for determining how APLR should do cross validation.
Tuning parameters that are intended for reducing computational costs:
- $max\_eligible\_terms$ limits (1) the number of terms already in the model that can be considered as interaction partners for terms in $\varvec{E}$ without interactions in a boosting step and (2) how many terms from $\varvec{E}$ remain in $\varvec{E}$ in the next boosting step.
- $ineligible\_boosting\_steps\_added$ controls how many boosting steps a term in $\varvec{E}$ that becomes ineligible has to remain ineligible.
- bins determines the maximum number of bins that can be created for discretizing the data when searching for the optimal split point in an APLR basis function.

3.4 APLR model interpretation

One of the most important features of APLR is the interpretability of the prediction rule. APLR uses the bases defined in Definitions 1 and 2 to capture the effect of each predictor on the response. Table 1 provides the results of the first 15 terms added to the model, other than the intercept, in an example.

Table 1 Example of the effect of the predictors on the response captured by APLR

Automatic piecewise linear regression

Abstract

Similar content being viewed by others

Bayesian additive regression trees with model trees

Estimation of Prediction Error with Regression Trees

Regression Forests

1 Introduction

2 Background

2.1 Gradient boosting

2.2 Multivariate adaptive regression splines (MARS)

3 Automatic piecewise linear regression

3.1 APLR basis functions

3.1.1 APLR basis functions without interactions

Definition 1

3.1.2 APLR basis functions with interactions

Definition 2

3.2 APLR fitting procedure

3.2.1 Initialization

3.2.2 Componentwise boosting step

3.2.3 Fitting an APLR basis function to the negative gradient

3.2.4 Considering interactions

3.2.5 Updating eligibility of terms

3.3 Finding APLR’s tuning parameters

3.3.1 Number of boosting iterations

3.3.2 Other tuning parameters

3.4 APLR model interpretation

3.5 Software implementation

4 Simulation study

4.1 Settings

4.1.1 Additive model with uncorrelated predictors

4.1.2 Additive model with correlated predictors

4.1.3 Non-additive model with uncorrelated predictors

4.1.4 Non-additive model with correlated predictors

4.2 Competitors

4.2.1 Random Forest

4.2.2 Gradient boosting regression tree

4.2.3 Other gradient boosting regression approaches

4.2.4 MARS

4.3 APLR tuning parameters

4.4 Results

4.4.1 Additive model with uncorrelated predictors

4.4.2 Additive model with correlated predictors

4.4.3 Non-additive model with uncorrelated predictors

4.4.4 Non-additive model with correlated predictors

4.5 Evaluating the APLR tuning parameters that are intended for reducing computational costs

5 Real data applications

5.1 Auto MPG dataset

5.1.1 Data

5.1.2 Results

5.1.3 APLR model interpretation

5.2 YearPredictionMSD dataset

5.2.1 Data

5.2.2 Results

5.3 Individual household electric power consumption dataset

5.3.1 Data

5.3.2 Results

6 Conclusions

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation