1 Introduction

Prediction models are central to modern data science. Being able to correctly predict an outcome of interest can allow better planning and may provide competitive advantages over competitors. While high predictiveness is crucial, it is often also necessary to explain the prediction, for example to understand the reasons why a model predicts the way it does. It can be something legally mandatory: for example, an insurance company using a customer scoring model to price customers must be able to justify why it has increased the premium of a customer. Or something useful to take decisions: for example a feature correlated to an increase in selling a specific product can be highlighted in future advertisements. Even from a purely statistical perspective, it is easier to perform goodness-of-fit checks on model predictions when the model is interpretable.

Regression modelling often presents a trade-off between predictiveness and interpretability. Highly predictive and popular tree-based algorithms such as Random Forest (Breiman 2001) and boosted trees perform very well in terms of prediction, but have very limited interpretability. Interpretable algorithms such as linear effect-based boosting and MARS (Friedman 1991) are totally interpretable but are typically less predictive.

One way to tackle the interpretability issue is to use frameworks that attempt to interpret black-box models. LIME (Ribeiro et al. 2016) and Shapley values (Shapley 1997) are two methods that implement this idea. LIME attempts to explain predictions from a black-box model by using local and interpretable models trained on a subset of the data. Shapley values attempt to estimate how much each predictor contributes to a prediction by running the black-box model many times with different predictor values. Unfortunately, both LIME and Shapley values have important drawbacks, for example instability (LIME) and a high computational cost (Shapley values). A different strategy consists of developing an algorithm that provides a model intrinsically interpretable. The challenge, then, is to get as close as possible to the predictive ability offered by the non-interpretable methods. This is the aim of our work.

Another trade-off that should be taken into account when considering prediction models is the ease of use. Sophisticated methods can have good performances, but may never be used in practice. For example, in a company, an algorithm that is easy to use is preferable because it may increase productivity by reducing the model development time. Variable selection, handling of interactions, and non-linear relationships are tasks that can be time-consuming to address. Algorithms such as Random Forest, boosted trees and MARS handle those tasks automatically, while more classical approaches often leave at least some of those tasks to the user. By using LIME and/or Shapley values to interpret the predictions from Random Forests or boosted trees, one needs to run these methods on top of the underlying regression models. This adds code complexity and decreases ease of use.

In this paper, a new regression algorithm, Automatic Piecewise Linear Regression (APLR) is proposed. APLR automatically handles variable selection, interactions, and non-linear relationships, so APLR has an ease of use comparable to Random Forest and boosted trees. Our empirical results show that APLR is able to compete with boosted trees and Random Forest on predictiveness. In contrast to the latter two approaches, importantly, APLR provides highly interpretable models. While introduced here only for regression, APLR can be easily extended to classification tasks (see Sect. 6).

The rest of the paper is organised as follows: the novel APLR algorithm is described in Sect. 3 after a brief review, in Sect. 2, of the basic tools on which it is built on. Sections 4 and 5 contrast APLR to other relevant regression algorithms on simulated and real data, respectively. Finally, some remarks in Sect. 6 concludes the paper.

2 Background

There are two important concepts that are central in the APLR algorithm: gradient boosting (Friedman et al. 2000) and MARS (Friedman 1991). We briefly review them in this section, focusing, as in most of the paper, on regression.

2.1 Gradient boosting

Gradient boosting is a supervised learning method introduced by Schapire (1990) and Freund (1995) to address the fundamental question of whether an ensemble of weak learners could produce a good estimate. From a modelling perspective, it is a forward stagewise additive procedure, in which at each stage (boosting step), a base learner is fitted to the negative gradient of the loss function computed at the estimate from the previous boosting step. In order to better explore the model space, the base learners are made artificially weak, in the sense of reduced fitting ability, by a penalty term (or learning rate). A base learner can be any fitting procedure, such as a regression tree, linear regression, etc. that relates the predictors to the negative gradient. The final estimate is the ensemble of the results from each base learner. One way of conceptually understanding how gradient boosting works is the following: in each boosting step, a base learner is trained to predict the residuals of the model made by all the base learners fitted in the previous boosting steps.

There are two important tuning parameters in gradient boosting, a penalty term, called learning rate, \(v\in (0,1]\), and the number of boosting steps, \(m_\textrm{stop}\). Normally, the former is set to a default value (\(v = 0.1\)), and the latter is computed via bootstrapping or cross-validation. \(m_\textrm{stop}\) is a very crucial parameter: a too-small value leads to underfitting, i.e., a situation in which the algorithm is not able to capture the relationship between predictor and response; In contrast, a too-large value leads to overfitting, i.e., the resulting model also tries to explain the randomness in the training data and the resulting prediction rule is not generalizable to new data. In addition to \(m_\textrm{stop}\) and v, the base learners may also have tuning parameters of their own that should be computed, for example the depth of a regression tree.

A particularly important version of the boosting algorithm is the so-called componentwise gradient boosting (Bühlmann and Yu 2003), in which at each boosting step only simple (understood as involving only one predictor) base learners are considered. One simple base learner is fitted for each predictor to the negative gradient, and the one that helps the most to minimise the loss function is kept. The procedure is computationally heavier than the basic method, but it allows variable selection and extending the use of boosting to high-dimensional settings. A description of the componentwise gradient boosting algorithm is given in Algorithm 1.

Algorithm 1
figure a

Componentwise gradient boosting

2.2 Multivariate adaptive regression splines (MARS)

MARS (Friedman 1991) is also a supervised learning algorithm that fits a model by iteratively adding new variables into it. It works like a classical stepwise linear regression procedure, as it starts from the null model (only intercept) and adds new terms step by step. But, in MARS, at each step two piecewise linear basis functions of the selected predictor (or of an interaction term) are added to the model. These basis functions form a so-called reflected pair around a constant, t. When the predictor value is a smaller than a constant t, one of the two basis functions is zero and the other basis function is negative and linear. Similarly, when the value is larger than t the function that was zero when \(x<t\) is positive and linear, while the other basis function is zero. These basis functions work locally because their values may be zero for wide ranges of values, and allow the procedure to save degrees of freedom and to capture non-linearities in the relation between predictors and response.

Also in MARS there are tuning parameters that must be chosen, the most important of which is the number of iterations \(max\_terms\). It has a similar role of \(m_\textrm{stop}\), so too small values lead to underfitting and too large values to overfitting. In practice, optimizing this parameter leads to models that are in general too complex, so a pruning procedure is often implemented to improve the results.

An important feature of MARS is that it handles interactions among variables. At each step, in addition to the candidate pair of basis functions of a predictor, also potential interactions between the candidate pair and a term that already belongs to the model are considered. It is possible to set an upper limit on the order of interaction through an additional tuning parameter \(max\_degree\).

3 Automatic piecewise linear regression

APLR is fundamentally a gradient boosting procedure adapted to allow the application of specific base learners inspired by the MARS algorithm. We start this section by introducing the two base learners. Later we will see how they are incorporated into the boosting procedure to generate APLR. Finally, we describe how the algorithm is fitted, including the choice of the tuning parameters.

3.1 APLR basis functions

Being the building block of boosting, a base learner is a function used to capture the effect of the predictors on the response (through the negative gradient). In addition to the simple linear effect, APLR uses specific basis functions that are able to capture non-linearity and interactions through local effects. There are two types of basis functions available, for main and interaction effects, respectively.

3.1.1 APLR basis functions without interactions

APLR basis functions without interactions are similar to the basis functions used in MARS, but they are used differently. In MARS, a reflected pair of basis functions is entered into the model at each step, while in APLR only one basis function can enter in a boosting step. The first reason is that in componentwise gradient boosting the base learner only uses one dimension. The second is that in gradient boosting it is advantageous to use weak learners. A single basis function is a weaker learner than a pair of them. Definition 1 formally defines APLR basis functions without interactions.

Definition 1

(APLR basis function without interactions) A basis function in APLR for a predictor x is one of the following two piecewise linear functions:

$$\begin{aligned}{} & {} \max (x-t,0), \\{} & {} \min (x-t,0), \end{aligned}$$

where t is a value that defines the split point for the basis function. A basis function of the form \(\max (x-t,0)\) is defined as a right basis function because non-zero values of it are to the right of the split point when plotted on the x-axis in a chart. Conversely, a basis function of the form \(\min (x-t,0)\) is defined as a left basis function. The split point for a right basis function is defined as the right split point and the split point for a left basis function is defined as the left split point. The number of effective observations \(n_{eff}\) is defined as the number of observations that do not get a zero value due to the max or min functions.

These basis functions have the ability to work locally since their values can be zero for wide ranges of predictor values. This also enables them to be weaker learners than a linear effect which is useful in gradient boosting. The type of APLR basis functions described in Definition 1 cannot handle interactions unless x itself is an interaction term.

3.1.2 APLR basis functions with interactions

The MARS-like basis functions described in Sect. 2 work well in the case of independent predictors, but may have problems when interaction terms are relevant. In MARS interactions are handled by allowing terms that are products of MARS basis functions. Such product terms may cause problems. For example, higher-order interactions form higher power products that may result in interaction terms with very large values (potentially causing computational problems) or very small values (potentially causing rounding errors), depending on the data. Another problem is related to the meaning of the interaction term when the sign of the predictors that interact changes. To illustrate this, let \(x_1\) and \(x_2\) be two predictors and let \(x_{12} = x_1 \cdot x_2\) be the estimated interaction between them. The combination \(x_1 = 1\) and \(x_2 = -1\) gives \(x_{12} = -1\). But the combination of \(x_1 = -1\) and \(x_2 = 1\) also gives \(x_{12} = -1\). These two sets of combinations could have vastly different response values, but \(x_{12}\) would not be able to discriminate between them.

In APLR interactions are handled in a way that avoids the aforementioned problems, in a way that resembles the handling of interactions in regression trees. Namely, interactions are formed by subsetting the data. As an example, let the first split in a regression tree be on \(x_1 \le 50\). The next split could be on \(x_2 > 10\) when \(x_1 \le 50\). Then \(x_2\) and \(x_1\) form a local interaction when \(x_1 \le 50\). An APLR basis function with an interaction term gets values of zero when the interaction term has a value of zero. This type of basis function produces interaction terms that work on local subsets of the data. Definition 2 formally defines these basis functions.

Definition 2

(APLR basis function with interactions) A basis function in APLR with interactions is similar to Definition 1 except that the form can be either of the following:

$$\begin{aligned}{} & {} \max (x-t,0) \cdot \mathbbm {1}(i \ne 0), \\{} & {} \min (x-t,0) \cdot \mathbbm {1}(i \ne 0), \end{aligned}$$

where i is an APLR basis function of a potentially different predictor, with or without interactions. Here \(\mathbbm {1}\) denotes the indicator function that assumes the value 1 if its argument is true and 0 otherwise. The depth of interactions is called interaction level. The interaction level is zero for a basis function without interactions (see Definition 1). For a basis function with interactions, the interaction level is one more than the interaction level of i. The number of effective observations \(n_{eff}\) is defined as in Definition 1, except that it also excludes observations that get a zero value due to the indicator function.

3.2 APLR fitting procedure

As a boosting-like approach, APLR follows on a high level the steps described in Algorithm 1. The implementation, however, is more articulated, and it is explained in detail in the following.

3.2.1 Initialization

APLR starts with a zero intercept term and no other terms in the model. This is similar to the initialization step of Algorithm 1.

In the first boosting step the set of potential terms that can enter into the model are the APLR basis functions without interactions (Definition 1) of all available predictors. This set is called \(\varvec{P}\). After a term other than the intercept has entered the model, then \(\varvec{P}\) can potentially expand in each following boosting step if interaction terms are added to the model using the base-learner of Definition 3.1.2. \(\varvec{P}\) can grow large and it can become computationally heavy to evaluate each potential term in every boosting step. APLR provides tuning parameters that can prevent all terms in \(\varvec{P}\) from being evaluated in each boosting step. This process is described in Sect. 3.2.5. To facilitate this functionality the set \(\varvec{E}\) holds terms that can be evaluated in the next boosting step. Initially \(\varvec{E} = \varvec{P}\) so that all terms in \(\varvec{P}\) are eligible in the first boosting step.

The final part of the initialization step is to define an empty set \(\varvec{C_0}\) for storing terms other than the intercept that are included in the model. In a general boosting step m, \(\varvec{C_m}\) can increase by up to one additional term. If \(\varvec{C_m}\) does not increase in a boosting step, then the regression coefficient for a term already in \(\varvec{C_{m-1}}\) can be updated.

Algorithm 2 summarizes how APLR is initialized.

Algorithm 2
figure b

APLR fitting step 1: Initialization

3.2.2 Componentwise boosting step

Each boosting step starts with a calculation of the negative gradient. For ease of description, we focus on the squared error loss function. The negative gradient is computed at the model estimate from the previous boosting step. For a generic boosting step m, the set that holds terms in the model other than the intercept, \(\varvec{C}_m\), is initialized to be the same as in the previous boosting step (\(\varvec{C}_{m-1}\)). The intercept is updated in each boosting step.

The next step is to find the optimal split points for each eligible term in \(\varvec{E}\) and to consider interaction terms. These parts of the procedure are described in Sects. 3.2.3 and 3.2.4, respectively. The following cases are possible:

  • Add a new term from \(\varvec{E}\) to the model (\(\varvec{C}_m\)).

  • Update a term already in \(\varvec{C}_m\) that is also in \(\varvec{E}\).

  • Add a new interaction term to \(\varvec{C}_m\).

  • Terminate the boosting procedure if none of the above options (or updating the intercept) reduces the training error. In this case, no more boosting steps are carried out.

The choice that results in the lowest loss is selected. Unless the boosting procedure is terminated, the eligibility of terms (\(\varvec{E}\)) for the next boosting step is updated. This is described in Sect. 3.2.5. Algorithm 3 formally describes how the componentwise boosting step is performed in APLR.

Algorithm 3
figure c

APLR fitting step 2: Componentwise boosting step

3.2.3 Fitting an APLR basis function to the negative gradient

When fitting an APLR basis function to the negative gradient \(\varvec{u}_m\), the first step is to determine if there are any observations for which the APLR basis function will be zero as a consequence of interactions (see Sect. 3.1.2). For such observations, the prediction from a linear regression model using the APLR basis function as the only predictor would be zero and the loss contribution would not change from the prior boosting step. It is computationally more efficient to avoid recalculating the loss for such observations. Therefore such observations are excluded from the remaining steps except that the loss contribution from them (unchanged from the previous boosting step) is used in the final step to determine the overall loss for the APLR basis function.

APLR has a tuning parameter to control model robustness, \(min\_observations\_in\_split\). It prevents terms with a lower number of effective observations (\(n_{eff}\)) than its value from entering into the model (\(\varvec{C}_m\)). This tuning parameter is comparable with the minimum node size in a regression tree. The main idea is to avoid having terms in the model that rely on too few observations. The default value for \(min\_observations\_in\_split\) is 20. For large datasets a larger value of \(min\_observations\_in\_split\) is recommended, while for very small datasets a lower value may be preferred. If \(n_{eff}\) is less than \(min\_observations\_in\_split\), then the fitting procedure is aborted, setting loss to infinity so that the APLR basis function cannot enter into the model.

One of the key aspects of fitting an APLR basis function to the negative gradient is to find the optimal splitting point. Searching for this point by iterating through all observations is computationally intensive. To ease the computational burden, APLR implements an approximation technique inspired by the algorithm used in the XGBoost implementation of gradient tree boosting (Chen and Guestrin 2016). XGBoost discretizes data into bins and uses the discretized data to find optimal splits.

APLR sorts predictor values \(\varvec{x}\), the negative gradient \(\varvec{u}_m\) and, if provided, sample weights \(\varvec{w}\), ascending by \(\varvec{x}\). Then APLR discretizes these sorted vectors into bins. The maximum number of bins that APLR can create in this process is determined by the tuning parameter bins. The default value of bins is 300. This value decreases the computational burden significantly for larger datasets and does not seem to degrade predictiveness (see Sect. 4.5). When splitting the data into bins, APLR first finds the left edges of the bins. The left edge of a bin is the lowest value of \(\varvec{x}\) in the bin. The first observation in the sorted \(\varvec{x}\) is always a left edge since it has the lowest value of \(\varvec{x}\). Apart from that, the first or last \(min\_observations\_in\_split\) observations cannot be left edges. Potential left edges are found by iterating through the sorted \(\varvec{x}\), starting from the lowest value. Potential left edges are required to have a higher value of x than the previous observation, otherwise the bins would overlap. If the number of potential left edges, b, is less than bins, then APLR creates a bin for each potential left edge. For ordered categorical variables with no more categories than bins, this enables each category to get a separate bin. If \(b=0\), then one bin will contain all the observations. In the latter case, the APLR basis function cannot have split points and can only be used as a linear effect. If \(b > bins\), then APLR first creates the two bins that have the lowest and highest potential left edges. This ensures that bin edges are placed as close as possible after the first and before the last \(min\_observations\_in\_split\) observations. Then, APLR calculates a minimum number of observations, \(n_{min}\), that any further bins must contain, so that the number of bins created does not exceed bins. By iterating through the remaining potential left edges, further bins are added under the constraint that they must contain at least \(n_{min}\) observations. Once the left edges for all the bins are known, it is trivial to compute the right edges.

For each bin, the discretized values of \(\varvec{x}\) and \(\varvec{u}_m\) are their averages based on the observations in the bin. If sample weights were provided by the user, then, for each bin, the discretized values of \(\varvec{w}\) are sums of \(\varvec{w}\) for observations in the bin. Otherwise, the discretized values of \(\varvec{w}\) are, for each bin, the number of observations in the bin. The goal is to weight the bins by the number of observations that they contain.

To increase computational efficiency, the creation of bins and discretization of \(\varvec{x}\) and \(\varvec{w}\) is only executed the first time that the APLR basis function is fitted to the negative gradient. For APLR basis functions without interactions, which are eligible in the first boosting step, this happens only in the first boosting step. For an APLR basis function with interactions this only happens in the boosting step when the basis function is added to \(\varvec{P}\). However, discretization of \(\varvec{u}_m\) happens in every boosting step when the APLR basis function is eligible.

The next step is to find the best split point by using the discretized data \(\varvec{x}_d\), \(\varvec{u}_{m, d}\) and \(\varvec{w}_d\). A copy of the APLR basis function is made. This copy, here defined as \(f(\varvec{x}_d)\), uses \(\varvec{x}_d\) as predictor instead of \(\varvec{x}\). For each bin, the loss is calculated for the left and for the right split points, respectively. In addition, the loss is calculated for a linear effect of \(\varvec{x}_d\) (without any split). The split point (or linear effect) with the lowest loss is selected. If there is a tie, then the split point (or linear effect) giving the largest \(n_{eff}\) is preferred to increase model robustness. When calculating the loss for a split point, the weighted linear regression coefficient \(\beta _d\) is estimated as follows,

$$\begin{aligned} \beta _d = v \cdot \frac{\sum _{i=1}^{bins} f(x_{d,i}) \cdot w_{d,i} \cdot u_{m,d,i}}{\sum _{i=1}^{bins} f(x_{d,i})^2 \cdot w_{d,i}}, \end{aligned}$$

where \(v \in (0,1]\) is the learning rate. The loss is then \(L_d = (\varvec{u}_{m,d} - f(\varvec{x}_d) \cdot \beta _d)^T \cdot (\varvec{u}_{m,d} - f(\varvec{x}_d) \cdot \beta _d)\).

Finally, the loss is calculated for the original APLR basis function, \(f(\varvec{x})\), using the approximately optimal split point (or linear effect) that was estimated on the discretized data. The weighted linear regression coefficient \(\beta\) is estimated as follows:

$$\begin{aligned} \beta = v \cdot \frac{\sum _{i=1}^{n_{eff}} f(x_{i}) \cdot w_{i} \cdot u_{m,i}}{\sum _{i=1}^{n_{eff}} f(x_{i})^2 \cdot w_{i}} \end{aligned}$$

If sample weights were not provided by the user then \(\beta\) is estimated without the w terms. The loss is \(L = (\varvec{u}_m - f(\varvec{x}) \cdot \beta )^T \cdot (\varvec{u}_m - f(\varvec{x}) \cdot \beta ) + L_0\) where \(L_0\) represents the loss from observations excluded in the first step mentioned in Sect. 3.2.3. Algorithm 3 selects candidates to enter into \(\varvec{C}_m\) (the model) based on L for each APLR basis function considered.

Algorithm 4 summarizes how an APLR basis function is fitted to the negative gradient.

Algorithm 4
figure d

Details on the fitting step 2 of APLR: Fit a basis function to \(\varvec{u}_m\)

3.2.4 Considering interactions

In each boosting step, before possible interactions are considered, APLR has already found a candidate term for model update from \(\varvec{E}\) (see Sect. 3.2.2). Then APLR considers the possible interactions between terms already in the model other than the intercept (\(\varvec{C}_m\)) and terms in \(\varvec{E}\) without interactions. If any interaction terms are added, then they are added to both \(\varvec{P}\) and \(\varvec{E}\). An interaction term is an APLR basis function with interactions (Definition 2), where the predictor, \(\varvec{x}\), is the predictor used in a term from \(\varvec{E}\) and i is a term in \(\varvec{C}_m\). Considering all possible interactions may be computationally intensive. APLR can reduce the number of interaction terms to evaluate with the help of three tuning parameters:

  • \(max\_interactions\) specifies the maximum number of interaction terms that can be added to \(\varvec{P}\).

  • \(max\_interaction\_level\) specifies the maximum interaction level allowed in an interaction term.

  • \(max\_eligible\_terms\) sets a limit on how many of the terms in \(\varvec{C}_m\) can be considered as interaction partners for terms in \(\varvec{E}\).

Note that if \(max\_eligible\_terms\) is smaller than the number of terms in \(\varvec{C}_m\), then the \(max\_eligible\_terms\) terms in \(\varvec{C}_m\) with the lowest previous loss are considered. For each term in \(\varvec{C}_m\), the previous loss is the loss in the most recent boosting step when the term was either added to \(\varvec{C}\) or had its regression coefficient updated.

The loss is calculated for each interaction term fitted to the negative gradient. Only interaction terms having a lower loss than the candidate term for model update from \(\varvec{E}\) (see Sect. 3.2.2) can be added to \(\varvec{P}\) and \(\varvec{E}\). These interaction terms are added to \(\varvec{P}\) and \(\varvec{E}\) starting with the term having the lowest loss, then the term having the second lowest loss, and so on, as long as the total number of interaction terms in \(\varvec{P}\) does not exceed \(max\_interactions\). The reasons for not adding terms with higher losses are:

  • To increase the chance that interaction terms in \(\varvec{C}_m\) are predictive. This can be especially relevant if \(max\_interactions\) is low. In such case, it can be advantageous to only add the most promising interaction terms.

  • To avoid evaluating terms that are likely less predictive in future boosting steps. This can potentially reduce the computational burden.

If any terms are added to \(\varvec{P}\) and \(\varvec{E}\) by the above procedure, then the term with the lowest loss becomes a candidate for entry to the model.

Algorithm 5 formally describes how interaction terms are considered.

Algorithm 5
figure e

Details on the fitting step 2 of APLR details: Interactions

3.2.5 Updating eligibility of terms

Evaluating all terms in \(\varvec{P}\) in every boosting step may be computationally costly. At the end of each boosting step APLR decides which terms in \(\varvec{P}\) will be eligible in the next boosting step by redefining \(\varvec{E}\). First, only the \(max\_eligible\_terms\) terms in \(\varvec{E}\) with the lowest loss are kept. The main idea is to avoid evaluating less predictive terms in every boosting step. Terms removed from \(\varvec{E}\) become ineligible for the next \(ineligible\_boosting\_steps\_added\) boosting steps. Finally, terms that have already been ineligible for \(ineligible\_boosting\_steps\_added\) boosting steps are reentered into \(\varvec{E}\). The reason is that a previously less predictive term may become more predictive compared to other terms in the future.

The above tuning parameters allow the user to control how and if the terms can become ineligible for some future boosting steps. The aim is to reduce the computational burden without significantly degrading predictiveness. The default values for \(max\_eligible\_terms\) and \(ineligible\_boosting\_steps\_added\) are 5 and 10 respectively. These defaults can notably reduce the computational burden and do not seem to degrade predictiveness (see Sect. 4.5).

Algorithm 6 summarizes how the eligibility of terms is updated at the end of each boosting step.

Algorithm 6
figure f

Details on the fitting step 2 of APLR: Updating term eligibility

3.3 Finding APLR’s tuning parameters

3.3.1 Number of boosting iterations

As mentioned in Sect. 2.1, the most important tuning parameter to determine in a boosting algorithm is the optimal number of boosting steps \(m_{stop}\). Tuning \(m_{stop}\) in APLR by doing a grid search or similar would be computationally expensive.

Because APLR uses parametric learners, it is possible to store regression coefficients for each boosting step with immaterial computational costs. APLR automatically tunes \(m_{stop}\) by performing cross-validation. For each training data subset an APLR model is trained. Afterwards \(m_{stop}\) is set to the value that minimized validation loss on the hold-out fold, retrieving the stored regression coefficient for that boosting step. Finally the models are merged. For example, the intercept for the final model is the average (weighted by the sum of observation weights in each training data subset) of the intercept terms in each model that was trained for cross-validation. The merging is faster than re-training a final model on the entire training dataset. An additional benefit of merging the models is that the predictiveness may increase due to an effect similar to bagging (variance reduction) because each of the models are trained on different subsets of the training data. This procedure is significantly faster compared to a grid search or similar, because \(m_{stop}\) is estimated in one cross-validation run instead of many. The user needs to specify the max number of boosting steps to try, M. The default value of M is 1000, but this default is not appropriate for all datasets. Plotting cross-validation loss versus boosting step can help the user to determine a reasonable value of M. The goal is to select M so that there are enough boosting steps to find the minimum validation loss (if it exists) while avoiding unnecessary computational costs associated with a too high M. Note that the optimal \(m_{stop}\) is affected by the learning rate. The learning rate, v, has a default value of 0.1 in APLR, which is reasonable (low enough) in many cases according to our empirical results and the literature (see, e.g., Bühlmann and Hothorn 2007). If needed, however, this value can be changed.

By default APLR does a random split of the data into 5 folds for cross-validation. Selecting the number of folds represents a bias-variance trade-off and a trade-off regarding computational time. The default value has worked well in our empirical tests. The tuning parameter \(cv\_folds\) specifies the amount of folds to use. Sometimes it is not feasible to split the data randomly. As an alternative, APLR provides a possibility to specify how particular observations are used in each train and validation subset. This can be useful for example in modeling of time series where it can be important to ensure that the validation set has more recent observations than the training set.

APLR allows the user to specify observation weights. This can be useful for example when handling data that is over- or undersampled. If sample weights are specified, then they are also split into cross-validation folds.

Algorithm 7 formally describes how training data is prepared.

Algorithm 7
figure g

APLR pre-processing step: Preparing training data

3.3.2 Other tuning parameters

APLR has other tuning parameters that should be tuned. This can be done within APLR’s split in training and validation sets, or based on an external procedure, for example by using cross-validation. Below there is a complete list of the APLR tuning parameters and some advice on how to tune them:

  • M is the maximum number of boosting steps to try. Ideally, it should be large enough to find the minimum validation error (if it exists) but not so large that unnecessary computational costs are incurred. A reasonable tuning strategy may be to start with the default value of 1000 and increase it if the validation error does not flatten out during those 1000 boosting steps. Please note that in the APLR package M is denoted as m to adhere to the naming convention of having variable names in lowercase letters.

  • v is the learning rate and should be set to a reasonably low value (see, e.g., Bühlmann and Hothorn 2007). The choice of M is affected by the choice of v as the optimal number of boosting steps usually decreases if v increases. The default value of 0.1 should work in most cases, but sometimes higher values can be considered to reduce the computational burden, or lower values to avoid a too fast convergence (i.e., early overfitting).

  • \(max\_interaction\_level\) specifies the maximum allowed depth of interactions. When tuning this parameter by, for example, a grid search, the values that should be tested are 0 (no interactions allowed), 1, 2, and a few larger values. Although there are no constraints on the maximum value allowed for this parameter, one must be careful when choosing larger values as the risk of overfitting increases significantly. The default value of 1 is often a safe choice, as it allows for interactions but avoids adding too much complexity to the model. Sometimes, however, adding a few interaction terms with a high interaction level reduces the loss more than adding many terms with a low interaction level.

  • \(max\_interactions\) specifies the maximum number of interaction terms that APLR can consider. The default value of 100000 basically does not add any constraint. While it is more reasonable to control the level of complexity related to interactions through \(max\_interaction\_level\), \(max\_interactions\) should be set to the highest value computationally affordable.

  • \(min\_observations\_in\_split\) determines the minimum number of effective observations (\(n_{eff}\)) that a term in the model must have. Higher values may give more robust models where terms rely on more observations. However, higher values may also increase bias because fewer terms are allowed to enter the model. The default value is 20. This parameter should be tuned in a grid search or similar.

  • \(cv\_folds\) specifies the number of randomly selected folds that the training data is split into. If a random selection is not desired, then the tuning parameter \(cv\_observations\) can be used instead of \(cv\_folds\) to specify user defined folds. None of these two tuning parameters are intended for tuning, but rather for determining how APLR should do cross validation.

  • Tuning parameters that are intended for reducing computational costs:

    • \(max\_eligible\_terms\) limits (1) the number of terms already in the model that can be considered as interaction partners for terms in \(\varvec{E}\) without interactions in a boosting step and (2) how many terms from \(\varvec{E}\) remain in \(\varvec{E}\) in the next boosting step.

    • \(ineligible\_boosting\_steps\_added\) controls how many boosting steps a term in \(\varvec{E}\) that becomes ineligible has to remain ineligible.

    • bins determines the maximum number of bins that can be created for discretizing the data when searching for the optimal split point in an APLR basis function.

3.4 APLR model interpretation

One of the most important features of APLR is the interpretability of the prediction rule. APLR uses the bases defined in Definitions 1 and 2 to capture the effect of each predictor on the response. Table 1 provides the results of the first 15 terms added to the model, other than the intercept, in an example.

Table 1 Example of the effect of the predictors on the response captured by APLR

In this example, the intercept is 0.249. The first two terms other than the intercept that were added to the model are P0 and P1. They are linear effects of predictors X6 and X5, respectively, and may be seen as a sign that APLR would not add complexity if not necessary.

The distinct terms of APLR start from P2, which is an APLR basis function without interactions (Definition 1). In P2, we can see that when X2 is smaller than 340 there is an expected decrease of 0.011 in the response for each unit increase of X2, while its effect is zero when \(X2 \ge 340\).

The first APLR basis function with interactions (Definition 2) is found in term P8, having an interaction level of 1. It only contributes to the prediction when P6 is nonzero (which happens when \(X6>72\)) and when \(X3<97\). In this region there is an expected decrease of 0.012 in the response for each unit increase of X3.

P10 and P11 are examples of terms having an interaction level of 2. P10 contributes to the prediction when \(X6>73\) (due to the interaction with P9), \(X4<2945\) (because P9 interacts with P7) and when \(X3<86\). For each unit increase of X3 in this region, the prediction decreases by 0.034.

APLR has other functionality that can be used to interpret a model, such as estimation of feature and term importance. This is discussed in Sect. 5.1.3.

3.5 Software implementation

APLR has been implemented in C++ for speed and memory efficiency. Because C++ is usually not practical to work with in Data Science, the C++ implementation of APLR has been wrapped as a Python package. In this package APLR is provided as a Scikit-learn compatible estimator. More information about this package and how to install it are available at https://github.com/ottenbreit-data-science/aplr.

4 Simulation study

4.1 Settings

In this section the novel APLR is contrasted to several competitors in 4 different scenarios to evaluate the algorithm’s performance and identify situations in which APLR has strengths and weaknesses. For each scenario, the simulations were run 10 times and the mean and standard deviation of both the MSE and the \(\hbox {R}^2\) were computed. The values of the MSE are reported over the MSE obtained by the oracle estimator, i.e., that based on the true model. In each of the 10 runs we generated 60000 observations, whereof half were randomly assigned to a training set and the remaining half were assigned to a test set. All models were trained and tuned on the training set and evaluated in the test set.

In each run 20 predictors (correlated or uncorrelated, depending on the scenario) were generated from a multivariate standard normal distribution. Afterward the true model for the response variable was calculated (additive or non-additive, depending on the scenario). In all scenarios, there are non-linear dependencies between predictors and the response. To provide an additional challenge for the algorithms, all scenarios also include noise predictors that do not affect the response variable. The following scenarios have been simulated:

  1. 1.

    The true model is additive (no interactions) and predictors are uncorrelated.

  2. 2.

    The true model is additive (no interactions) and predictors are correlated.

  3. 3.

    The true model is not additive and predictors are uncorrelated.

  4. 4.

    The true model is not additive and predictors are correlated.

4.1.1 Additive model with uncorrelated predictors

In this scenario, the relationship between the 20 predictors and the response variable is additive and non-linear. The true model is defined as

$$\begin{aligned} y = c + \sum _{j=1}^{10} \beta _j x_j^{d_j}+\epsilon \, \end{aligned}$$

where y is the response variable, \(\epsilon\) is an error term randomly drawn from a normal distribution with zero mean and standard deviation equal to the standard deviation of a simulated observation of the predictable component of y (y without the error term). Moreover, c is a constant chosen to be equal to 5, \(\beta _j\) is the regression coefficient for the predictor \(x_j\) and \(x_j\) is raised to the power of \(d_j\). The regression coefficients are randomly drawn from a standard normal distribution and the power coefficients are randomly drawn from a uniform distribution with values in the interval [2, 4]. The last 10 predictors are noise predictors that do not affect the response variable.

4.1.2 Additive model with correlated predictors

This scenario is similar to the scenario in Sect. 4.1.1, except that the simulated predictors are correlated with pairwise Pearson correlation coefficients of 0.9.

4.1.3 Non-additive model with uncorrelated predictors

In this scenario the predictors are simulated in the same manner as in Sect. 4.1.1. However, the relationship between the predictors and the response is not additive. The true model is defined as the Euclidean distance between pairs of predictors,

$$\begin{aligned} y = \beta \cdot \sqrt{ \sum _{j=2}^{10} (x_{j-1}-x_j)^2 } \cdot \epsilon \, \end{aligned}$$

where y is the response variable, \(\beta\) is a regression coefficient chosen to equal 1.2, \(x_j\) is the jth predictor and \(\epsilon\) is an error term randomly drawn from a Uniform distribution with values in the interval [0.5, 1.5].

4.1.4 Non-additive model with correlated predictors

This scenario is similar to the scenario in Sect. 4.1.3 except that the simulated predictors are correlated with pairwise Pearson correlation coefficients of 0.9.

4.2 Competitors

4.2.1 Random Forest

Random Forest (Breiman 2001) regression consists in averaging \(n\_{estimators}\) regression trees, each computed on a different bootstrap sample. In order to reduce the variance component of the prediction error, an effort to reduce the correlation among trees is performed by randomly selecting a fraction of all available predictors (\(max\_features\)) each time the best split of a node in a tree is computed. The depth of the trees is controlled by the tuning parameter \(min\_samples\_leaf\) (minimum number of observations required in a node).

In this study, we will use the algorithm implemented in the RandomForestRegressor class of the sklearn package in Python. The tuning parameters were computed by five-fold cross-validation in a grid search: \(max\_features\) in \(\{0.125,0.25,0.5,0.75,1.0\}\), \(min\_samples\_leaf\) in \(\{1, 20, 50, 100, 500\}\). The number of trees \(n\_estimators\) was set to 300 (results with \(n\_estimators = 100\) were only marginally worse).

4.2.2 Gradient boosting regression tree

In tree-based gradient boosting, regression trees are used as base learners. Using trees as base learners allows for capturing non-linearities in the effect of the predictors on the response and automatic handling of the interactions. The number of terminal nodes allowed in each of the regression trees, \(num\_leaves\) is a useful additional tuning parameter that controls the strength of the base learner and the level of interaction to consider.

Here we use LightGBM to implement gradient boosting regression trees. The tuning parameters \(n\_estimators\) (number of boosting steps) and \(num\_leaves\) (maximum number of leaves in each tree) were tuned by using the Bayesian probabilistic model-based approach for finding optimal tuning parameters found in the Optuna package for Python. The allowed ranges of integers for these parameters were [1, 3000] and [2, 128], respectively. 100 unique combinations were tried. The learning rate, v, was held constant at 0.1 as suggested in the literature.

4.2.3 Other gradient boosting regression approaches

The characteristics of a gradient boosting regression approach highly depend on the nature of the base learner implemented. Particularly relevant from this point of view is the choice between linear or non-linear base learners. To evaluate the performance of boosting algorithms with different base learners, we include two additional implementations of component-wise gradient boosting, available in the R package mboost.

  • the routine glmboost uses linear base learners. This procedure provides the most interpretable prediction model, as it is a penalised version of a GLM. On the other hand, it is not able to capture any non-linearity or interaction. Here the tuning parameters to set are the number of boosting steps \(m_{stop}\), computed by five-fold cross-validation on a grid [0, 5000], and the learning rate \(\nu\), set to its default value 0.1.

  • to allow capturing non-linearities, but not interactions, we also use the routine gamboost, with smoothing splines as base-learners. The tuning parameter dfbase specifies the desired effective degrees of freedom, here kept to its default value 4. As glmboost, a grid search is performed by five-fold cross-validation to find the best number of boosting steps (mstop) between 0 and 5000, while v is set to 0.1 and 0.3 (for computational reasons) in the non-additive and additive scenarios, respectively.

Note that these two routines are only used to evaluate the boosting approach when choosing specific base learners, and should not be intended as a test for mboost itself.

4.2.4 MARS

MARS was described in Sect. 2.2. Here we used the implementation Py-earth and we tuned its tuning parameters by a five-fold cross-validation: \(max\_degree\) (maximum interaction depth) was searched on a grid \(\{1, 2, 3, 4, 5, 6\}\), \(max\_terms\) (maximum number of terms generated prior to pruning the model) in \(\{10, 50, 100, 150\}\).

4.3 APLR tuning parameters

APLR was tuned by using its built-in splitting of the (training) data into 5 folds. The tuning parameters \(max\_interaction\_level\) and \(min\_observations\_in\_split\) were searched in a grid, while M, v, and \(max\_interactions\) were held constant.

The values allowed for \(max\_interaction\_level\) in the grid search were 0, 1, 2, and 100. These values correspond to the special case of no interactions (0), low depths of interactions (1 and 2), and interactions with a potentially high depth (100), respectively. The related tuning parameter \(max\_interactions\) was held constant at 100000 to allow APLR to fit as many interaction terms as possible in accordance with \(max\_interaction\_level\).

The value of the tuning parameter \(min\_observations\_in\_split\) was searched in \(\{20, 100, 500\}\) in all scenarios.

M was set to 3000, while v to 0.5 in the non-additive scenario with correlated predictors and to 0.1 in the other scenarios. In the first scenario, in fact, a higher learning rate prevented an increase of M that would have been significantly more computationally intensive.

4.4 Results

4.4.1 Additive model with uncorrelated predictors

Table 2 Results on the test set for the additive scenario with uncorrelated predictors

Table 2 shows the results for the additive scenario with uncorrelated predictors. The algorithms using a continuous base learner and that are able to automatically handle non-linear relationships performed best in this scenario and were very close to the predictiveness of the oracle estimator. gamboost did marginally better than APLR, that, in turn, performed marginally better than MARS. These results are not surprising, as the true model has characteristics (additivity, continuity) that these algorithms can easily capture. The bad performance of glmboost was also expected, as it cannot handle non-linear relationships. The tree-based algorithms were able to predict reasonably well, but not as well as the top algorithms.

When considering the choice of the tuning parameters (see in Table 3 the results for the first simulation run), the best results for APLR were obtained when \(max\_interaction\_level\) was zero. This is reassuring, as there were no interactions. The tuning of LightGBM also shows that in this situation there was no need for capturing interaction terms (\(num\_leaves = 2\), i.e., the simplest regression tree with no possibility to capture interaction). On this line, it is surprising that the best value for MARS was obtained with \(max\_degree \ge 3\), while one could expect that \(max\_degree = 1\). The difference in performance was nonetheless quite small.

Table 3 Best tuning parameter values in the first simulation run in the additive scenario with uncorrelated predictors

4.4.2 Additive model with correlated predictors

Table 4 Results on the test set for the additive scenario with correlated predictors

Table 4 shows the results for the additive scenario with correlated predictors. Here the results are similar to those obtained in the previous scenario (Sect. 4.4.1). In absolute terms, this might be a bit surprising, considering that one could expect that the correlation among predictors would have given the algorithms an additional challenge. In terms of comparison, instead, the key aspect is still the absence of interactions, so algorithms using a continuous base learner were again advantaged. When looking at the best values for the tuning parameters (in Table 5 the results for the first simulation run), in this case only APLR guessed the correct interaction level.

Table 5 Best tuning parameter values in the first simulation run in the additive scenario with correlated predictors

4.4.3 Non-additive model with uncorrelated predictors

Table 6 Results on the test set for the non-additive scenario with uncorrelated predictors

Table 6 shows test results for the non-additive scenario with uncorrelated predictors. As expected, the response in this scenario was more difficult to predict for the algorithms due to its non-additive relationship with the predictors. When considering the algorithms that automatically handle interactions, one sees that MARS, APLR and LightGBM predicted reasonably well, while Random Forest did not. Somewhat surprisingly, MARS had the best results in this scenario, followed by APLR and LightGBM. However, APLR was only slightly behind MARS. The algorithms that do not handle interactions automatically, gamboost and glmboost, performed poorly, especially when the base-learner is linear (glmboost). Looking at the choice of the tuning parameters, it seems that glmboost did not even try to fit a model and prefer the null one (at least in the first simulation run, see Table 7). In contrast, the tuning parameters for APLR, LightGBM, and MARS seem to be more or less in line with expectations, considering that there are interactions in this scenario.

Table 7 Best tuning parameter values in the first simulation run in the non-additive scenario with uncorrelated predictors

4.4.4 Non-additive model with correlated predictors

Table 8 Results on the test set for the non-additive scenario with correlated predictors

Finally, Table 8 shows test results for the non-additive scenario with correlated predictors. This was as expected the most difficult scenario for the algorithms because of the combination of correlated predictors and a non-additive model structure. APLR had the best performance in terms of prediction, followed by LightGBM and Random Forest. While these three algorithms did not predict well compared to the oracle estimator, one may claim that they were still able to fit good prediction models. The remaining algorithms, instead, performed quite badly. In particular, while MARS performed surprisingly well in Sect. 4.4.3 when predictors were uncorrelated, its performance was poor in this scenario. Regarding the tuning parameters, it seems that APLR, LightGBM, and MARS chose values more or less in line with what we would have expected in this scenario (for the values obtained in the first run of the simulations, see Table 9).

Table 9 Best tuning parameter values in the first simulation run in the non-additive scenario with correlated predictors

4.5 Evaluating the APLR tuning parameters that are intended for reducing computational costs

For one of the simulation runs of the first scenario (described in Sect. 4.1.1), APLR was run without utilizing the tuning parameters that are intended for reducing computational costs: bins, \(max\_eligible\_terms\) and \(ineligible\_boosting\_steps\_added\). When we let the algorithm run without restrictions, the mean squared error and the \(R^2\) did not substantially change (mean squared error 82.2174 and 82.2166, \(R^2\) 0.49860 and 0.49863, with and without restrictions, respectively). In contrast, the training time increased from 5 min (with the tuning parameters set to the default values) to 120 h (without restrictions), clearly supporting the use of bins, \(max\_eligible\_terms\) and \(ineligible\_boosting\_steps\_added\). While it is possible that on some specific datasets it is worth running the algorithm without utilizing the above-mentioned tuning parameters to get a better prediction, one should be aware of the huge increase in the associated computational cost. In this work, for example, due to this reason APLR has not been tested on other datasets without utilizing these tuning parameters.

5 Real data applications

The performance of APLR has been evaluated on three datasets that are publicly available on UCI Machine Learning Repository (Dua and Graff 2017). The datasets vary in size and the number of predictors and they also seem to vary with respect to the level of interactions between predictors. We can then reasonably state that APLR and its competitors were evaluated in a decently wide range of different situations.

About the tuning parameters, their values were searched as for the simulation study, with a few exceptions. The tuning parameters \(max\_interaction\_level\) and \(min\_observations\_in\_split\) were tuned in a grid search, while M, v and \(max\_interactions\) were held constant. The values allowed for \(max\_interaction\_level\) in the grid search were 0, 1, 2, and 100. The related parameter \(max\_interactions\) was held constant to its default value 100000 while the values allowed for \(min\_observations\_in\_split\) in the grid search were dataset specific. The learning rate v was set to 0.5 (the default of 0.1 in the first dataset) to avoid the computationally costly need of increasing M too much. The latter (M) was allowed to be up to 3500. For the same reason (to avoid having too many steps), in the last example the tuning parameter v of glmboost and gamboost was allowed to be larger than 0.1.

5.1 Auto MPG dataset

5.1.1 Data

This dataset stems from that described in Quinlan (1993), but it is a slightly modified version of the original one, with 8 of the original instances removed because they had unknown response values. In this dataset, there are 398 observations. The response variable is mpg (miles per gallon). There are eight potential predictors: cylinders (the number of cylinders), displacement (the displacement of the engine), horsepower (engine power measured in horsepower), weight (the car’s weight), acceleration (the car’s acceleration), \(model\ year\) (the year when the car was produced), origin (categorical variable with instances USA, Europe and Japan, denoting the country/region of origin for the car). An additional predictor with the name of the car was not used in the analysis. The categorical predictor origin was transformed into dummy variables. Six observations that had partially missing data were dropped. The remaining 392 rows were randomly split into a training set of 274 observations (approximately \(70\%\) of the observations) and a test set of 118 observations. The average absolute value of the pairwise Spearman rank correlation between the predictors in the training dataset is 0.47, indicating that there are fairly correlated predictors.

5.1.2 Results

Table 10 Results on the test set for the Auto MPG dataset

APLR predicted slightly better than the other algorithms on this dataset. The latter algorithms except glmboost had similar results (see Table 10). The \(\hbox {R}^2\) was relatively high, perhaps indicating that there is much signal in the data. glmboost performed worse than the other, likely because of non-linear dependencies between the predictors and the response, and because glmboost does not automatically handle interactions. It was not possible to train a gamboost model for this dataset, probably due to a bug in the mboost package. The best values of the tuning parameter \(max\_interaction\_level\) in APLR and of \(num\_leaves\) in LightGBM (see Table 11) indicate that the depth of interactions in the Auto MPG dataset is relatively low, but that relevant interaction terms exist. The best value of \(max\_degree\) in MARS was 3, indicating that interactions with a higher depth may also exist. However, in the simulation study, we noted that the best choice of the tuning parameters for MARS does not necessarily agree with the presence/absence of interactions.

Table 11 Best tuning parameter values for the Auto MPG dataset

5.1.3 APLR model interpretation

For the Auto MPG dataset, we report in Table 12 the final model obtained by APLR, where we can see the effect of the various predictors on the response. As an example on how to interpret a term in the APLR model, term T10 in the aforementioned table is an interaction term where its main predictor, horsepower, interacts with displacement and \(model\_year\). In the region where displacement is less than 318 and \(model\_year\) is greater than 76 then the prediction decreases by 0.0191 for every unit increase in horsepower.

APLR has functionality that calculates the contribution to the linear predictor for each term in the model for each observation on new data (or the training data). Due to the identity link function used in this model, the contribution to the linear predictor equals the contribution to the prediction. This can be used to estimate term importance. APLR does that by, for each term in the model, computing the standard deviation of the contribution to the linear predictor in the training data. If a term is important then the contribution to the linear predictor that it delivers should vary among observations in the training data. Estimated term importance is displayed in Table 12 and the table is sorted descending by it.

From the table one can see that the most important terms in the model use weight, horsepower and \(model\_year\) as main predictors. The most important terms using weight or horsepower as the main predictor have negative regression coefficients, meaning that an increase in weight or horsepower generally decreases predicted mpg. The opposite is the case for terms using \(model\_year\) as the main predictor, so that an increase in \(model\_year\) generally increases predicted mpg. There are terms in the model that go against this trend, such as term T35, but they are less important. Term T3 for instance decreases the prediction when \(model\_year\) increases for cars with less than 105 horsepower.

Looking at main effect terms only (those with no interactions) using weight as the main predictor (such as term T1), one can see that they all have negative regression coefficients and that all of them affect the prediction when weight is less than some value. Consequently the main effect of weight on the prediction is strongest when weight is low. APLR has functionality that calculates the regression coefficient shape for the main effect of a predictor. This can simplify the interpretation of main effects significantly. An example is provided for weight in Table 13. The table shows that the main effect of weight declines as weight increases and disappears when \(weight \ge 4165\).

There is a similar story regarding the main effect terms that use horsepower as the main predictor. The strongest horsepower related main effects occur when horsepower is less than 149 (terms T3 and T8). However, when \(horsepower \ge 150\) then the terms T154 and T160 still give a weakly negative effect on the prediction when horsepower increases.

For \(model\_year\) there is only one main effect term with a negative regression coefficient, term T117. It affects the predictions when \(model\_year<72\), slightly mitigating the most important \(model\_year\) term, T5, when \(model\_year>71\). However, when \(model\_year \le 71\) then term T117 is the only main effect term using \(model\_year\) that affects the predictions and thus gives a slightly negative effect on the prediction when \(model\_year\) increases.

Briefly looking at the predictors not mentioned so far, the main effect of displacement is that the prediction decreases when displacement increases. The same is the case for acceleration, except that no main effects of it affect the prediction when \(13.5 \le acceleration \le 14.7\). All terms involving cylinders as the main predictor have positive regression coefficients. Terms involving the categorical variables USA, Europe and Japan as main predictors generally have lower term importance compared to other terms. However, the most important terms with USA as the main predictor have negative regression coefficients (for example terms T48, T64, T66 and T85), while the most important terms with Europe or Japan as main predictors generally have positive regression coefficients.

The overall interpretation of the model is perhaps not surprising: heavier cars and cars with stronger and larger engines consume more fuel per mile than lighter cars with weaker and smaller engines. Newer cars consume less fuel than older cars. In addition the model indicates that American made cars consume more fuel than European or Japanese cars.

However, two of the predictors, acceleration and cylinders, work differently in the model than one could expect: higher values of acceleration (higher acceleration means slower acceleration) decrease the prediction while higher values of cylinders increase the prediction. A possible interpretation of this is that acceleration and cylinders mitigate other predictors in the model, such as weight, horsepower and displacement. This may be supported by the fact that the vast majority of terms having acceleration or cylinders as main predictor are interaction terms.

APLR also has functionality to estimate feature importance. The key assumption when doing this is that model terms are attributed to their main predictor (this assumption may be reasonable but will probably not always hold). So, for example, all terms having weight as the main predictor are used when estimating the feature importance of weight. The methodology for the estimation is otherwise similar to the previously described estimation of term importance. Table 14 shows estimated feature importance for this model. According to this, weight, horsepower and \(model\_year\) are the most important predictors, whereas cylinders and each of the categorical predictors are the least important ones.

Table 12 Final APLR model for the Auto MPG dataset
Table 13 Regression coefficient shape for main effects of weight
Table 14 Estimated feature importance of each predictor

5.2 YearPredictionMSD dataset

5.2.1 Data

This dataset is a subset of the Million Song Dataset (Bertin-Mahieux et al. 2011) and consists of 515345 observations. The response variable is the release year of a song, labelled year. There are ninety potential predictors that measure some attributes of the songs. In particular, 12 of them measure the timbre average and the remaining measure the timbre covariance. All predictors were used here. We followed the recommendation of Bertin-Mahieux et al. (2011) about using the first 463715 observations for training and the last 51630 observations for testing, to avoid the “producer effect”, i.e., making sure that no song from a given artist ends up in both the training and test set. Overall, the predictors only seem to be slightly correlated, with the average absolute value of pairwise Spearman rank correlation in the training dataset being 0.11. gamboost and glmboost were unable to handle the full training dataset on the computer used for the analysis because of too high memory consumption. Therefore, the training datasets for them were subsamples of the full training dataset consisting of 20000 and 30000, respectively, randomly chosen observations.

5.2.2 Results

Table 15 Results on the test set for the YearPredictionMSD dataset

Table 15 shows the results for this example. APLR performed the best, marginally ahead of LightGBM. Random Forest performed worse than LightGBM but ahead of MARS. The best values for \(max\_interaction\_level\) in APLR (100), for \(num\_leaves\) in LightGBM (12), and for \(max\_degree\) in MARS (4) indicate that there are interactions with some depth here. In light of this, it is a bit surprising that gamboost almost matched the predictiveness of MARS, since gamboost does not automatically handle interactions and because it was trained on a small subset of the data. In contrast, glmboost had the worst performance among the algorithms evaluated, indicating that there are most probably non-linear relationships between response and predictors. Table 16 shows the tuning parameters that were used for all of the algorithms.

Table 16 Best tuning parameter values for the YearPredictionMSD dataset

5.3 Individual household electric power consumption dataset

5.3.1 Data

This dataset consists of 2075259 observations, each of which is a measurement of the electric power consumption in one household located in Sceaux (7 km of Paris, France) between December 2006 and November 2010 (47 months). Approximately \(1.25\%\) of the observations contain missing measurements. These observations are dropped here. There are no recommendations available about which variable should be used as the response and which variables should be used as predictors. Here, \(sub\_metering\_3\) was selected as the response variable. It measured the electric power consumption by an electric water heater and an air-conditioner. The following predictors were used: \(global\_active\_power\), \(global\_reactive\_power\), voltage, \(global\_intensity\), \(sub\_metering\_1\) (electric power consumption in the kitchen) and \(sub\_metering\_2\) (electric power consumption in the laundry room).

The data were randomly split into training and test datasets containing \(60\%\) and \(40\%\) of the observations, respectively. There seems to be some correlation among the predictors because the average absolute value of pairwise Spearman rank correlation in the training dataset is 0.28. Again gamboost and glmboost were unable to handle the full training dataset on the computer used for the analysis because they used too much memory. Consequently, 30000 randomly chosen observations were sampled from the full training dataset and used for training.

5.3.2 Results

Table 17 Results on the test set for the Individual household electric power consumption dataset

In this dataset, LightGBM and Random Forest performed the best and were very close to each other in terms of predictiveness (see Table 17). This was the only case evaluated in this paper where the tree-based methods predicted better than APLR. However, APLR was only slightly behind and APLR predicted better than the rest of the algorithms. The best values of the tuning parameter for \(max\_interaction\_level\) in APLR (100), for \(num\_leaves\) in LightGBM (124), and for \(max\_degree\) in MARS (8) indicate that there are interactions with high-depth (see also Table 18). In fact, this dataset seems to have higher interaction depths than any of the other datasets used in this paper. It is therefore surprising that gamboost beats MARS, since gamboost does not automatically handle interactions and also because gamboost was trained on a small subset of the data. Again glmboost was the worst performer, likely due to non-linear relationships between predictors and response.

Table 18 Best tuning parameter values for the Individual household electric power consumption dataset

6 Conclusions

In this paper a novel regression algorithm, Automatic Piecewise Linear Regression (APLR), has been introduced. The algorithm is interpretable and automatically handles non-linear relationships, variable selection, and interactions.

Our simulation study showed that APLR has a predictiveness comparable to “parsimonious” algorithms in the simple scenarios in which these algorithms outperformed the tree-based ones. But APLR also performed slightly better than the “more complex” algorithms in the scenarios in which non-additivities and interactions were present. In other words, APLR was among the top-performing approaches regardless of the characteristics of the data. From the results obtained on real datasets, moreover, one can argue that APLR was, in terms of predictiveness, at the same level of LightGBM and performed better than the other algorithms considered. These datasets seem to contain interactions between the predictors, with the depth of interactions varying between the datasets, ranging from a relatively low depth in the Auto MPG dataset (Sect. 5.1) to a high depth in the Individual household electric power consumption dataset (Sect. 5.3). The pairwise correlation between predictors in those datasets also varied, being relatively low in the YearPredictionMSD dataset (Sect. 5.2) and highest in the Auto MPG dataset (Sect. 5.1). It is extremely positive that APLR worked very well in datasets with such different characteristics.

Considering that APLR is highly interpretable, unlike the tree-based algorithms, one can argue that APLR can increase the interpretability often without any price in terms of predictiveness. Further testing should be done to evaluate APLR on other simulation scenarios and other real datasets to confirm these impressions.

In this paper the focus has been on regression, but APLR can also be used for classification by setting the binomial negative log-likelihood as a loss function (and, consequently, the logit function as the link function). The APLR package has also been extended with a classifier that can directly handle multi-class problems. The classifier fits a logit model for each response category and, when predicting, chooses the class with the highest predicted probability. Predicted class probabilities for each class can also be generated. Other features of APLR not explored in this paper include: the possibility of choosing the loss function and link (for example, the tweedie loss function and the log link, or custom user defined functions); the possibility of specifying a different validation set metric from that used as loss function (for example, the negative Gini coefficient, or a custom user defined metric); to specify monotonic constraints on individual predictors, interaction constraints on individual predictors or groups of them; and the possibility of passing a list of predictors that should be prioritized. The latter option is very useful in the case of mandatory predictors (see, e.g., Binder and Schumacher 2008), i.e. predictors that, for any reason, must be included in the model.

A possibility of further development concerns the creation of an algorithm that does Automatic Smoothing Spline Regression instead of Automatic Piecewise Linear Regression. The results in Sects. 4 and 5 showed that gamboost with smoothing spline base learners performed well in many situations. However, gamboost does not automatically handle interactions. Implementing smoothing spline regression within the APLR framework may lead to good results.