1 Introduction

Ensemble methods have proved successful in the majority of machine learning competitions, as they integrate multiple machine learning algorithms into one predictive model. In particular, there are three main types of ensemble methods, including bootstrap aggregating (bagging), boosting, and stacking. Bagging and stacking induce base learners independently and aggregate them following a deterministic averaging process, while boosting learns sequentially in a very adaptive way and combines learners using a pre-specified strategy. The main goal of bagging is to arrive at an ensemble method with less variance than its base learners, whereas boosting aims to produce a strong model that is less biased than their base learners. Stack ensemble aims at reduce both variance and bias by combining diversified base learners. Compared with a single model, all of these ensemble methods can significantly improve predictive performance either by bias or variance reduction.

However, the final model produced by ensemble method is still regarded as a black-box model, since the combination of base learners leads to a complicated model structure and makes inner decision-making process not transparent for human beings. The ability to explain the rationale behind one’s decisions to others is an important aspect of human intelligence in either social interaction or educational context. The interpretability of the results to enable business owners or regulators to better understand risk management decision processes and compel companies to meet regulatory requirements. Some commonly used interpretable models, such as linear model and general additive model, cannot compete with ensemble models including Xgboost [1] and random forest [2] since its simple structure cannot capture complicated dynamic data patterns.

In recent decades, some research works focus on enhancing interpretability by developing tools to ‘open up the black box.’ There are, broadly speaking, three inter-related model-based areas of research: (a) global diagnostics (Sobol & Kucherenko (2009) [3], Kucherenko (2010) [4]); (b) local diagnostics (Sundararajan et al. (2017) [5], Ancona et al. (2018) [6]); and (c) development of approximate or surrogate models that may be easier to understand and explain. However, (Rudin and Cynthia (2019) [7]) suggests to avoid using explainable black-box models in high-stakes decisions since they are sometimes problematic with several reasons such as unreliable presentation or lack of details of what the original model delivers. Therefore, other researchers made efforts to build inherently interpretable models, such as explainable neural network (xNN) proposed by (Vaughan et al. (2018) [8]), adaptive explainable neural networks (AxNNs) by (Chen et al. (2020) [9]), and explainable neural networks with constraint (Yang et al. (2020) [10]). Following this research direction, we try to build an inherently interpretable with a predictive performance as strong as some black-box models or an even better performance.

In this paper, our LIFE algorithm fulfills three main goals: competitive predictive performance, boosted computation efficiency, and an interpretable model. We know that the single-hidden-layer NN has universal approximation property in theory and is easy to be interpreted due to simple architecture. However, we need to resort to a wide single-hidden-layer NN with a large number of neural nodes to obtain a strong predictive performance, which is numerically difficult to estimate in practice. Therefore, by leveraging the ensemble method and a simple structure of a single-hidden-layer NN, we developed an innovative and flexible framework called LIFE to train wide single-hidden-layer NNs, which can achieve both high accuracy and easy interpretability in both regression and classification settings.

In this algorithm, we first use a special hierarchical structure of multiple single-layer NNs to perform data sampling based on the linear projection of neurons, and then train multiple narrow single-hidden-layer NNs with ReLU activation as base learners on different subsets of the dataset; finally, we aggregate neural nodes as features from multiple base learners into a wide single-layer NN and do a join estimation via a linear model. Compared with the traditional training strategy, LIFE effectively avoids directly training a wide single-layer NN by extracting features from multiple narrow single-hidden-layer NNs trained on different subsets of dataset instead. In this way, we can introduce diversity among the base learners, which tends to decrease the total uncertainty after ensemble and, thus, yields better results empirically. To achieve good diversity among base learners, we build a hierarchical structure of multiple single neural networks, and leverage the linear projections of the neurons to define the subset of sampling. In addition, the algorithm combines the features defined by neurons from the single-layer NN base learner, for which we called neurons flattening, and this can further improve results compared with traditional ensemble methods. This technique also allows LIFE to take advantage of parallel computing to improve computational efficiency through training multiple narrow single-hidden-layer NNs simultaneously.

Extensive analyses on both simulated data and public real data verify its effectiveness in both predictive and computational performance. Furthermore, we provide theoretical foundation for LIFE and prove the importance of the diversity of base learners by exploring the relationship with two-stage ensemble stacking and the ambiguity loss decomposition for two-stage ensemble stacking. The final single-hidden-layer NN architecture obtained from LIFE allows to visualize the neural network weights and bias and understand the input and output relationship easily. In particular, a single-hidden-layer NN with rectified linear unit (ReLU) activation function [8] is equivalent to an additive index model with linear splines on linear projections. Moreover, it can be considered a local linear model, where all predictors are easily visualized by a parallel coordinates plot [11]. The main and interaction effects can also be identified by aggregating local linear model coefficients.

In general, our main contributions are summarized below:

  1. 1.

    LIFE is an innovative and flexible framework for ensemble methods, which allows different kinds of variants.

  2. 2.

    LIFE can achieve a better predictive performance than traditional single-hidden-layer NN training methods, as demonstrated by the theoretical background and empirical experiments.

  3. 3.

    LIFE can still keep model interpretable, and a new interpretation tool is introduced to detect main and interaction effects.

  4. 4.

    LIFE can improve the computation efficiency via easy parallelization, rendering wide single-layer NN training faster.

  5. 5.

    An theoretical foundation for LIFE based on ambiguity loss decomposition and diversity of base learners is provided.

The rest of the paper is organized as follows: In Sect. 2, we introduce LIFE algorithm and theoretical rationale behind LIFE through loss decomposition. Extensive experiments on simulated and real data are conducted in Sect. 3 to test the performance of LIFE under various conditions in comparison with other benchmark algorithms. In Sect. 4, we explored interpretation of model such as main or interaction effect detection. In Sect. 5, we discuss the model pruning and the extension of LIFE algorithm in more depth. In Sect. 6, we provide our conclusions.

2 Methodology

In this section, we introduce a new proposed ensemble method called LIFE. The LIFE algorithm is mainly used to train a wide single-hidden-layer NNs in an iterative way. First, the general framework for LIFE including three steps is introduced in Sect. 2.1. Then, details of LIFE algorithms are provided in Sect. 2.2. Finally, the theoretical foundation is discussed in Sect. 2.3.

2.1 General framework for LIFE

As shown in Fig. 1, the LIFE algorithm consists of three steps: data sampling, base learner training and feature extraction, model aggregation and pruning. Figure 1 presents several options in the white box for each step, which allow for various combinations of those steps to achieve pre-specified goals.

  1. 1.

    The first step consists in defining subsets of data via active functions of NN neurons, which are obtained by training multiple single-hidden-layer NNs in a hierarchal structure as illustrated in Fig. 2. The diversified base learners can be generated by training based on these subsets for datasets. More theoretical explanation on ensemble with diversified base learners will be provided in Sect. 2.3. There are other alternative ways to generate sampling for base learner training, e.g., bootstrapping in the traditional method or data splitting via random projection. Through leveraging linear projections from trained NNs, our sampling method in the supervised setting can more effectively generate the diversity among base learners, which is demonstrated by empirical experiments in Sect. 5.1.

  2. 2.

    The second step consists in training base learners on different subsets of data sampled during the first step. Various options can be considered, e.g., single-layer NN, multiple layer NN, regression or decision tree, etc. In this paper, the single-hidden-layer NN is used as the base learner given its interpretability. After estimating multiple NN base learners, all activation functions of the neurons from NN base learners are extracted as new features.

  3. 3.

    The third step consists in combining all new extracted features from different base learners in the second step to construct the final predictive model. In addition, new features can be pruned through regularization or other methods to generate a more parsimonious model. We use linear model and elastic net in our LIFE algorithm which is simple and straightforward in this paper, but some other more complicated model aggregating methods, e.g., adaptive regression model screen, and pruning methods, e.g., base learner selection algorithm 3, will be discussed in Sect. 5.2.

Fig. 1
figure 1

It is the general framework of LIFE algorithm and options colored in red in each step are used in the paper "colour figure online"

2.2 LIFE algorithm

LIFE algorithm is an iterative process with multiple single-layer NN base learners trained in each iteration. Assume there are J iterations. The first \(J-1\) iteration is used to define the data sampling through a hierarchical structure, and the last J iteration is used to build the features from single-layer NN base learners. Moreover, let \([K_1,\cdots , K_J]\) denote the collection of the number of hidden neurons for single-hidden-layer NNs in all iterations, where \(K_j\) is the number of hidden neural nodes for single-hidden-layer NNs which are trained in the \(j^{(th)}\) iteration. Figure 2 gives an illustration to LIFE framework, with \([K_1, K_2, K_3] = [3,3,2]\), where \({\hat{b}}_k^{(j)}\)’s and \({\hat{w}}_k^{(j)}\)’s represent biases and weights from single-hidden-layer NNs respectively, \({\hat{\beta }}_k\)’s are coefficients used to linearly combine all new neurons in the final step, and the cp is the cutoff point to define subsets by controlling subset size.

Fig. 2
figure 2

An illustration of LIFE framework with \([K_1,K_2,K_3] = [3,3,2]\). Regions with a smiling emoji face indicate that the subsets satisfying the sampling conditions and are used for NN models training in the next iteration

The LIFE framework illustration in Fig. 2 can be separated into three steps as displayed in Fig. 1. The two iterations in the first step are used to perform data sampling, in which we first fit a single-hidden-layer NN with three hidden neurons, then define subsets by \(b_k^{(1)}+ x_i^Tw_k^{(1)}>cp\), where \(k=1,2,3;\,i=1,\cdots ,N\), given estimated bias and weight in each neuron. Further, the entire neuron is dropped and no longer be used for next iteration if the defined subset in this neuron is either too large or too small based on pre-specified criteria. For example, the green neuron in the middle is dropped due to small size of subset. If the subset size is close to the full data size, the sample is almost identical to original training data, which is not beneficial for generating diversified samples. On the other hand, if the subset size is too small, the sample is not representative of the original data, and the base learner built on this sample does not have good performance on the entire training data. We will show in Sect. 2.3, the diversity and accuracy of the base learners are the two key elements for the final ensemble performance. As the result of the first iteration in the first step, we leverage NN linear hyperplane in the neurons to perform data partition as shown in Fig. 3, of which the idea is similar to oblique trees [12]. Plots (a), (b), and (c) in Fig. 3 have shown how data are partitioned in different ways for these three neurons based on \( w_k^{(1)} \)’s and \( b_k^{(1)} \)’s, \(k = 1,2,3\), obtained from the first iteration. The non-white regions (yellow, green or red) indicate the subsets, where all observation satisfy \(b_k^{(j)}+ x_i^Tw_k^{(j)}>cp\), and will be used to fit single-hidden-layer NNs in the second iteration. Notice that the non-white region in plot (b) is so small, which corresponds to the dropped green neuron in Fig. 2. Plot (d) in Fig. 3 shows the combination of plot (a) and (c). As we can see, the two subsets are overlapped, which is different from the subsets defined in the traditional regression or decision tree structure, which uses the exclusive partition.

Fig. 3
figure 3

That is data partition by NN linear projection after the first iteration and observations in the colored area from (a), (b), (c) will be selected in the each subset. The (d) shows two kept data partitions in one plot

In the second iteration of the first step, single-hidden-layer NNs with three hidden neurons are fitted independently using the subsets defined from the first iteration. After that, the entire training set is evaluated for data sampling through NN linear projection \(b_k^{(2)}+ x_i^Tw_k^{(2)}\), where \(k=1,\cdots ,6; i=1,\cdots ,N,\) and new subsets are defined satisfying \(b_k^{(2)}+ x_i^Tw_k^{(2)}>cp\). Note that, the new subsets are defined on the entire training data, not the samples from the previous iteration, which is also very different from traditional regression or decision tree structure. Again, neurons with too small or too large subset are dropped forever, as shown in the first node (in brown) and the last node (in pink) of the second hidden layer from Fig. 2, corresponding to (a1) and (c3) in Fig. 4. In addition, Fig. 4 shows data partition obtained from the second iteration, where plots (a1), (a2), and (a3) represent the partitions generated from NN trained on the subset of the first node (yellow) in the first iteration, while plots (c1), (c2), and (c3) correspond to the third node (red) in the first iteration. Plot (d2) shows that each training data point as least belongs to one of the six subsets in (a1–a3) and (c1–c2). We do expect all or most data points are covered by different subsets. The reason is that the neurons with different active regions in NN are representing different features and patterns from the data, LIFE trains the same type of base learner model on part of the dataset but evaluates it on the entire dataset. This leads to small errors appearing in the region of sampled data and large errors outside region. Therefore, data sampling with these active regions can effectively define subsets for generating more diverse representation of data and producing less correlated prediction errors. The sampling in this supervised manner is better than sampling in a random way and this is further discussed in Sect. 5. In practice, the first step can be reduced to one iteration or have more than two iterations, displayed in Algorithm 1.

Fig. 4
figure 4

It is data partition by NN linear projection after the second iteration and observations in the colored area from (a1a3), (c1c3) will be selected in the each subset. The (d1) displays all training data points in blue, and the (d2) shows four kept data partitions in one plot

figure a

In the second step, four single-hidden-layer NNs, each with two hidden neurons, are trained as base learners independently on the subsets defined at the end of the first step, and then the features are generated by the ReLU activation functions. Please note that the features are evaluated on the entire training dataset. In the third step, we combine all these new features together. Since all features are obtained based on the entire training set, which technically forms a design matrix with dimension \(N \times m_J\), where N is the size of training set and \(m_J\) is the total number of extracted features from step 2 with \(J=8\) shown in Fig. 2. Then, a linear model is fitted on these features and \(\beta _i, i = 1, \cdots , m_J,\) is the coefficient of each feature. In the default setting, the linear regression or logistic model is applied to combine neural nodes extracted from different base learners and make a final prediction. However, there may be too many features and some of them can even be highly correlated, leading to overfitting problem. Therefore, we need to prune some redundant neural nodes through adding regularization or removing some base learners. Both methods can not only prevent overfitting, but also produce a more parsimonious NN model with fewer nodes that is beneficial for interpretation. As a regularized regression method that linearly combines the \(L_1\) and \(L_2\) penalties, the elastic net is one of the options, and it is used in the paper for the third step of model aggregation and pruning. LASSO and ridge regression are treated as special cases of elastic net.

By combining multiple relatively narrow but diversified single-hidden-layer NNs with \(K_J\) hidden neurons for each, the LIFE algorithm finally constructs a wide single-hidden-layer NN with \(m_J\) hidden neurons, which is 8 in Fig. 2. This trained process is completely different from the traditional methods, with which a NN is optimized as a whole by stochastic gradient based optimizers. In most cases, it is numerically difficult and computational expensive to train a single-layer NN and achieve good performance. LIFE can help overcome this difficulty and achieve decent performance. In addition, LIFE can leverage parallel computation to significantly reduce the training time. In the end, we provide the detailed pseudo-code of LIFE in J iterations, described in Algorithm 1, where \(m_j\) indicates the number of remaining neural nodes after the \(j^{th}\) iteration in the first step, where \(j=1,\cdots ,J\). Both u and l are maximal and minimal proportions of training set size, which provides the upper and lower bound for subset size, respectively. The neuron will be dropped if the proportion of subset size is beyond the range.

2.3 Theoretical foundation

The ensemble method, such as stacking [13], bagging [2], boosting [14] or Bayesian model averaging [15, 16] is composed of a multiple independently or sequentially trained regressors or classifiers whose predictions are combined or sequentially derived to make final predictions. Empirically, ensembles tend to yield better results than a single model when there is a significant diversity among the models [17]. For the past few decades, many studies have been focusing on accuracy and diversity of ensemble methods in either regression [18] or classification case [19,20,21]. (Krogh and Vedelsby (1994) [22]) proposed ambiguity decomposition and a computable approach to minimize the quadratic error of the ensemble estimator, while (Ueda and Nakano (1996) [23]) derived a general expression of bias-variance-covariance decomposition. (Brown et al. (2005) [24]) and (Hansen (2000) [25]) investigated the connections between ambiguity decomposition and bias-variance-covariance, and have shown they are identical. Based on ambiguity decomposition, we establish the theoretical foundation for LIFE and extend loss decomposition for both regression and classification setting, which will be discussed in Sects. 2.3.1 and 2.3.2.

2.3.1 Connection to stacking

Stacking is a type of ensemble method, by which a final model is trained from the combined predictions of another models. In stacking, the predictions from different machine learning models are used as new inputs and are combined to generate a new set of predictions. Those predictions can be used on additional layers, or the process can stop here with a final result. One important assumption behind stacking is that different base learners can produce weakly correlated prediction errors that are complementary. If we use weighted averages, we might believe that some of the base learner are better or more accurate and can be assigned higher weights. In the framework of stacking, an even better approach might be to estimate these weights more intelligently by using another layer of the learning algorithm, such as the linear model.

Table 1 Difference between LIFE and stacking
Fig. 5
figure 5

Relationship between LIFE and stacking (regression)

Fig. 6
figure 6

Relationship between LIFE and stacking (classification)

Some major differences between stacking method and LIFE algorithm are shown in Table 1. Without neural nodes flattening, LIFE is very similar to stacking. Despite the differences between the two methods, the minimization of loss function of LIFE is approximately equivalent to the minimization of loss function for the two-stage stacking method, which firstly fits multiple base learners (NN trained on different subsets sampling by LIFE algorithm) and use predictions from base learners as input to train a model averaging model. Figures 5 and 6 indicate a strong linear relationship between LIFE and the stacking method with different hyper-parameter setup in terms of MSE loss or cross-entropy loss in both regression and classification cases. This linear relationship has be verified by both simulated data (MIM) and real data (California Housing for regression and Gamma Telescope for classification). Due to the BLUE (Best linear unbiased prediction) property of OLS estimator in linear regression, the joint estimation of the coefficients of all the combined features in the three step makes LIFE always outperform two-stage stack ensemble method with the same setting. This can be verified in Fig. 5 that all the points are below the green diagonal line. For classification case, LIFE also performs better than stacking method with smaller minimum loss as shown in the white box of Fig. 6.

2.3.2 Loss function decomposition

(Krogh and Vedelsby (1994) [22]) proposed ambiguity decomposition for quadratic error of the ensemble estimator which is the sum of the quadratic loss of individual base learners and the ambiguity measure for diversity. We extend the ambiguity decomposition to both mean square error and cross-entropy error via Taylor expansion, where the two loss functions corresponds to regression and classification, respectively. Let an ensemble model with M base learners be expressed as \(f_{ens}= \sum _{j=1}^M \beta _jf^{(j)}\), where \(\sum _{j=1}^M\beta _j=1\) and \(\beta _j\ge 0\).

For any loss function that is twice differentiable, we can expand the loss function of \(j^{th}\) base learner around output of an ensemble model based on Taylor’s theorem with Peano form of the remainder as follows:

$$\begin{aligned} \begin{aligned} l(y,f^{(j)})&= l(y,f_{ens})+ l^{\prime }(y,f_{ens})(f^{(j)}-f_{ens})\\&\quad + \frac{1}{2} l^{\prime \prime }(y,f^{(j)\star })(f^{(j)}-f_{ens})^2, \end{aligned} \end{aligned}$$
(1)

where the value of \(f^{(j)\star }\) is between \(f_{ens}\) and \(f^{(j)}\). Multiplying both sides of Eq. (1) by \(w_j\) and taking a sum yield:

$$\begin{aligned} \begin{aligned}&\sum _{j=1}^{M} \beta _jl(y,f^{(j)}) \\&\quad = \sum _{j=1}^{M} \beta _j l(y,f_{ens})+ \sum _{j=1}^{M} \beta _j l^{\prime }(y,f_{ens})(f^{(j)}-f_{ens}) \\&\quad\quad\quad + \frac{1}{2} \sum _{j=1}^{M} \beta _jl^{\prime \prime }(y,f^{(j)\star })(f^{(j)}-f_{ens})^2. \end{aligned} \end{aligned}$$
(2)

The second term on the right side of (2) is expressed by:

$$\begin{aligned} \begin{aligned}&\sum _{j=1}^{M} \beta _j l^{\prime }(y,f_{ens})(f^{(j)}-f_{ens})\\&\quad = l^{\prime }(y,f_{ens}) \left\{\sum _{j=1}^{M} \beta _j f^{(j)}-f_{ens}\sum _{j=1}^M\beta _j \right\}\\&\quad =l^{\prime }(y,f_{ens})\{f_{ens}-f_{ens}\}=0. \end{aligned} \end{aligned}$$
(3)

Since this term is zero, the loss function \(l(y,f_{ens})\) of the ensemble can be decomposed into:

$$\begin{aligned} \begin{aligned} l(y,f_{ens}) = \sum _{j=1}^{M} \beta _j l(y,f^{(j)}) - \frac{1}{2} \sum _{j=1}^{M}\beta _j l^{\prime \prime }(y,f^{(j)\star })(f_{ens}-f^{(j)})^2. \end{aligned} \end{aligned}$$
(4)

In regression case, let \(f_i=\sum _{j=1}^M \beta _j f_{i}^{(j)}\) be individual predicted value of an ensemble model for \(i^{th}\) observation, where M is the number of base learners, \(f_{i}^{(j)}\) represents individual predicted value of \(j^{th}\) base learner for \(i^{th}\) observation and \(\beta _j\) denotes regression coefficient for \(j^{th}\) base learner. Basically, the mean squared error (MSE) is commonly used loss function \(l(y,f)=\frac{1}{N}\sum _i^N(y_i-f_i)^2\) for regression problems, where N is the total number of observations in the entire dataset. Based on Eq. (5), MSE of an ensemble model can be written in terms of the ambiguity decomposition given \(x_i, i=1,\cdots ,N\):

$$\begin{aligned} \begin{aligned} MSE =& \frac{1}{N}\sum _{i=1}^N(y_i-f_i)^2= \frac{1}{N}\sum _{i=1}^N \left(y_i-\sum _{j=1}^M \beta _j f_{i}^{(j)}\right)^2 \\&\quad =\underbrace{\frac{1}{N}\sum _{i=1}^N\sum _{j=1}^M \beta _j(y_i- f_{i}^{(j)})^2}_\text {accuracy}-\underbrace{\frac{1}{N}\sum _{i=1}^N\sum _{j=1}^M \beta _j(f_i- f_{i}^{(j)})^2}_\text {diversity}. \end{aligned} \end{aligned}$$
(5)

On the right-hand side of Eq. (5), the first term of this decomposition is to measure average prediction accuracy of base learners, while the second term is called ambiguity (hence the name of the decomposition) and can be easily interpreted in terms of diversity between individual base learners. Unlike the bias-variance-covariance decomposition, the ambiguity decomposition highlights a trade-off between the average accuracy of base learners, and their deviation from the ensemble output.

Regarding LIFE, the base learner is a single-hidden-layer neural network trained on a subset of all observations. A stronger base learner indicates a better performance of model, which is reflected by first term of Eq. (5). If the subset size is small, the base learner is also weak, which deteriorates performance. Thus, a lower bound is set up. The power of LIFE framework comes from second term diversity, which is due to data sampling during iterations. Creating different subsets through sampling allows the model to be trained on different aspects of data, which produces diversity deliberately without resorting to other machine learning algorithms. In general, the more diverse the subsets, the better the predictive performance of LIFE. Hence, the upper bound is necessary to ensure diversity of subset since subset contains almost all observations and its size is very large, making subsets loss diversity. Another parameter cutoff point is set up to balance accuracy and diversity as well.

For binary classification purposes, let \(f_{i}= \sum _{j=1}^M \beta _jf_i^{(j)}\) be individual predicted probability of an ensemble model for \(i^{th}\) observation, which is weighted average of predicted probability or log-odds of base learner \(f_i^{(j)}\), where \(\sum _{j=1}^M\beta _j=1\) and \(\beta _j\ge 0\). The cross-entropy loss is widely used for classification and it can be written as follows for single observation:

$$\begin{aligned} \begin{aligned} l(y_i,f_i)=-y_ilog(f_i)-(1-y_i)log(1-f_i). \end{aligned} \end{aligned}$$
(6)

By plugging the loss function (14) into Eq. (4), we can write average cross-entropy loss of an ensemble method on training set \(\{x_i, y_i\}_{i=1,\cdots ,N}\) in the probability space as follows:

$$\begin{aligned} \begin{aligned}&\sum _{i=1}^N [-y_ilog(f_i)-(1-y_i)log(1-f_i)] \\&\quad =\underbrace{\sum _{i=1}^N\sum _{j=1}^M \beta _j [-y_ilog(f_i^{(j)})-(1-y_i)log(1-f_i^{(j)})]}_\text {accuracy} \\&\quad -\underbrace{\frac{1}{2} \sum _{i=1}^N\sum _{j=1}^M \beta _j \left\{\frac{y_i-2f_i^{(j)\star }y_i+\left(f_i^{(j)\star }\right)^2}{[f_i^{(j)\star }(1-f_i^{(j)\star })]^2}\right\}(f_i- f_{i}^{(j)})^2}_\text {diversity}, \end{aligned} \end{aligned}$$
(7)

where \(f_i^{(j)\star }\) takes value between \(f_i^{(j)}\) and \(f_i\). The term \((f_i-f_i^{(j)})^2\) is a measure of the differences in value between base learner and the ensemble. The cross-entropy loss and its decomposition in the log-odds space is provided in the “Appendix”. Unlike diversity term in the regression case, the second term (diversity) in the right-hand side of Eq. (7) also includes the true class label \(y_i\) and parameter with unknown value \(f_i^{(j)\star }\). However, the interpretation of decomposition is also clear. It shows that a lower average accuracy of individual base learner can be compensated by a higher disagreement with the ensemble, scaled by \(\frac{1}{(f_i^{(j)\star })^2}\) if \(y_i=1\) or \(\frac{1}{(1-f_i^{(j)\star })^2}\) if \(y_i=0\) in the probability space. Since \(\frac{\beta _j}{(f_i^{(j)\star })^2}\) or \(\frac{\beta _j}{(1-f_i^{(j)\star })^2}\) is positive, the more deviance of predicted probability between base learner and an ensemble model implies more diversity.

Fig. 7
figure 7

Loss decomposition for regression

Fig. 8
figure 8

Loss decomposition for classification

We implement the loss decomposition on simulated data (MIM) and real data (California Housing for regression and Gamma Telescope for classification) to LIFE without neural nodes flattening. It ensembles the predictions from single-hidden-layer NN base learner directly which is the two-stage stacking model averaging method discussed above. As subset size in the sampling step of LIFE impact the strength of diversity, we explore the overall loss, the weighted sum individual base learner accuracy, and the ambiguity measure against the average subset size over all the single-hidden-layer NNs in the first step of LIFE. Here, we vary the subset size by controlling the cutoff point cp for linear project. Figures 7 and 8 show the relationship between average subset size (in terms of the proportion of original train data size) and loss for accuracy, loss for diversity, and MSE loss in both regression and classification cases. In plot (a) and (b), the blue curves indicate the ambiguity diversity measure have a decreasing trend when the subset size increases. This is consistent with the intuition that the larger overlapping the subsets are, the less diverse the base learners are. On the other hand, the accuracy of the individual base learners is higher when subset size is larger, as the sample is more representative of the whole training set. Similarly, training on smaller subsets has stronger diversity but leads to lower accuracy. Plots (c) and (d) illustrate the trade-off between average accuracy and diversity to minimize total loss, where the dash line shows the optimal subset size achieving the minimum loss. Combining the results from Sects. 2.3.1 and 2.3.2, we conclude that competitive predictive performance of LIFE benefits from diversity due to data sampling in the first step and the feature ‘flattening’ and joint estimation in the third step.

3 Empirical experiment

In this section, we conduct multiple empirical experiments via both simulated and real data for regression and classification cases to confirm the competitive performance of LIFE. We have generated multiple datasets with Normal distribution and heavy-tailed predictor distribution (Laplace distribution), as well as different function forms to analyze predictive performance and computational efficiency. All datasets are split into \(80\%\) training data and \(20\%\) testing data. Other benchmark models including single-hidden-layer NN trained by different optimizer (local linear approximation and Adam algorithm), and other machine learning algorithms including multilayer FFNN, Xgboost, and random forest are tested on the same data for comparison after extensive hyper-parameter tuning. Local linear approximation (LLA) algorithm is a recently proposed method to estimate the weights and biases of single-hidden-layer NN by iterative linear regression and linear approximation of the ReLU activation function [26]. The LLA algorithm is distinguished from existing gradient descent algorithms in that it utilizes the Hessian matrix in the same spirit of Fisher scoring algorithm for nonlinear regression models with normal error. The outline of the LLA algorithm is included in the “Appendix”.

3.1 Simulated data

3.1.1 Regression

For the regression scenario, there are three different function forms including generalized additive model (GAM), additive index model (AIM), and multiple index model (MIM), which are expressed, as follows:

$$\begin{aligned}{} & {} \begin{aligned} GAM: y_i&=\beta _1 x_{1i} +\beta _2\sqrt{ |x_{2i} |} +\beta _3 |x_{3i} |\\&\quad + \beta _4exp(x_{4i})+\beta _5log( |x_{5i} |)+\beta _6max(1,x_{6i})+\epsilon _i,\\ \beta&= \{\beta _1, \cdots , \beta _6\}=\{1.5,\sqrt{5}, 2,4e^{(\frac{-1.5}{7})}, 4log(1.5),-4\}, \epsilon _i \sim N(0,1), i=1,\dots ,N, \end{aligned} \end{aligned}$$
(8)
$$\begin{aligned}{} & {} \begin{aligned} AIM: y_i&= 2log( |\beta _1x_{1i}+\cdots +\beta _4x_{4i} |)\\&\quad +exp(\frac{\beta _3x_{3i}+\cdots +\beta _6x_{6i}}{9})\\&\quad +max(0,\beta _5x_{5i}+\beta _6x_{6i})+\epsilon _i, \\ \beta&= \{\beta _1, \cdots , \beta _6\}\\&=\{3,-2.5,2,-1.5,1.5,-1\}, \end{aligned} \end{aligned}$$
(9)
$$\begin{aligned}{} & {} \begin{aligned} MIM: y_i&= exp(\beta _1 x_{1i}+\beta _2 x_{2i})\beta _3x_{3i}\\&\quad +\frac{\beta _4x_{4i}}{1+\beta _5 |x_{5i} |}+max(2,\beta _6x_{6i})+\epsilon _i, \\ \beta&= \{\beta _1, \cdots , \beta _6\}\\&=\{0.03,-0.025,1,-3,1.5,-2\}, \end{aligned} \end{aligned}$$
(10)

where \(N=20k\) and all predictors \(\{x_{ji}\}_ {j=1,\cdots ,6;i=1,\cdots ,N}\) are drawn from Normal or Laplace distribution. For regression, the experimental results show mean and standard deviation of RMSE, \(R^2\) and training time T over five replications in Tables 2 and 3, while logloss and AUC are used as a performance metric in the classification case as shown in Tables 4 and 5. Bayesian optimization allows us to jointly tune more parameters with fewer experiments and find better values, so we implement it to perform extensive hyper-parameter tuning on all the algorithm. The important hyper-parameters for LIFE include the number of iterations, the number of neurons in each iteration, upper and lower bound. We marked optimal results that have won the campaign in bold.

Table 2 Regression on simulated data (normal distribution)
Table 3 Regression on simulated data (laplace distribution)

As illustrated in Tables 2 and 3, the result is predictable regardless of the distribution predictors drawn. LIFE algorithm with LLA optimizer achieves higher accuracy among all methods in terms of predictive performance on the test set. The values of two metrics from LIFE (LLA) are close to oracle values, which implies LIFE performs well in the data with a smoothing response surface. If we compare results from one-hidden-layer FFNNs trained by LIFE algorithm with either LLA or Adam base learners and non-ensemble algorithms of LLA or Adam, LIFE always outperforms the relevant optimization methods used to train single-hidden-layer FFNN as a whole due to the generated diversity of data sampling. In addition, the performance of LIFE also depends on the strength of individual NN base learner, which can be easily spotted in Tables 2 and 3 that LIFE (LLA) outperforms LIFE (Adam). From the perspective of computational efficiency, LIFE algorithm also shows some advantages over other single-hidden-layer NN training algorithms. In general, LIFE algorithm can not only boost predictive performance of one-hidden-layer NN, but also speed up training, especially with respect to wide NN with large hidden-layer dimension.

3.1.2 Classification

For classification case, the functional forms in simulation setup are similar to the ones in regression case except that the coefficients are a little bit different. Detailed information on formulas can be found in “Appendix”. Similar to the setup in regression case, we choose \(N=20k\) and all predictors are drawn from either Normal or Laplace distribution. The response variable is sampled from Bernoulli distribution with probability calculated using the logit link function.

Table 4 Classification on simulated data (normal distribution)
Table 5 Classification on simulated data (laplace distribution)

Tables 4 and 5 show the simulation results from binary scenario, where data are generated from Normal distribution and Laplace distribution, respectively. Similar to the results in regression case, LIFE (LLA) has won the campaign in four out of six functional forms. LIFE (Adam) also performs pretty well especially in Table 5 for Laplace distribution. For data drawn from Normal distribution shown in Table 4, LIFE (Adam) is also quite close to the optimal result. Furthermore, there is a strong evidence to show LIFE algorithm does improve the performance of base learners with larger AUC, smaller logloss and smaller standard errors of both metrics. Even if the base learner is not strong enough, like Adam for GAM and MIM in Laplace distribution (Table 5), with which AUC or logloss or both has large standard error, the ensemble approach in LIFE (Adam) can dramatically reduce the variance. Xgboost ranks at top for GAM in both distributions, however, the differences between LIFE and Xgboost are negligible with only 0.1% of difference in AUC and 3% of difference in logloss.

3.2 Real data

Besides implementing LIFE algorithm on simulated data, we also tested it on seven public datasets for regression and eight datasets for classification and compared it with other benchmark models including single-hidden-layer FFNN and Xgboost. All datasets are split into \(80\%\) training data and \(20\%\) testing data with 10 different random seeds, which yield results over 10 replications. For all the datasets, we transformed categorical variables into dummy variables and standardized the continuous variables, so that the mean and the variance of each continuous variable are equal to 0 and 1, respectively. A detailed description of all datasets and corresponding data preprocessing steps are outlined in the “Appendix”.

3.2.1 Regression

The experiment results averaged over ten replications are reported in Table 6, including root mean squared error (RMSE), R-squared (\(R^2\)), and training time (T).

Table 6 Regression on real data

As observed in Table 6, LIFE (LLA) is ranked as the best algorithm in the four datasets. For the remaining three datasets, LIFE (LLA) is still the second or third best algorithm among all models with a close or slightly worse predictive performance than optimal one (Xgboost or random forest), which implies that the LIFE algorithm is competitive with other state-of-art machine learning algorithms. In addition, there is an average \(4.6\%\) or \(1.8\%\) improvement in R-square of all real datasets when single-hidden-layer NN is trained by LIFE (LLA or Adam) instead of other optimization methods (LLA or Adam), which is consistent with the conclusion made from experiment in the simulated data. It is also worth mentioning that computation efficiency of NN training has been significantly boosted for almost all dataset via LIFE compared with traditional NN training methods. In particular, if we take a look at the largest real dataset CASP, training time of NN via LIFE reduces to 216 seconds from 2881 seconds or to 54 seconds from 339 seconds when we use LLA as optimizer or Adam respectively, which is almost more than six times faster. Although tree-based ensemble methods such as Xgboost and random forest show strong predictive power in some datasets, they are still black-box models, and they are hard to interpret. The biggest advantage of our proposed algorithm LIFE is that it preserves the interpretability of model, which is still single-hidden-layer NN with very strong predictive performance and boosted computation efficiency.

3.2.2 Classification

In the classification case, original LLA algorithm is not stable, since it involves matrix inversion. We added a ridge parameter into the matrix inversion and treat it as a hyper-parameter in LLA algorithm. Experiments have shown adding ridge parameter in LLA can give better and more stable prediction than not adding ridge parameter. After testing LLA and LIFE (LLA) with or without ridge parameter, we further choose the best one for the performance.

Table 7 Classification on real data

Table 7 presents similar patterns in real data analyses as in simulation studies with LIFE (LLA) and LIFE (Adam) taking turns to occupy the dominant position for most of the datasets. LIFE performs much better than Xgboost and random forest (RF) in most experiments. For example, with Breast Cancer Wisconsin data, logloss in LIFE (Adam) is 26.3% lower than that in Xgboost, and with MAGIC Gamma Telescope data, logloss has dropped by 7% from random forest to LIFE (LLA). The performance of some datasets, such as Bank Marketing data, where LIFE cannot outperform Xgboost or RF, however, the performance is competitive, with only 0.5% and 0.7% of difference in AUC and logloss, respectively. Another aspect worth mentioning is that Higgs Boson data contains quite a few highly correlated variables. The results show that LIFE algorithm outperforms all the rest of models, which indicates LIFE really does an excellent job in predicting on highly correlated structures.

Table 8 ResNet18 vs. its potential improvers

We have also investigated whether LIFE algorithm can further improve the performance of trained deep neural network on image data. We here use ResNet18 proposed by (He et al. 2016 [27]) as an example and apply the algorithms on MNIST data [28]. The detailed information of data preprocessing can be found in Case 8 from the description lists of real datasets 6. After training MNIST data using ResNet18, the output of final convolutional layer has been extracted, which has size \(8000 \times 512\), and it is also the input of feed forward neural network (FFNN) in the final step of ResNet18. This \(8000 \times 512\) data is then treated as the input of LIFE (Adam). Further, we also attach Xgboost, Adam to ResNet18 and compare the results.

3.2.3 Classification on image data

Figure 9 shows the performance of LIFE (orange line) is better than ResNet18 (blue line) with consistently larger AUC and smaller logloss in all of the 10 replications. LIFE is comparable to Xgboost in terms of logloss in general with all 10 values below ResNet18. However, in terms of AUC, Xgboost is worse than ResNet18 with significantly lower AUC in two replications (seed = 2 and seed = 8). On the other hand, Adam (red line) is not able to further improve the performance of ResNet18, which is consistent with what we have discovered in the previous empirical studies. Table 8 also indicates LIFE and Xgboost are both capable of remarkably enhancing a trained deep NN with similar performance. One last discovery is since the input of LIFE contains 512 columns, it also indicates that LIFE can handle high dimensionality quite well in terms of prediction.

Fig. 9
figure 9

ResNet18 vs. its potential improvers

4 Interpretation

Interpretability is the degree to which one human being can understand the cause of a decision or predict the result of a model. The higher the interpretability of a machine learning or deep learning model, the easier it is for someone to comprehend why certain decisions or predictions have been made. A key advantage of LIFE is that it is still an inherently interpretable model. From the perspective of the NN structure, the model is a single-hidden-layer NN with ReLU activation function where all the weights and bias can be easily extracted and visualized. Moreover, the single-layer NN with ReLU activation function can be rewritten in the form of local linear model representation, and be interpreted by exploring the patterns of local linear model coefficients. Finally, the main and interaction effects can be identified by exploring and aggregating the local linear coefficients.

We use the bike sharing data result as an example to illustrate the intrinsic interpretability of LIFE. Bike sharing data is a public dataset hosted on UCI machine learning repository, where there are around 17, 000 observations on hourly (and daily) bike rental counts along with weather and time information between 2011 and 2012 in the Capital Bikeshare system. Out of the original 17 predictors, we removed some non-meaningful and highly correlated ones, leaving us with 9 predictors to predict hourly rental counts. At the tiny expense of predictive performance, we applied both the base learner selection method shown in Algorithm 3 in Sect. 5.2 and elastic net to reduce the number of base learners and features so that the final single-hidden-layer NN has a small number of significant neurons and is easier for interpretation.

Fig. 10
figure 10

Variable importance detection (LIFE)

Fig. 11
figure 11

Variable importance detection (random forest)

4.1 Explore the weights and bias of single-layer NN

As LIFE finally generates a single-hidden-layer NN in the third step, we can explore the weights \({\hat{w}}_k\)s and bias \({\hat{b}}_k\)s of the NN directly and identify which variable is important. For bike sharing data, there are finally 116 new features (or neurons) after base learner selection and elastic net regularization. We measure the neuron importance by \(std({\hat{\beta }}_k\sigma ({\hat{b}}_k+x^T{\hat{w}}_k))/ std({\hat{f}})\), and \({\hat{f}}\), where std is the standard deviation, \({\hat{f}}\) is the predicted value of response variable for regression or log-odds for classification, \({\hat{w}}_k\)s is the neuron weight, and \({\hat{b}}_k\) is the coefficient for neuron. This quantity measures the importance of neurons/feature by comparing the variation of each feature to the total variance. The histogram on the neuron importance for the 116 features in Fig. 10 shows that there are only 13 neurons whose importance values are greater than \(2\%\) of the maximum importance.

Then, we can detect how each variable contributes to each neuron by applying the following measurement:

$$\begin{aligned} {\hat{w}}_k{\hat{\beta }}_k \frac{std(\sigma ({\hat{b}}_k+x^T{\hat{w}}_k))}{std({\hat{f}})} \end{aligned}$$

where we allocate neuron importance to each variable by multiplying \({\hat{w}}_k\). This contribution measurement can be simply visualized by heatmap between neurons and original variables in Fig. 10. It shows that hour (hr) and working day (workingday) are top significant variables with darker colors for almost each important neuron compared with other variables. Another variable temperature (temp) can also be considered to relatively important except hr and workingday.

Variable importance detection using random forest in Fig. 11 shows similar findings, with hour (hr) having the highest relative importance score, followed by temperature (temp) and working day (workingday).

4.2 Treat a single-layer NN as a local linear model

As we may have many features in the final wide single-layer NN, it is difficult to visualize and explore the weights and bias of all the neurons. Hence, we also propose to interpret single-layer NN from local linear model perspective. Single-layer NN with ReLU function can be considered a type of local linear model. Each linear projection would determine the active or inactive states of the ReLU neurons at hidden layers, which define the layered pattern. The activation region is constructed as a combination of those distinct patterns. Those activation regions are mutually exclusive and regarded as convex polytopes with closed-form boundaries [29]. A linear equation can be used in all data points inside the activation region to represent the relationships between response and independent variables. After defining the region each observation belongs to, we can easily extract a linear equation for each region based on estimated weights in the hidden and output layers. The detailed algorithm that performs a linear equation extraction the following:

figure b

We can visualize those linear equations by a parallel coordinate plot, which allows comparing the estimated coefficients of all predictors for different local linear regions. Through the visualization of local linear equations, we can not only have an overview of the importance of each predictor in each region by comparing the magnitude of coefficients, but also check the validity for effect of each predictor on the response variables. It is worth mentioning that those coefficients are comparable after standardizing all the predictors. There are three scenario for a particular independent variable:

  1. 1.

    Relatively large coefficients of the variable, compared with others in terms of absolute values and have the same signs, imply that this variable has a significant positive or negative effect on the response variable if all coefficients are positive or negative.

  2. 2.

    Relatively large coefficients of the variable with both positive and negative signs strongly imply that this variable has inconsistent slopes across local activation regions, which might be due to either its own nonlinear main effect or the interaction effects with other variables.

  3. 3.

    Small and close-to-zero coefficients indicate that this feature is not important to explain the variation of the response variable and can be removed from the model.

Furthermore, we were able to verify if the sign of estimated coefficients of predictor in all regions is consistent with domain knowledge or business sense. Figure 12 displays the estimated coefficients of all predictors in the local activation regions for bike sharing dataset. There are 116 neurons extracted from NN base learners and 47 local regions created by Algorithm 2 and each local region has at least three data points.

Fig. 12
figure 12

Parallel coordinates plot for bikesharing data

It clearly indicates that hour (hr), working day (workingday) and temperature (temp) are the three most important predictors with relatively higher absolute values of their corresponding coefficients in several local regions, which is pretty consistent with result from Fig. 11. Their estimated coefficients present different directions across local activation regions, which is consistent with our second scenario.

This gives us a hint of interactions between those variables. Other variables such as humidity (hum) and wind speed (windspeed) are insignificant based on their absolute values of estimation coefficients from the plot. Sometimes there are too many local regions and (Sudjianto et al.(2020)) [29] provides two approaches to simplify and reduce the number of local linear equations-merging and flattening in their paper, where a variety of other diagnostic tools and plots for local linear model have also been provided.

4.3 Main and interaction effect detection

Even though the parallel coordinates plot provides a guideline about the variable importance in each local region, we still need a solid technique to detect nonlinear main effects and interaction effects. To achieve this purpose, we can treat single-hidden-layer NN as a varying coefficient model through linear equation extraction shown in Algorithm 2. As all the local linear equation coefficients are varying over local regions, and region definition depends on predictors, so the coefficients can be treated as a function of predictors in Eq. 11.

$$\begin{aligned} \begin{aligned}& {\hat{f}}_i=\alpha _{0i}+\alpha _{1i}x_{1i}+\cdots +\alpha _{mi}x_{mi}+\cdots +\alpha _{pi}x_{pi}, \\ &\ i=1,\cdots ,n, \\ &\text {where } \alpha _{mi}=g(x_{1i},\cdots ,x_{pi}), \end{aligned} \end{aligned}$$
(11)

where p is the number of predictors and n is number of observations. \({\hat{f}}_i\) is predicted value for regression and predicted log-odds for classification. \(\alpha _{mi}\) is the coefficient for \(m^{th}\) variable at \(i^{th}\) observation, and could also be a function of all predictors, varying by different observations. Our goal is to investigate what the functional forms of the estimated coefficients are. Therefore, we separate \(\alpha _{mi}\) into two components representing main and interaction effects in Eq. 12:

$$\begin{aligned} \begin{aligned} \alpha _{mi}=g(x_{1i},\cdots ,x_{pi}) = \underbrace{g_{main}(x_{mi})}_\text {main effect}+\underbrace{g_{int}(x_{1i},\cdots ,x_{pi})}_\text {interaction effect}.\\ \end{aligned} \end{aligned}$$
(12)

The first term in Eq. 12 is a function of \(x_{mi}\) , including the intercept of \(\alpha _{mi}\), and this term captures the main effect of \(x_{mi}\). If \(\alpha _{mi}\) has a significant intercept, then linear main effect can be identified; while a strong relationship with \(x_{mi}\) indicates a nonlinear main effect. The remaining second term is the function of other predictors and it may or may not contain \(x_{mi}\). This term can be used to detect interactions between \(x_{mi}\) and other predictors. For an illustration, let us look at a simple example with all estimated coefficients constant except \(\alpha _{1i}=\theta _0+\theta _1 x_{1i}+\theta _2 x_{2i}\), then the varying coefficient model can be expressed as follows:

$$\begin{aligned} \begin{aligned} &y_i=\alpha _{0i}+\theta _0 x_{1i}+\theta _1 x_{1i}^2+\theta _2(x_{2i})x_{1i}+\alpha _{2}x_{2i}+\\&\quad\cdots +\alpha _{p}x_{pi}+\epsilon _i. \end{aligned} \end{aligned}$$
(13)

where \(g_{main}(x_{1i})=\theta _0+\theta _1x_{1i}\) and \(g_{int}(x_{2i})=\theta _2x_{2i}\). We can easily identify the interaction term between \(x_{1i}\) and \(x_{2i}\), and \(x_{1i}\) shows a nonlinear main effect via its quadratic term. To detect main effects and interaction effects from \(\alpha _{mi}\), we propose the two-stage process below:

  1. 1.

    Check nonlinearity: Calculate conditional expectation \(E({\hat{\alpha }}_{mi} |x_{mi})\) by smoothing estimated coefficients of predictors against itselfFootnote 1. \({\hat{\alpha }}_{mi} \sim g_{main}(x_{mi})\ \ \ m=1,\cdots ,p\).

  2. 2.

    Check interactions: Remove main effect from \(\alpha _{mi}\), and calculate conditional expectation \(E({\hat{\alpha }}_{mi}-{\hat{g}}_{main}(x_{mi}) |x_{ki})\) by smoothing estimated coefficients of predictors against each other variable. \({\hat{\alpha }}_{mi}-{\hat{g}}_{main}(x_{mi}) \sim g_{k}^m(x_{ki})\ \ \ \ k\ne m\)

Fig. 13
figure 13

Plot matrix between \(\alpha _m\) and \(x_m\)

We choose to use a two-stage process instead of a one-stage process, as we can estimate its main and interaction effect more accurately in the correlated predictor case and split two effects effectively. Note that some special interaction effects may not be identified by one-stage process such as \(y=\alpha _0+\alpha _1 x_1\) as an example, where \(\alpha _1=x_1 x_2\). In this case, \(g_2^1 (x_{2i})\) is zero curve. Fortunately, most common interaction patterns can be identified by our two-stage process. As long as \(g_k^m (x_{ki})\) has a significant pattern on \(x_{ki}\), an interaction effect can be identified.

For the case of bike sharing data, we visualized all pairs of varying coefficients and variables \((\alpha _{mi}\ vs\ x_{ki})\) with scatter plots in Fig. 13. Due to \(\frac{\partial {\hat{f}}(\textbf{x})}{\partial x_m} = \alpha _m\) , this is also scattered partial derivative plot for \({\hat{f}}(\textbf{x})\). On top of the scatter plot, we also draw \(g_{main} (x_{mi})\) against \(x_{mi}\) in the diagonal plots show and \(g_k^m (x_{ki})\) agaist \(x_{ki}\) in the (mk) off-diagonal plots. To further quantify the magnitude of the interaction effects, we calculated weighted standard deviation of \(g_{main} (x_{mi})\) and \(g_k^m (x_{ki})\) with population density as weight. The heatmap of the interaction measures for bike sharing data is provided in Fig. 14 where diagonals are masked by zero.

Fig. 14
figure 14

Heatmap for interaction measures

Fig. 15
figure 15

ALE plots for predictors based on LIFE

The nonlinear patterns of variables can be clearly spotted in the diagonal plots in Fig. 13. The most important variable hour (hr) displays the drastic fluctuation compared with others, indicating its nonlinear effect on response. As evidenced in both Figs. 13 and 14, the top three interaction pairs including hr vs workingday, hr vs weekday, and hr vs temp can be easily identified.

In addition to interaction detection, we can obtain and visualize the main effect of each predictor directly by aggregating the local linear coefficients. Due to \(\frac{\partial {\hat{f}}(\textbf{x})}{\partial x_m} = \alpha _m\) in the varying coefficient setting, we compute the exact main effect of \(x_m\) by constructing a relationship between \(f(x_j)\) and \(x_j\) based on formula \(\int _0^x E(\frac{\partial {\hat{f}}(\textbf{x})}{\partial x_m} |x_m)dx_m\) from Accumulated Local Effects (ALE) plot, where the variable is transformed back to original scale as seen in Fig. 15. This ALE formulation can be simplified as \(\int _0^x E(\alpha _m |x_m)dx_m=\int _0^x g_{main}(x_m)dx_m\) and its numerical implementation of ALE is achieved by the Midpoint Rule.

The main effect for hr has two peaks and one trough, which is similar to partial dependence plot from other machine learning algorithms, while the main effect of temp and hum shows a quadratic relationship. More specifically, the peak of bike rentals happen around 7 am and 5–6 pm, while very few people will rent bikes around 3–4 am. People usually prefer to rent bikes in a nice day with moderate temperature and humidity. Both of them are pretty consistent with common sense.

5 Discussion

5.1 Different sampling schemes

The LIFE algorithm can be considered a general framework with three steps, as discussed in the methodology section. LIFE is very flexible and allows users to try different combinations of three steps. The first step presents several data sampling options. In the paper, we use linear projection inside NN neurons to split data and select data points from active region for base learner training, as shown in Fig. 2. Those linear projections are obtained with trained NNs in a supervised setting. Given a fixed hyper-parameter setup for LIFE, we have implemented our method with different sampling choices including NN projection, Random projection, Bootstrapping. In Table 9, we can easily see that all the ensemble methods outperform a single single-hidden-layer NN model optimized by LLA or Adam. Most importantly, sampling by NN linear projection is better than other sampling methods that create subsets in a random way.

Table 9 Sampling Method Comparison

5.2 Base learner selection

For model aggregation and pruning, we can prune neurons to have a single-layer NN with fewer neurons. We used elastic net for pruning due to its simplicity, but pruning methods besides elastic net can also be considered. Based on properties of LIFE, we also developed an alternative pruning method called base learner selection to reduce the number of nodes in the final step. It is assumed that LIFE works well because the correlations of prediction errors are not strong among different base learners. Therefore, we can remove base learner one by one, according to the correlation between its prediction errors and prediction errors from other base learners. In this way, we can still maintain diversity and solve overfitting issue by keeping fewer necessary base learners without sacrificing predictive performance a lot. This method is thoroughly described by Algorithm 3.

figure c

In Algorithm 3, the threshold \(\tau \) is the percentage of base learners you want to retrieved from a pool of candidates. When there is a large number of neurons in LIFE setting, the elastic net is usually computational expensive and the base learner selection is a good alternative by parallel computation. Moreover, we can combine these two pruning methods to achieve a simpler NN model from wide NN faster. We also illustrate it using two simulated model (GAM and MIM). The plots (a) and (b) in Fig. 16 show the relationship between \(R^2\) and number of hidden neurons for the feature extraction. Setting different thresholds in the base learner selection Algorithm 3, we can construct single-hidden-layer NN with different number of neurons. In general, base learner selection algorithms can effectively reduce number of neurons to produce a simpler model without sacrificing predictive performance or obtain even better results.

Fig. 16
figure 16

Relationship between \(R^2\) and number of neurons. The red point indicates one using linear regression as a final step without base learner selection, while blue points indicates model aggregation by base learner selection with different threshold "colour figure online"

6 Conclusion

In this paper, we have proposed a novel algorithm that fits single-hidden-layer NN to achieve three goals: ensuring competitive predictive performance, boosting computational efficiency, and preserving the interpretability of the model. Unlike traditional NN training methods, we train it in an iterative way through multiple NNs layer-by-layer training and then effectively combine them via neural nodes flattening. We have evaluated the performance of our approach using simulated and empirical data in terms of predictive accuracy and computational efficiency and found that it consistently outperforms single-hidden-layer NN trained directly by LLA or Adam optimizer and achieves competitive results as those of Xgboost.

This superior performance lies in three reasons: First, as an ensemble method, the LIFE algorithm performs data sampling through linear projection inside neural nodes, which creates diversity among the models and contributes to bias and variance reduction of prediction from combined models. Second, the LIFE algorithm takes advantage of single-hidden-layer NN structure to combine multiple narrow single-hidden-layer NNs into a wide one via neural nodes flattening Third, LIFE algorithm benefits from leveraging parallel computing to train multiple NNs on subsets of data simultaneously. Moreover, the base learner selection method is introduced in the paper to help us prune redundant neural nodes and produce a more parsimonious model after several iterations of the LIFE algorithm. We have also proposed a new method for main and interaction detection from the perspective of interpretation.