Linear iterative feature embedding: an ensemble framework for an interpretable model

A new ensemble framework for an interpretable model called linear iterative feature embedding (LIFE) has been developed to achieve high prediction accuracy, easy interpretation, and efficient computation simultaneously. The LIFE algorithm is able to fit a wide single-hidden-layer neural network (NN) accurately with three steps: defining the subsets of a dataset by the linear projections of neural nodes, creating the features from multiple narrow single-hidden-layer NNs trained on the different subsets of the data, combining the features with a linear model. The theoretical rationale behind LIFE is also provided by the connection to the loss ambiguity decomposition of stack ensemble methods. Both simulation and empirical experiments confirm that LIFE consistently outperforms directly trained single-hidden-layer NNs and also outperforms many other benchmark models, including multilayers feed forward neural network (FFNN), Xgboost, and random forest (RF) in many experiments. As a wide single-hidden-layer NN, LIFE is intrinsically interpretable. Meanwhile, both variable importance and global main and interaction effects can be easily created and visualized. In addition, the parallel nature of the base learner building makes LIFE computationally efficient by leveraging parallel computing.


Introduction
Ensemble methods have proved successful in the majority of machine learning competitions, as they integrate multiple machine learning algorithms into one predictive model.In particular, there are three main types of ensemble methods, including bootstrap aggregating (bagging), boosting, and stacking.Bagging and stacking learns base learners independently and aggregates them following a deterministic averaging process, while boosting learns sequentially in a very adaptive way and combines learners using a pre-specified strategy.The main goal of bagging is to arrive at an ensemble method with less variance than its base learners, whereas boosting mainly try to produce a strong model that is less biased than their base learners.Stack ensemble aims at reduce both variance and bias by combining diversified base learners.Compared with a single model, all of these ensemble methods can significantly improve predictive performance either by bias or variance reduction.
However, the final model produced by ensemble method is still regarded as a black box model, since the combination of base learners leads to a complicated model structure and makes inner decision-making process not transparent for human beings.The ability to explain the rationale behind one's decisions to others is an important aspect of human intelligence in either social interaction or educational context.The interpretability of the results to enable business owners or regulators to better understand risk management decision processes and compel companies to meet regulatory requirements.Some commonly used interpretable models, such as linear model and general additive model cannot compete with ensemble models including Xgboost [10] and Random Forest [4] since its simple structure cannot capture complicated dynamic data patterns.
In recent decades, some research works focus on enhancing interpretability by developing tools to "open up the black box".There are, broadly speaking, three inter-related model-based areas of research: a) global diagnostics (Sobol & Kucherenko (2009) [16], Kucherenko (2010) [17]); b) local diagnostics (Sundararajan et al. (2017) [24], Ancona et al. (2018) [2]); and c) development of approximate or surrogate models that may be easier to understand and explain.However, (Rudin and Cynthia (2019) [22] ) suggests to avoid using explainable black-box models in high-stakes decisions since they are sometimes problematic with several reasons such as unreliable presentation or lack of details of what the original model delivers.Therefore, other researchers made efforts to build inherently interpretable model, such as explainable neural network (xNN) proposed by (Vaughan et al. (2018) [26]), adaptive explainable neural networks (AxNNs) by (Chen et al. (2020) [8]), and explainable neural networks with constraint (Yang et al. (2020) [28]).Following this research direction, we try to build an inherently interpretable with a predictive performance as strong as some black box models or an even better performance.
In this paper, our LIFE algorithm fulfills three main goals: competitive predictive performance, boosted computation efficiency, and interpretable model.We know that the singlehidden-layer NN has universal approximation property in theory and is easy to be interpreted due to simple architecture.However, we need to resort to a wide single-hidden-layer NN with a large number of neural nodes to obtain a strong predictive performance, which is numerically difficult to estimate in practice.Therefore, by leveraging the ensemble method and a simple structure of a single-hidden-layer NN, we developed an innovative and flexible framework called LIFE to train wide single-hidden-layer NNs, which can achieve both high accuracy and easy interpretability in both regression and classification settings.
In this algorithm, we first use a special hierarchical structure of multiple single layer NNs to perform data sampling based on the linear projection of neurons, and then train multiple narrow single-hidden-layer NNs with ReLU activation as base learners on different subsets of the dataset; finally, we aggregate neural nodes as features from multiple base learners into a wide single layer NN and do a join estimation via a linear model.Compared with the traditional training strategy, LIFE effectively avoids directly training a wide single layer NN by extracting features from multiple narrow single-hidden-layer NNs trained on different subsets of dataset instead.In this way, we can introduce diversity among the base learners, which tends to decrease the total uncertainty after ensemble and, thus, yields better results empirically.To achieve good diversity among base learners, we build a hierarchical structure of multiple single neural networks, and leverage the linear projections of the neurons to define the subset of sampling.In addition, the algorithm combines the features defined by neurons from the single layer NN base learner, for which we called neurons flattening, and this can further improve results compared with traditional ensemble methods.This technique also allows LIFE to take advantage of parallel computing to improve computational efficiency through training multiple narrow single-hidden-layer NNs simultaneously.
Extensive analyses on both simulated data and public real data verify its effectiveness in both predictive and computational performance.Furthermore, we provide theoretical foundation for LIFE and prove the importance of the diversity of base learners by exploring the relationship with two-stage ensemble stacking and the ambiguity loss decomposition for two-stage ensemble stacking.The final single-hidden-layer NN architecture obtained from LIFE allows to visualize the neural network weights and bias and understand the input and output relationship easily.In particular, a single-hidden-layer NN with rectified linear unit (ReLU) activation function [26] is equivalent to an additive index model with linear splines on linear projections.Moreover, it can be considered a local linear model, where all predictors are easily visualized by a parallel coordinates plot [3].The main and interaction effects can also be identified by aggregating local linear model coefficients.
In general, our main contributions are summarized below: 1. LIFE is an innovative and flexible framework for ensemble methods, which allows different kinds of variants.
2. LIFE can achieve a better predictive performance than traditional single-hidden-layer NN training methods, as demonstrated by the theoretical background and empirical experiments.
3. LIFE can still keep model interpretable, and a new interpretation tool is introduced to detect main and interaction effects.

General framework for LIFE
As shown in the Figure 1, the LIFE algorithm consists of three steps: data sampling, base learner training and feature extraction, model aggregation and pruning.Figure 1 presents several options in the white box for each step, which allow for various combinations of those steps to achieve pre-specified goals.
1.The first step consists in defining subsets of data via active functions of NN neurons, which are obtained by training multiple single-hidden-layer NNs in a hierarchal structure as illustrated in Figure 2. The diversified base learners can be generated by training based on these subsets for data sets.More theoretical explanation on ensemble with diversified base learners will be provided in Section 2.3.There are other alternative ways to generate sampling for base learner training, e.g., bootstrapping in the traditional method or data splitting via random projection.Through leveraging linear projections from trained NNs, our sampling method in the supervised setting can more effectively generate the diversity among base learners, which is demonstrated by empirical experiments in Section 5.1.
2. The second step consists in training base learners on different subsets of data sampled during the first step.Various options can be considered, e.g., single layer NN, multiple layer NN, and regression or decision tree, etc., In this paper, the single-hidden-layer NN is used as the base learner given its interpretability.After estimating multiple NN base learners, all activation functions of the neurons from NN base learners are extracted as new features.
3. The third step consists in combining all new extracted features from different base learners in the second step to construct the final predictive model.In addition, new features can be pruned through regularization or other methods to generate a more parsimonious model.We use linear model and elastic-net in our LIFE algorithm which is simple and straightforward in this paper, but some other more complicated model aggregating methods e.g., adaptive regression model screen, and pruning methods, e.g., base learner selection algorithm 3, will be discussed in Section 5.2.
Figure 1: It is the general framework of LIFE algorithm and options colored in red in each step are used in the paper.

LIFE Algorithm
LIFE algorithm is an iterative process with multiple single layer NN base learners trained in each iteration.Assume there are J iterations.The first J − 1 iterations is used to define the data sampling through a hierarchical structure, and the last J iteration is used to build the features from single layer NN base learners.Moreover, let [K 1 , • • • , K J ] denote the collection of the number of hidden neurons for single-hidden-layer NNs in all iterations, where K j is the number of hidden neural nodes for single-hidden-layer NNs which are trained in the j (th) iteration.Figure 2 below gives an illustration to LIFE framework, with [K 1 , K 2 , K 3 ] = [3,3,2], where b(j) k 's and ŵ(j) k 's represent biases and weights from single-hidden-layer NNs respectively, βk 's are coefficients used to linearly combine all new neurons in the final step, and the cp is the cutoff point to define subsets by controlling subset size.The LIFE framework illustration in Figure 2 can be separated into three steps as displayed in Figure 1.The two iterations in the first step are used to perform data sampling, in which we first fit a single-hidden-layer NN with three hidden neurons, then define subsets by b k > cp, where k = 1, 2, 3; i = 1, • • • , N , given estimated bias and weight in each neuron.Further, the entire neuron is dropped and no longer be used for next iteration if the defined subset in this neuron is either too large or too small based on pre-specified criteria.For example, the green neuron in the middle is dropped due to small size of subset.If the subset size is close to the full data size, the sample is almost identical to original training data, which is not beneficial for generating diversified samples.On the other hand, if the subset size is too small, the sample is not representative of the original data, and the base learner built on this sample does not have good performance on the entire training data.We will show in Section 2.3, the diversity and accuracy of the base learners are the two key elements for the final ensemble performance.As the result of the first iteration in the first step, we leverage NN linear hyperplane in the neurons to perform data partition as shown in Figure 3, of which the idea is similar to oblique trees [14].Plots (a), (b) and (c) in Figure 3 have shown how data are partitioned in different ways for these three neurons based on w k > cp, and will be used to fit single-hidden-layer NNs in the second iteration.Notice that the non-white region in plot (b) is so small, which corresponds to the dropped green neuron in Figure 2. Plot (d) in Figure 3 shows the combination of plot (a) and (b).As we can see, the two subsets are overlapped, which is different from the subsets defined in the traditional regression or decision tree structure, which uses the exclusive partition.In the second iteration of the first step, single-hidden-layer NNs with three hidden neurons are fitted independently using the subsets defined from the first iteration.After that, the entire training set is evaluated for data sampling through NN linear projection b k > cp.Note that, the new subsets are defined on the entire training data, not the samples from the previous iteration, which is also very different from traditional regression or decision tree structure.Again, neurons with too small or too large subset are dropped forever, as shown in the first node (in brown) and the last node (in pink) of the second hidden layer from Figure 2, corresponding to (a1) and (c3) in Figure 4.In addition, Figure 4 shows data partition obtained from the second iteration, where plots (a1), (a2) and (a3) represent the partitions generated from NN trained on the subset of the first node (yellow) in the first iteration, while plots (c1), (c2) and (c3) correspond to the third node (red) in the first iteration.Plot (d2) shows that each training data point as least belongs to one of the six subsets in (a1-a3) and (c1-c2).We do expect all or most data points are covered by different subsets.The reason is that the neurons with different active regions in NN are representing different features and patterns from the data, LIFE trains the same type of base learner model on part of the dataset but evaluates it on the entire dataset.This leads to small errors appearing in the region of sampled data and large errors outside region.Therefore, data sampling with these active regions can effectively define subsets for generating more diverse representation of data and producing less correlated prediction errors.The sampling in this supervised manner is better than sampling in a random way and this is further discussed in Section 5.In practice, the first step can be reduced to one iteration or have more than two iterations, displayed in Algorithm 1.
Figure 4: It is data partition by NN linear projection after the second iteration and observations in the colored area from (a1-a3), (c1-c3) will be selected in the each subset.The (d1) displays all training data points in green, and the (d2) shows four kept data partitions in one plot.
In the second step, four single-hidden-layer NNs, each with two hidden neurons are trained as base learners independently on the subsets defined at the end of the first step, and then the features are generated by the ReLU activation functions.Please note that the features are evaluated on the entire training dataset.In the third step, we combine all these new features together.Since all features are obtained based on the entire training set, which technically forms a design matrix with dimension N × m J , where N is the size of training set and m J is the total number of extracted features from step 2 with J = 8 shown in Figure 2. Then a linear model is fitted on these features and β i , i = 1, • • • , m J , is the coefficient of each feature.In the default setting, the linear regression or logistic model is applied to combine neural nodes extracted from different base learners and make a final prediction.However, there may be too many features and some of them can even be highly correlated, leading to overfitting problem.Therefore, we need to prune some redundant neural nodes through adding regularization or removing some base learners.Both methods can not only prevent overfitting, but also produce a more parsimonious NN model with fewer nodes that is beneficial for interpretation.As a regularized regression method that linearly combines , where m j is the total number of neural nodes left in the j th iteration.3. Record size of sampled observations as s.
4. if l < s/N < u then 1. Train a single-hidden-layer NN regressor or classifier with the number of hidden neurons equal to K j on sampled subset of training set x, y by specified optimizer.
2. Collect estimated parameters b(j) q , ŵ(j) q , where q = 1, the L 1 and L 2 penalties, the elastic net is one of the options, and it is used in the paper for the third step of model aggregation and pruning.LASSO and ridge regression are treated as special cases of elastic net.By combining multiple relatively narrow but diversified single-hidden-layer NNs with K J hidden neurons for each, the LIFE algorithm finally constructs a wide single-hiddenlayer NN with m J hidden neurons, which is 8 in Figure reffig:tree.This trained process is completely different from the traditional methods, with which a NN is optimized as a whole by stochastic gradient based optimizers.In most cases, It is numerically difficult and computational expensive to train a single layer NN and achieve good performance.LIFE can help overcome this difficulty and achieve decent performance.In addition, LIFE can leverage parallel computation to significantly reduce the training time.In the end, we provide the detailed pseudo-code of LIFE in J iterations, described in Algorithm 1, where m j indicates the number of remaining neural nodes after the j th iteration in the first step, where j = 1, • • • , J.Both u and l are maximal and minimal proportions of training set size, which provides the upper and lower bound for subset size, respectively.The neuron will be dropped if the proportion of subset size is beyond the range.

Theoretical Foundation
The ensemble method, such as stacking [27], bagging [4], boosting [9] or Bayesian model averaging [20,21] is composed of a multiple independently trained regressors or classifiers whose predictions are combined to make final predictions.Empirically, ensembles tend to yield better results than a single model when there is a significant diversity among the models [18].For the past few decades, many studies have been focusing on accuracy and diversity of ensemble methods in either regression [6] or classification case [1,11,7].(Krogh and Vedelsby (1994) [15]) proposed ambiguity decomposition and a computable approach to minimize the quadratic error of the ensemble estimator, while (Ueda and Nakano (1996) [25]) derived a general expression of bias-variance-covariance decomposition.(Brown et al. (2005) [5]) and (Hansen (2000) [12]) investigated the connections between ambiguity decomposition and bias-variance-covariance, and have shown they are identical.Based on ambiguity decomposition, we establish the theoretical foundation for LIFE and extend loss decomposition for both regression or classification setting, which will be discussed in Section 2.3.1 and 2.3.2.

Connection to Stacking
Stacking is a type of ensemble method, by which a final model is trained from the combined predictions of another models.In stacking, the predictions from different machine learning models are used as new inputs and are combined to generate a new set of predictions.Those predictions can be used on additional layers, or the process can stop here with a final result.
One important assumption behind stacking is that different base learners can produce weakly correlated prediction errors that are complementary.If we use weighted averages, we might believe that some of the base learner are better or more accurate and can be assigned higher weights.In the framework of stacking, an even better approach might be to estimate these weights more intelligently by using another layer of the learning algorithm, such as the linear model.Some major differences between stacking method and LIFE algorithm as shown in Table 1.Without neural nodes flattening, LIFE is very similar to stacking.Despite the differences between the two methods, the minimization of loss function of LIFE is approximately equivalent to the minimization of loss function for the two-stage stacking method, which firstly fits multiple base learners (NN trained on different subsets sampling by LIFE algorithm) and use predictions from base learners as input to train a model averaging model.real data (California Housing for regression and Gamma Telescope for classification).Due to the BLUE (Best linear unbiased prediction) property of OLS estimator in linear regression, the joint estimation of the coefficients of all the combined features in the three step makes LIFE always outperform two-stage stack ensemble method with the same setting.This can be verified in Figure 5 that all the points are below the green diagonal line.For classification case, LIFE also performs better than stacking method with smaller minimum loss as shown in the white box of Figure 6.

Loss Function Decomposition
(Krogh and Vedelsby (1994) [15]) proposed ambiguity decomposition for quadratic error of the ensemble estimator which is the sum of the quadratic loss of individual base learners and the ambiguity measure for diversity.We extend the ambiguity decomposition to both mean square error and cross entropy error via Taylor expansion, where the two loss functions corresponds to regression and classification, respectively.Let an ensemble model with M base learners be expressed as f ens = M j=1 β j f (j) , where M j=1 β j = 1 and β j ≥ 0. For any loss function that is twice differentiable, we can expand the loss function of j th base learner around output of an ensemble model based on Taylor's theorem with Peano form of the remainder as follows: where the value of f (j) is between f ens and f (j) .Multiplying both sides of equation ( 1) by w j and taking a sum yield: ( The second term on the right side of ( 2) is expressed by: Since this term is zero, the loss function l(y, f ens ) of the ensemble can be decomposed into: In regression case, let be individual predicted value of an ensemble model for i th observation, where M is the number of base learners, f represents individual predicted value of j th base learner for i th observation and β j denotes regression coefficient for j th base learner.Basically, the mean squared error (MSE) is commonly used loss function l(y, f ) = 1 N N i (y i −f i ) 2 for regression problems, where N is the total number of observations in the entire dataset.Based on equation (5), MSE of an ensemble model can be written in terms of the ambiguity decomposition given On the right-hand side of equation ( 5), the first term of this decomposition is to measure average prediction accuracy of base learners, while the second term is called ambiguity (hence the name of the decomposition) and can be easily interpreted in terms of diversity between individual base learners.Unlike the bias-variance-covariance decomposition, the ambiguity decomposition highlights a trade-off between the average accuracy of base learners, and their deviation from the ensemble output.
Regarding LIFE, the base learner is a single-hidden-layer neural network trained on a subset of all observations.A stronger base learner indicates a better performance of model, which is reflected by first term of equation (5).If the subset size is small, the base learner is also weak, which deteriorates performance.Thus, a lower bound is set up.The power of LIFE framework comes from second term diversity, which is due to data sampling during iterations.Creating different subsets through sampling allows the model to be trained on different aspects of data, which produces diversity deliberately without resorting to other machine learning algorithms.In general, the more diverse the subsets, the better the predictive performance of LIFE.Hence, the upper bound is necessary to ensure diversity of subset since subset contains almost all observations and its size is very large, making subsets loss diversity.Another parameter cutoff point is set up to balance accuracy and diversity as well.
For binary classification purposes, let be individual predicted probability of an ensemble model for i th observation, which is weighted average of predicted probability or log-odds of base learner f i , where M j=1 β j = 1 and β j ≥ 0. The cross-entropy loss is widely used for classification and it can be written as follows for single observation: By plugging the loss function ( 14) into equation ( 4), we can write average cross entropy loss of an ensemble method on training set {x i , y i } i=1,••• ,N in the probability space as follows: where f 2 is a measure of the differences in value between base learner and the ensemble.The cross-entropy loss and its decomposition in the log-odds space is provided in the Appendix 6.Unlike diversity term in the regression case, the second term (diversity) in the right-hand side of equation ( 7) also includes the true class label y i and parameter with unknown value f (j) i .However, the interpretation of decomposition is also clear.It shows that a lower average accuracy of individual base learner can be compensated by a higher disagreement with the ensemble, scaled by ) 2 is positive, the more deviance of predicted probability between base learner and an ensemble model implies more diversity.We implement the loss decomposition on simulated data (MIM) and real data (California Housing for regression and Gamma Telescope for classification) to LIFE without neural nodes flattening.It ensembles the predictions from single-hidden-layer NN base learner directly which is the two-stage stacking model averaging method discussed above.As subset size in the sampling step of LIFE impact the strength of diversity, we explore the overall loss, the weighted sum individual base learner accuracy and the ambiguity measure against the average subset size over all the single-hidden-layer NNs in the first step of LIFE.Here we vary the subset size by controlling the cutoff point cp for linear project.Figures 7 and 10 show the relationship between average subset size (in terms of the proportion of original train data size) and loss for accuracy, loss for diversity and MSE loss in both regression and classification cases.In plot (a) and (b), the blue curves indicating the ambiguity diversity measure have a decreasing trend when the subset size increases.This is consistent with the intuition that the larger overlapping the subsets are, the less diverse the base learners are.On the other hand, the accuracy of the individual base learners is higher when subset size is larger, as the sample is more representative of the whole training set.Similarly, training on

Empirical Experiment
In this section, we conduct multiple empirical experiments via both simulated and real data for regression and classification cases to confirm the competitive performance of LIFE.
We have generated multiple datasets with normal distribution and heavy-tailed predictor distribution (Laplace distribution), as well as different function forms to analyze predictive performance and computational efficiency.Other benchmark models including single-hiddenlayer NN trained by different optimizer (Local linear approximation and Adam algorithm), and other machine learning algorithms including multi-layer FFNN, Xgboost, and Random Forest are tested on the same data for comparison after extensive hyper-parameter tuning.
Local Linear Approximation (LLA) algorithm is a recently proposed method to estimate the weights and biases of single-hidden-layer NN by iterative linear regression and linear approximation of the ReLU activation function [29].The LLA algorithm is distinguished from existing gradient descent algorithms in that it utilizes the Hessian matrix in the same spirit of Fisher scoring algorithm for nonlinear regression models with normal error.The outline of the LLA algorithm is included in the Appendix 6.
3.1 Simulated Data

Regression
For the regression scenario, there are three different function forms including Generalized Additive Model (GAM), Additive Index Model (AIM) and Multiple Index Model (MIM), which are expressed, as follows: AIM : M IM : where N = 20k and all predictors {x ji } j=1,  2 and 3, while log-loss and AUC are used as a performance metric in the classification case as shown in Tables 4 and 5.
Bayesian Optimization allows us to jointly tune more parameters with fewer experiments and find better values, so we implement it to perform extensive hyper-parameter tuning on all the algorithm.The important hyper-parameters for LIFE include the number of iterations, the number of neurons in each iteration, upper and lower bound.We marked optimal results that have won the campaign in red.
As illustrated in Tables 2 and 3, the result is predictable regardless of the distribution predictors drawn.LIFE algorithm with LLA optimizer achieves higher accuracy among all  2 and 3 that LIFE (LLA) outperforms LIFE(Adam).From the perspective of computational efficiency, LIFE algorithm also shows some advantages over other single-hidden-layer NN training algorithms.In general, LIFE algorithm can not only boost predictive performance of one-hidden layer NN, but also speed up training, especially with respect to wide NN with large hidden layer dimension.

Classification
For classification case, the functional forms in simulation setup are similar to the ones in regression case except that the coefficients are a little bit different.Detailed information on formulas can be found in Appendix 6.Similar to the setup in regression case, we choose N = 20k and all predictors are drawn from either Normal or Laplace distribution.The response variable is sampled from Bernoulli distribution with probability calculated using the logit link function.Table 4 and 5 show the simulation results from binary scenario, where data are generated from Normal distribution and Laplace distribution, respectively.Similar to the results in regression case, LIFE(LLA) has won the campaign in four out of six functional forms.LIFE(Adam) also performs pretty well especially in Table 5 for Laplace Distribution.For data drawn from Normal Distribution shown in Table 4, LIFE(Adam) is also quite close to the optimal result.Furthermore, there is a strong evidence to show LIFE algorithm does improve the performance of base learners with larger AUC, smaller log-loss and smaller standard errors of both metrics.Even if the base learner is not strong enough, like Adam for GAM and MIM in Laplace Distribution (Table 5), with which AUC or log-loss or both has large standard error, the ensemble approach in LIFE(Adam) can dramatically reduce the variance.XGboost ranks at top for GAM in both distributions, however, the differences between LIFE and XGboost are negligible with only 0.1% of difference in AUC and 3% of difference in log-loss.

Real Data
Besides implementing LIFE algorithm on simulated data, we also tested it on 7 public datasets for regression and eight datasets for classification and compared it with other benchmark models including single-hidden-layer FFNN and Xgboost.All data sets are split into 80% training data and 20% testing data with 10 different random seeds, which yield results over 10 replications.For all the datasets, we transformed categorical variables into dummy variables and standardized the continuous variables, so that the mean and the variance of each continuous variable are equal to 0 and 1, respectively.A detailed description of all datasets and corresponding data preprocessing steps are outlined in the Appendix 6.

Regression
The experiment results averaged over ten replications are reported in Table 6, including root mean squared error (RM SE), R-squared (R 2 ) and training time (T ).As observed in Table 6, LIFE (LLA) is ranked as the best algorithm in the four datasets.For the remaining three datasets, LIFE(LLA) is still the second or third best algorithm among all models with a close or slightly worse predictive performance than optimal one (Xgboost or Random Forest), which implies that the LIFE algorithm is competitive with other state-of-art machine learning algorithms.In addition, there is an average 4.6% or 1.8% improvement in R-square of all real datasets when single-hidden-layer NN is trained by LIFE (LLA or Adam) instead of other

Classification
In the classification case, original LLA algorithm is not stable, since it involves matrix inversion.We added a ridge parameter into the matrix inversion and treat it as a hyperparameter in LLA algorithm.Experiments have shown adding ridge parameter in LLA can give better and more stable prediction than not adding ridge parameter.After testing LLA and LIFE(LLA) with or without ridge parameter, we further choose the best one for the performance.Table 7 presents similar patterns in real data analyses as in simulation studies with   We have also investigated whether LIFE algorithm can further improve the performance of trained deep neural network on image data.We here use ResNet18 proposed by (He et al. 2016 [13]) as an example and apply the algorithms on MNIST data [19].The detailed information of data preprocessing can be found in Case 8 from the description lists of real data sets 6.After training MNIST data using ResNet18, the output of final convolutional layer has been extracted, which has size 8000 × 512, and it is also the input of feed forward neural network (FFNN) in the final step of ResNet18.This 8000 × 512 data is then treated as the input of LIFE(Adam).Further, we also attach XGBoost, Adam to ResNet18 and compare the results.

Classification on Image Data
Figure 9 shows the performance of LIFE (orange line) is better than ResNet18 (blue line) with consistently larger AUC and smaller log-loss in all of the 10 replications.LIFE is comparable to XGboost in terms of log-loss in general with all 10 values below ResNet18.However, in terms of AUC, XGboost is worse than ResNet18 with significantly lower AUC in two replications (seed=2 and seed=8).On the other hand, Adam (red line) is not able to further improve the performance of ResNet18, which is consistent with what we have discovered in the previous empirical studies.Table 8 also indicates LIFE and Xgboost are both capable of remarkably enhancing a trained deep NN with similar performance.One last discovery is since the input of LIFE contains 512 columns, it also indicates that LIFE

Interpretation
Interpretability is the degree to which one human being can understand the cause of a decision or predict the result of a model.The higher the interpretability of a machine learning or deep learning model, the easier it is for someone to comprehend why certain decisions or predictions have been made.A key advantage of LIFE is that it is still an inherently interpretable model.From the perspective of the NN structure, the model is a single-hidden-layer NN with ReLU activation function where all the weights and bias can be easily extracted and visualized.Moreover, the single layer NN with ReLU activation function can be rewritten in the form of local linear model representation, and be interpreted by exploring the patterns of local linear model coefficients.Finally, the main and interaction effects can be identified by exploring and aggregating the local linear coefficients.
We use the bike sharing data result as an example to illustrate the intrinsic interpretability of LIFE.Bike sharing data is a public dataset hosted on UCI machine learning repository, where there are around 17, 000 observations on hourly (and daily) bike rental counts along with weather and time information between 2011 and 2012 in the Capital Bikeshare system.Out of the original 17 predictors, we removed some non-meaningful and highly correlated ones, leaving us with 9 predictors to predict hourly rental counts.At the tiny expense of predictive performance, we applied both the base learner selection method shown in Algorithm 3 in Section 5.2 and elastic net to reduce the number of base learners and features so that the final single hidden-layer NN has a small number of signficant neurons and is easier for interpretation.

Explore the weights and bias of single layer NN
As LIFE finally generates a single-hidden-layer NN in the third step, we can explore the weights ŵk s and bias bk s of the NN directly and identify which variable is important.For bike sharing data, there are finally 116 new features (or neurons) after base learner selection and elastic net regularization.We measure the neuron importance by std( βk σ( bk +x T ŵk ))/std( f ), and f , where std is the standard deviation, f is the predicted value of response variable for regression or log-odds for classification, ŵk s is the neuron weight, and bk is the coefficient for neuron.This quantity measures the importance of neurons/feature by comparing the variation of each feature to the total variance.The histogram on the neuron importance for the 116 features in Figure 10a shows that there are only 13 neurons whose importance values are greater than 2% of the maximum importance.
Then we can detect how each variable contributes to each neuron by applying the following measurement: where we allocate neuron importance to each variable by multiplying ŵk .This contribution measurement can be simply visualized by heatmap between neurons and original variables in Figure 10b .It shows that hour(hr ) and working day(workingday) are top significant variables with darker colors for almost each important neuron compared with other variables.Another variable temperature(temp) can also be considered to relatively important except hr and workingday.

Treat a single layer NN as a local linear model
As we may have many features in the final wide single layer NN, it is difficult to visualize and explore the weights and bias of all the neurons.Hence, we also propose to interpret single layer NN from local linear model perspective.Single-layer NN with ReLU function can be considered a type of local linear model.Each linear projection would determine the active or inactive states of the ReLU neurons at hidden layers, which define the layered pattern.The activation region is constructed as a combination of those distinct patterns.Those activation regions are mutually exclusive and regarded as convex polytopes with closed-form boundaries [23].A linear equation can be used in all data points inside the activation region to represent the relationships between response and independent variables.After defining the region each observation belongs to, we can easily extract a linear equation for each region based on estimated weights in the hidden and output layers.The detailed algorithm that performs a linear equation extraction the following: Algorithm 2: Local Linear Equation Extraction Input: Estimated weights and biases { ŵk , bk βk } k=1,••• ,m J ; τ : threshold for the number of observations Output: depending on if they have the same set of active neurons ( boundary condition) We can visualize those linear equations by a parallel coordinate plot, which allows comparing the estimated coefficients of all predictors for different local linear regions.Through the visualization of local linear equations, we can not only have an overview of the importance of each predictor in each region by comparing the magnitude of coefficients, but also check the validity for effect of each predictor on the response variables.It is worth mentioning that those coefficients are comparable after standardizing all the predictors.There are three scenario for a particular independent variable: 1. Relatively large coefficients of the variable, compared with others in terms of absolute values and have the same signs, imply that this variable has a significant positive or negative effect on the response variable if all coefficients are positive or negative.Relatively large coefficients of the variable with both positive and negative signs strongly imply that this variable has inconsistent slopes across local activation regions, which might be due to either its own nonlinear main effect or the interaction effects with other variables.
3. Small and close-to-zero coefficients indicate that this feature is not important to explain the variation of the response variable and can be removed from the model.Furthermore, we were able verify if the sign of estimated coefficients of predictor in all regions is consistent with domain knowledge or business sense.Figure 11 displays the estimated coefficients of all predictors in the local activation regions for bike sharing dataset.There are 116 neurons extracted from NN base learners and 47 local regions created by Algorithm 2 and each local region has at least three data points.It clearly indicates that hour(hr ), working day(workingday) and temperature(temp) are the three most important predictors with relatively higher absolute values of their corresponding coefficients in several local regions, which is pretty consistent with result from Figure 10b.Their estimated coefficients present different directions across local activation regions, which is consistent with our second scenario.This gives us a hint of interactions between those variables Other variables such as humidity(hum) and wind speed(windspeed ) are insignificant based on their absolute values of estimation coefficients from the plot.Sometimes there are too many local regions and (Sudjianto et al.( 2020)) [23] provides two approaches to simplify and reduce the number of local linear equations-merging and flattening in their paper, where a variety of other diagnostic tools and plots for local linear model have also been provided.

Main and Interaction Effect Detection
Even though the parallel coordinates plot provides a guideline about the variable importance in each local region, we still need a solid technique to detect nonlinear main effects and interaction effects.To achieve this purpose, we can treat single-hidden-layer NN as a varying coefficient model through linear equation extraction shown in Algorithm 2. As all the local linear equation coefficients are varying over local regions, and region definition depends on predictors, so the coefficients can be treated as a function of predictors in equation 11.
where p is the number of predictors and n is number of observations.fi is predicted value for regression and predicted log-odds for classification.α mi is the coefficient for m th variable at i th observation, and could also be a function of all predictors, varying by different observations.Our goal is to investigate what the functional forms of the estimated coefficients are.Therefore, we separate α mi into two components representing main and interaction effects in equation 12: The first term in the equation 12 is a function of x mi , including the intercept of α mi , and this term captures the main effect of x mi .If α mi has a significant intercept, then linear main effect can be identified; while a strong relationship with x mi indicates a nonlinear main effect.The remaining second term is the function of other predictors and it may or may not contain x mi .This term can be used to detect interactions between x mi and other predictors.For an illustration, let us look at a simple example with all estimated coefficients constant except α 1i = θ 0 + θ 1 x 1i + θ 2 x 2i , then the varying coefficient model can be expressed as follows: where g main (x 1i ) = θ 0 + θ 1 x 1i and g int (x 2i ) = θ 2 x 2i .We can easily identify the interaction term between x 1i and x 2i , and x 1i shows a nonlinear main effect via its quadratic term.To detect main effects and interaction effects from α mi , we propose the two-stage process below: 1. Check nonlinearity: Calculate conditional expectation E(α mi |x mi ) by smoothing estimated coefficients of predictors against itself1 .αmi ∼ g main (x mi ) m = 1, • • • , p.We choose to use a two-stage process instead of a one-stage process, as we can estimate its main and interaction effect more accurately in the correlated predictor case and split two effects effectively.Note that some special interaction effects may not be identified by one-stage process such as y = α 0 + α 1 x 1 as an example, where α 1 = x 1 x 2 .In this case, g 1 2 (x 2i ) is zero curve.Fortunately, most common interaction patterns can be identified by our two stage process.As long as g m k (x ki ) has a significant pattern on x ki , an interaction effect can be identified.
For the case of bike sharing data, we visualized all pairs of varying coefficients and variables (α mi vs x ki ) with scatter plots in Figure 12.Due to ∂ f (x) ∂xm = α m , this is also scattered partial derivative plot for f (x) .On top of the scatter plot, we also draw g main (x mi ) against x mi in the diagonal plots show and g m k (x ki ) agaist x ki in the (m, k) off-diagonal plots .To further quantify the magnitude of the interaction effects, we calculated weighted standard deviation of g main (x mi ) and g m k (x ki ) with population density as weight.The heatmap of the interaction measures for bike sharing data is provided in Figure 13 where diagonals are masked by zero.The nonlinear patterns of variables can be clearly spotted in the diagonal plots in Figure 12.The most important variable hour(hr ) displays the drastic fluctuation compared with others, indicating its nonlinear effect on response.As evidenced in both Figure 12 and Figure 13, the top three interaction pairs including hr vs workingday, hr vs weekday and hr vs temp can be easily identified.
In addition to interaction detection, we can obtain and visualize the main effect of each predictor directly by aggregating the local linear coefficients.Due to ∂ f (x) ∂xm = α m in the varying coefficient setting, we compute the exact main effect of x m by constructing a relationship between f (x j ) and x j based on formula ∂xm |x m )dx m from Accumulated Local Effects (ALE) plot, where the variable is transformed back to original scale as seen in Figure 14.This ALE formulation can be simplified as x 0 g main (x m )dx m and its numerical implementation of ALE is achieved by the Midpoint Rule.The main effect forhr has two peaks and one trough, which is similar to partial dependence plot from other machine learning algorithms, while the main effect of temp and hum show a quadratic relationship.More specifically, the peak of bike rentals happen around 7 am and 5-6 pm, while very few

Different Sampling Schemes
The LIFE algorithm can be considered a general framework with three steps, as discussed in the methodology section.LIFE is very flexible and allows users to try different combinations of three steps.The first step presents several data sampling options.In the paper, we use linear projection inside NN neurons to split data and select data points from active region for base learner training, as shown in Figure 2.Those linear projections are obtained with trained NNs in a supervised setting.Given a fixed hyper-parameter setup for LIFE, we have implemented our method with different sampling choices including NN projection, Random projection, Bootstrapping.In Table 9, we can easily see that all the ensemble methods outperform a single single-hidden-layer NN model optimized by LLA or Adam.Most importantly, sampling by NN linear projection is better than other sampling methods that create subsets in a random way.

Base Learner Selection
For model aggregation and pruning, we can prune neurons to have a single layer NN with fewer neurons.We used Elastic net for pruning due to its simplicity, but pruning methods besides Elastic net can also be considered.Based on properties of LIFE, we also developed an alternative pruning method called base learner selection to reduce the number of nodes in the final step.It is assumed that LIFE works well because the correlations of prediction errors are not strong among different base learners.Therefore, we can remove base learner one by one, according to the correlation between its prediction errors and prediction errors from other base learners.In this way, we can still maintain diversity and solve overfitting issue by keeping fewer necessary base learners without sacrificing predictive performance a lot.This method is thoroughly described by Algorithm 3.
2. Collect R 2 j from this linear model 7 Remove B j based on value of R 2 j in descending order until τ is achieved (the one with highest R 2 will be removed first).
In Algorithm 3, the threshold θ is the percentage of base learners you want to retrieved from a pool of candidates.When there is a large number of neurons in LIFE setting, the elastic-net is usually computational expensive and the base learner selection is a good alternative by parallel computation.Moreover, we can combine these two pruning methods to achieve a simpler NN model from wide NN faster.We also illustrate it using two simulated model (GAM and MIM).The plots (a) and (b) in Figure 15 show the relationship between R 2 and number of hidden neurons for the feature extraction.Setting different thresholds in the base learner selection Algorithm 3, we can construct single-hidden-layer NN with different number of neurons.In general, base learner selection algorithms can effectively reduce number of neurons to produce a simpler model without sacrificing predictive performance or obtain even better results.

Conclusion
In this paper, we have proposed a novel algorithm that fits single-hidden-layer NN to achieve three goals: ensuring competitive predictive performance, boosting computational efficiency, and preserving the interpretability of the model.Unlike traditional NN training methods, we train it in an iterative way through multiple NNs layer-by-layer training and then effectively combine them via neural nodes flattening.We have evaluated the performance of our approach using simulated and empirical data in terms of predictive accuracy and computational efficiency and found that it consistently outperforms single-hidden-layer NN trained directly by LLA or Adam optimizer and achieves competitive results as those of Xgboost.
This superior performance lies in three reasons: First, as an ensemble method, the LIFE algorithm performs data sampling through linear projection inside neural nodes, which creates diversity among the models and contributes to bias and variance reduction of prediction  from combined models.Second, the LIFE algorithm takes advantage of single-hidden-layer NN structure to combine multiple narrow single-hidden-layer NNs into a wide one via neural nodes flattening Third, LIFE algorithm benefits from leveraging parallel computing to train multiple NNs on subsets of data simultaneously.Moreover, the base learner selection method is introduced in the paper to help us prune redundant neural nodes and produce a more parsimonious model after several iterations of the LIFE algorithm.We have also proposed a new method for main and interaction detection from the perspective of interpretation.
, and z i = [z 1i , z 2i , z 3i ] T , which is a J 1 (p + 2)-dimensional vector.the objective function is approximated by which is the LS function of linear regression with the response y i and predictors z i .Denote the resulting LS estimate of β j , γ j and mη j by βj , γj and mη j , respectively.By the definition of γ j and mη j , we can update b j and w j as shown in the

Functional Forms for Classification Case in Simulation Study
In classification case, we use three function forms including Generalized Additive Model (GAM), Additive Index Model (AIM) and Multiple Index Model (MIM) expressed as follows, where i ∼ N (0, 1), i = 1, . . ., N .
Structure, which is taken from CASP 5-9.The goal is to predict RMSD-size of residue given other physical attributes.There are 45730 observations and 9 predictors including total surface area, non-polar exposed area, fractional area of exposed non polar residue, fractional area of exposed non polar part of residue, molecular mass weighted exposed area, average deviation from standard exposed area of residue, Euclidian distance, secondary structure penalty, special Distribution constraints (N,K Value).Case 7: Electrical Grid.The local stability analysis of the 4-node star system (electricity producer is in the center) implementing Decentral Smart Grid Control concept.There are 10000 observations and 11 predictors including tau[x], x = 1, 2, 3, 4 which are reaction time of participant, p[x], x = 2, 3, 4 which are nominal power consumed (negative) divided by produced(positive)(real), and g[x], x = 1, 2, 3, 4 that are coefficient (gamma) proportional to price elasticity.The continuous response variable is the maximal real part of the characteristic equation root.

Classification
Case 1: Bank Marketing.The data provides information regarding direct marketing campaigns of a Portuguese banking institution.The classification goal is to predict if the clients, who were contacted based on at least two phone calls in general, would like to subscribe a term deposit or not.The data contains 45211 examples and 16 variables including 6 numerical variables (age, balance, day, campaign, pdays, previous) and 9 categorical variables (job, marital, education, default, housing, loan, contact, month, poutcome).Notice that input variable 'duration' is not included in the analyses based on the suggestions from the provider of this dataset since it highly affects the output target and thus the variable should be discarded due to the intention of building a realistic predictive model.
Case 2: Breast Cancer Wisconsin.Data comes from original Wisconsin Breast Cancer Database, with purpose of detecting if the tissue is benign or malignant.The original dataset contains 699 instances and 9 attributes.After deleting rows with missing values, finally 683 instances are used in our analyses.All predictive variables are numerical with a scale of 1 to 10, including Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli and Mitoses.
Case 3: Higgs Boson.This dataset was built from official ATLAS full-detector simulation in 2014 that mixes 'Higgs to tautau' events with different backgrounds, where the events in which Higgs bosons were produced are comprised in the signal sample, while other known processes mimicking the signal are considered as background noise.The dataset contains 818238 events and 63 variables, among which a few are highly correlated.The objective is to detect signal from background based on characteristics of events such as mass (estimated, transverse or invariant), transverse momentum, centrality of the pseudo rapidity of

Figure 2 :
Figure 2: An illustration of LIFE framework with [K 1 , K 2 , K 3 ] = [3, 3, 2].Regions with a smiling emoji face indicate that the subsets satisfying the sampling conditions and are used for NN models training in the next iteration.

k
's, k = 1, 2, 3, obtained from the first iteration.The non-white regions (yellow, green or red) indicate the subsets, where all observation satisfy b (j)

Figure 3 :
Figure 3: That is data partition by NN linear projection after the first iteration and observations in the colored area from (a), (b), (c) will be selected in the each subset.The (d) shows two kept data partitions in one plot.
and new subsets are defined satisfying b (2)

3 1 .
Train a single-hidden-layer NN regressor or classifier with the number of hidden neurons equal to K j on training set x, y by specified optimizer.

Figure 5 :
Figure 5: Relationship between LIFE and Stacking (Regression) Figure 5 and 6 indicate a strong linear relationship between LIFE and the stacking method with different hyper-parameter setup in terms of MSE loss or cross entropy loss in both regression and classification cases.This linear relationship has be verified by both simulated data (MIM) and

Figure 8 :
Figure 8: Loss decomposition for classification

Figure 9 :
Figure 9: ResNet18 vs.Its Potential Improvers (a) Neuron Importance (b) Contribution of Variables to Neurons

5 1 . 2 .
Construct activation set A t of the local region R t relying on its boundary condition 6 Calculate coefficients: Êt = k∈Rt ŵk * βk

2 .Figure 12 :
Figure 12: Plot Matrix Between α m and x m

Figure 14 :
Figure 14: ALE Plots for Predictors Based on LIFE

Figure 15 :
Figure 15: Relationship between R 2 and number of neurons.The red point indicates one using linear regression as a final step without base learner selection, while blue points indicates model aggregation by base learner selection with different threshold.

j
step 2 of algorithm If | βj | is very close to zero, one may simply set b , we may estimate W and b by iteratively and regressing y i on the updated z i .The procedure can be summarized as the following algorithm.1.Set initial value forW (0) = [w (0) 1 , • • • , w(0)J 1 ] T and b (0) j , and let c = 0. 2. Calculate z i defined in the text based on w (c) j and b (c) j , obtain the least squares estimate (LSE) βj s, γj s and ηj s by running a linear regression y i on covariate z i , and update the biases and weights by b + ηj / βj .if | βj | ≥ ε, where ε is a constant for numerical stability and is set to be 10 −3 in our numerical experiment, and keep the corresponding biases and weights unchanged if | βj | < ε.
which are called new features.

Table 1 :
Difference between LIFE and Stacking ••• ,6;i=1,••• ,N are drawn from Normal or Laplace distribution.For regression, the experimental results show mean and standard deviation of RM SE, R 2 and training time T over five replications in Tables

Table 2 :
Regression on Simulated Data (Normal Distribution) Note: On the left side of vertical line, all columns represent just single-hidden-layer NNs optimized by LIFE (LLA), LLA, LIFE (Adam) and Adam, while there are three state-of-art machine learning methods including FFNN, Xgboost, and Random Forest (RF) on the right side.FFNN is multi-hidden-layer feed forward NN, where range for number of hidden layers is from two to four.In addition, Xgboost and RF are two tree-based ensemble methods.The figures inside parenthesis indicate the standard deviation of metrics, and time represents training time of one replication.The numbers colored in red represents the optimal results of this metric.methods in terms of predictive performance on the test set.The values of two metrics from LIFE (LLA) are close to oracle values, which implies LIFE performs well in the data with a smoothing response surface.If we compare results from one hidden-layer FFNNs trained by LIFE algorithm with either LLA or Adam base learners and non-ensemble algorithms of LLA or Adam, LIFE always outperforms the relevant optimization methods used to train single hidden-layer FFNN as a whole due to the generated diversity of data sampling.In addition, the performance of LIFE also depends on the strength of individual NN base learner, which can be easily spotted in the Tables

Table 3 :
Regression on Simulated Data (Laplace Distribution) Note: On the left side of vertical line, all columns represent just single-hidden-layer NNs optimized by LIFE (LLA), LLA, LIFE (Adam) and Adam, while there are three state-of-art machine learning methods including FFNN, Xgboost, and Random Forest (RF) on the right side.FFNN is multi-hidden-layer feed forward NN, where range for number of hidden layers is from two to four.In addition, Xgboost and RF are two tree-based ensemble methods.The figures inside parenthesis indicate the standard deviation of metrics, and time represents training time of one replication.The numbers colored in red represents the optimal results of this metric.

Table 4 :
Classification on Simulated Data (Normal Distribution) Note: On the left side of vertical line, all columns represent just single-hidden-layer NNs optimized by LIFE (LLA), LLA, LIFE (Adam) and Adam, while there are three state-of-art machine learning methods including FFNN, Xgboost, and Random Forest (RF) on the right side.FFNN is multi-hidden-layer feed forward NN, where range for number of hidden layers is from two to four.In addition, Xgboost and RF are two tree-based ensemble methods.The figures inside parenthesis indicate the standard deviation of metrics, and time represents training time of one replication.The numbers colored in red represents the optimal results of this metric.

Table 5 :
Classification on Simulated Data (Laplace Distribution) Note: On the left side of vertical line, all columns represent just single-hidden-layer NNs optimized by LIFE (LLA), LLA, LIFE (Adam) and Adam, while there are three state-of-art machine learning methods including FFNN, Xgboost, and Random Forest (RF) on the right side.FFNN is multi-hidden-layer feed forward NN, where range for number of hidden layers is from two to four.In addition, Xgboost and RF are two tree-based ensemble methods.The figures inside parenthesis indicate the standard deviation of metrics, and time represents training time of one replication.The numbers colored in red represents the optimal results of this metric.optimizationmethods (LLA or Adam), which is consistent with the conclusion made from experiment in the simulated data.It is also worth mentioning that computation efficiency of NN training has been significantly boosted for almost all dataset via LIFE compared with traditional NN training methods.In particular, if we take a look at the largest real dataset CASP, training time of NN via LIFE reduces to 216 seconds from 2881 seconds or to 54 seconds from 339 seconds when we use LLA as optimizer or Adam respectively, which is almost more than six times faster.Although tree-based ensemble methods such as Xgboost and Random Forest show strong predictive power in some datasets, they are still black-box models, and they are hard to interpret.The biggest advantage of our proposed algorithm LIFE is that it preserves the interpretability of model, which is still single-hidden-layer NN with very strong predictive performance and boosted computation efficiency.

Table 6 :
Regression on Real DataNote: On the left side of vertical line, all columns represent just single-hidden-layer NNs optimized by LIFE (LLA), LLA, LIFE (Adam) and Adam, while there are three state-of-art machine learning methods including FFNN, Xgboost, and Random Forest (RF) on the right side.FFNN is multi-hidden-layer feed forward NN, where range for number of hidden layers is from two to four.In addition, Xgboost and RF are two tree-based ensemble methods.Thefigures inside parenthesis indicate the standard deviation of metrics, and time represents training time of one replication.The numbers colored in red represents the optimal results of this metric.

Table 7 :
Classification on Real DataNote: On the left side of vertical line, all columns represent just single-hidden-layer NNs optimized by LIFE (LLA), LLA, LIFE (Adam) and Adam, while there are three state-of-art machine learning methods including FFNN, Xgboost, and Random Forest (RF) on the right side.FFNN is multi-hidden-layer feed forward NN, where range for number of hidden layers is from two to four.In addition, Xgboost and RF are two tree-based ensemble methods.The figures inside parenthesis indicate the standard deviation of metrics, and time represents training time of one replication.The numbers colored in red represents the optimal results of this metric.

Table 8 :
ResNet18 vs.Its Potential Improvers Adam) taking turns to occupy the dominant position for most of the data sets.LIFE performs much better than Xgboost and Random Forest (RF) in most experiments.For example, with Breast Cancer Wisconsin data, log-loss in LIFE(Adam) is 26.3% lower than that in Xgboost, and with MAGIC Gamma Telescope data, log-loss has dropped by 7% from Random Forest to LIFE(LLA).The performance of some datasets, such as Bank Marketing data, where LIFE cannot outperform Xgboost or RF, however, the performance is competitive, with only 0.5% and 0.7% of difference in AUC and log-loss, respectively.Another aspect worth mentioning is that Higgs Boson data contains quite a few highly correlated variables.The results show that LIFE algorithm outperforms all the rest of models, which indicates LIFE really does an excellent job in predicting on highly correlated structures.

Table 9 :
Sampling Method ComparisonNote: The different sampling methods in the first step of the general framework of LIFE are tested on two simulated data and two real data.NN projection performs data partition by linear projection inside neurons of NNs and then samples data from the selected region, which is the main part of LIFE algorithm used in the paper.Random projection performs data partition by random linear projection, where weights and biases are randomly drawn from standard normal.The bootstrapping selects observations randomly from training set.The number colored in red represents optimal result for this metric.LLA is used to optimize single-hidden-layer NN base learner in the regression case, which Adam is used for classification case.