Model selection in multivariate adaptive regression splines (MARS) using information complexity as the fitness function

Kartal Koc, Elcin; Bozdogan, Hamparsum

doi:10.1007/s10994-014-5440-5

Model selection in multivariate adaptive regression splines (MARS) using information complexity as the fitness function

Published: 27 June 2014

Volume 101, pages 35–58, (2015)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Model selection in multivariate adaptive regression splines (MARS) using information complexity as the fitness function

Download PDF

Elcin Kartal Koc¹ &
Hamparsum Bozdogan¹

8668 Accesses
41 Citations
Explore all metrics

Abstract

This paper introduces information-theoretic measure of complexity (ICOMP) criterion for model selection in multivariate adaptive regression splines (MARS) to tradeoff efficiently between how well the model fits the data and the model complexity. As is well known, MARS is a popular nonparametric regression technique used to study the nonlinear relationship between a response variable and the set of predictors with the help of piecewise linear or cubic splines as basis functions. A critical aspect in determining the form of the nonparametric regression model during the MARS strategy is the evaluation of portfolio of submodels to select the best submodel with the appropriate number of knots over subset of predictors. In the usual regression modeling, when a large number of predictor variables are present in the model, and there is no precise information about the exact functional relationships among the variables, many model selection criteria still overfit the model. In this paper, to find the simplest model that balances the overfitting and underfitting for the model, ICOMP is proposed as a powerful model selection criterion for MARS modeling. Here, the model complexity is treated with respect to the interdependency of parameter estimates, as well as the number of free parameters in the model. We develop and study the performance of ICOMP along with several most popular model selection criteria such as Akaike’s information criterion, Schwarz’s Bayesian information criterion and generalized cross-validation in MARS modeling to select the best subset models. We provide two Monte Carlo simulation examples and a real benchmark example to demonstrate the utility and versatility of the proposed model selection approach to determine best functional form of the predictive model. Our numerical examples show that ICOMP provides a general model selection criterion with an insight to the interdependencies and/or correlational structure between parameter estimates in the selected model. This new approach can also be applicable to many complex statistical modeling problems.

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Overfitting, Model Tuning, and Evaluation of Prediction Performance

1 Introduction

In high dimensional data modeling, multivariate adaptive regression splines (MARS) is a popular nonparametric regression technique used to study the nonlinear relationship between a response variable and the set of predictor variables with the help of splines. MARS uses piecewise linear or cubic splines for local fit and applies an adaptive procedure to select the final model. MARS can be viewed as a generalization of stepwise linear regression or modification of the classification and regression trees (CART) to improve further CART’s performance in the regression modeling (Friedman 1991).

In passing, we note that the underlying idea of MARS modeling appears to be similar to the group method of data handling (GMDH) which is a combinatorial heuristic, developed by Ivakhnenko dating back to 1966 (Ivakhnenko 1966), a Ukrainian cyberneticist, which constructs a mathematical model of a system in an evolutionary fashion. The algorithm is designed to model the functional relationship between the response and predictor variables which are learned directly from a self-organization of the data. It constructs high order regression type models beginning with a few basic quadratic equations and constructing a high-order polynomial of the Kolmogorov–Gabor type. The difference between MARS and GMDH is that, MARS uses piecewise linear or cubic splines instead of quadratic polynomials in several variables. For more on GMDH, we refer the readers to Ivakhnenko (1966) and Hild and Bozdogan (1995).

The popularity of MARS as a nonparametric modeling tool can be seen in its successful applications in many cross-disciplinary fields such as in medical research in detecting disease-risk relationship differences among gender subgroups (York et al. 2006), in studies of HIV reverse transcriptase inhibitors (Xu et al. 2004), in breast cancer diagnosis (Chou et al. 2004); in business in mining the customer credit (Lee et al. 2006), and in intrusion detection systems (Mukkamala et al. 2006); in molecular biology in chromatographic retention prediction of peptides (Put and Vander Heyden 2007), and many others, to mention a few.

In terms of most recent new algorithmic developments on the performance of MARS, in the literature, we see the uses of genetic algorithm for knot selection in Pitmann and McCulloch (2002). In the study of Weber et al. (2012), MARS algorithm is modified by introducing penalized residual sum of squares, and the problem is solved as Tikhonov regularization problem with conic quadratic optimization (Weber et al. 2012). In the studies of Yazici (2011) and Ozmen et al. (2011), the complexity of the method proposed in Weber et al. (2012) is reduced by bootsrapping and the capability of the method is enhanced to handle random input and output variables by robust method, respectively. Further, a time efficient forward selection procedure is proposed in Kartal Koc and Iyigun (2013) for MARS modeling.

A critical aspect in determining the form of the non-parametric regression model during the MARS strategy is the evaluation of submodels to select the best one with proper number of knots over the best subset of predictors. In the model fitting process, the function estimation is basically generated via a two-step procedure: forward selection and backward elimination. At each forward step, a candidate term (spline function) that most improves the overall ’goodness-of-fit’ of the fitted model is added to the model. As discussed in Friedman (1991), at the end of this step there may be model terms that no longer sufficiently contributes to the model fit. Thus, by a backward step, the candidate term that least degrades the overall ’goodness-of-fit’ of the fitted model is eliminated from the model. In this respect, evaluation and selection of relevant subset of predictor variables with corresponding proper knots are the main concern of MARS to reduce the curse of dimensionality.

The problem of selecting the best spline functions, which are treated as the inputs, in MARS is solved by Friedman through a stepwise procedure using the modified generalized cross-validation (GCV) (not accounting for the selection bias) of Craven and Wahba (1979). Although Friedman avoids the overfitting problem in MARS by the modified GCV, in the literature, questions have been raised whether the modified GCV criterion is the ’best’ criterion for model selection in the MARS algorithm.

In the literature, initially Stevens (1991) appears to be the first to apply Akaike’s information criterion (AIC) (Akaike 1974), AIC’s modification (Akaike 1979), Schwarz Bayesian criterion (SBC) (Schwarz 1978), Amemiya’s prediction criterion (PC) (Amemiya 1980) including modified GCV in MARS for modeling the univariate and semi-multivariate time series systems. Although, these criteria are specifically designed for model selection and not just for the estimation of risk as stated in Barron and Xiao (1991), in regression modeling, when a large number of predictor variables are presented to the model, and there is no precise information about the exact relationships among the variables, such criteria still overfit the model. In addition, the complexity of a model increases as the number of independent and adjustable parameters (i.e., effective degrees of freedom of the model) increase. According to the qualitative principle of Occam’s Razor, we need to find the simplest model that judiciously balances overfitting and under-fitting of the model. To achieve this in MARS, our major objective is to introduce and develop for the first time Bozdogan’s information-theoretic measure of complexity (ICOMP) criterion (“I” for information and “COMP” for complexity) (Bozdogan 1988, 1990, 1994, 2000) within the MARS modeling framework .

In contrast to AIC-based information criteria, ICOMP approximates sum of two Kullback and Leibler (1951) distances that measures the lack of fit of the model and the model complexity in one criterion function using an entropic measure of the estimated covariance matrix of the model parameters. In this sense, the concept of model complexity here takes into account not only the number of free parameters in the model but also the interdependency of parameter estimates. Hence, a general model selection criterion with an insight to the correlational structure between parameter estimates in the selected model can be provided by ICOMP. Using ICOMP, also a better tradeoff between how well the model fits the data and the model complexity is achieved for MARS modeling. In addition, our objective also is to carry out a comprehensive Monte Carlo simulation study to compare the performance of the model selection criteria such as ICOMP, AIC, SBC, and GCV in MARS modeling which to our knowledge does not exist in the literature.

This paper is organized as follows. In Sect. 2, requisite background on MARS modeling and GCV criterion are given. Section 3 provides analytical derived forms of ICOMP based on the estimated inverse Fisher information matrix (IFIM) and the estimated posterior utility form of ICOMP along with the derived forms of AIC and SBC in MARS modeling. As an alternative to Tikhonov regularization, in this section we introduce a new and novel smoothed (or robust) covariance estimation procedures to resolve the problem of ill-conditioned model covariance matrices in MARS modeling and also use an eigenvalue stabilization method given in Thomaz (2004). In Sect. 4, performances of the model selection criteria in selecting the best subset of models are shown and studied via two Monte Carlo simulations, and on a real dataset to predict the body fat in obesity studies. Section 5 concludes the paper with a discussion and provides future directions in MARS modeling research.

2 Multivariate adaptive regression splines

MARS is developed by Friedman (1991) as a nonparametric regression technique to approximate a general type of model,

$$\begin{aligned} y=f(\mathbf {x})+\varepsilon , \end{aligned}$$

(1)

where, $\varepsilon $ indicates the error term, $\varvec{x}=(x_{1},x_{2},\ldots ,x_{p})^{T}$ denotes the $p$ number of predictor variables, and $y$ is a response variable.

To approximate the nonlinear relationship between predictor variables, $\varvec{x}$ and response variable, $y$, a flexible model estimate is provided using piecewise linear basis functions (BFs) of the form,

$$\begin{aligned} \begin{array}{lll} (x-t)_{+}={\left\{ \begin{array}{ll} x-t, &{} \textit{if}\, x>t\\ 0 &{} \textit{otherwise} \end{array}\right. } &{} \,\hbox {and}\, &{} (t-x)_{+}={\left\{ \begin{array}{ll} t-x, &{} \textit{if}\, x<t\\ 0 &{} \textit{otherwise} \end{array}\right. } \end{array} \end{aligned}$$

where, the “+” means positive part.

As an example for univariate variable, $x$, the piecewise linear BFs (also called reflected pairs) for $t=0.5$ are shown in Fig. 1, where $t$ denotes the knot point (or breaking point).

The idea of MARS is to form reflected pairs for each predictor variable, $x_{j}$, $j\in \{1,\ldots ,p\}$ with knots at each observed value, $x_{ij}$, $i\in \{1,\ldots ,n\}$ of that variable, where $n$ is the sample size. For the example given in Fig. 1, two other possible BFs with knots at $t=0.2$ and $t=0.8$ are displayed by shadow lines. The set of all possible reflected pairs with the corresponding knots, therefore, can be expressed by the set $\mathcal {S}$ in (2).

$$\begin{aligned} {{\mathcal {S}}}=\{(x_{j}-t)_{+},(t-x_{j})_{+}|\, t\in \{x_{1j},x_{2j},\ldots ,x_{nj}\},\, j\in \{1,\ldots ,p\}\}. \end{aligned}$$

(2)

The model building strategy of MARS is similar to the one developed in classical linear regression. However, instead of the original predictor variables, MARS uses the functions in set $\mathcal {S}$ or their products. The form of the MARS model defined to approximate the function in (1) is defined as

$$\begin{aligned} f(\mathbf {x})=\beta _{0}+\sum _{m=1}^{M}\beta _{m}B_{m}(\mathbf {x}), \end{aligned}$$

(3)

where, $B_{m}(\mathbf {x})$ represents a BF from set $\mathcal {S}$ or product of two or more such functions, and $M$ is the number of BFs in the current model (Friedman 1991; Friedman and Silverman 1989).

For multiple variable cases, the expression $B_{m}(\mathbf {x})$ in (3) can also incorporate interactions between predictors. The interaction terms are created in MARS by multiplying an existing BF with a truncated linear function involving a new variable. Hence, product of two BFs produces a result which is nonzero only over the space of predictors where both components are nonzero. In Fig. 2, the form of the function $B(x_{1,}x_{2})$ resulted from the multiplication of two piecewise linear functions $(x_{1}+0.5)_{+}$ and $(x_{2}+1)_{+}$ is illustrated.

An example of MARS models built by piecewise linear and cubic splines for a two-dimensional noise-free function given by

$$\begin{aligned} y=f(x_{1},x_{2})=\mathrm{sin}(2\pi x_{1})\, \mathrm{cos}(1.25\pi x_{2}) \end{aligned}$$

(4)

are shown in Fig. 3a, b, respectively. The regression surface is build by using only nonzero components which locally obtained from the product of two BFs only when they are needed (Hastie et al. 2001).

2.1 Traditional model selection in MARS with GCV

MARS builds a model by searching over all combinations of the variables and all values of each variable as the candidate knots through an adaptive procedure including a two-stage process: forward selection and backward elimination.

In the forward step, the algorithm starts with a model consisting of intercept term, $\beta _{0}$ and then the reflected pairs that give the maximum reduction in sum-of-squares residual error are added to the model iteratively until the maximum number of terms specified by the user is reached. Each new BF consists of a term already in the model multiplied with a new truncated linear function. At the end of this step, a large model typically overfitting the data is obtained. Figure 4a illustrates a simple example of how MARS would attempt to fit data during the forward step in a two dimension space using piecewise linear regression splines.

Following the forward-step, a backward elimination is implemented to refine the model fitting process. In this pruning step, the BFs contributing less to the model are eliminated step by step through modified GCV (Craven and Wahba 1979) until the best submodel is found. GCV depends on the idea of minimizing the average-squared residuals of the fit of the model given by

$$\begin{aligned} GCV(M)=\frac{1}{n}\frac{\sum _{i=1}^{n}\left( y_{i}-\hat{f}_{M}(\mathbf {x}_{i})\right) ^{2}}{\left( 1-P(M)^{*}/n\right) ^{2}}, \end{aligned}$$

(5)

where, $y_{i}$ is the ith observed response value; $\hat{f}_{M}(x_{i})$ is the fitted response value obtained for the ith observed predictor vector, $\mathbf {x}_{i}=(x_{i1},\ldots ,x_{ip})^{T},\, i=(1,..,n)$, $n$ is the number of observations, and $M$ represents the maximum number of BFs in the model.

In general, $P(M)$ is calculated by

$$\begin{aligned} P(M)=\textit{trace}\left( \mathbf {B}(\mathbf {B}^{T}\mathbf {B})^{-1}\mathbf {B}^{T}\right) +1 \end{aligned}$$

(6)

and it represents the cost penalty measure of a model, when there are $M$ BFs in the model (Friedman 1991). In (6), $\mathbf {B}$ denotes the matrix of BFs with dimension $M\times n$.

Further, $P(M)$ in (6) represents the effective number of parameters which is a penalty measure for complexity. A modified form of $P(M)$ is used in the current MARS algorithm which is $P(M)^{*}=P(M)+dM$, where $M$ is the number non-constant BFs in the MARS model. Note that, for an additive model, $d$ is taken to be two, while it is taken to be three for an interaction model (Friedman 1991; Hastie et al. 2001). If the value of $P(M)$ is small, a large model including too many BFs is built. Otherwise, a smaller model is obtained. For a simple model with less lack-of-fit, the model with minimum GCV is chosen.

Figure 4b gives an example for a fitted model obtained after a backward step. As it is seen, the models obtained by the backward elimination step are smooth that keeps the fidelity of the data.

Friedman (1991) provides valuable insights into the use of GCV criterion for various types of MARS modeling. However, the criterion does not consider the complexity in terms of correlation within the model parameters.

3 ICOMP: a new information theoretic model selection criterion

In recent years, the statistical literature has placed more and more emphasis on information-based model selection and evaluation criteria. The necessity of introducing the concept of model evaluation has been recognized as one of the important technical areas, and the problem is posed on the choice of the best approximating model among a class of competing models by a suitable model evaluation criteria given a data set. Several of the popular model selection criteria have its underpinning to statistical information theory. They are based on the estimation of Kullback-Leibler information in high dimensions as a loss function (Kullback and Leibler 1951; Kullback 1968). The objective of information-based model selection criteria are to select a model that best incorporates the inference uncertainty (i.e., a measure of the lack-of-fit or badness-of-fit of the model) and parametric uncertainty (i.e., a measure of model parsimony and complexity).

Recently, based on Akaike’s original AIC (Akaike 1973), many model-selection procedures which take the form of a penalized likelihood (a negative log likelihood plus a penalty term) have been proposed (Sclove 1987). For example, for AIC this form is given by

$$\begin{aligned} AIC(k)=-2\mathrm{log}L(\hat{\theta }_{k})+2k, \end{aligned}$$

(7)

where, $L(\hat{\theta }_{k})$ is the maximized likelihood function, $\hat{\theta }_{k}$ is the maximum likelihood estimate of the parameter vector $\theta _{k}$, and $k$ is the number of independent parameters estimated. The first term in (7), $-2\mathrm{log}L(\hat{\theta }_{k})$ is a measure of lack of fit, and $2k$ is the penalty term for the number of free parameters estimated in the model.

In AIC, a compromise takes place between the measure of lack-of-fit, and the number of parameters, which is considered as a measure of complexity that compensates for the bias in the lack-of-fit. The model with minimum AIC value is chosen as the best model to fit the data.

The use of AIC as a model selection criterion is popular because of its simplicity. However, it is well-known that in the context of complex modeling situations, AIC overfits the model order. In response to the over-fitting phenomenon in model selection, Schwarz (1978) introduced a Bayesian model selection criterion, abbreviated as SBC, assuming the data are generated from an exponential family of distributions. Independently, Rissanen (1978) introduced his minimum description length (MDL) criterion which takes the same form as SBC both defined as

$$\begin{aligned} MDL/SBC(k)=-2\mathrm{log}L(\hat{\theta }_{k})+k\mathrm{log}(n). \end{aligned}$$

(8)

Comparing with AIC, the SBC in (8) increases the penalty for adding additional terms to the model by a factor of $(1/2)\mathrm{ln}(n)$. In general, the model with minimum SBC or MDL is chosen as the best model to fit the data.

The development of ICOMP has been motivated in part by AIC, and in part by information complexity concepts and indices. In contrast to AIC, the new ICOMP procedure is based on the structural complexity of an element or set of random vectors via a generalization of the information-based covariance complexity index of Van Emden (1971).

A rationale for ICOMP as a model selection criterion is that it combines a badness-of-fit term (such as minus twice the maximum log likelihood) with a measure of complexity of a model differently than AIC, or its variants, by taking into account the interdependencies of the parameter estimates as well as the dependencies of the model residuals. The general form of ICOMP is based on the quantification of the concept of overall model complexity in terms of the estimated inverse-Fisher information matrix (IFIM). This approach results in an approximation to the sum of two Kullback-Leibler distances Kullback and Leibler (1951).

In contrast to AIC and SBC, ICOMP is designed to estimate a loss function given by Bozdogan (2004) as

$$\begin{aligned} \textit{Loss}=\textit{Lack}\, \textit{of}\, \textit{Fit}+\textit{Lack}\, \textit{of}\, \textit{Parsimony}+\textit{Lack}\, \textit{of}\, \textit{Complexity} \end{aligned}$$

(9)

This is achieved by using the additivity property of information theory and the entropic developments in Rissanen (1976) in his final estimation criterion (FEC) in estimation and model identification problems, as well as AIC (Akaike 1973) and its analytical extensions in Bozdogan (1987). In the loss function in (9), by the third term, profusion of complexity, we mean the interdependencies or the correlations among the parameter estimates and the random error term of a model.

We define the general form of ICOMP as

$$\begin{aligned} \textit{ICOMP}(k)=-2\mathrm{log}L(\hat{\theta }_{k})+2C\left( \hat{\Sigma }_{model}\right) , \end{aligned}$$

(10)

where $L(\hat{\theta }_{k})$ is the maximized likelihood function, $\hat{\theta }_{k}$ is the maximum likelihood estimate of the parameter vector $\theta _{k}$, and $C$ represents a real-valued complexity measure. In (10), $\hat{\Sigma }_{model}=\hat{C}ov(\hat{\theta )}$ represents the estimated covariance matrix of the parameter vector of the model. This covariance matrix can be estimated in several ways, one of which uses celebrated Cramer-Rao lower bound (CRLB) matrix. The form of the estimated inverse Fisher information matrix (IFIM) of the model is obtained from

$$\begin{aligned} \hat{\mathcal {F}}^{-1}=\left\{ -E\left( \frac{\partial ^{2}\mathrm{log}L(\theta )}{\partial \theta \partial \theta ^{'}}\right) _{\hat{\theta }}\right\} ^{-1}. \end{aligned}$$

(11)

In (11), the expression in bracket is the matrix of second partial derivatives of the log-likelihood function of the fitted model evaluated at the maximum likelihood estimators. For more on IFIM, we refer the readers to Cramér (1946) and Rao (1945, 1947, 1948). By the estimated IFIM, an inherent measure of uncertainty or a precise measure of accuracy of the parameters which is estimated from the available data can be provided. The diagonal elements of IFIM contain the estimated variance of the estimated parameters, while the corresponding off-diagonals contain their covariances. Thus, ICOMP provides a universal criterion with IFIM which takes into account the entire parameter space of the model.

There are several forms and justifications of ICOMP based on (10) discussed in Bozdogan (1988, 1990, 2000, 2004, 2010) and Bozdogan and Bearse (1998). Here, we present only two of the general forms of ICOMP to be used in MARS modeling and show their derived analytical forms in the next section.

3.1 ICOMP based on estimated inverse Fisher information matrix (IFIM)

For a multivariate normal linear or nonlinear structural model, based on IFIM, ICOMP in (10) is defined as

$$\begin{aligned} \textit{ICOMP}(\textit{IFIM})=-2\mathrm{log}L(\hat{\theta }_{k})+2C_{1}\left( \hat{\mathcal {F}}^{-1}(\hat{\theta }_{k})\right) , \end{aligned}$$

(12)

where, $C_{1}$ denotes the maximal entropic complexity of the estimated IFIM, given by

$$\begin{aligned} C_{1}(\hat{\mathcal {F}}^{-1})=\frac{s}{2}\mathrm{log}\left[ \frac{tr(\hat{\mathcal {F}}^{-1})}{s}\right] -\frac{1}{2}\mathrm{log}|\hat{\mathcal {F}}^{-1}|. \end{aligned}$$

(13)

In (13), $s$ refers to the dimension or the rank of $\hat{\mathcal {F}}^{-1}$.

After some work, for a MARS model under the consideration that the random noise is normally distributed, the estimated IFIM is obtained as

$$\begin{aligned} \hat{C}ov(\hat{\beta },\hat{\sigma }^{2})=\hat{\mathcal {F}}^{-1}=\left[ \begin{array}{c@{\quad }c} \hat{\sigma }^{2}(B'B)^{-1} &{} \quad 0\\ 0 &{} \quad \frac{2\hat{\sigma }^{4}}{n} \end{array}\right] , \end{aligned}$$

(14)

where

$$\begin{aligned} \hat{\beta }=(B'B)^{-1}(B'y)\,\, and\,\hat{\,\,\sigma }^{2}=\frac{(y-B\hat{\beta })'(y-B\hat{\beta })}{n}.\\ \end{aligned}$$

Using the definition in (12), ICOMP(IFIM) becomes

$$\begin{aligned} \textit{ICOMP}(\textit{IFIM})=n\mathrm{ln}(2\pi )+n\mathrm{ln}(\hat{\sigma }^{2})+n+2C_{1}(\hat{\mathcal {F}}^{-1}(\hat{\theta }_{M})), \end{aligned}$$

(15)

where, the $C_{1}$ complexity is given by

$$\begin{aligned} C_{1}(\hat{\mathcal {F}}^{-1}(\hat{\theta }_{M}))=(M+1)\mathrm{log}\left[ \frac{tr\hat{\sigma }^{2}(B'B)^{-1}+\frac{2\hat{\sigma }^{4}}{n}}{M+1}\right] -\mathrm{log}|\hat{\sigma }^{2}(B'B)^{-1}|-\mathrm{log}\left( \frac{2\hat{\sigma }^{4}}{n}\right) . \end{aligned}$$

(16)

In (14), as the number of free parameters increases (i.e. as the size of B increases), the error variance $\hat{\sigma }^{2}$ gets smaller even though the complexity gets larger. Also, as $\hat{\sigma }^{2}$ increases, $(B'B)^{-1}$ decreases. Therefore, the use of $C_{1}(\mathcal {\hat{F}}^{-1}(\hat{\theta }_{M}))$ in information-theoretic model evaluation criteria achieves a trade-off between these two extremes and guards against the presence of multicollinearity. With ICOMP(IFIM), complexity is defined as the measure of the interaction or the dependency between its components. Hence, ICOMP(IFIM) provides a more judicious penalty term than AIC and SBC, and chooses simple models that provide more accurate and efficient parameter estimates over more complex models.

3.2 ICOMP as an estimate of posterior expected utility

By introducing some utility functions for both the lack of fit component of the model and the complexity of the parameters space of the model, a new class of ICOMP(IFIM) criteria are developed as a Bayesian criterion in maximizing a posterior expected utility (PEU) (Bozdogan and Haughton 1998; Bozdogan 2010). The idea of using two utility functions $U_{1}$ and $U_{2}$ that are multiplied to define a utility $U$ whose posterior expectation is maximized to select a model is also considered by Poskitt (1987) and others. Poskitt defines $\mathrm{log}(U_{1})=KL$, where KL is Kullbrack-Liebler information, and it is considered as utility function by many authors. See Chaloner and Verdinelli (1995) for more details.

In this paper, a version of ICOMP(IFIM) derived from the multiplication of the utility $U_{1}$ by a utility $U_{2}$ equal to

$$\begin{aligned} U_{2}=\mathrm{exp}\left[ -\frac{k}{2}\mathrm{log}(n)-C_{1}(\hat{\mathcal {F}}^{-1})\right] \end{aligned}$$

(17)

is used. With the choice of these utility functions, a more consistent ICOMP(IFIM) criterion whose formulation given in (18) is proposed and used in MARS modeling. This criterion provides a severe penalization for the overparametrization. Thus, simplest MARS models are chosen whenever there is nothing to be lost by doing so. This is very crucial for determining the nonlinear relationship between inputs and output variables with high multicollinearity.

$$\begin{aligned} \textit{ICOMP}(\textit{IFIM})_{PEU}=-2\mathrm{log}L(\hat{\theta }_{k})+k(1+\mathrm{log}(n))+2C_{1}\left( \hat{\mathcal {F}}^{-1}(\hat{\theta }_{k})\right) , \end{aligned}$$

(18)

3.3 Robust covariance estimation

In regression modeling, covariance matrices of the parameter estimates can often be ill-conditioned. That is, singularity occurs and that the condition number becomes very large. This is especially valid in the case of high multicollinearity between predictors, and that fact is usually indispensable in MARS modeling. To remedy the manifestation of the singular solutions, as an alternative to Tikhonov regularization (Taylan et al. 2010), we propose a new regularization of the covariance matrix of the parameter estimates in MARS modeling by adjusting the eigenvalues of the estimated covariance matrix, $\hat{\Sigma }$ given by

$$\begin{aligned} \Sigma ^{*}=\hat{\Sigma }+\alpha I_{p}, \end{aligned}$$

(19)

where, $I_{p}$ is the $p$ dimensional identity matrix. This is often called the “naive” ridge regularization.

Usually, the ridge parameter, $\alpha $, is chosen to be very small. For different ridge parameters, many smoothed (or robust) covariance estimators have been developed as a way to data-adaptively improve ill-conditioned and/or singular covariance matrix in MARS. Several of these smoothed covariance estimators perturb the diagonals, and hence, the eigenvalues enough to achieve well-conditioned covariance matrix. In this study, we propose to use Maximum Likelihood/ Empirical Bayes (MLE/EB) given in (20), Stipulated Ridge (SRE) (Shurygin 1983) given in (21), and Thomaz Stabilization method (Thomaz 2004) given in (23).

MLE/EB:
$$\begin{aligned} \hat{\Sigma }_{MLE/EB}=\hat{\Sigma }+\frac{p-1}{ntr(\hat{\Sigma })}I_{p}, \end{aligned}$$
(20)
SRE:
$$\begin{aligned} \hat{\Sigma }_{SRE}=\hat{\Sigma }+p(p-1)(2ntr(\hat{\Sigma })^{-1}I_{p} \end{aligned}$$
(21)
Thomaz Stabilization:
$$\begin{aligned} \hat{\Sigma }_{Thomaz}=V\Lambda ^{*}V, \end{aligned}$$
(22)
where,
$$\begin{aligned} \Lambda ^{*}=\left[ \begin{array}{c@{\quad }c@{\quad }c@{\quad }c} \mathrm{max}(\lambda _{1,}\bar{\lambda }) &{}\quad 0 &{}\quad \cdots &{}\quad 0\\ 0 &{}\quad \mathrm{max}(\lambda _{2,}\bar{\lambda }) &{}\quad \cdots &{}\quad 0\\ 0 &{}\quad 0 &{}\quad \cdots &{}\quad 0\\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ 0 &{}\quad 0 &{}\quad \cdots &{}\quad \mathrm{max}(\lambda _{p,}\bar{\lambda }) \end{array}\right] \end{aligned}$$
(23)

where $\lambda _{i}$ is the ith eigenvalue, $\bar{\lambda }$ is the arithmetic mean of the eigenvalues, and $V$ is the matrix of eigenvalues.

4 Numerical examples

In this section, ICOMP(IFIM)PEU criterion given in (18) is implemented for MARS modeling, and its performance on best subset of variable selection is compared with that of GCV, AIC and SBC criteria using Monte Carlo simulations. We also show our results on a real dataset to predict body fat in obesity studies. As mentioned before, model selection criterion has an important effect on the selection of proper knots and influential variables over the response variable. In Fig. 5, we illustrate how different models can be fitted by using different model selection criteria. Since the selected number and locations of knots are different over the same variables, different forms of MARS models may be obtained for the same underlying dataset. Because of this fact, MARS modeling is studied under the model selection framework. It is note that, the model selection criteria studied in this paper are implemented to MARS algorithm using ARESLab (Jekabsons 2011) toolbox written entirely in MATLAB$^{(R)}$ (2010) environment. This toolbox uses the main functionality of MARS technique described by Friedman (1991).

To carry out a subset selection of variables, two Monte Carlo simulation protocols are implemented. The first protocol includes a nonlinear functional form between predictors and response, while the second simulation protocol refers collinear variable structure. MARS models are built for 100 different datasets generated using the same function in each protocol. To provide some insight regarding the importance of variables as predictors over the dependent variable, and to see whether the true model can be selected or not, the final MARS model fit is analyzed by ANOVA decomposition form given in (24).

$$\begin{aligned}&\hat{f}(\mathbf {x})=\beta _{0}+\sum _{k_{m}=1}\beta _{m}B_{m}(x_{i})+\sum _{k_{m}=2}\beta _{m}B_{m}(x_{i},x_{j})\nonumber \\&\quad +\sum _{k_{m}=3}\beta _{m}B_{m}(x_{i},x_{j},x_{k})+\cdots \end{aligned}$$

(24)

MARS refits the model after removing all terms involving the variable to be assessed and calculates the reduction in goodness of fit. All variables are then ranked according to their impact on goodness of fit. By the ANOVA decomposition in (24), it is possible to identify which variables enter to the model, whether they are purely additive, or are involved in interactions with other variables. In (24), the first term in ANOVA decomposition represents only the main effects, while the second and third terms reflect two-way and three-way interactions, respectively. The other terms denotes four-way interaction terms or etc. For each MARS model, the resulting ANOVA decomposition is examined to see whether the correct subset of models can be selected or not. Furthermore, the prediction and accuracy performances of the final models selected by each criterion are evaluated using the measures such as mean squared error (MSE) including residual sum of squares and number of terms in the model and multiple coefficient of determination ($R^{2}$).

4.1 Monte Carlo simulation: example 1

In over first Monte Carlo simulation study, the performance of ICOMP(IFIM)PEU criteria is demonstrated on a simulated dataset using a nonlinear function given in Friedman (1991). We start by creating datasets using a ten-dimensional function with Gaussian noise. The data consists of a 10-dimensional unit hypercube ($x_{i}=rand(0,1),\, i=1,\ldots ,10$ ).

$$\begin{aligned} y=10sin(\pi x_{1}x_{2})+20(x_{3}-0.5)^{2}+10x_{4}+5x_{5}+0.5\varepsilon , \end{aligned}$$

(25)

where, $\varepsilon \sim N(0,1)$, the standard normal distribution.

Note that, while the first three variables are nonlinear in function in (25), the next are linear to the output, and the last 5 variables have no effect on the response $y$. Therefore, true model includes the predictors $x_{1}$, $x_{2},$ $x_{3}$, $x_{4}$ and $x_{5}$.

For this simulation protocol, the maximal number of basis functions (BFs) is set to 21 including the intercept term, and maximum interaction level is limited to 2. That is, only pairwise products of BFs are allowed. The model is piecewise linear type. MARS algorithm is applied under these specifications using GCV, AIC, SBC and ICOMP(IFIM)PEU. An example of MARS model obtained through GCV criterion for a data generated with 200 observations is illustrated in Table 1.

Table 1 Mars equation

Model selection in multivariate adaptive regression splines (MARS) using information complexity as the fitness function

Abstract

Similar content being viewed by others

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Overfitting, Model Tuning, and Evaluation of Prediction Performance

1 Introduction

2 Multivariate adaptive regression splines

2.1 Traditional model selection in MARS with GCV

3 ICOMP: a new information theoretic model selection criterion

3.1 ICOMP based on estimated inverse Fisher information matrix (IFIM)

3.2 ICOMP as an estimate of posterior expected utility

3.3 Robust covariance estimation

4 Numerical examples

4.1 Monte Carlo simulation: example 1

4.2 Monte Carlo simulation: example 2

4.3 A real benchmark example: prediction of body fat in obesity studies

5 Conclusion and discussion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation