Goodness of Fit in Nonparametric Regression Modelling

Abstract

When the nonparametric methodology is used for the model fitting based on regression technique, then how to judge the goodness of fit of the model is an issue that is addressed in the paper. A goodness of fit statistic is proposed, and its statistical properties in terms of its asymptotic distribution are derived and studied.

Introduction

Statistics and data science have an important role in finding a relationship between the response and predictors. Various methodologies have been developed in statistics to find such relationships. The regression analysis is one of the important and popular tools among the researchers and practitioners to derive the statistical models. The regression analysis aims to derive a relationship between the response and the predictors. The regression analysis has been developed from various perspectives, e.g. linear regression, nonlinear regression, parametric regression, nonparametric regression, Bayesian regression, etc. The parametric regression methods depend upon the form of the model and involve parameters which determine the shape of the regression function. In many situations, it is difficult to determine the shape of the regression function. Nonparametric regression procedures are free from such constraints. Finding the regression model is equivalent to finding the estimates of involved parameters. In spite that each type of regression model has its advantages and limitations, the researchers have devoted their energy in finding out good estimation procedures to yield better models. However, a practitioner will be more interested in knowing the performance of the model so that the model can be used in further applications like forecasting of values of the response variable. The practitioner would like to know the performance of the model obtained by any estimator in quantitative terms.

The coefficient of determination, popularly known as R-Square (\(R^2\)), measures the model performance quantitatively in the multiple linear regression model. It measures the model performance by measuring the variation in the response variable explained by the fitted model based on its relationship with the independent predictors. It determines the degree of a linear relationship in terms of multiple correlation coefficient between response and predictors based on the sample observations and ordinary least squares estimator of the regression parameters. Despite having several constraints, it is still a popular measure of goodness of fit among the practitioners, see Hahn [11]. The nonparametric regression procedures occupy an important place in the applications and are more versatile from the application point of view. However, how to measure the goodness of fit in nonparametric regression models is an area which has not found the attention of the researchers. This paper is a modest effort in this direction, and a measure of goodness of fit is proposed for judging the degree of fit in nonparametric regression modelling.

The coefficient of determination in the multiple linear regression model can be viewed as the consistent estimator of the population multiple correlation coefficient between the response and explanatory variables. So the coefficient of determination is also a statistic, and its properties have been studied to judge its performance. A systematic study of the properties of the coefficient of determination and its adjusted version under the normality of disturbances have been studied in Crämer [5]. The coefficient of determination is usable only in the multiple linear regression model when the model fitting is obtained by the ordinary least squares estimator of the regression coefficient. Such a coefficient of determination has been used, and its suitable forms have been explored for a variety of conditions by the researchers. Eshima and Tabata [6, 7] developed a version of the coefficient of determination for the generalized linear models in entropy form, whereas Renaud and Victoria-Feser [28] developed the robust coefficient of determination in regression setup. The coefficient of determination for the logistic regression model has been studied in Tjur [33], Hong et al. [14], and Liao and McGee [18]. The coefficient of determination in the local polynomial model and mixed regression models has been developed and studied in Huang and Chen [16] and Hössjer [15], respectively. Cheng et al. [2] and Cheng et al. [3] proposed the goodness of fit in measurement error models and restricted measurement error models, respectively. Cheng et al. [4] proposed the goodness of fit measure in models obtained through shrinkage estimation. A variety of roles of the coefficient of determination in a linear regression model have been illustrated in van der Linde and Tutz [35], Srivastava and Shobhit [30], Lipsitz et al. 19], Nagelkerke [24]. Further, Knight [17] and Hilliard and Llyod [13] discussed in simultaneous equation models. Marchand [20, 21] addressed the point estimation of the coefficient of determination and Ohtani [26] considered the setup of the misspecified linear regression model and derived the density of coefficient of determination and its adjusted version along with the risks under the asymmetric loss function. Considering the arbitrary generalized least squares estimation, Tanaka and Huba [32] found a general coefficient of determination for covariance structure models, whereas McKean and Sievers [22] developed a version of coefficients of determination under least absolute deviation analysis. The partial correlation coefficient and coefficient of determination for multivariate normal repeated measures data are discussed in Lipstiz et al. [19]. [27] derived and analyzed the properties of the coefficient of determination and its adjusted version in the presence of specification error under multivariate t-distribution of random errors. Ullah and Srivastava [34] found the approximate moments of the coefficient of determination under the small disturbance asymptotic theory. Later, Srivastava et al. [31] derived the large sample asymptotic biases and mean squared errors of the coefficient of determination and its adjusted version.

The philosophy behind the development of the coefficient of determination in the multiple linear regression model has been utilized in this paper to develop a measure of goodness of fit in the nonparametric regression modelling. The use of developed goodness of fit statistic based on any estimator is demonstrated by using the Nadaraya–Watson and Priestley–Chao estimators. The asymptotic distribution of the proposed statistics is derived, which will be useful for various statistical problems like testing of hypothesis and constructing confidence intervals of the population version of the coefficient of determination.

The plan of the paper is as follows: The coefficient of determination in a parametric multiple linear regression model along with its salient properties is discussed in Sect. 2. The setup of a nonparametric regression model is described in Sect. 3. The development of proposed goodness of fit statistic in nonparametric setup along with a description of nonparametric regression is detailed in Sect. 4 and its subsections. The asymptotic distribution of the proposed goodness of fit statistic is derived in Sect. 5. Some concluding remarks are placed in Sect. 6.

Coefficient of Determination in Classical Multiple Linear Regression Model

We here first discuss the development and role of coefficient of determination \(R^2\) in the parametric multiple linear regression model under the usual assumptions. Let us consider the following multiple linear regression model connecting the response and predictors as

$$\begin{aligned} y=\alpha e_n + X\beta + u, \end{aligned}$$
(2.1)

where y is the \(n\times 1\) vector of values on response variable, X is the \(n\times p\) matrix of values on p non-stochastic predictors \(X_1,X_2,\ldots , X_p\), \(\alpha \) is the intercept term, \(\beta \) is the \(p\times 1\) vector of regression coefficients \(\beta _1,\beta _2,\ldots , \beta _p\) associated with \(X_1,X_2,\ldots , X_p\), respectively, \(e_n\) is the \(n\times 1\) vector of elements unity (i.e. 1’s) and u is the \(n\times 1\) vector of disturbances with mean vector 0 and covariance matrix \(\sigma ^2I\). Under this classical multiple linear regression model, the coefficient of determination is defined as

$$\begin{aligned} R^2&= \frac{b'X'PXb}{y'Py}\nonumber \\&= \frac{y'PX(X'PX)^{-1}X'Py}{y'Py}, \end{aligned}$$
(2.2)

where \(b=(X'PX)^{-1}X'Py\) is the ordinary least squares estimator of \(\beta \) and \(P=I_n-n^{-1}e_n e_n'\).

The coefficient of determination \(R^2\) is obtainable from the analysis of variance in multiple linear regression model (2.1) as

$$\begin{aligned} R^2=\frac{SS_{regression}}{SS_{total}}=1-\frac{SS_{error}}{SS_{total}},\;\; 0 \le R^2 \le 1, \end{aligned}$$
(2.3)

where the total sum of squares \((SS_{total})\) is partitioned into two orthogonal components, viz., sum of squares due to regression \((SS_{regression})\) and sum of squares due to error \((SS_{error})\) as \(SS_{total}=SS_{regression}+SS_{error}\), where \(SS_{total}=y'Py\), \(SS_{regression}=y'PX(X'PX)^{-1}X'Py\), and \(SS_{error}=y'[I-X(X'X)^{-1}X']y\). This \(R^2\) measures the proportion of variation being explained by the fitted model based on \(X_1,X_2,\ldots , X_p\) with respect to the total variation.

Using the definition of ith residual in (2.1), which is defined as the difference between the observed and fitted values of y, i.e. \(y_i \sim {\hat{y}}_i\)

The \(R^2\) can also be expressed as

$$\begin{aligned} R^2=1 - \frac{\sum _{i = 1}^n (y_i - {\hat{y}}_i)^2}{\sum _{i = 1}^n (y_i - {\bar{y}})^2},\;\; 0 \le R^2 \le 1 \end{aligned}$$
(2.4)

where \(y_i \sim {\hat{y}}_i\) is the \(i^{th}\) residual in (2.1), \({\hat{y}}_i={\hat{\alpha }}-b_1 x_{i1}-b_2 x_{i2} -\ldots -b_p x_{ip}\), \(i=1,2,\ldots ,n\), \({\bar{x}}_j=\frac{1}{n}\sum _{j=1}^n x_{ij}\) and \({\bar{y}}=\frac{1}{n}\sum _{i=1}^n y_{i}\). Here, \({\hat{\alpha }}, b_1, b_2, \ldots , b_p\) are the least squares estimators of \(\alpha , \beta _1, \beta_2, \ldots , \beta_p\), respectively,

Values of \(R^2\) indicate the degree of goodness of fit in (2.1) based on the ordinary least square estimation. Ideally, \(R^2=0\) and \(R^2=1\) indicate the worst and best fits of the model. Any other value of \(0 \le R^2 \le 1\) indicates the degree of adequacy of the fitted model. Such an interpretation is comparable with the interpretation of \(\theta \). A model is considered as best fitted if \(\sigma ^2=0\) which implies \(\theta =1\). Similarly, the model is considered as worst fitted when \(\beta =0\) implying \(\theta =0\) indicating none of the predictors contributes in explaining the variation. Similarly, any other values of \(0< \theta < 1\) can be considered as measuring the degree of goodness of the fitted model. So an estimator of \(\theta \) compares and measures the contribution of predictors in explaining the variability among the values of the response variable which is obtained by the model in the population. Hence, \(R^2\) can be considered as an estimator of \(\theta \), thereby measuring the goodness of fit indicated by the fitted model.

$$\begin{aligned} \theta = \frac{\beta '\Sigma \beta }{\beta '\Sigma \beta +\sigma ^2}, \end{aligned}$$
(2.5)

where \(\Sigma ={\text{ plim}}_{n\rightarrow \infty } n^{-1}X'PX\) and \({\text{ plim}}_{n\rightarrow \infty } \) indicates the probability in limit, i.e. the convergence in probability.

Note that \(\theta \) is the population counterpart of \(R^2\). It can be proved using the variance of disturbance terms as \(\sigma ^2\) and usual assumptions of classical linear regression analysis that \(R^2\) converges to \(\theta \) in probability, i.e.

$$\begin{aligned} {\text{ plim}}_{n\rightarrow \infty } \;R^2 = \theta , \end{aligned}$$

i.e. \(R^2\) is a consistent estimator of \(\theta \), and it is to be noted that \(R^2\) is a biased estimator of \(\theta \).

Nonparametric Regression Model

In nonparametric regression, there is no parametric relationship among the predictors and the response variable. In other words, by estimating the regression function from observed data, we strive to find out the true nature of the relationship between the response variable and the predictors.

Now, we first illustrate the set up of nonparametric regression model. Suppose that \(((y_1, \mathbf{x}_{1}),\ldots , (y_{n}, \mathbf{x}_{n}))\) are n many observations, and the associated regression model is \(y_{i} = m (\mathbf{x}_{i}) + \nu _{i}\) for all \(i = 1, \ldots , n\), where \(y_{i}\) is the i-th observation of the response variable, \(\mathbf{x}_{i}\) is the i-th observation of the d-dimensional (\(d\ge 1\)) predictor variable, and \(\nu _{i}\) is the error corresponds to i-th observation. Here, \(m : {\mathbb {R}}^{d}\rightarrow {\mathbb {R}}\) is the unknown nonparametric regression function. The more technical assumptions on m or the error random variables will be stated following the context of study.

To carry out any methodology using nonparametric regression model, one needs to estimate the nonparametric regression curve, and it is needless to mention that the “quality” of the estimated regression curve depends on various factors such as the smoothness of the curve. In other words, if m is believed to be smooth, one can reasonably estimate the regression curve \(m(\mathbf{x})\) by a center of the collection of the response variables, whose corresponding predictor observations are lying in the neighbourhood of \(\mathbf{x}\). As a center of the observations of the response variables, a natural choice will be the mean or any other suitable measure of central tendency of those observations in the neighbourhood of x. In other words, we may consider “local average”, which is based on Y observations corresponding to predictor observations in a small neighbourhood of x since if the predictor observations corresponding to Y observations are far away from x, it will give the mean value, which is distant from the original value of \(m(\mathbf{x})\). Such local averaging procedure can be considered as the basis of smoothing.

Let us now be formal about the smoothed estimator of \(m(\mathbf{x})\), where \(\mathbf{x}\) is a fixed point. Suppose that \({\hat{m}}_{n} (\mathbf{x})\) is a locally smoothed estimator of \(m(\mathbf{x})\) at the fixed point \(\mathbf{x}\), and \(w_{i} (\mathbf{x})\) is the positive weight corresponding to i-th observation such that \(\sum \limits _{i = 1}^{n} w_{i} (\mathbf{x}) = 1\). Define the smoothed estimator \({\hat{m}}_{n} (\mathbf{x})\) as \({\hat{m}}_{n} (\mathbf{x}) = \sum \limits _{i = 1}^{n} w_{i}(\mathbf{x}) y_{i}\), i.e. a certain weighted average of the response variables, where weights \(w_{i} (\mathbf{x})\) depend on the point of evaluation \(\mathbf{x}\). Further, note that \({\hat{m}}_{n} (\mathbf{x})\) is the solution of a certain minimization problem. Note that

$$\begin{aligned} {\hat{m}}_{n}(\mathbf{x}) = arg\min _{\zeta \in \mathbb {R}}\sum \limits _{i = 1}^{n}(y_{i} - \zeta )^{2} \omega _{i}(\mathbf{x}), \end{aligned}$$

i.e. in other words, the locally smoothed estimator is nothing but the locally weighted least squares estimator. As we have observed, the smoothness of the estimator is controlled by the choices of \(\omega _{i}(.)\), one now needs to address the possibility of the choices of \(\omega _{i} (.)\), which is discussed in the following paragraph.

A simple way to choose the weight function \(w_{i} (.)\) is by considering the kernel density estimator [(see, e.g. Silverman [29]] as the weight function, which is well-known as kernel smoothing technique. The kernel is usually denoted by k(.) satisfying the conditions \(k (.)\ge 0\) and \(\int k(v) dv = 1\), and one can consider

$$\begin{aligned} \omega _{i} (\mathbf{x}) = \frac{k \left(\frac{\mathbf{x}_{i} - \mathbf{x}}{h_{n}}\right)}{\sum \nolimits _{i = 1}^{n}k \left(\frac{\mathbf{x}_{i} - \mathbf{x}}{h_{n}}\right)}, \end{aligned}$$

where \(h_{n}\) is a sequence of bandwidth satisfying \(h_{n}\rightarrow 0\) as \(n\rightarrow \infty \) and some other technical conditions, which will be stated as required. Using the aforementioned \(w_{i} (\mathbf{x})\), we have

$$\begin{aligned} {\hat{m}}_{n} (\mathbf{x}) = \frac{\sum \nolimits _{i = 1}^{n}k \left(\frac{\mathbf{x}_{i} - \mathbf{x}}{h_{n}}\right) y_{i}}{\sum \nolimits _{i = 1}^{n}k \left(\frac{\mathbf{x}_{i} - \mathbf{x}}{h_{n}}\right)}, \end{aligned}$$

which is well-known Nadaraya–Watson estimator [see, e.g. Nadaraya [23] and Watson [36]]. For \(d = 1\), the following example can give us some insight about the behaviour of \({\hat{m}}_{n} (.)\) as \(h_{n}\rightarrow 0\) and \(h_{n}\rightarrow \infty \) as \(n\rightarrow \infty \). The example is taken from Härdle [12].

Let \(k(s) = 0.75 (1 - s^2) 1_{\{|s|\le 1\}},\) where \(1_{A} = 1\) if A is true, and \(1_{A} = 0\) if A is not true. On the one hand, note that as \(h_{n}\rightarrow 0\) as \(n\rightarrow \infty \), we have

$$\begin{aligned} {\hat{m}}_{n} (x_i)\rightarrow \frac{k(0)y_{i}}{k(0)} = y_{i}, \end{aligned}$$

i.e. as the bandwidth is getting smaller, the estimator \({\hat{m}}_{n}(.)\) converges to corresponding observation of the response variable. On the other hand, as \(h_{n}\rightarrow \infty \) as \(n\rightarrow \infty \), we have

$$\begin{aligned} {\hat{m}}_{n}(x)\rightarrow \frac{\sum \nolimits _{i = 1}^{n} k(0)y_{i}}{\sum \nolimits _{i = 1}^{n} k(0)} = \frac{1}{n}\sum \limits _{i = 1}^{n} y_{i}, \end{aligned}$$

i.e. the bandwidth is getting larger, the estimated curve will be flattened. Besides, apart from the Nadaraya–Watson estimator, for \(d = 1\), there are a few more well-known estimators in the literature. For instance, Priestley–Chao estimator [see Priestley and Chao [25]] and Gasser–Müller estimator [see Gasser and Müuller [8]] are two such examples. For Priestley–Chao estimator,

$$\begin{aligned} \omega _{i} (x) = n (x_{i} - x_{i - 1}) k\left( \frac{x - x_{i}}{h_{n}}\right) , \end{aligned}$$

where \(x_{0} = 0\), and for Gasser–Müller estimator,

$$\begin{aligned} \omega _{i} (x) = n\int _{s_{i - 1}}^{s_{i}} k\left( \frac{x - u}{h_{n}}\right) du, \end{aligned}$$

where \(x_{i - 1}\le s_{i - 1}\le x_{i}\). In the literature, there are many articles on the various large sample properties of the aforesaid estimators. For all those results, the reader is referred to Härdle [12] and a few references therein. Using these basic toolkits related to the estimation of the nonparametric regression function, we are going to propose and study the goodness of fit in nonparametric regression model in the next section.

Goodness of Fit in Nonparametric Regression Model

In Sect. 2, the coefficient of determination or the goodness of fit is extensively discussed for the multiple linear regression which is a parametric model. Now, a natural question arises how to measure the goodness of fit in the nonparametric regression model. One option is to consider the way \(R^2\) is developed in the parametric multiple linear regression and use this philosophy to construct the measure of goodness of fit in a nonparametric regression model.

Looking at the construct of \(R^2\) in (2.3), it is clear that \(R^2\) measures the proportions of the variability in y which is explained by the fitted model based on the ordinary least square estimators of regression parameters. The total variability is being measured by \(SS_{total}\), and the variability due to the fitted model is obtained by \(SS_{regression}\). Suppose somehow such functions like \(SS_{total}\) and \(SS_{regression}\) can be defined in the case of nonparametric regression modelling. In that case, a similar measure like \(R^2\) can be defined, which is expected to measure the goodness of fit in the nonparametric regression model. It is important to observe that the \(R^2\) in multiple linear regression model (2.1) is based on the orthogonal partitioning of the total sum of squares into the sum of squares due to regression and the sum of squares due to random errors. Such partitioning is not possible to achieve in case of nonparametric regression modelling setup.

To define the goodness of fit, recall the nonparametric regression model:

$$\begin{aligned} y_{i} = m (\mathbf{x}_{i}) + \nu _{i}, \end{aligned}$$
(4.1)

where m is the unknown regression function, and \(\nu _{i}\) is the error corresponding to the i-th observation with variance \(\sigma ^{2} < \infty \). A set of n observations are available as \((y_1, \mathbf{x}_1), (y_2, \mathbf{x}_2), \ldots , (y_n, \mathbf{x}_n)\) associated with y and x.

Let \({\hat{m}}_{n}^{2}(\mathbf{z})\) be an appropriate estimator of \(m^{2}(\mathbf{z})\), and suppose that \({\hat{\sigma }}_{n}^{2}\) is an appropriate estimator of the error variance \(\sigma ^{2}\). Under such a specification, we propose to measure the variation explained by the fitted model obtained by \({\hat{m}}_{n}(\mathbf{z})\) as \({\hat{m}}_{n}^{2}(\mathbf{z})\), and the total variation as \({\hat{\sigma }}_{n}^{2} + {\hat{m}}_{n}^{2}(z)\). We now propose the following measure of goodness of fit statistic under nonparametric (NP) regression for a fixed \(\mathbf{z}\) as

$$\begin{aligned} {\hat{\theta }}_{n} (\mathbf{z}, NP) = \frac{{\hat{m}}_{n}^{2}(\mathbf{z})}{{\hat{\sigma }}_{n}^{2} + {\hat{m}}_{n}^{2}(\mathbf{z})}, \end{aligned}$$
(4.2)

and the population version of \({\hat{\theta }}_{n} (\mathbf{z}, NP)\) is

$$\begin{aligned} \theta (\mathbf{z}, NP) = \frac{m^{2}(\mathbf{z})}{\sigma ^{2} + m^{2}(\mathbf{z})}. \end{aligned}$$
(4.3)

It is clear from the expression of \({\hat{\theta }}_{n} (\mathbf{z}, NP)\) in (4.2), such a goodness of fit statistic depends upon the choice of \({\hat{m}}_{n}^{2}(\mathbf{z})\) and \({\hat{\sigma }}_{n}^{2}\). A particular choice of \({\hat{m}}_{n}^{2}(\mathbf{z})\) and \({\hat{\sigma }}_{n}^{2}\) will indicate the degree of goodness of fit achieved by fitting the model using these estimators.

For example, two popular estimators of \(m(\mathbf{z})\) in nonparametric regression are the Nadaraya–Watson (NW) estimator and Priestley–Chao (PC) estimator, which are defined as the following.

$$\begin{aligned} {\hat{m}}_{n}(\mathbf{z}, NW)&= \frac{\sum \nolimits _{i = 1}^{n} k \left(\frac{\mathbf{x}_{i} - \mathbf{z}}{h_{n}}\right) y_{i}}{\sum \nolimits _{i = 1}^{n} k \left(\frac{\mathbf{x}_{i} - \mathbf{z}}{h_{n}}\right)} \end{aligned}$$
(4.4)
$$\begin{aligned} {\hat{m}}_{n}(\mathbf{z}, PC)&= \frac{\sum \nolimits _{i = 1}^{n} k \left(\frac{\mathbf{x}_{i} - \mathbf{z}}{h_{n}}\right) y_{i}}{n h_{n}}. \end{aligned}$$
(4.5)

Suppose that \(\sigma ^{2}\) is estimated under Nadaraya–Watson estimator and Priestley–Chao estimator by (4.6) and (4.7), respectively, as follows:

$$\begin{aligned} {\hat{\sigma }}_{n, NW}^{2}&= \frac{1}{n - 1}\sum \limits _{i = 1}^{n}\{y_{i} - {\hat{m}}_{n} (\mathbf{x}_{i}, NW)\}^{2} \end{aligned}$$
(4.6)
$$\begin{aligned} {\hat{\sigma }}_{n, PC}^2&= \frac{1}{n - 1}\sum \limits _{i = 1}^{n}\{y_{i} - {\hat{m}}_{n}(\mathbf{x}_{i}, PC)\}^{2}. \end{aligned}$$
(4.7)

Next, we consider the Nadaraya–Watson and the Priestley–Chao estimators to estimate the unknown regression function. The goodness of fit statistics based on the Nadaraya–Watson and Priestley–Chao estimators are defined by (4.8) and (4.9), respectively, as follows:

$$\begin{aligned} {\hat{\theta }}_{n} (\mathbf{z}, NW)&= \frac{{\hat{m}}_{n}^{2}(\mathbf{z}, NW)}{{\hat{\sigma }}_{n, NW}^{2} + {\hat{m}}_{n}^{2}(\mathbf{z}, NW)} \end{aligned}$$
(4.8)
$$\begin{aligned} {\hat{\theta }}_{n} (\mathbf{z}, PC)&= \frac{{\hat{m}}_{n}^{2}(\mathbf{z}, PC)}{{\hat{\sigma }}^2_{n, PC}+ {\hat{m}}_{n}^{2}(\mathbf{z}, PC)}. \end{aligned}$$
(4.9)

We now illustrate how the proposed goodness of fit statistics can be used. Suppose that we want to know that use of which of the estimators between \({\hat{m}}_{n} (\mathbf{z}, NW)\) and \({\hat{m}}_{n} (\mathbf{z}, PC)\) provides a better fit to the model. So, one can compute \({\hat{\theta }}_{n} (\mathbf{z}, NW)\) and \({\hat{\theta }}_{n} (\mathbf{z}, PC)\). In case if \({\hat{\theta }}_{n} (\mathbf{z}, NW) \ge {\hat{\theta }}_{n} (\mathbf{z}, PC)\) for all \(\mathbf{z}\), we can infer that Nadaraya–Watson estimator based model is providing a better model fit than the Priestley–Chao estimator based model. On the other hand, if \({\hat{\theta }}_{n} (\mathbf{z}, NW) \le {\hat{\theta }}_{n} (\mathbf{z}, PC)\) for all \(\mathbf{z}\), we can infer that Priestley–Chao estimator based model is providing a better model fit than the Nadaraya–Watson estimator based model. However, when \({\hat{\theta }}_{n} (\mathbf{z}, NW)\) and \({\hat{\theta }}_{n} (\mathbf{z}, PC)\) have overlapping between each other, one may consider the goodness of fit criteria as \(\displaystyle \sup _{z}{\hat{\theta }}_{n}^{2}(\mathbf{z})\) or \(\int {\hat{\theta }}_{n}^{2} (\mathbf{z}) d\mathbf{z}\). For instance, one may check whether \(\displaystyle \int _{z}{\hat{\theta }}_{n}^{2}(\mathbf{z}, NW) d\mathbf{z}\) is larger than \(\displaystyle \int _{z}{\hat{\theta }}_{n}^{2}(\mathbf{z}, PC) d\mathbf{z}\) to prefer \(\displaystyle \int _{z}{\hat{\theta }}_{n}^{2}(\mathbf{z}, NW) d\mathbf{z}\) over \(\displaystyle \int _{z}{\hat{\theta }}_{n}^{2}(\mathbf{z}, PC) d\mathbf{z}\). In fact, \(\displaystyle \int _{z}{\hat{\theta }}_{n}^{2}(\mathbf{z}) d\mathbf{z}\) can be used in the problems related to variable selection as well. For instance, suppose that the model involves five predictors (denote them as \(x_1\), \(x_2\), \(x_3\), \(x_4\) and \(x_5\)), and a is the value of \(\displaystyle \int _{z} {\hat{\theta }}_{n}^{2}(\mathbf{z}) d\mathbf{z}\) for that model. Now, suppose that two more predictors (denote them as \(x_6\) and \(x_7\)) are added, and \(a^{*}\) is the value of \(\displaystyle \int _{z} {\hat{\theta }}_{n}^{2}(\mathbf{z}) d\mathbf{z}\) for the model involving all seven predictors (i.e. \(x_1\), \(x_2\), \(\ldots \), \(x_7\)). At the end, if \(a^{*} > a\), one may conclude that \(x_6\) and \(x_7\) are significant predictors.

Asymptotic Distribution of \({\hat{\theta }}_{n} (\mathbf{z}, NW)\) and \({\hat{\theta }}_{n} (\mathbf{z}, PC)\)

We have already discussed the usefulness of \({\hat{\theta }}_{n} (\mathbf{z}, NW)\) and \({\hat{\theta }}_{n} (\mathbf{z}, PC)\). Next, in order to implement \({\hat{\theta }}_{n} (\mathbf{z}, NW)\) and \({\hat{\theta }}_{n} (\mathbf{z}, PC)\), one needs to know the distribution of \({\hat{\theta }}_{n} (\mathbf{z}, NW)\) and \({\hat{\theta }}_{n} (\mathbf{z}, PC)\). However, since deriving the exact distribution of \({\hat{\theta }}_{n} (\mathbf{z}, NW)\) and \({\hat{\theta }}_{n} (\mathbf{z}, PC)\) is intractable, we here present the asymptotic distributions of \({\hat{\theta }}_{n} (\mathbf{z}, NW)\) and \({\hat{\theta }}_{n} (\mathbf{z}, PC)\).

Before stating the result, one needs to assume the following technical conditions.

  1. (C1)

    The predictor variable \(\mathbf{x}_{i}\) \((i = 1,2, \ldots , n)\) are lying in \([0, 1]^{d}\) (\(d\ge 1\)) such a way that \(\displaystyle \sup _{i, j = 1, \ldots , n}||\mathbf{x}_{i} - \mathbf{x}_{j}|| = O\left( \frac{1}{n}\right) \).

  2. (C2)

    The random errors \(\nu _{1} \nu _2, \ldots , \nu _{n}\) are i.i.d. sequence of random variables with \(E(\nu )^{1 + \frac{1}{s}}\), where \(0< s < 1\), and \(\nu \) has the identical distribution as \(\nu _{i}\), for \(i = 1,2, \ldots , n\).

  3. (C3)

    The sequence of bandwidth \(h_{n}\) such that \(h_{n}\rightarrow 0\) as \(n\rightarrow \infty \) and \(n h_{n}^{d}\rightarrow \infty \) as \(n\rightarrow \infty \) when \(d\ge 1\).

  4. (C4)

    The kernel function k(.) has compact support, and it satisfies that \(||\mathbf{x}||^{d + \delta } k(\mathbf{x})\rightarrow 0\) as \(||\mathbf{x}||\rightarrow \infty \) for some \(\delta > 0\).

Theorem 1

Let us denote \(\sigma _{1}^{2} (\mathbf{z}) = \displaystyle \lim _{n\rightarrow \infty }~n h_{n}^{d}{\text{ Var }}~\{{\hat{m}}_{n} (\mathbf{z}, NW)\}\) and \(\sigma _{2}^{2} (\mathbf{z}) = \displaystyle \lim _{n\rightarrow \infty }~n h_{n}^{d}{\text{ Var }}~\{{\hat{m}}_{n} (\mathbf{z}, PC)\}\), where Var denotes the variance. Then, under (C1)–(C4), \(\sqrt{nh_{n}^{d}}(\hat{\theta}_{n} (\mathbf{z}, NW) - E({\hat{\theta }}_{n} (\mathbf{z}, NW)))\) converges weakly to the distribution of \(\frac{Z_{1}}{\sigma ^{2} + Z_{1}}\), and \(\sqrt{nh_{n}^{d}}(\hat{\theta}_{n} (\mathbf{z}, PC) - E({\hat{\theta }}_{n} (\mathbf{z}, PC)))\) converges weakly to the distribution of \(\frac{Z_{2}}{\sigma ^{2} + Z_{2}}\). Here, \(Z_{1}\) is a random variable associated with normal distribution with zero mean and variance \(\sigma _{1}^{2} (\mathbf{z})\), and \(Z_{2}\) is a random variable associated with normal distribution with zero mean and variance \( \sigma _{2}^{2} (\mathbf{z})\).

Corollary 1

Under (C1)–(C4), \(n h_{n}^{d}\int \limits _{\mathbf{z}\in [0, 1]^{d}}[{\hat{\theta }}_{n} (\mathbf{z}, NW) - E({\hat{\theta }}_{n} (\mathbf{z}, NW))]^{2}d\mathbf{z}\) converges weakly to the distribution of \(\displaystyle\int\limits_{\mathbf{z}\in [0, 1]^{d}}\left\{\frac{Z_{1}}{\sigma^{2} + Z_{1}}\right\}^{2} d\mathbf{z}\), and \(n h_{n}^{d}\int \limits _{\mathbf{z}\in [0, 1]^{d}}[{\hat{\theta }}_{n} (\mathbf{z}, PC) - E({\hat{\theta }}_{n} (\mathbf{z}, PC))]^{2}d\mathbf{z}\) converges weakly to the distribution of \(\displaystyle\int\limits_{\mathbf{z}\in [0, 1]^{d}}\left\{\frac{Z_{2}}{\sigma^{2} + Z_{2}}\right\}^{2} d\mathbf{z}\).

Here \(Z_1\) and \(Z_2\) are the same as defined in the statement of Theorem 1.

Theorem 1 describes the asymptotic distributions of \({\hat{\theta }}_{n} (\mathbf{z}, NW)\) and \({\hat{\theta }}_{n} (\mathbf{z}, PC)\) after appropriate normalization. Note that for both estimators, the centralization is done by the Expectation of the estimator instead of \(\theta (\mathbf{z}, NP)\) defined in (4.3). However, under some regularity conditions, it is possible to establish that the difference between \(E({\hat{\theta }}_{n} (\mathbf{z}, NW))\) (or \(E({\hat{\theta }}_{n} (\mathbf{z}, PC))\)) and \(\theta (\mathbf{z}, NP)\) is negligible, and hence, the assertion of the Theorem 1 is useful in practice. As we discussed before Sect. 5, often we may need to consider \(\displaystyle \int _{z}{\hat{\theta }}_{n}^{2}(\mathbf{z}, NW) d\mathbf{z}\) and \(\displaystyle \int _{z}{\hat{\theta }}_{n}^{2}(\mathbf{z}, PC) d\mathbf{z}\) as a measure of goodness of fit. Corollary 1 asserts how the asymptotic distribution of the \(\displaystyle \int _{z}{\hat{\theta }}_{n}^{2}(\mathbf{z}, NW) d\mathbf{z}\) and \(\displaystyle \int _{z}{\hat{\theta }}_{n}^{2}(\mathbf{z}, PC) d\mathbf{z}\) will be, which are indeed useful in practice. The proofs of Theorem 1 and Corollary 1 are provided below.

To prove this theorem, one needs to prove the following lemma first.

Lemma 1

\({\hat{\sigma }}_{n, NW}^{2} - \sigma ^{2}{\mathop {\rightarrow }\limits ^{p}}0\) as \(n\rightarrow \infty \) and \({\hat{\sigma }}_{n, PC}^{2} - \sigma ^{2}{\mathop {\rightarrow }\limits ^{p}}0\) as \(n\rightarrow \infty \), where \({\hat{\sigma }}_{n, NW}^{2}\) and \({\hat{\sigma }}_{n, NW}^{2}\) are defined in (4.6) and (4.7), respectively, and \(\sigma ^{2}\) is the error variance.

Proof of Lemma 1

We here provide the derivation of \({\hat{\sigma }}_{n, NW}^{2} - \sigma ^{2}{\mathop {\rightarrow }\limits ^{p}}0\) as \(n\rightarrow \infty \), and the proof of \({\hat{\sigma }}_{n, PC}^{2} - \sigma ^{2}{\mathop {\rightarrow }\limits ^{p}}0\) as \(n\rightarrow \infty \) will follow from the same arguments.

Observe that

$$\begin{aligned} {\hat{\sigma }}_{n, NW}^{2}&= \frac{1}{n - 1}\sum \limits _{i = 1}^{n}\{y_{i} - {\hat{m}}_{n} (\mathbf{x}_{i}, NW)\}^{2}\\&= {} \frac{1}{n - 1}\sum \limits _{i = 1}^{n}\{m(\mathbf{x}_{i}) + \nu _{i} - {\hat{m}}_{n} (\mathbf{x}_{i}, NW)\}^{2}\\&= \frac{1}{n - 1}\sum \limits _{i = 1}^{n} \nu _{i}^{2} + \frac{1}{n - 1}\sum \limits _{i = 1}^{n} \{m(\mathbf{x}_{i}) - {\hat{m}}_{n} (\mathbf{x}_{i}, NW)\}^{2}\\&\quad+ \frac{2}{n - 1}\sum \limits _{i = 1}^{n} \nu _{i}\{m(\mathbf{x}_{i}) - {\hat{m}}_{n} (\mathbf{x}_{i}, NW)\}\\&:= t(I) + t(II) + t(III), \end{aligned}$$

where \(t(I)=\frac{1}{n - 1}\sum \limits _{i = 1}^{n} \nu _{i}^{2}\), \(t(II) = \frac{1}{n - 1}\sum \limits _{i = 1}^{n} \{m(\mathbf{x}_{i}) - {\hat{m}}_{n} (\mathbf{x}_{i}, NW)\}^{2}\) and \(t(III)=\frac{2}{n - 1}\sum \limits _{i = 1}^{n} \nu _{i}\{m(\mathbf{x}_{i}) - {\hat{m}}_{n} (\mathbf{x}_{i}, NW)\}\).

Note that \(t(I){\mathop {\rightarrow }\limits ^{p}}E (\nu ^{2}) = \sigma ^{2}\) as \(n\rightarrow \infty \) using weak law of large numbers. Next, under (C1)-(C4), it follows from Theorem 2 of Georgiev [9] that \(t(II){\mathop {\rightarrow }\limits ^{p}}0\) as \(n\rightarrow \infty \), and \(t(III){\mathop {\rightarrow }\limits ^{p}}0\) as \(n\rightarrow \infty \) by straightforward application of Cauchy–Schwarz inequality along with the fact of Theorem 2 of Georgiev [9] and weak law of large number. Hence, the proof of this lemma is complete. For the proof of \({\hat{\sigma }}_{n, PC}^{2} - \sigma ^{2}{\mathop {\rightarrow }\limits ^{p}}0\) as \(n\rightarrow \infty \), one needs to follow the simiar arguments using the assertion of Theorem 3 in Georgiev [10]. \(\square \)

Proof of Theorem 1

Here also, we will provide the proof for \(\sqrt{nh_{n}^{d}}(\hat{\theta}_{n} (\mathbf{z}, NW) - E({\hat{\theta }}_{n} (\mathbf{z}, NW)))\), and the proof for \(\sqrt{nh_{n}^{d}}(\hat{\theta}_{n} (\mathbf{z}, PC) - E({\hat{\theta }}_{n} (\mathbf{z}, PC)))\) will follow from the same arguments.

Note that it follows from Theorem 7 of Georgiev [9] that \(\sqrt{nh_{n}^{d}}(\hat{m}_{n} (\mathbf{z}) - m(\mathbf{z}))\) converges weakly to \(Z_{1}\). Now, since \({\hat{\sigma }}_{n, NW}^{2}{\mathop {\rightarrow }\limits ^{p}}\sigma ^{2}\) as \(n\rightarrow \infty \), the proof directly follows from the application of continuous mapping theorem and Slutsky’s theorem [see Billingsley [1]]. For the the asymptotic distribution of \(\sqrt{nh_{n}^{d}}(\hat{\theta}_{n} (\mathbf{z}, PC) - E({\hat{\theta }}_{n} (\mathbf{z}, PC)))\), the same arguments using the assertion of Theorem 6 in Georgiev [10] will lead the proof. \(\square \)

Proof of Corollary 1

Since the integration and the square function operators are continuous, direct application of continuous mapping theorem on the assertion of Theorem 1 completes the proof. \(\square \)

Conclusions

A measure of goodness of fit for judging the fitting of a model obtained through the nonparametric regression is proposed in this paper. The idea behind the development of such goodness of fit statistic is borrowed from the idea of the coefficient of determination in the context of the multiple linear regression model. The limits and interpretation of the proposed goodness of fit statistic are similar to the limits and interpretation of the coefficient of determination. Various estimators can be used for fitting the nonparametric regression model, and the value of proposed goodness of fit can reflect on the degree of fitting of the model. So this will help in deciding which of the estimators can be used to obtain a better fitted model. This can also help in determining if an added variable is important or not. If the increment in the value of the goodness of fit does not change significantly will indicate that the added predictor is not important in the sense that it is not helping in explaining the variation in the values of the response variable. Finding the exact finite sample moments or distribution of the proposed statistics is difficult. So their asymptotic distributions have been derived. This can help in the construction of large sample test and confidence intervals. The proposed goodness of fit statistic can also be used for variable selection. If a predictor is added in the model and if the value of goodness of fit increases significantly, it can be concluded that the added predictor is important and contributes in explaining the variation in the response variable. The predictors can be added in the model as long as the increment in the value of the goodness of fit statistic does not change significantly.

References

  1. 1.

    Billingsley P (1995) Probability and measure. Wiley, New York

    Google Scholar 

  2. 2.

    Cheng C-L, Shalabh, Garg G (2014) Coefficient of determination for multiple measurement error models. J Multivar Anal 12:137–152

    MathSciNet  Article  Google Scholar 

  3. 3.

    Cheng C-L, Shalabh, Garg G (2016) Goodness of fit in restricted measurement error models. J Multivar Anal 14:101–116

    MathSciNet  Article  Google Scholar 

  4. 4.

    Cheng C-L, Shalabh, Chaturvedi A (2019) Goodness of fit for generalized shrinkage estimation. Theory Probab Math Stat (Am Math Soc) 10:177–197

    MathSciNet  MATH  Google Scholar 

  5. 5.

    Crämer JS (1987) Mean and variance of \(R^2\) in small and moderate samples. J Economet 35:253–266

    Article  Google Scholar 

  6. 6.

    Eshima N, Tabata M (2010) Entropy coefficient of determination for generalized linear models. Comput Stat Data Anal 54(5):1381–1389

    MathSciNet  Article  Google Scholar 

  7. 7.

    Eshima N, Tabata M (2011) Three predictive power measures for generalized linear models: the entropy coefficient of determination, the entropy correlation coefficient and the regression correlation coefficient. Comput Stat Data Anal 55(11):3049–3058

    MathSciNet  Article  Google Scholar 

  8. 8.

    Gasser T, Müller HG (1984) Estimating regression functions and their derivatives by the kernel method. Scandinavian J Stat 11:171–185

    MathSciNet  MATH  Google Scholar 

  9. 9.

    Georgiev AA (1989) Asymptotic properties of the multivariate Nadaraya-Watson regression function estimate: the fixed design case. Stat Prob Lett 7:35–40

    MathSciNet  Article  Google Scholar 

  10. 10.

    Georgiev AA (1990) Non-parametric multiple function fitting. Stat Prob Lett 10:203–211

    Article  Google Scholar 

  11. 11.

    Hahn GJ (1973) The coefficient of determination exposed!. Chem Technol 3:609–614

    Google Scholar 

  12. 12.

    Härdle W (1990) Applied Nonparametric Regression. Cambridge University Press, Cambridge

    Google Scholar 

  13. 13.

    Hilliard JE, Lloyd WP (1980) Coefficient of determination in a simultaneous equation model: a pedagogic note. J Bus Res 8(1):1–6

    Article  Google Scholar 

  14. 14.

    Hong CS, Ham JH, Kim HI (2005) Variable selection for logistic regression model using adjusted coefficients of determination. Korean J Appl Stat 18(2):435–443

    MathSciNet  Article  Google Scholar 

  15. 15.

    Hössjer O (2008) On the coefficient of determination for mixed regression models. J Stat Plan Inference. 138(10):3022–3038

    MathSciNet  Article  Google Scholar 

  16. 16.

    Huang L-S, Chen J (2008) Analysis of variance, coefficient of determination and F-test for local polynomial regression. Ann Stat 36(5):2085–2109

    MathSciNet  Article  Google Scholar 

  17. 17.

    Knight JL (1980) The coefficient of determination and simultaneous equation systems. J Economet 14(2):265–270

    MathSciNet  Article  Google Scholar 

  18. 18.

    Liao JG, McGee D (2003) Adjusted coefficients of determination for logistic regression. Am Statist 57(3):161–165

    MathSciNet  Article  Google Scholar 

  19. 19.

    Lipsitz SR, Leong T, Ibrahim J, Lipshultz S (2001) A partial correlation coefficient and coefficient of determination for multivariate normal repeated measures data. Statistician 50(1):87–95

    MathSciNet  Google Scholar 

  20. 20.

    Marchand É (1997) On moments of beta mixtures, the noncentral beta distribution, and the coefficient of determination. J Stat Comput Simul 59(2):161–178

    MathSciNet  Article  Google Scholar 

  21. 21.

    Marchand É (2001) Point estimation of the coefficient of determination. Stat Decis 19(2):137–154

    MathSciNet  MATH  Google Scholar 

  22. 22.

    McKean JW, Sievers GL (1987) Coefficients of determination for least absolute deviation analysis. Stat Prob Lett 5(1):49–54

    MathSciNet  Article  Google Scholar 

  23. 23.

    Nadaraya EA (1964) On estimating regression function. Theory Prob Appl 10:186–190

    Article  Google Scholar 

  24. 24.

    Nagelkerke NJD (1991) A note on a general definition of the coefficient of determination. Biometrika 78(3):691–692

    MathSciNet  Article  Google Scholar 

  25. 25.

    Priestley MB, Chao MT (1972) Non-parametric function fitting. J R Stat Soc Ser B 34:385–392

    MathSciNet  MATH  Google Scholar 

  26. 26.

    Ohtani K (1994) The density functions of \(R^2\) and \({\bar{R}}^2\), and their risk performance under asymmetric loss in misspecified linear regression models. Econom Modell 11(4):463–471

    Article  Google Scholar 

  27. 27.

    Ohtani K, Hasegawa H (1993) On small sample properties of \(R^2\) in a linear regression model with multivariate t errors and proxy variables. Economet Theory 9(3):504–515

    Article  Google Scholar 

  28. 28.

    Renaud O, Victoria-Feser M-P (2010) A robust coefficient of determination for regression. J Stat Plann Inference 140(7):1852–1862

    MathSciNet  Article  Google Scholar 

  29. 29.

    Silverman BW (1986) Density Estimation for Statistics and Data Analysis. Chapman and Hall, London

    Google Scholar 

  30. 30.

    Srivastava AK, Shobhit (2002) A family of estimators for the coefficient of determination in linear regression models. J Appl Stat Sci 11(2:133–144

    MathSciNet  Google Scholar 

  31. 31.

    Srivastava AK, Srivastava VK, Ullah A (1995) The coefficient of determination and its adjusted version in linear regression models. Economet Rev 14(2):229–240

    MathSciNet  Article  Google Scholar 

  32. 32.

    Tanaka JS, Huba GJ (1998) A general coefficient of determination for covariance structure models under arbitrary GLS estimation. British J Math Stat Psychol 42(2):233–239

    MathSciNet  Article  Google Scholar 

  33. 33.

    Tjur T (2009) Coefficients of determination in logistic regression models—a new proposal: the coefficient of discrimination. Am Stat 63(4):366–372

    MathSciNet  Article  Google Scholar 

  34. 34.

    Ullah A, Srivastava VK (1994) Moments of the ratio of quadratic forms in non-normal variables with econometric examples. J Economet 62(2):129–141

    MathSciNet  Article  Google Scholar 

  35. 35.

    van der Linde A, Tutz G (2008) On association in regression: the coefficient of determination revisited. Statistics 42(1):1–24

    MathSciNet  Article  Google Scholar 

  36. 36.

    Watson GS (1964) Smooth regression analysis. Sankhya 26:359–372

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge the support from the MATRICS Project from the Science and Engineering Research Board (SERB), Department of Science and Technology, Government of India (Grant No. MTR/2019/000033).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Shalabh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Celebrating the Centenary of Professor C. R. Rao” guest edited by Ravi Khattree, Sreenivasa Rao Jammalamadaka, and M. B. Rao.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shalabh, Dhar, S.S. Goodness of Fit in Nonparametric Regression Modelling. J Stat Theory Pract 15, 18 (2021). https://doi.org/10.1007/s42519-020-00148-x

Download citation

Keywords

  • Asymptotic distribution
  • Coefficient of determination
  • Goodness of fit
  • Kernel density estimator
  • Nadaraya–Watson estimator
  • Nonparametric regression
  • Priestley–Chao estimator