Nonparametric series density estimation and testing

This paper first establishes consistency of the exponential series density estimator when nuisance parameters are estimated as a preliminary step. Convergence in relative entropy of the density estimator is preserved, which in turn implies that the quantiles of the population density can be consistently estimated. The density estimator can then be employed to provide a test for the specification of fitted density functions. Commonly, this testing problem has utilized statistics based upon the empirical distribution function, such as the Kolmogorov-Smirnov or Cramér von-Mises, type. However, the tests of this paper are shown to be asymptotically pivotal having limiting standard normal distribution, unlike those based on the edf. For comparative purposes with those tests, the numerical properties of both the density estimator and test are explored in a series of experiments. Some general superiority over commonly used edf based tests is evident, whether standard or bootstrap critical values are used.


Introduction
Testing whether a sample of data has been generated from a hypothesized distribution is one of the fundamental problems in statistics and econometrics. Traditionally such tests have been constructed from the empirical distribution function (edf the simplest of sampling schemes such tests are known to be not asymptotically pivotal, e.g. see Stephens (1976), Conover (1999) and Babu and Rao (2004). Moreover, under more sophisticated sampling schemes such tests can become prohibitively complex, see Bai (2003) and Corradi and Swanson (2006).
Instead, this paper provides tests based on a generalization of the consistent series density estimator of Crain (1974) and Barron and Sheu (1991) . Consistency is maintained when nuisance parameters are estimated as a preliminary step. This, when applied to the infinite dimensional likelihood ratio test of Portnoy (1988) generalizes the tests of Claeskens and Hjort (2004) and Marsh (2007) to test for specification.
The proposed procedure offers three advantages over those tests based on the edf. First they are asymptotically pivotal, and numerical experiments are designed and reported in support of this. This also implies automatic validity, including secondorder as in Beran (1988), of bootstrap critical values. Valid bootstrap critical values for the non-pivotal edf based tests, e.g. as in Kojadinovic and Yan (2012), do not benefit from this. Second, they are generally more powerful than the most commonly used edf based tests. Again numerical evidence is presented to support this. Lastly, because they are based on a consistent density estimator, in the event of rejection the density estimator itself can be used to, for instance, consistently estimate the quantiles of the underlying variable.
The plan for the paper is as follows. The next section presents the density estimator and demonstrates that it converges in relative entropy to the population density. A corollary provides consistent quantile estimation, with accuracy demonstrated in numerical experiments. Section 3 provides the nonparametric test, establishes that it is asymptotically pivotal and consistent against fixed alternatives. A corollary establishes validity of bootstrap critical values. Numerical experiments are presented in support of these results as well as demonstrating some superiority over edf based tests. Section 4 concludes while two appendices present the proofs of two theorems and tables containing the outcome of the experiments, respectively.

Consistent nonparametric estimation of possibly misspecified densities 2.1 Theoretical results
Suppose that our sample y = {Y i } n i=1 consists of independent copies of a random variable Y having distribution, G (y) = Pr[Y ≤ y] and density g (y) = dG (y) /dy. For this sample we fit the parametric likelihood, L = n i=1 f (Y i ; β) for some chosen density function f (y; β) , where β is an unknown k × 1 parameter. Denote the (quasi) maximum likelihood estimator for β byβ n .
In this context the hypothesis to be tested is: where F (y; β) = y −∞ f (z; β) dz and for some (unknown) value β 0 . Tests for H 0 will be detailed in the next Section. First, however, we assume the following, whether or not H 0 holds:

Assumption 1
(i) The density f (y; β) is measurable in y for every β ∈ B, a compact subset of p−dimensional Euclidean space, and is continuous in β for every y. (ii) G (y) is an absolutely continuous distribution function, E log[g (y) exists and That isβ n is a √ n consistent Quasi maximum likelihood estimator for the pseudo-true value β * . Note that under H 0 we have β * = β 0 . To proceed denoteX i = F Y i ,β n having mean value expansion, where β + lies on a line segment joiningβ n and β * . As a consequence we can writê whereX i = F (Y i , β * ) and by construction and as a consequence of Assumption 1 (iv), that is e i is both bounded and degenerate. Since theX i are IID denote their common distribution and density function by U (x) = Pr X < x and u (x) = dU (x) /dx, respectively. Here we will apply the series density estimator of Crain (1974) and Barron and Sheu (1991) to consistently estimate u (x) and thus quantiles of U (x), from which the quantiles of G (y) can be consistently recovered. Application of the density estimator requires choice of approximating basis, here we choose the simplest polynomial basis, similar to Marsh (2007). We will approximate u (x) via the exponential family, where ψ m (θ ) is the cumulant function, defined so that 1 0 p x (θ )dx = 1. From Assumption 1 log [u (x)] has, at least, r − 1 absolutely continuous derivatives and its r th derivative is square integrable. According to Barron and Sheu (1991) there exists a unique θ (m) and, as m → ∞, p x θ (m) converges, in relative entropy, to u (x) at rate m −2r , meaning that as m → ∞. Moreover, if a sample X i n 1 were available then if m 3 /n → 0 and lettinḡ θ (m) be the unique solution to then p x θ (m) converges in relative entropy to u (x) , see Theorem 1 of Barron and Sheu (1991).
Here, however, the sample X i n 1 is not available, instead we only observe X i n 1 and consequently haveθ (m) as the unique solution to Note that the Eqs (5), (6) and (7 ) define one-to-one mappings between the sample space Ω (m) ∈ R m and the parameter space Θ (m) ∈ R m in the exponential family, see Barndorff-Nielsen (1978). We can therefore define three pairs of m dimensional parameter and statistics, respectively as θ (m) : μ (m) , θ (m) :X (m) and . Generically these mappings can be expressed via The uniqueness of these mappings can be exploited in the following Theorem, proved in Appendix A, to show that the density estimator p x θ (m) converges in relative entropy at the same rate as p x θ (m) .
Theorem 1 Letθ (m) denote the estimated exponential parameter determined by (7) then under Assumption 1 and for m, n → ∞ with m 3 /n → 0, According to Theorem 1, in terms of the density estimator, at least, the effect of observing X 1 , . . . ,X n rather than {X 1 , . . . , X n } is asymptotically negligible under Assumption 1 and for either choice of basis. Moreover, if the goal were only nonparametric estimation of the density, then the optimal choice of the dimension m is the same as when no parameters are estimated, i.e. m opt ∝ n 1 1+2r (with a minimax rate of m * n = O n −1/5 , since r ≥ 2 by assumption). The optimal rate the rate of convergence of the estimator remains of order O p n − 2r 1+2r . It should not be surprising that the rate of convergence is unaffected when parameters are replaced by √ n consistent estimators. Theorem 1 thus generalises the results of Crain (1974) and Barron and Sheu (1991), as summarized in Lemma 1 of Marsh (2007), by permitting estimation of nuisance parameters as a preliminary step.
Additionally, we may recover the quantiles of Y from those implied by the approximating series density estimator. This is captured in the following Corollary, which follows immediately since convergence in relative entropy implies convergence in law.

Numerical application of a quantile estimator
The consequence of Corollary 1 is that the quantiles associated with T n,m converge to those of Y, i.e. letting q A (π ) , for 0 < π < 1, denote the quantile function of the random variable A, we have The following set of experiments compare the Mean Square Errors (MSE) of estimators for the quantiles of Y based on those ofT n,m for m = 3, 9 and for quantiles calculated at the probabilities, π = . 05, .25, .50, .75, .95. We also compare the accuracy of estimated quantiles when unknown parameters are estimated against cases where they are not.
First suppose that Y i ∼ I I D Y := t (4) but we estimate the Gaussian likelihood implied by N μ, σ 2 . Define the first obtained from the (misspecified) Gaussian model imposing zero mean and unit variance, and the second from the Gaussian model with estimated mean and variance. Following the development above, as well as that of Barron and Sheu (1991) , let θ * (m) andθ (m) and denote the estimated parameters for the exponential series density estimators for the samples X * i n 1 and X i n 1 , respectively. Let T * n,m have density p t θ * (m) (note that this is just straight forward application of the original set-up of Barron and Sheu 1991) and letT n,m have density p t θ (m) , as in Corollary 1. The pairs of estimated quantiles for Y are then constructed as in The MSE of these quantiles, for each probability π, are presented in Appendix B, Tables 1a for m = 3 and 1b for m = 9. Next suppose that Y i ∼ I I D Y := Γ (1.2, 1) and define Analogous to above let T * n,m andT n,m have densities p t θ * (m) and p t θ (m) and so pairs of estimated quantiles for Y are constructed via, The MSE of these quantiles, for each probability π, are presented in Table 1c (m = 3) and 1d (m = 9).
The consistency of the quantiles obtained from, in particular,T m,n is illustrated clearly in Table 1. More relevant, however, is that estimating the parameters of the fitted model as a preliminary step produces quantile estimators that can be superior, as the sample size becomes large, to those obtained by simply imposing parameter values, as can be clearly seen by comparing the right and left panels in Table 1. Note also that although the larger value of m yields more accurate quantile estimates in these cases, this is at some computational cost and, in other cases, potential numerical instability. Although this latter possibility is greatly mitigated, since the X i n i=1 are bounded.

Main results
Here we provide a test of the null hypothesis that the fitted likelihood is correctly specified as in (1).The previous section generalized the Barron and Sheu (1991) series density estimator and the resulting nonparametric likelihood ratio test then generalizes the test of Marsh (2007).
To proceed note that when H 0 is true then in Assumption 1, β * = β 0 and in (2) Marsh (2007) means that (1) can be tested via, in the exponential family (4), where θ (m) is the solution to (5) and 0 (m) is an m × 1 vector of zeros. The likelihood ratio test of Portnoy (1988) applied via the density estimator of Crain (1974) and Barron and Sheu (1991)obtained from the sample X 1 , . . . ,X n iŝ The null hypothesis is rejected for large values ofλ m . Under any fixed alternative H 1 : For every fixed alternative distribution for Y there is a unique alternative distribution for X on (0, 1) and associated with that distribution will be another consistent density estimator given by say, p x (θ 1 (m) ). In practice, of course, θ 1 (m) will be neither specified nor known. The following Theorem, again proved in Appendix A, gives the asymptotic distribution of the likelihood ratio test statistic both under the null hypothesis (10) and also demonstrates consistency against any such fixed alternative.

P. Marsh
Theorem 2 Suppose that Assumption 1 holds, we construct X i n i=1 as described in (2), and we let m, n → ∞ with m 3 /n → 0, then: (ii) Under any fixed alternative H 1 : G (y) = F (y; β) , for any β, and for any finite κ, Theorem 2 generalizes the test of Marsh (2007) establishing asymptotic normality and consistency against fixed alternatives when β has to be estimated. Via Claeskens and Hjort (2004) it is demonstrated that as n → ∞ with m 3 /n → 0, then the testΛ m (i.e. the, here, unfeasible test based on the notional sample X i n 1 ) has power against local alternatives parametrized by θ (m) n with c c = 1. Heuristically, implicit from the proof of Theorem 2 the properties of the test follow from;Λ m −Λ m = O p m n , and soΛ m has power against that same rate of local-alternatives.

Testing for normality or exponentiality
The likelihood ratio testΛ m is asymptotically pivotal, specifically standard normal. Competitor tests, such as KS, CM and AD (these tests are mathematically detailed in Stephens 1976or Conover 1999) are not pivotal, although asymptotic critical values are readily available for all cases of testing for Exponentiality and Normality.
First we will demonstrate that indeed asymptotic critical values for nonparametric likelihood tests do have close to nominal size for large values of n and m. We are interested in testing the null hypotheses with nominal significance levels 10, 5 and 1% and based on sample sizes n = 25, 50, 100 and 200. Lettingȳ n andσ 2 n be the estimated mean and variance (i.e. β n =ȳ n for H E 0 andβ n = ȳ n ,σ 2 n for H N 0 ) then the tests are constructed from the mapping to (0, 1) ;X to test H E 0 , andX to test H N 0 . Table 2 in Appendix B provides rejection frequencies for the tests constructed for values of m = 3, 5, 7, 9, 11, 17. The left hand panel of numbers correspond to testing H E 0 and the right to H N 0 , critical values at the 1, 5 and 10% significance level from the standard normal distribution are used throughout.
The purpose of these experiments is only to demonstrate that the finite sample performance of the tests clearly improves as both n and m increase, as predicted by Theorem 2(i). Note the use of three significance levels to better illustrate convergence for large values of both m and n.
Although competitor tests are not asymptotically pivotal (and therefore no comparisons under the null are made) instead Table 3 compares the 5% size corrected powers of two variants of the tests, with m = 3 and m = 9 with the three direct competitors for a single sample size of n = 100. Table 3a and b present rejection frequencies for these tests and the KS, CM and AD tests for testing H N 0 under alternatives that the data is instead drawn from, where 1 (.) denotes the indicator function. These latter three alternatives represent simplistic variants of common types of misspecification in econometric or financial data, i.e. misspecification of a conditional mean, variance or the possibility of a break in the mean (here half way through the sample). Note that these models imply that (2) will not be IID on (0, 1), but ergodicity implies the sample moments will still converge. Finally, Table 3f considers instead testing H E 0 against the alternative In each table the left hand panel corresponds to the case where we construct the test imposing the parameter values specified in the null rather than estimating them (i.e. using the, unfeasible, test of Marsh 2007)). The right hand panel has the rejection frequencies for tests based on estimated values, i.e. using (11) and (12), respectively.
The outcomes in Table 3 imply the following broad conclusions. The nonparametric likelihood test basedΛ 3 is the most powerful almost uniformly, across all alternatives and whether parameters are estimated or not. The observed lack of power of the most commonly used test, KS, is particularly evident, it is consistently the poorest performing test. The other edf based tests andΛ 9 are broadly comparable in terms of their rejection frequencies, although AD is perhaps on average slightly more powerful and CM less powerful.

Bootstrap critical values
The proposed tests require a choice of dimension, m. The results presented in Tables  2 and 3 suggest an inevitable compromise, larger values of m imply tests having size closer to nominal, while smaller values of m imply tests having greater power. In order to overcome this compromise we can instead consider the properties of these tests when bootstrap critical values are instead employed.
For these tests the bootstrap procedure is as follows: On obtaining the MLEβ n and calculatingΛ 3 , as described above;

Denote the indicator functionÎ
We then reject H 0 ifÎ Λ B = 1. First, however, the required asymptotic justification for the bootstrap is automatic given thatΛ m → d N (0, 1) giving the following corollary to Theorem 2.

Corollary 2 Under Assumption 1 and if n, m → ∞ with m
Here we will compare the performance of bootstrap critical values forΛ 3 with those of CM and AD by repeating many of the experiments of Kojadinovic and Yan (2012). In this sub-section all experiments described in this sub-section are performed on the basis of B = 200 bootstrap replications. All nuisance parameters were estimated via maximum likelihood using Mathematica 8's own numerical optimization algorithm.
The first set of experiments mimic those presented in Kojadinovic and Yan (2012 ,  Table 1). Specifically we define the following Normal, Logistic, Gamma and Weibull Distributions; The specific parameter values for L * , Γ * and W * are chosen to minimize relative entropy (I (β) in Assumption 1(iii)) for each family to the distribution of N * . Sample sizes of n = 25, 50, 100, 200 are used in the experiments described below. Table 4a contains the finite sample size of each test. It is clear that, under H 0 , the parametric bootstrap provides highly accurate critical values for all of the tests. On size alone there is nothing to choose between them. It is however, worth reporting, the computational time of each bootstrap critical value. For theΛ 3 test critical values were obtained after 2.0 and 3.2 seconds for sample sizes n = 100 and 200, respectively. The times for the other tests were similar to each other, taking around 0.9 and 2.9 seconds, respectively. Table 4b and c contain the finite sample rejection frequencies under various alternative hypotheses, covering all pairwise permutations of the distributions in (13). As with the finite sample sizes it is not possible to pick a clear winner, moreover where they overlap the results are in line with those of Kojadinovic and Yan (2012). There is, of course, no uniformly most powerful test of goodness-of-fit so it is not surprising that the power ofΛ 3 is not always the largest. However its performance over this range of nulls and alternatives is far less volatile and in no circumstance is the test dominated by any of the other two.

Conclusions
This paper has generalized the series density estimator of Barron and Sheu (1991) to cover the case where parameters are estimated in the context of misspecified models. The nonparametric likelihood ratio tests of Marsh (2007) can be thus extended to cover the case of estimated parameters. The general aim has been to provide a testing procedure which overcomes the three main criticisms of edf based tests, i.e. that they are not pivotal, have low power, and offer no direction in case of rejection.
Instead the tests of this paper are shown to be asymptotically standard normal and they have power advantages over edf tests, whether critical values are size corrected or obtained by a consistent bootstrap. This suggests the proposed tests will be much simpler to generalize to the settings of Bai (2003) or Corradi and Swanson (2006). Finally, in the event of rejection, the series density estimator upon which the tests are built may be employed to consistently estimate the quantiles of the density from which the sample is taken.

Acknowledgements This article was funded by University of Nottingham.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

A Appendix A: Proofs
In order to avoid any ambiguity throughout this appendix the order of magnitude symbol O(.) is defined by, a n,m = O b n,m ⇐⇒ lim m,n→∞ ; m 3 /n→0 a n,m b n,m ≤ c 1 < ∞, and analogously for the probabilistic versions O p (.) and o p (.). If the quantity under scrutiny does not depend upon the dimension m then the condition m 3 /n → 0 becomes redundant.

Proof of Theorem 1:
First recall the definitions note , The Euclidean distance between the two polynomial sufficient statistics satisfies, Taking the j th element and notingX i = X i + e i ,then SinceX i ∈ (0, 1) while, as in (3), e i = O p (n −1/2 ) and e i ∈ (−1, 1) then, where Consequently, and also from the definition of Euclidean distance, we have, Consider now μ (m) , then from the triangle inequality, which follows from (15) and noting the same order of magnitude applies for the first distance, as in Barron and Sheu (1991, eq. 6.5), which represents the distance in the case that the sequence X j i n 1 were observed directly.  Barron and Sheu (1991, eq. 6.9) we obtain, Given that Assumption 1 assures the required conditions of Barron and Sheu (1991, Theorem 1) are met then the first two terms in (17) Barron and Sheu (1991, Lemma 5), which holds for any two values in Ω m ⊂ R m , here uniquely defined by Eqs (6) and ( 7), implies that Marsh and hence as required.
Part (i): To proceed we have defined, whereθ (m) solves (7), or equivalently, Similarly the value 0 (m) defines, The exponential log-likelihood is strictly convex so that the mapping, ψ m θ (m) = μ (m) is one-to-one between the parameter space Θ m ⊂ R m and sample space Ω m ⊂ R m , similar to (8). Application of Barron and Sheu (1991, eq. 5.6) and also (16) thus gives, As a consequence of both (18) and (16) we have that, and note that the expansions provided in the provided in the proofs of Theorems 3.1 and 3.2 of Portnoy (1988) apply for any two pairs of values, here θ (m) , 0 (m) and X (m) , μ (m) .
To continue, noting expectations under the null hypothesis can be written here as E U [.] sinceX ∼ U := U [0, 1], the uniform distribution with density p 0 (m) (x) = 1, we then have expansions analogous to Portnoy (1988, eq. 3.5 and 3.6), Subtracting (20) from (19) and applying arguments identical to those given below Portnoy (1988, Theorem 3.1, eq. 3.7) yields, From the definition of the likelihood ratio test we therefore have, as in Portnoy (1988, eq. 3.12). Letē =X (m) −X (m) then from the proof of Theorem 1, we have Now define the m × 1 random variable Since the likelihood ratio statistic is parameterization invariant the likelihood ratio test based on observations on V m would be identical to that based onX (m) . Rather than defining a new triple of values, analogous to those in (5), (6) and (7) , in both the parameter space Θ m (note that in particular the hypothesized value would no longer satisfy θ (m) = 0 (m) ) and sample space Ω m we will instead, and without any loss of generality assume a parameterization in which both E X (m) = 0 and V X (m) = I m . Note, however, that it is the unobservedX which is assumed to be standardized not the observedX (m) .

Part (ii): Under any fixed alternative the density ofX
and so let θ 1 (m) be the unique solution to, The uniqueness of solutions to (23) imply θ 1 (m) = 0 (m) . To take the least favorable case, define θ 1 (m) = θ 1 1 , .θ 1 2 , . . . , θ 1 m and suppose that θ 1 k = 0 for some finite k but that θ 1 j = 0 for all j = k. The series density estimator is consistent for θ n , analogous to (18) above, and so we can write, We can therefore write the likelihood ratio aŝ whereλ 1 m is the likelihood ratio for testing H 1 : θ (m) = θ 1 (m) . Thus, under H 1 , we can writê Immediate from Part (i) of this theorem is that as m, n → ∞ , with m 3 /n → 0, since m 3 /n → 0 and hence Pr Λ m > κ → 1, as required.