A class of asymptotically efficient estimators based on sample spacings

In this paper, we consider general classes of estimators based on higher-order sample spacings, called the Generalized Spacings Estimators. Such classes of estimators are obtained by minimizing the Csiszár divergence between the empirical and true distributions for various convex functions, include the “maximum spacing estimators” as well as the maximum likelihood estimators (MLEs) as special cases, and are especially useful when the latter do not exist. These results generalize several earlier studies on spacings-based estimation, by utilizing non-overlapping spacings that are of an order which increases with the sample size. These estimators are shown to be consistent as well as asymptotically normal under a fairly general set of regularity conditions. When the step size and the number of spacings grow with the sample size, an asymptotically efficient class of estimators, called the “Minimum Power Divergence Estimators,” are shown to exist. Simulation studies give further support to the performance of these asymptotically efficient estimators in finite samples and compare well relative to the MLEs. Unlike the MLEs, some of these estimators are also shown to be quite robust under heavy contamination.


Introduction
Let {F θ , θ ∈ Θ} be a family of absolutely continuous distribution functions on the real line and denote the corresponding densities by { f θ , θ ∈ Θ}. For any convex function φ on the positive half real line, the quantity is called the φ-divergence between the distributions F θ and F θ 0 . The φ-divergences, introduced by Csiszár (1963) as information-type measures, have several statistical applications including estimation. Although Csiszár (1977) describes how this measure can also be used for discrete distributions, we are concerned with the case of absolutely continuous distributions in the present paper. Let ξ 1 , . . . , ξ n−1 be a sequence of independent and identically distributed (i.i.d.) random variables (r.v.'s) from F θ 0 , θ 0 ∈ Θ. The goal is to estimate the unknown parameter θ 0 . If φ(x) = − log x, then S φ (θ ) is known as the Kullback-Leibler divergence between F θ and F θ 0 . In this case, a consistent estimator of this S φ (θ ) is given by Minimization of this statistic with respect to θ is equivalent to maximization of the log-likelihood function, n i=1 log f θ (ξ i ). Thus, by finding a value of θ ∈ Θ that minimizes (1), we obtain the well-known maximum likelihood estimator (MLE) of θ 0 . Note, in order to minimize the right-hand side of (1) with respect to θ , we do not need to know the value of θ 0 . On the other hand, for convex functions other than φ(x) = − log x, a minimization of the left-hand side of (1) with respect to θ would require the knowledge of θ 0 , the parameter that is to be estimated. Thus, for general convex φ functions, it is not obvious how to approximate S φ (θ ) in order to obtain a statistic that can be used for estimating the parameters.
One solution to this problem was provided by Beran (1977), who proposed that f θ 0 could be estimated by a suitable nonparametric estimatorf (e.g., a kernel estimator) in the first stage, and in the second stage, the estimator of θ 0 should be chosen as any parameter value θ ∈ Θ that minimizes the approximation of S φ (θ ). In the estimation method suggested by Beran (1977), the function φ(x) = 1 2 |1 − √ x| 2 was used, and this particular case of φ-divergence is known as the squared Hellinger distance. Read and Cressie (1988, p. 124) put forward a similar idea based on power divergences, and it should be noted that the family of power divergences is a subclass of the family of φ-divergences. The general approach of estimating parameters by minimizing a distance (or a divergence) between a nonparametric density estimate and the model density over the parameter space has been further extended by subsequent authors, and many of these procedures combine strong robustness features with asymptotic efficiency. See Basu et al. (2011) and the references therein for details.
Here, we propose an alternate approach obtained by approximating S φ (θ ), using the sample spacings. Let ξ (1) ≤ · · · ≤ ξ (n−1) denote the ordered sample of ξ 1 , · · · , ξ n−1 , and let ξ (0) = − ∞ and ξ (n) = ∞. For an integer m = m(n), sufficiently smaller than n, we put k = n/m. Without loss of generality, when stating asymptotic results, we may assume that k = k(n) is an integer and define non-overlapping mth-order spacings as In Eq. (3), the reciprocal of the argument of φ is related to a nonparametric histogram density estimator considered in Prakasa Rao (1983, Section 2.4). More precisely, When both k and m increase with the sample size, then, for large n, where m/2 is the largest integer smaller than or equal to m/2. Thus, intuitively, as k, m → ∞ as n → ∞, S φ,n (θ ) should converge in probability to S φ (θ ). Furthermore, since φ is a convex function, by Jensen's inequality, we have S φ (θ ) ≥ S φ (θ 0 ) = φ(1). This suggests that if the distribution F θ is a smooth function in θ , an argument minimizing S φ,n (θ ), θ ∈ Θ, should be close to the true value of θ 0 , and hence be a reasonable estimator. An argument θ =θ φ,n which minimizes S φ,n (θ ), θ ∈ Θ, will be referred to as a Generalized Spacings Estimator (GSE) of θ 0 . When convenient, a root of the equation (d/dθ)S φ,n (θ ) = 0 will also be referred to as a GSE.
By using different functions φ and different values of m, we get various criteria for statistical estimation. The ideas behind this proposed family of estimation methods generalize the ideas behind the maximum spacing (MSP) method, as introduced by Ranneby (1984); the same method was introduced from a different point of view by Cheng and Amin (1983). Ranneby derived the MSP method from an approximation of the Kullback-Leibler divergence between F θ and F θ 0 , i.e., S φ (θ ) with φ(x) = − log x and defined the MSP estimator as any parameter value in Θ that minimizes (3) with φ(x) = − log x and m = 1. This estimator has been shown, under general conditions, to be consistent (Ranneby 1984;Ekström 1996Ekström , 1998Shao and Hahn 1999) and asymptotically efficient (Shao and Hahn 1994;Ghosh and Jammalamadaka 2001). Based on the maximum entropy principle, Kapur and Kesavan (1992) proposed to estimate θ 0 by selecting the value of θ ∈ Θ that minimizes (3) with φ(x) = x log x and m = 1. With this particular choice of φ-function, S φ (θ ) becomes the Kullback-Leibler divergence between F θ 0 and F θ (rather than F θ and F θ 0 ). We refer to Ekström (2008) for a survey of estimation methods based on spacings and the Kullback-Leibler divergence. Ekström (1997Ekström ( , 2001 and Ghosh and Jammalamadaka (2001) considered a subclass of the estimation methods proposed in the current paper, namely the GSEs with m = 1. Under general regularity conditions, it turns out that this subclass produces estimators that are consistent and asymptotically normal and that the MSP estimator, which corresponds to the special case when φ(x) = − log x, has the smallest asymptotic variance in this subclass (Ekström 1997;Ghosh and Jammalamadaka 2001). Estimators based on overlapping (rather than non-overlapping) mth-order spacings was considered in Ekström (1997Ekström ( , 2008, where small Monte Carlo studies indicated, in an asymptotic sense, that larger orders of spacings are always better (when φ(x) is a convex function other than the negative log function). Menéndez et al. (1997Menéndez et al. ( , 2001a and Mayoral et al. (2003) consider minimum divergence estimators based on spacings that are related to our GSEs. In the asymptotics, they use only a fixed number of spacings, and their results suggest that GSEs will not be asymptotically fully efficient when k in (3) is held fixed.
In the present paper, it is shown that GSEs are consistent and asymptotically normal under general conditions. In contrast to the aforementioned papers, we allow both the number of spacings, k, and the order of spacings, m, to increase to infinity with the sample size. We show that if both of them do tend to infinity, then there exists a class of asymptotically efficient GSEs that we call the Minimum Power Divergence Estimators (MPDEs). In contrast, if m is held fixed, then the only asymptotically optimal GSE is the one based on φ(x) = − log x. The main results are stated in the next section, followed by a simulation study assessing (i) the performance of these estimators for different m and n and (ii) the robustness of these estimators under contamination. In the latter case, we also assess a suggested data-driven rule for choosing the order of spacings, m. Detailed proofs are to be found in "Appendix" and online Supplementary Material.

Main results
In this section, we state the main results and the assumptions that are needed. Unless otherwise stated, it will henceforth be assumed that ξ 1 , . . . , ξ n−1 are i.i.d. r.v.'s from F θ 0 , θ 0 ∈ Θ.

Assumption 2
The parameter space Θ ⊂ R contains an open interval of which θ 0 is an interior point, and F θ (·) is differentiable with respect to θ .

Assumption 3
The function φ(t), t > 0, satisfies the following conditions: (a) it is strictly convex; and c 3 are some nonnegative constants; (d) it is twice differentiable.
Assumption 3 is valid for a wide class of convex functions including the following, where the cases λ = − 1 and λ = 0 are given by continuity, i.e., by noting that lim λ→0 (x λ − 1)/λ = log x.
For the purpose of the next theorem, we will use the notation f (x, θ) for the density f θ (x), and we denote its partial derivatives by Let W 1 , W 2 , . . . be a sequence of independent standard exponentially distributed r.v.'s, We then have the following important result: Theorem 2 Let m = o(n). In addition to Assumptions 1 and 2, assume the following conditions: (i) The function φ is a strictly convex function and thrice continuously differentiable.
Then, for any consistent rootθ φ,n of (d/dθ)S φ,n (θ ) = 0, we have and The theorems will be proved using proof methods related to those of, e.g., Lehmann and Casella (1998) for the MLE. A generalization to the multiparameter case is possible, much like in the case of maximum likelihood estimation. However, as in Lehmann and Casella (1998) for the MLE, this will require somewhat more complex assumptions and proofs and we refrain from attempting it here.
For the case m = 1 and by assuming that Ekström (1997) and Ghosh and Jammalamadaka (2001) showed that e m (φ) ≥ 1, with equality if and only if φ(x) = a log x +bx +c, for some constants a, b, and c. That this inequality holds true for general m can be seen by integrating by parts, i.e., assuming that where the inequality on the right-hand side follows by noting that where a < 0. It should be observed that the corresponding estimatorθ φ,n does not depend on the chosen values of a < 0, b, and c. Thus, without loss of generality, we may choose a = − 1 and b = c = 0, i.e., for m fixed, the asymptotically optimal estimatorθ φ,n is based on the function φ(x) = − log x. If, however, we let m → ∞, then the asymptotically optimal estimator is no longer unique. For example, let us consider the family of power divergences (Read and Cressie 1988;Pardo 2006) given by where the cases λ = − 1 and λ = 0 are given by continuity, i.e., Pardo and Pardo 2000).
, the family of power divergences is a subclass of the family of φ-divergences.
, and Theorem 1 establishes the existence of a consistent root of the equation (d/dθ)S φ λ ,n (θ ) = 0. The next result asserts that any such sequence is asymptotically normal and efficient when k, m → ∞ as n → ∞.
It is desired to have statistical estimation procedures that perform well when an assumed parametric model is correctly specified, thereby attaining high efficiency at the assumed model. A problem is that assumed parametric assumptions are almost never literally true. Thus, in addition, it is desired to have estimation procedures that are relatively insensitive to small departures from the model assumptions and that somewhat larger deviations from the model assumptions do not cause a "catastrophe" (Huber and Ronchetti 2009). Procedures satisfying these features are called robust. Due to its relationship with the estimator suggested by Beran (1977), we conjecture that our minimum Hellinger distance estimator, i.e., the MPDE with λ = − 1/2, is robust (in the sense of Beran 1977 and Lindsay 1994). In addition, by arguments put forward in Lindsay (1994), we conjecture that GSEs based on bounded φ functions are robust with respect to contaminations of the original data (cf. Mayoral et al. 2003). In the next section, we consider the MPDE with λ = − 1/2 and apply Monte Carlo simulations to compare its performance with those of the MLE and the MPDEs with λ = − 1 and − 0.9.
First, we consider estimating the mean θ of a N (θ, 1) distribution and compare the root mean square errors (RMSEs) of various MPDEs, with the RMSE of the MLE. These are shown in Fig. 1. MPDEs are computed for λ = − 1, − 0.9, and − 0.5, and for all values of m, the order of the spacings, which are divisors of n. We define a relative RMSE of an MPDE to be its RMSE divided by the RMSE of the MLE. Each RMSE was estimated from 1000 Monte Carlo samples with n = 840 from the N (θ, 1) distribution with θ = 0. We present relative RMSEs for m up to 150, because when m is larger than that the relative RMSEs tend to be quite large in comparison. From Fig. 1, we see that MPDEs with λ equal to − 1 or − 0.9 are about as good as the MLE for comparatively small values of m (and less well for larger m). The MPDE with λ = − 0.5 is not quite as good in terms of RMSE. For example, with an optimally chosen m, it had an RMSE about 0.5% larger than that of the MLE. The simulation results indicate that the optimal choice of m is 15, 15, and 14 for λ = − 1, − 0.9, Next, we consider the issue of stability/robustness of these estimators. It is known that the MSP estimator (Cheng and Amin 1983;Ranneby 1984), i.e., the MPDE with λ = − 1 and m = 1, much like the MLE, suffers from lack of stability under even small deviations from the underlying model, i.e., the distributions of the MSP and ML estimators can be greatly perturbed if the assigned model is only approximately true. This is demonstrated in a simulation study by Nordahl (1992) and by Fujisawa and Eguchi (2008) in a numerical study on the MLE. As in Nordahl (1992), we will assume a proportion 1−ε of the data is generated from a normal distribution, while a proportion ε is generated by some unknown mechanism that produces "outliers." For example, measurements are made, which are 95% of the time correct, while 5% of the time operator reading/writing errors are made or the recording instrument malfunctions. Therefore, we assume that a random sample ξ 1 , . . . , ξ n−1 is generated from an ε- in which H (x) denotes an arbitrary distribution that models the outliers and ε is the level of contamination. Of interest is to estimate θ 0 , the mean of the observations in the case when no recording errors occur. If nε is rather small, we may have few observations from H , making it difficult to assess the model for H . In such a case, instead of modeling the mixture distribution G, one may (wrongly) assume that all observations come from Φ(x − θ) with θ = θ 0 , and then use an estimation method that provides a good estimate of θ 0 even in presence of outliers coming from H . That is, robust estimation aims at finding an estimatorθ that efficiently estimates θ 0 even though the data are contaminated by an outlier distribution H (Fujisawa and Eguchi 2008).
In our Monte Carlo simulation, we used H (x) = Φ((x − ρ)/τ ), τ > 0, with ρ = 10 and τ = 1. For each ε = 0.0, 0.1, . . . , 0.4, we generated 1000 Monte Carlo samples with n = 840 from G(x − θ 0 ) with θ 0 = 0, and for every sample, the MLE of θ 0 was computed using the model F θ (x) = Φ(x − θ). MPDEs for this case were also computed for λ = − 1, − 0.9, and − 0.5 and for the previously found optimal values of m for the respective values of λ. For each level of contamination, we used the 1000 samples for computing estimated RMSEs for the respective estimators of θ 0 . The resulting RMSEs are shown in Fig. 2. In case of contamination, we see that the MPDEs with λ equal to − 0.9 or − 0.5 are superior to the MLE and the MPDE with λ = − 1. In other words, the MLE and MPDEs such as the MSP estimator, which all can be derived from the Kullback-Leibler divergence, perform poorly under contamination, and other MPDEs are to be preferred.
Looking at Fig. 1, it is clear that the choice of m, the order of spacings, is important for the quality of MPDE estimators. We now propose a data-based approach for choosing m for MPDEs (and more generally for GSEs). No asymptotic optimality is claimed for the approach. The main purpose is rather to provide sensible answers for finite sample sizes. For a given λ (or φ-function), letθ m denote the MPDE (or GSE) using the order of spacings m. The suggested approach is given by the following algorithm: Step 1 Computeθ 1 .
Step 2 For r in 1, . . . , R: Draw a bootstrap sample x r ,1 , . . . , x r ,n−1 from Fθ 1 . For some set of positive integers, M, computeθ r ,m for each m ∈ M, whereθ r ,m denotes the r th bootstrap replicate ofθ m .
Step 3 Choose m opt = arg min m∈M Under the same settings as in Fig. 2 and with m chosen according to the above algorithm, with M defined as the set of divisors of n, we consider two MPDEs, with λ = − 0.9 and − 0.5, respectively. We compare these with Karunamuni and Wu's (2011) one-step minimum Hellinger distance estimator (MHDE), obtained from a one-step Newton-Raphson approximation to the solution of the equationŜ φ (θ ) = 0, whereŜ φ (θ ) is defined as in (2), with φ(x) = 1 2 |1 − √ x| 2 andf a kernel density density estimator. Karunamuni and Wu (ibid.) show that their one-step MHDE has the same asymptotic behavior as Beran's (1977) MHDE, as long as the initial estimator in the Newton-Raphson algorithm is reasonably good and that it retains excellent robustness properties of the MHDEs. In our simulations, we used the median as the initial estimator of θ 0 , and the kernel estimator was based on the Epanechnikov kernel with bandwidth chosen to be (15e) 1/5 (π/32) 1/10σ n −1/5 , whereσ = median{|ξ i − median{ξ j }|}/Φ(3/4) (cf. Basu et al. 2011, pp. 108-109). The resulting RMSEs are shown in Fig. 3. When ε = 0, the MPDE with λ = − 0.9 is the winner in terms of RMSE (In comparison, the MPDE estimators with λ = − 0.9 and − 0.5, and the onestep MHDE had RMSEs that were about 0.0, 1.8, and 0.4 percent larger than that of the MLE, respectively). For ε = 0.1, the one-step MHDE performs somewhat better than the two MPDEs, but for ε = 0.2, 0.3, and 0.4, the most efficient estimator is the MPDE with λ = − 0.5. Under heavy contamination, i.e., for levels of contamination equal to or larger than 0.3, both MPDEs are clearly more robust than the one-step MHDE.
By Corollary 1, MPDEs are asymptotically normal and efficient under a general set of regularity conditions. When applying an MPDE, a particular value of λ needs to be chosen. In our simulations, we considered three choices, λ = − 1, − 0.9, and − 0.5. Much like the MLE, the MPDE with λ = − 1 can be greatly perturbed if the assigned model is only approximately true. In the choice between λ = − 0.9 and λ = − 0.5, the former appears to provide better estimates if there is no contamination, while the latter seems to give more robust estimators when some contamination is suspected.

Concluding remarks
In this paper, we propose classes of estimators, called Generalized Spacings Estimators or GSEs, based on non-overlapping higher-order spacings and show that under some regularity conditions, they are consistent and asymptotically normal. Within these classes, we demonstrate the existence of asymptotically efficient estimators, called MPDEs. Through simulations, we demonstrate that they perform well also in moderate sample sizes relative to the MLEs. However, unlike the MLEs, some of these spacings estimators are quite robust under contamination. In this article, we also propose a data-driven choice for the order of spacings, m, based on bootstrapping, and the Monte Carlo studies indicate that this practical way of choosing m leads to MPDEs which perform comparatively well and even much better at higher levels of contamination, than the one-step MHDEs proposed in the literature. Moreover, the GSEs suggested here can be suitably extended and used in more general situations. For example, by using mth nearest neighbor balls as a multidimensional analogue to univariate mth-order spacings, our proposed classes of estimators can be extended to multivariate observations (Kuljus and Ranneby (2015) studied this problem for m = 1), but specifics need further exploration. Another possibility is to define GSEs based on overlapping spacings of order m. The derivation of the asymptotic distribution of such estimators is an open problem.
Proof See online Supplementary Material.

Proof of Theorem 1
The proof is very similar to that of Theorem 3.7 in Lehmann and Casella (1998, p. 447), and therefore, we omit the details. Note, however, that in Lehmann and Casella (1998), the log-likelihood needs to be replaced with S n (θ ), and the reference to Theorem 3.2 needs to be replaced with Proposition 1 of the current paper.
The second term here tends to zero in probability due to continuity of the function g θ 0 and (9). For the first term, the central limit theorem (Corollary 1 and Remark 1 of Mirakhmedov (2005)) is valid with asymptotical expectation (1) Thus, putting last two relations into (24), we obtain Consider ∇ k . By noting that and by using Lemma 1, we see that On the other hand, it is easy to see that 1 0 g θ 0 (u)du = −I (θ 0 ), and therefore, by (11), ∇ k p → 0. This fact, together with (23) and (25), implies that Similar arguments show that Theorem 2 follows from (8), (22), (26), and (27).
Proof of Corollary 1 By straightforward algebra, we find that By Stirling's approximation formula, Γ (x + 1) = √ 2π x(x/e) x (1 + O(x −1 )), and we find for large enough m that where c λ is a constant depending on λ only. Hence, the corollary follows from Theorem 2.