Minimax estimation of a bivariate cumulative distribution function

The problem of estimating a bivariate cumulative distribution function F under the weighted squared error loss and the weighted Cramer–von Mises loss is considered. No restrictions are imposed on the unknown function F. Estimators, which are minimax among procedures being affine transformation of the bivariate empirical distribution function, are found. Then it is proved that these procedures are minimax among all decision rules. The result for the weighted squared error loss is generalized to the case where F is assumed to be a continuous cumulative distribution function. Extensions to higher dimensions are briefly discussed.


Introduction
Minimax estimation of a one dimensional cumulative distribution function (c.d.f.) was initiated in 1955 by Aggarwal (1955) and has been extensively studied since then (see the references given in the next paragraph). To the best of our knowledge, extensions of this approach to higher dimensions have not been investigated. In this paper we therefore consider estimating a bivariate c.d.f. and we generalize to this case some known results concerning minimax estimation of a univariate c.d.f. We also briefly discuss a multivariate generalization of these results.
Minimax estimation of a univariate c.d.f. F was considered by many authors. Using an invariance structure relative to the group of continuous and strictly increasing transformations, Aggarwal (1955) Ferguson (1967, pages 191-197) generalized this result to the case that L(F, F) = R (F − F) 2 h(F) d F, where h(·) is a continuous and positive function. He also asked whether the best invariant estimates are minimax among the larger class of (not necessarily invariant) procedures. Yu (1992b) established the minimaxity of the best invariant procedure in Ferguson's setup. In particular he found the minimax estimator of a continuous c.d.f. F under the loss of the form L(F, F) = numbers. Analog minimaxity findings were obtained by Mohammadi and van Zwet (2002) (entropy loss), Ning and Xie (2007) (Linex loss), and Stȩpień-Baran Stepien (2010) (strictly convex loss). Phadia and Yu (1992) proved minimaxity of the empirical distribution function under the Kolmogorov-Smirnov loss sup t∈R |F(t) − F(t)|. Jafari Jozani et al. (2014) considered the problem of estimating a continuous distribution function F, as well as meaningful functions τ (F), under a large class of loss functions. They obtained best invariant estimators and established their minimaxity for Hölder continuous τ 's and strict bowl-shaped losses with a bounded derivative. Phadia (1973) considered a different model in which it is not assumed that F is continuous. He found the minimax estimator of F under the noninvariant loss function L(F, F) = 1} are fixed numbers and w is a given non-null finite measure on (R, B R ), with B R denoting the σ -algebra of Borel sets on R. Yu (1992a) considered minimax estimation of F with a more general loss function is quadratic and h(·) is continuous and positive. He proved that this problem is equivalent to that of finding a minimax estimator of a binomial proportion p under the loss L B ( p, d) = G( p − d)h( p). His result was generalized by Jokiel- Rokita and Magiera (2007) to the case where G(·) is convex.
In the first part of the paper we generalize the result of Phadia (1973) to the two-dimensional case. Let (X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X n , Y n ) be i.i.d. two-dimensional random vectors from an unknown bivariate (not necessarily continuous) c.d.f. F on R 2 . On the basis of this sample we find a minimax estimator F 1 of F under the weighted squared error loss Here W is a given non-null finite measure on (R 2 , B R 2 ) and δ, γ ∈ {0, 1} are fixed numbers. In the case where δ = 0 or γ = 0, the loss functions L 1 is more sensitive to departures from F in the tails of the distribution. We also show that the decision rule F 1 remains minimax even if F is assumed to be absolutely continuous with respect to the Lebesgue measure on R 2 . In the second part of the paper we find a minimax estimator F 2 of an arbitrary bivariate c.d.f. F under the invariant weighted Cramer-von Mises loss The latter result cannot be viewed as a natural generalization of the above mentioned result of Yu (1992b), because we derive F 2 not assuming that F must be continuous. Since we use a much larger class of c.d.f.'s, the minimax inference for the loss L 2 becomes easier, but is still nontrivial.
In the univariate case, the minimax estimators φ and d 0 found by Phadia (1973) and Yu (1992b), respectively, are linear functions of the empirical distribution function (e.d.f.). Therefore, it is not surprising that the minimax procedures F 1 and F 2 are linear functions of the bivariate e.d.f. Moreover, for each δ, γ ∈ {0, 1}, F 1 = F 2 and F 1 has the same form as φ except that the univariate e.d.f. is replaced by the bivariate e.d.f. On the other hand, φ = d 0 and the relationship between F 2 and d 0 is slightly different than that between F 1 and φ. The reason for this difference is that the parameter space considered by Yu (1992b) contains only continuous c.d.f.'s.

Formulation of the problem
In the problem of estimating a univariate distribution function, the action space is restricted to the functions a(·) : R → [0, 1], which are nondecreasing. To define the appropriate action space to estimate a bivariate distribution function we note, that a two-dimensional analog of a nondecreasing function of one variable is a 2-increasing function.
Let F(Z; s, t) be an estimator of a bivariate distribution function F(s, t), based on the sample Z = ((X 1 , Y 1 ), . . . , (X n , Y n )). To simplify the notation we will also write F(·, ·) and F(s, t) for F(Z, ·, ·) and F(Z, s, t), respectively. We assume that for each realization z of Z, the decision rule F(z; ·, ·) is an element of the action space It is obvious that any bivariate c.d.f. F belongs to the class A. However, A contains estimates a(·, ·) which are not c.d.f.'s, because they do not satisfy the conditions a(−∞, −∞) = 0, a(−∞, y) = 0, a(x, −∞) = 0 and a(∞, ∞) = 1. Such estimates are often referred to as a defective distribution functions. We include these estimates in the action space to obtain satisfactory results. The necessity of using defective distribution functions to estimate a univariate distribution function was recognized by Aggarwal (1955).
The family of all estimators F, which satisfy the above condition we denote by D, i.e. we put We use the symbol D A to denote the subclass of D, which consists of the following affine estimators The estimates from D A are computationally tractable. The important member of D A is the empirical bivariate c.d.f. defined by Let E F denote the expectation with respect to the c.d.f. F. Then, the risk function of an estimate F under the loss function L i i = 1, 2 [see (1) and (2)] is given by When the value of i is clear from the context, we write R instead of R i . We are interested in finding the minimax estimator of F, i.e. the estimator F N ∈ D, for which the following equation holds Here F is the family of all bivariate distribution on R 2 , i.e. F = {F : F is a cumulative distribution function on R 2 }.
Unfortunately, many estimators from the class D are not computationally tractable and finding F N may be a difficult task. Therefore, we first look for affine minimax estimators. We say that a decision rule The quantities ρ N and ρ A are called the minimax risk and minimax affine risk, respectively. Clearly, ρ N ≤ ρ A . We prove that under both L 1 and L 2 these two risks are equal, i.e. ρ A = ρ N . Hence F A is minimax among all estimators from the class D.
We also discuss the minimax approach in the case where an unknown distribution F is assumed to be absolutely continuous with respect to the Lebesgue measure. Therefore, we denote F AC = {F : F is an absolutely continuous distribution function on R 2 }.

Auxiliary results
As we have mentioned above, Yu (1992a) proved that results on minimax estimation of a binomial proportion p can help to find minimax estimators of a univariate c.d.f.. It is not surprising that these results also can be applied in the bivariate case. For the sake of completeness we recall here the following well known facts concerning inference on binomial proportion, that will be used in the next sections (see, e.g. Hodges and Lehmann 1950, Olkin and Sobel 1979, Lehmann and Casella 1998, pages 311-312 or Phadia 1973. Suppose that on the basis of an observation X from the binomial distribution B(n, p) we wish to estimate the unknown success probability p under the weighted squared error loss (the term inside the first brackets is the risk function of an affine estimator d( where the infimum is over measurable functions d : R → R. The constants a 1 , b 1 , r 1 are given by In each of the four cases, d 1 (X ) has constant risk and is the Bayes estimator of p when p has the beta prior B(α, β) with (α, β) equal to ( √ n/2, √ n/2) in the first case, (1, √ n) in the second, ( √ n, 1) in the third and (1, 1) in the fourth. Therefore, d 1 (X ) is minimax (see Lehmann and Casella 1998, Corollary 1.5, page 311).
follows that the risk function of an affine estimator F ab (s, t) = a To find the minimax affine rule, we first derive the lower bound for the minimax affine risk ρ A . The method used here is closely related to that of Phadia (1973). For any fixed integer k ≥ 1 and any p in ( We use this inequality to prove the following lemma. Lemma 1 Let δ, γ ∈ {0, 1} be fixed and let a 1 , b 1 , r 1 be the corresponding numbers defined by (5).
is the minimax affine rule under the loss function L 1 and the minimax affine risk is given by Proof Since (7) holds for any positive integer k, we conclude by (3) that By straightforward calculations it is easy to verify that the integrand in the second line of (6) does not depend of F and equals r 1 when a = a 1 and b = b 1 (see again Phadia 1973). Therefore, the risk function of the estimator F A = F a 1 b 1 is constant and equal to r 1 Remark 1 Note that the constants a 1 , b 1 and r 1 , which define both the form of a minimax affine rule and the corresponding minimax affine risk, depend on both δ and γ . However, for simplicity of notation, this dependence is not shown throughout this paper.
The next theorem, which is the main result of this section, states that the estimator F a 1 b 1 is minimax in D. The proof is based on a method of Yu (1992a) (cf. also Jokiel-Rokita and Magiera 2007).
Theorem 1 Let δ, γ ∈ {0, 1} be fixed and let a 1 , b 1 , r 1 be the corresponding numbers defined by (5). Then the estimator F A = a 1 n i=1 1 (X i ≤ s, Y i ≤ t) + b 1 of an arbitrary bivariate c.d.f. F is minimax under the loss function L 1 and the minimax risk is given by Proof To prove the theorem, it suffices to show that sup F∈F R 1 ( F, F) ≥ ρ A for any F ∈ D. Let k > 0 be a fixed integer and let (X 1 , Y 1 ), . . . , (X n , Y n ) be i.i.d. random vectors from the c.d.f. F kp defined above. Since F kp (s, t) = p on [−k, k] 2 , it follows that for any F ∈ D, Note first that the joint distribution of the vector Z = ((X 1 , Y 1 ), . . . , (X n , Y n )) is given by where n 1 is the value of the random variable N 1 which counts the number of observations (x j , y j ) that fall into the square A 1 . Since N 1 is the sufficient statistic for p, we may assume that F depends on Z only through N 1 , i.e. F(Z; s, t) = F(N 1 ; s, t) for some Borel measurable function F : Then, This immediately shows that Since N 1 has the binomial distribution B(n, p), the last inequality implies by (8) and (4) Here we use the fact that F(Z; ·, ·) = F(N 1 ; ·, ·). Letting k → ∞ we obtain the lower bound which proves that under the loss L 1 , te decision rule F A := F a 1 b 1 is minimax among all estimators.

Remark 2
Since for any integer k ≥ 1 and any p ∈ (0, 1), the c.d.f. F kp is absolutely continuous (with respect to Lebesgue measure), it follows that we have proved a stronger result than the statement of the last theorem. In fact, we have shown that under the loss L 1 , F A is the minimax estimator of an unknown absolutely continuous bivariate c.d.f. F. Moreover, Theorem 1 can be easily generalized to dimensions d > 2.
In a slight modification of the proof, F kp is a d-variate c.d.f., which equals p over the hypercube [−k, k] d .

Minimax estimation under the loss function L 2
Under the loss L 2 the risk of an affine decision rule F ab is given by [cf. (6)] To find the minimax affine estimator, we first derive the lower bound for the minimax affine risk ρ A . For this purpose, we choose a suitable family of the bivariate c.d.f.'s and consider estimation of F in the resulting submodel. Let (x n ) n≥1 be a given increasing sequence of points from (0, 1) and let m ≥ 1 be a fixed integer. We putx 0 =ỹ 0 = 0 andỹ i = 1 −x i for i ≥ 1. Let the set S m be defined by S m = {s = (s 0 , . . . , s m ) ∈ [0, 1] m+1 : s 0 + · · · + s m = 1}.
For any probability vector p = ( p 0 , . . . , p m ) ∈ S m , we denote by F m, p the bivariate c.d.f. which corresponds to a discrete random vector (X , Y ) with the support {(x 0 ,ỹ 0 ), . . . , (x m ,ỹ m )} and with the joint probability mass function given by which implies that for any integer k ≥ 0, We use the last equality to find the lower bound for the minimax affine risk ρ A . For each p 0 ∈ [0, 1] and each integer m ≥ 1, let p (m) 0 denote the corresponding vector from S m given by Then, for any integer k ≥ 0, Using (9) and applying the last equality with k = 0, 1, 2, we therefore find that Hence, by (3) and (5), we obtain the following lower bound for the risk of affine estimators Lemma 2 Let δ, γ ∈ {0, 1} be fixed and let a 1 , b 1 , r 1 be the corresponding constants defined (5).
is the minimax affine rule under the loss L 2 and the minimax affine risk is given by Proof The risk function of F A = F a 1 b 1 is constant and equal to r 1 , because the integrand in (9) does not depend of F and equals r 1 when a = a 1 and b = b 1 (cf. the proof of Lemma 1). This completes the proof, because r 1 is the lower bound for ρ A and hence inf a,b∈R To prove that F A is minimax among all estimators we use the Bayes approach. More precisely, we take a specific sequence of priors on F such that the corresponding sequence of Bayes risks converges to the supremum of the risk for F A . Since the corresponding limit of Bayes risks is the lower bound for the minimax risk, we conclude that F A is minimax.
Suppose that we know a priori that (X 1 , Y 1 ), . . . , (X n , Y n ) are i.i.d. random vectors from F m, p , where m is a fixed positive integer and p = ( p 0 , . . . , p m ) ∈ S m is an uknown probability vector. Then, the joint distribution of Z = ((X 1 , Y 1 ), . . . , (X n , Y n )) is given by . . . , N m ) has the multinomial distribution on m + 1 categories with n draws and probability vector p = ( p 0 , . . . , p m ). Note that for any estimator F ∈ D, we obtain Since N is the sufficient statistics for p, we may assume that F(x i ,ỹ i ), i = 0, . . . , m, depends on Z only through N i.e. F(Z;x i ,ỹ i ) = d i (N) for some real-valued Bore1 measurable function d i . Therefore, the problem of estimating F m, p with the loss L 2 and the sample Z from F m, p is equivalent to estimation of multinomial probabilities p = ( p 0 , . . . , p m ) under the loss and the sample N. This means that the corresponding two minimax risks are equal, i.e.
where p m = 1 − ( p 0 + · · · + p m−1 ) and I S m (·) is the indicator function of the set S m . Since the Dirichlet prior is conjugate to the multinomial distribution it follows that the posterior of p given N = n is D(α 0 + n 0 , . . . , α m + n m ), i.e.
Let d π = (d π 1 , . . . , d π m ) be the Bayes estimator of p = ( p 0 , . . . , p m ) under the loss L(d, p), given by (11). Then provided that these posterior moments are finite. If δ = γ = 0, then both integral and the resulting Bayes risk can be easily calculated, because for each each k, l ∈ {0, 1, 2, 3}, To find these moments in the case where δ = 1 or γ = 1, we use the following Liouville formula (cf. Fichtenholz 1992). Let a function φ : [0, 1] → R be continuous and let p and q be any positive real numbers. If the integral 1 0 |φ(u)|u p+q−1 du is finite, then the following identity holds Let a, b be any real numbers such that α 0 Let (α 0 , . . . , α m ) ∈ A δ,γ m . Then, since the posterior of p given N = n is D(α 0 + n 0 , . . . , α m + n m ), it follows by (14) and (15), that the Bayes estimator Clearly, the finiteness of the above posterior means is implied by the fact that if Since the random variables N 0 and N 0 + N i , i = 1, . . . , m have the binomial distributions B(n, p 0 ) and B(n, p 0 + p i ), respectively, the risk of d π is given by where for simplicity of notation we write Then, by (14) and (15), the Bayes risk r (π ) = E π R(d π , p) can be written as To simplify the last formula, we note that by (14) and (15) Moreover, we also have Finally, by (14), (15) and (17), we obtain the following formula for the Bayes risk of d π r (π) Before stating the main result of this section, we introduce some notation. If δ = γ = 0 we put n 0 = 4 and when δ = 1 or γ = 1 we set n 0 = 1.
Theorem 2 Let δ, γ ∈ {0, 1} be fixed and let a 1 , b 1 , r 1 be the corresponding constants defined by (5). If n ≥ n 0 then the estimator F A = a 1 n i=1 1 (X i ≤ s, Y i ≤ t) + b 1 of F is minimax under the weighted Cramer-von Mises loss function L 2 (F, F) and the minimax risk is given by Proof Lemma 2 implies that for any integer m > 1 and any (α 0 , . . . , α m ) ∈ A δ,γ m , we have where r (α 0 , . . . , α m ) stands for the Bayes risk given by the right-hand side of (18). It is clear that to prove minimaxity of F A it suffices to find a sequence of points ((α (m) 0 , . . . , α (m) m )) such that (see, e.g. Lehmann and Casella 1998, Theorem 1.12, page 316).