Abstract
Consider independent observations \((X_i,R_i)\) with random or fixed ranks \(R_i\), while conditional on \(R_i\), the random variable \(X_i\) has the same distribution as the \(R_i\)-th order statistic within a random sample of size k from an unknown distribution function F. Such observation schemes are well known from ranked set sampling and judgment post-stratification. Within a general, not necessarily balanced setting we derive and compare the asymptotic distributions of three different estimators of the distribution function F: a stratified estimator, a nonparametric maximum-likelihood estimator and a moment-based estimator. Our functional central limit theorems generalize and refine previous asymptotic analyses. In addition, we discuss briefly pointwise and simultaneous confidence intervals for the distribution function with guaranteed coverage probability for finite sample sizes. The methods are illustrated with a real data example, and the potential impact of imperfect rankings is investigated in a small simulation experiment.
Similar content being viewed by others
References
Balakrishnan, N., Li, T. (2006). Confidence intervals for quantiles and tolerance intervals based on ordered ranked set samples. Annals of the Institute of Statistical Mathematics, 58, 757–777.
Bhoj, D. S. (2001). Ranked set sampling with unequal samples. Biometrics, 57(3), 957–962.
Chen, Z. (2001). Non-parametric inferences based on general unbalanced ranked-set samples. Journal of Nonparametric Statistics, 13(2), 291–310.
Chen, Z., Bai, Z., Sinha, B. K. (2004). Ranked set sampling. Theory and applications. New York: Springer.
Clopper, C. J., Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4), 404–413.
Dastbaravarde, A., Arghami, N. R., Sarmad, M. (2016). Some theoretical results concerning non parametric estimation by using a judgment poststratification sample. Communications in Statistics, Theory and Methods, 45(8), 2181–2203.
David, H. A., Nagaraja, H. N. (2003). Order statistics (3rd ed.). Hoboken, NJ: Wiley-Interscience.
Dell, T. R., Clutter, J. L. (1972). Ranked set sampling theory with order statistics background. Biometrics, 28(2), 545–555.
Frey, J., Ozturk, O. (2011). Constrained estimation using judgement post-stratification. Annals of the Institute of Statistical Mathematics, 63, 769–789.
Ghosh, K., Tiwari, R. C. (2008). Estimating the distribution function using \(k\)-tuple ranked set samples. Journal of Statistical Planning and Inference, 138(4), 929–949.
Huang, J. (1997). Properties of the Npmle of a distribution function based on ranked set samples. Annals of Statistics, 25(3), 1036–1049.
Kvam, P. H., Samaniego, F. J. (1994). Nonparametric maximum likelihood estimation based on ranked set samples. Journal of the American Statistical Association, 89(426), 526–537.
MacEachern, S. N., Stasny, E. A., Wolfe, D. A. (2004). Judgement post-stratification with imprecise rankings. Biometrics, 60, 207–215.
McIntyre, G. A. (1952). A method of unbiased selective sampling, using ranked sets. Australian Journal of Agricultural Research, 3, 385–390.
Presnell, B., Bohn, L. L. (1999). U-Statistics and imperfect ranking in ranked set sampling. Journal of Nonparamatric Statistics, 10(2), 111–126.
R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Accessed July 2018.
Shorack, G. R., Wellner, J. A. (1986). Empirical processes with applications to statistics. New York: Wiley.
Stokes, S. L., Sager, T. W. (1988). Characterization of a ranked-set sample with application to estimating distribution functions. Journal of the American Statistical Association, 83(402), 374–381.
Terpstra, J. T., Miller, Z. A. (2006). Exact inference for a population proportion based on a ranked set sample. Communications in Statistics, Simulation and Computation, 35(1), 19–26.
Wang, X., Wang, K., Lim, J. (2012). Isotonized CDF estimation from judgement poststratification data with empty strata. Biometrics, 68(1), 194–202.
Wolfe, D. A. (2004). Ranked set sampling: An approach to more efficient data collection. Statistical Science, 19(4), 636–643.
Wolfe, D. A. (2012). Ranked set sampling: Its relevance and impact on statistical inference. ISRN Probability and Statistics, 2012, 568385. https://doi.org/10.5402/2012/568385.
Acknowledgements
Constructive comments by an associate editor and two referees are gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
We first recall two well-known facts about uniform empirical processes, see Shorack and Wellner (1986).
Proposition 6
Let \(U_1, U_2, U_3, \ldots \) be independent random variables with uniform distribution on [0, 1]. For \(N \in \mathbb {N}\) and \(u \in [0,1]\) define
Then, as \(N \rightarrow \infty \), \(\mathbb {V}^{(N)}\) converges in distribution in \(\ell _\infty ([0,1])\) to a standard Brownian bridge \(\mathbb {V}\) on [0, 1]. Moreover, for any fixed \(\delta \in [0,1/2)\) and \(\epsilon > 0\),
For the estimators \(\widehat{F}_n^\mathrm{M}\), \(\widehat{F}_n^\mathrm{L}\) we need some basic facts and inequalities for the auxiliary functions \(w_k\) and \(B_k\) which are proved in the supplement:
Lemma 7
- (a):
-
For \(r = 1,2,\ldots ,k\), the function \(w_r\) on (0, 1) may be written as \(w_r(t) = \widetilde{w}_r(t) / (t(1-t))\) with \(\widetilde{w}_r : [0,1] \rightarrow (0,\infty )\) continuously differentiable. Moreover, for \(r = 1,2,\ldots ,k\) and \(t \in (0,1)\),
$$\begin{aligned} 1 \ \le \ \widetilde{w}_r(t) \ \le \ \max (r,k+1-r). \end{aligned}$$ - (b):
-
For any constant \(c \in (0,1)\) there exists a number \(c' = c'(k,c) > 0\) with the following property: If \(t,p \in (0,1)\) such that
$$\begin{aligned} \frac{|p - t|}{t(1-t)} \ \le \ c , \end{aligned}$$then for \(r = 1,2,\ldots ,k\),
$$\begin{aligned} \max \left\{ \left| \frac{w_r(p)}{w_r(t)} - 1 \right| , \left| \frac{B_r(p) - B_r(t)}{\beta _r(t) (p - t)} - 1 \right| \right\} \ \le \ c' \frac{|p - t|}{t(1-t)}. \end{aligned}$$
Proof of Theorem 2
We start with the weight functions \(\gamma _{nr}^\mathrm{Z}\): Note that by Lemma 7,
with the probability weights \(\pi _{nr} := N_{nr}/n\) and continuous functions \(\widetilde{w}_r : [0,1] \rightarrow [1,k]\). Since the beta densities \(\beta _r\) are also continuous with \(\beta _1(0) = \beta _k(1) = k\), this shows that \(\gamma _{nr}^\mathrm{Z}\) is well-defined and continuous, provided that its denominator is strictly positive, i.e.,
For sufficiently large n this is the case, because \(\lim _{n\rightarrow \infty } \pi _{nr} = \pi _r\) for all r. The functions \(\gamma _r^\mathrm{Z}\) in Corollary 4 are continuous, too, and elementary considerations reveal that
as \(n \rightarrow \infty \). In particular, \(\max _{t \in [0,1], 1 \le r \le k} \gamma _{nr}^\mathrm{Z}(t) = O(1)\).
Note that for \(n \ge 1\) and \(1 \le r \le k\), the empirical process \(\mathbb {V}_{nr}\) is distributed as \(\mathbb {V}^{(N_{nr})}\) in Proposition 6. Note also that the distribution functions \(B_r\) satisfy \(B_1 \ge B_2 \ge \cdots \ge B_k\), because for \(1 \le r < k\) the density ratio \(\beta _{r+1}/\beta _r\) is a positive multiple of \(t/(1 - t)\) and thus strictly increasing. Consequently, for \(1 \le r \le k\),
so
Consequently,
as \(n \rightarrow \infty \) and \(c \downarrow 0\). All in all, we may conclude that
It remains to be shown that the process \(\sqrt{n} (\widehat{B}_n^\mathrm{Z} - B)\) may be approximated by \(\mathbb {V}_n^\mathrm{Z}\). In case of \(\mathrm{Z} = \mathrm{S}\) it follows from \(\sum _{r=1}^k \beta _r \equiv k\) that \(\sum _{r=1}^k B_r = k B\), and this implies that
For \(\mathrm{Z} = \mathrm{M}, \mathrm{L}\) it suffices to show that for any fixed number \(b \ne 0\) and
the following statements are true: If \(b < 0\), then with asymptotic probability one,
If \(b > 0\), then with asymptotic probability one,
Here we use the conventions that \(L_n'(t,\cdot ) := \infty \) and \(B_r := 0\) on \((-\infty ,0]\) while \(L_n'(t,\cdot ) := -\infty \) and \(B_r := 1\) on \([1,\infty )\).
To verify these claims, we split the interval (0, 1) into \((0,c_n]\), \([c_n,1-c_n]\) and \([1-c_n,1)\) with numbers \(c_n \in (0,1/2)\) to be specified later, where \(c_n \downarrow 0\).
On \([c_n,1-c_n]\) we utilize Lemma 7: For \(t \in [c_n, 1 - t_n]\) and \(p \in (0,1)\) such that \(|p - t| \le t(1-t)/2\) we may write
and
where
Note that for \(t \in [c_n,1-c_n]\),
Hence we choose \(c_n\) such that \(c_n \downarrow 0\) but \(n c_n^{2(1-\delta )} \rightarrow \infty \). With this choice, we may conclude that uniformly in \(t \in [c_n,1-c_n]\),
On the other hand, since \(\beta _1(t) + \beta _k(t) \ge \beta _1(1/2) + \beta _k(1/2) = k 2^{2-k}\),
Consequently,
and
for some random functions \(\kappa _n^\mathrm{M}, \kappa _n^\mathrm{L} : [c_n,1-c_n] \rightarrow [-1,1]\). These considerations show that (8) and (9) are satisfied with \([c_n,1-c_n]\) in place of (0, 1).
It remains to verify (8) and (9) with \((0,c_n]\) in place of (0, 1); the interval \([1-c_n,1)\) may be treated analogously. Note first that for \(2 \le r \le k\),
so
Furthermore, since \(B_1(t) = 1 - (1 - t)^k\),
Hence for \(t \in (0,c_n]\) and \(p \in (0,2c_n]\),
and
where
Note also that
In particular, \(\sup _{t \in (0,c_n]} p_n^\mathrm{Z}(t) = c_n + o_p(n^{-1/2} c_n^\delta ) = c_n (1 + o_p(1))\), and in case of \(b > 0\), \(\mathop {\mathrm {I\!P}}\nolimits \bigl ( p_n^\mathrm{Z}(t) > 0 \ \text {for} \ 0 < t \le c_n \bigr ) \rightarrow 1\).
In case of \(b > 0\), these considerations show that for \(0 < t \le c_n\),
and
Analogously, in case of \(b < 0\), for any \(t \in (0,c_n]\) we obtain the inequalities
Hence, (8) and (9) are satisfied with \((0,c_n]\) in place of (0, 1). \(\square \)
Proof of Theorem 3
For symmetry reasons it suffices to prove the first part about the left tails. Let \((c_n)_n\) be a sequence of numbers in (0, 1 / 2] converging to zero. Then for \(t \in (0,c_n]\) and \(\delta := \kappa /2 \in (0,1/2)\),
Concerning \(\widehat{B}_n^\mathrm{M}\) and \(\widehat{B}_n^\mathrm{L}\), for any \(t \in (0,c_n]\) and \(p \in (0,1)\),
and
where
Now we proceed similarly as in the proof of Theorem 2, defining
for some fixed \(b \ne 0\). Note that for \(t \in (0,c_n]\),
because \(\kappa > \delta \). Note also that
because \(\widehat{B}_{n1} \ge 0\) and \(t \mapsto t - (1 - (1-t)^k)/k\) is strictly convex on [0, 1] with derivative 0 at 0. Thus \(p_n(t) > 0\) for all \(t \in (0,c_n]\) in case of \(b > 0\).
In case of \(b > 0\), we may conclude that
and
Hence for any fixed \(b > 0\),
Similarly we can show that for any fixed \(b < 0\), with asymptotic probability one, \(\sqrt{n} (\widehat{B}_n^\mathrm{Z}(t) - t) \le \mathbb {V}_n^{(\ell )}(t) + b t^\kappa \) for all \(t \in (0,c_n]\). \(\square \)
Proof of Corollary 4
It follows from Proposition 6 that
Together with (5) this entails that \(\sup _{t \in [0,1]} \bigl | \mathbb {V}_n^\mathrm{Z}(t) - \widetilde{\mathbb {V}}_n^\mathrm{Z}(t) \bigr | \rightarrow _p 0\), where \(\widetilde{\mathbb {V}}_n^\mathrm{Z} := \sum _{r=1}^k \gamma _r^\mathrm{Z} \, \mathbb {V}_{nr} \circ B_r\). But \(\gamma _r^\mathrm{Z} \equiv 0\) whenever \(\pi _r = 0\). In case of \(\pi _r > 0\) it follows from Proposition 6 that \(\mathbb {V}_{nr}\) converges in distribution to \(\mathbb {V}_r\). Consequently, \(\widetilde{\mathbb {V}}_n^\mathrm{Z}\) converges in distribution to the Gaussian process \(\mathbb {V}^\mathrm{Z} = \sum _{r=1}^k \gamma _r^\mathrm{Z} \, \mathbb {V}_r \circ B_r\). \(\square \)
Proof of Theorem 5
The asserted inequalities follow from Jensen’s inequality. On the one hand, it follows from \(w_r = \beta _r / (B_r(1 - B_r))\) and \(\sum _{r=1}^k \beta _r \equiv k\) that
Equality holds if, and only if,
But
so
is strictly decreasing in t. Hence there is at most one solution of the equation \(\pi _1 w_1(t) = \pi _k w_k(t)\).
Similarly, with \(a_r(t) := \pi _r \beta _r(t) \big / \sum _{s=1}^k \pi _s \beta _s(t)\),
Here the inequality is strict unless
But \(w_1(t) = w_k(t)\) implies that \(t = 1/2\). Moreover, \(w_1(1/2) = 2k/(1 - 2^{-k})\) and
are identical if, and only if, \(k^2 + k + 2 = 2^{k+1}\). But \(2^{k+1} = 2 \sum _{j=0}^k \left( {\begin{array}{c}k\\ j\end{array}}\right) \) is strictly larger than \(2(1 + k + k(k-1)/2) = k^2 + k + 2\) if \(k \ge 3\).
As to the ratios \(E^\mathrm{Z}(t) := K^\mathrm{Z}(t)/K^\mathrm{L}(t)\), note first that
On the other hand, with \(a_r(t)\) as above,
with a random variable W with distribution \(\sum _{r=1}^k a_r(t) \delta _{w_r(t)}\). But with \(\ell (t) := \min _r w_r(t)\) and \(u(t) := \max _r w_r(t)\), convexity of \(w \mapsto w^{-1}\) on \([\ell (t),u(t)]\) implies that
so
This upper bound for \(E^\mathrm{M}(t)\) is attained approximately, if the distribution of W approaches the uniform distribution on \(\{\ell (t),u(t)\}\). Hence we should choose \((\pi _r)_{r=1}^k\) as follows: Let r(1), r(2) be two different numbers in \(\{1,\ldots ,k\}\) such that \(w_{r(1)}(t) = \ell (t)\) and \(w_{r(2)}(t) = u(t)\). Then let
The inequality \(\rho (t) \le k\) follows from Lemma 7 and the fact that \(\rho (t)\) remains unchanged if we replace \(w_r(t)\) with \(\widetilde{w}_r(t) = t(1-t) w_t(t) \in [1,k]\). \(\square \)
About this article
Cite this article
Dümbgen, L., Zamanzade, E. Inference on a distribution function from ranked set samples. Ann Inst Stat Math 72, 157–185 (2020). https://doi.org/10.1007/s10463-018-0680-y
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-018-0680-y