Abstract
With the emergence of the big data era, the need for sampling methods that select samples based on the order of the observed units is felt more than ever. In order to meet this necessity, a new sequential unequal probability sampling method is proposed. The decision to select or not each unit is made based on the order in which the units appear. A variant of this method allows a selection of a sample from a stream. This method consists in using sliding windows which are a kind of strata of controllable size. This method also allows the sample to be spread in a controlled manner throughout the population. A special case of the method with windows of size one leads to deciding on each sampling unit immediately after observing it. The implementation of size one windows is simple and will be presented here based on an algorithm with a single condition. Also, by selecting the windows of size two, we will have one of the optimal stream sampling methods, which results in a well-spread stream sample with positive second-order inclusion probabilities.
Similar content being viewed by others
References
Aubry P (2023) On the correct implementation of the hanurav-vijayan selection procedure for unequal probability sampling without replacement. Commun Stat-Simul Comput 52(5):1849–1877
Boley M, Lucchese C, Paurat D, Gartner T (2011) Direct local pattern sampling by efficient two-step random procedures. In: ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD’11, San Diego, USA, 21–24 August 2011. ACM Press, New York, USA, pp 582–590
Busnel Y, Tillé Y (2020) Attack-tolerant unequal probability sampling methods over sliding window for distributed streams. In: 4th international conference on compute and data analysis (ICCDA 2020), Mar 2020, San Jose, United States, pp 72–78
Chao M-T (1982) A general purpose unequal probability sampling plan. Biometrika 69:653–656
Chaudhuri A, Pal S (2022) Sampling with Varying Probabilities. Springer Nature Singapore, Singapore, pp 43–109
Chauvet G (2012) On a characterization of ordered pivotal sampling. Bernoulli 18(4):1320–1340
Chauvet G (2021) A note on chromy’s sampling procedure. J Surv Stat Methodol 9(5):1050–1061
Chauvet G (2022) A Cautionary Note on the Hanurav-Vijayan Sampling Algorithm. J Surv Stat Methodol 10(5):1276–1291
Chromy JR (1979) Sequential sample selection methods. In: Proceedings of the American statistical association, survey research methods section, pp 401–406
Cohen E, Duffield N, Kaplan H, Lund C, Thorup M (2009) Stream sampling for variance-optimal estimation of subset sums. In: Proceedings of the twentieth annual ACM-SIAM symposium on discrete algorithms. society for industrial and applied mathematics, pp 1255–1264
Deville J-C, Tillé Y (1998) Unequal probability sampling without replacement through a splitting method. Biometrika 85:89–101
Diop L, Diop CT, Giacometti A, Li D, Soulet A (2018) Sequential pattern sampling with norm constraints. In: t2018 IEEE international conference on data mining (ICDM), pp 89–98
Gabler S (1990) Minimax Solutions in Sampling from Finite Populations. Springer, New York
Giacometti A, Soulet A (2021) Reservoir pattern sampling in data streams. In: Oliver N, Pérez-Cruz F, Kramer S, Read J, Lozano JA (eds) Machine learning and knowledge discovery in databases. Research track. Springer International Publishing, Cham, pp 337–352
Grafström A, Lundström NLP (2013) Why well spread probability samples are balanced? Open J Stat 3(1):36–41
Grafström A, Lundström NLP, Schelin L (2012) Spatially balanced sampling through the pivotal method. Biometrics 68(2):514–520
Grafström A, Matei A, Qualité L, Tillé Y (2012) Size constrained unequal probability sampling with a non-integer sum of inclusion probabilities. Electron J Stat 6:1477–1489
Hanif M, Brewer KRW (1980) Sampling with unequal probabilities without replacement: A review. Int Stat Rev 48:317–335
Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47(260):663–685
Jauslin R, Panahbehagh B, Tillé Y (2022) Sequential spatially balanced sampling. Environmetrics 33(8):e2776
Jauslin R, Tillé Y (2020) Spatial spread sampling using weakly associated vectors. J Agric Biol Environ Stat 25(3):431–451
Madow WG (1949) On the theory of systematic sampling, II. Ann Math Stat 20:333–354
Narain RD (1951) On sampling without replacement with varying probabilities. J Indian Soc Agric Stat 3:169–174
R Core Team (2022) R: a language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria
Sunter AB (1977) List sequential sampling with equal or unequal probabilities without replacement. Appl Stat 26:261–268
Sunter AB (1986) Solutions to the problem of unequal probability sampling without replacement. Int Stat Rev 54:33–50
Tillé Y (1996) An elimination procedure of unequal probability sampling without replacement. Biometrika 83:238–241
Tillé Y (2006) Sampling Algorithms. Springer, New York
Tillé Y (2019) A general result for selecting balanced unequal probability samples from a stream. Inf Process Lett 152:1–6
Vijayan K (1968) An exact \(\pi ps\) sampling scheme, generalization of a method of Hanurav. J Roy Stat Soc B30:556–566
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Result 1
Without loss of generality and for ease of notation we only do the proof for \(t = 1\). For \(k=1\) it is obvious that \(\text {E}(\pi ^1_1)=\pi _1\), and for \(k=2,3,\dots ,N\),
Also for the sum of inclusion probabilities, we have
and
\(\square \)
Proof of Result 2
Without loss of generality and for ease of notation we only do the proof for \(t = 1\). Since
for \(k=2,3,\dots ,N\), a necessary and sufficient condition is that \(\pi _1\ge 1-1/c_1\), which gives the result. \(\square \)
Proof of Result 3
Let define
Then it is possible to decompose n as
and
where “\(\#\)” indicates cardinality. Now, if n is an integer, \(c_1A\) is an integer denoted by d, thus \(A=d/c_1\). In this case, we have \(\#U_B=n-d\). Furthermore, it is easy to see that \(B\le \#U_B\). Now, from Eq. (10), we have
Now if \(d=1,2,\dots \) then \(\pi _1\ge (1-1/c_1)\) and if \(d=0\) then \(\#U_B=n\) or in other words, \(\pi ^{1(0)}_k=1\) for all \(k=2,3,\dots ,N\). Therefore, to have all \(\pi ^{1(1)}_k\ge 0\), it is necessary to have
But as in such cases, \(\pi ^{1(0)}_k=\pi _k+\alpha _k \pi _1=1,\) for some \(0<\alpha _k\le 1\), then \(\pi _k+\pi _1\ge 1\) and therefore Condition (11) is satisfied.
Then, if n is an integer, the condition of Result 2 is always fulfilled. \(\square \)
Proof of Result 4
Proof for \(w_1\) is obvious based on Result 1. We prove the result for \(w_2\):
-
(i)
For \(k=k_1+1,\dots ,a_2\),
$$\begin{aligned} E(\pi ^*_k)=\pi _{a_1}\pi ^{*(1)}_k+(1-\pi _{a_1})\pi ^{*(0)}_k=\pi _{a_1}\pi ^{*(1)}_k+\pi _k-\pi _{a_1}\pi ^{*(1)}_k=\pi _k, \end{aligned}$$and for \(k=k_1\),
$$\begin{aligned} E(\pi ^*_{k_1})=\pi _{a_1}\times 1+ (1-\pi _{a_1})\frac{\pi _{k_1}-\pi _{a_1}}{(1-\pi _{a_1})}=\pi _{k_1}. \end{aligned}$$ -
(ii)
For \(\pi ^{+}_{a_1}=1\), in (7), actually \(\pi _{b_1}\) will be distributed on \(\pi _{k_1+1},\dots ,\pi _{a_2}\). To show that the inclusion probabilities in (8) are non-negative, we have
$$\begin{aligned} \frac{\pi _k-\min (c^*_2\pi _k,1)\pi _{a_1}}{(1-\pi _{a_1})}\ge \frac{\pi _k-c^*_2\pi _k\pi _{a_1}}{(1-\pi _{a_1})}\ge 0 \end{aligned}$$which leads to
$$\begin{aligned} (1-\pi _{a_1})\ge (1-\frac{1}{c^*_{2}}). \end{aligned}$$But the size of \(w_2\) is an integer, we know that
$$\begin{aligned} \frac{\pi _k-\min (c^*_2\pi _k,1)(1-\pi _{b_1})}{\pi _{b_1}}\ge \frac{\pi _k-c^*_2\pi _k(1-\pi _{b_1})}{\pi _{b_1}}\ge 0 \end{aligned}$$and then
$$\begin{aligned} \pi _{b_1}\ge \left( 1-\frac{1}{c^*_{2}}\right) . \end{aligned}$$Therefore, as \(\pi _{a_1}+\pi _{b_1}=\pi _{k_1}\le 1\) we have
$$\begin{aligned} (1-\pi _{a_1})\ge \pi _{b_1}\ge \left( 1-\frac{1}{c^*_{2}}\right) . \end{aligned}$$ -
(iii)
Proof for respecting sum of the inclusion probabilities are straightforward by calculating summation of \(\pi ^{*(0)}_k\) and \(\pi ^{*(1)}_k\) inside \(w_2\).
For the other windows, proof is the same. \(\square \)
Proof of Result 5
For calculating the second-order inclusion probability \(\pi _{k\ell }\) where \(k<\ell \), \(k\in w_i\) and \(\ell \in w_j\), we have
In \(\pi _{\ell \mid k}\), given k is selected affect on selecting \(\pi _\ell \) by changing \(\pi _{a_{j-1}}\). Then based on a recursive relation, step by step we can calculate \(\pi _{a_{j-1}\mid k}\) using \(\pi _{a_{j-2}\mid k}\) and so on. Then we need to consider the cases in Result 5, as
-
(i)
in this case, the second inclusion probabilities can be calculated based on the design, \(p_i\), implemented inside the respective window,
-
(ii)
here, after following recursive calculation for calculating \(\pi _{a_i\mid k}\), as \(a_i\) and k are in the same window, we have
$$\begin{aligned} \pi _{a_i\mid k}=\frac{\pi _{ka_i}}{\pi _k}=\frac{\pi ^{p_i}_{ka_i}}{\pi _k}, \end{aligned}$$ -
(iii)
here, since unit k is a cross-border unit, if \(a_i\) is selected, \(\pi _{a_{i+1}}\) will be updated as \(min(c_{i+1}\pi _{a_{i+1}},1)\), and if \(a_i\) is not selected, then \(\pi _{a_{i+1}}\) will be updated as
$$\begin{aligned} \frac{\pi ^{p_i(a_i\ni S)}_{{b_{i}}{a_{i+1}}}}{\pi _{b_{i}}/(1-\pi _{a_{i}})}. \end{aligned}$$For conditional probability of \(a_i\) itself, as \(a_i\) is a part of \(\pi _k=a_i+b_i,\) and then
$$\begin{aligned} \{a_i\in S\}\subset \{k\in S\}, \end{aligned}$$therefore we have
$$\begin{aligned} \pi _{a_i\mid k}=\frac{\text{ Pr }(k\in S, a_i \in S)}{\pi _k}=\frac{\text{ Pr }(a_i \in S)}{\pi _k}=\frac{\pi _{a_i}}{\pi _k}, \end{aligned}$$ -
(iv)
when \(\pi _{\ell }=\pi _{a_j}+\pi _{b_j}\), then
$$\begin{aligned} \pi _{\ell \mid k}=\text{ Pr }(a_j\in S)+\text{ Pr }(a_j\notin S)\text{ Pr }(b_j\in S\mid a_j \notin S)=\pi _{a_{j}|k}+(1-\pi _{a_{j}|k})\frac{\pi _{b_j}}{1-\pi _{a_{j}}}, \end{aligned}$$and the rest of the proof is the same as case ii),
-
(v)
the first part of the proof of this case is the same as case iv), and after recursive calculations, the last part is the same as the last part of case iii).
\(\square \)
Proof of Result 6
After deciding on the first window, as \(a_1\) is not a real unit, depending on the decision for this unit, the units inside \(w_2\) will be initially updated as
and
Consider unit \(\ell \), inside \(w_2\)
-
(I)
j is not a cross-border unit,
-
(1)
if \(n_\ell <i\) and \(\pi ^{+}_{a_1}=1\), then according to (13) we have
$$\begin{aligned} \pi ^{*}_\ell =\frac{\frac{\pi _\ell }{1-\pi _{b_1}}}{1-\sum _{i=k_1+1}^{\ell -1}\frac{\pi _{\ell }}{1-\pi _{b_1}}}=\frac{\pi _\ell }{1-(F_{\ell -1}-\lfloor F_{\ell -1}\rfloor )}, \end{aligned}$$and if \(\pi ^{+}_{a_1}=0\),
$$\begin{aligned} \pi ^{*}_\ell =\frac{\frac{\pi _{\ell }-\pi ^{*(1)}_{\ell }\pi _{a_1}}{1-\pi _{a_1}}}{1-\frac{\pi _{k_1}-\pi _{a_1}}{1-\pi _{a_1}}-\sum _{i=k_1+1}^{\ell -1}\frac{\pi _{\ell }-\pi ^{*(1)}_{\ell }\pi _{a_1}}{1-\pi _{a_1}}}, \end{aligned}$$which with replacing \(\pi ^{*(1)}_\ell \) by \(\pi _\ell /(1-\pi _{b_1})\) we have
$$\begin{aligned} \pi ^{*}_\ell =\frac{\pi _\ell }{1-(F_{\ell -1}-\lfloor F_{\ell -1}\rfloor )}. \end{aligned}$$ -
(2)
if \(n_\ell =i\), according the window size, this unit will not be selected.
-
(1)
-
(II)
\(\ell \) is a cross-border unit,
-
(1)
if \(n_\ell <i\), according to the window size, this unit will be selected with probability one.
-
(2)
if \(n_\ell =i\), we can calculate \(\pi ^{*}_\ell \) directly using (12).
For the other windows, the proof is the same.
-
(1)
\(\square \)
Proof of Result 7
The structure of the population and cross-units inside both methods are the same, and the updating principle of Deville’s method is the same as Eqs. (12) and (13). In Deville’s method, consider the two first windows,
-
(I)
\(\ell \) is not a cross-border unit,
-
(1)
If the cross-border unit is selected inside the previous window, then
$$\begin{aligned} \text{ Pr }(\ell \in S)= & {} \int \limits _{F_{\ell -1}}^{ F_{\ell }}f(x)dx=\int \limits _{F_{\ell -1}}^{ F_{\ell }}\frac{1}{\lceil F_{k_1}\rceil -F_{k_1}}dx=\frac{1}{\lceil F_{k_1}\rceil -F_{k_1}}\pi _\ell \\= & {} \frac{1}{1-\pi _{b_1}}\pi _\ell \end{aligned}$$which is equivalent to the first term of (13),
-
(2)
If the cross-border unit is not selected inside the previous window, then
$$\begin{aligned} \text{ Pr }(\ell \in S)= & {} \int \limits _{F_{\ell -1}}^{ F_{\ell }}1-\frac{(\lceil F_{k_1-1}\rceil -F_{k_1-1})(F_{k_1}-\lfloor F_{k_1} \rfloor )}{\left\{ 1-(\lceil F_{k_1-1}\rceil -F_{k_1-1})\right\} \left\{ 1-(F_{k_1}-\lfloor F_{k_1} \rfloor )\right\} }dx\\= & {} \frac{1-\pi _{k_1}}{(1-\pi _{a_1})(1-\pi _{b_1})}\pi _\ell \end{aligned}$$which is equivalent to the second term of (13),
-
(1)
-
(II)
\(\ell \) is a cross-border unit (\(\ell =k_1\)),
-
(1)
If the cross-unit is selected inside the previous window, then the method ignores the second part of \(k_1\), i.e. (\(\pi ^*_{a_1}\)=0), which is equivalent to the first term of (12),
-
(2)
If the cross-unit is not selected inside the previous window, then
$$\begin{aligned} \text{ Pr }(\ell \in S)= & {} \int \limits _{\lfloor F_{\ell }\rfloor }^{ F_{\ell }}\frac{1}{1-(\lceil F_{k_1-1}\rceil -F_{k_1-1})}dx\\= & {} \frac{1}{1-(\lceil F_{k_1-1}\rceil -F_{k_1-1})}(F_{\ell }-\lfloor F_{\ell }\rfloor )\\= & {} \frac{1}{1-\pi _{a_1}}\pi _{b_1}=\frac{\pi _{k_1} -\pi _{a_1}}{1-\pi _{a_1}} \end{aligned}$$which is equivalent to the second term of (12),
-
(1)
For the other windows, the proof is the same.
Now if s is a fixed sample and \(p_{_I}(.)\) and \(p_{_D}(.)\) are the designs of IDS and Deville’s method respectively, as all the units inside s have to be selected under the same principle in both method, then \(p_{_I}(s)=p_{_D}(s).\) Furthermore, it is proved in Chauvet (2012) and Chauvet (2021) that the Deville’s, Chromy sequential and order pivotal methods lead to the same design, and then the proof is complete. \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Panahbehagh, B., Jauslin, R. & Tillé, Y. A general stream sampling design. Comput Stat (2023). https://doi.org/10.1007/s00180-023-01408-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00180-023-01408-7