Skip to main content
Log in

A general stream sampling design

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

With the emergence of the big data era, the need for sampling methods that select samples based on the order of the observed units is felt more than ever. In order to meet this necessity, a new sequential unequal probability sampling method is proposed. The decision to select or not each unit is made based on the order in which the units appear. A variant of this method allows a selection of a sample from a stream. This method consists in using sliding windows which are a kind of strata of controllable size. This method also allows the sample to be spread in a controlled manner throughout the population. A special case of the method with windows of size one leads to deciding on each sampling unit immediately after observing it. The implementation of size one windows is simple and will be presented here based on an algorithm with a single condition. Also, by selecting the windows of size two, we will have one of the optimal stream sampling methods, which results in a well-spread stream sample with positive second-order inclusion probabilities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Aubry P (2023) On the correct implementation of the hanurav-vijayan selection procedure for unequal probability sampling without replacement. Commun Stat-Simul Comput 52(5):1849–1877

    Article  MathSciNet  MATH  Google Scholar 

  • Boley M, Lucchese C, Paurat D, Gartner T (2011) Direct local pattern sampling by efficient two-step random procedures. In: ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD’11, San Diego, USA, 21–24 August 2011. ACM Press, New York, USA, pp 582–590

  • Busnel Y, Tillé Y (2020) Attack-tolerant unequal probability sampling methods over sliding window for distributed streams. In: 4th international conference on compute and data analysis (ICCDA 2020), Mar 2020, San Jose, United States, pp 72–78

  • Chao M-T (1982) A general purpose unequal probability sampling plan. Biometrika 69:653–656

    Article  MathSciNet  MATH  Google Scholar 

  • Chaudhuri A, Pal S (2022) Sampling with Varying Probabilities. Springer Nature Singapore, Singapore, pp 43–109

    Google Scholar 

  • Chauvet G (2012) On a characterization of ordered pivotal sampling. Bernoulli 18(4):1320–1340

    Article  MathSciNet  MATH  Google Scholar 

  • Chauvet G (2021) A note on chromy’s sampling procedure. J Surv Stat Methodol 9(5):1050–1061

    Article  Google Scholar 

  • Chauvet G (2022) A Cautionary Note on the Hanurav-Vijayan Sampling Algorithm. J Surv Stat Methodol 10(5):1276–1291

    Article  Google Scholar 

  • Chromy JR (1979) Sequential sample selection methods. In: Proceedings of the American statistical association, survey research methods section, pp 401–406

  • Cohen E, Duffield N, Kaplan H, Lund C, Thorup M (2009) Stream sampling for variance-optimal estimation of subset sums. In: Proceedings of the twentieth annual ACM-SIAM symposium on discrete algorithms. society for industrial and applied mathematics, pp 1255–1264

  • Deville J-C, Tillé Y (1998) Unequal probability sampling without replacement through a splitting method. Biometrika 85:89–101

    Article  MathSciNet  MATH  Google Scholar 

  • Diop L, Diop CT, Giacometti A, Li D, Soulet A (2018) Sequential pattern sampling with norm constraints. In: t2018 IEEE international conference on data mining (ICDM), pp 89–98

  • Gabler S (1990) Minimax Solutions in Sampling from Finite Populations. Springer, New York

    Book  MATH  Google Scholar 

  • Giacometti A, Soulet A (2021) Reservoir pattern sampling in data streams. In: Oliver N, Pérez-Cruz F, Kramer S, Read J, Lozano JA (eds) Machine learning and knowledge discovery in databases. Research track. Springer International Publishing, Cham, pp 337–352

    Chapter  Google Scholar 

  • Grafström A, Lundström NLP (2013) Why well spread probability samples are balanced? Open J Stat 3(1):36–41

    Article  Google Scholar 

  • Grafström A, Lundström NLP, Schelin L (2012) Spatially balanced sampling through the pivotal method. Biometrics 68(2):514–520

    Article  MathSciNet  MATH  Google Scholar 

  • Grafström A, Matei A, Qualité L, Tillé Y (2012) Size constrained unequal probability sampling with a non-integer sum of inclusion probabilities. Electron J Stat 6:1477–1489

    Article  MathSciNet  MATH  Google Scholar 

  • Hanif M, Brewer KRW (1980) Sampling with unequal probabilities without replacement: A review. Int Stat Rev 48:317–335

    Article  MathSciNet  MATH  Google Scholar 

  • Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47(260):663–685

    Article  MathSciNet  MATH  Google Scholar 

  • Jauslin R, Panahbehagh B, Tillé Y (2022) Sequential spatially balanced sampling. Environmetrics 33(8):e2776

    Article  MathSciNet  Google Scholar 

  • Jauslin R, Tillé Y (2020) Spatial spread sampling using weakly associated vectors. J Agric Biol Environ Stat 25(3):431–451

    Article  MathSciNet  MATH  Google Scholar 

  • Madow WG (1949) On the theory of systematic sampling, II. Ann Math Stat 20:333–354

    Article  MathSciNet  MATH  Google Scholar 

  • Narain RD (1951) On sampling without replacement with varying probabilities. J Indian Soc Agric Stat 3:169–174

    MathSciNet  Google Scholar 

  • R Core Team (2022) R: a language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria

  • Sunter AB (1977) List sequential sampling with equal or unequal probabilities without replacement. Appl Stat 26:261–268

    Article  MathSciNet  Google Scholar 

  • Sunter AB (1986) Solutions to the problem of unequal probability sampling without replacement. Int Stat Rev 54:33–50

    Article  MathSciNet  MATH  Google Scholar 

  • Tillé Y (1996) An elimination procedure of unequal probability sampling without replacement. Biometrika 83:238–241

    Article  MathSciNet  MATH  Google Scholar 

  • Tillé Y (2006) Sampling Algorithms. Springer, New York

    MATH  Google Scholar 

  • Tillé Y (2019) A general result for selecting balanced unequal probability samples from a stream. Inf Process Lett 152:1–6

    Article  MathSciNet  MATH  Google Scholar 

  • Vijayan K (1968) An exact \(\pi ps\) sampling scheme, generalization of a method of Hanurav. J Roy Stat Soc B30:556–566

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bardia Panahbehagh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof of Result 1

Without loss of generality and for ease of notation we only do the proof for \(t = 1\). For \(k=1\) it is obvious that \(\text {E}(\pi ^1_1)=\pi _1\), and for \(k=2,3,\dots ,N\),

$$\begin{aligned} \text {E}(\pi ^1_k) =\pi ^{1(0)}_k(1-\pi _1)+\pi ^{1(1)}_k\pi _1=\pi _k^{1(0)} (1-\pi _1)+\frac{\pi _k-\pi ^{1(0)}_k(1-\pi _1)}{\pi _1}\pi _1=\pi _k. \end{aligned}$$

Also for the sum of inclusion probabilities, we have

$$\begin{aligned} \sum _{k\in U}\pi _k^{1(0)} = 0+\sum _{k=2}^{N}\pi ^{1(0)}_k=\sum _{k=2}^{N}\min (c_1\pi _k,1) =n=\sum _{k=1}^{N}\pi _k, \end{aligned}$$

and

$$\begin{aligned} \begin{aligned} \displaystyle \sum _{k\in U}\pi _k^{1(1)}&=\displaystyle 1+\sum _{k=2}^{N}\pi ^{1(1)}_k\\&=\displaystyle 1+\frac{\sum _{k=2}^{N}\pi _k-(1-\pi _1)\sum _{k=2}^{N}\pi ^{1(0)}_k}{\pi _1}\\&=\displaystyle \frac{\pi _1+\{(n-\pi _1)-n+n\pi _1\}}{\pi _1}\\&=n\\&=\displaystyle \sum _{k=1}^{N}\pi _k. \end{aligned} \end{aligned}$$

\(\square \)

Proof of Result 2

Without loss of generality and for ease of notation we only do the proof for \(t = 1\). Since

$$\begin{aligned} \pi ^{1(1)}_k=\left\{ \begin{array}{ll} \displaystyle \frac{\pi _k-\min (c_1\pi _k,1)(1-\pi _1)}{\pi _1} \le \frac{\pi _k-c_1\pi _k(1-\pi _1)}{\pi _1} &{} \text{ if } \pi ^{1(1)}_k=1\\ \displaystyle \frac{\pi _k-c_1\pi _k(1-\pi _1)}{\pi _1} &{}\text{ if } \pi ^{1(1)}_k<1, \end{array}\right. \end{aligned}$$

for \(k=2,3,\dots ,N\), a necessary and sufficient condition is that \(\pi _1\ge 1-1/c_1\), which gives the result. \(\square \)

Proof of Result 3

Let define

$$\begin{aligned} U_{A}= & {} \{k\in U|0< \pi _k^{1(0)}<1\},\; U_{B} = \{k\in U|\pi _k^{1(0)}=1\},\\ A= & {} \sum _{k\in U_A}\pi _k \text{ and } B=\sum _{k\in U_{B}}\pi _k. \end{aligned}$$

Then it is possible to decompose n as

$$\begin{aligned} \pi _1+A+B=n \end{aligned}$$
(10)

and

$$\begin{aligned} c_1A+\#U_B=n, \end{aligned}$$

where “\(\#\)” indicates cardinality. Now, if n is an integer, \(c_1A\) is an integer denoted by d, thus \(A=d/c_1\). In this case, we have \(\#U_B=n-d\). Furthermore, it is easy to see that \(B\le \#U_B\). Now, from Eq. (10), we have

$$\begin{aligned} \pi _1=n-A-B=n-\frac{d}{c_1}-B\ge n-\frac{d}{c_1}-\#U_B=n-\frac{d}{c_1}-(n-d)=d\left( 1-\frac{1}{c_1}\right) . \end{aligned}$$

Now if \(d=1,2,\dots \) then \(\pi _1\ge (1-1/c_1)\) and if \(d=0\) then \(\#U_B=n\) or in other words, \(\pi ^{1(0)}_k=1\) for all \(k=2,3,\dots ,N\). Therefore, to have all \(\pi ^{1(1)}_k\ge 0\), it is necessary to have

$$\begin{aligned} \pi ^{1(1)}_k=\frac{\pi _k-(1-\pi _1)}{\pi _1}\ge 0, \Rightarrow \pi _k+\pi _1-1\ge 0. \end{aligned}$$
(11)

But as in such cases, \(\pi ^{1(0)}_k=\pi _k+\alpha _k \pi _1=1,\) for some \(0<\alpha _k\le 1\), then \(\pi _k+\pi _1\ge 1\) and therefore Condition (11) is satisfied.

Then, if n is an integer, the condition of Result 2 is always fulfilled. \(\square \)

Proof of Result 4

Proof for \(w_1\) is obvious based on Result 1. We prove the result for \(w_2\):

  1. (i)

    For \(k=k_1+1,\dots ,a_2\),

    $$\begin{aligned} E(\pi ^*_k)=\pi _{a_1}\pi ^{*(1)}_k+(1-\pi _{a_1})\pi ^{*(0)}_k=\pi _{a_1}\pi ^{*(1)}_k+\pi _k-\pi _{a_1}\pi ^{*(1)}_k=\pi _k, \end{aligned}$$

    and for \(k=k_1\),

    $$\begin{aligned} E(\pi ^*_{k_1})=\pi _{a_1}\times 1+ (1-\pi _{a_1})\frac{\pi _{k_1}-\pi _{a_1}}{(1-\pi _{a_1})}=\pi _{k_1}. \end{aligned}$$
  2. (ii)

    For \(\pi ^{+}_{a_1}=1\), in (7), actually \(\pi _{b_1}\) will be distributed on \(\pi _{k_1+1},\dots ,\pi _{a_2}\). To show that the inclusion probabilities in (8) are non-negative, we have

    $$\begin{aligned} \frac{\pi _k-\min (c^*_2\pi _k,1)\pi _{a_1}}{(1-\pi _{a_1})}\ge \frac{\pi _k-c^*_2\pi _k\pi _{a_1}}{(1-\pi _{a_1})}\ge 0 \end{aligned}$$

    which leads to

    $$\begin{aligned} (1-\pi _{a_1})\ge (1-\frac{1}{c^*_{2}}). \end{aligned}$$

    But the size of \(w_2\) is an integer, we know that

    $$\begin{aligned} \frac{\pi _k-\min (c^*_2\pi _k,1)(1-\pi _{b_1})}{\pi _{b_1}}\ge \frac{\pi _k-c^*_2\pi _k(1-\pi _{b_1})}{\pi _{b_1}}\ge 0 \end{aligned}$$

    and then

    $$\begin{aligned} \pi _{b_1}\ge \left( 1-\frac{1}{c^*_{2}}\right) . \end{aligned}$$

    Therefore, as \(\pi _{a_1}+\pi _{b_1}=\pi _{k_1}\le 1\) we have

    $$\begin{aligned} (1-\pi _{a_1})\ge \pi _{b_1}\ge \left( 1-\frac{1}{c^*_{2}}\right) . \end{aligned}$$
  3. (iii)

    Proof for respecting sum of the inclusion probabilities are straightforward by calculating summation of \(\pi ^{*(0)}_k\) and \(\pi ^{*(1)}_k\) inside \(w_2\).

For the other windows, proof is the same. \(\square \)

Proof of Result 5

For calculating the second-order inclusion probability \(\pi _{k\ell }\) where \(k<\ell \), \(k\in w_i\) and \(\ell \in w_j\), we have

$$\begin{aligned} \text{ Pr }(k\in S, \ell \in S)=\text{ Pr }(k\in S)\text{ Pr }(\ell \in S\mid k \in S)=\pi _k\pi _{\ell \mid k}. \end{aligned}$$

In \(\pi _{\ell \mid k}\), given k is selected affect on selecting \(\pi _\ell \) by changing \(\pi _{a_{j-1}}\). Then based on a recursive relation, step by step we can calculate \(\pi _{a_{j-1}\mid k}\) using \(\pi _{a_{j-2}\mid k}\) and so on. Then we need to consider the cases in Result 5, as

  1. (i)

    in this case, the second inclusion probabilities can be calculated based on the design, \(p_i\), implemented inside the respective window,

  2. (ii)

    here, after following recursive calculation for calculating \(\pi _{a_i\mid k}\), as \(a_i\) and k are in the same window, we have

    $$\begin{aligned} \pi _{a_i\mid k}=\frac{\pi _{ka_i}}{\pi _k}=\frac{\pi ^{p_i}_{ka_i}}{\pi _k}, \end{aligned}$$
  3. (iii)

    here, since unit k is a cross-border unit, if \(a_i\) is selected, \(\pi _{a_{i+1}}\) will be updated as \(min(c_{i+1}\pi _{a_{i+1}},1)\), and if \(a_i\) is not selected, then \(\pi _{a_{i+1}}\) will be updated as

    $$\begin{aligned} \frac{\pi ^{p_i(a_i\ni S)}_{{b_{i}}{a_{i+1}}}}{\pi _{b_{i}}/(1-\pi _{a_{i}})}. \end{aligned}$$

    For conditional probability of \(a_i\) itself, as \(a_i\) is a part of \(\pi _k=a_i+b_i,\) and then

    $$\begin{aligned} \{a_i\in S\}\subset \{k\in S\}, \end{aligned}$$

    therefore we have

    $$\begin{aligned} \pi _{a_i\mid k}=\frac{\text{ Pr }(k\in S, a_i \in S)}{\pi _k}=\frac{\text{ Pr }(a_i \in S)}{\pi _k}=\frac{\pi _{a_i}}{\pi _k}, \end{aligned}$$
  4. (iv)

    when \(\pi _{\ell }=\pi _{a_j}+\pi _{b_j}\), then

    $$\begin{aligned} \pi _{\ell \mid k}=\text{ Pr }(a_j\in S)+\text{ Pr }(a_j\notin S)\text{ Pr }(b_j\in S\mid a_j \notin S)=\pi _{a_{j}|k}+(1-\pi _{a_{j}|k})\frac{\pi _{b_j}}{1-\pi _{a_{j}}}, \end{aligned}$$

    and the rest of the proof is the same as case ii),

  5. (v)

    the first part of the proof of this case is the same as case iv), and after recursive calculations, the last part is the same as the last part of case iii).

\(\square \)

Proof of Result 6

After deciding on the first window, as \(a_1\) is not a real unit, depending on the decision for this unit, the units inside \(w_2\) will be initially updated as

$$\begin{aligned} \pi ^*_{b_1}=\left\{ \begin{array}{ll} \pi ^{*(1)}_{b_1}=0 &{} \text{ if } \pi ^+_{a_1}=1\\[2mm] \displaystyle \pi ^{*(0)}_{b_1}=\frac{\pi _{k_1}-\pi _{a_1}}{1-\pi _{a_1}} &{} \text{ if } \pi ^+_{a_1}=0, \end{array} \right. \end{aligned}$$
(12)

and

$$\begin{aligned} \pi ^*_k=\left\{ \begin{array}{ll} \pi ^{*(1)}_k=\frac{\pi _{k}}{1-\pi _{b_1}} &{} \text{ if } \pi ^+_{a_1}=1\\[2mm] \displaystyle \pi ^{*(0)}_k=\frac{\pi _{k}-\pi ^{*(1)}_{k}\pi _{a_1}}{1-\pi _{a_1}} &{} \text{ if } \pi ^+_{a_1}=0, \end{array} \right. \text { for } k = k_1+1,\dots ,a_2. \end{aligned}$$
(13)

Consider unit \(\ell \), inside \(w_2\)

  1. (I)

    j is not a cross-border unit,

    1. (1)

      if \(n_\ell <i\) and \(\pi ^{+}_{a_1}=1\), then according to (13) we have

      $$\begin{aligned} \pi ^{*}_\ell =\frac{\frac{\pi _\ell }{1-\pi _{b_1}}}{1-\sum _{i=k_1+1}^{\ell -1}\frac{\pi _{\ell }}{1-\pi _{b_1}}}=\frac{\pi _\ell }{1-(F_{\ell -1}-\lfloor F_{\ell -1}\rfloor )}, \end{aligned}$$

      and if \(\pi ^{+}_{a_1}=0\),

      $$\begin{aligned} \pi ^{*}_\ell =\frac{\frac{\pi _{\ell }-\pi ^{*(1)}_{\ell }\pi _{a_1}}{1-\pi _{a_1}}}{1-\frac{\pi _{k_1}-\pi _{a_1}}{1-\pi _{a_1}}-\sum _{i=k_1+1}^{\ell -1}\frac{\pi _{\ell }-\pi ^{*(1)}_{\ell }\pi _{a_1}}{1-\pi _{a_1}}}, \end{aligned}$$

      which with replacing \(\pi ^{*(1)}_\ell \) by \(\pi _\ell /(1-\pi _{b_1})\) we have

      $$\begin{aligned} \pi ^{*}_\ell =\frac{\pi _\ell }{1-(F_{\ell -1}-\lfloor F_{\ell -1}\rfloor )}. \end{aligned}$$
    2. (2)

      if \(n_\ell =i\), according the window size, this unit will not be selected.

  2. (II)

    \(\ell \) is a cross-border unit,

    1. (1)

      if \(n_\ell <i\), according to the window size, this unit will be selected with probability one.

    2. (2)

      if \(n_\ell =i\), we can calculate \(\pi ^{*}_\ell \) directly using (12).

    For the other windows, the proof is the same.

\(\square \)

Proof of Result 7

The structure of the population and cross-units inside both methods are the same, and the updating principle of Deville’s method is the same as Eqs. (12) and (13). In Deville’s method, consider the two first windows,

  1. (I)

    \(\ell \) is not a cross-border unit,

    1. (1)

      If the cross-border unit is selected inside the previous window, then

      $$\begin{aligned} \text{ Pr }(\ell \in S)= & {} \int \limits _{F_{\ell -1}}^{ F_{\ell }}f(x)dx=\int \limits _{F_{\ell -1}}^{ F_{\ell }}\frac{1}{\lceil F_{k_1}\rceil -F_{k_1}}dx=\frac{1}{\lceil F_{k_1}\rceil -F_{k_1}}\pi _\ell \\= & {} \frac{1}{1-\pi _{b_1}}\pi _\ell \end{aligned}$$

      which is equivalent to the first term of (13),

    2. (2)

      If the cross-border unit is not selected inside the previous window, then

      $$\begin{aligned} \text{ Pr }(\ell \in S)= & {} \int \limits _{F_{\ell -1}}^{ F_{\ell }}1-\frac{(\lceil F_{k_1-1}\rceil -F_{k_1-1})(F_{k_1}-\lfloor F_{k_1} \rfloor )}{\left\{ 1-(\lceil F_{k_1-1}\rceil -F_{k_1-1})\right\} \left\{ 1-(F_{k_1}-\lfloor F_{k_1} \rfloor )\right\} }dx\\= & {} \frac{1-\pi _{k_1}}{(1-\pi _{a_1})(1-\pi _{b_1})}\pi _\ell \end{aligned}$$

      which is equivalent to the second term of (13),

  2. (II)

    \(\ell \) is a cross-border unit (\(\ell =k_1\)),

    1. (1)

      If the cross-unit is selected inside the previous window, then the method ignores the second part of \(k_1\), i.e. (\(\pi ^*_{a_1}\)=0), which is equivalent to the first term of (12),

    2. (2)

      If the cross-unit is not selected inside the previous window, then

      $$\begin{aligned} \text{ Pr }(\ell \in S)= & {} \int \limits _{\lfloor F_{\ell }\rfloor }^{ F_{\ell }}\frac{1}{1-(\lceil F_{k_1-1}\rceil -F_{k_1-1})}dx\\= & {} \frac{1}{1-(\lceil F_{k_1-1}\rceil -F_{k_1-1})}(F_{\ell }-\lfloor F_{\ell }\rfloor )\\= & {} \frac{1}{1-\pi _{a_1}}\pi _{b_1}=\frac{\pi _{k_1} -\pi _{a_1}}{1-\pi _{a_1}} \end{aligned}$$

      which is equivalent to the second term of (12),

For the other windows, the proof is the same.

Now if s is a fixed sample and \(p_{_I}(.)\) and \(p_{_D}(.)\) are the designs of IDS and Deville’s method respectively, as all the units inside s have to be selected under the same principle in both method, then \(p_{_I}(s)=p_{_D}(s).\) Furthermore, it is proved in Chauvet (2012) and Chauvet (2021) that the Deville’s, Chromy sequential and order pivotal methods lead to the same design, and then the proof is complete. \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Panahbehagh, B., Jauslin, R. & Tillé, Y. A general stream sampling design. Comput Stat (2023). https://doi.org/10.1007/s00180-023-01408-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00180-023-01408-7

Keywords

Navigation