Efficient Sampling Methods for Discrete Distributions

Bringmann, Karl; Panagiotou, Konstantinos

doi:10.1007/s00453-016-0205-0

Efficient Sampling Methods for Discrete Distributions

Open access
Published: 29 August 2016

Volume 79, pages 484–508, (2017)
Cite this article

Download PDF

You have full access to this open access article

Algorithmica Aims and scope Submit manuscript

Efficient Sampling Methods for Discrete Distributions

Download PDF

Karl Bringmann¹ &
Konstantinos Panagiotou²

6683 Accesses
20 Citations
Explore all metrics

Abstract

We study the fundamental problem of the exact and efficient generation of random values from a finite and discrete probability distribution. Suppose that we are given n distinct events with associated probabilities $p_1, \dots , p_n$. First, we consider the problem of sampling from the distribution where the i-th event has probability proportional to $p_i$. Second, we study the problem of sampling a subset which includes the i-th event independently with probability $p_i$. For both problems we present on two different classes of inputs—sorted and general probabilities—efficient data structures consisting of a preprocessing and a query algorithm. Varying the allotted preprocessing time yields a trade-off between preprocessing and query time, which we prove to be asymptotically optimal everywhere.

The Alias Method for Sampling Discrete Distributions

Consistent Subset Sampling

Dynamic sampling from a discrete probability distribution with a known distribution of rates

Article Open access 09 October 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Generating random variables from finite and discrete distributions has long been an important building block in many applications. For example, in computer simulations usually a huge number of random decisions based on prespecified or dynamically changing distributions is made. In this work we consider two fundamental computational problems, namely sampling from a distribution and sampling independent events. We consider these problems on general probabilities as well as restricted to sorted probabilities. The latter case is motivated by the fact that many natural distributions, such as the geometric or binomial distribution, are unimodal, i.e., they change monotonicity at most once. After splitting up such a distribution at its only extremum, we obtain two sorted sequences of probabilities, see Sect. 5 for a thorough discussion. As we will see, there is a rich interplay in designing efficient algorithms that solve these different problem variants.

We present our results on the classical Real RAM model of computation [1, 13]. In particular, we will assume that the following operations take constant time: (1) accessing the content of any memory cell, (2) generating a uniformly distributed real number in the interval [0, 1], and (3) performing basic arithmetical operations involving real numbers like addition, multiplication, division, comparison, truncation, and evaluating any fundamental function like exp and log. We argue in Sect. 5 that our algorithms can also be adapted to work on the Word RAM model of computation.

1.1 Proportional Sampling

We first focus on the classic problem of sampling from a given distribution. Given $\mathbf {p}= (p_1,\ldots ,p_n) \in \mathbb {R}_{\ge 0}^n$, we define a random variable $Y = Y_{\mathbf {p}}$ that takes values in^{Footnote 1} [n] such that ${\text {Pr}}[Y = i] = {p_i}/{\mu }$, where $\mu = \sum _{i=1}^n p_i$ is assumed to be positive. Note that if $\mu =1$ then $\mathbf {p}$ is indeed a probability distribution, otherwise we need to normalize first. We concern ourselves with the problem of sampling Y. We study this problem on two different classes of input sequences, sorted and general (i.e., not necessarily sorted) sequences; depending on the class under consideration we call the problem SortedProportionalSampling or UnsortedProportionalSampling.

A single-sample algorithm for SortedProportionalSampling or UnsortedProportionalSampling gets input $\mathbf {p}$ and outputs a number $s \in [n]$ that has the same distribution as Y. When we speak of “input $\mathbf {p}$” we mean that the algorithm gets to know n and can access every $p_{i}$ in constant time. This can be achieved by storing all $p_{i}$’s in an array, but also, e.g., by having access to an algorithm computing any $p_i$ in constant time. In particular, the algorithm does not know the number of i’s with $p_{i} = 0$. Moreover, the input format is not sparse. For this problem we prove the following result.

Theorem 1.1

There is a single-sample algorithm for SortedProportionalSampling with expected time $\mathcal {O}\big (\frac{\log n}{\log \log n}\big )$ and for UnsortedProportionalSampling with expected time $\mathcal {O}(n)$. Both bounds are asymptotically tight.

We remark that all our lower bounds only hold for algorithms that work for all n and all (sorted) sequences $p_{1},\ldots ,p_{n}$. They are worst-case bounds over the input sequence $\mathbf {p}$ and asymptotic in n. For particular instances $\mathbf {p}$ there can be faster algorithms. To avoid any confusion, note that we mean worst-case bounds whenever we speak of (running) time and expected bounds whenever we speak of expected (running) time.

To obtain faster sampling times, we consider sampling data structures that support ProportionalSampling as a query. We view building the data structure as preprocessing of the input. More precisely, in this preprocessing-query variant we consider the interplay of two algorithms. First, the preprocessing algorithm P gets $\mathbf {p}$ as input and computes some auxiliary data $D = D(\mathbf {p})$. Second, the query algorithm Q gets input $\mathbf {p}$ and D, and samples Y, i.e., for any $s \in [n]$ we have ${\text {Pr}}[Q(\mathbf {p},D) = s] = {\text {Pr}}[Y = s]$. For ${\text {Pr}}[Q(\mathbf {p},D) = s]$ the probability goes only over the random choices of Q, so that, after running the preprocessing once, running the query algorithm multiple times generates multiple independent samples. In this setting we prove the following tight result.

Theorem 1.2

For any $2 \le \beta \le \mathcal {O}(\frac{\log n}{\log \log n})$, SortedProportionalSampling can be solved in preprocessing time $\mathcal {O}(\log _\beta n)$ and expected query time $\mathcal {O}(\beta )$. This is optimal, as there is a constant $\varepsilon > 0$ such that for all $2 \le \beta \le \mathcal {O}(\frac{\log n}{\log \log n})$ SortedProportionalSampling has no data structure with preprocessing time $\varepsilon \log _\beta (n)$ and expected query time $\varepsilon \, \beta $.

Note that if we can afford a preprocessing time of $\mathcal {O}(\log n)$ then the query time is already $\mathcal {O}(1)$, which is optimal. Thus, larger preprocessing times cannot yield better query times. Moreover, for $\beta = \Theta (\frac{\log n}{\log \log n})$ the preprocessing time is equal to the query time. Thus, we may skip the preprocessing phase and run both the preprocessing and query algorithm for every sample. We obtain a single-sample algorithm with runtime $\mathcal {O}(\frac{\log n}{\log \log n})$. This shows that $\beta \gg \frac{\log n}{\log \log n}$ makes no sense and explains why we allow preprocessing time $\mathcal {O}(\log _\beta n)$ with $2 \le \beta \le \mathcal {O}(\frac{\log n}{\log \log n})$. Varying $\beta $ yields a trade-off between preprocessing and query time; if one wants to have a large number of samples, one should set $\beta = 2$ to minimize query time, while a large $\beta $ yields superior runtimes if one wants only a small number of samples. Note that we prove a matching lower bound for this trade-off for all $\beta $.

For general input sequences, ProportionalSampling can be solved by the technique known as pairing or aliasing [8, 17]; see also Mihai Pătraşcu’s blog [14] for an excellent exposition. Basically, we use $\mathcal {O}(n)$ preprocessing to distribute the probabilities of all elements over n urns such that any urn contains exactly 1 / n probability mass, stemming from at most two elements. For querying we first choose an urn uniformly at random. Then we choose one of the two included elements randomly according to their probability mass in the urn, resulting in an $\mathcal {O}(1)$ (worst-case) query time. This result is not new, but will be used in the proofs of Theorem 1.5 and Theorem 1.6 below, so we include it for completeness.

Theorem 1.3

UnsortedProportionalSampling can be solved in preprocessing time $\mathcal {O}(n)$ and query time $\mathcal {O}(1)$. This is optimal, as there is a constant $\varepsilon > 0$ such that UnsortedProportionalSampling has no data structure with preprocessing time $\varepsilon n$ and expected query time $\varepsilon n$.

Note that any data structure with preprocessing time $t_p$ and query time $t_q$ can be transformed into a single-sample algorithm with expected time $t_p+t_q$, so the single-sample variant of the problem is also solved by the preprocessing-query variant. This argument proves that Theorem 1.1 follows from Theorems 1.2 and 1.3.

Related work The fundamental problem of the exact and efficient generation of random values from discrete and continuous distributions has been studied extensively in the literature. The seminal work [9] examines the power of several restricted devices, like finite-state machines; the articles [6, 18] provide a further refined treatment of the topic. However, their results are not directly comparable to ours, since on the one hand they do not make any assumption on the sequence of probabilities and use unbiased coin flips as the only source of randomness, but on the other hand they cannot guarantee efficient precomputation on general sequences. Furthermore, [7] and [10] provided algorithms for a dynamic version of UnsortedProportionalSampling, where the probabilities may change over time. In particular, under certain mild conditions their results guarantee the same bounds as in Theorem 1.3. Finally, there is a solution to UnsortedProportionalSampling [3] that can be implemented on a WordRAM (i.e., the $p_i$’s are each represented by w bits, and the usual arithmetic operations on w-bit integers take constant time) that improves upon Walker’s technique and has optimal space and time requirements.

1.2 Subset Sampling

In the previous section we considered the problem of sampling from a distribution. In this section we give an algorithm to randomly pick a subset S of $\{1,\ldots ,n\}$, where the values $p_i = {\text {Pr}}[i \in S]$ are given as an input, and the events “$i \in S$” are independent. In other words, we are given $\mathbf {p}= (p_1,\ldots ,p_n)$ as input and we want to sample a random variable $X \subseteq [n]$ with

$$\begin{aligned} {\text {Pr}}[X = S] = \bigg ( \prod _{i \in S} p_i \bigg ) \cdot \bigg ( \prod _{i \in [n] \setminus S} (1-p_i) \bigg ). \end{aligned}$$

For shortcut we write $\mu = \mu _{\mathbf {p}} = \sum _{i=1}^n p_i = \mathbb {E}[X]$. We call the problem of sampling X SortedSubsetSampling or UnsortedSubsetSampling, if we consider it on sorted or general input sequences, respectively.

The motivation for this problems comes from sampling certain random graphs. Consider for instance the Chung-Lu random graph model [4]: We are given weights $w_1 \ge \cdots \ge w_n$ and sample a graph on vertex set [n] where the edge $\{i,j\}$ is independently present with probability ${\text {min}}\{1, \frac{w_i w_j}{\sum _k w_k}\}$. Note that for any fixed vertex i, the edge probabilities to vertices $j>i$ are descendingly sorted. Thus, sampling the set of neighbors of vertex i is an instance of SortedSubsetSampling. Solving these instances for all vertices i yields a Chung-Lu random graph, and our algorithms from this paper do this in total time $\mathcal {O}(n \log n + m)$, where m is the expected number of edges. This does not match the optimal $\mathcal {O}(n+m)$ [11], because we ignore the structure connecting the different arising instances. However, it serves as a motivating example.

As previously, we consider two variations of SubsetSampling. In the single-sample variant we are given $\mathbf {p}$ and we want to compute an output that has the same distribution as X. Moreover, in the preprocessing-query variant we have a precomputation algorithm that is given $\mathbf {p}$ and computes some auxiliary data D, and a query algorithm that is given $\mathbf {p}$ and D and has an output with the same distribution as X; where the results of multiple calls to the query algorithm are independent.

Any query algorithm cannot run faster than $\mathcal {O}(1 + \mu )$, as its expected output size is $\mu $ and any algorithm requires a running time of $\Omega (1)$. Whether this query time is achievable depends on $\mu $ and the allotted preprocessing time, as our results below make precise. Note that the single-sample variant of UnsortedSubsetSampling can be solved trivially in time $\mathcal {O}(n)$; we just toss a biased coin for every $p_i$. This algorithm is optimal, as shown by the following tight result.

Theorem 1.4

There is a single-sample algorithm for SortedSubsetSampling with expected time

$$\begin{aligned} t(n,\mu ) = {\left\{ \begin{array}{ll} \mathcal {O}(\mu ), &{} \text {if } \mu \ge \tfrac{1}{2} \log n, \\ \mathcal {O}\left( 1 + \frac{\log n}{\log (\frac{\log n}{\mu })}\right) , &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

and for UnsortedProportionalSampling with expected time $\mathcal {O}(n)$. Both bounds are asymptotically tight for any fixed $\mu = \mu (n)$.

Let us discuss what we mean by “asymptotically tight for any fixed $\mu = \mu (n)$”. Fix any $\mu = \mu (n)$. Consider any single-sample algorithm for SortedSubsetSampling that, given any $\mathbf {p}$ (not necessarily with $\mu _\mathbf {p}= \mu $), correctly samples from the desired distribution. Then there exists an input $\mathbf {p}$ with $\mu _\mathbf {p}= \mu $ such that the expected time of the algorithm on input $\mathbf {p}$ is $\Omega (t(n,\mu ))$, where $t(n,\mu )$ is defined in Theorem 1.4. This holds even if we allow the algorithm to have a very large runtime for all instances with $\mu _\mathbf {p}\ne \mu $. In particular, our runtime bound is not only tight for one infinite family of input $\mathbf {p}$ (realizing a particular function $\mu (n)$), but for every $\mu (n)$ we construct a hard family of inputs. A similar discussion applies to Theorems 1.5 and 1.6 below.

As for ProportionalSampling, the single-sample result Theorem 1.4 follows from our results on the preprocessing-query variant below.

Theorem 1.5

For any $2 \le \beta < n$, SortedSubsetSampling can be solved in preprocessing time $\mathcal {O}(\log _\beta n)$ and expected query time $\mathcal {O}(t_q^\beta (n,\mu ))$, where

$$\begin{aligned} t_q^\beta (n,\mu ) = {\left\{ \begin{array}{ll} \mu , &{} \text {if}\,\mu \ge \tfrac{1}{2} \log n, \\ 1 + \beta \mu , &{} \text {if } \mu < \tfrac{1}{\beta }\log _\beta n, \\ \frac{\log n}{\log (\frac{\log n}{\mu })}, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

In particular, the query time is always at most $\mathcal {O}(1 + \beta \mu )$. This is optimal, as there is a constant $\varepsilon >0$ such that for all $2 \le \beta < n$ SortedSubsetSampling has no data structure with preprocessing time $\varepsilon \log _\beta n$ and expected query time $\varepsilon \, t_q^\beta (n,\mu )$ for any fixed $\mu = \mu (n)$.

Observe that setting $\beta =2$ in the above result yields a preprocessing time of $\mathcal {O}(\log n)$ and an (optimal) expected query time of $\mathcal {O}(1+\mu )$.

The next result addresses the case of general, i.e., not necessarily sorted, probabilities.

Theorem 1.6

UnsortedSubsetSampling can be solved in preprocessing time $\mathcal {O}(n)$ and expected query time $\mathcal {O}(1 + \mu )$. This is optimal, as there is a constant $\varepsilon > 0$ such that UnsortedSubsetSampling has no data structure with preprocessing time $\varepsilon n$ and expected query time $\varepsilon n$ for any fixed $\mu = \mu (n)$.

Both positive results in the previous theorems highly depend on each other. In particular, as is demonstrated in Sect. 2.2, we prove them by repeatedly reducing the instance size n and switching from the one problem variant to the other.

We also present a relation between ProportionalSampling and SubsetSampling that suggests that the classic problem ProportionalSampling is the easier of the two problems (or can be seen as a special case of SubsetSampling). Specifically, we present a reduction that allows one to infer the upper bounds for ProportionalSampling (Theorems 1.2 and 1.3) from the upper bounds for SubsetSampling (Theorems 1.5 and 1.6), see Sect. 4 for details.

Related work A classic algorithm solves SubsetSampling for $p_1 = \ldots = p_n = p$ in the optimal expected time $\mathcal {O}(1 + \mu )$, see, e.g., the monographs [5] and [8], where also many other cases are discussed. Indeed, observe that the index $i_1$ of the first sampled element is geometrically distributed, i.e., ${\text {Pr}}[i_1 = i] = (1-p)^{i-1} p$. Such a random value can be generated by setting $i_1 = \lfloor \frac{\log {\text {rand}}()}{\log (1-p)} \rfloor $. Moreover, after having sampled the index of the first element, we iterate the process starting at $i_1+1$ to sample the second element, and so on, until we arrive for the first time at an index $i_k > n$. In [16] the “orthogonal” problem is considered, where we want to uniformly sample a fixed number of elements from a stream of objects. The problem of UnsortedSubsetSampling was considered also in [15], where algorithms with linear preprocessing time and suboptimal query time $\mathcal {O}(\log n + \mu )$ were designed. Our results improve upon this running time, and provide matching lower bounds.

1.3 Notation and Organization

In the remainder, we will write $\ln x$ for the natural logarithm of x, $\log _t x = \ln x / \ln t$, and $\log x = \log _2 x$. Finally, we will write ${\text {rand}}()$ for a uniform random number in [0, 1].

The rest of the paper is structured as follows. In Sect. 2 we present our new algorithms, proving (the upper bounds of) Theorem 1.2 in Sect. 2.1 and Theorems 1.5 and 1.6 in Sect. 2.2. In Sect. 3 we present the lower bounds, proving (the lower bounds of) Theorems 1.3 and 1.6 in Sect. 3.1, Theorem 1.2 in Sect. 3.2, and Theorem 1.5 in Sect. 3.3. We present our reduction from ProportionalSampling to SubsetSampling in Sect. 4. We discuss relaxations to our input and machine model and possible extensions in Sect. 5.

2 Upper Bounds

2.1 A Simple Algorithm for Sorted Proportional Sampling

In this section, we prove the upper bound of Theorem 1.2 by presenting an algorithm for SortedProportionalSampling with $\mathcal {O}(\beta )$ expected query time after $\mathcal {O}(\log _\beta n)$ preprocessing, where $2 \le \beta \le \mathcal {O}(\frac{\log n}{\log \log n})$ is a parameter. We remark that our algorithm also works for $\beta \gg \frac{\log n}{\log \log n}$, but is not meaningful in this case, because then the preprocessing time is less than the query time.

Let $p_{1},\ldots ,p_{n}$ be an input sequence to SortedProportionalSampling. Consider the blocks $B_k := \{i \in [n] \mid \beta ^k \le i < \beta ^{k+1}\}$ with $0 \le k \le L := \lfloor \log _\beta n \rfloor $. Note that $B_0,\ldots ,B_L$ partition [n]. For $i \in B_k$ we set $\overline{p}_i := p_{\beta ^k}$, which is an upper bound for $p_i$. Let $\mu := \sum _i p_i$ and $\overline{\mu }:= \sum _i \overline{p}_i$. We also set for $0 \le k \le L$

$$\begin{aligned} q_k := \sum _{i \in B_k} \overline{p}_i = |B_k| \cdot p_{\beta ^k} = \big (\!{\text {min}}(\beta ^{k+1},n+1) - \beta ^k\big ) \cdot p_{\beta ^k}. \end{aligned}$$

For preprocessing, we run the preprocessing of UnsortedProportionalSampling on $q_0,\ldots ,q_L$. This takes time $\mathcal {O}(L) = \mathcal {O}(\log _\beta n)$ using Theorem 1.3, since $q_k$ can be evaluated in constant time.

Our query algorithm consists of two steps. First, we sample an index i with distribution $\overline{p}_1,\ldots ,\overline{p}_n$. To this end, we sample a block $B_k$ proportional to the distribution $q_0,\ldots ,q_L$ and then sample an index $i \in B_k$ uniformly at random. Second, with probability $1 - p_i/\overline{p}_i$ we reject i and repeat the whole process. Otherwise we return i. This culminates into Algorithm 1.

Note that we pick index $i \in B_k$ with probability proportional to $\overline{p}_i$ and do not reject it with probability $p_i/\overline{p}_i$. Thus, the probability of returning a particular index i is proportional to $\overline{p}_i \cdot p_i/\overline{p}_i = p_i$ and we obtained an exact sampling algorithm. Moreover, in any iteration of the loop the probability r of not rejecting, i.e., of leaving the loop, is

$$\begin{aligned} r = \frac{1}{\overline{\mu }} \sum _{i=1}^n \overline{p}_i \cdot p_i/\overline{p}_i. \end{aligned}$$

In this equation, note the first step of sampling with respect to $\overline{p}_1,\ldots ,\overline{p}_n$ ($\frac{1}{\overline{\mu }} \sum _{i=1}^n \overline{p}_i$) and the second step of rejection ($p_i/\overline{p}_i$). Clearly, this simplifies to $r = \mu / \overline{\mu }$. The following lemma shows that $\overline{\mu }\le \beta \cdot \mu $, implying $r \ge 1/\beta $. Hence, the expected number of iterations of the loop is $\mathcal {O}(\beta )$, and in total querying takes expected time $\mathcal {O}(\beta )$.

Lemma 2.1

We have $\mu \le \overline{\mu }\le \beta \cdot \mu $.

Proof

The first inequality follows from $p_i \le \overline{p}_i$. Note that for $i \in B_k$ we have $\lceil i/\beta \rceil \le \beta ^k$. Thus, $p_{\lceil i/\beta \rceil } \ge p_{\beta ^k}$. Hence,

$$\begin{aligned} \overline{\mu }= \sum _{i=1}^n \overline{p}_i \le \sum _{i=1}^n p_{\lceil i/\beta \rceil } \le \beta \sum _{i=1}^n p_i = \beta \cdot \mu . \end{aligned}$$

$\square $

2.2 Subset Sampling

In this section we consider SortedSubsetSampling and UnsortedSubsetSampling and prove the upper bounds of Theorems 1.5 and 1.6. An interesting interplay between both of these problem variants will be revealed on the way.

We begin with an algorithm for unsorted probabilities that has a quite large preprocessing time, but will be used as a base case later. The algorithm uses Theorem 1.3.

Lemma 2.2

UnsortedSubsetSampling can be solved in preprocessing time $\mathcal {O}(n^2)$ and expected query time $\mathcal {O}(1+\mu )$.

Proof

For $i \in [n]$ let us denote by $S_i$ the smallest sampled element that is at least i, or $\infty $, if no such element is sampled. Then $S_i$ is a random variable such that

$$\begin{aligned} {\text {Pr}}[S_i = j] = p_j \prod _{i \le k < j} (1-p_k) \quad \text {and}\quad {\text {Pr}}[S_i = \infty ] = \prod _{i\le k \le n} (1-p_k). \end{aligned}$$

All these probabilities can be computed on a Real RAM in time $\mathcal {O}(n)$ for any i, i.e., in time $\mathcal {O}(n^2)$ for all i. After having computed the distribution of the $S_i$’s, we execute, for each $i \in [n]$, the preprocessing of Theorem 1.3, which allows us to quickly sample $S_i$ later on. This preprocessing takes time $\mathcal {O}(n^2)$.

For querying, we start at $i=1$ and iteratively sample the smallest element $j \ge i$ (i.e., sample $S_i$), output j, and start over with $i = j+1$. This is done until $j = \infty $ or $i = n+1$. Note that any sample of $S_i$ can be computed in $\mathcal {O}(1)$ time with our preprocessing, so that sampling $S \subseteq [n]$ will be done in time $\mathcal {O}(1 + |S|)$. The expected runtime is, thus, $\mathcal {O}(1+\mu )$. $\square $

After having established this base case, we turn towards reductions between SortedSubsetSampling and UnsortedSubsetSampling. First, we give an algorithm for UnsortedSubsetSampling that reduces the problem to SortedSubsetSampling. For this, we roughly sort the probabilities so that we get good upper bounds for each probability. Then these upper bounds will be a sorted instance. After querying from this sorted instance, we use rejection (see, e.g., [8]) to sample with the original probabilities.

Lemma 2.3

Assume that SortedSubsetSampling can be solved in preprocessing time $t_p(n,\mu )$ and expected query time $t_q(n,\mu )$, where $t_p$ and $t_q$ are monotonically increasing in n and $\mu $. Then UnsortedSubsetSampling can be solved in preprocessing time $\mathcal {O}(n + t_p(n, 2\mu + 1))$ and expected query time $\mathcal {O}(1 + \mu + t_q(n, 2\mu + 1))$.

Proof

Let $\mathbf {p}= (p_{1},\ldots ,p_{n})$ be an input sequence to UnsortedSubsetSampling. For preprocessing, we permute the input $\mathbf {p}$ so that it is approximately sorted, by partitioning it into buckets $U_k := \{i \in [n] \mid 2^{-k} \ge p_i > 2^{-k-1}\}$, for $k \in \{0,1,\ldots ,L-1\}$, and $U_L := \{i \in [n] \mid 2^{-L} \ge p_i\}$, where $L = \lceil \log n \rceil $. For each $i \in U_k$ we set $\overline{p}_i := 2^{-k}$, which is an upper bound on $p_i$. We sort the probabilities $\overline{p}_i$, $i \in [n]$, descendingly using bucket sort with the buckets $U_k$, yielding $\overline{p}_1' \ge \ldots \ge \overline{p}_n'$. In this process we store the original index ${\text {ind}}(i)$ corresponding to $\overline{p}_i'$, so that we can find $p_{{\text {ind}}(i)}$ corresponding to $\overline{p}_i'$ in constant time. Then we run the preprocessing of SortedSubsetSampling on $\overline{p}_1',\ldots ,\overline{p}_n'$. Note that

$$\begin{aligned} \overline{\mu } := \sum _{i=1}^n \overline{p}_i' = \sum _{i=1}^n \overline{p}_i \le \sum _{i=1}^n {\text {max}}\left\{ 2 p_i,\frac{1}{n}\right\} \le 2\mu + 1. \end{aligned}$$

Thus, the total preprocessing time is bounded by

$$\begin{aligned} \mathcal {O}(n) + t_p(n, \overline{\mu }) = \mathcal {O}(n + t_p(n, 2\mu + 1)), \end{aligned}$$

establishing the first claim.

For querying, we query $\overline{p}_1',\ldots ,\overline{p}_n'$ using SortedSubsetSampling, yielding $S' \subseteq [n]$. We compute $S := \{{\text {ind}}(i) \mid i \in S'\}$. Each $i \in S$ was sampled with probability $\overline{p}_i \ge p_i$. We use rejection to get this probability down to $p_i$. For this, we generate for each $i \in S$ a random number ${\text {rand}}()$ and check whether it is smaller than or equal to ${p_i}/{\overline{p}_i}$. If this is not the case, we delete i from S. Note that we have thus sampled i with probability $p_i$, and all elements are sampled independently, so S has the desired distribution. Moreover, since the expected size of $S'$ is $\overline{\mu }$, the expected query time is bounded by

$$\begin{aligned} t_q(n, \overline{\mu }) + \mathcal {O}(1 + \mathbb {E}[|S'|]) = \mathcal {O}(1 + \mu + t_q(n, 2\mu +1)), \end{aligned}$$

and the second claim is also established. $\square $

We also give a reduction in the other direction, solving SortedSubsetSampling by UnsortedSubsetSampling.

Lemma 2.4

Let $2 \le \beta < n$. Assume that UnsortedSubsetSampling can be solved in preprocessing time $t_p(n,\mu )$ and expected query time $t_q(n,\mu )$, where $t_p$ and $t_q$ are monotonically increasing in n and $\mu $. Then SortedSubsetSampling can be solved in preprocessing time $\mathcal {O}(\log _\beta n + t_p(1+ \log _\beta n, \beta \mu ))$ and expected query time $\mathcal {O}(1 + \beta \mu + t_q(1 + \log _\beta n, \beta \mu ))$. More precisely, our preprocessing computes a value $\overline{\mu }$ with $\mu \le \overline{\mu }\le \beta \mu $ and the expected query time is $\mathcal {O}(1 + \overline{\mu }+ t_q(1 + \log _\beta n, \overline{\mu }))$.

Proof

Let $p_{1},\ldots ,p_{n}$ be an input sequence to SortedSubsetSampling. As in Sect. 2.1, we consider blocks $B_k = \{i \in [n] \mid \beta ^k \le i < \beta ^{k+1}\}$, with $k \in \{0,\ldots ,L\}$ and $L := \lfloor \log _\beta n \rfloor $, and let $\overline{p}_i := p_{\beta ^k}$ for $i \in B_k$. We will first sample with respect to the probabilities $\overline{p}_i$—call the sampled elements potential—and then use rejection. For this, let $X_k$ be an indicator random variable for the event that we sample at least one potential element in $B_k$. Then

$$\begin{aligned} q_k := {\text {Pr}}[X_k = 1] = 1 - (1 - p_{\beta ^k})^{|B_k|}. \end{aligned}$$

Moreover, let $Y_{k}$ be a random variable for the index of the first potential element in block $B_{k}$, minus $\beta ^{k}$. Let $Y_k = \infty $, if no element in $B_k$ is sampled as a potential element. Then ${\text {Pr}}[Y_k = i] = p_{\beta ^k} (1-p_{\beta ^k})^i$ for $i \in \{ 0, \ldots , |B_k|-1 \}$, and ${\text {Pr}}[Y_k = \infty ] = {\text {Pr}}[X_k = 0] = 1 - q_k$. We calculate

$$\begin{aligned} {\text {Pr}}[Y_k = i \mid X_k = 1] = \frac{{\text {Pr}}[Y_k = i]}{{\text {Pr}}[X_k = 1]} = \frac{p_{\beta ^k}}{q_k} (1-p_{\beta ^k})^{i}, \qquad i \in \{ 0, \ldots , |B_k|-1 \}. \end{aligned}$$

Since this is a (truncated) geometric distribution, we can sample from it in constant time. Indeed, consider a geometric random variable Z with parameter p truncated at m, i.e., ${\text {Pr}}[Z = i] = p(1-p)^i / q$ for $i \in \{0,\ldots ,m-1\}$, where $q := 1-(1-p)^m$. Then $\lfloor \log (1-q \cdot {\text {rand}}()) / \log (1-p) \rfloor $ samples from Z; see also [8].

For preprocessing, we first compute the probabilities $q_k$, $k \in \{0, \dots , L\}$. This can be done in time $\mathcal {O}(L) = \mathcal {O}(\log _\beta n)$ (as $a^{b} = \exp ( b \ln a )$ can be computed in constant time on a Real RAM). Then we run the preprocessing of UnsortedSubsetSampling on them; note that the $q_k$’s are in general not sorted. In total, the preprocessing time is at most

$$\begin{aligned} \mathcal {O}(\log _\beta n) + t_p(1 + \log _\beta n , \nu ), \qquad \text {where} \qquad \nu = \sum _{i=0}^{\lfloor \log _\beta n \rfloor } q_k. \end{aligned}$$

Using that $(1 - x)^y \ge 1-xy$ for $0<x<1$ and $y \ge 1$ we obtain

$$\begin{aligned} \nu = \sum _{i=0}^{\lfloor \log _\beta n \rfloor } 1 - (1 - p_{\beta ^k})^{|B_k|} \le \sum _{i=0}^{\lfloor \log _\beta n \rfloor } p_{\beta ^k} |B_k| = \sum _{i=1}^n \overline{p}_i = \overline{\mu }. \end{aligned}$$

Using Lemma 2.1 we obtain $\nu \le \beta \mu $, and the bound $\mathcal {O}(\log _\beta n + t_p(1 + \log _\beta n , \beta \mu ))$ for the total preprocessing time follows immediately.

For querying, we query the blocks $B_k$ that contain potential elements, using the query algorithm for UnsortedSubsetSampling on $q_0,\ldots ,q_k$. Then, for each block $B_k$ that contains a potential element, we sample all potential elements in this block. Note that the first of the potential elements in $B_k$ is distributed as ${\text {Pr}}[Y_k = i \mid X_k = 1]$, which is geometric, so we can sample from it in constant time, while all further potential elements are distributed as $Y_k$ (but only on the remainder of the block), which is still geometric. Then, after having sampled a set $\overline{S}$ of potential elements, we keep each $i \in \overline{S}$ only if ${\text {rand}}() \le {p_i}/{\overline{p}_i}$. This yields a random sample $S \subseteq \overline{S}$ with the desired distribution. The overall query time is then at most

$$\begin{aligned} t_q(1 + \log _\beta n, \nu ) + \mathcal {O}(1 + |\overline{S}|) \le t_q(1 + \log _\beta n, \overline{\mu }) + \mathcal {O}(1 + |\overline{S}|) \end{aligned}$$

As the expected value of $|\overline{S}|$ is $\overline{\mu }\le \beta \mu $ the proof is completed. $\square $

Next, we put the above three lemmas together to prove the upper bounds of Theorems 1.5 and 1.6.

Proof of Theorem 1.6, upper bound

To solve UnsortedSubsetSampling, we use the reduction Lemma 2.3 and then Lemma 2.4 (where we set $\beta = 2$), followed by the base case Lemma 2.2. This reduces the instance size from n to $\mathcal {O}(\log n)$, so that preprocessing costs $\mathcal {O}(n)$ for the invocation of the first lemma, $\mathcal {O}(\log n)$ for the second, and $\mathcal {O}(\log ^2 n)$ for the third. Note that $\mu $ is increased only by constant factors, so that we indeed get the a query time of $\mathcal {O}(1 + \mu )$.

For SortedSubsetSampling we first prove a weaker statement than Theorem 1.5, which follows from simply putting together the reductions of this section.

Lemma 2.5

Let $2 \le \beta < n$ . Then SortedSubsetSampling can be solved in preprocessing time $\mathcal {O}(\log _\beta n)$ and expected query time $\mathcal {O}(1 + \beta \mu )$. More precisely, our preprocessing computes a value $\overline{\mu }$ with $\mu \le \overline{\mu }\le \beta \mu $ and the expected query time is $\mathcal {O}(1 + \overline{\mu })$.

Proof

To solve SortedSubsetSampling, we use the reduction presented in Lemma 2.4 followed by the upper bound of Theorem 1.6 that we proved above. This reduces the instance size from n to $\mathcal {O}(\log _\beta n)$ while $\mu $ is increased to $\mathcal {O}(1+\beta \mu )$. We obtain the desired preprocessing time $\mathcal {O}(\log _\beta n)$ and query time $\mathcal {O}(1 + \beta \mu )$. $\square $

Proof of Theorem 1.5, upper bound

Assume that we are allowed preprocessing time $\mathcal {O}(\log _{\tilde{\beta }} n)$ for some $2 \le \tilde{\beta }< n$. Our algorithm for SortedSubsetSampling simply runs the preprocessing of Lemma 2.5 with $\beta = \tilde{\beta }$ to satisfy the preprocessing time constraint.

For querying, we improve upon the runtime of Lemma 2.5 as follows. For any $\beta \in \{2,\ldots ,n\}$, let $\overline{\mu }(\beta )$ be the upper bound on $\mu $ computed by Lemma 2.5 given $\mathcal {O}(\log _\beta n)$ preprocessing time. Initially, we set $\beta := \tilde{\beta }$ so that $\overline{\mu }(\beta ) = \overline{\mu }(\tilde{\beta })$ was computed by our preprocessing. If $1 + \overline{\mu }(\tilde{\beta }) \le \log _{\tilde{\beta }} n$ then we run the query algorithm of Lemma 2.5 and are done. Otherwise, we repeatedly set $\beta := \lceil \beta ^{1/2} \rceil $ and rerun the preprocessing of Lemma 2.5, until $\beta = 2$ or $1 + \overline{\mu }(\beta ) \le \log _\beta n$. Then we run the query algorithm of Lemma 2.5.

It remains to analyze the runtime of this query algorithm. We consider three cases. (1) If $1 + \overline{\mu }(\tilde{\beta }) \le \log _{\tilde{\beta }} n$ then the $\beta $-decreasing loop does not start and the query time is $\mathcal {O}(1 + \overline{\mu }(\tilde{\beta })) \le \mathcal {O}(1 + \tilde{\beta }\mu )$. (2) If the $\beta $-decreasing loop breaks at $\beta =2$, then since it did not stop at $\beta \in \{3,4\}$ we have $1+4\mu > \log _4 n$, or $\mu = \Omega (\log n)$. In this case, the total query time is $\mathcal {O}(1 + \mu + \log n) = \mathcal {O}(\mu )$. (3) Otherwise the $\beta $-decreasing loop stopped at some $\beta ^*$ with $1 + \overline{\mu }(\beta ^*) \le \log _{\beta ^*} n$. Using $\overline{\mu }(\beta ) \le \beta \mu $ and that we decrease $\beta $ by taking its square root, we obtain $\beta ^* \ge \gamma ^{1/2}$, where $\gamma \ge 2$ satisfies

$$\begin{aligned} 1 + \gamma \mu = \log _\gamma n. \end{aligned}$$

The above equation solves to $\gamma = \Theta \big (\frac{\log n}{\mu } \big / \log \big ( \frac{\log n}{\mu } \big ) \big )$. This yields a total query time of $\mathcal {O}(\log _{\beta ^*} n) = \mathcal {O}(\log _\gamma n) = \mathcal {O}\big (\frac{\log n}{\log ( \frac{\log n}{\mu })} \big )$, which proves the claimed query time. $\square $

3 Lower Bounds

We prove most of our lower bounds by reducing the various sampling problems to the following fact, that searching in an unordered array of length m takes time $\Omega (m)$. A notable exception is Lemma 3.4.

Fact 3.1

Consider problem ArraySearch: Given m and query access to an array $A \in \{0,1\}^m$ consisting of m bits, with exactly one bit set to 1, find the position of this bit. Any randomized algorithm for ArraySearch needs $\Omega (m)$ accesses to A in expectation.

3.1 Proportional Sampling on Unsorted Probabilities

The lower bound for Theorem 1.3 is provided by the following lemma that reduces ArraySearch to UnsortedProportionalSampling. Moreover, the same proof yields the lower bound of Theorem 1.6 for UnsortedSubsetSampling.

Lemma 3.2

Any single-sample algorithm for UnsortedProportionalSampling has expected time $\Omega (n)$. Moreover, any single-sample algorithm for UnsortedSubsetSampling has expected time $\Omega (n)$.

Proof

Let A be an instance of ArraySearch of size n, say with 1-bit at position $\ell ^*$. We consider the instance

$$\begin{aligned} \mathbf {p}^A = (p_1^{A},\ldots ,p_n^{A}) \quad \text {with}\quad p_i^{A} = A[i]. \end{aligned}$$

Any sampling algorithm for UnsortedProportionalSampling returns $\ell ^*$ on instance $\mathbf {p}^{A}$ with probability 1. Thus, simulating any algorithm for UnsortedProportionalSampling (by computing $p_i^A$ on the fly) we obtain an algorithm for finding the 1-bit of array A. Hence, by Fact 3.1, any algorithm for UnsortedProportionalSampling takes expected time $\Omega (n)$.

Observe that on the same instance $\mathbf {p}^A$ any sampling algorithm for UnsortedSubsetSampling returns the set $\{\ell ^*\}$ with probability 1. This needs expected time $\Omega (n)$ for the same reasons. With varying $\mu $, no better bound is possible, either: If $\mu \ge 1$, consider an ArraySearch instance A of length $n-s$, where $s := \lceil \mu - 1 \rceil $. Let $p_i^{A} = A[i]$ for $1 \le i \le n-s$ and set the last s probabilities $p_i^A$ to values that sum up to $\mu - 1$. Then we still need runtime $\Omega (n - \mu )$ by Fact 3.1. As we also need runtime $\Omega (\mu )$ for outputting the result, the lower bound of $\Omega (n)$ follows. Otherwise, if $\mu < 1$, then we consider $\tilde{p}_i^A := \mu \cdot A[i]$. Since the algorithm does not know $\mu $, it behaves just as in the case $\mu = 1$ until it reads $p_{\ell ^*}^A$. However, finding $\ell ^*$ takes time $\Omega (n)$, which yields the result. $\square $

3.2 Proportional Sampling on Sorted Probabilities

Here we present the proof of the lower bound of Theorem 1.2 for SortedProportionalSampling.

Proof of Theorem 1.2, lower bound

Let $n \in \mathbb {N}$ and $2 \le \beta \le \mathcal {O}(\frac{\log n}{\log \log n})$. Let $s_i := \sum _{j=0}^{i-1} \beta ^j = (\beta ^i - 1)/(\beta - 1)$. Let L be maximal with $s_L \le n$ and note that $L = \Theta (\log _\beta n)$. Then $\beta \le \mathcal {O}(\frac{\log n}{\log \log n})$ implies $\beta = \mathcal {O}(L)$. We consider blocks $B_i := \{ s_i, s_i+1, \ldots , s_i+\beta ^{i-1}-1 \}$, for $i=1,\ldots ,L$, that partition $\{1,\ldots ,s_L\}$.

Let A be an instance of ArraySearch of size L, say with 1-bit at position $\ell ^*$. To construct the instance $\mathbf {p}= \mathbf {p}^A = (p_1^A,\ldots ,p_n^A)$ we set for any $\ell \in \{1,\ldots ,L\}$ and $j \in B_\ell $

$$\begin{aligned} p_j^A := \beta ^{-\ell + A[\ell ]}, \end{aligned}$$

and $p_j^A := 0$ for $s_L < j \le n$. As block $B_\ell $ has size $\beta ^\ell $, the total probability mass of $B_\ell $ is $\sum _{j \in B_\ell } p_j^A = \beta ^{A[\ell ]}$, i.e., it is $\beta $ for $A[\ell ] = 1$, and 1 otherwise. Observe that

$$\begin{aligned} \mu = \sum _{i=1}^n p_i^A = L + \beta - 1, \end{aligned}$$

since block $B_{\ell ^*}$ contributes $\beta $ and each of the other $L-1$ blocks contributes 1 as total probability mass. Furthermore, note that $p_1^A,\ldots ,p_n^A$ is indeed sorted, as the probability of an element in block $B_\ell $ is smaller by a factor of (at least) $\beta $ than the probability of an element in $B_{\ell -1}$, except if $\ell = \ell ^*$, in which case these probabilities coincide.

In the following we will prove that there is no sampling algorithm where the preprocessing reads at most $\varepsilon L$ input values and the querying reads at most $\varepsilon \beta $ input values in expectation, for a sufficiently small constant $\varepsilon >0$. Assume, for the sake of contradiction, that such an algorithm exists. On $\mathbf {p}^A$ we run the preprocessing and then K times the query algorithm, sampling K numbers $X_1,\ldots ,X_K \in \{1,\ldots ,n\}$. Denote by $Y_k$ the block of $X_k$, i.e., $X_k \in B_{Y_k}$. If $A[Y_k] = 1$ for some $1 \le k \le K$ then we return $Y_k$, otherwise we linearly search for the 1-bit of A.

This yields an algorithm for ArraySearch, let us analyze its expected number of accesses to A. Since the total probability mass of block $B_{\ell ^*}$ is $\beta $, we have

$$\begin{aligned} {\text {Pr}}[Y_k = \ell ^*] = \frac{\beta }{\mu } = \frac{\beta }{L+\beta -1} = \Omega \Big (\frac{\beta }{L}\Big ), \end{aligned}$$

since $\beta = \mathcal {O}(L)$. Thus, ${\text {Pr}}[\not \exists k:A[Y_k] = 1] = (1 - \Omega (\beta /L))^K = \exp (-\Omega (K \beta / L))$. Setting $K = \Theta (\log (1/\varepsilon ) L/\beta )$ (with sufficiently large hidden constant), this probability is at most $\varepsilon $. Hence, the expected number of accesses to A of the constructed algorithm is (counting preprocessing, K queries, and a possible linear search through A)

$$\begin{aligned} \varepsilon L + K \cdot \varepsilon \beta + {\text {Pr}}[\not \exists k:A[Y_k] = 1] \cdot L \le \mathcal {O}(\log (1/\varepsilon ) \varepsilon L). \end{aligned}$$

For sufficiently small $\varepsilon > 0$ this contradicts Fact 3.1.$\square $

Note that the same proof also works for single-sample algorithms. In this case the preprocessing reads no input values, and the only restriction is $\beta \le \mathcal {O}(L)$. Setting $\beta = \Theta (\log (n)/\log \log (n))$ this yields a lower bound of $\Omega (\log (n)/\log \log (n))$ on the expected runtime of any single-sample algorithm for SortedProportionalSampling.

3.3 Subset Sampling on Sorted Probabilities

We first prove two lemmas proving lower bounds for SortedSubsetSampling in different situations. Then we show how the lower bound of Theorem 1.5 follows from these lemmas.

Lemma 3.3

Let $\beta \in \{2,\ldots ,n\}$. Consider any data structure for SortedSubsetSampling with preprocessing time $\varepsilon \log _{\beta } n$ (where $\varepsilon >0$ is a sufficiently small constant) and query time $t_q(n,\mu )$. Then for any $\mu = \mu (n)$ with $\beta (1+\mu ) = \mathcal {O}(\log _\beta n)$ we have $t_q(n,\mu ) = \Omega (\beta \mu )$.

Proof

We closely follow the proof of the lower bound of Theorem 1.2 (Sect. 3.2). Let $s_i := \sum _{j=0}^{i-1} \beta ^j = (\beta ^i - 1)/(\beta - 1)$. Let L be maximal with $s_L \le n$ and note that $L = \Theta (\log _\beta n)$. We consider blocks $B_i := \{ s_i, s_i+1, \ldots , s_i+\beta ^{i-1}-1 \}$, for $i=1,\ldots ,L$, that partition $\{1,\ldots ,s_L\}$.

Note that our assumptions imply $\beta = \mathcal {O}(\log _\beta n)$, from which it follows that $\beta = \mathcal {O}(\log n)$ and thus $L = \Theta (\log _\beta n) = \Omega (\log n / \log \log n)$ grows with n. Since we can assume that n is sufficiently large, we thus can assume the same for L. By assumption we also have $\mu = \mathcal {O}(\log _\beta n) = \mathcal {O}(L)$. If $\mu > L$, then we introduce elements $p_1=\ldots =p_{\lceil \mu -L\rceil } = 1$. Then on the remainder $p_{\lceil \mu -L\rceil +1},\ldots ,p_n$ we have a probability mass $\mu - \lceil \mu - L \rceil $, which is at most L, but still $\Omega (\mu )$ (where we use that L is at least a sufficiently large constant). Hence, it suffices to show that sampling from the remainder takes query time $\Omega (\beta \mu )$. Focussing on this remainder, without loss of generality we can from now on assume $\mu \le L$.

Let A be an instance of ArraySearch of size L, say with 1-bit at position $\ell ^*$. To construct the instance $\mathbf {p}= \mathbf {p}^A = (p_1^A,\ldots ,p_n^A)$, for some $0 \le \alpha \le 1$ we set for any $\ell \in \{1,\ldots ,L\}$ and $j \in B_\ell $ the input to $ p_j^A := \alpha \cdot \beta ^{-\ell + A[\ell ]} $, and for $s_L < j \le n$ to $p_j^A := 0$. As block $B_\ell $ has size $\beta ^\ell $, the total probability mass of $B_\ell $ is $\sum _{j \in B_\ell } p_j^A = \alpha \cdot \beta ^{A[\ell ]}$. Observe that $ \mu = \sum _{i=1}^n p_i^A = \alpha (L + \beta - 1)$ indeed has a solution $0 \le \alpha \le 1$, since $\mu \le L$. Furthermore, note that $p_1^A,\ldots ,p_n^A$ is indeed sorted.

Assume for the sake of contradiction that there is a data structure for SortedSubsetSampling where the preprocessing reads at most $\varepsilon \log _\beta n$ input values and the querying reads at most $\varepsilon \beta \mu $ input values in expectation, for a sufficiently small constant $\varepsilon > 0$.

On $\mathbf {p}^A$ we run the preprocessing and then K times the query algorithm, sampling K sets $X_1,\ldots ,X_K \subseteq \{1,\ldots ,n\}$. For every $x \in \bigcup _{k=1}^K X_k$ we determine its block $B_y$ and check whether $A[y] = 1$. If so, we have found the 1-bit of A. Otherwise we linearly search for the 1-bit of A.

This yields an algorithm for ArraySearch, let us analyze its expected number of accesses to A. Let $\ell ^*$ be the position of the 1-bit in A. The probability of not sampling any $i \in B_{\ell ^*}$ in any of the K queries is

$$\begin{aligned} \prod _{i \in B_{\ell ^*}} (1-p_i)^K = (1 - \alpha \cdot \beta ^{-\ell ^* + 1})^{K \beta ^{\ell ^*}} \le \exp (-K \alpha \beta ). \end{aligned}$$

This probability becomes at most $\varepsilon $ by setting $K = \lceil \ln (1/\varepsilon ) / ( \alpha \beta ) \rceil = \Theta (1 + \log (1/\varepsilon ) / (\alpha \beta ))$. Hence, the expected number of accesses to A of the constructed algorithm is (counting preprocessing, K queries, and a linear search through A with probability at most $\varepsilon $)

$$\begin{aligned} \mathcal {O}(\varepsilon L + K\cdot \varepsilon \beta \mu + \varepsilon \cdot L)\le & {} \mathcal {O}(\varepsilon (L + \beta \mu + \log (1/\varepsilon ) \mu / \alpha ))\\\le & {} \mathcal {O}(\varepsilon (\log (1/\varepsilon ) (L + \beta ) + \beta \mu )), \end{aligned}$$

using $\mu = \alpha (L + \beta - 1)$. Because of the condition $\beta (1+\mu ) = \mathcal {O}(\log _\beta n)$ we can further bound the expected number of accesses to A by $\mathcal {O}(\log (1/\varepsilon ) \varepsilon L)$, which contradicts Fact 3.1 for sufficiently small $\varepsilon >0$. $\square $

Lemma 3.4

Consider any data structure for SortedSubsetSampling with preprocessing time $t_p(n)$ and expected query time $t_q(n,\mu )$. For any $\mu = \mu (n) \le \tfrac{1}{2}$ we have

$$\begin{aligned} t_p(n) + t_q(n,\mu ) = \Omega \Big ( \frac{\log n}{\log \frac{\log n}{\mu }} \Big ). \end{aligned}$$

Note that this lemma directly implies the lower bound of Theorem 1.4 for SortedSubsetSampling assuming $\mu \le \tfrac{1}{2}$.

Proof

Let (P, Q) be a preprocessing and a query algorithm, and let $\mathbf {p}$ be an instance. Let $D = P(\mathbf {p})$ be the result of the precomputation. By definition we have

$$\begin{aligned} {\text {Pr}}[Q(\mathbf {p},D) = \emptyset ] = \prod _{i \in [n]} (1 - p_{i}) =: \Delta (\mathbf {p}), \end{aligned}$$

where the probability goes only over the randomness of the query algorithm, not the preprocessing. If $\mu _\mathbf {p}\le \frac{1}{2}$, since $p_i \le \mu $ one can easily check that $1-p_i \ge 4^{-p_i}$, which yields

$$\begin{aligned} \Delta (\mathbf {p}) \ge 4^{-\mu _\mathbf {p}} \ge 4^{-1/2} \ge \tfrac{1}{2}. \end{aligned}$$

Let $\mathcal P \subseteq [n]$ be the positions $i \in [n]$ at which the preprocessing reads the value $p_{i}$ during the computation of D, note that $|\mathcal P| \le t_p = t_p(n)$. Without loss of generality, we can assume that $1,n \in \mathcal P$, i.e., that the preprocessing reads $p_{1}$ and $p_{n}$, as this adjustment of the algorithm does not increase its runtime asymptotically.

For an instance $\mathbf {p}$ and $\mathcal Q \subseteq [n]$, let $\Delta (\mathbf {p},\mathcal Q)$ be the probability that query algorithm Q (with input $\mathbf {p},D$) reads exactly the values $p_{i}$ with $i \in \mathcal Q$ before returning $\emptyset $. We clearly have

$$\begin{aligned} \sum _{\mathcal Q \subseteq [n]} \Delta (\mathbf {p},\mathcal Q) = \Delta (\mathbf {p}). \end{aligned}$$

(1)

Furthermore, if $\mu _\mathbf {p}\le \frac{1}{2}$ and the expected query time is at most $t_q = t_q(n,\mu )$, we have

$$\begin{aligned} \sum _{\begin{array}{c} \mathcal Q \subseteq [n] \\ |\mathcal Q| \le 4 t_q \end{array}} \Delta (\mathbf {p},\mathcal Q) \ge \frac{1}{4}. \end{aligned}$$

(2)

Indeed, since $|\mathcal Q|$ is a lower bound on the runtime of the query algorithm, denoting by $\mathcal E$ the event that algorithm Q on input $\mathbf {p},D$ runs for time at most $4 t_q$ we have

$$\begin{aligned} \sum _{\begin{array}{c} \mathcal Q \subseteq [n] \\ |\mathcal Q| \le 4 t_q \end{array}} \Delta (\mathbf {p},\mathcal Q) \ge {\text {Pr}}[Q(\mathbf {p},D) = \emptyset \text { and } \mathcal E] \ge {\text {Pr}}[Q(\mathbf {p},D) = \emptyset ] + {\text {Pr}}[\mathcal E] - 1. \end{aligned}$$

Since ${\text {Pr}}[Q(\mathbf {p},D) = \emptyset ] = \Delta (\mathbf {p}) \ge \frac{1}{2}$ and ${\text {Pr}}[\text {not }\mathcal E] \le \tfrac{1}{4}$ by Markov’s inequality, we obtain (2).

By (2) and since the number of subsets of [n] of size at most $4t_q$ is $\sum _{s=0}^{4t_q} {n \atopwithdelims ()s} \le \big (e n / 4t_q \big )^{4t_q} \le n^{4t_q}/4$, there exists a set $\mathcal Q^* \subseteq [n]$, $|\mathcal Q^*| \le 4t_q$, with

$$\begin{aligned} \Delta (\mathbf {p},\mathcal Q^*) \ge \frac{1}{4} \cdot \bigg ( \sum _{s=0}^{4t_q} {n \atopwithdelims ()s} \bigg )^{-1} \ge n^{-4t_q}. \end{aligned}$$

(3)

Now we fix the instance $\mathbf {p}= (p_1,\ldots ,p_n)$ by setting

$$\begin{aligned} p_i := \frac{\alpha }{i}, \end{aligned}$$

for a parameter $\alpha > 0$ chosen such that $\sum _{i=1}^n p_i = \alpha H_n = \mu = \mu (n)$, implying $\alpha = \Theta (\mu /\log n)$. Fixing a set $\mathcal Q^*$ as above for this instance $\mathbf {p}$, we define a second instance $\mathbf {p}' = (p_1',\ldots ,p_n')$ by setting

$$\begin{aligned} p_i' := {\text {min}}\{p_j \mid i \ge j \in \mathcal Q^* \cup \mathcal P\}. \end{aligned}$$

That is, $\mathbf {p}$ and $\mathbf {p}'$ agree on the read positions $\mathcal Q^*$ and $\mathcal P$, and at all other positions $p_i'$ is as large as possible with $\mathbf {p}'$ still being sorted. This means that the preprocessing and the query algorithm cannot distinguish between both instances, implying a critical property we will use,

$$\begin{aligned} \Delta (\mathbf {p}',\mathcal Q^*) = \Delta (\mathbf {p},\mathcal Q^*). \end{aligned}$$

With this, we obtain

$$\begin{aligned} \Delta (\mathbf {p}') \mathop {\ge }\limits ^{(1)} \Delta (\mathbf {p}',\mathcal Q^*) = \Delta (\mathbf {p},\mathcal Q^*) \mathop {\ge }\limits ^{(3)} n^{-4t_q}. \end{aligned}$$

(4)

We next bound $\Delta (\mathbf {p}')$. Denote the read positions by $\mathcal Q^* \cup \mathcal P = \{i_1,\ldots ,i_k\}$ with $i_1 \le \ldots \le i_k$. Note that $k \le t_p + 4t_q$. By assumption, we have $i_1 = 1, \, i_k = n$, and we define $i_{k+1} := n+1$. We obtain

$$\begin{aligned} \Delta (\mathbf {p}')&= \prod _{i \in [n]} (1 - p_i') = \prod _{\ell = 1}^k (1 - p_{i_{\ell }})^{i_{\ell +1} - i_{\ell }}. \end{aligned}$$

Using $1-x \le e^{-x}$ for $x \ge 0$ this yields

$$\begin{aligned} \Delta (\mathbf {p}') \le \exp \left( - \sum _{\ell = 1}^k p_{i_{\ell }} (i_{\ell +1} - i_{\ell }) \right) = \exp \left( - \alpha \sum _{\ell = 1}^k \left( \frac{i_{\ell +1}}{i_{\ell }} - 1 \right) \right) . \end{aligned}$$

Using the arithmetic-geometric mean inequality we obtain

$$\begin{aligned} \frac{1}{k} \sum _{\ell = 1}^k \frac{i_{\ell +1}}{i_{\ell }} \ge \left( \prod _{\ell = 1}^k \frac{i_{\ell +1}}{i_{\ell }} \right) ^{1/k} \ge n^{1/k}, \end{aligned}$$

which yields $ \Delta (\mathbf {p}') \le \exp \left( - \alpha k (n^{1/k} - 1) \right) \le \exp (- \alpha (n^{1/k} - 1)) $. Combining this with (4),

$$\begin{aligned} \exp \left( - \alpha (n^{1/k} - 1) \right) \ge n^{-4t_q} \ge \exp \left( - \mathcal {O}( \log ^2 n) \right) , \end{aligned}$$

as $t_q = \mathcal {O}(\log n)$ (otherwise the claim follows directly). Taking the logarithm twice and rearranging,

$$\begin{aligned} k \ge \frac{\log n}{\log ( 1 + \mathcal {O}(\log ^2 (n) / \alpha ))}. \end{aligned}$$

Using $t_p + 4t_q \ge k$ and $\alpha = \Theta (\mu /\log n)$, we obtain

$$\begin{aligned} t_p + 4t_q \ge \frac{\log n}{\log ( \mathcal {O}( \log ^3 (n) / \mu ))}, \end{aligned}$$

and thus $t_p + t_q = \Omega ( \frac{\log n}{\log ( \log (n) / \mu )})$. $\square $

A tedious case distinction now shows that the lower bound of Theorem 1.5 follows from the above two lemmas.

Proof of Theorem 1.5, lower bound

We prove that any data structure for SortedSubsetSampling with $\varepsilon \log _\beta n$ preprocessing time (where $\varepsilon >0$ is a sufficiently small constant) needs query time $\Omega (t_q^\beta (n,\mu ))$ for any $\mu = \mu (n)$, where

$$\begin{aligned} t_q^\beta (n,\mu ) = {\left\{ \begin{array}{ll} \mu , &{} \text {if } \mu \ge \tfrac{1}{2} \log n, \\ 1 + \beta \mu , &{} \text {if } \mu < \tfrac{1}{\beta }\log _\beta n, \\ \frac{\log n}{\log (\frac{\log n}{\mu })}, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

We consider six (sub-)cases depending on $\mu $ and $\beta $, in each case reducing the claim to Lemma 3.3 or 3.4.

Case 1, $\mu \ge \frac{1}{2}$: We split this into 3 subcases as follows.

Case 1.1, $\mu \ge \tfrac{1}{2} \log n$: As the expected output size is $\mu $, the expected query time is always $\Omega (\mu )$, which is tight in this case.

Case 1.2, $\mu \ge \frac{1}{2}$ and $\tfrac{1}{\beta }\log _\beta n \le \mu < \tfrac{1}{2} \log n$: In this case, we can choose $2 \le \gamma \le \beta $ such that $\mu = \Theta (\tfrac{1}{\gamma }\log _\gamma n)$. Solving for $\gamma $ yields $\gamma = \Theta \big ( \tfrac{\log n}{\mu } \big / \log \tfrac{\log n}{\mu }\big )$. We have $\gamma \le 2 \gamma \mu \le \mathcal {O}(\log _\gamma n)$, so Lemma 3.3 (applied with $\beta $ replaced by $\gamma $) yields a lower bound of $\Omega (\gamma \mu ) = \Omega \big (\tfrac{\log n}{\log \frac{\log n}{\mu }}\big )$ for any data structure with preprocessing time $\mathcal {O}(\log _\beta n) \le \mathcal {O}(\log _\gamma n)$.

Case 1.3, $\frac{1}{2} \le \mu < \tfrac{1}{\beta }\log _\beta n$: These inequalities imply $\beta \le 2 \beta \mu \le 2 \log _\beta n$. Thus, Lemma 3.3 applies, showing that the query time is $\Omega (\beta \mu )$. As any algorithm takes time $\Omega (1)$, the query time is also bounded by $\Omega (1 + \beta \mu )$, as desired.

Case 2, $\mu < \frac{1}{2}$: We split this into three subcases as follows.

Case 2.1, $\tfrac{1}{\beta }\log _\beta n \le \mu < \frac{1}{2}$: Note that $\mu \ge \tfrac{1}{\beta }\log _\beta n$ implies $\beta ^2 \ge \beta \log \beta \ge \tfrac{1}{\mu }\log n$ so that $\log \beta = \Omega \big (\log \tfrac{\log n}{\mu }\big )$. Hence, the preprocessing time is $\varepsilon \log _\beta n = \mathcal {O}\big (\varepsilon \tfrac{\log n}{\log \frac{\log n}{\mu }}\big )$. For sufficiently small $\varepsilon > 0$, Lemma 3.4 now implies $t_q(n,\mu ) = \Omega \big ( \tfrac{\log n}{\log \frac{\log n}{\mu }}\big )$, as desired.

Case 2.2, $\mu < \frac{1}{2}$ and $\tfrac{1}{\beta ^3} \log n \le \mu < \tfrac{1}{\beta }\log _\beta n$: Then $\log \beta = \Omega \big (\log \tfrac{\log n}{\mu }\big )$ and $\log _\beta n = \mathcal {O}\big (\tfrac{\log n}{\log \frac{\log n}{\mu }}\big )$. Hence, with $\varepsilon \log _\beta n$ preprocessing time and sufficiently small $\varepsilon >0$, Lemma 3.4 implies that $t_q(n,\mu ) = \Omega \big (\tfrac{\log n}{\log \frac{\log n}{\mu }}\big ) \ge \Omega (\log _\beta n) \ge \Omega (\beta \mu )$, where the last inequality follows from $\mu < \tfrac{1}{\beta }\log _\beta n$. Since any algorithm takes time $\Omega (1)$, this yields a lower bound of $\Omega (1+\beta \mu )$, as desired.

Case 2.3, $\mu < \frac{1}{2}$ and $\mu < \tfrac{1}{\beta ^3} \log n$: Note that $\mu < \tfrac{1}{\beta ^3} \log n$ implies $\mu < \tfrac{1}{\beta }\log _\beta n$. Thus, if $\beta \mu < 1$ then our query time is $\mathcal {O}(1 + \beta \mu ) = \mathcal {O}(1)$, which is clearly optimal. Hence, assume $\beta \mu \ge 1$. Together with $\mu < \tfrac{1}{\beta ^3} \log n$ this implies $\beta \le \sqrt{\log n}$. Hence,

$$\begin{aligned} \log _\beta n \ge \Omega (\tfrac{\log n}{\log \log n}) \gg \mathcal {O}(\sqrt{\log n}) \ge \beta \ge \Omega (\beta (1+\mu )), \end{aligned}$$

where the last inequality uses $\mu < \frac{1}{2}$. Thus, Lemma 3.3 is applicable and we obtain a lower bound of $t_q(n,\mu ) = \Omega (\beta \mu ) = \Omega (1+\beta \mu )$, as desired.$\square $

4 Reduction from Proportional Sampling to Subset Sampling

In this section, we present a reduction from (Sorted or Unsorted) ProportionalSampling to (Sorted or Unsorted) SubsetSampling. This yields an alternative proof of the upper bounds for ProportionalSampling (Theorems 1.2 and 1.3) using the upper bounds for SubsetSampling (Theorems 1.5 and 1.6). Moreover, it shows that the classic ProportionalSampling problem is easier than SubsetSampling (or the former can be seen as a special case of the latter).

We first present a reduction that works for $\mu \le 1$ and yields a query time proportional to $1/\mu $. Then we show how to ensure $1/\beta \le \mu \le 1$ after $\mathcal {O}(\log _\beta n)$ preprocessing, which together with the first reduction shows the main result of this section, Proposition 4.5.

4.1 Special Case

Let $\mathbf {p}$ be an instance to SortedProportionalSampling or UnsortedProportionalSampling. We assume $\mu \le 1$ and will obtain a running time proportional to $\frac{1}{\mu }$, which is most reasonable when $\mu $ comes from a small interval $[1/\beta ,1]$. Instead of $\mathbf {p}$ we consider $\mathbf {p}' = (p_1',\ldots ,p_n')$ with $p_i' := {p_i}/({1+p_i})$. Note that if $\mathbf {p}$ is sorted then $\mathbf {p}'$ is also sorted. Moreover, $\mu ' := \sum _{i=1}^n p_i'$ is in the range $[{\mu }/{2},\mu ]$.

Let $Y = \textsc {ProportionalSampling}(\mathbf {p})$ be the random variable denoting proportional sampling on input $\mathbf {p}$, and $X = \textsc {SubsetSampling}(\mathbf {p}')$ be the random variable denoting subset sampling on input $\mathbf {p}'$. Then conditioned on sampling exactly one element $X = \{i\}$, this element i is distributed exactly as Y, as formulated by the following lemma.

Lemma 4.1

We have for all $i \in [n]$

$$\begin{aligned} {\text {Pr}}[X = \{i\} \mid |X| = 1] = {\text {Pr}}[Y = i]. \end{aligned}$$

Proof

By applying Bayes’ rule we infer that

$$\begin{aligned} {\text {Pr}}\left[ X = \{i\} \mid |X| = 1\right] \;&= \,{\text {Pr}}[X = \{i\}] / {\text {Pr}}[|X| = 1] \\&= \left( \frac{p_i'}{1-p_i'} \prod _{k=1}^n (1-p_k') \right) / \left( \sum _{j=1}^n \frac{p_j'}{1-p_j'} \prod _{k=1}^n (1-p_k') \right) \\&= \left( \frac{p_i'}{1-p_i'} \right) / \left( \sum _{j=1}^n \frac{p_j'}{1-p_j'} \right) \end{aligned}$$

Plugging in the definition of $p_i'$ yields

$$\begin{aligned} {\text {Pr}}[X = \{i\} \mid |X| = 1]&= \frac{p_i}{\sum _{j=1}^n p_j} = {\text {Pr}}[Y = i]. \end{aligned}$$

and the statement is shown. $\square $

Moreover, the probability of sampling exactly one element is not too small, as shown in the following lemma. This bound is not best possible but sufficient for our purposes.

Lemma 4.2

If $\mu \le 1$ then we have

$$\begin{aligned} {\text {Pr}}[|X| = 1] \ge \mu /4. \end{aligned}$$

Proof

First, observe that by Markov’s inequality

$$\begin{aligned} {\text {Pr}}[|X| \ge 2] \le \mathbb {E}[|X|] / 2 = \mu ' / 2 \le 1/2, \end{aligned}$$

and thus, ${\text {Pr}}[|X| \in \{0,1\}] \ge 1/2$. Moreover, the definition of X implies that

$$\begin{aligned} {\text {Pr}}[|X|&= 0] = \prod _{k=1}^n (1-p_k') \quad \text {and} \quad {\text {Pr}}[|X| = 1] \\&= \sum _{j=1}^n \frac{p_j'}{1-p_j'}\, \prod _{k=1}^n (1-p_k') = \mu \cdot {\text {Pr}}[|X| = 0]. \end{aligned}$$

By putting everything together we obtain that ${\text {Pr}}[|X| = 1](1 + \frac{1}{\mu }) \ge 1/2$, and thus

$$\begin{aligned} {\text {Pr}}[|X| = 1] \ge \mu \cdot \frac{1}{2(1 + \mu )} \ge \frac{\mu }{4}, \end{aligned}$$

as claimed. $\square $

We put these facts together to show the following result. We need $\mu \le 1$, and we want $\mu $ as large as possible, since the obtained running time is proportional to $\frac{1}{\mu }$. In the next section we will see that we can assume $\frac{1}{\beta }\le \mu \le 1$ after preprocessing $\mathcal {O}(\log _\beta n)$.

Lemma 4.3

Assume that (Sorted or Unsorted) SubsetSampling can be solved in preprocessing time $t_p(n,\mu )$ and expected query time $t_q(n,\mu )$, where $t_p$ and $t_q$ are monotonically increasing in n and $\mu $. Then (Sorted or Unsorted, respectively) ProportionalSampling on instances with $\mu \le 1$ can be solved in preprocessing time $\mathcal {O}(t_p(n,\mu ))$ and expected query time $\mathcal {O}(\tfrac{1}{\mu }\cdot \, t_q(n,\mu ))$.

Proof

For preprocessing, given input $\mathbf {p}$, we run the preprocessing of SubsetSampling on input $\mathbf {p}'$. This does not mean that we compute the vector $\mathbf {p}'$ beforehand, but if the preprocessing algorithm of SubsetSampling reads the i-th input value, we compute $p_i' = {p_i}/{(1+p_i)}$ on the fly, so that preprocessing needs runtime $\mathcal {O}(t_p(n,\mu ))$ (recall that $\mu ' \le \mu $). It allows to sample X later on in expected runtime $\mathcal {O}(t_q(n,\mu ))$ using the same trick of computing $\mathbf {p}'$ on the fly.

For querying, we repeatedly sample X until we sample a set S of size one. Returning the unique element of S results in a proper sample according to SortedProportionalSampling by Lemma 4.1. Moreover, by Lemma 4.2 and the fact that sampling X needs expected time $\mathcal {O}(t_q(n,\mu ))$ after our preprocessing, the total expected query time is $\mathcal {O}(\tfrac{1}{\mu }\cdot t_q(n,\mu ))$. $\square $

4.2 General Case

In this subsection we reduce the general case with arbitrary $\mu $ to the special case $1/\beta \le \mu \le 1$. In the unsorted case, we simply compute $\mu $ exactly in time $\mathcal {O}(n)$, which shows the following proposition. In the sorted case, we approximate $\mu $ using an idea of Sect. 2.1, see Proposition 4.5.

Proposition 4.4

Assume that UnsortedSubsetSampling can be solved in preprocessing time $t_p(n,\mu )$ and expected query time $t_q(n,\mu )$, where $t_p$ and $t_q$ are monotonically increasing in n and $\mu $. Then UnsortedProportionalSampling can be solved in preprocessing time $\mathcal {O}(n + t_p(n,1))$ and expected query time $\mathcal {O}(t_q(n,1))$.

Note that plugging Theorem 1.6 into the above proposition yields the upper bound of Theorem 1.3.

Proof

In the preprocessing we compute $\mu $ in time $\mathcal {O}(n)$, and set $\tilde{p}_i := p_i/\mu $ for $ i\in [n]$. This rescaling ensures $\tilde{\mu }= \sum _i \tilde{p}_i = 1$. Then we run the algorithm guaranteed by Lemma 4.3 on $\tilde{p}_1, \ldots , \tilde{p}_n$. $\square $

Proposition 4.5

Let $\beta \in \{2, \dots , n\}$. Assume that SortedSubsetSampling can be solved in preprocessing time $t_p(n,\mu )$ and expected query time $t_q(n,\mu )$, where $t_p$ and $t_q$ are monotonically increasing in n and $\mu $. Then SortedProportionalSampling can be solved in preprocessing time $\mathcal {O}(\log _\beta n + t_p(n,1))$ and expected query time $\mathcal {O}({\text {max}}_{1/\beta \le \nu \le 1} \tfrac{1}{\nu } t_q(n, \nu ))$.

Note that plugging Theorem 1.5 into the above proposition yields the upper bound of Theorem 1.2 (to see the bound on the query time, note that we can set $t_q(n,\mu ) = \mathcal {O}(1+\beta \mu )$ by Theorem 1.5 or Lemma 2.5, so that ${\text {max}}_{1/\beta \le \nu \le 1} \tfrac{1}{\nu } t_q(n, \nu ) = \mathcal {O}( {\text {max}}_{1/\beta \le \nu \le 1} \tfrac{1}{\nu } (1 + \beta \nu ) ) = \mathcal {O}( \beta )$).

Proof

Let $\mathbf {p}$ be an instance of SortedProportionalSampling with $\mu = \sum _{i=1}^n p_i$. As in Sect. 2.1 we consider the blocks $B_k := \{i \in [n] \mid \beta ^k \le i < \beta ^{k+1}\}$ with $0 \le k \le L := \lfloor \log _\beta n \rfloor $ and set $\overline{p}_i := p_{\beta ^k}$ for $i \in B_k$. Then for $\overline{\mu }:= \sum _{i=1}^n \overline{p}_i$ we have $\mu \le \overline{\mu }\le \beta \cdot \mu $ by Lemma 2.1. Note that we can compute $\overline{\mu }$ in time $\mathcal {O}(\log _\beta n)$, as

$$\begin{aligned} \overline{\mu } = \sum _{k=0}^{L} p_{\beta ^k} \cdot \big (\!{\text {min}}(\beta ^{k+1},n+1) - \beta ^k \big ). \end{aligned}$$

With these observations at hand, for preprocessing, we compute $\overline{\mu }$ and consider $\mathbf {p}' = (p_1',\ldots ,p_n')$ with $p_i' := {p_i}/{\overline{\mu }}$. Since $\mu \le \overline{\mu }\le \beta \cdot \mu $ we have $\mu ' := \sum _{i=1}^n p_i'$ in the range $[{1}/\beta ,1]$. Thus, we can run the preprocessing of SortedProportionalSampling on $\mathbf {p}'$; Lemma 4.3 is applicable since $\mathbf {p}'$ has $\mu ' \in [1/\beta ,1]$. We do this without computing the whole vector $\mathbf {p}'$. Instead, if the preprocessing algorithm reads the i-th input value, we compute $p_i'$ on the fly. This way we need a total runtime for preprocessing of $\mathcal {O}(\log _\beta n + t_p(n,1))$.

For querying, Lemma 4.3 allows us to query according to $\mathbf {p}'$ in expected runtime $\mathcal {O}(\tfrac{1}{\mu '} t_q(n,\mu ')) \le \mathcal {O}({\text {max}}_{1/\beta \le \nu \le 1} \tfrac{1}{\nu } t_q(n, \nu ))$, where we again compute values of $\mathbf {p}'$ on the fly as needed. As we want to sample proportionally to the input distribution, a sample with respect to $\mathbf {p}'$ has the same distribution as a sample with respect to $\mathbf {p}$, so that we simply return the sampled number. $\square $

5 Relaxations

In this section we describe some natural relaxations for the input and machine model studied so far in this paper.

Large Deviations for the Running Times The query runtimes in Theorems 1.2, 1.5 and 1.6 are, in fact, not only small in expectation, but they are also concentrated, i.e., they satisfy large deviation estimates in the following sense. Let t be the expected runtime bound and T the actual runtime. Then

$$\begin{aligned} {\text {Pr}}[T > k t] = e^{-\Omega (k)}, \end{aligned}$$

where the asymptotics are with respect to k. This is shown rather straightforwardly along the lines of our proofs of these theorems, except the fact that the size of the random set X in SubsetSampling is concentrated. Note that for any $a>1$ the Chernoff bound shows that

$$\begin{aligned} {\text {Pr}}[ |X| > a \mu ] < \bigg ( \frac{e^{a-1}}{a^a} \bigg )^\mu \le \Big ( \frac{e}{a} \Big )^{a \mu }. \end{aligned}$$

For $\mu \ll 1$ this inequality does not show a tail bound of $e^{-\Omega (k)}$ for ${\text {Pr}}[|X| > k \mu ]$, and in fact such a tail bound does not hold. However, it suffices that |X| is not much larger than $1+\mu $ to bound our algorithms’ running times, and this indeed has an exponential tail bound, since by setting $a = k (\mu + 1)/\mu $ we obtain

$$\begin{aligned} {\text {Pr}}[|X| > k (\mu +1)] < \Big ( \frac{e \mu }{k(\mu +1)} \Big )^{k(\mu +1)} \le \Big ( \frac{k}{e} \Big )^{-k}. \end{aligned}$$

Partially Sorted Input The condition of sorted input for SortedSubsetSampling and SortedProportionalSampling can easily be relaxed, as long as we have sorted upper bounds of the probabilities. Given input $\mathbf {p}$ and sorted $\overline{\mathbf {p}}$ with $p_{i} \le \overline{p}_{i}$ for all $i \in [n]$, we simply sample according to $\overline{\mathbf {p}}$ and use rejection to get down to the probabilities $\mathbf {p}$. This allows for the optimal query time $\mathcal {O}(1+\mu )$ as long as $\overline{\mu }= \sum _{i=1}^{n} \overline{p}_{i} = \mathcal {O}(1 + \mu )$, where $\mu = \sum _{i=1}^{n} p_{i}$.

Unimodular Input Many natural distributions $\mathbf {p}$ are not sorted, but unimodular, meaning that $p_{i}$ is monotonically increasing for $1 \le i \le m$ and monotonically decreasing for $m \le i \le n$ (or the other way round). Knowing m, we can run the algorithms developed in this paper on both sorted halfs, and combine the return values, which gives an optimal query algorithm for unimodular inputs. Alternatively, if we have strong monotonicity, we can search for m in time $\mathcal {O}(\log n)$ using ternary search.

This can be naturally generalized to k-modular inputs, where the monotonicity changes k times.

Approximate Input In some applications it may be costly to compute the probabilities $p_{i}$ exactly, but we are able to compute approximations $\overline{p}_{i}(\varepsilon ) \ge p_{i} \ge \underline{p}_{i}(\varepsilon )$, with relative error at most $\varepsilon $, where the cost of computing these approximations depends on $\varepsilon $. We can still guarantee optimal query time, if the costs of computing these approximations are small enough, see e.g. [12].

We sketch this for SubsetSampling. We can surely sample a superset $\overline{S}$ with respect to the probabilities $\overline{p}_{i}(\frac{1}{2})$. Then we want to use rejection, i.e., for each element $i \in \overline{S}$ we want to compute a random number $r := {\text {rand}}()$ and delete i from $\overline{S}$ if $r \cdot \overline{p}_{i}(\frac{1}{2}) > p_{i}$, to get a sample set S. This check can be performed as follows. We initialize $k:=1$. If $r \cdot \overline{p}_{i}(\frac{1}{2}) > \overline{p}_{i}(2^{-k})$ we delete i from $\overline{S}$. If $r \cdot \overline{p}_{i}(\frac{1}{2}) \le \underline{p}_{i}(2^{-k})$ we keep i and are done. Otherwise, we increase k by 1. This method needs an expected number of $\mathcal {O}(1)$ rounds of increasing k; the probability of needing k rounds is $\mathcal {O}(2^{-k})$. Hence, if the cost of computing $\overline{p}_{i}(\varepsilon )$ and $\underline{p}_{i}(\varepsilon )$ is $\mathcal {O}(\varepsilon ^{-c})$ with $c < 1$, the expected overall cost is constant, and we get an optimal expected query time of $\mathcal {O}(1+\mu )$.

Word RAM Throughout the paper we worked in the Real RAM model of computation, where every memory cell can store a real number. In the more realistic Word RAM model each cell consists of $w = \Omega (\log n)$ bits and any reasonable operation on two words can be performed in constant time. Additionally to the standard repertoire of operations, we assume that we can generate a uniformly random word in constant time. It is known that in this model Bernoulli and geometric random variates can be drawn in constant time [2] and the classic aliasing method for UnsortedProportionalSampling still works [3]. This already allows one to translate large parts of the algorithms of this paper to the Word RAM. Unfortunately, terms like $\prod _{1\le k \le n} (1-p_k)$ (see Sect. 2.2) cannot be evaluated exactly on the Word RAM, as the result would need at least n bits. This difficulty can be solved by working with $\mathcal {O}(\log n)$ bit approximations and increasing the precision as needed, similarly to the generalization to approximate input that we discussed in the last paragraph. This way one can obtain a complete translation of our algorithms to the Word RAM. We omit the details.

Notes

Throughout the paper, we abbreviate $[n] = \{1, \dots , n\}$.

References

Borodin, A., Munro, I.: The Computational Complexity of Algebraic and Numeric Problems. Elsevier Publishing Company, London (1975)
MATH Google Scholar
Bringmann, K., Friedrich, T.: Exact and efficient generation of geometric random variates and random graphs. In: Proceedings of 40th International Colloquium on Automata, Languages, and Programming (ICALP’13), pp. 267–278 (2013)
Bringmann, K., Green Larsen, K.: Succinct sampling from discrete distributions. In Proceedings of 45th Annual ACM Symposium on Theory of Computing (STOC’13), pp. 775–782 (2013)
Chung, Fan, Linyuan, Lu: The average distance in a random graph with given expected degrees. Internet Math. 1(1), 91–113 (2004)
Article MathSciNet MATH Google Scholar
Devroye, L.: Nonuniform Random Variate Generation. Springer, New York (1986)
Book MATH Google Scholar
Flajolet, P., Saheb, N.: The complexity of generating an exponentially distributed variate. J. Algorithms 7(4), 463–488 (1986)
Article MathSciNet MATH Google Scholar
Hagerup, T., Mehlhorn, K. and Munro, I.: Maintaining discrete probability distributions optimally. In: Proceedings of 20th International Colloquium on Automata, Languages, and Programming (ICALP ’93), pp. 253–264 (1993)
Knuth, D.E.: The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, 3rd edn. Addison-Wesley Publishing Company, Boston (2009)
MATH Google Scholar
Knuth, D.E., Yao, A.C.: The complexity of nonuniform random number generation. In: Traub, J.F. (ed.) Algorithms and Complexity: New Directions and Recent Results, Proceedings of a Symposium, pp. 357–428. Carnegie-Mellon University, Computer Science Department, Academic Press, New York, NY (1976)
Matias, Y., Vitter, J.S., Ni, W.-C.: Dynamic generation of discrete random variates. Theory Comput Syst 36(4), 329–358 (2003)
Article MathSciNet MATH Google Scholar
Miller, J.C., Hagberg, A.A.: Efficient generation of networks with given expected degrees. In: Proceedings of 8th International Workshop Algorithms and Models for the Web Graph (WAW’11), pp. 115–126 (2011)
Nacu, Ş., Peres, Y.: Fast simulation of new coins from old. Ann. Appl. Probab. 15(1A), 93–115 (2005)
Article MathSciNet MATH Google Scholar
Preparata, F.P., Shamos, M.I.: Computational Geometry. Texts and Monographs in Computer Science. Springer, New York (1985)
Google Scholar
Pătraşcu, M.: WebDiarios de Motocicleta, Sampling a discrete distribution. http://infoweekly.blogspot.com/2011/09/sampling-discrete-distribution.html (2011)
Tsai, M.-T., Wang, D.-W., Liau, C.-J., Hsu, T.-S.: Heterogeneous subset sampling. In: Proceedings of 16th Annual International Computing and Combinatorics Conference (COCOON ’10), pp. 500–509 (2010)
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)
Walker, A.J.: New fast method for generating discrete random numbers with arbitrary distributions. Electron. Lett. 10, 127–128 (1974)
Article Google Scholar
Yao, A.C.: Context-free grammars and random number generation. In: Combinatorial Algorithms on Words 12, 357–361 (1985)

Download references

Acknowledgments

Open access funding provided by Max Planck Society (or associated institution if applicable).

Author information

Authors and Affiliations

Max Planck Institute for Informatics, Saarbrücken, Germany
Karl Bringmann
University of Munich, Munich, Germany
Konstantinos Panagiotou

Authors

Karl Bringmann
View author publications
You can also search for this author in PubMed Google Scholar
Konstantinos Panagiotou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Karl Bringmann.

Additional information

A preliminary version of this paper with worse upper and lower bounds appeared at ICALP’12.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Bringmann, K., Panagiotou, K. Efficient Sampling Methods for Discrete Distributions. Algorithmica 79, 484–508 (2017). https://doi.org/10.1007/s00453-016-0205-0

Download citation

Received: 29 July 2014
Accepted: 20 August 2016
Published: 29 August 2016
Issue Date: October 2017
DOI: https://doi.org/10.1007/s00453-016-0205-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Efficient Sampling Methods for Discrete Distributions

Abstract

Similar content being viewed by others

The Alias Method for Sampling Discrete Distributions

Consistent Subset Sampling

Dynamic sampling from a discrete probability distribution with a known distribution of rates

1 Introduction

1.1 Proportional Sampling

Theorem 1.1

Theorem 1.2

Theorem 1.3

1.2 Subset Sampling

Theorem 1.4

Theorem 1.5

Theorem 1.6

1.3 Notation and Organization

2 Upper Bounds

2.1 A Simple Algorithm for Sorted Proportional Sampling

Lemma 2.1

Proof

2.2 Subset Sampling

Lemma 2.2

Proof

Lemma 2.3

Proof

Lemma 2.4

Proof

Proof of Theorem 1.6, upper bound

Lemma 2.5

Proof

Proof of Theorem 1.5, upper bound

3 Lower Bounds

Fact 3.1

3.1 Proportional Sampling on Unsorted Probabilities

Lemma 3.2

Proof

3.2 Proportional Sampling on Sorted Probabilities

Proof of Theorem 1.2, lower bound

3.3 Subset Sampling on Sorted Probabilities

Lemma 3.3

Proof

Lemma 3.4

Proof

Proof of Theorem 1.5, lower bound

4 Reduction from Proportional Sampling to Subset Sampling

4.1 Special Case

Lemma 4.1

Proof

Lemma 4.2

Proof

Lemma 4.3

Proof

4.2 General Case

Proposition 4.4

Proof

Proposition 4.5

Proof

5 Relaxations

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation