Advertisement

On practical privacy-preserving fault-tolerant data aggregation

  • Krzysztof GriningEmail author
  • Marek Klonowski
  • Piotr Syga
Regular Contribution
  • 161 Downloads

Abstract

In this paper, we propose a fault-tolerant privacy-preserving data aggregation protocol which utilizes limited local communication between nodes. As a starting point, we analyze the Binary Protocol presented by Chan et al. Comparing to previous work, their scheme guaranteed provable privacy of individuals and could work even if some number of users refused to participate. In our paper we demonstrate that despite its merits, their method provides unacceptably low accuracy of aggregated data for a wide range of assumed parameters and cannot be used in majority of real-life systems. To show this we use both analytic and experimental methods. On the positive side, we present a precise data aggregation protocol that provides provable level of privacy even when facing massive failures of nodes. Moreover, our protocol requires significantly less computation (limited exploiting of heavy cryptography) than most of currently known fault-tolerant aggregation protocols and offers better security guarantees that make it suitable for systems of limited resources (including sensor networks). Most importantly, our protocol significantly decreases the error (compared to Binary Protocol). However, to obtain our result we relax the model and allow some limited communication between the nodes. Our approach is a general way to enhance privacy of nodes in networks that allow such limited communication, i.e., social networks, VANETs or other IoT appliances. Additionally, we conduct experiments on real data (Facebook social network) to compare our protocol with protocol presented by Chan et al.

Keywords

Data aggregation Differential privacy Fault tolerance Distributed systems Untrusted aggregator 

1 Introduction

Aggregation of data is a fundamental problem that has been approached from different perspectives. Recently there were many papers published that presented methods of data aggregation which preserve privacy of individual users. More precisely, the goal of the protocol is to reveal some general aggregated statistics (like an average value) while keeping the value of each individual secret, even if the aggregator is untrusted (e.g., tries to learn input of individual users). The general notion is to design a protocol that allows the aggregator to learn a perturbed sum, but no intermediate results.

In [44] the authors have introduced a new approach to aggregation of information in distributed systems based on combining cryptographic techniques and typical “methods of differential privacy” (users themselves adding appropriately calibrated noise), that was originally used for protecting privacy of individuals in statistical databases under differential privacy regime. The privacy preservation is usually realized by adding some carefully prepared noise to the aggregated values. Similar approach has been independently proposed in [40].

These papers put a new light on the problem of privacy-preserving data aggregation—the authors constructed a protocol that can be very useful; however, its applicability is limited to some narrow class of scenarios due to a few shortcomings. One of them is the fact that all of the members of a group of users have to cooperate to compute the aggregated data, which is forced by the underlying cryptographic primitives. Thus, this approach is not appropriate for a dynamic, real-life systems (e.g., mobile sensor networks), even though it seems to be a perfect solution for a fixed system of devices, where series of data are generated periodically for a long time and the number of failures is small (e.g., collecting measurements of electricity consumption in a neighborhood).

To circumvent the problem of previous solutions, in [8] the authors introduced the protocol which extended the one presented in [44]. This protocol, called the Binary Protocol, was essentially built on top of the one presented in [44] and was a privacy-preserving data aggregation protocol that is, to some extent, fault tolerant.

In our paper we focus on showing some shortcomings of the solution from [8] by pointing out the extent to which it is fault tolerant (Sect. 4). On the positive side, we present our own privacy-preserving fault-tolerant data aggregation protocol (Sect. 5). Its main idea is to utilize a naturally emerging communication structures in networks, i.e., due to proximity or friend relation (in social networks). Because of that, it can be used in every network that has some underlying communication graph (say Facebook social network, VANETs or various smart appliances which can, to some extent, communicate with some other nodes). Note that we do not assume that the communication graph is complete, our protocol works even if each node can communicate just with very small amount of other nodes (say one or two other nodes). We also present an experimental comparison of the Binary Protocol and our protocol conducted on real-life data (Sect. 6).

1.1 Our contribution and organization of the paper

In Sect. 2 we briefly describe the model assumed in our paper and provide some notation used throughout it as well as introduce some definitions we use further on. In Sect. 3 we recall the Binary Protocol presented in [8], followed by discussion of its disadvantages in Sect. 4. Further, in Sect. 5 we present and analyze our own privacy-preserving and fault-tolerant data aggregation protocol addressing the issues concerning the Binary Protocol. In Sect. 6 we experimentally compare our protocol with Binary Protocol using Facebook social network data. Section 7 is devoted to recalling some of the previous work related to the problem addressed in the paper. Finally, in Sect. 8 we conclude and indicate some possible future work. The contribution of our paper is as follows.
  • We show that the fault-tolerant protocol from [8] (called Binary Protocol) offers very low level of accuracy of aggregated data even for small number of faults for reasonable size of the network. This holds despite very good asymptotic guarantees.

  • On the positive side we construct a new protocol that offers much better accuracy and significantly lower computational requirements. However, we assume a weaker security model where users may trust a few others and we allow some limited, local communication between users. This assumption is justified in various scenarios, specifically when users possess some local knowledge about few other participants. This is a natural assumption in electricity meters, where privacy concerns is that the adversary can deduce i.e., the sleep/work habits or the number of inhabitants in the household. Your neighbors may know most of your habits anyway (i.e., by simply observing the lights in your window). Similarly, in cloud services or social network, where one naturally have some friends or users to whom he or she gives the data willingly. More precisely, all my neighbors/friends can cooperate to break my privacy much easier than any outer party.

  • We experimentally show, using real data, that our protocol, utilizing this limited local communication between nodes, allows us to maintain privacy and fault tolerance even for a massive number of failing nodes, while the Binary Protocol becomes far too inaccurate even for small number of faults.

2 Definitions and tools

Below we present some definitions and facts that will be used throughout this paper. We will denote the set of real numbers by \(\mathbb {R}\), integers by \(\mathbb {Z}\) and natural numbers by \(\mathbb {N}\).

Definition 1

(Symmetric geometric distribution). Let \(\alpha > 1\). We denote by \(Geom(\alpha )\) the symmetric geometric distribution that takes integer values such that the probability mass function at \(k \in \mathbb {Z}\) is \(\frac{\alpha -1 }{\alpha +1} \cdot \alpha ^{-|k|}\).

Fact 1

(From [8]) Let \(\epsilon >0\). Let uv be integers such that \(|u-v| \le \varDelta \) for fixed \(\varDelta \in \mathbb {N^+}\). Let r be a random variable having distribution \(Geom(\exp (\frac{\epsilon }{\varDelta }))\). Then for any integer k
$$\begin{aligned} P[v+r=k] \le \exp (\epsilon )P[u+r=k]. \end{aligned}$$

Definition 2

(Diluted geometric distribution). Let \(\alpha > 1\) and \(0< \beta \le 1\). A random variable has \(\beta \)-diluted Geometric distribution \(Geom^{\beta }(\alpha )\) if with probability \(\beta \) it is sampled from \(Geom(\alpha )\), and with probability \(1-\beta \) is set to 0.

In the same manner as in [8], we use computational differential privacy as a measure of privacy protection. This notion has been introduced (in a similar form) in [34] and is in fact a computational counterpart of differential privacy from [11].

Definition 3

(Computational differential privacy against compromise (from [8])) Suppose that the users are being compromised by some underlying randomized process \(\mathcal {C}\), and we use C to denote the information obtained by the adversary from the compromised users. Let \(\varepsilon ,\, \delta >0\). A (randomized) protocol \(\varPi \) preserves computational \((\varepsilon ,\, \delta )\)-differential privacy (against the compromising process \(\mathcal {C}\)) if there exists a negligible function \(\eta : \mathbb {N}\rightarrow \mathbb {R}^+\) such that for all \(\lambda \in \mathbb {N}\), for all \(i \in \left\{ 1,\,2,\,\ldots ,\,n\right\} \), for all vectors \(x,\,y\in \left\{ 0,\,1\right\} ^n\) that differ only at position i, for all probabilistic polynomial-time Turing machines \(\mathcal {A}\), for any output \(b\in \left\{ 0,\,1\right\} \),
$$\begin{aligned}&P_{\mathcal {C}_i}\left[ \mathcal {A}\left( \varPi \left( \lambda ,\,x\right) ,\,C\right) =b\right] \\&\quad \le e^{\varepsilon }P{\mathcal {C}_i}\left[ \mathcal {A}\left( \varPi \left( \lambda ,\,y\right) ,\,C\right) =b\right] +\delta +\eta \left( \lambda \right) ~, \end{aligned}$$
where the probability is taken over the randomness of \(\mathcal {A}\), \(\varPi \) and \(\mathcal {C}_i\), which denotes the underlying compromising process conditioning on the event that user i is uncompromised.

In a similar manner to regular differential privacy, we say that protocol \(\varPi \) preserves computational \(\varepsilon \)-differential privacy if it preserves computational \((\varepsilon ,\,0)\)-differential privacy. The intuition behind this definition is as follows. Every party has some bit b. From observing some processing of data, it is not feasible for any computationally bounded adversary to learn too much about b. This should hold with probability at least \(1-\delta \).

3 Protocol by Chan et al.—description

In the paper [8] its authors propose a fault tolerant, privacy-preserving data aggregation protocol which has been named Binary Protocol. The purpose of the protocol is to allow some untrusted aggregator \(\mathrm {\mathbf {AGG}}\) to learn the sum of values \(x_i\), \(1\le i\le n\), where \(x_i\) is kept by the ith user. We will denote ith user by \(\mathrm {\mathbf {N}}_i\). The idea is based on earlier work [44], in particular the block aggregation protocol. In this setting, we do not have a trusted party who is authorized to collect the real data and then perform some specific actions to preserve privacy (i.e., add noise of appropriate magnitude). The users themselves have to be responsible for securing their privacy by adding noise from some specific distribution, encrypting the noisy value and then sending it to the aggregator. This problem requires combination of both cryptographic and privacy-preserving techniques. See that we have essentially two adversaries here. First is an external one, against whom we have to use cryptography to protect the communication between users and the aggregator. This external adversary should not be able to decrypt anything, including noisy sum of all data. On the other hand, the aggregator himself is an adversary as well. This adversary, however, should be able to decrypt only the noisy sum and should not be able to compromise the privacy of any subset of users. The general notion behind block aggregation is to generate a random secret key \(sk_i\) for each of n users as well as an additional \(sk_0\) given to the aggregator, such that \(\sum _{i=0}^n{sk_i}=0\). Before sending the encrypted data, ith user adds noise \(r_i\) coming from diluted geometric distribution (Definition 2 in Sect. 2). We will denote the noisy data of ith user by \(\tilde{x}_i = x_i + r_i\). Namely, each user transmits \(\mathrm {Enc}_{{\mathrm {sk}}_i}\left( \tilde{x}_i\right) \) so that upon receiving all shares and having \(sk_0\), the secret keys cancel out and the aggregator is left with the desired noisy sum. One may easily note that as long as each user transmits its value, \(\mathrm {\mathbf {AGG}}\) may use \(sk_0\) to decrypt the sum. The symmetric geometric distribution \(Geom(\alpha )\) can be viewed as a discrete version of Laplace distribution, which is widely used in differential privacy papers. Having discrete values is essential for the cryptographic part of the protocol. The dilution parameter \(\beta \) is the probability that a specific user will add noise from \(Geom(\alpha )\). This is done because, intuitively, we want at least one user to add a geometric noise, but we do not want too many of these noises to keep the accumulated noise sufficiently small. The problem that occurred with so-called block aggregation is that whenever a single user fails to deliver their share (and what is really important— their \(sk_i\)), the blindings do not cancel out, hence making it impossible for the aggregator to decrypt the desired value.

Binary Protocol presented in [8] addresses the incompleteness of the data by arranging the users in a virtual binary tree. One may visualize each user as a leaf of a binary tree, with all the tree-nodes up to the root being virtual. The aggregator is identified with an additional tree node, which is located “above” the root and is connected only to the tree root. In order to simulate the tree structure, the users and \(\mathrm {\mathbf {AGG}}\) are equipped with appropriate secret keys and generate random noises for each of the tree layer, where layer is equivalent to the depth the tree node is at, i.e., the first layer consists of root, second layer consists of two direct children of the root. Finally, the \(\left\lceil \log n\right\rceil +1{\text {st}}\) layer consists of the leaves. Also, each virtual node corresponds to a segment of users, namely those who are in the leaves which are descendants of this specific virtual node. Each user performs block aggregation protocol for each of the layers, i.e., they generate their block \(\mathrm {Enc}_{{\mathrm {sk}}_i}\left( \tilde{x}_i\right) \) for the \(\left\lceil \log n\right\rceil +1{\text {st}}\) layer and their shares for larger blocks of higher layers. In each of the layers, the noise \(r_i\) is taken from a different distribution, namely \(\beta \) parameter for diluted geometric distribution is derived as follows: \(\beta =\min \left( \frac{1}{\left| B\right| }\ln \frac{1}{\delta _0},\,1\right) \), where \(\left| B\right| \) is the length of segments corresponding to nodes in the layer and \(\delta _0>0\) is a privacy parameter. One may note that, the higher the layer is, the sparser the blinding becomes. If all users present their shares, the problem is reduced to the original Block Aggregation (except from computational costs like the distribution of additional secret keys, each user generating not one, but more noises, etc.). The aggregator may decrypt the root-layer block, obtaining the sum of all the \(\tilde{x}_i\) with the blinding canceled out. However, if at least one user \(\mathrm {\mathbf {N}}_i\) fails, all the blocks containing \(\mathrm {\mathbf {N}}_i\) will suffer the same issues as block aggregation with a missing user. Namely, the underlying value cannot be decrypted due to lack of necessary secret keys (they all have to be present to cancel each other). In order to provide the aggregation of the working users, the authors allow the aggregator to find such a covering of the tree from the blocks of different layers that all the working users are covered, none of the failed users is included and that \(\mathrm {\mathbf {AGG}}\) is able to recover the result. It is easy to see that for any subset of failed users in such a virtual binary tree we can find an appropriate coverage by working leaves and virtual nodes. If this brief description of Binary Protocol is not satisfying, the authors encourage the reader to read the whole paper [8].

Binary Protocol provides security under computational differential privacy model and results in \({{\mathrm{O}}}\left( n\log n\right) \) communications exchanged in the network and, even more importantly, guarantees \(\tilde{{{\mathrm{O}}}}\left( \left( \log n\right) ^{\frac{3}{2}}\right) \) error. This notion, however, hides significant constants. In a practical setting, results of [8] are less satisfying than one would expect. The issues concerning the protocol and the resulting error are raised in Sect. 4.

4 Analysis of Chan et al.’s protocol—the magnitude of error

In this section we will show that the error magnitude in Binary Protocol is significant for moderate number of participants. Note that in  [8] the authors assumed that each user has value \(x_i \in \{0,1\}\), which means that the range of the sum of aggregated data is [0, n]. Thus, error of magnitude \(\gamma n\) shall be regarded large for moderate constant \(\gamma \). Note that if we simply ignore all the data (thus making the protocol private and without any communication) and make a coin toss for each user to determine which value he holds, we will have expected accuracy 0.5n. Obviously, this is an absolutely preposterous idea, yet it shows that any protocol with error of linear magnitude should be considered extremely inaccurate.

The authors of [8] have shown that the magnitude of error is o(n) asymptotically. However, in practical applications we are also interested in performance of this protocol for moderate values of n, i.e., \(n \leqslant 2^{14}\). We will show that for a reasonable range of values of the number of users n and number of failures \(\kappa \) the error is large (\(\gamma n\) for some constant \(\gamma \)) with significant probability. Obviously, as the n increases, the Binary Protocol becomes better because of the asymptotic guarantees. However, our aim here is to show, that if the number of participants is at most moderate (i.e., around \(2^{12}\)) or the number of failures is significant (i.e., \(\kappa = \log _2(n)\), \(\kappa = \lfloor \frac{n}{2^6}\rfloor \)) then the accuracy of Binary Protocol is too low to be used. Furthermore, if the number of users is quite small (i.e., \(2^{10}\) or less), then even for \(\kappa = 5\) the errors generated are unacceptably high.

We aim to show a precise magnitude of error in the Binary Protocol. To achieve this, we will use some subtler method than these presented by the authors of  [8]. To support our analytic analysis we show results of simulations. Note that in  [8] the authors described only simulations without failed nodes, even though their protocol is specifically designed to handle cases with failed nodes.

4.1 Analytical approach

The size of error depends on the number of failed users and the way they are distributed among all participants. Let us fix n as the number of participants. Like the authors of  [8], we assume for simplicity that n is a power of 2. However, our reasoning can be generalized for every n. We also assume that \(\kappa \) users have failed and these users are uniformly distributed among all participants, which is standard approach for failures model. The error generated during the Binary Protocol is the sum of all noises in the aggregated blocks. Throughout this section, we will use following notation, \(\delta _0 = \frac{\delta }{\lfloor \log _2(n)\rfloor + 1}\), where \(\delta \) is a privacy parameter. Also we have \(\beta _i=\min \left( \frac{1}{\left| B_i\right| }\ln \frac{1}{\delta _0},\,1\right) \), where \(B_i\) is size of the node on \(i\hbox {th}\) level of the tree. Because we assumed that n is a power of 2, so the binary tree is full, then \(B_i\) is essentially the number of leaves being descendants of any node on \(i\hbox {th}\) level of the tree. In our analysis, first we show an exact formula for the expected value of the number of noises added by individual nodes. Note that in [8] the authors gave only asymptotic formulas for the sum of generated noises. Here we give an exact formula in the following theorem.

Theorem 1

Let Y be a random variable which denotes the number of noises added during the Binary Protocol. Let \(\kappa > 0\) and fix n as the number of participants. Then, the expected value of random variable Y is given by the following formula:
$$\begin{aligned} EY = n-\kappa + n\cdot \sum _{i=1}^{\log _2(n)-1}\left( \frac{\left( {\begin{array}{c}n-\frac{n}{2^i}\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) } \cdot \left( \beta _i - \beta _{i+1}\right) \right) , \end{aligned}$$
where \(\beta _i=\min \left( \frac{1}{\left| B_i\right| }\ln \frac{1}{\delta _0},\,1\right) \).

Proof

Consider Binary Protocol described in Sect. 3. We aim to give a precise formula for the expected value of the number of noises added in this protocol. For simplicity, we assume that n is the power of 2. We also assume that \(\kappa \) leaves have failed, and they are uniformly chosen from all n leaves. We will use random variables \(X_i\) to denote the number of segments (on ith level of the tree) corresponding to subset of users where none failed. We will also use random variable \(X^*_i\) to denote the number of aggregating nodes on the ith level of the tree. By EX we will denote the expected value of random variable X. Let us begin with stating and proving the following \(\square \)

Lemma 1

Consider Binary Protocol with fixed \(\kappa \) and n. We call a node an aggregating node, if it is used by the aggregator to obtain a sum of data from some subset of users. We have the following formula for \(i \geqslant 1\)
$$\begin{aligned} EX_i^* = EX_i - 2EX_{i-1} = 2^i\cdot \left( \frac{\left( {\begin{array}{c}n-\frac{n}{2^{i}}\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) } - \frac{\left( {\begin{array}{c}n-\frac{n}{2^{i-1}}\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) } \right) . \end{aligned}$$

Proof

First of all, we will call a segment in the Binary Protocol tree clean if and only if there are no failures in this segment. Each node in the tree corresponds to a specific segment, according to Binary Protocol rules. See that on a certain tree level, all nodes correspond to segments of the same size, noted here by \(|B_i|\). Throughout this reasoning we will call the ’root level’ 0, children of the root are on level 1 and so on, up to level \(\log _2(n)\) which is the ‘leaves level’.

Value held by each user is aggregated in exactly one node, which belongs to some \(i\hbox {th}\) level and corresponds to a specific segment. This user generates a symmetric geometric noise with probability \(\beta _i\) (see Definition 2), where:
$$\begin{aligned} \beta _i = \min \left( \frac{1}{|B_i|}\ln \left( \frac{\log _2(n)+1}{\delta }\right) , 1\right) . \end{aligned}$$
We want to know an expected value of the number of noises generated throughout the whole protocol. To do this, first we denote the number of ’clean’ segments of size \(|B_i|\) (corresponding to nodes on \(i\hbox {th}\) level of the tree) by a random variable \(X_i\). See that \(X_i \in \{0, 1, \ldots , 2^i\}\). Furthermore, we see that:
$$\begin{aligned} X_i = \sum _{j=1}^{2^i}X_{i,j}, \end{aligned}$$
where
$$\begin{aligned} X_{i,j} = {\left\{ \begin{array}{ll} 1, \qquad \text {if segment}\, j \text {on level}\,i\,\text {has no fails},\\ 0, \qquad \text {otherwise}. \end{array}\right. } \end{aligned}$$
This, and the fact that \(EX_{i,j} = EX_{i,k}\) for every \(j, k \in {0, \ldots , 2^i}\), allows us to use linearity of expectation to calculate \(EX_i\):
$$\begin{aligned} EX_i = E\sum _{j=1}^{2^i}X_{i,j} = 2^i EX_{i,1} = 2^i \cdot P(X_{i,1} = 1). \end{aligned}$$
(1)
Now see that
$$\begin{aligned} P(X_{i,1} = 1) = \frac{\left( {\begin{array}{c}n-|B_i|\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) }, \end{aligned}$$
and also \(|B_i| = \frac{n}{2^i}\), thus plugging these to (1) we get
$$\begin{aligned} EX_i = 2^i \cdot \frac{\left( {\begin{array}{c}n-\frac{n}{2^i}\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) }. \end{aligned}$$
(2)
Now let us consider the number of segments which really aggregate the data. See that if a node is an aggregating one, that means that it corresponds to a clean segment, but its parent does not correspond to a clean segment. We denote the number of aggregating nodes on \(i\hbox {th}\) level by \(X_i^*\). We can see that \(X_i^* = X_i - 2X_{i-1}\), where \(i \in \{1, \ldots , \log _2(n)\}\).
There are \(X_i\) clean nodes on \(i\hbox {th}\) level, but we have to subtract all the clean nodes from higher level of the tree multiplicated by 2, because each of these clean nodes on a higher level is parent to two nodes on ith level, which are therefore not an aggregating nodes, because their parent is clean. That gives us
$$\begin{aligned} EX_i^* = EX_i - 2EX_{i-1} = 2^i\cdot \left( \frac{\left( {\begin{array}{c}n-\frac{n}{2^{i}}\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) } - \frac{\left( {\begin{array}{c}n-\frac{n}{2^{i-1}}\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) } \right) , \end{aligned}$$
which completes the proof of this lemma. \(\square \)

Lemma 1 gives us an explicit formula for \(EX_i^*\). Now, when we have a formula for the expected value of the number of aggregating nodes on each level, we can proceed to calculating the expected value of the number of geometric noises generated during the Binary Protocol.

Let \(Y_i\) be a random variable which denotes the number of noises generated on \(i\hbox {th}\) level of the tree. On \(i\hbox {th}\) level we aggregate \(X_i^*\) segments, each of these segments have \(2^{\log _2(n)-i}\) users and each of these users generates geometric noise with probability \(\beta _i\). Therefore, we have \(Y_i \sim \text {Bin}\left( 2^{\log _2(n)-i} \cdot X_i^*, \beta _i\right) \), where \(\text {Bin}(n,p)\) denotes binomial distribution. After observing this, we can see that
$$\begin{aligned} EY_i = EX_i^* \cdot 2^{\log _2(n)-i} \cdot \beta _i. \end{aligned}$$
Every user is aggregated only on one level, so if we take a sum over all levels of the tree, we will get all the noises generated during the Binary Protocol. Let Y be a random variable that denotes the number of noises generated. We have
$$\begin{aligned} Y = \sum _{i=0}^{\log _2(n)}Y_i, \end{aligned}$$
and we can also safely assume that if \(\kappa > 0\), then \(Y_0 = 0\), because if at least one user has failed, then we cannot possibly aggregate all users in the root of the tree. Furthermore, using linearity of expectation and the fact that if \(X \sim \text {Bin}(n,p)\) then \(EX = np\), we have
$$\begin{aligned}&EY = \sum _{i=1}^{\log _2(n)} EX_i^* \cdot 2^{\log _2(n)-i} \cdot \beta _i\\&\quad = \sum _{i=1}^{\log _2(n)} (EX_i - 2EX_{i-1}) \quad \cdot 2^{\log _2(n)-i} \cdot \beta _i. \end{aligned}$$
After straightforward calculations, we can get
$$\begin{aligned} EY&= \sum _{i=1}^{\log _2(n)} EX_i \cdot 2^{\log _2(n)-i} \cdot \beta _i - \sum _{i=1}^{\log _2(n)} 2EX_{i-1} \\&\qquad \cdot 2^{\log _2(n)-i} \cdot \beta _i = \\&= \sum _{i=1}^{\log _2(n)} EX_i \cdot 2^{\log _2(n)-i} \cdot \beta _i - \sum _{i=0}^{\log _2(n)-1} EX_{i} \cdot 2^{\log _2(n)-i} \\&\qquad \cdot \beta _{i+1} = \\&= EX_{\log _2(n)} \cdot \beta _{\log _2(n)} - n \beta _1 EX_0 + \sum _{i=1}^{\log _2(n)-1} EX_i \\&\qquad \cdot 2^{\log _2(n)-i} \cdot \left( \beta _i-\beta _{i+1}\right) . \end{aligned}$$
Also, as \(\kappa > 0\), we have \(X_0 = 0\) with probability 1. These facts yield the following result
$$\begin{aligned} EY&= EX_{\log _2(n)} + \sum _{i=1}^{\log _2(n)-1} EX_i \cdot 2^{\log _2(n)-i} \cdot \left( \beta _i - \beta _{i+1}\right) = \\&= n \cdot \frac{\left( {\begin{array}{c}n-1\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) } + \sum _{i=1}^{\log _2(n)-1} 2^i \cdot \frac{\left( {\begin{array}{c}n-\frac{n}{2^i}\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) } \cdot 2^{\log _2(n)-i} \cdot \left( \beta _i - \beta _{i+1}\right) = \\&= n-\kappa + n\cdot \sum _{i=1}^{\log _2(n)-1}\left( \frac{\left( {\begin{array}{c}n-\frac{n}{2^i}\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) } \cdot \left( \beta _i - \beta _{i+1}\right) \right) . \end{aligned}$$
This gives us a formula for calculating EY and completes the proof of this theorem.

Now we show a lower bound for the expected number of noises for limited range of n. We present it in the following

Corollary 1

Let \(2^4 \leqslant n \leqslant 2^{21}\) and \(\delta = 0.05\), then EY has a following lower bound:
$$\begin{aligned} EY \geqslant n-\kappa - n\cdot \left( e^{-\frac{8\kappa }{n}} + \frac{\ln \left( \frac{\log _2(n)+1}{\delta }\right) }{8} \cdot \left( e^{-\frac{16\kappa }{n}} - e^{-\frac{8\kappa }{n}} \right) \right) . \end{aligned}$$

Proof

We fix \(\delta = 0.05\). First observe that for \(2^4 \leqslant n \leqslant 2^{21}\) we have
$$\begin{aligned} \beta _{\log _2(n)}=\beta _{\log _2(n)-1}=\beta _{\log _2(n)-2} = 1, \end{aligned}$$
as for these levels we have \(\frac{1}{|B_i|}\cdot \ln (\log (n)+1) > 1\). This means that users aggregated in segments of length 1 and 2 generate noise with probability 1. Furthermore, for \(i \leqslant (\log _2(n)-3)\) we have \(\beta _{i} < 1\). Also, for \(i \leqslant (\log _2(n)-4)\), we have
$$\begin{aligned} \frac{\beta _{i+1}}{\beta _i} = \frac{|B_i|}{|B_{i+1}|} = 2. \end{aligned}$$
Another observation is that we can get an upper bound for \(\frac{\left( {\begin{array}{c}n-\frac{n}{2^i}\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) }\) in a following way
$$\begin{aligned}&\frac{\left( {\begin{array}{c}n-\frac{n}{2^i}\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) } = \frac{(n-\frac{n}{2^i})! \cdot (n-\kappa )!}{(n-\frac{n}{2^i}-\kappa )! \cdot n!} = \\&\quad = \frac{(n\cdot \frac{2^i-1}{2^i})\cdot (n\cdot \frac{2^i-1}{2^i}-1) \cdot \ldots \cdot (n\cdot \frac{2^i-1}{2^i}-\kappa +1)}{n\cdot (n-1) \cdot \ldots \cdot (n-\kappa +1)} = \\&\quad = \left( \frac{2^i-1}{2^i}\right) ^{\kappa } \cdot \frac{n\cdot (n\cdot -\frac{2^i}{2^i-1}) \cdot \ldots \cdot (n-(\kappa -1) \cdot \frac{2^i}{2^i-1})}{n\cdot (n-1) \cdot \ldots \cdot (n-\kappa +1)} \leqslant \\&\quad \leqslant \left( \frac{2^i-1}{2^i}\right) ^{\kappa } = \left( 1-\frac{1}{2^i}\right) ^{\kappa } = \left( \left( 1-\frac{1}{2^i}\right) ^{2^i}\right) ^{\frac{\kappa }{2^i}} \leqslant e^{-\frac{\kappa }{2^i}}, \end{aligned}$$
where the last inequality comes from the fact that \((1-x) \leqslant e^{-x}\). We can use all these observations to obtain a lower bound. Let \(\beta ^* = \ln \left( \frac{\log _2(n)+1}{\delta }\right) \). Then we have
$$\begin{aligned} EY&= n-\kappa + n\cdot \sum _{i=1}^{\log _2(n)-1}\left( \frac{\left( {\begin{array}{c}n-\frac{n}{2^i}\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) } \cdot \left( \beta _i - \beta _{i+1}\right) \right) = \\&= n-\kappa - n\\ {}&\quad \cdot \left( \sum _{i=1}^{\log _2(n)-4}\left( \frac{\left( {\begin{array}{c}n-\frac{n}{2^i}\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) } \cdot \beta _i\right) + \frac{\left( {\begin{array}{c}n-8\\ \kappa \end{array}}\right) }{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) } \cdot \left( 1-\beta _{\log _2(n)-3} \right) \right) \geqslant \\&\geqslant n-\kappa - n\\ {}&\quad \cdot \left( \sum _{i=1}^{\log _2(n)-4}\left( e^{-\frac{\kappa }{2^i}} \cdot \beta _i\right) + e^{\frac{8\kappa }{n}} \cdot \left( 1-\beta _{\log _2(n)-3} \right) \right) \geqslant \\&\geqslant n-\kappa - n\\ {}&\quad \cdot \left( \sum _{i=1}^{\log _2(n)-4}\left( e^{-\frac{\kappa }{2^{\log _2(n)-4}}} \cdot \beta _i\right) + e^{\frac{8\kappa }{n}} \cdot \left( 1-\beta _{\log _2(n)-3} \right) \right) = \\&= n-\kappa - n\\ {}&\quad \cdot \left( e^{-\frac{16 \kappa }{n}} \cdot \frac{\beta ^*}{n} \cdot \sum _{i=1}^{\log _2(n)-4}\left( 2^i\right) + e^{\frac{8\kappa }{n}} \cdot \left( 1-\beta _{\log _2(n)-3} \right) \right) = \\&= n-\kappa - n\\ {}&\quad \cdot \left( e^{-\frac{16 \kappa }{n}} \cdot \frac{\beta ^*}{n} \cdot \left( \frac{n}{8} - 2\right) + e^{\frac{8\kappa }{n}} \cdot \left( 1-\frac{\beta ^*}{8} \right) \right) \geqslant \\&\geqslant n-\kappa - n\cdot \left( e^{-\frac{16\kappa }{n}} \cdot \frac{\beta ^*}{8} + e^{\frac{8\kappa }{n}} \cdot \left( 1-\frac{\beta ^*}{8} \right) \right) = \\&= n-\kappa - n\cdot \left( e^{-\frac{8\kappa }{n}} + \frac{\beta ^*}{8} \cdot \left( e^{-\frac{16\kappa }{n}} - e^{-\frac{8\kappa }{n}} \right) \right) . \end{aligned}$$
Which gives our lower bound for EY and finishes the proof of this corollary. \(\square \)

Note that if \(n < 2^4\) then we have \(\beta _i = 0\), which means that every remaining user has to add noise (even if there are no failures, i.e., \(\kappa =0\)). There is no need to give a lower bound in that case, because then the number of noisy inputs is exactly \(n-\kappa \). Note also that even though we fixed a specific \(\delta \) that is used broadly in previous papers (including [8]), similar reasoning can be made for different values of \(\delta \).

We can use this bound to show a following

Example 1

Fix \(\delta = 0.05\). We will plot the lower bound for the fraction of nodes that added noise in Binary Protocol, i.e., lower bound for \(\frac{EY}{n}\), using Corollary 1. In Fig. 1 we assumed \(\kappa = \log _2(n)\) failures. See that for moderate number of nodes, the ”noisy” fraction is linear of n. In Fig. 2 we set \(\kappa = \frac{n}{2^6}\) which is still less than 2% failures. The fraction of users that generated noise is over 17%. Recall that ideally there should be only a single noise or a constant number of those.

Fig. 1

Lower bound for \(\frac{EY}{n}\) in Binary Protocol with \(\delta = 0.05\) and \(\kappa = \log _2(n)\)

Fig. 2

Lower bound for \(\frac{EY}{n}\) in Binary Protocol with \(\delta = 0.05\) and \(\kappa = \frac{n}{2^6})\)

It can easily be seen in Example 1 that even if the number of failures is very small (i.e., less than 2% users with failures), the number of noises generated is linear of n for realistic number of nodes. Note that it does not yet mean that the size of the error is linear, because the noises could cancel each other out to some extent. We will get to this problem in the next paragraph.

Having an exact formula and also a lower bound for the expected number of noises generated, we can calculate the error. Let us assume that we have m noises generated. Recall that each of them comes from symmetric geometric distribution \(Geom(\alpha )\) with \(\alpha > 1\), which is comprehensively described both in  [8, 44]. We denote the sum of all noises as Z. One can easily see that \(EZ = 0\) due to symmetry of distribution. However, the expected additional error, i.e., E|Z| might be, and we will show that it often is, quite large.

Theorem 2

Consider Binary Protocol with fixed \(\alpha \), let m denote the number of noises generated, each coming from \(Geom(\alpha )\) distribution. Then let Z be a random variable which denotes the sum of generated noises. We have a following theorem, which proof is based on techniques comprehensively described in [39].
$$\begin{aligned} E|Z| = \int \limits _{0}^{\infty } \frac{4\cdot \alpha \cdot m \cdot \sin {t} \cdot \left( \alpha -1\right) ^{2m}}{t \cdot \pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t. \end{aligned}$$

Proof

We are interested in the absolute sum of m noises, to estimate the error in Binary Protocol. First, let Z be a random variable that denote the sum of noises. See that
$$\begin{aligned} Z = \sum _{i=1}^m Z_i, \end{aligned}$$
where \(Z_i\) is a random variable with distribution \(\hbox {Geom}(\alpha \)), where \(\alpha = e^{\frac{\epsilon }{\log _2(n)+1}}\).
Let \(\varphi _{Z_i}(t)\) denotes the characteristic function of \(Z_i\). We have
$$\begin{aligned} \varphi _{Z_i}(t) = \frac{(\alpha -1)^2}{\alpha ^2 - \alpha (e^t+e^{-t}) + 1} = \frac{(\alpha -1)^2}{\alpha ^2 - 2\alpha \cos {t} + 1}. \end{aligned}$$
Let \(\varphi _{Z}(t)\) denote the characteristic function of Z. As \(Z_i\) are i.i.d. random variables, we get
$$\begin{aligned} \varphi _{Z}(t) = \left( \varphi _{Z_1}\right) ^m = \left( \frac{(\alpha -1)^2}{\alpha ^2 - 2\alpha \cos {t} + 1}\right) ^m. \end{aligned}$$
We will use techniques comprehensively described in [39] to calculate expected value of |Z|. We have a following\(\square \)

Fact 2

(From [39])
$$\begin{aligned} \varphi _{Z_+}(t)= & {} Ee^{itZ_+} = \frac{1}{2}[1+\varphi _{Z}(t)] \\&\quad + \frac{1}{2\pi i} \int \limits _{-\infty }^{\infty }\left[ \varphi _{Z}(t+u)-\varphi _{Z}(u)\right] \frac{du}{u}, \end{aligned}$$
where \(Z_+\) denotes \(\max (0,Z)\), and the integral is understood in the principal value sense (see [39]). Now see that
$$\begin{aligned} |Z| = Z_+ + Z_- = Z_+ + (-Z_+) = 2Z_+, \end{aligned}$$
which is true for symmetric Z. Fortunately, this is the case here. Furthermore, we have
$$\begin{aligned} E|Z| = 2EZ_+ = 2\frac{\varphi _{Z_+}'(0)}{i}. \end{aligned}$$
(3)
We have to calculate the derivative of \(\varphi _{Z_+}(t)\) at 0. It can be done in the following way
$$\begin{aligned} \varphi _{Z_+}'(0)= & {} \frac{\varphi _{Z}'(0)}{2} + \frac{d}{\text {d}t}\left( \frac{1}{2\pi i} \displaystyle \int \limits _{-\infty }^{\infty }\left[ \varphi _{Z}(t+u)-\varphi _{Z}(u)\right] \frac{du}{u}\right) \left( 0\right) \nonumber \\= & {} \frac{1}{2\pi i} \left( \int \limits _{-\infty }^{\infty }\left[ \varphi _{Z}'(t+u)\right] \frac{du}{u}\right) (0) \nonumber \\= & {} \frac{1}{2\pi i} \displaystyle \int \limits _{-\infty }^{\infty }\left[ \varphi _{Z}'(u)\right] \frac{du}{u}. \end{aligned}$$
(4)
We used the fact that \(\varphi _{Z}'(0) = 0\), because Z is symmetric. Moreover, because EZ exists, then E|Z| also has to exist. That is why the integral has to be finite, so we were able to use Lebesgue theorem to swap order of derivation and integration. We can derive \(\varphi _{Z}(t)\) which yields the following
$$\begin{aligned} \varphi _{Z}'(t) = \frac{-2\cdot \alpha \cdot m \cdot \sin {t} \cdot \left( \alpha -1\right) ^{2m}}{\left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}. \end{aligned}$$
(5)
Combining (3), (4), (5) and observing that \(\varphi _{Z}'(t)\) is an even function, we obtain the following formula for E|Z|
$$\begin{aligned} E|Z| = \displaystyle \int \limits _{0}^{\infty } \frac{4\cdot \alpha \cdot m \cdot \sin {t} \cdot \left( \alpha -1\right) ^{2m}}{t \cdot \pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t, \end{aligned}$$
which completes the proof of this theorem.

We also show a lower bound for E|Z| in a following

Fact 3

For fixed n and \(\epsilon \), which is a privacy parameter, provided that \(\alpha = \frac{\epsilon }{\log _2(n)+1}\) and \(m = \gamma n\), for \(\gamma \in [0,1]\) we have
$$\begin{aligned} E|Z| \geqslant c_{n,\epsilon } \cdot \sqrt{\gamma } \cdot \frac{\log _2(n) \cdot \sqrt{n}}{\epsilon \sqrt{\pi }} - 0.1~, \end{aligned}$$
where \(c_{n,\epsilon }\) is a constant, which is at least 1.4 for moderate values of n and \(\epsilon \).

Proof

Let us define \(\omega (t)\)
$$\begin{aligned} \omega (t) = \frac{4\cdot \alpha \cdot m \cdot \sin {t} \cdot \left( \alpha -1\right) ^{2m}}{\pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}} \end{aligned}$$
We have
$$\begin{aligned} E|Z| = \displaystyle \int \limits _{0}^{\infty } \frac{4\cdot \alpha \cdot m \cdot \sin {t} \cdot \left( \alpha -1\right) ^{2m}}{t \cdot \pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t = \int \limits _{0}^{\infty } \frac{\omega (t)}{t}. \end{aligned}$$
One can easily see that \(\omega (t)\) is periodic with period \(2\pi \). We can therefore consider splitting the integral into \([2k \pi , 2(k+1)\pi ]\) intervals and try to find a good lower bound for this integral. We have
$$\begin{aligned} E|Z| = \sum _{k=0}^{\infty } \left( \displaystyle \int \limits _{2k\pi }^{2(k+1)\pi } \frac{4\cdot \alpha \cdot m \cdot \sin {t} \cdot \left( \alpha -1\right) ^{2m}}{t \cdot \pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t \right) . \end{aligned}$$
Consider any of these integrals for \(k > 0\)
$$\begin{aligned} \displaystyle \int \limits _{2k\pi }^{2(k+1)\pi } \frac{4\cdot \alpha \cdot m \cdot \sin {t} \cdot \left( \alpha -1\right) ^{2m}}{t \cdot \pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t \geqslant 0. \end{aligned}$$
(6)
We will now explain why this inequality holds. First, observe that function \(\omega (t)\) is an odd function on interval \([2k\pi ,2(k+1)\pi ]\). One can easily see, that \(\omega (t)\) is positive on \([2k\pi ,2k\pi +\pi ]\) and negative on \([2k\pi + \pi , 2(k+1)\pi ]\). Furthermore, the absolute value of \(\frac{\omega (t)}{t}\) is greater on the first half of the interval, because of the decreasing factor \(\frac{1}{t}\). This yields (6), which is true for all these intervals, and we will use it for all \(k > 0\), so that leaves us with
$$\begin{aligned} E|Z|&= \sum _{k=0}^{\infty } \left( \displaystyle \int \limits _{2k\pi }^{2(k+1)\pi } \frac{4\cdot \alpha \cdot m \cdot \sin {t} \cdot \left( \alpha -1\right) ^{2m}}{t \cdot \pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t \right) \\&\geqslant \int \limits _{0}^{2\pi } \frac{4\cdot \alpha \cdot m \cdot \sin {t} \cdot \left( \alpha -1\right) ^{2m}}{t \cdot \pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t. \end{aligned}$$
Plotting this function shows that almost all of the mass is concentrated around 0, especially for \(\alpha \) close to 1. We could use the lower bound (6); however, there is no point using it on the whole interval, because we would obtain trivial inequality \(E|Z| \geqslant 0\). It requires slightly more subtle handling. Clearly, we could use (6) for any interval of type \([\pi - x, \pi + x]\), for \(x \leqslant \pi \). This yields the following
$$\begin{aligned} E|Z|&\geqslant \int \limits _{0}^{2\pi } \frac{4\cdot \alpha \cdot m \cdot \sin {t} \cdot \left( \alpha -1\right) ^{2m}}{t \cdot \pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t \\&\geqslant \int \limits _{0}^{\eta _{\alpha ,m}} \frac{4\cdot \alpha \cdot m \cdot \sin {t} \cdot \left( \alpha -1\right) ^{2m}}{t \pi \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t \\&\quad +\, \int \limits _{2\pi -\eta _{\alpha ,m}}^{2\pi } \frac{4\cdot \alpha \cdot m \cdot \sin {t} \cdot \left( \alpha -1\right) ^{2m}}{t \pi \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t, \end{aligned}$$
which is true for every \(\eta _{\alpha ,m} \in [0,\pi ]\). Now see that if \(\eta _{\alpha ,m} < \frac{\pi }{2}\), we can bound the first integral in a following way
$$\begin{aligned}&\int \limits _{0}^{\eta _{\alpha ,m}} \frac{4\alpha m \cdot \sin {t} \cdot \left( \alpha -1\right) ^{2m}}{t \cdot \pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t \nonumber \\&\qquad \geqslant \int \limits _{0}^{\eta _{\alpha ,m}} \frac{4 \alpha m \cdot \cos {t} \cdot \left( \alpha -1\right) ^{2m}}{\pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t, \end{aligned}$$
(7)
which follows from the fact that \(x \leqslant \tan {x}\) for \(x \in [0,\frac{\pi }{2})\). Furthermore,
$$\begin{aligned}&\int \limits _{2\pi -\eta _{\alpha ,m}}^{2\pi } \frac{4 \alpha m \cdot \sin {t} \cdot \left( \alpha -1\right) ^{2m}}{t \cdot \pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t \nonumber \\&\qquad \geqslant \int \limits _{2\pi -\eta _{\alpha ,m}}^{2\pi } \frac{4\alpha m \sin {t} \left( \alpha -1\right) ^{2m}}{t \cdot \pi \cdot \left( \alpha - 1\right) ^{2m+2}}\text {d}t, \end{aligned}$$
(8)
which comes from plugging 1 instead of \(\cos {t}\), which makes the function greater in terms of absolute value, but as it is negative on this interval, it yields a lower bound. The function from (7) has an explicit anti-derivative. On the other hand, in (8) we have, in fact, an integral of \(\frac{\sin {t}}{t}\) multiplied by a constant depending on \(\alpha \) and m. There also still remains a problem of choosing \(\eta _{\alpha , m}\). First we can observe that, for small enough \(\eta _{\alpha ,m}\), we have
$$\begin{aligned} \int \limits _{2\pi -\eta _{\alpha ,m}}^{2\pi } \frac{\sin {t}}{t}\text {d}t \geqslant -\frac{\eta _{\alpha ,m}^2}{10}. \end{aligned}$$
Obviously, this holds for \(\eta _{\alpha ,m} = 0\). Let Si(x) denote the anti-derivative of \(\frac{\sin {x}}{x}\). After derivating left side we obtain
$$\begin{aligned}&\frac{\text {d}\left( Si(2\pi )-Si(2\pi -\eta _{\alpha ,m})\right) }{\text {d}\eta _{\alpha ,m}} = -\,\frac{\text {d}\left( Si(2\pi -\eta _{\alpha ,m}\right) }{\text {d}\eta _{\alpha ,m}} \\&\quad = \frac{\sin \left( 2\pi - \eta _{\alpha ,m}\right) }{2\pi -\eta _{\alpha ,m}} = -\,\frac{\sin (\eta _{\alpha ,m})}{2\pi - \eta _{\alpha ,m}} \geqslant -\frac{\eta _{\alpha ,m}}{2\pi - \eta _{\alpha ,m}}. \end{aligned}$$
Derivating the right side yields \(-0.2\eta _{\alpha ,m}\). We can check when the left side is greater than the right side
$$\begin{aligned} -\frac{\eta _{\alpha ,m}}{2\pi - \eta _{\alpha ,m}} \geqslant -0.2\eta _{\alpha ,m} \iff \eta _{\alpha ,m} \leqslant 2\pi -5 \end{aligned}$$
So for \(\eta _{\alpha ,m} \leqslant \left( 2\pi -5\right) \) we have
$$\begin{aligned} \int \limits _{2\pi -\eta _{\alpha ,m}}^{2\pi } \frac{\sin {t}}{t}\text {d}t \geqslant -\frac{\eta _{\alpha ,m}^2}{10} \end{aligned}$$
Now we pick \(\eta _{\alpha ,m}\) so that
$$\begin{aligned} -0.1\eta _{\alpha ,m}^2 \cdot \frac{4\alpha m}{\pi (\alpha -1)^2} = -\,0.1. \end{aligned}$$
That gives us
$$\begin{aligned} \eta _{\alpha ,m} = \sqrt{\frac{\pi (\alpha -1)^2}{4\alpha m}}. \end{aligned}$$
Plugging it all to our formula for expected magnitude of noise yields
$$\begin{aligned} E|Z| \geqslant \int \limits _{0}^{\eta _{\alpha ,m}} \frac{4\cdot a \cdot m \cdot \cos {t} \cdot \left( \alpha -1\right) ^{2m}}{\pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t - 0.1. \end{aligned}$$
We are now interested in the lower bound for this integral. One can see that
$$\begin{aligned}&\displaystyle \int \limits _{0}^{\eta _{\alpha ,m}} \frac{4\cdot a \cdot m \cdot \cos {t} \cdot \left( \alpha -1\right) ^{2m}}{\pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t \\&\displaystyle \qquad \geqslant \int \limits _{0}^{\eta _{\alpha ,m}} \frac{4\cdot a \cdot m \cdot \cos ({\eta _{\alpha ,m}}) \cdot \left( \alpha -1\right) ^{2m}}{\pi \cdot \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t. \end{aligned}$$
This inequality is just plugging the smallest possible value of cosine on this interval. Furthermore, we have
$$\begin{aligned}&\int \limits _{0}^{\eta _{\alpha ,m}} \frac{4\alpha m \cdot \cos ({\eta _{\alpha ,m}}) \cdot \left( \alpha -1\right) ^{2m}}{\pi \left( \alpha ^2 - 2\alpha \cos {t} + 1\right) ^{m+1}}\text {d}t \\&\quad \geqslant \int \limits _{0}^{\eta _{\alpha ,m}} \frac{4\alpha m \cdot \left( 1-\frac{\eta _{\alpha ,m}^2}{2}\right) \cdot \left( \alpha -1\right) ^{2m}}{\pi \left( \alpha ^2 - 2\alpha \cdot \left( 1-\frac{t^2}{2}\right) + 1\right) ^{m+1}}\text {d}t. \end{aligned}$$
This bound comes from the fact that \(\cos {t} \geqslant \left( 1-\frac{t^2}{2}\right) \). Let us call the integrand function g(t). This function has a following anti-derivative G(t):
$$\begin{aligned} G(t) = \frac{4(\alpha -1)^{2m-2} \alpha m t \left( 1+\frac{\alpha t^2}{(\alpha -1)^2}\right) ^m \left( 1-\frac{\eta _{\alpha ,m}^2}{2}\right) {}_2F_1\left( \frac{1}{2},1+m,\frac{3}{2},-\frac{\alpha \cdot t^2}{(\alpha -1)^2}\right) }{\left( \alpha ^2 +\alpha (t^2 - 2) + 1\right) ^{m}\cdot \pi }, \end{aligned}$$
where the \({}_2F_1(a,b,c,z)\) denotes ordinary hypergeometric function (see [46]). One can easily see, that \(G(0) = 0\). That leaves us with
$$\begin{aligned} E|Z| \geqslant G(\eta _{\alpha ,m}) - 0.1. \end{aligned}$$
Function \(G(\eta _{\alpha ,m})\) is quite complicated, but we can greatly simplify it. Let us begin with taking some of the \(G(\eta _{\alpha ,m})\) factors
$$\begin{aligned}&\frac{(\alpha -1)^{2m-2}\cdot \left( 1+\frac{\alpha \cdot \eta _{\alpha ,m}^2}{(\alpha -1)^2}\right) ^m}{\left( \alpha ^2 +\alpha \cdot (\eta _{\alpha ,m}^2 - 2) + 1\right) ^{m}} \\&\quad = \frac{(\alpha -1)^{-2}\cdot \left( 1+\frac{\alpha \cdot \eta _{\alpha ,m}^2}{(\alpha -1)^2}\right) ^m}{\left( \frac{\alpha ^2}{(\alpha -1)^{2}} +\frac{\alpha }{(\alpha -1)^{2}} \cdot (\eta _{\alpha ,m}^2 - 2) + \frac{1}{(\alpha -1)^{2}}\right) ^{m}} = \\&\quad = \frac{\left( \alpha -1\right) ^{-2} \cdot \left( 1+\frac{\alpha \cdot \eta _{\alpha ,m}^2}{(\alpha -1)^2}\right) ^m}{\left( 1+\frac{\alpha \cdot \eta _{\alpha ,m}^2}{(\alpha -1)^2}\right) ^m} = \left( \alpha -1\right) ^{-2}. \end{aligned}$$
Furthermore, we can expand \({}_2F_1(a,b,c,z)\) into Taylor series around 0 in a following way:
$$\begin{aligned}&{}_2F_1\left( \frac{1}{2},1+m,\frac{3}{2},-\frac{\alpha \cdot t^2}{(\alpha -1)^2}\right) = 1-\frac{\alpha (m+1)t^2}{3(\alpha -1)^2} + O(t^4) \\&\qquad \geqslant 1-\frac{\alpha \cdot (m+1) \cdot \eta _{\alpha ,m}^2}{3\cdot \left( \alpha -1\right) ^2}. \end{aligned}$$
Using these two observations, we obtain
$$\begin{aligned}&G(\eta _{\alpha ,m})\\&\quad \geqslant \frac{4(\alpha -1)^{-2}\cdot \alpha \cdot m \cdot \eta _{\alpha ,m} \cdot \left( 1-\frac{\eta _{\alpha ,m}^2}{2}\right) \cdot \left( 1-\frac{\alpha \cdot (m+1) \cdot \eta _{\alpha ,m}^2}{3\cdot \left( \alpha -1\right) ^2}\right) }{\pi } \end{aligned}$$
We can further simplify this by recalling that \(\alpha \)\( = e^{\frac{\epsilon }{\log _2(n)+1}}\) and \(m = \gamma n\) and observing that \(\left( 1-\frac{\eta _{\alpha ,m}^2}{2}\right) \cdot \left( 1-\frac{\alpha \cdot (m+1) \cdot \eta _{\alpha ,m}^2}{3\cdot \left( \alpha -1\right) ^2}\right) \) is increasing with n. Let us call this value \(c_n^*\). We can fix this for the smallest n that we want toconsider. See that, for example, for \(n \geqslant 2^7\) we have \(c_n^* \geqslant 1.43\). This leaves us with
$$\begin{aligned} G(\eta _{\alpha ,m})&\geqslant \frac{4 c_n^* \cdot (\alpha -1)^{-2}\cdot \alpha \cdot m \cdot \eta _{\alpha ,m}}{\pi } = \\&= \frac{4 c_n^* \cdot (\alpha -1)^{-2} \cdot \alpha \cdot m \cdot \sqrt{\frac{\pi (\alpha -1)^2}{4\alpha m}}}{\pi } = \\&= \frac{2 c_n^* \cdot \sqrt{\alpha \cdot m}}{\sqrt{\pi } \cdot (\alpha -1)} \geqslant \frac{2 c_n^* \cdot \sqrt{m}}{\sqrt{\pi } \cdot (\alpha -1)} \\&= \frac{2c_n^* \cdot \sqrt{\gamma n}}{\sqrt{\pi } \cdot (e^{\frac{\epsilon }{\log _2(n)+1}}-1)} \geqslant \\&\geqslant \frac{2c_n^* \cdot \sqrt{\gamma n}}{\sqrt{\pi } \cdot (e^{\frac{\epsilon }{\log _2(n)}}-1)} \geqslant \frac{\xi \log _2(n) \cdot 2c_n^* \cdot \sqrt{\gamma n}}{\epsilon \sqrt{\pi }}, \end{aligned}$$
where \(\xi \) is such that \(e^{\xi \cdot x} \leqslant (1+x)\) for \(x = (\frac{1}{2\log _2(n)})\). For example, in case we have \(\epsilon = 0.5\) and \(n \geqslant 2^7\) it suffices to take \(\xi = 0.96\). In the end we have
$$\begin{aligned} G(\eta _{\alpha ,m}) \geqslant c_{n,\epsilon } \cdot \sqrt{\gamma } \cdot \frac{\log _2(n) \cdot \sqrt{n}}{\epsilon \sqrt{\pi }}, \end{aligned}$$
where \(c_{n,\epsilon } = 2\xi c_n^* \) which is, for moderate n and \(\epsilon \), greater than 1.4. In fact, for \(\epsilon = 0.5\) and \(n \geqslant 2^7\) it is greater than 2. In the end we have
$$\begin{aligned} E|Z| \geqslant c_{n,\epsilon } \cdot \sqrt{\gamma } \cdot \frac{\log _2(n) \cdot \sqrt{n}}{\epsilon \sqrt{\pi }} - 0.1, \end{aligned}$$
which completes the proof of this fact. \(\square \)

Using Fact 3, we can obtain a following

Example 2

Consider Binary Protocol for \(\delta = 0.05\), \(\epsilon = 0.5\), \(n \leqslant 2^{10}\) and \(\kappa = \log _2(n)\). Let |Z| be the absolute value of all noises aggregated during this protocol. We have \(E|Z| \geqslant 0.15\cdot n\). Moreover, if we take \(\kappa = \frac{n}{2^6}\) and \(2^6 \leqslant n \leqslant 2^{12}\) we have \(E|Z| \geqslant 0.12 \cdot n\).

This is an immediate result from the Fact 3, we can see that \(\frac{E|Z|}{n}\) is a decreasing function of n, so it is enough to plug \(n = 2^{10}\) into lower bound for E|Z| for the first part of the corollary and \(n=2^{12}\) for the second part of the corollary.

This clearly shows that even if we consider the lower bound for the number of noises and their magnitude, the Binary Protocol is far from perfect for many realistic scenarios, i.e., when the number of participants is moderate. Even worse conclusions will be drawn in Sect. 4.2, where we use the exact formulas given in Theorems 1 and 2 to numerically analyze the errors generated in this protocol.

4.2 Numerical approach

In Sect. 4.1 we gave both exact formulas and lower bounds for the number of noises generated and their sum. Note that the lower bounds are not very tight for many n. In this subsection, we will show that the errors generated are, in fact, even larger. We will use the exact formulas to precisely calculate the errors numerically. First let us consider the case where \(n \leqslant 2^{10}\), \(\kappa = \left\lfloor {\log _2(n)}\right\rfloor \), and privacy parameters are \(\epsilon = 0.5\), \(\delta = 0.05\). See Fig. 3. It clearly shows that the error magnitude in Binary Protocol is, in fact, significantly greater than the lower bound given in Corollary 2. Now let \(2^6 \leqslant n \leqslant 2^{12}\), \(\kappa = \frac{n}{2^6}\) and privacy parameters stays the same. See Fig. 4. Again we can see that the error magnitude is unacceptably high, greater than 0.2n. Recall our reasoning about ignoring all the data and decide about the value for each user via coin toss. This is obviously a preposterous idea, yet in our case, it yields an expected error of 0.5n so the same order of magnitude as the Binary Protocol. Moreover, the noise is independent from the data, so such error could be very problematic, especially if the sum of the real data is small (e.g., o(n)). In such case the noise could be greater than the data itself. We can also see how great the errors will be for constant value of \(\kappa = 5\). See Fig. 5.
Fig. 3

Error magnitude in Binary Protocol with \(\epsilon = 0.5\), \(\delta = 0.05\) and \(\kappa = \left\lfloor {\log _2(n)}\right\rfloor \)

Fig. 4

Error magnitude in Binary Protocol with \(\epsilon = 0.5\), \(\delta = 0.05\) and \(\kappa = \left\lfloor {\frac{n}{2^6}}\right\rfloor \)

Fig. 5

Error magnitude in Binary Protocol with \(\epsilon = 0.5\), \(\delta = 0.05\) and \(\kappa = 5\)

5 Precise aggregation algorithm with local communication

In this part, we present an alternative protocol PAALC (Precise Aggregation Algorithm with Local Communication) that in some scenarios offers much better accuracy of aggregated data when failures occur, while preserving high level of users’ privacy protection. In fact our protocol works in a substantially different way and for slightly modified model. Thus, despite its performance and accuracy that outperforms the original protocol of Chan at al., they are not fully comparable.
Fig. 6

Example of a clusterized network with global aggregator (\(\mathrm {\mathbf {AGG}}\)) and local aggregators (\(\mathrm {\mathbf {Agg}}\)) marked

First of all, we assume that users may communicate (also in order to bypass the lower bound pointed out in [7]). Let us stress that the communication is limited to a small circle of “neighbors”. The idea behind the presented construction is to take advantage of some natural structures emerging in distributed systems (e.g., social networks) wherein, apart from logical connections between each user and a server/aggregator there are also some direct links between individual users. Clearly, such model is not adequate for some real-life problems discussed in [8], for example in sensor fields with unidirectional communication. Thus there are applications where the original approach without any local communication is the only one possible.

5.1 Modified model

We assume that the network consists of n users - \(V=\{v_1,v_2,\ldots , v_n\}\) as well as the aggregator \(\mathrm {\mathbf {AGG}}\) and a set of \(k<n\)local aggregators\(\mathrm {\mathbf {Agg}}_1,\dots ,\, \mathrm {\mathbf {Agg}}_k\). Please note that the local aggregators may be separate entities but without any significant changes they may be selected from the set of regular users V. The only issue with this approach is that we have to ensure that the local aggregator is either selected during the aggregation round or it cannot fail during a single execution of aggregation process. We assume that each user is assigned to exactly one local aggregator. We denote the set of nodes assigned to the local aggregator \(\mathrm {\mathbf {Agg}}_i\) by \(V_i\). An example of the network’s topology is depicted in Fig. 6.

We can derive a graph \(G=(V,\,E)\) from the network structure, where V are all the nodes and the set of edges is created based on the ability to establish communication (e.g., transmission range in a sensor network, friendship relation in a social network). Namely, the edge \(\left\{ v,\,v'\right\} \in E\) if and only if v and \(v'\) are neighbors and can communicate via a private channel. In our protocol we assume that each node can perform some basic cryptographic operations and has access to a source of randomness. By N(v) we denote a set of such vertices \(v'\) of G that the edge \(\left\{ v,\,v'\right\} \in E\). Security of the protocol described in Sect. 5.3 depends on the structure of graph G, and how many parties the adversary can corrupt. Discussion on security of the protocol is given in Sect. 5.5.

Adversary The adversary may corrupt and therefore control a subsets of users, local aggregators and the aggregator. He can read all messages the controlled parties sent or received. The goal of the adversary in this model is to obtain sum of aggregated data, with worse privacy parameters than those guaranteed, of any subset of uncorrupted users. If the adversary cannot obtain such information, we consider the protocol differential privacy preserving with appropriate parameters.

5.2 Building blocks

Similarly to previous papers, for obtaining high level of data privacy we combine cryptographic techniques with data perturbation methods typical for research concentrated on differential privacy of databases.

The first technique we use in our protocol is a homomorphic encryption scheme based on original ElGamal construction enriched by some extra techniques introduced in [16]. More precisely, encrypted messages can be “aggregated” and re-encrypted. Moreover one can “add” an extra encryption layer to a given ciphertext, in such way that the message can be decrypted only using both respective keys. Clearly this operation preserves the homomorphic property.

Let p denote a large prime number and let \(\mathbf {G}\) be a group of order p such that the decisional Diffie–Hellman problem is hard. Let g be a generator of \(\mathbf {G}\). Let \({\mathrm {sk}}, {\mathrm {sk}}'\) be a some private keys and \(g^{{\mathrm {sk}}}, g^{{\mathrm {sk}}'}\) are respective public keys.
  • Encryption of ’1’ A pair \(\mathrm {Enc}_{{\mathrm {sk}}}\left( 1\right) =(g^r,\,g^{r\cdot {\mathrm {sk}}})\) for a random \(r\in \mathbf {Z_p}\) is an encryption of 1 using secret key \({\mathrm {sk}}\).

  • Re-encryption A ciphertext representing 1 can be re-encrypted (based on re-randomization). Namely, one can get another ciphertext representing one, without private key. Namely having \(\mathrm {Enc}_{{\mathrm {sk}}}\left( 1\right) =(g^r,\,g^{r\cdot {\mathrm {sk}}})\) one can choose \(r'\) and compute \(\text{ Re }(\mathrm {Enc}_{{\mathrm {sk}}}\left( 1\right) )=(g^{r\cdot r'},\,g^{{r\cdot r'}\cdot {\mathrm {sk}}})\) that represents 1 as well.

  • Adding layer of encryption Having a ciphertext \(\mathrm {Enc}_{{\mathrm {sk}}}\left( 1\right) \)\(=(g^r,\,g^{r\cdot {\mathrm {sk}}})\) a party having private key \({\mathrm {sk}}'\) can “add encryption layer” to a ciphertext obtaining
    $$\begin{aligned}&\mathrm {Enc}_{{\mathrm {sk}}+{\mathrm {sk}}' }\left( 1\right) =((g^r)^{r'},\,(g^{r\cdot {\mathrm {sk}}})^{r'}\cdot (g^r)^{r'{\mathrm {sk}}'})\\&=(g^{r\cdot r'},\,g^{r\cdot r'\cdot ({\mathrm {sk}}+{\mathrm {sk}}')}). \end{aligned}$$
  • Filling the ciphertext Having \(\mathrm {Enc}_{{\mathrm {sk}}}\left( 1\right) =(g^r,\,g^{r\cdot {\mathrm {sk}}})\) for any message \(m \in G\) one can compute
    $$\begin{aligned} \mathrm {Enc}_{{\mathrm {sk}}+{\mathrm {sk}}'}\left( m\right) =(g^r,\,g^{r\cdot {\mathrm {sk}}}\cdot m). \end{aligned}$$
  • Partial decryption Having \(\mathrm {Enc}_{{\mathrm {sk}}+{\mathrm {sk}}'}\left( m\right) =(g^{r\cdot r'},\,\)\(g^{r\cdot r'\cdot ({\mathrm {sk}}+{\mathrm {sk}}')}m)\) and a private key \({\mathrm {sk}}'\) for \(m \in G\) one can “remove one layer of encryption” and obtain
    $$\begin{aligned} \mathrm {Enc}_{{\mathrm {sk}}}\left( m\right) =\left( g^{r\cdot r'},\,\frac{g^{r\cdot r'\cdot ({\mathrm {sk}}+{\mathrm {sk}}')}m}{(g^{r\cdot r'})^{{\mathrm {sk}}'}}\right) =(g^{r\cdot r'},\,g^{r\cdot r'\cdot {\mathrm {sk}}}m). \end{aligned}$$
For the sake of clarity we skip some technical details (i.e., choice of the group size, generators) as well as full security discussion of this encryption scheme. Note that these are quite standard techniques used in many papers including [16, 17].

Similarly to previous papers (including [8, 44]), we utilize the following method: If we know that each user \(v \in V\) has a value from an interval of moderate size \(\xi _v \in [0,\varDelta ]\), then the sum of values of all \(\xi _v\)’s cannot exceed \( n\varDelta \). Thus one can find a discrete logarithm for \(g^{\sum _{v\in V} \xi _v }\) even if finding a discrete logarithm of \(g^r\) is not feasible for a random element \(r\in \mathbf {G}\). Using Pollard’s Rho method, this can be completed in average time \(O(\sqrt{n\varDelta })\).

5.3 Protocol description

We start this subsection by recalling the decisional Diffie–Hellman assumption used in further presentation.

Definition 4

Consider a cyclic group \(\mathbf {G}\) of order q. Given \((g, g^a, g^b, g^c)\) for a randomly chosen generator \(g \in \mathbf {G}\) and random \(a,b,c \in {0, \ldots , q-1}\) for the adversary \(g^{ab}\) and \(g^c\) are computationally indistinguishable. We say that decisional Diffie–Hellman problem is hard in group \(\mathbf {G}\) if it the group satisfies the decisional Diffie–Hellman assumption.

During the protocol, we assume that the aggregator \(\mathrm {\mathbf {AGG}}\) has a private key \({\mathrm {sk}}\), moreover each of the local aggregators \(\mathrm {\mathbf {Agg}}_i\) has its own private key \({\mathrm {sk}}_i\). We also assume that there is a public parameter g, that is a generator of some finite group \(\mathbf {G}\), in which decisional Diffie–Hellman problem is hard. By \(\mathrm {Enc}_{{\mathrm {sk}}}\left( c\right) \) we denote the encryption structure introduced in Sect. 5.2. Let us assume that each user v has a private value \(\xi _v\) from the range \([0,\,\varDelta ]\). We also assume that there are private channels between some of the users (underlying communication graph). We do not get into the details of implementing such secure channels between some of the users, there are various techniques described in the asymmetric cryptography literature. The final aim is to provide \(\mathrm {\mathbf {AGG}}\) the sum \(\sum _{v\in V}\xi _v\) perturbed in such way that the privacy (expressed in terms of differential privacy) of all \(v\in V\) is preserved. Clearly, the privacy of users can be endangered both by revealing the output as well as by collecting information about the aggregation process.
  • Setup
    • \(\mathrm {\mathbf {AGG}}\) broadcasts to the local aggregators \(\mathrm {Enc}_{{\mathrm {sk}}}\left( 1\right) \).

    • Each of the local aggregators \(\mathrm {\mathbf {Agg}}_i\) constructs \(\mathrm {Enc}_{{\mathrm {sk}}+{\mathrm {sk}}_i}\left( 1\right) \) and publishes it for all users from \(V_i\).

    The setup phase is performed only once during network’s lifetime. Moreover if needed, each \(\mathrm {\mathbf {Agg}}_i\) may provide a non-interactive proof that the operations were performed correctly and honestly [4, 15].
  • Aggregation

  • Algorithm for node v
    • For each node \(v'\in N(v)\) generate a random value \(x^{v}_{v'}\in \mathbf {G}\).

    • Using a private channel send each value \(x^{v}_{v'}\) to the appropriate neighbor \(v'\).

    • Having received all \(x^{v'}_{v}\) from each of the neighbors, select random \(r_v\) from \(Geom^{\beta }(\alpha )\) and calculate
      $$\begin{aligned} c_v=\sum _{v'\in N(v)}{x^{v'}_{v}}-\sum _{v'\in N(v)}{x^{v}_{v'}} +r_v+\xi _v. \end{aligned}$$
    • Compute \(\text{ Re }(\mathrm {Enc}_{{\mathrm {sk}}+{\mathrm {sk}}_i}\left( g^{c_v}\right) )\) and send it to \(\mathrm {\mathbf {Agg}}_i\).

  • Algorithm for local aggregator \(\mathrm {\mathbf {Agg}}_i\)
    • Having received \(\mathrm {Enc}_{{\mathrm {sk}}+{\mathrm {sk}}_i}\left( g^{c_v}\right) \) from all nodes from \(V_i\), compute
      $$\begin{aligned} \mathrm {Enc}_{{\mathrm {sk}}}\left( g^{c_v}\right) =\left( g^{r_i},\,\frac{g^{r_i({\mathrm {sk}}+{\mathrm {sk}}_i)+c_v}}{g^{r_i\cdot {\mathrm {sk}}_i}}\right) . \end{aligned}$$
      This operations result in obtaining shares
      $$\begin{aligned} \mathrm {Enc}_{{\mathrm {sk}}}\left( g^{c_{v_1}}\right)= & {} (g^{r_{v_1}},g^{r_{v_1}\cdot {\mathrm {sk}}+c_{v_1}}),\, \dots ,\,\mathrm {Enc}_{{\mathrm {sk}}}\left( g^{c_{v_l}}\right) \\&\quad =(g^{r_{v_l}},g^{{r_{v_l}}\cdot {\mathrm {sk}}+c_{v_l}}) \end{aligned}$$
      of all \(l=|V_i|\) users from \(|V_i|\).
    • Compute
      $$\begin{aligned}&\mathrm {Enc}_{{\mathrm {sk}}}\left( g^{c_{v_1}+\dots +c_{v_l}}\right) =\left( \prod _{i=1}^l g^{r_i},\,\prod _{i=1}^l{g^{r_{i}{\mathrm {sk}}+c_{v_i}}} \right) \\&\quad =\left( g^{\sum _{i=1}^l r_i},\,g^{(\sum _{i=1}^l r_i){\mathrm {sk}}+\sum _{i=1}^l{c_{v_i}}} \right) ~. \end{aligned}$$
    • Send the value \(\mathrm {Enc}_{{\mathrm {sk}}}\left( g^{c_{v_1}+\dots +c_{v_l}}\right) \) to the aggregator \(\mathrm {\mathbf {AGG}}\).

  • Final aggregation
    • Having received the aggregated values from each \(V_i\), for each of those values \(\mathrm {\mathbf {AGG}}\) calculate \(y_i= g^{\sum _{v\in V_i}{c_{v}}}\), using its private key \({\mathrm {sk}}\) for each \(i=1,\ldots , k\). Then compute
      $$\begin{aligned} y = \prod _{i}^{k} y_i = \prod _{i} g^{\sum _{v\in V_i}{c_{v}}}=g^{\sum _{v\in V}{c_{v}}}. \end{aligned}$$
    • Then \(\mathrm {\mathbf {AGG}}\) compute discrete logarithm of y as a final (perturbed) value being a sum of all \(\sum _{v\in V}\xi _{v}\).

An example of node’s communication is shown in Fig. 7. Note that the protocol depends on two security parameters \(\beta \) and \(\alpha \). They strongly depend on the topology of the underlying graph. We discuss this issue in the next subsection.
Fig. 7

An example of communication in a single aggregation round from a perspective of node v. The dotted line marks the set of nodes assigned to a single local aggregator \(\mathrm {\mathbf {Agg}}\). Note that neighbors may have different local aggregators

5.4 Adversarial model

In this subsection, we will comment on each actor in the system and their goals and abilities if corrupted by adversary. Note that an adversary can corrupt any subset of actors.

Actors
  • User when corrupted can reveal both his data and the random-generated value to the Adversary. Moreover, he can reveal all random values received from his neighbors. Tampering with these values, however, would make no sense because it would make the final aggregation impossible to decrypt, which is not the goal of the Adversary (recall that the Adversary aims at gathering the knowledge about uncorrupted participants, not at breaking the protocol)

  • Local aggregator when corrupted can attempt decrypting received shares. His goal is to either decrypt the sum of data of any subset of uncorrupted users (see that if he can decrypt the sum of any subset of users with at least one uncorrupted user, he can calculate the sum of this subset by subtracting data revealed by corrupted users)

  • Aggregator when corrupted will try to recover the data of any subset of users. Remember that we allow him only to recover the sum of data of all uncorrupted users. Obviously if we want to reveal the sum to the public, we cannot avoid the Adversary being able to subtract the known sum leaked by corrupted users from the revealed sum of data of all users and, moreover, the privacy parameters are calibrated in such way that it will protect the fully aggregated sum. If the corrupted aggregator can get the sum of any smaller subset of the honest users, that means privacy is breached.

5.5 Comparison and analysis

In this section we outline the analysis of the presented aggregation protocol with respect to correctness, level of privacy provided and error of the result obtained by the aggregator. The analysis is slightly more complicated since the parameters of the protocol strongly depend on the underlying network. We argue, however, that they offer very good properties for wide classes of networks.

Correctness First, let us look at the result obtained by the aggregator \(\mathrm {\mathbf {AGG}}\) in the last step of the protocol. This is a discrete logarithm of \(g^{\sum _{v\in V}{c_{v}}}\). Let us observe that
$$\begin{aligned} \sum _{v\in V}{c_v}&=\sum _{v\in V}{\left( \sum _{v'\in N(v)}{x^{v'}_{v}}-\sum _{v'\in N(v)}{x^{v}_{v'}} +r_v+\xi _v\right) }\\&=\sum _{v\in V}{\sum _{v'\in N(v)}{x^{v'}_{v}}}- \sum _{v\in V}{\sum _{v'\in N(v)}{x^{v}_{v'}}}\\&\quad +\,\sum _{v\in V}{\xi _v}+\sum _{v\in V}{r_v}=\sum _{v\in V}{\xi _v}+\sum _{v\in V}{r_v}. \end{aligned}$$
The value \(\sum _{v\in V}{\xi _v}\) is the exact sum of values kept by nodes and sum of all the noises \(\sum _{v\in V}{r_v}\). This leads to two conclusions. First, the result is correct. Second, retrieving the data using Pollard’s Rho method (or even brute force method) is feasible since the absolute value of the first sum has to be smaller than \(n\varDelta \). One can easy see that the sum of added noises is of the magnitude O(n) with high probability (as a sum of independent geometric distributions).

Privacy protection We assume that the encryption scheme \(\mathrm {Enc}_{{\mathrm {sk}}}\left( \right) \) is semantically secure. In particular, after re-encryption operation, one cannot retrieve any non-trivial information about the plaintext without the private key \({\mathrm {sk}}\) possibly except some negligible probability \(\eta \left( \lambda \right) \) with respect to the key-length \(\lambda \) or some other security parameters. In particular, in our protocol, the local aggregator \(\mathrm {\mathbf {AGG}}_i\) cannot learn the contributions sent to \(\mathrm {\mathbf {AGG}}_j\) for \(i\ne j\) without access to keys \({\mathrm {sk}}_j\) and \({\mathrm {sk}}\).

Note that all neighboring users exchange a purely random values \(x_{v}^{v'}\)’s that finally cancel out; however, as long as they remain unknown to the adversary, they perfectly obfuscate the results sent to the aggregator (exactly in the same manner as the one-time pad cipher). This can be easily adopted to our protocol to get the following fact.

Fact 4

Let us assume that the adversary can control any subset of aggregators and a subset of users \(V\setminus V^{H}\). Let \(\mathcal {S}\) be a connected component of a subgraph of \(\mathcal {G}=(V,E)\) induced by the subset \(V^{H}\). Then, the adversary can learn nothing but \(\sum _{v\in \mathcal {S}} (\xi _{v} + r_v)\) about the values \(\xi _{v}\)’s from the execution of PAALC for any \(v\in V^{H}\).

Theorem 3

Let us assume that PAALC with parameter \(\alpha = \exp (\frac{\epsilon }{\varDelta })\) is executed in the network represented by a graph \(\mathcal {G}=(V,E)\) and \(\mathcal {G}'\) is a subgraph of \(\mathcal {G}\) induced by the set of uncompromised users \(V^{H}\). Moreover, we assume that each user v contributes a value \(\xi _{v} \in [0,\varDelta ]\).

If in each connected component \(\mathcal {S}\) of \(\mathcal {G}'\), there is a user s, such that its added noise r is taken from \(Geom(\exp (\frac{\epsilon }{\varDelta }))\), then PAALC preserves computational \((\varepsilon ,\,0)\)-differential privacy.

Proof

Let \(\varXi = \sum _{s\in S} \xi _s\) and let \(\varXi '\) be the same sum with changed a single value \(\xi _s\). By the assumption about the range of the aggregated values we get \(|\varXi '- \varXi |\le \varDelta \). Let r be a random variable taken from the symmetric geometric distribution \(Geom(\exp (\frac{\epsilon }{\varDelta }))\). From Fact 1, we know that \(Pr[\varXi +r=k]\) may differ from \(Pr[\varXi '+r=k]\) by at most a multiplicative factor \(\exp (\epsilon )\). However, from Fact 4, we know that the adversary may learn nothing more than the sum of all values from the component \(\mathcal {S}\). To complete the proof it is enough to recall that we assumed that probability of gaining some other knowledge if weak parameters of the cipher are chosen is at most negligible function \(\eta \left( \lambda \right) \).\(\square \)

From this theorem follows next corollary.

Corollary 2

If PAALC is executed on a graph such that a subgraph induced by the set of uncompromised users \(V^{H}\) is connected and with probability at least \(1-\delta \) at least one uncompromised users adds its value r from \(Geom(\exp (\frac{\epsilon }{\varDelta }))\) then PAALC computationally preserves \((\varepsilon ,\,\delta )\)-differential privacy.

Translating into real terms Theorem 3 with Corollary 2 say if the connections between honest users are enough dense and we can somehow guarantee that at least one honest node adds the noise, the system is secure. The core of the problem is judge if a real-world networks are dense enough and what parameters of adding noise are sufficient. This problem is discussed in the next paragraph.

Accuracy The level of accuracy and security in this protocol strongly depends on the graph topology and chosen security parameters. We will consider a random graph, where each of possible edge is independently added with probability p. Moreover the adversary controls up to \(n-m\) randomly chosen users.

Theorem 4

Let us consider a random network with n nodes. Each of possible \({n \atopwithdelims ()2}\) connections (edges) is independently added to the network with probability \(p \ge \frac{8\log n}{n}\). Let \(\mathcal {S}\) be a subgraph induced by a subset of at least \(m \ge n/2\) randomly chosen nodes. Then \(\mathcal {S}\) is connected with probability at least \(1 - 1/n\).

Proof

Let us note that \(\mathcal {S}\) is not connected if and only if there exists a subset of nodes from \(\mathcal {S}\) with cardinality \(1\le k \le m/2\) such that there is no connection to any of remaining \(m-k\) nodes. For a given subset of \(\mathcal {S}\) of cardinality k probability that no edge connects it to other \(m-k\) nodes of \(\mathcal {S}\) is \((1-p)^{k(m-k)}\).

Let \(A_k\) be an event that there exists such a “cutoff” subset of cardinality k. Clearly, using union bound argument we get
$$\begin{aligned} \Pr [A_k]\le (1-p)^{k(m-k)} {m \atopwithdelims ()k}. \end{aligned}$$
Probability that \(\mathcal {S}\) is not connected is equivalent to the event \(A_1\cup \ldots \cup A_k\) for \(k=1,\ldots , m/2\). Again, using union bound
$$\begin{aligned} \Pr [A_1\cup \ldots \cup A_{\frac{m}{2}}]&\le \sum \limits _{k=1}^{m/2} \Pr [A_k] \le \sum \limits _{k=1}^{m/2} (1-p)^{k(m-k)} {m \atopwithdelims ()k} \le \\&\le \sum \limits _{k=1}^{m/2} (1-p)^{k\frac{m}{2}} {m \atopwithdelims ()k} = (\star ). \end{aligned}$$
Since \({m \atopwithdelims ()k} \leqslant m^k\), we get
$$\begin{aligned}&(\star ) \le \sum \limits _{k=1}^{m/2} \left( (1-p)^{\frac{m}{2}} m \right) ^{k} \le \sum \limits _{k=1}^{\infty } \left( (1-p)^{\frac{m}{2}} m \right) ^{k} \\&\quad = \frac{(1-p)^{m/2}m}{1- (1-p)^{m/2}m} = (\star \star ). \end{aligned}$$
Since the function \(f(x)= \frac{a^x x}{ 1 - a^x x } \) is decreasing for \(x> -\frac{1}{\log (a)}\) (if \(0<a<1\)) and from the assumption that \(m \ge n/2\) we have
$$\begin{aligned} (\star \star ) \le \frac{(1-p)^{n/4}\frac{n}{2}}{1- (1-p)^{n/2}\frac{n}{2}}~. \end{aligned}$$
Applying inequality \(\exp (x) \ge 1+x\) and substituting \(p=\frac{8\log n}{n}\), we obtain
$$\begin{aligned} (\star \star ) \le \frac{\exp \left( -\frac{8\log (n)}{n}\right) \frac{n}{2}}{1- 1/2} \le \exp \left( -\log (n^2)\right) n = \frac{1}{n}, \end{aligned}$$
which concludes the proof of this theorem. \(\square \)

Note that the presented model boils down to the classic Erdős- -Rényi model [22].

From Theorem 4 we learn that a “typical” network of n nodes with random connections such that the average number of neighbors is \(8\log n = \varTheta (\log n)\) is dense enough even if the adversary is able to compromise as much as n / 2 nodes.

If we have guaranteed at least n / 2 honest (uncompromised and working) nodes one may note that the probability that none of them adds the noise is at least \((1-\beta )^{n/2}\). To have \((1-\beta )^{n/2} \le \delta \) one needs to have \(\beta \) such that \(\log (1-\beta ) \le \frac{2\log \delta }{n}\). Since \(\log (1+x) \le x \) for \(x>-1\), it is enough to use \(\beta \ge \frac{2\log (1/\delta )}{n}\). Clearly the expected error cannot exceed \(2 \sqrt{\log (1/\delta )}\) for \(\beta = \frac{2\log (1/\delta )}{n}\). Using standard methods, one can also show that the expected error is concentrated.

Remarks and Extensions We proved that the proposed protocol guarantees a very good accuracy even facing a massive failures and compromising of nodes. Half of nodes may failed or cooperate with the adversary (In fact this result can be generalized to any constant fraction of users). The analysis and the model can be relaxed/extended in many directions. One can instantly observe that the analysis can be extended for smaller \(\delta \) for the price of moderate increasing of the expected noise. Note that the value of \(\delta \) set to a celebrated magic constant 0.05 seems to be definitely too big for practice. Indeed, this implies that one out of each 20 may lose its privacy.

We believe that this approach can be useful for other graphs-including those representing social networks. Note that if a graph guarantees a specific level of privacy then more dense graph (with some added edges) offers at least the same level of privacy. Thus it is enough if each users adds something like \(\varTheta (\log n)\) “randomly” chosen neighbors to protect the privacy in any network.

Note that our protocol is not immune against an adversarial nodes that sends incoherent random data. To the best of our knowledge all protocols of this type (including [8, 44]) are prone to so-called contaminating attacks. To mitigate this problem as in other cases one may apply orthogonal methods presented in [6].

6 PAALC and binary protocol comparison on real data

In this section we experimentally compared the Binary Protocol from [8] and our PAALC described in Sect. 5. We will conduct an experiment on real data from Facebook social network collected in SNAP dataset by Stanford University (see  [31, 33]), where nodes denote users and edges denote friend relation. We assume that each user hold one bit of information which we want to aggregate, i.e., value \(x_i \in \{0,1\}\). For number of node failures \(\kappa \in \{0,1,\ldots ,200\}\) we will experimentally check what is the error size in our protocol with parameters \(\epsilon = 0.5\) and \(\delta = 0.05\). Then we compare it to the Binary Protocol with the same privacy parameters.

Firstly, in Fig. 8 we can see what is the average fraction of nodes remaining in the giant component after \(\kappa \) node failures. One can easily see, that the overwhelming majority of nodes remain connected in a single, giant component. Our protocol will preserve the privacy of these nodes. We emphasize, that the nodes remaining out of the giant component may be prone to privacy loss. This is the price we have to pay to significantly decrease the error.
Fig. 8

Average fraction of nodes remaining in the giant component after \(\kappa \) failures

Observe that the Binary Protocol does not utilize the connections and communication between users in any way. Our protocol, on the other hand, is dependent on the structure of the underlying graph, which means that on more dense dataset (with higher clustering coefficient) it will perform better than on sparse graph. That is why it is more suitable for various dense datasets, precisely like the social network or cloud users. In Fig. 9 one can see what is the size of error in both protocols.
Fig. 9

Blue line denotes the error in the Binary Protocol, red line denotes the error in PAALC

See that the error in PAALC is constant, while in the Binary Protocol the errors are much higher. Recall that the real sum of the data is at most 4039 (if all users hold 1). Thus, error of size of magnitude \(10^3\) renders the aggregated data not suitable for statistic inference. On the other hand, our protocol gives constant error of size approximately 5, which makes the aggregated data not only private, but also useful for statistical analysis. Unfortunately, our vast error decrease comes at a price of not protecting the privacy of these nodes which do not belong to the giant component. This, obviously, is a significant drawback of our protocol. However, it can be avoided by, for example enriching the graph with additional edges or doing an additional check whether specific node belong to the giant component or not (then the outlying nodes would have to always add noise), which we remain as a future work.

We want again to emphasize that PAALC and the Binary Protocol are not fully comparable. The Binary Protocol gives the privacy guarantee to all users and with no communication between them. However, it is not very robust to failures, despite the fact that it is designed precisely as a fault-tolerant protocol. Even if less than 5% users are prone to failure, the error in the aggregated data is too big for the data to remain useful for statistical purposes. On the other hand, PAALC requires some communication (albeit very limited) between users and, maybe even more importantly, strongly depend on the connections between users (communication channels). The denser the network, the more secure PAALC is. Without the improvements which we mentioned in previous paragraph, the privacy of users not belonging to the giant connected component in PAALC is prone to attacks.

To conclude, in a scenario where we have quite dense network and we expect more than just a constant number of failures but rather say a few percent of failures among users, the Binary Protocol, even though it preserves the privacy of users, return the aggregated data with such a huge error, that it might not be appropriate for any reasonable statistical analysis. For such scenarios, if we can pay the price of a few (say less than 0.1%) users possibly losing their privacy, PAALC gives us constant size error, significantly smaller than the Binary Protocol.

7 Previous and related work

Data aggregation in distributed networks has been thoroughly studied due to practical importance of such protocols. Measuring the target environment, aggregating data and rising alarm are arguably three most important functionalities of distributed sensing networks, and with the increased number of personal mobile devices, the aggregation becomes of greatest interest among the three. Exemplary protocols that do not address security nor privacy may be found in [20, 32], with the latter being often presented as a model aggregation algorithm.

There are several settings considering data aggregation. They differ in both, the abilities and constraints of the nodes performing the aggregation, as well as the issues that the algorithm addresses. Some of the adversities that may be addressed include data confidentiality (i.e., protecting the data from disclosure), privacy of the nodes (inability to learn exact values of each node), node failure and spontaneous node joining the network as well as data poisoning (i.e., injecting malicious data by the adversary that allows them to significantly influence the outcome of the algorithm or learning more information about the execution that they would not gain when following the protocol honestly). Note that we are using the notion of differential privacy. A survey concerning this privacy definition is presented in [10]. Very well presented tutorial for differential privacy basics can be found in [12]. Our paper follows the model considered in [8], where the nodes have constrained abilities and their energy pool is limited. Authors present a privacy-preserving aggregation protocol that assumes malicious aggregator, moreover they claim tolerance for failures and joins, hence addressing majority of the issues. Similar problems that focus on narrower range of properties have been also studied in [40, 44]. and more recently in [3]. An interesting line of research concerning this protocol can be seen in [27, 29, 30] and also more recently [9, 13, 26], but the authors of papers in this line of research focus more on the cryptographic part, with less emphasis on the privacy part (ie. how to improve utility of the aggregated data). In [19, 38] authors present some aggregation protocols that preserve privacy; however, they do not consider dynamic changes inside of the network. The latter also considers data-poisoning attacks; however, the authors do not provide rigid proofs. A different approach was presented in [37, 43], where the authors present a framework for some aggregation functions and consider the confidentiality of the result, however, leaving nodes’ privacy out of scope of their papers. On the other hand, there is bulk of research that focuses on fault tolerance that leaves privacy and security issues either out of scope or just mentioned, not keeping it as a priority. Examples of such work may be found in [14, 25, 28]. In [7] the authors present an asymptotic lower bound on the error of the aggregation that preserves privacy, showing that in order to reduce the errors, one has to resign from perfect privacy and focus rather on computational variant of the privacy preservation.

An example of work on secure data aggregation in stronger models may be found in [23, 42], where the authors consider data aggregation in a smart grid. Also, a survey concerning privacy methods (seen from the smart grids perspective, but most of these protocols can also be applied in more general way) can be found in [24]. Note that the authors of this survey do not focus on the magnitude of errors, not to mention non-asymptotic error considerations. Another fruitful branch of the research on data aggregation considers data aggregation in vehicular ad hoc networks (VANET). The research in this field is motivated by the increasing number of ”smart-cars” with internal computational unit. One of the first works addressing this issue was [21, 36, 45]. A practical scenario for data aggregation in VANET has been presented in [5]. The security issue in VANET data aggregation has been mentioned in [18, 41]. A survey of the known protocols has been performed in [35]. One may note that retrieving encrypted or blinded data by one entity, that requires cooperation of others is similar to cryptographic secret sharing. Some of the most important work on secret sharing may be found in [1, 2]; however, in our paper we draw from the Universal Re-encryption method presented in [16].

8 Conclusions

In our paper we provided a precise analysis of accuracy of the data aggregation protocol presented in [8]. We have shown that in many cases its accuracy may not be sufficient even if the number of faults is moderate. We proposed new fault tolerant, privacy-preserving aggregation protocol that offers much better precision. In order to obtain this, we allowed a limited communication between the nodes. This assumption deviates from the classic model. We experimentally compared both protocols using real-life Facebook social network structure.

We believe that our approach and security model is justified in many real-life scenarios; however, much research is left to be done in the field. First of all, our protocol as well as all other similar protocols we are aware of, is not immune against so-called data-poisoning attack. Another problem is finding solution for statistics other than sum. Authors of aggregating schemes usually limit the scope of their work to sum, product and average of the values of all nodes in the network. In many cases we need, however, other statistics, e.g., minimum or the median. We suppose that finding more general statistics with guaranteed privacy of individuals is possible using methods explored in e-voting protocols. However, they are very demanding in terms of required resources. From the theoretical point of view the important question is about the possible trade-offs between privacy protection, volume of communication and possible accuracy of the results of aggregation.

References

  1. 1.
    Beimel, A.: Secret-sharing schemes: a survey. In: Proceedings of the Third International Conference on Coding and Cryptology, IWCC’11, pp. 11–46. Springer, Berlin (2011)Google Scholar
  2. 2.
    Benaloh, J.C.: Secret sharing homomorphisms: keeping shares of a secret secret. In: Advances in Cryptology. Springer, Berlin (1987)Google Scholar
  3. 3.
    Benhamouda, F., Joye, M., Libert, B.: A new framework for privacy-preserving aggregation of time-series data. ACM Trans. Inf. Syst. Secur. (TISSEC) 18(3), 10 (2016)CrossRefGoogle Scholar
  4. 4.
    Blum, M., Feldman, P., Micali, S.: Non-interactive zero-knowledge and its applications. In: Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing, STOC ’88, pp. 103–112. ACM, New York, NY (1988)Google Scholar
  5. 5.
    Caliskan, M., Graupner, D., Mauve, M.: Decentralized discovery of free parking places. In: Proceedings of the 3rd International Workshop on Vehicular Ad Hoc Networks, VANET ’06, pp. 30–39. ACM, New York, NY (2006)Google Scholar
  6. 6.
    Chan, H., Perrig, A., Przydatek, B., Song, D.: Sia: Secure information aggregation in sensor networks. J. Comput. Secur. 15(1), 69–102 (2007)CrossRefGoogle Scholar
  7. 7.
    Chan, T.-H.H., Shi, E., Song, D.: Optimal lower bound for differentially private multi-party aggregation. IACR Cryptology ePrint Archive 2012:373, informal publication (2012)Google Scholar
  8. 8.
    Chan, T.-H. H., Shi, E., Song, D.: Privacy-preserving stream aggregation with fault tolerance. In: Keromytis, A.D. (ed.) Financial Cryptography, volume 7397 of Lecture Notes in Computer Science, pp. 200–214. Springer, Berlin (2012)Google Scholar
  9. 9.
    Corrigan-Gibbs, H., Boneh, D.: Prio: private, robust, and scalable computation of aggregate statistics. In: NSDI, pp. 259–282 (2017)Google Scholar
  10. 10.
    Cynthia Dwork: Differential privacy: a survey of results. In: TAMC, pp. 1–19 (2008)Google Scholar
  11. 11.
    Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Theory of Cryptography, Third Theory of Cryptography Conference, TCC 2006, March 4–7, 2006, Proceedings, pp. 265–284. New York, NY (2006)Google Scholar
  12. 12.
    Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9(3–4), 211–407 (2014)MathSciNetGoogle Scholar
  13. 13.
    Emura, K.: Privacy-preserving aggregation of time-series data with public verifiability from simple assumptions. In: Australasian Conference on Information Security and Privacy, pp. 193–213. Springer, Berlin (2017)Google Scholar
  14. 14.
    Feng, Y., Tang, S., Dai, G.: Fault tolerant data aggregation scheduling with local information in wireless sensor networks. Tsinghua Sci. Technol. 16(5), 451–463 (2011)CrossRefGoogle Scholar
  15. 15.
    Goldreich, O., Oren, Y.: Definitions and properties of zero-knowledge proof systems. J. Cryptol. 7(1), 1–32 (1994)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Golle, P., Jakobsson, M., Juels, A., Syverson, P.F.: Universal re-encryption for mixnets. In: Okamoto, T. (ed.) Topics in Cryptology–CT-RSA 2004, The Cryptographers’ Track at the RSA Conference 2004, San Francisco, CA, USA, February 23–27, 2004, Proceedings, volume 2964 of Lecture Notes in Computer Science, pp. 163–178. Springer, Berlin (2004)Google Scholar
  17. 17.
    Gomulkiewicz, M., Klonowski, M., Kutylowski, M.: Onions based on universal re-encryption–anonymous communication immune against repetitive attack. In: Lim, C.H., Yung, M. (ed.) Information Security Applications, 5th International Workshop, WISA 2004, Jeju Island, Korea, August 23–25, 2004, Revised Selected Papers, volume 3325 of Lecture Notes in Computer Science, pp 400–410. Springer, Berlin (2004)Google Scholar
  18. 18.
    Han, Q., Du, S., Ren, D., Zhu, H.: SAS: a secure data aggregation scheme in vehicular sensing networks. In: Proceedings of IEEE International Conference on Communications, ICC 2010, Cape Town, South Africa, 23–27 May 2010, pp 1–5. IEEE, New York (2010)Google Scholar
  19. 19.
    He, W., Liu, X., Nguyen, H., Nahrstedt, K.: A cluster-based protocol to enforce integrity and preserve privacy in data aggregation. In: ICDCS Workshops, pp. 14–19. IEEE Computer Society, New York (2009)Google Scholar
  20. 20.
    Heinzelman, W.R., Kulik, J., Balakrishnan, H.: Adaptive protocols for information dissemination in wireless sensor networks. In: Proceedings of the 5th Annual ACM/IEEE International Conference on Mobile Computing and Networking, MobiCom ’99, pages 174–185, ACM, New York, NY (1999)Google Scholar
  21. 21.
    Hermann. SOTIS–a self-organizing traffic information system. In: Proceedings of the IEEE Vehicular Technology Conference Spring, pp. 2442–2246 (2003)Google Scholar
  22. 22.
    Janson, S., Luczak, T., Rucinski, A.: Random Graphs. Wiley, New York (2011)Google Scholar
  23. 23.
    Jawurek, M., Kerschbaum, F.: Fault-tolerant privacy-preserving statistics. In: Fischer-Hubner, S., Wright, M. (eds.) Privacy Enhancing Technologies, volume 7384 of Lecture Notes in Computer Science, pp. 221–238. Springer, Berlin (2012)Google Scholar
  24. 24.
    Jawurek, M., Kerschbaum, F., Danezis, G.: Sok: Privacy Technologies for Smart Grids–ASurvey of Options. Microsoft Res, Cambridge (2012)Google Scholar
  25. 25.
    Jhumka, A., Bradbury, M., Saginbekov, S.: Efficient fault-tolerant collision-free data aggregation scheduling for wireless sensor networks. J. Parallel Distrib. Comput. 74(1), 1789–1801 (2014)CrossRefGoogle Scholar
  26. 26.
    Joye, M.: Cryptanalysis of a privacy-preserving aggregation protocol. IEEE Trans. Dependable Secure Comput. 14(6), 693–694 (2017)CrossRefGoogle Scholar
  27. 27.
    Joye, M., Libert, B.: A scalable scheme for privacy-preserving aggregation of time-series data. In: International Conference on Financial Cryptography and Data Security, pp. 111–125. Springer, Berlin (2013)Google Scholar
  28. 28.
    Larrea, M., Martin, C., Astrain, J.J.: Hierarchical and fault-tolerant data aggregation in wireless sensor networks. In: 2nd International Symposium on Wireless Pervasive Computing, 2007. ISWPC ’07 (2007)Google Scholar
  29. 29.
    Leontiadis, I., Elkhiyaoui, K., Molva, R.: Private and dynamic time-series data aggregation with trust relaxation. In: International Conference on Cryptology and Network Security, pp 305–320. Springer, Berlin (2014)Google Scholar
  30. 30.
    Leontiadis, I., Elkhiyaoui, K., Önen, M., Molva, R.: Puda–privacy and unforgeability for data aggregation. In: International Conference on Cryptology and Network Security, pp. 3–18. Springer, Berlin (2015)Google Scholar
  31. 31.
    Leskovec, J., Krevl, A.: SNAP Datasets: Stanford Large Network Dataset Collection (2014). http://snap.stanford.edu/data
  32. 32.
    Madden, S., Franklin, M.J., Hellerstein, J.M., Hong, W.: Tag: A tiny aggregation service for ad-hoc sensor networks. SIGOPS Oper. Syst. Rev. 36(SI), 131–146 (2002)CrossRefGoogle Scholar
  33. 33.
    McAuley, J.J., Leskovec, J.: Learning to discover social circles in ego networks. In: NIPS, volume 2012, pp. 548–56 (2012)Google Scholar
  34. 34.
    Mironov, I., Pandey, O., Reingold, O., Vadhan, S.P.: Computational differential privacy. In: 29th Annual International Cryptology Conference Advances in Cryptology–CRYPTO 2009, Santa Barbara, CA, USA, August 16–20, 2009. Proceedings, pp. 126–142 (2009)Google Scholar
  35. 35.
    Mohanty, S., Jena, D.: Secure data aggregation in vehicular-adhoc networks: a survey. Proced. Technol. 6, 922–929 (2012). 2nd International Conference on Communication, Computing and Security [ICCCS-2012]CrossRefGoogle Scholar
  36. 36.
    Nadeem, T., Dashtinezhad, S., Liao, C., Iftode, L.: Trafficview: traffic data dissemination using car-to-car communication. SIGMOBILE Mob. Comput. Commun. Rev. 8(3), 6–19 (2004)CrossRefGoogle Scholar
  37. 37.
    Papadopoulos, S., Kiayias, A., Papadias, D.: Exact in-network aggregation with integrity and confidentiality. IEEE Trans. Knowl. Data Eng. 24(10), 1760–1773 (2012)CrossRefGoogle Scholar
  38. 38.
    PDA: Privacy-preserving data aggregation in wireless sensor networks (2007)Google Scholar
  39. 39.
    Pinelis, I.: Characteristic function of the positive part of a random variable and related results, with applications. Stat. Probab. Lett. 106, 281–286 (2015)MathSciNetCrossRefGoogle Scholar
  40. 40.
    Rastogi, V., Nath, S.: Differentially private aggregation of distributed time-series with transformation and encryption. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pp. 735–746, ACM, New York, NY (2010)Google Scholar
  41. 41.
    Rivas, D.A., Barceló-Ordinas, J.M., Zapata, M.G., Morillo-Pozo, J.D.: Security on VANETs: privacy, misbehaving nodes, false information and secure data aggregation. J. Netw. Comput. Appl. 34(6), 1942–1955 (2011)CrossRefGoogle Scholar
  42. 42.
    Rottondi, C., Verticale, G., Krauss, C.: Distributed privacy-preserving aggregation of metering data in smart grids. IEEE J. Sel. Areas Commun. (JSAC)–JSAC Smart Grid Commun. Ser. 31, 1342–1354 (2013)CrossRefGoogle Scholar
  43. 43.
    Roy, S., Conti, M., Setia, S., Jajodia, S.: Secure data aggregation in wireless sensor networks: filtering out the attacker’s impact. Trans. Info. For. Sec. 9(4), 681–694 (2014)CrossRefGoogle Scholar
  44. 44.
    Shi, E., Chow, R., Chan, T.-H.H., Song, D., Rieffel, E.: Privacy-preserving aggregation of time-series data. In: In NDSS (2011)Google Scholar
  45. 45.
    Wischhof, L., Ebner, A., Rohling, H.: Information dissemination in self-organizing intervehicle networks. IEEE Trans. Intell. Transp. Syst. 6(1), 90–101 (2005)CrossRefGoogle Scholar
  46. 46.
    WolframResearch. Hypergeometric2F1. From WolframResearch (2011). http://functions.wolfram.com/HypergeometricFunctions/Hypergeometric2F1

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Krzysztof Grining
    • 1
    Email author
  • Marek Klonowski
    • 1
  • Piotr Syga
    • 1
  1. 1.Department of Computer Science, Faculty of Fundamental Problems of TechnologyWroclaw University of Science and TechnologyWroclawPoland

Personalised recommendations