1 Introduction

Consider a scenario in which a survey is conducted among a sample of random individuals and data mining techniques are applied to learn information on the entire population. If such information will disclose information on the individuals participating in the survey, then they will be reluctant to participate in the survey. To address this question, Kasiviswanathan et al. (2011) introduced the notion of private learning, where a private learner is required to output a hypothesis that gives accurate classification while protecting the privacy of the individual samples from which the hypothesis was obtained.

The definition of a private learner is a combination of two qualitatively different notions. One is that of probably approximately correct (PAC) learning (Valiant 1984), the other of differential privacy (Dwork et al. 2006). PAC learning, on one hand, is an average case requirement, which requires that the output of the learner on most samples is good. Differential privacy, on the other hand, is a worst-case requirement. It is a strong notion of privacy that provides meaningful guarantees in the presents of powerful attackers and is increasingly accepted as a standard for providing rigorous privacy. Recent research on privacy has shown, somewhat surprisingly, that it is possible to design differentially private variants of many analyses. Further discussions on differential privacy can be found in the surveys of Dwork (2009, 2011).

We next give more details on PAC learning and differential privacy. In PAC learning, a collection of samples (labeled examples) is generalized into a hypothesis. It is assumed that the examples are generated by sampling from some (unknown) distribution \(\mathcal{D}\) and are labeled according to an (unknown) concept c taken from some concept class \(\mathcal{C}\). The learned hypothesis h should predict with high accuracy the labeling of examples taken from the distribution \(\mathcal{D}\), an average-case requirement. In differential privacy the output of a learner should not be significantly affected if a particular example is replaced with an arbitrary example. Concretely, differential privacy considers the collection of samples as a database, defines that two databases are neighbors if they differ in exactly one sample, and requires that for every two neighboring databases the output distribution of a private learner should be similar.

In this paper, we consider private learning of finite, discrete domains. Finite domains are natural as computers only store information with finite precision. The work of Kasiviswanathan et al. (2011) demonstrated that private learning in such domains is feasible—any concept class that is PAC learnable can be learned privately (but not necessarily efficiently), by a “private Occam’s razor” algorithm, with sample complexity that is logarithmic in the size of the hypothesis class.Footnote 1 Furthermore, taking into account the earlier result of Blum et al. (2005) (that all concept classes that can be efficiently learned in the statistical queries model can be learned privately and efficiently) and the efficient private parity learner of Kasiviswanathan et al. (2011), we get that most “natural” computational learning tasks can be performed privately and efficiently (i.e., with polynomial resources). This is important as learning problems generalize many of the computations performed by analysts over collections of sensitive data.

The results of Blum et al. (2005), Kasiviswanathan et al. (2011) show that private learning is feasible in an extremely broad sense, and hence, one can essentially equate learning and private learning. However, the costs of the private learners constructed in Blum et al. (2005), Kasiviswanathan et al. (2011) are generally higher than those of non-private ones by factors that depend not only on the privacy, accuracy, and confidence parameters of the private learner. In particular, the well-known relationship between the sample complexity of PAC learners and the VC-dimension of the concept class (ignoring computational efficiency) (Blumer et al. 1989) does not hold for the above constructions of private learners; the sample complexity of the algorithms of Blum et al. (2005), Kasiviswanathan et al. (2011) is proportional to the logarithm of the size of the concept class. Recall that the VC-dimension of a concept class is bounded by the logarithm of its size, and is significantly lower for many interesting concept classes, hence, there may exist learning tasks for which “very practical” non-private learner exists, but any private learner is “impractical” (with respect to the sample size required).

The focus of this work is on a fine-grain examination of the differences in complexity between private and non-private learning. The hope is that such an examination will eventually lead to an understanding of which complexity measure is relevant for the sample complexity of private learning, similar to the well-understood relationship between the VC-dimension and sample complexity of PAC learning. Such an examination is interesting also for other tasks, and a second task we examine is that of releasing a sanitization of a data set that simultaneously protects privacy of individual contributors and offers utility to the data analyst. See the discussion in Sect. 1.1.2.

1.1 Our contributions

We now give a brief account of our results. Throughout this rather informal discussion we will treat the accuracy, confidence, and privacy parameters as constants (a detailed analysis revealing the dependency on these parameters is presented in the technical sections). We use the term “efficient” for polynomial time computations.

Following standard computational learning terminology, we will call learners for a concept class \(\mathcal{C}\) that only output hypotheses in \(\mathcal{C}\) proper, and other learners improper. The original motivation in computational learning theory for this distinction is that there exist concept classes \(\mathcal{C}\) for which proper learning is computationally intractable (Pitt and Valiant 1988), whereas it is tractable to learn \(\mathcal{C}\) improperly (Valiant 1984). As we will see below, the distinction between proper and improper learning is useful also when discussing private learning, and for reasons other than making intractable learning tasks tractable. Our results on private learning are summarized in Table 1.

Table 1 Our separation results (ignoring dependence on ϵ,α,β), where (d) is any function that grows as ω(logd)

1.1.1 Proper and improper private learning

It is instructive to look into the construction of the private Occam’s razor algorithm of Kasiviswanathan et al. (2011) and see why its sample complexity is proportional to the logarithm of the size of the hypothesis class used. The algorithm uses the exponential mechanism of McSherry and Talwar (2007) to choose a hypothesis. The choice is probabilistic, where the probability mass that is assigned to each of the hypotheses decreases exponentially with the number of samples that are inconsistent with it. A union-bound argument is used in the claim that the construction actually yields a learner, and a sample size that is logarithmic in the size of the hypothesis class is needed for the argument to go through. The question is whether such sample size is required?

To address the above question, we consider a simple, but natural, class \(\operatorname {\mathtt {POINT}}=\{\operatorname {\mathtt {POINT}}_{d}\}\) containing the concepts c j :{0,1}d→{0,1} where c j (x)=1 for x=j, and 0 otherwise. The VC-dimension of \(\operatorname {\mathtt {POINT}}_{d}\) is one, and hence, it can be learned (non-privately and efficiently, properly or improperly) with merely O(1) samples.

In sharp contrast, (when used for properly learning \(\operatorname {\mathtt {POINT}}_{d}\)) the above-mentioned private Occam’s razor algorithm from Kasiviswanathan et al. (2011) requires \(O(\log(|\operatorname {\mathtt {POINT}}_{d}|)) = O(d)\) samples—obtaining the largest possible gap in sample complexity when compared to non-private learners! Our first result is a matching lower bound. We prove that any proper private learner for \(\operatorname {\mathtt {POINT}}_{d}\) must use Ω(d) samples, therefore, answering negatively the question (from Kasiviswanathan et al. (2011)) of whether proper private learners should exhibit sample complexity that is approximately the VC-dimension (or even a function of the VC-dimension) of the concept class.Footnote 2

A natural way to improve the sample complexity is to use the private Occam’s razor to improperly learn \(\operatorname {\mathtt {POINT}}_{d}\) with a smaller hypothesis class that is still expressive enough for \(\operatorname {\mathtt {POINT}}_{d}\), reducing the sample complexity to the logarithm of the smaller hypothesis class. We show that this indeed is possible, as there exists a hypothesis class of size O(d) that can be used for learning \(\operatorname {\mathtt {POINT}}_{d}\) improperly, yielding an algorithm with sample complexity O(logd). Furthermore, this bound is tight, any hypothesis class for learning \(\operatorname {\mathtt {POINT}}_{d}\) must contain Ω(d) hypotheses. These bounds are interesting as they give a separation between proper and improper private learning—proper private learning of \(\operatorname {\mathtt {POINT}}_{d}\) requires Ω(d) samples, whereas \(\operatorname {\mathtt {POINT}}_{d}\) can be improperly privately learned using O(logd) samples. Note that such a combinatorial separation does not exist for non-private learning, as VC-dimension number of samples are needed and sufficient for both proper and improper non-private learners. Furthermore, the Ω(d) lower bound on the size of the hypothesis class maps a clear boundary to what can be achieved in terms of sample complexity using the private Occam’s razor for \(\operatorname {\mathtt {POINT}}_{d}\). It might even suggest that any private learner for \(\operatorname {\mathtt {POINT}}_{d}\) should use Ω(logd) samples.

It turns out, however, that the intuition expressed in the last sentence is at fault. We construct an efficient improper private learner for \(\operatorname {\mathtt {POINT}}_{d}\) that uses merely O(1) samples, hence, establishing the strongest possible separation between proper and improper private learners. For the construction, we extrapolate on a technique from the efficient private parity learner of Kasiviswanathan et al. (2011). The construction of Kasiviswanathan et al. (2011) utilizes a natural non-private proper learner, and hence, results in a proper private learner. Due to the bounds mentioned above, we cannot use a proper learner for \(\operatorname {\mathtt {POINT}}_{d}\), and hence, we construct an improper (rather unnatural) learner to base our construction upon. Our construction utilizes a double-exponential hypothesis class, and hence, is inefficient (even outputting a hypothesis requires super-polynomial time). We use a simple compression using pseudorandom functions (akin to Mishra and Sandler (2006)) to make the algorithm efficient.

The above two improper learning algorithms use “heavy” hypotheses, that is, the hypotheses are Boolean functions that return 1 on many inputs (in contrast to a point function that returns 1 on exactly one input). Informally, each such heavy hypothesis protects the privacy since it could have been returned on many different concepts. The main technical point in these algorithms is how to choose a heavy hypothesis with a small error. To complete the picture, we prove that using heavy hypotheses is unavoidable: Every private learning algorithm for \(\operatorname {\mathtt {POINT}}_{d}\) that uses o(d) samples must use heavy hypotheses.

Next we look into the concept class \(\operatorname {\mathtt {INTERVAL}}=\{\operatorname {\mathtt {INTERVAL}}_{d}\} \), where for T=2d we define \(\operatorname {\mathtt {INTERVAL}}_{d}=\{ c_{1},\ldots,c_{T+1} \}\) and, for 1≤jT+1, the concept c j :{1,…,T+1}→{0,1} is defined as follows: c j (x)=1 for x<j and c j (x)=0 otherwise. As with \(\operatorname {\mathtt {POINT}}_{d}\), it is easy to show that the sample complexity of any proper private learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) is Ω(d). We give two results regarding the sample complexity of improper private learning of \(\operatorname {\mathtt {INTERVAL}}_{d}\). The first result shows that if a sublinear (in d) sample complexity private learner exists for \(\operatorname {\mathtt {INTERVAL}}_{d}\), then it must output, with high probability, a very “complex looking” hypothesis in the sense that the hypothesis must switch from zero to one (and vice-versa) exponentially many times, unlike any concept \(c_{j} \in \operatorname {\mathtt {INTERVAL}}_{d}\) that switches only once from one to zero at j. The second result considers a generalization of the technique that yielded the O(1) sample improper private learner for \(\operatorname {\mathtt {POINT}}_{d}\), and shows that it alone would not yield a private learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sublinear (in d) sample complexity.

We apply the above lower bound on the number of samples for proper private learning \(\operatorname {\mathtt {POINT}}_{d}\) to show a separation in the sample complexity of efficient proper private learners (under a slightly relaxed definition of proper learning) and inefficient proper private learners. More concretely, assuming the existence of a pseudorandom generator with exponential stretch, we present a concept class \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\)—a subset of \(\operatorname {\mathtt {POINT}}_{d}\)—such that every efficient private learner that learns \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using \(\operatorname {\mathtt {POINT}}_{d}\) requires Ω(d) samples. In contrast, an inefficient proper private learner exists that uses only a super-logarithmic number of samples. This is the first example in private learning where requiring efficiency on top of privacy comes at a price of larger sample size.

1.1.2 The sample size of non-interactive sanitization mechanisms

Given a database containing a collection of individual information, a sanitization is a release of information that protects the privacy of the individual contributors while offering utility to the analyst using the database. The setting is non-interactive if once the sanitization is released, then the original database and the curator play no further role. Blum et al. (2008) presented a construction of such non-interactive sanitizers for count queries. Let \(\mathcal{C}\) be a concept class consisting of efficiently computable predicates from a discretized domain X to {0,1}. Given a collection D of data items taken from X, Blum et al. employ the exponential mechanism (McSherry and Talwar 2007) to (inefficiently) obtain another collection D′ with data items from X such that D′ maintains approximately correct count of ∑ dD c(d) for all concepts \(c\in \mathcal{C}\) provided that the size of D is \(O(\log(|X|) \cdot\mathrm{\it VCDIM}(\mathcal{C}))\). As D′ is generated using the exponential mechanism, the differential privacy of D is protected. The database D′ is referred to as a synthetic database as it contains data items drawn from the same universe (i.e., from X) as the original database D.

We provide a new lower bound for non-interactive sanitization mechanisms. We show that for \(\operatorname {\mathtt {POINT}}_{d}\) every non-interactive sanitization mechanism that is usefulFootnote 3 for \(\operatorname {\mathtt {POINT}}_{d}\) requires a database of size Ω(d). This lower bound is tight as the sanitization mechanism of Blum et al. for \(\operatorname {\mathtt {POINT}}_{d}\) uses a database of size \(O(d \cdot\mathrm{\it VCDIM}(\operatorname {\mathtt {POINT}}_{d})) = O(d)\). Our lower bound holds even if the sanitized output is an arbitrary data structure, i.e., not necessarily a synthetic database.

A preliminary version of this paper appeared in the 7th Theory of Cryptography Conference (TCC), 2010. The TCC paper contained a proof sketch of the results presented in Sects. 3, 4.2, 6, and 7. The results presented in Sects. 4.1, 4.3, and 5 are new.

1.2 Related work

The notion of PAC learning was introduced by Valiant (1984). The notion of differential privacy was introduced by Dwork et al. (2006). Private learning was introduced in Kasiviswanathan et al. (2011). Beyond proving that (ignoring computation) every concept class with finite, discrete domain can be PAC learned privately (see Theorem 3.2 below), Kasiviswanathan et al. proved an equivalence between learning in the statistical queries model and private learning in the local communication model (a.k.a. randomized response). The general private data release mechanism we mentioned above was introduced in Blum et al. (2008) along with a specific construction for halfspace queries. Also as mentioned above, both Kasiviswanathan et al. (2011) and Blum et al. (2008) use the exponential mechanism of McSherry and Talwar (2007), a generic construction of differential private analyses, which (in general) does not yield efficient algorithms.

A recent work of Dwork et al. (2009) considered the complexity of non-interactive sanitization under two settings: (a) sanitized output is a synthetic database, and (b) sanitized output is some arbitrary data structure. For the task of sanitizing with a synthetic database they show a separation between efficient and inefficient sanitization mechanisms based on whether the size of the instance space and the size of the concept class is polynomial in a (security) parameter or not. For the task of sanitizing with an arbitrary data structure they show a tight connection between the complexity of sanitization and traitor tracing schemes used in cryptography. They leave the problem of separating efficient private and inefficient private learning open.

Following the preliminary version of our paper (Beimel et al. 2010), Chaudhuri and Hsu (2011) study the sample complexity for private learning infinite concept classes when the data is drawn from a continuous distribution. Using techniques very similar to ours, they show that, under these settings, there exists a simple concept class for which any proper learner that uses a finite number of examples and guarantees differential privacy, fails to satisfy accuracy guarantee for at least one unlabeled data distribution. This implies that the results of Kasiviswanathan et al. (2011) do not extend to infinite hypothesis classes on continuous data distributions.

Chaudhuri and Hsu (2011) also study learning algorithms that are only required to protect the privacy of the labels (and not necessary protect the privacy of the examples themselves). They prove upper bounds and lower bounds for this scenario. In particular, they prove a lower bound on the sample complexity using the doubling dimension of the disagreement metric of the hypothesis class with respect to the unlabeled data distribution. This result does not imply our results. For example, the class \(\operatorname {\mathtt {POINT}}_{d}\) can be properly learned using O(1) samples while protecting the privacy of the labels, while we prove that Ω(d) samples are required to properly learn this class while protecting the privacy of the examples and the labels. It seems that label privacy may give enough protection in the restricted setting where the content of the underlying examples is publicly known. However, in many settings this information is highly sensitive. For example, in a database containing medical records we wish to protect the identity of the people in the sample (i.e., we do not want to disclose that they have been to a hospital).

It is well known that for all concept classes \(\mathcal{C}\), every learner for \(\mathcal{C}\) requires \(\varOmega(\mathrm{\it VCDIM(\mathcal{C})})\) samples (Ehrenfeucht et al. 1989). This lower bound on the sample size also holds for private learning. Blum et al. (2013) show that this result extends to the setting of private data release. They show that for all concept classes \(\mathcal{C}\), every non-interactive sanitization mechanism that is useful for \(\mathcal{C}\) requires \(\varOmega(\mathrm{\it VCDIM(\mathcal{C})})\) samples (remember that the best upper bound is \(O(\log(|X|) \cdot\mathrm{\it VCDIM}(\mathcal{C}))\)). We show in Sect. 7 that the lower bound of \(\varOmega(\mathrm{\it VCDIM(\mathcal{C})})\) is not tight—there exists a concept class \(\mathcal{C}\) of constant VC-dimension such that every non-interactive sanitization mechanism that is useful for \(\mathcal{C}\) requires a much larger sample size.

Tools for private learning (not in the PAC setting) were studied in a few papers; such tools include, for example, private logistic regression (Chaudhuri and Monteleoni 2008) and private empirical risk minimization (Chaudhuri et al. 2011; Kifer et al. 2012).

1.3 Questions for future exploration

The motivation of this work was to study the connection between non-private and private learning. We believe that the ideas developed in this work are a first step in developing a general theory of private learning. In particular, we believe that there is a combinatorial measure that characterizes private learning (for non-private learning such combinatorial measure exists—the VC dimension). Such characterization was given recently in Beimel et al. (2013).

In this paper, the ideas used for lower bounding sample size for proper private learning of points is also used to establish a lower bound on the sample size for sanitization of databases. Other connections between private learning and sanitization were explored in (Blum et al. 2008). The open question is there is a deeper connection between the models, i.e., does any bound for one task imply a similar bound for the other?

1.4 Organization

In Sect. 2, we define private learning. In Sect. 3, we prove lower bounds on proper private learning, and in Sect. 4, we describe efficient improper private learning algorithms for the \(\operatorname {\mathtt {POINT}}\) concept class. In Sect. 5, we discuss private learning of the \(\operatorname {\mathtt {INTERVAL}}\) concept class. In Sect. 6, we show a separation between efficient and inefficient proper private learning. Finally, in Sect. 7, we prove a lower bound for non-interactive sanitization.

2 Preliminaries

Notation

We use [n] to denote the set {1,2,…,n}. The notation O γ (g(n)) is a shorthand for O(h(γ)⋅g(n)) for some non-negative function h. Similarly, the notation Ω γ (g(n)). We use \(\mathop {\rm negl}(\cdot)\) to denote functions from \(\mathbb {R}^{+}\) to [0,1] that decrease faster than any inverse polynomial.

2.1 Preliminaries from privacy

A database is a vector D=(d 1,…,d m ) over a domain X, where each entry d i D represents information contributed by one individual. Databases D and D′ are called neighbors if they differ in exactly one entry (i.e., the Hamming distance between D and D′ is 1). An algorithm is private if neighboring databases induce nearby distributions on its outcomes. Formally:

Definition 2.1

(Differential Privacy (Dwork et al. 2006))

A randomized algorithm \(\mathcal{A}\) is ϵ-differentially private if for all neighboring databases D,D′, and for all sets \(\mathcal{S}\) of outputs,

$$\begin{aligned} \Pr\bigl[\mathcal{A}(D ) \in\mathcal{S}\bigr] \leq\exp(\epsilon) \cdot\Pr\bigl[\mathcal{A}\bigl(D'\bigr) \in\mathcal{S}\bigr]. & \end{aligned}$$
(1)

The probability is taken over the random coins of \(\mathcal{A}\).

An immediate consequence of (1) is that for any two databases D,D′ (not necessarily neighbors) of size m, and for all sets \(\mathcal{S}\) of outputs, \(\Pr[\mathcal{A}(D ) \in\mathcal{S}] \geq\exp(-\epsilon m) \cdot \Pr [\mathcal{A}(D') \in\mathcal{S}]\).

2.2 Preliminaries from learning theory

We consider Boolean classification problems. A concept c:X→{0,1} is a function that labels examples taken from the domain X by either 0 or 1. The domain X is understood to be an ensemble \(X=\{X_{d}\}_{d\in \mathbb {N}}\) (typically, X d ={0,1}d) and a concept class \(\mathcal{C}\) is an ensemble \(\mathcal{C}= \{\mathcal{C}_{d}\}_{d\in \mathbb {N}}\) where \(\mathcal{C}_{d}\) is a class of concepts mapping X d to {0,1}. In this paper X d is always a finite, discrete set. A concept class comes implicitly with a way to represent concepts and \(\mathop {\rm size}(c)\) is the size of the (smallest) representation of the concept c under the given representation scheme.

PAC learning algorithms are given examples sampled according to an unknown probability distribution \(\mathcal{D}\) over X d , and labeled according to an unknown target concept \(c_{d}\in \mathcal{C}_{d}\). Define the error of a hypothesis h:X d →{0,1} as

$$\mathop {\rm error}_{\mathcal{D}}(c,h)=\Pr_{x \sim \mathcal{D}}\bigl[h(x)\neq c(x)\bigr]. $$

Definition 2.2

(PAC Learning (Valiant 1984))

An algorithm \(\mathcal{A}\) is an (α,β)-PAC learner of a concept class \(\mathcal{C}_{d}\) over X d using hypothesis class \(\mathcal{H}_{d}\) and sample size n if for all concepts \(c \in \mathcal{C}_{d}\), all distributions \(\mathcal{D}\) on X d , given an input D=(d 1,…,d n ), where d i =(x i ,c(x i )) with x i drawn i.i.d. from \(\mathcal{D}\) for all i∈[n], algorithm \(\mathcal{A}\) outputs a hypothesis \(h\in \mathcal{H}_{d}\) satisfying

$$\begin{aligned} \Pr\bigl[\mathop {\rm error}_{\mathcal{D}}(c,h) \leq\alpha\bigr] \geq 1-\beta. \end{aligned}$$

The probability is taken over the randomness of the learner \(\mathcal{A}\) and the sample points chosen according to \(\mathcal{D}\).

An Algorithm \(\mathcal {A}_{1}\), whose inputs are d,α,β, and a set of samples (labeled examples) D, is a PAC learner of a concept class \(\mathcal{C}=\{\mathcal{C}_{d}\}_{d\in \mathbb {N}}\) over \(X=\{X_{d}\}_{d\in \mathbb {N}}\) using hypothesis class \(\mathcal{H}=\{\mathcal{H}_{d}\}_{d\in \mathbb {N}}\) if there exists a polynomial p(⋅,⋅,⋅,⋅) such that for all \(d \in \mathbb {N}\) and 0<α,β<1, the Algorithm \(\mathcal {A}_{1}\) (d,α,β,⋅) is an (α,β)-PAC learner of the concept class \(\mathcal{C}_{d}\) over X d using hypothesis class \(\mathcal{H}_{d}\) and sample size \(n=p(d,\mathop {\rm size}(c),1/\alpha ,\log (1/\beta))\).Footnote 4 If \(\mathcal{A}\) runs in time polynomial in \(d,\mathop {\rm size}(c),1/\alpha,\log(1/\beta)\), we say that it is an efficient PAC learner. Also the learner is called a proper PAC learner if \(\mathcal{H}=\mathcal{C}\), otherwise it is called an improper PAC learner.

A concept class \(\mathcal{C}= \{\mathcal{C}_{d}\}_{d\in \mathbb {N}}\) over \(X= \{X_{d}\}_{d\in \mathbb {N}}\) is PAC learnable using hypothesis class \(\mathcal{H}= \{\mathcal{H}_{d}\}_{d\in \mathbb {N}}\) if there exists a PAC learner \(\mathcal{A}\) learning \(\mathcal{C}\) over X using hypothesis class \(\mathcal{H}\). If \(\mathcal{A}\) is an efficient PAC learner, we say that \(\mathcal{C}\) is efficiently PAC learnable.

It is well known that improper learning is more powerful than proper learning. For example, Pitt and Valiant (1988) show that unless RP=NP, k-term DNF are not efficiently learnable by k-term DNF, whereas it is possible to learn a k-term DNF efficiently using k-CNF (Valiant 1984). For more background on learning theory, see (Kearns and Vazirani 1994).

Definition 2.3

(VC-Dimension (Vapnik and Chervonenkis 1971))

Let \(\mathcal{C}=\{\mathcal{C}_{d}\}\) be a class of concepts over X={X d }. We say that \(\mathcal{C}_{d}\) shatters a point set YX d if \(|\{c(Y):c\in \mathcal{C}_{d}\} | = 2^{|Y|}\), i.e., the concepts in \(\mathcal{C}_{d}\) when restricted to Y produce all the 2|Y| possible assignments on Y. The VC-dimension of \(\mathcal{C}_{d}\) (\(\mathrm{\it VCDIM}(\mathcal{C}_{d})\)) is defined as the size of a maximum point set that is shattered by \(\mathcal{C}_{d}\), as a function of d.

Theorem 2.4

(Blumer et al. 1989)

Let \(\mathcal{C}_{d}\) be a concept class over X d . There exists an (α,β)-PAC learner that learns \(\mathcal{C}_{d}\) using \(\mathcal{C}_{d}\) using \(O((\mathrm{\it VCDIM}(\mathcal{C}_{d})\cdot\log(\frac{1}{\alpha})+\log(\frac{1}{\beta }))/\alpha )\) samples.

2.3 Private learning

Definition 2.5

(Private PAC Learning (Kasiviswanathan et al. 2011))

Let d,α,β be as in Definition 2.2 and ϵ>0. A concept class \(\mathcal{C}\) is privately PAC learnable using \(\mathcal{H}\) if there exists a learning Algorithm \(\mathcal {A}_{1}\) that takes inputs ϵ,d,α,β,D, returns a hypothesis \(\mathcal{A}(\epsilon ,d,\alpha,\beta,D)\), and satisfies

Sample efficiency.:

The number of samples (labeled examples) in D is polynomial in 1/ϵ, d, \(\mathop {\rm size}(c)\), 1/α, and log(1/β);

Privacy.:

For all d and ϵ,α,β>0, algorithm \(\mathcal{A}(\epsilon ,d,\alpha,\beta,\cdot)\) is ϵ-differentially private (as formulated in Definition 2.1);

Utility.:

For all ϵ>0, algorithm \(\mathcal{A}(\epsilon ,\cdot ,\cdot,\cdot,\cdot)\) PAC learns \(\mathcal{C}\) using \(\mathcal{H}\) (as formulated in Definition 2.2).

An Algorithm \(\mathcal {A}_{1}\) is an efficient private PAC learner if it runs in time polynomial in 1/ϵ, d, \(\mathop {\rm size}(c)\), 1/α, log(1/β). Also the private learner is called proper if \(\mathcal{H}=\mathcal{C}\), otherwise it is called improper.

Remark 2.6

The privacy requirement in Definition 2.5 is a worst-case requirement. That is, Inequality (1) must hold for every pair of neighboring databases D,D′ (even if these databases are not consistent with any concept in \(\mathcal{C}\)). In contrast, the utility requirement is an average-case requirement, where we only require the learner to succeed with high probability over the distribution of the databases. This qualitative difference between the utility and privacy of private learners is crucial. A wrong assumption on how samples are formed that leads to a meaningless outcome can usually be replaced with a better one with very little harm. No such amendment is possible once privacy is lost due to a wrong assumption. See Kasiviswanathan et al. (2011) for further discussion.

Note also that each entry d i in a database D is a labeled example. That is, we protect the privacy of both the example and its label.

Observation 2.7

The computational separation between proper and improper learning also holds when we add the privacy constraint. That is, unless RP=NP, no proper private learner can learn k-term DNF, whereas there exists an efficient improper private learner that can learn k-term DNF using a k-CNF. The efficient k-term DNF learner of Valiant (1984) uses statistical queries (SQ) (Kearns 1998), which can be simulated efficiently and privately as shown by Blum et al. (2005), Kasiviswanathan et al. (2011).

More generally, such a gap can be shown for any concept class that cannot be properly PAC learned, but can be efficiently learned (improperly) in the statistical queries model.

2.4 Concentration bounds

Chernoff bounds give exponentially decreasing bounds on the tails of distributions. Specifically, let X 1,…, X n be independent random variables where Pr[X i =1]=p and Pr[X i =0]=1−p for some 0<p<1. Clearly, \(\operatorname{\mathbb{E}}[\sum_{i} X_{i}]=pn\). Chernoff bounds show that the sum is concentrated around this expected value: For every 0<δ≤1,

$$\begin{aligned} \Pr \biggl[\sum_i X_i \geq(1+\delta) \operatorname{\mathbb{E}}\biggl[ \sum_i X_i \biggr] \biggr] & \leq\exp \biggl(-\operatorname{\mathbb{E}}\biggl[{ \sum_i} X_i \biggr]\delta^2/3 \biggr), \\ \Pr \biggl[\sum_i X_i \leq(1-\delta) \operatorname{\mathbb{E}}\biggl[ \sum_i X_i \biggr] \biggr] & \leq\exp \biggl(-\operatorname{\mathbb{E}}\biggl[ \sum_i X_i \biggr]\delta^2/2 \biggr), \\ \Pr \biggl[ \biggl \vert \sum_i X_i - \operatorname{\mathbb{E}}\biggl[ \sum_i X_i \biggr] \biggr \vert \geq\delta \biggr] & \leq2 \cdot\exp \bigl(-2\delta^2/n \bigr). \end{aligned}$$
(2)

The first two inequalities are known as the multiplicative Chernoff bounds (Chernoff 1952), and the last inequality is known as the Chernoff-Hoeffding bound (Hoeffding 1963).

3 Proper learning vs. proper private learning

We begin by recalling the upper bound on the sample (database) size for private learning from Kasiviswanathan et al. (2011). The bound in Kasiviswanathan et al. (2011) is for agnostic learning, and we restate it for (non-agnostic) PAC learning using the following notion of α-representation:

Definition 3.1

We say that a hypothesis class \(\mathcal{H}_{d}\) α-represents a concept class \(\mathcal{C}_{d}\) over the domain X d if for every \(c \in \mathcal{C}_{d}\) and every distribution \(\mathcal{D}\) on X d there exists a hypothesis \(h \in \mathcal{H}_{d}\) such that \(\mathop {\rm error}_{\mathcal{D}}(c,h)\leq\alpha\).

Theorem 3.2

(Kasiviswanathan et al. (2011), restated)

Assume that there is a hypothesis class \(\mathcal{H}_{d}\) that α/2-represents a concept class \(\mathcal{C}_{d}\). Then, for every 0<β<1, there exists a private PAC learner for \(\mathcal{C}_{d}\) using \(\mathcal{H}_{d}\) that uses \(O((\log(|\mathcal{H}_{d}|) +\log(1/\beta))/(\epsilon\alpha))\) samples, where ϵ,α, and β are the parameters of the private learner. The learner might not be efficient.

In other words, using Theorem 3.2 the number of samples that suffices for learning a concept class \(\mathcal{C}_{d}\) is logarithmic in the size of the smallest hypothesis class that α-represents \(\mathcal{C}_{d}\). For comparison, the number of samples required for learning \(\mathcal{C}_{d}\) non-privately is characterized by the VC-dimension of \(\mathcal{C}_{d}\) (by the lower bound of Ehrenfeucht et al. (1989) and the upper bound of Blumer et al. (1989)).

In the following, we will investigate private learning of the following simple concept class. Let T=2d and X d ={1,…,T}. Define the concept class \(\operatorname {\mathtt {POINT}}_{d}\) to be the set of points over {1,…,T}:

Definition 3.3

(Concept Class \(\operatorname {\mathtt {POINT}}_{d}\))

For j∈[T], define c j  : [T]→{0,1} as c j (x)=1 if x=j, and c j (x)=0 otherwise. Furthermore, define \(\operatorname {\mathtt {POINT}}_{d} = \{c_{j}\}_{j\in[T]}\).

We note that we use the set {1,…,T} for notational convenience only—when discussing the concept class \(\operatorname {\mathtt {POINT}}_{d}\) we never use the fact that the elements in T are integer numbers.

The class \(\operatorname {\mathtt {POINT}}_{d}\) trivially α-represents itself, and hence, we get using Theorem 3.2 that it is (properly) PAC learnable using \(O((\log(|\operatorname {\mathtt {POINT}}_{d}|) +\log(1/\beta))/(\epsilon\alpha)) = O((d +\log(1/\beta))/(\epsilon\alpha))\) samples. For completeness, we give an efficient implementation of this learner.

Lemma 3.4

There is an efficient proper private PAC learner for \(\operatorname {\mathtt {POINT}}_{d}\) that uses O((d+log(1/β))/ϵα) samples.

Proof

We adapt the learner of Kasiviswanathan et al. (2011). Let \(\operatorname {\mathtt {POINT}}_{d} = \{c_{1},\ldots,c_{2^{d}}\}\). The learner uses the exponential mechanism of McSherry and Talwar (2007). Let D=((x 1,y 1),…,(x m ,y m )) be a database of samples (the labels y i ’s are assumed to be consistent with some concept in \(\operatorname {\mathtt {POINT}}_{d}\)). Define for every \(c_{j} \in \operatorname {\mathtt {POINT}}_{d}\),

$$q(D,c_j) = - \bigl|\bigl\{i \, :\, y_i \neq c_j(x_i)\bigr\}\bigr|, $$

i.e., q(D,c j ) is negative of the number of points in D misclassified by c j . The private learner \(\mathcal{A}\) is defined as follows: output hypothesis \(c_{j} \in \operatorname {\mathtt {POINT}}_{d}\) with probability proportional to exp(ϵq(D,c j )/2). Since the exponential mechanism is ϵ-differentially private (McSherry and Talwar 2007), \(\mathcal{A}\) is ϵ-differentially private. By Kasiviswanathan et al. (2011), if m=O((d+log(1/β))/(ϵα)), then \(\mathcal{A}\) is also a proper PAC learner.

We now show that \(\mathcal{A}\) can be implemented efficiently. Implementing the exponential mechanism requires computing q(D,c j ) for 1≤j≤2d. However, q(D,c j ) is same for all j∉{x 1,…,x m } and can be computed in O(m) time, that is, q(D,c j )=q D , where q D =−|{i : y i =1}|. Also for any j∈{x 1,…,x m }, the value of q(D,c j ) can be computed in O(m) time. Let

$$P = \biggl( \sum_{j \in \{x_1,\ldots,x_m\}}\exp\bigl(\epsilon \cdot q(D,c_j)/2\bigr) \biggr) + \bigl(2^d-m\bigr)\exp(\epsilon \cdot q_D/2). $$

The Algorithm \(\mathcal {A}_{1}\) can be efficiently implemented as the following sampling procedure:

  1. 1.

    For j∈{x 1,…,x m }, with probability exp(ϵq(D,c j )/2)/P, output c j .

  2. 2.

    With probability (2dm)⋅exp(ϵq D /2)/P, pick uniformly at random a hypothesis from \(\operatorname {\mathtt {POINT}}_{d} \setminus\{c_{x_{1}},\ldots ,c_{x_{m}}\}\) and output it.

 □

3.1 Separation between proper learning and proper private learning

We now show that private learners may require many more samples than non-private ones. We prove that for any proper private earner for the concept class \(\operatorname {\mathtt {POINT}}_{d}\) the required number of samples is at least logarithmic in the size of the concept class, matching Theorem 3.2, whereas there exists non-private proper learners for \(\operatorname {\mathtt {POINT}}_{d}\) that use only a constant number of samples.

To prove the lower bound, we show that a large collection of m-record databases D 1,…,D N exists, with the property that every PAC learner has to output a different hypothesis for each of these databases (recall that in our context a database is a collection of labeled examples, supposedly drawn from some distribution and labeled consistently with some target concept). As any two databases D a and D b differ on at most m entries, differential privacy implies that a private learner must output on input D a the hypothesis that is accurate for D b (and not accurate for D a ) with probability at least (1−β)⋅exp(−ϵm). Since this holds for every pair of databases, unless m is large enough we get that the private learner’s output on D a is, with high probability, a hypothesis that is not accurate for D a .

In Theorem 3.6, we prove a general lower bound on the sample complexity of private learning of a class \(\mathcal{C}_{d}\) by a hypothesis classes \(\mathcal{H}_{d}\) that is α-minimal for \(\mathcal{C}_{d}\) as defined in Definition 3.5. In Corollary 3.8, we prove that Theorem 3.6 implies the claimed lower bound for proper private learning of \(\operatorname {\mathtt {POINT}}_{d}\). In Lemma 3.9, we improve this lower bound for \(\operatorname {\mathtt {POINT}}_{d}\) by a factor of 1/α.

Definition 3.5

If \(\mathcal{H}_{d}\) α-represents \(\mathcal{C}_{d}\), and every \(\mathcal{H}'_{d} \subsetneq \mathcal{H}_{d}\) does not α-represent \(\mathcal{C}_{d}\), then we say that \(\mathcal{H}_{d}\) is α-minimal for \(\mathcal{C}_{d}\).

Theorem 3.6

Let \(\mathcal{H}_{d}\) be an α-minimal representation for \(\mathcal{C}_{d}\). Then, any private PAC learner that learns \(\mathcal{C}_{d}\) using \(\mathcal{H}_{d}\) requires \(\varOmega((\log(|\mathcal{H}_{d}|)+\log(1/\beta))/\epsilon )\) samples, where ϵ,α, and β are the parameters of the private learner.

Proof

Let \(\mathcal{C}_{d}\) be a class of concepts over the domain X d and let \(\mathcal{H}_{d}\) be α-minimal for \(\mathcal{C}_{d}\). Since for every \(h \in \mathcal{H}_{d}\), the class \(\mathcal{H}_{d} \setminus\{h\} \) does not α-represent \(\mathcal{C}_{d}\), we get that there exists a concept \(c_{h} \in \mathcal{C}_{d}\) and a distribution \(\mathcal{D}_{h}\) on X d such that on inputs drawn from \(\mathcal{D}_{h}\) and labeled by c h , every PAC learner (that learns \(\mathcal{C}_{d}\) using \(\mathcal{H}_{d}\)) has to output h with probability at least 1−β.

Let \(\mathcal{A}\) be a private learner that learns \(\mathcal{C}_{d}\) using \(\mathcal{H}_{d}\), and suppose \(\mathcal{A}\) uses m samples. We next show that for every \(h\in \mathcal{H}_{d}\) there exists a database \(D_{h}\in X_{d}^{m}\) on which \(\mathcal{A}\) has to output h with probability at least 1−β. To see that, note that if \(\mathcal{A}\) is run on m examples chosen i.i.d. from the distribution \(\mathcal{D}_{h}\) and labeled according to c h , then \(\mathcal{A}\) outputs h with probability at least 1−β (where the probability is taken over the randomness of \(\mathcal{A}\) and the sample points chosen according to \(\mathcal{D}\)). Hence, a collection of m labeled examples over which \(\mathcal{A}\) outputs h with probability at least 1−β exists, and D h is set to contain these m samples.

Take \(h,h'\in \mathcal{H}_{d}\) such that \(h\not=h'\) and consider the two corresponding databases D h and D h with m entries each. Clearly, they differ in at most m entries, and hence, we get by the differential privacy of \(\mathcal{A}\) that

$$\begin{aligned} \Pr\bigl[\mathcal{A}(D_{h}) = h'\bigr] \geq& \exp(-\epsilon m) \cdot\Pr\bigl[\mathcal{A}(D_{h'})=h'\bigr] \\ \geq& \exp(-\epsilon m) \cdot(1-\beta). \end{aligned}$$

Since the above inequality holds for every two databases corresponding to a pair of hypotheses in \(\mathcal{H}\), we fix an arbitrary \(h\in \mathcal{H}\) and get,

$$\begin{aligned} \Pr\bigl[\mathcal{A}(D_{h}) \neq h\bigr] = & \Pr\bigl[ \mathcal{A}(D_{h}) \in \mathcal{H}_d\setminus\{ h\}\bigr] = \sum _{h'\in \mathcal{H}_d\setminus\{h\}} \Pr\bigl[\mathcal{A}(D_{h}) = h'\bigr] \\ \geq& (|\mathcal{H}_d|-1) \cdot\exp(-\epsilon m) \cdot(1-\beta). \end{aligned}$$

On the other hand, we chose D h such that \(\Pr[\mathcal{A}(D_{h}) = h] \geq 1-\beta\), equivalently, \(\Pr[\mathcal{A}(D_{h}) \neq h] \leq\beta\). Therefore, \((|\mathcal{H}_{d}|-1)\cdot\exp(-\epsilon m)\cdot(1-\beta) \leq\beta\). Solving the last inequality for m, we get \(m=\varOmega((\log(|\mathcal{H}_{d}|) + \log(1/\beta))/\epsilon )\) as required. □

Using Theorem 3.6, we now prove a lower bound on the number of samples needed for proper private learning concept class \(\operatorname {\mathtt {POINT}}_{d}\).

Proposition 3.7

\(\operatorname {\mathtt {POINT}}_{d}\) is α-minimal for itself for every α<1.

Proof

Clearly, \(\operatorname {\mathtt {POINT}}_{d}\) α-represents itself. To show minimality, consider a subset \(\mathcal{H}'_{d} \subsetneq \operatorname {\mathtt {POINT}}_{d}\), where \(c_{i} \notin \mathcal{H}'_{d}\). Under the distribution \(\mathcal{D}\) that chooses i with probability one, \(\mathop {\rm error}_{\mathcal{D}}(c_{i},c_{j}) = 1\) for all \(j\not=i\). Hence, \(\mathcal{H}'_{d}\) does not α-represent \(\operatorname {\mathtt {POINT}}_{d}\). □

The VC-dimension of \(\operatorname {\mathtt {POINT}}_{d}\) is one.Footnote 5 It is well known that a standard (non-private) proper learner uses approximately VC-dimension number of samples to learn a concept class (Blumer et al. 1989). In contrast, we get that far more samples are needed for any proper private learner for \(\operatorname {\mathtt {POINT}}_{d}\). The following corollary follows directly from Theorem 3.6 and Proposition 3.7:

Corollary 3.8

Every proper private PAC learner for \(\operatorname {\mathtt {POINT}}_{d}\) requires Ω((d+log(1/β))/ϵ) samples.

We now show that the lower bound for \(\operatorname {\mathtt {POINT}}_{d}\) can be improved by a factor of 1/α, matching (up to constant factors) the upper bound in Theorem 3.2.

Lemma 3.9

Every proper private PAC learner for \(\operatorname {\mathtt {POINT}}_{d}\) requires Ω((d+log(1/β))/(ϵα)) samples.

Proof

Define the distributions \(\mathcal{D}_{i}\) (where 2≤iT) on X d as follows: point 1 is picked with probability 1−α and point i is picked with probability α. The support of \(\mathcal{D}_{i}\) is on points 1 and i.

We say a database D=(d 1,…,d m ) where d j =(x j ,y j ) for all j∈[m] is good for distribution \(\mathcal{D}_{i}\) if at most 2αm points from x 1,…,x m equal i. Let D i be a database where x 1,…,x m are i.i.d. samples from \(\mathcal{D}_{i}\) with y j =c i (x j ) for all j∈[m]. By Chernoff bound, the probability that D i is good for distribution \(\mathcal{D}_{i}\) is at least 1−exp(−αm/3). Let \(\mathcal{A}\) be a proper private learner. On D i , \(\mathcal{A}\) has to output h=c i with probability at least 1−β (otherwise, if \(\mathcal{A}\) outputs some h=c j , where ji, then \(\mathop {\rm error}_{\mathcal{D}_{i}}(c_{i},h) = \mathop {\rm error}_{\mathcal{D}_{i}}(c_{i},c_{j})= \Pr_{x \sim \mathcal{D}_{i}}[c_{i}(x) \neq c_{j}(x)] > \alpha\), thus, violating the PAC learning condition for accuracy). Hence, the probability that either D i is not good or \(\mathcal{A}\) fails to return c i on D i is at most exp(−αm/3)+β. Therefore, with probability at least 1−β−exp(−αm/3), the database D i is good and \(\mathcal{A}\) returns c i on D i . Thus, for every i there exists a database D i that is good for \(\mathcal{D}_{i}\) such that \(\mathcal{A}\) returns c i on D i with probability at least 1−Γ, where Γ=β+exp(−αm/3).

Fix such databases D 2,…,D T . For every j, the databases D 2 and D j differ in at most 4αm entries (since each of them contains at most 2αm entries that are not 1). Therefore, by the guarantees of differential privacy,

$$\Pr\bigl[\mathcal{A}(D_2) \in\{c_3,\ldots,c_{T}\} \bigr] \geq(T-2) \exp(-4\epsilon \alpha m) (1-\varGamma) = \bigl(2^d-2 \bigr) \exp(-4\epsilon \alpha m) (1-\varGamma). $$

Algorithm \(\mathcal {A}_{1}\) on input D 2 outputs c 2 with probability at least 1−Γ. Therefore,

$$\bigl(2^{d}-2\bigr) \exp(-4\epsilon \alpha m) (1-\varGamma) \leq \varGamma. $$

Solving for m, we get the claimed bound. □

We conclude this section showing that every hypothesis class \(\mathcal{H}\) that α-represents \(\operatorname {\mathtt {POINT}}_{d}\) should have at least d hypotheses. Therefore, if we use Theorem 3.2 to learn \(\operatorname {\mathtt {POINT}}_{d}\) we need Ω(logd) samples.

Lemma 3.10

Let α<1/2. \(|\mathcal{H}| \geq d\) for every hypothesis class \(\mathcal{H}\) that α-represents \(\operatorname {\mathtt {POINT}}_{d}\).

Proof

Let \(\mathcal{H}\) be a hypothesis class with \(|\mathcal{H}| < d\). Consider a table whose T=2d columns correspond to the possible 2d inputs 1,…,T, and whose \(|\mathcal{H}|\) rows correspond to the hypotheses in \(\mathcal{H}\). The (i,j)th entry in the table is 0 or 1 depending on whether the ith hypothesis gives 0 or 1 on input j. Since \(|\mathcal{H}| < d=\log (T)\), at least two columns jj′ are identical, that is, h(j)=h(j′) for every \(h \in \mathcal{H}\). Consider the concept \(c_{j} \in \operatorname {\mathtt {POINT}}_{d}\) (defined as c j (x)=1 if x=j, and 0 otherwise), and the distribution \(\mathcal{D}\) with probability mass 1/2 on both j and j′. We get that \(\mathop {\rm error}_{\mathcal{D}}(c_{j},h) \geq 1/2 > \alpha\) for all \(h \in \mathcal{H}\) (since for any hypothesis h(j)=h(j′), the hypothesis either errs on j or on j′). Therefore, \(\mathcal{H}\) does not α-represent \(\operatorname {\mathtt {POINT}}_{d}\). □

4 Proper private learning vs. improper private learning

We now use \(\operatorname {\mathtt {POINT}}_{d}\) to show a separation between proper and improper private PAC learning. One-way of achieving a smaller sample complexity is to use Theorem 3.2 to improperly learn \(\operatorname {\mathtt {POINT}}_{d}\) with a hypothesis class \(\mathcal{H}\) that α-represents \(\operatorname {\mathtt {POINT}}_{d}\), but is of size smaller than \(|\operatorname {\mathtt {POINT}}_{d}|\). By Lemma 3.10, we know that every such \(\mathcal{H}\) should have at least d hypotheses.

In Sect. 4.1, we show that there does exist a \(\mathcal{H}\) with \(|\mathcal{H}|=O(d)\) that α-represents \(\operatorname {\mathtt {POINT}}_{d}\). This immediately gives a separation—proper private learning \(\operatorname {\mathtt {POINT}}_{d}\) requires Ω α,β,ϵ (d) samples, whereas \(\operatorname {\mathtt {POINT}}_{d}\) can be improperly privately learned using O α,β,ϵ (logd) samples.Footnote 6

We conclude that α-representing hypothesis classes can, hence, be a natural and powerful tool for constructing efficient private learners. One may even be tempted to think that no better learners exist, and furthermore, that the sample complexity of private learning is characterized by the size of the smallest hypothesis class that α-represents the concept class. Our second result, presented in Sect. 4.2, shows that this is not the case, and in fact, other techniques yield a much more efficient learner using only O α,β,ϵ (1) samples, and hence demonstrating the strongest possible separation between proper and improper private learners. The reader interested only in the stronger result may choose to skip directly to Sect. 4.2.

4.1 Improper private learning of \(\operatorname {\mathtt {POINT}}_{d}\) using O α,β,ϵ (logd) samples

We next construct a private learner applying the construction of Theorem 3.2 to the class \(\operatorname {\mathtt {POINT}}_{d}\). For that we (randomly) construct a hypothesis class \(\mathcal{H}_{d}\) that α-represents the concept class \(\operatorname {\mathtt {POINT}}_{d}\), where \(|\mathcal{H}_{d}| = O_{\alpha}(d)\). Lemma 3.10 shows that this is optimal up to constant factors. In the rest of this section, a set A⊆[T] represents the hypothesis h A , where h A (i)=1 if iA and h A (i)=0 otherwise.

To demonstrate the main idea of our construction, we begin with a construction of a hypothesis class \(\mathcal{H}_{d} = \{ A_{1},\ldots,A_{k} \}\) that α-represents \(\operatorname {\mathtt {POINT}}_{d}\), where \(k = O(\sqrt{T}/\alpha)=O(\sqrt{2^{d}}/\alpha)\) (this should be compared to the size of \(\operatorname {\mathtt {POINT}}_{d}\) which is 2d). Every \(A_{i} \in \mathcal{H}_{d}\) is a subset of {1,…,T}, such that

  1. (1)

    For every j∈{1,…,T} there are more than 1/α sets in \(\mathcal{H}\) that contain j; and

  2. (2)

    For every 1≤i 1<i 2k, \(|A_{i_{1}} \cap A_{i_{2}}|\leq1\).

We next argue that the class \(\mathcal{H}_{d}\) α-represents \(\operatorname {\mathtt {POINT}}_{d}\). For every concept \(c_{j} \in \operatorname {\mathtt {POINT}}_{d}\) there are hypotheses \(A_{1},\ldots,A_{p} \in \mathcal{H}_{d}\) that contain j (where p=⌊1/α⌋+1) and are otherwise disjoint (that is, the intersection between any two sets \(A_{i_{1}}\) and \(A_{i_{2}}\) is exactly j). Fix a distribution \(\mathcal{D}\). For every A i , \(\mathop {\rm error}_{\mathcal{D}}(c_{j},A_{i})=\Pr_{\mathcal{D}} [A_{i} \setminus \{j\}]\). Since there are more than 1/α such sets and the sets A i ∖{j} are disjoint, there exists at least one set such that \(\mathop {\rm error}_{\mathcal{D}}(c_{j},A_{i})\leq\alpha\). Thus, \(\mathcal{H}_{d}\) α-represents the concept class \(\operatorname {\mathtt {POINT}}_{d}\).

We want to show that there is a hypothesis class, whose size is \(O(\sqrt{T}/\alpha)\), that satisfies the above two requirements. As an intermediate step, we show a construction of size O(T). We consider a projective plane with T points and T lines (each line is a set of points) such that for any two points there is exactly one line containing them and for any two lines there is exactly one point contained in both of them. Such projective plane exists whenever T=q 2+q+1 for a prime power q (see, e.g., Hughes and Piper 1973). Furthermore, the number of lines passing through each point is q+1. If we take the lines as the hypothesis class for q≥1/α, then they satisfy the above requirements, thus, they α-represent \(\operatorname {\mathtt {POINT}}_{d}\). However, the number of hypotheses in the class is T and no progress was made.

We modify the above projective plane construction. We start with a projective plane with 2T points and choose a subset of the lines: We choose each line at random with probability \(O(1/(\sqrt{T}\alpha))\). Since these lines are part of the projective plane, they satisfy the above requirement (2). It can be shown that with positive probability for at least half of the j’s requirement (1) is satisfied and the number of chosen lines is \(O(\sqrt{T}/\alpha)\). We choose such lines, eliminate points that are contained in less than 1/α chosen lines, and get the required construction with T points and \(O(\sqrt{T}/\alpha)\) lines. The details of the last steps are omitted. We next show a much more efficient construction based on the above idea.

Lemma 4.1

For every α<1, there is a hypothesis class \(\mathcal{H}_{d}\) that α-represents \(\operatorname {\mathtt {POINT}}_{d}\) such that \(|\mathcal{H}_{d}| = O(d/\alpha^{2})\).

Proof

We will show how to construct a hypothesis class \(\mathcal{H}_{d}=\{ S_{1},\ldots,S_{k} \}\), where every \(S_{i} \in \mathcal{H}_{d}\) is a subset of {1,…,T} and for every j

$$ \begin{aligned} &\mbox{There are}\ p=\log T\cdot (1+\lfloor 1/\alpha \rfloor )\ \mbox{sets}\ A_1,\ldots,A_p\ \mbox{in}\ \mathcal{H}_d\ \mbox{that contain}\ j\ \mbox{such that}\\ &\mbox{for every}\ b\neq j,\ \mbox{the point}\ b\ \mbox{is contained in less than}\ \log T\ \mbox{of the sets}\ A_1,\ldots,A_p. \end{aligned} $$
(3)

First we show that \(\mathcal{H}_{d}\) α-represents \(\operatorname {\mathtt {POINT}}_{d}\). Fix a concept \(c_{j} \in \operatorname {\mathtt {POINT}}_{d}\) and a distribution \(\mathcal{D}\), and consider hypotheses A 1,…,A p in \(\mathcal{H}_{d}\) that contain j. Since every point in these hypotheses is contained in less than logT sets,

$$\begin{aligned} \sum_{i=1}^p\Pr_{\mathcal{D}} \bigl[A_i \setminus \{ j \}\bigr] < \log T \cdot\Pr_{\mathcal{D}}\Biggl[\bigcup _{i=1}^p \bigl(A_i \setminus \{ j \}\bigr) \Biggr] \leq \log T. \end{aligned}$$

Thus, there exists at least one set A i such that \(\mathop {\rm error}_{\mathcal{D}}(c_{j},A_{i}) =\Pr_{\mathcal{D}}[A_{i} \setminus \{j\}] \leq\log T /p < \alpha\). This implies that \(\mathcal{H}_{d}\) α-represents the concept class \(\operatorname {\mathtt {POINT}}_{d}\).

We next show how to construct \(\mathcal{H}_{d}\). Let k=8ep 2/logT (that is, k=O(logT/α 2)). We choose k random subsets of {1,…,2T} of size 4pT/k. We will show that a point j satisfies (3) with probability at least 3/4. We assume d≥16 (and hence, p≥16 and T≥16).

Fix j. The expected number of sets that contain j is k⋅(4pT/k)/(2T)=2p, thus, by Chebyshev inequality, the probability that less than p sets contain j is less than 2/p≤1/8. We call this event BAD 1.

Let j be such that there are at least p sets that contain j and let A 1,…,A p be p of them. Notice that A 1∖{j},…,A p ∖{j} are random subsets of {1,…,2T}∖{j} of size (4pT/k)−1. Now fix bj. The probability that a random subset of {1,…,2T}∖{j} of size (4pT/k)−1 contains b is (4pT/k−1)/(2T−1)<2p/k. For logT random sets of size (4pT/k)−1, the probability that all of them contain b is less than (2p/k)logT. Thus, the probability that there is a b∈{1,…,2T}, where bj, and logT sets among A 1,…,A p such that these logT sets contains b is less than

$$\begin{aligned} 2T \cdot\binom{p}{\log T} (2p/k )^{\log T} \leq& 2T \cdot(\mathrm {e}p/\log T )^{\log T} (2p/k )^{\log T}\quad\bigl(\mbox{where}\ \mathrm{e}= \exp(1)\bigr) \\ =& 2T \cdot\bigl(2\mathrm{e}p^2/(k \log T) \bigr)^{\log T}. \end{aligned}$$

By the choice of k, 2ep 2/(klogT)=1/4, thus, the above probability is at most 2T⋅(1/4)logT=2/T≤1/8. We call this event BAD 2.

To conclude, the probability that j does not satisfy (3) is the probability that either BAD 1 or BAD 2 happens which is at most 1/4. Therefore, the expected number of j’s that do not satisfy (3) is less than T/2. By Markov inequality, the probability that more than T points j do not satisfy (3) is less than 1/2. We take k=O(logT/α 2) subsets of {1,…,2T}, denoted S 1,…,S k , such that at least T points j satisfy (3). By the probabilistic argument above, such sets exist. Let V be a set of size T of the points that satisfy (3), and define \(\mathcal{H}_{d}=\{ S_{1}\cap V,\ldots,S_{k}\cap V \}\). Finally, by a simple renaming, we can assume that \(\mathcal{H}_{d}\) contains subsets of {1,…,T} as required. □

From Lemma 4.1 and Theorem 3.2 we get:

Theorem 4.2

There exists an improper private PAC learner for \(\operatorname {\mathtt {POINT}}_{d}\) that uses \(O((\log d +\log\frac{1}{\alpha} +\log\frac{1}{\beta})/\epsilon \alpha)\) samples, where ϵ,α, and β are the parameters of the private learner.

There is a difference between the use of improper learning in Theorem 4.2 and typical use of improper learning in non-private settings. Typically, a non-private learner uses a hypothesis class that is larger than the size of concept class. This larger class enables learning in polynomial time. We get an improved sample complexity by learning using a hypothesis class whose size is smaller than the concept class.

4.2 Improper private learning of \(\operatorname {\mathtt {POINT}}_{d}\) using O α,β,ϵ (1) samples

We now show a stronger separation result, namely, that \(\operatorname {\mathtt {POINT}}_{d}\) can be privately (and efficiently) learned by an improper learner using just O α,β,ϵ (1) samples. We begin by presenting a non-private improper PAC learner \(\mathcal{A}_{1}\) for \(\operatorname {\mathtt {POINT}}_{d}\) that succeeds with only constant probability. Roughly, \(\mathcal{A}_{1}\) applies a simple proper learner for \(\operatorname {\mathtt {POINT}}_{d}\), and then modifies its outcome by adding random “noise”. We then use sampling to convert \(\mathcal{A}_{1}\) into a private learner \(\mathcal{A}_{2}\); like \(\mathcal{A}_{1}\) the probability that \(\mathcal{A}_{2}\) succeeds in learning \(\operatorname {\mathtt {POINT}}_{d}\) is only a constant. Later we amplify the success probability of \(\mathcal{A}_{2}\) to get a private PAC learner. Both \(\mathcal{A}_{1}\) and \(\mathcal{A}_{2}\) are inefficient as they output hypotheses with exponential description length. However, using a pseudorandom function it is possible to compress the outputs of \(\mathcal{A}_{1}\) and \(\mathcal{A}_{2}\), and achieve a private learning algorithms whose running time is efficient. This is explained in Sect. 4.2.1.

Algorithm \(\mathcal {A}_{2}\) described below is ϵ -differentially private, where ϵ =ln(4) is a fixed constant. To construct an ϵ-differentially private algorithm for every ϵ, we describe a transformation in Lemma 4.4 that takes a bigger sample and replaces some samples with ⋆ and executes \(\mathcal{A}_{2}\) on the resulting sample. Therefore, we assume that some of the sample points given to \(\mathcal{A}_{1}\) and \(\mathcal{A}_{2}\) are ⋆.

Algorithm \(\mathcal{A}_{1}\)

Given a sample z 1,…,z m , where every z i is either a labeled example (x i ,y i ) or ⋆, Algorithm \(\mathcal {A}_{1}\) performs the following:

  1. 1.

    If z 1,…,z m is not consistent with any concept in \(\operatorname {\mathtt {POINT}}_{d}\), return ⊥ (this happens only if for two indices i,j∈[m] such that z i =(x i ,y i ) and z j =(x j ,y j ) either (1) x i x j and y i =y j =1 or (2) x i =x j and y i y j ).

  2. 2.

    If y i =0 for all i∈[m] such that z i ≠⋆, then let \(c= {\bf0}\) (the all zero hypothesis); otherwise, let c be the (unique) hypothesis from \(\operatorname {\mathtt {POINT}}_{d}\) that is consistent with the labeled examples in the sample.

  3. 3.

    Modify c at random to get a hypothesis h: for each x∈[T] independently let h(x)=1−c(x) with probability α/8 and, otherwise let h(x)=c(x). Return h.

We next argue that if the sample z 1,…,z m contains at least 2ln(4)/α examples z i =(x i ,y i ) such that each x i is drawn i.i.d. according to a distribution \(\mathcal{D}\) on [T], and the examples are labeled consistently according to some \(c_{j} \in \operatorname {\mathtt {POINT}}_{d}\), then \(\Pr[\mathop {\rm error}_{\mathcal{D}}(c_{j},c) \geq\alpha/2] \leq1/4\). If the examples are labeled consistently according to some \(c_{j} \ne{\bf0}\), then cc j only if (j,1) is not in the sample and in this case \(c= {\bf0}\). If \(\Pr_{x \sim \mathcal{D}}[x=j] < \alpha/2\) and (j,1) is not in the sample, then \(c={\bf0}\) and \(\mathop {\rm error}_{\mathcal{D}}(c_{j},{\bf0}) < \alpha/2\). Otherwise \(\Pr_{x \sim \mathcal{D}}[x=j] \geq\alpha/2\); thus, the probability that all examples of the form (x i ,y i ) are not (j,1) is at most ((1−α/2)2/α)ln(4)≤1/4 (as there are at least 2ln(4)/α such examples).

To see that \(\mathcal{A}_{1}\) PAC learns \(\operatorname {\mathtt {POINT}}_{d}\) (with confidence at least 1/2) note that,

$$\operatorname{\mathbb{E}}_h\bigl[\mathop {\rm error}_\mathcal{D}(c,h)\bigr] =\operatorname{\mathbb{E}}_h \operatorname{\mathbb{E}}_{x\sim \mathcal{D}}\bigl[ \bigl|h(x)-c(x)\bigr|\bigr] = \operatorname{\mathbb{E}}_{x\sim \mathcal{D}}\operatorname{\mathbb{E}}_h \bigl[ \bigl|h(x)-c(x)\bigr|\bigr]= \frac{\alpha}{8}, $$

and hence, using Markov’s inequality,

$$\Pr_h\bigl[\mathop {\rm error}_\mathcal{D}(c,h) \geq\alpha/2\bigr] \leq1/4. $$

Combining this with \(\Pr[\mathop {\rm error}_{\mathcal{D}}(c_{j},c) \geq\alpha/2] \leq1/4\) and \(\mathop {\rm error}_{\mathcal{D}}(c_{j},h) \leq \mathop {\rm error}_{\mathcal{D}}(c_{j},c) + \mathop {\rm error}_{\mathcal{D}}(c,h)\), implies that \(\Pr[\mathop {\rm error}_{\mathcal{D}}(c_{j},h) \geq\alpha] \leq1/2\).

Algorithm \(\mathcal{A}_{2}\)

We now modify the learner \(\mathcal{A}_{1}\) to get a private learner \(\mathcal{A}_{2}\) (a similar idea was used in Kasiviswanathan et al. (2011) for learning parity functions). Given a sample z 1,…,z m, where every z i is either a labeled example (x i ,y i ) or ⋆, Algorithm \(\mathcal {A}_{2}\) performs the following:

  1. 1.

    With probability α/8, return ⊥.

  2. 2.

    Construct a set S⊆[m′] by picking each element of [m′] with probability p=α/4.

  3. 3.

    Run the non-private learner \(\mathcal{A}_{1}\) on the examples indexed by S.

Claim 4.3

Let α<1/2, ϵ =ln(4), and β =3/4. Algorithm \(\mathcal {A}_{2}\) is an ϵ -differentially private (α,β )-PAC learner for the class \(\operatorname {\mathtt {POINT}}_{d}\) provided that it is given a sample which contains at least 32ln(4)/α 2 labeled examples (i.e., m′≥32ln(4)/α 2).

Proof

We first show that \(\mathcal{A}_{2}\) PAC learns \(\operatorname {\mathtt {POINT}}_{d}\) with confidence at least β =3/4. Let S be the set chosen by \(\mathcal{A}_{2}\). The expected number of samples is at least p⋅(32ln(4))/α 2=8ln(4)/α. By Chernoff bound, the probability that the sample indexed by S contains less than 2ln(4)/α (in fact, 4ln(4)/α) samples is less than exp(−ln(4)/α)<1/16 (since \(\mathcal{A}_{2}\) gets at least 32ln(4)/α 2 labeled examples and α<1/2). Algorithm \(\mathcal {A}_{2}\) can err only when either \(\mathcal{A}_{1}\) does not get 2ln(4)/α labeled examples, or when \(\mathcal{A}_{1}\) errs, or when \(\mathcal{A}_{2}\) returns ⊥ in Step (1). Therefore, we get that \(\mathcal{A}_{2}\) PAC learns \(\operatorname {\mathtt {POINT}}_{d}\) with accuracy parameter α′=α and confidence parameter β′=1/16+1/2+α/8≤3/4.

We next show that \(\mathcal{A}_{2}\) is ϵ -differentially private. Let D,D′ be two neighboring databases, and assume that they differ on the ith entry. Recall that after sampling S, one of them can be consistent with some c j , while the other might not be consistent. First let us analyze the probability of \(\mathcal{A}_{2}\) outputting ⊥:

$$\begin{aligned} \frac{\Pr[ \mathcal{A}_2(D)=\perp]}{\Pr[ \mathcal{A}_2(D')=\perp]} =& \frac {p \cdot\Pr[ \mathcal{A}_2(D)=\perp\ | \ i\in S] + (1-p) \cdot\Pr [ \mathcal{A}_2(D)=\perp\ | \ i\notin S]}{p\cdot\Pr[ \mathcal{A}_2(D')=\perp\ | \ i\in S] + (1-p) \cdot\Pr[ \mathcal{A}_2(D')=\perp\ | \ i\notin S]} \\ \leq& \frac{p\cdot1 + (1-p) \cdot\Pr[ \mathcal{A}_2(D)=\perp\ | \ i\notin S]}{p\cdot0 + (1-p) \cdot\Pr[ \mathcal{A}_2(D')=\perp\ | \ i\notin S]} \\ =&\frac{p}{(1-p) \cdot\Pr[ \mathcal{A}_2(D')=\perp\ | \ i\notin S]}+1 \leq\frac{8p}{\alpha(1-p)} + 1, \end{aligned}$$

where the last equality follows by noting that if iS then \(\mathcal{A}_{2}\) is equally likely to output ⊥ on D and D′, and the last inequality follows as ⊥ is returned with probability α/8 in Step (1) of Algorithm \(\mathcal {A}_{2}\).

For the more interesting case, where \(\mathcal{A}_{2}\) outputs a hypothesis h, we get:

$$\begin{aligned} \frac{\Pr[ \mathcal{A}_2(D)=h]}{\Pr[ \mathcal{A}_2(D')=h]} =& \frac{p \cdot\Pr[ \mathcal{A}_2(D)=h \ | \ i\in S] + (1-p) \cdot\Pr [ \mathcal{A}_2(D)=h \ | \ i\notin S]}{p\cdot\Pr[ \mathcal{A}_2(D')=h \ | \ i\in S] + (1-p) \cdot\Pr[ \mathcal{A}_2(D')=h \ | \ i\notin S]} \\ \leq& \frac{p \cdot\Pr[ \mathcal{A}_2(D)=h \ | \ i\in S] + (1-p) \cdot \Pr [ \mathcal{A}_2(D)=h \ | \ i\notin S]}{p\cdot0 + (1-p) \cdot\Pr[\mathcal{A}_2(D')=h \ | \ i\notin S]} \\ =&\frac{p}{1-p}\cdot\frac{\Pr[ \mathcal{A}_2(D)=h \ | \ i\in S]}{\Pr [\mathcal{A}_2(D)=h \ | \ i\notin S]}+1, \end{aligned}$$

where the last equality uses the fact that if iS then \(\mathcal{A}_{2}\) is equally likely to output h on D and D′. If in D the ith row is ⋆, then \(\Pr[ \mathcal{A}_{2}(D)=h \ | \ i\in S]=\Pr[ \mathcal{A}_{2}(D)=h \ | \ i\notin S] =\Pr[ \mathcal{A}_{2}(D')=h \ | \ i\notin S]\), and the above ratio is bounded by \(p/(1-p)+1=1/(1-\alpha/4) < 4/3 < e^{\epsilon^{\star}}\).

To complete the proof, we need to bound the ratio of \(\Pr[ \mathcal{A}_{2}(D)=h \ | \ i\in S]\) to \(\Pr[ \mathcal{A}_{2}(D)=h \ | \ i\notin S]\) when z i =(x i ,y i ).

$$\begin{aligned} &\frac{\Pr[ \mathcal{A}_2(D)=h \ | \ i\in S]}{\Pr [ \mathcal{A}_2(D)=h \ | \ i\notin S]} \\ &\quad {} = \frac{\sum_{R \subseteq[m'] \setminus\{i\}}\Pr[ \mathcal{A}_2(D)=h \ | \ S =R\cup\{i\}]\cdot\Pr[\text{$ \mathcal{A}_{2}$ selects $R$ from $[m']\setminus\{i\}$}]}{\sum_{R \subseteq[m'] \setminus\{i\}}\Pr [ \mathcal{A}_2(D)=h \ | \ S =R]\cdot\Pr[\text{$ \mathcal{A}_{2}$ selects $R$ from $[m']\setminus\{i\}$}]} \\ &\quad {} \leq \max_{R \subseteq[m'] \setminus\{i\}}\frac{\Pr[ \mathcal{A}_2(D)=h \ | \ S =R\cup\{i\}]}{\Pr[ \mathcal{A}_2(D)=h \ | \ S =R]} . \end{aligned}$$
(4)

In the max in (4), we only need to consider sets R such that the sample labeled by the elements in R is consistent, that is, \(\Pr[ \mathcal{A}_{2}(D)=h \ | \ S =R] > 0\). Now having or not having access to (x i ,y i ) can only affect the choice of h(x i ), and since \(\mathcal{A}_{1}\) flips the output with probability α/8, we get

$$\max_{R \subseteq[m'] \setminus\{i\}} \frac{\Pr[ \mathcal{A}_2(D)=h \ | \ S= R \cup\{i\}]}{\Pr[ \mathcal{A}_2(D)=h \ | S=R]} \leq\frac{1-\alpha /8}{\alpha /8} \leq \frac{8}{\alpha}. $$

Putting everything together, we get

$$\frac{\Pr[ \mathcal{A}_2(D)=h]}{\Pr[ \mathcal{A}_2(D')=h]} \leq\frac {8p}{\alpha (1-p)} + 1 = \frac{8}{(4-\alpha)} + 1 < 3 + 1 = e^{\epsilon^\star}. $$

 □

Algorithm \(\mathcal {A}_{2}\) is ϵ -differentially private for some fixed ϵ . We reduce ϵ to any desired ϵ using the following lemma (implicit in Kasiviswanathan et al. (2011)). In this lemma, we assume that the learning algorithm can handle “undefined entries”, i.e., entries of the form ⋆.Footnote 7

Lemma 4.4

Let \(\mathcal{A}\) be an ϵ -differentially private algorithm. Construct an algorithm \(\mathcal{B}\) that on input a database D=(d 1,…,d n ) constructs a new database D s whose ith entry is d i with probability f(ϵ,ϵ )=(exp(ϵ)−1)/(exp(ϵ )+exp(ϵ)−exp(ϵϵ )−1) andotherwise, and then runs \(\mathcal{A}\) on D s . Then, \(\mathcal{B}\) is ϵ-differentially private.

Proof

Let D,D′ be neighboring databases, and assume they differ on the ith entry. Let S⊆[n] denote the indices of the random set of entries that are not changed to ⋆. Let q=f(ϵ,ϵ ). Since D and D′ differ in just the ith entry, for any outcome t, \(\Pr[\mathcal{A}(D_{s}) = t | i \notin S] = \Pr[\mathcal{A}(D'_{s}) = t | i \notin S]\). Thus,

$$\begin{aligned} &\frac{\Pr[\mathcal{B}(D) = t]}{\Pr[\mathcal{B}(D')=t]} \\ &\quad {} = \frac{q\cdot\Pr[\mathcal{A}(D_s) = t | i\in S] + (1-q)\cdot\Pr [\mathcal{A}(D_s) = t | i \notin S]}{q\cdot\Pr[\mathcal{A}(D'_s) = t | i\in S] + (1-q)\cdot\Pr[\mathcal{A}(D_s) = t | i \notin S]} \\ &\quad {} = \frac{\sum_{R \subseteq[n]\setminus\{i\}}\Pr[S\setminus\{ i\} = R ]\cdot(q \cdot\Pr[\mathcal{A}(D_s) = t | S = R\cup\{i\}] + (1-q)\cdot \Pr [\mathcal{A}(D_s) = t | S = R])}{\sum_{R \subseteq[n]\setminus\{i\}}\Pr [S\setminus\{i\} = R ]\cdot(q \cdot\Pr[\mathcal{A}(D'_s) = t | S = R\cup \{ i\}] + (1-q)\cdot\Pr[\mathcal{A}(D_s) = t | S = R])} \\ &\quad {} \leq \max_{R \subseteq[n]\setminus\{i\}} \frac{q \cdot\Pr [\mathcal{A}(D_s) = t | S = R\cup\{i\}] + (1-q)\cdot\Pr[\mathcal{A}(D_s) = t | S = R]}{q \cdot\Pr[\mathcal{A}(D'_s) = t | S = R\cup\{i\}] + (1-q)\cdot\Pr[\mathcal{A}(D_s) = t | S = R]} \\ &\quad {} \leq \max_{R \subseteq[n]\setminus\{i\}} \frac{q\cdot\exp (\epsilon ^\star)\cdot\Pr[\mathcal{A}(D_s) = t | S = R] + (1-q)\cdot\Pr[\mathcal{A}(D_s) = t | S = R]}{q \cdot\exp(-\epsilon ^\star)\cdot\Pr[\mathcal{A}(D_s) = t | S = R] + (1-q)\cdot\Pr[\mathcal{A}(D_s) = t | S = R]} \\ &\quad {} = \frac{1 + q\cdot(\exp(\epsilon ^\star) -1)}{1 - q \cdot(1-\exp (-\epsilon ^\star))} = \exp(\epsilon ). \end{aligned}$$

The last inequality follows because by the guarantees of differential privacy

$$\begin{aligned} \Pr\bigl[\mathcal{A}(D_s) = t | S = R\cup\{i\}\bigr] \leq \exp\bigl( \epsilon ^\star\bigr) \cdot\Pr\bigl[\mathcal{A}(D_s) = t | S = R \cup \emptyset\bigr], \end{aligned}$$

and

$$\begin{aligned} \Pr\bigl[\mathcal{A}\bigl(D'_s\bigr) = t | S = R\cup\{i\} \bigr] \geq& \exp\bigl(-\epsilon ^\star\bigr)\cdot\Pr\bigl[\mathcal{A}\bigl(D'_s\bigr) = t | S = R \cup\emptyset\bigr] \\ = & \exp\bigl(-\epsilon ^\star\bigr)\cdot\Pr\bigl[\mathcal{A}(D_s) = t | S = R \cup\emptyset\bigr] \quad\bigl(\mbox{as}\ R \subseteq [n]\setminus \{i\}\bigr). \end{aligned}$$

Therefore, \(\mathcal{B}\) is an ϵ-differentially private algorithm. □

Claim 4.5

Let α<1/2, 0<β≤1 and 0<ϵ<1. There exists an ϵ-differentially private (α,β)-PAC learner for the class \(\operatorname {\mathtt {POINT}}_{d}\) which uses a sample of size \(\mathop{\rm{poly}}\nolimits (1/\epsilon,1/\alpha, \log(1/\beta))\).

Proof

We first apply the transformation described in Lemma 4.4 on Algorithm \(\mathcal {A}_{2}\). Call the resulting Algorithm \(\mathcal{A}_{3}\). In this case ϵ =ln(4) and

$$\begin{aligned} f\bigl(\epsilon ,\epsilon ^\star\bigr)=\frac{\exp(\epsilon )-1}{\exp(\epsilon ^\star)+\exp(\epsilon )-\exp (\epsilon -\epsilon ^\star)-1} > \epsilon /6 \end{aligned}$$

for ϵ<1 (since exp(ϵ)−1≥ϵ). By Chernoff bound, if we take a sample of size 384ln(4)/(ϵα 2) and choose each example with probability at least ϵ/6, then with probability at least 1−exp(−32ln(4)) the resulting sample size is at least 32ln(4)/α 2. Now if given 32ln(4)/α 2 samples, \(\mathcal{A}_{2}\) returns a hypothesis with error at most α with probability at least 1/4. Therefore, the total probability that \(\mathcal{A}_{2}\) returns a hypothesis with error greater than α is at most exp(−32ln(4))+3/4 (the first term comes from \(\mathcal{A}_{2}\) not getting enough samples and the second term comes from \(\mathcal{A}_{2}\) returning a hypothesis with error greater than α even after getting enough samples). Thus, the algorithm resulting from the transformation described in Lemma 4.4 returns a hypothesis with error at most α with probability at least 1−(exp(−32ln(4))+3/4)>1/5 (i.e., confidence parameter of the above learner is 4/5).

We next privately boost the confidence parameter of the learner from 4/5 to any value β>0 similar to Kasiviswanathan et al. (2011). We execute N=log5/4(5/β) times algorithm \(\mathcal{A}_{3}\) with accuracy α/8 and disjoint samples; we get N hypotheses Hyp={h 1,…,h N }. With probability at least 1−(4/5)N=1−β/5 at least one of the hypotheses has error less than α/8. We need to privately choose such a hypothesis. To achieve this goal we take a fresh sample of size m=24ln(3/β 2)/(ϵα), compute the mistake of each hypothesis on this sample, and use the exponential mechanism of McSherry and Talwar (2007) to choose the hypothesis. Specifically, let m i be the number of errors that hypothesis h i has on the sample; return the hypothesis h i with probability

$$\frac{\exp(-\epsilon m_i/2)}{\sum_{j=1}^{N} \exp(-\epsilon m_j/2)}. $$

Changing one example can reduce m i by at most 1 and increase m j by at most one for every ij (thus, increasing \(\sum_{j=1}^{N} \exp(-\epsilon m_{j}/2)\) by at most exp(−ϵ/2)); therefore the selection of the hypothesis is ϵ-differentially private.

We next argue that with probability at least 1−β the selected hypothesis h i has error at most α. With probability at least 1−β/5, at least one of the hypotheses from Hyp has error less than α/8; by Chernoff bound with probability at least 1−β 2/3 this hypothesis has empirical errorFootnote 8 at most α/4. Let us call \(\mathcal{E}_{1}\) the event that there exists a hypothesis with error less than α/8 and empirical error less than α/4 in Hyp. Event \(\mathcal{E}_{1}\) happens with probability at least (1−β/5)(1−β 2/3)>1−(β/5+β 2/3).

On the other hand, the probability that a hypothesis h j that has error greater than α has empirical error ≤α/2 is less than β 2/3. By the union bound, the probability that there is such hypothesis in Hyp is at most β/3 (since N≤1/β for β≤0.01). Let us call \(\mathcal{E}_{2}\) the event that all hypotheses in Hyp with error greater than α have empirical error greater than α/2. Event \(\mathcal{E}_{2}\) happens with probability at least 1−β/3.

Conditioned on \(\mathcal{E}_{1}\), the probability that a hypothesis with empirical error ≥α/2 is selected by the exponential mechanism is at most

$$\begin{aligned} \frac{\exp(-\epsilon \alpha m/4)}{\sum_{j=1}^{N} \exp(-\epsilon m_j/2)} \leq \frac{\exp(-\epsilon \alpha m/4)}{\exp(-\epsilon \alpha m/8)} = \exp(-\epsilon \alpha m/8). \end{aligned}$$

The first inequality holds because conditioned on \(\mathcal{E}_{1}\) there exists a hypothesis (say, h ) in Hyp with empirical error less than α/4. Therefore, m ≤(α/4)m, and

$$\sum_{j=1}^{N} \exp(-\epsilon m_j/2) \geq\exp(-\epsilon m_\ell/2) \geq\exp(-\epsilon \alpha m/8). $$

Since m=24ln(3/β)/(ϵα), the value of exp(−ϵαm/8) is at most β 3/27. Therefore, conditioned on \(\mathcal{E}_{1}\) and \(\mathcal{E}_{2}\), the probability that a specific hypothesis with error greater than α is selected by the exponential mechanism is at most β 3/27, and by the union bound, the probability that a hypothesis with error greater than α is selected by the exponential mechanism is at most Nβ 3/27≤β 2/27. By removing all the conditioning, we get that the selected hypothesis has error greater than α with probability at most β/5+β 2/3+β/3+β 2/27≤β. □

4.2.1 Making the learner efficient

The outcome of \(\mathcal{A}_{1}\) (hence, \(\mathcal{A}_{2}\)) is a hypothesis whose description is exponentially long (since it contains a list of the indices where the output was flipped). We now complete our construction by compressing this description using a pseudorandom function. The running time of the resulting algorithm is polynomial and the hypothesis it returns has a short description.

We use a slightly non-standard definition of (non-uniform) pseudorandom functions from binary strings of size d to bits; these pseudorandom functions can be easily constructed given standard pseudorandom functions (which in turn can be constructed under standard assumptions (Goldreich 2001)). Roughly speaking, a collection of functions is pseudorandom if it cannot be distinguished from truly random functions. We start by defining the random functions in our definition.

Definition 4.6

Define \(H^{q}_{d}: \{0,1\}^{d} \rightarrow\{0,1\}\) as a random variable, where each value \(H^{q}_{d}(x)\) for x∈{0,1}d is selected i.i.d. to be 1 with probability q and 0 otherwise.

We consider a (non-uniform) polynomial-time distinguishing algorithm (represented by a circuit) C d that can query a function in polynomially many points. Any such algorithm should not be able to distinguish if the answers of the function are random or are answered according to a random function from the pseudorandom family. Formally,

Definition 4.7

Let \(F = \{F_{d}\}_{d \in \mathbb {N}}\) be a function ensemble, where for every d, F d is a set of functions from {0,1}d to {0,1}. We say that the function ensemble F is q-biased pseudorandom if for every family of polynomial-size circuits with oracle access \(\{C_{d}\}_{d\in \mathbb {N}}\), every polynomial p(⋅), and all sufficiently large d’s,

$$\begin{aligned} \bigl \vert \Pr\bigl[C_d^f\bigl(1^d\bigr) = 1\bigr] - \Pr\bigl[C_d^{H^q_d}\bigl(1^d\bigr) = 1 \bigr]\bigr \vert < & \frac{1}{p(d)}. \end{aligned}$$
(5)

In the above inequality, the first probability is taken over the random choice of f with uniform distribution from F d , and the second probability is taken over the random variable \(H^{q}_{d}\).

For convenience, for \(d \in \mathbb {N}\), we consider F d as a set of functions from {1,…,T} to {0,1}, where T=2d. We set q=α/4 in the above definition. Using an α/4-biased pseudorandom function ensemble F (such functions can be constructed from standard pseudorandom functions (Goldreich 2001)), we change Step (3) of Algorithm \(\mathcal {A}_{1}\) as follows:

(3)′:

If \(c={\bf0}\), let h be a random function from F d . Otherwise (i.e., c=c j for some j∈[T]), let h be a random function from F d subject to h(j)=1. Return h.

Call the resulting modified Algorithm \(\mathcal{A}_{4}\). We next show that \(\mathcal{A}_{4}\) is a PAC learner. Note that there exists a negligible function \(\mathop {\rm negl}\) such that for large enough d,

$$\bigl \vert \,\Pr\bigl[h(x)=1|h(j)=1\bigr] - \alpha/4\,\bigr \vert \leq \mathop {\rm negl}(d) $$

for every x∈{1,…,T} (as otherwise, we get a non-uniform distinguisher for the ensemble F). Thus,

$$\begin{aligned} \operatorname{\mathbb{E}}_{h\in F_d}\bigl[\mathop {\rm error}_\mathcal{D}(c,h)\bigr] = & \operatorname{\mathbb{E}}_{h\in F_d} \operatorname{\mathbb{E}}_{x\sim \mathcal{D}}\bigl[ \bigl|h(x)-c(x)\bigr|\bigr] \\ \leq& \operatorname{\mathbb{E}}_{h\in F_d}\operatorname{\mathbb{E}}_{x\sim \mathcal{D}}\bigl[h(x)\bigr]= \operatorname{\mathbb{E}}_{x\sim \mathcal{D}} \operatorname{\mathbb{E}}_{h\in F_d}\bigl[h(x)\bigr] \leq\frac{\alpha}{4}+\mathop {\rm negl}(d). \end{aligned}$$

The first inequality follows as for all x∈[T], h(x)≥c(x) by our restriction on the choice of h. Thus, by the same arguments as for \(\mathcal{A}_{1}\), Algorithm \(\mathcal{A}_{4}\) is a PAC learner.

We next modify Algorithm \(\mathcal {A}_{2}\) by executing the learner \(\mathcal{A}_{4}\) instead of the learner \(\mathcal{A}_{1}\). Call the resulting modified Algorithm \(\mathcal{A}_{5}\). To see that Algorithm \(\mathcal{A}_{5}\) preserves differential privacy it suffices to give a bound on (4). By comparing the case where S=R with S=R∪{i}, we get that the probability for a hypothesis h can increase only if \(c={\bf0}\) when S=R, and c=c i when S=R∪{i}. Therefore,

$$\max_{R \subseteq[m'] \setminus\{i\}}\ \frac{\Pr[ \mathcal{A}_5(D)=h \ | \ S= R \cup\{i\}]}{\Pr[ \mathcal{A}_5(D)=h \ | S=R]} \leq\frac{1}{(\alpha /4) - \mathop {\rm negl}(d)} \leq \frac{1}{(\alpha/8)} = \frac {8}{\alpha}. $$

Applying the same steps as in the proof of Claim 4.5, we get the following result.

Theorem 4.8

There exists an efficient improper private PAC learner for \(\operatorname {\mathtt {POINT}}_{d}\) that uses O α,β,ϵ (1) samples, where ϵ,α, and β are the parameters of the private learner.

Lemma 3.9 and Theorem 4.8 give the following separation:

Theorem 4.9

Every proper private PAC learner for \(\operatorname {\mathtt {POINT}}_{d}\) requires Ω((d+log(1/β))/(ϵα)) samples, whereas there exists an efficient improper private PAC learner that can learn \(\operatorname {\mathtt {POINT}}_{d}\) using O α,β,ϵ (1) samples. Here, ϵ,α, and β are the parameters of the private learners.

4.3 Restrictions on the hypothesis class of private learners with low sample complexity

We conclude this section by showing that every (improper) private learner for \(\operatorname {\mathtt {POINT}}_{d}\) using o(d) samples must return hypotheses that evaluate to one on many points (in contrast, every hypothesis in \(\operatorname {\mathtt {POINT}}_{d}\) returns the value one on just one input). This explains why our algorithms for \(\operatorname {\mathtt {POINT}}_{d}\) that use o(d) samples return “complex” hypotheses.

Definition 4.10

(weight)

The weight of a hypothesis h is the number of points for which it returns the value one, i.e., |{i:h(i)=1}|.

Theorem 4.11

There exists no private PAC learner for \(\operatorname {\mathtt {POINT}}_{d}\) with sample complexity o α,β,ϵ (d) that for every distribution returns, with probability at least half, hypotheses with weight \(2^{o_{\alpha,\beta,\epsilon}(d)}\) (where the probability is taken over the randomness of the learner and the sample points chosen according to the distribution). Here, ϵ,α, and β are the parameters of the private learner.

Proof

In the proof assume the contrary, i.e., there exists a private learner that for every distribution returns hypotheses with weight \(2^{o_{\alpha ,\beta,\epsilon}(d)}\) with probability at least half. We prove that, under this assumption, there is a proper private learning algorithm for \(\operatorname {\mathtt {POINT}}_{d}\) with sample complexity o α,β,ϵ (d), in contradiction with Lemma 3.9.

Let \(c_{t} \in \operatorname {\mathtt {POINT}}_{d}\) be the target concept. Assume for contradiction that there exists an ϵ-differentially private (α,β)-PAC learner \(\mathcal{A}'\) for \(\operatorname {\mathtt {POINT}}_{d}\) with sample complexity o α,β,ϵ (d) that for every distribution returns, with probability at least 1/2, hypotheses of weight less than z, for \(z=2^{o_{\alpha,\beta,\epsilon}(d)}\) (where the probability is taken over the randomness of \(\mathcal{A}'\) and the sample points chosen according to the distribution).

Let \(\mathcal{D}\) denote the underlying sample distribution. Construct a proper learner \(\mathcal{A}\) (for \(\operatorname {\mathtt {POINT}}_{d}\)) which on input ϵ,d,α,β does the following:

  1. 1.

    Let k=ln(β/2)/ln(3/4).

  2. 2.

    Invoke k times the algorithm \(\mathcal{A}'\) with parameters ϵ,d,α/2,β′=1/4, each time on a fresh logz sized i.i.d. sample drawn from \(\mathcal{D}\) and labeled by c t . Let h 1,…,h k (where k′≤k) be the hypotheses returned in these executions with weight less than z.

  3. 3.

    If k′=0 halt with failure, otherwise set \(\mathcal{H}_{d} = \{c_{j}: h_{i}(j) = 1\ \textrm{for some}\ i\in[k']\}\).

  4. 4.

    Invoke the proper private learner of Lemma 3.4 with parameters ϵ,α,β/2 and hypothesis class \(\mathcal{H}_{d}\) on a fresh \(\ell= O((\log(|\mathcal{H}_{d}|) +\log(1/\beta))/(\epsilon\alpha)) \) sized i.i.d. sample drawn from \(\mathcal{D}\) and labeled by c t . Output the hypothesis returned by the learner.

Note that \(\ell= O((\log(|\mathcal{H}_{d}|) +\log(1/\beta))/(\epsilon \alpha)) = o_{\alpha,\beta,\epsilon}(d)\), and that the sample complexity of \(\mathcal{A}\) is klogz+=o α,β,ϵ (d). Furthermore, \(\mathcal{A}\) always returns a hypothesis in \(\operatorname {\mathtt {POINT}}_{d}\) (note that \(\mathcal{H}_{d}\subset \operatorname {\mathtt {POINT}}_{d}\)). Hence, if \(\mathcal{A}\) is a private learner for \(\operatorname {\mathtt {POINT}}_{d}\), we get a contradiction to Lemma 3.9.

Note that \(\mathcal{A}\) is ϵ-differentially private (follows since \(\mathcal{A}'\) is ϵ-differentially private and in Step (4), we invoke the ϵ-differentially private algorithm from Lemma 3.4 on a fresh sample).

To conclude the proof we show that \(\mathcal{A}\) is indeed a learner for \(\operatorname {\mathtt {POINT}}_{d}\). Note that for each of the hypotheses h i returned by \(\mathcal{A}'\) in Step (2), we have that

$$\begin{aligned} \mbox{Condition 1:}\,\, \Pr\bigl[\mathop {\rm error}_{\mathcal{D}}(c_t,h_i) \leq\alpha/2\bigr] \geq 1-\beta'=\frac{3}{4}, \end{aligned}$$

and

$$\begin{aligned} \mbox{Condition 2:}\,\, \Pr[h_i\ \mbox{has weight less than}\ z] \geq \frac{1}{2}, \end{aligned}$$

where the probability is taken over the randomness of \(\mathcal{A}'\) and the sample points chosen according to \(\mathcal{D}\). We get that h i satisfies both the above conditions with probability at least 1/4, and the probability that none of the hypotheses \(\mathcal{A}'\) outputs satisfy both these conditions is at most (3/4)k=β/2.

We henceforth assume that a hypothesis, h i , returned by \(\mathcal{A}'\) in Step (2) is of weight less than z and \(\mathop {\rm error}_{\mathcal{D}}(c_{t},h_{i})\leq\alpha/2\). We claim that in this case \(\mathcal{H}_{d}\) contains a hypothesis \(c_{j}\in \mathcal{H}_{d}\) for which \(\mathop {\rm error}_{\mathcal{D}}(c_{t},c_{j}) \leq \alpha/2\), as if h i (t)=1 then we can set j=t, and otherwise, j can be any point such that h i (j)=1, as

$$\begin{aligned} \mathop {\rm error}_{\mathcal{D}}(c_t,c_j) =& \Pr_{x\sim \mathcal{D}}[x=t] + \Pr_{x\sim \mathcal{D}}[x=j] \leq\Pr_{x\sim \mathcal{D}}[x=t] + \Pr_{x\sim \mathcal{D}} \bigl[h_i(x)=1\bigr]\\ =& \mathop {\rm error}_{\mathcal{D}}(c_t,h_i) \leq\alpha/2. \end{aligned}$$

In other words, \(\mathcal{H}_{d}\) α/2-represents {c t }.

To conclude the proof, we observe that having \(\mathcal{H}_{d}\) α/2-represent {c t } suffices for the proof of Theorem 3.2, and hence, the hypothesis (in Step (4)) returned by the learner of Theorem 3.2 is with probability at least 1−β/2 within error α from c t .

To summarize, we get that \(\mathcal{A}\) is a proper private learner for \(\operatorname {\mathtt {POINT}}_{d}\) under distribution \(\mathcal{D}\) with sample complexity o α,β,ϵ (d). Since this holds for every \(\mathcal{D}\) this leads to a contradiction to Lemma 3.9 (the lemma shows that there exists a distribution for which there is no proper private learner for \(\operatorname {\mathtt {POINT}}_{d}\) with sample complexity o α,β,ϵ (d)). □

5 Private learning of intervals (partial results)

In this section, we examine \(\operatorname {\mathtt {INTERVAL}}_{d}\), a concept class that like \(\operatorname {\mathtt {POINT}}_{d}\) is very natural and simple and has VC-dimension 1. By Theorem 3.6, any proper private learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) requires Ω α,β,ϵ (d) samples (as \(\operatorname {\mathtt {INTERVAL}}_{d}\) is α-minimal for itself), and we ask whether stronger separation results than we showed for \(\operatorname {\mathtt {POINT}}_{d}\) can be proved for\(\operatorname {\mathtt {INTERVAL}}_{d}\). Specifically, we ask if we can prove a lower bound of ω α,β,ϵ (1) for any private learner for\(\operatorname {\mathtt {INTERVAL}}_{d}\) (i.e., also for improper private learners).

We give partial results towards answering this question. In Sect. 5.1, we show that if there exists an O α,β,ϵ (1) sample sized improper private learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\), then it must use hypotheses that are very unlike intervals, and in fact must switch exponentially many times between zero and one (this is similar to the result presented for \(\operatorname {\mathtt {POINT}}_{d}\) in Sect. 4.3). Then, in Sect. 5.2, we take a deeper look into improper private learning of \(\operatorname {\mathtt {INTERVAL}}_{d}\), and prove that the technique from Sect. 4.2 that yielded the efficient private learner for \(\operatorname {\mathtt {POINT}}_{d}\) with sample complexity O α,β,ϵ (1) cannot yield an algorithm for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sample complexity o α,β,ϵ (d). In other words, the technique of adding independent noise from Sect. 4.2, even with exponentially many switch points, does not yield a learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with o α,β,ϵ (d) sample complexity.

Before proving the above results, let us first formally define \(\operatorname {\mathtt {INTERVAL}}_{d}\) and establish a sample complexity lower bound for proper private learning this concept class.

Definition 5.1

The concept class \(\operatorname {\mathtt {INTERVAL}}_{d}\) is {c j :j∈{1,…,T+1}} where T=2d and the concept c j :[T]→{0,1} maps all x<j to 1 and all xj to 0.

Unlike the concept class \(\operatorname {\mathtt {POINT}}_{d}\), the values of elements of X d are significant in the sense that the geometric relation of which point is to the left of the other is meaningful. Note that the cardinality of \(\operatorname {\mathtt {INTERVAL}}_{d}\) is 2d+1, and that it is α-minimal for itself (for all α<1/2), and hence, we can use Theorem 3.6 and get a lower bound on the sample complexity of proper private learners for \(\operatorname {\mathtt {INTERVAL}}_{d}\).

Lemma 5.2

Every proper private PAC learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) requires Ω((d+(1/β))/ϵ) samples.

5.1 Restrictions on the hypothesis class of private learners with low sample complexity

We give an insight on the structure of the hypothesis class of an improper private learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sample complexity o α,β,ϵ (d). We show that if such a learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) exists, then it must return, with high probability, a hypothesis that switches frequently between zero and one. Therefore, the hypothesis outputted by the learner has a very different structure compared to the concepts in \(\operatorname {\mathtt {INTERVAL}}_{d}\), which switch exactly once from 1 to 0. This result resembles Theorem 4.11, where we proved a similar structural statement for private learning \(\operatorname {\mathtt {POINT}}\) class.

Definition 5.3

(Switching Point)

We say that j is a switching point in hypothesis h if h(j)≠h(j−1). If h(j−1)=1 we say that j is a decreasing switching point. Otherwise, we say the switching point is increasing. The points 1 and T+1 are also referred to as switching points. The point 1 is a increasing switching point if h(1)=1 and decreasing otherwise. The point T+1 is a increasing switching point if h(T)=0 and decreasing otherwise.

We next prove that every private learner with sample complexity o α,β,ϵ (d) returns with high probability a hypothesis with an exponential number of switching points. We prove this using a method similar to the proof of the previous theorem. We assume that a learner exists which returns with constant probability a hypothesis with too little switching points. We then show that a proper private learner can be reconstructed from this hypothesis. For the reconstruction, we use a simplified version of the exponential mechanism of McSherry and Talwar (2007). Existence of a proper private learner for the class \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sample complexity o α,β,ϵ (d) leads to a contradiction to Lemma 5.2.

Theorem 5.4

There exists no private PAC learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sample complexity o α,β,ϵ (d) that for every distribution returns, with probability at least half, hypotheses with \(2^{o_{\alpha ,\beta,\epsilon}(d)}\) switching points (where the probability is taken over the randomness of the learner and the sample points chosen according to the distribution). Here, ϵ,α, and β are the parameters of the private learner.

Proof

Let \(\mathcal{D}\) denote the underlying sample distribution. Every concept \(c \in \operatorname {\mathtt {INTERVAL}}_{d}\) consists of exactly one decreasing switching point. Discovering this point is discovering the accurate concept. Assume first that the target concept is c t for some 1≤tT+1 and we have a hypothesis h such that \(\mathop {\rm error}_{\mathcal{D}}(c_{t},h) \leq\alpha\). Let j and k be two consecutive switching points in h such that jtk.Footnote 9 Assume first that the switching point j is decreasing (and, thus, k is increasing). Note that c j (x)=c t (x)=1 for every x<j and c j (x)=c t (x)=0 for every xt. Therefore, c j is a hypothesis which only errs on {j,…,t−1}. Also c j (x)=h(x)=0 for every x∈{j,…,t−1}.

Therefore, we can refer to c j as a concept which is reconstructed from h (it is chosen from h’s switching points) and which fixes all of h’s errors in {1,…,j−1}∪{t,…,T}. On the other hand, h errs on every point in {j,…,t−1}, so c j does not introduce new errors to h. We get that

$$ \mathop {\rm error}_\mathcal{D}(c_t,c_j) \le \mathop {\rm error}_\mathcal{D}(c_t,h) \leq\alpha. $$

Similarly, if j is an increasing switching point, then k is decreasing, then c k is such that

$$ \mathop {\rm error}_\mathcal{D}(c_t,c_k) \le \mathop {\rm error}_\mathcal{D}(c_t,h) \leq\alpha. $$

Define

$$\operatorname {\mathtt {SWITCH}}(h) = \{c_j : j \text{ is a switching point in } h\}. $$

Note that \(\operatorname {\mathtt {SWITCH}}(h) \neq\emptyset\) by construction. By our discussion above, if h is such that \(\mathop {\rm error}_{\mathcal{D}}(c_{t},h)\leq \alpha\) then so is the case for at least one concept in \(\operatorname {\mathtt {SWITCH}}(h)\). Clearly, \(|\operatorname {\mathtt {SWITCH}}(h)|\) is bounded by the number of switching points in h.

Remark 5.5

Note that if the empirical error of h on some sample database D is less than α, then using same arguments as above there exists a concept in \(\operatorname {\mathtt {SWITCH}}(h)\) whose empirical error on D is also less than α.

As in Kasiviswanathan et al. (2011), we use the exponential mechanism in order to choose a hypothesis out of \(\operatorname {\mathtt {SWITCH}}(h)\) (we used the same mechanism in the proof of Claim 4.5).

We now have enough tools for the proof. Assume that \(\mathcal{A}'\) is an ϵ-differentially private (α,β)-PAC learner for the class \(\operatorname {\mathtt {INTERVAL}}_{d}\) with a sample complexity o α,β,ϵ (d) that on every distribution returns, with probability at least 1/2, hypotheses with at most \(z=z(\alpha,\beta,\epsilon ,d)=2^{o_{\alpha,\beta,\epsilon}(d)}\) switching points. Let \(s = 8\ln (\frac{12}{\beta}) / (\alpha^{2}) + 8 \ln(\frac{(6-\beta) z}{\beta } ) / (\alpha\epsilon) + K ( \frac{1}{\alpha} \log\frac{1}{\beta} + \frac{1}{\alpha} \log\frac{1}{\alpha} )\) for some constant K to be set below.

Construct a proper private learner \(\mathcal{A}\) as follows:

  1. 1.

    Let \(\alpha'=\frac{\alpha}{4}; \beta'= \frac{\beta}{6}\).

  2. 2.

    For i in \(\{1,\ldots,\log\frac{1}{\beta'}\}\):

    1. (a)

      Draw o α,β,ϵ (d) new samples from \(\mathcal{D}\) and label it by c t . Let D′ denote these labeled examples.

    2. (b)

      Apply \(\mathcal{A}'\) with parameters ϵ,α′,β′ on D′. Let h i be the returned hypothesis.

  3. 3.

    Let \(\hat{h}\) denote the first hypothesis in {h 1,…,h log(1/β′)} such that \(\lvert \operatorname {\mathtt {SWITCH}}(h_{i}) \rvert\leq z\). If no such \(\hat{h}\) exists, return “FAIL”.

  4. 4.

    Draw s additional samples according to \(\mathcal{D}\) and label it by c t . Let D s denote these labeled examples.

  5. 5.

    Choose a concept c out of \(\operatorname {\mathtt {SWITCH}}(\hat{h})\) using the exponential mechanism on D s with parameter ϵ and return it.

We now show that \(\mathcal{A}\) is a proper private (α,β)-PAC learner with sample complexity o α,β,ϵ (d). This is a contradiction to Lemma 5.2.

First, note that according to the assumption, Step (2a) is given enough samples. Also according to the assumption, for every i we have that \(\Pr[\lvert \operatorname {\mathtt {SWITCH}}(h_{i}) \rvert\ge z ] \le 1/2\). Therefore, Step (3) fails with probability at most (1/2)log(1/β′)=β′. Since the chosen hypothesis \(\hat{h}\) is a uniformly distributed hypothesis conditioned on \(\lvert \operatorname {\mathtt {SWITCH}}(\hat {h}) \rvert \leq z\) (an event with probability at least half), the probability that \(\mathop {\rm error}_{\mathcal{D}}(c_{t},\hat{h}) \geq\alpha'\) is at most 2β′+β′=3β′ (2β′ comes from the Step (2b) and β′ from Step (3)).

In our next analysis, we assume that \(\mathop {\rm error}_{\mathcal{D}}(c_{t},\hat{h}) < \alpha '\). Denote by \(\widehat{\mathop {\rm error}}_{D_{s}}(h')\) the empirical error of a hypothesis h′ on the samples D s , and let \(Q = \widehat{\mathop {\rm error}}_{D_{s}}(\hat{h})\). Clearly, \(\operatorname{\mathbb{E}}_{D_{s}} [ Q ] =\mathop {\rm error}_{\mathcal{D}}(c_{t},\hat{h}) \leq\alpha'\), where the expectation is over the drawing of the samples D s in Step (4). We can bound Q with high probability using Chernoff-Hoeffding bound (Inequality (2)) and get

$$ \Pr\bigl[ \bigl\vert Q - \operatorname{\mathbb{E}}_{D_s}[Q] \bigr\vert\geq\alpha' \bigr] \le2 \exp\bigl( - 2s \alpha'^2 \bigr). $$

Since \(s > 8\ln (\frac{12}{\beta}) / (\alpha^{2})= \ln(\frac{2}{\beta'}) / (2 \alpha '^{2})\), we have

$$ \Pr\bigl[ \bigl\vert Q - \operatorname{\mathbb{E}}_{D_s}[Q] \bigr\vert\ge\alpha' \bigr] \le\beta'. $$

Since \(\operatorname{\mathbb{E}}_{D_{s}}[Q] \leq\alpha'\), we now have Pr[Q≥2α′]≤β′. For the analysis of the last step we assume that indeed

$$ \widehat{\mathop {\rm error}}_{D_s}(\hat{h}) \le2 \alpha'. $$

Next, we analyze the complexity and accuracy of the exponential mechanism step. Let

$$\operatorname {\mathtt {good}}(D_s,\hat{h}) = \bigl\{c_j \in \operatorname {\mathtt {SWITCH}}(\hat{h}) : \widehat{\mathop {\rm error}}_{D_s}(c_j) \leq3\alpha' \bigr\}. $$

That is, \(\operatorname {\mathtt {good}}(D_{s},\hat{h})\) contains the concepts in \(\operatorname {\mathtt {SWITCH}}(\hat{h})\) that are inconsistent with less than 3αs samples, i.e., concepts such that \(m_{c_{j}} \leq3 \alpha' s\). Let \(\operatorname {\mathtt {bad}}(D_{s},\hat{h})\) be all the other concepts in \(\operatorname {\mathtt {SWITCH}}(\hat{h})\). Let \({\mathcal{E}}_{\operatorname {\mathtt {good}}}\) (resp. \({\mathcal{E}}_{\operatorname {\mathtt {bad}}}\)) be the event that a concept in \(\operatorname {\mathtt {good}}(D_{s},\hat{h})\) (resp. \(\operatorname {\mathtt {bad}}(D_{s},\hat {h})\)) is chosen by the exponential mechanism in Step (5). Remember, we assumed \(\widehat{\mathop {\rm error}}_{D_{s}}(\hat{h}) \le2 \alpha'\). Also remember that if \(\widehat{\mathop {\rm error}}_{D_{s}}(\hat{h}) \le2 \alpha'\), then, according to observations mentioned in Remark 5.5 there is at least one concept \(c^{\star}\in \operatorname {\mathtt {SWITCH}}(\hat{h})\) whose empirical error is also bounded by 2α′ (therefore, \(c^{\star}\in \operatorname {\mathtt {good}}(D_{s},\hat{h})\)). So in Step (5),

$$\begin{aligned} \frac{\Pr[{\mathcal{E}}_{\operatorname {\mathtt {good}}}]}{\Pr[{\mathcal{E}}_{\operatorname {\mathtt {bad}}}]} = & \frac{ \sum _{c_j \in \operatorname {\mathtt {good}}(D_s,\hat{h})} \exp(-\epsilon\cdot m_{c_j}/2) }{\sum_{c_j \in \operatorname {\mathtt {bad}}(D_s,\hat{h})} \exp(-\epsilon\cdot m_{c_j}/2)} \\ \ge& \frac{\exp(-\epsilon\cdot m_{c^\star}/2)}{\sum_{c_j \in \operatorname {\mathtt {bad}}(D_s,\hat{h})} \exp(-\epsilon\cdot m_{c_j}/2)} \ge\frac{\exp(-\alpha' s \epsilon)}{\sum_{c_j \in \operatorname {\mathtt {bad}}(D_s,\hat{h})} \exp(-3\alpha' s \epsilon/2)} \\ \ge& \frac{\exp(-\alpha' s \epsilon)}{\lvert \operatorname {\mathtt {SWITCH}}(\hat{h}) \rvert\cdot\exp(-3\alpha' s \epsilon/2)} = \frac{\exp (\alpha' s \epsilon/2)}{\lvert \operatorname {\mathtt {SWITCH}}(\hat{h}) \rvert} \\ \ge& \frac{\exp(\alpha' s \epsilon/2)}{z}. \end{aligned}$$

Since \(s > 8 \ln(\frac{(6-\beta) z}{\beta} ) / (\alpha\epsilon) = 2 \ln(\frac{(1-\beta') z}{\beta'}) / (\alpha' \epsilon)\), we get that

$$\frac{\Pr[{\mathcal{E}}_{\operatorname {\mathtt {good}}}]}{1-\Pr[{\mathcal{E}}_{\operatorname {\mathtt {good}}}]}=\frac{\Pr [{\mathcal{E}}_{\operatorname {\mathtt {good}}}]}{\Pr[{\mathcal{E}}_{\operatorname {\mathtt {bad}}}]} \ge\frac{1-\beta'}{\beta'} $$

and, thus, \(\Pr[{\mathcal{E}}_{\operatorname {\mathtt {good}}}] \ge1-\beta'\). Therefore, if \(\hat{h}\) satisfies \(\widehat{\mathop {\rm error}}_{D_{s}}(\hat{h}) \le2 \alpha'\) and it has less than z switching points, then Step (5) returns with probability at least 1−β′ a concept \(c \in \operatorname {\mathtt {INTERVAL}}_{d}\) such that \(\widehat{\mathop {\rm error}}_{D_{s}}(c) \leq3\alpha '\). For our last analysis, we assume that indeed a concept with empirical error bounded by 3α′ was chosen in Step (5).

Finally, we show that c, the concept returned by \(\mathcal{A}\), has indeed \(\mathop {\rm error}_{\mathcal{D}}(c,c_{t}) \leq\alpha\) with high probability. As the VC-dimension of \(\operatorname {\mathtt {INTERVAL}}_{d}\) is 1, by Blumer et al. (1989), there exists a constant such that whenever more than \(\ell( \frac {1}{\alpha'} \log\frac{1}{\beta'} + \frac{1}{\alpha'} \log\frac{1}{\alpha'} )\) samples are drawn from some distribution \(\mathcal{D}\), then \(\Pr[ \lvert \mathop {\rm error}_{\mathcal{D}}(c_{t},c) - \widehat{\mathop {\rm error}}_{D_{s}} (c) \rvert\geq \alpha' ] \leq\beta'\). Remember that \(s > K ( \frac{1}{\alpha} \log\frac {1}{\beta} + \frac{1}{\alpha} \log\frac{1}{\alpha} )\) for some constant K (depending on ). As we assumed \(\widehat{\mathop {\rm error}}_{D_{s}}(c) \le 3\alpha'\), we finally have that \(\mathop {\rm error}_{\mathcal{D}}(c_{t},c) \le4 \alpha' = \alpha\) with probability at least 1−β′.

Next we analyze the confidence parameter of \(\mathcal{A}\). We now list the bad events. As said before, the probability of \(\mathop {\rm error}_{\mathcal{D}}(c_{t},\hat{h}) \geq\alpha'\) at the end of Step (3) is bounded by 3β′. After this \(\hat{h}\) is chosen in Step (3), its empirical error on the samples D s is too high with probability bounded by β′. The exponential mechanism fails to return a concept c with low empirical error on D s with probability bounded by β′. Finally, if the exponential mechanism successfully returned a concept with low empirical error, then the misclassification error of c is too high with probability bounded by β′. Using the union bound, we get that the probability of any of the above bad events happening is bounded by 6β′. Therefore,

$$ \Pr\bigl[ \mathop {\rm error}_{\mathcal{D}}(c_t,c) \geq\alpha\bigr] \leq6 \beta' = \beta. $$

We now calculate the sample complexity. Note that samples are drawn in Step (4) and many times in Step (2a). As we assumed the sample complexity of \(\mathcal{A}'\) is o α,β,ϵ (d) and it is executed log(1/β′) times, we get that the total sample complexity of this step is o α,β,ϵ (d). (Remember that α′ and β′ are of the same order as α and β.) Also note that since \(z=2^{o_{\alpha,\beta,\epsilon}(d)}\), the sample complexity of Step (4) is s=o α,β,ϵ (d). Therefore, the sample complexity of \(\mathcal{A}\) is log(1/β′)⋅o α,β,ϵ (d)+s=o α,β,ϵ (d).

Finally, note that we assumed \(\mathcal{A}'\) maintains ϵ-differential privacy. Also the exponential mechanism maintains ϵ-differential privacy. Since any execution of the inner algorithms is on different independently drawn samples of the whole sample set, the learner \(\mathcal{A}\) maintains ϵ-differential privacy.

Combining all the above statements we have that if there is an ϵ-differentially private (α/4,β)-PAC learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sample complexity o α,β,ϵ (d) that for every distribution returns, with probability at least half, a hypotheses with \(2^{\varOmega_{\alpha,\beta,\epsilon}(d)}\) switching points, then there is a proper ϵ-differentially private (α,β)-PAC learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sample complexity o α,β,ϵ (d). This contradicts Lemma 5.2. □

5.2 Impossibility of private independent noise learners with low sample complexity

We next show that the ideas used to construct in Sect. 4.2 a private learner for \(\operatorname {\mathtt {POINT}}_{d}\) with sample complexity O α,β,ϵ (1) cannot be used for \(\operatorname {\mathtt {INTERVAL}}_{d}\). We begin by formalizing a class of independent noise learners that generalizes the construction in Sect. 4.2. We note that independent noise learners are allowed to output hypotheses whose description is exponential in d (recall that this issue was resolved for \(\operatorname {\mathtt {POINT}}_{d}\) by using compression with pseudorandom functions).

Definition 5.6

(Private Independent Noise Learner)

A private independent noise learner for a concept class \(\mathcal{C}_{d}\) over X d using sample size m′ and parameters α′,β′,ϵ is a pair of algorithms \(( \mathcal{A}^{\rm outer}, \mathcal{A}^{\rm inner})\), called the outer and inner learners respectively, that for all concepts \(c \in \mathcal{C}_{d}\), all distributions \(\mathcal{D}\) on X d , given an input D=(d 1,…,d m), where d i =(x i ,c(x i )) with x i drawn i.i.d. from \(\mathcal{D}\) for all i∈[m′], does the following:

  1. 1.

    The outer learner \(\mathcal{A}^{\rm outer}\) is a private PAC learner (as defined in Definition 2.5) for \(\mathcal{C}_{d}\) using the class of all \(2^{|X_{d}|}\) functions X d →{0,1}. Furthermore, \(\mathcal{A}^{\rm outer}(\epsilon,d,\alpha',\beta',D)\) is restricted to execute as follows:

    1. (a)

      Select parameters α α′,β β′, and a noise rate μ as a (deterministic) function of ϵ,α′,β′.

    2. (b)

      Run \(\mathcal{A}^{\rm inner}(d,\alpha^{\star},\beta^{\star},D)\). Denote the output hypothesis c .

    3. (c)

      If \(c^{\star}\notin \mathcal{C}_{d}\) then output “fail” and halt. Otherwise, produce a hypothesis h by addition of noise to all entries of c independently, i.e., for all xX d set h(x)=1−c (x) with probability μ, and h(x)=c (x) otherwise.

  2. 2.

    The inner learner \(\mathcal{A}^{\rm inner}\) outputs with probability at least 1−β (over the randomness of \(\mathcal{A}^{\rm inner}\) and the sampling of D according to \(\mathcal{D}\)) a hypothesis \(c^{\star}\in \mathcal{C}_{d}\) such that \(\mathop {\rm error}_{\mathcal{D}}(c^{\star}, c)\leq\alpha^{\star}\).

Example 5.7

We show that Algorithm \(\mathcal {A}_{2}\), described in Sect. 4.2, is a private independent noise learner for \(\operatorname {\mathtt {POINT}}_{d}\). In order to do this, we describe Algorithm \(\mathcal {A}_{2}\) in a different way than the description in Sect. 4.2.Footnote 10 The outer learner is the learner defined in Definition 5.6 selecting parameters α =α′/2,β′=3/4,β =1/2, and a noise rate μ=α′/8. The inner learner does the following:

  1. 1.

    Set α=α′.

  2. 2.

    Get a sample (x 1,y 1),…,(x m,y m), where x i ’s are chosen according to \(\mathcal{D}\) and m′=32ln(4)/α 2.

  3. 3.

    With probability α/8, return ⊥.

  4. 4.

    Construct a set S⊆[m′] by picking each element of [m′] with probability α/4.

  5. 5.

    If ((x i ,y i )) iS is not consistent with any concept in \(\operatorname {\mathtt {POINT}}_{d}\), return ⊥.

  6. 6.

    If y i =0 for all iS, then let \(c= {\bf0}\) (the all zero hypothesis); otherwise, let c be the (unique) hypothesis from \(\operatorname {\mathtt {POINT}}_{d}\) that is consistent with the labeled example ((x i ,y i )) iS .

As analyzed in Sect. 4.2, Algorithm \(\mathcal {A}_{2}\) is ln(4)-differentially private. It is also (α′,β′)-PAC learner. To construct an algorithm that is ϵ-differentially private for smaller values of ϵ, we use a transformation described in Lemma 4.4. It can be seen that the resulting algorithm is also a private independent noise learner.

Furthermore, in the above description of \(\mathcal{A}_{2}\), the confidence parameter is β′=3/4. In Sect. 4.2, we boosted the confidence parameter by using the exponential mechanism. The resulting learning algorithm is not a private independent noise learner. However, for any constant β′, we can modify \(\mathcal{A}_{2}\) such that the resulting algorithm has confidence β′ and is a private independent noise learner; however, the sample complexity of the resulting algorithm is not polynomial in log(1/β′).

We next show that there is no private independent noise learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) using only o α,β,ϵ (d) samples. We will show that in this case, we can essentially recover the outcome of the inner learner (with probability at least 1−β a hypothesis in \(\operatorname {\mathtt {INTERVAL}}_{d}\)) from the outcome of the outer learner. It follows then that the existence of a private independent noise learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) that uses o α,β,ϵ (d) samples implies a proper private learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) that uses o α,β,ϵ (d) samples, in contradiction with Lemma 5.2.

Theorem 5.8

There is no private independent noise learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) for β′<1/4 and α′<β′/100 that learns using m′=o α′,β′,ϵ (d) samples.

Proof

Assume towards a contradiction that a private independent noise learner \(( \mathcal{A}^{\rm outer}, \mathcal{A}^{\rm inner})\) exists for \(\operatorname {\mathtt {INTERVAL}}_{d}\). Let \(\mathcal{D}\) denote the underlying sample distribution and \(c_{t} \in \operatorname {\mathtt {INTERVAL}}_{d}\) denote the target concept. Consider an execution of \(\mathcal{A}^{\rm outer}\) when invoked with parameters α′,β′ where β′<1/2 (we will further restrict α′,β′ below). We first show a simple bound on the noise rate μ=μ(α′,β′) selected by \(\mathcal{A}^{\rm outer}\). Denote by α α′,β β′ the parameters that \(\mathcal{A}^{\rm outer}\) selects for the inner learner. Denote by c the concept returned by \(\mathcal{A}^{\rm inner}\) and by h the concept returned by \(\mathcal{A}^{\rm outer}\) (or ⊥ if \(\mathcal{A}^{\rm outer}\) halts without an output).

Note that by the definition of a private independent noise learner, \(\mathcal{A}^{\rm inner}\) outputs \(c^{\star}\in \operatorname {\mathtt {INTERVAL}}_{d}\) satisfying \(\mathop {\rm error}_{\mathcal{D}}(c_{t},c^{\star}) \le\alpha^{\star}\) with probability at least 1−β . Similarly, since \(\mathcal{A}^{\rm outer}\) is a learner, we get that \(\mathcal{A}^{\rm outer}\) outputs h satisfying \(\mathop {\rm error}_{\mathcal{D}}(c_{t},h) \le\alpha'\) with probability at least 1−β′. In both cases, the probability is taken over the randomness in the execution of the learner (for \(\mathcal{A}^{\rm outer}\) this includes the randomness of \(\mathcal{A}^{\rm inner}\)) and the sample points chosen according to \(\mathcal{D}\). We, hence, define the event

$${\mathcal{E}}: \begin{array}{l} \mathcal{A}^{\rm inner}\ \mbox{outputs}\ c^\star\in \operatorname {\mathtt {INTERVAL}}_d\ \mbox{satisfying}\ \mathop {\rm error}_{\mathcal{D}}(c_t,c^\star) \le\alpha^\star;\ \mbox{and} \\ \mathcal{A}^{\rm outer}\ \mbox{outputs}\ h\ \mbox{satisfying}\ \mathop {\rm error}_{\mathcal{D}}(c_t,h) \le \alpha' \end{array} $$

and conclude that \(\Pr[{\mathcal{E}}] \geq1-\beta' - \beta^{\star}> 0\).

In the following, we bound \(\operatorname{\mathbb{E}}_{h} [ \mathop {\rm error}_{\mathcal{D}}(c_{t},h) ] \triangleq \operatorname{\mathbb{E}}_{h} \operatorname{\mathbb{E}}_{x \sim \mathcal{D}} [ \lvert h(x)-c_{t}(x) \rvert ]\), assuming \({\mathcal{E}}\). This will yield an upper bound on μ.

$$\begin{aligned} \operatorname{\mathbb{E}}_h \bigl[ \mathop {\rm error}_{\mathcal{D}}(c_t,h)\; \arrowvert {\mathcal{E}}\bigr] & = \operatorname{\mathbb{E}}_h \operatorname{\mathbb{E}}_{x \sim \mathcal{D}} \bigl[ \bigl\vert h(x)-c_t(x) \bigr\vert\; | {\mathcal{E}}\bigr] \\ & \ge \operatorname{\mathbb{E}}_h \bigl[ \operatorname{\mathbb{E}}_{x \sim \mathcal{D}} \bigl[ \bigl\vert h(x)-c^\star(x) \bigr\vert\; | {\mathcal{E}}\bigr] - \operatorname{\mathbb{E}}_{x \sim \mathcal{D}} \bigl[ \bigl\vert c_t(x)-c^\star(x) \bigr\vert\; | {\mathcal{E}}\bigr] \bigr] \end{aligned}$$
(6)
$$\begin{aligned} & \ge \operatorname{\mathbb{E}}_h \operatorname{\mathbb{E}}_{x \sim \mathcal{D}} \bigl[ \bigl\vert h(x)-c^\star(x) \bigr\vert\; | {\mathcal{E}}\bigr] - \alpha^\star \end{aligned}$$
(7)
$$\begin{aligned} &= \operatorname{\mathbb{E}}_{x \sim \mathcal{D}} \operatorname{\mathbb{E}}_h \bigl[ \bigl\vert h(x)-c^\star(x) \bigr\vert\; | {\mathcal{E}}\bigr] - \alpha^\star= {\mu }- \alpha^\star. \end{aligned}$$
(8)

Inequality (6) follows from the triangle inequality, i.e., |h(x)−c (x)|≤|h(x)−c t (x)|+|c t (x)−c (x)|, and Inequality (7) follows from \(\mathop {\rm error}_{\mathcal{D}}(c_{t},c^{\star}) \le\alpha^{\star}\). On the other hand, by the definition of \({\mathcal{E}}\)

$$ \operatorname{\mathbb{E}}_h \bigl[ \mathop {\rm error}_{\mathcal{D}}(c_t,h) \; | {\mathcal{E}}\bigr] < \alpha'. $$
(9)

Noting that the setting of μ is deterministic (and, hence, the setting of μ does not depend on whether the event \({\mathcal{E}}\) holds), we get from Inequalities (8) and (9) that α′≥μα , and hence, μ≤2α′. It follows that by choosing α′ to be small enough, we restrict μ to be small.

We now show how to reconstruct c from h. The reconstruction algorithm is as follows:

  1. 1.

    For every t∈{1,…,T+1} define mismatch(t,h)=|{x<t:h(x)=0}|+|{xt:h(x)=1}|.

  2. 2.

    Find for which mismatch(,h) is the lowest and return c .

  3. 3.

    If no such unique point exists, return “FAIL”.

We now bound the probability that c c . We call a point x for which noise was added by \(\mathcal{A}^{\rm outer}\) (i.e., \(h(x)\not =c^{\star}(x)\)) dirty, otherwise we call x clean. Let j be such that c j =c . Then, mismatch(j,h) is the number of dirty points. The reconstruction algorithm fails to return c if and only if there is some point k such that mismatch(k,h)≤mismatch(j,h). In this case, we say that k is bad. We show that for small enough μ, such a bad point exists only with constant probability. In the following, we assume that k>j (the case k<j is symmetric). First note that c j and c k disagree agree only on points in {j,…,k−1} (i.e., mismatch(j,h) and mismatch(k,h) have the same contribution from points not between j and k). Now every dirty point in {j,…,k−1} contributes 1 to mismatch(j,h) and nothing to mismatch(k,h), and similarly each clean point between {j,…,k−1} contributes 1 to mismatch(k,h) and nothing to mismatch(j,h). Since we assumed that mismatch(k,h)≤mismatch(j,h), it should be the case that at least half the entries in {j,…,k−1} are dirty.

We consider the case where there is a bad point bigger than j (the case where it is smaller than j is handled analogously). Let k>j be the smallest bad point which is bigger than j, that is, k is the smallest such that the number of dirty points in {j,…,k−1} is at least the number of clean points. Hence, k=j+1 if and only if j is a dirty point; if k>j+1 then for all j<<k the number of clean entries in {j,…,−1} exceeds the number of dirty points (otherwise is a bad point smaller than k). From the above arguments it follows that the number of clean points in {j,…,k−1} equals the number of dirty points in {j,…,k−1}.

Let \(\operatorname {\mathtt {noise}}_{j}\) be a sequence starting from j which indicates which entries in c were flipped by \(\mathcal{A}^{\rm outer}\), i.e., every dirty point bigger than j is marked by 1 in \(\operatorname {\mathtt {noise}}_{j}\), and every clean point is marked by 0. According to the above analysis, we get that there exists a bad point k>j only if

  • \(\operatorname {\mathtt {noise}}_{j}\) begins with 1 (this if the case when k=j+1), or

  • \(\operatorname {\mathtt {noise}}_{j}\) begins with some Dyck word, where a Dyck word is a balanced string of “parentheses” in the sense that it consists of n zeros and n ones, and in every prefix the number of ones does not exceed the number of zeros (this is the case when k>j+1).

The probability of \(\operatorname {\mathtt {noise}}_{j}\) to begin with 1 is μ. The probability of \(\operatorname {\mathtt {noise}}_{j}\) to start with a specific Dyck word of length 2n is μ n(1−μ)n. The number of Dyck words of length 2n is the n th Catalan number, \(C_{n} = \frac{1}{n+1} {2n \choose n} \), and we get that the probability of a bad k>j is bounded by

$$ {\mu }+ \sum_{n=1}^{\infty} C_n \cdot {\mu }^n(1-{\mu })^n. $$

Note that this is a loose bound because as every Dyck word is a prefix of longer Dyck words, and so we over count many possibilities of bad noise. Using the Stirling approximation, \(C_{n} \approxeq \frac{4^{n}}{n^{3/2}\sqrt{\pi}} \le\frac{4^{n}}{n\sqrt{\pi}}\) for every n≥1. Therefore, the probability of failure to reconstruct c j from h due a bad k>j is bounded by

$$\begin{aligned} {\mu }+ \sum_{n=1}^{\infty} C_n \cdot {\mu }^n(1-{\mu })^n & \le {\mu }+ \sum_{n=1}^{\infty} C_n \cdot {\mu }^n \\ &\le {\mu }+ \sum_{n=1}^{\infty} \frac{(4{\mu })^n}{n\sqrt{\pi}} = {\mu }+ \frac{1}{\sqrt{\pi}} \sum_{n=1}^{\infty} \frac{(4{\mu })^n}{n} \\ &= {\mu }+ \frac{1}{\sqrt{\pi}} \bigl(-\ln{(1-4{\mu })}\bigr). \end{aligned}$$

The last equality follows from the Taylor series of ln(x). As (−ln(1−4μ))<5μ for every μ≤0.09, the probability of failure to reconstruct c out of h due to a bad k>j is bounded by \({\mu }+ \frac{1}{\sqrt{\pi}}\cdot5{\mu }< 4{\mu }\). Due to symmetry, the probability of failing because of a bad k<j is also bounded by 4μ. Thus, for small enough values of μ, the probability of failure to reconstruct \(\mathcal{A}^{\rm inner}\)’s original output c (i.e., the probability that c c ) from h is bounded by 8μ.

To conclude the proof, we construct \(\mathcal{A}\), a proper private learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\), using \(\mathcal{A}^{\rm outer}\). Learner \(\mathcal{A}\) executes as follows:

  1. 1.

    Let \(\beta' = \frac{\beta}{4}\) and \(\alpha' = \frac{\min(\alpha ,\beta)}{100}\).

  2. 2.

    Apply \(\mathcal{A}^{\rm outer}\) with parameters ϵ,d,α′,β′ to improperly learn \(\operatorname {\mathtt {INTERVAL}}_{d}\) using o α′,β′,ϵ (d) samples. Let h be the output of \(\mathcal{A}^{\rm outer}\). If \(\mathcal{A}^{\rm outer}\) fails then halt.

  3. 3.

    Reconstruct a concept \(c_{\ell}\in \operatorname {\mathtt {INTERVAL}}_{d}\) out of the noisy hypothesis h (as described in the reconstruction algorithm above) and return it.

Note that the sample complexity of \(\mathcal{A}\) is o α′,β′,ϵ (d)=o α,β,ϵ (d). Also note that the reconstruction step does not access D, but only the output of \(\mathcal{A}^{\rm outer}\). As \(\mathcal{A}^{\rm outer}\) is ϵ-differentially private, so is \(\mathcal{A}\). Finally, note that the probability that \(\mathcal{A}\) fails to output \(c_{\ell}\in \operatorname {\mathtt {INTERVAL}}_{d}\) such that \(\mathop {\rm error}_{\mathcal{D}}(c_{\ell},c)\leq\alpha\) is bounded by the probability that the reconstruction algorithm fails, (i.e., c c ) and the probability that \(\mathcal{A}^{\rm inner}\) fails to output \(c^{\star}\in \operatorname {\mathtt {INTERVAL}}_{d}\) such that \(\mathop {\rm error}_{\mathcal{D}}(c^{\star},c)\leq \alpha^{\star}\leq\alpha' \leq\alpha\). Remember that μ≤2α′. Since 2α′≤0.02 (for α≤1) this implies that μ≤0.02 and the above condition μ≤0.09 is satisfied, and hence,

$$\Pr\bigl[\mathop {\rm error}_\mathcal{D}(c_\ell,c_t) \geq\alpha \bigr] \leq\beta^\star+ 8{\mu }\leq\beta' + 8\cdot2 \alpha' \leq\frac{\beta}{4} + 16\cdot\frac{\beta }{100} \leq \beta. $$

Note that β β′ from the definition of private independent noise learner. Thus, the algorithm \(\mathcal{A}\) returns a concept \(c_{\ell}= c^{\star}\in \operatorname {\mathtt {INTERVAL}}_{d}\) such that \(\Pr[\mathop {\rm error}_{\mathcal{D}}(c_{\ell},c_{t}) \geq \alpha] \leq\beta\), and so it is a proper ϵ-differentially private (α,β)-PAC learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sample complexity o α,β,ϵ (d), in contradiction to Lemma 5.2. □

6 Separation between efficient and inefficient proper private PAC learning

In this section, we use the sample size lower bound for proper private learning \(\operatorname {\mathtt {POINT}}_{d}\) (Corollary 3.8) to obtain a separation between the sample complexities of efficient and inefficient proper private PAC learning. In the case of efficient proper private learning, we use a slightly relaxed notion of proper learning for reasons explained below.

In our separation we use pseudorandom generators, which we now define. Let U r represent a uniformly random string from {0,1}r. Let \(\ell(d):\mathbb {N}\rightarrow \mathbb {N}\) be a function and \(G=\{G_{d}\}_{d \in \mathbb {N}}\) be a deterministic algorithm such that on input from {0,1}(d) it returns an output from {0,1}d. Informally, we say that G is pseudorandom generator if on (d) truly random bits it outputs d bits that are indistinguishable from d random bits. Formally, for every probabilistic polynomial time algorithm \(\mathcal{B}\) there exists a negligible function \(\mathop {\rm negl}(d)\) (i.e., a function that is asymptotically smaller than 1/d c for all c>0) such that

$$\begin{aligned} \bigl \vert \Pr\bigl[\mathcal{B}\bigl(G_d(U_{\ell(d)}) \bigr)=1\bigr]-\Pr\bigl[\mathcal{B}(U_d)=1\bigr]\bigr \vert \leq& \mathop {\rm negl}(d). \end{aligned}$$
(10)

Pseudorandom generators G with (d)=ω(logd) exist under various strong hardness assumptions (Goldreich 2001). The difference d(d) is defined as the stretch of the pseudorandom generator. Let \(\operatorname {\mathtt {POINT}}_{d} = \{c_{1},\ldots,c_{2^{d}}\}\). To an efficient (polynomially bounded) private learner, the concept \(c_{G_{d}(U_{\ell(d)})}\) would appear as a uniformly random concept picked from \(\operatorname {\mathtt {POINT}}_{d}\). Define concept class

$${\widehat {\operatorname {\mathtt {POINT}}}}_d = \bigl\{ c_{G_d(r)} \,|\, r \in\{0,1 \}^{\ell(d)}\bigr\}. $$

First, we show that, assuming G is a pseudorandom generator, there exists no efficient proper learner for \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) (note that this statement holds even without the privacy constraint). Assume \(\mathcal{A}_{p}\) is an efficient proper learner for \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\). We use \(\mathcal{A}_{p}\) to construct a distinguisher for the pseudorandom generator as follows: Given j∈{1,…,2d}, we construct the database D with m entries (j,1). If \(\mathcal{A}_{p}(D)=c_{j}\), then the distinguisher returns 1, otherwise it returns 0.

  1. (1)

    If j is in the image of G d , then by the utility guarantee of the proper learner, \(\mathcal{A}_{p}\) has to return c j on D with probability at least 1−β. Thus, the distinguisher returns 1 with probability at least 1−β when j is chosen from G d (U (d)).

  2. (2)

    If j is not in the image of G d , then the database D is not labeled consistently by any concept in \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\). Consider any such j, a proper learner that returns a hypothesis from \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) implies a distinguisher that never returns 1 (i.e., always returns 0). Therefore, the probability that the distinguisher returns 1 when j=U d is at most the probability that j is in the image of G d , which is at most \(\ell(d)/2^{d} = \mathop {\rm negl}(d)\).

To summarize, assuming \(\mathcal{A}_{p}\) is an efficient proper learner for \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\), the distinguisher will return 1 with probability at least 1−β when j=G d (U (d)), and with probability at most \(\mathop {\rm negl}(d)\) when j=U d , in contradiction to (10). We conclude that no efficient proper learner exists for \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) and, therefore, we relax in the following our notion of proper private learners for \({\widehat {\operatorname {\mathtt {POINT}}}}\) to allow outputting hypothesis from \(\operatorname {\mathtt {POINT}}\). We show that under this liberal relaxation, efficient proper learning of \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) with sample complexity o(d) is not possible. However, we show that inefficient proper private learning of \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) with sample complexity o(d) is possible under the strict definition of proper learning.

Sample complexity of efficiently private learning \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using \(\operatorname {\mathtt {POINT}}_{d}\)

Consider an efficient private learner \(\mathcal{A}_{\mathop {\rm eff}}\) that learns \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using \(\operatorname {\mathtt {POINT}}_{d}\) and has sample complexity m. We now show that either a distinguisher exists for the pseudorandom generator G d or m=Ω β,ϵ (d). Assume β<1/4.

We use \(\mathcal{A}_{\mathop {\rm eff}}\) to construct a distinguisher for the pseudorandom generator as follows: Given j∈{1,…,2d}, we construct the database D with m entries (j,1). If \(\mathcal{A}_{\mathop {\rm eff}}(D)=c_{j}\), then the distinguisher returns 1, otherwise it returns 0.

If for at least a 3/4th fraction of the values j∈[2d], algorithm \(\mathcal{A}_{\mathop {\rm eff}}\), when applied to a database with m entries (j,1), does not return c j with probability at least 3/4, then the distinguisher succeeds in breaking the pseudorandom generator. This is because if the above statement is not true then the distinguisher returns 1 with probability at most 3/4 when j=U d , and the distinguisher will return 1 with probability at least 1−β>3/4 when j=G d (U (d)).Footnote 11

However, arguments similar as in the proof of Theorem 3.6 show that it is not possible to have a learner that on 3/4th fraction of the values j∈[2d], when applied to a database with m=o((d+log(1/β))/ϵ) entries (j,1), returns c j with probability at least 3/4. This means that either we have a distinguisher for the pseudorandom generator or the sample complexity of \(\mathcal{A}_{\mathop {\rm eff}}\) is at least Ω β,ϵ (d). So, assuming the existence of a pseudorandom generator, we get that there exists no efficient private learner that learns \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using \(\operatorname {\mathtt {POINT}}_{d}\) and has o((d+log(1/β))/ϵ) sample complexity.Footnote 12

Sample complexity of inefficient proper private learners for \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\)

If the learner is not polynomially bounded, then it can use the algorithm from Theorem 3.2 to privately learn \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\). Since \(|{\widehat {\operatorname {\mathtt {POINT}}}}_{d}|=2^{\ell(d)}\), the private learner from Theorem 3.2 uses O(((d)+log(1/β))/(ϵα)) samples.

We get the following separation between efficient and inefficient proper private learning:

Theorem 6.1

Let (d) be any function that grows as ω(logd). Assuming the existence of a pseudorandom generator G d  : {0,1}(d)→{0,1}d, there exists no efficient proper PAC learner for \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) and every efficient (polynomial-time) private PAC learner that learns \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using \(\operatorname {\mathtt {POINT}}_{d}\) requires Ω((d+log(1/β))/ϵ) samples, whereas there exists an inefficient proper private PAC learner that can learn \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using O(((d)+log(1/β))/(ϵα)) samples.

Remark 6.2

In the non-private setting, there exists an efficient proper learner that can learn \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using \(\operatorname {\mathtt {POINT}}_{d}\) with O((log(1/α)+log(1/β))/α) samples (as \(\mathrm{\it VCDIM}({\widehat {\operatorname {\mathtt {POINT}}}}_{d})=1\)). In the non-private setting, we also know that even inefficient learners require Ω(log(1/β)/α) samples (Ehrenfeucht et al. 1989; Kearns and Vazirani 1994). Therefore, for \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) the sample complexity difference that we observe in Theorem 6.1 does not exist without the privacy constraint.

7 Lower bounds for non-interactive sanitization

We now prove a lower bound on the database size (or sample size) needed to privately release an output that is useful for all concepts in a concept class. We start by recalling a definition and a result of Blum et al. (2008).

Let \(X = \{X_{d}\}_{d\in \mathbb {N}}\) be some discretized domain and consider a class of predicates \(\mathcal{C}\) over X. A database D contains points taken from X d . A predicate query Q c for c:X d →{0,1} in \(\mathcal{C}\) is defined as

$$Q_c(D) = \frac{|\{d_i \in D \,:\, c(d_i) =1\} |}{|D|}. $$

A sanitizer (or data release mechanism) is a differentially private algorithm \(\mathcal{A}\) that gets as input a database D and outputs another database \(\widehat{D}\) with entries taken from X d . An algorithm \(\mathcal{A}\) is (α,β)-useful for predicates in the class \(\mathcal{C}\) if for every database D with probability at least 1−β the algorithm \(\mathcal{A}(D)\) returns a database \(\widehat{D}\) such that for every cC,

$$\bigl|Q_c(D)-Q_c(\widehat{D})\bigr| < \alpha. $$

Theorem 7.1

(Blum et al. 2008)

For any class of predicates \(\mathcal{C}\), and any database \(D \in X_{d}^{m}\), such that

$$m \geq O \biggl(\frac{\log(|X_d|) \cdot\mathrm{\it VCDIM}(\mathcal{C}) \log (1/\alpha)}{\alpha^3\epsilon }+ \frac{\log(1/\beta)}{\epsilon \alpha} \biggr), $$

there exists an (α,β)-useful mechanism \(\mathcal{A}\) that preserves ϵ-differential privacy. The algorithm might not be efficient.

We show that the dependency on log(|X d |) in Theorem 7.1 is essential: there exists a class of predicates \(\mathcal{C}\) with VC-dimension O(1) that requires |D|=Ω α,β,ϵ (log(|X d |)). For our lower bound, the sanitized output \(\widehat{D}\) could be any arbitrary data structure (not necessarily a synthetic database). Remember that a synthetic database contains data drawn from the same domain as the original database and Theorem 7.1 outputs a synthetic database. For simplicity, however, here we focus on the case where the output is a synthetic database. The proof of this lower bound uses ideas from Sect. 3.1.

Theorem 7.2

Every ϵ-differentially private non-interactive mechanism that is (α,β)-useful for \(\operatorname {\mathtt {POINT}}_{d}\) requires an input database of size Ω((d+log(1/β))/(ϵα)).

Proof

Let T=2d and X d =[T] be the domain. Consider the class \(\operatorname {\mathtt {POINT}}_{d}\). For every i∈[T], construct a database \(D_{i} \in X_{d}^{m}\) by setting (1−3α)m entries as 1 and the remaining 3αm entries as i (for i=1 all entries of D 1 are 1). For i∈[T]∖{1}, we say that a database \(\widehat{D}\) is α-useful for D i if \(2\alpha < Q_{c_{i}}(\widehat{D}) < 4\alpha\) and \(1-4\alpha< Q_{c_{1}}(\widehat{D}) < 1-2\alpha\). We say that \(\widehat{D}\) is α-useful for D 1 if \(1-\alpha< Q_{c_{1}}(\widehat{D}) \leq1\). It follows that for ij, if \(\widehat{D}\) is α-useful for D i then it is not α-useful for D j .

Let \(\widehat{\mathbb{D}}_{i}\) be the set of all databases that are α-useful for D i . Note that for all i≠1, databases D 1 and D i differ on 3αm entries, and by our previous observation, \(\widehat{\mathbb{D}}_{1} \cap\widehat{\mathbb{D}}_{i} = \emptyset\). Let \(\mathcal{A}\) be an (α,β)-useful private release mechanism for \(\operatorname {\mathtt {POINT}}_{d}\). For all i, on input D i mechanism \(\mathcal{A}\) should pick an output from \(\widehat{\mathbb{D}}_{i}\) with probability at least 1−β. We get by the differential privacy of \(\mathcal{A}\) that

$$\Pr\bigl[\mathcal{A}(D_1) \in\widehat{ \mathbb{D}}_i\bigr] \geq\exp(-3\epsilon \alpha m) \Pr\bigl[\mathcal{A}(D_i) \in\widehat{ \mathbb{D}}_i\bigr] \geq\exp(-3 \epsilon \alpha m) \cdot(1-\beta). $$

Hence,

$$\begin{aligned} \Pr\bigl[\mathcal{A}(D_1) \notin\widehat{\mathbb{D}}_1\bigr] \geq& \Pr\biggl[\mathcal{A}(D_1) \in\bigcup_{i\neq1} \widehat{ \mathbb{D}}_i\biggr] \\ = & \sum_{i\neq1} \Pr\bigl[\mathcal{A}(D_1) \in\widehat{ \mathbb{D}}_i\bigr] \quad (\mbox{sets}\ \widehat{ \mathbb{D}}_i\ \mbox{are disjoint}) \\ \geq& (T-1)\exp(-3\epsilon \alpha m) \cdot(1-\beta). \end{aligned}$$

On the other hand, since \(\mathcal{A}\) is (α,β)-useful, \(\Pr[\mathcal{A}(D_{1}) \notin\widehat{\mathbb{D}}_{1}] < \beta\), and hence, we get that m=Ω((d+log(1/β))/(ϵα)). □