Bounds on the sample complexity for private learning and private data release
 810 Downloads
 7 Citations
Abstract
Learning is a task that generalizes many of the analyses that are applied to collections of data, in particular, to collections of sensitive individual information. Hence, it is natural to ask what can be learned while preserving individual privacy. Kasiviswanathan et al. (in SIAM J. Comput., 40(3):793–826, 2011) initiated such a discussion. They formalized the notion of private learning, as a combination of PAC learning and differential privacy, and investigated what concept classes can be learned privately. Somewhat surprisingly, they showed that for finite, discrete domains (ignoring time complexity), every PAC learning task could be performed privately with polynomially many labeled examples; in many natural cases this could even be done in polynomial time.
While these results seem to equate nonprivate and private learning, there is still a significant gap: the sample complexity of (nonprivate) PAC learning is crisply characterized in terms of the VCdimension of the concept class, whereas this relationship is lost in the constructions of private learners, which exhibit, generally, a higher sample complexity.
Looking into this gap, we examine several private learning tasks and give tight bounds on their sample complexity. In particular, we show strong separations between sample complexities of proper and improper private learners (such separation does not exist for nonprivate learners), and between sample complexities of efficient and inefficient proper private learners. Our results show that VCdimension is not the right measure for characterizing the sample complexity of proper private learning.
We also examine the task of private data release (as initiated by Blum et al. in STOC, pp. 609–618, 2008), and give new lower bounds on the sample complexity. Our results show that the logarithmic dependence on size of the instance space is essential for private data release.
Keywords
Differential privacy PAC learning Sample complexity Private data release1 Introduction
Consider a scenario in which a survey is conducted among a sample of random individuals and data mining techniques are applied to learn information on the entire population. If such information will disclose information on the individuals participating in the survey, then they will be reluctant to participate in the survey. To address this question, Kasiviswanathan et al. (2011) introduced the notion of private learning, where a private learner is required to output a hypothesis that gives accurate classification while protecting the privacy of the individual samples from which the hypothesis was obtained.
The definition of a private learner is a combination of two qualitatively different notions. One is that of probably approximately correct (PAC) learning (Valiant 1984), the other of differential privacy (Dwork et al. 2006). PAC learning, on one hand, is an average case requirement, which requires that the output of the learner on most samples is good. Differential privacy, on the other hand, is a worstcase requirement. It is a strong notion of privacy that provides meaningful guarantees in the presents of powerful attackers and is increasingly accepted as a standard for providing rigorous privacy. Recent research on privacy has shown, somewhat surprisingly, that it is possible to design differentially private variants of many analyses. Further discussions on differential privacy can be found in the surveys of Dwork (2009, 2011).
We next give more details on PAC learning and differential privacy. In PAC learning, a collection of samples (labeled examples) is generalized into a hypothesis. It is assumed that the examples are generated by sampling from some (unknown) distribution \(\mathcal{D}\) and are labeled according to an (unknown) concept c taken from some concept class \(\mathcal{C}\). The learned hypothesis h should predict with high accuracy the labeling of examples taken from the distribution \(\mathcal{D}\), an averagecase requirement. In differential privacy the output of a learner should not be significantly affected if a particular example is replaced with an arbitrary example. Concretely, differential privacy considers the collection of samples as a database, defines that two databases are neighbors if they differ in exactly one sample, and requires that for every two neighboring databases the output distribution of a private learner should be similar.
In this paper, we consider private learning of finite, discrete domains. Finite domains are natural as computers only store information with finite precision. The work of Kasiviswanathan et al. (2011) demonstrated that private learning in such domains is feasible—any concept class that is PAC learnable can be learned privately (but not necessarily efficiently), by a “private Occam’s razor” algorithm, with sample complexity that is logarithmic in the size of the hypothesis class.^{1} Furthermore, taking into account the earlier result of Blum et al. (2005) (that all concept classes that can be efficiently learned in the statistical queries model can be learned privately and efficiently) and the efficient private parity learner of Kasiviswanathan et al. (2011), we get that most “natural” computational learning tasks can be performed privately and efficiently (i.e., with polynomial resources). This is important as learning problems generalize many of the computations performed by analysts over collections of sensitive data.
The results of Blum et al. (2005), Kasiviswanathan et al. (2011) show that private learning is feasible in an extremely broad sense, and hence, one can essentially equate learning and private learning. However, the costs of the private learners constructed in Blum et al. (2005), Kasiviswanathan et al. (2011) are generally higher than those of nonprivate ones by factors that depend not only on the privacy, accuracy, and confidence parameters of the private learner. In particular, the wellknown relationship between the sample complexity of PAC learners and the VCdimension of the concept class (ignoring computational efficiency) (Blumer et al. 1989) does not hold for the above constructions of private learners; the sample complexity of the algorithms of Blum et al. (2005), Kasiviswanathan et al. (2011) is proportional to the logarithm of the size of the concept class. Recall that the VCdimension of a concept class is bounded by the logarithm of its size, and is significantly lower for many interesting concept classes, hence, there may exist learning tasks for which “very practical” nonprivate learner exists, but any private learner is “impractical” (with respect to the sample size required).
The focus of this work is on a finegrain examination of the differences in complexity between private and nonprivate learning. The hope is that such an examination will eventually lead to an understanding of which complexity measure is relevant for the sample complexity of private learning, similar to the wellunderstood relationship between the VCdimension and sample complexity of PAC learning. Such an examination is interesting also for other tasks, and a second task we examine is that of releasing a sanitization of a data set that simultaneously protects privacy of individual contributors and offers utility to the data analyst. See the discussion in Sect. 1.1.2.
1.1 Our contributions
We now give a brief account of our results. Throughout this rather informal discussion we will treat the accuracy, confidence, and privacy parameters as constants (a detailed analysis revealing the dependency on these parameters is presented in the technical sections). We use the term “efficient” for polynomial time computations.
Our separation results (ignoring dependence on ϵ,α,β), where ℓ(d) is any function that grows as ω(logd)
Concept class  Sample complexity  

\(\operatorname {\mathtt {POINT}}_{d}\)  NonPrivate Learning (Proper or Improper)  Improper Private Learning  Proper Private Learning 
Θ(1)  Θ(d)  
\({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\)  NonPrivate Learning (Efficient or Inefficient)  Inefficient Proper Private Learning  Efficient Proper Private Learning^{a} 
Θ(ℓ(d))  Θ(d) 
1.1.1 Proper and improper private learning
It is instructive to look into the construction of the private Occam’s razor algorithm of Kasiviswanathan et al. (2011) and see why its sample complexity is proportional to the logarithm of the size of the hypothesis class used. The algorithm uses the exponential mechanism of McSherry and Talwar (2007) to choose a hypothesis. The choice is probabilistic, where the probability mass that is assigned to each of the hypotheses decreases exponentially with the number of samples that are inconsistent with it. A unionbound argument is used in the claim that the construction actually yields a learner, and a sample size that is logarithmic in the size of the hypothesis class is needed for the argument to go through. The question is whether such sample size is required?
To address the above question, we consider a simple, but natural, class \(\operatorname {\mathtt {POINT}}=\{\operatorname {\mathtt {POINT}}_{d}\}\) containing the concepts c _{ j }:{0,1}^{ d }→{0,1} where c _{ j }(x)=1 for x=j, and 0 otherwise. The VCdimension of \(\operatorname {\mathtt {POINT}}_{d}\) is one, and hence, it can be learned (nonprivately and efficiently, properly or improperly) with merely O(1) samples.
In sharp contrast, (when used for properly learning \(\operatorname {\mathtt {POINT}}_{d}\)) the abovementioned private Occam’s razor algorithm from Kasiviswanathan et al. (2011) requires \(O(\log(\operatorname {\mathtt {POINT}}_{d})) = O(d)\) samples—obtaining the largest possible gap in sample complexity when compared to nonprivate learners! Our first result is a matching lower bound. We prove that any proper private learner for \(\operatorname {\mathtt {POINT}}_{d}\) must use Ω(d) samples, therefore, answering negatively the question (from Kasiviswanathan et al. (2011)) of whether proper private learners should exhibit sample complexity that is approximately the VCdimension (or even a function of the VCdimension) of the concept class.^{2}
A natural way to improve the sample complexity is to use the private Occam’s razor to improperly learn \(\operatorname {\mathtt {POINT}}_{d}\) with a smaller hypothesis class that is still expressive enough for \(\operatorname {\mathtt {POINT}}_{d}\), reducing the sample complexity to the logarithm of the smaller hypothesis class. We show that this indeed is possible, as there exists a hypothesis class of size O(d) that can be used for learning \(\operatorname {\mathtt {POINT}}_{d}\) improperly, yielding an algorithm with sample complexity O(logd). Furthermore, this bound is tight, any hypothesis class for learning \(\operatorname {\mathtt {POINT}}_{d}\) must contain Ω(d) hypotheses. These bounds are interesting as they give a separation between proper and improper private learning—proper private learning of \(\operatorname {\mathtt {POINT}}_{d}\) requires Ω(d) samples, whereas \(\operatorname {\mathtt {POINT}}_{d}\) can be improperly privately learned using O(logd) samples. Note that such a combinatorial separation does not exist for nonprivate learning, as VCdimension number of samples are needed and sufficient for both proper and improper nonprivate learners. Furthermore, the Ω(d) lower bound on the size of the hypothesis class maps a clear boundary to what can be achieved in terms of sample complexity using the private Occam’s razor for \(\operatorname {\mathtt {POINT}}_{d}\). It might even suggest that any private learner for \(\operatorname {\mathtt {POINT}}_{d}\) should use Ω(logd) samples.
It turns out, however, that the intuition expressed in the last sentence is at fault. We construct an efficient improper private learner for \(\operatorname {\mathtt {POINT}}_{d}\) that uses merely O(1) samples, hence, establishing the strongest possible separation between proper and improper private learners. For the construction, we extrapolate on a technique from the efficient private parity learner of Kasiviswanathan et al. (2011). The construction of Kasiviswanathan et al. (2011) utilizes a natural nonprivate proper learner, and hence, results in a proper private learner. Due to the bounds mentioned above, we cannot use a proper learner for \(\operatorname {\mathtt {POINT}}_{d}\), and hence, we construct an improper (rather unnatural) learner to base our construction upon. Our construction utilizes a doubleexponential hypothesis class, and hence, is inefficient (even outputting a hypothesis requires superpolynomial time). We use a simple compression using pseudorandom functions (akin to Mishra and Sandler (2006)) to make the algorithm efficient.
The above two improper learning algorithms use “heavy” hypotheses, that is, the hypotheses are Boolean functions that return 1 on many inputs (in contrast to a point function that returns 1 on exactly one input). Informally, each such heavy hypothesis protects the privacy since it could have been returned on many different concepts. The main technical point in these algorithms is how to choose a heavy hypothesis with a small error. To complete the picture, we prove that using heavy hypotheses is unavoidable: Every private learning algorithm for \(\operatorname {\mathtt {POINT}}_{d}\) that uses o(d) samples must use heavy hypotheses.
Next we look into the concept class \(\operatorname {\mathtt {INTERVAL}}=\{\operatorname {\mathtt {INTERVAL}}_{d}\} \), where for T=2^{ d } we define \(\operatorname {\mathtt {INTERVAL}}_{d}=\{ c_{1},\ldots,c_{T+1} \}\) and, for 1≤j≤T+1, the concept c _{ j }:{1,…,T+1}→{0,1} is defined as follows: c _{ j }(x)=1 for x<j and c _{ j }(x)=0 otherwise. As with \(\operatorname {\mathtt {POINT}}_{d}\), it is easy to show that the sample complexity of any proper private learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) is Ω(d). We give two results regarding the sample complexity of improper private learning of \(\operatorname {\mathtt {INTERVAL}}_{d}\). The first result shows that if a sublinear (in d) sample complexity private learner exists for \(\operatorname {\mathtt {INTERVAL}}_{d}\), then it must output, with high probability, a very “complex looking” hypothesis in the sense that the hypothesis must switch from zero to one (and viceversa) exponentially many times, unlike any concept \(c_{j} \in \operatorname {\mathtt {INTERVAL}}_{d}\) that switches only once from one to zero at j. The second result considers a generalization of the technique that yielded the O(1) sample improper private learner for \(\operatorname {\mathtt {POINT}}_{d}\), and shows that it alone would not yield a private learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sublinear (in d) sample complexity.
We apply the above lower bound on the number of samples for proper private learning \(\operatorname {\mathtt {POINT}}_{d}\) to show a separation in the sample complexity of efficient proper private learners (under a slightly relaxed definition of proper learning) and inefficient proper private learners. More concretely, assuming the existence of a pseudorandom generator with exponential stretch, we present a concept class \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\)—a subset of \(\operatorname {\mathtt {POINT}}_{d}\)—such that every efficient private learner that learns \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using \(\operatorname {\mathtt {POINT}}_{d}\) requires Ω(d) samples. In contrast, an inefficient proper private learner exists that uses only a superlogarithmic number of samples. This is the first example in private learning where requiring efficiency on top of privacy comes at a price of larger sample size.
1.1.2 The sample size of noninteractive sanitization mechanisms
Given a database containing a collection of individual information, a sanitization is a release of information that protects the privacy of the individual contributors while offering utility to the analyst using the database. The setting is noninteractive if once the sanitization is released, then the original database and the curator play no further role. Blum et al. (2008) presented a construction of such noninteractive sanitizers for count queries. Let \(\mathcal{C}\) be a concept class consisting of efficiently computable predicates from a discretized domain X to {0,1}. Given a collection D of data items taken from X, Blum et al. employ the exponential mechanism (McSherry and Talwar 2007) to (inefficiently) obtain another collection D′ with data items from X such that D′ maintains approximately correct count of ∑_{ d∈D } c(d) for all concepts \(c\in \mathcal{C}\) provided that the size of D is \(O(\log(X) \cdot\mathrm{\it VCDIM}(\mathcal{C}))\). As D′ is generated using the exponential mechanism, the differential privacy of D is protected. The database D′ is referred to as a synthetic database as it contains data items drawn from the same universe (i.e., from X) as the original database D.
We provide a new lower bound for noninteractive sanitization mechanisms. We show that for \(\operatorname {\mathtt {POINT}}_{d}\) every noninteractive sanitization mechanism that is useful^{3} for \(\operatorname {\mathtt {POINT}}_{d}\) requires a database of size Ω(d). This lower bound is tight as the sanitization mechanism of Blum et al. for \(\operatorname {\mathtt {POINT}}_{d}\) uses a database of size \(O(d \cdot\mathrm{\it VCDIM}(\operatorname {\mathtt {POINT}}_{d})) = O(d)\). Our lower bound holds even if the sanitized output is an arbitrary data structure, i.e., not necessarily a synthetic database.
A preliminary version of this paper appeared in the 7th Theory of Cryptography Conference (TCC), 2010. The TCC paper contained a proof sketch of the results presented in Sects. 3, 4.2, 6, and 7. The results presented in Sects. 4.1, 4.3, and 5 are new.
1.2 Related work
The notion of PAC learning was introduced by Valiant (1984). The notion of differential privacy was introduced by Dwork et al. (2006). Private learning was introduced in Kasiviswanathan et al. (2011). Beyond proving that (ignoring computation) every concept class with finite, discrete domain can be PAC learned privately (see Theorem 3.2 below), Kasiviswanathan et al. proved an equivalence between learning in the statistical queries model and private learning in the local communication model (a.k.a. randomized response). The general private data release mechanism we mentioned above was introduced in Blum et al. (2008) along with a specific construction for halfspace queries. Also as mentioned above, both Kasiviswanathan et al. (2011) and Blum et al. (2008) use the exponential mechanism of McSherry and Talwar (2007), a generic construction of differential private analyses, which (in general) does not yield efficient algorithms.
A recent work of Dwork et al. (2009) considered the complexity of noninteractive sanitization under two settings: (a) sanitized output is a synthetic database, and (b) sanitized output is some arbitrary data structure. For the task of sanitizing with a synthetic database they show a separation between efficient and inefficient sanitization mechanisms based on whether the size of the instance space and the size of the concept class is polynomial in a (security) parameter or not. For the task of sanitizing with an arbitrary data structure they show a tight connection between the complexity of sanitization and traitor tracing schemes used in cryptography. They leave the problem of separating efficient private and inefficient private learning open.
Following the preliminary version of our paper (Beimel et al. 2010), Chaudhuri and Hsu (2011) study the sample complexity for private learning infinite concept classes when the data is drawn from a continuous distribution. Using techniques very similar to ours, they show that, under these settings, there exists a simple concept class for which any proper learner that uses a finite number of examples and guarantees differential privacy, fails to satisfy accuracy guarantee for at least one unlabeled data distribution. This implies that the results of Kasiviswanathan et al. (2011) do not extend to infinite hypothesis classes on continuous data distributions.
Chaudhuri and Hsu (2011) also study learning algorithms that are only required to protect the privacy of the labels (and not necessary protect the privacy of the examples themselves). They prove upper bounds and lower bounds for this scenario. In particular, they prove a lower bound on the sample complexity using the doubling dimension of the disagreement metric of the hypothesis class with respect to the unlabeled data distribution. This result does not imply our results. For example, the class \(\operatorname {\mathtt {POINT}}_{d}\) can be properly learned using O(1) samples while protecting the privacy of the labels, while we prove that Ω(d) samples are required to properly learn this class while protecting the privacy of the examples and the labels. It seems that label privacy may give enough protection in the restricted setting where the content of the underlying examples is publicly known. However, in many settings this information is highly sensitive. For example, in a database containing medical records we wish to protect the identity of the people in the sample (i.e., we do not want to disclose that they have been to a hospital).
It is well known that for all concept classes \(\mathcal{C}\), every learner for \(\mathcal{C}\) requires \(\varOmega(\mathrm{\it VCDIM(\mathcal{C})})\) samples (Ehrenfeucht et al. 1989). This lower bound on the sample size also holds for private learning. Blum et al. (2013) show that this result extends to the setting of private data release. They show that for all concept classes \(\mathcal{C}\), every noninteractive sanitization mechanism that is useful for \(\mathcal{C}\) requires \(\varOmega(\mathrm{\it VCDIM(\mathcal{C})})\) samples (remember that the best upper bound is \(O(\log(X) \cdot\mathrm{\it VCDIM}(\mathcal{C}))\)). We show in Sect. 7 that the lower bound of \(\varOmega(\mathrm{\it VCDIM(\mathcal{C})})\) is not tight—there exists a concept class \(\mathcal{C}\) of constant VCdimension such that every noninteractive sanitization mechanism that is useful for \(\mathcal{C}\) requires a much larger sample size.
Tools for private learning (not in the PAC setting) were studied in a few papers; such tools include, for example, private logistic regression (Chaudhuri and Monteleoni 2008) and private empirical risk minimization (Chaudhuri et al. 2011; Kifer et al. 2012).
1.3 Questions for future exploration
The motivation of this work was to study the connection between nonprivate and private learning. We believe that the ideas developed in this work are a first step in developing a general theory of private learning. In particular, we believe that there is a combinatorial measure that characterizes private learning (for nonprivate learning such combinatorial measure exists—the VC dimension). Such characterization was given recently in Beimel et al. (2013).
In this paper, the ideas used for lower bounding sample size for proper private learning of points is also used to establish a lower bound on the sample size for sanitization of databases. Other connections between private learning and sanitization were explored in (Blum et al. 2008). The open question is there is a deeper connection between the models, i.e., does any bound for one task imply a similar bound for the other?
1.4 Organization
In Sect. 2, we define private learning. In Sect. 3, we prove lower bounds on proper private learning, and in Sect. 4, we describe efficient improper private learning algorithms for the \(\operatorname {\mathtt {POINT}}\) concept class. In Sect. 5, we discuss private learning of the \(\operatorname {\mathtt {INTERVAL}}\) concept class. In Sect. 6, we show a separation between efficient and inefficient proper private learning. Finally, in Sect. 7, we prove a lower bound for noninteractive sanitization.
2 Preliminaries
Notation
We use [n] to denote the set {1,2,…,n}. The notation O _{ γ }(g(n)) is a shorthand for O(h(γ)⋅g(n)) for some nonnegative function h. Similarly, the notation Ω _{ γ }(g(n)). We use \(\mathop {\rm negl}(\cdot)\) to denote functions from \(\mathbb {R}^{+}\) to [0,1] that decrease faster than any inverse polynomial.
2.1 Preliminaries from privacy
A database is a vector D=(d _{1},…,d _{ m }) over a domain X, where each entry d _{ i }∈D represents information contributed by one individual. Databases D and D′ are called neighbors if they differ in exactly one entry (i.e., the Hamming distance between D and D′ is 1). An algorithm is private if neighboring databases induce nearby distributions on its outcomes. Formally:
Definition 2.1
(Differential Privacy (Dwork et al. 2006))
An immediate consequence of (1) is that for any two databases D,D′ (not necessarily neighbors) of size m, and for all sets \(\mathcal{S}\) of outputs, \(\Pr[\mathcal{A}(D ) \in\mathcal{S}] \geq\exp(\epsilon m) \cdot \Pr [\mathcal{A}(D') \in\mathcal{S}]\).
2.2 Preliminaries from learning theory
We consider Boolean classification problems. A concept c:X→{0,1} is a function that labels examples taken from the domain X by either 0 or 1. The domain X is understood to be an ensemble \(X=\{X_{d}\}_{d\in \mathbb {N}}\) (typically, X _{ d }={0,1}^{ d }) and a concept class \(\mathcal{C}\) is an ensemble \(\mathcal{C}= \{\mathcal{C}_{d}\}_{d\in \mathbb {N}}\) where \(\mathcal{C}_{d}\) is a class of concepts mapping X _{ d } to {0,1}. In this paper X _{ d } is always a finite, discrete set. A concept class comes implicitly with a way to represent concepts and \(\mathop {\rm size}(c)\) is the size of the (smallest) representation of the concept c under the given representation scheme.
Definition 2.2
(PAC Learning (Valiant 1984))
An Algorithm \(\mathcal {A}_{1}\), whose inputs are d,α,β, and a set of samples (labeled examples) D, is a PAC learner of a concept class \(\mathcal{C}=\{\mathcal{C}_{d}\}_{d\in \mathbb {N}}\) over \(X=\{X_{d}\}_{d\in \mathbb {N}}\) using hypothesis class \(\mathcal{H}=\{\mathcal{H}_{d}\}_{d\in \mathbb {N}}\) if there exists a polynomial p(⋅,⋅,⋅,⋅) such that for all \(d \in \mathbb {N}\) and 0<α,β<1, the Algorithm \(\mathcal {A}_{1}\) (d,α,β,⋅) is an (α,β)PAC learner of the concept class \(\mathcal{C}_{d}\) over X _{ d } using hypothesis class \(\mathcal{H}_{d}\) and sample size \(n=p(d,\mathop {\rm size}(c),1/\alpha ,\log (1/\beta))\).^{4} If \(\mathcal{A}\) runs in time polynomial in \(d,\mathop {\rm size}(c),1/\alpha,\log(1/\beta)\), we say that it is an efficient PAC learner. Also the learner is called a proper PAC learner if \(\mathcal{H}=\mathcal{C}\), otherwise it is called an improper PAC learner.
A concept class \(\mathcal{C}= \{\mathcal{C}_{d}\}_{d\in \mathbb {N}}\) over \(X= \{X_{d}\}_{d\in \mathbb {N}}\) is PAC learnable using hypothesis class \(\mathcal{H}= \{\mathcal{H}_{d}\}_{d\in \mathbb {N}}\) if there exists a PAC learner \(\mathcal{A}\) learning \(\mathcal{C}\) over X using hypothesis class \(\mathcal{H}\). If \(\mathcal{A}\) is an efficient PAC learner, we say that \(\mathcal{C}\) is efficiently PAC learnable.
It is well known that improper learning is more powerful than proper learning. For example, Pitt and Valiant (1988) show that unless RP=NP, kterm DNF are not efficiently learnable by kterm DNF, whereas it is possible to learn a kterm DNF efficiently using kCNF (Valiant 1984). For more background on learning theory, see (Kearns and Vazirani 1994).
Definition 2.3
(VCDimension (Vapnik and Chervonenkis 1971))
Let \(\mathcal{C}=\{\mathcal{C}_{d}\}\) be a class of concepts over X={X _{ d }}. We say that \(\mathcal{C}_{d}\) shatters a point set Y⊂X _{ d } if \(\{c(Y):c\in \mathcal{C}_{d}\}  = 2^{Y}\), i.e., the concepts in \(\mathcal{C}_{d}\) when restricted to Y produce all the 2^{Y} possible assignments on Y. The VCdimension of \(\mathcal{C}_{d}\) (\(\mathrm{\it VCDIM}(\mathcal{C}_{d})\)) is defined as the size of a maximum point set that is shattered by \(\mathcal{C}_{d}\), as a function of d.
Theorem 2.4
(Blumer et al. 1989)
Let \(\mathcal{C}_{d}\) be a concept class over X _{ d }. There exists an (α,β)PAC learner that learns \(\mathcal{C}_{d}\) using \(\mathcal{C}_{d}\) using \(O((\mathrm{\it VCDIM}(\mathcal{C}_{d})\cdot\log(\frac{1}{\alpha})+\log(\frac{1}{\beta }))/\alpha )\) samples.
2.3 Private learning
Definition 2.5
(Private PAC Learning (Kasiviswanathan et al. 2011))
 Sample efficiency.

The number of samples (labeled examples) in D is polynomial in 1/ϵ, d, \(\mathop {\rm size}(c)\), 1/α, and log(1/β);
 Privacy.

For all d and ϵ,α,β>0, algorithm \(\mathcal{A}(\epsilon ,d,\alpha,\beta,\cdot)\) is ϵdifferentially private (as formulated in Definition 2.1);
 Utility.

For all ϵ>0, algorithm \(\mathcal{A}(\epsilon ,\cdot ,\cdot,\cdot,\cdot)\) PAC learns \(\mathcal{C}\) using \(\mathcal{H}\) (as formulated in Definition 2.2).
Remark 2.6
The privacy requirement in Definition 2.5 is a worstcase requirement. That is, Inequality (1) must hold for every pair of neighboring databases D,D′ (even if these databases are not consistent with any concept in \(\mathcal{C}\)). In contrast, the utility requirement is an averagecase requirement, where we only require the learner to succeed with high probability over the distribution of the databases. This qualitative difference between the utility and privacy of private learners is crucial. A wrong assumption on how samples are formed that leads to a meaningless outcome can usually be replaced with a better one with very little harm. No such amendment is possible once privacy is lost due to a wrong assumption. See Kasiviswanathan et al. (2011) for further discussion.
Note also that each entry d _{ i } in a database D is a labeled example. That is, we protect the privacy of both the example and its label.
Observation 2.7
The computational separation between proper and improper learning also holds when we add the privacy constraint. That is, unless RP=NP, no proper private learner can learn kterm DNF, whereas there exists an efficient improper private learner that can learn kterm DNF using a kCNF. The efficient kterm DNF learner of Valiant (1984) uses statistical queries (SQ) (Kearns 1998), which can be simulated efficiently and privately as shown by Blum et al. (2005), Kasiviswanathan et al. (2011).
More generally, such a gap can be shown for any concept class that cannot be properly PAC learned, but can be efficiently learned (improperly) in the statistical queries model.
2.4 Concentration bounds
3 Proper learning vs. proper private learning
We begin by recalling the upper bound on the sample (database) size for private learning from Kasiviswanathan et al. (2011). The bound in Kasiviswanathan et al. (2011) is for agnostic learning, and we restate it for (nonagnostic) PAC learning using the following notion of αrepresentation:
Definition 3.1
We say that a hypothesis class \(\mathcal{H}_{d}\) αrepresents a concept class \(\mathcal{C}_{d}\) over the domain X _{ d } if for every \(c \in \mathcal{C}_{d}\) and every distribution \(\mathcal{D}\) on X _{ d } there exists a hypothesis \(h \in \mathcal{H}_{d}\) such that \(\mathop {\rm error}_{\mathcal{D}}(c,h)\leq\alpha\).
Theorem 3.2
(Kasiviswanathan et al. (2011), restated)
Assume that there is a hypothesis class \(\mathcal{H}_{d}\) that α/2represents a concept class \(\mathcal{C}_{d}\). Then, for every 0<β<1, there exists a private PAC learner for \(\mathcal{C}_{d}\) using \(\mathcal{H}_{d}\) that uses \(O((\log(\mathcal{H}_{d}) +\log(1/\beta))/(\epsilon\alpha))\) samples, where ϵ,α, and β are the parameters of the private learner. The learner might not be efficient.
In other words, using Theorem 3.2 the number of samples that suffices for learning a concept class \(\mathcal{C}_{d}\) is logarithmic in the size of the smallest hypothesis class that αrepresents \(\mathcal{C}_{d}\). For comparison, the number of samples required for learning \(\mathcal{C}_{d}\) nonprivately is characterized by the VCdimension of \(\mathcal{C}_{d}\) (by the lower bound of Ehrenfeucht et al. (1989) and the upper bound of Blumer et al. (1989)).
In the following, we will investigate private learning of the following simple concept class. Let T=2^{ d } and X _{ d }={1,…,T}. Define the concept class \(\operatorname {\mathtt {POINT}}_{d}\) to be the set of points over {1,…,T}:
Definition 3.3
(Concept Class \(\operatorname {\mathtt {POINT}}_{d}\))
For j∈[T], define c _{ j } : [T]→{0,1} as c _{ j }(x)=1 if x=j, and c _{ j }(x)=0 otherwise. Furthermore, define \(\operatorname {\mathtt {POINT}}_{d} = \{c_{j}\}_{j\in[T]}\).
We note that we use the set {1,…,T} for notational convenience only—when discussing the concept class \(\operatorname {\mathtt {POINT}}_{d}\) we never use the fact that the elements in T are integer numbers.
The class \(\operatorname {\mathtt {POINT}}_{d}\) trivially αrepresents itself, and hence, we get using Theorem 3.2 that it is (properly) PAC learnable using \(O((\log(\operatorname {\mathtt {POINT}}_{d}) +\log(1/\beta))/(\epsilon\alpha)) = O((d +\log(1/\beta))/(\epsilon\alpha))\) samples. For completeness, we give an efficient implementation of this learner.
Lemma 3.4
There is an efficient proper private PAC learner for \(\operatorname {\mathtt {POINT}}_{d}\) that uses O((d+log(1/β))/ϵα) samples.
Proof
 1.
For j∈{x _{1},…,x _{ m }}, with probability exp(ϵ⋅q(D,c _{ j })/2)/P, output c _{ j }.
 2.
With probability (2^{ d }−m)⋅exp(ϵ⋅q _{ D }/2)/P, pick uniformly at random a hypothesis from \(\operatorname {\mathtt {POINT}}_{d} \setminus\{c_{x_{1}},\ldots ,c_{x_{m}}\}\) and output it.
3.1 Separation between proper learning and proper private learning
We now show that private learners may require many more samples than nonprivate ones. We prove that for any proper private earner for the concept class \(\operatorname {\mathtt {POINT}}_{d}\) the required number of samples is at least logarithmic in the size of the concept class, matching Theorem 3.2, whereas there exists nonprivate proper learners for \(\operatorname {\mathtt {POINT}}_{d}\) that use only a constant number of samples.
To prove the lower bound, we show that a large collection of mrecord databases D _{1},…,D _{ N } exists, with the property that every PAC learner has to output a different hypothesis for each of these databases (recall that in our context a database is a collection of labeled examples, supposedly drawn from some distribution and labeled consistently with some target concept). As any two databases D _{ a } and D _{ b } differ on at most m entries, differential privacy implies that a private learner must output on input D _{ a } the hypothesis that is accurate for D _{ b } (and not accurate for D _{ a }) with probability at least (1−β)⋅exp(−ϵm). Since this holds for every pair of databases, unless m is large enough we get that the private learner’s output on D _{ a } is, with high probability, a hypothesis that is not accurate for D _{ a }.
In Theorem 3.6, we prove a general lower bound on the sample complexity of private learning of a class \(\mathcal{C}_{d}\) by a hypothesis classes \(\mathcal{H}_{d}\) that is αminimal for \(\mathcal{C}_{d}\) as defined in Definition 3.5. In Corollary 3.8, we prove that Theorem 3.6 implies the claimed lower bound for proper private learning of \(\operatorname {\mathtt {POINT}}_{d}\). In Lemma 3.9, we improve this lower bound for \(\operatorname {\mathtt {POINT}}_{d}\) by a factor of 1/α.
Definition 3.5
If \(\mathcal{H}_{d}\) αrepresents \(\mathcal{C}_{d}\), and every \(\mathcal{H}'_{d} \subsetneq \mathcal{H}_{d}\) does not αrepresent \(\mathcal{C}_{d}\), then we say that \(\mathcal{H}_{d}\) is αminimal for \(\mathcal{C}_{d}\).
Theorem 3.6
Let \(\mathcal{H}_{d}\) be an αminimal representation for \(\mathcal{C}_{d}\). Then, any private PAC learner that learns \(\mathcal{C}_{d}\) using \(\mathcal{H}_{d}\) requires \(\varOmega((\log(\mathcal{H}_{d})+\log(1/\beta))/\epsilon )\) samples, where ϵ,α, and β are the parameters of the private learner.
Proof
Let \(\mathcal{C}_{d}\) be a class of concepts over the domain X _{ d } and let \(\mathcal{H}_{d}\) be αminimal for \(\mathcal{C}_{d}\). Since for every \(h \in \mathcal{H}_{d}\), the class \(\mathcal{H}_{d} \setminus\{h\} \) does not αrepresent \(\mathcal{C}_{d}\), we get that there exists a concept \(c_{h} \in \mathcal{C}_{d}\) and a distribution \(\mathcal{D}_{h}\) on X _{ d } such that on inputs drawn from \(\mathcal{D}_{h}\) and labeled by c _{ h }, every PAC learner (that learns \(\mathcal{C}_{d}\) using \(\mathcal{H}_{d}\)) has to output h with probability at least 1−β.
Let \(\mathcal{A}\) be a private learner that learns \(\mathcal{C}_{d}\) using \(\mathcal{H}_{d}\), and suppose \(\mathcal{A}\) uses m samples. We next show that for every \(h\in \mathcal{H}_{d}\) there exists a database \(D_{h}\in X_{d}^{m}\) on which \(\mathcal{A}\) has to output h with probability at least 1−β. To see that, note that if \(\mathcal{A}\) is run on m examples chosen i.i.d. from the distribution \(\mathcal{D}_{h}\) and labeled according to c _{ h }, then \(\mathcal{A}\) outputs h with probability at least 1−β (where the probability is taken over the randomness of \(\mathcal{A}\) and the sample points chosen according to \(\mathcal{D}\)). Hence, a collection of m labeled examples over which \(\mathcal{A}\) outputs h with probability at least 1−β exists, and D _{ h } is set to contain these m samples.
Using Theorem 3.6, we now prove a lower bound on the number of samples needed for proper private learning concept class \(\operatorname {\mathtt {POINT}}_{d}\).
Proposition 3.7
\(\operatorname {\mathtt {POINT}}_{d}\) is αminimal for itself for every α<1.
Proof
Clearly, \(\operatorname {\mathtt {POINT}}_{d}\) αrepresents itself. To show minimality, consider a subset \(\mathcal{H}'_{d} \subsetneq \operatorname {\mathtt {POINT}}_{d}\), where \(c_{i} \notin \mathcal{H}'_{d}\). Under the distribution \(\mathcal{D}\) that chooses i with probability one, \(\mathop {\rm error}_{\mathcal{D}}(c_{i},c_{j}) = 1\) for all \(j\not=i\). Hence, \(\mathcal{H}'_{d}\) does not αrepresent \(\operatorname {\mathtt {POINT}}_{d}\). □
The VCdimension of \(\operatorname {\mathtt {POINT}}_{d}\) is one.^{5} It is well known that a standard (nonprivate) proper learner uses approximately VCdimension number of samples to learn a concept class (Blumer et al. 1989). In contrast, we get that far more samples are needed for any proper private learner for \(\operatorname {\mathtt {POINT}}_{d}\). The following corollary follows directly from Theorem 3.6 and Proposition 3.7:
Corollary 3.8
Every proper private PAC learner for \(\operatorname {\mathtt {POINT}}_{d}\) requires Ω((d+log(1/β))/ϵ) samples.
We now show that the lower bound for \(\operatorname {\mathtt {POINT}}_{d}\) can be improved by a factor of 1/α, matching (up to constant factors) the upper bound in Theorem 3.2.
Lemma 3.9
Every proper private PAC learner for \(\operatorname {\mathtt {POINT}}_{d}\) requires Ω((d+log(1/β))/(ϵα)) samples.
Proof
Define the distributions \(\mathcal{D}_{i}\) (where 2≤i≤T) on X _{ d } as follows: point 1 is picked with probability 1−α and point i is picked with probability α. The support of \(\mathcal{D}_{i}\) is on points 1 and i.
We say a database D=(d _{1},…,d _{ m }) where d _{ j }=(x _{ j },y _{ j }) for all j∈[m] is good for distribution \(\mathcal{D}_{i}\) if at most 2αm points from x _{1},…,x _{ m } equal i. Let D _{ i } be a database where x _{1},…,x _{ m } are i.i.d. samples from \(\mathcal{D}_{i}\) with y _{ j }=c _{ i }(x _{ j }) for all j∈[m]. By Chernoff bound, the probability that D _{ i } is good for distribution \(\mathcal{D}_{i}\) is at least 1−exp(−αm/3). Let \(\mathcal{A}\) be a proper private learner. On D _{ i }, \(\mathcal{A}\) has to output h=c _{ i } with probability at least 1−β (otherwise, if \(\mathcal{A}\) outputs some h=c _{ j }, where j≠i, then \(\mathop {\rm error}_{\mathcal{D}_{i}}(c_{i},h) = \mathop {\rm error}_{\mathcal{D}_{i}}(c_{i},c_{j})= \Pr_{x \sim \mathcal{D}_{i}}[c_{i}(x) \neq c_{j}(x)] > \alpha\), thus, violating the PAC learning condition for accuracy). Hence, the probability that either D _{ i } is not good or \(\mathcal{A}\) fails to return c _{ i } on D _{ i } is at most exp(−αm/3)+β. Therefore, with probability at least 1−β−exp(−αm/3), the database D _{ i } is good and \(\mathcal{A}\) returns c _{ i } on D _{ i }. Thus, for every i there exists a database D _{ i } that is good for \(\mathcal{D}_{i}\) such that \(\mathcal{A}\) returns c _{ i } on D _{ i } with probability at least 1−Γ, where Γ=β+exp(−αm/3).
We conclude this section showing that every hypothesis class \(\mathcal{H}\) that αrepresents \(\operatorname {\mathtt {POINT}}_{d}\) should have at least d hypotheses. Therefore, if we use Theorem 3.2 to learn \(\operatorname {\mathtt {POINT}}_{d}\) we need Ω(logd) samples.
Lemma 3.10
Let α<1/2. \(\mathcal{H} \geq d\) for every hypothesis class \(\mathcal{H}\) that αrepresents \(\operatorname {\mathtt {POINT}}_{d}\).
Proof
Let \(\mathcal{H}\) be a hypothesis class with \(\mathcal{H} < d\). Consider a table whose T=2^{ d } columns correspond to the possible 2^{ d } inputs 1,…,T, and whose \(\mathcal{H}\) rows correspond to the hypotheses in \(\mathcal{H}\). The (i,j)th entry in the table is 0 or 1 depending on whether the ith hypothesis gives 0 or 1 on input j. Since \(\mathcal{H} < d=\log (T)\), at least two columns j≠j′ are identical, that is, h(j)=h(j′) for every \(h \in \mathcal{H}\). Consider the concept \(c_{j} \in \operatorname {\mathtt {POINT}}_{d}\) (defined as c _{ j }(x)=1 if x=j, and 0 otherwise), and the distribution \(\mathcal{D}\) with probability mass 1/2 on both j and j′. We get that \(\mathop {\rm error}_{\mathcal{D}}(c_{j},h) \geq 1/2 > \alpha\) for all \(h \in \mathcal{H}\) (since for any hypothesis h(j)=h(j′), the hypothesis either errs on j or on j′). Therefore, \(\mathcal{H}\) does not αrepresent \(\operatorname {\mathtt {POINT}}_{d}\). □
4 Proper private learning vs. improper private learning
We now use \(\operatorname {\mathtt {POINT}}_{d}\) to show a separation between proper and improper private PAC learning. Oneway of achieving a smaller sample complexity is to use Theorem 3.2 to improperly learn \(\operatorname {\mathtt {POINT}}_{d}\) with a hypothesis class \(\mathcal{H}\) that αrepresents \(\operatorname {\mathtt {POINT}}_{d}\), but is of size smaller than \(\operatorname {\mathtt {POINT}}_{d}\). By Lemma 3.10, we know that every such \(\mathcal{H}\) should have at least d hypotheses.
In Sect. 4.1, we show that there does exist a \(\mathcal{H}\) with \(\mathcal{H}=O(d)\) that αrepresents \(\operatorname {\mathtt {POINT}}_{d}\). This immediately gives a separation—proper private learning \(\operatorname {\mathtt {POINT}}_{d}\) requires Ω _{ α,β,ϵ }(d) samples, whereas \(\operatorname {\mathtt {POINT}}_{d}\) can be improperly privately learned using O _{ α,β,ϵ }(logd) samples.^{6}
We conclude that αrepresenting hypothesis classes can, hence, be a natural and powerful tool for constructing efficient private learners. One may even be tempted to think that no better learners exist, and furthermore, that the sample complexity of private learning is characterized by the size of the smallest hypothesis class that αrepresents the concept class. Our second result, presented in Sect. 4.2, shows that this is not the case, and in fact, other techniques yield a much more efficient learner using only O _{ α,β,ϵ }(1) samples, and hence demonstrating the strongest possible separation between proper and improper private learners. The reader interested only in the stronger result may choose to skip directly to Sect. 4.2.
4.1 Improper private learning of \(\operatorname {\mathtt {POINT}}_{d}\) using O _{ α,β,ϵ }(logd) samples
We next construct a private learner applying the construction of Theorem 3.2 to the class \(\operatorname {\mathtt {POINT}}_{d}\). For that we (randomly) construct a hypothesis class \(\mathcal{H}_{d}\) that αrepresents the concept class \(\operatorname {\mathtt {POINT}}_{d}\), where \(\mathcal{H}_{d} = O_{\alpha}(d)\). Lemma 3.10 shows that this is optimal up to constant factors. In the rest of this section, a set A⊆[T] represents the hypothesis h _{ A }, where h _{ A }(i)=1 if i∈A and h _{ A }(i)=0 otherwise.
 (1)
For every j∈{1,…,T} there are more than 1/α sets in \(\mathcal{H}\) that contain j; and
 (2)
For every 1≤i _{1}<i _{2}≤k, \(A_{i_{1}} \cap A_{i_{2}}\leq1\).
We next argue that the class \(\mathcal{H}_{d}\) αrepresents \(\operatorname {\mathtt {POINT}}_{d}\). For every concept \(c_{j} \in \operatorname {\mathtt {POINT}}_{d}\) there are hypotheses \(A_{1},\ldots,A_{p} \in \mathcal{H}_{d}\) that contain j (where p=⌊1/α⌋+1) and are otherwise disjoint (that is, the intersection between any two sets \(A_{i_{1}}\) and \(A_{i_{2}}\) is exactly j). Fix a distribution \(\mathcal{D}\). For every A _{ i }, \(\mathop {\rm error}_{\mathcal{D}}(c_{j},A_{i})=\Pr_{\mathcal{D}} [A_{i} \setminus \{j\}]\). Since there are more than 1/α such sets and the sets A _{ i }∖{j} are disjoint, there exists at least one set such that \(\mathop {\rm error}_{\mathcal{D}}(c_{j},A_{i})\leq\alpha\). Thus, \(\mathcal{H}_{d}\) αrepresents the concept class \(\operatorname {\mathtt {POINT}}_{d}\).
We want to show that there is a hypothesis class, whose size is \(O(\sqrt{T}/\alpha)\), that satisfies the above two requirements. As an intermediate step, we show a construction of size O(T). We consider a projective plane with T points and T lines (each line is a set of points) such that for any two points there is exactly one line containing them and for any two lines there is exactly one point contained in both of them. Such projective plane exists whenever T=q ^{2}+q+1 for a prime power q (see, e.g., Hughes and Piper 1973). Furthermore, the number of lines passing through each point is q+1. If we take the lines as the hypothesis class for q≥1/α, then they satisfy the above requirements, thus, they αrepresent \(\operatorname {\mathtt {POINT}}_{d}\). However, the number of hypotheses in the class is T and no progress was made.
We modify the above projective plane construction. We start with a projective plane with 2T points and choose a subset of the lines: We choose each line at random with probability \(O(1/(\sqrt{T}\alpha))\). Since these lines are part of the projective plane, they satisfy the above requirement (2). It can be shown that with positive probability for at least half of the j’s requirement (1) is satisfied and the number of chosen lines is \(O(\sqrt{T}/\alpha)\). We choose such lines, eliminate points that are contained in less than 1/α chosen lines, and get the required construction with T points and \(O(\sqrt{T}/\alpha)\) lines. The details of the last steps are omitted. We next show a much more efficient construction based on the above idea.
Lemma 4.1
For every α<1, there is a hypothesis class \(\mathcal{H}_{d}\) that αrepresents \(\operatorname {\mathtt {POINT}}_{d}\) such that \(\mathcal{H}_{d} = O(d/\alpha^{2})\).
Proof
We next show how to construct \(\mathcal{H}_{d}\). Let k=8ep ^{2}/logT (that is, k=O(logT/α ^{2})). We choose k random subsets of {1,…,2T} of size 4pT/k. We will show that a point j satisfies (3) with probability at least 3/4. We assume d≥16 (and hence, p≥16 and T≥16).
Fix j. The expected number of sets that contain j is k⋅(4pT/k)/(2T)=2p, thus, by Chebyshev inequality, the probability that less than p sets contain j is less than 2/p≤1/8. We call this event BAD _{1}.
To conclude, the probability that j does not satisfy (3) is the probability that either BAD _{1} or BAD _{2} happens which is at most 1/4. Therefore, the expected number of j’s that do not satisfy (3) is less than T/2. By Markov inequality, the probability that more than T points j do not satisfy (3) is less than 1/2. We take k=O(logT/α ^{2}) subsets of {1,…,2T}, denoted S _{1},…,S _{ k }, such that at least T points j satisfy (3). By the probabilistic argument above, such sets exist. Let V be a set of size T of the points that satisfy (3), and define \(\mathcal{H}_{d}=\{ S_{1}\cap V,\ldots,S_{k}\cap V \}\). Finally, by a simple renaming, we can assume that \(\mathcal{H}_{d}\) contains subsets of {1,…,T} as required. □
From Lemma 4.1 and Theorem 3.2 we get:
Theorem 4.2
There exists an improper private PAC learner for \(\operatorname {\mathtt {POINT}}_{d}\) that uses \(O((\log d +\log\frac{1}{\alpha} +\log\frac{1}{\beta})/\epsilon \alpha)\) samples, where ϵ,α, and β are the parameters of the private learner.
There is a difference between the use of improper learning in Theorem 4.2 and typical use of improper learning in nonprivate settings. Typically, a nonprivate learner uses a hypothesis class that is larger than the size of concept class. This larger class enables learning in polynomial time. We get an improved sample complexity by learning using a hypothesis class whose size is smaller than the concept class.
4.2 Improper private learning of \(\operatorname {\mathtt {POINT}}_{d}\) using O _{ α,β,ϵ }(1) samples
We now show a stronger separation result, namely, that \(\operatorname {\mathtt {POINT}}_{d}\) can be privately (and efficiently) learned by an improper learner using just O _{ α,β,ϵ }(1) samples. We begin by presenting a nonprivate improper PAC learner \(\mathcal{A}_{1}\) for \(\operatorname {\mathtt {POINT}}_{d}\) that succeeds with only constant probability. Roughly, \(\mathcal{A}_{1}\) applies a simple proper learner for \(\operatorname {\mathtt {POINT}}_{d}\), and then modifies its outcome by adding random “noise”. We then use sampling to convert \(\mathcal{A}_{1}\) into a private learner \(\mathcal{A}_{2}\); like \(\mathcal{A}_{1}\) the probability that \(\mathcal{A}_{2}\) succeeds in learning \(\operatorname {\mathtt {POINT}}_{d}\) is only a constant. Later we amplify the success probability of \(\mathcal{A}_{2}\) to get a private PAC learner. Both \(\mathcal{A}_{1}\) and \(\mathcal{A}_{2}\) are inefficient as they output hypotheses with exponential description length. However, using a pseudorandom function it is possible to compress the outputs of \(\mathcal{A}_{1}\) and \(\mathcal{A}_{2}\), and achieve a private learning algorithms whose running time is efficient. This is explained in Sect. 4.2.1.
Algorithm \(\mathcal {A}_{2}\) described below is ϵ ^{⋆}differentially private, where ϵ ^{⋆}=ln(4) is a fixed constant. To construct an ϵdifferentially private algorithm for every ϵ, we describe a transformation in Lemma 4.4 that takes a bigger sample and replaces some samples with ⋆ and executes \(\mathcal{A}_{2}\) on the resulting sample. Therefore, we assume that some of the sample points given to \(\mathcal{A}_{1}\) and \(\mathcal{A}_{2}\) are ⋆.
Algorithm \(\mathcal{A}_{1}\)
 1.
If z _{1},…,z _{ m } is not consistent with any concept in \(\operatorname {\mathtt {POINT}}_{d}\), return ⊥ (this happens only if for two indices i,j∈[m] such that z _{ i }=(x _{ i },y _{ i }) and z _{ j }=(x _{ j },y _{ j }) either (1) x _{ i }≠x _{ j } and y _{ i }=y _{ j }=1 or (2) x _{ i }=x _{ j } and y _{ i }≠y _{ j }).
 2.
If y _{ i }=0 for all i∈[m] such that z _{ i }≠⋆, then let \(c= {\bf0}\) (the all zero hypothesis); otherwise, let c be the (unique) hypothesis from \(\operatorname {\mathtt {POINT}}_{d}\) that is consistent with the labeled examples in the sample.
 3.
Modify c at random to get a hypothesis h: for each x∈[T] independently let h(x)=1−c(x) with probability α/8 and, otherwise let h(x)=c(x). Return h.
We next argue that if the sample z _{1},…,z _{ m } contains at least 2ln(4)/α examples z _{ i }=(x _{ i },y _{ i }) such that each x _{ i } is drawn i.i.d. according to a distribution \(\mathcal{D}\) on [T], and the examples are labeled consistently according to some \(c_{j} \in \operatorname {\mathtt {POINT}}_{d}\), then \(\Pr[\mathop {\rm error}_{\mathcal{D}}(c_{j},c) \geq\alpha/2] \leq1/4\). If the examples are labeled consistently according to some \(c_{j} \ne{\bf0}\), then c≠c _{ j } only if (j,1) is not in the sample and in this case \(c= {\bf0}\). If \(\Pr_{x \sim \mathcal{D}}[x=j] < \alpha/2\) and (j,1) is not in the sample, then \(c={\bf0}\) and \(\mathop {\rm error}_{\mathcal{D}}(c_{j},{\bf0}) < \alpha/2\). Otherwise \(\Pr_{x \sim \mathcal{D}}[x=j] \geq\alpha/2\); thus, the probability that all examples of the form (x _{ i },y _{ i }) are not (j,1) is at most ((1−α/2)^{2/α })^{ln(4)}≤1/4 (as there are at least 2ln(4)/α such examples).
Algorithm \(\mathcal{A}_{2}\)
 1.
With probability α/8, return ⊥.
 2.
Construct a set S⊆[m′] by picking each element of [m′] with probability p=α/4.
 3.
Run the nonprivate learner \(\mathcal{A}_{1}\) on the examples indexed by S.
Claim 4.3
Let α<1/2, ϵ ^{⋆}=ln(4), and β ^{⋆}=3/4. Algorithm \(\mathcal {A}_{2}\) is an ϵ ^{⋆}differentially private (α,β ^{⋆})PAC learner for the class \(\operatorname {\mathtt {POINT}}_{d}\) provided that it is given a sample which contains at least 32ln(4)/α ^{2} labeled examples (i.e., m′≥32ln(4)/α ^{2}).
Proof
We first show that \(\mathcal{A}_{2}\) PAC learns \(\operatorname {\mathtt {POINT}}_{d}\) with confidence at least β ^{⋆}=3/4. Let S be the set chosen by \(\mathcal{A}_{2}\). The expected number of samples is at least p⋅(32ln(4))/α ^{2}=8ln(4)/α. By Chernoff bound, the probability that the sample indexed by S contains less than 2ln(4)/α (in fact, 4ln(4)/α) samples is less than exp(−ln(4)/α)<1/16 (since \(\mathcal{A}_{2}\) gets at least 32ln(4)/α ^{2} labeled examples and α<1/2). Algorithm \(\mathcal {A}_{2}\) can err only when either \(\mathcal{A}_{1}\) does not get 2ln(4)/α labeled examples, or when \(\mathcal{A}_{1}\) errs, or when \(\mathcal{A}_{2}\) returns ⊥ in Step (1). Therefore, we get that \(\mathcal{A}_{2}\) PAC learns \(\operatorname {\mathtt {POINT}}_{d}\) with accuracy parameter α′=α and confidence parameter β′=1/16+1/2+α/8≤3/4.
Algorithm \(\mathcal {A}_{2}\) is ϵ ^{⋆}differentially private for some fixed ϵ ^{⋆}. We reduce ϵ ^{⋆} to any desired ϵ using the following lemma (implicit in Kasiviswanathan et al. (2011)). In this lemma, we assume that the learning algorithm can handle “undefined entries”, i.e., entries of the form ⋆.^{7}
Lemma 4.4
Let \(\mathcal{A}\) be an ϵ ^{⋆}differentially private algorithm. Construct an algorithm \(\mathcal{B}\) that on input a database D=(d _{1},…,d _{ n }) constructs a new database D _{ s } whose ith entry is d _{ i } with probability f(ϵ,ϵ ^{⋆})=(exp(ϵ)−1)/(exp(ϵ ^{⋆})+exp(ϵ)−exp(ϵ−ϵ ^{⋆})−1) and ⋆ otherwise, and then runs \(\mathcal{A}\) on D _{ s }. Then, \(\mathcal{B}\) is ϵdifferentially private.
Proof
Claim 4.5
Let α<1/2, 0<β≤1 and 0<ϵ<1. There exists an ϵdifferentially private (α,β)PAC learner for the class \(\operatorname {\mathtt {POINT}}_{d}\) which uses a sample of size \(\mathop{\rm{poly}}\nolimits (1/\epsilon,1/\alpha, \log(1/\beta))\).
Proof
We next argue that with probability at least 1−β the selected hypothesis h _{ i } has error at most α. With probability at least 1−β/5, at least one of the hypotheses from Hyp has error less than α/8; by Chernoff bound with probability at least 1−β ^{2}/3 this hypothesis has empirical error^{8} at most α/4. Let us call \(\mathcal{E}_{1}\) the event that there exists a hypothesis with error less than α/8 and empirical error less than α/4 in Hyp. Event \(\mathcal{E}_{1}\) happens with probability at least (1−β/5)(1−β ^{2}/3)>1−(β/5+β ^{2}/3).
On the other hand, the probability that a hypothesis h _{ j } that has error greater than α has empirical error ≤α/2 is less than β ^{2}/3. By the union bound, the probability that there is such hypothesis in Hyp is at most β/3 (since N≤1/β for β≤0.01). Let us call \(\mathcal{E}_{2}\) the event that all hypotheses in Hyp with error greater than α have empirical error greater than α/2. Event \(\mathcal{E}_{2}\) happens with probability at least 1−β/3.
4.2.1 Making the learner efficient
The outcome of \(\mathcal{A}_{1}\) (hence, \(\mathcal{A}_{2}\)) is a hypothesis whose description is exponentially long (since it contains a list of the indices where the output was flipped). We now complete our construction by compressing this description using a pseudorandom function. The running time of the resulting algorithm is polynomial and the hypothesis it returns has a short description.
We use a slightly nonstandard definition of (nonuniform) pseudorandom functions from binary strings of size d to bits; these pseudorandom functions can be easily constructed given standard pseudorandom functions (which in turn can be constructed under standard assumptions (Goldreich 2001)). Roughly speaking, a collection of functions is pseudorandom if it cannot be distinguished from truly random functions. We start by defining the random functions in our definition.
Definition 4.6
Define \(H^{q}_{d}: \{0,1\}^{d} \rightarrow\{0,1\}\) as a random variable, where each value \(H^{q}_{d}(x)\) for x∈{0,1}^{ d } is selected i.i.d. to be 1 with probability q and 0 otherwise.
We consider a (nonuniform) polynomialtime distinguishing algorithm (represented by a circuit) C _{ d } that can query a function in polynomially many points. Any such algorithm should not be able to distinguish if the answers of the function are random or are answered according to a random function from the pseudorandom family. Formally,
Definition 4.7
 (3)′

If \(c={\bf0}\), let h be a random function from F _{ d }. Otherwise (i.e., c=c _{ j } for some j∈[T]), let h be a random function from F _{ d } subject to h(j)=1. Return h.
Applying the same steps as in the proof of Claim 4.5, we get the following result.
Theorem 4.8
There exists an efficient improper private PAC learner for \(\operatorname {\mathtt {POINT}}_{d}\) that uses O _{ α,β,ϵ }(1) samples, where ϵ,α, and β are the parameters of the private learner.
Lemma 3.9 and Theorem 4.8 give the following separation:
Theorem 4.9
Every proper private PAC learner for \(\operatorname {\mathtt {POINT}}_{d}\) requires Ω((d+log(1/β))/(ϵα)) samples, whereas there exists an efficient improper private PAC learner that can learn \(\operatorname {\mathtt {POINT}}_{d}\) using O _{ α,β,ϵ }(1) samples. Here, ϵ,α, and β are the parameters of the private learners.
4.3 Restrictions on the hypothesis class of private learners with low sample complexity
We conclude this section by showing that every (improper) private learner for \(\operatorname {\mathtt {POINT}}_{d}\) using o(d) samples must return hypotheses that evaluate to one on many points (in contrast, every hypothesis in \(\operatorname {\mathtt {POINT}}_{d}\) returns the value one on just one input). This explains why our algorithms for \(\operatorname {\mathtt {POINT}}_{d}\) that use o(d) samples return “complex” hypotheses.
Definition 4.10
(weight)
The weight of a hypothesis h is the number of points for which it returns the value one, i.e., {i:h(i)=1}.
Theorem 4.11
There exists no private PAC learner for \(\operatorname {\mathtt {POINT}}_{d}\) with sample complexity o _{ α,β,ϵ }(d) that for every distribution returns, with probability at least half, hypotheses with weight \(2^{o_{\alpha,\beta,\epsilon}(d)}\) (where the probability is taken over the randomness of the learner and the sample points chosen according to the distribution). Here, ϵ,α, and β are the parameters of the private learner.
Proof
In the proof assume the contrary, i.e., there exists a private learner that for every distribution returns hypotheses with weight \(2^{o_{\alpha ,\beta,\epsilon}(d)}\) with probability at least half. We prove that, under this assumption, there is a proper private learning algorithm for \(\operatorname {\mathtt {POINT}}_{d}\) with sample complexity o _{ α,β,ϵ }(d), in contradiction with Lemma 3.9.
Let \(c_{t} \in \operatorname {\mathtt {POINT}}_{d}\) be the target concept. Assume for contradiction that there exists an ϵdifferentially private (α,β)PAC learner \(\mathcal{A}'\) for \(\operatorname {\mathtt {POINT}}_{d}\) with sample complexity o _{ α,β,ϵ }(d) that for every distribution returns, with probability at least 1/2, hypotheses of weight less than z, for \(z=2^{o_{\alpha,\beta,\epsilon}(d)}\) (where the probability is taken over the randomness of \(\mathcal{A}'\) and the sample points chosen according to the distribution).
 1.
Let k=ln(β/2)/ln(3/4).
 2.
Invoke k times the algorithm \(\mathcal{A}'\) with parameters ϵ,d,α/2,β′=1/4, each time on a fresh logz sized i.i.d. sample drawn from \(\mathcal{D}\) and labeled by c _{ t }. Let h _{1},…,h _{ k′} (where k′≤k) be the hypotheses returned in these executions with weight less than z.
 3.
If k′=0 halt with failure, otherwise set \(\mathcal{H}_{d} = \{c_{j}: h_{i}(j) = 1\ \textrm{for some}\ i\in[k']\}\).
 4.
Invoke the proper private learner of Lemma 3.4 with parameters ϵ,α,β/2 and hypothesis class \(\mathcal{H}_{d}\) on a fresh \(\ell= O((\log(\mathcal{H}_{d}) +\log(1/\beta))/(\epsilon\alpha)) \) sized i.i.d. sample drawn from \(\mathcal{D}\) and labeled by c _{ t }. Output the hypothesis returned by the learner.
Note that \(\ell= O((\log(\mathcal{H}_{d}) +\log(1/\beta))/(\epsilon \alpha)) = o_{\alpha,\beta,\epsilon}(d)\), and that the sample complexity of \(\mathcal{A}\) is klogz+ℓ=o _{ α,β,ϵ }(d). Furthermore, \(\mathcal{A}\) always returns a hypothesis in \(\operatorname {\mathtt {POINT}}_{d}\) (note that \(\mathcal{H}_{d}\subset \operatorname {\mathtt {POINT}}_{d}\)). Hence, if \(\mathcal{A}\) is a private learner for \(\operatorname {\mathtt {POINT}}_{d}\), we get a contradiction to Lemma 3.9.
Note that \(\mathcal{A}\) is ϵdifferentially private (follows since \(\mathcal{A}'\) is ϵdifferentially private and in Step (4), we invoke the ϵdifferentially private algorithm from Lemma 3.4 on a fresh sample).
To conclude the proof, we observe that having \(\mathcal{H}_{d}\) α/2represent {c _{ t }} suffices for the proof of Theorem 3.2, and hence, the hypothesis (in Step (4)) returned by the learner of Theorem 3.2 is with probability at least 1−β/2 within error α from c _{ t }.
To summarize, we get that \(\mathcal{A}\) is a proper private learner for \(\operatorname {\mathtt {POINT}}_{d}\) under distribution \(\mathcal{D}\) with sample complexity o _{ α,β,ϵ }(d). Since this holds for every \(\mathcal{D}\) this leads to a contradiction to Lemma 3.9 (the lemma shows that there exists a distribution for which there is no proper private learner for \(\operatorname {\mathtt {POINT}}_{d}\) with sample complexity o _{ α,β,ϵ }(d)). □
5 Private learning of intervals (partial results)
In this section, we examine \(\operatorname {\mathtt {INTERVAL}}_{d}\), a concept class that like \(\operatorname {\mathtt {POINT}}_{d}\) is very natural and simple and has VCdimension 1. By Theorem 3.6, any proper private learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) requires Ω _{ α,β,ϵ }(d) samples (as \(\operatorname {\mathtt {INTERVAL}}_{d}\) is αminimal for itself), and we ask whether stronger separation results than we showed for \(\operatorname {\mathtt {POINT}}_{d}\) can be proved for\(\operatorname {\mathtt {INTERVAL}}_{d}\). Specifically, we ask if we can prove a lower bound of ω _{ α,β,ϵ }(1) for any private learner for\(\operatorname {\mathtt {INTERVAL}}_{d}\) (i.e., also for improper private learners).
We give partial results towards answering this question. In Sect. 5.1, we show that if there exists an O _{ α,β,ϵ }(1) sample sized improper private learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\), then it must use hypotheses that are very unlike intervals, and in fact must switch exponentially many times between zero and one (this is similar to the result presented for \(\operatorname {\mathtt {POINT}}_{d}\) in Sect. 4.3). Then, in Sect. 5.2, we take a deeper look into improper private learning of \(\operatorname {\mathtt {INTERVAL}}_{d}\), and prove that the technique from Sect. 4.2 that yielded the efficient private learner for \(\operatorname {\mathtt {POINT}}_{d}\) with sample complexity O _{ α,β,ϵ }(1) cannot yield an algorithm for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sample complexity o _{ α,β,ϵ }(d). In other words, the technique of adding independent noise from Sect. 4.2, even with exponentially many switch points, does not yield a learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with o _{ α,β,ϵ }(d) sample complexity.
Before proving the above results, let us first formally define \(\operatorname {\mathtt {INTERVAL}}_{d}\) and establish a sample complexity lower bound for proper private learning this concept class.
Definition 5.1
The concept class \(\operatorname {\mathtt {INTERVAL}}_{d}\) is {c _{ j }:j∈{1,…,T+1}} where T=2^{ d } and the concept c _{ j }:[T]→{0,1} maps all x<j to 1 and all x≥j to 0.
Unlike the concept class \(\operatorname {\mathtt {POINT}}_{d}\), the values of elements of X _{ d } are significant in the sense that the geometric relation of which point is to the left of the other is meaningful. Note that the cardinality of \(\operatorname {\mathtt {INTERVAL}}_{d}\) is 2^{ d }+1, and that it is αminimal for itself (for all α<1/2), and hence, we can use Theorem 3.6 and get a lower bound on the sample complexity of proper private learners for \(\operatorname {\mathtt {INTERVAL}}_{d}\).
Lemma 5.2
Every proper private PAC learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) requires Ω((d+(1/β))/ϵ) samples.
5.1 Restrictions on the hypothesis class of private learners with low sample complexity
We give an insight on the structure of the hypothesis class of an improper private learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sample complexity o _{ α,β,ϵ }(d). We show that if such a learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) exists, then it must return, with high probability, a hypothesis that switches frequently between zero and one. Therefore, the hypothesis outputted by the learner has a very different structure compared to the concepts in \(\operatorname {\mathtt {INTERVAL}}_{d}\), which switch exactly once from 1 to 0. This result resembles Theorem 4.11, where we proved a similar structural statement for private learning \(\operatorname {\mathtt {POINT}}\) class.
Definition 5.3
(Switching Point)
We say that j is a switching point in hypothesis h if h(j)≠h(j−1). If h(j−1)=1 we say that j is a decreasing switching point. Otherwise, we say the switching point is increasing. The points 1 and T+1 are also referred to as switching points. The point 1 is a increasing switching point if h(1)=1 and decreasing otherwise. The point T+1 is a increasing switching point if h(T)=0 and decreasing otherwise.
We next prove that every private learner with sample complexity o _{ α,β,ϵ }(d) returns with high probability a hypothesis with an exponential number of switching points. We prove this using a method similar to the proof of the previous theorem. We assume that a learner exists which returns with constant probability a hypothesis with too little switching points. We then show that a proper private learner can be reconstructed from this hypothesis. For the reconstruction, we use a simplified version of the exponential mechanism of McSherry and Talwar (2007). Existence of a proper private learner for the class \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sample complexity o _{ α,β,ϵ }(d) leads to a contradiction to Lemma 5.2.
Theorem 5.4
There exists no private PAC learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sample complexity o _{ α,β,ϵ }(d) that for every distribution returns, with probability at least half, hypotheses with \(2^{o_{\alpha ,\beta,\epsilon}(d)}\) switching points (where the probability is taken over the randomness of the learner and the sample points chosen according to the distribution). Here, ϵ,α, and β are the parameters of the private learner.
Proof
Let \(\mathcal{D}\) denote the underlying sample distribution. Every concept \(c \in \operatorname {\mathtt {INTERVAL}}_{d}\) consists of exactly one decreasing switching point. Discovering this point is discovering the accurate concept. Assume first that the target concept is c _{ t } for some 1≤t≤T+1 and we have a hypothesis h such that \(\mathop {\rm error}_{\mathcal{D}}(c_{t},h) \leq\alpha\). Let j and k be two consecutive switching points in h such that j≤t≤k.^{9} Assume first that the switching point j is decreasing (and, thus, k is increasing). Note that c _{ j }(x)=c _{ t }(x)=1 for every x<j and c _{ j }(x)=c _{ t }(x)=0 for every x≥t. Therefore, c _{ j } is a hypothesis which only errs on {j,…,t−1}. Also c _{ j }(x)=h(x)=0 for every x∈{j,…,t−1}.
Remark 5.5
Note that if the empirical error of h on some sample database D is less than α, then using same arguments as above there exists a concept in \(\operatorname {\mathtt {SWITCH}}(h)\) whose empirical error on D is also less than α.
As in Kasiviswanathan et al. (2011), we use the exponential mechanism in order to choose a hypothesis out of \(\operatorname {\mathtt {SWITCH}}(h)\) (we used the same mechanism in the proof of Claim 4.5).
We now have enough tools for the proof. Assume that \(\mathcal{A}'\) is an ϵdifferentially private (α,β)PAC learner for the class \(\operatorname {\mathtt {INTERVAL}}_{d}\) with a sample complexity o _{ α,β,ϵ }(d) that on every distribution returns, with probability at least 1/2, hypotheses with at most \(z=z(\alpha,\beta,\epsilon ,d)=2^{o_{\alpha,\beta,\epsilon}(d)}\) switching points. Let \(s = 8\ln (\frac{12}{\beta}) / (\alpha^{2}) + 8 \ln(\frac{(6\beta) z}{\beta } ) / (\alpha\epsilon) + K ( \frac{1}{\alpha} \log\frac{1}{\beta} + \frac{1}{\alpha} \log\frac{1}{\alpha} )\) for some constant K to be set below.
 1.
Let \(\alpha'=\frac{\alpha}{4}; \beta'= \frac{\beta}{6}\).
 2.For i in \(\{1,\ldots,\log\frac{1}{\beta'}\}\):
 (a)
Draw o _{ α,β,ϵ }(d) new samples from \(\mathcal{D}\) and label it by c _{ t }. Let D′ denote these labeled examples.
 (b)
Apply \(\mathcal{A}'\) with parameters ϵ,α′,β′ on D′. Let h _{ i } be the returned hypothesis.
 (a)
 3.
Let \(\hat{h}\) denote the first hypothesis in {h _{1},…,h _{log(1/β′)}} such that \(\lvert \operatorname {\mathtt {SWITCH}}(h_{i}) \rvert\leq z\). If no such \(\hat{h}\) exists, return “FAIL”.
 4.
Draw s additional samples according to \(\mathcal{D}\) and label it by c _{ t }. Let D _{ s } denote these labeled examples.
 5.
Choose a concept c out of \(\operatorname {\mathtt {SWITCH}}(\hat{h})\) using the exponential mechanism on D _{ s } with parameter ϵ and return it.
We now show that \(\mathcal{A}\) is a proper private (α,β)PAC learner with sample complexity o _{ α,β,ϵ }(d). This is a contradiction to Lemma 5.2.
First, note that according to the assumption, Step (2a) is given enough samples. Also according to the assumption, for every i we have that \(\Pr[\lvert \operatorname {\mathtt {SWITCH}}(h_{i}) \rvert\ge z ] \le 1/2\). Therefore, Step (3) fails with probability at most (1/2)^{log(1/β′)}=β′. Since the chosen hypothesis \(\hat{h}\) is a uniformly distributed hypothesis conditioned on \(\lvert \operatorname {\mathtt {SWITCH}}(\hat {h}) \rvert \leq z\) (an event with probability at least half), the probability that \(\mathop {\rm error}_{\mathcal{D}}(c_{t},\hat{h}) \geq\alpha'\) is at most 2β′+β′=3β′ (2β′ comes from the Step (2b) and β′ from Step (3)).
Finally, we show that c, the concept returned by \(\mathcal{A}\), has indeed \(\mathop {\rm error}_{\mathcal{D}}(c,c_{t}) \leq\alpha\) with high probability. As the VCdimension of \(\operatorname {\mathtt {INTERVAL}}_{d}\) is 1, by Blumer et al. (1989), there exists a constant ℓ such that whenever more than \(\ell( \frac {1}{\alpha'} \log\frac{1}{\beta'} + \frac{1}{\alpha'} \log\frac{1}{\alpha'} )\) samples are drawn from some distribution \(\mathcal{D}\), then \(\Pr[ \lvert \mathop {\rm error}_{\mathcal{D}}(c_{t},c)  \widehat{\mathop {\rm error}}_{D_{s}} (c) \rvert\geq \alpha' ] \leq\beta'\). Remember that \(s > K ( \frac{1}{\alpha} \log\frac {1}{\beta} + \frac{1}{\alpha} \log\frac{1}{\alpha} )\) for some constant K (depending on ℓ). As we assumed \(\widehat{\mathop {\rm error}}_{D_{s}}(c) \le 3\alpha'\), we finally have that \(\mathop {\rm error}_{\mathcal{D}}(c_{t},c) \le4 \alpha' = \alpha\) with probability at least 1−β′.
We now calculate the sample complexity. Note that samples are drawn in Step (4) and many times in Step (2a). As we assumed the sample complexity of \(\mathcal{A}'\) is o _{ α,β,ϵ }(d) and it is executed log(1/β′) times, we get that the total sample complexity of this step is o _{ α,β,ϵ }(d). (Remember that α′ and β′ are of the same order as α and β.) Also note that since \(z=2^{o_{\alpha,\beta,\epsilon}(d)}\), the sample complexity of Step (4) is s=o _{ α,β,ϵ }(d). Therefore, the sample complexity of \(\mathcal{A}\) is log(1/β′)⋅o _{ α,β,ϵ }(d)+s=o _{ α,β,ϵ }(d).
Finally, note that we assumed \(\mathcal{A}'\) maintains ϵdifferential privacy. Also the exponential mechanism maintains ϵdifferential privacy. Since any execution of the inner algorithms is on different independently drawn samples of the whole sample set, the learner \(\mathcal{A}\) maintains ϵdifferential privacy.
Combining all the above statements we have that if there is an ϵdifferentially private (α/4,β)PAC learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sample complexity o _{ α,β,ϵ }(d) that for every distribution returns, with probability at least half, a hypotheses with \(2^{\varOmega_{\alpha,\beta,\epsilon}(d)}\) switching points, then there is a proper ϵdifferentially private (α,β)PAC learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sample complexity o _{ α,β,ϵ }(d). This contradicts Lemma 5.2. □
5.2 Impossibility of private independent noise learners with low sample complexity
We next show that the ideas used to construct in Sect. 4.2 a private learner for \(\operatorname {\mathtt {POINT}}_{d}\) with sample complexity O _{ α,β,ϵ }(1) cannot be used for \(\operatorname {\mathtt {INTERVAL}}_{d}\). We begin by formalizing a class of independent noise learners that generalizes the construction in Sect. 4.2. We note that independent noise learners are allowed to output hypotheses whose description is exponential in d (recall that this issue was resolved for \(\operatorname {\mathtt {POINT}}_{d}\) by using compression with pseudorandom functions).
Definition 5.6
(Private Independent Noise Learner)
 1.The outer learner \(\mathcal{A}^{\rm outer}\) is a private PAC learner (as defined in Definition 2.5) for \(\mathcal{C}_{d}\) using the class of all \(2^{X_{d}}\) functions X _{ d }→{0,1}. Furthermore, \(\mathcal{A}^{\rm outer}(\epsilon,d,\alpha',\beta',D)\) is restricted to execute as follows:
 (a)
Select parameters α ^{⋆}≤α′,β ^{⋆}≤β′, and a noise rate μ as a (deterministic) function of ϵ,α′,β′.
 (b)
Run \(\mathcal{A}^{\rm inner}(d,\alpha^{\star},\beta^{\star},D)\). Denote the output hypothesis c ^{⋆}.
 (c)
If \(c^{\star}\notin \mathcal{C}_{d}\) then output “fail” and halt. Otherwise, produce a hypothesis h by addition of noise to all entries of c ^{⋆} independently, i.e., for all x∈X _{ d } set h(x)=1−c ^{⋆}(x) with probability μ, and h(x)=c ^{⋆}(x) otherwise.
 (a)
 2.
The inner learner \(\mathcal{A}^{\rm inner}\) outputs with probability at least 1−β ^{⋆} (over the randomness of \(\mathcal{A}^{\rm inner}\) and the sampling of D according to \(\mathcal{D}\)) a hypothesis \(c^{\star}\in \mathcal{C}_{d}\) such that \(\mathop {\rm error}_{\mathcal{D}}(c^{\star}, c)\leq\alpha^{\star}\).
Example 5.7
 1.
Set α=α′.
 2.
Get a sample (x _{1},y _{1}),…,(x _{ m′},y _{ m′}), where x _{ i }’s are chosen according to \(\mathcal{D}\) and m′=32ln(4)/α ^{2}.
 3.
With probability α/8, return ⊥.
 4.
Construct a set S⊆[m′] by picking each element of [m′] with probability α/4.
 5.
If ((x _{ i },y _{ i }))_{ i∈S } is not consistent with any concept in \(\operatorname {\mathtt {POINT}}_{d}\), return ⊥.
 6.
If y _{ i }=0 for all i∈S, then let \(c= {\bf0}\) (the all zero hypothesis); otherwise, let c be the (unique) hypothesis from \(\operatorname {\mathtt {POINT}}_{d}\) that is consistent with the labeled example ((x _{ i },y _{ i }))_{ i∈S }.
As analyzed in Sect. 4.2, Algorithm \(\mathcal {A}_{2}\) is ln(4)differentially private. It is also (α′,β′)PAC learner. To construct an algorithm that is ϵdifferentially private for smaller values of ϵ, we use a transformation described in Lemma 4.4. It can be seen that the resulting algorithm is also a private independent noise learner.
Furthermore, in the above description of \(\mathcal{A}_{2}\), the confidence parameter is β′=3/4. In Sect. 4.2, we boosted the confidence parameter by using the exponential mechanism. The resulting learning algorithm is not a private independent noise learner. However, for any constant β′, we can modify \(\mathcal{A}_{2}\) such that the resulting algorithm has confidence β′ and is a private independent noise learner; however, the sample complexity of the resulting algorithm is not polynomial in log(1/β′).
We next show that there is no private independent noise learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) using only o _{ α,β,ϵ }(d) samples. We will show that in this case, we can essentially recover the outcome of the inner learner (with probability at least 1−β a hypothesis in \(\operatorname {\mathtt {INTERVAL}}_{d}\)) from the outcome of the outer learner. It follows then that the existence of a private independent noise learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) that uses o _{ α,β,ϵ }(d) samples implies a proper private learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) that uses o _{ α,β,ϵ }(d) samples, in contradiction with Lemma 5.2.
Theorem 5.8
There is no private independent noise learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) for β′<1/4 and α′<β′/100 that learns using m′=o _{ α′,β′,ϵ }(d) samples.
Proof
Assume towards a contradiction that a private independent noise learner \(( \mathcal{A}^{\rm outer}, \mathcal{A}^{\rm inner})\) exists for \(\operatorname {\mathtt {INTERVAL}}_{d}\). Let \(\mathcal{D}\) denote the underlying sample distribution and \(c_{t} \in \operatorname {\mathtt {INTERVAL}}_{d}\) denote the target concept. Consider an execution of \(\mathcal{A}^{\rm outer}\) when invoked with parameters α′,β′ where β′<1/2 (we will further restrict α′,β′ below). We first show a simple bound on the noise rate μ=μ(α′,β′) selected by \(\mathcal{A}^{\rm outer}\). Denote by α ^{⋆}≤α′,β ^{⋆}≤β′ the parameters that \(\mathcal{A}^{\rm outer}\) selects for the inner learner. Denote by c ^{⋆} the concept returned by \(\mathcal{A}^{\rm inner}\) and by h the concept returned by \(\mathcal{A}^{\rm outer}\) (or ⊥ if \(\mathcal{A}^{\rm outer}\) halts without an output).
 1.
For every t∈{1,…,T+1} define mismatch(t,h)={x<t:h(x)=0}+{x≥t:h(x)=1}.
 2.
Find ℓ for which mismatch(ℓ,h) is the lowest and return c _{ ℓ }.
 3.
If no such unique point exists, return “FAIL”.
We now bound the probability that c _{ ℓ }≠c ^{⋆}. We call a point x for which noise was added by \(\mathcal{A}^{\rm outer}\) (i.e., \(h(x)\not =c^{\star}(x)\)) dirty, otherwise we call x clean. Let j be such that c _{ j }=c ^{⋆}. Then, mismatch(j,h) is the number of dirty points. The reconstruction algorithm fails to return c ^{⋆} if and only if there is some point k such that mismatch(k,h)≤mismatch(j,h). In this case, we say that k is bad. We show that for small enough μ, such a bad point exists only with constant probability. In the following, we assume that k>j (the case k<j is symmetric). First note that c _{ j } and c _{ k } disagree agree only on points in {j,…,k−1} (i.e., mismatch(j,h) and mismatch(k,h) have the same contribution from points not between j and k). Now every dirty point in {j,…,k−1} contributes 1 to mismatch(j,h) and nothing to mismatch(k,h), and similarly each clean point between {j,…,k−1} contributes 1 to mismatch(k,h) and nothing to mismatch(j,h). Since we assumed that mismatch(k,h)≤mismatch(j,h), it should be the case that at least half the entries in {j,…,k−1} are dirty.
We consider the case where there is a bad point bigger than j (the case where it is smaller than j is handled analogously). Let k>j be the smallest bad point which is bigger than j, that is, k is the smallest such that the number of dirty points in {j,…,k−1} is at least the number of clean points. Hence, k=j+1 if and only if j is a dirty point; if k>j+1 then for all j<ℓ<k the number of clean entries in {j,…,ℓ−1} exceeds the number of dirty points (otherwise ℓ is a bad point smaller than k). From the above arguments it follows that the number of clean points in {j,…,k−1} equals the number of dirty points in {j,…,k−1}.

\(\operatorname {\mathtt {noise}}_{j}\) begins with 1 (this if the case when k=j+1), or

\(\operatorname {\mathtt {noise}}_{j}\) begins with some Dyck word, where a Dyck word is a balanced string of “parentheses” in the sense that it consists of n zeros and n ones, and in every prefix the number of ones does not exceed the number of zeros (this is the case when k>j+1).
 1.
Let \(\beta' = \frac{\beta}{4}\) and \(\alpha' = \frac{\min(\alpha ,\beta)}{100}\).
 2.
Apply \(\mathcal{A}^{\rm outer}\) with parameters ϵ,d,α′,β′ to improperly learn \(\operatorname {\mathtt {INTERVAL}}_{d}\) using o _{ α′,β′,ϵ }(d) samples. Let h be the output of \(\mathcal{A}^{\rm outer}\). If \(\mathcal{A}^{\rm outer}\) fails then halt.
 3.
Reconstruct a concept \(c_{\ell}\in \operatorname {\mathtt {INTERVAL}}_{d}\) out of the noisy hypothesis h (as described in the reconstruction algorithm above) and return it.
Note that β ^{⋆}≤β′ from the definition of private independent noise learner. Thus, the algorithm \(\mathcal{A}\) returns a concept \(c_{\ell}= c^{\star}\in \operatorname {\mathtt {INTERVAL}}_{d}\) such that \(\Pr[\mathop {\rm error}_{\mathcal{D}}(c_{\ell},c_{t}) \geq \alpha] \leq\beta\), and so it is a proper ϵdifferentially private (α,β)PAC learner for \(\operatorname {\mathtt {INTERVAL}}_{d}\) with sample complexity o _{ α,β,ϵ }(d), in contradiction to Lemma 5.2. □
6 Separation between efficient and inefficient proper private PAC learning
In this section, we use the sample size lower bound for proper private learning \(\operatorname {\mathtt {POINT}}_{d}\) (Corollary 3.8) to obtain a separation between the sample complexities of efficient and inefficient proper private PAC learning. In the case of efficient proper private learning, we use a slightly relaxed notion of proper learning for reasons explained below.
 (1)
If j is in the image of G _{ d }, then by the utility guarantee of the proper learner, \(\mathcal{A}_{p}\) has to return c _{ j } on D with probability at least 1−β. Thus, the distinguisher returns 1 with probability at least 1−β when j is chosen from G _{ d }(U _{ ℓ(d)}).
 (2)
If j is not in the image of G _{ d }, then the database D is not labeled consistently by any concept in \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\). Consider any such j, a proper learner that returns a hypothesis from \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) implies a distinguisher that never returns 1 (i.e., always returns 0). Therefore, the probability that the distinguisher returns 1 when j=U _{ d } is at most the probability that j is in the image of G _{ d }, which is at most \(\ell(d)/2^{d} = \mathop {\rm negl}(d)\).
To summarize, assuming \(\mathcal{A}_{p}\) is an efficient proper learner for \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\), the distinguisher will return 1 with probability at least 1−β when j=G _{ d }(U _{ ℓ(d)}), and with probability at most \(\mathop {\rm negl}(d)\) when j=U _{ d }, in contradiction to (10). We conclude that no efficient proper learner exists for \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) and, therefore, we relax in the following our notion of proper private learners for \({\widehat {\operatorname {\mathtt {POINT}}}}\) to allow outputting hypothesis from \(\operatorname {\mathtt {POINT}}\). We show that under this liberal relaxation, efficient proper learning of \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) with sample complexity o(d) is not possible. However, we show that inefficient proper private learning of \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) with sample complexity o(d) is possible under the strict definition of proper learning.
Sample complexity of efficiently private learning \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using \(\operatorname {\mathtt {POINT}}_{d}\)
Consider an efficient private learner \(\mathcal{A}_{\mathop {\rm eff}}\) that learns \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using \(\operatorname {\mathtt {POINT}}_{d}\) and has sample complexity m. We now show that either a distinguisher exists for the pseudorandom generator G _{ d } or m=Ω _{ β,ϵ }(d). Assume β<1/4.
We use \(\mathcal{A}_{\mathop {\rm eff}}\) to construct a distinguisher for the pseudorandom generator as follows: Given j∈{1,…,2^{ d }}, we construct the database D with m entries (j,1). If \(\mathcal{A}_{\mathop {\rm eff}}(D)=c_{j}\), then the distinguisher returns 1, otherwise it returns 0.
If for at least a 3/4th fraction of the values j∈[2^{ d }], algorithm \(\mathcal{A}_{\mathop {\rm eff}}\), when applied to a database with m entries (j,1), does not return c _{ j } with probability at least 3/4, then the distinguisher succeeds in breaking the pseudorandom generator. This is because if the above statement is not true then the distinguisher returns 1 with probability at most 3/4 when j=U _{ d }, and the distinguisher will return 1 with probability at least 1−β>3/4 when j=G _{ d }(U _{ ℓ(d)}).^{11}
However, arguments similar as in the proof of Theorem 3.6 show that it is not possible to have a learner that on 3/4th fraction of the values j∈[2^{ d }], when applied to a database with m=o((d+log(1/β))/ϵ) entries (j,1), returns c _{ j } with probability at least 3/4. This means that either we have a distinguisher for the pseudorandom generator or the sample complexity of \(\mathcal{A}_{\mathop {\rm eff}}\) is at least Ω _{ β,ϵ }(d). So, assuming the existence of a pseudorandom generator, we get that there exists no efficient private learner that learns \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using \(\operatorname {\mathtt {POINT}}_{d}\) and has o((d+log(1/β))/ϵ) sample complexity.^{12}
Sample complexity of inefficient proper private learners for \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\)
If the learner is not polynomially bounded, then it can use the algorithm from Theorem 3.2 to privately learn \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\). Since \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}=2^{\ell(d)}\), the private learner from Theorem 3.2 uses O((ℓ(d)+log(1/β))/(ϵα)) samples.
We get the following separation between efficient and inefficient proper private learning:
Theorem 6.1
Let ℓ(d) be any function that grows as ω(logd). Assuming the existence of a pseudorandom generator G _{ d } : {0,1}^{ ℓ(d)}→{0,1}^{ d }, there exists no efficient proper PAC learner for \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) and every efficient (polynomialtime) private PAC learner that learns \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using \(\operatorname {\mathtt {POINT}}_{d}\) requires Ω((d+log(1/β))/ϵ) samples, whereas there exists an inefficient proper private PAC learner that can learn \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using O((ℓ(d)+log(1/β))/(ϵα)) samples.
Remark 6.2
In the nonprivate setting, there exists an efficient proper learner that can learn \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using \(\operatorname {\mathtt {POINT}}_{d}\) with O((log(1/α)+log(1/β))/α) samples (as \(\mathrm{\it VCDIM}({\widehat {\operatorname {\mathtt {POINT}}}}_{d})=1\)). In the nonprivate setting, we also know that even inefficient learners require Ω(log(1/β)/α) samples (Ehrenfeucht et al. 1989; Kearns and Vazirani 1994). Therefore, for \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) the sample complexity difference that we observe in Theorem 6.1 does not exist without the privacy constraint.
7 Lower bounds for noninteractive sanitization
We now prove a lower bound on the database size (or sample size) needed to privately release an output that is useful for all concepts in a concept class. We start by recalling a definition and a result of Blum et al. (2008).
Theorem 7.1
(Blum et al. 2008)
We show that the dependency on log(X _{ d }) in Theorem 7.1 is essential: there exists a class of predicates \(\mathcal{C}\) with VCdimension O(1) that requires D=Ω _{ α,β,ϵ }(log(X _{ d })). For our lower bound, the sanitized output \(\widehat{D}\) could be any arbitrary data structure (not necessarily a synthetic database). Remember that a synthetic database contains data drawn from the same domain as the original database and Theorem 7.1 outputs a synthetic database. For simplicity, however, here we focus on the case where the output is a synthetic database. The proof of this lower bound uses ideas from Sect. 3.1.
Theorem 7.2
Every ϵdifferentially private noninteractive mechanism that is (α,β)useful for \(\operatorname {\mathtt {POINT}}_{d}\) requires an input database of size Ω((d+log(1/β))/(ϵα)).
Proof
Let T=2^{ d } and X _{ d }=[T] be the domain. Consider the class \(\operatorname {\mathtt {POINT}}_{d}\). For every i∈[T], construct a database \(D_{i} \in X_{d}^{m}\) by setting (1−3α)m entries as 1 and the remaining 3αm entries as i (for i=1 all entries of D _{1} are 1). For i∈[T]∖{1}, we say that a database \(\widehat{D}\) is αuseful for D _{ i } if \(2\alpha < Q_{c_{i}}(\widehat{D}) < 4\alpha\) and \(14\alpha< Q_{c_{1}}(\widehat{D}) < 12\alpha\). We say that \(\widehat{D}\) is αuseful for D _{1} if \(1\alpha< Q_{c_{1}}(\widehat{D}) \leq1\). It follows that for i≠j, if \(\widehat{D}\) is αuseful for D _{ i } then it is not αuseful for D _{ j }.
On the other hand, since \(\mathcal{A}\) is (α,β)useful, \(\Pr[\mathcal{A}(D_{1}) \notin\widehat{\mathbb{D}}_{1}] < \beta\), and hence, we get that m=Ω((d+log(1/β))/(ϵα)). □
Footnotes
 1.
Chaudhuri and Hsu (2011) prove that this is not true for continuous domains.
 2.
Our proof technique yields lower bounds not only on private learning \(\operatorname {\mathtt {POINT}}_{d}\) properly, but on private learning of any concept class \(\mathcal{C}\) with various hypothesis classes that we call αminimal for \(\mathcal{C}\).
 3.
Informally, a mechanism is useful for a concept class if for every input, the output of the mechanism maintains approximately correct counts for all concepts in the concept class.
 4.
The definition of PAC learning usually only requires that the sample complexity is polynomial in 1/β (rather than log(1/β)). However, these two requirements are equivalent (see, e.g., Kearns and Vazirani 1994, Sect. 4.2).
 5.
Note that every singleton {j} where j∈[T] is shattered by \(\operatorname {\mathtt {POINT}}_{d}\) as c _{ j }(j)=1 and c _{ j′}(j)=0 for all \(j'\not =j\). No set of two points {j,j′} is shattered by \(\operatorname {\mathtt {POINT}}_{d}\) as c _{ j″}(j)=c _{ j″}(j′)=1 for no j″∈[T].
 6.
Remember, the notation O _{ α,β,ϵ }(g(n)) is a shorthand for O(h(α,β,ϵ)⋅g(n)) for some nonnegative function h. Similarly, the notation Ω _{ α,β,ϵ }(g(n)).
 7.
These ⋆ entries cannot be simply removed as the question if two databases are neighbors depends on the locations of the ⋆’s.
 8.
Given an input D=(d _{1},…,d _{ m }) where each d _{ i }=(x _{ i },c(x _{ i })) is a labeled example, the empirical error of h is \(\frac{1}{m} \{i \,:\, h(x_{i}) \neq c(x_{i}) \}\).
 9.
The switching points j and k exist as points 1 and T+1 are always switching points.
 10.
For simplicity of the description, we ignore the fact that some of the sample points can be ⋆.
 11.
If j is in the image of G _{ d }, then the analysis is same as (1) above. By utility guarantees, \(\mathcal{A}_{\mathop {\rm eff}}\) has to return c _{ j } on D with probability at least 1−β. Thus, the distinguisher returns 1 with probability at least 1−β when j chosen from G _{ d }(U _{ ℓ(d)}).
 12.
An almost matching upper bound of O((d+log(1/β))/ϵα) on the sample complexity for efficiently private learning \({\widehat {\operatorname {\mathtt {POINT}}}}_{d}\) using \(\operatorname {\mathtt {POINT}}_{d}\) can be obtained as in Lemma 3.4.
Notes
Acknowledgements
We thank Benny Applebaum, Eyal Kushilevitz, and Adam Smith for helpful initial discussions.
Amos Beimel’s research was partly supported by the Israel Science Foundation (grant No. 938/09) and by the Frankel Center for Computer Science at BenGurion University. Shiva Prasad Kasiviswanathan thanks Los Alamos National Laboratory and IBM T.J. Watson Research Center for supporting him while this research was performed. Hai Brenner and Kobbi Nissim’s research was supported by the Israel Science Foundation (grant No. 860/06).
References
 Beimel, A., Kasiviswanathan, S. P., & Nissim, K. (2010). Bounds on the sample complexity for private learning and private data release. In D. Micciancio (Ed.), LNCS: Vol. 5978. TCC (pp. 437–454). Berlin: Springer. Google Scholar
 Beimel, A., Nissim, K., & Stemmer, U. (2013). Characterizing the sample complexity of private learners. In ITCS (pp. 97–110). CrossRefGoogle Scholar
 Blum, A., Dwork, C., McSherry, F., & Nissim, K. (2005). Practical privacy: the SuLQ framework. In PODS (pp. 128–138). New York: ACM. Google Scholar
 Blum, A., Ligett, K., & Roth, A. (2008). A learning theory approach to noninteractive database privacy. In STOC (pp. 609–618). New York: ACM. Google Scholar
 Blum, A., Ligett, K., & Roth, A. (2013). A learning theory approach to noninteractive database privacy. Journal of the ACM, 60(2), 12. MathSciNetCrossRefGoogle Scholar
 Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the VapnikChervonenkis dimension. Journal of the ACM, 36(4), 929–965. MathSciNetCrossRefzbMATHGoogle Scholar
 Chaudhuri, K., & Hsu, D. (2011). Sample complexity bounds for differentially private learning. Journal of Machine Learning Research, 19, 155–186. Google Scholar
 Chaudhuri, K., & Monteleoni, C. (2008). Privacypreserving logistic regression. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), NIPS, Cambridge: MIT Press. Google Scholar
 Chaudhuri, K., Monteleoni, C., & Sarwate, A. D. (2011). Differentially private empirical risk minimization. Journal of Machine Learning Research, 12, 1069–1109. MathSciNetzbMATHGoogle Scholar
 Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23, 493–507. MathSciNetCrossRefzbMATHGoogle Scholar
 Dwork, C. (2009). The differential privacy frontier. In O. Reingold (Ed.), LNCS: Vol. 5444. TCC (pp. 496–502). Berlin: Springer. Google Scholar
 Dwork, C. (2011). A firm foundation for private data analysis. Communications of the ACM, 54(1), 86–95. CrossRefGoogle Scholar
 Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In S. Halevi & T. Rabin (Eds.), LNCS: Vol. 3876. TCC (pp. 265–284). Berlin: Springer. Google Scholar
 Dwork, C., Naor, M., Reingold, O., Rothblum, G., & Vadhan, S. (2009). On the complexity of differentially private data release. In STOC (pp. 381–390). New York: ACM. Google Scholar
 Ehrenfeucht, A., Haussler, D., Kearns, M. J., & Valiant, L. G. (1989). A general lower bound on the number of examples needed for learning. Information and Computation, 82(3), 247–261. MathSciNetCrossRefzbMATHGoogle Scholar
 Goldreich, O. (2001). Foundations of cryptography, volume basic tools. Cambridge: Cambridge University Press. CrossRefGoogle Scholar
 Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13–30. MathSciNetCrossRefzbMATHGoogle Scholar
 Hughes, D. R., & Piper, F. C. (1973). Projective planes (Vol. 6). Berlin: Springer. zbMATHGoogle Scholar
 Kasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhodnikova, S., & Smith, A. (2011). What can we learn privately? SIAM Journal on Computing, 40(3), 793–826. MathSciNetCrossRefzbMATHGoogle Scholar
 Kearns, M. J. (1998). Efficient noisetolerant learning from statistical queries. Journal of the ACM, 45(6), 983–1006. Preliminary version in proceedings of STOC’93. MathSciNetCrossRefzbMATHGoogle Scholar
 Kearns, M. J., & Vazirani, U. V. (1994). An introduction to computational learning theory. Cambridge: MIT Press. Google Scholar
 Kifer, D., Smith, A. D., & Thakurta, A. (2012). Private convex optimization for empirical risk minimization with applications to highdimensional regression. Journal of Machine Learning Research, 23, 25. Google Scholar
 McSherry, F., & Talwar, K. (2007). Mechanism design via differential privacy. In FOCS (pp. 94–103). New York: IEEE Press. Google Scholar
 Mishra, N., & Sandler, M. (2006). Privacy via pseudorandom sketches. In PODS (pp. 143–152). New York: ACM. Google Scholar
 Pitt, L., & Valiant, L. G. (1988). Computational limitations on learning from examples. Journal of the ACM, 35(4), 965–984. MathSciNetCrossRefzbMATHGoogle Scholar
 Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27, 1134–1142. CrossRefzbMATHGoogle Scholar
 Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16, 264. CrossRefzbMATHGoogle Scholar