Skip to main content
Log in

Statistical Active Learning Algorithms for Noise Tolerance and Differential Privacy

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

We describe a framework for designing efficient active learning algorithms that are tolerant to random classification noise and are differentially-private. The framework is based on active learning algorithms that are statistical in the sense that they rely on estimates of expectations of functions of filtered random examples. It builds on the powerful statistical query framework of Kearns (JACM 45(6):983–1006, 1998). We show that any efficient active statistical learning algorithm can be automatically converted to an efficient active learning algorithm which is tolerant to random classification noise as well as other forms of “uncorrelated” noise. The complexity of the resulting algorithms has information-theoretically optimal quadratic dependence on \(1/(1-2\eta )\), where \(\eta \) is the noise rate. We show that commonly studied concept classes including thresholds, rectangles, and linear separators can be efficiently actively learned in our framework. These results combined with our generic conversion lead to the first computationally-efficient algorithms for actively learning some of these concept classes in the presence of random classification noise that provide exponential improvement in the dependence on the error \(\epsilon \) over their passive counterparts. In addition, we show that our algorithms can be automatically converted to efficient active differentially-private algorithms. This leads to the first differentially-private active learning algorithms with exponential label savings over the passive case.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. The sample complexity of the SQ analogues might be polynomially larger though.

  2. For any function \(f(x)\) that does not depend on the label, we have: \(\mathop {\mathbf {E}}\nolimits _{ P^\eta _{|\chi }}[f(x)\cdot \ell ] = (1-\eta )\mathop {\mathbf {E}}\nolimits _{ P_{|\chi }}[f(x)\cdot \ell ] + \eta \cdot \mathop {\mathbf {E}}\nolimits _{ P_{|\chi }}[f(x)\cdot (-\ell )] = (1-2\cdot \eta )\mathop {\mathbf {E}}\nolimits _{ P_{|\chi }}[f(x)\cdot \ell ]\). The first equality follows from the fact that under \( P^\eta _{|\chi }\), for any given \(x\), there is a \((1-\eta )\) chance that the label is the same as under \( P_{|\chi }\), and an \(\eta \) chance that the label is the negation of the label obtained from \( P_{|\chi }\).

  3. As usual, we can bring the distribution to be close enough to this form using unlabeled samples or \(O(b/\epsilon )\) target-independent queries, where \(b\) is the number of bits needed to represent our examples.

  4. In [8] a related but different definition of privacy was used. However, as pointed out in [38] the same translation can be used to achieve differential privacy.

References

  1. Awasthi, P., Balcan, M.-F., Long, P.: The power of localization for efficiently learning linear separators with noise. In: Proceedings of the 46th ACM Symposium on Theory of Computing (2014)

  2. Aslam, J., Decatur, S.: Specification and simulation of statistical query algorithms for efficiency and noise tolerance. JCSS 56, 191–208 (1998)

    MATH  MathSciNet  Google Scholar 

  3. Angluin, D., Laird, P.: Learning from noisy examples. Mach. Learn. 2, 343–370 (1988)

    Google Scholar 

  4. Bousquet, O., Boucheron, S., Lugosi, G.: Theory of classification: a survey of recent advances. ESAIM Probab. Stat. 9, 323–375 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  5. Balcan, M.-F., Beygelzimer, A., Langford, J.: Agnostic active learning. In: ICML (2006)

  6. Balcan, M.-F., Broder, A., Zhang, T.: Margin based active learning. In: COLT, pp. 35–50 (2007)

  7. Beygelzimer, A., Dasgupta, S., Langford, J.: Importance weighted active learning. In: ICML, pp. 49–56 (2009)

  8. Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the SuLQ framework. In: Proceedings of PODS, pp. 128–138 (2005)

  9. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.: Learnability and the Vapnik–Chervonenkis dimension. J. ACM 36(4), 929–965 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  10. Bshouty, N., Feldman, V.: On using extended statistical queries to avoid membership queries. JMLR 2, 359–395 (2002)

    MATH  MathSciNet  Google Scholar 

  11. Blum, A., Furst, M., Jackson, J., Kearns, M., Mansour, Y., Rudich, S.: Weakly learning DNF and characterizing statistical query learning using Fourier analysis. In: STOC, pp. 253–262 (1994)

  12. Blum, A., Frieze, A., Kannan, R., Vempala, S.: A polynomial time algorithm for learning noisy linear threshold functions. Algorithmica 22(1/2), 35–52 (1997)

    MathSciNet  Google Scholar 

  13. Balcan, M.-F., Hanneke, S.: Robust interactive learning. In: COLT (2012)

  14. Beygelzimer, A., Hsu, D., Langford, J., Zhang, T.: Agnostic active learning without constraints. In: NIPS (2010)

  15. Balcan, M.-F., Hanneke, S., Wortman, J.: The true sample complexity of active learning. In: COLT (2008)

  16. Balcan, M.-F., Long, P.M.: Active and passive learning of linear separators under log-concave distributions. In: COLT (2013)

  17. Bylander, T.: Learning linear threshold functions in the presence of classification noise. In: Proceedings of COLT, pp. 340–347 (1994)

  18. Cavallanti, G., Cesa-Bianchi, N., Gentile, C.: Learning noisy linear classifiers via adaptive and selective sampling. Mach Learn. 83, 71–102 (2011)

  19. Chaudhuri, K., Hsu, D.: Sample complexity bounds for differentially private learning. In: JMLR COLT Proceedings, vol. 19, pp. 155–186 (2011)

  20. Chu, C., Kim, S., Lin, Y., Yu, Y., Bradski, G., Ng, A., Olukotun, K.: Map-reduce for machine learning on multicore. In: Proceedings of NIPS, pp. 281–288 (2006)

  21. Castro, R., Nowak, R.: Minimax bounds for active learning. In: COLT (2007)

  22. Dasgupta, S.: Coarse sample complexity bounds for active learning. In: NIPS, vol. 18 (2005)

  23. Dasgupta, S.: Active learning theory. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 14–19 (2010)

  24. Dekel, O., Gentile, C., Sridharan, K.: Selective sampling and active learning from single and multiple teachers. J. Mach. Learn. Res. 13(1), 2655–2697 (2012)

  25. Dasgupta, S., Hsu, D.: Hierarchical sampling for active learning. In: ICML, pp. 208–215 (2008)

  26. Dasgupta, S., Hsu, D.J., Monteleoni, C.: A general agnostic active learning algorithm. In: NIPS (2011)

  27. Dasgupta, S., Tauman Kalai, A., Monteleoni, C.: Analysis of perceptron-based active learning. J. Mach. Learn. Res. 10, 281–299 (2009)

    MATH  MathSciNet  Google Scholar 

  28. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: TCC, pp. 265–284 (2006)

  29. Dunagan, J., Vempala, S.: A simple polynomial-time rescaling algorithm for solving linear programs. In: STOC, pp. 315–320 (2004)

  30. Feldman, V.: A complete characterization of statistical query learning with applications to evolvability. J. Comput. Syst. Sci. 78(5), 1444–1459 (2012)

    Article  MATH  Google Scholar 

  31. Feldman, V., Grigorescu, E., Reyzin, L., Vempala, S., Xiao, Y.: Statistical algorithms and a lower bound for detecting planted cliques. In: ACM STOC (2013)

  32. Freund, Y., Seung, H.S., Shamir, E., Tishby, N.: Selective sampling using the query by committee algorithm. Mach. Learn. 28(2–3), 133–168 (1997)

    Article  MATH  Google Scholar 

  33. Gonen, A., Sabato, S., Shalev-Shwartz, S.: Efficient pool-based active learning of halfspaces. In: ICML (2013)

  34. Hanneke, S.: Theory of disagreement-based active learning. Found. Trends Mach. Learn. 7(2–3), 131–309 (2014)

  35. Hanneke, S.: A bound on the label complexity of agnostic active learning. In: ICML (2007)

  36. Jagannathan, G., Monteleoni, C., Pillaipakkamnatt, K.: A semi-supervised learning approach to differential privacy. In: Proceedings of the 2013 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE Workshop on Privacy Aspects of Data Mining (PADM) (2013)

  37. Kearns, M.: Efficient noise-tolerant learning from statistical queries. JACM 45(6), 983–1006 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  38. Kasiviswanathan, S.P., Lee, H.K., Nissim, K., Raskhodnikova, S., Smith, A.: What can we learn privately? SIAM J. Comput. 40(3), 793–826 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  39. Koltchinskii, V.: Rademacher complexities and bounding the excess risk in active learning. JMLR 11, 2457–2485 (2010)

    MATH  MathSciNet  Google Scholar 

  40. Kearns, M., Vazirani, U.: An Introduction to Computational Learning Theory. MIT Press, Cambridge, MA (1994)

    Google Scholar 

  41. Kanade, V., Valiant, L.G., Wortman Vaughan, J.: Evolution with drifting targets. In: Proceedings of COLT, pp. 155–167 (2010)

  42. Long, P.M.: On the sample complexity of PAC learning halfspaces against the uniform distribution. IEEE Trans. Neural Netw. 6(6), 1556–1559 (1995)

    Article  Google Scholar 

  43. Lovász, L., Vempala, S.: The geometry of logconcave functions and sampling algorithms. Random Struct. Algorithms 30(3), 307–358 (2007)

    Article  MATH  Google Scholar 

  44. McCallum, A., Nigam, K.: Employing EM in pool-based active learning for text classification. In: ICML, pp. 350–358 (1998)

  45. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386–407 (1958)

    Article  MathSciNet  Google Scholar 

  46. Raginsky, M., Rakhlin, A.: Lower bounds for passive and active learning. In: NIPS, pp. 1026–1034 (2011)

  47. Valiant, L.G.: A theory of the learnable. Commun. ACM 27(11), 1134–1142 (1984)

    Article  MATH  Google Scholar 

  48. Vapnik, V.: Statistical Learning Theory. Wiley-Interscience, New York (1998)

    MATH  Google Scholar 

  49. Vempala, S.: Personal Communication (2013)

  50. Wang, L.: Smoothness, disagreement coefficient, and the label complexity of agnostic active learning. J. Mach. Learn. Res. 12, 2269–2292 (2011)

Download references

Acknowledgments

We thank Avrim Blum and Santosh Vempala for useful discussions. This work was supported in part by NSF Grants CCF-0953192, CCF-110128, and CCF 1422910, AFOSR Grant FA9550-09-1-0538, ONR Grant N00014-09-1-0751, and a Microsoft Research Faculty Fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maria Florina Balcan.

Appendices

Appendix 1: Passive SQ Learning of Halfspaces

The first SQ algorithm for learning general halfspaces was given by Blum et al. [12]. This algorithm requires access to unlabeled samples from the unknown distribution and therefore is only label-statistical. This algorithm can be used as a basis for our active SQ algorithm but the resulting active algorithm will also be only label-statistical. As we have noted in Sect. 2, this is sufficient to obtain our RCN tolerant active learning algorithm given in Corollary 3.7. However our differentially-private simulation needs the algorithm to be (fully) statistical. Therefore we base our algorithm on the algorithm of Dunagan and Vempala for learning halfspaces [29]. While [29] does not contain an explicit statement of the SQ version of the algorithm it is known and easy to verify that the algorithm has a SQ version [49]. This follows from the fact that the algorithm in [29] relies on a combination on the Perceptron [45] and the modified Perceptron algorithms [12] both of which have SQ versions [12, 17]. Another small issue that we need to take care of to apply the algorithm is that the running time and tolerance of the algorithm depend polynomially (in fact, linearly) on \(\log (1/\rho _0)\), where \(\rho _0\) is the margin of the points given to the algorithm. Namely, \(\rho _0 = \min _{x \in S}\frac{|w \cdot x|}{\Vert x\Vert }\), where \(h_w\) is the target homogeneous halfspace and \(S\) is the set of points given to the algorithm. We are dealing with continuous distributions for which the margin is 0 and therefore we make the following observation. In place of \(\rho _0\) we can use any margin \(\rho _1\) such that the probability of being within margin \(\le \rho _1\) around the target hyperplane is small enough that it can be absorbed into the tolerance of the statistical queries of the Dunagan–Vempala algorithm for margin \(\rho _1\). Formally,

Definition 7.1

For positive \(\delta <1\) and distribution \(D\), we denote

$$\begin{aligned} \gamma (D,\delta ) = \inf _{\Vert w\Vert =1}\sup _{\gamma >0} \left\{ \gamma \ \left| \ \mathop {\Pr }\limits _D\left[ \frac{|w \cdot x|}{\Vert x\Vert }\le \gamma \right] \le \delta \ \right. \right\} , \end{aligned}$$

namely the smallest value of \(\gamma \) such that for every halfspace \(h_w,\,\gamma \) is the largest such that the probability of being within margin \(\gamma \) of \(h_w\) under \(D\) is at most \(\delta \). Let \(\tau _{{\mathtt {DV}}}(\rho ,\epsilon )\) be the tolerance of the SQ version of the Dunagan–Vempala algorithm when the initial margin is equal to \(\rho \) and error is set to \(\epsilon \). Let

$$\begin{aligned} \rho _1(D,\epsilon ) = \frac{1}{2} \sup _{\rho \ge 0}\left\{ \rho \ |\ \gamma (D,\tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)/3) \ge \rho \right\} . \end{aligned}$$

Now, for \(\rho _1 = \rho _1(D,\epsilon )\) we know that \(\gamma (D,\tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)/3) \ge \rho \). Let \(D'\) be defined as distribution \(D\) conditioned on having margin at least \(\rho \) around the target hyperplane \(h_w\). By the definition of the function \(\gamma \), the probability of being within margin \(\le \rho \) is at most \(\tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)/3\). Therefore for any query function \(g:X \times \{-1,1\}\rightarrow [-1,1]\),

$$\begin{aligned} \left| \mathop {\mathbf {E}}_D[g(x,h_w(x))] - (1-\tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)/3) \mathop {\mathbf {E}}_{D'}[g(x,h_w(x))] \right| \le \tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)/3 \end{aligned}$$

and hence \(|\mathop {\mathbf {E}}\nolimits _D[g(x,h_w(x))] - \mathop {\mathbf {E}}\nolimits _{D'}[h_w(x,f(x))]| \le 2 \tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)/3\). This implies that we can obtain an answer to any SQ relative to \(D'\) with tolerance \(\tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)\) by using the same SQ relative to \(D\) with tolerance \(\tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)/3\). This means that by running the Dunagan–Vempala algorithm in this way we will obtain a hypothesis with error at most \(\epsilon /2\) relative to \(D'\). This hypothesis has error at most \(\epsilon /2 + 2 \tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)/3\) which, without loss of generality, is at most \(\epsilon \). Combining these observations about the Dunagan–Vempala algorithm, we obtain the following statement.

Theorem 7.2

([29]) There exists a SQ algorithm LearnHS-DV that learns \({\mathcal {H}}_d\) to accuracy \(1-\epsilon \) over any distribution \(D\). Further LearnHS outputs a homogeneous halfspace, runs in time polynomial in \(d,\,1/\epsilon \) and \(\log (1/\rho _1)\) and uses SQs of tolerance \( \ge 1/\text{ poly }(d,1/\epsilon ,\log (1/\rho _1))\), where \(\rho _1 = \rho _1(D,\epsilon )\).

To apply Theorem 7.2 we need to obtain bounds on \(\rho _1(D,\epsilon )\) for any distribution \(D\) on which we might run the Dunagan–Vempala algorithm.

Lemma 7.3

Let \(D\) be an isotropic log-concave distribution. Then for any \(\delta \in (0,1/20),\,\gamma (D,\delta ) \ge \delta /(6 \ln (1/\delta ))\).

Proof

Let \(\gamma \in (0,1/16)\) and \(w\) be any unit vector. We first upper-bound \(\mathop {\Pr }\nolimits _D\left[ \frac{|w \cdot x|}{\Vert x\Vert }\le \gamma \right] \).

$$\begin{aligned} \mathop {\Pr }\limits _D\left[ \frac{|w \cdot x|}{\Vert x\Vert }\le \gamma \right]&\le \mathop {\Pr }\limits _D\left[ \Vert x\Vert \le \ln (1/\gamma ) \text{ and } \frac{|w \cdot x|}{\Vert x\Vert }\le \gamma \right] + \mathop {\Pr }\limits _D\left[ \Vert x\Vert > \ln (1/\gamma ) \right] \nonumber \\&\le \mathop {\Pr }\limits _D\left[ |w \cdot x| \le \gamma \cdot \ln (1/\gamma ) \right] + \mathop {\Pr }\limits _D\left[ \Vert x\Vert > \ln (1/\gamma )\right] . \end{aligned}$$
(8)

By Lemma 5.7 in [43], for an isotropic log-concave \(D\) and any \(R > 1,\,\mathop {\Pr }\nolimits _D[\Vert x\Vert > R] < e^{-R+1}\). Therefore

$$\begin{aligned} \mathop {\Pr }\limits _D\left[ \Vert x\Vert > \ln (1/\gamma ) \right] \le e \cdot \gamma . \end{aligned}$$

Further, by Lemma 3.2,

$$\begin{aligned} \mathop {\Pr }\limits _D\left[ |w \cdot x| \le \gamma \cdot \ln (1/\gamma ) \right] \le 2 \gamma \cdot \ln (1/\gamma ). \end{aligned}$$

Substituting, these inequalities into Eq. (8) we obtain that for \(\gamma \in (0,1/16)\),

$$\begin{aligned} \mathop {\Pr }\limits _D\left[ \frac{|w \cdot x|}{\Vert x\Vert }\le \gamma \right] \le 2 \gamma \cdot \ln (1/\gamma ) + e \cdot \gamma \le 3 \gamma \cdot \ln (1/\gamma ). \end{aligned}$$

This implies that for \(\gamma = \delta /(6 \ln (1/\delta ))\) and any unit vector \(w\),

$$\begin{aligned} \mathop {\Pr }\limits _D\left[ \frac{|w \cdot x|}{\Vert x\Vert }\le \gamma \right] \le 3 \delta /(6 \ln (1/\delta )) \cdot \left( \ln (1/\delta ) + \ln (6 \ln (1/\delta )\right) \le \delta , \end{aligned}$$

where we used that for \(\delta < 1/20,\,6 \ln (1/\delta ) \le 1/\delta \). By definition of \(\gamma (D,\delta )\), this implies that \(\gamma (D,\delta ) \ge \delta /(6 \ln (1/\delta ))\). \(\square \)

We are now ready to prove Theorem 3.5. There exists a SQ algorithm LearnHS that learns \({\mathcal {H}}_d\) to accuracy \(1-\epsilon \) over any distribution \(D_{|\chi }\), where \(D\) is an isotropic log-concave distribution and \(\chi :\mathbb {R}^d \rightarrow [0,1]\) is a filter function. Further LearnHS outputs a homogeneous halfspace, runs in time polynomial in \(d,\,1/\epsilon \) and \(\log (1/\lambda )\) and uses SQs of tolerance \( \ge 1/\text{ poly }(d,1/\epsilon ,\log (1/\lambda ))\), where \(\lambda = \mathop {\mathbf {E}}\nolimits _D[\chi (x)]\).

Proof of Theorem 3.5

To prove the theorem we bound \(\rho _1 = \rho _1(D_{|\chi },\epsilon )\) and then apply Theorem 7.2. We first observe that for any event \(\Lambda \),

$$\begin{aligned} \mathop {{\mathbf {Pr}}}\limits _{D_{|\chi }}[\Lambda ] \le \mathop {{\mathbf {Pr}}}\limits _D[\Lambda ]/\mathop {\mathbf {E}}_D[\chi ]. \end{aligned}$$

Applying this to the event \(\frac{|w \cdot x|}{\Vert x\Vert } \le \gamma \) in Definition 7.1 we obtain that \(\gamma (D_{|\chi },\delta ) \ge \gamma (D,\delta \cdot \mathop {\mathbf {E}}\nolimits _D[\chi ])\). By Lemma 7.3, we get that \(\gamma (D_{|\chi },\delta ) = \Omega (\lambda \delta /\log (1/(\lambda \delta )))\).

In addition, by Theorem 7.2, \(\tau _{{\mathtt {DV}}}(\rho ,\epsilon ) \ge 1/p(d,1/\epsilon ,\log (1/\rho ))\) for some polynomial \(p\). This implies that

$$\begin{aligned} \gamma (D_{|\chi },\tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)/3)&\le \gamma \left( D_{|\chi },\Omega \left( \frac{1}{p(d,1/\epsilon ,\log (1/\rho ))}\right) \right) \\&= \tilde{\Omega }\left( \frac{\lambda }{p(d,1/\epsilon ,\log (1/\rho ))}\right) . \end{aligned}$$

Therefore, we will obtain that,

$$\begin{aligned} \rho _1(D_{|\chi }, \epsilon ) =\tilde{\Omega }\left( \frac{\lambda }{p(d,1/\epsilon ,1)}\right) . \end{aligned}$$

\(\square \)

By plugging this bound into Theorem 7.2 we obtain the claim.

Appendix 2: Proofs from Sect. 4

We now prove Lemmas 4.6 and 4.7 which we restate for convenience.

Lemma 8.1

(Lemma 4.6 restated) For any \(v,w\in S_{d-1}\) such that \(\Vert v-w\Vert = \Delta \le \sqrt{2}\) and \(\gamma > 0\),

$$\begin{aligned} {\mathbf {Pr}}[h_v(x)&\ne h_w(x) \ |\ |\langle v, x \rangle | \le \gamma ]\\&= \frac{A_{d-3} \int _0^\gamma (1-r^2)^{(d-3)/2}\int _{\frac{r \cdot \sqrt{2-\Delta ^2}}{\Delta \cdot \sqrt{1-r^2}}}^1 (1-s^2)^{(d-4)/2}ds\cdot dr }{A_{d-2} \int _0^\gamma (1-r^2)^{(d-3)/2}dr}. \end{aligned}$$

We denote the probability by \({\mathsf {cp}}_d(\gamma ,\Delta )\).

Proof

By using spherical symmetry, we can assume without loss of generality that \(v = (1,0,0,\ldots ,0)\) and \(w=(\sqrt{1-\Delta ^2/2}, \Delta /\sqrt{2},0,0,\ldots ,0)\). We now examine the surface area of the points that satisfy \(h_w(x) = -1\) and \(0 \le \langle v, x \rangle \le \gamma \) (which is a half of the error region at distance at most \(\gamma \) from \(v\)). To compute it we consider the points on \(S_{d-1}\) that satisfy \(\langle v, x \rangle = r\). These points form a hypersphere \(\sigma \) of dimension \(d-2\) and radius \(\sqrt{1-r^2}\). In this hypersphere points that satisfy \(h_w(x) = -1\) are points \((r,s,x_3,\ldots ,x_d) \in S_d\) for which \(r\sqrt{1-\Delta ^2/2} + s \Delta /\sqrt{2} \le 0\). In other words, \(s \ge r\sqrt{2-\Delta ^2}/\Delta \) or points of \(\sigma \) which are at least \(r\sqrt{2-\Delta ^2}/\Delta \) away from hyperplane \((0,1,0,0,\ldots ,0)\) passing through the origin of \(\sigma \) (also referred to as a hyperspherical cap). As in the Eq. (6), we obtain that its \(d-2\)-dimensional surface area is:

$$\begin{aligned} (1-r^2)^{(d-2)/2}\int _{\frac{r \cdot \sqrt{2-\Delta ^2}}{\Delta \cdot \sqrt{1-r^2}}}^1 A_{d-3} (1-s^2)^{(d-4)/2}ds \end{aligned}$$

Integrating over all \(r\) from \(0\) to \(\gamma \) gives the surface area of the region \(h_w(x) = -1\) and \(0 \le \langle v, x \rangle \le \gamma \):

$$\begin{aligned} \int _0^\gamma (1-r^2)^{(d-3)/2}\int _{\frac{r \cdot \sqrt{2-\Delta ^2}}{\Delta \cdot \sqrt{1-r^2}}}^1 A_{d-3} (1-s^2)^{(d-4)/2}ds\cdot dr. \end{aligned}$$

Hence the conditional probability is as claimed. \(\square \)

Lemma 8.2

(Lemma 4.7 restated) For \(\Delta \le \sqrt{2}\), any \(d\ge 4\), and \(\gamma \ge \Delta /(2\sqrt{d}),\,\partial _\Delta {\mathsf {cp}}_d(\gamma ,\Delta ) \ge 1/(56\gamma \cdot \sqrt{d})\).

Proof

First note that

$$\begin{aligned} \tau (\gamma )=\frac{A_{d-3}}{A_{d-2}\int _0^\gamma (1-r^2)^{(d-3)/2} dr} \end{aligned}$$

is independent of \(\Delta \) and therefore it is sufficient to differentiate

$$\begin{aligned} \theta (\gamma ,\Delta ) = \int _0^\gamma (1-r^2)^{(d-3)/2}\int _{\frac{r \cdot \sqrt{2-\Delta ^2}}{\Delta \cdot \sqrt{1-r^2}}}^1 (1-s^2)^{(d-4)/2}ds\cdot dr. \end{aligned}$$

Let \(\gamma ' = \Delta /(2\sqrt{d})\) (note that by our assumption \(\gamma ' \le \gamma \)). By the Leibnitz integral rule,

$$\begin{aligned} \partial _\Delta \theta (\gamma ,\Delta )&= \int _0^\gamma (1-r^2)^{(d-3)/2} \partial _\Delta \left( \int _{\frac{r \cdot \sqrt{2-\Delta ^2}}{\Delta \cdot \sqrt{1-r^2}}}^1(1-s^2)^{(d-4)/2}ds \right) \cdot dr \\&= \int _0^\gamma (1-r^2)^{(d-3)/2} \left( 1- \frac{r^2 (2-\Delta ^2)}{\Delta ^2(1-r^2)}\right) ^{\frac{d-4}{2}} \cdot \frac{2r}{\Delta ^2 \sqrt{1-r^2} \sqrt{2-\Delta ^2}} \cdot dr \\&\ge \int _0^\gamma (1-r^2)^{(d-4)/2} \left( 1- \frac{2 r^2}{\Delta ^2 (1-r^2)}\right) ^{\frac{d-4}{2}} \cdot \frac{2r}{\sqrt{2} \Delta ^2} \cdot dr \\&\ge \int _0^{\gamma '} (1-r^2)^{(d-4)/2} \left( 1- \frac{2 r^2}{\Delta ^2(1-r^2)}\right) ^{\frac{d-4}{2}} \cdot \frac{\sqrt{2} \cdot r}{\Delta ^2} \cdot dr. \end{aligned}$$

Now using the conditions \(\Delta \le \sqrt{2},\,d\ge 4\), we obtain that \(\gamma ' \le 1/(2\sqrt{2})\) and hence for all \(r \in [0,\gamma '],\,1-r^2 \ge 7/8\) and \(r^2/\Delta ^2 \le \gamma '^2/\Delta ^2 = 1/(4d)\). This implies that for all \(r\in [0,\gamma ']\),

$$\begin{aligned} 1- \frac{2 r^2}{\Delta ^2(1-r^2)} \ge 1- \frac{2}{\frac{7}{8} 4d} = 1-\frac{4}{7d}. \end{aligned}$$

Now,

$$\begin{aligned} \left( 1-\frac{4}{7d}\right) ^{(d-4)/2} \ge 1- \frac{4 (d-4)}{14 d} \ge \frac{5}{7}. \end{aligned}$$

Substituting this into our expression for \(\partial _\Delta \theta (\gamma ,\Delta )\) we get

$$\begin{aligned}&\partial _\Delta \theta (\gamma ,\Delta )\\&\quad \ge \int _0^{\gamma '} (1-r^2)^{(d-4)/2} \frac{\sqrt{2} \cdot 5r}{7 \Delta ^2} \cdot dr \ge \frac{1}{\Delta ^2}\int _0^{\gamma '} (1-r^2)^{(d-4)/2} \cdot r \cdot dr \\&\quad = \frac{1}{\Delta ^2(d-2)} \left( 1 - (1-\gamma '^2)^{(d-2)/2}\right) \ge \frac{1}{\Delta ^2(d-2)} \left( 1 -e^{-\gamma '^2 (d-2)/2}\right) \\&\quad \ge ^{(*)}\frac{1}{\Delta ^2(d-2)} \left( 1 -\left( 1-\frac{(d-2)\gamma '^2}{4}\right) \right) = \frac{\gamma '^2}{4 \Delta ^2} =\frac{1}{16d}, \end{aligned}$$

where to derive \((*)\) we use the fact that \(e^{-\gamma '^2 (d-2)/2} \le 1-\gamma '^2 (d-2)/4\) since \(e^{-x} \le 1-x/2\) for every \(x \in [0,1]\) and \(\gamma '^2 (d-2)/2 \le \frac{\Delta ^2 (d-2)}{8d} < 1\). At the same time, \(\int _0^\gamma (1-r^2)^{(d-3)/2} dr \le \gamma \) and therefore,

$$\begin{aligned} \partial _\Delta {\mathsf {cp}}_d(\gamma ,\Delta ) = \tau (\gamma ) \cdot \partial _\Delta \theta (\gamma ,\Delta ) \ge \frac{A_{d-3}}{16 d \gamma A_{d-2}} \ge \frac{1}{32 \sqrt{3} \gamma \sqrt{d} } > \frac{1}{56 \gamma \sqrt{d}}, \end{aligned}$$

where we used that \(\frac{A_{d-3}}{A_{d-2}} \ge \sqrt{d}/(2\sqrt{3})\) (e.g. [27]). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Balcan, M.F., Feldman, V. Statistical Active Learning Algorithms for Noise Tolerance and Differential Privacy. Algorithmica 72, 282–315 (2015). https://doi.org/10.1007/s00453-014-9954-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-014-9954-9

Keywords

Navigation