Abstract
We describe a framework for designing efficient active learning algorithms that are tolerant to random classification noise and are differentially-private. The framework is based on active learning algorithms that are statistical in the sense that they rely on estimates of expectations of functions of filtered random examples. It builds on the powerful statistical query framework of Kearns (JACM 45(6):983–1006, 1998). We show that any efficient active statistical learning algorithm can be automatically converted to an efficient active learning algorithm which is tolerant to random classification noise as well as other forms of “uncorrelated” noise. The complexity of the resulting algorithms has information-theoretically optimal quadratic dependence on \(1/(1-2\eta )\), where \(\eta \) is the noise rate. We show that commonly studied concept classes including thresholds, rectangles, and linear separators can be efficiently actively learned in our framework. These results combined with our generic conversion lead to the first computationally-efficient algorithms for actively learning some of these concept classes in the presence of random classification noise that provide exponential improvement in the dependence on the error \(\epsilon \) over their passive counterparts. In addition, we show that our algorithms can be automatically converted to efficient active differentially-private algorithms. This leads to the first differentially-private active learning algorithms with exponential label savings over the passive case.
Similar content being viewed by others
Notes
The sample complexity of the SQ analogues might be polynomially larger though.
For any function \(f(x)\) that does not depend on the label, we have: \(\mathop {\mathbf {E}}\nolimits _{ P^\eta _{|\chi }}[f(x)\cdot \ell ] = (1-\eta )\mathop {\mathbf {E}}\nolimits _{ P_{|\chi }}[f(x)\cdot \ell ] + \eta \cdot \mathop {\mathbf {E}}\nolimits _{ P_{|\chi }}[f(x)\cdot (-\ell )] = (1-2\cdot \eta )\mathop {\mathbf {E}}\nolimits _{ P_{|\chi }}[f(x)\cdot \ell ]\). The first equality follows from the fact that under \( P^\eta _{|\chi }\), for any given \(x\), there is a \((1-\eta )\) chance that the label is the same as under \( P_{|\chi }\), and an \(\eta \) chance that the label is the negation of the label obtained from \( P_{|\chi }\).
As usual, we can bring the distribution to be close enough to this form using unlabeled samples or \(O(b/\epsilon )\) target-independent queries, where \(b\) is the number of bits needed to represent our examples.
References
Awasthi, P., Balcan, M.-F., Long, P.: The power of localization for efficiently learning linear separators with noise. In: Proceedings of the 46th ACM Symposium on Theory of Computing (2014)
Aslam, J., Decatur, S.: Specification and simulation of statistical query algorithms for efficiency and noise tolerance. JCSS 56, 191–208 (1998)
Angluin, D., Laird, P.: Learning from noisy examples. Mach. Learn. 2, 343–370 (1988)
Bousquet, O., Boucheron, S., Lugosi, G.: Theory of classification: a survey of recent advances. ESAIM Probab. Stat. 9, 323–375 (2005)
Balcan, M.-F., Beygelzimer, A., Langford, J.: Agnostic active learning. In: ICML (2006)
Balcan, M.-F., Broder, A., Zhang, T.: Margin based active learning. In: COLT, pp. 35–50 (2007)
Beygelzimer, A., Dasgupta, S., Langford, J.: Importance weighted active learning. In: ICML, pp. 49–56 (2009)
Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the SuLQ framework. In: Proceedings of PODS, pp. 128–138 (2005)
Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.: Learnability and the Vapnik–Chervonenkis dimension. J. ACM 36(4), 929–965 (1989)
Bshouty, N., Feldman, V.: On using extended statistical queries to avoid membership queries. JMLR 2, 359–395 (2002)
Blum, A., Furst, M., Jackson, J., Kearns, M., Mansour, Y., Rudich, S.: Weakly learning DNF and characterizing statistical query learning using Fourier analysis. In: STOC, pp. 253–262 (1994)
Blum, A., Frieze, A., Kannan, R., Vempala, S.: A polynomial time algorithm for learning noisy linear threshold functions. Algorithmica 22(1/2), 35–52 (1997)
Balcan, M.-F., Hanneke, S.: Robust interactive learning. In: COLT (2012)
Beygelzimer, A., Hsu, D., Langford, J., Zhang, T.: Agnostic active learning without constraints. In: NIPS (2010)
Balcan, M.-F., Hanneke, S., Wortman, J.: The true sample complexity of active learning. In: COLT (2008)
Balcan, M.-F., Long, P.M.: Active and passive learning of linear separators under log-concave distributions. In: COLT (2013)
Bylander, T.: Learning linear threshold functions in the presence of classification noise. In: Proceedings of COLT, pp. 340–347 (1994)
Cavallanti, G., Cesa-Bianchi, N., Gentile, C.: Learning noisy linear classifiers via adaptive and selective sampling. Mach Learn. 83, 71–102 (2011)
Chaudhuri, K., Hsu, D.: Sample complexity bounds for differentially private learning. In: JMLR COLT Proceedings, vol. 19, pp. 155–186 (2011)
Chu, C., Kim, S., Lin, Y., Yu, Y., Bradski, G., Ng, A., Olukotun, K.: Map-reduce for machine learning on multicore. In: Proceedings of NIPS, pp. 281–288 (2006)
Castro, R., Nowak, R.: Minimax bounds for active learning. In: COLT (2007)
Dasgupta, S.: Coarse sample complexity bounds for active learning. In: NIPS, vol. 18 (2005)
Dasgupta, S.: Active learning theory. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 14–19 (2010)
Dekel, O., Gentile, C., Sridharan, K.: Selective sampling and active learning from single and multiple teachers. J. Mach. Learn. Res. 13(1), 2655–2697 (2012)
Dasgupta, S., Hsu, D.: Hierarchical sampling for active learning. In: ICML, pp. 208–215 (2008)
Dasgupta, S., Hsu, D.J., Monteleoni, C.: A general agnostic active learning algorithm. In: NIPS (2011)
Dasgupta, S., Tauman Kalai, A., Monteleoni, C.: Analysis of perceptron-based active learning. J. Mach. Learn. Res. 10, 281–299 (2009)
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: TCC, pp. 265–284 (2006)
Dunagan, J., Vempala, S.: A simple polynomial-time rescaling algorithm for solving linear programs. In: STOC, pp. 315–320 (2004)
Feldman, V.: A complete characterization of statistical query learning with applications to evolvability. J. Comput. Syst. Sci. 78(5), 1444–1459 (2012)
Feldman, V., Grigorescu, E., Reyzin, L., Vempala, S., Xiao, Y.: Statistical algorithms and a lower bound for detecting planted cliques. In: ACM STOC (2013)
Freund, Y., Seung, H.S., Shamir, E., Tishby, N.: Selective sampling using the query by committee algorithm. Mach. Learn. 28(2–3), 133–168 (1997)
Gonen, A., Sabato, S., Shalev-Shwartz, S.: Efficient pool-based active learning of halfspaces. In: ICML (2013)
Hanneke, S.: Theory of disagreement-based active learning. Found. Trends Mach. Learn. 7(2–3), 131–309 (2014)
Hanneke, S.: A bound on the label complexity of agnostic active learning. In: ICML (2007)
Jagannathan, G., Monteleoni, C., Pillaipakkamnatt, K.: A semi-supervised learning approach to differential privacy. In: Proceedings of the 2013 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE Workshop on Privacy Aspects of Data Mining (PADM) (2013)
Kearns, M.: Efficient noise-tolerant learning from statistical queries. JACM 45(6), 983–1006 (1998)
Kasiviswanathan, S.P., Lee, H.K., Nissim, K., Raskhodnikova, S., Smith, A.: What can we learn privately? SIAM J. Comput. 40(3), 793–826 (2011)
Koltchinskii, V.: Rademacher complexities and bounding the excess risk in active learning. JMLR 11, 2457–2485 (2010)
Kearns, M., Vazirani, U.: An Introduction to Computational Learning Theory. MIT Press, Cambridge, MA (1994)
Kanade, V., Valiant, L.G., Wortman Vaughan, J.: Evolution with drifting targets. In: Proceedings of COLT, pp. 155–167 (2010)
Long, P.M.: On the sample complexity of PAC learning halfspaces against the uniform distribution. IEEE Trans. Neural Netw. 6(6), 1556–1559 (1995)
Lovász, L., Vempala, S.: The geometry of logconcave functions and sampling algorithms. Random Struct. Algorithms 30(3), 307–358 (2007)
McCallum, A., Nigam, K.: Employing EM in pool-based active learning for text classification. In: ICML, pp. 350–358 (1998)
Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386–407 (1958)
Raginsky, M., Rakhlin, A.: Lower bounds for passive and active learning. In: NIPS, pp. 1026–1034 (2011)
Valiant, L.G.: A theory of the learnable. Commun. ACM 27(11), 1134–1142 (1984)
Vapnik, V.: Statistical Learning Theory. Wiley-Interscience, New York (1998)
Vempala, S.: Personal Communication (2013)
Wang, L.: Smoothness, disagreement coefficient, and the label complexity of agnostic active learning. J. Mach. Learn. Res. 12, 2269–2292 (2011)
Acknowledgments
We thank Avrim Blum and Santosh Vempala for useful discussions. This work was supported in part by NSF Grants CCF-0953192, CCF-110128, and CCF 1422910, AFOSR Grant FA9550-09-1-0538, ONR Grant N00014-09-1-0751, and a Microsoft Research Faculty Fellowship.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Passive SQ Learning of Halfspaces
The first SQ algorithm for learning general halfspaces was given by Blum et al. [12]. This algorithm requires access to unlabeled samples from the unknown distribution and therefore is only label-statistical. This algorithm can be used as a basis for our active SQ algorithm but the resulting active algorithm will also be only label-statistical. As we have noted in Sect. 2, this is sufficient to obtain our RCN tolerant active learning algorithm given in Corollary 3.7. However our differentially-private simulation needs the algorithm to be (fully) statistical. Therefore we base our algorithm on the algorithm of Dunagan and Vempala for learning halfspaces [29]. While [29] does not contain an explicit statement of the SQ version of the algorithm it is known and easy to verify that the algorithm has a SQ version [49]. This follows from the fact that the algorithm in [29] relies on a combination on the Perceptron [45] and the modified Perceptron algorithms [12] both of which have SQ versions [12, 17]. Another small issue that we need to take care of to apply the algorithm is that the running time and tolerance of the algorithm depend polynomially (in fact, linearly) on \(\log (1/\rho _0)\), where \(\rho _0\) is the margin of the points given to the algorithm. Namely, \(\rho _0 = \min _{x \in S}\frac{|w \cdot x|}{\Vert x\Vert }\), where \(h_w\) is the target homogeneous halfspace and \(S\) is the set of points given to the algorithm. We are dealing with continuous distributions for which the margin is 0 and therefore we make the following observation. In place of \(\rho _0\) we can use any margin \(\rho _1\) such that the probability of being within margin \(\le \rho _1\) around the target hyperplane is small enough that it can be absorbed into the tolerance of the statistical queries of the Dunagan–Vempala algorithm for margin \(\rho _1\). Formally,
Definition 7.1
For positive \(\delta <1\) and distribution \(D\), we denote
namely the smallest value of \(\gamma \) such that for every halfspace \(h_w,\,\gamma \) is the largest such that the probability of being within margin \(\gamma \) of \(h_w\) under \(D\) is at most \(\delta \). Let \(\tau _{{\mathtt {DV}}}(\rho ,\epsilon )\) be the tolerance of the SQ version of the Dunagan–Vempala algorithm when the initial margin is equal to \(\rho \) and error is set to \(\epsilon \). Let
Now, for \(\rho _1 = \rho _1(D,\epsilon )\) we know that \(\gamma (D,\tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)/3) \ge \rho \). Let \(D'\) be defined as distribution \(D\) conditioned on having margin at least \(\rho \) around the target hyperplane \(h_w\). By the definition of the function \(\gamma \), the probability of being within margin \(\le \rho \) is at most \(\tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)/3\). Therefore for any query function \(g:X \times \{-1,1\}\rightarrow [-1,1]\),
and hence \(|\mathop {\mathbf {E}}\nolimits _D[g(x,h_w(x))] - \mathop {\mathbf {E}}\nolimits _{D'}[h_w(x,f(x))]| \le 2 \tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)/3\). This implies that we can obtain an answer to any SQ relative to \(D'\) with tolerance \(\tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)\) by using the same SQ relative to \(D\) with tolerance \(\tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)/3\). This means that by running the Dunagan–Vempala algorithm in this way we will obtain a hypothesis with error at most \(\epsilon /2\) relative to \(D'\). This hypothesis has error at most \(\epsilon /2 + 2 \tau _{{\mathtt {DV}}}(\rho ,\epsilon /2)/3\) which, without loss of generality, is at most \(\epsilon \). Combining these observations about the Dunagan–Vempala algorithm, we obtain the following statement.
Theorem 7.2
([29]) There exists a SQ algorithm LearnHS-DV that learns \({\mathcal {H}}_d\) to accuracy \(1-\epsilon \) over any distribution \(D\). Further LearnHS outputs a homogeneous halfspace, runs in time polynomial in \(d,\,1/\epsilon \) and \(\log (1/\rho _1)\) and uses SQs of tolerance \( \ge 1/\text{ poly }(d,1/\epsilon ,\log (1/\rho _1))\), where \(\rho _1 = \rho _1(D,\epsilon )\).
To apply Theorem 7.2 we need to obtain bounds on \(\rho _1(D,\epsilon )\) for any distribution \(D\) on which we might run the Dunagan–Vempala algorithm.
Lemma 7.3
Let \(D\) be an isotropic log-concave distribution. Then for any \(\delta \in (0,1/20),\,\gamma (D,\delta ) \ge \delta /(6 \ln (1/\delta ))\).
Proof
Let \(\gamma \in (0,1/16)\) and \(w\) be any unit vector. We first upper-bound \(\mathop {\Pr }\nolimits _D\left[ \frac{|w \cdot x|}{\Vert x\Vert }\le \gamma \right] \).
By Lemma 5.7 in [43], for an isotropic log-concave \(D\) and any \(R > 1,\,\mathop {\Pr }\nolimits _D[\Vert x\Vert > R] < e^{-R+1}\). Therefore
Further, by Lemma 3.2,
Substituting, these inequalities into Eq. (8) we obtain that for \(\gamma \in (0,1/16)\),
This implies that for \(\gamma = \delta /(6 \ln (1/\delta ))\) and any unit vector \(w\),
where we used that for \(\delta < 1/20,\,6 \ln (1/\delta ) \le 1/\delta \). By definition of \(\gamma (D,\delta )\), this implies that \(\gamma (D,\delta ) \ge \delta /(6 \ln (1/\delta ))\). \(\square \)
We are now ready to prove Theorem 3.5. There exists a SQ algorithm LearnHS that learns \({\mathcal {H}}_d\) to accuracy \(1-\epsilon \) over any distribution \(D_{|\chi }\), where \(D\) is an isotropic log-concave distribution and \(\chi :\mathbb {R}^d \rightarrow [0,1]\) is a filter function. Further LearnHS outputs a homogeneous halfspace, runs in time polynomial in \(d,\,1/\epsilon \) and \(\log (1/\lambda )\) and uses SQs of tolerance \( \ge 1/\text{ poly }(d,1/\epsilon ,\log (1/\lambda ))\), where \(\lambda = \mathop {\mathbf {E}}\nolimits _D[\chi (x)]\).
Proof of Theorem 3.5
To prove the theorem we bound \(\rho _1 = \rho _1(D_{|\chi },\epsilon )\) and then apply Theorem 7.2. We first observe that for any event \(\Lambda \),
Applying this to the event \(\frac{|w \cdot x|}{\Vert x\Vert } \le \gamma \) in Definition 7.1 we obtain that \(\gamma (D_{|\chi },\delta ) \ge \gamma (D,\delta \cdot \mathop {\mathbf {E}}\nolimits _D[\chi ])\). By Lemma 7.3, we get that \(\gamma (D_{|\chi },\delta ) = \Omega (\lambda \delta /\log (1/(\lambda \delta )))\).
In addition, by Theorem 7.2, \(\tau _{{\mathtt {DV}}}(\rho ,\epsilon ) \ge 1/p(d,1/\epsilon ,\log (1/\rho ))\) for some polynomial \(p\). This implies that
Therefore, we will obtain that,
\(\square \)
By plugging this bound into Theorem 7.2 we obtain the claim.
Appendix 2: Proofs from Sect. 4
We now prove Lemmas 4.6 and 4.7 which we restate for convenience.
Lemma 8.1
(Lemma 4.6 restated) For any \(v,w\in S_{d-1}\) such that \(\Vert v-w\Vert = \Delta \le \sqrt{2}\) and \(\gamma > 0\),
We denote the probability by \({\mathsf {cp}}_d(\gamma ,\Delta )\).
Proof
By using spherical symmetry, we can assume without loss of generality that \(v = (1,0,0,\ldots ,0)\) and \(w=(\sqrt{1-\Delta ^2/2}, \Delta /\sqrt{2},0,0,\ldots ,0)\). We now examine the surface area of the points that satisfy \(h_w(x) = -1\) and \(0 \le \langle v, x \rangle \le \gamma \) (which is a half of the error region at distance at most \(\gamma \) from \(v\)). To compute it we consider the points on \(S_{d-1}\) that satisfy \(\langle v, x \rangle = r\). These points form a hypersphere \(\sigma \) of dimension \(d-2\) and radius \(\sqrt{1-r^2}\). In this hypersphere points that satisfy \(h_w(x) = -1\) are points \((r,s,x_3,\ldots ,x_d) \in S_d\) for which \(r\sqrt{1-\Delta ^2/2} + s \Delta /\sqrt{2} \le 0\). In other words, \(s \ge r\sqrt{2-\Delta ^2}/\Delta \) or points of \(\sigma \) which are at least \(r\sqrt{2-\Delta ^2}/\Delta \) away from hyperplane \((0,1,0,0,\ldots ,0)\) passing through the origin of \(\sigma \) (also referred to as a hyperspherical cap). As in the Eq. (6), we obtain that its \(d-2\)-dimensional surface area is:
Integrating over all \(r\) from \(0\) to \(\gamma \) gives the surface area of the region \(h_w(x) = -1\) and \(0 \le \langle v, x \rangle \le \gamma \):
Hence the conditional probability is as claimed. \(\square \)
Lemma 8.2
(Lemma 4.7 restated) For \(\Delta \le \sqrt{2}\), any \(d\ge 4\), and \(\gamma \ge \Delta /(2\sqrt{d}),\,\partial _\Delta {\mathsf {cp}}_d(\gamma ,\Delta ) \ge 1/(56\gamma \cdot \sqrt{d})\).
Proof
First note that
is independent of \(\Delta \) and therefore it is sufficient to differentiate
Let \(\gamma ' = \Delta /(2\sqrt{d})\) (note that by our assumption \(\gamma ' \le \gamma \)). By the Leibnitz integral rule,
Now using the conditions \(\Delta \le \sqrt{2},\,d\ge 4\), we obtain that \(\gamma ' \le 1/(2\sqrt{2})\) and hence for all \(r \in [0,\gamma '],\,1-r^2 \ge 7/8\) and \(r^2/\Delta ^2 \le \gamma '^2/\Delta ^2 = 1/(4d)\). This implies that for all \(r\in [0,\gamma ']\),
Now,
Substituting this into our expression for \(\partial _\Delta \theta (\gamma ,\Delta )\) we get
where to derive \((*)\) we use the fact that \(e^{-\gamma '^2 (d-2)/2} \le 1-\gamma '^2 (d-2)/4\) since \(e^{-x} \le 1-x/2\) for every \(x \in [0,1]\) and \(\gamma '^2 (d-2)/2 \le \frac{\Delta ^2 (d-2)}{8d} < 1\). At the same time, \(\int _0^\gamma (1-r^2)^{(d-3)/2} dr \le \gamma \) and therefore,
where we used that \(\frac{A_{d-3}}{A_{d-2}} \ge \sqrt{d}/(2\sqrt{3})\) (e.g. [27]). \(\square \)
Rights and permissions
About this article
Cite this article
Balcan, M.F., Feldman, V. Statistical Active Learning Algorithms for Noise Tolerance and Differential Privacy. Algorithmica 72, 282–315 (2015). https://doi.org/10.1007/s00453-014-9954-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-014-9954-9