Near-Optimal Coresets of Kernel Density Estimates

Abstract

We construct near-optimal coresets for kernel density estimates for points in \({\mathbb {R}}^d\) when the kernel is positive definite. Specifically we provide a polynomial time construction for a coreset of size \(O(\sqrt{d}/\varepsilon \cdot \sqrt{\log 1/\varepsilon } )\), and we show a near-matching lower bound of size \(\Omega (\min \{\sqrt{d}/\varepsilon , 1/\varepsilon ^2\})\). When \(d\ge 1/\varepsilon ^2\), it is known that the size of coreset can be \(O(1/\varepsilon ^2)\). The upper bound is a polynomial-in-\((1/\varepsilon )\) improvement when \(d \in [3,1/\varepsilon ^2)\) and the lower bound is the first known lower bound to depend on d for this problem. Moreover, the upper bound restriction that the kernel is positive definite is significant in that it applies to a wide variety of kernels, specifically those most important for machine learning. This includes kernels for information distances and the sinc kernel which can be negative.

This is a preview of subscription content, log in to check access.

Fig. 1

Notes

  1. 1.

    This combines results published in SOCG 2018 [39] and SODA 2018 [38].

References

  1. 1.

    Arias-Castro, E., Mason, D., Pelletier, B.: On the estimation of the gradient lines of a density and the consistency of the mean-shift algorithm. J. Mach. Learn. Res. 17, 43 (2016)

    MATH  MathSciNet  Google Scholar 

  2. 2.

    Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950)

    Article  MathSciNet  MATH  Google Scholar 

  3. 3.

    Bach, F., Lacoste-Julien, S., Obozinski, G.: On the equivalence between herding and conditional gradient algorithms. In: Proceedings of the 29th International Coference on International Conference on Machine Learning (ICML’12), pp. 1355–1362. Omnipress (2012)

  4. 4.

    Banaszczyk, W.: Balancing vectors and Gaussian measures of \(n\)-dimensional convex bodies. Random Struct. Algorithms 12(4), 351–360 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  5. 5.

    Bansal, N., Dadush, D., Garg, S., Lovett, S.: The Gram–Schmidt walk: a cure for the Banaszczyk blues (STOC’18). In: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 587–597. ACM, New York (2018)

  6. 6.

    Bentley, J.L., Saxe, J.B.: Decomposable searching problems I: static-to-dynamic transformations. J. Algorithms 1, 4 (1980)

    MATH  MathSciNet  Google Scholar 

  7. 7.

    Bobrowski, O., Mukherjee, S., Taylor, J.E.: Topological consistency via kernel estimation. Bernoulli 23(1), 288–328 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  8. 8.

    Chazelle, B.: The Discrepancy Method. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  9. 9.

    Chazelle, B., Matoušek, J.: On linear-time deterministic algorithms for optimization problems in fixed dimensions. J. Algorithms 21(3), 579–597 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  10. 10.

    Chen, Y., Welling, M., Smola, A.: Super-samples from kernel hearding. In: Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI’10), pp. 109–116. AUAI Press, Arlington (2010)

  11. 11.

    Clarkson, K.: Coresets, sparse greedy approximation, and the Frank–Wolfe algorithm. ACM Trans. Algorithms 4(6), 63 (2010)

    MATH  MathSciNet  Google Scholar 

  12. 12.

    Cortés, E.C., Scott, C.: Sparse approximation of a kernel mean. IEEE Trans. Signal Process. 65(5), 1310–1323 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  13. 13.

    Devroye, L., Györfi, L.: Nonparametric Density Estimation: The \(L_1\) View. Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics. Wiley, New York (1985)

    Google Scholar 

  14. 14.

    Drineas, P., Mahoney, M.W.: On the Nyström method for approximating a Gram matrix for improved kernel-based learning. J. Mach. Learn. Res. 6, 2153–2175 (2005)

    MATH  MathSciNet  Google Scholar 

  15. 15.

    Dunn, J.C.: Convergence rates for conditional gradient sequences generated by implicit step length rules. SIAM J. Control Optim. 18(5), 473–489 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  16. 16.

    Fan, J., Gijbels, I.: Local Polynomial Modelling and Its Applications. Monographs on Statistics and Applied Probability, vol. 66. Chapman & Hall, London (1996)

    Google Scholar 

  17. 17.

    Fasy, B.T., Lecci, F., Rinaldo, A., Wasserman, L., Balakrishnan, S., Singh, A.: Confidence sets for persistence diagrams. Ann. Stat. 42(6), 2301–2339 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  18. 18.

    Freund, R.M., Grigas, P.: New analysis and results for the Frank–Wolfe method. Math. Program. 155(1–2), 199–230 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  19. 19.

    Gärtner, B., Jaggi, M.: Coresets for polytope distance. In: Proceedings of the 25th Annual Symposium on Computational Geometry (SCG’09), pp. 33–42. ACM, New York (2009)

  20. 20.

    Glaunès, J.: Transport par difféomorphismes de points, de mesures et de courants pour la comparaison de formes et l’anatomie numérique. PhD thesis, Université Paris 13 (2005)

  21. 21.

    Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theoret. Comput. Sci. 38(2–3), 293–306 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  22. 22.

    Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012)

    MATH  MathSciNet  Google Scholar 

  23. 23.

    Harvey, N., Samadi, S.: Near-optimal herding. In: Proceedings of the 27th Conference on Learning Theory vol. 35, pp. 1165–1183 (2014)

  24. 24.

    Hein, M., Bousquet, O.: Hilbertian metrics and positive definite kernels on probability measures. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 136–143 (2005)

  25. 25.

    Hofmann, T., Schölkopf, B., Smola, A.J.: Kernel methods in machine learning. Ann. Stat. 36(3), 1171–1220 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  26. 26.

    Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization. In: Proceedings of the 30th International Conference on Machine Learning, vol 28(1), pp. 427–435 (2013)

  27. 27.

    Jaggi, M., Lacoste-Julien, S.: On the global linear convergence of Frank–Wolfe optimization variants. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

  28. 28.

    Joshi, S., Kommaraji, R.V., Phillips, J.M., Venkatasubramanian, S.: Comparing distributions and shapes using the kernel distance. In: Proceedings of the 27th Annual Symposium on Computational Geometry (SoCG’11), pp. 47–56. ACM, New York (2011)

  29. 29.

    Lacoste-Julien, S., Lindsten, F., Bach, F.: Sequential kernel herding: Frank–Wolfe optimization for particle filtering. In: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pp. 544–552 (2015)

  30. 30.

    Li, Y., Long, P.M., Srinivasan, A.: Improved bounds on the samples complexity of learning. J. Comput. Syst. Sci. 62(3), 516–527 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  31. 31.

    Lopaz-Paz, D., Muandet, K., Schölkopf, B., Tolstikhin, I.: Towards a learning theory of cause-effect inference. In: Proceedings of the 32nd International Conference on Machine Learning, vol. 37, pp. 1452–1461 (2015)

  32. 32.

    Matoušek, J.: Geometric Discrepancy: An Illustrated Guide. Algorithms and Combinatorics, vol. 18, 2nd edn. Springer, Berlin (2010)

    Google Scholar 

  33. 33.

    Matoušek, J., Nikolov, A., Talwar, K.: Factorization norms and hereditary discrepancy. Int. Math. Res. Not. https://doi.org/10.1093/imrn/rny033

  34. 34.

    Muandet, K., Fukumizu, K., Sriperumbudur, B.K., Schölkopf, B.: Kernel mean embedding of distributions: a review and beyond. Found. Trends Mach. Learn. 10, 1–141 (2017)

    Article  MATH  Google Scholar 

  35. 35.

    Müller, A.: Integral probability metrics and their generating classes of functions. Adv. Appl. Probab. 29(2), 429–443 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  36. 36.

    Phillips, J.M.: Algorithms for \(\varepsilon \)-approximations of terrains. In: ICALP (2008)

  37. 37.

    Phillips, J.M.: \(\varepsilon \)-Samples for kernels. In: Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’13), pp. 1622–1632. SIAM, Philadelphia (2013)

  38. 38.

    Phillips, J.M., Tai, W.M.: Improved coresets for kernel density estimates. In: Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’18), pp. 2718–2727. SIAM, Philadelphia (2018)

  39. 39.

    Phillips, J.M., Tai, W.M.: Near-optimal coresets for kernel density estimates. In: Proceedings 34th International Symposium on Computational Geometry (SoCG’18), pp. 66:1–66:13. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2018)

  40. 40.

    Phillips, J.M., Venkatasubramanian, S.: A gentle introduction to the kernel distance. arXiv:1103.1625 (2011)

  41. 41.

    Phillips, J.M., Wang, B., Zheng, Y.: Geometric inference on kernel density estimates. In: Proceedings 31th International Symposium on Computational Geometry (SoCG’15). Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2015)

  42. 42.

    Rinaldo, A., Wasserman, L.: Generalized density clustering. Ann. Stat. 38(5), 2678–2722 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  43. 43.

    Schoenberg, I.J.: Metric spaces and completely monotone functions. Ann. Math. 39(4), 811–841 (1938)

    Article  MathSciNet  MATH  Google Scholar 

  44. 44.

    Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2002)

    Google Scholar 

  45. 45.

    Schubert, E., Zimek, A., Kriegel, H.P.: Generalized outlier detection with flexible kernel density estimates. In: Proceedings of the SIAM International Conference on Data Mining, pp. 542–550 (2014)

  46. 46.

    Scott, D.W.: Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, New York (1992)

    Google Scholar 

  47. 47.

    Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, London (1986)

    Google Scholar 

  48. 48.

    Song, L., Zhang, X., Smola, A., Gretton, A., Schölkopf, B.: Tailoring density estimation via reproducing kernel moment matching. In: Proceedings of the 25th International Conference on Machine Learning (ICML’08), pp. 992–999. ACM, New York (2008)

  49. 49.

    Sriperumbudur, B.K., Gretton, A., Fukumizu, K., Schölkopf, B., Lanckriet, G.R.G.: Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res. 11, 1517–1561 (2010)

    MATH  MathSciNet  Google Scholar 

  50. 50.

    Wahba, G.: Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV. In: Advances in Kernel Methods—Support Vector Learning, pp. 69–88. MIT Press, Cambridge (1999)

  51. 51.

    Zheng, Y., Phillips, J.M.: L\(_{\infty }\) error and bandwidth selection for kernel density estimates of large data. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15), pp. 1533–1542. ACM, New York (2015)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Wai Ming Tai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

J.M. Phillips thanks the support by NSF CCF-1350888, IIS-1251019, ACI-1443046, CNS-1514520, and CNS-1564287.

Editor in Charge: Kenneth Clarkson

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Phillips, J.M., Tai, W.M. Near-Optimal Coresets of Kernel Density Estimates. Discrete Comput Geom 63, 867–887 (2020). https://doi.org/10.1007/s00454-019-00134-6

Download citation

Keywords

  • Coreset
  • Kernel density estimate
  • Discrepancy theory