Advertisement

Near-Optimal Coresets of Kernel Density Estimates

  • Jeff M. Phillips
  • Wai Ming TaiEmail author
Article

Abstract

We construct near-optimal coresets for kernel density estimates for points in \({\mathbb {R}}^d\) when the kernel is positive definite. Specifically we provide a polynomial time construction for a coreset of size \(O(\sqrt{d}/\varepsilon \cdot \sqrt{\log 1/\varepsilon } )\), and we show a near-matching lower bound of size \(\Omega (\min \{\sqrt{d}/\varepsilon , 1/\varepsilon ^2\})\). When \(d\ge 1/\varepsilon ^2\), it is known that the size of coreset can be \(O(1/\varepsilon ^2)\). The upper bound is a polynomial-in-\((1/\varepsilon )\) improvement when \(d \in [3,1/\varepsilon ^2)\) and the lower bound is the first known lower bound to depend on d for this problem. Moreover, the upper bound restriction that the kernel is positive definite is significant in that it applies to a wide variety of kernels, specifically those most important for machine learning. This includes kernels for information distances and the sinc kernel which can be negative.

Keywords

Coreset Kernel density estimate Discrepancy theory 

Notes

References

  1. 1.
    Arias-Castro, E., Mason, D., Pelletier, B.: On the estimation of the gradient lines of a density and the consistency of the mean-shift algorithm. J. Mach. Learn. Res. 17, 43 (2016)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Bach, F., Lacoste-Julien, S., Obozinski, G.: On the equivalence between herding and conditional gradient algorithms. In: Proceedings of the 29th International Coference on International Conference on Machine Learning (ICML’12), pp. 1355–1362. Omnipress (2012)Google Scholar
  4. 4.
    Banaszczyk, W.: Balancing vectors and Gaussian measures of \(n\)-dimensional convex bodies. Random Struct. Algorithms 12(4), 351–360 (1998)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Bansal, N., Dadush, D., Garg, S., Lovett, S.: The Gram–Schmidt walk: a cure for the Banaszczyk blues (STOC’18). In: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 587–597. ACM, New York (2018)Google Scholar
  6. 6.
    Bentley, J.L., Saxe, J.B.: Decomposable searching problems I: static-to-dynamic transformations. J. Algorithms 1, 4 (1980)MathSciNetzbMATHGoogle Scholar
  7. 7.
    Bobrowski, O., Mukherjee, S., Taylor, J.E.: Topological consistency via kernel estimation. Bernoulli 23(1), 288–328 (2017)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Chazelle, B.: The Discrepancy Method. Cambridge University Press, Cambridge (2000)CrossRefGoogle Scholar
  9. 9.
    Chazelle, B., Matoušek, J.: On linear-time deterministic algorithms for optimization problems in fixed dimensions. J. Algorithms 21(3), 579–597 (1996)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Chen, Y., Welling, M., Smola, A.: Super-samples from kernel hearding. In: Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI’10), pp. 109–116. AUAI Press, Arlington (2010)Google Scholar
  11. 11.
    Clarkson, K.: Coresets, sparse greedy approximation, and the Frank–Wolfe algorithm. ACM Trans. Algorithms 4(6), 63 (2010)MathSciNetzbMATHGoogle Scholar
  12. 12.
    Cortés, E.C., Scott, C.: Sparse approximation of a kernel mean. IEEE Trans. Signal Process. 65(5), 1310–1323 (2016)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Devroye, L., Györfi, L.: Nonparametric Density Estimation: The \(L_1\) View. Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics. Wiley, New York (1985)Google Scholar
  14. 14.
    Drineas, P., Mahoney, M.W.: On the Nyström method for approximating a Gram matrix for improved kernel-based learning. J. Mach. Learn. Res. 6, 2153–2175 (2005)MathSciNetzbMATHGoogle Scholar
  15. 15.
    Dunn, J.C.: Convergence rates for conditional gradient sequences generated by implicit step length rules. SIAM J. Control Optim. 18(5), 473–489 (1980)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Fan, J., Gijbels, I.: Local Polynomial Modelling and Its Applications. Monographs on Statistics and Applied Probability, vol. 66. Chapman & Hall, London (1996)zbMATHGoogle Scholar
  17. 17.
    Fasy, B.T., Lecci, F., Rinaldo, A., Wasserman, L., Balakrishnan, S., Singh, A.: Confidence sets for persistence diagrams. Ann. Stat. 42(6), 2301–2339 (2014)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Freund, R.M., Grigas, P.: New analysis and results for the Frank–Wolfe method. Math. Program. 155(1–2), 199–230 (2016)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Gärtner, B., Jaggi, M.: Coresets for polytope distance. In: Proceedings of the 25th Annual Symposium on Computational Geometry (SCG’09), pp. 33–42. ACM, New York (2009)Google Scholar
  20. 20.
    Glaunès, J.: Transport par difféomorphismes de points, de mesures et de courants pour la comparaison de formes et l’anatomie numérique. PhD thesis, Université Paris 13 (2005)Google Scholar
  21. 21.
    Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theoret. Comput. Sci. 38(2–3), 293–306 (1985)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012)MathSciNetzbMATHGoogle Scholar
  23. 23.
    Harvey, N., Samadi, S.: Near-optimal herding. In: Proceedings of the 27th Conference on Learning Theory vol. 35, pp. 1165–1183 (2014)Google Scholar
  24. 24.
    Hein, M., Bousquet, O.: Hilbertian metrics and positive definite kernels on probability measures. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 136–143 (2005)Google Scholar
  25. 25.
    Hofmann, T., Schölkopf, B., Smola, A.J.: Kernel methods in machine learning. Ann. Stat. 36(3), 1171–1220 (2008)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization. In: Proceedings of the 30th International Conference on Machine Learning, vol 28(1), pp. 427–435 (2013)Google Scholar
  27. 27.
    Jaggi, M., Lacoste-Julien, S.: On the global linear convergence of Frank–Wolfe optimization variants. In: Advances in Neural Information Processing Systems, vol. 28 (2015)Google Scholar
  28. 28.
    Joshi, S., Kommaraji, R.V., Phillips, J.M., Venkatasubramanian, S.: Comparing distributions and shapes using the kernel distance. In: Proceedings of the 27th Annual Symposium on Computational Geometry (SoCG’11), pp. 47–56. ACM, New York (2011)Google Scholar
  29. 29.
    Lacoste-Julien, S., Lindsten, F., Bach, F.: Sequential kernel herding: Frank–Wolfe optimization for particle filtering. In: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pp. 544–552 (2015)Google Scholar
  30. 30.
    Li, Y., Long, P.M., Srinivasan, A.: Improved bounds on the samples complexity of learning. J. Comput. Syst. Sci. 62(3), 516–527 (2001)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Lopaz-Paz, D., Muandet, K., Schölkopf, B., Tolstikhin, I.: Towards a learning theory of cause-effect inference. In: Proceedings of the 32nd International Conference on Machine Learning, vol. 37, pp. 1452–1461 (2015)Google Scholar
  32. 32.
    Matoušek, J.: Geometric Discrepancy: An Illustrated Guide. Algorithms and Combinatorics, vol. 18, 2nd edn. Springer, Berlin (2010)zbMATHGoogle Scholar
  33. 33.
    Matoušek, J., Nikolov, A., Talwar, K.: Factorization norms and hereditary discrepancy. Int. Math. Res. Not.  https://doi.org/10.1093/imrn/rny033
  34. 34.
    Muandet, K., Fukumizu, K., Sriperumbudur, B.K., Schölkopf, B.: Kernel mean embedding of distributions: a review and beyond. Found. Trends Mach. Learn. 10, 1–141 (2017)CrossRefGoogle Scholar
  35. 35.
    Müller, A.: Integral probability metrics and their generating classes of functions. Adv. Appl. Probab. 29(2), 429–443 (1997)MathSciNetCrossRefGoogle Scholar
  36. 36.
    Phillips, J.M.: Algorithms for \(\varepsilon \)-approximations of terrains. In: ICALP (2008)Google Scholar
  37. 37.
    Phillips, J.M.: \(\varepsilon \)-Samples for kernels. In: Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’13), pp. 1622–1632. SIAM, Philadelphia (2013)Google Scholar
  38. 38.
    Phillips, J.M., Tai, W.M.: Improved coresets for kernel density estimates. In: Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’18), pp. 2718–2727. SIAM, Philadelphia (2018)CrossRefGoogle Scholar
  39. 39.
    Phillips, J.M., Tai, W.M.: Near-optimal coresets for kernel density estimates. In: Proceedings 34th International Symposium on Computational Geometry (SoCG’18), pp. 66:1–66:13. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2018)Google Scholar
  40. 40.
    Phillips, J.M., Venkatasubramanian, S.: A gentle introduction to the kernel distance. arXiv:1103.1625 (2011)
  41. 41.
    Phillips, J.M., Wang, B., Zheng, Y.: Geometric inference on kernel density estimates. In: Proceedings 31th International Symposium on Computational Geometry (SoCG’15). Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2015)Google Scholar
  42. 42.
    Rinaldo, A., Wasserman, L.: Generalized density clustering. Ann. Stat. 38(5), 2678–2722 (2010)MathSciNetCrossRefGoogle Scholar
  43. 43.
    Schoenberg, I.J.: Metric spaces and completely monotone functions. Ann. Math. 39(4), 811–841 (1938)MathSciNetCrossRefGoogle Scholar
  44. 44.
    Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2002)Google Scholar
  45. 45.
    Schubert, E., Zimek, A., Kriegel, H.P.: Generalized outlier detection with flexible kernel density estimates. In: Proceedings of the SIAM International Conference on Data Mining, pp. 542–550 (2014)Google Scholar
  46. 46.
    Scott, D.W.: Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, New York (1992)CrossRefGoogle Scholar
  47. 47.
    Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, London (1986)CrossRefGoogle Scholar
  48. 48.
    Song, L., Zhang, X., Smola, A., Gretton, A., Schölkopf, B.: Tailoring density estimation via reproducing kernel moment matching. In: Proceedings of the 25th International Conference on Machine Learning (ICML’08), pp. 992–999. ACM, New York (2008)Google Scholar
  49. 49.
    Sriperumbudur, B.K., Gretton, A., Fukumizu, K., Schölkopf, B., Lanckriet, G.R.G.: Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res. 11, 1517–1561 (2010)MathSciNetzbMATHGoogle Scholar
  50. 50.
    Wahba, G.: Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV. In: Advances in Kernel Methods—Support Vector Learning, pp. 69–88. MIT Press, Cambridge (1999)Google Scholar
  51. 51.
    Zheng, Y., Phillips, J.M.: L\(_{\infty }\) error and bandwidth selection for kernel density estimates of large data. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15), pp. 1533–1542. ACM, New York (2015)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.University of UtahSalt Lake CityUSA

Personalised recommendations