Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11052)


Kernel methods are a popular choice for classification problems, but when solving large-scale learning tasks computing the quadratic kernel matrix quickly becomes infeasible. To circumvent this problem, the Nyström method that approximates the kernel matrix using only a smaller sample of the kernel matrix has been proposed. Other techniques to speed up kernel learning include stochastic first order optimization and conditioning. We introduce Nyström-SGD, a learning algorithm that trains kernel classifiers by minimizing a convex loss function with conditioned stochastic gradient descent while exploiting the low-rank structure of a Nyström kernel approximation. Our experiments suggest that the Nyström-SGD enables us to rapidly train high-accuracy classifiers for large-scale classification tasks. Code related to this paper is available at:



Part of the work on this paper has been supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 “Providing Information by Resource-Constrained Analysis”, project C3.


  1. 1.
    Anderhub, H., et al.: Design and operation of FACT - the first G-APD Cherenkov telescope. J. Instrum. 8(06), P06008 (2013)CrossRefGoogle Scholar
  2. 2.
    Avron, H., Clarkson, K.L., Woodruff, D.P.: Faster kernel ridge regression using sketching and preconditioning. SIAM J. Matrix Anal. Appl. 38(4), 1116–1138 (2017)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Bartlett, P.L., Bousquet, O., Mendelson, S.: Local rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Chapelle, O.: Training a support vector machine in the primal. Neural Comput. 19(5), 1155–1178 (2007)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Cho, Y., Saul, L.K.: Kernel methods for deep learning. In: Advances in Neural Information Processing Systems, pp. 342–350 (2009)Google Scholar
  6. 6.
    Cutajar, K., Osborne, M.A., Cunningham, J.P., Filippone, M.: Preconditioning Kernel Matrices. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning, vol. 48, pp. 2529–2538. PMLR, New York (2016)Google Scholar
  7. 7.
    Drineas, P., Mahoney, M.W.: On the Nyström method for approximating a gram matrix for improved kernel-based learning. J. Mach. Learn. Res. 6(12), 2153–2175 (2005)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Gonen, A., Shalev-Shwartz, S.: Faster SGD using sketched conditioning. Technical report (2015)Google Scholar
  9. 9.
    Janocha, K., Czarnecki, W.M.: On loss functions for deep neural networks in classification. Schedae Informaticae 25, 1–10 (2017)Google Scholar
  10. 10.
    Kwok, J.T., Lu, B.l.: Making large-scale nystrom approximation possible. In: Proceedings of the 27th International Conference on Machine Learning, p. 12 (2010)Google Scholar
  11. 11.
    Le, Q.V., Sarlos, T., Smola, A.J.: Fastfood: approximate kernel expansions in loglinear time. In: Dasgupta, S., McAllester, D. (eds.) Proceedings of the 30th International Conference on Machine Learning, pp. 244–252. PMLR, Atlanta (2013)Google Scholar
  12. 12.
    Lin, J., Rosasco, L.: Optimal rates for learning with Nyström stochastic gradient methods. Technical report (2017)Google Scholar
  13. 13.
    Ma, S., Belkin, M.: Diving into the shallows: a computational perspective on large-scale shallow learning. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30. Curran Associates, Inc. (2017)Google Scholar
  14. 14.
    Noethe, M., Al., E.: FACT - performance of the first Cherenkov telescope observing with SiPMs. In: Proceedings of Science: 35th International Cosmic Ray Conference, vol. 301, pp. 0–7 (2017)Google Scholar
  15. 15.
    Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems (NIPS) (2009)Google Scholar
  16. 16.
    Rosasco, L.: On learning with integral operators. J. Mach. Learn. Res. 11, 905–934 (2010)MathSciNetzbMATHGoogle Scholar
  17. 17.
    Rudi, A., Carratino, L., Rosasco, L.: FALKON: an optimal large scale kernel method. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 3891–3901. Curran Associates, Inc. (2017)Google Scholar
  18. 18.
    Schölkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, London (2002)zbMATHGoogle Scholar
  19. 19.
    Schölkopf, B., Smola, A., Müller, K.-R.: Kernel principal component analysis. In: Gerstner, W., Germond, A., Hasler, M., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 583–588. Springer, Heidelberg (1997). Scholar
  20. 20.
    Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127, 3–30 (2011)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Williams, C.K., Seeger, M.: Using the Nyström method to speed up kernel machines. In: NIPS, pp. 3–9 (2000)Google Scholar
  22. 22.
    Zhang, Y., Lee, J.D., Jordan, M.I.: l1-regularized neural networks are improperly learnable in polynomial time. In: Proceedings of the 33rd International Conference on Machine Learning, vol. 48 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Artificial Intelligence GroupTU Dortmund UniversityDortmundGermany

Personalised recommendations