How Effectively Train Large-Scale Machine Learning Models?

  • Aven SamarehEmail author
  • Mahshid Salemi Parizi
Part of the Springer Optimization and Its Applications book series (SOIA, volume 152)


The stochastic gradient method (SGM) is widely used as an optimization tool in many machine learning applications including support vector machines (SVM)s, logistic regression, graphical models and deep learning. SGM computes the estimates of the gradient from a single randomly chosen sample in each iteration. Therefore, applying a stochastic gradient method for large-scale machine learning problems can be computationally efficient. In this work, we focus on generating generalization bounds for a randomized algorithm such as Random Fourier features learned with stochastic gradient descent algorithm. Our findings are based on a mutual relationship between the generalization error of an algorithm and its stability properties. The stability of an algorithm is measured by the generalization error regarding the absolute difference between the testing and the training error. Overall, an algorithm is called stable if by changing any single training data point the training error varies slightly. In this work, we measured the stability of stochastic gradient method (SGM) for learning an approximated Fourier primal support vector machine. In particular, under certain regularity assumptions, we showed that a randomized algorithm such as Random Fourier feature where is trained by a stochastic gradient method (SGM) with few iterations has vanishing generalization error. Therefore, the iterative optimization algorithm can stop long before its convergence to reduce computational cost. We empirically verified the theoretical findings for different parameters using several data sets.


Convex optimization  Random fourier features  Support vector machine Generalization error 


  1. 1.
    Duan, J., Luo, L., Li, J., Gao, X., Zhang, W.: Measuring of train wheel surface defect based on linear CCD imaging. In: 2016 18th International Wheelset Congress (IWC), pp. 65–70. IEEE (2016)Google Scholar
  2. 2.
    Moyne, J., Iskandar, J.: Big data analytics for smart manufacturing: case studies in semiconductor manufacturing. Processes 5(3), 39 (2017)CrossRefGoogle Scholar
  3. 3.
    Scime, L., Beuth, J.: Anomaly detection and classification in a laser powder bed additive manufacturing process using a trained computer vision algorithm. Addit. Manuf. 19, 114–126 (2018)CrossRefGoogle Scholar
  4. 4.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification and Scene Analysis, vol. 3. Wiley, New York (1973)zbMATHGoogle Scholar
  5. 5.
    Hagan, M.T., Demuth, H.B., Beale, M.H., De Jesús, O.: Neural Network Design, vol. 20. Pws Pub., Boston (1996)Google Scholar
  6. 6.
    Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)Google Scholar
  7. 7.
    Keller, J.M., Gray, M.R., Givens, J.A.: A fuzzy k-nearest neighbor algorithm. IEEE Trans. Syst. Man Cybern. 4, 580–585. IEEE (1985)Google Scholar
  8. 8.
    Scholkopf, B., Smola, A.J.: Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001)Google Scholar
  9. 9.
    Niyogi, P., Burges, C., Ramesh, P.: Distinctive feature detection using support vector machines. In: 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), vol. 1, pp. 425–428. IEEE (1999)Google Scholar
  10. 10.
    Schölkopf, B., Smola, A., Muller, K.-R., Burges, C., Vapnik, V.: Support Vector Methods in Learning and Feature Extraction. Citeseer (1998)Google Scholar
  11. 11.
    Jonsson, K., Kittler, J, Li, Y.P., Matas, J.: Support vector machines for face authentication. Image Vis. Comput. 20(5–6), 369–375 (2002)CrossRefGoogle Scholar
  12. 12.
    Jabri, M., Flower, B.: Weight perturbation: an optimal architecture and learning technique for analog vlsi feedforward and recurrent multilayer networks. IEEE Trans. Neural Netw. 3(1), 154–157 (1992)CrossRefGoogle Scholar
  13. 13.
    Flower, B., Jabri, M.: Summed weight neuron perturbation: an O(N) improvement over weight perturbation. In: Advances in Neural Information Processing Systems, pp. 212–219 (1993)Google Scholar
  14. 14.
    Nemirovski, A., Yudin, D.: On Cezari’s convergence of the steepest descent method for approximating saddle point of convex-concave functions. Soviet Math. Dokl. 19, 258–269 (1978)Google Scholar
  15. 15.
    Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Bousquet, O., Elisseeff, A.: Stability and generalization. J. Mach. Learn. Res. 2, 499–526 (2002)MathSciNetzbMATHGoogle Scholar
  17. 17.
    Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240 (2015)Google Scholar
  18. 18.
    Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2008)Google Scholar
  19. 19.
    Chapelle, O.: Training a support vector machine in the primal. Neural Comput. 19(5), 1155–1178 (2007)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Yang, T., Li, Y.-F., Mahdavi, M., Jin, R., Zhou, Z.-H.: Nyström method vs random fourier features: a theoretical and empirical comparison. In: Advances in Neural Information Processing Systems, pp. 476–484 (2012)Google Scholar
  21. 21.
    Bubeck, S. et al.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)CrossRefGoogle Scholar
  22. 22.
    Lu, J., Hoi, S.C.H., Wang, J., Zhao, P., Liu, Z.-Y.: Large scale online kernel learning. J. Mach. Learn. Res. 17(1), 1613–1655 (2016)MathSciNetzbMATHGoogle Scholar
  23. 23.
    Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for SVM. Math. program. 127(1), 3–30 (2011)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Sutherland, D.J., Schneider, J.: On the error of random fourier features. arXiv preprint arXiv:1506.02785 (2015)Google Scholar
  25. 25.
    Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Industrial and Systems Engineering DepartmentUniversity of WashingtonSeattleUSA

Personalised recommendations