Advertisement

Smooth Minimization of Nonsmooth Functions with Parallel Coordinate Descent Methods

  • Olivier FercoqEmail author
  • Peter Richtárik
Conference paper
Part of the Springer Proceedings in Mathematics & Statistics book series (PROMS, volume 279)

Abstract

We study the performance of a family of randomized parallel coordinate descent methods for minimizing a nonsmooth nonseparable convex function. The problem class includes as a special case L1-regularized L1 regression and the minimization of the exponential loss (“AdaBoost problem”). We assume that the input data defining the loss function is contained in a sparse \(m\times n\) matrix A with at most \(\omega \) nonzeros in each row and that the objective function has a “max structure”, allowing us to smooth it. Our main contribution consists in identifying parameters with a closed-form expression that guarantees a parallelization speedup that depends on basic quantities of the problem (like its size and the number of processors). The theory relies on a fine study of the Lipschitz constant of the smoothed objective restricted to low dimensional subspaces and shows an increased acceleration for sparser problems.

Keywords

Coordinate descent Parallel computing Smoothing Lipschitz constant 

Notes

Acknowledgements

The work of both authors was supported by the EPSRC grant EP/I017127/1 (Mathematics for Vast Digital Resources). The work of P.R. was also supported by the Centre for Numerical Algorithms and Intelligent Software (funded by EPSRC grant EP/G036136/1 and the Scottish Funding Council).

References

  1. 1.
    Bradley, J.K., Kyrola, A., Bickson, D., Guestrin, C.: Parallel coordinate descent for L1-regularized loss minimization. In: 28th International Conference on Machine Learning (2011)Google Scholar
  2. 2.
    Bian, Y., Li, X., Liu, Y.: Parallel coordinate descent newton for large-scale L1-regularized minimization. arXiv:1306:4080v1 (2013)
  3. 3.
    Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, adaboost and bregman distances. Mach. Learn. 48(1–3), 253–285 (2002)CrossRefGoogle Scholar
  4. 4.
    Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Prog. 1–39 (2013)Google Scholar
  5. 5.
    Dang, C.D., Lan, G: Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM J. Opt. 25(2), 856–881 (2015)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Fercoq, O.: Parallel coordinate descent for the AdaBoost problem. In: International Conference on Machine Learning and Applications—ICMLA ’13 (2013)Google Scholar
  7. 7.
    Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Opt. 25(4), 1997–2023 (2015)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Computational Learning Theory, pp. 23–37. Springer (1995)Google Scholar
  9. 9.
    Guyon, I., Gunn, S., Ben-Hur, A., Dror, G.: Result analysis of the NIPS 2003 feature selection challenge. Adv. Neural Inf. Process. Syst. 17, 545–552 (2004)Google Scholar
  10. 10.
    Journée, M., Nesterov, Y., Richtárik, P., Sepulchre, R.: Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11, 517–553 (2010)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletcher, P.: Block-coordinate frank-wolfe optimization for structural svms. In: 30th International Conference on Machine Learning (2013)Google Scholar
  12. 12.
    Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Op. Res. 35(3), 641–654 (2010)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(285–322), 1–5 (2015)Google Scholar
  14. 14.
    Liu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Prog. 152(1–2), 615–642 (2015)Google Scholar
  15. 15.
    Mukherjee, I., Canini, K., Frongillo, R., Singer, Y.: Parallel boosting with momentum. In: Lecture Notes in Computer Science, vol. 8188. Machine Learning and Knowledge Discovery in Databases, ECML (2013)CrossRefGoogle Scholar
  16. 16.
    Mukherjee, I., Rudin, C., Schapire, R.E.: The rate of convergence of AdaBoost. J. Mach. Learn. Res. 14(1), 2315–2347 (2013)Google Scholar
  17. 17.
    Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Identifying suspicious urls: an application of large-scale online learning. In: Proceedings of the 26th International Conference on Machine Learning, pp. 681–688. ACM (2009)Google Scholar
  18. 18.
    Necoara, I., Clipici, D.: Efficient parallel coordinate descent algorithm for convex optimization problems with separable constraints: application to distributed mpc. J. Process Control 23, 243–253 (2013)CrossRefGoogle Scholar
  19. 19.
    Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Prog. 103, 127–152 (2005)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Opt. 22(2), 341–362 (2012)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Nesterov, Y.: Subgradient methods for huge-scale optimization problems. Math. Prog. 146(1), 275–297 (2014)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Necoara, I., Nesterov, Y., Glineur, F.: Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Politehnica Univ. of Bucharest, Technical report (2012)Google Scholar
  23. 23.
    Necoara, I., Patrascu, A.: A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints. Comput. Opt. Appl. 57(2), 307–337 (2014)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Palit, I., Reddy, C.K.: Scalable and parallel boosting with MapReduce. IEEE Trans. Knowl. Data Eng. 24(10), 1904–1916 (2012)CrossRefGoogle Scholar
  25. 25.
    Richtárik, P., Takáč, M.: Efficiency of randomized coordinate descent methods on minimization problems with a composite objective function. In: 4th Workshop on Signal Processing with Adaptive Sparse Structured Representations, June 2011Google Scholar
  26. 26.
    Richtárik, P., Takáč, M.: Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. In: Operations Research Proceedings, pp. 27–32. Springer (2012)Google Scholar
  27. 27.
    Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Prog. 144(2), 1–38 (2014)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Prog. 156(1), 433–484 (2016)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Richtárik, P., Takáč, M., Damla Ahipaşaoğlu, S.: Alternating maximization: unifying framework for 8 sparse PCA formulations and efficient parallel codes. arXiv:1212:4137 (2012)
  30. 30.
    Ruszczyński, A.: On convergence of an augmented Lagrangian decomposition method for sparse convex optimization. Math. Op. Res. 20(3), 634–656 (1995)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Schapire, R.E., Freund, Y.: Boosting: Foundations and Algorithms. The MIT Press (2012)Google Scholar
  32. 32.
    Shalev-Shwartz, S., Tewari, A.: Stochastic methods for \(\ell _1\)-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)MathSciNetzbMATHGoogle Scholar
  33. 33.
    Shalev-Shwartz, S., Zhang, T.: Accelerated mini-batch stochastic dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp. 378–385 (2013)Google Scholar
  34. 34.
    Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)MathSciNetzbMATHGoogle Scholar
  35. 35.
    Takáč, M., Bijral, A., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMs. In: 30th International Conference on Machine Learning (2013)Google Scholar
  36. 36.
    Telgarsky, M.: A primal-dual convergence analysis of boosting. J. Mach. Learn. Res. 13, 561–606 (2012)MathSciNetzbMATHGoogle Scholar
  37. 37.
    Tao, Q., Kong, K., Chu, D., Wu, G.: Stochastic coordinate descent methods for regularized smooth and nonsmooth losses. In: Machine Learning and Knowledge Discovery in Databases, pp. 537–552 (2012)CrossRefGoogle Scholar
  38. 38.
    Tappenden, R., Richtárik, P., Büke, B.: Separable approximations and decomposition methods for the augmented Lagrangian. Opt. Methods Softw. 30(3), 643–668 (2015)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Tappenden, R., Richtárik, P., Gondzio, J.: Inexact coordinate descent: complexity and preconditioning. J. Opt. Theory Appl. 1–33 (2016)Google Scholar
  40. 40.
    Tappenden, R., Takáč, M., Richtárik, P.: On the complexity of parallel coordinate descent. Opt. Methods Softw. 1–24 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.LTCI, Télécom ParisTechUniversité Paris-SaclayParisFrance
  2. 2.School of MathematicsThe University of EdinburghEdinburghUK

Personalised recommendations