Skip to main content

Smooth Minimization of Nonsmooth Functions with Parallel Coordinate Descent Methods

  • Conference paper
  • First Online:
Modeling and Optimization: Theory and Applications (MOPTA 2017)

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 279))

Included in the following conference series:

Abstract

We study the performance of a family of randomized parallel coordinate descent methods for minimizing a nonsmooth nonseparable convex function. The problem class includes as a special case L1-regularized L1 regression and the minimization of the exponential loss (“AdaBoost problem”). We assume that the input data defining the loss function is contained in a sparse \(m\times n\) matrix A with at most \(\omega \) nonzeros in each row and that the objective function has a “max structure”, allowing us to smooth it. Our main contribution consists in identifying parameters with a closed-form expression that guarantees a parallelization speedup that depends on basic quantities of the problem (like its size and the number of processors). The theory relies on a fine study of the Lipschitz constant of the smoothed objective restricted to low dimensional subspaces and shows an increased acceleration for sparser problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The results presented in this paper were obtained the Fall of 2012 and Spring of 2013, the follow-up work [6] was prepared in the Summer of 2013.

  2. 2.

    We coined the term Nesterov separability in honor of Yu. Nesterov’s seminal work on the smoothing technique [19], which is applicable to functions represented in the form (8). Nesterov did not study problems with row-sparse matrices A, as we do in this work, nor did he study parallel coordinate descent methods. However, he proposed the celebrated smoothing technique which we also employ in this paper.

  3. 3.

    This is the case in many cases, including (i) \(\Psi _i(t) = \lambda _i|t|\), (ii) \(\Psi _i(t) = \lambda _i t^2\), and (iii) \(\Psi _i(t) = 0\) for \(t \in [a_i,b_i]\) and \(+\infty \) outside this interval (and the multivariate/block generalizations of these functions). For complicated functions \(\Psi _i(t)\), one may need to do one-dimensional optimization, which will cost O(1) for each i, provided that we are happy with an inexact solution. An analysis of PCDM in the \(\tau =1\) case in such an inexact setting can be found in Tappenden et al. [39], and can be extended to the parallel setting.

  4. 4.

    Note that \(h^{(S)}\) is different from \(h_{[S]}=\sum _{i \in S} U_i h^{(i)}\), which is a vector in \(\mathbb {R}^N\), although both \(h^{(S)}\) and \(h_{[S]}\) are composed of blocks \(h^{(i)}\) for \(i \in S\).

  5. 5.

    In fact, the proof of the former is essentially identical to the proof of (44), and (46) follows from (44) by choosing \(J_1=J_2=J\) and \(\theta _{ij} = 1\).

  6. 6.

    This assumption is not restrictive as \(\beta ' \ge 1\), \(n \ge \tau \) and \(\epsilon \) is usually small. However, it is technically needed.

  7. 7.

    Without the assumption \(\beta '=\min \{\omega ,\tau \}\), the algorithm still converges but with a proved complexity in \(O\big (1/(\epsilon \rho )\big )\) instead of \(O\big (1/\epsilon \log (1/(\epsilon \rho ))\big )\) [40]. In our experiments, we have never encountered a problem with using the more efficient \(\tau \)-nice sampling even in the non-strongly convex case. In fact, this weaker result may just be an artifact of the analysis.

  8. 8.

    Some synchronization do take place from time to time for monitoring purposes.

References

  1. Bradley, J.K., Kyrola, A., Bickson, D., Guestrin, C.: Parallel coordinate descent for L1-regularized loss minimization. In: 28th International Conference on Machine Learning (2011)

    Google Scholar 

  2. Bian, Y., Li, X., Liu, Y.: Parallel coordinate descent newton for large-scale L1-regularized minimization. arXiv:1306:4080v1 (2013)

  3. Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, adaboost and bregman distances. Mach. Learn. 48(1–3), 253–285 (2002)

    Article  Google Scholar 

  4. Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Prog. 1–39 (2013)

    Google Scholar 

  5. Dang, C.D., Lan, G: Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM J. Opt. 25(2), 856–881 (2015)

    Google Scholar 

  6. Fercoq, O.: Parallel coordinate descent for the AdaBoost problem. In: International Conference on Machine Learning and Applications—ICMLA ’13 (2013)

    Google Scholar 

  7. Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Opt. 25(4), 1997–2023 (2015)

    Google Scholar 

  8. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Computational Learning Theory, pp. 23–37. Springer (1995)

    Google Scholar 

  9. Guyon, I., Gunn, S., Ben-Hur, A., Dror, G.: Result analysis of the NIPS 2003 feature selection challenge. Adv. Neural Inf. Process. Syst. 17, 545–552 (2004)

    Google Scholar 

  10. Journée, M., Nesterov, Y., Richtárik, P., Sepulchre, R.: Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11, 517–553 (2010)

    MathSciNet  MATH  Google Scholar 

  11. Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletcher, P.: Block-coordinate frank-wolfe optimization for structural svms. In: 30th International Conference on Machine Learning (2013)

    Google Scholar 

  12. Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Op. Res. 35(3), 641–654 (2010)

    Article  MathSciNet  Google Scholar 

  13. Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(285–322), 1–5 (2015)

    Google Scholar 

  14. Liu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Prog. 152(1–2), 615–642 (2015)

    Google Scholar 

  15. Mukherjee, I., Canini, K., Frongillo, R., Singer, Y.: Parallel boosting with momentum. In: Lecture Notes in Computer Science, vol. 8188. Machine Learning and Knowledge Discovery in Databases, ECML (2013)

    Google Scholar 

  16. Mukherjee, I., Rudin, C., Schapire, R.E.: The rate of convergence of AdaBoost. J. Mach. Learn. Res. 14(1), 2315–2347 (2013)

    Google Scholar 

  17. Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Identifying suspicious urls: an application of large-scale online learning. In: Proceedings of the 26th International Conference on Machine Learning, pp. 681–688. ACM (2009)

    Google Scholar 

  18. Necoara, I., Clipici, D.: Efficient parallel coordinate descent algorithm for convex optimization problems with separable constraints: application to distributed mpc. J. Process Control 23, 243–253 (2013)

    Article  Google Scholar 

  19. Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Prog. 103, 127–152 (2005)

    Google Scholar 

  20. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Opt. 22(2), 341–362 (2012)

    Article  MathSciNet  Google Scholar 

  21. Nesterov, Y.: Subgradient methods for huge-scale optimization problems. Math. Prog. 146(1), 275–297 (2014)

    Google Scholar 

  22. Necoara, I., Nesterov, Y., Glineur, F.: Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Politehnica Univ. of Bucharest, Technical report (2012)

    Google Scholar 

  23. Necoara, I., Patrascu, A.: A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints. Comput. Opt. Appl. 57(2), 307–337 (2014)

    Article  MathSciNet  Google Scholar 

  24. Palit, I., Reddy, C.K.: Scalable and parallel boosting with MapReduce. IEEE Trans. Knowl. Data Eng. 24(10), 1904–1916 (2012)

    Article  Google Scholar 

  25. Richtárik, P., Takáč, M.: Efficiency of randomized coordinate descent methods on minimization problems with a composite objective function. In: 4th Workshop on Signal Processing with Adaptive Sparse Structured Representations, June 2011

    Google Scholar 

  26. Richtárik, P., Takáč, M.: Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. In: Operations Research Proceedings, pp. 27–32. Springer (2012)

    Google Scholar 

  27. Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Prog. 144(2), 1–38 (2014)

    Google Scholar 

  28. Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Prog. 156(1), 433–484 (2016)

    Google Scholar 

  29. Richtárik, P., Takáč, M., Damla Ahipaşaoğlu, S.: Alternating maximization: unifying framework for 8 sparse PCA formulations and efficient parallel codes. arXiv:1212:4137 (2012)

  30. Ruszczyński, A.: On convergence of an augmented Lagrangian decomposition method for sparse convex optimization. Math. Op. Res. 20(3), 634–656 (1995)

    Article  MathSciNet  Google Scholar 

  31. Schapire, R.E., Freund, Y.: Boosting: Foundations and Algorithms. The MIT Press (2012)

    Google Scholar 

  32. Shalev-Shwartz, S., Tewari, A.: Stochastic methods for \(\ell _1\)-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)

    MathSciNet  MATH  Google Scholar 

  33. Shalev-Shwartz, S., Zhang, T.: Accelerated mini-batch stochastic dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp. 378–385 (2013)

    Google Scholar 

  34. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)

    MathSciNet  MATH  Google Scholar 

  35. Takáč, M., Bijral, A., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMs. In: 30th International Conference on Machine Learning (2013)

    Google Scholar 

  36. Telgarsky, M.: A primal-dual convergence analysis of boosting. J. Mach. Learn. Res. 13, 561–606 (2012)

    MathSciNet  MATH  Google Scholar 

  37. Tao, Q., Kong, K., Chu, D., Wu, G.: Stochastic coordinate descent methods for regularized smooth and nonsmooth losses. In: Machine Learning and Knowledge Discovery in Databases, pp. 537–552 (2012)

    Google Scholar 

  38. Tappenden, R., Richtárik, P., Büke, B.: Separable approximations and decomposition methods for the augmented Lagrangian. Opt. Methods Softw. 30(3), 643–668 (2015)

    Article  MathSciNet  Google Scholar 

  39. Tappenden, R., Richtárik, P., Gondzio, J.: Inexact coordinate descent: complexity and preconditioning. J. Opt. Theory Appl. 1–33 (2016)

    Google Scholar 

  40. Tappenden, R., Takáč, M., Richtárik, P.: On the complexity of parallel coordinate descent. Opt. Methods Softw. 1–24 (2017)

    Google Scholar 

Download references

Acknowledgements

The work of both authors was supported by the EPSRC grant EP/I017127/1 (Mathematics for Vast Digital Resources). The work of P.R. was also supported by the Centre for Numerical Algorithms and Intelligent Software (funded by EPSRC grant EP/G036136/1 and the Scottish Funding Council).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Olivier Fercoq .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fercoq, O., Richtárik, P. (2019). Smooth Minimization of Nonsmooth Functions with Parallel Coordinate Descent Methods. In: Pintér, J.D., Terlaky, T. (eds) Modeling and Optimization: Theory and Applications. MOPTA 2017. Springer Proceedings in Mathematics & Statistics, vol 279. Springer, Cham. https://doi.org/10.1007/978-3-030-12119-8_4

Download citation

Publish with us

Policies and ethics