Smooth Minimization of Nonsmooth Functions with Parallel Coordinate Descent Methods

Fercoq, Olivier; Richtárik, Peter

doi:10.1007/978-3-030-12119-8_4

Olivier Fercoq³ &
Peter Richtárik⁴

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 279))

Included in the following conference series:

Modeling and Optimization: Theory and Applications

553 Accesses
2 Citations
1 Altmetric

Abstract

We study the performance of a family of randomized parallel coordinate descent methods for minimizing a nonsmooth nonseparable convex function. The problem class includes as a special case L1-regularized L1 regression and the minimization of the exponential loss (“AdaBoost problem”). We assume that the input data defining the loss function is contained in a sparse \(m\times n\) matrix A with at most \(\omega \) nonzeros in each row and that the objective function has a “max structure”, allowing us to smooth it. Our main contribution consists in identifying parameters with a closed-form expression that guarantees a parallelization speedup that depends on basic quantities of the problem (like its size and the number of processors). The theory relies on a fine study of the Lipschitz constant of the smoothed objective restricted to low dimensional subspaces and shows an increased acceleration for sparser problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The results presented in this paper were obtained the Fall of 2012 and Spring of 2013, the follow-up work [6] was prepared in the Summer of 2013.
2.
We coined the term Nesterov separability in honor of Yu. Nesterov’s seminal work on the smoothing technique [19], which is applicable to functions represented in the form (8). Nesterov did not study problems with row-sparse matrices A, as we do in this work, nor did he study parallel coordinate descent methods. However, he proposed the celebrated smoothing technique which we also employ in this paper.
3.
This is the case in many cases, including (i) \(\Psi _i(t) = \lambda _i|t|\), (ii) \(\Psi _i(t) = \lambda _i t^2\), and (iii) \(\Psi _i(t) = 0\) for \(t \in [a_i,b_i]\) and \(+\infty \) outside this interval (and the multivariate/block generalizations of these functions). For complicated functions \(\Psi _i(t)\), one may need to do one-dimensional optimization, which will cost O(1) for each i, provided that we are happy with an inexact solution. An analysis of PCDM in the \(\tau =1\) case in such an inexact setting can be found in Tappenden et al. [39], and can be extended to the parallel setting.
4.
Note that \(h^{(S)}\) is different from \(h_{[S]}=\sum _{i \in S} U_i h^{(i)}\), which is a vector in \(\mathbb {R}^N\), although both \(h^{(S)}\) and \(h_{[S]}\) are composed of blocks \(h^{(i)}\) for \(i \in S\).
5.
In fact, the proof of the former is essentially identical to the proof of (44), and (46) follows from (44) by choosing \(J_1=J_2=J\) and \(\theta _{ij} = 1\).
6.
This assumption is not restrictive as \(\beta ' \ge 1\), \(n \ge \tau \) and \(\epsilon \) is usually small. However, it is technically needed.
7.
Without the assumption \(\beta '=\min \{\omega ,\tau \}\), the algorithm still converges but with a proved complexity in \(O\big (1/(\epsilon \rho )\big )\) instead of \(O\big (1/\epsilon \log (1/(\epsilon \rho ))\big )\) [40]. In our experiments, we have never encountered a problem with using the more efficient \(\tau \)-nice sampling even in the non-strongly convex case. In fact, this weaker result may just be an artifact of the analysis.
8.
Some synchronization do take place from time to time for monitoring purposes.

References

Bradley, J.K., Kyrola, A., Bickson, D., Guestrin, C.: Parallel coordinate descent for L1-regularized loss minimization. In: 28th International Conference on Machine Learning (2011)
Google Scholar
Bian, Y., Li, X., Liu, Y.: Parallel coordinate descent newton for large-scale L1-regularized minimization. arXiv:1306:4080v1 (2013)
Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, adaboost and bregman distances. Mach. Learn. 48(1–3), 253–285 (2002)
Article Google Scholar
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Prog. 1–39 (2013)
Google Scholar
Dang, C.D., Lan, G: Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM J. Opt. 25(2), 856–881 (2015)
Google Scholar
Fercoq, O.: Parallel coordinate descent for the AdaBoost problem. In: International Conference on Machine Learning and Applications—ICMLA ’13 (2013)
Google Scholar
Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Opt. 25(4), 1997–2023 (2015)
Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Computational Learning Theory, pp. 23–37. Springer (1995)
Google Scholar
Guyon, I., Gunn, S., Ben-Hur, A., Dror, G.: Result analysis of the NIPS 2003 feature selection challenge. Adv. Neural Inf. Process. Syst. 17, 545–552 (2004)
Google Scholar
Journée, M., Nesterov, Y., Richtárik, P., Sepulchre, R.: Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11, 517–553 (2010)
MathSciNet MATH Google Scholar
Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletcher, P.: Block-coordinate frank-wolfe optimization for structural svms. In: 30th International Conference on Machine Learning (2013)
Google Scholar
Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Op. Res. 35(3), 641–654 (2010)
Article MathSciNet Google Scholar
Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(285–322), 1–5 (2015)
Google Scholar
Liu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Prog. 152(1–2), 615–642 (2015)
Google Scholar
Mukherjee, I., Canini, K., Frongillo, R., Singer, Y.: Parallel boosting with momentum. In: Lecture Notes in Computer Science, vol. 8188. Machine Learning and Knowledge Discovery in Databases, ECML (2013)
Google Scholar
Mukherjee, I., Rudin, C., Schapire, R.E.: The rate of convergence of AdaBoost. J. Mach. Learn. Res. 14(1), 2315–2347 (2013)
Google Scholar
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Identifying suspicious urls: an application of large-scale online learning. In: Proceedings of the 26th International Conference on Machine Learning, pp. 681–688. ACM (2009)
Google Scholar
Necoara, I., Clipici, D.: Efficient parallel coordinate descent algorithm for convex optimization problems with separable constraints: application to distributed mpc. J. Process Control 23, 243–253 (2013)
Article Google Scholar
Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Prog. 103, 127–152 (2005)
Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Opt. 22(2), 341–362 (2012)
Article MathSciNet Google Scholar
Nesterov, Y.: Subgradient methods for huge-scale optimization problems. Math. Prog. 146(1), 275–297 (2014)
Google Scholar
Necoara, I., Nesterov, Y., Glineur, F.: Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Politehnica Univ. of Bucharest, Technical report (2012)
Google Scholar
Necoara, I., Patrascu, A.: A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints. Comput. Opt. Appl. 57(2), 307–337 (2014)
Article MathSciNet Google Scholar
Palit, I., Reddy, C.K.: Scalable and parallel boosting with MapReduce. IEEE Trans. Knowl. Data Eng. 24(10), 1904–1916 (2012)
Article Google Scholar
Richtárik, P., Takáč, M.: Efficiency of randomized coordinate descent methods on minimization problems with a composite objective function. In: 4th Workshop on Signal Processing with Adaptive Sparse Structured Representations, June 2011
Google Scholar
Richtárik, P., Takáč, M.: Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. In: Operations Research Proceedings, pp. 27–32. Springer (2012)
Google Scholar
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Prog. 144(2), 1–38 (2014)
Google Scholar
Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Prog. 156(1), 433–484 (2016)
Google Scholar
Richtárik, P., Takáč, M., Damla Ahipaşaoğlu, S.: Alternating maximization: unifying framework for 8 sparse PCA formulations and efficient parallel codes. arXiv:1212:4137 (2012)
Ruszczyński, A.: On convergence of an augmented Lagrangian decomposition method for sparse convex optimization. Math. Op. Res. 20(3), 634–656 (1995)
Article MathSciNet Google Scholar
Schapire, R.E., Freund, Y.: Boosting: Foundations and Algorithms. The MIT Press (2012)
Google Scholar
Shalev-Shwartz, S., Tewari, A.: Stochastic methods for \(\ell _1\)-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)
MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Zhang, T.: Accelerated mini-batch stochastic dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp. 378–385 (2013)
Google Scholar
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
MathSciNet MATH Google Scholar
Takáč, M., Bijral, A., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMs. In: 30th International Conference on Machine Learning (2013)
Google Scholar
Telgarsky, M.: A primal-dual convergence analysis of boosting. J. Mach. Learn. Res. 13, 561–606 (2012)
MathSciNet MATH Google Scholar
Tao, Q., Kong, K., Chu, D., Wu, G.: Stochastic coordinate descent methods for regularized smooth and nonsmooth losses. In: Machine Learning and Knowledge Discovery in Databases, pp. 537–552 (2012)
Google Scholar
Tappenden, R., Richtárik, P., Büke, B.: Separable approximations and decomposition methods for the augmented Lagrangian. Opt. Methods Softw. 30(3), 643–668 (2015)
Article MathSciNet Google Scholar
Tappenden, R., Richtárik, P., Gondzio, J.: Inexact coordinate descent: complexity and preconditioning. J. Opt. Theory Appl. 1–33 (2016)
Google Scholar
Tappenden, R., Takáč, M., Richtárik, P.: On the complexity of parallel coordinate descent. Opt. Methods Softw. 1–24 (2017)
Google Scholar

Download references

Acknowledgements

The work of both authors was supported by the EPSRC grant EP/I017127/1 (Mathematics for Vast Digital Resources). The work of P.R. was also supported by the Centre for Numerical Algorithms and Intelligent Software (funded by EPSRC grant EP/G036136/1 and the Scottish Funding Council).

Author information

Authors and Affiliations

LTCI, Télécom ParisTech, Université Paris-Saclay, Paris, France
Olivier Fercoq
School of Mathematics, The University of Edinburgh, Edinburgh, UK
Peter Richtárik

Authors

Olivier Fercoq
View author publications
You can also search for this author in PubMed Google Scholar
Peter Richtárik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olivier Fercoq .

Editor information

Editors and Affiliations

Dept of Industrial & Systems Engineering, Lehigh University, Bethlehem, PA, USA
János D. Pintér
Dept of Industrial & Systems Engineering, Lehigh University, Bethlehem, PA, USA
Tamás Terlaky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fercoq, O., Richtárik, P. (2019). Smooth Minimization of Nonsmooth Functions with Parallel Coordinate Descent Methods. In: Pintér, J.D., Terlaky, T. (eds) Modeling and Optimization: Theory and Applications. MOPTA 2017. Springer Proceedings in Mathematics & Statistics, vol 279. Springer, Cham. https://doi.org/10.1007/978-3-030-12119-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-12119-8_4
Published: 15 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12118-1
Online ISBN: 978-3-030-12119-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics