Abstract
We study the performance of a family of randomized parallel coordinate descent methods for minimizing a nonsmooth nonseparable convex function. The problem class includes as a special case L1-regularized L1 regression and the minimization of the exponential loss (“AdaBoost problem”). We assume that the input data defining the loss function is contained in a sparse \(m\times n\) matrix A with at most \(\omega \) nonzeros in each row and that the objective function has a “max structure”, allowing us to smooth it. Our main contribution consists in identifying parameters with a closed-form expression that guarantees a parallelization speedup that depends on basic quantities of the problem (like its size and the number of processors). The theory relies on a fine study of the Lipschitz constant of the smoothed objective restricted to low dimensional subspaces and shows an increased acceleration for sparser problems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The results presented in this paper were obtained the Fall of 2012 and Spring of 2013, the follow-up work [6] was prepared in the Summer of 2013.
- 2.
We coined the term Nesterov separability in honor of Yu. Nesterov’s seminal work on the smoothing technique [19], which is applicable to functions represented in the form (8). Nesterov did not study problems with row-sparse matrices A, as we do in this work, nor did he study parallel coordinate descent methods. However, he proposed the celebrated smoothing technique which we also employ in this paper.
- 3.
This is the case in many cases, including (i) \(\Psi _i(t) = \lambda _i|t|\), (ii) \(\Psi _i(t) = \lambda _i t^2\), and (iii) \(\Psi _i(t) = 0\) for \(t \in [a_i,b_i]\) and \(+\infty \) outside this interval (and the multivariate/block generalizations of these functions). For complicated functions \(\Psi _i(t)\), one may need to do one-dimensional optimization, which will cost O(1) for each i, provided that we are happy with an inexact solution. An analysis of PCDM in the \(\tau =1\) case in such an inexact setting can be found in Tappenden et al. [39], and can be extended to the parallel setting.
- 4.
Note that \(h^{(S)}\) is different from \(h_{[S]}=\sum _{i \in S} U_i h^{(i)}\), which is a vector in \(\mathbb {R}^N\), although both \(h^{(S)}\) and \(h_{[S]}\) are composed of blocks \(h^{(i)}\) for \(i \in S\).
- 5.
- 6.
This assumption is not restrictive as \(\beta ' \ge 1\), \(n \ge \tau \) and \(\epsilon \) is usually small. However, it is technically needed.
- 7.
Without the assumption \(\beta '=\min \{\omega ,\tau \}\), the algorithm still converges but with a proved complexity in \(O\big (1/(\epsilon \rho )\big )\) instead of \(O\big (1/\epsilon \log (1/(\epsilon \rho ))\big )\) [40]. In our experiments, we have never encountered a problem with using the more efficient \(\tau \)-nice sampling even in the non-strongly convex case. In fact, this weaker result may just be an artifact of the analysis.
- 8.
Some synchronization do take place from time to time for monitoring purposes.
References
Bradley, J.K., Kyrola, A., Bickson, D., Guestrin, C.: Parallel coordinate descent for L1-regularized loss minimization. In: 28th International Conference on Machine Learning (2011)
Bian, Y., Li, X., Liu, Y.: Parallel coordinate descent newton for large-scale L1-regularized minimization. arXiv:1306:4080v1 (2013)
Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, adaboost and bregman distances. Mach. Learn. 48(1–3), 253–285 (2002)
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Prog. 1–39 (2013)
Dang, C.D., Lan, G: Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM J. Opt. 25(2), 856–881 (2015)
Fercoq, O.: Parallel coordinate descent for the AdaBoost problem. In: International Conference on Machine Learning and Applications—ICMLA ’13 (2013)
Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Opt. 25(4), 1997–2023 (2015)
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Computational Learning Theory, pp. 23–37. Springer (1995)
Guyon, I., Gunn, S., Ben-Hur, A., Dror, G.: Result analysis of the NIPS 2003 feature selection challenge. Adv. Neural Inf. Process. Syst. 17, 545–552 (2004)
Journée, M., Nesterov, Y., Richtárik, P., Sepulchre, R.: Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11, 517–553 (2010)
Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletcher, P.: Block-coordinate frank-wolfe optimization for structural svms. In: 30th International Conference on Machine Learning (2013)
Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Op. Res. 35(3), 641–654 (2010)
Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(285–322), 1–5 (2015)
Liu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Prog. 152(1–2), 615–642 (2015)
Mukherjee, I., Canini, K., Frongillo, R., Singer, Y.: Parallel boosting with momentum. In: Lecture Notes in Computer Science, vol. 8188. Machine Learning and Knowledge Discovery in Databases, ECML (2013)
Mukherjee, I., Rudin, C., Schapire, R.E.: The rate of convergence of AdaBoost. J. Mach. Learn. Res. 14(1), 2315–2347 (2013)
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Identifying suspicious urls: an application of large-scale online learning. In: Proceedings of the 26th International Conference on Machine Learning, pp. 681–688. ACM (2009)
Necoara, I., Clipici, D.: Efficient parallel coordinate descent algorithm for convex optimization problems with separable constraints: application to distributed mpc. J. Process Control 23, 243–253 (2013)
Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Prog. 103, 127–152 (2005)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Opt. 22(2), 341–362 (2012)
Nesterov, Y.: Subgradient methods for huge-scale optimization problems. Math. Prog. 146(1), 275–297 (2014)
Necoara, I., Nesterov, Y., Glineur, F.: Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Politehnica Univ. of Bucharest, Technical report (2012)
Necoara, I., Patrascu, A.: A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints. Comput. Opt. Appl. 57(2), 307–337 (2014)
Palit, I., Reddy, C.K.: Scalable and parallel boosting with MapReduce. IEEE Trans. Knowl. Data Eng. 24(10), 1904–1916 (2012)
Richtárik, P., Takáč, M.: Efficiency of randomized coordinate descent methods on minimization problems with a composite objective function. In: 4th Workshop on Signal Processing with Adaptive Sparse Structured Representations, June 2011
Richtárik, P., Takáč, M.: Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. In: Operations Research Proceedings, pp. 27–32. Springer (2012)
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Prog. 144(2), 1–38 (2014)
Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Prog. 156(1), 433–484 (2016)
Richtárik, P., Takáč, M., Damla Ahipaşaoğlu, S.: Alternating maximization: unifying framework for 8 sparse PCA formulations and efficient parallel codes. arXiv:1212:4137 (2012)
Ruszczyński, A.: On convergence of an augmented Lagrangian decomposition method for sparse convex optimization. Math. Op. Res. 20(3), 634–656 (1995)
Schapire, R.E., Freund, Y.: Boosting: Foundations and Algorithms. The MIT Press (2012)
Shalev-Shwartz, S., Tewari, A.: Stochastic methods for \(\ell _1\)-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)
Shalev-Shwartz, S., Zhang, T.: Accelerated mini-batch stochastic dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp. 378–385 (2013)
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
Takáč, M., Bijral, A., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMs. In: 30th International Conference on Machine Learning (2013)
Telgarsky, M.: A primal-dual convergence analysis of boosting. J. Mach. Learn. Res. 13, 561–606 (2012)
Tao, Q., Kong, K., Chu, D., Wu, G.: Stochastic coordinate descent methods for regularized smooth and nonsmooth losses. In: Machine Learning and Knowledge Discovery in Databases, pp. 537–552 (2012)
Tappenden, R., Richtárik, P., Büke, B.: Separable approximations and decomposition methods for the augmented Lagrangian. Opt. Methods Softw. 30(3), 643–668 (2015)
Tappenden, R., Richtárik, P., Gondzio, J.: Inexact coordinate descent: complexity and preconditioning. J. Opt. Theory Appl. 1–33 (2016)
Tappenden, R., Takáč, M., Richtárik, P.: On the complexity of parallel coordinate descent. Opt. Methods Softw. 1–24 (2017)
Acknowledgements
The work of both authors was supported by the EPSRC grant EP/I017127/1 (Mathematics for Vast Digital Resources). The work of P.R. was also supported by the Centre for Numerical Algorithms and Intelligent Software (funded by EPSRC grant EP/G036136/1 and the Scottish Funding Council).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Fercoq, O., Richtárik, P. (2019). Smooth Minimization of Nonsmooth Functions with Parallel Coordinate Descent Methods. In: Pintér, J.D., Terlaky, T. (eds) Modeling and Optimization: Theory and Applications. MOPTA 2017. Springer Proceedings in Mathematics & Statistics, vol 279. Springer, Cham. https://doi.org/10.1007/978-3-030-12119-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-12119-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12118-1
Online ISBN: 978-3-030-12119-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)