Abstract
Subgradient methods (SM) have long been the preferred way to solve the large-scale Nondifferentiable Optimization problems arising from the solution of Lagrangian Duals (LD) of Integer Programs (IP). Although other methods can have better convergence rate in practice, SM have certain advantages that may make them competitive under the right conditions. Furthermore, SM have significantly progressed in recent years, and new versions have been proposed with better theoretical and practical performances in some applications. We computationally evaluate a large class of SM in order to assess if these improvements carry over to the IP setting. For this we build a unified scheme that covers many of the SM proposed in the literature, comprised some often overlooked features like projection and dynamic generation of variables. We fine-tune the many algorithmic parameters of the resulting large class of SM, and we test them on two different LDs of the Fixed-Charge Multicommodity Capacitated Network Design problem, in order to assess the impact of the characteristics of the problem on the optimal algorithmic choices. Our results show that, if extensive tuning is performed, SM can be competitive with more sophisticated approaches when the tolerance required for solution is not too tight, which is the case when solving LDs of IPs.
Similar content being viewed by others
References
Ahookhosh, M.: Optimal subgradient algorithms with application to large-scale linear inverse problems. Tech. rep., Optimization Online (2014)
Anstreicher, K., Wolsey, L.: Two “well-known” properties of subgradient optimization. Math. Program. 120(1), 213–220 (2009)
Astorino, A., Frangioni, A., Fuduli, A., Gorgone, E.: A nonmonotone proximal bundle method with (potentially) continuous step decisions. SIAM J. Optim. 23(3), 1784–1809 (2013)
Bacaud, L., Lemaréchal, C., Renaud, A., Sagastizábal, C.: Bundle methods in stochastic optimal power management: a disaggregated approach using preconditioners. Comput. Optim. Appl. 20, 227–244 (2001)
Bahiense, L., Maculan, N., Sagastizábal, C.: The volume algorithm revisited: relation with bundle methods. Math. Program. 94(1), 41–70 (2002)
Barahona, F., Anbil, R.: The volume algorithm: producing primal solutions with a subgradient method. Math. Program. 87(3), 385–399 (2000)
Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim. 22(2), 557–580 (2012)
Ben Amor, H., Desrosiers, J., Frangioni, A.: On the choice of explicit stabilizing terms in column generation. Discrete Appl. Math. 157(6), 1167–1184 (2009)
Bertsekas, D., Nedić, A.: Incremental subgradient methods for nondifferentiable optimization. SIAM J. Optim. 12(1), 109–138 (2001)
Borghetti, A., Frangioni, A., Lacalandra, F., Nucci, C.: Lagrangian heuristics based on disaggregated bundle methods for hydrothermal unit commitment. IEEE Trans. Power Syst. 18(1), 313–323 (2003)
Bot, R., Hendrich, C.: A variable smoothing algorithm for solving convex optimization problems. TOP 23, 124–150 (2014)
Brännlund, U.: A generalised subgradient method with relaxation step. Math. Program. 71, 207–219 (1995)
Briant, O., Lemaréchal, C., Meurdesoif, P., Michel, S., Perrot, N., Vanderbeck, F.: Comparison of bundle and classical column generation. Math. Program. 113(2), 299–344 (2008)
Camerini, P., Fratta, L., Maffioli, F.: On improving relaxation methods by modified gradient techniques. Math. Program. Study 3, 26–34 (1975)
Cappanera, P., Frangioni, A.: Symmetric and asymmetric parallelization of a cost-decomposition algorithm for multi-commodity flow problems. INFORMS J. Comput. 15(4), 369–384 (2003)
Censor, Y., Davidi, R., Herman, G., Schulte, R., Tetruashvili, L.: Projected subgradient minimization cersus superiorization. J. Optim. Theory Appl. 160(3), 730–747 (2014)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Crainic, T.G., Frangioni, A., Gendron, B.: Multicommodity capacitated network design. In: Soriano, P., Sanso, B. (eds.) Telecommunications network planning, pp. 1–19. Kluwer Academics Publisher (1999)
Crainic, T., Frangioni, A., Gendron, B.: Bundle-based relaxation methods for multicommodity capacitated fixed charge network design problems. Discrete Appl. Math. 112, 73–99 (2001)
Crema, A., Loreto, M., Raydan, M.: Spectral projected subgradient with a momentum term for the Lagrangean dual approach. Comput. Oper. Res. 34, 31743186 (2007)
d’Antonio, G., Frangioni, A.: Convergence analysis of deflected conditional approximate subgradient methods. SIAM J. Optim. 20(1), 357–386 (2009)
du Merle, O., Goffin, J.L., Vial, J.P.: On improvements to the analytic center cutting plane method. Comput. Optim. Appl. 11, 37–52 (1998)
Feltenmark, S., Kiwiel, K.: Dual applications of proximal bundle methods, including Lagrangian relaxation of nonconvex problems. SIAM J. Optim. 10(3), 697–721 (2000)
Frangioni, A.: Solving semidefinite quadratic problems within nonsmooth optimization algorithms. Comput. Oper. Res. 21, 1099–1118 (1996)
Frangioni, A.: Generalized bundle methods. SIAM J. Optim. 13(1), 117–156 (2002)
Frangioni, A., Gallo, G.: A bundle type dual-ascent approach to linear multicommodity min cost flow problems. INFORMS J. Comput. 11(4), 370–393 (1999)
Frangioni, A., Gendron, B.: A stabilized structured Dantzig–Wolfe decomposition method. Math. Program. 140, 45–76 (2013)
Frangioni, A., Gorgone, E.: A library for continuous convex separable quadratic knapsack problems. Eur. J. Oper. Res. 229(1), 37–40 (2013)
Frangioni, A., Gorgone, E.: Generalized bundle methods for sum-functions with “easy” components: applications to multicommodity network design. Math. Program. 145(1), 133–161 (2014)
Frangioni, A., Lodi, A., Rinaldi, G.: New approaches for optimizing over the semimetric polytope. Math. Program. 104(2–3), 375–388 (2005)
Fumero, F.: A modified subgradient algorithm for Lagrangean relaxation. Comput. Oper. Res. 28(1), 33–52 (2001)
Geoffrion, A.: Lagrangian relaxation and its uses in iteger programming. Math. Program. Study 2, 82–114 (1974)
Gondzio, J., González-Brevis, P., Munari, P.: New developments in the primal–dual column generation technique. Eur. J. Oper. Res. 224(1), 41–51 (2013)
Görtz, S., Klose, A.: A simple but usually fast branch-and-bound algorithm for the capacitated facility location problem. INFORMS J. Comput. 24(4), 597610 (2012)
Guignard, M.: Efficient cuts in Lagrangean ‘relax-and-cut’ schemes. Eur. J. Oper. Res. 105, 216–223 (1998)
Held, M., Karp, R.: The traveling salesman problem and minimum spanning trees. Oper. Res. 18, 1138–1162 (1970)
Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms II–Advanced Theory and Bundle Methods, Grundlehren Math. Wiss., vol. 306. Springer, New York (1993)
Ito, M., Fukuda, M.: A family of subgradient-based methods for convex optimization problems in a unifying framework. Tech. rep., Optimization Online (2014)
Jones, K., Lustig, I., Farwolden, J., Powell, W.: Multicommodity network flows: the impact of formulation on decomposition. Math. Program. 62, 95–117 (1993)
Kelley, J.: The cutting-plane method for solving convex programs. J. SIAM 8, 703–712 (1960)
Kiwiel, K.: Convergence of approximate and incremental subgradient methods for convex optimization. SIAM J. Optim. 14(3), 807–840 (2003)
Kiwiel, K., Goffin, J.: Convergence of a simple subgradient level method. Math. Program. 85(4), 207–211 (1999)
Kiwiel, K., Larsson, T., Lindberg, P.: The efficiency of ballstep subgradient level methods for convex optimization. Math. Oper. Res. 23, 237–254 (1999)
Lan, G., Zhou, Y.: Approximation accuracy, gradient methods, and error bound for structured convex optimization. Technical report, University of Florida (2014)
Larsson, T., Patriksson, M., Strömberg, A.B.: Conditional subgradient optimization—theory and applications. Eur. J. Oper. Res. 88(2), 382–403 (1996)
Larsson, T., Patriksson, M., Strömberg, A.B.: Ergodic, primal convergence in dual subgradient schemes for convex programming. Math. Program. 86, 283–312 (1999)
Lemaréchal, C.: An extension of Davidon methods to nondifferentiable problems. In: Balinski, M., Wolfe, P. (eds.) Nondifferentiable Optimization, Mathematical Programming Study, vol. 3, pp. 95–109. North-Holland, Amsterdam (1975)
Lemaréchal, C., Renaud, A.: A geometric study of duality gaps, with applications. Math. Program. 90, 399–427 (2001)
Necoara, I., Suykens, J.: Application of a smoothing technique to decomposition in convex optimization. IEEE Trans. Autom. Control 53(11), 2674–2679 (2008)
Nedic, A., Bertsekas, D.: Incremental subgradient methods for nondifferentiable optimization. Math. Program. 120, 221–259 (2009)
Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)
Nesterov, Y.: Excessive gap technique in nonsmooth convex minimization. SIAM J. Optim. 16, 235–249 (2005)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103, 127–152 (2005)
Nesterov, Y.: Primal-dual subgradient methods for convex optimization. Math. Program. 120, 221–259 (2009)
Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152, 381–404 (2014)
Neto, E., De Pierro, A.: Incremental subgradients for constrained convex optimization: a unified framework and new methods. SIAM J. Optim. 20(3), 1547–1572 (2009)
Ouorou, A.: A proximal cutting plane method using Chebychev center for nonsmooth convex optimization. Math. Program. 119(2), 239–271 (2009)
Polyak, B.: Minimization of unsmooth functionals. Zh. Vychisl. Mat. Fiz 9(3), 509–521 (1969)
Sherali, B., Choi, B., Tuncbilek, C.: A variable target value method for nondifferentiable optimization. Oper. Res. Lett. 26, 1–8 (2000)
Sherali, B., Lim, C.: On embedding the volume algorithm in a variable target value method. Oper. Res. Lett. 32, 455462 (2004)
Shor, N.: Minimization Methods for Nondifferentiable Functions. Springer, Berlin (1985)
Solodov, M., Zavriev, S.: Error stability properties of generalized gradient-type algorithms. J. Optim. Theory Appl. 98(3), 663–680 (1998)
Tseng, P.: Conditional gradient sliding for convex optimization. Math. Program. 125, 263–295 (2010)
Wolfe, P.: A method of conjugate subgradients for minimizing nondifferentiable functions. In: Balinski, M., Wolfe, P. (eds.) Nondifferentiable Optimization, Mathematical Programming Study, vol. 3, pp. 145–173. North-Holland, Amsterdam (1975)
Acknowledgements
The first author acknowledge the contribution of the Italian Ministry for University and Research under the PRIN 2012 Project 2012JXB3YF “Mixed-Integer Nonlinear Optimization: Approaches and Applications”. The work of the second author has been supported by NSERC (Canada) under Grant 184122-09. The work of the third author has been supported by the Post-Doctoral Fellowship D.R. No 2718/201 (Regional Operative Program Calabria ESF 2007/2013) and the Interuniversity Attraction Poles Programme P7/36 “COMEX: combinatorial optimization metaheuristics & exact methods” of the Belgian Science Policy Office. All the authors gratefully acknowledge the contribution of the anonymous referees and of the editors of the journal to improving the initial version of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
The software that was reviewed as part of this submission has been issued the Digital Object Identifier doi:10.5281/zenodo.556738.
Appendix
Appendix
We now describe all the details of the SM that we have tested, together with the results of the tuning phase. We remark that for some parameters it is nontrivial even to set a reasonable ranges for the values. Our approach has been to select the initial range heuristically, and then test it: if the best value consistently ended up being at one extreme, this was taken as indication that the interval should be enlarged accordingly. This hinges on the assumption that the behaviour of the algorithm is somewhat “monotonic” in the parameters; while this is not necessarily true, for the vast majority of parameters a “monotonic” behaviour has been verified experimentally, in that we almost never found the case where different settings “far apart” provided better performances than these “in the middle.”
1.1 General parameters of SM
The following parameters are common to all variants of SM we tested, basically irrespective of the specific rules for choosing the stepsize and the deflection.
-
We denote by pr \(\subseteq \{ \, g_i , d_{i-1} , d_i \, \}\) the subset of vectors that are projected on the tangent cone \(T_i\) of \(\varLambda \) at \(\bar{\lambda }_i\); in all our tests, pr does not depend on the iteration. As already remarked, pr \(= \{ \, g_i , d_{i-1} , d_i \, \}\) makes no sense as \(T_i\) is convex. Furthermore, when no deflection is done \(d_i = g_i\) and therefore only pr \(= \{ \, g_i \, \}\) and pr \(= \emptyset \) make sense.
-
Regarding the order in which the stepsize and the deflection are chosen, we denote by sg \(\in \{\, \mathrm {drs}, \mathrm {dr0}, \mathrm {srs}, \mathrm {sr0}\,\}\) the four possible schemes, where “dr” and “sr” refer to the deflection-restricted and stepsize-restricted approach, respectively, while “s” and “0” refer to using or not the safe rule ((9) and (10), respectively). Of course, drs and dr0 only apply if deflection is performed.
-
We denote by \(\chi \) the parameter used to adjust the Lipschitz constant L in the incremental case, cf. (14), for which we tested the values \(\chi =\) 1e-v for v \(\in \{ 0, \ldots , 8 \}\).
-
For the AS, one crucial decision is how often separation is performed: doing it less often avoids some computations, but at the risk of ignoring possibly relevant information for too long. We performed separation after the fixed number \(s_1 \in \{0, 1\}\) of iterations, i.e., either not using the AS at all or separating every iteration. Initial tests showed that larger values of \(s_l\) were not effective.
1.2 Parameters of the SR
We now examine in details the parameters of the three SR. Since all of them have the form (7), we are looking at different ways for determining \(\beta _i\) and \(f^{lev}_i\).
Polyak In this SR \(\beta _i\) and \(f^{lev}_i\) are kept fixed at all iterations. Here, we exploit the fact that in our application we know have “target value” \(\underline{f}\) and simply test the two cases \(f^{lev} \in \{ f_*, 10\%f_*\}\). As for the other parameter, we tested \(\beta \in \{\, 0.01 , 0.1 , 1 , 1.5 , 1.99 \, \}\).
ColorTV This SR is based on the improvement \(\varDelta f = \bar{f}_{i-1} - f_i\) of f and the scalar product \(d_i g_i\) to estimate “how successful a step has been.” Note, however, that in deflection-restricted schemes (i.e., drs and dr0) \(d_i\) is not available and we use \(d_{i-1} g_{i}\) instead. Iteration i is marked as green if \(d_i g_i > \rho \) and \(\varDelta f \ge \rho \, \max \{|f_i^\mathrm{{rec}}|, 1\}\), as yellow if \(d_i g_i < \rho \) and \(\varDelta f \ge 0\), and as red otherwise, where \(\rho > 0\) is a tolerance. Intuitively, green is a “good” step possibly indicating that a larger \(\nu _i\) may have been preferable, whereas red is a “bad” one suggesting that \(\nu _i\) is too large. Given three parameters \(c_g, c_y\) and \(c_r\), and denoting by \(n_g, n_y\) and \(n_r\) the number of consecutive green, yellow and red iterations, respectively, \(\beta _i\) is updated as:
-
1.
if \(n_g \ge c_g\) then set \(\beta _i = \min \{ \, 2 , 2\beta _{i-1} \, \}\);
-
2.
if \(n_y \ge c_y\) then set \(\beta _i = \min \{ \, 2 , 1.1\beta _{i-1} \, \}\);
-
3.
if \(n_r \ge c_r\) then then set \(\beta _i = \max \{\) 5e-4 \(, 0.67\beta _{i-1} \, \}\);
-
4.
if none of the above cases occur, then set \(\beta _i = \beta _{i-1}\).
One important parameter is therefore the arbitrarily fixed value \(\beta _0\). Also, the SR includes a simple target-following scheme whereby if \(f_i \le 1.05 f_i^{lev}\) then \(f_i^{lev} = f_i - 0.05 f_i^{lev}\) (note that this never happens for \(f^{lev} = 10\%f_*\)). For this SR we kept \(\rho = \) 1e-6 fixed and we tested all combinations of \(\beta _0\in \{ \, 0.01 , 0.1 , 1 , 1.5 , 1.99 \, \}\), \(c_g\in \{ \, 1 , 10 , 50 \, \}\), \(c_y \in \{ \, 50 , 100 , 400 \, \}\), and \(c_r \in \{ \, 10 , 20 , 50 \, \}\).
FumeroTV This SR has a complex management of \(f_i^{lev}\) and \(\beta _i\), motivated by experimental considerations [31], that is subdivided into two distinct phases. The switch between the two is an iteration counter r, that is increased each time there is no improvement in the function value. This counter is used to define the exponential function \(\sigma (r) = e^{-0.6933 (r/r_1)^{3.26}}\), where \(r_1\) is a parameter; note that \(\sigma (r_1) \approx 1/2\), which is how the two apparently weird numerical parameters have been selected. The function \(\sigma \), which is decreasing in r, is used in two ways. The first is to determine the maximum number of non-improving steps, which is the smallest integer \(r_2\) such that \(\sigma _{\infty } \ge \sigma (r_2)\), where the threshold \(\sigma _{\infty } > 0\) is another parameter: given \(r_1\) and \(\sigma _{\infty }\), \(r_2\) can be obtained with a simple closed formula. The second is to construct at each iteration the value of \(f_i^{lev}\) as a convex combination of the known global lower bound \(\underline{f}\) (which, not incidentally, this algorithm specifically tailored for IP is the only one to explicitly use) and the current record value as \(f^{lev}_i = \sigma (r) \underline{f} + (1-\sigma (r))f^{rec}_i\). In the first phase, when r varies, the threshold varies as well: as \(\sigma (r)\) decreases when r grows, \(f^{lev}_i\) is kept closer and closer to \(f^{rec}_i\) as the algorithm proceeds. In the second phase (\(r \ge r_2\)), where r is no longer updated, \(\sigma (r) = \sigma _{\infty }\). The procedure for updating r and \(\beta _i\) uses four algorithmic parameters: a tolerance \(\delta > 0\), two integer numbers \(\eta _1\) and \(\eta _2 \ge 1\), and the initial value \(\beta _0 \in (0, 2)\). The procedure is divided in two phases, according to the fact that the iteration counter r (initialized to 0) is smaller or larger than the threshold \(r_2\). Similarly to ColorTV, the rule keeps a record value \(\bar{f}_i\) (similar, but not necessarily identical, to \(f^{rec}_i\)) and declares a “good” step whenever \(f_i \le \bar{f}_i - \delta \max \{|\bar{f}_i|,1\}\), in which case \(\bar{f}\) is updated to \(f_i\). In either phase, the number of consecutive “non-good” steps is counted. In the first phase, after \(\bar{\eta }_2\) such steps r is increased by one, and \(\beta _i\) is updated as \(\beta _i = \beta _{i-1}/ (2\beta _{i-1} + 1)\). In the second phase r is no longer updated: after every “good” step \(\beta _i\) is doubled, whereas after \(\bar{\eta }_1\) “non good” steps \(\beta _i\) is halved. In the tuning phase we tested the following values for the parameters: \(\sigma _{\infty } \in \{\) 1e-4, 1e-3, 1e-2 \(\}\), \(\delta =\) 1e-6, \(r_1 \in \{\, 10 , 50 , 100 , 150 , 200 , 250 , 300 , 350 \,\}\), \(\beta _0 \in \{\, 0.01 , 0.1 , 1 , 1.5 , 1.99 \,\}\), \(\eta _1 \in \{\, 10 , 50 , 100 , 150 , 200 , 250 , 300 , 350 \,\}\), \(\eta _2 \in \{\, 10 , 50 , 100 , 150 , 200 \,\}\).
1.3 Parameters of the DR
We now describe in details the two “complex” DR that we have tested (STSubgrad, where \(\alpha _i = 1 \Longrightarrow d_i = g_i\) and \(\bar{\lambda }_{i+1} = \lambda _{i+1}\) for all i, hardly needs any comment). Note that the selection of \(\bar{\lambda }_{i+1}\) is also done by the Deflection() object.
Primal–Dual The PDSM is based on a sophisticated convergence analysis aimed at obtaining optimal a-priori complexity estimates [54]. A basic assumption of PDSM is that \(\varLambda \) is endowed with a prox-function \(d(\lambda )\), and that one solves the modified form of (1)
restricted upon a compact subset of the feasible region, where \(D\ge 0\) is a parameter. D is never directly used in the algorithm, except to optimally tune its parameters; hence, (21) can always be considered if f has a minimum \(\lambda _*\). In particular, we take \(d(\lambda ) = \Vert \lambda - \lambda _0 \Vert ^2/2\), in which case \(D = \Vert \lambda _* - \lambda _0 \Vert ^2/2\). In general D is unknown; however, the parameter “\(t^*\)” in the stopping formulæ (12)/(13) is somehow related. Roughly speaking, \(t^*\) estimates how far at most one can move along a subgradient \(g_i \in \partial f(\lambda _i)\) when \(\lambda _i\) is an approximately optimal solution. The parameter, that it used in the same way by Bundle methods, is independent from the specific solution algorithm and has been individually tuned (which is simple enough, ex-post); hence, \(D = (t^*)^2L\) is a possible estimate. Yet, \(t^*\) is supposed to measure \(\Vert \lambda ^* - \lambda _i \Vert \) for a “good” \(\lambda _i\), whereas D requires the initial \(\lambda _0\), which typically is not “good”: hence, we introduced a further scaling factor \(F > 0\), i.e., took \(\gamma = (F\sqrt{L}) / (t^*\sqrt{2})\) for SA and \(\gamma = F / (t^*\sqrt{2L})\) for WA (cf. (25)), and we experimentally tuned F. In general one would expect \(F > 1\), and the results confirm this; however, to be on the safe side we tested all the values \(F \in \{\) 1e-4, 1e-3, 1e-2,1e-1, 1, 1e1, 1e2, 1e3, 1e4 \(\}\). As suggested by one Referee we also tested using \(D = \Vert \lambda _* - \lambda _0 \Vert ^2/2\), with \(\lambda _*\) obtained by some previous optimization. The results clearly showed that the “exact” estimate of D not always translated in the best performances; in particular, for the FR the results were always consistently worse, whereas for the KR the results were much worse for WA, and completely comparable (but not any better) for SA. This is why in the end we reported results with the tuned value of F.
For the rest, PDSM basically have no tunable parameters. It has to be remarked, however, that PDSM are not, on the outset, based on a simple recurrence of the form (5); rather, given two sequences of weights \(\{\upsilon _i\}\) and \(\{\omega _i\}\), the next iterate is obtained as
Yet, when \(\varLambda = \mathbb {R}^n\) (22) readily reduces to (5), as the following Lemma shows.
Lemma 1
Assume \(\varLambda = \mathbb {R}^n\), select \(d(\lambda ) = \Vert \lambda - \lambda _0 \Vert ^2/2\), fix \(\lambda _i = \lambda _0\) for all \(i \ge 0\) in (5). By defining \(\varDelta _i = \sum _{k=1}^i \upsilon _k\), the following DR and SR
are such that \(\lambda _{i+1}\) produced by (22) is the same produced by (5) and (8).
Proof
Under the assumptions, (22) is a strictly convex unconstrained quadratic problem, whose optimal solution is immediately available by the closed formula
This clearly is (5) under the SR in (23) provided that one shows that the DR in (23) produces
This is indeed easy to show by induction. For \(i = 1\) one immediately obtains \(d_1 = g_1\). For the inductive case, one just has to note that
to obtain
\(\square \)
Interestingly, the same happens if simple sign constraints \(\lambda \ge 0\) are present, which is what we actually have whenever \(\varLambda \ne \mathbb {R}^n\).
Lemma 2
If \(\varLambda = \mathbb {R}^n_+\), the same conclusion as in Lemma 1 hold after \(\mathrm{P}_{\varLambda }(\lambda _{i+1})\).
Proof
It is easy to see that the optimal solution of (22) with \(\varLambda = \mathbb {R}^n_+\) is equal to that with \(\varLambda = \mathbb {R}^n\), i.e. (24), projected over \(\mathbb {R}^n_+\). \(\square \)
Therefore, implementing the DR and the SR as in (23), and never updating \(\bar{\lambda }_i = \lambda _0\), allow us to fit PDSM in our general scheme. To choose \(\upsilon _i\) and \(\omega _i\) we follow the suggestions in [54]: the SA approach corresponds to \(\upsilon _i = 1\), and the WA one to \(\upsilon _i = 1 / \Vert g_i\Vert \). We then set \(\omega _i = \gamma \hat{\omega }_i\), where \(\gamma > 0\) is a constant, and \(\hat{\omega }_0 = \hat{\omega }_1 = 1\), \(\hat{\omega }_i = \hat{\omega }_{i-1} + 1 / \hat{\omega }_{i-1}\) for \(i \ge 2\), which implies \(\hat{\omega }_{i+1} = \sum _{k=0}^i 1 / \hat{\omega }_k\). The analysis in [54] suggests settings for \(\gamma \) that provide the best possible theoretical convergence, i.e.,
for the SA and WA, respectively, L being the Lipschitz constant of f.
Volume In this DR, \(\alpha _i\) is obtained as the optimal solution of a univariate quadratic problem. As suggested in [5], and somewhat differently from the original [6], we use exactly the “poorman’s form” of the master problem of the proximal Bundle method
where the linearization errors \(\sigma _i(\bar{\lambda }_i)\) and \(\epsilon _{i-1}(\bar{\lambda }_i)\) have been discussed in details in Sect. 2.2. Note that we use the stepsize \(\nu _{i-1}\) of the previous iteration as stability weight, since that term corresponds to the stepsize that one would do along the dual optimal solution in a Bundle method [3, 5, 25]. It may be worth remarking that the dual of (26)
where \(d = \lambda - \bar{\lambda }_i\), is closely tied to (22) in PDSM. The difference is that (27) uses two (approximate) subgradients, \(g_i\) and \(d_i\), whereas in (22) one uses only one (approximate) subgradient obtained as weighted average of the ones generated at previous iterations. Problem (26) is inexpensive, because without the constraint \(\alpha \in [0, 1]\) it has the closed-form solution
and thus one can obtain its optimal solution by simply projecting \(\alpha ^*_i\) over [0, 1]. However, as suggested in [5, 6] we rather chose \(\alpha _i\) in the more safeguarded way
where \(\tau _i\) is initialized to \(\tau _0\), and each \(\tau _p\) iterations is decreased multiplying it by \(\tau _f < 1\), while ensuring that it remains larger than \(\tau _{\min }\). The choice of the stability center is also dictated by a parameter \(m > 0\) akin that used in Bundle methods: if \(\bar{f}_i - f_{i+1} \ge m \max \{1,|f^{ref}_i|\}\) a Serious Step occurs and \(\bar{\lambda }_{i+1} = \lambda _{i+1}\), otherwise a Null Step takes place and \(\bar{\lambda }_{i+1} = \bar{\lambda }_i\). For the tuning phase we have searched all the combinations of the following values for the above parameters: \(\tau _0\in \{\, 0.01 , 0.1 , 1 , 10 \,\}\), \(\tau _p \in \{\, 10 , 50 , 100 , 200 , 500 \,\}\), \(\tau _f \in \{\, 0.1 , 0.4 , 0.8 , 0.9 , 0.99 \,\}\), \(\tau _{\min } \in \{\) 1e-4, 1e-5 \(\}\), \(m \in \{\, 0.01 , 0.1 \,\}\).
1.4 Detailed results of the tuning phase
The tuning phase required a substantial computational work, and a nontrivial analysis of the results. As discussed in Sect. 3.2, each SM configuration gave rise to an aggregated convergence graph. To select the best configurations, the graphs were visually inspected, and the ones corresponding to a better overall convergence rates were selected. This usually was the configuration providing the best final gap for all instances. Occasionally, other configurations gave better results than the chosen one in the earlier stages of the algorithm on some subsets of the instances; usually the advantage was marginal at best, and only on a fraction of the cases, while the disadvantage in terms of final result was pronounced. In general it has always been possible to find “robust” settings that provided the best (or close so) gap at termination, but were not too far from the best gaps even in all the other stages. Furthermore, although the total number of possible combinations was rather large, it turned out that only a relatively small set of parameters had a significant impact on the performances, and in most of the cases their effect was almost orthogonal to each other. This allowed us to effectively single out “robust” configurations for our test sets; for several of the parameters, the “optimal” choice has been unique across all instances, which may provide useful indications even for different problems.
For the sake of clarity and conciseness, in Tables 3 and 4, we report the chosen values of the parameters for FR and KR, respectively, briefly remarking about the effect of each parameter and their relationships. The behaviour of SM was pretty similar in the two cases \(\underline{f} = f_*\) and \(\underline{f} = 10\%f_*\); hence, the tables report the values for \(\underline{f} = f_*\), indicating in “[]” these for \(\underline{f} = 10\%f_*\) if they happen to be different. The tables focus on the combinations between the three SR and the two DR, plus the incremental case; the parameters of Primal–Dual variant are presented separately since the SR is combined with the DR.
Results for the FR. The results for FR are summarized in Table 3, except for those settings that are constantly optimal. In particular, STSubgrad and Incremental have better performances with \(\mathrm{{pr}} = \{ g_i \}\), irrespective of the SR. For Volume, instead, the optimal setting of \(\mathrm{{pr}}\) does depend on the SR, although \(\mathrm{{pr}} = \{ d_i \}\) and \(\mathrm{{pr}} = \{ d_{i-1} \}\) were hardly different. All the other parameters of Volume depend on the SR (although the stepsize-restricted scheme with no safe rule is often good), except \(\tau _{\min }\) and m that are always best set to 1e-4 and 0.1, respectively. Another interesting observation is that, while Volume does have several parameters, it does seem that they operate quite independently of each other, as changing one of them always has a similar effect irrespective of the others. We also mention that for ColorTV the parameters \(\mathrm {c}_y\) and \(\mathrm {c}_r\) have little impact on the performance, whereas \(\mathrm {c}_g\) plays an important role and it significantly influences the quality of the results. As for FumeroTV, \(\sigma _{\infty }\) and \(\eta _2\) have hardly any impact, and we arbitrarily set them to 1e-4 and 50, respectively.
In PDSM, the only crucial value is F, used to compute the optimal value of \(\gamma \) in (25). We found its best value to be 1e2 and 1e3 for SA and WA, respectively. The choice has a large impact on performances, which significantly worsen for values far from these.
Results for the KR. The best parameters for the KR are reported in Table 4. Although the best values are in general different from the FR, confirming the (unfortunate) need for problem-specific parameter tuning, similar observations as in that case can be made. For instance, for Volume, the parameters were still more or less independent from each other, and \(\tau _{\min }\) and m were still hardly impacting, with the values 1e-4 and 0.1 still very adequate. For ColorTV, results are again quite stable varying \(\mathrm {c}_y\). Yet, differences can be noted: for instance, for FR \(\mathrm {c}_g\) is clearly the most significant parameter and dictates most of the performance variations, while for the KR the relationship between the two parameters \(\mathrm {c}_r\) and \(\mathrm {c}_g\) and the results is less clear. Similarly, for FumeroTV some settings are conserved: \(\sigma _{\infty }\) and \(\eta _2\) have very little effect and can be set to 1e-4 and 50, respectively. Other cases were different: for instance the parameters \(\eta _1\), \(r_1\) and \(\beta _0\) were more independent on each other than in the FR.
The parameters of Primal–Dual showed to be quite independent from the underlying Lagrangian approach, with the best value of F still being 1e2 for SA and 1e3 for WA. This confirms the higher overall robustness of the approach.
We terminate the Appendix with a short table detailing which of the variants of SM that we tested have a formal proof of convergence and where it can be found, indicating the references wherein the proofs are given. The columns DR and SR, as usual, indicate which ones among the possible defection and stepsize rules are adopted; an entry “any” means that the corresponding proof holds for all the rules. Moreover, PR, AS and IN, respectively, stands for the strategies: (i) projection, (ii) active set and (iii) incremental.
Rights and permissions
About this article
Cite this article
Frangioni, A., Gendron, B. & Gorgone, E. On the computational efficiency of subgradient methods: a case study with Lagrangian bounds. Math. Prog. Comp. 9, 573–604 (2017). https://doi.org/10.1007/s12532-017-0120-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12532-017-0120-7
Keywords
- Subgradient methods
- Nondifferentiable Optimization
- Computational analysis
- Lagrangian relaxation
- Multicommodity Network Design