Skip to main content
Log in

On the computational efficiency of subgradient methods: a case study with Lagrangian bounds

  • Full Length Paper
  • Published:
Mathematical Programming Computation Aims and scope Submit manuscript

Abstract

Subgradient methods (SM) have long been the preferred way to solve the large-scale Nondifferentiable Optimization problems arising from the solution of Lagrangian Duals (LD) of Integer Programs (IP). Although other methods can have better convergence rate in practice, SM have certain advantages that may make them competitive under the right conditions. Furthermore, SM have significantly progressed in recent years, and new versions have been proposed with better theoretical and practical performances in some applications. We computationally evaluate a large class of SM in order to assess if these improvements carry over to the IP setting. For this we build a unified scheme that covers many of the SM proposed in the literature, comprised some often overlooked features like projection and dynamic generation of variables. We fine-tune the many algorithmic parameters of the resulting large class of SM, and we test them on two different LDs of the Fixed-Charge Multicommodity Capacitated Network Design problem, in order to assess the impact of the characteristics of the problem on the optimal algorithmic choices. Our results show that, if extensive tuning is performed, SM can be competitive with more sophisticated approaches when the tolerance required for solution is not too tight, which is the case when solving LDs of IPs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Ahookhosh, M.: Optimal subgradient algorithms with application to large-scale linear inverse problems. Tech. rep., Optimization Online (2014)

  2. Anstreicher, K., Wolsey, L.: Two “well-known” properties of subgradient optimization. Math. Program. 120(1), 213–220 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  3. Astorino, A., Frangioni, A., Fuduli, A., Gorgone, E.: A nonmonotone proximal bundle method with (potentially) continuous step decisions. SIAM J. Optim. 23(3), 1784–1809 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  4. Bacaud, L., Lemaréchal, C., Renaud, A., Sagastizábal, C.: Bundle methods in stochastic optimal power management: a disaggregated approach using preconditioners. Comput. Optim. Appl. 20, 227–244 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  5. Bahiense, L., Maculan, N., Sagastizábal, C.: The volume algorithm revisited: relation with bundle methods. Math. Program. 94(1), 41–70 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  6. Barahona, F., Anbil, R.: The volume algorithm: producing primal solutions with a subgradient method. Math. Program. 87(3), 385–399 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  7. Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim. 22(2), 557–580 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  8. Ben Amor, H., Desrosiers, J., Frangioni, A.: On the choice of explicit stabilizing terms in column generation. Discrete Appl. Math. 157(6), 1167–1184 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  9. Bertsekas, D., Nedić, A.: Incremental subgradient methods for nondifferentiable optimization. SIAM J. Optim. 12(1), 109–138 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  10. Borghetti, A., Frangioni, A., Lacalandra, F., Nucci, C.: Lagrangian heuristics based on disaggregated bundle methods for hydrothermal unit commitment. IEEE Trans. Power Syst. 18(1), 313–323 (2003)

    Article  Google Scholar 

  11. Bot, R., Hendrich, C.: A variable smoothing algorithm for solving convex optimization problems. TOP 23, 124–150 (2014)

    Article  MATH  MathSciNet  Google Scholar 

  12. Brännlund, U.: A generalised subgradient method with relaxation step. Math. Program. 71, 207–219 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  13. Briant, O., Lemaréchal, C., Meurdesoif, P., Michel, S., Perrot, N., Vanderbeck, F.: Comparison of bundle and classical column generation. Math. Program. 113(2), 299–344 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  14. Camerini, P., Fratta, L., Maffioli, F.: On improving relaxation methods by modified gradient techniques. Math. Program. Study 3, 26–34 (1975)

    Article  MATH  MathSciNet  Google Scholar 

  15. Cappanera, P., Frangioni, A.: Symmetric and asymmetric parallelization of a cost-decomposition algorithm for multi-commodity flow problems. INFORMS J. Comput. 15(4), 369–384 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  16. Censor, Y., Davidi, R., Herman, G., Schulte, R., Tetruashvili, L.: Projected subgradient minimization cersus superiorization. J. Optim. Theory Appl. 160(3), 730–747 (2014)

    Article  MATH  MathSciNet  Google Scholar 

  17. Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  18. Crainic, T.G., Frangioni, A., Gendron, B.: Multicommodity capacitated network design. In: Soriano, P., Sanso, B. (eds.) Telecommunications network planning, pp. 1–19. Kluwer Academics Publisher (1999)

  19. Crainic, T., Frangioni, A., Gendron, B.: Bundle-based relaxation methods for multicommodity capacitated fixed charge network design problems. Discrete Appl. Math. 112, 73–99 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  20. Crema, A., Loreto, M., Raydan, M.: Spectral projected subgradient with a momentum term for the Lagrangean dual approach. Comput. Oper. Res. 34, 31743186 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  21. d’Antonio, G., Frangioni, A.: Convergence analysis of deflected conditional approximate subgradient methods. SIAM J. Optim. 20(1), 357–386 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  22. du Merle, O., Goffin, J.L., Vial, J.P.: On improvements to the analytic center cutting plane method. Comput. Optim. Appl. 11, 37–52 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  23. Feltenmark, S., Kiwiel, K.: Dual applications of proximal bundle methods, including Lagrangian relaxation of nonconvex problems. SIAM J. Optim. 10(3), 697–721 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  24. Frangioni, A.: Solving semidefinite quadratic problems within nonsmooth optimization algorithms. Comput. Oper. Res. 21, 1099–1118 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  25. Frangioni, A.: Generalized bundle methods. SIAM J. Optim. 13(1), 117–156 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  26. Frangioni, A., Gallo, G.: A bundle type dual-ascent approach to linear multicommodity min cost flow problems. INFORMS J. Comput. 11(4), 370–393 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  27. Frangioni, A., Gendron, B.: A stabilized structured Dantzig–Wolfe decomposition method. Math. Program. 140, 45–76 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  28. Frangioni, A., Gorgone, E.: A library for continuous convex separable quadratic knapsack problems. Eur. J. Oper. Res. 229(1), 37–40 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  29. Frangioni, A., Gorgone, E.: Generalized bundle methods for sum-functions with “easy” components: applications to multicommodity network design. Math. Program. 145(1), 133–161 (2014)

    Article  MATH  MathSciNet  Google Scholar 

  30. Frangioni, A., Lodi, A., Rinaldi, G.: New approaches for optimizing over the semimetric polytope. Math. Program. 104(2–3), 375–388 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  31. Fumero, F.: A modified subgradient algorithm for Lagrangean relaxation. Comput. Oper. Res. 28(1), 33–52 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  32. Geoffrion, A.: Lagrangian relaxation and its uses in iteger programming. Math. Program. Study 2, 82–114 (1974)

    Article  MathSciNet  Google Scholar 

  33. Gondzio, J., González-Brevis, P., Munari, P.: New developments in the primal–dual column generation technique. Eur. J. Oper. Res. 224(1), 41–51 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  34. Görtz, S., Klose, A.: A simple but usually fast branch-and-bound algorithm for the capacitated facility location problem. INFORMS J. Comput. 24(4), 597610 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  35. Guignard, M.: Efficient cuts in Lagrangean ‘relax-and-cut’ schemes. Eur. J. Oper. Res. 105, 216–223 (1998)

    Article  MATH  Google Scholar 

  36. Held, M., Karp, R.: The traveling salesman problem and minimum spanning trees. Oper. Res. 18, 1138–1162 (1970)

    Article  MATH  MathSciNet  Google Scholar 

  37. Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms II–Advanced Theory and Bundle Methods, Grundlehren Math. Wiss., vol. 306. Springer, New York (1993)

    MATH  Google Scholar 

  38. Ito, M., Fukuda, M.: A family of subgradient-based methods for convex optimization problems in a unifying framework. Tech. rep., Optimization Online (2014)

  39. Jones, K., Lustig, I., Farwolden, J., Powell, W.: Multicommodity network flows: the impact of formulation on decomposition. Math. Program. 62, 95–117 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  40. Kelley, J.: The cutting-plane method for solving convex programs. J. SIAM 8, 703–712 (1960)

    MATH  MathSciNet  Google Scholar 

  41. Kiwiel, K.: Convergence of approximate and incremental subgradient methods for convex optimization. SIAM J. Optim. 14(3), 807–840 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  42. Kiwiel, K., Goffin, J.: Convergence of a simple subgradient level method. Math. Program. 85(4), 207–211 (1999)

    MATH  MathSciNet  Google Scholar 

  43. Kiwiel, K., Larsson, T., Lindberg, P.: The efficiency of ballstep subgradient level methods for convex optimization. Math. Oper. Res. 23, 237–254 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  44. Lan, G., Zhou, Y.: Approximation accuracy, gradient methods, and error bound for structured convex optimization. Technical report, University of Florida (2014)

  45. Larsson, T., Patriksson, M., Strömberg, A.B.: Conditional subgradient optimization—theory and applications. Eur. J. Oper. Res. 88(2), 382–403 (1996)

    Article  MATH  Google Scholar 

  46. Larsson, T., Patriksson, M., Strömberg, A.B.: Ergodic, primal convergence in dual subgradient schemes for convex programming. Math. Program. 86, 283–312 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  47. Lemaréchal, C.: An extension of Davidon methods to nondifferentiable problems. In: Balinski, M., Wolfe, P. (eds.) Nondifferentiable Optimization, Mathematical Programming Study, vol. 3, pp. 95–109. North-Holland, Amsterdam (1975)

    Chapter  Google Scholar 

  48. Lemaréchal, C., Renaud, A.: A geometric study of duality gaps, with applications. Math. Program. 90, 399–427 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  49. Necoara, I., Suykens, J.: Application of a smoothing technique to decomposition in convex optimization. IEEE Trans. Autom. Control 53(11), 2674–2679 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  50. Nedic, A., Bertsekas, D.: Incremental subgradient methods for nondifferentiable optimization. Math. Program. 120, 221–259 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  51. Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)

    Google Scholar 

  52. Nesterov, Y.: Excessive gap technique in nonsmooth convex minimization. SIAM J. Optim. 16, 235–249 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  53. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103, 127–152 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  54. Nesterov, Y.: Primal-dual subgradient methods for convex optimization. Math. Program. 120, 221–259 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  55. Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152, 381–404 (2014)

    Article  MATH  MathSciNet  Google Scholar 

  56. Neto, E., De Pierro, A.: Incremental subgradients for constrained convex optimization: a unified framework and new methods. SIAM J. Optim. 20(3), 1547–1572 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  57. Ouorou, A.: A proximal cutting plane method using Chebychev center for nonsmooth convex optimization. Math. Program. 119(2), 239–271 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  58. Polyak, B.: Minimization of unsmooth functionals. Zh. Vychisl. Mat. Fiz 9(3), 509–521 (1969)

    MATH  Google Scholar 

  59. Sherali, B., Choi, B., Tuncbilek, C.: A variable target value method for nondifferentiable optimization. Oper. Res. Lett. 26, 1–8 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  60. Sherali, B., Lim, C.: On embedding the volume algorithm in a variable target value method. Oper. Res. Lett. 32, 455462 (2004)

    Article  MathSciNet  Google Scholar 

  61. Shor, N.: Minimization Methods for Nondifferentiable Functions. Springer, Berlin (1985)

    Book  MATH  Google Scholar 

  62. Solodov, M., Zavriev, S.: Error stability properties of generalized gradient-type algorithms. J. Optim. Theory Appl. 98(3), 663–680 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  63. Tseng, P.: Conditional gradient sliding for convex optimization. Math. Program. 125, 263–295 (2010)

    Article  MATH  MathSciNet  Google Scholar 

  64. Wolfe, P.: A method of conjugate subgradients for minimizing nondifferentiable functions. In: Balinski, M., Wolfe, P. (eds.) Nondifferentiable Optimization, Mathematical Programming Study, vol. 3, pp. 145–173. North-Holland, Amsterdam (1975)

    Chapter  Google Scholar 

Download references

Acknowledgements

The first author acknowledge the contribution of the Italian Ministry for University and Research under the PRIN 2012 Project 2012JXB3YF “Mixed-Integer Nonlinear Optimization: Approaches and Applications”. The work of the second author has been supported by NSERC (Canada) under Grant 184122-09. The work of the third author has been supported by the Post-Doctoral Fellowship D.R. No 2718/201 (Regional Operative Program Calabria ESF 2007/2013) and the Interuniversity Attraction Poles Programme P7/36 “COMEX: combinatorial optimization metaheuristics & exact methods” of the Belgian Science Policy Office. All the authors gratefully acknowledge the contribution of the anonymous referees and of the editors of the journal to improving the initial version of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Frangioni.

Additional information

The software that was reviewed as part of this submission has been issued the Digital Object Identifier doi:10.5281/zenodo.556738.

Appendix

Appendix

We now describe all the details of the SM that we have tested, together with the results of the tuning phase. We remark that for some parameters it is nontrivial even to set a reasonable ranges for the values. Our approach has been to select the initial range heuristically, and then test it: if the best value consistently ended up being at one extreme, this was taken as indication that the interval should be enlarged accordingly. This hinges on the assumption that the behaviour of the algorithm is somewhat “monotonic” in the parameters; while this is not necessarily true, for the vast majority of parameters a “monotonic” behaviour has been verified experimentally, in that we almost never found the case where different settings “far apart” provided better performances than these “in the middle.”

1.1 General parameters of SM

The following parameters are common to all variants of SM we tested, basically irrespective of the specific rules for choosing the stepsize and the deflection.

  • We denote by pr \(\subseteq \{ \, g_i , d_{i-1} , d_i \, \}\) the subset of vectors that are projected on the tangent cone \(T_i\) of \(\varLambda \) at \(\bar{\lambda }_i\); in all our tests, pr does not depend on the iteration. As already remarked, pr \(= \{ \, g_i , d_{i-1} , d_i \, \}\) makes no sense as \(T_i\) is convex. Furthermore, when no deflection is done \(d_i = g_i\) and therefore only pr \(= \{ \, g_i \, \}\) and pr \(= \emptyset \) make sense.

  • Regarding the order in which the stepsize and the deflection are chosen, we denote by sg \(\in \{\, \mathrm {drs}, \mathrm {dr0}, \mathrm {srs}, \mathrm {sr0}\,\}\) the four possible schemes, where “dr” and “sr” refer to the deflection-restricted and stepsize-restricted approach, respectively, while “s” and “0” refer to using or not the safe rule ((9) and (10), respectively). Of course, drs and dr0 only apply if deflection is performed.

  • We denote by \(\chi \) the parameter used to adjust the Lipschitz constant L in the incremental case, cf. (14), for which we tested the values \(\chi =\) 1e-v for v \(\in \{ 0, \ldots , 8 \}\).

  • For the AS, one crucial decision is how often separation is performed: doing it less often avoids some computations, but at the risk of ignoring possibly relevant information for too long. We performed separation after the fixed number \(s_1 \in \{0, 1\}\) of iterations, i.e., either not using the AS at all or separating every iteration. Initial tests showed that larger values of \(s_l\) were not effective.

1.2 Parameters of the SR

We now examine in details the parameters of the three SR. Since all of them have the form (7), we are looking at different ways for determining \(\beta _i\) and \(f^{lev}_i\).

Polyak In this SR \(\beta _i\) and \(f^{lev}_i\) are kept fixed at all iterations. Here, we exploit the fact that in our application we know have “target value” \(\underline{f}\) and simply test the two cases \(f^{lev} \in \{ f_*, 10\%f_*\}\). As for the other parameter, we tested \(\beta \in \{\, 0.01 , 0.1 , 1 , 1.5 , 1.99 \, \}\).

ColorTV This SR is based on the improvement \(\varDelta f = \bar{f}_{i-1} - f_i\) of f and the scalar product \(d_i g_i\) to estimate “how successful a step has been.” Note, however, that in deflection-restricted schemes (i.e., drs and dr0) \(d_i\) is not available and we use \(d_{i-1} g_{i}\) instead. Iteration i is marked as green if \(d_i g_i > \rho \) and \(\varDelta f \ge \rho \, \max \{|f_i^\mathrm{{rec}}|, 1\}\), as yellow if \(d_i g_i < \rho \) and \(\varDelta f \ge 0\), and as red otherwise, where \(\rho > 0\) is a tolerance. Intuitively, green is a “good” step possibly indicating that a larger \(\nu _i\) may have been preferable, whereas red is a “bad” one suggesting that \(\nu _i\) is too large. Given three parameters \(c_g, c_y\) and \(c_r\), and denoting by \(n_g, n_y\) and \(n_r\) the number of consecutive green, yellow and red iterations, respectively, \(\beta _i\) is updated as:

  1. 1.

    if \(n_g \ge c_g\) then set \(\beta _i = \min \{ \, 2 , 2\beta _{i-1} \, \}\);

  2. 2.

    if \(n_y \ge c_y\) then set \(\beta _i = \min \{ \, 2 , 1.1\beta _{i-1} \, \}\);

  3. 3.

    if \(n_r \ge c_r\) then then set \(\beta _i = \max \{\) 5e-4 \(, 0.67\beta _{i-1} \, \}\);

  4. 4.

    if none of the above cases occur, then set \(\beta _i = \beta _{i-1}\).

One important parameter is therefore the arbitrarily fixed value \(\beta _0\). Also, the SR includes a simple target-following scheme whereby if \(f_i \le 1.05 f_i^{lev}\) then \(f_i^{lev} = f_i - 0.05 f_i^{lev}\) (note that this never happens for \(f^{lev} = 10\%f_*\)). For this SR we kept \(\rho = \) 1e-6 fixed and we tested all combinations of \(\beta _0\in \{ \, 0.01 , 0.1 , 1 , 1.5 , 1.99 \, \}\), \(c_g\in \{ \, 1 , 10 , 50 \, \}\), \(c_y \in \{ \, 50 , 100 , 400 \, \}\), and \(c_r \in \{ \, 10 , 20 , 50 \, \}\).

FumeroTV This SR has a complex management of \(f_i^{lev}\) and \(\beta _i\), motivated by experimental considerations [31], that is subdivided into two distinct phases. The switch between the two is an iteration counter r, that is increased each time there is no improvement in the function value. This counter is used to define the exponential function \(\sigma (r) = e^{-0.6933 (r/r_1)^{3.26}}\), where \(r_1\) is a parameter; note that \(\sigma (r_1) \approx 1/2\), which is how the two apparently weird numerical parameters have been selected. The function \(\sigma \), which is decreasing in r, is used in two ways. The first is to determine the maximum number of non-improving steps, which is the smallest integer \(r_2\) such that \(\sigma _{\infty } \ge \sigma (r_2)\), where the threshold \(\sigma _{\infty } > 0\) is another parameter: given \(r_1\) and \(\sigma _{\infty }\), \(r_2\) can be obtained with a simple closed formula. The second is to construct at each iteration the value of \(f_i^{lev}\) as a convex combination of the known global lower bound \(\underline{f}\) (which, not incidentally, this algorithm specifically tailored for IP is the only one to explicitly use) and the current record value as \(f^{lev}_i = \sigma (r) \underline{f} + (1-\sigma (r))f^{rec}_i\). In the first phase, when r varies, the threshold varies as well: as \(\sigma (r)\) decreases when r grows, \(f^{lev}_i\) is kept closer and closer to \(f^{rec}_i\) as the algorithm proceeds. In the second phase (\(r \ge r_2\)), where r is no longer updated, \(\sigma (r) = \sigma _{\infty }\). The procedure for updating r and \(\beta _i\) uses four algorithmic parameters: a tolerance \(\delta > 0\), two integer numbers \(\eta _1\) and \(\eta _2 \ge 1\), and the initial value \(\beta _0 \in (0, 2)\). The procedure is divided in two phases, according to the fact that the iteration counter r (initialized to 0) is smaller or larger than the threshold \(r_2\). Similarly to ColorTV, the rule keeps a record value \(\bar{f}_i\) (similar, but not necessarily identical, to \(f^{rec}_i\)) and declares a “good” step whenever \(f_i \le \bar{f}_i - \delta \max \{|\bar{f}_i|,1\}\), in which case \(\bar{f}\) is updated to \(f_i\). In either phase, the number of consecutive “non-good” steps is counted. In the first phase, after \(\bar{\eta }_2\) such steps r is increased by one, and \(\beta _i\) is updated as \(\beta _i = \beta _{i-1}/ (2\beta _{i-1} + 1)\). In the second phase r is no longer updated: after every “good” step \(\beta _i\) is doubled, whereas after \(\bar{\eta }_1\) “non good” steps \(\beta _i\) is halved. In the tuning phase we tested the following values for the parameters: \(\sigma _{\infty } \in \{\) 1e-4, 1e-3, 1e-2 \(\}\), \(\delta =\) 1e-6, \(r_1 \in \{\, 10 , 50 , 100 , 150 , 200 , 250 , 300 , 350 \,\}\), \(\beta _0 \in \{\, 0.01 , 0.1 , 1 , 1.5 , 1.99 \,\}\), \(\eta _1 \in \{\, 10 , 50 , 100 , 150 , 200 , 250 , 300 , 350 \,\}\), \(\eta _2 \in \{\, 10 , 50 , 100 , 150 , 200 \,\}\).

1.3 Parameters of the DR

We now describe in details the two “complex” DR that we have tested (STSubgrad, where \(\alpha _i = 1 \Longrightarrow d_i = g_i\) and \(\bar{\lambda }_{i+1} = \lambda _{i+1}\) for all i, hardly needs any comment). Note that the selection of \(\bar{\lambda }_{i+1}\) is also done by the Deflection() object.

Primal–Dual The PDSM is based on a sophisticated convergence analysis aimed at obtaining optimal a-priori complexity estimates [54]. A basic assumption of PDSM is that \(\varLambda \) is endowed with a prox-function \(d(\lambda )\), and that one solves the modified form of (1)

$$\begin{aligned} \min \{ \, f(\lambda ) \,:\, d(\lambda )\le D , \lambda \in \varLambda \,\} \end{aligned}$$
(21)

restricted upon a compact subset of the feasible region, where \(D\ge 0\) is a parameter. D is never directly used in the algorithm, except to optimally tune its parameters; hence, (21) can always be considered if f has a minimum \(\lambda _*\). In particular, we take \(d(\lambda ) = \Vert \lambda - \lambda _0 \Vert ^2/2\), in which case \(D = \Vert \lambda _* - \lambda _0 \Vert ^2/2\). In general D is unknown; however, the parameter “\(t^*\)” in the stopping formulæ (12)/(13) is somehow related. Roughly speaking, \(t^*\) estimates how far at most one can move along a subgradient \(g_i \in \partial f(\lambda _i)\) when \(\lambda _i\) is an approximately optimal solution. The parameter, that it used in the same way by Bundle methods, is independent from the specific solution algorithm and has been individually tuned (which is simple enough, ex-post); hence, \(D = (t^*)^2L\) is a possible estimate. Yet, \(t^*\) is supposed to measure \(\Vert \lambda ^* - \lambda _i \Vert \) for a “good” \(\lambda _i\), whereas D requires the initial \(\lambda _0\), which typically is not “good”: hence, we introduced a further scaling factor \(F > 0\), i.e., took \(\gamma = (F\sqrt{L}) / (t^*\sqrt{2})\) for SA and \(\gamma = F / (t^*\sqrt{2L})\) for WA (cf. (25)), and we experimentally tuned F. In general one would expect \(F > 1\), and the results confirm this; however, to be on the safe side we tested all the values \(F \in \{\) 1e-4, 1e-3, 1e-2,1e-1, 1, 1e1, 1e2, 1e3, 1e4 \(\}\). As suggested by one Referee we also tested using \(D = \Vert \lambda _* - \lambda _0 \Vert ^2/2\), with \(\lambda _*\) obtained by some previous optimization. The results clearly showed that the “exact” estimate of D not always translated in the best performances; in particular, for the FR the results were always consistently worse, whereas for the KR the results were much worse for WA, and completely comparable (but not any better) for SA. This is why in the end we reported results with the tuned value of F.

For the rest, PDSM basically have no tunable parameters. It has to be remarked, however, that PDSM are not, on the outset, based on a simple recurrence of the form (5); rather, given two sequences of weights \(\{\upsilon _i\}\) and \(\{\omega _i\}\), the next iterate is obtained as

$$\begin{aligned} \textstyle \lambda _{i+1} = {\mathrm{argmin}} \big \{ \, \lambda \sum _{k=1}^i \upsilon _k g_k + \omega _i d(\lambda ) \,:\, \lambda \in \varLambda \, \big \} \;\;. \end{aligned}$$
(22)

Yet, when \(\varLambda = \mathbb {R}^n\) (22) readily reduces to (5), as the following Lemma shows.

Lemma 1

Assume \(\varLambda = \mathbb {R}^n\), select \(d(\lambda ) = \Vert \lambda - \lambda _0 \Vert ^2/2\), fix \(\lambda _i = \lambda _0\) for all \(i \ge 0\) in (5). By defining \(\varDelta _i = \sum _{k=1}^i \upsilon _k\), the following DR and SR

$$\begin{aligned} \alpha _i = \upsilon _i / \varDelta _i \;\; (\in [0,1]) \qquad \text{ and }\qquad \nu _i = \varDelta _i / \omega _i \end{aligned}$$
(23)

are such that \(\lambda _{i+1}\) produced by (22) is the same produced by (5) and (8).

Proof

Under the assumptions, (22) is a strictly convex unconstrained quadratic problem, whose optimal solution is immediately available by the closed formula

$$\begin{aligned} \textstyle \lambda _{i+1} = \lambda _0 - (1/\omega _i) \sum _{k=1}^i \upsilon _k g_k. \end{aligned}$$
(24)

This clearly is (5) under the SR in (23) provided that one shows that the DR in (23) produces

$$\begin{aligned} \textstyle d_i = \left( \sum _{k=1}^i \upsilon _k g_k\right) / \varDelta _i. \end{aligned}$$

This is indeed easy to show by induction. For \(i = 1\) one immediately obtains \(d_1 = g_1\). For the inductive case, one just has to note that

$$\begin{aligned} 1 - \frac{\upsilon _{i+1}}{\varDelta _{i+1}} = \frac{\varDelta _{i+1} - \upsilon _{i+1}}{\varDelta _{i+1}} = \frac{\varDelta _i}{\varDelta _{i+1}} \end{aligned}$$

to obtain

$$\begin{aligned} d_{i+1} = \alpha _{i+1} g_{i+1} {+} (1-\alpha _{i+1}) d_i {=} \frac{\upsilon _{i+1}}{\varDelta _{i+1}}g_{i+1} + \frac{\varDelta _i}{\varDelta _{i+1}} \frac{\sum _{k=1}^i \upsilon _k g_k}{\varDelta _i} =\frac{1}{\varDelta _{i+1}} \sum _{k=1}^{i+1}\upsilon _k g_k. \quad \end{aligned}$$

\(\square \)

Interestingly, the same happens if simple sign constraints \(\lambda \ge 0\) are present, which is what we actually have whenever \(\varLambda \ne \mathbb {R}^n\).

Lemma 2

If \(\varLambda = \mathbb {R}^n_+\), the same conclusion as in Lemma 1 hold after \(\mathrm{P}_{\varLambda }(\lambda _{i+1})\).

Proof

It is easy to see that the optimal solution of (22) with \(\varLambda = \mathbb {R}^n_+\) is equal to that with \(\varLambda = \mathbb {R}^n\), i.e. (24), projected over \(\mathbb {R}^n_+\). \(\square \)

Therefore, implementing the DR and the SR as in (23), and never updating \(\bar{\lambda }_i = \lambda _0\), allow us to fit PDSM in our general scheme. To choose \(\upsilon _i\) and \(\omega _i\) we follow the suggestions in [54]: the SA approach corresponds to \(\upsilon _i = 1\), and the WA one to \(\upsilon _i = 1 / \Vert g_i\Vert \). We then set \(\omega _i = \gamma \hat{\omega }_i\), where \(\gamma > 0\) is a constant, and \(\hat{\omega }_0 = \hat{\omega }_1 = 1\), \(\hat{\omega }_i = \hat{\omega }_{i-1} + 1 / \hat{\omega }_{i-1}\) for \(i \ge 2\), which implies \(\hat{\omega }_{i+1} = \sum _{k=0}^i 1 / \hat{\omega }_k\). The analysis in [54] suggests settings for \(\gamma \) that provide the best possible theoretical convergence, i.e.,

$$\begin{aligned} \textstyle \gamma = L/\sqrt{2 D} \qquad \text{ and }\qquad \gamma = 1/\sqrt{2 D} \;\; , \end{aligned}$$
(25)

for the SA and WA, respectively, L being the Lipschitz constant of f.

Volume In this DR, \(\alpha _i\) is obtained as the optimal solution of a univariate quadratic problem. As suggested in [5], and somewhat differently from the original [6], we use exactly the “poorman’s form” of the master problem of the proximal Bundle method

$$\begin{aligned} \min \big \{ \; \nu _{i-1} \left\| \alpha g_i + (1 - \alpha ) d_{i-1} \right\| ^2/2 + \alpha \sigma _i(\bar{\lambda }_i) + (1 - \alpha )\epsilon _{i-1}(\bar{\lambda }_i) \;:\; \alpha \in [0, 1]\; \big \}\nonumber \\ \end{aligned}$$
(26)

where the linearization errors \(\sigma _i(\bar{\lambda }_i)\) and \(\epsilon _{i-1}(\bar{\lambda }_i)\) have been discussed in details in Sect. 2.2. Note that we use the stepsize \(\nu _{i-1}\) of the previous iteration as stability weight, since that term corresponds to the stepsize that one would do along the dual optimal solution in a Bundle method [3, 5, 25]. It may be worth remarking that the dual of (26)

$$\begin{aligned} \textstyle \min \big \{ \, \max \{ \, g_i d - \sigma _i(\bar{\lambda }_i) , d_{i-1} d - \epsilon _{i-1}(\bar{\lambda }_i) \, \} + \Vert d \Vert ^2/(2\nu _{i-1}) \,\big \} \;\; , \end{aligned}$$
(27)

where \(d = \lambda - \bar{\lambda }_i\), is closely tied to (22) in PDSM. The difference is that (27) uses two (approximate) subgradients, \(g_i\) and \(d_i\), whereas in (22) one uses only one (approximate) subgradient obtained as weighted average of the ones generated at previous iterations. Problem (26) is inexpensive, because without the constraint \(\alpha \in [0, 1]\) it has the closed-form solution

$$\begin{aligned} \alpha ^*_i \;=\; \frac{\epsilon _{i-1}(\bar{\lambda }_i) - \sigma _i(\bar{\lambda }_i) - \nu _{i-1} d_{i-1}(g_i - d_{i-1})}{\nu _{i-1} \Vert g_i - d_{i-1}\Vert ^2} \;\; , \end{aligned}$$

and thus one can obtain its optimal solution by simply projecting \(\alpha ^*_i\) over [0, 1]. However, as suggested in [5, 6] we rather chose \(\alpha _i\) in the more safeguarded way

$$\begin{aligned} \alpha _i = \left| \begin{array}{ll} \alpha _{i-1} / 10 &{} \text { if } \alpha ^*_i \le \mathtt{1e-8}\\ \min \{\tau _i,1.0\} &{} \text { if } \alpha ^*_i \ge 1\\ \alpha ^*_i &{} \text { otherwise } \end{array}\right. \end{aligned}$$

where \(\tau _i\) is initialized to \(\tau _0\), and each \(\tau _p\) iterations is decreased multiplying it by \(\tau _f < 1\), while ensuring that it remains larger than \(\tau _{\min }\). The choice of the stability center is also dictated by a parameter \(m > 0\) akin that used in Bundle methods: if \(\bar{f}_i - f_{i+1} \ge m \max \{1,|f^{ref}_i|\}\) a Serious Step occurs and \(\bar{\lambda }_{i+1} = \lambda _{i+1}\), otherwise a Null Step takes place and \(\bar{\lambda }_{i+1} = \bar{\lambda }_i\). For the tuning phase we have searched all the combinations of the following values for the above parameters: \(\tau _0\in \{\, 0.01 , 0.1 , 1 , 10 \,\}\), \(\tau _p \in \{\, 10 , 50 , 100 , 200 , 500 \,\}\), \(\tau _f \in \{\, 0.1 , 0.4 , 0.8 , 0.9 , 0.99 \,\}\), \(\tau _{\min } \in \{\) 1e-4, 1e-5 \(\}\), \(m \in \{\, 0.01 , 0.1 \,\}\).

1.4 Detailed results of the tuning phase

The tuning phase required a substantial computational work, and a nontrivial analysis of the results. As discussed in Sect. 3.2, each SM configuration gave rise to an aggregated convergence graph. To select the best configurations, the graphs were visually inspected, and the ones corresponding to a better overall convergence rates were selected. This usually was the configuration providing the best final gap for all instances. Occasionally, other configurations gave better results than the chosen one in the earlier stages of the algorithm on some subsets of the instances; usually the advantage was marginal at best, and only on a fraction of the cases, while the disadvantage in terms of final result was pronounced. In general it has always been possible to find “robust” settings that provided the best (or close so) gap at termination, but were not too far from the best gaps even in all the other stages. Furthermore, although the total number of possible combinations was rather large, it turned out that only a relatively small set of parameters had a significant impact on the performances, and in most of the cases their effect was almost orthogonal to each other. This allowed us to effectively single out “robust” configurations for our test sets; for several of the parameters, the “optimal” choice has been unique across all instances, which may provide useful indications even for different problems.

For the sake of clarity and conciseness, in Tables 3 and 4, we report the chosen values of the parameters for FR and KR, respectively, briefly remarking about the effect of each parameter and their relationships. The behaviour of SM was pretty similar in the two cases \(\underline{f} = f_*\) and \(\underline{f} = 10\%f_*\); hence, the tables report the values for \(\underline{f} = f_*\), indicating in “[]” these for \(\underline{f} = 10\%f_*\) if they happen to be different. The tables focus on the combinations between the three SR and the two DR, plus the incremental case; the parameters of Primal–Dual variant are presented separately since the SR is combined with the DR.

Table 3 Optimal parameters for the Flow Relaxation
Table 4 Optimal parameters for the Knapsack Relaxation

Results for the FR. The results for FR are summarized in Table 3, except for those settings that are constantly optimal. In particular, STSubgrad and Incremental have better performances with \(\mathrm{{pr}} = \{ g_i \}\), irrespective of the SR. For Volume, instead, the optimal setting of \(\mathrm{{pr}}\) does depend on the SR, although \(\mathrm{{pr}} = \{ d_i \}\) and \(\mathrm{{pr}} = \{ d_{i-1} \}\) were hardly different. All the other parameters of Volume depend on the SR (although the stepsize-restricted scheme with no safe rule is often good), except \(\tau _{\min }\) and m that are always best set to 1e-4 and 0.1, respectively. Another interesting observation is that, while Volume does have several parameters, it does seem that they operate quite independently of each other, as changing one of them always has a similar effect irrespective of the others. We also mention that for ColorTV the parameters \(\mathrm {c}_y\) and \(\mathrm {c}_r\) have little impact on the performance, whereas \(\mathrm {c}_g\) plays an important role and it significantly influences the quality of the results. As for FumeroTV, \(\sigma _{\infty }\) and \(\eta _2\) have hardly any impact, and we arbitrarily set them to 1e-4 and 50, respectively.

In PDSM, the only crucial value is F, used to compute the optimal value of \(\gamma \) in (25). We found its best value to be 1e2 and 1e3 for SA and WA, respectively. The choice has a large impact on performances, which significantly worsen for values far from these.

Results for the KR. The best parameters for the KR are reported in Table 4. Although the best values are in general different from the FR, confirming the (unfortunate) need for problem-specific parameter tuning, similar observations as in that case can be made. For instance, for Volume, the parameters were still more or less independent from each other, and \(\tau _{\min }\) and m were still hardly impacting, with the values 1e-4 and 0.1 still very adequate. For ColorTV, results are again quite stable varying \(\mathrm {c}_y\). Yet, differences can be noted: for instance, for FR \(\mathrm {c}_g\) is clearly the most significant parameter and dictates most of the performance variations, while for the KR the relationship between the two parameters \(\mathrm {c}_r\) and \(\mathrm {c}_g\) and the results is less clear. Similarly, for FumeroTV some settings are conserved: \(\sigma _{\infty }\) and \(\eta _2\) have very little effect and can be set to 1e-4 and 50, respectively. Other cases were different: for instance the parameters \(\eta _1\), \(r_1\) and \(\beta _0\) were more independent on each other than in the FR.

The parameters of Primal–Dual showed to be quite independent from the underlying Lagrangian approach, with the best value of F still being 1e2 for SA and 1e3 for WA. This confirms the higher overall robustness of the approach.

We terminate the Appendix with a short table detailing which of the variants of SM that we tested have a formal proof of convergence and where it can be found, indicating the references wherein the proofs are given. The columns DR and SR, as usual, indicate which ones among the possible defection and stepsize rules are adopted; an entry “any” means that the corresponding proof holds for all the rules. Moreover, PR, AS and IN, respectively, stands for the strategies: (i) projection, (ii) active set and (iii) incremental.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Frangioni, A., Gendron, B. & Gorgone, E. On the computational efficiency of subgradient methods: a case study with Lagrangian bounds. Math. Prog. Comp. 9, 573–604 (2017). https://doi.org/10.1007/s12532-017-0120-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12532-017-0120-7

Keywords

Mathematics Subject Classification

Navigation