On the computational efficiency of subgradient methods: a case study with Lagrangian bounds

Frangioni, Antonio; Gendron, Bernard; Gorgone, Enrico

doi:10.1007/s12532-017-0120-7

On the computational efficiency of subgradient methods: a case study with Lagrangian bounds

Full Length Paper
Published: 06 May 2017

Volume 9, pages 573–604, (2017)
Cite this article

Mathematical Programming Computation Aims and scope Submit manuscript

Antonio Frangioni¹,
Bernard Gendron^2,3 &
Enrico Gorgone^4,5

985 Accesses
15 Citations
Explore all metrics

Abstract

Subgradient methods (SM) have long been the preferred way to solve the large-scale Nondifferentiable Optimization problems arising from the solution of Lagrangian Duals (LD) of Integer Programs (IP). Although other methods can have better convergence rate in practice, SM have certain advantages that may make them competitive under the right conditions. Furthermore, SM have significantly progressed in recent years, and new versions have been proposed with better theoretical and practical performances in some applications. We computationally evaluate a large class of SM in order to assess if these improvements carry over to the IP setting. For this we build a unified scheme that covers many of the SM proposed in the literature, comprised some often overlooked features like projection and dynamic generation of variables. We fine-tune the many algorithmic parameters of the resulting large class of SM, and we test them on two different LDs of the Fixed-Charge Multicommodity Capacitated Network Design problem, in order to assess the impact of the characteristics of the problem on the optimal algorithmic choices. Our results show that, if extensive tuning is performed, SM can be competitive with more sophisticated approaches when the tolerance required for solution is not too tight, which is the case when solving LDs of IPs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Survey on Lagrangian relaxation for MILP: importance, challenges, historical review, recent advancements, and opportunities

Article 11 July 2023

IPRSOCP: A Primal-Dual Interior-Point Relaxation Algorithm for Second-Order Cone Programming

Article 17 May 2024

Stochastic dual dynamic integer programming

Article 02 March 2018

References

Ahookhosh, M.: Optimal subgradient algorithms with application to large-scale linear inverse problems. Tech. rep., Optimization Online (2014)
Anstreicher, K., Wolsey, L.: Two “well-known” properties of subgradient optimization. Math. Program. 120(1), 213–220 (2009)
Article MATH MathSciNet Google Scholar
Astorino, A., Frangioni, A., Fuduli, A., Gorgone, E.: A nonmonotone proximal bundle method with (potentially) continuous step decisions. SIAM J. Optim. 23(3), 1784–1809 (2013)
Article MATH MathSciNet Google Scholar
Bacaud, L., Lemaréchal, C., Renaud, A., Sagastizábal, C.: Bundle methods in stochastic optimal power management: a disaggregated approach using preconditioners. Comput. Optim. Appl. 20, 227–244 (2001)
Article MATH MathSciNet Google Scholar
Bahiense, L., Maculan, N., Sagastizábal, C.: The volume algorithm revisited: relation with bundle methods. Math. Program. 94(1), 41–70 (2002)
Article MATH MathSciNet Google Scholar
Barahona, F., Anbil, R.: The volume algorithm: producing primal solutions with a subgradient method. Math. Program. 87(3), 385–399 (2000)
Article MATH MathSciNet Google Scholar
Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim. 22(2), 557–580 (2012)
Article MATH MathSciNet Google Scholar
Ben Amor, H., Desrosiers, J., Frangioni, A.: On the choice of explicit stabilizing terms in column generation. Discrete Appl. Math. 157(6), 1167–1184 (2009)
Article MATH MathSciNet Google Scholar
Bertsekas, D., Nedić, A.: Incremental subgradient methods for nondifferentiable optimization. SIAM J. Optim. 12(1), 109–138 (2001)
Article MATH MathSciNet Google Scholar
Borghetti, A., Frangioni, A., Lacalandra, F., Nucci, C.: Lagrangian heuristics based on disaggregated bundle methods for hydrothermal unit commitment. IEEE Trans. Power Syst. 18(1), 313–323 (2003)
Article Google Scholar
Bot, R., Hendrich, C.: A variable smoothing algorithm for solving convex optimization problems. TOP 23, 124–150 (2014)
Article MATH MathSciNet Google Scholar
Brännlund, U.: A generalised subgradient method with relaxation step. Math. Program. 71, 207–219 (1995)
Article MATH MathSciNet Google Scholar
Briant, O., Lemaréchal, C., Meurdesoif, P., Michel, S., Perrot, N., Vanderbeck, F.: Comparison of bundle and classical column generation. Math. Program. 113(2), 299–344 (2008)
Article MATH MathSciNet Google Scholar
Camerini, P., Fratta, L., Maffioli, F.: On improving relaxation methods by modified gradient techniques. Math. Program. Study 3, 26–34 (1975)
Article MATH MathSciNet Google Scholar
Cappanera, P., Frangioni, A.: Symmetric and asymmetric parallelization of a cost-decomposition algorithm for multi-commodity flow problems. INFORMS J. Comput. 15(4), 369–384 (2003)
Article MATH MathSciNet Google Scholar
Censor, Y., Davidi, R., Herman, G., Schulte, R., Tetruashvili, L.: Projected subgradient minimization cersus superiorization. J. Optim. Theory Appl. 160(3), 730–747 (2014)
Article MATH MathSciNet Google Scholar
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Article MATH MathSciNet Google Scholar
Crainic, T.G., Frangioni, A., Gendron, B.: Multicommodity capacitated network design. In: Soriano, P., Sanso, B. (eds.) Telecommunications network planning, pp. 1–19. Kluwer Academics Publisher (1999)
Crainic, T., Frangioni, A., Gendron, B.: Bundle-based relaxation methods for multicommodity capacitated fixed charge network design problems. Discrete Appl. Math. 112, 73–99 (2001)
Article MATH MathSciNet Google Scholar
Crema, A., Loreto, M., Raydan, M.: Spectral projected subgradient with a momentum term for the Lagrangean dual approach. Comput. Oper. Res. 34, 31743186 (2007)
Article MATH MathSciNet Google Scholar
d’Antonio, G., Frangioni, A.: Convergence analysis of deflected conditional approximate subgradient methods. SIAM J. Optim. 20(1), 357–386 (2009)
Article MATH MathSciNet Google Scholar
du Merle, O., Goffin, J.L., Vial, J.P.: On improvements to the analytic center cutting plane method. Comput. Optim. Appl. 11, 37–52 (1998)
Article MATH MathSciNet Google Scholar
Feltenmark, S., Kiwiel, K.: Dual applications of proximal bundle methods, including Lagrangian relaxation of nonconvex problems. SIAM J. Optim. 10(3), 697–721 (2000)
Article MATH MathSciNet Google Scholar
Frangioni, A.: Solving semidefinite quadratic problems within nonsmooth optimization algorithms. Comput. Oper. Res. 21, 1099–1118 (1996)
Article MATH MathSciNet Google Scholar
Frangioni, A.: Generalized bundle methods. SIAM J. Optim. 13(1), 117–156 (2002)
Article MATH MathSciNet Google Scholar
Frangioni, A., Gallo, G.: A bundle type dual-ascent approach to linear multicommodity min cost flow problems. INFORMS J. Comput. 11(4), 370–393 (1999)
Article MATH MathSciNet Google Scholar
Frangioni, A., Gendron, B.: A stabilized structured Dantzig–Wolfe decomposition method. Math. Program. 140, 45–76 (2013)
Article MATH MathSciNet Google Scholar
Frangioni, A., Gorgone, E.: A library for continuous convex separable quadratic knapsack problems. Eur. J. Oper. Res. 229(1), 37–40 (2013)
Article MATH MathSciNet Google Scholar
Frangioni, A., Gorgone, E.: Generalized bundle methods for sum-functions with “easy” components: applications to multicommodity network design. Math. Program. 145(1), 133–161 (2014)
Article MATH MathSciNet Google Scholar
Frangioni, A., Lodi, A., Rinaldi, G.: New approaches for optimizing over the semimetric polytope. Math. Program. 104(2–3), 375–388 (2005)
Article MATH MathSciNet Google Scholar
Fumero, F.: A modified subgradient algorithm for Lagrangean relaxation. Comput. Oper. Res. 28(1), 33–52 (2001)
Article MATH MathSciNet Google Scholar
Geoffrion, A.: Lagrangian relaxation and its uses in iteger programming. Math. Program. Study 2, 82–114 (1974)
Article MathSciNet Google Scholar
Gondzio, J., González-Brevis, P., Munari, P.: New developments in the primal–dual column generation technique. Eur. J. Oper. Res. 224(1), 41–51 (2013)
Article MATH MathSciNet Google Scholar
Görtz, S., Klose, A.: A simple but usually fast branch-and-bound algorithm for the capacitated facility location problem. INFORMS J. Comput. 24(4), 597610 (2012)
Article MATH MathSciNet Google Scholar
Guignard, M.: Efficient cuts in Lagrangean ‘relax-and-cut’ schemes. Eur. J. Oper. Res. 105, 216–223 (1998)
Article MATH Google Scholar
Held, M., Karp, R.: The traveling salesman problem and minimum spanning trees. Oper. Res. 18, 1138–1162 (1970)
Article MATH MathSciNet Google Scholar
Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms II–Advanced Theory and Bundle Methods, Grundlehren Math. Wiss., vol. 306. Springer, New York (1993)
MATH Google Scholar
Ito, M., Fukuda, M.: A family of subgradient-based methods for convex optimization problems in a unifying framework. Tech. rep., Optimization Online (2014)
Jones, K., Lustig, I., Farwolden, J., Powell, W.: Multicommodity network flows: the impact of formulation on decomposition. Math. Program. 62, 95–117 (1993)
Article MATH MathSciNet Google Scholar
Kelley, J.: The cutting-plane method for solving convex programs. J. SIAM 8, 703–712 (1960)
MATH MathSciNet Google Scholar
Kiwiel, K.: Convergence of approximate and incremental subgradient methods for convex optimization. SIAM J. Optim. 14(3), 807–840 (2003)
Article MATH MathSciNet Google Scholar
Kiwiel, K., Goffin, J.: Convergence of a simple subgradient level method. Math. Program. 85(4), 207–211 (1999)
MATH MathSciNet Google Scholar
Kiwiel, K., Larsson, T., Lindberg, P.: The efficiency of ballstep subgradient level methods for convex optimization. Math. Oper. Res. 23, 237–254 (1999)
Article MATH MathSciNet Google Scholar
Lan, G., Zhou, Y.: Approximation accuracy, gradient methods, and error bound for structured convex optimization. Technical report, University of Florida (2014)
Larsson, T., Patriksson, M., Strömberg, A.B.: Conditional subgradient optimization—theory and applications. Eur. J. Oper. Res. 88(2), 382–403 (1996)
Article MATH Google Scholar
Larsson, T., Patriksson, M., Strömberg, A.B.: Ergodic, primal convergence in dual subgradient schemes for convex programming. Math. Program. 86, 283–312 (1999)
Article MATH MathSciNet Google Scholar
Lemaréchal, C.: An extension of Davidon methods to nondifferentiable problems. In: Balinski, M., Wolfe, P. (eds.) Nondifferentiable Optimization, Mathematical Programming Study, vol. 3, pp. 95–109. North-Holland, Amsterdam (1975)
Chapter Google Scholar
Lemaréchal, C., Renaud, A.: A geometric study of duality gaps, with applications. Math. Program. 90, 399–427 (2001)
Article MATH MathSciNet Google Scholar
Necoara, I., Suykens, J.: Application of a smoothing technique to decomposition in convex optimization. IEEE Trans. Autom. Control 53(11), 2674–2679 (2008)
Article MATH MathSciNet Google Scholar
Nedic, A., Bertsekas, D.: Incremental subgradient methods for nondifferentiable optimization. Math. Program. 120, 221–259 (2009)
Article MATH MathSciNet Google Scholar
Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)
Google Scholar
Nesterov, Y.: Excessive gap technique in nonsmooth convex minimization. SIAM J. Optim. 16, 235–249 (2005)
Article MATH MathSciNet Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103, 127–152 (2005)
Article MATH MathSciNet Google Scholar
Nesterov, Y.: Primal-dual subgradient methods for convex optimization. Math. Program. 120, 221–259 (2009)
Article MATH MathSciNet Google Scholar
Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152, 381–404 (2014)
Article MATH MathSciNet Google Scholar
Neto, E., De Pierro, A.: Incremental subgradients for constrained convex optimization: a unified framework and new methods. SIAM J. Optim. 20(3), 1547–1572 (2009)
Article MATH MathSciNet Google Scholar
Ouorou, A.: A proximal cutting plane method using Chebychev center for nonsmooth convex optimization. Math. Program. 119(2), 239–271 (2009)
Article MATH MathSciNet Google Scholar
Polyak, B.: Minimization of unsmooth functionals. Zh. Vychisl. Mat. Fiz 9(3), 509–521 (1969)
MATH Google Scholar
Sherali, B., Choi, B., Tuncbilek, C.: A variable target value method for nondifferentiable optimization. Oper. Res. Lett. 26, 1–8 (2000)
Article MATH MathSciNet Google Scholar
Sherali, B., Lim, C.: On embedding the volume algorithm in a variable target value method. Oper. Res. Lett. 32, 455462 (2004)
Article MathSciNet Google Scholar
Shor, N.: Minimization Methods for Nondifferentiable Functions. Springer, Berlin (1985)
Book MATH Google Scholar
Solodov, M., Zavriev, S.: Error stability properties of generalized gradient-type algorithms. J. Optim. Theory Appl. 98(3), 663–680 (1998)
Article MATH MathSciNet Google Scholar
Tseng, P.: Conditional gradient sliding for convex optimization. Math. Program. 125, 263–295 (2010)
Article MATH MathSciNet Google Scholar
Wolfe, P.: A method of conjugate subgradients for minimizing nondifferentiable functions. In: Balinski, M., Wolfe, P. (eds.) Nondifferentiable Optimization, Mathematical Programming Study, vol. 3, pp. 145–173. North-Holland, Amsterdam (1975)
Chapter Google Scholar

Download references

Acknowledgements

The first author acknowledge the contribution of the Italian Ministry for University and Research under the PRIN 2012 Project 2012JXB3YF “Mixed-Integer Nonlinear Optimization: Approaches and Applications”. The work of the second author has been supported by NSERC (Canada) under Grant 184122-09. The work of the third author has been supported by the Post-Doctoral Fellowship D.R. No 2718/201 (Regional Operative Program Calabria ESF 2007/2013) and the Interuniversity Attraction Poles Programme P7/36 “COMEX: combinatorial optimization metaheuristics & exact methods” of the Belgian Science Policy Office. All the authors gratefully acknowledge the contribution of the anonymous referees and of the editors of the journal to improving the initial version of the manuscript.

Author information

Authors and Affiliations

Dipartimento di Informatica, Università di Pisa, Pisa, Italy
Antonio Frangioni
Centre Interuniversitaire de Recherche sur les Réseaux d’Entreprise, la Logistique et le Transport (CIRRELT), Montreal, Canada
Bernard Gendron
Department of Computer Science and Operations Research, Université de Montréal, Montreal, Canada
Bernard Gendron
Dipartimento di Matematica ed Informatica, Università di Cagliari, Cagliari, Italy
Enrico Gorgone
Indian Institute of Management Bangalore (IIMB), Bengaluru, India
Enrico Gorgone

Authors

Antonio Frangioni
View author publications
You can also search for this author in PubMed Google Scholar
Bernard Gendron
View author publications
You can also search for this author in PubMed Google Scholar
Enrico Gorgone
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio Frangioni.

Additional information

The software that was reviewed as part of this submission has been issued the Digital Object Identifier doi:10.5281/zenodo.556738.

Appendix

We now describe all the details of the SM that we have tested, together with the results of the tuning phase. We remark that for some parameters it is nontrivial even to set a reasonable ranges for the values. Our approach has been to select the initial range heuristically, and then test it: if the best value consistently ended up being at one extreme, this was taken as indication that the interval should be enlarged accordingly. This hinges on the assumption that the behaviour of the algorithm is somewhat “monotonic” in the parameters; while this is not necessarily true, for the vast majority of parameters a “monotonic” behaviour has been verified experimentally, in that we almost never found the case where different settings “far apart” provided better performances than these “in the middle.”

1.1 General parameters of SM

The following parameters are common to all variants of SM we tested, basically irrespective of the specific rules for choosing the stepsize and the deflection.

We denote by pr $\subseteq \{ \, g_i , d_{i-1} , d_i \, \}$ the subset of vectors that are projected on the tangent cone $T_i$ of $\varLambda $ at $\bar{\lambda }_i$; in all our tests, pr does not depend on the iteration. As already remarked, pr $= \{ \, g_i , d_{i-1} , d_i \, \}$ makes no sense as $T_i$ is convex. Furthermore, when no deflection is done $d_i = g_i$ and therefore only pr $= \{ \, g_i \, \}$ and pr $= \emptyset $ make sense.
Regarding the order in which the stepsize and the deflection are chosen, we denote by sg $\in \{\, \mathrm {drs}, \mathrm {dr0}, \mathrm {srs}, \mathrm {sr0}\,\}$ the four possible schemes, where “dr” and “sr” refer to the deflection-restricted and stepsize-restricted approach, respectively, while “s” and “0” refer to using or not the safe rule ((9) and (10), respectively). Of course, drs and dr0 only apply if deflection is performed.
We denote by $\chi $ the parameter used to adjust the Lipschitz constant L in the incremental case, cf. (14), for which we tested the values $\chi =$ 1e-v for v $\in \{ 0, \ldots , 8 \}$.
For the AS, one crucial decision is how often separation is performed: doing it less often avoids some computations, but at the risk of ignoring possibly relevant information for too long. We performed separation after the fixed number $s_1 \in \{0, 1\}$ of iterations, i.e., either not using the AS at all or separating every iteration. Initial tests showed that larger values of $s_l$ were not effective.

1.2 Parameters of the SR

We now examine in details the parameters of the three SR. Since all of them have the form (7), we are looking at different ways for determining $\beta _i$ and $f^{lev}_i$.

Polyak In this SR $\beta _i$ and $f^{lev}_i$ are kept fixed at all iterations. Here, we exploit the fact that in our application we know have “target value” $\underline{f}$ and simply test the two cases $f^{lev} \in \{ f_*, 10\%f_*\}$. As for the other parameter, we tested $\beta \in \{\, 0.01 , 0.1 , 1 , 1.5 , 1.99 \, \}$.

ColorTV This SR is based on the improvement $\varDelta f = \bar{f}_{i-1} - f_i$ of f and the scalar product $d_i g_i$ to estimate “how successful a step has been.” Note, however, that in deflection-restricted schemes (i.e., drs and dr0) $d_i$ is not available and we use $d_{i-1} g_{i}$ instead. Iteration i is marked as green if $d_i g_i > \rho $ and $\varDelta f \ge \rho \, \max \{|f_i^\mathrm{{rec}}|, 1\}$, as yellow if $d_i g_i < \rho $ and $\varDelta f \ge 0$, and as red otherwise, where $\rho > 0$ is a tolerance. Intuitively, green is a “good” step possibly indicating that a larger $\nu _i$ may have been preferable, whereas red is a “bad” one suggesting that $\nu _i$ is too large. Given three parameters $c_g, c_y$ and $c_r$, and denoting by $n_g, n_y$ and $n_r$ the number of consecutive green, yellow and red iterations, respectively, $\beta _i$ is updated as:

1.
if $n_g \ge c_g$ then set $\beta _i = \min \{ \, 2 , 2\beta _{i-1} \, \}$;
2.
if $n_y \ge c_y$ then set $\beta _i = \min \{ \, 2 , 1.1\beta _{i-1} \, \}$;
3.
if $n_r \ge c_r$ then then set $\beta _i = \max \{$ 5e-4 $, 0.67\beta _{i-1} \, \}$;
4.
if none of the above cases occur, then set $\beta _i = \beta _{i-1}$.

One important parameter is therefore the arbitrarily fixed value $\beta _0$. Also, the SR includes a simple target-following scheme whereby if $f_i \le 1.05 f_i^{lev}$ then $f_i^{lev} = f_i - 0.05 f_i^{lev}$ (note that this never happens for $f^{lev} = 10\%f_*$). For this SR we kept $\rho = $ 1e-6 fixed and we tested all combinations of $\beta _0\in \{ \, 0.01 , 0.1 , 1 , 1.5 , 1.99 \, \}$, $c_g\in \{ \, 1 , 10 , 50 \, \}$, $c_y \in \{ \, 50 , 100 , 400 \, \}$, and $c_r \in \{ \, 10 , 20 , 50 \, \}$.

FumeroTV This SR has a complex management of $f_i^{lev}$ and $\beta _i$, motivated by experimental considerations [31], that is subdivided into two distinct phases. The switch between the two is an iteration counter r, that is increased each time there is no improvement in the function value. This counter is used to define the exponential function $\sigma (r) = e^{-0.6933 (r/r_1)^{3.26}}$, where $r_1$ is a parameter; note that $\sigma (r_1) \approx 1/2$, which is how the two apparently weird numerical parameters have been selected. The function $\sigma $, which is decreasing in r, is used in two ways. The first is to determine the maximum number of non-improving steps, which is the smallest integer $r_2$ such that $\sigma _{\infty } \ge \sigma (r_2)$, where the threshold $\sigma _{\infty } > 0$ is another parameter: given $r_1$ and $\sigma _{\infty }$, $r_2$ can be obtained with a simple closed formula. The second is to construct at each iteration the value of $f_i^{lev}$ as a convex combination of the known global lower bound $\underline{f}$ (which, not incidentally, this algorithm specifically tailored for IP is the only one to explicitly use) and the current record value as $f^{lev}_i = \sigma (r) \underline{f} + (1-\sigma (r))f^{rec}_i$. In the first phase, when r varies, the threshold varies as well: as $\sigma (r)$ decreases when r grows, $f^{lev}_i$ is kept closer and closer to $f^{rec}_i$ as the algorithm proceeds. In the second phase ($r \ge r_2$), where r is no longer updated, $\sigma (r) = \sigma _{\infty }$. The procedure for updating r and $\beta _i$ uses four algorithmic parameters: a tolerance $\delta > 0$, two integer numbers $\eta _1$ and $\eta _2 \ge 1$, and the initial value $\beta _0 \in (0, 2)$. The procedure is divided in two phases, according to the fact that the iteration counter r (initialized to 0) is smaller or larger than the threshold $r_2$. Similarly to ColorTV, the rule keeps a record value $\bar{f}_i$ (similar, but not necessarily identical, to $f^{rec}_i$) and declares a “good” step whenever $f_i \le \bar{f}_i - \delta \max \{|\bar{f}_i|,1\}$, in which case $\bar{f}$ is updated to $f_i$. In either phase, the number of consecutive “non-good” steps is counted. In the first phase, after $\bar{\eta }_2$ such steps r is increased by one, and $\beta _i$ is updated as $\beta _i = \beta _{i-1}/ (2\beta _{i-1} + 1)$. In the second phase r is no longer updated: after every “good” step $\beta _i$ is doubled, whereas after $\bar{\eta }_1$ “non good” steps $\beta _i$ is halved. In the tuning phase we tested the following values for the parameters: $\sigma _{\infty } \in \{$ 1e-4, 1e-3, 1e-2 $\}$, $\delta =$ 1e-6, $r_1 \in \{\, 10 , 50 , 100 , 150 , 200 , 250 , 300 , 350 \,\}$, $\beta _0 \in \{\, 0.01 , 0.1 , 1 , 1.5 , 1.99 \,\}$, $\eta _1 \in \{\, 10 , 50 , 100 , 150 , 200 , 250 , 300 , 350 \,\}$, $\eta _2 \in \{\, 10 , 50 , 100 , 150 , 200 \,\}$.

1.3 Parameters of the DR

We now describe in details the two “complex” DR that we have tested (STSubgrad, where $\alpha _i = 1 \Longrightarrow d_i = g_i$ and $\bar{\lambda }_{i+1} = \lambda _{i+1}$ for all i, hardly needs any comment). Note that the selection of $\bar{\lambda }_{i+1}$ is also done by the Deflection() object.

Primal–Dual The PDSM is based on a sophisticated convergence analysis aimed at obtaining optimal a-priori complexity estimates [54]. A basic assumption of PDSM is that $\varLambda $ is endowed with a prox-function $d(\lambda )$, and that one solves the modified form of (1)

$$\begin{aligned} \min \{ \, f(\lambda ) \,:\, d(\lambda )\le D , \lambda \in \varLambda \,\} \end{aligned}$$

(21)

restricted upon a compact subset of the feasible region, where $D\ge 0$ is a parameter. D is never directly used in the algorithm, except to optimally tune its parameters; hence, (21) can always be considered if f has a minimum $\lambda _*$. In particular, we take $d(\lambda ) = \Vert \lambda - \lambda _0 \Vert ^2/2$, in which case $D = \Vert \lambda _* - \lambda _0 \Vert ^2/2$. In general D is unknown; however, the parameter “$t^*$” in the stopping formulæ (12)/(13) is somehow related. Roughly speaking, $t^*$ estimates how far at most one can move along a subgradient $g_i \in \partial f(\lambda _i)$ when $\lambda _i$ is an approximately optimal solution. The parameter, that it used in the same way by Bundle methods, is independent from the specific solution algorithm and has been individually tuned (which is simple enough, ex-post); hence, $D = (t^*)^2L$ is a possible estimate. Yet, $t^*$ is supposed to measure $\Vert \lambda ^* - \lambda _i \Vert $ for a “good” $\lambda _i$, whereas D requires the initial $\lambda _0$, which typically is not “good”: hence, we introduced a further scaling factor $F > 0$, i.e., took $\gamma = (F\sqrt{L}) / (t^*\sqrt{2})$ for SA and $\gamma = F / (t^*\sqrt{2L})$ for WA (cf. (25)), and we experimentally tuned F. In general one would expect $F > 1$, and the results confirm this; however, to be on the safe side we tested all the values $F \in \{$ 1e-4, 1e-3, 1e-2,1e-1, 1, 1e1, 1e2, 1e3, 1e4 $\}$. As suggested by one Referee we also tested using $D = \Vert \lambda _* - \lambda _0 \Vert ^2/2$, with $\lambda _*$ obtained by some previous optimization. The results clearly showed that the “exact” estimate of D not always translated in the best performances; in particular, for the FR the results were always consistently worse, whereas for the KR the results were much worse for WA, and completely comparable (but not any better) for SA. This is why in the end we reported results with the tuned value of F.

For the rest, PDSM basically have no tunable parameters. It has to be remarked, however, that PDSM are not, on the outset, based on a simple recurrence of the form (5); rather, given two sequences of weights $\{\upsilon _i\}$ and $\{\omega _i\}$, the next iterate is obtained as

$$\begin{aligned} \textstyle \lambda _{i+1} = {\mathrm{argmin}} \big \{ \, \lambda \sum _{k=1}^i \upsilon _k g_k + \omega _i d(\lambda ) \,:\, \lambda \in \varLambda \, \big \} \;\;. \end{aligned}$$

(22)

Yet, when $\varLambda = \mathbb {R}^n$ (22) readily reduces to (5), as the following Lemma shows.

Lemma 1

Assume $\varLambda = \mathbb {R}^n$, select $d(\lambda ) = \Vert \lambda - \lambda _0 \Vert ^2/2$, fix $\lambda _i = \lambda _0$ for all $i \ge 0$ in (5). By defining $\varDelta _i = \sum _{k=1}^i \upsilon _k$, the following DR and SR

$$\begin{aligned} \alpha _i = \upsilon _i / \varDelta _i \;\; (\in [0,1]) \qquad \text{ and }\qquad \nu _i = \varDelta _i / \omega _i \end{aligned}$$

(23)

are such that $\lambda _{i+1}$ produced by (22) is the same produced by (5) and (8).

Proof

Under the assumptions, (22) is a strictly convex unconstrained quadratic problem, whose optimal solution is immediately available by the closed formula

$$\begin{aligned} \textstyle \lambda _{i+1} = \lambda _0 - (1/\omega _i) \sum _{k=1}^i \upsilon _k g_k. \end{aligned}$$

(24)

This clearly is (5) under the SR in (23) provided that one shows that the DR in (23) produces

$$\begin{aligned} \textstyle d_i = \left( \sum _{k=1}^i \upsilon _k g_k\right) / \varDelta _i. \end{aligned}$$

This is indeed easy to show by induction. For $i = 1$ one immediately obtains $d_1 = g_1$. For the inductive case, one just has to note that

$$\begin{aligned} 1 - \frac{\upsilon _{i+1}}{\varDelta _{i+1}} = \frac{\varDelta _{i+1} - \upsilon _{i+1}}{\varDelta _{i+1}} = \frac{\varDelta _i}{\varDelta _{i+1}} \end{aligned}$$

to obtain

$$\begin{aligned} d_{i+1} = \alpha _{i+1} g_{i+1} {+} (1-\alpha _{i+1}) d_i {=} \frac{\upsilon _{i+1}}{\varDelta _{i+1}}g_{i+1} + \frac{\varDelta _i}{\varDelta _{i+1}} \frac{\sum _{k=1}^i \upsilon _k g_k}{\varDelta _i} =\frac{1}{\varDelta _{i+1}} \sum _{k=1}^{i+1}\upsilon _k g_k. \quad \end{aligned}$$

$\square $

Interestingly, the same happens if simple sign constraints $\lambda \ge 0$ are present, which is what we actually have whenever $\varLambda \ne \mathbb {R}^n$.

Lemma 2

If $\varLambda = \mathbb {R}^n_+$, the same conclusion as in Lemma 1 hold after $\mathrm{P}_{\varLambda }(\lambda _{i+1})$.

Proof

It is easy to see that the optimal solution of (22) with $\varLambda = \mathbb {R}^n_+$ is equal to that with $\varLambda = \mathbb {R}^n$, i.e. (24), projected over $\mathbb {R}^n_+$. $\square $

Therefore, implementing the DR and the SR as in (23), and never updating $\bar{\lambda }_i = \lambda _0$, allow us to fit PDSM in our general scheme. To choose $\upsilon _i$ and $\omega _i$ we follow the suggestions in [54]: the SA approach corresponds to $\upsilon _i = 1$, and the WA one to $\upsilon _i = 1 / \Vert g_i\Vert $. We then set $\omega _i = \gamma \hat{\omega }_i$, where $\gamma > 0$ is a constant, and $\hat{\omega }_0 = \hat{\omega }_1 = 1$, $\hat{\omega }_i = \hat{\omega }_{i-1} + 1 / \hat{\omega }_{i-1}$ for $i \ge 2$, which implies $\hat{\omega }_{i+1} = \sum _{k=0}^i 1 / \hat{\omega }_k$. The analysis in [54] suggests settings for $\gamma $ that provide the best possible theoretical convergence, i.e.,

$$\begin{aligned} \textstyle \gamma = L/\sqrt{2 D} \qquad \text{ and }\qquad \gamma = 1/\sqrt{2 D} \;\; , \end{aligned}$$

(25)

for the SA and WA, respectively, L being the Lipschitz constant of f.

Volume In this DR, $\alpha _i$ is obtained as the optimal solution of a univariate quadratic problem. As suggested in [5], and somewhat differently from the original [6], we use exactly the “poorman’s form” of the master problem of the proximal Bundle method

$$\begin{aligned} \min \big \{ \; \nu _{i-1} \left\| \alpha g_i + (1 - \alpha ) d_{i-1} \right\| ^2/2 + \alpha \sigma _i(\bar{\lambda }_i) + (1 - \alpha )\epsilon _{i-1}(\bar{\lambda }_i) \;:\; \alpha \in [0, 1]\; \big \}\nonumber \\ \end{aligned}$$

(26)

where the linearization errors $\sigma _i(\bar{\lambda }_i)$ and $\epsilon _{i-1}(\bar{\lambda }_i)$ have been discussed in details in Sect. 2.2. Note that we use the stepsize $\nu _{i-1}$ of the previous iteration as stability weight, since that term corresponds to the stepsize that one would do along the dual optimal solution in a Bundle method [3, 5, 25]. It may be worth remarking that the dual of (26)

$$\begin{aligned} \textstyle \min \big \{ \, \max \{ \, g_i d - \sigma _i(\bar{\lambda }_i) , d_{i-1} d - \epsilon _{i-1}(\bar{\lambda }_i) \, \} + \Vert d \Vert ^2/(2\nu _{i-1}) \,\big \} \;\; , \end{aligned}$$

(27)

where $d = \lambda - \bar{\lambda }_i$, is closely tied to (22) in PDSM. The difference is that (27) uses two (approximate) subgradients, $g_i$ and $d_i$, whereas in (22) one uses only one (approximate) subgradient obtained as weighted average of the ones generated at previous iterations. Problem (26) is inexpensive, because without the constraint $\alpha \in [0, 1]$ it has the closed-form solution

$$\begin{aligned} \alpha ^*_i \;=\; \frac{\epsilon _{i-1}(\bar{\lambda }_i) - \sigma _i(\bar{\lambda }_i) - \nu _{i-1} d_{i-1}(g_i - d_{i-1})}{\nu _{i-1} \Vert g_i - d_{i-1}\Vert ^2} \;\; , \end{aligned}$$

and thus one can obtain its optimal solution by simply projecting $\alpha ^*_i$ over [0, 1]. However, as suggested in [5, 6] we rather chose $\alpha _i$ in the more safeguarded way

$$\begin{aligned} \alpha _i = \left| \begin{array}{ll} \alpha _{i-1} / 10 &{} \text { if } \alpha ^*_i \le \mathtt{1e-8}\\ \min \{\tau _i,1.0\} &{} \text { if } \alpha ^*_i \ge 1\\ \alpha ^*_i &{} \text { otherwise } \end{array}\right. \end{aligned}$$

where $\tau _i$ is initialized to $\tau _0$, and each $\tau _p$ iterations is decreased multiplying it by $\tau _f < 1$, while ensuring that it remains larger than $\tau _{\min }$. The choice of the stability center is also dictated by a parameter $m > 0$ akin that used in Bundle methods: if $\bar{f}_i - f_{i+1} \ge m \max \{1,|f^{ref}_i|\}$ a Serious Step occurs and $\bar{\lambda }_{i+1} = \lambda _{i+1}$, otherwise a Null Step takes place and $\bar{\lambda }_{i+1} = \bar{\lambda }_i$. For the tuning phase we have searched all the combinations of the following values for the above parameters: $\tau _0\in \{\, 0.01 , 0.1 , 1 , 10 \,\}$, $\tau _p \in \{\, 10 , 50 , 100 , 200 , 500 \,\}$, $\tau _f \in \{\, 0.1 , 0.4 , 0.8 , 0.9 , 0.99 \,\}$, $\tau _{\min } \in \{$ 1e-4, 1e-5 $\}$, $m \in \{\, 0.01 , 0.1 \,\}$.

1.4 Detailed results of the tuning phase

The tuning phase required a substantial computational work, and a nontrivial analysis of the results. As discussed in Sect. 3.2, each SM configuration gave rise to an aggregated convergence graph. To select the best configurations, the graphs were visually inspected, and the ones corresponding to a better overall convergence rates were selected. This usually was the configuration providing the best final gap for all instances. Occasionally, other configurations gave better results than the chosen one in the earlier stages of the algorithm on some subsets of the instances; usually the advantage was marginal at best, and only on a fraction of the cases, while the disadvantage in terms of final result was pronounced. In general it has always been possible to find “robust” settings that provided the best (or close so) gap at termination, but were not too far from the best gaps even in all the other stages. Furthermore, although the total number of possible combinations was rather large, it turned out that only a relatively small set of parameters had a significant impact on the performances, and in most of the cases their effect was almost orthogonal to each other. This allowed us to effectively single out “robust” configurations for our test sets; for several of the parameters, the “optimal” choice has been unique across all instances, which may provide useful indications even for different problems.

For the sake of clarity and conciseness, in Tables 3 and 4, we report the chosen values of the parameters for FR and KR, respectively, briefly remarking about the effect of each parameter and their relationships. The behaviour of SM was pretty similar in the two cases $\underline{f} = f_*$ and $\underline{f} = 10\%f_*$; hence, the tables report the values for $\underline{f} = f_*$, indicating in “[]” these for $\underline{f} = 10\%f_*$ if they happen to be different. The tables focus on the combinations between the three SR and the two DR, plus the incremental case; the parameters of Primal–Dual variant are presented separately since the SR is combined with the DR.

Table 3 Optimal parameters for the Flow Relaxation

Full size table

Table 4 Optimal parameters for the Knapsack Relaxation

Full size table

Results for the FR. The results for FR are summarized in Table 3, except for those settings that are constantly optimal. In particular, STSubgrad and Incremental have better performances with $\mathrm{{pr}} = \{ g_i \}$, irrespective of the SR. For Volume, instead, the optimal setting of $\mathrm{{pr}}$ does depend on the SR, although $\mathrm{{pr}} = \{ d_i \}$ and $\mathrm{{pr}} = \{ d_{i-1} \}$ were hardly different. All the other parameters of Volume depend on the SR (although the stepsize-restricted scheme with no safe rule is often good), except $\tau _{\min }$ and m that are always best set to 1e-4 and 0.1, respectively. Another interesting observation is that, while Volume does have several parameters, it does seem that they operate quite independently of each other, as changing one of them always has a similar effect irrespective of the others. We also mention that for ColorTV the parameters $\mathrm {c}_y$ and $\mathrm {c}_r$ have little impact on the performance, whereas $\mathrm {c}_g$ plays an important role and it significantly influences the quality of the results. As for FumeroTV, $\sigma _{\infty }$ and $\eta _2$ have hardly any impact, and we arbitrarily set them to 1e-4 and 50, respectively.

In PDSM, the only crucial value is F, used to compute the optimal value of $\gamma $ in (25). We found its best value to be 1e2 and 1e3 for SA and WA, respectively. The choice has a large impact on performances, which significantly worsen for values far from these.

Results for the KR. The best parameters for the KR are reported in Table 4. Although the best values are in general different from the FR, confirming the (unfortunate) need for problem-specific parameter tuning, similar observations as in that case can be made. For instance, for Volume, the parameters were still more or less independent from each other, and $\tau _{\min }$ and m were still hardly impacting, with the values 1e-4 and 0.1 still very adequate. For ColorTV, results are again quite stable varying $\mathrm {c}_y$. Yet, differences can be noted: for instance, for FR $\mathrm {c}_g$ is clearly the most significant parameter and dictates most of the performance variations, while for the KR the relationship between the two parameters $\mathrm {c}_r$ and $\mathrm {c}_g$ and the results is less clear. Similarly, for FumeroTV some settings are conserved: $\sigma _{\infty }$ and $\eta _2$ have very little effect and can be set to 1e-4 and 50, respectively. Other cases were different: for instance the parameters $\eta _1$, $r_1$ and $\beta _0$ were more independent on each other than in the FR.

The parameters of Primal–Dual showed to be quite independent from the underlying Lagrangian approach, with the best value of F still being 1e2 for SA and 1e3 for WA. This confirms the higher overall robustness of the approach.

We terminate the Appendix with a short table detailing which of the variants of SM that we tested have a formal proof of convergence and where it can be found, indicating the references wherein the proofs are given. The columns DR and SR, as usual, indicate which ones among the possible defection and stepsize rules are adopted; an entry “any” means that the corresponding proof holds for all the rules. Moreover, PR, AS and IN, respectively, stands for the strategies: (i) projection, (ii) active set and (iii) incremental.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Frangioni, A., Gendron, B. & Gorgone, E. On the computational efficiency of subgradient methods: a case study with Lagrangian bounds. Math. Prog. Comp. 9, 573–604 (2017). https://doi.org/10.1007/s12532-017-0120-7

Download citation

Received: 01 November 2015
Accepted: 20 April 2017
Published: 06 May 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s12532-017-0120-7

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the computational efficiency of subgradient methods: a case study with Lagrangian bounds

Abstract

Access this article

Similar content being viewed by others

Survey on Lagrangian relaxation for MILP: importance, challenges, historical review, recent advancements, and opportunities

IPRSOCP: A Primal-Dual Interior-Point Relaxation Algorithm for Second-Order Cone Programming

Stochastic dual dynamic integer programming

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

1.1 General parameters of SM

1.2 Parameters of the SR

1.3 Parameters of the DR

Lemma 1

Proof

Lemma 2

Proof

1.4 Detailed results of the tuning phase

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

On the computational efficiency of subgradient methods: a case study with Lagrangian bounds

Abstract

Access this article

Similar content being viewed by others

Survey on Lagrangian relaxation for MILP: importance, challenges, historical review, recent advancements, and opportunities

IPRSOCP: A Primal-Dual Interior-Point Relaxation Algorithm for Second-Order Cone Programming

Stochastic dual dynamic integer programming

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

1.1 General parameters of SM

1.2 Parameters of the SR

1.3 Parameters of the DR

Lemma 1

Proof

Lemma 2

Proof

1.4 Detailed results of the tuning phase

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation