Submodular functions: from discrete to continuous domains


Submodular set-functions have many applications in combinatorial optimization, as they can be minimized and approximately maximized in polynomial time. A key element in many of the algorithms and analyses is the possibility of extending the submodular set-function to a convex function, which opens up tools from convex optimization. Submodularity goes beyond set-functions and has naturally been considered for problems with multiple labels or for functions defined on continuous domains, where it corresponds essentially to cross second-derivatives being nonpositive. In this paper, we show that most results relating submodularity and convexity for set-functions can be extended to all submodular functions. In particular, (a) we naturally define a continuous extension in a set of probability measures, (b) show that the extension is convex if and only if the original function is submodular, (c) prove that the problem of minimizing a submodular function is equivalent to a typically non-smooth convex optimization problem, and (d) propose another convex optimization problem with better computational properties (e.g., a smooth dual problem). Most of these extensions from the set-function situation are obtained by drawing links with the theory of multi-marginal optimal transport, which provides also a new interpretation of existing results for set-functions. We then provide practical algorithms which are based on function evaluations, to minimize generic submodular functions on discrete domains, with associated convergence rates.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. 1.

    Bach, F.: Learning with Submodular Functions: A Convex Optimization Perspective, Volume 6 of Foundations and Trends in Machine Learning. NOW, Breda (2013)

    Google Scholar 

  2. 2.

    Bach, F.: Duality between subgradient and conditional gradient methods. SIAM J. Optim. 25(1), 115–129 (2015)

    MathSciNet  MATH  Article  Google Scholar 

  3. 3.

    Bae, E., Yuan, J., Tai, X.-C., Boykov, Y.: A fast continuous max-flow approach to non-convex multi-labeling problems. In: Bruhn, A., Pock, T., Tai, X.-C. (eds.) Efficient Algorithms for Global Optimization Methods in Computer Vision, pp. 134–154. Springer (2014)

  4. 4.

    Best, M.J., Chakravarti, N.: Active set algorithms for isotonic regression: a unifying framework. Math. Program. 47(1), 425–439 (1990)

    MathSciNet  MATH  Article  Google Scholar 

  5. 5.

    Blumensath, T., Davies, M.E.: Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 27(3), 265–274 (2009)

    MathSciNet  MATH  Article  Google Scholar 

  6. 6.

    Borwein, J.M., Lewis, A.S.: Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer, Berlin (2006)

    MATH  Book  Google Scholar 

  7. 7.

    Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    MATH  Book  Google Scholar 

  8. 8.

    Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23(11), 1222–1239 (2001)

    Article  Google Scholar 

  9. 9.

    Carlier, G.: On a class of multidimensional optimal transportation problems. J. Convex Anal. 10(2), 517–530 (2003)

    MathSciNet  MATH  Google Scholar 

  10. 10.

    Chambolle, A., Darbon, J.: On total variation minimization and surface evolution using parametric maximum flows. Int. J. Comput. Vis. 84(3), 288–307 (2009)

    MATH  Article  Google Scholar 

  11. 11.

    Choquet, G.: Theory of capacities. Annales de l’Institut Fourier 5, 131–295 (1954)

    MathSciNet  MATH  Article  Google Scholar 

  12. 12.

    Djolonga, J., Krause, A.: From MAP to marginals: variational inference in Bayesian submodular models. In: Advances in Neural Information Processing Systems (NIPS) (2014)

  13. 13.

    Djolonga, J., Krause, A.: Scalable variational inference in log-supermodular models. In: International Conference on Machine Learning (ICML) (2015)

  14. 14.

    Edmonds, J.: Submodular functions, matroids, and certain polyhedra. In: Jünger, M., Reinelt, G., Rinaldi, G. (eds.) Combinatorial optimization—Eureka, you shrink! pp. 11–26. Springer (2003)

  15. 15.

    Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)

    MathSciNet  MATH  Article  Google Scholar 

  16. 16.

    Favati, P., Tardella, F.: Convexity in nonlinear integer programming. Ric. Oper. 53, 3–44 (1990)

    Google Scholar 

  17. 17.

    Feige, U., Mirrokni, V.S., Vondrák, J.: Maximizing non-monotone submodular functions. SIAM J. Comput. 40(4), 1133–1153 (2011)

    MathSciNet  MATH  Article  Google Scholar 

  18. 18.

    Frank, M., Wolfe, P.: An algorithm for quadratic programming. Nav. Res. Logist. Q. 3(1–2), 95–110 (1956)

    MathSciNet  Article  Google Scholar 

  19. 19.

    Friedland, S., Gaubert, S.: Submodular spectral functions of principal submatrices of a hermitian matrix, extensions and applications. Linear Algebra Appl. 438(10), 3872–3884 (2013)

    MathSciNet  MATH  Article  Google Scholar 

  20. 20.

    Fujishige, S.: Submodular Functions and Optimization. Elsevier, Amsterdam (2005)

    MATH  Google Scholar 

  21. 21.

    Fujishige, S., Murota, K.: Notes on l-/m-convex functions and the separation theorems. Math. Program. 88(1), 129–146 (2000)

    MathSciNet  MATH  Article  Google Scholar 

  22. 22.

    Groenevelt, H.: Two algorithms for maximizing a separable concave function over a polymatroid feasible region. Eur. J. Oper. Res. 54(2), 227–236 (1991)

    MathSciNet  MATH  Article  Google Scholar 

  23. 23.

    Hebiri, M., Van De Geer, S.: The smooth-lasso and other \(\ell _1 +\ell _2\)-penalized methods. Electron. J. Stat. 5, 1184–1226 (2011)

    MathSciNet  MATH  Article  Google Scholar 

  24. 24.

    Ishikawa, H.: Exact optimization for markov random fields with convex priors. IEEE Trans. Pattern Anal. Mach. Intell. 25(10), 1333–1336 (2003)

    Article  Google Scholar 

  25. 25.

    Iwata, S., Fleischer, L., Fujishige, S.: A combinatorial strongly polynomial algorithm for minimizing submodular functions. J. ACM 48(4), 761–777 (2001)

    MathSciNet  MATH  Article  Google Scholar 

  26. 26.

    Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization. In: Proceedings of the International Conference on Machine Learning (ICML) (2013)

  27. 27.

    Juditsky, A., Nemirovski, A.: First order methods for nonsmooth convex large-scale optimization, I: general purpose methods. In: Sra, S., Nowozin, S., Wright, S.J. (eds.) Optimization for Machine Learning, pp. 121–148. MIT Press, Cambridge (2011)

    MATH  Google Scholar 

  28. 28.

    Kantorovich, L.V.: On the translocation of masses. Dokl. Akad. Nauk SSSR 37, 199–201 (1942)

    MathSciNet  MATH  Google Scholar 

  29. 29.

    Karlin, S., Rinott, Y.: Classes of orderings of measures and related correlation inequalities. I. Multivariate totally positive distributions. J. Multivar. Anal. 10(4), 467–498 (1980)

    MathSciNet  MATH  Article  Google Scholar 

  30. 30.

    Kim, S., Kojima, M.: Exact solutions of some nonconvex quadratic optimization problems via SDP and SOCP relaxations. Comput. Optim. Appl. 26(2), 143–154 (2003)

    MathSciNet  MATH  Article  Google Scholar 

  31. 31.

    Krause, A., Guestrin, C.: Submodularity and its applications in optimized information gathering. ACM Trans. Intell. Syst. Technol. 2(4), 1–20 (2011)

    Article  Google Scholar 

  32. 32.

    Lacoste-Julien, S., Jaggi, M.: On the global linear convergence of Frank–Wolfe optimization variants. In: Advances in Neural Information Processing Systems (NIPS) (2015)

  33. 33.

    Lange, K., Hunter, D.R., Yang, I.: Optimization transfer using surrogate objective functions. J. Comput. Gr. Stat. 9(1), 1–20 (2000)

    MathSciNet  Google Scholar 

  34. 34.

    Lin, H., Bilmes, J.: A class of submodular functions for document summarization. In: Proceedings of the 49th Annual Meeting of the Association for computational Linguistics: Human Language Technologies, vol. 1. Association for Computational Linguistics (2011)

  35. 35.

    Lorentz, G.G.: An inequality for rearrangements. Am. Math. Mon. 60(3), 176–179 (1953)

    MathSciNet  MATH  Article  Google Scholar 

  36. 36.

    Lovász, L.: Submodular functions and convexity. In: Bachem, A., Korte, B., Grötschel, M. (eds.) Mathematical Programming: The State of the Art, Bonn, pp. 235–257. Springer, Bonn (1982)

  37. 37.

    Milgrom, P., Shannon, C.: Monotone comparative statics. Econ. J. Econ. Soc. 62(1), 157–180 (1994)

    MathSciNet  MATH  Google Scholar 

  38. 38.

    Mitter, S.K.: Convex optimization in infinite dimensional spaces. In: Blondel, V.D., Boyd, S.P., Kimura, H. (eds.) Recent Advances in Learning and Control, pp. 161–179. Springer (2008)

  39. 39.

    Monge, G.: Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences, pp. 666–704 (1781)

  40. 40.

    Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)

    MATH  Google Scholar 

  41. 41.

    Nedić, A., Ozdaglar, A.: Approximate primal solutions and rate analysis for dual subgradient methods. SIAM J. Optim. 19(4), 1757–1780 (2009)

    MathSciNet  MATH  Article  Google Scholar 

  42. 42.

    Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions-I. Math. Program. 14(1), 265–294 (1978)

    MathSciNet  MATH  Article  Google Scholar 

  43. 43.

    Orlin, J.B.: A faster strongly polynomial time algorithm for submodular function minimization. Math. Program. 118(2), 237–251 (2009)

    MathSciNet  MATH  Article  Google Scholar 

  44. 44.

    Pock, T., Chambolle, A., Cremers, D., Bischof, H.: A convex relaxation approach for computing minimal partitions. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2009)

  45. 45.

    Rachev, S.T.: Probability Metrics and the Stability of Stochastic Models. Wiley, New York (1991)

    MATH  Google Scholar 

  46. 46.

    Rudin, W.: Real and Complex Analysis. McGraw-Hill, New York (1986)

    MATH  Google Scholar 

  47. 47.

    Ruozzi, N.: Exactness of approximate MAP inference in continuous MRFs. In: Advances in Neural Information Processing Systems (NIPS) (2015)

  48. 48.

    Santambrogio, F.: Optimal Transport for Applied Mathematicians. Springer, Berlin (2015)

    MATH  Book  Google Scholar 

  49. 49.

    Schlesinger, D., Flach, B.: Transforming an arbitrary MinSum problem into a binary one. Technical Report TUD-FI06-01, Dresden University of Technology, Germany (2006)

  50. 50.

    Schrijver, A.: A combinatorial algorithm minimizing submodular functions in strongly polynomial time. J. Comb. Theory Ser. B 80(2), 346–355 (2000)

    MathSciNet  MATH  Article  Google Scholar 

  51. 51.

    Schrijver, A.: Combinatorial Optimization: Polyhedra and Efficiency, vol. 24. Springer, Berlin (2003)

    MATH  Google Scholar 

  52. 52.

    Shapley, L.S.: Complements and substitutes in the optimal assignment problem. Nav. Res. Logist. Q. 9(1), 45–48 (1962)

    Article  Google Scholar 

  53. 53.

    Shor, N.S.: Nondifferentiable Optimization and Polynomial Problems, vol. 24. Springer, Berlin (2013)

    MATH  Google Scholar 

  54. 54.

    Topkis, D.M.: Minimizing a submodular function on a lattice. Oper. Res. 26(2), 305–321 (1978)

    MathSciNet  MATH  Article  Google Scholar 

  55. 55.

    Topkis, D.M.: Supermodularity and Complementarity. Princeton University Press, Princeton (2011)

    Book  Google Scholar 

  56. 56.

    Villani, C.: Optimal Transport: Old and New, vol. 338. Springer, Berlin (2008)

    MATH  Google Scholar 

  57. 57.

    Vondrák, J.: Optimal approximation for the submodular welfare problem in the value oracle model. In: Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, pp. 67–74. ACM (2008)

  58. 58.

    Wald, Y., Globerson, A.: Tightness results for local consistency relaxations in continuous MRFs. In: Proceedings of Uncertainty in Artifical Intelligence (UAI) (2014)

  59. 59.

    Werner, T.: A linear programming approach to max-sum problem: a review. IEEE Trans. Pattern Anal. Mach. Intell. 29(7), 1165–1179 (2007)

    Article  Google Scholar 

  60. 60.

    Wolfe, P.: Finding the nearest point in a polytope. Math. Program. 11(1), 128–149 (1976)

    MathSciNet  MATH  Article  Google Scholar 

  61. 61.

    Živný, S., Werner, T., Průša, D.: The power of lp relaxation for map inference. In: Nowozin, S., Gehler, P.V., Jancsary, J., Lampert, C.H. (eds.) Advanced Structured Prediction. MIT Press, Cambridge (2014)

    Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Francis Bach.


Appendix A: Proof of miscellaneous results

Appendix A.1: Submodularity of the Lovász extension of a submodular set-function

We consider a submodular set-function \(G: \{0,1\}^n \rightarrow \mathbb {R}\) and its Lovász extension \(g: \mathbb {R}^n \rightarrow \mathbb {R}\). In order to show the submodularity of g, we simply apply the definition and consider \(x \in \mathbb {R}^n\) and two distinct basis vectors \(e_i\) and \(e_j\) with infinitesimal positive displacements \(a_i\) and \(a_j\). If (ij) belongs to two different level sets of x, then, for \(a_i\) and \(a_j\) small enough, \(g(x+a_i e_i) + g(x+a_j e_j) - g(x) - g(x+a_i e_i + a_j e_j)\) is equal to zero. If (ij) belongs to the same level sets, then, by explicitly computing the quantity for \(a_i>a_j\) and \(a_i<a_j\), it is non-negative (as a consequence of submodularity).

Note that then, the extension has another expression when \({\mathcal {X}}= [0,1]^n\), as, if \(H = g\),

$$\begin{aligned} h(\mu )= & {} \int _0^1 g( F_{\mu _1}^{-1}(t),\dots ,F_{\mu _n}^{-1}(t)) dt \\= & {} \int _0^1 \int _0^1 G( \{ i, F_{\mu _i}^{-1}(t) \geqslant z \} ) dt dz = \int _0^1 \int _0^1 G( \{ i, F_{\mu _i}(z) \geqslant t \} ) dt dz \\= & {} \int _0^1 g( F_{\mu _1}(z),\dots ,F_{\mu _n}(z)) dz, \end{aligned}$$

which provides another proof of submodularity since g is convex.

Appendix A.2: Submodularity of the multi-linear extension of a submodular set-function

We consider a submodular set-function \(G: \{0,1\}^n \rightarrow \mathbb {R}\) and its multi-linear extension [57], defined as a function \(\tilde{g}:[0,1]^n \rightarrow \mathbb {R}\) with

$$\begin{aligned} \tilde{g}(x_1,\dots ,x_n) = {\mathbb {E}}_{y_i\sim \mathrm{Bernoulli}(x_i), i \in \{1,\dots ,n\}} G(y), \end{aligned}$$

where all Bernoulli random variables \(y_i\) are independent. In order to show submodularity, we only need to consider the case \(n=2\), for which the function \(\tilde{g}\) is quadratic in x and the cross-term is non-positive because of the submodularity of G. See more details in [57]. The extension on a product of measures on \([0,1]^n\) does not seem to have the same simple interpretation as for the Lovász extension in Appendix A.1.

Appendix A.3: Proof of Proposition 6

In this proof, we are going to use a reparameterization of w as follows; we define a vector \(v_i \in \mathbb {R}^{\{1,\dots ,k_i-1\}}\) as the cumulative sum of \(w_i\), that is, such that \(v_i(x_i) = \sum _{ y_i=1}^{x_i} w_i(y_i)\) (this is a bijection from \(w_i\) to \(v_i\)). We then have the constraint that \(\sum _{i=1}^n v_i(x_i) \leqslant H(x_1,\dots ,x_n) -H(0) \), for all \(x \in {\mathcal {X}}\), with the convention that \(v_i(0)=0\). The extra constraint is that \(\sum _{i=1}^n v_i(k_i-1) = H(k_1-1,\dots ,k_n-1)-H(0)\). Moreover, we have, with \(\mu _i(x_i) = \rho _i(x_i) - \rho _i(x_i+1) \) for \(x_i \in \{1,\dots ,k_i-2\}\) and \(\mu _i(k_i-1) = \rho _i(k_i-1)\), an expression for the linear function we aim to maximize, that is, by Abel’s summation formula:

$$\begin{aligned} \sum _{x_i=1}^{k_i-1} w_i(x_i) \rho _i(x_i)= & {} \sum _{x_i=2}^{k_i-1} \big [ v_i(x_i) - v_i(x_i-1) \big ] \rho _i(x_i) + v_i(1) \rho _i(1)\\= & {} \sum _{x_i=1}^{k_i-2} v_i(x_i) \big [\rho _i(x_i) - \rho _i(x_i+1) \big ] + v_i(k_i-1) \rho _i(k_1 - 1) \\= & {} \sum _{x_i=1}^{k_i-1} v_i(x_i) \mu _i(x_i). \end{aligned}$$

We first assume that each \(\rho _i\) is non-decreasing. We are going to follow the same proof than for set-functions, based on convex duality. First, given the piecewise-linearity of \(h_\downarrow - H(0)\) (and hence the homogeneity), and the fact that \(h_\downarrow (\rho + C) = h_\downarrow (\rho ) + C \big [ H(k_1-1,\dots ,k_n-1) - H(0) \big ] \), we only need to show the result for \(\rho _i(x_i) \in [0,1]\) for all i, and \(x_i \in {\mathcal {X}}_i\). Each vector \(\rho _i\) is then uniquely associated to a probability measure (with non-negative values because \(\rho _i\) is non-increasing) \(\mu _i\) on \({\mathcal {X}}_i = \{0,\dots , k_i-1\}\), and, from Lemma 1, \(h_\downarrow (\rho ) = h(\mu )\).

Using the parameterization in terms of v and \(\mu \), we can now consider the Lagrangian, with dual values \(\gamma \in \mathbb {R}_+^{\mathcal {X}}\):

$$\begin{aligned} \mathcal {L}(v,\gamma ) = \sum _{i=1}^n \sum _{x_i=1}^{k_i-1} v_i(x_i) \mu _i(x_i) + \sum _{x \in {\mathcal {X}}} \gamma (x) \bigg ( H(x) -H(0) - \sum _{i=1}^n v_i(x_i) \delta (x_i>0)\bigg ). \end{aligned}$$

By maximizing with respect to the primal variable v, the dual problem thus becomes the one of minimizing \(\sum _{x \in {\mathcal {X}}} \gamma (x) \big [H (x)- H(0) \big ]\) such that

$$\begin{aligned} \forall i \in \{1,\dots ,n\}, \ \forall x_i \in \{1,\dots ,k_i-1\},&\gamma _i(x_i) = \mu _i(x_i) \\ \forall x \in {\mathcal {X}}\backslash \{(k_1-1,\dots ,k_n-1)\},&\gamma (x) \geqslant 0, \end{aligned}$$

where \(\gamma _i\) is the marginal of \(\gamma \) on the i-th variable, that is, \(\gamma _i(x_i) = \sum _{ x_j \in {\mathcal {X}}_j, j \ne i} \gamma (x_1,\dots ,x_n)\).

We now exhibit primal/dual pairs with equal objective values. For the dual variable \(\gamma \), we consider the solution of the usual optimal transport problem (which happens to satisfy extra constraints, that is, nonnegativity also for \(x=0\) and summing to one), and the dual value is exactly the extension \(h(\mu )- H(0)\). For the primal variable, we consider the w parameterization (which is in bijection with v). From the greedy algorithm described in Sect. 4.1, the vector w satisfies the sum constraint \(\sum _{i=1}^n \sum _{y_i = 1}^{k_i-1} w_i(y_i) = H(k_1-1,\dots ,k_n-1) -H(0)\). We simply need to show that for all \(x \in {\mathcal {X}}\), \( \sum _{i=1}^n \sum _{z_i = 1}^{x_i} w_i(z_i) \leqslant H(x) - H(0)\). We have, using the notation of the greedy algorithm from Sect. 4.1:

$$\begin{aligned}&\sum _{i=1}^n \sum _{z_i = 1}^{x_i} w_i(z_i) \\&\quad = \sum _{s=1}^r \big ( H[y(s)] - H[y(s-1)] \big ) \delta (y(s)_{i(s)} \leqslant x_{i(s)} ) \hbox { by definition of } i(s) \hbox { and } y(s),\\&\quad = \sum _{a=1}^n \sum _{s, i(s) = a} \big ( H[y(s)] - H[y(s-1)] \big ) \delta (y(s)_{a} \leqslant x_{a} ) \hbox { by splitting the values of } i(s),\\&\quad = \sum _{a=1}^n \sum _{s, i(s) = a} \big ( H[y(s-1) + e_a] - H[y(s-1)] \big ) \delta (y(s)_{a} \leqslant x_{a} ) \\&\qquad \hbox { because we go from } y(s-1) \hbox { to } y(s) \hbox { by incrementing the component } i(s)=a, \\&\quad \leqslant \sum _{a=1}^n \sum _{s, i(s) = a} \big ( H[\min \{y(s-1) + e_a,x\} ] - H[\min \{y(s-1), x\} ] \big ) \delta (y(s-1)_a + 1 \leqslant x_{a} ) \\&\qquad \hbox { by submodularity and because } y(s)_{i(s)} = y(s-1)_{i(s)} + 1,\\&\quad = \sum _{a=1}^n \sum _{s, i(s) = a} \big ( H[\min \{y(s-1) + e_a,x\} ] - H[\min \{y(s-1), x\} ] \big ) \\&\qquad \hbox { because the difference in values of } H \hbox { is equal to zero for } y(s-1)_a + 1 > x_{a}, \\&\quad = \sum _{s=1}^r \big ( H[\min \{y(s) ,x\} ] - H[\min \{y(s-1), x\} ] \big ) \\&\quad = H[\min \{x,y(r)\}] - H[\min \{x,y(0)\}] = H(x)-H(0) . \end{aligned}$$

Thus, w is feasible. By construction the primal value is equal to \(h(\mu )-H(0)\). We thus have a primal/dual optimal pair and the result is proved for non-decreasing \(\rho _i\)’s. This notably shows that the polyhedron \(\mathcal {W}(H)\) is non-empty.

We can now show that if one \(\rho _i\) is not non-decreasing, then the supremum is equal to infinity. In such a case, then there exists \(x_i \in \{1,\dots ,k_i-2\}\) such that \(\mu _i(x_i) < 0 \). We may then let the corresponding \(v_i(x_i)\) tend to \(-\infty \).

Appendix A.4: Proof of Proposition 7

In the statements above, \(\langle w, \rho \rangle \) stands for \(\sum _{i=1}^n \sum _{x_i=1}^{k_i-1} w_i(x_i) \rho _i(x_i)\). The statements (a) and (b) were shown in the proof of Proposition 6. We may also show another representation: \(\displaystyle \mathcal {W}(H) = \mathcal {B}(H) + \mathcal {K}\) with\( \mathcal {K}\) the set of \( w \in \prod _{i=1}^n \mathbb {R}^{k_i-1}\) such that \(\forall i \in \{1,\dots ,n\}, \ \forall x_i \in \{1,\dots ,k_i-2\}, \ \sum _{y_i=1}^{x_i} w_i(y_i) \leqslant 0 \) and \(\sum _{y_i=1}^{k_i-1} w_i(y_i) = 0\).

Statement (c) is a simple consequence of the fact that the polar cone to \(\prod _{i=1}^n \mathbb {R}^{k_i-1}_\downarrow \) is what needs to be added to \(\mathcal {B}(H)\) to get to \(\mathcal {W}(H)\). In other words, (c) is the equality of two convex sets and is thus equivalent to the equality of their support functions; and we have, for any \(\rho _i \in \mathbb {R}^{k_i-1}\), using Abel’s summation formula:

$$\begin{aligned} \langle w_i, \rho _i \rangle= & {} \sum _{x_i=1}^{k_i-1} w_i(x_i) \rho _i(x_i) = \sum _{x_i=1}^{k_i-2} \bigg ( \sum _{y_i=1}^{x_i} w_i(y_i) \bigg ) \big [\rho _i(x_i) - \rho _i(x_i+1) \big ] \\&+ \bigg ( \sum _{y_i=1}^{x_i-1} w_i(y_i) \bigg ) \rho _i(k_1 - 1), \end{aligned}$$

from which we see that the supremum of \(\langle w, \rho \rangle \) with respect to \(\rho \in \prod _{i=1}^n \mathbb {R}^{k_i-1}_\downarrow \) is equal to zero if \( w \in \mathcal {K}\) and \(+\infty \) otherwise. By convex duality, this implies that the supremum of \(\langle w, \rho \rangle \) with respect to \(w \in \mathcal {K}\) is equal to zero if \(\rho \) has non-increasing components and zero otherwise. We thus have, for any \(\rho \in \prod _{i=1}^n \mathbb {R}^{k_i-1}\), \(\sup _{ w \in \mathcal {B}(H) + \mathcal {K} } \langle w, \rho \rangle = h_\downarrow (\rho )\) if \(\rho \in \prod _{i=1}^n \mathbb {R}^{k_i-1}_\downarrow \) and \(+\infty \) otherwise. Given Proposition 6, this leads to the desired result.

Appendix A.5: Proof of Theorem 3

The first statement (a) is a classical consequence of submodularity [54]. The main idea is that when we go from t to \(t'<t\), then the function difference is increasing by convexity, hence the minimizer has to decrease.

For the second statement (b), we provide expressions for all elements of the gap, following the proof in [1, Prop. 8.5]. For \(M>0\) large enough, we have from Eq. (11):

$$\begin{aligned} h_\downarrow (\rho ) - H(0) = \int _{-M}^{+M} H\big [ \theta (\rho ,t) \big ] dt - M H(k_1-1,\dots ,k_n-1) - M H(0). \end{aligned}$$

Moreover, the integral of the i-th “non-H-dependent” part of the primal objective for the submodular minimization problem in Eq. (16) is equal to, for \(i \in \{1,\dots ,n\}\) and the primal candidate \(x_i = \theta (\rho _i,t)\):

$$\begin{aligned}&\int _{-M}^M \sum _{y_i=1}^{x_i} a'_{i y_i}(t) dt \\&\quad = \int _{-M}^{+ M} \bigg \{ \sum _{x_i=1}^{k_i-1} \delta ( \theta (\rho _i,t)= x_i) \sum _{y_i=1}^{x_i}a_{i y_i}' (t) \bigg \} dt \\&\quad = \int _{-M}^{\rho _i(k_i-1)}\sum _{y_i=1}^{k_i-1}a_{i y_i}' (t) dt + \sum _{s_i=2}^{k_i-1} \int _{\rho _i(s_i)}^{\rho _i(s_i-1)} \sum _{y_i=1}^{s_i-1}a_{i y_i}' (t) dt \hbox { by definition of } \theta , \\&\quad = \sum _{y_i=1}^{k_i-1}\big \{ a_{i y_i}( \rho _i(k-1)) - a_{i y_i}(-M) \big \} + \sum _{s_i=2}^{k_i-1} \sum _{y_i=1}^{s_i-1}\bigg \{ a_{i y_i} ( \rho _i(s_i-1) )- a_{i y_i} ( \rho _i(s_i) ) \bigg \} \\&\quad = - \sum _{y_i=1}^{k_i-1} a_{i y_i}(-M) + \sum _{x_i=1}^{k_i-1} a_{i x_i}(\rho _i(x_i)). \end{aligned}$$

We may now compute for any \(i \in \{1,\dots ,n\}\) and \(x_i \in \{0,\dots ,k_i-1\}\), the necessary pieces for the integral of the dual objective values from Eq. (12):

$$\begin{aligned}&\int _{-M}^M \min \{ w_i(x_i) + a'_{i x_i}(t), 0\} dt \\&\quad = \int _{-M}^{(a_{ix_i}^*)'(-w_i(x_i))} \big ( w_i(x_i) + a'_{i x_i}(t) \big ) dt \hbox { because } w_i(x_i) \\&\qquad +\,a'_{i x_i}(t) \leqslant 0 \Leftrightarrow t \leqslant (a_{i x_i}^*)'(-w_i(x_i)),\\&\quad = a_{i x_i}\big ( (a_{ix_i}^*)'(-w_i(x_i)) \big ) - a_{i x_i}(-M) + w_i(x_i) \big [ M + (a_{ix_i}^*)'(-w_i(x_i))\big ] \\&\quad = (\psi _{i y_{i}}^*)' ( - w_i(x_i)) ( - w_i(x_i)) - \psi _{i y_{i}}^*( - w_i(x_i))- a_{i y_i} (-M) \\&\qquad + w_i(x_i) \big [ M + (a_{ix_i}^*)'(-w_i(x_i))\big ] \\&\quad = -a_{i x_{i}}^*( - w_i(x_i)) - a_{i x_i} (-M) + w_i(x_i) M. \end{aligned}$$

By putting all pieces together, we obtain the desired result, that is,

$$\begin{aligned} h_\downarrow (\rho ) - H(0) + \sum _{x_i=1}^{k_i-1} a_{i x_i}(\rho _i(x_i)) + \sum _{i=1}^n \sum _{x_i=1}^{k_i-1} a_{i x_{i}}^*( - w_i(x_i)) \end{aligned}$$

is exactly the integral from \(-M\) to M of

$$\begin{aligned}&H\big [ \theta (\rho ,t) \big ] + \sum _{i=1}^n \sum _{x_i=1}^{k_i-1} \delta ( \theta (\rho _i,t)= x_i) \sum _{y_i=1}^{x_i}a_{i y_i}' (t) - H(0) \\&\quad - \sum _{i=1}^n \min \{ w_i(x_i) + a'_{i x_i}(t), 0\}. \end{aligned}$$

This implies (c). The statement (d) is proved exactly like for set-functions [1, Prop. 8.3].

Appendix B: Strongly-convex separable submodular function minimization: general case

In this appendix, we show how the separable problem described in Sect. 4.3 can seen from optimal transport theory for all possible sets \({\mathcal {X}}_i\), potentially non finite. We choose strongly convex separable costs of the form \(\sum _{i=1}^n \varphi _i(\mu _i)\), with:

$$\begin{aligned} \varphi _i(\mu _i) = \int _{{\mathcal {X}}_i} a_i(x_i, F_{\mu _i}(x_i)) dx_i, \end{aligned}$$

where for all \(x_i \in {\mathcal {X}}_i\), \(a_i(x_i,\cdot )\) is a differentiable \(\lambda \)-strongly convex function on [0, 1]. This implies that the function \(\varphi _i\), as a function of \(F_{\mu _i}\) is \(\lambda \)-strongly convex (for the \(L_2\)-norm on cumulative distribution functions). A key property is that it may be expressed as a transport cost, as we now show.

Proposition 8

(Convex functions of cumulative distributions) For \(a_i: {\mathcal {X}}_i \times [0,1] \rightarrow \mathbb {R}\) convex and differentiable with respect to the second variable, the function \(\varphi _i: \mathcal {P}({\mathcal {X}}_i) \rightarrow \mathbb {R}\) defined in Eq. (18) is equal to

$$\begin{aligned} \varphi _i(\mu _i) = \int _0^1 c_i( F_{\mu _i}^{-1} (t),1-t) dt, \end{aligned}$$

for \(\displaystyle c_i(z_i,t_i) = \int _{{\mathcal {X}}_i} a_i(x_i,0) dx_i + \int _{{\mathcal {X}}_i} \bigg ( \int _{{\mathcal {X}}_i \cap (-\infty , z_i] } \frac{\partial a_i}{\partial t_i}(x_i, 1- t_i) dx_i \bigg ) \) is a submodular cost on \({\mathcal {X}}_i \times [0,1]\).


We have the following sequence of equalities:

$$\begin{aligned} \varphi _i(\mu _i)= & {} \int _{{\mathcal {X}}_i} a_i(x_i, F_{\mu _i}(x_i)) dx_i \\= & {} \int _{{\mathcal {X}}_i} \bigg ( a_i(x_i,0) + \int _{0}^{F_{\mu _i}(x_i)} \frac{\partial a_i}{\partial t_i}(x_i,t_i) dt_i \bigg )dx_i \\= & {} \int _{{\mathcal {X}}_i} a_i(x_i,0) dx_i + \int _{{\mathcal {X}}_i} \bigg ( \int _{{\mathcal {X}}_i \cap [x_i,+\infty ) } \frac{\partial a_i}{\partial t_i}(x_i, F_{\mu _i}(y_i)) d\mu _i(y_i) \bigg )dx_i \\&\quad \hbox { by the change of variable } t_i = F_{\mu _i}(y_i), \\= & {} \int _{{\mathcal {X}}_i} a_i(x_i,0) dx_i + \int _{{\mathcal {X}}_i} \bigg ( \int _{{\mathcal {X}}_i \cap (-\infty , y_i] } \frac{\partial a_i}{\partial t_i}(x_i, F_{\mu _i}(y_i)) dx_i \bigg ) \\&\quad d\mu _i(y_i) \hbox { by Fubini's theorem} , \\= & {} \int _{{\mathcal {X}}_i} c_i(y_i, 1 - F_{\mu _i}(y_i)) d\mu _i(y_i) \hbox { by definition of } c_i , \\= & {} \int _{0}^1 c_i( F_{\mu _i}^{-1}(t),1-t) dt \hbox { by the change of variable } t_i = F_{\mu _i}(y_i). \end{aligned}$$

Moreover, \(c_i\) is indeed a submodular function as, for any \(z_i' > z_i\) in \({\mathcal {X}}_i\), we have \(c_i(z_i',t_i) - c_i(z_i,t_i) = \displaystyle \int _{{\mathcal {X}}_i} \bigg ( \int _{{\mathcal {X}}_i \cap (z_i, z_i'] } \frac{\partial a_i}{\partial t_i}(x_i, 1- t_i) dx_i \bigg ) \), which is a decreasing function in \(t_i\) because \(a_i\) is convex. Thus \(c_i\) is submodular. It is also strictly submodular if \(a_i(x_i,\cdot )\) is strongly convex for all \(x_i \in {\mathcal {X}}_i\). \(\square \)

Because \(c_i\) is submodular, we have, following Proposition 3, a formulation of \(\varphi _i\) as an optimal transport problem between the measure \(\mu _i\) on \({\mathcal {X}}_i\) and the uniform distribution U[0, 1] on [0, 1], as

$$\begin{aligned} \varphi _i(\mu _i) = \inf _{\gamma _i \in \mathcal {P}({\mathcal {X}}_i \times [0,1] } \int _{{\mathcal {X}}_i \times [0,1]} c_i(x_i,t_i) d \gamma _i(x_i,t_i), \end{aligned}$$

such that \(\gamma _i\) has marginals \(\mu _i\) and the uniform distribution on [0, 1]. It is thus always convex (as as the minimum of a jointly convex problem in \(\gamma _i\) and \(\mu _i\))—note that it is already convex from the definition in Eq. (19) and the relationship between \(a_i\) and \(c_i\) in Proposition 8.

We now consider the following problem:

$$\begin{aligned} \inf _{\mu \in \mathcal {P}^\otimes ({\mathcal {X}}) } h(\mu ) + \sum _{i=1}^n \varphi _i(\mu _i) , \end{aligned}$$

which is an optimization problem with h, with additional separable transport costs \(\varphi _i(\mu _i)\). Given our assumption regarding the strong-convexity of the functions \(a_i\) above, this system has a unique solution. We may derive a dual problem using the representation from Eq. (5):

$$\begin{aligned} \inf _{\mu \in \mathcal {P}^\otimes ({\mathcal {X}}) } h(\mu ) + \sum _{i=1}^n \varphi _i(\mu _i)= & {} \inf _{\mu \in \mathcal {P}^\otimes ({\mathcal {X}}) } \sup _{v \in \mathcal {V}(H)} \sum _{i=1}^n \bigg \{ \int _{{\mathcal {X}}_i} v_i(x_i) d \mu _i(x_i) + \varphi _i(\mu _i) \bigg \} \nonumber \\= & {} \sup _{v \in \mathcal {V}(H)} \sum _{i=1}^n \inf _{\mu _i \in \mathcal {P}({\mathcal {X}}_i) } \bigg \{ \int _{{\mathcal {X}}_i} v_i(x_i) d \mu _i(x_i) + \varphi _i(\mu _i) \bigg \} \nonumber \\= & {} \sup _{v \in \mathcal {V}(H)} - \sum _{i=1}^n\varphi _i^*( -v_i) , \end{aligned}$$

where we use the Fenchel-dual notation \(\varphi _i^*( v_i) = \sup _{\mu _i \in \mathcal {P}({\mathcal {X}}_i) } \{ \int _{{\mathcal {X}}_i} v_i(x_i) d \mu _i(x_i) - \varphi _i(\mu _i) \}\). The equation above provides a dual problem to Eq. (20). We may also consider a family of submodular minimization problems, parameterized by \(t \in [0,1]\):

$$\begin{aligned} \min _{x \in {\mathcal {X}}} \ H(x) + \sum _{i=1}^n c_i( x_i,1- t), \end{aligned}$$

with their duals defined from Eq. (9). Note that the two dual problems are defined on the same set \(\mathcal {V}(H)\). We can now prove the following theorem relating the two optimization problems.

Theorem 4

(Separable optimization—general case) Assume H is continuous and submodular and all \(c_i\), \(i=1,\dots ,n\) are defined as in Proposition 8. Then:

  1. (a)

    If x and \(x'\) are minimizers of Eq. (22) for \(t> t'\), then \(x \leqslant x'\) (for the partial order on \(\mathbb {R}^n\)), i.e., the solutions of Eq. (22) are non-increasing in \(t \in [0,1]\).

  2. (b)

    Given a primal candidate \(\mu \in \mathcal {P}^\otimes ({\mathcal {X}})\) and a dual candidate \(v \in \mathcal {V}(H)\), then the duality gap for the problem in Eq. (20) is the integral from \(t=0\) to \(t=1\) of the gaps for the problem in Eq. (22) for the same dual candidate and the primal candidate \((F_{\mu _1}^{-1}(t),\dots , F_{\mu _n}^{-1}(t)) \in {\mathcal {X}}\).

  3. (c)

    Given the unique solution \(\mu \) of Eq. (20), for all \(t \in [0,1]\), \((F_{\mu _1}^{-1}(t),\dots , F_{\mu _n}^{-1}(t)) \in {\mathcal {X}}\) is a solution of Eq. (22).

  4. (d)

    Given any solutions \(x^t \in {\mathcal {X}}\) for all problems in Eq. (22), we may define \(\mu \) through \(F_{\mu _i}(x_i) = \sup \big \{ t \in [0,1], \ x_i^t \geqslant x_i \big \}\), for all i and \(x_i \in {\mathcal {X}}_i\), so that \(\mu \) is the optimal solution of Eq. (20).


The first statement (a) is a direct and classical consequence of the submodularity of \(c_i\) [55, Section 2.8]. The main idea is that when we go from t to \(t'<t\), then the function difference, i.e., \(x_i \mapsto c_i( x_i, t') - c_i( x_i, t)\) is strictly increasing, hence the minimizer has to decrease.

For the second statement (b), we may first re-write the cost function in Eq. (20) as an integral in t, that is, for any \(\mu \in \mathcal {P}^\otimes ({\mathcal {X}})\):

$$\begin{aligned} h(\mu ) + \sum _{i=1}^n \varphi _i(\mu _i)= & {} \int _0^1 \bigg \{ H\big [ F_{\mu _1}^{-1}(t),\dots , F_{\mu _n}^{-1}(t) \big ]+ \sum _{i=1}^n c_i\big [ F_{\mu _i}^{-1} (t), 1-t \big ] \bigg \} dt. \end{aligned}$$

The gap defined in Proposition 4 for a single submodular minimization problem in Eq. (22) is, for a primal candidate \(x \in {\mathcal {X}}\) and a dual candidate \(v \in \mathcal {V}(H)\):

$$\begin{aligned}&H(x) + \sum _{i=1}^n c_i\big [ x_i, F_{\mu _i}^{-1} (t) \big ] - \sum _{i=1}^n \min _{y_i \in {\mathcal {X}}_i} \Big \{ v_i(y_i) + c_i \big [ y_i, 1-t\big ]\Big \}, \end{aligned}$$

and its integral with respect to \(t \in [0,1]\) for \(x_i = F_{\mu _i}^{-1}(t)\), for all \(i \in \{1,\dots ,n\}\), is equal to

$$\begin{aligned}&h(\mu ) + \sum _{i=1}^n \varphi _i(\mu _i) - \sum _{i=1}^n \int _0^1 \min _{y_i \in {\mathcal {X}}_i} \Big \{ v_i(y_i) + c_i \big [ y_i, 1-t \big ]\Big \} dt . \end{aligned}$$

Finally, we have, by using the formulation of \(\varphi _i(\mu _i)\) through optimal transport, an expression of the elements appearing in the dual problem in Eq. (21):

$$\begin{aligned} - \varphi _i^*(-v_i)= & {} \inf _{\mu _i \in \mathcal {P}_i({\mathcal {X}}_i) } \bigg \{ \int _{{\mathcal {X}}_i} v_i(x_i) d \mu _i(x_i) + \varphi _i(\mu _i) \bigg \} \\= & {} \inf _{\mu _i \in \mathcal {P}_i({\mathcal {X}}_i) } \inf _{\gamma _i \in {\varPi }(\mu _i,U[0,1]) } \int _{{\mathcal {X}}_i \times [0,1]} \big [ v_i(x_i) + c_i(x_i,1-t_i) \big ] d \gamma _i(x_i,t_i) \\= & {} \int _0^1 \inf _{y_i \in {\mathcal {X}}_i} \big [ v_i(y_i) + c_i(y_i,1-t_i) \big ] d t_i, \end{aligned}$$

because we may for any \(t_i\) choose the conditional distribution of \(x_i\) given \(t_i\) equal to a Dirac at the minimizer \(y_i\) of \(v_i(y_i) + c_i(y_i,1-t_i)\). This implies (b) and thus, by continuity, (c). The statement (d) is proved exactly like for set-functions [1, Prop. 8.3]. \(\square \)

In Sect. 4.2, we will consider a formulation for finite sets that will exactly recover the set-function case, with the additional concept of base polytopes.

Appendix C: Divide-and-conquer algorithm for separable optimization

We consider the optimization problem studied in Sect. 4.3:

$$\begin{aligned} \min _{\rho } h_\downarrow (\rho ) + \sum _{i=1}^n \sum _{x_i=1}^{k_i-1} a_{i x_i}\big [\rho _i(x_i) \big ] \hbox { such that } \rho \in \prod _{i=1}^n \mathbb {R}^{k_i-1}_{\downarrow } , \end{aligned}$$

whose dual problem is the one of maximizing \( \sum _{i=1}^n \sum _{x_i=1}^{k_i-1} -a_{i x_i}^*(-w_i(x_i))\) such that \(w \in \mathcal {W}(H)\). We assume that all functions \(a_{ix_i}\) are strictly convex and differentiable with a Fenchel-conjugate with full domain. From Theorem 3, we know that it is equivalent to a sequence of submodular minimization problems of the form:

$$\begin{aligned} \min _{x \in {\mathcal {X}}} H(x) + \sum _{i=1}^n \sum _{ y_i = 1}^{x_i} a_{i y_i}' (t), \end{aligned}$$

i.e., minimizing H plus a modular function. We now show that by solving a sequence of such problems (with added restrictions on the domains of the variables), we recover exactly the solution \(\rho \) of the original problem. This algorithm directly extends the one from [22], and we follow the exposition from [1, Section 9.1]. The recursive algorithm is as follows:

  1. (1)

    Find the unique global maximizer of \(\displaystyle - \sum _{i=1}^n \sum _{x_i=1}^{k_i-1} a_{ix_i}^*(-w_i(x_i))\) with the single constraint that \(\displaystyle \sum _{i=1}^n \sum _{x_i=1}^{k_i-1} w_i(x_i) = H(k_1,\dots ,k_n) - H(0)\). This can be typically obtained in closed form, or through the following one-dimensional problem (obtained by convex duality), \(\displaystyle \min _{ t \in \mathbb {R}} \big [ H(k_1-1,\dots ,k_n-1) - H(0) \big ] t + \sum _{i=1}^n \sum _{x_i=1}^{k_i-1} a_{ix_i}(t) \), with \(w_i(x_i)\) then obtained as \(w_i(x_i) = - a_{i x_i}'(t)\).

  2. (2)

    Find an element \(y \in {\mathcal {X}}\) that minimizes \(\displaystyle H(y) - \sum _{i=1}^n \sum _{z_i=1}^{y_i} w_i(z_i)\). This is a submodular function minimization problem.

  3. (3)

    If \(\displaystyle H(y) - \sum _{i=1}^n \sum _{z_i=1}^{y_i} w_i(z_i) = H(0)\), then exit (w is optimal).

  4. (4)

    Maximize \( \sum _{i=1}^n \sum _{x_i=1}^{y_i} -a_{i x_i}^*(-w_i(x_i))\) over \(w \in \mathcal {W}(H_{x \leqslant y })\), where \(H_{x \leqslant y}\) is the restriction of H to \(\{x \leqslant y\}\), to obtain \(w^{\{x \leqslant y\}}\).

  5. (5)

    Maximize \( \sum _{i=1}^n \sum _{x_i=y_i+1}^{k_i-1} -a_{i x_i}^*(-w_i(x_i))\) over \(w \in \mathcal {W}(H_{x \geqslant y+1 })\), where \(H_{x \geqslant y+1}\) is the restriction of H to \(\{x \geqslant y+1\}\), to obtain \(w^{\{x \geqslant y+1\}}\).

  6. (6)

    Concatenate the two vectors \(w^{\{x \geqslant y+1\}}\) and \(w^{\{x \leqslant y\}}\) into w, which is the optimal solution.

The proof of correctness is the same as for set-functions [1, Section 9.1]: if the algorithm stops at step (3), then we indeed have the optimal solution because the optimum on a wider set happens to be in \(\mathcal {W}(H)\). Since the optimal w in (1) is such that \(w_i(x_i) = - a_{i x_i}'(t)\) for all i and \(x_i\) and a single real number t, the problem solved in (2) corresponds to one of the submodular function minimization problems from Theorem 3. From the statement (d) of that theorem, we know that the optimal primal solution \(\rho \) will be such that \(\rho _i(x_i) \geqslant t\) for \(x_i \leqslant y_i\) and \(\rho _i(x_i) \leqslant t\) for \(x_i \geqslant y_i+1\). Given the expression of \(h_\downarrow (\rho )\) obtained from the greedy algorithm, if we impose that \(\min _{ x_i \leqslant y_i} \rho _i(x_i) \geqslant \max _{ x_i \geqslant y_i+1} \rho _i(x_i) \), the function \(h_\downarrow (\rho )\) is the sum of two independent terms and the minimization of the primal problem can be done into two separate pieces, which correspond exactly to \(h_{x \leqslant y }^\downarrow (\rho ^{x \leqslant y })\) and \(h_{x \geqslant y+1 }^\downarrow ( \rho ^{x \geqslant y+1 })\). If we minimize the two problems separately, like done in steps (4) and (5), we thus only need to check that the decoupled solutions indeed have the correct ordering, that is \(\min _{ x_i \leqslant y_i} \rho _i(x_i) \geqslant \max _{ x_i \geqslant y_i+1} \rho _i(x_i) \), which is true because of Theorem 3 applied to the two decoupled problems with the same value of t.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bach, F. Submodular functions: from discrete to continuous domains. Math. Program. 175, 419–459 (2019).

Download citation


  • Submodularity
  • Optimal transport
  • Continuous optimization

Mathematics Subject Classification

  • 90C25
  • 90C27