Advertisement

Convex Optimization

  • Francisco J. AragónEmail author
  • Miguel A. Goberna
  • Marco A. López
  • Margarita M. L. Rodríguez
Chapter
  • 1.6k Downloads
Part of the Springer Undergraduate Texts in Mathematics and Technology book series (SUMAT)

Abstract

Convex optimization deals with problems of the form
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) \\ &{} \text {s.t.} &{} x\in F, \end{array} \end{aligned}$$
where \(\emptyset \ne F\subset \mathbb {R}^{n}\) is a convex set and \(f:F\rightarrow \mathbb {R}\) is a convex function. In this chapter, we analyze four particular cases of problem P in (4.1): (a) Unconstrained convex optimization, where the constraint set F represents a given convex subset of \(\mathbb {R}^{n}\) (as \(\mathbb {R} _{++}^{n} \)). (b) Convex optimization with linear constraints, where
$$\begin{aligned} F=\left\{ x\in \mathbb {R}^{n}:g_{i}\left( x\right) \le 0,i\in I\right\} , \end{aligned}$$
with \(I=\left\{ 1,\ldots , m\right\} \), \(m\ge 1\), and \(g_{i}\) are affine functions for all \(i\in I\). In this case, F is a polyhedral convex set (an affine manifold in the particular case where F is the solution set of a system of linear equations, as each equation can be replaced by two inequalities).
Convex optimization deals with problems of the form
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) \\ &{} \text {s.t.} &{} x\in F, \end{array} \end{aligned}$$
(4.1)
where \(\emptyset \ne F\subset \mathbb {R}^{n}\) is a convex set and \(f:F\rightarrow \mathbb {R}\) is a convex function. In this chapter, we analyze four particular cases of problem P in (4.1):

(a) Unconstrained convex optimization, where the constraint set F represents a given convex subset of \(\mathbb {R}^{n}\) (as \(\mathbb {R} _{++}^{n} \)).

(b) Convex optimization with linear constraints, where
$$\begin{aligned} F=\left\{ x\in \mathbb {R}^{n}:g_{i}\left( x\right) \le 0,i\in I\right\} , \end{aligned}$$
with \(I=\left\{ 1,\ldots , m\right\} \), \(m\ge 1\), and \(g_{i}\) are affine functions for all \(i\in I\). In this case, F is a polyhedral convex set (an affine manifold in the particular case where F is the solution set of a system of linear equations, as each equation can be replaced by two inequalities).
(c) Convex optimization with inequality constraints, where
$$\begin{aligned} F=\left\{ x\in C:g_{i}\left( x\right) \le 0,i\in I\right\} , \end{aligned}$$
with \(\emptyset \ne C\subset \mathbb {R}^{n}\) being a given constraint convex set (\(C=\mathbb {R}^{n}\) by default) and \(g_{i}:C\rightarrow \mathbb {R} \) are convex functions for all \(i\in I\). Observe that \(F=\cap _{i\in I}S_{0}\left( g_{i}\right) \) is convex as each sublevel set \(S_{0}\left( g_{i}\right) =\left\{ x\in C:g_{i}\left( x\right) \le 0\right\} \) is convex by the convexity of \(g_{i}\), \(i\in I\). If C is closed and \(g_{i}\) is continuous on C for all \(i\in I\), then F is closed by the closedness of \( S_{0}\left( g_{i}\right) \), \(i\in I\) (if \(C=\mathbb {R}\), all constraint functions \(g_{i}\) are continuous due to their Lipschitz continuity on open bounded intervals; in fact, it can be proved that any convex function on \(C= \mathbb {R}^{n}\) is also continuous).

Sensitivity analysis  gives an answer to “what if” questions as “how does small perturbations in some part of the data affect the optimal value of P?”. Subsection 4.4.1 tackles this question when the perturbations affect the right-hand side of the constraints (representing the available resources when P models a production planning problem). Subsections 4.4.3 and 4.4.4 introduce two different dual problems for P and provide their corresponding strong duality theorems.

(d) Conic optimization , where F is defined by means of constraints involving closed convex cones in certain Euclidean spaces. This type of convex optimization problems is very important in practice and can be solved by means of efficient numerical methods based on advanced mathematical tools. Since they can hardly be solved with pen and paper, the purpose of Section 4.5 is to introduce the readers in this fascinating field through the construction of duality problems for this class of problems and the review of the corresponding strong duality theorems.

4.1 The Optimal Set

The next result concerns the optimal set \(F^{*}= \mathrm{argmin}\left\{ f\left( x\right) :x\right. \) \(\left. \in F\right\} \) of \(P\).

Proposition 4.1

(The optimal set in convex optimization) Let P be the convex optimization problem in (4.1). Then:

(i) \(F^{*}\) is a convex set.

(ii) If f is strictly convex on F, then \(\left| F^{*}\right| \le 1\).

(iii) If F is an unbounded closed set such that \(\mathop {\mathrm{int}}F\ne \emptyset \) and f is strongly convex and continuous on F, then \(\left| F^{*}\right| =1\).

(iv) If f is differentiable on \(\mathop {\mathrm{int}}F\ne \emptyset \) and continuous on F, then
$$\begin{aligned} \left\{ x\in \mathop {\mathrm{int}}F:\nabla f\left( x\right) =0_{n}\right\} \subset F^{*}. \end{aligned}$$
(v) If F is open and f is differentiable on F, then
$$\begin{aligned} F^{*}=\left\{ x\in F:\nabla f\left( x\right) =0_{n}\right\} . \end{aligned}$$

Proof

(i) We can assume that \(F^{*}\ne \emptyset \). Then, due to the convexity of F and f, \(F^{*}=S_{v\left( P\right) }\left( f\right) \) is convex.

(ii) Assume that f is strictly convex on F and there exist \(x, y\in F^{*}\), with \(x\ne y\). Then, \(\frac{x+y}{2}\in F^{*}\) and
$$\begin{aligned} f\left( \frac{x+y}{2}\right) <\frac{1}{2}f\left( x\right) +\frac{1}{2} f\left( y\right) =v\left( P\right) , \end{aligned}$$
which is a contradiction.

(iii) Since F is an unbounded closed set and f is coercive by Proposition  2.60, we have \(\left| F^{*}\right| \ge 1\). Then, \(\left| F^{*}\right| =1\) by (ii).

(iv) We must show that any critical point \(\overline{x}\in \mathop {\mathrm{int}}F\) of f is a global minimum. In fact, by Proposition  2.40, for any \( x\in \mathop {\mathrm{int}}F\), one has
$$\begin{aligned} f\left( x\right) \ge f\left( \overline{x}\right) +\nabla f\left( \overline{x }\right) ^{T}\left( x-\overline{x}\right) =f\left( \overline{x}\right) . \end{aligned}$$
(4.2)
Given \(x\in F\), as \(F\subset \mathop {\mathrm{cl}}F=\mathop {\mathrm{cl}}\mathop {\mathrm{int}}F\) by Proposition  2.4, there exists a sequence \(\left\{ x_{r}\right\} \subset \mathop {\mathrm{int}}F\) such that \(x_{r}\rightarrow x\). By (4.2), \(f\left( \overline{x}\right) \le f\left( x_{r}\right) \) for all r and, taking limits, we get \(f\left( \overline{x}\right) \le f\left( x\right) \). Hence, \( \overline{x}\in F^{*}\).

(v) The inclusion \(\left\{ x\in F:\nabla f\left( x\right) =0_{n}\right\} \subset F^{*}\) is consequence of (iv) applied to all \(x\in F\), while the reverse inclusion follows from Fermat’s principle.       \(\square \)

Example 4.2

In Example  1.40, we considered the function
$$\begin{aligned} f\left( x\right) =\sqrt{x^{2}+a^{2}}+\sqrt{\left( x-1\right) ^{2}+b^{2}}, \end{aligned}$$
with \(a, b\in \mathbb {R}\), \(a<b\), in order to prove the reflection law. Since \(f^{\prime \prime }\) is positive on \(\mathbb {R}\), f is strictly convex on \(\mathbb {R}\). Thus, f attains its minimum at a unique point (the critical point \(\bar{x}=\frac{a}{a+b}\)).

The same happens with the function \(f\left( x\right) =\frac{\sqrt{x^{2}+a^{2} }}{v_{1}}+\frac{\sqrt{\left( x-1\right) ^{2}+b^{2}}}{v_{2}}\) used to prove the refraction law.

Example 4.3

Let \(f:F=\mathbb {R}\times ] 0,2 [ \rightarrow \mathbb {R}\) be given by \(f\left( x,\!\!y\right) =\ln y+\frac{1+x^{2}}{y}\). Since
$$\begin{aligned} \nabla ^{2}f\left( x,\!\!y\right) =\left[ \begin{array}{cc} \frac{2}{y} &{} -\frac{2x}{y^{2}} \\ -\frac{2x}{y^{2}} &{} \frac{2\left( x^{2}+1\right) }{y^{3}}-\frac{1}{y^{2}} \end{array} \right] \Rightarrow \det \nabla ^{2}f\left( x,\!\!y\right) =\frac{2}{y^{3}} \left( \frac{2}{y}-1\right) >0 \end{aligned}$$
for all \(y\in ] 0,2 [ \), \(\nabla ^{2}f\) is positive definite on F and f is strictly convex on \(F\). According to Proposition 4.1(ii), \(\left| F^{*}\right| \le 1\). Indeed, by (v), \(F^{*}=\left\{ {\left( 0,1\right) }^{T} \right\} \). Observe that the existence of optimal solution does not follow from (iii) as F is not closed and f is not strongly convex because
$$\begin{aligned} \lim _{r\rightarrow \infty }\nabla ^{2}f\left( 0,2-\frac{1}{r}\right) =\left[ \begin{array}{cc} 1 &{} 0 \\ 0 &{} 0 \end{array} \right] . \end{aligned}$$

It is easy to see that the inclusion in Proposition 4.1(iv) may be strict (consider \(f\left( x\right) =e^{x}\) with F being any closed interval in \(\mathbb {R}\)).

We now show an immediate application of this result to systems of linear inequalities, while the next two sections provide applications to statistical inference and operations research.

A problem arising in signal processing consists in the search of points in the intersection of finitely many closed half-spaces which, frequently, do not have common points. This problem can be formulated as the computation of an approximate solution (in the least squares sense) to a (possibly inconsistent) linear system \(\left\{ Ax\le b\right\} \), where A is an \( m\times n\) matrix, \(b\in \mathbb {R}^{m}\), and \(\le \) denotes the partial ordering in \(\mathbb {R}^{m}\) such that \(y\le z\) when \(y_{i}\le z_{i}\) for all \(i=1,\ldots , m\). In formal terms,
$$\begin{aligned} P_{1}:\,\ \text {Min}_{x\in \mathbb {R}^{n}}\text { }\,\left\| \left( Ax-b\right) _{+}\right\| , \end{aligned}$$
where
$$\begin{aligned} \left( Ax-b\right) _{+}:={\left( \max \left\{ a_{1}^{T}x-b_{1}, 0\right\} ,\ldots ,\max \left\{ a_{m}^{T}x-b_{m}, 0\right\} \right) }^{T} , \end{aligned}$$
with \(a_{i}^{T}\) denoting the ith row of A, \(i=1,\ldots , m\).
Let us illustrate the meaning of \((Ax-b)_+\) with a simple example, with \(n=2\), \(m=3\) and \(\Vert a_i\Vert =1\), \(i=1,2,3\). Let \(S_i:=\{x\in \mathbb {R}^2:a_i^Tx\le b_i\}\), \(i=1,2,3\), and assume that \(\cap _{i=1}^3 S_i=\emptyset \). Given \(x\in \mathbb {R}^2\), we have (see, e.g., (4.19) below) that
$$\max \left\{ a_i^Tx-b_i, 0\right\} =\left\{ \begin{array}{ll} a_i^Tx-b_i=d(x,\mathop {\mathrm{bd}} S_i),&{} \text {if }x\not \in S_i,\\ 0,&{} \text {otherwise}.\end{array}\right. $$
Hence, for x as in Fig. 4.1, we get \((Ax-b)_+={(d_1,0,d_3)}^{T}\).
Fig. 4.1

Illustration of \((Ax-b)_+\)

Corollary 4.4

(Least squares solution to an inconsistent linear inequality system) The optimal solutions to \(P_{1}\) are the solutions to the nonlinear system of equations
$$\begin{aligned} A^{T}\left( Ax-b\right) _{+}=0_{n}. \end{aligned}$$
(4.3)

Proof

We replace \(P_{1}\) by the equivalent unconstrained optimization problem
$$\begin{aligned} \begin{array}{lll} P_{2}:&\text {Min}_{x\in \mathbb {R}^{n}}&f\left( x\right) =\left\| \left( Ax-b\right) _{+}\right\| ^{2}. \end{array} \end{aligned}$$
Denoting by \(p_{+}:\mathbb {R\rightarrow }\) \(\mathbb {R}\) the positive part function, that is, \(p_{+}\left( y\right) :=\max \left\{ y, 0\right\} \) and by \(h_{i}\left( x\right) :\mathbb {R}^{n}\mathbb {\rightarrow }\) \(\mathbb {R}\) the affine function \(h_{i}\left( x\right) :=a_{i}^{T}x-b_{i}\), \(i=1,\ldots , m\), we can write \(f=\sum _{i=1}^{m}\) \(\left( p_{+}^{2}\circ h_{i}\right) \). Since \( p_{+}^{2}\) is convex and differentiable, with
$$\begin{aligned} \frac{dp_{+}^{2}\left( y\right) }{dy}=2p_{+}\left( y\right) ,\quad \forall y\in \mathbb {R}, \end{aligned}$$
while \(\nabla h_{i}\left( x\right) =a_{i}\) for all \(x\in \mathbb {R}^{n}, i=1,\ldots , m\), we deduce that f is convex and differentiable as it is the sum of m functions satisfying these properties, with
$$\begin{aligned} \nabla f\left( x\right) =2\mathop {\displaystyle \sum }\limits _{i=1}^{m}\left( a_{i}^{T}x-b_{i}\right) _{+}a_{i}=2A^{T}\left( Ax-b\right) _{+}. \end{aligned}$$
The conclusion follows from Proposition 4.1(v).       \(\square \)

Three elementary proofs of the existence of solution for the nonlinear system in (4.3) can be found in [25], but the analytical computation of such a solution is usually a hard task. The uniqueness is not guaranteed as the objective function of \(P_{2}\) is not strictly convex.

Example 4.5

The optimality condition (4.3) for the inconsistent system
$$\begin{aligned} \left\{ x_{1}\le -1,-x_{1}\le -1,-x_{2}\le -1,x_{2}\le -1\right\} \end{aligned}$$
reduces to
$$\begin{aligned} \left( \begin{array}{c} \left( x_{1}+1\right) _{+}-\left( 1-x_{1}\right) _{+} \\ -\left( 1-x_{2}\right) _{+}+\left( x_{2}+1\right) _{+} \end{array} \right) =\left( \begin{array}{c} 0 \\ 0 \end{array} \right) , \end{aligned}$$
i.e., \(\left( x_{1}+1\right) _{+}=\left( 1-x_{1}\right) _{+}\) and \(\left( x_{2}+1\right) _{+}=\left( 1-x_{2}\right) _{+}\), whose unique solution is \(\left( 0,0\right) \) (see Fig. 4.2).
Fig. 4.2

Solution of \((x+1)_+=(1-x)_{+}\)

Example 4.6

Consider now the inconsistent system
$$\begin{aligned} \left\{ x_{1}\le -1,-x_{1}\le -1,x_{2}\le 1\right\} . \end{aligned}$$
The optimality condition (4.3) is
$$\begin{aligned} \left( \begin{array}{c} \left( x_{1}+1\right) _{+}-\left( -x_{1}+1\right) _{+} \\ \left( x_{2}-1\right) _{+} \end{array} \right) =\left( \begin{array}{c} 0 \\ 0 \end{array} \right) , \end{aligned}$$
which holds if and only if \(x_{1}=0\) and \(x_{2}\le 1\). Then, the set of least squares solutions of the above system is \(\left\{ 0\right\} \times ] -\infty , 1]\).

4.2 Unconstrained Convex Optimization

This section presents two interesting applications of unconstrained convex optimization. The first one is a classical location problem posed by Fermat in the seventeenth century that gave rise to a vast amount of the literature illustrating the three paradigms of optimization: the geometrical, the analytical, and the numerical ones. The second application consists in a detailed resolution of an important optimization problem arising in statistical inference that it is commonly solved in a nonrigorous way.

4.2.1 The Fermat–Steiner Problem

The following problem was first posed by Fermat to Torricelli: Given three points in the plane \(P_{1}\), \(P_{2}\), and \(P_{3}\), compute a fourth one P that minimizes the sum of distances to the first three points, i.e.,
$$\begin{aligned} P_{FS}:\text { Min }\ f\left( P\right) :=d\left( P, P_{1}\right) +d\left( P, P_{2}\right) +d\left( P, P_{3}\right) , \end{aligned}$$
where \(d\left( P, P_{i}\right) \) denotes the distance from P to \(P_{i}, i=1,2,3\). This type of problem frequently arises in practical situations, where P may represent the point where a water well should be sunk, a fire station must be installed, a rural hospital should be constructed, etc., with the purpose of serving three close villages. We can assume that \(P_{1}, P_{2}\), and \(P_{3}\) are not aligned (otherwise, the optimal solution of \(P_{FS}\) is obviously the point, of the three given points, which belongs to the segment determined by the two other points).

The problem \(P_{FS}\) appears under different names in the mathematical literature (Fermat, Fermat–Steiner, Fermat–Torricelli, Fermat–Steiner–Weber, etc.), and its variants, generically called location (or p-center) problems, are still being studied by operational researchers and analysts. The most prominent of these variants are those that consider more than three given points, those which assign weights to the given points (e.g., number of residents at each village in the examples above), those which replace the Euclidean distance with non-Euclidean ones, those which replace the plane by spaces of greater (even infinite) dimension, those which replace the objective function by another one as \(d\left( P, P_{1}\right) ^{2}+d\left( P, P_{2}\right) ^{2}+d\left( P, P_{3}\right) ^{2}\) (in which case it is easy to prove that the optimal solution is the barycenter of the triangle of vertices \(P_{1}\), \(P_{2}\), and \(P_{3}\)) or \(\max \left\{ d\left( P, P_{1}\right) ,d\left( P, P_{2}\right) ,d\left( P, P_{3}\right) \right\} \) (frequently used when locating emergency services), etc. There exists an abundant literature on the characterization of the triangle centers as optimal solutions to suitable unconstrained optimization problems. One of the last centers to be characterized in this way has been the orthocenter (see [55]).

Solving an optimization problem by means of the geometric approach starts with a preliminary empirical phase, which is seldom mentioned in the literature. This provides a conjecture on the optimal solution (a point in the case of \(P_{FS}\)), which is followed by the conception of a rigorous compass and ruler constructive proof, in the style of Euclid’s elements. The limitation of this approach is the lack of a general method to get conjectures and proofs, which obliges the decision maker to conceive ad hoc experiments and proofs. We illustrate the geometric approach by solving \(P_{FS}\) for triples of points \(P_{1}\), \(P_{2}\), and \(P_{3}\) determining obtuse-angled triangles. We denote by \(\alpha _{i}\) the angle (expressed in degrees) corresponding to vertex \(P_{i}\), \(i=1,2,3\). Intuition suggests that when \(\max \left\{ \alpha _{1},\alpha _{2},\alpha _{3}\right\} \) is big enough, the optimal solution is the vertex corresponding to the obtuse angle. Indeed, we now show that this conjecture is true when \(\max \left\{ \alpha _{1},\alpha _{2},\alpha _{3}\right\} \ge 120\) degrees by means of a compass and ruler constructive proof.

Proposition 4.7

(Trivial Fermat–Steiner problems) If \(\alpha _{i}\ge 120\) degrees, then \(P_{i}\) is a global minimum of \(P_{FS} \).

Fig. 4.3

Sketch of the proof of Proposition 4.7

Proof

Assume that the angle at \(P_{2}\), say \(\alpha _{2}\), is not less than 120 degrees. By applying to \(P_{1}\) and to an arbitrary point \(P\ne P_{2}\) a counterclockwise rotation centered at \(P_{2}\), with angle \(\alpha :=180-\alpha _{2}\) degrees, one gets the image points \(P_{1}^{\prime }\) and \(P^{\prime }\), so that the points \(P_{1}^{\prime }, P_{2}\), and \(P_{3}\) are aligned (see Fig. 4.3).

We now recall two known facts: First, the rotations preserve the Euclidean distance between pairs of points, and second, in any isosceles triangle whose unequal angle measures \(0< \alpha \le 60\) degrees, the length of the opposite side is less or equal than the length of the equal sides. Hence, by the triangle inequality,
$$\begin{aligned} f(P_{2})&=d\left( P_{1}, P_{2}\right) +d\left( P_{2}, P_{3}\right) =d\left( P_{1}^{\prime }, P_{3}\right) \\&\le d\left( P_{1}^{\prime },P^{\prime }\right) +d\left( P^{\prime }, P_{3}\right) \\&\le d\left( P_{1},P\right) +d\left( P^{\prime },P\right) +d\left( P, P_{3}\right) \\&\le d\left( P_{1},P\right) +d\left( P, P_{2}\right) +d\left( P, P_{3}\right) =f\left( P\right) , \end{aligned}$$
so \(P_{2}\) is a global minimum of \(P_{FS}\).       \(\square \)

The converse statement claiming that the attainment of the minimum at a vertex entails that \(\max \left\{ \alpha _{1},\alpha _{2},\alpha _{3}\right\} \ge 120\) degrees can also be proved geometrically [68, pp. 280–282].

From now on, we assume that \(\max \left\{ \alpha _{1},\alpha _{2},\alpha _{3}\right\} <120\) degrees, i.e., that the minimum of f is not attained at one of the points \(P_{1}{} , P_{2}, P_{3}\). The first two solutions to \( P_{FS}\) under this assumption were obtained by Torricelli and his student Viviani, but were published in 1659, 12 years after the publication of Cavallieri’s solution. Another interesting solution was provided by Steiner in 1842, which was later rediscovered by Gallai and, much later, by Hoffmann in 1929 (many results and methods have been rediscovered again and again before the creation of the first mathematical abstracting and reviewing services, Zentralblatt MATH., in 1930, and Mathematical Reviews, in 1940). Steiner’s solution was based on successive 60 degree rotations in the plane, while Torricelli’s one consisted in maximizing the area of those equilateral triangles whose sides contain exactly one of the points \(P_{1}\), \(P_{2}\), and \(P_{3}\). A particular case of the latter problem was posed by Moss, in 1755, at the women’s magazine The Ladies Diary or the Woman’s Almanack, which suggests the existence of an important cohort of women with a good mathematical training during the Age of Enlightenment. The solution to the Torricelli–Moss problem, actually the dual problem to \(P_{FS}\), was rediscovered by Vecten, in 1810, and by Fasbender, in 1846, to whom it was mistakenly attributed until 1991.

The second approach used to solve \(P_{FS}\), called analogical, is inspired in the least action metaphysical principle asserting that nature always works in the most economic way, i.e., by minimizing certain magnitude: time in optics (as observed by Fermat), surface tension in soap bubbles, potential energy in gravitational fields, etc. The inconvenience of this analogical approach is that it is not easy to design an experiment involving some physical magnitude whose minimization is equivalent to the optimization problem to be solved. Moreover, nature does not compute global minima but local ones (in fact, critical points), as it happens with beads inserted in wires, whose equilibrium points are the local minima of the height (recall that the potential energy of a particle of given mass m and variable height y is mgy, where g denotes the gravitational constant, so the gravitational field provides local minima of y on the curve represented by the wire). Moreover, these local minima might be attained in a parsimonious way, at least in the case of wildlife (evolution by natural selection is a very slow process).

The analogical approach has inspired many exotic numerical optimization methods imitating nature or even social behavior, but not always in a rigorous way. Such methods, called metaheuristic, are designed to generate, or select, a heuristic (partial search algorithm) that may provide a sufficiently good solution to a given optimization problem. In fact, “in recent years, the field of combinatorial optimization [where some decision variables take integer values] has witnessed a true tsunami of ‘novel’ metaheuristic methods, most of them based on a metaphor of some natural or man-made process. The behavior of virtually any species of insects, the flow of water, musicians playing together—it seems that no idea is too far-fetched to serve as inspiration to launch yet another metaheuristic” [82].

We illustrate the analogical approach by solving \(P_{FS}\) through two experiments that are based on the minimization of the surface tension of a soap bubble and the minimization of the potential energy of a set of mass points, respectively.

4.2.1.1 Minimizing the Surface Tension

The physicist Plateau witnessed, in 1840, a domestic accident caused by a servant who wrongly poured oil on a receptacle containing a mixture of alcohol and water. Intrigued by the spheric shape of the oil bubbles, Plateau started to perform various experiments with soapy water solutions, which lead him to the conclusion that the surface tension of the soap film is proportional to its surface area, so that the bubbles are spheric when they are at equilibrium (i.e., when the pressure of the air contained in the bubble equals the atmospheric pressure). Due to the lack of analytic tools, he could not justify his conjecture, which was proved in the twentieth century by Radó, in 1930, and by Douglas, in 1931, independently.

To analogically solve \(P_{FS}\), it is sufficient to take two transparent plastic layers, a felt-tip pen, three sticks of equal length l, glue, and a soapy solution. Represent on both layers the three points \(P_{1}, P_{2}\), and \( P_{3};\) put both layers in parallel, at a distance l, so that each pair of points with the same label is vertically aligned, and link these pairs of points with the sticks and the glue. Submerging this frame in the soapy solution and getting it out, one obtains, at equilibrium, a minimal surface soap film linking the layers and the sticks. This surface is formed by rectangles of height l, so that the sum of areas is a local minimum for the area of perturbed soap films provided that the perturbations are sufficiently small (see Fig. 4.4). One of these equilibrium configurations corresponds to the triangular prism whose bases are the triangles of vertices \(P_{1}, P_{2}\), and \(P_{3}\) drawn on both layers, so in order to minimize the surface tension we may be obliged to repeat several times the experiment until we get the desired configuration, which provides the global minimum for \(P_{FS}\) on both layers. A similar construction allows to solve the Fermat–Steiner problem with more than three points.
Fig. 4.4

Minimizing the surface tension

4.2.1.2 Minimizing the Potential Energy

The famous Scottish Book was a notebook used in the 1930s and 1940s by mathematicians of the Lwów School of Mathematics for collecting interesting solved, unsolved, and even probably unsolvable problems (Lwów is the Polish name of the, at present, Ukrainian city of Lviv). The notebook was named after the Scottish Café where it was kept. Among the active participants at these meetings, there were the functional analysts and topologists Banach, Ulam, Mazur, and Steinhaus, who conceived an ingenious experiment to solve the weighted Fermat–Steiner problem. This problem seems to have been first considered in the calculus textbook published by Thomas Simpson in 1750. Tong and Chua (1995) have proposed a geometric solution inspired in Torricelli’s one for \(P_{FS}\).

Steinhaus’ experiment requires a table, whose height will be taken as unit length, a drill and three pieces of string of a unit of length tied at one of their extremes. We assume that the three sides of the triangle of vertices \(P_{1}, P_{2}\), and \(P_{3}\) have length less than 1 (a rescaling may be necessary to get this assumption). Draw the three points \(P_{1}, P_{2}\), and \(P_{3}\) on the table, drill a hole at each of these points, and introduce one of the strings through each of the holes. Tie a small solid of mass \( \lambda _{i}\) at the string hanging from \(P_{i}\), \(i=1,2,3\) (see Fig. 4.5).
Fig. 4.5

Minimizing the gravitational potential energy

Assume that the knot is at P, placed at a distance \(d_{i}=d\left( P, P_{i}\right) \) from \(P_{i}\), and denote by \(y_{i}\) the height to the floor of the ith solid. Then, since \(y_{i}+1-d_{i}=1\) (i.e., \(y_{i}=d_{i}\)), the potential energy of the three-mass system is the product of the gravitational constant g times
$$\begin{aligned} \mathop {\displaystyle \sum }\limits _{i=1}^{3}\lambda _{i}y_{i}=\mathop {\displaystyle \sum }\limits _{i=1}^{3}\lambda _{i}d\left( P, P_{i}\right) . \end{aligned}$$
Taking \(\lambda _{1}=\lambda _{2}=\lambda _{3}\), the potential energy is proportional to \(f\left( P\right) :=d\left( P, P_{1}\right) +d\left( P, P_{2}\right) +d\left( P, P_{3}\right) \), subject to some constraints that do not need to be explicitly stated while P lies in the interior of the triangle determined by \(P_{1}\), \(P_{2}\), and \(P_{3}\), as they are not active at the optimal solution. To prove the existence and uniqueness of optimal solution, one needs an analytical argument based on coercivity and convexity. Nothing changes when one considers more than three points, whether weighted or not.

4.2.1.3 The Analytical Solution

Consider the plane equipped with rectangular coordinate axis. We can identify the four points \(P_{1}\), \(P_{2}\), \(P_{3}\), and P with \( x_{1}, x_{2}, x_{3}, x\in \mathbb {R}^{2}\), and \(P_{FS}\) is then formulated as:
$$\begin{aligned} P_{FS}:\ \text {Min}_{x\in \mathbb {R}^{2}}\text { }f\left( x\right) =\mathop {\displaystyle \sum }\limits _{i=1}^{3}\left\| x-x_{i}\right\| . \end{aligned}$$
We assume that \(x_{1}\), \(x_{2}\), \(x_{3}\) are not aligned and that the size of the angles of the triangle with vertices \(x_{1}\), \(x_{2}\), \(x_{3}\) is all less than 120 degrees.

We start by exploring the properties of the objective function \( f=\sum _{i=1}^{3}f_{i}\), where \(f_{i}\left( x\right) =\left\| x-x_{i}\right\| \). Obviously, \(f_{i}\) is convex as it is the result of composing an affine function, \(x\mapsto x-x_{i}\), and a convex one (the norm), \(i=1,2,3\), so f is convex too. Thanks to the convexity of \(P_{FS}\), the local minima obtained via the analogical approach are actually global minima. We now prove that f is strictly convex and coercive.

We first show the strict convexity of f in an intuitive way. Consider the norm function, \(h\left( x\right) :=\left\| x\right\| \), whose graph, \( \mathop {\mathrm{gph}}h\), is the ice-cream cone of Fig. 4.6. Given \( x, d\in \mathbb {R}^{2}\), with \(d\ne 0_{2}\), the graph of the one variable real-valued function \(h_{x, d}\) given by \(h_{x, d}\left( t\right) =\left\| x+td\right\| \) (the section of h determined by x and d) is a branch of hyperbola if \(x\notin \mathop {\mathrm{span}}\left\{ d\right\} \), and it is the union of two half-lines if \(x\in \mathop {\mathrm{span}}\left\{ d\right\} \). So, \(h_{x, d}\) is strictly convex if and only if \(x\notin \mathop {\mathrm{span}} \left\{ d\right\} \). Consequently, the section of \(f_{i}\) determined by the line containing x in the direction of d is strictly convex if and only if \(x-x_{i}\notin \mathop {\mathrm{span}}\left\{ d\right\} \). Since the sum of two convex functions is strictly convex when one of them is strictly convex, all sections of \( f_{1}+f_{2}\) are strictly convex except in the case that the line containing x in the direction of d contains \(x_{1}\) and \(x_{2}\). Finally, \( f=\sum _{i=1}^{3}f_{i}\) is strictly convex due to the assumptions that \(x_{1}, \) \(x_{2}\), and \(x_{3}\) are not aligned, so at least one of the three (convex) sections is strictly convex.
Fig. 4.6

Intersection of \(\mathop {\mathrm{gph}}\Vert \cdot \Vert \) with vertical planes

We now prove the strict convexity of f analytically, by contradiction, exploiting the fact that, given two vectors \(u, v\in \mathbb {R}^{2}\setminus \left\{ 0_{2}\right\} \), if \(\left\| u+v\right\| =\left\| u\right\| +\left\| v\right\| \), then \(u^{T}v=\left\| u\right\| \left\| v\right\| \), and so, there exists \(\mu >0\) such that \(u=\mu v\). Suppose that f is not strictly convex. Let \(y, z\in \mathbb { R}^{2}\), \(y\ne z\) and \(\lambda \in ] 0,1 [ \), be such that
$$\begin{aligned} f\left( \left( 1-\lambda \right) y+\lambda z\right) =\left( 1-\lambda \right) f\left( y\right) +\lambda f(z), \end{aligned}$$
i.e.,
$$\begin{aligned} \sum _{i=1}^{3}f_{i}\left( \left( 1-\lambda \right) y+\lambda z\right) =\left( 1-\lambda \right) \sum _{i=1}^{3}f_{i}\left( y\right) +\lambda \sum _{i=1}^{3}f_{i}(z) . \end{aligned}$$
Then, due to the convexity of \(f_{i}\), \(i=1,2,3\), one necessarily has
$$\begin{aligned} f_{i}\left( \left( 1-\lambda \right) y+\lambda z\right) =\left( 1-\lambda \right) f_{i}\left( y\right) +\lambda f_{i}(z) ,\; i=1,2,3, \end{aligned}$$
or, equivalently,
$$\begin{aligned} \left\| \left( 1-\lambda \right) \left( y-x_{i}\right) +\lambda \left( z-x_{i}\right) \right\| =\left\| \left( 1-\lambda \right) \left( y-x_{i}\right) \right\| +\left\| \lambda \left( z-x_{i}\right) \right\| ,\; i=1,2,3. \end{aligned}$$
(4.4)
Let \(i\in \left\{ 1,2,3\right\} \). If \(y\ne x_{i}\ne z\), one has \(\left( 1-\lambda \right) \left( y-x_{i}\right) ,\lambda \left( z-x_{i}\right) \in \mathbb {R} ^{2}\setminus \left\{ 0_{2}\right\} \) and (4.4) implies the existence of \(\mu _{i}>0\) such that \(\left( 1-\lambda \right) \left( y-x_{i}\right) =\mu _{i}\lambda \left( z-x_{i}\right) \). Defining \(\gamma _{i}:=\frac{\mu _{i}\lambda }{1-\lambda }>0\), we have \(y-x_{i}=\gamma _{i}\left( z-x_{i}\right) \), with \(\gamma _{i}\ne 1\) as \(y\ne z\). Hence,
$$\begin{aligned} x_{i}=\left( \frac{1}{1-\gamma _{i}}\right) y-\left( \frac{\gamma _{i}}{ 1-\gamma _{i}}\right) z\in L, \end{aligned}$$
where L is the line containing y and z. Alternatively, if \(y\ne x_{i}\ne z\) does not hold, we have \(x_{i}\in \left\{ y, z\right\} \subset L\). We conclude that \(\left\{ x_{1}, x_{2}, x_{3}\right\} \subset L \); i.e., the points \( x_{1}\), \(x_{2}\), and \(x_{3}\) are aligned (contradiction).
The functions \(f_{i}\), \(i=1,2,3\), are coercive as \(f_{i}\left( x\right) =\left\| x-x_{i}\right\| \ge \left\| x\right\| -\left\| x_{i}\right\| \) implies that
$$\begin{aligned} \lim _{\left\| x\right\| \rightarrow +\infty }f_{i}\left( x\right) =+\infty . \end{aligned}$$
Thus, their sum f is also coercive. In summary, \(P_{FS}\) has a unique optimal solution. From this fact, and the observation that \(\nabla f_{i}\left( x\right) =\frac{x-x_{i}}{\left\| x-x_{i}\right\| }\) for all \(x\ne x_{i}\), \(i=1,2,3\), we deduce the following result.

Proposition 4.8

(Analytic solution to the Fermat–Steiner problem) There exists a unique global minimizer of \(f\). Moreover, such a global minimizer is the element \(x\in \mathbb {R}^{2}\setminus \left\{ 0_{2}\right\} \) such that
$$\begin{aligned} \nabla f\left( x\right) =\mathop {\displaystyle \sum }\limits _{i=1}^{3}\frac{x-x_{i}}{\left\| x-x_{i}\right\| }=0_{2}. \end{aligned}$$
(4.5)

Proposition 4.9

(Geometric solution to the Fermat–Steiner problem) The optimal solution of \(P_{FS}\) is the isogonic center of the triangle with vertices \(x_{1}\), \(x_{2}\), and \(x_{3}\), that is, the point \( \overline{x}\) from which the three sides are seen under the same angle of 120 degrees.

Proof

Let \(\alpha _{12}\), \(\alpha _{13}\), and \(\alpha _{23}\) be the angles under which the sides \(\left[ x_{1}, x_{2}\right] \), \(\left[ x_{1}, x_{3}\right] \), and \(\left[ x_{2}, x_{3}\right] \) are seen by an observer placed at x satisfying (4.5). Since f is differentiable at x, the Fermat necessary optimality condition yields \(\nabla f\left( x\right) =0_{2}\), i.e.,
$$\begin{aligned} u+v+w=0_{2}, \end{aligned}$$
(4.6)
where \(u:=\frac{x-x_{1}}{\left\| x-x_{1}\right\| }\), \(v:=\frac{x-x_{2}}{ \left\| x-x_{2}\right\| }\) and \(w:=\frac{x-x_{3}}{\left\| x-x_{3}\right\| }\). Since \(\left\| u\right\| = \left\| v\right\| = \left\| w\right\| = 1\), \(u^{T}v=\cos \alpha _{12}\), \(u^{T}w=\) \( \cos \alpha _{13}\), and \(v^{T}w=\cos \alpha _{23}. \)
By multiplying both sides of (4.6) by u, v, and w, one gets a linear system of equations whose unknowns are the cosines of the angles \(\alpha _{12}\), \(\alpha _{13}\), and \(\alpha _{23}\),
$$\begin{aligned} \left\{ \begin{array}{ccc} 1+\cos \alpha _{12}+\cos \alpha _{13} &{} = &{} 0 \\ \cos \alpha _{12}+1+\cos \alpha _{23} &{} = &{} 0 \\ \cos \alpha _{13}+\cos \alpha _{23}+1 &{} = &{} 0 \end{array} \right\} , \end{aligned}$$
whose unique solution is
$$\begin{aligned} \cos \alpha _{12}=\cos \alpha _{13}=\cos \alpha _{23}=-\frac{1}{2}, \end{aligned}$$
that is,
$$\begin{aligned} \alpha _{12}=\alpha _{13}=\alpha _{23}=120\text { degrees}. \end{aligned}$$
Hence, the unique solution to \(\nabla f\left( x\right) =0_{2}\) is the isogonic center \(\overline{x}\).       \(\square \)

Example 4.10

([68, pp. 284–285]). In the 1960s, Bell Telephone installed telephone networks connecting the different headquarters of the customer companies with a cost, regulated by law, which was proportional to the total length of the installed network. One of these customers, Delta Airlines, wanted to connect its hubs placed at the airports of Atlanta, Chicago, and New York, which approximately formed an equilateral triangle. Bell Telephone proposed to install cables connecting one of the bases with the other two bases, but Delta Airlines proposed instead to create a virtual base at the isogonic center of the triangle formed by the three hubs, linking this point with the three bases by cables. The Federal Authority agreed with this solution, and Delta Airlines earned a significant amount of money that we can estimate. Take the average distance between hubs as length unit. Since the height of the equilateral triangle is \(\frac{\sqrt{3}}{2}\), the distance from the isogonic center (here coinciding with the orthocenter) to any vertex is \(\frac{2}{3}\times \frac{\sqrt{3}}{2}=\frac{\sqrt{3}}{3}\). Therefore, the length of the networks proposed by Bell Telephone and by Delta Airlines was of 2 and \(\sqrt{3}\) units, with an estimated earning of \(\left( \frac{2-\sqrt{3}}{2}\right) \times 100= 13.4\%\).

4.2.2 The Fixed-Point Method of Weiszfeld

A specific numerical method for \(P_{FS}\) was proposed by Weiszfeld in 1937, when he was only 16. Indeed, since
$$\begin{aligned} \mathop {\displaystyle \sum }\limits _{i=1}^{3}\frac{\overline{x}-x_{i}}{\left\| \overline{x}-x_{i}\right\| }=0_{2} \end{aligned}$$
is equivalent to
$$\begin{aligned} \left( \mathop {\displaystyle \sum }\limits _{i=1}^{3}\frac{1}{\left\| \overline{x}-x_{i}\right\| }\right) \overline{x}=\left( \mathop {\displaystyle \sum }\limits _{i=1}^{3}\frac{x_{i}}{\left\| \overline{x}-x_{i}\right\| }\right) , \end{aligned}$$
that is,
$$\begin{aligned} \overline{x}=\frac{\left( \mathop {\displaystyle \sum }\limits _{i=1}^{3}\frac{x_{i}}{\left\| \overline{x}-x_{i}\right\| }\right) }{\left( \mathop {\displaystyle \sum }\limits _{i=1}^{3}\frac{1}{\left\| \overline{x}-x_{i}\right\| }\right) }, \end{aligned}$$
the optimality condition (4.5) is equivalent to assert that \(\overline{x}\) is a fixed point for the function
$$\begin{aligned} h\left( x\right) :=\frac{\left( \mathop {\displaystyle \sum }\limits _{i=1}^{3}\frac{x_{i}}{ \left\| x-x_{i}\right\| }\right) }{\left( \mathop {\displaystyle \sum }\limits _{i=1}^{3}\frac{1 }{\left\| x-x_{i}\right\| }\right) }. \end{aligned}$$
Inspired in the argument of the Banach fixed-point theorem, proved in 1922, Weiszfeld considered sequences \(\left\{ x^{k}\right\} \subset \mathbb {R}^{3}\) whose initial element (seed) \(x^{0}\in \mathbb {R}^{2}\diagdown \left\{ x_{1}, x_{2}, x_{3}\right\} \) is arbitrary, and \(x^{k+1}=h\left( x^{k}\right) , \) \(k=1,2,...\) Of, course, \(\left\{ x^{k}\right\} \) could be finite as h fails to be differentiable at three points and, in the best case that it is infinite, it does not necessarily converges as h is not contractive. Harold W. Kuhn rediscovered Weiszfeld’s method in 1973 and showed that \( \left\{ x^{k}\right\} \) always converges whenever it is infinite. Recent research has shown that the set of bad seeds, formed by the initial elements for which \(\left\{ x^{k}\right\} \) attains an element of \(\left\{ x_{1}, x_{2}, x_{3}\right\} \) after a finite number of steps, is countable [9, 48]. Since the set of bad seeds is very small (more precisely, it is a Lebesgue zero-measure set), one concludes that the sequences generated by Weiszfeld’s method converge to the isogonic point with probability one when the seeds are selected at random in the triangle. A similar behavior has been observed, in practice, for all the general optimization methods described in Chap.  5, including those which assume the existence of gradients or Hessian matrices of \(f\).

4.2.3 An Application to Statistical Inference

One of the main aims of statistical inference consists in estimating the parameters of the density function of a scalar random variable x, which is assumed to belong to a given family, from a sample \(\left( x_{1},\ldots , x_{n}\right) \in \mathbb {R}^{n}\) (the result of \(n\ge 2\) realizations of x). The most common method to solve this statistical inference problem is the maximum likelihood one, which consists in computing the unique global maximum \(\overline{\theta }\) of the maximum likelihood function of the sample, that is,
$$\begin{aligned} f\left( x_{1},\ldots , x_{n}\mid \theta \right) :=\prod _{i=1}^{n}f\left( x_{i}\mid \theta \right) , \end{aligned}$$
where \(f\left( x \mid \theta \right) \) represents the density function of x and \(\theta \) is the parameter to be estimated.
Assume, for instance, that x has a normal distribution of parameter \(\theta =\left( \mu ,\sigma ^{2}\right) \in \mathbb {R}\times \mathbb {R}_{++}\), i.e., that the density function of x is
$$\begin{aligned} f\left( x\mid \mu ,\sigma ^{2}\right) =\frac{1}{\sigma \sqrt{2\pi }}\exp \left( -\frac{\left( x-\mu \right) ^{2}}{2\sigma ^{2}}\right) , \end{aligned}$$
so we have that the maximum likelihood function is, in this case,
$$\begin{aligned} f\left( x_{1},\ldots , x_{n}\mid \mu ,\sigma ^{2}\right) :=\prod _{i=1}^{n}f\left( x_{i}\mid \mu ,\sigma ^{2}\right) ={\left( \frac{1}{2\pi \sigma ^{2}}\right) }^{\frac{n}{2}}\exp \left( -\frac{ \sum _{i=1}^{n}{\left( x_{i}-\mu \right) }^{2}}{2\sigma ^{2}}\right) . \end{aligned}$$
The optimization problem to be solved is then
$$\begin{aligned} \begin{array}{lll} P_{1}: &{} \text {Max} &{} f\left( x_{1},\ldots , x_{n}\mid \mu ,\sigma ^{2}\right) \\ &{} \text {s.t.} &{} \left( \mu ,\sigma ^{2}\right) \in \mathbb {R\times R}_{++}, \end{array} \end{aligned}$$
which can be reformulated, recalling that \(\overline{x}:=\frac{1}{n} \sum _{i=1}^{n}x_{i}\) is the sample mean while \(s^{2}:=\frac{1}{n} \sum _{i=1}^{n}\left( x_{i}-\overline{x}\right) ^{2}\) is the sample variance. We first observe that \({\left( \frac{1}{2\pi }\right) }^{\frac{n}{2} }\) is constant, while
$$\begin{aligned} \sum _{i=1}^{n}{\left( x_{i}-\mu \right) }^{2}&=\sum _{i=1}^{n}{\left( x_{i}-\overline{x}+\overline{x}-\mu \right) }^{2} \\&=\sum _{i=1}^{n}{\left( x_{i}-\overline{x}\right) }^{2}+2\left( \overline{x} -\mu \right) \sum _{i=1}^{n}\left( x_{i}-\overline{x}\right) +n\left( \overline{x}-\mu \right) ^{2} \\&=n\left[ s^{2}+\left( \overline{x}-\mu \right) ^{2}\right] . \end{aligned}$$
Replacing in \(P_{1}\) the (positive) objective function by its natural logarithm, changing the task “Max” by the task “Min” (i.e., changing the sign of the objective function), and making the change of variable \(\delta :=\sigma ^{2}>0\), one gets the equivalent problem:
$$\begin{aligned} \begin{array}{lll} P_{2}: &{} \text {Min} &{} g\left( \mu ,\delta \right) :=\frac{1}{2}\ln \delta + \frac{1}{2\delta }\left[ s^{2}+\left( \overline{x}-\mu \right) ^{2}\right] \\ &{} \text {s.t.} &{} \left( \mu ,\delta \right) \in \mathbb {R}\times \mathbb {R} _{++}, \end{array} \end{aligned}$$
whose constraint set \(\mathbb {R}\times \mathbb {R}_{++}\) is open and convex, and \(g\in \mathcal {C}^{2}\left( \mathbb {R}\times \mathbb {R}_{++}\right) \); see Fig. 4.7.
Fig. 4.7

\(\mathop {\mathrm{gph}} g\) for \(\overline{x}=0\) and \(s=\frac{1}{4}\)

Since \(\mathop {\mathrm{dom}}g=\mathbb {R}\times \mathbb {R}_{++}\) is not closed, we cannot derive the existence of a global minimum via coercivity. Fortunately, we can prove this existence by a suitable convexity argument. We have
$$\begin{aligned} \frac{\partial g}{\partial \mu }=-\frac{\overline{x}-\mu }{\delta }, \end{aligned}$$
$$\begin{aligned} \frac{\partial g}{\partial \delta }=\frac{1}{2\delta }-\frac{1}{2\delta ^{2}} \left( s^{2}+\left( \overline{x}-\mu \right) ^{2}\right) , \end{aligned}$$
and
$$\begin{aligned} \nabla ^{2}g\left( \mu ,\delta \right) =\left[ \begin{array}{cc} \frac{1}{\delta } &{} \frac{\overline{x}-\mu }{\delta ^{2}} \\ \frac{\overline{x}-\mu }{\delta ^{2}} &{} \frac{1}{\delta ^{3}}\left( s^{2}+\left( \overline{x}-\mu \right) ^{2}\right) -\frac{1}{2\delta ^{2}} \end{array} \right] , \end{aligned}$$
with
$$\begin{aligned} \det \nabla ^{2}g\left( \mu ,\delta \right) =\frac{1}{\delta ^{4}}\left( s^{2}+\left( \overline{x}-\mu \right) ^{2}\right) -\frac{1}{2\delta ^{3}}- \frac{\left( \overline{x}-\mu \right) ^{2}}{\delta ^{4}} =\frac{s^{2}}{\delta ^{4}}-\frac{1}{2\delta ^{3}}. \end{aligned}$$
Thus, \(\nabla ^{2}g\) is positive definite on the open convex set \(C:=\mathbb { R}\times ] 0,2s^{2} [\). Since g is convex and differentiable on C, the minima of g on C are its critical points. Obviously, \(\frac{ \partial g}{\partial \mu }=0\) if and only if \(\mu =\overline{x}\). Then,
$$\begin{aligned} \frac{\partial g}{\partial \delta }=\frac{1}{2\delta }-\frac{1}{2\delta ^{2}} \left( s^{2}+\left( \overline{x}-\mu \right) ^{2}\right) =\frac{1}{2\delta }- \frac{s^{2}}{2\delta ^{2}}=0 \end{aligned}$$
if and only if \(\delta =s^{2}\), so we have that
$$\begin{aligned} \left( \widehat{\mu },\widehat{\delta }\right) :=\left( \overline{x} , s^{2}\right) \end{aligned}$$
is the unique global minimum of g on C, by Proposition 4.1(v). In particular,
$$\begin{aligned} g\left( \widehat{\mu },\widehat{\delta }\right) \le g\left( \mu ,\delta \right) ,\quad \forall \mu \in \mathbb {R},\forall \delta \le \widehat{ \delta }. \end{aligned}$$
It remains to prove that
$$\begin{aligned} g\left( \widehat{\mu },\widehat{\delta }\right) \le g\left( \mu ,\delta \right) ,\quad \forall \mu \in \mathbb {R},\forall \delta >\widehat{\delta }, \end{aligned}$$
or, equivalently, since \(\inf \left\{ g(\mu ,\delta ):\mu \in \mathbb {R},\delta >\widehat{\delta }\right\} =g(\overline{x},\delta )\),
$$\begin{aligned} g\left( \overline{x},\widehat{\delta }\right) \le g\left( \overline{x} ,\delta \right) ,\quad \forall \delta >\widehat{\delta }, \end{aligned}$$
as
$$\begin{aligned} g\left( \mu ,\delta \right) =\frac{1}{2}\ln \delta +\frac{1}{2\delta }\left( s^{2}+\left( \overline{x}-\mu \right) ^{2}\right) \ge g\left( \overline{x} ,\delta \right) ,\quad \forall \delta >0. \end{aligned}$$
Consider the function \(h\left( \delta \right) :=g\left( \overline{x},\delta \right) =\frac{1}{2}\ln \delta +\frac{s^{2}}{2\delta }\). Since
$$\begin{aligned} h^{\prime }\left( \delta \right) =\frac{1}{2\delta }-\frac{s^{2}}{2\delta ^{2}}\ge 0,\quad \forall \delta \in \left[ \right. s^{2},+\infty [ =\left[ \right. \widehat{\delta },+\infty [ , \end{aligned}$$
one has
$$\begin{aligned} h\left( \widehat{\delta }\right) \le h\left( \delta \right) ,\quad \forall \delta \ge \widehat{\delta }, \end{aligned}$$
so \(\left( \widehat{\mu },\widehat{\delta }\right) \) is a global minimum of g on \(\mathbb {R}\times \mathbb {R}_{++}\). The uniqueness follows from the fact that \(\left( \widehat{\mu },\widehat{\delta }\right) \) is the unique critical point of g on \(\mathbb {R}\times \mathbb {R}_{++}\). We thus get the aimed conclusion:

Proposition 4.11

(Maximum likelihood estimators) The maximum likelihood estimators of the parameters \(\mu \) and \( \sigma ^{2}\) of a normally distributed random variable are the sample mean and the variance, respectively.

4.3 Linearly Constrained Convex Optimization

We consider in this section convex optimization problems of the form
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) \\ &{} \text {s.t.} &{} g_{i}\left( x\right) \le 0,\text { }i\in I:=\left\{ 1,\ldots , m\right\} , \end{array} \end{aligned}$$
where \(g_{i}\left( x\right) =a_{i}^{T}x-b_{i}\), \(i\in I\), and f is convex and differentiable on an open convex set C that contains the feasible set, denoted by \(F\).

4.3.1 Optimality Conditions

Definition 4.12

We say that \(d\in \mathbb {R}^{n}\) is a feasible direction at \( \overline{x}\in F\) if there exists \(\varepsilon >0\) such that \(\overline{x} +td\in F\) for all \(t\in [0,\varepsilon ]\) (i.e., if we can move from \( \overline{x}\) in the direction d without leaving F). The set of feasible directions at \(\overline{x}\) form a convex cone called feasible direction cone,  which we denote by \(D\left( \overline{x}\right) \), i.e.,
$$\begin{aligned} D\left( \overline{x}\right) :=\left\{ d\in \mathbb {R}^{n}:\exists \varepsilon >0\text { such that }\overline{x}+td\in F,\text { }\forall t\in \left[ 0,\varepsilon \right] \right\} , \end{aligned}$$
We denote by \(I(\overline{x})\) the set of active indices at \( \overline{x}\in F\), that is,
$$\begin{aligned} I(\overline{x}):=\left\{ i\in I:g_{i}(\overline{x})=0\right\} . \end{aligned}$$
We define the active cone at \( \overline{x}\in F\) as the convex cone generated by the gradients of the active constraints at that point, i.e.,
$$\begin{aligned} A\left( \overline{x}\right) :=\mathop {\mathrm{cone}}\left\{ \nabla g_{i}(\overline{x} ):\, i\in I(\overline{x})\right\} . \end{aligned}$$
(4.7)
In other words,
$$\begin{aligned} I(\overline{x})=\left\{ i\in I:a_{i}^{T}\overline{x}=b_{i}\right\} \quad \text {and}\quad A\left( \overline{x}\right) =\mathop {\mathrm{cone}}\left\{ a_{i}:\, i\in I( \overline{x})\right\} . \end{aligned}$$

Since \(I(\overline{x})\) is finite, \(A\left( \overline{x}\right) \) is a finitely generated convex cone, and so, it is closed (see Proposition  2.13).

Definition 4.13

The (negative) polar cone of a nonempty set \(Y\subset \mathbb {R}^{n}\) is defined as
$$\begin{aligned} Y^{\circ }:=\left\{ z\in \mathbb {R}^{n}:y^{T}z\le 0, \text { }\forall y\in Y\right\} . \end{aligned}$$
By convention, the polar cone of the empty set \(\emptyset \) is the whole space \(\mathbb {R}^n\).
Obviously, the polar cone is a closed convex cone (even if Y is not convex). It is easy to see that
$$\begin{aligned} A\left( \overline{x}\right) ^{\circ } =\left\{ y\in \mathbb {R} ^{n}:a^{T}y\le 0,\text { }\forall a\in A\left( \overline{x}\right) \right\} =\left\{ y\in \mathbb {R}^{n}:a_{i}^{T}y\le 0,\text { }\forall i\in I( \overline{x})\right\} . \end{aligned}$$
Since \(I(\overline{x})\) is finite, \(A\left( \overline{x}\right) ^{\circ }\) is a polyhedral cone.

Proposition 4.14

(Computing the feasible direction cone) The feasible direction cone at \(\overline{x}\in F\) satisfies the equation
$$\begin{aligned} D\left( \overline{x}\right) =A\left( \overline{x}\right) ^{\circ }. \end{aligned}$$
(4.8)

Proof

We shall prove both inclusions.

Let \(d\in D(\bar{x})\) and \(\varepsilon >0\) be such that \(\overline{x}+t d\in F\) for all \(t\in [0,\varepsilon ]\). For any \(i\in I(\overline{x})\), one has \(a_{i}^{T}\overline{x}=b_{i}\), so we have that
$$\begin{aligned} b_{i}\ge a_{i}^{T}(\overline{x}+\varepsilon d)=a_{i}^{T}\overline{x}+\varepsilon a_{i}^{T}d=b_{i}+\varepsilon a_{i}^{T}d. \end{aligned}$$
Hence, \(a_{i}^{T}d\le 0\) for all \(i\in I(\overline{x})\). Thus, \(d\in A(\overline{x})^{\circ }\).
To prove the reverse inclusion, take an arbitrary \(d\in A(\overline{x})^{\circ }\). If \( i\not \in I(\overline{x})\), then \(a_{i}^{T}\overline{x}<b_{i}\), so there exists a constant \(\eta >0\) such that
$$\begin{aligned} a_{i}^{T}\overline{x}+\eta <b_{i},\quad \forall i\not \in I(\overline{x}). \end{aligned}$$
Moreover, there exists a constant \(\varepsilon >0\) such that \(\varepsilon |a_{i}^{T}d|<\eta \) for all \(i\in I\). Take now any \(t\in [0,\varepsilon ]\) and any \(i\in I\).
If \( i\in I(\bar{x})\), as \(d\in A(\overline{x})^{\circ }\), then
$$\begin{aligned} a_{i}^{T}(\overline{x}+td)=b_{i}+ta_{i}^{T}d\le b_{i}. \end{aligned}$$
If \(i\not \in I(\overline{x})\), then
$$\begin{aligned} a_{i}^{T}(\overline{x}+td) =a_{i}^{T}\overline{x}+ta_{i}^{T}d\le a_{i}^{T}\overline{x}+t|a_{i}^{T}d| \le a_{i}^{T}\overline{x}+\varepsilon |a_{i}^{T}d|\le a_{i}^{T}\overline{x}+\eta <b_{i}, \end{aligned}$$
so we have that \(a_{i}^{T}(\overline{x}+td)\le b_{i}\), for all \(i\in I\). Hence, \(\overline{x}+td\in F\) for all \(t\in [0,\varepsilon ]\), that is, \(d\in D(\overline{x})\).       \(\square \)

Lemma 4.15

(Generalized Farkas lemma) For any \(Y\subset \mathbb {R}^{n}\), it holds
$$ Y^{\circ \circ }=\mathop {\mathrm{cl}}\mathop {\mathrm{cone}} Y.$$

Proof

Let \(y\in Y^{\circ \circ }\), and let us assume, by contradiction, that \(y\not \in \mathop {\mathrm{cl}}\mathop {\mathrm{cone}}Y\). Since \(\mathop {\mathrm{cl}}\mathop {\mathrm{cone}}Y\) is a closed convex cone, by the separation theorem for convex cones (Corollary  2.15), there exists \( a\in \mathbb {R}^{n}\) such that \(a^{T}y>0\) and
$$\begin{aligned} a^{T}x\le 0,\quad \forall x\in \mathop {\mathrm{cl}}\mathop {\mathrm{cone}}Y. \end{aligned}$$
(4.9)
From (4.9), one gets that \(a\in Y^{\circ }\). Since \(y\in Y^{\circ \circ }\), we should have \(a^{T}y\le 0\), getting in this way the aimed contradiction.

To prove the reverse inclusion, we now show that \(Y\subset Y^{\circ \circ }\) (from this inclusion, one concludes that \(\mathop {\mathrm{cl}}\mathop {\mathrm{cone}}Y\subset \mathop {\mathrm{cl}}\mathop {\mathrm{cone}}Y^{\circ \circ }=Y^{\circ \circ }\), as \(Y^{\circ \circ }\) is a closed convex cone). Take any \(y\in Y\). By the definition of \(Y^{\circ }\), one has \(x^{T}y\le 0\) for all \(x\in Y^{\circ }\), and so, \(y\in Y^{\circ \circ }\).       \(\square \)

We next characterize those homogeneous linear inequalities which are satisfied by all the solutions to a given homogeneous linear inequality system. In two dimensions, this result characterizes those half-planes which contain a given angle with apex at the origin and whose boundaries go through the origin.

Corollary 4.16

(Generalized Farkas lemma for linear inequalities) Consider the system \(\left\{ a_{i}^{T}x\le 0,\text { }i\in I\right\} \) of linear inequalities in \( \mathbb {R}^{n}\), with I being an arbitrary (possibly infinite) index set. Then,
$$\forall x\in \mathbb {R}^n: \left[ a_i^Tx\le 0,\;\forall i\in I\right] \Rightarrow [a^Tx\le 0]$$
if and only if
$$\begin{aligned} a\in \mathop {\mathrm{cl}} \mathop {\mathrm{cone}}\left\{ a_{i},\text { }i\in I\right\} . \end{aligned}$$

Proof

This is a direct application of Lemma 4.15 to the set \(Y:=\left\{ a_i, i\in I\right\} \). Indeed, \(a\in Y^{\circ \circ }\), by definition, if \(a^Tx\le 0\) for all \(x\in \mathbb {R}^n\) such that \(a_i^Tx\le 0\) for all \(i\in I\). By Lemma 4.15, we have that \(Y^{\circ \circ }=\mathop {\mathrm{cl}}\mathop {\mathrm{cone}}\left\{ a_i, i\in I\right\} \), which proves the claim.      \(\square \)

Trying to characterize in a rigorous way the equilibrium points of dynamic systems, the Hungarian physicist G. Farkas proved in 1901, after several failed attempts, the particular case of the above result where the index set I is finite and the closure operator can be removed (recall that any finitely generated convex cone is closed). This classical Farkas lemma will be used in this chapter to characterize optimal solutions to convex problems and to linearize conic systems. Among its many applications, let us mention machine learning [63], probability, economics, and finance [32, 37]. The generalized Farkas lemma (whose infinite dimensional version was proved by Chu [22]) is just one of the many available extensions of the classical Farkas lemma [31]. We use this version in Chapter  6 to obtain necessary optimality conditions in nonconvex optimization.

The following proposition lists some basic properties of polar cones to be used hereinafter.

Proposition 4.17

(Handling polar sets)  Given \(Y, Z\subset \mathbb {R}^{n}\), the following properties hold:

(i) If \(Y\subset Z\), then \(Z^{\circ }\subset Y^{\circ }.\)

(ii) \(Y^{\circ }={\left( \mathop {\mathrm{cone}}Y\right) }^{\circ }={\left( \mathop {\mathrm{cl }}\mathop {\mathrm{cone}}Y \right) }^{\circ }.\)

(iii) \(Y^{\circ \circ }=Y\) if and only if Y is a closed convex cone.

Proof

Statements (i) and (ii) come straightforwardly from the definition of polar cone, while (iii) is immediate from the Farkas Lemma 4.15.       \(\square \)

Corollary 4.18

The feasible direction cone at \(\overline{x}\in F\) satisfies the equation
$$\begin{aligned} D(\overline{x})^{\circ }=A(\overline{x}). \end{aligned}$$
(4.10)

Proof

Taking the polar at both members of (4.8), and applying Lemma 4.15, one gets
$$\begin{aligned} D(\overline{x})^{\circ }=A(\overline{x})^{\circ \circ }=\mathop {\mathrm{cl}}A(\overline{x})=A(\overline{x}), \end{aligned}$$
as \(A(\overline{x})\) is a closed convex cone.       \(\square \)

Example 4.19

Consider the polyhedral set
$$\begin{aligned} F:=\{x\in \mathbb {R}^{2}:-2x_{1}+x_{2}\le 0,-x_{1}-x_{2}\le 0,-x_{2}-4\le 0\} \end{aligned}$$
and the point \(\overline{x}=0_{2}\in F\). We have \(I(\overline{x})=\{1,2\}\) and
$$\begin{aligned} A(\overline{x})=\mathop {\mathrm{cone}}\left\{ \left( \begin{array}{c} -2 \\ 1 \end{array} \right) ,\left( \begin{array}{c} -1 \\ -1 \end{array} \right) \right\} . \end{aligned}$$
The feasible direction cone at \(\overline{x}\) can be expressed as
$$\begin{aligned} D\left( \overline{x}\right) =\left\{ x\in \mathbb {R}^{2}:-2x_{1}+x_{2}\le 0,\,-x_{1}-x_{2}\le 0\right\} . \end{aligned}$$
We can easily check that \(D(\overline{x})^{\circ }=A(\overline{x})\) and \(A(\overline{x})^{\circ }=D(\overline{x}) \) (see Fig. 4.8).
Fig. 4.8

The feasible direction cone and the active cone are polar of each other

We now prove, from the previously obtained relationships between the feasible direction cone and the active cone, the simplest version of the Karush–Kuhn–Tucker (KKT in brief) theorem, which provides a first-order characterization for linearly constrained convex optimization problems in terms of the so-called nonnegativity condition (NC) , the stationarity  condition (SC), and the complementarity condition (CC) , altogether called the KKT conditions . The general version of the KKT theorem, which provides a necessary optimality condition for nonlinear optimization problems with inequality constraints, was first proved by Karush in his unpublished master’s thesis on the extension of the method of Lagrange multipliers for equality constrained problems, in 1939, and rediscovered by Kuhn and Tucker in 1951.

Theorem 4.20

(KKT theorem with linear constraints)  Let \(\overline{x}\in F\). Then, the following statements are equivalent:

(i) \(\overline{x}\in F^{*}\).

(ii) \(-\nabla f\left( \overline{x}\right) \in A\left( \overline{x}\right) \).

(iii) There exists \(\overline{\lambda }\in \mathbb {R}^{m}\) such that:
$$\begin{aligned}\begin{array}{ll}\text {(NC)} &{} \overline{\lambda }\in \mathbb {R}_{+}^{m}{} ; \\ \text {(SC)} &{} \nabla f\left( \overline{x}\right) +\sum _{i\in I}\overline{\lambda }_{i}a_{i}=0_{n};\text { and} \\ \text {(CC)} &{} \overline{\lambda }_{i}\left( b_{i}-a_{i}^{T}\overline{x}\right) =0,\quad \forall i\in I.\text { }\end{array}\end{aligned}$$

Proof

We shall prove that (i)\(\Leftrightarrow \)(ii) and (ii)\(\Leftrightarrow \) (iii).

(i)\(\Rightarrow \)(ii) Assume that \( \overline{x}\in F^{*}\). Let \(d\in D\left( \overline{x}\right) \) and \( \varepsilon >0\) be such that \(\overline{x}+td\in F\) for all \(t\in [0,\varepsilon ]\). Since \(f\left( \overline{x}+td\right) \ge f\left( \overline{x}\right) \) for all \(t\in [0,\varepsilon ]\),
$$\begin{aligned} \nabla f(\overline{x})^{T}d=f^{\prime }\left( \overline{x};d\right) =\lim _{t\searrow 0}\frac{f\left( \overline{x}+td\right) -f\left( \overline{x} \right) }{t}\ge 0. \end{aligned}$$
Then, recalling (4.10), one gets
$$\begin{aligned} -\nabla f(\overline{x})\in D\left( \overline{x}\right) ^{\circ }=A\left( \overline{x}\right) . \end{aligned}$$
(ii)\(\Rightarrow \)(i) Assume now that \(\overline{x}\in F\) satisfies \(-\nabla f(\overline{x})\in A\left( \overline{x}\right) \), that is, \(-\nabla f( \overline{x})\in D\left( \overline{x}\right) ^{\circ }\). Let \(x\in F\). Since \(x-\overline{x}\in D\left( \overline{x}\right) \), we have \(\nabla f\left( \overline{x}\right) ^{T}\left( x-\overline{x}\right) \ge 0\). Then, due to the convexity of f (see Proposition  2.40),
$$\begin{aligned} f\left( x\right) \ge f\left( \overline{x}\right) +\nabla f\left( \overline{x }\right) ^{T}\left( x-\overline{x}\right) \ge f\left( \overline{x}\right) , \end{aligned}$$
so we have that \(\overline{x}\in F^{*}\).
(ii)\(\Leftrightarrow \)(iii) Let \(-\nabla f\left( \overline{x}\right) \in A\left( \overline{x}\right) \), i.e.,
$$\begin{aligned} -\nabla f(\overline{x})\in \mathop {\mathrm{cone}}\,\left\{ a_{i},\, i\in I(\overline{x })\right\} . \end{aligned}$$
Then, there exist scalars \(\overline{\lambda }_{i}\ge 0\), \(i\in I(\overline{x })\), such that \(-\nabla f\left( \overline{x}\right) =\sum _{i\in I(\overline{x })}\overline{\lambda }_{i}a_{i}\). Defining \(\overline{\lambda }_{i}=0\) for all \(i\in I\setminus I(\overline{x})\), the KKT vector \( \overline{\lambda }:={\left( \overline{\lambda }_{1},\ldots ,\overline{ \lambda }_{m}\right) }^{T} \in \mathbb {R}^{m}\) satisfies the three conditions: (NC), (SC), and (CC).

The converse statement is trivial.       \(\square \)

The KKT conditions allow to solve occasionally, with pen and paper, small convex optimization problems by enumerating the feasible solutions which could be optimal at any of the \(2^{m}\) (possibly empty) subsets of F which result in discussing the possible values of the index set \(I(x) \subset \left\{ 1,\ldots , m\right\} \). For instance, if F is a triangle in \( \mathbb {R}^{2}\) described by a linear system \(\left\{ a_{i}^{T}x\le b_{i}, \text { }i\in I\right\} \), then \(I( x) =\emptyset \) for \(x\in \mathop {\mathrm{ int}}F\) (the green set in Fig. 4.9),  \(\left| I( x) \right| =2\) for the three vertices (the red points), and \(\left| I( x) \right| =1\) for the points on the three sides which are not vertices (the blue segments). Observe that, since there is no \(x\in F\) such that \( \left| I( x) \right| =3\), the \(7=2^{3}-1\) mentioned subsets of F provide a partition of the feasible set.
Fig. 4.9

Partition of F

Once the set I(x) has been fixed, the optimal solutions in the corresponding subset of F are the x-part of the pairs \(( x,\lambda ) \in \mathbb {R}^{n}\times \mathbb {R}^{\left| I( x) \right| }\) such that (NC) and (SC) hold, i.e.,
$$\begin{aligned} \begin{array}{c} -\nabla f( x) =\sum _{i\in I( x) }\lambda _{i}a_{i}, \\ a_{i}^{T}x=b_{i},\quad \forall i\in I( x), \\ \lambda _{i}\ge 0,\quad \forall i\in I( x), \\ a_{i}^{T}x<b_{i},\quad \forall i\in I\setminus I( x) . \end{array} \end{aligned}$$
(4.11)
The difficulty of solving the subsystem of \(n+\left| I( x) \right| \) equations and unknowns from the KKT system  (4.11) derives from the nonlinear nature of \(\nabla f(x)\). A solution \(( x,\lambda ) \) is accepted when the m inequalities in (4.11) are satisfied.
In the presence of equations, the mentioned discussion on I(x) only affects the inequality constraints. For instance, if the constraints of P are \(a_{1}^{T}x=b_{1}\) and \(a_{2}^{T}x\le b_{2}\), the possible values of I(x) are \(\{ 1\} \) and \(\{ 1,2\} \), with associated KKT systems
$$ \left\{ -\nabla f( x) =\lambda _{1}a_{1}, a_{1}^{T}x=b_{1}, a_{2}^{T}x<b_{2}\right\} $$
and
$$ \left\{ -\nabla f( x) =\lambda _{1}a_{1}+\lambda _{2}a_{2},\lambda _{2}\ge 0,a_{1}^{T}x=b_{1}, a_{2}^{T}x=b_{2}\right\} , $$
respectively.

Example 4.21

The design problem in Example  1.6 was first formulated as
$$\begin{aligned} \begin{array}{lll} P_{1}: &{} \text {Min} &{} f_{1}\left( x\right) =x_{1}x_{2}+x_{1}x_{3}+x_{2}x_{3} \\ &{} \text {s.t.} &{} x_{1}x_{2}x_{3}=1, \\ &{} &{} x\in \mathbb {R}_{++}^{3}. \end{array} \end{aligned}$$
The change of variables \(y_{i}:=\ln x_{i}\), \(i=1,2,3\), provides the equivalent linearly constrained convex problem
$$\begin{aligned} \begin{array}{lll} P_{2}: &{} \text {Min} &{} f_{2}\left( y\right) =e^{y_{1}+y_{2}}+e^{y_{1}+y_{3}}+e^{y_{2}+y_{3}} \\ &{} \text {s.t.} &{} y_{1}+y_{2}+y_{3}=0. \end{array} \end{aligned}$$
The unique point satisfying the KKT conditions is \(0_{3}\), with KKT vectors \( {\left( \overline{\lambda }_{1},\overline{\lambda }_{2}\right) }^{T} \) such that \( \overline{\lambda }_{1}-\overline{\lambda }_{2}=2\). Then, \(F_{2}^{*}=\left\{ 0_{3}\right\} \) and thus \(F_{1}^{*}=\left\{ {\left( 1,1,1\right) }^{T} \right\} \). This shows that we may have a unique optimal solution to P with infinitely many corresponding KKT vectors.

In the particular case of linear optimization, where \(f\left( x\right) =c^{T}x\) for some vector \(c\in \mathbb {R}^{n}\), the global minima are characterized by the condition \(-c\in A\left( \overline{x} \right) \).

It is worth observing that we have proved (i)\(\Rightarrow \)(ii)\( \Leftrightarrow \)(iii) in Theorem 4.20 under the unique assumption that \(\overline{x}\) is a local minimum of the problem. So, if f is not convex, (ii) and the equivalent statement (iii) are necessary conditions for \(\overline{x}\) to be a local minimum. The application of this useful necessary condition allows to filter the candidates for local minima and for computing the global minima whenever F is compact or f is coercive on F (by comparing candidates). Another useful trick consists in tackling, instead of the given problem, the result of eliminating some constraints, called relaxed problem, whose feasible set is generally greater than the initial one (this is systematically made when the constraint set C is discrete, e.g., when the decision variables are integer, i.e., \(C=\mathbb {Z} ^{n}\)). The next example illustrates the application of both tools to solve nonconvex optimization problems.

Example 4.22

The objective function f of
$$\begin{aligned} \begin{array}{lll} P_{1}: &{} \text {Min} &{} f\left( x\right) =\frac{50}{x_{1}}+\frac{20}{x_{2}}+x_{1}x_{2} \\ &{} \text {s.t.} &{} 1-x_{1}\le 0, \\ &{} &{} 1-x_{2}\le 0, \end{array} \end{aligned}$$
is not convex on \(F\). After discussing the four possible cases for \(I\left( x\right) \) (\(\emptyset \), \(\left\{ 1\right\} \), \(\left\{ 2\right\} \), and \( \left\{ 1,2\right\} \)), we find a unique candidate for \(I\left( x\right) =\emptyset \): \(\overline{x}={\left( 5,2\right) }^{T} \), with \(\nabla f\left( \overline{x}\right) =\left( -49,-19\right) ^{T}\). The case \(I(x)=\{1,2\}\), corresponding to \(x={\left( 1,1\right) }^{T}\), is represented in Fig. 4.10. To prove that \(F^{*}=\left\{ \overline{x}\right\} \) it is enough to show that f is coercive. Indeed, given \(x\in F\), one has
$$\begin{aligned} f\left( x\right) \le \gamma \Rightarrow x_{1}x_{2}\le \gamma \Rightarrow [1\le x_{1}\le \gamma \text { }\wedge \text { }1\le x_{2}\le \gamma ]. \end{aligned}$$
Observe that the relaxed problem obtained by eliminating the linear constraints is a geometric optimization problem already solved (see Exercise  3.14):
$$\begin{aligned} \begin{array}{lll} P_{2}: &{} \text {Min} &{} f\left( x\right) =\frac{50}{x_{1}}+\frac{20}{x_{2}} +x_{1}x_{2} \\ &{} \text {s.t.} &{} x\in \mathbb {R}_{++}^{2}. \end{array} \end{aligned}$$
Since the optimal solution of \(P_{2}\) is \(\overline{x}={\left( 5,2\right) }^{T} \) and \(\overline{x}\in F_{1}\), we conclude that \(F_{1}^{*}=\left\{ \overline{x}\right\} \).
Fig. 4.10

The KKT conditions fail at \(x={(1,1)}^{T}\)

4.3.2 Quadratic Optimization

Linearly constrained convex quadratic optimization problems enjoy specific theoretical properties, as the so-called Frank–Wolfe Theorem [36], which guarantees the existence of optimal solutions even when the polyhedral feasible set is unbounded and the objective function is neither convex nor coercive.

Theorem 4.23

(Frank–Wolfe) Let \(Q\in \mathcal {S}_n\) and \(c\in \mathbb {R}^n\), and suppose that
$$\begin{aligned} \begin{array}{lll} P_{Q}: &{} \text {Min} &{} f\left( x\right) =\frac{1}{2}x^{T}Qx-c^{T}x \\ &{} \text {s.t.} &{} a_{i}^{T}x\le b_{i},\text { }i\in I, \end{array} \end{aligned}$$
(4.12)
is a bounded problem (not necessarily convex). Then, its optimal set \(F^*\) is nonempty.

Proof

The original proof given by Frank and Wolfe in [36] is beyond the scope of this book. An analytical direct proof was offered by Blum and Oettli in [14].       \(\square \)

The problem \(P_Q\) can be efficiently solved by ad hoc numerical methods which include active set, primal-dual, gradient projection, interior point, Frank–Wolfe, and simplex algorithms [12, 70, 87]. This type of problems directly arise in a variety of fields, e.g., in statistics, data fitting, portfolio problems (see [84, 87] and references therein), multiclass classification [84] (a widely used tool of machine learning), and also as subproblems in certain algorithms for nonlinear optimization problems (e.g., sequential quadratic and augmented Lagrangian algorithms). In [85], the reader can find several chapters related to quadratic optimization, e.g., a review on numerical methods (pp. 3–11) and applications to electrical engineering (pp. 27–36) and to chemical engineering (pp. 37–46).
Fig. 4.11

Strip separating two finite sets

Example 4.24

Consider the geometric problem consisting in the separation of two finite sets in \(\mathbb {R}^{n}\), \(\left\{ u_{1},\ldots , u_{p}\right\} \) and \(\left\{ v_{1},\ldots , v_{q}\right\} \), by means of the thickest possible sandwich (the region of \(\mathbb {R}^{n}\) limited by two parallel hyperplanes, also called strip when \(n=2\)); see Fig. 4.11.

A necessary and sufficient condition for the existence of such sandwiches is the existence of a hyperplane strictly separating the polytopes \(U:=\mathop {\mathrm{conv}}\left\{ u_{1},\ldots , u_{p}\right\} \) and \(V:=\mathop {\mathrm{conv}}\left\{ v_{1},\ldots , v_{q}\right\} \). It is easy to show that, if \(U\cap V\ne \emptyset \), the optimal solution to this geometric problem can be obtained by computing a pair of vectors \(\left( \overline{u},\overline{v}\right) \in U\times V\) such that
$$ \left\| \overline{u}-\overline{v}\right\| \le \left\| u-v\right\| ,\quad \forall \left( u, v\right) \in U\times V, $$
whose existence is consequence of the compactness of \(U\times V\) and the continuity of the Euclidean norm. In fact, let \(w:=\overline{v}-\overline{u}\ne 0_{n}\). Since \(\overline{u}\) is the point of U closest to \(\overline{v}\ne \overline{u}\), \(w^{T}\left( x-\overline{u}\right) \le 0\) for all \(x\in U\) (by the same argument as in the proof of Theorem  2.6). Similarly, \(w^{T}\left( x-\overline{v}\right) \ge 0\) for all \(x\in V\). So, the thickest sandwich separating U from V is
$$\left\{ x\in \mathbb {R}^{n}:w^{T}\overline{v}\le w^{T}x\le w^{T}\overline{u}\right\} ,$$
whose width (also called margin) is \(\left\| w\right\| =\left\| \overline{u}- \overline{v}\right\| \); see Fig. 4.12.
Since U and V are polytopes, they can be expressed (recall Example  2.14) as solution sets of the finite systems \(\left\{ c_{i}^{T}x\le h_{i}, i\in I\right\} \) and \(\left\{ d_{j}^{T}x\le e_{j}, j\in J\right\} \), respectively. Defining Open image in new window and \(A:=\left[ I_{n}\mid -I_{n}\right] \), and denoting by \( 0_{n\times n}\) the \(n\times n\) null matrix, the optimization problem to be solved can be formulated as the quadratic optimization one
$$ \begin{array}{lll} P_{U, V}: &{} \mathop {\mathrm{Min}}\nolimits _{y\in \mathbb {R}^{2n}} &{} y^{T}\left( A^{T}A\right) y \\ &{} \text {s.t.} &{} c_{i}^{T}\left[ I_{n}\mid 0_{n\times n}\right] y\le h_{i},\;i\in I, \\ &{} &{} d_{j}^{T}\left[ 0_{n\times n}\mid I_{n}\right] y\le e_{j},\;j\in J, \end{array} $$
which is convex, as the Gram matrix \(A^{T}A\) is positive semidefinite (even though it is not positive definite because A is not column rank complete). So, if Open image in new window is an optimal solution to \(P_{U, V}\), the hyperplanes \(w^{T}x=w^{T} \overline{u}\) and \(w^{T}x=w^{T}\overline{v}\) support U at \(\overline{u}\) and V at \(\overline{v}\), respectively.
Fig. 4.12

Maximizing the margin between U and V

In the simplest application of the machine learning  methodology, the sets of vectors \(\{ u_{1},\ldots , u_{p}\} \) and \(\{ v_{1},\ldots , v_{q}\} \) in Example 4.24 represent two samples of a random vector, called learning sets (or training sets), of items which have been classified according to some criterion, e.g., in automatic diagnosis, patients sharing some symptom which are actually healthy or not (Classes I and II); In credit scoring, mortgage customers that respected or not its terms (Classes I and II), etc. In that case, if Open image in new window is an optimal solution to \(P_{U, V}\), its optimal value \(\overline{y }^{T}\left( A^{T}A\right) \overline{y}\) is the margin between the learning sets, and the affine function \(h(x):=w^{T}x-w^{T}\left( \frac{\overline{v}+ \overline{u}}{2}\right) \), with \(w=\overline{v}-\overline{u}\), called classifier, allows to classify future items according to the sign of h on their corresponding observed vector x: As shown in Fig. 4.13, if \(h(x)>0\) (respectively, \(h(x)<0\)), the item with observed vector x is likely in Class I (Class II, respectively), while the classification of future items with observed vector x such that \(h(x)=0\) (the hyperplane of symmetry of the thickest separating sandwich) is dubious.
Fig. 4.13

Classification of new items

The quadratic problem \(P_{U, V}\) presents a double inconvenient: It may have multiple optimal solutions, and the computed optimal solution Open image in new window may have many nonzero entries. Observe that, if \({}\overline{u}_{k}= \overline{v}_{k}=0\), the kth component of the observations is useless for the classification and can be eliminated from the medical check or from the list of documents attached by the borrower to his/her application.

In order to overcome the first inconvenience, we can add to the objective function of \(P_{U, V}\) a regularization term  \(\gamma \left\| y\right\| ^{2}\), with \(\gamma >0\) selected by the decision maker, guaranteeing the existence of a unique optimal solution to the resulting convex quadratic optimization problem (thanks to the strong convexity of the regularization term). This optimal solution, depending on \(\gamma \), approaches the optimal set of \(P_{U, V}\) as \(\gamma \) decreases to 0.

Regarding the second inconvenience, we could follow a similar strategy, consisting in adding to the objective function of \(P_{U, V}\) a sparsity term  \(\gamma s\left( y\right) \), with \(s\left( y\right) \) denoting the number of nonzero components of y, and \(\gamma >0\). Unfortunately, the function s is not even continuous. For this reason, it is usually replaced by the \( \ell _{1}\) norm, as the optimal solutions of the resulting problems tend to be sparse vectors  (i.e., vectors with many zero components).

For example, consider the quadratic optimization problem \(P_Q\) in (4.12) with \(Q=I_2\) and \(c=0_{2}\), whose polyhedral feasible set F is represented in Fig. 4.14. The function s is equal to 0 at \(0_{2}\), has a value of 1 at the axes, and is equal to 2 on the rest of the plane. In Fig. 4.14, we represent the optimal solution to \(P_Q\) and that of the modified problem \(P_Q^1\), which has the same feasible set but the objective function is equal to \(\frac{1}{2}x^Tx+\Vert x\Vert _1\). Hence, s is equal to 2 at the optimal solution of \(P_Q\), while it has a value of 1 at the solution of \(P_Q^1\). The addition of the \(\ell _1\) norm to the objective function is responsible for the pointedness of the level curves of the resulting function. Therefore, in contrast to the strategy for the regularization term \(\gamma \Vert \cdot \Vert ^2\), the value of \(\gamma \) in the sparsity term should not be chosen too small.
Fig. 4.14

Level curves and optimal solution to \(\frac{1}{2}x^Tx\) (left) and \(\frac{1}{2}x^Tx+\Vert x\Vert _1\) (right)

We now show that the addition of a term \(\gamma \left\| x\right\| _{1}, \) with \(\gamma >0\), to the objective function of a convex quadratic problem
$$ \begin{array}{lll} P_{Q}: &{} \text {Min} &{} \frac{1}{2}x^{T}Qx-c^{T}x \\ &{} \text {s.t.} &{} a_{i}^{T}x\le b_{i}, i\in I, \end{array} $$
preserves these desirable properties. In fact, recalling the change of variables we have used at Subsubsection  1.1.5.6 to linearize \(\ell _{1}\) regression problems, we write \(x=u-v\), with \(u, v\in \mathbb {R}_{+}^{n}\), to reformulate the convex but nonquadratic problem
$$ \begin{array}{lll} P_{Q}^{1}: &{} \text {Min} &{} \frac{1}{2}x^{T}Qx-c^{T}x+\gamma \left\| x\right\| _{1} \\ &{} \text {s.t.} &{} a_{i}^{T}x\le b_{i}, i\in I, \end{array} $$
as the convex quadratic one
$$ \begin{array}{lll} P_{Q}^{2}: &{} \text {Min} &{} \frac{1}{2}\left( u-v\right) ^{T}Q\left( u-v\right) -c^{T}\left( u-v\right) +\gamma 1_{n}^{T}\left( u+v\right) \\ &{} &{} a_{i}^{T}\left( u-v\right) \le b_{i},\;i\in I, \\ &{} \text {s.t.} &{} u_{i}\ge 0,\text { }v_{i}\ge 0,\;i=1,\ldots , n. \end{array} $$
This modeling trick has been successfully used in different fields to get convex quadratic reformulations of certain optimality problems. For instance, in compressed sensing [35] one can find unconstrained optimization problems of the form
$$ \begin{array}{ll} \mathop {\mathrm{Min}}_{x\in \mathbb {R}^{n}}&\frac{1}{2}x^{T}Qx-c^{T}x+\gamma \left\| x\right\| _{1} \end{array} , $$
with \(\gamma >0\), which can be reformulated as a convex quadratic problem as above. The same change of variables reduces the convex parametric problem with convex quadratic objective
$$ \begin{array}{lll} P^{\gamma }: &{} \text {Min} &{} \frac{1}{2}x^{T}Qx-c^{T}x \\ &{} \text {s.t.} &{} \left\| x\right\| _{1}\le \gamma , \end{array} $$
which arises in the homotopy method  for variable selection [73] (where the optimal solutions to \(P^{\gamma }\) form, as the parameter \(\gamma \) increases, the so-called path of minimizers ) into the linearly constrained quadratic problem
$$ \begin{array}{lll} P_{Q}^{\gamma }: &{} \text {Min} &{} \frac{1}{2}\left( u-v\right) ^{T}Q\left( u-v\right) -c^{T}\left( u-v\right) \\ &{} \text {s.t.} &{} 1_{n}^{T}\left( u+v\right) \le \gamma . \\ &{} &{} u_{i}\ge 0,\text { }v_{i}\ge 0,\text { }i=1,\ldots , n. \end{array} $$
An important variant to the least squares problem \(P_{LS}\) in ( 3.10) arising in machine learning is the so-called \(\ell _{1}\)-penalized least absolute shrinkage and selection operator (LASSO ) [18],
$$ \begin{array}{ll} P_{LASSO}^{\gamma }:&\mathop {\mathrm{Min}}\nolimits _{x\in \mathbb {R}^{n}}f(x)= \frac{1}{2}\left\| Ax-b\right\| ^{2}+\gamma \left\| x\right\| _{1}, \end{array} $$
where the parameter \(\gamma \) controls the weight of the penalization, whose main advantage over the LS estimator is the observed low density of its optimal solution, the so-called LASSO estimator . Due to the lack of differentiability of the penalizing term \(\gamma \left\| x\right\| _{1}\), \(P_{LASSO}^{\gamma } \) is a hard problem for which specific first- and second-order methods have been proposed [18, Section 8]. However, as observed in [87], \( P_{LASSO}^{\gamma }\) can be more easily solved once it has been reformulated as a linearly constrained quadratic optimization problem, by replacing the decision variable x by \(u, v\in \mathbb {R}_{+}^{n}\) such that \(x=u-v\).
Observe finally that the quadratic cone optimization problems of the type
$$ \begin{array}{lll} P_{K}: &{} \text {Min} &{} \frac{1}{2}x^{T}Qx-c^{T}x \\ &{} \text {s.t.} &{} Ax+b\in K, \end{array} $$
where A is an \(m\times n\) matrix, \(b\in \mathbb {R}^{m}\)  and K is a polyhedral convex cone in \(\mathbb {R}^{m}\), can also be reduced to a linearly constrained quadratic optimization problem thanks to the Farkas Lemma 4.15. In fact, since \(K^{\circ }\) is polyhedral too, we can write \(K^{\circ }=\mathop {\mathrm{cone}}\left\{ y_{1},\ldots , y_{p}\right\} \). Since \( K=K^{\circ \circ }=\left\{ y_{1},\ldots , y_{p}\right\} ^{\circ }\), we have that
$$ Ax+b\in K\Longleftrightarrow y_{j}^{T}\left( Ax+b\right) \le 0,\quad \forall j\in \left\{ 1,\ldots , p\right\} . $$
So, the conic constraint \(Ax+b\in K\) can be replaced in \(P_{K}\) by the linear system \(\left\{ \left( y_{j}^{T}A\right) x\le -y_{j}^{T}b, j=1,\ldots , p\right\} \).

The main reasons to devote a subsection to this particular class of convex optimization problems are its importance in practice and the fact that they can be solved with pen and paper when the number of inequality constraints is sufficiently small. Even more, as shown next, their optimal sets can be expressed by means of closed formulas in the favorable case where all constraints are linear equations, as it happens in the design of electric circuits (recall that the heat generated in a conductor is proportional to the resistance times the square of the intensity, so Q is diagonal and \(c = 0_{n}).\)

Proposition 4.25

(Quadratic optimization with linear inequality constraints)  Suppose that the problem \(P_Q\) in (4.12) is bounded and that \(Q\in \mathcal {S}_{n}\) is positive semidefinite. Then, \(F^{*}\ne \emptyset \) (in particular, \( \left| F^{*}\right| =1\) whenever Q is positive definite). Moreover, \(\overline{x}\in F\) is an optimal solution to \(P_{Q}\) if and only if there exists \(\overline{\lambda }\in \mathbb {R}^{m}\) such that:
$$ \begin{array}{ll} \text {(NC)} &{} \overline{\lambda }\in \mathbb {R}_{+}^{m}{} ; \\ \text {(SC)} &{} c-Q\overline{x}=\sum _{i\in I}\overline{\lambda }_{i}a_{i}; \text { and} \\ \text {(CC)} &{} \overline{\lambda }_{i}\left( b_{i}-a_{i}^{T}\overline{x} \right) =0,\quad \forall i\in I.\text { } \end{array} $$

Proof

It is a straightforward consequence of the Frank–Wolfe Theorem 4.23, Propositions  3.1 and 4.1, and Theorem 4.20.       \(\square \)

The advantage of the quadratic optimization problems in comparison with those considered in Subsection 4.3.1 is that the KKT system to be solved for each possible value of \(I\left( x\right) \),
$$ \left\{ \begin{array}{c} c-Qx=\sum _{i\in I\left( x\right) }\lambda _{i}a_{i} \\ a_{i}^{T}x=b_{i},\quad i\in I\left( x\right) \end{array} \right\} $$
is now linear.

A multifunction (or set-valued mapping)  between two nonempty sets \(X\subset \mathbb {R}^m\) and \(Y\subset \mathbb {R}^n\) is a correspondence associating with each \(x\in X\) a subset of \(Y\).

Definition 4.26

The metric projection onto a closed set \(\emptyset \ne C\subset \mathbb {R}^{n}\) is the multifunction \(P_{C}:\mathbb {R}^{n}\rightrightarrows \mathbb {R}^{n}\) associating with each \(y\in \mathbb {R}^{n}\) the set
$$\begin{aligned} P_{C}\left( y\right) =\left\{ x\in C:d\left( y,x\right) =d\left( y, C\right) \right\} , \end{aligned}$$
that is, the set of global minima of the function \(x\mapsto d\left( y, x\right) =\left\| x-y\right\| \) on \(C\).

Obviously, \(P_{C}\left( y\right) \ne \emptyset \) as \(d\left( y,\cdot \right) \) is a continuous coercive function on \(\mathbb {R}^n\) and the set C is closed. For instance, if C is a sphere centered at z, \(P_{C}(z) =C\) and \( P_{C}\left( y\right) =C\cap \left\{ z+\lambda \left( y-z\right) :\lambda \ge 0\right\} \) (a unique point) for all \(y\ne z\).

Proposition 4.25 allows to compute \(P_{C}\left( y\right) \) for any \(y\in \mathbb {R}^{n}\), whenever C is a polyhedral convex set, but not to get an explicit expression for \(P_{C}\).

Example 4.27

To compute the projection of \({\left( 3,1,-1\right) }^{T} \) onto the polyhedral cone K of Example  2.14,
$$ K=\left\{ x\in \mathbb {R}^{3}:x_{1}+x_{2}-2x_{3}\le 0,x_{2}-x_{3}\le 0,-x_{3}\le 0\right\} $$
we must minimize \(\left\| {\left( x_{1}-3,x_{2}-1,x_{3}+1\right) }^{T} \right\| ^{2}\) on K; i.e., we must solve the convex quadratic problem
$$ \begin{array}{llll} P_{Q}: &{} \text {Min} &{} f\left( x\right) =x^{T}x-\left( 6,2,-2\right) x &{} \\ &{} \text {s.t.} &{} x_{1}+x_{2}-2x_{3}\le 0,\\ &{} &{} x_{2}-x_{3}\le 0,\\ &{} &{} -x_{3}\le 0, \end{array} $$
which has a positive definite matrix \(Q=2I_{3}\), so that \(P_{Q}\) has a unique optimal solution. The KKT system for \(I\left( x\right) =\left\{ 1\right\} \) is
$$ \left\{ \begin{array}{c} -2x+{\left( 6,2,-2\right) }^{T} =\lambda _{1}{\left( 1,1,-2\right) }^{T} \\ x_{1}+x_{2}-2x_{3}=0 \\ \lambda _{1} \ge 0 \\ x_{2}-x_{3}<0 \\ -x_{3} < 0 \end{array} \right\} , $$
whose unique solution is \(\left( x^{T},\lambda _{1} \right) =\left( 2,0,1, 2\right) \). Thus, \(P_{K}\left( {\left( 3,1,-1\right) }^{T}\right) =\left\{ {( 2,0,1)}^{T}\right\} \), as shown in Fig. 4.15.
Fig. 4.15

Projection of \({(3,1,-1)}^{T}\) onto the polyhedral cone K

We now consider the particular case of the minimization of a quadratic function on an affine manifold.

Proposition 4.28

(Quadratic optimization with linear equality constraints)  Let
$$\begin{aligned} \begin{array}{lll} P_{Q}: &{} \text {Min} &{} f\left( x\right) =\frac{1}{2}x^{T}Qx-c^{T}x+b \\ &{} \text {s.t.} &{} Mx=d, \end{array} \end{aligned}$$
(4.13)
be such that \(Q\in \mathcal {S}_{n}\) is positive definite, \(c\in \mathbb {R}^{n}\), \(b\in \mathbb {R}\), M is a full row rank \(m\times n\) matrix, and \(d\in \mathbb {R} ^{m}\). Then, the unique optimal solution to \(P_{Q}\) is
$$\begin{aligned} \overline{x}=Q^{-1}M^{T}{\left( MQ^{-1}M^{T}\right) }^{-1}\left( d-MQ^{-1}c\right) +Q^{-1}c. \end{aligned}$$
(4.14)

Proof

Let \(a_{i}^{T}\), \(i=1,\ldots , m\), be the rows of M. By assumption, we know that the set \(\left\{ a_{i}, i=1,\ldots , m\right\} \) is linearly independent.

Since Q is a positive definite symmetric matrix, f is coercive and strictly convex on \(\mathbb {R}^{n}\) by Proposition  3.1, and so on the affine manifold \(F:=\left\{ x\in \mathbb {R}^{n}:Mx=d\right\} \). Therefore, we have that P has a unique optimal solution \(\overline{x}\), which must satisfy the KKT conditions. Hence, by Proposition 4.25, there exists a multiplier vector \( \lambda \in \mathbb {R}^{m}\) such that
$$\begin{aligned} Q\overline{x}-c+\sum _{i=1}^{m}\lambda _{i}a_{i}=Q\overline{x}-c+M^{T}\lambda =0_{n}, \end{aligned}$$
which, by the nonsingularity of Q, yields
$$\begin{aligned} \overline{x}=-Q^{-1}M^{T}\lambda +Q^{-1}c. \end{aligned}$$
(4.15)
We now show that the symmetric matrix \(MQ^{-1}M^{T}\) is positive definite. Indeed, given \(y\in \mathbb {R}^{m}\setminus \left\{ 0_{m}\right\} \), one has
$$\begin{aligned} y^{T}\left( MQ^{-1}M^{T}\right) y={\left( M^{T}y\right) }^{T}Q^{-1}\left( M^{T}y\right) >0, \end{aligned}$$
as \(Q^{-1}\) is positive definite and the columns of \(M^{T}\) are linearly independent. Thus, \(MQ^{-1}M^{T}\) is nonsingular and (4.15) provides
$$\begin{aligned} d=M\overline{x}=-\left( MQ^{-1}M^{T}\right) \lambda +MQ^{-1}c. \end{aligned}$$
This equation allows to obtain the expression of the multiplier vector corresponding to \( \overline{x}\):
$$\begin{aligned} \lambda =-{\left( MQ^{-1}M^{T}\right) }^{-1}\left( d-MQ^{-1}c\right) . \end{aligned}$$
Replacing this vector \(\lambda \) in (4.15), and simplifying, one gets (4.14).       \(\square \)

Example 4.29

Let
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) =\frac{1}{2}x_{1}^{2}+x_{2}^{2} + \frac{3}{2}x_{3}^{2}+x_{1}x_{3}-6x_{1}-4x_{2}+6 \\ &{} \text {s.t.} &{} x_{1}+x_{2}=4,\text { }x_{1}+2x_{3}=3. \end{array} \end{aligned}$$
We directly get the unique optimal solution to P from (4.14): \( \overline{x}={\left( \frac{43}{11},\frac{1}{11},\frac{-5}{11}\right) }^{T}\).

4.3.3 Some Closed Formulas

Proposition 4.28 has an immediate application to statistics, more concretely to \(\ell _{2}\) regression with interpolation: Given the point cloud \(\left\{ (t_{i},s_{i}),\, i=1,\ldots , p\right\} \), with \( t_{i}\ne t_{j}\) for all \(i\ne j\), we try to get the polynomial \(p\left( t\right) =x_{1}+x_{2}t+\ldots +x_{q+1}t^{q}\) of degree at most q which minimizes the Euclidean norm of the residual vector corresponding to the first m observations
$$\begin{aligned} r:=\left( \begin{array}{c} x_{1}+x_{2}t_{1}+\ldots +x_{q+1}t_{1}^{q}-s_{1} \\ \vdots \\ x_{1}+x_{2}t_{m}+\ldots +x_{q+1}t_{m}^{q}-s_{m} \end{array} \right) \in \mathbb {R}^{m},\text { }m<p, \end{aligned}$$
and such that it interpolates the remaining points; i.e., it satisfies \(p\left( t_{i}\right) =s_{i}\), \(i=m+1,\ldots , p\) (see Fig. 4.16).
Fig. 4.16

\(\ell _2\) regression with interpolation

The problem to be solved is
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} f(x):=\sum \limits _{i=1}^{m}{\left( \sum \limits _{j=1}^{q+1}x_{j}t_{i}^{j-1}-s_{i}\right) }^{2} \\ &{} \text {s.t.} &{} \sum \limits _{j=1}^{q+1}x_{j}t_{i}^{j-1}=s_{i},\;i=m+1,\ldots , p, \end{array} \end{aligned}$$
(4.16)
which is of the form of Proposition 4.28, with \(Q=2N^{T}N\), \( c=2N^{T}s\), and \(b=\left\| s\right\| ^{2}\), where
$$\begin{aligned} N:=\left[ \begin{array}{cccc} 1 &{} t_{1} &{} \ldots &{} t_{1}^{q} \\ 1 &{} t_{2} &{} \ldots &{} t_{2}^{q} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ 1 &{} t_{m} &{} \ldots &{} t_{m}^{q} \end{array} \right] \quad \text {and}\quad s:=\left( \begin{array}{c} s_{1} \\ s_{2} \\ \vdots \\ s_{m} \end{array} \right) , \end{aligned}$$
while
$$\begin{aligned} M:=\left[ \begin{array}{cccc} 1 &{} t_{m+1} &{} \ldots &{} t_{m+1}^{q} \\ 1 &{} t_{m+2} &{} \ldots &{} t_{m+2}^{q} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ 1 &{} t_{q+1} &{} \ldots &{} t_{q+1}^{q} \end{array} \right] \quad \text {and}\quad d:=\left( \begin{array}{c} s_{m+1} \\ s_{m+2} \\ \vdots \\ s_{q+1} \end{array} \right) . \end{aligned}$$
A straightforward application of Proposition 4.28 provides the following closed formula for this problem.

Corollary 4.30

(Regression with interpolation) The unique optimal solution to the \(\ell _{2}\) regression problem with interpolation in (4.16) is the polynomial \(\overline{p}\left( t\right) :=\overline{x}_{1}+\overline{x}_{2}t+\ldots +\overline{x}_{q+1}t^{q}\), where
$$\begin{aligned} \overline{x}= {\left( N^{T}N\right) }^{-1}M^{T}{\left( M{\left( N^{T}N\right) }^{-1}M^{T}\right) }^{-1}\left( d-M{\left( N^{T}N\right) }^{-1}N^{T}s\right) +{\left( N^{T}N\right) }^{-1}N^{T}s. \end{aligned}$$

We now apply Proposition 4.28 to the computation of \( P_{C}\left( y\right) \), whenever C is an affine manifold.

Corollary 4.31

(Metric projection onto affine manifolds)  Let M be a full row rank \(m\times n\) matrix. The metric projection of any \(y\in \mathbb {R}^{n}\) onto the affine manifold \(F=\left\{ x\in \mathbb {R}^{n}:Mx=d\right\} \) is
$$\begin{aligned} P_{F}\left( y\right) =\left\{ M^{T}{\left( MM^{T}\right) }^{-1}\left( d-My\right) +y\right\} , \end{aligned}$$
(4.17)
and the distance from y to F is
$$\begin{aligned} d\left( y, F\right) =\left\| M^{T}{\left( MM^{T}\right) }^{-1}\left( d-My\right) \right\| . \end{aligned}$$

Proof

The problem to be solved is \(P_{Q}\) in (4.13), with
$$\begin{aligned} f\left( x\right) =\left\| x-y\right\| ^{2}=x^{T}x-2y^{T}x+\left\| y\right\| ^{2}. \end{aligned}$$
Taking \(Q=2I_{n}\), \(c=2y\), and \(b=\left\| y\right\| ^{2}\) in (4.14), one gets (4.17). Moreover,
$$\begin{aligned} d\left( y, F\right) =\left\| P_{F}\left( y\right) -y\right\| =\left\| M^{T}{\left( MM^{T}\right) }^{-1}\left( d-My\right) \right\| , \end{aligned}$$
which completes the proof.       \(\square \)
In practice, it is convenient to compute \(P_{F}\left( y\right) \) in two steps, without inverting the Gram matrix \(G:=MM^{T}\) of M, as follows:
  • Step 1: Find w such that \(\left\{ Gw=d-My\right\} \).

  • Step 2: Compute \(P_{F}\left( y\right) =M^{T}w+y\).

The next result is an immediate consequence of Corollary 4.31.

Corollary 4.32

(The distance from the origin to a linear manifold) Let M be a full row rank \(m\times n\) matrix. The solution to the consistent system \(\left\{ Mx=d\right\} \) of minimum Euclidean norm is
$$\begin{aligned} P_{F}\left( 0_{n}\right) =\left\{ M^{T}{\left( MM^{T}\right) }^{-1}d\right\} . \end{aligned}$$
(4.18)

Example 4.33

The point of the hyperplane \(H=\left\{ x\in \mathbb {R}^{n}:c^{T}x=d\right\} , \) with \(c\in \mathbb {R}^{n}\setminus \left\{ 0_{n}\right\} \) and \(d\in \mathbb {R}\), closest to a given point y can be obtained by taking \( M=c^{T} \) in (4.17), i.e.,
$$\begin{aligned} P_{H}\left( y\right) =\left\{ \left\| c\right\| ^{-2}\left( d-c^{T}y\right) c+y\right\} , \end{aligned}$$
so the Euclidean distance from y to H is given by the well-known formula
$$\begin{aligned} d\left( y, H\right) =\left\| P_{H}\left( y\right) -y\right\| = \frac{\left| c^{T}y-d\right| }{\left\| c\right\| }. \end{aligned}$$
(4.19)

Example 4.34

We are asked to compute the point of the affine manifold
$$\begin{aligned} F:=\left\{ x\in \mathbb {R}^{3}:x_{1}+x_{2}+x_{3}=1,-x_{1}-x_{2}+x_{3}=0\right\} \end{aligned}$$
closest to the origin (see Fig. 4.17). We have
$$\begin{aligned} M=\left[ \begin{array}{ccc} 1 &{} 1 &{} 1 \\ -1 &{} -1 &{} 1 \end{array} \right] \quad \text {and}\quad d=\left( \begin{array}{c} 1 \\ 0 \end{array} \right) . \end{aligned}$$
We can apply (4.18) in two ways:
1. By matrix calculus:
$$\begin{aligned} P_{F}\left( 0_{3}\right) =\left\{ M^{T}G^{-1}d\right\} =\left\{ \left[ \begin{array}{cc} 1 &{} -1 \\ 1 &{} -1 \\ 1 &{} 1 \end{array} \right] \left[ \begin{array}{cc} \frac{3}{8} &{} \frac{1}{8} \\ \frac{1}{8} &{} \frac{3}{8} \end{array} \right] \left( \begin{array}{c} 1 \\ 0 \end{array} \right) \right\} =\left\{ \left( \begin{array}{c} \frac{1}{4} \\ \frac{1}{4} \\ \frac{1}{2} \end{array} \right) \right\} . \end{aligned}$$
2. By Gauss elimination: The unique solution to \(\left\{ Gw=d\right\} \) is \( w={\left( \frac{3}{8},\frac{1}{8}\right) }^{T}\). Thus,
$$\begin{aligned} P_{F}\left( 0_{3}\right) =\left\{ M^{T}w\right\} =\left\{ \left[ \begin{array}{cc} 1 &{} -1 \\ 1 &{} -1 \\ 1 &{} 1 \end{array} \right] \left( \begin{array}{c} \frac{3}{8} \\ \frac{1}{8} \end{array} \right) \right\} =\left\{ \left( \begin{array}{c} \frac{1}{4} \\ \frac{1}{4} \\ \frac{1}{2} \end{array} \right) \right\} . \end{aligned}$$

4.4 Arbitrarily Constrained Convex Optimization\(^{\star }\)

The rest of the chapter is devoted to the study of convex optimization problems of the form
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) \\ &{} \text {s.t.} &{} g_{i}\left( x\right) \le 0,\text { }i\in I, \\ &{} &{} x\in C, \end{array} \end{aligned}$$
(4.20)
where \(\emptyset \ne C\subset \mathbb {R}^{n}\) is a convex set, the functions \(g_{i}\) are convex on C, \(i\in I\), and f is convex on the feasible set F of \(P\). For this type of problems, we are not able to give a closed solution, but we can at least obtain optimal solutions in an analytic way via KKT conditions. When P arises in industrial production planning, any perturbation of the right-hand side of \(g_{i}\left( x\right) \le 0\) represents a variation of the available amount of the ith recourse. Sensitivity analysis allows to estimate the variation of the optimal value under this type of perturbations when there exists a so-called sensitivity vector which is related to the KKT necessary conditions at an optimal solution to \(P\). Stopping rules for the numerical optimization algorithms for P can be obtained from either the mentioned KKT necessary conditions or by solving simultaneously P and a suitable dual problem providing lower bounds for the optimal value of \(P\).
Fig. 4.17

Minimum Euclidean norm solution to the system \(\left\{ x_{1}{+}x_{2}{+}x_{3}=1,-x_{1}-x_{2}+x_{3}=0\right\} \)

4.4.1 Sensitivity Analysis

This subsection deals with the parametric problem that results from replacing 0 by a parameter \(z_{i}\in \mathbb {R}\) in the functional constraint \(g_{i}\left( x\right) \le 0\), \(i\in I=\{1,\ldots , m\}\):
$$\begin{aligned} \begin{array}{lll} P(z) : &{} \text {Min} &{} f\left( x\right) \\ &{} \text {s.t.} &{} g_{i}\left( x\right) \le z_{i}, i\in I, \\ &{} &{} x\in C. \end{array} \end{aligned}$$
The parameter vector \(z={\left( z_{1},\ldots , z_{m}\right) }^{T} \in \mathbb {R}^{m}\) will be called right-hand side vector in P(z). The feasible set and the optimal value of P(z) will be denoted by \(\mathcal {F}( z) \) and \(\mathcal {\vartheta }(z) \), respectively, with the convention that \(\mathcal {\vartheta }(z) =+\infty \) when \( \mathcal {F}(z) =\emptyset \). Obviously, \(P\left( 0_{m}\right) \equiv P\) and \(\mathcal {\vartheta }\left( 0_{m}\right) =v\left( P\right) \in \overline{\mathbb {R}}:=\mathbb {R}\cup \left\{ \pm \infty \right\} \) (the extended real line). Observe that \( \mathcal {F}:\mathbb {R}^{m}\rightrightarrows \mathbb {R}^{n}\) is a multifunction, while \(\mathcal {\vartheta }: \mathbb {R}^{m}\rightarrow \overline{\mathbb {R}}\) is an extended real function, called feasible set multifunction  and value function , respectively. Both the value function \( \mathcal {\vartheta }\) and the feasible set multifunction \(\mathcal {F }\) are represented through their corresponding graphs
$$\begin{aligned} \mathop {\mathrm{gph}}\mathcal {\vartheta }:=\left\{ \left( \begin{array}{c} z\\ y \end{array} \right) \in \mathbb {R} ^{m+1}:y=\mathcal {\vartheta }(z) \right\} \end{aligned}$$
and
$$\begin{aligned} \mathop {\mathrm{gph}}\mathcal {F}:=\left\{ \left( \begin{array}{c} z\\ x \end{array} \right) \in \mathbb {R} ^{m+n}:x\in \mathcal {F}(z) \right\} . \end{aligned}$$
The epigraph of \(\mathcal {\vartheta }\) is the set
$$\begin{aligned} \mathop {\mathrm{epi}}\mathcal {\vartheta }:=\left\{ \left( \begin{array}{c} z\\ y \end{array} \right) \in \mathbb {R} ^{m+1}:\mathcal {\vartheta }(z) \le y\right\} . \end{aligned}$$
The domains of \(\mathcal {F}\) and \(\mathcal {\vartheta }\) are
$$\begin{aligned} \mathop {\mathrm{dom}}\mathcal {F}:=\left\{ z\in \mathbb {R}^{m}:\mathcal {F}(z) \ne \emptyset \right\} \quad \text {and}\quad \mathop {\mathrm{dom}}\mathcal { \vartheta }:=\left\{ z\in \mathbb {R}^{m}:\mathcal {\vartheta }(z) <+\infty \right\} , \end{aligned}$$
which obviously coincide. Observe that \(z\in \mathop {\mathrm{int}}\mathop {\mathrm{dom}} \mathcal {\vartheta }\) when small perturbations of z preserve the consistency of P(z).

We now study the properties of \(\mathcal {\vartheta }\), whose argument z is interpreted as a perturbation of \(0_{m}\). One of these properties is the convexity that we define now for extended functions. To handle these functions, we must first extend the algebraic operations (sum and product) and the natural ordering of \(\mathbb {R}\) to the extended real line \(\overline{ \mathbb {R}}\).

Concerning the sum, inspired by the algebra of limits of sequences, we define
$$\begin{aligned} \alpha +\left( +\infty \right) =\left( +\infty \right) +\alpha =+\infty ,\quad \forall \alpha \in \mathbb {R\cup }\left\{ +\infty \right\} \end{aligned}$$
and
$$\begin{aligned} \alpha +\left( -\infty \right) =\left( -\infty \right) +\alpha =-\infty ,\quad \forall \alpha \in \mathbb {R\cup }\left\{ -\infty \right\} . \end{aligned}$$
In the indeterminate case, it is convenient to define
$$\begin{aligned} \left( -\infty \right) +\left( +\infty \right) =\left( +\infty \right) +\left( -\infty \right) =+\infty . \end{aligned}$$
(4.21)
In this way, the sum is well defined on \(\overline{\mathbb {R}}\) and is commutative.
The product of elements of \(\overline{\mathbb {R}}\) is also defined as in the algebra of limits (e.g., \(\alpha \left( +\infty \right) =+\infty \) if \( \alpha >0\)), with the following convention for the indeterminate cases:
$$\begin{aligned} 0\left( +\infty \right) =\left( +\infty \right) 0=0\left( -\infty \right) =\left( -\infty \right) 0=0. \end{aligned}$$
(4.22)
So, the product is also well defined on \(\overline{\mathbb {R}}\) and is commutative (but \(\left( \overline{\mathbb {R}};+,\cdot \right) \) is not a field as \(+\infty \) and \(-\infty \) do not have inverse elements).
The extension of the ordering of \(\mathbb {R}\) to \(\overline{\mathbb {R}}\) is made by the convention that
$$\begin{aligned} -\infty<\alpha <+\infty ,\quad \forall \alpha \in \mathbb {R}. \end{aligned}$$
The extension of the absolute value from \(\mathbb {R}\) to \(\overline{\mathbb { R}}\) consists in defining \(\left| +\infty \right| =\left| -\infty \right| =+\infty \), and has similar properties. The calculus rules for sums, products, and inequalities on \(\overline{\mathbb {R}}\) are similar to those of \(\mathbb {R}\), just taking into account (4.21) and (4.22). For instance, if \(\alpha ,\beta \in \left\{ \pm \infty \right\} \), then \(\left| \alpha -\beta \right| =+\infty \). Any nonempty subset of \(\overline{\mathbb {R}}\) (defined as in \(\mathbb {R}\)) has an infimum (supremum), possibly \(-\infty \) ( \(+\infty \), respectively). We also define \(\inf \emptyset =+\infty \) and \( \sup \emptyset =-\infty \).

We are now in a position to define convexity of extended real functions, which, unlike Definition  2.26, does not involve a particular convex set C of \(\mathbb {R}^{n}\).

Definition 4.35

The extended real function \(h:\mathbb {R}^{n}\rightarrow \overline{\mathbb {R}} \) is convex if
$$\begin{aligned} h\left( \left( 1-\mu \right) x+\mu y\right) \le \left( 1-\mu \right) h\left( x\right) +\mu h\left( y\right) , \quad \forall x, y\in \mathbb {R}^{n},\text { }\forall \mu \in ] 0,1 [, \end{aligned}$$
and it is concave  when \(-h\) is convex.

It is easy to prove that h is convex if and only if \(\mathop {\mathrm{epi}}h\) is a convex subset of \(\mathbb {R}^{n+1}\). Hence, the supremum of convex functions (in particular the supremum of affine functions) is also convex. Since linear mappings between linear spaces preserve the convexity of sets and \( \mathop {\mathrm{dom}}h\) is the image of \(\mathop {\mathrm{epi}}h\) by the vertical projection Open image in new window if h is convex then \(\mathop {\mathrm{dom}}h\) is convex too.

Definition 4.36

A given point \(\widehat{x}\in C\) is a Slater point for P if \(g_{i}\left( \widehat{x}\right) <0\) for all \(i\in I\) (that is, \(\widehat{x}\in F\) and \( I\left( \widehat{x}\right) =\emptyset \)). We say that the problem P satisfies the Slater constraint qualification (SCQ) when there exists a Slater point.

The next example shows that \(\mathcal {\vartheta }\) might be nondifferentiable at \(0_{m}\) even when SCQ holds, although it is a convex function, as we shall see in Theorem 4.38. It can even be noncontinuous; see Exercise 4.16.

Example 4.37

Consider the convex optimization problem
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) =\left| x_{1}\right| +x_{2} \\ &{} \text {s.t.} &{} g\left( x\right) =x_{1}\le 0 \\ &{} &{} x\in C=\mathbb {R\times R}_{+}. \end{array} \end{aligned}$$
It can be easily checked that P satisfies SCQ, \(F^{*}=\left\{ 0_{2}\right\} , v\left( P\right) =0, \mathcal {F}(z) =] -\infty , z\left. \right] \times \mathbb {R}_{+}, \mathcal {\vartheta }(z) =-z\), if \(z<0\), and \(\mathcal {\vartheta }(z) =0\), when \(z\ge 0\). In Fig. 4.18, we show \(\mathop {\mathrm{gph}}\mathcal {F}=\left\{ {\left( z, x_{1}, x_{2}\right) }^{T} \in \mathbb {R}^{3}:x_{1}\ge z\right\} \), Open image in new window , and Open image in new window . Finally, observe that \(\mathcal {\vartheta }\) is continuous on \(\mathbb {R}\) and differentiable on \(\mathbb {R\setminus }\left\{ 0\right\} \).
Fig. 4.18

\(\mathop {\mathrm{gph}}\mathcal {F}\) (left) and \(\mathop {\mathrm{gph}}\mathcal {\vartheta }\) (right)

Theorem 4.38

(Convexity of the value function) The value function \(\mathcal {\vartheta }\) of a convex optimization problem P is convex. Moreover, if P satisfies SCQ, then \(0_{m}\in \mathop {\mathrm{int}}\mathop {\mathrm{dom}}\mathcal {\vartheta }. \)

Proof

Observe that the restriction of \(\mathcal {\vartheta }\) to \(\mathop {\mathrm{dom}} \mathcal {\vartheta }\) takes values in \(\mathbb {R\cup }\left\{ -\infty \right\} \). Let \(y, z\in \mathbb {R}^{n}\) and \(\mu \in ] 0,1 [ \). If either \(\mathcal {\vartheta }\left( x\right) =+\infty \) or \(\mathcal {\vartheta }\left( y\right) =+\infty \), then \(\left( 1-\mu \right) \mathcal {\vartheta }\left( x\right) +\mu \mathcal { \vartheta }\left( y\right) =+\infty \), according to the calculus rules on \( \overline{\mathbb {R}}\). Hence, we may assume that \(y, z\in \mathop {\mathrm{dom}}\mathcal { \vartheta }\).

We claim the inclusion \(A\subset B\) for the sets A and B defined as follows:
$$\begin{aligned} A:=&\big \{\left( 1-\mu \right) x_{1}+\mu x_{2}:x_{1}, x_{2}\in C, \\&\quad \left( 1-\mu \right) g_{i}\left( x_{1}\right) +\mu g_{i}\left( x_{2}\right) \le \left( 1-\mu \right) y_{i}+\mu z_{i},\text { }\forall i\in I \big \} \end{aligned}$$
and
$$\begin{aligned} B:=\left\{ x\in C:g_{i}\left( x\right) \le \left( 1-\mu \right) y_{i}+\mu z_{i},\text { }\forall i\in I\right\} =\mathcal {F}\left( \left( 1-\mu \right) y+\mu z\right) . \end{aligned}$$
Indeed, take an arbitrary \(x:=\left( 1-\mu \right) x_{1}+\mu x_{2}\in A\). By the convexity of \(g_{i}\) on C, \(i\in I\), one has
$$\begin{aligned} g_{i}\left( x\right) \le \left( 1-\mu \right) g_{i}\left( x_{1}\right) +\mu g_{i}\left( x_{2}\right) \le \left( 1-\mu \right) y_{i}+\mu z_{i},\quad \forall i\in I, \end{aligned}$$
so we have that \(x\in B\).
The inclusion \(A\subset B\) is preserved when one takes images by f, i.e., \(f\left( A\right) \subset f\left( B\right) \), and the order is inverted when one takes infima: \(\inf f\left( B\right) \le \inf f\left( A\right) \). By the definition of \(\mathcal {\vartheta }\) and the convexity of f, we get
$$\begin{aligned} \begin{aligned} \mathcal {\vartheta }( \left( 1-\mu \right) y&+\mu z) =\inf \left\{ f\left( x\right) :x\in \mathcal {F}\left( \left( 1-\mu \right) y+\mu z\right) \right\} \\&=\inf f\left( B\right) \le \inf f\left( A\right) \\&=\inf \big \{ f\left( \left( 1-\mu \right) x_{1}+\mu x_{2}\right) :x_{1}, x_{2}\in C,\\&\qquad \quad \left( 1-\mu \right) g_{i}\left( x_{1}\right) +\mu g_{i}\left( x_{2}\right) \le \left( 1-\mu \right) y_i+\mu z_i\text { }\forall i\in I\big \} \\&\le \inf \big \{ \left( 1-\mu \right) f\left( x_{1}\right) +\mu f\left( x_{2}\right) :x_{1}, x_{2}\in C,\\&\qquad \quad \left( 1-\mu \right) g_{i}\left( x_{1}\right) +\mu g_{i}\left( x_{2}\right) \le \left( 1-\mu \right) y_i+\mu z_i\text { }\forall i\in I\big \} . \end{aligned} \end{aligned}$$
(4.23)
Consider now the sets
$$\begin{aligned} D:=\left\{ \left( x_{1}, x_{2}\right) \in C\times C:g_{i}\left( x_{1}\right) \le y_{i}\text { and }g_{i}\left( x_{2}\right) \le z_{i},\text { }\forall i\in I\right\} \end{aligned}$$
and
$$\begin{aligned} E:=\left\{ \left( x_{1}, x_{2}\right) \in C\times C:\left( 1-\mu \right) g_{i}\left( x_{1}\right) +\mu g_{i}\left( x_{2}\right) \le \left( 1-\mu \right) y_{i}+\mu z_{i},\text { }\forall i\in I\right\} . \end{aligned}$$
Since, for any \(i\in I\),
$$\begin{aligned} \left[ g_{i}\left( x_{1}\right) \le y_{i}\text { and }g_{i}\left( x_{2}\right) \le z_{i}\right] \Rightarrow \left( 1-\mu \right) g_{i}\left( x_{1}\right) +\mu g_{i}\left( x_{2}\right) \le \left( 1-\mu \right) y_{i}+\mu z_{i}, \end{aligned}$$
we have \(D\subset E\). By (4.23), the latter inclusion, and the fact that the infimum of the sum of two sets of real numbers is the sum of their respective infima, one has
$$\begin{aligned} \mathcal {\vartheta }\left( \left( 1-\mu \right) y+\mu z\right)&\le \inf \left\{ \left( 1-\mu \right) f\left( x_{1}\right) +\mu f\left( x_{2}\right) :\left( x_{1}, x_{2}\right) \in E\right\} \\&\le \inf \left\{ \left( 1-\mu \right) f\left( x_{1}\right) +\mu f\left( x_{2}\right) :\left( x_{1}, x_{2}\right) \in D\right\} \\&=\inf \left\{ \left( 1-\mu \right) f\left( x_{1}\right) :x_{1}\in C, g_{i}\left( x_{1}\right) \le y_{i},\text { }\forall i\in I\right\} \\&\quad +\inf \left\{ \mu f\left( x_{2}\right) :x_{2}\in C, g_{i}\left( x_{2}\right) \le z_{i},\text { }\forall i\in I\right\} \\&=\left( 1-\mu \right) \mathcal {\vartheta }\left( y\right) +\mu \mathcal { \vartheta }(z) . \end{aligned}$$
Hence, \(\mathcal {\vartheta }\) is convex.

Assume now that P satisfies SCQ. Let \(\widehat{x}\in C\) be such that \( g_{i}\left( \widehat{x}\right) <0\) for all \(i\in I\). Let \(\rho :=\min _{i\in I}\left( -g_{i}\left( \widehat{x}\right) \right) >0\), that is, \(\max _{i\in I}g_{i}\left( \widehat{x}\right) =-\rho \).

Given \(z\in \rho \mathbb {B}\), one has \(\left| z_{i}\right| \le \rho \), so \(g_{i}\left( \widehat{x}\right) \le -\rho \le z_{i}\), \(i\in I\). Thus, \(\widehat{x}\in \mathcal {F}(z)\) and one has \(z\in \mathop {\mathrm{ dom}}\mathcal {F}=\mathop {\mathrm{dom}}\mathcal {\vartheta }\). Since \(\rho \mathbb {B} \mathbb {\subset }\mathop {\mathrm{dom}}\mathcal {\vartheta }\), we conclude that \( 0_{m}\in \mathop {\mathrm{int}}\mathop {\mathrm{dom}}\mathcal {\vartheta }\), which completes the proof.       \(\square \)

In Exercise 4.16, \(\mathcal {\vartheta }\) is finite valued on \(\mathop {\mathrm{dom}} \mathcal {\vartheta }\). We now show that this fact is a consequence of the convexity of \(\mathcal {\vartheta }\) and its finiteness at some point of \(\mathop {\mathrm{int}}\mathop {\mathrm{dom}}\mathcal {\vartheta }=\mathbb {R}\).

Corollary 4.39

If \(0_{m}\in \mathop {\mathrm{int}}\mathop {\mathrm{dom}}\mathcal { \vartheta }\) and \(\mathcal {\vartheta }\left( 0_{m}\right) \in \mathbb {R}\), then \(\mathcal {\vartheta }\) is finite on the whole of \(\mathop {\mathrm{dom}}\mathcal { \vartheta }\).

Proof

Suppose that \(0_{m}\in \mathop {\mathrm{int}}\mathop {\mathrm{dom}}\mathcal {\vartheta }\), and assume, by contradiction, that there exists a point \(z^{1}\in \mathop {\mathrm{dom}} \mathcal {\vartheta }\) such that \(\mathcal {\vartheta }\left( z^{1}\right) =-\infty .\ \)Let \(\varepsilon >0\) be such that \(\varepsilon \mathbb {B} \subset \mathop {\mathrm{dom}}\vartheta \). The point \(z^{2}:=-\frac{\varepsilon }{ \left\| z^{1}\right\| }z^{1}\in \varepsilon \mathbb {B}\) satisfies \( 0_{m}\in ] z^{1}, z^{2} [ \). Let \(\mu \in ] 0,1 [ \) be such that \(0_{m}=\left( 1-\mu \right) z^{1}+\mu z^{2}\). Then, due to the convexity of \(\mathcal {\vartheta }\) and the fact that \(\mathcal {\vartheta } \left( z^{2}\right) \in \mathbb {R}\cup \left\{ -\infty \right\} \), we have
$$ \mathcal {\vartheta }\left( 0_{m}\right) \le \left( 1-\mu \right) \mathcal { \vartheta }\left( z^{1}\right) +\mu \mathcal {\vartheta }\left( z^{2}\right) =-\infty , $$
so that \(\mathcal {\vartheta }\left( 0_{m}\right) =-\infty \) (contradiction).      \(\square \)

As a consequence of Theorem 4.38, \(\mathop {\mathrm{epi}}\mathcal {\vartheta }\) is convex, but it can be closed (as in Example 4.37) or not. It always satisfies the inclusion \(\mathop {\mathrm{gph}}\vartheta \subset \mathop {\mathrm{bd}} \mathop {\mathrm{cl}}\mathop {\mathrm{epi}}\mathcal {\vartheta }\) because \(\mathop {\mathrm{gph}} \vartheta \subset \mathop {\mathrm{epi}}\mathcal {\vartheta }\) and, for any \(x\in \mathop {\mathrm{dom}}\vartheta \), \({\left( x,\vartheta \left( x\right) -\frac{1}{k} \right) }^{T} \notin \mathop {\mathrm{epi}}\mathcal {\vartheta }\) for all \(k\in \mathbb {N}\). Thus, by the supporting hyperplane theorem, there exists a hyperplane supporting \(\mathop {\mathrm{cl}}\mathop {\mathrm{epi}}\mathcal {\vartheta }\) at any point of \( \mathop {\mathrm{gph}}\vartheta \). This hyperplane might be unique or not, and when it is unique, it can be vertical or not. For instance, in Example 4.37, any line of the form \(y=-\lambda z\) with \(\lambda \in \left[ 0,1 \right] \) supports \(\mathop {\mathrm{epi}}\mathcal {\vartheta }\) at \(0_2\), while the vertical line Open image in new window does not support \(\mathop {\mathrm{epi}}\mathcal {\vartheta }\).

Theorem 4.40

(Sensitivity theorem)  If P is bounded and satisfies SCQ, then \(\mathcal { \vartheta }\) is finite valued on \(\mathop {\mathrm{dom}}\mathcal {\vartheta }\) and there exists \(\lambda \in \mathbb {R}_{+}^{m}\) such that
$$\begin{aligned} \mathcal {\vartheta }(z) \ge \mathcal {\vartheta }\left( 0_{m}\right) -\lambda ^{T}z,\quad \forall z\in \mathop {\mathrm{dom}}\mathcal { \vartheta }. \end{aligned}$$
(4.24)
If, additionally, \(\mathcal {\vartheta }\) is differentiable at \(0_{m}\), then the unique \(\lambda \in \mathbb {R}_{+}^{m}\) satisfying (4.24) is \( -\nabla \mathcal {\vartheta }\left( 0_{m}\right) \).

Proof

By Theorem 4.38, \(\mathcal {\vartheta }\) is convex and \( 0_{m}\in \mathop {\mathrm{int}}\mathop {\mathrm{dom}}\mathcal {\vartheta }\). Since \(\mathcal { \vartheta }\left( 0_{m}\right) =v\left( P\right) \in \mathbb {R}\), Corollary 4.39 allows to assert that \(\mathcal {\vartheta }\) is finite valued on \(\mathop {\mathrm{dom}}\mathcal {\vartheta }\).
Fig. 4.19

Geometric interpretation of the subgradients when \(m=1\)

Since \(\mathcal {\vartheta }\) is convex and \(0_{m}\in \mathop {\mathrm{int}}\mathop {\mathrm{ dom}}\mathcal {\vartheta }\), by Proposition  2.33, there exists a nonvertical hyperplane \(a^{T}x+a_{n+1}x_{n+1}=b\), with \(a_{n+1}\ne 0\), which supports \( \mathop {\mathrm{cl}}\mathop {\mathrm{epi}}\mathcal {\vartheta }\) at Open image in new window Then, \(a_{n+1}\mathcal {\vartheta }\left( 0_{m}\right) =b\), and we can assume that
$$\begin{aligned} a^{T}z+a_{n+1}z_{n+1}\le b,\quad \forall \left( \frac{z}{z_{n+1}}\right) \in \mathop {\mathrm{cl}}\mathop {\mathrm{epi}}\mathcal {\vartheta }. \end{aligned}$$
(4.25)
Since Open image in new window for all \(\delta \ge 0\), we have \( a_{n+1}<0\). Dividing by \(\left| a_{n+1}\right| \) both members of  (4.25), and defining \(\gamma :=\frac{a}{\left| a_{n+1}\right| }\) and \(\beta :=\frac{b}{\left| a_{n+1}\right| } \), one gets
$$\begin{aligned} \gamma ^{T}z-z_{n+1}\le \beta ,\quad \forall \left( \begin{array}{c} z\\ z_{n+1} \end{array} \right) \in \mathop {\mathrm{gph}}\mathcal {\vartheta }, \end{aligned}$$
with \(\beta =\gamma ^{T}0_{m}-\mathcal {\vartheta }\left( 0_{m}\right) =-\mathcal { \vartheta }\left( 0_{m}\right) \). In other words,
$$\begin{aligned} \mathop {\mathrm{gph}}\mathcal {\vartheta }\subset \left\{ \left( \begin{array}{c} z \\ z_{n+1} \end{array} \right) \in \mathbb {R}^{m+1}:{\left( \begin{array}{c} \gamma \\ -1 \end{array} \right) }^{T}\left[ \left( \begin{array}{c} z \\ z_{n+1} \end{array} \right) -\left( \begin{array}{c} 0_{m} \\ \mathcal {\vartheta }\left( 0_{m}\right) \end{array} \right) \right] \le 0\right\} \end{aligned}$$
(4.26)
(the vectors \(\gamma \in \mathbb {R}^{m}\) satisfying (4.26) are called subgradients  of \(\mathcal {\vartheta }\) at \(0_{m}\); see Fig. 4.19.)
Given \(z\in \mathop {\mathrm{dom}}\mathcal {\vartheta }\), as \(\left( \begin{array}{c} z \\ \mathcal {\vartheta }(z) \end{array} \right) \in \mathop {\mathrm{gph}}\mathcal {\vartheta }\), we have
$$\begin{aligned} {\left( \begin{array}{c} \gamma \\ -1 \end{array} \right) }^{T}\left( \begin{array}{c} z \\ \mathcal {\vartheta }(z) -\mathcal {\vartheta }\left( 0_{m}\right) \end{array} \right) =\gamma ^{T}z-\mathcal {\vartheta }(z) +\mathcal { \vartheta }\left( 0_{m}\right) \le 0. \end{aligned}$$
Hence, the vector \(\lambda :=-\gamma \) satisfies (4.24).
We now prove that \(\lambda \in \mathbb {R}_{+}^{m}\) by contradiction. Assume that \(\lambda \) has a negative component \(\lambda _{i}<0\). Since \( 0_{m}\in \mathop {\mathrm{int}}\mathop {\mathrm{dom}}\mathcal {\vartheta }\), there exists a sufficiently small positive number \(\varepsilon \) such that \(\varepsilon \mathbb {B}\subset \mathop {\mathrm{dom}}\mathcal {\vartheta }\). Then, \(\varepsilon e_{i}\in \mathop {\mathrm{dom}}\mathcal {\vartheta }\) and, by (4.24), one has
$$\begin{aligned} \mathcal {\vartheta }\left( \varepsilon e_{i}\right) \ge \mathcal {\vartheta } \left( 0_{m}\right) -\lambda ^{T}\left( \varepsilon e_{i}\right) =v\left( P\right) -\varepsilon \lambda _{i}>v\left( P\right) . \end{aligned}$$
(4.27)
But \(\varepsilon e_{i}\in \mathbb {R}_{+}^{m}\) implies \(\mathcal {F}\left( 0_{m}\right) \subset \mathcal {F}\left( \varepsilon e_{i}\right) \), which in turn implies \(v\left( P\right) \ge \mathcal {\vartheta }\left( \varepsilon e_{i}\right) \), in contradiction with (4.27).

Finally, the case where \(\mathcal {\vartheta }\) is differentiable at \(0_{m}\) is a direct consequence of the last assertion in Proposition  2.33.       \(\square \)

A vector \(\lambda \in \mathbb {R}_{+}^{m}\) satisfying (4.24) is said to be a sensitivity vector for \(P\). The geometrical meaning of (4.24) is that there exists an affine function whose graph contains Open image in new window and is a lower approximation of \(\mathcal {\vartheta }\).

In Example 4.37, the sensitivity vectors (scalars here as \(m=1\)) for P are the elements of the interval \(\left[ 0,1\right] \); see Fig. 4.18.

When \(\mathcal {\vartheta }\) is differentiable at \(0_{m}\), then the unique sensitivity vector is \(\lambda =-\nabla \mathcal {\vartheta }\left( 0_{m}\right) \), which will be interpreted in Chapter   5 as the steepest descent direction for \(\mathcal {\vartheta }\) at \(0_{m}\) (the search direction used by the steepest descent method to improve the current iterate). Alternatively, if \(\mathcal {\vartheta }\) is not differentiable at \( 0_{m}\) and \(t>0\) is sufficiently small, we have the following lower bound for the variation of \(\mathcal {\vartheta }\) in the direction of \(e_{i}\), \( i=1,\ldots , m\):
$$\begin{aligned} \frac{\mathcal {\vartheta }\left( te_{i}\right) -\mathcal {\vartheta }\left( 0_{m}\right) }{t}\ge -\lambda _{i}. \end{aligned}$$
As the section functions of \(\mathcal {\vartheta }\) have side derivatives, we can write
$$\begin{aligned} \mathcal {\vartheta }^{\prime }\left( 0_{m};e_{i}\right) =\lim _{t\searrow 0} \frac{\mathcal {\vartheta }\left( te_{i}\right) -\mathcal {\vartheta }\left( 0_{m}\right) }{t}\ge -\lambda _{i}, \end{aligned}$$
that is, the rate of variation of \(\mathcal {\vartheta }\) in the direction \( e_{i}\) is at least \(-\lambda _{i}\).

In Example 4.37, \(\mathcal {\vartheta }^{\prime }(0;1)\ge -\lambda \) for all \(\lambda \in [0,1]\), so \(\mathcal {\vartheta }^{\prime }(0;1)\ge 0\). Similarly, \(\mathcal {\vartheta }^{\prime }(0;-1)\ge \lambda \) for all \(\lambda \in [0,1]\), so \(\mathcal {\vartheta }^{\prime }(0;-1)\ge 1\). Actually, one has that \(\mathcal {\vartheta }^{\prime }(0;1)=0\) and \(\mathcal {\vartheta }^{\prime }(0;-1)=1\).

4.4.2 Optimality Conditions

Given a feasible solution \(\widetilde{x}\) of the convex optimization problem P in (4.20) and a vector \(\lambda \in \mathbb {R}_{+}^{m}\), we have
$$\begin{aligned} \inf \left\{ f\left( x\right) +\mathop {\displaystyle \sum }\limits _{i\in I}\lambda _{i}g_{i}\left( x\right) :x\in C\right\} \le f\left( \widetilde{x}\right) +\mathop {\displaystyle \sum }\limits _{i\in I}\lambda _{i}g_{i}(\widetilde{x})\le f\left( \widetilde{x }\right) , \end{aligned}$$
so one has that
$$\begin{aligned} h(\lambda ):=\inf \left\{ f\left( x\right) +\mathop {\displaystyle \sum }\limits _{i\in I}\lambda _{i}g_{i}\left( x\right) :x\in C\right\} \le v\left( P\right) . \end{aligned}$$
Thus, \(h\left( \lambda \right) \) is a lower bound for \(v\left( P\right) \). In other words, the maximization of \( h\left( \lambda \right) \) for \(\lambda \in \mathbb {R}_{+}^{m}\) provides a dual problem for \(P\). This motivates the study of the function depending on x and \( \lambda \) involved in the definition of h.

Definition 4.41

The Lagrange function  of problem P in (4.20) is \(L:C\times \mathbb {R}_{+}^{m}\rightarrow \mathbb {R}\) given by
$$\begin{aligned} L\left( x,\lambda \right) =f\left( x\right) +\mathop {\displaystyle \sum }\limits _{i\in I}\lambda _{i}g_{i}\left( x\right) . \end{aligned}$$

We now prove that, if we know a sensitivity vector, we can reformulate P as an unconstrained convex problem.

Theorem 4.42

(Reduction to an unconstrained problem) If \(\lambda \) is a sensitivity vector for a bounded convex problem P, then
$$\begin{aligned} v\left( P\right) =\inf _{x\in C}L\left( x,\lambda \right) . \end{aligned}$$

Proof

We associate with a given \(x\in C\) the vector \(z:={\left( g_{1}\left( x\right) ,\ldots , g_{m}\left( x\right) \right) }^{T} \). Since \(x\in \mathcal {F} (z) \), we have z \(\in \mathop {\mathrm{dom}}\mathcal {\vartheta }\) and \(f\left( x\right) \ge \mathcal { \vartheta }(z) \). Moreover, by the assumption on \(\lambda \), we get
$$\begin{aligned} f\left( x\right) \ge \mathcal {\vartheta }(z) \ge v\left( P\right) -\lambda ^{T}z=v\left( P\right) -\mathop {\displaystyle \sum }\limits _{i\in I}\lambda _{i}g_{i}\left( x\right) , \end{aligned}$$
which yields \(L\left( x,\lambda \right) \ge v\left( P\right) \). Taking the infimum of both sides for \(x\in C\), we deduce \(v\left( P\right) \le \inf _{x\in C}L\left( x,\lambda \right) \).
Conversely, as \(\lambda \in \mathbb {R}_{+}^{m}\), one has
$$\begin{aligned} \inf _{x\in C}L\left( x,\lambda \right)&\le \inf \left\{ L\left( x,\lambda \right) :x\in C, g_{i}\left( x\right) \le 0\text { },\forall i\in I\right\} \\&\le \inf \left\{ f\left( x\right) :x\in C, g_{i}\left( x\right) \le 0 \text { },\forall i\in I\right\} =v\left( P\right) . \end{aligned}$$
We thus conclude that \(v\left( P\right) =\inf _{x\in C}L\left( x,\lambda \right) \).       \(\square \)

The Lagrange function for the problem in Example 4.37 is \(L\left( x,\lambda \right) =\left| x_{1}\right| +x_{2}+\lambda x_{1}\). Taking the sensitivity vector (here a scalar) \(\lambda =1\), one has, for any \(x\in C= \mathbb {R\times }\mathbb {R}_{+}\), \(L\left( x, 1\right) =x_{2}\), if \(x_{1}<0\), and \(L\left( x, 1\right) =2x_{1}+x_{2}\), if \(x_{1}\ge 0\), so we have that \( \inf _{x\in C}L\left( x, 1\right) =0=v\left( P\right) \).

Theorem 4.43

(Saddle point theorem)  Assume that P is a bounded convex problem satisfying SCQ. Let \(\overline{x}\in C\). Then, \(\overline{x}\in F^{*}\) if and only if there exists \(\overline{\lambda }\in \mathbb {R}^{m}\) such that:

(NC) \(\overline{\lambda }\in \mathbb {R}_{+}^{m}\);

(SPC) \(L\left( \overline{x},\lambda \right) \le L\left( \overline{x}, \overline{\lambda }\right) \le L\left( x,\overline{\lambda }\right) ,\quad \forall x\in C,\lambda \in \mathbb {R}_{+}^{m}\); and

(CC) \(\overline{\lambda }_{i}g_{i}\left( \overline{x}\right) =0,\quad \forall i\in I\).

The new acronym (SPC) refers to saddle point condition .

Remark 4.44

(Comment previous to the proof). (SPC) can be interpreted by observing that \(\left( \overline{x},\overline{ \lambda }\right) \) is a saddle point for the function \(\left( x,\lambda \right) \mapsto L\left( x,\lambda \right) \) on \(C\times \mathbb {R} _{+}^{m}\) as \(\overline{x}\) is a minimum of \(L\left( x,\overline{\lambda } \right) \) on C, while \(\overline{\lambda }\) is a maximum of \(L\left( \overline{x},\lambda \right) \) on \(\mathbb {R}_{+}^{m}\). For instance, \(0_{2} \) is a saddle point for the function \(\left( x,\lambda \right) \mapsto \) \(x^{2}-\lambda ^{2}\) on \(\mathbb {R}^{2}\), as Fig. 4.20 shows. One can easily check that \(\left( 0,0,1\right) \) is a saddle point for the problem in Example 4.37, as \(L\left( x, 1\right) =\left| x_{1}\right| +x_{2}+x_{1}\ge 0=L\left( 0,0,1\right) \) for all \(x\in C\) and \(L\left( 0,0,\lambda \right) =0=L\left( 0,0,1\right) \) for all \(\lambda \in \mathbb {R}_{+}\).

Fig. 4.20

\(\left( 0,0\right) \) is a saddle point of \((x,\lambda )\mapsto x^2-\lambda ^2\) on \(\mathbb {R}^2\)

Proof

Throughout this proof, we represent by (SPC1) and (SPC2) the first and the second inequalities in (SPC), respectively.

Let \(\overline{x}\in F^{*}\). Under the assumptions on P, Theorem 4.40 implies the existence of a sensitivity vector \( \overline{\lambda }\in \mathbb {R} _{+}^{m}\) for \(P\). Obviously, \(\overline{\lambda }\) satisfies (NC). Then, by Theorem 4.42, we have that \(v\left( P\right) =\inf _{x\in C}L\left( x,\overline{ \lambda }\right) \). Since \(\overline{x}\in F^{*}\subset F\) and \(\overline{\lambda }\in \mathbb {R}_{+}^{m}\), we obtain
$$\begin{aligned} f\left( \overline{x}\right) =v\left( P\right) =\inf _{x\in C}L\left( x,\overline{\lambda }\right) \le L\left( \overline{x},\overline{\lambda }\right) =f\left( \overline{x}\right) +\mathop {\displaystyle \sum }\limits _{i\in I}\overline{\lambda }_{i}g_{i}\left( \overline{x}\right) \le f\left( \overline{x}\right) . \end{aligned}$$
(4.28)
From (4.28), we get, on the one hand, that \(\sum _{i\in I}\overline{ \lambda }_{i}g_{i}\left( \overline{x}\right) =0\), i.e., \(\overline{\lambda } _{i}g_{i}\left( \overline{x}\right) =0\) for \(i\in I\), which amounts to say that P satisfies (CC). On the other hand, since \(L\left( \overline{x },\overline{\lambda }\right) =\inf _{x\in C}L\left( x,\overline{\lambda } \right) \), we have \(L\left( \overline{x},\overline{\lambda }\right) \le L\left( x,\overline{\lambda }\right) \) for all \(x\in C\), so (SPC2) holds. In order to prove that (SPC1) also holds, take an arbitrary \(\lambda \in \mathbb {R}_{+}^{m}\). By (CC),
$$\begin{aligned} L\left( \overline{x},\overline{\lambda }\right) -L\left( \overline{x} ,\lambda \right) =\mathop {\displaystyle \sum }\limits _{i\in I}\left( \overline{\lambda } _{i}-\lambda _{i}\right) g_{i}\left( \overline{x}\right) =-\mathop {\displaystyle \sum }\limits _{i\in I}\lambda _{i}g_{i}\left( \overline{x}\right) \ge 0. \end{aligned}$$
Thus, \(L\left( \overline{x},\lambda \right) \le L\left( \overline{x}, \overline{\lambda }\right) \) for all \(\lambda \in \mathbb {R}_{+}^{m}\); that is, (SPC1) holds true.
Conversely, we now suppose that \(\overline{x}\in C\) and there exists \( \overline{\lambda }\in \mathbb {R}^{m}\) such that (NC), (SPC), and (CC) hold. We associate with \(\overline{\lambda }\) the vectors \(\lambda ^{i}:=\overline{ \lambda }+e_{i}\), \(i=1,\ldots , m\). By (NC), we have \(\lambda ^{i}\in \mathbb {R}_{+}^{m}\), \(i\in I\). Further, by (SPC1),
$$\begin{aligned} 0\ge L\left( \overline{x},\lambda ^{i}\right) -L\left( \overline{x}, \overline{\lambda }\right) =g_{i}\left( \overline{x}\right) ,\quad \forall i\in I, \end{aligned}$$
so \(\overline{x}\in F\). From (SPC1) and (CC), we get
$$\begin{aligned} f\left( \overline{x}\right) =L\left( \overline{x}, 0_{m}\right) \le L\left( \overline{x},\overline{\lambda }\right) =f\left( \overline{x}\right) +\mathop {\displaystyle \sum }\limits _{i\in I}\overline{\lambda }_{i}g_{i}\left( \overline{x}\right) =f\left( \overline{x}\right) , \end{aligned}$$
so we get \(f\left( \overline{x}\right) =L\left( \overline{x},\overline{\lambda }\right) \). We thus have, by (SPC2),
$$\begin{aligned} f\left( \overline{x}\right) =L\left( \overline{x},\overline{\lambda }\right)&=\inf \left\{ L\left( x,\overline{\lambda }\right) :x\in C\right\} \\&\le \inf \left\{ L\left( x,\overline{\lambda }\right) :x\in C,\text { } g_{i}\left( \overline{x}\right) \le 0\text { },\forall i\in I\right\} \\&=\inf \left\{ f\left( x\right) +\mathop {\displaystyle \sum }\limits _{i\in I}\overline{\lambda } _{i}g_{i}\left( x\right) :x\in C,\text { }g_{i}\left( \overline{x}\right) \le 0\text { },\forall i\in I\right\} \\&\le \inf \left\{ f\left( x\right) :x\in C,\text { }g_{i}\left( \overline{x} \right) \le 0\text { },\forall i\in I\right\} =v\left( P\right) . \end{aligned}$$
Hence, \(\overline{x}\in F^{*}\). The proof is complete.       \(\square \)

What Theorem 4.43 asserts is that, under the assumptions on P (boundedness and SCQ), a feasible solution is optimal if and only if there exists a vector \(\overline{\lambda }\in \mathbb {R}^{m}\) such that (NC), (SPC), and (CC) hold, in which case we say that \(\overline{\lambda }\) is a Lagrange vector .

The next simple example, where \(n=m=1\), allows to visualize the saddle point of a convex optimization problem as \(\mathop {\mathrm{gph}}L\subset \mathbb {R}^{3}\).

Example 4.45

Consider
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) =x^{2} \\ &{} \text {s.t.} &{} x^{2}-1\le 0, \\ &{} &{} x\in \mathbb {R}. \end{array} \end{aligned}$$
It is easy to see that \(F^{*}=\left\{ 0_{2}\right\} \), \(v\left( P\right) =0\), \(\mathcal {F}(z) =\left[ -\sqrt{z+1},\sqrt{z+1}\right] \) and \(\mathcal {\vartheta }(z) =0\), if \(z\ge -1\), while \(\mathcal {F} (z) =\emptyset \ \)and \(\mathcal {\vartheta }(z) =+\infty \), if \(z<-1\). Since \(\mathop {\mathrm{epi}}\mathcal {\vartheta =}\left[ \right. -1,+\infty [ \times \mathbb {R}_{+,}\) the unique sensitivity vector is 0. Figure 4.21 shows that \(\left( 0,0\right) \) is a saddle point for \(L\left( x,\lambda \right) =x^{2}+\lambda \left( x^{2}-1\right) \) on \( \mathbb {R\times R}_{+}\) (observe that \(L\left( x,\cdot \right) \) is an affine function on \(\mathbb {R}_{+}\) for all \(x\in \mathbb {R}\)).
Fig. 4.21

\(\left( 0,0\right) \) is a saddle point of L in Example 4.45

We now show that the condition \(-\nabla f\left( \overline{x}\right) \in A( \overline{x})=\mathop {\mathrm{cone}}\left\{ \nabla g_{i}\left( \overline{x}\right) , i\in I\left( \overline{x}\right) \right\} \) (the active cone at \(\overline{x}\) defined in (4.7)) also characterizes the optimality of \(\overline{x}\in F\) when P is a differentiable convex optimization problem and SCQ holds (which means that the assumptions of the KKT Theorems 4.20 and 4.46 are independent of each other).

Theorem 4.46

(KKT theorem with convex constraints)  Assume that P is a bounded convex problem satisfying SCQ, with \(f,g_{i}, i\in I\), differentiable on some open set containing \(C\). Let \( \overline{x}\in F\cap \mathop {\mathrm{int}}C\). Then, the following statements are equivalent:

(i) \(\overline{x}\in F^{*}\).

(ii) \(-\nabla f\left( \overline{x}\right) \in A\left( \overline{x}\right) \).

(iii) There exists some \(\overline{\lambda }\in \mathbb {R}^{m}\) such that:
$$\begin{aligned} \begin{array}{ll} \text {(NC)} &{} \overline{\lambda }\in \mathbb {R}_{+}^{m}{} ; \\ \text {(SC)} &{} \nabla _{x}L\left( \overline{x},\overline{\lambda }\right) =\nabla f\left( \overline{x}\right) +\sum _{i\in I}\overline{\lambda } _{i}\nabla g_{i}\left( \overline{x}\right) =0_{n};\text { and} \\ \text {(CC)} &{} \overline{\lambda }_{i}g_{i}\left( \overline{x}\right) =0,\quad \forall i\in I. \end{array} \end{aligned}$$

Proof

Since (ii) and (iii) are trivially equivalent, it is sufficient to prove that (i)\(\Leftrightarrow \)(iii).

Assume that \(\overline{x}\in F^{*}\). By Theorem 4.43, there exists a sensitivity vector \(\overline{\lambda }\in \mathbb {R}_{+}^{m}\) satisfying (CC) and (SPC). As the second inequality in (SPC) is
$$\begin{aligned} L\left( \overline{x},\overline{\lambda }\right) \le L\left( x,\overline{ \lambda }\right) , \quad \forall x\in C, \end{aligned}$$
we have that \(\overline{x}\in \mathop {\mathrm{int}}C\) is a global minimum of \(x\mapsto L\left( x, \overline{\lambda }\right) \) on \(C\). Then, the Fermat principle implies that
$$\begin{aligned} \nabla _{x}L\left( \overline{x},\overline{\lambda }\right) =\nabla f\left( \overline{x}\right) +\mathop {\displaystyle \sum }\limits _{i\in I}\overline{\lambda }_{i}\nabla g_{i}\left( \overline{x}\right) =0_{n}, \end{aligned}$$
that is, (SC) holds.
We now assume the existence of \(\overline{\lambda }\in \mathbb {R}^{m}\) such that (NC), (SC), and (CC) hold. Let \(x\in F\). Due to the convexity of f and \(g_{i}\), \(i\in I\), we have
$$\begin{aligned} f\left( x\right)&\ge f\left( x\right) +\sum _{i\in I}\overline{\lambda } _{i}g_{i}\left( x\right) \\&\ge \left( f\left( \overline{x}\right) +\nabla f{\left( \overline{x} \right) }^{T}\left( x-\overline{x}\right) \right) +\sum _{i\in I}\overline{ \lambda }_{i}\left( g_{i}\left( \overline{x}\right) +\nabla g_{i}{\left( \overline{x}\right) }^{T}\left( x-\overline{x}\right) \right) \\&=\left( f\left( \overline{x}\right) +\sum _{i\in I}\overline{\lambda } _{i}g_{i}\left( \overline{x}\right) \right) +{\left( \nabla f\left( \overline{x}\right) +\sum _{i\in I}\overline{\lambda }_{i}\nabla g_{i}\left( \overline{x}\right) \right) }^{T}\left( x-\overline{x}\right) \\&=f\left( \overline{x}\right) . \end{aligned}$$
Hence, \(\overline{x}\in F^{*}\).       \(\square \)

A vector \(\overline{\lambda }\in \mathbb {R}^{m}\) such that (NC), (SC), and (CC) hold is said to be a KKT vector.

Remark 4.47

Observe that SCQ was not used in the proof of (iii)\(\Rightarrow \)(i). Therefore, in a convex problem, the existence of a KKT vector associated with \(\overline{x}\) implies the global optimality of \(\overline{x}\).

Revisiting Example 4.37, \(\left( {\overline{x}}^{T},\overline{\lambda } \right) =\left( 0,0,1\right) \) does not satisfy (SC) of Theorem 4.46 as f is not even differentiable at \(0_{2} \). Concerning Example 4.45, it is easy to check that \(0_{2} \) satisfies (NC), (SC), and (CC).

The KKT conditions are used to either confirm or reject the optimality of a given feasible solution (e.g., the current iterate for some convex optimization algorithms) or as a filter allowing to elaborate a list of candidates to global minima. Under the assumptions of Theorem 4.46, the sensitivity theorem guarantees the existence of some sensitivity vector and the proofs of the saddle point and the KKT theorems show that such a vector is a KKT vector. So, if P has a unique KKT vector, then it is a sensitivity vector too.

Example 4.48

We try to solve the optimization problem
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) =2x_{1}^{2}+2x_{1}x_{2}+x_{2}^{2}-10x_{1}-10x_{2} \\ &{} \text {s.t.} &{} g_{1}\left( x\right) =x_{1}^{2}+x_{2}^{2}-5\le 0, \\ &{} &{} g_{2}\left( x\right) =3x_{1}+x_{2}-6\le 0. \end{array} \end{aligned}$$
We have \(F\subset \sqrt{5}\mathbb {B}\), so F is compact,
$$\begin{aligned} \nabla f(x)=\left( \begin{array}{c} 4x_{1}+2x_{2}-10 \\ 2x_{1}+2x_{2}-10 \end{array} \right) ,\quad \nabla g_{1}(x)=\left( \begin{array}{c} 2x_{1} \\ 2x_{2} \end{array} \right) \quad \text {and}\quad \nabla g_{2}(x)=\left( \begin{array}{c} 3 \\ 1 \end{array} \right) , \end{aligned}$$
with \(\nabla ^{2}f\) and \(\nabla ^{2}g_{1}\) positive definite while \(\nabla ^{2}g_{2}\) is positive semidefinite. Thus, the three functions are convex and differentiable on \(\mathbb {R}^{2}\) and P has a unique optimal solution. Moreover, \(0_{2}\) is a Slater point, so SCQ holds. There are four possible values for \(I\left( x\right) \) (the parts of \(I=\left\{ 1,2\right\} \)):

\(\bullet \) \(I(x)=\emptyset \): The unique solution to \(\nabla f(x)=0_{2}\) is \(x={\left( 0,5\right) }^{T} \notin F\).

\(\bullet \) \(I(x)=\left\{ 1\right\} \): We must solve the nonlinear system
$$\begin{aligned} \left\{ \left( \begin{array}{c} 4x_{1}+2x_{2}-10 \\ 2x_{1}+2x_{2}-10 \end{array} \right) +\lambda _{1}\left( \begin{array}{c} 2x_{1} \\ 2x_{2} \end{array} \right) =0_{2};x_{1}^{2}+x_{2}^{2}=5\right\} . \end{aligned}$$
Eliminating \(x_{1}\) and \(x_{2}\), one gets \(\lambda _{1}^{4}+6\lambda _{1}^{3}+\lambda _{1}^{2}-4\lambda _{1}-4=0\), whose unique positive root is \( \lambda _{1}=1\) (see Fig. 4.22).
Fig. 4.22

Plot of the polynomial \(\lambda _{1}^{4}+6\lambda _{1}^{3}+\lambda _{1}^{2}-4\lambda _{1}-4\)

Replacing this value, we obtain \(x^{1}={\left( 1,2\right) }^{T} \in F\), which is a minimum of P with corresponding KKT vector \( {\left( 1,0\right) }^{T}\). Once we have obtained the unique minimum of P, there is no need to complete the discussion of the remaining cases \(I\left( x\right) =\left\{ 2\right\} \) and \(I\left( x\right) =\left\{ 1,2\right\} \).

Figure 4.23 shows the partition of F into the sets \( F_{1}, F_{2}, F_{3}, F_{4}\) corresponding to the parts of \(I\left( x\right) \), \( \emptyset \), \(\{1\}\), \(\{2\}\), and \(\{1,2\}\), respectively. Observe that \( F_{1}=\mathop {\mathrm{int}}F\), \(F_{2}\) is an arch of circle without its end points, \(F_{3}\) is a segment without its extreme points, and, finally, \( F_{4} \) is formed by two isolated points (the end points of the mentioned segment).

Fig. 4.23

Applying the KKT conditions: partition of F depending on I(x)

When P has multiple solutions, one can use the next result, which directly proves that the Lagrange vectors are sensitivity vectors.

Theorem 4.49

(Sensitivity and saddle points)  Let \(\overline{x}\in F^{*}\). If \(\overline{\lambda }\in \mathbb {R}^{m}\) satisfies

(NC) \(\overline{\lambda }\in \mathbb {R}_{+}^{m}\),

(SPC) \(L\left( \overline{x},\lambda \right) \le L\left( \overline{x}, \overline{\lambda }\right) \le L\left( x,\overline{\lambda }\right) ,\quad \forall x\in C,\lambda \in \mathbb {R}_{+}^{m}\), and

(CC) \(\overline{\lambda }_{i}g_{i}\left( \overline{x}\right) =0,\quad \forall i\in I\),

then \(\overline{\lambda }\) is a sensitivity vector for P, i.e.,
$$\begin{aligned} \mathcal {\vartheta }(z) \ge \mathcal {\vartheta }\left( 0_{m}\right) -\overline{\lambda }^{T}z, \ \ \forall z\in \mathop {\mathrm{dom}} \mathcal {\vartheta }. \end{aligned}$$

Proof

Let \(\overline{x}\in F^{*}\) and \(\overline{\lambda }\in \mathbb {R}^{m}\) be such that (NC), (SPC), and (CC) hold. By (CC),
$$\begin{aligned} v(P)=f\left( \overline{x}\right) =f\left( \overline{x}\right) +\mathop {\displaystyle \sum }\limits _{i\in I}\overline{\lambda }_{i}g_{i}\left( \overline{x}\right) =L\left( \overline{x},\overline{\lambda }\right) , \end{aligned}$$
and, by (SPC),
$$\begin{aligned} v(P)=L\left( \overline{x},\overline{\lambda }\right) \le L\left( x, \overline{\lambda }\right) =f\left( x\right) +\mathop {\displaystyle \sum }\limits _{i\in I}\overline{ \lambda }_{i}g_{i}\left( x\right) ,\quad \forall x\in C. \end{aligned}$$
Let \(z\in \mathop {\mathrm{dom}}\mathcal {\vartheta }\) and \(x\in C\) such that \(g_{i}\left( x\right) \le z\), \(i\in I\). Then, by (NC),
$$\begin{aligned} v\left( P\right) \le f\left( x\right) +\mathop {\displaystyle \sum }\limits _{i\in I}\overline{ \lambda }_{i}g_{i}\left( x\right) \le f\left( x\right) +\overline{\lambda } ^{T}z, \ \forall x\in \mathcal {F}(z) . \end{aligned}$$
(4.29)
Finally, taking the infimum on \(\mathcal {F}(z) \) of both members of (4.29), we get
$$\begin{aligned} \mathcal {\vartheta }\left( 0_{m}\right) =v\left( P\right) \le \mathcal {\vartheta }(z) +\overline{\lambda }^{T}z \end{aligned}$$
This completes the proof.       \(\square \)

4.4.3 Lagrange Duality

We have seen in the last subsection that \(h\left( y\right) :=\inf _{x\in C}L\left( x, y\right) \), with \(y\in \mathbb {R}_{+}^{m}\), provides a lower bound for \(v\left( P\right) \). Thus, we associate with P the following problem, called Lagrange dual  (or Lagrangian dual) of \(P\),
$$\begin{aligned} \begin{array}{lll} D^L: &{} \text {Max} &{} h\left( y\right) :=\inf _{x\in C}L\left( x, y\right) \\ &{} \text {s.t.} &{} y\in \mathbb {R}_{+}^{m}, \end{array} \end{aligned}$$
whose optimal value is denoted by \(v\left( D^L\right) \in \overline{\mathbb {R}} \). It is worth observing that \(L\left( x,\cdot \right) \) is an affine function for all \(x\in C\), so \(h\left( \cdot \right) :=\inf _{x\in C}L\left( x,\cdot \right) \) is a concave function and \(D^L\) is equivalent to a linearly constrained convex optimization problem. Hence, its feasible set \(G= \mathbb {R}_{+}^{m}\) and its optimal set \(G^{*}\) are both convex subsets of \(\mathbb {R}^{m}\). The weak duality theorem \(v\left( D^L\right) \le v\left( P\right) \) is a straightforward consequence of the definition of \(D^L\) (observe that the weak duality holds even in nonconvex optimization). The difference \(v\left( P\right) -v\left( D^{L}\right) \ge 0\) is called the duality gap  of the dual pair \(P-D^{L}\).

The main result in any duality theory establishes that \(v\left( D^L\right) =v\left( P\right) \) under suitable assumptions. This equation allows to certify the optimality of the current iterate or to stop the execution of primal-dual algorithms whenever an \(\varepsilon -\)optimal solution has been attained. Strong duality holds when, in addition to \(v\left( D^L\right) =v\left( P\right) \), \(G^{*}\ne \emptyset \). In linear optimization, it is known that the simultaneous consistency of the primal problem P and its dual one \(D^L\) guarantees that \( v\left( D^L\right) =v\left( P\right) \) with \(F^{*}\ne \emptyset \ \)and \( G^{*}\ne \emptyset \). This is not the case in convex optimization, where strong duality requires the additional condition that SCQ holds (which is not enough in nonconvex optimization).

Theorem 4.50

(Strong Lagrange duality)  If P satisfies SCQ and it is bounded, then \(v\left( D^L\right) =v\left( P\right) \) and \(G^{*}\ne \emptyset \).

Proof

The assumptions guarantee, by Theorem 4.40, the existence of a sensitivity vector \(\overline{y}\in \mathbb {R}_{+}^{m}\), and this vector satisfies \(h\left( \overline{y}\right) =v\left( P\right) \) by Theorem 4.42. Then, \(v\left( P\right) =h\left( \overline{y}\right) \le v\left( D^L\right) \) and the conclusion follows from the weak duality theorem.       \(\square \)

Therefore, in the simple Example 4.45, where SCQ holds, one has \(h\left( y\right) =\inf _{x\in \mathbb {R}}\left( x^{2}+y\left( x^{2}-1\right) \right) =-y\), for all \(y\in \mathbb {R}^m_{+}\). So, \(v\left( D^L\right) =\sup _{y\in \mathbb {R}_{+}}h \left( y\right) =0=v\left( P\right) \) and the optimal value of \(D^L\) is attained at \(\overline{ y}=0\).

Let us revisit Example 4.37. There, the value of \(h:\mathbb {R} _{+}\rightarrow \overline{\mathbb {R}}\) at \(y\in \mathbb {R}^m_{+}\) is
$$\begin{aligned} \begin{array}{ll} h\left( y\right) &{} =\inf _{x\in \mathbb {R\times R}_{+}}\left( \left| x_{1}\right| +x_{2}+yx_{1}\right) \\ &{} =\min \left\{ \inf _{x\in \mathbb {R}_{+}^{2}}\left( \left( y+1\right) x_{1}+x_{2}\right) ,\inf _{x\in \left( -\mathbb {R}_{+}\right) \times \mathbb { R}_{+}}\left( \left( y-1\right) +x_{2}\right) \right\} \\ &{} =\left\{ \begin{array}{cc} 0, &{} 0\le y \le 1 \\ -\infty , &{} y > 1 \end{array} \right. , \end{array} \end{aligned}$$
as \(\inf _{x\in \mathbb {R}_{+}^{2}}\left( \left( y+1\right) x_{1}+x_{2}\right) = 0\). So, \(v\left( D^L\right) =0\) with optimal set \(G^{*}=\left[ 0,1\right] \).
We now obtain an explicit expression for the Lagrange dual of the quadratic problem with inequalities
$$ \begin{array}{lll} P_{Q}: &{} \text {Min} &{} f\left( x\right) =\frac{1}{2}x^{T}Qx-c^{T}x \\ &{} \text {s.t.} &{} a_{i}^{T}x\le b_{i},\text { }i\in I, \end{array} $$
considered in Subsection 4.3.2, assuming that Q is positive definite. Denote by A the \(m\times n\) matrix whose rows are \(a_{1}^{T},\ldots , a_{m}^{T}\) and \(b=\left( b_{1},\ldots , b_{m}\right) ^{T}\). The Lagrange function of \(P_{Q}\) is
$$ L_{Q}\left( x, y\right) =\frac{1}{2}x^{T}Qx+\left( -c+A^{T}y\right) ^{T}x-b^{T} y. $$
Since \(L_{Q}\left( \cdot , y\right) \) is strongly convex, its minimum on \( \mathbb {R}^{n}\) is attained at the unique zero of \(\nabla _{x}L_{Q}\left( x, y\right) =Qx-c+A^{T}y\), i.e., at \(Q^{-1}\left( c-A^{T}y\right) \). Thus,
$$\begin{aligned} h\left( y\right)&:=L_{Q}\left( Q^{-1}\left( c-A^{T}y\right) , y\right) \\&\,=-\frac{1}{2}y^{T}\left( AQ^{-1}A^{T}\right) y+\left( AQ^{-1}c-b\right) ^{T}y-\frac{1}{2}c^{T}Q^{-1}c, \end{aligned}$$
which is a concave function. Hence, the Lagrange dual problem of \(P_{Q}\),
$$ \begin{array}{lll} D_{Q}^{L}: &{} \text {Max} &{} h\left( y\right) \\ &{} \text {s.t.} &{} y_{i}\ge 0,\text { }i=1,\ldots , m, \end{array} $$
is a convex quadratic problem as well, with very simple linear constraints (usually much simpler than those of \(P_{Q}\)). If SCQ holds, then \(v\left( P_{Q}\right) =v\left( D_{Q}^{L}\right) \), with \( G^{*}\ne \emptyset \), and one can exploit this simplicity in two different ways:
  • Obtaining, as in linear optimization, an exact optimal solution of \( D_{Q}^{L}\) by means of some quadratic solver, and then the aimed optimal solution of \(P_{Q}\) by using (SC) and (CC);

  • Interrupting the execution of any primal-dual algorithm whenever \(f\left( x_{k}\right) -h\left( y_{k}\right) <\varepsilon \) for some tolerance \(\varepsilon >0\) (approximate stopping rule), as shown in Fig. 4.24.

Fig. 4.24

Computing an \(\varepsilon \)-optimal solution

4.4.4 Wolfe Duality

Ph. Wolfe, an expert in quadratic optimization, proposed in 1961 [90] an alternative dual problem that allowed to cope with convex quadratic optimization problems whose objective function fails to be strongly convex.

Let P be an optimization problem of type (4.20), with convex and differentiable objective and constraint functions, and \(C=\mathbb {R}^n\). The Wolfe dual  of P is
$$ \begin{array}{lll} D^{W}: &{} \text {Max} &{} L\left( u,y\right) \\ &{} \text {s.t.} &{} \nabla _{u}L\left( u, y\right) =0_{n}, \\ &{} &{} y_{i}\ge 0,\text { }i=1,\ldots , m, \end{array} $$
where L denotes the Lagrange function of \(P\). As for the previous dual problems, we denote by G and \(G^{*}\) the feasible and the optimal sets of \(D^{W}\), respectively. In contrast to the pair \(P-D^{L}\), the weak duality of \(P-D^{W}\) is not immediate and follows from the convexity assumption.

Proposition 4.51

(Weak duality) It holds that \(v\left( D^{W}\right) \le v\left( P\right) \).

Proof

Let \(x\in F\) and \(\left( u, y\right) \in G\). By the characterization of convex differentiable functions in Proposition  2.40,
$$\begin{aligned} f\left( x\right) -f\left( u\right) \ge \nabla f\left( u\right) ^{T}\left( x-u\right) \end{aligned}$$
(4.30)
and
$$\begin{aligned} g_{i}\left( x\right) -g_{i}\left( u\right) \ge \nabla g_{i}\left( u\right) ^{T}\left( x-u\right) ,\quad i=1,\ldots , m. \end{aligned}$$
(4.31)
From (4.30), the definition of G, inequality (4.31), and the primal feasibility of x, in this order, one gets
$$\begin{aligned} f\left( x\right) -f\left( u\right)&\ge \nabla f\left( u\right) ^{T}\left( x-u\right) \\&=-\mathop {\displaystyle \sum }\nolimits _{i=1}^{m}y_{i}\nabla g_{i}\left( u\right) ^{T}\left( x-u\right) \\&\ge -\mathop {\displaystyle \sum }\nolimits _{i=1}^{m}y_{i}\left( g_{i}\left( x\right) -g_{i}\left( u\right) \right) \\&\ge \mathop {\displaystyle \sum }\nolimits _{i=1}^{m}y_{i}g_{i}\left( u\right) , \end{aligned}$$
so that
$$ f\left( x\right) \ge f\left( u\right) +\mathop {\displaystyle \sum }\nolimits _{i=1}^{m}y_{i}g_{i}\left( u\right) =L\left( u, y\right) , $$
showing that \(v\left( P\right) \ge v\left( D^{W}\right) \).      \(\square \)
We cannot associate a Wolfe dual to the problem of Example 4.37 as f is not differentiable. Regarding the problem P in Example 4.45, \(L\left( u, y\right) =\left( 1+y\right) u^{2}-y\) and
$$ G=\left\{ \left( u, y\right) \in \mathbb {R\times R}_{+}:\left( 1+y\right) u=0\right\} =\left\{ 0\right\} \mathbb {\times R}_{+}. $$
So, \(G^{*}=\left\{ 0_{2}\right\} \) and \(v\left( D^{W}\right) =v\left( P\right) \). We now prove that, in fact, the fulfillment of the strong duality property in this example is a consequence of SCQ.

Theorem 4.52

(Strong Wolfe duality) If P is solvable and SCQ holds, then \(v\left( D^{W}\right) =v\left( P\right) \) and \(D^{W}\) is solvable.

Proof

Let \(\overline{x}\in F^{*}\). It will be sufficient to show the existence of some \(\overline{y}\in \mathbb {R}_{+}^{m}\) such that \( \left( \overline{x},\overline{y}\right) \in G^{*}\) and \(f\left( \overline{x}\right) =L\left( \overline{x},\overline{y}\right) \).

We first observe that, by the convexity assumption and Proposition  2.40,
$$\begin{aligned} L\left( u,y\right) -L\left( x,y\right) \ge \nabla _{u}L\left( x,y\right) ^{T}\left( u-x\right) ,\quad \forall y\in \mathbb {R}_{+}^{m},\forall u, x\in \mathbb {R}^{n}. \end{aligned}$$
(4.32)
Let \(\left( u_{1}, y\right) ,\left( u_{2}, y\right) \in G\). Applying (4.32) to the couples \(\left( u_{1}, y\right) \) and \(\left( u_{2}, y\right) \), since \(\nabla _{u}L\left( u_{j}, y\right) =0_{n}\), \(j=1,2\), one gets \( L\left( u_{1}, y\right) -L\left( u_{2}, y\right) \ge 0\) and \(L\left( u_{2}, y\right) -L\left( u_{1}, y\right) \ge 0\); i.e., \(L\left( u_{1}, y\right) =L\left( u_{2}, y\right) \). Thus, \(L\left( \cdot , y\right) \) is constant on G for all \(y\in \mathbb {R}_{+}^{m}\).
By the saddle point Theorem 4.43, with \(C=\mathbb {R}^{n}\), there exists \(\overline{y}\in \mathbb { R}_{+}^{m}\) such that
$$\begin{aligned} L\left( \overline{x},y\right) \le L\left( \overline{x},\overline{y}\right) \le L\left( x,\overline{y}\right) ,\quad \forall x\in \mathbb {R}^{n}, y\in \mathbb {R}_{+}^{m}, \end{aligned}$$
(4.33)
and
$$\begin{aligned} \overline{y}_{i}g_{i}\left( \overline{x}\right) =0,\quad i=1,\ldots , m. \end{aligned}$$
(4.34)
Thus, \(\overline{x}\) is a global minimum of \(x\mapsto L\left( x, \overline{y }\right) \), so the Fermat principle implies that \(\nabla _u L\left( \overline{x},\overline{y}\right) =0_n\); i.e., \((\overline{x},\overline{y})\in G\).
From (4.33) and the constancy of \(L\left( \cdot , y\right) \) on G, we have
$$\begin{aligned} L\left( \overline{x},\overline{y}\right)&=\max \left\{ L\left( \overline{x} ,y\right) :y\in \mathbb {R}_{+}^{m}\right\} \\&\ge \max \left\{ L\left( \overline{x},y\right) :\left( \overline{x} ,y\right) \in G\right\} \\&=\max \left\{ L\left( x,y\right) :\left( x, y\right) \in G\right\} \\&\ge L\left( \overline{x},\overline{y}\right) , \end{aligned}$$
so that \(\left( \overline{x},\overline{y}\right) \in G^{*}\), with
$$ L\left( \overline{x},\overline{y}\right) =f\left( \overline{x}\right) +\mathop {\displaystyle \sum }\nolimits _{i=1}^{m}\overline{y}_{i}g_{i}\left( \overline{x}\right) =f\left( \overline{x}\right) $$
thanks to (4.34).      \(\square \)
Generally speaking, \(D^{W}\) may be difficult to deal with computationally, because its objective function is not concave in both variables. However, in convex quadratic optimization, \(D^{W}\) is preferable to \(D^{L}\) when Q is only positive semidefinite. Indeed, let
$$ \begin{array}{lll} P_{Q}: &{} \text {Min} &{} f\left( x\right) =\frac{1}{2}x^{T}Qx-c^{T}x\\ &{} \text {s.t.} &{} a_{i}^{T}x\le b_{i},\text { }i\in I, \end{array} $$
be a convex quadratic problem, and let A be the \(m\times n\) matrix whose rows are \(a_{1}^{T},\ldots , a_{m}^{T}\). Since its Lagrange function is
$$ L_{Q}\left( u, y\right) =\frac{1}{2}u^{T}Qu+\left( -c+A^{T}y\right) ^{T}u-b^{T}y, $$
the u-gradient reads
$$ \nabla _{u}L_{Q}\left( u, y\right) =Qu+\left( -c+A^{T}y\right) . $$
Then,
$$ G=\left\{ \left( u, y\right) \in \mathbb {R}^{n}\mathbb {\times R} _{+}^{m}:-c+A^{T}y=-Qu\right\} $$
and the Wolfe dual of \(P_{Q}\) can be expressed as
$$ \begin{array}{lll} D_{Q}^{W}: &{} \text {Max} &{} -\frac{1}{2}u^{T}Qu-b^{T}y \\ &{} \text {s.t.} &{} -c+A^{T}y=-Qu \\ &{} &{} y_{i}\ge 0,\text { }i=1,\ldots , m, \end{array} $$
whose objective function is concave (but not strongly concave) and its linear constraints are very simple, while \(D_{Q}^{L}\) does not have an explicit expression as Q is not invertible.

4.5 A Glimpse on Conic Optimization\(^{\star }\)

We have already found in Subsection 4.3.2 convex feasible sets of the form \( F=\left\{ x\in \mathbb {R}^{n}:Ax+b\in K\right\} \), where A is an \(m\times n \) matrix, \(b\in \mathbb {R}^{m}\)  and K is a polyhedral convex cone in \( \mathbb {R}^{m}\). The following three convex cones frequently arise in practice:
  • The positive orthant of \(\mathbb {R}^{m}\), \(\mathbb {R}_{+}^{m}\), which converts the conic constraint \(Ax+b\in K\) into an ordinary linear inequality system;

  • The second-order cone,  also called the ice-cream cone,
    $$ K_{p}^{m}:=\left\{ x\in \mathbb {R}^{m}:x_{m}\ge \left\| {\left( x_{1},\ldots , x_{m-1}\right) }^{T} \right\| \right\} , $$
    and cartesian products of the form \(\prod _{j=1}^{l}K_{p}^{m_{j}+1}\) (\(K_{p}^{3}\) is represented twice in Fig. 4.6).
  • The cone of positive semidefinite symmetric matrices in \( \mathcal {S}_{q}\),  usually denoted by \(\mathcal {S}_{q}^{+}\). Here, we identify the space of all \( q\times q\) symmetric matrices \(\mathcal {S}_{q}\) with \(\mathbb {R}^{q(q+1)/2}=\mathbb {R }^{m}\). From now on, we write \(A\succeq 0\) when \(A\in \mathcal {S}_{q}\) is positive semidefinite and \(A\succ 0\) when it is positive definite.

It is easy to see that the above cones are closed and convex, and they have nonempty interior (observe that \(A\succ 0\) implies that \(A\in \mathop {\mathrm{int}} \mathcal {S}_{q}^{+}\) as the eigenvalues of A are continuous functions of its entries). Moreover, they are pointed  and symmetric  (meaning that \(K\cap -K=\left\{ 0_{m}\right\} \) and that \(K^{\circ }=-K\), respectively). The pointedness of \(\mathbb {R}_{+}^{m}\) and \(K_{p}^{m}\) is evident, while, for \(\mathcal {S}_{q}^{+}\), it follows from the characterization of the symmetric positive semidefinite matrices by the nonnegativity of their eigenvalues. The symmetry is also evident for \(\mathbb {R}_{+}^{m}\), it can be easily proved from the Cauchy–Schwarz inequality for \(K_{p}^{m}\), and it is a nontrivial result proved by Moutard for \(\mathcal {S}_{q}^{+}\) [58, Theorem 7.5.4].

We first consider a linear conic problem  of the form
$$ \begin{array}{lll} P_{K}: &{} \text {Min} &{} c^{T}x \\ &{} \text {s.t.} &{} Ax+b\in K, \end{array} $$
where \(c\in \mathbb {R}^{n}\), A is an \(m\times n\) matrix, \(b\in \mathbb {R} ^{m}\), and K is a pointed closed convex cone in \(\mathbb {R}^{m}\) such that \(\mathop {\mathrm{int}}K\ne \emptyset \). The assumptions on K guarantee that \( K^{\circ }\) same properties and the existence of a  compact base of \(K^{\circ }\), i.e., a compact convex set W such that \(0_{m}\notin W\) and \(K^{\circ }=\mathop {\mathrm{cone}}W\) [40, page p. 447]. For instance, compact bases of \(K^{\circ }\) are \(W=-\mathop {\mathrm{conv}}\left\{ e_{1},\ldots , e_{m}\right\} \), for \(K=\mathbb {R}_{+}^{m}\), and \(W=\mathbb { B\times }\left\{ -1\right\} \), for \(K=K_{p}^{m}\). By the same argument as in Subsection 4.3.2, since \(K^{\circ }=\mathop {\mathrm{cone}}W\),
$$\begin{aligned} Ax+b\in K\Longleftrightarrow w^{T}\left( Ax+b\right) \le 0,\quad \forall w\in W. \end{aligned}$$
(4.35)
So, the conic constraint \(Ax+b\in K\) can be replaced in \(P_{K}\) by the right-hand side of (4.35), getting the equivalent linear semi-infinite problem
$$ \begin{array}{lll} P_{K}^{1}: &{} \text {Min} &{} c^{T}x \\ &{} \text {s.t.} &{} \left( w^{T}A\right) x\le -w^{T}b,\;w\in W, \end{array} $$
(a linear optimization problem with infinitely many inequalities), whose theory, methods, and applications are exposed in [43] (see also [44]). From the existence theorem [43, Corollary 3.1.1] for this type of problems,
$$ F\ne \emptyset \Longleftrightarrow \left( \begin{array}{c} 0_{n} \\ -1 \end{array} \right) \notin \mathop {\mathrm{cone}}\left\{ \left[ A\mid -b\right] ^{T}w,\, w\in W\right\} , $$
in which case, optimal solutions can be computed by the corresponding numerical methods. Stopping rules for primal-dual algorithms can be obtained from the Haar dual problem  of \(P_{K}^{1}\),
$$ \begin{array}{lll} D_{K}^{1}: &{} \text {Max} &{} \sum _{w\in W}b^{T}w\lambda _{w} \\ &{} \text {s.t.} &{} \sum _{w\in W}\lambda _{w}A^Tw=-c, \\ &{} &{} \lambda \in \mathbb {R}_{+}^{\left( W\right) }, \end{array} $$
where \(\mathbb {R}_{+}^{\left( W\right) }\) denotes the convex cone of real-valued functions on W which vanish except on a finite subset of W. One always has the weak inequality \(v\left( D_{K}^{1}\right) \le v\left( P_{K}^{1}\right) \) as, given a pair \(\left( x,\lambda \right) \) of primal-dual feasible solutions,
$$ c^{T}x=-\sum _{w\in W}\lambda _{w}\left( A^{T}w\right) ^{T}x\ge \sum _{w\in W}w^{T}b\lambda _{w}. $$
Moreover, the strong duality holds under the Slater constraint qualification (still abbreviated as SCQ) that there exists some \(\widehat{x} \in \mathbb {R}^{n}\) such that \(\left( w^{T}A\right) \widehat{x}<-w^{T}b\) for all \(w\in W\) [44]. Taking into account the compactness of the index set in \(P_{K}^{1}\) and the continuity of the coefficients with respect to the index, it is also possible to associate with \(P_{K}^{1}\) another dual problem, say \(D_{K}^{2}\), which is also linear and is posed on a certain infinite dimensional space of measures on W that is called continuous dual [4]. Once again, strong duality holds for \(D_{K}^{2} \) under the same SCQ as above.
We can also reformulate \(P_{K}^{1}\) as a convex optimization problem with a single constraint function \(g\left( x\right) :=\max _{w\in W}\left\{ \left( w^{T}A\right) x+w^{T}b\right\} \) for all \(x\in \mathbb {R}^{n}\). Obviously, \( P_{K}^{1}\) is equivalent to
$$ \begin{array}{lll} P_{K}^{2}: &{} \text {Min} &{} f\left( x\right) =c^{T}x \\ &{} \text {s.t.} &{} g\left( x\right) \le 0, \end{array} $$
whose Lagrange dual is
$$ \begin{array}{lll} D_{K}^{2}: &{} \text {Max} &{} h\left( y\right) :=\inf _{x\in \mathbb {R} ^{n}}\left\{ c^{T}x+yg\left( x\right) \right\} \\ &{} \text {s.t.} &{} y\ge 0. \end{array} $$
Due to the lack of differentiability of g, the Wolfe dual of \(P_{K}^{2}\) is not well defined.
Observe that the dual pair \(P_{K}^{1}-D_{K}^{1}\) is preferable to the pair \( P_{K}^{2}-D_{K}^{2}\) because:
  • The Slater constraint qualification for the pair \(P_{K}^{1}-D_{K}^{1}\) is weaker than the one corresponding to the pair \(P_{K}^{2}-D_{K}^{2}\), i.e., the existence of some \(\widehat{x}\in \mathbb {R}^{n}\) such that \( g\left( \widehat{x}\right) <0\).

  • \(D_{K}^{2}\) can hardly be solved in practice as no explicit expression of g is available except for particular cases.

However, the preferable dual of \(P_{K}\), from a computational point of view, is the so-called conic dual problem :
$$ \begin{array}{lll} D_{K}: &{} \text {Max} &{} b^{T}y \\ &{} \text {s.t.} &{} A^{T}y=-c, \\ &{} &{} y\in K^{\circ }. \end{array} $$
To check the weak duality, take an arbitrary primal-dual feasible solution \( \left( x, y\right) \). Then, since \(Ax+b\in K\), we have
$$ c^{T}x=-\left( A^{T}y\right) ^{T}x=-y^{T}Ax=-y^{T}(Ax+b)+b^Ty\ge b^{T}y. $$
The strong duality holds for the pair \(P_{K}-D_{K}\) when there exists some \(\widehat{x}\in \mathbb {R}^{n}\) such that \(A\widehat{x}+b\) belongs to the relative interior  of K, i.e., to the intersection of K with some open set of \(\mathbb {R}^{n}\), which is the peculiar form of SCQ for this class of linear conic problems [74].
In order to motivate the definition of a more general class of conic problems, let us represent by \(x_{1}\), instead of x, the decision variable in \(P_{K}\) and introduce a new variable \(x_{2}:=Ax_{1}+b\). Denoting \(\widetilde{c}:=\left( \begin{array}{c} c \\ 0_{m} \end{array} \right) \) and \(x:=\left( \begin{array}{c} x_{1} \\ x_{2} \end{array} \right) \in \mathbb {R}^{n+m}\), problem \(P_{K}\) turns out to be equivalent to
$$ \begin{array}{lll} P_{K}^{3}: &{} \text {Min} &{} f\left( x\right) =\widetilde{c}^{T}x \\ &{} \text {s.t.} &{} \left[ A\mid -I_{n}\right] x=-b \\ &{} &{} x\in \mathbb {R}^{n}\times K. \end{array} $$
Obviously, \(P_{K}^{3}\) is a particular case of the so-called general linear conic problem [6]
$$ \begin{array}{lll} P_{\mathcal {K}}: &{} \text {Min} &{} \left\langle c,x\right\rangle \\ &{} \text {s.t.} &{} \left\langle a_{i}, x\right\rangle =b_{i},\;i\in I=\left\{ 1,\ldots , m\right\} , \\ &{} &{} x\in \mathcal {K}, \end{array} $$
where \(\mathcal {K}\) is a convex cone in a given linear space \(\mathcal {Z}\) equipped with an inner product \(\left\langle \cdot ,\cdot \right\rangle \), \( c, a_{1},\ldots , a_{m}\in \mathcal {Z}\), and \(b_{1},\ldots , b_{m}\in \mathbb {R}\) (observe that the definitions of convex set, convex function, cone, polar cone, and convex optimization problem make sense when the usual decision space \(\mathbb {R}^{n}\) is replaced by \(\mathcal {Z})\). We associate with \(P_{ \mathcal {K}}\) the following linear conic problem:
$$ \begin{array}{lll} D_{\mathcal {K}}: &{} \text {Max} &{} b^{T}y \\ &{} \text {s.t.} &{} \sum _{i\in I}y_{i}a_{i}+z=c, \\ &{} &{} \left( y, z\right) \in \mathbb {R}^{m}\times \mathcal {K}^{\circ } , \end{array} $$
where \(b=\left( b_{1},\ldots , b_{m}\right) ^{T}\) and \(\mathcal {K}^{\circ }\) is the polar cone of \(\mathcal {K}\). It is easy to check the convexity of \(P_{ \mathcal {K}}\) and \(D_{\mathcal {K}}\). Taking an arbitrary primal-dual feasible solution \(\left( x,\left( y, z\right) \right) \), one has
$$ b^{T}y=\left\langle \sum _{i\in I}y_{i}a_{i},x\right\rangle =\left\langle c-z,x\right\rangle =\left\langle c,x\right\rangle -\left\langle z,x\right\rangle \ge \left\langle c, x\right\rangle , $$
so that weak duality holds. For this reason, \(D_{\mathcal {K}}\) is said to be the conic dual problem of \(P_{\mathcal {K}}\). The way of guaranteeing strong duality for the pair \(P_{\mathcal {K}}-D_{\mathcal {K}} \) depends on \(\mathcal {Z}\) and \(\mathcal {K}\):
  • In linear optimization, \(\mathcal {Z}=\mathbb {R}^{m}\), \( \left\langle c, x\right\rangle =c^{T}x\), and \(\mathcal {K}=\mathbb {R}_{+}^{m}\). Strong duality holds just assuming the existence of some primal-dual feasible solution; i.e., no CQ is needed.

  • In second-order cone optimization, \(\mathcal {Z} =\prod _{j=1}^{l}\mathbb {R}^{n_{j}+1}\), \(\left\langle c, x\right\rangle =\sum _{j=1}^{l}c_{j}^{T}x_{j}\), and \(\mathcal {K} =\prod _{j=1}^{l}K_{p}^{n_{j}+1}\).

  • In semidefinite optimization, \(\mathcal {Z}=\mathcal {S}_{n}\), \( \left\langle C, X\right\rangle \) is the trace of the product matrix CX, and \(\mathcal {K}=\mathcal {S}_{n}^{+}\).

The SCQ in second-order cone optimization and semidefinite optimization reads as follows: There exists a primal-dual feasible solution \(\left( \widehat{x},\left( \widehat{y},\widehat{z}\right) \right) \) such that \( \widehat{x},\widehat{z}\in \mathop {\mathrm{int}}\mathcal {K}\). For the mentioned last two classes of optimization problems, SCQ also guarantees the existence of a primal optimal solution [6]. A favorable consequence of the symmetry of the three mentioned cones is that the corresponding conic problems admit efficient primal-dual algorithms [69].

It is worth mentioning that second-order cone optimization problems can be reformulated as semidefinite optimization ones as a consequence of the identity
$$ K_{p}^{m}=\left\{ x\in \mathbb {R}^{m}:\left[ \begin{array}{ccccc} x_{m} &{} 0 &{} \ldots &{} 0 &{} x_{1} \\ 0 &{} x_{m} &{} &{} 0 &{} x_{2} \\ \vdots &{} \vdots &{} \ddots &{} \vdots &{} \vdots \\ 0 &{} 0 &{} \ldots &{} x_{m} &{} x_{m-1} \\ x_{1} &{} x_{2} &{} \ldots &{} x_{m-1} &{} x_{m} \end{array} \right] \succeq 0\right\} , $$
but these reformulations are not used in practice as the specific second-order cone optimization methods are much more efficient than the adaptations of the semidefinite optimization ones [3].

Many textbooks on convex optimization, e.g., [10, 15], pay particular attention to the theory and methods of conic optimization. Several chapters of [7] also deal with the theory and methods of conic and semidefinite optimization, while [2] is focused on second-order cone optimization. Regarding applications, [10, 15] present interesting applications to engineering and finance (e.g., the portfolio problem), while [85] contains chapters reviewing applications of conic optimization to nonlinear optimal control (pp. 121–133), truss topology design (pp. 135–147), and financial engineering (pp. 149–160).

4.6 Exercises

4.1

Prove that \(\overline{x}={\left( 1,1\right) }^{T} \) is the unique optimal solution to the problem
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) ={\left( x_{1}-2\right) }^{2}+{\left( x_{2}-1\right) }^{2} \\ &{} \text {s.t.} &{} g_{1}\left( x\right) =x_{1}^{2}-x_{2}\le 0, \\ &{} &{} g_{2}\left( x\right) =x_{1}^{2}+x_{2}^{2}-2\le 0. \end{array} \end{aligned}$$

4.2

Express a positive number a as the sum of three numbers so that the sum of its corresponding cubes is minimized, under the following assumptions:

(a) The three numbers are arbitrary.

(b) The three numbers are positive.

(c) The three numbers are nonnegative.

4.3

Solve the following problem by using the KKT conditions
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} e^{x^{2}+y^{2}} \\ &{} \text {s.t.} &{} 2x-y=4. \end{array} \end{aligned}$$

4.4

Solve
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} f\left( x_{1}, x_{2}\right) =x_{1}^{-1}x_{2}^{-1} \\ &{} \text {s.t.} &{} x_{1}+x_{2}\le 2, \\ &{} &{} x_{1}>0,x_{2}>0. \end{array} \end{aligned}$$

4.5

Solve the optimization problem posed in Exercise  1.3 by using the KKT conditions.

4.6

Determine whether it is true or false that \(\overline{x}={\left( -\frac{1}{3},-\frac{1}{ 3},-\frac{1}{3},-3\right) }^{T} \) is, from all the solutions of the linear system
$$\begin{aligned} \left\{ \begin{array}{c} x_{1}+x_{2}-x_{3}\ge -1 \\ 3x_{1}+3x_{2}+3x_{3}-4x_{4}\ge 9 \\ -x_{1}-x_{2}-x_{3}\ge 1 \\ -3x_{1}+x_{2}+x_{3}\ge -1 \\ x_{1}-3x_{2}+x_{3}\ge -1 \\ -x_{1}-x_{2}+3x_{3}\ge -3 \end{array} \right\} , \end{aligned}$$
the closest one to the origin in \(\mathbb {R}^{4}\).

4.7

Consider the following problem in \(\mathbb {R}^{2}\):
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) =4x_{1}^{2}+x_{2}^{2}-8x_{1}-4x_{2} \\ &{} \text {s.t.} &{} x_{1}+x_{2}\le 4 \\ &{} &{} 2x_{1}+x_{2}\le 5 \\ &{} &{} -x_{1}+4x_{2}\ge 2 \\ &{} &{} x_{1}\ge 0,x_{2}\ge 0. \end{array} \end{aligned}$$
(a) Reformulate P in such a way that the level curves are circumferences.

(b) Solve P graphically.

(c) Prove analytically that the result obtained in (b) is really true.

(d) Solve the problem obtained when we add to P the constraint \( x_{2}\ge 3\).

4.8

Solve
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Max} &{} \ f\left( x\right) =20x_{1}+16x_{2}-2x_{1}^{2}-x_{2}^{2}-x_{3}^{2} \\ &{} \text {s.t.} &{} x_{1}+x_{2}\le 5, \\ &{} &{} x_{1}+x_{2}-x_{3}=0, \\ &{} &{} x_{1}\ge 0,x_{2}\ge 0. \end{array} \end{aligned}$$

4.9

Consider the problem
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} x_{1}^{2}+x_{2}^{2}+5x_{1} \\ &{} \text {s.t.} &{} x_{1}^{2}+x_{2}^{2}\le 4, \\ &{} &{} -x_{2}\le -2. \end{array} \end{aligned}$$
(a) Solve it graphically.

(b) Analyze the fulfillment of the KKT conditions at the point obtained in (a).

4.10

Check that the convex optimization problem
$$ \begin{array}{lll} P: &{} \text {Min} &{} f\left( x, y\right) =x \\ &{} \text {s.t.} &{} x^{2}+{\left( y-1\right) }^{2}\le 1, \\ &{} &{} x^{2}+{\left( y+1\right) }^{2}\le 1, \end{array} $$
has a unique global minimum that does not satisfy the KKT conditions. Why does such a thing happen?

4.11

Prove that the bounded problem
$$ \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) =x_{1}^{2}-2x_{1}+x_{2}^{2} \\ &{} \text {s.t.} &{} x_{1}+x_{2}\le 0, \\ &{} &{} x_{1}^{2}-4\le 0, \end{array} $$
has a unique optimal solution with a unique KKT vector associated with it. From that information, what can you say about the effect on the optimal value of small perturbations on the right-hand side of each constraint?

4.12

Consider the parametric problem
$$\begin{aligned} \begin{array}{lll} P(z) : &{} \text {Min} &{} f\left( x\right) =x_{1}^{2}-2x_{1}+x_{2}^{2}+4x_{2} \\ &{} \text {s.t.} &{} 2x_{1}+x_{2}\le z, \\ &{} &{} x\in \mathbb {R}^{2}. \end{array} \end{aligned}$$
(a) Compute the value function \(\mathcal {\vartheta }\).

(b) Check that \(\mathcal {\vartheta }\) is convex and differentiable at 0.

(c) Find the set of sensitivity vectors for \(P\left( 0\right) \).

4.13

Consider the problem
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) =x_{1}^{2}-x_{1}x_{2} \\ &{} \text {s.t.} &{} x_{1}-x_{2}\le 2, \end{array} \end{aligned}$$
and its corresponding parametric problem
$$\begin{aligned} \begin{array}{lll} P(z) : &{} \text {Min} &{} f\left( x\right) =x_{1}^{2}-x_{1}x_{2} \\ &{} \text {s.t.} &{} x_{1}-x_{2}-2\le z. \end{array} \end{aligned}$$
(a) Identify the feasible solutions to P satisfying the KKT conditions, and determine whether any of them is an optimal solution.

(b) Determine whether the value function is convex and whether P satisfies SCQ.

(c) Compute a sensitivity vector.

4.14

The utility function of a consumer is \(u\left( x, y\right) =xy\), where x and y denote the consumed quantities of two goods A and B, whose unit prices are 2 and 3 c.u., respectively. Maximize the consumer utility, knowing that she has 90 c.u.

4.15

A tetrahedron (or triangular pyramid) is rectangular when three of its faces are rectangle triangles that we will name cateti, whereas the fourth face will be named hypotenuse. Design a rectangular tetrahedron whose hypotenuse has minimum area being the pyramid height on the hypotenuse h meters.

4.16

Consider the convex optimization problem
$$ \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) =e^{-x_{2}} \\ &{} \text {s.t.} &{} \left\| x\right\| -x_{1}\le 0, \\ &{} &{} x\in \mathbb {R}^{2}. \end{array} $$
(a) Identify the feasible set multifunction \(\mathcal {F}\) and the value function \(\mathcal {\vartheta }\).

(b) Study the continuity and differentiability of \(\mathcal {\vartheta }\) on its domain.

(c) Compute the sensitivity vectors.

(d) Determine whether the optimal values of P and of its Lagrange dual problem \(D^{L}\) are equal.

4.17

Consider the convex optimization problem
$$ \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) =\left\| x\right\| \\ &{} \text {s.t.} &{} x_{1}+x_{2}\le 0, \\ &{} &{} x\in \mathbb {R}^{2}. \end{array} $$
(a) Identify the feasible set multifunction \(\mathcal {F}\) and the value function \(\mathcal {\vartheta }\).

(b) Study the continuity and differentiability of \(\mathcal {\vartheta }\) on its domain.

(c) Compute the sensitivity vectors of \(\mathcal {\vartheta }\).

(d) Compute the optimal set of \(P\).

(e) Compute the optimal set of its Lagrange dual problem \(D^{L}\).

(f) Determine whether strong duality holds.

4.18

Solve analytically and geometrically the problem
$$\begin{aligned} \begin{array}{ccc} P: &{} \text {Min} &{} f\left( x\right) =x_{1}^{2}+x_{2}^{2}+x_{3}^{2} \\ &{} \text {s.t.} &{} x_{1}+x_{2}+x_{3}\le -3, \end{array} \end{aligned}$$
and compute its sensitivity vectors. Solve also the Lagrange dual \(D^{L}\) and the Wolfe dual \(D^{W}\) of P, showing that the strong duality holds for both duality pairs without using the SCQ.

4.19

Consider the problem
$$\begin{aligned} \begin{array}{lll} P: &{} \text {Min} &{} f\left( x\right) =x \\ &{} \text {s.t.} &{} g\left( x\right) =x^{2}\le 0. \end{array} \end{aligned}$$
(a) Solve P analytically, if possible.

(b) Express analytically and represent graphically \(\mathop {\mathrm{gph}}\mathcal {F}\).

(c) Identify the value function \(\mathcal {\vartheta }\) and represent graphically \(\mathop {\mathrm{gph}}\mathcal {\vartheta }\).

(d) Analyze the differentiability of \(\mathcal {\vartheta }\) on the interior of its domain.

(e) Compute the sensitivity vectors of \(\mathcal {\vartheta }\) (here scalars, since \(m=1\)).

(f) Compute the saddle points of the Lagrange function L.

(g) Compute the KKT vectors on the optimal solutions of \(P\).

(h) Check the strong duality property for the Lagrange dual \(D^{L}\) and for the Wolfe dual \(D^{W}\) of \(P\).

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Francisco J. Aragón
    • 1
    Email author
  • Miguel A. Goberna
    • 1
  • Marco A. López
    • 1
  • Margarita M. L. Rodríguez
    • 1
  1. 1.Department of MathematicsUniversity of AlicanteAlicanteSpain

Personalised recommendations