In this section, we will present a method to compute descent directions of nonsmooth MOPs that generalizes the method from [12] to the multiobjective case. As described in the previous section, when computing descent directions via Theorem 2.2, the problem of computing subdifferentials arises. Since this is difficult in practice, we will instead replace W in Theorem 2.2 by an approximation of \({{\,\mathrm{conv}\,}}\left( \cup _{i = 1}^k \partial f_i(x) \right) \) such that the resulting direction is guaranteed to have sufficient descent. To this end, we replace the Clarke subdifferential by the so-called \(\varepsilon \)-subdifferential and take a finite approximation of the latter.
The Epsilon-Subdifferential
By definition, \(\partial f_i(x)\) is the convex hull of the limits of the gradient of \(f_i\) in all sequences near x that converge to x. Thus, if we evaluate \(\nabla f_i\) in a number of points close to x (where it is defined) and take the convex hull, we expect the resulting set to be an approximation of \(\partial f_i(x)\). To formalize this, we introduce the following definition [21, 28].
Definition 3.1
Let \(\varepsilon \ge 0\), \(x \in \mathbb {R}^n\), \(B_\varepsilon (x) := \{ y \in \mathbb {R}^n : \Vert x - y \Vert \le \varepsilon \}\) and \(i \in \{1,\ldots ,k\}\). Then,
$$\begin{aligned} \partial _\varepsilon f_i(x) := {{\,\mathrm{conv}\,}}\left( \bigcup _{y \in B_\varepsilon (x)} \partial f_i(y) \right) \end{aligned}$$
is the (Goldstein) \(\varepsilon \)-subdifferential of \(f_i\) in x. \(\xi \in \partial _\varepsilon f_i(x)\) is an \(\varepsilon \)-subgradient.
Note that \(\partial _0 f_i(x) = \partial f_i(x)\) and \(\partial f_i(x) \subseteq \partial _\varepsilon f_i(x)\). For \(\varepsilon \ge 0\) we define for the multiobjective setting
$$\begin{aligned} F_\varepsilon (x) := {{\,\mathrm{conv}\,}}\left( \bigcup _{i=1}^k \partial _\varepsilon f_i(x) \right) . \end{aligned}$$
To be able to choose \(W = F_\varepsilon (x)\) in Theorem 2.2, we first need to establish some properties of \(F_\varepsilon (x)\).
Lemma 3.1
\(\partial _\varepsilon f_i(x)\) is nonempty, convex and compact for all \(i \in \{1,\ldots ,k\}\). In particular, the same holds for \(F_\varepsilon (x)\).
Proof
For \(\partial _\varepsilon f_i(x)\), this was shown in [21], Prop. 2.3. For \(F_\varepsilon (x)\), it then follows directly from the definition. \(\square \)
We immediately get the following corollary from Theorems 2.1 and 2.2.
Corollary 3.1
Let \(\varepsilon \ge 0\).
-
(a)
If x is Pareto optimal, then
$$\begin{aligned} 0 \in F_\varepsilon (x). \end{aligned}$$
(4)
-
(b)
Let \(x \in \mathbb {R}^n\) and
$$\begin{aligned} \bar{v} := \mathop {\mathrm{arg min}}\limits _{\xi \in -F_\varepsilon (x)} \Vert \xi \Vert ^2. \end{aligned}$$
(5)
Then, either \(\bar{v} \ne 0\) and
$$\begin{aligned} \langle \bar{v}, \xi \rangle \le - \Vert \bar{v} \Vert ^2 < 0 \quad \forall \xi \in F_\varepsilon (x), \end{aligned}$$
(6)
or \(\bar{v} = 0\) and there is no \(v \in \mathbb {R}^n\) with \(\langle v, \xi \rangle < 0\) for all \(\xi \in F_\varepsilon (x)\).
The previous corollary states that when working with the \(\varepsilon \)-subdifferential instead of the Clarke subdifferential, we still have a necessary optimality condition and a way to compute descent directions, although the optimality conditions are weaker and the descent direction has a less strong descent. This is illustrated in the following example.
Example 3.1
Consider the locally Lipschitz function
$$\begin{aligned} f : \mathbb {R}^2 \rightarrow \mathbb {R}^2, \quad x \mapsto \begin{pmatrix} (x_1 - 1)^2 + (x_2 - 1)^2 \\ x_1^2 + | x_2 | \end{pmatrix}. \end{aligned}$$
The set of nondifferentiable points of f is \(\mathbb {R}\times \{ 0 \}\). For \(\varepsilon > 0\) and \(x \in \mathbb {R}^2\) we have
$$\begin{aligned} \nabla f_1(x) = \begin{pmatrix} 2(x_1 - 1) \\ 2(x_2 - 1) \end{pmatrix} \quad \text {and} \quad \partial _\varepsilon f_1(x) = 2 B_\varepsilon (x) - \begin{pmatrix} 2 \\ 2 \end{pmatrix}. \end{aligned}$$
For \(x \in \mathbb {R}\times \{ 0 \}\) we have
$$\begin{aligned} \partial f_2(x) = \{ 2 x_1 \} \times [-1,1] \quad \text {and} \quad \partial _\varepsilon f_2(x) = \{ 2 x_1 + [-2\varepsilon ,2\varepsilon ] \} \times [-1,1]. \end{aligned}$$
Figure 1 shows the Clarke subdifferential (a), the \(\varepsilon \)-subdifferential (b) for \(\varepsilon = 0.2\) and the corresponding sets \(F_\varepsilon (x)\) for \(x = (1.5, 0)^\top \).
Additionally, the corresponding solutions of (5) are shown. In this case, the predicted descent \(-\Vert \bar{v} \Vert ^2\) (cf. (3)) is approximately \(-3.7692\) in (a) and \(-2.4433\) in (b).
Figure 2 shows the same scenario for \(x = (0.5,0)^\top \).
Here, the Clarke subdifferential still yields a descent, while \(\bar{v} = 0\) for the \(\varepsilon \)-subdifferential. In other words, x satisfies the necessary optimality condition (4) but not (1).
The following lemma shows that for the direction from (5), there is a lower bound for a step size up to which we have guaranteed descent in each objective function \(f_i\).
Lemma 3.2
Let \(\varepsilon \ge 0\) and \(\bar{v}\) be the solution of (5). Then,
$$\begin{aligned} f_i(x + t \bar{v}) \le f_i(x) - t \Vert \bar{v} \Vert ^2 \quad \forall t \le \frac{\varepsilon }{\Vert \bar{v} \Vert }, i \in \{1,\ldots ,k\}. \end{aligned}$$
Proof
Let \(t \le \frac{\varepsilon }{\Vert \bar{v} \Vert }\). Since \(f_i\) is locally Lipschitz continuous on \(\mathbb {R}^n\), it is in particular Lipschitz continuous on an open set containing \(x + [0,t] \bar{v}\). By applying the mean value theorem (cf. [24], Thm. 2.3.7), we obtain the existence of some \(r \in \ ]0,t[\) with
$$\begin{aligned} f_i(x + t \bar{v}) - f_i(x) \in \langle \partial f_i(x + r \bar{v}), t \bar{v} \rangle . \end{aligned}$$
From \(\Vert x - (x + r \bar{v}) \Vert = r \Vert \bar{v} \Vert < \varepsilon \) it follows that \(\partial f_i(x + r \bar{v}) \subseteq \partial _\varepsilon f_i(x)\). This means that there is some \(\xi \in \partial _\varepsilon f_i(x)\) such that
$$\begin{aligned} f_i(x + t \bar{v}) - f_i(x) = t \langle \xi , \bar{v} \rangle . \end{aligned}$$
Combined with (6) we obtain
$$\begin{aligned}&f_i(x + t \bar{v}) - f_i(x) \le - t \Vert \bar{v} \Vert ^2 \\ \Leftrightarrow \quad&f_i(x + t \bar{v}) \le f_i(x) - t \Vert \bar{v} \Vert ^2, \end{aligned}$$
which completes the proof. \(\square \)
In the following, we will describe how we can obtain a good approximation of (5) without requiring full knowledge of the \(\varepsilon \)-subdifferentials.
Efficient Computation of Descent Directions
In this part, we will describe how the solution of (5) can be approximated when only a single subgradient can be computed at every \(x \in \mathbb {R}^n\). Similar to the gradient sampling approach, the idea behind our method is to replace \(F_\varepsilon (x)\) in (5) by the convex hull of a finite number of \(\varepsilon \)-subgradients \(\xi _1,\ldots ,\xi _m\) from \(F_\varepsilon (x)\) for \(m \in \mathbb {N}\). Since it is impossible to know a priori how many and which \(\varepsilon \)-subgradients are required to obtain a good descent direction, we solve (5) multiple times in an iterative approach to enrich our approximation until a satisfying direction has been found. To this end, we have to specify how to enrich our current approximation \({{\,\mathrm{conv}\,}}(\{ \xi _1, \ldots , \xi _m \})\) and how to characterize an acceptable descent direction.
Let \(W = \{\xi _1, \ldots , \xi _m\} \subseteq F_\varepsilon (x)\) and
$$\begin{aligned} \tilde{v} := \mathop {\mathrm{arg min}}\limits _{v \in -{{\,\mathrm{conv}\,}}(W)} \Vert v \Vert ^2. \end{aligned}$$
(7)
Let \(c \in \ ]0,1[\). Motivated by Lemma 3.2, we regard \(\tilde{v}\) as an acceptable descent direction, if
$$\begin{aligned} f_i \left( x + \frac{\varepsilon }{\Vert \tilde{v} \Vert } \tilde{v} \right) \le f_i(x) - c \varepsilon \Vert \tilde{v} \Vert \quad \forall i \in \{1,\ldots ,k\}. \end{aligned}$$
(8)
If the set \(I \subseteq \{1,\ldots ,k\}\) for which (8) is violated is nonempty, then we have to find a new \(\varepsilon \)-subgradient \(\xi ' \in F_\varepsilon (x)\) such that \(W \cup \{ \xi ' \}\) yields a better descent direction. Intuitively, (8) being violated means that the local behavior of \(f_i\), \(i \in I\), in x in the direction \(\tilde{v}\) is not sufficiently captured in W. Thus, for each \(i \in I\), we expect that there exists some \(t' \in \ ]0,\frac{\varepsilon }{\Vert \tilde{v} \Vert }]\) such that \(\xi ' \in \partial f_i(x + t' \tilde{v})\) improves the approximation of \(F_\varepsilon (x)\). This is proven in the following lemma.
Lemma 3.3
Let \(c \in \ ]0,1[\), \(W = \{ \xi _1, \ldots , \xi _m \} \subseteq F_\varepsilon (x)\) and \(\tilde{v}\) be the solution of (7). If
$$\begin{aligned} f_i \left( x + \frac{\varepsilon }{\Vert \tilde{v} \Vert } \tilde{v} \right) > f_i(x) - c \varepsilon \Vert \tilde{v} \Vert \end{aligned}$$
for some \(i \in \{1,\ldots ,k\}\), then there is some \(t' \in \ ]0,\frac{\varepsilon }{\Vert \tilde{v} \Vert }]\) and \(\xi ' \in \partial f_i(x + t' \tilde{v})\) such that
$$\begin{aligned} \langle \tilde{v}, \xi ' \rangle > - c \Vert \tilde{v} \Vert ^2. \end{aligned}$$
(9)
In particular, \(\xi ' \in F_\varepsilon (x) {\setminus } {{\,\mathrm{conv}\,}}(W)\).
Proof
Assume that for all \(t' \in \ ]0,\frac{\varepsilon }{\Vert \tilde{v} \Vert }]\) and \(\xi ' \in \partial f_i(x + t' \tilde{v})\) we have
$$\begin{aligned} \langle \tilde{v}, \xi ' \rangle \le - c \Vert \tilde{v} \Vert ^2. \end{aligned}$$
(10)
By applying the mean value theorem as in Lemma 3.2, we obtain
$$\begin{aligned} f_i \left( x + \frac{\varepsilon }{\Vert \tilde{v} \Vert } \tilde{v} \right) - f_i(x) \in \langle \partial f_i(x + r \tilde{v}), \frac{\varepsilon }{\Vert \tilde{v} \Vert } \tilde{v} \rangle \end{aligned}$$
for some \(r \in \ ]0,\frac{\varepsilon }{\Vert \tilde{v} \Vert }[\). This means that there is some \(\xi ' \in \partial f_i(x + r \tilde{v})\) such that
$$\begin{aligned} f_i \left( x + \frac{\varepsilon }{\Vert \tilde{v} \Vert } \tilde{v} \right) - f_i(x) = \langle \xi ', \frac{\varepsilon }{\Vert \tilde{v} \Vert } \tilde{v} \rangle = \frac{\varepsilon }{\Vert \tilde{v} \Vert } \langle \xi ', \tilde{v} \rangle . \end{aligned}$$
By (10) it follows that
$$\begin{aligned}&f_i \left( x + \frac{\varepsilon }{\Vert \tilde{v} \Vert } \tilde{v} \right) - f_i(x) \le -c \varepsilon \Vert \tilde{v} \Vert \\ \Leftrightarrow \quad&f_i \left( x + \frac{\varepsilon }{\Vert \tilde{v} \Vert } \tilde{v} \right) \le f_i(x) - c \varepsilon \Vert \tilde{v} \Vert , \end{aligned}$$
which is a contradiction. In particular, (3) yields \(\xi ' \in F_\varepsilon (x) {\setminus } {{\,\mathrm{conv}\,}}(W)\). \(\square \)
The following example visualizes the previous lemma.
Example 3.2
Consider f as in Example 3.1, \(\varepsilon = 0.2\) and \(x = (0.75, 0)^\top \). The dashed lines in Fig. 3 show the \(\varepsilon \)-subdifferentials, \(F_\varepsilon (x)\) and the resulting descent direction (cf. Figs. 1 and 2).
Let \(y = (0.94,-0.02)^\top \). Then, we have \(\Vert x - y \Vert \approx 0.191 \le \varepsilon \), so \(y \in B_\varepsilon (x)\) and
$$\begin{aligned} \partial _\varepsilon f_1(x)&\supseteq \partial f_1(y) = \left\{ \begin{pmatrix} -0.12 \\ -2.04 \end{pmatrix} \right\} =: \{ \xi _1 \},\\ \partial _\varepsilon f_2(x)&\supseteq \partial f_2(y) = \left\{ \begin{pmatrix} 1.88 \\ -1 \end{pmatrix} \right\} =: \{ \xi _2 \}. \end{aligned}$$
Let \(W := \{ \xi _1, \xi _2 \}\) and \({{\,\mathrm{conv}\,}}(W)\) be the approximation of \(F_\varepsilon (x)\), shown as the solid line in Fig. 3a. Let \(\tilde{v}\) be the solution of (7) for this W and choose \(c = 0.25\). Checking (8), we have
$$\begin{aligned} f_2 \left( x + \frac{\varepsilon }{\Vert \tilde{v} \Vert } \tilde{v} \right)&\approx 0.6101 > 0.4748 \approx f_2(x) - c \varepsilon \Vert \tilde{v} \Vert . \end{aligned}$$
By Lemma 3.3, this means that there is some \(t' \in \ ]0, \frac{\varepsilon }{\Vert \tilde{v} \Vert }]\) and \(\xi ' \in \partial f_2(x + t' \tilde{v})\) such that
$$\begin{aligned} \langle \tilde{v}, \xi ' \rangle > -c \Vert \tilde{v} \Vert ^2. \end{aligned}$$
In this case, we can take for example \(t' = \frac{1}{2} \frac{\varepsilon }{\Vert \tilde{v} \Vert }\), resulting in
$$\begin{aligned} \partial f_2(x + t'v) \approx \left\{ \begin{pmatrix} 1.4077 \\ 1 \end{pmatrix} \right\} =: \{ \xi ' \}, \\ \langle \tilde{v}, \xi ' \rangle \approx 0.4172 > -0.7696 \approx -c \Vert \tilde{v} \Vert ^2. \end{aligned}$$
Figure 3b shows the improved approximation \(W \cup \{ \xi ' \}\) and the resulting descent direction \(\tilde{v}\). By checking (8) for this new descent direction, we see that \(\tilde{v}\) is acceptable. (Note that in general, a single improvement step like this will not be sufficient to reach an acceptable direction.)
Note that Lemma 3.3 only shows the existence of \(t'\) and \(\xi '\) without stating a way how to actually compute them. To this end, let \(i \in \{1,\ldots ,k\}\) be the index of an objective function for which (8) is not satisfied, define
$$\begin{aligned} h_i : \mathbb {R}\rightarrow \mathbb {R}, \quad t \mapsto f_i(x + t \tilde{v}) - f_i(x) + c t \Vert \tilde{v} \Vert ^2 \end{aligned}$$
(11)
(cf. [12]) and consider Algorithm 1. If \(f_i\) is continuously differentiable around x, then (9) is equivalent to \(h_i'(t') > 0\), i.e., \(h_i\) being monotonically increasing around \(t'\). Thus, the idea of Algorithm 1 is to find some t such that \(h_i\) is monotonically increasing around t, while checking whether (9) is satisfied for a subgradient \(\xi \in f_i(x + t \tilde{v})\).
Although in the general setting, we cannot guarantee that Algorithm 1 yields a subgradient satisfying (9), we can at least show that after finitely many iterations, a factor t is found such that \(\partial f_i(x + t \tilde{v})\) contains a subgradient that satisfies (9).
Lemma 3.4
Let \(i \in \{1,\ldots ,k\}\) be an index for which (8) is violated. Let \((t_j)_j\) be the sequence generated in Algorithm 1.
-
(a)
If \((t_j)_j\) is finite, then some \(\xi '\) was found satisfying (9).
-
(b)
If \((t_j)_j\) is infinite, then it converges to some \(\bar{t} \in \ ]0,\frac{\varepsilon }{\Vert \tilde{v} \Vert }]\) with \(h_i(\bar{t}) \ge h_i(\frac{\varepsilon }{\Vert \tilde{v} \Vert })\) such that
-
(i)
there is some \(\xi ' \in \partial f_i(x + \bar{t} \tilde{v})\) satisfying (9) or
-
(ii)
\(0 \in \partial h_i(\bar{t})\), i.e., \(\bar{t}\) is a critical point of \(h_i\).
Proof
Let \((t_j)_j\) be finite with last element \(\bar{t} \in \ ]0,\frac{\varepsilon }{\Vert \tilde{v} \Vert }[\). Then, Algorithm 1 must have stopped in step 3, i.e., some \(\xi ' \in \partial f_i(x + \bar{t} \tilde{v})\) satisfying (9) was found.
Now let \((t_j)_j\) be infinite. By construction, \((t_j)_j\) is a Cauchy sequence in the compact set \([0,\frac{\varepsilon }{\Vert \tilde{v} \Vert }]\), so it has to converge to some \(\bar{t} \in [0,\frac{\varepsilon }{\Vert \tilde{v} \Vert }]\). Additionally, since (8) is violated for the index i by assumption, we have
$$\begin{aligned} h_i(0) = 0 \quad \text {and} \quad h_i \left( \frac{\varepsilon }{\Vert \tilde{v} \Vert } \right) > 0. \end{aligned}$$
Let \((a_j)_j\) and \((b_j)_j\) be the sequences corresponding to a and b in Algorithm 1 (at the start of each iteration in step 2).
Assume that \(\bar{t} = 0\). Then, we must have \(h_i(t_j) \ge h_i(b_j)\) in step 4 for all \(j \in \mathbb {N}\). Thus,
$$\begin{aligned} h_i(t_j) \ge h_i(b_j) = h(t_{j-1}) \ge h_i(b_{j-1}) = \cdots = h_i(t_1) \ge h_i(b_1) = h_i \left( \frac{\varepsilon }{\Vert \tilde{v} \Vert } \right) > 0 \end{aligned}$$
for all \(j \in \mathbb {N}\). This is a contradiction to the continuity of \(h_i\) (in 0). So we must have \(\bar{t} > 0\). Additionally, \(\lim _{j \rightarrow \infty } b_j = \bar{t}\) and \((h_i(b_j))_j\) is non-decreasing, so
$$\begin{aligned} h_i(\bar{t}) = \lim _{j \rightarrow \infty } h_i(b_j) \ge h_i(b_1) = h_i \left( \frac{\varepsilon }{\Vert \tilde{v} \Vert } \right) . \end{aligned}$$
By construction we have \(h_i(a_j) < h_i(b_j)\) for all \(j \in \mathbb {N}\). Thus, by the mean value theorem, there has to be some \(r_j \in \ ]a_j,b_j[\) such that
$$\begin{aligned} 0 < h_i(b_j) - h_i(a_j) \in \langle \partial h_i(r_j), b_j - a_j \rangle = \partial h_i(r_j) (b_j - a_j). \end{aligned}$$
In particular, \(\lim _{j \rightarrow \infty } r_j = \bar{t}\) and since \(a_j < b_j\), \(\partial h_i(r_j) \cap \mathbb {R}^{> 0} \ne \emptyset \) for all \(j \in \mathbb {N}\). By upper semicontinuity of \(\partial h\) there must be some \(\theta \in \partial h_i(\bar{t})\) with \(\theta \ge 0\). By the chain rule, we have
$$\begin{aligned} 0 \le \theta \in \partial h_i(\bar{t}) \subseteq \langle \tilde{v}, \partial f_i(x + \bar{t} \tilde{v}) \rangle + c \Vert \tilde{v} \Vert ^2. \end{aligned}$$
(12)
Thus, there must be some \(\xi \in \partial f_i(x + \bar{t} \tilde{v})\) with
$$\begin{aligned} 0 \le \langle \tilde{v}, \xi \rangle + c \Vert \tilde{v} \Vert ^2 \quad \Leftrightarrow \quad&\langle \tilde{v}, \xi \rangle \ge - c \Vert \tilde{v} \Vert ^2. \end{aligned}$$
If there exists \(\xi ' \in \partial f_i(x + \bar{t} \tilde{v})\) such that this inequality is strict, i.e., such that \(\langle \tilde{v}, \xi ' \rangle > - c \Vert \tilde{v} \Vert ^2\), then b)(i) holds. Otherwise, if we have \(\langle \tilde{v}, \xi ' \rangle \le - c \Vert \tilde{v} \Vert ^2\) for all \(\xi ' \in \partial f_i(x + \bar{t} \tilde{v})\), then (12) implies \(\partial h_i(\bar{t}) \subseteq \mathbb {R}^{\le 0}\). This means that \(0 = \theta \in \partial h_i(\bar{t})\), so b)(ii) holds. \(\square \)
In the following remark, we will briefly discuss the implication of Lemma 3.4 for practical use of Algorithm 1.
Remark 3.1
Let \(i \in \{1,\ldots ,k\}\) and \((t_j)_j\) be as in Lemma 3.4. Assume that \((t_j)_j\) is infinite with limit \(\bar{t} \in \ ]0, \frac{\varepsilon }{\Vert \tilde{v} \Vert }]\). In the proof of Lemma 3.4, we showed that there is a sequence \((r_j)_j \in \ ]0,\frac{\varepsilon }{\Vert \tilde{v} \Vert }[\) with limit \(\bar{t}\) such that \(\partial h_i(r_j) \cap \mathbb {R}^{>0} \ne \emptyset \) for all \(j \in \mathbb {N}\). By the definition of the Clarke subdifferential, this implies that there is a sequence \((s_j)_j \in \ ]0,\frac{\varepsilon }{\Vert \tilde{v} \Vert }[\) with \(\lim _{j \rightarrow \infty } s_j = \bar{t}\) such that \(h_i\) is differentiable in \(s_j\) and \(h'_i(s_j) > 0\) for all \(j \in \mathbb {N}\). Due to the upper semicontinuity of the Clarke subdifferential, each \(s_j\) has an open neighborhood \(U_j\) with
$$\begin{aligned} \partial h_i(s) \subseteq \mathbb {R}^{>0} \quad \forall s \in U_j, j \in \mathbb {N}. \end{aligned}$$
As in the proof of Lemma 3.4, it follows that for all \(s \in U_j\) with \(j \in \mathbb {N}\), there is some \(\xi ' \in \partial f_i(x + s \tilde{v})\) satisfying (9). Thus, roughly speaking, even if we are in case b)(ii) of Lemma 3.4, there are open sets arbitrarily close to \(\bar{t}\) on which we can find new \(\varepsilon \)-subgradients (as in case b)(i)).
Motivated by the previous remark, we will from now on assume that Algorithm 1 stops after finitely many iterations and thus yields a new subgradient satisfying (9). We can use this method of finding new subgradients to construct an algorithm that computes descent directions of nonsmooth MOPs, namely Algorithm 2.
The following theorem shows that Algorithm 2 stops after a finite number of iterations and produces an acceptable descent direction (cf. (8)).
Theorem 3.1
Algorithm 2 terminates. In particular, if \(\tilde{v}\) is the last element of \((v_l)_l\), then either \(\Vert \tilde{v} \Vert \le \delta \) or \(\tilde{v}\) is an acceptable descent direction, i.e.,
$$\begin{aligned} f_i \left( x + \frac{\varepsilon }{\Vert \tilde{v} \Vert } \tilde{v} \right) \le f_i(x) - c \varepsilon \Vert \tilde{v} \Vert \quad \forall i \in \{1,...,k\}. \end{aligned}$$
Proof
Assume that Algorithm 2 does not terminate, i.e., \((v_l)_{l \in \mathbb {N}}\) is an infinite sequence. Let \(l > 1\) and \(j \in I_{l-1}\). Since \(\xi ^j_{l-1} \in W_l\) and \(-v_{l-1} \in W_{l-1} \subseteq W_l\), we have
$$\begin{aligned} \Vert v_l \Vert ^2&\le \Vert -v_{l-1} + s (\xi ^j_{l-1} + v_{l-1}) \Vert ^2 \nonumber \\&= \Vert v_{l-1} \Vert ^2 - 2 s \langle v_{l-1}, \xi ^j_{l-1} + v_{l-1} \rangle + s^2 \Vert \xi ^j_{l-1} + v_{l-1} \Vert ^2 \nonumber \\&= \Vert v_{l-1} \Vert ^2 - 2 s \langle v_{l-1}, \xi ^j_{l-1} \rangle - 2 s \Vert v_{l-1} \Vert ^2 + s^2 \Vert \xi ^j_{l-1} + v_{l-1} \Vert ^2 \end{aligned}$$
(13)
for all \(s \in [0,1]\). Since \(j \in I_{l-1}\), we must have
$$\begin{aligned} \langle v_{l-1}, \xi ^j_{l-1} \rangle > -c \Vert v_{l-1} \Vert ^2 \end{aligned}$$
(14)
by step 5. Let L be a common Lipschitz constant of all \(f_i\), \(i \in \{1,...,k\}\), on the closed \(\varepsilon \)-ball \(B_\varepsilon (x)\) around x. Then by [24], Prop. 2.1.2, and the definition of the \(\varepsilon \)-subdifferential, we must have \(\Vert \xi \Vert \le L\) for all \(\xi \in F_\varepsilon (x)\). So in particular,
$$\begin{aligned} \Vert \xi ^j_{l-1} + v_{l-1} \Vert \le 2 L. \end{aligned}$$
(15)
Combining (13) with (14) and (15) yields
$$\begin{aligned} \Vert v_l \Vert ^2&< \Vert v_{l-1} \Vert ^2 + 2 s c \Vert v_{l-1} \Vert ^2 - 2 s \Vert v_{l-1} \Vert ^2 + 4 s^2 L^2 \\&= \Vert v_{l-1} \Vert ^2 - 2 s (1-c) \Vert v_{l-1} \Vert ^2 + 4 s^2 L^2. \end{aligned}$$
Let \(s := \frac{1-c}{4 L^2} \Vert v_{l-1} \Vert ^2\). Since \(1 - c \in \ ]0,1[\) and \(\Vert v_{l-1} \Vert \le L\) we have \(s \in \ ]0,1[\). We obtain
$$\begin{aligned} \Vert v_l \Vert ^2&< \Vert v_{l-1} \Vert ^2 - 2 \frac{(1-c)^2}{4 L^2} \Vert v_{l-1} \Vert ^4 + \frac{(1-c)^2}{4 L^2} \Vert v_{l-1} \Vert ^4 \\&= \left( 1 - \frac{(1-c)^2}{4 L^2} \Vert v_{l-1} \Vert ^2 \right) \Vert v_{l-1} \Vert ^2. \end{aligned}$$
Since Algorithm 2 did not terminate, it holds \(\Vert v_{l-1} \Vert > \delta \). It follows that
$$\begin{aligned} \Vert v_l \Vert ^2 < \left( 1 - \left( \frac{1-c}{2 L} \delta \right) ^2 \right) \Vert v_{l-1} \Vert ^2. \end{aligned}$$
Let \(r = 1 - \left( \frac{1-c}{2 L} \delta \right) ^2\). Note that we have \(\delta < \Vert v_l \Vert \le L\) for all \(l \in \mathbb {N}\), so \(r \in \ ]0,1[\). Additionally, r does not depend on l, so we have
$$\begin{aligned} \Vert v_l \Vert ^2< r \Vert v_{l-1} \Vert ^2< r^2 \Vert v_{l-1} \Vert ^2< \cdots < r^{l-1} \Vert v_1 \Vert ^2 \le r^{l-1} L^2. \end{aligned}$$
In particular, there is some l such that \(\Vert v_l \Vert \le \delta \), which is a contradiction. \(\square \)
Remark 3.2
The proof of Theorem 3.1 shows that for convergence of Algorithm 2, it would be sufficient to consider only a single \(j \in I_l\) in step 5. Similarly, for the initial approximation \(W_1\), a single element of \(\partial _\varepsilon f_i(x)\) for any \(i \in \{1,\ldots ,k\}\) would be enough. A modification of either step can potentially reduce the number of executions of step 5 (i.e., Algorithm 1) in Algorithm 2 in case the \(\varepsilon \)-subdifferentials of multiple objective functions are similar. Nonetheless, we will restrain the discussion in this article to Algorithm 2 as it is, since both modifications also introduce a bias toward certain objective functions, which we want to avoid.
To highlight the strengths of Algorithm 2, we will consider an example where standard gradient sampling approaches can fail to obtain a useful descent direction.
Example 3.3
For \(a, b \in \mathbb {R}{\setminus } \{ 0 \}\) consider the locally Lipschitz function
$$\begin{aligned} f : \mathbb {R}^2 \rightarrow \mathbb {R}^2, \quad x \mapsto \begin{pmatrix} (x_1 - 1)^2 + (x_2 - 1)^2 \\ | x_2 - a |x_1|| + b x_2 \end{pmatrix}. \end{aligned}$$
The set of nondifferentiable points is
$$\begin{aligned} \Omega _f = (\{ 0 \} \times \mathbb {R}) \cup \{ (\lambda , a | \lambda | )^\top : \lambda \in \mathbb {R}\}, \end{aligned}$$
separating \(\mathbb {R}^2\) into four smooth areas (cf. Fig. 4a). For large \(a > 0\), the two areas above the graph of \(\lambda \mapsto a | \lambda |\) become small, making it difficult to compute the subdifferential close to \((0,0)^\top \).
Let \(a = 10\), \(b = 0.5\), \(\varepsilon = 10^{-3}\) and \(x = (10^{-4}, 10^{-4})^\top \). In this case, \((0,0)^\top \) is the minimal point of \(f_2\) and
$$\begin{aligned} \partial _\varepsilon f_2(x)&= {{\,\mathrm{conv}\,}}\left\{ \begin{pmatrix} -a \\ b - 1 \end{pmatrix}, \begin{pmatrix} a \\ b + 1 \end{pmatrix}, \begin{pmatrix} a \\ b - 1 \end{pmatrix}, \begin{pmatrix} -a \\ b + 1 \end{pmatrix} \right\} \\&= {{\,\mathrm{conv}\,}}\left\{ \begin{pmatrix} -10 \\ -0.5 \end{pmatrix}, \begin{pmatrix} 10 \\ 1.5 \end{pmatrix}, \begin{pmatrix} 10 \\ -0.5 \end{pmatrix}, \begin{pmatrix} -10 \\ 1.5 \end{pmatrix} \right\} . \end{aligned}$$
In particular, \(0 \in \partial _\varepsilon f_2(x)\), so the descent direction from (5) with the exact \(\varepsilon \)-subdifferentials is zero. When applying Algorithm 2 in x, after two iterations we obtain
$$\begin{aligned} \tilde{v} = v_2 \approx (0.118, 1.185) \cdot 10^{-9}, \end{aligned}$$
i.e., \(\Vert \tilde{v} \Vert \approx 1.191 \cdot 10^{-11}\). Thus, x is correctly identified as (potentially) Pareto optimal. The final approximation \(W_2\) of \(F_\varepsilon (x)\) is
$$\begin{aligned} W_2 = \left\{ \xi ^1_1, \xi ^2_1, \xi ^2_2 \right\} = \left\{ \begin{pmatrix} -1.9998 \\ -1.9998 \end{pmatrix}, \begin{pmatrix} 10 \\ -0.5 \end{pmatrix}, \begin{pmatrix} -10 \\ 1.5 \end{pmatrix} \right\} . \end{aligned}$$
The first two elements of \(W_2\) are the gradients of \(f_1\) and \(f_2\) in x from the first iteration of Algorithm 2, and the last element is the gradient of \(f_2\) in the point \(x' = x + tv_1 = (0.038, 0.596)^\top \cdot 10^{-3} \in B_\varepsilon (x)\) from the second iteration. The result is visualized in Fig. 4.
Building on Algorithm 2, it is now straightforward to construct the descent method for locally Lipschitz continuous MOPs given in Algorithm 3. In step 4, the classical Armijo backtracking line search was used (cf. [4]) for the sake of simplicity. Note that it is well defined due to step 4 in Algorithm 2.
Since we introduced the two tolerances \(\varepsilon \) (for the \(\varepsilon \)-subdifferential) and \(\delta \) (as a threshold for when we consider \(\varepsilon \)-subgradients to be zero), we cannot expect that Algorithm 3 computes points which satisfy the optimality condition (1). This is why we introduce the following definition, similar to the definition of \(\varepsilon \)-stationarity from [11].
Definition 3.2
Let \(x \in \mathbb {R}^n\), \(\varepsilon > 0\) and \(\delta > 0\). Then, x is called \((\varepsilon , \delta )\)-critical, if
$$\begin{aligned} \min _{v \in -F_\varepsilon (x)} \Vert v \Vert \le \delta . \end{aligned}$$
It is easy to see that \((\varepsilon , \delta )\)-criticality is a necessary optimality condition for Pareto optimality, but a weaker one than (1). The following theorem shows that convergence in the sense of \((\varepsilon , \delta )\)-criticality is what we can expect from our descent method.
Theorem 3.2
Let \((x_j)_j\) be the sequence generated by Algorithm 3. Then, either the sequence \((f_i(x_j))_j\) is unbounded below for each \(i \in \{1,\ldots ,k\}\), or \((x_j)_j\) is finite with the last element being \((\varepsilon ,\delta )\)-critical.
Proof
Assume that \((x_j)_j\) is infinite. Then, \(\Vert v_j \Vert > \delta \) for all \(j \in \mathbb {N}\). Let \(j \in \mathbb {N}\). If \(\bar{t} = 2^{-\bar{s}} t_0 \ge \frac{\varepsilon }{\Vert v_j \Vert }\) in step 4 of Algorithm 3, then
$$\begin{aligned} f_i(x_j + \bar{t} v_j) - f_i(x_j)&= f_i(x_j + 2^{-\bar{s}} t_0 v_j) - f_i(x_j) \\&\le - 2^{-\bar{s}} t_0 c \Vert v_j \Vert ^2 \le - c \varepsilon \Vert v_j \Vert< -c \varepsilon \delta < 0 \end{aligned}$$
for all \(i \in \{1,\ldots ,k\}\). If instead \(\bar{t} = \frac{\varepsilon }{\Vert v_j \Vert }\) in step 4, then the same inequality holds due to Theorem 3.1. This implies that \((f_i(x_j))_j\) is unbounded below for each \(i \in \{1,\ldots ,k\}\).
Now assume that \((x_j)_j\) is finite, with \(\bar{x}\) and \(\bar{v}\) being the last elements of \((x_j)_j\) and \((v_j)_j\), respectively. Since the algorithm stopped, we must have \(\Vert \bar{v} \Vert \le \delta \). From the application of Algorithm 2 in step 2, we know that there must be some \(\overline{W} \subseteq F_\varepsilon (\bar{x})\) such that \(\bar{v} = \mathop {\mathrm{arg min}}\limits _{v \in -{{\,\mathrm{conv}\,}}(\overline{W})} \Vert v \Vert ^2\). This implies
$$\begin{aligned} \min _{v \in -F_\varepsilon (\bar{x})} \Vert v \Vert \le \min _{v \in -{{\,\mathrm{conv}\,}}(\overline{W})} \Vert v \Vert = \Vert \bar{v} \Vert \le \delta , \end{aligned}$$
which completes the proof. \(\square \)
Finally, before considering numerical examples in the following section, we will discuss the influence of the parameters of Algorithm 3 on its results and performance:
-
The tolerance \(\varepsilon > 0\) is the radius of the closed ball that we use for the \(\varepsilon \)-subdifferential (cf. Definition 3.1). On the one hand, by definition, we can only expect Algorithm 3 to compute the actual Pareto critical set up to a distance of \(\varepsilon \), which is the motivation for choosing \(\varepsilon \) relatively small. On the other hand, \(\varepsilon \) controls how early the algorithm notices nondifferentiable points, so a small \(\varepsilon \) means that the algorithm might take a longer time until the nondifferentiability of the objective vector is detected. These two properties should be kept in mind when choosing \(\varepsilon \) in practice.
-
The tolerance \(\delta > 0\) is the threshold for when we consider the norm of the descent direction (i.e., a convex combination of \(\varepsilon \)-subgradients) to be zero. The result of Algorithm 3 will be closer to the Pareto critical set the smaller we choose \(\delta \). But since more iterations are needed, it also takes more time until the algorithm stops.
-
The parameters \(c \in \ ]0,1[\) and \(t_0 > 0\) are used for the step length, where \(t_0\) is the initial step length and c is the percentage of how much actual descent we want to have compared to the predicted descent based on the \(\varepsilon \)-subgradients. The closer c is chosen to 1, the steeper the descent of the computed descent direction will be and the closer the direction computed by Algorithm 2 will be to the theoretical descent direction from (5). Then again, this also increases the iterations required in Algorithm 2 and therefore the runtime of Algorithm 3.
Examples for the choice of these parameters can be found in the following section. Furthermore, we will introduce a modification of Algorithm 3 with a dynamic tolerance \(\varepsilon \).