1 Introduction

In optimization, regularization is one of the basic tools for dealing with irregular solutions. For an objective function \(f : {\mathbb {R}}^n \rightarrow {\mathbb {R}}\), the idea is to add a regularization term \(g : {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) to f which enforces regularity, and to weight g with a regularization parameter \(\lambda \ge 0\) to control to which extent this regularity is enforced. So instead of optimizing f, the regularized problem

$$\begin{aligned} \min _{x \in {\mathbb {R}}^n} f(x) + \lambda g(x) \end{aligned}$$

with \(\lambda \ge 0\) is solved. For \(\lambda = 0\) the original problem is recovered. Increasing \(\lambda \) leads to successively more regular solutions, at the cost of an increased objective value of f.

Depending on the application, the term “regularity” above can have many different meanings: In sparse regression, regularity of the solution means sparsity, and a prominent example for the regularization term is the \(\ell ^1\)-norm [1, 2]. In hyperplane separation for data classification (also known as support-vector machines), regularity is related to robustness of the derived classifier, and a possible regularization term can be derived from the scalar product of the data points with the hyperplane (known as the hinge loss) [2, 3]. In image denoising, regularity means the absence of noise in the reconstructed image, which can be measured using the total variation [4]. In (exact) penalty methods for constrained optimization problems, regularity refers to feasibility, and the sum of the individual constraint violations can be used as a regularization term [5, 6]. Finally, in deep learning, regularization is used to avoid overfitting, which is usually related to the \(\ell ^2\)- or \(\ell ^1\)-norm of the weights [3, 7].

Clearly, the choice of the regularization parameter \(\lambda \) has a large impact on the solution of the regularized problem. If \(\lambda \) is chosen too small, then solutions are almost optimal for f but irregular. If it is chosen too large, then solutions are highly regular but have an unacceptably large objective value with respect to f. One way of dealing with this issue is to not only compute a regularized solution for a single \(\lambda \), but to compute the entire so-called regularization path R, which is the set of all regularized solutions for all \(\lambda \ge 0\). The properties and features of R (e.g., knee points [8]) can then be used to better choose a desirable solution. Obviously, simply solving the regularized problem for many \(\lambda \ge 0\) to obtain a discretization of R is inefficient. Instead, so-called path-following methods (also known as continuation methods, homotopy methods or predictor–corrector methods) can be used, which iteratively compute new points on the regularization path close to already known points until the complete path is explored. By exploiting the smoothness properties of the path, the computation of each new point tends to be cheap. For the development of such methods, it is crucial to have a good understanding of the structure of the regularization path. In [9, 10], it was shown that for sparse regression, the regularization path R is piecewise linear and a path-following method was proposed for its computation. Similar results were shown in [11] for support-vector machines. In a more general setting in [12], it was shown that if f is piecewise quadratic and g is piecewise linear, then R is always piecewise linear. In case of the exact penalty method in constrained optimization, it was shown in [13] that if the constrained problem is convex (and the equality constraints are affinely linear), then R is piecewise smooth. Recently, in [14], the structure of the regularization path was analyzed for the case where f is twice continuously differentiable and g is the \(\ell ^1\)-norm, with the results suggesting that R is piecewise smooth.

The goal of this article is to analyze the structure of the regularization path in a more general setting. Note that in the applications above, we have the pattern that f is always smooth while g is always nonsmooth. Thus, in this article, we will also assume that f is smooth. For g, we will assume that it is merely piecewise differentiable (as defined in [15]). Compared to weaker assumptions in nonsmooth analysis like local Lipschitz continuity, this has the advantage that the Clarke subdifferential of g is easy to compute and that the set of nonsmooth points of g can essentially be described as a level set of certain smooth functions. Since all of the regularization terms in the above applications (except for the \(\ell ^2\)-norm) are in fact piecewise differentiable, our setting generalizes many of the existing approaches. We will analyze the structure of R by approximating it with the critical regularization path \(R_c\), which is based on the first-order optimality conditions of the regularized problem, and then identifying sufficient conditions for \(R_c\) to be smooth around a given point. More precisely, our main result will be that if these conditions are met, then \(R_c\) is locally the projection of a higher-dimensional smooth manifold onto \({\mathbb {R}}^n\) (cf. Theorem 2). In particular, all points violating these conditions are potential “kinks” (or “nonsmooth points”) of \(R_c\). Depending on which condition is violated, this allows for a classification of nonsmooth features of the regularization path. Furthermore, the nature of our sufficient conditions suggests that \(R_c\) (and R) is still piecewise smooth.

From a theoretical point of view, the core idea of this article is the application of the level set theorem (cf. [16], Theorem 5.12) to a smooth function h whose projected zero level set locally coincides with \(R_c\). For h to be smooth, we have to carefully construct it by considering the so-called smooth selection functions that g consists of. Compared to the previous results in [9,10,11,12,13], this general approach has to be followed since, apart from smoothness, no other properties of the selection functions can be exploited. For the case where g is the \(\ell ^1\)-norm, this methodology reduces to the approach in [14], which is significantly easier to handle due to the simplicity of the \(\ell ^1\)-norm. In the more general case that is considered in this article, more care has to be taken when working with the selection functions.

The remainder of this article is structured as follows. In Sect. 2, we begin by briefly introducing the basic concepts that we use in our theoretical results (A more detailed introduction can be found in the electronic supplementary material). Besides piecewise differentiability, these are multiobjective optimization and affine geometry. The former can be used to obtain an (almost) equivalent formulation of the regularization problem as a multiobjective optimization problem, while the latter is required for working with the subdifferential of g. In Sect. 3, we will analyze the structure of the regularization path R. We will do this by expressing \(R_c\) as the union of the intersection of certain sets, whose structure we can analyze by applying standard results from differential geometry. We will also formulate an abstract algorithm for a path-following method based on our results. In Sect. 4, we will apply our results to two problem classes, which are support-vector machines and the exact penalty method. Finally, we draw a conclusion and discuss possible future work in Sect. 5.

2 Basic concepts

In this section, we will introduce the basic ideas of piecewise differentiable functions, multiobjective optimization and affine geometry. As these topics may not be common in the optimization community, we also compiled a more detailed introduction with the specific results that we use in this article and included it in the electronic supplementary material.

For the regularization term we will assume piecewise differentiability [15] in the following sense.

Definition 1

Let \(U \subseteq {\mathbb {R}}^n\) be open. Let \(g : U \rightarrow {\mathbb {R}}\) be continuous and \(g_i : U \rightarrow {\mathbb {R}}\), \(i \in \{1,\dots ,k\}\), be a set of r-times continuously differentiable (or \(C^r\)) functions for \(r \in {\mathbb {N}}\cup \{\infty \}\). If \(g(x) \in \{ g_1(x), \dots , g_k(x) \}\) for all \(x \in U\), then g is piecewise r-times differentiable (or a \(PC^r\)-function). In this case, \(\{ g_1, \dots , g_k \}\) is called a set of selection functions of g.

If \(g : U \rightarrow {\mathbb {R}}\) is a \(PC^r\)-function with selection functions \(\{g_1, \ldots , g_k\}\), then the Clarke subdifferential [17] of g is given by

$$\begin{aligned} \partial g(x) = \mathop {\textrm{conv}}\limits (\{ \nabla g_i(x) : i \in I^e(x) \}) \quad \forall x \in U, \end{aligned}$$
(1)

where \(\mathop {\textrm{conv}}\limits ( \cdot )\) denotes the convex hull and \(I^e(x)\) is the set of essentially active selection functions in x. In particular, \(\partial g(x)\) is a polytope and, assuming the essentially active selection functions are known, easy to compute.

For the derivation of our theoretical results, we will interpret regularization problems as multiobjective optimization problems (MOPs) [18,19,20]. For general functions \(f : {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) and \(g : {\mathbb {R}}^n \rightarrow {\mathbb {R}}\), the MOP minimizing f and g is denoted by

$$\begin{aligned} \min _{x \in {\mathbb {R}}^n} \begin{pmatrix} f(x) \\ g(x) \end{pmatrix} \end{aligned}$$

and its solution is defined in the following.

Definition 2

A point \(x \in {\mathbb {R}}^n\) is called Pareto optimal if there is no \(y \in {\mathbb {R}}^n\) with

$$\begin{aligned} f(y)< f(x) \text { and } g(y) \le g(x) \quad \quad \text {or} \quad \quad f(y) \le f(x) \text { and } g(y) < g(x). \end{aligned}$$

The set of all Pareto optimal points is the Pareto set. Its image under the objective vector (fg), i.e., the set \(\{ (f(x),g(x))^\top : x \text { is Pareto optimal} \} \subseteq {\mathbb {R}}^2\), is the Pareto front.

If both f and g are at least locally Lipschitz continuous and x is Pareto optimal, then

$$\begin{aligned} 0 \in \mathop {\textrm{conv}}\limits (\partial f(x) \cup \partial g(x)) \end{aligned}$$
(2)

or, equivalently,

$$\begin{aligned} \exists \alpha _1, \alpha _2 \ge 0, \xi ^1 \in \partial f(x), \xi ^2 \in \partial g(x) : \alpha _1 \xi ^1 + \alpha _2 \xi ^2 = 0, \alpha _1 + \alpha _2 = 1. \end{aligned}$$
(3)

Points that satisfy this optimality condition are called Pareto critical and the set of all such points is the Pareto critical set \(P_c\). The quantities \(\alpha _1\) and \(\alpha _2\) in (3) will be referred to as KKT multipliers of f and g in x, respectively.

Finally, the structure of the condition (3) will make it possible to use affine geometry [21,22,23] to relate properties of the Pareto critical set \(P_c\) to properties of the subdifferentials \(\partial f(x)\) and \(\partial g(x)\).

Definition 3

  1. (a)

    Let \(k \in {\mathbb {N}}\) and \(a^i \in {\mathbb {R}}^n\), \(i \in \{1,\dots ,k\}\). Let \(\lambda \in {\mathbb {R}}^k\) with \(\sum _{i = 1}^k \lambda _i = 1\). Then \(\sum _{i = 1}^k \lambda _i a^i\) is an affine combination of \(\{ a^1, \dots , a^k \}\).

  2. (b)

    Let \(E \subseteq {\mathbb {R}}^n\). Then \(\textrm{aff}(E)\) is the set of all affine combinations of elements of E, called the affine hull of E. Formally,

    $$\begin{aligned} \textrm{aff}(E) := \left\{ \sum _{i = 1}^k \lambda _i a^i : k \in {\mathbb {N}}, a^i \in E, \lambda _i \in {\mathbb {R}}, i \in \{1,\dots ,k\}, \sum _{i = 1}^k \lambda _i = 1 \right\} . \end{aligned}$$
  3. (c)

    Let \(E \subseteq {\mathbb {R}}^n\). If \(\textrm{aff}(E) = E\), then E is called an affine space.

Analogously to linear algebra, it is possible to define the affine dimension \(\mathop {\textrm{affdim}}\limits (A)\) and affine bases of an affine space A. An important result about affine spaces (and convex sets) is Carathéodory’s theorem:

Theorem 1

Let A be a finite subset of \({\mathbb {R}}^n\). Then every element in \(\mathop {\textrm{conv}}\limits (A)\) can be written as a convex combination of \(\mathop {\textrm{affdim}}\limits (\textrm{aff}(A)) + 1\) elements of A.

3 The structure of the regularization path

Let \(f : {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) be continuously differentiable and \(g : {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) be \(PC^1\). For a regularization parameter \(\lambda \ge 0\), consider the parameter-dependent problem

$$\begin{aligned} \min _{x \in {\mathbb {R}}^n} f(x) + \lambda g(x). \end{aligned}$$
(4)

The set

$$\begin{aligned} R := \left\{ {\bar{x}} \in {\mathbb {R}}^n : \exists \lambda \ge 0 \ \text {with} \ {\bar{x}} \in \mathop {\mathrm {arg\,min}}\limits _{x \in {\mathbb {R}}^n} f(x) + \lambda g(x) \right\} \end{aligned}$$
(5)

is known as the regularization path of (4) [11, 24, 25] and the goal of this article is to analyze its structure.

We will do this by not analyzing R directly, but by analyzing the (potentially larger) set that is defined by the first-order optimality condition of (4): If \({\bar{x}}\) is a solution of (4) for some \(\lambda \ge 0\), then it is a critical point of \(f + \lambda g\), i.e., \(0 \in \partial (f + \lambda g)({\bar{x}})\) (cf. Theorem 4.1 in [6]). This is the motivation for defining the critical regularization path

$$\begin{aligned} R_c := \left\{ {\bar{x}} \in {\mathbb {R}}^n : \exists \lambda \ge 0 \ \text {with} \ 0 \in \partial (f + \lambda g)({\bar{x}}) \right\} . \end{aligned}$$
(6)

In general we have \(R \subseteq R_c\). If \(f + \lambda g\) is convex (e.g., if both f and g are convex), then criticality is sufficient for optimality (cf. Theorem 4.2 in [6]), so \(R = R_c\). For example, this is the case for the Lasso problem [1] (where f contains some least squares error and g is the \(\ell ^1\)-norm) and total variation denoising [4] (where f contains some least squares error and g is the total variation). The extend to which structural result about \(R_c\) apply to R in the general nonconvex case will be discussed in Remark 5.

Our main result in this section will be that \(R_c\) has a piecewise smooth structure. More precisely, we will derive five conditions (Assumptions A1 to A5) for a point \(x^0 \in R_c\) which, when combined, assure that locally around \(x^0\), \(R_c\) is the projection of a smooth manifold from a higher-dimensional space onto \({\mathbb {R}}^n\). In turn, these assumptions allow for a classification of kinks of \(R_c\) by checking which assumption is violated. Throughout this article, we will use the term kinks to loosely refer to points in \(R_c\) around which \(R_c\) is not a smooth manifold.

In order to analyze the structure of \(R_c\), we first show that \(R_c\) is related to the Pareto critical set \(P_c\) of the MOP

$$\begin{aligned} \min _{x \in {\mathbb {R}}^n} \begin{pmatrix} f(x) \\ g(x) \end{pmatrix}. \end{aligned}$$
(7)

More precisely, we have the following lemma.

Lemma 1

It holds:

  1. (a)

    \(R_c = \{ {\bar{x}} \in {\mathbb {R}}^n : \exists \xi \in \partial g({\bar{x}}), \alpha _1 > 0, \alpha _2 \ge 0 \ \text {with} \ \alpha _1 \nabla f({\bar{x}}) + \alpha _2 \xi = 0 \ \text {and} \ \alpha _1 + \alpha _2 = 1 \} \subseteq P_c\).

  2. (b)

    \(R_c \cup \{ x \in {\mathbb {R}}^n : 0 \in \partial g(x) \} = P_c\).

Proof

  1. (a)

    Since f is continuously differentiable we have \(\partial f(x) = \{ \nabla f(x) \}\) for all \(x \in {\mathbb {R}}^n\). Furthermore, from basic calculus for subdifferentials (cf. Corollary 1 in [17], Section 2.3) it follows that \({\bar{x}} \in R_c\) is equivalent to

    $$\begin{aligned} \begin{aligned}&\exists \lambda \ge 0 : 0 \in \partial (f + \lambda g)({\bar{x}}) = \partial f({\bar{x}}) + \lambda \partial g({\bar{x}}) = \nabla f({\bar{x}}) + \lambda \partial g({\bar{x}}) \\&\quad \Leftrightarrow \ \exists \lambda \ge 0 : 0 \in \frac{1}{1 + \lambda } \nabla f({\bar{x}}) + \frac{\lambda }{1 + \lambda } \partial g({\bar{x}}) \\&\quad \Leftrightarrow \ \exists \xi \in \partial g({\bar{x}}), \lambda \ge 0 : \frac{1}{1 + \lambda } \nabla f({\bar{x}}) + \frac{\lambda }{1 + \lambda } \xi = 0 \\&\quad \Leftrightarrow \ \exists \xi \in \partial g({\bar{x}}), \alpha _1 > 0, \alpha _2 \ge 0 : \alpha _1 \nabla f({\bar{x}}) + \alpha _2 \xi = 0 \ \text {and} \ \alpha _1 + \alpha _2 = 1. \end{aligned} \end{aligned}$$
    (8)

    By (3) this implies \({\bar{x}} \in P_c\).

  2. (b)

    Due to (a) we only have to show the implication “\(\supseteq \)”, so let \({\bar{x}} \in P_c\). By (3) there are \(\xi \in \partial g({\bar{x}})\) and \(\alpha _1, \alpha _2 \ge 0\) with \(\alpha _1 + \alpha _2 = 1\) and \(\alpha _1 \nabla f({\bar{x}}) + \alpha _2 \xi = 0\). If \(\alpha _1 = 0\) then \(\alpha _2 = 1\), so \(0 = \xi \in \partial g({\bar{x}})\). Otherwise, \(\alpha _1 > 0\) and from (8) it follows that \({\bar{x}} \in R_c\) (with \(\lambda =\frac{\alpha _2}{\alpha _1}\)). \(\square \)

By the previous lemma, \(R_c\) and \(P_c\) coincide up to critical points of g in which all KKT multipliers corresponding to f are zero. Roughly speaking, these points correspond to “\(\lambda = \infty \)” in (4).

Remark 1

It is important to note that Lemma 1 does not imply that critical points of g are not contained in \(R_c\), i.e., that \(R_c \cap \{ x \in {\mathbb {R}}^n : 0 \in \partial g(x) \} = \emptyset \). For example, if \(0 \in \mathop {\textrm{int}}\limits (\partial g(x))\), then it is possible to show that there is some \({\bar{\lambda }}\) with \(0 \in \partial (f + \lambda g)(x)\) for all \(\lambda \ge {\bar{\lambda }}\).

By Lemma 1, structural results about Pareto critical sets can be used to analyze the structure of the critical regularization path \(R_c\). For example, under some mild regularity assumptions on f and g, Theorem 5.1 in [26] shows that in areas where g is (twice continuously) differentiable, the set of Pareto critical points with non-vanishing KKT multipliers is the projection of a 1-dimensional manifold from \({\mathbb {R}}^{n+2}\) onto \({\mathbb {R}}^n\). To derive our main result, we will extend the ideas in [26] to the whole Pareto critical set up to certain kinks.

We begin by taking a closer look at the Pareto critical set \(P_c\) of (7). By definition, \(P_c\) is characterized by the optimality condition (2). Since f is continuously differentiable and g is \(PC^1\), the subdifferential of f is simply its gradient, and the subdifferential of g is the convex hull of all essentially active selection functions (cf. (1)). Thus, for a fixed \(x \in {\mathbb {R}}^n\), (2) is equivalent to the existence of a vanishing convex combination of a finite number of elements. This is the same type of condition as in the smooth case, except that there is now no continuous dependency of these elements on x. Furthermore, the number of elements is not constant. Nonetheless, by iterating over all possible essentially active sets, \(P_c\) can at least be written as the union of sets that behave similarly to Pareto critical sets in the smooth case. Let \(\{ g_1, \dots , g_k \}\) be a set of selection functions of g. Then formally, these considerations lead to the following decomposition of \(P_c\):

$$\begin{aligned} P_c&= \{ x \in {\mathbb {R}}^n : 0 \in \mathop {\textrm{conv}}\limits (\{ \nabla f(x) \} \cup \partial g(x) ) \} \nonumber \\&= \{ x \in {\mathbb {R}}^n : 0 \in \mathop {\textrm{conv}}\limits (\{ \nabla f(x) \} \cup \{ \nabla g_i(x) : i \in I^e(x) \} ) \} \nonumber \\&= \bigcup _{I \subseteq \{1,\dots ,k\}} P_c^I \cap \Omega ^I, \end{aligned}$$
(9)

where

$$\begin{aligned} \begin{aligned} P_c^I&:= \{ x \in {\mathbb {R}}^n : 0 \in \mathop {\textrm{conv}}\limits (\{ \nabla f(x) \} \cup \{ \nabla g_i(x) : i \in I \} ) \}, \\ \Omega ^I&:= \{ x \in {\mathbb {R}}^n : I^e(x) = I \}. \end{aligned} \end{aligned}$$
(10)

In words, \(P_c^I\) is the Pareto critical set of the (smooth) MOP with objective vector \((f,g_{i_1}, \dots g_{i_{|I |}})^\top \) (for \(I = \{i_1, \ldots , i_{|I |} \}\)) and \(\Omega ^I\) is the set of points in \({\mathbb {R}}^n\) in which precisely the selection functions with an index in I are essentially active. Thus, (9) expresses \(P_c\) as the union of Pareto critical sets of smooth MOPs that are intersected with the sets of points with constant essentially active sets. A visualization of this decomposition is shown in the following example.

Example 1

Consider problem (7) for \(f : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}\), \(x \mapsto (x_1 - 2)^2 + (x_2 - 1)^2\), and

$$\begin{aligned}&g_1 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, \quad x \mapsto x_1 + x_2, \\&g_2 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, \quad x \mapsto x_1 - x_2, \\&g_3 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, \quad x \mapsto -x_1 + x_2, \\&g_4 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, \quad x \mapsto -x_1 - x_2, \\&g : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, \quad x \mapsto \max (\{g_1(x),g_2(x),g_3(x),g_4(x)\}) = \Vert x \Vert _1. \end{aligned}$$

It is possible to show that the Pareto critical (and in this case Pareto optimal) set is given by

$$\begin{aligned} P_c&= \{ (0,0)^\top \} \cup ((0,1] \times \{ 0 \}) \cup \{ x \in {\mathbb {R}}^2: x_1 \in (1,2], x_2 = x_1 - 1 \} \\&= (P_c^{\{ 1,2,3,4 \}} \cap \Omega ^{\{ 1,2,3,4 \}}) \cup (P_c^{\{ 1,2 \}} \cap \Omega ^{\{ 1,2 \}}) \cup (P_c^{\{ 1 \}} \cap \Omega ^{\{ 1 \}}) . \end{aligned}$$

Figure 1 shows the decomposition of \(P_c\) into the sets \(P_c^I \cap \Omega ^I\) as in (9).

Fig. 1
figure 1

Decomposition of \(P_c\) into the sets \(P_c^I \cap \Omega ^I\) as in (9)

We will analyze the piecewise smooth structure of \(P_c\) via (9) by first analyzing \(\Omega ^I\), then the intersection \(P_c^I \cap \Omega ^I\) and finally the union over all \(P_c^I \cap \Omega ^I\). Furthermore, as we expect \(P_c\) to possess kinks, we will only consider its local structure around a given point. In other words, for \(x^0 \in P_c\), we will only consider the structure of \(P_c \cap U\) for open neighborhoods \(U \subseteq {\mathbb {R}}^n\) of \(x^0\).

The strategy for our analysis in this section is to derive assumptions for \(x^0\) which are sufficient for \(P_c\) to have a smooth structure locally around \(x^0\). These assumptions represent different sources and types of nonsmoothness of \(P_c\) and will allow for a classification of nonsmooth points.

3.1 The structure of \(\Omega ^I\)

By definition, the set \(\Omega ^I\) only depends on g. For \(I = \{i\} \subseteq \{1,\ldots ,k\}\), \(\Omega ^{\{ i \}}\) is the set of points where only the selection function \(g_i\) is essentially active. From Lemma SM1 it follows that \(\Omega ^{\{ i \}}\) is an open subset of \({\mathbb {R}}^n\) in this case. For \(I \subseteq \{1,\ldots ,k\}\) with \(|I|> 1\), \(\Omega ^I\) is the set of points where precisely the selection functions corresponding to the elements of I are essentially active. Typically (but not necessarily), these are points where g is nonsmooth, which by Rademacher’s Theorem ( [27], Theorem 3.2) form a null set. In the following, we will analyze its structure.

Since we are only interested in the structure of \(\Omega ^I\) in a local sense, we also only have to consider restrictions \(g|_U\) of g to open neighborhoods of a point \(x^0 \in {\mathbb {R}}^n\). In terms of the open neighborhood U of \(x^0\) and the set of selection functions of \(g|_U\), we introduce the following assumption:

Assumption A1

For \(x^0 \in {\mathbb {R}}^n\) there is an open neighborhood \(U \subseteq {\mathbb {R}}^n\) of \(x^0\) and a set of selection functions \(\{ g_1, \dots , g_k \}\) of \(g|_U\) such that

  1. (i)

    \(I(x^0) = \{1,\dots ,k\}\),

  2. (ii)

    \(I^e(x) = I(x) \quad \forall x \in U\),

  3. (iii)

    \(\mathop {\textrm{affdim}}\limits (\textrm{aff}(\{\nabla g_i(x) : i \in \{1,\dots ,k \} \})) = \mathop {\textrm{affdim}}\limits (\textrm{aff}(\{ \nabla g_i(x^0) : i \in \{1,\dots ,k \} \})) \ \forall x \in U\).

Assumption A1 can be interpreted as follows: A1(i) ensures that all selection functions we consider are actually relevant for the representation of g in U. The condition A1(ii) ensures that it does not matter if we consider the active or the essentially active set in U, which allows for an easier representation of \(\Omega ^I\). Finally, A1(iii) makes sure that the representation of \(\partial g(x^0)\) via the gradients of our selection functions is “stable” on U with respect to its affine dimension.

In the following, we will discuss the restrictiveness of Assumption A1. By (SM1), A1(i) can always be satisfied by choosing U sufficiently small. For A1(ii) and (iii), we consider the following example.

Example 2

  1. (a)

    Let

    $$\begin{aligned}&g_1 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, \quad x \mapsto x_2^2 - x_1, \\&g_2 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, \quad x \mapsto {\left\{ \begin{array}{ll} x_1^2 - x_1, &{}\quad x_1 \le 0, \\ -x_1, &{}\quad x_1 > 0, \end{array}\right. } \\&g : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, \quad x \mapsto \max (\{ g_1(x), g_2(x) \}). \end{aligned}$$

    Then g is \(PC^1\) with selection functions \(g_1\) and \(g_2\). The graph and the level sets of g are shown in Fig. 2. For the activity of \(g_2\) we have

    $$\begin{aligned} 2 \in I(x) \ \Leftrightarrow \ g(x) = g_2(x) \ \Leftrightarrow \ {\left\{ \begin{array}{ll} x_2 \in [x_1,-x_1], &{}\quad x_1 \le 0, \\ x_2 = 0, &{}\quad x_1 > 0, \end{array}\right. } \end{aligned}$$

    and

    $$\begin{aligned} 2 \in I^e(x) \ {}&\Leftrightarrow \ x \in \mathop {\textrm{cl}}\limits (\mathop {\textrm{int}}\limits (\{ y \in {\mathbb {R}}^2 : g(y) = g_2(y) \} )) \\ {}&\Leftrightarrow \ x_1 \le 0,\ x_2 \in [x_1,-x_1]. \end{aligned}$$

    Thus, for any open neighborhood \(U \subseteq {\mathbb {R}}^2\) of \(x^0 = (0,0)^\top \), there is some \(x \in U\) with \(I^e(x) \ne I(x)\). In other words, A1(ii) does not hold in \(x^0\) for this set of selection functions. But note that in this case, this can easily be fixed by modifying the behavior of \(g_2\) for \(x_1 > 0\). For example, replacing \(g_2\) by

    $$\begin{aligned} {\tilde{g}}_2 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, \quad x \mapsto {\left\{ \begin{array}{ll} x_1^2 - x_1, &{}\quad x_1 \le 0, \\ -x_1^2 - x_1, &{}\quad x_1 > 0. \end{array}\right. } \end{aligned}$$

    solves the issue.

  2. (b)

    For the selection functions \(g_1\) and \({\tilde{g}}_2\) of g as in a), we have

    $$\begin{aligned} \nabla g_1(x) = \begin{pmatrix} -1 \\ 2 x_2 \end{pmatrix} \ \text {and} \ \nabla {\tilde{g}}_2(x) = {\left\{ \begin{array}{ll} (2 x_1 - 1,0)^\top , &{}\quad x_1 \le 0, \\ (-2 x_1 - 1,0)^\top , &{}\quad x_1 > 0. \end{array}\right. } \end{aligned}$$

    In particular, in \(x^0 = (0,0)^\top \) we have \(\nabla g_1(x^0) = \nabla {\tilde{g}}_2(x^0) = (-1,0)^\top \), so

    $$\begin{aligned} \mathop {\textrm{affdim}}\limits (\textrm{aff}(\{ \nabla g_1(x^0), \nabla {\tilde{g}}_2(x^0) \})) = 0. \end{aligned}$$

    But it is easy to see that

    $$\begin{aligned} \mathop {\textrm{affdim}}\limits (\textrm{aff}(\{ \nabla g_1(x), \nabla {\tilde{g}}_2(x) \})) = 1 \quad \forall x \in {\mathbb {R}}^2 \setminus \{ 0\}. \end{aligned}$$

    In particular, A1(iii) does not hold in \(x^0\) (for this set of selection functions).

Fig. 2
figure 2

a The graph of the \(PC^1\)-function g in Example 2a. b The level sets of g

By Lemma SM1, for a given \(x^0 \in {\mathbb {R}}^n\), we can always choose the open neighborhood U of \(x^0\) such that all selection functions of the local restriction \(g|_U\) of g are essentially active in \(x^0\). In particular, we can assume that \(I^e(x^0) = I(x^0)\). While this does not imply that (ii) holds in Assumption A1, the previous example shows how A1(ii) may be satisfied through modifications of the selection functions in areas where they are active, but not essentially active. Although we will not prove that this is always possible, it motivates us to believe that A1(ii) is not a strong assumption in practice.

In contrast to A1(ii), modifying the selection functions will have less impact on A1(iii). The reason for this is the fact that if A1(i) and A1(ii) hold, then the right-hand side of A1(iii) is the dimension of the affine hull of the subdifferential of g in \(x^0\) (cf. (1)). In particular, the right-hand side does not depend on the choice of selection functions. In light of this, A1(iii) implies that the dimension of the affine hull of the subdifferential of g is constant in all \(x \in U\) with \(I^e(x) = I^e(x^0)\), i.e., in all \(x \in \Omega ^{I^e(x^0)}\) (cf. (10)). Thus, A1(iii) is more related to the function g and less related to the choice of selection functions. In Example 2 a), we see that the set \(\Omega ^{\{1,2\}}\) (in blue) has a kink in \(x^0 = (0,0)^\top \). The following lemma suggests that this is caused by A1(iii) being violated. Thus, by assuming A1(iii), we limit ourselves to local restrictions \(g|_U\) for which \(\Omega ^{I^e(x^0)}\) has a smooth structure.

Lemma 2

Let \(x^0 \in {\mathbb {R}}^n\). Let \(U \subseteq {\mathbb {R}}^n\) be an open neighborhood of \(x^0\) and let \(\{ g_1, \dots , g_k \}\) be a set of selection functions of \(g|_U\) as in Assumption A1. Let \(d = \mathop {\textrm{affdim}}\limits (\textrm{aff}(\partial g(x^0)))\) and let \(\{ i_1, \dots , i_{d+1} \} \subseteq \{ 1, \dots , k \}\) such that \(\{ \nabla g_i(x^0) : i \in \{ i_1, \dots , i_{d+1} \} \}\) is an affine basis of \(\textrm{aff}(\{ \nabla g_i(x^0) : i \in \{ 1, \dots , k \} \})\). Then there is an open neighborhood \(U' \subseteq U\) of \(x^0\) such that

$$\begin{aligned} g_i(x) - g_1(x) = 0 \ \forall i \in \{2,\dots ,k\} \quad \Leftrightarrow \quad g_i(x) - g_{i_1}(x) = 0 \ \forall i \in \{i_2,\dots ,i_{d+1}\} \end{aligned}$$

for all \(x \in U'\) and \(\Omega ^{\{1,\dots ,k\}} \cap U'\) is an embedded \((n-d)\)-dimensional submanifold of \(U'\). In particular,

$$\begin{aligned} \Omega ^{\{1,\dots ,k\}} \cap U' = \{ x \in U' : g_i(x) - g_{i_1}(x) = 0 \ \forall i \in \{i_2,\dots ,i_{d+1}\} \}. \end{aligned}$$

Proof

The direction "\(\Rightarrow \)" is obvious, so consider the converse. By A1(iii) and since the gradients \(\nabla g_i\), \(i \in \{i_1,\dots ,i_{d+1}\}\), are continuous, there is an open neighborhood \(U' \subseteq U\) of \(x^0\) such that \(\{ \nabla g_i(x) : i \in \{i_1,\dots ,i_{d+1}\} \}\) is an affine basis of \(\{ \nabla g_i(x) : i \in \{1,\dots ,k\} \}\) for all \(x \in U'\). Let

$$\begin{aligned} \varphi : U' \rightarrow {\mathbb {R}}^{k-1}, \quad x \mapsto \begin{pmatrix} g_2(x) - g_1(x) \\ \vdots \\ g_k(x) - g_1(x) \end{pmatrix}. \end{aligned}$$

By A1(iii) the Jacobian \(D\varphi (x)\) has constant rank d for all \(x \in U'\). By A1(i) we have \(\varphi (x^0) = 0\), so the level set \(L := \varphi ^{-1}(0) = \Omega ^{\{1,\dots ,k\}} \cap U'\) is nonempty. Thus, by Theorem 5.12 in [16], L is an embedded \((n-d)\)-dimensional submanifold of \(U'\). Additionally, let

$$\begin{aligned} \varphi ' : U' \rightarrow {\mathbb {R}}^{d}, \quad x \mapsto \begin{pmatrix} g_{i_2}(x) - g_{i_1}(x) \\ \vdots \\ g_{i_{d+1}}(x) - g_{i_1}(x) \end{pmatrix}. \end{aligned}$$

By construction, \(D\varphi '(x)\) has constant rank d for all \(x \in U'\). With the same argument as above, it follows that \(L' := \varphi '^{-1}(0)\) is an embedded \((n-d)\)-dimensional submanifold of \(U'\) as well. Since \(L \subseteq L'\), L is also an embedded \((n-d)\)-dimensional submanifold of \(L'\) (cf. [16], Proposition 4.22). By Proposition 5.1 in [16], this implies that L is an open subset of \(L'\). As \(L'\) is endowed with the subspace topology of \(U' \subseteq {\mathbb {R}}^n\), this means that we can assume w.l.o.g. that \(U'\) is an open neighborhood of \(x^0\) with \(U' \cap L' = L\), completing the proof. \(\square \)

By the previous lemma, Assumption A1 allows us to assume w.l.o.g. that for the restriction \(g|_U\), the set of points with a constant active set \(\Omega ^{I^e(x^0)}\) is a smooth manifold around \(x^0 \in U\) of dimension \(n - \mathop {\textrm{affdim}}\limits (\textrm{aff}(\partial g(x^0)))\). Furthermore, it shows that for the representation of \(\Omega ^{I^e(x^0)}\) as a level set, it is sufficient to only consider a subset of the set of selection functions whose gradients form an affine basis of \(\partial g(x^0)\).

3.2 The structure of \(P_c^I \cap \Omega ^I\)

After analyzing the structure of \(\Omega ^I\), we will now turn towards the structure of the intersection \(P_c^I \cap \Omega ^I\) in (9). First of all, as for \(\Omega ^I\), we will show that not all selection functions of g are required for the representation of \(P_c^I \cap \Omega ^I\). More precisely, a simple application of Carathéodory’s theorem (Theorem 1) to the definition of \(P_c^I\) yields the following result.

Lemma 3

Let \(x^0 \in P_c\) and let \(\{g_1,\dots ,g_k\}\) be a set of selection functions of g. If \(x^0\) is not a critical point of g, then there is an index set \(\{ i_1, \dots , i_r \} \subseteq \{ 1, \dots , k \}\) with \(r = \mathop {\textrm{affdim}}\limits (\textrm{aff}(\{ \nabla f(x^0) \} \cup \partial g(x^0)))\) such that

  1. (a)

    \(0 \in \mathop {\textrm{conv}}\limits (\{ \nabla f(x^0) \} \cup \{ \nabla g_i(x^0) : i \in \{i_1,\dots ,i_r \}\})\),

  2. (b)

    \(\{ \nabla f(x^0) \} \cup \{ \nabla g_i(x^0) : i \in \{i_1,\dots ,i_r \}\}\) is affinely independent.

Proof

By Theorem 1, there is an affinely independent subset of

$$\begin{aligned} \{ \nabla f(x^0) \} \cup \{ \nabla g_i(x^0) : i \in \{1,\dots ,k\} \} \end{aligned}$$

of size \(r+1\) with zero in its convex hull. Since \(x^0\) is not a critical point of g, \(\nabla f(x^0)\) must be contained in that subset. \(\square \)

With Lemma 2 and Lemma 3, we have ways to simplify \(\Omega ^I\) and \(P_c^I\), respectively, by only considering certain selection functions of g. But note that we can not necessarily choose the same selection functions for both results: Although the set \(\{ \nabla g_i(x^0) : i \in \{i_1,\dots ,i_r\}\}\) in Lemma 3 is affinely independent, the index set \(\{i_1,\dots ,i_r\}\) can not necessarily be used in Lemma 2 since we might have \(r < d+1\), i.e.,

$$\begin{aligned} \begin{aligned}&\mathop {\textrm{affdim}}\limits (\textrm{aff}(\{ \nabla f(x^0) \} \cup \partial g(x^0))) < \mathop {\textrm{affdim}}\limits (\textrm{aff}(\partial g(x^0))) + 1 \\&\quad \Leftrightarrow \ \textrm{aff}(\{ \nabla f(x^0) \} \cup \partial g(x^0)) = \textrm{aff}(\partial g(x^0)) \\&\quad \Leftrightarrow \ \nabla f(x^0) \in \textrm{aff}(\partial g(x^0)). \end{aligned} \end{aligned}$$
(11)

In particular, since \(x^0\) is Pareto critical, this would imply that \(0 \in \textrm{aff}(\partial g(x^0))\) (even though \(x^0\) is not critical for g, i.e., \(0 \notin \mathop {\textrm{conv}}\limits (\partial g(x^0))\)). The following lemma shows that this scenario is related to the uniqueness of the KKT multiplier corresponding to f in \(x^0\).

Lemma 4

Let \(x^0 \in P_c\) such that \(x^0\) is not a critical point of g.

  1. (a)

    If the KKT multiplier \(\alpha _1\) of f in \(x^0\) (cf. (3)) is not unique, then \(\nabla f(x^0) \in \textrm{aff}(\partial g(x^0))\).

  2. (b)

    If \(\nabla f(x^0) \in \textrm{aff}(\partial g(x^0))\) and 0 is contained in the relative interior (cf. Definition SM9) of \(\mathop {\textrm{conv}}\limits (\{ \nabla f(x^0) \} \cup \partial g(x^0))\), then the KKT multiplier \(\alpha _1\) of f in \(x^0\) is not unique.

Proof

See “Appendix A.1”. \(\square \)

Remark 2

In [26], Section 4.3, it was shown that in the smooth case and under certain regularity assumptions on f and g, the coefficient vector of the vanishing convex combination in the KKT condition in a point \(x \in P_c\), i.e., the vector \((\alpha _1, \alpha _2)^\top \) in (3), is orthogonal to the tangent space of the image of the Pareto critical set at \((f(x),g(x))^\top \). Thus, roughly speaking, non-uniqueness of \((\alpha _1, \alpha _2)^\top \) suggests that this tangent space is “degenarate”, i.e., that the Pareto front possesses a kink at \((f(x),g(x))^\top \).

The following example shows what behavior may occur if the KKT multiplier of f is not unique.

Example 3

Consider problem (7) for \(f : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}\), \(x \mapsto x_1^2 + x_2^2\), and

$$\begin{aligned}&g_1 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, \quad x \mapsto x_1^2 + (x_2 - 1)^2, \\&g_2 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, \quad x \mapsto x_1^2 + (x_2 - 1)^2 - \left( x_2 - \frac{1}{2} \right) , \\&g : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, \quad x \mapsto \max (\{g_1(x),g_2(x)\}). \end{aligned}$$

Then g is \(PC^1\) with selection functions \(g_1\) and \(g_2\). It is easy to see that

$$\begin{aligned} \Omega ^{\{1,2\}} = \{ x \in {\mathbb {R}}^n : I^e(x) = \{1,2\} \} = {\mathbb {R}}\times \left\{ \frac{1}{2} \right\} , \end{aligned}$$

as depicted in Fig. 3a.

Fig. 3
figure 3

a Pareto critical set \(P_c\) and \(\Omega ^I\), \(I \subseteq \{1,2\}\), in Example 3. b Pointwise discretization of the image \(\{ (f(x),g(x))^\top : x \in {\mathbb {R}}^2 \}\) of the objective vector (fg) and the image of the Pareto critical set under (fg)

The Pareto critical (and in this case Pareto optimal) set is given by \(P_c = \{ 0 \} \times [0,1]\). In particular, \(x^0 = (0,\frac{1}{2})^\top \) is the only Pareto critical point where more than one selection function is active, i.e., \(P_c^{\{1,2\}} \cap \Omega ^{{\{1,2\}}} = \{ x^0 \}\). By computing the gradients in \(x^0\), we obtain

$$\begin{aligned} \nabla f(x^0) = (0,1)^\top , \ \nabla g_1(x^0) = (0,-1)^\top , \ \nabla g_2(x^0) = (0,-2)^\top . \end{aligned}$$

We see that

$$\begin{aligned} \frac{1}{2} \nabla f(x^0) + \frac{1}{2} \nabla g_1(x^0) = 0 \quad \text {and} \quad \frac{2}{3} \nabla f(x^0) + \frac{1}{3} \nabla g_2(x^0) = 0, \end{aligned}$$

so the KKT multiplier of f is not unique. By Lemma 4 this implies \(\nabla f(x^0) \in \textrm{aff}(\{ \partial g(x^0) \})\). More explicitly, for this example, it is easy to check that

$$\begin{aligned} \nabla f(x^0) = 3 \nabla g_1(x^0) - 2 \nabla g_2(x^0). \end{aligned}$$

Figure 3b shows an approximation of the image of (fg) and the image of the Pareto critical set. As discussed in Remark 2, we see that the image of \(P_c\) has a kink at \((f(x^0),g(x^0))^\top = (\frac{1}{4},\frac{1}{4})^\top \).

As the previous example suggests, a scenario where the KKT multiplier of f is not unique may occur if the Pareto critical set goes transversally through the set of nonsmooth points instead of moving tangentially along it. In other words, it may occur if arbitrarily close to \(x^0 \in P_c\), there are Pareto critical points with essentially active sets \(I_1\) and \(I_2\) such that \(I_1 \ne I_2\) and \(I_1 \ne I^e(x^0) \ne I_2\). Due to continuity of the gradients, the KKT multipliers for both sets \(I_1\) and \(I_2\) have accumulation points that are KKT multipliers of \(x^0\). Since \(I_1 \ne I_2\), these accumulation points may not coincide, such that the KKT multipliers in \(x^0\) are not unique. In terms of the structure of \(P_c^I \cap \Omega ^I\), we see that it is a 0-dimensional set in Example 3 (for \(I = \{1,2\}\)) as it is just a single point.

Although Pareto critical points \(x^0\) with \(\nabla f(x^0) \in \textrm{aff}(\partial g(x^0))\) may not necessarily cause nonsmoothness of \(P_c\), we will still exclude them from our consideration of the local structure of \(P_c\) around \(x^0\) to avoid the irregularities discussed above. So formally, we introduce the following assumption:

Assumption A2

For \(x^0 \in P_c\) we have

$$\begin{aligned} \nabla f(x^0) \notin \textrm{aff}(\partial g(x^0)). \end{aligned}$$

Roughly speaking, since \(\mathop {\textrm{affdim}}\limits (\textrm{aff}(\partial g(x^0))) < n\) in most cases, we expect that the set of points that violate Assumption A2 is small compared to \(P_c\) (or even empty). By (11), Assumption A2 implies that there is an index set as in Lemma 3 that satisfies the requirements of Lemma 2. In particular, \(P_c^I \cap \Omega ^I\) can then be expressed using only a subset of the selection functions of g.

The discussion of \(P_c^I \cap \Omega ^I\) so far was mainly focused on the removal of redundant information in the subdifferential of g to simplify our analysis. We will now turn towards its actual geometrical structure. To this end, we again consider Example 1.

Example 4

Let f and g be as in Example 1 (The corresponding Pareto critical set is shown in Fig. 1). Let \(x^0 = (1,0)^\top \) and \(U \subseteq {\mathbb {R}}^2\) be the open ball with radius one around \(x^0\). Then a set of selection functions of \(g|_U\) is given by \(\{g_1, g_2\}\) and we have \(P_c^{\{ 1,2 \}} \cap \Omega ^{\{ 1,2 \}} = (0,1] \times \{ 0 \}\). In particular, \(x^0\) is a boundary point of \(P_c^{\{ 1,2 \}} \cap \Omega ^{\{ 1,2 \}}\), such that \(P_c^{\{ 1,2 \}} \cap \Omega ^{\{ 1,2 \}}\) is not smooth around \(x^0\) (in the sense of smooth manifolds). The gradients of f, \(g_1\) and \(g_2\) are shown in Fig. 4.

Fig. 4
figure 4

The gradients of f, \(g_1\) and \(g_2\) in \(x^0 = (1,0)^\top \) in Example 4. The dashed line shows the (relative) boundary of the convex hull \(\mathop {\textrm{conv}}\limits (\{ \nabla f(x^0) \} \cup \partial g(x^0))\)

We see that there is a unique convex combination

$$\begin{aligned} \frac{1}{3} \nabla f(x^0) + \frac{2}{3} \nabla g_1(x^0) + 0 \nabla g_2(x^0) = 0 \end{aligned}$$
(12)

where the coefficient of \(\nabla g_2(x^0)\) is zero.

Note that in the previous example, there is still a vanishing affine combination of the gradients of f, \(g_1\) and \(g_2\) for \(x = (x_1,0)^\top \), \(x_1 > 1\). But it is not a convex combination, as the coefficient corresponding to \(\nabla g_2(x)\) is negative. Due to the continuity of the gradients, this can only happen if one of the coefficients in \(x^0\) is already zero (as in (12)). To exclude the type of nonsmoothness caused by this, we introduce the following assumption.

Assumption A3

For \(x^0 \in P_c\) and a set of selection functions \(\{ g_1, \dots , g_k \}\) of g, there is an index set \(\{ i_1, \dots , i_r \} \subseteq \{1,\dots ,k\}\) as in Lemma 3 and positive coefficients \(\alpha ^0 > 0\), \(\beta ^0 \in ({\mathbb {R}}^{>0})^r\) with \(\alpha ^0 + \sum _{j = 1}^r \beta ^0_j = 1\) and \(\alpha ^0 \nabla f(x^0) + \sum _{j = 1}^r \beta ^0_j \nabla g_{i_j}(x^0) = 0\).

The following lemma yields a necessary condition for Assumption A3 to hold, which is related to the relative interior (cf. Definition SM9) of \(\mathop {\textrm{conv}}\limits (\{ \nabla f(x^0) \} \cup \partial g(x^0))\). In particular, it is independent of the choice of selection functions.

Lemma 5

Let \(x^0 \in P_c\). If there is a set of selection functions such that Assumption A3 holds, then

$$\begin{aligned} 0 \in \mathop {\textrm{ri}}\limits (\mathop {\textrm{conv}}\limits (\{ \nabla f(x^0) \} \cup \partial g(x^0))). \end{aligned}$$

Proof

See “Appendix A.2”. \(\square \)

After introducing the Assumptions A1, A2 and A3, we are now able to show the first structural result about \(P_c^I \cap \Omega ^I\). The following lemma shows that \(P_c^I \cap \Omega ^I\) is the projection of a level set from a higher-dimensional space onto the variable space \({\mathbb {R}}^n\).

Lemma 6

Let \(x^0 \in P_c\). Let \(U \subseteq {\mathbb {R}}^n\) be an open neighborhood of \(x^0\) and let \(\{ g_1, \dots , g_k \}\) be a set of selection functions of \(g|_U\) satisfying Assumptions A1 and A3. Assume that Assumption A2 holds. Then there is an index set \(\{ i_1, \dots , i_r \} \subseteq \{1,\dots ,k\}\) and an open neighborhood \(U' \subseteq U\) of \(x^0\) such that

$$\begin{aligned} P_c^{\{1,\dots ,k\}} \cap \Omega ^{\{1,\dots ,k\}} \cap U' = {\mathop {\textrm{pr}}\limits }_x(h^{-1}(0)) \cap U', \end{aligned}$$
(13)

where \({\mathop {\textrm{pr}}\limits }_x : {\mathbb {R}}^n \times {\mathbb {R}}\times {\mathbb {R}}^r \rightarrow {\mathbb {R}}^n\) is the projection onto the first n components and

$$\begin{aligned} h: {\mathbb {R}}^n \! \times \! {\mathbb {R}}^{>0} \! \times \! ({\mathbb {R}}^{>0})^r \rightarrow {\mathbb {R}}^n \! \times \! {\mathbb {R}}\! \times \! {\mathbb {R}}^{r-1}, (x,\alpha ,\beta ) \mapsto \begin{pmatrix} \alpha \nabla f(x) + \sum _{j = 1}^r \beta _j \nabla g_{i_j}(x) \\ \alpha + \sum _{j = 1}^r \beta _j - 1 \\ (g_{i_j}(x) - g_{i_1}(x))_{j \in \{2,\dots ,r\}} \end{pmatrix}. \end{aligned}$$

Proof

Let \(\{ i_1, \dots , i_r \} \subseteq \{1,\dots ,k\}\) be an index set as in A3. Since the gradients \(\nabla f\) and \(\nabla g_{i_j}\), \(j \in \{1,\dots ,r\}\), are continuous and \(\{ \nabla f(x^0) \} \cup \{ \nabla g_{i_j}(x^0) : j \in \{1,\dots ,r\} \}\) is affinely independent, there is an open neighborhood \(U' \subseteq U\) of \(x^0\) such that \(\{ \nabla f(x) \} \cup \{ \nabla g_{i_j}(x) : j \in \{1,\dots ,r\} \}\) is affinely independent for all \(x \in U'\). In particular,

$$\begin{aligned} \begin{aligned} r&\le \mathop {\textrm{affdim}}\limits (\textrm{aff}(\{ \nabla f(x) \} \cup \{ \nabla g_i(x) : i \in \{1,\dots ,k\})) \\&\le \mathop {\textrm{affdim}}\limits (\textrm{aff}(\{ \nabla g_i(x) : i \in \{1,\dots ,k\})) + 1 \quad \forall x \in U'. \end{aligned} \end{aligned}$$
(14)

By A1, A2 and A3, we have

$$\begin{aligned} \begin{aligned} r&{\mathop {=}\limits ^{A3}} \mathop {\textrm{affdim}}\limits (\textrm{aff}(\{ \nabla f(x^0) \} \cup \partial g(x^0))) {\mathop {=}\limits ^{A2}} \mathop {\textrm{affdim}}\limits (\textrm{aff}(\partial g(x^0))) + 1 \\&{\mathop {=}\limits ^{A1 (i),(ii)}} \mathop {\textrm{affdim}}\limits (\textrm{aff}(\{ \nabla g_i(x^0) : i \in \{1,\dots ,k\})) + 1 \\&{\mathop {=}\limits ^{A1 (iii)}} \mathop {\textrm{affdim}}\limits (\textrm{aff}(\{ \nabla g_i(x) : i \in \{1,\dots ,k\})) + 1 \quad \forall x \in U'. \end{aligned} \end{aligned}$$
(15)

Combining (14) and (15), we obtain

$$\begin{aligned} \mathop {\textrm{affdim}}\limits (\textrm{aff}(\{ \nabla f(x) \} \cup \{ \nabla g_i(x) : i \in \{1,\dots ,k\})) = r \quad \forall x \in U', \end{aligned}$$

so \(\{ \nabla f(x) \} \cup \{ \nabla g_{i_j}(x) : j \in \{1,\dots ,r\} \}\) is an affine basis of \(\{ \nabla f(x) \} \cup \{ \nabla g_i(x) : i \in \{1,\dots ,k\} \}\) for all \(x \in U'\).

Let \(x \in P_c^{\{1,\dots ,k\}} \cap \Omega ^{\{1,\dots ,k\}} \cap U'\). By Lemma SM4, every element of \(\textrm{aff}(\{ \nabla f(x) \} \cup \{ \nabla g_i(x) : i \in \{1,\dots ,k\} \})\) can be uniquely written as an affine combination of elements of \(\{ \nabla f(x) \} \cup \{ \nabla g_{i_j}(x) : j \in \{1,\dots ,r\} \}\). Let \(\alpha ^0\) and \(\beta ^0\) as in A3. Since \(\alpha ^0 > 0\), \(\beta ^0 \in ({\mathbb {R}}^{>0})^r\) and the gradients \(\nabla f\), \(\nabla g_{i_j}\), \(j \in \{1,\dots ,r\}\), are continuous, we can assume w.l.o.g. that \(U'\) is small enough such that there are \(\alpha > 0\), \(\beta \in ({\mathbb {R}}^{>0})^r\) with \(\alpha + \sum _{j = 1}^r \beta _j = 1\) and

$$\begin{aligned} \alpha \nabla f(x) + \sum _{j = 1}^r \beta _j \nabla g_{i_j}(x) = 0. \end{aligned}$$

Furthermore, \(g_{i_j}(x) - g_{i_1}(x) = 0\) holds for all \(j \in \{2,\dots ,r\}\) since \(x \in \Omega ^{\{1,\dots ,k\}}\). Thus, \(h(x,\alpha ,\beta ) = 0\), i.e., \(x \in {\mathop {\textrm{pr}}\limits }_x(h^{-1}(0)) \cap U'\).

Now let \(x \in {\mathop {\textrm{pr}}\limits }_x(h^{-1}(0)) \cap U'\). Then \(x \in P_c^{\{1,\dots ,k\}}\) trivially holds since \(\{ i_1, \dots , i_r \} \subseteq \{1,\dots ,k\}\). By A1 and Lemma 2, we can assume w.l.o.g. that \(U'\) is small enough such that \(g_{i_j}(x) - g_{i_1}(x) = 0\) for all \(j \in \{2,\dots ,r\}\) implies \(x \in \Omega ^{\{1,\dots ,k\}}\), completing the proof. \(\square \)

Up to this point, we assumed f to be continuously differentiable and g to be \(PC^1\). This means that the map h in the previous lemma is at least continuous. If h is actually continuously differentiable, then standard results from differential geometry can be used to analyze the structure of its level sets on the right-hand side of (13). To this end, we will assume for the remainder of this section that f is twice continuously differentiable and g is \(PC^2\).

Theorem 2

In the setting of Lemma 6 it holds:

  1. (a)

    If \(Dh(x,\alpha ,\beta )\) has full rank for all \((x,\alpha ,\beta ) \in h^{-1}(0)\), then \(h^{-1}(0)\) is a 1-dimensional submanifold of \({\mathbb {R}}^n \times {\mathbb {R}}^{>0} \times ({\mathbb {R}}^{>0})^r\).

  2. (b)

    If \(Dh(x,\alpha ,\beta )\) has constant rank \(m \in {\mathbb {N}}\) for all \((x,\alpha ,\beta ) \in {\mathbb {R}}^n \times {\mathbb {R}}^{>0} \times ({\mathbb {R}}^{>0})^r\), then \(h^{-1}(0)\) is an \((n+r+1-m)\)-dimensional submanifold of \({\mathbb {R}}^n \times {\mathbb {R}}^{>0} \times ({\mathbb {R}}^{>0})^r\).

In both cases, the tangent space of \(h^{-1}(0)\) is given by

$$\begin{aligned} T_{(x,\alpha ,\beta )} (h^{-1}(0)) = \ker (Dh(x,\alpha ,\beta )). \end{aligned}$$
(16)

Proof

Part a) follows from Corollary 5.14 and part b) follows from Theorem 5.12 in [16]. The formula for the tangent space follows from Proposition 5.38 in [16]. \(\square \)

Remark 3

Equation (16) in the previous theorem can be used to compute tangent vectors of the regularization path in practice by computing elements of \({\mathop {\textrm{pr}}\limits }_x(\ker (Dh(x,\alpha ,\beta )))\). Thus, it is an essential result for the construction of path-following methods.

The previous theorem is the main result in this section. It shows that the structure of \(h^{-1}(0)\) (and thus the structure of \(P_c^I \cap \Omega ^I\) due to (13)) is related to the rank of the Jacobian Dh, given by

$$\begin{aligned} \begin{pmatrix} \alpha \nabla ^2 f(x) + \sum _{j = 1}^r \beta _j \nabla ^2 g_{i_j}(x) &{} \nabla f(x) &{} \nabla g_{i_1}(x) &{} \ldots &{} \nabla g_{i_r}(x) \\ 0 &{} 1 &{} 1 &{} \ldots &{} 1 \\ (\nabla g_{i_2}(x) - \nabla g_{i_1}(x))^\top &{} 0 &{} 0 &{} \ldots &{} 0 \\ \vdots &{} \vdots &{} \vdots &{} &{} \vdots \\ (\nabla g_{i_r}(x) - \nabla g_{i_1}(x))^\top &{} 0 &{} 0 &{} \ldots &{} 0 \end{pmatrix} \in {\mathbb {R}}^{(n+r) \times (n+r+1)} \end{aligned}$$

for \((x,\alpha ,\beta ) \in {\mathbb {R}}^n \times {\mathbb {R}}^{>0} \times ({\mathbb {R}}^{>0})^r\). Note that in Theorem 2 b), the assumption on the rank has to hold for all \((x,\alpha ,\beta ) \in {\mathbb {R}}^n \times {\mathbb {R}}^{>0}\times ({\mathbb {R}}^{>0})^{r}\) whereas in a), it only has to hold for all \((x,\alpha ,\beta ) \in h^{-1}(0)\). The following remark shows how the structure of Dh can be used to analyze its rank.

Remark 4

In the setting of Lemma 6, let \((v^x,v^\alpha ,v^\beta ) \in \ker (Dh(x,\alpha ,\beta )) \subseteq {\mathbb {R}}^n \times {\mathbb {R}}^{>0} \times ({\mathbb {R}}^{>0})^r\), i.e.,

$$\begin{aligned} \begin{aligned}&\left( \alpha \nabla ^2 f(x) + \sum _{j = 1}^r \beta _j \nabla ^2 g_{i_j}(x) \right) v^x + v^\alpha \nabla f(x) + \sum _{j = 1}^r v^\beta _j \nabla g_{i_j}(x) = 0, \\&v^\alpha + \sum _{j = 1}^r v^\beta _j = 0, \\&(\nabla g_{i_j}(x) - \nabla g_{i_1}(x))^\top v^x = 0 \quad \forall j \in \{2,\dots ,r\}. \end{aligned} \end{aligned}$$
(17)

Since \(\{ \nabla f(x), \nabla g_{i_1}(x), \dots , \nabla g_{i_r}(x) \}\) is affinely independent by construction (cf. proof of Lemma 6), the set

$$\begin{aligned} W := \left\{ v^\alpha \nabla f(x) + \sum _{j = 1}^r v^\beta _j \nabla g_{i_j}(x) : v^\alpha \in {\mathbb {R}}, v^\beta \in {\mathbb {R}}^r, v^\alpha + \sum _{j = 1}^r v^\beta _j = 0 \right\} \end{aligned}$$

is an r-dimensional linear subspace of \({\mathbb {R}}^n\). Similar to Lemma SM4, it is possible to show that for each element of W, the corresponding coefficients \(v^\alpha \) and \(v^\beta \) are unique. If \(\alpha \nabla ^2 f(x) + \sum _{j = 1}^r \beta _j \nabla ^2 g_{i_j}(x)\) is regular, then the first two lines of (17) are equivalent to

$$\begin{aligned} v^x \in -\left( \alpha \nabla ^2 f(x) + \sum _{j = 1}^r \beta _j \nabla ^2 g_{i_j}(x)\right) ^{-1} W =: V_1, \end{aligned}$$

where \(V_1\) is an r-dimensional linear subspace of \({\mathbb {R}}^n\). In particular, \(v^\alpha \) and \(v^\beta \) are uniquely determined by \(v^x\). Furthermore, if we denote by \(V^\bot \) the orthogonal complement of a subspace V, then the last line of (17) is equivalent to

$$\begin{aligned} v^x \in \mathop {\textrm{span}}\limits (\{ \nabla g_{i_j}(x) - \nabla g_{i_1}(x) : j \in \{2,\dots ,r\} \})^\bot =: V_2, \end{aligned}$$

where \(V_2\) is an \((n - (r - 1))\)-dimensional subspace of \({\mathbb {R}}^n\) since \(\{ \nabla g_{i_1}(x), \dots , \nabla g_{i_r}(x) \}\) is affinely independent. Thus, the dimension of \(\ker (Dh(x,\alpha ,\beta ))\) is given by the dimension of the intersection \(V_1 \cap V_2\). If we assume that \(V_1\) and \(V_2\) are generic subspaces, then we can apply a basic result from linear algebra to see that

$$\begin{aligned} \dim (\ker (Dh(x,\alpha ,\beta )))&= \dim (V_1 \cap V_2) = \dim (V_1) + \dim (V_2) - \dim (V_1 + V_2) \\&= r + (n - (r - 1)) - n = 1, \end{aligned}$$

i.e., the rank of \(Dh(x,\alpha ,\beta )\) is full and Theorem 2(a) can be applied.

The previous remark suggests that \(h^{-1}(0)\) is typically a 1-dimensional manifold such that we expect its projection \(P_c^I \cap \Omega ^I\) to be “1-dimensional” as well by (13). Nonetheless, we will see later that there are applications where \(h^{-1}(0)\) is a higher-dimensional manifold. Furthermore, there are cases where \(h^{-1}(0)\) is not a manifold at all (Note that this is not necessarily caused by the nonsmoothness of g and can also happen for smooth objective functions (cf. Example 1 in [28])). Thus, for \(P_c^I \cap \Omega ^I\) to have a smooth structure around a (corresponding) \(x^0 \in P_c\), we have to make the following assumption:

Assumption A4

In the setting of Lemma 6, Theorem 2 can be applied, i.e.,

  1. (a)

    \(\mathop {\textrm{rk}}\limits (Dh(x,\alpha ,\beta )) = n + r \quad \forall (x,\alpha ,\beta ) \in h^{-1}(0)\) or

  2. (b)

    \(\mathop {\textrm{rk}}\limits (Dh(x,\alpha ,\beta )) \text { is constant } \quad \forall (x,\alpha ,\beta ) \in {\mathbb {R}}^n \times {\mathbb {R}}^{>0} \times ({\mathbb {R}}^{>0})^r.\)

We conclude the discussion of the structure of \(P_c^I \cap \Omega ^I\) by considering the special case where f is quadratic and g is piecewise (affinely) linear. Remark A.3 in the “Appendix” shows that in this case, \(P_c^I \cap \Omega ^I\) is (locally) an affinely linear set around points that satisfy the assumptions of Lemma 6. This coincides with the results in [12].

3.3 The structure of \(P_c\)

After analyzing the structure of \(P_c^I \cap \Omega ^I\), we are now in the position to analyze the structure of the Pareto critical set \(P_c\) of (7). By (9), \(P_c\) can be written as the union of \(P_c^I \cap \Omega ^I\) for all possible combinations I of selection functions. Since we already discussed the structure of the individual \(P_c^I \cap \Omega ^I\), the only additional nonsmooth points in \(P_c\) may arise by taking their union. More precisely, nonsmooth points may arise where the different \(P_c^I \cap \Omega ^I\) touch, i.e., where the set of (essentially) active selection functions changes. The following lemma yields a necessary condition for identifying such points.

Lemma 7

Let \(x^0 \in P_c\) and let \(\{g_1, \dots , g_k\}\) be a set of selection functions of g with \(I^e(x^0) = \{ i_1, \dots , i_l \}\), \(l \in {\mathbb {N}}\). If for all open neighborhoods \(U \subseteq {\mathbb {R}}^n\) of \(x^0\), there is some \(x \in P_c \cap U\) with \(I^e(x) \ne I^e(x^0)\), then there are \(\alpha \ge 0\) and \(\beta \in ({\mathbb {R}}^{\ge 0})^{l}\) such that \(\alpha + \sum _{j = 1}^{l} \beta _j = 1\),

$$\begin{aligned} \alpha \nabla f(x^0) + \sum _{j = 1}^{l} \beta _j \nabla g_{i_j}(x^0) = 0 \end{aligned}$$

and \(\beta _j = 0\) for some \(j \in \{1,\dots ,l\}\).

Proof

See “Appendix A.4”. \(\square \)

A visualization of the previous lemma can be seen in Example 1: In \(x^0 = (1,0)^\top \), the sets \(P_c^{\{1,2\}} \cap \Omega ^{\{1,2\}}\) and \(P_c^{\{1\}} \cap \Omega ^{\{1\}}\) touch and there is a convex combination with a zero component (cf. (12)). In this case, this causes a kink in \(P_c\).

Note that in general, the existence of a coefficient vector with a zero component as in Lemma 7 is not a useful criterion to find points in \(P_c\) where the active set changes. For example, by Lemma 3, if the number of essentially active selection functions in \(x^0\) is larger than \(\mathop {\textrm{affdim}}\limits (\textrm{aff}(\{ \nabla f(x^0) \} \cup \partial g(x^0)))\), then there is always a coefficient vector with a zero component. A stricter condition would be that every coefficient vector has a zero component, i.e., that zero is located on the relative boundary of \(\mathop {\textrm{conv}}\limits (\{ \nabla f(x^0) \} \cup \partial g(x^0))\) (cf. Definition SM9). By Lemma 5, this would imply that Assumption A3 cannot hold, such that \(P_c^I \cap \Omega ^I\) may be nonsmooth around \(x^0\). Although the theory suggests (and we will later explicitly see this in Example 6) that this must not necessarily be the case in points where the active set changes, we believe it may be a useful criterion in practice.

Nonetheless, from a theoretical point of view, the only reliable assumption we can make to exclude points where the essentially active set changes is the following:

Assumption A5

For \(x^0 \in P_c\) and a set of selection functions \(\{ g_1, \dots , g_k \}\) of g, there is an open neighborhood \(U \subseteq {\mathbb {R}}^n\) of \(x^0\) such that

$$\begin{aligned} I^e(x) = I^e(x^0) \quad \forall x \in P_c \cap U. \end{aligned}$$
Table 1 An overview of the five assumptions required to have a smooth structure of \(P_c\) around \(x^0 \in P_c\)

From our considerations up to this point it follows that if \(x^0 \in P_c\) is a point in which Assumptions A1 to A5 hold (for the same set of selection functions), then \(P_c\) is the projection of a smooth manifold around \(x^0\) as in Theorem 2. An overview of all five assumptions is shown in Table 1. Unfortunately, in contrast to Assumptions A1, A2, A3 and A4, A5 is only an a posteriori condition, i.e., we already have to know \(P_c\) around \(x^0\) to be able to check if Assumption A5 holds.

Remark 5

  1. (a)

    For the development of path-following methods, it is crucial to be able to detect nonsmooth points during computation of the regularization path. If the different sets \(P_c \cap \Omega ^I\) are computed separately, then typically (but not necessarily), the nonsmooth points of the path are the end points of these sets (in case the path is “1-dimensional”, cf. Remark 4). Thus, since path-following methods compute a pointwise approximation of the path, these end points roughly appear as points where the method fails to continue with the currently active set \(I \subseteq \{1,\ldots ,k\}\). To find the exact nonsmooth point, one could try to find the closest point where one of the Assumptions A1 to A5 is violated. While it is not clear how this can be done numerically in our general setting, it is easier in specific applications like \(\ell _1\)-regularization [14] (where more structure can be exploited).

  2. (b)

    If Assumption A5 is violated in \(x^0 \in P_c\), then there are Pareto critical points arbitrarily close to \(x^0\) with a different (essentially) active set \(I' \ne I^e(x^0)\). In practice, it may be of interest to find \(I'\). For example, in path-following methods, \(I'\) could be used to compute the direction in which \(P_c\) continues once the nonsmoothness in \(x^0\) was detected. To this end, let \(\{ g_1, \dots , g_k \}\) be the set of selection functions which are all essentially active at \(x^0\). While it is not possible to determine \(I'\) solely from the set \(\mathop {\textrm{conv}}\limits (\{ \nabla f(x^0) \} \cup \partial g(x^0)) = \mathop {\textrm{conv}}\limits (\{ \nabla f(x^0) \} \cup \{ \nabla g_1(x^0), \dots , \nabla g_k(x^0) \})\), we can at least determine all potential candidates for \(I'\) by finding all subsets \(\{ i_1, \dots , i_m \} \subseteq \{ 1, \dots , k \}\) with

    $$\begin{aligned} 0 \in \mathop {\textrm{conv}}\limits (\{ \nabla f(x^0) \} \cup \mathop {\textrm{conv}}\limits (\{ \nabla g_{i_1}(x^0), \dots , \nabla g_{i_m}(x^0) \})). \end{aligned}$$
  3. (c)

    As the union of different \(P_c^I \cap \Omega ^I\) for \(I \subseteq \{1,\ldots ,k\}\), we expect that \(P_c\) (and thus \(R_c\) by Lemma 1) is typically a “1-dimensional” set. In this case, as long as the actual regularization path R (cf. (5)) is not discrete, both \(R_c\) and R have the same “dimension”. Thus, outside of kinks, we expect that \(R_c\) and R coincide locally (More precisely, we expect that for \(x \in R\) there is some open set \(U \subseteq {\mathbb {R}}^n\) with \(x \in U\) such that \(R \cap U = R_c \cap U\)). In this way, structural result about \(R_c\) could also be applied to R in the general nonconvex case.

We conclude this section with Algorithm 1, which is an abstract path-following method for \(P_c\) based on our results (for the case where \(P_c\) is connected and “1-dimensional”).

figure a

Note that this algorithm is purely motivated by the structure of \(P_c\) without taking any computational regards into account. As such, to obtain a practical method for specific cases, ways to implement the steps 3 to 6 have to be further investigated.

4 Examples

In this section, we will show how our results from Sect. 3 can be used to analyze the structure of regularization paths in two common applications. These are support vector machines (SVMs) in data classification [2] and the exact penalty method in constrained optimization [5, 29].

4.1 Support vector machine

Given a data set \(\{ (x^i, y^i) : x^i \in {\mathbb {R}}^l, y^i \in \{-1,1\}, i \in \{1,\dots ,N\} \}\), the goal of the support vector machine (SVM) is to find \(w \in {\mathbb {R}}^l\) and \(b \in {\mathbb {R}}\) such that

$$\begin{aligned} \mathop {\textrm{sign}}\limits (w^\top x^i + b) = y^i \quad \forall i \in \{1,\dots ,N\}. \end{aligned}$$

In other words, the goal is to find a hyperplane \(\{ x \in {\mathbb {R}}^l : w^\top x + b = 0 \}\) such that all \(x^i\) with \(y^i = 1\) lie on one side and all \(x^i\) with \(y^i = -1\) lie on the other side of the hyperplane. Since such a hyperplane may not be unique, an additional goal is to find the one where the minimal distance of the \(x^i\) to the hyperplane, also known as the margin, is as large as possible. One way of solving this problem is the penalization approach

$$\begin{aligned} \min _{(w,b) \in {\mathbb {R}}^l \times {\mathbb {R}}} f(w,b) + \lambda g(w,b) \end{aligned}$$
(18)

for \(\lambda \ge 0\) and

$$\begin{aligned}&f : {\mathbb {R}}^l \times {\mathbb {R}}\rightarrow {\mathbb {R}}, \quad (w,b) \mapsto \frac{1}{2} \Vert w \Vert _2^2, \\&g : {\mathbb {R}}^l \times {\mathbb {R}}\rightarrow {\mathbb {R}}, \quad (w,b) \mapsto \sum _{i = 1}^N \max \{0, 1 - y^i (w^\top x^i + b) \}. \end{aligned}$$

Roughly speaking, minimizing g ensures that the hyperplane separates the data, while minimizing f maximizes the margin. In theory, the most favorable hyperplane would be the one with \(g(w,b) = 0\) (if existent) and f(wb) as small as possible. But in practice, when working with large and noisy data sets, an imperfect separation where only few points violate the separation may be more desirable. The balance between the margin and the quality of the separation can be controlled via the parameter \(\lambda \) in (18), yielding a regularization path \(R_{\text {SVM}}\) as in (5) (for \(n = l+1\)).

Remark 6

In the literature, the roles of f and g in problem (18) are typically reversed. The resulting problem is equivalent to our formulation with the regularization parameter \(\frac{1}{\lambda }\) (except for critical points of f and g) (cf. Section 12.3.2 in [2]). Nonetheless, when the regularization path of the SVM is considered, \(\lambda \) in (18) is more commonly used for its parametrization.

The structure of the regularization path of the SVM was already considered in earlier works. In [11], it was shown that \(R_{\text {SVM}}\) is 1-dimensional and piecewise linear up to certain degenerate points, and a path-following method was proposed that exploits this structure. It was conjectured (without proof) that the existence of these degenerate points is related to certain properties of the data points \((x^i,y^i)\), like having duplicates of the same point or having multiple points with the same margin. In [30], these degeneracies were analyzed further and a modified path-following method was proposed, specifically taking degenerate data sets into account. Other methods for degenerate data sets were proposed in [31,32,33]. In the following, we will analyze how these degeneracies relate to the nonsmooth points we characterized in our results.

Obviously, f is twice continuously differentiable and g is \(PC^2\) with selection functions

$$\begin{aligned} \left\{ (w,b) \mapsto \sum _{i \in I} 1 - y^i (w^\top x^i + b) : I \subseteq \{1,\dots ,N\} \right\} . \end{aligned}$$

Furthermore, both f and g are convex, so \(R_{\text {SVM}}\) coincides with the critical regularization path (cf. (6)). Thus, we can apply our results from Sect. 3 to analyze the structure of \(R_{\text {SVM}}\). Since f is quadratic and all selection functions are linear, Remark A.3 shows that the regularization path is piecewise linear up to points violating the Assumptions A1 to A5. Due to the properties of g, the Assumption A1 always holds for the SVM, as shown in Remark A.5 in the “Appendix”.

In the following, we will consider the remaining Assumptions A2 to A5 in the context of the SVM and relate them to the degeneracies reported in [11]. We will do this by considering Example 1 from [30], which was specifically constructed to have a degenerate regularization path.

Example 5

Consider the data set

$$\begin{aligned}{} & {} \left\{ ((0.7,0.3)^\top ,1), ((0.5,0.5)^\top ,1), ((2,2)^\top ,-1),\right. \\{} & {} \quad \left. ((1,3)^\top ,-1), ((0.75,0.75)^\top ,1), ((1.75,1.75)^\top ,-1) \right\} . \end{aligned}$$

The regularization path for this data set can be computed analytically and is shown in Fig. 5a. In the following, we will analyze the points \(x^1\), \(x^2\), \(x^3\) and \(x^4\) highlighted in Fig. 5a with respect to the Assumptions A2 to A5.

Fig. 5
figure 5

a Regularization path of the SVM in Example 5 and the points \(x^1 = \frac{1}{372} (-35,-65,137)^\top \), \(x^2 = \frac{1}{93} (-35, -65, 137)^\top \), \(x^3 = \frac{1}{3} (-2,-2,5)^\top \) and \(x^4 = \frac{1}{5} (-4,-4,11)^\top \). b Image of the regularization path with \(y^i = (f(x^i),g(x^i))^\top \), \(i \in \{1,\dots ,4\}\), and the same coloring as in (a)

The point \(x^1\) lies in one of the 2-dimensional parts of the regularization path and it is possible to show that g is smooth around \(x^1\). It is easy to verify that Assumptions A2, A3 and A5 are satisfied. With regard to Assumption A4, it holds \(r = \mathop {\textrm{affdim}}\limits (\textrm{aff}(\{ \nabla f(x^1) \} \cup \partial g(x^1))) = 1\) (cf. Lemma 3) and

$$\begin{aligned} Dh(x,\alpha ,\beta ) = \begin{pmatrix} 2 \alpha &{} 0 &{} 0 &{} -\frac{35}{372} &{} \frac{14}{5} \\ 0 &{} 2 \alpha &{} 0 &{} -\frac{65}{372} &{} \frac{26}{5} \\ 0 &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 1 &{} 1 \end{pmatrix} \end{aligned}$$

with \(\mathop {\textrm{rk}}\limits (Dh(x,\alpha ,\beta )) = 3\) for all \((x,\alpha ,\beta ) \in {\mathbb {R}}^n \times {\mathbb {R}}^{>0} \times {\mathbb {R}}^{>0}\). Thus, A4(b) holds which by Theorem 2 implies that the regularization path is the projection of an \(n + r + 1 - m = 3 + 1 + 1 - 3 = 2\) dimensional manifold around \(x^1\), as expected.

The point \(x^2\) lies in a kink in the regularization path. The subdifferential of g in \(x^2\) can be computed analytically and is shown in Fig. 6(a).

Fig. 6
figure 6

Gradient of f, subdifferential of g and the (relative) boundary of the convex hull (dashed) in \(x^2\), \(x^3\) and \(x^4\) in Example 5

In this case, we have \(\mathop {\textrm{affdim}}\limits (\textrm{aff}(\partial g(x^2))) = 2\) and \(\nabla f(x^2) \notin \textrm{aff}(\partial g(x^2))\), so Assumption A2 holds. We see that zero lies on the relative boundary of \(\mathop {\textrm{conv}}\limits (\{ \nabla f(x^2) \cup \partial g(x^2)\})\) such that Assumption A3 must be violated (by Lemma 5). Furthermore, it is possible to show that the active set changes in \(x^2\), so Assumption A5 is violated as well.

The point \(x^3\) lies in another kink of the regularization path. The corresponding subdifferential of g is shown in Fig. 6b. As for \(x^2\), Assumptions A3 and A5 are violated in \(x^3\). But in contrast to \(x^2\) we have \(\mathop {\textrm{affdim}}\limits (\textrm{aff}(\partial g(x^2))) = 3\), so \(\nabla f(x^2) \in \textrm{aff}(\partial g(x^2)) = {\mathbb {R}}^3\) trivially holds and Assumption A2 is violated. As discussed in Remark 2, this results in a kink in the Pareto front in the image of \(x^3\) under the objective vector (fg), as can be seen in Fig. 5b.

Finally, \(x^4\) marks a corner of one of the 2-dimensional parts of the regularization path and the corresponding subdifferential is shown in Fig. 6c. As for \(x^3\), Assumptions A2, A3 and A5 are violated in \(x^4\). But unlike \(x^3\), when we consider the image of \(x^4\) in Fig. 5b, we see that there is no kink in \(y^4\). This suggests that the KKT multiplier of f is unique even though Assumption A2 is violated. Note that this is not a contradiction to Lemma 4 b), as 0 lies on the relative boundary of \(\mathop {\textrm{conv}}\limits (\{ \nabla f(x^4) \} \cup \partial g(x^4))\).

4.2 Exact penalty method

Consider the constrained optimization problem

$$\begin{aligned} \begin{aligned} \min _{x \in {\mathbb {R}}^n} f(x)&\\ s.t. \quad c^1_i(x)&\le 0, \quad i \in \{ 1,\dots ,p \}, \\ c^2_j(x)&= 0, \quad j \in \{ 1,\dots ,q \}, \end{aligned} \end{aligned}$$
(19)

where \(f:{\mathbb {R}}^n \rightarrow {\mathbb {R}}\), \(c^1_i:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\), \(i \in \{ 1,\dots ,p \}\), and \(c^2_j:{\mathbb {R}}^n \rightarrow {\mathbb {R}}\), \(j \in \{ 1,\dots ,q \}\), are continuously differentiable. In order to solve (19) the so-called exact penalty method can be used, where the idea is to solve the (nonsmooth) problem

$$\begin{aligned} \min _{x \in {\mathbb {R}}^n} f(x) + \lambda g(x) \end{aligned}$$
(20)

with a penalty weight \(\lambda \ge 0\) and

$$\begin{aligned} g : {\mathbb {R}}^n \rightarrow {\mathbb {R}}, \quad x \mapsto \left( \sum _{i=1}^p \max (c^1_i(x),0) + \sum _{j=1}^q |c^2_j(x)|\right) . \end{aligned}$$

It is easy to see that g is \(PC^1\) and that a set of selection functions is given by

$$\begin{aligned} \left\{ g_{\theta ,\sigma }:{\mathbb {R}}^n\rightarrow {\mathbb {R}}, \ x \mapsto \sum _{i=1}^p \theta _i c_i^1(x) + \sum _{j =1}^{q} \sigma _i c_j^2(x) : \theta \in \{0,1\}^p, \sigma \in \{-1,1\}^q \right\} . \end{aligned}$$
(21)

The method is based on the theoretical result that there is some \({\bar{\lambda }}>0\) such that every strict local minimizer of (19) is a local minimizer of (20) for every \(\lambda > {\bar{\lambda }}\), i.e., if \(\lambda \) is large enough, then the constrained problem (19) can be solved via the unconstrained problem (20) (cf. [5], Theorem 17.3). In practice, problem (20) will become ill-conditioned if \(\lambda \) is large compared to \({\bar{\lambda }}\). Thus, it is instead solved for multiple, increasing values of \(\lambda \) until a feasible solution is found. This results in a regularization path R as in (5). Note that all feasible points of (19) are critical points of g and the minimizer of (19) is typically the first intersection of the regularization path with the feasible set (when starting in the minimizer of f). In particular, the existence of \({\bar{\lambda }}\) as above implies that the minimizer of (19) is contained in R.

In [34], R is analyzed for the case where f is quadratic (and strictly convex) and all \(c^1_i\) and \(c^2_j\) are affinely linear. In this case, R coincides with the critical regularization path \(R_c\) (cf. (6)). It is shown that R is piecewise linear, which coincides with our results in Remark A.3. In [13], the more general case where f and all \(c_i^1\) are convex and all \(c_j^2\) are affinely linear is considered. There, it still holds \(R = R_c\) and it is shown that R is piecewise smooth with kinks occurring where the constraints become satisfied or violated.

Here, we want to use our theory to analyze the critical regularization path \(R_c\) in the more general setting where f, \(c_i^1\) and \(c_j^2\) are merely continuously differentiable. By our results in Sect. 3, we know that \(R_c\) is piecewise smooth up to points where the Assumptions A1 to A5 are violated. In Remark A.6 in the “Appendix”, it is shown that if all \(x \in {\mathbb {R}}^n\) satisfy the linear independence constraint qualification (LICQ), i.e., if

$$\begin{aligned} \{ \nabla c_i^1(x) : c_i^1(x) = 0 \} \cup \{ \nabla c_j^2(x) : c_j^2(x) = 0 \} \end{aligned}$$
(22)

is linearly independent for all \(x \in {\mathbb {R}}^n\), then Assumption A1 always holds and only Assumptions A2 to A5 may cause nonsmoothness in \(R_c\). For these remaining assumptions we consider the following example, where the feasible set is given by continuously differentiable but nonconvex inequality constraints. It is inspired by problem (15) in [13].

Example 6

Consider the constrained optimization problem (19) with

$$\begin{aligned} \begin{aligned} f(x)&= \frac{1}{2} x_1^2 + x_2^2 - x_1 x_2 + \frac{1}{2} x_1 - 2 x_2,\\ c^1_1(x)&= - \left( \left( x_1-\frac{1}{2} \right) ^2 + x_2^2 - 1 \right) ,\\ c^1_2(x)&= \left( x_1+\frac{1}{2} \right) ^2 + x_2^2 - 1,\\ c^1_3(x)&= - \left( x_1^2 + \left( x_2 - \frac{1}{2} \right) ^2 - 1 \right) . \end{aligned} \end{aligned}$$

The corresponding critical regularization path \(R_c\) of (20) can be computed analytically and is shown in black in Fig. 7a, consisting of two disconnected paths. The feasible set of the constrained problem coincides with the critical set of g, excluding the three isolated critical points of g. Since \(c_1^1\) and \(c_3^1\) are nonconvex, g is nonconvex as well, which is why \(R_c\) does not coincide with the actual regularization path R in this case. More precisely, R is merely the union of the path from the minimal point of f to \(x^2\) and the intersection of \(R_c\) with the feasible set (cf. Fig. 7).

Fig. 7
figure 7

a R (black, solid) and \(R_c\) (black) for the exact penalty method in Example 6 and the points \(x^1\approx (0.1614,0.9409)^\top \), \(x^2=(0,\frac{\sqrt{3}}{2})^\top \), \(x^3\approx (-0.8027, 0.9531)^\top \), \(x^4\approx (0.4631, -0.2691)^\top \) with a zoom of the intersection of \(c_3^1(x) = 0\) and \(R_c\). b Image of \(R_c\) with \(y^i = (f(x^i),g(x^i))\), \(i \in \{1,\dots ,4\}\), and the same coloring as in (a). Furthermore, a zoom of the image around \(y^3\)

In the following we will analyze the kinks of \(R_c\), which are located in \(x^1\) to \(x^4\) and between the minimal point of f and \(x^1\) (cf. Fig. 7a). First of all, it is easy to see that kinks occur precisely where constraints become satisfied or violated along \(R_c\). Due to the construction of the selection functions (cf. (21)), this causes Assumption A5 to be violated in these points.

For \(x^1\), the gradient of f and the subdifferential of g are shown in Fig. 8a. We see that Assumption A2 holds and that Assumption A3 is violated since zero lies on the relative boundary of \(\mathop {\textrm{conv}}\limits (\{ \nabla f(x^1) \} \cup \partial g(x^1))\) (cf. Lemma 5). The same behavior occurs in all other kinks except for \(x^2\). For \(x^2\), \(\nabla f(x^2)\) and \(\partial g(x^2)\) are shown in Fig. 8b. In contrast to the other points, Assumption A2 is clearly violated since \(\dim (\textrm{aff}(\partial g(x^2))) = 2 = n\). As discussed in Remark 2, this causes a kink in the image of \(R_c\), which can be seen in Fig. 7b. Moreover, zero lies in the relative interior of \(\mathop {\textrm{conv}}\limits (\{ \nabla f(x^2) \} \cup \partial g(x^2))\) and it is easy to see that Assumption A3 holds.

In addition to the features described so far, the image of \(R_c\) possesses so-called turning points. If we treat the image of \(R_c\) as an actual (continuous) path, then these are points where the direction of the path abruptly turns around. For example, this can be observed in \(y^3\) and \(y^4\) in Fig. 7b. These points were already discussed in [14] and in Example 3.4 therein, it was highlighted that they are not necessarily caused by any nonsmoothess of the objectives. Since we are mainly interested in the structure of \(R_c\) in this article, we will leave their analysis for future work.

Fig. 8
figure 8

Gradient of f, subdifferential of g and the corresponding (relative) boundary of the convex hull (dashed) in \(x^1\) and \(x^2\) of Example 6

Note that all kinks in the previous examples were points where constraints become satisfied or violated, which suggests that the structural results from [13] also hold in our more general nonconvex case, at least for the critical regularization path \(R_c\). Furthermore, \(R_c\) is still connecting the minimum of f to the solution of the constrained problem (19) (which is the intersection of \(R_c\) with the feasible set). Thus, it might be possible to apply a path-following method similar to the one in [13] to nonconvex problems as well.

5 Conclusion

In this article, we have presented results about the structure of regularization paths for piecewise differentiable regularization terms. We did this by first showing that the critical regularization path is related to the Pareto critical set \(P_c\) of the multiobjective optimization problem which contains the objective function f and the regularization term g. Afterwards, we analyzed \(P_c\) by reformulating it as a union of the intersection of certain sets, which allowed us to apply differential geometry to obtain structural results. During this derivation, we identified five assumptions (A1 to A5) which (when combined) are sufficient for \(P_c\) to have a smooth structure locally around a given \(x^0 \in P_c\). In turn, nonsmooth features of \(P_c\) (like “kinks”) can be classified depending on which of these five assumptions is violated. We demonstrated this by analyzing the regularization paths for the support-vector machine and the exact penalty method.

Based on our results in this article, there are multiple possible directions for future work:

  • We believe that most of our theoretical results would still hold (with only minor adjustments) if we would assume f to be merely piecewise differentiable as well (In this case, the regularization function \(f + \lambda g\) would still be piecewise differentiable).

  • Although the MOP (7) considered in this article has only two objectives, multiobjective optimization can handle any number of objectives. In particular, (7) could be formulated for arbitrarily many regularization terms. We believe that results similar to ours (with a higher-dimensional regularization path) could be obtained for this case. This would allow regularization methods such as the elastic net [35] to be incorporated into our framework.

  • While we were focused on regularization in this article, our results can also be used in the context of general multiobjective optimization to construct path-following methods for the solution of nonsmooth MOPs, extending [12,13,14, 26].

  • Although we provided the main ingredients for the construction of path-following methods, i.e., a way to compute the tangent space in smooth areas and a characterization of nonsmooth points, their development and actual implementation is still non-trivial. For example, other important ingredients are the computation of new points on \(R_c\) after taking a step along the tangent direction (also known as a corrector), the detection of kinks in the path and the computation of the correct tangent direction after a kink was found. Treating these problems in our general framework could greatly simplify the development of new path-following methods.