Efficient scalarization in multiobjective optimal control of a nonsmooth PDE

This work deals with the efficient numerical characterization of Pareto stationary fronts for multiobjective optimal control problems with a moderate number of cost functionals and a mildly nonsmooth, elliptic, semilinear PDE-constraint. When “ample” controls are considered, strong stationarity conditions that can be used to numerically characterize the Pareto stationary fronts are known for our problem. We show that for finite dimensional controls, a sufficient adjoint-based stationarity system remains obtainable. It turns out that these stationarity conditions remain useful when numerically characterizing the fronts, because they correspond to strong stationarity systems for problems obtained by application of weighted-sum and reference point techniques to the multiobjective problem. We compare the performance of both scalarization techniques using quantifiable measures for the approximation quality. The subproblems of either method are solved with a line-search globalized pseudo-semismooth Newton method that appears to remove the degenerate behavior of the local version of the method employed previously. We apply a matrix-free, iterative approach to deal with the memory and complexity requirements when solving the subproblems of the reference point method and compare several preconditioning approaches.


3 Problem formulation
The prevailing notion of optimality in multiobjective optimization is that of optimal compromises or Pareto optimal points, see Definition 1. Evaluating the quality of Pareto optimal points typically involves interpreting the set of function values of all Pareto optimal points -the Pareto front. The aim of this paper is the efficient numerical characterization of the Pareto fronts of nonsmooth multiobjective optimal control problems with few objectives and an elliptic PDE-constraint with max-type nonsmoothness using generalized stationarity conditions, and the comparison of the numerical performance of a weighted-sum approach and a reference point method in terms of efficiency and discretization quality. For ease of presentation, we restrict this exposition to the case of two objectives only -though the scope of this paper can readily be extended to moderately many objective functions using hierarchical approaches, see [3,4,15]. Hence, we consider bicriterial problems of the form In (P), the symbols y and u denote the state and control variables in the corresponding state space V and (possibly finite dimensional) control space U , respectively, the j i denote suitably well behaved scalar cost functionals, the i are nonnegative regularization parameters and B ∶ U → L 2 ( ) denotes a control-to-right-hand-side mapping. For the detailed assumptions on the problem, we refer to Assumption 1.
Multicriterial optimization problems with nonsmooth PDE-constraints like (P) arise in various physical applications with conflicting objectives, see, e.g., [13,16,19,21]. The combination of generally only (Hadamard) directionally differentiable Nemytski operators in the constraint and the inherently nonsmooth structure of multiobjective optimization makes sensitivity and stationarity analysis as well as the numerical solution of these problems rather delicate. Their treatment typically requires specialized stationarity concepts and approaches that do not follow standard procedure for Gâteaux differentiable problems. The particular case of the PDEconstraint in (P) is rather well understood in terms of existence and regularity of solutions and differentiability properties of the solution operator and has previously been addressed as a constraint in optimization problems in, e.g., [6,7]. The specific structure of the PDE even allows for the derivation of strong stationarity systems when the control space is sufficiently rich, which has been considered in both scalar and multiobjective optimization with an arbitrary number of objectives, see [6,7]. Stationarity conditions of intermediate strength based on the characterization of the subdifferentials of the solution operator to the constraining PDE have been addressed in [6] and the considerations in [7] show that C-stationarity and strong stationarity in fact coincide for ample controls.
Numerically, the Pareto stationary front of problem (P) has been characterized for two and three cost functionals and L 2 ( ) controls [7] by combining a (P) min (y,u) J(y, u) = � J 1 (y, u)

3
Efficient scalarization in nonsmooth PDE-MOOC first-optimize-then-discretize approach and a pseudo-semismooth Newton (PSN) method, and by using a regularization approach. With multiple objectives, computation times can quickly become large with increased fineness of discretizations of the domain and the Pareto front, making efficient numerics a vital component to these simulations. In [5], the authors discussed a standard offline/online greedy reducedbasis approach for (P) with a single scalar objective function and both low and high dimensional control/parameter space and compared the results to an adaptive way of generating the reduced basis along the solution process of the PSN. In this paper, our goal is the efficient characterization of the Pareto stationary front of the bicriterial problem (P) for both L 2 ( )-controls as well as finite-dimensional controls with specific structure to allow for a combination with the reducedbasis approaches from [5] mentioned above. In the case of the L 2 ( )-controls, where strong stationarity conditions were derived in [7], this is essentially a refinement of the numerical techniques employed in the same paper. For finite dimensional controls, which are typically insufficiently "rich" to obtain strong stationarity conditions, however, this requires revisiting the optimality conditions and how they can be used to characterize the Pareto stationary fronts. Following the approaches in [7], we show that, in the case of finite dimensional controls, a sufficient stationarity system can still be obtained. Though a sufficient system for a necessary optimality condition is unfavorable in general, it turns out that the system we obtain is essentially a strong stationarity system for the optimization problems corresponding to the weighted-sum method and the Euclidean reference point method and therefore a somewhat reasonable system to use for characterization of the Pareto stationary fronts. The nonlinear structure in the sets that are involved there, introduced by the nonsmoothness of the PDE, however, does not lend itself to use for characterization of the front directly. Instead, a linearization of the same system will be solved by a line-search stabilized PSN approach and combined with a sign condition that is checked a-posteriori. The line-search essentially eliminates the rare, mesh-dependent non-convergence issue observed for the undamped PSN in [5,7]. We will compare the efficiency and the approximation quality of the front of the weighted-sum and the reference point approach using quantifiable quality measures.
The structure of this paper is as follows: We will shortly comment on the assumptions for this paper in Subsect. 1.1. Then, we recall the required notions of Pareto optimality and Pareto stationarity and state the respective first order systems for (non-)ample controls in Sect. 2. In the case U = L 2 ( ) , the analytical results are analogous to those in [7] and the strong stationarity results are simply restated, whereas similar conditions in the case U = ℝ p are shown to remain sufficient only. In Sect. 3, we will shortly recall the weighted-sum method and the reference point method, and show that the sufficient system from the multiobjective setting is equivalent to strong stationarity systems for the scalarized systems. We explain how these methods can therefore be used to characterize the Pareto stationary front. The numerical implementation is explained in detail in Sect. 4, where we present a matrix-free preconditioned limited-memory generalized minimal residual (L-GMRES) method for the line-search globalized pseudo-semismooth Newton (gPSN) method that will be used to handle the density of the discretization matrices in the reference point method and the convergence issues arising from the nonsmoothness of the PDE-constraint, respectively. We further present two numerical examples -one with (FE-discretized) L 2 ( )-controls and one with inherently finite dimensional controls in Sect. 5. The interpretation of the numerical results is specifically focused on the effects of different preconditioning strategies for the reference point method. We introduce two quantities to measure the approximation quality of the two methods and use them as a basis for a performance comparison of the two scalarization approaches.

Notation and assumptions on the data
We also set H = L 2 ( ) . For functions in Y, the Laplacian is understood in the nonvariational sense and the Dirichlet Laplacian ∶ H → Y � is understood in the very weak sense (see [12], Section 1.9). Our assumptions on the data are as follows: is a bounded domain that is convex or possesses a C 1,1 -boundary (cf. [10, Section 6.2]), 2) j 1 , j 2 ∶ Y → ℝ are weakly lower semicontinuous, twice continuously Fréchetdifferentiable and bounded from below, 3) 1 ≥ 0 , 2 > 0 , ≥ 0, 4) U = ℝ p for p ∈ ℕ ⧵ {0} and ‖ ⋅ ‖ U denotes the Euclidean norm or U = H and ‖ ⋅ ‖ U denotes the L 2 -norm, 5) B ∶ U → H possesses the following property: 5a) If U = ℝ p the operator B ∶ U → H is linear (and therefore automatically bounded) and the pairwise intersection of the sets {b i ≠ 0} ⊂ of b i ∶=B(e i ) ∈ H , where e i , i = 1, … , p denote the unit vectors in U , are Lebesgue nullsets and none of the b i are zero. 5b) If U = H the operator B ∶ U → H is unitary.

A sufficient condition for pareto stationarity
Structurally, this section follows [7] closely, where the case U = H = L 2 ( ) is dealt with and carries over immediately, yielding strong stationarity conditions for the Pareto stationary points. For the case where U = ℝ p , however, the same reasoning does not hold up -specifically, the system obtained by similar arguments will only be sufficient in general. The main issue in the analysis is the well-known fact that strong stationarity conditions typically require ample controls. Note that the results presented in this section can readily be generalized to an arbitrary finite number of objective functionals.

Efficient scalarization in nonsmooth PDE-MOOC
We start by summarizing the main properties of the solution operator S to the PDE-constraint in the next lemma as a minor extension to [7,Lemma 4.2]. Note that, since we will mostly focus on the case U = ℝ p , we do not obtain analogous results to all the results stated in [7,Lemma 4.2] for the case U = H. Lemma 1 (Properties of the solution operator S ) Let u ∈ U be a control with associated state y = S(u) . Then: Proof Let u ∈ U be arbitrarily given and y = S(u) ∈ Y . Notice that the Y-regularity also follows from Proposition 2.1 in [6].
1) The linearity and boundedness of B imply Lipschitz continuity and Hadamard differentiability analogously to Proposition 2.1 and Theorem 2.2 in [6] and due to the chain rule, the directional derivative of the solution operator w = S � (u;h) solves 2) Note that the operator B † is well defined due to

3
Efficient scalarization in nonsmooth PDE-MOOC for every w ∈ Y . Consequently, for every w ∈ Y. ◻ As usual, we denote the well-defined reduced cost functional as Ĵ ∶ U → ℝ 2 ,Ĵ(u) = J(S(u), u) . Having established the properties of the solution operator, we are ready to review the different notions of Pareto optimality and the optimality conditions that will play a role later on.
Definition 1 (Pareto Optimality) Let ȳ,ū with ȳ = S(ū) and ū ∈ U . The control ū is called: 1) a local weak Pareto optimal point of (P) if an r > 0 exists such that there is no u ∈ U satisfying 2) a local Pareto optimal point of (P) if an r > 0 exists such that there is no u ∈ U satisfying where the latter inequality is strict for at least one i; 3) a local proper Pareto optimal point of (P) if there are r, C > 0 such that for every u ∈ U satisfying ‖u −ū‖ U < r and

4) a global (weak/proper) Pareto optimal point of (P) if the previous conditions hold with r = ∞;
The image sets of all controls that are (local/global) (weak/proper) Pareto optimal under the cost functional are called the Pareto fronts corresponding to the respective sense of optimality.
Analogously to [7], we obtain the following corresponding primal optimality conditions.
is a local weak Pareto optimal point of (P), then there exists no direction h ∈ U satisfying Accordingly, we obtain the corresponding notions of Pareto stationarity.
Definition 2 (Pareto Stationarity) Let ū ∈ U and ȳ = S(ū) . The control ū is called: where the latter inequality is strict for at least one i; Remark 1 By definition, all proper Pareto optima are Pareto optima, which in turn all are weak Pareto optima. The same holds locally. As a consequence of Theorem 1, all local weak Pareto optima are weakly Pareto stationary and all local proper Pareto optima are properly Pareto stationary. However, Pareto stationarity is generally not necessary for local Pareto optimality. ◊ In the case of L 2 ( )-controls, the version of Tucker's/Motzkin's theorem of the alternative in infinite dimensions in [7,Lemma 4.4] provides the existence of the multipliers appearing in the system of strong stationarity conditions. In the case of finite dimensional controls, we unfortunately have to deal with generally nonlinear

3
Efficient scalarization in nonsmooth PDE-MOOC subsets of the spaces involved, and, as it turns out, an extension of the existence results to arbitrary subsets does not hold. Hence, we can only recover one of the implications when the result is generalized to potentially nonlinear subsets of Hilbert spaces.
with the inequality holding strictly for at least one i.
, N with the inequality holding strictly for at least one i, which shows the claim. ◻ It follows that the technique used to show [7, Theorem 4.5] now yields a sufficient system only, i.e., we obtain the following sufficient adjoint-based optimality system. Theorem 2 (Sufficient adjoint-based system) 1. Assume that there exists an adjoint state p and a multiplier ̄ such that ū,ȳ,p,̄ satisfy the coupled system Then ū ∈ U is a weak Pareto stationary point of (P). 2. Assume that ū,ȳ,p,̄ satisfy the system (5), where the inequality in (5b) is strict, i.e. ̄i > 0 for i = 1, 2 . Then ū is a proper Pareto stationary point of (P) (and thus also a Pareto stationary point).

Corollary 1 Consider the setting of Theorem
in Theorem 2-1) or -2) and the sign condition is added, then the resulting system is sufficient (but generally not necessary) for weak/proper Pareto stationarity.

Connection to scalarization methods
As shown in Sect. 2, we can obtain an adjoint based system that is sufficient for Pareto stationarity. When we base numerical characterizations of the Pareto stationary front on this system, since it is potentially not necessary for Pareto stationarity, we may lose out on points on the front. In this section, we want to show that the adjoint system from Theorem 2 is a strong stationarity system for problems arising in two well-known scalarization methods -i.e., using the adjoint multiobjective stationarity system is not a worse approach than straight forward scalarization. We will briefly explain the scalarization methods -the weightedsum method (cf., e.g., [8]) and the reference point method (cf., e.g., [14,17]) -and how we intend to use them to characterize the Pareto stationary front.

Weighted-sum method (WSM)
For weights 1 , 2 ≥ 0 with 1 + 2 = 1 , the optimization problem is called the weighted-sum problem (with non-negative weights 1 , 2 ) corresponding to (P). The weighted-sum method is based on solving (P ) for varying . The primal optimality conditions for the WSM are given in the following theorem.
Proof The corollary follows analogously to the proof of Theorem 2. However, since the problem is inherently scalar, the reverse implication of Lemma 2 is obtained for free and we obtain equivalence. ◻ min (y,u)

3
Efficient scalarization in nonsmooth PDE-MOOC Corollary 2 especially implies that the adjoint system (5) is a strong stationarity system for the weighted-sum problem (P ), i.e., that it is equivalent to the primal necessary stationarity conditions (8) of the weighted-sum problem. Accordingly, it is reasonable to characterize the stationary points of the weighted-sum problems using the system (5). However, system (5) may have multiple solutions, and (in the case of the finite dimensional controls) we cannot solve it numerically because of the variational inequality (5c) on the possibly unknown and nonlinear set Im(S � (ū;⋅)) . In practice, we therefore modify (5c)-(5d) and instead consider the system which coincides with the strong stationarity system in the case U = H . If U = ℝ p , we additionally check the sign condition for p a posteriori. If the condition is satisfied, then the solution is still a solution to (5) and therefore a weak Pareto stationary point, see Corollary 2. In the weightedsum algorithm, we can now set 1 = 1 − 2 and solve the stationarity system of the WSM for varying 2 ∈ [0, 1] , where 2 ≠ 0 is required to guarantee well-posedness of the problems. Specifically, we introduce an additional small parameter tol > 0 and choose 2 in [ tol , 1 − tol ] . The final procedure of the WSM is summarized in Algorithm 1. The result of Algorithm 1 is a discrete approximation of the set of weakly Pareto stationary points and the corresponding front.

Reference point method (RPM)
The reference point problem with Euclidean norm ‖⋅‖ 2 for a reference point z ∈ ℝ 2 is given by The reference point method is based on solving (P z ) for varying reference points. As the next theorem shows, optimizers to the reference point problems are Pareto optimal.
Theorem 4 Every local (global) solution to (P z ) such that z ∈ ℝ 2 with Ĵ (u) − z > 0 holds, is also a local (global) Pareto optimal point of (P).
Proof We assume that ū ∈ U with ȳ = S(ū) is a local solution to (P z ), i.e., there exists an r 1 > 0 such that for all u ∈ U with ‖ū − u‖ U < r 1 the inequality F z (ȳ,ū) ≤ F z (S(u), u) is satisfied. Now, we assume that ū is not locally Pareto optimal, which implies that for every r 2 > 0 there exists u end ∈ U with ‖u end −ū‖ < r 2 and J i (S(u end ), u end ) ≤ J i (ȳ,ū) for i = 1, 2 , where the latter inequality is strict for at least one i. Since we can choose r 2 arbitrarily small, S and J i are continuous and J i (ȳ,ū) − z i > 0 holds for i = 1, 2 by assumption, this implies that where the second inequality is strict for at least one i. Since the Euclidean norm is strictly monotone [8,Definition 4.19], this implies the contradiction F z (S(u end ), u end ) < F z (ȳ,ū) and proves the first inclusion. The statement for global solutions follows immediately from the first statement by choosing r 1

3
Efficient scalarization in nonsmooth PDE-MOOC Primal stationarity conditions for (P z ) are shown in the next theorem. A control ū ∈ U with associated state ȳ = S(ū) is called a stationary point of (P z ) for z ∈ ℝ 2 if (11) is satisfied.

Corollary 3
Let the control ū ∈ U with associated state ȳ = S(ū) be given.
Proof Part 1) follows analogously to the proof of Corollary 2.
To show part 2), let ū be a stationary point of (P z ) with associated state ȳ = S(ū) such that 0 ≠ J(ȳ,ū) − z ≥ 0 . Then part 1) implies that there exists an adjoint state for all w ∈ Im(S � (ū;⋅)), p such that ū,ȳ,p solve (12). Accordingly, with normalized weight and adjoint p given by system (5) is also satisfied. Thus Corollary 2 implies that ū is a stationary point of (P ) with ≥ 0 and 1 + 2 = 1 . The other implication follows analogously without normalization by choosing the reference point z = J(ȳ,ū) − , since then 0 ≠ J(ȳ,ū) − z = ≥ 0 . The cases with strict inequalities follow analogously. ◻ Corollary 3 shows that primal stationarity in the two scalarization methods is essentially equivalent. Also note that, except for rather degenerate choices of reference points, system (12) is equivalent to the adjoint multiobjective system (5), i.e., (5) is a strong stationarity condition for the reference point problem (P z ). Again, we generally cannot solve (12) directly in the implementation when U = ℝ p because of the variational inequality on a possibly unknown and nonlinear image set. We proceed analogously to the modifications in the WSM, cf. (9), and instead solve For U = L 2 ( ) , this again coincides with the strong stationarity system from [7]. For U = ℝ p we test for the sign condition (10) a posteriori. If it is satisfied as well, ū is a weak Pareto stationary point of (P).
Another central question for the RPM is how suitable reference points can be chosen in the numerical implementation. To this end, we follow the approach presented in [2]. Let k max denote the maximal number of Pareto stationary points in the numerical implementation and let (y 1 , u 1 ) denote an initial starting point with u 1 being a stationary point of the weighted-sum problem with weights 1 = 1 − tol and 2 = tol ≪ 1 . Then the first reference point z 2 (corresponding to the second point on the front) is chosen as where h ⟂ , h ∥ > 0 are scaling parameters. For i = 2, … , k max − 2 the reference point z i+1 is chosen as

3
Efficient scalarization in nonsmooth PDE-MOOC with ⟂ = z i − J(y i , u i ) and ∥ = (− ⟂ 2 , ⟂ 1 ) T . Note that due to the strong weighting of J 1 at (y 1 , u 1 ) , the Pareto front is approximately vertical in the area of the first reference point. This motivates the initial choice ∥ = (0, −1) T and ⟂ = (−1, 0) T .
Using this update technique, we end up with the reference point method stated in Algorithm 2.
Note that the stopping criterion implies that if k max is large enough, then the upper left as well as the lower right corner points of the Pareto front coincide with those of the WSM. If 0 ≠ J(S(ū),ū) − z ≥ 0 holds for all ū ∈P sw s , then the result of Algorithm 2 is a discrete approximation of the set of weak Pareto stationary points and the corresponding Pareto front. If one wants to ensure this condition a priori, it is possible to, e.g., choose fixed reference points on shifted coordinate axes. The shift has to be performed such that all reference points are below the lower bounds on J i , cf. [3].

Numerical implementation
For the numerical realization and tests of the algorithms, we will assume that j 1 (y) = 1 2 ‖y − y d ‖ 2 H , j 2 (y) = 0 and 1 = 0 . We fix the domain = (0, 1) 2 and consider P 1 -type finite elements (FE) on a Friedrichs-Keller triangulation of the domain. The measure of fineness of the grids will be h > 0 , which denotes the inverse number of square cells per dimension -i.e., the grid will have 2∕h 2 triangles. We write the coefficient vector of the piecewise linear interpolant of a function w ∶ → ℝ on the grid vertices in sans-serif font (i.e., ∈ ℝ N ) and use the same font for the matrices in the (15) discretized settings. We resort to mass lumping for the nonlinear max-term in order to be able to evaluate it componentwisely. Inevitably, this introduces a numerical discretization error. Its effects decrease with increasing fineness of the discretization but increase with the coefficient that scales the nonlinearity. The corresponding stiffness matrix ∈ ℝ N×N , mass matrix ∈ ℝ N×N and lumped mass matrix ̃ ∈ ℝ N×N are given from the FE ansatz functions i , i = 1, … , N , as Thus the FE approximation of (9c, e) introduced for the WSM is for some given ∈ ℝ 2 that satisfies (9a, b), where is the FE-discretized version of the linear operator B and ∶= 0 where x ∶ ℝ N → ℝ N×N maps a vector to the diagonal matrix that takes the Heaviside function with functional value x at 0 evaluated for each entry of the vector as its diagonal entries. The matrix = p ∈ ℝ p×p is the identity if U = ℝ p , and = is the mass matrix if U = H . Note that this means that, depending on the space U , sometimes sans-serif notation would be appropriate for the (discretized) control u. To avoid any misunderstandings, we will always denote u without sans-serif style. These finite dimensional systems are solved with a globalized version of a pseudo-semismooth Newton (PSN) method (which essentially ignores the indicator functions' dependence on the state in the continuous system, i.e. the Heaviside functions' dependence in the discretized system, when the linearization of the systems is computed). For more details on the PSN without globalization, we refer to [5,7]. The FE system matrix at iterates ( h , h , u) reads as: We proceed analogously for the RPM and discretize (13 b, d) using finite elements, which yields These discretized systems are solved with a PSN method as well. The FE system matrix at iterates ( h , h , u) reads as:

3
Efficient scalarization in nonsmooth PDE-MOOC with Remark 2 For both methods, the sign condition (10) is not added into the discretized stationarity system and instead verified a posteriori if U = ℝ p . ◊ Compared to the system matrix of the single objective case presented in [5], especially the system matrix of the RPM is more complicated and possesses a nonsparse substructure. This is due to the matrices ( h ) and (u) , which possess the dense terms ( h − d )( ( h − d )) T and ( 2 u)( 2 u) T . These can cause severe memory and runtime problems when the reference point problem is solved on fine finite element grids with a linear solver. Due to these restrictions, the reference point method's subproblems will generally take longer to solve than the WSM, also on coarser grids.
One thing to keep in mind when applying the PSN method is that there is no guarantee of convergence as seen in the numerical examples in [7]. The method in fact shows failure to converge in practice, with the rate of failed attempts over the subproblems decaying as the grid discretization's fineness is increased. This suggest some sort of degeneration of the undamped search directions that could be countered with a globalization mechanism.
Accordingly, the two questions that we will address in the remainder of this section are the following: 1) Can the non-convergence issue of the PSN method itself be removed? 2) Is it possible to reduce computation times and memory problems of the (linear) PSN steps in the reference point method to make it competitive in terms of computation times?

Globalized PSN
The numerical experiments in [7] indicate that non-convergence of the PSN is an issue that strongly depends on (insufficiently fine) discretizations. As a stabilization approach independently of the grid fineness, we will present and test a line-search globalization of the PSN based on results for semismooth Newton methods as in, e.g., [9] and [11], which will be referred to as the gPSN method. Let us assume that we want to find a root of a function F ∶ ℝ 2N+p → ℝ 2N+p with system matrix G ∶ ℝ 2N+p → ℝ (2N+p)×(2N+p) .
This will either be (16) with system matrix (17) for the subproblems in the WSM or (18) with system matrix (19) for the subproblems in the RPM. We will employ a linesearch globalization with the merit function Λ ∶ ℝ 2N+p → ℝ, x ↦ 1 2 ‖F(x)‖ 2 .

Remark 3
Note that the norm in the merit function Λ is the discrete equivalent of the norm in V � × V � × U . Hence it is generally expensive to evaluate, since we need to compute a Riesz representative. However, we will precompute the necessary factorizations to speed-up the computations to some extent. ◊ The gPSN method is summarized in the following algorithm. Note that we cannot conclude -e.g., from theory on globalized semismooth Newton methods -that the algorithm converges without introducing a maximum number k max of PSN steps and a minimum step length 2 > 0.
It turns out that there are examples, where Algorithm 3 converges while a refinement of the grid does not yield convergence, see Sect. 5. However, it can still happen that the gPSN method does not converge. Obviously, it would not be a good idea to add the final iterate of the gPSN method to the Pareto set nonetheless. Instead, if within the RPM the PSN does not converge, we update the reference point as follows: If no previous solution to the reference point problem is available, we choose If previous solutions to the reference point problem are available, we choose Essentially, previous information is used repeatedly to find a new reference point by going into the same parallel direction. If the gPSN is used within the WSM and does not converge, we can simply proceed to the next discretized weight.

A Preconditioned Matrix-Free L-GMRES Method
We will now focus on how to speed-up the computation and how to overcome the difficulties arising from the dense terms in the RPM. Notice that both dense terms are rank-1-matrices. Therefore it is easy to implement the matrix-vector-product This motivates the use of an iterative solver that only relies on matrix-vector-products in each PSN step for solving the reference point subproblem. Since the system is not symmetric positive definite, the CG method is not an alternative and we will use L-GMRES instead. As the performance heavily depends on the condition number of the system matrix (see [18]), which might be very large, especially for very small values of the regularization parameter 2 , we will precondition the method with one of the following preconditioners: 1. a: The dense terms ( h − d )( ( h − d )) T and ( 2 u)( 2 u) T are omitted in an approximated system matrix that is used as a preconditioner. 2. aBJ: A block Jacobi preconditioner is applied with the approximation described for the preconditioner a. 3. aBGS: A block Gauss-Seidel preconditioner is applied with the approximation described for the preconditioner a. 4. aILU: An incomplete LU factorization together with the approximation described for the preconditioner a is applied.
Of course the same iterative, preconditioned approach can be implemented for the WSM, with the only difference being that no approximation -by ignoring dense terms -is necessary for the preconditioner. Note that also standard Jacobi, Gauss-Seidel, block Jacobi and block Gauss-Seidel and incomplete LU factorization preconditioners were tested. But the first two did not give any speed-up and the last four were still significantly less effective than the preconditioners above due to the dense terms still remaining. As expected, it is quite important to use the block structure of the problem as much as possible and to avoid the dense terms.

Remark 4
As long as 2 u T u∕2 − z ≠ 0 , the invertibility of the aBJ preconditioner is ensured for the RPM, since the first two block diagonal elements are symmetric positive definite and the last block diagonal element is symmetric and either positive or negative definite (see (19)). Analogously in case of the WSM, the invertibility is always ensured. ◊ For the four different preconditioners above, we propose three different update strategies. Those strategies are: 1. Never update: Only one preconditioner is generated for the first iteration of the first subproblem and then this preconditioner is used for all subproblems and all gPSN iterations. 2. Update once: One preconditioner is generated for each subproblem and then used for all gPSN iterations. 3. Always update: The preconditioner is generated for each gPSN iteration of each subproblem.

Numerical examples
In this section, we present numerical results for two examples -one with finite and one with infinite dimensional control space. First, the focus of our exposition will be on the performance of the RPM and the different preconditioning strategies. After the best update strategy is identified, we will compare RPM and WSM method.
In order to reasonably quantify the quality of the approximation of the respective Pareto (stationary) fronts, we employ two quality measures. The maximal distance between neighboring points on the Pareto front will be our first measure. As a second measure of approximation quality, we will consider which is the maximum shortest distance between points on the front divided by the average shortest distance and therefore bounded from below by one. If this quantity is small, this indicates that the approximation quality is somewhat uniform across the entire Pareto front, while a large value indicates that some parts of the Pareto front are approximated better than others are, i.e., a localized clustering.
Note that in all results presented here, the sign condition (see Remark 2) is satisfied and the gPSN method always converges.
Our code is implemented in Python3 and uses FEniCS [1] for the matrix assembly. Sparse memory management and computations (especially L-GMRES) are implemented with SciPy [20]. All computations below were run on an Ubuntu 20.04 notebook with 32 GB main memory and an Intel Core i7-8565U CPU.

The numerical examples
First we introduce the two examples. The parameters listed in Table 1 are fixed for the rest of this work.

Example 1 -Infinite dimensional controls
For the first numerical example the desired state is chosen as . This desired state is chosen to promote nonsmoothness. The control space U is chosen as H = L 2 ( ) , the operator B as the identity on U and is the mass matrix.

Example 2 -finite dimensional controls
For the second numerical example, we choose y d (x) = 1 2 − x 1 sin( x 1 ) sin( x 2 ) . The space U is chosen as ℝ 2 , = 2 is the identity matrix in ℝ 2×2 and the operator B is set to where the definition is to be understood as L 2 -functions mapping x ∈ to ℝ that are embedded into V ′ . For plots of the operator B and the desired state we refer to [5, Fig. 1 and 4].

Preconditioning the reference point method
In this section, we consider the different preconditioning approaches for the RPM. Therefore, we additionally choose the parameter 2 = 5 ⋅ 10 −3 and the parameters h ⟂ = 10 , h ∥ = 0.1 in the RPM and 1 = 1 ⋅ 10 −4 for the gPSN. Note that the arguably large value of 1 is necessary for the L-GMRES method to converge without preconditioner, because the (sometimes badly conditioned) problem is numerically difficult to solve. The results for Example 1 are given in Table 2.
First of all, we can see that the computation time decreases for all preconditioning approaches. Also the average number of L-GMRES iterations is very small (between 2 and 3.5) and increases only slightly for smaller step sizes h. The latter observation is in contrast to the performance of the non-preconditioned solving, which starts out with a large number of average L-GMRES iterations that nonetheless significantly increases for smaller step sizes h. Furthermore, we can see that cheaper preconditioners lead to a larger speed-up if the preconditioner is updated more often, i.e. with the preconditioning strategy always update the preconditioner aBJ still gives a significant speed-up of about 35, but all other preconditioners cannot give a speed-up above 9. Nonetheless, the best preconditioning approach surprisingly is the never update strategy combined with the preconditioner aBJ. This indicates that the problem structure does not change significantly with respect to the current reference point and thus a computationally expensive update of the preconditioner is unnecessary. Next, we consider the results for Example 2, which can be found in Table 3. We basically observe the same behavior as previously and again never update and aBJ is the best preconditioning approach. Note that this finite dimensional example is inherently better conditioned and thus combinations of more expensive preconditioners such as a, aBGS and aILU and expensive preconditioning strategies such as update once and always update often lead to larger computation times compared to the performance in the absence of a preconditioner. Furthermore, the preconditioner aILU seems to behave unstably, since the number of average L-GMRES iterations increases significantly for smaller step sizes h. Note that aILU includes some parameters which could be varied and might improve this behavior, but we will not go into details here.
Next we consider a fixed step size h = 1∕100 and investigate the behavior of the different preconditioning approaches for varying 2 . We expect larger condition numbers for smaller values of 2 and therefore problems which are harder to solve numerically. Since the strategies update once and always update and the preconditioner aILU did not prove useful, they are excluded from this considerations. The results can be found in Table 4 for Example 1 and in Table 5 for Example 2.
In both examples the average number of L-GMRES iterations increases slightly for smaller values of 2 . This increase is stronger for the first example. If, on the other hand, no preconditioner is used, there is a significant increase for smaller values of 2 . This is especially true for the first example, which starts with an average number of 18.75 iterations for 2 = 1 and ends with an average number of 501.07 iterations for 2 = 10 −3 . In the second example there is only an increase from 10.48

Comparison of the RPM and the WSM
In this section, we want to compare the performance of the reference point method and the weighted-sum method both in terms of computation times and discretization quality. We choose a step size of h = 1∕100 , a tolerance 1 = 10 −5 in the gPSN and h ⟂ = 1 , h ∥ = 0.2 in the RPM. We will consider 2 = 1 for both examples. This means that the problems are relatively well conditioned, but the Pareto fronts are harder to approximate than for smaller values of 2 .
In order to make the results of the RPM and the WSM qualitatively comparable, we first run the RPM, which yields a number of discretization points on the front. Afterwards we run the WSM with k max (the number of Pareto points) chosen as the number of discretization points generated by the RPM. At this point we have the same number of discretization points on the respective approximated fronts. However, the WSM tends to cluster the discretization points. In order to obtain comparable approximation quality, we then double the number of points in the WSM until the maximal distance for points on the Pareto front (see (22)) is below the maximal distance for the RPM. Afterwards, the parameter h ∥ is halved as often as k max in the WSM was doubled before to compare the evolution of the quality measures max and clust for an increasing size of the approximated Pareto front.
The Pareto fronts are shown in Fig. 1.
We can see that both methods approximate the same curve. But the WSM shows a clustering behavior in the lower right corner whilst giving only a poor approximation in the remainder of the Pareto front. This already indicates that some refinement of the weights' distribution is generally well advised for the WSM. In Fig. 2, the evolution of the quality measures for the WSM and RPM for the procedure described above is shown.
In the left figures, we can see that for the WSM the number of points on the Pareto front needs to be doubled seven times in order to reach a maximal distance of points on the Pareto front that is smaller than that of the RPM. Furthermore the approximation quality decreases every time the size of the Pareto front is doubled. This is due to the clustering behavior in the lower right corner. As a result, an unnecessarily large number of points on the Pareto front is needed to reach a desired maximal distance max . On the other hand, with the RPM, the approximation quality Fig. 1 Pareto fronts from WSM and RPM for step size 1∕h = 100 and 2 = 1 even decreases whilst h ∥ is halved. This behavior can be seen in the right figures and is even better than expected. Note that the size of the Pareto front is approximately doubled when h ∥ is halved.
The question remains, which method performs better in terms of computational cost. A comparison of the results from the RPM and the first and last result from the WSM in the procedure of doubling the number of discretization points is shown in Table 6.
The observation for both examples are similar with respect to sizes of the Pareto fronts and the measures of approximation quality. Also whilst the RPM is slightly slower than the WSM if the same size for the Pareto front is used, it is about 28 times faster than the WSM when an at least equally good approximation quality is desired.

Conclusion
If the controls on the right-hand-side of the constraining PDE to (P) are finite dimensional, then conditions that imply primal stationarity can be found. Those conditions can be interpreted as strong stationarity systems of scalarization methods. They are, however, not usable numerically because they contain unknown nonlinearities in the spaces that the conditions are formulated in. Modifying the conditions, we ended up with linear systems as in the case of ample controls. We have shown that both WSM and RPM can be applied to characterize the front of Pareto stationary points for this nonsmooth problem. The reference point method performs significantly better when both approximation quality and computation time are considered, as long as preconditioning is used intelligently in GMRES. In our tests, the preconditioning strategy aBJ without updates performs the best. We also saw that the line-search globalized version of the PSN method leads to better performance and convergence of the method over the basic version with uncontrolled step lengths. A simple reduction of the step size numerically does not appear to guarantee this behavior.

Conflict of interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.