Deep Learning the Efficient Frontier of Convex Vector Optimization Problems

In this paper, we design a neural network architecture to approximate the weakly efficient frontier of convex vector optimization problems (CVOP) satisfying Slater's condition. The proposed machine learning methodology provides both an inner and outer approximation of the weakly efficient frontier, as well as an upper bound to the error at each approximated efficient point. In numerical case studies we demonstrate that the proposed algorithm is effectively able to approximate the true weakly efficient frontier of CVOPs. This remains true even for large problems (i.e., many objectives, variables, and constraints) and thus overcoming the curse of dimensionality.


Introduction
Vector and multiobjective optimization is the field of maximizing or minimizing multiple objective functions simultaneously.As with traditional (scalar) optimization, these problems can be classified as, e.g., linear, convex, or nonconvex.Within this work we focus on convex vector optimization problems (CVOP).Such problems are widely applicable in practice.Specifically, multiobjective optimization has been applied to fields as diverse as finance (for the mean-risk problem of [23] in [21] as well as for set-valued risk measures [12]), electric and power systems [6], structural design [24] and references therein, and chemical engineering [3].
Convex vector optimization problems are those with two or more convex objectives and convex constraints.Already multiple algorithms exist to approximate the solutions for these problems (or for specific subclasses of such problems).We refer the interested reader to [26] for an overview of some algorithms for CVOPs.For numerical tractability, these methods result in (polyhedral) approximations of the efficient frontier.That is, rather than finding the exact solution, only an (inner and/or outer) approximation of the efficient frontier can be found.We wish to highlight algorithms based on Benson's approximation algorithm [2] for linear vector optimization problems, that have been generalized to the convex case in [9,22,8,17].While powerful, these algorithms suffer from curse of dimensionality as the computational complexity of such problems grows exponentially in the dimension of the problem as, e.g., the vertex enumeration subproblem is NP-hard [18].Typically in practice, when the number of objective functions is five or higher, those algorithms cannot solve the vector optimization problem anymore; this computational bound holds for convex or even linear vector optimization problems.Such computational constraints similarly hold true for other types of algorithms, e.g., the box coverage algorithm of [11] for computing approximate solutions of multiobjective optimization problems.In Section 5.2 we will revisit a test instance from [11] that could not be solved (in a reasonable amount of time or for higher dimensions at all) for dimension four and higher.In contrast, with the machine learning methodology that we propose in this paper, we can still solve that problem for a 20 dimensional objective space in under a minute on a personal machine.Similarly, the sandwich algorithm from [4] provides an inner and outer approximation of the efficient frontier, but suffers also the curse of dimensionality as it involves vertex enumeration.In [4] problems up to 14 objectives have been solved.We will reuse example 7.2 from [4] and solve it up to 5000 objectives instead.
Within this work we construct a new approximation method for CVOPs using neural networks.The primary contributions and innovations of this work are two-fold.
1. We propose a machine learning methodology to (approximately) solve CVOPs by considering the weighted-sum scalarizations of the multiobjective problem.Due to the use of neural networks, the proposed algorithm can tractably be applied to large vector optimization problems without the need for vertex enumeration (as required by, e.g., [2,4,22]).That is, we are able to overcome the curse of dimensionality in solving vector optimization problems.In fact, as demonstrated in numerical case studies, the computational time required for the machine learning approach undertaken herein grows (roughly) linearly with the problem size.
2. By proposing two neural networks that are jointly trained, the proposed machine learning methodology provides both inner and outer approximations of the entire weakly efficient frontier of the relevant CVOP.In constructing the algorithm in this way, we are able to quantify (an upper bound to) the error of these approximations at each point on the weakly efficient frontier.By having inner and outer approximations with a known error, this machine learning methodology can be applied in practice without the typical risks of machine learning -the uncertain approximation error.In comparison to a naive application of machine learning for vector optimization problems, these neural networks were motivated by the vertex enumeration approaches (e.g., [4,22,11]) which also provide inner and outer approximations.
The organization of this paper is as follows.In Section 2, the background, notation, and structure of both CVOPs and neural networks is provided.In Section 3, the proposed neural network architecture for approximating the weakly efficient frontier of a strictly convex vector optimization problem is provided.Specifically, a primal neural network is proposed in Section 3.1, a dual neural network is provided in Section 3.2, and the joint loss function for these neural networks is given in Section 3.3.Then, in Section 4, inner and outer approximations of the upper image of the problem are constructed from these neural networks.Within this section, we additionally provide a function which provides an upper bound to the errors for the inner approximation at each point on the weakly efficient frontier.We implement the proposed machine learning approach for four strictly convex optimization problems in Section 5. We use these problems both to validate the proposed approach and to demonstrate its applicability to large-scale problems thus overcoming the curse of dimensionality.Finally, in Section 6, we extend the machine learning framework of Section 3 to allow for convex, but not strictly convex, vector optimization problems.This is demonstrated on two small problems -a linear vector optimization problem and the mean-risk portfolio optimization problem -that permit simple visualizations.

Background
Within this section, we provide the background material necessary for this work.First, in Section 2.1, the vector optimization setting that is considered throughout this work is presented along with basic definitions and results.Second, in Section 2.2, some background on neural networks and their usage for approximating, i.e. learning, a target function, are given.

Vector optimization problems
Consider a decision maker attempting to simultaneously minimize P objective functions f i : X → R for i = 1, ..., P over the set of decisions X ⊆ R N .Throughout, we assume that the decision space X = ∅ is nonempty and that the ordering cone in the objective space is R P + .To simplify notation, we denote f (x) := (f 1 (x), ..., f P (x)) ⊤ for any x ∈ X.That is, throughout this work we consider a vector optimization problem (VOP) of the form min{f (x) | x ∈ X}. ( Before continuing to the main results of this work, we provide some basic definitions and results on vector optimization.We refer the interested reader to, e.g., [16] for more details on these problems and definitions.We denote by f Within this work we are specifically interested in problems with feasible space for g : R N → R M .In particular, we are interested in convex vector optimization problems (CVOP), i.e., problems of the form min{f (x) | g j (x) ≤ 0, j = 1, ..., M } (P) such that f i , g j are convex functions for every i = 1, ..., P and j = 1, ..., M .It is well known that (weak) minimizers of (P) are related to solutions of the weighted-sum scalarization problem.Proposition 2.3.Consider the CVOP (P).
x * is a minimizer if it solves the weighted-sum scalarization problem (wP) for some w ∈ W ∩ R P ++ .Proof.The first statement follows from Corollary 5.29 of [16].The second statement follows from Theorem 5.18(b) of [16].
Assumption 2.4.Throughout this work we will make the following assumptions: • f i are strictly convex and continuously differentiable; • g j are convex and continuously differentiable; • Slater's condition holds, i.e., there exists some x ∈ R N such that g j (x) < 0 for every j = 1, ..., M .
Remark 2.5.Let us remark on some possible relaxations of these assumptions.
1. Note that the strict convexity of the objective functions f i as encoded in Assumption 2.4 could be relaxed for all results presented within this work so long as the weighted-sum scalarization problem (wP) is strictly convex for all scalarizations w ∈ W of interest.This will be used in the example presented in Section 5.4.It will also prove useful in Section 6 to allow for general convex, but not strictly convex, objective functions f i by augmenting the problem with an additional strictly convex objective function.This augmentation ensures that the new objective has strictly convex scalarizations for all w ∈ W of interest.
2. As presented above, we focus solely on inequality constrained CVOPs within this work.We wish to note that linear equality constraints can be introduced within this setting as well.Consider the CVOP provide a basis for the null space of A; that is, AA ⊥ z = 0 for any z ∈ R N −O .Furthermore, let x ∈ R N define a particular solution of the linear constraints, i.e, Ax = b.The equality constrained CVOP can then be reformulated in the form (P) as with N − O variables.Provided this reformulated problem satisfies Assumption 2.4, the methodology considered within this work can be applied directly to this reformulation.This will be used in the examples considered in Sections 5.4 and 6.
3. Herein we take the ordering cone R P + .All results can be considered in which the minimization is taken with respect to an ordering cone C ⊆ R p with nonempty interior instead.The only modification necessary is that the scalarizations are taken with respect to w ∈ C The weakly efficient frontier can be traced through considerations of w ∈ W → f (x * (w)) where x * (w) is a solution to the scalarization problem (wP) in Proposition 2.3 for every w ∈ W. We recall that the goal of this work is to determine (an approximation of) the weakly efficient frontier.As such, we seek to approximate the mapping w ∈ W → x * (w) rather than just finding its value at a finite number of scalarizations.
In addition to the primal scalarization problem (wP) we will also consider its Lagrange dual problem.For that, consider the dual function d : R M × W → R defined by The dual problem is then to maximize this dual function subject to the dual feasibility constraints, i.e., max inf By Assumption 2.4, strong duality is satisfied for these scalar problems, i.e. the optimal values of (wP) and (wD) are equal, and a dual solution exists.The method presented in this paper to approximate the weakly efficient frontier by machine learning methods will rely heavily on the following well-known KKT conditions.Note that the KKT conditions (4)- (7) below are the usual KKT conditions of the scalarization problem (wP), but can also be obtained using the multiobjective KKT conditions for the CVOP (P) directly (without the need to go through the scalarization first), see [10].
Theorem 2.6.(x * (w), λ * (w)) ∈ R N × R M are a pair of primal (x * (w)) and dual (λ * (w)) solutions of the scalarization problems (wP), respectively (wD), w.r.t.w ∈ W if and only if they jointly satisfy the following KKT conditions where J denotes the Jacobian of the corresponding function.
Proof.Due to the assumption that Slater's condition holds (Assumption 2.4), this result can be found in, e.g., [5,Chapter 5.5.3].

Neural networks
Consider, now, the task of approximating a function y * : T → R m for some compact set of input variables T ⊆ R n .Such an approximation problem is, fundamentally, a regression problem.Within this work, we will focus on feed-forward neural networks to regress the function y * .These functions can be viewed as a multi-stage regression model.We refer the interested reader to, e.g., [14,Chapter 6] for more details on feed-forward neural networks.
Definition 2.7.An ℓ-hidden layer neural network with h l nodes in layer l ∈ {0, 1, ..., ℓ + 1} (such that h 0 := n and h ℓ+1 := m for the input and output layers respectively) is a mapping y ℓ+1 : T × Θ → R m that can be decomposed as with y 0 (t, θ) := t for activation functions Φ l : R h l → R h l for every l = 1, ..., ℓ + 1 and parameter space The parameter θ l,b is often called the bias and θ l,w is often called the weights of the l th hidden layer.Jointly, the number of hidden layers ℓ, the number of nodes within each hidden layer h l for l ∈ {1, ..., ℓ}, the activation functions Φ l for l ∈ {1, ..., ℓ + 1}, and the restrictions on the parameter space Θ are called the hyperparameters of the neural network.
Remark 2.8.The connection structure of a neural network in layer l is defined through the sparsity structure of the weights θ l,w imposed within the parameter space Θ.A dense neural network, which we consider throughout the remainder of this work, is one Before continuing, we wish to consider some common choices of activation functions used within the neural network literature.
1. Linear: Consider the identity mapping φ(z) := z.This activation function is often denoted as the linear activation function as applying it along with the bias and weights creates a linear regression within that layer of the neural network.
Because the composition of linear functions is again linear, this activation function is typically only used for the output layer ℓ + 1 in practice.As noted within the statement of the universal approximation theorem (Theorem 2.10), there only exists parameters θ ∈ Θ such that this approximation holds.To attempt to find these parameters, we undertake a training regiment.This is accomplished by minimizing some loss function L : T × R m × R m → R + such that L(t, y, y) = 0 for any t ∈ T and y ∈ R m (see, e.g., Definition 3.1 of [27]), i.e., we seek the parameters θ * ∈ Θ such that where the expectation is taken over the input t ∈ T (jointly with y * (t) is the true output includes noise).The expected loss y(•, θ) → E[ L(t, y * (t), y(t, θ))] is often referred to as the risk functional associated with the loss function L. (The loss L(t, y * (t), y(t, θ)) can be thought of as a distance the neural network y(t, θ) is from its target value y * (t).In particular, we use the stronger convention that L(t, y * (t), y(t, θ)) = 0 if and only if y(t, θ) = y * (t).)However, this minimizing argument need not exist in general and, furthermore, this may be a nonconvex optimization problem; we remark on these issues further within Remark 2.11 below.This process of minimizing the expected loss function is often called training the neural network and it is how the network learns.Typically the risk functional is evaluated at some finite training points {t 1 , ..., t K } ⊆ T so that it is defined via the empirical measure, i.e., L(y(•, θ)) := for the loss function L. Remark 2.11.As a note of caution, the universal approximation theorem only guarantees there exists some neural network which can uniformly approximate the target.However, there is no guarantee that we actually constructed that neural network in practice when setting hyperparameters and training the parameters.In particular, optimizing the parameters θ by minimizing the risk functional often requires solving a highly nonlinear optimization problems.Though this problem is typically highdimensional and nonlinear, multiple algorithms exist to efficiently approximate θ * .Within this work, we use the Adam optimizer [20] which is a gradient descent-based approach developed for large-scale problems.As a brief preview for the results of this work, we are essentially substituting the hard problem of convex vector optimization (Problem (P)) with the previous scalar optimization approach (Problem (8)) encoded within these neural network optimization methodologies.
For the remainder of this work, to ease notation, we will drop the dependence of neural networks on their parameters θ.

A primal-dual neural network
Within this section our goal is to construct two neural networks which approximate the primal and dual solutions (x * (•), λ * (•)) to the scalarization problems (wP) so as to trace approximations of the entire weakly efficient frontier.These neural networks will be constructed to guarantee primal and dual feasibility as encoded in ( 5) and (6), i.e., a primal feasible neural network x : W → X and a dual neural network λ : W → R M + .In addition, we propose a loss function to jointly train these neural networks based on the KKT conditions of Theorem 2.6.

Primal feasible neural network
Herein we define a neural network with P inputs (given by w ∈ W) and N outputs that guarantees primal feasibility (5).In order to accomplish this, we want to construct the final layer of the neural network x : W → R N so that the constraints encoded within X are satisfied.
Within the following proposition, we construct a projection mapping which, if used as the final activation function for an arbitrary neural network, guarantees primal feasibility.As such, we use the general recursive structure of a neural network to guarantee that the primal neural network x : W → R N is feasible for every w ∈ W.
t * (z) := max x is as defined in Assumption 2.4.Then x(w) ∈ X for every w ∈ W.
Proof.Recall by construction of the feasible set that x(w) ∈ X if and only if g j (x(w)) ≤ 0 for every j = 1, ..., M .Consider the j th inequality constraint.Note that t * (z) ∈ [0, 1) for any z by construction of x strictly feasible.Therefore, by convexity and the definition of t * , We often refer to the final activation function z → (1 − t * (z))z + t * (z)x provided in Proposition 3.1 as a projection operator as it projects the neural network into the feasible region X.Remark 3.2.Due to rounding errors in practice, it is recommended to add a tolerance level to the mapping t * defined in (10).That is, define the neural network x : W → X using t * (•; ǫ) : R N → [0, 1) provided by for fixed tolerance ǫ ∈ (0, − max j g j (x)).
For the remainder of this work we assume the primal neural network is constructed so as to be feasible, i.e., as in Proposition 3.1 with the final activation function of z → (1 − t * (z))z + t * (z)x to a generic neural network z : W → R N .Remark 3.3.We propose the projection operator as a general formulation for a final activation function which guarantees primal feasibility.If the feasible set X takes a standard form, then a more direct approach can be utilized instead.For instance, within Section 5.4 below, X is the unit simplex; therein one could instead also guarantee primal feasibility by taking the normalized rectified linear units (i.e., x i (w) := z i (w) + / N j=1 z i (w) + for every i if N j=1 z i (w) + > 0 and x(w) = x otherwise) without needing to directly eliminate the equality constraint through the use of A ⊥ and changing the input space as described in Remark 2.5(2).

Dual feasible neural networks
Herein we define the dual neural network with P inputs (given by w ∈ W) and M outputs which guarantees dual feasibility (6).Similar to the approach taken for the primal neural network, we want to construct the final layer of this neural network so that feasibility is guaranteed.Specifically, feasibility for the dual variables associated with the inequality constraints requires non-negativity (6).This can be accomplished by simply applying the rectified linear unit as introduced in Example 2.9 (or the smooth approximation of the positive function) as a final activation function to a neural network with M outputs already, i.e., λ(w) := λ(w) + for neural network λ : W → R M .

Loss function
In order to train the parameters of our two neural networks, we need to also construct a loss function which we seek to minimize.In particular, we design a single loss function to jointly train both neural networks simultaneously.To accomplish this, we will use the KKT conditions presented in Theorem 2.6.
Specifically, to find the parameters for our neural networks, we select K (representative) choices of the scalarization weights w k ∈ W for k = 1, ..., K1 and seek to minimize the deviation our neural networks have from the KKT conditions.Due to the design of our neural networks (x(•), λ(•)) as presented above, the primal and dual feasibility conditions ( 5) and ( 6) are automatically satisfied and, as such, do not need to be considered in our loss function.Therefore, the deviation from the KKT conditions can be fully characterized by the norms of the right-hand sides of the first order condition (4) and complimentary slackness condition (7).As these errors may be of different orders of magnitude, we choose to weight the error in the complimentary slackness condition by η > 0. That is, our empirical risk functional L for jointly training the two neural networks is provided by Remark 3.5.
1.The loss associated with the complimentary slackness conditions may be modified to account for the rounding errors as discussed in Remark 3.2.For instance, for tolerance ǫ > 0, we can consider the complimentary slackness error to be diag(λ 2 with indicator function taking value 1 if the argument is true and 0 otherwise.In this way, we treat any solution within ǫ of the boundary to be sitting on the boundary when considering complimentary slackness.
2. The risk functional L requires an extra hyperparameter η > 0 to provide a relative weight for the complimentary slackness condition.In practice, we choose with expectations taken over the scalarizations w and (x 0 (w), λ 0 (w)) following the initialization of, e.g., the stochastic gradient descent.In this way, the initial optimization directions for the neural networks will attempt to minimize both sources of error.
Remark 3.6.We wish to note that the KKT condition based risk functional and loss function (11) introduced above, and used throughout the numerical examples herein, is not the only loss function that can be taken for this problem.As Assumption 2.4 implies strong duality, the duality gap can be used as a loss function instead via where d : R M × W → R is the dual function provided in (3).Notably, this construction removes the direct need to jointly train the primal and dual neural networks as the primal neural network could be trained with risk functional L p (x(•)) := and the dual neural network could be trained with risk functional to achieve the same effect.Though separating the risk functionals so that the primal and dual neural networks can be trained separately is tempting due to the intuitive construction of these objectives, we find that this construction performs worse than the joint KKT based risk functional.Due to the use of the first order condition (4) in (11), the KKT based risk functional inherently incorporates some elements of sensitivity analysis to improve performance locally around the training data; in contrast, L p , L d only seek to optimize at the training data without any additional information to interpolate between the training data. 2

Approximating the weakly efficient frontier
Within this section we present a method to formally use the neural networks constructed in Section 3 to construct inner and outer approximations of the weakly efficient frontier.
Before formally providing the results on the inner and outer approximations, we wish to recall the (Lagrange) dual function (3) to the scalarization problem (wP).That is, we consider the mapping d : R M × W → R defined by The dual problem is then to maximize this dual function subject to the dual feasibility constraints (6) as provided in (wD).
The first main result of this section, to prove that the primal neural network can be utilized to provide an inner approximation and the dual neural network can be utilized to provide an outer approximation of the upper set, is formalized in Lemma 4.1.Notably, this result is totally reliant on, and trivially follows from, the feasibility of these neural networks as guaranteed by the final activation functions presented in Section 3 above.
Lemma 4.1.The primal neural network x : W → X provides an inner approximation of the upper set P and the dual neural network λ : W → R M + provide an outer approximation of the upper set P via the relation Proof.We will prove this result by showing, first, the inner approximation and, second, the outer approximation.Let y ∈ w∈W [f (x(w)) + R P + ].There exists some w ∈ W such that y − f (x( w)) ∈ R P + .Therefore, as x * ( w) ∈ X by Proposition 3.1, for any w ∈ W, it holds By Proposition 2.3, and noting that the upper set P is closed and convex by assumption, the inner approximation holds.Now, let y ∈ P.There exists some w ∈ W such that y − f (x * ( w)) ∈ R P + for solution x * : W → X to (wP).Therefore, for any w ∈ W where the equality holds by strong duality.(Though we utilize strong duality here, in fact only weak duality is required.)As a direct consequence, the outer approximation holds.
Beyond defining the inner and outer approximations as done in Lemma 4.1, we want to also quantify the error from these approximations.To do this, we first wish to define a variable approximation error.This is accomplished by quantifying the approximation error in any scalarization direction.As opposed to the standard, constant, approximation error (which we discuss more in Proposition 4.3) this variable approximation error allows us to more accurately assess the local errors coming from our machine learning approach.As far as the authors are aware, this definition is novel to the literature.Definition 4.2.P ⊆ R P is an ǫ(•)-inner approximation of (P) for the function for the upper set P, and where G(w) := {y ∈ R P | w ⊤ y ≥ 0} for any w ∈ W.
Notably, the quantification of the errors in the ǫ(•)-inner approximation are variable.In contrast, typically (see, e.g., [22]) a constant error ǫ ∈ R ++ is taken instead; in such a setting an ǫ-inner approximation P ⊆ R P of P is defined by the relations P ⊆ P ⊆ P − ǫ1.
In the following proposition we show that when the error function is constant, Definition 4.2 reduces to the classical definition of an approximation.Proposition 4.3.Let P ⊆ R P be a closed and convex upper set and let ǫ : W → R + be a constant function, i.e., ǫ(w) ≡ ǫ 0 > 0 for every w ∈ W. P is an ǫ(•)-inner approximation of P ⊆ R P if and only if P ⊆ P ⊆ P − ǫ 0 1.
Proof.As P is a closed and convex upper set, it trivially follows (by the separating hyperplane theorem) that and the result is proven.Within Corollary 4.4 below, we provide an analytical form for the functional approximation error ǫ(•) for the inner approximation defined within Lemma 4.1.Specifically, this function depends on both the primal and dual neural networks to quantify an upper bound to the "distance" between the inner and outer approximations at any point on the weakly efficient frontier.for any w ∈ W based, additionally, on the dual neural network λ : Proof.
The result trivially follows from Lemma 4.1.

Numerical case studies
Within this section we will consider four multiobjective problems to numerically study the neural network approach presented above.The first problem (Section 5.1) has 2 objective functions but many decision variables and constraints.In the second problem (Section 5.2), we investigate the performance of the proposed methodology as the number of objectives grow.The third problem (Section 5.3) provides an additional high-dimensional problem so that we can investigate the impact of training time on the performance of the methodology.The final problem (Section 5.4) presents the meanvariance optimization problem [23] over a large number of assets.All computations of the neural networks were completed with PyTorch on a local machine using the Adam optimizer [20] with learning rate 10 −4 .Throughout these examples, to generate the loss function, we weight the complimentary slackness condition (7) with η = 10.We wish to note that all hyperparameters -except the terminal activation functions as presented in Section 3 -are chosen arbitrarily and were not found via, e.g., a grid search.All intermediate activation functions (i.e., all but the terminal activation functions) are chosen to be the hyperbolic tangent function with linear linking between layers.

Two objective problem
Consider the following problem with P = 2 objectives: Note that this problem was considered as "Test Instance 2" in [11].Due to the symmetric structure of this problem, the true unique minimizer for any scalarization w ∈ W can trivially be deduced as x * : W → [0, 1] N defined by As this problem has a closed form set of minimizers, the efficient frontier can be exactly provided as This permits us to present an exact comparison between the machine learning methodology to the ground truth.
In order to fully consider our outer approximation as presented in Lemma 4.1, we need to also discuss the Lagrange dual problem of the weighted-sum scalarizations.Let us encode the box constraints through the linear inequalities Ax ≤ b for A := In such a way we consider M = 2N linear inequality constraints.Due to the quadratic structure of the scalarizations, the Lagrange dual function d : R M × W → R can be directly computed by the quadratic structure d(λ, w) With this setup, we can consider a specific instantiation of the problem.In particular, for test purposes, we will simply consider this bi-objective problem with N = 40 decision variables.As a direct consequence of the box constraints, there are M = 80 inequality constraints. 3Furthermore, to demonstrate the power of the neural network methodology proposed above, we train the primal and dual neural networks with only 4 scalarizations: w ∈ {(0, 1) ⊤ , ( 1 3 , 2 3 ) ⊤ , ( 2 3 , 1 3 ) ⊤ , (1, 0) ⊤ }.Testing will be undertaken on the dense grid w ∈ {(i/1000, 1 − i/1000) | i ∈ {0, 1, ..., 1000}}.The primal neural network x : W → X is designed with three hidden layers, each with 800 hidden nodes; the terminal (projection) activation function is considered with a tolerance of 5 × 10 −5 to guarantee primal feasibility as discussed in Remark 3.2 and with strictly feasible point x := 1 2 × 1.The dual neural network λ : W → R 80 + is designed with three hidden layers, each with 1600 hidden nodes.All intermediate activation functions (i.e., all but the terminal activation functions) are chosen to be the hyperbolic tangent function with linear linking between layers.Finally, we train these networks over 1000 epochs.
Figure 1a displays the true weak efficient frontier along with two approaches for approximation.The true weak efficient frontier is plotted with a solid black curve.The inner approximation provided by the primal neural network (i.e., the boundary bd cl co w∈W [f (x(w))+R P + ]) is plotted as a solid blue line and the outer approximation provided by the dual neural network (i.e., the boundary bd w∈W {y ∈ R P | w ⊤ y ≥ d(λ(w), w)}) is plotted as a solid red line.These machine learning approximations are compared with the direct computation of the primal and dual solutions for the 4 training scalarizations as dashed black lines.Immediately noticeable, the neural network approximations almost completely overlap with each other except in the south east corner of the figure (i.e., for scalarizations w ∈ W with w 1 low).The approximations can be more clearly seen in Figure 1b in which the area around the training scalarization w = ( 23 , 1 3 ) ⊤ is highlighted.The errors ǫ(w) can be directly seen in Figures 1c and 1d; these figures display the exact same data but in a linear and logarithmic scale respectively.For these scalarizations the error for the exact computation drops to 0, whereas the neural network has errors growing to around 0.2.These errors are further improved (see the red line "NN Realized Error" in Figures 1c and 1d) when using the realizations of the neural network approximation -the convex hull in the inner approximation and the intersection in the outer approximation as provided in Lemma 4.1.However, as is clear from all figures, for (nearly) any choice of scalarization outside "low" w 1 , the neural network outperforms the direct computation.Further, we wish to remind the reader that the neural networks were trained without hyperparameter tuning and, as such, the errors ǫ(w) for the machine learning approach can be further improved.

Many objective problem
We now wish to consider how the neural networks can consider a problem with many objectives.In particular we will consider a problem with P = M ≤ N objectives for every i = 1, ..., P and with M = P constraints g j (x) = f j (x) − 1 for every j = 1, ..., M .Note that this problem was considered as "Test Instance 4" in [11]. 4Unlike in the prior example of Section 5.1, we do not have a closed form solution to the minimizers to this problem.However, the neural network methodology presented above can still be applied to deduce inner and outer approximations for the weakly efficient for any λ ∈ R M and w ∈ W. For test purposes we will vary P = M ∈ {2, ..., 20} with N = 100 within this case study. 5ith this setup, we can consider specific instantiations of the problem at the different choices of P = M (always with fixed N = 100).Herein, we train each of these problems with 50 uniformly chosen random scalarizations w ∈ W of the correct dimension.Both primal and dual neural networks x : W → X and λ : W → R M + are designed with 2 hidden layers, each hidden layer with 500 hidden nodes.The terminal (projection) activation function for the primal neural network is considered with a tolerance of 5 × 10 −5 to guarantee primal feasibility as discussed in 3.2 and with strictly feasible point x := P i=1 1 P × e i for unit vectors e i ∈ R N .All intermediate activation functions (i.e., all but the terminal activation functions) are chosen to be the hyperbolic tangent function with linear linking between layers.Finally, we train these networks over 200 epochs.
In order to compare the performance of the inner and outer approximations as the size of the problem grows, we compute ǫ(w) for 5000 uniformly sampled scalarizations w ∈ W of the correct dimension.These errors (in logarithmic scale) are displayed in Figure 2a.Note that the errors ǫ(w) become more consistent as the dimension of the problem increases; this is noticeable with the shrinking of the error bars of the box plot.Additionally, as can be seen in Figure 2b, both the mean and median of these errors (tend to) decrease as the dimension of the problem grows.This improvement in the mean and median of the errors is primarily due to the shrinking size of the feasible region as the dimension grows.For instance, as the problem size P = M grows, the feasible region can be proven to exist within a box with shrinking volume, i.e., X ⊆ [0, 1] P × [−1, 1] N −P .This is made clear in Figure 2b as the inner approximation from selecting a random primal feasible point (uniformly selected in [0, 1] P × {0} N −P and projected onto X via the projection activation function proposed in Proposition 3.1) and random dual feasible point (deterministically selecting λ := 0) improves as the size of the problem increases.Importantly, the neural networks uniformly outperform the random approximations for all tested problems.Finally, we wish to highlight that the reported approximation errors ǫ(w) in Figure 2 are without considerations of the convex hull for the inner approximation and intersection for the outer approximation as in Lemma 4.1.For this reason, we view the reported errors as an upper bound on those that would be realized by the neural networks constructed in this example.
Beyond demonstrating that the proposed neural network approach is able to approximate the weak efficient frontier, we also use this many objective problem to comment on the computational requirements for training the primal and dual neural networks.The computational runtimes (on a local machine) for training and testing the neural Error (w) (a) Box plot of approximation errors ǫ(w) from machine learning approach.The central line provides the median; the box shows the inter-quartile range; outliers (red plussigns) are determined to be further from the median than 1.5 times the inter-quartile range.networks are displayed in Figure 2c.Though this methodology does take longer as the problem size P = M grows, it does so in a relatively predictable manner.The increased time is due to the growth in the size of both the primal and dual neural networks.Though a one-to-one comparison is not possible as the times reported in [11] were run on a different machine than our neural network approach and were computed with fixed approximation errors, we conclude this discussion by directly comparing computation times and approximation errors with the reported results in [11].Specifically, we find that the neural network approach has longer runtime for P = M ∈ {2, 3} (neural network training required 9.2 and 10.69 seconds respectively) than the box coverage approach of [11] (2.46 and 4.64) for fixed approximation error ǫ = 0.2; with that same approximation error, already at P = M = {4, 5}, the neural network approach outperforms the box coverage approach of [11] both in runtime (12.09 and 13.73 seconds against 15.05 and 122.25 seconds respectively) with improved accuracy as well as P(ǫ(w) < 0.2) ≈ 98% for uniformly sampled test scalarizations.Improving the approximation error in the box coverage approach of [11] to fixed error ǫ = 0.1 (which notably is still worse than the proposed neural network approach on average in all test problems), the neural network approach is faster for P = M = 3 (runtime of 22.90 seconds compared to 12.09 seconds for the neural network approach).In fact, demonstrating the curse of dimensionality in the box coverage approach, that method could not complete its computations within an hour even for P = M = 4 (compared to a runtime of 13.73 seconds for our neural network approach).Even for P = M = 20, the neural network approach still completes all computations in under 37 seconds on a local machine. 6

High dimensional problem
We now wish to consider how the neural networks can consider a problem with an arbitrarily large number of objectives and variables.In particular we will consider a problem with P = N objectives for every i = 1, ..., P and with M = 1 constraints g 1 (x) = x − (1 + ǫ)1 2 − 1 for fixed ǫ > 0. Throughout this section, we take ǫ = 0.01 chosen arbitrarily.Similar to the many objective problem in Section 5.2, we do not have a closed form solution to the minimizers to this problem.However, the neural network methodology presented above can still be applied to deduce inner and outer approximations for the weakly efficient frontier.To consider the outer approximation, we need to consider the Lagrange dual function d : R M × W → R of the weighted-sum scalarizations.For this problem, the Lagrange dual can be computed as for any λ ∈ R M and w ∈ W. This problem was previously formulated as "Problem 7.2" in [4]. 7For test purposes, in both case studies, we will vary P = N ∈ {2, ..., 10, 15, ..., 50, 60, ..., 100, 500, 1000, 5000}.
We train all of these problems with 50 uniformly chosen random scalarizations w ∈ W of the correct dimension.Both primal and dual neural networks x : W → X and λ → R M + are designed with 2 hidden layers, each hidden layer with 300 hidden nodes.The terminal (projection) activation function for the primal neural network is considered with a tolerance of 5 × 10 −5 to guarantee primal feasibility as discussed in Remark 3.2 and with strictly feasible point x := (1+ǫ)1 for the one vector 1 ∈ R N . 8In contrast to the prior numerical examples, here we use the soft plus activation function (i.e., log(1+ exp(•))) to guarantee dual feasibility.All intermediate activation functions (i.e., all but the terminal activation functions) are chosen to be the hyperbolic tangent function with linear linking between layers.In order to compare the performance of the inner and outer approximations as the size of the problem grows, we compute ǫ(w) for 5000 uniformly sampled scalarizations w ∈ W of the correct dimension over varying number of epochs for the Adam optimizer.
In Figure 3a, the test approximation errors are plotted as the size of the problem P = N ≤ 100 varies after training for 2500 epochs of the Adam optimizer.Note that the errors are (approximately) an order of magnitude lower for P = N = 2 than P = N = 3 and beyond, i.e., on the order of 10 −3 growing to 10 −2 .These errors stabilize around 10 −2 until P = N = 15 after which the errors grow (but level off) as the dimensions increase.This behavior can be seen in the mean, median, the interquartile range, and the 95% confidence interval.We can benchmark these errors in two ways: 1. If, in comparison, we naively chose the feasible points x(w) := (1 + ǫ)1 and λ(w) := 0, then the errors would uniformly be the constant ǫ(w As such these neural networks vastly outperform the naive choice for all tested problems.Notably, randomly selecting primal feasible points uniformly results in comparable errors.
2. Alternatively, if we chose the x(w) := (1 + ǫ − 1/ √ N )1 (so that we select a point on the efficient frontier) and λ(w) := 0, then the errors would depend on the dimension of the problem as ǫN (w For N = 2 with our chosen ǫ = 0.01, this results in an error of approximately 0.0917 which is nearly 2 orders of magnitude larger than the test errors found by our neural network approach; as N grows so do these potential errors always remaining significantly above the 95% confidence interval as displayed in Figure 3a.As with the prior example of Section 5.2, the neural networks could be further improved as the reported approximation errors ǫ(w) in Figure 3a are without considerations of the convex hull for the inner approximation and the intersection for the outer approximation as in Lemma 4.1.For this reason, we view the reported errors as an upper bound on those that would be realized by the neural networks constructed in this example.
In Figure 3b, we explore the impact of training the neural networks on the test performance.In particular, we vary the number of epochs over which we train both the primal and dual neural networks between 100 and 2500.Two key results are observable.First, as we increase the amount of epochs over which we run the Adam optimizer, the test errors decrease.Note that these errors are displayed in a log-scale so that the errors decrease a significant amount as we increase the training time.Second, errors of around 8 For the implementation we follow the modified problem as discussed in Remark 3.4.Beyond demonstrating that the proposed neural network approach is able to approximate the weak efficient frontier, we also use this problem to comment on the computational requirements for training the primal and dual neural networks.The computational runtimes (on a local machine) for training and testing the neural networks (with P = N ≤ 100) are displayed in Figure 3c.Though this methodology does take longer as the problem size P = N grows, it does so at a nearly linear growth rate.If we expand beyond P = N ≥ 100, we find that the linear rate remains up to the 500-dimensional problem (47.86 seconds); however the runtime grows superlinearly after around P = N = 500.For P = N = 1000, training the primal and dual neural networks took approximately 3 minutes and 43 seconds for 1000 epochs; for P = N = 5000, training these neural networks required 39 minutes and 16 seconds for 1000 epochs.The increased time is due to the growth in the size of the primal dual neural network.
As in Section 5.2 above, a one-to-one comparison is not possible as the times reported in [4] were run on a different machine than our neural network approach, we conclude this discussion by directly comparing computation times and approximation errors with the reported results in [4].Specifically, we find that the neural network approach has similar runtimes for P = N ∈ {2, 3} (approximately 10 seconds following [4]).However, whereas the neural network methodology has nearly linear growth in runtime (for P = N ≤ 500), the runtime for the methodology proposed within [4] grows exponentially in the problem dimensions.In fact, already at only 14 dimensions, the methodology proposed within [4] required over an hour in order to reach an error of approximately 0.25 (i.e., a longer time than required for the neural networks in the 5000-dimensional setting).

Mean-variance optimization
Consider now the mean-variance portfolio optimization problem proposed originally by Markowitz [23] as a bi-objective optimization problem.Specifically, this problem consists of a portfolio optimizer who is simultaneously attempting to maximize her expected return and minimize her risk (as encoded by the variance of the portfolio returns).We will present this problem with no-short selling allowed, i.e., the investor can only purchase assets for her portfolio.As such we will consider the following P = 2 objectives over S ≥ 2 assets in which the agent can invest, and where r ∈ R S denotes the mean returns for each asset and C ∈ R S×S is the covariance structure.The investor is constrained by M = S inequality constraints g j (x) = −x j for every j = 1, ..., M (encoding the no-short selling constraints).In contrast to the prior examples, herein the optimizer is constrained so that she uses all funds that she has available (i.e., that this problem is strictly convex we will restrict the set of scalarizations under consideration to W + := {w ∈ W | w 2 > 0} with positive definite covariance matrix C as is discussed in Remark 2.5(1).
To consider the outer approximation, we need to consider the Lagrange dual function d : R M × W + → R of the weighted-sum scalarizations.For this mean-variance problem, the Lagrange dual can be computed as for any λ ∈ R M and w ∈ W + .We will implement this mean-variance optimization problem on S = 492 stocks in the S&P500 index. 9Empirical means and covariances for daily log returns from January 1, 2018 through December 31, 2021 are used to construct the mean asset returns r and covariance matrix C. 10 Herein, we train this problem with 5 scalarizations: We then test this problem over 1001 scalarizations w ∈ {(i/1000, 1 − i/1000) | i ∈ {0, 1, ..., 1000}}.The primal neural network x : W + → X is designed with three hidden layers, each with 800 hidden nodes.Due to the construction of the feasible region, we apply a ReLU activation function prior to the projection activation function, i.e., such that the neural network z in Proposition 3.1 has a terminal ReLU activation function.The terminal (projection) activation function for this primal neural network is considered with tolerance of 5 × 10 −5 to guarantee primal feasibility as discussed in Remark 3.2 and with strictly feasible point x := 7.5 × 10 −5 1.The dual neural network λ : W + → R 492 + is designed with three hidden layers, each with 800 hidden nodes.As with the high dimensional example in Section 5.3, here we use the soft plus activation function (i.e., log(1 + exp(•))) to guarantee dual feasibility.All intermediate activation functions (i.e., all but the terminal activation functions) are chosen to be the hyperbolic tangent function with linear linking between layers.Finally, we train these networks over 5000 epochs.
Figure 4a displays the true weak efficient frontier along with the neural network inner and outer approximations.The true weak efficient frontier is plotted with a solid black curve.The inner approximation provided by the primal neural network (i.e., the boundary bd cl co w∈W [f (x(w)) + R 2 + ]) is plotted as a solid blue line and the outer approximation provided by the dual neural network (i.e., the boundary bd w∈W + {y ∈ R 2 | w ⊤ y ≥ d(λ(w), w)}) is plotted as a solid red line.Using the usual transformation of the the mean-variance problem, the Markowitz bullet approximations providing the relation between the standard deviation and mean of the portfolio return, are displayed in Figure 4b.As shown in Figure 4c, the realized approximation errors is on the order    of 10 −4 throughout the space of scalarizations w ∈ W + .Note that the approximation errors jump up for w 2 ≈ 0, i.e., when the problem approaches a linear objective (rather than strictly convex as imposed in Assumption 2.4).
6 Extending to non-strictly convex vector optimization problems As stated within Remark 2.5 (1), within this section we wish to relax the strict convexity assumption of Assumption 2.4 so that we only require f i to be convex for every objective function i = 1, ..., P .That is, consider the CVOP (P) such that • f i are convex and continuously differentiable; • g j are convex and continuously differentiable; • (P) is bounded; • Slater's condition holds with strictly feasible point x ∈ R N .With this set of assumptions, which differ from Assumption 2.4 only through the non-strict convexity of the objective functions, we again consider the weighted-sum scalarization problem (wP) and its dual (wD) in order to learn the weakly efficient frontier.
The impact of the potential non-strict convexity of the objective functions to the machine learning approach can be two-fold.
• In contrast to the simple dual feasibility conditions (6) under Assumption 2.4, we might need to consider additional implicit dual feasibility conditions to guarantee finiteness of the Lagrange dual function (3).As this depends on the particular structure of the considered problem, we will discuss this in more detail in the applications where this issue occurs.The case of linear vector optimization problems are, for instance, discussed in Section 6.1.2.
• The approach considered in Section 3 needs to be modified as neural networks can be unstable for discontinuous functions (e.g., the universal approximation theorem (Theorem 2.10) only applies for continuous functions).Specifically, due to the possibility of multiplicity of optimizers for convex problems, there may not exist a continuous selector of optimizers.
To address the second point above, the approach taken herein consists of constructing feasible primal and dual neural networks to the original problem (P) but where the risk functional ( 11) is based on the KKT conditions of the scalarizations of an augmented CVOP with scalarization min (1 − δ) where w ∈ W with δ ∈ (0, 1) small and f : R N → R strictly convex and continuously differentiable.In particular, setting f (x) := x 2 2 implies that the objective of (wP) is strongly convex and thus also strictly convex.
As demonstrated by Proposition 6.1 below, this augmentation provides a reliable approximation to the solution of the original weighted-sum scalarization (wP).Proposition 6.1.Assume X ⊆ R N is compact and fix w ∈ W. Let X * (w) ⊆ X be the set of primal solutions of the scalarization problem (wP) w.r.t.w.Additionally, let x * (w, δ) be the primal solutions of the scalarization problem (wP) w.r.t.w and δ ∈ (0, 1).Then there exists at least one accumulation point of {x * (w, δ) | δ ∈ (0, 1)} as δ ց 0 and any such accumulation point is an element of X * (w), i.e., lim δց0 x * (w, δ) ⊆ X * (w).
To demonstrate this approach, we will first illustrate the exact construction for a generic linear vector optimization problems (LVOP) in Section 6.1 along with a simple numerical example in that setting.We will then consider in Section 6.2 a version of the mean-risk problem to present a separate, practical problem from financial applications.

Linear vector optimization problems
Consider the generic inequality-constrained LVOP 11 By construction, this is a CVOP that is not strictly convex.We will use this problem to illustrate the modifications that are necessary to formalize the neural network approach provided within Section 3 to guarantee the inner and outer approximation results provided within Section 4. Briefly, the idea is to construct primal and dual feasible neural networks to the LVOP that approximate the solution to a closely related strictly convex CVOP.
Using this LVOP, let us construct our primal and dual neural networks followed by a discussion of the necessary changes to the loss function for training purposes.We wish to note that the inner and outer approximation results provided within Section 4 can be applied to the constructions provided within this section without modification.Due to the similarity of this proposition to Proposition 3.1 we skip the proof of it here.

Loss function
Consider now the modified weighted-sum scalarization problem (wP) with fixed weight δ ∈ (0, 1).An application of Theorem 2.6 implies that (x * (w, δ), λ * (w, δ)) ∈ R N × R M are a pair of primal (x * (w, δ)) and dual ( λ * (w, δ)) solutions of the scalarization problem (wP), respectively its Lagrange dual problem, w.r.t.w ∈ W and δ ∈ (0, 1) if and only if they jointly satisfy the KKT conditions Note that the primal and dual feasibility constraints of the original, unmodified, scalarization problems (wP) are included in those of the modified scalarization problem (wP).Thus, the neural networks (x(•), λ(•)) constructed in Sections 6.1.1 and 6.1.2from the original, unmodified, scalarization problem (wP) are automatically also feasible for the modified problem (wP).Hence, the only difference in methodology is to use the risk functional of the modified problem (wP) instead of the original risk functional (11) of problem (wP).The risk functional based on the KKT conditions of the modified problem (wP) can be constructed in analogy to Section 3.3 as Recall from Proposition 6.1 that for δ ∈ (0, 1) small enough, if the loss of the neural networks is small enough, we can confidently conclude that the obtained solutions (x * (•, δ), λ * (•, δ)) are "close" to the true solutions of (wP).

Inner and outer approximations
Following the construction of the primal and dual neural networks for the LVOP, we want to comment on the inner and outer approximations as presented in Section 4. Formally, the approximation results of Section 4 follow directly from feasibility (both primal and dual feasibility) and strong duality.As we recover feasibility of our neural networks, and strong duality still holds due to Slater's condition, no modifications are necessary to recover these approximations.In particular, the primal neural network x : W → R N provides an inner approximation and the dual neural network λ : W → R M provides an outer approximation of the upper set P via the relation

Discussion and numerical results
Consider now the bi-objective (P = 2) LVOP with N = 2 decision variables and M = 5 inequality constraints constructed by To place it within the strictly convex setting, we introduce the augmented form with f (x) := x 2 2 included with constant weight δ = 10 −4 .As in Section 5, the neural networks for this example were computed with PyTorch on a local machine using the Adam optimizer with learning rate 10 −4 .Here, we weight the complimentary slackness condition (7) with η = 10 −4 so that the first-order condition error (4) and complimentary slackness error are, initially, of the same order of magnitude.We train these neural networks for 500 epochs.We wish to note that all other hyperparameters -except the terminal activation functions as presented within this section -are chosen arbitrarily and were not found via, e.g., a grid search.
Specifically, we have two neural networks to design: the primal (x : W → R 2 ) and dual (λ : W → R 5 ) neural networks.The primal neural network x is designed with three hidden layers, each with 800 hidden nodes; the terminal (projection) activation function is considered with a tolerance of 5 × 10 −5 to guarantee primal feasibility as discussed in Remark 3.2 and with strictly feasible point x := 1.The dual neural network λ is designed with three hidden layers, each with 800 hidden nodes.The terminal (projection) activation function is considered with a tolerance of 5 × 10 −5 to guarantee dual feasibility as discussed in Remark 3.2 and with strictly feasible points λ : W → R M constructed via the linear programming approach outlined within Remark 6.4 with ǭ := 5 × 10 −3 × 1.All intermediate activation functions (i.e., all but the terminal activation functions) are chosen to be the hyperbolic tangent function with linear linking between layers.
Figure 5a displays the true weak efficient frontier along with the neural network inner and outer approximations on test scalarizations w ∈ {(i/500, 1 − i/500) | i ∈ {0, 1, ..., 500}} after training on w ∈ {(i/19, 1 − i/19) | i ∈ {0, 1, ..., 19}} with δ = 10 −4 .The true weak efficient frontier is plotted with a solid black curve.The inner approximation provided by the primal neural network (i.e., the boundary bd cl co w∈W [Cx(w)+ R 2 + ]) is plotted as a solid blue line and the outer approximation provided by the dual neural network (i.e., the boundary bd w∈W {y ∈ R 2 | w ⊤ y ≥ −b ⊤ λ(w)}) is plotted as a solid red line.Immediately noticeable, the neural network approximations almost completely overlap with each other and, in particular, capture the vertices of the true   weak efficient frontier.The errors ǫ(w) can be directly seen in Figure 1d which displays the data in a logarithmic scale.For these scalarizations, the original neural network approaches provide relatively high errors (see the blue line "NN Error").However, when using the realizations of the neural network approximation -the convex hull in the inner approximation and the intersection in the outer approximation as provided in Lemma 4.1 -these errors drop significantly with most scalarizations having errors around 10 −2 .Further, we wish to remind the reader that the neural networks were trained without hyperparameter tuning and, as such, the errors ǫ(w) for the machine learning approach can be further improved.

Mean-risk problem
Herein we want to revisit a practical example by studying the mean-risk problem in finance with dynamic trading.Similar to the mean-variance problem presented within Section 5.4, this is a bi-objective problem that allows for easy visualization of the optimal portfolio.However, in contrast to the mean-variance problem, herein we consider risk as measured by the entropic risk measure [13,Example 4.34].We will present this problem with no-short selling allowed, i.e., the investor can only purchase assets for her portfolio.To simplify this setting we will consider a market with S = 2 assets only: asset 0 is the cash asset and is worth $1 throughout; asset 1 is a risky stock whose value fluctuates (with equal up or down probability) according to the Cox-Ross-Rubinstein binomial tree [7] with T = 2 time steps and volatility σ 2 = 0.05 as depicted in Figure 6a.Within this setting, the investor is optimizing over 6 variables -investments in each asset at time 0 and in each asset in either state at time 1.Let x = (x 0,0 , x0,1 , xU 1,0 , xU 1,1 , xD 1,0 , xD 1,1 ) ∈ R 6 provide the investments at times 0, 1 and in assets 0, 1.The investor is seeking to maximize her expected return and minimize her entropic risk at time T = 2, i.e., for risk-aversion parameter α > 0. Herein we will consider α = 1/2 chosen arbitrarily.
As this problem is convex, but not strictly convex, we augment this problem with a 3rd objective function: f 3 (x) = x 2 2 which is always included with weight δ = 10 −4 .Due to the no-short selling constraint, the investor is constrained by M = S inequality constraints g(x) = −x ∈ R 6 .In addition to these inequality constraints, this constructed dynamic portfolio is required to satisfy self-financing conditions with initial wealth W = 10 such that As discussed within Sections 2.1 and 5.4, we consider the modification of this problem to guarantee these equality constraints are satisfied.Denote these self-financing constraints by Ax = (W, 0, 0) ⊤ ; then we consider N = 3 variables with x := A ⊥ x + x for x ∈ R N with A ⊥ ∈ R 6×3 and x := W 2 1 ∈ R 6 .To consider the outer approximation, we need to consider the Lagrange dual function d : R 6 × W → R of the weighted-sum scalarizations (prior to augmentation).For this mean-risk problem, the Lagrange dual function does not have a closed form solution, rather it is computed by solving the unconstrained problem d(λ, w) := inf for any λ ∈ R 3 and w ∈ W.
With this setting, we have two neural networks to design: the primal (x : W → R 3 ) and dual (λ : W → R 6 ) neural networks.The primal neural network x is designed with three hidden layers, each with 800 hidden nodes; the terminal (projection) activation function is considered with a tolerance of 5 × 10 −5 to guarantee primal feasibility as discussed in Remark 3.2 and with strictly feasible point x := 0 ∈ R 3 .The dual neural network λ is designed with three hidden layers, each with 800 hidden nodes.The terminal ReLU activation function is used to guarantee dual feasibility.All intermediate activation functions (i.e., all but the terminal activation functions) are chosen to be the hyperbolic tangent function with linear linking between layers.As with the prior examples, the neural networks herein are computed with PyTorch on a local machine using the Adam optimizer with learning rate 10 −4 .Here, we weight the complimentary slackness condition (7) with η = 10.We train these neural networks for 2000 epochs over 4 training points {(0, 1) ⊤ , ( 1 3 , 2 3 ) ⊤ , ( 2 3 , 1 3 ) ⊤ , (1, 0) ⊤ }.We wish to note that all other hyperparameters -except the terminal activation functions discussed previously -are chosen arbitrarily and were not found via, e.g., a grid search.
Figure 6b displays the true weak efficient frontier along with the neural network inner and outer approximations on 501 regularly spaced test scalarizations w ∈ {(i/500, 1− gated.We did not explore the impact of different intermediate layer and activation function structures.Optimizing this structure can reduce the size of the realized errors and improve the approximations.We propose investigating such a hyperparameter optimization for specific classes of problems, e.g., for quadratic vector optimization as analytical structures are most likely to be tractable.

Figure 2 :
Figure 2: Section 5.2: Visualizations of approximation errors ǫ(w) over 5000 (uniformly random) test scalarizations w ∈ W as the size of the problem P = M changes with fixed number of primal variables N = 100 and associated runtimes when trained over 50 (uniformly random) scalarizations.
Runtime for training the neural networks and computing the test scalarizations for varying number of epochs.

Figure 3 :
Figure 3: Section 5.3: Visualizations of approximation errors ǫ(w) over 5000 (uniformly random) test scalarizations w ∈ W as the size of the problem P = N ≤ 100 and number of epochs of the Adam optimizer change as well as the associated runtimes when trained over 50 (uniformly random) scalarizations.
Plot of weak efficient frontier (black) and approximations through neural networks (blue and red).

Figure 4 :
Figure 4: Section 5.4: Plot of the efficient frontier, Markowitz bullet, and approximation errors over 1001 regularly spaced test scalarizations.Neural networks are computed with 5 regularly spaced training scalarizations.
Plot of weak efficient frontier (black) and approximations through neural networks (blue and red).
Plot of approximation errors ǫ(•) in logscale so that the neural network (blue) and realized approximations (red) provide ǫ(•)inner approximations.

Figure 5 :
Figure 5: Section 6.1: Plot of the efficient frontier and approximation errors for 501 regularly spaced test scalarizations.Neural networks are computed with the 20 regularly spaced training scalarizations.
Theorem 2.10.[25, Theorem 3.1] Let y * : T → R m be a continuous function with compact domain and φ : R → R be a continuous activation function (applied component-wise to vectors as in Example 2.9).For every ǫ > 0 there exists a 1-layer neural network y : T × Θ → R m with h 1 hidden nodes such that To consider the outer approximation, we need to consider the Lagrange dual function d : R M × W → R of the weighted-sum scalarizations.For this problem, the Lagrange dual can be computed as frontier.
10 −2 can be obtained by training over more epochs; this error level can be obtained for larger dimensions if we increased the training time further.Notably this improved test performance in higher dimensions comes only from lengthening the number of training epochs while retaining the constant 50 (random) training scalarizations.