Convex Optimization with an Interpolation-based Projection and its Application to Deep Learning

Convex optimizers have known many applications as differentiable layers within deep neural architectures. One application of these convex layers is to project points into a convex set. However, both forward and backward passes of these convex layers are significantly more expensive to compute than those of a typical neural network. We investigate in this paper whether an inexact, but cheaper projection, can drive a descent algorithm to an optimum. Specifically, we propose an interpolation-based projection that is computationally cheap and easy to compute given a convex, domain defining, function. We then propose an optimization algorithm that follows the gradient of the composition of the objective and the projection and prove its convergence for linear objectives and arbitrary convex and Lipschitz domain defining inequality constraints. In addition to the theoretical contributions, we demonstrate empirically the practical interest of the interpolation projection when used in conjunction with neural networks in a reinforcement learning and a supervised learning setting.


Introduction
Several recent research has investigated the integration of a 'convex optimization layer' within the computational graph of machine learning architectures in applications such as optimal control [de Avila Belbute-Peres et al., 2018, Amos et al., 2018, computer vision [Bertinetto et al., 2019, Lee et al., 2019 or filtering . Within this line of research, we distinguish two use cases for convex optimization. In the first use case, the output of the 'convex optimization layer' is a convex problem by definition. For example, a node can compute the maximum a posteriori of an image model [de Avila Belbute-Peres et al., 2018, Amos et al., 2018. In the second use case, a node restricts-by means of a projection-its input to a convex set and becomes a convex optimization problem by choice. For example, a node can restrict its input to the set of physically plausible vertex deformations [Geng et al., 2019].
In the second use case, it was shown in Geng et al. [2019] that the projection step benefits from being fully integrated to the learning process in both the forward and backward passes. Letting x be the input of the projection layer, g be the projection, and f be the ensuing computations-e.g. a loss function; integrating the projection into the backward pass amounts to differentiating through f • g(x). There have been several advances in differentiating through convex programs [Agrawal et al., 2019]. However, the forward and backward passes on g remain significantly more expensive than the typical matrix multiplications that would precede or succeed g [Amos and Kolter, 2017]. We investigate in this paper an alternative projection that is more lightweight to compute and differentiate than solving a convex program. Even if sub-optimal, in the sense that the proposed projection will not return the closest point to the input within the admissible set, the rationale behind the proposed algorithm is that since we are differentiating through f and g, a sub-optimal projection could still drive the optimization process to an optimal point.
The proposed projection maps any input x to a feasible point g(x) by simply interpolating x with a point x 0 satisfying the convex inequality constraints. The interpolation parameter is computed in closed form by exploiting the convexity of the domain defining function. We first show in this paper that the interpolationbased projection when used as in Projected Gradient Descent [Rosen, 1960, Nocedal andWright, 2006]by projecting the iterate after each gradient step-does not converge to an optimum. However, when differentiating through both the objective and the projection, we show that the resulting algorithm converges for a linear objective and arbitrary convex and Lipschitz domain defining functions. Finally, we provide in addition to the theoretical analysis, empirical results using the projection in conjunction with neural network models in reinforcement and supervised learning. Our results show that the proposed projection can be used to tackle constrained policy optimization or to provide an inductive bias improving generalization while being significantly cheaper to compute than an orthogonal, 'optimal' projection.
This work generalizes and formally analyzes previous interpolation-based projections we developed in the context of reinforcement learning (RL) in Akrour et al. [2019]. Several RL algorithms add informationtheoretic constraints to the policy optimization problem, such as a minimal entropy or a maximal Kullback-Leibler (KL) divergence to the data generating policy [Deisenroth et al., 2013]. We proposed in Akrour et al. [2019] differentiable policy parameterizations that comply with these constraints by construction, allowing the policy optimization problem to be solved by standard gradient descent algorithms. These parameterizations were based on interpolating any input parameterization of a distribution with a constraint satisfying parameterization. For example, interpolating an input discrete distribution with the uniform distribution, that satisfies any reasonable minimal entropy constraint. Interestingly, albeit these projections were not 'optimal' in the sense that they do not minimize a distance to the admissible set, we noted empirically (see Akrour et al. [2019], Fig. 1 and surrounding text) that such parameterization would always drive the descent algorithm to an optimum on a toy problem with a linear objective and a convex, entropy constraint. The main contribution of this paper is to generalize the idea of interpolation projections to arbitrary convex domain defining functions and to prove convergence of a descent algorithm leveraging this projection. From a practical point of view, in addition to the previously discussed RL application, we provide an example usage of the interpolation projection in a supervised learning context. The interpolation projection can be used as an inexpensive and differentiable operator to add convex constraints to the output of a neural network model, while being significantly cheaper than norm minimizing projections [Agrawal et al., 2019].
Computationally frugal projections were previously studied in the context of feasibility problems [Combettes, 1997], where the goal is to find a point inside a convex set. The approximate projection in Combettes [1997] uses the gradient of a violated inequality constraint to find a half-space that is a superset of the feasible set. Then an orthogonal projection on this hyper-plane is performed resulting in a point outside of the feasible set, but closer to the set than the input point. In contrast, our projection is not based on the gradient of the constraint but on its convexity and results in a point inside the feasible set. Moreover, the optimization setting we consider is more general than the feasibility setting and our assumption of an initial feasible x 0 would already solve the problem of Combettes [1997]. As such, our work and that of Combettes [1997] differ both in their objectives and their methods. In Xu [2018], Lan and Zhou [2016], approximate projections are derived when the number of constraints is large, but these algorithms still rely on expensive orthogonal projections. To the best of our knowledge, no other work previously showed convergence of a convex optimizer with non-orthogonal projections. The practical implications being a cheap way of adding convex constraints to machine learning models as shown in the experimental validation section.

Preliminaries
Let us first introduce and analyse the ideas in a convex optimization setting. Let f : R d → R and h : R d → R be convex and differentiable functions. We consider the following convex program For clarity of exposition, we initially only consider a single inequality constraint with differentiable h. Our results will be straightforwardly extended to multiple inequality constraints in Sec. 4.1 with sub-differentiable functions. Letting the convex set C ⊆ R d be defined by C = {x ∈ R d : h(x) ≤ 0}, the optimization problem (P) can be reformulated as min x∈C f (x). To solve this problem, one approach is to use the Projected Gradient Descent (PGD) algorithm [Rosen, 1960, Nocedal andWright, 2006] which is given by the following equation where g is a mapping that projects points from R d to C. The projection g is defined by the minimization g(x) = arg min y∈C x − y 2 of the Euclidean norm . 2 on R d . Mirror descent [Bubeck, 2014], an alternative for solving (P), can be seen as a generalization of PGD to other distances. These projection-based methods are most efficient when a closed form expression of the projection exists. Otherwise, a nested optimization problem needs to be solved after every gradient update of the iterate.
Other approaches such as the Frank-Wolfe method or the interior-point method also solve series of optimization problems. The Frank-Wolfe method [Frank and Wolfe, 1956] solves a series of linear approximations of the problem, x k+1 = arg min x∈C ∇f (x k ) T x; and the interior-point method [Karmarkar, 1984, Nesterov andNemirovskii, 1994] introduces a slack variable s for the inequality constraint and solves f (x) − µ k ln s under an equality constraint, for a series of values of µ k going to 0.
In contrast to all these methods, our algorithm takes a simpler and more direct approach by performing gradient descent on the composition of the objective and a projection. The proposed interpolation-based projection will transform the constrained problem (P) into an unconstrained one. The projection is readily defined without any other assumption than the convexity of h. Unlike previous algorithms, the interpolation projection is not defined as the minimization of a norm. To alleviate any ambiguity, from here on the term projection is understood as the more general following definition.
Definition 2.1. A projection g is a mapping from a set to a subset thereof.
Specifically, in this paper the superset is R d and the subset is C.

Interpolation-based projection and gradient descent
To solve the optimization problem min x∈C f (x) described in (P), we use a projection g that will ensure that for all x ∈ R d , g(x) ∈ C, i.e. h(g(x)) ≤ 0. The projection g is defined for any convex function h, provided there exists some point x 0 strictly satisfying the constraint, i.e. h(x 0 ) < 0. In which case, g is given by . When h(x) > 0, g simply interpolates between the violating point x and the point x 0 in C; otherwise, it returns x itself. Figure 1: Sequence of points generated by algorithms (a) and (b) with interpolation projection g. Since g is not a projection in the 2 minimizing sense, it cannot be used as in PGD (a). However, taking the derivative of the projection into account as in (b), drives the algorithm to the optimum.
Proposition 3.1. g is a projection from R n to C.
Proof. We will demonstrate that g( Even though g is a projection in the sense of Def. 2.1, it is not a projection in the usual sense that it minimizes a norm between x and elements of C. As a result, this projection cannot be used as in projected gradient descent (Sec. 2). To illustrate this, Fig. 1 shows a simple convex problem with a quadratic objective-the sphere function-and a linear constraint. When used as in the projected gradient descent update of Eq. (1), the resulting algorithm stales along the line with which it first exits C. Indeed, when optimizing the sphere function in an unconstrained way, gradient descent follows a straight line from x 0 to the origin. As it first exits C, the interpolation projection puts the iterate back on the same line and the algorithm keeps going back and forth indefinitely. In contrast, when optimizing the composition of the projection and the objective by gradient descent the iterate is pushed back to C in such a way that it moves towards the optimum. In fact, a simple computation shows us that when x k is not in C, the update in Eq.
(2) is linearly mixing the gradient of the objective f and the constraint h. Formally, when h(x k ) > 0, then g is differentiable at x k -from the assumption that h is-and the gradient ∇f • g(x k ) is given by Algorithm 1 Interpolation-based projection convex optimizer 1: Input: linear function f , convex function h, Lipschitz constants L and H (A1-A2), domain bound R (A4), initial point x 0 satisfying A3 and iteration count K. if h(x k ) ≤ 0 then 6: 10: end if 12: end for 13: return 1 Here J k is the Jacobian of g at x k , η k is short for η x k and I is the identity matrix. The expression of J k is obtained by straightforward computation, while Eq. (3) is obtained from the identity g( (3) shows that the gradient of f • g(x k ), when x k / ∈ C, is a linear mixing between the gradient of f at the projected point g(x k ) and the gradient of h at x k . Since h(x 0 ) < 0, the mixing term in Eq. (3) is positive iff ∇f (g(x k )) T (g(x k ) − x 0 ) ≤ 0. In fact, the first step in our convergence analysis is to show that the previous quantity is indeed always negative.
The mixing between the gradient of f and h is reminiscent of the conditional subgradient descent of Larsson et al. [1996]. This algorithm is an acceleration of PGD, that restricts the definition of a sub-gradient as a linear under estimator of f only within C. In this case, it is shown in Larsson et al. [1996] that when h(x k ) = 0, the set of conditional sub-gradients of f can be extended by adding any sub-gradient of f to a sub-gradient of h. Here however, the projection g(x k ) is not on the boundary of C-for example if h is strictly convex then h(g(x k )) < 0 and hence Eq. (3) is not necessarily a conditional subgradient of f , and the convergence analysis of our algorithm has to be carried out using different tools.
Alg. 1 summarises the optimization algorithm for constrained optimization using the interpolation-based projection. Alg. 1 starts by renormalizing h such that h(x 0 ) = −1, then defines the optimal step-size β w.r.t. an upper bound derived given assumptions A1 to A4 defined in the next section. Alg. 1 then follows a gradient descent (Eq. (2)), selecting a different step-size α k , as a function of a constant β, whether the iterate is inside or outside C. When x / ∈ C, the gradient is given by Eq.
(3). Alg. 1 then returns the average of the projected points. The algorithm operates a first order gradient descent on f • g, which as per Eq. (3), is of linear time and memory complexity.

Convergence analysis
The first step in the convergence analysis of Alg. 1 is a lemma showing that for an appropriate choice of the step-size α k , the quantity ∇f (g(x k )) T (g(x k ) − x 0 ) is always negative for k ≥ 0. As a consequence, the gradient of f • g will always mix gradients of objective and constraint with opposing directions when the iterate exits the valid set. We prove the lemma under the assumption of a linear objective function f , a Lipschitz continuous domain defining function h, in addition to the previously discussed assumption of an initial strictly feasible point x 0 .
Lemma 4.1. Under A1-A3, the sequence of x k produced by Alg. 1 verifies, for all k ≥ 0 and for β ≤ 1 LH , Proof. Let us prove the lemma by induction. For k = 0 the inequality is trivially true. Now assuming the inequality holds for some k ≥ 0. It implies that c T (g(x k ) − x 0 ) ≤ 0. We distinguish in the following two cases, whether x k is feasible or not. However, we treat both cases of feasibility of x k+1 jointly by writing By adding and subtracting x k inside the parentheses, and since for h( which from the induction hypothesis is the sum of two negative numbers and is thus negative. Now if h(x k ) > 0 then by again adding and subtracting x k , and by replacing x k+1 − x k with the gradient update following Eq. (3), we obtain From the induction hypothesis, it is sufficient that ≤ 1 for the last quantity to be negative. Using the fact that and using the Cauchy-Schwarz inequality as well as assumption A1 and A2, we obtain Since β ≤ 1 LH by assumption, the last quantity is ≤ 1 as desired. As such, we conclude that The assumption of the linearity of f is used in the induction step and allows several simplifications since for f linear, ∇f (x k+1 ) = ∇f (x k ). Extending the convergence analysis of Alg. 1 to non-linear objectives could be achieved by extending Lem. 4.1 to this case. However, as discussed in Sec. 4.1, since the assumptions on h are mild, many constrained convex optimization algorithms can be recast as problems solvable by Alg. 1.
To prove convergence of Alg. 1, we need an additional assumption on the boundedness of the distance to an optimum.
The convergence result for Alg. 1 is as follows From now on, and without loss of generality, we assume that h(x 0 ) = −1 and h is H-Lipschitz. We revert to the general case where h(x 0 ) < 0 at the end of the proof.
Following standard proofs of subgradient descent algorithms, our proof begins by estimating the distance of the iterate to the optimum As in Lem. 4.1, we study separately the case where x k ∈ C and x k / ∈ C. In each case, we derive an upper bound of x k+1 − x * 2 2 and then pick the largest of the two. Starting with x k / ∈ C, we replace ∇f • g(x k ) by its definition in Eq. (3), and by expanding the quadratic expression we obtain Adding and subtracting g(x k ) in ∇f (g(x k )) T (x k − x * ) and by expanding the definition of g(x k ) and η k when But from convexity of h, we know that h( In addition, α k and η k are always positive and from Lem. As a result the last term of Eq. (5) is always negative and x k+1 − x * 2 2 can be bounded by In the upper bound of Inq. (6), we will now bound the term α k ∇f • g(x k ) 2 2 that is specific to the case h(x k ) > 0. By replacing the gradient with its definition and using the fact that we have rescaled h such that h(x) = −1, we obtain Using the Cauchy-Schwarz inequality as well as assumption A1, A2 and A4 we obtain Replacing Eq. (7) into Eq. (6), using the definition of α k and since h(x 0 ) = −1 we have Now for the simpler case x k ∈ C we have Using assumption A1 and since x k = g(x k ) and α k = β when x k ∈ C, we obtain the following bound Clearly the upper bound of x k+1 − x * 2 2 in Inq. (8) is always larger than the one in Inq. (9). As such, we can use the upper bound of x k+1 − x * 2 2 in Inq. (8) for all iterates of Alg. 1. Letting A = L 2 (1 + HR) 2 , and averaging over the first K terms of both sides of Inq. (9) yields From the convexity of f we have that as well as 1 . Using these two properties yields Rearranging terms and cancelling telescoping sums yields Using A1, A2 and A4 and after replacing A we obtain Minimizing this upper bound w.r.t. to β gives the optimal fixed step-size β = This gives us a first condition on β, but to achieve the bound in Inq. (10), we made use of Lem. 4.1 which requires that β ≤ 1 LH , yielding an additional condition on K R L(1 + HR) Now the only remaining operation is to express the step-size, the condition on K in Inq. (11) and the error upper bound in Inq. (10) in terms of the original Lipschitz constant which is achieved simply by replacing H with H |h(x0)| in these inequalities.
The O( 1 √ K ) convergence rate is typical of sub-gradient descent on non-smooth convex functions [Nocedal and Wright, 2006], which is expected since f • g is non-smooth. Compared to projected gradient descent (PGD), the bound now shows an explicit dependence on the Lipschitz constant of h. This is also expected since in PGD the projection is assumed to be computable at no cost. As a result, the error bound of PGD does not depend on the gradient of h in any way, whereas in our algorithm this dependence is made explicit. Because of the non-smoothness of f • g and the resulting O( 1 √ K ) convergence rate, we do not expect the general formulation of Alg. 1 to be competitive with specialized convex optimizers developed for specific convex problem classes. However, the versatility and cheap computational cost of the interpolation projection offers large gains compared to convex optimizers when integrated into (non-convex) machine learning models, as shown in the experimental validation section.

Subgradients, multiple constraints and non-linear objectives
So far we have only considered a single inequality constraint. Alg. 1 and its theoretical guaranties can easily be extended to tackle multiple inequality constraints and an affine equality constraint To tackle constrained optimization in C , we define Alg. 1' that replaces Line 10 of Alg. 1. Specifically, the gradient ∇h in Eq. (3) is simply replaced by a sub-gradient of h. Under A1, A3-A5, this new algorithm has the same convergence properties of Alg. 1. Indeed, h being convex, the projection is still valid and will be given with interpolation weight η x = min i∈{1..
In summary, the differentiablity requirement of h can be relaxed to only require sub-differentiability, and multiple constraints are treated as a single constraint using the max over these sub-differentiable constraints. As for the affine equality constraint, it can be eliminated by replacing x with F z + x 0 as shown in Boyd and Vandenberghe [2004], where F is a matrix whose range is the null space of A under the condition that x 0 is a solution of Ax = b. Note that the objective function remains linear after the aforementioned change of variable, and hence the convergence guarantees still apply.
As for non-linear objectives, we note that most convex programs can be written as cone programs of the form min x∈K c T x, for a closed convex cone K and a linear objective [Nesterov and Nemirovskii, 1994]. In fact, there exists automated tools [Grant et al., 2006, Grant andBoyd, 2008] that perform this rewriting by replacing non-linear functions in the computational graph with their graph implementation-a generic epigraph-based representation. These tools are used by existing solvers such as CVX [Grant and Boyd, 2014], and for our algorithm to be applicable to these cone programs, one has to provide a domain defining function h equivalent to the constraint x ∈ K for all cones supported by the tool. In the next section, we provide numerical examples for the semi-definite cone, the second order cone and the linear cone.

Experimental validation
We first conduct numerical evaluations on toy convex problems to validate the theoretical analysis. The broader usage of the interpolation projection in machine learning is then evaluated in both a reinforcement and supervised learning setting.

Constrained convex optimization
Alg. 1 defines the step-size as a function of the domain bounds and the Lipschitz constants which are typically unknown in practice. We thus investigate on a wide range of convex optimization problems the robustness of the interpolation projection to the choice of (a potentially wrong) step-size. We compare our algorithm to Projected Gradient Descent (PGD, Rosen [1960], Nocedal and Wright [2006]) and subgradient descent (SubGD, Shor et al. [1985], Bertsekas [2015]). Subgradient descent is a converging descent algorithm that in our constrained setting operates by i) following the gradient of f if x ∈ C ii) following the (sub-)gradient of h otherwise. This algorithm is very simple and another objective of these numerical experiments is to investigate whether the mixing of the gradients ∇f and ∇h, obtained from differentiating through f • g in Eq.
(3), provides any practical advantage compared to the simpler scheme of subgradient descent. In the following, we denote our algorithm by IGD, where the 'I' stands for interpolation. We consider five problem classes comprising linear programs, semi-definite programs, second order cone programs, problems with a bounded 2 norm or with an exponential form constraint. Exact definition of each problem and their random generation process is deferred to the appendix.
Results. For each of the five problem classes, 100 random instances are generated and we compute at each iteration the smallest f (x k )−f (x * ) f (x0)−f (x * ) achieved so far. We compared the gradient descent algorithms with step-sizes from 10 −4 to 10 −1 . Experiments for each step-size are conducted on the same 100 problem instances, and although we plot the results for each step-size separately, one can easily extract the best performing step-size for each method from the same plots. The plots (deferred to the appendix) show that in 17 out of the 20 problems and step-sizes combinations, IGD outperforms SubGD, sometimes with several order of magnitudes. On semi-definite programs, SubGD performs better with larger step-sizes, although best results are still obtained overall by IGD with the smallest step-size. On the bounded norm problem where PGD is applicable, our algorithm is able to match PGD up until a precision ranging from 10 −2 to 10 −5 depending on the step-size, before tracking behind. In contrast, SubGD is distanced at a significantly lower precision. These results both demonstrate a certain robustness to the choice of step-size and a practical interest in the mixing of gradients obtained by differentiating through f • g. Thanks to the generality of the projection and the simplicity of performing unconstrained gradient descent on f • g, we expect the interpolation projection to find many usages in machine learning, two of which are presented in the next subsections.

Reinforcement learning in continuous action spaces
We consider in this section policy optimization updates that occur at each iteration of the approximate policy iteration (API) scheme [Bertsekas, 2011, Scherrer, 2014. To formalize the policy update in API we briefly introduce key concepts of reinforcement learning (RL). A Markov Decision Process (MDP) is a quintuple (S, A, R, P, γ) where S and A are state and action spaces, that are in our experiment R ds and R da respectively. P : S × A → P(S) and R : S × A → R determine the next state transition probability and reward upon the execution of a given action in a given state. We denote by q(a|s) the probability density of executing a ∈ A in s ∈ S according to the stochastic policy q. Additionally, for policy q we define the Q-function Q q (s, a) = E [ ∞ t=0 γ t R(s t , a t ) | s 0 = s, a 0 = a], where the expectation is taken w.r.t. random variables a t+1 ∼ q(.|s t ) and s t+1 ∼ p(.|s t , a t ) for t > 0; the value function V q (s) = E a∼q(.|s) [Q q (s, a)] and the advantage function A q (s, a) = Q q (s, a) − V q (s). The goal in API is to find the policy maximizing the policy return J(q) = V q (s 0 ) for some starting state s 0 .
API iterates three steps, generating data from the current policy q, evaluating A q and updating the policy q using A q . To update the policy we consider the maximization of A q under a KL divergence constraint between the current and next policies-establishing a 'step-size' in probability space-as is done in Schulman et al. [2015], Rajeswaran et al. [2017], Peters and Schaal [2008]. The policy update is given by subject to E s∼q [KL(p(.|s) q(.|s))] ≤ .
We specifically consider the setting where p and q are Gaussian policies. A Gaussian policy has density p(.|s) = N (µ(s), Σ), for co-variance matrix Σ and mean function µ(.). In our set-up we consider diagonal co-variance matrices as in Schulman et al. [2015], Rajeswaran et al. [2017] and linear-in-features or neural network based mean functions. The linear-in-feature mean function is given by µ(s) = φ(s) T M using the same random Fourier features φ of Rajeswaran et al. [2017] with 2000 entries, whereas the neural network mean function is given by a neural network following the architecture in Schulman et al. [2015] with 2 hidden layers with 64 neurons each. For estimating A q we follow Rajeswaran et al. [2017] and use a neural network to learn V q and estimate A q from trajectories. For both cases we use = 10 −2 as in Schulman et al. [2015]. To solve the aforementioned problems, both natural approaches with linear-in-features [Rajeswaran et al., 2017] and neural network mean functions [Schulman et al., 2015] follow the same approach: a second order approximation of the constraint (13) is computed, as well as a linear approximation of the objective function (12). The resulting problem is then solved in closed form resulting in the natural gradient update of the policy parameters. However, as the constraint satisfaction is not guaranteed-since the problem is solved by approximating the constraint-both approaches [Schulman et al., 2015, Rajeswaran et al., 2017 add a linesearch routine, interpolating between the new parameters and the parameters of q, to ensure that Inq. (13) holds.
To compare to natural gradient, we employ first a naive algorithm that optimizes objective (12) in an unconstrained way, with the Adam algorithm [Kingma and Ba, 2015], before calling the line-search routine used by the natural gradient approaches to ensure constraint satisfaction. Secondly, we augment the naive algorithm by adding an interpolation projection 'layer' to the output of the policy. The projection layer, as depicted in Fig. 2-left, takes as input a set of action means-given by evaluating the current mean function over a mini-batch of input states-and a covariance matrix and returns a new set of means and a covariance matrix that comply with the constraint. To formalize, let us define h and x 0 , the two elements needed to perform the interpolation projection. Given a finite set of states {s 1 , . . . , s K }, we define where µ q and Σ q are respectively the mean function and covariance matrix of q. h is convex and we use as x 0 for the interpolation projection the means and covariance matrix of q. The projection that returns a set of means and a covariance matrix compying with the KL divergence constraint is then given by g as in Sec. 3, from the definition of h and x 0 . To illustrate the algorithm, assume for a mini-batch of states {s 1 , . . . , s K } the mean and covariance functions return a mini-batch of means µ(s 1 ), . . . , µ(s K ) and a covariance matrix Σ. If the constraint, estimated for this mini-batch is violated, we use the projection g as in Sec. 3 to obtain a new set of means µ η (s 1 ), . . . , µ η (s K ) and covariance matrix where p η (.|s) = N (µ η (s), Σ η ). Once the objective is computed, we backpropagate throughout the whole computational graph which backpropagates through the interpolation projection.
In the linear-in-feature case, we note that the KL divergence is not only convex in the mean and covariance of the Gaussian but also in the policy parameters. Specifically, we have that is a convex function in M and Σ, and from linearity of the mean function interpolating the means or the parameter M directly are equivalent. Moreover, the η obtained using h(M, Σ) or h(µ(s 1 ), . . . , µ(s K ), Σ) will be identical for a given mini-batch since the value of h will be the same in both cases. The optimization process can thus be seen as performing gradient descent on (f • g)(M, Σ), where f is the objective (12). This is similar to the convex optimization setting studied theoretically, except f is now non-linear non-convexbecause A q is not necessarily convex. However, the empirical results show that the optimization scheme still performs well despite f • g being non-convex. This is not entirely surprising since gradient descent is widely used and well behaved for non-convex problems too.
To generate real RL optimization problems, we run natural gradient on the BipedalWalker-v2 environment [Brockman et al., 2016] for one million steps with a policy update after a minimum of 3000 steps. We run 11 of such independent runs, generating over 3000 optimization problems for each of the linear and non-linear cases. Both the naive algorithm and the projection augmented algorithm use the same hyperparameters for the update, by performing 30 epochs with a step-size 1 of 5 × 10 −5 ). For each of the 3000 optimization problems, we record the ratio between the objective value when solving the problem with gradient descent, divided by the value when solving the problem following the natural gradient baselines in each of the linear [Rajeswaran et al., 2017] and non-linear [Schulman et al., 2015] case. A value larger than 1 indicates that the method solved the constrained problem better than the state-of-the-art.   Fig. 2 shows the distribution of such ratios for the linear and non-linear mean function cases. In both cases, without the projection, the unconstrained optimization with a final line-search step performs significantly worse than natural gradient descent. In contrast, adding the interpolation projection of the Gaussian distributions' parameters while using the same optimization scheme, results in a median improvement over natural gradient of 31% and 57% for the linear and non-linear mean function cases respectively. Note that in the linear case, the optimization setting resembles the earlier convex optimization experiments as the constraint is convex in the input means of h but also directly on the parameters of the mean function M . When the mean function is a neural network, the interpolation projection still seems to guide the gradient descent algorithm towards regions of the parameter space that better trade off objective maximization and constraint satisfaction than the naive algorithm.
We also evaluated replacing the interpolation layer with an orthogonal projection using a differentiable convex solver [Agrawal et al., 2019]. The orthogonal projection receives the same input means and covariance matrix as the interpolation projection but returns instead the parameters that minimize the Euclidean distance to the inputs while complying with the KL divergence constraint. This is a convex problem and we used the tools of [Agrawal et al., 2019] to both compute the forward pass-solve the convex problem-and the backward pass-differentiate around the solution of the convex problem-of this computational graph. The computational cost of this model is more than 300 times that of the vanilla neural network model, while our model with the interpolation projection is only about 1.5 more expensive. Due to the increased computational costs, we performed only 6 independent runs for this comparison totaling about 1700 optimization problems. Comparison between the two optimization schemes are shown in Fig. 3. Surprisingly, the interpolation projection performs better than the more accurate projection, perhaps because of a better interplay between the interpolation projection and the subsequent line-search routine, while being significantly cheaper to compute.

Supervised learning of dynamics models
In the previous experiment we have shown how the interpolation projection can be used to tackle constrained optimization problems in the context of RL. In this experiment, we provide an example of an inductive bias in the form of a convex constraint on the outputs of a neural network, and we show how the interpolation projection can be used to comply with these constraints. The task consists in predicting the position, for several steps in the future, of 7 circular rigid bodies connected in 3 different configurations with respectively 6, 9 and 12 strings of the same length as shown in Fig. 4. The considered inductive bias constrains the distance between predicted positions of connected rigid bodies to be at most the length of the string. To comply with the constraint, we add after the prediction of the neural network y t , an interpolation projection that returns g(y t ), such that the constraints imposed by the strings are respected. To compute g, we define h as the maximum distance between linked bodies, which is convex, and use as 'x 0 '-the anchor point of the interpolation projection-an imaginary configuration that places all rigid bodies in the average of their position according to y t−1 . This point has thus zero distance between all circular bodies and strictly satisfies the constraints. Given h and 'x 0 ', the interpolation projection g follows as in Sec. 3.
To predict the next position we use a neural network with 4 hidden layers having 256 nodes each. The network takes as input the last three positions of all 7 circular bodies and outputs the change to the current set of positions. We train this neural network as a recursive neural network (RNN), using backpropagation through time, as the predicted position in the next time-step is fed back to its input. In addition to the base RNN model, we evaluate the same RNN with the inductive bias in the form of convex constraints as described above. Ground truth trajectories are generated by letting the object fall from a distance of 400 units of measure (u.m.), after applying an initial force generated by selecting a node uniformly at random then applying a force with constant norm sampled uniformly at random on an upper half circle. The diameter of the circular rigid body is 1 u.m. Box2d [Catto, 2007] is used to simulate 200 of such trajectories, 50 of which are used for training, 75 for validation and 75 for test. Each trajectory contains 485 time-steps and the train set alone contains circa 24K time-steps. We train both the RNN and RNN with convex constraints for a fixed time of 3 days on a single core of an AMD 3900x.
The generalization results in Tab. 1 show that both models can synthesize relatively close trajectories to the original ones for an extended period of time (485 time-steps at 60Hz) from only the first three time-steps of the test trajectories. The results also show that the additional interpolation projection layer, enforcing compliance with the physical constraints imposed by the strings, reduces the prediction error for the two shapes with the most strings; while for the simpler chain shape, the vanilla model performs better. The worse performance in this setup might be the result of the additional non-smoothness introduced by the interpolation projection. Yet, even when it under-performs quantitatively with the chain shape, the trajectories generated by the projection augmented model can look qualitatively better since the vanilla model sometimes exhibits large violations of the constraints as shown in Fig. 5. In conclusion, introducing an inductive bias through additional constraints and using the interpolation projection to comply with the constraints showed promising results both quantitatively and qualitatively, with little computational overhead-the training procedure becoming only about 1.2 times slower. In comparison, we were unable to run the baseline with the optimal projection layer that solves a convex problem for every forward pass. Compared to the RL setting, the combined effect of a larger dataset (more than 10x) and the increased number of convex problems to solve per gradient update (up to 240x du to the back-propagation through time) would require several months for the training procedure to complete on the same AMD 3900x processor.

Conclusion
We introduced in this paper an interpolation-based projection onto a convex set that can be readily computed for any convex domain defining function. We then derived a descent algorithm based on the composition of the objective and the projection and showed that this surprisingly yields a convergent algorithm when the objective is linear, despite the 'sub-optimality' of the projection. From a practical point of view, we have shown that this projection when added as a layer to computational models, allows to tackle constrained optimization in reinforcement learning or adds an inductive bias to predictive models. Because the projection is general and computationally frugal, we think this work can find many other applications in machine learning where intermediary nodes of a computational graph are constrained to be in a convex set.

Appendix A. Convex optimization numerical illustration
We describe in more details the experimental setting of the convex optimization comparisons. We consider five problem classes comprising linear programs, semi-definite programs, second order cone programs, problems with a bounded 2 norm and problems with an exponential form constraint. The form of the domain defining function h for each of these problems is trivial except for the semi-definite cone, where we used h(A) = −λ min , the negative of the smallest eigenvalue of the symmetric real valued matrix A. The subgradient of h w.r.t. A is given in this case by −vv T , where v is the eigenvector associated with λ min . We now detail each problem class and its random instance generation.
Linear program (Lin). The problem is We generate instances such that the optimum is at (0, . . . , 0) T and the constraints are active at the optimum. The objective is generated by sampling a c uniformly at random on the hyper-sphere. Following the idea in Hansen et al. [2016Hansen et al. [ , 2019, we define the constraints of such problems by setting the gradient of the first constraint to a 1 = −c to ensure the Karush-Kuhn-Tucker optimality conditions Kuhn and Tucker [1951], Nocedal and Wright [2006] hold at (0, . . . , 0) T . At this point, the point x = c is feasible and we generate the remaining M − 1 constraints randomly while making sure that x remains feasible. Specifically, each a i , for i ∈ {2 . . . M }, is sampled on the hypersphere uniformly at random and redefined as Semi-definite program (SDP). The dual of the problem is given by The constraint implies that i x i A i − C is a positive semi-definite matrix. We generate the problem data following the code of Malick et al. [2009] to obtain problems where strong duality holds. There is one difference in the generation of the matrices A i , that are made sparse in the original code, while we use A i = 1 2 (B i + B T i ) with entries of B i sampled from the Normal distribution. Second order cone program (SOC). The problem is The objective is generated by sampling a c uniformly at random on the hyper-sphere. Then an x 0 is generated following the same procedure. All other problem data are then sampled from the normal distribution except d i that is computed such that h( x 2 ≤ 1. A random instance of the problem is generated by sampling a vector c uniformly at random on the hypersphere such that the optimum x * is −c with value f (x * ) = −1. Figure 6: Comparison of first order descent algorithms with different step-sizes on linear programs (leftmost column), semidefinite programs, second order cone programs, programs with bounded norm or exponential shaped constraint (rightmost column).
Step-size β ranges from 0.1 on the first row to 10 −4 on the forth row. All plots averaged over 100 runs.
Exponential constraint (Exp) The problem is min x c T x, where b is a vector that has on each entry W (1), the Lambert W function evaluated at 1. It is designed such that the minimum of the constraint is attained at (0, . . . , 0) T , facilitating the generation of feasible points. c is generated by sampling uniformly at random on the hyper-sphere.
Obtaining x 0 and f (x * ). For Lin, Norm and Exp, x 0 is generated by uniformly sampling at random in the unit ball, and resampling if the point is not feasible. For SDP we use x 0 as in the code of Malick et al. [2009]. For SOC, our algorithm cannot use the x 0 described in the problem definition, since h(x 0 ) = 0. To obtain a valid x 0 for our algorithm, starting from the aforementioned x 0 , we perform 100 optimization steps with Adam Kingma and Ba [2015] and a step-size of 10 −2 on the maximum over the constraints, and use the newly obtained point as the x 0 for all algorithms. For Lin and Norm, f (x * ) is known whereas we estimate it for the remaining problems using CVXPY Diamond and Boyd [2016] with the highest precision available.
Performance metrics. For every optimization problem we randomly generate an instance and run all optimizers for 10000 iterations. We repeat this procedure 100 times for every problem. For each run, and at each iteration k, we compute min t∈{1..k} f (g(x t )) where g is the norm minimizing projection for PGD or the interpolation projection for our algorithm. For subgradient descent we use instead min t∈{i∈{1..k}s.t.h(xi)≤0} f (x t ), i.e. we pick the best point so far that is in C. We consider the min instead of f ( 1 k k t=0 g(x t )) as an evaluation metric for our algorithm in order to allow for comparisons with the subgradient descent method in which the average point so far, is not necessarily in C. Note that the theoretical guarantees given by Thm. 4.2 are exactly the same for this min criterion since min t∈{1..k} f (g(x t )) ≤ 1 k k t=0 f (g(x t )) can be used in a similar way in the proof in lieu of the average point. In order to allow for meaningful averaging between the several randomly generated instances, we normalize the performance between 0 and 1 for each run by subtracting f (x * ) and dividing by f (x 0 ) − f (x * ). Instances of Lin, SDP, SOC and Norm and Exp are of dimensionality 10, 10, 20, 100 and 2 respectively. For each problem, we evaluated all algorithms with step-sizes β of 10 −4 , 10 −3 , 10 −2 and 10 −1 . Random instances across different step-sizes are identical and results are therefore directly comparable. Finally, the performance plots in Fig. 6 are obtained by plotting the median and the upper and lower quantiles.
Results. On the plots of Fig. 6, one can notice on all problems that the performance of all algorithms perfectly overlaps in initial iterations. That is due to the fact that all compared algorithms are similar up to the point where an iterate first exits the feasible set C. The plots also show that in 17 out of the 20 problem and step-size combination, IGD outperforms SubGD, sometimes with several order of magnitude. On semi-definite programs, SubGD performs better with larger step-sizes, although best results are still obtained overall by IGD with the smallest step-size. On the Norm problem where PGD is applicable and with β = 0.001, we observe that both PGD and IGD perform very similarly despite the simplicity and the linear nature of the projection used by our algorithm, and both algorithms perform better than the more naive SubGD baseline. On these problems, our algorithm is able to match PGD up until a precision ranging from 10 −2 to 10 −5 for different step-sizes, before tracking behind. In contrast SubGD is distanced at a significantly lower precision. All combined, these results both demonstrate a certain robustness to the choice of step-size and a practical interest in the mixing of gradients obtained by differentiating through f • g. Thanks to the generality of the projection and the simplicity of performing unconstrained gradient descent on f • g, we expect the interpolation projection to find many usages in machine learning.