1 Introduction

Several recent research has investigated the integration of a ‘convex optimization layer’ within the computational graph of machine learning architectures in applications such as optimal control (de Avila Belbute-Peres et al. 2018; Amos et al. 2018), computer vision (Bertinetto et al. 2019; Lee et al. 2019) or filtering (Barratt and Boyd 2019). Within this line of research, we distinguish two use cases for convex optimization. In the first use case, the output of the ‘convex optimization layer’ is a convex problem by definition. For example, a node can compute the maximum a posteriori of an image model (de Avila Belbute-Peres et al. 2018; Amos et al. 2018). In the second use case, a node restricts—by means of a projection—its input to a convex set and becomes a convex optimization problem by choice. For example, a node can restrict its input to the set of physically plausible vertex deformations (Geng et al. 2019).

In the second use case, it was shown in Geng et al. (2019) that the projection step benefits from being fully integrated to the learning process in both the forward and backward passes. Let x be the input of the projection layer, g be the projection, and f be the ensuing computations—e.g. a loss function. Integrating the projection into the backward pass amounts to differentiating through \(f\circ g (x)\). There have been several advances in differentiating through convex programs (Agrawal et al. 2019). However, the forward and backward passes on g remain significantly more expensive than the typical matrix multiplications that would precede or succeed g (Amos and Kolter 2017). We investigate in this paper an alternative projection that is more lightweight to compute and differentiate than solving a convex program. Even if sub-optimal, in the sense that the proposed projection will not return the closest point to the input within the admissible set, the rationale behind the proposed algorithm is that since we are differentiating through f and g, a sub-optimal projection could still drive the optimization process to an optimal point.

The proposed projection maps any input x to a feasible point g(x) by simply interpolating x with a point \(x_0\) satisfying the convex inequality constraints. The interpolation parameter is computed in closed form by exploiting the convexity of the domain defining function. We first show in this paper that the interpolation-based projection when used as in projected gradient descent (Rosen 1960; Nocedal and Wright 2006)—by projecting the iterate after each gradient step—does not converge to an optimum. However, when differentiating through both the objective and the projection, we show that the resulting algorithm converges for a linear objective and arbitrary convex and Lipschitz domain defining functions. Finally, we provide in addition to the theoretical analysis, empirical results using the projection in conjunction with neural network models in reinforcement and supervised learning. Our results show that the proposed projection can be used to tackle constrained policy optimization or to provide an inductive bias improving generalization while being significantly cheaper to compute than an orthogonal, ‘optimal’ projection.

This work generalizes and formally analyzes previous interpolation-based projections we developed in the context of reinforcement learning (RL) in Akrour et al. (2019). Several RL algorithms add information-theoretic constraints to the policy optimization problem, such as a minimal entropy or a maximal Kullback–Leibler (KL) divergence to the data generating policy (Deisenroth et al. 2013). We proposed in Akrour et al. (2019) differentiable policy parameterizations that comply with these constraints by construction, allowing the policy optimization problem to be solved by standard gradient descent algorithms. These parameterizations were based on interpolating any input parameterization of a distribution with a constraint satisfying parameterization. For example, interpolating an input discrete distribution with the uniform distribution, that satisfies any reasonable minimal entropy constraint. Interestingly, although these projections were not ‘optimal’ in the sense that they do not minimize a distance to the admissible set, we noted empirically (see Akrour et al. (2019), Fig. 1 and surrounding text) that such parameterization would always drive the descent algorithm to an optimum on a toy problem with a linear objective and a convex, entropy constraint. The main contribution of this paper is to generalize the idea of interpolation projections to arbitrary convex domain defining functions and to prove convergence of a descent algorithm leveraging this projection. From a practical point of view, in addition to the previously discussed RL application, we provide an example usage of the interpolation projection in a supervised learning context. The interpolation projection can be used as an inexpensive and differentiable operator to add convex constraints to the output of a neural network model, while being significantly cheaper than norm minimizing projections (Agrawal et al. 2019).

Computationally frugal projections were previously studied in the context of feasibility problems (Combettes 1997), where the goal is to find a point inside a convex set. The approximate projection in Combettes (1997) uses the gradient of a violated inequality constraint to find a half-space that is a superset of the feasible set. Then an orthogonal projection on this hyper-plane is performed resulting in a point outside of the feasible set, but closer to the set than the input point. In contrast, our projection is not based on the gradient of the constraint but on its convexity and results in a point inside the feasible set. Moreover, the optimization setting we consider is more general than the feasibility setting and our assumption of an initial feasible \(x_0\) would already solve the problem of Combettes (1997). As such, our work and that of Combettes (1997) differ both in their objectives and their methods. In Xu (2018); Lan and Zhou (2016), approximate projections are derived when the number of constraints is large, but these algorithms still rely on expensive orthogonal projections. To the best of our knowledge, no other work previously showed convergence of a convex optimizer with non-orthogonal projections. The practical implications being a cheap way of adding convex constraints to machine learning models as shown in the experimental validation section.

2 Preliminaries

Let us first introduce and analyse the ideas in a convex optimization setting. Let \(f : {{\mathbb {R}}}^d \rightarrow {{\mathbb {R}}}\) and \(h: {{\mathbb {R}}}^d \rightarrow {{\mathbb {R}}}\) be convex and differentiable functions. We consider the following convex program

$$\begin{aligned} \min _{x \in {{\mathbb {R}}}^d} \quad&f(x),\\ \text {s.t.} \quad&h(x) \le 0. \end{aligned}$$
(P)

For clarity of exposition, we initially only consider a single inequality constraint with differentiable h. Our results will be straightforwardly extended to multiple inequality constraints in Sect. 4.1 with sub-differentiable functions. For the convergence analysis in Sect. 4, we only consider the case of a linear function \(f(x) =c^Tx\). However, we also discuss in Sect. 4.1 how several convex problems can be rewritten in this form. For now, let us assume that f is an arbitrary differentiable convex function.

Letting the convex set \(\mathcal{C} \subseteq {{\mathbb {R}}}^d\) be defined by \(\mathcal{C}=\{x\in {{\mathbb {R}}}^d: h(x) \le 0\}\), the optimization problem (P) can be reformulated as \(\min _{x\in \mathcal C}f(x)\). To solve this problem, one approach is to use the Projected Gradient Descent (PGD) algorithm (Rosen 1960; Nocedal and Wright 2006) which is given by the following equation

$$\begin{aligned} x_{k+1}&= g\left( x_k -\alpha \nabla f(x_k)\right) , \end{aligned}$$
(1)

where g is a mapping that projects points from \({{\mathbb {R}}}^d\) to \(\mathcal C\). The projection g is defined by the minimization \(g(x) = {\arg \min }_{y\in \mathcal{C}} \Vert x-y\Vert _2\) of the Euclidean norm \(\Vert .\Vert _2\) on \({{\mathbb {R}}}^d\). Mirror descent (Bubeck 2014), an alternative for solving (P), can be seen as a generalization of PGD to other distances. These projection-based methods are most efficient when a closed form expression of the projection exists. Otherwise, a nested optimization problem needs to be solved after every gradient update of the iterate.

Other approaches such as the Frank–Wolfe method or the interior-point method also solve series of optimization problems. The Frank-Wolfe method (Frank and Wolfe 1956) solves a series of linear approximations of the problem, \(x_{k+1} = \arg \min _{x \in \mathcal C}\nabla f(x_k)^Tx\); and the interior-point method (Karmarkar 1984; Nesterov and Nemirovskii 1994) introduces a slack variable s for the inequality constraint and solves \(f(x) - \mu _k \ln s\) under an equality constraint, for a series of values of \(\mu _k\) going to 0.

In contrast to all these methods, our algorithm takes a simpler and more direct approach by performing gradient descent on the composition of the objective and a projection. The proposed interpolation-based projection will transform the constrained problem (P) into an unconstrained one. The projection is readily defined without any other assumption than the convexity of h and the availability of a strictly admissible point. Unlike previous algorithms, the interpolation projection is not defined as the minimization of a norm. To alleviate any ambiguity, from here on the term projection is understood as the more general following definition.

Definition 1

A projection g is a mapping from a set to a subset thereof.

Specifically, in this paper the superset is \({{\mathbb {R}}}^d\) and the subset is \(\mathcal C\).

3 Interpolation-based projection and gradient descent

To solve the optimization problem \(\min _{x\in \mathcal{C}}f(x)\) described in (P), we use a projection g that will ensure that for all \(x \in {{\mathbb {R}}}^d\), \(g(x) \in \mathcal C\), i.e. \(h(g(x)) \le 0\). The projection g is defined for any convex function h, provided there exists some point \(x_0\) strictly satisfying the constraint, i.e. \(h(x_0) < 0\). In which case, g is given by

$$\begin{aligned} g(x) = {\left\{ \begin{array}{ll} x &{}\text {if } h(x) \le 0,\\ \eta _x x + (1-\eta _x) x_0 &{} \text {else}, \end{array}\right. } \end{aligned}$$

with \(\eta _x = \frac{h(x_0)}{h(x_0)-h(x)}\). When \(h(x) > 0\), g simply interpolates between the violating point x and the point \(x_0\) in \(\mathcal C\); otherwise, it returns x itself. We would like to emphasize that knowing an initially feasible point \(x_0\) can be a strong assumption for some applications and finding such an \(x_0\) can be a costly procedure in itself. However, in many applications such as the reinforcement and supervised learning ones considered in the experiments section, a trivial feasible point is readily available.

Fig. 1
figure 1

Sequence of points generated by algorithms (a) and (b) with interpolation projection g. Since g is not a projection in the \(\ell _2\) minimizing sense, it cannot be used as in PGD (a). However, taking the derivative of the projection into account as in (b), drives the algorithm to the optimum

Proposition 1

g is a projection from \({{\mathbb {R}}}^n\) to \(\mathcal C\).

Proof

We will demonstrate that \(g(x) \in \mathcal C\) for all \(x \in {{\mathbb {R}}}^d\). If \(h(x) \le 0\), g(x) is in \(\mathcal C\) by definition. If \(h(x) > 0\) then \(\eta _x \in (0,1)\) since \(h(x_0) - h(x)< h(x_0) < 0\) and

$$\begin{aligned} h(g(x))&= h(\eta _x x + (1-\eta _x) x_0),\\&\le \eta _x h(x) + (1-\eta _x) h(x_0), \\&= h(x_0) - \eta _x (h(x_0) - h(x)),\\&= 0. \end{aligned}$$
(h convex)

\(\square\)

figure a

Even though g is a projection in the sense of Definition 1, it is not a projection in the usual sense that it minimizes a norm between x and elements of \(\mathcal C\). As a result, this projection cannot be used as in projected gradient descent (Sect. 2). To illustrate this, Fig. 1 shows a simple convex problem with a quadratic objective—the sphere function—and a linear constraint. When used as in the projected gradient descent update of Eq. (1), the resulting algorithm stales along the line with which it first exits \(\mathcal C\). Indeed, when optimizing the sphere function in an unconstrained way, gradient descent follows a straight line from \(x_0\) to the origin. As it first exits \(\mathcal C\), the interpolation projection puts the iterate back on the same line and the algorithm keeps going back and forth indefinitely. In contrast, when optimizing the composition of the projection and the objective by gradient descent

$$\begin{aligned} x_{k+1} = x_k - \alpha _k \nabla f\circ g (x_k), \end{aligned}$$
(2)

the iterate is pushed back to \(\mathcal C\) in such a way that it moves towards the optimum. In fact, a simple computation shows us that when \(x_k\) is not in \(\mathcal C\), the update in Eq. (2) is linearly mixing the gradient of the objective f and the constraint h. Formally, when \(h(x_k) > 0\), then g is differentiable at \(x_k\)—from the assumption that h is—and the gradient \(\nabla f\circ g (x_k)\) is given by

$$\begin{aligned} \nabla f\circ g (x_k)&= J_k(x_{k})^T \nabla f(g(x_k)),\nonumber \\&= \eta _k \left( I + \frac{\nabla h(x_k)(x_k-x_0)^T}{h(x_0)-h(x_k)}\right) \nabla f(g(x_k)),\nonumber \\&= \eta _k \left( \nabla f(g(x_k)) + \frac{\nabla f(g(x_k))^T(g(x_k)-x_0)}{h(x_0)}\nabla h(x_k)\right) . \end{aligned}$$
(3)

Here \(J_k\) is the Jacobian of g at \(x_k\), \(\eta _k\) is short for \(\eta _{x_k}\) and I is the identity matrix. The expression of \(J_k\) is obtained by straightforward computation, while Eq. (3) is obtained from the identity \(g(x_k)-x_0=\eta _k(x_k - x_0)\). Equation (3) shows that the gradient of \(f\circ g(x_k)\), when \(x_k\notin \mathcal C\), is a linear mixing between the gradient of f at the projected point \(g(x_k)\) and the gradient of h at \(x_k\). Since \(h(x_0) < 0\), the mixing term in Eq. (3) is positive iff \(\nabla f(g(x_k))^T(g(x_k)-x_0)\le 0\). In fact, the first step in our convergence analysis is to show that the previous quantity is indeed always negative.

The mixing between the gradient of f and h is reminiscent of the conditional subgradient descent of Larsson et al. (1996). This algorithm is an acceleration of PGD, that restricts the definition of a sub-gradient as a linear under-estimator of f only within \(\mathcal C\). In this case, it is shown in Larsson et al. (1996) that when \(h(x_k) = 0\), the set of conditional sub-gradients of f can be extended by adding any sub-gradient of f to a sub-gradient of h. Here however, the projection \(g(x_k)\) is not on the boundary of \(\mathcal C\)—for example if h is strictly convex then \(h(g(x_k))<0\) and hence Eq. (3) is not necessarily a conditional subgradient of f, and the convergence analysis of our algorithm has to be carried out using different tools.

Algorithm 1 summarises the optimization algorithm for constrained optimization using the interpolation-based projection. Algorithm 1 starts by renormalizing h such that \(h(x_0) = -1\), then defines the optimal step-size \(\beta\) w.r.t. an upper bound derived given assumptions A1 to A4 defined in the next section. Algorithm 1 then follows a gradient descent (Eq. (2)), selecting a different step-size \(\alpha _k\), as a function of a constant \(\beta\), whether the iterate is inside or outside \(\mathcal C\). When \(x \notin \mathcal C\), the gradient is given by Eq. (3). Algorithm 1 then returns the average of the projected points. The algorithm operates a first order gradient descent on \(f\circ g\), which as per Eq. (3), is of linear time and memory complexity. The definition of the step-size \(\beta\) requires two problem specific quantities, that are generally not known in advance. While these quantities are necessary for the convergence analysis of the algorithm, we show in the experiments section that Algorithm 1 is robust to a broader range of step-sizes.

4 Convergence analysis

The first step in the convergence analysis of Algorithm 1 is a lemma showing that for an appropriate choice of the step-size \(\alpha _k\), the quantity \(\nabla f(g(x_k))^T(g(x_k)-x_0)\) is always negative for \(k \ge 0\). As a consequence, the gradient of \(f\circ g\) will always mix gradients of objective and constraint with opposing directions when the iterate exits the \(\mathcal C\). We prove the lemma under the assumption of a linear objective function f, a Lipschitz continuous domain defining function h, in addition to the previously discussed assumption of an initial strictly feasible point \(x_0\).

A1.:

\(f(x) = c^T x\) is a linear function in \({{\mathbb {R}}}^d\) and \(\Vert c\Vert _2 \le L\).

A2.:

h is convex, everywhere differentiable in \({{\mathbb {R}}}^d\) and H-Lipschitz w.r.t. \(\Vert .\Vert _2\).

A3.:

There exists \(x_0\) such that \(h(x_0) < 0\).

Lemma 1

Under A1–A3, the sequence of \(x_k\) produced by Algorithm 1 verifies, for all \(k\ge 0\) and for \(\beta \le \frac{1}{LH}\), \(\nabla f(g(x_k))^T(g(x_k)-x_0) \le 0\).

Proof

Let us prove the lemma by induction. For \(k = 0\) the inequality is trivially true. Now assuming the inequality holds for some \(k \ge 0\). It implies that \(c^T(g(x_k) - x_0) \le 0\). We distinguish in the following two cases, whether \(x_k\) is feasible or not. However, we treat both cases of feasibility of \(x_{k+1}\) jointly by writing \(g(x_{k+1}) - x_0 = \eta _{k+1}\left( x_{k+1}-x_0\right)\) which becomes true by assuming \(\eta _{k+1} = 1\) when \(x_{k+1}\) is feasible. First, assume \(h(x_k) \le 0\) then

$$\begin{aligned} \nabla f(g(x_{k+1}))^T(g(x_{k+1})-x_0)&= \eta _{k+1}c^T(x_{k+1}-x_0). \end{aligned}$$

By adding and subtracting \(x_k\) inside the parentheses, and since for \(h(x_k)\le 0\), \(x_{k+1}-x_k = -\alpha _k c\), we arrive at

$$\begin{aligned} \nabla f(g(x_{k+1}))^T(g(x_{k+1})-x_0)&= \eta _{k+1}\big (-\alpha _k c^Tc +c^T (x_k-x_0)\big ), \end{aligned}$$

which from the induction hypothesis is the sum of two negative numbers and is thus negative. Now if \(h(x_k) > 0\) then by again adding and subtracting \(x_k\), and by replacing \(x_{k+1}-x_k\) with the gradient update following Eq. (3), we obtain

$$\begin{aligned} \nabla f(g(x_{k+1}))^T(g(x_{k+1})-x_0)&= \eta _{k+1}\Bigg (-\alpha _k \eta _k c^Tc + c^T (x_k-x_0)\\&\quad \bigg (1 - \frac{\alpha _k \eta _k}{h(x_0)-h(x_k)}c^T\nabla h(x_k)\bigg )\Bigg ). \end{aligned}$$

From the induction hypothesis, it is sufficient for the last quantity to be negative, that \(\frac{\alpha _k \eta _k}{h(x_0)-h(x_k)}c^T\nabla h(x_k) \le 1\). Using the fact that

$$\begin{aligned} \frac{\alpha _k \eta _k}{h(x_0)-h(x_k)}c^T\nabla h(x_k) \le \left| \frac{\alpha _k \eta _k}{h(x_0)-h(x_k)}c^T\nabla h(x_k)\right| , \end{aligned}$$

and using the Cauchy–Schwarz inequality as well as assumption A1 and A2, we obtain

$$\begin{aligned} \frac{\alpha _k \eta _k}{h(x_0)-h(x_k)}c^T\nabla h(x_k)&\le \left| \frac{\alpha _k\eta _k}{h(x_0)-h(x_k)}\right| LH, \\&\le \beta LH, \quad \quad \eta _k < 1 \end{aligned}$$

Since \(\beta \le \frac{1}{LH}\) by assumption, the last quantity is \(\le 1\) as desired. As such, we conclude that \(\nabla f(g(x_{k+1}))^T(g(x_{k+1})-x_0) \le 0\) for \(h(x_k) > 0\). \(\square\)

The assumption of the linearity of f is used in the induction step and allows several simplifications since for f linear, \(\nabla f(x_{k+1}) = \nabla f(x_k)\). Extending the convergence analysis of Algorithm 1 to non-linear objectives could be achieved by extending Lemma 1 to this case. However, as discussed in Sect. 4.1, since the assumptions on h are mild, many constrained convex optimization algorithms can be recast as problems solvable by Algorithm 1.

To prove convergence of Algorithm 1, we need an additional assumption on the boundedness of the distance to an optimum.

A4.:

\(\exists x^* \in \mathcal{C}\) such that \(\forall x \in \mathcal{C}, f(x^*)\le f(x)\) and \(\Vert x_0 - x^*\Vert \le R\), for some \(R \ge 0\).

The convergence result for Algorithm 1 is as follows

Theorem 1

Under A1–A4 and for \(H_0 = \frac{H}{|h(x_0)|}\), the returned value of Algorithm 1 verifies \(f\left( \frac{1}{K}\sum _{k=0}^{K-1} g(x_k)\right) - f(x^*) \le \frac{RL(1+ H_0R)}{\sqrt{K}}\) for \(K \ge \frac{R^2H_0^2}{(1+ H_0R)^2}\) and for \(\beta = \frac{R}{L(1+ H_0R)\sqrt{K}}\).

Proof

As A3 ensures that \(h(x_0)\) is non zero, an equivalent optimization problem can be obtained where \(h(x_0) = -1\) by rescaling h with \(|h(x_0)|\). Letting \(H_0 = \frac{H}{|h(x_0)|}\), the only difference will be that if h is H-Lipschitz then \(h/|h(x_0)|\) is \(H_0\)-Lipschitz. From now on, and without loss of generality, we assume that \(h(x_0) = -1\) and h is H-Lipschitz. We revert to the general case where \(h(x_0) < 0\) at the end of the proof.

Following standard proofs of subgradient descent algorithms, our proof begins by estimating the distance of the iterate to the optimum

$$\begin{aligned} \Vert x_{k+1}-x^*\Vert _2^2&= \Vert x_k - \alpha _k \nabla f\circ g (x_k) - x^*\Vert _2^2. \end{aligned}$$

As in Lemma 1, we study separately the case where \(x_k\in \mathcal C\) and \(x_k\notin \mathcal C\). In each case, we derive an upper bound of \(\Vert x_{k+1}-x^*\Vert _2^2\) and then pick the largest of the two. Starting with \(x_k\notin \mathcal C\), we replace \(\nabla f\circ g (x_k)\) by its definition in Eq. (3), and by expanding the quadratic expression we obtain

$$\begin{aligned} \Vert x_{k+1}-x^*\Vert _2^2&= \Vert x_k-x^*\Vert _2^2 + \Vert \alpha _k \nabla f\circ g (x_k)\Vert _2^2 - 2\alpha _k \eta _k \nabla f(g(x_k))^T(x_k-x^*)\nonumber \\&\quad - 2\alpha _k \eta _k \frac{ \nabla f(g(x_k))^T(x_k-x_0)\nabla h(x_k)^T(x_k-x^*)}{h(x_0)-h(x_k)}. \end{aligned}$$
(4)

Adding and subtracting \(g(x_k)\) in \(\nabla f(g(x_k))^T(x_k-x^*)\) and by expanding the definition of \(g(x_k)\) and \(\eta _k\) when \(h(x_k) > 0\) we obtain

$$\begin{aligned} \nabla f(g(x_k))^T(x_k-x^*)&= \nabla f(g(x_k))^T(g(x_k)-x^*) \\&\quad -\frac{h(x_k)}{h(x_0)-h(x_k)}\nabla f(g(x_k))^T(x_k - x_0). \end{aligned}$$

Replacing \(\nabla f(g(x_k))^T(x_k-x^*)\) in Eq. (4) gives

$$\begin{aligned} \Vert x_{k+1}-x^*\Vert _2^2&= \Vert x_k-x^*\Vert _2^2 + \Vert \alpha _k \nabla f\circ g (x_k)\Vert _2^2 - 2\alpha _k \eta _k \nabla f(g(x_k))^T(g(x_k)-x^*)\nonumber \\&\quad \ + 2\alpha _k \eta _k \left( \frac{h(x_k) + \nabla h(x_k)^T(x^*-x_k)}{h(x_0)}\right) \nabla f(g(x_k))^T(g(x_k)-x_0). \end{aligned}$$
(5)

But from convexity of h, we know that \(h(x_k) + \nabla h(x_k)^T(x^*-x_k) \le h(x^*) \le 0\) implying

$$\begin{aligned} \frac{h(x_k) + \nabla h(x_k)^T(x^*-x_k)}{h(x_0)}\ge \frac{h(x^*)}{h(x_0)}\ge 0. \end{aligned}$$

In addition, \(\alpha _k\) and \(\eta _k\) are always positive and from Lemma 1, \(\nabla f(g(x_k))^T(g(x_k)-x_0)\) is negative for all \(k\ge 0\) provided \(\beta \le \frac{1}{LH}\). As a result the last term of Eq. (5) is always negative and \(\Vert x_{k+1}-x^*\Vert _2^2\) can be bounded by

$$\begin{aligned} \Vert x_{k+1}-x^*\Vert _2^2&\le \Vert x_k-x^*\Vert _2^2 + \Vert \alpha _k \nabla f\circ g (x_k)\Vert _2^2\nonumber \\&\quad - 2\alpha _k \eta _k \nabla f(g(x_k))^T(g(x_k)-x^*). \end{aligned}$$
(6)

In the upper bound of Inq. (6), we will now bound the term \(\Vert \alpha _k \nabla f\circ g (x_k)\Vert _2^2\) that is specific to the case \(h(x_k) > 0\). By replacing the gradient with its definition and using the fact that we have rescaled h such that \(h(x) = -1\), we obtain

$$\begin{aligned} \beta ^{-2}||\alpha _k \nabla&f\circ g (x_k)||_2^2 =||\nabla f(g(x_k)) - \nabla f(g(x_k))^T(g(x_k)-x_0)\nabla h(x_k)||_2^2. \end{aligned}$$

Using the Cauchy-Schwarz inequality as well as assumption A1, A2 and A4 we obtain

$$\begin{aligned} \beta ^{-2}\Vert \alpha _k \nabla f\circ g (x_k)\Vert _2^2 \le L^2 (1+ HR)^2. \end{aligned}$$
(7)

Replacing Eq. (7) into Eq. (6), using the definition of \(\alpha _k\) and since \(h(x_0) = -1\) we have

$$\begin{aligned} \Vert x_{k+1}-x^*\Vert _2^2 \le&\Vert x_k-x^*\Vert _2^2 + \beta ^2 L^2 (1+ HR)^2 - 2\beta \nabla f(g(x_k))^T(g(x_k)-x^*). \end{aligned}$$
(8)

Now for the simpler case \(x_k \in \mathcal C\) we have

$$\begin{aligned} \Vert x_{k+1}-x^*\Vert _2^2&= \Vert x_{k}-x^*\Vert _2^2 + \Vert \alpha _k\nabla f(x_k)\Vert _2^2-2\alpha _k\nabla f(x_k)^T(x_k-x^*). \end{aligned}$$

Using assumption A1 and since \(x_k = g(x_k)\) and \(\alpha _k = \beta\) when \(x_k \in \mathcal C\), we obtain the following bound

$$\begin{aligned} \Vert x_{k+1}-x^*\Vert _2^2 \le&\Vert x_{k}-x^*\Vert _2^2 + \beta ^2L^2 -2\beta \nabla f(g(x_k))^T(g(x_k)-x^*). \end{aligned}$$
(9)

Clearly the upper bound of \(\Vert x_{k+1}-x^*\Vert _2^2\) in Inq. (8) is always larger than the one in Inq. (9). As such, we can use the upper bound of \(\Vert x_{k+1}-x^*\Vert _2^2\) in Inq. (8) for all iterates of Algorithm 1. Letting \(A = L^2 (1+ HR)^2\), and averaging over the first K terms of both sides of Inq. (9) yields

$$\begin{aligned} \frac{1}{K}\sum _{k=0}^{K-1}\Vert x_{k+1}-x^*\Vert _2^2&\le \frac{1}{K}\sum _{k=0}^{K-1}\Vert x_k-x^*\Vert _2^2 + \beta ^2 A \\&\quad - \frac{2\beta }{K} \sum _{k=0}^{K-1}\nabla f(g(x_k))^T(g(x_k)-x^*). \end{aligned}$$

From the convexity of f we have that

$$\begin{aligned} \nabla f(g(x_k))^T(g(x_k)-x^*)&\ge f(g(x_k))-f(x^*), \end{aligned}$$

as well as \(\frac{1}{K}\sum _{k=0}^{K-1} f(g(x_k))\ge f\left( \frac{1}{K}\sum _{k=0}^{K-1}g(x_k)\right)\). Using these two properties yields

$$\begin{aligned} \frac{1}{K}\sum _{k=0}^{K-1}\Vert x_{k+1}-x^*\Vert _2^2&\le \frac{1}{K}\sum _{k=0}^{K-1}\Vert x_k-x^*\Vert _2^2 + \beta ^2 A \\&\quad - 2\beta \left( f\left( \frac{1}{K}\sum _{k=0}^{K-1}g(x_k)\right) -f(x^*)\right) . \end{aligned}$$

Rearranging terms and cancelling telescoping sums yields

$$\begin{aligned} f\left( \frac{1}{K}\sum _{k=0}^{K-1}g(x_k)\right) -f(x^*)&\le \frac{1}{2\beta K}\Big (\Vert x_0 - x^*\Vert _2^2 -\Vert x_{K} - x^*\Vert _2^2 + K\beta ^2A\Big ). \end{aligned}$$

Using A1, A2 and A4 and after replacing A we obtain

$$\begin{aligned} f\left( \frac{1}{K}\sum _{k=0}^{K-1}g(x_k)\right) -f(x^*)\le \frac{R^2}{2\beta K}+\frac{\beta L^2(1+ HR)^2}{2}. \end{aligned}$$

Minimizing this upper bound w.r.t. to \(\beta\) gives the optimal fixed step-size \(\beta = \frac{R}{L(1+ HR)\sqrt{K}}\) with error

$$\begin{aligned} f\left( \frac{1}{K}\sum _{k=0}^{K-1}g(x_k)\right) -f(x^*)&\le \frac{RL(1+ HR)}{\sqrt{K}}. \end{aligned}$$
(10)

This gives us a first condition on \(\beta\), but to achieve the bound in Inq. (10), we made use of Lemma 1 which requires that \(\beta \le \frac{1}{LH}\), yielding an additional condition on K

$$\begin{aligned} \frac{R}{L(1+ HR)\sqrt{K}} \le \frac{1}{LH}, \Leftrightarrow K \ge \frac{R^2H^2}{(1+ HR)^2}. \end{aligned}$$
(11)

Now the only remaining operation is to express the step-size, the condition on K in Inq. (11) and the error upper bound in Inq. (10) in terms of the original Lipschitz constant which is achieved simply by replacing H with \(\frac{H}{|h(x_0)|}\) in these inequalities. \(\square\)

The \(\mathcal {O}(\frac{1}{\sqrt{K}})\) convergence rate is typical of sub-gradient descent on non-smooth convex functions (Nocedal and Wright 2006), which is expected since \(f\circ g\) is non-smooth. Compared to projected gradient descent (PGD), the bound now shows an explicit dependence on the Lipschitz constant of h. This is also expected since in PGD the projection is assumed to be computable at no cost. As a result, the error bound of PGD does not depend on the gradient of h in any way, whereas in our algorithm this dependence is made explicit. Because of the non-smoothness of \(f\circ g\) and the resulting \(\mathcal {O}(\frac{1}{\sqrt{K}})\) convergence rate, we do not expect the general formulation of Algorithm 1 to be competitive with specialized convex optimizers developed for specific convex problem classes. However, the versatility and cheap computational cost of the interpolation projection offers large gains compared to convex optimizers when integrated into (non-convex) machine learning models, as shown in the experimental validation section.

4.1 Subgradients, multiple constraints and non-linear objectives

So far we have only considered a single inequality constraint. Algorithm 1 and its theoretical guaranties can easily be extended to tackle multiple inequality constraints and an affine equality constraint

$$\begin{aligned} \min _{x \in {{\mathbb {R}}}^d} \quad&f(x),\\ \text {s.t.} \quad&h_i(x)\le 0, \ \text {for all } i \in \{1\dots M\},\\&Ax = b, \end{aligned}$$

where \(h_i\) are convex functions in \({{\mathbb {R}}}^d\), A a matrix and b a vector. Let \(\mathcal{C'}=\{x\in {{\mathbb {R}}}^d: h_i(x) \le 0\, \text {for all } i \in \{1\dots M\}\}\). We define h as \(h(x) = \max _{i\in \{1\dots M\}} h_i(x)\). Then h is sub-differentiable if all \(h_i\) are (sub-)differentiable. Moreover, we assume that all \(h_i\) are Lipschitz with constant at most H, resulting in the following assumption

A5.:

h is convex, sub-differentiable in \({{\mathbb {R}}}^d\) and H-Lipschitz w.r.t. \(\Vert .\Vert _2\).

To tackle constrained optimization in \(\mathcal C'\), we define Algorithm 1’ that replaces Line 10 of Algorithm 1. Specifically, the gradient \(\nabla h\) in Eq. (3) is simply replaced by a sub-gradient of h. Under A1, A3–A5, this new algorithm has the same convergence properties of Algorithm 1. Indeed, h being convex, the projection is still valid and will be given with interpolation weight \(\eta _x = \min _{i \in \{1\dots M\}}\frac{h(x_0)}{h(x_0)-h_i(x)}\), selecting the smallest interpolation weight given by the constraint \(h_i\) with the highest violation. Additionally, of h, the proof of Theorem 1 only uses the property \(\nabla h(x_k)^T(x^*-x_k) \le h(x^*)-h(x_k)\) which is also fulfilled by a sub-gradient of h.

In summary, the differentiablity requirement of h can be relaxed to only require sub-differentiability, and multiple constraints are treated as a single constraint using the \(\max\) over these sub-differentiable constraints. As for the affine equality constraint, it can be eliminated by replacing x with \(Fz + x_0\) as shown in Boyd and Vandenberghe (2004), where F is a matrix whose range is the null space of A under the condition that \(x_0\) is a solution of \(Ax = b\). Note that the objective function remains linear after the aforementioned change of variable, and hence the convergence guarantees still apply.

As for non-linear objectives, we note that most convex programs can be written as cone programs of the form \(\min _{x \in \mathcal {K}} c^T x\), for a closed convex cone K and a linear objective (Nesterov and Nemirovskii 1994). In fact, there exists automated tools (Grant et al. 2006; Grant and Boyd 2008) that perform this rewriting by replacing non-linear functions in the computational graph with their graph implementation—a generic epigraph-based representation. These tools are used by existing solvers such as CVX (Grant and Boyd 2014), and for our algorithm to be applicable to these cone programs, one has to provide a domain defining function h equivalent to the constraint \({x \in \mathcal {K}}\) for all cones supported by the tool. In the next section, we provide numerical examples for the semi-definite cone, the second order cone and the linear cone.

5 Experimental validation

We first conduct numerical evaluations on toy convex problems to validate the theoretical analysis. The broader usage of the interpolation projection in machine learning is then evaluated in both a reinforcement and supervised learning setting.

5.1 Constrained convex optimization

Algorithm 1 defines the step-size as a function of the domain bounds and the Lipschitz constants which are typically unknown in practice. We thus investigate on a wide range of convex optimization problems the robustness of the interpolation projection to the choice of (a potentially wrong) step-size. We compare our algorithm to Projected Gradient Descent (PGD, Rosen (1960); Nocedal and Wright (2006)) and subgradient descent (SubGD, Shor et al. (1985); Bertsekas (2015)). Subgradient descent is a converging descent algorithm that in our constrained setting operates by (i) following the gradient of f if \(x \in \mathcal C\) (ii) following the (sub-)gradient of h otherwise. This algorithm is very simple and another objective of these numerical experiments is to investigate whether the mixing of the gradients \(\nabla f\) and \(\nabla h\), obtained from differentiating through \(f\circ g\) in Eq. (3), provides any practical advantage compared to the simpler scheme of subgradient descent. In the following, we denote our algorithm by IGD, where the ‘I’ stands for interpolation. We consider five problem classes comprising linear programs, semi-definite programs, second order cone programs, problems with a bounded \(\ell _2\) norm or with an exponential form constraint. Exact definition of each problem and their random generation process is deferred to the appendix.

Results. For each of the five problem classes, 100 random instances are generated and we compute at each iteration the smallest \(\frac{f(x_k) - f(x^*)}{f(x_0) - f(x^*)}\) achieved so far. We compared the gradient descent algorithms with four different step-sizes ranging from \(10^{-4}\) to \(10^{-1}\). Experiments for each step-size are conducted on the same 100 problem instances, and although we plot the results for each step-size separately, one can easily extract the best performing step-size for each method from the same plots. The plots (deferred to the appendix) show that in 17 out of the 20 problems and step-sizes combinations, IGD outperforms SubGD, sometimes with several order of magnitudes. On semi-definite programs, SubGD performs better with larger step-sizes, although best results are still obtained overall by IGD with the smallest step-size. On the bounded norm problem where PGD is applicable, our algorithm is able to match PGD up until a precision ranging from \(10^{-2}\) to \(10^{-5}\) depending on the step-size, before tracking behind. In contrast, SubGD is distanced at a significantly lower precision. These results both demonstrate a certain robustness to the choice of step-size and a practical interest in the mixing of gradients obtained by differentiating through \(f\circ g\). Thanks to the generality of the projection and the simplicity of performing unconstrained gradient descent on \(f\circ g\), we expect the interpolation projection to find many usages in machine learning, two of which are presented in the next subsections.

5.2 Reinforcement learning in continuous action spaces

We consider in this section policy optimization updates that occur at each iteration of the approximate policy iteration (API) scheme (Bertsekas 2011; Scherrer 2014). To formalize the policy update in API we briefly introduce key concepts of reinforcement learning (RL). A Markov Decision Process (MDP) is a quintuple \((\mathcal{S}, \mathcal{A}, R, P, \gamma )\) where \(\mathcal{S}\) and \(\mathcal{A}\) are state and action spaces, that are in our experiment \({{\mathbb {R}}}^{d_s}\) and \({{\mathbb {R}}}^{d_a}\) respectively. \(P: \mathcal{S}\times \mathcal{A}\mapsto \mathcal{P}(\mathcal{S})\) and \(R: \mathcal{S}\times \mathcal{A}\mapsto {\mathbb {R}}\) determine the next state transition probability and reward upon the execution of a given action in a given state. We denote by q(a|s) the probability density of executing \(a \in \mathcal{A}\) in \(s \in \mathcal{S}\) according to the stochastic policy q. Additionally, for policy q we define the Q-function \(Q_{q}(s,a) = {{\mathbb {E}}}\left[ \sum _{t=0}^\infty \gamma ^t R(s_t, a_t)\mid s_0 = s, a_0=a\right]\), where the expectation is taken w.r.t. random variables \(a_{t+1}\sim q(.|s_t)\) and \(s_{t+1}\sim p(.|s_t, a_t)\) for \(t > 0\); the value function \(V_{q}(s) = {{\mathbb {E}}}_{a\sim q(.|s)}\left[ Q_{q}(s,a)\right]\) and the advantage function \(A_{q}(s,a)=Q_{q}(s,a) - V_{q}(s)\). The goal in API is to find the policy maximizing the policy return \(J(q) = V_{q}(s_0)\) for some starting state \(s_0\).

API iterates three steps, generating data from the current policy q, evaluating \(A_q\) and updating the policy q using \(A_q\). To update the policy we consider the maximization of \(A_q\) under a KL divergence constraint between the current and next policies—establishing a ’step-size’ in probability space—as is done in Schulman et al. (2015); Rajeswaran et al. (2017); Peters and Schaal (2008). The policy update is given by

$$\begin{aligned}&\underset{p}{\arg \max } \quad {{\mathbb {E}}}_{s,a\sim q}\left[ \frac{p(a|s)}{q(a|s)}A_q(s,a)\right] , \end{aligned}$$
(12)
$$\begin{aligned}&\text {subject to}\ \ \ \ \ \quad {{\mathbb {E}}}_{s \sim q}\left[ \mathrm {KL}(p(.|s)\Vert q(.|s))\right] \le \epsilon . \end{aligned}$$
(13)

We will benchmark algorithms on a continuous action task and specifically consider the case where p and q are Gaussian policies. A Gaussian policy has density \(p(.|s) = {\mathcal {N}}(\mu (s), \varSigma )\), for co-variance matrix \(\varSigma\) and mean function \(\mu (.)\). In our set-up we consider diagonal co-variance matrices as in Schulman et al. (2015); Rajeswaran et al. (2017) and linear-in-features or neural network based mean functions. The linear-in-feature mean function is given by \(\mu (s) = \phi (s)^TM\) using the same random Fourier features \(\phi\) of Rajeswaran et al. (2017) with 2000 entries, whereas the neural network mean function is given by a neural network following the architecture in Schulman et al. (2015) with 2 hidden layers with 64 neurons each. For estimating \(A_q\) we follow Rajeswaran et al. (2017) and use a neural network to learn \(V_q\) and estimate \(A_q\) from trajectories. For both cases we use \(\epsilon = 10^{-2}\) as in Schulman et al. (2015).

To solve the aforementioned problems, both natural approaches with linear-in-features (Rajeswaran et al. 2017) and neural network mean functions (Schulman et al. 2015) follow the same approach: a second order approximation of the constraint (13) is computed, as well as a linear approximation of the objective function (12). The resulting problem is then solved in closed form resulting in the natural gradient update of the policy parameters. However, as the constraint satisfaction is not guaranteed—since the problem is solved by approximating the constraint—both approaches (Schulman et al. 2015; Rajeswaran et al. 2017) add a line-search routine, interpolating between the new parameters and the parameters of q, to ensure that Inq. (13) holds.

Fig. 2
figure 2

From left to right: a The computational graph of an RL policy with the projection layer taking as input the intermediate values \(\mu (s)\) and \(\varSigma\) and returning a new mean and covariance complying with the KL-divergence constraint. b, c Distributions of the improvement ratio over the natural gradient baseline for gradient descent on the policy parameters with and without the interpolation projection. The thick vertical black bars in the violin plot span the lower and upper quartiles

To compare to natural gradient, we employ first a naive algorithm that optimizes objective (12) in an unconstrained way, with the Adam algorithm (Kingma and Ba 2015), before calling the line-search routine used by the natural gradient approaches to ensure constraint satisfaction. Secondly, we augment the naive algorithm by adding an interpolation projection ’layer’ to the output of the policy. The projection layer, as depicted in Fig. 2-left, takes as input a set of action means—given by evaluating the current mean function over a mini-batch of input states—and a covariance matrix and returns a new set of means and a covariance matrix that comply with the constraint. To formalize, let us define h and \(x_0\), the two elements needed to perform the interpolation projection. Given a finite set of states \(\{s_1, \dots , s_K\}\), we define

$$\begin{aligned} h(\mu (s_1),\dots ,\mu (s_K), \varSigma ) = \frac{1}{K}\sum _{k}\text {KL}({\mathcal {N}}(\mu (s_k),\varSigma )|{\mathcal {N}}(\mu _q(s_k), \varSigma _q)) - \epsilon , \end{aligned}$$

where \(\mu _q\) and \(\varSigma _q\) are respectively the mean function and covariance matrix of q. h is convex and we use as \(x_0\) for the interpolation projection the means and covariance matrix of q. The projection that returns a set of means and a covariance matrix compying with the KL divergence constraint is then given by g as in Sect. 3, from the definition of h and \(x_0\).

To illustrate the algorithm, assume for a mini-batch of states \(\{s_1, \dots , s_K\}\) the mean and covariance functions return a mini-batch of means \(\mu (s_1),\dots ,\mu (s_K)\) and a covariance matrix \(\varSigma\). If the constraint, estimated for this mini-batch is violated,

$$\begin{aligned} \frac{1}{K}\sum _{k}\text {KL}({\mathcal {N}}(\mu (s_k),\varSigma )|{\mathcal {N}}(\mu _q(s_k), \varSigma _q)) > \epsilon , \end{aligned}$$

we use the projection g as in Sect. 3 to obtain a new set of means \(\mu _\eta (s_1),\dots ,\mu _\eta (s_K)\) and covariance matrix \(\varSigma _\eta\) where \(\mu _\eta (s_k) = \eta \mu (s_k) + (1-\eta ) \mu _q(s)\) and \(\varSigma _\eta = \eta \varSigma + (1-\eta ) \varSigma _q\) and then evaluate the objective for \(p_\eta\)

$$\begin{aligned} \frac{1}{N}\sum _{k}\frac{p_\eta (a_k|s_k)}{q(a_k|s_k)}A_q(s_k,a_k), \end{aligned}$$

where \(p_\eta (.|s) = {\mathcal {N}}(\mu _\eta (s),\varSigma _\eta )\). Once the objective is computed, we backpropagate throughout the whole computational graph which backpropagates through the interpolation projection.

In the linear-in-feature case, we note that the KL divergence is not only convex in the mean and covariance of the Gaussian but also in the policy parameters. Specifically, we have that

$$\begin{aligned} h(M, \varSigma ) = \frac{1}{N}\sum _{k}\text {KL}({\mathcal {N}}(\phi (s_k)^TM,\varSigma )|{\mathcal {N}}(\phi (s_k)^TM_q, \varSigma _q)) - \epsilon , \end{aligned}$$

is a convex function in M and \(\varSigma\), and from linearity of the mean function interpolating the means or the parameter M directly are equivalent. Moreover, the \(\eta\) obtained using \(h(M, \varSigma )\) or \(h(\mu (s_1),\dots ,\mu (s_K), \varSigma )\) will be identical for a given mini-batch since the value of h will be the same in both cases. The optimization process can thus be seen as performing gradient descent on \((f\circ g)(M, \varSigma )\), where f is the objective (12). This is similar to the convex optimization setting studied theoretically, except f is now non-linear non-convex—because \(A_q\) is not necessarily convex. However, the empirical results show that the optimization scheme still performs well despite \(f\circ g\) being non-convex. This is not entirely surprising since gradient descent is widely used and well behaved for non-convex problems too.

Fig. 3
figure 3

Distributions of the improvement ratio over SGD + A norm minimizing projection of SGD with and without the interpolation projection. The thick black bars in the violin plot span the lower and upper quartiles. Each violin plot is obtained after solving circa 1700 optimization problems

To generate real RL optimization problems, we run natural gradient on the BipedalWalker-v2 environment (Brockman et al. 2016) for one million steps with a policy update after a minimum of 3000 steps. We run 11 of such independent runs, generating over 3000 optimization problems for each of the linear and non-linear cases. Both the naive algorithm and the projection augmented algorithm use the same hyper-parameters for the update, by performing 30 epochs with a step-sizeFootnote 1 of \(5\times 10^{-5}\)). For each of the 3000 optimization problems, we record the ratio between the objective value when solving the problem with gradient descent, divided by the value when solving the problem following the natural gradient baselines in each of the linear (Rajeswaran et al. 2017) and non-linear (Schulman et al. 2015) case. A value larger than 1 indicates that the method solved the constrained problem better than the state-of-the-art.

Fig. 4
figure 4

The three considered objects with 7 rigid bodies and 6, 9 and 12 strings respectively from left to right

Figure 2 shows the distribution of such ratios for the linear and non-linear mean function cases. In both cases, without the projection, the unconstrained optimization with a final line-search step performs significantly worse than natural gradient descent. In contrast, adding the interpolation projection of the Gaussian distributions’ parameters while using the same optimization scheme, results in a median improvement over natural gradient of \(31\%\) and \(57\%\) for the linear and non-linear mean function cases respectively. Note that in the linear case, the optimization setting resembles the earlier convex optimization experiments as the constraint is convex in the input means of h but also directly on the parameters of the mean function M. When the mean function is a neural network, the interpolation projection still seems to guide the gradient descent algorithm towards regions of the parameter space that better trade off objective maximization and constraint satisfaction than the naive algorithm.

We also evaluated replacing the interpolation layer with an orthogonal projection using a differentiable convex solver (Agrawal et al. 2019). The orthogonal projection receives the same input means and covariance matrix as the interpolation projection but returns instead the parameters that minimize the Euclidean distance to the inputs while complying with the KL divergence constraint. This is a convex problem and we used the tools of (Agrawal et al. 2019) to both compute the forward pass—solve the convex problem—and the backward pass—differentiate around the solution of the convex problem—of this computational graph. The computational cost of this model is more than 300 times that of the vanilla neural network model, while our model with the interpolation projection is only about 1.5 more expensive. Due to the increased computational costs, we performed only 6 independent runs for this comparison totaling about 1700 optimization problems. Comparison between the two optimization schemes are shown in Fig. 3. Surprisingly, the interpolation projection performs better than the more accurate projection, perhaps because of a better interplay between the interpolation projection and the subsequent line-search routine, while being significantly cheaper to compute.

5.3 Supervised learning of dynamics models

Table 1 Mean Euclidean distance and std. dev. between test trajectories and model generated trajectories, obtained by unrolling 485 time-steps from the first three time-steps of each of the 75 test trajectories. First row shows the vanilla neural network model, and the second row adds an interpolation projection layer to respect physical constraints imposed by the strings

In the previous experiment we have shown how the interpolation projection can be used to tackle constrained optimization problems in the context of RL. In this experiment, we provide an example of an inductive bias in the form of a convex constraint on the outputs of a neural network, and we show how the interpolation projection can be used to comply with these constraints. The task consists in predicting the position, for several steps in the future, of 7 circular rigid bodies connected in 3 different configurations with respectively 6, 9 and 12 strings of the same length as shown in Fig. 4. We would like to emphasize that even though there are constraints on the output of the neural network, we impose no constraints on its parameters.

The considered inductive bias constrains the distance between predicted positions of connected rigid bodies to be at most the length of the string. To comply with the constraint, we add after the prediction of the neural network \(y_t\), an interpolation projection that returns \(g(y_t)\), such that the constraints imposed by the strings are respected. To compute g, we define h as the maximum distance between linked bodies, which is convex, and use as ‘\(x_0\)’—the anchor point of the interpolation projection—an imaginary configuration that places all rigid bodies in the average of their positions according to \(y_{t-1}\). This point has thus zero distance between all circular bodies and strictly satisfies the constraints. Given h and ‘\(x_0\)’, the interpolation projection g follows as in Sect. 3.

To predict the next set of positions \(y_t\) we use a neural network with 4 hidden layers having 256 nodes each. The network takes as input the last three positions of each 7 circular bodies and outputs the change to the current set of positions. We train this neural network as a recursive neural network (RNN), using backpropagation through time, as the predicted position in the next time-step is fed back to its input. We used for the optimization procedure Adam (Kingma and Ba 2015) with a step-size of \(10^{-4}\). Because of the computational complexity of this task, we did not perform full and rigorous experimental comparisons with different step-sizes but only compared step-sizes on partial runs before settling for the value of \(10^{-4}\).

In addition to the base RNN model, we evaluate the same RNN with the inductive bias in the form of convex constraints as described above. Ground truth trajectories are generated by letting the object fall from a distance of 400 units of measure (u.m.), after applying an initial force generated by selecting a node uniformly at random then applying a force with constant norm sampled uniformly at random on an upper half circle. The diameter of the circular rigid body is 1 u.m. Box2d (Catto 2007) is used to simulate 200 of such trajectories, 50 of which are used for training, 75 for validation and 75 for test. Each trajectory contains 485 time-steps and the train set alone contains circa 24K time-steps. We train both the RNN and RNN with convex constraints for a fixed time of 3 days on a single core of an AMD 3900x.

Fig. 5
figure 5

Predicted trajectories vs ground truth. As errors compound, the RNN model without shape constraints exhibits large violations of the physical structure of the chain, as highlighted in red. In contrast, the model with the projection layer maintains physical consistency with the original shape at all times. An animated version of Fig. 5 is provided here

The generalization results in Table 1 show that both models can synthesize relatively close trajectories to the original ones for an extended period of time (485 time-steps at 60Hz) from only the first three time-steps of the test trajectories. The results also show that the additional interpolation projection layer, enforcing compliance with the physical constraints imposed by the strings, reduces the prediction error for the two shapes with the most strings; while for the simpler chain shape, the vanilla model performs better. The worse performance in this setup might be the result of the additional non-smoothness introduced by the interpolation projection. Yet, even when it under-performs quantitatively with the chain shape, the trajectories generated by the projection augmented model can look qualitatively better since the vanilla model sometimes exhibits large violations of the constraints as shown in Fig. 5. In conclusion, introducing an inductive bias through additional constraints and using the interpolation projection to comply with the constraints showed promising results both quantitatively and qualitatively, with little computational overhead—the training procedure becoming only about 1.2 times slower. In comparison, we were unable to run the baseline with the optimal projection layer that solves a convex problem for every forward pass. Compared to the RL setting, the combined effect of a larger dataset (more than 10x) and the increased number of convex problems to solve per gradient update (up to 240x du to the back-propagation through time) would require several months for the training procedure to complete on the same AMD 3900x processor.

6 Conclusion

We introduced in this paper an interpolation-based projection onto a convex set that can be readily computed for any convex domain defining function. We then derived a descent algorithm based on the composition of the objective and the projection and showed that this surprisingly yields a convergent algorithm when the objective is linear, despite the ‘sub-optimality’ of the projection. From a practical point of view, we have shown that this projection when added as a layer to computational models, allows to tackle constrained optimization in reinforcement learning or adds an inductive bias to predictive models. Because the projection is general and computationally frugal, we think this work can find many other applications in machine learning where intermediary nodes of a computational graph are constrained to be in a convex set.