1 Introduction

We consider the problem of minimizing a nonlinear function over a polyhedron:

$$\begin{aligned} \min ~ \displaystyle f(\textbf{x}) ~~ \text{ subject } \text{ to } ~~ {\textbf{x}} \in \varOmega := \{{\textbf{x}} \in \mathbb {R}^n : \displaystyle A\,\textbf{x}= \textbf{b}, ~~ \displaystyle \textbf{l}\le \textbf{x}\le \textbf{u}\}, \end{aligned}$$
(1)

where f is continuously differentiable, \( A = (\textbf{a}_1, \ldots , \textbf{a}_m)^T \in \mathbb {R}^{m\times n}\), \(\textbf{b}\in \mathbb {R}^m\), \(\textbf{l}\in \! \left\{ \mathbb {R}\cup \{ -\infty \} \right\} ^{n}\), and \(\textbf{u}\in \! \left\{ \mathbb {R}\cup \{ +\infty \} \right\} ^{n}\). Many different Gradient Projection (GP) methods for problem (1) have been proposed in recent years that can be seen as extension of the gradient method for unconstrained minimization. An overview of convergence results for general constrained problems can be found in [4, 8, 9, 13, 26, 36], and in [10] where the linearly constrained case is analyzed in more detail. The design of new GP methods was stimulated by the surprisingly good performance of the spectral gradient methods based on the Barzilai and Borwein formulas [2].

Moving to constrained problems, a critical issue arising in GP methods is the projection of trial steps on the feasible set, which is a rather expensive task in general. To deal with this drawback, a rather popular strategy is to combine GP with a phase in which an unconstrained minimization is performed over faces of the polyhedron. In such a two phase algorithm, the GP phase has the dual role of promoting global convergence and identifying “promising active constraints” (i.e. constraints likely to be active at the solution). The unconstrained minimization phase, on the other hand, aims to speed up the convergence by using a superlinearly convergent method. Under nondegeneracy assumptions, only the second phase is performed eventually, and the asymptotic convergence speed coincides with that of the algorithm used in the second phase.

A suitable branching rule between the two phases is a key ingredient for practical efficiency. A sequence of GP steps should be stopped when either the active set settles down or stagnation in the sequence is detected. Likewise, the minimization phase needs a reliable criterion to determine the value of continued exploration of the current face of the polyhedron; this essentially boils down to comparing the local stationarity conditions for the current face of the polyhedron to the stationarity conditions for problem (1).

For box-constrained optimization, effective branching rules can be stated for a feasible point \(\textbf{x}\) componentwise in terms of binding variables [38] or in terms of the relative magnitude of gradient components associated with inactive and active bound constraints. The latter approach finds its roots in [29, 30] and in the subsequent works [3, 31], where the optimality conditions are expressed in terms of the free gradient and the chopped gradient; their relative magnitude is used to determine whether a face is worth further exploration. Inspired by the work in [28], the author of [22] introduced the concept of proportional iterates and proportioning step for the case of bound constrained quadratic programs, which was then further exploited in [23, 24] and, more recently, in [25, 37]. Finally, we recall that similar ideas were also exploited in the case of general, nonquadratic bound-constrained problems by the authors in [32], where the switch between the two phases is based on a comparison between the optimality with respect to the full problem and the optimality with respect to the subspace defined by the active bound constraints. In the last few years, these ideas have been extended to more general constrained optimization problems. In [1] the authors analyze an active-set strategy for linearly constrained optimization problems. The idea of proportioning has been extended to problems with quadratic separable constraints [6, 7], to the case of problems with an additional single linear constraint [18], and to the case of \(\ell _1\)-regularized problems [11, 12, 14, 16]. Finally, the work in [32] has been generalized to the case of problems subject to polyhedral constraints [33, 35] and very recently to general nonlinearly constrained optimization problems [19, 20].

In this paper we show that the projection of the negative gradient on the tangent cone at \(\textbf{x}\) can be expressed as a sum \(\varvec{\beta }({\textbf{x}}) + \varvec{\varphi }({\textbf{x}})\) where \(\varvec{\beta }(\textbf{x})\) and \(\varvec{\varphi }(\textbf{x})\) are orthogonal and generalize the definition of free gradient and chopped gradient [22, 28, 29] used within the framework of two-phase GP algorithms for box-constrained problems. We show that \(\varvec{\beta }(\textbf{x})\) only vanishes when the multipliers associated with the active constraints have the correct signs, while \(\varvec{\varphi }(\textbf{x})\) only vanishes when \(\textbf{x}\) is a stationary point in the active manifold. This decomposition allows us to generalize the definition of binding variables in bound constrained optimization to the case of polyhedral constraints, and measure complementary aspects of stationarity that can be exploited in order to design effective switching rules in two-phase algorithms. As an example, we present an algorithm for solving convex quadratic programs, which generalizes the P2GP algorithm introduced in [18], and which employs a switching rule based on the theory that we develop.

The paper is organized as follows. Section 2 includes some basics about the stationarity conditions for problem (1) and the projected gradient as defined in [10] and its properties. In Sect. 3 the free and the chopped gradients are defined and some of their properties are established. Section 4 deals with the convex quadratic programming problem, gives the definition of proportional iterate and shows that disproportionality at a point guarantees that such point does not belong to the face identified by the active set at the solution. Moreover, we introduce PSAQP, a two-phase algorithm for convex quadratic programs which uses a proportionality-based branching rule between the two phases. Section 5 provides some insights about the implementation of PSAQP and presents the results of some numerical experiments. Finally, in Sect. 6 we draw some conclusions.

1.1 Notation

Throughout the paper \(\Vert \cdot \Vert \) denotes the Euclidian norm, \(\textbf{g}(\textbf{x}):= \nabla f (\textbf{x})\) (a column vector), and \(\textbf{e}_i\) is the i-th column of the identity matrix. The dimension of \(\textbf{e}_i\) should be clear from context. Given \( \textbf{v}\in \mathbb {R}^n \) and , we define

where is the number of elements in and \(v_i\) is the ith entry of \(\textbf{v}\). In a similar manner, given a matrix \( M\in \mathbb {R}^{m\times n}\) and the index subsets and , we define

where \(m_{ij}\) the (ij)th entry of M. The notation denotes . When a matrix has just one subscript, it refers to the columns of the matrix; that is, denotes the submatrix of M corresponding to the columns . The null space of M is denoted N(M). The 2-norm condition number of a matrix is denoted \(\kappa (M)\). If M is symmetric and positive definite, then \(\kappa (M)\) is the ratio between the largest and smallest eigenvalue of M.

\(P_\varOmega \) denotes the Euclidian norm projection onto \(\varOmega \), i.e.

$$\begin{aligned} P_{\varOmega } (\textbf{x}):= \textrm{argmin}_{\textbf{y}\in \varOmega } \Vert \textbf{y}- \textbf{x}\Vert . \end{aligned}$$

Definition 1

We define the following index sets:

and are called respectively the active set and the free set at \(\textbf{x}\), while and are abbreviations for and .

Definition 2

For any \(\textbf{x}\in \varOmega \), we define the following affine spaces:

(2)
(3)

Note that the equalities associated with \(E(\textbf{x})\) correspond to the affine closure of the face determined by the active set at \(\textbf{x}\) and \(E_0(\textbf{x}) \) is its support.

2 Stationarity conditions

Definition 3

\(\textbf{x}^* \in \varOmega \) is a stationary point for problem (1) if and only if there exist Lagrangian multipliers vectors \(\varvec{\theta }^*\in \mathbb {R}^m\) and \(\varvec{\lambda }^*\in \mathbb {R}^n\) such that

$$\begin{aligned}{} & {} \textbf{g}({\textbf{x}}^*) = \sum _{i=1}^n \lambda _i^* \textbf{e}_i + \sum _{j=1}^m \theta _j^* \textbf{a}_j = \sum _{i=1}^n \lambda _i^* \textbf{e}_i + A^T\varvec{\theta }^*, \end{aligned}$$
(4)
(5)

The stationary point \( \textbf{x}^* \) is nondegenerate with respect to the bound constraints if the inequalities in (5) are strict. Otherwise, the problem is degenerate.

Since we are interested in building an estimate for the Lagrange multipliers, we make the following assumption which guarantees their uniqueness.

Assumption 1

(Linear Independence Constraint Qualification-LICQ) At the stationary point \(\textbf{x}^*\) of (1), the active constraint normals

are linearly independent.

By Assumption 1, the rows of are linearly independent, where .

Since \( \varvec{\lambda }_i^* = 0\) for all , the KKT conditions (4) and (5) are equivalent to

(6)
(7)
(8)

When Assumption 1 holds, , or equivalently, . Since has full row-rank, the matrix

is invertible.

Condition (6) can be rewritten as

Premultiplying by yields

(9)

which reduces to when .

Given any point \( \textbf{x}\in \varOmega \), we could utilize a modification of (6) to estimate \(\varvec{\theta }^*\) in which is replaced by . Even when Assumption 1 holds, the m by m matrix could be singular. Hence, to estimate \(\varvec{\theta }^*\) we use the minimum norm solution \(\varvec{\theta }({\textbf{x}})\) of the least squares problem

(10)

If is interpreted as the pseudoinverse of , then

(11)

The first-order optimality condition for a minimizer of (10) is given by

(12)

If the rows of are linearly independent, then . It follows from (9) that \(\varvec{\theta }({\textbf{x}}^*) = \varvec{\theta }^*\).

Let us introduce the vectors

$$\begin{aligned} \textbf{h}^{\varvec{\omega }} (\textbf{x}) := \textbf{g}(\textbf{x}) - A^T(\varvec{\theta }({\textbf{x}}) + \varvec{\omega }) , \quad \textbf{h}(\textbf{x}) := \textbf{h}^{\textbf{0}} ({\textbf{x}}), \end{aligned}$$
(13)

where \(\varvec{\omega }\) denotes any vector in the null space . For , is independent of \(\varvec{\omega }\) since . As in (6)–(8), \(\textbf{x}\) is a stationary point for the optimization problem (1) if there exists \(\varvec{\omega }\in \mathbb {R}^m\) such that

(14)

Motivated by (14), we define the binding set at an arbitrary \(\textbf{x}\in \varOmega \) as follows:

Definition 4

If \( \textbf{x}\in \varOmega \), then the binding set at \( \textbf{x}\) associated with is given by

(15)

The binding collection at \(\textbf{x}\) consists of all these sets:

Note that in the definition of , we do not require that \(h_i^{\varvec{\omega }} (\textbf{x}) = 0\) for . In the case of bound constrained problems, \(\textbf{h}^{\varvec{\omega }} (\textbf{x}) = \textbf{h}(\textbf{x}) = \textbf{g}(\textbf{x})\) and (15) corresponds to the standard definition of binding set. It is also possible to show that the quantities introduced above provide a good estimate of the Lagrange multipliers in the case of non-degenerate stationary points. Observe that in this case , hence one has \(\textbf{h}^{\varvec{\omega }} (\textbf{x}^*) \equiv \textbf{h}(\textbf{x}^*)\).

Theorem 1

If \(\left\{ \textbf{x}^k \right\} \) is a sequence in \(\varOmega \) which converges to a non-degenerate stationary point \(\textbf{x}^*\) where Assumption 1 holds and for all \( k>\overline{k} \), then

and \(h_i(\textbf{x}^k) = 0\) for all \(k > \overline{k}\) and .

Proof

For all \( k > \overline{k}\), and the matrix is full rank by Assumption 1. By observing that \(\textbf{h}^{\varvec{\omega }} (\textbf{x}^k) \equiv \textbf{h}(\textbf{x}^k)\), the thesis follows from the continuity of \( \nabla f(\textbf{x}) \). \(\square \)

Another way to express stationarity for problem (1) is through the projected negative gradient of f at a point \(\textbf{x}\in \varOmega \). In Calamai and Moré [10], this projection is expressed as

$$\begin{aligned} \nabla _{\varOmega } f(\textbf{x}) := \textrm{argmin}\left\{ \Vert \textbf{v}+ \nabla f(\textbf{x})\Vert \; : \; \textbf{v}\in T_\varOmega (\textbf{x}) \right\} , \end{aligned}$$
(16)

where

is the tangent cone to \(\varOmega \) at \(\textbf{x}\); that is, \(T_\varOmega (\textbf{x})\) is the closure of the cone of feasible directions at \(\textbf{x}\). If \(\nabla _{\varOmega } f(\textbf{x}^*) = \textbf{0}\), then the KKT conditions for (16) imply that \(\textbf{x}^*\) is a stationary point, and conversely, if \(\textbf{x}^*\) is a stationary point of (1), then \(\nabla _{\varOmega } f(\textbf{x}^*) = \textbf{0}\). Observe that the condition \(\nabla _{\varOmega } f(\textbf{x}^*) = \textbf{0}\) can be equivalently stated as

$$\begin{aligned} -\nabla f(\textbf{x}^*) \in T_\varOmega (\textbf{x})^\circ , \end{aligned}$$

where \(T_\varOmega (\textbf{x})^\circ = \left\{ \textbf{w}\in \mathbb {R}^n \,: \, \textbf{w}^T\textbf{v}\le 0, \; \forall \, \textbf{v}\in T_\varOmega (\textbf{x}) \right\} \) is the polar of the tangent cone at \(\textbf{x}\), also known as the normal cone to \(\varOmega \) at \(\textbf{x}\).

The drawback of using (16) as measure of stationarity is that \(\nabla _{\varOmega } f(\textbf{x})\) could be very large, even for \(\textbf{x}\in \varOmega \) very close to a stationary point of (1), since the projected gradient is only lower semicontinuous (see [10, Lemma 3.3]). On the other hand, the projected gradient enjoys a remarkable feature which is synthesized in the following theorem, whose proof follows the lines of that of [18, Theorem 2.3] by extending [10, Theorem 4.1] to (possibly degenerate) stationary points that satisfy Assumption 1.

Theorem 2

If \(\left\{ \textbf{x}^k \right\} \) is a sequence in \(\varOmega \) which converges to a point \(\textbf{x}^*\) and if

$$\begin{aligned} \lim _{k \rightarrow \infty } \Vert \nabla _{\varOmega } f(\textbf{x}^k)\Vert = 0 \end{aligned}$$
(17)

Then

  • \(\textbf{x}^*\) is a stationary point for problem (1);

  • if Assumption  1 holds, then for all k sufficiently large, where and \(\lambda _i^*\) is the Lagrange multiplier associated with the i-th bound constraint.

Because of Theorem 2, an algorithm producing a sequence \(\{ \textbf{x}^k \}\) that satisfies (17) is able to identify the active variables that are nondegenerate at the solution in a finite number of steps.

Calamai and Moré [10] prove that for a sequence \(\{ \textbf{x}^k \}\) generated by a GP algorithm, (17) holds if the step lengths satisfy some appropriate sufficient decrease conditions; this is true also for two phase algorithms that can be framed under the very general framework of Algorithm 5.3 in [10].

In a two phase algorithm for problem (1) a critical issue is splitting the violation of the optimality conditions between free and active variables. To this end, in the next section we extend to problems of the form (1) the definition of free and chopped gradient originally proposed in [22] for the case of problems with box constraints.

In Sect. 4, in a two phase algorithm for quadratic problems, we show how the free and the chopped gradient can be used to set a suitable stopping criterion in minimization in the reduced space, and to state finite convergence for the strictly convex case, even in case of degeneracy at the solution.

3 The free and the chopped gradient

We start by defining the free gradient \(\varvec{\varphi }(\textbf{x})\) at \(\textbf{x}\in \varOmega \) for problem (1).

Definition 5

For any \(\textbf{x}\in \varOmega \), the free gradient \(\varvec{\varphi }(\textbf{x})\in \mathbb {R}^n\) is defined by

(18)

The following theorems state some properties of \(\varvec{\varphi }(\textbf{x})\), including its relationship with the projected gradient.

Lemma 1

If \(\textbf{x}\in \varOmega \), then \(\varvec{\varphi }(\textbf{x}) = \textbf{0}\) if and only if \(\textbf{x}\) is a stationary point for

$$\begin{aligned} \begin{array}{rl} \min &{} f(\textbf{u}), \\ \text{ s.t. } &{} \textbf{u}\in E(\textbf{x}). \end{array} \end{aligned}$$
(19)

Proof

The point \(\textbf{x}\) is a stationary point for (19) if there exists \(\varvec{\mu }\in \mathbb {R}^n\) and such that

(20)

By (13) and (18), \(\varvec{\varphi }(\textbf{x})=\textbf{0}\) if and only if

(21)

Thus by taking \(\varvec{\mu }= \varvec{\theta }({\textbf{x}})\) and \(\lambda _i = g_i({\textbf{x}}) - A_i^T\varvec{\theta }(\textbf{x})\) for all , we see that (20) is satisfied and \({\textbf{x}}\) is a stationary point for (19).

On the other hand, if \(\textbf{x}\) is a stationary point for (19), then (20) implies that lies in the range of , which implies that (21) holds. \(\square \)

Remark 1

Lemma 1 shows that \(\varvec{\varphi }(\textbf{x})\) can be considered as a measure of optimality within the reduced space determined by the active variables at \(\textbf{x}\).

Lemma 2

For any \(\textbf{x}\in \varOmega \), \(\varvec{\varphi }(\textbf{x})\) is the orthogonal projection of \(-\nabla _\varOmega f(\textbf{x})\) onto \(E_0 (\textbf{x})\), where \(E_0 (\textbf{x})\) is given in (3). Furthermore,

$$\begin{aligned} \Vert \varvec{\varphi }(\textbf{x}) \Vert ^2 = - ( \nabla _\varOmega f(\textbf{x}))^T\varvec{\varphi }(\textbf{x}). \end{aligned}$$
(22)

Proof

The definition (16) of projected gradient and the associated KKT conditions imply that

$$\begin{aligned} \textbf{v}= \nabla _\varOmega f (\textbf{x}) = -\textbf{g}({\textbf{x}}) + A^T\varvec{\nu }+ \varvec{\mu }, \end{aligned}$$
(23)

for some \( \varvec{\nu }\in \mathbb {R}^m\) and \(\varvec{\mu }\in \mathbb {R}^n\) with

Recall that \(\textbf{h}(\textbf{x}):= \textbf{g}(\textbf{x}) - A^T \varvec{\theta }(\textbf{x})\). Substituting for \(\textbf{g}(\textbf{x})\) using (23), the equation for \(\textbf{h}\) can be rewritten as

$$\begin{aligned} \textbf{h}(\textbf{x}) = - \nabla _{\varOmega } f(\textbf{x}) + A^T\varvec{\sigma }+ \varvec{\mu }, \quad \varvec{\sigma }= \varvec{\nu }- \varvec{\theta }(\textbf{x}) . \end{aligned}$$
(24)

Since and , (24) can be expressed

$$\begin{aligned} \varvec{\varphi }(\textbf{x}) = - \nabla _{\varOmega } f(\textbf{x}) + A^T\varvec{\sigma }+ \varvec{\tau }, \end{aligned}$$
(25)

where \(\tau _i = \mu _i = 0\) for and \(\tau _i = \mu _i - h_i({\textbf{x}})\) otherwise. Also, note that the condition (12) is equivalent to

$$\begin{aligned} A \varvec{\varphi }(\textbf{x}) = \textbf{0}\end{aligned}$$
(26)

since and . Thus by (25) and (26), the KKT conditions for the problem

$$\begin{aligned} \min \left\{ \Vert \textbf{v}+ \nabla _\varOmega f(\textbf{x}) \Vert : \; \textbf{v}\in E_0(\textbf{x}) \right\} \end{aligned}$$

are satisfied by \({\textbf{v}} = \varvec{\varphi }(\textbf{x})\). This establishes the first part of the lemma. Equation (22) follows from (24), (26), and the fact that . \(\square \)

Lemma 3

If \(\textbf{x}\in \varOmega \), then if and only if

$$\begin{aligned} \varvec{\varphi }(\textbf{x})= - \nabla _\varOmega f(\textbf{x}). \end{aligned}$$
(27)

Proof

Condition (27) is equivalent to

$$\begin{aligned} - \varvec{\varphi }(\textbf{x}) = \textrm{argmin}\left\{ \Vert \textbf{v}+ \textbf{g}\Vert \; : \; \textbf{v}\in T_\varOmega (\textbf{x}) \right\} . \end{aligned}$$
(28)

The KKT conditions associated with the minimizer on the right side of (28) are

$$\begin{aligned} A\textbf{v}= \textbf{0}, \quad \textbf{v}= -\textbf{g}({\textbf{x}}) + A^T\varvec{\nu }+ \varvec{\mu }, \end{aligned}$$
(29)

for some \(\varvec{\nu }\in \mathbb {R}^m\) and \(\varvec{\mu }\in \mathbb {R}^n\) with \(\mu _i=0 \) if , \(\mu _i \ge 0 \) if , and \(\mu _i \le 0 \) if . By (26), \(\textbf{v}= -\varvec{\varphi }({\textbf{x}})\) satisfies the first equality in (29). Moreover, if , then there exists such that \(h_i^{\varvec{\omega }} ({\textbf{x}}) \ge 0\) if , and \(h_i^{\varvec{\omega }} ({\textbf{x}}) \le 0 \) if . Hence, the second equality in (29) is satisfied by taking \(\varvec{\nu }= \varvec{\theta }(\textbf{x}) + \varvec{\omega }\), \(\mu _i = h_i^{\varvec{\omega }}({\textbf{x}})\) for , and \(\mu _i = 0\) for .

Conversely, suppose that (27) holds, which is equivalent to (29) with \(\textbf{v}= -\varvec{\varphi }(\textbf{x})\). By definition, we have \(\varphi _i ({\textbf{x}}) = g_i({\textbf{x}}) - A_i\varvec{\theta }({\textbf{x}})\) for all . Hence, by (29) with \(\textbf{v}= -\varvec{\varphi }(\textbf{x})\) it follows that . This shows that \(\varvec{\omega }=\) . We claim that . For , we have

$$\begin{aligned} h_i^{\varvec{\omega }} (\textbf{x}) = g_i (\textbf{x}) - A_i^T (\varvec{\theta }(\textbf{x}) + \varvec{\omega }) = g_i(\textbf{x}) - A_i^T \varvec{\nu }= \textbf{0}, \end{aligned}$$

where the last equality is from (29). For , (29) also gives

$$\begin{aligned} h_i^{\varvec{\omega }} (\textbf{x}) = g_i(\textbf{x}) - A_i^T \varvec{\nu }= \mu _i + \varphi _i(\textbf{x}) = \mu _i \end{aligned}$$

since . Since \(\mu _i \ge 0 \) if , and \(\mu _i \le 0 \) if , it follows that . \(\square \)

Inspired by the previous lemma, we give the following definition.

Definition 6

For any \(\textbf{x}\in \varOmega \), the chopped gradient \(\varvec{\beta }(\textbf{x})\) is defined as

$$\begin{aligned} \varvec{\beta }(\textbf{x}) := - \nabla _\varOmega f(\textbf{x}) - \varvec{\varphi }(\textbf{x}). \end{aligned}$$
(30)

Remark 2

Lemma 3 implies that \(\varvec{\beta }(\textbf{x}) = \textbf{0}\) if and only if . Therefore, \(\Vert \varvec{\beta }({\textbf{x}})\Vert \) could be used to assess whether the KKT multipliers for the inequality constraints have the correct sign.

Some properties of \(\varvec{\beta }(\textbf{x})\) are given next.

Lemma 4

For any \(\textbf{x}\in \varOmega \), we have \(\text{(a) } \varvec{\beta }(\textbf{x})^T\varvec{\varphi }(\textbf{x}) = 0\), \(\text{(b) } \varvec{\beta }(\textbf{x}) \in N (A)\), \(\text{(c) } -\varvec{\beta }(\textbf{x}) \in T_\varOmega (\textbf{x})\), and \(\text{(d) } \textbf{g}(\textbf{x})^T\varvec{\beta }(\textbf{x}) = \Vert \varvec{\beta }(\textbf{x})\Vert ^2\).

Proof

By Lemma 2, it follows that

$$\begin{aligned} \varvec{\beta }(\textbf{x})^T \varvec{\varphi }(\textbf{x})= & {} - \left( \nabla _\varOmega f (\textbf{x})+ \varvec{\varphi }(\textbf{x}) \right) ^T \! \varvec{\varphi }(\textbf{x}) \\= & {} -\nabla _\varOmega f(\textbf{x})^T \varvec{\varphi }(\textbf{x}) - \varvec{\varphi }(\textbf{x})^T\varvec{\varphi }(\textbf{x}) = 0, \end{aligned}$$

which gives (a). By (16), \(\nabla _\varOmega f(\textbf{x}) \in N(A)\) and by (26), \(\varvec{\varphi }(\textbf{x}) \in N(A)\); hence, \(\varvec{\beta }(\textbf{x}) = -(\nabla _\varOmega f(\textbf{x}) + \varvec{\varphi }(\textbf{x})) \in N(A)\) and (b) holds. Since \(\nabla _\varOmega f(\textbf{x}) \in T_\varOmega (\textbf{x})\) and \(\varvec{\varphi }(\textbf{x}) \in E_0(\textbf{x})\), (c) holds. By [10, Lemma 3.1], we have \( - \textbf{g}(\textbf{x})^T \nabla _\varOmega f(\textbf{x}) =\) \(\Vert \nabla _\varOmega f (\textbf{x}) \Vert ^2\). Substitute in this identity, \(-\nabla _\varOmega f(\textbf{x}) =\) \(\varvec{\beta }(\textbf{x}) + \varvec{\varphi }(\textbf{x})\) and exploit (a) to obtain

$$\begin{aligned} \textbf{g}(\textbf{x})^T \left( \varvec{\varphi }(\textbf{x}) + \varvec{\beta }(\textbf{x}) \right) = \Vert \varvec{\varphi }(\textbf{x}) \Vert ^2 + \Vert \varvec{\beta }(\textbf{x}) \Vert ^2, \end{aligned}$$

which is rearranged into

$$\begin{aligned} \textbf{g}(\textbf{x})^T \varvec{\beta }(\textbf{x}) = \Vert \varvec{\beta }(\textbf{x}) \Vert ^2 + \varvec{\varphi }(\textbf{x})^T(\varvec{\varphi }(\textbf{x}) - \textbf{g}(\textbf{x})) . \end{aligned}$$
(31)

Recall that by its definition, \(\varphi _i(\textbf{x}) = 0\) for all and \(\varphi _i (\textbf{x}) - g_i(\textbf{x}) =\) \(-A_i^T \varvec{\theta }(\textbf{x})\) for all . It follows that

where the last equality is by (26). Since the trailing term in (31) vanishes, the proof of (d) is complete. \(\square \)

4 Proportionality-based algorithm for quadratic programs

In the previous sections it has been shown that for polyhedral constrained optimization the projected gradient has an orthogonal decomposition of the form

$$\begin{aligned} \nabla _\varOmega f(\textbf{x})= \varvec{\beta }(\textbf{x})+\varvec{\varphi }(\textbf{x}). \end{aligned}$$
(32)

\(\varvec{\varphi }(\textbf{x})\) provides a measure of stationarity within the reduced space determined by the active variables at \(\textbf{x}\), while \(\varvec{\beta }(\textbf{x})\) gives a measure of bindingness of the active variables at \(\textbf{x}\).

In the remainder of this paper we restrict our attention to the Quadratic Program (QP):

$$\begin{aligned} \begin{array}{c} \text{ minimize } ~ \displaystyle f(\textbf{x}):=\frac{1}{2} \textbf{x}^T\,H\,\textbf{x}- \textbf{c}^T\textbf{x}~~ \\ \text{ subject } \text{ to } ~~ {\textbf{x}} \in \varOmega := \{{\textbf{x}} \in \mathbb {R}^n : \displaystyle A\,\textbf{x}= \textbf{b}, ~~ \displaystyle \textbf{l}\le \textbf{x}\le \textbf{u}\}, \end{array} \end{aligned}$$
(33)

where \(H \in \mathbb {R}^{n \times n}\) is symmetric, \(\textbf{c}\in \mathbb {R}^{n}\), \( A \in \mathbb {R}^{m\times n}\), \(\textbf{b}\in \mathbb {R}^m\), \(\textbf{l}\in \! \left\{ \mathbb {R}\cup \{ -\infty \} \right\} ^{n}\), and \(\textbf{u}\in \! \left\{ \mathbb {R}\cup \{ +\infty \} \right\} ^{n}\).

Based on the theory developed in the previous sections, for Problem (33) we propose an algorithm based on an alternation of a gradient projection (GP) phase and a subspace minimization phase (SM). The switch from GP to SM is based on an heuristic criterion checking the change of the active set and the objective value decrease. The switch from SM to GP is, instead, based on the “proportionality criterion”. By using the projected gradient decomposition (32) we can extend to problem (33) the definition of proportional iterates, introduced for the bound constrained quadratic problems by Dostál in [22]. An iterate \( \textbf{x}^k \) is called proportional if, for a suitable constant \(\varGamma >0\),

$$\begin{aligned} \Vert \varvec{\beta }(\textbf{x}^k) \Vert _\infty \le \varGamma \Vert \varvec{\varphi }(\textbf{x}^k) \Vert . \end{aligned}$$
(34)

Although the algorithm can be defined independently from the convexity of the problem, some of the results which follow will only apply to the case of strictly convex problems, i.e. problems in which the restriction of H on the nullspace of A is strictly convex. In this case disproportionality of \(\textbf{x}^k\) guarantees that the solution of (33) does not belong to the face identified by the active variables at \(\textbf{x}^k\), as shown in Theorem 3.

Before proceeding we introduce the following lemma.

Lemma 5

Let us consider the problem

$$\begin{aligned} \begin{array}{rl} \min &{} w(\textbf{z}) := \frac{1}{2} \textbf{z}^T\,K\,\textbf{z}- \textbf{p}^T\textbf{z}, \\ \text{ s.t. } &{} R\,\textbf{z}= \textbf{q}, \end{array} \end{aligned}$$
(35)

where \(K \in \mathbb {R}^{t \times t}\), \(\textbf{p}\in \mathbb {R}^{t}\), \(R\in \mathbb {R}^{s\times t}\), and \(\textbf{q}\in \mathbb {R}^s\).

Let \(P_{N(R)}\) be the orthogonal projection onto N(R), and U a matrix with orthonormal columns spanning N(R). Assume that \(U^T K \, U\) is positive definite, and \(\overline{\textbf{z}}\) be the solution of (35). Then for any solution \(\textbf{z}\) of \(R\textbf{z}= \textbf{q}\), we have

$$\begin{aligned} w(\textbf{z}) - w(\overline{\textbf{z}}) \le \frac{1}{2} \, \Vert B \Vert \Vert P_{N(R)} \nabla w(\textbf{z}) \Vert ^2, \end{aligned}$$
(36)

where \(B = (U^T K U)^{-1}\).

Proof

The solutions \(\textbf{z}\) of \(R\,\textbf{z}= \textbf{q}\) can be expressed as the sum of the null space N(R) and a particular solution, denoted \(\textbf{r}\). Since the columns of U are an orthonormal basis for N(R), this is equivalent to \(\textbf{z}= \textbf{r}+ U\,\textbf{y}\) for some \(\textbf{y}\).

Thus, (35) can be reduced to

$$\begin{aligned} \min \; {\widetilde{w}}(\textbf{y}) := \frac{1}{2} \textbf{y}^T U^T K \, U \textbf{y}- (\textbf{p}^T - \textbf{r}^T K) U \textbf{y}. \end{aligned}$$
(37)

If \(\overline{\textbf{y}}\) denotes the solution of (37), then \(\overline{\textbf{z}} = \textbf{r}+ U\,\overline{\textbf{y}}\). Since (37) is unconstrained, it follows that \(\nabla {\widetilde{w}}(\overline{\textbf{y}})= 0\). Since \({\widetilde{w}} (\textbf{y})\) and \(w(\textbf{z})\) differ by a constant when \(\textbf{z}= \textbf{r}+ U\,\textbf{y}\), it follows that

$$\begin{aligned} w(\textbf{z}) - w (\overline{\textbf{z}}) = {\widetilde{w}}(\textbf{y}) - {\widetilde{w}} (\overline{\textbf{y}}). \end{aligned}$$
(38)

We expand the right side of (38) in a Taylor series around \(\overline{\textbf{y}}\) and utilize the identity \(\nabla {\widetilde{w}}(\overline{\textbf{y}})= 0\) to obtain

$$\begin{aligned} w(\textbf{z}) - w(\overline{\textbf{z}})= & {} \frac{1}{2} (\textbf{y}- \overline{\textbf{y}})^T U^T KU (\textbf{y}- \overline{\textbf{y}}) \nonumber \\= & {} \frac{1}{2} (\textbf{y}- \overline{\textbf{y}})^T (U^T KU) (U^TKU)^{-1} (U^TKU) (\textbf{y}- \overline{\textbf{y}}). \end{aligned}$$
(39)

The identity \(\nabla {\widetilde{w}}(\overline{\textbf{y}})= 0\) can be expressed

$$\begin{aligned} U^T K (U\overline{\textbf{y}} + \textbf{r}) = U^T \textbf{p}. \end{aligned}$$

Exploiting this relation yields

$$\begin{aligned} U^TKU(\textbf{y}- \overline{\textbf{y}}) = U^TK(U\textbf{y}+ \textbf{r}- U\overline{\textbf{y}} - \textbf{r}) = U^TK\textbf{z}- U^T\textbf{p}= U^T \nabla w(\textbf{z}). \end{aligned}$$

Since \(\Vert U^T \nabla w(\textbf{z})\Vert = \Vert P_{N(R)} \nabla w (\textbf{z})\Vert \), (39) completes the proof. \(\square \)

We are now in position to state the main result of this section, which extends to problems of the form (33) the results shown for simpler quadratic programs in [22, Theorem 3.2] (bound constraints only) and in [18, Theorem 3.8] (bound constraints plus a single linear equality constraint).

Theorem 3

Let us consider problem (33), and suppose is positive definite, where \(V \in \mathbb {R}^{n \times (n-m)}\) is a matrix with orthonormal columns spanning the null space N(A). Let \(\textbf{x}\in \varOmega \) be such that , and let \(\overline{\textbf{x}}\) be the solution of

$$\begin{aligned} \begin{array}{rl} \min &{} f(\textbf{u}),\\ \text{ s.t. } &{} \textbf{u}\in E(\textbf{x}), \end{array} \end{aligned}$$
(40)

where \(E(\textbf{x})\) is defined in (2). If \( \overline{\textbf{x}}\in \varOmega \), then \(\varvec{\beta }(\overline{\textbf{x}})\ne 0\).

Proof

The proof follows the lines of that of Theorem 3.8 in [18], which refers to the simpler case \(m=1\). For the sake of completeness, we summarize the main steps of the proof.

Let . By Lemma 4 part (d), the assumption that \(\Vert \varvec{\beta }(\textbf{x}) \Vert _\infty>\) , and the identity \(\varvec{\beta }(\textbf{x}) = V V^T\varvec{\beta }(\textbf{x})\) (since \(\varvec{\beta }(\textbf{x})\in N(A)\)), one can prove that

(41)

Since \(\overline{\textbf{x}}\) satisfies the KKT conditions of problem (40), i.e.,

(42)

where \(\eta _i\) and \(\varvec{\gamma }\) are the Lagrange multipliers, one can write

(43)
(44)

where and . It follows that

(45)

By applying Lemma 5 with , , , , , and \(w(\textbf{z})\) defined as in (35), we obtain

(46)

where and W has orthonormal columns spanning . By (12) and (18), . Combining (45) and (46) gives

(47)

If \(\zeta _{min} (M)\) denotes the smallest eigenvalue of the symmetric matrix M, then .

Observe that

(48)

where the last inequality is since is positive definite. By (47), one gets

(49)

The latter inequality, together with (41) yields

$$\begin{aligned} 0>f(\textbf{y})-f(\overline{\textbf{x}})>\textbf{g}(\overline{\textbf{x}})^T\,(\textbf{y}-\overline{\textbf{x}}). \end{aligned}$$

By (43) and the definition of \(\textbf{y}\), one gets ; hence,

$$\begin{aligned} \textbf{g}(\overline{\textbf{x}})^T\,\varvec{\beta }(\textbf{x})>0 . \end{aligned}$$
(50)

For the remainder of the proof we assume that \( \overline{\textbf{x}}\in \varOmega \) and we set . From (44) and it follows that , moreover, by Lemma 1 we have that

$$\begin{aligned} \varvec{\varphi }(\overline{\textbf{x}}) = \textbf{0}. \end{aligned}$$
(51)

We will proceed by contradiction, and suppose that \(\varvec{\beta }(\overline{\textbf{x}}) = 0\). Observe that (51) implies that \(\overline{\textbf{x}}\) is the optimal solution of problem (33); hence

$$\begin{aligned} -\textbf{g}(\overline{\textbf{x}}) \in T_\varOmega (\overline{\textbf{x}})^\circ . \end{aligned}$$

We consider two possible cases: (a) , (b) . Case (a) leads immediately to a contradiction of (50) since \(-\varvec{\beta }(\textbf{x})\in T_\varOmega (\textbf{x})\) and \(-\overline{\textbf{g}} \in T_\varOmega (\textbf{x})^\circ \).

About case (b), the optimality of \(\overline{\textbf{x}}\) for problem (33) yields

(52)

Since , by comparing (42) and (52) we get

and hence

Because of Assumption 1, matrix is such that , therefore and \(\varvec{\gamma }= \varvec{\nu }\). This implies that \(\eta _i=\lambda _i\) for , whereas \(\lambda _i = 0\) for . Hence, we can write

This, by the definition of normal cone and by Farkas’ Lemma, yields \(- \textbf{g}(\overline{\textbf{x}})\in T_\varOmega (\textbf{x})^\circ \), which leads to a contradiction as in case (a). \(\square \)

Before introducing our method, let’s briefly recall the GP algorithmic framework as stated by Calamai and Moré in [10] that we report in Algorithm 1.

figure a

The framework aims to combine gradient projection steps with steps leading to better theoretical or practical performances while preserving the original identification properties of the GP method. The key idea is that the alternative steps, apart from leading to a decrease of the objective function, must preserve the variables currently considered active. If a ‘suitable’ active set has been identified, one could define a reduced problem on the complementary set of free variables, i.e., focus on the solution of

$$\begin{aligned} \begin{array}{rl} \min &{} f(\textbf{x}), \\ \text{ s.t. } &{} \textbf{x}\in E(\textbf{x}^k). \end{array} \end{aligned}$$
(54)

We introduce a general framework for the solution of problem (33) which we will call Proportionality-based Subspace Accelerated algorithm for Quadratic Programs (PSAQP), which generalizes the P2GP algorithm in [18]. The framework is outlined in Algorithm 2 where \(\varvec{\varphi }(\textbf{x}^i), \; \varvec{\beta }(\textbf{x}^i)\), and \(f(\textbf{x}^i)\) are denoted by \(\varvec{\varphi }^i, \; \varvec{\beta }^i\) and \( f^i\), respectively. PSAQP alternates between GP phases, referred-to as ‘identification’ phases, and SM phases, where an approximate solution to (54) is sought, with \(\textbf{x}^k\) inherited from the last identification phase. The identification phase stops if the active set remains fixed in two consecutive iterations or no reasonable progress is made in reducing the objective function, i.e., if

$$\begin{aligned} f^k - f^{k+1} \le \eta \, \max _{m \le l < k} (f^l - f^{l+1}), \end{aligned}$$
(55)

where \(\eta \) is a suitable constant and m is the first iteration of the current identification phase. Once the SM phase has started, the possible return to the identification phase is determined by exploiting the proportionality criterion (34). To allow PSQAP to fit into the general framework of Algorithm 1, a projected line-search is performed on the direction coming from the unconstrained minimization phase, forcing the new iterate to be in \(\varOmega ^k = \varOmega \,\cap \,E(\textbf{x}^k)\). This allows the minimization phase to add variables to the active set, but not to remove them. Finally, the algorithm is stopped if

$$\begin{aligned} \Vert \nabla _\varOmega f^k\Vert = \Vert \varvec{\varphi }^k + \varvec{\beta }^k \Vert \le tol. \end{aligned}$$
(56)
figure b

In the following we report some results on the identification and convergence properties of PSAQP which easily follow from results previously established in the literature. For this reason we will omit their proofs.

The first result is a consequence of Theorem 5.2 in [10]. It was previously stated as Theorem 4.2 in [18] for quadratic programs subject to bound constraints and a single linear constraint. It is also stated here since it applies to general, polyhedral constrained quadratic programs.

Theorem 4

Let \(\left\{ \textbf{x}^k\right\} \) be a sequence generated by applying PSAQP to problem (33). Assume that the set

$$\begin{aligned} K_{GP} = \left\{ k\in \mathbb {N} :\; \textbf{x}^{k+1} \text{ is } \text{ generated } \text{ by } \text{ step } \text{6 } \text{ of } \text{ Algorithm }~2 \right\} , \end{aligned}$$

i.e., the set of GP iterations, is infinite. If some subsequence \(\left\{ \textbf{x}^k\right\} _{k\in K}\), with \(K\subseteq K_{GP}\), is bounded, then

$$\begin{aligned} \lim \limits _{k\in K,\,k\rightarrow \infty } \left\| \nabla _\varOmega f(\textbf{x}^{k+1})\right\| = 0. \end{aligned}$$
(57)

Moreover, any limit point of \(\left\{ \textbf{x}^k\right\} _{k\in K_{GP}}\) is a stationary point for problem (33).

Since the minimization steps can only add variables to the active set, the identification properties of the GP steps are extended to the whole sequence generated by PSAQP. We state the following result which extends Lemma 4.3 in [18] to polyhedral constrained QPs. It is worth noting that we fixed the statement by adding the assumption that the set of GP iterations is infinite so that Theorem 4 is applicable. Furthermore, we note that the second part of the statement, which we added to the Lemma for the sake of conciseness, can be derived by the discussion following the related result in [18].

Lemma 6

Let us assume that problem (33) is strictly convex with optimal solution \(\textbf{x}^*\). If \(\left\{ \textbf{x}^k\right\} \) is a sequence in \(\varOmega \) generated by PSAQP applied to (33) and the set of GP iterations is infinite, then for all k sufficiently large

where is defined in Theorem 2.

Moreover, if \(\textbf{x}^*\) is a nondegenerate stationary point, then for all k sufficiently large

The latter result implies that, in case of nondegeneracy, the solution of (33) reduces to the solution of an unconstrained problem in a finite number of iterations. In case of degeneracy Lemma 6 implies that only the nondegenerate active constraints at the solution are identified in a finite number of steps. Nevertheless, in both cases finite convergence can be proved for the PSAQP algorithm, in case of exact minimization steps, provided that a sufficiently large value for \(\varGamma \) in (34) is chosen. These results are stated in the following theorem, which extends to the PSAQP algorithm for problems of the form (33) the properties proved in [18, Theorem 4.4] for P2GP, suited for problems subject to bound constraint and a single linear constraint. The proof follows the lines of that of Theorem 4.4 in [18]. We report the main steps for the sake of completeness.

Theorem 5

Let us assume that problem (33) is strictly convex with optimal solution \(\textbf{x}^*\). Let \(\left\{ \textbf{x}^k\right\} \) be a sequence in \(\varOmega \) generated by PSAQP applied to (33), in which the minimization phase is performed by any algorithm that is exact for strictly convex quadratic programming. If one of the following conditions holds:

  1. (i)

    \(\textbf{x}^*\) is nondegenerate,

  2. (ii)

    \(\textbf{x}^*\) is degenerate and , where is defined in Theorem 3,

then \(\textbf{x}^k=\textbf{x}^*\) for k sufficiently large.

Proof

Case (i) trivially follows from Lemma 6.

For case (ii) we first observe that, according to Lemma 6, in a finite number of steps the active nondegenerate variables and the free variables are identified by PSAQP. Hence, it exists \(\overline{k}\) such that for \(k \ge \overline{k}\) the solution of

$$\begin{aligned} \begin{array}{rl} \min &{} f(\textbf{x}), \\ \text{ s.t. } &{} \textbf{x}\in E(\textbf{x}^k), \end{array} \end{aligned}$$
(58)

coincides with \(\textbf{x}^*\), solution of (33).

Thanks to Theorem 3 it is easy to prove that this implies the proportionality of all the iterates starting from \(\overline{k}\). Hence, after iteration \(\overline{k}\) PSAQP will always use the (exact) minimization phase to determine the next iterate. The finite convergence results follows by observing that at each given iterate either the new point is the solution to (33) or nondegenerate variables are added to the active set and the latter can only happen a finite number of times. \(\square \)

For Algorithm 2, a critical issue is how the approximate solution \(\textbf{d}^k\) is computed. The approximation should be generated so that the iterates do not get trapped in SM phase, thus jeopardizing the algorithm convergence. On the other hand, when the objective is strongly convex over N(A) and \(\textbf{d}^k\) is the minimizer of the SM phase quadratic, finite convergence to the minimum occurs under the assumptions of Theorem 5.

5 Numerical experiments

We implemented in MATLAB a version of the PSAQP algorithm introduced in the previous section. In order to assess its performances we tested it on synthetic problems, generated with the aim of building test cases with varying characteristics, such as Hessian condition number, number of equality constraints, number of active constraints at the solution, and the degree of degeneracy.

5.1 Implementation details

Here we discuss some implementation details for Algorithm 2.

5.1.1 Projections

An efficient way of computing projections onto polyhedra is essential for the performance of PSAQP. Indeed, apart from the standard projections onto \(\varOmega \) (see Line 6), the algorithm requires at each iteration also projections onto \(\varOmega ^k = \varOmega \,\cap \,E(\textbf{x}^k)\) (see Line 16) and onto the tangent cone \(T_\varOmega (\textbf{x}^k)\) for the computation of \( \nabla _{\varOmega }f(\textbf{x}) \). We perform these projection by the PPROJ algorithm proposed in [34].

It is worth noting that the backtracking linesearch procedures at Line 6 and Line 16 could be replaced by linesearches on the line segment connecting \( \textbf{x}^k \) and \( P_{S}(\textbf{x}^k-\overline{\alpha } \textbf{d}^k)\) (with \(S=\varOmega \) and \(S=\varOmega ^k\), respectively, and \(\textbf{d}^k\) the moving direction) where \( \overline{\alpha } \) is the starting steplength in the original linesearch (resulting in a smaller number of projections required). Nevertheless, an advantage of the linesearch considered in this paper over the above-mentioned linesearch on the line segment is that the active constraints are identified in accordance with Theorem 2 and Lemma 6.

5.1.2 Minimization phase

Setting \( \textbf{d}=\textbf{u}-\textbf{x}\), the solution to (40) can be found by solving

(59)

which can be equivalently rewritten as

(60)

Problem (60) can be recast as an unconstrained QP problem as follows. Suppose that and a Rank-Revealing QR factorization of is available. This means that, by assuming without loss of generality that the first r rows of are a maximal set of linearly independent rows, one can write

(61)

with orthogonal, \( R_{11}\in \mathbb {R}^{r\times r} \) upper triangular, and \( {R_{12}\in \mathbb {R}^{r\times (m-r)}} \). By applying the variable change \( \textbf{v}= Q\textbf{y}\), problem (60) becomes

(62)

with and . Observe that, since \( R^T_{11} \) is lower triangular, the constraints

$$\begin{aligned} R^T\textbf{y}= \left( \begin{array}{cc} R^T_{11} &{}\quad 0\\ R^T_{12} &{}\quad 0 \end{array}\right) \textbf{y}= \textbf{0}\end{aligned}$$

are satisfied if and only if \( y_i = 0 \) for \( i=1,\ldots ,r \). This allows one to find a solution to (62) by solving the unconstrained quadratic problem

(63)

where, by defining , we set

Remark 3

Since matrix \( \widetilde{G} \) is derived from H by a similarity transformation (Q being orthogonal) and a submatrix extraction, it can be proved that

$$\begin{aligned} \kappa (\widetilde{G})\le \kappa (H). \end{aligned}$$

It is worth noting that previous discussion implies the availability of a full QR factorization which, however, would require a computational cost of and a storage cost of , which may be prohibitive when is large. Although one could opt for a thin QR factorization and reduce both costs by the factor , the latter does not allow to control the subproblem conditioning. Indeed, although a thin factor \(\hat{Q}\) can be used to build an orthogonal projection operator over the null set of , is not invertible and the Hessian spectrum won’t be preserved. In our implementation we use a modified thin QR factorization in which, instead of building \(\hat{Q}\), we store in a matrix the m vectors used to generate the Householder transformations producing R, hence storing Q in a factorized form. This allows us to preserve the theoretical advantages of using a full QR factorization with the computational and storage complexity of a thin one.

Thanks to the presence of the proportionality criterion, one can relax the stopping criterion in the solution of problem (63). In our experiments we monitor the progress in reducing the objective function, i.e. we stop the minimization-phase solver whenever

$$\begin{aligned} \widetilde{p}(\widetilde{\textbf{z}}^j)- \widetilde{p}(\widetilde{\textbf{z}}^{j+1}) \le \xi \, \max _{1 \le l < j} \left\{ \widetilde{p}(\widetilde{\textbf{z}}^l) - \widetilde{p}(\widetilde{\textbf{z}}^{l+1}) \right\} , \end{aligned}$$
(64)

where \(\xi \in (0,1)\). This choice follows [38] and [18]. Observe that if PSAQP will use the minimization phase also in the subsequent step and the active set has not changed, the minimization method continues its iterations as the previous call had not been stopped.

5.1.3 Lagrange multipliers estimate

We observe that the computation of Lagrange multiplier estimates is only needed at Line 18 of Algorithm 2 to decompose the projected gradient into \(\varvec{\varphi }^{k+1}\) and \(\varvec{\beta }^{k+1}\). Therefore, assuming again that the first r rows of constitute a maximal independent set, one can exploit the QR factorization (61) computed for the minimization phase and consider the matrix \(\widetilde{A} = R_{11}^T Q^T\). A multiplier estimate \(\varvec{\theta }(\textbf{x})\) can then be computed by setting

(65)

with , and setting the other Lagrange multipliers to 0.

5.2 Test problems and results

We compared PSAQP with the basic gradient projection method (referred to as GP), to show the effectiveness of the proposed acceleration strategy. Moreover, we compared it with the IP-PMM interior point method introduced in [39], which has been shown to be highly efficient in the solution of quadratic and nonquadratic problems arising from various data analysis task [15]. All the tests were run with MATLAB R2021b on the magicbox server operated by the Department of Mathematics and Physics at the University of Campania “L. Vanvitelli”. We run the tests using an Intel Xeon Platinum 8168 CPU with 192 GB of RAM. The elapsed times reported for the MATLAB codes were measured by using the tic and toc commands.

To test the three algorithms we built convex random QP problems by modifying the procedure for generating problems of the same form with \(m=1\) used in [18]. We first compute a point \(\textbf{x}^*\) and then build a problem of type (33) having \(\textbf{x}^*\) as stationary point. We built a set of 270 problems with the following parameters:

  • n, number of variables, in \(\lbrace 10000,\,20000\rbrace \);

  • m, number of constraints, in \(\lbrace 2,\,5,\,10,\,20,\,50\rbrace \);

  • ncond, \(\log _{10}\) of the Hessian condition number, in \(\lbrace 4,\,5,\,6\rbrace \);

  • naxsol, fraction of active variables at \(\textbf{x}^*\), in \(\lbrace 0.1,\,0.5,\,0.9\rbrace \);

  • ndeg, \(-\log _{10}\) of the magnitude of the Lagrange multipliers associated with the bound constraints (near-degeneracy measure), in \(\lbrace 0,\,1,\,3\rbrace \).

For all the problems we generated a random feasible starting point to be used by PSAQP and GP. The methods were compared by using the performance profiles proposed by Dolan and Moré [21].

5.2.1 Comparison with GP

Both in GP and in the identification phase of PSAQP, the initial steplength for the linesearch is determined by the Alternate BB rule \(\textrm{ABB}_\textrm{min}\) proposed in [27] and analyzed in [17], i.e. at each step the steplength is chosen as

$$\begin{aligned} \alpha ^k_{\textrm{ABB}_\textrm{min}}= \left\{ \!\! \begin{array}{ll} \min \left\{ \alpha ^j_\textrm{BB2} \, : \, j=\max \{1,k-s\},\ldots ,k\right\} , &{} \displaystyle \text{ if } \, \frac{ \alpha ^k_\textrm{BB2}}{ \alpha ^k_\textrm{BB1}} <\tau , \\ \alpha ^k_\textrm{BB1}, &{} \text{ otherwise }, \end{array} \right. \end{aligned}$$
(66)

where s is a non-negative integer, \(\tau \in (0,1)\), and \(\alpha ^k_\textrm{BB1}\) and \(\alpha ^k_\textrm{BB2}\) are the well-know Barzilai-Borwein steplengths introduced in [2]. We point out that, while for PSAQP we used \(\tau =0.2\) and \(s=3\), in GP we obtained better performances when using the alternative procedure described in [5], in which \(\tau \) is adapted at each iteration, with \(\tau =0.5\) as starting value. About the proportionality condition (34), a choice made according to Theorem 3 requires a knowledge which is usually unknown about the spectrum of H. We used the adaptive strategy for updating \(\varGamma \) introduced in [18] with starting value equal to 1.

We set \(tol = 10^{-5} \Vert \varvec{\varphi }^0 + \varvec{\beta }^0 \Vert \); each algorithm could run only for 120 s for each problem, and was allowed at most 30000 matrix–vector products and 30000 projections. We declared failures if these limits were achieved without satisfying the tolerance.

Fig. 1
figure 1

Performance profiles of PSAQP and GP on strictly convex QPs in terms of elapsed time. Top left: all problems; top right: naxsol\(=0.9\); bottom left: naxsol\(=0.5\); bottom right: naxsol\(=0.1\)

Figure 1 shows the performance profiles, \(\pi (\chi )\), of the two methods on the whole set of problems. The profiles corresponding to all the problems (top left) and to those with naxsol\(=0.9\) (top right), naxsol\(=0.5\) (bottom left), and naxsol\(=0.2\) (bottom right) are reported. We note that PSAQP failed to satisfy the tolerance only once, on a problem with naxsol\(=0.1\). PG had, instead, 41 failures: 9 on problems with naxsol\(=0.1\), 16 on problems with naxsol\(=0.5\), and 16 on problems with naxsol\(=0.9\). We see that PSAQP has the best performance in all the cases. The plots show that the gap between PSAQP and GP is reduced as the number of active constraints at the solution increases. This suggests that the advantages of the use of the minimization phase are reduced when the size of subproblem (63) is larger.

Fig. 2
figure 2

Performance profiles of PSAQP and GP on strictly convex QPs in terms of number of Hessian-vector products (left) and number of projections (right). First row: all problems; second row: naxsol\(=0.9\); third row: naxsol\(=0.5\); fourth row: naxsol\(=0.1\)

To have a clearer overview on the comparison between the two algorithms we report in Fig. 2 the performance profiles for the number of matrix–vector products and the number of projections. We see that PSAQP always performs the smallest number of projections and it is in general more efficient and robust in terms of the number of Hessian-vector products. By looking at the plots one can see that the performance of GP with respect to PSAQP improve as the number of active constraints at the solution increases (for naxsol\(=0.1\) – last row – it is more efficient in terms of Hessian-vector products). Since in this case matrix vector products can be computed with a cost, however, the performance of the methods in terms of elapsed time appears to be mainly affected by the number of calls to the projection routine. It is also worth pointing out that the average number of QR factorizations performed by PSAQP is around 118 and it increases as naxsol increases (100 for naxsol\(=0.1\), 120 for naxsol\(=0.5\), 136 for naxsol\(=0.9\)). As for the number of Hessian-vector products, this does not appear to have an effect on the comparison in terms of elapsed time, suggesting that the cost of computing the QR decompositions is negligible with respect to the cost of projections.

5.2.2 Comparison with IP-PMM

We here report the results of the comparison performed between our implementation of PSAQP and the MATLAB implementation of IP-PMM.Footnote 1 To deal with the matrix-free nature of the synthetic test problems, we equipped IP-PMM with the MATLAB MINRES implementation by Page and Saunders,Footnote 2 with a block-diagonal preconditioner similar to the one used in [15, Section 5.1] to solve the KKT system for both the predictor and the corrector step at each iteration. Since for each problem we know the solution \(\textbf{x}^*\), we ran both PSAQP and IP-PMM with a stop condition of the form

$$\begin{aligned} f(\textbf{x}^*)-f(\textbf{x}^k) \le tol\_f\, |f(\textbf{x}^*)| \end{aligned}$$

with \(tol\_f=10^{-6}\). To obtain comparable solutions, for IP-PMM we coupled the stopping condition with a condition on the primal feasibilty of \(\textbf{x}^k\), i.e., an absolute tolerance \(10^{-4}\) on the satisfaction of the problem constraints (we observe that in all the cases the absolute feasibility error for PSAQP is below \(10^{-8}\)). IP-PMM was run using 2 different tolerance values \(\tau _{MR}\) for MINRES, \(10^{-5}\) and \(10^{-8}\) with a mixed absolute/relative stopping criterion; in both cases MINRES was allowed to perform a maximum of 200 iteration at each call. PSAQP was able to reach the desired tolerances in all the cases, whereas the two versions of IP-PMM were able to satisfy the required tolerance just for 94 (35% – \(\tau _{MR}=10^{-5}\)) and 167 (62% – \(\tau _{MR}=10^{-8}\)) problems, respectively. It is worth mentioning that the average number of iterations for IP-PMM in the successful istances was around 21 and that the algorithm performed an average number of 51 and 110 MINRES iterations at each call, respectively for the case \(\tau _{MR}=10^{-5}\) and the case \(\tau _{MR}=10^{-8}\).

Fig. 3
figure 3

Performance profiles of PSAQP and IP-PMM (\(\tau _{MR} = 10^{-5}\) on 97 strictly convex QPs in terms of elapsed time. Top: all problems; bottom left: m\(=2,5,10,20\); bottom right: m\(=50\)

Fig. 4
figure 4

Performance profiles of PSAQP and IP-PMM (\(\tau _{MR} = 10^{-8}\) on 167 strictly convex QPs in terms of elapsed time. Top: all problems; bottom left: m\(=2,5,10,20\); bottom right: m\(=50\)

We report in Figs. 3 and 4 performance profiles for the comparison with PSAQP, restricted to the set of problems for which IP-PMM was able to reach the desired solution. The plots show that PSAQP outperforms IP-PMM for problems with the number of linear constraints between 2 and 20, while IP-PMM is always faster in the instances with 50 linear equality constraints. This suggests that the proposed PSAQP implementation is competitive for problems with a small number of linear constraints while it struggles in the solution of problems with larger constraint matrices. This is potentially due to the use of QR factorizations (which can become expensive for larger matrices) for the construction of the reduced problems and the computation of the Lagrange multipliers estimate. This also suggests that to expand the range of applicability of the proposed strategy, a future efficient C-based implementation, which is out of the scope of this paper, should resort to different strategies for the multiplier estimate and the minimization phase, and exploit quantities computed in the identification phase.

6 Conclusions

We have shown that for polyhedral constrained optimization, at a feasible point \(\textbf{x}\) the projection \(\nabla _\varOmega f(\textbf{x})\) of the negative gradient on the tangent cone has an orthogonal decomposition of the form \( \nabla _\varOmega f(\textbf{x})= \varvec{\beta }(\textbf{x})+\varvec{\varphi }(\textbf{x}), \) where \(\varvec{\varphi }(\textbf{x})\) and \(\varvec{\beta }(\textbf{x})\) measure different aspects of stationarity. From a practical point of view, within active set algorithms that alternate two phases, one to identify “promising faces” of the polyhedron to be explored, one to accelerate the function reduction, \(\varvec{\varphi }(\textbf{x})\) and \(\varvec{\beta }(\textbf{x})\) can be used to give suitable rules for switching between the two phases.

As an example of this, we have introduced an active-set algorithm for the solution of convex quadratic programs, and proved its finite convergence in the case of possibly degenerate strictly convex problems. Numerical experiments on synthetic QP problems with a small number of dense constraints show the efficiency of the proposed strategy over the gradient projection method and over a tailored interior point method in the case of number of constraints between 2 and 20. Future work will deal with the implementation of an efficient C-based version of the PSAQP method and its extension to the non-quadratic case.