1 Introduction

Consider the following regularized optimization problem:

$$\begin{aligned} \min _{x \in \mathcal {E}}\quad F(x) {:}{=}f(x) + \varPsi (x), \end{aligned}$$
(1)

where \(\mathcal {E}\) is a Euclidean space with an inner product \(\langle \cdot ,\,\cdot \rangle \) and its induced norm \({\left\| {\cdot }\right\| }\), the regularizer \(\varPsi \) is extended-valued, convex, proper, and lower-semicontinuous, f is continuously differentiable with Lipschitz-continuous gradients, and the solution set \(\varOmega \) is non-empty. This type of problems is ubiquitous in applications such as machine learning and signal processing (see, for example, [8, 9, 47]). One widely-used method for (1) is inexact successive quadratic approximation (ISQA). At the tth iteration with iterate \(x^t\), ISQA obtains the update direction \(p^t\) by approximately solving

$$\begin{aligned}{} & {} p^t \approx \mathop {{{\,\mathrm{\arg \,min}\,}}}\limits _{p \in \mathcal {E}}\quad Q_{H_t}\left( p; x^t \right) , \end{aligned}$$
(2)
$$\begin{aligned}{} & {} Q_{H_t}\left( p; x^t \right) {:}{=}\langle \nabla f\left( x^t \right) ,\,p\rangle + \frac{1}{2} \langle p,\, H_t p\rangle + \varPsi \left( x^t + p \right) - \varPsi \left( x^t \right) , \end{aligned}$$
(3)

where \(H_t\) is a self-adjoint positive-semidefinite linear endomorphism of \(\mathcal {E}\). The iterate is then updated along \(p^t\) with a step size \(\alpha _t > 0\).

ISQA is among the most efficient for (1). Its variants differ in the choice of \(H_t\) and \(\alpha _t\), and how accurately (2) is solved. In this class, proximal Newton (PN) [23, 28] and proximal quasi-Newton (PQN) [45] are popular for their fast convergence in iterations. Regrettably, their subproblem has no closed-form solution as \(H_t\) is non-diagonal, so one needs to use an iterative solver for (2) and the running time to reach the accuracy requirement can hence be lengthy. For example, to attain the same superlinear convergence as truncated Newton for smooth optimization, PN that takes \(H_t = \nabla ^2 f\) requires increasing accuracies for the subproblem solution, implying a growing and unbounded number of inner iterations of the subproblem solver. Its superlinear convergence thus gives little practical advantage in running time. On the contrary, in smooth optimization, one can solve (2) with a bounded cost by either conjugate gradient (CG) or matrix factorizations since \(\varPsi \equiv 0\). The advantage of second-order methods over the first-order ones in regularized optimization is therefore not as significant as that in smooth optimization.

A possible remedy when \(\varPsi \) is partly smooth [26] is to switch to smooth optimization after identifying an active manifold \(\mathcal {M}\) that contains a solution \(\hat{x}\) to (1) and makes \(\varPsi \) confined to it smooth. We say an algorithm can identify \(\mathcal {M}\) if there is a neighborhood U of \(\hat{x}\) such that \(x^t \in U\) implies \(x^{t+1} \in \mathcal {M}\), and call it possesses the manifold identification property. Unfortunately, for ISQA, this property in general only holds when (2) is always solved exactly. Indeed, even if each \(p^t\) is arbitrarily close to the corresponding exact solution, it is possible that no iterate lies in the active manifold, as shown below.

Example 1

Consider the following simple example of (1) with \(\varPsi (\cdot ) = \Vert \cdot \Vert _1\):

$$\begin{aligned} \min _{x \in \mathbb {R}^2}\, (x_1 - 2.5)^2 + (x_2 - 0.3)^2 + \Vert x\Vert _1, \end{aligned}$$

whose only solution is \({\hat{x}} = (2,0)\), and \(\Vert x\Vert _1\) is smooth relative to \(\mathcal {M}= \{x \mid x_2 = 0\}\) around \({\hat{x}}\). Consider \(\{x^t\}\) with \(x^t_1 = 2 + f(t), x^t_2 = f(t)\), for some \(f(t) > 0\) with \( f(t) \downarrow 0\), and let \(H_t \equiv I, \alpha _t \equiv 1\), and \(p^t = x^{t+1} - x^t\). The optimum of (2) is \(p^{t*} = {\hat{x}} - x^t\), so \(\Vert x^t- {\hat{x}}\Vert = O(f(t))\) and \(\Vert p^t - p^{t*}\Vert =O(f(t))\). As f is arbitrary, both the subproblem approximate solutions and their corresponding objectives converge to the optimum arbitrarily fast, but \(x^t \notin \mathcal {M}\) for all t. \(\square \)

Moreover, some versions of inexact PN generalize the stopping condition of CG for truncated Newton to require \(\Vert r^t\Vert \rightarrow 0\), where

$$\begin{aligned} r^t {:}{=}\mathop {{{\,\mathrm{\arg \,min}\,}}}\limits _{r} \Vert r\Vert , \quad \text { subject to } \quad r \in \partial _{p} Q_{H_t}(p^t;x^t), \end{aligned}$$
(4)

but Example 1 gives a sequence of \(\{\Vert r^t\Vert \}\) converging to 1, hinting that such a condition might have an implicit relation with manifold identification.

Interestingly, in our numerical experience in [21, 22, 29], ISQA with approximate subproblem solutions, even without an increasing solution precision and on problems that are not strongly convex, often identifies the active manifold rapidly. We thus aim to provide theoretical support for such a phenomenon and utilize it to devise more efficient and practical methods that trace the superior performance of second-order methods in smooth optimization.

In this work, we show that ISQA essentially possesses the manifold identification property, by giving a sufficient condition for inexact solutions of (2) in ISQA to identify the active manifold that is satisfied by the output of most of the widely-used subproblem solvers even if (2) is solved arbitrarily roughly. We also show that \({\left\| {r^t}\right\| } \downarrow 0\) is indeed sufficient for manifold identification, so PN can achieve superlinear convergence in a more efficient way through this property. When the iterates do not lie in a compact set, it is possible that the iterates do not converge, in which case even algorithms possessing the manifold identification property might fail to identify the active manifold because the iterates never enter a neighborhood that enables the identification. Therefore, we also show convergence of the iterates under a sharpness condition widely-seen in real-world problems that generalizes the quadratic growth condition and the weak sharp minima. Under convexity, this sharpness condition is equivalent to a type of the Kurdyka–Łojasiewicz (KL) condition [19, 32], but convergence of general ISQA methods under the KL condition is unknown since the inexactness condition can easily violate the relative-error condition needed in [2, 4], and thus our analysis provides a novel approach to obtain iterate convergence for this family of algorithms. Based on these results, we propose an improved, practical algorithm ISQA \(^{+}\) that switches to smooth optimization after the active manifold is presumably identified. We show that ISQA \(^{+}\) is superior to existing PN-type methods as it possesses the same superlinear and even quadratic rates in iterations but has bounded per-iteration cost. ISQA \(^{+}\) hence also converges superlinearly in running time, which, to our best knowledge, is the first of the kind. Our analysis is more general than existing ones in guaranteeing superlinear convergence in a broader class of degenerate problems. Numerical results also confirm ISQA \(^{+}\) ’s much improved efficiency over PN and PQN.

1.1 Related work

ISQA for (1) or the special case of constrained optimization has been well-studied, and we refer the readers to [21] for a detailed review of related methods. We mention here in particular the works [7, 23, 28, 54] that provided superlinear convergence results. Lee et al [23] first analyzed the superlinear convergence of PN and PQN. Their analysis considers only strongly convex f, so both the convergence of the iterate and the positive-definiteness of the Hessian are guaranteed. Their inexact version requires \(\Vert r^t\Vert \downarrow 0\), which might not happen when the solutions to (2) are only approximate, as illustrated in Example 1. With the same requirement for \(\Vert r^t\Vert \) as [23], Li et al [28] showed that superlinear convergence for inexact PN can be achieved when f is self-concordant. Byrd et al [7] focused on \(\varPsi (\cdot ) = \Vert \cdot \Vert _1\) and showed superlinear convergence of PN under the subproblem stopping condition (10) defined in Sect. 2, which is achievable as long as \(p^t\) is close enough to the optimum of (2). To cope with degenerate cases in which the Hessian is only positive semidefinite, Yue et al [54] used the stopping condition of [7] to propose a damping PN for general \(\varPsi \) and showed that its iterates converge and achieve superlinear convergence under convexity and the error-bound (EB) condition [33] even if F is not coercive. A common drawback of [7, 23, 28, 54] is that they all require increasing precisions in solving the subproblem, so the superlinear rate is observed only in iterations but not in running time in their experiments. In contrast, by switching to smooth optimization after identifying the active manifold, ISQA \(^{+}\) achieves superlinear convergence not only in iterations but also in time, and is thus much more efficient in practice. Our superlinear convergence result also allows a broader range of degeneracy than that in [54].

Although ISQA is intensively studied, its ability for manifold identification is barely discussed because this does not in general hold, as noted in Example 1. Hare [14] showed that ISQA identifies the active manifold under the assumptions that (2) is always solved exactly and the iterates converge, and his analysis cannot be extended to inexact versions. Our observation in [21, 22, 29] that ISQA identifies the active manifold empirically motivated this work to provide theoretical guarantees for this phenomenon.

Manifold identification requires the iterates, or at least a subsequence, to converge to a point of partial smoothness. In most existing analyses for (1), iterate convergence is proven under either: (i) f is convex and the algorithm is a first-order one, (ii) F is strongly convex, or (iii) the Kurdyka–Łojasiewicz (KL) condition holds. Analyses for the first scenario rely on the implicit regularization of first-order methods such that their iterates lie in a bounded region [31], but this is not applicable to ISQA. Under the second condition, convergence of the objective directly implies that of the iterates. For the third case, convergence of the full iterate sequence is usually proven under an assumption of a relative-error behavior, of the form

$$\begin{aligned} \min _{v \in \partial F(x^{t+1})} {\left\| {v}\right\| } \le b {\left\| {x^{t+1} - x^t}\right\| }, \forall t \end{aligned}$$

for some \(b > 0\), as done in [2, 5, 10], but this condition can easily be violated when inexactness kicks in in ISQA, as argued by Bonettini et al [5]. To work around this issue, [5] further assumed that the forward-backward envelope [46] of F satisfies the KL condition and obtained iterate convergence under such a situation, but whether KL condition of F implies that of its forward-backward envelope is unclear. The only exception to get convergence under the KL condition of F for a specific type of SQA method is [54] that shows the convergence of the iterates for their specific algorithm under EB and convexity of f but [54] requires \(H_t\) in (2) to be the Hessian of f plus a multiple of identity and their analysis cannot be extended to general \(H_t\). On the other hand, our analysis for iterate convergence is novel and more general in covering a much broader algorithmic framework and requiring only a general sharpness condition for F that contains both EB and the weak-sharp minima [6] as special cases.

Our design of the two-stage ISQA \(^{+}\) is inspired by [24, 29] in conjecturing that the active manifold is identified after the current manifold remains unchanged, but the design of the first stage is quite different and we also add in additional safeguards in the second stage. Lee and Wright [24] used dual averaging in the first stage for optimizing the expected value of an objective function involving random variables, so their algorithm is more suitable for stochastic settings. Li et al [29] focused on distributed optimization and their usage of manifold identification is for reducing the communication cost, instead of accelerating general regularized optimization considered in this work.

1.2 Outline

This work is outlined as follows. In Sect. 2, we describe the algorithmic framework and give preliminary properties. Technical results in Sect. 3 prove the manifold identification property of ISQA and the convergence of the iterates. We then describe the proposed ISQA \(^{+}\) and show its superlinear convergence in running time in Sect. 4. The effectiveness of ISQA \(^{+}\) is then illustrated through extensive numerical experiments in Sect. 5. Section 6 finally concludes this work. Our implementation of the described algorithms is available at https://www.github.com/leepei/ISQA_plus/.

2 Preliminaries

We denote the minimum of (1) by \(F^*\); the domain of \(\varPsi \) by \(\text {dom}(\varPsi )\); and the set of convex, proper, and lower semicontinuous functions by \(\varGamma _0\). For any set C, \(\textrm{relint}(C)\) denotes its relative interior. We will frequently use the following notations.

$$\begin{aligned} \delta _t {:}{=}F(x^t) - F^*,\quad P_{\varOmega }(x) {:}{=}\mathop {{{\,\mathrm{\arg \,min}\,}}}\limits _{y \in \varOmega } \Vert x - y\Vert ,\quad \textrm{dist}(x,\varOmega ) {:}{=}\Vert x - P_\varOmega (x)\Vert . \end{aligned}$$

The level set \(\text {Lev}\left( {\xi }\right) {:}{=}\left\{ x\mid F(x) - F^* \le \xi \right\} \) for any \(\xi \ge 0\) is closed but not necessarily bounded. A function is L-smooth if it is differentiable with the gradient L-Lipschitz continuous. We denote the identity operator by I. For self-adjoint linear endomorphisms AB of \(\mathcal {E}\), \(A \succ B\) (\(\succeq \)) means \(A-B\) is positive definite (positive semidefinite). We abbreviate \(A \succ \tau I\) to \(A \succ \tau \) for \(\tau \in \mathbb {R}\). The set of A with \(A \succ 0\) is denoted by \({\mathcal {S}_{++}}\). The subdifferential \(\partial \varPsi (x)\) of \(\varPsi \) at x is well-defined as \(\varPsi \in \varGamma _0\), hence so is the generalized gradient \(\partial F(x) = \nabla f(x) + \partial \varPsi (x)\). For any \(g \in \varGamma _0\), \(\tau \ge 0\), and \(\varLambda \in {\mathcal {S}_{++}}\), the proximal mapping

$$\begin{aligned} \textrm{prox}^{\varLambda }_{\tau g}(x) {:}{=}\mathop {{{\,\mathrm{\arg \,min}\,}}}\limits _{y \in \mathcal {E}}\, \frac{1}{2} \langle x - y,\,\varLambda (x-y)\rangle + \tau g\left( y \right) \end{aligned}$$
(5)

is continuous and finite in \(\mathcal {E}\) even outside \(\text {dom}(g)\). When \(\varLambda = I\), (5) is shorten to \(\textrm{prox}_{\tau g}(x)\). For (2), we denote its optimal solution by \(p^{t*}\). When there is no ambiguity, we abbreviate \(Q_{H_t}(\cdot ; x^t)\) to \(Q_t(\cdot )\), \(Q_t(p^t)\) to \(\hat{Q}_t\), and \(Q_t(p^{t*})\) to \(Q_t^*\). Some notations used in this paper are summarized in Table 1.

Table 1 Summary of some notations

2.1 Algorithmic framework

We give out details in defining the ISQA framework by discussing the choice of \(H_t\), the subproblem solver and its stopping condition, and how sufficient objective decrease is ensured. We consider the algorithm a two-level loop procedure, where the outer loop updates the iterate \(x^t\) and the iterations of the subproblem solver form the inner loop.

After obtaining \(p^t\) from (2), we need to find a step size \(\alpha _t > 0\) for it to ensure sufficient objective decrease. Given \(\gamma ,\beta \in (0,1)\), we take \(\alpha _t\) as the largest value in \(\{\beta ^0,\beta ^1,\dotsc \}\) satisfying an Armijo-like condition.

$$\begin{aligned} F\left( x^t + \alpha _t p^t \right) \le F\left( x^t \right) + \gamma \alpha _t Q_t(p^t). \end{aligned}$$
(6)

This condition is satisfied by all \(\alpha _t\) small enough as long as \(Q_t(p^t) < 0\) and \(Q_t(\cdot )\) is strongly convex; see [21, Lemma 3].

For the choice of \(H_t\), we only make the following blanket assumption without further specification to make our analysis more general.

$$\begin{aligned} \exists M,m > 0,\quad \text { such that } \quad M \succeq H_t \succeq m, \;\forall t \ge 0. \end{aligned}$$
(7)

For (2), any suitable solver for regularized optimization, such as (accelerated) proximal gradient, (variance-reduced) stochastic gradient methods, and their variants, can be used, and the following are common for their inner loop termination:

$$\begin{aligned}{} & {} \hat{Q}_t - Q_t^* \le \epsilon _t, \end{aligned}$$
(8)
$$\begin{aligned}{} & {} \left\| r^t \right\| \le \epsilon _t, \quad \text { or } \end{aligned}$$
(9)
$$\begin{aligned}{} & {} G^\tau _{H_t} \left( p^t;x^t \right) {:}{=}{\left\| {p^t - \bar{p}^t_\tau }\right\| }\le \epsilon _t, \end{aligned}$$
(10)

for some given \(\epsilon _t\ge 0\) and \(\tau > 0\), where

$$\begin{aligned} \bar{p}^t_\tau&{:}{=}\textrm{prox}_{\tau \varPsi }\left( \left( x^t + p^t \right) - \tau \left( \nabla f(x^t) + H_t p^t\right) \right) - x^t. \end{aligned}$$
(11)

The point \(\bar{p}^t_\tau \) in (11) is computed by taking a proximal gradient step of the subproblem (2) from \(p^t\), and thus \(\bar{p}^t - p^t\) is the proximal gradient (with step size \(\tau \)) of \(Q_t\) at \(p^t\). Because the subproblem is strongly convex, the norm of the proximal gradient is zero at \(p^t\) if and only if \(p^t\) is the unique solution to the subproblem. We will see in the next subsection that the norm of this proximal gradient squared is also equivalent to the objective distance to the optimum of the subproblem. We summarize this framework in Algorithm 1.

figure a

2.2 Basic properties

Under (7), (3) is m-strongly convex with respect to p, so the following standard results hold for any \(\tau \in (0, 1/M]\) and any \(p^t \in \mathcal {E}\) [12, 39].

$$\begin{aligned}{} & {} \Vert r^t\Vert ^2 \ge 2 m \left( Q_t(p^t) - Q_t^* \right) \ge \frac{m}{M} G^\tau _{H_t}\left( p^t;x^t \right) ^2, \end{aligned}$$
(12)
$$\begin{aligned}{} & {} G^\tau _{H_t}\left( p^t;x^t \right) ^2 \ge \tau \left( \frac{(2m^{-1} + \tau )(1 + M \tau )}{\tau } - \frac{1}{2} \right) ^{-1} \left( Q_t\left( \bar{p}^t_\tau \right) - Q_t^* \right) . \end{aligned}$$
(13)

Therefore, (10) and (8) are almost equivalent and implied by (9), while Example 1 has shown that (9) is a stronger condition not implied by (8). Although (13) does not show that (10) directly implies (8), once (10) is satisfied, we can use it to find \(\bar{p}^t_\tau \) satisfying (8) from \(p^t\).

A central focus of this work is manifold identification, so we first formally define manifolds following [49]. A set \(\mathcal {M}\in \mathbb {R}^m\) is a p-dimensional \(\mathcal {C}^k\) manifold around \(x \in \mathbb {R}^m\) if there is a \(\mathcal {C}^k\) function \(\varPhi : \mathbb {R}^m \rightarrow \mathbb {R}^{m-p}\) whose derivative at x is surjective such that for y close enough to x, \(y \in \mathcal {M}\) if and only if \(\varPhi (y) = 0\). Through the implicit function theorem, we can also use a \(\mathcal {C}^k\) parameterization \(\phi : \mathbb {R}^p \rightarrow \mathcal {M}\), with \(\phi (y) = x\) and the derivative injective at y, to describe a neighborhood of x on \(\mathcal {M}\). Now we are ready for the definition of partial smoothness [26] that we assume for the regularizer when discussing manifold identification.

Definition 1

(Partly smooth) A convex function \(\varPsi \) is partly smooth at a point \(x^*\) relative to a set \(\mathcal {M}\) containing \(x^*\) if \(\partial \varPsi (x^*) \ne \emptyset \) and:

  1. 1.

    Around \(x^*\), \(\mathcal {M}\) is a \(\mathcal {C}^2\)-manifold and \(\varPsi |_{\mathcal {M}}\) is \(\mathcal {C}^2\).

  2. 2.

    The affine span of \(\partial \varPsi (x)\) is a translate of the normal space to \(\mathcal {M}\) at \(x^*\).

  3. 3.

    \(\partial \varPsi \) is continuous at \(x^*\) relative to \(\mathcal {M}\).

Intuitively, it means \(\varPsi \) is smooth around \(x^*\) in \(\mathcal {M}\) but changes drastically along directions leaving the manifold. We also call this \(\mathcal {M}\) the active manifold.

As the original identification results in [16, Theorem 5.3] and [27, Theorem 4.10] require the sum F to be partly smooth but our setting does not require so for f, we first provide a result relaxing the conditions to ensure identification in our scenario.

Lemma 1

Consider (1) with \(f\in \mathcal {C}^1\) and \(\varPsi \) convex, proper, closed, and partly smooth at a point \(x^*\) relative to a \(\mathcal {C}^2\)-manifold \(\mathcal {M}\). If at \(x^*\), the nondegenerate condition

$$\begin{aligned} 0 \in \textrm{relint}\left( \partial F(x^*)\right) = \nabla f(x^*) + \textrm{relint}\left( \partial \varPsi (x^*)\right) \end{aligned}$$
(14)

holds, and there is a sequence \(\{x^t\}\) converging to \(x^*\) with \(F(x^t) \rightarrow F(x^*)\), then

$$\begin{aligned} \textrm{dist}\left( 0, \partial F(x^t) \right) \rightarrow 0 \quad \Leftrightarrow \quad x^t \in \mathcal {M}\text { for all t large}. \end{aligned}$$

Proof

We first observe that as \(f \in \mathcal {C}^1\), \(x^t \rightarrow x^*\) implies \(f(x^t) \rightarrow f(x^*)\), whose combination with \(F(x^t) \rightarrow F(x^*)\) further implies \(\varPsi (x^t) \rightarrow \varPsi (x^*)\). Moreover, convex functions are prox-regular everywhere. Therefore, the premises of [27, Theorem 4.10] on \(\varPsi \) are satisfied. We then note that

$$\begin{aligned} \textrm{dist}\left( 0, \partial F(x^t)\right) \rightarrow 0 \quad \Leftrightarrow \quad \textrm{dist}\left( -\nabla f(x^t), \partial f(x^t)\right) \rightarrow 0. \end{aligned}$$
(15)

Again from that \(f \in \mathcal {C}^1\), \(x^k \rightarrow x^*\) implies \(\nabla f(x^k) \rightarrow \nabla f(x^*)\), so by (14), (15) is further equivalent to

$$\begin{aligned} \textrm{dist}\left( -\nabla f(x^*), \partial \varPsi (x^t)\right) \rightarrow 0, \end{aligned}$$

which is the necessary and sufficient condition for \(x^t \in \mathcal {M}\) for all t large in [27, Theorem 4.10] because (14) indicates that \(- \nabla f(x^*) \in \textrm{relint}(\partial \varPsi (x^*))\). We then apply that theorem to obtain the desired result. Here we note that for the requirements

\(\square \)

Using Lemma 1, we further state an identification result for (1) under our setting without the need to check whether \(\{F(x^t)\}\) converges to \(F(x^*)\). This will be useful in our later theoretical development.

Lemma 2

Consider (1) with \(f \in \mathcal {C}^1\) and \(\varPsi \in \varGamma _0\). If \(\varPsi \) is partly smooth relative to a manifold \(\mathcal {M}\) at a point \(x^*\) satisfying (14), and there is a sequence \(\{x^t\}\) converging to \(x^*\), then we have

$$\begin{aligned} \textrm{dist}\left( 0, \partial F(x^t) \right) \rightarrow 0 \quad \Rightarrow \quad x^t \in \mathcal {M}\text { for all t large}. \end{aligned}$$

Proof

As \(\varPsi \in \varGamma _0\), it is subdifferentially continuous at \(x^* \in \text {dom}(\varPsi )\) by [44, Example 13.30]. Thus, \(\textrm{dist}(0, \partial F(x^t)) \rightarrow 0 \in \partial F(x^*)\) and \(x^t \rightarrow x^*\) imply that \(F(x^t) \rightarrow F(x^*)\). The desired result is then obtained by applying Lemma 1. \(\square \)

We note that the requirement of the above two lemmas is partial smoothness of \(\varPsi \), instead of F, at \(x^*\). Therefore, it is possible that \(F\mid _{\mathcal {M}}\) is not \(\mathcal {C}^2\), as we only require \(\varPsi \mid _{\mathcal {M}} \in \mathcal {C}^2\) and f being L-smooth.

3 Manifold identification of ISQA

Our first major result is the manifold identification property of Algorithm 1. We start with showing that the strong condition (9) with \(\epsilon _t \downarrow 0\) is sufficient.

Theorem 1

Consider a point \(x^*\) satisfying (14) with \(\varPsi \in \varGamma _0\) partly smooth at \(x^*\) relative to some manifold \(\mathcal {M}\). Assume f is locally L-smooth for \(L > 0\) around \(x^*\). If Algorithm 1 is run with the condition (9) and (7) holds, then there exist \(\epsilon , \delta > 0\) such that \(\Vert {x^t} - x^* \Vert \le \delta , \epsilon _t \le \epsilon \), and \(\alpha _t = 1\) imply \(x^{t+1} \in \mathcal {M}\).

Proof

Since each iteration of Algorithm 1 is independent of the previous ones, we abuse the notation to let \(x^t\) be the input of Algorithm 1 at the tth iteration and \(p^t\) the corresponding inexact solution to (2), but \(x^{t+1}\) is irrelevant to \(p^t\) or \(\alpha _t\). Assume for contradiction the statement is false. Then there exist a sequence \(\{x^t\}\subset \mathcal {E}\) converging to \(x^*\), a nonnegative sequence \(\{\epsilon _t\}\) converging to 0, a sequence \(\{H_t\} \subset {\mathcal {S}_{++}}\) satisfying (7), and a sequence \(\{p^t\}\subset \mathcal {E}\) such that \(Q_{H_t}(p^t;x^t)\) in (3) satisfies (9) for all t, yet \(x^t + p^t \notin \mathcal {M}\) for all t. From (4),

$$\begin{aligned} r^t - \nabla f\left( x^t \right) - H_t p^t \in \partial \varPsi \left( x^{t} + p^t \right) , \quad \forall t \ge 0. \end{aligned}$$
(16)

Therefore, we get from (16) that

$$\begin{aligned} \textrm{dist}\left( 0, \partial F\left( x^t + p^t \right) \right)&= \textrm{dist}\left( -\nabla f\left( x^t + p^t \right) , \partial \varPsi \left( x^t + p^t \right) \right) \nonumber \\&\le {\left\| {-\nabla f\left( x^t + p^t\right) - \left( r^{t} - \nabla f\left( x^{t} \right) - H_{t} p^t \right) }\right\| }\nonumber \\&\le {\left\| {\nabla f\left( x^t + p^t\right) - \nabla f\left( x^{t} \right) }\right\| } + {\left\| {r^{t}}\right\| } + {\left\| {H_{t}}\right\| } {\left\| {p^t}\right\| }\nonumber \\&\le \left( L + M \right) {\left\| {p^t}\right\| } + {\left\| {r^{t}}\right\| }. \end{aligned}$$
(17)

From (7) and the convexity of \(\varPsi \), \(Q_{H_t}(\cdot ;x^t)\) is m-strongly convex, which implies

$$\begin{aligned} Q_{H_t}\left( p;x^t \right) - Q_{H_t}\left( p^{t*};x^t \right) \ge \frac{m}{2} {\left\| {p - p^{t*}}\right\| }^2, \quad \forall p \in \mathcal {E}. \end{aligned}$$
(18)

Combining (18) and (12) shows that

$$\begin{aligned} m^{-1} \Vert r^t\Vert + \Vert p^{t*}\Vert \ge \Vert p^t\Vert . \end{aligned}$$
(19)

Since \(\{x^t\}\) converges to \(x^*\), by the argument in [50, Lemma 3.2], we get

$$\begin{aligned} p^{t*} = O(\Vert x^t - x^*\Vert ) \end{aligned}$$
(20)

whenever \(x^t\) is close enough to \(x^*\). Thus (19), (20), (9), and that \(\epsilon _t \downarrow 0\) imply

$$\begin{aligned} \lim _{t \rightarrow \infty } \Vert p^t\Vert = 0, \end{aligned}$$
(21)

Substituting (9) and (21) into (17) gives \(\textrm{dist}(0, \partial F(x^t)) \rightarrow 0\), and by (21) we also get \( x^t + p^t \rightarrow x^* + 0 = x^*\). Therefore Lemma 2 implies that \(x^{t} + p^{t} \in \mathcal {M}\) for all t large enough, proving the desired contradiction. \(\square \)

Theorem 1 shows that if a variant of PN or PQN needs \(\Vert r^t\Vert \downarrow 0\) to achieve superlinear convergence, \(\mathcal {M}\) will be identified in the middle, so one can reduce the running time by switching to smooth optimization that can be conducted more efficiently while possessing the same superlinear convergence. Moreover, although Theorem 1 shows that (9 is sufficient for identifying the active manifold, it might never be satisfied as Example 1 showed. We therefore provide another sufficient condition for ISQA to identify the active manifold that is satisfied by most of the widely-used solvers for (2), showing that ISQA essentially possesses the manifold identification property. This result uses the condition (8), which is weaker than (4), and we follow [21] to define \(\{\epsilon _t\}\) using a given \(\eta \in [0,1)\):

$$\begin{aligned} \epsilon _t = \eta \left( Q_t(0) - Q_t^* \right) = -\eta Q_t^*. \end{aligned}$$
(22)

As argued in [21] and practically adopted in various implementations including [13, 20, 22], it is easy to ensure that (8) with (22) holds for some \(\eta <1\) under (7) (although the explicit value might be unknown) if we apply a linear-convergent subproblem solver with at least a pre-specified number of iterations to (2). In the following result, we introduce an operator \(\varLambda _t\in {\mathcal {S}_{++}}\) for computing generalized proximal steps in the subproblem, and require that the eigenvalues of \(\varLambda _t\) are upper-bounded by some value fixed over all t. This operator genralizes the case of taking a multiple of the identity in the quadratic term of a proximal problem (applied in proximal gradient as the step size), and can be treated as a sort of preconditioner to allow for more possibilities of subproblem solvers. We will also see after the proof of the following theorem some examples of existing algorithms that can fit into this framework by specifying different choices of \(\varLambda _t\).

Theorem 2

Consider the setting of Theorem 1. If Algorithm 1 is run with (8) and (22) for some \(\eta \in [0,1)\), (7) holds, and the update direction \(p^{t}\) satisfies

$$\begin{aligned} x^t + p^{t} = \textrm{prox}^{\varLambda _{t}}_{\varPsi }\left( y^{t} - \varLambda _{t}^{-1} \left( \nabla f\left( x^t \right) + H_t \left( y^{t} - x^t \right) + s^{t} \right) \right) , \end{aligned}$$
(23)

where \(s^{t}\) satisfies \(\Vert s^{t}\Vert \le R\left( {\left\| {y^{t} - (x^t + p^{t*})}\right\| } \right) \) for some \(R:[0,\infty ) \rightarrow [0,\infty )\) continuous in its domain with \(R(0) = 0\), \(\varLambda _{t} \in {\mathcal {S}_{++}}\) with \(M_1 \succeq \varLambda _{t}\) for \(M_1 > 0\), and \(y^{t}\) satisfies

$$\begin{aligned} {\left\| {\left( y^{t} - x^t\right) - p^{t*}}\right\| }&\le \eta _1 \left( Q_t(0) - Q_t^*\right) ^\nu \end{aligned}$$
(24)

for some \(\nu > 0\) and \(\eta _1 \ge 0\), then there exists \(\epsilon , \delta > 0\) such that \(\Vert x^t - x^*\Vert \le \epsilon ,|Q_t^*| \le \delta \), and \(\alpha _t = 1\) imply \(x^{t+1} \in \mathcal {M}\).

Proof

Suppose the statement is not true for contradiction. Then there exist a continuous function \(R: [0, \infty ) \rightarrow [0, \infty )\) with \(R(0) = 0\), \(\eta _1 \ge 0\), \(M_1 > 0\), a sequence \(\{x^t\} \subset \mathcal {E}\) converging to \(x^*\), a sequence \(\{H_t\} \subset {\mathcal {S}_{++}}\) satisfying (7) and

$$\begin{aligned} \lim _{t \rightarrow \infty }\, \min _{p}\quad Q_{H_t}\left( p;x^t \right) = 0, \end{aligned}$$
(25)

three sequences \(\{p^t\}, \{y^t\}, \{s^t\} \subset \mathcal {E}\) and a sequence \(\{\varLambda _t\}\subset {\mathcal {S}_{++}}\) with \(M_1 \succeq \varLambda _t\) such that (8) with (22) and (23)–(24) hold, yet \(x^t + p^t \notin \mathcal {M}\) for all t. We abuse the notation to let \(p^{t*}\) and \(Q^*_t\) respectively denote the optimal solution and objective value for \(\min _p Q_{H_t}(p;x^t)\), but \(x^{t+1}\) is irrelevant to \(p^t\) or \(\alpha _t\).

The optimality condition of (5) applied to (23) indicates that

$$\begin{aligned} -\varLambda _{t}\left( x^{t} - y^{t}\right) - \left( \nabla f\left( x^t \right) + H_t \left( y^{t} - x^t \right) + s^{t} \right) \in \partial \varPsi \left( x^{t} + p^t \right) . \end{aligned}$$
(26)

Thus, we have

$$\begin{aligned}&\textrm{dist}\left( 0, \partial F\left( x^t + p^t \right) \right) \nonumber \\&\quad \le \left\| \nabla f\left( x^t + p^t \right) -\varLambda _{t}\left( x^t + p^t - y^{t}\right) - \nabla f\left( x^{t} \right) \right. \left. -H_{t} \left( y^{t} - x^{t} \right) - s^{t} \right\| \nonumber \\&\quad \le {\left\| {\nabla f\left( x^t + p^t \right) - \nabla f\left( x^{t} \right) }\right\| } + M_1 \left( {\left\| {x^t - y^{t}}\right\| } + {\left\| {p^t}\right\| }\right) + M {\left\| {y^{t} - x^{t}}\right\| } + {\left\| {s^{t}}\right\| }\nonumber \\&\quad \le ~(L + M_1) {\left\| {p^t}\right\| } + (M + M_1) {\left\| {y^{t} - x^{t}}\right\| } + {\left\| {s^{t}}\right\| }. \end{aligned}$$
(27)

For the first two terms in (27), the triangle inequality and (18) imply

$$\begin{aligned}{} & {} {\left\| {p^{t}}\right\| } \le {\left\| {p^{t} - p^{t*}}\right\| } + {\left\| {p^{t*}}\right\| } \le \sqrt{2m^{-1}}\sqrt{ Q_{t}\left( p^{t} \right) - Q^*_{t} } + \sqrt{2m^{-1}}\sqrt{ 0 - Q^*_{t}}\nonumber \\{} & {} \quad {\mathop {\le }\limits ^{(22)}}~ \left( \sqrt{\eta } + 1 \right) \sqrt{2m^{-1}}\sqrt{ - Q^*_{t}}, \text { and } \end{aligned}$$
(28)
$$\begin{aligned}{} & {} {\left\| {y^{t}- x^{t}}\right\| } \le {\left\| {\left( y^{t}- x^{t}\right) - p^{t*}}\right\| } + {\left\| {p^{t*}}\right\| } {\mathop {\le }\limits ^{(24)}} \eta _1 \left( - Q^*_{t}\right) ^\nu + \sqrt{2m^{-1}}\sqrt{ - Q^*_{t}}. \end{aligned}$$
(29)

For the last term in (27), we have from our definition of \(s^t\) that

$$\begin{aligned} {\left\| {s^{t}}\right\| }&\le R \left( {\left\| {y^{t} - \left( x^{t} + p^{t*}\right) }\right\| }\right) . \end{aligned}$$
(30)

By substituting (28)–(30) back into (27), clearly there are \(C_1, C_2 > 0\) such that

$$\begin{aligned} \textrm{dist}\left( 0, \partial F\left( x^t + p^t \right) \right) \le C_1 \sqrt{ - Q^*_t} + C_2 \left( -Q^*_t \right) ^\nu {+ R \left( {\left\| {y^{t} - \left( x^{t} + p^{t*}\right) }\right\| }\right) .} \end{aligned}$$
(31)

Note that \(Q_{H_t}(0;x^t) \equiv 0\), so \(-Q^*_t \ge 0\) from its optimality and the right-hand side of (31) is well-defined. Next, we see from (25), (24), and the continuity of R that

$$\begin{aligned} \lim _{t \rightarrow \infty } {\left\| {\left( y^{t} - x^t\right) - p^{t*}}\right\| } = 0, \quad \Rightarrow \quad \lim _{t \rightarrow \infty } R\left( {\left\| {\left( y^{t} - x^t\right) - p^{t*}}\right\| }\right) = 0. \end{aligned}$$
(32)

Applying (25) and (32) to (31) and letting t approach infinity then yield

$$\begin{aligned} \lim _{t \rightarrow \infty } \textrm{dist}\left( 0, \partial F\left( x^t + p^t \right) \right) = 0. \end{aligned}$$
(33)

Next, from (28) and (25), it is also clear that \({\left\| {p^t}\right\| } \rightarrow 0\), so from the convergence of \(x^t\) to \(x^*\) we have

$$\begin{aligned} x^t + p^t \rightarrow x^* + 0 = x^*. \end{aligned}$$
(34)

Now (34) and (33) allow us to apply Lemma 2 so \(x^t + p^t \in \mathcal {M}\) for all t large enough, leading to the desired contradiction. \(\square \)

The function R can be seen as a general residual function and we just need from it that \(s^t\) approaches 0 with \({\left\| {y^{t} - (x^t + p^{t*}{)}}\right\| }\), and Theorem 2 can be used as long as we can show that such an R exists, even if the exact form is unknown. Condition (24) is deliberately chosen to exclude the objective \(Q_t(y^{t} - x^t)\) so that broader algorithmic choices like those with \(y^{t} \notin \text {dom}(\varPsi )\) can be included.

One concern for Theorem 2 is the requirement of \(|Q_t^*| \le \delta \). Fortunately, for Algorithm 1 with (8) and (22), if \(\alpha _t\) are lower-bounded by some \(\bar{\alpha }> 0\) (which is true under (7) by [21, Corollary 1]) and F is lower-bounded, then (6) together with (8) and (22) shows that \(-Q_t^*\) is summable and thus decreasing to 0.

We now provide several examples satisfying (23)–(24) to demonstrate the usefulness of Theorem 2. In our description below, \(p^{t,i}\) denotes the ith iterate of the subproblem solver at the tth outer iteration and \(x^{t,i}{:}{=}x^t + p^{t,i}\).

  • Using \(x^{t+1} = x^t + \bar{p}^t_\tau \) from (11): Assume we have a tentative \(p^t\) that satisfies (10) for some \(\hat{\epsilon }_t\), and we use (11) to generate \(\bar{p}^t_\tau \) as the output satisfying (8) for a corresponding \(\epsilon _t\) calculated by (13). We see that it is of the form (23) with \(s^{t} = 0\), \(y^{t} = x^t + p^t\), and \(\varLambda _{t} = I / \tau \). From (13) we know that (8) is satisfied, while [12, Corollary 3.6] and (12) guarantees (24) with \(\nu = 1/2\).

  • Proximal gradient (PG): These methods generate the inner iterates by

    $$\begin{aligned} x^{t,0} = x^t, \, x^{t,i+1} = \textrm{prox}^{\lambda _{t,i}I}_{\varPsi }\left( x^{t,i} - \lambda _{t,i}^{-1} \left( \nabla f\left( x^t \right) + H_t \left( x^{t,i} - x^t \right) \right) \right) , \forall i > 0, \end{aligned}$$

    for some \(\{\lambda _{t,i}\}\) bounded in a positive range that guarantees \(\{Q_t(p^{t,i})\}_{i}\) is a decreasing sequence for all t (decided through pre-specification, line search, or other mechanisms). Therefore, for any t, no matter what value of i is the last inner iteration, (23) is satisfied with \(y^{t} = x^{t,i-1}, s^{t} = 0, \varLambda _{t,i} = \lambda _{t,i-1} I\). The condition (22) holds for some \(\eta < 1\) because proximal-gradient-type methods are a descent method with Q-linear convergence on strongly convex problems, and (24) holds for \(\eta _1\sqrt{2 \eta / m}\) and \(\nu = 1/2\) from (18).

  • Accelerated proximal gradient (APG): The iterates are generated by

    $$\begin{aligned}{} & {} {\left\{ \begin{array}{ll} y^{t,1} &{}= x^{t,0} = x^t,\\ y^{t,i} &{}= x^{t,i-1} + \left( 1 - \frac{2}{\sqrt{\kappa (H_t)} + 1} \right) \left( x^{t,i-1} - x^{t,i-2}\right) , \forall i > 1, \end{array}\right. } \end{aligned}$$
    (35)
    $$\begin{aligned}{} & {} x^{t,i} = \textrm{prox}^{{\left\| {H_t}\right\| } I}_{\varPsi }\left( y^{t,i} - {\left\| {H_t}\right\| }^{-1} \left( \nabla f\left( x^t \right) + H_t \left( y^{t,i} - x^t \right) \right) \right) , \forall i > 1,\nonumber \\ \end{aligned}$$
    (36)

    where \(\kappa (H_t) \ge 1\) is the condition number of \(H_t\). APG satisfies [39]:

    $$\begin{aligned} Q_t\left( x^{t,i} - x^t \right) - Q_t^* \le -2\left( 1 - \kappa (H_t)^{-1/2} \right) ^i Q_t^*,\quad \forall i > 0, \quad \forall t \ge 0. \end{aligned}$$
    (37)

    Since \(\kappa (H_t) \ge 1\), \(p^{t,i}\) satisfies (8) with (22) for all \(i \ge \ln 2\sqrt{\kappa (H_t)}\). If such a \(p^{t,i}\) is output as our \(p^t\), we see from (36) that (23) holds with \(s^t = 0\) and \(\varLambda _{t} = \Vert H_t\Vert I\).

    The only condition to check is hence whether \(y^{t,i}\) satisfies (24). The case of \(i=1\) holds trivially with \(\eta _1 = \sqrt{2/m}\). For \(i > 1\), (24) holds with \(\eta _1 = 3\sqrt{2 \eta m^{-1}}\) and \(\nu = 1/2\) because

    $$\begin{aligned}&~{\left\| {y^{t,i} - \left( x^t + p^{t*} \right) }\right\| }\\&\quad {\mathop {\le }\limits ^{(35)}} ~ \left( 1 - \frac{2}{\sqrt{\kappa (H_t)} + 1} \right) {\left\| { x^{t,i-1} - x^{t,i-2}}\right\| }+ {\left\| {x^{t,i-1} - \left( x^t + p^{t*} \right) }\right\| } \\&\quad {\mathop {\le }\limits ^{(18),(37)}}~ \left( {\left\| {x^{t,i-1} - x^t - p^{t*} }\right\| } + {\left\| {x^{t,i-2} - x^t - p^{t*}}\right\| }\right) + \sqrt{2\eta m^{-1}} \sqrt{-Q_t^*} \\&\quad \le ~ 3 \sqrt{2\eta m^{-1}} \sqrt{-Q_t^*}. \end{aligned}$$
  • Prox-SAGA/SVRG: These methods update the iterates by

    $$\begin{aligned} x^{t,i} = \textrm{prox}^{\lambda _t I}_{\varPsi }\left( x^{t,i-1} - \lambda _t^{-1} \left( \nabla f\left( x^t \right) + H_t \left( x^{t,i-1} - x^t \right) + s^{t} \right) \right) , \end{aligned}$$

    with \(x^{t,0} = x^t\), \(\{\lambda _t\}\) bounded in a positive range, and \(\{s^{t,i}\}\) are random variables converging to 0 as \(x^{t,i} - x^t\) approaches \(p^{t*}\). (For a detailed description, see, for example, [43].) It is shown in [52] that for prox-SVRG, \(Q_t(x^{t,i} - x^t) - Q_t^*\) converges linearly to 0 with respect to i if \(\lambda _t\) is small enough, so (8) with (22) is satisfied. A similar but more involved bound for prox-SAGA can be derived from the results in [11]. When \(p^{t,i} = x^{t,i} - x^t\) for some \(i > 0\) is output as \(p^t\), we get \(y^{t} = x^{t,i-1}\), \(\varLambda _{t} = \lambda _t I\), and \(s^t = s^{t,i}\) in (23), so the requirements of Theorem 2 hold.

If \(\varPsi \) is block-separable and \(\mathcal {M}\) decomposable into a product of submanifolds that conform with the blocks of \(\varPsi \), (23) can be modified easily to suit block-wise solvers like block-coordinate descent. This extension simply adapts the analysis above to the block-wise setting, so the proof is straightforward and omitted for brevity.

3.1 Iterate convergence

Theorems 1 and 2 both indicate that for Algorithm 1 to identify \(\mathcal {M}\), we need \(x^t\) (or at least a subsequence) to converge to a point \(x^*\) of partial smoothness. We thus complement our analysis to show the convergence of the iterates under convexity of f and a local sharpness condition, which is a special case of the KL condition and is universal in real-world problems, without any additional requirement on the algorithm such as the relative-error condition in [2, 4, 5]. In particular, we assume F satisfy the following for some \(\zeta , \xi > 0\), and \(\theta \in (0,1]\):

$$\begin{aligned} \zeta \textrm{dist}(x,\varOmega ) \le \left( F(x) - F^*\right) ^{\theta },\quad \forall x \in \text {Lev}\left( {\xi }\right) . \end{aligned}$$
(38)

This becomes the well-known quadratic growth condition when \(\theta = 1/2\), and \(\theta = 1\) corresponds to the weak-sharp minima [6]. As discussed in Sect. 1.1, under convexity of f, [54] showed that the iterates of their PN variant converge to some \(x^* \in \varOmega \) if F satisfies EB, which is equivalent to the quadratic growth condition in their setting [12]. Our analysis allows broader choices of \(H_t\) and (strong) iterate convergence is proven for \(\theta \in (1/4,1]\).

Theorem 3

Consider Algorithm 1 with any \(x^0 \in \mathcal {E}\). Assume that \(\varOmega \ne \emptyset \), f is convex and L-smooth for \(L > 0\), \(\varPsi \in \varGamma _0\), (7) holds, there is \(\eta \in [0,1)\) such that \(p^t\) satisfies (8) with (22) for all t, and that (38) holds for \(\xi , \zeta > 0\) and \(\theta \in (1/4,1]\). Then \(x^t \rightarrow x^*\) for some \(x^* \in \varOmega \).

The convergence in Theorem 3 holds true in infinite dimensional real Hilbert spaces with strong convergence (which is indistinguishable from weak convergence in the finite-dimensional case), and the proof in “Appendix B” is written in this general scenario. The key of its proof is our following improved convergence rate, which might have its own interest. Except for that the case of \(\theta = 1/2\) has been proven by Peng et al [41] and that \(\theta = 0\) reduces to the general convex case analyzed in [21], this faster convergence rate is, up to our knowledge, new for ISQA.

Theorem 4

Consider the settings of Theorem 3 but with \(\theta \in [0,1]\), \(\mathcal {E}\) a real Hilbert space and \(x^0 \in \text {Lev}\left( {\xi }\right) \). Then there is \(\bar{\alpha } > 0\) such that \(\alpha _t \ge \bar{\alpha }\) for all t and the following hold.

  1. 1.

    For \(\theta \in (1/2,1]\): When \(\delta _t\) is large enough such that

    $$\begin{aligned} \delta _t > \left( \zeta ^2M^{-1} \right) ^{\frac{1}{2\theta - 1}}, \end{aligned}$$
    (39)

    we have

    $$\begin{aligned} \delta _{t+1}&\le \delta _t \left( 1 - \frac{(1 - \eta )\alpha _t \gamma \zeta ^2 \xi ^{1 - 2\theta }}{2M} \right) . \end{aligned}$$
    (40)

    Next, let \(t_0\) be the first index failing (39), then for all \(t \ge t_0\) we have

    $$\begin{aligned} \delta _{t+1} \le \delta _t \left( 1 - \frac{(1 - \eta ) \alpha _t \gamma }{2} \right) . \end{aligned}$$
    (41)
  2. 2.

    For \(\theta = 1/2\), we have global Q-linear convergence of \(\delta _t\) in the form

    $$\begin{aligned} \frac{\delta _{t+1}}{\delta _t}\le 1 - \left( 1 - \eta \right) \alpha _t \gamma \cdot {\left\{ \begin{array}{ll} \frac{\zeta ^2}{2 \Vert H_t\Vert }, &{}\text { if } \zeta ^2 \le \Vert H_t\Vert ,\\ \left( 1 - \frac{\Vert H_t\Vert }{2 \zeta ^2}\right) , &{}\text { else,} \end{array}\right. } \quad \forall t \ge 0. \end{aligned}$$
    (42)
  3. 3.

    For \(\theta \in {[}0,1/2)\), (41) takes place when \(\delta _t\) is large enough to satisfy (39). Let \(t_0\) be the first index such that (39) fails, then

    $$\begin{aligned} \frac{\delta _t}{ \delta _{t_0}} \le \left( \left( 1 - 2\theta \right) \sum _{i=t_0}^{t-1} \alpha _i \right) ^{-\frac{1}{1-2\theta }} \le \left( \left( 1 - 2\theta \right) (t - t_0)\bar{\alpha }\right) ^{-\frac{1}{1-2\theta }}, \,\forall t \ge t_0. \end{aligned}$$
    (43)

For the range of \(\theta \) in Theorem 3, convergence of the iterates generated by inexact SQA is a new result. Moreover, even if the additional conditions on the forward-backward envelope in [5] also hold, although their analysis also uses the subproblem stopping condition (8), they need a much stricter stopping tolerance of the form

$$\begin{aligned} \epsilon _t = - \tau _t Q_t^*,\quad \sum \sqrt{\tau _t} < \infty , \end{aligned}$$

which clealy requires \(\tau _t\) to converge to 0 fast enough and thus costs much more time in the subproblem solve. In contrast, we just need \(\epsilon _t\) to be a constant factor of \(Q_t^*\) in (22), so the number of inner iterations and thus the cost of subproblem solve can be a constant.

4 An efficient inexact SQA method with superlinear convergence in running time

Now that it is clear ISQA is able to identify the active manifold, we utilize the fact that the optimization problem reduces to a smooth one after the manifold is identified to devise more efficient approaches, with safeguards to ensure that the correct manifold is really identified. The improved algorithm, ISQA \(^{+}\), is presented in Algorithm 2 and we explain the details below.

figure b

ISQA \(^{+}\) has two stages, separated by the event of identifying the active manifold \(\mathcal {M}\) of a cluster point \(x^*\). Our analysis showed that iterates converging to \(x^*\) will eventually identify \(\mathcal {M}\), but since neither \(x^*\) nor \(\mathcal {M}\) is known a priori, the conjecture of identification can only be made when \(\mathcal {M}\) remains unchanged for \(S > 0\) iterations.

Most parts in the first stage are the same as Algorithm 1, although we have added specifications for the subproblem solver according to Theorem 2. The only major difference is that instead of linesearch, ISQA \(^{+}\) adjusts \(H_t\) and re-solve (2) whenever (6) with \(\alpha _t = 1\) fails. This trust-region-like approach has guaranteed global convergence from [21] and ensures \(\alpha _t = 1\) for Theorem 2 to be applicable.

In the second stage, we alternate between a standard proximal gradient (PG) step (44) and a manifold optimization (MO) step. PG is equivalent to solving (2) with \(H_t = {\hat{L}} I\) to optimality, so Theorem 2 applies. When \(\mathcal {M}\) is not correctly identified, a PG step thus prevents us from sticking at a wrong manifold, while when the superlinear convergence phase of the MO step is reached, using PG instead of solving (2) with a sophisticated \(H_t\) avoids redundant computation.

When the objective is partly smooth relative to a manifold \(\mathcal {M}\), optimizing it within \(\mathcal {M}\) can be cast as a manifold optimization problem, and efficient algorithms for this type is abundant (see, for example, [1] for an overview). The difference between applying MO methods and sticking to (2) is that in the former, we can obtain the exact solution to the subproblem for generating the update direction in finite time, because the subproblem in the MO step is simply an unconstrained quadratic optimization problem whose solution can be found by solving a linear system, while in the latter it takes indefinitely long to compute the exact solution, so the former is preferred in practice for better running time. Although we did not assume that f is twice-differentiable, its generalized Hessian (denoted by \(\nabla ^2 f\)) exists everywhere since the gradient is Lipschitz-continuous [17]. As discussed in Sect. 2, we can find a \(\mathcal {C}^2\) parameterization \(\phi \) of \(\mathcal {M}\) around \(x^*\), and we use this \(\phi \) to describe a truncated semismooth Newton (TSSN) approach. Since \(\mathcal {M}\) might change between iterations, when we are conducting MO at the tth iteration, we find a parameterization \(\phi _t\) of the current \(\mathcal {M}\) and a point \(y^t\) such that \(\phi _t(y^t) = x^t\). If \(\mathcal {M}\) remains fixed, we also retain the same \(\phi _t\). The TSSN step \(q^t\) is then obtained by using preconditioned CG (PCG, see for example [40, Chapter 5]) to find an approximate solution for

$$\begin{aligned} {\begin{aligned} q^t&\approx \mathop {{{\,\mathrm{\arg \,min}\,}}}\limits _{q}\quad \langle g^t,\,q\rangle + \frac{1}{2} \langle q,\,H_t q\rangle , \quad \text { or equivalently } \quad H_t q^t \approx -g^t,\\ g^t&{:}{=}\nabla F(\phi _t(y^t)), \; H_t {:}{=}\nabla ^2 F\left( \phi _t(y^t) \right) + \mu _t I, \; \mu _t {:}{=}c {\left\| {g^t}\right\| }^{\rho }, \end{aligned}} \end{aligned}$$
(45)

that satisfies

$$\begin{aligned} {\left\| {H_t q^t + g^t}\right\| } \le 0.1 \min \left\{ {\left\| {g^t}\right\| }, {\left\| {g^t}\right\| }^{1 + \rho } \right\} \end{aligned}$$
(46)

with pre-specified \(c > 0\) and \(\rho \in (0,1]\). We then run a backtracking line search procedure to find a suitable step size \(\alpha _t > 0\). For achieving superlinear convergence, we should accept unit step size whenever possible, so we only require the objective not to increase. If \(q^t\) is not a descent direction or \(\alpha _t\) is too small, we consider the MO step failed and go back to the first stage. If \(\alpha _t < 1\), the superlinear convergence phase is not entered yet, and likely \(\mathcal {M}\) has not been correctly identified, so we also switch back to the first stage. This algorithm is summarized in Algorithm 3. When products between \(\nabla ^2 F(\phi _t(y^t))\) and arbitrary points, required by PCG, cannot be done easily, one can adopt Riemannian quasi-Newton approaches like [18] instead.

Viewing from (45), it is clear that our previous claim that the exact solution of the subproblem can be found in finite time holds. In particular, the second form of the subproblem indicates that the exact solution can be obtained by solving a linear system, and this can be done in time at most cubic to the manifold dimension by first finding a matrix representation for \(H_t\) and then inverting it. If there is no solution to the subproblem, such a numerical linear algebra approach can also detect the infeasibility of the linear system with the same time cost. Similarly, Riemannian quasi-Newton approaches also have the same time cost upper bound for computing the update direction. The PCG procedure we adopted also has such a finite-termination guarantee with the same cost bound, and it comes additionally with a guarantee that the objective value (the first form in (45)) converges to the optimum at the same R-linear rate as the accelerated gradient method. However, in practice we often observe that PCG converges much faster than this worst-case convergence speed guarantee and can satisfy (46) with time cost significantly less than either applying a (proximal) first-order method to (2) or obtaining an exact solution of (45) by resorting to matrix decompositions.

The parameter S for thresholding the entry of the second stage indicates our confidence in the currently identified manifold. If we think the identification is reliable, we can set it to a lower value so that we enter the manifold optimization stage earlier. In general, we would like to set S in an appropriate range so that the algorithm does not spend too much time on a wrong manifold, and when the right manifold is found, it does not stay in the first stage long either. In practice, more sophisticated strategies to adaptively change S might lead to better efficiency. For example, we can start with a small S and restrict the running time of the MO step to be proportional to S steps of the first stage. When unit step size is not accepted in an MO step, or when the manifold changes in a proximal gradient step, indicating that the fast local superlinear convergence is not yet achieved, we can switch back to the first stage and increase S.

figure c

4.1 Global convergence

This section provides global convergence guarantees for Algorithm 2. Because MO steps do not increase the objective value, global convergence of Algorithm 2 follows from the analysis in [21] by treating (44) as solving (2) with \(H_t = \hat{L} I\) and noting that this update always satisfies (6) with \(\alpha _t = 1\). For completeness, we still state these results, and provide a proof in “Appendix C”.

First, we restate a result in [21] to bound the number of steps spent in the while-loop for enlarging \(H_t\) in Algorithm 2.

Lemma 3

([21, Lemma 4]) Given an initial choice \(H_t^0\) for \(H_t\) at the \(\textrm{t}\)th iteration of Algorithm 2 (so initially we start with \(H_t = H_t^0\) and modify it when (6) fails with \(\alpha _t = 1\)) and a parameter \(\beta \in (0,1)\). Consider the following two variants for enlarging \(H_t\), starting with \(\sigma = 1\):

$$\begin{aligned} \sigma \leftarrow \beta \sigma , \quad H_t \leftarrow H_t^0 / \sigma , \end{aligned}$$
(Variant 1)
$$\begin{aligned} H_t \leftarrow H_t^0 + \sigma ^{-1} I,\quad \sigma \leftarrow \beta \sigma {.} \end{aligned}$$
(Variant 2)

We then have the following bounds if every time the approximate solution to (2) always satisfies \(Q_{H_t}(p^t;x^t) \le 0\).

  1. 1.

    If \(H_t^0\) satisfies \(M_0^t \succeq H_t^0 \succeq m_0^t\) for some \(M_0^t \ge m_0^t > 0\), then the final \(H_t\) from (Variant 1) satisfies \(\Vert H_t\Vert \le M \max \left\{ 1, L / (\beta m)\right\} \), and the while-loop terminates within \(\lceil \log _{\beta ^{-1}} L/m \rceil \) rounds.

  2. 2.

    If \(H_t^0 \succeq 0\), then the final \(H_t\) from (Variant 2) satisfies \({\left\| {H_t}\right\| } \le M + \max \left\{ 1, L / \beta \right\} \) and the while-loops terminates within \(1 + \lceil \log _{\beta ^{-1}}{L} \rceil \) rounds.

Now we provide global convergence guarantees for Algorithm 2 without the need of manifold identification. From Lemma 3, we can simply assume without loss of generality that (7) holds for the final \(H_t\) that leads to the final update direction that satisfies (6).

Theorem 5

Consider (1) with f L-smooth for \(L > 0\), \(\varPsi \in \varGamma _0\), and \(\varOmega \ne \emptyset \). Assume Algorithm 2 is applied with an initial point \(x^0\), the estimate \(\hat{L}\) satisfies \(\hat{L} \ge L\), (7) holds for the final \(H_t\) after exiting the while-loop, and (8) is satisfied with (22) for some \(\eta \in [0,1)\) fixed over t. Let \(\{k_t\}\) be the iterations that the MO step is not attempted (so either (2) is solved approximately or (44) is conducted), then we have \(k_t \le 2t\) for all t. By denoting \(\tilde{M} {:}{=}\max \left\{ \hat{L}, M \right\} , \quad \tilde{m} {:}{=}\min \left\{ \hat{L}, m \right\} \), we have the following convergence rate guarantees.

  1. 1.

    Let \(G_t {:}{=}\mathop {{{\,\mathrm{\arg \,min}\,}}}\limits _{p} Q_I(p;x^t)\), then \({\left\| {G_{k_t}}\right\| } \rightarrow 0\), and for all \(t\ge 0\), we have

    $$\begin{aligned} \min _{0 \le i \le t} {\left\| {G_{k_i}}\right\| }^2 \le \frac{F(x^0) - F^*}{\gamma (t+1)} \frac{\tilde{M}^2 \left( 1 + \tilde{m}^{-1} + \sqrt{1 - 2 \tilde{M}^{-1} + \tilde{m}^{-2}}\right) ^2}{2(1 - \eta ) \tilde{m}}. \end{aligned}$$

    Moreover, \(G_t = 0\) if and only if \(0 \in \partial F(x^t)\), and therefore any limit point of \(\{x^{k_t}\}\) is a stationary point of (1).

  2. 2.

    If in addition f is convex and there exists \(R_0 \in [0, \infty )\) such that

    $$\begin{aligned} \sup _{x: F\left( x \right) \le F\left( x^0 \right) }\left\| x - P_\varOmega (x)\right\| = R_0 \end{aligned}$$
    (47)

    (in other words, (38) holds with \(\theta = 0\), \(\zeta = R_0^{-1}\), and \(\xi = F(x^0) - F^*\)), then:

    1. 2.1

      When \(F(x^{k_t}) - F^* \ge \tilde{M} R_0^2\), we have

      $$\begin{aligned} F\left( x^{k_t+1} \right) - F^* \le \left( 1 - \frac{ \gamma (1 - \eta ) }{2} \right) \left( F\left( x^{k_t} \right) - F^*\right) . \end{aligned}$$
    2. 2.2

      Let \(t_0 {:}{=}\mathop {{{\,\mathrm{\arg \,min}\,}}}\limits _{t} \{t: F\left( x^{k_t} \right) - F^* < \tilde{M} R_0^2\}\), we have for all \(t \ge t_0\) that

      $$\begin{aligned} F\left( x^{k_t} \right) - F^* \le \frac{2 \tilde{M} R_0^2}{ \gamma \left( 1 - \eta \right) \left( t - t_0 \right) + 2}. \end{aligned}$$

      Moreover, we have

      $$\begin{aligned} t_0 \le \max \left\{ 0, 1 + \frac{2}{\gamma \left( 1 - \eta \right) } \log \frac{F\left( x^0 \right) - F^*}{\tilde{M} R_0^2}\right\} . \end{aligned}$$

    In summary, we have \(F(x^t) - F^* = O\left( t^{-1} \right) \).

  3. 3.

    The results of Theorem 4 hold, with M and \({\left\| {H_t}\right\| }\) replaced by \(\tilde{M}\), \(\delta _t\) by \(\delta _{k_t}\), \(\alpha _t\) and \(\bar{\alpha }\) by 1, \(\delta _t\) by \(\delta _{k_t}\), \(\delta _{t+1}\) by \(\delta _{k_t+1}\), and \(\delta _{t_0}\) by \(\delta _{k_{t_0}}\).

4.2 Superlinear and quadratic convergence

Following the argument in the previous subsection to treat (44) as solving (2) exactly, the manifold identification property of (44) also follows from Theorem 2. We thus focus on its local convergence in this subsection. In what follows, we will show that \(x^t\) converges to a stationary point \(x^*\) satisfying (14) superlinearly or even quadratically in the second stage.

Let \(\mathcal {M}\) be the active manifold of \(x^*\) and \(\phi \) be a parameterization of \(\mathcal {M}\) with \(\phi (y^*) = x^*\) for some point \(y^*\). We can thus assume without loss of generality \(\phi _t = \phi \) for all t that identified \(\mathcal {M}\). We denote \(F_{\phi } (y) {:}{=}F(\phi (y))\). For simplicity, we assume that \(F_{\phi }\) is twice-differentiable with its Hessian locally Lipschitz continuous around \(y^*\). In particular, we just need the following property to hold locally in a neighborhood \(U_0\) of \(y^*\):

$$\begin{aligned} \nabla F_\phi (y_1) - \nabla F_\phi (y_2) - \nabla ^2 F_\phi (y_2) = O\left( \Vert y_1 - y_2\Vert ^2 \right) , \quad \forall y_1, y_2 \in U_0. \end{aligned}$$
(48)

We do not assume \(\nabla ^2 F_\phi (y^*) \succ 0\) like existing analyses for Newton’s method, but consider a degenerate case in which there is a neighborhood \(U_1\) of \(y^*\) such that

$$\begin{aligned} \nabla ^2 F_\phi (y) {\succeq } 0, \quad \forall y \in U_1. \end{aligned}$$
(49)

Note that (49) implies that \(F_\phi \) is convex within \(U_1\). We can decompose \(\mathcal {E}\) into the direct sum of the tangent and the normal spaces of \(\mathcal {M}\) at \(x^*\), and thus its stationarity implies \(\nabla F_{\phi }(y^*) = 0\). This and (49) mean \(y^*\) is a local optimum of \(F_{\phi }\), and hence \(x^*\) is a local minimum of F when f is L-smooth, following the argument of [50, Theorem 2.5]. We also assume that \(F_\phi \) satisfies a sharpness condition similar to (38) in a neighborhood \(U_2\) of \(y^*\):

$$\begin{aligned} \zeta ^{\hat{\theta }} {\left\| {y-y^*}\right\| } \le \left( F_\phi (y) - F\left( y^* \right) \right) ^{\hat{\theta }},\quad \forall y \in U_2, \end{aligned}$$
(50)

for some \(\zeta > 0\) and \({\hat{\theta }} \in (0,1/2]\).Footnote 1 By shrinking the neighborhoods if necessary, we assume without loss of generality that \({U_0 =} U_1 = U_2\) and denote it by U. Note that the conventional assumption of positive-definite Hessian at \(y^*\) is a special case that satisfies (49) and (50) with \(\hat{\theta } = 1/2\).

We define \(d_t {:}{=}{\left\| {y^t - y^*}\right\| }\) and use it to bound \(y^t + q^t\) and \(\nabla F_\phi (y^t + q^t)\).

Lemma 4

Consider a stationary point \(x^*\) of (1) with \(\varPsi \) partly smooth at it relative to a manifold \(\mathcal {M}\) with a parameterization \(\phi \) and a point \(y^*\) such that \(\phi (y^*) = x^*\), and assume that within a neighborhood U of \(y^*\), \(F_{\psi }\) is twice-differentiable with (49) and (48) hold. Then \(y^t \in U\) implies that any \(q^t\) satisfying (46) is bounded by

$$\begin{aligned} {\left\| {q^t}\right\| } \le 2 d_t + \mu _t^{-1} O\left( d_t^{2} \right) + 0.1 \mu _t^{-1} {\left\| {g^t}\right\| }^{1 + \rho }. \end{aligned}$$
(51)

Proof

From (46), we can find \(\psi _t \in \mathcal {E}\) such that

$$\begin{aligned} H_t q^t + g^t = \psi _t, \quad {\left\| {\psi _t}\right\| } \le 0.1 {\left\| {g^t}\right\| }^{1 + \rho }. \end{aligned}$$
(52)

From (49) and (45), we have

$$\begin{aligned} H_t \succeq \mu _t \succ 0, \end{aligned}$$
(53)

so \(H_t\) is invertible. We then get

$$\begin{aligned} {\left\| {y^t + q^t - y^*}\right\| } {\mathop {=}\limits ^{(52)}}&~ {\left\| {H_t^{-1} \left( \psi _t - g^t + H_t\left( y^t - y^* \right) \right) }\right\| }\nonumber \\ {\mathop {\le }\limits ^{(45)}}&~ {\left\| {H_t^{-1}}\right\| } \left( {\left\| {\psi _t}\right\| } + {\left\| {g^t - \nabla ^2 F\left( \phi _t\left( y^t \right) \right) d_t }\right\| } + \mu _t d_t\right) \nonumber \\ {\mathop {\le }\limits ^{(53),(52),(48)}}&~ 0.1 \mu _t^{-1} {\left\| {g^t}\right\| }^{1 + \rho } + \mu _t^{-1} O\left( d_t^{2} \right) + d_t. \end{aligned}$$
(54)

From the triangle inequality, we have \({\left\| {q^t}\right\| } \le {\left\| {y^t - y^*}\right\| } + {\left\| {y^t + q^t - y^*}\right\| }\), whose combination with (54) proves (51). \(\square \)

Lemma 5

Consider the setting of Lemma 4 and further assume that \(\varPsi \in \varGamma _0\) and f is L-smooth. The following hold.

  1. 1.

    If \(\rho \in (0,1]\) and \(F_\phi \) satisfies (50) with \(\hat{\theta } = 1/2\) for some \(\zeta > 0\), then

    $$\begin{aligned} {\left\| { y^t + q^t - y^*}\right\| } = O\left( d_t^{1+\rho } \right) , \quad {\left\| {\nabla F_\phi (y^t + q^t)}\right\| } = O\left( {\left\| {g^t}\right\| }^{1+\rho } \right) . \end{aligned}$$
    (55)
  2. 2.

    If \(\rho = 0.69\) and \(F_\phi \) satisfies (50) for some \(\zeta > 0\) and \(\hat{\theta } \ge 3/8\), then

    $$\begin{aligned} {\left\| { y^t + q^t - y^*}\right\| } = o\left( d_t \right) , \quad {\left\| {\nabla F_\phi (y^t + q^t)}\right\| } = o\left( {\left\| {g^t}\right\| } \right) . \end{aligned}$$
    (56)

Proof

From the assumptions on \(\varPsi \), \(\phi \), and f, \(F_\phi \) is twice-differentiable almost everywhere, and within any compact set K containing \(y^*\), any \(\nabla ^2 F_\phi \in \partial \nabla F_\phi \) is upper-bounded by some \(L_K > 0\) (\(f(\phi (y))\) is differentiable with the gradient Lipschitz-continuous, and \(\nabla ^2 (\varPsi ( \phi (y))\) is upper-bounded), so \( F_\phi \) is \(L_K\)-smooth within K. Since K is arbitrary, we let \(K \supset U\) and obtain

$$\begin{aligned} {\left\| {g^t}\right\| } \le L_K d_t. \end{aligned}$$
(57)

Since (49) implies that \(F_\phi \) is convex in U, (50) leads to

$$\begin{aligned} {\left\| {\nabla F_{\phi } \left( y \right) }\right\| } \ge \zeta {\left\| {y - y^*}\right\| } ^{(1 - \hat{\theta })/\hat{\theta }}, \quad \forall y \in U. \end{aligned}$$
(58)

For the first case, (57)–(58) show \(g^t = \varTheta (d_t)\), so Lemma 4 implies

$$\begin{aligned} {\left\| {q^t}\right\| } = O(d_t) = O(g^t). \end{aligned}$$
(59)

Thus, by the triangle inequality, (55) is proven by

$$\begin{aligned}&{\left\| {\nabla F_\phi \left( y^t + q^t\right) }\right\| } \nonumber \\ {}&= ~{\left\| {\nabla F_\phi \left( y^t + q^t\right) + \psi _t - \psi _t}\right\| }\nonumber \\&{\mathop {\le }\limits ^{(52)}} ~ {\left\| {\nabla F_\phi \left( y^t + q^t\right) - g^t - H_t q^t}\right\| } + 0.1 {\left\| {g^t}\right\| }^{1+\rho } \nonumber \\&{\mathop {\le }\limits ^{(48),(45)}} ~ O\left( {\left\| {q^t}\right\| }^2 \right) + \mu _t {\left\| {q^t}\right\| } + 0.1 {\left\| {g^t}\right\| }^{1+\rho } {\mathop {=}\limits ^{(59),(45)}} O\left( {\left\| {g^t}\right\| }^{1+\rho } \right) . \end{aligned}$$
(60)

In the second case, \(\hat{\theta }/(1 - \hat{\theta }) \ge 3/5\), so (45), Lemma 4 and (57)–(58) imply

$$\begin{aligned} {\left\| {q^t}\right\| } = O\left( d_t^{0.85}\right) . \end{aligned}$$
(61)

We then get from \(\rho = 0.69\) that

$$\begin{aligned} {\left\| {\nabla F_\phi \left( y^t + q^t\right) }\right\| }&{\mathop {\le }\limits ^{(60)}} ~ O\left( {\left\| {q^t}\right\| }^2 \right) + \mu _t {\left\| {q^t}\right\| } + 0.1 {\left\| {g^t}\right\| }^{1.69}\nonumber \\&{\mathop {=}\limits ^{(57),(45),(61)}} ~ O\left( d_t^{1.7} \right) + 2\mu d_t + O(d_t^2) + O\left( d_t^{1.69}\right) = O(d_t^{1.69}). \end{aligned}$$
(62)

From (58) we get that

$$\begin{aligned} {\left\| {y^t + q^t - y^*}\right\| } = O\left( {\left\| {\nabla F_{\phi } \left( y^t + q^t \right) }\right\| }^{0.6}\right) {\mathop {\le }\limits ^{(62)}} O\left( (d_t^{1.69})^{0.6} \right) = O\left( d_t^{1.014} \right) , \end{aligned}$$

proving the first equation in (56). The second one is then proven by

$$\begin{aligned} {\left\| {\nabla F_{\phi } \left( y^t + q^t \right) }\right\| } {\mathop {\le }\limits ^{(62)}} O\left( (d_t^{1.69})\right) {\mathop {=}\limits ^{(58)}} O\left( {\left\| {g^t}\right\| }^{1.014} \right) . \end{aligned}$$

\(\square \)

Now we are able to show two-step superlinear convergence of \({\left\| {x - x^*}\right\| }\).

Theorem 6

Consider the setting of Lemma 5 and assume in addition that \(x^*\) satisfies (14). Then there is a neighborhood V of \(x^*\) such that if at the \(t_0\)th iteration of Algorithm 2 for some \(t_0 > 0\) we have that \(x^{t_0} \in V\), Unchanged \(\ge S\), \(\mathcal {M}\) is correctly identified with parameterization \(\phi \) and \(\phi (y^*) = x^*\), and \(\alpha _t = 1\) is taken in Algorithm 3 for all \(t \ge t_0\), we get the following for all \(t \ge t_0\).

  1. 1.

    For \(\rho \in (0,1]\) and \(F_\phi \) satisfying (50) with \(\hat{\theta } = 1/2\) for some \(\zeta > 0\):

    $$\begin{aligned} {\left\| {x^{t +2} - x^*}\right\| } = O\left( {\left\| {x^{t}-x^*}\right\| }^{1+\rho }\right) , {\left\| {\nabla F_\phi \left( {y}^{t +2}\right) }\right\| } = O\left( {\left\| {\nabla F_\phi \left( {y}^{t}\right) }\right\| }^{1+\rho }\right) . \end{aligned}$$
    (63)
  2. 2.

    For \(\rho = 0.69\) and \(F_\phi \) satisfying (50) for some \(\zeta > 0\) and \(\hat{\theta } \ge 3/8\),

    $$\begin{aligned} {\left\| {x^{t +2} - x^*}\right\| } = o\left( {\left\| {x^{t}-x^*}\right\| }\right) . \end{aligned}$$

Proof

In our discussion below, \(V_i\) and \(U_i\) for \(i \in \mathbb {N}\) are respectively neighborhoods of \(x^*\) and \(y^*\).

Since \(\phi \) is \(\mathcal {C}^2\), there is \(U_1\) of \(y^*\) such that

$$\begin{aligned} \phi \left( y \right) - \phi \left( y^* \right) = \langle \nabla \phi (y^*),\,y-y^*\rangle + O\left( {\left\| {y - y^*}\right\| }^2 \right) , \forall y \in U_1. \end{aligned}$$
(64)

Because the derivative of \(\phi \) at \(y^*\) is injective, (64) implies

$$\begin{aligned} {\left\| {\phi \left( y \right) - \phi \left( y^* \right) }\right\| } = \varTheta \left( {\left\| {y-y^*}\right\| } \right) , \forall y \in U_1. \end{aligned}$$
(65)

If the tth iteration is a TSSN step, we define \(y^t\) to be the point such that \(\phi (y^t) = x^t\). If either case in Lemma 5 holds and \(q^t\) satisfies (46), from that \({\left\| {y^t + q^t - y^*}\right\| } = o({\left\| {y^t - y^*}\right\| })\) we can find \(U_2 \subset U\) such that \(y^t\in U_2\) implies \(y^t + q^t \in U_2\). Take \(U_3 {:}{=}U_1\cap U_2 \subset U\), for \(y^t \in U_3\) and \(x^{t+1} = \phi (y^t + q^t)\), we get \(y^t+q^t \in U_3 \subset U_1\), and hence the following from (65).

$$\begin{aligned} {\left\| {x^{t+1} - x^*}\right\| } = \varTheta \left( {\left\| {y^t+q^t - y^*}\right\| }\right) , \quad {\left\| {x^t-x^*}\right\| } = \varTheta \left( {\left\| {y^t-y^*}\right\| } \right) . \end{aligned}$$
(66)

On the other hand, consider the case in which the tth iteration is a PG step. As \(\phi \) is \(\mathcal {C}^2\) and \(\phi (y^*) = x^*\), we can find \(V_1\) such that \(\phi (U_3) \supset V_1 \cap \mathcal {M}\). From [50, Lemma 3.2], there is \(V_2\) such that \(x^t\in V_2\) implies

$$\begin{aligned} {\left\| {x^{t+1} - x^*}\right\| } = O({\left\| {x^t-x^*}\right\| }), \end{aligned}$$
(67)

so there is \(V_3\subset V_2\) such that \(x^{t+1} \in V_1\) if \(x^{t} \in V_3\). Therefore, from Theorem 2 (applicable because we have assumed (14)), there is \(V_4\) such that \(x^t \in V_4\) implies \(x^{t+1} \in \mathcal {M}\). Take \(V_5 {:}{=}V_4 \cap V_3\), then \(x^t\in V_5\) implies \(x^{t+1} \in V_1 \cap \mathcal {M}\), thus we can find \(y^{t+1}\in U_3\) with \(\phi (y^{t+1}) = x^{t+1}\).

Now consider the first case in the statement. If at the tth iteration we have \(x^t\in V_5\) and have taken (44), then \(x^{t+1} \in V_1\cap \mathcal {M}\) with \(x^{t+1} = \phi (y^{t+1})\) for some \(y^{t+1} \in U_3\), so we can take a TSSN step at the \((t+1)\)th iteration and

$$\begin{aligned} {\left\| {x^{t+2} - x^*}\right\| }&{\mathop {=}\limits ^{(66)}} \varTheta \left( {\left\| {y^{t+1} + q^{t+1} - y^*}\right\| }\right) {\mathop {=}\limits ^{Lemma 5}} O\left( {\left\| {y^{t+1} - y^*}\right\| }^{1+\rho } \right) \nonumber \\&{\mathop {=}\limits ^{(66)}} O\left( {\left\| {x^{t+1} - x^*}\right\| }^{1+\rho } \right) {\mathop {=}\limits ^{(67)}} O\left( {\left\| {x^t - x^*}\right\| }^{1+\rho } \right) , \end{aligned}$$
(68)

so there is \(V_6 \subset V_5\) such that \(x^t \in V_6\) implies \(x^{t+2} \in V_6\) as well and the superlinear convergence in (68) propagates to \(t+2,t+4,\dotsc \). We therefore see that \({\left\| {x^{t+2i+1} - x^*}\right\| } = O({\left\| {x^{t+2i-1} - x^*}\right\| }^{1+\rho })\) for \(i \in \mathbb {N}\) as well, so there is \(V_7 \subset V_1\) such that \(x^{t+2i-1} \in V_7 \cap \mathcal {M}\) implies \(x^{t+2i+1} \in V_7 \cap \mathcal {M}\) as well.

Let \(V = V_7 \cap V_6\), we see that \(x^t \in V \cap \mathcal {M}\) implies \(x^{t+2} \in V \cap \mathcal {M}\) no matter we take PG or TSSN first, proving the first equation in (63). The convergence of \(\nabla F_\phi \) then follows from (57) and (58). The superlinear convergence in the second case follows the same argument. \(\square \)

Note that when \(\rho =1\) in the first case, we obtain quadratic convergence.

The analysis in [54] assumed directly (58) instead of (50), together with a Lipschitzian Hessian for f, under the setting of regularized optimization to get a superlinear rate. In the context of smooth optimization, our analysis is more general in giving a wider range for superlinear convergence. In particular, for (58), [54] only allowed \(\tilde{\theta } = 1\), where \(\tilde{\theta } {{:}{=}} \hat{\theta } / (1 - \hat{\theta })\), whereas our result extends the range of superlinear convergence to \(\tilde{\theta } \ge 0.6\).

Remark 1

PCG returns the exact solution of (45) in d steps, where d is the dimension of \(\mathcal {M}\), and each step involves only a bounded number of basic linear algebra operations, so the running time of Algorithm 3 is upper bounded. Therefore, superlinear convergence of Algorithm 2 in terms of iterations, from Theorem 6, implies that in terms of running time as well. This contrasts with existing PN approaches, as they all require applying an iterative subproblem solver to (2) with increasing precision, which also takes increasing time per iteration because (2) has no closed-form solution.

4.3 Availability of parameterization for the manifold

The proposed Algorithm 2 relies on the existence of the parameterization \(\phi \). There are many widely-used regularizers in machine learning and signal processing that are known to be partly smooth, and the corresponding manifold can also be described by some simple parameterizations. A summary of some of the most popular regularizers, including the \(\ell _1\)-norm, \(\ell _\infty \)-norm, \(\ell _{2,1}\)-norm, \(\ell _0\) pseudo-norm, nuclear norm (for matrix variables), and total variation, and their associated manifolds can be found in [30, Chapter 5.2].

Except for those regularizers listed above, following the discussion of [15, 37], several well-developed approaches can be leveraged for the search of the parameterization in other partly smooth regularizers. The first one is the Riemannian optimization approach (see, for example, [1]), of which a projection from \(\mathcal {E}\) to the tangent space of a manifold and a retraction map that projects from the tangent space back to the manifold are the workhorses. Such mappings for various manifolds have been constantly being developed. The composite of the projection to the tangent space and the retraction to the manifold can then be used as a reasonable parameterization for our purpose. The second approach is to rely on the \(\mathcal{V}\mathcal{U}\)-decomposition [25, 34,35,36]. It has been shown by [15, 37] that the so-called fast track in the \(\mathcal{V}\mathcal{U}\)-decomposition for a convex and partly smooth regularizer is equivalent to its corresponding active manifold, and thus the developed descriptions for the fast track of various functions can be utilized as another possibility for parameterization. This corresponds to the tangential parameterization discussed in [37]. These are also exactly the two parameterizations suggested by [31] in applying high-order acceleration steps for a family of forward-backward splitting algorithms. Another possible parameterization suggested by [37] is the projection parameterization that uses the Euclidean projection to map a point in the tangent space back to the manifold.

5 Numerical results

We conduct numerical experiments on \(\ell _1\)-regularized logistic regression to support our theory, which is of the form in (1) with \(\mathcal {E}= \mathbb {R}^d\) for some \(d \in \mathbb {N}\),

$$\begin{aligned} \varPsi (x) = \lambda {\left\| {x}\right\| }_1,\; f(x) = \sum _{i=1}^n \log \left( 1 + \exp \left( -b_i \langle a_i,\,x\rangle \right) \right) , \end{aligned}$$
(69)

where \(\lambda > 0\) decides the weight of the regularization and \((a_i,b_i) \in \mathbb {R}^d \times \{-1,1\}\) for \(i=1,\dotsc ,n\) are the data points. Note that \(\lambda {\left\| {x}\right\| }_1\) is partly smooth at every \(x \in \mathbb {R}^d\) relative to \(\mathcal {M}_{x} {:}{=}\left\{ y \mid y_i = 0, \forall i \in {J}_x \right\} \), where \({J_x} {:}{=}\{i \mid x_i = 0\}\). Let I be the identity matrix, and \(J_x^C{:}{=}\{1,\dotsc ,d\} \setminus J_x\), then viewing from the definition of \(\mathcal {M}_x\) here, the parameterization we use at each iteration is simply projecting \(y_t \in \mathbb {R}^{|I_{x_t}|}\) back to \(\mathbb {R}^d\). Namely,

$$\begin{aligned} {\phi _t(y_t) {:}{=}I_{:,J_{x_t}^C}y_t.} \end{aligned}$$

We use public available real-world data sets listed in Table 2.Footnote 2 All experiments are conducted with \(\lambda = 1\) in (69). All methods are implemented in C++, and we set \(\gamma = 10^{-4}, \beta = 0.5, T = 5\) throughout for ISQA and ISQA \(^{+}\).

Table 2 Data statistics

5.1 Manifold identification of different subproblem solvers

We start with examining the ability for manifold identification of different subproblem solvers. We run both ISQA and the first stage of ISQA \(^{+}\) (by setting \(S = \infty \)) and consider two settings for \(H_t\). The first is the L-BFGS approximation with a safeguard in [22], and we set \(m=10\) and \(\delta = 10^{-10}\) in their notation following their experimental setting. The second is a PN approach in [54] that uses \(H_t = \nabla ^2 f\left( x^t \right) + c {\left\| {x^t - \textrm{prox}_{\varPsi }\left( x^t - \nabla f\left( x^t \right) \right) }\right\| }^\rho I\), and we set \(\rho = 0.5, c= 10^{-6}\) following their suggestion. In both cases, we enlarge \(H_t\) in Algorithm 2 through \(H_t \leftarrow 2 H_t\).

We compare the following subproblem solvers.

  • SpaRSA [51]: a backtracking PG-type method with the initial step sizes estimated by the Barzilai-Borwein method.

  • APG [38]: See (35)–(36). We use a simple heuristic of restarting whenever the objective is not decreasing.

  • Random-permutation cyclic coordinate descent (RPCD): Cyclic proximal coordinate descent with the order of coordinates reshuffled every epoch.

The results presented in Table 3 show that all subproblem solvers for ISQA \(^{+}\) can identify the active manifold, verifying Theorem 2. Because the step sizes are mostly one in this experiment, even solvers for ISQA can identify the active manifold. Among the solvers, RPCD is the most efficient and stable in identifying the active manifold, so we stick to it in subsequent experiments.

Table 3 Outer iterations and time (s) for different subproblem solvers to identify the active manifold

5.2 Comparing ISQA \(^{+}\) with existing algorithms

We proceed to compare ISQA \(^{+}\) with the following state of the art for (1) using the relative objective value: \((F(x) - F^*) / F^*\).

  • LHAC [45]: an inexact proximal L-BFGS method with RPCD for (2) and a trust-region-like approach. Identical to our L-BFGS variant, we set \(m=10\) for constructing \(H_t\) in this experiment.

  • NewGLMNET [53]: a line-search PN with an RPCD subproblem solver.

IRPN [54] is another PN method that performs slightly faster than NewGLMNET, but their algorithmic frameworks are similar and the experiment in [54] showed that the running time of NewGLMNET is competitive. We thus use NewGLMNET as the representative because its code is open-sourced.

For ISQA \(^{+}\), we set \(S=10\) and use both PN and L-BFGS variants with RPCD in the first stage and Algorithm 3 with \(\rho = 0.5, c = 10^{-6}\) in the second. For PCG, we use the diagonals of \(H_t\) as the preconditioner. We use a heuristic to let PCG start with an iteration bound \(T_0 = 5\), double it whenever \(\alpha _t = 1\) until reaching the dimension of \(\mathcal {M}\), and reset it to 5 when \(\alpha _t < 1\). For the value of S, although tuning it properly might lead to even better performance, we observe that the current setting already suffices to demonstrate the improved performance of the proposed algorithm.

Results in Fig. 1 show the superlinear convergence in running time of ISQA \(^{+}\), while LHAC and NewGLMNET only exhibit linear convergence. We observe that for data with \(n \gg d\), including a9a, ijcnn1, covtype.scale, and epsilon, L-BFGS approaches are faster because \(H_t p\) can be evaluated cheaply (LHAC failed on covtype.scale due to implementation issues), and PN approaches are faster otherwise, so no algorithm is always superior. Nonetheless, for the same type of \(H_t\), ISQA \(^{+}\)-LBFGS and ISQA \(^{+}\)-Newton respectively improve state-of-the-art algorithms LHAC and NewGLMNET greatly because of the fast local convergence, especially when the base method converges slowly.

Fig. 1
figure 1

Comparison of different algorithms. For covtype.scale, LHAC is not shown because it failed to converge

6 Conclusions

In this paper, we showed that for regularized problems with a partly smooth regularizer, inexact successive quadratic approximation is essentially able to identify the active manifold because a mild sufficient condition is satisfied by most of commonly-used subproblem solvers. An efficient algorithm ISQA \(^{+}\) utilizing this property is proposed to attain superlinear convergence on a wide class of degenerate problems in running time, greatly improving upon state of the art for regularized problems that only exhibit superlinear convergence in outer iterations. Numerical evidence illustrated that ISQA \(^{+}\) improves the running time of state of the art for regularized optimization.