SCORE: approximating curvature information under self-concordant regularization

Adeoye, Adeyemi D.; Bemporad, Alberto

doi:10.1007/s10589-023-00502-2

SCORE: approximating curvature information under self-concordant regularization

Open access
Published: 08 July 2023

Volume 86, pages 599–626, (2023)
Cite this article

Download PDF

You have full access to this open access article

Computational Optimization and Applications Aims and scope Submit manuscript

SCORE: approximating curvature information under self-concordant regularization

Download PDF

1412 Accesses
3 Altmetric
Explore all metrics

Abstract

Optimization problems that include regularization functions in their objectives are regularly solved in many applications. When one seeks second-order methods for such problems, it may be desirable to exploit specific properties of some of these regularization functions when accounting for curvature information in the solution steps to speed up convergence. In this paper, we propose the SCORE (self-concordant regularization) framework for unconstrained minimization problems which incorporates second-order information in the Newton-decrement framework for convex optimization. We propose the generalized Gauss–Newton with Self-Concordant Regularization (GGN-SCORE) algorithm that updates the minimization variables each time it receives a new input batch. The proposed algorithm exploits the structure of the second-order information in the Hessian matrix, thereby reducing computational overhead. GGN-SCORE demonstrates how to speed up convergence while also improving model generalization for problems that involve regularized minimization under the proposed SCORE framework. Numerical experiments show the efficiency of our method and its fast convergence, which compare favorably against baseline first-order and quasi-Newton methods. Additional experiments involving non-convex (overparameterized) neural network training problems show that the proposed method is promising for non-convex optimization.

Learning to optimize: A tutorial for continuous and mixed-integer optimization

Article 08 May 2024

A Proximal Alternating Direction Method of Multipliers for DC Programming with Structured Constraints

Article 11 May 2024

Machine Learning Optimization Techniques: A Survey, Classification, Challenges, and Future Research Issues

Article 29 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The results presented in this paper apply to a pseudo-online optimization algorithm based on solving a regularized unconstrained minimization problem under the assumption of strong convexity and Lipschitz continuity. Unlike first-order methods such as stochastic gradient descent (SGD) [1, 2] and its variants [3,4,5,6] that only make use of first-order information through the function gradients, second-order methods [7,8,9,10,11,12,13] attempt to incorporate, in some way, second-order information in their approach, through the Hessian matrix or the Fisher information matrix (FIM). It is well known that this generally provides second-order methods with better (quadratic) convergence than a typical first-order method which only converges linearly in the neighbourhood of the solution [14].

Despite their convergence advantage over first-order methods, second-order methods result into highly prohibitive computations, namely, inverting an $n_w\times n_w$ matrix at each iteration, where $n_w$ is the number of optimization variables. In most commonly used second-order methods, the natural gradient, Gauss–Newton, and the sub-sampled Newton [15]—or its regularized version [16]—used for incorporating second-order information while maintaining desirable convergence properties, compute the Hessian matrix (or FIM) $\varvec{H}$ by using the approximation $\varvec{H}\approx \varvec{J}^T\varvec{J}$, where $\varvec{J}$ is the Jacobian matrix. This approximation is still an $n_w\times n_w$ matrix and remains computationally demanding for large problems. In recent works [17,18,19,20], second-order methods for overparameterized neural network models are made to bypass this difficulty by applying a matrix identity, and instead only compute the matrix $\varvec{J}\varvec{J}^T$ which is a $d\cdot N\times d\cdot N$ matrix, where d is the model output dimension and N is the number of data points. This approach significantly reduces computational overhead in the case $d\cdot N$ is much smaller than $n_w$ (overparameterized models) and helps to accelerate convergence [17]. Nevertheless, for an objective function with a differentiable (up to two times) convex regularizer, this simplification requires a closer attention and special modifications for a general problem with a large number of variables.

The idea of exploiting desirable regularization properties for improving the convergence of the (Gauss–)Newton scheme has been around for decades and most of the published works on the topic combine in different ways the idea of Levenberg–Marquardt (LM) regularization, line-search, and trust-region methods [21]. For example, the recent work [22] combines the idea of cubic regularization (originally proposed by [21]) and a particular variant of the adaptive LM penalty that uses the Euclidean gradient norm of the output-fit loss function (see [23] for a comprehensive list and complexity bounds). Their proposed scheme achieves a global ${\mathcal {O}}(k^{-2})$ rate, where k is the iteration number. A similar idea is considered in [24] using the Bregman distances, extending the idea to develop an accelerated variant of the scheme that achieves a ${\mathcal {O}}(k^{-3})$ convergence rate.

In this paper, we propose a new self-concordant regularization (SCORE) scheme for efficiently choosing optimal variables of the model involving smooth, strongly convex optimization objectives, where one of the objective functions regularizes the model’s variable vector and hence avoids overfitting, ultimately improving the model’s ability to generalize well. By an extra assumption that the regularization function is self-concordant, we propose the GGN-SCORE algorithm (see Algorithm 4 below) that updates the minimization variables in the framework of the local norm $\Vert \cdot \Vert _x$ (a.k.a., Newton-decrement) of a self-concordant function f(x) such as seen in [25]. Our proposed scheme does not require that the output-fit loss function is self-concordant, which in many applications does not hold [22]. Instead, we exploit the greedy descent provision of self-concordant functions, via regularization, to achieve a fast convergence rate while maintaining feasible assumptions on the combined objective function (from an application point of view). Although this paper assumes a convex optimization problem, we also provide experiments that show promising results for non-convex problems that arise when training neural networks. Our experimental results provide an interesting opportunity for future investigation and scaling of the proposed method for large-scale machine learning problems, as one of the non-convex problems considered in the experiments involves training an overparameterized neural network. We remark that overparameterization is an interesting and desirable property, and a topic of interest in the machine learning community [26,27,28,29].

This paper is organized as follows: First, we introduce some notations, formulate the optimization problem with basic assumptions, and present an initial motivation for the optimization method in Sect. 2. In Sect. 3, we derive a new generalized method for reducing the computational overhead associated with the mini-batch Newton-type updates. The idea of SCORE is introduced in Sect. 4 and our GGN-SCORE algorithm is presented thereafter. Experimental results that show the efficiency and fast convergence of the proposed method are presented in Sect. 5.

2 Preliminaries

2.1 Notation and basic assumptions

Let $\{(\varvec{x}_n,\varvec{y}_n)\}_{n=1}^N$ be a sequence of N input and output sample pairs, $\varvec{x}_n\in \mathbb {R}^{n_p}, \varvec{y}_n\in \mathbb {R}^{d}$, where $n_p$ is the number of features and d is the number of targets. We assume a model $f(\varvec{\theta };\varvec{x}_n)$, defined by $f:\mathbb {R}^{n_w} \times \mathbb {R}^{n_p} \rightarrow \mathbb {Y}$ and parameterized by the vector of variables $\varvec{\theta }\in \mathbb {R}^{n_w}$. We denote by $\partial _{\varvec{a}} \varvec{b} \equiv \partial \varvec{b}$ the gradient (or first derivative) of $\varvec{b}$ (with respect to $\varvec{a}$) and $\partial _{\varvec{a}\varvec{a}}^2 \varvec{b} \equiv \partial ^2\varvec{b}$ the second derivative of $\varvec{b}$ with respect to $\varvec{a}$. We write $\Vert \cdot \Vert $ to denote the 2-norm. The set $\{\textrm{diag}(\varvec{v}):\varvec{v}\in {{\mathbb {R}}}^n\}$, where $\textrm{diag}:{{\mathbb {R}}}^n \rightarrow {{\mathbb {R}}}^{n\times n}$, denotes the set of all diagonal matrices in ${{\mathbb {R}}}^{n\times n}$. Throughout the paper, bold-face letters denote vectors and matrices.

Suppose that $f(\varvec{\theta };\varvec{x}_n)$ outputs the value $\hat{\varvec{y}}_n \in \mathbb {R}^{d}$. The regularized minimization problem we want to solve is

$$\begin{aligned} \min _{\varvec{\theta }}{\mathcal {L}}(\varvec{\theta })&:=\underbrace{\sum _{n=1}^N\ell (\varvec{y}_n,\hat{\varvec{y}}_n)}_{g(\varvec{\theta })} + \lambda \underbrace{\sum _{j=1}^{n_w} r_j(\varvec{\theta }_j)}_{h(\varvec{\theta })}, \end{aligned}$$

(1)

where $\ell :{{\mathbb {R}}}^d\times {{\mathbb {R}}}^d\rightarrow {{\mathbb {R}}}$ is a (strongly) convex twice-differentiable output-fit loss function, $r_j:{{\mathbb {R}}}\rightarrow {{\mathbb {R}}}$, $j=1,\ldots ,n_w$, define a separable regularization term on $\varvec{\theta }$, $g(\varvec{\theta }):{{\mathbb {R}}}^{n_w}\rightarrow {{\mathbb {R}}}$, $h(\varvec{\theta }):{{\mathbb {R}}}^{n_w}\rightarrow {{\mathbb {R}}}$. We assume that the regularization function $h(\varvec{\theta })$, scaled by the parameter $\lambda >0$, is twice differentiable and strongly convex. The following preliminary conditions define the regularity of the Hessian of ${\mathcal {L}}(\varvec{\theta })$ and are assumed to hold only locally in this work:

Assumption 1

The functions g and h are twice-differentiable with respect to $\varvec{\theta }$ and are respectively $\gamma _l$- and $\gamma _a$-strongly convex.

Assumption 2

$\exists \gamma _u$, $\gamma _b$ with $0\le \gamma _l\le \gamma _u < \infty $, $0<\gamma _a\le \gamma _b < \infty $, such that the gradient of ${\mathcal {L}}(\varvec{\theta })$ is $(\gamma _u+\lambda \gamma _b)$-Lipschitz continuous $\forall \varvec{x} \in {{\mathbb {R}}}^{n_p}, \varvec{y} \in {{\mathbb {R}}}^d$. That is, $\forall \varvec{x} \in {{\mathbb {R}}}^{n_p}, \forall \varvec{y} \in {{\mathbb {R}}}^d$, the gradient $\partial _{\varvec{\theta }}{\mathcal {L}}(\varvec{\theta })=\partial _{\varvec{\theta }}g(\varvec{\theta }) + \lambda \partial _{\varvec{\theta }}h(\varvec{\theta })$ satisfies

$$\begin{aligned} \left\| \partial _{\varvec{\theta }} {\mathcal {L}}(\varvec{y}, f(\varvec{\theta _1};\varvec{x})) - \partial _{\varvec{\theta }} {\mathcal {L}}(\varvec{y}, f(\varvec{\theta _2};\varvec{x}))\right\| \le (\gamma _u+\lambda \gamma _b) \left\| \varvec{\theta _1} - \varvec{\theta _2}\right\| , \end{aligned}$$

(2)

for any $\varvec{\theta _1},\varvec{\theta _2} \in {{\mathbb {R}}}^{n_w}$.

Assumption 3

$\exists \gamma _g, \gamma _h$ with $0<\gamma _g,\gamma _h<\infty $ such that $\forall \varvec{x} \in {{\mathbb {R}}}^{n_p}, \forall \varvec{y} \in {{\mathbb {R}}}^d$, the second derivatives of $g(\varvec{\theta })$ and $h(\varvec{\theta })$ respectively satisfy

$$\begin{aligned}{} & {} \left\| \partial ^2 g(\varvec{y}, f(\varvec{\theta _1};\varvec{x})) - \partial ^2 g(\varvec{y}, f(\varvec{\theta _2};\varvec{x}))\right\| \le \gamma _g\left\| \varvec{\theta _1} - \varvec{\theta _2}\right\| , \end{aligned}$$

(3a)

$$\begin{aligned}{} & {} \left\| \partial ^2 h(\varvec{y}, f(\varvec{\theta _1};\varvec{x})) - \partial ^2 h(\varvec{y}, f(\varvec{\theta _2};\varvec{x}))\right\| \le \gamma _h\left\| \varvec{\theta _1} - \varvec{\theta _2}\right\| , \end{aligned}$$

(3b)

for any $\varvec{\theta _1},\varvec{\theta _2} \in {{\mathbb {R}}}^{n_w}$.

Commonly used loss functions such as the squared loss, and the sigmoidal cross-entropy loss are twice differentiable and (strongly) convex in the model variables. Certain smoothed sparsity-inducing penalties such as the (pseudo-)Huber function – presented later in this paper – constitute the class of functions that may be well-suited for $h(\varvec{\theta })$ defined above.

The assumptions of strong convexity and smoothness about the objective ${\mathcal {L}}(\varvec{\theta })$ are standard conventions in many optimization problems as they help to characterize the convergence properties of the underlying solution method [30, 31]. However, the smoothness assumption about the objective ${\mathcal {L}}(\varvec{\theta })$ is sometimes not feasible for some multi-objective (or regularized) problems where a non-smooth (penalty-inducing) function $h(\varvec{\theta })$ is used (see the recent work [32]). In such a case, and when the need to incorporate second-order information arise, a well-known approach in the optimization literature is generally either to approximate the non-smooth objectives by a smooth quadratic function (when such an approximation is available) or use a “proximal splitting” method and replace the 2-norm in this setting with the Q-norm, where Q is the Hessian matrix or its approximation [33]. In [33], the authors propose two techniques that help to avoid the complexity that is often introduced in subproblems when the latter approach is used. While proposing new approaches, [34] highlights some popular techniques to handle non-differentiability. Each of these works highlight the importance of incorporating second-order information in the solution techniques of optimization problems. By conveniently solving the optimization problem (1) where the assumptions made above are satisfied, our method ensures the full curvature information is captured while reducing computational overhead.

2.2 Approximate Newton scheme

Given the current value of $\varvec{\theta }$, the (Gauss–)Newton method computes an update to $\varvec{\theta }$ via

$$\begin{aligned} \varvec{\theta } \leftarrow \varvec{\theta } - \rho \varvec{G}, \end{aligned}$$

(4)

where $\rho $ is the step size, $\varvec{G}=\varvec{H}^{-1}\partial {\mathcal {L}}(\varvec{\theta })$ and $\varvec{H}$ is the Hessian of ${\mathcal {L}}$ or its approximation. In this work, we consider the generalized Gauss–Newton (GGN) approximation of $\varvec{H}$ which we now define in terms of the function $g(\varvec{\theta })$. This approximation and its detailed expression motivates the modified version introduced in the next section to include the regularization function $h(\varvec{\theta })$.

Definition 1

(Generalized Gauss–Newton Hessian) Let $\textrm{diag}(\varvec{q}_n)\in \mathbb {R}^{d\times d}$ be the second derivative of the loss function $\ell (\varvec{y}_n,\hat{\varvec{y}}_n)$ with respect to the predictor $\hat{\varvec{y}}_n$, $\varvec{q}_n = \partial _{\hat{\varvec{y}}_n\hat{\varvec{y}}_n}^2 \ell (\varvec{y}_n,\hat{\varvec{y}}_n)$ for $n=1,2,\ldots ,N$, and let $\varvec{Q}_g\in \mathbb {R}^{dN\times dN}$ be a block diagonal matrix with $\varvec{q}_n$ being the n-th diagonal block. Let $\varvec{J}_n\in \mathbb {R}^{d\times n_w}$ denote the Jacobian of $\hat{\varvec{y}}_n$ with respect to $\varvec{\theta }$ for $n=1,2,\ldots ,N$, and let $\varvec{J}_g\in \mathbb {R}^{dN\times n_w}$ be the vertical concatenation of all $\varvec{J}_n$’s. Then, the generalized Gauss–Newton (GGN) approximation of the Hessian matrix $\varvec{H}_g\in \mathbb {R}^{n_w\times n_w}$ associated with the fit loss $\ell (\varvec{y}_n,\hat{\varvec{y}}_n)$ with respect to $\varvec{\theta }$ is defined by

$$\begin{aligned} \varvec{H}_g\approx \varvec{J}_g^T\varvec{Q}_g \varvec{J}_g = \sum _{n=1}^{N}\varvec{J}_n^T\textrm{diag}(\varvec{q}_n)\varvec{J}_n. \end{aligned}$$

(5)

Let $\varvec{e}_n\in \mathbb {R}^{d}$ be the Jacobian of the fit loss defined by $\varvec{e}_n=\partial _{\hat{\varvec{y}}_n}\ell (\varvec{y}_n,\hat{\varvec{y}}_n)$ for $n=1,2,\ldots ,N$. For example, in case of squared loss $\ell (\varvec{y}_n,\hat{\varvec{y}}_n)=\frac{1}{2}(\varvec{y}_n-\hat{\varvec{y}}_n)^2$ we get that $\varvec{e}_n$ is the residual $\varvec{e}_n=\hat{\varvec{y}}_n-\varvec{y}_n$. Let $\varvec{e}_g\in \mathbb {R}^{dN}$ be the vertical concatenation of all $\varvec{e}_n$’s. Then, using the chain rule, we write

$$\begin{aligned} \varvec{J}_n^T\varvec{e}_n^T&= \begin{bmatrix} \partial _{\theta _1}\hat{y}^{(1)} &{} \partial _{\theta _1}\hat{y}^{(2))} &{} \cdots &{} \partial _{\theta _1}\hat{y}^{(d)}\\ \partial _{\theta _2}\hat{y}^{(1)} &{} \partial _{\theta _2}\hat{y}^{(2)} &{} \cdots &{} \partial _{\theta _2}\hat{y}^{(d)}\\ \vdots &{}\vdots &{}&{}\vdots \\ \partial _{\theta _{n_w}}\hat{y}^{(1)} &{} \partial _{\theta _{n_w}}\hat{y}^{(2)} &{} \cdots &{} \partial _{\theta _{n_w}}\hat{y}^{(d)} \end{bmatrix} \begin{bmatrix} \partial _{\hat{y}}\ell ^{(1)}\\ \partial _{\hat{y}}\ell ^{(2)}\\ \vdots \\ \partial _{\hat{y}}\ell ^{(d)} \end{bmatrix} \nonumber \\&= \begin{bmatrix} \partial _{\theta _1}\ell ^{(1)}+\partial _{\theta _1}\ell ^{(2)}+\cdots +\partial _{\theta _1}\ell ^{(d)}\\ \partial _{\theta _2}\ell ^{(1)}+\partial _{\theta _2}\ell ^{(2)}+\cdots +\partial _{\theta _2}\ell ^{(d)}\\ \vdots \\ \partial _{\theta _{n_w}}\ell ^{(1)}+\partial _{\theta _{n_w}}\ell ^{(2)}+\cdots +\partial _{\theta _{n_w}}\ell ^{(d)} \end{bmatrix} \nonumber \\&= \left[ \sum _{i=1}^{d}\partial _{\theta _1}\ell ^{(i)} \quad \sum _{i=1}^{d}\partial _{\theta _2}\ell ^{(i)} \quad \cdots \quad \sum _{i=1}^{d}\partial _{\theta _{n_w}}\ell ^{(i)} \right] ^T, \end{aligned}$$

(6)

and

$$\begin{aligned} \varvec{g}_g(\varvec{\theta })=\varvec{J}_g^T\varvec{e}_g=\sum _{n=1}^{N} \varvec{J}_n^T\varvec{e}_n. \end{aligned}$$

(7)

As noted in [35], the GGN approximation has the advantage of capturing the curvature information of $\ell $ in $g(\varvec{\theta })$ through the term $\varvec{Q}_g$ as opposed to the FIM, for example, which ignores the full second-order interactions. While it may become obvious, say when training a deep neural network with certain loss functions, how the GGN approximation can be exploited to simplify expressions for $\varvec{H}_g$ (see e.g. in [36]), a modification is required to take account of a twice-differentiable convex regularization function to achieve some degree of simplicity and elegance. We derive a modification to the above for the mini-batch scheme presented in the next section that includes the derivatives of $h(\varvec{\theta })$ in the GGN approximation of $\varvec{H}_g$. This modification leads to our GGN-SCORE algorithm in Sect. 4.

3 Second-order (pseudo-online) optimization

Suppose that at each mini-batch step k we uniformly select a random index set ${\mathcal {I}}_k\subseteq \{1,2,\ldots ,N\},\vert {\mathcal {I}}_k\vert =m \le N$ (usually $m\ll N$) to access a mini-batch of m samples from the training set. The loss derivatives used for the recursive update of the variables $\varvec{\theta }$ in this way is computed at each step k, and are estimated as running averages over the batch-wise computations. This leads to a stochastic approximation of the true derivatives at each iteration for which we assume unbiased estimations.

The problem of finding the optimal adjustment $\delta \varvec{\theta }_{mk} :=\varvec{\theta }_{k+1}-\varvec{\theta }_k$ that solves (1) results in solving either an overdetermined or an underdetermined linear system depending on whether $dm\ge n_w$ or $dm < n_w$, respectively. Consider, for example, the squared fit loss and the penalty-inducing square norm as the scalar-valued functions $g(\varvec{\theta })$ and $h(\varvec{\theta })$, respectively in (1). Then, $\varvec{Q}_g$ will be the identity matrix, and the LM solution $\delta \varvec{\theta }$ [37, 38] is estimated at each iteration k according to the rule^{Footnote 1}:

$$\begin{aligned} \delta \varvec{\theta } = - (\varvec{H}_g+\lambda \varvec{I})^{-1}\varvec{g}_g = -(\varvec{J}_g^T\varvec{J}_g+\lambda \varvec{I})^{-1}\varvec{J}_g^T\varvec{e}_g. \end{aligned}$$

(8)

If $dm < n_w$ (possibly $dm\ll n_w$), then by using the Searle identity $(\varvec{A}\varvec{B} + \lambda \varvec{I})\varvec{A} = \varvec{A}(\varvec{B}\varvec{A} + \lambda \varvec{I})$ [39], we can conveniently update the adjustment $\delta \varvec{\theta }$ by

$$\begin{aligned} \delta \varvec{\theta } = -\varvec{J}_g^T(\varvec{J}_g\varvec{J}_g^T+\lambda \varvec{I})^{-1}\varvec{e}_g. \end{aligned}$$

(9)

Clearly, this provides a more computationally efficient way of solving for $\delta \varvec{\theta }$. In what follows, we formulate a generalized solution method for the regularized problem (1) which similarly exploits the Hessian matrix structure when solving the given optimization problem, thereby conveniently and efficiently computing the adjustment $\delta \varvec{\theta }$.

Taking the second-order approximation of ${\mathcal {L}}(\varvec{\theta })$, we have

$$\begin{aligned} {\mathcal {L}}(\varvec{\theta } + \delta \varvec{\theta })&\approx {\mathcal {L}}(\varvec{\theta }) + \varvec{g}^T\delta \varvec{\theta } + \frac{1}{2}\delta \varvec{\theta }^T \varvec{H}\delta \varvec{\theta }, \end{aligned}$$

(10)

where $\varvec{H}\in {{\mathbb {R}}}^{n_w\times n_w}$ is the Hessian of ${\mathcal {L}}(\varvec{\theta })$ and $\varvec{g}\in {{\mathbb {R}}}^{n_w}$ is its gradient. Let $M=dm+1$ and define the Jacobian $\varvec{J}\in {{\mathbb {R}}}^{M\times n_w}$:

$$\begin{aligned} \varvec{J}^T&= \begin{bmatrix} \partial _{\theta _1}\hat{\varvec{y}}_1 &{} \partial _{\theta _1}\hat{\varvec{y}}_2 &{} \cdots &{} \partial _{\theta _1}\hat{\varvec{y}}_m &{} \lambda \partial _{\theta _1}r_1\\ \partial _{\theta _2}\hat{\varvec{y}}_1 &{} \partial _{\theta _2}\hat{\varvec{y}}_2 &{} \cdots &{} \partial _{\theta _2}\hat{\varvec{y}}_m &{} \lambda \partial _{\theta _2}r_2\\ \vdots &{}\vdots &{}&{}\vdots &{}\vdots \\ \partial _{\theta _{n_w}}\hat{\varvec{y}}_1 &{} \partial _{\theta _{n_w}}\hat{\varvec{y}}_2 &{} \cdots &{} \partial _{\theta _{n_w}}\hat{\varvec{y}}_m &{} \lambda \partial _{\theta _{n_w}}r_{n_w} \end{bmatrix}. \end{aligned}$$

(11)

Let $\varvec{e}_M=\varvec{1}$ and denote by $\varvec{e}\in {{\mathbb {R}}}^M$ the vertical concatenation of all $\varvec{e}_n$’s, $\varvec{e}_n\in {{\mathbb {R}}}^d$, $n=1,2,\ldots ,m+1$. Then by using the chain rule as in (6) and (7), we obtain

$$\begin{aligned} \varvec{g}(\varvec{\theta }) = \partial _{\varvec{\theta }}{\mathcal {L}}(\varvec{\theta }) = \varvec{J}^T\varvec{e}. \end{aligned}$$

(12)

Let $\varvec{q}_n=\partial _{\hat{\varvec{y}}_n\hat{\varvec{y}}_n}^2\ell (\varvec{y}_n,\hat{\varvec{y}}_n)$, $\varvec{q}_n\in {{\mathbb {R}}}^d$ for $n=1,2,\ldots ,m$ (clearly $\varvec{q}_n=\varvec{1}$ in case of squared fit loss terms) and let $\varvec{q}_{m+1}=\varvec{0}$. Define $\varvec{Q}\in {{\mathbb {R}}}^{M\times M}$ as the diagonal matrix with diagonal elements $\varvec{q}_n$ for $n=1,2,\ldots ,m+1$, where $\varvec{Q}=\varvec{Q}^T\succeq \varvec{0}$ by convexity of $\ell $. Consider the following slightly modified GGN approximation of the Hessian $\varvec{H}\in {{\mathbb {R}}}^{n_w\times n_w}$ associated with ${\mathcal {L}}(\varvec{\theta })$:

$$\begin{aligned} \varvec{H}\approx \varvec{J}^T\varvec{Q}\varvec{J}+\lambda \varvec{H}_h, \end{aligned}$$

(13)

where $\varvec{H}_h$ is the Hessian of the regularization term $h(\varvec{\theta })$, $\varvec{H}_h\in {{\mathbb {R}}}^{n_w \times n_w}$, and is a diagonal matrix whose diagonal terms are

$$\begin{aligned} {\varvec{H}_h}_{jj}=\frac{d^2r_j(\theta _j)}{d\theta _j^2},\ j=1,\ldots ,n_w. \end{aligned}$$

We hold on to the notation $\varvec{H}$ to represent the modified GGN approximation of the full Hessian matrix $\partial ^2{\mathcal {L}}$. By differentiating (10) with respect to $\delta \varvec{\theta }$ and equating to zero, we obtain the optimal adjustment

$$\begin{aligned} \delta \varvec{\theta } = -(\varvec{J}^T\varvec{Q}\varvec{J} + \lambda \varvec{H}_h )^{-1}\varvec{J}^T\varvec{e}. \end{aligned}$$

(14)

Remark 1

The inverse matrix in (14) exists due to the strong convexity assumption on the regularization function h which makes $\varvec{H_r}\succeq (\min _{j}{\varvec{H}_h}_{jj})\varvec{I}$ and therefore matrix $\varvec{J}^T\varvec{Q}\varvec{J} + \lambda \varvec{H}_h$ is symmetric positive definite and hence invertible.

Let $\varvec{U}=\varvec{J}^T\varvec{Q}$. Using the identity [40, 41]

$$\begin{aligned} (\varvec{D}-\varvec{V}\varvec{A}^{-1}\varvec{B})^{-1}\varvec{V}\varvec{A}^{-1} = \varvec{D}^{-1}\varvec{V}(\varvec{A}-\varvec{B}\varvec{D}^{-1}\varvec{V})^{-1} \end{aligned}$$

(15)

with $\varvec{D}=\lambda \varvec{H_r}$, $\varvec{V}=-\varvec{J}^T$, $\varvec{A}=\varvec{I}_M$, and $\varvec{B}=\varvec{U}^T$, and recalling that $\varvec{Q}$ is symmetric, from (14) we get

$$\begin{aligned} \delta \varvec{\theta }&= \left( \lambda \varvec{H}_h-(-\varvec{J}^T)\varvec{I}_M \varvec{U}^T\right) ^{-1}(-\varvec{J}^T)\varvec{I}^{-1}_M\varvec{e}\nonumber \\&=-\frac{1}{\lambda }\varvec{H}_h^{-1}\varvec{J}^T\left( \varvec{I}_M+\varvec{U}^T\frac{1}{\lambda }\varvec{H}_h^{-1} \varvec{J}^T\right) ^{-1}\varvec{e}\nonumber \\&=-\varvec{H}_h^{-1}\varvec{J}^T\left( \lambda \varvec{I}_M+\varvec{Q}\varvec{J} \varvec{H}_h^{-1} \varvec{J}^T\right) ^{-1}\varvec{e}. \end{aligned}$$

(16)

Remark 2

When combined with a second identity, namely $\varvec{V}\varvec{A}^{-1}(\varvec{A}-\varvec{B}\varvec{D}^{-1}\varvec{V}) = (\varvec{D} -\varvec{V}\varvec{A}^{-1}\varvec{B})\varvec{D}^{-1}\varvec{V}$, one can directly derive from (15) Woodbury identity defined as [42] $ (\varvec{A}+\varvec{U}\varvec{B}\varvec{V})^{-1} = \varvec{A}^{-1} - \varvec{A}^{-1}\varvec{U}(\varvec{B}+\varvec{V}\varvec{A}^{-1}\varvec{U})^{-1}\varvec{V}\varvec{A}^{-1}$, or in the special case $\varvec{B} = -\varvec{D}^{-1}$, as $ (\varvec{A}-\varvec{U}\varvec{D}^{-1}\varvec{V})^{-1} = \varvec{A}^{-1} + \varvec{A}^{-1}\varvec{U}(\varvec{D}-\varvec{V}\varvec{A}^{-1}\varvec{U})^{-1}\varvec{V}\varvec{A}^{-1}$. Using Woodbury identity would, in fact, structurally not result into the closed-form update step (16) in an exact sense. Our construction involves a more general regularization function than the commonly used square norm, where the Woodbury identity can be equally useful, as its Hessian yields a multiple of the identity matrix.

Compared to (14), the clear advantage of the form (16) is that it requires the factorization of an $M\times M$ matrix rather than an $n_w\times n_w$ matrix, where the term $\varvec{H}_h^{-1}$ can be conveniently obtained by exploiting its diagonal structure. Given these modifications, we proceed by making an assumption that defines the residual $\varvec{e}$ and the Jacobian $\varvec{J}$ in the region of convergence where we assume the starting point $\varvec{\theta }_0$ of the process (16) lies.

Let $\varvec{\theta }^*$ be a nondegenerate minimizer of ${\mathcal {L}}$, and define ${\mathcal {B}}_\epsilon (\varvec{\theta }^*):=\{\varvec{\theta }_k\in {{\mathbb {R}}}^{n_w}:\Vert \varvec{\theta }_k - \varvec{\theta }^*\Vert \le \epsilon \}$, a closed ball of a sufficiently small radius $\epsilon \ge 0$ about $\varvec{\theta }^*$. We denote by ${\mathcal {N}}_{\epsilon }(\varvec{\theta }^*)$ an open neighbourhood of the sublevel set $\Gamma ({\mathcal {L}}):=\{\varvec{\theta }_k:{\mathcal {L}}(\varvec{\theta }_k)\le {\mathcal {L}}(\varvec{\theta }_0)\}$, so that ${\mathcal {B}}_\epsilon (\varvec{\theta }^*)=\textrm{cl}({\mathcal {N}}_{\epsilon }(\varvec{\theta }^*))$. We then have ${\mathcal {N}}_{\epsilon }(\varvec{\theta }^*):=\{\varvec{\theta }_k\in {{\mathbb {R}}}^{n_w}:\Vert \varvec{\theta }_k - \varvec{\theta }^*\Vert < \epsilon \}$.

Assumption 4

(i)
Each $\varvec{e}_n(\varvec{\theta }_k)$ and each $\varvec{q}_n(\varvec{\theta }_k)$ is Lipschitz smooth, and $\forall \varvec{\theta }_k\in {\mathcal {N}}_{\epsilon }(\varvec{\theta }^*)$ there exists $\nu >0$ such that $\Vert \varvec{J}(\varvec{\theta }_k)\varvec{z}\Vert \ge \nu \Vert \varvec{z}\Vert $.
(ii)
$\lim _{k\rightarrow \infty }\left\| \mathbb {E}_m[\varvec{H}(\varvec{\theta }_k)] - \partial ^2{\mathcal {L}}(\varvec{\theta }_k)\right\| =0$ almost surely whenever $\lim _{k\rightarrow \infty } \Vert \varvec{g}_{{\mathcal {L}}}(\varvec{\theta }_k)\Vert = 0$, $\forall \varvec{\theta }_k\in {\mathcal {N}}_{\epsilon }(\varvec{\theta }^*)$, where $\mathbb {E}_m[\cdot ]$ denotes^{Footnote 2} expectation with respect to m.

Remark 3

Assumption 4(i) implies that the singular values of $\varvec{J}$ are uniformly bounded away from zero and $\exists \beta ,\tilde{\beta }>0$ such that $\Vert \varvec{e}\Vert \le \beta , \Vert \varvec{J}\Vert =\Vert \varvec{J}^T\Vert \le \tilde{\beta }$, then as $\varvec{Q}\succeq \varvec{0}$, we have $\exists K_1$ such that $\varvec{Q}\le K_1\varvec{I}$, and hence $\Vert \lambda \varvec{I}+\varvec{Q}\varvec{J}\varvec{H}_h^{-1}\varvec{J}^T\Vert \le \lambda +(K/\gamma _a)$, where $K=K_1\tilde{\beta }^2$. Note that although we use limits in Assumption 4(ii), the assumption similarly holds in expectation by unbiasedness. Also, a sufficient sample size m may be required for Assumption 4(ii) to hold, by law of large numbers, see e.g. [43, Lemma 1, Lemma 2].

Remark 4

$\varvec{H}_g(\varvec{\theta })$ and $\varvec{H}_h(\varvec{\theta })$ satisfy

$$\begin{aligned}{} & {} \gamma _l\varvec{I}_{n_w}\preceq \varvec{H}_g(\varvec{\theta }_k)\preceq \gamma _u \varvec{I}_{n_w}, \quad \left\| \varvec{H}_g(\varvec{y}, f(\varvec{\theta _1};\varvec{x})) - \varvec{H}_g(\varvec{y}, f(\varvec{\theta _2};\varvec{x}))\right\| \le \gamma _g\left\| \varvec{\theta _1} - \varvec{\theta _2}\right\| , \\{} & {} \gamma _a\varvec{I}_{n_w}\preceq \varvec{H}_h(\varvec{\theta }_k)\preceq \gamma _b \varvec{I}_{n_w}, \quad \left\| \varvec{H}_h(\varvec{y}, f(\varvec{\theta _1};\varvec{x})) - \varvec{H}_h(\varvec{y}, f(\varvec{\theta _2};\varvec{x}))\right\| \le \gamma _h\left\| \varvec{\theta _1} - \varvec{\theta _2}\right\| , \end{aligned}$$

at any point $\varvec{\theta }_k\in {{\mathbb {R}}}^{n_w}$, for any $\varvec{x}\in {{\mathbb {R}}}^{n_p},\varvec{y}\in {{\mathbb {R}}}^d$, and for any $\varvec{\theta _1}, \varvec{\theta _2}\in {\mathcal {N}}_\epsilon (\varvec{\theta }^*)$ where Assumption 4 holds.

We now state a convergence result for the update type (14) (and hence (16)). First, we define the second-order optimality condition and state two useful lemmas.

Definition 2

(Second-order sufficiency condition (SOSC)) Let $\varvec{\theta }^*$ be a local minimum of a twice-differentiable function ${\mathcal {L}}(\cdot )$. The second-order sufficiency condition (SOSC) holds if

$$\begin{aligned} \partial {\mathcal {L}}(\varvec{\theta }^*) = 0, \quad \partial ^2 {\mathcal {L}}(\varvec{\theta }^*) \succ 0. \end{aligned}$$

(SOSC)

Lemma 1

( [14, Theorem 1.2.3]) Suppose that Assumption 3 holds. Let $\mathbb {W}\subseteq {{\mathbb {R}}}^{n_w}$ be a closed and convex set on which ${\mathcal {L}}(\varvec{\theta })$ is twice-continuously differentiable. Let $S\subset \mathbb {W}$ be an open set containing some $\varvec{\theta }^*$, and suppose that ${\mathcal {L}}(\varvec{\theta }^*)$ satisfies (SOSC). Then, there exists ${\mathcal {L}}^* = {\mathcal {L}}(\varvec{\theta }^*)$ satisfying

$$\begin{aligned} {\mathcal {L}}(\varvec{\theta }_k)>{\mathcal {L}}^* \quad \forall \varvec{\theta }_k \in S. \end{aligned}$$

(18)

Lemma 2

The adjustment $\delta \varvec{\theta }$ given by (14) (and hence, (16)) provides a descent direction for the total loss ${\mathcal {L}}(\varvec{\theta }_k)$ in (1) at the k th oracle call.

Remark 5

Lemma 1, Lemma 2 and the first part of (SOSC) ensure that the second part of Assumption 4(ii) always holds. In essence, it holds at every point $\varvec{\theta }_k$ of the sequence $\{\varvec{\theta }_k\}$ generated by the process (16) as long as we choose a starting point $\varvec{\theta _0}\in {\mathcal {B}}_\epsilon (\varvec{\theta }^*)$.

Theorem 3

Suppose that Assumptions 1, 2, 3 and 4 hold, and that $\varvec{\theta }^*$ is a local minimizer of ${\mathcal {L}}(\varvec{\theta })$ for which the assumptions in Lemma 1 hold. Let $\{\varvec{\theta }_k\}$ be the sequence generated by the process (16). Then starting from a point $\varvec{\theta }_0\in {\mathcal {N}}_{\epsilon }(\varvec{\theta }^*)$, $\{\varvec{\theta }_k\}$ converges at a Q-quadratic rate. Namely:

$$\begin{aligned} \left\| \varvec{\theta }_{k+1}-\varvec{\theta }^*\right\| \le \xi _k\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| ^2, \end{aligned}$$

where

$$\begin{aligned} \xi _k = \frac{1}{2}\frac{\gamma _g + b\gamma _h}{\left( \gamma _l+a\gamma _a - (\gamma _g+b\gamma _h)\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| \right) }. \end{aligned}$$

The proofs of these results are reported in “Appendix A” and “Appendix B”.

4 Self-concordant regularization (SCORE)

In the case $dm < n_w$, one could deduce by mere visual inspection that the matrix $\varvec{H}_h^{-1}$ in (16) plays an important role in the perturbation of $\varvec{\theta }$ within the region of convergence due to its “dominating” structure in the equation. This may be true as it appears. But beyond this “naive” view, let us derive a more technical intuition about the update step (16) by using a similar analogy as that given in [14, Chapter 5]. We have

$$\begin{aligned} \varvec{\theta }_{k+1}=\varvec{\theta }_k-\varvec{H}_h^{-1}\varvec{J}^T\left( \lambda \varvec{I}+\varvec{Q}\varvec{J} \varvec{H}_h^{-1} \varvec{J}^T\right) ^{-1}\varvec{e}, \end{aligned}$$

(19)

where $\varvec{I}$ is the identity matrix of suitable dimension. By some simple algebra (see e.g., [14, Lemma 5.1.1]), one can show that indeed the update method (19) is affine-invariant. In other words, the region of convergence ${\mathcal {N}}_{\epsilon }(\varvec{\theta }^*)$ does not depend on the problem scaling but on the local topological structure of the minimization functions $g(\varvec{\theta })$ and $h(\varvec{\theta })$ [14].

Consider the non-augmented form of (19): $\varvec{\theta }_{k+1}=\varvec{\theta }_k-\varvec{Q}_g\varvec{J}_g^T\varvec{G}_g^{-1}\varvec{e}_g$, where $\varvec{G}_g = \varvec{J}_g\varvec{J}_g^T$ is the so-called Gram matrix. It has been shown that indeed when learning an overparameterized model (sample size smaller than number of variables), and as long as we start from an initial point $\varvec{\theta }_0$ close enough to the minimizer $\varvec{\theta }^*$ of ${\mathcal {L}}(\varvec{\theta })$ (assuming that such a minimizer exists), both $\varvec{J}_g$ and $\varvec{G}_g$ remain stable throughout the learning process (see e.g., [17, 44, 45]). The term $\varvec{Q}_g$ may have little to no effect on this notion, for example, in case of the squared fit loss, $\varvec{Q}_g$ is just an identity term. However, the original Equation (19) involves the Hessian matrix $\varvec{H}_h$ which, together with its bounds (namely, $\gamma _a$, $\gamma _b$ characterizing the region of convergence), changes rapidly when we scale the problem [14]. It therefore suffices to impose an additional assumption on the function $h(\varvec{\theta })$ that will help to control the rate at which its second-derivative $\varvec{H}_h$ changes, thereby enabling us to characterize an affine-invariant structure for the region of convergence. Namely:

Assumption 5

(SCORE) The scaled regularization function $\lambda h(\varvec{\theta })$ has a third derivative and is $M_h$-self-concordant. That is, the inequality

$$\begin{aligned} \left| \left<\varvec{u},\left( \partial ^3h(\varvec{\theta })[\varvec{u}]\right) \varvec{u}\right>\right| \le 2M_h\left<\varvec{u},\partial ^2h(\varvec{\theta })\varvec{u}\right>^{3/2}, \end{aligned}$$

(20)

where $M_h\ge 0$, holds for any $\varvec{\theta }$ in the closed convex set $\mathbb {W}\subseteq {{\mathbb {R}}}^{n_w}$ and $\varvec{u}\in {{\mathbb {R}}}^{n_w}$.

Here, $\partial ^3h(\varvec{\theta })[\varvec{u}]\in {{\mathbb {R}}}^{n_w \times n_w}$ denotes the limit

$$\begin{aligned} \partial ^3h(\varvec{\theta })[\varvec{u}] :=\lim \limits _{t\rightarrow 0}\frac{1}{t}\left( \partial ^2(\varvec{\theta }+t\varvec{u})-\partial ^2(\varvec{\theta })\right) , \quad t \in {{\mathbb {R}}}. \end{aligned}$$

For a detailed analysis of self-concordant functions, we refer the interested reader to [14, 25].

Given this extra assumption, and following the idea of Newton-decrement in [25], we propose to update $\varvec{\theta }$ by

$$\begin{aligned} \delta \varvec{\theta }=-\frac{\alpha }{1+M_h\eta _k}\varvec{H}_h^{-1}\varvec{J}^T\left( \lambda \varvec{I}+\varvec{Q}\varvec{J} \varvec{H}_h^{-1} \varvec{J}^T\right) ^{-1}\varvec{e}, \end{aligned}$$

(21)

where $\alpha >0$, $\eta _k = \left<\varvec{g}_h,\varvec{H}_h^{-1}\varvec{g}_h\right>^{1/2}$ and $\varvec{g}_h$ is the gradient of $h(\varvec{\theta })$. The proposed method which we call GGN-SCORE is summarized, for one oracle call, in Algorithm 4.

There is a wide class of self-concordant functions that meet practical requirements, either in their scaled form [14, Corollary 5.1.3] or in their original form (see e.g., in [46] for a comprehensive list). Two of them are used in our experiments in Sect. 5.

In the following, we state a local convergence result for the process (21). We introduce the notation $\Vert \Delta \Vert _{\varvec{\theta }}$ to represent the local norm of a direction $\Delta $ taken with respect to the local Hessian of a self-concordant function h, $\Vert \Delta \Vert _{\varvec{\theta }}:=\left<\Delta ,\partial _{\varvec{\theta }\varvec{\theta }}^2\,h(\varvec{\theta })\Delta \right>^{1/2} =\left\| \left[ \partial _{\varvec{\theta }\varvec{\theta }}^2\,h(\varvec{\theta })\right] ^{1/2}\Delta \right\| $. Hence, without loss of generality, a ball ${\mathcal {B}}_\epsilon (\varvec{\theta }^*)$ of radius $\epsilon $ about $\varvec{\theta }^*$ is taken with respect to the local norm for the function h in the result that follows.

Theorem 4

Let Assumptions 1, 2, 3, 4 and 5 hold, and let $\varvec{\theta }^*$ be a local minimizer of ${\mathcal {L}}(\varvec{\theta })$ for which the assumptions in Lemma 1 hold. Let $\{\varvec{\theta }_k\}$ be the sequence generated by Algorithm 4 with $\alpha = \frac{\sqrt{\gamma _a}}{\beta _1}(K+\lambda \gamma _a)$, $\beta _1=\beta \tilde{\beta }$. Then starting from a point $\varvec{\theta }_0\in {\mathcal {N}}_{M_h^{-1}}(\varvec{\theta }^*)$, $\{\varvec{\theta }_k\}$ converges to $\varvec{\theta }^*$ according to the following descent properties:

$$\begin{aligned}&\mathbb {E}[{\mathcal {L}}(\varvec{\theta }_{k+1})] \le {\mathcal {L}}(\varvec{\theta }_k) - \left( \frac{\lambda }{M_h^2}\omega (\zeta _k)+\frac{\gamma _l}{2\gamma _a}\omega ''(\tilde{\zeta }_k)-\xi \right) ,\\&\mathbb {E}\Vert \varvec{\theta }_{k+1}-\varvec{\theta }^*\Vert _{\varvec{\theta }_{k+1}} \le \vartheta \Vert \varvec{\theta }_k-\varvec{\theta }^*\Vert _{\varvec{\theta }_k} + \frac{\gamma _u}{\beta _1}\Vert \varvec{\theta }_k-\varvec{\theta }^*\Vert + \frac{\gamma _g}{2}\Vert \varvec{\theta }_k-\varvec{\theta }^*\Vert ^2, \end{aligned}$$

where $\omega (\cdot )$ is an auxiliary univariate function defined by $\omega (t):=t-\ln (1+t)$ and has a second derivative $\omega ''(t)=1/(1+t)^2$, and

$$\begin{aligned}&\zeta _k :=\frac{M_h}{1+M_h\eta _k}, \quad \tilde{\zeta }_k :=M_h\eta _k, \quad \xi :=\frac{2(\gamma _u+\lambda \gamma _b)}{\sqrt{\gamma _a}},\\&\vartheta :=1+\frac{\lambda }{\sqrt{\gamma _a}\beta _1(1-M_h\Vert \varvec{\theta }_k-\varvec{\theta }^*\Vert _{\varvec{\theta }_k})}. \end{aligned}$$

The proof is provided in “Appendix B”. The results of Theorem 4 combine strong convexity and smoothness properties of both $g(\varvec{\theta })$ and $h(\varvec{\theta })$, and requires that only $h(\varvec{\theta })$ is self-concordant.

5 Experiments

In this section, we validate the efficiency of GGN-SCORE (Algorithm 4) in solving the problem of minimizing a regularized strongly convex quadratic function and in solving binary classification tasks. For the binary classification tasks, we use real datasets: $\texttt {ijcnn1}$ and $\texttt {w2a}$ from the LIBSVM repository [47] and $\texttt {dis}$, $\texttt {hypothyroid}$ and $\texttt {coil2000}$ from the PMLB repository [48], with an 80:20 train:test split each. The datasets are summarized in Table 1. In each classification task, the model with a sigmoidal output $\hat{\varvec{y}}_n$ is trained using the cross-entropy fit loss function $\ell (\varvec{y}_n,\hat{\varvec{y}}_n) = \frac{1}{2}\sum _{n=1}^N \varvec{y}_n\log \left( \frac{1}{\hat{\varvec{y}}_n}\right) + (\varvec{1}-\varvec{y}_n)\log \left( \frac{1}{\varvec{1}-\hat{\varvec{y}}_n}\right) $, and the “deviance” residual [49] $\varvec{e}_n = (-1)^{\varvec{y}_n+1}\sqrt{-2\left[ \varvec{y}_n\log (\hat{\varvec{y}}_n) + (\varvec{1}-\varvec{y}_n)\log (\varvec{1}-\hat{\varvec{y}}_n)\right] }$, $\varvec{y}_n, \hat{\varvec{y}}_n\in \{\varvec{0},\varvec{1}\}$.

Table 1 Real datasets: $N =$ number of data samples, $n_p=$ number of features, $d=$ number of targets

Full size table

We map the input space to higher dimensional feature space using the radial basis function (RBF) kernel $K(\varvec{x}_n,\varvec{x}_n')=\exp (-\gamma \Vert \varvec{x}_n-\varvec{x}_n'\Vert ^2)$ with $\gamma =0.1$. In each experiment, we use the penalty value $\lambda =0.1$ with both pseudo-Huber regularization $h_\mu (\varvec{\theta })$ [50] parameterized by $\mu >0$ [51, 52] and $\ell ^2$ regularization $h_2(\varvec{\theta })$ defined respectively as

$$\begin{aligned} h_\mu (\varvec{\theta }) :=\sqrt{\mu ^2 +\varvec{\theta }^2}-\mu , \qquad h_2(\varvec{\theta }) = \Vert \varvec{\theta }\Vert _2^2 :=\sum _{i=1}^{n_w}\left| \theta _i\right| ^2, \end{aligned}$$

with coefficient $\mu =1.0$. Throughout, we choose a batch size, m of 512 for $\texttt {w2a}$, $\texttt {dis}$ and $\texttt {hypothyroid}$, 2048 for $\texttt {coil2000}$ and 4096 for $\texttt {ijcnn1}$, unless otherwise stated. We assume a scaled self-concordant regularization so that $M_h=1$ [14, Corollary 5.1.3].

5.1 GGN-SCORE for different values of $\alpha $

To illustrate the behaviour of GGN-SCORE for different values of $\alpha $ in Algorithm 4 versus its value indicated in Theorem 4, we consider the problem of minimizing a regularized strongly convex quadratic function:

$$\begin{aligned} \min _{\varvec{\theta }}{\mathcal {L}}(\varvec{\theta }) :=\frac{1}{2}\varvec{\theta }^\top \hat{\varvec{Q}}\varvec{\theta } - \varvec{p}^\top \varvec{\theta } + \lambda h(\varvec{\theta }) \equiv g(\varvec{\theta }) + \lambda h(\varvec{\theta }), \end{aligned}$$

(22)

where $\hat{\varvec{Q}}\in {{\mathbb {R}}}^{n_w\times n_w}$ is symmetric positive definite, $\varvec{p}\in {{\mathbb {R}}}^{n_w}$, g is $\gamma _a$-strongly convex and has $\gamma _u$-Lipschitz gradient, with the smallest and largest eigenvalues of $\hat{\varvec{Q}}$ corresponding to $\gamma _a$ and $\gamma _u$, respectively. For this function, suppose the gradient and Hessian of $h(\varvec{\theta })$ is known, for example when we choose $h=h_\mu $ or $h=h_2$, we have $\partial {\mathcal {L}}(\varvec{\theta }) = \hat{\varvec{Q}}\varvec{\theta } - \varvec{p} + \lambda \partial h(\varvec{\theta })$ and $\partial ^2{\mathcal {L}}(\varvec{\theta }) = \hat{\varvec{Q}} + \lambda \partial ^2\,h(\varvec{\theta })$. The coefficients $\hat{\varvec{Q}}$ and $\varvec{p}$ form our data, and with $\hat{\varvec{Q}}\equiv 0.1\times \varvec{M}^\top \varvec{M}$, $N=n_w=1000$, we generate the data $\varvec{M}\in {{\mathbb {R}}}^{N\times N}$ randomly from a uniform distribution on [0, 1] and consider the case ${\mathcal {L}}^* = \varvec{0}$ in which $\varvec{p}$ is the zero vector. The optimization variable $\varvec{\theta }$ is initialized to a random value generated from a normal distribution with mean 0 and standard deviation 0.01. Figure 1 shows the behaviour of GGN-SCORE for this problem with different values of $\alpha $ in (0, 1] and $\alpha = \frac{\sqrt{\gamma _a}}{\beta _1}(K+\lambda \gamma _a)$ indicated in Theorem 4. We experiment with different batch sizes $m\in \{64, 128, 256, 512\}$. One observes from Fig. 1 that larger batch size yields better convergence speed when we choose $\alpha = \frac{\sqrt{\gamma _a}}{\beta _1}(K+\lambda \gamma _a)$, validating the recommendation in Remark 3. Figure 1 also shows the comparison of GNN-SCORE with the first-order Adam [5] algorithm, and the quasi-Newton Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) [53] method using optimally tuned learning rates. While choosing $\alpha = \frac{\sqrt{\gamma _a}}{\beta _1}(K+\lambda \gamma _a)$ yields the kind of convergence shown in Theorem 4, Fig. 1 shows that by choosing $\alpha $ in (0, 1], we can similarly achieve a great convergence that scale well with the problem.

Strictly speaking, the value of $\alpha $ indicated in Theorem 4 is not of practical interest, as it contains terms that may not be straightforward to retrieve in practice. In practice, we treat $\alpha $ as a hyperparameter that takes a fixed positive value in (0, 1]. For an adaptive step-size selection rule, such as that in Line 5 of Algorithm 4, choosing a suitable scaling constant such as $\alpha $ is often straightforward, as the main step-size selection task is accomplished by the defined rule. We show the behaviour of GGN-SCORE on the real datasets for different values of $\alpha $ in (0, 1] in Fig. 2. In general, a suitable scaling factor $\alpha $ should be selected based on the application demands.

5.2 Comparison with SGD, Adam, and L-BFGS methods on real datasets

Using the real datasets, we compare GGN-SCORE for solving (1) with results from the SGD, Adam, and the L-BFGS algorithms using optimally tuned learning rates. We also consider the training problem of a neural network with two hidden layers of dimensions (2, 128), respectively for the $\texttt {coil2000}$ dataset, one hidden layer with dimension 1 for the $\texttt {ijcnn1}$ dataset, and two hidden layers of dimensions (4, 128), respectively for the remaining datasets. We use ReLU activation functions in the hidden layers of the networks, and the network is overparameterized for $\texttt {dis}$, $\texttt {hypothyroid}$, $\texttt {w2a}$ and $\texttt {coil2000}$ with 25425, 21529, 23497 and 16229 trainable parameters, respectively. We choose $\alpha \in \{0.2, 0.5\}$ for GGN-SCORE. Minimization variables are initialized to the zero vector for all the methods. The neural network training problems are solved under the same settings. The results are respectively displayed in Fig. 3 and Fig. 4 for the convex and non-convex cases.

To investigate how well the learned model generalizes, we use the binary accuracy metric which measures how often the model predictions match the true labels when presented with new, previously unseen data: $Accuracy = \frac{1}{N}\sum _{n=1}^{N}\left( 2\varvec{y}_n\hat{\varvec{y}}_n-\varvec{y}_n-\hat{\varvec{y}}_n+1\right) $. While GGN-SCORE converges faster than SGD, Adam and L-BFGS methods, it generalizes comparatively well. The results are further compared with other known binary classification techniques to measure the quality of our solutions. The accuracy scores for dis, hypothyroid, coil2000 and w2a datasets, with a 60:40 train:test split each, are computed on the test set. The mean scores are compared with those from the different classification techniques, and are shown in Fig. 5. The CPU runtimes are also compared where it is indicated that on average GGN-SCORE solves each of the problems within one second. This scales well with the other techniques, as we note that while GNN-SCORE solves each of the problems in high dimensions, the success of most of the other techniques are limited to relatively smaller dimensions of the problems. The obtained results from the classification techniques used for comparison are computed directly from the respective scikit-learn [54] functions.

The L-BFGS experiments are implemented with PyTorch [55] (v. 1.10.1+cu102) in full-batch mode. The GGN-SCORE, Adam and SGD methods are implemented using the open-source Keras API with TensorFlow [56] backend (v. 2.7.0). All experiments are performed on a laptop with dual (2.30GHz + 2.30GHz) Intel Core i7-11800 H CPU and 16GB RAM.

In summary, GGN-SCORE converges way faster (in terms of number of epochs) than SGD, Adam, and L-BFGS, and generalizes comparatively well. Experimental results show the computational convenience and elegance achieved by our “augmented” approach for including regularization functions in the GGN approximation. Although GGN-SCORE comes with a higher computational cost (in terms of wall-clock time per iteration) than first-order methods on average, if per-iteration learning time is not provided as a bottleneck, this may not become an obvious issue as we need to pass the proposed optimizer on the dataset only a few times (epochs) to obtain superior function approximation and relatively high-quality solutions in our experiments.

6 Conclusion

In this paper, we have proposed GGN-SCORE, a generalized Gauss–Newton-type algorithm for solving unconstrained regularized minimization problems, where the regularization function is considered to be self-concordant. In this generalized setting, we employed a matrix approximation scheme that significantly reduces the computational overhead associated with the method. Unlike existing techniques that impose self-concordance on the problem’s objective function, our analysis involves a less restrictive condition from a practical point of view but similarly benefits from the idea of self-concordance by considering scaled optimization step-lengths that depend on the self-concordant parameter of the regularization function. We proved a quadratic local convergence rate for our method under certain conditions, and validate its efficiency in numerical experiments that involve both strongly convex problems and the general non-convex problems that arise when training neural networks. In both cases, our method compare favourably against Adam, SGD, and L-BFGS methods in terms of per-iteration convergence speed, as well as some machine learning techniques used in binary classification, in terms of solution quality.

In future research, it would be interesting to relax some conditions on the problem and analyze a global convergence rate for the proposed method. We would also consider an analysis for our method in a general non-convex setting even though numerically we have observed a similar convergence speed as the strongly convex case.

Data availability

The datasets analysed during the current study are available in the LIBSVM repository [47], https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html and the PMLB repository [48], https://epistasislab.github.io/pmlb/.

Notes

For simplicity of notation, and unless where the full notations are explicitly required, we shall subsequently drop the subscripts m and k, and assume that each expression represents stochastic approximations performed at step k using randomly selected data batches each of size m.
Subsequently, we shall omit the notation for the Hessian and gradient estimates as we assume unbiasedness.

References

Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 400–407 (1951)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer, USA (2010)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research 12(7) (2011)
Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural. Inf. Process. Syst. 26, 315–323 (2013)
Google Scholar
Becker, S., le Cun, Y.: Improving the convergence of back-propagation learning with second order methods (1988)
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1), 503–528 (1989)
Article MathSciNet MATH Google Scholar
Hagan, M.T., Menhaj, M.B.: Training feedforward networks with the Marquardt algorithm. IEEE Trans. Neural Netw. 5(6), 989–993 (1994)
Article Google Scholar
Amari, S.-I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
Article Google Scholar
Martens, J., et al.: Deep learning via hessian-free optimization. In: ICML, vol. 27, pp. 735–742 (2010)
Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584 (2013)
Martens, J., Grosse, R.: Optimizing neural networks with Kronecker-factored approximate curvature. In: International Conference on Machine Learning, pp. 2408–2417 (2015). PMLR
Nesterov, Y., et al.: Lectures on Convex Optimization, vol. 137. Springer, Switzerland (2018)
MATH Google Scholar
Byrd, R.H., Chin, G.M., Neveitt, W., Nocedal, J.: On the use of stochastic hessian information in optimization methods for machine learning. SIAM J. Optim. 21(3), 977–995 (2011)
Article MathSciNet MATH Google Scholar
Erdogdu, M.A., Montanari, A.: Convergence rates of sub-sampled newton methods. arXiv preprint arXiv:1508.02810 (2015)
Cai, T., Gao, R., Hou, J., Chen, S., Wang, D., He, D., Zhang, Z., Wang, L.: Gram-gauss-newton method: Learning overparameterized neural networks for regression problems. arXiv preprint arXiv:1905.11675 (2019)
Zhang, G., Martens, J., Grosse, R.: Fast convergence of natural gradient descent for overparameterized neural networks. arXiv preprint arXiv:1905.10961 (2019)
Bernacchia, A., Lengyel, M., Hennequin, G.: Exact natural gradient in deep linear networks and application to the nonlinear case. NIPS (2019)
Karakida, R., Osawa, K.: Understanding approximate fisher information for fast convergence of natural gradient descent in wide neural networks. arXiv preprint arXiv:2010.00879 (2020)
Nesterov, Y., Polyak, B.T.: Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
Article MathSciNet MATH Google Scholar
Mishchenko, K.: Regularized newton method with global ${O}(1/k^{2})$ convergence. arXiv preprint arXiv:2112.02089 (2021)
Marumo, N., Okuno, T., Takeda, A.: Constrained Levenberg-marquardt method with global complexity bound. arXiv preprint arXiv:2004.08259 (2020)
Doikov, N., Nesterov, Y.: Gradient regularization of newton method with Bregman distances. arXiv preprint arXiv:2112.02952 (2021)
Nesterov, Y., Nemirovskii, A.: Interior-point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia (1994)
Book MATH Google Scholar
Bubeck, S., Sellke, M.: A universal law of robustness via isoperimetry. arXiv preprint arXiv:2105.12806 (2021)
Muthukumar, V., Vodrahalli, K., Subramanian, V., Sahai, A.: Harmless interpolation of noisy data in regression. IEEE J. Sel. Areas Inf. Theory 1(1), 67–83 (2020)
Article Google Scholar
Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. 116(32), 15849–15854 (2019)
Article MathSciNet MATH Google Scholar
Allen-Zhu, Z., Li, Y., Liang, Y.: Learning and generalization in overparameterized neural networks, going beyond two layers. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Meng, S.Y., Vaswani, S., Laradji, I.H., Schmidt, M., Lacoste-Julien, S.: Fast and furious convergence: Stochastic second order methods under interpolation. In: International Conference on Artificial Intelligence and Statistics, pp. 1375–1386 (2020). PMLR
Ye, H., Luo, L., Zhang, Z.: Nesterov’s acceleration for approximate newton. J. Mach. Learn. Res. 21, 142–1 (2020)
MathSciNet MATH Google Scholar
Bieker, K., Gebken, B., Peitz, S.: On the treatment of optimization problems with l1 penalty terms via multiobjective continuation. arXiv preprint arXiv:2012.07483 (2020)
Patrinos, P., Stella, L., Bemporad, A.: Forward-backward truncated newton methods for convex composite optimization. arXiv preprint arXiv:1402.6655 (2014)
Schmidt, M., Fung, G., Rosales, R.: Fast optimization methods for l1 regularization: a comparative study and two new approaches. In: European Conference on Machine Learning, pp. 286–297. Springer (2007)
Schraudolph, N.N.: Fast curvature matrix-vector products for second-order gradient descent. Neural Comput. 14(7), 1723–1738 (2002)
Article MATH Google Scholar
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Article MathSciNet MATH Google Scholar
Levenberg, K.: A method for the solution of certain non-linear problems in least squares. Q. Appl. Math. 2(2), 164–168 (1944)
Article MathSciNet MATH Google Scholar
Marquardt, D.W.: An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Ind. Appl. Math. 11(2), 431–441 (1963)
Article MathSciNet MATH Google Scholar
Searle, S.R.: Matrix Algebra Useful for Statistics. Wiley, Hoboken (1982)
MATH Google Scholar
Duncan, W.J.: Lxxviii. Some devices for the solution of large sets of simultaneous linear equations: with an appendix on the reciprocation of partitioned matrices. Lond. Edinb. Dublin Philos. Mag. J. Sci. 35(249), 660–670 (1944)
Guttman, L.: Enlargement methods for computing the inverse matrix. Ann. Math. Stat. 336–343 (1946)
Higham, N.J.: Accuracy and Stability of Numerical Algorithms. SIAM, New York (2002)
Book MATH Google Scholar
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled newton methods ii: Local convergence rates. arXiv preprint arXiv:1601.04738 (2016)
Du, S., Lee, J., Li, H., Wang, L., Zhai, X.: Gradient descent finds global minima of deep neural networks. In: International Conference on Machine Learning, pp. 1675–1685. PMLR (2019)
Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 (2018)
Sun, T., Tran-Dinh, Q.: Generalized self-concordant functions: a recipe for newton-type methods. Math. Program. 178(1), 145–213 (2019)
Article MathSciNet MATH Google Scholar
Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 1–27 (2011)
Article Google Scholar
Romano, J.D., Le, T.T., La Cava, W., Gregg, J.T., Goldberg, D.J., Chakraborty, P., Ray, N.L., Himmelstein, D., Fu, W., Moore, J.H.: Pmlb v1.0: an open source dataset collection for benchmarking machine learning methods. arXiv preprint arXiv:2012.00058v2 (2021)
Dunn, P.K., Smyth, G.K.: Generalized Linear Models with Examples in R. Springer, New York (2018)
Book MATH Google Scholar
Ostrovskii, D.M., Bach, F.: Finite-sample analysis of $ m $-estimators using self-concordance. Electron. J. Stat. 15(1), 326–391 (2021)
Article MathSciNet MATH Google Scholar
Charbonnier, P., Blanc-Féraud, L., Aubert, G., Barlaud, M.: Deterministic edge-preserving regularization in computed imaging. IEEE Trans. Image Process. 6(2), 298–311 (1997)
Article Google Scholar
Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004)
Book MATH Google Scholar
Nocedal, J.: Updating quasi-newton matrices with limited storage. Math. Comput. 35(151), 773–782 (1980)
Article MathSciNet MATH Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org (2015). https://www.tensorflow.org/
Fowkes, J.M., Gould, N.I., Farmer, C.L.: A branch and bound algorithm for the global optimization of hessian lipschitz continuous functions. J. Global Optim. 56(4), 1791–1815 (2013)
Article MathSciNet MATH Google Scholar

Download references

Funding

Open access funding provided by Scuola IMT Alti Studi Lucca within the CRUI-CARE Agreement

Author information

Authors and Affiliations

IMT School for Advanced Studies, Piazza San Francesco 19, Lucca, 55100, Italy
Adeyemi D. Adeoye & Alberto Bemporad

Authors

Adeyemi D. Adeoye
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Bemporad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adeyemi D. Adeoye.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A useful results

Following Assumptions 1 – 3, we get that [14, Theorem 2.1.6]

$$\begin{aligned} \gamma _l\varvec{I}_{n_w}\preceq \partial ^2g(\varvec{\theta })\preceq \gamma _u \varvec{I}_{n_w},\quad \gamma _a\varvec{I}_{n_w}\preceq \partial ^2h(\varvec{\theta })\preceq \gamma _b \varvec{I}_{n_w}, \end{aligned}$$

(A1)

where $\varvec{I}_{n_w}\in \mathbb {R}^{n_w \times n_w}$ is an identity matrix. Consequently,

$$\begin{aligned} (\gamma _l+\lambda \gamma _a)\varvec{I}_{n_w}\preceq \partial ^2{\mathcal {L}}(\varvec{\theta })\preceq (\gamma _u+\lambda \gamma _b) \varvec{I}_{n_w}. \end{aligned}$$

(A2)

In addition, the second derivative of ${\mathcal {L}}(\varvec{\theta })$ is $(\gamma _g + \lambda \gamma _h)$-Lipschitz continuous $\forall \varvec{x} \in {{\mathbb {R}}}^{n_p}, \varvec{y} \in {{\mathbb {R}}}^d$, that is,

$$\begin{aligned} \left\| \partial ^2 {\mathcal {L}}(\varvec{y}, f(\varvec{\theta _1};\varvec{x})) - \partial ^2 {\mathcal {L}}(\varvec{y}, f(\varvec{\theta _2};\varvec{x}))\right\| \le (\gamma _g + \lambda \gamma _h)\left\| \varvec{\theta _1} - \varvec{\theta _2}\right\| . \end{aligned}$$

(A3)

Lemma 5

Let Assumptions 1, 2 and 3 hold. Let $\varvec{\theta _1}, \varvec{\theta _2}$ be any two points in ${{\mathbb {R}}}^{n_w}$. Then we have

$$\begin{aligned}&\left\| \varvec{g}(\varvec{\theta _1}) - \varvec{g}(\varvec{\theta _2}) - \left<\varvec{H}(\varvec{\theta _2}),\varvec{\theta _1}-\varvec{\theta _2}\right>\right\| \le \frac{\gamma _g +\lambda \gamma _h}{2}\left\| \varvec{\theta _1-\varvec{\theta _2}}\right\| ^2, \end{aligned}$$

(A4)

$$\begin{aligned}&\left| {\mathcal {L}}(\varvec{\theta _1})-{\mathcal {L}}(\varvec{\theta _2})-\left<\varvec{g}(\varvec{\theta _2}), \varvec{\theta _1-\varvec{\theta _2}}\right>-\frac{1}{2}\left<\varvec{H}(\varvec{\theta _2})(\varvec{\theta _1}-\varvec{\theta _2}), \varvec{\theta _1}-\varvec{\theta _2}\right>\right| \nonumber \\&\quad \le \frac{\gamma _g + \lambda \gamma _h}{6}\left\| \varvec{\theta _1-\varvec{\theta _2}}\right\| ^3. \end{aligned}$$

(A5)

Proof

Fix $\varvec{\theta _1}, \varvec{\theta _2} \in {{\mathbb {R}}}^{n_w}$. Then

$$\begin{aligned} \varvec{g}(\varvec{\theta _1}) - \varvec{g}(\varvec{\theta _2}) = \int _0^1\partial ^2{\mathcal {L}}(\varvec{\theta _2} +\tau (\varvec{\theta _1}-\varvec{\theta _2}))(\varvec{\theta _1}-\varvec{\theta _2})d\tau . \end{aligned}$$

As $\tau \in [0,1]$, we have $\varvec{\theta _2}+\tau (\varvec{\theta _1}-\varvec{\theta _2}) \in {{\mathbb {R}}}^{n_w}$. Hence writing $\partial ^2{\mathcal {L}}=\varvec{H}$, we have

$$\begin{aligned}&\varvec{g}(\varvec{\theta _1}) - \varvec{g}(\varvec{\theta _2}) \\&\quad = \int _0^1\varvec{H}(\varvec{\theta _2}+\tau (\varvec{\theta _1} -\varvec{\theta _2}))(\varvec{\theta _1}-\varvec{\theta _2})d\tau \\&\varvec{g}(\varvec{\theta _1}) - \varvec{g}(\varvec{\theta _2}) - \varvec{H}(\varvec{\theta _2})(\varvec{\theta _1}-\varvec{\theta _2})\\&\quad = \int _0^1\left( \varvec{H}(\varvec{\theta _2}+\tau (\varvec{\theta _1}-\varvec{\theta _2}))-\varvec{H}(\varvec{\theta _2})\right) (\varvec{\theta _1}-\varvec{\theta _2})d\tau \\&\left\| \varvec{g}(\varvec{\theta _1}) - \varvec{g}(\varvec{\theta _2}) - \varvec{H}(\varvec{\theta _2})(\varvec{\theta _1}-\varvec{\theta _2})\right\| \\&\quad =\left\| \int _0^1\left( \varvec{H}(\varvec{\theta _2}+\tau (\varvec{\theta _1}-\varvec{\theta _2})) -\varvec{H}(\varvec{\theta _2})\right) (\varvec{\theta _1}-\varvec{\theta _2})d\tau \right\| \\&\quad \le \int _0^1\left\| \left( \varvec{H}(\varvec{\theta _2}+\tau (\varvec{\theta _1}-\varvec{\theta _2})) -\varvec{H}(\varvec{\theta _2})\right) (\varvec{\theta _1}-\varvec{\theta _2})\right\| d\tau \\&\quad \le \int _0^1\left\| \varvec{H}(\varvec{\theta _2}+\tau (\varvec{\theta _1}-\varvec{\theta _2}))-\varvec{H} (\varvec{\theta _2})\right\| \cdot \left\| \varvec{\theta _1}-\varvec{\theta _2}\right\| d\tau \\&\quad \le \int _0^1\tau (\gamma _g + \lambda \gamma _h)\left\| \varvec{\theta _1}-\varvec{\theta _2}\right\| ^2d\tau \\&\quad = \frac{\gamma _g + \lambda \gamma _h}{2}\left\| \varvec{\theta _1}-\varvec{\theta _2}\right\| ^2. \end{aligned}$$

The proof of (A5) follows immediately using a similar procedure (see e.g., [57]). $\square $

Corollary 5.1

Let Assumptions 1, 2 and 3 hold. Let $\varvec{\theta _1}, \varvec{\theta _2}$ be any two points in ${{\mathbb {R}}}^{n_w}$. Then by writing $\partial ^2{\mathcal {L}}=\varvec{H}$,

$$\begin{aligned} \varvec{H}(\varvec{\theta _2})-(\gamma _g + \lambda \gamma _h)\left\| \varvec{\theta _1}-\varvec{\theta _2}\right\| \le \varvec{H}(\varvec{\theta _1})\le \varvec{H}(\varvec{\theta _2})+(\gamma _g + \lambda \gamma _h)\left\| \varvec{\theta _1}-\varvec{\theta _2}\right\| . \end{aligned}$$

(A6)

Proof

The proof follows immediately by recalling for any $\varvec{\theta }\in {{\mathbb {R}}}^{n_w}$, $\varvec{H}(\varvec{\theta })$ is positive definite, and hence the eigenvalues $s_j$ of the difference $\varvec{H}(\varvec{\theta _1})-\varvec{H}(\varvec{\theta _2})$ satisfy

$$\begin{aligned} \left| s_j \right| \le (\gamma _g + \lambda \gamma _h)\left\| \varvec{\theta _1}-\varvec{\theta _2}\right\| , \quad j=1,2,\ldots ,n_w, \end{aligned}$$

so that we have

$$\begin{aligned} -(\gamma _g +\lambda \gamma _h)\left\| \varvec{\theta _1}-\varvec{\theta _2}\right\| \le \varvec{H}(\varvec{\theta _1})-\varvec{H}(\varvec{\theta _2})\le (\gamma _g + \lambda \gamma _h)\left\| \varvec{\theta _1}-\varvec{\theta _2}\right\| . \end{aligned}$$

$\square $

Lemma 6

( [14, Theorem 2.1.5]) Let the first derivative of a function $\Phi (\cdot )$ be L-Lipschitz on $\textrm{dom}(\Phi )$. Then for any $\varvec{\theta _1},\varvec{\theta _2}\in \textrm{dom}(\Phi )$, we have

$$\begin{aligned} 0\le \Phi (\varvec{\theta _1})-\Phi (\varvec{\theta _2})-\left<\partial \Phi (\varvec{\theta _2}),\varvec{\theta _1} -\varvec{\theta _2}\right>\le \frac{L}{2}\Vert \varvec{\theta _2}-\varvec{\theta _1}\Vert ^2. \end{aligned}$$

Definition 3

( [14, Definition 2.1.3]) A continuously differentiable function $\psi (\cdot )$ is $\gamma _{sc}$-strongly convex on $\textrm{dom}(\psi )$ if for any $\varvec{\theta _1},\varvec{\theta _2}\in \textrm{dom}(\psi )$, we have

$$\begin{aligned} \psi (\varvec{\theta _1})\ge \psi (\varvec{\theta _2})+\left<\partial \psi (\varvec{\theta _2}),\varvec{\theta _1}-\varvec{\theta _2}\right> + \frac{\gamma _{sc}}{2}\Vert \varvec{\theta _1-\varvec{\theta _2}}\Vert ^2, \quad \gamma _{sc}>0. \end{aligned}$$

We remark that if the first derivative of the function $\psi (\cdot )$ in the above definition is L-Lipschitz continuous, then by construction, for any $\varvec{\theta _1},\varvec{\theta _2}\in \textrm{dom}(\psi )$, $\psi $ satisfies the Lipschitz constraint

$$\begin{aligned} \vert \psi (\varvec{\theta _1})-\psi (\varvec{\theta _2})\vert \le L\Vert \varvec{\theta _1}-\varvec{\theta _2}\Vert . \end{aligned}$$

(A7)

Lemma 7

( [14, Theorem 5.1.8]) Let the function $\phi (\cdot )$ be $M_\phi $-self-concordant. Then, for any $\varvec{\theta _1},\varvec{\theta _2}\in \textrm{dom}(\phi )$, we have

$$\begin{aligned} \phi (\varvec{\theta _1})\ge \phi (\varvec{\theta _2})+\left<\partial \phi (\varvec{\theta _2}),\varvec{\theta _1}-\varvec{\theta _2}\right> + \frac{1}{M_\phi ^2}\omega (M_\phi \Vert \varvec{\theta _1}-\varvec{\theta _2}\Vert _{\varvec{\theta _2}}), \end{aligned}$$

where $\omega (\cdot )$ is an auxiliary univariate function defined by $\omega (t):=t-\ln (1+t)$.

Lemma 8

( [14, Corollary 5.1.5]) Let the function $\phi (\cdot )$ be $M_\phi $-self-concordant. Let $\varvec{\theta _1},\varvec{\theta _2}\in \textrm{dom}(\phi )$ and $r=\left\| \varvec{\theta _1}-\varvec{\theta _2}\right\| _{\varvec{\theta _1}}<\frac{1}{M_\phi }$. Then

$$\begin{aligned} \left( 1-M_\phi r+\frac{1}{3}M_\phi ^2r^2\right) \partial ^2\phi (\varvec{\theta _1})\preceq \int _0^1 \partial ^2\phi (\varvec{\theta _2}+\tau (\varvec{\theta _1}-\varvec{\theta _2}))d\tau \preceq \frac{1}{1-M_\phi r}\partial ^2\phi (\varvec{\theta _1}). \end{aligned}$$

Appendix B missing proofs

Proof of Lemma 2

By the arguments of Remark 1, we have that the matrix $(\varvec{J}(\varvec{\theta }_k)^T\varvec{Q}\varvec{J}(\varvec{\theta }_k) + \lambda \varvec{H}_h(\varvec{\theta }_k))^{-1}$ is positive definite. Hence, with $\varvec{g}(\varvec{\theta }_k)\ne \varvec{0}$, we have $\varvec{g}\varvec{H}(\varvec{\theta }_k)^{-1} \varvec{g}(\varvec{\theta }_k) = -\delta \varvec{\theta }^T\varvec{g}(\varvec{\theta }_k)> 0$ and $\delta \varvec{\theta }^T\varvec{g}(\varvec{\theta }_k) < \varvec{0}$. $\square $

Proof of Theorem 3

The process formulated in (16) performs the update

$$\begin{aligned} \varvec{\theta }_{k+1} = \varvec{\theta }_k - \varvec{H}(\varvec{\theta }_k)^{-1}\varvec{g}(\varvec{\theta }_k). \end{aligned}$$

As $\varvec{g}(\varvec{\theta }^*)=0$ by mean value theorem and the first part of (SOSC), we have

$$\begin{aligned} \varvec{\theta }_{k+1}-\varvec{\theta }^*&= \varvec{\theta }_k - \varvec{\theta }^* - \varvec{H}(\varvec{\theta }_k)^{-1} \int _0^1\partial ^2{\mathcal {L}}(\varvec{\theta }^*+\tau (\varvec{\theta }_k-\varvec{\theta }^*))(\varvec{\theta }_k-\varvec{\theta }^*)d\tau \\&= \left[ \varvec{I} - \varvec{H}(\varvec{\theta }_k)^{-1}\int _0^1\partial ^2{\mathcal {L}}(\varvec{\theta }^* +\tau (\varvec{\theta }_k-\varvec{\theta }^*))d\tau \right] (\varvec{\theta }_k-\varvec{\theta }^*)\\&= \varvec{H}(\varvec{\theta }_k)^{-1}\int _0^1\left( \varvec{H}(\varvec{\theta }_k) - \partial ^2{\mathcal {L}} (\varvec{\theta }^*+\tau (\varvec{\theta }_k-\varvec{\theta }^*))\right) d\tau (\varvec{\theta }_k-\varvec{\theta }^*)\\ \left\| \varvec{\theta }_{k+1}-\varvec{\theta }^*\right\|&= \left\| \varvec{H}(\varvec{\theta }_k)^{-1} \int _0^1\left( \varvec{H}(\varvec{\theta }_k) - \partial ^2{\mathcal {L}}(\varvec{\theta }^* +\tau (\varvec{\theta }_k-\varvec{\theta }^*))\right) d\tau (\varvec{\theta }_k-\varvec{\theta }^*)\right\| \\&\le \left\| \varvec{H}(\varvec{\theta }_k)^{-1}\right\| \int _0^1\left\| \varvec{H}(\varvec{\theta }_k) - \partial ^2{\mathcal {L}}(\varvec{\theta }^*+\tau (\varvec{\theta }_k-\varvec{\theta }^*))\right\| d\tau \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| \\ \end{aligned}$$

Also, as $\tau \in [0,1]$ we have that $\varvec{\theta }^*+\tau (\varvec{\theta }_k-\varvec{\theta }^*)\in {\mathcal {B}}_\epsilon (\varvec{\theta }^*)\subseteq {{\mathbb {R}}}^{n_w}$. By taking the limit $\lim \limits _{k\rightarrow \infty }\left\| \varvec{\theta }_{k+1}-\varvec{\theta }^*\right\| $, we have by the first part of (SOSC) and Assumption 4(ii)

$$\begin{aligned} \left\| \varvec{\theta }_{k+1}-\varvec{\theta }^*\right\|&\le \left\| \varvec{H}(\varvec{\theta }_k)^{-1}\right\| \int _0^1\left\| \varvec{H}(\varvec{\theta }_k) - \varvec{H}(\varvec{\theta }^*+\tau (\varvec{\theta }_k-\varvec{\theta }^*)) \right\| d\tau \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| \\&\le \left\| \varvec{H}(\varvec{\theta }_k)^{-1}\right\| \int _0^1(1-\tau )(\gamma _g + \lambda \gamma _h) \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| d\tau \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| \\&= \frac{\gamma _g + \lambda \gamma _h}{2}\left\| \varvec{H}(\varvec{\theta }_k)^{-1}\right\| \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| ^2. \end{aligned}$$

By combining the claims in Corollary 5.1 and the bounds of $\varvec{H}(\varvec{\theta }_k)$ in (17), we obtain the relation

$$\begin{aligned} \gamma _l+a\gamma _a - (\gamma _g+\lambda \gamma _h)\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| \preceq \varvec{H}(\varvec{\theta }^*)-(\gamma _g+\lambda \gamma _h)\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| \preceq \varvec{H}(\varvec{\theta }_k). \end{aligned}$$

Recall that $\varvec{H}(\varvec{\theta }_k)$ is positive definite, and hence invertible. We deduce that, indeed for all $\varvec{\theta }_k$ satisfying $\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| \le \epsilon $, $\epsilon $ small enough, we have

$$\begin{aligned} \left\| \varvec{H}(\varvec{\theta }_k)^{-1}\right\| \le \left( \gamma _l-\gamma _g - \lambda (\gamma _h-\gamma _a) \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| \right) ^{-1}. \end{aligned}$$

Therefore,

$$\begin{aligned} \left\| \varvec{\theta }_{k+1}-\varvec{\theta }^*\right\| \le \xi _k\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| ^2, \end{aligned}$$

where

$$\begin{aligned} \xi _k = \frac{1}{2}\frac{\gamma _g + \lambda \gamma _h}{\left( \gamma _l-\gamma _g - \lambda (\gamma _h-\gamma _a)\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| \right) }. \end{aligned}$$

$\square $

Proof of Theorem 4

First, we upper bound the norm $\Vert \delta \varvec{\theta }\Vert _{\varvec{\theta }_k}:=\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_k\Vert _{\varvec{\theta }_k} =\Vert \varvec{H}_h^{1/2}(\varvec{\theta }_{k+_1}-\varvec{\theta }_k)\Vert $. From Remark 3, we have

$$\begin{aligned}&\Vert \varvec{J}(\varvec{\theta }_k)^T(\lambda \varvec{I}+\varvec{Q}\varvec{J}\varvec{H}_h^{-1}\varvec{J}^T(\varvec{\theta }_k))^{-1}\Vert \\&\quad \le \frac{\gamma _a\Vert \varvec{J}(\varvec{\theta }_k)^T\Vert }{K+\lambda \gamma _a}\le \frac{\tilde{\beta }\gamma _a}{K+\lambda \gamma _a}. \end{aligned}$$

Hence,

$$\begin{aligned}&\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_k\Vert _{\varvec{\theta }_k} \le \frac{\alpha }{1+M_h\eta _k}\left\| \varvec{H}_h^{1/2}\varvec{H}_h^{-1}\varvec{J}^T(\varvec{\theta }_k)(\lambda \varvec{I}+\varvec{Q}\varvec{J}\varvec{H}_h^{-1} \varvec{J}^T(\varvec{\theta }_k))^{-1}\varvec{e}(\varvec{\theta }_k)\right\| \nonumber \\&\le \frac{\alpha }{1+M_h\eta _k} \left\| \varvec{H}_h(\varvec{\theta }_k)^{1/2}\right\| \left\| \varvec{H}_h (\varvec{\theta }_k)^{-1}\right\| \left\| \varvec{J}(\varvec{\theta }_k)^T(\lambda \varvec{I}+\varvec{Q}\varvec{J} \varvec{H}_h^{-1}\varvec{J}^T(\varvec{\theta }_k))^{-1}\right\| \left\| \varvec{e}(\varvec{\theta }_k)\right\| \nonumber \\&\le \frac{\alpha \beta \tilde{\beta }\gamma _a^{-1/2}}{(K+\lambda \gamma _a)(1+M_h\eta _k)}. \end{aligned}$$

(B8)

Similarly, we have

$$\begin{aligned} \Vert \varvec{\theta }_{k+1}-\varvec{\theta }_k\Vert \le \frac{\alpha \beta \tilde{\beta }}{(K+\lambda \gamma _a)(1+M_h\eta _k)}. \end{aligned}$$

(B9)

By Lemma 6, the function ${\mathcal {L}}(\varvec{\theta })$ satisfies

$$\begin{aligned}&{\mathcal {L}}(\varvec{\theta }_{k+1})\le {\mathcal {L}}(\varvec{\theta }_k)+\left<\partial {\mathcal {L}} (\varvec{\theta }_k),\varvec{\theta }_{k+1}-\varvec{\theta }_k\right>+\frac{\gamma _u+\lambda \gamma _b}{2}\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_k\Vert ^2\\&\quad = {\mathcal {L}}(\varvec{\theta }_k)+\left<\varvec{g}_g(\varvec{\theta }_k),\varvec{\theta }_{k+1} -\varvec{\theta }_k\right>+\lambda \left<\varvec{g}_h(\varvec{\theta }_k),\varvec{\theta }_{k+1}-\varvec{\theta }_k\right> \frac{\gamma _u+\lambda \gamma _b}{2}\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_k\Vert ^2. \end{aligned}$$

By convexity of g and h, and self-concordance of h, we have (using Definition 3, Lemma 7 and (A7))

$$\begin{aligned} {\mathcal {L}}(\varvec{\theta }_{k+1})&\le {\mathcal {L}}(\varvec{\theta }_k)+(\gamma _u+\lambda \gamma _b) \Vert \varvec{\theta }_{k+1}-\varvec{\theta }_k\Vert +\frac{\gamma _u+\lambda \gamma _b-\gamma _l}{2}\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_k\Vert ^2\\&\quad -\frac{\lambda }{M_h^2}\omega (M_h\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_k\Vert _{\varvec{\theta }_k})\\&{\mathop {\le }\limits ^{(B8)}}{\mathcal {L}}(\varvec{\theta }_k) +(\gamma _u+\lambda \gamma _b)\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_k\Vert +\frac{\gamma _u+\lambda \gamma _b -\gamma _l}{2}\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_k\Vert ^2\\&\quad -\frac{\lambda }{M_h^2}\omega \left( \frac{\alpha \beta \tilde{\beta }\gamma _a^{-1/2}M_h}{(K+\lambda \gamma _a)(1+M_h\eta _k)}\right) . \end{aligned}$$

Substituting the choice $\alpha = (\beta \tilde{\beta })^{-1}(\gamma _a)^{1/2}(K+\lambda \gamma _a)$, we have

$$\begin{aligned} {\mathcal {L}}(\varvec{\theta }_{k+1})&\le \,\,{\mathcal {L}}(\varvec{\theta }_k)+(\gamma _u+\lambda \gamma _b) \Vert \varvec{\theta }_{k+1}-\varvec{\theta }_k\Vert +\frac{\gamma _u+\lambda \gamma _b-\gamma _l}{2}\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_k\Vert ^2\\&-\frac{\lambda }{M_h^2}\omega \left( \frac{M_h}{1+M_h\eta _k}\right) . \end{aligned}$$

Taking expectation on both sides with respect to m conditioned on $\varvec{\theta }_k$, we get

$$\begin{aligned} \mathbb {E}[{\mathcal {L}}(\varvec{\theta }_{k+1})]&\le \,\,{\mathcal {L}}(\varvec{\theta }_k)+(\gamma _u+\lambda \gamma _b) \Vert \varvec{\theta }_{k+1}-\varvec{\theta }_k\Vert +\frac{\gamma _u+\lambda \gamma _b-\gamma _l}{2}\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_k\Vert ^2\\&-\mathbb {E}\left[ \frac{\lambda }{M_h^2}\omega \left( \frac{M_h}{1+M_h\eta _k}\right) \right] , \end{aligned}$$

Note the second derivative $\omega ''$ of $\omega $: $\omega ''(t)=1/(1+t)^2$. By the convexity of $\omega $ and using Jensen’s inequality, also recalling unbiasedness of the derivatives,

$$\begin{aligned} \mathbb {E}[{\mathcal {L}}(\varvec{\theta }_{k+1})]&\le {\mathcal {L}}(\varvec{\theta }_k)+\frac{\gamma _u +\lambda \gamma _b}{\sqrt{\gamma _a}(1+M_h\eta _k)}+\frac{\gamma _u+\lambda \gamma _b}{2\gamma _a(1+M_h\eta _k)^2} -\frac{\gamma _l}{2\gamma _a}\omega ''(M_h\eta _k)\\&\qquad -\frac{\lambda }{M_h^2}\omega \left( \frac{M_h}{1+M_h\eta _k}\right) \\&\le {\mathcal {L}}(\varvec{\theta }_k)-\left[ \frac{\lambda }{M_h^2}\omega \left( \frac{M_h}{1+M_h\eta _k}\right) +\frac{\gamma _l}{2\gamma _a}\omega ''(M_h\eta _k)-\frac{2(\gamma _u+\lambda \gamma _b)}{\sqrt{\gamma _a}}\right] . \end{aligned}$$

In the above, we used the bounds of the norm in (B9), and again used the choice $\alpha = (\beta \tilde{\beta })^{-1}(\gamma _a)^{1/2}(K+\lambda \gamma _a)$.

To proceed, let us make a simple remark that is not explicitly stated in Remark 4: For any $\varvec{\theta }_k,\varvec{\theta }_{k+1}\in {\mathcal {N}}_\epsilon (\varvec{\theta }^*)$, we have

$$\begin{aligned}{} & {} \left\| \varvec{H}_g(\varvec{\theta }_{k+1}) - \varvec{H}_g(\varvec{\theta }_k)\right\| \le \gamma _g\left\| \varvec{\theta _1} - \varvec{\theta _2}\right\| \rightarrow 0 \,\,\text {as}\,\, k\rightarrow \infty , \\{} & {} \left\| \varvec{H}_h(\varvec{\theta }_{k+1}) - \varvec{H}_h(\varvec{\theta }_k)\right\| \le \gamma _h\left\| \varvec{\theta _1} - \varvec{\theta _2}\right\| \rightarrow 0 \,\,\text {as}\,\, k\rightarrow \infty . \end{aligned}$$

Now, recall the proposed update step (21):

$$\begin{aligned} \varvec{\theta }_{k+1}=\varvec{\theta }_k-\frac{\alpha }{1+M_h\eta _k}\varvec{H}_h^{-1}\varvec{J}^T\left( \lambda \varvec{I}+\varvec{Q}\varvec{J} \varvec{H}_h^{-1} \varvec{J}^T\right) ^{-1}\varvec{e}. \end{aligned}$$

Then the above remark allows us to perform the following operation: Subtract $\varvec{\theta }^*$ from both sides and pre-multiply by $\varvec{H}_h^{1/2}(\varvec{\theta }_{k+1})\approx \varvec{H}_h^{1/2}(\varvec{\theta }_k)=\varvec{H}_h^{1/2}$, we get the recursion

$$\begin{aligned}&\varvec{H}_h^{1/2}(\varvec{\theta }_{k+1}-\varvec{\theta }^*)\\&\quad =\varvec{H}_h^{1/2}(\varvec{\theta }_k-\varvec{\theta }^*)-\frac{\alpha }{1+M_h\eta _k}\varvec{H}_h^{1/2} \varvec{H}_h^{-1}\varvec{J}^T(\varvec{\theta }_k)\left( \lambda \varvec{I}+\varvec{Q}\varvec{J} \varvec{H}_h^{-1}\varvec{J}^T(\varvec{\theta }_k)\right) ^{-1}\varvec{e}(\varvec{\theta }_k),\\&\left\| \varvec{H}_h^{1/2}(\varvec{\theta }_{k+1}-\varvec{\theta }^*)\right\| \\&\quad =\left\| \varvec{H}_h^{1/2}(\varvec{\theta }_k-\varvec{\theta }^*)-\frac{\alpha }{1+M_h\eta _k} \varvec{H}_h^{1/2}\varvec{H}_h^{-1}\varvec{J}^T(\varvec{\theta }_k)\left( \lambda \varvec{I}+\varvec{Q}\varvec{J} \varvec{H}_h^{-1}\varvec{J}^T(\varvec{\theta }_k)\right) ^{-1}\varvec{e}(\varvec{\theta }_k)\right\| \\&\quad \le \left\| \varvec{H}_h^{1/2}(\varvec{\theta }_k-\varvec{\theta }^*)\right\| +\frac{\alpha }{1+M_h\eta _k}\left\| \varvec{H}_h^{1/2}\right\| \left\| \varvec{H}_h^{-1}\varvec{J}^T\left( \lambda \varvec{I}+\varvec{Q}\varvec{J} \varvec{H}_h^{-1}\varvec{J}^T\right) ^{-1}\varvec{e}(\varvec{\theta }_k)\right\| . \end{aligned}$$

Take expectation with respect to m on both sides conditioned on $\varvec{\theta }_k$ and again consider unbiasedness of the derivatives. Further, recall the definition of the local norm $\Vert \cdot \Vert _{\varvec{\theta }}$, and the bounds of $\varvec{H}_h$, then

$$\begin{aligned}&\mathbb {E}\left\| \varvec{\theta }_{k+1}-\varvec{\theta }^*\right\| _{\varvec{\theta }_{k+1}}\\&\quad \le \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| _{\varvec{\theta }_k}+\frac{\alpha }{\sqrt{\gamma _a} (1+M_h\eta _k)}\left\| \left( \lambda \varvec{I}+\varvec{Q}\varvec{J}\varvec{H}_h^{-1}\varvec{J}^T(\varvec{\theta }_k)\right) ^{-1}\right\| \left\| \varvec{H}_h^{-1}\varvec{J}^T\varvec{e}(\varvec{\theta }_k)\right\| \\&\quad \le \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| _{\varvec{\theta }_k}+\frac{\alpha \gamma _a}{\sqrt{\gamma _a} (1+M_h\eta _k)(K+\lambda \gamma _a)}\left\| \varvec{H}_h^{-1}\varvec{g}(\varvec{\theta }_k)\right\| \\&\quad = \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| _{\varvec{\theta }_k}+\frac{\gamma _a}{\beta \tilde{\beta } (1+M_h\eta _k)}\left\| \varvec{H}_h^{-1}\varvec{g}_g(\varvec{\theta }_k)+\lambda \varvec{H}_h^{-1}\varvec{g}_h(\varvec{\theta }_k)\right\| \\&\quad \le \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| _{\varvec{\theta }_k}+\frac{\gamma _a}{\beta \tilde{\beta }} \left\| \varvec{H}_h^{-1}\varvec{g}_g(\varvec{\theta }_k)+\lambda \varvec{H}_h^{-1}\varvec{g}_h(\varvec{\theta }_k)\right\| \\&\quad \le \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| _{\varvec{\theta }_k}+\frac{\gamma _a}{\beta \tilde{\beta }} \left\| \varvec{H}_h^{-1}\varvec{g}_g(\varvec{\theta }_k)\right\| +\frac{\lambda \gamma _a}{\beta \tilde{\beta }} \left\| \varvec{H}_h^{-1}\varvec{g}_h(\varvec{\theta }_k)\right\| . \end{aligned}$$

By the mean value theorem and the first part of (SOSC),

$$\begin{aligned} \varvec{g}_g(\varvec{\theta }_k)&= \int _0^1\varvec{H}_g(\varvec{\theta }^*+\tau (\varvec{\theta }_k-\varvec{\theta }^*)) (\varvec{\theta }_k-\varvec{\theta }^*)d\tau \\&= \varvec{H}_g(\varvec{\theta }^*)(\varvec{\theta }_k-\varvec{\theta }^*) + \int _0^1\left( \varvec{H}_g(\varvec{\theta }^* +\tau (\varvec{\theta }_k-\varvec{\theta }^*))-\varvec{H}_g(\varvec{\theta }^*)\right) (\varvec{\theta }_k-\varvec{\theta }^*)d\tau , \end{aligned}$$

where $\varvec{H}_g$ is the second derivative of g.

$$\begin{aligned}&\left\| \varvec{g}_g(\varvec{\theta }_k)\right\| = \left\| \varvec{H}_g(\varvec{\theta }^*)(\varvec{\theta }_k-\varvec{\theta }^*) + \int _0^1\left( \varvec{H}_g(\varvec{\theta }^*+\tau (\varvec{\theta }_k-\varvec{\theta }^*))-\varvec{H}_g(\varvec{\theta }^*)\right) (\varvec{\theta }_k-\varvec{\theta }^*)d\tau \right\| \\&\quad \le \left\| \varvec{H}_g(\varvec{\theta }^*)\right\| \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| + \int _0^1\left\| \varvec{H}_g(\varvec{\theta }^*+\tau (\varvec{\theta }_k-\varvec{\theta }^*))-\varvec{H}_g(\varvec{\theta }^*)\right\| \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| d\tau \\&\quad = \left\| \varvec{H}_g(\varvec{\theta }^*)\right\| \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| +\int _0^1\tau \gamma _g\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| ^2d\tau \\&\quad = \left\| \varvec{H}_g(\varvec{\theta }^*)\right\| \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| +\frac{\gamma _g}{2}\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| ^2d\tau \\&\quad \le \gamma _u\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| +\frac{\gamma _g}{2} \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| ^2. \end{aligned}$$

In the above steps, we have used Assumption 4 and the remarks that follow it. Further,

$$\begin{aligned} \left\| \varvec{H}_h^{-1}\varvec{g}_g(\varvec{\theta }_k)\right\|&=\left\| \varvec{H}_h^{-1}\varvec{g}_g(\varvec{\theta }_k)\right\| \\&\le \left\| \varvec{H}_h^{-1}\right\| \left\| \varvec{g}_g(\varvec{\theta }_k)\right\| \\&\le \frac{1}{\gamma _a}\left( \gamma _u\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| +\frac{\gamma _g}{2}\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| ^2\right) \\&= \frac{\gamma _u}{\gamma _a}\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| +\frac{\gamma _g}{2\gamma _a}\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| ^2. \end{aligned}$$

Next, we analyze $\left\| \varvec{H}_h^{-1}\varvec{g}_h(\varvec{\theta }_k)\right\| $. We have

$$\begin{aligned} \left\| \varvec{H}_h^{-1}\varvec{g}_h(\varvec{\theta }_k)\right\|&= \left\| \varvec{H}_h^{-3/2}\varvec{H}_h^{1/2}\varvec{g}_h(\varvec{\theta }_k)\right\| \\&\le \frac{1}{\gamma _a^{3/2}}\left\| \varvec{H}_h^{1/2}\varvec{g}_h(\varvec{\theta }_k)\right\| \\&{\mathop {=}\limits ^{(SOSC)}} \frac{1}{\gamma _a^{3/2}}\left\| \varvec{H}_h^{1/2} (\varvec{\theta }_k)\left( \varvec{g}_h(\varvec{\theta }_k)-\varvec{g}_h(\varvec{\theta }^*)\right) \right\| , \end{aligned}$$

and by the mean value theorem,

$$\begin{aligned}&\left\| \varvec{H}_h^{-1}\varvec{g}_h(\varvec{\theta }_k)\right\| = \frac{1}{\gamma _a^{3/2}}\left\| \int _0^1\varvec{H}_h^{1/2} (\varvec{\theta }_k)\varvec{H}_h(\varvec{\theta }^*+\tau (\varvec{\theta }_k-\varvec{\theta }^*))(\varvec{\theta }_k-\varvec{\theta }^*)d\tau \right\| \\&\quad \le \frac{1}{\gamma _a^{3/2}}\left\| \varvec{H}_h^{1/2}(\varvec{\theta }_k)(\varvec{\theta }_k-\varvec{\theta }^*) \right\| \left\| \int _0^1\varvec{H}_h(\varvec{\theta }^*+\tau (\varvec{\theta }_k-\varvec{\theta }^*))d\tau \right\| \\&\quad = \frac{1}{\gamma _a^{3/2}}\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| _{\varvec{\theta }_k} \left\| \int _0^1\varvec{H}_h(\varvec{\theta }^*+\tau (\varvec{\theta }_k-\varvec{\theta }^*))d\tau \right\| \\&\quad \le \frac{1}{\gamma _a^{3/2}}\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| _{\varvec{\theta }_k} \left\| \int _0^1\varvec{H}_h^{-1/2}(\varvec{\theta }_k)\varvec{H}_h(\varvec{\theta }^*+\tau (\varvec{\theta }_k-\varvec{\theta }^*)) \varvec{H}_h^{-1/2}(\varvec{\theta }_k)d\tau \right\| \\&\quad {\mathop {\le }\limits ^{Lemma 8}} \frac{\left\| \varvec{\theta }_k -\varvec{\theta }^*\right\| _{\varvec{\theta }_k}}{\gamma _a^{3/2}\left( 1-M_h\left\| \varvec{\theta }_k -\varvec{\theta }^*\right\| _{\varvec{\theta }_k}\right) }. \end{aligned}$$

In the above, we have used the fact $\varvec{\theta }_0\in {\mathcal {N}}_{M_h^{-1}}(\varvec{\theta }^*)\implies \varvec{\theta }_k\in {\mathcal {N}}_{M_h^{-1}}(\varvec{\theta }^*)$ for all $\varvec{\theta }_k$ generated by the process (21).

Combining the above results, we have

$$\begin{aligned} \mathbb {E}\left\| \varvec{\theta }_{k+1}-\varvec{\theta }^*\right\| _{\varvec{\theta }_{k+1}}&\le \,\, \left[ 1+\frac{\lambda \gamma _a^{-1/2}}{\beta \tilde{\beta }\left( 1-M_h\left\| \varvec{\theta }_k -\varvec{\theta }^*\right\| _{\varvec{\theta }_k}\right) }\right] \left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| _{\varvec{\theta }_k}\\&+\frac{\gamma _u}{\beta \tilde{\beta }}\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| +\frac{\gamma _g}{2}\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| ^2. \nonumber \\ \end{aligned}$$

$\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Adeoye, A.D., Bemporad, A. SCORE: approximating curvature information under self-concordant regularization. Comput Optim Appl 86, 599–626 (2023). https://doi.org/10.1007/s10589-023-00502-2

Download citation

Received: 21 April 2022
Accepted: 17 June 2023
Published: 08 July 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s10589-023-00502-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

SCORE: approximating curvature information under self-concordant regularization

Abstract

Similar content being viewed by others

Learning to optimize: A tutorial for continuous and mixed-integer optimization

A Proximal Alternating Direction Method of Multipliers for DC Programming with Structured Constraints

Machine Learning Optimization Techniques: A Survey, Classification, Challenges, and Future Research Issues

1 Introduction

2 Preliminaries

2.1 Notation and basic assumptions

Assumption 1

Assumption 2

Assumption 3

2.2 Approximate Newton scheme

Definition 1

3 Second-order (pseudo-online) optimization

Remark 1

Remark 2

Assumption 4

Remark 3

Remark 4

Definition 2

Lemma 1

Lemma 2

Remark 5

Theorem 3

4 Self-concordant regularization (SCORE)

Assumption 5

Theorem 4

5 Experiments

5.1 GGN-SCORE for different values of \(\alpha \)

5.2 Comparison with SGD, Adam, and L-BFGS methods on real datasets

6 Conclusion

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A useful results

Lemma 5

Proof

Corollary 5.1

Proof

Lemma 6

Definition 3

Lemma 7

Lemma 8

Appendix B missing proofs

Proof of Lemma 2

Proof of Theorem 3

Proof of Theorem 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation