1 Introduction

The results presented in this paper apply to a pseudo-online optimization algorithm based on solving a regularized unconstrained minimization problem under the assumption of strong convexity and Lipschitz continuity. Unlike first-order methods such as stochastic gradient descent (SGD) [1, 2] and its variants [3,4,5,6] that only make use of first-order information through the function gradients, second-order methods [7,8,9,10,11,12,13] attempt to incorporate, in some way, second-order information in their approach, through the Hessian matrix or the Fisher information matrix (FIM). It is well known that this generally provides second-order methods with better (quadratic) convergence than a typical first-order method which only converges linearly in the neighbourhood of the solution [14].

Despite their convergence advantage over first-order methods, second-order methods result into highly prohibitive computations, namely, inverting an \(n_w\times n_w\) matrix at each iteration, where \(n_w\) is the number of optimization variables. In most commonly used second-order methods, the natural gradient, Gauss–Newton, and the sub-sampled Newton [15]—or its regularized version [16]—used for incorporating second-order information while maintaining desirable convergence properties, compute the Hessian matrix (or FIM) \(\varvec{H}\) by using the approximation \(\varvec{H}\approx \varvec{J}^T\varvec{J}\), where \(\varvec{J}\) is the Jacobian matrix. This approximation is still an \(n_w\times n_w\) matrix and remains computationally demanding for large problems. In recent works [17,18,19,20], second-order methods for overparameterized neural network models are made to bypass this difficulty by applying a matrix identity, and instead only compute the matrix \(\varvec{J}\varvec{J}^T\) which is a \(d\cdot N\times d\cdot N\) matrix, where d is the model output dimension and N is the number of data points. This approach significantly reduces computational overhead in the case \(d\cdot N\) is much smaller than \(n_w\) (overparameterized models) and helps to accelerate convergence [17]. Nevertheless, for an objective function with a differentiable (up to two times) convex regularizer, this simplification requires a closer attention and special modifications for a general problem with a large number of variables.

The idea of exploiting desirable regularization properties for improving the convergence of the (Gauss–)Newton scheme has been around for decades and most of the published works on the topic combine in different ways the idea of Levenberg–Marquardt (LM) regularization, line-search, and trust-region methods [21]. For example, the recent work [22] combines the idea of cubic regularization (originally proposed by [21]) and a particular variant of the adaptive LM penalty that uses the Euclidean gradient norm of the output-fit loss function (see [23] for a comprehensive list and complexity bounds). Their proposed scheme achieves a global \({\mathcal {O}}(k^{-2})\) rate, where k is the iteration number. A similar idea is considered in [24] using the Bregman distances, extending the idea to develop an accelerated variant of the scheme that achieves a \({\mathcal {O}}(k^{-3})\) convergence rate.

In this paper, we propose a new self-concordant regularization (SCORE) scheme for efficiently choosing optimal variables of the model involving smooth, strongly convex optimization objectives, where one of the objective functions regularizes the model’s variable vector and hence avoids overfitting, ultimately improving the model’s ability to generalize well. By an extra assumption that the regularization function is self-concordant, we propose the GGN-SCORE algorithm (see Algorithm 4 below) that updates the minimization variables in the framework of the local norm \(\Vert \cdot \Vert _x\) (a.k.a., Newton-decrement) of a self-concordant function f(x) such as seen in [25]. Our proposed scheme does not require that the output-fit loss function is self-concordant, which in many applications does not hold [22]. Instead, we exploit the greedy descent provision of self-concordant functions, via regularization, to achieve a fast convergence rate while maintaining feasible assumptions on the combined objective function (from an application point of view). Although this paper assumes a convex optimization problem, we also provide experiments that show promising results for non-convex problems that arise when training neural networks. Our experimental results provide an interesting opportunity for future investigation and scaling of the proposed method for large-scale machine learning problems, as one of the non-convex problems considered in the experiments involves training an overparameterized neural network. We remark that overparameterization is an interesting and desirable property, and a topic of interest in the machine learning community [26,27,28,29].

This paper is organized as follows: First, we introduce some notations, formulate the optimization problem with basic assumptions, and present an initial motivation for the optimization method in Sect. 2. In Sect. 3, we derive a new generalized method for reducing the computational overhead associated with the mini-batch Newton-type updates. The idea of SCORE is introduced in Sect. 4 and our GGN-SCORE algorithm is presented thereafter. Experimental results that show the efficiency and fast convergence of the proposed method are presented in Sect. 5.

2 Preliminaries

2.1 Notation and basic assumptions

Let \(\{(\varvec{x}_n,\varvec{y}_n)\}_{n=1}^N\) be a sequence of N input and output sample pairs, \(\varvec{x}_n\in \mathbb {R}^{n_p}, \varvec{y}_n\in \mathbb {R}^{d}\), where \(n_p\) is the number of features and d is the number of targets. We assume a model \(f(\varvec{\theta };\varvec{x}_n)\), defined by \(f:\mathbb {R}^{n_w} \times \mathbb {R}^{n_p} \rightarrow \mathbb {Y}\) and parameterized by the vector of variables \(\varvec{\theta }\in \mathbb {R}^{n_w}\). We denote by \(\partial _{\varvec{a}} \varvec{b} \equiv \partial \varvec{b}\) the gradient (or first derivative) of \(\varvec{b}\) (with respect to \(\varvec{a}\)) and \(\partial _{\varvec{a}\varvec{a}}^2 \varvec{b} \equiv \partial ^2\varvec{b}\) the second derivative of \(\varvec{b}\) with respect to \(\varvec{a}\). We write \(\Vert \cdot \Vert \) to denote the 2-norm. The set \(\{\textrm{diag}(\varvec{v}):\varvec{v}\in {{\mathbb {R}}}^n\}\), where \(\textrm{diag}:{{\mathbb {R}}}^n \rightarrow {{\mathbb {R}}}^{n\times n}\), denotes the set of all diagonal matrices in \({{\mathbb {R}}}^{n\times n}\). Throughout the paper, bold-face letters denote vectors and matrices.

Suppose that \(f(\varvec{\theta };\varvec{x}_n)\) outputs the value \(\hat{\varvec{y}}_n \in \mathbb {R}^{d}\). The regularized minimization problem we want to solve is

$$\begin{aligned} \min _{\varvec{\theta }}{\mathcal {L}}(\varvec{\theta })&:=\underbrace{\sum _{n=1}^N\ell (\varvec{y}_n,\hat{\varvec{y}}_n)}_{g(\varvec{\theta })} + \lambda \underbrace{\sum _{j=1}^{n_w} r_j(\varvec{\theta }_j)}_{h(\varvec{\theta })}, \end{aligned}$$
(1)

where \(\ell :{{\mathbb {R}}}^d\times {{\mathbb {R}}}^d\rightarrow {{\mathbb {R}}}\) is a (strongly) convex twice-differentiable output-fit loss function, \(r_j:{{\mathbb {R}}}\rightarrow {{\mathbb {R}}}\), \(j=1,\ldots ,n_w\), define a separable regularization term on \(\varvec{\theta }\), \(g(\varvec{\theta }):{{\mathbb {R}}}^{n_w}\rightarrow {{\mathbb {R}}}\), \(h(\varvec{\theta }):{{\mathbb {R}}}^{n_w}\rightarrow {{\mathbb {R}}}\). We assume that the regularization function \(h(\varvec{\theta })\), scaled by the parameter \(\lambda >0\), is twice differentiable and strongly convex. The following preliminary conditions define the regularity of the Hessian of \({\mathcal {L}}(\varvec{\theta })\) and are assumed to hold only locally in this work:

Assumption 1

The functions g and h are twice-differentiable with respect to \(\varvec{\theta }\) and are respectively \(\gamma _l\)- and \(\gamma _a\)-strongly convex.

Assumption 2

\(\exists \gamma _u\), \(\gamma _b\) with \(0\le \gamma _l\le \gamma _u < \infty \), \(0<\gamma _a\le \gamma _b < \infty \), such that the gradient of \({\mathcal {L}}(\varvec{\theta })\) is \((\gamma _u+\lambda \gamma _b)\)-Lipschitz continuous \(\forall \varvec{x} \in {{\mathbb {R}}}^{n_p}, \varvec{y} \in {{\mathbb {R}}}^d\). That is, \(\forall \varvec{x} \in {{\mathbb {R}}}^{n_p}, \forall \varvec{y} \in {{\mathbb {R}}}^d\), the gradient \(\partial _{\varvec{\theta }}{\mathcal {L}}(\varvec{\theta })=\partial _{\varvec{\theta }}g(\varvec{\theta }) + \lambda \partial _{\varvec{\theta }}h(\varvec{\theta })\) satisfies

$$\begin{aligned} \left\| \partial _{\varvec{\theta }} {\mathcal {L}}(\varvec{y}, f(\varvec{\theta _1};\varvec{x})) - \partial _{\varvec{\theta }} {\mathcal {L}}(\varvec{y}, f(\varvec{\theta _2};\varvec{x}))\right\| \le (\gamma _u+\lambda \gamma _b) \left\| \varvec{\theta _1} - \varvec{\theta _2}\right\| , \end{aligned}$$
(2)

for any \(\varvec{\theta _1},\varvec{\theta _2} \in {{\mathbb {R}}}^{n_w}\).

Assumption 3

\(\exists \gamma _g, \gamma _h\) with \(0<\gamma _g,\gamma _h<\infty \) such that \(\forall \varvec{x} \in {{\mathbb {R}}}^{n_p}, \forall \varvec{y} \in {{\mathbb {R}}}^d\), the second derivatives of \(g(\varvec{\theta })\) and \(h(\varvec{\theta })\) respectively satisfy

$$\begin{aligned}{} & {} \left\| \partial ^2 g(\varvec{y}, f(\varvec{\theta _1};\varvec{x})) - \partial ^2 g(\varvec{y}, f(\varvec{\theta _2};\varvec{x}))\right\| \le \gamma _g\left\| \varvec{\theta _1} - \varvec{\theta _2}\right\| , \end{aligned}$$
(3a)
$$\begin{aligned}{} & {} \left\| \partial ^2 h(\varvec{y}, f(\varvec{\theta _1};\varvec{x})) - \partial ^2 h(\varvec{y}, f(\varvec{\theta _2};\varvec{x}))\right\| \le \gamma _h\left\| \varvec{\theta _1} - \varvec{\theta _2}\right\| , \end{aligned}$$
(3b)

for any \(\varvec{\theta _1},\varvec{\theta _2} \in {{\mathbb {R}}}^{n_w}\).

Commonly used loss functions such as the squared loss, and the sigmoidal cross-entropy loss are twice differentiable and (strongly) convex in the model variables. Certain smoothed sparsity-inducing penalties such as the (pseudo-)Huber function – presented later in this paper – constitute the class of functions that may be well-suited for \(h(\varvec{\theta })\) defined above.

The assumptions of strong convexity and smoothness about the objective \({\mathcal {L}}(\varvec{\theta })\) are standard conventions in many optimization problems as they help to characterize the convergence properties of the underlying solution method [30, 31]. However, the smoothness assumption about the objective \({\mathcal {L}}(\varvec{\theta })\) is sometimes not feasible for some multi-objective (or regularized) problems where a non-smooth (penalty-inducing) function \(h(\varvec{\theta })\) is used (see the recent work [32]). In such a case, and when the need to incorporate second-order information arise, a well-known approach in the optimization literature is generally either to approximate the non-smooth objectives by a smooth quadratic function (when such an approximation is available) or use a “proximal splitting” method and replace the 2-norm in this setting with the Q-norm, where Q is the Hessian matrix or its approximation [33]. In [33], the authors propose two techniques that help to avoid the complexity that is often introduced in subproblems when the latter approach is used. While proposing new approaches, [34] highlights some popular techniques to handle non-differentiability. Each of these works highlight the importance of incorporating second-order information in the solution techniques of optimization problems. By conveniently solving the optimization problem (1) where the assumptions made above are satisfied, our method ensures the full curvature information is captured while reducing computational overhead.

2.2 Approximate Newton scheme

Given the current value of \(\varvec{\theta }\), the (Gauss–)Newton method computes an update to \(\varvec{\theta }\) via

$$\begin{aligned} \varvec{\theta } \leftarrow \varvec{\theta } - \rho \varvec{G}, \end{aligned}$$
(4)

where \(\rho \) is the step size, \(\varvec{G}=\varvec{H}^{-1}\partial {\mathcal {L}}(\varvec{\theta })\) and \(\varvec{H}\) is the Hessian of \({\mathcal {L}}\) or its approximation. In this work, we consider the generalized Gauss–Newton (GGN) approximation of \(\varvec{H}\) which we now define in terms of the function \(g(\varvec{\theta })\). This approximation and its detailed expression motivates the modified version introduced in the next section to include the regularization function \(h(\varvec{\theta })\).

Definition 1

(Generalized Gauss–Newton Hessian) Let \(\textrm{diag}(\varvec{q}_n)\in \mathbb {R}^{d\times d}\) be the second derivative of the loss function \(\ell (\varvec{y}_n,\hat{\varvec{y}}_n)\) with respect to the predictor \(\hat{\varvec{y}}_n\), \(\varvec{q}_n = \partial _{\hat{\varvec{y}}_n\hat{\varvec{y}}_n}^2 \ell (\varvec{y}_n,\hat{\varvec{y}}_n)\) for \(n=1,2,\ldots ,N\), and let \(\varvec{Q}_g\in \mathbb {R}^{dN\times dN}\) be a block diagonal matrix with \(\varvec{q}_n\) being the n-th diagonal block. Let \(\varvec{J}_n\in \mathbb {R}^{d\times n_w}\) denote the Jacobian of \(\hat{\varvec{y}}_n\) with respect to \(\varvec{\theta }\) for \(n=1,2,\ldots ,N\), and let \(\varvec{J}_g\in \mathbb {R}^{dN\times n_w}\) be the vertical concatenation of all \(\varvec{J}_n\)’s. Then, the generalized Gauss–Newton (GGN) approximation of the Hessian matrix \(\varvec{H}_g\in \mathbb {R}^{n_w\times n_w}\) associated with the fit loss \(\ell (\varvec{y}_n,\hat{\varvec{y}}_n)\) with respect to \(\varvec{\theta }\) is defined by

$$\begin{aligned} \varvec{H}_g\approx \varvec{J}_g^T\varvec{Q}_g \varvec{J}_g = \sum _{n=1}^{N}\varvec{J}_n^T\textrm{diag}(\varvec{q}_n)\varvec{J}_n. \end{aligned}$$
(5)

Let \(\varvec{e}_n\in \mathbb {R}^{d}\) be the Jacobian of the fit loss defined by \(\varvec{e}_n=\partial _{\hat{\varvec{y}}_n}\ell (\varvec{y}_n,\hat{\varvec{y}}_n)\) for \(n=1,2,\ldots ,N\). For example, in case of squared loss \(\ell (\varvec{y}_n,\hat{\varvec{y}}_n)=\frac{1}{2}(\varvec{y}_n-\hat{\varvec{y}}_n)^2\) we get that \(\varvec{e}_n\) is the residual \(\varvec{e}_n=\hat{\varvec{y}}_n-\varvec{y}_n\). Let \(\varvec{e}_g\in \mathbb {R}^{dN}\) be the vertical concatenation of all \(\varvec{e}_n\)’s. Then, using the chain rule, we write

$$\begin{aligned} \varvec{J}_n^T\varvec{e}_n^T&= \begin{bmatrix} \partial _{\theta _1}\hat{y}^{(1)} &{} \partial _{\theta _1}\hat{y}^{(2))} &{} \cdots &{} \partial _{\theta _1}\hat{y}^{(d)}\\ \partial _{\theta _2}\hat{y}^{(1)} &{} \partial _{\theta _2}\hat{y}^{(2)} &{} \cdots &{} \partial _{\theta _2}\hat{y}^{(d)}\\ \vdots &{}\vdots &{}&{}\vdots \\ \partial _{\theta _{n_w}}\hat{y}^{(1)} &{} \partial _{\theta _{n_w}}\hat{y}^{(2)} &{} \cdots &{} \partial _{\theta _{n_w}}\hat{y}^{(d)} \end{bmatrix} \begin{bmatrix} \partial _{\hat{y}}\ell ^{(1)}\\ \partial _{\hat{y}}\ell ^{(2)}\\ \vdots \\ \partial _{\hat{y}}\ell ^{(d)} \end{bmatrix} \nonumber \\&= \begin{bmatrix} \partial _{\theta _1}\ell ^{(1)}+\partial _{\theta _1}\ell ^{(2)}+\cdots +\partial _{\theta _1}\ell ^{(d)}\\ \partial _{\theta _2}\ell ^{(1)}+\partial _{\theta _2}\ell ^{(2)}+\cdots +\partial _{\theta _2}\ell ^{(d)}\\ \vdots \\ \partial _{\theta _{n_w}}\ell ^{(1)}+\partial _{\theta _{n_w}}\ell ^{(2)}+\cdots +\partial _{\theta _{n_w}}\ell ^{(d)} \end{bmatrix} \nonumber \\&= \left[ \sum _{i=1}^{d}\partial _{\theta _1}\ell ^{(i)} \quad \sum _{i=1}^{d}\partial _{\theta _2}\ell ^{(i)} \quad \cdots \quad \sum _{i=1}^{d}\partial _{\theta _{n_w}}\ell ^{(i)} \right] ^T, \end{aligned}$$
(6)

and

$$\begin{aligned} \varvec{g}_g(\varvec{\theta })=\varvec{J}_g^T\varvec{e}_g=\sum _{n=1}^{N} \varvec{J}_n^T\varvec{e}_n. \end{aligned}$$
(7)

As noted in [35], the GGN approximation has the advantage of capturing the curvature information of \(\ell \) in \(g(\varvec{\theta })\) through the term \(\varvec{Q}_g\) as opposed to the FIM, for example, which ignores the full second-order interactions. While it may become obvious, say when training a deep neural network with certain loss functions, how the GGN approximation can be exploited to simplify expressions for \(\varvec{H}_g\) (see e.g. in [36]), a modification is required to take account of a twice-differentiable convex regularization function to achieve some degree of simplicity and elegance. We derive a modification to the above for the mini-batch scheme presented in the next section that includes the derivatives of \(h(\varvec{\theta })\) in the GGN approximation of \(\varvec{H}_g\). This modification leads to our GGN-SCORE algorithm in Sect. 4.

3 Second-order (pseudo-online) optimization

Suppose that at each mini-batch step k we uniformly select a random index set \({\mathcal {I}}_k\subseteq \{1,2,\ldots ,N\},\vert {\mathcal {I}}_k\vert =m \le N\) (usually \(m\ll N\)) to access a mini-batch of m samples from the training set. The loss derivatives used for the recursive update of the variables \(\varvec{\theta }\) in this way is computed at each step k, and are estimated as running averages over the batch-wise computations. This leads to a stochastic approximation of the true derivatives at each iteration for which we assume unbiased estimations.

The problem of finding the optimal adjustment \(\delta \varvec{\theta }_{mk} :=\varvec{\theta }_{k+1}-\varvec{\theta }_k\) that solves (1) results in solving either an overdetermined or an underdetermined linear system depending on whether \(dm\ge n_w\) or \(dm < n_w\), respectively. Consider, for example, the squared fit loss and the penalty-inducing square norm as the scalar-valued functions \(g(\varvec{\theta })\) and \(h(\varvec{\theta })\), respectively in (1). Then, \(\varvec{Q}_g\) will be the identity matrix, and the LM solution \(\delta \varvec{\theta }\) [37, 38] is estimated at each iteration k according to the ruleFootnote 1:

$$\begin{aligned} \delta \varvec{\theta } = - (\varvec{H}_g+\lambda \varvec{I})^{-1}\varvec{g}_g = -(\varvec{J}_g^T\varvec{J}_g+\lambda \varvec{I})^{-1}\varvec{J}_g^T\varvec{e}_g. \end{aligned}$$
(8)

If \(dm < n_w\) (possibly \(dm\ll n_w\)), then by using the Searle identity \((\varvec{A}\varvec{B} + \lambda \varvec{I})\varvec{A} = \varvec{A}(\varvec{B}\varvec{A} + \lambda \varvec{I})\) [39], we can conveniently update the adjustment \(\delta \varvec{\theta }\) by

$$\begin{aligned} \delta \varvec{\theta } = -\varvec{J}_g^T(\varvec{J}_g\varvec{J}_g^T+\lambda \varvec{I})^{-1}\varvec{e}_g. \end{aligned}$$
(9)

Clearly, this provides a more computationally efficient way of solving for \(\delta \varvec{\theta }\). In what follows, we formulate a generalized solution method for the regularized problem (1) which similarly exploits the Hessian matrix structure when solving the given optimization problem, thereby conveniently and efficiently computing the adjustment \(\delta \varvec{\theta }\).

Taking the second-order approximation of \({\mathcal {L}}(\varvec{\theta })\), we have

$$\begin{aligned} {\mathcal {L}}(\varvec{\theta } + \delta \varvec{\theta })&\approx {\mathcal {L}}(\varvec{\theta }) + \varvec{g}^T\delta \varvec{\theta } + \frac{1}{2}\delta \varvec{\theta }^T \varvec{H}\delta \varvec{\theta }, \end{aligned}$$
(10)

where \(\varvec{H}\in {{\mathbb {R}}}^{n_w\times n_w}\) is the Hessian of \({\mathcal {L}}(\varvec{\theta })\) and \(\varvec{g}\in {{\mathbb {R}}}^{n_w}\) is its gradient. Let \(M=dm+1\) and define the Jacobian \(\varvec{J}\in {{\mathbb {R}}}^{M\times n_w}\):

$$\begin{aligned} \varvec{J}^T&= \begin{bmatrix} \partial _{\theta _1}\hat{\varvec{y}}_1 &{} \partial _{\theta _1}\hat{\varvec{y}}_2 &{} \cdots &{} \partial _{\theta _1}\hat{\varvec{y}}_m &{} \lambda \partial _{\theta _1}r_1\\ \partial _{\theta _2}\hat{\varvec{y}}_1 &{} \partial _{\theta _2}\hat{\varvec{y}}_2 &{} \cdots &{} \partial _{\theta _2}\hat{\varvec{y}}_m &{} \lambda \partial _{\theta _2}r_2\\ \vdots &{}\vdots &{}&{}\vdots &{}\vdots \\ \partial _{\theta _{n_w}}\hat{\varvec{y}}_1 &{} \partial _{\theta _{n_w}}\hat{\varvec{y}}_2 &{} \cdots &{} \partial _{\theta _{n_w}}\hat{\varvec{y}}_m &{} \lambda \partial _{\theta _{n_w}}r_{n_w} \end{bmatrix}. \end{aligned}$$
(11)

Let \(\varvec{e}_M=\varvec{1}\) and denote by \(\varvec{e}\in {{\mathbb {R}}}^M\) the vertical concatenation of all \(\varvec{e}_n\)’s, \(\varvec{e}_n\in {{\mathbb {R}}}^d\), \(n=1,2,\ldots ,m+1\). Then by using the chain rule as in (6) and (7), we obtain

$$\begin{aligned} \varvec{g}(\varvec{\theta }) = \partial _{\varvec{\theta }}{\mathcal {L}}(\varvec{\theta }) = \varvec{J}^T\varvec{e}. \end{aligned}$$
(12)

Let \(\varvec{q}_n=\partial _{\hat{\varvec{y}}_n\hat{\varvec{y}}_n}^2\ell (\varvec{y}_n,\hat{\varvec{y}}_n)\), \(\varvec{q}_n\in {{\mathbb {R}}}^d\) for \(n=1,2,\ldots ,m\) (clearly \(\varvec{q}_n=\varvec{1}\) in case of squared fit loss terms) and let \(\varvec{q}_{m+1}=\varvec{0}\). Define \(\varvec{Q}\in {{\mathbb {R}}}^{M\times M}\) as the diagonal matrix with diagonal elements \(\varvec{q}_n\) for \(n=1,2,\ldots ,m+1\), where \(\varvec{Q}=\varvec{Q}^T\succeq \varvec{0}\) by convexity of \(\ell \). Consider the following slightly modified GGN approximation of the Hessian \(\varvec{H}\in {{\mathbb {R}}}^{n_w\times n_w}\) associated with \({\mathcal {L}}(\varvec{\theta })\):

$$\begin{aligned} \varvec{H}\approx \varvec{J}^T\varvec{Q}\varvec{J}+\lambda \varvec{H}_h, \end{aligned}$$
(13)

where \(\varvec{H}_h\) is the Hessian of the regularization term \(h(\varvec{\theta })\), \(\varvec{H}_h\in {{\mathbb {R}}}^{n_w \times n_w}\), and is a diagonal matrix whose diagonal terms are

$$\begin{aligned} {\varvec{H}_h}_{jj}=\frac{d^2r_j(\theta _j)}{d\theta _j^2},\ j=1,\ldots ,n_w. \end{aligned}$$

We hold on to the notation \(\varvec{H}\) to represent the modified GGN approximation of the full Hessian matrix \(\partial ^2{\mathcal {L}}\). By differentiating (10) with respect to \(\delta \varvec{\theta }\) and equating to zero, we obtain the optimal adjustment

$$\begin{aligned} \delta \varvec{\theta } = -(\varvec{J}^T\varvec{Q}\varvec{J} + \lambda \varvec{H}_h )^{-1}\varvec{J}^T\varvec{e}. \end{aligned}$$
(14)

Remark 1

The inverse matrix in (14) exists due to the strong convexity assumption on the regularization function h which makes \(\varvec{H_r}\succeq (\min _{j}{\varvec{H}_h}_{jj})\varvec{I}\) and therefore matrix \(\varvec{J}^T\varvec{Q}\varvec{J} + \lambda \varvec{H}_h\) is symmetric positive definite and hence invertible.

Let \(\varvec{U}=\varvec{J}^T\varvec{Q}\). Using the identity [40, 41]

$$\begin{aligned} (\varvec{D}-\varvec{V}\varvec{A}^{-1}\varvec{B})^{-1}\varvec{V}\varvec{A}^{-1} = \varvec{D}^{-1}\varvec{V}(\varvec{A}-\varvec{B}\varvec{D}^{-1}\varvec{V})^{-1} \end{aligned}$$
(15)

with \(\varvec{D}=\lambda \varvec{H_r}\), \(\varvec{V}=-\varvec{J}^T\), \(\varvec{A}=\varvec{I}_M\), and \(\varvec{B}=\varvec{U}^T\), and recalling that \(\varvec{Q}\) is symmetric, from (14) we get

$$\begin{aligned} \delta \varvec{\theta }&= \left( \lambda \varvec{H}_h-(-\varvec{J}^T)\varvec{I}_M \varvec{U}^T\right) ^{-1}(-\varvec{J}^T)\varvec{I}^{-1}_M\varvec{e}\nonumber \\&=-\frac{1}{\lambda }\varvec{H}_h^{-1}\varvec{J}^T\left( \varvec{I}_M+\varvec{U}^T\frac{1}{\lambda }\varvec{H}_h^{-1} \varvec{J}^T\right) ^{-1}\varvec{e}\nonumber \\&=-\varvec{H}_h^{-1}\varvec{J}^T\left( \lambda \varvec{I}_M+\varvec{Q}\varvec{J} \varvec{H}_h^{-1} \varvec{J}^T\right) ^{-1}\varvec{e}. \end{aligned}$$
(16)

Remark 2

When combined with a second identity, namely \(\varvec{V}\varvec{A}^{-1}(\varvec{A}-\varvec{B}\varvec{D}^{-1}\varvec{V}) = (\varvec{D} -\varvec{V}\varvec{A}^{-1}\varvec{B})\varvec{D}^{-1}\varvec{V}\), one can directly derive from (15) Woodbury identity defined as [42] \( (\varvec{A}+\varvec{U}\varvec{B}\varvec{V})^{-1} = \varvec{A}^{-1} - \varvec{A}^{-1}\varvec{U}(\varvec{B}+\varvec{V}\varvec{A}^{-1}\varvec{U})^{-1}\varvec{V}\varvec{A}^{-1}\), or in the special case \(\varvec{B} = -\varvec{D}^{-1}\), as \( (\varvec{A}-\varvec{U}\varvec{D}^{-1}\varvec{V})^{-1} = \varvec{A}^{-1} + \varvec{A}^{-1}\varvec{U}(\varvec{D}-\varvec{V}\varvec{A}^{-1}\varvec{U})^{-1}\varvec{V}\varvec{A}^{-1}\). Using Woodbury identity would, in fact, structurally not result into the closed-form update step (16) in an exact sense. Our construction involves a more general regularization function than the commonly used square norm, where the Woodbury identity can be equally useful, as its Hessian yields a multiple of the identity matrix.

Compared to (14), the clear advantage of the form (16) is that it requires the factorization of an \(M\times M\) matrix rather than an \(n_w\times n_w\) matrix, where the term \(\varvec{H}_h^{-1}\) can be conveniently obtained by exploiting its diagonal structure. Given these modifications, we proceed by making an assumption that defines the residual \(\varvec{e}\) and the Jacobian \(\varvec{J}\) in the region of convergence where we assume the starting point \(\varvec{\theta }_0\) of the process (16) lies.

Let \(\varvec{\theta }^*\) be a nondegenerate minimizer of \({\mathcal {L}}\), and define \({\mathcal {B}}_\epsilon (\varvec{\theta }^*):=\{\varvec{\theta }_k\in {{\mathbb {R}}}^{n_w}:\Vert \varvec{\theta }_k - \varvec{\theta }^*\Vert \le \epsilon \}\), a closed ball of a sufficiently small radius \(\epsilon \ge 0\) about \(\varvec{\theta }^*\). We denote by \({\mathcal {N}}_{\epsilon }(\varvec{\theta }^*)\) an open neighbourhood of the sublevel set \(\Gamma ({\mathcal {L}}):=\{\varvec{\theta }_k:{\mathcal {L}}(\varvec{\theta }_k)\le {\mathcal {L}}(\varvec{\theta }_0)\}\), so that \({\mathcal {B}}_\epsilon (\varvec{\theta }^*)=\textrm{cl}({\mathcal {N}}_{\epsilon }(\varvec{\theta }^*))\). We then have \({\mathcal {N}}_{\epsilon }(\varvec{\theta }^*):=\{\varvec{\theta }_k\in {{\mathbb {R}}}^{n_w}:\Vert \varvec{\theta }_k - \varvec{\theta }^*\Vert < \epsilon \}\).

Assumption 4

  1. (i)

    Each \(\varvec{e}_n(\varvec{\theta }_k)\) and each \(\varvec{q}_n(\varvec{\theta }_k)\) is Lipschitz smooth, and \(\forall \varvec{\theta }_k\in {\mathcal {N}}_{\epsilon }(\varvec{\theta }^*)\) there exists \(\nu >0\) such that \(\Vert \varvec{J}(\varvec{\theta }_k)\varvec{z}\Vert \ge \nu \Vert \varvec{z}\Vert \).

  2. (ii)

    \(\lim _{k\rightarrow \infty }\left\| \mathbb {E}_m[\varvec{H}(\varvec{\theta }_k)] - \partial ^2{\mathcal {L}}(\varvec{\theta }_k)\right\| =0\) almost surely whenever \(\lim _{k\rightarrow \infty } \Vert \varvec{g}_{{\mathcal {L}}}(\varvec{\theta }_k)\Vert = 0\), \(\forall \varvec{\theta }_k\in {\mathcal {N}}_{\epsilon }(\varvec{\theta }^*)\), where \(\mathbb {E}_m[\cdot ]\) denotesFootnote 2 expectation with respect to m.

Remark 3

Assumption 4(i) implies that the singular values of \(\varvec{J}\) are uniformly bounded away from zero and \(\exists \beta ,\tilde{\beta }>0\) such that \(\Vert \varvec{e}\Vert \le \beta , \Vert \varvec{J}\Vert =\Vert \varvec{J}^T\Vert \le \tilde{\beta }\), then as \(\varvec{Q}\succeq \varvec{0}\), we have \(\exists K_1\) such that \(\varvec{Q}\le K_1\varvec{I}\), and hence \(\Vert \lambda \varvec{I}+\varvec{Q}\varvec{J}\varvec{H}_h^{-1}\varvec{J}^T\Vert \le \lambda +(K/\gamma _a)\), where \(K=K_1\tilde{\beta }^2\). Note that although we use limits in Assumption 4(ii), the assumption similarly holds in expectation by unbiasedness. Also, a sufficient sample size m may be required for Assumption 4(ii) to hold, by law of large numbers, see e.g. [43, Lemma 1, Lemma 2].

Remark 4

\(\varvec{H}_g(\varvec{\theta })\) and \(\varvec{H}_h(\varvec{\theta })\) satisfy

$$\begin{aligned}{} & {} \gamma _l\varvec{I}_{n_w}\preceq \varvec{H}_g(\varvec{\theta }_k)\preceq \gamma _u \varvec{I}_{n_w}, \quad \left\| \varvec{H}_g(\varvec{y}, f(\varvec{\theta _1};\varvec{x})) - \varvec{H}_g(\varvec{y}, f(\varvec{\theta _2};\varvec{x}))\right\| \le \gamma _g\left\| \varvec{\theta _1} - \varvec{\theta _2}\right\| , \\{} & {} \gamma _a\varvec{I}_{n_w}\preceq \varvec{H}_h(\varvec{\theta }_k)\preceq \gamma _b \varvec{I}_{n_w}, \quad \left\| \varvec{H}_h(\varvec{y}, f(\varvec{\theta _1};\varvec{x})) - \varvec{H}_h(\varvec{y}, f(\varvec{\theta _2};\varvec{x}))\right\| \le \gamma _h\left\| \varvec{\theta _1} - \varvec{\theta _2}\right\| , \end{aligned}$$

at any point \(\varvec{\theta }_k\in {{\mathbb {R}}}^{n_w}\), for any \(\varvec{x}\in {{\mathbb {R}}}^{n_p},\varvec{y}\in {{\mathbb {R}}}^d\), and for any \(\varvec{\theta _1}, \varvec{\theta _2}\in {\mathcal {N}}_\epsilon (\varvec{\theta }^*)\) where Assumption 4 holds.

We now state a convergence result for the update type (14) (and hence (16)). First, we define the second-order optimality condition and state two useful lemmas.

Definition 2

(Second-order sufficiency condition (SOSC)) Let \(\varvec{\theta }^*\) be a local minimum of a twice-differentiable function \({\mathcal {L}}(\cdot )\). The second-order sufficiency condition (SOSC) holds if

$$\begin{aligned} \partial {\mathcal {L}}(\varvec{\theta }^*) = 0, \quad \partial ^2 {\mathcal {L}}(\varvec{\theta }^*) \succ 0. \end{aligned}$$
(SOSC)

Lemma 1

( [14, Theorem 1.2.3]) Suppose that Assumption 3 holds. Let \(\mathbb {W}\subseteq {{\mathbb {R}}}^{n_w}\) be a closed and convex set on which \({\mathcal {L}}(\varvec{\theta })\) is twice-continuously differentiable. Let \(S\subset \mathbb {W}\) be an open set containing some \(\varvec{\theta }^*\), and suppose that \({\mathcal {L}}(\varvec{\theta }^*)\) satisfies (SOSC). Then, there exists \({\mathcal {L}}^* = {\mathcal {L}}(\varvec{\theta }^*)\) satisfying

$$\begin{aligned} {\mathcal {L}}(\varvec{\theta }_k)>{\mathcal {L}}^* \quad \forall \varvec{\theta }_k \in S. \end{aligned}$$
(18)

Lemma 2

The adjustment \(\delta \varvec{\theta }\) given by (14) (and hence, (16)) provides a descent direction for the total loss \({\mathcal {L}}(\varvec{\theta }_k)\) in (1) at the k th oracle call.

Remark 5

Lemma 1, Lemma 2 and the first part of (SOSC) ensure that the second part of Assumption 4(ii) always holds. In essence, it holds at every point \(\varvec{\theta }_k\) of the sequence \(\{\varvec{\theta }_k\}\) generated by the process (16) as long as we choose a starting point \(\varvec{\theta _0}\in {\mathcal {B}}_\epsilon (\varvec{\theta }^*)\).

Theorem 3

Suppose that Assumptions 123 and 4 hold, and that \(\varvec{\theta }^*\) is a local minimizer of \({\mathcal {L}}(\varvec{\theta })\) for which the assumptions in Lemma 1 hold. Let \(\{\varvec{\theta }_k\}\) be the sequence generated by the process (16). Then starting from a point \(\varvec{\theta }_0\in {\mathcal {N}}_{\epsilon }(\varvec{\theta }^*)\), \(\{\varvec{\theta }_k\}\) converges at a Q-quadratic rate. Namely:

$$\begin{aligned} \left\| \varvec{\theta }_{k+1}-\varvec{\theta }^*\right\| \le \xi _k\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| ^2, \end{aligned}$$

where

$$\begin{aligned} \xi _k = \frac{1}{2}\frac{\gamma _g + b\gamma _h}{\left( \gamma _l+a\gamma _a - (\gamma _g+b\gamma _h)\left\| \varvec{\theta }_k-\varvec{\theta }^*\right\| \right) }. \end{aligned}$$

The proofs of these results are reported in “Appendix A” and “Appendix B”.

4 Self-concordant regularization (SCORE)

In the case \(dm < n_w\), one could deduce by mere visual inspection that the matrix \(\varvec{H}_h^{-1}\) in (16) plays an important role in the perturbation of \(\varvec{\theta }\) within the region of convergence due to its “dominating” structure in the equation. This may be true as it appears. But beyond this “naive” view, let us derive a more technical intuition about the update step (16) by using a similar analogy as that given in [14, Chapter 5]. We have

$$\begin{aligned} \varvec{\theta }_{k+1}=\varvec{\theta }_k-\varvec{H}_h^{-1}\varvec{J}^T\left( \lambda \varvec{I}+\varvec{Q}\varvec{J} \varvec{H}_h^{-1} \varvec{J}^T\right) ^{-1}\varvec{e}, \end{aligned}$$
(19)

where \(\varvec{I}\) is the identity matrix of suitable dimension. By some simple algebra (see e.g., [14, Lemma 5.1.1]), one can show that indeed the update method (19) is affine-invariant. In other words, the region of convergence \({\mathcal {N}}_{\epsilon }(\varvec{\theta }^*)\) does not depend on the problem scaling but on the local topological structure of the minimization functions \(g(\varvec{\theta })\) and \(h(\varvec{\theta })\) [14].

Consider the non-augmented form of (19): \(\varvec{\theta }_{k+1}=\varvec{\theta }_k-\varvec{Q}_g\varvec{J}_g^T\varvec{G}_g^{-1}\varvec{e}_g\), where \(\varvec{G}_g = \varvec{J}_g\varvec{J}_g^T\) is the so-called Gram matrix. It has been shown that indeed when learning an overparameterized model (sample size smaller than number of variables), and as long as we start from an initial point \(\varvec{\theta }_0\) close enough to the minimizer \(\varvec{\theta }^*\) of \({\mathcal {L}}(\varvec{\theta })\) (assuming that such a minimizer exists), both \(\varvec{J}_g\) and \(\varvec{G}_g\) remain stable throughout the learning process (see e.g., [17, 44, 45]). The term \(\varvec{Q}_g\) may have little to no effect on this notion, for example, in case of the squared fit loss, \(\varvec{Q}_g\) is just an identity term. However, the original Equation (19) involves the Hessian matrix \(\varvec{H}_h\) which, together with its bounds (namely, \(\gamma _a\), \(\gamma _b\) characterizing the region of convergence), changes rapidly when we scale the problem [14]. It therefore suffices to impose an additional assumption on the function \(h(\varvec{\theta })\) that will help to control the rate at which its second-derivative \(\varvec{H}_h\) changes, thereby enabling us to characterize an affine-invariant structure for the region of convergence. Namely:

Assumption 5

(SCORE) The scaled regularization function \(\lambda h(\varvec{\theta })\) has a third derivative and is \(M_h\)-self-concordant. That is, the inequality

$$\begin{aligned} \left| \left<\varvec{u},\left( \partial ^3h(\varvec{\theta })[\varvec{u}]\right) \varvec{u}\right>\right| \le 2M_h\left<\varvec{u},\partial ^2h(\varvec{\theta })\varvec{u}\right>^{3/2}, \end{aligned}$$
(20)

where \(M_h\ge 0\), holds for any \(\varvec{\theta }\) in the closed convex set \(\mathbb {W}\subseteq {{\mathbb {R}}}^{n_w}\) and \(\varvec{u}\in {{\mathbb {R}}}^{n_w}\).

Here, \(\partial ^3h(\varvec{\theta })[\varvec{u}]\in {{\mathbb {R}}}^{n_w \times n_w}\) denotes the limit

$$\begin{aligned} \partial ^3h(\varvec{\theta })[\varvec{u}] :=\lim \limits _{t\rightarrow 0}\frac{1}{t}\left( \partial ^2(\varvec{\theta }+t\varvec{u})-\partial ^2(\varvec{\theta })\right) , \quad t \in {{\mathbb {R}}}. \end{aligned}$$

For a detailed analysis of self-concordant functions, we refer the interested reader to [14, 25].

Given this extra assumption, and following the idea of Newton-decrement in [25], we propose to update \(\varvec{\theta }\) by

$$\begin{aligned} \delta \varvec{\theta }=-\frac{\alpha }{1+M_h\eta _k}\varvec{H}_h^{-1}\varvec{J}^T\left( \lambda \varvec{I}+\varvec{Q}\varvec{J} \varvec{H}_h^{-1} \varvec{J}^T\right) ^{-1}\varvec{e}, \end{aligned}$$
(21)

where \(\alpha >0\), \(\eta _k = \left<\varvec{g}_h,\varvec{H}_h^{-1}\varvec{g}_h\right>^{1/2}\) and \(\varvec{g}_h\) is the gradient of \(h(\varvec{\theta })\). The proposed method which we call GGN-SCORE is summarized, for one oracle call, in Algorithm 4.

figure a

There is a wide class of self-concordant functions that meet practical requirements, either in their scaled form [14, Corollary 5.1.3] or in their original form (see e.g., in [46] for a comprehensive list). Two of them are used in our experiments in Sect. 5.

In the following, we state a local convergence result for the process (21). We introduce the notation \(\Vert \Delta \Vert _{\varvec{\theta }}\) to represent the local norm of a direction \(\Delta \) taken with respect to the local Hessian of a self-concordant function h, \(\Vert \Delta \Vert _{\varvec{\theta }}:=\left<\Delta ,\partial _{\varvec{\theta }\varvec{\theta }}^2\,h(\varvec{\theta })\Delta \right>^{1/2} =\left\| \left[ \partial _{\varvec{\theta }\varvec{\theta }}^2\,h(\varvec{\theta })\right] ^{1/2}\Delta \right\| \). Hence, without loss of generality, a ball \({\mathcal {B}}_\epsilon (\varvec{\theta }^*)\) of radius \(\epsilon \) about \(\varvec{\theta }^*\) is taken with respect to the local norm for the function h in the result that follows.

Theorem 4

Let Assumptions 1234 and 5 hold, and let \(\varvec{\theta }^*\) be a local minimizer of \({\mathcal {L}}(\varvec{\theta })\) for which the assumptions in Lemma 1 hold. Let \(\{\varvec{\theta }_k\}\) be the sequence generated by Algorithm 4 with \(\alpha = \frac{\sqrt{\gamma _a}}{\beta _1}(K+\lambda \gamma _a)\), \(\beta _1=\beta \tilde{\beta }\). Then starting from a point \(\varvec{\theta }_0\in {\mathcal {N}}_{M_h^{-1}}(\varvec{\theta }^*)\), \(\{\varvec{\theta }_k\}\) converges to \(\varvec{\theta }^*\) according to the following descent properties:

$$\begin{aligned}&\mathbb {E}[{\mathcal {L}}(\varvec{\theta }_{k+1})] \le {\mathcal {L}}(\varvec{\theta }_k) - \left( \frac{\lambda }{M_h^2}\omega (\zeta _k)+\frac{\gamma _l}{2\gamma _a}\omega ''(\tilde{\zeta }_k)-\xi \right) ,\\&\mathbb {E}\Vert \varvec{\theta }_{k+1}-\varvec{\theta }^*\Vert _{\varvec{\theta }_{k+1}} \le \vartheta \Vert \varvec{\theta }_k-\varvec{\theta }^*\Vert _{\varvec{\theta }_k} + \frac{\gamma _u}{\beta _1}\Vert \varvec{\theta }_k-\varvec{\theta }^*\Vert + \frac{\gamma _g}{2}\Vert \varvec{\theta }_k-\varvec{\theta }^*\Vert ^2, \end{aligned}$$

where \(\omega (\cdot )\) is an auxiliary univariate function defined by \(\omega (t):=t-\ln (1+t)\) and has a second derivative \(\omega ''(t)=1/(1+t)^2\), and

$$\begin{aligned}&\zeta _k :=\frac{M_h}{1+M_h\eta _k}, \quad \tilde{\zeta }_k :=M_h\eta _k, \quad \xi :=\frac{2(\gamma _u+\lambda \gamma _b)}{\sqrt{\gamma _a}},\\&\vartheta :=1+\frac{\lambda }{\sqrt{\gamma _a}\beta _1(1-M_h\Vert \varvec{\theta }_k-\varvec{\theta }^*\Vert _{\varvec{\theta }_k})}. \end{aligned}$$

The proof is provided in “Appendix B”. The results of Theorem 4 combine strong convexity and smoothness properties of both \(g(\varvec{\theta })\) and \(h(\varvec{\theta })\), and requires that only \(h(\varvec{\theta })\) is self-concordant.

5 Experiments

In this section, we validate the efficiency of GGN-SCORE (Algorithm 4) in solving the problem of minimizing a regularized strongly convex quadratic function and in solving binary classification tasks. For the binary classification tasks, we use real datasets: \(\texttt {ijcnn1}\) and \(\texttt {w2a}\) from the LIBSVM repository [47] and \(\texttt {dis}\), \(\texttt {hypothyroid}\) and \(\texttt {coil2000}\) from the PMLB repository [48], with an 80:20 train:test split each. The datasets are summarized in Table 1. In each classification task, the model with a sigmoidal output \(\hat{\varvec{y}}_n\) is trained using the cross-entropy fit loss function \(\ell (\varvec{y}_n,\hat{\varvec{y}}_n) = \frac{1}{2}\sum _{n=1}^N \varvec{y}_n\log \left( \frac{1}{\hat{\varvec{y}}_n}\right) + (\varvec{1}-\varvec{y}_n)\log \left( \frac{1}{\varvec{1}-\hat{\varvec{y}}_n}\right) \), and the “deviance” residual [49] \(\varvec{e}_n = (-1)^{\varvec{y}_n+1}\sqrt{-2\left[ \varvec{y}_n\log (\hat{\varvec{y}}_n) + (\varvec{1}-\varvec{y}_n)\log (\varvec{1}-\hat{\varvec{y}}_n)\right] }\), \(\varvec{y}_n, \hat{\varvec{y}}_n\in \{\varvec{0},\varvec{1}\}\).

Table 1 Real datasets: \(N =\) number of data samples, \(n_p=\) number of features, \(d=\) number of targets

We map the input space to higher dimensional feature space using the radial basis function (RBF) kernel \(K(\varvec{x}_n,\varvec{x}_n')=\exp (-\gamma \Vert \varvec{x}_n-\varvec{x}_n'\Vert ^2)\) with \(\gamma =0.1\). In each experiment, we use the penalty value \(\lambda =0.1\) with both pseudo-Huber regularization \(h_\mu (\varvec{\theta })\) [50] parameterized by \(\mu >0\) [51, 52] and \(\ell ^2\) regularization \(h_2(\varvec{\theta })\) defined respectively as

$$\begin{aligned} h_\mu (\varvec{\theta }) :=\sqrt{\mu ^2 +\varvec{\theta }^2}-\mu , \qquad h_2(\varvec{\theta }) = \Vert \varvec{\theta }\Vert _2^2 :=\sum _{i=1}^{n_w}\left| \theta _i\right| ^2, \end{aligned}$$

with coefficient \(\mu =1.0\). Throughout, we choose a batch size, m of 512 for \(\texttt {w2a}\), \(\texttt {dis}\) and \(\texttt {hypothyroid}\), 2048 for \(\texttt {coil2000}\) and 4096 for \(\texttt {ijcnn1}\), unless otherwise stated. We assume a scaled self-concordant regularization so that \(M_h=1\) [14, Corollary 5.1.3].

5.1 GGN-SCORE for different values of \(\alpha \)

To illustrate the behaviour of GGN-SCORE for different values of \(\alpha \) in Algorithm 4 versus its value indicated in Theorem 4, we consider the problem of minimizing a regularized strongly convex quadratic function:

$$\begin{aligned} \min _{\varvec{\theta }}{\mathcal {L}}(\varvec{\theta }) :=\frac{1}{2}\varvec{\theta }^\top \hat{\varvec{Q}}\varvec{\theta } - \varvec{p}^\top \varvec{\theta } + \lambda h(\varvec{\theta }) \equiv g(\varvec{\theta }) + \lambda h(\varvec{\theta }), \end{aligned}$$
(22)

where \(\hat{\varvec{Q}}\in {{\mathbb {R}}}^{n_w\times n_w}\) is symmetric positive definite, \(\varvec{p}\in {{\mathbb {R}}}^{n_w}\), g is \(\gamma _a\)-strongly convex and has \(\gamma _u\)-Lipschitz gradient, with the smallest and largest eigenvalues of \(\hat{\varvec{Q}}\) corresponding to \(\gamma _a\) and \(\gamma _u\), respectively. For this function, suppose the gradient and Hessian of \(h(\varvec{\theta })\) is known, for example when we choose \(h=h_\mu \) or \(h=h_2\), we have \(\partial {\mathcal {L}}(\varvec{\theta }) = \hat{\varvec{Q}}\varvec{\theta } - \varvec{p} + \lambda \partial h(\varvec{\theta })\) and \(\partial ^2{\mathcal {L}}(\varvec{\theta }) = \hat{\varvec{Q}} + \lambda \partial ^2\,h(\varvec{\theta })\). The coefficients \(\hat{\varvec{Q}}\) and \(\varvec{p}\) form our data, and with \(\hat{\varvec{Q}}\equiv 0.1\times \varvec{M}^\top \varvec{M}\), \(N=n_w=1000\), we generate the data \(\varvec{M}\in {{\mathbb {R}}}^{N\times N}\) randomly from a uniform distribution on [0, 1] and consider the case \({\mathcal {L}}^* = \varvec{0}\) in which \(\varvec{p}\) is the zero vector. The optimization variable \(\varvec{\theta }\) is initialized to a random value generated from a normal distribution with mean 0 and standard deviation 0.01. Figure 1 shows the behaviour of GGN-SCORE for this problem with different values of \(\alpha \) in (0, 1] and \(\alpha = \frac{\sqrt{\gamma _a}}{\beta _1}(K+\lambda \gamma _a)\) indicated in Theorem 4. We experiment with different batch sizes \(m\in \{64, 128, 256, 512\}\). One observes from Fig. 1 that larger batch size yields better convergence speed when we choose \(\alpha = \frac{\sqrt{\gamma _a}}{\beta _1}(K+\lambda \gamma _a)\), validating the recommendation in Remark 3. Figure 1 also shows the comparison of GNN-SCORE with the first-order Adam [5] algorithm, and the quasi-Newton Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) [53] method using optimally tuned learning rates. While choosing \(\alpha = \frac{\sqrt{\gamma _a}}{\beta _1}(K+\lambda \gamma _a)\) yields the kind of convergence shown in Theorem 4, Fig. 1 shows that by choosing \(\alpha \) in (0, 1], we can similarly achieve a great convergence that scale well with the problem.

Fig. 1
figure 1

Numerical behaviour of GGN-SCORE for different values of \(\alpha \) in the strongly convex quadratic test problem (22)

Fig. 2
figure 2

Numerical behaviour of GGN-SCORE for different values of \(\alpha \) using real datasets. \(m = 512\) for w2a, dis and hypothyroid, and \(m=2048\) for coil2000

Strictly speaking, the value of \(\alpha \) indicated in Theorem 4 is not of practical interest, as it contains terms that may not be straightforward to retrieve in practice. In practice, we treat \(\alpha \) as a hyperparameter that takes a fixed positive value in (0, 1]. For an adaptive step-size selection rule, such as that in Line 5 of Algorithm 4, choosing a suitable scaling constant such as \(\alpha \) is often straightforward, as the main step-size selection task is accomplished by the defined rule. We show the behaviour of GGN-SCORE on the real datasets for different values of \(\alpha \) in (0, 1] in Fig. 2. In general, a suitable scaling factor \(\alpha \) should be selected based on the application demands.

Fig. 3
figure 3

Convergence curves for GGN-SCORE, SGD, Adam and L-BFGS in the proposed convex problem. \(m = 512\) for w2a, dis and hypothyroid, \(m=2048\) for coil2000, and \(m=4096\) for ijcnn1

Fig. 4
figure 4

Convergence curves for GGN-SCORE, SGD, Adam and L-BFGS in the non-convex neural network training problem: overparameterized for dis, hypothyroid, w2a and coil2000. \(m = 512\) for w2a, dis and hypothyroid, \(m=2048\) for coil2000, and \(m=4096\) for ijcnn1

5.2 Comparison with SGD, Adam, and L-BFGS methods on real datasets

Using the real datasets, we compare GGN-SCORE for solving (1) with results from the SGD, Adam, and the L-BFGS algorithms using optimally tuned learning rates. We also consider the training problem of a neural network with two hidden layers of dimensions (2, 128), respectively for the \(\texttt {coil2000}\) dataset, one hidden layer with dimension 1 for the \(\texttt {ijcnn1}\) dataset, and two hidden layers of dimensions (4, 128), respectively for the remaining datasets. We use ReLU activation functions in the hidden layers of the networks, and the network is overparameterized for \(\texttt {dis}\), \(\texttt {hypothyroid}\), \(\texttt {w2a}\) and \(\texttt {coil2000}\) with 25425, 21529, 23497 and 16229 trainable parameters, respectively. We choose \(\alpha \in \{0.2, 0.5\}\) for GGN-SCORE. Minimization variables are initialized to the zero vector for all the methods. The neural network training problems are solved under the same settings. The results are respectively displayed in Fig. 3 and Fig. 4 for the convex and non-convex cases.

Fig. 5
figure 5

Classification task results using the method of this paper – logistic regression (LR), k-nearest neighbours (KNN), linear support vector machine (L-SVM), RBF-SVM, Gaussian process (GP), decision tree (DT), random forest (RF) and adaptive boosting (AdaBoost) techniques for dis, hypothyroid, coil2000 and w2a datasets. LR and SVM methods use the \(\ell _2\) penalty function with parameter \(\lambda = 1.0\). Solutions with GGN-SCORE, SGD, Adam, and L-BFGS are obtained over one data pass with \(m=128\)

To investigate how well the learned model generalizes, we use the binary accuracy metric which measures how often the model predictions match the true labels when presented with new, previously unseen data: \(Accuracy = \frac{1}{N}\sum _{n=1}^{N}\left( 2\varvec{y}_n\hat{\varvec{y}}_n-\varvec{y}_n-\hat{\varvec{y}}_n+1\right) \). While GGN-SCORE converges faster than SGD, Adam and L-BFGS methods, it generalizes comparatively well. The results are further compared with other known binary classification techniques to measure the quality of our solutions. The accuracy scores for dis, hypothyroid, coil2000 and w2a datasets, with a 60:40 train:test split each, are computed on the test set. The mean scores are compared with those from the different classification techniques, and are shown in Fig. 5. The CPU runtimes are also compared where it is indicated that on average GGN-SCORE solves each of the problems within one second. This scales well with the other techniques, as we note that while GNN-SCORE solves each of the problems in high dimensions, the success of most of the other techniques are limited to relatively smaller dimensions of the problems. The obtained results from the classification techniques used for comparison are computed directly from the respective scikit-learn [54] functions.

The L-BFGS experiments are implemented with PyTorch [55] (v. 1.10.1+cu102) in full-batch mode. The GGN-SCORE, Adam and SGD methods are implemented using the open-source Keras API with TensorFlow [56] backend (v. 2.7.0). All experiments are performed on a laptop with dual (2.30GHz + 2.30GHz) Intel Core i7-11800 H CPU and 16GB RAM.

In summary, GGN-SCORE converges way faster (in terms of number of epochs) than SGD, Adam, and L-BFGS, and generalizes comparatively well. Experimental results show the computational convenience and elegance achieved by our “augmented” approach for including regularization functions in the GGN approximation. Although GGN-SCORE comes with a higher computational cost (in terms of wall-clock time per iteration) than first-order methods on average, if per-iteration learning time is not provided as a bottleneck, this may not become an obvious issue as we need to pass the proposed optimizer on the dataset only a few times (epochs) to obtain superior function approximation and relatively high-quality solutions in our experiments.

6 Conclusion

In this paper, we have proposed GGN-SCORE, a generalized Gauss–Newton-type algorithm for solving unconstrained regularized minimization problems, where the regularization function is considered to be self-concordant. In this generalized setting, we employed a matrix approximation scheme that significantly reduces the computational overhead associated with the method. Unlike existing techniques that impose self-concordance on the problem’s objective function, our analysis involves a less restrictive condition from a practical point of view but similarly benefits from the idea of self-concordance by considering scaled optimization step-lengths that depend on the self-concordant parameter of the regularization function. We proved a quadratic local convergence rate for our method under certain conditions, and validate its efficiency in numerical experiments that involve both strongly convex problems and the general non-convex problems that arise when training neural networks. In both cases, our method compare favourably against Adam, SGD, and L-BFGS methods in terms of per-iteration convergence speed, as well as some machine learning techniques used in binary classification, in terms of solution quality.

In future research, it would be interesting to relax some conditions on the problem and analyze a global convergence rate for the proposed method. We would also consider an analysis for our method in a general non-convex setting even though numerically we have observed a similar convergence speed as the strongly convex case.