Hessian averaging in stochastic Newton methods achieves superlinear convergence

We consider minimizing a smooth and strongly convex objective function using a stochastic Newton method. At each iteration, the algorithm is given an oracle access to a stochastic estimate of the Hessian matrix. The oracle model includes popular algorithms such as Subsampled Newton and Newton Sketch, which can efficiently construct stochastic Hessian estimates for many tasks, e.g., training machine learning models. Despite using second-order information, these existing methods do not exhibit superlinear convergence, unless the stochastic noise is gradually reduced to zero during the iteration, which would lead to a computational blow-up in the per-iteration cost. We propose to address this limitation with Hessian averaging: instead of using the most recent Hessian estimate, our algorithm maintains an average of all the past estimates. This reduces the stochastic noise while avoiding the computational blow-up. We show that this scheme exhibits local Q-superlinear convergence with a non-asymptotic rate of (Υlog(t)/t)t\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\varUpsilon \sqrt{\log (t)/t}\,)^{t}$$\end{document}, where Υ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varUpsilon $$\end{document} is proportional to the level of stochastic noise in the Hessian oracle. A potential drawback of this (uniform averaging) approach is that the averaged estimates contain Hessian information from the global phase of the method, i.e., before the iterates converge to a local neighborhood. This leads to a distortion that may substantially delay the superlinear convergence until long after the local neighborhood is reached. To address this drawback, we study a number of weighted averaging schemes that assign larger weights to recent Hessians, so that the superlinear convergence arises sooner, albeit with a slightly slower rate. Remarkably, we show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage, and still exhibits a superlinear convergence rate nearly (up to a logarithmic factor) matching that of uniform Hessian averaging.


Introduction
We consider minimizing a smooth and strongly convex objective function: where f : R d → R is twice continuously differentiable with H(x) = ∇ 2 f (x) ∈ R d×d being its Hessian matrix.We suppose the Hessian H(x) satisfies for some constants 0 < λ min ≤ λ max , and we let κ = λ max /λ min denote its condition number.We also let x = arg min x∈R d f (x) be the unique global solution.
Problem (1) is arguably the most basic optimization problem, which nevertheless arises in many applications in machine learning and statistics (Shalev-Shwartz and Ben-David, 2009;Bubeck, 2015;Bottou et al., 2018;Lan, 2020).There is a plethora of methods for solving (1), and each (type of) method has its own benefits under favorable settings.This paper particularly focuses on solving (1) via second-order methods, where a (approximated) Hessian matrix is involved in the scheme.Consider the classical Newton's method of the form where ∇f t = ∇f (x t ), H t = H(x t ), and µ t is selected by passing a line search condition.Classical results indicate that Newton's method in (2) exploits a global Q-linear convergence in the error of function value f (x t ) − f (x ); and a local Q-quadratic convergence in the iterate error x t − x .More precisely, Newton's method has two phases, separated by a neighborhood for some ν > 0.
When x t ∈ N ν , ( 2) is in the damped Newton phase, where the objective f (x t ) is decreased by at least a fixed amount in each iteration, and converges linearly.In this phase, x t − x may converge slowly (e.g., it provably converges R-linearly using the fact that x t − x 2 ≤ 2/λ min (f (x t ) − f (x ))).When x t ∈ N ν , (2) transits to the quadratically convergent phase, where the unit stepsize is accepted and x t − x converges quadratically.See (Boyd and Vandenberghe, 2004, Section 9.5) for the analysis.
Compared to first-order methods, although second-order methods often exploit a faster convergence rate and behave more robustly to tuning parameters, they hinge on a high computational cost of forming the exact Hessian matrix H t at each step.To resolve such a computational bottleneck, which is particularly pressing in large-scale data applications, a variety of deterministic and stochastic methods have been proposed for constructing different alternatives of the Hessian matrix.When a deterministic alternative of H t , say Ȟt , is employed in (2), which may come from a finite difference approximation of the second-order derivative, or from a quasi-Newton update such as BFGS or DFP, the convergence behavior is well understood.Specifically, if { Ȟt } t are positive definite with uniform upper and lower bounds, the damped phase is preserved by the same analysis as Newton's method.Furthermore, the quadratically convergent phase is weakened to a superlinearly convergent phase, and the superlinear convergence occurs if and only if the celebrated Dennis-Moré condition (Dennis and Moré, 1974) holds, i.e., lim t→∞ See (Nocedal and Wright, 2006, Theorem 3.7) for the analysis.Recently, a deeper understanding of quasi-Newton methods for minimizing smooth and strongly convex objectives has been reported in Jin and Mokhtari (2020); Rodomanov and Nesterov (2021a,b,c).These works enhanced the analyses based on Dennis-Moré condition in (3) by performing a local, non-asymptotic convergence analysis, and provided explicit superlinear rates for different potential functions of quasi-Newton methods.The non-asymptotic superlinear results are more informative than asymptotic superlinear results established via (3), i.e., x t+1 − x / x t − x → 0 as t → ∞.However, the analyses in Jin and Mokhtari (2020); Rodomanov and Nesterov (2021a,b,c) highly rely on specific properties of quasi-Newton updates in the Broyden class, and do not apply to general Hessian approximations (e.g., finite difference approximation).

Stochastic Newton methods
A parallel line of research explores stochastic Newton methods, where a stochastic approximation H t is used in place of H t in (2).The stochasticity of H t may come from evaluating the Hessian on a random subset of data points (i.e., subsampling), or from projecting the Hessian onto a random subspace to achieve the dimension reduction (i.e., sketching).To unify different approaches, we consider in this paper a general Hessian approximation given by a stochastic oracle.In particular, we express the approximation H(x) at any point x by where E(x) ∈ S d×d is a symmetric random noise matrix following a certain distribution (conditional on x) with mean zero.At iterate x t , we query the oracle to obtain an approximation H t = H(x t ), which (implicitly) comes from generating a realization of the random matrix E t = E(x t ), and then adding E t to the true Hessian H t .We do not assume E t and H t are accessible to us.
As mentioned, the popular specializations of stochastic oracle include subsampling and sketching.For Hessian subsampling, a finite-sum objective is considered (5) In the t-th iteration, a subset ξ t ⊆ {1, 2, . . ., n} is sampled uniformly at random, and the subsampled Hessian is defined as We note that the components f i in (5) may not be convex even if the function f is strongly convex. 1 This is the so-called sum-of-non-convex setting (e.g., see Garber and Hazan, 2015;Garber et al., 2016;Shalev-Shwartz, 2016;Allen-Zhu and Yuan, 2016).Our oracle model (4) allows for this, since H t is not required to be positive semidefinite, while the existing Subsampled Newton methods generally do not allow it.See Section 1.3 and Example 2.3 for further discussion.
For Hessian sketching, one first forms the square-root Hessian matrix M t ∈ R n×d satisfying H t = M t M t , where n is the number of data points.Then, one generates a random sketch matrix S t ∈ R s×n with the sketch size s and the property E[S t S t ] = I, and the sketched Hessian is defined as In some cases, M t can be easily obtained.One particular example is a generalized linear model, where the objective has the form with {a i } n i=1 ∈ R d being n data points.In this case, M t = 1 √ n • diag(f 1 (a 1 x t ) 1/2 , . . ., f n (a n x t ) 1/2 )A where A = (a 1 , . . ., a n ) ∈ R n×d is the data matrix.
1 A concrete example is finding the leading eigenvector of a covariance matrix A = 1 n n i=1 aia i .Here, the objective can be structured as f b,ν (x) = 1 n n i=1 (x (νI − aia i )x − b x), where ν > A and b are given.See Garber and Hazan (2015) for details.
Another family of stochastic Newton methods is based on the so-called Sketch-and-Project framework (e.g., Gower and Richtárik, 2015), where a low-rank Hessian estimate is used to construct a Newton-like step (with the Moore-Penrose pseudoinverse instead of the matrix inverse).For example, in one approach (Gower et al., 2019), the Newton-like step in the t-th iteration is obtained by generating a sketching matrix S t ∈ R s×d to construct a rank-s approximate of the inverse Hessian, resulting in the update: x t+1 = x t − µ t S t (S t H t S t ) † S t ∇f (x t ).While linear convergence rates have been derived for these methods (e.g., Dereziński and Rebrova, 2022), they do not fit into our stochastic oracle framework due to an intrinsic bias arising in the Hessian estimates (see Section 1.3 for further discussion).
The majority of literature studies the convergence of stochastic Newton by establishing a high probability error recursion.For example, Erdogdu and Montanari (2015); Roosta-Khorasani and Mahoney (2018); Pilanci and Wainwright (2017); Agarwal et al. (2017) all showed, roughly speaking, that stochastic Newton methods exploit a local linear-quadratic recursion: for some constants c 1 , c 2 > 0 and δ ∈ (0, 1).These constants depend on the sample/sketch size at each step.Based on (8), one often applies the result for t = 0, 1, . . ., T recursively, uses the union bound, and establishes local convergence with probability 1 − T δ.This approach has the following key drawbacks.
(a) The presence of the linear term with coefficient c 1 > 0 in the recursion means that, once x t is sufficiently close to x , the algorithm can only achieve the linear convergence, as opposed to the quadratic or superlinear convergence achieved by deterministic methods.Prior works (e.g., Roosta-Khorasani and Mahoney, 2018;Bollapragada et al., 2018) discussed how to achieve local superlinear convergence by diminishing the coefficient c 1 = c 1,t gradually as t increases.However, since c 1 is proportional to the magnitude of stochastic noise in the Hessian estimates, diminishing it requires increasing the sample size for estimating the Hessian, which results in a blow-up of the per-iteration computational cost.To be specific, c 1 is proportional to the reciprocal of the square root of the sample size; thus this blow-up in terms of the sample size can be as fast as exponential if we wish to attain the quadratic convergence rate, and linear if we wish to attain the superlinear convergence rate established in this paper later.
(b) The presence of the failure probability δ in (8) means that after T iterations, a convergence guarantee may only hold with probability 1 − T δ.Thus, the failure probability explodes when T → ∞.To resolve this issue one can gradually diminish δ, e.g., δ = δ t = O(1/t 2 ) such that t δ t < ∞.However, once again, such adjustment on δ leads to an increasing sample/sketch size as the algorithm proceeds, although not as drastically as in (a) (it suffices to increase the sample size logarithmically).Thus, in our stochastic oracle model, where the noise level (and hence the sample size) remains constant, it is problematic to show any convergence with high probability from (8) as T → ∞.We note that some prior works (Bollapragada et al., 2018;Meng et al., 2020) showed the convergence in expectation guarantees.Although the explosion of the failure probability in (b) is suppressed by the direct characterization of the expectation, the drawback (a) still remains.In addition, the convergence in expectation results require stronger assumptions.For example, (Bollapragada et al., 2018, (2.17)) and (Meng et al., 2020, Theorem 2) showed similar recursions to (8) for E[ x t − x ].However, these works assumed a bounded moments of iterates condition, i.e., E[ ) 2 for some constant γ > 0, and assumed each component f i in (5) to be strongly convex.Both conditions are not needed in high probability analyses.Note that the majority of high probability analyses assume that each component f i is convex though not necessarily strongly convex (Agarwal et al., 2017;Roosta-Khorasani and Mahoney, 2018), while we do not impose any conditions on f i , which significantly weakens the conditions in the existing literature.Furthermore, the stepsize in Bollapragada et al. (2018); Meng et al. (2020) was prespecified without any adaptivity.Bellavia et al. (2019) introduced a non-monotone line search to address the adaptivity issue, while extra conditions such as the compactness of {x t } t were imposed for their expectation analysis.

Main results: Stochastic Newton with Hessian averaging
Concerned by the above limitations of stochastic Newton methods, we ask the following question: Can we design a stochastic Newton method that exploits global linear and local superlinear convergence in high probability, even for an infinite iteration sequence, and without increasing the computational cost in each iteration?
We provide an affirmative answer to this question by studying a class of stochastic Newton methods with Hessian averaging.A simple intuition is that the approximation error E t can be de-noised by aggregating all the errors {E i } t i=0 , inspired by the central limit theorem for martingale differences.Thus, if we could reuse the past samples and replace E t by Compared to H t , such a matrix does not preserve the information of the true Hessian H t .Thus, we observe a trade-off between de-noising the error E t and preserving the Hessian information H t : on one hand, we want to assign equal weights to all the past errors to achieve the fastest concentration for the error average (see Remark 2.7); on the other hand, we want to assign all weights to H t to fully preserve the most recent Hessian information.In this paper, we investigate this trade-off, and show how the Hessian averaging resolves the drawbacks of existing stochastic Newton methods.

3:
Obtain a stochastic Hessian approximation H t = H t + E t ;

4:
Compute the Hessian average via (9): Compute the Newton direction p t by solving: H t p t = −∇f t ; 6: if H t p t = −∇f t is not solvable or ∇f t p t ≥ 0, then skip iteration with x t+1 = x t ; 7: Compute a stepsize µ t = ρ jt , where j t is the smallest nonnegative integer such that Armijo condition : Update x t+1 = x t + µ t p t ; 9: end for where {w t } ∞ t=−1 is a prespecified increasing non-negative weight sequence starting with w −1 = 0.By online we mean that we only keep an average Hessian H t in each iteration, and update it as (9) when we obtain a new Hessian estimate.This scheme is in contrast to keeping all the past Hessian estimates.By the scheme (9), we note that, at iteration t, we re-scale the weights of { H i } t−1 t=0 equally by a factor of w t−1 /w t , instead of completely redefining all the weights.We note that such an online averaging scheme is memory and computation efficient: compared to stochastic Newton methods without averaging, we only require an extra O(d 2 ) space to save H t and O(d 2 ) flops to update it, which is negligible in the setting of stochastic Newton methods where the Hessian vector product requires O(d 2 ) flops and solving the exact Newton system requires O(d 3 ) flops.In (9), we use the ratio factors w t−1 /w t instead of direct weights merely to simplify our later presentation.Furthermore, (9) can be reformulated as a general weighted average as follows: In particular, by setting w t = t + 1, we obtain z i,t = 1/(t + 1) and further have H t = 1 t+1 t i=0 H i .Thus, we recover simple uniform averaging.If we let the sequence w t grow faster-than-linearly, this results in a weighted average that is skewed towards the more recent Hessian estimates.
Our proposed method replaces H t by H t when computing the Newton direction (2).The detailed procedure is displayed in Algorithm 1.We make two comments about the algorithm.
First, Algorithm 1 supposes that, unlike the Hessian, the function values and gradients are known deterministically.As a result, the method generates a monotonically decreasing sequence of f (x t ).If, on the other hand, f (x t ) and ∇f (x t ) were known with random noise, we would have to relax the Armijo condition by adding extra error terms (hence, f (x t ) would be potentially non-monotonic), and analyze the resulting method under a random model framework such as in Blanchet et al. (2019); Chen et al. (2017).We leave these non-trivial extensions to future work.
Second, for early iterates, the average Hessian H t may not be a good approximate of H t .For example, H t may be indefinite (or even singular) so that p t is not a descent direction.In that case, we skip the line search step and let x t+1 = x t (see Line 6 of Algorithm 1).Further, even if H t is positive definite, it may not be properly bounded, and thus leads to an extremely small stepsize µ t .However, as analyzed later in Lemmas 3.1 and 4.
Table 1: Three transitions of two averaging schemes: uniform averaging (w t = t + 1, see Theorem 1.1) and our proposed weighted averaging scheme (w t = (t + 1) log(t+1) , see Theorem 1.2).For each weight sequence, the three transitions occur with high probability.
concentrated for large t with high probability, so that H t is properly bounded from above and below.Thus, Algorithm 1 is well defined and can be performed for any iteration t, while it behaves like the classical Newton method only after a few iterations (the threshold is provided in Lemmas 3.1 and 4.3).

Main results
We conduct convergence analysis for Algorithm 1 with different weight sequences {w t } ∞ t=−1 .Throughout the analysis, we only assume the oracle noise E(x) has a sub-exponential tail, which in particular includes Hessian subsampling and Hessian sketching as special cases.Our convergence guarantees rely on the quality of the average Hessian approximation H t ; thus, we do not require H t to be a good approximation of H t .In other words, our scheme is applicable even if we generate a single sample for forming the subsampled Hessian estimator, and applicable even if some components f i are non-convex.This is because the noise E t can always be reduced by averaging (ensured by the central limit theorem) even if it has a large variance.
We show that, with high probability, Algorithm 1 has four convergence phases with three transition points: (a) Global phase: x t converges from any initial iterate x 0 to a local neighborhood of x , in which the unit stepsize starts being accepted.
(b) Steady phase: x t stays in the neighborhood.
(c) Slow superlinear phase: x t starts converging superlinearly with a rate gradually increasing.
(d) Fast superlinear phase: the superlinear acceleration reaches its full potential and is maintained for all the following iterations.We mention that the superlinear rate is measured with respect to the error x t − x H := (x t − x ) H (x t − x 0 ) where H = H(x ).Before introducing the main results in Theorems 1.1 and 1.2, we summarize them in Table 1, showing the transition points for two weight sequences.The transitions for general weights are provided in Section 4. We use Υ to denote the noise level of stochastic Hessian oracle (see Assumption 2.1 for a formal definition; typically Υ = O(κ)).We also use O(•) to suppress the logarithmic factors and the dependence on other constants except Υ and κ.We emphasize that all results of the paper require that the (weighted) average of the errors {E i } t i=0 is sufficiently concentrated; thus, they hold with high probability.
The first main result studies the uniform averaging scheme, which is informally stated below.We refer to Theorem 3.8 for a formal statement.
First, we observe that our convergence guarantee holds with high probability for the entire infinite iteration sequence, which addresses the issue of the blow-up of failure probability associated with the existing stochastic Newton methods (see part (b) in Section 1.1).
Second, the parameter Υ characterizes the noise level of the stochastic Hessian oracle.When the Hessian is generated by subsampling or sketching, Υ depends on the adopted sample/sketch sizes.As illustrated in Examples 2.3 and 2.4, Υ = O(κ) for popular Hessian estimators when sample/sketch sizes are independent of κ.In this case, T = O(κ 2 ).We require T ≥ O(Υ 2 ) only to ensure that { H t } t≥T are positive definite with condition numbers scaling as κ, so that the Newton step based on H t leads to a usual decrease of the objective.On the other hand, popular stochastic Newton methods often generate a larger sample size (which depends on κ) to enforce E t ≤ λ min for any t ≥ 0 with an ∈ (0, 1) (e.g., see Lemma 2 and Equation 3.10 in Roosta-Khorasani and Mahoney, 2018;Pilanci and Wainwright, 2017, respectively).In that case, { H t } t≥0 are positive definite and so are { H t } t≥0 .Importantly, our method does not require such well-conditioned Hessian oracles.
Third, we notice that the uniform averaging approach has three transitions outlined in Theorem 1.1.For the iterations before T , the error in function value decreases linearly, implying that x t converges R-linearly.The first transition occurs after T iterations when x t reaches the local neighborhood, where second-order information starts being useful.T is also where the exact Newton methods would reach quadratic convergence.However, the averaged Hessian estimates still carry inaccurate Hessian information from the global phase, which is only gradually forgotten.From T to O(T κ) iterations, the algorithm gradually forgets the Hessian estimates in the global phase, while x t still converges R-linearly.The second transition occurs after O(T κ) iterations, when x t starts converging superlinearly (with a slow rate).The third transition occurs after O(T 2 κ 2 /Υ 2 ), when the superlinear rate is accelerated to (Υ log(t)/t ) t and stabilized.This rate comes from the central limit theorem of averaging out the oracle noise; thus, this rate cannot be further improved.
Note that if we were to reset the averaged Hessian estimate H t after the first transition (so that the Hessian average does not include information from the global phase), then the algorithm would immediately reach the superlinear rate (Υ log(t)/t ) t after T iterations (i.e., all transitions would occur at once).However, the algorithm does not know a priori when this transition will occur.As a result, the uniform Hessian averaging incurs a potentially significant delay of up to O(T 2 κ 2 /Υ 2 ) iterations before reaching the desired superlinear rate.
In our second main result, we address the delayed transition to superlinear convergence that occurs in the uniform averaging.Specifically, we ask: Does there exist a universal weighted averaging scheme that achieves superlinear convergence without a delayed transition, and without any prior knowledge about the objective?
Remarkably, the answer to this question is affirmative.The weighted/non-uniform averaging scheme we present below puts more weight on recent Hessian estimates, so that the second-order information from the global phase is forgotten more quickly, and the transition to a fast superlinear rate occurs after only O(T ) iterations.Thus, the superlinear convergence occurs without any delay (up to constant factors).Such a scheme will necessarily have a slightly weaker superlinear rate than the uniform averaging as t → ∞, but we show that this difference is merely an additional O( √ log t) factor (see Theorem 4.4 and Example 4.8 for a formal statement).
We note that, given some knowledge about the global/local transition points (e.g., if the algorithm knows κ, or if some convergence criterion is used for estimating the transition point), it is possible to switch from the more conservative weighted averaging to the asymptotically more effective uniform averaging within one run of the algorithm.However, since knowing transition points is difficult and rare in practice, we leave such considerations to future work, and only focus on problem-independent averaging schemes.
It is also worth mentioning that this paper only considers a basic stochastic Newton scheme based on (2), where we suppose exact function and gradient information and solutions to the Newton systems are known exactly.Some literature allows one to access inexact function values and/or gradients, and/or apply Newton-CG or MINRES to solve the linear systems inexactly (Fong and Saunders, 2012;Roosta-Khorasani and Mahoney, 2018;Liu and Roosta, 2021;Yao et al., 2021).Applying our sample aggregation technique under these setups is promising, but we defer it to future work.The basic scheme purely reflects the benefits of Hessian averaging, which is the main interest of this work.Additionally, some literature deals with non-convex objectives via stochastic trust region methods (Chen et al., 2017;Blanchet et al., 2019) or stochastic Levenberg-Marquardt methods (Ma et al., 2019).The averaging scheme may not directly apply for these methods due to potential bias brought by Hessian modifications for addressing non-convexity, while our sample aggregation idea is still inspiring.We leave the generalization to non-convex objectives to future work as well.Some literature addressed the superlinearity of stochastic Newton methods under distributed or federated learning settings (Islamov et al., 2021;Safaryan et al., 2021;Qian et al., 2022).These works are not fully compatible with our Hessian oracle framework, since they exploit some distributed nature of problem to produce Hessian estimates with noise diminishing to zero (as opposed to the bounded noise in this paper).

Literature review
Stochastic Newton methods have recently received much attention.The popular Hessian approximation methods include subsampling and sketching.
For subsampled Newton methods, aside from extensive empirical studies on different problems (Martens, 2010;Martens and Sutskever, 2011;Kylasa et al., 2019;Xu et al., 2020), the pioneering work in Byrd et al. (2011) established the very first asymptotic global convergence by showing that ∇f t → 0 as t → ∞, while the quantitative rate is unknown.Furthermore, Byrd et al. (2012); Friedlander and Schmidt (2012) studied Newton-type algorithms with subsampled gradients and/or subsampled Hessians, and established global Q-linear convergence in the error of function value f (x t ) − f (x ).However, the above analyses neglected the underlying probabilistic nature of the subsampled Hessian H t , and required H t to be lower bounded away from zero deterministically.Such a condition holds only if each f i in (5) is strongly convex, which is restrictive in general.Erdogdu and Montanari (2015) relaxed such a condition by developing a novel algorithm, where the subsampled Hessian is adjusted by a truncated eigenvalue decomposition.With the exact gradient information and properly prespecified stepsizes, the authors showed a linear-quadratic error recursion for x t −x in high probability.Arguably, the convergence of standard subsampled Newton methods is originally analyzed in Roosta-Khorasani and Mahoney (2018) and Bollapragada et al. (2018) from different perspectives.In particular, for both sampling and not sampling the gradient, Roosta-Khorasani and Mahoney (2018) showed a global Q-linear convergence for f (x t ) − f (x ) and a local linear-quadratic convergence for x t − x in high probability.Under some additional conditions, Bollapragada et al. (2018) derived a global R-linear convergence for the expected function value E[f (x t ) − f (x )] and a (similar) local linear-quadratic convergence for the expected iterate error E[ x t − x ].For both works, the authors also discussed how to gradually increase the sample size for Hessian approximation to achieve a local Q-superlinear convergence with high probability and in expectation, respectively.Building on the two studies, various modifications of subsampled Newton methods have been reported with similar convergence guarantees.We refer to Xu et al. (2016); Ye et al. (2017); Bellavia et al. (2019); Li et al. (2020) and references therein.We note that Kovalev et al. (2019) designed a scheme that allows for a single sample in each iteration of subsampled Newton.That work established a local linear convergence in expectation, while we obtain a superlinear convergence in high probability.
As a parallel approach to subsampling, Newton sketch has also been broadly investigated.Pilanci and Wainwright (2017) proposed a generic Newton sketch method that approximates the Hessian via a Johnson-Lindenstrauss (JL) transform (e.g., the Hadamard transform), and the gradient is exact.Furthermore, Agarwal et al. (2017); Dereziński and Mahoney (2019); Dereziński et al. (2020a,b) proposed different Newton sketch methods with debiased or unbiased Hessian inverse approximations.Dereziński et al. (2021a) relied on a novel sketching technique called Leverage Score Sparsified (LESS) embeddings (Dereziński et al., 2021b) to construct a sparse sketch matrix, and studied the trade-off between the computational cost of H t and the convergence rate of the algorithm.Similar to subsampled Newton methods, the aforementioned literature established a local linear-quadratic (or linear) recursion for x t − x in high probability.A recent work Lacotte et al. (2021) adaptively increased the sketch size to let the linear coefficient be proportional to the iterate error, which leads to a quadratic convergence.However, the per-iteration computational cost is larger than typical methods.See Berahas et al. (2020) for a review of subsampled and sketched Newton, and their connections to, and empirical comparisons with, first-order methods.
Sketching has also been used to construct low-rank Hessian approximations through the Sketchand-Project framework, originally developed by Gower and Richtárik (2015) for solving linear systems, and extended to general convex/nonconvex optimization by Luo et al. (2016); Doikov et al. (2018); Gower et al. (2019); Na and Mahoney (2022).The convergence properties of this family of methods have been thoroughly studied: they achieve linear convergence in expectation, with the rate controlled by a so-called stochastic condition number, which is defined as the smallest eigenvalue of the expectation of the low-rank projection matrix defined by the sketch (Gower and Richtárik, 2015).While the per-iteration cost of stochastic Newton methods based on Sketch-and-Project is generally lower than that of the aforementioned Newton sketch methods, their convergence rates are more sensitive to the spectral properties of the Hessian.The precise characterizations of the convergence rates are given in Mutný et al. (2020); Dereziński et al. (2020b); Dereziński and Rebrova (2022).Moreover, the Sketch-and-Project estimates are generally biased, so they are not appropriate for averaging.
In summary, none of the aforementioned existing works achieve superlinear convergence with a fixed per-iteration computational cost.Additionally, high probability convergence guarantees generally fail as t → ∞, with potent exceptions of certain stochastic trust-region methods (Chen et al., 2017) that enjoy almost sure convergence.However, the per-iteration computation of the exceptions is not fixed and the local rate is unknown.Further, for finite-sum objectives, the existing literature on stochastic Newton methods assumes each f i to be strongly convex (Bollapragada et al., 2018) or convex (Roosta-Khorasani and Mahoney, 2018).However, f i needs not be convex even if f is strongly convex.See Garber and Hazan (2015); Garber et al. (2016); Shalev-Shwartz (2016); Allen-Zhu and Yuan (2016) and references therein for first-order algorithms designed under such a setting.We address the above limitations of stochastic Newton methods by reusing all the past samples to average Hessian estimates.Our scheme is especially preferable when we have a limited budget for per-iteration computation (e.g., when we use very few samples in subsampled Newton, resulting in an ill-conditioned Hessian estimate).Our established non-asymptotic superlinear rates are stronger than the existing results, and our numerical experiments demonstrate the superiority of Hessian averaging.Notation: Throughout the paper, we use I to denote the identity matrix, and 0 to denote the zero vector or matrix.Their dimensions are clear from the context.We use • to denote the Recall that we reserve the notation λ min , λ max to denote the lower and upper bounds of the true Hessian, and κ = λ max /λ min is the condition number.

Structure of the paper:
In Section 2 we present the preliminaries on matrix concentration that are needed for our results.Then, we establish convergence results for the uniform averaging scheme in Section 3. Section 4 establishes convergence for general weight sequences.Numerical experiments and conclusions are provided in Sections 5 and 6, respectively.

Preliminaries on Matrix Concentration
In this section, we present the key results on the concentration of sums of random matrices, which we then use to bound the noise of the averaged Hessian estimates.The Hessian estimates constructed by our algorithm are not independent, and hence standard matrix concentration results do not apply.However, they do satisfy a martingale difference condition, which we will exploit to derive useful concentration results.
Given a sequence of stochastic iterates {x t } ∞ t=0 , we let F 0 ⊆ F 1 ⊆ F 2 ⊆ . . .be a filtration where F t = σ(x 0:t ), ∀t ≥ 0, is the σ-algebra generated by the randomness from x 0 to x t .With such a filtration, we denote the conditional expectation by and the conditional probability by P t (•) = P(• | F t ).We suppose x 0 is deterministic, so that F 0 is a trivial σ-algebra.
For a given weight sequence {w t } ∞ t=−1 with w −1 = 0, the scheme (9) leads to where z i,t = (w i −w i−1 )/w t .Note that t i=0 z i,t = 1 and z i,t ∝ z i := w i −w i−1 , i.e., z i,t is proportional to an un-normalized weight z i .We see from ( 11) that H t consists of the Hessian averaging and noise averaging with the same weights.In principle, the Hessian averaging t i=0 z i,t H i → H as x t → x , while the noise averaging t i=0 z i,t E i → 0 due to the central limit theorem.We will show that the Hessian averaging (eventually) converges faster than the noise averaging.
To study the concentration of noise averaging Ēt , we use the fact that {E t } ∞ t=0 is a martingale difference sequence, and rely on concentration inequalities for matrix martingales.These concentration inequalities require a sub-exponential tail condition on the noise.We say that a random variable 2 for all p = 2, 3, ... , which is consistent (up to constants) with all standard notions of sub-exponentiality (see Section 2.7 in Vershynin, 2018).
Assumption 2.1 (Sub-exponential noise).We assume that E(x) is mean zero and E(x) is Υ E -sub-exponential for all x.Also, we define Υ := Υ E /λ min to be the scale-invariant noise level.
Remark 2.2.The sub-exponentiality of E(x) implies that E(x) has sub-exponential matrix moments: • I for p = 2, 3, ... .In fact, our analysis immediately applies under this slightly weaker condition.We impose the moment condition on E(x) purely because it is easier to check in practice.Also, we note that sometimes the noise H(x) −1/2 E(x)H(x) −1/2 is more natural to study, e.g., for sketching-based oracles where we additionally have H(x) 0. Thus, we can alternatively impose Υ-sub-exponentiality on H(x) −1/2 E(x)H(x) −1/2 .Our analysis can also be adapted to this alternate condition, and leads to tighter convergence rates (in terms of the dependence on κ) for particular sketching-based oracles.However, this adaptation loses certain generality, and thus we prefer to impose conditions directly on the oracle noise E(x).
Assumption 2.1 is weaker than assuming E(x) to be uniformly bounded by Υ E , and it is satisfied by all of the popular Hessian subsampling and sketching methods.For example, when using Gaussian sketching, the noise is not bounded but sub-exponential.Further, the sub-exponential constant Υ E (and hence Υ) reflects how the stochastic noise depends on the sample/sketch size, as illustrated in the examples below (see Appendix A for rigorous proofs).
Example 2.3.Consider subsampled Newton as in ( 6) with sample size s = |ξ t |, and suppose ∇ 2 f i (x) ≤ λ max R for some R > 0 and for all i.Then, we have Υ = O(κR log(d)/s).If we additionally assume that all f i (x) are convex, then Υ is improved to Υ = O( κR log(d)/s + κR log(d)/s ).
From the above two examples, we observe that Υ scales as O(κ) when holding everything else fixed.Also, Example 2.3 illustrates that our Hessian oracle model applies to subsampled Newton even when some components f i (x) are non-convex (while f (x) is still convex), although this adversely affects the sub-exponential constant.For Gaussian sketch in Example 2.4, we can show that Υ = O( d/s + d/s) (where Υ was defined in Remark 2.2).Thus, the dependence on κ can be avoided for the sub-exponential constant of noise H(x) −1/2 E(x)H(x) −1/2 .Analogous noise bounds can be proved for other sketching matrices S, including sparse sketches and randomized orthogonal transforms.
We now show concentration inequalities for Ēt in (11) under Assumption 2.1.We state the following preliminary lemma, which is a variant of Freedman's inequality for matrix martingales.
Lemma 2.5 (Adapted from Theorem 2.3 in Tropp (2011a)).Let t ≥ 0 be a fixed integer.Consider a d-dimensional martingale difference {E i } t i=0 (i.e., E i [E i ] = 0).Suppose there exists a function g t : Θ t → [0, ∞] with Θ t ⊆ (0, ∞), and a sequence of matrices {U i } t i=0 , such that for any Then, we have for any scalars η ≥ 0 and σ2 > 0, The function g t in (Tropp, 2011a, Theorem 2.3) is defined on the full positive set (0, ∞), but the proof applies to any subset Θ t .We use Lemma 2.5 to show the next result.
The martingale concentration in Theorem 2.6 matches the matrix Bernstein results for independent noises {E i } t i=0 (cf.Theorems 6.1, 6.2 in Tropp, 2011b).For any δ ∈ (0, 1), if we let the right hand side of (13) be δ/(t + 1) 2 , then we obtain that, with probability at least 1 We provide the following remark to discuss the fastest concentration rate.
Remark 2.7.To achieve the fastest concentration rate, we minimize the right hand side of ( 14) under the restriction t i=0 z i,t = 1.Note that the minimum of both t i=0 z 2 i,t and z (max) t is attained with equal weights, that is z i,t = 1/(t + 1), for i = 0, 1, . . ., t.Thus, the fastest concentration rate is attained with equal weights.Furthermore, a union bound over t leads to We note that the square root term log(d(t + 1)/δ)/t + 1 dominates the error bound for large t.
Recalling from (11) that z i,t = (w i − w i−1 )/w t , we know w t = t + 1 for the equal weights.If fact, the concentration rate of Ēt relates to the superlinear convergence rate of x t (see Theorem 1.1), because, as shown in the following sections, the convergence rate of x t is proportional to Ēt when x t is sufficiently close to x .

Convergence of Uniform Hessian Averaging
We now study the convergence of stochastic Newton with Hessian averaging.We consider the uniform averaging scheme, i.e., w t = t + 1, ∀t ≥ 0. Our first result suggests that, with high probability, H t 0 for all large t.This implies that the Newton direction p t = −( H t ) −1 ∇f t will be employed from some t onwards (cf.Line 6 of Algorithm 1).We recall that Υ = Υ E /λ min , and e denotes the natural base.
Lemma 3.1.Consider Algorithm 1 with w t = t + 1, ∀t ≥ 0. Under Assumption 2.1, we let δ, ∈ (0, 1) with d/δ ≥ e.We also let Then, with probability 1 − δπ 2 /6, the event By Lemma 3.1, we initialize the convergence analysis from the iteration t = T 1 , and condition on the event E. For 0 ≤ t < T 1 , the Newton system may or may not be solvable and the lower and upper bounds of H t may or may not scale as λ min and λ max (cf.Line 6 of Algorithm 1).Thus, for the iterates x 0:T 1 , we do not generally have guarantees on the convergence rate, but only know that the objective value is non-increasing, that is, f We next provide a Q-linear convergence for the objective value f (x t ) − f (x ) for t ≥ T 1 .
Lemma 3.2.Conditioning on the event (17), we let , ∀t ≥ T 1 , which implies R-linear convergence of the iterate error, We next show that x t stays in a neighborhood around x for all large t.For this, we need a Lipschitz continuity condition.
Combining Lemma 3.2 with Assumption 3.3 leads to the following corollary.
Combining ( 16) and ( 19), and using O(•) to neglect logarithmic factors and all constants except κ and Υ, we have T = O(Υ 2 + κ 2 ) with high probability.Building on Corollary 3.4, we then show that the unit stepsize is accepted locally.
The unit stepsize enables us to show a linear-quadratic error recursion.
Lemma 3.6 suggests that x t − x exhibits local Q-linear convergence.
Given all the presented lemmas, we state the final convergence guarantee.
Theorem 3.8.Consider Algorithm 1 with w t = t + 1, ∀t ≥ 0. Under Assumptions 2.1, 3.3, we let δ ∈ (0, 1) satisfy d/δ ≥ e, and let , ν ∈ (0, 1) satisfy Define the neighborhood N ν as in ( 18), and define T = T 1 + T 2 with T 1 given by ( 16) and T 2 given by ( 19).We also let J = 4T κ/ν.Then, with probability 1 − δπ 2 /6, we have that x T 1 :T +J converges R-linearly, x T :T +J ∈ N ν , and We note from Theorem 3.8 that the condition on and ν does not depend on unknown quantities of objective function and noise level.The convergence rate ρ t consists of two terms.The first term is due to the fact that T = O(Υ 2 + κ 2 ) Hessians are accumulated in the global phase, and each of them contributes an error (in norm • H ) as large as κ.Given these imprecise Hessians, the method cannot immediately converge superlinearly after T iterations.That is, ρ t 1 if J = t = 0. We need J = O(T κ) iterates to suppress the effect of these imprecise Hessians.The second term is due to the noise averaging, i.e., Ēt , which decays slower than the first term.Thus, for sufficiently large t, the noise averaging will finally dominate the convergence rate.
We present the above observation in the following corollary.It suggests that the averaging scheme has three transition points; thus four convergence phases.
Corollary 3.9.Under the setup of Theorem 3.8, Algorithm 1 has three transitions: (a): From x 0 to x T : the algorithm converges to a local neighborhood N ν from any initial point x 0 .(b): From x T to x T +J : the sequence x t stays in the neighborhood N ν .
(Starting from x T 1 , the sequence x t exhibits R-linear convergence) (c): From x T +J to x T +J +K : the algorithm converges Q-superlinearly with (d): From x T +J +K : the algorithm converges Q-superlinearly with where t ≥ T + J + K.
For an infinite iteration sequence {x t } ∞ t=0 and with high probability, Corollary 3.9(a) suggests that the first transition is T = O(κ 2 + Υ 2 ); Corollary 3.9(b) suggests that the second transition is J = O(T κ); Corollary 3.9(c) suggests that the third transition is K = O(T 2 κ 2 /Υ 2 ); and Corollary 3.9(d) suggests that the final rate is ρ (2) t = O(Υ log t/t).This recovers Theorem 1.1.Recalling that Υ typically does not exceed κ, in this case, we have T = O(κ 2 ), J = O(κ 3 ), and K = O(κ 6 /Υ 2 ).This suggests a trade-off between the final superlinear rate and the final transition.When the oracle noise level Υ is small, a faster superlinear rate is finally attained, but K also increases, meaning that the time to attain the final rate is further delayed.By Examples 2.3 and 2.4, Υ decays as sample/sketch size increases.Thus, the final superlinear rate is improved by increasing the sample/sketch size s, however the effect of this change may be delayed due to the rate/transition trade-off.Fortunately, as we will see in the following section, this trade-off can be optimized via weighted Hessian averaging.

Convergence of Averaging with General Weights
Although the superlinear convergence of stochastic Newton with uniform Hessian averaging (Corollary 3.9) is promising, since the scheme eventually attains the optimal superlinear rate implied by the central limit theorem, a clear drawback is the delayed transition-the scheme spends quite a long time before attaining the final rate.In this section, we study the relationship between transitions and general weight sequences.We consider performing Algorithm 1 with a weight sequence w t that satisfies the following general condition.
By the above assumption, we specialize the result of noise averaging in (14) as follows.
Naturally, to have Ēt concentrate, we require lim t→∞ log( d(t+1) δ ) w (t) w(t) = 0.It is easy to see that for some weight sequences, such as w(t) = exp(t) − exp(−1), such a requirement cannot be satisfied, which makes convergence fail.On the other hand, this is reasonable since (i), which means that we assign an exponentially large weight to the current Hessian estimate.Such an assignment preserves recent Hessian information better, but it diminishes the previous estimates too quickly to let the noise averaging concentrate.
Given the above concentration results, we have a similar result to Lemma 3.1.
We note that Lemma 3.2 and Corollary 3.4 still hold for general weight sequences.Thus, we let and know that x t ∈ N ν for all t ≥ I. Lemmas 3.5, 3.6 and Corollary 3.7 also carry over to the setting of general weight sequences.Building on these results, we state the final convergence guarantee.
The proof of Theorem 4.4 is more involved than the one of Theorem 3.8.In particular, when dealing with a critical term I+U +t j=I (w(j) − w(j − 1)) x j − x H , the analysis of Theorem 3.8 uses the facts that w(j) − w(j − 1) = 1 and x j converges linearly (cf.Corollary 3.7).However, that analysis does not apply for general weights, since plugging the setup w(j) = j + 1 masks some critical properties of w(•), such as w (j) ≥ 0 and w(j + 1)/w(j) ≤ Ψ, ∀j ≥ 0 (Ψ = 2 for w(j) = j + 1).In contrast, for Theorem 4.4, we separate the above term into two terms, I+U j=I (w(j) − w(j − 1)) x j − x H and I+U +t j=I+U +1 (w(j) − w(j − 1)) x j − x H .The first term is bounded by w(I + U)ν since x j ∈ N ν for I ≤ j ≤ I + U.The second term, which is proven to have the same bound as the first term, not only utilizes the linear convergence of x j with a rate depending on Ψ (proved by induction), but also utilizes general properties of w(•) in Assumption 4.1.
We arrive at the following corollary for the iteration transitions.The proof is the same as for Corollary 3.9, and is omitted.(Starting from x I 1 , the sequence x t exhibits R-linear convergence) (c): From x I+U to x I+U +V : the algorithm converges Q-superlinearly with for θ (1) (d): From x I+U +V : the algorithm converges Q-superlinearly with where t ≥ I + U + V.
Given Corollary 4.5, we provide the following examples.Again, we use O(•) to neglect the logarithmic factors and the dependence on all constants except κ and Υ.We emphasize that for an iteration sequence, the three transition points occur only with high probability.For sake of presentation, we will not repeat "with high probability" for each O(•).As mentioned, typically, Υ = O(κ) in practice.We note that for all considered w t sequences, Assumption 4.1 is satisfied and I 1 has the same order as T 1 (cf.( 16)), so that I = O(T ) = O(Υ 2 + κ 2 ).In other words, the weighted averaging and uniform averaging take the same order of iterations to get into the same local neighborhood.In the following examples, we show how the second and third transition points U and V, as well as the superlinear rates θ (1) t and θ (2) t , change for different weight sequences (see Table 2 for a summary of the case Υ ≤ O(κ), i.e.I = O(κ 2 )).
The above example recovers Corollary 3.9 (up to constant scaling factors).It achieves the best final rate θ (2) t ensured by the central limit theorem, but the initial rate θ (1) t is sub-optimal.Thus, this scheme takes a long time for the iterates to reach the final rate.
Example 4.7 (Accelerated initial rate).Let w t = (t + 1) p for p ≥ 1.The superlinear rates are: Moreover, the second and third transition points are given by

Initial superlinear phase
Final superlinear phase U rate θ (1) t V rate θ (2) t t + 1 (Ex.4.6) Table 2: Comparison of different averaging schemes, in terms of how many iterations it takes to transition to each superlinear phase, and the convergence rates achieved.We drop constant factors as well as logarithmic dependence on d and 1/δ, and assume that 1/poly(κ) ≤ Υ ≤ O(κ).
In the above example, we substantially accelerate the initial superlinear rate θ (1) t , while preserving the final rate θ (2) t up to a constant factor p. We observe that the transitions U and V are both improved upon Example 4.6.In particular, the dependence of κ in U is improved from κ to κ 1/p , and the dependence of I in V is improved from I 2 to I 2p/(2p−1) .An important observation is that both U and V contract to I as p goes to ∞, which inspires us to consider the following sequence.
Example 4.8 (Optimal transition points).Let w t = (t + 1) log(t+1) .Then, the superlinear rates are: We now derive the second and third transition points.In particular, U can be obtained from: where the first implication uses the fact y = x log x ⇐⇒ x = exp( √ log y), as well as the fact log κ ≤ (log I) 2 .The third transition V can be obtained from: where the implication uses κ 2 ≤ I. To let the right hand side hold, we let V = ξ • I and require Thus, the final transition point V can be chosen as Note that, as long as the oracle noise Υ is bounded below as Υ ≥ 1/poly(κ), we have V = O(I).
Since, as we mentioned, in practice the oracle noise actually grows proportionally with κ, this is an extremely mild condition.However, it is technically necessary, since, if the Hessian oracle always returned the true Hessian, i.e., Υ = 0, then θ (2) t = 0, so we would necessarily have V = ∞ (i.e., Hessian averaging is not helpful when we have the exact Hessian).In the above example, all of the transition points I, U, V are within constant factors of one another (not even logarithmic factors), while we sacrifice only log(t) in the final superlinear rate.The above results recover Theorem 1.2.This suggests that, using the weight sequence w t = (t + 1) log(t+1) , the averaging scheme transitions from the global phase to the local superlinearly convergent phase smoothly, and the local neighborhood that separates the two phases is consistent with the neighborhood of classical Newton methods that separate the global phase (i.e., the damped Newton phase) and the local quadratically convergent phase (cf.Lemma 3.6 with H t = H t ).

Numerical Experiments
We implement the proposed Hessian averaging schemes and compare them with popular baselines including BFGS, subsampled Newton methods and sketched Newton methods3 .We focus on the regularized logistic regression problem min where {(b i , a i )} n i=1 with b i ∈ {−1, 1} are input-label pairs.We let A = (a 1 , a 2 , . . ., a n ) ∈ R n×d be the data matrix.

Data generating process:
We generate several data matrices with (n, d, ν) = (1000, 100, 10 −3 ), varying their properties.First, we vary the coherence of A, since higher coherence leads to higher variance for subsampled Hessian estimates, compared to sketched Hessian estimates.We use τ A to denote the coherence of A, which is defined as where A = UΣV is the reduced singular value decomposition of A with U ∈ R n×d and V ∈ R d×d .Second, we vary the condition number κ A of A, since (as suggested by our theory), higher condition number leads to slower convergence and delayed transitions to superlinear rate.The detailed generalization of A is as follows.We fix V = I ∈ R d×d .For low coherence, we generate a n × d random matrix with each entry being independently generated from N (0, 1), and let U be the left singular vectors of that matrix.For high coherence, we divide each row of U by √ z i , where z i is independently sampled from Gamma distribution with shape 0.5 and scale 2. We observe that the coherence of the low coherence matrix is ≈ 1 (minimal), while the coherence of the high coherence matrix is ≈ 10 = n/d (maximal).For either low or high coherence, we vary κ A = {d 0.5 , d, d 1.5 }.For each κ A , we let singular values in Σ vary from 1 to κ A with equal spacing, and finally form the matrix A = UΣ.We also generate a random vector x ∼ N (0, 1/d • I), and the response b , where "-" indicates exceeding 999 iterations.In the case of stochastic methods, we provide three numbers for (the best one in bold): NoAvg, UnifAvg, and WeightAvg, which correspond to the standard method (without Hessian averaging), the method with uniform averaging (w t = t + 1, as in Theorem 1.1), and the method with weighted averaging (w t = (t + 1) log(t+1) , as in Theorem 1.2), respectively.For the setup, we use τ A to denote the coherence number of A; κ A to denote the condition number of A; and s to denote the sample/sketch size for the stochastic methods.
Newton methods.Given a batch size s, the subsampled Newton method computes the Hessian as where the index set ξ has size s and is generated uniformly from [1, n] without replacement.The sketched Newton method instead computes the Hessian as where S ∈ R s×n is the sketch matrix.We consider three sketching methods, one dense (slow and accurate) and two sparse variants (fast but less accurate).
(i) Gaussian sketch: we let S i,j iid ∼ N (0, 1).(ii) CountSketch (Clarkson and Woodruff, 2017): for each column of S, we randomly pick a row and realize a Rademacher variable.(iii) LESS-Uniform (Dereziński et al., 2021a): for each row of S, we randomly pick a fixed number of nonzero entries and realize a Rademacher variable in each nonzero entry.We let the number of nonzero entries in each row to be 0.1d.For (i)-(iii), the sketches are scaled appropriately so that E[ H(x)] = H(x).
For each of the above Hessian oracles (both sketched and subsampled), we consider three variants of the stochastic Newton methods, depending on how the final Hessian estimate is constructed.
The overall results are summarized in Table 3, where we show the number of iterations until convergence for each experiment (we use x t − x H ≤ 10 −6 as the criterion).We display the median over 50 independent runs.The first general takeaway is that our proposed WeightAvg performs best overall, and it is the most robust to different problem settings, when varying the coherence τ A , condition number κ A , sample size s, and sketch/sample type.In particular, among the three variants of NoAvg/UnifAvg/WeightAvg, the latter is the only one to beat BFGS in all settings with sample size as small as s = 0.5d, as well as the only one to successfully converge within the iteration budget (999 iterations) for all Hessian oracles.Nevertheless, there are a few problem instances where either NoAvg or UnifAvg perform somewhat better than WeightAvg, which can also be justified by our theory.In the cases where NoAvg performs best, the oracle noise is particularly small (due to the use of a dense Gaussian sketch and/or a large sketch size), which means that averaging out the noise is not helpful until long after the method has converged very close to the optimum.On the other hand, the cases where UnifAvg performs best are characterized by a well-conditioned objective function, where the superlinear phase of the optimization is reached almost instantly, so the slightly weaker superlinear rate of WeightAvg compared to UnifAvg manifests itself before reaching convergence (i.e., the additional factor of log(t) in Theorem 1.2).
To investigate the performance of Hessian averaging more closely, we present selected convergence plots in Figure 1.The figure shows decay of the error in the log-scale, so that a linear slope indicates linear convergence whereas a concave slope implies superlinear rate.Here, we used subsampling as the Hessian oracle, varying the coherence τ A , the condition number κ A , and sample size s, and compared the Hessian averaging schemes UnifAvg and WeightAvg against the baselines of standard Subsampled Newton (i.e., NoAvg) and BFGS.We make the following observations: (a) Subsampled Newton with Hessian averaging (UnifAvg and WeightAvg) exhibits a clear superlinear rate, observable in how its error plot curves away from the linear convergence of NoAvg.
We note that BFGS also exhibits a superlinear rate, but only much later in the convergence process.
(b) The gain from Hessian averaging (relative to NoAvg) is more significant both for highly ill-conditioned problems (large condition number κ) and for noisy Hessian oracles (small sample size s).For example, in the setting of (κ, s) = (d 1.5 , d) for both low and high coherence, standard Subsampled Newton (NoAvg) converges orders of magnitude slower than BFGS, and yet after introducing weighted averaging (WeightAvg), it beats BFGS by a factor of at least 2.5.
(c) For small condition number, the two Hessian averaging schemes (UnifAvg and WeightAvg) Figure 1: Convergence plots for Subsampled Newton with several Hessian averaging schemes, compared to the standard method without averaging (NoAvg) and to BFGS.For all of the plots, we have (n, d, ν) = (1000, 100, 10 −3 ).We vary the data coherence τ A , condition number κ A and the subsample size s.
perform similarly, although in the setting of (κ, s) = ( √ d, d), the superlinear rate of UnifAvg is slightly better than that of WeightAvg (cf. Figure 1(a)), which aligns with a slightly better rate in Theorem 1.1 compared to Theorem 1.2.
(d) For highly ill-conditioned problems, WeightAvg converges much faster than UnifAvg.This is because, as suggested by Theorem 1.1, it takes much longer for UnifAvg to transition to its fast rate.In fact, in the settings of (τ A , κ A , s) = (1, d 1.5 , 5d) and (τ A , κ A , s) = (10, d, 5d), UnifAvg initially trails behind NoAvg, which is a consequence of the distortion of the averaged Hessian by the noisy estimates from the early global convergence.We truncate iterations at 20 to highlight the superlinear rate of UnifAvg and WeightAvg.
We next investigate the convergence rate of different methods, and aim to show that Hessian averaging leads to Q-superlinear convergence, while subsampled/sketched Newton (with fixed sample/sketch size) only exhibit Q-linear convergence.Figure 2 From the figure, we indeed observe that our method with different weight sequences (UnifAvg and WeightAvg) always exhibits a Q-superlinear convergence, and the superlinear rate exhibits the trend of (1/t) t/2 matching the theory up to logarithmic factors.We also observe that subsampled/sketched Newton methods with different sketch matrices (NoAvg) always exhibit a Q-linear convergence.These observations are all consistent with our theory.

Conclusions
This paper investigated the stochastic Newton method with Hessian averaging.In each iteration, we obtain a stochastic Hessian estimate and then average it with all the past Hessian estimates.We proved that the proposed method exhibits a local Q-superlinear convergence rate, although the non-asymptotic rate and the transition points are different for different weight sequences.In particular, we proved that uniform Hessian averaging finally achieves (Υ log t/t) t superlinear rate, which is faster than the other weight sequences, but it may take as many as O(Υ 2 + κ 6 /Υ 2 ) iterations with high probability to get to this rate.We also observe that using weighted averaging w t = (t + 1) log(t+1) , the averaging scheme transitions from the global phase to the superlinear phase smoothly after O(Υ 2 + κ 2 ) iterations with high probability, with only a slightly slower rate of (Υ log t/ √ t) t .One of the future works is to apply our Hessian averaging technique on constrained nonlinear optimization problems.Na et al. (2022Na et al. ( , 2021) ) designed various stochastic second-order methods based on sequential quadratic programming (SQP) for solving constrained problems, where the Hessian of the Lagrangian function was estimated by subsampling.These works established the global convergence for stochastic SQP methods, while the local convergence rate of these methods remains unknown.On the other hand, based on our analysis, the local rate of Hessian averaging schemes is induced by the central limit theorem, which we can also apply on the noise of the Lagrangian Hessian oracles.Thus, it is possible to design a stochastic SQP method with Hessian averaging and show a similar local superlinear rate for solving constrained nonlinear optimization problems.We note that a recent work Na and Mahoney (2022) utilized the Hessian averaging to prove a local sublinear rate for stochastic SQP methods, while that result is not as strong as the one in this paper.Further, for unconstrained convex optimization problems, generalizing our analysis to enable stochastic gradients and function evaluations as well as inexact Newton directions is also an important future work, which can further improve the applicability of the algorithm.Finally, given any iteration threshold T , establishing the probability that the first/second/third transition occurs before T is an interesting open question.
Proof.The argument follows from the matrix Chernoff and Hoeffding inequalities (Tropp, 2011b, Theorems 1.1 and 1.3).Note that the standard version of these concentration inequalities applies to sampling with replacement, namely where indices I 1 , I 2 , ..., I s are sampled from an index set uniformly at random.Alternate matrix concentration inequalities for sampling without replacement are also known, e.g., see Gross and Nesme (2010); Wang et al. (2019).We thus take sampling with replacement as an example.
We start with the case where f i may not be convex.By the matrix Hoeffding inequality (Theorem 1.3, Tropp, 2011b), we have for some small constants c, c , c > 0 and any η > 0: where the last step follows because if the expression in the exponent is larger than − log(2), then the bound is vacuous, since the probability must lie in [0, 1].Thus, it follows that H(x) − H(x) = E(x) is a sub-exponential random variable with parameter Υ E = O(λ max R log(d)/s) (see Section 2.7 of Vershynin, 2018).The claim now easily follows.
Next, suppose that f i are convex.This means that all ∇ 2 f i (x) are positive semidefinite, and so is H(x).Thus we can use the matrix Chernoff inequality (Theorem 1. 1 Tropp, 2011b), which provides a sharper guarantee: Proof.In fact, the result holds as long as √ s S has independent mean zero unit variance sub-Gaussian entries.From a standard result on the concentration of sub-Gaussian matrices (Theorem 4.6.1 in Vershynin, 2018), there exists a constant c ≥ 1 such that, with probability 1 − 2 exp(−t 2 ), and have Plugging δ = 2(η ∧ √ η) into (A.1),we obtain for a small constant c > 0 that This further implies that, for a small constant c > 0, Thus, H(x) −1/2 E(x)H(x) where the last step is due to the fact that x ≥ 2 log(x) for all x > 0. Now, we consider the function f (t) = log(dt/δ)/t.For any t ≥ 1, its derivative is bounded as where N ν is defined in (18).
Lemma B.3.Suppose Assumption 3.3 holds, then we have Proof.We note that This completes the proof.

C.2 Proof of Lemma 3.1
Let us define the event By Theorem 2.6 and (15), we know P( Ē) ≥ 1 − δπ 2 /6.We also note that Therefore, with probability at least 1 − δπ 2 /6, we have for any t ≥ T 1 that Ēt ≤ λ min .This shows that the event E defined in (17) happens, and This completes the proof.

C.3 Proof of Lemma 3.2
By Lemma 3.1, we know that p t = −( H t ) −1 ∇f t for t ≥ T 1 .We apply Taylor's expansion: Thus, the Armijo condition is satisfied if Therefore, the backtracking line search on Line 7 leads to a stepsize satisfying Moreover, we apply Lemma 3.1 and strong convexity of f , and have which shows the first part of the statement.Furthermore, we apply (C.4) recursively, apply strong convexity, and have The last statement follows from x t − x 2 H ≤ λ max x t − x 2 .This completes the proof.

C.4 Proof of Corollary 3.4
We condition on the event (17), which happens with probability 1 − δπ 2 /6.By Lemma 3.2, we know that, for t By Lemma B.2, to have x t ∈ N ν , it suffices to have x t ∈ Nν 2 /3 .Thus, we let where T 2 is defined in ( 19) and the implication uses the fact that log(1/(1 − φ)) ≥ φ.Furthermore, since f (x t ) is always decreasing, we know after T 1 + T 2 iterations x t stays in Nν 2 /3 , and hence stays in N ν (cf.Lemma B.2).This completes the proof.
C.5 Proof of Lemma 3.5 By Assumption 3.3, we have for some η ∈ (0, 1), and have Combining (C.6), (C.7), and (C.8), we finally obtain Thus, it suffices to let as implied by the conditions stated in the lemma.This completes the proof.
C.6 Proof of Lemma 3.6 Since x t ∈ N ν with ν ≤ 2/3 • (0.5 − β), we apply Lemma B.3 and have Thus, we obtain (0.5 Since p t = −( H t ) −1 ∇f t and µ t = 1, we have For the second term on the right hand side, we have Combining (C.10) and (C.11), Thus, we have Combining the above inequality with (C.12), we obtain Combining the above inequality with (C.13), we complete the proof.
This completes the proof.
C.9 Proof of Lemma 4.2

C.11 Proof of Theorem 4.4
We follow the same proof structure as Theorem 3.8.We suppose the event (24) happens, which occurs with probability 1 − δπ 2 /6.We only need to study the convergence after I + U where U is chosen such that w(I + U) = 2w(I − 1)κ/ν.We claim that The first result is implied by ( 26) and the second result is implied by Assumption 4.1(v).For t = 0, we characterize the difference between H I+U and H I+U .We have as implied by (26).Thus, Lemma 3.6 leads to where the last inequality is due to x I+U ∈ N ν and 2ν ≤ θ 0 .Furthermore, we can see that Thus, (C.21) holds for t = 0. Suppose (C.21) holds for t − 1 with t ≥ 1, we prove (C.21) for t.We still characterize the difference between H I+U +t and H I+U +t .We have This completes the induction and finishes the proof.

Corollary 4. 5 .
Under the setup of Theorem 4.4, Algorithm 1 has three transitions: (a): From x 0 to x I : the algorithm converges to a local neighborhood N ν from any initial point x 0 .(b): From x I to x I+U : the sequence x t stays in the neighborhood N ν .
T +J +t+1 − x H +J +t − I • x T +J +t − x H +J +t − x 2 H + ρ t x T +J +t − x H .
• 12ρ t x T +J +t − x H (C.18) in(14), where z i,t = (w i −w i−1 )/w t and z where the first two inequalities use the fact that w (t) is non-negative and non-decreasing; the second last inequality uses the fact that w(t)w (t) ≥ 0, ∀t ≥ −1; and the last inequality uses Assumption 4.1(v).By the same derivation, we have z By Lemma 4.2, we know (22) holds for any t.By the definition of I 1 in (23), we know which implies Ēt ≤ λ min .Using (11) and noting that λ min • I t i=0 (w i − w i−1 )H i /w t λ max • I, we complete the proof.