An Overview of Stochastic Quasi-Newton Methods for Large-Scale Machine Learning

Numerous intriguing optimization problems arise as a result of the advancement of machine learning. The stochastic first-order method is the predominant choice for those problems due to its high efficiency. However, the negative effects of noisy gradient estimates and high nonlinearity of the loss function result in a slow convergence rate. Second-order algorithms have their typical advantages in dealing with highly nonlinear and ill-conditioning problems. This paper provides a review on recent developments in stochastic variants of quasi-Newton methods, which construct the Hessian approximations using only gradient information. We concentrate on BFGS-based methods in stochastic settings and highlight the algorithmic improvements that enable the algorithm to work in various scenarios. Future research on stochastic quasi-Newton methods should focus on enhancing its applicability, lowering the computational and storage costs, and improving the convergence rate.


Introduction
In essence, artificial intelligence is a process of optimization. The intelligence we are trying to develop almost always comes back to the optimization problem in the end. Whichever traditional machine learning, deep learning or reinforcement learning, their core concepts can be attributed to optimization problems. As a result, optimization theory and algorithms constitute a fundamental component of machine learning and become one of its main pillar. Additionally, the objective function in machine learning always takes on a special form, making it possible to create an optimization algorithm that can effectively handle massive amounts of data.
In machine learning, it is common to encounter optimization problems involving tens of millions of training examples and variables. Consider the following general optimization problem in an expectation form, where f : R d → R, ξ ∈ R d ξ denotes a random variable with distribution P, and E [·] represents the expectation with respect to ξ . In machine learning, the function f (·, ξ) is implicitly given or the distribution P is unknown, making it difficult to calculate the function value and gradient. A special case of (1) that arises frequently in machine learning is the empirical risk minimization problem, where the training set is assumed to be a collection of independent and identically distributed (i.i.d.) samples ξ i = (x i , y i ) with i ∈ {1, 2, · · · , n} according to P via certain observations. And n is the number of data sample which is always extremely large. Gradient descent (GD) is a popular approach to solve (2); it usually employs the following updates as shown in (3). But the computation cost of full gradient is expensive, so it is imperative to employ stochastic approximation (SA) algorithms to solve large-scale optimization problems, which can be traced back to the seminal work by [1]. The predominant methodology in machine learning advocates the use of stochastic gradient descent (SGD) methods [2]. In the k-th iteration, SGD chooses a subset N k ⊂ {1, 2, · · · , n} randomly and then calculates the stochastic gradient ∇ F N k (w k ) as shown in (3), where α k > 0 is the k-th step size. ∇ F N k (w k ) is an unbiased estimate of the gradient of F at w k , namely E[∇ F N k (w k )] = ∇ F(w k ).
SGD methods are widely used in machine learning [3][4][5][6]. There are also several variants, including momentum [7,8], Nesterov-accelerated gradient [9], adaptive learning-method [10][11][12], etc. In practice, SGD with momentum is widely used, especially in deep learning [13,14]. When deterministic risk components predominate in the beginning of the process, Nesterov's acceleration can speed up the rate of convergence of SGD [15,16]. Adaptive learning methods, such as Adagrad [10], Adadelta [11], Adam [12], compute adaptive learning rates for each parameter, where Adam algorithm outperforms other adaptive learning methods. However, since α k must decay to zero for convergence, SGD methods suffer from the adverse effect of noisy gradient estimates and slower convergence rate. To address this limitation, methods endowed with variance reduction [17] capabilities have been developed. Stochastic variance reduced methods [18] can converge linearly on strongly convex problems, including SAG [19,20], SAGA [21], SVRG [22], SARAH [23]. SAG and SAGA need to store the auxiliary vectors for every sample, which amounts to an O(nd) storage. The SVRG method only requires O(d) memory and consists of two loops: an outer loop where the reference gradientg k = ∇ F(w k ) is calculated, and an inner loop where the stochastic gradient steps are updated using g t = ∇ F N t (w t ) − ∇ F N t (w k ) +g k .
SGD methods, which take into account only gradient information, have a number of drawbacks, including relatively slow convergence and sensitivity to hyperparameter settings. Additionally, SGD methods suffer from ill-conditioning problem and often offer limited opportunities for parallelism. Second-order optimization methods have the potential to address these well-known shortcomings by integrating curvature information [24]. As a result, some methods employ second-order derivative information; the update rule is presented as where B k , H k represent the Hessian and the inverse Hessian of object function or the corresponding approximate matrix in the first k-th iteration, respectively. Compared with the first-order methods, the second-order methods have many advantages. The most crucial one is scale-invariant [17], without it, poorly scaled parameters will be much harder to optimize. Another advantages of second-order algorithm are that each iteration, whatever based on gradient or curvature, chooses the subsequent iterate by first computing the minimizer of a second-order Taylor series approximation as follows: The GD iteration takes B = I, while Newton's method takes B = H k or B = B −1 k , and quasi-Newton's method chooses the approximate H k or B k which is constructed using only gradient information.

Stochastic Second-Order Methods in Machine Learning
Numerous stochastic second-order algorithms have been proposed in recent years for finite sum functions, primarily employing random sampling techniques to approximate the true gradient and Hessian. Byrd et al. [25,26] proposed and analyzed the complexity and sample size of a subsampled Newton-CG algorithm. Erdogdu et al. [27] explored a subsampled Newton algorithm combined with low-rank approximation when the sample size is much larger than the number of parameters, that is n d, and analyzed its convergence rate. Utilizing random matrix concentration inequalities, Mooney et al. [28] analyzed the global and local convergence rate of subsampled Newton methods when n, d 1. Xu et al. [29] exploited the convergence rate of subsampled Newton algorithm in nonuniform sampling schemes. Ballapragada et al. [30] analyzed the convergence rate of the subsampled Newton algorithm in expectation. Zhang et al. [31] proposed a technique combining variance reduction with subsampling to improve the convergence rate and analyzed the convergence rate when the subproblem is a quadratic function. In [32], Pilanci et al. proposed a new Newton sketch algorithm; they calculated the approximate Newton step by randomly projected Hessian and established its superlinear convergence under certain conditions. The numerical performance of Newton sketch algorithm and subsampled Newton algorithm was compared by Berahas et al. [33].
These methods we mentioned above are of the form where A k 0 is a stochastic approximation of the Hessian and the conjugate gradient (CG) [34] method is always used by investigators to solve it inexactly. In subsampled Newton method, where S H ⊂ {1, 2, · · · , n} is chosen at random either uniformly or in a nonuniform manner [29]. While Newton sketch algorithm defines a random sketch matrix D k ∈ R q×n with the property that E[(D k ) D k /q] = I n (q < n) at each iteration and then calculates a square-root decomposition of the Hessian ∇ 2 F(w k ), denoted by ∇ 2 F(w k ) 1/2 ∈ R n×d , and defines the Hessian approximation as Agarwal et al. [35] proposed LiSSA, a new Newton-type algorithm for generalized linear models, with the complexity increasing linearly with the amount of parameters.
In addition to the direct approximation of Hessian, the quasi-Newton approach has also been used in stochastic settings. Bordes et al. [36] proposed an SGD algorithm based on diagonal curvature estimation matrix. Byrd et al. [37] combined SGD and limited memory BFGS (L-BFGS). Stochastic L-BFGS algorithm with variance reduction was proposed by Moritz et al. [38]. Gower et al. [39] propose a stochastic block L-BFGS method with variance reduction and demonstrate its linear convergence rate.
Although the majority of these works focus on smooth objective functions, nonsmooth regularizations like the L 1 norm are frequently taken into account in practice. How to create nonsmooth stochastic second-order algorithms is a crucial challenge.
Deep learning has been shown to be resistant to algorithms with certain optimization errors in real applications. As a result, some sophisticated traditional optimization methods can be simplified to some extent, allowing the algorithm to be applied to the neural network and the training speed to be greatly accelerated. Many second-order methods for neural network, including the generalized Gaussian Newton (GGN) and the Kronecker-factored approximate curvature (K-FAC) method, have been proposed recently.
The Hessian-free (HF) and Gauss-Newton methods are two examples of GGN methods. HF [40] solves a sub-optimization problem using the conjugate gradient (CG) algorithm, which does not require explicitly forming the curvature matrix but instead uses Hessian-vector products. It was demonstrated in [41] that Hessian-free works very well for training deep neural networks (DNNs) and recurrent neural networks (RNN) if it is correctly planned and implemented. In [42], the Gauss-Newton matrix is investigated to approximate the Hessian matrix, and [43] studies a practical blockdiagonal approximation to the Gauss-Newton matrix.
The Kronecker-factored approximate curvature (K-FAC) [44,45] is an effective method for simulating natural gradient descent (NGD), with a high-quality approximation of the Fisher information matrix. NGD [46][47][48] is an information geometry-based second-order optimization method. It employs the Fisher information matrix (FIM) as the curvature, which provides the overall perspective of the loss function and converges faster than the first-order method. Given the cross-entropy loss function , where x, y are the input and label, p(y|x, w) represents the density function of a predictive distribution P y|x . The FIM is formulated as F = E[∇ w log p(y|x, w)∇ w log p(y|x, w) ]. Consider a deep neural network with layers and denote the outputs of the i-th layer as b i , the inputs of the i-th as a i−1 and a weight matrix of the i-th layer as W i . The precise computation performed at each layer: where φ is an element-wise nonlinear function. Define w = vec (W 1 ) vec (W 2 ) · · · vec (W ) , Dv = − d log p(y|x,w) dv and g i = Db i . In the first step, K-FAC approximates FIM into block matrix: Noting that DW i = g i a i−1 , that is vec(DW i ) = vec(g i a i−1 ) = a i−1 ⊗ g i , then each block of FIM can be rewritten as Since (A ⊗ B) −1 = A −1 ⊗ B −1 for any matrices A and B, we can compute the block-diagonal FIM easily as K-FAC was first proposed for MLPs [44] and then extended to CNNs [45]. An eigenvalue-corrected Kronecker factorization (EKFAC), which can approximate FIM significantly better than K-FAC, was developed by George et al. [49]. Shampoo [50] extends adaptive learning rate methods by using a block-diagonal Kroneckerfactored preconditioning matrix. However, in fact, the exact strategies used by the present approximation scheme for the inverse of FIM still demand a lot of processing resources. Recently, Chen et al. [51] proposed a novel Trace-based Hardware-driven layer-ORiented natural gradient descent computation method (THOR), in which the update interval is gradually increased and the matrix trace is used to determine which blocks of FIM need to be updated. The K-FAC algorithm simplifies the natural gradient method and obtains a Kronecker product approximation. In practical implementation, it uses block-diagonal or block-tridiagonal approximation to obtain the second-order optimization approach for neural network, which significantly speeds up neural network training.

The Deterministic Quasi-Newton Methods
Numerous efforts have been done to develop quasi-Newton methods that incorporate the curvature information of the objective function without computing second derivatives, including the symmetric rank-1 method (SR1) [67][68][69], DFP rule of Davidon [70] and Fletcher and Powell [71], the Greenstadt [72] rule, and others. The most well-known one is the BFGS method, which is now a fundamental component of many modern optimization techniques and is named after Broyden, Fletcher, Goldfarb, and Shanno [73][74][75][76] who proposed the algorithm independently at nearly the same time.
The BFGS update has a number of exceptional qualities, including "self-correcting" qualities [69]. As a result, BFGS updating is currently regarded as the most effective of all quasi-Newton updating formulas.

The BFGS and L-BFGS
The BFGS method iteratively updates an estimate of the inverse Hessian, H k = B −1 k . When the curvature condition is met, the secant equation (8) always has a solution We require H k+1 to be, in certain ways, the closest to the current matrix H k in order to determine H k+1 uniquely. In other words, we solve the problem below where s k = w k+1 − w k , y k = ∇ F(w k+1 ) − ∇ F(w k ). Different matrix norms can be used in (9), with the weighted Frobenius norm; we can derive the unique solution, that is where ρ k = 1 y k s k and V k = I − ρ k y k s k , applying the Sherman-Morrison-Woodbury formula [77] to (10), we obtain Nocedal and Liu [69,78,79] proposed the idea of only using the most recent iterates and gradients in forming the inverse Hessian approximation to handle large-scale optimization problems. By storing only a small number of vectors of length d which implicitly represent the approximations rather than fully dense d × d approximations, these methods retain straightforward and compact approximations of Hessian matrices.

Algorithm 1 Deterministic BFGS Method
By saving a certain amount (m) of the vector pair {s i , y i } used in formulas (10) and (11), they implicitly store a modified version of H k . A sequence of inner products and vector summations involving ∇ F k and the pairs {s i , y i } for i = k − m, · · · , k − 1 can be used to obtain the product H k ∇ F k . The L-BFGS approximation H k satisfies the following formula when choosing an initial Hessian approximation H 0 k : The oldest vector pair in the collection of pairs {s i , y i } is replaced by the new pair {s k , y k } acquired from the current step after the new iteration has been computed. The limited-memory BFGS algorithm can be stated formally as Algorithm 2 with the two-loop recursion [69].
Discard the vector pair {s k−m , y k−m } from storage; 9 end if 10 Compute and save k ← k + 1; 12 end while

The Challenges of Quasi-Newton Methods in Stochastic Settings
SGD methods are widely used to solve (2), but they may not be suitable for illconditioned and nonconvex problems, which are better treated with second-order information. Although quasi-Newton algorithms have been extensively studied in the deterministic case, can we directly apply them to solve optimization problems in machine learning? When we investigate the extension of a quasi-Newton method from the deterministic setting to the stochastic setting, the answer is not encouraging, there are some difficulties: (1) Gradient measurements are noisy. The noise in the gradients in the stochastic setup may taint the correction pairs {s i , y i }. Let g k = ∇ F(w k ) + e k , if e k is larger than the g k or if s i is too small, then the dominant of "difference of two successive gradients" is just the difference of noise, which indicates that the vector y i can be very inaccurate.
(2) Line search is highly problematic. In the stochastic context, neither F(w) nor ∇ F(w) is available to us; instead, we only have access to noisy versions of them, and the global validity of the criteria they employ cannot be established from local subsamples. For instance, we may accept a step size α k in the case where the observed cost has sufficiently decreased, but the true objective function may have increased [80]. However, without line search, there is no guarantee that the curvature condition s i y i > 0 holds. The purpose of this paper is to provide a review on the stochastic quasi-Newton methods that have been demonstrated in both theory and practice to be effective at counteracting the negative impacts of challenging curvature in stochastic optimization. This paper attempts to demonstrate how to overcome the aforementioned difficulties in order to design more widely applicable stochastic quasi-Newton methods.

Basic Stochastic Quasi-Newton Methods
To solve (1) or (2), a number of stochastic quasi-Newton methods have been presented in recent years. In this section, we introduce the basic SQNM which draws lessons from the ideas of [59].

Sublinear Convergence
Online BFGS (oBFGS) [52] is a pioneering work in stochastic adaptations of the BFGS method. The oBFGS algorithm is a direct generalization of the BFGS algorithm that employs stochastic gradients rather than full gradients. It took consistent gradient measurements on the same data set N k , i.e., , which is named "gradient displacements". The update of B k is then modified by scaling down the last term of the update (10) by a factor 0 < σ 1 and selecting the step size without line search. To deal with the small eigenvalues, they add λs k (λ 0) to y k (which means modifying the BFGS update to estimate the inverse of H + λI ), that is
The online L-BFGS (oL-BFGS) was also proposed in [52] for solving large-scale optimization problems when the O(d 2 ) cost of storing and updating H k would be prohibitively expensive. oL-BFGS reduces the memory requirements as well as the computational cost of each iteration to O(md). Under the assumption that the objective function is strongly convex and smooth, and the second moment of the stochastic gradient is bounded, oL-BFGS converges to the optimal solution at a rate of O(1/k) in expectation [81].
To construct the correction pairs, stochastic quasi-Newton (SQN) [37] decouples the computation of the stochastic gradient from the curvature estimate. This method calculates y k using a Hessian-vector product [25] as (17), which is referred to as "Hessian action" [82]. The results of numerical experiments show that SQN is reliable and effective, although it does not improve convergence rate. In Algorithm 3, we formalize the sublinear convergent stochastic BFGS algorithms [37,52,56,81].

Linear Convergence
Despite being successful in extending the use of quasi-Newton methods to stochastic settings, the convergence rate of the aforementioned methods is sublinear. This is not better than the SGD convergence rate. The negative impact of noisy gradient estimations results in a slow, sublinear rate of SGD convergence [17]. Both in theory and in practice, stochastic variance reduced methods produce a faster convergence than SGD [18]. A stochastic L-BFGS algorithm [38] that extensively draws on SQN and incorporates variance reduction [22] was proposed. This is the first linearly convergent stochastic quasi-Newton algorithm (Algorithm 4). To calculate the search direction, this algorithm computes a variance-reduced gradient as shown below whereg k is the full gradient atw. Variance-reduced gradient is also used by the VITE algorithm [83], which combines regularized stochastic BFGS (RES) [56]/oBFGS with semi-stochastic gradient descent (S2GD) [84]. Using a coordinate transform framework, Zhao et al. [85] revisited the algorithms and enhanced the results of its convergence rate.

Algorithm 4 Stochastic Quasi-Newton Methods with Variance-Reduced Gradient
1 initial parameterw 0 , step size α > 0, parameters m, M and C; Compute s r =w r −w r −1 and y r by Hessian action; 13 Compute H r 14 end if 15 end for 16 Setw k+1 = w m 17 end for Gao [86] extends the standard BFGS by incorporating information of ∇ 2 F(w) along multiple directions in each update to improve the accuracy of the local Hessian approximation. Gower et al. [39] then presented a new quasi-Newton approach that uses stochastic block BFGS updates [86] in combination with SVRG.

Superlinear Convergence
The incremental quasi-Newton (IQN) [58] method is the first stochastic quasi-Newton method to achieve superlinear convergence. The information corresponding to the i-th function f i at step k is referred to as z k i , ∇ f i (z k i ) and B k i (for simplifying notation, let f i (w) = f (w; ξ i ) ). Consider the second-order approximation of the objective function f i (w), which is centered around the current iterate z i , then the function F(w) can be approximated with As a result, the updated iterate w k+1 can be defined as the minimizer of the quadratic programming in (15), which is explicitly given by It should be noted that the update in (16) demonstrates that the variable w k+1 is a function of the stored information of all functions f 1 , · · · , f n . As a result, they only update the local information of one function, chosen in a cyclic manner, in each iteration of the IQN method. Let i k be the index of the function selected at time k. Then, the update of BFGS can be used to compute the Hessian approximation B k i k corresponding to f i k while leaving all other the same, i.e., B k+1 i = B k i for i = i k . It is obviously that the update of IQN cannot be implemented at a low computational cost because it necessitates computation of the sums 1 The disadvantage of incremental second-order methods like IQN and stochastic Newton method (SNM) [87] is that they have computational and memory costs per iteration that are greater than or equal to O(d 2 ). This is prohibitive in situations where the number of model parameters is large. Chen [88] creates a new stochastic average Newton (SAN) method that is inexpensive to implement when solving regularized generalized linear models, with an iteration cost of the order of O(d).

Practical Considerations
The algorithmic advancements made by various SQNM are outlined in this section, along with a number of implementation-related topics including selecting the step size and how to generate the correction pairs.

The Correction Pairs
The updating rules for the correction pairs represent the primary distinction between various SQNM. Both oBFGS and RES compute y on the same subset N k , i.e., y k = ∇ F N k (w k+1 ) − ∇ F N k (w k ) as opposed to the variation ∇ F N k+1 (w k+1 ) − ∇ F N k (w k ), even though the former requires twice as many stochastic gradient evaluations and the latter is insufficient in the stochastic scenario to ensure convergence [56,81].
However, if the size of N k is too small, the gradient estimates will be extremely noisy, which will have a negative impact on the final Hessian approximation. Berahas et al. [89,90] proposed to construct the correction pairs by computing gradients based on the overlap of successive batches O k = N k ∩ N k+1 = ∅. The only restriction is that this overlap should not be too small. However, a nonuniform sampling strategy can produce a sample size that is considerably less than what uniform sampling calls for gradient displacements Another popular way to construct the correction pairs is proposed in SQN [37], which decouples the stochastic gradient and curvature estimate calculation. This approach calculates y k via a Hessian-vector product [25], named "Hessian action" [82], where s k is the difference of disjoint average between the 2C most recent iter-ations which means that they compute correction pairs every C iterations,w k = is a subsampled Hessian defined as (7). We can use the directional derivative [91] to calculate the Hessian-vector product at a low cost without explicitly constructing ∇ 2 F S H (w). Hessian actions can prevent the potential negative consequences of differencing noisy gradients, but at the cost of more computation because the batch size |S H | must be selected to be large enough to get useful curvature estimates. In a similar vein, AdaQN [92] computed y by matrix-vector product for training RNNs rather than using the subsampled Hessian but instead used the empirical Fisher information matrix [46,48]. Hessian action offers a number of interesting qualities. First, since this cost is spread out over C iterations, more computation may be done for the curvature computation. Additionally, using the Hessian action allows for a more robust estimation of curvature, especially when s is small and the gradients are noisy.
Berahas et al. [93] propose that new curvature pairs be sampled at each iteration. They choose random directions (τ i ) to sample points around the current iteration and establish the iteration displacement s = rτ i ( where r is the sampling radius) before generating y using gradient displacements ∇ F N (w) − ∇ F N (w + rτ i ) or Hessian action (17). By utilizing parallel/distributed computing settings, this approach can capture more recent and local information to improve the Hessian approximations. Another noteworthy point is that the convergence result for nonconvex problems can be proven because we can force the curvature pairs to meet the curvature requirement s y > ε s 2 > 0 by eliminating those that do not.

The Choice of H 0
The initial approximation H 0 has no universally effective magic formula in all cases. Most stochastic BFGS methods choose a small multiple of identity matrix I as H 0 , i.e., ε I. Stochastic L-BFGS methods always set H 0 k = γ k I, γ k = (s k−1 y k−1 )/(y k−1 y k−1 ) as the standard L-BFGS. Byrd et al. [25] propose to employ a conjugate gradient iteration as subsampled Newton methods to indirectly define H 0 k . According to some works [94,95], H 0 k is defined as follows: where 0 < γ k <γ k can be constants or iteration-dependent to prevent H 0 k from becoming nearly singular or nonpositive-definite.
AdaQN [92] was proposed for training RNNs, in which the inverse-Hessian matrix is initialized using accumulated gradient information. It is the same as the scaling matrix used by Adagrad [10] during each iteration

Hessian or Inverse Hessian Approximation
As is well known, B k+1 or H k+1 cannot be determined just by the secant equation. The quasi-Newton methods require the matrix B k+1 to be close to the matrix B k in order to resolve this indeterminacy. Mokhtari et al. [56] proposed that in BFGS, the matrix B k+1 is selected as the one that satisfies the secant condition while being closest to B k in terms of the Gaussian differential entropy, In the stochastic setting, they replace B with B − δ I in (18) to prevent the smallest eigenvalue of B k to from approaching 0 over time. Furthermore, RES establishes the corrected variationỹ k = y k − δs k to ensure that all eigenvalues of B k+1 exceed δ > 0, in other words, ifỹ k s k = (y k − δs k ) s k is positive, then all eigenvalues of B k+1 are larger than δ. B k+1 can be explicitly given by the expression In order to determine H k , Adrian et al. [96,97] employ a method similar to standard BFGS (9) and solve the regularized least-square problem as follows: where Y k y k−m+1 , · · · , y k , S k s k−m+1 , · · · , s k . The regulator matrixH k acts as a prior on H and can be modified at each iteration k. It was confirmed that the solution to Eq. (19) is given by

Stocastic block BFGS [39] constructs an update which satisfies a sketched version of the equation, namely
where D k ∈ R d×q is a randomly generated matrix, which has relatively few columns (q d). Then, the stochastic block BFGS update is defined by the weighted projection where H 2 k

Tr(H ∇ 2 F S H (w k )H ∇ 2 F S H (w k )), and tr denotes the trace. The solution is
where Δ k (D k Y k ) −1 and Y k ∇ 2 F S H (w k )D k , the same solution was given in [98][99][100] for multiple secant equations. The numerical results show that stochastic bock BFGS is more flexible. Similarly, Ma et al. [101] utilize the weak secant equation [102] to build an adaptive parameter-wise diagonal quasi-Newton for training DNNs. All existing quasi-Newton algorithms, according to Hennig et al. [99], can be reformulated and extended into a probabilistic interpretation. By utilizing this discovery, known as the Gaussian prior Hessian approximation, Wills et al. [103] provide a probabilistic quasi-Newton approach; more details are available in [80,103].

The Step Size
The necessity to choose an appropriate step size is still another significant obstacle to the development of stochastic quasi-Newton algorithms. The groundbreaking work of Robbins and Monro [1] made the following conclusion: the step size in stochastic situations should satisfy the conditions It is always necessary to use a "sufficiently small" constant or a diminishing step size, but it might be challenging to choose the appropriate constant step size. Therefore, the most popular option is that where α 0 , k 0 > 0 are tuning parameters. SQNM, when combined with a variancereduced gradient, always employs a constant step size. However, we concentrate on adaptive step size and stochastic line search (SLS). Zhou et al. [104] develop a stochastic adaptive BFGS (SA-BFGS) for selfconcordant function [105]. They examine whether the Wolfe condition is met for the adaptive step size α k in the method. Otherwise, SA-BFGS will switch back to taking an SGD step. Since we only have access to a noisy version of the gradient in the stochastic setting, it is difficult to ensure that the update direction is a descent direction. In [106], an available SLS algorithm was proposed, which employs the framework of Gaussian processes and Bayesian optimization. The step length that best satisfies a probabilistic measure combining reduction in the cost function with satisfaction of the Armijo condition is chosen [80].
The key to designing a SLS is ensuring that there is a decrease in the true function value with a high probability when only stochastic approximations can be observed. Based on variance estimates, sample size selection makes the angle between search direction p k and ∇ F(w k ) is an acute angle with high probability, that is E[ p k ∇ F(w k )] < 0. Sample size selection is always used in the design of SLS and is critical in both the evaluation of gradients and the incorporation of curvature information [26,[107][108][109][110]. A practical inner product quasi-Newton (IPQN) test was proposed in [111] to ensure the stochastic quasi-Newton search direction makes an acute angle with the true quasi-Newton direction in expectation where θ > 0. Progressive Batching L-BFGS [111] performs a backtracking line search that aims to satisfy the Armijo condition where c > 0. If the condition is not satisfied, set α k = α k /2. The line search proposed by Wills et al. [80] is conceptually more similar to this procedure. All existing quasi-Newton algorithms, according to Hennig et al. [99], can be interpreted as maximizing a posteriori estimates. Based on this, Wills et al. [80] proposed a Gaussian prior quasi-Newton approximation and a mechanism to ensure that the search direction is a descent direction in expectation, E[ p k ∇ F(w k )] < 0. Then, they developed a stochastic line search procedure that satisfies an Armijo condition in expectation for early iterations wheref (w k ) is subsampled function. However, this method assumes that the gradient is tainted by additive Gaussian noise, which is inappropriate for some problem classes, particularly when the noise distribution has heavy tails [80].
Xie et al. [112] demonstrated that there is a step size that satisfies the Armijo-Wolfe conditions for both the noisy and true objective functions under the assumption of the errors in function and gradient are bounded for all w. However, in order to guarantee that the condition number of the Hessian approximations is bounded, they need to be aware of the strong convexity parameter, which is typically unknown. A new noisetolerant quasi-Newton algorithm was proposed by Shi et al. [61], which has no need for exogenous function information. The noise-tolerant L-BFGS is linearly convergent to a neighborhood of the solution determined by the noise level if the objective function F is μ-strongly convex and has L-Lipschitz continuous gradients.

Diagonal Approximation
Rapid training progress is possible with diagonally scaled stochasic first-order algorithms like Adagrad [10], Adadelta [11], RMSprop, Adam [12]. Numerous SQNM also use a diagonal matrix to approximate the Hessian. Bordes et al. [36] studied SGD with a diagonal rescaling matrix based on the secant condition, named SGD-QN. The approach is highly effective and successful enough to be named the winner of the first PASCAL Large Scale Learning Challenge [113] because it just requires scalar computation. The corrected SGD-QN algorithm was then proposed in [114] to reap the full benefits of a diagonal scaling strategy. The Hessian is also approximated by Apollo [101] via a diagonal matrix, the elements of which are obtained by the variational method under the constraints of the parameter-wise weak secant equation. In terms of convergence speed and generalization performance, Apollo outperforms SGD and different variants of Adam significantly.

Regularization
oBFGS may produce divergent sequences when the noise of stochastic gradients is catastrophically amplified in curvature estimation. Regularization is incorporated into RES [56] to address this problem. It redefines B k+1 as the solution of the problem as follows: The term B − δ I rather than B in (24) is to preserve the smallest eigenvalue of B k from 0. The identity bias term Γ I is added to the update of RES as follows: which is always used to ensure the positive definiteness of the Hessian approximation. Under the same supposition as oBFGS, the RES method converges to the optimal solution at a rate of O(1/k) in expectation [56]. To address the merely convex or non-Lipschitz problem, Yousefian et al. [60,[115][116][117][118] construct a set of SQNM with regularization terms, in which the regularization parameter is updated iteratively and decays to zero.

Parallel Implementations
Stochastic quasi-Newton methods are more memory intensive and more expensive (per iteration) than SGD. As a result, it is critical to effectively scale and parallelize SQNM in a distributed system. By avoiding the pricey dot product operations in the two-loop recursion, vector-free L-BFGS [119] significantly increases computing performance while using MapReduce. On the High-Performance Computing Cluster (HPCC) Systems platform, Najafabadi et al. [120] provide a parallelized version of the L-BFGS algorithm. Some of the computing nodes responsible for evaluating the function and gradient in parallel implementations are unable to deliver results on schedule. By performing quasi-Newton update depending on the overlap between succeeding batches, the multi-batch L-BFGS [89,90] is intended to operate in a distributed setting to address this. They also proposed sampled L-BFGS and sampled L-SR1 methods [93], both of which have enough concurrency to benefit from parallel/distributed com-puting environments. The sampled L-SR1 approach is then described by Jahani et al. [121] in a scalable distributed implementation that is communication-efficient by applying sketching techniques to reduce the quantity of data communicated at each iteration.

Advanced Algorithms
We discuss extensions to the fundamental SQNM in this section. Some of these modifications make the techniques more general to deal with more challenging situations, like problems that are not smooth and/or strongly convex. Other expansions develop quicker algorithms by using extra algorithmic techniques or problem structure.

Accelerated Variants
The momentum term [7,8,13,122] and Nesterov's accelerated method [9] can accelerate the convergence of SGD and sometimes provide a distinct improvement in performance for neural network training [123]. The approach for SQNM acceleration is described in several works [124][125][126][127][128][129]. Nesterov's accelerated quasi-Newton (NAQ) method [124] is a recently proposed BFGS acceleration method. A stochastic variant of the NAQ technique called o(L)NAQ [125] was also devised, and it was found to perform better than the o(L)BFGS method. Then, Yasuda et al. [127] introduced the stochastic variance reduced Nesterov's accelerated quasi-Newton (SVR-NAQ) approach, which is superior to oNAQ since it updates the Hessian matrix with less noise. Chang et al. [126] propose a new sL-BFGS algorithm by importing a proper momentum, which is based on SQN [37] and stochastic L-BFGS with nonuniform sampling [85]. The complexity went from O (n + κ H κ) d log 1 ε to O n + κ H √ κ d log 1 ε , where κ and κ H represent the condition number of F and the Hessian approximation, respectively. A faster stochastic L-BFGS was proposed by Gao and Huang [128], and the oracle complexity was improved to O n + κκ H 1+ √ κκ H /n log 1 ε . For training neural networks, Indrapriyadarsini et al. [129] propose a new limited memory Nesterov's accelerated symmetric rank-1 (L-SR1-N) method. It is demonstrated that the proposed L-SR1-N converges to a stationary point both theoretically and experimentally.

Relaxing Smoothness
For classical (L)BFGS, the objective function must be smooth since the quasi-Newton direction produced at a nonsmooth point is not always a descent direction. Yu et al. [130] suggested a subgradient (L)BFGS for which global convergence can be restored by determining a descent direction and applying a line search. Yousefian and Nedić [116] modified the update rule as w k+1 = w k − α k H k ∇ F N k (w k + z k ), where z k is a uniform random variable drawn from a ball centered at the origin with radius r > 0. They establish the convergence properties of the scheme in the absence of the Lipschitzian property with this modification and a few acceptable assumptions. Jalilzadeh et al. [60,117] drew on the Moreau proximal smoothing [131], and they only reduce the smoothing parameter η after an odd number of iterations as the right-hand side of (26). Then, they utilized the η k -smoothed function F η k to , where 0 < δ 1 is scalar that controls the level of smoothing.
Numerous studies take into account the following optimization problem: where h is a nonsmooth convex function, such as 1 regularization. Stochastic proximal Newton-type approaches for these kinds of nonsmooth problems are investigated in [132][133][134].

Relaxing Strong Convexity
Convex functions are the focus of the majority of theories for stochastic quasi-Newton methods. In the absence of strong convexity, there are few convergence rate statements. For faster convergence and better theory, the objective function in machine learning is always regularized with the term 1 2 μ w 2 . However, the optimal solution to the F(w) + 1 2 μ w 2 is not the optimal solution to the original problem. To deal with the merely convex objective function, Yousefian et al. [60,115,117,118] develop a number of regularized SQN methods. They allow regularization parameters to be updated and decay to zero as iterations progress. Cyclic regularized stochastic BFGS (CR-SQN) [115] generates a sequence {w k } for any k 0 where μ k > 0 is the regularization parameter of the gradient mapping, and they only update μ k if k is even. The corresponding update rule of B k+1 is following, whereỹ k := ∇ F N k (w k+1 ) − ∇ F N k (w k ) + (1 − ρ)μ k s k for an even k, 0 < ρ < 1 is the regularization factor. In this way, it is easy to prove that s kỹ k Iteratively regularized stochastic limited-memory BFGS (IRS-L-BFGS) updates the parameter as (29) where the update policy for H k and the update rule for the regularization parameter μ k are based on the following general procedure, The {s i , y i } is defined for any odd k 1, where τ > 0 and 0 < δ 1 are parameters to control the level of regularization in the matrix H k , and the perturbation term μ δ k s k/2 → 0, as k → ∞. In terms of the value of the objective function, IRS-L-BFGS shows a convergence rate O(k −( 1 3 −ε) ), where ε is an arbitrary small positive scalar.

Nonconvex Problems
The main difficulty in developing quasi-Newton methods for nonconvex problems is that the Hessian of the objective function is not positive definite. In the deterministic scenario, several techniques, including cautious updating [135] and damping [136], have been proposed to establish convergence of the BFGS method in the nonconvex setting. Berahas and Takáč [89,90] skip the Hessian update if the curvature condition y k s k ε s k 2 is not satisfied. Similarly, sampled L-BFGS [93] updates the (inverse) Hessian approximation using only the set of curvature pairs satisfying y k s k ε s k 2 . In stochastic optimization, damping is more frequently used. Several works [57,94,95] use damping to make sure that H k or B k are positive definite. They definē where otherwise. However, the self-correcting properties of BFGS-type updates could be ruined by damping. Self-correcting BFGS (SC-BFGS), which uses a damping procedure intended to maintain these properties, was proposed in [137]. In SC-BFGS,ȳ = β k s k + (1 − β k )α k y k , where β k is the smallest value in the interval [0, 1] such that the following crucial bounds are satisfied: s kȳ k η 2 , η 1 ∈ (0, 1) and η 2 ∈ (1, ∞).
The sequence of inverse Hessian approximations must satisfy this inequality in order to ensure that the global convergence guarantees are met. Some SQNM approximate the Hessian via a diagonal matrix [101,138]; these methods always rectify the absolute value of B k with a convexity hyper-parameter σ to handle nonconvexity, i.e., rectify (B k , σ ) = max (|B k | , σ ), where the function rectify(·, σ ) is similar to the rectified linear unit (ReLU) [139] with a threshold of σ .

Hybrid Stochastic Second-Order Methods
To improve practical performance, a number of hybrid stochastic second-order algorithms have been proposed. The term "hybrid" refers to the elaborate fusion of these stochastic first-or second-order methods. Incorporating quasi-Newton techniques with SGD methods could shorten the optimization process and eliminate the need to finetune optimization hyperparameters.
Symmetric blOckwise truNcated optimIzation Algorithm (SONIA) [140] bridges the gap between first-and second-order methods by computing a search direction in one subspace that incorporates curvature information and performing a scaled gradient descent step in the orthogonal complement.
Yang et al. [141] discovered an intriguing fact: the true Hessian matrix is frequently a combination of a cheap part and a costly part, i.e., ∇ 2 F(w) = H (w) + Π(w), where H (w) is relatively cheap and accessible, while Π(w) is expensive or even not computable. For example, the subsampled Hessian matrix, the GGN/FIM, or any other approximation matrices are all examples of H(w), which stands for a matrix with partial Hessian information. We can employ SQNM to update Π(w). For training neural network, the Kronecker-factored quasi-Newton method [82,142] was proposed. They approximate the Hessian by a block-diagonal matrix that is the Kronecker product of two considerably small matrices A −1 i and G −1 i , then they use BFGS update to approximate them with innovative damping and Hessian-action [37,39,100] techniques.

Discussion and the Future
In this paper, we studied stochastic second-order approaches, particularly stochastic quasi-Newton methods, for large-scale machine learning. We began by briefly outlining the optimization problem in machine learning, and then, we summarized the existing stochastic first-order and second-order methods. The SQNM based on BFGS, which is currently regarded as the most effective quasi-Newton updating formula, was the main topic of discussion in this paper. Although there have been some other intelligent works presented that we do not mention, the majority of the algorithms we cover are representative. Following that, we briefly discuss some future research directions in stochastic second-order methods.
More complex optimization techniques have been used as computing hardware has advanced. The second-order optimization method can speed up the training of deep neural networks, but more effective optimization algorithms cannot be applied directly because they frequently require higher-order function information, such as Hessian, and the acquisition and calculation of this information are not feasible due to the excessive number of parameters. Therefore, the practical application of secondorder method for training neural networks is limited by the enormous calculation cost. Deep learning is robust to optimization algorithms; therefore, some advanced classic optimization techniques can be reduced and applied to neural networks to significantly speed up training.
Although no theory exists to clearly analyze or explain the precise effects of optimization models and algorithms on the generalization of neural networks, relevant research is still being conducted in this area, and optimization methods for deep learning have become the mainstream, with many extraordinary algorithms emerging. Theoretical evaluations of algorithm performance currently lag considerably behind the practical application. The efficiency analysis of traditional optimization methods often stops at the analysis of convergence order, while various improvements based on SGD methods are the same convergence order in theory in deep learning. However, even the progress of constant level can significantly increase training efficiency in practice. This demonstrates that traditional theoretical analysis methods cannot fully accommodate machine learning, but there is currently no reasonable additional theoretical analysis standard. In fact, the majority of the present approaches are difficult to theoretically analyze in terms of performance and can only be evaluated through tests with large amounts of data. Although the Adagrad method has been shown to have good convergence and lower regret accumulation, practice shows that Adagrad reduces the learning rate too quickly on many problems, resulting in unsatisfactory training efficiency.
In addition to the significant theoretical challenges, the development of more effective optimization methods like K-FAC shows that there is still a large research space to be discovered regarding the training of deep neural networks. Stochastic secondorder algorithms still have numerous limitations in comparison with the well-studied first-order method. Although the current second-order algorithms utilized for deep neural network training are significantly more effective, they have more restrictions than SGD approaches in terms of the types of problems that may be addressed. For instance, the implementation of K-FAC is complicated and heavily dependent on the unique neural network structure, making it more challenging than other approaches. The K-FAC approach can only be employed in the context of traditional convolution networks, such as image recognition, whereas Transformer model, which is frequently used in natural language processing, requires significant modification. Future research will focus heavily on finding ways to get around these restrictions, create second-order algorithms that are applicable to many applications, or enhance the training effect of second-order algorithms.
Under some conditions, gradient descent can converge linearly in the deterministic setting, whereas quasi-Newton methods can converge superlinearly. Typically, SGD can only achieve the sublinear convergence rate, but variance reduction can enable linear convergence. As shown in Sect. 4, some SQNM achieve sublinear convergence. SQNM with variance-reduced gradient can converge linearly; it does not recover a superlinear convergence rate. Although the IQN can converge superlinearly by using a proper quadratic approximation of individual functions, the memory required by the IQN is O(nd 2 ), which is unacceptable for high-dimensional or large-scale problems. As a result, a superlinear convergent stochastic quasi-Newton approach with lower costs is a promising area for further study [59]. Another point worth noting is that most SQNM avoid possible difficulties with BFGS or L-BFGS updating by assuming that the quality of gradient differences is sufficiently controlled. This is something that can be addressed.
Dynamic sample size schemes [26,107] have served as the foundation for many variance reduction schemes in machine learning [60,117]. The sample size used to approximate the Hessian strongly influences the performance of the subsampled Newton method. In addition, the construction of gradient displacement y within allowable error bounds also required careful consideration of sample size selection. The basic idea behind dynamic sample size is to increase the size of the training sample used in the gradient and Hessian evaluation by a factor. SGD methods converge linearly for strongly convex problems by utilizing dynamic sample size, and this scheme is used in stochastic second-order algorithms [143,144]. If the initial sample size is sufficiently large, Jin and Mokhtari [144] demonstrate that the subproblems can be solved superlinearly fast by applying quasi-Newton methods. But only convex environments are covered by their assurances. Another disadvantage is that the sample size needs to be increased geometrically, and the literature offers no direction on how to select the factor (AdaQN choose 2 [144]). As a result, investigating the use of dynamic sample size strategies in quasi-Newton methods is an intriguing line of research that merits further research. In addition, sampling is also a major concern in machine learning stochastic optimization techniques.
The step size is a crucial hyperparameter in the optimization process that has a direct impact on how well it works in practical applications. In deterministic settings, line search can be used to determine the step size. However, stochastic line search is computationally prohibited in machine learning since we only have subsampled data on function value, gradient, and Hessian approximation. How to make sure that the search direction is a descent direction in expectation is the first difficulty in constructing the SLS. Another issue is that we may reject an appropriate α k if the observed objective function value increases, even if the true cost has decreased sufficiently. Some studies proposed the IPQN test with a dynamic sample size. Another class of SLS is being developed that is based on a probabilistic framework for the Hessian; these methods are derived from the fact that all existing quasi-Newton algorithms can be interpreted as maxima of Gaussian posterior probability distributions [99]. There has always been a need for study in the area of stochastic line search.
In nonconvex settings, an essential challenge in designing quasi-Newton methods is that the eigenvalues of (inverse) Hessian approximations are not uniformly bounded above and away from zero. The positive-definite Hessian approximations were preserved in several studies via damping or regularization. Simultaneously, a more careful calculation method of correction pairs {s i , y i } is used to keep the gradient noise within an acceptable range. However, the self-correcting property of updates of the BFGS type could be ruined by damping techniques. Moreover, BFGS has some disadvantages when trying to enforce the positive definiteness of the approximated Hessian matrices in a nonconvex setting. As a result, developing a convergent quasi-Newton method for nonconvex settings is a very active direction of research.
Limited memory quasi-Newton methods form the basis of the majority of SQNM. A significant challenge with L-BFGS is that, while memory size m can have a significant effect on performance, it is difficult to predict which memory size will work best. Boggs and Byrd [145] recently proposed that each iteration choose a different memory size adaptively. Additionally, a displacement aggregation strategy [146] is suggested for the curvature pairs stored in an L-BFGS. These adaptive L-BFGS, however, cannot be used directly in stochastic scenarios. Therefore, stochastic L-BFGS or L-SR1 with adaptive memory size is a promising approach.
In the stochastic contexts, the intelligent fusion of second-order methods has already demonstrated its special attractiveness. On some machine learning problems, the structured stochastic quasi-Newton methods (S2QN) [141] framework is fairly competitive with the state-of-the-art approaches. If the sample size is large enough, a local superlinear convergence result is assured. On a number of CNN tasks, a new class of Kronecker-factored quasi-Newton methods (KF-QN-CNN) [142] outperformed stateof-the-art first-and second-order methods. Such methods have numerous obvious advantages, including comparable memory requirements, lower per-iteration time complexity, and less expensive computation. An intriguing study area would be to create a better framework that combines the advantages of stochastic first-order and second-order optimization methods.
Learning to optimize (L2O) has gained popularity in recent years, and it is based on "learning to learn" or "meta learning" [147,148]. The primary goal of L2O is to develop optimization methods by leveraging neural networks. L2O has begun to show great promise in a few applications and optimization domains. Egidio et al. [149] proposed using the truncated backpropagation through time algorithm to learn the stepsize policy for L-BFGS in a constrained domain. On some problems, the suggested step-size strategy is competitive with heuristically tuned optimization procedures. As a result, we may be able to use objective function and gradient information to learn an approximate "quasi-Newton" optimizer for machine learning. However, L2O research is still in its infancy, with numerous open challenges and research opportunities ranging from practice to theory. and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.