Limited-memory common-directions method for large-scale optimization: convergence, parallelization, and distributed optimization

In this paper, we present a limited-memory common-directions method for smooth optimization that interpolates between first- and second-order methods. At each iteration, a subspace of a limited dimension size is constructed using first-order information from previous iterations, and an efficient Newton method is deployed to find an approximate minimizer within this subspace. With properly selected subspace of dimension as small as two, the proposed algorithm achieves the optimal convergence rates for first-order methods while remaining a descent method, and it also possesses fast convergence speed on nonconvex problems. Since the major operations of our method are dense matrix-matrix operations, the proposed method can be efficiently parallelized in multicore environments even for sparse problems. By wisely utilizing historical information, our method is also communication-efficient in distributed optimization that uses multiple machines as the Newton steps can be calculated with little communication. Numerical study shows that our method has superior empirical performance on real-world large-scale machine learning problems.


Introduction
We consider the following unconstrained smooth optimization problem.
where f is ρ-Lipschitz-continuously differentiable and the solution set Ω of (1) is nonempty. We propose and study a limited-memory common-directions method for (1). At each iteration, our method constructs a manifold or a subspace of a limited dimension m in which we find the update direction. In the manifold parameterized by a smooth function G, whose domain is usually of a much smaller dimension, we may apply many efficient algorithms such as the Newton method to find an update direction. That is, given the current w and the manifold parameterized by a smooth function G : R m → R n , where m n, we approximately optimize This can be done through a Newton method that iteratively solves the quadratic problem 1 and conducts a line search, where t (T ) is the iterate obtained at the T th Newton iteration. We call the iterations of constructing G and solving (2) the outer iterations, and the Newton steps (3) inner iterations. Problem (2) reduces the original optimization problem to a much lower dimensional one and thus this subproblem can be solved efficiently in many cases. In particular, if we assume that G is a linear function such that G (t) = P t, ∀t ∈ R m (4) for some P ∈ R n×m , then This can be easily minimized by solving the following Newton linear system when P ∇ 2 f w + G t (T ) P is positive definite.
P ∇ 2 f w + P t (T ) P t = −P ∇ f w + P t (T ) . (6) Equivalently, each column of P is considered as a possible direction for updating w, and solving the subproblem (2) finds the best linear combination of the columns of P as the update direction. This explains the nomenclature -each column of P is a "common direction" as they can be commonly used to construct G in multiple outer iterations.
Notice that ∇ 2 f w + G t (T ) P can be calculated in the cost of O(m) gradient evaluations of f through automatic differentiation (see, for example, [21,Chapter 8]), or in an even lower cost when the problem structure can be utilized. Thereafter, obtaining the coefficients in (6) takes O(m 2 ) inner products of n-dimensional vectors. When m is small, the cost of calculating the gradient and ∇ 2 f w + G t (T ) P dominates, thus roughly the cost per inner iteration is asymptotically equivalent to m iterations of gradient descent if m is a fixed constant not affected by the problem parameters. Moreover, in many situations when the problem possesses specific structures that can be utilized, as described in Sect. 5.2, multiple inner iterations can even be cheaper than conducting a single gradient evaluation for f , making one outer iteration of our method cost the same as one iteration of gradient descent.
In the special case in which the column space of P is a subspace of the space spanned by the gradients of f in the previous and current outer iterations, our method can be seen as a first-order method. However, our method is more general in allowing other possible directions, such as quasi-Newton or stochastic Newton ones, being included. On the other hand, as Newton's method is applied in subspaces, our method can be considered as incorporating the curvature information as well. For example, when G(t) = ∇ f (w)t, our method with only one inner Newton step is similar to the Borwein-Barzilai method [2] at deciding the step size of steepest descent using some spectral information; when G includes recent update directions and the corresponding gradient differences, our method is in spirit similar to quasi-Newton methods in using these directions to approximate the Newton method. Therefore, it is expected that the proposed method can enjoy advantages from both first-and second-order methods. We will show that in theory, our method enjoys low global iteration complexity just like (accelerated) first-order methods, while in practice, our method converges even faster than second-order methods.
For problems with complicated structure, evaluations of the gradient, the objective, and the Hessian may not be highly parallelizable, especially when the data in the problem are sparse such that the bottleneck becomes memory access. In this case, traditional batch methods do not enjoy parallelism as much as expected. On the other hand, in the proposed limited-memory common-directions method, the major operations of the m Hessian-vector products and the O(m 2 ) vector-vector inner products in constructing the left-hand side matrix of (3) are by nature embarrassingly parallel at least up to m processors. This means that our method can enjoy better parallelism

Related work and our contributions
Limited-memory methods have been extensively studied in the context of quasi-Newton methods. When the initial Hessian estimate is a multiple of the identity, it can be easily seen that quasi-Newton methods belonging to the Broyden class find the update directions from the span of the current and previous gradients. Therefore, they can be seen as a type of first-order methods that finds the linear combination coefficients of the gradients using inner products between the historical gradients and the update steps. Among quasi-Newton methods, the limited-memory BFGS (L-BFGS) method [14] is one of the most popular, thanks to its economical spatial and computational costs per iteration and superior empirical convergence.
Recently, [26] proposed the common-directions method for smooth and strongly convex empirical risk minimization (ERM) problems. Their algorithm maintains the gradients up to the current iteration and finds the update step as a linear combination of them. The combination is determined by approximately finding the best possible one through running multiple iterations of the Newton method. The key ingredient in their algorithm is to utilize the problem structure of ERM to efficiently compute the Hessian with low cost. They show that by accumulating the gradients, this method possesses both the optimal global linear convergence rate for first-order methods and local quadratic convergence from the Newton method, and empirically it outperforms both optimal first-order methods and second-order methods on small-and mediumscale problems. The major disadvantage of the common-directions method is its usage of all gradients accumulated from the first iteration on such that for a rather difficult problem that requires many iterations to solve, both the storage and the subproblem solve become expensive.
In this paper, we consider a fusion between the common-directions method and the limited-memory quasi-Newton methods to develop a limited-memory common-directions method. In particular, instead of storing and using all gradients accumulated from the first iteration on, our method uses only information from the most recentm iterations for a pre-specified value ofm. To retain information from the discarded gradients for possibly better convergence, we also include the possibility of using as the common directions the momentum terms or previous update steps adopted by optimal first-order methods and limited-memory quasi-Newton methods. In comparison with our initial presentation of the limited-memory common-directions method that focused only on distributed optimization for a specific problem class [10], this paper provides a more general treatment for smooth optimization with extended theoretical analysis for optimal convergence rates, broader applicability, improved algorithms, and extensive experiments. In particular, our Sect. 5.2 is a generalization of the algorithmic description in [10], Theorems 2 and 5 are adapted from [10] to our more general setting, and Theorem 3 improves upon the result in the same work, but other parts are newly developed.
Through this limited-memory setting, we obtain controllable spatial and computational per-iteration cost, extend applicability from ERM to general optimization problems with the help of automatic differentiation, and better parallelize the computation. We develop convergence results using techniques different from that in [26] because we no longer include all previous gradients in our search space. We show that the optimal linear convergence rate for first-order methods on strongly convex problems is still attainable even when the number of common directions at each iteration is as small as two. In addition, we also cover the case not shown in [26] to prove that optimal convergence rate of O(1/k 2 ) for first-order methods on general convex problems can also be obtained with two properly selected common directions (k is the outer iteration counter in our method). Unlike other optimal methods that are non-monotone, our method is a strictly descent one. Moreover, other optimal methods only possess R-linear convergence on strongly convex problems, but our method also achieves global Q-linear convergence (with a different rate). Another contribution of this work in theory is showing that for a broad choices of the common directions, even if (2) is solved as coarse as by only one Newton iteration, global sublinear convergence rates of O(1/k) on both convex and nonconvex (to stationarity) problems, and global Q-linear convergence on problems satisfying the Polyak-Łojasiewicz condition can be ensured.
We also discuss that our algorithm is also suitable for multicore parallelization and distributed optimization to practically solve large-scale problems with high efficiency, and show through numerical results that the empirical behavior of the proposed algorithm is indeed suitable for these scenarios and can outperform state of the art in multicore parallel optimization and distributed optimization.
Empirical studies also show that our method outperforms optimal first-order methods on a single-core setting, and state of the art methods in multicore and distributed environments, and hence we have included the distributed version of the proposed algorithm in the open-source package MPI-LIBLINEAR. 2

Notations and assumptions
We denote f * := min w f (w). Given any ≥ 0, we say that a point w is anaccurate solution for (1) if f (w) ≤ f * + . For a given set of vectors a 1 , . . . , a t , span (a 1 , . . . , a t ) denotes the subspace spanned by them. That is, For any function G, Im(G) and dom(G) denote its image and effective domain, respectively. When not specified otherwise, · signifies · 2 . Given two symmetric matrices A and B of the same dimension, A B means that A − B is positive semidefinite and A B means A − B is positive definite. I denotes the identity matrix, and σ min (A) the smallest eigenvalue of A. We denote the kth iterate by w k for all k ≥ 0.
The following is assumed throughout this work. (1) is ρ-Lipschitz-continuously differentiable for some ρ > 0. Moreover, the solution set Ω of (1) is non-empty.

Assumption 1 The objective f in
We note that restricting the domain in (1) to R n is just for the ease of description, and our algorithm and analysis apply directly to any Euclidean spaces.

Organization
This work is organized as follows. Section 2 describes and analyzes a version of our algorithmic framework that achieves the optimal convergence rates for first-order methods with a carefully selected G. A more general version is then given in Sect. 3 with its convergence rates analyzed in Sect. 4. Discussion in Sect. 5.1 studies the choice of the common directions for improving the empirical convergence and how to utilize some special problem structures to make our method highly efficient in Sect. 5.2. Numerical results are presented in Sect. 6 to examine the empirical performance of our method. We further apply the proposed algorithm to parallel and distributed optimization in Sects. 7 and 8, respectively. Section 9 provides some concluding remarks. All proofs are in the "Appendix".

Limited-memory common-directions method with optimal convergence rates
We start with a case in which our method can be viewed as a first-order method, in the sense that at the kth outer iteration, for all k ≥ 0. It is known that the optimal iteration complexity for first-order methods to reach an -accurate solution under Assumption 1 is O(1/ √ ) when f is convex (or equivalently the best convergence rate is O(1/k 2 )), and O( √ ρ/σ log(1/ )), from an R-linear convergence rate, when f is σ -strongly convex with σ > 0 [17,19]. We will show that for a properly chosen G with m as small as two, our method achieves such optimal iteration complexities if (2) is approximately solved to enough accuracy. We describe our method as a meta algorithm in Algorithm 1, such that only outer iterations are considered while skipping the details of solving (2). There are many possible algorithms to efficiently optimize (2) when G is linear with a small dimension, and we postpone this discussion to later sections.
The choice of G can be quite arbitrary as long as it is linear, and for achieving the optimal convergence speed of first-order methods on σ -strongly convex problems for some σ ≥ 0 (σ = 0 indicates that f is only convex), we only require that at the kth outer iteration, with and α k being the positive solution to When σ > 0, Eqs. (8) and (9) imply When σ = 0, (8) and (9) in combination shows that Therefore from (9), and from the quadratic formula applied to (12) for k > 0 and to (8)-(9) for k = 0, we have By induction, it is clear that α k > 0 for all k, so (10) and (11) imply that γ k+1 > 0 for all k. We can therefore see that no matter σ > 0 or σ = 0, v k in (7) is always well-defined.
Using simple induction, we can see that so after including the span of v k+1 − w k , our method can still be considered as a firstorder method. Notice that at the beginning of the kth iteration, w k and v k are already known, so we can obtain v k+1 before starting the kth iteration of the algorithm. For the case in which f is nonconvex, we are able to show that min 0≤T ≤k ∇ f (w T ) 2 converges to 0 at a rate of o(1/k), which is the same speed as gradient descent and the accelerated gradient method for nonconvex problems, as long as Im(G) includes a vector whose direction does not deviate from the reversed gradient direction too much. Interestingly, unlike the result for the convex cases that requires sufficient accurate subproblem solutions for (2), this rate for stationarity can be achieved even if the subproblem is solved very roughly, as we shall see later in Theorem 3 in Sect. 4.

Algorithm 1: Limited-memory Common-directions Method
Given w 0 ∈ R n for k=0,1,… do Pick a positive integer m and linear function G : R m → R n Approximately solve We prove below the optimality of Algorithm 1 in terms of the convergence speed.

Theorem 1 Consider
(1) with f σ -strongly convex for some σ ≥ 0 (σ = 0 implies that f is only convex). For any initial point w 0 , the following results hold.
1. If σ > 0 and the subproblem solution for (2) at the kth iteration generates the next iterate w k+1 that satisfies the following conditions for all k ≥ 0, then Algorithm 1 generates objective values R-linearly convergent to the optimum: and it takes O( √ ρ/σ log(1/ )) iterations to obtain an -accurate solution.
2. If σ = 0 and there exists a sequence {ψ i } (not necessarily all nonnegative) such that for all k ≥ 0, then the iterates generated by Algorithm 1 satisfy for all k ≥ 0 for any w * in the solution set. Therefore, if there exists a constant A ∞ < ∞ such that

Algorithm 1 takes O(1/ √ ) iterations to get an -accurate solution.
In Theorem 1, if ψ k converges to zero at a rate of O(α k k −(1+δ) ) for any δ > 0, (19) is satisfied. (We can simply set ψ k = α k /(k 1+δ ) for any δ > 0.) When σ = 0, one can also replace α k defined by (9) with some simpler choices such as α k = (q − 1)/(k + q − 1) with q > 2 proposed by [25]. In this case, the optimal O(k −2 ) rate is still obtained, but the convergence rate requirement of ψ k can be simplified to O(k −(2+δ) ) for any δ > 0. The proof is essentially the same by noticing that in this case, instead of the equality in (12), we have We then show that when G is selected properly, (15) and (17) can be met by solving (2) approximately.
Proposition 1 Assume f is σ -strongly convex for some σ > 0. Consider (2) at the kth iteration with w = w k and letṽ If span (∇ f (w k ),ṽ k+1 ) ⊆ Im(G) and G is a linear function, then either holds for all t within some neighborhood of the solution set for (2); or either G(t * ) = v k+1 or G(t * ) = −∇ f (w k )/ρ is satisfied by all optimal solutions t * .
Notice that (21) and (22) respectively imply (15) and (17) in the next iteration. The above propositions therefore imply that either any approximate solution of (2) close enough to the solution set or some easily calculable vectors satisfy the required conditions for ensuring optimal convergence rates in Theorem 1. Thus, we can apply any convergent iterative subproblem solver to (2) and get the condition (21) or (22) satisfied within finite iterations. More specifically, when f is strongly convex, the minimizer is unique, and thus any algorithm ensuring the convergence of the objective to the minimum will generate iterates that eventually reach the neighborhood satisfying (21). When f is convex, any algorithm that produces an iterate sequence that converges to a minimizer, such as gradient descent [4,11], accelerated gradient [with suitable parameters,1, 5], randomized coordinate descent [with probability one,23], will be able to reach the neighborhood that satisfies (22) in finite iterations. In fact, in our experiment in Sect. 6 for strongly convex f , we observe that the condition (21) is always satisfied after one Newton step.
Including v k+1 − w k in Im(G) as suggested by Propositions 1 and 2 requires prior knowledge of the parameters σ and ρ. When these values are unknown, one can use the line search techniques in [20] to obtain similar rates. We omit details for this case as the analysis is very similar to that in Theorem 1.
We can also obtain a o(1/k) sublinear convergence rate for ∇ f (w k ) 2 even if f is nonconvex, see Theorem 3 in Sect. 4.

Practical limited-memory common-directions method
In this section, we discuss an efficient solver for the subproblem (2) with G in the form (4). More specifically, we show how to apply a line-search Newton method (5) on the subspace selected. For the subproblem, G can be decided freely and is not limited to that suggested by Propositions 1 and 2.
We describe how one iteration of the Newton method for the subproblem is done, and if one wants to run multiple inner iterations, it is just a repetition of this procedure. Because dom(G) is usually of a very low dimension, we use a full Newton step without truncation. This means that given the current iterate w, we obtain an update direction t for the subproblem (2) through solving the following m by m linear system constructed from (6) with t (0) = 0.
Computation of the coefficients in this linear system is easy. First, given P, we can use automatic differentiation to compute the matrix ∇ 2 f (w) P. Then the computation of both P ∇ 2 f (w) P and P ∇ f (w) is straightforward. When the problem structure is known, we may also utilize it to get an even lower cost in constructing the linear system, as we will discuss in Sect. 5.2. The final update direction for w is then constructed as p := G(t) = P t. Two pitfalls of directly using the p obtained above require extra care. The first one is that when the matrix on the left-hand side of (23) is not positive definite, it is possible that the resultant t is not a descent direction for (2) and therefore P t may be a nondescent direction for f . To take care of this problem, we add a sufficiently large multiple of the identity to the matrix on the left-hand side of (23) whenever the smallest eigenvalue of it is smaller than a given threshold τ > 0.
The second pitfall is that the full Newton's step does not always ensure sufficient objective value decrease. Therefore, to ensure convergence, we conduct a line search procedure to find a suitable step size θ and update the iterate w by Because we want to use the unit step size whenever possible in second-order methods, we use a simple backtracking line search procedure such that given parameters β, and let θ = β i . The overall algorithm is summarized in Algorithm 2. As a side note, there are many possible options for approximately solving (2) and the described method of Newton steps with line search is just one of them. For example, we can replace the matrix on the left-hand side of (23) with any positive definite matrix and the convergence is still guaranteed, as we will see in Sect. 4. We will also discuss in Sect. 5.2 a class of problems that we can evaluate ∇ f (G(t)) and ∇ 2 f (G(t)) cheaply for multiple t with a fixed G, in which case running multiple inner Newton iterations can be much cheaper than updating G, so that running multiple Newton iterations and running a single Newton iteration have almost the same cost per outer iteration.

Cost per iteration
Let the cost of calculating the gradient ∇ f (w) be denoted by T grad . By using automatic differentiation, it takes at most O(mT grad ) to evaluate ∇ 2 f (w k ) P k and the total cost of forming the coefficients on the left-hand side is therefore O(mT grad + m 2 n), where the second term is for the matrix-matrix product. The cost for computing the right-hand side of (23) is simply O(mn) for the matrix-vector product. The cost of computing the smallest eigenvalue of , which is the same as inverting H k . This cost is usually negligible in comparison to the O(m 2 n) cost of the matrix-matrix product. The cost of the function value evaluation in the line search is no greater than O(T grad ) and usually much smaller. In addition, we will see in Sect. 4 that the Algorithm 2: Practical Limited-memory common-directions method Given w 0 , m > 0, M 2 > 0, and β, c 1 ∈ (0, 1). Compute ∇ f (w 0 ) and pick an initial P 0 that has no more than m columns for k=0,1,… do Compute Pick some P k+1 that has no more than m columns end number of backtracking steps is bounded by a constant, and in practice θ k = 1 usually produces sufficient function decrease. Therefore, the cost per iteration of Algorithm 2 is O(mT grad + m 2 n + m 3 ).

Convergence analysis
We discuss the convergence speed of Algorithm 2. We separately discuss the three cases in which f is nonconvex, f is convex, and f satisfies the Polyak-Łojasiewicz condition [9,15,22], respectively.
Instead of the specific matrix H k considered in Algorithm 2, we prove convergence rates for a more general setting in finding the update direction p k . In particular, we consider solving whereĤ k is a given symmetric positive definite matrix such that there exist M 1 ≥ M 2 > 0 satisfying We denote the columns of P k by and m k can change with k without any restriction except for m k > 0 for all k.
For the ease of description, we assume without loss of generality that P k is orthonormal, which can always be achieved through the Gram-Schmidt process with cost O(m 2 k n), which is no larger than the cost discussed in Sect. 3.1. In this case, Algorithm 2 is a special case of the framework described here because shows that the lower bound in (26) is satisfied automatically, and since ∇ f is ρ- The following theorems show the finite termination of the backtracking and present the iteration complexity of our method with only minimal restrictions on the choice of the vectors in P k .
and the update direction p k is the same as that defined in (25) with H k satisfying (26), then the backtracking line search for (24) with any given β, c 1 ∈ (0, 1) terminates in finite steps, and the final step size satisfies The condition (28) does not require the existence of a descent direction. Instead, as long there is a direction whose angle with the gradient is bounded away from being orthogonal, we can ensure that the update direction p k is a descent one. Now we discuss the convergence rates of the proposed algorithm. We start with the nonconvex case.
Theorem 3 Assume f (w) satisfies Assumption 1. For an algorithm that iteratively solves (25) that satisfies the conditions (26) and (28) to obtain p k and uses it as the update direction with backtracking line search to find a step size satisfying (24), then the minimum of the norm square of the gradients vanishes at a o(1/k) rate, and ∇ f (w k ) 2 converges to zero.
Next, we consider the convex case. 3 and assume additionally that f is convex, and given any w 0 , the value

Theorem 4 Consider the same conditions in Theorem
The first inequality in (31) provides the reason why we conduct line search instead of directly applyingθ as the step size. Note that our assumption of a finite R 0 in (30) then the function values converge globally Q-linearly to f * : whereθ is the lower-bound of the step size in (29).
Notice that (32) does not require convexity, and hence even on some nonconvex problems, we can get linear convergence to the global optimum.
To obtain global convergence rates, we used parameters M 1 , M 2 , R 0 and ρ that are global. However, those values tend to be extreme values that barely occur in practice, thus we often observe much better convergence rates locally. Moreover, when the curvature information of f is properly included in H k , we tend to observe step sizes far away from the lower boundθ and unit step size is often accepted.
The theorems above provide convergence not only for our algorithm, but also many others, such as the Barzilai-Borwein method [2] with a line search to ensure sufficient objective decrease and line-search Newton methods. Similarly, the algorithm of combining past gradients in [26] can also be treated as a special case of our framework.

Implementation details
In this section, we discuss the selection for P k to improve the empirical performance of our algorithm, and then describe a general problem class whose structure we can utilize to make the implementation even more efficient.

Choices of the common directions
The convergence analysis in Sect. 4 only suggested that one should include some vectors that are gradient-related. Theorem 1 and Propositions 1-2 suggest that including the current iterate w k and a momentum vector related to v k defined in (7) might achieve better theoretical convergence speed. However, when we do not know ρ and σ , we are unable to calculate v k accurately. In many accelerated first-order methods, instead of this deliberately chosen v k , the update directions from previous iterations are used as "momentum terms" to be combined with the current gradient to form the new update step. Thus previous update directions and the current gradient are the most natural choices for the common directions.
When all previous gradients from the first iteration on are included in P k , [26] show that we can even achieve asymptotic quadratic convergence. This result suggests that the previous gradients are also good choices to include. When we have a fixed m, how to balance the number of previous gradients and previous update steps is a question. We observe that for quasi-Newton methods belonging to the Broyden class, when the initial estimate for the Hessian is a multiple of the identity, each quasi-Newton step is a linear combination of the current gradient, the difference of the previous gradients in the form ∇ f (w i+1 ) − ∇ f (w i ), and the previous update steps; see, for example, [21,Chapter 6]. For their limited-memory versions such as L-BFGS [14], the same number of previous updates and previous gradients are used together with the current gradient. As L-BFGS is quite popular in practice, we adopt this choice to use the same number of previous steps and gradient differences. We also take the current iterate as one column of P k because it is already available.
Assume m is even, now for the kth iteration, we have chosen to use w k , ∇ f (w k ), and the pairs ( (25), we see that when H k in Algorithm 2 is not damped by a multiple of identity, using , as they will result in the same span and therefore the same p k . Similarly, using the difference of the gradients is equivalent to directly using the gradients. Therefore, our construction of P k simplifies to This makes the update of P k straightforward -we just add the current iterate-gradient pair, and then discard the oldest pair when the number of columns in P k is larger than m. It has been shown in our preliminary report [10] that using previous gradient differences and update steps simultaneously gives better empirical performance than using only one of them.
Another choice is to include in P k some approximation of the Newton step that can be cheaply obtained to span a subspace whose distance to the solution set might be closer. One such case is taking the diagonal entries of the (generalized) Hessian to form the diagonal matrix D(w k ), and then use the vector D(w k ) −1 ∇ f (w k ) as an approximated Newton step. In this case, assuming m is a multiple of three, we take We will compare the empirical performance of (33) and (34) in Sect. 6.

Problem structure for efficient implementation
The part of forming the linear system (23) is the major additional cost of our algorithm in comparison to first-order algorithms. We will show that when the problem is of the form where Q is a real and quadratic function, X = (x 1 , · · · , x l ) ∈ R n×l , and g is separable in the sense g(z) = l i=1 g i (z i ), highly efficient implementation is possible for the choices of P k in (33) or (34). We further write Q in the form for some symmetric matrix A ∈ R n×n and vector b ∈ R n . Problems of the form (35) are widely seen in applications including machine learning and signal processing.
The key is to note that for any k > 0, P k and P k+1 have overlapping columns and only few columns are updated, and we can thus denote whereV k are the columns also appear in P k and V k+1 are the columns only appear in P k+1 excluding w k+1 . Using this notation, our discussion below can cover any choices of P k as long as existing columns are reused (and the discussion involving w k+1 can be skipped if w k+1 is not a column of P k+1 ). Throughout the iterations, we will maintain X P k , X w k , P k AP k , and P k b, and discuss their usage and update below. We will assume that the computation of both X v for any v ∈ R n and X u for any u ∈ R l cost O(T X ) for some T X (the most common scenarios include that T X = ln when X is dense and that T X equals the number of nonzero elements in X when X is sparse), and similarly assume that computing Av for v ∈ R n costs O(T A ). Under Assumption 1, (35) is twice-differentiable almost everywhere. Its gradient and (generalized) Hessian [16] are respectively where D w is a diagonal matrix, and The main computation at each iteration of Algorithms 1 and 2 is to construct and solve the linear system (23). For problems of the form (35), if at the kth iteration, the iterate is w k and the linear function (4) is defined by P k , then the matrix on the left-hand side of (23) is For large-scale problems, one should maintain X P k and calculate the second term in where the latter requires O(n 2 ) storage and O(n 2 l) computation to explicitly form the Hessian matrix of g(X w). Further, X P k does not need to be calculated from scratch (which has an expensive O(lmn) cost) because we can take the property that P k and P k−1 share most columns to efficiently update X P k−1 to X P k . We should also maintain X w k (which is actually a column of X P k for our choice of P k ) so that the cost of computing D w k can be reduced from the original O(T X ) to O(l). The second term of (41) can thus be efficiently computed in O(m 2 l) time, which is often even cheaper than computing the gradient of g(X w k ) that costs O(T X ).
For maintaining X P k , we have from (37) that The term X V k−1 is directly available because it is a submatrix of X P k−1 maintained in the (k − 1)th iteration, and X w k is obtained from the previous iteration in the line search procedure that we will explain later when discussing (48). Finally, from (33) and (34), V k in (37) has only one or two columns for all k, so computing X V k and therefore maintaining X P k costs only O(T X ). Next, we consider the first term in (41) and use (37) to deduce that We compute AV k in O(T A ) cost, and then P k (AV k ) costs only O(mn). To compute P k Aw k , in addition to (38), from the previous iteration we keep track of θ k−1 , t k−1 , (44), and the update direction Then the following calculation is conducted.
where, if possible, we use parentheses to specify the details of operations that will be explained below. For the first row in (45), bothV k−1 Aw k−1 andV k−1 AP k−1 are entries in P k−1 AP k−1 maintained, and the inner product between the latter and t k−1 costs only O(m). For the third row in (45) that involves V k AP k−1 t k−1 and V k Aw k−1 , we use AV k computed above and P k−1 t k−1 available in (44), and the remaining inner products cost only O(n). For the second row in (45), we note that involves two entries in P k−1 AP k−1 and the computation costs only O(m) for the inner product between t k−1 and P k−1 Aw k−1 , which is exactly w k AP k−1 t k−1 needed in (45). Finally, P k AV k−1 in (43) can be decomposed into The first entry is available from P k−1 AP k−1 , and the rest two have been calculated above (w k AV k−1 from the first row in (45) and V k AV k−1 as entries of P k (AV k ) in (43)), so these entries are obtained with no additional cost. Therefore, the cost of maintaining the term P k AP k using information from P k−1 AP k−1 is O(T A + mn).
Next is the right-hand side of (23), which from (39) can be calculated by The cost is low because X P k and P k Aw k have been respectively calculated in (42) and (45), and we only need to compute u w k , X P k u w k , and P k b. As indicated in (38), P k b should be maintained because we can reuse some elements of P k−1 b and only need to calculate V k b. 3 The cost of maintaining P k b is thus only O(n). The calculation of u w k is O(l) because X w k is obtained from the previous iteration in (48), and the part of X P k u w k costs only O(ml). This means that the right-hand side of (23) is not the bottleneck. When we conduct line search in (24), the above calculated information can be utilized to reduce the cost. The term ∇ f (w k ) p k can be calculated between (47) and t k with O(m) cost. All other terms are already known (the previous objective value is maintained) so we just need to discuss how to evaluate f (w k + β i p k ) with multiple values of i efficiently. For g(X w), we can use X P k obtained in (42) to calculate for each i can be obtained in O(l) time. Therefore, for each i, g(X (w + β i p)) can be evaluated in O(l) time as well. For the quadratic part, we see that As P k AP k is maintained, w k Aw k is one of its elements, and w k AP k t k−1 has been calculated in (45), we just need O(m 2 ) overhead for the last term. Then each line search step costs only O(1) on this part. Note that the sum in (48) with the final β i accepted will be X w k+1 needed for the next iteration; see (38). The new function value f (w k+1 ) is also obtained. Interestingly, the line search procedure via (48) and (49) does not generate the next iterate w k+1 . Thus, in the end of the iteration we calculate the update direction which as shown in (44) and (45) will be used in the next iteration, and finally obtain We summarize in Algorithm 3 how we maintain additional information to make the implementation for (35) more efficient and show the corresponding cost of each computational step.
The problem structure (35) also allows us to conduct multiple inner iterations with a cost much lower than parts other than solving the subproblem in an outer iteration, including updating P k and calculating the full gradient. We explain that as long as P k remain unchanged, the coefficients in the Newton linear systems can be evaluated quickly. From (6), (40), and (41), if w is changed in an inner iteration, all we need to calculate for updating the coefficients in the Newton linear system are D w and (X P k ) D w (X P k ) for the matrix, and u w , (P k A)w, (X P k ) u w for P k ∇ f (w). For the data-related part g(z) with z = X w, the objective value, u w , and D w can

Algorithm 3: Efficient implementation of Algorithm 2 for (35).
Given w 0 ∈ R n and matrices A and X such that computing Av and X v respectively cost O(T A ) and O(T X ) Compute X w 0 , Aw 0 , and f (w 0 ) for k=0, 1 Use X w k to calculate (40) Construct the linear system (23) by (41) and (47) O(m 2 l + n)

Solve the system (23) for t k O(m 3 ) Compute t k (P k AP k )t k and (X P k )t k respectively for (49) and (48)
O(l + m 2 ) Conduct backtracking line search to find θ k that satisfies (24) using (49)  all be quickly calculated in O(l) time through z that is maintained by (48). Notice that in this case, the column of w k in (37) is the iterate from the latest outer iteration, but not the latest iterate from the inner iteration (so are other columns in P k such as ∇ f (w k )), so the update of w and z are disentangled from the update of P k , where the former two change every inner iteration but the last one is updated every outer iteration. Thus, when P k and therefore X P k remains unchanged, we can compute (47), from (45), P k Aw can be updated with O(m + n) cost, and with both u w and X P k available, (X P k ) u w costs only O(lm) to compute. Thus constructing and solving the linear system can be much cheaper than, for example, calculating X V k when P k is updated in an outer iteration. We have therefore explained why (35) allows us to solve (2) with multiple inner Newton iterations efficiently. For updating (38) after one outer iteration, we accumulate the values of t in all inner iterations within the same outer iteration to conduct the updates in the manner discussed above.

Summary of cost analysis
We summarize the cost of the major steps in Algorithm 3 in Table 1. In total, the computational cost per iteration is Table 1 Summarization of the cost of the major computational steps of Algorithm 3 Step Cost Compute ∇ f (w k ) using Aw k and X w k O(T X + T A + n + l) Compute (41)  When m and the number of backtracking steps are not large, the last three terms in (52) are clearly dominated by T X + T A . If X or A are sparse, unless the matrix is highly sparse such that each row or column has a small number of non-zeros, the dominant term is still T X + T A + m 2 l.
In comparison with the general cost analysis given in Sect. 3.1, from (39) and (40), we note that T X + T A is exactly T grad (assuming l + n = O(T X + T A ) as argued above). Therefore, while the plain implementation involves m(T X + T A ), Algorithm 3 by utilizing the problem structure requires only two to three (T x + T A ), as can be seen from (52). As T X + T A is usually dominant, the efficient implementation described above can indeed lead to significant improvement in the computational cost.

Numerical experiments
We present numerical results of Algorithms 1 and 2 running with a single core of a machine with 16GB memory. Code for reproducing our results in this section and Sects. 7-8 is available at https://www.csie.ntu.edu.tw/~cjlin/papers/l-commdir/lcommdir-journal-exp.tar.gz.
Throughout the experiments, we consider 2 -regularized logistic regression and 2 -regularized squared-hinge loss SVM problems, which are of the forms min w 1 2 respectively, where C > 0 is a pre-specified parameter, x i , i = 1, . . . , l are the feature vectors of the training data with corresponding labels y i ∈ {−1, 1}. Notice that both problems are of the form (35), with A = I and b = 0 in the quadratic part, and g i (z i ) is the summand times C. Both problems satisfy Assumption 1 and the 2 regularization makes the problem strongly convex. Thus Theorem 5 and the first case of Theorem 1 apply because σstrong convexity implies (32) with the same σ . The datasets we use are summarized in Table 2. 4 We compare the relative difference to the optimal objective value: where f * is obtained by running our algorithm long enough. All methods are implemented in C/C++. We first examine the optimal convergence speed described in Theorem 1 and compare Algorithm 1 with related approaches. To apply Theorem 1, we consider Proposition 1 and take at each iteration to include the span of ∇ f (w k ) andṽ k+1 , as required by Proposition 1. We use this choice instead of directly includingṽ k+1 because w k and v k are maintained throughout for calculating v k+1 , so we can utilize these two vectors easily. The subproblem (14) is solved by the Newton method described in Sect. 3, with the inner stopping condition being (21). Notice that classical analysis for line-search Newton guarantees that the inner iterates will approach the solution set of (2), and Proposition 1 ensures that when the iterate is close enough to the solution set, this condition will be satisfied. Therefore, (21) is a valid stopping condition. We use the (smaller) first four datasets listed in Table 2 in this experiment. For parameters estimation, it is clear that for (53) and (54) the quadratic term is 1-strongly convex so we always have σ = 1. For ρ, clearly the gradient of the quadratic term is 1-Lipschitz continuous, and in the data-related term, each g i (x i w) has a Cρ x i 2 2 -Lipschitz continuous gradient withρ = 0.25 for (53) andρ = 2 for (54). We thus set ρ as the following upper bound.
We compare the following methods. -L-CommDir-Optimal: Algorithm 1 with the settings described above.
-Accelerated gradient (AG) [18,20]: we use the same σ = 1 and the ρ estimation in (57). We use the suffix "-t" for L-CommDir-BFGS and L-CommDir-Diag to indicate that information from the latest t iterations (including the current iteration) is used. We mainly use t = 5, making m = 10 for (33) in L-CommDir-BFGS and m = 15 in (34) for L-CommDir-Diag. To compare with L-CommDir-Optimal, we also consider t = 1 (information from the current iteration only), making m = 2 and m = 3 respectively for L-CommDir-BFGS and L-CommDir-Diag. For all methods that require line search to satisfy (24), we use the fixed parameters β ≡ 0.5 and c 1 ≡ 10 −2 throughout in both this section and all following experiments.
The results are shown in Table 3. We see that L-CommDir-Optimal always needs only one Newton step to satisfy (22), making Algorithm 1 with the inner stopping condition being (15) equivalent to Algorithm 2 under P k in (56) in this experiment. Except for a9a, L-CommDir-Optimal always outperforms AG significantly, while when AG is faster, L-CommDir-Optimal takes at most four times of the number of iterations, confirming that our method is converging at least as fast. We also see that when t = 1, L-CommDir-BFGS and L-CommDir-Diag tend to be worse than L-CommDir-Optimal, indicating that a momentum term is indeed helpful. However, when t ≥ 5, L-CommDir-BFGS, L-CommDir-Diag, and CommDir all converge much faster than these optimal first-order methods empirically in terms of both the running time and the number of iterations. Note that L-CommDir-BFGS always takes no fewer iterations than CommDir, as the subspace considered by the former is always in that of the latter, but L-CommDir-Diag sometimes converges faster. Notice that L-CommDir-BFGS and CommDir can still be considered as first-order methods in the sense that p k ∈ span (w 0 ,∇ f (w 0 ), . . . , ∇ f (w k )), but they show much faster empirical performance. 5 Hence, we focus on using (33) and (34) for P in the sequel. Table 3 Single-core comparison of different methods on (53)   We present the required time (in seconds) and number of iterations to make (55) no larger than 10 −8 . The names Optimal, BFGS, and DIAG omit the prefix L-CommDir.
The row "largest inner iter." of L-CommDir-Optimal shows the largest number of inner iterations taken over the outer iterations to solve (2) to satisfy (21) We continue to compare CommDir and L-CommDir further using all datasets in Table 2. We further try C ∈ {10 −3 , 1, 10 3 } to test how the algorithms perform when the condition number changes. We compare Comm Dir with both L-CommDir-BFGS and L-CommDir-Diag using information from the latest t iterations with t ∈ {5, 10, 15, 20}. The results are shown in Tables 4 and 5.
In all situations, CommDir requires fewer iterations than L-CommDir-BFGS. But even with only t = 5, the convergence speed of L-CommDir-BFGS is competitive. When it comes to the real running time, however, on difficult problems that take more iterations, especially when C = 10 3 , CommDir can be much slower in later iterations when m becomes large, while L-CommDir does not suffer from this issue and is therefore much faster.
We also observe that L-CommDir-Diag tends to be faster than L-CommDir-BFGS in terms of the iteration count, but the running time might not be as advantageous because computing the diagonal entries of the Hessian is as costly as computing the gradient. Also, from (34), V k+1 in (37) has two columns in L-CommDir-Diag, so its updating X P k is more costly than that of L-CommDir-BFGS.

Multicore parallelization
In comparison to other first-order and even second-order methods, one advantage of our method is that it can be better parallelized. When the problem is of the form (35) and X is sparse, the bottleneck of first-order methods is the computation of X w in (40) for calculating the gradient. This calculation is mainly reading through all the entries of X , so the bottleneck is usually the memory bandwidth instead of the computational power. Therefore, its parallelism is limited. Similar situation applies to second-order methods that repeatedly calculate the Hessian-vector products, as that is also a memory-bound procedure. Therefore, although theoretically all non-stochastic first-order and second-order methods are inherently parallel as the major operations are matrix-vector products, usually we do not experience that much speedup by using more cores.
On the other hand, the additional calculation of (41) in our method is a matrixmatrix calculation and the most expensive part (X P k ) D w k (X P k ) involves dense matrix-matrix operations, so there are more data reusing and the memory bandwidth is not the bottleneck anymore. Therefore, better parallelism of our method can be expected in multicore environments.
Moreover, in the multicore setting, except the step of solving the linear system, each of other steps in Table 1 invovles a set of independent operations such as matrix-vector products. Thus, they can be easily parallelized. Assume m 3 is relatively small and we do no parallelize it, then if we have K cores, the computational cost of one iteration of Algorithm 3 reduces from (52) to  We present the required time and number of iterations to make (55) no larger than 10 −8 Table 5 Single  We also note that in a shared-memory multicore environment, communication between cores is usually extremely fast and not an issue for the overall running time.

Multicore experiments
To verify our discussion, we conduct numerical experiments to show the empirical speedup of our method using different number of cores, and compare it with Multicore-LIBLINEAR 2.30 [12], 6 which is a state-of-the-art multicore package for (53). In particular, we compare our method with the trust-region Newton method with preconditioned conjugate gradient implemented in this package. What we intend to see is how the algorithms scale with the number of cores, so we compare the respective running time speedup of the algorithms. In this experiment, we present the results of (53) with C = 1, and take historical information from the latest ten iterations for L-CommDir. In addition to the datasets in Table 2, we also consider some larger ones listed in Table 6. 7 For each algorithm, the speedup is computed as: Speedup of using k cores = Running time of using 1 core Running time of using k cores .
In this experiment, all solvers are run on an Intel multi-core dualsocket machine with 180 GB memory. Each socket is associated with 20 cores, and we enforce all the threads to use cores from the same socket. Parallelization is achieved through openMP and Intel Math Kernel Library. The results are in Fig. 1. We see that for a9a, realsim, and news20, L-CommDir has much better speedup than Multicore-LIBLINEAR. Moreover, L-CommDir-Diag achieves significant parallelism on epsilon, and for others the speedup of L-CommDir and Multicore-LIBLINEAR are similar. This together with the high speedup show that multicore parallelization is indeed very useful for L-CommDir to reduce the running time.
where each f k is exclusively available only on the kth machine. elements, which is prohibitively expensive. Therefore, conjugate gradient (CG) that computes Hessian-vectors is adopted, as each Hessian-vector requires communicating an n-dimensional vector only. State of the art for distributed optimization [13,27,28] are Newton-CG methods with different preconditioners to reduce the overall rounds of communication needed, as Newton methods are fast-convergent asymptotically, and preconditioners could reduce the needed CG iterations per Newton step, but in practice these approaches can still take many CG iterations.
In distributed optimization, trading computation for communication could help reduce the running time. Our method that interpolates between first-and second-order methods is a perfect case for this purpose. The computation of the linear system in this setting is by letting each machine compute obtained step is a non-truncated full Newton step in a subsapce. Thus, when the subspace is close to a solution, our method can be much more communication-efficient than Newton-CG methods.

Distributed experiments
We proceed to examine the empirical performance of our method for distributed optimization. The distributed environment is a cluster of ten machines running MPI connected through a 1Gbps network. We compare the following methods.  We first examine the communication cost. For (53), L-CommDir-BFGS is among the most communication-efficient, and performs worse only on KDD2010-a with C = 1 and C = 10 3 . Similar trends are observed for (54), but for (54) with C = 10 3 , L-CommDir-BFGS always outperforms other methods in communication efficiency. Likely this is because the generalized Hessian of (54) is not as useful as the real Hessian of (53), especially for difficult problems, so L-CommDir-Diag and MPI-LIBLINEAR do not perform that well. On the other hand, L-CommDir-Diag is not as communicationefficient as L-CommDir-BFGS in general, and it usually performs closer to existing methods, but for difficult problems like KDD2010-a and url, it sometimes becomes the best method in terms of the rounds of communication.
Next, regarding the running time, it and the round of communication are positively correlated, but it is also dependent on the computational power of the machines and the distributed environment. In particular, we see that L-CommDir-Diag is relatively faster in the running time, because its amount of computation per communication round is less (for every iteration it takes two communication rounds while L-CommDir-BFGS takes only one). On the other hand, MPI-LIBLINEAR is much faster than L-CommDir on epsilon, as epsilon has a lower problem dimension, making the communication cost less significant in the overall running time.

Conclusions
In this work, we present an efficient smooth optimization algorithm that interpolates between first-and second-order methods by utilizing information from previous iterations. Theoretical results show that our method possesses the optimal convergence rates of first-order methods while being strictly descent in the objective value. Empirical results also show that our method outperforms optimal first-order methods and second-order methods on real-world empirical risk minimization problems in single-core, multicore, and distributed optimization. Future work includes extending our method to general regularized problems by adding the regularization term to the subproblem, and to consider nonlinear manifolds in the construction of G, possibly through Riemannian optimization or partly smooth functions. Based on this work, we have expanded the package MPI-LIBLINEAR, available at http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/distributed-liblinear/, to include the proposed method.
by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. which can be written as the following two equivalent forms.
Next, we consider the case of σ = 0. From the recursion (8), we have Further, in this case, (7) is reduced to We obtain the following recurrent relation of φ * k+1 .

A.2 Proof of Proposition 1
Proof From the strong convexity of f , even though there might be multiple possibilities of t that are optimal for (2), they all result in the same iterate w * k+1 after the mapping w k + G(t). We first show that (21) holds at any optimal t * that maps to w * k+1 and then discuss its neighborhood. As G is linear, we have that there is P ∈ R n×m such that G(t) = P t, ∀t ∈ R m . Therefore, the optimality condition of the convex problem (2) gives and R 0 is finite. By noting that Δ 0 ≥ 0 and that the differential of the right-hand side of (97) with respect to Δ 0 is nonnegative when Δ 0 ≥ 0, we can use (98) to simplify (97) to