Distributed block-diagonal approximation methods for regularized empirical risk minimization

In recent years, there is a growing need to train machine learning models on a huge volume of data. Therefore, designing efficient distributed optimization algorithms for empirical risk minimization (ERM) has become an active and challenging research topic. In this paper, we propose a flexible framework for distributed ERM training through solving the dual problem, which provides a unified description and comparison of existing methods. Our approach requires only approximate solutions of the sub-problems involved in the optimization process, and is versatile to be applied on many large-scale machine learning problems including classification, regression, and structured prediction. We show that our framework enjoys global linear convergence for a broad class of non-strongly-convex problems, and some specific choices of the sub-problems can even achieve much faster convergence than existing approaches by a refined analysis. This improved convergence rate is also reflected in the superior empirical performance of our method.


Introduction
With the rapid growth of data volume and model complexity, designing efficient learning algorithms has become increasingly important.Distributed optimization techniques, which decompose a large optimization problem into sub-problems and distribute the computational burden across multiple machines, have gained a great amount of interest.This type of approaches are particularly useful when the optimization problem involves massive computation and/or the data set is stored on multiple machines as it could not fit the capacity of a single node.However, the communication cost and the asynchronous nature of the process challenge the design of efficient optimization algorithms in distributed environments.
In this paper, we study distributed optimization algorithms for training machine learning models that can be represented under the regularized empirical risk minimization (ERM) framework.These models include binary/multi-class classification, regression, and structured prediction models, covering a variety of learning tasks.We specifically focus on linear models, which have been shown successful in dealing with large-scale data thank to their efficiency and interpretability. 1  Given a set of training data, {X i } l i=1 , X i ∈ R n×ci , c i ∈ N, regularized ERM models solve the following optimization problem: where g and ξ i are the regularization term and the loss term, respectively.We assume that f P is a proper, closed convex function.
Note that the definition of ERM problems in problem (1) is general and covers a variety types of learning problems.To unify the definitions of different learning problems, we encoded the true labels (i.e., y i ∈ Y i ) in the loss term ξ i and the input data X i .For some learning problems, the space of X i is spanned by a set of variables whose size may vary for different i.Therefore, we represent X i as an n × c i matrix.For example, in the part of speech tagging task, where each input sentence consists of several words, c i represents the number of words in the i-th sentence.We provide a detailed discussion about the loss terms for different types of learning problems in Section 6.
Regarding the regularization term, common choices of g include the squared 2 norm, the 1 norm, and the elastic net (Zou and Hastie, 2005) that combines the both.
In many applications, it is preferable to solve the Lagrange dual problem of problem (1), because the dual problem has better mathematical properties, making optimization easier.The Lagrange dual problem of (1) is where is the dual variable associated with X i , and for any function h(•), h * (•) is its convex conjugate: h * (w) := max z∈dom(h) z T w − h(w), ∀w.
The domain of α, Ω, is defined to be l i=1 dom(ξ * i ) ⊆ R l i=1 ci .We use the following notations to simplify our description.In a single-core setting, optimization methods for the dual ERM (2) have been widely studied (see, for example, Yuan et al. (2012) for a review).However, most of the state-of-the-art algorithms for the dual ERM are inherently sequential and cannot be parallelized easily.On the other hand, in a distributed setting, and the cost of communication is relatively high that it is usually the bottleneck for distributed optimization.Therefore, it is important to reduce the communication cost to make the optimization efficient.Consequently, algorithms with faster convergence rates are desirable for distributed optimization, because fewer iterations implies fewer communication rounds and thus lower communication cost.We observe that without careful consideration of this issue, existing distributed dual solvers do not achieve satisfactory training speed.As a result, existing distributed algorithms like Chen et al. (2014); Zhuang et al. (2015); Lin et al. (2014); Lee et al. (2017) often focus on using the gradient and/or the Hessian information to solve the primal form of ERM (1), whose computation either can be parallelized naturally or require only some modification in the implementation to adapt for distributed environments.However, in this paper, we show that with a careful design, dual approaches can be competitive with primal approaches and enjoy sound theoretical properties.
Note that some methods, such as Zhang and Lin (2015), reduce the number of communication rounds significantly; however, their computational cost per iteration is excessively high.These methods, though theoretically appealing in terms of communication efficiency, are impractical as their total running time is high because of the computational cost.In contrast, our focus in this work is a practical algorithm that works well for practitioners such that both computation and communication concerns are properly dealt with so that the overall running time is shorter.
In this work, we propose an efficient distributed framework for solving (2).At each iteration, our framework considers minimizing a problem consisting of a positive semidefinite block-diagonal second-order approximations of the non-separable g * (Xα) term in (2), plus the original separable loss conjugate ξ * (−α).We consider the setting that the training instances are distributed across K machines, where the instances on machine k are {X i } i∈J k .In our setting, J k are disjoint index sets such that K k=1 J k = {1, . . ., l}, J i ∩ J k = φ, ∀i = k. (3) Without loss of generality, we assume that j 0 := 0, j K := l, J k := {j k−1 + 1, . . ., j k } , k = 1, . . ., K.
For any vector v ∈ R l , v J k denotes the sub-vector in R |J k | that contains the coordinates of v indexed by J k .
Our framework iteratively solves a second-order approximation problem of (2) in a distributed environment.We discuss how to choose the approximation to make our method efficient in terms of convergence and communication.After solving the approximated problem, we conduct a line search that only requires negligible computational cost and O(1) communication.By utilizing a relaxed condition, our method is able to achieve global linear convergence for many non-strongly convex dual problems, including the popular support vectors machines (SVM) model proposed by Boser et al. (1992); Vapnik (1995), and the structured SVM (SSVM) problem (Tsochantaridis et al., 2005;Taskar et al., 2004).In other words, our algorithm only requires O(log(1/ )) iterations, or equivalently, rounds of communication, to obtain an -accurate solution for (2).We also discuss the main differences between our approach and existing distributed algorithms for (2).Experiments show that our algorithm is significantly faster than existing distributed solvers for (2).
Special cases of our framework were published earlier as conference and workshop papers (Lee and Roth, 2015;Lee et al., 2015).In this work we unify the results and extend the previous work to a general setting.We also provide more thorough theoretical and empirical analysis.We note that recently, in the work Zheng et al. (2017) an accelerated method for solving the dual ERM problem distributedly similar to the catalyst for convex optimization (Lin et al., 2015) is proposed.Like the catalyst framework that can be combined with any convex optimization algorithm, this acceleration in Zheng et al. (2017) can also be applied on top of our framework easily.Therefore, we focus our comparison on methods that directly solve the original optimization problem, since methods that are faster for the original problem are expected to perform better as well when combined with the acceleration framework.
This paper is organized as follows.We describe the algorithm overview in Section 2. The implementation details for distributed environments are shown in Section 3. In Section 4, we analyze the convergence property of our algorithm.Section 5 discusses related works for distributed ERM optimization.Details of applying our algorithm on specific ERM problems are described in Section 6. Numerical experiments are provided in Section 7. Some possible extensions and limitations of our work are discussed in Section 8. We then make some concluding remarks in Section 9.The program used in this work is available at http://github.com/leepei/blockERM.

Algorithm
Throughout this work, we make the following assumption.
Assumption 2.1 There is σ > 0 such that the regularizer g in the primal problem (1) is σ-strongly convex.Namely, Note that the goal of solving the dual problem is to obtain a solution to the original primal problem (1).It can easily be shown that strong duality between (1) and (2) holds, which means that any pair of primal and dual optimal solutions (w * , α * ) satisfies the following relation.
By (Hiriart-Urruty and Lemaréchal, 2001, Part E, Theorem 4.2.2),since g is σ-strongly convex, g * is differentiable and has (1/σ)-Lipschitz continuous gradient everywhere.This also indicates that even if g is extended-valued, g * is still finite everywhere, hence the feasible region of α is only restricted to Ω. From the KKT optimality conditions, we further have (5) Although ( 5) only holds at the optimum, we can still define w(α) as the primal solution associated with α, when α is a sub-optimal solution to (2), Our algorithm is an iterative descent method for solving (2).Start with any arbitrary feasible α 0 , it generates a sequence of iterates {α 1 , α 2 , . ..} with a property that f (α i ) ≤ f (α j ) if i > j.The iterates are updated by an update direction ∆α t and a step size η t ≥ 0, where η t will be specified later.
In the objective function in (2), each element in ξ * usually can be computed separately.Therefore, it can be optimized directly in a coordinate-wise manner.However, g * is complex and hard to optimize.Thus, we use its approximation as a surrogate such that the resulting sub-problems are easier to solve.As we mentioned above, g * and thus G * are differentiable and the gradients are Lipschitz continuous.Given the current iterate α t , we solve for some symmetric B t , and then take ∆α t as the update direction.The matrix B t can be varied over iterations and has a wide range of choices, depending on the goal.One usually wants to pick B t such that ( 8) is easy to solve to reduce the cost per iteration, or that Q α t Bt (∆α t ) is a tighter approximation of f (α t + ∆α t ) − f (α t ) so that it takes fewer iterations to obtain a good enough solution.We will show in Section 4 that as long as the whole objective of ( 8) is strongly convex, the obtained ∆α t will be a descent direction.We then consider two different line search possibilities to obtain the step size η t .The first is the exact line search.
However, this approach can be conducted only when the objective function has some specific structure to allow so.In the general case, we consider a backtracking line search with the Armijo rule.Given γ ≥ 1, β ∈ (0, 1), τ ∈ (0, 1), this procedure finds the smallest i ≥ 0 such that η = γβ i satisfies where

Distributed Implementation for Dual ERM
We now discuss how to adapt the algorithm framework in Section 2 to make distributed optimization for dual ERM efficient.In particular, we will discuss the choice of B in (8), the trick of making line search efficient, and how to solve (8) with the minimum amount of communication.
For the ease of algorithm description, we will use the following representation of X and α. where is the number of columns of X as well as the dimension of α.The index sets for the representation (11) corresponding to J k is denoted by Ji ⊆ {1, . . ., N }, i = 1, . . ., K. We define π(i) = k if i ∈ Jk .

Update Direction
In this section, we discuss how to select B t such that the objective of ( 8) is strongly convex, and the update direction can be uniquely decided.Notice that we distribute the data into multiple machines, and the k-th machine only stores and handles J k and corresponding dual variables α i associated with X i .Therefore, in order to reduce the communication cost, we need to pick B t in a way that (8) can be decomposed into independent sub-problems, and each sub-problem only involves a subset of data.In such way, each sub-problem can be solved locally on one machine without communicating with other machines.According to the partition (3), we should consider a block-diagonal B t such that (B t ) i,j = 0, if π(i) = π(j).
Any B t satisfying (12) falls in the block-diagonal approximation framework of our method.In most cases, we would like to incorporate higher order information as much as possible to obtain fast convergence, which will then reduces the rounds of expensive communication.A natural choice would then be the Hessian matrix.
However, the Hessian matrix is usually dense and does not satisfy (12).Therefore, we consider its block-diagonal approximation Hα t .
When the function g * is complicated such that ∇ 2 g * cannot be calculated easily, one can use the identity matrix, or some diagonal approximation, to substitute it in (13).Note that since if each machine maintains the information of Xα t , entries of ( 14) can be decomposed into parts that each only require information from data in one partition of Jk .Thus the sub-problems can indeed be solved independently on corresponding machines without communication.Another concern of using the Hessian matrix is that it might be only positive semidefinite, so that the requirement of the objective of ( 8) is not satisfied when ξ * is not strongly convex.In this case, we can add a damping term to B t to ensure the strong convexity.Therefore, our choice for B t can be generalized to the following formulation.
The choices of a t 1 and a t 2 might depend on problem structures and the applications.In most cases, a t 2 = 0 should be considered, especially when it is known that Q α t Hα t is already strongly convex.
For a t 1 , practical results (Pechyony et al., 2011;Yang, 2013) suggest that a t 1 ∈ [1, K] leads to fast convergence, while we prefer a t 1 ≡ 1 as it is a tighter approximation of the Hessian matrix.On the other hand, for the special case of a t 1 = 0, our framework reduces to proximal gradient, and when K = 1 and a t 1 = 1, a t 2 = 0, it turns into the proximal Newton method.In solving (8) using our choice of (15), we notice that each machine needs the information of Xα t to calculate both (B t ) J k ,J k and Therefore, after each iteration of updating α t , we need to synchronize the information x j α t j through communication.This communication of an O(n) vector is more suitable than communicating either the Hessian or the whole X together with α t .However, because we need to use the update direction to conduct line search, the better approach is to synchronize the following vector instead.
and then update v t+1 = v t + η t ∆v t (18) locally after the step size η t is determined.More explanations of communicating this vector will be explained in the next section when we discuss the details for conducting line search.

How to Conduct Line Search Efficiently
After the update direction ∆α is decided from solving (8), we need to conduct line search to ensure sufficient function decrease.For the backtracking variant, In the right-hand side of (9), the first term is already available from the previous iteration, and we thus only need to calculate (10).From (16), this value can be calculated by Notice that to obtain the terms related to ξ * , we need an O(1) communication.Because each machine has full information of v t and hence g * (v t ), the calculation of the first term can also be conducted distributedly, and we can combine the partial sums of all terms in (19) in one scalar to communicate to avoid redundant communication.One can also see from this calculation that synchronizing ∆v t is inevitable to compute the required value efficiently.
For the left-hand side of (9), in this calculation we use the original formulation in (2) because this simplifies the computation.The ξ * i term is calculated distributedly in nature, while for the g * (v) term, if this term is separable with respect to v, then the computation is also parallelizable.Further, we may even have a closed form solution to calculate it in O(1) time for different η if some values are precomputed in some special cases.For example, when Algorithm 1: Backtracking line search In this case we can compute ∆v 2 and v T ∆v distributedly in advance, then the calculation of (20) with different η requires only O(1) computation and does not need any communication.For the general case, though the computation might not be this low, by maintaining both the current v and ∆v, the computation of g(v + η∆v) requires no more communication but just at most O(n) computation locally, and this cost is negligible as other parts of the algorithm already involves computational cost more expensive than O(n).Another advantage of using v t and ∆v t is that when the backtracking procedure is terminated by satisfying (9), the value of v t+1 can be easily obtained and we do not need to conduct (18) again.The line search procedure is summarized in Algorithm 1.
For the exact line search variant, it is possible only when ∂f (α + η∆α) ∂η = 0 has an analytic solution, which requires both g * and ξ * to be differentiable at least in some open set.For example, when the objective of f is quadratic, the optimal step size can be obtained by Another way to approximate the exact line search variant is to consider a bisection method if the above assumption for analytic solution does not hold.In this case, we can utilize the trick of maintaining both v and ∆v mentioned above to reduce the communication cost of re-evaluating the objective value.

Local Solver at Each Machine
When we pick B t satisfying (12), the problem (8) can be decomposed into K independent subproblems of the following form. min Note that since all the information needed in ( 22) is available on machine k, the sub-problems can be solved without any inter-machine communication.
Our framework does not limit the solver for ( 8).Instead, one can take arbitrary local solver that is suitable for a specific problem.For example, we can consider (block) coordinate descent, projected gradient, proximal methods, and so on.Because each sub-problem is very close to the original dual problem of regularized ERM shown in (2), the cost is also similar, and one can consider methods that are efficient in the single-machine setting.In our experiment, we will use random-permuted cyclic coordinate descent (Hsieh et al., 2008;Yu et al., 2011;Chang and Yih, 2013) as our local solver, as this approach has been proven empirically to be efficient in the optimizing dual ERM problems in the single-core setting.Note that although the theoretical convergence rate of cyclic coordinate descent can be as worse as O(l 2 ) times slower than its randomized, non-cyclic counter part (Sun and Ye, 2016), its random-permuted version is known to behave similarly to the randomized and non-cyclic version, while it still has the deterministic convergence guarantee, and thus is the most favorable version in practice.Other options can be adopted in discretion for specific problems or data sets, but such discussion is beyond the scope of this work.

Output the Best Primal Solution
The proposed algorithm is a descent method for the dual problem (2).In other words, it is guaranteed that f (α t1 ) < f (α t2 ) provided t 1 > t 2 .However, when we obtain the primal iterates w t from ( 6), there is no guarantee that the primal objective is strictly decreasing.Although we are able to prove that the primal objective converges R-linearly in Section 4, we still cannot guarantee that the latest primal iterate is the one with the lowest primal objective.This situation happens for all methods that solve the dual problem.To deal with this problem, we keep track of the primal objective of each iterate, and when the algorithm is terminated, we take the iterate with the lowest primal objective as the final model.This is known as the pocket approach in the literature of the perceptron algorithm (Gallant, 1990).

Stopping criterion
If we consider only the sub-optimality of f (α t ), then we can use some simple indicators of suboptimality such as the norm of the update direction, or the norm of the projected gradient, whose values are zero at the optima, as the practical stopping condition.However, since we already computed the primal objective at each iteration as discussed above, we can directly use this as the stopping criterion with no additional cost, and this criterion should be more useful as the indicator of the model quality than others.Therefore, we consider the following stopping criterion: with some user-specified ≥ 0. The overall procedure for distributedly optimizing (2) discussed in this section is described in Algorithm 2.

Cost per Iteration
For the ease of description and analysis, we assume that the data entries are distributedly evenly across machines.Namely, we assume that each machine has O(#nnz/K) non-zero entries, where where #nnz is the number of non-zero elements in X, and the number of data points on each machine is O(l/K).We will separate the cost for computation and for communication in our discussion.In our cost analysis below, we do not make those assumptions of low computational cost for general f and g * resulted from the specific structures we discussed above.Instead, we only assume that the computation of g * (v) and ∇g * (v) are non-parallelizable and cost O(n), as further acceleration is a case-by-case possibility provided the special problem structure.The part of ∇ 2 g * (v) is assumed Algorithm 2: Block-diagonal approximation method for dual ERM problem (2).
Compute f P (w(α t )) by ( 6) Output w(α t ) and terminate end Each machine obtains ∆α t J k by solving the corresponding block of (8) independently and in parallel using the local data, with B decided by ( 15) to be negligible, for otherwise we will simply replace it with a multiple of the identity matrix as discussed in Section 3.1.
We first check the cost of forming the problem (8).The calculation of ∆v t through (17) costs O(#nnz/K) computation time and O(n) communication time, Calculating ∇g * (v t ) using the known v t costs O(n), and therefore computing ∇G(α t ) T J k ∆α J k costs O((n + N )/K) computationally, as we never form ∇G(α t ) explicitly, but only compute ∇G(α t ) T ∆α t through ∇g * (v t )∆v t .
Next, the cost of solving (8) is O(#nnz/K), as noted in most state-of-the-art single-core optimization methods for dual ERM.For the line search part, calculating ∆ t has computational cost O((n + N )/K) and O(1) communication cost.Each line search iteration costs O(n + N/K) computation and O(1) communication.Finally, evaluating the primal objective costs O(n) for g(w(α t )) and O(#nnz/K) for X T w(α t ) in computation.It also takes O(1) communication to gather the summation over ξ i .
In summary, each iteration of Algorithm 2 costs in communication.We will show in the next section that the number of line search is upper-bounded by a constant, so the overall cost per iteration is O((#nnz + N )/K + n) computationally and O(n) for communication.

Analysis
We now show that the update direction is indeed a descent one, and the backtracking line search procedure terminates within a bounded number of steps.
Lemma 4.1 Given the current iterate α t , if B t is chosen so that Q α t Bt is c 2 -strongly convex for some c 2 > 0 and c 1 + c 2 > 0, where c 1 is the smallest eigenvalue of B t , then the solution ∆α t obtained by solving (8) is a descent direction for f at α t , namely ∆ t < 0. In particular, Moreover, the backtracking line search procedure in Algorithm 1 terminates to satisfy (9) within max(0, log β (1 − τ ) σ(c 1 + c 2 )/( X T X γ) ) steps, and the step size is lower bounded by Proof For any λ ∈ [0, 1), we have where in the first inequality we used that ∆α t is the minimizer of ( 8), and in the second one we used the strong convexity of Q α t Bt .Rearranging this inequality gives us Divide both sides by (1 − λ), we get Let λ → 1 − , we obtain (23).Note in the last inequality of ( 23), equality holds if and only if is the optimum of (2).Therefore (23) indicates that ∆α t is a descent direction when the current iterate is not optimal because from Taylor expansion of the smooth term we have which implies that when η is small enough and positive, f (α t + η∆α t ) will be smaller than f (α t ).
Regarding the backtracking termination, from the (1/σ)-Lipschitz continuity of ∇g and the convexity, we have that for any η ∈ [0, 1] and any vector d, Take d = ∆α t in (25), we see that it suffices to have η ∈ [0, 1] such that to satisfy the Armijo rule (9).We directly see from ( 23) that ( 26) holds for any η ∈ [0, 1] satisfying Therefore, we have that under Algorithm 1, the step size generated is guaranteed to be We will then establish convergence in the dual, and use this result to deduct the convergence in (1) of the primal iterates obtained by ( 6), which is the convergence that we really care about for solving ERM problems.We assume that in addition to Assumption 2.1, either of the following assumptions holds.
Assumption 4.2 The function ξ is differentiable and its gradient is ρ-Lipschitz continuous for some ρ > 0.
These assumptions made on the sum ξ are less strict than demanding each ξ i to satisfy certain properties.
We note that the conditions in Lemma 4.1 is weaker than that of most proximal methods such as Tseng and Yun (2009); Lee et al. (2014) as we do not need B t to be positive definite, in which case Bt is still strongly convex when Assumption 4.2 holds.In this situation we can have a broader choice of B t .
We first show that Assumption 4.2 also implies that (29) holds for the dual problem.
Lemma 4.4 If f is µ-strongly convex as defined in (4), then it satisfies (29) with the same µ.
To prove the convergence of the whole algorithm, we need the following definition in our convergence discussion.
Definition 4.5 Given any optimization problem whose minimum is attainable and we denote it by f * , we say that x ∈ X is an -accurate solution for (31 Now we are ready to show the global linear convergence of our algorithm for solving (2).
Theorem 4.6 If (29) holds with µ > 0 for the objective of (2), there exists c 3 such that B t ≤ c 3 for all t, and that the conditions in Lemma 4.1 are also satisfied for all iterations for some c 1 , c 2 , then the iterate generated by Algorithm 2 converges Q-linearly, and to obtain an -accurate solution for (2), it takes at most Proof From ( 7), ( 23) and ( 9), we have that From the optimality of ∆α t in (8), we get that for some st+1 ∈ ∂ξ * (−α t − ∆α t ).By convexity, that the step size is in [0, 1] from Lemma 4.1, and the condition (29), we have Now to relate the first term to the decrease, we use (33) to get where in the second inequality, we used (a + b) 2 ≤ 2(a 2 + b 2 ) for all a, b, and in the last inequality we used Lipschitz continuity of ∇G * .We therefore get the following by combining (34), (35), and (32).
Let us define then rearranging (36) gives Combine the above result with the lower bound of η t from (24) proves the global Q-linear convergence, and the number of iterations taken to get an -accurate solution can be directly deduced from (37).
From Lemma 4.4, both Assumptions 4.2 and 4.3 satisfy the conditions required in Theorem 4.6.We note that in the final expression (37), a larger step size suggests a faster convergence for a fixed choice of B t .Therefore, it makes sense to conduct line search to try to find a larger η t instead of using a fixed and conservative one that ensures ( 9) is satisfied, while c 4 can be tuned through the choice of B t .However, we note that these are just the worst case iteration complexity, and empirically we often observe much faster convergence.
The next two results link the above linear convergence result for the dual problem to the convergence rate of the primal problem (1).
Proof Our proof consists of using α as the initial point, applying one step of some primal-dual algorithm, then utilizing the algorithm specific relation between the decrease in one iteration and the duality gap to obtain the bound.Finally, we notice that the decrease of dual objective in one iteration of any algorithm is upper bounded by the distance to the optimum from the current objective, and that the primal sub-optimality is upper-bounded by the duality gap.Therefore we will obtain an algorithm-independent result from some algorithm-specific results.
When Assumption 4.2 holds, the primal problem is in the type of problems considered in Shalev-Shwartz and Zhang (2012), and we have that ξ * is (1/ρ)-strongly convex.If we take α as the initial point, and apply one step of their method to obtain the next iterate α + , from (Shalev-Shwartz and Zhang, 2012, Lemma 1), we get where w * is the optimal solution of (1), and u i ∈ −∂ξ i (X T i w(α)).To remove the second term in (38), we set Note that in (Shalev-Shwartz and Zhang, 2012, Lemma 1), the result is for the expected value of dual objective value decrease at the current iteration and the expected duality gap of the previous iterations.However, for the initial point, the expected duality gap is a constant, and the expected function decrease cannot exceed the distance from the current objective to the optimum.When Assumption 4.3 holds, the primal problem falls in the type of problem discussed in Bach (2015).If we take α as the initial point, and apply one step of their method to obtain the next iterate α + , from the last inequality in the proof of Proposition 4.2 in Bach (2015) and weak duality, where In the last equality we used (Rockafellar, 1970, Corollary 13.3.3)such that if φ(•) is L-Lipschitz continuous, then the radius of dom(φ * ) is no larger than L. The right-hand side of ( 39) is concave with respect to s, hence we can obtain the maximum of it by setting the partial derivative with respect to s to zero.By defining the maximizer as ŝ, this gives ŝ = arg max and thus On the other hand, if ŝ = 1, we get These conditions and (40) indicate that Corollary 4.8 If we apply Algorithm 2 to solve a regularized ERM problem that satisfies either Assumption 4.2 or Assumption 4.3, then the primal iterates w t obtained from the dual iterates α t via (6) converges R-linearly.
Proof If Assumption 4.2 holds, Theorem 4.6 shows that the dual objective converges Q-linearly.Therefore, given any > 0, from Theorem 4.7, it suffices to have an ( (σ + ρ X 2 )/σ))-accurate dual solution to transform to an -accurate primal solution, which takes O log 1 iterations, showing R-linear convergence of the primal iterate.Note that here we omit all factors independent of to show the linear rate with respect to , while the number of iterations needed can be calculated by combining Theorems 4.6 and 4.7.
On the other hand, if Assumption 4.3 holds, we then assume < 4 X 2 L/σ, for otherwise it does not take any iteration to get an -accurate solution for the primal problem.Therefore, we need an O( 2 )-accurate dual solution to make the corresponding primal solution accurate, thus by Theorem 4.6, we need We note that the results in Theorem 4.7 and Corollary 4.8 are implied by existing works, and the calculations take very little effort.It is also not too difficult to obtain sublinear rates following similar techniques for problems that do not satisfy ( 29), but we omit them to simplify and focus our description.

Related Works
Our algorithm can be viewed from two different aspects.If we simply consider solving (8), then our method is similar to proximal quasi-Newton methods, with some specific pick of the second-order approximation.A generalization of it appears as the block coordinate descent method (Tseng and Yun, 2009), where the proximal quasi-Newton method is the special case that there is only one block of the variables.One thing worth noticing is that Tseng and Yun (2009) requires the matrix in (8) to be positive definite with a lower bound of the smallest eigenvalue over all iterations.We relaxed this condition to allow B be indefinite or positive semidefinite when the ξ * term is strongly convex.This condition is used when Assumption 4.2 holds and in this case we do not need to add a damping term in our second order approximation.Namely, we can set a t 2 ≡ 0 in (15).On the other hand, our focus is on how to devise a good approximation of the Hessian matrix of the smooth term to work efficiently for distributed optimization.Works focusing on this direction for dual ERM problems include Pechyony et al. (2011); Yang (2013); Ma et al. (2017).The work Pechyony et al. (2011) discusses how to solve the SVM dual problem distributedly.This problem is a special case of (2), see Section 6.1 for more details.They proposed a method called DSVM-AVE that iteratively solves (8) with B t defined in (15) with a t 1 ≡ 1, a t 2 ≡ 0 to obtain the update direction, while the step size η t is fixed to 1/K.Though they did not provide theoretical convergence guarantee in Pechyony et al. (2011), we can see the reasoning of this choice from (25).First, one can easily see that with the equality holds when all X i are identical and d is a multiple of the vector of ones.Therefore, taking λ = 1/K in (25) and plug in the bound in (41), we can see that minimizing (8) with a step size of 1/K leads to decrease of the objective value.
In Yang (2013), the algorithm DisDCA is proposed to solve (2) under the same assumption that g is strongly convex.They consider the case c i ≡ 1 for all i, but the algorithm can be directly generalized to c i > 1 easily.This method uses a specific algorithm, the stochastic dual coordinate descent (SDCA) method (Shalev-Shwartz and Zhang, 2016), to solve the local sub-problems, while the choice of B t is picked according to the algorithm parameters.To solve the sub-problem on machine k, each time SDCA samples one entry i k from J k with replacement and minimizes the local objective with respect to α i k .If each time machine k samples m k entries, and we denote then the first variant of DisDCA, called the basic variant in Yang (2013), picks B t in (8) as x j are from the same X k for some k that is sampled, 0 else, and the step size is fixed to 1.In this case, it is equivalent to splitting the data into l blocks, and the minimization is conducted only with respect to the blocks sampled.If we let I be the indices not sampled, then following the same reasoning of (41), we have where the equality holds when all X are identical and |I| = l − m.Therefore, by combining ( 43) and ( 25), it is not hard to see that directly in this case minimizing Q α Bt directly results in a certain amount of function value decrease.The analysis in Yang (2013) then shows that the primal iterates {w t } obtained by using substituting the dual iterates {α t } into (6) converges linearly to the optimum when all ξ i have Lipschitz continuous gradient and converges with a rate of O(1/ ) when all ξ i are Lipschitz continuous by using some proof techniques similar to that in Shalev-Shwartz and Zhang (2012), and as we noted in Section 4, this is actually the same as showing the decrease of the dual objective and then relate it to the primal objective.The key difference to our analysis is that they do not assume the further structure (29) of the dual problem when ξ * is not strongly convex, ergo the sublinear rate.
The second choice of B t in Yang (2013), called the practical variant, considers and takes unit step sizes.Similar to our discussion above for their basic variant, Q α Bt in this case is also an upper bound of the function value decrease if the step size is fixed to be 1, and we can expect this method to work better than the basic variant as the approximation is closer to the real Hessian while the scaling factor is closer to one.Empirical results show that this variant is as expected faster than the basic variant, despite the lack of theoretical convergence guarantee of this variant in Yang (2013).
Both DSVM-AVE and the practical variant of DisDCA are generalized to a framework proposed in Ma et al. (2017) that discusses the relation between the second-order approximation and the step size.In particular, their theoretical analysis for a fixed step size starts from ( 25) and (41).They considered solving (8) with B t defined as and showed that for a ∈ [1, K], a step size of a/K is enough to ensure convergence of the dual objective.As we discussed above for Pechyony et al. (2011) and Yang (2013), this choice can be proven to ensure objective value decrease from (25).Unlike DisDCA which is tied to SDCA, their framework allows arbitrary solver for the local sub-problems, and relates how precise the local subproblems are solved with the convergence rate.If we ignore the part of local precision, which can be derived from the solution precision to the right-hand side of ( 25), the convergence rates of their framework shown in Ma et al. (2017) is similar to that of Yang (2013) for the basic variant of DisDCA.
This work therefore provides theoretical convergence rates for both DSVM-AVE and the practical variant of DisDCA that are similar to the basic variant of DisDCA, and their experimental results on the local solver and the step size choices suggest that the practical variant of DisDCA is indeed the most efficient one.
When we take the same B t as these methods in (8), the major difference is that we do not take a pre-specified safe step size.Instead, we dynamically find a possibly larger step size that still guarantees function decrease.We can see that the choice of a = 1 in (44) gives too conservative the step size, while the choice of a = K might make the quadratic approximation in (8) deviate from the real Hessian too far.In particular, the case of a = 1 makes the Frobenius norm of the difference between ∇ 2 G * and B t the smallest, while other choices increase this value.This suggests that using a = 1 should be the best approximation one can get, but even directly using ∇ 2 G * might not guarantee decrease of the objective value.Our method thus provides a way to deal with this problem by adding a low-cost line search step.Moreover, by adding an assumption that holds true for most ERM losses, we are able to show linear convergence of a broader class of problems.
Most other practical distributed ERM solvers directly optimize (1).Primal batch solvers for ERM that require computing the full gradient or Hessian-vector products are inherently parallelizable and can be easily extended to distributed environments because the main computations are matrixvector products like Xw, and it mostly takes some implementation modification to make these approaches efficient in distributed environments, so the main algorithmic framework remains the same.Among them, it has been shown that distributed truncated-Newton (Zhuang et al., 2015;Lin et al., 2014), distributed limited-memory BFGS (Chen et al., 2014), and the limited-memory common-directions method (Lee et al., 2017) are the most efficient ones in practice.These methods have the advantage that their convergences are invariant of the data partition, but they require the additional assumption that the primal objective is differentiable or even twice-differentiable.However, there are important cases of ERM problems that do not possess differentiable primal objective function.In these cases, one still needs to consider the dual approaches, for other wise the convergence might be slower.Another popular algorithm for distributedly solving (1) without requiring differentiability is the alternating direction method of multipliers (ADMM) (Boyd et al., 2011) which is widely used in consensus problems.However, ADMM is known to converge slowly in practice and the convergence also relies on the parameter of weighing the penalty term in the Augmented Lagrangian, and there is no known way to tune this parameter easily.It has been shown in Yang (2013) that DisDCA outperforms ADMM on SVM problems, while Zhuang et al. (2015) showed that distributed truncated-Newton is faster than ADMM on logistic regression problems.
Recently there are also many methods that focus on the communication efficiency.By adding more computation per iteration, they are able to use fewer communication rounds to achieve the same level of objective value.However, these approaches either rely on stronger assumption that data points across machines are independent and identically distributed, which may not hold true in practice because it is possible that each server gathers data from a specific region, or has higher computational dependency on the dimensionality of the problem, resulting in communication-efficient but computational-inefficient and thus impractical methods.We therefore exclude discussion of these methods.

Applications
In this section, we apply the proposed algorithm in Section 2 to solve various regularized ERM problems, and discuss some techniques that can be used to improve the efficiency by utilizing the problem structures.We will show the empirical performance of these algorithms in Section 7.
For what follows, we first show that a class of problems satisfy the condition (29).
Lemma 6.1 Given any initial point α 0 , a problem of the following form where g is strongly convex, satisfies the condition (29 If the constraint is a polytope, then the condition (29) holds for any feasible α.
Proof From (Wang and Lin, 2014, Theorem 4.6), problem (45) satisfies a error bound condition with a parameter µ α 0 that is continuous with respect to α 0 within the level set, and (Bolte et al., 2015, Theorem 5) shows that this condition is equivalent to (29) with µ α 0 .This proves the first part of the theorem.When the constraint is a polytope, the feasible region is a compact set.Therefore, we can find the maximum of µ α 0 within this compact set and letting this value be µ, then (29) holds with the same µ for all feasible α.

Binary Classification and Regression
The first case that we consider is the support vector machine (SVM) problem proposed by Boser et al. (1992); Vapnik (1995), such that c i ≡ 1, and given C > 0, where y i ∈ {−1, 1}, ∀i.Obviously, Assumption 2.1 holds with σ = 1 in this case.After some simple calculation, we have that where I is the characteristic function of convex sets.It is clear that ξ i are C-Lipschitz continuous.By using results in Wang and Lin (2014), one can show that ( 29) is satisfied.We can also replace the hinge loss in SVM with squared-hinge loss: and then ξ i becomes differentiable, with the gradient being Lipschitz continuous.Therefore, Assumption 4.2 is satisfied.We have that One can observe that the dual objectives of both the hinge loss and squared-hinge loss SVM are quadratic functions, hence we can apply the exact line search approach in (21) with very low cost by utilizing the ∆v and v vectors.Moreover, since the dual problems are box-constrained, it is clear that both cases are of the form described in (45) so our analysis of linear convergence applies.
Another widely used classification model is logistic regression, where Loss name Table 1: Summary of popular ERM problems for binary classification (the range of y = {1, −1}) and for regression (the range of y = R), where our approach is applicable.Our approach is also applicable to the extensions of these methods for multi-class classification and structured prediction.
It can then be shown that the logistic loss is infinitely differentiable, and its gradient is Lipschitz continuous.Thus Assumption 4.2 is satisfied.An analogy of SVM to regression is support vector regression (SVR) by Boser et al. (1992); Vapnik (1995) such that given C > 0 and ≥ 0, with y i ∈ R for all i.Similar to the case of SVM, the first case satisfies Assumption 4.3 2 and the latter satisfies Assumption 4.2.Note that the degenerate case of = 0 corresponds to absolute deviation loss and least square loss.In the case of least square loss, we again can use the exact line search approach because the objective is a quadratic function.
Note that one can also replace g with other strongly convex functions, but its possible that (29) is not satisfied.In this case, one can establish some sublinear convergence rates by applying similar techniques in our analysis, but we omit these results to keep the description straightforward.
A short summary of various ξ i we discussed in this section is summarized in Table 1.

Multi-class Classification
For the case of Multi-class classification models, we assume without loss of generality that c i ≡ T for some T > 1, and y i ∈ {1, . . ., T } for all i.The first model we consider is the multi-class SVM model proposed by Crammer and Singer (2002).Given an original feature vector x i ∈ R ñ, the data matrix X i is defined as (I T − e yi 1 T ) ⊗ x i , where I T is the T by T identity matrix, e i is the unit vector of the i-th coordinate, 1 is the vector of ones, and ⊗ denotes the Kronecker product.We then get that n = T ñ, and the multi-class SVM model uses 2. Up to a reformulation of the dual problem by setting α = α + − α − and α + , α − ≥ 0.
From the first glance, this ξ seems to be not even Lipschitz continuous.However, its dual formulation is showing the boundedness of the dual variables α.Thus the primal variable w(α) = Xα also lies in a bounded area.Therefore, ξ i (X T i w(α)) also has a bounded domain, indicating that we can find L ≥ 0 such that this continuous function is Lipschitz continuous within this domain.Moreover, the formulation (48) satisfies the form in (45), so (29) holds.Therefore, Assumption 4.3 is satisfied.Note that in this case the objective of ( 48) is a quadratic function so once again we can apply the exact line search method for this problem.
Being an analogy of SVM, one can also consider the squared-hinge loss for multi-class SVM (Lee and Lin, 2013).
The key difference to the binary case is that the squared-hinge loss version of multi-class SVM does not possess a differentiable objective function.We need to apply a similar argument as above to argue the Lipschitzness of ξ.The dual formulation from the derivation of Lee and Lin ( 2013) is subject to α T i 1 = 0, i = 1, . . ., l, (α i ) j ≤ 0, ∀j = y i , i = 1, . . ., l, suggesting that each coordinate of α is only one-side-bounded, so this is not the case that α lies explicitly in a compact set.However, from the constraint and the objective, we can see that given any initial point α 0 , the level set of {α | f (α) ≤ f (α 0 )} is compact.Because our algorithm is a descent method, throughout the optimization process, all α implicitly lies within this compact set.This again indicates that w(α) and X T i w(α) are within a bounded set, proving the Lipschitzness of ξ i when our algorithm is applied.The condition (29) is also satisfied following the same argument for the hinge-loss case.Therefore Assumption 4.3 still holds, and it is obvious that we can use the exact line search method for this problem as well.
We can also extend the logistic regression model to the multi-class scenario.The loss function, usually termed as multinomial logistic regression or maximum entropy, is defined as .
It is not hard to see that Assumption 4.2 holds for this problem.For more details of its dual problem and an efficient local sub-problem solver, interested readers are referred to Yu et al. (2011).

Structured Prediction Models
In many real-world applications, the decision process involves making multiple predictions over a set of interdependent output variables, whose mutual dependencies can be modeled as a structured object such as a linear chain, a tree, or a graph.As an example, consider recognizing a handwritten word, where characters are output variables and together form a sequence structure.It is important to consider the correlations between the predictions of adjacent characters to aid the individual predictions of characters.A family of models designed for such problems are called structured prediction models.In the following, we discuss how to apply Algorithm 2 in solving SSVM, a popular structured prediction model.Different from binary and multi-class classification, the output in structured prediction problem is a set of output variables y i ∈ Y i , and Y i is the set of all feasible output structures.The sizes of input and output variables are often different from one instance to another.For example, in the handwriting recognition example, each element in y represents a character and Y is the set of all possible words.Depending the number of characters in the words, the sizes of input and output vary.
Given a set of observations where C > 0 is a predefined parameter.φ(y, y i , x i ) = Φ(x i , y i )−Φ(x i , y), and Φ(x, y) is a generated feature vector depending on both the input x and the output structure y.By defining features depending on the output, one can encode the output structure into the model and learn parameters to model the correlation between output variables.The constraints in problem (50) specify that the difference between the score assigned to the correct output structure should be higher than a predefined scoring metric ∆(y, y i ) ≥ 0 that represents the distance between output structures.If the constraints are not satisfied, then a penalty term ψ i is introduced to the objective function, where (ψ) defines the loss term.Similar to the binary classification and multi-class classification cases, common choices of the loss functions are the L2 loss, (z) = max(0, z) 2 , and the L1 loss, (z) = max(0, z).The SSVM problem in (50) fits into our framework, depending on the definition of the features, one can define X i to encode the output y.One example is to set every column of X i as a vector of the form φ(y, y i , x i ) with different y ∈ Y i such that c i = |Y i |, and let ξ i (X T i w) in problem (1) be Here, we use the order of y appeared in the columns of X i as the enumeration order for the coordinates of z.
We consider solving problem (50) in its dual form (Tsochantaridis et al., 2005).One can clearly see the similarity between (51) and the multi-class losses ( 47)-( 49), where the major difference is that the value 1 in the multi-class losses are replaced by the value of ∆(y y , y).Thus, it can be expected that the dual problem of SSVM is similar to that of multi-class SVM.With the L1 loss, the dual problem of (50) can be written as, (α i ) y ≤ 0, ∀y = y i , i = 1, . . ., l And, with the L2 loss, the dual of ( 50) is (α i ) y ≤ 0, ∀y = y i , i = 1, . . ., l As the dual forms are almost identical to that shown in Section 6.2, it is clear that all the analysis and discussion there can be directly used here.
The key challenge of solving problems ( 52) and ( 53) is that for most applications, the size of Y i and thus the dimension of α is exponentially large (with respect to the length of x i ), so optimizing over all variables is unrealistic.Efficient dual methods (Tsochantaridis et al., 2005;Lacoste-Julien et al., 2013;Chang and Yih, 2013) maintain a small working set of dual variables to optimize such that the remaining variables are fixed to be zero.These methods then iteratively enlarge the working set until the problem is well-optimized. 3The working set is selected using the sub-gradient of ( 50) with respect to the current iterate.Specifically, for each training instance x i , we add the dual variable α i,ŷ corresponding to the structure ŷ into the working set, where ŷ = arg max y∈Yi w T φ(y, y i , x i ) − ∆(y i , y). ( Once α is updated, we update w accordingly.We call the step of computing eq. ( 54) "inference", and call the part of optimizing Eq. ( 52) or ( 53) over a fixed working set "learning".When training SSVM distributedly, the learning step involves communication across machines.Therefore, inference and learning steps are both expensive.SSVM is an extension of multi-class SVM for the structured prediction cases.Similar, conditional random fields (CRF) (Lafferty et al., 2001) extends multinomial logistic regression.The loss function in CRF is defined as the negative log-likelihood: .
Similar to multinomial logistic regression, Assumption 4.2 holds for this loss function.

Experiments
We conduct experiments on different ERM problems to examine the efficiency of variant realizations of our framework.The problems range from simple binary and regression problem (i.e., c i ≡ 1) to problems with complex output structures (i.e., each c i are different), and from that exact line search can be conducted to that backtracking using Algorithm 1 is required.For each problem, we compare our method with state of the art for that specific problem.For the case c i ≡ 1, we consider two linear classification tasks.Regarding the situation of larger c i , we take L2-regularized SSVM as the exemplifying application.

Binary Linear Classification
For binary linear classification, From existing study for the single machine case, solving the dual problem is more favorable when the feature dimension is larger than the number of instances because the number of variables is smaller.We therefore consider data sets with l < n, shown in Table 2, in this task.We consider both linear SVM and L2-regularized logistic regression discussed in Section 6.1.The comparison criterion is the relative primal or dual objective distance to the optimum, respectively defined as where f * is the optimum we obtained approximately by running our algorithm with a tight stopping condition.Note that the optimum for the dual and the primal problems are identical, according to strong duality.We examine the relation between these values and the running time.We fix C = 1 in this experiment.The distributed environment is a local cluster with 16 nodes running MPI.
We compare the methods below whenever they are applicable for the problem.
• DisDCA (Yang, 2013): we consider the practical variant because it is shown to be faster than the basic variant.Moreover, experimental result in Ma et al. (2017) showed that this algorithm is faster than DSVM-AVE, and the best local solver for the per-machine sub-problem is indeed the coordinate descent method used in Yang (2013).
• BDA: the Block-Diagonal Approximation method proposed in this paper.For the dual SVM problem, we utilize the fact that it has a quadratic objective to conduct exact line search efficiently.For logistic regression, backtracking line search is used.
• L-CommDir (Lee et al., 2017): this is a state-of-the-art distributed primal ERM solver that has been shown to empirically outperforms existing distributed primal approaches.This algorithm requires differentiability of the primal objective, so we only apply it on the squaredhinge loss SVM problem and logistic regression.We use the implementation in the package MPI-LIBLINEAR 2.1. 4 • TRON (Zhuang et al., 2015): this is a distributed implementation of the trust-region truncated Newton method for unconstrained optimization proposed by Steihaug (1983).This algorithm also requires differentiability of the primal objective, so it is applied on only squared-hinge loss SVM and logistic regression problems.The implementation of this method is also from MPI-LIBLINEAR 2.1.
For both DisDCA and BDA, we implement them in C++ and the local solver being considered is random permutation cyclic coordinate descent (RPCD) for dual SVM (Hsieh et al., 2008) and dual logistic regression (Yu et al., 2011).For the choice of B t in (15) for our method, we use a t 1 ≡ 1 for all problems, as it is the closest approximation of the real Hessian, while regarding a t 2 , it is set to 10 −3 for the hinge-loss SVM problem and 0 for the other two.The other two algorithms are also implemented in C++.At every iteration, we run one epoch of RPCD on each machine, namely we pass through the whole data set once, before communication.For L-CommDir, we take the experimental setting in Lee et al. (2017) to use historical information from the previous five iterations.
The comparison of the dual objective is shown in Figure 1.We can see that our approach is always better than state of the art for the dual problem.The difference is more significant in the SVM problems, showing that exact line search has its advantage over backtracking, especially because its cost is low, while backtracking is still better than the fixed step size scheme.The reason behind is that although the approach of DisDCA provides a safe upper bound modeling of the objective difference such that the local updates can be directly applied to ensure the objective decrease, this upper bound might be too conservative, and more aggressive upper bound modelings might be computationally impractical to obtain.On the other hand, our approach provides an efficient way to estimate how aggressive the current update can be, according to the current iterate.Therefore the objective can decrease faster as the update is more aggressive but still safe to guarantee sufficient 4. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/.objective value decrease.Investigation on the step sizes at different iterations of the line search approaches shows that the step size is not fixed throughout the optimization procedure, indicating that a fixed step size scenario might not be an ideal choice.
The comparison of the primal objective is provided in Figure 2. Note that the step-like behavior of BDA is from that we use the best primal objective so far as we discussed in Section 3.4.For the case of hinge-loss SVM, our method is always the best, and note that only dual approaches are feasible for hinge loss as it is not differentiable.When it comes to squared-hinge-loss SVM, in which case exact line search for the dual problem can still be conducted, our method outperforms all approaches regardless the primal or the dual problem is solved.The dual problem of logistic regression is not a simple quadratic one, hence we cannot easily implement exact line search and need to resort to the backtracking approach.We can see that for this problem, L-CommDir has advantages in the later stage of optimization, while our method and DisDCA are competitive in the early stage, which is usually enough for linear classification tasks.In most cases, TRON is the slowest method.

Structured Learning
We perform experiments on two benchmark tasks for structured prediction, part-of-speech tagging (POS) and dependency parsing (DEP).For both tasks, we use the Wall Street Journal portion of the Penn Treebank (Marcus et al., 1993) with the standard split for training (section 02-21), development (section 22), and test (section 23).POS is a sequential labeling task, where we aim at learning partof-speech tags assigned to each word in a sentence.Each tag assignment (there are 45 possible tag assignments) depends on the associated word, the surrounding words, and their part-of-speech tags.The inference in POS is solved by the Viterbi algorithm (Viterbi, 1967).We evaluate our model by the per-word tag accuracy.For DEP, the goal is to learn, for each sentence, a tree structure which describes the syntactic dependencies between words.We use the graph-based parsing formulation and the features described in McDonald et al. (2005), where we find the highest scoring parse using the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967).We evaluate the parsing accuracy using the unlabeled attachment score, which measures the fraction of words that have correctly assigned parents.
We compare the following algorithms using eight nodes in a local cluster.All algorithms are implemented in JAVA, and the distributed platform is MPI.Data sets are uniformly randomly split into eight partitions.
• BDA: the proposed algorithm.We take a t 1 ≡ K and a t 2 ≡ 10 −3 as a t 1 ≡ 1 is less stable in the primal objectives, which is essential for the sub-problem solver in this application.
• Simple average: each machine trains a separate model using the local data.The final model is obtained by averaging all local models.
For BDA and ADMM, the problem considered is SSVM in (50) with L2 loss.Perceptron, on the other hand, solves a similar but different problem such that no regularization is involved.We set C = 0.1   for SSVM.Empirical experience suggest that structured SVM is not sensitive to C, and the model trained with C = 0.1 often attains reasonable test performance.
Both ADMM and BDA decompose the original optimization problem into sub-problems and solve the sub-problems by the dual coordinate descent solver proposed in Chang and Yih (2013), which is shown to be empirically efficient comparing to other existing methods.By solving the sub-problems using the same optimizers, we can investigate the algorithmic difference between ADMM and BDA.For all algorithms, we fix the number of inferences through all instances between any two rounds of communication to be one, so that the number of inference rounds is identical to the number of communication rounds.Although it is possible to alter the number of inferences between two rounds of communication to obtain a better convergence result, fine-tuning this parameter is not realistic for users.For BDA and ADMM, each time in solving the local sub-problem with a fixed working set, we let the local RPCD solver pass through the local instances ten times.We note that this number of iterations may also affect the convergence speed but we do not fine-tune this parameter for the same reason above.For ADMM, the weight for the penalty term in the augmented Lagrangian also affects the convergence speed.5 Instead of fine-tuning it, a fixed value of 1.0 is used.Note that since Perceptron and BDA/ADMM consider different problems, instead of showing objective function values, we compare the test performance along training time of these methods.
Figure 3 shows the results.The x-axis is in log-scale.Although averaging local classifiers achieves reasonable performance, all methods improve the performance of the models with multiple rounds of communications.This indicates that training models jointly on all parts of data is necessary.Among different algorithms, BDA performs the best in both tasks.It achieves the final accuracy performance (indicated when the accuracy stops improving) with shorter training time comparing to other approaches.This result confirms that BDA enjoys fast convergence rate.

Discussion
Our analysis discusses the situation of exactly minimizing (8) at each round, while the setting in the experiment seems not satisfying so.However, by linear convergence of cyclic coordinate descent on composite problems (see, for example, Wang and Lin (2014); Hong et al. (2013)), we can ensure that the problem is solved at least to some precision.Therefore, one can ensure that in this case ∆ t is still bounded by some amount related to the norm of the obtained update direction.Hence, we can find another matrix Bt such that the update direction is the solution of (8) with this matrix, and the conditions in Lemma 4.1 and Theorem 4.6 are still satisfied because the degree of freedom of the matrix is large enough.The key requirement here is that we need to control how exact or how loose the solutions of (8) at different iterations are.As long as we require that at each iteration it is solved at least to a fixed multiplicative precision, the values of c 1 , c 2 , c 3 from Bt will be bounded over iterations.We omit the proof details as the convergence theorems are just to provide a guarantee in the worst case while we observe from the experiments that the empirical convergence speed is much faster than that in the worst-case analysis.
As aforementioned, if the block-diagonal B t is closer to the Hessian of f , then empirical convergence is expected to be faster.A way to achieve this is to have a better data partition such that those off diagonal-block entries are as small as possible.However, repartitioning the data across machines involves a significant amount of data transmission, and thus is not a practical approach in general.A feasible scenario is when the data points are streamed in, then when a new data point arrives, we can conduct some approach to decide which machine to put this data point to achieve this goal approximately.However, this may make the partition imbalanced, so that the computational burden is not well parallelized.How to achieve a data partition that is simultaneously balanced and has smaller off diagonal-block entries is an interesting future research direction.
As we noted in (37), one always want to achieve a larger step size simultaneously with a smaller c 4 , and balancing these two factors is not an easy task in practice as the step size is not fixed.One possible heuristic is to dynamically adjust a t 1 and a t 2 according to the step size of the previous iteration.When the step size is smaller than a threshold, we can enlarge a t 1 and a t 2 by a factor, while when it is large enough, one can try to make (a t 1 , a t 2 ) closer to (1, 0) than the previous iteration.One limitation of our method is its strong scaling with number of machines is not promising.If we fix the data set but increase the number of machines we use, then one can see that B t will contain more zero entries.Therefore, one can expect our algorithm gets closer to the proximal gradient method and converges slower in practice.This is inevitable for all distributed optimization dual methods we discussed in Section 5.However, in practice, one apply distributed optimization for (1) or (2) not for the purpose of accelerating the training process, but because of the data storage nature.Usually the data set is originally stored distributedly, and therefore as just discussed, repartitioning it is costly in communication.Therefore, the strong scaling scenario is not the normal case in distributed optimization.Practitioners concern more about how to make the optimization procedure more efficient given a fixed number of machines and a fixed data partition.

Conclusions
In this work, we propose a framework for distributedly optimizing the dual problem of regularized empirical risk minimization.Our theoretical results show linear convergence for both the dual problem and the corresponding primal problem for a variety class of problems under a condition weaker than strong convexity for the dual problem.Our approach is most powerful when it is difficult to directly solve the primal problem.Experimental results show that our method outperforms state-of-the-art dual approaches for regularized empirical risk minimization, and is competitive to cutting-edge distributed primal methods.

Figure 1 :Figure 2 :
Figure 1: Comparison of different algorithms for optimizing the dual ERM problem with C = 1.We show training time v.s.relative difference of the dual objective to the optimal function value.Running time is in log scale.

Figure 3 :
Figure 3: Comparison between different algorithms for structured learning using eight nodes.Training time is in log scale.