Bregman primal–dual first-order method and application to sparse semidefinite programming

We present a new variant of the Chambolle–Pock primal–dual algorithm with Bregman distances, analyze its convergence, and apply it to the centering problem in sparse semidefinite programming. The novelty in the method is a line search procedure for selecting suitable step sizes. The line search obviates the need for estimating the norm of the constraint matrix and the strong convexity constant of the Bregman kernel. As an application, we discuss the centering problem in large-scale semidefinite programming with sparse coefficient matrices. The logarithmic barrier function for the cone of positive semidefinite completable sparse matrices is used as the distance-generating kernel. For this distance, the complexity of evaluating the Bregman proximal operator is shown to be roughly proportional to the cost of a sparse Cholesky factorization. This is much cheaper than the standard proximal operator with Euclidean distances, which requires an eigenvalue decomposition.


Introduction
Optimization methods based on Bregman distances offer the possibility of matching the Bregman distance to the structure in the problem, with the goal of reducing the complexity per iteration.In this paper, we apply this idea to the centering problem in sparse semidefinite programming.The paper is motivated by the difficulty of exploiting sparsity in large-scale semidefinite programming in general and, for proximal methods, the need for eigendecompositions to compute Euclidean projections on the positive semidefinite matrix cone.By replacing the Euclidean projection with a generalized Bregman projection, we take advantage of the efficiency and scalability of algorithms for sparse Cholesky factorization and several related computations [3,54].
We consider semidefinite programs (SDPs) in the standard form with primal variable X ∈ n and dual variables S ∈ n , y ∈ m , where n is the set of symmetric n × n matrices.The linear operator A ∶ n → m is defined as and A * (y) = ∑ m i=1 y i A i is its adjoint operator.The coefficients C, A 1 , … , A m are sym- metric n × n matrices.The notation n + is used for the cone of positive semidefinite (PSD) matrices in n .
In many large-scale applications of semidefinite programming, the coefficient matrices are sparse.The sparsity pattern of a symmetric n × n matrix can be represented by an undirected graph G = (V, E) with vertex set V = {1, 2, … , n} and edge set E. The set of matrices with sparsity pattern E is then defined as In this paper, E will denote the common (or aggregate) sparsity pattern of the coefficient matrices in the SDP, i.e., we assume that C, A 1 , … , A m ∈ n E .Note that the sparsity pattern E is not uniquely defined (unless it is dense, i.e., the sparsity graph G is complete): if the coefficients are in n E then they are also in n E ′ where E ⊂ E ′ .In particular, E can always be extended to make the graph G = (V, E) chordal or tri- angulated [14,54].Without loss of generality, we will assume that this is the case.
The primal variable X in (1) generally needs to be dense to be feasible.However, the cost function and the linear equality constraints only depend on the diagonal entries X ii and the off-diagonal entries X ij = X ji for {i, j} ∈ E .For the other entries the only requirement is to make the matrix positive semidefinite.In the dual problem, S ∈ n E holds at all dual feasible points.These observations imply that the SDPs (1) can be equivalently rewritten as a pair of primal and dual conic linear programs with sparse matrix variables X, S ∈ n E , and a vector variable y ∈ m .The primal cone K in this problem is the set of matrices in n E which have a positive semidefinite completion, i.e., K = Π E ( n + ) where Π E stands for projection on n E .The dual cone K * of K is the set of positive semidefinite matrices with sparsity pattern E, i.e., (1) primal: minimize (CX) dual: maximize b T y subject to A(X) = b subject to A * (y) ( primal: minimize Bregman primal-dual first-order method and application to… K * = n + ∩ n E .The formulation (2) is attractive when the aggregate sparsity pattern E is very sparse, in which case n E is a much lower-dimensional space than n .The centering problem for the sparse SDP (2) is where is the logarithmic barrier function for the cone K, defined as The centering parameter  > 0 controls the duality gap at the solution.Since the barrier function is n-logarithmically homogeneous, the optimal solution of the centering problem is a ( n)-suboptimal solution for the original SDP (2).The cen- tering problem (3) is useful as an approximation to the original problem, because it yields more easily computed suboptimal solutions, with an accuracy that can be controlled by the choice of barrier parameter.The centering problem is also a key component of barrier methods, in which a sequence of centering problems with decreasing values of the barrier parameter are solved.Traditionally, the centering problem in interior-point methods is solved by Newton's algorithm, possibly accelerated via the preconditioned conjugate gradient method [10,55], but recent work has started to examine the use of proximal methods such as the alternating direction method of multipliers (ADMM) or the proximal method of multipliers for this purpose [37,48].
Contributions The contribution of this paper is two-fold.First, we formulate a non-Euclidean (Bregman) proximal method for the centering problem of the sparse SDP.In the proposed method, the proximal operators are replaced by generalized proximal operators defined in terms of a Bregman generalized distance or divergence.We show that if the Bregman divergence generated by the barrier function for the cone K is used, the generalized projections can be computed very efficiently, with a complexity dominated by the cost of a sparse Cholesky factorization with sparsity pattern E. This is much cheaper than the eigenvalue decomposition needed to compute a Euclidean projection on the positive semidefinite cone.Hence, while the method only solves an approximation of the SDP (2), it can handle problem sizes that are orders of magnitude larger than the problems solved by standard interior-point and proximal first-order methods.
For the solution of the centering problem, we apply a variant of the primal-dual method proposed by Chambolle and Pock [22].The version of the algorithm described in [22] requires careful tuning of primal and dual step size parameters.Acceptable values of the step sizes depend on the norm of the linear operator A and the strong convexity constants for the distance function.These parameters are often difficult to estimate in practice.As a second contribution, we propose a new version of the algorithm, in which the step sizes are not fixed parameters, but are selected using an easily implemented line search procedure.We give a detailed convergence analysis of the algorithm with line search and show an O(1/k) ergodic convergence rate, which is consistent with previous results in [22,39].
Related work Sparse structure in semidefinite programming has been extensively studied by many authors.The scalability of interior-point methods is limited by the need to form and solve a set of m linear equations in m variables, known as the Schur complement system, at each iteration.This system is usually dense.Sparsity in the coefficients A i can be exploited to reduce the cost of assembling the Schur com- plement equations.This process is efficient especially in extremely sparse problems, where the coefficients A i may also have low rank.In dual barrier methods, one can also take advantage of sparsity of dual feasible variables S.These properties are leveraged in the dual interior-point methods described in [9][10][11][12][13].
In another line of research, techniques based on properties and algorithms for chordal sparsity patterns have been applied to semidefinite programming since the late 1990s [3, 13,18,29,30,34,35,42,46,50,51,58]; see [54,60] for recent surveys.An important tool from this literature is the conversion or clique decomposition method proposed by Fukuda et al. [30,42].It is based on a fundamental result from linear algebra, stating that for a chordal pattern E, a matrix X ∈ n E has a positive semidefinite completion if and only if X k k ⪰ 0 for k = 1, … , r , where 1 , ..., r are the maximal cliques in the graph [31].In the conversion method, the large sparse variable matrix X in ( 2) is replaced with smaller dense matrix variables X k = X k k .Each of these new variables is constrained to be positive semidefinite.Linear equality constraints need to be added to couple the variables X k , as they represent over- lapping subblocks of a single matrix X.Thus, a large sparse SDP is converted in an equivalent problem with several smaller, dense variables X k , and additional sparse equality constraints.This equivalent problem may be considerably easier to solve by interior-point methods than the original SDP (1).Recent examples where the clique decomposition is applied to solve large sparse SDPs can be found in [27,58].
Proximal splitting methods, such as (accelerated) proximal gradient methods [7,8,43], ADMM [16], and the primal-dual hybrid gradient (PDHG) or Chambolle-Pock method [20,28,47], are perhaps the most popular alternatives to interior-point methods in machine learning, image processing, and other applications involving large-scale convex programming.When applied to the SDPs (1), they require at each iteration a Euclidean projection on the positive semidefinite cone n + , hence, a symmetric eigenvalue decomposition of order n.This contributes an order n 3 term to the per-iteration complexity.In the nonsymmetric formulation (2) of the sparse SDP, the projections on K * or (equivalently) K cannot be computed directly, and must be handled by intro- ducing splitting variables and alternating projection on n E , which is trivial, and on n + , which requires an eigenvalue decomposition.The clique decomposition used in the conversion method described above, which was originally developed for interior-point methods, lends itself naturally to splitting algorithms as well.It allows us to replace the matrix constraint X ∈ K with several smaller dense inequalities X k ⪰ 0 , one for each maximal clique in the sparsity graph.In a proximal method, this means that projection on the n × n positive semidefinite cone can be replaced by less expensive projec- tions on lower-dimensional positive semidefinite cones [38,52,59,61].This advantage of the conversion method is tempered by the large number of consistency constraints that must be introduced to link the splitting variables X k .First-order methods typically do not compute very accurate solutions and if the residual error in the consistency 1 3 Bregman primal-dual first-order method and application to… constraints is not small, it may be difficult to convert the computed solution of the decomposed problem back to an accurate solution of the original SDP [27].
Outline The rest of the paper is organized as follows.In Sect. 2 we describe the Bregman distance generated by the barrier function and show how generalized projections can be efficiently computed without expensive eigenvalue decomposition.The primal-dual proximal algorithm and its convergence are discussed in Sect.3. Section 4 contains results of numerical experiments.
2 Barrier proximal operator for sparse PSD matrix cone

Centering problem
We will assume that the equality constraints in (2) include an equality constraint To make this explicit we write the centering problem (2) as For N = I , the normalized cone {X ∈ K | (NX) = 1} is a matrix extension of the probability simplex {x ⪰ 0 | 1 T x = 1} , sometimes referred to as the spectraplex.With minor changes, the techniques we discuss extend to a normalization in the inequality form (NX) ≤ 1 , with N ∈ n ++ ∩ n E .However, we will discuss (4) to retain the standard form of the centering problem.
The constraints (NX) = 1 and (NX) ≤ 1 guarantee the boundedness of the pri- mal feasible set, a common assumption in first-order methods.The added constraint does not diminish the generality of our approach.In many applications an equality (NX) = 1 is implied by the contraints A(X) = b and easily derived from the problem data (see Sect. 4 for two typical examples).When an equality constraint of this form is not readily available, one can add a bounding inequality (NX) ≤ 1 with N sufficiently small to ensure that the optimal solution is not modified.
To apply first-order proximal methods, we view the problem (4) as a linearly constrained optimization problem where f is defined as and H is the indicator function of the hyperplane H .The algorithm we apply to (5)  can be summarized as ( 4) where d is the Bregman distance generated by the barrier function : The choices of k , k , and k , together with the details and origins of the algorithm, will be discussed in Sect.3. In the remainder of this section we focus on the most expensive step in the algorithm, the optimization problem in the X-update (7b).
In Sects.2.2 and 2.3 we first review some facts from the theory of generalized distances and the logarithmic barrier functions for the primal and dual cones K and K * .Sections 2.4 and 2.5 describe the details of the barrier kernel and the associ- ated generalized proximal operator applied in (7b).

Bregman distance
Let h be a convex function, defined on a domain that has nonempty interior, and suppose h is continuously differentiable on ( h) .The generalized distance generated by h is defined as the function with domain d = h × ( h) .The function h is called the kernel func- tion that generates the generalized distance d.For h(x) = ‖x‖ 2  2 ∕2 and the standard inner product ⟨u, v⟩ = u T v , we obtain d(x, y) = ‖x − y‖ 2  2 ∕2 .The best known non- quadratic example is the relative entropy This generalized distance is generated by the kernel h(x) = ∑ i x i log x i , if we use the standard inner product.
Generalized distances are not necessarily symmetric ( d(x, y) ≠ d(y, x) in general) but share some other important properties with the squared Euclidean norm.An important example is the triangle identity [23,Lemma 3.1] which holds for all x ∈ h and y, z ∈ ( h) .This generalizes the identity Additional conditions may have to be imposed on the kernel function h, depending on the application and the algorithm in which the generalized distance is used [19].
1 3 Bregman primal-dual first-order method and application to… For now we only assume convexity and continuous differentiability on the interior of the domain.Other properties will be mentioned when needed.
The proximal operator of a closed convex function f is defined as If f is closed and convex, then the minimizer in the definition exists and is unique for all y [40].We will use the following extension to generalized distances.Suppose f is a convex function with the property that for every a and every y ∈ ( h) , the optimization problem has a unique solution x in ( h) .Then we denote the minimizer x by and call the mapping prox d f the generalized proximal operator of f.From the second expression we see that , where prox f is the standard proximal operator.
In contrast to the Euclidean case, it is difficult to give simple general conditions that guarantee that for every a and every y ∈ ( h) the problem (9) has a unique solution in ( h) .However, we will use the definition only for spe- cific combinations of f and d, for which problem ( 9) is particularly easy to solve.In those applications, existence and uniqueness of the solution follow directly from the availability of a fast algorithm for computing it.A classical example is the relative entropy distance with f given by the indicator function of the hyperplane {x | 1 T x = 1} .Problem (9) can be written as For any a and any positive y, the solution of ( 9) is unique and equal to the positive vector Research on proximal methods for semidefinite programming has been largely based on the standard Euclidean proximal operators and the distance defined by the matrix entropy [6].For these distances, projections on the positive semidefinite cone require eigenvalue decompositions, which limits the size of the variables that can be handled and precludes applications to large sparse SDPs.In the following sections, we introduce a generalized proximal operator designed for sparse semidefinite programming.The generalized proximal operator can be evaluated via a simple iterative algorithm with a complexity dominated by the cost of a sparse Cholesky factorization.

Primal and dual barrier
The logarithmic barrier functions for the cones are defined as with domains * = K * and = K , respectively.Note that (X) is the conjugate of * evaluated at −X.
In [3,54] efficient algorithms are presented for evaluating the two barrier functions, their gradients, and their directional second derivatives, when the sparsity pattern E is chordal.The value of the dual barrier * (S) = − log det S is easily com- puted from the diagonal entries in a sparse Cholesky factor of S. The gradient and Hessian are given by Given a Cholesky factorization of S, these expressions can be evaluated via one or two recursions on the elimination tree [3, 54], without explicitly computing the entire inverse S −1 or the matrix product S −1 VS −1 .The cost of these recursions is roughly the same as the cost of a sparse Cholesky factorization with the sparsity pattern E [3,54].
The primal barrier function and its gradient can be evaluated by solving the optimization problem in the definition of (X) .The optimal solution ŜX is the matrix in n ++ ∩ n E that satisfies Its inverse Ŝ−1 X is also the maximum determinant positive definite completion of X, i.e., Z = Ŝ−1 X is the solution of (where we take n ++ as the domain of the cost function).From ŜX , one obtains

3
Bregman primal-dual first-order method and application to… Comparing the expressions for the gradients of and * in ( 16) and ( 13), and using ( 14), we see that ∇ and ∇ * are inverse mappings, up to a change in sign: For general sparsity patterns, the determinant maximization problem (15) or the convex optimization problem in the definition of must be solved by an iterative optimization algorithm.If the pattern is chordal, these optimization problems can be solved by finite recursive algorithms, again at a cost that is comparable with the cost of a sparse Cholesky factorization for the same pattern [3, 54].

Barrier kernel
The primal barrier function is convex, continuously differentiable on the interior of the cone, and strongly convex on

It generates the Bregman divergence
On line 2 we used the properties (16) to express (Y) and ∇ (Y) .The generalized proximal operator (10) for the function f defined in (6), which is the key step in the X-update (7b) of algorithm (7), then becomes where To compute X we therefore need to solve an optimization problem where B ∈ n E and N ∈ n ++ ∩ n E .If we introduce a Lagrange multiplier for the equality constraint in (17), the optimality condition can be written as Eliminating X we obtain a nonlinear equation in : (The projection in (NΠ E ((B + N) −1 )) can be omitted because the matrix N has the sparsity pattern E.) The unique solution that satisfies B + N ≻ 0 defines the solution X = Π E ((B + N) −1 ) of ( 17).The Eq. ( 18) is also the optimality condition for the Lagrange dual of (17), which is a smooth unconstrained convex optimization problem in the scalar variable :

Newton method for barrier proximal operator
In this section we discuss in detail Newton's method applied to the dual problem (19) and the equivalent nonlinear Eq. ( 18).We write the equation as ( ) = 1 where The function and its derivative can be expressed in terms of the generalized eigenvalues i of (B, N) as Figure 1 shows an example with n = 4 , N = I , and eigenvalues 10, 5, 0, −5.
We are interested in computing the solution of ( ) = 1 that satisfies B + N ≻ 0 , i.e.,  > − min , where min = min i i is the smallest generalized eigenvalue of (B, N).We denote this interval by J = (− min , ∞) .The equation ( ) = 1 is guaranteed to Bregman primal-dual first-order method and application to… have a unique solution in J because is monotonic and continuous on this interval, with Furthermore, on the interval J, the function and its derivative can be expressed as Therefore ( ) and � ( ) can be evaluated by taking the inner product of N with Since B, N ∈ n E , these quantities can be computed by the efficient algorithms for computing the gradient and directional second derivative of * described in [3,54].
We note a few other properties of .First, the expressions in (21) show that is convex, decreasing, and positive on J. Second, if ∈ J , then ν ∈ J for all ν that satisfy This follows from and is also a simple consequence of the Dikin ellipsoid theorem for self-concordant functions [44, Theorem 2.1.1.b].
The Newton iteration for the equation where is a step size.The same iteration can be interpreted as a damped Newton method for the unconstrained problem (19).If + ∈ J for a unit step = 1 , then from strict convexity of .Hence after one full Newton step, the Newton iteration with unit steps approaches the solution monotonically from the left.If () < 1 then in general a non-unit step size must be taken to keep the iterates in J. From the Dikin ellipsoid inequality (22), we see that + ∈ J for all positive that satisfy lim 1 3 The theory of self-concordant functions provides a step size rule that satisfies this condition and guarantees convergence: where is a constant in (0, 1).As an alternative to this fixed step size rule, a standard backtracking line search can be used to determine a suitable step size in (23).
Checking whether + ∈ J can be done by attempting a sparse Cholesky factorization of B + + N.
Figure 1 shows that the function can be quite nonlinear around the solution of the equation if the solution is near − min .Instead of applying Newton's method directly to (20), it is useful to rewrite the nonlinear equation as ( ) = 0 where The negative smallest eigenvalue − min is a pole of ( ) , but a zero of 1∕ ( ) .Also the derivative of changes slowly near this zero point; in Fig. 1, the function is almost linear in the region of interest.This implies that Newton's method applied to (24), i.e., should be extremely efficient in this case.Starting the line search at = 1 is equiva- lent to starting at = ( ) in (23).This often requires fewer backtracking steps than starting at = 1.
Newton's method requires a feasible initial point 0 ∈ J .Suppose we know a posi- tive lower bound on the smallest eigenvalue of N. Then ν0 ∈ J where A lower bound on min (B) can be obtained from the Gershgorin circle theorem, which states that the eigenvalues of B are contained in the disks Thus, min (B) ≥ min i (B ii − ∑ j≠i �B ij �) .Apart from the above initialization, we find another practically useful initial point ν0 = n − B∕ N , which is the solution for (N(B + N) −1 ) = 1 when B happens to be a multiple of N.This choice is efficient in many practical examples but, unfortunately, not guaranteed to be feasible.Thus, in the implementation, we use ν0 if it is feasible and ν0 otherwise.
1 3 Bregman primal-dual first-order method and application to…

Bregman primal-dual method
The proposed algorithm ( 7) is applicable not only to sparse SDPs, but to more general optimization problems.To emphasize its generality and to simplify notation, we switch in this section to the vector form of the optimization problem where f is a closed convex function.Most of the discussion in this section extends to the more general standard form where f and g are closed convex functions.Problem ( 25) is a special case with g = {b} , the indicator function of the singleton {b} .While the standard form (26)  offers more flexibility, it should be noted that methods for the equality constrained problem (25) also apply to (26) if this problem is reformulated as We also note that (25) includes conic optimization problems in standard form if we define f (x) = c T x + C (x) , where C is the indicator function of the cone C. In Sect.3.1 we review some facts from convex duality theory.Section 3.2 describes the algorithm we propose for solving (25), and in Sect.3.3 we analyze its convergence.

Duality theory
The Lagrangian for problem (25) will be denoted by This function is convex in x and affine in z, and satisfies where f * (y) = sup x (y T x − f (x)) is the conjugate of f.The function f * (−A T z) is the objective in the dual problem Throughout this section we assume that there exists a saddle point (x ⋆ , z ⋆ ).Some of the convergence results in Sect.3.3 are expressed in terms of the merit function It is well known that for sufficiently large , the term ‖Ax − b‖ 2 is an exact penalty.Specifically, if  > ‖z ⋆ ‖ 2 , where z ⋆ is a solution of the dual problem (28), then opti- mal solutions of (30) are also optimal for (25).

Algorithm
The algorithm for (25) presented in this section involves a generalized distance d in the primal space, generated by a kernel function .It will be assumed that is strongly convex on f .This property can be expressed as for all x ∈ ∩ f and y ∈ ( ) ∩ f , where ‖ ⋅ ‖ is a norm, scaled so that the strong convexity constant in (31) is one.(More generally, if is -strongly convex with respect to ‖ ⋅ ‖ , then the factor 1/2 is replaced with ∕2 .By scaling the norm, one can assume = 1 .)We denote by ‖A‖ the matrix norm The algorithm is summarized as follows.Select starting points z −1 = z 0 and x 0 ∈ ( ) ∩ f .For k = 0, 1, … , repeat the following steps: Bregman primal-dual first-order method and application to… Step (33b) can be written more explicitly as The parameters k , k , k are determined by one of two methods. - The parameter satisfies 0 <  ≤ 1 .In practice, = 1 can be used, but some convergence results will require  < 1 ; see Sect. 3.3.4.-Varying parameters.The parameters k , k , k are determined by a backtracking search.At the start of the algorithm, we set −1 and −1 to some positive values.To start the search in iteration k we choose θk ≥ 1 .For i = 0, 1, 2, … , we set , and compute zk+1 , x k+1 , z k+1 using (33).If we accept the computed iterates zk+1 , x k+1 , z k+1 and step sizes k , k , and termi- nate the backtracking search.If (36) does not hold, we increment i and continue the backtracking search.The constant parameter choice is simple, but it is often overly pessimistic.Moreover it requires an estimate or tight upper bound for ‖A‖ , which is difficult to obtain in large-scale problems.Using a loose bound for ‖A‖ in (35) may result in unnecessar- ily small values of and , and can dramatically slow down the convergence.The definition of ‖A‖ further depends on the strong convexity constant for the kernel ; see (31) and (32).This quantity is also difficult to estimate for most kernels.
The varying parameters option does not require estimates or bounds on ‖A‖ or the strong convexity constant of the kernel.It is more expensive because in each backtracking iteration the three updates in (33) are computed.However, the extra cost is well justified in practice.If the line search process takes more than a few backtracking iterations, it indicates that the inequality (36) is much weaker than the conservative step size condition (35), and the algorithm with line search takes much larger steps than would be used by the constant parameter algorithm.In practice, the parameter θk can be set to one in most iterations.The backtracking search then first checks whether the previous step sizes k−1 and k−1 are acceptable, and decreases them only when needed to satisfy (36).The option of choosing θk > 1 allows one to occasionally increase the step sizes.
Algorithm (33) is related to several existing algorithms.With constant parameters, it is a special case of the primal-dual algorithm in [22, Algorithm 1], which solves the more general problem (26) and uses generalized distances for the primal and dual variables.Here we take g(y) = {b} and use a generalized distance only in (33c) the primal space.The line search condition (36) for selecting step sizes does not appear in [22].With standard proximal operators (for squared Euclidean distances), the primal-dual algorithm of [22] is also known as the primal-dual hybrid gradient (PDHG) algorithm, and has been extensively studied as a versatile and efficient algorithm for large-scale convex optimization; see [20,21,24,25,28,33,45,47,49,56,57] for applications, analysis, and extensions.The line search technique for the primal-dual algorithm proposed by Malitsky and Pock [39] is similar to the one described above, but not identical, even when squared Euclidean distances are used.
The algorithm can also be interpreted as a variation on the Bregman proximal point algorithm [19,26,32], applied to the optimality conditions In each iteration of the proximal point algorithm the iterates x k+1 , z k+1 are defined by the inclusion where pd (x, z) is a Bregman kernel.If we choose a kernel of the form then (37) reduces to In the generalized proximal operator notation defined of ( 10) and (11), this condition can be expressed as two equations These two equations are coupled and difficult to solve because x k+1 and z k+1 each appear on the right-hand side of an equality.The updates (33b) and (33c) are almost identical but replace z k+1 with zk+1 in the primal update.The iterate zk+1 can there- fore be interpreted as a prediction of z k+1 .This interpretation also provides some intuition for the step size condition (36).If zk+1 happens to be equal to z k+1 , then (36) imposes no upper bound on the step sizes k and k .This makes sense because when zk+1 = z k+1 the update is equal to the proximal point update, and the convergence theory for the proximal point method does not impose upper bounds on the step size.
He and Yuan [33] have given an interesting interpretation of the primal-dual algorithm of [20] as a "pre-conditioned" proximal point algorithm.For the algorithm considered here, their interpretation corresponds to choosing 1 3 Bregman primal-dual first-order method and application to… as the generalized distance in (37).It can be shown that under the strong convexity assumptions for mentioned at the beginning of the section, the function ( 38) is convex if √ ‖A‖ ≤ 1 .With this choice of Bregman kernel, the inclusion (37) reduces to which can be written as Except for the indexing of the iterates, this is identical to (33) with constant step sizes ( k = 1 , k = , k = ).

Convergence analysis
In this section we analyze the convergence of the algorithm following the ideas in [22,39,49].The main result is an ergodic convergence rate, given in Eq. ( 49).

Algorithm parameters
We first prove two facts about the step sizes in the two versions of the algorithm.
Proof We use the definition of the matrix norm ‖A‖ , the arithmetic-geometric mean inequality, and strong convexity of the Bregman kernel: The last inequality follows from (35).◻ The result implies that we can restrict the analysis to the algorithm with varying parameters.The constant parameter variant is a special case with θk = 1 , −1 = , and Varying parameters In the varying parameter variant of the algorithm the step sizes are bounded below by where = −1 ∕ −1 .
Proof We proved in the previous paragraph that the exit condition (36) in the backtracking search certainly holds if From this observation one can use induction to prove the lower bounds (39).Suppose k−1 ≥ min and k−1 ≥ min .This holds at k = 0 by definition of min and min .The first value of k tested in the search is  k = θk ≥ 1 .If this value is accepted, then If  k = θk is rejected, one or more backtracking steps are taken.Denote by θk the last rejected value.Then θk √  k−1  k−1 ‖A‖ >  , and the accepted k satisfies Therefore ◻

Analysis of one iteration
We now analyze the progress in one iteration of the varying parameter variant of algorithm (33).
1 3 Bregman primal-dual first-order method and application to… for all x ∈ f ∩ and all z.
Proof The second step (33b) defines x k+1 as the minimizer of By assumption the solution is uniquely defined and in the interior of .Therefore x k+1 satisfies the optimality condition Equivalently, the following holds for all x ∈ ∩ f : (The triangle identity ( 8) is used on the second line.)The dual update (33c) implies that The equality (42) We evaluate this at z = z i and add it to the equality at z = z i−2 multiplied by i−1 : Now we combine (41) for k = i − 1 , with ( 43) and (44).For i ≥ 1, The first inequality follows from (41).In the last step we substitute ( 43) and ( 44).
Next we note that the line search exit condition (36) implies that Substituting this in (45) gives the bound (40).◻ Monotonicity properties Suppose x ⋆ ∈  , and x ⋆ , z ⋆ satisfy the saddle point property (29).Then .

this gives the inequality
Since the left-hand side is nonnegative, the inequality (46) follows.Summing from i = 1 to k gives (47).◻

Ergodic convergence
We define averaged primal and dual sequences We first show that the averaged sequences satisfy for all x ∈ f ∩ and all z.This holds for every choice for ∈ (0, 1] in (36). .
Proof From (40), Since L is convex in x and affine in z, (48).◻ If we substitute in (48) an optimal x = x ⋆ (which satisfies Ax ⋆ = b ), we obtain that for all z.Maximizing both sides over z subject to ‖z‖ 2 ≤ shows that The first two terms on the left-hand side form the merit function (30).For  > ‖z ⋆ ‖ 2 , the penalty function in the merit function is exact, so f (x) + ‖Ax − b‖ 2 − f (x ⋆ ) ≥ 0 with equality only if x is optimal.(The use of an exact penalty function to express a convergence result is inspired by [49, page 287].)Since i ≥ min , the inequality shows that the merit function decreases as O(1/k).

Convergence of the iterates
We now make two additional assumptions about the Bregman kernel [19].
Bregman primal-dual first-order method and application to… 1.For fixed x, the sublevel sets {y | d(x, y) ≤ } are closed.In other words, the dis- tance d(x, y) is a closed function of y.
These two assumptions are not restrictive, and in particular, they are satisfied by the logarithmic barrier (12).We also make the (minor) assumptions that  < 1 in (36) and that k is bounded above (which is easily satisfied, since the user chooses θk ).With these additional assumptions it can be shown that the sequences x k , z k converge to optimal solutions.

Proof
The inequality (46) and strong convexity of show that the sequences x k , z k are bounded.Let (x k i , z k i ) be a convergent subsequence with limit (x, ẑ) .With  < 1 , (47) shows that d(x k i +1 , x k i ) converges to zero.By strong convexity of the ker- nel, x k i +1 − x k i → 0 and therefore the subsequence x k i +1 also converges to x .Since The dual update (33c) can be written as Since z k i +1 − z k i → 0 and k i ≥ min , the left-hand side converges to zero, so Ax = b.From (46), d(x ⋆ , x k i ) is bounded above.Since the sublevel sets {y | d(x ⋆ , y) ≤ } are closed subsets of ( ) , the limit x is in ( ) .The left-hand side of the optimality condition converges to −A T ẑ , because k ≥ min and ∇ is continuous on ( ) .By maximal monotonicity of f , this implies that −A T ẑ ∈ f (x) (see [17, page 27] [53,  lemma 3.2]).We conclude that x , ẑ satisfy the optimality conditions Ax = b and −A T ẑ ∈ f (x).
To show that the entire sequence converges, we substitute x = x , z = ẑ in (40): The left-hand side is nonnegative by the saddle point property (29).Therefore for all k.This shows that (50) for all k ≥ k i .By the second additional kernel property mentioned above, the right- hand side converges to zero.Therefore d(x, x k ) → 0 and z k → ẑ .If d(x, x k ) → 0 , then the strong convexity property of the kernel implies that x k → x .◻

Numerical experiments
In this section we evaluate the performance of algorithm (7), the Bregman PDHG algorithm (33) applied to the centering problem (5).The numerical results illustrate that the cost for evaluating the Bregman proximal operator ( 17) is comparable to the cost of a sparse Cholesky factorization with sparsity pattern E. This prox-evaluation dominates the computational cost in each iteration of (7), since A and A * are usually easy to evaluate for large-scale problems with sparse or other types of structure.In particular, the proposed method does not need to solve linear equations involving A or A * , an important advantage over ADMM and interior-point methods.
In this section we consider the centering problem for two sets of sparse SDPs, maximum cut problem and the graph partitioning problem.The experiments are carried out in Python 3.6 on a laptop with an Intel Core i5 2.4GHz CPU and 8GB RAM.The Python library for chordal matrix computations CHOMPACK [4] is used to compute chordal extensions (with the AMD reordering [1]), sparse Cholesky factorizations, the primal barrier , and the gradient and directional second derivative of the dual barrier * .Other sparse matrix computations are implemented using CVXOPT [2].
In the experiments, we terminate the iteration (33) when the relative primal and dual residuals are less than 10 −6 .These two stopping conditions are sufficient for our algorithm, as suggested by the convergence proof, in particular, Eqs.(50) and (51).The two residuals are defined as where ‖Y‖ max = max i,j �Y ij �.

Maximum cut problem
Given an undirected graph G = (V, E) , the maximum cut problem is to partition the set of vertices into two sets in order to maximize the total number of edges between the two sets.(If every edge {i, j} ∈ E is associated with a nonnegative weight w ij , then the maximum cut problem is to maximize the total weight of the edges between Bregman primal-dual first-order method and application to… the two sets.)One can show that the maximum cut problem can be represented as a binary quadratic optimization problem where L ∈ n is the Laplacian of an undirected graph G = (V, E) with vertices V = {1, 2, … , n} .The SDP relaxation of the maximum cut problem is with variable X ∈ n .The operator ∶ n → n returns the diagonal elements of the input matrix as a vector: (X) = (X 11 , X 22 , … , X nn ) .If moderate accuracy is allowed, we can solve the centering problem of the SDP relaxation with optimization variable X ∈ n E � where E ′ is a chordal extension of E. Note that (X) = n for all feasible X.The centering problem has the form of (5) with The Lagrangian of ( 53) is in the form of (27) where f is defined in (6), and z is the Lagrange multiplier associated with the equality constraint (X) = 1 .Thus we have where X ⋆ and z ⋆ are the primal and dual optimal solutions of the centering prob- lem (53), and p ⋆ sdp is the optimal value of the SDP (52).Numerical results We first collect four MAXCUT problems of moderate size from SDPLIB [15].The SDP relaxation (52) is solved using MOSEK [41] and the optimal value computed by MOSEK is denoted by p ⋆ sdp .(Note that the source file for the graph maxcutG55 was unfortunately incorrectly converted into SDPA sparse format.Thus the objective value for the maxG55 problem obtained from the original data file is 1.1039 × 10 4 instead of 9.9992 × 10 3 as reported in SDPLIB.) In (53), we set = 0.001∕n , and report in column 4 of Table 1 the difference between p ⋆ sdp and the cost function (1∕4) (LX) at the suboptimal solution returned by the algorithm.
The last two columns of Table 1 give the relative primal and dual residuals.These results show that the proposed algorithm is able to solve the centering SDP (53) with the desired accuracy.A comparison of the third and fourth columns of Table 1 confirms (54), i.e., the objective value of the SDP at X is within maximize (LX ⋆ ) + 1 T z ⋆ = n, n = 10 −3 of the optimal value.Considering the values of p ⋆ sdp , we see that the computed points on the central path are close to the optimal solutions of the SDPs.
To test the scalability of algorithm (33), we add four larger graphs from the SuiteSparse collection [36].In Table 2 we report the time per Cholesky factorization, the number of Newton steps per iteration, the time per PDHG iteration, and the number of iterations in the primal-dual (PDHG) algorithm for the eight test problems.
As can be seen from the table, the number of Newton iterations per proxevaluation remains small even when the size of the problem increases.Also, we observe that the time per PDHG iteration is roughly the cost of a sparse Cholesky factorization times the number of Newton steps.This means that the backtracking in Newton's method does not cause a significant overhead.Since the evaluations of A and A * in this problem are very cheap, the cost per prox-evaluation is the dominant term in the per-iteration complexity.
Table 1 Results for four instances of the MAXCUT problem from SDPLIB [15] Column 3 is the optimal value computed by MOSEK.Column 4 is the difference with the optimal value of the centering problem computed by algorithm (33)  Bregman primal-dual first-order method and application to…

Graph partitioning
The problem of partitioning the vertices of a graph G = (V, E) in two subsets of equal size (here we assume an even number of vertices), while minimizing the number of edges between the two subsets, can be expressed as where L is the graph Laplacian.The ith entry of the n-vector x indicates the set that vertex i is assigned to.To obtain an SDP relaxation we introduce a matrix variable Y = xx T and write the problem in the equivalent form and then relax the constraint Y = xx T as Y ⪰ 0 .This gives the SDP The dual SDP is with variables ∈ and z ∈ n .The aggregate sparsity pattern of the SDP ( 55) is completely dense, because the equality constraint 1 T Y1 = 0 has a coefficient matrix of all ones.We therefore elimi- nate the dense constraint using the technique described in [30, page 668].Let P be the n × (n − 1) matrix The columns of P form a sparse basis for the orthogonal complement of the multiples of the vector 1 .Suppose Y is feasible in (55) and define .
From 1 T Y1 = 0 , we see that and therefore v = 0 .Since the matrix ( 56) is positive semidefinite, we also have u = 0 .Hence every feasible Y can be expressed as Y = PXP T , with X ⪰ 0 .If we make this substitution in ( 55) we obtain The (n − 1) × (n − 1) matrix P T LP has elements Thus the sparsity pattern E ′ of the matrix P T LP is denser than E, i.e., E ⊆ E ′ .The n constraints (PXP T ) = 1 reduce to To apply algorithm (33), we first rewrite the graph partitioning problem as where E ′′ is a chordal extension of the aggregate sparsity pattern E ′ .Note that (P T PX) = n − 1 for all feasible X.The centering problem for this sparse SDP is of the form (5) with Numerical results Table 3 shows the numerical results for four problems from SDPLIB [15].
The SDP relaxation (55) is solved by MOSEK and its optimal value is denoted by p ⋆ sdp .In solving (57), we set = 0.001∕n , and report in Table 3 the value (1∕4) (P T LPX) , where X is the solution returned by the algorithm (33).As in the first experiment, the numerical results show that the algorithm is able to solve the centering SDP (57) with desired accuracy.1 3 Bregman primal-dual first-order method and application to… In addition, we test the algorithm for four additional graphs from the SuiteSparse collection [36].Table 4 reports the time per Cholesky factorization, the number of Newton steps per iteration, the time per PDHG iteration, and the number of iterations in the primal-dual algorithm.
The same observations as in Sect.4.1 apply: the number of Newton steps remains moderate as the size of the problem increases, and the cost per iteration is roughly linear in the cost of a Cholesky factorization.

Conclusions
We presented a Bregman proximal algorithm for the centering problem in sparse semidefinite programming.The Bregman distance used in the proximal operator is generated by the logarithmic barrier function for the cone of sparse matrices with a positive semidefinite completion.With this choice of Bregman distance, the per-iteration complexity of the algorithm is dominated by the cost of a Cholesky  factorization with the aggregate sparsity pattern of the SDP, plus the cost of evaluating the linear mapping in the constraints and its adjoint.
The proximal algorithm we used is based on the primal-dual method proposed by Chambolle and Pock [22].An important addition to the algorithm is a new procedure for selecting the primal and dual step sizes, without knowledge of the norm of the linear mapping or the strong convexity of the Bregman kernel.In the current implementation the ratio of the primal and dual step sizes is kept fixed throughout the iteration.An interesting further improvement would be to relax this condition, choosing = k ∕ k adaptively [5,39].
The standard primal-dual hybrid gradient algorithm is known to include several important algorithms as special cases.The Bregman extension of the algorithm is equally versatile.We mention one interesting example.Suppose the matrix A in ( 25) is a product of two matrices A = CB .Then ( 25) is equivalent to where g(y) = {b} (Cy) .The standard (Euclidean) proximal operator of g is the mapping The PDHG algorithm applied to the reformulated problem requires in each iteration an evaluation of the Bregman proximal operator of f, matrix-vector products with B and B T , and the solution of the least norm problem in the definition of prox g .For C = A , B = I , this can be interpreted as a Bregman extension of the Douglas-Rach- ford algorithm, or of Spingarn's method for convex optimization with equality constraints.
is a saddle point of the Lagrangian if Existence of a saddle point is equivalent to the property that the primal and dual optimal values are equal and attained.The left-hand equality in (29) holds if and only if Ax ⋆ = b .The right-hand equality holds if and only if −A T z ⋆ ∈ f (x ⋆ ) .Hence (x ⋆ , z ⋆ ) is a saddle point if and only if it satisfies the optimality conditions . The last two columns give the primal and dual residuals in the computed solution

Table 2
The four MAXCUT problems from SDPLIB plus four larger graphs from the SuiteSparse col- [36]ion[36]The last column ('PDHG iterations') gives the number of iterations in the primal-dual algorithm.Columns 3-5 describe the complexity of one iteration of the algorithm.The CPU time is measured in seconds

Table 3
(33)lts for four graph partitioning problems from SDPLIB Column 3 is the optimal value computed by MOSEK.Column 4 is the difference with the optimal value of the centering problem computed by algorithm(33).The last two columns give the primal and dual residuals in the computed solution

Table 4
The four graph partitioning problems from SDPLIB plus four larger graphs from the SuiteSparse collectionThe last column gives the number of iterations in the primal-dual algorithm.Columns 3-5 describe the complexity of one iteration of the algorithm.The CPU time is measured in seconds