1 Introduction

Optimization methods based on Bregman distances offer the possibility of matching the Bregman distance to the structure in the problem, with the goal of reducing the complexity per iteration. In this paper, we apply this idea to the centering problem in sparse semidefinite programming. The paper is motivated by the difficulty of exploiting sparsity in large-scale semidefinite programming in general and, for proximal methods, the need for eigendecompositions to compute Euclidean projections on the positive semidefinite matrix cone. By replacing the Euclidean projection with a generalized Bregman projection, we take advantage of the efficiency and scalability of algorithms for sparse Cholesky factorization and several related computations [3, 54].

We consider semidefinite programs (SDPs) in the standard form

$$\begin{aligned} \begin{array}{llllll} \text{ primal: } & \text{ minimize } & \mathop \mathbf{tr}(CX) & \text{ dual: } & \text{ maximize } & b^T y \\ &\text{ subject } \text{ to } &{\mathcal {A}}(X)=b \;\;\;\;\qquad \quad & &\text{ subject } \text{ to } & {\mathcal {A}}^*(y)+S=C \\ & & X \in \mathbf {S}^{n}_{+} & & & S \in \mathbf {S}^{n}_{+}, \end{array} \end{aligned}$$
(1)

with primal variable \(X\in \mathbf {S}^n\) and dual variables \(S\in \mathbf {S}^n\), \(y\in {\mathbf{R}}^m\), where \(\mathbf {S}^n\) is the set of symmetric \(n\times n\) matrices. The linear operator \({\mathcal {A}} :\mathbf {S}^n \rightarrow {\mathbf{R}}^m\) is defined as

$$\begin{aligned} {\mathcal {A}}(X) = \big (\mathop \mathbf{tr}(A_1X), \mathop \mathbf{tr}(A_2X), \ldots , \mathop \mathbf{tr}(A_mX) \big ) \end{aligned}$$

and \({\mathcal {A}}^*(y) = \sum _{i=1}^m y_iA_i\) is its adjoint operator. The coefficients \(C, A_1,\ldots ,A_m\) are symmetric \(n \times n\) matrices. The notation \(\mathbf {S}^{n}_{+}\) is used for the cone of positive semidefinite (PSD) matrices in \(\mathbf {S}^n\).

In many large-scale applications of semidefinite programming, the coefficient matrices are sparse. The sparsity pattern of a symmetric \(n\times n\) matrix can be represented by an undirected graph \(G= (V,E)\) with vertex set \(V=\{1,2,\ldots ,n\}\) and edge set E. The set of matrices with sparsity pattern E is then defined as

$$\begin{aligned} \mathbf {S}^{n}_{E} = \{ Y \in \mathbf {S}^n \mid Y_{ij} = Y_{ji} = 0 \;\; \hbox {if} i\ne j \hbox {and} \{i,j\}\not \in E\}. \end{aligned}$$

In this paper, E will denote the common (or aggregate) sparsity pattern of the coefficient matrices in the SDP, i.e., we assume that \(C, A_1,\ldots , A_m\in \mathbf {S}^{n}_{E}\). Note that the sparsity pattern E is not uniquely defined (unless it is dense, i.e., the sparsity graph G is complete): if the coefficients are in \(\mathbf {S}^{n}_{E}\) then they are also in \(\mathbf {S}^{n}_{E'}\) where \(E \subset E'\). In particular, E can always be extended to make the graph \(G=(V,E)\) chordal or triangulated [14, 54]. Without loss of generality, we will assume that this is the case.

The primal variable X in (1) generally needs to be dense to be feasible. However, the cost function and the linear equality constraints only depend on the diagonal entries \(X_{ii}\) and the off-diagonal entries \(X_{ij}=X_{ji}\) for \(\{i,j\} \in E\). For the other entries the only requirement is to make the matrix positive semidefinite. In the dual problem, \(S\in \mathbf {S}^{n}_{E}\) holds at all dual feasible points. These observations imply that the SDPs (1) can be equivalently rewritten as a pair of primal and dual conic linear programs

$$\begin{aligned} \begin{array}{llllll} \text{ primal: } & \text{ minimize } & \mathop \mathbf{tr}(CX) & \text{ dual: } & \text{ maximize } & b^Ty \\ &\text{ subject } \text{ to } &{\mathcal {A}}(X)=b \;\;\;\;\qquad \quad & &\text{ subject } \text{ to } & {\mathcal {A}}^*(y)+S=C \\ & & X \in K & & & S \in K^*, \end{array} \end{aligned}$$
(2)

with sparse matrix variables \(X,S\in \mathbf {S}^{n}_{E}\), and a vector variable \(y \in {\mathbf{R}}^m\). The primal cone K in this problem is the set of matrices in \(\mathbf {S}^{n}_{E}\) which have a positive semidefinite completion, i.e., \(K = \Pi _{E}(\mathbf {S}_+^{n})\) where \(\Pi _E\) stands for projection on \(\mathbf {S}^{n}_{E}\). The dual cone \(K^*\) of K is the set of positive semidefinite matrices with sparsity pattern E, i.e., \(K^*=\mathbf {S}_+^{n} \cap \mathbf {S}^{n}_{E}\). The formulation (2) is attractive when the aggregate sparsity pattern E is very sparse, in which case \(\mathbf {S}^{n}_{E}\) is a much lower-dimensional space than \(\mathbf {S}^n\).

The centering problem for the sparse SDP (2) is

$$\begin{aligned} \begin{array}{ll} \text{ minimize } & \mathop \mathbf{tr}(CX) + \mu \phi (X) \\ \text{ subject } \text{ to } & {\mathcal {A}}(X) = b, \end{array} \end{aligned}$$
(3)

where \(\phi\) is the logarithmic barrier function for the cone K, defined as

$$\begin{aligned} \phi (X)= \sup _{S\in \mathop \mathbf{int}K^*} {(-\mathop \mathbf{tr}(XS) +\log \det S)}. \end{aligned}$$

The centering parameter \(\mu >0\) controls the duality gap at the solution. Since the barrier function \(\phi\) is n-logarithmically homogeneous, the optimal solution of the centering problem is a \((\mu n)\)-suboptimal solution for the original SDP (2). The centering problem (3) is useful as an approximation to the original problem, because it yields more easily computed suboptimal solutions, with an accuracy that can be controlled by the choice of barrier parameter. The centering problem is also a key component of barrier methods, in which a sequence of centering problems with decreasing values of the barrier parameter are solved. Traditionally, the centering problem in interior-point methods is solved by Newton’s algorithm, possibly accelerated via the preconditioned conjugate gradient method [10, 55], but recent work has started to examine the use of proximal methods such as the alternating direction method of multipliers (ADMM) or the proximal method of multipliers for this purpose [37, 48].

Contributions The contribution of this paper is two-fold. First, we formulate a non-Euclidean (Bregman) proximal method for the centering problem of the sparse SDP. In the proposed method, the proximal operators are replaced by generalized proximal operators defined in terms of a Bregman generalized distance or divergence. We show that if the Bregman divergence generated by the barrier function \(\phi\) for the cone K is used, the generalized projections can be computed very efficiently, with a complexity dominated by the cost of a sparse Cholesky factorization with sparsity pattern E. This is much cheaper than the eigenvalue decomposition needed to compute a Euclidean projection on the positive semidefinite cone. Hence, while the method only solves an approximation of the SDP (2), it can handle problem sizes that are orders of magnitude larger than the problems solved by standard interior-point and proximal first-order methods.

For the solution of the centering problem, we apply a variant of the primal–dual method proposed by Chambolle and Pock [22]. The version of the algorithm described in [22] requires careful tuning of primal and dual step size parameters. Acceptable values of the step sizes depend on the norm of the linear operator \({\mathcal {A}}\) and the strong convexity constants for the distance function. These parameters are often difficult to estimate in practice. As a second contribution, we propose a new version of the algorithm, in which the step sizes are not fixed parameters, but are selected using an easily implemented line search procedure. We give a detailed convergence analysis of the algorithm with line search and show an O(1/k) ergodic convergence rate, which is consistent with previous results in [22, 39].

Related work Sparse structure in semidefinite programming has been extensively studied by many authors. The scalability of interior-point methods is limited by the need to form and solve a set of m linear equations in m variables, known as the Schur complement system, at each iteration. This system is usually dense. Sparsity in the coefficients \(A_i\) can be exploited to reduce the cost of assembling the Schur complement equations. This process is efficient especially in extremely sparse problems, where the coefficients \(A_i\) may also have low rank. In dual barrier methods, one can also take advantage of sparsity of dual feasible variables S. These properties are leveraged in the dual interior-point methods described in [9,10,11,12,13].

In another line of research, techniques based on properties and algorithms for chordal sparsity patterns have been applied to semidefinite programming since the late 1990s [3, 13, 18, 29, 30, 34, 35, 42, 46, 50, 51, 58]; see [54, 60] for recent surveys. An important tool from this literature is the conversion or clique decomposition method proposed by Fukuda et al. [30, 42]. It is based on a fundamental result from linear algebra, stating that for a chordal pattern E, a matrix \(X\in \mathbf {S}^{n}_{E}\) has a positive semidefinite completion if and only if \(X_{\gamma _k\gamma _k} \succeq 0\) for \(k=1,\ldots ,r\), where \(\gamma _1\), ..., \(\gamma _r\) are the maximal cliques in the graph [31]. In the conversion method, the large sparse variable matrix X in (2) is replaced with smaller dense matrix variables \(X_k = X_{\gamma _k\gamma _k}\). Each of these new variables is constrained to be positive semidefinite. Linear equality constraints need to be added to couple the variables \(X_k\), as they represent overlapping subblocks of a single matrix X. Thus, a large sparse SDP is converted in an equivalent problem with several smaller, dense variables \(X_k\), and additional sparse equality constraints. This equivalent problem may be considerably easier to solve by interior-point methods than the original SDP (1). Recent examples where the clique decomposition is applied to solve large sparse SDPs can be found in [27, 58].

Proximal splitting methods, such as (accelerated) proximal gradient methods [7, 8, 43], ADMM [16], and the primal–dual hybrid gradient (PDHG) or Chambolle–Pock method [20, 28, 47], are perhaps the most popular alternatives to interior-point methods in machine learning, image processing, and other applications involving large-scale convex programming. When applied to the SDPs (1), they require at each iteration a Euclidean projection on the positive semidefinite cone \(\mathbf {S}^n_+\), hence, a symmetric eigenvalue decomposition of order n. This contributes an order \(n^3\) term to the per-iteration complexity. In the nonsymmetric formulation (2) of the sparse SDP, the projections on \(K^*\) or (equivalently) K cannot be computed directly, and must be handled by introducing splitting variables and alternating projection on \(\mathbf {S}^{n}_{E}\), which is trivial, and on \(\mathbf {S}^n_+\), which requires an eigenvalue decomposition. The clique decomposition used in the conversion method described above, which was originally developed for interior-point methods, lends itself naturally to splitting algorithms as well. It allows us to replace the matrix constraint \(X\in K\) with several smaller dense inequalities \(X_k\succeq 0\), one for each maximal clique in the sparsity graph. In a proximal method, this means that projection on the \(n\times n\) positive semidefinite cone can be replaced by less expensive projections on lower-dimensional positive semidefinite cones [38, 52, 59, 61]. This advantage of the conversion method is tempered by the large number of consistency constraints that must be introduced to link the splitting variables \(X_k\). First-order methods typically do not compute very accurate solutions and if the residual error in the consistency constraints is not small, it may be difficult to convert the computed solution of the decomposed problem back to an accurate solution of the original SDP [27].

Outline The rest of the paper is organized as follows. In Sect. 2 we describe the Bregman distance generated by the barrier function and show how generalized projections can be efficiently computed without expensive eigenvalue decomposition. The primal–dual proximal algorithm and its convergence are discussed in Sect. 3. Section 4 contains results of numerical experiments.

2 Barrier proximal operator for sparse PSD matrix cone

2.1 Centering problem

We will assume that the equality constraints in (2) include an equality constraint \(\mathop \mathbf{tr}(NX)=1\), where \(N\in \mathbf {S}_{++}^{n} \cap \mathbf {S}^{n}_{E}\). To make this explicit we write the centering problem (2) as

$$\begin{aligned} \begin{array}{ll} \text{ minimize } & \mathop \mathbf{tr}(CX) + \mu \phi (X) \\ \text{ subject } \text{ to } & {\mathcal {A}}(X) = b, \\ & \mathop \mathbf{tr}(NX) = 1. \end{array} \end{aligned}$$
(4)

For \(N=I\), the normalized cone \(\{X\in K \mid \mathop \mathbf{tr}(NX) = 1\}\) is a matrix extension of the probability simplex \(\{x \succeq 0 \mid \varvec{1}^Tx=1\}\), sometimes referred to as the spectraplex. With minor changes, the techniques we discuss extend to a normalization in the inequality form \(\mathop \mathbf{tr}(NX) \le 1\), with \(N\in \mathbf {S}_{++}^{n} \cap \mathbf {S}^{n}_{E}\). However, we will discuss (4) to retain the standard form of the centering problem.

The constraints \(\mathop \mathbf{tr}(NX) = 1\) and \(\mathop \mathbf{tr}(NX)\le 1\) guarantee the boundedness of the primal feasible set, a common assumption in first-order methods. The added constraint does not diminish the generality of our approach. In many applications an equality \(\mathop \mathbf{tr}(NX)=1\) is implied by the contraints \({\mathcal {A}}(X) = b\) and easily derived from the problem data (see Sect. 4 for two typical examples). When an equality constraint of this form is not readily available, one can add a bounding inequality \(\mathop \mathbf{tr}(NX) \le 1\) with N sufficiently small to ensure that the optimal solution is not modified.

To apply first-order proximal methods, we view the problem (4) as a linearly constrained optimization problem

$$\begin{aligned} \begin{array}{ll} \text{ minimize } & f(X) \\ \text{ subject } \text{ to } & {\mathcal {A}}(X)=b, \end{array} \end{aligned}$$
(5)

where f is defined as

$$\begin{aligned} f(X)=\mathop \mathbf{tr}(CX)+\mu \phi (X) + \delta _{{\mathcal {H}}}(X), \qquad {\mathcal {H}} = \{X \in \mathbf {S}^{n}_{E}\mid \mathop \mathbf{tr}(NX) = 1\}, \end{aligned}$$
(6)

and \(\delta _H\) is the indicator function of the hyperplane \({\mathcal {H}}\). The algorithm we apply to (5) can be summarized as

$$\begin{aligned} {\bar{z}}_{k+1}= & z_k + \theta _k (z_k - z_{k-1}) \end{aligned}$$
(7a)
$$\begin{aligned} X_{k+1}= & \mathop \mathrm{argmin}_X {\big (f(X)+{\bar{z}}_{k+1}^T{\mathcal {A}}(X) +\frac{1}{\tau _k} d(X,X_k)\big )} \end{aligned}$$
(7b)
$$\begin{aligned} z_{k+1}= & z_k + \sigma _k ({\mathcal {A}}(X_{k+1})-b) \end{aligned}$$
(7c)

where d is the Bregman distance generated by the barrier function \(\phi\):

$$\begin{aligned} d(X,Y) = \phi (X) - \phi (Y) - \mathop \mathbf{tr}(\nabla \phi (Y)(X-Y)). \end{aligned}$$

The choices of \(\theta _k\), \(\sigma _k\), and \(\tau _k\), together with the details and origins of the algorithm, will be discussed in Sect. 3. In the remainder of this section we focus on the most expensive step in the algorithm, the optimization problem in the X-update (7b).

In Sects. 2.2 and 2.3 we first review some facts from the theory of generalized distances and the logarithmic barrier functions for the primal and dual cones K and \(K^*\). Sections 2.4 and 2.5 describe the details of the barrier kernel and the associated generalized proximal operator applied in (7b).

2.2 Bregman distance

Let h be a convex function, defined on a domain that has nonempty interior, and suppose h is continuously differentiable on \(\mathop \mathbf{int}{(\mathop \mathbf{dom}h)}\). The generalized distance generated by h is defined as the function

$$\begin{aligned} d(x,y) = h(x) - h(y) - \langle \nabla h(y), x-y\rangle , \end{aligned}$$

with domain \(\mathop \mathbf{dom}d = \mathop \mathbf{dom}h \times \mathop \mathbf{int}{(\mathop \mathbf{dom}h)}\). The function h is called the kernel function that generates the generalized distance d. For \(h(x) = \Vert x\Vert _2^2/2\) and the standard inner product \(\langle u,v\rangle = u^Tv\), we obtain \(d(x,y)= \Vert x-y\Vert _2^2/2\). The best known non-quadratic example is the relative entropy

$$\begin{aligned} d(x,y) = \sum _{i=1}^n (x_i \log (x_i/y_i) - x_i + y_i), \qquad \mathop \mathbf{dom}d = {\mathbf{R}}^n_+ \times {\mathbf{R}}^n_{++}. \end{aligned}$$

This generalized distance is generated by the kernel \(h(x) = \sum _i x_i \log x_i\), if we use the standard inner product.

Generalized distances are not necessarily symmetric (\(d(x,y) \ne d(y,x)\) in general) but share some other important properties with the squared Euclidean norm. An important example is the triangle identity [23, Lemma 3.1]

$$\begin{aligned} \langle \nabla h(y) - \nabla h(z), x-y \rangle = d(x,z) - d(x,y) - d(y,z) \end{aligned}$$
(8)

which holds for all \(x\in \mathop \mathbf{dom}h\) and \(y, z\in \mathop \mathbf{int}{(\mathop \mathbf{dom}h)}\). This generalizes the identity

$$\begin{aligned} (y-z)^T(x-y) = \frac{1}{2} \left( \Vert x-z\Vert _2^2 - \Vert x-y\Vert _2^2 - \Vert y-z\Vert _2^2\right) . \end{aligned}$$

Additional conditions may have to be imposed on the kernel function h, depending on the application and the algorithm in which the generalized distance is used [19]. For now we only assume convexity and continuous differentiability on the interior of the domain. Other properties will be mentioned when needed.

The proximal operator of a closed convex function f is defined as

$$\begin{aligned} \mathrm {prox}_f(y) = \mathop \mathrm{argmin}_x{(f(x) + \frac{1}{2} \Vert x-y\Vert _2^2)}. \end{aligned}$$

If f is closed and convex, then the minimizer in the definition exists and is unique for all y [40]. We will use the following extension to generalized distances. Suppose f is a convex function with the property that for every a and every \(y\in \mathop \mathbf{int}{(\mathop \mathbf{dom}h)}\), the optimization problem

$$\begin{aligned} \text{ minimize } \quad f(x) + \langle a, x \rangle + d(x,y) \end{aligned}$$
(9)

has a unique solution \({\hat{x}}\) in \(\mathop \mathbf{int}{(\mathop \mathbf{dom}h)}\). Then we denote the minimizer \({\hat{x}}\) by

$$\begin{aligned} \mathrm {prox}_f^d(y,a)= & \mathop \mathrm{argmin}_x{(f(x) + \langle a, x \rangle + d(x,y))} \nonumber \\= & \mathop \mathrm{argmin}_x{(f(x) + \langle a, x \rangle + h(x) - \langle \nabla h(y), x\rangle )} \end{aligned}$$
(10)

and call the mapping \(\mathrm {prox}_f^d\) the generalized proximal operator of f. From the second expression we see that \({\hat{x}}=\mathrm {prox}_f^d(y,a)\) satisfies

$$\begin{aligned} \nabla h(y) - \nabla h({\hat{x}}) - a \in \partial f({\hat{x}}). \end{aligned}$$
(11)

If \(d = \Vert x-y\Vert _2^2/2\), it is easily verified that \(\mathrm {prox}_f^d(y,a) = \mathrm {prox}_f (y-a)\), where \(\mathrm {prox}_f\) is the standard proximal operator.

In contrast to the Euclidean case, it is difficult to give simple general conditions that guarantee that for every a and every \(y\in \mathop \mathbf{int}{(\mathop \mathbf{dom}h)}\) the problem (9) has a unique solution in \(\mathop \mathbf{int}{(\mathop \mathbf{dom}h)}\). However, we will use the definition only for specific combinations of f and d, for which problem (9) is particularly easy to solve. In those applications, existence and uniqueness of the solution follow directly from the availability of a fast algorithm for computing it. A classical example is the relative entropy distance with f given by the indicator function of the hyperplane \(\{ x\mid \varvec{1}^Tx = 1\}\). Problem (9) can be written as

$$\begin{aligned} \begin{array}{ll} \text{ minimize } & a^Tx + \sum \limits _{i=1}^n (x_i\log (x_i/y_i) - x_i) \\ \text{ subject } \text{ to } &\varvec{1}^T x= 1. \end{array} \end{aligned}$$

For any a and any positive y, the solution of (9) is unique and equal to the positive vector

$$\begin{aligned} \mathrm {prox}_f^d(y,a) = \frac{1}{\sum _{i=1}^n y_i e^{-a_i}} \left[ \begin{array}{c} y_1 e^{-a_1} \\ \vdots \\ y_n e^{-a_n} \end{array}\right] . \end{aligned}$$

Research on proximal methods for semidefinite programming has been largely based on the standard Euclidean proximal operators and the distance defined by the matrix entropy [6]. For these distances, projections on the positive semidefinite cone require eigenvalue decompositions, which limits the size of the variables that can be handled and precludes applications to large sparse SDPs. In the following sections, we introduce a generalized proximal operator designed for sparse semidefinite programming. The generalized proximal operator can be evaluated via a simple iterative algorithm with a complexity dominated by the cost of a sparse Cholesky factorization.

2.3 Primal and dual barrier

The logarithmic barrier functions for the cones \(K^*=\mathbf {S}_+^{n} \cap \mathbf {S}^{n}_{E}\) and \(K=\Pi _{E}(\mathbf {S}_+^{n})\) are defined as

$$\begin{aligned} \phi _*(S) = -\log \det S, \qquad \phi (X) = \sup _S \big ({-\mathop \mathbf{tr}(XS)}-\phi _*(S)\big ), \end{aligned}$$
(12)

with domains \(\mathop \mathbf{dom}\phi _*= \mathop \mathbf{int}K^*\) and \(\mathop \mathbf{dom}\phi = \mathop \mathbf{int}K\), respectively. Note that \(\phi (X)\) is the conjugate of \(\phi _*\) evaluated at \(-X\).

In [3, 54] efficient algorithms are presented for evaluating the two barrier functions, their gradients, and their directional second derivatives, when the sparsity pattern E is chordal. The value of the dual barrier \(\phi _*(S)=-\log \det S\) is easily computed from the diagonal entries in a sparse Cholesky factor of S. The gradient and Hessian are given by

$$\begin{aligned} \nabla \phi _*(S) = -\Pi _E(S^{-1}), \qquad \nabla ^2 \phi _*(S)[V] = \frac{d}{dt} \nabla \phi _*(S+tV) = \Pi _E(S^{-1} V S^{-1}). \end{aligned}$$
(13)

Given a Cholesky factorization of S, these expressions can be evaluated via one or two recursions on the elimination tree [3, 54], without explicitly computing the entire inverse \(S^{-1}\) or the matrix product \(S^{-1}VS^{-1}\). The cost of these recursions is roughly the same as the cost of a sparse Cholesky factorization with the sparsity pattern E [3, 54].

The primal barrier function \(\phi\) and its gradient can be evaluated by solving the optimization problem in the definition of \(\phi (X)\). The optimal solution \({\hat{S}}_X\) is the matrix in \(\mathbf {S}_{++}^{n} \cap \mathbf {S}^{n}_{E}\) that satisfies

$$\begin{aligned} \Pi _E({\hat{S}}_X^{-1}) = X. \end{aligned}$$
(14)

Its inverse \({\hat{S}}_X^{-1}\) is also the maximum determinant positive definite completion of X, i.e., \(Z={\hat{S}}_X^{-1}\) is the solution of

$$\begin{aligned} \begin{array}{ll} \text{ maximize } &\log \det Z \\ \text{ subject } \text{ to } & \Pi _E(Z)=X \end{array} \end{aligned}$$
(15)

(where we take \(\mathbf {S}^n_{++}\) as the domain of the cost function). From \({\hat{S}}_X\), one obtains

$$\begin{aligned} \phi (X) = \log \det {\hat{S}}_X -n, \quad \nabla \phi (X) = -{\hat{S}}_X, \quad \nabla ^2 \phi (X) = \nabla ^2 \phi _*({\hat{S}}_X)^{-1}. \end{aligned}$$
(16)

Comparing the expressions for the gradients of \(\phi\) and \(\phi _*\) in (16) and (13), and using (14), we see that \(\nabla \phi\) and \(\nabla \phi _*\) are inverse mappings, up to a change in sign:

$$\begin{aligned} \nabla \phi (X) = -{\hat{S}}_X = -(\nabla \phi _*)^{-1}(-X), \qquad \nabla \phi _*(S) = - (\nabla \phi )^{-1}(-S). \end{aligned}$$

For general sparsity patterns, the determinant maximization problem (15) or the convex optimization problem in the definition of \(\phi\) must be solved by an iterative optimization algorithm. If the pattern is chordal, these optimization problems can be solved by finite recursive algorithms, again at a cost that is comparable with the cost of a sparse Cholesky factorization for the same pattern [3, 54].

2.4 Barrier kernel

The primal barrier function \(\phi\) is convex, continuously differentiable on the interior of the cone, and strongly convex on \(\mathop \mathbf{int}K \cap \{X \mid \mathop \mathbf{tr}(NX) = 1\}\). It generates the Bregman divergence

$$\begin{aligned} d(X,Y)= & \phi (X) - \phi (Y) - \mathop \mathbf{tr}{(\nabla \phi (Y)(X-Y))} \\= & \phi (X) - \log \det {\hat{S}}_Y + n + \mathop \mathbf{tr}{({\hat{S}}_Y (X-Y))} \\= & \phi (X) - \log \det {\hat{S}}_Y + \mathop \mathbf{tr}{({\hat{S}}_Y X)}. \end{aligned}$$

On line 2 we used the properties (16) to express \(\phi (Y)\) and \(\nabla \phi (Y)\). The generalized proximal operator (10) for the function f defined in (6), which is the key step in the X-update (7b) of algorithm (7), then becomes

$$\begin{aligned} {\hat{X}}= & \mathrm {prox}^d_f (Y,A) \\= & \mathop \mathrm{argmin}_{\mathop \mathbf{tr}(NX)=1} {\left( \mathop \mathbf{tr}(CX) + \mu \phi (X) + \mathop \mathbf{tr}(AX) + d(X,Y)\right) } \\= & \mathop \mathrm{argmin}_{\mathop \mathbf{tr}(NX)=1} {\left( \mathop \mathbf{tr}\big ((C + A-\nabla \phi (Y))X\big ) +(\mu +1) \phi (X) \right) } \\= & \mathop \mathrm{argmin}_{\mathop \mathbf{tr}(NX)=1} {(\mathop \mathbf{tr}(BX) + \phi (X))} \end{aligned}$$

where

$$\begin{aligned} B = \frac{1}{1+\mu } (C + A + {\hat{S}}_Y). \end{aligned}$$

To compute \({\hat{X}}\) we therefore need to solve an optimization problem

$$\begin{aligned} \begin{array}{ll} \text{ minimize } & \mathop \mathbf{tr}(BX) + \phi (X) \\ \text{ subject } \text{ to } & \mathop \mathbf{tr}(NX)=1, \end{array} \end{aligned}$$
(17)

where \(B\in \mathbf {S}^{n}_{E}\) and \(N\in \mathbf {S}_{++}^{n} \cap \mathbf {S}^{n}_{E}\). If we introduce a Lagrange multiplier \(\nu\) for the equality constraint in (17), the optimality condition can be written as

$$\begin{aligned} \nabla \phi (X) + B+\nu N = 0, \qquad \mathop \mathbf{tr}(NX) = 1. \end{aligned}$$

Equivalently, since \(\nabla \phi _*(S)= -(\nabla \phi )^{-1}(-S)\),

$$\begin{aligned} X = -\nabla \phi _*(B+\nu N) = \Pi _E((B+\nu N)^{-1}), \qquad \mathop \mathbf{tr}(NX)=1. \end{aligned}$$

Eliminating X we obtain a nonlinear equation in \(\nu\):

$$\begin{aligned} \mathop \mathbf{tr}(N (B+\nu N)^{-1}) = 1. \end{aligned}$$
(18)

(The projection in \(\mathop \mathbf{tr}(N\Pi _E((B+\nu N)^{-1}))\) can be omitted because the matrix N has the sparsity pattern E.) The unique solution \(\nu\) that satisfies \(B+\nu N \succ 0\) defines the solution \(X = \Pi _E ((B+\nu N)^{-1})\) of (17).

The Eq. (18) is also the optimality condition for the Lagrange dual of (17), which is a smooth unconstrained convex optimization problem in the scalar variable \(\nu\):

$$\begin{aligned} \begin{array}{ll} \text{ maximize }&-\phi _*(B+\nu N) - \nu . \end{array} \end{aligned}$$
(19)

2.5 Newton method for barrier proximal operator

In this section we discuss in detail Newton’s method applied to the dual problem (19) and the equivalent nonlinear Eq. (18). We write the equation as \(\zeta (\nu ) = 1\) where

$$\begin{aligned} \zeta (\nu ) = \mathop \mathbf{tr}(N (B+\nu N)^{-1}), \quad \zeta '(\nu ) = -\mathop \mathbf{tr}(N(B+\nu N)^{-1}N(B+\nu N)^{-1}). \end{aligned}$$
(20)

The function \(\zeta\) and its derivative can be expressed in terms of the generalized eigenvalues \(\lambda _i\) of (BN) as

$$\begin{aligned} \zeta (\nu ) = \sum _{i=1}^n \frac{1}{\nu +\lambda _i}, \qquad \zeta '(\nu ) = -\sum _{i=1}^n \frac{1}{(\nu +\lambda _i)^2}. \end{aligned}$$
(21)

Figure 1 shows an example with \(n=4\), \(N=I\), and eigenvalues \(10, 5, 0, -5\).

Fig. 1
figure 1

Left. The function \(\zeta (\nu ) = \sum _i 1/(\nu + \lambda _i)\) for \(\lambda = (-5,0,5,10)\). We are interested in the solution of \(\zeta (\nu ) = 1\) larger than \(-\lambda _\mathrm {min}=5\). Right. The function \(1/\zeta (\nu )-1\)

We are interested in computing the solution of \(\zeta (\nu ) = 1\) that satisfies \(B+\nu N \succ 0\), i.e., \(\nu > -\lambda _\mathrm {min}\), where \(\lambda _\mathrm {min} = \min _i \lambda _i\) is the smallest generalized eigenvalue of (BN). We denote this interval by \(J = (-\lambda _\mathrm {min}, \infty )\). The equation \(\zeta (\nu )=1\) is guaranteed to have a unique solution in J because \(\zeta\) is monotonic and continuous on this interval, with

$$\begin{aligned} \lim _{\nu \rightarrow -\lambda _\mathrm {min}} \zeta (\nu ) = \infty , \qquad \lim _{\nu \rightarrow \infty } \zeta (\nu ) = 0. \end{aligned}$$

Furthermore, on the interval J, the function \(\zeta\) and its derivative can be expressed as

$$\begin{aligned} \zeta (\nu ) = -\mathop \mathbf{tr}(N\nabla \phi _*(B+\nu N)), \qquad \zeta '(\nu ) = -\mathop \mathbf{tr}(N (\nabla ^2\phi _*(B+\nu N)[N])). \end{aligned}$$

Therefore \(\zeta (\nu )\) and \(\zeta '(\nu )\) can be evaluated by taking the inner product of N with

$$\begin{aligned} \nabla \phi _*(B+\nu N)= & -\Pi _E\left( (B+\nu N)^{-1}\right) \\ \nabla ^2\phi _*(B+\nu N)[N]= & -\Pi _E\left( (B+\nu N)^{-1} N(B+\nu N)^{-1}\right) . \end{aligned}$$

Since \(B, N\in \mathbf {S}^{n}_{E}\), these quantities can be computed by the efficient algorithms for computing the gradient and directional second derivative of \(\phi _*\) described in [3, 54].

We note a few other properties of \(\zeta\). First, the expressions in (21) show that \(\zeta\) is convex, decreasing, and positive on J. Second, if \(\nu \in J\), then \({\tilde{\nu }} \in J\) for all \({\tilde{\nu }}\) that satisfy

$$\begin{aligned} {\tilde{\nu }} > \nu - \frac{1}{\sqrt{|\zeta '(\nu )|}}. \end{aligned}$$
(22)

This follows from

$$\begin{aligned} |\zeta '(\nu )| = \sum _{i=1}^n \frac{1}{(\nu +\lambda _i)^2} \ge \frac{1}{(\nu +\lambda _\mathrm {min})^2}, \end{aligned}$$

and is also a simple consequence of the Dikin ellipsoid theorem for self-concordant functions [44, Theorem 2.1.1.b].

The Newton iteration for the equation \(\zeta (\nu )-1 = 0\) is

$$\begin{aligned} \nu ^+ = \nu + \alpha \frac{1-\zeta (\nu )}{\zeta '(\nu )}, \end{aligned}$$
(23)

where \(\alpha\) is a step size. The same iteration can be interpreted as a damped Newton method for the unconstrained problem (19). If \(\nu ^+ \in J\) for a unit step \(\alpha =1\), then

$$\begin{aligned} \zeta (\nu ^+) > \zeta (\nu ) + \zeta '(\nu ) (\nu ^+ - \nu ) = 1, \end{aligned}$$

from strict convexity of \(\zeta\). Hence after one full Newton step, the Newton iteration with unit steps approaches the solution monotonically from the left. If \(\zeta (\nu ) < 1\) then in general a non-unit step size must be taken to keep the iterates in J. From the Dikin ellipsoid inequality (22), we see that \(\nu ^+ \in J\) for all positive \(\alpha\) that satisfy

$$\begin{aligned} \alpha < \frac{\sqrt{|\zeta '(\nu )|}}{1-\zeta (\nu )}. \end{aligned}$$

The theory of self-concordant functions provides a step size rule that satisfies this condition and guarantees convergence:

$$\begin{aligned} \alpha = \frac{\sqrt{|\zeta '(\nu )|}}{\sqrt{|\zeta '(\nu )|} + 1- \zeta (\nu )} \quad \text{ if } \quad \frac{1-\zeta (\nu )}{\sqrt{|\zeta '(\nu )|}} < \eta , \qquad \alpha = 1 \quad \text{ otherwise }, \end{aligned}$$

where \(\eta\) is a constant in (0, 1). As an alternative to this fixed step size rule, a standard backtracking line search can be used to determine a suitable step size \(\alpha\) in (23). Checking whether \(\nu ^+ \in J\) can be done by attempting a sparse Cholesky factorization of \(B+\nu ^+ N\).

Figure 1 shows that the function \(\zeta\) can be quite nonlinear around the solution of the equation if the solution is near \(-\lambda _\mathrm {min}\). Instead of applying Newton’s method directly to (20), it is useful to rewrite the nonlinear equation as \(\psi (\nu ) = 0\) where

$$\begin{aligned} \psi (\nu ) = \frac{1}{\zeta (\nu )} - 1. \end{aligned}$$
(24)

The negative smallest eigenvalue \(-\lambda _\mathrm {min}\) is a pole of \(\zeta (\nu )\), but a zero of \(1/\zeta (\nu )\). Also the derivative of \(\psi\) changes slowly near this zero point; in Fig. 1, the function \(\psi\) is almost linear in the region of interest. This implies that Newton’s method applied to (24), i.e.,

$$\begin{aligned} \nu ^+ = \nu + \beta \frac{\psi (\nu )}{\psi ^\prime (\nu )} = \nu + \beta \frac{\zeta (\nu ) (1-\zeta (\nu ))}{\zeta ^\prime (\nu )}, \end{aligned}$$

should be extremely efficient in this case. Starting the line search at \(\beta =1\) is equivalent to starting at \(\alpha = \zeta (\nu )\) in (23). This often requires fewer backtracking steps than starting at \(\alpha =1\).

Newton’s method requires a feasible initial point \(\nu _0 \in J\). Suppose we know a positive lower bound \(\gamma\) on the smallest eigenvalue of N. Then \({\hat{\nu }}_0 \in J\) where

$$\begin{aligned} {\hat{\nu }}_0 > \max {\Big \{0, \frac{-\lambda _\mathrm {min}(B)}{\gamma }\Big \}}. \end{aligned}$$

A lower bound on \(\lambda _\mathrm {min}(B)\) can be obtained from the Gershgorin circle theorem, which states that the eigenvalues of B are contained in the disks

$$\begin{aligned} \Big \{s \Bigm \vert |s-B_{ii}| \le \sum _{j \ne i}|B_{ij}|\Big \}, \quad i=1,\ldots ,n. \end{aligned}$$

Thus, \(\lambda _\mathrm {min}(B) \ge \min _i{(B_{ii}-\sum _{j \ne i} |B_{ij}|)}\). Apart from the above initialization, we find another practically useful initial point \({\tilde{\nu }}_0 = n - \mathop \mathbf{tr}B/\mathop \mathbf{tr}N\), which is the solution for \(\mathop \mathbf{tr}(N(B+\nu N)^{-1})=1\) when B happens to be a multiple of N. This choice is efficient in many practical examples but, unfortunately, not guaranteed to be feasible. Thus, in the implementation, we use \({\tilde{\nu }}_0\) if it is feasible and \({\hat{\nu }}_0\) otherwise.

3 Bregman primal–dual method

The proposed algorithm (7) is applicable not only to sparse SDPs, but to more general optimization problems. To emphasize its generality and to simplify notation, we switch in this section to the vector form of the optimization problem

$$\begin{aligned} \begin{array}{ll} \text{ minimize } & f(x) \\ \text{ subject } \text{ to } &Ax=b \end{array} \end{aligned}$$
(25)

where f is a closed convex function. Most of the discussion in this section extends to the more general standard form

$$\begin{aligned} \begin{array}{ll} \text{ minimize }&f(x) + g(Ax), \end{array} \end{aligned}$$
(26)

where f and g are closed convex functions. Problem (25) is a special case with \(g = \delta _{\{b\}}\), the indicator function of the singleton \(\{b\}\). While the standard form (26) offers more flexibility, it should be noted that methods for the equality constrained problem (25) also apply to (26) if this problem is reformulated as

$$\begin{aligned} \begin{array}{ll} \text{ minimize } & f(x) + g(y) \\ \text{ subject } \text{ to } & Ax - y = 0. \end{array} \end{aligned}$$

We also note that (25) includes conic optimization problems in standard form

$$\begin{aligned} \begin{array}{ll} \text{ minimize } & c^Tx \\ \text{ subject } \text{ to } & Ax=b \\ & x \in C \end{array} \end{aligned}$$

if we define \(f(x) = c^Tx + \delta _C(x)\), where \(\delta _C\) is the indicator function of the cone C.

In Sect. 3.1 we review some facts from convex duality theory. Section 3.2 describes the algorithm we propose for solving (25), and in Sect. 3.3 we analyze its convergence.

3.1 Duality theory

The Lagrangian for problem (25) will be denoted by

$$\begin{aligned} L(x,z) = f(x) + z^T(Ax-b). \end{aligned}$$
(27)

This function is convex in x and affine in z, and satisfies

$$\begin{aligned} \sup _z {L(x,z)}&= f(x) + \delta _{\{b\}}(Ax) = \left\{ \begin{array}{ll} f(x) & Ax=b \\ +\infty & \text{ otherwise, } \end{array}\right. \\ \inf _x {L(x,z)}&= -f^*(-A^Tz)+b^Tz, \end{aligned}$$

where \(f^*(y) = \sup _x{(y^Tx - f(x))}\) is the conjugate of f. The function \(f^*(-A^Tz)\) is the objective in the dual problem

$$\begin{aligned} \begin{array}{ll} \text{ maximize }&{-f^*(-A^Tz)+b^Tz}. \end{array} \end{aligned}$$
(28)

A point \((x^\star , z^\star )\) is a saddle point of the Lagrangian if

$$\begin{aligned} \sup _z {L(x^\star , z)} = L(x^\star , z^\star ) = \inf _x {L(x, z^\star )}. \end{aligned}$$
(29)

Existence of a saddle point is equivalent to the property that the primal and dual optimal values are equal and attained. The left-hand equality in (29) holds if and only if \(Ax^\star = b\). The right-hand equality holds if and only if \(-A^Tz^\star \in \partial f(x^\star )\). Hence \((x^\star , z^\star )\) is a saddle point if and only if it satisfies the optimality conditions

$$\begin{aligned} Ax^\star = b, \qquad -A^Tz^\star \in \partial f(x^\star ). \end{aligned}$$

Throughout this section we assume that there exists a saddle point \((x^\star , z^\star )\).

Some of the convergence results in Sect. 3.3 are expressed in terms of the merit function

$$\begin{aligned} f(x) + \gamma \Vert Ax-b\Vert _2. \end{aligned}$$
(30)

It is well known that for sufficiently large \(\gamma\), the term \(\gamma \Vert Ax-b\Vert _2\) is an exact penalty. Specifically, if \(\gamma > \Vert z^\star \Vert _2\), where \(z^\star\) is a solution of the dual problem (28), then optimal solutions of (30) are also optimal for (25).

3.2 Algorithm

The algorithm for (25) presented in this section involves a generalized distance d in the primal space, generated by a kernel function \(\phi\). It will be assumed that \(\phi\) is strongly convex on \(\mathop \mathbf{dom}f\). This property can be expressed as

$$\begin{aligned} d(x,y) \ge \frac{1}{2} \Vert x-y\Vert ^2 \end{aligned}$$
(31)

for all \(x \in \mathop \mathbf{dom}\phi \cap \mathop \mathbf{dom}f\) and \(y\in \mathop \mathbf{int}{(\mathop \mathbf{dom}\phi )} \cap \mathop \mathbf{dom}f\), where \(\Vert \cdot \Vert\) is a norm, scaled so that the strong convexity constant in (31) is one. (More generally, if \(\phi\) is \(\rho\)-strongly convex with respect to \(\Vert \cdot \Vert\), then the factor 1/2 is replaced with \(\rho /2\). By scaling the norm, one can assume \(\rho =1\).) We denote by \(\Vert A\Vert\) the matrix norm

$$\begin{aligned} \Vert A\Vert = \sup _{x\ne 0} \frac{\Vert Ax\Vert _2}{\Vert x\Vert } = \sup _{z\ne 0, \, x\ne 0} \frac{z^T Ax}{\Vert z\Vert _2 \Vert x\Vert }. \end{aligned}$$
(32)

The algorithm is summarized as follows. Select starting points \(z_{-1} = z_0\) and \(x_0 \in \mathop \mathbf{int}(\mathop \mathbf{dom}\phi ) \cap \mathop \mathbf{dom}f\). For \(k=0,1,\ldots\), repeat the following steps:

$$\begin{aligned} {\bar{z}}_{k+1}= & z_k + \theta _k (z_k - z_{k-1}) \end{aligned}$$
(33a)
$$\begin{aligned} x_{k+1}= & \mathrm {prox}_{\tau _k f}^d (x_k, \tau _k A^T{\bar{z}}_{k+1}) \end{aligned}$$
(33b)
$$\begin{aligned} z_{k+1}= & z_k + \sigma _k(Ax_{k+1} -b). \end{aligned}$$
(33c)

Step (33b) can be written more explicitly as

$$\begin{aligned} x_{k+1} = \mathop \mathrm{argmin}_x {(f(x) + {\bar{z}}_{k+1}^TA x + \frac{1}{\tau _k} d(x,x_k))}. \end{aligned}$$
(34)

The parameters \(\tau _k\), \(\sigma _k\), \(\theta _k\) are determined by one of two methods.

  • Constant parameters: \(\theta _k=1\), \(\tau _k= \tau\), \(\sigma _k=\sigma\), where

    $$\begin{aligned} \sqrt{\sigma \tau } \Vert A\Vert \le \delta . \end{aligned}$$
    (35)

    The parameter \(\delta\) satisfies \(0 < \delta \le 1\). In practice, \(\delta =1\) can be used, but some convergence results will require \(\delta < 1\); see Sect. 3.3.4.

  • Varying parameters. The parameters \(\tau _k\), \(\sigma _k\), \(\theta _k\) are determined by a backtracking search. At the start of the algorithm, we set \(\tau _{-1}\) and \(\sigma _{-1}\) to some positive values. To start the search in iteration k we choose \({\bar{\theta }}_k \ge 1\). For \(i=0,1,2,\ldots\), we set \(\theta _k = 2^{-i}{\bar{\theta }}_k\), \(\tau _k=\theta _k\tau _{k-1}\), \(\sigma _k=\theta _k\sigma _{k-1}\), and compute \({\bar{z}}_{k+1}\), \(x_{k+1}\), \(z_{k+1}\) using (33). If

    $$\begin{aligned} (z_{k+1}-{\bar{z}}_{k+1})^T A(x_{k+1}-x_k) \le \frac{\delta ^2}{\tau _k} d(x_{k+1}, x_k) + \frac{1}{2\sigma _k} \Vert {\bar{z}}_{k+1} - z_{k+1}\Vert _2^2, \end{aligned}$$
    (36)

    we accept the computed iterates \({\bar{z}}_{k+1}\), \(x_{k+1}\), \(z_{k+1}\) and step sizes \(\tau _k\), \(\sigma _k\), and terminate the backtracking search. If (36) does not hold, we increment i and continue the backtracking search.

The constant parameter choice is simple, but it is often overly pessimistic. Moreover it requires an estimate or tight upper bound for \(\Vert A\Vert\), which is difficult to obtain in large-scale problems. Using a loose bound for \(\Vert A\Vert\) in (35) may result in unnecessarily small values of \(\tau\) and \(\sigma\), and can dramatically slow down the convergence. The definition of \(\Vert A\Vert\) further depends on the strong convexity constant for the kernel \(\phi\); see (31) and (32). This quantity is also difficult to estimate for most kernels.

The varying parameters option does not require estimates or bounds on \(\Vert A\Vert\) or the strong convexity constant of the kernel. It is more expensive because in each backtracking iteration the three updates in (33) are computed. However, the extra cost is well justified in practice. If the line search process takes more than a few backtracking iterations, it indicates that the inequality (36) is much weaker than the conservative step size condition (35), and the algorithm with line search takes much larger steps than would be used by the constant parameter algorithm. In practice, the parameter \({\bar{\theta }}_k\) can be set to one in most iterations. The backtracking search then first checks whether the previous step sizes \(\tau _{k-1}\) and \(\sigma _{k-1}\) are acceptable, and decreases them only when needed to satisfy (36). The option of choosing \({\bar{\theta }}_k > 1\) allows one to occasionally increase the step sizes.

Algorithm (33) is related to several existing algorithms. With constant parameters, it is a special case of the primal–dual algorithm in [22, Algorithm 1], which solves the more general problem (26) and uses generalized distances for the primal and dual variables. Here we take \(g(y) = \delta _{\{b\}}\) and use a generalized distance only in the primal space. The line search condition (36) for selecting step sizes does not appear in [22].

With standard proximal operators (for squared Euclidean distances), the primal–dual algorithm of [22] is also known as the primal–dual hybrid gradient (PDHG) algorithm, and has been extensively studied as a versatile and efficient algorithm for large-scale convex optimization; see [20, 21, 24, 25, 28, 33, 45, 47, 49, 56, 57] for applications, analysis, and extensions. The line search technique for the primal–dual algorithm proposed by Malitsky and Pock [39] is similar to the one described above, but not identical, even when squared Euclidean distances are used.

The algorithm can also be interpreted as a variation on the Bregman proximal point algorithm [19, 26, 32], applied to the optimality conditions

$$\begin{aligned} 0 \in \left[ \begin{array}{cc} 0 &{}\quad A^T \\ -A &{}\quad 0 \end{array}\right] \left[ \begin{array}{c} x \\ z\end{array}\right] + \left[ \begin{array}{c} \partial f(x) \\ b \end{array}\right] . \end{aligned}$$

In each iteration of the proximal point algorithm the iterates \(x_{k+1}\), \(z_{k+1}\) are defined by the inclusion

$$\begin{aligned} 0\in & \left[ \begin{array}{cc} 0 &{}\quad A^T \\ -A &\quad 0 \end{array}\right] \left[ \begin{array}{c} x_{k+1} \\ z_{k+1} \end{array}\right] + \left[ \begin{array}{c} \partial f(x_{k+1}) \\ b \end{array}\right] \nonumber \\&+ \nabla \phi _\mathrm {pd}(x_{k+1},z_{k+1}) - \nabla \phi _\mathrm {pd}(x_k,z_k), \end{aligned}$$
(37)

where \(\phi _\mathrm {pd}(x,z)\) is a Bregman kernel. If we choose a kernel of the form

$$\begin{aligned} \phi _\mathrm {pd}(x,z) = \frac{1}{\tau } \phi (x) + \frac{1}{2\sigma } \Vert z\Vert _2^2, \end{aligned}$$

then (37) reduces to

$$\begin{aligned} 0 \in \left[ \begin{array}{cc} 0 &\quad A^T \\ -A &\quad 0 \end{array}\right] \left[ \begin{array}{c} x_{k+1} \\ z_{k+1} \end{array}\right] + \left[ \begin{array}{c} \partial f(x_{k+1}) \\ b \end{array}\right] + \left[ \begin{array}{c} (\nabla \phi (x_{k+1}) - \nabla \phi (x_k))/\tau \\ (z_{k+1} - z_k)/\sigma \end{array}\right] . \end{aligned}$$

In the generalized proximal operator notation defined of (10) and (11), this condition can be expressed as two equations

$$\begin{aligned} x_{k+1} = \mathrm {prox}_{\tau f}^d (x_k, \tau A^Tz_{k+1}), \qquad z_{k+1} = z_k + \sigma (Ax_{k+1}-b). \end{aligned}$$

These two equations are coupled and difficult to solve because \(x_{k+1}\) and \(z_{k+1}\) each appear on the right-hand side of an equality. The updates (33b) and (33c) are almost identical but replace \(z_{k+1}\) with \({\bar{z}}_{k+1}\) in the primal update. The iterate \({\bar{z}}_{k+1}\) can therefore be interpreted as a prediction of \(z_{k+1}\). This interpretation also provides some intuition for the step size condition (36). If \({\bar{z}}_{k+1}\) happens to be equal to \(z_{k+1}\), then (36) imposes no upper bound on the step sizes \(\tau _k\) and \(\sigma _k\). This makes sense because when \({\bar{z}}_{k+1}=z_{k+1}\) the update is equal to the proximal point update, and the convergence theory for the proximal point method does not impose upper bounds on the step size.

He and Yuan [33] have given an interesting interpretation of the primal–dual algorithm of [20] as a “pre-conditioned” proximal point algorithm. For the algorithm considered here, their interpretation corresponds to choosing

$$\begin{aligned} \phi _\mathrm {pd}(x,z) = \frac{1}{\tau } \phi (x) + \frac{1}{2\sigma } \Vert z\Vert _2^2 + z^TAx \end{aligned}$$
(38)

as the generalized distance in (37). It can be shown that under the strong convexity assumptions for \(\phi\) mentioned at the beginning of the section, the function (38) is convex if \(\sqrt{\sigma \tau } \Vert A\Vert \le 1\). With this choice of Bregman kernel, the inclusion (37) reduces to

$$\begin{aligned} 0 \in \left[ \begin{array}{cc} 0 &\quad A^T \\ -A &\quad 0 \end{array}\right] \left[ \begin{array}{c} x_k \\ 2z_{k+1} -z_k \end{array}\right] + \left[ \begin{array}{c} \partial f(x_{k+1}) \\ b \end{array}\right] + \left[ \begin{array}{c} (\nabla \phi (x_{k+1}) - \nabla \phi (x_k))/\tau \\ (z_{k+1} - z_k)/\sigma \end{array}\right] , \end{aligned}$$

which can be written as

$$\begin{aligned} z_{k+1} = z_k + \sigma (Ax_k-b), \qquad x_{k+1} = \mathrm {prox}_{\tau f}^d(x_k, \tau A^T(2z_{k+1}-z_k)). \end{aligned}$$

Except for the indexing of the iterates, this is identical to (33) with constant step sizes (\(\theta _k=1\), \(\tau _k=\tau\), \(\sigma _k=\sigma\)).

3.3 Convergence analysis

In this section we analyze the convergence of the algorithm following the ideas in [22, 39, 49]. The main result is an ergodic convergence rate, given in Eq. (49).

3.3.1 Algorithm parameters

We first prove two facts about the step sizes in the two versions of the algorithm.

Constant parameters If \(\theta _k=1\), \(\tau _k=\tau\), \(\sigma _k=\sigma\), where \(\tau\) and \(\sigma\) satisfy (35), then the iterates \({\bar{z}}_{k+1}\), \(x_{k+1}\), \(z_{k+1}\) satisfy (36).

Proof

We use the definition of the matrix norm \(\Vert A\Vert\), the arithmetic–geometric mean inequality, and strong convexity of the Bregman kernel:

$$\begin{aligned}&{(z_{k+1}-{\bar{z}}_{k+1})^TA(x_{k+1}-x_k)} \\&\quad \le \Vert A\Vert \Vert x_{k+1} - x_k\Vert \Vert z_{k+1}-{\bar{z}}_{k+1} \Vert _2 \\&\quad = \frac{\sqrt{\sigma _k\tau _k} \Vert A\Vert }{\delta } \left( \frac{\delta ^2 \Vert x_{k+1} - x_k\Vert ^2}{\tau _k} \frac{\Vert z_{k+1}-{\bar{z}}_{k+1} \Vert _2^2}{\sigma _k})\right) ^{1/2} \\&\quad \le \frac{\sqrt{\sigma _k\tau _k} \Vert A\Vert }{\delta } \left( \frac{\delta ^2\Vert x_{k+1} - x_k\Vert ^2}{2\tau _k} + \frac{\Vert z_{k+1}-{\bar{z}}_{k+1} \Vert _2^2}{2\sigma _k} \right) \\&\quad \le \frac{\sqrt{\sigma _k\tau _k} \Vert A\Vert }{\delta } \left( \frac{\delta ^2 d(x_{k+1},x_k)}{\tau _k} \, + \frac{\Vert z_{k+1} - {\bar{z}}_{k+1}\Vert _2^2}{2\sigma _k} \right) \\&\quad \le \frac{\delta ^2 d(x_{k+1},x_k)}{\tau _k} \, + \frac{\Vert {\bar{z}}_{k+1} - z_{k+1}\Vert _2^2}{2\sigma _k}. \end{aligned}$$

The last inequality follows from (35). \(\square\)

The result implies that we can restrict the analysis to the algorithm with varying parameters. The constant parameter variant is a special case with \({\bar{\theta }}_k=1\), \(\tau _{-1}=\tau\), and \(\sigma _{-1}=\sigma\).

Varying parameters In the varying parameter variant of the algorithm the step sizes are bounded below by

$$\begin{aligned} \tau _k \ge \tau _\mathrm {min} \triangleq \min {\{ \tau _{-1}, \frac{\delta }{2\sqrt{\beta }\Vert A\Vert }\}}, \qquad \sigma _k \ge \sigma _\mathrm {min} \triangleq \beta \tau _\mathrm {min}, \end{aligned}$$
(39)

where \(\beta = \sigma _{-1}/\tau _{-1}\).

Proof

We proved in the previous paragraph that the exit condition (36) in the backtracking search certainly holds if

$$\begin{aligned} \sqrt{\sigma _k\tau _k} \Vert A\Vert \le \delta . \end{aligned}$$

From this observation one can use induction to prove the lower bounds (39). Suppose \(\tau _{k-1} \ge \tau _\mathrm {min}\) and \(\sigma _{k-1} \ge \sigma _\mathrm {min}\). This holds at \(k=0\) by definition of \(\tau _\mathrm {min}\) and \(\sigma _\mathrm {min}\). The first value of \(\theta _k\) tested in the search is \(\theta _k = {\bar{\theta }}_k\ge 1\). If this value is accepted, then

$$\begin{aligned} \tau _k = {\bar{\theta }}_k \tau _{k-1} \ge \tau _{k-1} \ge \tau _\mathrm {min}, \qquad \sigma _k = {\bar{\theta }}_k \sigma _{k-1} \ge \sigma _{k-1} \ge \sigma _\mathrm {min}. \end{aligned}$$

If \(\theta _k= {\bar{\theta }}_k\) is rejected, one or more backtracking steps are taken. Denote by \({\tilde{\theta }}_k\) the last rejected value. Then \({\tilde{\theta }}_k \sqrt{\sigma _{k-1}\tau _{k-1}} \Vert A\Vert > \delta\), and the accepted \(\theta _k\) satisfies

$$\begin{aligned} \theta _k = \frac{{\tilde{\theta }}_k}{2} > \frac{\delta }{2\sqrt{\sigma _{k-1}\tau _{k-1}}\Vert A\Vert } = \frac{\delta }{2\tau _{k-1}\sqrt{\beta }\Vert A\Vert }. \end{aligned}$$

Therefore

$$\begin{aligned} \tau _k = \theta _k\tau _{k-1} > \frac{\delta }{2\sqrt{\beta }\Vert A\Vert } \ge \tau _\mathrm {min}, \qquad \sigma _k = \beta \tau _k \ge \beta \tau _\mathrm {min}. \end{aligned}$$

\(\square\)

3.3.2 Analysis of one iteration

We now analyze the progress in one iteration of the varying parameter variant of algorithm (33).

Duality gap For \(i\ge 1\), the iterates \(x_i\), \(z_i\), \({\bar{z}}_i\) satisfy

$$\begin{aligned}&{L(x_i,z) - L(x, {\bar{z}}_i)} \nonumber \\&\quad \le \frac{1}{\tau _{i-1}} \left( d(x,x_{i-1}) - d(x,x_i) - (1-\delta ^2) d(x_i,x_{i-1})\right) \nonumber \\&\qquad + \frac{1}{2\sigma _{i-1}} \left( \Vert z-z_{i-1}\Vert _2^2 - \Vert z-z_i\Vert _2^2 - \Vert {\bar{z}}_i - z_{i-1}\Vert _2^2 \right) \end{aligned}$$
(40)

for all \(x \in \mathop \mathbf{dom}f \cap \mathop \mathbf{dom}\phi\) and all z.

Proof

The second step (33b) defines \(x_{k+1}\) as the minimizer of

$$\begin{aligned}&{f(x) + {\bar{z}}_{k+1}^T Ax + \frac{1}{\tau _k} d(x, x_k)} \\&\quad = f(x) + {\bar{z}}_{k+1}^T Ax + \frac{1}{\tau _k} \big (\phi (x) -\phi (x_k) - \langle \nabla \phi (x_k), x-x_k\rangle \big ). \end{aligned}$$

By assumption the solution is uniquely defined and in the interior of \(\mathop \mathbf{dom}\phi\). Therefore \(x_{k+1}\) satisfies the optimality condition

$$\begin{aligned} \frac{1}{\tau _k} (\nabla \phi (x_k) - \nabla \phi (x_{k+1})) - A^T {\bar{z}}_{k+1} \in \partial f(x_{k+1}). \end{aligned}$$

Equivalently, the following holds for all \(x\in \mathop \mathbf{dom}\phi \cap \mathop \mathbf{dom}f\):

$$\begin{aligned}&{f(x) - f(x_{k+1})} \nonumber \\&\quad \ge -{\bar{z}}_{k+1}^T A(x-x_{k+1}) + \frac{1}{\tau _k} \langle \nabla \phi (x_k) - \nabla \phi (x_{k+1}), x-x_{k+1}\rangle \nonumber \\&\quad = -{\bar{z}}_{k+1}^T A(x - x_{k+1}) - \frac{1}{\tau _k} (d(x,x_k) - d(x,x_{k+1}) - d(x_{k+1}, x_k)). \end{aligned}$$
(41)

(The triangle identity (8) is used on the second line.) The dual update (33c) implies that

$$\begin{aligned} (z-z_{k+1})^T (Ax_{k+1}-b) = \frac{1}{\sigma _k} (z-z_{k+1})^T (z_{k+1}-z_k) \quad \text{ for } \text{ all } z. \end{aligned}$$
(42)

This equality at \(k=i-1\) is

$$\begin{aligned} (z-z_i)^T (Ax_i-b)= & {} \frac{1}{\sigma _{i-1}} (z-z_i)^T (z_i-z_{i-1}) \nonumber \\= & {} \frac{1}{2\sigma _{i-1}} \left( \Vert z-z_{i-1}\Vert _2^2 - \Vert z-z_i\Vert _2^2 - \Vert z_i - z_{i-1} \Vert _2^2 \right) . \end{aligned}$$
(43)

The equality (42) at \(k=i-2\) is

$$\begin{aligned} (z-z_{i-1})^T (Ax_{i-1}-b)= & \frac{1}{\sigma _{i-2}} (z-z_{i-1})^T (z_{i-1}-z_{i-2}) \\= & \frac{\theta _{i-1}}{\sigma _{i-1}} (z-z_{i-1})^T (z_{i-1}-z_{i-2}) \\= & \frac{1}{\sigma _{i-1}} (z-z_{i-1})^T ({\bar{z}}_i-z_{i-1}). \end{aligned}$$

We evaluate this at \(z=z_{i}\) and add it to the equality at \(z=z_{i-2}\) multiplied by \(\theta _{i-1}\):

$$\begin{aligned}&{(z_i-{\bar{z}}_i)^T (Ax_{i-1}-b)} \nonumber \\&\quad = \frac{1}{\sigma _{i-1}} (z_i-{\bar{z}}_i)^T({\bar{z}}_i - z_{i-1}) \nonumber \\&\quad =\frac{1}{2\sigma _{i-1}} \left( \Vert z_i-z_{i-1}\Vert _2^2 - \Vert z_i - {\bar{z}}_i\Vert _2^2 - \Vert {\bar{z}}_i - z_{i-1}\Vert _2^2 \right) . \end{aligned}$$
(44)

Now we combine (41) for \(k=i-1\), with (43) and (44). For \(i\ge 1\),

$$\begin{aligned}&{L(x_i,z) - L(x, {\bar{z}}_i)} \nonumber \\&\quad = f(x_i) + z^T(A x_i-b) - f(x) - {\bar{z}}_i^T(Ax-b)\nonumber \\&\quad \le \frac{1}{\tau _{i-1}} \left( d(x,x_{i-1}) - d(x,x_i) - d(x_i,x_{i-1})\right) \nonumber \\&\qquad + {\bar{z}}_i^T A(x-x_i) + z^T (Ax_i-b) - {\bar{z}}_i^T (Ax-b) \nonumber \\&\quad = \frac{1}{\tau _{i-1}} \left( d(x,x_{i-1}) - d(x,x_i) - d(x_i,x_{i-1})\right) + (z-{\bar{z}}_i)^T(Ax_i-b)\nonumber \\&\quad = \frac{1}{\tau _{i-1}} \left( d(x,x_{i-1}) - d(x,x_i) - d(x_i,x_{i-1})\right) + (z_i-{\bar{z}}_i)^T A(x_i-x_{i-1}) \nonumber \\&\qquad + (z-z_i)^T(Ax_i-b) + (z_i-{\bar{z}}_i)^T (Ax_{i-1}-b) \nonumber \\&\quad = \frac{1}{\tau _{i-1}} (d(x,x_{i-1}) - d(x,x_i) - d(x_i,x_{i-1})) + (z_i-{\bar{z}}_i)^T A(x_i-x_{i-1}) \nonumber \\&\qquad + \frac{1}{2\sigma _{i-1}} (\Vert z-z_{i-1}\Vert _2^2 - \Vert z-z_i\Vert _2^2 - \Vert {\bar{z}}_i - z_{i-1}\Vert _2^2 - \Vert {\bar{z}}_i - z_i \Vert _2^2). \end{aligned}$$
(45)

The first inequality follows from (41). In the last step we substitute (43) and (44). Next we note that the line search exit condition (36) implies that

$$\begin{aligned} (z_i-{\bar{z}}_i)^T A(x_i-x_{i-1}) \le \frac{\delta ^2}{\tau _{i-1}} d(x_i,x_{i-1}) + \frac{1}{2\sigma _{i-1}} \Vert {\bar{z}}_i - z_i\Vert _2^2. \end{aligned}$$

Substituting this in (45) gives the bound (40). \(\square\)

Monotonicity properties Suppose \(x^\star \in \mathop \mathbf{dom}\phi\), and \(x^\star\), \(z^\star\) satisfy the saddle point property (29). Then

$$\begin{aligned} d(x^\star ,x_i) + \frac{1}{2\beta } \Vert z^\star -z_i\Vert _2^2 \le d(x^\star ,x_{i-1}) + \frac{1}{2\beta } \Vert z^\star -z_{i-1}\Vert _2^2 \end{aligned}$$
(46)

where \(\beta = \sigma _{-1}/\tau _{-1}\). Moreover

$$\begin{aligned} \sum _{i=1}^k \left( (1-\delta ^2) d(x_i,x_{i-1}) + \frac{1}{2\beta } \Vert {\bar{z}}_i - z_{i-1} \Vert _2^2\right) \le d(x^\star , x_0) + \frac{1}{2\beta } \Vert z^\star - {\bar{z}}_0\Vert _2^2. \end{aligned}$$
(47)

These inequalities hold for any value \(\delta \in (0,1]\) in the line search condition (36). The second inequality implies that \({\bar{z}}_i-z_{i-1} \rightarrow 0\). If \(\delta < 1\) it also implies that \(d(x_i, x_{i-1}) \rightarrow 0\) and, by the strong convexity assumption on \(\phi\), that \(x_i-x_{i-1} \rightarrow 0\).

Proof

We substitute \(x=x^\star\), \(z=z^\star\) in (40) and note that \(L(x_i, z^\star ) - L(x^\star , {\bar{z}}_i) \ge 0\) (from the saddle-point property (29)):

$$\begin{aligned} 0\le & L(x_i, z^\star ) - L(x^\star , {\bar{z}}_i) \\\le & \frac{1}{\tau _{i-1}} (d(x^\star ,x_{i-1}) - d(x^\star , x_i) - (1-\delta ^2) d(x_i, x_{i-1})) \\&+ \frac{1}{2\sigma _{i-1}} \left( \Vert z^\star - z_{i-1}\Vert _2^2 - \Vert z^\star - z_i\Vert _2^2 - \Vert {\bar{z}}_i - z_{i-1}\Vert _2^2\right) . \end{aligned}$$

With \(\beta =\sigma _{i-1}/\tau _{i-1} = \sigma _{-1}/\tau _{-1}\), this gives the inequality

$$\begin{aligned}&{ (1-\delta ^2) d(x_i,x_{i-1}) + \frac{1}{2\beta } \Vert {\bar{z}}_i - z_{i-1}\Vert _2^2} \\&\quad \le d(x^\star , x_{i-1}) - d(x^\star ,x_i) + \frac{1}{2\beta } (\Vert z^\star -z_{i-1}\Vert _2^2 - \Vert z^\star -z_i\Vert _2^2). \end{aligned}$$

Since the left-hand side is nonnegative, the inequality (46) follows. Summing from \(i=1\) to k gives (47). \(\square\)

3.3.3 Ergodic convergence

We define averaged primal and dual sequences

$$\begin{aligned} x^\mathrm {avg}_k = \frac{1}{\sum _{i=1}^k \tau _{i-1}} \sum _{i=1}^k \tau _{i-1}x_i, \qquad z^\mathrm {avg}_k = \frac{1}{\sum _{i=1}^k \tau _{i-1}} \sum _{i=1}^k \tau _{i-1} {\bar{z}}_i. \end{aligned}$$

We first show that the averaged sequences satisfy

$$\begin{aligned} L(x^\mathrm {avg}_k,z) - L(x,z^\mathrm {avg}_k) \le \frac{1}{\sum _{i=1}^k \tau _{i-1}} (d(x,x_0) + \frac{1}{2\beta } \Vert z-z_0\Vert _2^2) \end{aligned}$$
(48)

for all \(x\in \mathop \mathbf{dom}f\cap \mathop \mathbf{dom}\phi\) and all z. This holds for every choice for \(\delta \in (0,1]\) in (36).

Proof

From (40),

$$\begin{aligned}&{L(x_i,z) - L(x, {\bar{z}}_i)} \\&\quad \le \frac{1}{\tau _{i-1}} \left( d(x,x_{i-1}) - d(x,x_i) + \frac{1}{2\beta } \Vert z-z_{i-1}\Vert _2^2 - \frac{1}{2\beta } \Vert z-z_i\Vert _2^2 \right) . \end{aligned}$$

Since L is convex in x and affine in z,

$$\begin{aligned}&{\left( \sum _{i=1}^k \tau _{i-1}\right) (L(x^\mathrm {avg}_k, z) - L(x, z^\mathrm {avg}_k))} \\&\quad \le \sum _{i=1}^k \tau _{i-1} (L(x_i, z) - L(x, {\bar{z}}_i)) \\&\quad \le d(x,x_0) - d(x,x_k) + \frac{1}{2\beta } (\Vert z-z_0\Vert _2^2 - \Vert z-z_k\Vert _2^2) \\&\quad \le d(x,x_0) + \frac{1}{2\beta } \Vert z-z_0\Vert _2^2. \end{aligned}$$

Dividing by \(\sum _{i=1}^k \tau _{i-1}\) gives (48). \(\square\)

If we substitute in (48) an optimal \(x=x^\star\) (which satisfies \(Ax^\star = b\)), we obtain that

$$\begin{aligned} f(x^\mathrm {avg}_k) + z^T(Ax^\mathrm {avg}_k-b) - f(x^\star ) \le \frac{1}{\sum _{i=1}^k\tau _{i-1}} \left( d(x^\star ,x_0) + \frac{1}{2\beta } \Vert z-z_0\Vert _2^2\right) \end{aligned}$$

for all z. Maximizing both sides over z subject to \(\Vert z\Vert _2\le \gamma\) shows that

$$\begin{aligned}&{f(x^\mathrm {avg}_k) + \gamma \Vert Ax^\mathrm {avg}_k-b \Vert _2 - f(x^\star )} \nonumber \\&\quad \le \frac{1}{\sum _{i=1}^k\tau _{i-1}} \left( d(x^\star ,x_0) + \frac{1}{2\beta } \sup _{\Vert z\Vert _2 \le \gamma } \Vert z-z_0\Vert _2^2\right) \nonumber \\&\quad = \frac{1}{\sum _{i=1}^k\tau _{i-1}} \left( d(x^\star ,x_0) + \frac{1}{2\beta } (\gamma + \Vert z_0\Vert _2)^2\right) . \end{aligned}$$
(49)

The first two terms on the left-hand side form the merit function (30). For \(\gamma > \Vert z^\star \Vert _2\), the penalty function in the merit function is exact, so \(f(x) + \gamma \Vert Ax-b\Vert _2 - f(x^\star ) \ge 0\) with equality only if x is optimal. (The use of an exact penalty function to express a convergence result is inspired by [49, page 287].) Since \(\tau _i \ge \tau _\mathrm {min}\), the inequality shows that the merit function decreases as O(1/k).

3.3.4 Convergence of the iterates

We now make two additional assumptions about the Bregman kernel \(\phi\) [19].

  1. 1.

    For fixed x, the sublevel sets \(\{y \mid d(x,y) \le \alpha \}\) are closed. In other words, the distance d(xy) is a closed function of y.

  2. 2.

    If \(y_k\in \mathop \mathbf{int}{(\mathop \mathbf{dom}\phi )}\) converges to \(x\in \mathop \mathbf{dom}\phi\), then \(d(x,y_k)\rightarrow 0\).

These two assumptions are not restrictive, and in particular, they are satisfied by the logarithmic barrier \(\phi\) (12). We also make the (minor) assumptions that \(\delta < 1\) in (36) and that \(\theta _k\) is bounded above (which is easily satisfied, since the user chooses \({\bar{\theta }}_k\)). With these additional assumptions it can be shown that the sequences \(x_k\), \(z_k\) converge to optimal solutions.

Proof

The inequality (46) and strong convexity of \(\phi\) show that the sequences \(x_k\), \(z_k\) are bounded. Let \((x_{k_i}, z_{k_i})\) be a convergent subsequence with limit \(({\hat{x}}, {\hat{z}})\). With \(\delta < 1\), (47) shows that \(d(x_{k_i+1}, x_{k_i})\) converges to zero. By strong convexity of the kernel, \(x_{k_i+1} - x_{k_i} \rightarrow 0\) and therefore the subsequence \(x_{k_i+1}\) also converges to \({\hat{x}}\). Since \(z_{k_i+1}- z_{k_i} \rightarrow 0\), the subsequence \(z_{k_i+1}\) converges to \({\hat{z}}\). Since \(\theta _k\) is bounded above, \({\bar{z}}_{k_{i+1}} = z_{k_i} + \theta _k (z_{k_i} - z_{k_i-1})\) also converges to \({\hat{z}}\).

The dual update (33c) can be written as

$$\begin{aligned} Ax_{k_i+1}- b = \frac{1}{\sigma _{k_i}} (z_{k_i+1} - z_{k_i}). \end{aligned}$$
(50)

Since \(z_{k_i+1} -z_{k_i} \rightarrow 0\) and \(\sigma _{k_i} \ge \sigma _\mathrm {min}\), the left-hand side converges to zero, so \(A{\hat{x}}= b\).

From (46), \(d(x^\star , x_{k_i})\) is bounded above. Since the sublevel sets \(\{ y \mid d(x^\star , y)\le \alpha \}\) are closed subsets of \(\mathop \mathbf{int}{(\mathop \mathbf{dom}\phi )}\), the limit \({\hat{x}}\) is in \(\mathop \mathbf{int}{(\mathop \mathbf{dom}\phi )}\). The left-hand side of the optimality condition

$$\begin{aligned} \frac{1}{\tau _{k_i}} (\nabla \phi (x_{k_i}) - \nabla \phi (x_{k_i+1})) - A^T {\bar{z}}_{k_i+1} \in \partial f(x_{k_i+1}) \end{aligned}$$
(51)

converges to \(-A^T{\hat{z}}\), because \(\tau _k\ge \tau _\mathrm {min}\) and \(\nabla \phi\) is continuous on \(\mathop \mathbf{int}{(\mathop \mathbf{dom}\phi )}\). By maximal monotonicity of \(\partial f\), this implies that \(-A^T {\hat{z}} \in \partial f({\hat{x}})\) (see [17, page 27] [53, lemma 3.2]). We conclude that \({\hat{x}}\), \({\hat{z}}\) satisfy the optimality conditions \(A{\hat{x}}=b\) and \(-A^T{\hat{z}} \in \partial f({\hat{x}})\).

To show that the entire sequence converges, we substitute \(x ={\hat{x}}\), \(z={\hat{z}}\) in (40):

$$\begin{aligned}&{L(x_k, {\hat{z}}) - L({\hat{x}}, {\bar{z}}_k)} \\&\quad \le \frac{1}{\tau _{k-1}} (d({\hat{x}}, x_{k-1}) - d({\hat{x}}, x_k)) + \frac{1}{2\beta \tau _{k-1}} ( \Vert {\hat{z}} - z_{k-1}\Vert _2^2 - \Vert {\hat{z}} - z_k\Vert _2^2). \end{aligned}$$

The left-hand side is nonnegative by the saddle point property (29). Therefore

$$\begin{aligned} d({\hat{x}}, x_k) + \frac{1}{2\beta } \Vert {\hat{z}}-z_k\Vert _2^2 \le d({\hat{x}}, x_{k-1}) + \frac{1}{2\beta } \Vert {\hat{z}}-z_{k-1}\Vert _2^2 \end{aligned}$$

for all k. This shows that

$$\begin{aligned} d({\hat{x}}, x_k) + \frac{1}{2\beta } \Vert {\hat{z}}-z_k\Vert _2^2 \le d({\hat{x}}, x_{k_i}) + \frac{1}{2\beta } \Vert {\hat{z}}-z_{k_i}\Vert _2^2 \end{aligned}$$

for all \(k\ge k_i\). By the second additional kernel property mentioned above, the right-hand side converges to zero. Therefore \(d({\hat{x}}, x_k) \rightarrow 0\) and \(z_k \rightarrow {\hat{z}}\). If \(d({\hat{x}}, x_k) \rightarrow 0\), then the strong convexity property of the kernel implies that \(x_k \rightarrow {\hat{x}}\). \(\square\)

4 Numerical experiments

In this section we evaluate the performance of algorithm (7), the Bregman PDHG algorithm (33) applied to the centering problem (5). The numerical results illustrate that the cost for evaluating the Bregman proximal operator (17) is comparable to the cost of a sparse Cholesky factorization with sparsity pattern E. This prox-evaluation dominates the computational cost in each iteration of (7), since \({\mathcal {A}}\) and \({\mathcal {A}}^*\) are usually easy to evaluate for large-scale problems with sparse or other types of structure. In particular, the proposed method does not need to solve linear equations involving \({\mathcal {A}}\) or \({\mathcal {A}}^*\), an important advantage over ADMM and interior-point methods.

In this section we consider the centering problem for two sets of sparse SDPs, the maximum cut problem and the graph partitioning problem. The experiments are carried out in Python 3.6 on a laptop with an Intel Core i5 2.4GHz CPU and 8GB RAM. The Python library for chordal matrix computations CHOMPACK [4] is used to compute chordal extensions (with the AMD reordering [1]), sparse Cholesky factorizations, the primal barrier \(\phi\), and the gradient and directional second derivative of the dual barrier \(\phi _*\). Other sparse matrix computations are implemented using CVXOPT [2].

In the experiments, we terminate the iteration (33) when the relative primal and dual residuals are less than \(10^{-6}\). These two stopping conditions are sufficient for our algorithm, as suggested by the convergence proof, in particular, Eqs. (50) and (51). The two residuals are defined as

$$\begin{aligned} \text{ primal } \text{ residual }= & \frac{\Vert z_k-z_{k-1}\Vert _2}{\sigma _k \max \{1,\Vert z_k\Vert _\infty \}}, \\ \text{ dual } \text{ residual }= & \frac{\Vert \nabla \phi (X_k)-\nabla \phi (X_{k-1})\Vert _2}{\tau _k \max \{1,\Vert X_k\Vert _\mathrm {max}\}}, \end{aligned}$$

where \(\Vert Y\Vert _\mathrm {max}=\max _{i,j} |Y_{ij}|\).

4.1 Maximum cut problem

Given an undirected graph \(G=(V,E)\), the maximum cut problem is to partition the set of vertices into two sets in order to maximize the total number of edges between the two sets. (If every edge \(\{i,j\} \in E\) is associated with a nonnegative weight \(w_{ij}\), then the maximum cut problem is to maximize the total weight of the edges between the two sets.) One can show that the maximum cut problem can be represented as a binary quadratic optimization problem

$$\begin{aligned} \begin{array}{ll} \text{ maximize } &{} (1/4)x^TLx \\ \text{ subject } \text{ to } &{} x \in \{\pm 1\}^n, \end{array} \end{aligned}$$

where \(L \in \mathbf {S}^n\) is the Laplacian of an undirected graph \(G=(V,E)\) with vertices \(V=\{1,2,\ldots ,n\}\). The SDP relaxation of the maximum cut problem is

$$\begin{aligned} \begin{array}{ll} \text{ maximize } & (1/4)\mathop \mathbf{tr}(LX) \\ \text{ subject } \text{ to } & \mathop \mathbf{diag}(X)=\varvec{1}\\ & X \succeq 0, \end{array} \end{aligned}$$
(52)

with variable \(X \in \mathbf {S}^n\). The operator \(\mathop \mathbf{diag}:\mathbf {S}^n \rightarrow {\mathbf{R}}^n\) returns the diagonal elements of the input matrix as a vector: \(\mathop \mathbf{diag}(X)=(X_{11}, X_{22}, \ldots , X_{nn})\). If moderate accuracy is allowed, we can solve the centering problem of the SDP relaxation

$$\begin{aligned} \begin{array}{ll} \text{ minimize } & {-(1/4)} \mathop \mathbf{tr}(LX) + \mu \phi (X) \\ \text{ subject } \text{ to } & \mathop \mathbf{diag}(X) = \varvec{1}\\ & X \in \Pi _{E'}(\mathbf {S}_+^{n}) \end{array} \end{aligned}$$
(53)

with optimization variable \(X \in \mathbf {S}^{n}_{E^\prime }\) where \(E^\prime\) is a chordal extension of E. Note that \(\mathop \mathbf{tr}(X) = n\) for all feasible X. The centering problem has the form of (5) with

$$\begin{aligned} C=-\frac{1}{4}L, \qquad N=\frac{1}{n}I, \qquad {\mathcal {A}}(X) = \mathop \mathbf{diag}(X). \end{aligned}$$

The Lagrangian of (53) is in the form of (27) where f is defined in (6), and z is the Lagrange multiplier associated with the equality constraint \(\mathop \mathbf{diag}(X)=\varvec{1}\). Thus we have

$$\begin{aligned} \frac{1}{4} \mathop \mathbf{tr}(LX^\star ) \le p^\star _\mathrm {sdp} \le \varvec{1}^T z^\star , \qquad -\frac{1}{4} \mathop \mathbf{tr}(LX^\star ) + \varvec{1}^T z^\star = \mu n, \end{aligned}$$
(54)

where \(X^\star\) and \(z^\star\) are the primal and dual optimal solutions of the centering problem (53), and \(p^\star _\mathrm {sdp}\) is the optimal value of the SDP (52).

Numerical results We first collect four MAXCUT problems of moderate size from SDPLIB [15]. The SDP relaxation (52) is solved using MOSEK [41] and the optimal value computed by MOSEK is denoted by \(p^\star _\mathrm {sdp}\). (Note that the source file for the graph maxcutG55 was unfortunately incorrectly converted into SDPA sparse format. Thus the objective value for the maxG55 problem obtained from the original data file is \({1.1039 \times 10^4}\) instead of \({9.9992 \times 10^3}\) as reported in SDPLIB.)

In (53), we set \(\mu =0.001/n\), and report in column 4 of Table 1 the difference between \(p^\star _\mathrm {sdp}\) and the cost function \((1/4)\mathop \mathbf{tr}(LX)\) at the suboptimal solution returned by the algorithm.

Table 1 Results for four instances of the MAXCUT problem from SDPLIB [15]

The last two columns of Table 1 give the relative primal and dual residuals. These results show that the proposed algorithm is able to solve the centering SDP (53) with the desired accuracy. A comparison of the third and fourth columns of Table 1 confirms (54), i.e., the objective value of the SDP at X is within \(\mu n = 10^{-3}\) of the optimal value. Considering the values of \(p^\star _\mathrm {sdp}\), we see that the computed points on the central path are close to the optimal solutions of the SDPs.

To test the scalability of algorithm (33), we add four larger graphs from the SuiteSparse collection [36]. In Table 2 we report the time per Cholesky factorization, the number of Newton steps per iteration, the time per PDHG iteration, and the number of iterations in the primal–dual (PDHG) algorithm for the eight test problems.

Table 2 The four MAXCUT problems from SDPLIB plus four larger graphs from the SuiteSparse collection [36]

As can be seen from the table, the number of Newton iterations per prox-evaluation remains small even when the size of the problem increases. Also, we observe that the time per PDHG iteration is roughly the cost of a sparse Cholesky factorization times the number of Newton steps. This means that the backtracking in Newton’s method does not cause a significant overhead. Since the evaluations of \({\mathcal {A}}\) and \({\mathcal {A}}^*\) in this problem are very cheap, the cost per prox-evaluation is the dominant term in the per-iteration complexity.

4.2 Graph partitioning

The problem of partitioning the vertices of a graph \(G=(V,E)\) in two subsets of equal size (here we assume an even number of vertices), while minimizing the number of edges between the two subsets, can be expressed as

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} (1/4)x^TLx \\ \text{ subject } \text{ to } &{} \varvec{1}^T x =0 \\ &{} x\in \{-1,1\}^n, \end{array} \end{aligned}$$

where L is the graph Laplacian. The ith entry of the n-vector x indicates the set that vertex i is assigned to. To obtain an SDP relaxation we introduce a matrix variable \(Y=xx^T\) and write the problem in the equivalent form

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} (1/4)\mathop \mathbf{tr}(LY) \\ \text{ subject } \text{ to } &{} \varvec{1}^TY\varvec{1}= 0 \\ &{} \mathop \mathbf{diag}(Y) = \varvec{1}\\ &{} Y = xx^T, \end{array} \end{aligned}$$

and then relax the constraint \(Y=xx^T\) as \(Y\succeq 0\). This gives the SDP

$$\begin{aligned} \begin{array}{ll} \text{ minimize } & (1/4) \mathop \mathbf{tr}(LY) \\ \text{ subject } \text{ to } & \varvec{1}^TY\varvec{1}= 0 \\ & \mathop \mathbf{diag}(Y) = \varvec{1}\\ & Y \succeq 0. \end{array} \end{aligned}$$
(55)

The dual SDP is

$$\begin{aligned} \begin{array}{ll} \text{ maximize } & \varvec{1}^T z \\ \text{ subject } \text{ to } & \mathop \mathbf{diag}(z) + \xi \varvec{1}\varvec{1}^T \preceq (1/4) L, \end{array} \end{aligned}$$

with variables \(\xi \in {\mathbf{R}}\) and \(z \in {\mathbf{R}}^n\).

The aggregate sparsity pattern of the SDP (55) is completely dense, because the equality constraint \(\varvec{1}^TY\varvec{1}=0\) has a coefficient matrix of all ones. We therefore eliminate the dense constraint using the technique described in [30, page 668]. Let P be the \(n\times (n-1)\) matrix

$$\begin{aligned} P = \left[ \begin{array}{rrrrr} 1 &{} \quad 0 &{}\quad \cdots &{}\quad \quad 0 &{} 0 \\ -1 &{}\quad 1 &{} \quad \cdots &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad -1 &{} \quad \cdots &{}\quad 0 &{} \quad 0 \\ \vdots &{} \quad \vdots &{} \quad &{} \quad \vdots &{}\quad \vdots \\ 0 &{}\quad 0 &{} \quad \cdots &{}\quad 1 &{}\quad 0 \\ 0 &{} \quad 0 &{}\quad \cdots &{}\quad -1 &{}\quad 1 \\ 0 &{}\quad 0 &{} \quad \cdots &{}\quad 0 &{}\quad -1 \end{array} \right] . \end{aligned}$$

The columns of P form a sparse basis for the orthogonal complement of the multiples of the vector \(\varvec{1}\). Suppose Y is feasible in (55) and define

$$\begin{aligned} \left[ \begin{array}{cc} X &{}\quad u \\ u^T &{}\quad v \end{array}\right] = \left[ \begin{array}{cc} P&\varvec{1}\end{array}\right] ^{-1} Y \left[ \begin{array}{cc} P&\varvec{1}\end{array}\right] ^{-T}. \end{aligned}$$
(56)

From \(\varvec{1}^TY\varvec{1}= 0\), we see that

$$\begin{aligned} 0 = \varvec{1}^TY\varvec{1}= \varvec{1}^T \left[ \begin{array}{cc} P&\varvec{1}\end{array}\right] \left[ \begin{array}{cc} X &{} \quad u \\ u^T &{} \quad v \end{array}\right] \left[ \begin{array}{cc} P&\varvec{1}\end{array}\right] ^T \varvec{1}= n^2 v, \end{aligned}$$

and therefore \(v=0\). Since the matrix (56) is positive semidefinite, we also have \(u = 0\). Hence every feasible Y can be expressed as \(Y = P XP^T\), with \(X\succeq 0\). If we make this substitution in (55) we obtain

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} (1/4) \mathop \mathbf{tr}(P^TLP X) \\ \text{ subject } \text{ to } &{} \mathop \mathbf{diag}(PXP^T) = \varvec{1}\\ &{} X\succeq 0. \end{array} \end{aligned}$$

The \((n-1)\times (n-1)\) matrix \(P^TLP\) has elements

$$\begin{aligned} (P^TLP)_{ij} = \left\{ \begin{array}{ll} L_{ii} - 2L_{i,i+1} + L_{i+1,i+1} &{} i=j \\ L_{ij} - L_{i+1,j} - L_{i,j+1} + L_{i+1,j+1} &{} i \ne j. \end{array}\right. \end{aligned}$$

Thus the sparsity pattern \(E^\prime\) of the matrix \(P^TLP\) is denser than E, i.e., \(E \subseteq E^\prime\). The n constraints \(\mathop \mathbf{diag}(PXP^T) = \varvec{1}\) reduce to

$$\begin{aligned} X_{11} = 1, \quad X_{i-1,i-1} + X_{ii} - 2X_{i,i-1} = 1, \;\; i=2,\ldots ,n-1, \quad X_{n-1,n-1} = 1. \end{aligned}$$

To apply algorithm (33), we first rewrite the graph partitioning problem as

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} (1/4)\mathop \mathbf{tr}(P^TLP X) \\ \text{ subject } \text{ to } &{} \mathop \mathbf{diag}(PXP^T) = \varvec{1}\\ &{} X \in \Pi _{E^{\prime \prime }}(\mathbf {S}_+^{n-1}) \end{array} \end{aligned}$$
(57)

where \(E^{\prime \prime }\) is a chordal extension of the aggregate sparsity pattern \(E^\prime\). Note that \(\mathop \mathbf{tr}(P^TPX) = n-1\) for all feasible X. The centering problem for this sparse SDP is of the form (5) with

$$\begin{aligned} \begin{array}{ll} \displaystyle C = \frac{1}{4} P^TLP, &{} \qquad {\mathcal {A}}(X) = \mathop \mathbf{diag}(PXP^T), \\ \displaystyle N = \frac{1}{n-1} P^TP, &{} \qquad {\mathcal {A}}^*(y)=P^T \mathop \mathbf{diag}(y) P. \end{array} \end{aligned}$$

Numerical results Table 3 shows the numerical results for four problems from SDPLIB [15].

Table 3 Results for four graph partitioning problems from SDPLIB

The SDP relaxation (55) is solved by MOSEK and its optimal value is denoted by \(p^\star _\mathrm {sdp}\). In solving (57), we set \(\mu =0.001/n\), and report in Table 3 the value \((1/4)\mathop \mathbf{tr}(P^TLPX)\), where X is the solution returned by the algorithm (33). As in the first experiment, the numerical results show that the algorithm is able to solve the centering SDP (57) with desired accuracy.

In addition, we test the algorithm for four additional graphs from the SuiteSparse collection [36]. Table 4 reports the time per Cholesky factorization, the number of Newton steps per iteration, the time per PDHG iteration, and the number of iterations in the primal–dual algorithm.

Table 4 The four graph partitioning problems from SDPLIB plus four larger graphs from the SuiteSparse collection

The same observations as in Sect. 4.1 apply: the number of Newton steps remains moderate as the size of the problem increases, and the cost per iteration is roughly linear in the cost of a Cholesky factorization.

5 Conclusions

We presented a Bregman proximal algorithm for the centering problem in sparse semidefinite programming. The Bregman distance used in the proximal operator is generated by the logarithmic barrier function for the cone of sparse matrices with a positive semidefinite completion. With this choice of Bregman distance, the per-iteration complexity of the algorithm is dominated by the cost of a Cholesky factorization with the aggregate sparsity pattern of the SDP, plus the cost of evaluating the linear mapping in the constraints and its adjoint.

The proximal algorithm we used is based on the primal–dual method proposed by Chambolle and Pock [22]. An important addition to the algorithm is a new procedure for selecting the primal and dual step sizes, without knowledge of the norm of the linear mapping or the strong convexity of the Bregman kernel. In the current implementation the ratio of the primal and dual step sizes is kept fixed throughout the iteration. An interesting further improvement would be to relax this condition, choosing \(\beta =\sigma _k / \tau _k\) adaptively [5, 39].

The standard primal–dual hybrid gradient algorithm is known to include several important algorithms as special cases. The Bregman extension of the algorithm is equally versatile. We mention one interesting example. Suppose the matrix A in (25) is a product of two matrices \(A= CB\). Then (25) is equivalent to

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} f(x) + g(y) \\ \text{ subject } \text{ to } &{} Bx = y \end{array} \end{aligned}$$
(58)

where \(g(y) = \delta _{\{b\}}(Cy)\). The standard (Euclidean) proximal operator of g is the mapping

$$\begin{aligned} \mathrm {prox}_g(u) = \mathop \mathrm{argmin}_{Cy=b} \Vert y-u\Vert _2^2. \end{aligned}$$
(59)

The PDHG algorithm applied to the reformulated problem requires in each iteration an evaluation of the Bregman proximal operator of f, matrix–vector products with B and \(B^T\), and the solution of the least norm problem in the definition of \(\mathrm {prox}_g\). For \(C=A\), \(B=I\), this can be interpreted as a Bregman extension of the Douglas–Rachford algorithm, or of Spingarn’s method for convex optimization with equality constraints.