A Weighted Mirror Descent Algorithm for Nonsmooth Convex Optimization Problem

Large-scale nonsmooth convex optimization is a common problem for a range of computational areas including machine learning and computer vision. Problems in these areas contain special domain structures and characteristics. Special treatment of such problem domains, exploiting their structures, can significantly reduce the computational burden. In this paper, we consider a Mirror Descent method with a special choice of distance function for solving nonsmooth optimization problems over a Cartesian product of convex sets. We propose to use a nonlinear weighted distance in the projection step. The convergence analysis identifies optimal weighting parameters that, eventually, lead to the optimally weighted step-size strategy for every projection on a corresponding convex set. We show that the optimality bound of the Mirror Descent algorithm using the weighted distance is either an improvement to, or in the worst case as good as, the optimality bound of the Mirror Descent using unweighted distances. We demonstrate the efficiency of the algorithm by solving the Markov Random Fields optimization problem. In order to exploit the domain of the problem, we use a weighted log-entropy distance and a weighted Euclidean distance. Promising experimental results demonstrate the effectiveness of the proposed method.


Introduction
It is well known that convex optimization problems can be solved in polynomial time at a low iteration count using interior point methods.However, most of these methods do not scale well with the dimension of the optimization problem.A single iteration cost of an interior point method grows nonlinearly with the problem size.As a result, low iteration count becomes expensive in terms of computational performance.Since what matters most in practice is the overall computational time to solve the problem, first-order methods with computationally low-cost iterations become a viable choice for large-scale optimization problems.In this paper, we present an efficient first-order method to solve a general large-scale nonsmooth optimization problem over a Cartesian product of convex sets.The proposed method is the Mirror Descent (MD) algorithm [1][2][3][4], an iterative first-order approach for nonsmooth optimization problems, with a special choice of distance function.The main idea of MD is to utilize a suitable Bregman distance [5] and identify the optimal step-size for the projection step over the feasible domain.In the case where the domain is a Cartesian product of convex sets, we propose to use optimal step-size strategy for each projection on the corresponding subset instead of using a common step-size for the projection on the entire domain.In order to achieve this, we employ a weighted distance function for the projection scheme.The weighted distance function exploits the 'disjoint' property of the problem's domain by considering suitable weights for every subset.By assessing the optimality bound for the proposed algorithm, we establish the optimal weighting parameters for each distance function of the corresponding subset.These weighting parameters influence the projection step as scaling factors of the common step-size.Thus, the step-size is scaled appropriately for the corresponding subset projection.
As an illustration, we demonstrate the performance of the proposed algorithm, hereafter referred to as the weighted MD, by solving the Markov Random Fields (MRF) optimization problem [6,7].This problem often arises from the areas of image analysis and machine learning [8].We employ the weighted MD with log-entropy distances and optimal subset-dependent step-sizes to initialize the starting point.Subsequently, we use the weighted MD with Euclidean distances and incorporate the duality gap in the step-size computation.Experimental results confirm the superiority of the weighted MD over the MD algorithm with unweighted distance.
The remainder of this paper focuses on analyzing and describing the proposed weighted MD algorithm and its application to the MRF optimization problem.In the next section, we review the MD algorithm with a general distance function.Section 3 derives the optimality bound for solving a nonsmooth convex optimization problem over a Cartesian product of convex sets using MD.In addition, Sect. 3 introduces required definitions for developing the weighted MD algorithm.In Sect.3.3, we derive the optimality bound of the proposed weighted MD algorithm and show that it is either an improvement to, or in the worst case as good as, the MD algorithm as described in Sect. 2. In Sect.4, we consider the dual of the MRF optimization problem.The MRF dual belongs to the class of large-scale nonsmooth optimization problem over a Cartesian product of convex sets.We can therefore employ the weighted MD to solve it.We report very promising computational results in the online supplementary material provided.

Mirror Descent Algorithm
Consider the following nonsmooth convex optimization problem: where is the Cartesian product of N closed and convex sets; and X ⊂ R n .In this problem, the decision variable x can be decomposed into N disjoint blocks, where each block x i ∈ X i .In addition, we assume the following for (1): -The objective function f : X → R is concave and Lipschitz continuous.
Problem (1) can be solved by the Mirror Descent algorithm.MD algorithm [1][2][3][4] is a generalization of the projected subgradient method.The standard subgradient approach employs the Euclidean distance function with a suitable step-size in the projection step.Mirror Descent extends the standard projected subgradient method by employing a nonlinear distance function with an optimal step-size in the nonlinear projection step.In this section, we review the Mirror Descent algorithm for solving problem (1) without considering the domain geometry.Let D(., .)denote the distance between any two points in the set X , and MD algorithm employs a sequence of nonlinear projection: where f x k is a subgradient at the point x k , μ is the optimal step-size.The set up of Mirror Descent requires D(., .)compatible with the norm: -. on the space embedding X and its dual norm: - The maximum distance is given by = max x,y∈X D(x, y).Suppose f (x) is Lipschitz continuous on X with the Lipschitz constant L = max x∈X f x * < ∞, we have the following convergence property for MD algorithm.
Theorem 2.1 Let f * denotes the global optimal objective function and x = argmax x={x 1 ,...,x K } f (x).Then, using the optimal step-size: we have the following optimality bound after K iterations: Theorem 2.1 is a well-known result, and its proof can be found in [2,4].In the following section, we derive explicitly the optimality bound where the domain X is the Cartesian product of subsets X i , i = 1, 2, . . ., N .After that, we introduce a new distance function that will improve the derived optimality bound.The proposed parameterised distance naturally assigns weighting parameters to the projection step (2) on each subset X i .

Mirror Descent Algorithm with Weighted Distance
We consider a distance measurement on the given domain (the Cartesian product of many subsets) as a sum of weighted subset distances.In this setting, each subset is equipped with a specific distance function and a weighting parameter.We subsequently utilize this weighted distance in the projection step to develop a weighted Mirror Descent algorithm.

Weighted Distance Function
The distance function D(x, y) is defined as the Bregman distance: where ψ(.) is σ -strongly convex over a compatible norm ., i.e., Without any loss of generality, we assume1 σ = 1 throughout the paper.A compatible norm . is dependent of the choice of distance function.For example, l 1 -norm is chosen for log-entropy distance [4], l 2 -norm for Euclidean distance.Instead of using one distance function over the entire domain, let us consider separate choices of Bregman distance D i for each subset X i , i ∈ {1, 2, . . ., N }: Each subset distance D i (x i , y i ) is equipped with a compatible norm .i .Various choices of distance functions and compatible norms are discussed in [5,9,10].Two examples that are relevant to the MRF application we consider later are: the distance between x and y is equivalent to the sum of distances D i (x i , y i ).Using this definition, we can now state a corollary to Theorem 2.1.
The optimality bound (4) for solving problem (1) by the Mirror Descent algorithm is given by: Proof When X is the Cartesian product of N convex sets X i , i ∈ {1, 2, . . ., N }, the distance between two vectors x, y ∈ X is the sum of distances between any two blocks x i , y i ∈ X i .As a result, the maximum distance is also the sum of maximum distances on subset X i : Since the subsets X i and X j are independent, i = j; i, j ∈ {1, 2, . . ., N }, we have: Substituting and L in the optimality bound (4) yields (7).
We now propose a weighted distance function in order to improve the optimality bound (7).For each subset distance D i , let us introduce a weighting parameter α i > 0. The new distance function is then defined as a weighted combination of subset distances: (10) This yields the definition for ψ(x) as a weighted sum of convex function ψ i (x i ): Substituting (10) in the projection step (2) naturally yields: Essentially, the property of X triggers an ability to independently compute the projection (12) on each subset X i .In other words, if we consider the optimality condition of the optimization problem (12) w.r.t. each block x i ∈ X i , then ( 12) is separable and is equivalent to: ∀i ∈ {1, . . ., N } : As a result, we hope to achieve better performance by using suitable (or optimal) weighting parameters α i for the corresponding subset X i .

Compatible Norm, Dual Norm, Weighted Lipschitz Constant and Maximum Weighted Distance
In order to analyze the convergence of the sequence generated by (12), we need to establish the Lipschitz constant.This can be computed as the upper bound of the dual norm of the subgradients.To this end, we propose a compatible norm .associated with the weighted distance.
x i i , and then, the weighted function, ψ(x) = N i=1 α i ψ i (x i ), is 1-strongly convex w.r.t. the weighted norm: Proof We have, ∀x, y ∈ X : The dual norm .* of the proposed weighted norm ( 14) can be derived using the definition of dual norm (see Sect. 2 and [11]): where .i * is a dual norm of .i over the subset X i .Let L i = max x i ∈X i f x i i * denote the local Lipschitz constant w.r.t. to a subset X i ; then, the weighed Lipschitz constant is given by: In addition, the maximum weighted distance becomes: Remark 3.1 The unweighted functions ( 8) and ( 9) in Sect. 2 can be viewed as a special case of the above-weighted functions where α i = 1 , ∀i = 1, 2, . . ., N .

Convergence Properties
We show the first result for optimality bound of the weighted MD algorithm.
Lemma 3.2 Let f * denote the global optimal objective function and x = argmax x={x 1 ,...,x K } f (x) and μ be the step-size.We have the following optimality bound after K iterations: Similar results can be found in [1,2,4].The initial bound (18) depends on three terms μ, L and , where the last two terms are themselves functions of the weighting parameters α i .Therefore, we can tighten the bound (18) by considering its minimization w.r.t.μ and α i .
Theorem 3.1 For each subset X i , let L i = max x i ∈X i f x i i * be the local Lipschitz constant and i = max x i ,y i ∈X i D i (x i , y i ) be the maximum subset distance.Then, the optimal weighting parameters are given by: In addition, these parameters yield the optimal step-size: Proof Minimizing the RHS of (18) w.r.t.μ yields the result of Theorem 2.1, . This optimality bound is a function of α := [α 1 , α 2 , . . ., α N ] .The best optimality bound can be achieved by considering a minimization of: The optimizer of φ(α) needs to satisfy the following optimality condition: Now, let us rewrite the optimality bound K μ + μ L 2 2 in (18) as: Minimizing the RHS of the above equality w.r.t.α i and substituting μ = .Suppose the weighted distance is normalized by the weighting parameters, i.e., = 1, then the weighted Lipschitz is given by: Using the above-weighted Lipschitz constant and the normalized maximum distance, = 1, yields the optimal weighting parameters (19).We can verify that the optimal α i normalizes the maximum distance, i.e., = 1, generates the weighted Lipschitz constant (22) using the definition (16) and satisfies the optimality condition (21) of the optimality bound function φ(α).Theorem 3.2 Let f * denotes the global optimal objective function and x = argmax x={x 1 ,..,x K } f (x).The weighted MD algorithm with the optimal step-size (20) and the optimal weighting parameters (19) has the following optimality bound after K iterations: Proof Substituting the optimal step-size (20) and the optimal weighting parameters (19) into (18) directly yields the result.
The following establishes the relative performance of the proposed weighted MD algorithm compared to the MD algorithm with unweighted distance.The proposed algorithm with weighted distance is an improvement over the algorithm with unweighted distance.Numerical experiments discussed in the next section and the supplementary material underline this promising result.

Corollary 3.2
The optimality bound (23) of the proposed weighted MD algorithm is either an improvement to, or in the worst case as good as, the optimality bound (7) of the MD algorithm with unweighted distance: Proof By the Cauchy-Schwarz inequality, we have: The above inequality directly yields (24).

Weighted Mirror Descent Algorithm for MRF Optimization
Markov Random Fields [8] are an important class of graph-structured models in image processing and machine learning.In general, the MRF model aims to reveal hidden quantities ξ based on some observations of available input data.Various discussion about MRF modeling and MRF optimization methods in image analysis and machine learning can be found in [6,8,12,13].In this paper, we focus on the dual of the linear programming (LP) relaxation for the MRF optimization problem.The detailed description of the MRF model and the construction of the dual problem can be found in the supplementary material provided (see also [6]).Let us consider the LP relaxation of the MRF problem: min Applying the dual decomposition technique yields the dual objective function: In this setting, the sum of data cost θ t must equal to the original θ (see [6] or the supplementary material): and the Lagrangian vector λ becomes decision variables of the dual optimization problem: max where := t∈T λ t = 0 .The domain is a Cartesian product of subsets { i } ∀i∈I , where I := {(a, l)} ∀a∈V,∀l∈L {(ab, lk)} ∀ab∈E,∀l,k∈L .Each subset is defined as where I is the cardinality of I .It is well known that the solution of ( 27) is the lower bound of the LP problem (25).By strong duality, the solution of ( 27) becomes the solution of the LP (25).Problem ( 27) is a nonsmooth convex optimization problem over the Cartesian product of convex subsets (1).
There have been several approaches for solving the nonsmooth problem (27).One approach is by Savchynskyy et al. [7] using Nesterov's smoothing technique.Their method relaxes the nonsmooth objective function by a smoothing parameter.As a result, the algorithm only computes a suboptimal solution of the dual problem and does not yield the optimal solution for the LP problem (25).In addition, this algorithm requires computations for all dual variables at every iteration, while the weighted MD requires fewer dual updates as the algorithm converges (as we will see in Remark 4.1).Schmidt et al. [14] proposed a primal-dual method for solving the LP (25); however, their paper shows that the primal-dual method is inferior to the dual decomposition technique for large-scale problem.The weighted MD algorithm is a generalization of the projected subgradient algorithm which was also proposed for solving the dual (27) by Komodakis et al. [6] and Jancsary et al. [15].

Weighted MD for the MRF Problem
Problem (27) requires an initialization of θ t that satisfies (26).The standard initialization θ t = θ T might not give a good starting point for subgradient-typed methods.A better initialization is an initialization such that the objective function value is closer to the optimal objective value.Suppose we have a better initialization θ t * , we can reduce the computational efforts for solving λ significantly.To this end, let us introduce the following optimization problem: where Problem (28) also has the same form as (1) and can be solved using the weighted MD algorithm.After obtaining the optimal initialization {ρ t * • θ , ∀t ∈ T }, where ρ = argmax ρ∈ f (ρ), we can proceed to solve for λ: where = × × • • • × I is the product set of linear subsets: The two problems ( 28) and (30) can be combined into one problem: By setting λ = 0, we have (32) ≡ (28).Similarly, if we set ρ t * = argmax ρ∈ f (ρ), then we have (32) ≡ (30).The weighted MD algorithm for solving the MRF problem is described in Algorithm 1.As we will see later (equation ( 40)), exact and optimal step-size τ can be computed while the exact η is not available.A heuristic based on the difference between the current objective value and the optimal solution will be used to approximate η.The smaller this difference is, the less error accumulates in approximating λ.Therefore, the solution to problem (28) yields a starting point for λ such that its objective value is closer to the optimal solution compared to an objective value corresponding to a random starting point.We clarify the various aspects of the vector ρ (similar for λ): Algorithm 1: Weighted Mirror Descent for the MRF Problem Step 1: Choose two nonegative numbers K 1 , K 2 ; Step 2: Initialize ρ 1 = 1 T .1 and λ 1 = 0; Step 3: Step 4: Step 5: Step 6: ρ ∈ denotes a full corresponding to all subgraphs of the set T .-With superscript t, ρ t denotes a vector corresponding to subgraph t ∈ T .-With subscript i, ρ i denotes a collection of scalars ρ t i across all subgraphs that cover the index i, and ρ i ∈ i .
-When superscripts t and k are used together, we separate them by a comma: ρ t,k is a vector, or ρ t,k i is a scalar.
The two weighted distances D and D yield the corresponding subset projections for (33): ∀i ∈ I : To this end, we choose the log-entropy distance function for each subset i and the Euclidean distance function for each subset i .Let us consider: -For each i : Let Then, ψ i is 1-strongly convex w.r.t. . 2 .The dual norm of . 2 is itself.
By using the Bregman distance, we can obtain the log-entropy distance function and the Euclidean distance function for the corresponding subset.As a result, each iteration of the recurrences (34) can be solved in a closed form: We note that MD algorithm with unweighted distance also uses the above recurrences with the constant choice α i = α i = 1, ∀i ∈ I .Using the definitions of optimal step-size (20) and weighting pararmeters (19), the two subset-dependent step-sizes τ α i and η α i can be written as: The above subset-dependent improve the performance of the weighted MD because they use optimal values of α i and α i instead of the constant 1.It thus remains to show how to compute the subgradients f ρ and f λ at any feasible ρ ∈ and λ ∈ .
Lemma 4.1 Let ξ t = argmin ξ t ∈ t ρ t • θ + λ t , ξ t be the optimal solution for the MRF subproblem of the corresponding subgraph t ∈ T .Then, the subgradients of f (ρ, λ) w.r.t. the corresponding decision vector are given by: Proof Let x, y be arbitrary vectors such that x ∈ and y ∈ .By definition, ξ t is not necessarily optimal for min ξ t ∈ t x t • θ + y t , ξ t , i.e., ∀t ∈ T : min In addition, Remark 4.1 The above choices of subgradient rely on the exact solution ξ t ∈ I for each subgraph t (that can be computed very efficiently by a dynamic programming algorithm, e.g., max-product belief propagation or graph cut).Using these subgradients, we can verify that updates (35) are only needed at disagreement nodes. 2 As a result, we can utilize this property to define a stopping criterion by counting the number of disagreement nodes.Let L k be the number of disagreement nodes at iteration k.Essentially, as L k → 0, the algorithm converges to a stationary point, i.e., the optimal solution.
By using the above subgradients and the fact that ξ t i ∈ [0, 1], we can derive the local Lipschitz constants corresponding to their subsets, ∀i ∈ I : To specify the maximum subset distances, we need to find an upper bound for the distance between any feasible point to starting points ρ 1 i and λ 1 i .

4.2
Let all of starting point ρ t,1 i = 1 T , and the upper bound of the distance between any feasible vector and ρ 1 i is given by: Proof Using the Bregman distance ( 6) with log-entropy function ψ i (ρ i ) = t∈T ρ t i log ρ t i for every subset i , i ∈ I, we have: The last two inequalities follow from the facts that 0 ≤ ρ t i ≤ 1; therefore, log ρ t i ≤ 0, and t∈T ρ t i = 1.
Similar to the above, the Bregman distance with ψ i (λ i ) = 1 2 t∈T (λ t i ) 2 yields the Euclidean distance corresponding to subset i ; thus, the quantity i is given by (with The subset i defined in (31) does not allow exact computation for i .For example, assume the index i ∈ I is covered by two subgraphs t 1 , t 2 ∈ T , then The quantity 2 i can be infinitely large.Thus, the step-size η α i also becomes infinitely large.In this problem, we assume subset i to be bounded and nonempty.Therefore, we estimate i by a quantity that is proportional to the distance between the solution λ * i and the starting point λ 1 i = 0. Given the primal problem (25) and dual problem (32), we use the approximate duality gap (since the primal solutions cannot always be computed exactly using the dual solutions) as a heuristic estimation of the distance between the current iterate and the optimal solution.
In order to estimate the duality gap at iteration k, we need to compute (approximately) the primal value P(ξ k ) = θ, ξ k .Several approaches to estimate the primal variables are discussed in [6].We employ the ergodic sequence of dual subgradients f λ k to estimate the primal variables.Ergodic convergence analysis [16] has been used by many authors to bridge the primal-dual gap in convex optimization.In the approach, primal variables ξ k are estimated by considering the weighted average of the dual subgradients over all iterations: The approximate duality gap is given by |P(ξ K ) − f ( ρ, λ K )|, which can be used as a heuristic to estimate i at iteration k: where L k is the number of disagreement nodes (see Remark 4.1).Substituting local Lipschitz constants (37) and subset distances (38), (39) into the subset-dependent step-sizes (36) yields: Relating the step-size η α i to the duality gap allows the algorithm to admit large stepsizes when the duality gap is large (far from the optimum).As the duality gap reduces, so does the step-size.This choice of step-size is consistent with the diminishing stepsize approach that guarantees convergence for subgradient methods [17].

Numerical Experiments
Experimental results are discussed in the supplementary material provided and published online along with this paper.

Conclusions
An efficient algorithm is presented for solving a large-scale nonsmooth convex problem.The method is based on the Mirror Descent algorithm employing a suitable weighted distance function.By assessing the optimality bound of the proposed algorithm, we are able to compute the optimal subset-dependent step-sizes.This yields a convergence rate that is not worse than the MD algorithm with unweighted distance.The experimental results for MRF optimization problems confirm the improved performance.