We consider a distance measurement on the given domain (the Cartesian product of many subsets) as a sum of weighted subset distances. In this setting, each subset is equipped with a specific distance function and a weighting parameter. We subsequently utilize this weighted distance in the projection step to develop a weighted Mirror Descent algorithm.
Weighted Distance Function
The distance function \({{\mathrm{D}}}(x,y)\) is defined as the Bregman distance:
$$\begin{aligned} {{\mathrm{D}}}(x,y) = \psi (x) - \psi (y) - {{\mathrm{\langle }}}\nabla \psi (y), x-y {{\mathrm{\rangle }}}, \end{aligned}$$
where \(\psi (.)\) is \(\sigma \)-strongly convex over a compatible norm \(\Vert .\Vert \), i.e.,
$$\begin{aligned} \langle \nabla \psi (x) - \nabla \psi (y), x - y \rangle \ge \sigma \Vert x - y\Vert ^2 , \quad \forall x,y \in \mathcal X . \end{aligned}$$
(5)
Without any loss of generality, we assumeFootnote 1
\(\sigma = 1\) throughout the paper. A compatible norm \(\Vert .\Vert \) is dependent of the choice of distance function. For example, \(l_1\)-norm is chosen for log-entropy distance [4], \(l_2\)-norm for Euclidean distance. Instead of using one distance function over the entire domain, let us consider separate choices of Bregman distance \({{\mathrm{D_i}}}\) for each subset \({{\mathrm{\mathcal X_i}}}, i \in \{1,2,\ldots ,N\}\):
$$\begin{aligned} {{\mathrm{D_i}}}(x_i,y_i) = \psi ^i(x_i) - \psi ^i(y_i) - {{\mathrm{\langle }}}\nabla \psi ^i(y_i) , x_i-y_i {{\mathrm{\rangle }}}, \forall x_i,y_i \in {{\mathrm{\mathcal X_i}}}. \end{aligned}$$
(6)
Each subset distance \({{\mathrm{D_i}}}(x_i,y_i)\) is equipped with a compatible norm \(\Vert .\Vert _i\). Various choices of distance functions and compatible norms are discussed in [5, 9, 10]. Two examples that are relevant to the MRF application we consider later are:
-
Euclidean distance: \({{\mathrm{D_i}}}(x_i,y_i) = \frac{1}{2}\Vert x_i - y_i\Vert ^2_2\). In this case, \(\psi ^i(x_i) = \frac{1}{2}\Vert x_i\Vert ^2_2\) and it is straightforward to show \(\psi ^i(.)\) is 1-strongly convex w.r.t \(\Vert .\Vert _2\).
-
Log-entropy distance: \({{\mathrm{D_i}}}(x_i,y_i) = \sum _j x^j_i \log (x^j_i / y^j_i) + y^j_i - x^j_i\). In this case, \(\psi ^i(x_i) = \sum _j x^j_i \log x^j_i - x^j_i\) is shown to be 1-strongly convex w.r.t. \(\Vert .\Vert _1\) [4].
When \(x,y \in \mathcal X\) and the domain \(\mathcal {X} = \mathcal {X}_1 \times \mathcal {X}_2 \times \cdots \times \mathcal {X}_N\), the distance between x and y is equivalent to the sum of distances \({{\mathrm{D_i}}}(x_i,y_i)\). Using this definition, we can now state a corollary to Theorem 2.1.
Corollary 3.1
Let \({{\mathrm{\Omega _i}}}\) denote the maximum distance of a subset \(\mathcal X_i\), i.e., \({{\mathrm{\Omega _i}}}= \max _{x_i,y_i \in \mathcal X_i} {{\mathrm{D_i}}}(x_i,y_i)\), and let \({{\mathrm{\mathcal L_i}}}= \max _{x_i \in {{\mathrm{\mathcal X_i}}}} \Vert f'_{x_i}\Vert _*\) denotes the local Lipschitz constant w.r.t. to a subset \({{\mathrm{\mathcal X_i}}}\). The optimality bound (4) for solving problem (1) by the Mirror Descent algorithm is given by:
$$\begin{aligned} f^* - f(\bar{x}) \le \frac{\sqrt{\sum ^N_{i=1} {{\mathrm{\mathcal L_i}}}^2} \sqrt{2 \sum ^N_{i=1} {{\mathrm{\Omega _i}}}}}{\sqrt{K}}. \end{aligned}$$
(7)
Proof
When \(\mathcal X\) is the Cartesian product of N convex sets \(\mathcal X_i, i \in \{1,2,\ldots ,N\}\), the distance between two vectors \(x,y \in \mathcal X\) is the sum of distances between any two blocks \(x_i,y_i \in {{\mathrm{\mathcal X_i}}}\). As a result, the maximum distance \({{\mathrm{\Omega }}}\) is also the sum of maximum distances on subset \(\mathcal X_i\):
$$\begin{aligned} {{\mathrm{\Omega }}}= \sum ^N_{i=1} {{\mathrm{\Omega _i}}}. \end{aligned}$$
(8)
Since the subsets \(\mathcal X_i\) and \(\mathcal X_j\) are independent, \(i\ne j; i,j \in \{1,2,\ldots ,N\}\), we have:
$$\begin{aligned} {{\mathrm{\mathcal L}}}= \max _{x \in \mathcal X} \Vert f'_x\Vert _* = \max _{x\in \mathcal X} \sqrt{\sum ^N_{i=1} \Vert f'_{x_i}\Vert ^2_*} = \sqrt{\sum ^N_{i=1} \max _{x_i\in \mathcal X_i} \Vert f'_{x_i}\Vert ^2_*} = \sqrt{\sum ^N_{i=1} {{\mathrm{\mathcal L_i}}}^2} . \end{aligned}$$
(9)
Substituting \({{\mathrm{\Omega }}}\) and \({{\mathrm{\mathcal L}}}\) in the optimality bound (4) yields (7). \(\square \)
We now propose a weighted distance function in order to improve the optimality bound (7). For each subset distance \({{\mathrm{D_i}}}\), let us introduce a weighting parameter \({{\mathrm{\alpha _i}}}>0\). The new distance function is then defined as a weighted combination of subset distances:
$$\begin{aligned} {{\mathrm{D}}}(x,y) := \sum ^N_{i=1} {{\mathrm{\alpha _i}}}{{\mathrm{D_i}}}(x_i,y_i) = \sum ^N_{i=1} {{\mathrm{\alpha _i}}}\psi ^i(x_i) - {{\mathrm{\alpha _i}}}\psi ^i(y_i) - {{\mathrm{\alpha _i}}}{{\mathrm{\langle }}}\nabla \psi ^i(y_i) , x_i-y_i {{\mathrm{\rangle }}}. \end{aligned}$$
(10)
This yields the definition for \({{\mathrm{\psi }}}(x)\) as a weighted sum of convex function \(\psi ^i(x_i)\):
$$\begin{aligned} {{\mathrm{\psi }}}(x) = \sum ^N_{i=1} {{\mathrm{\alpha _i}}}\psi ^i(x_i). \end{aligned}$$
(11)
Substituting (10) in the projection step (2) naturally yields:
$$\begin{aligned} x^{k+1} = \underset{x \in \mathcal {X}}{{{\mathrm{argmax}}}} \left\langle f'_{x^k}, x \right\rangle - \frac{1}{\mu } \sum ^N_{i=1} {{\mathrm{\alpha _i}}}{{\mathrm{D_i}}}(x_i,x_i^k). \end{aligned}$$
(12)
Essentially, the property of \(\mathcal X\) triggers an ability to independently compute the projection (12) on each subset \({{\mathrm{\mathcal X_i}}}\). In other words, if we consider the optimality condition of the optimization problem (12) w.r.t. each block \(x_i \in {{\mathrm{\mathcal X_i}}}\), then (12) is separable and is equivalent to:
$$\begin{aligned} \forall i \in \{1,\ldots ,N\}: \quad x^{k+1}_i = \underset{x_i \in {{\mathrm{\mathcal X_i}}}}{{{\mathrm{argmax}}}} \left\langle f'_{x^k_i}, x_i \right\rangle - \frac{{{\mathrm{\alpha _i}}}}{\mu } {{\mathrm{D_i}}}(x_i,y_i). \end{aligned}$$
(13)
As a result, we hope to achieve better performance by using suitable (or optimal) weighting parameters \({{\mathrm{\alpha _i}}}\) for the corresponding subset \({{\mathrm{\mathcal X_i}}}\).
Compatible Norm, Dual Norm, Weighted Lipschitz Constant and Maximum Weighted Distance
In order to analyze the convergence of the sequence generated by (12), we need to establish the Lipschitz constant. This can be computed as the upper bound of the dual norm of the subgradients. To this end, we propose a compatible norm \(\Vert .\Vert \) associated with the weighted distance.
Lemma 3.1
For all \(i \in \{1,\ldots ,N\}\), let \({{\mathrm{\alpha _i}}}>0, \psi ^i(x_i)\) is 1-strongly convex w.r.t. \(\Vert x_i\Vert _i\), and then, the weighted function, \({{\mathrm{\psi }}}(x) = \sum ^N_{i=1} {{\mathrm{\alpha _i}}}\psi ^i(x_i)\), is 1-strongly convex w.r.t. the weighted norm:
$$\begin{aligned} \Vert x\Vert := \sqrt{\sum ^N_{i=1} \alpha ^i \Vert x_i\Vert ^2_i} . \end{aligned}$$
(14)
Proof
We have, \(\forall x,y \in \mathcal X\):
$$\begin{aligned} \langle \nabla {{\mathrm{\psi }}}(x) - \nabla {{\mathrm{\psi }}}(y), x - y \rangle \ge \sum ^N_{i=1} \alpha ^i \Vert x_i - y_i \Vert ^2_i = \Vert x - y\Vert ^2. \end{aligned}$$
\(\square \)
The dual norm \(\Vert .\Vert _*\) of the proposed weighted norm (14) can be derived using the definition of dual norm (see Sect. 2 and [11]):
$$\begin{aligned} \Vert \xi \Vert _* = \sqrt{\sum ^N_{i=1} \frac{\Vert \xi _i \Vert ^2_{i*}}{{{\mathrm{\alpha _i}}}}} , \end{aligned}$$
(15)
where \(\Vert .\Vert _{i*}\) is a dual norm of \(\Vert .\Vert _i\) over the subset \({{\mathrm{\mathcal X_i}}}\). Let \({{{\mathrm{\mathcal L_i}}}= \max _{x_i \in {{\mathrm{\mathcal X_i}}}} \Vert f'_{x_i}\Vert _{i*}}\) denote the local Lipschitz constant w.r.t. to a subset \({{\mathrm{\mathcal X_i}}}\); then, the weighed Lipschitz constant is given by:
$$\begin{aligned} {{\mathrm{\mathcal L}}}= \max _{x \in \mathcal X} \Vert f'_x\Vert _* = \sqrt{\sum ^N_{i=1} \frac{{{\mathrm{\mathcal L_i}}}^2}{{{\mathrm{\alpha _i}}}}} . \end{aligned}$$
(16)
In addition, the maximum weighted distance \({{\mathrm{\Omega }}}\) becomes:
$$\begin{aligned} {{\mathrm{\Omega }}}= \max _{x,y \in \mathcal X} {{\mathrm{D}}}(x,y) = \sum ^N_{i=1} {{\mathrm{\alpha _i}}}{{\mathrm{\Omega _i}}}, \end{aligned}$$
(17)
where \({{\mathrm{\Omega _i}}}= \max _{x_i,y_i \in \mathcal {X}_i} {{\mathrm{D_i}}}(x_i,y_i)\).
Remark 3.1
The unweighted functions (8) and (9) in Sect. 2 can be viewed as a special case of the above-weighted functions where \({{\mathrm{\alpha _i}}}= 1 \; , \; \forall i = 1,2,\ldots ,N\).
Convergence Properties
We show the first result for optimality bound of the weighted MD algorithm.
Lemma 3.2
Let \(f^*\) denote the global optimal objective function and \(\bar{x} = {{{\mathrm{argmax}}}}_{x = \{x^1,\ldots ,x^K\}} \, f(x)\) and \(\mu \) be the step-size. We have the following optimality bound after K iterations:
$$\begin{aligned} f^* - f(\bar{x}) \le \frac{{{\mathrm{\Omega }}}}{K\mu } + \frac{\mu {{\mathrm{\mathcal L}}}^2}{2} . \end{aligned}$$
(18)
Similar results can be found in [1, 2, 4]. The initial bound (18) depends on three terms \(\mu \), \({{\mathrm{\mathcal L}}}\) and \({{\mathrm{\Omega }}}\), where the last two terms are themselves functions of the weighting parameters \({{\mathrm{\alpha _i}}}\). Therefore, we can tighten the bound (18) by considering its minimization w.r.t. \(\mu \) and \({{\mathrm{\alpha _i}}}\).
Theorem 3.1
For each subset \({{\mathrm{\mathcal X_i}}}\), let \({{\mathrm{\mathcal L_i}}}= \max _{x_i \in {{\mathrm{\mathcal X_i}}}} \Vert f'_{x_i}\Vert _{i*}\) be the local Lipschitz constant and \({{\mathrm{\Omega _i}}}= \max _{x_i,y_i \in \mathcal {X}_i} {{\mathrm{D_i}}}(x_i,y_i)\) be the maximum subset distance. Then, the optimal weighting parameters are given by:
$$\begin{aligned} {{\mathrm{\alpha _i}}}= \frac{{{\mathrm{\mathcal L_i}}}}{\sqrt{{{\mathrm{\Omega _i}}}}\left( \sum ^N_{i=1} {{\mathrm{\mathcal L_i}}}\sqrt{{{\mathrm{\Omega _i}}}} \right) }, \quad \forall i = 1,2,\ldots ,N . \end{aligned}$$
(19)
In addition, these parameters yield the optimal step-size:
$$\begin{aligned} \mu = \frac{\sqrt{2}}{\sqrt{K}\left( \sum ^N_{i=1} {{\mathrm{\mathcal L_i}}}\sqrt{{{\mathrm{\Omega _i}}}}\right) }. \end{aligned}$$
(20)
Proof
Minimizing the RHS of (18) w.r.t. \(\mu \) yields the result of Theorem 2.1, \({f^* - f(\bar{x}) \le \frac{{{\mathrm{\mathcal L}}}\sqrt{2 {{\mathrm{\Omega }}}}}{\sqrt{K}}}\). This optimality bound is a function of \(\alpha := [\alpha ^1,\alpha ^2,\ldots ,\alpha ^N]^\top \). The best optimality bound can be achieved by considering a minimization of:
$$\begin{aligned} \phi (\alpha ) = {{\mathrm{\mathcal L}}}^2(\alpha ) {{\mathrm{\Omega }}}(\alpha ) = \sum ^N_{i=1} \frac{{{\mathrm{\mathcal L_i}}}^2}{{{\mathrm{\alpha _i}}}}\sum ^N_{i=1} {{\mathrm{\alpha _i}}}{{\mathrm{\Omega _i}}}. \end{aligned}$$
The optimizer of \(\phi (\alpha )\) needs to satisfy the following optimality condition:
$$\begin{aligned} \frac{{{\mathrm{\alpha _i}}}^2{{\mathrm{\Omega _i}}}}{{{\mathrm{\mathcal L_i}}}^2} \sum ^N_{j=1,j \ne i} \frac{{\mathcal L_j}^2}{\alpha _j} = \sum ^N_{j=1,j \ne i} \alpha _j\mathrm \Omega _j , \quad \forall i = 1,2,\ldots ,N. \end{aligned}$$
(21)
Now, let us rewrite the optimality bound \(\frac{{{\mathrm{\Omega }}}}{K\mu } + \frac{\mu {{\mathrm{\mathcal L}}}^2}{2}\) in (18) as:
$$\begin{aligned} \frac{{{\mathrm{\Omega }}}}{K\mu } + \frac{\mu {{\mathrm{\mathcal L}}}^2}{2} = \frac{\sum ^N_{i=1} {{\mathrm{\alpha _i}}}{{\mathrm{\Omega _i}}}}{K\mu } + \frac{\mu }{2} \sum ^N_{i=1} \frac{{{\mathrm{\mathcal L_i}}}^2}{{{\mathrm{\alpha _i}}}} . \end{aligned}$$
Minimizing the RHS of the above equality w.r.t. \({{\mathrm{\alpha _i}}}\) and substituting \(\mu = \frac{\sqrt{2{{\mathrm{\Omega }}}}}{{{\mathrm{\mathcal L}}}\sqrt{K}}\) (Theorem 2.1) in the minimizer give \({{\mathrm{\alpha _i}}}= \frac{{{\mathrm{\mathcal L_i}}}\sqrt{{{\mathrm{\Omega }}}}}{{{\mathrm{\mathcal L}}}\sqrt{{{\mathrm{\Omega _i}}}}} , \forall i = 1,2,\ldots ,N\). Substituting these weighting parameters into the maximum distance, \({{\mathrm{\Omega }}}= \sum ^N_{i=1} {{\mathrm{\alpha _i}}}{{\mathrm{\Omega _i}}}\), yields \(\sqrt{{{\mathrm{\Omega }}}} = \frac{\sum ^N_{i=1} {{\mathrm{\mathcal L_i}}}\sqrt{{{\mathrm{\Omega _i}}}}}{{{\mathrm{\mathcal L}}}}\). Suppose the weighted distance is normalized by the weighting parameters, i.e., \({{\mathrm{\Omega }}}= 1\), then the weighted Lipschitz is given by:
$$\begin{aligned} {{\mathrm{\mathcal L}}}= \sum ^N_{i=1} {{\mathrm{\mathcal L_i}}}\sqrt{{{\mathrm{\Omega _i}}}} . \end{aligned}$$
(22)
Using the above-weighted Lipschitz constant and the normalized maximum distance, \({{\mathrm{\Omega }}}=1\), yields the optimal weighting parameters (19). We can verify that the optimal \({{\mathrm{\alpha _i}}}\) normalizes the maximum distance, i.e., \({{\mathrm{\Omega }}}= 1\), generates the weighted Lipschitz constant (22) using the definition (16) and satisfies the optimality condition (21) of the optimality bound function \(\phi (\alpha )\). \(\square \)
Theorem 3.2
Let \(f^*\) denotes the global optimal objective function and \(\bar{x} = {{{\mathrm{argmax}}}}_{x = \{x^1,..,x^K\}} \, f(x)\). The weighted MD algorithm with the optimal step-size (20) and the optimal weighting parameters (19) has the following optimality bound after K iterations:
$$\begin{aligned} f^* - f(\bar{x}) \le \frac{\sqrt{2}\sum ^N_{i=1} {{\mathrm{\mathcal L_i}}}\sqrt{{{\mathrm{\Omega _i}}}}}{\sqrt{K}} . \end{aligned}$$
(23)
Proof
Substituting the optimal step-size (20) and the optimal weighting parameters (19) into (18) directly yields the result. \(\square \)
The following result establishes the relative performance of the proposed weighted MD algorithm compared to the MD algorithm with unweighted distance. The proposed algorithm with weighted distance is an improvement over the algorithm with unweighted distance. Numerical experiments discussed in the next section and the supplementary material underline this promising result.
Corollary 3.2
The optimality bound (23) of the proposed weighted MD algorithm is either an improvement to, or in the worst case as good as, the optimality bound (7) of the MD algorithm with unweighted distance:
$$\begin{aligned} \frac{\sqrt{2}\sum ^N_{i=1} {{\mathrm{\mathcal L_i}}}\sqrt{{{\mathrm{\Omega _i}}}}}{\sqrt{K}} \le \frac{\sqrt{\sum ^N_{i=1} {{\mathrm{\mathcal L_i}}}^2} \sqrt{2 \sum ^N_{i=1} {{\mathrm{\Omega _i}}}}}{\sqrt{K}} . \end{aligned}$$
(24)
Proof
By the Cauchy–Schwarz inequality, we have:
$$\begin{aligned} \left( \sum ^N_{i=1} {{\mathrm{\mathcal L_i}}}\sqrt{{{\mathrm{\Omega _i}}}}\right) ^2\le \left( \sum ^N_{i=1} {{\mathrm{\mathcal L_i}}}^2\right) \left( \sum ^N_{i=1} {{\mathrm{\Omega _i}}}\right) . \end{aligned}$$
The above inequality directly yields (24).\(\square \)