In this section we provide a complexity analysis of NSync.
Assumptions
Our analysis of NSync is based on two assumptions. The first assumption generalizes the ESO concept introduced in [21] and later used in [5, 6, 22, 27, 28] to nonuniform samplings. The second assumption requires that \(\phi \) be strongly convex.
Notation For \(x,y,u \in \mathbf {R}^n\) we write \(\Vert x\Vert _u^2 :=\sum _i u_i x_i^2\), \(\langle x, y\rangle _u :=\sum _{i=1}^n u_i y_i x_i\), \(x \bullet y :=(x_1 y_1, \dots , x_n y_n)\) and \(u^{-1} :=(1/u_1,\dots ,1/u_n)\). For \(S\subseteq [n]\) and \(h\in \mathbf {R}^n\), let \(h_{[S]} :=\sum _{i\in S} h_i e^i\).
Assumption 1
(Nonuniform ESO: Expected Separable Overapproximation) Assume that \(p=(p_1,\dots ,p_n)^T>0\) and that for some positive vector \(w\in \mathbf {R}^n\) and all \(x,h \in \mathbf {R}^n\), the following inequality holds:
$$\begin{aligned} \mathbf {E}[ \phi (x+h_{[\hat{S}]})] \le \phi (x) + \langle \nabla \phi (x), h \rangle _p + \frac{1}{2} \Vert h\Vert _{p \bullet w}^2. \end{aligned}$$
(2)
As soon as \(\phi \) has a Lipschitz continuous gradient, then for every random sampling
\(\hat{S}\) there exist positive weights \(w_1,\dots ,w_n\) such that Assumption 1 holds. In this sense, the assumption is not restrictive. Inequalities of the type (2), in the uniform case (\(p_i=p_j\) for all i, j), were studied in [6, 21, 22, 27]. Motivated by the introduction of the nonuniform ESO assumption in this paper, and the development in Sect. 3 of our work, an entire paper was recently written, dedicated to the study of nonuniform ESO inequalities [16].Footnote 1
We now turn to the second and final assumption.
Assumption 2
(Strong convexity) We assume that \(\phi \) is \(\gamma \)-strongly convex with respect to the norm \(\Vert \cdot \Vert _{v}\), where \(v=(v_1,\dots ,v_n)^T>0\) and \(\gamma >0\). That is, we require that for all \(x,h \in \mathbf {R}^n\),
$$\begin{aligned} \phi (x+h) \ge \phi (x) + \langle \nabla \phi (x),h \rangle + \frac{\gamma }{2} \Vert h\Vert _{v}^2. \end{aligned}$$
(3)
Complexity
We can now establish a bound on the number of iterations sufficient for NSync to approximately solve (1) with high probability. We believe it is remarkable that the proof is very concise.
Theorem 3
Let Assumptions 1 and 2 be satisfied. Choose \(x^0 \in \mathbf {R}^n\), \(0 < \epsilon < \phi (x^0)-\phi ^*\) and \(0< \rho < 1\), where \(\phi ^* :=\min _x \phi (x)\). Let
$$\begin{aligned} \Lambda :=\max _{i} \frac{w_i}{p_i v_i}. \end{aligned}$$
(4)
If \(\{x^k\}\) are the random iterates generated by NSync, then
$$\begin{aligned} K \ge \frac{\Lambda }{\gamma }\log \left( \frac{\phi (x^0)-\phi ^*}{\epsilon \rho }\right) \quad \Rightarrow \quad \mathbf {Prob}(\phi (x^{K})-\phi ^* \le \epsilon ) \ge 1-\rho . \end{aligned}$$
(5)
Moreover, we have the lower bound
$$\begin{aligned} \Lambda \ge \left( \sum _{i=1}^n \frac{w_i}{v_i}\right) \Bigg / \mathbf {E}[|\hat{S}| ]. \end{aligned}$$
(6)
Proof
We first claim that \(\phi \) is \(\mu \)-strongly convex with respect to the norm \(\Vert \cdot \Vert _{w\bullet p^{-1}}\), i.e.,
$$\begin{aligned} \phi (x+h) \ge \phi (x) + \langle \nabla \phi (x),h \rangle + \frac{\mu }{2} \Vert h\Vert _{w \bullet p^{-1}}^2, \end{aligned}$$
(7)
where \(\mu :=\gamma /\Lambda \). Indeed, this follows by comparing (3) and (7) in the light of (4). Let \(x^*\) be such that \(\phi (x^*) = \phi ^*\). Using (7) with \(h=x^*-x\),
$$\begin{aligned} \phi ^* - \phi (x) \overset{(7)}{\ge } \min _{h' \in \mathbf {R}^n} \langle \nabla \phi (x), h'\rangle + \frac{\mu }{2} \Vert h'\Vert _{ w \bullet p^{-1}}^2 = -\frac{1}{2\mu } \Vert \nabla \phi (x)\Vert ^2_{p \bullet w^{-1}}. \end{aligned}$$
(8)
Let \(h^k :=-({{\mathrm{Diag}}}(w))^{-1}\nabla \phi (x^k)\). Then \(x^{k+1}=x^k + (h^k)_{[\hat{S}]}\), and utilizing Assumption 1, we get
Taking expectations in the last inequality and rearranging the terms, we obtain
$$\begin{aligned} \mathbf {E}[\phi (x^{k+1}) -\phi ^*]\le (1-\mu ) \mathbf {E}[\phi (x^k)-\phi ^*] \le (1-\mu )^{k+1} (\phi (x^0)-\phi ^*). \end{aligned}$$
Using this, Markov inequality, and the definition of K, we finally get
$$\begin{aligned} \mathbf {Prob}(\phi (x^K)-\phi ^* \ge \epsilon ) \le \frac{\mathbf {E}[\phi (x^K)-\phi ^*]}{\epsilon } \le \frac{ (1-\mu )^{K} (\phi (x^0)-\phi ^*)}{\epsilon } \le \rho . \end{aligned}$$
Let us now establish the last claim.
First, note that (see [21, Sec 3.2] for more results of this type),
$$\begin{aligned} \sum _i p_i = \sum _i \sum _{S: i \in S}p_S = \sum _S \sum _{i : i \in S} p_S = \sum _S p_S |S| = \mathbf {E}[|\hat{S}|]. \end{aligned}$$
(9)
Letting \(\Delta :=\{p'\in \mathbf {R}^n : p'\ge 0, \sum _i p_i' = \mathbf {E}[|\hat{S}|]\}\), we have
$$\begin{aligned} \Lambda \overset{(4)+(9)}{\ge } \min _{p' \in \Delta } \max _i \frac{w_i}{p_i' v_i} = \frac{1}{\mathbf {E}[|\hat{S}|]}\sum _{i=1}^n \frac{v_i}{w_i}, \end{aligned}$$
where the last equality follows since optimal \(p_i'\) is proportional to \(v_i/w_i\). \(\square \)
Theorem 3 is generic in the sense that we do not say when Assumption 1 is satisfied and how should one go about choosing the stepsizes \(\{w_i\}\) and probabilities \(\{p_S\}\). In the next section we address these issues. On the other hand, this abstract setting allowed us to write a brief complexity proof.
The quantity \(\Lambda \), defined in (4), can be interpreted as a condition number associated with the problem and our method. Hence, as we vary the distribution of \(\hat{S}\), \(\Lambda \) will vary. It is clear intuitively that \(\Lambda \) can be arbitrarily bad. Indeed, by choosing a sampling \(\hat{S}\) which “nearly” ignores one or more of the coordinates (by setting \(p_i\approx 0\) for some i), we should expect the number of iterations to grow as the method will necessarily be very slow in updating these coordinates.
In the light of this, inequality (6) is useful as it gives a useful expression for bounding \(\Lambda \) from below.
Change of variables
Consider the change of variables \(y={{\mathrm{Diag}}}(d) x\), where \(d>0\). Defining \(\phi ^d(y):=\phi (x)\), we get \(\nabla \phi ^d(y) = ({{\mathrm{Diag}}}(d))^{-1}\nabla \phi (x)\). It can be seen that (2), (3) can equivalently be written in terms of \(\phi ^d\), with w replaced by \(w^d :=w \bullet d^{-2}\) and v replaced by \(v^d :=v \bullet d^{-2}\). By choosing \(d_i=\sqrt{v_i}\), we obtain \(v^d_i=1\) for all i, recovering standard strong convexity.