1 Introduction

Rotations are a fundamental object in robotics and vision and the problem of learning rotations, or finding the underlying rotation from a given set of examples, has numerous applications [see Arora (2009) for a summary of application areas, including computer vision, face recognition, robotics, crystallography, and physics]. Besides their practical importance, rotations have been shown to be powerful enough to capture seemingly more general mappings. Rotations can represent arbitrary Euclidean transformations via a conformal embedding by adding two special dimensions (Wareham et al. 2005). Also Doran et al. (1993) showed that the rotation group provides a universal representation for all Lie groups.

In the batch setting, the problem of learning a rotation was originally posed as the problem of estimating the altitude of satellites by Wahba (1966). The related problem of learning orthogonal, rather than rotation, matrices is known as the “orthogonal Procrustes problem” (see Schonemann 1966). Algorithms for optimizing a static cost function over the set of unitary matrices were proposed by Abrudan et al. (2008a, b) using descent over Riemannian manifolds.

The question of whether there are online algorithms for this problem was explicitly posed as an open problem in COLT 2008 (Smith and Warmuth 2008). An online algorithm for learning rotations was given by Arora (2009). This algorithm exploits the Lie group/Lie algebra relationship between rotation matrices and skew symmetric matrices respectively, and the matrix exponential and matrix logarithm that maps between these matrix classes. However, this algorithm deterministically predicts with a single rotation matrix in each trial. In this paper, we prove that any such deterministic algorithm can be forced to have regret at least \(\varOmega (T)\), where T is the number of trials. In contrast, we give an algorithm that produces a random rotation in each trial and has expected regret \(\sqrt{nT}\), where n is the dimension of the matrices.

1.1 Technical contributions

The main technique used in this paper is a new variant of online gradient descent with Euclidean projections (see e.g. Herbster and Warmuth 2001; Zinkevich 2003) that uses what we call “lazy projections”. A straightforward implementation of the original algorithm requires \(O(n^3)\) time per iteration, because in each round we need to perform a Euclidean projection of the parameter matrix onto the convex hull of orthogonal matrices, and this projection requires the computation of a singular value decomposition. The crucial new idea is to project the parameter matrix onto a convex set determined by the current instance that contains the convex hull of all orthogonal matrices as a subset. The projection onto this larger set can be done easily in \(O(n^2)\) time and needs to be performed only when the current parameter matrix is outside of the set. Surprisingly, our new algorithm based on this idea of “lazy projections” has the same optimal regret bound but requires only \(O(n^2)\) time per iteration.

The loss function for learning rotations can be expressed as a linear function. Such linear losses are special because they are the least convex losses. The main case where such linear losses have been investigated is in connection with the Hedge and the Matrix Hedge algorithm (Freund and Schapire 1997; Warmuth and Kuzmin 2011). However for the latter algorithms the parameter space is one-norm or trace norm bounded, respectively. In contrast, the implicit parameter space of the main algorithm of this paper is essentially infinity norm bounded, i.e. the convex hull of orthogonal matrices consists of all square matrices with singular values at most one.

1.2 Outline of paper

We begin with some preliminaries in the next section, the precise online learning problem for rotations, and basic properties of rotations. We also show how to solve the corresponding off-line problem exactly. We then give in Sect. 3 our main probabilistic algorithm and prove a \(\sqrt{nT}\) regret bound for it. This bound cannot be improved by more than a constant factor, since we can prove a lower bound of essentially \(\sqrt{\frac{nT}{2}}\) (for the case when \(T\ge n\)).

For the sake of completeness we also consider deterministic algorithms in Sect. 4. In particular we show that any deterministic algorithm can be forced to have regret at least T in general. In Sect. 5 we then present the optimal randomized and deterministic algorithms for the special case when there is a rotation that is consistent with all examples. A number of open problems are discussed in final Sect. 6. In “Appendix 1” we prove a lemma that characterizes the solution to the batch optimization problem for learning rotations and in “Appendix 2” we show that the convex hull of orthogonal matrices consists of all matrices with maximum singular value one.

1.3 Historical note

A preliminary version of this paper appeared in COLT 2010 (Hazan et al. 2010b). Unfortunately the algorithm we presented in the conference (based on the Follow the Perturbed Leader paradigm) was flawed: the true regret bound it obtains is \(O(n\sqrt{T})\) as opposed to the claimed bound of \(O(\sqrt{nT})\). After noticing this we published a corrigendum [posted on the conference website, Hazan et al. (2010a)] with a completely different technique based on online gradient descent that obtained the optimal regret \(O(\sqrt{nT})\). The algorithm presented in this paper is similar, but much more efficient, taking \(O(n^2)\) time per iteration rather than \(O(n^3)\).

2 Preliminaries and problem statement

2.1 Notation

In this paper, all vectors lie in \(\mathbb {R}^n\) and all matrices in \(\mathbb {R}^{n \times n}\). We use \(\det (\mathbf {M})\) to denote the determinant of matrix \(\mathbf {M}\). An orthogonal matrix is a matrix \(\mathbf {R}\) whose columns and rows are orthogonal unit vectors, i.e. \(\mathbf {R}^\top \mathbf {R}= \mathbf {R}\mathbf {R}^\top = \mathbf {I}\), where \(\mathbf {I}\) is the identity matrix. We let \(\mathcal{O}(n)\) denote the set of all orthogonal matrices of dimension \(n\times n\) and \(\mathcal{SO}(n)\) denote the special orthogonal group of rotation matrices, which are all orthogonal matrices of determinant one. Since for \(n = 1\) there is exactly one rotation matrix (i.e. \(\mathcal{SO}(1)=\{1\}\)), the problem becomes trivial, so throughout this paper we assume that \(n > 1\).

For any vector \(\mathbf{x}, \Vert \mathbf{x}\Vert \) denotes the \(\ell _2\) norm and for any matrix \(\mathbf {A}, \Vert \mathbf {A}\Vert _F=\sqrt{{{\mathrm{tr}}}(\mathbf {A}\mathbf {A}^\top )}\) is the Frobenius norm. All regret bounds of this paper immediately carry over to the complex domain: replace orthogonal by unitary matrices and rotation matrices by unitary matrices of determinant one.

2.2 Online learning of rotations problem

Learning proceeds in a series of trials. In every iteration for \(t = 1,2,\ldots , T\):

  1. 1.

    The online learner is given a unit instance vector \(\mathbf x_t\) (i.e. \(\Vert \mathbf x_t\Vert = 1\)).

  2. 2.

    The learner is then required to predict (deterministically or randomly) with a unit vector \(\hat{\mathbf {y}}_t\).

  3. 3.

    Finally, the algorithm obtains the “true” result, which is also a unit vector \(\mathbf y_t\).

  4. 4.

    The loss to the learner then is half the squared norm of the difference between her predicted vector and the “true” rotated vector \(\mathbf y_t\):

    $$\begin{aligned} \tfrac{1}{2}\Vert \hat{\mathbf {y}}_t - \mathbf y_t \Vert ^2 \ =\ 1 - \mathbf{y}_t^\top \hat{\mathbf {y}}_t \end{aligned}$$
    (1)
  5. 5.

    If \(\hat{\mathbf {y}}_t\) is chosen probabilistically, then we define the expected loss as

    $$\begin{aligned} \text{ E }\left[ \tfrac{1}{2}\left\| \mathbf y_t-\hat{\mathbf{y}}_t\right\| ^2\right] \ =\ \tfrac{1}{2}\text{ E }\left[ \Vert \hat{\mathbf {y}}_t - \mathbf{y}_t \Vert ^2\right] \ =\ 1 - \mathbf{y}_t^\top \text{ E }[\hat{\mathbf {y}}_t]. \end{aligned}$$

Note that the loss is linear in the prediction vector \(\hat{\mathbf {y}}_t\) or the expected prediction vector \(\text{ E }[\hat{\mathbf {y}}_t]\), respectively. The goal of the learner is to minimize the regret on all T examples against the best fixed rotation \(\mathbf {R}\) chosen in hindsight:

$$\begin{aligned} \text {Regret}_T\ =\ \sum _{t=1}^T \text{ E }\left[ \tfrac{1}{2}\left\| \mathbf y_t-\hat{\mathbf{y}}_t\right\| ^2\right] - \min _{\mathbf {R}\in \mathcal{SO}(n)} \sum _{t=1}^T \tfrac{1}{2}\left\| \mathbf y_t-\mathbf {R}\mathbf x_t\right\| ^2. \end{aligned}$$
(2)

In this paper we give a probabilistic online algorithm with worst-case regret \(\sqrt{nT}\) and an adversary strategy that forces any probabilistic algorithm to have have regret essentially \(\sqrt{\tfrac{1}{2}nT}\). Note that since our loss is linear in the rotation \(\mathbf {R}\), the best total loss \(\min _{\mathbf {R}\in \mathcal{SO}(n)} \sum _{t=1}^T \tfrac{1}{2}\left\| \mathbf y_t-\mathbf {R}\mathbf x_t\right\| ^2\) cannot be decreased by allowing mixtures of rotations.

When \(n>1\), then any unit vector can be rotated into any other unit vector. Namely one can always produce an explicit rotation matrix \(\mathbf {R}_t\) in Step 2 that rotates \(\mathbf x_t\) to \(\hat{\mathbf {y}}_t\) (that is, \(\hat{\mathbf {y}}_t\) in the definition of regret (2) is replaced by \(\mathbf {R}_t\mathbf x_t\)). Such a rotation matrix can be computed in \(O(n^2)\) time, as the following lemma shows.

Lemma 1

Let \(\mathbf x\) and \(\hat{\mathbf {y}}\) be two unit vectors. Then we can find an explicit rotation matrix \(\mathbf {R}\) that rotates \(\mathbf x\) onto \(\hat{\mathbf {y}}\), i.e. \(\mathbf {R}\mathbf x= \hat{\mathbf {y}}\), in \(O(n^2)\) time.

Proof

We first take care of the case when \(\hat{\mathbf {y}}=\pm \mathbf x\): If \(\hat{\mathbf {y}}= \mathbf x\) we can simply let \(\mathbf {R}= \mathbf {I}\); if \(\hat{\mathbf {y}}= -\mathbf x\) and n is even, then we can use \(\mathbf {R}= -\mathbf {I}\); finally, if \(\hat{\mathbf {y}}= -\mathbf x\) and n is odd, then choose \(\mathbf {R}=-\mathbf {I}+2 \mathbf z\mathbf z^\top \), where \(\mathbf z\) is an arbitrary unit vector orthogonal to \(\mathbf x\). In all these cases, \(\mathbf {R}\mathbf x=\hat{\mathbf {y}}\) and \(|\mathbf {R}|=1\).

For the remaining case (\(\hat{\mathbf {y}}\ne \pm \mathbf x\)), let d denote the dot product \(\mathbf x\cdot \hat{\mathbf {y}}\) and let \(\hat{\mathbf {y}}^\bot \) be the normalized component of \(\hat{\mathbf {y}}\) that is perpendicular to \(\mathbf x\), i.e. \(\hat{\mathbf {y}}^\bot =\frac{\hat{\mathbf {y}}-d\mathbf x}{\Vert \hat{\mathbf {y}}-d\mathbf x\Vert }\). Let \(\mathbf {U}\) be an orthogonal matrix with \(\mathbf x\) and \(\hat{\mathbf {y}}^\bot \) as its first two columns. Now define \(\mathbf {R}\) as \(\mathbf {U}\mathbf {C}\mathbf {U}^\top ,\) where

$$\begin{aligned} \mathbf {C}= \left( \begin{array}{cccccc} d &{} -\sqrt{1-d^2} &{}\;\;\;0\;\;\;&{}\;\;\;0\;\;\;&{}\cdots &{} \;\;\;0\\ \sqrt{1-d^2} &{} d &{}0&{}0&{}\cdots &{} \;\;\;0\\ 0&{}0&{} \;1 &{}0&{} \cdots &{} \;\;\;0\\ 0&{}0&{} 0&{}\;1 &{} \cdots &{} \;\;\;0\\ \vdots &{}\vdots &{}\vdots &{}\vdots &{} \;\ddots &{} \;\;\;\vdots \\ 0&{}0&{}0&{}0&{} \cdots &{} \;\;\;\;1 \end{array} \right) . \end{aligned}$$

All unspecified off-diagonal entries are 0, and all diagonal entries starting from the third row are 1. Now \(\mathbf {R}\) is a rotation matrix that satisfies the requirements because

$$\begin{aligned} \mathbf {R}\mathbf x= & {} \mathbf {U}\mathbf {C}\mathbf {U}^\top \mathbf x= \mathbf {U}\mathbf {C}\,(1,0,0, \ldots ,0)^\top \\= & {} \mathbf {U}\,(d,\sqrt{1-d^2},0, 0,\ldots ,0)^\top \\= & {} d\,\mathbf x+ \sqrt{1-d^2} \,\hat{\mathbf {y}}^\bot = \hat{\mathbf {y}},\\ \mathbf {R}\mathbf {R}^\top= & {} \mathbf {U}\mathbf {C}\mathbf {C}^\top \mathbf {U}^\top = \mathbf {U}\mathbf {I}\mathbf {U}^\top =\mathbf {I}\quad \text {and}\quad \det (\mathbf {R})=\det (\mathbf {U})\det (\mathbf {C})\det (\mathbf {U})=\det (\mathbf {C})=1. \end{aligned}$$

Note that \(\mathbf {R}\) can be computed in \(O(n^2)\) time by rewriting it as

$$\begin{aligned} \mathbf {R}=\mathbf {I}+ \mathbf {U}(\mathbf {C}-\mathbf {I})\mathbf {U}^\top= & {} \mathbf {I}+ (d-1)\left( \mathbf x\mathbf x^\top +\hat{\mathbf {y}}^\bot \left( \hat{\mathbf {y}}^\bot \right) ^\top \right) \\&+\sqrt{1-d^2}\left( \hat{\mathbf {y}}^\bot \mathbf x^\top -\mathbf x\left( \hat{\mathbf {y}}^\bot \right) ^\top \right) . \end{aligned}$$

\(\square \)

2.3 Solving the offline problem

Before describing our online algorithm, we need to understand how to solve the optimization problem of offline (batch) algorithm:

$$\begin{aligned} \mathop {{\text {argmin}}}\limits _{\mathbf {R}\in \mathcal{SO}(n)} {\textstyle \sum }_{t=1}^T \tfrac{1}{2}\left\| \mathbf y_t-\mathbf {R}\mathbf x_t\right\| ^2= & {} \mathop {{\text {argmin}}}\limits _{\mathbf {R}\in \mathcal{SO}(n)} T- \left( {\textstyle \sum }_{t=1}^T \mathbf{y}_t^\top \mathbf {R}\mathbf{x}_t\right) \end{aligned}$$
(3)
$$\begin{aligned}= & {} {\text {argmax}}_{\mathbf {R}\in \mathcal{SO}(n)} {{\mathrm{tr}}}\left( \left( {\textstyle \sum }_{t=1}^T \mathbf{x}_t\mathbf{y}_t^\top \right) \mathbf {R}\right) . \end{aligned}$$
(4)

The first equality follows from rewriting the loss function \(\tfrac{1}{2}\left\| \mathbf y_t-\mathbf {R}\mathbf x_t\right\| ^2\) as in (1). Note that the T examples \((\mathbf x_t,\mathbf y_t)\) only enter into the optimization problem via the matrix \(\mathbf {S}:= \sum _{t=1}^T \mathbf x_t\mathbf y_t^\top \). This matrix functions as a “sufficient statistic” of the examples. As we shall see later, our randomized online algorithm will also be based on this sufficient statistic.

In general, an optimization problem of the form

$$\begin{aligned} {\text {argmax}}_{\mathbf {R}\in \mathcal{SO}(n)} {{\mathrm{tr}}}(\mathbf {S}\mathbf {R}) \end{aligned}$$
(5)

for some matrix \(\mathbf {S}\), is a classical problem known as Wahba’s problem (Wahba 1966). Figure 1 gives a simple example of the challenges in solving Wahba’s problem: the optimum rotation matrix may not be unique (any rotation matrix \(\mathbf {R}\) has the same loss on the two examples given in Fig. 1).

Fig. 1
figure 1

For two examples \((\mathbf{x}, \mathbf{y})\) and \((\mathbf{x}, -\mathbf{y})\), the matrix \(\mathbf {S}= \mathbf{x}\mathbf{y}^\top - \mathbf{x}\mathbf{y}^\top = 0\). Thus, any rotation matrix \(\mathbf {R}\) has the same total loss. In particular, if \(\hat{\mathbf {y}}= \mathbf {R}\mathbf x\) for any rotation matrix \(\mathbf {R}\), then the total loss is exactly 2: \(\tfrac{1}{2}\Vert \mathbf y- \hat{\mathbf {y}}\Vert ^2+\tfrac{1}{2}\Vert \!\!-\!\!\mathbf y- \hat{\mathbf {y}}\Vert ^2 = 2 - \left( \mathbf y^\top - \mathbf y^\top \right) \hat{\mathbf {y}}= 2-0 = 2.\) Thus the loss is at least 1 for at least one of the examples regardless of how \(\hat{\mathbf {y}}\) was chosen. We exploit this kind of argument in our lower bounds

Nevertheless, the value of Wahba’s problem (5) has a very elegant solution expressed i.t.o. the singular value decomposition (SVD) of \(\mathbf {S}\).

Lemma 2

Let \(\mathbf {S}= \mathbf {U}\varvec{\Sigma }\mathbf {V}^\top \) be any SVD of \(\mathbf {S}\), i.e. \(\mathbf {U}\) and \(\mathbf {V}\) are orthogonal matrices, and \(\varvec{\Sigma }= \text {diag}(\sigma _1, \sigma _2, \ldots , \sigma _n)\) is the diagonal matrix of non-negative singular values. Assume that \(\sigma _n\) is the smallest singular value and let \(s := \det (\mathbf {U})\det (\mathbf {V})\). Since \(\mathbf {U}\) and \(\mathbf {V}\) are orthogonal, \(s\in \{+1,-1\}\). Now if \(\mathbf {W}:= \text{ diag }(1, 1, \ldots , 1, s)\), then

$$\begin{aligned}\mathbf {V}\mathbf {W}\mathbf {U}^\top \in {\text {argmax}}_{\mathbf {R}\in \mathcal{SO}(n)} {{\mathrm{tr}}}(\mathbf {S}\mathbf {R}), \end{aligned}$$

and the value of the optimal solutions is \(\sum _{i=1}^{n-1}\sigma _i + s\sigma _n\), which is always non-negative.

By (4), the solution to Wahba’s problem is obtained by solving the \({\text {argmax}}\) problem of the above lemma with \(\mathbf {S}=\sum _{t=1}^T \mathbf x_t\mathbf y_t^\top \) and the value of the original optimization problem (3) is \(T-\sum _{i=1}^{n-1}\sigma _i - s\sigma _n\).

Note that the solution \(\mathbf {W}\) given in the lemma is a rotation matrix since it is a product of three orthogonal matrices, and its determinant equals \(\det (\mathbf {U})\det (\mathbf {V})\det (\mathbf {W}) = 1\). The solution can be found in \(O(n^3)\) time by constructing a SVD of \(\mathbf {S}\).

We have been unable to find a complete proof of the above lemma for dimensions more than 3 in the literature and therefore, for the sake of completeness, we give a self-contained proof in “Appendix 1”.

Note that if we are simply optimizing over all orthogonal matrices \(\mathbf {R}\in \mathcal{O}(n)\), with no condition on \(\det (\mathbf {R})\), then we arrive at another classical problem known as the “orthogonal Procrustes problem” (first solved by Schonemann 1966). The solution for this simpler problem is also given in “Appendix 2” for completeness:

Lemma 3

Let \(\mathbf {S}= \mathbf {U}\varvec{\Sigma }\mathbf {V}^\top \) be a SVD of \(\mathbf {S}\) as in Lemma 2. Then

$$\begin{aligned} \mathbf {V}\mathbf {U}^\top \in {\text {argmax}}_{\mathbf {R}\in \mathcal{O}(n)} {{\mathrm{tr}}}(\mathbf {S}\mathbf {R}) \end{aligned}$$

and the value of the optimum solutions is \(\sum _{i=1}^n \sigma _i\).

3 The randomized algorithm and main theorem

3.1 The algorithm

We begin by presenting our main randomized algorithm, called “Lazy Projection GD” (see Algorithm 1).

figure a

The algorithm maintains a parameter matrix \(\mathbf {W}_{t-1}\), which intuitively models a distribution for rotation matrices that transform \(\mathbf x_t\) onto \(\hat{\mathbf {y}}_t\). The goal will be to predict with unit vector \(\hat{\mathbf {y}}_t\) such that \(\text{ E }(\hat{\mathbf {y}}_t)=\mathbf {W}_{t-1}\mathbf x_t\). This is not quite possible if \(\Vert \mathbf {W}_{t-1}\mathbf x_t\Vert > 1\). Therefore during the t-th trial in Step 6 of the algorithm, \(\mathbf {W}_{t-1}\) is updated to an intermediate parameter matrix \(\mathbf {W}_t^m\) which makes it possible that the algorithm can predict with a unit vector \(\hat{\mathbf {y}}_t\) such that \(\text{ E }(\hat{\mathbf {y}}_t)=\mathbf {W}_t^m\mathbf x_t\). As shown in Lemma 4, the update rule for obtaining \(\mathbf {W}_t^m\) from \(\mathbf {W}_{t-1}\) in Step 6 is a Bregman projection with respect to the squared Frobenius norm onto the set of matrices \(\mathbf {W}\) for which \(\Vert \mathbf {W}\mathbf x\Vert \le 1\):

$$\begin{aligned} \mathbf {W}_t^m = \mathop {{\text {argmin}}}\limits _{\mathbf {W}:\Vert \mathbf {W}\mathbf x_t\Vert ^2 \le 1} \tfrac{1}{2}\Vert \mathbf {W}-\mathbf {W}_{t-1}\Vert ^2_F. \end{aligned}$$

This ensures that \(\Vert \mathbf {W}_t^m \mathbf x_t\Vert \le 1\), making it possible to predict with unit vector \(\hat{\mathbf {y}}_t\) such that \(\text{ E }(\hat{\mathbf {y}}_t)=\mathbf {W}_t^m\mathbf x_t\).

The prediction \(\hat{\mathbf {y}}_t\) is computed as follows. There are two main cases depending on the length of \(\mathbf z_t=\mathbf {W}_{t-1}\mathbf x_t\). When \(\Vert \mathbf z_t\Vert \le 1\), then \(\mathbf z_t\) is not too long and unit vector \(\widetilde{\mathbf z}_t:= \frac{\mathbf z_t}{\Vert \mathbf z_t\Vert }\) in direction \(\mathbf z_t\) is used for choosing the prediction.Footnote 1 More precisely, the algorithm predicts with \(\hat{\mathbf {y}}_t=\pm \widetilde{\mathbf z}_t\) with probability \(\frac{1\pm \Vert \mathbf z_t\Vert }{2}\). In this case the intermediate parameter matrix \(\mathbf {W}_t^m\) is set to be equal to the parameter matrix \(\mathbf {W}_{t-1}\) and

$$\begin{aligned} \text{ E }(\hat{\mathbf {y}}_t)= \Vert \mathbf z_t\Vert \;\widetilde{\mathbf z}_t=\mathbf z_t=\mathbf {W}_{t-1}\mathbf x_t=\mathbf {W}_t^m\mathbf x_t. \end{aligned}$$
(6)

In the degenerate case when \(\Vert \mathbf z_t\Vert =0\), then the direction \(\widetilde{\mathbf z}_t=\frac{\mathbf z_t}{\Vert \mathbf z_t\Vert }\) of \(\mathbf z_t\) is undefined and the algorithm arbitrarily sets \(\widetilde{\mathbf z}_t\) to the unit vector \(\mathbf {e}_1=(1,0,\ldots ,0)^\top \). Observe that the above equalities (6) remain valid in this case.

When \(\Vert \mathbf z_t\Vert > 1\), then \(\mathbf z_t\) is too long and the algorithm deterministically sets the prediction \(\hat{\mathbf {y}}_t\) to the shortened unit direction \(\widetilde{\mathbf z}_t=\frac{\mathbf z_t}{\Vert \mathbf z_t\Vert }\). Now the parameter matrix \(\mathbf {W}_{t-1}\) also needs to be “shortened” or “projected” as we shall see in a moment. More precisely, \(\mathbf {W}_t^m\) is set to equal \(\mathbf {W}_{t-1}(\mathbf {I}-(1-\frac{1}{\Vert \mathbf z_t\Vert })\mathbf x_t\mathbf x_t^\top )\) which ensures that

$$\begin{aligned} \hat{\mathbf {y}}_t=\widetilde{\mathbf z}_t= \frac{\mathbf {W}_{t-1} \mathbf x_t}{\Vert \mathbf z_t\Vert }=\mathbf {W}_t^m\mathbf x_t. \end{aligned}$$

We conclude that in all cases the expected loss of the algorithm is

$$\begin{aligned} \text{ E }\left( \tfrac{1}{2}\left\| \mathbf y_t-\hat{\mathbf{y}}_t\right\| ^2\right) = 1 - \mathbf y_t^\top \text{ E }(\hat{\mathbf {y}}_t) = 1 - \mathbf y_t^\top \mathbf {W}_t^m\mathbf x_t. \end{aligned}$$
(7)

This means that the update of \(\mathbf {W}_t\) from \(\mathbf {W}_t^m\) in Step 8 of the algorithm is a standard gradient descent update with respect to the squared Frobenius norm and the above expected loss:

$$\begin{aligned}\mathbf {W}_t = \mathop {{\text {argmin}}}\limits _{\mathbf {W}} \left( \tfrac{1}{2}\left\| \mathbf {W}-\mathbf {W}_t^m\right\| ^2_F +\eta (1-\mathbf y_t^\top \mathbf {W}\mathbf x_t) \right) .\end{aligned}$$

We now prove that the update of \(\mathbf {W}_t^m\) from \(\mathbf {W}_{t-1}\) is a Bregman projection with respect to the squared Frobenius norm:

Lemma 4

Let \(\mathbf {V}\) be a matrix and \(\mathbf x\) be a unit vector. Then the projection of \(\mathbf {V}\) on the set \(\{\mathbf {W}:\Vert \mathbf {W}\mathbf x\Vert ^2 \le 1\}\) is given by

$$\begin{aligned} \mathop {{\text {argmin}}}\limits _{\mathbf {W}:\Vert \mathbf {W}\mathbf x\Vert ^2 \le 1} \tfrac{1}{2}\Vert \mathbf {W}-\mathbf {V}\Vert ^2_F \;=\; {\left\{ \begin{array}{ll} \mathbf {V}&{}\quad \text { if } \Vert \mathbf {V}\mathbf x\Vert \le 1 \\ \mathbf {V}\left( \mathbf {I}- \left( 1 - \tfrac{1}{\Vert \mathbf {V}\mathbf x\Vert }\right) \mathbf x\mathbf x^\top \right) &{}\quad \text { otherwise.} \end{array}\right. } \end{aligned}$$

The projection can be computed in \(O(n^2)\) time.

Proof

If \(\Vert \mathbf {V}\mathbf x\Vert \le 1\) then \(\mathbf {V}\) is already in the set \(\{\mathbf {W}:\Vert \mathbf {W}\mathbf x\Vert ^2 \le 1\}\) and hence the projection is \(\mathbf {V}\) itself. So assume that \(\Vert \mathbf {V}\mathbf x\Vert > 1\). Now note that the set \(\{\mathbf {W}:\Vert \mathbf {W}\mathbf x\Vert ^2 \le 1\}\) is convex. Thus, the projection \(\mathbf {W}\) of \(\mathbf {V}\) onto this set lies on the boundary and satisfies \(\Vert \mathbf {W}\mathbf{x}\Vert = 1\). The theory of Lagrange multipliers tells us that the optimum solution satisfies the following equation for some constant \(\alpha \):

$$\begin{aligned} \mathbf {W}-\mathbf {V}+\alpha \mathbf {W}\mathbf x\mathbf x^\top = 0. \end{aligned}$$
(8)

By right multiplying the matrices on both sides of the equation with vector \(\mathbf x\), moving \(-\mathbf {V}\mathbf x\) to the right and squaring the length both sides we get

$$\begin{aligned} \Vert \mathbf {W}\mathbf x\Vert ^2(1+\alpha )^2 = \Vert \mathbf {V}\mathbf x\Vert ^2. \end{aligned}$$

Since \(\Vert \mathbf {W}\mathbf x\Vert ^2=1\), \(\alpha =\pm \Vert \mathbf {V}\mathbf x\Vert -1\). From Eq. (8) it follows that

$$\begin{aligned} \mathbf {W}=\left. \mathbf {V}\left( \mathbf {I}+\alpha \mathbf x\mathbf x^\top \right) ^{-1} =\mathbf {V}\left( \mathbf {I}- \tfrac{\alpha }{1+\alpha } \mathbf x\mathbf x^\top \right) =\mathbf {V}\left( \mathbf {I}- \left( 1\mp \frac{1}{\Vert \mathbf {V}\mathbf x\Vert }\right) \mathbf x\mathbf x^\top \right) \right) . \end{aligned}$$

The second equality follows from the Sherman–Morrison–Woodbury formula (see e.g. Bernstein 2009). Note that the inverse is defined because we are in the case \(\Vert \mathbf {V}\mathbf x\Vert >1\) and therefore both solutions for \(\alpha \) are not equal to \(-1\). The result now follows from the fact that for the \(\alpha =+\Vert \mathbf {V}\mathbf x\Vert -1\) solution, \(\Vert \mathbf {W}-\mathbf {V}\Vert _F^2\) is smaller. \(\square \)

We call our algorithm “Lazy Projection GD”, because for the sake of efficiency we “project as little as necessary” while keeping the relationship \(\text{ E }[\hat{\mathbf {y}}_t]=\mathbf {W}_t^m\mathbf x_t\). The algorithm takes \(O(n^2)\) time per trial since all steps can be reduced to a small number of matrix additions and matrix vector multiplications. In the corrigendum to the conference version of this paper, an alternate but more time expensive projection is used which projects \(\mathbf {W}_{t-1}\) onto the convex hull of all orthogonal matrices. As sketched below, this is more involved because it requires us to maintain the SVD decomposition of the parameter matrix.

Let B denote convex hull of all orthogonal matrices. In “Appendix 2” we show that B is the set of all square matrices with singular values at most one. Projecting onto B consists of computing the SVD of \(\mathbf {W}_{t-1}\) and capping all singular values larger than one at one (see corrigendum to Hazan et al. 2010b). Next, the projected matrix \(\mathbf {W}_t^m\) is randomly rounded to an orthogonal matrix \(\mathbf {U}_t\) s.t. \(\text{ E }[\mathbf {U}_t]=\mathbf {W}_t^m\) (see Lemma 6 for details), and then the prediction made is \(\hat{\mathbf {y}}_t = \mathbf {U}_t\mathbf x_t\). Overall, the algorithm takes \(O(n^3)\) time per iteration.

As shown in “Appendix 2”, the convex hull B of all orthogonal matrices can be written as an intersection of convex constraints:

$$\begin{aligned} B=\bigcap _{\mathbf x:||\mathbf x||=1} \{\mathbf {W}:||\mathbf {W}\mathbf x|| \le 1\}. \end{aligned}$$

So for the sake of efficiency we only project onto \(\{\mathbf {W}:||\mathbf {W}\mathbf x_t|| \le 1\}\), where \(\mathbf x_t\) is the current instance, and not onto the full intersection B. This new “lazy projection” method is a simpler method with update time \(O(n^2)\) that avoids the need to maintain the SVD decomposition. The possibility of using delayed projections was discussed in Section 5.5 of Helmbold and Warmuth (2009). What is unique in our setting is that the convex set we project onto depends on the instance \(\mathbf x_t\).

3.2 Analysis and main theorem

We are now ready to prove our main regret bound for our on-line algorithm based on lazy projections. Note that the learning rate depends on the number of trials T. However, it is easy to run the algorithms in stages based on a geometric series of upper bounds on T [see for example Algorithm G1 of Cesa-Bianchi et al. (1996)]. This increases the regret bound by at most a constant factor.

Theorem 1

If Algorithm 1 is run with \(\eta = \sqrt{\frac{n}{T}}\) on any sequence of T examples, then

$$\begin{aligned}\sum _{t=1}^T \text{ E }\left[ \tfrac{1}{2}\left\| \mathbf y_t-\hat{\mathbf{y}}_t\right\| ^2\right] - \min _{\mathbf {R}\in \mathcal{SO}(n)} \sum _{t=1}^T \tfrac{1}{2}\left\| \mathbf y_t-\mathbf {R}\mathbf x_t\right\| ^2\le \ \sqrt{nT}. \end{aligned}$$

Proof

For any rotation matrix \(\mathbf {R}\),

$$\begin{aligned} \tfrac{1}{2}\left\| \mathbf {W}_t-\mathbf {R}\right\| _F^2&= \tfrac{1}{2}\left\| \mathbf {W}_t^m-\mathbf {R}+\eta \;\mathbf{y}_t\mathbf{x}_t^\top \right\| _F^2\\&= \tfrac{1}{2}\left\| \mathbf {W}_t^m-\mathbf {R}\right\| _F^2 + \eta {{\mathrm{tr}}}\left( \left( \mathbf {W}_t^m-\mathbf {R}\right) \mathbf x_t\mathbf y_t^\top \right) + \tfrac{\eta ^2}{2} \Vert \mathbf{y}_t\mathbf{x}_t^\top \Vert _F^2\\&= \tfrac{1}{2}\left\| \mathbf {W}_t^m-\mathbf {R}\right\| _F^2 + \eta \left( \tfrac{1}{2}\left\| \mathbf y_t-\mathbf {R}\mathbf x_t\right\| ^2-\text{ E }\left[ \tfrac{1}{2}\left\| \mathbf y_t-\hat{\mathbf{y}}_t\right\| ^2\right] \right) + \tfrac{\eta ^2}{2}\\&\le \tfrac{1}{2}\left\| \mathbf {W}_{t-1}-\mathbf {R}\right\| _F^2 + \eta \left( \tfrac{1}{2}\left\| \mathbf y_t-\mathbf {R}\mathbf x_t\right\| ^2-\text{ E }\left[ \tfrac{1}{2}\left\| \mathbf y_t-\hat{\mathbf{y}}_t\right\| ^2\right] \right) + \tfrac{\eta ^2}{2}. \end{aligned}$$

The first equality uses the update in Step 8 of Algorithm 1. In the second equality we expand the square, and in the third we use \(\tfrac{1}{2}\left\| \mathbf y_t-\mathbf {R}\mathbf x_t\right\| ^2=1-{{\mathrm{tr}}}(\mathbf {R}\mathbf x_t\mathbf y_t^\top )\) and \(\text{ E }\left[ \tfrac{1}{2}\left\| \mathbf y_t-\hat{\mathbf{y}}_t\right\| ^2\right] =1-{{\mathrm{tr}}}(\mathbf {W}_t^m\mathbf x_t\mathbf y_t^\top )\), which is Eq. (7) above. The last inequality is the most interesting. It follows from the generalized Pythagorean theorem for Bregman divergences and the fact the update from \(\mathbf {W}_{t-1}\) to \(\mathbf {W}_t^m\) is a Bregman projection onto the set \(\{\mathbf {W}:\Vert \mathbf {W}\mathbf x_t\Vert ^2 \le 1\}= \{\mathbf {W}:\Vert \mathbf {W}\mathbf x_t\Vert \le 1\}\). The crucial fact is that this is a closed convex set that contains all rotation matrices. See Herbster and Warmuth (2001) for an extended discussion of the application of Bregman projections for obtaining regret bounds. By rearranging we get the following inequality for each trial:

$$\begin{aligned} \text{ E }\left[ \tfrac{1}{2}\left\| \mathbf y_t-\hat{\mathbf{y}}_t\right\| ^2\right] - \tfrac{1}{2}\left\| \mathbf y_t-\mathbf {R}\mathbf x_t\right\| ^2\ \le \ \frac{\Vert \mathbf {W}_{t-1} -\mathbf {R}\Vert _F^2 - \Vert \mathbf {W}_t-\mathbf {R}\Vert _F^2}{2\eta } + \frac{\eta }{2}. \end{aligned}$$

By summing over all T trials we get

$$\begin{aligned} \sum _t \text{ E }\left[ \tfrac{1}{2}\left\| \mathbf y_t-\hat{\mathbf{y}}_t\right\| ^2\right] - \sum _t\tfrac{1}{2}\left\| \mathbf y_t-\mathbf {R}\mathbf x_t\right\| ^2\le & {} \frac{\Vert \mathbf {W}_0-\mathbf {R}\Vert _F^2 - \Vert \mathbf {W}_T-\mathbf {R}\Vert _F^2}{2\eta } +\frac{\eta T}{2}\\\le & {} \frac{n}{2\eta }+\frac{\eta T}{2}, \end{aligned}$$

since \(\Vert \mathbf {W}_0-\mathbf {R}\Vert _F^2 =\Vert \mathbf {R}\Vert _F^2={{\mathrm{tr}}}\left( \mathbf {R}\mathbf {R}^\top \right) ={{\mathrm{tr}}}(\mathbf {I})=n\) and \(\Vert \mathbf {W}_{T}-\mathbf {R}\Vert _F^2 \ge 0\). The RHS is minimized at \(\sqrt{nT}\) by choosing \(\eta =\sqrt{\frac{n}{T}}\). \(\square \)

As we shall see the above upper bound of \(\sqrt{nT}\) on the regret bound of our randomized algorithm is rather weak in the noise-free case, i.e. when there is a rotation that has loss zero on the entire sequence. It is an open problem whether the upper bound on the regret can be strengthened to \(O(\sqrt{nL}+n)\) where L is the loss of the best rotation on the entire sequence of trials. An \(O(\sqrt{nL})\) regret bound was erroneously claimed in the conference paper of Hazan et al. (2010b).

3.3 Lower bounds on the regret

We now show a lower bound against any algorithm (including randomized ones).

Theorem 2

For any integer \(T \ge n\ge 2\), there is a fixed instance vector sequence on which any randomized online algorithm for learning rotations can be forced to have regret at least \(\sqrt{\frac{(T-1)(n-1)}{4}} =\varOmega ( \sqrt{nT}).\)

Proof

We first define the fixed instance sequence. Let \(\mathbf{e}_i\) denote the i-th standard basis vector, i.e. the vector with 1 in its i-th coordinate and 0 everywhere else. In trial \(t < T\), set \(\mathbf{x}_t = \mathbf{e}_{f(t)}\), where \(f(t) = (t \text { mod } n-1) + 1\) (i.e. we cycle through the coordinates \(1, 2, \ldots , n-1\)). The last instance is \(\mathbf {e}_n\).

We will now show that for any online algorithm for the rotations problem, there is a sequence of result vectors \(\mathbf y_t\) of length T for which this algorithm has regret at least \(\sqrt{\frac{(T-1)(n-1)}{4}} =\varOmega ( \sqrt{nT}).\) Recall that \(\mathbf {e}_{f(t)}\) is the instance at trial \(1\le t<T\). To achieve our lower bound on the regret we set \(\mathbf y_t\) equal to the instance at trial t or its reverse with equal probability, i.e. \(\mathbf{y}_t = \sigma _t \mathbf{e}_{f(t)}\), where \(\sigma _t \in \{-1, 1\}\) uniformly at random. For any coordinate \(i \in 1, 2, \ldots , n-1\), let \(X_i = \sum _{t:\ f(t) = i} \sigma _{t}\). For the final trial T, the instance \(\mathbf{x}_T\) is \(\mathbf{e}_{n}\), and we set the result vector to \(\mathbf{y}_T = \sigma _T \mathbf{e}_{n}\), where \(\sigma _T \in \{-1, 1\}\) is chosen in a certain way specified momentarily. First, note that

$$\begin{aligned} \mathbf {S}_T\ =\ \sum _{t=1}^T \mathbf{y}_t\mathbf{x}_t^\top \ =\ \text {diag}(X_1, X_2, \ldots , X_{n-1}, \sigma _T). \end{aligned}$$

We choose \(\sigma _T\) so that

$$\begin{aligned} \det (\mathbf {S}_T)\ =\ \sigma _T\prod _{i=1}^{n-1} X_i\ >\ 0. \end{aligned}$$

In other words, \(\sigma _T = \text {sgn}(\prod _{i=1}^{n-1} X_i)\).

By Lemma 2, the solution to the offline problem is the rotation matrix \(\mathbf {R}^\star = {\text {argmax}}_{\mathbf {R}\in \mathcal{SO}(n)} {{\mathrm{tr}}}(\mathbf {S}_T \mathbf {R})\), where

$$\begin{aligned} \mathbf {R}^\star = \text{ diag }(\text {sgn}(X_1), \text {sgn}(X_2), \ldots , \text {sgn}(X_{n-1}), \sigma _T), \end{aligned}$$

and the loss of this matrix is

$$\begin{aligned} \sum _{t=1}^T \tfrac{1}{2}\left\| \mathbf y_t-\mathbf {R}^{\star }\mathbf x_t\right\| ^2\ =\ T - {{\mathrm{tr}}}(\mathbf {S}_TR^\star )\ =\ T - \sum _{i=1}^{n-1} |X_i| - 1. \end{aligned}$$

Since each \(X_i\) is a sum of at least \(\lfloor \frac{T-1}{n-1}\rfloor \) Rademacher variables, Khintchine’s inequality (Haagerup 1982) implies that \(\text{ E }_{\sigma _t}[|X_i|] \ge \sqrt{\frac{1}{2}\lfloor \frac{T-1}{n-1}\rfloor }\) where the expectation is taken over the choice of the \(\sigma _t\)’s. Thus, the expected loss of the optimal rotation is bounded as follows:

$$\begin{aligned} \text{ E }_{\sigma _t}\left[ \sum _{t=1}^T \tfrac{1}{2}\left\| \mathbf y_t-\mathbf {R}^{\star }\mathbf x_t\right\| ^2\right]&= T - \text{ E }_{\sigma _t}[{{\mathrm{tr}}}(\mathbf {S}_T\mathbf {R}^\star )]\nonumber \\&\le T-1 - (n-1)\sqrt{\frac{1}{2}\left\lfloor \frac{T-1}{n-1}\right\rfloor }. \end{aligned}$$
(9)

We now show that for \(t<T\) the expected loss of the algorithm is one, where the expectation is with respect to \(\sigma _t\) as well as the internal randomization of the algorithm. Note that both types of randomizations are independent. If \(\text{ E }_{\mathrm {alg}} [\hat{\mathbf {y}}_t]\) denotes the expected prediction vector of the algorithm at trial t after receiving instance vector \(\mathbf {e}_{f(t)}\) and before receiving the result vector \(\mathbf y_t=\sigma _t\mathbf {e}_t\), then

$$\begin{aligned} \text{ E }_{\sigma _t,\mathrm {alg}}\left[ \tfrac{1}{2}\left\| \mathbf y_t-\hat{\mathbf{y}}_t\right\| ^2\right] \ =\ 1 - \text{ E }_{\sigma _t,\mathrm {alg}}\left[ \mathbf{y}_t^\top \hat{\mathbf {y}}_t\right] \ =\ 1 - \text{ E }_{\sigma _t}[\sigma _t]\, \mathbf {e}_{f(t)}^\top \text{ E }_{\mathrm {alg}} \left[ \hat{\mathbf {y}}_t\right] \ =\ 1. \end{aligned}$$

In trial T, the algorithm might at best have an expected loss of 0. Thus, the expected loss of the algorithm is at least \(T - 1\), and hence by subtracting (9), its expected regret (with respect to both randomizations) is at least \((n-1)\sqrt{\frac{1}{2}\left\lfloor \frac{T-1}{n-1}\right\rfloor } \). Since for \(a\ge 1\), \(\lfloor a \rfloor \ge \max (a-1,1) \ge (a-1+1)/2=a/2\), the expected regret is lower bounded by

$$\begin{aligned} (n-1)\sqrt{\frac{1}{4}\left( \frac{T-1}{n-1}\right) } = \sqrt{\frac{(T-1)(n-1)}{4}}. \end{aligned}$$

This implies that for any algorithm there is a choice of the \(\sigma _t\)’s for which the algorithm has expected regret (with respect to its internal randomization only) at least \(\sqrt{\frac{(T-1)(n-1)}{4}}\), as required. Since \(T\ge n\ge 2\), this lower bounds is \(\varOmega (\sqrt{nT})\). \(\square \)

As can be seen from the proof, the above lower bound is essentially \(\sqrt{\frac{nT}{2}}\) for large n and T, s.t. \(T\ge n\). Recall that the upper bound of our Algorithm 1 is \(\sqrt{nT}\).

We now generalize the above lower bound on the worst-case regret that depends on the number of trials T to a lower bound that depends on the loss L of the best rotation in hindsight. Applying Theorem 6 of Sect. 5 below with n orthogonal instances shows that any randomized on-line algorithm can be forced to incur loss n on sequences of examples for which there is a consistent rotation (i.e. \(L=0\)). Combined with the above lower bound, this immediately generalizes to the following lower bound for any \(L\ge 0\):

Corollary 1

For any \(L\ge 0\) and \(T \ge \lfloor L/2 \rfloor \), any randomized on-line algorithm can be forced to have regret \(\varOmega (\sqrt{nL}+n)\) for some sequence of T examples for which the loss of the best rotation is at most L.

Proof

If \(L< 2n\), then the lower bound of n for the noise free case already implies a lower bound of \(\varOmega (\sqrt{nL}+n)\). If \(L\ge 2n\), then we apply the lower bound of the above theorem for the first \(T_0=\lfloor L/2 \rfloor \) rounds. Note that \(T \ge T_0 \ge n\). We then achieve a regret lower bound of \(\varOmega (\sqrt{\lfloor L/2\rfloor n})=\varOmega (\sqrt{nL}+n)\) for the first \(T_0\) rounds. Let \(\mathbf {R}_0\) be the best rotation for the first \(T_0\) rounds. Since the per trial loss of any rotation is at most 2, the loss of the best rotation on the sequence of the \(T_0\) examples used in the construction of the above theorem is at most \(2T_0 \le L\).

For the remaining \(T - T_0\) rounds, we simply use arbitrary pairs of unit vectors \((\mathbf x_t, \mathbf y_t)\) that are consistent with \(\mathbf {R}_0\) (i.e. \(\mathbf y_t = \mathbf {R}_0 \mathbf x_t\) for \(T_0 < t \le T\)). Thus, the rotation \(\mathbf {R}_0\) incurs 0 additional loss, and hence its total loss is bounded by L; the algorithm, on the other hand, also incurs 0 additional loss at best. Thus we have an \(\varOmega (\sqrt{nL}+n)\) lower bound on the regret for the sequence of T examples in this construction, and the best rotation has loss at most L.

\(\square \)

Note that the constant factors needed for the \(\varOmega (.)\) notation in the proofs of the lower bounds are independent of the algorithm and the values of n and L.

4 Deterministic online algorithms

We begin by showing that any deterministic algorithm can be forced to have regret T in T trials. We then show how to construct a deterministic algorithm from any probabilistic one with at most twice the loss.

Theorem 3

For any deterministic algorithm, there is a sequence of T examples (which may be fixed before running the algorithm) such that the algorithm has regret at least T on it.

Proof

The adversary sets all instances \(\mathbf x_t\) to \(\mathbf {e}_1\). Since the algorithm is deterministic, the adversary can compute the prediction \(\hat{\mathbf {y}}_t\), and then set the result vector \(\mathbf y_t = -\hat{\mathbf {y}}_t\). The per trial loss is therefore

$$\begin{aligned} \tfrac{1}{2}\left\| \mathbf y_t-\hat{\mathbf{y}}_t\right\| ^2\ =\ \tfrac{1}{2}\Vert 2\hat{\mathbf{y}}_t\Vert ^2\ =\ 2, \end{aligned}$$

amounting to a loss 2T in all trials.

The loss of the optimum rotation on the produced sequence of examples is

$$\begin{aligned} \min _{\mathbf {R}\in \mathcal{SO}(n)} \tfrac{1}{2}\left\| \mathbf y_t-\mathbf {R}\mathbf x_t\right\| ^2\ =\ T - \max _{\mathbf {R}\in \mathcal{SO}(n)}{{\mathrm{tr}}}(\mathbf {S}_T \mathbf {R}), \end{aligned}$$

where \(\mathbf {S}_T = \sum _{t=1}^T \mathbf {e}_1\mathbf{y}_t^\top \). By Lemma 2, \(\max _{\mathbf {R}\in \mathcal{SO}(n)}{{\mathrm{tr}}}(\mathbf {S}_T\mathbf {R}) \ge 0\). Thus, the loss of the optimum rotation is at most T. Hence, the algorithm has regret at least \(2T - T = T\). \(\square \)

A simple deterministic algorithm would be to predict with an optimal rotation matrix for the data seen so far using the algorithm of Lemma 2. However this Follow the Leader or Incremental Off-Line type algorithm along with the elegant deterministic algorithms based on Lie group mathematics (Arora 2009) all can be forced to have linear worst-case regret, whereas probabilistic algorithms can achieve worst-case regret at most \(\sqrt{nT}\). We don’t know the optimal worst-case regret achievable by deterministic algorithms. In some cases the worst-case regret of deterministic algorithm is at least F times the regret of the best probabilistic algorithm, where F grows with the size of the problem (see e.g. Warmuth et al. 2011). We show that for our rotation problem, the factor is at most 2.

Theorem 4

Any randomized algorithm for learning rotations can be converted into a deterministic algorithm with at most twice the loss.

Proof

We construct a deterministic algorithm from the randomized one that in each trial has at most twice the loss. Let \(\mathbf z_t\) be the expected prediction \(\text{ E }[\hat{\mathbf {y}}_t]\) of the randomized algorithm. If \(\mathbf z_t=\mathbf {0}\), we let the deterministic algorithm predict with \(\mathbf {e}_1\). In this case the randomized algorithm has expected loss \(1-\mathbf y_t^\top \mathbf z_t=1\) and the deterministic algorithm has loss at most 2.

If \(\mathbf z_t\ne \mathbf {0}\), then the deterministic algorithm predicts with \(\hat{\mathbf {y}}_t=\frac{\mathbf z_t}{\Vert \mathbf z_t\Vert }\). We need to show that

$$\begin{aligned} 1-\mathbf y_t^\top \hat{\mathbf {y}}_t&\le 2(1- \mathbf y_t^\top \mathbf z_t) \nonumber \\ \Leftrightarrow 0&\le 1-2\mathbf y_t^\top \mathbf z_t + \mathbf y_t^\top \frac{\mathbf z_t}{\Vert \mathbf z_t\Vert }. \end{aligned}$$
(10)

Next observe that \(\Vert \mathbf z_t\Vert \) lies in the range \([|\mathbf y_t^\top \mathbf z_t|\,,1]\): the upper bound holds because \(\mathbf z_t\) lies in the unit ball and the lower bound follows from the fact that \(|\mathbf y_t^\top \mathbf z_t|\) is the length of the projection of \(\mathbf z_t\) onto unit vector \(\mathbf y_t\). Now if \(\mathbf y_t^\top \mathbf z_t \ge 0\), then we use the upper bound on the range to show that \(1-2\mathbf y_t^\top \mathbf z_t + \mathbf y_t^\top \frac{\mathbf z_t}{\Vert \mathbf z_t\Vert } \ge 1-2\mathbf y_t^\top \mathbf z_t + \mathbf y_t^\top \mathbf z_t = 1-\mathbf y_t^\top \mathbf z_t \ge 0\). If \(\mathbf y_t^\top \mathbf z_t \le 0\), then we use the lower bound on the range to show that \(1-2\mathbf y_t^\top \mathbf z_t + \mathbf y_t^\top \frac{\mathbf z_t}{\Vert \mathbf z_t\Vert } \ge 1- 2\mathbf y_t^\top \mathbf z_t -1 \ge 0\). \(\square \)

5 Learning when there is a consistent rotation

In Sect. 3 we gave a randomized algorithm with at most \(\sqrt{nT}\) worst-case regret (when \(T\ge n\)) and showed that this regret bound is tight within a constant factor. However regret bounds that are expressed as a function of number of trials T only guard against the high-noise case and are weak when there is a comparator with small loss. Ideally we want bounds that grow with the loss of the best comparator chosen in hindsight as was done in the original online learning papers (see e.g. Cesa-Bianchi et al. 1996a; Kivinen and Warmuth 1997). In an attempt to understand what regret bounds are possible when there is a comparator of low noise, we carefully analyze the case when there is a rotation \(\mathbf {R}\) consistent with all T examples, i.e. \(\mathbf {R}\mathbf x_t = \mathbf y_t\) for all \(t = 1, 2, \ldots , T\). Even in this case, the online learner incurs loss until a unique consistent rotation is determined and the question is which algorithm has the smallest regret against such sequences of examples.

There is a randomized and a deterministic variant of the algorithm, but they only disagree in the first trial. The deterministic variant predicts with \(\hat{\mathbf {y}}_1=\mathbf {e}_1\) and the randomized variant with \(\hat{\mathbf {y}}_1=\pm \mathbf {e}_1\) with probability \(\tfrac{1}{2}\) [i.e. \(\text{ E }(\hat{\mathbf {y}}_1)=\mathbf {0})\)]. In later trials the algorithm (both variants) predict deterministically. At trial \(t\ge 2\), it first decomposes the instance \(\mathbf x_t\) in the part \(\mathbf x^{\Vert }_t\) that lies in the span of the previous instances and \(\mathbf x^{\bot }_t\) that is perpendicular to the span. By definition, \(\mathbf x^{\Vert }_t\) is a linear combination of the past instances \(\mathbf x_q\) for \(1,\ldots ,t-1\). Replacing the instance vectors \(\mathbf x_q\) in the linear combination by the result vectors \(\mathbf y_q\), we arrive at \(\mathbf y^{\Vert }_t\) (See Step 5). The algorithm essentially predicts with the direction vector of \(\mathbf y^{\Vert }_t\) until and including the first trial S in which the matrix \([\mathbf x_1,\ldots ,\mathbf x_S]\) has rank \(n-1\). From trial \(S+1\) to T, the algorithm predicts with the unique consistent rotation.

figure b

Theorem 5

On any sequence of examples \((\mathbf x_1,\mathbf y_1),\ldots ,(\mathbf x_T,\mathbf y_T)\) that is consistent with a rotation, the randomized version of Algorithm 2 has expected loss \(\sum _{t=1}^S (1-\Vert \mathbf x^{\Vert }_t\Vert )\). The deterministic version has loss at most \(\left( \sum _{t=1}^S \left( 1-\Vert \mathbf x^{\Vert }_t\Vert \right) \right) + 1\).

Proof

For proving the upper bound for the randomized algorithms, first note that in trial 1 the expected loss is \(1-\mathbf y_1^\top \text{ E }(\hat{\mathbf {y}}_1)=1-\mathbf y_1^\top \mathbf {0}=1=1-\Vert \mathbf x_1^\Vert \Vert \). In trials \(2\le t\le S\), the deterministic choice \(\hat{\mathbf {y}}_t=\frac{\mathbf y^{\Vert }_t}{\Vert \mathbf y^{\Vert }_t\Vert }\) assures that the loss is

$$\begin{aligned} \tfrac{1}{2}\left\| \mathbf y_t-\hat{\mathbf{y}}_t\right\| ^2=1-\mathbf y_t^\top \hat{\mathbf {y}}_t=1- \Vert \mathbf y^{\Vert }_t\Vert = 1- \Vert \mathbf x^{\Vert }_t\Vert . \end{aligned}$$
(11)

In the special case when \(\mathbf x^{\Vert }_t=\mathbf {0}\) then \(\mathbf x_t^\top \mathbf x_1 = 0\), and so \(\mathbf y_t^\top \mathbf y_1 = 0\) since we have a consistent rotation and rotations preserve angles. Thus the deterministic prediction \(\hat{\mathbf {y}}_t=\mathbf y_1\) has loss \(1=1-\Vert \mathbf x^{\Vert }_t\Vert \) as well. If \(S<T\), then since \(\text {rank}([\mathbf x_1, \mathbf x_2, \ldots , \mathbf x_S]) = n-1\), the vectors \(\mathbf x_1^\bot , \mathbf x_2^\bot , \ldots , \mathbf x_S^\bot \) are mutually orthogonal, with exactly \(n - 1\) non-zero vectors. This gives two sequences of mutually orthogonal vectors \(\mathbf x_1^\bot , \mathbf x_2^\bot , \ldots , \mathbf x_S^\bot \) and \(\mathbf y_1^\bot , \mathbf y_2^\bot , \ldots , \mathbf y_S^\bot \) such that \(\Vert \mathbf x_t^\bot \Vert = \Vert \mathbf y_t^\bot \Vert \) for \(t \le S\), and exactly \(n-1\) non-zero vectors in each sequence. By Lemma 5 below we conclude that there is a unique rotation matrix \(\mathbf {R}\) such that \(\mathbf y_t^\bot = \mathbf {R}\mathbf x_t^\bot \) for \(t \le S\). Since the result vectors \(\mathbf y_t\) are linear combinations of the orthogonal components \(\mathbf y_q^\bot \) for \(q\le t\), we have the rotation \(\mathbf {R}\) is a unique rotation consistent with the first S examples. Hence we find this rotation and incur no more loss in any trial \(t > S\). Overall, the expected loss of the randomized algorithm is

$$\begin{aligned} \sum _{t=1}^S (1-\Vert \mathbf x_t^{\Vert }\Vert ). \end{aligned}$$

The deterministic algorithm may have loss at most 2 in the first trial. For the rest of the trials the deterministic algorithm is the same as the randomized one and has the same bound on the loss, leading to an upper bound of \(\left( \sum _{t=1}^S \left( 1-\Vert \mathbf x_t^{\Vert }\Vert \right) \right) + 1.\) \(\square \)

We now give a simple lemma which states that a rotation is essentially determined by its action on \(n-1\) mutually orthogonal vectors. The proof is straightforward, but we include it for the sake of completeness.

Lemma 5

Let \(\mathbf x_1^{\bot }, \mathbf x_2^{\bot }, \ldots , \mathbf x_T^{\bot }\) and \(\mathbf y_1^{\bot }, \mathbf y_2^{\bot }, \ldots , \mathbf y_T^{\bot }\) be two orthogonal sequences of vectors in \(\mathbb {R}^n\) such that \(\Vert \mathbf x_t\Vert =\Vert \mathbf y_t\Vert \) and both sequences have at most \(r\le n-1\) non-zero vectors. Then there is a rotation matrix \(\mathbf {R}\) such that \(\mathbf y_t = \mathbf {R}\mathbf x_t\). If \(r=n-1\), then \(\mathbf {R}\) is unique.

Proof

Pairs \((\mathbf x_t,\mathbf y_t)\) of length zero can be omitted and without loss of generality we can rescale the remaining vectors so that their length is one, since rotations are a linear transformation and preserve the scaling. Also we can always add more orthonormal vectors to both sequences and therefore it suffices to find a rotation when \(r=n-1=T\). Let \(\mathbf x_n^\bot \) and \(\mathbf y_n^\bot \) be unit vectors orthogonal to the span of \(\{\mathbf x_1^\bot , \mathbf x_2^\bot , \ldots , \mathbf x_{n-1}^\bot \}\) and \(\{\mathbf y_1^\bot , \mathbf y_2^\bot , \ldots , \mathbf y_{n-1}^\bot \}\), respectively. Now, since rotations preserve angles and lengths, if there is a rotation \(\mathbf {R}\) such that \(\mathbf {R}\mathbf x_t^\bot = \mathbf y_t^\bot \) for \(t = 1, 2, \ldots , n-1\), then we must have \(\mathbf {R}\mathbf x_n^\bot = \pm \mathbf y_n^\bot \). We now determine the right sign of \(\mathbf y_n^\bot \) as follows. Let \(\mathbf {X}= [\mathbf x_1^\bot , \mathbf x_2^\bot , \ldots , \mathbf x_{n-1}^\bot , \mathbf x_n^\bot ]\), \(\mathbf {Y}^+ = [\mathbf y_1^\bot , \mathbf y_2^\bot , \ldots , \mathbf y_{n-1}^\bot , \mathbf y_n^\bot ]\), and \(\mathbf {Y}^- = [\mathbf y_1^\bot , \mathbf y_2^\bot , \ldots , \mathbf y_{n-1}^\bot , -\mathbf y_n^\bot ]\). Note that \(\mathbf {X}\), \(\mathbf {Y}^+\) and \(\mathbf {Y}^-\) are all orthogonal matrices. The desired rotation matrix \(\mathbf {R}\) is a solution to one of following two linear systems \(\mathbf {R}\mathbf {X}= \mathbf {Y}^+\) or \(\mathbf {R}\mathbf {X}= \mathbf {Y}^-\), i.e. \(\mathbf {R}= \mathbf {Y}^+\mathbf {X}^\top \) or \(\mathbf {R}= \mathbf {Y}^-\mathbf {X}^\top \), since \(\mathbf {X}^{-1} = \mathbf {X}^\top \). Note that both \(\mathbf {Y}^+\mathbf {X}^\top \) and \(\mathbf {Y}^-\mathbf {X}^\top \) are orthogonal matrices since they are products of orthogonal matrices, and the determinant of one is the negative of the other. The desired rotation matrix is then the solution with determinant 1. \(\square \)

We now prove a lower bound which shows that Algorithm 2 given above has the strong property of being instance-optimal: viz., for any sequence of input vectors and for any algorithm, there is a sequence of result vectors on which the loss achieved of the algorithm is at least that of Algorithm 2:

Theorem 6

For any online algorithm and for every instance sequence \(\mathbf x_1,\ldots ,\mathbf x_T\), there is an adversary strategy for choosing the result sequence \(\mathbf y_1,\ldots ,\mathbf y_T\) for which there is a consistent rotation, and the algorithm incurs loss \(\sum _{t=1}^S (1-\Vert \mathbf x^{\Vert }_t\Vert )\). Furthermore, if the algorithm is deterministic, then the adversary can force a loss of at least \(\sum _{t=1}^S (1-\Vert \mathbf x^{\Vert }_t\Vert ) + 1\).

Proof

Let us begin with the lower bound for randomized algorithms. At trials \(t\le S\), the adversary finds some vector \(\mathbf y^{\bot }_t\) of length \(\Vert \mathbf x^\bot _t\Vert =\sqrt{1-\Vert \mathbf x^{\Vert }_t\Vert ^2}\) that is perpendicular to the span of \(\{\mathbf y_1,\ldots ,\mathbf y_{t-1}\}\) and chooses the result vector \(\mathbf y_t=\mathbf y^{\Vert }_t\pm \mathbf y^{\bot }_t\) depending on which of the two examples \((\mathbf x_t,\mathbf y^{\Vert }_t\pm \mathbf y^{\bot }_t)\) has larger expected loss. The expected loss of both examples is

$$\begin{aligned} \text{ E }\left[ \tfrac{1}{2}\Vert \mathbf y^{\Vert }_t+\mathbf y^{\bot }_t-\hat{\mathbf {y}}_t\Vert \right] +\text{ E }\left[ \tfrac{1}{2}\Vert \mathbf y^{\Vert }_t-\mathbf y^{\bot }_t-\hat{\mathbf {y}}_t\Vert \right]&=2-\left( \mathbf y^{\Vert }_t+\mathbf y^{\bot }_t+\mathbf y^{\Vert }_t-\mathbf y^{\bot }_t\right) ^\top \text{ E }\left[ \hat{\mathbf {y}}_t\right] \\&=2-2\left( \mathbf y_t^\Vert \right) ^\top \text{ E }\left[ \hat{\mathbf {y}}_t\right] \\&\ge 2-2\Vert \mathbf y^{\Vert }_t\Vert = 2-2\Vert \mathbf x^{\Vert }_t\Vert ,\end{aligned}$$

where last inequality holds because \(\Vert \text{ E }[\hat{\mathbf {y}}_t]\Vert \le \text{ E }[\Vert \hat{\mathbf {y}}_t\Vert ] \le 1\), and therefore \((\mathbf y_t^{\Vert })^\top \text{ E }[\hat{\mathbf {y}}_t] \le \Vert \mathbf y_t^{\Vert }\Vert \). It follows that one of the result vectors \(\mathbf y_t=\mathbf y^{\Vert }_t\pm \mathbf y^{\bot }_t\) incurs expected loss at least \(1-\Vert \mathbf x^{\Vert }_t\Vert \) and the adversary makes that choice. Overall, the loss for any algorithm is at least

$$\begin{aligned} \sum _{t=1}^S \left( 1-\Vert \mathbf x_t^{\Vert }\Vert \right) .\end{aligned}$$

For deterministic algorithms, the adversary can further force a loss of 2 on the first trial by choosing \(\mathbf y_1 = -\hat{\mathbf {y}}_1\), leading to a lower bound of \(1 + \sum _{t=1}^S (1-\Vert \mathbf x_t^{\Vert }\Vert )\).

We next show that for the examples \((\mathbf x_t,\mathbf y_t)\) produced by the adversary, there always is a consistent rotation \(\mathbf {R}\) such that \(\mathbf y_t=\mathbf {R}\mathbf x_t\). Note that in the construction the vectors \(\mathbf y_1, \mathbf y_2, \ldots , \mathbf y_t\) lie in the span of the \(\mathbf y_1^\bot , \mathbf y_2^\bot , \ldots , \mathbf y_t^\bot \). So it suffices to show that there is a rotation matrix \(\mathbf {R}\) such that \(\mathbf y_q^\bot = \mathbf {R}\mathbf x_q^\bot \) for all \(q \le t\). To show this we use Lemma 5. First, we note that since rank of \([\mathbf x_1, \mathbf x_2, \ldots , \mathbf x_t]\) is at most \(n-1\), and since \(\mathbf x_1^\bot , \mathbf x_2^\bot , \ldots , \mathbf x_t^\bot \) are mutually orthogonal and span the same subspace, there can be at most \(n-1\) non-zero \(\mathbf x_q^\bot \) vectors. Further, by construction \(\Vert \mathbf y_q^\bot \Vert = \Vert \mathbf x_q^\bot \Vert \) for all \(q \le t\). Thus, by Lemma 5, the desired rotation exists and we are done. \(\square \)

The following algorithm seems more natural than Algorithm 2 above: Given \(\mathbf x_t\), decompose it into \(\mathbf x_t^{\Vert }\) and \(\mathbf x_t^{\bot }\) as before. Then predict with a rotation that is consistent with the previous \(t-1\) examples (i.e. takes \(\mathbf x_t^{\Vert }\) to \(\mathbf y_t^{\Vert }\)) and rotates \(\mathbf x_t^\bot \) to an arbitrary vector \(\pm \mathbf y_t^\bot \) of the same length that is orthogonal to the previous result vectors and whose sign is chosen uniformly. Thus \(\hat{\mathbf {y}}_t=\mathbf y_t^{\Vert }+\pm \mathbf y_t^\bot \) and the expected loss of this algorithm is

$$\begin{aligned}1-y_t^\top \text{ E }\left[ \hat{\mathbf {y}}_t\right] = 1-\left( \mathbf y_t^{\Vert }+\mathbf y_t^\perp \right) ^\top \left( \mathbf y_t^{\Vert }\pm \mathbf y_t^\perp \right) = 1-||\mathbf y_t^{\Vert }||^2 =1-||\mathbf x_t^{\Vert }||^2.\end{aligned}$$

However the above lower bound shows that when \(\mathbf y^{\Vert }_t\ne \mathbf {0}\), then only the deterministic prediction \(\hat{\mathbf {y}}_t=\frac{\mathbf y^{\Vert }_t}{\Vert \mathbf y^{\Vert }_t\Vert }\) has the optimal loss \(1-\Vert \mathbf x_t\Vert \) [i.e. it makes the equalities (11) tight]. This is also in contrast to Algorithm 1, which often employs non-optimal random predictions in trials when \(\mathbf y^{\Vert }_t\ne \mathbf {0}\). It seems that less randomization should be used when the loss of the best offline rotation is small because a deterministic algorithm is essentially minimax optimal in the noise-free case.

The deterministic variant of Algorithm 2 also does not predict with a consistent rotation. It is easy to show that any deterministic algorithm that predicts with a consistent rotation can be forced to have loss \(2(n-1)\) instead of the optimum n: when such an algorithm receives an instance vector that is orthogonal to all the previous instance vectors, the algorithm deterministically predicts with a unit vector \(\hat{\mathbf {y}}_t\) that is orthogonal to all the previous result vectors. Since the adversary knows \(\hat{\mathbf {y}}_t\), it can simply choose the result vector \(\mathbf y_t\) as \(-\hat{\mathbf {y}}_t\). This forces the algorithm to incur loss 2 and the can be repeated \(n-1\) times, at which point the optimal rotation is completely determined.

6 Conclusions

We have presented a randomized online algorithm for learning rotation with a regret bound of \(\sqrt{nT}\) and proved a lower bound of \(\varOmega (\sqrt{nT})\) for \(T>n\). We also proved a lower bound of \(\varOmega (\sqrt{nL}+n)\) for the regret of any algorithm on sequences on which the best rotation has lost at most L. Recently, an upper bound was shown that matches this lower bound within a constant factor (Nie 2015). However the algorithm that achieves this upper bound on the regret is not the GD algorithm of this paper based on lazy projection but the GD algorithm that projects into the convex hull of orthogonal matrices at the end of each trial (Hazan et al. 2010a). The latter algorithm requires \(O(n^3)\) per trial (Hazan et al. 2010a). It is open whether the \(O(n^2)\) lazy projection version of GD has the same \(\varOmega (\sqrt{nL}+n)\) regret bound.

In general, we don’t know the best way to maintain uncertainty over rotation matrices. The convex hull of orthogonal matrices seems to be a good candidate (as used in Hazan et al. 2010a). For the sake of efficiency, we used an even larger hypothesis space in this paper: we only require in Algorithm 1 that \(\Vert \mathbf {W}_t^m\mathbf x_t\Vert \le 1\) at trial t. The algorithm regularizes with the squared Frobenius norm and uses Bregman projections with respect to the inequality constraint \(\Vert \mathbf {W}_t^m\mathbf x_t\Vert \le 1\). However such projections “forget” information about past examples and the squared Frobenius norm does not take the group structure of \(\mathcal{SO}(n)\) into account.

One way to resolve some of these questions is to develop the minimax optimal algorithm for rotations and the linear loss discussed in this paper. This was recently posted as an open problem in Kotłowski and Warmuth (2011). In the special case of \(n=2\), the minimax regret for randomized algorithm and T trials was determined to be \(\sqrt{T}\), whereas for our randomized algorithm we were only able to prove the larger regret bound of \(\sqrt{2T}\). We have shown in the previous section that in the noise-free case, the minimax regret is \(n-1\) for randomized algorithms and \(2(n-1)\) for deterministic ones. Curiously enough the minimax algorithm for \(n=2\) and T trials (Kotłowski and Warmuth 2011) makes heavy use of randomness, whereas in the noise-free case the optimal randomized algorithm only requires randomness in trials when \(\mathbf x_t^{\Vert }=\mathbf {0}\).

Another direction is to make the problem harder and study learning rotations when the loss is non-linear. The question is whether for non-linear loss functions the GD algorithm that projects on the parameter space of the convex hull of orthogonal matrices still achieves a worst-case regret that is within a constant factor of optimal.