## Abstract

We describe online algorithms for learning a rotation from pairs of unit vectors in \(\mathbb {R}^n\). We show that the expected regret of our online algorithm compared to the best fixed rotation chosen offline over *T* iterations is \(\sqrt{nT}\). We also give a lower bound that proves that this expected regret bound is optimal within a constant factor. This resolves an open problem posed in COLT 2008. Our online algorithm for choosing a rotation matrix is essentially an incremental gradient descent algorithm over the set of all matrices, with specially tailored projections. We also show that any deterministic algorithm for learning rotations has \(\varOmega (T)\) regret in the worst case.

## Keywords

Rotations Online learning Regret bounds Bregman projection Minimax## 1 Introduction

Rotations are a fundamental object in robotics and vision and the problem of learning rotations, or finding the underlying rotation from a given set of examples, has numerous applications [see Arora (2009) for a summary of application areas, including computer vision, face recognition, robotics, crystallography, and physics]. Besides their practical importance, rotations have been shown to be powerful enough to capture seemingly more general mappings. Rotations can represent arbitrary Euclidean transformations via a conformal embedding by adding two special dimensions (Wareham et al. 2005). Also Doran et al. (1993) showed that the rotation group provides a universal representation for all Lie groups.

In the batch setting, the problem of learning a rotation was originally posed as the problem of estimating the altitude of satellites by Wahba (1966). The related problem of learning orthogonal, rather than rotation, matrices is known as the “orthogonal Procrustes problem” (see Schonemann 1966). Algorithms for optimizing a static cost function over the set of unitary matrices were proposed by Abrudan et al. (2008a, b) using descent over Riemannian manifolds.

The question of whether there are online algorithms for this problem was explicitly posed as an open problem in COLT 2008 (Smith and Warmuth 2008). An online algorithm for learning rotations was given by Arora (2009). This algorithm exploits the Lie group/Lie algebra relationship between rotation matrices and skew symmetric matrices respectively, and the matrix exponential and matrix logarithm that maps between these matrix classes. However, this algorithm deterministically predicts with a single rotation matrix in each trial. In this paper, we prove that any such deterministic algorithm can be forced to have regret at least \(\varOmega (T)\), where *T* is the number of trials. In contrast, we give an algorithm that produces a random rotation in each trial and has expected regret \(\sqrt{nT}\), where *n* is the dimension of the matrices.

### 1.1 Technical contributions

The main technique used in this paper is a new variant of online gradient descent with Euclidean projections (see e.g. Herbster and Warmuth 2001; Zinkevich 2003) that uses what we call “lazy projections”. A straightforward implementation of the original algorithm requires \(O(n^3)\) time per iteration, because in each round we need to perform a Euclidean projection of the parameter matrix onto the convex hull of orthogonal matrices, and this projection requires the computation of a singular value decomposition. The crucial new idea is to project the parameter matrix onto a convex set determined by the current instance that contains the convex hull of all orthogonal matrices as a subset. The projection onto this larger set can be done easily in \(O(n^2)\) time and needs to be performed only when the current parameter matrix is outside of the set. Surprisingly, our new algorithm based on this idea of “lazy projections” has the same optimal regret bound but requires only \(O(n^2)\) time per iteration.

The loss function for learning rotations can be expressed as a linear function. Such linear losses are special because they are the least convex losses. The main case where such linear losses have been investigated is in connection with the Hedge and the Matrix Hedge algorithm (Freund and Schapire 1997; Warmuth and Kuzmin 2011). However for the latter algorithms the parameter space is one-norm or trace norm bounded, respectively. In contrast, the implicit parameter space of the main algorithm of this paper is essentially infinity norm bounded, i.e. the convex hull of orthogonal matrices consists of all square matrices with singular values at most one.

### 1.2 Outline of paper

We begin with some preliminaries in the next section, the precise online learning problem for rotations, and basic properties of rotations. We also show how to solve the corresponding off-line problem exactly. We then give in Sect. 3 our main probabilistic algorithm and prove a \(\sqrt{nT}\) regret bound for it. This bound cannot be improved by more than a constant factor, since we can prove a lower bound of essentially \(\sqrt{\frac{nT}{2}}\) (for the case when \(T\ge n\)).

For the sake of completeness we also consider deterministic algorithms in Sect. 4. In particular we show that any deterministic algorithm can be forced to have regret at least *T* in general. In Sect. 5 we then present the optimal randomized and deterministic algorithms for the special case when there is a rotation that is consistent with all examples. A number of open problems are discussed in final Sect. 6. In “Appendix 1” we prove a lemma that characterizes the solution to the batch optimization problem for learning rotations and in “Appendix 2” we show that the convex hull of orthogonal matrices consists of all matrices with maximum singular value one.

### 1.3 Historical note

A preliminary version of this paper appeared in COLT 2010 (Hazan et al. 2010b). Unfortunately the algorithm we presented in the conference (based on the Follow the Perturbed Leader paradigm) was flawed: the true regret bound it obtains is \(O(n\sqrt{T})\) as opposed to the claimed bound of \(O(\sqrt{nT})\). After noticing this we published a corrigendum [posted on the conference website, Hazan et al. (2010a)] with a completely different technique based on online gradient descent that obtained the optimal regret \(O(\sqrt{nT})\). The algorithm presented in this paper is similar, but much more efficient, taking \(O(n^2)\) time per iteration rather than \(O(n^3)\).

## 2 Preliminaries and problem statement

### 2.1 Notation

In this paper, all vectors lie in \(\mathbb {R}^n\) and all matrices in \(\mathbb {R}^{n \times n}\). We use \(\det (\mathbf {M})\) to denote the determinant of matrix \(\mathbf {M}\). An *orthogonal matrix* is a matrix \(\mathbf {R}\) whose columns and rows are orthogonal unit vectors, i.e. \(\mathbf {R}^\top \mathbf {R}= \mathbf {R}\mathbf {R}^\top = \mathbf {I}\), where \(\mathbf {I}\) is the identity matrix. We let \(\mathcal{O}(n)\) denote the set of all orthogonal matrices of dimension \(n\times n\) and \(\mathcal{SO}(n)\) denote the special orthogonal group of *rotation matrices*, which are all orthogonal matrices of determinant one. Since for \(n = 1\) there is exactly one rotation matrix (i.e. \(\mathcal{SO}(1)=\{1\}\)), the problem becomes trivial, so throughout this paper we assume that \(n > 1\).

For any vector \(\mathbf{x}, \Vert \mathbf{x}\Vert \) denotes the \(\ell _2\) norm and for any matrix \(\mathbf {A}, \Vert \mathbf {A}\Vert _F=\sqrt{{{\mathrm{tr}}}(\mathbf {A}\mathbf {A}^\top )}\) is the Frobenius norm. All regret bounds of this paper immediately carry over to the complex domain: replace orthogonal by unitary matrices and rotation matrices by unitary matrices of determinant one.

### 2.2 Online learning of rotations problem

- 1.
The online learner is given a unit instance vector \(\mathbf x_t\) (i.e. \(\Vert \mathbf x_t\Vert = 1\)).

- 2.
The learner is then required to predict (deterministically or randomly) with a unit vector \(\hat{\mathbf {y}}_t\).

- 3.
Finally, the algorithm obtains the “true” result, which is also a unit vector \(\mathbf y_t\).

- 4.The loss to the learner then is half the squared norm of the difference between her predicted vector and the “true” rotated vector \(\mathbf y_t\):$$\begin{aligned} \tfrac{1}{2}\Vert \hat{\mathbf {y}}_t - \mathbf y_t \Vert ^2 \ =\ 1 - \mathbf{y}_t^\top \hat{\mathbf {y}}_t \end{aligned}$$(1)
- 5.If \(\hat{\mathbf {y}}_t\) is chosen probabilistically, then we define the expected loss as$$\begin{aligned} \text{ E }\left[ \tfrac{1}{2}\left\| \mathbf y_t-\hat{\mathbf{y}}_t\right\| ^2\right] \ =\ \tfrac{1}{2}\text{ E }\left[ \Vert \hat{\mathbf {y}}_t - \mathbf{y}_t \Vert ^2\right] \ =\ 1 - \mathbf{y}_t^\top \text{ E }[\hat{\mathbf {y}}_t]. \end{aligned}$$

*T*examples against the best fixed rotation \(\mathbf {R}\) chosen in hindsight:

When \(n>1\), then any unit vector can be rotated into any other unit vector. Namely one can always produce an explicit rotation matrix \(\mathbf {R}_t\) in Step 2 that rotates \(\mathbf x_t\) to \(\hat{\mathbf {y}}_t\) (that is, \(\hat{\mathbf {y}}_t\) in the definition of regret (2) is replaced by \(\mathbf {R}_t\mathbf x_t\)). Such a rotation matrix can be computed in \(O(n^2)\) time, as the following lemma shows.

### **Lemma 1**

Let \(\mathbf x\) and \(\hat{\mathbf {y}}\) be two unit vectors. Then we can find an explicit rotation matrix \(\mathbf {R}\) that rotates \(\mathbf x\) onto \(\hat{\mathbf {y}}\), i.e. \(\mathbf {R}\mathbf x= \hat{\mathbf {y}}\), in \(O(n^2)\) time.

### *Proof*

We first take care of the case when \(\hat{\mathbf {y}}=\pm \mathbf x\): If \(\hat{\mathbf {y}}= \mathbf x\) we can simply let \(\mathbf {R}= \mathbf {I}\); if \(\hat{\mathbf {y}}= -\mathbf x\) and *n* is even, then we can use \(\mathbf {R}= -\mathbf {I}\); finally, if \(\hat{\mathbf {y}}= -\mathbf x\) and *n* is odd, then choose \(\mathbf {R}=-\mathbf {I}+2 \mathbf z\mathbf z^\top \), where \(\mathbf z\) is an arbitrary unit vector orthogonal to \(\mathbf x\). In all these cases, \(\mathbf {R}\mathbf x=\hat{\mathbf {y}}\) and \(|\mathbf {R}|=1\).

*d*denote the dot product \(\mathbf x\cdot \hat{\mathbf {y}}\) and let \(\hat{\mathbf {y}}^\bot \) be the normalized component of \(\hat{\mathbf {y}}\) that is perpendicular to \(\mathbf x\), i.e. \(\hat{\mathbf {y}}^\bot =\frac{\hat{\mathbf {y}}-d\mathbf x}{\Vert \hat{\mathbf {y}}-d\mathbf x\Vert }\). Let \(\mathbf {U}\) be an orthogonal matrix with \(\mathbf x\) and \(\hat{\mathbf {y}}^\bot \) as its first two columns. Now define \(\mathbf {R}\) as \(\mathbf {U}\mathbf {C}\mathbf {U}^\top ,\) where

### 2.3 Solving the offline problem

*T*examples \((\mathbf x_t,\mathbf y_t)\) only enter into the optimization problem via the matrix \(\mathbf {S}:= \sum _{t=1}^T \mathbf x_t\mathbf y_t^\top \). This matrix functions as a “sufficient statistic” of the examples. As we shall see later, our randomized online algorithm will also be based on this sufficient statistic.

Nevertheless, the value of Wahba’s problem (5) has a very elegant solution expressed i.t.o. the singular value decomposition (SVD) of \(\mathbf {S}\).

### **Lemma 2**

By (4), the solution to Wahba’s problem is obtained by solving the \({\text {argmax}}\) problem of the above lemma with \(\mathbf {S}=\sum _{t=1}^T \mathbf x_t\mathbf y_t^\top \) and the value of the original optimization problem (3) is \(T-\sum _{i=1}^{n-1}\sigma _i - s\sigma _n\).

Note that the solution \(\mathbf {W}\) given in the lemma is a rotation matrix since it is a product of three orthogonal matrices, and its determinant equals \(\det (\mathbf {U})\det (\mathbf {V})\det (\mathbf {W}) = 1\). The solution can be found in \(O(n^3)\) time by constructing a SVD of \(\mathbf {S}\).

We have been unable to find a complete proof of the above lemma for dimensions more than 3 in the literature and therefore, for the sake of completeness, we give a self-contained proof in “Appendix 1”.

Note that if we are simply optimizing over all orthogonal matrices \(\mathbf {R}\in \mathcal{O}(n)\), with no condition on \(\det (\mathbf {R})\), then we arrive at another classical problem known as the “orthogonal Procrustes problem” (first solved by Schonemann 1966). The solution for this simpler problem is also given in “Appendix 2” for completeness:

### **Lemma 3**

## 3 The randomized algorithm and main theorem

### 3.1 The algorithm

*t*-th trial in Step 6 of the algorithm, \(\mathbf {W}_{t-1}\) is updated to an intermediate parameter matrix \(\mathbf {W}_t^m\) which makes it possible that the algorithm can predict with a unit vector \(\hat{\mathbf {y}}_t\) such that \(\text{ E }(\hat{\mathbf {y}}_t)=\mathbf {W}_t^m\mathbf x_t\). As shown in Lemma 4, the update rule for obtaining \(\mathbf {W}_t^m\) from \(\mathbf {W}_{t-1}\) in Step 6 is a Bregman projection with respect to the squared Frobenius norm onto the set of matrices \(\mathbf {W}\) for which \(\Vert \mathbf {W}\mathbf x\Vert \le 1\):

^{1}More precisely, the algorithm predicts with \(\hat{\mathbf {y}}_t=\pm \widetilde{\mathbf z}_t\) with probability \(\frac{1\pm \Vert \mathbf z_t\Vert }{2}\). In this case the intermediate parameter matrix \(\mathbf {W}_t^m\) is set to be equal to the parameter matrix \(\mathbf {W}_{t-1}\) and

### **Lemma 4**

### *Proof*

We call our algorithm “Lazy Projection GD”, because for the sake of efficiency we “project as little as necessary” while keeping the relationship \(\text{ E }[\hat{\mathbf {y}}_t]=\mathbf {W}_t^m\mathbf x_t\). The algorithm takes \(O(n^2)\) time per trial since all steps can be reduced to a small number of matrix additions and matrix vector multiplications. In the corrigendum to the conference version of this paper, an alternate but more time expensive projection is used which projects \(\mathbf {W}_{t-1}\) onto the convex hull of all orthogonal matrices. As sketched below, this is more involved because it requires us to maintain the SVD decomposition of the parameter matrix.

Let *B* denote convex hull of all orthogonal matrices. In “Appendix 2” we show that *B* is the set of all square matrices with singular values at most one. Projecting onto *B* consists of computing the SVD of \(\mathbf {W}_{t-1}\) and capping all singular values larger than one at one (see corrigendum to Hazan et al. 2010b). Next, the projected matrix \(\mathbf {W}_t^m\) is randomly rounded to an orthogonal matrix \(\mathbf {U}_t\) s.t. \(\text{ E }[\mathbf {U}_t]=\mathbf {W}_t^m\) (see Lemma 6 for details), and then the prediction made is \(\hat{\mathbf {y}}_t = \mathbf {U}_t\mathbf x_t\). Overall, the algorithm takes \(O(n^3)\) time per iteration.

*B*of all orthogonal matrices can be written as an intersection of convex constraints:

*B*. This new “lazy projection” method is a simpler method with update time \(O(n^2)\) that avoids the need to maintain the SVD decomposition. The possibility of using delayed projections was discussed in Section 5.5 of Helmbold and Warmuth (2009). What is unique in our setting is that the convex set we project onto depends on the instance \(\mathbf x_t\).

### 3.2 Analysis and main theorem

We are now ready to prove our main regret bound for our on-line algorithm based on lazy projections. Note that the learning rate depends on the number of trials *T*. However, it is easy to run the algorithms in stages based on a geometric series of upper bounds on *T* [see for example Algorithm G1 of Cesa-Bianchi et al. (1996)]. This increases the regret bound by at most a constant factor.

### **Theorem 1**

*T*examples, then

### *Proof*

*T*trials we get

As we shall see the above upper bound of \(\sqrt{nT}\) on the regret bound of our randomized algorithm is rather weak in the noise-free case, i.e. when there is a rotation that has loss zero on the entire sequence. It is an open problem whether the upper bound on the regret can be strengthened to \(O(\sqrt{nL}+n)\) where *L* is the loss of the best rotation on the entire sequence of trials. An \(O(\sqrt{nL})\) regret bound was erroneously claimed in the conference paper of Hazan et al. (2010b).

### 3.3 Lower bounds on the regret

We now show a lower bound against any algorithm (including randomized ones).

### **Theorem 2**

For any integer \(T \ge n\ge 2\), there is a fixed instance vector sequence on which any randomized online algorithm for learning rotations can be forced to have regret at least \(\sqrt{\frac{(T-1)(n-1)}{4}} =\varOmega ( \sqrt{nT}).\)

### *Proof*

We first define the fixed instance sequence. Let \(\mathbf{e}_i\) denote the *i*-th standard basis vector, i.e. the vector with 1 in its *i*-th coordinate and 0 everywhere else. In trial \(t < T\), set \(\mathbf{x}_t = \mathbf{e}_{f(t)}\), where \(f(t) = (t \text { mod } n-1) + 1\) (i.e. we cycle through the coordinates \(1, 2, \ldots , n-1\)). The last instance is \(\mathbf {e}_n\).

*T*for which this algorithm has regret at least \(\sqrt{\frac{(T-1)(n-1)}{4}} =\varOmega ( \sqrt{nT}).\) Recall that \(\mathbf {e}_{f(t)}\) is the instance at trial \(1\le t<T\). To achieve our lower bound on the regret we set \(\mathbf y_t\) equal to the instance at trial

*t*or its reverse with equal probability, i.e. \(\mathbf{y}_t = \sigma _t \mathbf{e}_{f(t)}\), where \(\sigma _t \in \{-1, 1\}\) uniformly at random. For any coordinate \(i \in 1, 2, \ldots , n-1\), let \(X_i = \sum _{t:\ f(t) = i} \sigma _{t}\). For the final trial

*T*, the instance \(\mathbf{x}_T\) is \(\mathbf{e}_{n}\), and we set the result vector to \(\mathbf{y}_T = \sigma _T \mathbf{e}_{n}\), where \(\sigma _T \in \{-1, 1\}\) is chosen in a certain way specified momentarily. First, note that

*t*after receiving instance vector \(\mathbf {e}_{f(t)}\) and before receiving the result vector \(\mathbf y_t=\sigma _t\mathbf {e}_t\), then

*T*, the algorithm might at best have an expected loss of 0. Thus, the expected loss of the algorithm is at least \(T - 1\), and hence by subtracting (9), its expected regret (with respect to both randomizations) is at least \((n-1)\sqrt{\frac{1}{2}\left\lfloor \frac{T-1}{n-1}\right\rfloor } \). Since for \(a\ge 1\), \(\lfloor a \rfloor \ge \max (a-1,1) \ge (a-1+1)/2=a/2\), the expected regret is lower bounded by

As can be seen from the proof, the above lower bound is essentially \(\sqrt{\frac{nT}{2}}\) for large *n* and *T*, s.t. \(T\ge n\). Recall that the upper bound of our Algorithm 1 is \(\sqrt{nT}\).

We now generalize the above lower bound on the worst-case regret that depends on the number of trials *T* to a lower bound that depends on the loss *L* of the best rotation in hindsight. Applying Theorem 6 of Sect. 5 below with *n* orthogonal instances shows that any randomized on-line algorithm can be forced to incur loss *n* on sequences of examples for which there is a consistent rotation (i.e. \(L=0\)). Combined with the above lower bound, this immediately generalizes to the following lower bound for any \(L\ge 0\):

### **Corollary 1**

For any \(L\ge 0\) and \(T \ge \lfloor L/2 \rfloor \), any randomized on-line algorithm can be forced to have regret \(\varOmega (\sqrt{nL}+n)\) for some sequence of *T* examples for which the loss of the best rotation is at most *L*.

### *Proof*

If \(L< 2n\), then the lower bound of *n* for the noise free case already implies a lower bound of \(\varOmega (\sqrt{nL}+n)\). If \(L\ge 2n\), then we apply the lower bound of the above theorem for the first \(T_0=\lfloor L/2 \rfloor \) rounds. Note that \(T \ge T_0 \ge n\). We then achieve a regret lower bound of \(\varOmega (\sqrt{\lfloor L/2\rfloor n})=\varOmega (\sqrt{nL}+n)\) for the first \(T_0\) rounds. Let \(\mathbf {R}_0\) be the best rotation for the first \(T_0\) rounds. Since the per trial loss of any rotation is at most 2, the loss of the best rotation on the sequence of the \(T_0\) examples used in the construction of the above theorem is at most \(2T_0 \le L\).

For the remaining \(T - T_0\) rounds, we simply use arbitrary pairs of unit vectors \((\mathbf x_t, \mathbf y_t)\) that are consistent with \(\mathbf {R}_0\) (i.e. \(\mathbf y_t = \mathbf {R}_0 \mathbf x_t\) for \(T_0 < t \le T\)). Thus, the rotation \(\mathbf {R}_0\) incurs 0 additional loss, and hence its total loss is bounded by *L*; the algorithm, on the other hand, also incurs 0 additional loss at best. Thus we have an \(\varOmega (\sqrt{nL}+n)\) lower bound on the regret for the sequence of *T* examples in this construction, and the best rotation has loss at most *L*.

\(\square \)

Note that the constant factors needed for the \(\varOmega (.)\) notation in the proofs of the lower bounds are independent of the algorithm and the values of *n* and *L*.

## 4 Deterministic online algorithms

We begin by showing that any deterministic algorithm can be forced to have regret *T* in *T* trials. We then show how to construct a deterministic algorithm from any probabilistic one with at most twice the loss.

### **Theorem 3**

For any deterministic algorithm, there is a sequence of *T* examples (which may be fixed before running the algorithm) such that the algorithm has regret at least *T* on it.

### *Proof*

*T*in all trials.

*T*. Hence, the algorithm has regret at least \(2T - T = T\). \(\square \)

A simple deterministic algorithm would be to predict with an optimal rotation matrix for the data seen so far using the algorithm of Lemma 2. However this Follow the Leader or Incremental Off-Line type algorithm along with the elegant deterministic algorithms based on Lie group mathematics (Arora 2009) all can be forced to have linear worst-case regret, whereas probabilistic algorithms can achieve worst-case regret at most \(\sqrt{nT}\). We don’t know the optimal worst-case regret achievable by deterministic algorithms. In some cases the worst-case regret of deterministic algorithm is at least *F* times the regret of the best probabilistic algorithm, where *F* grows with the size of the problem (see e.g. Warmuth et al. 2011). We show that for our rotation problem, the factor is at most 2.

### **Theorem 4**

Any randomized algorithm for learning rotations can be converted into a deterministic algorithm with at most twice the loss.

### *Proof*

We construct a deterministic algorithm from the randomized one that in each trial has at most twice the loss. Let \(\mathbf z_t\) be the expected prediction \(\text{ E }[\hat{\mathbf {y}}_t]\) of the randomized algorithm. If \(\mathbf z_t=\mathbf {0}\), we let the deterministic algorithm predict with \(\mathbf {e}_1\). In this case the randomized algorithm has expected loss \(1-\mathbf y_t^\top \mathbf z_t=1\) and the deterministic algorithm has loss at most 2.

## 5 Learning when there is a consistent rotation

In Sect. 3 we gave a randomized algorithm with at most \(\sqrt{nT}\) worst-case regret (when \(T\ge n\)) and showed that this regret bound is tight within a constant factor. However regret bounds that are expressed as a function of number of trials *T* only guard against the high-noise case and are weak when there is a comparator with small loss. Ideally we want bounds that grow with the loss of the best comparator chosen in hindsight as was done in the original online learning papers (see e.g. Cesa-Bianchi et al. 1996a; Kivinen and Warmuth 1997). In an attempt to understand what regret bounds are possible when there is a comparator of low noise, we carefully analyze the case when there is a rotation \(\mathbf {R}\) consistent with all *T* examples, i.e. \(\mathbf {R}\mathbf x_t = \mathbf y_t\) for all \(t = 1, 2, \ldots , T\). Even in this case, the online learner incurs loss until a unique consistent rotation is determined and the question is which algorithm has the smallest regret against such sequences of examples.

*S*in which the matrix \([\mathbf x_1,\ldots ,\mathbf x_S]\) has rank \(n-1\). From trial \(S+1\) to

*T*, the algorithm predicts with the unique consistent rotation.

### **Theorem 5**

On any sequence of examples \((\mathbf x_1,\mathbf y_1),\ldots ,(\mathbf x_T,\mathbf y_T)\) that is consistent with a rotation, the randomized version of Algorithm 2 has expected loss \(\sum _{t=1}^S (1-\Vert \mathbf x^{\Vert }_t\Vert )\). The deterministic version has loss at most \(\left( \sum _{t=1}^S \left( 1-\Vert \mathbf x^{\Vert }_t\Vert \right) \right) + 1\).

### *Proof*

*S*examples. Hence we find this rotation and incur no more loss in any trial \(t > S\). Overall, the expected loss of the randomized algorithm is

We now give a simple lemma which states that a rotation is essentially determined by its action on \(n-1\) mutually orthogonal vectors. The proof is straightforward, but we include it for the sake of completeness.

### **Lemma 5**

Let \(\mathbf x_1^{\bot }, \mathbf x_2^{\bot }, \ldots , \mathbf x_T^{\bot }\) and \(\mathbf y_1^{\bot }, \mathbf y_2^{\bot }, \ldots , \mathbf y_T^{\bot }\) be two orthogonal sequences of vectors in \(\mathbb {R}^n\) such that \(\Vert \mathbf x_t\Vert =\Vert \mathbf y_t\Vert \) and both sequences have at most \(r\le n-1\) non-zero vectors. Then there is a rotation matrix \(\mathbf {R}\) such that \(\mathbf y_t = \mathbf {R}\mathbf x_t\). If \(r=n-1\), then \(\mathbf {R}\) is unique.

### *Proof*

Pairs \((\mathbf x_t,\mathbf y_t)\) of length zero can be omitted and without loss of generality we can rescale the remaining vectors so that their length is one, since rotations are a linear transformation and preserve the scaling. Also we can always add more orthonormal vectors to both sequences and therefore it suffices to find a rotation when \(r=n-1=T\). Let \(\mathbf x_n^\bot \) and \(\mathbf y_n^\bot \) be unit vectors orthogonal to the span of \(\{\mathbf x_1^\bot , \mathbf x_2^\bot , \ldots , \mathbf x_{n-1}^\bot \}\) and \(\{\mathbf y_1^\bot , \mathbf y_2^\bot , \ldots , \mathbf y_{n-1}^\bot \}\), respectively. Now, since rotations preserve angles and lengths, if there is a rotation \(\mathbf {R}\) such that \(\mathbf {R}\mathbf x_t^\bot = \mathbf y_t^\bot \) for \(t = 1, 2, \ldots , n-1\), then we must have \(\mathbf {R}\mathbf x_n^\bot = \pm \mathbf y_n^\bot \). We now determine the right sign of \(\mathbf y_n^\bot \) as follows. Let \(\mathbf {X}= [\mathbf x_1^\bot , \mathbf x_2^\bot , \ldots , \mathbf x_{n-1}^\bot , \mathbf x_n^\bot ]\), \(\mathbf {Y}^+ = [\mathbf y_1^\bot , \mathbf y_2^\bot , \ldots , \mathbf y_{n-1}^\bot , \mathbf y_n^\bot ]\), and \(\mathbf {Y}^- = [\mathbf y_1^\bot , \mathbf y_2^\bot , \ldots , \mathbf y_{n-1}^\bot , -\mathbf y_n^\bot ]\). Note that \(\mathbf {X}\), \(\mathbf {Y}^+\) and \(\mathbf {Y}^-\) are all orthogonal matrices. The desired rotation matrix \(\mathbf {R}\) is a solution to one of following two linear systems \(\mathbf {R}\mathbf {X}= \mathbf {Y}^+\) or \(\mathbf {R}\mathbf {X}= \mathbf {Y}^-\), i.e. \(\mathbf {R}= \mathbf {Y}^+\mathbf {X}^\top \) or \(\mathbf {R}= \mathbf {Y}^-\mathbf {X}^\top \), since \(\mathbf {X}^{-1} = \mathbf {X}^\top \). Note that both \(\mathbf {Y}^+\mathbf {X}^\top \) and \(\mathbf {Y}^-\mathbf {X}^\top \) are orthogonal matrices since they are products of orthogonal matrices, and the determinant of one is the negative of the other. The desired rotation matrix is then the solution with determinant 1. \(\square \)

We now prove a lower bound which shows that Algorithm 2 given above has the strong property of being *instance-optimal*: viz., for any sequence of input vectors and for any algorithm, there is a sequence of result vectors on which the loss achieved of the algorithm is at least that of Algorithm 2:

### **Theorem 6**

For any online algorithm and for every instance sequence \(\mathbf x_1,\ldots ,\mathbf x_T\), there is an adversary strategy for choosing the result sequence \(\mathbf y_1,\ldots ,\mathbf y_T\) for which there is a consistent rotation, and the algorithm incurs loss \(\sum _{t=1}^S (1-\Vert \mathbf x^{\Vert }_t\Vert )\). Furthermore, if the algorithm is deterministic, then the adversary can force a loss of at least \(\sum _{t=1}^S (1-\Vert \mathbf x^{\Vert }_t\Vert ) + 1\).

### *Proof*

We next show that for the examples \((\mathbf x_t,\mathbf y_t)\) produced by the adversary, there always is a consistent rotation \(\mathbf {R}\) such that \(\mathbf y_t=\mathbf {R}\mathbf x_t\). Note that in the construction the vectors \(\mathbf y_1, \mathbf y_2, \ldots , \mathbf y_t\) lie in the span of the \(\mathbf y_1^\bot , \mathbf y_2^\bot , \ldots , \mathbf y_t^\bot \). So it suffices to show that there is a rotation matrix \(\mathbf {R}\) such that \(\mathbf y_q^\bot = \mathbf {R}\mathbf x_q^\bot \) for all \(q \le t\). To show this we use Lemma 5. First, we note that since rank of \([\mathbf x_1, \mathbf x_2, \ldots , \mathbf x_t]\) is at most \(n-1\), and since \(\mathbf x_1^\bot , \mathbf x_2^\bot , \ldots , \mathbf x_t^\bot \) are mutually orthogonal and span the same subspace, there can be at most \(n-1\) non-zero \(\mathbf x_q^\bot \) vectors. Further, by construction \(\Vert \mathbf y_q^\bot \Vert = \Vert \mathbf x_q^\bot \Vert \) for all \(q \le t\). Thus, by Lemma 5, the desired rotation exists and we are done. \(\square \)

The deterministic variant of Algorithm 2 also does not predict with a consistent rotation. It is easy to show that any deterministic algorithm that predicts with a consistent rotation can be forced to have loss \(2(n-1)\) instead of the optimum *n*: when such an algorithm receives an instance vector that is orthogonal to all the previous instance vectors, the algorithm deterministically predicts with a unit vector \(\hat{\mathbf {y}}_t\) that is orthogonal to all the previous result vectors. Since the adversary knows \(\hat{\mathbf {y}}_t\), it can simply choose the result vector \(\mathbf y_t\) as \(-\hat{\mathbf {y}}_t\). This forces the algorithm to incur loss 2 and the can be repeated \(n-1\) times, at which point the optimal rotation is completely determined.

## 6 Conclusions

We have presented a randomized online algorithm for learning rotation with a regret bound of \(\sqrt{nT}\) and proved a lower bound of \(\varOmega (\sqrt{nT})\) for \(T>n\). We also proved a lower bound of \(\varOmega (\sqrt{nL}+n)\) for the regret of any algorithm on sequences on which the best rotation has lost at most *L*. Recently, an upper bound was shown that matches this lower bound within a constant factor (Nie 2015). However the algorithm that achieves this upper bound on the regret is not the GD algorithm of this paper based on lazy projection but the GD algorithm that projects into the convex hull of orthogonal matrices at the end of each trial (Hazan et al. 2010a). The latter algorithm requires \(O(n^3)\) per trial (Hazan et al. 2010a). It is open whether the \(O(n^2)\) lazy projection version of GD has the same \(\varOmega (\sqrt{nL}+n)\) regret bound.

In general, we don’t know the best way to maintain uncertainty over rotation matrices. The convex hull of orthogonal matrices seems to be a good candidate (as used in Hazan et al. 2010a). For the sake of efficiency, we used an even larger hypothesis space in this paper: we only require in Algorithm 1 that \(\Vert \mathbf {W}_t^m\mathbf x_t\Vert \le 1\) at trial *t*. The algorithm regularizes with the squared Frobenius norm and uses Bregman projections with respect to the inequality constraint \(\Vert \mathbf {W}_t^m\mathbf x_t\Vert \le 1\). However such projections “forget” information about past examples and the squared Frobenius norm does not take the group structure of \(\mathcal{SO}(n)\) into account.

One way to resolve some of these questions is to develop the minimax optimal algorithm for rotations and the linear loss discussed in this paper. This was recently posted as an open problem in Kotłowski and Warmuth (2011). In the special case of \(n=2\), the minimax regret for randomized algorithm and *T* trials was determined to be \(\sqrt{T}\), whereas for our randomized algorithm we were only able to prove the larger regret bound of \(\sqrt{2T}\). We have shown in the previous section that in the noise-free case, the minimax regret is \(n-1\) for randomized algorithms and \(2(n-1)\) for deterministic ones. Curiously enough the minimax algorithm for \(n=2\) and *T* trials (Kotłowski and Warmuth 2011) makes heavy use of randomness, whereas in the noise-free case the optimal randomized algorithm only requires randomness in trials when \(\mathbf x_t^{\Vert }=\mathbf {0}\).

Another direction is to make the problem harder and study learning rotations when the loss is non-linear. The question is whether for non-linear loss functions the GD algorithm that projects on the parameter space of the convex hull of orthogonal matrices still achieves a worst-case regret that is within a constant factor of optimal.

## Footnotes

- 1.
In the special case when \(\Vert \mathbf z_t\Vert =1\), then \(\widetilde{\mathbf z}_t=\mathbf z_t\) and \(\hat{\mathbf {y}}_t\) is deterministically chosen as \(\widetilde{\mathbf z}_t\).

## Notes

### Acknowledgments

This work was motivated by preliminary research done with Adam Smith (Smith and Warmuth 2010, 2008) and we greatly benefited from discussions with him. We thank Abhishek Kumar for allowing us to include a simplified proof of Wahba’s problem which was worked out in collaboration with him. Our work greatly benefited from discussions with Raman Arora and Wojciech Kotłowski. In particularly, Wojciech showed us the \(O(n^2)\) algorithm for rotating a unit vector onto another. Finally, thanks to Christfried Webers for pointing out a number of subtle inaccuracies in an earlier draft and to Nie Jiazhong for many helpful discussions.

## References

- Abrudan, T., Eriksson, J., Koivunen, V. (2008a). Efficient Riemannian algorithms for optimization under unitary matrix constraint. In
*IEEE international conference on acoustics, speech and signal processing (ICASSP ’08)*, pp. 2353–2356. IEEE, March (2008).Google Scholar - Abrudan, T. E., Eriksson, J., & Koivunen, V. (2008b). Steepest descent algorithms for optimization under unitary matrix constraint.
*IEEE Transactions on Signal Processing*,*56*(3), 1134–1146.MathSciNetCrossRefGoogle Scholar - Arora, R. (2009). On learning rotations. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta (Eds.),
*Advances in neural information processing systems*(Vol. 22, pp. 55–63). Cambridge: MIT Press.Google Scholar - Bernstein, D. S. (2009).
*Matrix mathematics: Theory, facts, and formulas*. Princeton: Princeton University Press.CrossRefzbMATHGoogle Scholar - Cesa-Bianchi, N., Long, P., & Warmuth, M. K. (1996a). Worst-case quadratic loss bounds for on-line prediction of linear functions by gradient descent.
*IEEE Transactions on Neural Networks*,*7*(2), 604–619.CrossRefGoogle Scholar - Cesa-Bianchi, N., Philip, M. L., & Manfred, K. W. (1996b). Worstcase quadratic loss bounds for prediction using linear functions and gradient descent.
*IEEE Transactions on Neural Networks and Learning Systems*,*7*(3), 604–619.CrossRefGoogle Scholar - Doran, C., Hestenes, D., Sommen, F., & Van Acker, N. (1993). Lie groups as spin groups.
*Journal of Mathematical Physics*,*34*(8), 3642–3669.MathSciNetCrossRefzbMATHGoogle Scholar - Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting.
*Journal of Computer and System Sciences*,*55*, 119–139.MathSciNetCrossRefzbMATHGoogle Scholar - Haagerup, U. (1982). The best constants in the Khintchine inequality.
*Studia Mathematica*,*70*(3), 427–485.MathSciNetzbMATHGoogle Scholar - Hazan, E., Kale, S., Warmuth, M. K. (2010a). Corrigendum to “learning rotations with little regret”. http://www.colt2010.org/papers/rotfixfinal.pdf, September (2010).
- Hazan, E., Kale, S., Warmuth, M. K. (2010b). Learning rotations with little regret. In
*COLT ’10*, June (2010). A corrigendum can be found at the conference website.Google Scholar - Herbster, M., & Warmuth, M. K. (2001). Tracking the best linear predictor.
*Journal of Machine Learning Research*,*1*, 281–309.MathSciNetzbMATHGoogle Scholar - Helmbold, D. P., & Warmuth, M. K. (2009). Learning permutations with exponential weights.
*Journal of Machine Learning Research*,*10*, 1705–1736.MathSciNetzbMATHGoogle Scholar - Kivinen, J., & Warmuth, M. K. (1997). Additive versus exponentiated gradient updates for linear prediction.
*Information and Computation*,*132*(1), 1–64.MathSciNetCrossRefzbMATHGoogle Scholar - Kotłowski, W., Warmuth, M. K. (2011). Minimax algorithm for learning rotations. In
*Proceedings of the 24rd annual conference on learning theory (COLT ’11)*, June (2011).Google Scholar - Nie, J. (2015). Optimal learning with matrix parameters. Ph.D. thesis, University of California, Santa Cruz.Google Scholar
- Schonemann, P. (1966). A generalized solution of the orthogonal Procrustes problem.
*Psychometrika 31*(1), 1–10.Google Scholar - Smith, A., Warmuth, M. K. (2008). Learning rotations. In:
*Proceedings of the 21st annual conference on learning theory (COLT ’08)*, p. 517, July (2008).Google Scholar - Smith, A. M., Warmuth, M. K. (2010). Learning rotations online. Technical Report UCSC-SOE-10-08, Department of Computer Science, University of California, Santa Cruz, February (2010).Google Scholar
- Wahba, G. (1966). Problem 65–1, a least squares estimate of satellite attitude.
*SIAM Review*,*8*(3), 384–386.MathSciNetCrossRefGoogle Scholar - Wareham, R., Cameron, J., Lasenby, J. (2005). Applications of conformal geometric algebra in computer vision and graphics. In
*6th international workshop IWMM 2004*, pp. 329–349 (2005).Google Scholar - Warmuth, M. K., & Kuzmin, D. (2011). Online variance minimization.
*Journal of Machine Learning*,*87*(1), 1–32.MathSciNetCrossRefzbMATHGoogle Scholar - Warmuth, M. K., Koolen, W. M., Helmbold, D. P. (2011). Combining initial segments of lists. In
*Proceedings of the 22nd international conference on algorithmic learning theory (ALT ’11)*, pp. 219–233. Springer-Verlag, October (2011).Google Scholar - Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In
*ICML*, pp. 928–936 (2003)Google Scholar