An algorithm to compute the polar decomposition of a 3 × 3 matrix

We propose an algorithm for computing the polar decomposition of a 3 × 3 real matrix that is based on the connection between orthogonal matrices and quaternions. An important application is to 3D transformations in the level 3 Cascading Style Sheets specification used in web browsers. Our algorithm is numerically reliable and requires fewer arithmetic operations than the alternative of computing the polar decomposition via the singular value decomposition.


Introduction
In designing algorithms for dense numerical linear algebra problems we are usually driven by the need to make the algorithms efficient for large values of the dimension, n.This leads to certain design principles, such as to prefer an O(n 3 ) flops algorithm to an O(n 4 ) flops one and to aim to develop algorithms that are rich in matrix-matrix operations.However, there is another class of problems that is rarely considered, though practically important: to solve a large number of very small problems.
For small problems (2 ≤ n ≤ 8, say), the exponent and constant of the highest order term in the flop count are not necessarily indicative of the cost, since that term may not dominate, and the prevalence of matrix-matrix or matrix-vector operations is also of little relevance.These problems therefore need a different approach.
In this work we are concerned with computing the polar decomposition of a 3 × 3 matrix.This problem arises in a number of applications, but the main motivation for our work comes from 3D graphics transformations, where the polar decomposition provides a convenient way to parametrize and interpolate transformations [22,28].In particular, the polar decomposition is of interest for the level 3 specification of the Cascading Style Sheets (CSS) language that is under development by the World Wide Web Consortium (W3C) [31, sect. 20.3].The requirements are for an implementation of the polar decomposition that can run in any web browser on any computing platform (desktop and mobile), runs fast enough to support animation, never fails, is numerically stable, and has minimal dependence on libraries [27].
Other applications requiring the polar decomposition of a 3 × 3 matrix are in classical continuum mechanics [3], in orthogonalization of approximate direction cosine matrices in aerospace systems [1,10,18], and for modeling soft tissue deformation in virtual surgery [32].
An obvious question is whether we can compute the polar decomposition of a 3×3 matrix explicitly.Formulae are available for the 2 × 2 case [14,29], and for companion matrices [30].In [16] it is explained how to obtain analytic formulae for the eigendecomposition of a symmetric 3 × 3 matrix.These yield complicated formulae for the singular value decomposition (SVD), and hence the polar decomposition via the eigendecompositions of A T A and AA T .However, this procedure suffers from numerical instability.
We are not the first to treat efficient solution of linear algebra problems of small dimension.An early investigation is that of Pitsianis and Van Loan [24], who focus on multiple instances of small problems and stack the problems in order to exploit vector hardware.More recently, Dong et al. [7] have investigated how to implement LU factorizations of many small matrices on GPUs.Previous work exists on computing the polar decomposition or SVD of 3×3 matrices A, but none of the algorithms of which we are aware satisfies the requirements listed above.In particular, some algorithms (such as that in [19]) begin by computing A T A and are numerically unstable.
In Section 2 we use an argument based on quaternions to explain how the orthogonal polar factor of a 3×3 matrix can be obtained from an eigenvector corresponding to the dominant eigenvalue of a related 4×4 symmetric matrix.This relation is the basis of the algorithm that we develop in Section 3. The algorithm of Section 3 is presented in stages, beginning with the simplest and most practically importance case in which A has a sufficiently large second largest singular value.The operation counts given in Section 4 show that the new algorithm generally requires fewer operations than the alternative that computes the polar decomposition via the SVD.Numerical experiments presented in Section 5 show that the new algorithm has an advantage in speed over its competitors without sacrificing numerical stability.Concluding remarks are given in Section 6.

Polar decomposition of 3 × 3 matrices and quaternions
We recall that a polar decomposition of a matrix A ∈ R n×n is a factorization A = QH , where Q is orthogonal and H is symmetric positive semidefinite [13,Chap. 8].
Clearly, H = (A T A) 1/2 is always unique, and when A is nonsingular H is positive definite and Q = AH −1 is unique.
The polar decomposition can be obtained from the SVD and vice versa.Indeed if We will denote by O(3) the group of 3 × 3 orthogonal matrices and by SO(3) the subgroup of O(3) of orthogonal matrices with determinant 1.The orthogonal polar factor Q of A ∈ R 3×3 has the trace maximizing property This can be shown using the SVD by noting that trace(Z T A) = trace Z T U V T = trace V T Z T U ≤ trace , and equality is attained at Z = UV T = Q.Now we consider the sign of the determinant of Q.Note that if det A > 0 then det Q = 1.If det A = 0 it is not hard to show that there is a choice of the polar decomposition satisfying det Q = 1.If det A < 0 then det(−A) > 0 (since n is odd), so we have the polar decomposition we have This means that we can look for the matrix Q ∈ SO(3) that maximizes the trace if det A ≥ 0 and minimizes the trace if det A < 0. The polar factor is equal to ηQ.
Throughout this paper we use the 2-norm on R n , x 2 = (x T x) 1/2 , and the Frobenius norm on R n×n , A F = trace(A T A) 1/2 .
In the rest of this section we develop a relation between the the orthogonal polar factor of a 3 × 3 matrix and the eigensystem of a related 4 × 4 symmetric matrix.Most of the results given are essentially known but have not previously been collected together in this unified and self-contained way.

Quaternions and rotations
Our approach to computing a polar decomposition of a 3 × 3 real matrix is based on the trace maximization property, and it is best explained using quaternions.The ideas in this subsection about the relation between quaternions and rotations were mostly known to Cayley [4] and Hamilton [8].They have also reappeared several times in the modern literature, for example [5,15,17], and are known to experts of graphic animation: see [19,22,28] and the references therein.An excellent reference on quaternions in linear algebra is the book by Rodman [25].
Let H be the ring of quaternions, with the standard basis {1, i, j, k}.H is a vector space over the real numbers, so any q ∈ H can be written It is known that algebraically H is an associative normed division algebra.All the rules for multiplication can be deduced from the defining equations In particular, multiplication is not commutative since, for example, 3), we can define a Euclidean norm on There is a conjugation defined by (w If w = 0 we say that the quaternion xi + yj + zk is pure imaginary.Moreover, if there is some 0 = α ∈ R such that two quaternions q 1 , q 2 satisfy q 1 = αq 2 , we write q 1 ∼ q 2 .It is easy to check that ∼ is an equivalence relation.Recall that the quotient space Q := H/ ∼ is defined as the set of equivalence classes of H by ∼.There are precisely two unimodular quaternions, equal to each other up to sign, in each equivalence class (except for the equivalence class {0}).Hence, Q \ {0} is oneto-one with the unit hemisphere in R 4 , and the latter is precisely the set of all possible eigenvectors of a 4 × 4 symmetric matrix up to normalization.
The embedding can be applied column-by-column to matrices, inducing a mapping Moreover we have the mapping (2.4) Here are some properties of the mapping ψ in (2.4) that will be useful below.They are also discussed in [4], [5, sect. 3], [17, sect. 7.1], for example.

Theorem 2.1
The following properties hold.
Note that Theorem 21 (d) implies that there is a bijection between the set of 3 × 3 orthogonal matrices of unit determinant and the set of all unit vectors in R 4 (up to the sign).Indeed, both sets are one-to-one with Q \ {0}.Furthermore, the bijection is explicitly given by (2.4).
The latter equation defines a function τ : R 4 → R. Since ψ in (2.4) maps q ∈ H with |q| = 1 to some Q ∈ SO(3) by Theorem 2.1 (d), we can now use a Lagrange multiplier method to solve our constrained maximum problem, by looking for the stationary points of T  (2.5)Only a few simple algebraic manipulations are needed to see that Therefore the condition for a stationary point, ∇f = 0, is equivalent to the equations Now we derive some properties of B. Recall that the adjugate of a matrix A, denoted adj A, is the matrix whose (i, j ) element is (−1) i+j times the minor obtained by deleting the j th row and the ith column of A. Here, and throughout, Proof The proof is by direct verification.
For the following results we denote the singular values of A by σ 1 , σ 2 , σ 3 , with σ 1 ≥ σ 2 ≥ σ 3 ≥ 0. Recall that η is defined in (2.1).Expressing the various norms in terms of the singular values yields an alternative expression for the determinant.

Corollary 2.3 det(xI
The next theorem gives the spectrum of B in terms of the SVD of A. It has been previously proved by Mackey [17,Prop. 6], also using the connection with quaternions.

In particular, if rank
Proof The result can be obtained directly, by solving the quartic equation of Corollary 2.3.
Recall that a dominant eigenvalue is an eigenvalue whose absolute value is maximal.Theorem 2.4 shows that the dominant eigenvalue of B is always (up to sign) the trace (or nuclear) norm of A, which we denote by Any eigenvector v of B, normalized so that it has unit norm, gives a stationary point for the trace, via the mapping ψ.However, it must be a dominant eigenvector in order to give a global maximum for η trace(Q T A) = ητ .Here and below, by a "dominant eigenvector" we mean an eigenvector, normalized to have unit norm, associated with a dominant eigenvalue μ.If there is a unique dominant eigenvalue, we will correspondingly speak about "the" dominant eigenvector.
To check the statement above, observe that as long as w, x, y, z lie on the unit sphere v = 1 then τ (w, x, y, z) is equal to the Rayleigh quotient v T Bv/(v T v), by (2.6).Therefore, if λ 1 ≥ λ 2 ≥ λ 3 ≥ λ 4 are the eigenvalues of B we have λ 4 ≤ τ (x, y, w, z) ≤ λ 1 .Hence, when η = 1 the maximum of τ (w, x, y, z) is λ 1 = μ, and it is achieved when v = [w x y z] T is a corresponding (dominant) eigenvector.Similarly if η = −1 then the maximum of −τ (w, x, y, z) is equal to −λ 4 = μ, and this maximum is achieved when v is a corresponding (dominant) eigenvector.Note that the maximum is achieved at a unique vector (up to sign) if and only if det A = 0.
We summarize the analysis in the following theorem, where we simplify the formula for Q by using the normalization Recall also the transpose in (2.2), which corresponds to a conjugation, by Theorem 2.1 (a).

Theorem 2.5
Let A ∈ R 3×3 and let B ∈ R 4×4 be defined as in (2.5).Let v be the dominant eigenvector of B when A is full rank, or any vector in any dominant eigenspace when A is rank deficient, normalized to have unit norm.Then a polar decomposition A = QH is given by ) The next result has not previously appeared, and will be needed below.
any possible choice of the sign is allowed by the statement) and p Finally, if rank A ≤ 1, then μ = ±σ 1 , and Finally we note a bound for the eigenvalue of B of largest absolute value, derived via standard inequalities on matrix norms.We may assume without loss of generality that η = 1.Then the largest eigenvalue λ 1 is nonnegative and satisfies (2.8)

Algorithms for the polar decomposition
The theoretical analysis in Section 2 suggests an algorithm to compute the polar decomposition of A ∈ R 3×3 : form the matrix B, approximate its dominant eigenvector v, form the polar factor Q via Theorem 2.5, and finally compute H = Q T A. The most basic implementation of the idea is to compute a complete eigendecomposition of B via any reliable eigensolver, such as the QR algorithm.
This idea is very simple, but it is clearly overkill: only one eigenvector of B is needed, but Algorithm 3.1 computes all four of them.A method that computes only a dominant eigenvector, such as the power method, inverse iteration, or the Rayleigh quotient iteration, could be cheaper.However, the linear convergence of the power method and inverse iteration can make them too slow.The cubically convergent Rayleigh quotient method, or the conjugate gradient method used for instance in [3], have a different disadvantage: they do not guarantee convergence to the dominant eigenvector, at least if we do not have a good initial guess.
To overcome these difficulties, we will first compute a cheap estimate λ 1 of λ 1 , the dominant eigenvalue of B. Then we will apply inverse iteration to the shifted matrix Both for the analysis, and in the algorithm, it is convenient to scale A to have unit Frobenius norm; thus the algorithm preprocesses A ← A/ A F .Hence throughout this section we assume that σ 2 1 + σ 2 2 + σ 2 3 = 1.We will see below that there are three different regimes, depending on the estimated magnitude of σ 2 , the second largest singular value of A. However, our algorithm turns out to be very simple when σ 2 is large enough, and so we discuss this case first.

Main branch of the algorithm
For simplicity of the exposition, we begin by presenting Algorithm 3.2, which is intended for the case where σ 2 is large enough.In practice, most inputs satisfy this criterion.Algorithm 3.2 is the main branch of Algorithm 3.5, which handles the general case.
In the factorization on line 7 the pivot at each stage is chosen as the largest element in absolute value on the diagonal.The factorization exists in view of B s having three positive eigenvalues.
In the next subsections we explain in more detail how Algorithm 3.2 works.As indicated above, the basic strategy is to first approximate λ 1 and then to apply inverse iteration to the shifted matrix B s .

Computing the characteristic polynomial of B
The cheapest way to estimate the dominant eigenvalue λ 1 is via the characteristic polynomial of B. Of course, this is not always a numerically stable approach, but it provides a cheap estimate that in some circumstances can be reasonably good.The first ingredient is the computation of the constant term in the characteristic polynomial of B, which is equal to the determinant of B. The latter can be computed with various methods: Leibniz's formula (for B ∈ R n×n : det B = σ ∈S sign σ n i=1 a i,σ i , where the sum is over all permutations of the set {1, 2, . . ., n}), or as 1−4 adj(A) 2 F (by Lemma 2.2), with each element of adj(A) computed as a determinant of a 2 × 2 matrix, or via an LU factorization.As the last method requires the fewest flops, this is our choice.
We now assume that b = det B (on line 2) has been computed with an absolute forward error cu with u the machine precision and c a moderate constant.
We then need to compute det A, which is done via an LU factorization of A with partial pivoting.In principle, a computation via Leibniz's formula would be less expensive. 1However, it is important to obtain the correct sign of det A, as the code uses this information to decide whether the dominant eigenvalue of B is positive or negative: if det A < 0 then we preprocess B ← −B.A careless evaluation of sign det A could result in instability in the computation of the polar decomposition.However, by the analysis in [23] the computation of a determinant via Gaussian elimination yields the correct sign unless the smallest singular value is a moderate multiple of u, and in this case it can be shown that either sign is acceptable.Therefore, from now on we may assume without loss of generality that det A ≥ 0 and that d = det A has been computed with a small absolute forward error.

Estimating λ 1
At this point, we need to compute λ 1 .As described by the pseudocode below, the default strategy is to use an exact formula to solve the quartic equation in Lemma 2.2.The algorithm below is based on a slight modification of the general method given in [26, sect.1].
We now the analytic formulae to spot potential dangers.One comes from the potential ill conditioning of the scalar root-finding problem.In the approach discussed above, a potential danger is obvious a priori: Algorithm 3.3 computes the dominant eigenvalue of a symmetric matrix (whose condition number is 1) as the dominant root of a scalar quartic equation (whose condition number can potentially be large).This may be inadvisable when λ 1 and λ 2 are close, i.e., λ 1 is a near multiple root.The bigger the gap between λ 1 and λ 2 the more we can trust our cheap estimate of λ 1 and hence the less work is needed to obtain the dominant eigenvector.
In Section 3.1.3,we will see that the simple logical test 1 − b > τ 2 , where τ 2 is a tolerance, implies (2.9) In the more sophisticated Algorithm 3.5, this condition is actually checked.Although we will not explicitly use this fact, it is worth noting that, in view of Theorem 2.4 or Lemma 2.6, λ 1 and λ 2 are close precisely when σ 2 + σ 3 is small, i.e., the numerical rank of A is 1.
At first sight, one might think that Newton's method is a good idea in general.In fact, by (2.8) the dominant eigenvalue of B must lie in the interval I := [1, √ 3], and since det(xI − B) is a convex function on I, the method is guaranteed to converge quadratically and monotonically from above if we use √ 3 as a starting point.However, Newton's method is much more expensive than the analytic formula when many more than a couple of iterations are required for convergence.Fortunately, the scenario when the analytic formula might suffer from numerical instability is favorable to Newton's method.Indeed, (θ, φ) ≈ (arctan √ 2, π/4) implies that λ 1 = cos θ + sin θ(cos φ + sin φ) ≈ √ 3, and, since the starting point is √ 3, Newton's method is guaranteed to converge within only a few iterations (with the current choice of tolerance τ 1 in Algorithm 3.3, one iteration suffices).

Obtaining the dominant eigenvector
Finally, we need to use the approximation of the dominant eigenvalue λ 1 of B to obtain the dominant eigenvector v. Suppose that 1 − b > τ 2 .Then, 3 ) 2 = 4 sin 2 θ −3 sin 4 θ , where we have used the fact that for any x, y > 0, (x + y) 2 − 4xy = (x − y) 2 ≥ 0 and the parametrization of the previous subsection.This yields Therefore So we are guaranteed that the dominant eigenvalue is well separated, and hence inverse iteration will be fast to converge.
Furthermore, 1 − b > τ 2 implies by Lemma 2.6 that the absolute value of the derivative of the characteristic polynomial of B, evaluated at λ 1 , is at least β = τ 2 , giving the upper bound 1/β for the condition number of λ 1 as a root of the characteristic polynomial.In Algorithm 3.5, to be described in the next section, we check the condition 1 − b > τ 2 , with a current choice of tolerance τ 2 = 10 −4 .Here and below, we denote by λ1 := λ 1 + 2 the computed value, as opposed to the exact value λ 1 , of the dominant eigenvalue of B. We assume that the forward error, 2| |, is of order the unit roundoff u times a moderate constant.(The factor 2 in the definition is just for notational convenience in the following).
At this point, Algorithm 3.2 computes an LDL T factorization with diagonal pivoting of a 4 × 4 symmetric matrix B s of numerical rank 3 whose smallest eigenvalue is bounded below by −2 .Indeed, recall from (2.9) that the third largest eigenvalue of B s is The analysis in [11,12,Chap. 10] can be applied to argue that the computed L and D satisfy B s − L D LT F ≤ θu, with θ a moderate constant.In particular θ is the sum of two terms: the bound in the error analysis of the Cholesky factorization for a semidefinite matrix (note that the right hand side of [12, eq. (10.19)] is √ 21 for n = 4 and rank 3), plus another term coming from the approximation of λ 1 .
From Theorem 2.4, we see that inverse iteration converges linearly with rate 2 ) ≤ 2θu.The vector v can be easily computed directly, using the exact formula for the inverse of a unit lower triangular 4 × 4 matrix.

Complete algorithm for the general case
Now we give the complete algorithm, which is applicable to any A ∈ R 3×3 .
We now explain the computations in Algorithm 3.5 when b ≥ 1 − τ 2 .In this case we need to estimate σ 2 in order to decide whether to apply inverse iteration or subspace iteration on a 4 × 2 matrix.
Recall that we have defined = 1 2 ( λ1 − λ 1 ).Then the convergence ratio of inverse iteration is On the other hand the convergence ratio for subspace iteration with a twodimensional subspace is However, if subspace iteration is chosen, we do not really care that our ansatz 4 × 2 matrix V completes its convergence towards an invariant subspace (corresponding to the dominant and second dominant eigenvectors): for our goals it suffices that a good approximation of the dominant eigenvector lies in V .This happens with a potentially faster rate, We want to make a decision on what choice to make according to the maximum number of iterations that we expect from either method.Let ω = − log 10 σ 2 and suppose 2 < ω < 8 so that σ 2 .The absolute condition number of the polynomial root finding problem p(x) = 0 is |p (λ)| −1 for the root λ.Observe that .
and therefore σ 2 1 > 0.99997 (for τ 2 = 10 −4 ).Hence, we can estimate the forward error in λ1 via Lemma 2.6: Hence, setting u = 1.1×10 −16 , i.e., assuming an implementation in double precision floating point arithmetic, we get log 10 | | ω − 16.86.Assume without loss of generality that > 0. A pessimistic bound for the convergence ratio of inverse iteration is ρ I = (σ 2 − )/ ; similarly, a pessimistic bound for the convergence of the dominant eigenvector within subspace iteration is ρ S = (σ 1 − )/ .Subspace iteration costs at least twice as much per iteration, plus some extra fixed costs.Heuristically, we estimate the cost of subspace iteration as about three times that of inverse iteration.Hence, we do not wish to employ subspace iteration unless ρ 3 I < ρ S .From the above, we have log 10 ρ I 16.86 − 2ω and log 10 ρ S 16.86 − ω.Hence, 3 log 10 ρ I < log 10 ρ S corresponds (approximately) to ω > 6.74.In practice, as we have no access to ω, we use complete pivoting in P 1 AP 2 = LU and evaluate ω = − log 10 |u 22 |.A discussion of why LU with complete pivoting is a reliable rank revealing factorization can be found in [12, sect.9.12], and we note that the key constant n2 n−1 appearing there is 12 for n = 3.
If ω < 7.18, we use inverse iteration, otherwise we use subspace iteration.(7.18 is approximately the value corresponding to 6 inverse iterations).If ω < 7.18 we also estimate the number of inverse iterations needed for convergence, with the formula n it = 15/(16.86− 2 ω) .Note that this is equivalent to setting a tolerance 10 −15 , and that it yields 1 ≤ n it ≤ 6.The reason to estimate a priori the number of iterations is to avoid having to evaluate a stopping criterion, thus saving flops.
Assuming now ω < 7.18, we compute the LDL T factorization of B s .We then apply inverse iteration for n it steps, with starting vector L−T e 4 .
It remains to complete the analysis to include the possibility ω ≥ 7.18.In this case, given the analysis above, due to efficiency issues we resort to subspace iteration, with starting guess V = L −T [e 3 e 4 ].Within the subspace iteration we orthonormalize at each step, for two steps.We also expect potential loss of accuracy in approximating λ 1 , because the latter is close to being a double root of the characteristic polynomial.We can expect an accuracy of order √ u, and now we are in the regime σ 2 .Hence a lower bound for the convergence rate is Thus, we expect convergence after about −15/ log( √ u) ≈ 2 steps.This justifies the fact that we implement a fixed number of steps (equal to 2) without checking against any stopping criterion.
At this point, we project onto the 2 × 2 symmetric matrix H = V T BV and compute the dominant eigenvector w of H via an analytic expression.The dominant eigenvector of B is then computed as v = V w.

Flop counts
Here we give a brief discussion of the operation count of Algorithm 3.5.A detailed breakdown of the operation count can be found in the Supplementary Materials.
The most advantageous situation occurs when Algorithm 3.5 calls Algorithm 3.2.It is also the most common scenario in graphics applications, where the input matrices are typically well conditioned [27].Assuming that the analytic formula is used to estimate λ 1 in Algorithm 3.3 the total cost of Algorithm 3.5 is 250 flops (if Newton's method is used instead then not many iterations are needed and hence this has little effect on the flop count).This is comparable to the number of flops needed to compute the polar decomposition by applying the analytic method in [16] to diagonalize AA T .If Algorithm 3.2 cannot be applied then within Algorithm 3.5 the LU factorization of A is performed with complete pivoting in order to obtain an order of magnitude estimate of σ 2 .As explained in Section 3.2, this is done by testing whether the quantity ω is large enough or smaller than a certain threshold.In the former case, inverse iteration is applied.The cost for this case is between 314 and 524 flops, according to how many iterations are needed.Finally, if ω is small we employ subspace iteration, for which the cost is at most 540 flops.Hence the proposed algorithm typically requires about 250 flops and in the worst case about 550 flops.
This should be compared against the standard method employing an SVD.With this approach, one first computes an SVD (including the singular vectors) of A. Then the polar factor is computed via one matrix multiplication (costing 27M +18A, where M denotes a multiplication and A an addition) and the Hermitian factor is recovered from its eigendecomposition (cost 36M + 18A).So the total cost is that of an SVD, which clearly dominates, plus a further 99 flops for postprocessing.
What is the cost of an SVD?It costs 119 flops to bidiagonalize a 3 × 3 matrix (employing Householder reflectors), including the cost of storing the information that is needed to recover the singular vectors.Concerning the iterative part, LAPACK implements the Demmel-Kahan QR iteration [6] until one singular value (usually the smallest) has converged.Then the SVD of the deflated 2 × 2 bidiagonal is computed directly.The cost of a step of the Demmel-Kahan iteration, excluding the computation of a shift (if any), is 108 flops (36 for the singular values plus 72 to update the singular vectors).Since shifted QR is eventually cubically convergent, one can argue that a "typical" sequence for a superdiagonal element is O(10 −1 ) → O(10 −3 ) → O(10 −9 ) → converged, which means 3 iterations.Indeed, we have run the iteration on a sample of random bidiagonal matrices of Frobenius norm 1, observing on average about 2.7 iterations for convergence.As a result of the above observation and of the experiments, we feel that it is fair to assume that at least 1 and at most 4 iterations will be required.Finally, the SVD of the 2 × 2 matrix is computed via a direct method, and it costs 35 flops to determine the singular values and further 36 flops for updating the singular vectors.In total the SVD approach thus costs about 400 flops in the best case and about 700 flops in the worst case, depending on the number of iterations required for convergence.

Timings
For timing purposes, we have implemented in Julia [2] the main branch of Algorithm 3.5, i.e., Algorithm 3.2, and compared it with several alternative algorithms, also implemented in Julia.The other algorithms are the standard method of obtaining the polar decomposition from the SVD (the implementation makes calls to the LAPACK algorithms for the SVD), the Newton iteration [13,Alg. 8.20], and the QDWH algorithm [20,21].The experiments were done on a MacBook Pro with a 2.5 GHz Intel Core i5 processor.Table 1 gives the timings for 10 5 calls to the different algorithms for normally distributed random inputs.For reproducibility, we have also tested the algorithm for a specific (well conditioned) input: The experiments show a clear advantage in execution time for Algorithm 3.2 over the other algorithms.We observe that the ratio of timings between Algorithm 3.2 and the method based on the SVD is very close to the theoretical value expected from the flop count.

Accuracy
Here we report some tests of accuracy based on a MATLAB implementation of Algorithm 3.5.We compare with the code polar svd from the Matrix Function toolbox [9], which applies the standard method of obtaining the polar decomposition from the SVD.
First, we provide a simple test to illustrate that Algorithm 3. (5.2)  We generated this example as A = Q 1 diag(1, y, y)Q 2 for certain rational orthogonal matrices Q 1 and Q 2 .Hence, for this input, 1 − b = 4y 2 .Therefore, we expect Algorithm 3.2 to struggle when y 2 τ 2 .(For y 2 ≥ τ 2 , Algorithm 3.5 calls Algorithm 3.2, and hence the experimental results coincide.)From [13, Thm.8.9], the relative condition number of the polar factor U is which for this particular input is Table 2 reports the relative forward errors Û −U F / U F = Û −U F / √ 3, where Û is the computed orthogonal factor in the polar decomposition of A while U is a reference result obtained by running the SVD algorithm in high precision arithmetic.The results show that, unlike Algorithm 3.2, Algorithm 3.5 and polar-SVD exhibit forward stable behaviour for all y.
We have performed further experiments comparing the accuracy of Algorithm 3.1, Algorithm 3.5, and polar svd on random matrices with specified singular value distributions.Each experiment was repeated on a sample of 10000 matrices.The results are shown in Tables 3, 4, 5, 6.In each case the first line in the table reports the worst-case relative error Ĥ − H F / H F in the computed symmetric positive semidefinite factor Ĥ for the three codes, where H is a reference result.Note that H is always a well conditioned function of A [13, Thm.8.9].The second line (included in all but the last experiment) reports the worst-case relative error for the computed orthogonal polar factor Û , Û − U F / U F = Û − U F / √ 3. Depending on the conditioning of the problem, given by (5.3), the forward error in U may be large even  for a backward stable method.The last line reports the worst-case backward error A − Û Ĥ F / A F .We observed that, for each of the measures that we analyzed, the average case error was typically 5 to 10 times smaller than the worst case error for each code.
From the results of the experiments we may conclude that the proposed approach yields results of comparable accuracy to polar svd, with an advantage in computation time in the most common situation of an input with large σ 2 .

Conclusions
The standard algorithms in numerical linear algebra are designed to be efficient for large matrices and are not necessarily optimal for small ones.Moreover, standard software relies on libraries such the Basic Linear Algebra Subprograms and LAPACK that are not necessary for small dimensions.Our new algorithm for computing the polar decomposition of a 3 × 3 real matrix, Algorithm 3.5, exploits a connection between orthogonal matrices and quaternions to reduce the problem to computing the dominant eigenvector of a 4 × 4 symmetric matrix.The algorithm has a flop count generally lower than the alternative of obtaining the polar decomposition from the SVD and it is faster than the SVD approach, as well as several other alternatives, in our numerical experiments.Moreover, the algorithm does not rely on library routines for computing eigenvalues or singular values.The algorithm therefore satisfies all the requirements of the Web CSS application that motivated this work, and in the usual case in this application in which the matrix is well conditioned the algorithm reduces to the simpler form of Algorithm 3.2.

Lemma 2 . 6
Let p(x) = det(xI − B) be the characteristic polynomial of B and μ an eigenvalue of B with largest absolute value.Then |p (μ Had we computed both λ 1 and the LDL T factorization of B s in exact arithmetic then we would have had d 44 = 0. (In floating point arithmetic, we have |d 44 | ≤ θu, and in practice |d 44 | is usually much smaller).This suggests that v := L−T e 4 / L−T e 4 2 is already a good estimate of the dominant eigenvector of B. Indeed, for the residual we have 2 can be inaccurate if the condition 1 − b > τ 2 , where b = det B, is not satisfied.Consider the matrix, parametrized by y ∈ [0, 1], 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 T .

Table 1
Timings (seconds) for 10 5 calls to Julia implementations of several algorithms

Table 2
Relative forward errors in U for various algorithms, for input(5.2)

Table 3
Singular values prescribed to be equal to 1, 10 −1 , and 10 −2 .Here we are sure that Algorithm 3.5 calls Algorithm 3.2.The condition number of U is κ U ≈ 10.6.

Table 4
Singular values prescribed to be equal to 1, 10 −5 , and 10 −12 .Here, σ 2 is too small for Algorithm 3.2 to work properly.However, ω is small enough that inverse iteration will be applied by Algorithm 3.5.The condition number of U is κ U ≈ 1.15 × 10 5

Table 6
Singular values prescribed to be equal to 1, 0, 0 (rank 1 matrix).Here, U is not uniquely defined so we do not report any forward error for its computation